Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
Recently, there is a high demand for deploying DeepSeek-R1 and V3 locally, possibly because the official service often suffers from being busy and some organizations have data privacy concerns. While single-machine deployment offers infrastructure simplicity, the models' 671B FP8 parameter configuration exceeds the practical memory limits of a standard 8-GPU machine. Quantization is a widely used technique that helps reduce model memory consumption. However, it is unclear what the performance of DeepSeek-R1 and V3 will be after being quantized. This technical report presents the first quantitative evaluation of multi-bitwidth quantization across the complete DeepSeek model spectrum. Key findings reveal that 4-bit quantization maintains little performance degradation versus FP8 while enabling single-machine deployment on standard NVIDIA GPU devices. We further propose DQ3_K_M, a dynamic 3-bit quantization method that significantly outperforms traditional Q3_K_M variant on various benchmarks, which is also comparable with 4-bit quantization (Q4_K_M) approach in most tasks. Moreover, DQ3_K_M supports single-machine deployment configurations for both NVIDIA H100/A100 and Huawei 910B. Our implementation of DQ3\_K\_M is released at https://github.com/UnicomAI/DeepSeek-Eval, containing optimized 3-bit quantized variants of both DeepSeek-R1 and DeepSeek-V3.
Enbo Zhao、Yi Shen、Shuming Shi、Jieyun Huang、Zhihao Chen、Ning Wang、Siqi Xiao、Jian Zhang、Kai Wang、Shiguo Lian
计算技术、计算机技术
Enbo Zhao,Yi Shen,Shuming Shi,Jieyun Huang,Zhihao Chen,Ning Wang,Siqi Xiao,Jian Zhang,Kai Wang,Shiguo Lian.Quantitative Analysis of Performance Drop in DeepSeek Model Quantization[EB/OL].(2025-05-05)[2025-07-16].https://arxiv.org/abs/2505.02390.点此复制
评论