WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling
Post-Training Quantization (PTQ) has been effectively compressing neural networks into very few bits using a limited calibration dataset. Various quantization methods utilizing second-order error have been proposed and demonstrated good performance. However, at extremely low bits, the increase in qu...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10982219/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849311814587777024 |
|---|---|
| author | Geunjae Choi Kamin Lee Nojun Kwak |
| author_facet | Geunjae Choi Kamin Lee Nojun Kwak |
| author_sort | Geunjae Choi |
| collection | DOAJ |
| description | Post-Training Quantization (PTQ) has been effectively compressing neural networks into very few bits using a limited calibration dataset. Various quantization methods utilizing second-order error have been proposed and demonstrated good performance. However, at extremely low bits, the increase in quantization error is significant, hindering optimal performance. Previous second-order error-based PTQ methods relied solely on quantization scale values and weight rounding for quantization. We introduce a weight-activation product scaling method that, when used alongside weight rounding and scale value adjustments, effectively reduces quantization error even at very low bits. The proposed method compensates for the errors resulting from quantization, thereby achieving results closer to the original model. Additionally, the method effectively reduces the potential increase in computational and memory complexity through channel-wise grouping, shifting, and channel mixing techniques. Our method is validated on various CNNs, and extended to ViT and object detection models, showing strong generalization across architectures. We conducted tests on various CNN-based models to affirm the superiority of our proposed quantization scheme. Our proposed approach enhances accuracy in 2/4-bit quantization with less than 1.5% computational overhead, and hardware-level simulation confirms its suitability for real-time deplo1yment with negligible latency increase. Furthermore, hardware-level simulation on a silicon-proven ASIC NPU confirms that our method achieves higher accuracy with negligible latency overhead, making it practical for real-time edge deployment. |
| format | Article |
| id | doaj-art-dbcfcf4d0fea4d618c638a4f789dd279 |
| institution | Kabale University |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-dbcfcf4d0fea4d618c638a4f789dd2792025-08-20T03:53:17ZengIEEEIEEE Access2169-35362025-01-0113795347954710.1109/ACCESS.2025.356630710982219WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product ScalingGeunjae Choi0https://orcid.org/0009-0003-6502-8207Kamin Lee1https://orcid.org/0009-0005-4608-147XNojun Kwak2https://orcid.org/0000-0002-1792-0327Graduate School of Convergence Science and Technology, Seoul National University, Gwanak-gu, Seoul, Republic of KoreaGraduate School of Convergence Science and Technology, Seoul National University, Gwanak-gu, Seoul, Republic of KoreaGraduate School of Convergence Science and Technology, Seoul National University, Gwanak-gu, Seoul, Republic of KoreaPost-Training Quantization (PTQ) has been effectively compressing neural networks into very few bits using a limited calibration dataset. Various quantization methods utilizing second-order error have been proposed and demonstrated good performance. However, at extremely low bits, the increase in quantization error is significant, hindering optimal performance. Previous second-order error-based PTQ methods relied solely on quantization scale values and weight rounding for quantization. We introduce a weight-activation product scaling method that, when used alongside weight rounding and scale value adjustments, effectively reduces quantization error even at very low bits. The proposed method compensates for the errors resulting from quantization, thereby achieving results closer to the original model. Additionally, the method effectively reduces the potential increase in computational and memory complexity through channel-wise grouping, shifting, and channel mixing techniques. Our method is validated on various CNNs, and extended to ViT and object detection models, showing strong generalization across architectures. We conducted tests on various CNN-based models to affirm the superiority of our proposed quantization scheme. Our proposed approach enhances accuracy in 2/4-bit quantization with less than 1.5% computational overhead, and hardware-level simulation confirms its suitability for real-time deplo1yment with negligible latency increase. Furthermore, hardware-level simulation on a silicon-proven ASIC NPU confirms that our method achieves higher accuracy with negligible latency overhead, making it practical for real-time edge deployment.https://ieeexplore.ieee.org/document/10982219/Post-training quantization (PTQ)low-bit quantizationweight-activation product scalingchannel-wise groupingASICchannel-wise grouping |
| spellingShingle | Geunjae Choi Kamin Lee Nojun Kwak WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling IEEE Access Post-training quantization (PTQ) low-bit quantization weight-activation product scaling channel-wise grouping ASIC channel-wise grouping |
| title | WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling |
| title_full | WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling |
| title_fullStr | WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling |
| title_full_unstemmed | WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling |
| title_short | WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling |
| title_sort | waps quant low bit post training quantization using weight activation product scaling |
| topic | Post-training quantization (PTQ) low-bit quantization weight-activation product scaling channel-wise grouping ASIC channel-wise grouping |
| url | https://ieeexplore.ieee.org/document/10982219/ |
| work_keys_str_mv | AT geunjaechoi wapsquantlowbitposttrainingquantizationusingweightactivationproductscaling AT kaminlee wapsquantlowbitposttrainingquantizationusingweightactivationproductscaling AT nojunkwak wapsquantlowbitposttrainingquantizationusingweightactivationproductscaling |