WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling

Post-Training Quantization (PTQ) has been effectively compressing neural networks into very few bits using a limited calibration dataset. Various quantization methods utilizing second-order error have been proposed and demonstrated good performance. However, at extremely low bits, the increase in qu...

Full description

Saved in:
Bibliographic Details
Main Authors: Geunjae Choi, Kamin Lee, Nojun Kwak
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10982219/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849311814587777024
author Geunjae Choi
Kamin Lee
Nojun Kwak
author_facet Geunjae Choi
Kamin Lee
Nojun Kwak
author_sort Geunjae Choi
collection DOAJ
description Post-Training Quantization (PTQ) has been effectively compressing neural networks into very few bits using a limited calibration dataset. Various quantization methods utilizing second-order error have been proposed and demonstrated good performance. However, at extremely low bits, the increase in quantization error is significant, hindering optimal performance. Previous second-order error-based PTQ methods relied solely on quantization scale values and weight rounding for quantization. We introduce a weight-activation product scaling method that, when used alongside weight rounding and scale value adjustments, effectively reduces quantization error even at very low bits. The proposed method compensates for the errors resulting from quantization, thereby achieving results closer to the original model. Additionally, the method effectively reduces the potential increase in computational and memory complexity through channel-wise grouping, shifting, and channel mixing techniques. Our method is validated on various CNNs, and extended to ViT and object detection models, showing strong generalization across architectures. We conducted tests on various CNN-based models to affirm the superiority of our proposed quantization scheme. Our proposed approach enhances accuracy in 2/4-bit quantization with less than 1.5% computational overhead, and hardware-level simulation confirms its suitability for real-time deplo1yment with negligible latency increase. Furthermore, hardware-level simulation on a silicon-proven ASIC NPU confirms that our method achieves higher accuracy with negligible latency overhead, making it practical for real-time edge deployment.
format Article
id doaj-art-dbcfcf4d0fea4d618c638a4f789dd279
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-dbcfcf4d0fea4d618c638a4f789dd2792025-08-20T03:53:17ZengIEEEIEEE Access2169-35362025-01-0113795347954710.1109/ACCESS.2025.356630710982219WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product ScalingGeunjae Choi0https://orcid.org/0009-0003-6502-8207Kamin Lee1https://orcid.org/0009-0005-4608-147XNojun Kwak2https://orcid.org/0000-0002-1792-0327Graduate School of Convergence Science and Technology, Seoul National University, Gwanak-gu, Seoul, Republic of KoreaGraduate School of Convergence Science and Technology, Seoul National University, Gwanak-gu, Seoul, Republic of KoreaGraduate School of Convergence Science and Technology, Seoul National University, Gwanak-gu, Seoul, Republic of KoreaPost-Training Quantization (PTQ) has been effectively compressing neural networks into very few bits using a limited calibration dataset. Various quantization methods utilizing second-order error have been proposed and demonstrated good performance. However, at extremely low bits, the increase in quantization error is significant, hindering optimal performance. Previous second-order error-based PTQ methods relied solely on quantization scale values and weight rounding for quantization. We introduce a weight-activation product scaling method that, when used alongside weight rounding and scale value adjustments, effectively reduces quantization error even at very low bits. The proposed method compensates for the errors resulting from quantization, thereby achieving results closer to the original model. Additionally, the method effectively reduces the potential increase in computational and memory complexity through channel-wise grouping, shifting, and channel mixing techniques. Our method is validated on various CNNs, and extended to ViT and object detection models, showing strong generalization across architectures. We conducted tests on various CNN-based models to affirm the superiority of our proposed quantization scheme. Our proposed approach enhances accuracy in 2/4-bit quantization with less than 1.5% computational overhead, and hardware-level simulation confirms its suitability for real-time deplo1yment with negligible latency increase. Furthermore, hardware-level simulation on a silicon-proven ASIC NPU confirms that our method achieves higher accuracy with negligible latency overhead, making it practical for real-time edge deployment.https://ieeexplore.ieee.org/document/10982219/Post-training quantization (PTQ)low-bit quantizationweight-activation product scalingchannel-wise groupingASICchannel-wise grouping
spellingShingle Geunjae Choi
Kamin Lee
Nojun Kwak
WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling
IEEE Access
Post-training quantization (PTQ)
low-bit quantization
weight-activation product scaling
channel-wise grouping
ASIC
channel-wise grouping
title WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling
title_full WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling
title_fullStr WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling
title_full_unstemmed WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling
title_short WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling
title_sort waps quant low bit post training quantization using weight activation product scaling
topic Post-training quantization (PTQ)
low-bit quantization
weight-activation product scaling
channel-wise grouping
ASIC
channel-wise grouping
url https://ieeexplore.ieee.org/document/10982219/
work_keys_str_mv AT geunjaechoi wapsquantlowbitposttrainingquantizationusingweightactivationproductscaling
AT kaminlee wapsquantlowbitposttrainingquantizationusingweightactivationproductscaling
AT nojunkwak wapsquantlowbitposttrainingquantizationusingweightactivationproductscaling