Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models

Modern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference. Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jaewoo Yang, Hayun Kim, Junyung Ji, Younghoon Kim
Format:	Article
Language:	English
Published:	MDPI AG 2025-04-01
Series:	Future Internet
Subjects:	quantization LLM post-training quantization outliers
Online Access:	https://www.mdpi.com/1999-5903/17/4/185
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850180603228454912
author	Jaewoo Yang Hayun Kim Junyung Ji Younghoon Kim
author_facet	Jaewoo Yang Hayun Kim Junyung Ji Younghoon Kim
author_sort	Jaewoo Yang
collection	DOAJ
description	Modern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference. Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such as INT8. However, we identify a critical challenge in activation quantization for GLU (Gated Linear Unit) variants, which are commonly used in the feed-forward networks of modern LLMs like the LLaMA family. Specifically, severe local quantization errors arise due to excessively large activation magnitudes, which we refer to as activation spikes, leading to significant degradation in model performance. Our analysis reveals a systematic pattern of these spikes: they predominantly occur in the FFN (feed-forward network) layers at the early and late layers of the model and are concentrated on a small subset of tokens rather than being uniformly distributed across a token sequence. To mitigate this issue, we propose two empirical methods: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), which isolate activation spikes during quantization. Extensive experiments demonstrated that our methods effectively improve activation quantization, particularly in coarse-grained quantization schemes, enhancing the performance of LLMs with GLU variants and addressing the limitations of existing quantization techniques. The code for implementing our methods and reproducing the experiments is publicly available our GitHub repository.
format	Article
id	doaj-art-d2c4dfd3622b4535bb87127c9d1b37ff
institution	OA Journals
issn	1999-5903
language	English
publishDate	2025-04-01
publisher	MDPI AG
record_format	Article
series	Future Internet
spelling	doaj-art-d2c4dfd3622b4535bb87127c9d1b37ff2025-08-20T02:18:05ZengMDPI AGFuture Internet1999-59032025-04-0117418510.3390/fi17040185Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language ModelsJaewoo Yang0Hayun Kim1Junyung Ji2Younghoon Kim3Department of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of KoreaDepartment of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of KoreaDepartment of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of KoreaDepartment of Applied Artificial Intelligence, Hanyang University at Ansan, Ansan 15588, Republic of KoreaModern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference. Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such as INT8. However, we identify a critical challenge in activation quantization for GLU (Gated Linear Unit) variants, which are commonly used in the feed-forward networks of modern LLMs like the LLaMA family. Specifically, severe local quantization errors arise due to excessively large activation magnitudes, which we refer to as activation spikes, leading to significant degradation in model performance. Our analysis reveals a systematic pattern of these spikes: they predominantly occur in the FFN (feed-forward network) layers at the early and late layers of the model and are concentrated on a small subset of tokens rather than being uniformly distributed across a token sequence. To mitigate this issue, we propose two empirical methods: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), which isolate activation spikes during quantization. Extensive experiments demonstrated that our methods effectively improve activation quantization, particularly in coarse-grained quantization schemes, enhancing the performance of LLMs with GLU variants and addressing the limitations of existing quantization techniques. The code for implementing our methods and reproducing the experiments is publicly available our GitHub repository.https://www.mdpi.com/1999-5903/17/4/185quantizationLLMpost-training quantizationoutliers
spellingShingle	Jaewoo Yang Hayun Kim Junyung Ji Younghoon Kim Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models Future Internet quantization LLM post-training quantization outliers
title	Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models
title_full	Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models
title_fullStr	Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models
title_full_unstemmed	Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models
title_short	Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models
title_sort	mitigating quantization errors due to activation spikes in gated linear unit based large language models
topic	quantization LLM post-training quantization outliers
url	https://www.mdpi.com/1999-5903/17/4/185
work_keys_str_mv	AT jaewooyang mitigatingquantizationerrorsduetoactivationspikesingatedlinearunitbasedlargelanguagemodels AT hayunkim mitigatingquantizationerrorsduetoactivationspikesingatedlinearunitbasedlargelanguagemodels AT junyungji mitigatingquantizationerrorsduetoactivationspikesingatedlinearunitbasedlargelanguagemodels AT younghoonkim mitigatingquantizationerrorsduetoactivationspikesingatedlinearunitbasedlargelanguagemodels

Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models

Similar Items