ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads

Large language Models (LLMs) have immense potential to enhance the capabilities of Cyber-Physical-Social Intelligence (CPSI) systems, enabling them to better engage with complex cyber, physical, and social environments. However, the high inference latency of LLMs, which is inherited from the autoreg...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Huiping Zhuang, Fan Yu, Hongen Shao, Xiaofeng Zou
Format:	Article
Language:	English
Published:	Tsinghua University Press 2025-06-01
Series:	Big Data Mining and Analytics
Subjects:	speculative decoding efficient inference large language models (llms)
Online Access:	https://www.sciopen.com/article/10.26599/BDMA.2024.9020074
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849714744932433920
author	Ziqian Zeng Jiahong Yu Qianshi Pang Zihao Wang Huiping Zhuang Fan Yu Hongen Shao Xiaofeng Zou
author_facet	Ziqian Zeng Jiahong Yu Qianshi Pang Zihao Wang Huiping Zhuang Fan Yu Hongen Shao Xiaofeng Zou
author_sort	Ziqian Zeng
collection	DOAJ
description	Large language Models (LLMs) have immense potential to enhance the capabilities of Cyber-Physical-Social Intelligence (CPSI) systems, enabling them to better engage with complex cyber, physical, and social environments. However, the high inference latency of LLMs, which is inherited from the autoregressive decoding process, hinders their wide application in CPSI systems. To address this challenge, current approaches have incorporated speculative decoding to enable parallel prediction of multiple subsequent tokens, thereby achieving inference acceleration. Nevertheless, the accuracy of these decoding heads falls short of the autoregressive decoding approach. In light of these limitations, we propose ResDecode, a novel speculative decoding method characterized by its efficient and accurate decoding heads. Within the lightweight draft model, we propose a residual decoding head to compensate for the full context encoder’s limited capability on long-range dependencies, thus improving accuracy. ResDecode demonstrates impressive results, achieving a maximum speedup ratio of 3.2× on the MT-bench compared to vanilla autoregressive decoding.
format	Article
id	doaj-art-9b7c6bd2e5064f4cbc0f542e5681b461
institution	DOAJ
issn	2096-0654 2097-406X
language	English
publishDate	2025-06-01
publisher	Tsinghua University Press
record_format	Article
series	Big Data Mining and Analytics
spelling	doaj-art-9b7c6bd2e5064f4cbc0f542e5681b4612025-08-20T03:13:36ZengTsinghua University PressBig Data Mining and Analytics2096-06542097-406X2025-06-018477979310.26599/BDMA.2024.9020074ResDecode: Accelerating Large Language Models Inference via Residual Decoding HeadsZiqian Zeng0Jiahong Yu1Qianshi Pang2Zihao Wang3Huiping Zhuang4Fan Yu5Hongen Shao6Xiaofeng Zou7Shien Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou 511442, ChinaSchool of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, ChinaDepartment of Computer Science and Engineering, School of Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, ChinaShien Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou 511442, ChinaHuawei Technologies Co. Ltd., Hangzhou 310000, ChinaSchool of Future Technology, South China University of Technology, Guangzhou 511442, ChinaSchool of Future Technology, South China University of Technology, Guangzhou 511442, ChinaLarge language Models (LLMs) have immense potential to enhance the capabilities of Cyber-Physical-Social Intelligence (CPSI) systems, enabling them to better engage with complex cyber, physical, and social environments. However, the high inference latency of LLMs, which is inherited from the autoregressive decoding process, hinders their wide application in CPSI systems. To address this challenge, current approaches have incorporated speculative decoding to enable parallel prediction of multiple subsequent tokens, thereby achieving inference acceleration. Nevertheless, the accuracy of these decoding heads falls short of the autoregressive decoding approach. In light of these limitations, we propose ResDecode, a novel speculative decoding method characterized by its efficient and accurate decoding heads. Within the lightweight draft model, we propose a residual decoding head to compensate for the full context encoder’s limited capability on long-range dependencies, thus improving accuracy. ResDecode demonstrates impressive results, achieving a maximum speedup ratio of 3.2× on the MT-bench compared to vanilla autoregressive decoding.https://www.sciopen.com/article/10.26599/BDMA.2024.9020074speculative decodingefficient inferencelarge language models (llms)
spellingShingle	Ziqian Zeng Jiahong Yu Qianshi Pang Zihao Wang Huiping Zhuang Fan Yu Hongen Shao Xiaofeng Zou ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads Big Data Mining and Analytics speculative decoding efficient inference large language models (llms)
title	ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads
title_full	ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads
title_fullStr	ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads
title_full_unstemmed	ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads
title_short	ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads
title_sort	resdecode accelerating large language models inference via residual decoding heads
topic	speculative decoding efficient inference large language models (llms)
url	https://www.sciopen.com/article/10.26599/BDMA.2024.9020074
work_keys_str_mv	AT ziqianzeng resdecodeacceleratinglargelanguagemodelsinferenceviaresidualdecodingheads AT jiahongyu resdecodeacceleratinglargelanguagemodelsinferenceviaresidualdecodingheads AT qianshipang resdecodeacceleratinglargelanguagemodelsinferenceviaresidualdecodingheads AT zihaowang resdecodeacceleratinglargelanguagemodelsinferenceviaresidualdecodingheads AT huipingzhuang resdecodeacceleratinglargelanguagemodelsinferenceviaresidualdecodingheads AT fanyu resdecodeacceleratinglargelanguagemodelsinferenceviaresidualdecodingheads AT hongenshao resdecodeacceleratinglargelanguagemodelsinferenceviaresidualdecodingheads AT xiaofengzou resdecodeacceleratinglargelanguagemodelsinferenceviaresidualdecodingheads

ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads

Similar Items