BALI—A Benchmark for Accelerated Language Model Inference

The rise of Large Language Models (LLMs) has revolutionized natural language processing, enabling advancements across diverse applications, including chatbots, live translators, content generation, virtual assistants, and domain-specific automation tools. These applications rely on real-time or near...

Full description

Saved in:
Bibliographic Details
Main Authors: Lena Jurkschat, Preetam Gattogi, Sahar Vahdati, Jens Lehmann
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11026002/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849421656245665792
author Lena Jurkschat
Preetam Gattogi
Sahar Vahdati
Jens Lehmann
author_facet Lena Jurkschat
Preetam Gattogi
Sahar Vahdati
Jens Lehmann
author_sort Lena Jurkschat
collection DOAJ
description The rise of Large Language Models (LLMs) has revolutionized natural language processing, enabling advancements across diverse applications, including chatbots, live translators, content generation, virtual assistants, and domain-specific automation tools. These applications rely on real-time or near-real-time responses to process sequential LLM requests, creating a critical demand for efficient and accelerated inference. These developments have led to numerous frameworks optimizing inference speed and resource utilization. However, they are often mutually incomparable or are inadequately described due to the lack of standardized benchmarks. Consequently, there is a notable lack of comparison frameworks due to the vast configuration space, bounded factors such as hardware specifications, inference framework parameters, and dataset variations. We propose BALI, an open-source Benchmark for Accelerated Language Model Inference, aiming to provide comprehensive analysis and standardized evaluation metrics to enhance the comparability of LLM performance across configurations. With BALI, we propose substantial measurements to evaluate and rank the efficiency of LLM frameworks across multiple aspects, including sequential decoding, parallelization, and setup efficiency. We show results for mainly small to medium-size models (1-30B parameters) in a sequential or non-batched setup, which is highly relevant for various real-time LLM applications. These observations reveal that the design decisions for such a framework constitute an application-dependent and multidimensional challenge. Thus, our objective is to provide LLM an inference benchmark with a clearly defined evaluation, incorporating multidimensional criteria to provide comparable performance assessments.
format Article
id doaj-art-1cae2238947b431cafffd9ed72168330
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-1cae2238947b431cafffd9ed721683302025-08-20T03:31:24ZengIEEEIEEE Access2169-35362025-01-0113989769898910.1109/ACCESS.2025.357689811026002BALI—A Benchmark for Accelerated Language Model InferenceLena Jurkschat0https://orcid.org/0009-0002-7332-5861Preetam Gattogi1https://orcid.org/0009-0001-4604-2009Sahar Vahdati2Jens Lehmann3Center for Interdisciplinary Digital Sciences (CIDS), ScaDS.AI Dresden/Leipzig, Technische Universität Dresden, Dresden, GermanyCenter for Interdisciplinary Digital Sciences (CIDS), ScaDS.AI Dresden/Leipzig, Technische Universität Dresden, Dresden, GermanyCenter for Interdisciplinary Digital Sciences (CIDS), ScaDS.AI Dresden/Leipzig, Technische Universität Dresden, Dresden, GermanyCenter for Interdisciplinary Digital Sciences (CIDS), ScaDS.AI Dresden/Leipzig, Technische Universität Dresden, Dresden, GermanyThe rise of Large Language Models (LLMs) has revolutionized natural language processing, enabling advancements across diverse applications, including chatbots, live translators, content generation, virtual assistants, and domain-specific automation tools. These applications rely on real-time or near-real-time responses to process sequential LLM requests, creating a critical demand for efficient and accelerated inference. These developments have led to numerous frameworks optimizing inference speed and resource utilization. However, they are often mutually incomparable or are inadequately described due to the lack of standardized benchmarks. Consequently, there is a notable lack of comparison frameworks due to the vast configuration space, bounded factors such as hardware specifications, inference framework parameters, and dataset variations. We propose BALI, an open-source Benchmark for Accelerated Language Model Inference, aiming to provide comprehensive analysis and standardized evaluation metrics to enhance the comparability of LLM performance across configurations. With BALI, we propose substantial measurements to evaluate and rank the efficiency of LLM frameworks across multiple aspects, including sequential decoding, parallelization, and setup efficiency. We show results for mainly small to medium-size models (1-30B parameters) in a sequential or non-batched setup, which is highly relevant for various real-time LLM applications. These observations reveal that the design decisions for such a framework constitute an application-dependent and multidimensional challenge. Thus, our objective is to provide LLM an inference benchmark with a clearly defined evaluation, incorporating multidimensional criteria to provide comparable performance assessments.https://ieeexplore.ieee.org/document/11026002/LLM inferencetransformer decoderLLM inference benchmarkinggeneration speedperformance analysisinference standardization
spellingShingle Lena Jurkschat
Preetam Gattogi
Sahar Vahdati
Jens Lehmann
BALI—A Benchmark for Accelerated Language Model Inference
IEEE Access
LLM inference
transformer decoder
LLM inference benchmarking
generation speed
performance analysis
inference standardization
title BALI—A Benchmark for Accelerated Language Model Inference
title_full BALI—A Benchmark for Accelerated Language Model Inference
title_fullStr BALI—A Benchmark for Accelerated Language Model Inference
title_full_unstemmed BALI—A Benchmark for Accelerated Language Model Inference
title_short BALI—A Benchmark for Accelerated Language Model Inference
title_sort bali x2014 a benchmark for accelerated language model inference
topic LLM inference
transformer decoder
LLM inference benchmarking
generation speed
performance analysis
inference standardization
url https://ieeexplore.ieee.org/document/11026002/
work_keys_str_mv AT lenajurkschat balix2014abenchmarkforacceleratedlanguagemodelinference
AT preetamgattogi balix2014abenchmarkforacceleratedlanguagemodelinference
AT saharvahdati balix2014abenchmarkforacceleratedlanguagemodelinference
AT jenslehmann balix2014abenchmarkforacceleratedlanguagemodelinference