Benchmarking Open-Source Large Language Models for Sentiment and Emotion Classification in Indonesian Tweets

We benchmark 22 open-source large language models (LLMs) against ChatGPT-4 and human annotators on two NLP tasks—sentiment analysis and emotion classification—for Indonesian tweets. This study contributes to NLP in a relatively low-resource language (Bahasa Indonesia) by evalua...

Full description

Saved in:

Bibliographic Details
Main Authors:	Arbi Haza Nasution, Aytug Onan, Yohei Murakami, Winda Monika, Anggi Hanafiah
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Annotation quality emotion classification sentiment analysis Indonesian language processing language models low-resource languages
Online Access:	https://ieeexplore.ieee.org/document/11016677/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We benchmark 22 open-source large language models (LLMs) against ChatGPT-4 and human annotators on two NLP tasks—sentiment analysis and emotion classification—for Indonesian tweets. This study contributes to NLP in a relatively low-resource language (Bahasa Indonesia) by evaluating zero-shot classification performance on a labeled tweet corpus. The dataset includes sentiment labels (Positive, Negative, Neutral) and emotion labels (Love, Happiness, Sadness, Anger, Fear). We compare model predictions to human annotations and report precision, recall, and F1-score, along with inference time analysis. ChatGPT-4 achieves the highest macro F1-score (0.84) on both tasks, slightly outperforming human annotators. The best-performing open-source models—such as LLaMA3.1_70B and Gemma2_27B—achieve over 90% of ChatGPT-4’s performance, while smaller models lag behind. Notably, some mid-sized models (e.g., Phi-4 at 14B parameters) perform comparably to much larger models on select categories. However, certain classes—particularly Neutral sentiment and Fear emotion—remain challenging, with lower agreement even among human annotators. Inference time varies significantly: optimized models complete predictions in under an hour, while some large models require several days. Our findings show that state-of-the-art open models can approach closed-source LLMs like ChatGPT-4 on Indonesian classification tasks, though efficiency and consistency in edge cases remain open challenges. Future work should explore fine-tuning multilingual LLMs on Indonesian data and practical deployment strategies in real-world applications.
ISSN:	2169-3536

Benchmarking Open-Source Large Language Models for Sentiment and Emotion Classification in Indonesian Tweets

Similar Items