A benchmark for evaluating crisis information generation capabilities in LLMs

Introduction. Large language models (LLMs) have become increasingly significant in crisis information management due to their advanced natural language processing capabilities. This study aims to develop a comprehensive evaluation benchmark to assess the effectiveness of LLMs in generating crisis i...

Full description

Saved in:
Bibliographic Details
Main Authors: Ruilian Han, Lu An, Wei Zhou, Gang Li
Format: Article
Language:English
Published: University of Borås 2025-03-01
Series:Information Research: An International Electronic Journal
Subjects:
Online Access:https://publicera.kb.se/ir/article/view/47518
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850033255882948608
author Ruilian Han
Lu An
Wei Zhou
Gang Li
author_facet Ruilian Han
Lu An
Wei Zhou
Gang Li
author_sort Ruilian Han
collection DOAJ
description Introduction. Large language models (LLMs) have become increasingly significant in crisis information management due to their advanced natural language processing capabilities. This study aims to develop a comprehensive evaluation benchmark to assess the effectiveness of LLMs in generating crisis information. Method. CIEeval, an evaluation dataset, was constructed through steps such as information extraction and prompt generation. CIEeval covers 26 types of crises across sub-domains including water disasters, environmental pollution, and others, comprising a total of 4.8k data entries. Analysis. Eight LLMs applicable to the Chinese context were selected for evaluation based on multidimensional criteria. A combination of manual and machine scoring methods was utilized. This approach ensured a comprehensive understanding of each model's performance. Results. The manual and machine scores showed significant correlation. Under this scoring method, Claude 3.5 Sonnet performed the best, particularly excelling in complex scenarios like natural and accident disasters. In contrast, while scoring slightly lower overall, Chinese models like ERNIE 4.0 Turbo and iFlytek Spark V4.0, showed strong performance in specific crises. Conclusion. The evaluation benchmark validates the best LLM for crisis information generation (Claude 3.5 Sonnet) and provides valuable insights for LLMs to optimize and apply LLM in crisis information.
format Article
id doaj-art-4be574b735914ba0b7cec3f787a3a7ed
institution DOAJ
issn 1368-1613
language English
publishDate 2025-03-01
publisher University of Borås
record_format Article
series Information Research: An International Electronic Journal
spelling doaj-art-4be574b735914ba0b7cec3f787a3a7ed2025-08-20T02:58:18ZengUniversity of BoråsInformation Research: An International Electronic Journal1368-16132025-03-0130iConf10.47989/ir30iConf47518A benchmark for evaluating crisis information generation capabilities in LLMsRuilian Han0Lu An1Wei Zhou2Gang Li3Center for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, ChinaCenter for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, ChinaSchool of Information Management, Wuhan University, ChinaCenter for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, China Introduction. Large language models (LLMs) have become increasingly significant in crisis information management due to their advanced natural language processing capabilities. This study aims to develop a comprehensive evaluation benchmark to assess the effectiveness of LLMs in generating crisis information. Method. CIEeval, an evaluation dataset, was constructed through steps such as information extraction and prompt generation. CIEeval covers 26 types of crises across sub-domains including water disasters, environmental pollution, and others, comprising a total of 4.8k data entries. Analysis. Eight LLMs applicable to the Chinese context were selected for evaluation based on multidimensional criteria. A combination of manual and machine scoring methods was utilized. This approach ensured a comprehensive understanding of each model's performance. Results. The manual and machine scores showed significant correlation. Under this scoring method, Claude 3.5 Sonnet performed the best, particularly excelling in complex scenarios like natural and accident disasters. In contrast, while scoring slightly lower overall, Chinese models like ERNIE 4.0 Turbo and iFlytek Spark V4.0, showed strong performance in specific crises. Conclusion. The evaluation benchmark validates the best LLM for crisis information generation (Claude 3.5 Sonnet) and provides valuable insights for LLMs to optimize and apply LLM in crisis information. https://publicera.kb.se/ir/article/view/47518LLMsCrisis informaticsLLMs evaluationInformation generationEvaluation benchmark
spellingShingle Ruilian Han
Lu An
Wei Zhou
Gang Li
A benchmark for evaluating crisis information generation capabilities in LLMs
Information Research: An International Electronic Journal
LLMs
Crisis informatics
LLMs evaluation
Information generation
Evaluation benchmark
title A benchmark for evaluating crisis information generation capabilities in LLMs
title_full A benchmark for evaluating crisis information generation capabilities in LLMs
title_fullStr A benchmark for evaluating crisis information generation capabilities in LLMs
title_full_unstemmed A benchmark for evaluating crisis information generation capabilities in LLMs
title_short A benchmark for evaluating crisis information generation capabilities in LLMs
title_sort benchmark for evaluating crisis information generation capabilities in llms
topic LLMs
Crisis informatics
LLMs evaluation
Information generation
Evaluation benchmark
url https://publicera.kb.se/ir/article/view/47518
work_keys_str_mv AT ruilianhan abenchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms
AT luan abenchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms
AT weizhou abenchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms
AT gangli abenchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms
AT ruilianhan benchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms
AT luan benchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms
AT weizhou benchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms
AT gangli benchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms