A benchmark for evaluating crisis information generation capabilities in LLMs
Introduction. Large language models (LLMs) have become increasingly significant in crisis information management due to their advanced natural language processing capabilities. This study aims to develop a comprehensive evaluation benchmark to assess the effectiveness of LLMs in generating crisis i...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
University of Borås
2025-03-01
|
| Series: | Information Research: An International Electronic Journal |
| Subjects: | |
| Online Access: | https://publicera.kb.se/ir/article/view/47518 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850033255882948608 |
|---|---|
| author | Ruilian Han Lu An Wei Zhou Gang Li |
| author_facet | Ruilian Han Lu An Wei Zhou Gang Li |
| author_sort | Ruilian Han |
| collection | DOAJ |
| description |
Introduction. Large language models (LLMs) have become increasingly significant in crisis information management due to their advanced natural language processing capabilities. This study aims to develop a comprehensive evaluation benchmark to assess the effectiveness of LLMs in generating crisis information.
Method. CIEeval, an evaluation dataset, was constructed through steps such as information extraction and prompt generation. CIEeval covers 26 types of crises across sub-domains including water disasters, environmental pollution, and others, comprising a total of 4.8k data entries.
Analysis. Eight LLMs applicable to the Chinese context were selected for evaluation based on multidimensional criteria. A combination of manual and machine scoring methods was utilized. This approach ensured a comprehensive understanding of each model's performance.
Results. The manual and machine scores showed significant correlation. Under this scoring method, Claude 3.5 Sonnet performed the best, particularly excelling in complex scenarios like natural and accident disasters. In contrast, while scoring slightly lower overall, Chinese models like ERNIE 4.0 Turbo and iFlytek Spark V4.0, showed strong performance in specific crises.
Conclusion. The evaluation benchmark validates the best LLM for crisis information generation (Claude 3.5 Sonnet) and provides valuable insights for LLMs to optimize and apply LLM in crisis information.
|
| format | Article |
| id | doaj-art-4be574b735914ba0b7cec3f787a3a7ed |
| institution | DOAJ |
| issn | 1368-1613 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | University of Borås |
| record_format | Article |
| series | Information Research: An International Electronic Journal |
| spelling | doaj-art-4be574b735914ba0b7cec3f787a3a7ed2025-08-20T02:58:18ZengUniversity of BoråsInformation Research: An International Electronic Journal1368-16132025-03-0130iConf10.47989/ir30iConf47518A benchmark for evaluating crisis information generation capabilities in LLMsRuilian Han0Lu An1Wei Zhou2Gang Li3Center for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, ChinaCenter for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, ChinaSchool of Information Management, Wuhan University, ChinaCenter for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, China Introduction. Large language models (LLMs) have become increasingly significant in crisis information management due to their advanced natural language processing capabilities. This study aims to develop a comprehensive evaluation benchmark to assess the effectiveness of LLMs in generating crisis information. Method. CIEeval, an evaluation dataset, was constructed through steps such as information extraction and prompt generation. CIEeval covers 26 types of crises across sub-domains including water disasters, environmental pollution, and others, comprising a total of 4.8k data entries. Analysis. Eight LLMs applicable to the Chinese context were selected for evaluation based on multidimensional criteria. A combination of manual and machine scoring methods was utilized. This approach ensured a comprehensive understanding of each model's performance. Results. The manual and machine scores showed significant correlation. Under this scoring method, Claude 3.5 Sonnet performed the best, particularly excelling in complex scenarios like natural and accident disasters. In contrast, while scoring slightly lower overall, Chinese models like ERNIE 4.0 Turbo and iFlytek Spark V4.0, showed strong performance in specific crises. Conclusion. The evaluation benchmark validates the best LLM for crisis information generation (Claude 3.5 Sonnet) and provides valuable insights for LLMs to optimize and apply LLM in crisis information. https://publicera.kb.se/ir/article/view/47518LLMsCrisis informaticsLLMs evaluationInformation generationEvaluation benchmark |
| spellingShingle | Ruilian Han Lu An Wei Zhou Gang Li A benchmark for evaluating crisis information generation capabilities in LLMs Information Research: An International Electronic Journal LLMs Crisis informatics LLMs evaluation Information generation Evaluation benchmark |
| title | A benchmark for evaluating crisis information generation capabilities in LLMs |
| title_full | A benchmark for evaluating crisis information generation capabilities in LLMs |
| title_fullStr | A benchmark for evaluating crisis information generation capabilities in LLMs |
| title_full_unstemmed | A benchmark for evaluating crisis information generation capabilities in LLMs |
| title_short | A benchmark for evaluating crisis information generation capabilities in LLMs |
| title_sort | benchmark for evaluating crisis information generation capabilities in llms |
| topic | LLMs Crisis informatics LLMs evaluation Information generation Evaluation benchmark |
| url | https://publicera.kb.se/ir/article/view/47518 |
| work_keys_str_mv | AT ruilianhan abenchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms AT luan abenchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms AT weizhou abenchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms AT gangli abenchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms AT ruilianhan benchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms AT luan benchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms AT weizhou benchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms AT gangli benchmarkforevaluatingcrisisinformationgenerationcapabilitiesinllms |