Annotated data for semantic role labeling of crisis events in Indonesian TweetsMendeley Data

Social media platforms like Twitter provide essential real-time information about crisis events. Although the text data generated is rich, its vast volume and unstructured format make manual analysis challenging. Information extraction technologies such as Semantic Role Labeling (SRL) are needed to...

Full description

Saved in:

Bibliographic Details
Main Authors:	Amelia Devi Putri Ariyanto, Diana Purwitasari, Bilqis Amaliah, Chastine Fatichah, Muhammad Ghifari Taqiuddin, Haikal
Format:	Article
Language:	English
Published:	Elsevier 2025-08-01
Series:	Data in Brief
Subjects:	Semantic role labeling Named entity recognition Crisis event Twitter data Low-resource languages Indonesian Tweets
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340925004184
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Social media platforms like Twitter provide essential real-time information about crisis events. Although the text data generated is rich, its vast volume and unstructured format make manual analysis challenging. Information extraction technologies such as Semantic Role Labeling (SRL) are needed to identify a sentence's semantic roles, such as who is the victim, what happened, when and where the event occurred, and what objects are affected in the crisis text to speed up and facilitate the emergency response process. However, the availability of public SRL datasets, especially for Indonesian, still considered a low-resource language, is very limited. We aim to develop an Indonesian-language SRL dataset based on Twitter text focusing on crisis events. This dataset includes entity labels for Named Entity Recognition (NER), another information extraction technique besides SRL. Text data was obtained through a crawling process on Twitter using specific keywords from 2018–2023, then preprocessed to obtain clean and relevant data for crisis events in Indonesia. The cleaned text data was then manually annotated by two experts based on guidelines designed to maintain consistency, resulting in 99,206 tokens labeled with SRL and NER. The high inter-annotator agreement value (Cohen's Kappa >0.90) indicates reliable data quality. This dataset is designed to support the development of automated models for information extraction, such as SRL and NER. The results of this extraction will be used for disaster impact analysis, mapping affected areas, and planning for crisis mitigation. By providing this dataset, the research opens up new opportunities for developing Natural Language Processing (NLP) in Indonesian, especially in crisis event analysis.
ISSN:	2352-3409

Annotated data for semantic role labeling of crisis events in Indonesian TweetsMendeley Data

Similar Items