Annotated data for semantic role labeling of crisis events in Indonesian TweetsMendeley Data
Social media platforms like Twitter provide essential real-time information about crisis events. Although the text data generated is rich, its vast volume and unstructured format make manual analysis challenging. Information extraction technologies such as Semantic Role Labeling (SRL) are needed to...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-08-01
|
| Series: | Data in Brief |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340925004184 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Social media platforms like Twitter provide essential real-time information about crisis events. Although the text data generated is rich, its vast volume and unstructured format make manual analysis challenging. Information extraction technologies such as Semantic Role Labeling (SRL) are needed to identify a sentence's semantic roles, such as who is the victim, what happened, when and where the event occurred, and what objects are affected in the crisis text to speed up and facilitate the emergency response process. However, the availability of public SRL datasets, especially for Indonesian, still considered a low-resource language, is very limited. We aim to develop an Indonesian-language SRL dataset based on Twitter text focusing on crisis events. This dataset includes entity labels for Named Entity Recognition (NER), another information extraction technique besides SRL. Text data was obtained through a crawling process on Twitter using specific keywords from 2018–2023, then preprocessed to obtain clean and relevant data for crisis events in Indonesia. The cleaned text data was then manually annotated by two experts based on guidelines designed to maintain consistency, resulting in 99,206 tokens labeled with SRL and NER. The high inter-annotator agreement value (Cohen's Kappa >0.90) indicates reliable data quality. This dataset is designed to support the development of automated models for information extraction, such as SRL and NER. The results of this extraction will be used for disaster impact analysis, mapping affected areas, and planning for crisis mitigation. By providing this dataset, the research opens up new opportunities for developing Natural Language Processing (NLP) in Indonesian, especially in crisis event analysis. |
|---|---|
| ISSN: | 2352-3409 |