Semantic classification of Indonesian consumer health questions

Abstract Purpose Online consumer health forums serve as a way for the public to connect with medical professionals. While these medical forums offer a valuable service, online Question Answering (QA) forums can struggle to deliver timely answers due to the limited number of available healthcare prof...

Full description

Saved in:

Bibliographic Details
Main Authors:	Raniah Nur Hanami, Rahmad Mahendra, Alfan Farizki Wicaksono
Format:	Article
Language:	English
Published:	BMC 2025-07-01
Series:	Journal of Biomedical Semantics
Subjects:	Text mining Consumer health question-answering system Semantic annotation scheme Semantic type classification Consumer health questions
Online Access:	https://doi.org/10.1186/s13326-025-00334-5
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849389083553431552
author	Raniah Nur Hanami Rahmad Mahendra Alfan Farizki Wicaksono
author_facet	Raniah Nur Hanami Rahmad Mahendra Alfan Farizki Wicaksono
author_sort	Raniah Nur Hanami
collection	DOAJ
description	Abstract Purpose Online consumer health forums serve as a way for the public to connect with medical professionals. While these medical forums offer a valuable service, online Question Answering (QA) forums can struggle to deliver timely answers due to the limited number of available healthcare professionals. One way to solve this problem is by developing an automatic QA system that can provide patients with quicker answers. One key component of such a system could be a module for classifying the semantic type of a question. This would allow the system to understand the patient’s intent and route them towards the relevant information. Methods This paper proposes a novel two-step approach to address the challenge of semantic type classification in Indonesian consumer health questions. We acknowledge the scarcity of Indonesian health domain data, a hurdle for machine learning models. To address this gap, we first introduce a novel corpus of annotated Indonesian consumer health questions. Second, we utilize this newly created corpus to build and evaluate a data-driven predictive model for classifying question semantic types. To enhance the trustworthiness and interpretability of the model’s predictions, we employ an explainable model framework, LIME. This framework facilitates a deeper understanding of the role played by word-based features in the model’s decision-making process. Additionally, it empowers us to conduct a comprehensive bias analysis, allowing for the detection of “semantic bias”, where words with no inherent association with a specific semantic type disproportionately influence the model’s predictions. Results The annotation process revealed moderate agreement between expert annotators. In addition, not all words with high LIME probability could be considered true characteristics of a question type. This suggests a potential bias in the data used and the machine learning models themselves. Notably, XGBoost, Naïve Bayes, and MLP models exhibited a tendency to predict questions containing the words “kanker” (cancer) and “depresi” (depression) as belonging to the DIAGNOSIS category. In terms of prediction performance, Perceptron and XGBoost emerged as the top-performing models, achieving the highest weighted average F1 scores across all input scenarios and weighting factors. Naïve Bayes performed best after balancing the data with Borderline SMOTE, indicating its promise for handling imbalanced datasets. Conclusion We constructed a corpus of query semantics in the domain of Indonesian consumer health, containing 964 questions annotated with their corresponding semantic types. This corpus served as the foundation for building a predictive model. We further investigated the impact of disease-biased words on model performance. These words exhibited high LIME scores, yet lacked association with a specific semantic type. We trained models using datasets with and without these biased words and found no significant difference in model performance between the two scenarios, suggesting that the models might possess an ability to mitigate the influence of such bias during the learning process.
format	Article
id	doaj-art-d5d6802b39fe4e94b9f9ecff00ab5ef1
institution	Kabale University
issn	2041-1480
language	English
publishDate	2025-07-01
publisher	BMC
record_format	Article
series	Journal of Biomedical Semantics
spelling	doaj-art-d5d6802b39fe4e94b9f9ecff00ab5ef12025-08-20T03:42:04ZengBMCJournal of Biomedical Semantics2041-14802025-07-0116111710.1186/s13326-025-00334-5Semantic classification of Indonesian consumer health questionsRaniah Nur Hanami0Rahmad Mahendra1Alfan Farizki Wicaksono2Faculty of Computer Science, Universitas IndonesiaFaculty of Computer Science, Universitas IndonesiaFaculty of Computer Science, Universitas IndonesiaAbstract Purpose Online consumer health forums serve as a way for the public to connect with medical professionals. While these medical forums offer a valuable service, online Question Answering (QA) forums can struggle to deliver timely answers due to the limited number of available healthcare professionals. One way to solve this problem is by developing an automatic QA system that can provide patients with quicker answers. One key component of such a system could be a module for classifying the semantic type of a question. This would allow the system to understand the patient’s intent and route them towards the relevant information. Methods This paper proposes a novel two-step approach to address the challenge of semantic type classification in Indonesian consumer health questions. We acknowledge the scarcity of Indonesian health domain data, a hurdle for machine learning models. To address this gap, we first introduce a novel corpus of annotated Indonesian consumer health questions. Second, we utilize this newly created corpus to build and evaluate a data-driven predictive model for classifying question semantic types. To enhance the trustworthiness and interpretability of the model’s predictions, we employ an explainable model framework, LIME. This framework facilitates a deeper understanding of the role played by word-based features in the model’s decision-making process. Additionally, it empowers us to conduct a comprehensive bias analysis, allowing for the detection of “semantic bias”, where words with no inherent association with a specific semantic type disproportionately influence the model’s predictions. Results The annotation process revealed moderate agreement between expert annotators. In addition, not all words with high LIME probability could be considered true characteristics of a question type. This suggests a potential bias in the data used and the machine learning models themselves. Notably, XGBoost, Naïve Bayes, and MLP models exhibited a tendency to predict questions containing the words “kanker” (cancer) and “depresi” (depression) as belonging to the DIAGNOSIS category. In terms of prediction performance, Perceptron and XGBoost emerged as the top-performing models, achieving the highest weighted average F1 scores across all input scenarios and weighting factors. Naïve Bayes performed best after balancing the data with Borderline SMOTE, indicating its promise for handling imbalanced datasets. Conclusion We constructed a corpus of query semantics in the domain of Indonesian consumer health, containing 964 questions annotated with their corresponding semantic types. This corpus served as the foundation for building a predictive model. We further investigated the impact of disease-biased words on model performance. These words exhibited high LIME scores, yet lacked association with a specific semantic type. We trained models using datasets with and without these biased words and found no significant difference in model performance between the two scenarios, suggesting that the models might possess an ability to mitigate the influence of such bias during the learning process.https://doi.org/10.1186/s13326-025-00334-5Text miningConsumer health question-answering systemSemantic annotation schemeSemantic type classificationConsumer health questions
spellingShingle	Raniah Nur Hanami Rahmad Mahendra Alfan Farizki Wicaksono Semantic classification of Indonesian consumer health questions Journal of Biomedical Semantics Text mining Consumer health question-answering system Semantic annotation scheme Semantic type classification Consumer health questions
title	Semantic classification of Indonesian consumer health questions
title_full	Semantic classification of Indonesian consumer health questions
title_fullStr	Semantic classification of Indonesian consumer health questions
title_full_unstemmed	Semantic classification of Indonesian consumer health questions
title_short	Semantic classification of Indonesian consumer health questions
title_sort	semantic classification of indonesian consumer health questions
topic	Text mining Consumer health question-answering system Semantic annotation scheme Semantic type classification Consumer health questions
url	https://doi.org/10.1186/s13326-025-00334-5
work_keys_str_mv	AT raniahnurhanami semanticclassificationofindonesianconsumerhealthquestions AT rahmadmahendra semanticclassificationofindonesianconsumerhealthquestions AT alfanfarizkiwicaksono semanticclassificationofindonesianconsumerhealthquestions

Semantic classification of Indonesian consumer health questions

Similar Items