A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information
Abstract Afaan Oromo is a resource-scarce language with limited tools developed for its processing, posing significant challenges for natural language tasks. The tools designed for English do not work efficiently for Afaan Oromo due to the linguistic differences and lack of well-structured resources...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2024-12-01
|
Series: | Scientific Reports |
Subjects: | |
Online Access: | https://doi.org/10.1038/s41598-024-83743-3 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841559525773017088 |
---|---|
author | Etana Fikadu Dinsa Mrinal Das Teklu Urgessa Abebe |
author_facet | Etana Fikadu Dinsa Mrinal Das Teklu Urgessa Abebe |
author_sort | Etana Fikadu Dinsa |
collection | DOAJ |
description | Abstract Afaan Oromo is a resource-scarce language with limited tools developed for its processing, posing significant challenges for natural language tasks. The tools designed for English do not work efficiently for Afaan Oromo due to the linguistic differences and lack of well-structured resources. To address this challenge, this work proposes a topic modeling framework for unstructured health-related documents in Afaan Oromo using latent dirichlet allocation (LDA) algorithms. All collected documents lack label information, which poses significant challenges for categorizing the documents and applying the supervised learning methods. So, we utilize the LDA model since it offers solutions to this problem by allowing discovery of the latent topics of the documents without requiring the predefined labels. The model takes a word dictionary to extract hidden topics by evaluating word patterns and distributions across the dataset. Then it extracts the most relevant document topics and generates weight values for each word in the documents per topic. Next, we classify the topics using the represented keyword as input and assign class labels based on human evaluations topic coherence. This model could be applied to classifying medical documents and used to find specialists who best suitable for patients’ requests from the obtained information. As a conclusion of our findings, the topic modeling using LDA gave the promised value of 79.17% accuracy and 79.66% F1 score for test documents of the dataset. |
format | Article |
id | doaj-art-f4f0f83dd2304f5c970976e9af0e858e |
institution | Kabale University |
issn | 2045-2322 |
language | English |
publishDate | 2024-12-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Scientific Reports |
spelling | doaj-art-f4f0f83dd2304f5c970976e9af0e858e2025-01-05T12:26:53ZengNature PortfolioScientific Reports2045-23222024-12-0114111410.1038/s41598-024-83743-3A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label informationEtana Fikadu Dinsa0Mrinal Das1Teklu Urgessa Abebe2Department of Computer Science and Engineering, Engineering and Technology, Wollega UniversityDepartment of Data Science, Indian Institute of Technology Palakkad(IIT Palakkad)Department of CSE, Adama Science and Technology UniversityAbstract Afaan Oromo is a resource-scarce language with limited tools developed for its processing, posing significant challenges for natural language tasks. The tools designed for English do not work efficiently for Afaan Oromo due to the linguistic differences and lack of well-structured resources. To address this challenge, this work proposes a topic modeling framework for unstructured health-related documents in Afaan Oromo using latent dirichlet allocation (LDA) algorithms. All collected documents lack label information, which poses significant challenges for categorizing the documents and applying the supervised learning methods. So, we utilize the LDA model since it offers solutions to this problem by allowing discovery of the latent topics of the documents without requiring the predefined labels. The model takes a word dictionary to extract hidden topics by evaluating word patterns and distributions across the dataset. Then it extracts the most relevant document topics and generates weight values for each word in the documents per topic. Next, we classify the topics using the represented keyword as input and assign class labels based on human evaluations topic coherence. This model could be applied to classifying medical documents and used to find specialists who best suitable for patients’ requests from the obtained information. As a conclusion of our findings, the topic modeling using LDA gave the promised value of 79.17% accuracy and 79.66% F1 score for test documents of the dataset.https://doi.org/10.1038/s41598-024-83743-3Afaan OromoTopic modelingLatent dirichlet allocationText analysisInformation retrievalClassification |
spellingShingle | Etana Fikadu Dinsa Mrinal Das Teklu Urgessa Abebe A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information Scientific Reports Afaan Oromo Topic modeling Latent dirichlet allocation Text analysis Information retrieval Classification |
title | A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information |
title_full | A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information |
title_fullStr | A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information |
title_full_unstemmed | A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information |
title_short | A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information |
title_sort | topic modeling approach for analyzing and categorizing electronic healthcare documents in afaan oromo without label information |
topic | Afaan Oromo Topic modeling Latent dirichlet allocation Text analysis Information retrieval Classification |
url | https://doi.org/10.1038/s41598-024-83743-3 |
work_keys_str_mv | AT etanafikadudinsa atopicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation AT mrinaldas atopicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation AT tekluurgessaabebe atopicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation AT etanafikadudinsa topicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation AT mrinaldas topicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation AT tekluurgessaabebe topicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation |