A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information

Abstract Afaan Oromo is a resource-scarce language with limited tools developed for its processing, posing significant challenges for natural language tasks. The tools designed for English do not work efficiently for Afaan Oromo due to the linguistic differences and lack of well-structured resources...

Full description

Saved in:
Bibliographic Details
Main Authors: Etana Fikadu Dinsa, Mrinal Das, Teklu Urgessa Abebe
Format: Article
Language:English
Published: Nature Portfolio 2024-12-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-024-83743-3
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841559525773017088
author Etana Fikadu Dinsa
Mrinal Das
Teklu Urgessa Abebe
author_facet Etana Fikadu Dinsa
Mrinal Das
Teklu Urgessa Abebe
author_sort Etana Fikadu Dinsa
collection DOAJ
description Abstract Afaan Oromo is a resource-scarce language with limited tools developed for its processing, posing significant challenges for natural language tasks. The tools designed for English do not work efficiently for Afaan Oromo due to the linguistic differences and lack of well-structured resources. To address this challenge, this work proposes a topic modeling framework for unstructured health-related documents in Afaan Oromo using latent dirichlet allocation (LDA) algorithms. All collected documents lack label information, which poses significant challenges for categorizing the documents and applying the supervised learning methods. So, we utilize the LDA model since it offers solutions to this problem by allowing discovery of the latent topics of the documents without requiring the predefined labels. The model takes a word dictionary to extract hidden topics by evaluating word patterns and distributions across the dataset. Then it extracts the most relevant document topics and generates weight values for each word in the documents per topic. Next, we classify the topics using the represented keyword as input and assign class labels based on human evaluations topic coherence. This model could be applied to classifying medical documents and used to find specialists who best suitable for patients’ requests from the obtained information. As a conclusion of our findings, the topic modeling using LDA gave the promised value of 79.17% accuracy and 79.66% F1 score for test documents of the dataset.
format Article
id doaj-art-f4f0f83dd2304f5c970976e9af0e858e
institution Kabale University
issn 2045-2322
language English
publishDate 2024-12-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-f4f0f83dd2304f5c970976e9af0e858e2025-01-05T12:26:53ZengNature PortfolioScientific Reports2045-23222024-12-0114111410.1038/s41598-024-83743-3A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label informationEtana Fikadu Dinsa0Mrinal Das1Teklu Urgessa Abebe2Department of Computer Science and Engineering, Engineering and Technology, Wollega UniversityDepartment of Data Science, Indian Institute of Technology Palakkad(IIT Palakkad)Department of CSE, Adama Science and Technology UniversityAbstract Afaan Oromo is a resource-scarce language with limited tools developed for its processing, posing significant challenges for natural language tasks. The tools designed for English do not work efficiently for Afaan Oromo due to the linguistic differences and lack of well-structured resources. To address this challenge, this work proposes a topic modeling framework for unstructured health-related documents in Afaan Oromo using latent dirichlet allocation (LDA) algorithms. All collected documents lack label information, which poses significant challenges for categorizing the documents and applying the supervised learning methods. So, we utilize the LDA model since it offers solutions to this problem by allowing discovery of the latent topics of the documents without requiring the predefined labels. The model takes a word dictionary to extract hidden topics by evaluating word patterns and distributions across the dataset. Then it extracts the most relevant document topics and generates weight values for each word in the documents per topic. Next, we classify the topics using the represented keyword as input and assign class labels based on human evaluations topic coherence. This model could be applied to classifying medical documents and used to find specialists who best suitable for patients’ requests from the obtained information. As a conclusion of our findings, the topic modeling using LDA gave the promised value of 79.17% accuracy and 79.66% F1 score for test documents of the dataset.https://doi.org/10.1038/s41598-024-83743-3Afaan OromoTopic modelingLatent dirichlet allocationText analysisInformation retrievalClassification
spellingShingle Etana Fikadu Dinsa
Mrinal Das
Teklu Urgessa Abebe
A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information
Scientific Reports
Afaan Oromo
Topic modeling
Latent dirichlet allocation
Text analysis
Information retrieval
Classification
title A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information
title_full A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information
title_fullStr A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information
title_full_unstemmed A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information
title_short A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information
title_sort topic modeling approach for analyzing and categorizing electronic healthcare documents in afaan oromo without label information
topic Afaan Oromo
Topic modeling
Latent dirichlet allocation
Text analysis
Information retrieval
Classification
url https://doi.org/10.1038/s41598-024-83743-3
work_keys_str_mv AT etanafikadudinsa atopicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation
AT mrinaldas atopicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation
AT tekluurgessaabebe atopicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation
AT etanafikadudinsa topicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation
AT mrinaldas topicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation
AT tekluurgessaabebe topicmodelingapproachforanalyzingandcategorizingelectronichealthcaredocumentsinafaanoromowithoutlabelinformation