InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis

BackgroundInfectious diseases have consistently been a significant concern in public health, requiring proactive measures to safeguard societal well-being. In this regard, regular monitoring activities play a crucial role in mitigating the adverse effects of diseases on socie...

Full description

Saved in:
Bibliographic Details
Main Authors: Yesim Selcuk, Eunhui Kim, Insung Ahn
Format: Article
Language:English
Published: JMIR Publications 2025-02-01
Series:JMIR Medical Informatics
Online Access:https://medinform.jmir.org/2025/1/e63881
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823859912297414656
author Yesim Selcuk
Eunhui Kim
Insung Ahn
author_facet Yesim Selcuk
Eunhui Kim
Insung Ahn
author_sort Yesim Selcuk
collection DOAJ
description BackgroundInfectious diseases have consistently been a significant concern in public health, requiring proactive measures to safeguard societal well-being. In this regard, regular monitoring activities play a crucial role in mitigating the adverse effects of diseases on society. To monitor disease trends, various organizations, such as the World Health Organization (WHO) and the European Centre for Disease Prevention and Control (ECDC), collect diverse surveillance data and make them publicly accessible. However, these platforms primarily present surveillance data in English, which creates language barriers for non–English-speaking individuals and global public health efforts to accurately observe disease trends. This challenge is particularly noticeable in regions such as the Middle East, where specific infectious diseases, such as Middle East respiratory syndrome coronavirus (MERS-CoV), have seen a dramatic increase. For such regions, it is essential to develop tools that can overcome language barriers and reach more individuals to alleviate the negative impacts of these diseases. ObjectiveThis study aims to address these issues; therefore, we propose InfectA-Chat, a cutting-edge large language model (LLM) specifically designed for the Arabic language but also incorporating English for question and answer (Q&A) tasks. InfectA-Chat leverages its deep understanding of the language to provide users with information on the latest trends in infectious diseases based on their queries. MethodsThis comprehensive study was achieved by instruction tuning the AceGPT-7B and AceGPT-7B-Chat models on a Q&A task, using a dataset of 55,400 Arabic and English domain–specific instruction–following data. The performance of these fine-tuned models was evaluated using 2770 domain-specific Arabic and English instruction–following data, using the GPT-4 evaluation method. A comparative analysis was then performed against Arabic LLMs and state-of-the-art models, including AceGPT-13B-Chat, Jais-13B-Chat, Gemini, GPT-3.5, and GPT-4. Furthermore, to ensure the model had access to the latest information on infectious diseases by regularly updating the data without additional fine-tuning, we used the retrieval-augmented generation (RAG) method. ResultsInfectA-Chat demonstrated good performance in answering questions about infectious diseases by the GPT-4 evaluation method. Our comparative analysis revealed that it outperforms the AceGPT-7B-Chat and InfectA-Chat (based on AceGPT-7B) models by a margin of 43.52%. It also surpassed other Arabic LLMs such as AceGPT-13B-Chat and Jais-13B-Chat by 48.61%. Among the state-of-the-art models, InfectA-Chat achieved a leading performance of 23.78%, competing closely with the GPT-4 model. Furthermore, the RAG method in InfectA-Chat significantly improved document retrieval accuracy. Notably, RAG retrieved more accurate documents based on queries when the top-k parameter value was increased. ConclusionsOur findings highlight the shortcomings of general Arabic LLMs in providing up-to-date information about infectious diseases. With this study, we aim to empower individuals and public health efforts by offering a bilingual Q&A system for infectious disease monitoring.
format Article
id doaj-art-6c49440104eb47c59e2893722b98e43a
institution Kabale University
issn 2291-9694
language English
publishDate 2025-02-01
publisher JMIR Publications
record_format Article
series JMIR Medical Informatics
spelling doaj-art-6c49440104eb47c59e2893722b98e43a2025-02-10T19:31:36ZengJMIR PublicationsJMIR Medical Informatics2291-96942025-02-0113e6388110.2196/63881InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative AnalysisYesim Selcukhttps://orcid.org/0009-0006-9689-1398Eunhui Kimhttps://orcid.org/0000-0001-7775-3172Insung Ahnhttps://orcid.org/0000-0003-1171-1206 BackgroundInfectious diseases have consistently been a significant concern in public health, requiring proactive measures to safeguard societal well-being. In this regard, regular monitoring activities play a crucial role in mitigating the adverse effects of diseases on society. To monitor disease trends, various organizations, such as the World Health Organization (WHO) and the European Centre for Disease Prevention and Control (ECDC), collect diverse surveillance data and make them publicly accessible. However, these platforms primarily present surveillance data in English, which creates language barriers for non–English-speaking individuals and global public health efforts to accurately observe disease trends. This challenge is particularly noticeable in regions such as the Middle East, where specific infectious diseases, such as Middle East respiratory syndrome coronavirus (MERS-CoV), have seen a dramatic increase. For such regions, it is essential to develop tools that can overcome language barriers and reach more individuals to alleviate the negative impacts of these diseases. ObjectiveThis study aims to address these issues; therefore, we propose InfectA-Chat, a cutting-edge large language model (LLM) specifically designed for the Arabic language but also incorporating English for question and answer (Q&A) tasks. InfectA-Chat leverages its deep understanding of the language to provide users with information on the latest trends in infectious diseases based on their queries. MethodsThis comprehensive study was achieved by instruction tuning the AceGPT-7B and AceGPT-7B-Chat models on a Q&A task, using a dataset of 55,400 Arabic and English domain–specific instruction–following data. The performance of these fine-tuned models was evaluated using 2770 domain-specific Arabic and English instruction–following data, using the GPT-4 evaluation method. A comparative analysis was then performed against Arabic LLMs and state-of-the-art models, including AceGPT-13B-Chat, Jais-13B-Chat, Gemini, GPT-3.5, and GPT-4. Furthermore, to ensure the model had access to the latest information on infectious diseases by regularly updating the data without additional fine-tuning, we used the retrieval-augmented generation (RAG) method. ResultsInfectA-Chat demonstrated good performance in answering questions about infectious diseases by the GPT-4 evaluation method. Our comparative analysis revealed that it outperforms the AceGPT-7B-Chat and InfectA-Chat (based on AceGPT-7B) models by a margin of 43.52%. It also surpassed other Arabic LLMs such as AceGPT-13B-Chat and Jais-13B-Chat by 48.61%. Among the state-of-the-art models, InfectA-Chat achieved a leading performance of 23.78%, competing closely with the GPT-4 model. Furthermore, the RAG method in InfectA-Chat significantly improved document retrieval accuracy. Notably, RAG retrieved more accurate documents based on queries when the top-k parameter value was increased. ConclusionsOur findings highlight the shortcomings of general Arabic LLMs in providing up-to-date information about infectious diseases. With this study, we aim to empower individuals and public health efforts by offering a bilingual Q&A system for infectious disease monitoring.https://medinform.jmir.org/2025/1/e63881
spellingShingle Yesim Selcuk
Eunhui Kim
Insung Ahn
InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis
JMIR Medical Informatics
title InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis
title_full InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis
title_fullStr InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis
title_full_unstemmed InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis
title_short InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis
title_sort infecta chat an arabic large language model for infectious diseases comparative analysis
url https://medinform.jmir.org/2025/1/e63881
work_keys_str_mv AT yesimselcuk infectachatanarabiclargelanguagemodelforinfectiousdiseasescomparativeanalysis
AT eunhuikim infectachatanarabiclargelanguagemodelforinfectiousdiseasescomparativeanalysis
AT insungahn infectachatanarabiclargelanguagemodelforinfectiousdiseasescomparativeanalysis