BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face

In the landscape of natural language processing (NLP) research, the availability of comprehensive datasets plays a pivotal role in advancing various tasks, including paraphrasing. However, for languages such as Bengali, the availability of such datasets remains limited, particularly in specialized d...

Full description

Saved in:
Bibliographic Details
Main Authors: Faisal Ibn Aziz, Muhammad Nazrul Islam
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340925004299
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849343265911865344
author Faisal Ibn Aziz
Muhammad Nazrul Islam
author_facet Faisal Ibn Aziz
Muhammad Nazrul Islam
author_sort Faisal Ibn Aziz
collection DOAJ
description In the landscape of natural language processing (NLP) research, the availability of comprehensive datasets plays a pivotal role in advancing various tasks, including paraphrasing. However, for languages such as Bengali, the availability of such datasets remains limited, particularly in specialized domains like health. Recognizing this gap, this study endeavours to address the scarcity of resources by presenting a novel Bengali paraphrasing dataset specifically tailored to the health domain. The dataset construction process involved sourcing sentences from Bengali newspapers, focusing on health-related content. Due to the dearth of existing datasets in Bengali and the specialized nature of paraphrasing, particularly in the health domain, this endeavour necessitated the development of a unique methodology. This methodology included the development of a script for data extraction, pre-processing, translation of Bengali sentences into English, paraphrasing of the English sentences, and subsequent translation back into Bengali to generate paraphrased versions of the original sentences. A user study involving 100 participants was conducted to evaluate 500 sample Bengali paraphrased sentences to identify the most suitable library (VamsiT5 Paws) for creating the paraphrasing dataset. As such, a total of 200,000 sentences were extracted and paraphrased in this process. This dataset holds significant potential to foster advancements in paraphrasing research that facilitate the development of language models; and ultimately contribute to the broader goal of enhancing NLP capabilities in Bengali, particularly in specialized domains like health.
format Article
id doaj-art-db4648cb74d04c8ba465394bf97cd8e5
institution Kabale University
issn 2352-3409
language English
publishDate 2025-08-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj-art-db4648cb74d04c8ba465394bf97cd8e52025-08-20T03:43:02ZengElsevierData in Brief2352-34092025-08-016111169910.1016/j.dib.2025.111699BanglaHealth: A Bengali paraphrase dataset on health domainHugging FaceFaisal Ibn Aziz0Muhammad Nazrul Islam1Department of Computer Science and Engineering, Military Institute of Science and Technology (MIST), Mirpur Cantonment, Dhaka 1216, BangladeshCorresponding author.; Department of Computer Science and Engineering, Military Institute of Science and Technology (MIST), Mirpur Cantonment, Dhaka 1216, BangladeshIn the landscape of natural language processing (NLP) research, the availability of comprehensive datasets plays a pivotal role in advancing various tasks, including paraphrasing. However, for languages such as Bengali, the availability of such datasets remains limited, particularly in specialized domains like health. Recognizing this gap, this study endeavours to address the scarcity of resources by presenting a novel Bengali paraphrasing dataset specifically tailored to the health domain. The dataset construction process involved sourcing sentences from Bengali newspapers, focusing on health-related content. Due to the dearth of existing datasets in Bengali and the specialized nature of paraphrasing, particularly in the health domain, this endeavour necessitated the development of a unique methodology. This methodology included the development of a script for data extraction, pre-processing, translation of Bengali sentences into English, paraphrasing of the English sentences, and subsequent translation back into Bengali to generate paraphrased versions of the original sentences. A user study involving 100 participants was conducted to evaluate 500 sample Bengali paraphrased sentences to identify the most suitable library (VamsiT5 Paws) for creating the paraphrasing dataset. As such, a total of 200,000 sentences were extracted and paraphrased in this process. This dataset holds significant potential to foster advancements in paraphrasing research that facilitate the development of language models; and ultimately contribute to the broader goal of enhancing NLP capabilities in Bengali, particularly in specialized domains like health.http://www.sciencedirect.com/science/article/pii/S2352340925004299Natural language processing (NLP)ParaphrasingBengali paraphrasingBengali languageHealth domain
spellingShingle Faisal Ibn Aziz
Muhammad Nazrul Islam
BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
Data in Brief
Natural language processing (NLP)
Paraphrasing
Bengali paraphrasing
Bengali language
Health domain
title BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
title_full BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
title_fullStr BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
title_full_unstemmed BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
title_short BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
title_sort banglahealth a bengali paraphrase dataset on health domainhugging face
topic Natural language processing (NLP)
Paraphrasing
Bengali paraphrasing
Bengali language
Health domain
url http://www.sciencedirect.com/science/article/pii/S2352340925004299
work_keys_str_mv AT faisalibnaziz banglahealthabengaliparaphrasedatasetonhealthdomainhuggingface
AT muhammadnazrulislam banglahealthabengaliparaphrasedatasetonhealthdomainhuggingface