BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face

In the landscape of natural language processing (NLP) research, the availability of comprehensive datasets plays a pivotal role in advancing various tasks, including paraphrasing. However, for languages such as Bengali, the availability of such datasets remains limited, particularly in specialized d...

Full description

Saved in:

Bibliographic Details
Main Authors:	Faisal Ibn Aziz, Muhammad Nazrul Islam
Format:	Article
Language:	English
Published:	Elsevier 2025-08-01
Series:	Data in Brief
Subjects:	Natural language processing (NLP) Paraphrasing Bengali paraphrasing Bengali language Health domain
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340925004299
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849343265911865344
author	Faisal Ibn Aziz Muhammad Nazrul Islam
author_facet	Faisal Ibn Aziz Muhammad Nazrul Islam
author_sort	Faisal Ibn Aziz
collection	DOAJ
description	In the landscape of natural language processing (NLP) research, the availability of comprehensive datasets plays a pivotal role in advancing various tasks, including paraphrasing. However, for languages such as Bengali, the availability of such datasets remains limited, particularly in specialized domains like health. Recognizing this gap, this study endeavours to address the scarcity of resources by presenting a novel Bengali paraphrasing dataset specifically tailored to the health domain. The dataset construction process involved sourcing sentences from Bengali newspapers, focusing on health-related content. Due to the dearth of existing datasets in Bengali and the specialized nature of paraphrasing, particularly in the health domain, this endeavour necessitated the development of a unique methodology. This methodology included the development of a script for data extraction, pre-processing, translation of Bengali sentences into English, paraphrasing of the English sentences, and subsequent translation back into Bengali to generate paraphrased versions of the original sentences. A user study involving 100 participants was conducted to evaluate 500 sample Bengali paraphrased sentences to identify the most suitable library (VamsiT5 Paws) for creating the paraphrasing dataset. As such, a total of 200,000 sentences were extracted and paraphrased in this process. This dataset holds significant potential to foster advancements in paraphrasing research that facilitate the development of language models; and ultimately contribute to the broader goal of enhancing NLP capabilities in Bengali, particularly in specialized domains like health.
format	Article
id	doaj-art-db4648cb74d04c8ba465394bf97cd8e5
institution	Kabale University
issn	2352-3409
language	English
publishDate	2025-08-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj-art-db4648cb74d04c8ba465394bf97cd8e52025-08-20T03:43:02ZengElsevierData in Brief2352-34092025-08-016111169910.1016/j.dib.2025.111699BanglaHealth: A Bengali paraphrase dataset on health domainHugging FaceFaisal Ibn Aziz0Muhammad Nazrul Islam1Department of Computer Science and Engineering, Military Institute of Science and Technology (MIST), Mirpur Cantonment, Dhaka 1216, BangladeshCorresponding author.; Department of Computer Science and Engineering, Military Institute of Science and Technology (MIST), Mirpur Cantonment, Dhaka 1216, BangladeshIn the landscape of natural language processing (NLP) research, the availability of comprehensive datasets plays a pivotal role in advancing various tasks, including paraphrasing. However, for languages such as Bengali, the availability of such datasets remains limited, particularly in specialized domains like health. Recognizing this gap, this study endeavours to address the scarcity of resources by presenting a novel Bengali paraphrasing dataset specifically tailored to the health domain. The dataset construction process involved sourcing sentences from Bengali newspapers, focusing on health-related content. Due to the dearth of existing datasets in Bengali and the specialized nature of paraphrasing, particularly in the health domain, this endeavour necessitated the development of a unique methodology. This methodology included the development of a script for data extraction, pre-processing, translation of Bengali sentences into English, paraphrasing of the English sentences, and subsequent translation back into Bengali to generate paraphrased versions of the original sentences. A user study involving 100 participants was conducted to evaluate 500 sample Bengali paraphrased sentences to identify the most suitable library (VamsiT5 Paws) for creating the paraphrasing dataset. As such, a total of 200,000 sentences were extracted and paraphrased in this process. This dataset holds significant potential to foster advancements in paraphrasing research that facilitate the development of language models; and ultimately contribute to the broader goal of enhancing NLP capabilities in Bengali, particularly in specialized domains like health.http://www.sciencedirect.com/science/article/pii/S2352340925004299Natural language processing (NLP)ParaphrasingBengali paraphrasingBengali languageHealth domain
spellingShingle	Faisal Ibn Aziz Muhammad Nazrul Islam BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face Data in Brief Natural language processing (NLP) Paraphrasing Bengali paraphrasing Bengali language Health domain
title	BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
title_full	BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
title_fullStr	BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
title_full_unstemmed	BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
title_short	BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
title_sort	banglahealth a bengali paraphrase dataset on health domainhugging face
topic	Natural language processing (NLP) Paraphrasing Bengali paraphrasing Bengali language Health domain
url	http://www.sciencedirect.com/science/article/pii/S2352340925004299
work_keys_str_mv	AT faisalibnaziz banglahealthabengaliparaphrasedatasetonhealthdomainhuggingface AT muhammadnazrulislam banglahealthabengaliparaphrasedatasetonhealthdomainhuggingface

BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face

Similar Items