BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
In the landscape of natural language processing (NLP) research, the availability of comprehensive datasets plays a pivotal role in advancing various tasks, including paraphrasing. However, for languages such as Bengali, the availability of such datasets remains limited, particularly in specialized d...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-08-01
|
| Series: | Data in Brief |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340925004299 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849343265911865344 |
|---|---|
| author | Faisal Ibn Aziz Muhammad Nazrul Islam |
| author_facet | Faisal Ibn Aziz Muhammad Nazrul Islam |
| author_sort | Faisal Ibn Aziz |
| collection | DOAJ |
| description | In the landscape of natural language processing (NLP) research, the availability of comprehensive datasets plays a pivotal role in advancing various tasks, including paraphrasing. However, for languages such as Bengali, the availability of such datasets remains limited, particularly in specialized domains like health. Recognizing this gap, this study endeavours to address the scarcity of resources by presenting a novel Bengali paraphrasing dataset specifically tailored to the health domain. The dataset construction process involved sourcing sentences from Bengali newspapers, focusing on health-related content. Due to the dearth of existing datasets in Bengali and the specialized nature of paraphrasing, particularly in the health domain, this endeavour necessitated the development of a unique methodology. This methodology included the development of a script for data extraction, pre-processing, translation of Bengali sentences into English, paraphrasing of the English sentences, and subsequent translation back into Bengali to generate paraphrased versions of the original sentences. A user study involving 100 participants was conducted to evaluate 500 sample Bengali paraphrased sentences to identify the most suitable library (VamsiT5 Paws) for creating the paraphrasing dataset. As such, a total of 200,000 sentences were extracted and paraphrased in this process. This dataset holds significant potential to foster advancements in paraphrasing research that facilitate the development of language models; and ultimately contribute to the broader goal of enhancing NLP capabilities in Bengali, particularly in specialized domains like health. |
| format | Article |
| id | doaj-art-db4648cb74d04c8ba465394bf97cd8e5 |
| institution | Kabale University |
| issn | 2352-3409 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Data in Brief |
| spelling | doaj-art-db4648cb74d04c8ba465394bf97cd8e52025-08-20T03:43:02ZengElsevierData in Brief2352-34092025-08-016111169910.1016/j.dib.2025.111699BanglaHealth: A Bengali paraphrase dataset on health domainHugging FaceFaisal Ibn Aziz0Muhammad Nazrul Islam1Department of Computer Science and Engineering, Military Institute of Science and Technology (MIST), Mirpur Cantonment, Dhaka 1216, BangladeshCorresponding author.; Department of Computer Science and Engineering, Military Institute of Science and Technology (MIST), Mirpur Cantonment, Dhaka 1216, BangladeshIn the landscape of natural language processing (NLP) research, the availability of comprehensive datasets plays a pivotal role in advancing various tasks, including paraphrasing. However, for languages such as Bengali, the availability of such datasets remains limited, particularly in specialized domains like health. Recognizing this gap, this study endeavours to address the scarcity of resources by presenting a novel Bengali paraphrasing dataset specifically tailored to the health domain. The dataset construction process involved sourcing sentences from Bengali newspapers, focusing on health-related content. Due to the dearth of existing datasets in Bengali and the specialized nature of paraphrasing, particularly in the health domain, this endeavour necessitated the development of a unique methodology. This methodology included the development of a script for data extraction, pre-processing, translation of Bengali sentences into English, paraphrasing of the English sentences, and subsequent translation back into Bengali to generate paraphrased versions of the original sentences. A user study involving 100 participants was conducted to evaluate 500 sample Bengali paraphrased sentences to identify the most suitable library (VamsiT5 Paws) for creating the paraphrasing dataset. As such, a total of 200,000 sentences were extracted and paraphrased in this process. This dataset holds significant potential to foster advancements in paraphrasing research that facilitate the development of language models; and ultimately contribute to the broader goal of enhancing NLP capabilities in Bengali, particularly in specialized domains like health.http://www.sciencedirect.com/science/article/pii/S2352340925004299Natural language processing (NLP)ParaphrasingBengali paraphrasingBengali languageHealth domain |
| spellingShingle | Faisal Ibn Aziz Muhammad Nazrul Islam BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face Data in Brief Natural language processing (NLP) Paraphrasing Bengali paraphrasing Bengali language Health domain |
| title | BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face |
| title_full | BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face |
| title_fullStr | BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face |
| title_full_unstemmed | BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face |
| title_short | BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face |
| title_sort | banglahealth a bengali paraphrase dataset on health domainhugging face |
| topic | Natural language processing (NLP) Paraphrasing Bengali paraphrasing Bengali language Health domain |
| url | http://www.sciencedirect.com/science/article/pii/S2352340925004299 |
| work_keys_str_mv | AT faisalibnaziz banglahealthabengaliparaphrasedatasetonhealthdomainhuggingface AT muhammadnazrulislam banglahealthabengaliparaphrasedatasetonhealthdomainhuggingface |