BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data

In the vibrant linguistic landscape of Bengali, spoken by millions in Bangladesh and India, the gap between saintly and common terms is culturally and computationally significant. Recognising this, we introduce BanglaBlend, a pioneering dataset created to capture these stylistic distinctions. Bangla...

Full description

Saved in:

Bibliographic Details
Main Authors:	Umme Ayman, Chayti Saha, Azmain Mahtab Rahat, Sharun Akter Khushbu
Format:	Article
Language:	English
Published:	Elsevier 2025-02-01
Series:	Data in Brief
Subjects:	Bangla text classification Bangla language Text classification Natural language processing
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340924012022
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832576512627310592
author	Umme Ayman Chayti Saha Azmain Mahtab Rahat Sharun Akter Khushbu
author_facet	Umme Ayman Chayti Saha Azmain Mahtab Rahat Sharun Akter Khushbu
author_sort	Umme Ayman
collection	DOAJ
description	In the vibrant linguistic landscape of Bengali, spoken by millions in Bangladesh and India, the gap between saintly and common terms is culturally and computationally significant. Recognising this, we introduce BanglaBlend, a pioneering dataset created to capture these stylistic distinctions. BanglaBlend comes with 7350 annotated sentences, 3675 in saintly form and 3675 in common form, covering a crucial need in natural language processing (NLP) resources for Bangla. This dataset is transformational in a variety of applications. It contributes to the creation of NLP models that can detect and imitate Bengali stylistic nuances, hence improving tasks like as text categorisation, sentiment analysis, and style translation. BanglaBlend also facilitates literary analysis, cultural heritage projects, and the creation of domain-specific texts. To achieve the best data quality, rigorous pre-processing techniques such as anonymization and duplication removal were used. The style designations were extensively validated in three steps to ensure correctness. BanglaBlend is more than just a dataset; it is a cornerstone for future NLP research and development in Bangla. It is a valuable resource for studying stylistic diversity, aids in the development of context-aware language models, and is an essential tool for academic research and practical applications. By making BanglaBlend freely accessible, we hope to encourage cooperation and creativity within the Bangla NLP community, therefore adding to the worldwide variety of linguistic computational resources
format	Article
id	doaj-art-dabf1c2621454f37b1f5b2f3df86c4a1
institution	Kabale University
issn	2352-3409
language	English
publishDate	2025-02-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj-art-dabf1c2621454f37b1f5b2f3df86c4a12025-01-31T05:11:37ZengElsevierData in Brief2352-34092025-02-0158111240BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley DataUmme Ayman0Chayti Saha1Azmain Mahtab Rahat2Sharun Akter Khushbu3Department of Computer Science and Engineering. Daffodil International University, Bangladesh; Corresponding author.Department of Computer Science and Engineering. Daffodil International University, BangladeshDepartment of Information and Communication Technology, Comilla University, BangladeshDepartment of Computer Science and Engineering. Daffodil International University, BangladeshIn the vibrant linguistic landscape of Bengali, spoken by millions in Bangladesh and India, the gap between saintly and common terms is culturally and computationally significant. Recognising this, we introduce BanglaBlend, a pioneering dataset created to capture these stylistic distinctions. BanglaBlend comes with 7350 annotated sentences, 3675 in saintly form and 3675 in common form, covering a crucial need in natural language processing (NLP) resources for Bangla. This dataset is transformational in a variety of applications. It contributes to the creation of NLP models that can detect and imitate Bengali stylistic nuances, hence improving tasks like as text categorisation, sentiment analysis, and style translation. BanglaBlend also facilitates literary analysis, cultural heritage projects, and the creation of domain-specific texts. To achieve the best data quality, rigorous pre-processing techniques such as anonymization and duplication removal were used. The style designations were extensively validated in three steps to ensure correctness. BanglaBlend is more than just a dataset; it is a cornerstone for future NLP research and development in Bangla. It is a valuable resource for studying stylistic diversity, aids in the development of context-aware language models, and is an essential tool for academic research and practical applications. By making BanglaBlend freely accessible, we hope to encourage cooperation and creativity within the Bangla NLP community, therefore adding to the worldwide variety of linguistic computational resourceshttp://www.sciencedirect.com/science/article/pii/S2352340924012022Bangla text classificationBangla languageText classificationNatural language processing
spellingShingle	Umme Ayman Chayti Saha Azmain Mahtab Rahat Sharun Akter Khushbu BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data Data in Brief Bangla text classification Bangla language Text classification Natural language processing
title	BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data
title_full	BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data
title_fullStr	BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data
title_full_unstemmed	BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data
title_short	BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data
title_sort	banglablend a large scale nobel dataset of bangla sentences categorized by saint and common form of bangla languagemendeley data
topic	Bangla text classification Bangla language Text classification Natural language processing
url	http://www.sciencedirect.com/science/article/pii/S2352340924012022
work_keys_str_mv	AT ummeayman banglablendalargescalenobeldatasetofbanglasentencescategorizedbysaintandcommonformofbanglalanguagemendeleydata AT chaytisaha banglablendalargescalenobeldatasetofbanglasentencescategorizedbysaintandcommonformofbanglalanguagemendeleydata AT azmainmahtabrahat banglablendalargescalenobeldatasetofbanglasentencescategorizedbysaintandcommonformofbanglalanguagemendeleydata AT sharunakterkhushbu banglablendalargescalenobeldatasetofbanglasentencescategorizedbysaintandcommonformofbanglalanguagemendeleydata

BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data

Similar Items