BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data

In the vibrant linguistic landscape of Bengali, spoken by millions in Bangladesh and India, the gap between saintly and common terms is culturally and computationally significant. Recognising this, we introduce BanglaBlend, a pioneering dataset created to capture these stylistic distinctions. Bangla...

Full description

Saved in:
Bibliographic Details
Main Authors: Umme Ayman, Chayti Saha, Azmain Mahtab Rahat, Sharun Akter Khushbu
Format: Article
Language:English
Published: Elsevier 2025-02-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340924012022
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832576512627310592
author Umme Ayman
Chayti Saha
Azmain Mahtab Rahat
Sharun Akter Khushbu
author_facet Umme Ayman
Chayti Saha
Azmain Mahtab Rahat
Sharun Akter Khushbu
author_sort Umme Ayman
collection DOAJ
description In the vibrant linguistic landscape of Bengali, spoken by millions in Bangladesh and India, the gap between saintly and common terms is culturally and computationally significant. Recognising this, we introduce BanglaBlend, a pioneering dataset created to capture these stylistic distinctions. BanglaBlend comes with 7350 annotated sentences, 3675 in saintly form and 3675 in common form, covering a crucial need in natural language processing (NLP) resources for Bangla. This dataset is transformational in a variety of applications. It contributes to the creation of NLP models that can detect and imitate Bengali stylistic nuances, hence improving tasks like as text categorisation, sentiment analysis, and style translation. BanglaBlend also facilitates literary analysis, cultural heritage projects, and the creation of domain-specific texts. To achieve the best data quality, rigorous pre-processing techniques such as anonymization and duplication removal were used. The style designations were extensively validated in three steps to ensure correctness. BanglaBlend is more than just a dataset; it is a cornerstone for future NLP research and development in Bangla. It is a valuable resource for studying stylistic diversity, aids in the development of context-aware language models, and is an essential tool for academic research and practical applications. By making BanglaBlend freely accessible, we hope to encourage cooperation and creativity within the Bangla NLP community, therefore adding to the worldwide variety of linguistic computational resources
format Article
id doaj-art-dabf1c2621454f37b1f5b2f3df86c4a1
institution Kabale University
issn 2352-3409
language English
publishDate 2025-02-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj-art-dabf1c2621454f37b1f5b2f3df86c4a12025-01-31T05:11:37ZengElsevierData in Brief2352-34092025-02-0158111240BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley DataUmme Ayman0Chayti Saha1Azmain Mahtab Rahat2Sharun Akter Khushbu3Department of Computer Science and Engineering. Daffodil International University, Bangladesh; Corresponding author.Department of Computer Science and Engineering. Daffodil International University, BangladeshDepartment of Information and Communication Technology, Comilla University, BangladeshDepartment of Computer Science and Engineering. Daffodil International University, BangladeshIn the vibrant linguistic landscape of Bengali, spoken by millions in Bangladesh and India, the gap between saintly and common terms is culturally and computationally significant. Recognising this, we introduce BanglaBlend, a pioneering dataset created to capture these stylistic distinctions. BanglaBlend comes with 7350 annotated sentences, 3675 in saintly form and 3675 in common form, covering a crucial need in natural language processing (NLP) resources for Bangla. This dataset is transformational in a variety of applications. It contributes to the creation of NLP models that can detect and imitate Bengali stylistic nuances, hence improving tasks like as text categorisation, sentiment analysis, and style translation. BanglaBlend also facilitates literary analysis, cultural heritage projects, and the creation of domain-specific texts. To achieve the best data quality, rigorous pre-processing techniques such as anonymization and duplication removal were used. The style designations were extensively validated in three steps to ensure correctness. BanglaBlend is more than just a dataset; it is a cornerstone for future NLP research and development in Bangla. It is a valuable resource for studying stylistic diversity, aids in the development of context-aware language models, and is an essential tool for academic research and practical applications. By making BanglaBlend freely accessible, we hope to encourage cooperation and creativity within the Bangla NLP community, therefore adding to the worldwide variety of linguistic computational resourceshttp://www.sciencedirect.com/science/article/pii/S2352340924012022Bangla text classificationBangla languageText classificationNatural language processing
spellingShingle Umme Ayman
Chayti Saha
Azmain Mahtab Rahat
Sharun Akter Khushbu
BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data
Data in Brief
Bangla text classification
Bangla language
Text classification
Natural language processing
title BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data
title_full BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data
title_fullStr BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data
title_full_unstemmed BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data
title_short BanglaBlend: A large-scale nobel dataset of bangla sentences categorized by saint and common form of bangla languageMendeley Data
title_sort banglablend a large scale nobel dataset of bangla sentences categorized by saint and common form of bangla languagemendeley data
topic Bangla text classification
Bangla language
Text classification
Natural language processing
url http://www.sciencedirect.com/science/article/pii/S2352340924012022
work_keys_str_mv AT ummeayman banglablendalargescalenobeldatasetofbanglasentencescategorizedbysaintandcommonformofbanglalanguagemendeleydata
AT chaytisaha banglablendalargescalenobeldatasetofbanglasentencescategorizedbysaintandcommonformofbanglalanguagemendeleydata
AT azmainmahtabrahat banglablendalargescalenobeldatasetofbanglasentencescategorizedbysaintandcommonformofbanglalanguagemendeleydata
AT sharunakterkhushbu banglablendalargescalenobeldatasetofbanglasentencescategorizedbysaintandcommonformofbanglalanguagemendeleydata