The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

Script identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languag...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mamtimin Qasim, Wushour Silamu, Minghui Qiu
Format:	Article
Language:	English
Published:	MDPI AG 2024-11-01
Series:	Data
Subjects:	script script identification language identification language identification dataset
Online Access:	https://www.mdpi.com/2306-5729/9/11/134
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849688956901261312
author	Mamtimin Qasim Wushour Silamu Minghui Qiu
author_facet	Mamtimin Qasim Wushour Silamu Minghui Qiu
author_sort	Mamtimin Qasim
collection	DOAJ
description	Script identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languages and determining how to construct relevant language identification datasets has been conducted. Therefore, in this paper, we discuss and design a script identification algorithm and the construction of a language identification dataset based on script groups. The data sources in this paper comprise 261 different languages’ text corpora from the Leipzig Corpora Collection, which are grouped into 23 different script groups. In the Unicode encoding scheme, different scripts are arranged into different code regions. Based on this feature, we propose a written script identification algorithm based on regular expression matching, the micro F-score of which reaches 0.9929 in sentence-level script identification experiments. To reduce noise when constructing the language identification dataset for each script, a script identification algorithm is used to filter out other-script content in each text.
format	Article
id	doaj-art-ed204bdd78684d8d9ade9540573af4b6
institution	DOAJ
issn	2306-5729
language	English
publishDate	2024-11-01
publisher	MDPI AG
record_format	Article
series	Data
spelling	doaj-art-ed204bdd78684d8d9ade9540573af4b62025-08-20T03:21:47ZengMDPI AGData2306-57292024-11-0191113410.3390/data9110134The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification DatasetMamtimin Qasim0Wushour Silamu1Minghui Qiu2School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, ChinaSchool of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaSchool of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, ChinaScript identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languages and determining how to construct relevant language identification datasets has been conducted. Therefore, in this paper, we discuss and design a script identification algorithm and the construction of a language identification dataset based on script groups. The data sources in this paper comprise 261 different languages’ text corpora from the Leipzig Corpora Collection, which are grouped into 23 different script groups. In the Unicode encoding scheme, different scripts are arranged into different code regions. Based on this feature, we propose a written script identification algorithm based on regular expression matching, the micro F-score of which reaches 0.9929 in sentence-level script identification experiments. To reduce noise when constructing the language identification dataset for each script, a script identification algorithm is used to filter out other-script content in each text.https://www.mdpi.com/2306-5729/9/11/134scriptscript identificationlanguage identificationlanguage identification dataset
spellingShingle	Mamtimin Qasim Wushour Silamu Minghui Qiu The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset Data script script identification language identification language identification dataset
title	The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset
title_full	The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset
title_fullStr	The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset
title_full_unstemmed	The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset
title_short	The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset
title_sort	design of a script identification algorithm and its application in constructing a text language identification dataset
topic	script script identification language identification language identification dataset
url	https://www.mdpi.com/2306-5729/9/11/134
work_keys_str_mv	AT mamtiminqasim thedesignofascriptidentificationalgorithmanditsapplicationinconstructingatextlanguageidentificationdataset AT wushoursilamu thedesignofascriptidentificationalgorithmanditsapplicationinconstructingatextlanguageidentificationdataset AT minghuiqiu thedesignofascriptidentificationalgorithmanditsapplicationinconstructingatextlanguageidentificationdataset AT mamtiminqasim designofascriptidentificationalgorithmanditsapplicationinconstructingatextlanguageidentificationdataset AT wushoursilamu designofascriptidentificationalgorithmanditsapplicationinconstructingatextlanguageidentificationdataset AT minghuiqiu designofascriptidentificationalgorithmanditsapplicationinconstructingatextlanguageidentificationdataset

The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

Similar Items