Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems

The main purpose of the article is to analyze how effectively different types of thesaurus relations can be used for solutions of text classification tasks. The basis of the study is an automatically generated thesaurus of a subject area, that contains three types of relations: synonymous, hierarchi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nadezhda S. Lagutina, Ksenia V. Lagutina, Ivan A. Shchitov, Ilya V. Paramonov
Format:	Article
Language:	English
Published:	Yaroslavl State University 2017-12-01
Series:	Моделирование и анализ информационных систем
Subjects:	thesaurus semantic relations thesaurus relations topical classification sentiment classification
Online Access:	https://www.mais-journal.ru/jour/article/view/614
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850023940820303872
author	Nadezhda S. Lagutina Ksenia V. Lagutina Ivan A. Shchitov Ilya V. Paramonov
author_facet	Nadezhda S. Lagutina Ksenia V. Lagutina Ivan A. Shchitov Ilya V. Paramonov
author_sort	Nadezhda S. Lagutina
collection	DOAJ
description	The main purpose of the article is to analyze how effectively different types of thesaurus relations can be used for solutions of text classification tasks. The basis of the study is an automatically generated thesaurus of a subject area, that contains three types of relations: synonymous, hierarchical and associative. To generate the thesaurus the authors use a hybrid method based on several linguistic and statistical algorithms for extraction of semantic relations. The method allows to create a thesaurus with a sufficiently large number of terms and relations among them. The authors consider two problems: topical text classification and sentiment classification of large newspaper articles. To solve them, the authors developed two approaches that complement standard algorithms with a procedure that take into account thesaurus relations to determine semantic features of texts. The approach to topical classification includes the standard unsupervised BM25 algorithm and the procedure, that take into account synonymous and hierarchical relations of the thesaurus of the subject area. The approach to sentiment classification consists of two steps. At the first step, a thesaurus is created, whose terms weight polarities are calculated depending on the term occurrences in the training set or on the weights of related thesaurus terms. At the second step, the thesaurus is used to compute the features of words from texts and to classify texts by the algorithm SVM or Naive Bayes. In experiments with text corpora BBCSport, Reuters, PubMed and the corpus of articles about American immigrants, the authors varied the types of thesaurus relations that are involved in the classification and the degree of their use. The results of the experiments make it possible to evaluate the efficiency of the application of thesaurus relations for classification of raw texts and to determine under what conditions certain relationships affect more or less. In particular, the most useful thesaurus connections are synonymous and hierarchical, as they provide a better quality of classification.
format	Article
id	doaj-art-430cfffa23324a3d86fadd9e7646169e
institution	DOAJ
issn	1818-1015 2313-5417
language	English
publishDate	2017-12-01
publisher	Yaroslavl State University
record_format	Article
series	Моделирование и анализ информационных систем
spelling	doaj-art-430cfffa23324a3d86fadd9e7646169e2025-08-20T03:01:14ZengYaroslavl State UniversityМоделирование и анализ информационных систем1818-10152313-54172017-12-0124677278710.18255/1818-1015-2017-6-772-787448Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification ProblemsNadezhda S. Lagutina0Ksenia V. Lagutina1Ivan A. Shchitov2Ilya V. Paramonov3P.G. Demidov Yaroslavl State UniversityP.G. Demidov Yaroslavl State UniversityP.G. Demidov Yaroslavl State UniversityP.G. Demidov Yaroslavl State UniversityThe main purpose of the article is to analyze how effectively different types of thesaurus relations can be used for solutions of text classification tasks. The basis of the study is an automatically generated thesaurus of a subject area, that contains three types of relations: synonymous, hierarchical and associative. To generate the thesaurus the authors use a hybrid method based on several linguistic and statistical algorithms for extraction of semantic relations. The method allows to create a thesaurus with a sufficiently large number of terms and relations among them. The authors consider two problems: topical text classification and sentiment classification of large newspaper articles. To solve them, the authors developed two approaches that complement standard algorithms with a procedure that take into account thesaurus relations to determine semantic features of texts. The approach to topical classification includes the standard unsupervised BM25 algorithm and the procedure, that take into account synonymous and hierarchical relations of the thesaurus of the subject area. The approach to sentiment classification consists of two steps. At the first step, a thesaurus is created, whose terms weight polarities are calculated depending on the term occurrences in the training set or on the weights of related thesaurus terms. At the second step, the thesaurus is used to compute the features of words from texts and to classify texts by the algorithm SVM or Naive Bayes. In experiments with text corpora BBCSport, Reuters, PubMed and the corpus of articles about American immigrants, the authors varied the types of thesaurus relations that are involved in the classification and the degree of their use. The results of the experiments make it possible to evaluate the efficiency of the application of thesaurus relations for classification of raw texts and to determine under what conditions certain relationships affect more or less. In particular, the most useful thesaurus connections are synonymous and hierarchical, as they provide a better quality of classification.https://www.mais-journal.ru/jour/article/view/614thesaurussemantic relationsthesaurus relationstopical classificationsentiment classification
spellingShingle	Nadezhda S. Lagutina Ksenia V. Lagutina Ivan A. Shchitov Ilya V. Paramonov Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems Моделирование и анализ информационных систем thesaurus semantic relations thesaurus relations topical classification sentiment classification
title	Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems
title_full	Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems
title_fullStr	Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems
title_full_unstemmed	Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems
title_short	Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems
title_sort	analysis of influence of different relations types on the quality of thesaurus application to text classification problems
topic	thesaurus semantic relations thesaurus relations topical classification sentiment classification
url	https://www.mais-journal.ru/jour/article/view/614
work_keys_str_mv	AT nadezhdaslagutina analysisofinfluenceofdifferentrelationstypesonthequalityofthesaurusapplicationtotextclassificationproblems AT kseniavlagutina analysisofinfluenceofdifferentrelationstypesonthequalityofthesaurusapplicationtotextclassificationproblems AT ivanashchitov analysisofinfluenceofdifferentrelationstypesonthequalityofthesaurusapplicationtotextclassificationproblems AT ilyavparamonov analysisofinfluenceofdifferentrelationstypesonthequalityofthesaurusapplicationtotextclassificationproblems

Analysis of Influence of Different Relations Types on the Quality of Thesaurus Application to Text Classification Problems

Similar Items