Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method

TF-IDF (term frequency-inverse document frequency) is one of the traditional text similarity calculation methods based on statistics. Because TF-IDF does not consider the semantic information of words, it cannot accurately reflect the similarity between texts, and semantic information enhanced metho...

Full description

Saved in:
Bibliographic Details
Main Author: Fei Lan
Format: Article
Language:English
Published: Wiley 2022-01-01
Series:Advances in Multimedia
Online Access:http://dx.doi.org/10.1155/2022/7923262
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832553641979936768
author Fei Lan
author_facet Fei Lan
author_sort Fei Lan
collection DOAJ
description TF-IDF (term frequency-inverse document frequency) is one of the traditional text similarity calculation methods based on statistics. Because TF-IDF does not consider the semantic information of words, it cannot accurately reflect the similarity between texts, and semantic information enhanced methods distinguish between text documents poorly because extended vectors with semantic similar terms aggravate the curse of dimensionality. Aiming at this problem, this paper advances a hybrid with the semantic understanding and TF-IDF to calculate the similarity of texts. Based on term similarity weighting tree (TSWT) data structure and the definition of semantic similarity information from the HowNet, the paper firstly discusses text preprocess and filter process and then utilizes the semantic information of those key terms to calculate similarities of text documents according to the weight of the features whose weight is greater than the given threshold. The experimental results show that the hybrid method is better than the pure TF-IDF and the method of semantic understanding at the aspect of accuracy, recall, and F1-metric by different K-means clustering methods.
format Article
id doaj-art-dbefe631f51d4b2c9726f02954cf2603
institution Kabale University
issn 1687-5699
language English
publishDate 2022-01-01
publisher Wiley
record_format Article
series Advances in Multimedia
spelling doaj-art-dbefe631f51d4b2c9726f02954cf26032025-02-03T05:53:40ZengWileyAdvances in Multimedia1687-56992022-01-01202210.1155/2022/7923262Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF MethodFei Lan0School of Electronics and Internet of ThingsTF-IDF (term frequency-inverse document frequency) is one of the traditional text similarity calculation methods based on statistics. Because TF-IDF does not consider the semantic information of words, it cannot accurately reflect the similarity between texts, and semantic information enhanced methods distinguish between text documents poorly because extended vectors with semantic similar terms aggravate the curse of dimensionality. Aiming at this problem, this paper advances a hybrid with the semantic understanding and TF-IDF to calculate the similarity of texts. Based on term similarity weighting tree (TSWT) data structure and the definition of semantic similarity information from the HowNet, the paper firstly discusses text preprocess and filter process and then utilizes the semantic information of those key terms to calculate similarities of text documents according to the weight of the features whose weight is greater than the given threshold. The experimental results show that the hybrid method is better than the pure TF-IDF and the method of semantic understanding at the aspect of accuracy, recall, and F1-metric by different K-means clustering methods.http://dx.doi.org/10.1155/2022/7923262
spellingShingle Fei Lan
Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method
Advances in Multimedia
title Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method
title_full Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method
title_fullStr Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method
title_full_unstemmed Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method
title_short Research on Text Similarity Measurement Hybrid Algorithm with Term Semantic Information and TF-IDF Method
title_sort research on text similarity measurement hybrid algorithm with term semantic information and tf idf method
url http://dx.doi.org/10.1155/2022/7923262
work_keys_str_mv AT feilan researchontextsimilaritymeasurementhybridalgorithmwithtermsemanticinformationandtfidfmethod