A Hybrid K-Means++ and Particle Swarm Optimization Approach for Enhanced Document Clustering

Document Clustering has attracted the interest of many researchers who have created several solutions to this problem by combining different techniques, models, and algorithms. While famous for its simplicity, the most commonly used algorithm, K-Means, suffers from issues such as finding the optimal...

Full description

Saved in:
Bibliographic Details
Main Authors: Eisha Hassan, Fazila Malik, Qazi Waqas Khan, Nadeem Ahmad, Muhammad Sardaraz, Faten Khalid Karim, Hela Elmannai
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10855427/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Document Clustering has attracted the interest of many researchers who have created several solutions to this problem by combining different techniques, models, and algorithms. While famous for its simplicity, the most commonly used algorithm, K-Means, suffers from issues such as finding the optimal value for k and random initialization of the centroids. In this paper, we propose a hybrid methodology combining K-Means++ with the metaheuristic algorithm PSO to overcome the challenges of both these algorithms. K-Means++ is a smart initialization technique that selects clusters based on probability function so that they are further away from each other. Two methods, Elbow Analysis, and Silhouette Score, have been used to find the optimal value of k. The research also attempts to assess the results of the best-performing feature extraction techniques and create a combined approach using their average. Although the combined approach did not outperform the existing methods, it paved the way for further advancement and exploration of merging feature extraction techniques. The experiments were performed on four datasets: 20 Newsgroups, Reuters, WebKB, and Wine, and the results were evaluated using three evaluation metrics: Purity, Silhouette Score, and Jaccard Index. The proposed approach achieved a Purity of 0.95, 0.91. 0.86, 0.89, the Silhouette Score was measured to be 0.5, 0.05, 0.05, 0.59, and the Jaccard Index was reported to be 0.04, 0.04, 0.05, 0.14 on Reuters, WebKB, 20 Newsgroups, and Wine datasets, respectively
ISSN:2169-3536