A Hybrid K-Means++ and Particle Swarm Optimization Approach for Enhanced Document Clustering
Document Clustering has attracted the interest of many researchers who have created several solutions to this problem by combining different techniques, models, and algorithms. While famous for its simplicity, the most commonly used algorithm, K-Means, suffers from issues such as finding the optimal...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10855427/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Document Clustering has attracted the interest of many researchers who have created several solutions to this problem by combining different techniques, models, and algorithms. While famous for its simplicity, the most commonly used algorithm, K-Means, suffers from issues such as finding the optimal value for k and random initialization of the centroids. In this paper, we propose a hybrid methodology combining K-Means++ with the metaheuristic algorithm PSO to overcome the challenges of both these algorithms. K-Means++ is a smart initialization technique that selects clusters based on probability function so that they are further away from each other. Two methods, Elbow Analysis, and Silhouette Score, have been used to find the optimal value of k. The research also attempts to assess the results of the best-performing feature extraction techniques and create a combined approach using their average. Although the combined approach did not outperform the existing methods, it paved the way for further advancement and exploration of merging feature extraction techniques. The experiments were performed on four datasets: 20 Newsgroups, Reuters, WebKB, and Wine, and the results were evaluated using three evaluation metrics: Purity, Silhouette Score, and Jaccard Index. The proposed approach achieved a Purity of 0.95, 0.91. 0.86, 0.89, the Silhouette Score was measured to be 0.5, 0.05, 0.05, 0.59, and the Jaccard Index was reported to be 0.04, 0.04, 0.05, 0.14 on Reuters, WebKB, 20 Newsgroups, and Wine datasets, respectively |
|---|---|
| ISSN: | 2169-3536 |