Evaluating the Impact of Feature Engineering in Phishing URL Detection: A Comparative Study of URL, HTML, and Derived Features
Phishing attacks have evolved into sophisticated threats, making effective cybersecurity detection strategies essential. While many studies focus on either URL or HTML features, limited work has explored the comparative impact of engineered feature sets across different machine learning models. This...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11031414/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Phishing attacks have evolved into sophisticated threats, making effective cybersecurity detection strategies essential. While many studies focus on either URL or HTML features, limited work has explored the comparative impact of engineered feature sets across different machine learning models. This study aims to bridge that empirical gap by evaluating the effectiveness of URL-based, HTML-based, and derived features, individually and in combination, on phishing URL detection. The proposed approach utilizes the PhishOFE dataset of 101,063 phishing and legitimate URLs. Features are organized into four sets: 1) URL only, 2) HTML only, 3) URL + HTML, and 4) URL + HTML + derived features. Ten machine learning models are employed, including Random Forest, k-Nearest Neighbors, Logistic Regression, Support Vector Machine, Naive Bayes, and advanced ensemble methods such as LightGBM, XGBoost, and CatBoost. Performance is assessed using accuracy, precision, recall, and F1-score, while permutation importance is used to evaluate feature significance. Experimental results demonstrate that ensemble models outperform traditional classifiers, with CatBoost achieving the highest accuracy of 99.45% using the complete feature set. Moreover, URL features like URLLength and NoOfSubDomain consistently rank high in importance, while derived features such as SuspiciousCharRatio and URLComplexityScore notably enhance detection performance in specific models. |
|---|---|
| ISSN: | 2169-3536 |