Evaluating the Impact of Feature Engineering in Phishing URL Detection: A Comparative Study of URL, HTML, and Derived Features

Phishing attacks have evolved into sophisticated threats, making effective cybersecurity detection strategies essential. While many studies focus on either URL or HTML features, limited work has explored the comparative impact of engineered feature sets across different machine learning models. This...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yanche Ari Kustiawan, Khairil Imran Ghauth
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Phishing URL detection machine learning feature engineering URL features HTML features derived features
Online Access:	https://ieeexplore.ieee.org/document/11031414/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Phishing attacks have evolved into sophisticated threats, making effective cybersecurity detection strategies essential. While many studies focus on either URL or HTML features, limited work has explored the comparative impact of engineered feature sets across different machine learning models. This study aims to bridge that empirical gap by evaluating the effectiveness of URL-based, HTML-based, and derived features, individually and in combination, on phishing URL detection. The proposed approach utilizes the PhishOFE dataset of 101,063 phishing and legitimate URLs. Features are organized into four sets: 1) URL only, 2) HTML only, 3) URL + HTML, and 4) URL + HTML + derived features. Ten machine learning models are employed, including Random Forest, k-Nearest Neighbors, Logistic Regression, Support Vector Machine, Naive Bayes, and advanced ensemble methods such as LightGBM, XGBoost, and CatBoost. Performance is assessed using accuracy, precision, recall, and F1-score, while permutation importance is used to evaluate feature significance. Experimental results demonstrate that ensemble models outperform traditional classifiers, with CatBoost achieving the highest accuracy of 99.45% using the complete feature set. Moreover, URL features like URLLength and NoOfSubDomain consistently rank high in importance, while derived features such as SuspiciousCharRatio and URLComplexityScore notably enhance detection performance in specific models.
ISSN:	2169-3536

Evaluating the Impact of Feature Engineering in Phishing URL Detection: A Comparative Study of URL, HTML, and Derived Features

Similar Items