Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection

Drowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pion...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhen Gao, Xiaowen Chen, Jingning Xu, Rongjie Yu, Heng Zhang, Jinqiu Yang
Format:	Article
Language:	English
Published:	MDPI AG 2024-12-01
Series:	Sensors
Subjects:	fatigue detection CLIP pre-trained model Transformer instance normalization semantic analysis
Online Access:	https://www.mdpi.com/1424-8220/24/24/7948
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850085469056925696
author	Zhen Gao Xiaowen Chen Jingning Xu Rongjie Yu Heng Zhang Jinqiu Yang
author_facet	Zhen Gao Xiaowen Chen Jingning Xu Rongjie Yu Heng Zhang Jinqiu Yang
author_sort	Zhen Gao
collection	DOAJ
description	Drowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pioneers the use of the CLIP (Contrastive Language-Image Pre-training) model for fatigue detection. And by harnessing the power of a Transformer architecture, sophisticated and long-term temporal features are adeptly extracted from video sequences, paving the way for more nuanced and accurate fatigue analysis. The proposed CT-Net (CLIP-Transformer Network) achieves an AUC (Area Under the Curve) of 0.892, a 36% accuracy improvement over the prevalent CNN-LSTM (Convolutional Neural Network-Long Short-Term Memory) end-to-end model, reaching state-of-the-art performance. Experiments show that the CLIP pre-trained model more accurately extracts facial and behavioral features from driver video frames, improving the model’s AUC by 7% over the ImageNet-based pre-trained model. Moreover, compared with LSTM, the Transformer more flexibly captures long-term dependencies among temporal features, further enhancing the model’s AUC by 4%.
format	Article
id	doaj-art-0bafd022e0eb470ba0593cc7f4a28d01
institution	DOAJ
issn	1424-8220
language	English
publishDate	2024-12-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj-art-0bafd022e0eb470ba0593cc7f4a28d012025-08-20T02:43:43ZengMDPI AGSensors1424-82202024-12-012424794810.3390/s24247948Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue DetectionZhen Gao0Xiaowen Chen1Jingning Xu2Rongjie Yu3Heng Zhang4Jinqiu Yang5School of Computer Science and Technology, Tongji University, Shanghai 201804, ChinaSchool of Computer Science and Technology, Tongji University, Shanghai 201804, ChinaSchool of Computer Science and Technology, Tongji University, Shanghai 201804, ChinaKey Laboratory of Road and Traffic Engineering of the Ministry of Education, Shanghai 201804, ChinaZhejiang Fengxing Huiyun Technology Co., Ltd., Hangzhou 311107, ChinaDepartment of Computer Science and Software Engineering, Concordia University, Montreal, QC H3G 1M8, CanadaDrowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pioneers the use of the CLIP (Contrastive Language-Image Pre-training) model for fatigue detection. And by harnessing the power of a Transformer architecture, sophisticated and long-term temporal features are adeptly extracted from video sequences, paving the way for more nuanced and accurate fatigue analysis. The proposed CT-Net (CLIP-Transformer Network) achieves an AUC (Area Under the Curve) of 0.892, a 36% accuracy improvement over the prevalent CNN-LSTM (Convolutional Neural Network-Long Short-Term Memory) end-to-end model, reaching state-of-the-art performance. Experiments show that the CLIP pre-trained model more accurately extracts facial and behavioral features from driver video frames, improving the model’s AUC by 7% over the ImageNet-based pre-trained model. Moreover, compared with LSTM, the Transformer more flexibly captures long-term dependencies among temporal features, further enhancing the model’s AUC by 4%.https://www.mdpi.com/1424-8220/24/24/7948fatigue detectionCLIP pre-trained modelTransformerinstance normalizationsemantic analysis
spellingShingle	Zhen Gao Xiaowen Chen Jingning Xu Rongjie Yu Heng Zhang Jinqiu Yang Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection Sensors fatigue detection CLIP pre-trained model Transformer instance normalization semantic analysis
title	Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
title_full	Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
title_fullStr	Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
title_full_unstemmed	Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
title_short	Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
title_sort	semantically enhanced feature extraction with clip and transformer networks for driver fatigue detection
topic	fatigue detection CLIP pre-trained model Transformer instance normalization semantic analysis
url	https://www.mdpi.com/1424-8220/24/24/7948
work_keys_str_mv	AT zhengao semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection AT xiaowenchen semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection AT jingningxu semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection AT rongjieyu semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection AT hengzhang semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection AT jinqiuyang semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection

Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection

Similar Items