Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
Drowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pion...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2024-12-01
|
| Series: | Sensors |
| Subjects: | |
| Online Access: | https://www.mdpi.com/1424-8220/24/24/7948 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850085469056925696 |
|---|---|
| author | Zhen Gao Xiaowen Chen Jingning Xu Rongjie Yu Heng Zhang Jinqiu Yang |
| author_facet | Zhen Gao Xiaowen Chen Jingning Xu Rongjie Yu Heng Zhang Jinqiu Yang |
| author_sort | Zhen Gao |
| collection | DOAJ |
| description | Drowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pioneers the use of the CLIP (Contrastive Language-Image Pre-training) model for fatigue detection. And by harnessing the power of a Transformer architecture, sophisticated and long-term temporal features are adeptly extracted from video sequences, paving the way for more nuanced and accurate fatigue analysis. The proposed CT-Net (CLIP-Transformer Network) achieves an AUC (Area Under the Curve) of 0.892, a 36% accuracy improvement over the prevalent CNN-LSTM (Convolutional Neural Network-Long Short-Term Memory) end-to-end model, reaching state-of-the-art performance. Experiments show that the CLIP pre-trained model more accurately extracts facial and behavioral features from driver video frames, improving the model’s AUC by 7% over the ImageNet-based pre-trained model. Moreover, compared with LSTM, the Transformer more flexibly captures long-term dependencies among temporal features, further enhancing the model’s AUC by 4%. |
| format | Article |
| id | doaj-art-0bafd022e0eb470ba0593cc7f4a28d01 |
| institution | DOAJ |
| issn | 1424-8220 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Sensors |
| spelling | doaj-art-0bafd022e0eb470ba0593cc7f4a28d012025-08-20T02:43:43ZengMDPI AGSensors1424-82202024-12-012424794810.3390/s24247948Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue DetectionZhen Gao0Xiaowen Chen1Jingning Xu2Rongjie Yu3Heng Zhang4Jinqiu Yang5School of Computer Science and Technology, Tongji University, Shanghai 201804, ChinaSchool of Computer Science and Technology, Tongji University, Shanghai 201804, ChinaSchool of Computer Science and Technology, Tongji University, Shanghai 201804, ChinaKey Laboratory of Road and Traffic Engineering of the Ministry of Education, Shanghai 201804, ChinaZhejiang Fengxing Huiyun Technology Co., Ltd., Hangzhou 311107, ChinaDepartment of Computer Science and Software Engineering, Concordia University, Montreal, QC H3G 1M8, CanadaDrowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pioneers the use of the CLIP (Contrastive Language-Image Pre-training) model for fatigue detection. And by harnessing the power of a Transformer architecture, sophisticated and long-term temporal features are adeptly extracted from video sequences, paving the way for more nuanced and accurate fatigue analysis. The proposed CT-Net (CLIP-Transformer Network) achieves an AUC (Area Under the Curve) of 0.892, a 36% accuracy improvement over the prevalent CNN-LSTM (Convolutional Neural Network-Long Short-Term Memory) end-to-end model, reaching state-of-the-art performance. Experiments show that the CLIP pre-trained model more accurately extracts facial and behavioral features from driver video frames, improving the model’s AUC by 7% over the ImageNet-based pre-trained model. Moreover, compared with LSTM, the Transformer more flexibly captures long-term dependencies among temporal features, further enhancing the model’s AUC by 4%.https://www.mdpi.com/1424-8220/24/24/7948fatigue detectionCLIP pre-trained modelTransformerinstance normalizationsemantic analysis |
| spellingShingle | Zhen Gao Xiaowen Chen Jingning Xu Rongjie Yu Heng Zhang Jinqiu Yang Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection Sensors fatigue detection CLIP pre-trained model Transformer instance normalization semantic analysis |
| title | Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection |
| title_full | Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection |
| title_fullStr | Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection |
| title_full_unstemmed | Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection |
| title_short | Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection |
| title_sort | semantically enhanced feature extraction with clip and transformer networks for driver fatigue detection |
| topic | fatigue detection CLIP pre-trained model Transformer instance normalization semantic analysis |
| url | https://www.mdpi.com/1424-8220/24/24/7948 |
| work_keys_str_mv | AT zhengao semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection AT xiaowenchen semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection AT jingningxu semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection AT rongjieyu semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection AT hengzhang semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection AT jinqiuyang semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection |