Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection

Drowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pion...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhen Gao, Xiaowen Chen, Jingning Xu, Rongjie Yu, Heng Zhang, Jinqiu Yang
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/24/24/7948
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850085469056925696
author Zhen Gao
Xiaowen Chen
Jingning Xu
Rongjie Yu
Heng Zhang
Jinqiu Yang
author_facet Zhen Gao
Xiaowen Chen
Jingning Xu
Rongjie Yu
Heng Zhang
Jinqiu Yang
author_sort Zhen Gao
collection DOAJ
description Drowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pioneers the use of the CLIP (Contrastive Language-Image Pre-training) model for fatigue detection. And by harnessing the power of a Transformer architecture, sophisticated and long-term temporal features are adeptly extracted from video sequences, paving the way for more nuanced and accurate fatigue analysis. The proposed CT-Net (CLIP-Transformer Network) achieves an AUC (Area Under the Curve) of 0.892, a 36% accuracy improvement over the prevalent CNN-LSTM (Convolutional Neural Network-Long Short-Term Memory) end-to-end model, reaching state-of-the-art performance. Experiments show that the CLIP pre-trained model more accurately extracts facial and behavioral features from driver video frames, improving the model’s AUC by 7% over the ImageNet-based pre-trained model. Moreover, compared with LSTM, the Transformer more flexibly captures long-term dependencies among temporal features, further enhancing the model’s AUC by 4%.
format Article
id doaj-art-0bafd022e0eb470ba0593cc7f4a28d01
institution DOAJ
issn 1424-8220
language English
publishDate 2024-12-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj-art-0bafd022e0eb470ba0593cc7f4a28d012025-08-20T02:43:43ZengMDPI AGSensors1424-82202024-12-012424794810.3390/s24247948Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue DetectionZhen Gao0Xiaowen Chen1Jingning Xu2Rongjie Yu3Heng Zhang4Jinqiu Yang5School of Computer Science and Technology, Tongji University, Shanghai 201804, ChinaSchool of Computer Science and Technology, Tongji University, Shanghai 201804, ChinaSchool of Computer Science and Technology, Tongji University, Shanghai 201804, ChinaKey Laboratory of Road and Traffic Engineering of the Ministry of Education, Shanghai 201804, ChinaZhejiang Fengxing Huiyun Technology Co., Ltd., Hangzhou 311107, ChinaDepartment of Computer Science and Software Engineering, Concordia University, Montreal, QC H3G 1M8, CanadaDrowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pioneers the use of the CLIP (Contrastive Language-Image Pre-training) model for fatigue detection. And by harnessing the power of a Transformer architecture, sophisticated and long-term temporal features are adeptly extracted from video sequences, paving the way for more nuanced and accurate fatigue analysis. The proposed CT-Net (CLIP-Transformer Network) achieves an AUC (Area Under the Curve) of 0.892, a 36% accuracy improvement over the prevalent CNN-LSTM (Convolutional Neural Network-Long Short-Term Memory) end-to-end model, reaching state-of-the-art performance. Experiments show that the CLIP pre-trained model more accurately extracts facial and behavioral features from driver video frames, improving the model’s AUC by 7% over the ImageNet-based pre-trained model. Moreover, compared with LSTM, the Transformer more flexibly captures long-term dependencies among temporal features, further enhancing the model’s AUC by 4%.https://www.mdpi.com/1424-8220/24/24/7948fatigue detectionCLIP pre-trained modelTransformerinstance normalizationsemantic analysis
spellingShingle Zhen Gao
Xiaowen Chen
Jingning Xu
Rongjie Yu
Heng Zhang
Jinqiu Yang
Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
Sensors
fatigue detection
CLIP pre-trained model
Transformer
instance normalization
semantic analysis
title Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
title_full Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
title_fullStr Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
title_full_unstemmed Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
title_short Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection
title_sort semantically enhanced feature extraction with clip and transformer networks for driver fatigue detection
topic fatigue detection
CLIP pre-trained model
Transformer
instance normalization
semantic analysis
url https://www.mdpi.com/1424-8220/24/24/7948
work_keys_str_mv AT zhengao semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection
AT xiaowenchen semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection
AT jingningxu semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection
AT rongjieyu semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection
AT hengzhang semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection
AT jinqiuyang semanticallyenhancedfeatureextractionwithclipandtransformernetworksfordriverfatiguedetection