Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning

Managing traffic flow through urban intersections is challenging. Conflicts involving a mix of different vehicles with blind spots makes it relatively vulnerable for crashes to happen. This paper presents a new framework based on a fine-tuned Multimodal Large Language Model (MLLM), GPT-4o, that can...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sari Masri, Huthaifa I. Ashqar, Mohammed Elhenawy
Format:	Article
Language:	English
Published:	MDPI AG 2025-05-01
Series:	Safety
Subjects:	conflict detection fine-tuning Multimodal Large Language Models (MLLMs) prompt design unsignalized intersections urban traffic management
Online Access:	https://www.mdpi.com/2313-576X/11/2/40
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849704750140882944
author	Sari Masri Huthaifa I. Ashqar Mohammed Elhenawy
author_facet	Sari Masri Huthaifa I. Ashqar Mohammed Elhenawy
author_sort	Sari Masri
collection	DOAJ
description	Managing traffic flow through urban intersections is challenging. Conflicts involving a mix of different vehicles with blind spots makes it relatively vulnerable for crashes to happen. This paper presents a new framework based on a fine-tuned Multimodal Large Language Model (MLLM), GPT-4o, that can control intersections using bird eye view videos taken by drones in real-time. This fine-tuned GPT-4o model is used to logically and visually reason traffic conflicts and provide instructions to the drivers, which aids in creating a safer and more efficient traffic flow. To fine-tune and evaluate the model, we labeled a dataset that includes three-month drone videos, and their corresponding trajectories recorded in Dresden, Germany, at a 4-way intersection. Preliminary results showed that the fine-tuned GPT-4o achieved an accuracy of about 77%, outperforming zero-shot baselines. However, using continuous video-frame sequences, the model performance increased to about 89% on a time serialized dataset and about 90% on an unbalanced real-world dataset, respectively. This proves the model’s robustness in different conditions. Furthermore, manual evaluation by experts includes scoring the usefulness of the predicted explanations and recommendations by the model. The model surpassed on average rating of 8.99 out of 10 for explanations, and 9.23 out of 10 for recommendations. The results demonstrate the advantages of combining MLLMs with structured prompts and temporal information for conflict detection. These results offer a flexible and robust prototype framework to improve the safety and effectiveness of uncontrolled intersections. The code and labeled dataset used in this study are publicly available (see Data Availability Statement).
format	Article
id	doaj-art-7e8f30598cb248b6a64571d58d5df9dd
institution	DOAJ
issn	2313-576X
language	English
publishDate	2025-05-01
publisher	MDPI AG
record_format	Article
series	Safety
spelling	doaj-art-7e8f30598cb248b6a64571d58d5df9dd2025-08-20T03:16:39ZengMDPI AGSafety2313-576X2025-05-011124010.3390/safety11020040Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and ReasoningSari Masri0Huthaifa I. Ashqar1Mohammed Elhenawy2Natural, Engineering and Technology Sciences Department, Arab American University, 13 Zababdeh, Jenin P.O. Box 240, PalestineAI and Data Science Department, Arab American University, 13 Zababdeh, Jenin P.O. Box 240, PalestineCARRS-Q and Centre for Data Science, Queensland University of Technology, Brisbane, QLD 4059, AustraliaManaging traffic flow through urban intersections is challenging. Conflicts involving a mix of different vehicles with blind spots makes it relatively vulnerable for crashes to happen. This paper presents a new framework based on a fine-tuned Multimodal Large Language Model (MLLM), GPT-4o, that can control intersections using bird eye view videos taken by drones in real-time. This fine-tuned GPT-4o model is used to logically and visually reason traffic conflicts and provide instructions to the drivers, which aids in creating a safer and more efficient traffic flow. To fine-tune and evaluate the model, we labeled a dataset that includes three-month drone videos, and their corresponding trajectories recorded in Dresden, Germany, at a 4-way intersection. Preliminary results showed that the fine-tuned GPT-4o achieved an accuracy of about 77%, outperforming zero-shot baselines. However, using continuous video-frame sequences, the model performance increased to about 89% on a time serialized dataset and about 90% on an unbalanced real-world dataset, respectively. This proves the model’s robustness in different conditions. Furthermore, manual evaluation by experts includes scoring the usefulness of the predicted explanations and recommendations by the model. The model surpassed on average rating of 8.99 out of 10 for explanations, and 9.23 out of 10 for recommendations. The results demonstrate the advantages of combining MLLMs with structured prompts and temporal information for conflict detection. These results offer a flexible and robust prototype framework to improve the safety and effectiveness of uncontrolled intersections. The code and labeled dataset used in this study are publicly available (see Data Availability Statement).https://www.mdpi.com/2313-576X/11/2/40conflict detectionfine-tuningMultimodal Large Language Models (MLLMs)prompt designunsignalized intersectionsurban traffic management
spellingShingle	Sari Masri Huthaifa I. Ashqar Mohammed Elhenawy Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning Safety conflict detection fine-tuning Multimodal Large Language Models (MLLMs) prompt design unsignalized intersections urban traffic management
title	Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning
title_full	Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning
title_fullStr	Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning
title_full_unstemmed	Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning
title_short	Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning
title_sort	leveraging bird eye view video and multimodal large language models for real time intersection control and reasoning
topic	conflict detection fine-tuning Multimodal Large Language Models (MLLMs) prompt design unsignalized intersections urban traffic management
url	https://www.mdpi.com/2313-576X/11/2/40
work_keys_str_mv	AT sarimasri leveragingbirdeyeviewvideoandmultimodallargelanguagemodelsforrealtimeintersectioncontrolandreasoning AT huthaifaiashqar leveragingbirdeyeviewvideoandmultimodallargelanguagemodelsforrealtimeintersectioncontrolandreasoning AT mohammedelhenawy leveragingbirdeyeviewvideoandmultimodallargelanguagemodelsforrealtimeintersectioncontrolandreasoning

Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning

Similar Items