An Image Grid Can Be Worth a Video: Zero-Shot Video Question Answering Using a VLM

Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wonkyun Kim, Changin Choi, Wonseok Lee, Wonjong Rhee
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Image grid video question answering video representation vision language model
Online Access:	https://ieeexplore.ieee.org/document/10802898/
Tags:	Add Tag No Tags, Be the first to tag this record!

Be the first to leave a comment!

An Image Grid Can Be Worth a Video: Zero-Shot Video Question Answering Using a VLM

Similar Items