Soccer-CLIP: Vision Language Model for Soccer Action Spotting

In the rapidly advancing field of computer vision, the application of multimodal models—specifically, vision-language frameworks—has shown substantial promise for complex tasks such as video-based action spotting. This paper introduces Soccer-CLIP, a vision-language model speci...

Full description

Saved in:
Bibliographic Details
Main Authors: Yoonho Shin, Sanghoon Park, Youngsub Han, Byoung-Ki Jeon, Soonyoung Lee, Byung Jun Kang
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10916659/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In the rapidly advancing field of computer vision, the application of multimodal models—specifically, vision-language frameworks—has shown substantial promise for complex tasks such as video-based action spotting. This paper introduces Soccer-CLIP, a vision-language model specially designed for soccer action spotting. Soccer-CLIP incorporates an innovative domain-specific prompt engineering strategy, leveraging large language models (LLMs) to refine textual representations for precise alignment with soccer-specific actions. Our model integrates both visual and textual features to enhance recognition accuracy of critical soccer events. With the temporal augmentation techniques devised for input videos, Soccer-CLIP builds upon existing methodologies to address the inherent challenges of temporally sparse event annotations within video sequences. Evaluations on the SoccerNet Action Spotting benchmark demonstrate that Soccer-CLIP outperforms previous state-of-the-art models, exploring the effectiveness of our model’s capacity to capture domain-specific contextual nuances. This work represents a significant advancement in automated sports analysis, providing a robust and adaptable framework for broader applications in video recognition and temporal action localization tasks.
ISSN:2169-3536