MKER: multi-modal knowledge extraction and reasoning for future event prediction

Abstract Humans can predict what will happen shortly, which is essential for survival, but machines cannot. To equip machines with the ability, we introduce the innovative multi-modal knowledge extraction and reasoning (MKER) framework. This framework combines external commonsense knowledge, interna...

Full description

Saved in:
Bibliographic Details
Main Authors: Chenghang Lai, Shoumeng Qiu
Format: Article
Language:English
Published: Springer 2025-01-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-024-01741-4
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823861517819314176
author Chenghang Lai
Shoumeng Qiu
author_facet Chenghang Lai
Shoumeng Qiu
author_sort Chenghang Lai
collection DOAJ
description Abstract Humans can predict what will happen shortly, which is essential for survival, but machines cannot. To equip machines with the ability, we introduce the innovative multi-modal knowledge extraction and reasoning (MKER) framework. This framework combines external commonsense knowledge, internal visual relation knowledge, and basic information to make inference. This framework is built on an encoder-decoder structure with three essential components: a visual language reasoning module, an adaptive cross-modality feature fusion module, and a future event description generation module. The visual language reasoning module extracts the object relationships among the most informative objects and the dynamic evolution of the relationship, which comes from the sequence scene graphs and commonsense graphs. The long short-term memory model is employed to explore changes in the object relationships at different times to form a dynamic object relationship. Furthermore, the adaptive cross-modality feature fusion module aligns video and language information by using object relationship knowledge as guidance to learn vision-language representation. Finally, the future event description generation module decodes the fused information and generates the language description of the next event. Experimental results demonstrate that MKER outperforms existing methods. Ablation studies further illustrate the effectiveness of the designed module. This work advances the field by providing a way to predict future events, enhance machine understanding, and interact with dynamic environments.
format Article
id doaj-art-f71db06eb3a94616850460ff161144c9
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2025-01-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-f71db06eb3a94616850460ff161144c92025-02-09T13:01:26ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-01-0111211510.1007/s40747-024-01741-4MKER: multi-modal knowledge extraction and reasoning for future event predictionChenghang Lai0Shoumeng Qiu1School of Computer Science, Fudan UniversitySchool of Computer Science, Fudan UniversityAbstract Humans can predict what will happen shortly, which is essential for survival, but machines cannot. To equip machines with the ability, we introduce the innovative multi-modal knowledge extraction and reasoning (MKER) framework. This framework combines external commonsense knowledge, internal visual relation knowledge, and basic information to make inference. This framework is built on an encoder-decoder structure with three essential components: a visual language reasoning module, an adaptive cross-modality feature fusion module, and a future event description generation module. The visual language reasoning module extracts the object relationships among the most informative objects and the dynamic evolution of the relationship, which comes from the sequence scene graphs and commonsense graphs. The long short-term memory model is employed to explore changes in the object relationships at different times to form a dynamic object relationship. Furthermore, the adaptive cross-modality feature fusion module aligns video and language information by using object relationship knowledge as guidance to learn vision-language representation. Finally, the future event description generation module decodes the fused information and generates the language description of the next event. Experimental results demonstrate that MKER outperforms existing methods. Ablation studies further illustrate the effectiveness of the designed module. This work advances the field by providing a way to predict future events, enhance machine understanding, and interact with dynamic environments.https://doi.org/10.1007/s40747-024-01741-4Future event predictionMulti-modal knowledge extraction and reasoningExternal commonsense knowledgeInternal visual relation knowledgeDynamic object relationship
spellingShingle Chenghang Lai
Shoumeng Qiu
MKER: multi-modal knowledge extraction and reasoning for future event prediction
Complex & Intelligent Systems
Future event prediction
Multi-modal knowledge extraction and reasoning
External commonsense knowledge
Internal visual relation knowledge
Dynamic object relationship
title MKER: multi-modal knowledge extraction and reasoning for future event prediction
title_full MKER: multi-modal knowledge extraction and reasoning for future event prediction
title_fullStr MKER: multi-modal knowledge extraction and reasoning for future event prediction
title_full_unstemmed MKER: multi-modal knowledge extraction and reasoning for future event prediction
title_short MKER: multi-modal knowledge extraction and reasoning for future event prediction
title_sort mker multi modal knowledge extraction and reasoning for future event prediction
topic Future event prediction
Multi-modal knowledge extraction and reasoning
External commonsense knowledge
Internal visual relation knowledge
Dynamic object relationship
url https://doi.org/10.1007/s40747-024-01741-4
work_keys_str_mv AT chenghanglai mkermultimodalknowledgeextractionandreasoningforfutureeventprediction
AT shoumengqiu mkermultimodalknowledgeextractionandreasoningforfutureeventprediction