Invariant Representation Learning in Multimedia Recommendation with Modality Alignment and Model Fusion

Multimedia recommendation systems aim to accurately predict user preferences from multimodal data. However, existing methods may learn a recommendation model from spurious features, i.e., appearing to be related to an outcome but actually having no causal relationship with the outcome, leading to po...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xinghang Hu, Haiteng Zhang
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Entropy
Subjects:	multimedia recommendation model fusion multimodal representation
Online Access:	https://www.mdpi.com/1099-4300/27/1/56
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Multimedia recommendation systems aim to accurately predict user preferences from multimodal data. However, existing methods may learn a recommendation model from spurious features, i.e., appearing to be related to an outcome but actually having no causal relationship with the outcome, leading to poor generalization ability. While previous approaches have adopted invariant learning to address this issue, they simply concatenate multimodal data without proper alignment, resulting in information loss or redundancy. To overcome these challenges, we propose a framework called M<sup>3</sup>-InvRL, designed to enhance recommendation system performance through common and modality-specific representation learning, invariant learning, and model merging. Specifically, our approach begins by learning modality-specific representations along with a common representation for each modality. To achieve this, we introduce a novel contrastive loss that aligns representations and imposes mutual information constraints to extract modality-specific features, thereby preventing generalization issues within the same representation space. Next, we generate invariant masks based on the identification of heterogeneous environments to learn invariant representations. Finally, we integrate both invariant-specific and shared invariant representations for each modality to train models and fuse them in the output space, reducing uncertainty and enhancing generalization performance. Experiments on real-world datasets demonstrate the effectiveness of our approach.
ISSN:	1099-4300

Invariant Representation Learning in Multimedia Recommendation with Modality Alignment and Model Fusion

Similar Items