0

InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection

InterCLIP-MEP enhances multi-modal sarcasm detection by integrating a refined CLIP with a Memory-Enhanced Predictor, achieving state-of-the-art performance on the MMSD2.0 benchmark.

Year
2024
Venue
arXiv 2024
Authors
5
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2406.16464v5ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Sarcasm in social media, often expressed through text-image combinations, poses challenges for sentiment analysis and intention mining. Current multi-modal sarcasm detection methods have been demonstrated to overly rely on spurious cues within the textual modality, revealing a limited ability to genuinely identify sarcasm through nuanced text-image interactions. To solve this problem, we propose InterCLIP-MEP, which introduces Interactive CLIP (InterCLIP) with an efficient training strategy to extract enriched text-image representations by embedding cross-modal information directly into each encoder. Additionally, we design a Memory-Enhanced Predictor (MEP) with a dynamic dual-channel memory that stores valuable test sample knowledge during inference, acting as a non-parametric classifier for robust sarcasm recognition. Experiments on two benchmarks demonstrate that InterCLIP-MEP achieves state-of-the-art performance, with significant accuracy and F1 score improvements on MMSD and MMSD2.0. Our code is available at https://github.com/CoderChen01/InterCLIP-MEP.

Authors

5