Abstract
With technological advancements, we can now capture rich dialogue content, tones, textual information, and visual data through tools like microphones, the internet, and cameras. However, relying solely on a single modality for emotion analysis often fails to accurately reflect the true emotional state, as this approach overlooks the dynamic correlations between different modalities. To address this, our study introduces a multimodal emotion recognition method that combines tensor decomposition fusion and self-supervised multi-task learning. This method first employs Tucker decomposition techniques to effectively reduce the model’s parameter count, lowering the risk of overfitting. Subsequently, by building a learning mechanism for both multimodal and unimodal tasks and incorporating the concept of label generation, it more accurately captures the emotional differences between modalities. We conducted extensive experiments and analyses on public datasets like CMU-MOSI and CMU-MOSEI, and the results show that our method significantly outperforms existing methods in terms of performance. The related code is open-sourced at https://github.com/ZhuJw31/MMER-TD.
| Original language | English |
|---|---|
| Article number | 39 |
| Pages (from-to) | 1-14 |
| Number of pages | 14 |
| Journal | International Journal of Multimedia Information Retrieval |
| Volume | 13 |
| Issue number | 4 |
| DOIs | |
| Publication status | Published - Dec 2024 |
| Externally published | Yes |
Keywords
- Self-supervised
- Multi-tasking
- Emotion recognition
- Multimodal
Fingerprint
Dive into the research topics of 'Multi-modal emotion recognition using tensor decomposition fusion and self-supervised multi-tasking'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver