1. ERC任务定义、应用场景、关键挑战
2. 常用数据集
数据集 | conversation (Train/Valid/Test) | utterance (Train/Valid/Test) |
MELD | 1038/114/280 | 9989/1109/2610 (13708) |
IEMOCAP | 100/20/31 | 4810/1000/1623 (7433) |
数据集 | Max | Min | Mean |
MELD | 24 | 1 | 9.62 |
数据集 | Max | Min | Mean |
MELD | 321 | 1 | 77.33 |
MELD数据集:包含文本、视觉、听觉三种模态的高质量对齐数据,包含304个不同的speakers,每个conversation都包含三个及三个以上的speakers。所有的utterance都标注了七类emotion (anger, disgust, fear, joy, neutral, sadness, and surprise)和三类sentiment (neutral, positive, and negative)标签。原始数据集已包含Valid和Test。
IEMOCAP数据集:包含文本、视觉、听觉三种模态的近12小时two-way conversations数据,包含10个speakers。所有的utterance都标注了六类emotion (happy, sad, neutral, angry, excited, and frustrated)标签。原始数据集无Valid,CVPR2022的M2FNet是从Train集合中随机选取10%作为Valid。
Q1: 原始数据集没有给出conversation标签,而ERC任务又是针对对话的情绪识别,应如何理解?先前论文如何处理?A1: ERC任务的性能评价应落脚到utterances。与一般ER不同的是,ERC需要更多地考虑utterance的上下文场景。比如MMGCN训练时的输入就是batch (16) 个conversations。
3. 评价指标
- CVPR2022的M2FNet: accuracy / weighted average F1 score
- ICASSP2022的CKE-Net: weighted F1 score
- ACL2021的MMGCN: weighted accuracy / weighted average F1 score
4. 特征提取
- CVPR2022的M2FNet: 文本模态使用RoBERTa,视觉/听觉模态使用ResNet18,visual extractor在CASIA webface database上训练的,audio extractor是从相应音频信号获得的梅尔频谱上训练的。作者还构建了一个用于特征提取模型训练的损失函数(综合边际三元组$L_{AMT}$、协方差$L_{Cov}$和方差$L_{Var}$的损失)。表示为:$L_{FE}={\lambda_1}{L_{AMT}} + {\lambda_2}{L_{Cov}} + {\lambda_3}{L_{Var}}$
- ICASSP2022的CKE-Net: 无详细介绍。
- ACL2021的MMGCN: 文本模态采用TextCNN,视觉模态使用在FER+数据集上微调的DenseNet,听觉模态采用OpenSmile toolkit with IS10。
5. 相关工作
5.1 MMGCN (ACL2021, Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation)
作者单位: 中国人民大学 金琴团队 论文链接
Abstract: Emotion recognition in conversation (ERC) is a crucial component in affective dialogue systems, which helps the system understand users’ emotions and generate empathetic re- sponses. However, most works focus on mod- eling speaker and contextual information pri- marily on the textual modality or simply lever- aging multimodal information through fea- ture concatenation. In order to explore a more effective way of utilizing both multi- modal and long-distance contextual informa- tion, we propose a new model based on mul- timodal fused graph convolutional network, MMGCN, in this work. MMGCN can not only make use of multimodal dependencies effec- tively, but also leverage speaker information to model inter-speaker and intra-speaker de- pendency. We evaluate our proposed model on two public benchmark datasets, IEMOCAP and MELD, and the results prove the effec- tiveness of MMGCN, which outperforms other SOTA methods by a significant margin under the multimodal conversation setting.
$\mathcal{H}^{(l+1)}=\sigma({({(1-\alpha)\widetilde{\mathcal{P}}\mathcal{H}^{(l)} + \alpha\mathcal{H}^{(0)}})}{({(1-\beta^{l})\mathcal{L}+\beta^{(l)}\mathcal{W}^{(l)}})})$
$\beta^{(l)}=\log{(\frac{\eta}{l} + 1)}$
$\mathcal{L}$是恒等映射 (identify mapping),是一种残差机制