多模态表征学习(跨模态检索——视频描述生成)

COOT Cooperative Hierarchical Transformer for Video-Text Representation Learning

1. Motivation

作者认为:很多视频-文本任务都蕴含着不同粒度的信息,如:视频帧-单词、片段-句子、视频-段落,每个都包含了不同粒度的语义信息。

本文解决的问题是:1)如何有效利用层次化的信息;2)如何建模不同粒度、不同模态数据之间的信息交互。

提出的COOT模型包含三个主要部分:

  • 一个注意力感知聚合层,用于local temporal context (intra-level, within a clip);
  • 一个上下文transformer,用于学习低级语义和高级语义的交互(inter-level, clip-video/sentence-paragraph);
  • 用于跨模态交互的循环一致性损失 (cycle-consistency loss)

Abstract: Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

2. Model

avator avator

3. Cross-Modal Cycle Consistency Loss (跨模态循环一致性)

avator 循环意味着闭环,该损失用于学习跨模态数据的语义对齐。其基本假设为:“A pair of clip and sentence will be identified as semantically aligned if they are nearest neighbors in the learned common spaces.”。换言之:将clip和sentence向量映射到common sapce中,则这两个向量是最相邻的。

定义clip序列:$[{\theta_{i}}]_{i=1}^{n} = [\theta_{1}, \cdots, \theta_{n}]$

定义sentence序列:$[{\delta_{i}}]_{i=1}^{m} = [\delta{1}, \cdots, \delta{m}]$

给定sentence ${\delta_{i}}$,可以计算出最近邻 (soft nearest neighbor): $$ \bar{\theta_{\delta_{i}}}=\sum_{j=1}^{n}\alpha_{j}\theta_{j},\
$$ 其中,$\alpha_{j}$是clip $\theta_{j}$对sentence $\delta_{i}$的相似度分数: $$ \alpha_{j} = \frac{\exp(-{||\delta_{i}-\theta_{j}||^2})}{\sum_{k=1}^n{\exp(-{||\delta_{i}-\theta_{k}||}^2)}} $$

给定$\bar{\theta_{\delta_{i}}}$,反过来可以计算soft location: $$ \mu = \sum_{j=1}^{m}\beta_{j}j $$ 其中,$\beta_{j}$是sentence $\delta_{k}$对于clip向量$\bar{\theta_{\delta_{i}}}$的相似度分数: $$ \beta_{j} = \frac{\exp(-{||\bar{\theta_{\delta_{i}}}-\delta_{j}||}^2)}{\sum_{k=1}^{m}\exp(-{||\bar{\theta_{\delta_{i}}}-\delta_{k}||}^2)} $$ 正反向操作后,当且仅当$\mu=i$时,才是跨模态循环一致的。原文为:The sentence embedding $\delta_{i}$ is semantically cycle consistent if and only if it cycles back to the original location, i.e., $i=\mu$。

由此,可得跨模态循环一致性损失$l_{CMC}$ $$ l_{CMC} = ||i-\mu||^2 $$

总结:$l_{CMC}$从多模态语义对齐的基本概念出发,通过$A\rightarrow B, B\rightarrow A$的循环映射来增强两个模态的一致性(进行多模态语义对齐)。其处理对象为序列长度不一致的sentences和clips对,是非显式对齐的方法,感觉可以用在之后的工作中。

4. Result

注意:下列图片中$R@N$指标表示top N结果的召回率。

avator avator avator avator avator avator

朱传波
朱传波
计算机科学与技术专业博士生

我的研究兴趣包括:多模态智能、情感分析、情绪识别、讽刺识别等