多模态表征学习（跨模态检索——视频描述生成）

COOT Cooperative Hierarchical Transformer for Video-Text Representation Learning

朱传波

最近更新于 Aug 15, 2022 2 分钟阅读时长多模态表征学习, 模态对齐, 跨模态检索, 视频描述生成

1. Motivation

作者认为：很多视频-文本任务都蕴含着不同粒度的信息，如：视频帧-单词、片段-句子、视频-段落，每个都包含了不同粒度的语义信息。

本文解决的问题是：1）如何有效利用层次化的信息；2）如何建模不同粒度、不同模态数据之间的信息交互。

提出的COOT模型包含三个主要部分：

一个注意力感知聚合层，用于local temporal context (intra-level, within a clip)；
一个上下文transformer，用于学习低级语义和高级语义的交互(inter-level, clip-video/sentence-paragraph)；
用于跨模态交互的循环一致性损失 (cycle-consistency loss)

Abstract: Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists of three major components: an attention-aware feature aggregation layer, which leverages the local temporal context (intra-level, e.g., within a clip), a contextual transformer to learn the interactions between low-level and high-level semantics (inter-level, e.g. clip-video, sentence-paragraph), and a cross-modal cycle-consistency loss to connect video and text. The resulting method compares favorably to the state of the art on several benchmarks while having few parameters. All code is available open-source at https://github.com/gingsi/coot-videotext

2. Model

avator avator

avator 循环意味着闭环，该损失用于学习跨模态数据的语义对齐。其基本假设为：“A pair of clip and sentence will be identiﬁed as semantically aligned if they are nearest neighbors in the learned common spaces.”。换言之：将clip和sentence向量映射到common sapce中，则这两个向量是最相邻的。

定义clip序列：$[{\theta_{i}}]_{i=1}^{n} = [\theta_{1}, \cdots, \theta_{n}]$

定义sentence序列：$[{\delta_{i}}]_{i=1}^{m} = [\delta{1}, \cdots, \delta{m}]$

给定sentence ${\delta_{i}}$，可以计算出最近邻 (soft nearest neighbor): $$ \bar{\theta_{\delta_{i}}}=\sum_{j=1}^{n}\alpha_{j}\theta_{j},\
$$ 其中，$\alpha_{j}$是clip $\theta_{j}$对sentence $\delta_{i}$的相似度分数： $$ \alpha_{j} = \frac{\exp(-{||\delta_{i}-\theta_{j}||^2})}{\sum_{k=1}^n{\exp(-{||\delta_{i}-\theta_{k}||}^2)}} $$

给定$\bar{\theta_{\delta_{i}}}$，反过来可以计算soft location： $$ \mu = \sum_{j=1}^{m}\beta_{j}j $$ 其中，$\beta_{j}$是sentence $\delta_{k}$对于clip向量$\bar{\theta_{\delta_{i}}}$的相似度分数： $$ \beta_{j} = \frac{\exp(-{||\bar{\theta_{\delta_{i}}}-\delta_{j}||}^2)}{\sum_{k=1}^{m}\exp(-{||\bar{\theta_{\delta_{i}}}-\delta_{k}||}^2)} $$ 正反向操作后，当且仅当$\mu=i$时，才是跨模态循环一致的。原文为：The sentence embedding $\delta_{i}$ is semantically cycle consistent if and only if it cycles back to the original location, i.e., $i=\mu$。

由此，可得跨模态循环一致性损失$l_{CMC}$ $$ l_{CMC} = ||i-\mu||^2 $$

总结：$l_{CMC}$从多模态语义对齐的基本概念出发，通过$A\rightarrow B, B\rightarrow A$的循环映射来增强两个模态的一致性（进行多模态语义对齐）。其处理对象为序列长度不一致的sentences和clips对，是非显式对齐的方法，感觉可以用在之后的工作中。