论文笔记：learning disentangled representations of video with missing data-爱代码爱编程

2023-01-31 分类: 论文笔记论文阅读

2020 Neurips

1 intro & abstract

视频表征的一个挑战是高维、动态、各个像素之间多模态分布
- 最近的一些研究通过探索视频的inductive bias，并将高维数据映射到低微数据中
- —>这种方法通过将视频的各帧分解成语义上有意义的因子，来获得视频的解耦表征
- ——>但是，当物体在视频中有缺失时，现存的方法并不能很好地进行建模
这篇论文就希望学习带有缺失数据的视频的解耦表征
- 提出了DIVE (disentangled-imputed-video autoencoder）
  - 通过将视频分解成appearance、pose和missingness 这三个隐变量，来学习视频的表征
  - 通过学习到的解耦隐变量，来补全视频中的缺失数据
  - 使用补全了的视频表征进行随机的、无监督的视频预测

2 related work

2.1 解耦表征

序列的无监督学习解耦表征通常有三类

基于VAE的

Learning to decompose and disentangle representations for video prediction NIPS 2018
Sequential attend, infer, repeat: Generative modelling of moving objects. NIPS 2018
Compositional video prediction.NIPS 2018
Unsupervised learning of disentangled and interpretable representations from sequential data.NIPS 2017
Structured object-aware physics prediction for video modeling and planning. ICLR 2020
R-sqair: Relational sequential attend, infer, repeat.，2019 arxiv

基于GAN的

Unsupervised learning of disentangled representations from video. NIPS 2017
Decomposing motion and content for natural video sequence prediction. arxiv 2019
Stochastic video generation with a learned prior. ICML2018

基于加&乘的

Structured object-aware physics prediction for video modeling and planning. ICLR 2020

对于视频数据而言，最常用的做法是将视频帧编码成隐变量，然后将隐藏表征分解成内容和动态因子（content，dynamics）
- 视频中的内容（物体、背景。。。）是固定的
- 视频中物体的方位则会一直改变
- ——>但大部分模型只能解决没有缺失数据的视频

2.2 视频预测

视频预测一般是基于过去的视频帧来预测未来的视频帧
- 使用LSTM，ConvLSTM，PredRNN等模型
- 但这些模型的问题是，他们预测的都是确定值（帧），这并不能很好地建模视频数据中未来帧的不确定性
论文中使用随机视频预测，这能更好地捕获环境中的随机动力学

3 Disentangled-Imputed-Video-autoEncoder (DIVE)

在这篇论文中，作者假设每一个视频最多有N个物体；观测K个时间片的视频序列，预测K+1~T的视频帧
模型整体上是一个VAE架构，将视频中的物体分解成三种不同的隐变量：appearance、pose、missingness

记带缺失数据的视频为，其中每一个是一帧
- 论文旨在学习视频的隐藏表征，并将其解耦成三个不同的隐变量
  - - i表示视频的第i个物体
    - $z_{i,a}^t$ 表示h维的appearance向量
    - $z_{i,p}^t$ 是一个三维的向量，表示pose（x，y坐标和缩放大小）
    - $z_{i,m}^t$ 是0/1的missingness 标签（1表示物体被遮住/丢失）

3.1 补全模型

补全模型根据missingness标签 $z_{i,m}^t$ 来更新隐藏状态
如果没有丢失数据，那么补全模型的隐藏状态更新方式为：
- - 这里 $f_{enc}$ 是双向LSTM
  - i-1是上一个物体（但是视频里面物体怎么排序，我好像在这篇论文里没有找到，熟悉这一领域的欢迎补充）
如果有缺失值的话中yt是得不到的，故而需要补全，记此时需要补全的内容的隐藏状态为
- - FC是全连接层， $\mathbf{h}_{i, p}^{t-1}$ 是这一小节要介绍的pose的隐藏状态
记隐藏层的向量为，他回根据不同的 missingness标签来选择不同的隐藏状态
- $\mathbf{u}_i^t=\left\{\begin{array}{ll} \hat{\mathbf{h}}_{i, y}^t & \mathbf{z}_{i, m}^t=1 \\ \gamma \mathbf{h}_{i, y}^t+(1-\gamma) \hat{\mathbf{h}}_{i, y}^t & \mathbf{z}_{i, m}^t=0 \end{array}, \quad \gamma \sim \operatorname{Bernoulli}(p)\right.$
- 这里当没有丢失数据的时候，这边使用的是 $h_{i,y}^t,\hat{h_{i,y}^t}$ 的混合，论文发现这样效果更好
- 输入只是带缺失值的y，所以我们并不能直接知道missingness标签 $z_{i,m}^t$ ，这个值到底是0还是1，是通过后面的3.2.1 missingness inference得到的

pose的隐藏状态通过LSTM来更新
- $\mathbf{h}_{i, p}^t=\operatorname{LSTM}\left(\mathbf{h}_{i, p}^{t-1}, \mathbf{u}_i^t\right)$

3.2 推断模型

一开始我们只有视频数据y，怎么得到z呢

3.2.1 missingness inference

对于missingness变量，使用如下的方式推断
- $\mathbf{z}_{i, m}^t=H(x), \quad x \sim \mathcal{N}\left(\mu_m, \sigma_m^2\right), \quad\left[\mu_m, \sigma_m^2\right]=\operatorname{FC}\left(\mathbf{h}_{i, y}^t\right), \quad H(x)= \begin{cases}1 & x \geq 0 \\ 0 & x<0\end{cases}$

3.2.2 pose inference

$\bg_white q\left(\mathbf{z}_{i, p}^{1: T} \mid \mathbf{y}^{1: K}\right)=\prod_{t=1}^K q\left(\mathbf{z}_{i, p}^t \mid \mathbf{z}_{i, p}^{1: t-1}\right), \quad \mathbf{z}_{i, p}^t=f_{\operatorname{tran}}\left(\mathbf{z}_{i, p}^{t-1}, \beta_i^t\right), \quad$

$\beta_i^t \sim \mathcal{N}\left(\mu_p, \sigma_p^2\right),\left[\mu_p, \sigma_p^2\right]=\operatorname{FC}\left(\mathbf{h}_{i, p}^t\right)$

3.2.3 dynamic appearance

appearance变量是一个随时间一直变化的内容
- 论文这里把appearance分解成静态分量 $a_{i,s}$ 和动态分量 $a_{i,d}$
对于静态分量，作者使用“Learning to decompose and disentangle representations for video prediction.”中的inverse affine spatial transformation
- $\mathbf{a}_{i, s}=\operatorname{FC}\left(\mathbf{h}_{i, a}^K\right), \quad \mathbf{h}_{i, a}^{t+1}= \begin{cases}\operatorname{LSTM}_1\left(\mathbf{h}_{i, a}^t, \mathcal{T}^{-1}\left(\mathbf{y}^t ; \mathbf{z}_{i, p}^t\right)\right) & t<K \\ \operatorname{LSTM}_2\left(\mathbf{h}_{i, a}^t\right) & K \leq t<T\end{cases}$
- （对未来视频的预测，就是一种自回归的方式了（t的hidden state是t+1的input）
对于动态分量，作者建模的是各帧之间的区别
- $\mathbf{a}_{i, d}^1=\mathrm{FC}\left(\left[\mathbf{a}_{i, s}, \mathcal{T}^{-1}\left(\mathbf{y}^1 ; \mathbf{z}_{i, p}^1\right)\right]\right), \quad \mathbf{a}_{i, d}^{t+1}=\mathbf{a}_{i, d}^t+\delta_{i, d}^t, \quad \delta_{i, d}^t=\mathrm{FC}\left(\left[\mathbf{h}_{i, a}^t, \mathbf{a}_{i, s}\right]\right)$
最后的appearance是将动态和静态结合在一块得到的
- $q\left(\mathbf{z}_{i, a} \mid \mathbf{y}^{1: K}\right)=\prod_t \mathcal{N}\left(\mu_a, \sigma_a^2\right), \quad\left[\mu_a, \sigma_a^2\right]=\mathrm{FC}\left(\left[\mathbf{a}_{i, s}, \gamma \mathbf{a}_{i, d}^t\right]\right), \quad \gamma \sim \operatorname{Bernoulli}(p)$

3.4 生成模型与学习

给定带有丢失数据的视频 $\left(\mathbf{y}^1, \cdots, y^t\right)$ ，记潜在的完整视频为 $\left(\mathbf{x}^1, \cdots \mathbf{x}^t\right)$ ，那么，视频序列的生成概率分布为： $p\left(\mathbf{y}^{1: K}, \mathbf{x}^{K+1: T} \mid \mathbf{z}^{1: T}\right)=\prod_{i=1}^N p\left(\mathbf{y}_i^{1: K} \mid \mathbf{z}_i^{1: K}\right) p\left(\mathbf{x}_i^{K+1: T} \mid \mathbf{z}_i^{K+1: T}\right)$
其中每一个object的概率可以用如下方式计算而得 $p\left(\mathbf{y}_i^t \mid \mathbf{z}_{i, a}^t\right)=\mathcal{T}\left(f_{\operatorname{dec}}\left(\mathbf{z}_{i, a}^t\right) ; \mathbf{z}_{i, p}^t\right) \circ\left(1-\mathbf{z}_{i, m}^t\right), \quad p\left(\mathbf{x}_i^t \mid \mathbf{z}_{i, a}^t\right)=\mathcal{T}\left(f_{\operatorname{dec}}\left(\mathbf{z}_{i, a}^t\right), \mathbf{z}_{i, p}^t\right)$

4 实验部分

给定10帧，预测10帧

本文链接：https://blog.csdn.net/qq_40206371/article/details/128813349

vue 频繁操作使用防抖和节流-爱代码爱编程

防抖和节流 Vue 没有内置支持防抖和节流，但可以使用 Lodash 等库来实现。如果某个组件仅使用一次，可以在 methods 中直接应用防抖： <script src="https://unpkg.com/lodash@4.17.20/lodash.min.js"></script> <script> V

本文将会结合自己搭建经验一步一步展开，记录最新vue前端框架如何搭建。 1、安装vite 全局安装 npm init vite@latest 2、使用vite创建vue3工程 npm init @vitejs/app 第一步：工程文件名第二步：选择框架第三步：选择vue-ts组合第四步：进入项目文件夹，安装依赖即可

极简vue3+vite+pinia项目模板，开箱即用_失岸的博客-爱代码爱编程

主要功能： 1.layout布局 2.router，pinia，axio， 3.element-ui plus 部分组件二次封装 4.多语言 / icon图标组件封装 / 多环境 / eslint 其他功能不定时更新... 线上地址：Vite App GitHub：https://github.com/jimoruyan/vite-vue3-

项目实战旅游系统(vue3+pinia+vite)----项目配置_drowningswimmer的博客-爱代码爱编程

1.创建项目 npm init vue@latest 这里我全部都是选择的NO是为了回顾router和pinia的配置然后npm install安装一下依赖，用npm run dev运行 2.划分目录结构 assets----存放文件资源(cs

vue3 + ts +pinia+element-plus+mock 项目---动态路由+用户权限路由篇_何之柱的博客-爱代码爱编程

动态路由+用户权限路由篇 1、实现思路实现动态路由前提是服务端做好Router路由表的返回数据格式，一般分为两种情况(本人见识少，目前只遇到这两种)一是服务端返回客户端登录用户的权限ID数组，客户端通过权限ID修改对应路由的hidden属性选择是否展示，同时在router.beforeEach中判断好每次路由跳转的目标地址是否符合权限信

vue+vite+pinia使用-爱代码爱编程

一、安装 vite项目 1.创建项目：npm create vite@latest 2.运行项目： //进入项目文件后执行 npm install //安装 npm run dev //运行官网：开始 | Vite 官方中文文档二、vue使用案例 1.父组件调用子组件（this.$refs.），通过ref定义子组件的名字，从而使用t

酷炫ui！最新开源的vue3.2移动电商实战（源码+文档分享）-爱代码爱编程

今天给大家分享了一个《vite3+vue3.2+pinia+axios移动电商实战》教程，来自某大厂的前端带训实战。属内部资料，本文做一个介绍，包括整体技术框架和UI展示。需要的小伙伴，直接下图扫码(或加微信zhaoxi0061)免费领取！（随时可能网盘失效，速领）分享中，扫码即可领取添加微信ITIT920也可移动电商U

vue3-爱代码爱编程

目录一、PNPM 二、vite 2.1、Vite 和 Webpack 区别 2.2、搭建一个Vite项目 2.3、路由使用 2.4、vuex 三、pinia 3.1、与vuex有什么不同 3.2、pinia基本使用 3.3、pinia--state 3.4、pinia--修改--action 3.5、pinia--重置数据--$r