Top Banner
NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification Rongcheng Lin, Jing Xiao, Jianping Fan University of North Carolina at Charlotte
12

NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

Feb 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

NeXtVLAD: An Efficient Neural Network

to Aggregate Frame-level Features for Large-scale

Video Classification

Rongcheng Lin, Jing Xiao, Jianping Fan

University of North Carolina at Charlotte

Page 2: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

Final solution overview:

[X. Lan 2018]

Individual loss

mixture loss

distill loss

Page 3: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

▪ 79M parameters ≈ 316M storage using Float32▪ 2 nvidia gtx 1080 TI ▪ 400+ examples/sec with SSD▪ About 2 days to reach the optimal results

Page 4: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

Motivations: feature groups for aggregation?

ResNet ResNeXt

NetVLAD NeXtVLAD

Page 5: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

Soft assignment of frame feature i to cluster k

Residual vectorSum over all M frames in

the video

𝑦𝑗𝑘 =

𝑖

𝛼𝑘(𝑥𝑖)(𝑥𝑖𝑗 − 𝑐𝑘𝑗)

NetVLAD: a learnable pooling approach

[R. Arandjelović, 2016]

Page 6: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

NetVLAD: a learnable pooling approach

▪ Parameters Number:𝑁 × 𝐾 × (𝐻 + 2)

▪ N(input dimension), K(cluster number), H(hidden size)

e.g. N = 1024, K = 256H = 2048 will result in537 millions parameters

Intra normalization

Page 7: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

NeXtVLAD: a mixture of NetVLAD over group features

𝑦𝑗𝑘 =

𝑔

𝑎𝑔 𝑥𝑖

𝑖

𝛼𝑔𝑘 𝑥𝑖 (𝑥𝑖𝑗𝑔− 𝑐𝑘𝑗)

Sum over all groups

Attention function over groups

Group level NetVLAD aggregation

𝛼𝑔 𝑥𝑖 = 𝜎(𝑤𝑔𝑇𝑥𝑖 + 𝑏𝑔)

Page 8: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

Feature expansion

Reshape operation

NeXtVLAD: a mixture of NetVLAD on group features

▪ Parameters Number:

𝜆𝑁 × (𝑁 + 𝐺 + 𝐾 × ( 𝐺 +𝐻 + 1

𝐺)

▪ N(input dimension), G(group number), K(cluster number), H (hidden size), 𝜆(expansion factor)

▪ About 𝐺

𝜆times smaller than NetVLAD

with same input and output dimension

▪ e. g. N=1024, G=8, K=256, H=2048, 𝜆= 2 will results in 140 million parameters

Page 9: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

Overview of NeXtVLAD model

[A. Miech 2017]

▪ Reverse whitening: ො𝑥𝑖𝑗 = 𝑥𝑖𝑗 ∗ 𝑒𝑗▪ SE Context Gating:

Page 10: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video
Page 11: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

Single model performance comparison

Page 12: NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level … · 2019-10-22 · NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video

Questions & More Details - - > https://www.kaggle.com/c/youtube8m-2018/discussion/63223

Code -- >https://github.com/linrongc/youtube-8m