-
Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics: Student Research Workshop, pages 1–7July
5 - July 10, 2020. c©2020 Association for Computational
Linguistics
1
Adaptive Transformers for Learning Multimodal
Representations
Prajjwal [email protected]
Abstract
The usage of transformers has grownfrom learning about language
semantics toforming meaningful visiolinguistic repre-sentations.
These architectures are oftenover-parametrized, requiring large
amounts ofcomputation. In this work, we extend adap-tive approaches
to learn more about modelinterpretability and computational
efficiency.Specifically, we study attention spans, sparse,and
structured dropout methods to helpunderstand how their attention
mechanismextends for vision and language tasks. Wefurther show that
these approaches can help uslearn more about how the network
perceivesthe complexity of input sequences, sparsitypreferences for
different modalities, and otherrelated phenomena.
1 Introduction
Learning richer representations from visual andtext data is a
central task to solve multi-modallearning. Attention-based methods
have provento be very useful in learning long term dependen-cies
and forming richer representations of the in-put sequences.
Numerous approaches (Lu et al.,2019; Su et al., 2019; Li et al.,
2019; Chen et al.,2019) have been proposed for learning
visiolinguis-tic representations with transformers. Althoughthese
approaches have provided us with significantimprovement on various
benchmarks (languageand visiolinguistic), the architectures used
are over-parameterized require extensive training lasting
forseveral weeks using multiple objectives to forma generalized
representation of the task to be ad-dressed, which is then followed
by fine-tuning ona downstream task. This workflow has become
aconcerning problem. It results in deep learningmethodologies being
inaccessible and increasedcarbon footprints (Strubell et al.,
2019). In thiswork, we specifically explore adaptive methods.
We refer to Adaptive mechanisms as those meth-ods that change
their behavior during training/runtime and adapt stochastically to
the environmentbased on data heuristics (parameters) learned by
en-countering samples from the same data distributionoptimized by
an objective function. Alternativeapproaches such as pruning,
distillation (Hintonet al., 2015) and quantization are rigid to
someextent and induce some form of permanent modi-fications to the
model. Adaptive methods enforcethe network to learn parameters such
that their be-havior changes as per the complexity of the
inputsequence as perceived by the neural network. Thecode to
reproduce the results in this work is pub-licly available at this
link1.
Current self-attention approaches assume thatthe attention span
of a head is invariant to the com-plexity of an input sequence.
Attention heads canlearn their optimal context size (Sukhbaatar et
al.,2019), which results in a reduction of FLOPS.When an optimal
attention span is learned, theamount of attention given to a
particular input se-quence by an attention head is determined by
itscontext size. We show that the context size varieswith the
emergent complexity of the sequence, andspans can help us
understand how much sensitive alayer is to an input sequence.
Training models with a quarter of a million pa-rameters are not
feasible and practical for mostusers. One effective way to
facilitate neural net-work scaling is by making the weights of the
net-work sparse. This configuration allows us to per-form faster
training of deeper networks with rela-tively less compute. To make
attention distributionssparse, we use α entmax (Correia et al.,
2019) toobtain probability distribution of weights. Nor-malized
exponential functions like softmax cannotassign a zero attention
weight. This property en-
1https://github.com/prajjwal1/adaptive_transformer
https://github.com/prajjwal1/adaptive_transformerhttps://github.com/prajjwal1/adaptive_transformer
-
2
forces the context vector to stay dense, resultingin
non-relevant sequences to be considered eventhough the network has
discarded them by puttinga deficient weight. Adaptive sparsity can
makean attention head to learn richer distributions byoscillating
the behavior of distribution to stay be-tween softmax and
sparsemax. We show that thisbehavior can help us understand
preferences for thedensity of attention weight distribution and how
itvaries amongst each head about different modality.
We also study a form of regularization methodcalled Layerdrop
(Fan et al., 2019) to understandits regularization impact for
multi-modal features.If the network can learn to drop identical
layers(Data Driven pruning), then it can be regarded as anadaptive
depth mechanism. We specifically use theEvery other pruning method
where the user speci-fies the drop rate because it offers maximal
gainsas suggested compared to its counterpart pruningmethods. This
method has proven to be effectivein reducing the number of
parameters and pruninglayers during inference.
The contribution of this work is as follows:
• The adaptive approaches have only beentested with linguistic
features only. We extendthese approaches to study how do they
alignto capture complex relationships between dif-ferent
modalities. We also study the effects ofaligning these approaches
to understand theircompatibility through ablation analysis.
• We perform interpretability analysis to learnhow these
approaches can enhance our under-standing of attention behavior and
adaptiveapproaches.
• We provide experimental results on the recentadaptive
approaches for the multi-modal inputsequences.
2 Background
2.1 LXMERT
We use LXMERT (Tan and Bansal, 2019) as thebaseline
architecture. The adaptive approaches canbe combined with any other
self-attention mecha-nism based transformer. LXMERT uses self
andcross attention layers to jointly attend to imageand text inputs
(input sequence). Specifically, ittakes a word-level sentence and
object-level imageembeddings. The encoder consists of three
main
components: language (9 layers) and visual (5 lay-ers) encoder
(single-modality) to form textual andimage representations and
cross-modality encoder(5 layers) to jointly attend to both these
representa-tions. Cross attention is responsible for forming
themapping between ROI features and textual repre-sentations. Since
the architecture used is identical,we refer the readers to (Tan and
Bansal, 2019) for adetailed description of pre-training strategies.
Thenetwork used has been pre-trained on four objec-tives: Masked
Cross Modality LM, Masked ObjectPrediction, Cross Modality
Matching, and ImageQuestion Answering. Faster RCNN is used to
ex-tract ROI features from the input images.
2.2 Adaptive Attention Span
Unlike dynamic attention, which assumes that allattention heads
require the same amount of span,learning an optimal attention span
enables the gath-ering of information as per the context size
deter-mined by the attention head. A max upper boundspan limit is
enforced on each head, which helpsreduce computation and memory
requirements. Asproposed in (Sukhbaatar et al., 2019),
differentheads emphasize on different context dependingupon the
task it is addressing. We explicitly showthat these spans vary
significantly based on thecomplexity of the task. We use the same
maskingfunction with minor modification:
mz(x) = min
[max
[1
R(R+ z − x), 0
], 1
](1)
Here, z acts as a model’s parameter. We initialize itwith
kaiming normal (He et al., 2015) distribution.mz is coupled with
the attention weights. Hyperpa-rameter R helps in controlling the
softness of thisattention distribution.
The attention head compute the similarities be-tween current
token t and past token r in the span[t− S, t) as:
str = xTt Q
T (Kxr + Pt−r) (2)
where K, Q and Pt−r denote key, query vectors,and position
embedding respectively. In the stan-dard setting, attention weight
distribution is ob-tained by applying softmax on the similarity
vector.
Atr = softmax(str) (3)
The attention weights from Equation 3 are then
-
3
1 2 3 4 5Epoch
Language (Self Attention)
100
120
140
160
180
Atte
ntio
n Sp
an
layer 0layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer
8
1 2 3 4 5Epoch
Language (Cross Attention)
120
130
140
150
160
170
180
190
Atte
ntio
n Sp
an
layer 0layer 1layer 2layer 3layer 4
1 2 3 4 5Epoch
Vision+Language (Cross Attention)
180
190
200
210
220
230
Atte
ntio
n Sp
an
layer 0layer 1layer 2layer 3layer 4
1 2 3 4 5Epoch
Vision (Cross Attention)
100
120
140
160
180
200
Atte
ntio
n Sp
an
layer 0layer 1layer 2layer 3layer 4
1 2 3 4 5Epoch
Vision (Self Attention)
165
170
175
180
185
190
195
200
Atte
ntio
n Sp
an
layer 0layer 1layer 2layer 3layer 4
Figure 1: Variation of adaptive spans in different attention
layers (single and cross-modality) as the training pro-gresses.
Accuracy on the local-validation set is reported per epoch. The
maximum adaptive span limit was set to1024
processed by the masking function as:
Atr =mz(t− r)exp(str)
t−1∑q=t−S
mz(t− q)exp(str)(4)
The masking function is a non-increasing func-tion that applies
a transformation to the input valuesof attention scores to keep
them in range of [0, 1].The parameters of mz are updated with model
pa-rameters to learn the optimal span.
2.3 Adaptive Sparse AttentionIn order to make attention weights
sparse, we useα entmax as proposed in (Correia et al.,
2019).Specifically, softmax is replaced with α entmax tocompute
attention weights given attention scoresin Equation 3.
Att(Q,K, V ) = π
(QK>√
d
)V (5)
π(Z)ij = α -entmax (zi)j (6)
α plays a crucial role in determining the behaviorof an
attention head. If α > 1, the weight dis-tribution would move
away from softmax’s denserepresentation towards sparse mappings as
its cur-vature changes. For α = 2, we obtain completesparse
mappings. The value of alpha oscillates be-tween 1 and 2. It is set
as a network parameter,which is jointly optimized in the training
process.Different values of α will govern the behavior ofthe
attention head.
2.4 LayerDrop
Layerdrop (Fan et al., 2019) is a method to reducethe depth of
the transformer in a controlled manner.This method drops the
identical sub-layers in thetransformer determined by a pruning
strategy. Wefollow the Every Other strategy, which drops thelayer
as specified by a drop rate. It has been notedthat this pruning
strategy works well as comparedto Search on Valid and Data Driven
pruning strate-gies. Let N denote the total number of layers inthe
network. Setting p = 1 implies that we aredropping one layer out of
all the layers assignedfor a modality. The number of remaining
layersbecomes N − p. Although the network will consistof an
equivalent amount of parameters as that of Nlayers, all the
operations will be carried out equiv-alent to operations in N − p
layers. This strategyallows us to prune layers during inference
time.
3 Experimental Setup
Visual Question Answering To solve the VQAtask, given an image
and a question related to it,the network is supposed to predict the
right an-swer from the given set of answer choices. Weperformed all
the experimentation on the VQA 2.0dataset (Antol et al., 2015). The
dataset consistsof three sets with a train set containing 83k
imagesand 444k questions, a validation set containing 41kimages and
214k questions, and a test set contain-ing 81k images and 448k
questions. In this case,the network is asked to predict an answer
from3129 answer choices for a particular question.
-
4
Implementation We use the pre-trained weightsprovided by (Tan
and Bansal, 2019). We fine-tuneLXMERT to form visiolinguistic
representationsbased on image and text sequences with
adaptiveapproaches mentioned above. This operation isfollowed by a
classifier that receives the concate-nated pooled features of image
and text to predictthe answer. Fine-tuning is performed on a
singleP100 GPU with 128 batch size. Optimization is per-formed with
Lookahead (Zhang et al., 2019) withLAMB (You et al., 2019) as the
inner optimizer.Learning rate schedule is regulated by CyclicalLR
(Smith, 2017), with base and max learning ratesset to 1e− 5 and 1e−
4.
4 Experimental Findings and Results
Adaptive span for understanding the complex-ity of the input
sequence We demonstrate howlearning spans can help in understanding
the behav-ior of individual layers. Figure 1 shows how spanvaries
amongst different attention layers. Studyingspans can help us
understand which layers are moresensitive to the input sequences
encountered duringthe training process.
In the case of single modality encoder, spansfor self-attention
layers for vision and languagedecrease monotonically, indicating
that the learningbehavior is somewhat similar, although slopes
tellus that the rate of learning is dissimilar. Similarbehavior is
seen in the cross-modality encoder forlanguage.
Requiring a larger context size is indicativeof the complexity
of the sequences. When self-attention attends to both modalities,
we observethat the intermediate layers responsible for
formingcomplex representations increase their spans.
Thisobservation shows that a more significant span isnecessary to
attend both modalities jointly. Self-attention also requires a high
span when attendingto visual features in the cross-modality
encoder.This observation shows that visual sequences areperceived
as a more complex input to process thana language input in the
cross-modality encoder.
Determining sparsity preferences for vision andlanguage modality
with α The value of α deter-mines if the head is favoring sparse or
dense atten-tion weight distribution. For dealing with
languagemodality, self-attention favors mostly sparse map-ping of
attention weights in intermediate layers.Similar behavior is
observed inside cross-modalityencoder as well. This observation
shows that lan-
52.5 55.0 57.5 60.0 62.5 65.0 67.5 70.0Accuracy
2
4
6
8
10
Epoc
h
layerdrop-10-6-6layerdrop-9-5-5attn_spansparse
Figure 2: Regularization effect of layerdrop
guage modality benefits from sparse weights beingassigned as
attention distribution. The value ofα is restricted below 1.5 for
processing visual in-puts. When vision modality is involved, heads
thatpreferred sparse mapping initially are convergingtowards denser
mapping, indicating that this repre-sentation of attention weights
is preferred. We alsoobserve that when both modalities are
involved, thenetwork prefers, even more, denser weight
distribu-tion. This observation shows that vision modalityis given
more preference (partly due to perceivedcomplexity) over language
inputs to process thesequence. Figure 3 shows variation of α values
astraining progresses.
Regularization effect of Layerdrop We con-sider two
configurations of the model. The first onehas 10 language, 6
vision, and 6 cross-modalitylayers with drop rate (p) set to 1
layer. In thiscase, the number of parameters is more, but theFLOPS
is equivalent to the standard 9-5-5 base-line configuration. The
later one has the 9-5-5configuration with p set to 1. This rate
causes aFLOP reduction of 17.54%. It is observed that lay-erdrop
requires ∼3.5x more compute runtime forconvergence during training.
A possible explana-tion can be that additional training aids in
forminga consolidated understanding of multi-modal
rep-resentations. Even after ensuring the convergenceof the model,
a strong regularization effect (witha minimum value of p) prevents
the network fromachieving performance that is close enough withthe
mentioned adaptive methods with an equivalentnumber of parameters
being used training. Figure 2and Table 2 shows this noted
observations.
Quantitative Analysis In this section, Table 1compares the
adaptive approaches with the baselinemodel and other
state-of-the-art models, which relyupon standard softmax attention
mechanism. We
-
5
Language Encoder (9 layers)Cross Modality Encoder (Lan-guage) (5
layers)
Cross Modality Encoder for Vi-sion and Language (5 layers)
Cross Modality Encoder for Vi-sion (5 layers) Vision Encoder (5
layers)
Figure 3: Variation of Alpha in Entmax in first six attention
heads during an intermediate training stage of 9-5-5LXMERT model. X
and Y axis denote epoch and alpha values, respectively. For
simplicity, we only show alphavalues for the first six attention
heads (12). Color codes denote different attention heads.
Figure 4: Top 5 confidence scores of an example input sequence
Left: Adaptive Entmax Center: Adaptive Atten-tion Span Right:
10-6-6 config with Layerdrop (p=1). Zoom in to see scores and
labels.
notice that these approaches achieve near closeperformance as
standard attention mechanisms bybeing computationally efficient.
The results arereported without any hyperparameter tuning.
Qualitative Analysis In this section, we analyzethe confidence
scores on complex examples to bet-ter understand the network’s
predictions. We usu-ally take the class with maximum confidence,
butanalyzing confidence scores of other classes canhelp us learn
about what the network is learningabout the similarity of different
tasks in the image.Figure 4 shows confidence scores on an
exampleinput. We observe that entmax aids in forming aconsolidated
understanding of contrastive features.In most cases, the top 5
confidence scores includepredictions present in the ground truth.
Due tosparse mapping, the network makes strong, con-fident
predictions about one label. When trainedwith an adaptive attention
span, the network some-times seems unsure about the correct label
as ex-
pected from softmax behavior. It works well whena high
probability is assigned to one label in theground truth. We did not
observe comparable per-formance from Layerdrop. In this example,
theright answer is assigned a deficient score. The net-work does
not seem to learn distinguishing featuresfrom similar classes
properly.
5 Ablation Analysis
We normalize attention scores with entmax insteadof softmax
before applying the masking functionto use both adaptive attention
span and sparse at-tention weights mapping. It is evident from
Table 2that the adaptive span works better with the
denserrepresentation of attention weights to perform op-timally.
The effect of soft masking function isreduced when used with a
sparse mapping func-tion. We evaluate the layerdrop method with
twoconfigurations of the network 9-5-5 (language, vi-sion, and
cross-modality layers) and 10-6-6 with
-
6
Model test-dev test-std
BUTD (Anderson et al., 2018) 65.32 65.67
ViLBERT (Lu et al., 2019) 70.55 70.92VLBERT (Su et al., 2019)
71.16 -VisualBERT (Li et al., 2019) 70.80 71.00UNITER (Chen et al.,
2019) 72.27 72.46
LXMERT (Tan and Bansal, 2019)w/ softmax 72.42 72.54w/ Adaptive
Attetion Span 71.62 71.72w/ Adaptive Sparse 71.73 71.97w/ Layerdrop
(10-6-6) (p=1) 66.4 66.72
Table 1: Comparison to the state-of-the-art methodswith adaptive
approaches on the VQA dataset.
Model test-dev test-std
LXMERT (Tan and Bansal, 2019)w/ Attention Span and Entmax 63.07
63.33Default (10-6-6) 66.35 66.57w/ Layerdrop (9-5-5) (p=1) 66.51
66.81
Table 2: Ablation study for Adaptive approaches
p = 1. From Table 2, we see that the shallowernetwork performs
better than the deeper-layeredmodel. This observation shows that
there is a spe-cific threshold drop rate up until which
layerdrophelps. It is plausible that this type of regularizationis
favorable in deeper networks.
6 Conclusion
While attention-based approaches are becominguniversal,
computationally efficient ways mustbe favored for broader adoption
of provided pre-trained models on low resource hardware.
Adaptivemethods can significantly reduce the cost incurredto train
such models and carbon footprints. In thiswork, we extend adaptive
approaches to Visiolin-guistic tasks to understand more about
attention andadaptive mechanisms. While the empirical resultsare
encouraging, important future work includesexplorations of higher
efficient adaptive and sparsemechanisms that can significantly
cause FLOPSand parameter reduction with minimal loss in
per-formance.
ReferencesPeter Anderson, Xiaodong He, Chris Buehler, Damien
Teney, Mark Johnson, Stephen Gould, and LeiZhang. 2018.
Bottom-up and top-down attention forimage captioning and visual
question answering. InProceedings of the IEEE conference on
computer vi-sion and pattern recognition, pages 6077–6086.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet
Mitchell, Dhruv Batra, C Lawrence Zitnick,and Devi Parikh. 2015.
Vqa: Visual question an-swering. In Proceedings of the IEEE
internationalconference on computer vision, pages 2425–2433.
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed ElKholy, Faisal
Ahmed, Zhe Gan, Yu Cheng, andJingjing Liu. 2019. Uniter: Learning
univer-sal image-text representations. arXiv
preprintarXiv:1909.11740.
Gonçalo M Correia, Vlad Niculae, and André FT Mar-tins. 2019.
Adaptively sparse transformers. arXivpreprint arXiv:1909.00015.
Angela Fan, Edouard Grave, and Armand Joulin. 2019.Reducing
transformer depth on demand with struc-tured dropout. arXiv
preprint arXiv:1909.11556.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2015.
Delving deep into rectifiers: Surpassinghuman-level performance on
imagenet classification.In Proceedings of the IEEE international
conferenceon computer vision, pages 1026–1034.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling
the knowledge in a neural network. arXivpreprint
arXiv:1503.02531.
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-JuiHsieh, and
Kai-Wei Chang. 2019. Visualbert: Asimple and performant baseline
for vision and lan-guage. arXiv preprint arXiv:1908.03557.
Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee. 2019.
Vilbert: Pretraining task-agnostic visi-olinguistic representations
for vision-and-languagetasks. In Advances in Neural Information
Process-ing Systems, pages 13–23.
Leslie N Smith. 2017. Cyclical learning rates for train-ing
neural networks. In 2017 IEEE Winter Confer-ence on Applications of
Computer Vision (WACV),pages 464–472. IEEE.
Emma Strubell, Ananya Ganesh, and Andrew Mc-Callum. 2019. Energy
and policy considera-tions for deep learning in nlp. arXiv
preprintarXiv:1906.02243.
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,Furu Wei, and
Jifeng Dai. 2019. Vl-bert: Pre-training of generic
visual-linguistic representations.arXiv preprint
arXiv:1908.08530.
-
7
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-janowski, and
Armand Joulin. 2019. Adaptiveattention span in transformers. arXiv
preprintarXiv:1905.07799.
Hao Tan and Mohit Bansal. 2019. Lxmert: Learningcross-modality
encoder representations from trans-formers. arXiv preprint
arXiv:1908.07490.
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu,Sanjiv Kumar,
Srinadh Bhojanapalli, Xiaodan Song,James Demmel, and Cho-Jui Hsieh.
2019. Largebatch optimization for deep learning: Training bertin 76
minutes. arXiv preprint arXiv:1904.00962,1(5).
Michael Zhang, James Lucas, Jimmy Ba, and Geof-frey E Hinton.
2019. Lookahead optimizer: k stepsforward, 1 step back. In Advances
in Neural Infor-mation Processing Systems, pages 9593–9604.