Making a Point with Pointer Networks: Arranging Shufﬂed ...gs9ed/reports/VislangProjectReport.pdf · Making a Point with Pointer Networks: Arranging Shufﬂed Stories Gautam Somappa

Making a Point with Pointer Networks:Arranging Shuffled Stories

Gautam Somappa Sivaraman K SDepartment of Computer Science, University of Virginia, Charlottesville, VA 22904

[gs9ed, ks6cq]@virginia.edu

Abstract

Sorting images in an album is an interesting task in com-puter vision and machine learning. It allows us to build sys-tems that can generalize and exploit the temporal relation-ship between images. We thus, explore this task by imple-menting various models and reporting the performance onthe Visual Storytelling (VIST) dataset, which also includesassociated captions of the images. Previous implementationof pairwise comparison model fails to capture the overallcontext of the story. We hence, propose a more intuitiveapproach to this problem taking into account and using alatent representation of the input (shuffled) sequence, whichuses Pointer Network (ptrNet) to utilize the whole contex-tual information to perform story ordering/sorting. We dis-play the effectiveness of this model through various experi-ments. We make the source code available on Github 1 anda live demo of the best model here2.

1. IntroductionThe ability of sequencing objects and tasks is innate in

humans and comes much naturally with minimal supervi-sion. A child learns to stack blocks in decreasing order ofsize to form a tower. Sequencing also has applications tosorting words (Linearization[10], as it is termed in NLP),according to some grammatical rules (for instance, in a sen-tence We are coming home, grammatical rules dictatethe ordering of individual words). Another popular exam-ple of sequence sorting where humans excel is at form-ing coherent story from a jumbled set of sentences or im-ages, which is widely used in activity or event recognition.While sorting for absolute values has seen huge leaps in re-search, sorting for complex and abstract sequences like im-ages/audio is yet to take a formidable shape. The advent ofDeep Learning, which is a complex system of artificial neu-ral networks has seen a massive improvement in the perfor-

1https://github.com/maximus009/StorySorter2http://128.143.63.199:5000/

Figure 1. Task at a glance: shuffled input story and sorted output

mance of various machine learning related tasks like Imageand Speech Recognition. In our project, we aim to solve theproblem of sorting abstract sequences using a neural model.We hope to build a system that can look at images and ar-range them in a way that makes sense, like how humansarrange photo albums and to achieve this without relyingon metadata (such as date uploaded) and taking help fromcomplementary modalities like associated captions. A sys-tem that can sort images temporally/sequentially can alsobe used to sequence images in childrens book to generatea proper story, or can be used in a crowdsourced albumfor major events like New Years Eve or even the Oscars.Snapchat has a similar functionality, but this is done manu-ally; they have a team that curates snaps from different usersto create a story. Our goal is to learn the temporal structureof the entities through artificial neural networks. This leadsto temporal common sense [1] such that the neural networkcan generalize the sequencing tasks. We explore sequence-to-sequence (seq2seq) models and a recently proposed neu-ral architecture called Pointer Network (ptrNet) for the taskof ordering. From this implementation of image sortingwith the help of associated captions, we hope demonstratethe model’s ability to learn a temporal sequence.

2. Related WorkChen et al. (2009)[3] use a generalized Mallows model

for modeling sequences for coherence within single docu-ments. Recently, Mostafazadeh et al. (2016)[8] presented

1

the ROCStories dataset of 5- sentence stories with stereo-typical causal and temporal relations between events. Inour work though, we make use of a multi-modal story-dataset that contains both images and associated story-likecaptions. Some works in vision (Pickup et al., 2014[9];Basha et al., 2012[2]) also temporally order images; typi-cally by finding correspondences between multiple imagesof the same scene using geometry-based approaches. Sim-ilarly, Choi et al. (2016)[4] compose a story out of multi-ple short video clips. They define metrics based on scenedynamics and coherence, and use dense optical flow andpatch-matching. In contrast, our work deals with storiescontaining potentially visually dissimilar but semanticallycoherent set of images and captions.

We draw inspiration from Agrawal et al, [1] to learn tem-poral common sense from multi-modal stories consisting ofa sequence of aligned image-caption pairs, and thus start byimplementing the pairwise-model mentioned in [1] obtainbaselines. We also introduce an implementation of PointerNetworks [11, 12, 5] on this task and compare with the base-line implementation.

3. MethodologyHere, we describe our approach toward gathering the se-

quence of stories and also explain the training procedure.

3.1. Data Collection

We use the Visual Storytelling (VIST) dataset for ourproject. VIST has 81, 743 unique photos in 20, 211 se-quences, aligned to both descriptive (caption) and story lan-guage. The VIST dataset has three tiers of language for thesame image. (1) Descriptions of images in-isolation (DII);(2) Descriptions of images-in sequence (DIS); and (3) Sto-ries for images-in sequence (SIS). This tiered approach re-veals the effect of temporal context and the effect of nar-rative language. We are more interested in stories fromimages-in sequence as the captions are able to maintain re-lations with previous frames thereby conveying a story. TheVIST dataset is split into albums and each album has 5 im-ages each arranged in a way to convey a cohesive story. Wecollect 40, 000 unique stories, and use 30, 000 for trainingand 10, 000 for testing.

3.2. Input Features

We use VGG16 pre-trained model trained on ImageNet.We extract features of 4096 dimensions from its penultimatelayer to encode the images. To encode sentences, we passthem to skip-thought vectors [6]. That is, instead of usinga word to predict its surrounding context, we encode eachsentence with respect to its neighboring sentence. Thus,any composition operator can be substituted as a sentenceencoder and only the objective function becomes modi-fied. The vectors are a numpy array with as many rows as

Figure 2. Architecture for Pairwise Model

Figure 3. Architecture of Pointer Network

the length of sentences, and each row is 4800 dimensional(combine-skip model, from the paper) with the uni-skip andbi-skip models contributing to 2, 400 dimensions each.

We explore three models: Visual-only, Text-only and Vi-sual and Text features combined, and report performancesfor all these models in the later section. In our multi-modalapproach we concatenate the image-caption features repre-sented by a 8896(4096 + 4800) dimensional vector.

3.3. Pairwise Model

For a story, the images along with its captions are in se-quence. We shuffle them and take all possible combinationsof image and/or caption pairs. We develop pairwise scoringmodels that given a pair of elements (i , j), learn to assigna score: S([[σi < σj ]] — i, j) indicating whether elementi should be placed before element j in the permutation σ.Here, [[ ]] indicates the Iverson bracket (which is 1 if theinput argument is true and 0 otherwise). Thus, this is treatedas a binary classification problem. Figure 2 illustrates themodel when we take both the image and caption features.There can be 20 combinations of pairs given a story withfive elements. Among these 20 vectors, we randomly sam-ple 16 of them (every epoch, for every sample) and thenshuffle these vectors so as to minimize any correlation be-tween these input vectors, and train the model accordingly.

2

3.4. Pointer Networks

Our proposed methodology uses pointer networks (pro-posed by Vinyals et al. (2015) [12]) to solve the sortingproblem. Pointer Networks are useful when the output istaken from the input sequence. In our approach we willbe using the following networks: An encoder LSTM anda decoder Pointer Network inspired from the architectureproposed in [5].

1. The features are passed through an encoder LSTM ,which is fed xi at each time step, i, until the end of theinput sequence is reached.

2. At each step, the network produces a vector that modu-lates a content-based attention mechanism over inputs.The output of the attention mechanism is a softmaxdistribution with dictionary size equal to the length ofthe input. This output corresponds to the index at theoutput sequence of each entity in the input sequence.

For each story, we store the original sequence, and addi-tionally shuffle and store the sequences to ensure the modeleventually sees all possible permutations of all the storiesin the training set over many epochs. In this case, the out-put will be five one-hot vectors corresponding to the desiredposition of each element in the input. Figure 3 shows thePointer Network architecture.

We use ADAM optimizer at a learning rate of 0.01 de-cayed by a factor of 0.0001 every epoch for both the above-mentioned models.

4. Experiments and ResultsWe report the Spearman’s Correlation and Kendall’s Tau

metrics to evaluate our results. Kendalls tau [7](τ), com-puted as

τ = 1− 2 ∗ ninv(N2

)where the ninv is the number of inversions of pairs in thepredicted sequence with incorrect relative order and N isthe length of the sequence, which is 5 in our case. A τ scoreof 0.5 means that half of the values matched within a givenstory and its predicted sequence.

If the the ranks are represented as distinct integers, whichis the case, Spearman’s rank correlation (rs) is given by thefollowing:

rs = 1− 6∑d2i

N(N2 − 1)

where, di is the difference between the input and predictedrank sequence, and N is the number of observations, whichis 5 in our case.

It is observed that the pairwise model is not able to cap-ture the contextual information of the input sequence and is

Model τ rsRandom 0.24 0.20Pairwise-V 0.28 0.24Pairwise-T 0.40 0.38Pairwise-VT 0.42 0.40PtrNet-V 0.46 0.36PtrNet-T 0.54 0.48PtrNet-VT 0.60 0.52

Table 1. Spearman Correlation(rs) and Kendall’s Tau (τ ) valueson Test set

thus, unable to achieve a high score. The pointer networkwas able to overcome this shortcoming and was able to per-form significantly better at qualitative experiments by cap-turing the intrinsic temporal intent of the stories and lever-age that to sort the shuffled input.

Out of the six sets of experiments, it was found thatconcatenating visual and textual modalities performs bet-ter than using modalities individually. Of the two mod-els trained, pointer networks performs significantly betterthan pairwise comparison model. The pairwise gives a τscore of 42% whereas the pointer network gives a score of60%; which is a considerable improvement from the pair-wise model, however, not as great as expected. A score of60% means that approximately 3 out of 5 entities are almostcorrectly placed in the predicted output sequence.Figure 4 shows the Average Spearman correlation for thepredicted sequences across the validation set for every 100epochs. Table 1 shows the values of the metrics for the mod-els and modalities experimented for this task.

0 50 100 150 200epochs

0

10

20

30

40

50

60

70

Avg S

pearm

an c

orr

ela

tion

VTVT

Figure 4. Avg Spearman Correlation for ptrNet model: V-Visualfeatures only, T-Text features only, VT-Concatenated features

Figure 5 illustrates the output attention weights for aninput that is already sorted. The intent was to visualize the

3

output values for a sorted input to see if the model is able tounderstand that it need not change the input at all. Althoughnot reported, the attention map starts with uniformly dis-tributed values for each of the 25 spots on the map (value of0.2), and converges to identity matrix (which is also the ex-pected ground truth for an already sorted input). The modellearns to place the first entity in the sequence with very highconfidence. It would be interesting to see the performanceof the model using Bi-LSTMs or Stacked LSTMs to assessif it can learn to place entities into other positions as high aconfidence.

Predicted Softmax Weights for Output Sequence

Gro

und T

ruth

of

Input

Sequence

Attention Map for Pointer Network

Figure 5. ptrNet’s output weights for an already sorted input

We see that the model learns to predict the entities thatappear toward the beginning and the end with a high con-fidence, but kind of softens the values for the entities thatappear in the middle. Since the model is more influencedwith the language component, we believe that certain wordscould contribute to teaching the model to place entities incertain positions. This is also discussed and well illustratedin [1], where, words like “overall“ and “lastly“ prove to bediscriminatory in assigning positions to entities in a story.

AcknowledgementWe would like to thank Professor Vicente Ordonez for hisconstant motivation and guidance and extend acknowledg-ment to Abhimanyu Banerjee for his useful insights.

5. Conclusion and Future WorkIn this report, we address the task of sequencing shuf-

fled stories from the corresponding image and caption pairs.To solve this, we deploy two models: a pairwise compara-tor model, much like the classic numeric comparison ”lessthan” operator but now works in the temporal sense, andextended this model to sort the story; and a pointer net-work that works like a seq2seq model but directly outputs

the expected index of the input elements in the output asan attention map. We observe that visual features are notvery robust for this task. It seems so that even pointer net-works are not able to fully excel in this task, but it does showpromise toward furthering this model. We certainly believethat the Pointer Network is the appropriate choice of neuralarchitecture for this task. Also, there is immense scope toimprove the performance of the visual modality alone. Aswe proposed earlier, it will certainly help to represent thevisual modality in an activity context embedding. This issomething we will surely try to see if it alleviates the cur-rent model. We could also train the Skip thought embed-dings instead of using pre-trained weights. With a cleverdesign, it will be possible to perfect this task using modelsinspired by Pointer Networks.

References[1] H. Agrawal, A. Chandrasekaran, D. Batra, D. Parikh, and

M. Bansal. Sort story: Sorting jumbled images and captionsinto stories. CoRR, abs/1606.07493, 2016.

[2] T. Basha, Y. Moses, and S. Avidan. Photo sequencing. InEuropean Conference on Computer Vision, pages 654–667.Springer, 2012.

[3] H. Chen, S. R. K. Branavan, R. Barzilay, and D. R.Karger. Content modeling using latent permutations. CoRR,abs/1401.3488, 2014.

[4] J. Choi, T. H. Oh, and I. S. Kweon. Video-story compositionvia plot analysis. In 2016 IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 3122–3130,June 2016.

[5] J. Gong, X. Chen, X. Qiu, and X. Huang. End-to-end neu-ral sentence ordering using pointer network. arXiv preprintarXiv:1611.04953, 2016.

[6] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun,A. Torralba, and S. Fidler. Skip-thought vectors. In Advancesin neural information processing systems, pages 3294–3302,2015.

[7] R. Kumar and S. Vassilvitskii. Generalized distances be-tween rankings. In Proceedings of the 19th internationalconference on World wide web, pages 571–580. ACM, 2010.

[8] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra,L. Vanderwende, P. Kohli, and J. F. Allen. A corpus and eval-uation framework for deeper understanding of commonsensestories. CoRR, abs/1604.01696, 2016.

[9] L. C. Pickup, Z. Pan, D. Wei, Y. Shih, C. Zhang, A. Zisser-man, B. Scholkopf, and W. T. Freeman. Seeing the arrow oftime. In IEEE Conference on Computer Vision and PatternRecognition, 2014.

[10] A. Schmaltz, A. M. Rush, and S. M. Shieber. Word orderingwithout syntax. arXiv preprint arXiv:1604.08633, 2016.

[11] O. Vinyals, S. Bengio, and M. Kudlur. Order mat-ters: Sequence to sequence for sets. arXiv preprintarXiv:1511.06391, 2015.

[12] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks.CoRR, abs/1506.03134, 2015.

4

Figure 6. Correct Predictions by the Model

5

Figure 7. Incorrect Predictions by the Model

6

Making a Point with Pointer Networks: Arranging Shufﬂed ...gs9ed/reports/VislangProjectReport.pdf · Making a Point with Pointer Networks: Arranging Shufﬂed Stories Gautam Somappa

Documents