Combining Image and Language to Predict and Understand the ...cs231n.stanford.edu/reports/2017/posters/816.pdf · Combining Image and Language to Predict and Understand the Usefulness

Back Ground

Results

• Cast the prediction problem to Regression and Classification

• Make usefulness prediction by combing two neural networks

• CNN on an image associated with the review

• Attention based Convolutional RNN on the text of the review

• Analyze Attention weights to gain insight into what make a review useful

Using 10.2K Reviews - Class 0: Reviews have 0

useful votes- Class 1: Reviews have

>9 useful votes- A subset of images from

the 200,000 business pictures provided by yelp

# of ConvRNN Layer V.S. Classification Accuracy

71%

73.25%

75.5%

77.75%

80%

0 1 2 3 4 5# of ConvRNN Layer V.S.

Regression Accuracy (RMS)

2.79

2.805

2.82

2.835

2.85

0 1 2 3 4

Classification Results (All Model )

Accu

racy

50%

57.5%

65%

72.5%

80%

Image+ConvRNN

ConvRNN

RNN+Attention

Bi-RNN

SVMRegression Results

(All Model )

RMSE

2.6

2.75

2.9

3.05

3.2

Image+ConvRNN

ConVRNN

RNN+Attention

Bi-RNN

Linear Regression

Combining Image and Language to Predict and Understand the Usefulness of Yelp ReviewsDavid Z Liu ([email protected])

Challenge

• Predicting a reviews usefulness is important and challenging

• Knowing the usefulness of a review in advance, businesses can recommend high quality and fresh reviews to customers and gain business insight

• For reviews with the same length, No obvious features that directly indicate the usefulness of a document

• Only a small amount of data have significant number of useful votes

Approach

Data Set

Link Image With Text - Link a business photo to a review

written about that business by match words in the image caption with words in the review text

- Pick the image with the highest Jaccard similarity

Model

• Input to the model:• Written review to the RNN.

Text are converted to 300 dimensional GloVe

• Associated image to the CNN, converted to 64X64X3

• Output from the model:• Classification or regression

prediction

64x64x3

32x

32x8

16x

16x

16

8x8x

64

4x4x

64

I Love food and Images

300x300x1

200x200x16

100x100x16

100

(0 = bidirectional RNN without attention)

Visualizing Attention Weight A typical review with the top 25 words greened base on attention intensity.

Image + ConvRNNBidirectional RNN Without Attention

Bidirectional RNN With Attention

Prediction

ConvRNN

Size corresponds to the attention weight

mailto:[email protected]

Combining Image and Language to Predict and Understand the ...cs231n.stanford.edu/reports/2017/posters/816.pdf · Combining Image and Language to Predict and Understand the Usefulness

Documents