Top Banner
Yelp Data Challenge User Rating Prediction using machine and Deep learning. Amr Koura
30

yelp data challenge

Apr 08, 2017

Download

Engineering

AMR koura
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: yelp data challenge

Yelp Data Challenge User Rating Prediction using machine and Deep learning.

Amr Koura

Page 2: yelp data challenge

Content

• Yelp Data Challenge• Problem Definition• Classical Machine Learning• Deep Learning• Docker Image• Conclusion

Page 3: yelp data challenge

Yelp Data Challenge

Page 4: yelp data challenge

Yelp dataset

Academic Dataset.Information about local business.Five Json files:1. business2. check-in3. review4. tip5. user

https://www.yelp.com/dataset_challenge/dataset

Page 5: yelp data challenge

User Reviews

Page 6: yelp data challenge

User Reviews

Jason Files contains user reviews:Example:

{"user_id": "Iu6AxdBYGR4A0wspR9BYHA", "review_id": "KPvLNJ21_4wbYNctrOwWdQ", "stars": 5, "date": "2014-02-13", "text": "Excellent food. Superb customer service. I miss the mario machines they used to have, but it's still a great place steeped in tradition.", "type": "review", "business_id": "5UmKMjUEUNdYWqANhGckJw"}

Page 7: yelp data challenge

Problem Definition

Page 8: yelp data challenge

“Given a user Review text , can we predict user Rating?”

Page 9: yelp data challenge

Simpler Problem

Assuming That:

• bad rate (1-3)• good rate (4-5)we will predict whether the user like the

service (rate=4 or 5) or not(rate=1,2 or 3)

Page 10: yelp data challenge

“Given a user Review text , can we predict whether user like service or not?”

Page 11: yelp data challenge

Pre-Processing

Page 12: yelp data challenge

Pre-processing

Extract the first 2000 review as dataset.

Features: user reviews text.Labels: if Rate >3: Positive Else: Negative

Page 13: yelp data challenge

Approach

Page 14: yelp data challenge

Classical Machine Learning

Algorithms

Deep learning Approach

Page 15: yelp data challenge

Classical Machine Learning Algorithms

Page 16: yelp data challenge

Classical Machine Learningcombination of several machine learning algorithms.

Logistic Regression. Naive Bayes. Stochastic gradient Descent. Support Vector Machine.

combine results using majority votes.

https://github.com/amrqura/yelpRatePrediction

Page 17: yelp data challenge

Classical Machine LearningFeatures:

use words of the review text as features as the following:

1.Extract most common 3000 words in the training dataset.

In each statement examine if the frequent words exists or not.

Feature set : matrix of [2000*3000] 2000: number of statements. 3000: boolean values examine existence of frequent word.

https://github.com/amrqura/yelpRatePrediction

Page 18: yelp data challenge

Classical Machine LearningTraining:

Train model in 80% statements and validate on 20%.

Run:

$ python3 TextClassification.py

https://github.com/amrqura/yelpRatePrediction

Page 19: yelp data challenge

Classical Machine LearningResults(5- cross validation):

Naive Bayes accuracy percent: 66.5 MultinomialNB accuracy percent: 67.95 BernoulliNB accuracy percent: 65.9 Logistic regression accuracy percent: 67.15 SGD accuracy percent: 63.3 SVC accuracy percent: 54.0 LinearSVC accuracy percent: 63.75 NuSVC accuracy percent: 66.35 Voted accuracy percent: 67.35

https://github.com/amrqura/yelpRatePrediction

Page 20: yelp data challenge

Deep learning Approach

Page 21: yelp data challenge

Convolutional Neural Network for sentence classification

Page 22: yelp data challenge

CNN Architecture

https://arxiv.org/pdf/1408.5882v2.pdf

Page 23: yelp data challenge

CNN DatasetEach statement is represented by n*k matrix. n= number of words k= vector length.

we use word2vec to convert each word to vector.

https://github.com/amrqura/deepYelpRatePrediction

Page 24: yelp data challenge

CNN Training

implementation using Tensorflow. number of filters=128 filter size=3,4,5. drop out=0.5

Apply max pooling.

output layer: two nodes with softmax function.

use cross-entropy error function. use L2 regularization.

https://github.com/amrqura/deepYelpRatePrediction

Page 25: yelp data challenge

CNNTraining:

5 cross validation. Train model in 1600 statements and validate on 400.

Run:

$ python rating_prediction.py

Accuracy: 0.73549998

https://github.com/amrqura/deepYelpRatePrediction

Page 26: yelp data challenge

Docker Image

Page 27: yelp data challenge

Docker ImageDocker Image:

https://hub.docker.com/r/amrkoura/yelpchallenge/

Run: docker pull amrkoura/yelpchallange docker run -t -i amrkoura/yelpchallenge /bin/bash cd /src/

Run classical machine learning:

python3 TextClassification.py

Run Deep learning:

python rating_prediction.py

Page 28: yelp data challenge

Conclusion

Page 29: yelp data challenge

Conclusion2 implementations to predict the user rating from user review. • classical machine learning • Deep learning , convolutional neural network.

public Docker image:

https://hub.docker.com/r/amrkoura/yelpchallenge/

Page 30: yelp data challenge

Thank you