yelp data challenge

Yelp Data Challenge User Rating Prediction using machine and Deep learning.

Amr Koura

Content

• Yelp Data Challenge• Problem Definition• Classical Machine Learning• Deep Learning• Docker Image• Conclusion

Yelp Data Challenge

Yelp dataset

Academic Dataset.Information about local business.Five Json files:1. business2. check-in3. review4. tip5. user

https://www.yelp.com/dataset_challenge/dataset

User Reviews

User Reviews

Jason Files contains user reviews:Example:

{"user_id": "Iu6AxdBYGR4A0wspR9BYHA", "review_id": "KPvLNJ21_4wbYNctrOwWdQ", "stars": 5, "date": "2014-02-13", "text": "Excellent food. Superb customer service. I miss the mario machines they used to have, but it's still a great place steeped in tradition.", "type": "review", "business_id": "5UmKMjUEUNdYWqANhGckJw"}

Problem Definition

“Given a user Review text , can we predict user Rating?”

Simpler Problem

Assuming That:

• bad rate (1-3)• good rate (4-5)we will predict whether the user like the

service (rate=4 or 5) or not(rate=1,2 or 3)

“Given a user Review text , can we predict whether user like service or not?”

Pre-Processing

Pre-processing

Extract the first 2000 review as dataset.

Features: user reviews text.Labels: if Rate >3: Positive Else: Negative

Approach

Classical Machine Learning

Algorithms

Deep learning Approach

Classical Machine Learning Algorithms

Classical Machine Learningcombination of several machine learning algorithms.

Logistic Regression. Naive Bayes. Stochastic gradient Descent. Support Vector Machine.

combine results using majority votes.

https://github.com/amrqura/yelpRatePrediction


Classical Machine LearningFeatures:

use words of the review text as features as the following:

1.Extract most common 3000 words in the training dataset.

In each statement examine if the frequent words exists or not.

Feature set : matrix of [2000*3000] 2000: number of statements. 3000: boolean values examine existence of frequent word.



Classical Machine LearningTraining:

Train model in 80% statements and validate on 20%.

Run:

$ python3 TextClassification.py



Classical Machine LearningResults(5- cross validation):

Naive Bayes accuracy percent: 66.5 MultinomialNB accuracy percent: 67.95 BernoulliNB accuracy percent: 65.9 Logistic regression accuracy percent: 67.15 SGD accuracy percent: 63.3 SVC accuracy percent: 54.0 LinearSVC accuracy percent: 63.75 NuSVC accuracy percent: 66.35 Voted accuracy percent: 67.35



Deep learning Approach

Convolutional Neural Network for sentence classification

CNN Architecture

https://arxiv.org/pdf/1408.5882v2.pdf

https://arxiv.org/pdf/1408.5882v2.pdf

CNN DatasetEach statement is represented by n*k matrix. n= number of words k= vector length.

we use word2vec to convert each word to vector.

https://github.com/amrqura/deepYelpRatePrediction


CNN Training

implementation using Tensorflow. number of filters=128 filter size=3,4,5. drop out=0.5

Apply max pooling.

output layer: two nodes with softmax function.

use cross-entropy error function. use L2 regularization.



CNNTraining:

5 cross validation. Train model in 1600 statements and validate on 400.

Run:

$ python rating_prediction.py

Accuracy: 0.73549998



Docker Image

Docker ImageDocker Image:

https://hub.docker.com/r/amrkoura/yelpchallenge/

Run: docker pull amrkoura/yelpchallange docker run -t -i amrkoura/yelpchallenge /bin/bash cd /src/

Run classical machine learning:

python3 TextClassification.py

Run Deep learning:

python rating_prediction.py

Conclusion

Conclusion2 implementations to predict the user rating from user review. • classical machine learning • Deep learning , convolutional neural network.

public Docker image:

https://hub.docker.com/r/amrkoura/yelpchallenge/

Thank you

yelp data challenge

Engineering