Top Banner
Building a high scale machine learning pipeline with Apache Spark and Kafka https://www.flickr.com/photos/sanjayaprime/5013115478 Bedő Dániel
29

Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Mar 17, 2018

Download

Documents

truonglien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Building a high scale machine learning pipeline

with Apache Spark and Kafka

https://www.flickr.com/photos/sanjayaprime/5013115478Bedő Dániel

Page 2: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

• biggest community-driven question & answer website in Germany

• 20 million questions, 70 million answers

• similar to Quora, Yahoo Answers

Page 3: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Google Update Impact

Page 4: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Ordering of answers

Page 5: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

supervised machine learning

determine the type of the training data

gather a training set

find a representation of the data

pick a learning algorithm

run the training algorithm

evaluate the accuracy

Page 6: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Regression Prototype

Page 7: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Identify the problems

• Model not complex enough

• Similar inputs, different outputs?

• Not enough training data

Page 8: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

The old ETL pipeline

Page 9: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

ETL v2Spark

Page 10: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Spark ecosystem

API

Scala Python Java R

Spark Streaming

Spark SQL MLLib GraphX

Page 11: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

K

Kafka

Producer

Producer

Consumer

Consumer

Consumer

Topic 1 Partition 0

Broker 1

Topic 1 Partition 1

Broker 2

Topic 1 Partition 2

Broker 3

Kafka Cluster

Page 12: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Kafka topic

• scale

• parallelism

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 8

Producer

partition 0

partition 1

partition 2

Page 13: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Parquet

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

SELECT votes FROM logs WHERE cc = ‘AT’

push-down filters

Page 14: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Kafka

Rabbit MQServices

Tracking

Spark Cluster

KafkaKafka

HDFS(Tracking)

Stre

amin

g Worker

MySQL Read Slave

MySQL Master

Redis Master

Redis Read Slave

ElasticSearch

ETL v2

Page 15: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Project Moria

Page 16: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Clean training data?

Page 17: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Project Angmar

• tried lots of different supervised learning methods

• feature engineering - most crucial part

• analyse the domain, chart everything

Page 18: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

FeaturesContent

length

syntactic complexity

number of links

probability of deletion

Social

votes

most helpful answer

number of comments

answered by expert

Author

gained votes

credibility score

role

deleted answer ratio

number of answers

number of comments

reported answer ratio

Page 19: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

The structure of the network

21

3

1

0,2

0,4

0,1

2 0,8

Answer vector

AV normalized

0,9

0,6

0,2

0,1

Output

Page 20: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

The Result

Page 21: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Calculate features for all answers

Batch Layer (Spark Batch)

Insert features in Redis

Calculate Score and store in MySQL

Speed Layer (Spark Streaming)

Listen for events

Insert or update Redis

Calculate Score and store in MySQL

Serving Layer

Lambda Architecture

Page 22: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Back pressure

• Bulk jobs insert too fast

• MySQL: sendQueue size, threads connected

• ElasticSearch: load on the instance creating the new index

Page 23: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Debugging the network

+1Change individual features

Page 24: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

real world test (deleted vs non-deleted)

deletednon-deleted

Page 25: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Switching models

Amount of questions for a

score range

Old Score

New score

Page 26: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with
Page 27: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Insights

MenWomen

Page 28: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Learnings• If your use case is complex, you need a complex

model

• If you have a complex model, you need lots of data

• If you have lots of data, you need an ETL pipeline that can process huge amounts of data fast

• Think about your use case first, then design the pipeline

Page 29: Building a high scale machine learning pipeline with ...biconsulting.hu/letoltes/2016budapestdata/bedodaniel_spark_final.pdf · Building a high scale machine learning pipeline with

Questions?You can ask them on gutefrage too :)