Top Banner
Felix Cheung Principal Engineer & Spark Committer SSR: Structured Streaming for R and Machine Learning
42

SSR: Structured Streaming for R and Machine Learning

Jan 23, 2018

Download

Data & Analytics

felixcss
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SSR: Structured Streaming for R and Machine Learning

Felix Cheung

Principal Engineer & Spark Committer

SSR:

Structured Streaming for

R and Machine Learning

Page 2: SSR: Structured Streaming for R and Machine Learning
Page 3: SSR: Structured Streaming for R and Machine Learning

Disclaimer:

Apache Spark community contributions

Page 4: SSR: Structured Streaming for R and Machine Learning

Agenda

• Structured Streaming

• ML Pipeline

• R - putting it all together

• Considerations

Page 5: SSR: Structured Streaming for R and Machine Learning
Page 6: SSR: Structured Streaming for R and Machine Learning

Why Streaming?

• Faster insight at scale

• ETL

• Trends

• Latest data to static data

• Continuous Learning

Page 7: SSR: Structured Streaming for R and Machine Learning

Spark Streaming

1. Receiver

2. Direct DStream

3. Structured Streaming

Page 8: SSR: Structured Streaming for R and Machine Learning

Structured Streaming

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Page 9: SSR: Structured Streaming for R and Machine Learning

Structured Streaming

• "Streaming Logical Plan"

– Extending Dataset/DataFrame to include

incremental execution of unbounded input

– aka Rinse & Repeat

Page 10: SSR: Structured Streaming for R and Machine Learning

Same

• Transformations:

map

filter

aggregate

window

join* (*some limitations)

Page 11: SSR: Structured Streaming for R and Machine Learning

Better

• Trigger

• Consistency

• Fault Tolerance

• Event time – late data, watermark

Page 12: SSR: Structured Streaming for R and Machine Learning

Execution Plan

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Page 13: SSR: Structured Streaming for R and Machine Learning

SS in a Circuit

DataFrame DataFrameTrigger

Page 14: SSR: Structured Streaming for R and Machine Learning

Source

File

Kafka

Socket

MQTT

Page 15: SSR: Structured Streaming for R and Machine Learning

Sink

File (new formats in 2.1+)

Console

Memory (aka Temp View)

Foreach

Kafka (new in 2.2)

Page 16: SSR: Structured Streaming for R and Machine Learning

Output Mode

Append (default)

Complete

Update (new in 2.1.1)

Page 17: SSR: Structured Streaming for R and Machine Learning

Streaming & ML Don't Mix*

Page 18: SSR: Structured Streaming for R and Machine Learning

ML Pipeline Model

Page 19: SSR: Structured Streaming for R and Machine Learning

Remember the SS Flow?

DataFrame DataFrame

Page 20: SSR: Structured Streaming for R and Machine Learning

ML Pipeline fit()

• Essentially an Action

• Results in a Model

• Sink start() also an Action

• Structured Streaming circuit must be completed

with Sink start()

Page 21: SSR: Structured Streaming for R and Machine Learning

R to the Rescue

Page 22: SSR: Structured Streaming for R and Machine Learning

R

• Statistical computing and graphics

• 10.7k+ packages on CRAN

Page 23: SSR: Structured Streaming for R and Machine Learning

Why Streaming in R

• Single integrated job for everything

1. Ingest

2. ETL

3. Machine Learning

• Use your favorite packages - freedom to choose

• rkafka – last published 2015

Page 24: SSR: Structured Streaming for R and Machine Learning
Page 25: SSR: Structured Streaming for R and Machine Learning

SparkR

• DataFrame API like R data.frame, dplyr– Full Spark optimizations

• SQL, Session, Catalog

• “Spark Packages”

• ML

• R-native UDF

• SS

Page 26: SSR: Structured Streaming for R and Machine Learning

Native R UDF

• User-Defined Functions - custom transformation

• Apply by Partition

• Apply by Group

Page 27: SSR: Structured Streaming for R and Machine Learning

Parallel Processing By Partition

Page 28: SSR: Structured Streaming for R and Machine Learning

https://spark-summit.org/east-2017/events/scalable-data-science-with-sparkr/

Page 29: SSR: Structured Streaming for R and Machine Learning

Native R UDF = DF Transform

DataFrame DataFrame

DataFrame DataFrame

Page 30: SSR: Structured Streaming for R and Machine Learning

SS in R

1. DataStreamReader/Writer

2. StreamingQuery

3. Extending DataFrame (isStreaming)

Page 31: SSR: Structured Streaming for R and Machine Learning

About Demo

• Create a job to discover trending news topics

– Structured Streaming

– Machine Learning with native R package in

UDF

Page 32: SSR: Structured Streaming for R and Machine Learning

Demo!

https://goo.gl/0v6YxF

Page 33: SSR: Structured Streaming for R and Machine Learning

Demo

• SS – read text stream from Kafka

• R-UDF – a partition with lines of text

– RTextTools – text vector into DTM – scrubbing

– LDA

– terms

• SQL – group by words, count

• SS – write to console

Page 34: SSR: Structured Streaming for R and Machine Learning

Read DataFrame vs Stream

read.df(datapath, source = "parquet")

read.stream("kafka",

kafka.bootstrap.servers = servers,

subscribe = topic)

Page 35: SSR: Structured Streaming for R and Machine Learning

Streaming WordCount in 1 linelibrary(magrittr)

kbsrvs <- "kafka-0.broker.kafka.svc.cluster.local:9092"

topic <- "test1"

read.stream("kafka", kafka.bootstrap.servers = kbsrvs, subscribe = topic) %>%

selectExpr("explode(split(value as string, ' ')) as word") %>%

group_by("word") %>%

count() %>%

write.stream("console", outputMode = "complete")

Page 36: SSR: Structured Streaming for R and Machine Learning

Challenges

Page 37: SSR: Structured Streaming for R and Machine Learning

Streaming and ML

• Streaming – small batch

• ML – sometimes large data to build model

=> pre-trained model

=> online machine learning

• Adopting to data schema, pattern changes

• Updating model (when?)

Page 38: SSR: Structured Streaming for R and Machine Learning

Practical Implementation

- LSI – online training

- Online LDA

- kNN

- k-means with predict on new data

Page 39: SSR: Structured Streaming for R and Machine Learning

SS Considerations

• Schema of DataFrame from Kafka:

key (object), value (object), topic, partition,

offset, timestamp, timestampType

• OutputMode requirements

Page 40: SSR: Structured Streaming for R and Machine Learning

ML with R-UDF

• Native code UDF can break the job

- eg. ML packages could be sensitive to empty row

- more data checks In Real Life

• Debugging can be challenging – run separately first

• UDF must return that matches schema

• Model as state to distribute to each UDF instance

Page 41: SSR: Structured Streaming for R and Machine Learning

Future – SSR

• Configurable trigger

• Watermark for late data

Page 42: SSR: Structured Streaming for R and Machine Learning

Thank You.https://github.com/felixcheung

linkedin: http://linkd.in/1OeZDb7

blog: http://bit.ly/1E2z6OI