Top Banner
Go DataDriven PROUDLY PART OF THE XEBIA GROUP @fzk [email protected] Apache Spark Friso van Vollenhoven for applied machine learning
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Spark Talk for Applied machine learning

GoDataDrivenPROUDLY PART OF THE XEBIA GROUP

@fzk [email protected]

Apache Spark

Friso van Vollenhoven

for applied machine learning

Page 2: Apache Spark Talk for Applied machine learning

GoDataDriven

Page 3: Apache Spark Talk for Applied machine learning

GoDataDriven

Page 4: Apache Spark Talk for Applied machine learning

GoDataDriven

This talk is about tools.

Page 5: Apache Spark Talk for Applied machine learning

GoDataDriven

Page 6: Apache Spark Talk for Applied machine learning
Page 7: Apache Spark Talk for Applied machine learning
Page 8: Apache Spark Talk for Applied machine learning

GoDataDriven

Page 9: Apache Spark Talk for Applied machine learning

GoDataDriven

Page 10: Apache Spark Talk for Applied machine learning

GoDataDriven

Page 11: Apache Spark Talk for Applied machine learning

GoDataDriven

Page 12: Apache Spark Talk for Applied machine learning
Page 13: Apache Spark Talk for Applied machine learning
Page 14: Apache Spark Talk for Applied machine learning

GoDataDriven

Resilient Distributed Dataset

• Immutable set of records (e.g. tuples)

• Distributed across a cluster of workers

• Stored in RAM or on disk (partially)

• Built through transformations

• Automatically rebuilt on failure

• Possibly replicated

Page 15: Apache Spark Talk for Applied machine learning

GoDataDriven

Operations

• Operate on RDD’s

• Create a new RDD

• Or materialise RDD and return data

• Transformations: map, filter, groupBy, etc.

• Actions: count, collect, reduce, save, etc.

Page 16: Apache Spark Talk for Applied machine learning
Page 17: Apache Spark Talk for Applied machine learning
Page 18: Apache Spark Talk for Applied machine learning

GoDataDriven

The good parts

• Language bindings for Java, Scala and Python

• Works interactively from a shell:

• Scala + IPython (notebook)

• Plays nice with Hadoop

• Deploy on top of YARN cluster manager

• Read data from HDFS

• Hadoop-like fault tolerance

Page 19: Apache Spark Talk for Applied machine learning

The better part?https://github.com/Bridgewater/scala-notebook

Page 20: Apache Spark Talk for Applied machine learning
Page 21: Apache Spark Talk for Applied machine learning
Page 22: Apache Spark Talk for Applied machine learning
Page 23: Apache Spark Talk for Applied machine learning
Page 24: Apache Spark Talk for Applied machine learning
Page 25: Apache Spark Talk for Applied machine learning
Page 26: Apache Spark Talk for Applied machine learning

https://github.com/Sotera/spark-distributed-louvain-modularity

Page 27: Apache Spark Talk for Applied machine learning

GoDataDriven

We’re hiring / Questions? / Thank you!

@fzk [email protected]

Friso van Vollenhoven