Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma Bakanlığı’nın görüşlerini yansıtmamaktadır.
39
Embed
Introduction to Apache Spark - Big Data · 2018-01-31 · Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Introduction to Apache Spark
Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı
kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi
dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma
Bakanlığı’nın görüşlerini yansıtmamaktadır.
A Major Step Backwards?
MapReduce is a step backward in database access:
Schemas are good
Separation of the schema from the application is good
High-level access languages are good
MapReduce is poor implementation
Brute force and only brute force (no indexes, for example)
MapReduce is not novel
MapReduce is missing features
Bulk loader, indexing, updates, transactions…
MapReduce is incompatible with DMBS tools
Source: Blog post by DeWitt and Stonebraker
Need for High-Level Languages
MapReduce is great for one-pass computation large-data
processing
But writing Java programs for everything is verbose and slow
Data scientists don’t want to write Java
But it is inefficient for multi-pass algorithms
No efficient primitives for data sharing and iterative tasks
State between steps goes to distributed file system
Slow due to replication & disk storage
Move it onto a cluster
In cluster setting, using of Map-Reduce and Hadoop is
slow
Hadoop writes to disk and complex
How to improve this?
iter. 1 iter. 2 . . .
Input
file system
read
file system
write
file system
read
file system
write
Input
query 1
query 2
query 3
result 1
result 2
result 3
file system
read
. . .
Commonly spend 90% of time doing I/O
Example: Iterative Apps
What we need is…
Resilient
Checkpointing
Fast, does not always store to disk
Replayable
Embarrassingly Parallel
Scala
Scala has many other nice features:
A type system that makes sense.
Traits.
Implicit conversions.
Pattern Matching.
XML literals, Parser combinators, ...
Spark: A Brief History
2004
MapReduce paper
2002
MapRe
2006
Hadoop @ Yahoo!
op-level
8
2010
Spark paper
2002 2004 2006 2008 2010 2012 2014
duce @ Google 2008 2014 Hadoop Summit Apache Spark t
Why better than hadoop?
• In-memory as opposed to disk
• Data can be cached in memory or disk for future use
• Fast: up to 100 times faster as it is using memory as opposed to disk
• Easier than Hadoop while being functional, runs a general DAG
• APIs in Java, Scala, Python, R
WordCount MapReduce vs Spark
WordCount in 50+ lines of Java MR WordCount in 3 lines of Spark
Word Count Example
text_file = sc.textFile("hdfs://... wordsL ist .tx t ")