Top Banner
Apache Spark Concepts - Spark SQL, GraphX, Streaming Petr Zapletal Cake Solutions
34
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark Concepts - Spark SQL, Graphx, Streaming

Apache Spark

Concepts - Spark SQL, GraphX, Streaming

Petr Zapletal Cake Solutions

Page 2: Spark Concepts - Spark SQL, Graphx, Streaming

Apache Spark and Big Data

1) History and market overview

2) Installation

3) MLlib and Machine Learning on Spark

4) Porting R code to Scala and Spark

5) Concepts - Spark SQL, GraphX, Streaming

6) Spark’s distributed programming model

7) Deployment

Page 3: Spark Concepts - Spark SQL, Graphx, Streaming

Table of contents

● Resilient Distributed Datasets

● Spark SQL

● GraphX

● Spark Streaming

● Q & A

Page 4: Spark Concepts - Spark SQL, Graphx, Streaming

Spark Modules

Page 5: Spark Concepts - Spark SQL, Graphx, Streaming

Resilient Distributed Datasets

● Immutable, distributed collection of records

● Lazy evaluation, caching option, can be persisted

● Number of operations & transformations

● Can be created from data storage or different RDD

Page 6: Spark Concepts - Spark SQL, Graphx, Streaming

Spark SQL

● Spark’s interface to work with structured or semistructured data

● Structured data

o known set of fields for each record - schema

● Main capabilities

o load data from variety of structured sources

o query the data with SQL

o integration between Spark (Java, Scala and Python API) and SQL

(joining RDDs and SQL tables, using SQL functionality)

Page 7: Spark Concepts - Spark SQL, Graphx, Streaming

More than SQL

● Unified interface for structured data

Page 8: Spark Concepts - Spark SQL, Graphx, Streaming

SchemaRDD

● RDD of row objects, each representing a record

● Known schema (i.e. data fields) of its rows

● Behaves like regular RDD, stored in more efficient manner

● Adds new operations, especially running SQL queries

● Can be created from

o external data sources

o results of queries

o regular RDD

● Used in ML Pipeline API

Page 9: Spark Concepts - Spark SQL, Graphx, Streaming

SchemaRDD

Page 10: Spark Concepts - Spark SQL, Graphx, Streaming

Getting Started

● Entry points:

o HiveContext

superset functionality, Hive related

o SQLContext

Page 11: Spark Concepts - Spark SQL, Graphx, Streaming

● Loads input JSON file into SchemaRDD

● Uses context to execute query

Query Example

Page 12: Spark Concepts - Spark SQL, Graphx, Streaming

Loading and Saving Data

● Supports number of structured data sources

o Apache Hive

data warehouse infrastructure on top of Hadoop

summarization, querying (SQL-like interface) and analysis

o Parquet

column-oriented storage format in Hadoop ecosystem

efficient storage of records with nested fields

o JSON

o RDDs

o JDBC/ODBC Server

connecting Business Intelligence tools

remote access to Spark cluster

Page 13: Spark Concepts - Spark SQL, Graphx, Streaming

GraphX

● New Spark API for graphs and graph-parallel computation

● Resilient Distributed Property Graph (RDPG, extends RDD)

o directed multigraph ( -> parallel edges)

o properties attached to each vertex and edge

● Common graph operations (subgraph computation, joining vertices, ...)

● Growing collection of graph algorithms

Page 14: Spark Concepts - Spark SQL, Graphx, Streaming

Motivation● Growing scale and importance of graph data

● Application of data-parallel algorithms to graph computation is inefficient

● Graph-parallel systems (Pregel, PowerGraph, ...) designed for efficient

execution of graph algorithms

o do not address graph construction & transformation

o limited fault tolerance & data mining support

Page 15: Spark Concepts - Spark SQL, Graphx, Streaming

Performance Comparison

Page 16: Spark Concepts - Spark SQL, Graphx, Streaming

Property Graph

● Directed multigraph with user defined objects to each vertex and edge

Page 17: Spark Concepts - Spark SQL, Graphx, Streaming

Property Graph

Page 18: Spark Concepts - Spark SQL, Graphx, Streaming

Triplet View

● Logical join of vertex and edge properties

Page 19: Spark Concepts - Spark SQL, Graphx, Streaming

Graph Operations

● Basic information (numEdges, numVertices, inDegrees, ...)

● Views (vertices, edges, triplets)

● Caching (persist, cache, ...)

● Transformation (mapVertices, mapEdges, ...)

● Structure modification (reverse, subgraph, ...)

● Neighbour aggregation (collectNeighbours, aggregations, ...)

● Pregel API

● Graph builders (various I/O operations)

● ...

Page 20: Spark Concepts - Spark SQL, Graphx, Streaming

Graph Algorithms

● Built-in algorithms

o PageRank, Connected Components, Triangle Count, ...

Page 21: Spark Concepts - Spark SQL, Graphx, Streaming

Demo

Page 22: Spark Concepts - Spark SQL, Graphx, Streaming

Spark Streaming

● Scalable, high-throughput, fault-tolerant stream processing

Page 23: Spark Concepts - Spark SQL, Graphx, Streaming

Architecture

● Streams are chopped up into batches

● Each batch is processed in Spark

● Results pushed out in batches

Page 24: Spark Concepts - Spark SQL, Graphx, Streaming

Streaming Word Count

Page 25: Spark Concepts - Spark SQL, Graphx, Streaming

Streaming Word Count

Page 26: Spark Concepts - Spark SQL, Graphx, Streaming

StreamingContext

● Entry point for all streaming functionality

o define input sources

o stream transformations

o output operations to DStreams

o starts & stops streaming process

● Limitations

o once started, computations cannot be added

o cannot be restarted

o one active per JVM

Page 27: Spark Concepts - Spark SQL, Graphx, Streaming

Discretized Streams

● Basic abstraction, represents a continuous stream of data

● DStreams

● Implemented as series of RDDs

Page 28: Spark Concepts - Spark SQL, Graphx, Streaming

Stateless Transformations

● Processing of each batch does not depend on previous batches

● Transformation is separately applied to every batch

o Map, flatMap, filter, reduce, groupBy, …

● Combining data from multiple DStreams

o Join, cogroup, union, ...

Page 29: Spark Concepts - Spark SQL, Graphx, Streaming

Stateful Transformations

● Use data or intermediate results from previous batches to compute the

result of the current batch

● Windowed operations

o act over a sliding window of time periods

● UpdateStateByKey

o maintain state while continuously updating it with new information

● Require checkpointing

Page 30: Spark Concepts - Spark SQL, Graphx, Streaming

Output Operations

● Specify what needs to be done with the final transformed data

● Pushing to external DB, printing, …

● If not performed, DStream is not evaluated

Page 31: Spark Concepts - Spark SQL, Graphx, Streaming

Input Sources

● Built-in support for a number of different data sources

● Often in additional libraries (i.e. spark-streaming-kafka)

● HDFS

● Akka Actor Stream

● Apache Kafka

● Apache Flume

● Twitter Stream

● Kinesis

● Custom Sources

● ...

Page 32: Spark Concepts - Spark SQL, Graphx, Streaming

Demo

Page 33: Spark Concepts - Spark SQL, Graphx, Streaming

Conclusion

● RDD repetition

● Spark Modules Overview

o Spark SQL

o GraphX

o Spark Streaming

Page 34: Spark Concepts - Spark SQL, Graphx, Streaming

Questions