Top Banner
Page 1 Magellan: Geospatial Analytics on Spark Ram Sriharsha Spark and Data Science Architect, Hortonworks
42

Apache con big data 2015 magellan

Apr 15, 2017

Download

Data & Analytics

Ram Sriharsha
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache con big data 2015 magellan

Page 1

Magellan: Geospatial Analytics on SparkRam Sriharsha

Spark and Data Science Architect, Hortonworks

Page 2: Apache con big data 2015 magellan

Page 2

Agenda• Geospatial Analytics• Geospatial Data Formats• Challenges• Magellan• Spark SQL and Catalyst: An Intro• How does Magellan use Spark SQL?• Demo• Q & A

Page 3: Apache con big data 2015 magellan

Page 3

Geospatial Context is useful

Where do people go on weekends?Does usage pattern change with time?Predict the drop off point of a user?Predict the location where next pick up can be expected?

Identify crime hotspotsHow do these hotspots evolve with time?Predict the likelihood of crime occurring at a given neighborhood

Predict climate at fairly granular levelClimate insurance: do I need to buy insurance for my crops?Climate as a factor in crime: Join climate dataset with Crimes

Page 4: Apache con big data 2015 magellan

Page 4

Geospatial Data is pervasive

Page 5: Apache con big data 2015 magellan

Page 5

Obscure Data Formats• ESRI Shapefile format

–SHP–SHX–DBX (Dbase File Format, very few legitimate parsers exist)

• GeoJSON• Open Source Parsers exist (GIS 4 Hadoop)

Page 6: Apache con big data 2015 magellan

Page 6

Why do we need a proper framework?• No standardized way of dealing with data and metadata• Esoteric Data Formats and Coordinate Systems• No optimizations for efficient joins • APIs are too low level• Language integration simplifies exploratory analytics• Commonly used algorithms can be made available at scale

–Map matching–Geohash indices–Markov models

Page 7: Apache con big data 2015 magellan

Page 7

What do we want to support?

• Parse geospatial data and metadata into Shapes + Metadata Map• Python and Scala support• Geometric Queries

–efficiently!–simple and intuitive syntax!

• Scalable implementations of common algorithms–Map Matching–Geohash Indexing–Spatial Joins

Page 8: Apache con big data 2015 magellan

Page 8

Where are we at?• Magellan available on Github (https://github.com/harsha2010/magellan)• Can parse and understand most widely used formats

–GeoJSON, ESRIShapefile

• All geometries supported• 1.0.3 released (http://spark-packages.org/package/harsha2010/magellan)• Broadcast join available for common scenarios• Work in progress (targeted 1.0.4)

–Geohash Join optimization–Map Matching algorithm using Markov Models

• Python and Scala support• Please give it a try and give us feedback!

Page 9: Apache con big data 2015 magellan

Page 9

Geospatial Data Structures

Page 10: Apache con big data 2015 magellan

Page 10

Operations• intersection• union• Symmetrical difference• Difference• Convex hull

Page 11: Apache con big data 2015 magellan

Page 11

Queries• contains (within)• Covers (covered-by)• intersects• touches

Page 12: Apache con big data 2015 magellan

Page 12

Geospatial Queries

Is a triplet of points (A, B, C) Clockwise or Counter Clockwise ordered?

Page 13: Apache con big data 2015 magellan

Page 13

Geospatial Queries

Ray Tracing Algorithm:Draw a ray from point and countthe # of times it intersects polygon

Page 14: Apache con big data 2015 magellan

Page 14

Accelerating Spatial Queries• Bounding Box• Geohashing• R-Tree (and other) indices

Page 15: Apache con big data 2015 magellan

Page 15

Magellan• Basic Abstraction in terms of Shape

–Point, PolyLine, Polygon–Supports multiple implementations (currently uses ESRI-Java-API)

• SQL Data Type = Shape–Efficient: Construct once and use

• Operations supported as SQL operations–within, intersects, contains etc.

• Allows efficient Join implementations using Catalyst–Broadcast join already available–Geohash based join algorithm in progress

Page 16: Apache con big data 2015 magellan

Page 16

load

Page 17: Apache con big data 2015 magellan

Page 17

query metadata

Page 18: Apache con big data 2015 magellan

Page 18

geometric query

Page 19: Apache con big data 2015 magellan

Page 19

Python API

Page 20: Apache con big data 2015 magellan

Page 20

Python API, join

Page 21: Apache con big data 2015 magellan

Page 21

Why spark?• DataFrames

– Intuitive manipulation of distributed structured data

• Catalyst Optimizer–Push predicates to Data Source, allows optimized filters

• Memory Optimized Execution Engine

Page 22: Apache con big data 2015 magellan

Page 22

The Spark ecosystem

Spark Core

Spark SQL

Spark Streaming

ML-Lib GraphX

Distributed Compute Engine• Speed, ease of use and fast prototyping• Open source• Powerful abstractions• Python, R , Scala, Java support

Page 23: Apache con big data 2015 magellan

Page 23

Spark DataFrames are intuitive

RDD

DataFrame

dept name age

Bio H Smith 48

CS A Turing 54

Bio B Jones 43

Phys E Witten 61

Page 24: Apache con big data 2015 magellan

Page 24

Spark DataFrames are fast!

Page 25: Apache con big data 2015 magellan

Page 25

Spark SQL under the covers

Page 26: Apache con big data 2015 magellan

Page 26

Catalyst• Rows, Data Types• Expressions• Operators• Rules• Optimization

Page 27: Apache con big data 2015 magellan

Page 27

Rows and Data Types• Standard SQL Data Types

–Date, Int, Long, String, etc

• Complex Data Types–Array, Map, Struct, etc

• Custom Data Types• Row = Collection of Data Types

–Represents a single row

Page 28: Apache con big data 2015 magellan

Page 28

Expressions• Literals• Arithmetic Expressions

–maxOf, unaryMinus

• Predicates–Not, and, in, case when

• Cast• String Expressions

–substring, like, startsWith

Page 29: Apache con big data 2015 magellan

Page 29

Optimizations• Constant Folding• Predicate Pushdown

–Combine Filters–Push Predicate Through Join

• Null Propagation• Boolean Simplification

Page 30: Apache con big data 2015 magellan

Page 30

Execution Engine• Data Sources to read data into Data Frames

–Supports extending pushdowns to data store

• Optimized in memory layout–ORC, Tungsten etc.

• Spark Strategies–Convert logical plan -> physical plan–Rules based on statistics

Page 31: Apache con big data 2015 magellan

Page 31

How does Magellan use Catalyst?• Custom Data Source

–Parses GeoJSON, ESRIShapefile etc into (Shape, Metadata) pairs–Returns a DataFrame with columns (point, polygon, polyline, metadata)–Overrides PrunedFilteredScan–Outputs Shape instances

• Custom Data Type–Point, Polygon, Polyline instances of Shape–Each Shape has a python counterpart. –Each Shape is its own SQL type (=> no serialization overhead for SQL -> Scala and back)

• Magellan Context–Overrides Spark Planner allowing custom join implementations

• Python wrappers

Page 32: Apache con big data 2015 magellan

Page 32

Leveraging Data Sources

Page 33: Apache con big data 2015 magellan

Page 33

Leveraging Catalyst

Binary Expression

Page 34: Apache con big data 2015 magellan

Page 34

Leveraging Catalyst

Spatial Join + Predicate Pushdown

Page 35: Apache con big data 2015 magellan

Page 35

Python Bindings

Page 36: Apache con big data 2015 magellan

Page 36

• Adds Custom Python Data Types–Point, PolyLine, Polygon wrappers around Scala Data Types

• Wraps coordinate transformations and expressions• Custom Picklers and Unpicklers

–Serialize to and from Scala

Page 37: Apache con big data 2015 magellan

Page 37

Future Work• Geohash Indices• Spatial Join Optimization• Map Matching Algorithms• Improving pyspark bindings

Page 38: Apache con big data 2015 magellan

Page 38

geohashing

Page 39: Apache con big data 2015 magellan

Page 39

Base 32 encoding

Page 40: Apache con big data 2015 magellan

Page 40

geohashing

Page 41: Apache con big data 2015 magellan

Page 41

Map Matching• Given sequence of points (representing a trip?) what was the road path taken?• Challenges

–Error in GPS measurements–Error in coordinate projections–Time Gap between measurements, cannot just snap to nearest road

Page 42: Apache con big data 2015 magellan

Page 42

DemoUber use case?