Neil Dahlke, Engineer 2016 November 4 Real-Time Analytics with MemSQL and Spark
Neil Dahlke, Engineer
2016 November 4
Real-Time Analytics with MemSQL and Spark
About Me: Neil Dahlke Engineer
MemSQL • real-time database for transactions / analytics
Formerly Globus • high performance data transfer for research scientists
Past talks• Real-time, Geospatial, Maps
Slides: http://www.slideshare.net/MemSQL/realtime-geospatial-maps-by-neil-dahlke
WHAT WEARE SEEING
A WORLD OF CONNECTED MACHINES AND PEOPLE
WHAT WE ARE SEEING:Sensors. Applications. Machines. And us.Generating more data every single day.
By 2020, over 20 billion connected things willbe in use across a range of industries.
REAL-TIMEINPUTS
LIVEOUTPUTS
Sensors
Logs
Events
Streaming
Inserts
Upserts
Queries
DashboardsBusiness
Intelligence
Applications
Predict Analytics
WHAT DO REAL TIME BUSINESSES NEED?
FAST DATAINGEST
The volume of data that can be ingested
into the database
WHAT DO REAL TIME BUSINESSES NEED?
LOW LATENCYQUERIES
The time it takes to execute queries and
receive results
WHAT DO REAL TIME BUSINESSES NEED?
HIGHCONCURRENCYThe ability to scale
simultaneous operations
WHAT DO REAL TIME BUSINESSES NEED?
FAST DATAINGEST
The volume of data that can be ingested
into the database
LOW LATENCYQUERIES
The time it takes to execute queries and
receive results
HIGHCONCURRENCYThe ability to scale
simultaneous operations
REAL-TIMEINPUTS
LIVEOUTPUTS
Sensors
Logs
Events
Streaming
Inserts
Upserts
Queries
DashboardsBusiness
Intelligence
Applications
Predict Analytics
A massively scalable database and ingest solution allowed for massive growth, real-time analytic applications and faster, targeted.
+
Kafka• Component we kept
S3 • Persisted all logs to cold storage for eventual analysis
Hadoop• Nighly map-reduce jobs
Redshift• Took a full day to load data from previous day• Reaching overlap of times caused data crisis
Before
No real time access to analytics No SQL interface for analysts and data scientists Massive nightly Hadoop batch jobs (late data) Unfiltered and incomplete data (silos) Expensive
Why was this bad for their business operations?
Why was this bad for their data operations?
Too slow Not scalable No deduplication
• aka not exactly-once Low concurrency
FAST DATAINGEST LOW
LATENCYQUERIES
HIGHCONCURRENCY
How It Works Now
After
TECHNICAL BENEFITS Instant accuracy to the latest re-pin 1 GB/sec totaling 72 TB/day
THE PINTEREST REAL-TIME ARCHITECTURE
REAL-TIMEANALYTICS
Accelerated ingesttime by 200,000x
1 GB/sec totaling 72 TB/day
RESULTS
Visualizing The Data
23
24
Visualizing the Data Demo built using
• Mapbox• Websockets• Tornado web server
When an image is re pinned, the circles on the globe expand, showing higher volume areas
Reads data from MemSQL directly
25
DEMO
Questions?
More Info http://www.odbms.org/blog/2015/04/powering-big-data-at-
pinterest-interview-with-krishna-gade/
https://gigaom.com/2015/02/18/pinterest-is-experimenting-with-memsql-for-real-time-data-analytics/
https://www.infoq.com/news/2015/03/pinterest-memsql-spark-streaming
http://blog.memsql.com/pinterest-apache-spark-use-case/ https://
engineering.pinterest.com/blog/real-time-analytics-pinterest
Resources https://github.com/memsql/memsql-spark-connector http://docs.memsql.com/docs/streamliner-administration http://docs.memsql.com/docs/pipelines-overview https://github.com/memsql/memsql-docker-quickstart
Thank You