Real-Time Analytics with MemSQL and Spark

Neil Dahlke, Engineer

2016 November 4

Real-Time Analytics with MemSQL and Spark

About Me: Neil Dahlke Engineer

MemSQL • real-time database for transactions / analytics

Formerly Globus • high performance data transfer for research scientists

Past talks• Real-time, Geospatial, Maps

Slides: http://www.slideshare.net/MemSQL/realtime-geospatial-maps-by-neil-dahlke

WHAT WEARE SEEING

A WORLD OF CONNECTED MACHINES AND PEOPLE

WHAT WE ARE SEEING:Sensors. Applications. Machines. And us.Generating more data every single day.

By 2020, over 20 billion connected things willbe in use across a range of industries.

REAL-TIMEINPUTS

LIVEOUTPUTS

Sensors

Logs

Events

Streaming

Inserts

Upserts

Queries

DashboardsBusiness

Intelligence

Applications

Predict Analytics

WHAT DO REAL TIME BUSINESSES NEED?

FAST DATAINGEST

The volume of data that can be ingested

into the database


LOW LATENCYQUERIES

The time it takes to execute queries and

receive results


HIGHCONCURRENCYThe ability to scale

simultaneous operations


FAST DATAINGEST

The volume of data that can be ingested

into the database

LOW LATENCYQUERIES

The time it takes to execute queries and

receive results

HIGHCONCURRENCYThe ability to scale

simultaneous operations

REAL-TIMEINPUTS

LIVEOUTPUTS

Sensors

Logs

Events

Streaming

Inserts

Upserts

Queries

DashboardsBusiness

Intelligence

Applications

Predict Analytics

A massively scalable database and ingest solution allowed for massive growth, real-time analytic applications and faster, targeted.

+

Kafka• Component we kept

S3 • Persisted all logs to cold storage for eventual analysis

Hadoop• Nighly map-reduce jobs

Redshift• Took a full day to load data from previous day• Reaching overlap of times caused data crisis

Before

No real time access to analytics No SQL interface for analysts and data scientists Massive nightly Hadoop batch jobs (late data) Unfiltered and incomplete data (silos) Expensive

Why was this bad for their business operations?

Why was this bad for their data operations?

Too slow Not scalable No deduplication

• aka not exactly-once Low concurrency

FAST DATAINGEST LOW

LATENCYQUERIES

HIGHCONCURRENCY

How It Works Now

After

TECHNICAL BENEFITS Instant accuracy to the latest re-pin 1 GB/sec totaling 72 TB/day

THE PINTEREST REAL-TIME ARCHITECTURE

REAL-TIMEANALYTICS

Accelerated ingesttime by 200,000x

1 GB/sec totaling 72 TB/day

RESULTS

Visualizing The Data

23

24

Visualizing the Data Demo built using

• Mapbox• Websockets• Tornado web server

When an image is re pinned, the circles on the globe expand, showing higher volume areas

Reads data from MemSQL directly

25

DEMO

Questions?

More Info http://www.odbms.org/blog/2015/04/powering-big-data-at-

pinterest-interview-with-krishna-gade/

https://gigaom.com/2015/02/18/pinterest-is-experimenting-with-memsql-for-real-time-data-analytics/

https://www.infoq.com/news/2015/03/pinterest-memsql-spark-streaming

http://blog.memsql.com/pinterest-apache-spark-use-case/ https://

engineering.pinterest.com/blog/real-time-analytics-pinterest

http://www.odbms.org/blog/2015/04/powering-big-data-at-pinterest-interview-with-krishna-gade/









http://blog.memsql.com/pinterest-apache-spark-use-case/

https://engineering.pinterest.com/blog/real-time-analytics-pinterest



Resources https://github.com/memsql/memsql-spark-connector http://docs.memsql.com/docs/streamliner-administration http://docs.memsql.com/docs/pipelines-overview https://github.com/memsql/memsql-docker-quickstart

https://github.com/memsql/memsql-spark-connector

https://github.com/memsql/memsql-spark-connector

http://docs.memsql.com/docs/streamliner-administration

http://docs.memsql.com/docs/streamliner-administration

http://docs.memsql.com/docs/pipelines-overview

http://docs.memsql.com/docs/pipelines-overview

https://github.com/memsql/memsql-docker-quickstart

https://github.com/memsql/memsql-docker-quickstart

Thank You

Real-Time Analytics with MemSQL and Spark

Technology