Top Banner
End User Panel on Real-Time Data Analytics Building Predictive Applications with Real-Time Data Pipelines and Streamliner Eric Frenkiel, CEO and Co-Founder, MemSQL
46

Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Apr 15, 2017

Download

Data & Analytics

MemSQL
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

End User Panel on Real-Time Data Analytics

Building Predictive Applications with Real-Time Data Pipelines and Streamliner

Eric Frenkiel, CEO and Co-Founder, MemSQL

Page 2: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Going Real-Time is the Next Phase for Big Data

MoreDevices

More Interconnectivity

MoreUser Demand

…and companies are at risk of being left behind

Page 3: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

MemSQL Architecture

S t r e a m i n g Da t a W a r e h o u s e

Streaming

Integrated streamingwith Streamliner

Database

High volume transactions for structured and unstructured data

Data Warehouse

Fast, scalableSQL for immediate

analytics

Page 4: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Applications and Technology Trends

Real-Time Analytics Risk-Management Personalization

Portfolio Tracking Monitoring and Detection

Internet of Things | Real-Time Data Pipelines | Operationalizing Apache Spark

Page 5: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Put Apache Spark in the fast lane.Persist. Perform. Perfect.

Page 6: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Changing the Way the World Invests

Noah Zucker, Vice President – Tactical Engineering, Novus Partners

Scalable Portfolio Intelligence with MemSQL

Page 7: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

100+ Investment Managers, $2 Trillion AUM Research Platform: 10,000+ Institutions Founded 2007, Privately Held

We help investors discover their true investment acumen and risk

About Novus

Page 8: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

True Investment Acumen and Risk…at Scale

Page 9: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Top-Tier Client List

Page 10: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

24/7 ETL Handholding

Overnight Failure = Business Hours Slowdown

Scala worker pool limited by the database

Non-trivial code changes needed to shard and scale

Before MemSQL…

Page 11: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Today’s Portfolio Intelligence…Right Now

Before MemSQL:

With MemSQL:

90 Min.

2 Min.

Customer Data

Persistent StoreETL Analytics

(Scala)

Page 12: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

First-Class JSON Support…Happy Developers

memsql> select * from tasks t where t.task::uid::%clientId = 7;

+---------+---------------------------------------------------------------+| task_id | task |+---------+---------------------------------------------------------------+| 3 | {"uid":{"clientId":7,"id":1009,"which":"P"},"user":"noahlz"} |+---------+---------------------------------------------------------------+1 row in set (0.00 sec)

Salat

Page 13: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Client team focuses on service, not ETL

Predictable application performance

Scala workers: 12 126 Add servers to scale –

No code changes needed

With MemSQL…

Page 14: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

http://www.novus.com

http://tech.novus.com

@NovusCode

Page 15: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Ian Hansen, Software Engineering ManagerDigital Ocean

ETL Tools for Small Teams

Page 16: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Problem: Business Intelligence Slows as We Grow

Data lives in SQL Easy to ask new questions in SQL But… Business Intelligence tasks taking longer Database isn’t built for quick aggregations

Page 17: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Solution: Scale-out SQL Database SQL team stays powerful Quick to iterate with quick answers Prepare for the future!

Page 18: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Problem: Data isn’t in MemSQL

Plus

You don’t have an engineer on your team

It’s hard to get an engineer’s time

You’ve got a job to do… (which is taking more and more time)

Page 19: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Solution: ETL Using REPLACE INTO MySQL SQL flavor (available in MemSQL) Handles new rows and updates on rows Easy to write

• Query source database then replace into target database

Many other scale-out SQL databases don’t have equivalent

Page 20: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Problem: Now Load JSON Event Data ~300K events per day Many different types of JSON events

Page 21: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Solution: MemSQL Loader + JSON Type

Only loads new files (or files whose content has changed)

Parallelizes the process Transformation script

simple: return id and raw json data

SQL team unaffected by new JSON events

./memsql-loader load /opt/events/** --table events --script=/opt/events-etl --file-id-column file_id --columns id,data

Page 22: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Problem: Processing Data on Select Need computed value in SQL query Computing the value slows down queries Computed value used on many queries

• e.g. domain from a URL string

Page 23: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Solution: Persistent Columns

Pre-compute result and save it on the row

Automatically updated if row changes

No need to alter ETL pipeline

ALTER TABLE events ADD COLUMN ( referring_domain AS substring_index(substring(data::$referrer, (locate('//', data::$referrer)) + 2), '/', 1) PERSISTED varchar(255))

Page 24: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Solution: Persistent Columns

Use pre-computed value in select

memsql> select data, referring_domain from events limit 2;+-------------------------------------+------------------+| data | referring_domain |+-------------------------------------+------------------+| {"referrer":"http://example.com/b"} | example.com || {"referrer":"http://example.com/a"} | example.com |+-------------------------------------+------------------+

Page 25: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Tools

REPLACE INTO syntax JSON native type MemSQL Loader Persistent columns Now, MemSQL Streamliner

Page 26: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

We Want More Data

Page 27: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

We are Hiring

Page 28: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Mike DePrizio, Senior Architect, Akamai Technologies

Unlocking Revenue with In-Memory Technology

Page 29: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

We are the leading provider of cloud services for delivering, optimizing and securing online content and business applications

$1.96BRevenue

1,300Locations

5,000+Customers

5,100+Employees

CORPORATE STATS (2014):

OUR HISTORY: Founded 1998 and rooted in MIT technology—solving Internet congestion with math not hardware

Page 30: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

The Business of Billing

Billing domino effect Akamai Customers Sub-customers

Daily billing requires: Fast data delivery

Accurate data

Old Model New Model

Generating a bill at end of month for customer services

Generating a bill at the end of every day for sub-customer services

Page 31: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Current Billing Data Management

Gather logs from 190,000+ servers in 1400 locations in 110 countries Multiple PBs/day aggregate/reduce into relevant billing data feed

Typical data record: 3 key fields plus metrics

Load resulting data record into our RDBMS system

Page 32: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Greatest Challenges Current system cannot handle expected throughput Difficult to quickly scale up existing environments New model will generate 10x+ data

Page 33: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Deploying MemSQLApplicationDaily Sub-customer billing

ProblemExisting RDMS pipeline loads were maxed out at 150-300K upserts/second, could not keep up with projected size of new billing model

ResultsMemSQL cluster performs at 1.9 million upserts/second, allowing transition from monthly to daily billing

Billing Data resource usage statistics

INSERT... ON DUPLICATE KEY UPDATE... (1.9 million/sec)

Billing Application

• Compute sub-customer charges daily

• Roll up sub-customer usage by customer/cloud provider

• More sophisticated platform offers customers better service, partners new business opportunities

Page 34: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Results Speak for Themselves 2M upserts/second on AWS EC2

instances

Scalability on commodity hardware

Meeting our billing windows

Unlocking revenue

Page 35: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Adapt PoC for real-world situations

Continue scaling linearly

Optimize results with small cluster deployment 

What Next?

Page 36: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Eric Frenkiel, MemSQL CEO and co-founderSeptember 30, 2015 • New York, NY

Introducing MemSQL Streamliner

Page 37: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

One click deployment of integrated Apache Spark

Put Spark in the Fast Lane• GUI pipeline setup• Multiple data pipelines• Real-time transformation

Eliminates batch ETL Open source on GitHub

Introducing the MemSQL Streamliner

Page 38: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Simple Deployment Process

Application

Page 39: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

1. Deploy MemSQL

Cluster

In-Memory | Distributed | Relational

Application

Page 40: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

2. Deploy Spark

Cluster

Application

Page 41: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Kafka Connects to Each Node

Cluster

Application

Page 42: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Streamliner ArchitectureFirst of many integrated Apache Spark solutions

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Page 43: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Streamliner ETL Detail

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

STREAMLINER

Custom

Future Extractor

JSON

Custom

Future Transformer

Extract Transform Load

Page 44: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Building Predictive Applications

StreamlinerInput

User JarSAS Generated PMML

Industrial Equipment

Sensor Data

S1 S2 S3 P1 P2 P3

Scoring Real-Time Data with Predictive Models

Sensor 1 Predictive Model 1

Page 45: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Streamliner Benefits

Build end-to-end data pipelines in minutes

Reduce data latency from days or hours to ZERO

Support thousands of concurrent users running real-time queries

Give users immediate access to fresh data via innovative applications 

Page 46: Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

THE GAME

See MemSQL Streamliner in Action at Booth #831