NEW USAGE MODEL FOR REAL-TIME ANALYTICS WILLIAM L. BAIN CEO AT SCALEOUT SOFTWARE, INC. SCALEOUT SOFTWARE, INC.
Jul 07, 2015
NEW USAGE MODEL FOR REAL-TIME ANALYTICS
WILLIAM L. BAINCEO AT SCALEOUT SOFTWARE, INC. SCALEOUT SOFTWARE, INC.
Using In-Memory Models ofReal-World Systems for Operational Intelligence
Copyright © 2014 by ScaleOut Software, Inc.
Big Data HispanoNovember 17, 2014
Bill Bain, CEO ([email protected])
2 ScaleOut Software, Inc.
• What Is Operational Intelligence?• Example: Tracking Cable Viewers• Implementing OI Using an In-Memory Data Grid:
• Distributing the Data Across a Cluster• Integrating Data-Parallel Analysis• Building an In-Memory Model
• More Examples of In-Memory Models• Comparison to Spark and Storm• Implementing an Example in Financial Services• Using In-Memory Hadoop MapReduce for OI
Agenda
3 ScaleOut Software, Inc.
• Dr. William Bain, Founder & CEO• Career focused on parallel computing – Bell Labs, Intel, Microsoft• 3 prior start-ups, last acquired by Microsoft and product now ships as
Network Load Balancing in Windows Server
• ScaleOut Software develops and markets In-Memory Data Grids,software middleware for:• Scaling application performance and • Providing operational intelligence using• In-memory data storage and computing• Nine years in the market, 400 customers,
10,000 servers; sample customers:
About the Speaker
4 ScaleOut Software, Inc.
Goal: Provide immediate feedback to a system handling live data.A few examples:• Ecommerce: for personalized, real-time recommendations• Equity trading: to minimize risk during a trading day• Reservations systems: to identify issues, reroute, etc.• Credit cards & wire transfers: to detect fraud in real time• Smart grids: to optimize power distribution & detect issues
Online Systems Need Operational Intelligence
5 ScaleOut Software, Inc.
• Goals:• Make real-time, personalized upsell offers.• Immediately respond to service issues.• Track aggregate behavior to identify patterns, e.g.:
• Total instantaneous incoming event rate• Most popular programs and # viewers by zip code
• Requirements:• Track events from 10M cable boxes with 25K events/sec (2.2B/day).• Correlate, cleanse, and enrich events per rules (e.g. ignore fast channel
switches, match channels to programs).• Be able to feed enriched events to recommendation engine within 5 sec.• Immediately examine any cable box (e.g., box status) & track statistics.
Example: Track Cable TV Viewers
©2011 Tammy Bruce presents LiveWire
6 ScaleOut Software, Inc.
Based on a simulated workload for San Diego metropolitan area:• Continuously correlates and
enriches telemetry from 10M simulated set-top boxes (from synthetic load generator).
• Processes more than 30K events/second.
• Enriches events with program information every second.
• Tracks aggregate statistics (e.g., top 10 programs by zip code) every 10 secs.
The Result: An OI Platform
Real-Time Dashboard
7 ScaleOut Software, Inc.
Big Data Analytics
Real-Time vs. Batch Analytics
Static data setsPetabytesDisk storageMinutes to hoursBest uses:
• Analyzing warehoused data
• Mining for long-term trends
Live data setsGigabytes to terabytesIn-memory storageSeconds to minutesBest uses:
• Tracking live data• Immediately
identifying trends and capturing opportunities
• Providing immediate feedback
AnalyticsServer
hServer
HadoopIBM
TeradataSASSAP
Real-Time Batch
Real-time“Operational Intelligence”
Batch“Business Intelligence”
8 ScaleOut Software, Inc.
• Operational intelligence can co-exist with business intelligence:• Processes streaming data close to its sources.• Provides real-time, “tactical” feedback (e.g., recommendations, alerts).• Transforms data for storage in the data warehouse (ETL).• Data warehouse provides “strategic” guidance.
• Using the same tool set (e.g., Hadoop MapReduce) lowers TCO:• Leverages common skill set.• Simplifies design (e.g., loading data into HDFS).
Integrated View of Analytics
9 ScaleOut Software, Inc.
• To keep up with fast growing “live” workloads &maintain fast response times:• Track state of entities within a
live system.• Reliably process updates to
data set in real-time.
• To identify and respond to trends in fast-changing data:• Enrich & evaluate “live” data set
in real time.• Respond to identified
patterns within seconds.
Challenges for Operational Intelligence
0
50
100
150
200
250
300
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Mill
ions
Growth in Web Servers
Source:Netcraft
0
500
1000
1500
2000
2500
3000
3500
4000
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Exeb
ytes
Growth in “Big Data”
“More data has been created in the past three years than in the past 40,000.”
10 ScaleOut Software, Inc.
• In-memory data grid(IMDG) holds active entities undergoing state changes in memory.
• Backing store optionally holds large population of entities.
• IMDG processes incoming stream of state changes.
• Analytics engine examines entities in real time and generates alerts within seconds as needed.
In-Memory Architecture forOperational Intelligence
11 ScaleOut Software, Inc.
In-Memory Data Grid (IMDG) stores “live” data in a cluster:• Fits in the business logic layer:
• Follows object-oriented view of data(vs. relational view).
• Stores collections of Java/.NET/C++ objects shared by multiple clients.
• Uses create/read/update/delete and query APIs to access data.
• Implemented across a cluster of servers or VMs:• Scales storage and throughput
by adding servers.• Provides high availability
in case a server fails.
In-Memory Data Grid
12 ScaleOut Software, Inc.
• IMDG’s collections of objects act like process collections:• Unstructured, typically instances of a class
(stored as serialized blobs)• Individually accessible / update-able
• IMDG adds attributes:• Accessible by global key• Query-able by properties• Highly available• Optional timeouts• Distributed locking• Integration with a backing store• Optional dependency relationships• Asynchronous event handling
IMDGs Use Object-Oriented Model
Basic “CRUD” APIs:• Create(key, obj, tout)• Read(key)• Update(key, obj)• Delete(key)and…• Lock(key)• Unlock(key)
Objectkey
13 ScaleOut Software, Inc.
In-Memory, Data-Parallel Computing• Integrates with IMDG data storage to minimize data motion.• Ex.: Parallel Method Invocation (PMI), an object-oriented version
of data-parallel computing from the HPC community:• Selects objects using a parallel query on data hosted in the IMDG.• Runs user-defined methods in parallel across the cluster and merges
results.
Analyze Data (Eval)
Combine Results (Merge)
In-Memory Data Grid Runs Data-Parallel Computation.
14 ScaleOut Software, Inc.
Achieving Linear SpeedupAvoid data motion (network or disk I/O) which limits throughput:
15 ScaleOut Software, Inc.
Object-oriented model tracks and analyzes real-world entities:
In-Memory Model of “Live” Entities
In-MemoryState in“IMDG”
NoSQLStorage
Real-TimeData Parallel
Analysis
16 ScaleOut Software, Inc.
• Each cable box is represented as an object in the IMDG:• Object holds raw & enriched event streams, viewer parameters, and
statistics.• IMDG captures
incoming events by updating objects.
• IMDG uses data-parallel computation to:• immediately
enrich box objectsto generate alerts to recc. engine, and
• continuouslycollect and reportglobal statistics.
Example: Cable Set-Top Boxes
17 ScaleOut Software, Inc.
Fast map/reduce reconciles inventory and order systems for an online retailer:• Challenge: Inventory and online
order management are handledby different applications.• Reconciled once per day.• Inaccurate orders reduces margins.
• Solution:• Host SKUs in IMDG updated in real
time by order & inventory systems.• Use MapReduce to reconcile in two minutes.• Enables real-time reconciliation to ensure accurate orders.
Example in Ecommerce: Inventory Management
18 ScaleOut Software, Inc.
• IMDG holds customerinformation for activeWeb users.
• IMDG saves/retrieves customer information from backing store.
• Web browsers send activity information to analytics engine.
• IMDG updates customer history andpreferences.
• Analytics engine identifies browsing andbuying patterns.
• Analytics engine makes suggestions in real-time. Also sends email follow-ups.
Example: Web Shopping
19 ScaleOut Software, Inc.
Brick and mortar stores use OI to compete with online experience:• IMDG tracks opt-in customers to make recommendations.• RFID tags identify product selection and availability in showroom. • Analytics engine sends real-time advisories to sales staff via tablet.
Example: Retail Shopping
20 ScaleOut Software, Inc.
Focus: accelerating business intelligence using in-memory computing:• In-memory computing to accelerate and extend
Hadoop MapReduce using data-parallel operators in Scala.
• Stores data as “resilient distributed datasets” (RDDs):• Distributed across cluster• Immutable• Hold data from/output to HDFS.• Manages data stream as a sequence of RDDs.
• Comparison to IMDG:• Not designed for operational systems:
• Lacks high availability (uses lineage).• Intended for data-parallel operations:
• Lacks CRUD APIs on individual objects.
Comparison: IMDGs to Spark
21 ScaleOut Software, Inc.
• Focus: continuous processing of input streams• Storm implements pipelined execution of tasks by “bolts” on
incoming data streams.• Streams can be distributed to bolts with configurable mappings.• Developer controls the number of tasks per bolt.
• Storm uses a centralized master node and Zookeeper for fault-tolerance.
• Issues:• Managing global state• Minimizing data motion• Complexity / tuning
Comparison to Storm
22 ScaleOut Software, Inc.
• Hedge fund tracks a set of hedging strategies:• Strategies can cover various market
sectors, such as high-tech, automotive, energy, consumer, real estate, etc.
• Each strategy contains list of holdings and rules for managing the holdings (such as target allocations).
• Updates to market data continuously arrive during the trading day.
• The challenge: update and analyze a large population of hedging strategies to immediately alert traders.
Implementing an Example in FinServ
23 ScaleOut Software, Inc.
• The IMDG holds hedging strategies as an object-oriented collection.• Updates to market data
are managed as a series ofsnapshot objects.
• The IMDG performsrepeated data-parallel analysis on hedging strategies to generatealerts.
• Merges alerts and feeds to traders in real time.
• IMDG automatically and dynamicallyscales its throughput to handle newhedging strategies by adding servers.
In-Memory Model
24 ScaleOut Software, Inc.
Step 1: Select all objects using parallel query of strategy objects:• Query spec matches data’s object-oriented properties.• Selected objects are fed to the analysis engine on each local server.
Implementing the Analysis
25 ScaleOut Software, Inc.
Java Example: Parallel Querypublic class Portfolio {
private long id;private Set<Stock> longPositions;private Set<Stock> shortPositions;private double totalValue;private Region region;private boolean alerted; // alert for trading
@SossIndexAttribute // query-able propertypublic double getTotalValue() {…}@SossIndexAttribute // query-able propertypublic Region getRegion() {…}
public Set<Long> evalPositions(MarketSnapshot ms) {…};}NamedCache pset = CacheFactory.getCache(“portfolios");
Set<Portfolio> res = pset.queryObjects(Portfolio.class, and(greaterThan(“totalValue”, 1000000),
equals(“region”, Region.US)));
26 ScaleOut Software, Inc.
Step 2: Create parallel methods to update and analyze the queried collection of hedging strategies:• “Eval” method applies market snapshot to an instance of a strategy
object:• Compare to a MapReduce mapper; adds an input parameter.• Updates the strategy object’s positions.• Analyzes the positions for a deviation from allowed rules.• Optionally generates an alert.
• “Merge” method combines alerts across the collection of strategies:• Compare to a MapReduce combiner.• Uses binary combining.• Is applied globally to the object collection by the IMDG (unlike a Mapreduce
reducer).
• Note: both methods access hydrated objects; avoid need for CRUD access.
Implementing the Analysis
27 ScaleOut Software, Inc.
• Create method to analyze a queried portfolio and another method to pair-wise merge the result sets of alerted portfolios:
Java Example: Parallel Method Invocation
public class PortfolioAnalysis implementsInvokable<Portfolio, MarketSnapshot, Set<Long>>
{public Set<Long> eval(Portfolio p, MarketSnapshot ms)
throws InvokeException {
// update portfolio and return id if alerted:return p.evalPositions(ms);
}
public Set<Long> merge(Set<Long> set1, Set<Long> set2) throws InvokeException {
set1.addAll(set2);return set1; // merged set of alerted portfolio ids
}}
28 ScaleOut Software, Inc.
• Run a parallel method invocation on a queried set of portfolios and return set of ids for alerted portfolios:
Java Example: Parallel Method Invocation
NamedCache pset = CacheFactory.getCache(“portfolios");
InvokeResult alertedPortolios = pset.invoke(PortfolioAnalysis.class,Portfolio.class, and(greaterThan(“totalValue”, 1000000), // query spec
equals(“region”, Region.US)),marketSnapshot, // parameters...);
System.out.println("The alerted portfolios are" + alertedPortfolios.getResult());
29 ScaleOut Software, Inc.
• IMDG ships user’s code and libraries to its servers.• IMDG automatically schedules analysis operations across all grid
servers and cores:• The analysis runs on all objects selected
by the parallel query.• Each grid server analyzes its locally stored
objects to minimize data motion.• Parallel execution ensures fast
completion time:• IMDG automatically distributes
workload across servers/cores.• Scaling the IMDG automatically
handles larger data sets.
Running the Analysis
30 ScaleOut Software, Inc.
• The IMDG automatically merges all analysis results:• The IMDG first merges all results within each grid server in parallel.• It then merges results across all grid servers to create one combined
result.• Efficient parallel merge
minimizes the delay incombining all results.
• The IMDG delivers thecombined result to theinvoking application as one object.
Merging the Results
31 ScaleOut Software, Inc.
• In-memory analysis delivers a set of alerts to traders every 300 msec.
• Enables the trader to examine strategy details in real time:
Output: Real-Time Alerts
32 ScaleOut Software, Inc.
• Measured a similar financial services application (back testing stock trading strategies on stock histories)
• Hosted IMDG in Amazon EC2 using 75 servers holding 1 TB of stock history data in memory
• IMDG handled a continuous stream of updates (1.1 GB/s)• Results: analyzed 1 TB in 4.1 seconds (250 GB/s) with linear scaling
Sample Performance Results for PMI
33 ScaleOut Software, Inc.
Benefits:• Enables use of standard Hadoop MapReduce for operational
intelligence.• Accelerates data access by holding data in memory.• Analyzes and updates “live” data.• Reduces overheads of standard
Hadoop distributions:• Batch scheduling• Disk access• Data shuffling• Mandatory key sorting
• Enables new features, e.g.:• Global combining, optional sorting
In-Memory MapReduce
34 ScaleOut Software, Inc.
• A Hadoop distribution does not have to be installed unless HDFS is used.• The developer starts MapReduce applications from a remote workstation.• The IMDG automatically builds a reusable “invocation grid” of JVMs on the
grid’s servers for PMI and ships the application’s jars.• Results are stored in the IMDG, HDFS, or optionally globally merged and
returned to the remote workstation.
Running MapReduce on an IMDG
35 ScaleOut Software, Inc.
Run In-Memory MR with YARN• YARN transparently integrates batch and in-memory MapReduce into a
single execution framework with shared access to HDFS.• For example, IMDG can transparently run Apache Hive in-memory.
Example of ScaleOut hServer with HortonworksExample of Hive
Running on IMDG
36 ScaleOut Software, Inc.
Run MapReduce as two PMI phases:• Data can be input from either the
IMDG or an external data source.• Works with any input/output format
compatible with the Apache distribution.
• IMDG uses its data-parallel execution engine (PMI) to invoke the mappers and the reducers.• Eliminates batch scheduling
overhead.• Intermediate results are stored
within the IMDG.• Minimizes data motion between the
mappers and reducers.• Allows optional sorting.
• Output of a single reducer/combiner optionally can be globally merged.
Implementing MapReduce
37 ScaleOut Software, Inc.
• IMDG adds grid input format for accessing key/value pairs held in the IMDG.
• MapReduce programs optionally can output results to IMDG with grid output format.
• Grid Record Reader optimizes access to key/value pairs to eliminate network overhead.
• Applications can access and update key/value pairs as operational data during analysis.
Accessing IMDG Data for M/R
38 ScaleOut Software, Inc.
• IMDG adds Dataset Record Reader (wrapper) to cache HDFS data during program execution.
• Hadoop automatically retrieves data from IMDG on subsequent runs.
• Dataset Record Reader stores and retrieves data with minimum network and memory overheads.
• Tests with Terasortbenchmark have demonstrated 11Xfaster access latency over HDFS without IMDG.
Optional Caching of HDFS Data
39 ScaleOut Software, Inc.
IMDG needs multiple in-memory storage models:• Named cache, optimized for
rich semantics on large objects:• Property-based query• Distributed locking• Access from remote grids
• Named map, optimized for efficient storage and bulk analysis (e.g., MapReduce):• Highly efficient object storage• Pipelined, bulk-access
mechanisms
In-Memory Storage Models
40 ScaleOut Software, Inc.
In-Memory Concurrent Map:• Stores key/value pairs in chunks.• Allows CRUD operations on kvps.• Automatically organizes chunks into
splits.• Uses per-split hash table to access
keys and manage multi-valued keys.
• Stores shuffled data set between mappers and reducers.
• Pipelines chunks to mappers and from reducers.
• Optionally uses memory mapped files to reduce access latency.
• Provides support for sorting keys.
In-Memory Storage Optimizations
41 ScaleOut Software, Inc.
• MapReduce optimizations:• Optional sorting• Optional multicast of parameters to mappers• Optional O(logN) global combining (avoids
single, sequential reducer)• Optional HDFS caching• Optional reuse of JVMs across jobs
• Measured performance:• Startup times reduced to a few milliseconds• Word count benchmark shows 20X speedup.• Real-world example shows >40X speedup.
• Current limitations:• No specific security for multi-tenancy• Intermediate data must fit in the IMDG
In-Memory M/R Optimizations
42 ScaleOut Software, Inc.
• Re-use in-memory context across MapReduce jobs:
Accelerating Start-Up Times
public static void main(String argv[]) throws Exception {//Configure and load the invocation grid InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid").
// Add JAR files as IG dependenciesaddJar("main-job.jar"). addJar("first-library.jar").// Add classes as IG dependenciesaddClass(MyMapper.class). addClass(MyReducer.class).// Define custom JVM parameterssetJVMParameters("-Xms512M -Xmx1024M").load();
//Run 10 jobs on the same invocation gridfor(int i=0; i<10; i++) {
Configuration conf = new Configuration();
//The preloaded invocation grid is passed as the parameter to the jobJob job = new HServerJob(conf, "Job number "+i, false, grid);
//......Configure the job here.........
//Run the jobjob.waitForCompletion(true);
}//Unload the invocation grid when we are donegrid.unload();
}
43 ScaleOut Software, Inc.
• Online systems need operational intelligence on “live” data for immediate feedback.
• Operational intelligence can be implemented using an IMDG integrated with data-parallel analysis.
• IMDGs track “live” state:• Model real-world entities as a
highly available object collection.• Enable updates to track changes.• Use data-parallel computation for
immediate feedback with low latency.
• Can run standard MapReduce.
Recap
Thank you!
44
45 ScaleOut Software, Inc.
• Mark class properties as indexes for query:
• Define a query using these properties:
Parallel Query Example (C#)
class Stock {[SossIndex]public string Ticker { get; set; }public decimal TotalShares { get; set; }public decimal Price { get; set; }}
NamedCache cache = CacheFactory.GetCache("Stocks");var q = from s in cache.QueryObjects<Stock>()
where s.Ticker == "GOOG" || s.Ticker == "ORCL"select s;
Console.WriteLine("{0} Stocks found", q.Count());
46 ScaleOut Software, Inc.
• Create method to analyze each queried stock object:
• Create method to pair-wise merge the results:
Example of Analysis Code (C#)
static decimal eval(Stock stock, StockCalcParams params){
return stock.Price * stock.TotalShares;}
static decimal merge(decimal r1, decimal r2){
return r1 + r2;}
47 ScaleOut Software, Inc.
• Run a parallel method invocation:
Invoking the Parallel Analysis (C#)
NamedCache cache = CacheFactory.GetCache("Stocks");
decimal valueOfSelectedStocks =
(from s in cache.QueryObjects<Stock>()where s.Ticker == "GOOG" || s.Ticker == "ORCL"
select s)
.Invoke(new StockCalcParams(…), new Func<Stock, StockCalcParams, decimal>(eval))
.Merge(new Func<decimal, decimal, decimal>(merge));
Console.WriteLine(“The value of selected stocks is {0}",valueOfSelectedStocks);
17TH ~ 18th NOV 2014MADRID (SPAIN)