Top Banner
Performing Real-Time Analytics with In-Memory Data Grids Copyright © 2013 by ScaleOut Software, Inc. Cloud Expo June 10, 2013 Mikhail Sobolev ([email protected]) David Brinker ([email protected])
30

Real-time analysis using an in-memory data grid - Cloud Expo 2013

Nov 28, 2014

Download

Technology

ScaleOut technical session at Cloud Expo 2013 in NY. Covers the use of in-memory data grids for real-time analysis of fast-changing data. Includes a financial services example.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 1. Performing Real-Time Analyticswith In-Memory Data GridsCopyright 2013 by ScaleOut Software, Inc.Cloud ExpoJune 10, 2013Mikhail Sobolev ([email protected])David Brinker ([email protected])

2. 2 ScaleOut Software, Inc. What is an In-Memory Data Grid (IMDG)? Top Benefits of IMDGs The Need for Real-Time Analytics Example: A Platform for Managing Hedging Strategies Using an IMDG to Perform Real-Time Analysis Benchmark Results Integrating an IMDG into Hadoop2Agenda 3. 3 ScaleOut Software, Inc. Dr. Mikhail Sobolev, Lead Java Architect Ph.D. from Moscow Institute of Physics and Technology Research and consulting focus in parallel computing Responsible for development of scalable software services in Java David Brinker, COO 20 years software business and executive management experience Mentor Graphics, Cadence, Webridge Company: ScaleOut Software Develops and markets IMDG products Founded in September 2003 Offices in Bellevue, WA and Beaverton, OR Eight years market experience in Windows& LinuxAbout the Speakers 4. 4 ScaleOut Software, Inc. ScaleOut StateServer Flagship product IMDG middleware for Windowsand Linux Industry-leading performance and ease of use ScaleOut GeoServer adds WAN based data replication for DR Breakthrough technology for globaldata access ScaleOut Analytics Server adds Real-time data analysis for operational data Comprehensive management tools ScaleOut hServer adds 1st step for Hadoop real-time analytics Accelerates data access and execution.ScaleOut Software ProductsScaleOut StateServer In-Memory Data GridGridServiceGridServiceGridServiceGridService 5. 5 ScaleOut Software, Inc.In-memory storage for fast updates and retrieval of live data Fits in the business logic layer: Stores collections of Java/.NETobjects shared by multiple clients. Uses create/read/update/deleteand query APIs to access data. Implemented across a cluster ofservers or VMs: Scales storage and throughputby adding servers. Provides high availabilityin case a server fails.What is an In-Memory Data Grid? 6. 6 ScaleOut Software, Inc.Scaling Data Access Using an IMDGExample: Cloud-Hosted App Application runs as multiple virtualservers (VS). Application instances store andretrieve LOB data from cloud-basedfile system or database-. Applications need fast, scalablestorage for live data. In-memory data grid runs asmultiple virtual servers to provideelastic in-memory storage forlive data. 7. 7 ScaleOut Software, Inc. As a vertical storage tier: Runs as middleware software. Adds missing storage layer to boostperformance. Uses out-of-process memory. Avoids repeated trips to a backing store.Where IMDGs Are DeployedProcessorCacheApplicationMemoryIn-ProcessL2 CacheProcessorCacheApplicationMemoryIn-ProcessL2 CacheBackingStorage As a horizontal storage tier: Allows data sharing among servers. Scales performance & capacity. Adds high availability. Can be used independently of backingstorage.In-MemoryData GridOut-of-ProcessIn-MemoryData GridOut-of-Process 8. 8 ScaleOut Software, Inc. IMDG incorporates a client-side in-processcache (near cache): Transparent to the application Holds recently accessed data Boosts performance: Eliminates repeated network data transfers &deserialization Reduces access times to near in-processlatency Is automatically updated if the grid isupdated Supports various coherency models(coherent, polled, event-driven)The Secret to Fast Access TimeApplicationMemoryIn-ProcessClient-sideCacheIn-ProcessIn-MemoryData GridOut-of-Process 9. 9 ScaleOut Software, Inc. IMDGs enable seamless data access across on-premise sites andcloud-based deployments: Automatically accessremote data as needed. Efficiently manageWAN bandwidth. Enable full datacoherency across sites. Supports multiple usagemodels: Replication for DR Remote access Synchronized read/writeGlobal Data Integration 10. 10 ScaleOut Software, Inc. IMDG bridges on-premise and cloud-based in-memory storage ofWeb session state. IMDG automatically migrates session-state objects into the cloudon demand. This enables seamless access to data across multiple sites.Example: Web Farm Cloud-Bursting 11. 11 ScaleOut Software, Inc.In-Memory Data Grid is middleware software which provides:1. Fast access time for fast-changing, live data2. Scalable throughput and storage capacity to match agrowing workload and keep response times low3. High availability to prevent data loss if a grid server (ornetwork link) fails4. Shared access to dataacross the server farm5. Global data access acrossmultiple sites and the cloud6. And fast data analysisfor quickly and easily miningdata using map/reduceTop Benefits of IMDGsAccessLatencyThroughputGrid DBMSAccess Latency vs. ThroughputFasterScales 12. 12 ScaleOut Software, Inc. Traditional big data analysisplatforms analyze offline data: Example: Hadoop Very large, static datasets Data is often copied from otherdisk-based storage systems to adistributed file system for analysis. IMDGs store and analyze online data: Fast-changing, operational data Data storage is memory-based. Data motion is minimized for fast,continuous analysis.IMDGs Analyze Live Data 13. 13 ScaleOut Software, Inc.A few examples: Equity trading: to minimize risk during a trading day Ecommerce: to optimize real-time shopping activity Reservations systems: to identify issues, reroute, etc. Credit cards: to detect fraud in real time Smart grids: to optimize power distribution & detect issuesOnline Systems Need Real-Time Analysis 14. 14 ScaleOut Software, Inc.A platform for managing hedging strategies: A hedge fund manages a set of hedging strategies: Strategies can cover various marketsectors, such as high-tech, automotive,energy, consumer, real estate, etc. Each strategy contains list of holdingsand rules for managing the holdings(such as target allocations). Updates to market datacontinuously arrive duringthe trading day. Challenge: The hedge fund must be able to quickly update andanalyze its hedging strategies and provide alerts to traders.Example in Financial Services 15. 15 ScaleOut Software, Inc. Deliver a stream of alerts to traderswithin a few seconds. Enable the trader to examine strategy details in real time:The Result: Real-Time Alerts 16. 16 ScaleOut Software, Inc. The IMDG holds the set of strategy objects as an in-memory collection. Updates to market datacontinuously flow throughthe IMDG. The IMDG performsrepeated map/reduceanalysis on hedgingstrategies everysecond. Each analysis iteration both updatesand analyzes every strategy object. The IMDG collects alerts after eachanalysis and delivers them to thetrader.The Solution: Real-Time AnalyticsUsing an IMDG 17. 17 ScaleOut Software, Inc. Analyze every selected strategy object in parallel within the IMDG: Update the strategys positions with latest market prices. Evaluate the strategys rules to see if a trade is needed. Example: Alert if current allocation exceeds target threshold. Generate an alert if holdings need to be changed. Merge the results across all strategy objects to create a set ofalerts.The Analysis Algorithm 18. 18 ScaleOut Software, Inc.Shipping Analysis Code to the IMDG IMDG creates Java or .NET execution environment for analysis: Spans all IMDG servers. Ensures tight integration with memory-based data storage. IMDG client ships jars/assemblies to IMDG servers for execution: Keeps development model simple. Optionally allows pre-staging for multiple runs to shorten startup time. Optionally allows automatic re-staging if code changes between runs. Client starts analysis: Sends invocation tothe IMDG. IMDG returnsanalysis results. 19. 19 ScaleOut Software, Inc.The parallel analysis executes in three steps: Step 1: The application first selects all relevant objects in thecollection with a parallel query run on all grid servers. Note: Query spec matches datas object-oriented properties.Running the Analysis 20. 20 ScaleOut Software, Inc. Step 2: The IMDG automatically schedules analysis operationsacross all grid servers and cores. The analysis runs on all objects selectedby the parallel query. Each grid server analyzes its locally storedobjects to minimize data motion. Parallel execution ensures fastcompletion time: IMDG automatically distributesworkload across servers/cores. Scaling the IMDG automaticallyhandles larger data sets.Running the Analysis: Step 2 21. 21 ScaleOut Software, Inc. File-based map/reduce must move data to memory for analysis: IMDGs memory-based computation engine analyzes data in place:IMDG Minimizes Data MotionD D D D D D D D DD D D D D D D D DGrid ServerGrid ServerGrid ServerE E EM/R ServerEM/R ServerEM/R ServerEFile System /DatabaseServerMemoryIn-MemoryData Grid 22. 22 ScaleOut Software, Inc. Step 3: The IMDG automatically merges all analysis results. The IMDG first merges all results within each grid server in parallel. It then merges results across all grid servers to create one combinedresult. Efficient parallel mergeminimizes the delay incombining all results. The IMDG delivers thecombined result to thetraders display as oneobject.Running the Analysis: Step 3 23. 23 ScaleOut Software, Inc.Running a similar analysis algorithm (stock back-testing) within anIMDG: IMDG hosted in Amazon cloud using 75 servers. IMDG holds 1 TB of stock history data in memory. IMDG handles continuous stream of updates (1.1 GB/s) whileperforming real-time analysis on live data. Entire data set analyzed in4.1 seconds (250 GB/s). IMDG scales linearly byadding servers asworkload grows.Benchmark Results 24. 24 ScaleOut Software, Inc. Typically used for very large, static, offline datasets Data is held on disk in a file system (HDFS) or DBMS Data is often copied from other disk-based storage systems toHDFS for analysis.Problem: Hadoop Cannot EfficientlyPerform Real-Time Analytics 25. 25 ScaleOut Software, Inc.Comparison of IMDGs and HadoopIMDG HadoopData set size Gigabytes->terabytes Terabytes->petabytesData repository In-memory File / databaseData view Queried object collection File-based key/valuepairsDevelopment time Low HighAutomaticscalabilityYes Application dependentBest use Real-time analysis oflive, memory-based dataBatch analysis oflarge, static datasetsI/O overhead Low HighCluster mgt. Simple ComplexHigh availability Memory-based File-based 26. 26 ScaleOut Software, Inc. Survey result from Strata 2013: 93% of Hadoop users wouldbenefit from real-time data analytics. Strategy: Integrate IMDG into Hadoop. How: Stage data in IMDG for fast access. Thereby allow updates to data duringHadoop execution. Automatically retrievedata from HDFS asnecessary. Enable unchangedHadoop programstructure. Combine scalabilityof Hadoop map/reduceand IMDG.Enabling Hadoop to PerformReal-Time Analysis 27. 27 ScaleOut Software, Inc. IMDG adds Hadoop grid recordreader for accessing key/valuepairs held in the IMDG. Hadoop programs optionally canoutput results to IMDG with gridrecord writer. Applications can access and updatekey/value pairs as live data duringanalysis. Grid record reader and writeroptimize access to key/value pairsto eliminate network overhead.Accessing IMDG Data in Hadoop 28. 28 ScaleOut Software, Inc. IMDG adds wrapper for HDFS record reader to cache HDFS dataduring program execution. Hadoop automatically retrieves data from IMDG on subsequent runs. Wrapper accesses IMDG tostore and retrieve datawith minimum networkoverhead. Useful in multiple what-ifanalyses on one data set Tests with Terasortbenchmark havedemonstrated 11Xlower access latencyover HDFS without IMDG.Using IMDG as an HDFS Cache 29. 29 ScaleOut Software, Inc. IMDGs use in-memory storage to scale access to data forapplications which process live, fast-changing data. IMDGs can be deployed in the cloud and provide global dataintegration across sites. Many applications need toperform real-time analyticson live data. IMDGs can meet this need,delivering results in secondsinstead of minutes or hours. Hadoop was not designed forreal-time analytics, but IMDGs can enable Hadoop to accelerate access to data.Summary 30. In-Memory Data Grids forServer Farms & Cloud Computingwww.scaleoutsoftware.com