High performance analytics sas greenplum sunz 2012

1 © Copyright 2011 EMC Corpora2on. All rights reserved.

Data Computing Division

E M C A C Q U I R E S G R E E N P L U M

“Greenplum, with expertise in the massively parallel arena, will give the storage giant a boost in big-data computing.”

– InformationWeek –

Greenplum Becomes the Foundation of EMC’s Data Computing Division

“For three years, Gartner has identified Greenplum as the most advanced vendor in the visionary

quadrant of its data warehouse DBMS Magic Quadrant….” – Gartner



New Reali2es… New Demands! •  Do it faster

–  Ingest more data –  Ingest it faster –  Keep it unsummarised, keep it for longer

•  Be more Responsive –  Unpredictable queries, Rapidly evolving bespoke analy2cs –  New tools: Hadoop, MapReduce, Hive, HBase, “R”

•  Manage new data types –  Manage and allow queries across structured, semi-‐structured and unstructured data

•  Do it at a lower cost

Big Data will revolu/onise Data Warehousing and analy/cs.



Why Greenplum?

Fast Data

Loading Extreme Performance & Elastic Scalability

Unified Data Access

•  EMC Greenplum is a shared nothing, massively parallel processing (MPP) data warehouse system

•  Core principle of data computing is to move the processing dramatically closer to the data and to the people



Segment Servers

Query processing & data storage

... ...

Master Server

Query planning & dispatch

Hadoop MapReduce

Data Sources

Loading, streaming, etc.

Network Interconnect

External Files, URLs, Hadoop (HDFS), WebServices (including from other DBs),

O/S Pipes (including from other DBs)

Standard Business Intelligence and Analy2cal tools

SQL BI tools

Analytical tools

Queries distributed across all available

resources

Shared Nothing, Massively Parallel Processing means no boSlenecks and linear scalability.

Data loading also takes advantage of MPP architecture

Greenplum handles structured, semi-‐structured and

unstructured data

Clients see a single database

Structured Analy2cs Unstructured Analy2cs

primary server, plus hot failover



Why is MPP different?

Greenplum is a Scale-Out Architecture on standard commodity hardware

…

MPP •  Queries shipped to each node simultaneously •  Execute parallel on each segment instance. •  Multiple pipe lines of data •  Highly Scalable topology •  Locks and buffers not shared.

Traditional •  Single database buffer used by all user

operations •  More locks, means more complex lock

management system •  Single pipe to data •  Limited Scalability


Data Computing Division 20/02/12 6

Par22oning: The Key to Parallelism Strategy: Spread data evenly across as many nodes (and disks) as possible

43 Oct 20 2005 12 64 Oct 20 2005 111 45 Oct 20 2005 42 46 Oct 20 2005 64 77 Oct 20 2005 32 48 Oct 20 2005 12

Order

Ord

er #

Ord

er

Dat

e

Cus

tom

er

ID

Greenplum Database High Speed Loader

50 Oct 20 2005 34 56 Oct 20 2005 213 63 Oct 20 2005 15 44 Oct 20 2005 102 53 Oct 20 2005 82 55 Oct 20 2005 55



Greenplum Database Powerful Data Loading Capabilities •  Industry leading performance:

–  >10TB per hour per rack •  Innovative, parallel-everything

architecture: –  Scatter-Gather Streaming™

provides true linear scaling –  Support for both large-batch

and continuous real-time loading strategies

–  Enable complex data transformations “in-flight”

–  Transparent interfaces to loading via support files, application and services



Tradi2onal Loading vs Greenplum DB Parallel Loading

Segment nodes

Segment nodes

Segment nodes

Segment nodes

Interconnect

Conventional Loading

ETL Servers

Interconnect

ETL Servers



Client 1 4 2 9

7 3 11 6

8 12 5 10

Sort Request Sort Request Sort Request

Advanced pipeline process for fast operation

Master Server

Segment Servers



12 11 10

9

8 7 6

5

4

3 2 1

Advanced pipeline process for fast operation

Master Server

Segment Servers

Client



Greenplum Database Extreme Performance

•  Optimized for BI and Analytics –  Rich eco-system of partners

•  Provides automatic parallelization –  Just load and query like any database –  Tables are automatically distributed across

nodes –  No need for manual partitioning or tuning

•  Extremely scalable MPP shared-nothing Architecture

–  All nodes can scan and process in parallel –  Linear scalability by adding nodes

Interconnect

Loading



Pla^orm Independence Delivers Choice and Flexibility

So2ware-‐Only •  On your x86 hardware •  Flexibility for any workload

Virtualized Infrastructure •  Pool resources •  Elas2c scalability

Data Compu@ng Appliance •  Op2mized Price/Performance •  Minimum 2me-‐to-‐value •  Ideal for Produc@on Environments



Table ‘Customer’

Jan ’09 Feb ’09 Mar ’09 Apr ’09 May ’09 Jun ’09 Jul ’09 Aug ’09 Sept ’09 Oct ’09 Nov ’09

Column-Oriented Archival Compression

Column-Oriented Fast Compression

Row-Oriented Fast Compression

Greenplum Polymorphic Data Storage

•  Greenplum Database’s engine provides a flexible storage model –  Four table types: heap, row-oriented, column-oriented, external –  Block compression: Gzip (levels 1-9), QuickLZ

•  Storage types can be mixed within a database, and even within a table –  Fully configurable via table DDL and partitioning syntax –  You may also choose to index some partitions and not others

•  Gives customers the choice of processing model for any table or partition –  Supports ILM scenarios – denser packing of older partitions, etc. –  Tables/partitions of different storage types can be joined together without restriction –  Highly tuned – e.g. columnar does efficient pre-projection and parallel execution



Unified Data Access Across The Enterprise •  Workload Management

–  Connection management controls how many users can be connected and assigns them to a queue

–  User-based resource queues allow for control of the total number or cost of queries allowed at any point in time.

•  Dynamic Query Prioritization –  Patent pending technique of dynamically

balancing resources across running queries

–  Allows DBAs to control query priorities in real-time, or determine default priorities by resource queue



Highly interactive web-based performance monitoring

Real-time and historic views of:

•  Resource utilization

•  Queries and query internals

Greenplum Performance Monitor



Key Technical Requirements for HPA Ø  Technical Values

ü  Performance - Massively parallel Architecture ü  Load speeds – 10TB/hr ü  Integration with SAS ü  In-database analytics using Java, PL/R, etc ü  Integration with many more BI, Analytical tools, ü  Integration with Hadoop for unstructured data analysis

Ø  Financial Value ü  Lower Total cost of ownership ü  Best Price/performance Ratio in the industry for EDW/ analytical

appliance Ø  Operational Values

ü  No Indices maintenance ü  Backup recovery solution ü  Most robust Disaster Recovery Solution in Industry ü  Best Technical and customer Support Organization backing



A Few SAS Generalisations

Ø Large sequential reads and writes Ø Reading and Writing of data is done via the OS’s file

cache Ø I/O throughput rate is restricted by how fast the OS’s file

cache can process the data Ø A lot of temporary files can be created .



An MPP SQL query – just for fun

• 44TB and the query planner executes a sequential scan. There are 1,218 million rows of data and 1000 columns. 5 concurrent users running the same query on a monthy data set.

• As a base line: a single node on a typical high-end server with a single controller can read about 1.5GB per second into the database. So, a DBMS deployed on a single node can scan our 44TB in 40.7 hours.




•  If we deploy over 8 nodes on a Greenplum cluster the aggregate I/O bandwidth increases linearly to 12GB/sec. Our query will complete in 61 minutes.

•  If we compress the rows then we can read more data with each I/O. Compression varies but 2.5X is a reasonable estimate. So our effective scan rate improves by 2.5 and our query completes in 24.4 minutes.




• Partitioning allows us to split the data on each segment by a known value, by month in our example and if possible, read only the partitions selected. We scan only 1/84th (7 x 12 months) of the table. Our query completes in 17.4 seconds.

• Columnar, based compression is more effective than row based compression. 10X columnar based compression is a conservative estimate…10X is 4 times better than the 2.5X row compression already built into our example. So now our table scan completes in 4.35 seconds.




• Columnar projection lets us perform I/O on only the columns we are interested in. Lets assume 500 of the 1000 columns in our example. By reading only 50% of the data we reduce our I/O by 50%. And our table scan completes in 2.175 seconds. If 5 people were executing the same query concurrently and each person was configured to have an equal share of the system resources then each persons query would complete in 10.9 seconds.




• Note that queries that touch two months touch twice as much data and would complete in 4.35 seconds, four months in 8.7 seconds, and so on it is scalable and robust

• Also note that joins are also implemented using a

shared-nothing approach, meaning that they scale up as well

• We can apply indexes if necessary to further improve query performance.



An MPP SQL query – Summary



Mul2ple op2ons for SAS & GP Deployments

SAS Grid

SAS In-‐Database SAS In-‐Memory

SAS Access, Greenplum database



SAS Access, Greenplum database

•  Provides integration capability to Greenplum

•  Allows for increased performance of Base SAS Procs when using the latest SAS v 9.3 release

•  Products: SAS Access for Greenplum

•  libname myGP ODBC server=gplum04 db=customers port=5432 user=gpusr1 password=gppwd1;




SAS In-‐Database

•  SAS Enterprise Miner models to execute within Greenplum database.

•  Automat ica l ly t rans la tes and publishes the model as a scoring function inside the database.

•  High-performance model scoring with faster time to results

•  Products: SAS Scoring Accelerator Note: Currently, this will be only available for Greenplum in the next version release of 9.3 slated for the end of this year.

In-Database Scoring In-Database Analytics




SAS In-‐Database

In-Database Scoring In-Database Analytics

•  Execution of key SAS analytical, data discovery and data summarization tasks in database.

•  Reduces the time needed to build, execute and deploy powerful predictive models.

•  Improve data governance on predictive analytics projects and produce faster, better results.

•  Products: SAS Analytics Accelerator

Note: Currently, this is in Roadmap for Greenplum will be available with SAS future versions




SAS Grid

•  SAS running on a cluster of servers for better performance

•  This can provide some acceleration on the base procs with Greenplum as the database, as it allows the database to make use of parallel processing

•  Products: SAS Access for Greenplum




SAS In-‐Memory

•  This is a complete 'big data' stack offering fast-loading, robust data management and complex analytics in a purpose-built environment.

•  Very high performance for business users that can significantly increase revenues or decrease costs as a result of improved performance

•  Products: GP & SAS HPA Note: Available in Q4 2011




SAS / Greenplum Product Overview

SAS High Performance Computing

SAS Access

Provides integration capability to a number of databases

Allows for increased performance of Base SAS Procs when using the latest SAS v 9.3 release

Products: SAS Access for Greenpum

SAS Grid

Utilized to run SAS on a grid of commodity servers instead of large UNIX or Mainframe

Limited impact to SAS jobs and users, but simplified operations. Generally uses more CPUs for improved performance

Products: SAS Access Greenplum, SAS Grid

SAS In-Database

Allows certain models to be pushed into the database for execution. Requires SAS Enterprise Miner in order to be of utilized

Will lead to significant (20x or more) improvement in performance versus non-database deployments

Products: SAS Access for Greenplum, SAS Grid, SAS Enterprise Miner, SAS Scoring Accelerator for Greenplum

SAS In-Memory (HPA)

New functionality from SAS that requires dedicated database appliance

Very high performance for business users that can signficantly increase revenues or decrease costs as a result of improved performance

Products: SAS Access for Greenplum, SAS Grid, SAS High Performance Analytics



In-Database Roadmap for Greenplum

Greenplum SAS Product Capability

Status

Base SAS® Descriptive Statistics / Query and Reporting – SQL Pushdown

Available in 2011 Q4 (9.3 M)

SAS/Access® Interface Database Specific Integration and Connectivity

Available

Support for SAS Format Function Available in 2011 Q4 (9.3 M) SAS® Data Integration

Studio Data Extraction, Load and

transformation Available

SAS® Scoring Accelerator*

Production Batch Scoring / Real Time Scoring

Available in 2011 Q4 (9.3 M)



What is SAS High Performance Analytics for GP?

•  It’s software (GP DB, SAS HPA) •  It combines parallel execution with in-memory •  It allows large volumes of data to be handled

quickly • A select set of procedures from following SAS

products: Base SAS, SAS/STAT, SAS/ETS, SAS/OR and SAS Enterprise Miner.



Why is GP & SAS and good match??

ü Greenplum & SAS already work well together via SAS|Access and the Scoring Accelerator

ü GP & SAS represent end-to-end analytics infrastructure, including rapid data loads, powerful ETL, parallel data computing for reports and analytics

ü Greenplum delivers extreme performance via the MPP architecture that is optimized for faster query execution and unmatched data loading

ü  Rapidly deployable and designed for massive growth ü  SAS & GP are working to develop advanced solutions with

deeper connectivity this solution will represent state of art in high performance, scalable, advanced analytics



Some Greenplum Big Data References • 

•  The Greenplum Database supports up to 2^48 (2 to the power of 48) rows per table. One Greenplum customer – Fox Interactive Media has a trillion row fact table and is adding a further 3TB per day in a True mixed-workload environment supporting production reporting, ad-hoc data mining, and operational data services.

• 

•  Another On-line eCommerce client at last site visit had approximately 21TB in their Greenplum instance with 10 nodes. They load between 10-30 million rows a day but the issue is frequency and complexity rather than size. There are 2,000 Informatica workflows per day, complex hourly loads (up to 300 Greenplum loads per batch with 9,000 Greenplum loads every day)

• 

•  They have 5,000 tables, 350,000 columns 4,000 views, 1,600 indexes, relational and dimensional models, heavily relational/3NF as they had a legacy Teradata DW that Greenplum replaced. Hourly metadata/schema/table changes in response to the hourly data loads.

•  This Client is averaging around a million SQL statements per day. They have heavy spikes during peak hours and maintain a Cognos reporting SLA of 100k queries per hour. They have over 1000 Cognos users and 50% of the workload is Cognos; these are mostly small statements. 25% is financial reporting, 10% is CRM. The remaining 15% is ad-hoc by power users and analysts with lots of 25-50 slice significantly large queries (and up to 100 slices). They have dependent views to 4 levels of nesting: view (great-grandchild) -> view (grandchild) -> view (child) -> view -> table.



Australian Tax Office uses Greenplum as an investigatory tool in their Compliance and Audit Logging Unit. They are an extremely happy reference customer referring to Greenplum's ability to pull in data from multiple sources and quickly analysis the data without needing to create complex data models or even indices.

31 © Copyright 2010 EMC Corporation. All rights reserved.

Some SAS & Greenplum Customers (some) RWS, in Singapore used MS SQL server as their reporting environment. Their reporting & ETL process were

very slow and the DWH environment is limited in terms of scalability. They were looking for an in-database platform that can work with SAS. We won in a competitive PoC last quarter and is being currently implemented. They will be using GP & SAS as EDW to store and analyze the customer trends AIS, a Telco in Thailand migrated a Teradata DWH as well as 2 Oracle DWHs onto a single Greenplum cluster

demonstrating the schema independence of the Database. The system has expanded to 70 TB across 32 Servers. AIS using SAS as their analytical platform.

Inland Revenue Service was running on Oracle DWH and had problems with Analytical report processing time. We won this deal in Q3 and is currently in the implementation phase.

Samsung Life Insurance had a 50TB Sybase DWH that they had spent 8 years building. They ran out of performance but were able to migrate the entire environment to Greenplum in 3 months. They had approx. 400,000 reports across 4 tools (SAS, Webfocus, MSTR, OLAP) only about 100 required tuning.


Data Computing Division 12

Greenplum Customers -- Government

•  Pacific Northwest National Labs (Dept. of Energy) does cyberanalytics.

•  Usa spending.gov traces the outlays of the US Federal Government.

•  The Federal Reserve Bank of Kansas City does economic analysis mostly related to the housing market.

•  Recently, the Internal Revenue Service purchased a DCA to do work related to Fraudulent Tax returns.

•  ATO uses GP as an investigatory tool in their Compliance and Audit Logging Unit.



SAS AND EMC GREENPLUM INTEGRATED ARCHITECTURE

Data Scientist

Data Engineer

Data Analyst

Bl Analyst

LOB User

Data Platform Admin

DAT

A S

CIE

NC

E T

EA

M

Greenplum Chorus - Analytic Productivity Layer

SAS Analytics

Private/Hybrid Cloud Infrastructure or Appliance

SAS Business Intelligence

SAS Information Management

Greenplum Database Greenplum Hadoop

Data Access & Query Layer



High Performance Analytics

‘The power to know fast’



Questions?

High performance analytics sas greenplum sunz 2012

Technology

audit logging

data computing

sas enterprise

mpp sql query

advanced pipeline

oriented fast

base sas procs

table scan