This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Greenplum is a Scale-Out Architecture on standard commodity hardware
…
MPP • Queries shipped to each node simultaneously • Execute parallel on each segment instance. • Multiple pipe lines of data • Highly Scalable topology • Locks and buffers not shared.
Traditional • Single database buffer used by all user
operations • More locks, means more complex lock
management system • Single pipe to data • Limited Scalability
Jan ’09 Feb ’09 Mar ’09 Apr ’09 May ’09 Jun ’09 Jul ’09 Aug ’09 Sept ’09 Oct ’09 Nov ’09
Column-Oriented Archival Compression
Column-Oriented Fast Compression
Row-Oriented Fast Compression
Greenplum Polymorphic Data Storage
• Greenplum Database’s engine provides a flexible storage model – Four table types: heap, row-oriented, column-oriented, external – Block compression: Gzip (levels 1-9), QuickLZ
• Storage types can be mixed within a database, and even within a table – Fully configurable via table DDL and partitioning syntax – You may also choose to index some partitions and not others
• Gives customers the choice of processing model for any table or partition – Supports ILM scenarios – denser packing of older partitions, etc. – Tables/partitions of different storage types can be joined together without restriction – Highly tuned – e.g. columnar does efficient pre-projection and parallel execution
Key Technical Requirements for HPA Ø Technical Values
ü Performance - Massively parallel Architecture ü Load speeds – 10TB/hr ü Integration with SAS ü In-database analytics using Java, PL/R, etc ü Integration with many more BI, Analytical tools, ü Integration with Hadoop for unstructured data analysis
Ø Financial Value ü Lower Total cost of ownership ü Best Price/performance Ratio in the industry for EDW/ analytical
appliance Ø Operational Values
ü No Indices maintenance ü Backup recovery solution ü Most robust Disaster Recovery Solution in Industry ü Best Technical and customer Support Organization backing
• 44TB and the query planner executes a sequential scan. There are 1,218 million rows of data and 1000 columns. 5 concurrent users running the same query on a monthy data set.
• As a base line: a single node on a typical high-end server with a single controller can read about 1.5GB per second into the database. So, a DBMS deployed on a single node can scan our 44TB in 40.7 hours.
• If we deploy over 8 nodes on a Greenplum cluster the aggregate I/O bandwidth increases linearly to 12GB/sec. Our query will complete in 61 minutes.
• If we compress the rows then we can read more data with each I/O. Compression varies but 2.5X is a reasonable estimate. So our effective scan rate improves by 2.5 and our query completes in 24.4 minutes.
• Partitioning allows us to split the data on each segment by a known value, by month in our example and if possible, read only the partitions selected. We scan only 1/84th (7 x 12 months) of the table. Our query completes in 17.4 seconds.
• Columnar, based compression is more effective than row based compression. 10X columnar based compression is a conservative estimate…10X is 4 times better than the 2.5X row compression already built into our example. So now our table scan completes in 4.35 seconds.
• Columnar projection lets us perform I/O on only the columns we are interested in. Lets assume 500 of the 1000 columns in our example. By reading only 50% of the data we reduce our I/O by 50%. And our table scan completes in 2.175 seconds. If 5 people were executing the same query concurrently and each person was configured to have an equal share of the system resources then each persons query would complete in 10.9 seconds.
• Note that queries that touch two months touch twice as much data and would complete in 4.35 seconds, four months in 8.7 seconds, and so on it is scalable and robust
• Also note that joins are also implemented using a
shared-nothing approach, meaning that they scale up as well
• We can apply indexes if necessary to further improve query performance.
• SAS Enterprise Miner models to execute within Greenplum database.
• Automat ica l ly t rans la tes and publishes the model as a scoring function inside the database.
• High-performance model scoring with faster time to results
• Products: SAS Scoring Accelerator Note: Currently, this will be only available for Greenplum in the next version release of 9.3 slated for the end of this year.
• The Greenplum Database supports up to 2^48 (2 to the power of 48) rows per table. One Greenplum customer – Fox Interactive Media has a trillion row fact table and is adding a further 3TB per day in a True mixed-workload environment supporting production reporting, ad-hoc data mining, and operational data services.
•
• Another On-line eCommerce client at last site visit had approximately 21TB in their Greenplum instance with 10 nodes. They load between 10-30 million rows a day but the issue is frequency and complexity rather than size. There are 2,000 Informatica workflows per day, complex hourly loads (up to 300 Greenplum loads per batch with 9,000 Greenplum loads every day)
•
• They have 5,000 tables, 350,000 columns 4,000 views, 1,600 indexes, relational and dimensional models, heavily relational/3NF as they had a legacy Teradata DW that Greenplum replaced. Hourly metadata/schema/table changes in response to the hourly data loads.
• This Client is averaging around a million SQL statements per day. They have heavy spikes during peak hours and maintain a Cognos reporting SLA of 100k queries per hour. They have over 1000 Cognos users and 50% of the workload is Cognos; these are mostly small statements. 25% is financial reporting, 10% is CRM. The remaining 15% is ad-hoc by power users and analysts with lots of 25-50 slice significantly large queries (and up to 100 slices). They have dependent views to 4 levels of nesting: view (great-grandchild) -> view (grandchild) -> view (child) -> view -> table.
Australian Tax Office uses Greenplum as an investigatory tool in their Compliance and Audit Logging Unit. They are an extremely happy reference customer referring to Greenplum's ability to pull in data from multiple sources and quickly analysis the data without needing to create complex data models or even indices.
Some SAS & Greenplum Customers (some) RWS, in Singapore used MS SQL server as their reporting environment. Their reporting & ETL process were
very slow and the DWH environment is limited in terms of scalability. They were looking for an in-database platform that can work with SAS. We won in a competitive PoC last quarter and is being currently implemented. They will be using GP & SAS as EDW to store and analyze the customer trends AIS, a Telco in Thailand migrated a Teradata DWH as well as 2 Oracle DWHs onto a single Greenplum cluster
demonstrating the schema independence of the Database. The system has expanded to 70 TB across 32 Servers. AIS using SAS as their analytical platform.
Inland Revenue Service was running on Oracle DWH and had problems with Analytical report processing time. We won this deal in Q3 and is currently in the implementation phase.
Samsung Life Insurance had a 50TB Sybase DWH that they had spent 8 years building. They ran out of performance but were able to migrate the entire environment to Greenplum in 3 months. They had approx. 400,000 reports across 4 tools (SAS, Webfocus, MSTR, OLAP) only about 100 required tuning.