2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved. 2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Greenplum DB Technical Overview aka GPDB
John Funk
3 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Business Data Lake Architecture
4 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Pivotal HD Architecture
HDFS
HBase Pig, Hive, Mahout
Map Reduce
Sqoop Flume
Resource
Management & Workflow
Yarn
Zookeeper
Apache Pivotal
Command Center Configure,
Deploy, Monitor, Manage
Data Loader
Pivotal HD Enterprise
Spring
Unified Storage Service
Xtension Framework
Catalog Services
Query Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ – Advanced Database Services
Hadoop Virtualization Extension
Distrubuted In-memory
Store
Query Transactions
Ingestion Processing
Hadoop Driver – Parallel with Compaction
ANSI SQL + In-Memory
GemFire XD – Real-Time Database Services
MADlib Algorithms
5 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Where should we put Data? When do I need it? Now Later
What do I want to do with it?
Singular event processing (OLTP Analy?cs) Transac?ons
Exploratory Analy?cs
Structured, deep analy?cs
How do I need to store it? Temporarily I want to, but am not required
I must and am required to
How will I query/search? Structured, regular
Using and alterna?ve index (other source)
Unstructured, unknown AD Hoc SQL
Where is it coming from? Events, stream, message stream File ETL
GemFireXD Pivotal HD GP Hadoop + GP All 3 solu?ons
6 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Big Data: Industry Perspective Retail • CRM – Customer Scoring • Store Siting and Layout • Fraud Detection / Prevention • Supply Chain Optimization
Advertising & Public Relations • Demand Signaling • Ad Targeting • Sentiment Analysis • Customer Acquisition
Financial Services • Algorithmic Trading • Risk Analysis • Fraud Detection • Portfolio Analysis
Media & Telecommunications • Network Optimization • Customer Scoring • Churn Prevention • Fraud Prevention
Manufacturing • Product Research • Engineering Analytics • Process & Quality Analysis • Distribution Optimization
Energy • Smart Grid • Exploration
Government • Market Governance • Counter-Terrorism • Econometrics • Health Informatics
Healthcare & Life Sciences • Pharmaco-Genomics • Bio-Informatics • Pharmaceutical Research • Clinical Outcomes Research
7 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Internet of Things
Value of 1% efficiency improvement
8 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Virtuous Cycle of Innovation
Key elements of Industrial Internet
9 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Extreme Performance for Analytics
• Performance through parallelism – True performance through a shared-nothing MPP architecture – In place, incremental scaling – Optimized for analytic workloads – Paralell Function Execution
• Simple and automatic – Just load and query like any database – Tables are automatically distributed
across nodes
• Flexibility and choice – True column and row based storage – Deep Hadoop integration – Broad partner support – Support for the deployment options right for you
GREENPLUM DATABASE
10 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Architecture: Performance Via Parallelism
• Scale-out architecture on standard commodity hardware
• Automatic parallelization – Load and query like any database
– Automatically distributed tables across all nodes
– No need for manual partitioning or tuning
• Extremely scalable MPP shared-nothing architecture
– All nodes can scan and process in parallel
– Linear scalability by adding nodes – On-line expansion when adding nodes
GREENPLUM DATABASE
Loading
Interconnect
Greenplum Database
Storage
Compute
11 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Performance: Parallel Query Optimizer • Cost-based optimization looks for
the most efficient plan • Physical plan contains scans,
joins, sorts, aggregations, etc. • Global planning avoids sub-
optimal ‘SQL pushing’ to segments
• Directly inserts ‘motion’ nodes for inter-segment communication
PHYSICAL EXECUTION PLAN FROM SQL OR MAPREDUCE
Gather Motion 4:1(Slice 3)
Sort
HashAggregate
HashJoin
Redistribute Motion 4:4(Slice 1)
HashJoin
Hash Hash
HashJoin
Hash
Broadcast Motion 4:4(Slice 2)
Seq Scan on motion
Seq Scan on customer Seq Scan on line item
Seq Scan on orders
12 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Performance: Dynamic Pipelining • A supercomputing-based “soft-switch” responsible for
– Efficiently pumping streams of data between motion nodes during query-plan execution
– Delivers messages, moves data, collects results, and coordinates work among the segments in the system
Dynamic Pipelining Software Interconnect
13 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Architecture: Scalability with Scale-Out
Advantages: • Scale In-Place • No Forklifting • Immediately Usable Simple Process • Connect New Hardware • Simple Restart • Schedule Redistribution
of Existing Data
GREENPLUM DATABASE
...
New Segment Servers
Query planning & dispatch
...
14 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Loading: Industry’s Fastest • Industry leading performance
at 10+TB per-hour per-rack • Scatter-Gather Streaming™ provides
true linear scaling • Support for both large-batch and
continuous real-time loading strategies
• Enable complex data transformations “in-flight”
• Transparent interfaces to loading via support files, application, and services
SINGLE RACK COMPARISON
Greenplum load rates scale linearly with the number of racks, others do not.
For example, two racks = >20TB/H
Greenplum Oracle Exadata
Netezza Teradata
15 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Loading: Massively-Parallel Ingest
• Fast Parallel Load & Unload – No Master Node
bottleneck – 10+ TB/Hour per Rack – Linear scalability
• Low Latency – Data immediately
available – No intermediate stores – No data “reorganization”
• Load/Unload To & From: – File Systems – ETL Products – Hadoop Distributions
Extreme speed and, immediate usability from files, ETL & Hadoop
External Sources
Loading, streaming, etc.
gNet Network Interconnect
... ...
... ...
Master Servers
Query planning & dispatch
Segment Servers
Query processing & data storage
SQL
ETL File Systems
16 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
SINGLE RACK COMPARISON
Most Powerful Data Loading Capabilities
• Industry leading performance at 16+TB per-hour per-rack
• Scatter-Gather Streaming™ provides true linear scaling
• Support for both large-batch and continuous real-time loading strategies
• Enable complex data transformations “in-flight”
• Transparent interfaces to loading via support files, application, and services
Greenplum load rates scale linearly with the number of racks, others do not.
For example, two racks = >32TB/H
Greenplum Oracle Exadata
Netezza Teradata
GREENPLUM DATABASE
17 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
DATA SET
Multi-Level Partitioning • Hash Distribution to evenly spread data
across all segment instances • Range Partition within an segment
instance to minimize scan work
Segment 1A
Segment 1C
Segment 1D
Segment 2A
Segment 2B
Segment 2C
Segment 2D
Segment 3A
Segment 3B
Segment 3C
Segment 3D
Jan 2007 Feb 2007 Mar 2007 Apr 2007 May 2007 Jun 2007 Jul 2007 Aug 2007 Sep 2007 Oct 2007 Nov 2007 Dec 2007
Segment 1B
18 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Architecture: Polymorphic StorageTM
� Enable Information Lifecycle Management (ILM)
� Storage types can be mixed within a table or database – Four table types: heap, row-oriented
append, column-oriented append and external
� Rich compression functionality, definable column by column – Blockwise: Gzip1-9 & QuickLZ – Streamwise: RLE (levels 1-4)
� Flexible indexing, partitioning, and more
TABLE ‘CUSTOMER’
Mar ‘11
Apr ‘11
May ‘11
Jun ‘11
Jul ‘11
Aug ‘11
Sept ‘11
Oct ‘11
Nov ‘11
Row-oriented for HOT DATA Column-oriented for COLD DATA
GREENPLUM DATABASE
19 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
MANAGEABILITY, EXTENSIONS GREENPLUM DATABASE
20 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Administration: Simple Tools • Single console for both Database and
Hadoop • Administration
– Start, Stop Database – Recover, Rebalance Segments
• Interactive view of System Metrics – Real-time – Historic (Configurable by time period)
• In-depth view for System Health – Hardware health – Software (Database, Hadoop)
• Query Monitoring – Search, Prioritize, Cancel Queries – View Query’s Execution Plan
• Workload Management – Configure Resource Queues – Prioritize Users
GREENPLUM DATABASE
21 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Connection Management • Control over how many
users can be connected. • Provides pooling (to allow
large numbers) and caps (to restrict numbers if desired)
• Intelligently frees and reacquires temporarily idle session resources
User-Based Resource Queues • Each user is assigned to a
resource queue that performs ‘admission control’ of queries into the database
• Allows DBAs to control the total number or total cost of queries allowed in at any point in time
Dynamic Query Prioritization • Patent pending technique of
dynamically balancing resources across running queries
• Allows DBAs to control query priorities in real-time, or determine default priorities by resource queue
Administration: Workload Management
Work smarter, not harder.
GREENPLUM DATABASE
22 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Security: Authentication & Authorization
User Authentication
Role Management
Connection Management
Authenticate With: • Database Passwords • LDAP • Active Directory • Kerberos/GSSAPI • RADIUS • Digital Certs. • Pluggable Auth.
(PAM)
Manage Roles: • Identify Users and
Groups • Grant/Revoke Access to:
• Databases • Tables • External Tables • Functions • Languages • Schemas • Etc.
• Grant Permissions: • Select • Insert, Update,
Delete • Rules • Connect • Execute • Etc.
Connections: • Where to Listen • # of Connections • Pools • Encryption • Authentication
Methods
GREENPLUM DATABASE
23 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Security: Standards & Certs. Networks Encrypted using SSL, TLS
Database encryption supported using PGCrypto
� Algorithms: AES 128, 192, 256, DES, 3-DES and many others
Authentication
� MD5 (default, set at install time)
� SHA-256
� SHA-256-FIPS
Local Passwords Encrypted
� Super user can change password hashing algoritym
� Using GUC: password_hash_algorithm
� GUC can be set either system-wide or on a session level
Standards
� Federal Standard FIPS-140-2 compliant
GREENPLUM DATABASE
24 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Data Load Options SQL INSERT
� Standard Row by row insert – slowest method – INSERT into tableX VALUES (‘John’, ‘Doe’, ‘Manager’)
� All data is passed through Master server
PostgreSQL Copy command
� Inserts data from a file or stdin (another query) – faster than SQL INSERT – COPY tableX FROM {file | STDIN}
� All data is passed through Master server
Parallel loading with gpfdist/gpload
� Segment servers connect directly to external files served via gpfdist
� Load bypasses Master server
� Segment servers load in parallel
� External tables point to the streamed files – CREATE EXTERNAL TABLE ext_table LOCATION (gpfdist://dir/*) – CREATE TABLE tableY AS SELECT * FROM ext_table
� Integrated with Informatica PowerExchange and Pentaho
GREENPLUM DATABASE
25 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Parallelized ETL with Greenplum One server, running Pentaho PDI and gpload. Provides parallelize data loading
GREENPLUM DATABASE
26 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Multiple ETL Servers (DIA Module) Multiple ETL servers, each running Pentaho PDI and gpload. Even more parallelism for data loading.
GREENPLUM DATABASE
27 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
HIGH AVAILABILITY, BACKUP, SUPPORT GREENPLUM DATABASE
28 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Availability: Multi-Level Redundancy
Client Redundant Interconnect
MP Segment Servers
Primary Master
1
Sync & Failover
Processes
Standby Master
Primary Data
RAID 5 Protection
GREENPLUM DATABASE
A1
B1
C1
A2
B2
C2
A1
B1
C1
A2
B2 C2
Mirror Data
1
29 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
• Option 1: custom external tables – Good control over which tables/data to backup – Enables incremental backup – Doesn’t include other objects, such as roles, resource queues, etc.
• Option 2: pgdump – Free utility – Not parallelized – Creates one dump file on the master
• Option 3: gpcrondump – Free utility – Parallelized backup – Creates SQL files on the master and segment hosts – Must restore to same number of hosts/segments – Incremental backup not supported
• Option 4: EMC Data Domain
Backup in a nutshell GREENPLUM DATABASE
30 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Backup/Restore with EMC Data Domain � Integration options
– NFS: Data Domain device mounted as NFS storage
– DD Boost: Native, client-side deduplication. Supported in GPDB 4.2 and higher
� Drastic reduction in backup storage requirement
� Backup all segment servers in parallel directly to Data Domain
� Data Domain Integrates seamlessly into standard Greenplum full backup data export and data restore procedures
GREENPLUM DATABASE
Full Appliance
+ Data Domain
Boost or NFS
2 X 10GBit IP
31 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
� Ideal for configurations with RPO and RTO requirements that can be specified in hours � Supports:
– Collection Replication for DD Boost backup – Directory-level replication for NFS backup – Encryption over the WAN
Data Domain Replication
LAN/WAN
Greenplum DCA Greenplum DCA
Data Domain Data Domain
GREENPLUM DATABASE
Backup and restore between remote and primary sites
Backup/Restore with EMC Data Domain
32 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
• Remote Technical Support – 24x7 technical support and remote troubleshooting – Customer-managed case severity level – Four-hour response objective
• Onsite Support (DCA Only) – Installation of replacement parts – Replacement parts shipped for next business day arrival – GP SW upgrade included
• Proactive Service – Secure remote monitoring for hardware – Notification of engineering technical advisories – Built-in tools maximize stability and performance
• Secure Self-Help – 24x7 access to eService support tools including
knowledgebase, forums, and appropriately licensed software updates
GREENPLUM DATABASE
Customer Support Services
33 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Deployment Options GREENPLUM DATABASE
34 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
GREENPLUM DCA
Deployment Choice & Flexibility
Modular Appliances � Modular Flexibility � Database, Hadoop
and ETL Modules � Future Partner-
Specific Modules � Common Admin and
Network Mgmt. � Incremental
Scalability � Rapid Deployment
Software Editions � Deploy on your x86
hardware � Certified
Configurations � Perpetual or
Subscription Lic. � Community Editions
35 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
GREENPLUM DCA
Modular Options
• Modules: – Greenplum Database – Greenplum Hadoop – Greenplum Data
Integration Accelerator – Partner Modules
• From ¼ to 12 Racks • Incremental Scale • Reduced Racking • Reduced Enterprise
Networking
+
Add ¼ rack Increments
Greenplum DIA
Module
Greenplum Database Modules
or
or
Greenplum HD
Module
1st Rack
Functional Module
Functional Module
Functional Module
Greenplum Database Module
(required)
Add ¼ rack Increments
Greenplum DIA
Module
Greenplum Database Modules
or
or
Greenplum HD
Module
Additional Racks
Functional Module
Functional Module
Functional Module
Functional Module
36 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved. 36 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Greenplum DB Analytics
37 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Extensible for Analytics: In-Database Analytical Algorithms
• Bringing the power of parallelism to commonly-used modeling and analytics functions
• In-database analytics – SAS – HPA, Access, and Scoring Accelerator – MADLib – An open-source library of advanced
analytics functions – Analytics extensions supported, including
• Graphlib – Analytics for graph data • PostGIS - Geospatial support, PL/R - Statistical
Computing, PL/Java, PL/Perl, etc. • GPText – massively parallel text processing
38 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Stored Procedures Support
� Extends SQL with user-defined logic
Greenplum gNet
Data Access & Query Layer
Stored Procedures MapReduce
Polymorphic Storage
SQL 2003/ 2008 OLAP
SQL
GREENPLUM DATABASE
ODBC JDBC
In-Database Analytics SAS
� Written in SQL � Used for deploying reusable logic
39 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
SQL 2003/2008 OLAP Support
� Simple aggregates
Greenplum gNet
Data Access & Query Layer
Stored Procedures MapReduce
Polymorphic Storage
SQL 2003/ 2008 OLAP
SQL
GREENPLUM DATABASE
ODBC JDBC
In-Database Analytics SAS
� Window functions � Excellent for BI
40 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
MapReduce Support
� Java-based programming
Greenplum gNet
Data Access & Query Layer
Stored Procedures MapReduce
Polymorphic Storage
SQL 2003/ 2008 OLAP
SQL
GREENPLUM DATABASE
ODBC JDBC
In-Database Analytics SAS
� Command-line accessible
� Run SQL and MapReduce against the same data
41 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
In-Database Analytics Support
� GPtext for unstructured data � PostGIS for Geospatial analysis
Greenplum gNet
Data Access & Query Layer
Stored Procedures MapReduce
Polymorphic Storage
SQL 2003/ 2008 OLAP
SQL
GREENPLUM DATABASE
ODBC JDBC
In-Database Analytics SAS
� MADlib for scalable in-database analytics
� User Written Analytical Algorithims
42 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Greenplumb: A powerful platform for machine learning
Regressions, Classification, Clustering, High Dimensionality Reduction, Cross validation and many more…
Recommender Systems, Connected Components, PageRank, Triangle Counting, Subgraph Centrality, Spectral Clustering and many more…
Machine learning on Relational data
Machine Learning on Graph data
43 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Rich Machine Learning Library
� Features: – Rich set of SQL Machine Learning algorithms from MADlib 1.4 added – Graphlab 2.2 supported (beta) – UDF support in R, Python, and Java.
� Benefits: – Analyze relational and graph data together, without needing data
movement. – Scalable machine learning algorithms helps do rapid data science
experiments on big data. – Design custom algorithms using popular languages like R, Python and
Java.
44 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
User-Written Analytical Algorithms � Broad Choice of Development Language
– R, C, Java, Python, Perl
� Multiple Execution Models – User Defined Aggregate – Scalar Result – User Defined Function – List Result – User Defined Table Function – Tabular Result
� Can Be Embedded Within: – SQL, Stored Procedures, MapReduce Maps
45 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
MADlib In-Database Analytic Library • Scalable, in-database analytic library
- Parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data
• Open-source, to enable extensibility and growth • Fully Parallelized • Can be customized by users • Collaboration of developers from Greenplum, University of California at
Berkeley and other commercial entities
46 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
MADlib In-Database Analytical Functions
Descriptive Statistics Modeling Quantile Correlation Matrix Profile Association Rule Mining
CountMin (Cormode-Muthukrishnan) Sketch-based Estimator K-Means Clustering
FM (Flajolet-Martin) Sketch-based Estimator Naïve Bayes Classification
MFV (Most Frequent Values) Sketch-based Estimator Linear Regression
Frequency Logistic Regression Histogram Support Vector Machines Bar Chart SVD Matrix Factorisation Box Plot Chart Decision Trees/CART
Latent Dirichlet Allocation Topic Modeling
47 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
GPText for Text Analytics � Full text indexing and search � Join structured and text in single query � Database security and availability features � Parallel, linearly scalable performance � No-Cost - bundled into GPDB
48 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Spatial Analytics (PostGIS) � Integrated PostGIS 2.0 includes support for Geography data type, Geometry
data type & previous PostGIS 1.4 features. – Enables polygons that cover the polls or cross the dateline – Easily allows users to work with latitude/longitude data without having to know about projections – No other map projection works for big organizations with truly global data
� Open-GIS Compatible � GIS Data Types � OpenGIS Simple Feature Access
PostGIS 2.0 features are available with GPDB 4.2.6 via the Greenplum Package Manager
49 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Integrated with Tools/Languages, incl. R
• List the columns in the table and preview the first 3 rows of data (the limit is passed through to the db)
• Examine the resulting model
• Load PivotalR Library
• Create the “houses” object as a proxy object in R. The data is not loaded into R
• Run a linear regression. This is executed in-database.
• The model is stored in-database, greatly simplifying the development of scoring applications
50 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
SAS
� SAS Scoring Accelerator
Greenplum gNet
Data Access & Query Layer
Stored Procedures MapReduce
Polymorphic Storage
SQL 2003/ 2008 OLAP
SQL
GREENPLUM DATABASE
ODBC JDBC
In-Database Analytics SAS
� SAS High Performance Analytics (HPA)
� SAS Access � SAS Grid
51 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
Deep SAS Integration � SAS/Access for Greenplum
– Fast, transparent and secure access to Greenplum data from SAS
� SAS High-Performance Analytics for Greenplum – Closely-Integrated In-Memory Analytics – Accelerates Computation – Eliminates Most Data Movement – Shares Segment Servers with Greenplum DB
� SAS Scoring Accelerator for Greenplum – Execute SAS Models in Parallel In-Database
� SAS Grid for Greenplum – Accelerate SAS Model Execution for Load and Run – Integrated As Part of Greenplum DCA – Leverages DCA’s High-Speed Interconnect – Reduce Load on and Cost of Data Center Networks
Question again about using the Greenplum logo.
52 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
A Mature Enterprise Platform
PRODUCT FEATURES
CLIENT ACCESS & TOOLS
Multi-Level Fault Tolerance (RAID, Mirroring, DR with
Data Domain Boost)
Shared-Nothing MPP
Parallel Query Optimizer
Polymorphic Data Storage™
CLIENT ACCESS ODBC, JDBC, OLEDB,
MapReduce, etc.
CORE MPP ARCHITECTURE
Parallel Dataflow Engine
gNet™ Software Interconnect
Scatter/Gather Streaming™ Data Loading
Online System Expansion Workload Management GREENPLUM
DATABASE ADAPTIVE SERVICES
LOADING & EXT. ACCESS
Petabyte-Scale Loading
Trickle Micro-Batching
Anywhere Data Access
STORAGE & DATA ACCESS
Hybrid Storage & Execution (Row- & Column-Oriented)
In-Database Compression
Multi-Level Partitioning
Indexes – Btree, Bitmap, etc.
External Table Support
LANGUAGE SUPPORT
Comprehensive SQL
Native MapReduce
SQL 2003 OLAP Extensions
Programmable Analytics
Analytics Extensions (GeoSpatial, PR/R, PL/Java,
PL/Python, PL/Perl)
3rd PARTY TOOLS BI Tools, ETL Tools
Data Mining, etc
ADMIN TOOLS Greenplum Command Center
Greenplum Package Manager
GREENPLUM DATABASE
53 Capgemin / Pivotal Alliance Confidential–Do Not Distribute Just Do It Immersion Program © Copyright 2014 Pivotal. All rights reserved.
� Massively Parallel Analytics Performance
� Industry-Leading Load Speed
� No-Forklift Scalability
� Rich SQL with Schema Agnosticism
� In-Database Analytical Extensions
� SAS Acceleration Options
� Industry-Leading Workload Mgmt.
� Parallel Co-Processing with Hadoop
� Availability and Multi-Level Redundancy
� Rich, Easy-to-Use Administration Tools
� Big-Data-Capable Backup Facilities
� Information and User Security
GREENPLUM DATABASE GPDB Delivers