1 © Copyright 2012 EMC Corporation. All rights reserved.
1 © Copyright 2012 EMC Corporation. All rights reserved.
2 © Copyright 2012 EMC Corporation. All rights reserved.
整合分析結構與非結構性資料暨應用案例 Greenplum Enable Big Data Analytics
邱垂吉 Jimmy Chiu 技術顧問/EMC Greenplum Taiwan
3 © Copyright 2010 EMC Corporation. All rights reserved.
• Volume: data volumes approaching multiple petabytes
• Velocity: data being generated and ingested for analysis in real-time
• Variety: tabular, documents, e-mail, metering, network, video, image, audio
• Complexity: different standards, domain rules, and storage formats per data type
Transactional Data
Documents Smart Grid
Variety Complexity
Velocity Volume
New insights on
customers, products,
and operations
Contextual and
location-aware
delivery to any
device
Images Audio Video Text
Gartner March 2011
Volume, Variety, Velocity, Value + Complexity
Big Data
4 © Copyright 2010 EMC Corporation. All rights reserved.
Sample Big Data Scenarios
AUTO INSURANCE IN P&C INSURANCE
LOAN PROCESSING IN BANKING
SMART GRID ANALYTICS IN UTILITIES/ENERGY
VIDEO ANALYTICS IN RETAIL
PROACTIVE EMERGENCY RESPONSE IN HEALTHCARE
REAL-TIME STATISTICAL
PROCESS CONTROL IN MANUFACTURING
5 © Copyright 2010 EMC Corporation. All rights reserved.
Big Data Analytics For Competitive Advantage Suppliers
Today’s Business Model
Customers
Inventory
Physical Assets
Distribution
Services
Mass
Marketing
Manufacturing
Customers
Suppliers
Inventory
Physical Assets
Distribution
Services
Personal Marketing
Additional Profits
Manufacturing
Big Data Analytics Business Model
Who are my
most valuable
customers?
What are my most
important
products?
What are my most
successful
campaigns?
6 © Copyright 2010 EMC Corporation. All rights reserved.
Big Data meets Fast Data
Social and Personal – Every Minutes:
•Google gets more than 2 million search queries
•About 47,000 people download an App
•Some 100,000 tweets hit Twitter
•Almost 300,000 people log on to Facebook
Business and Transactional:
•CERN (European Organization for Nuclear Research) generates 40TB/sec of scientific data
•Wal-Mart – 1 million transactions per hour
•World’s top systems currently trade at faster than 50 microseconds
•New York Stock Exchange generates 1TB of new trading data daily
7 © Copyright 2010 EMC Corporation. All rights reserved.
Working together, they enable entirely New Business Models
Big Data allows you to find opportunities you didn’t know you had. Fast Data allows you to respond to opportunities before they are gone.
In the Financial Services Industry, large quantities of historical data need to be processed against a growing number of fast-moving data feeds. Batch processing is no longer a suitable solution!
8 © Copyright 2010 EMC Corporation. All rights reserved.
Effective Customer Segmentation is all about blending Structured and Unstructured Data
– Transaction data (structured data) tells you what the customer did.
– Unstructured data can tell you why they did it, why some others did not, what else they need or want, and what problems they may have.
9 © Copyright 2010 EMC Corporation. All rights reserved.
Big Data Architecture Requirements
• Multiple data types: structured, semi-structured, unstructured
• Integrated data stores: real-time, traditional, data warehouse
• Modern development tools: Java, lightweight messages, mobile-enabled
• Cloud-enabled: elastic scale, self-healing
Beware point solutions – integration is critical!
Solving Big Data challenge involves more than just
managing volumes of data.
― Gartner
10 © Copyright 2010 EMC Corporation. All rights reserved.
Greenplum Overview
11 © Copyright 2010 EMC Corporation. All rights reserved.
Greenplum Product Line
12 © Copyright 2010 EMC Corporation. All rights reserved.
Architecture of Greenplum
Master servers optimize queries
for the most efficient query execution
MPP Scatter/Gather streaming for
fast loading of data
Flexible framework for processing large datasets
Interconnect for continuous
pipelining of data processing
Segment servers process queries
close to the data in parallel
Process large datasets with support for
both SQL and MapReduce
Master Master
SQL
MapReduce
13 © Copyright 2010 EMC Corporation. All rights reserved.
Share Disk eg:
Oracle RAC
DB
SAN Share disk
DB DB DB
Intranet
SAN/FC
Share
everything eg:
Unix server
DB
Disk
Share nothing eg:
Greenplum
DB DB DB DB
Disk Disk Disk Disk
Master Intranet
MPP
Greenplum MPP Share-Nothing Arch.
14 © Copyright 2010 EMC Corporation. All rights reserved.
Benefits of the Greenplum Database Architecture
• Simplicity – Parallelism is automatic – no manual partitioning required – No complex tuning required – just load and query – HA – Best of breed x86 and Ethernet networking technologies
• Scalability – Linear scalability – Each node adds storage, query performance, loading performance
• Flexibility – Fully parallelism for SQL92, SQL99, SQL2003 OLAP, MapReduce – Any schema (star, snowflake, 3NF, hybrid, etc) – Rich extensibility and language support (Perl, Python, R, C, etc) – Structure, semi-structure and unstructure
15 © Copyright 2010 EMC Corporation. All rights reserved.
Greenplum and Hadoop
Analytics
Structured
ERP/CRM
Semi-Structured
Machine Data
Logs
UnStructured
Images/Sound
Ad-hoc Analysis
Dynamic Data batch reporting on static data
16 © Copyright 2010 EMC Corporation. All rights reserved.
Big Data Analytics The Power of Data Co-Processing
Greenplum Chorus
Analytic Productivity & Tool Integration
Data Access And Query SQL, MapReduce, SAS, MADLib, Mahout, R, and others
Greenplum Database Greenplum Hadoop
SQL Engine
For Structured Data • In-database Advanced
Analytics
• Extreme performance on
commodity hardware parallel
data exchange
parallel
data exchange
Network
Parallel Loading Of
All Data Types
MapReduce Engine
For Unstructured Data •Enterprise ready Apache
Hadoop
•Faster, more dependable, and
easier to use
Gre
en
plu
m C
om
man
der
En
d-t
o-e
nd
Pla
tfo
rm M
an
ag
em
en
t &
Co
ntr
ol
17 © Copyright 2010 EMC Corporation. All rights reserved.
Greenplum Hadoop
• Greenplum HD
– Enterprise-ready Apache Hadoop
– Proven at Scale in 1,000 node Analytics Workbench
– Single product with 2 storage options (Isilon & HDFS)
• Enterprise Edition becomes Greenplum MR:
– Advanced features
– 100% API compatible
– Software-only product
18 © Copyright 2010 EMC Corporation. All rights reserved.
AWB Update
Analytics Workbench Operational!
•1025 nodes operational
•1011 nodes with GPHD installed
•8 total projects have been on boarded from university collaboration to partner technology evaluation
Proposals accepted by customer engagement team – [email protected]
•Engagement team will learn project objectives
•JEDI council approves/disproves project based on technical feasibility and alignment with company goals
•Projects informed of decisions and timelines
Cluster access via - http://portal.analyticsworkbench.com/
19 © Copyright 2010 EMC Corporation. All rights reserved.
Apache Hadoop Pain Points
• Poor Job and Application Monitoring Solution
• Non-existent Performance Monitoring Monitoring
• Complex System Configuration and Manageability
• No Data Format Interoperability & Storage Abstractions
Operability and
Manageability
• Poor Dimensional Lookup Performance
• Very poor Random Access and Serving Performance
Performance
20 © Copyright 2010 EMC Corporation. All rights reserved.
100% APACHE
INTERFACE
Greenplum MR: Enterprise Edition Stack
Distributed File System
MapReduce Framework (MapRed)
Pig
Hive
HBase
Zookeeper
Enhanced Monitoring
21 © Copyright 2010 EMC Corporation. All rights reserved.
Greenplum MR: Enterprise Edition Enterprise-Ready Hadoop Platform for Unstructured Data
• 2 – 5x Faster than Apache Hadoop Faster
• High Availability
• Mirroring Reliable
• NFS mountable
• Graphical System Management
Easier to Use
22 © Copyright 2010 EMC Corporation. All rights reserved.
Greenplum MR Simple Management
• Health Monitoring
• Cluster Administration
• Application Provisioning
23 © Copyright 2010 EMC Corporation. All rights reserved.
Rack Level Monitoring
24 © Copyright 2010 EMC Corporation. All rights reserved.
Greenplum MR Delivers True Return on Investment
• Eliminates all single points of failure
• High Availability for Job Tracker , NameNode &
NFS
• Snapshots allow point-in-time data protection
and recovery.
• Mirroring for business continuity includes wide
area replication support.
• NFS direct access to simply load and access
data directly in a Hadoop cluster
• Enables standard tools and utilities to work
directly on data contained in Hadoop
• Heatmap user interface provides full cluster
visibility and control.
• Speeds jobs by 2X – 5X
• Provides faster performance with ½ the
hardware
• Substantial capital and operating expense
savings
25 © Copyright 2010 EMC Corporation. All rights reserved.
EMC Greenplum
Fastest data loading Advanced analytics
DATA IN DECISIONS OUT IN-DATABASE ANALYTICS
Scatter/Gather Streaming
technology for the world’s
fastest data loading
•Eliminate data load bottlenecks
•Clean and integrate new data
•Several loading options, ranging from bulk load updates to micro-batching for near real-time processing
Optimized for fast query execution
and linear scalability
•Move processing closer to data
•Shared-nothing, massively parallel processing (MPP) scale-out architecture
•Computing is automatically optimized and distributed across resources
• Provides the best concurrent multi-workload performance
Unified data access for greater
insight and value from data
•Enable parallel analysis across the enterprise
•Open platform with broad language support
•Certified enterprise connectivity and integration with most business intelligence; extract, transform, and load (ETL); and management products
26 © Copyright 2010 EMC Corporation. All rights reserved.
Data Input
Integration Data Stores and
Access Data
Analysis Presentation &
Delivery
Multimedia
Web/Social
ERP
CRM
POS
Data Sources
Mobile
Documents
Machine Data
Quality
MDM
ETL
Enterprise
Data
Warehouse
BU 1
BU 2
BU 3
Da
ta M
art
s
Ma
p-
Re
du
ce
Key Values Documents Other NoSql
Ecosystem* HDFS
Hadoop
NoSQL Stores
Federated
Data
Warehouse
Map-
Reduce
BI as a
Service
Sta
tistic
s
Da
ta M
inin
g
Op
era
tion
s R
esea
rch
Ne
ura
l Ne
ts
Genetic
Alg
orith
ms
OL
AP
Alerts
Reports
Dashboards
Spreadsheets
*Hadoop Ecosystem includes: Hive, Pig, Mahout, HBase, ZooKeeper, Oozie, Sqoop, Avro
Structured
data sources
Traditional data
Integration Traditional data
warehousing
Big data analytics
ramifications
SQL Stores
LOB data
EMC Big Data Analytics Reference Architecture
Mobile
Data Visualization
parallel
data exchange
27 © Copyright 2010 EMC Corporation. All rights reserved.
Architecture for Business Value
DB’s
GPDB
Analytics tools
(SAS, R, MADlib and more)
Business Value
Files
MapRFS
(GPMR)
Analytics Self-develop app
Hbase
Analytics tools
(Mahout)
.csv
.txt
Analytics Self-develop app
JDBC
ODBC
Java API
ETL
Load x MapRFS: C++; MR: C++
Performance: 2~5X
High Availability
Stable
SAS & MADlib
- In GPDB
- In Memory
Chorus for Collaboration
29 © Copyright 2010 EMC Corporation. All rights reserved.
Big Data And EMC
4 New Analytic Applications
Unified Analytics Platform 2
Petabyte Scale Data Storage 1
Data Science 3
30 © Copyright 2010 EMC Corporation. All rights reserved.
SAS / Greenplum Product Overview
SAS High Performance Computing
SAS Access for Integration
Provides integration capability to a number of databases
Allows for increased performance of Base SAS Procs
Products: SAS Access for Greenpum
SAS In-Database Processing
Requires SAS Enterprise Miner in order to be of value
Will lead to significant improvement in performance
Products: SAS Access for Greenplum, SAS Grid Manager, SAS Enterprise Miner, SAS Scoring Accelerator for Greenplum
SAS In-Memory Analytics
New functionality from SAS that requires dedicated database appliance
Very high performance for business users that can significantly increase revenues or decrease costs as a result of improved performance
Products: SAS Access for Greenplum, SAS Grid Manager, SAS High Performance Analytics
31 © Copyright 2010 EMC Corporation. All rights reserved.
SAS and Greenplum UAP Integrated Architecture
SAS AND EMC GREENPLUM UAP INTEGRATED ARCHITECTURE
Data
Scientist
Data
Engineer
Data
Analyst
Bl
Analyst LOB
User
Data
Platform
Admin
DA
TA
SC
IEN
CE
TE
AM
Greenplum Chorus - Analytic Productivity Layer
SAS Analytics
Private/Hybrid Cloud Infrastructure or Appliance
SAS Business Intelligence
SAS Information Management
Greenplum Database Greenplum Hadoop
Data Access & Query Layer (SAS ACCESS, SQL, MapReduce)
32 © Copyright 2010 EMC Corporation. All rights reserved.
Structured & Unstructured Data
Analyze Petabytes Of Current Data
Virtual, Scale Out Architecture
Self-Service
Iterative, Agile
Transparent, Real-time Collaboration
In A Single Unified Analytics Platform
33 © Copyright 2010 EMC Corporation. All rights reserved.