© Copyright 2010 EMC Corporation. All rights reserved. 1 BIG DATA IS CHANGING THE WORLD
© Copyright 2010 EMC Corporation. All rights reserved. 1
BIG DATA
IS CHANGING THE WORLD
© Copyright 2010 EMC Corporation. All rights reserved. 2
IN THIS DECADE THE DIGITAL UNIVERSE
WILL GROW 44XFROM 0.9 ZETTABYTES TO 35.2 ZETTABYTES
Source : 2010 IDC Digital Universe Study
© Copyright 2010 EMC Corporation. All rights reserved. 3
90% OF THEDIGITAL UNIVERSE IS
UNSTRUCTURED
Source: 2011 IDC Digital Universe Study
© Copyright 2010 EMC Corporation. All rights reserved. 4
GeophysicalExploration
Big Data Has Arrived
Medical Imaging
VideoSurveillanceMobile Sensors
Video Rendering
Gene Sequencing
Smart Grids
Social Media
ElectronicPayments
© Copyright 2010 EMC Corporation. All rights reserved. 5
Billion Dollar Specialty Care Service Provider
Deliver Better Healthcare With Big DataQ
ua
lity
Of
Pa
tie
nt
Ca
re
Legacy System &
Traditional Data
New System &
Big Data
Treatment
Pathways On
Summary Data
Treatment
Pathways On
All The Data
Social &
Economic
Factors
International
Results
Individual
Patient History
© Copyright 2010 EMC Corporation. All rights reserved. 6
Retail Banking Firm Aligns Offers To Customers
Increase Profit Margins With Big DataC
ust
om
er
Pro
fit
Legacy System &
Traditional Data
New System &
Big Data
Agent
“Best Guess”
Profit-Based
Recommendations
User Based
Recommendations
Identify
“At-Risk”
Customers
© Copyright 2010 EMC Corporation. All rights reserved. 7
Classifying and segmenting Big Data
• Rich content stores—original intellectual property or value-added
– Media, VOD, content creation, special effects, satellite imagery, GIS data
• Generated from workflow—must be managed/processed quickly & cheaply
– Manufacturing, simulation, electronic design
• Develop new intellectual property based on big data
– Pharmaceutical companies doing customised drug development
• Companies, public sector, utilities mining data for business advantage
• Some mine consumer data—higher-volume and potentially higher-value
© Copyright 2010 EMC Corporation. All rights reserved. 8
0
10
20
30
40
50
60
70
80
90
2009 2010 2011 2012 2013 2014
EX
AB
YT
ES
Big Data is File & Unstructured Data
By 2012, 80% of all storage capacity sold will be for file-based data
Source: IDC
File Based: 60.7% CAGR Block Based: 21.8% CAGR
© Copyright 2010 EMC Corporation. All rights reserved. 9
Why is Big Data appearing now?
Source: IDC
© Copyright 2010 EMC Corporation. All rights reserved. 10
Gartner’s 3 V’s of Big Data
© Copyright 2010 EMC Corporation. All rights reserved. 11
“The Internet of Things”
• Massive explosion of smart devices, all sending, receiving, storing data
– handhelds, tablets, cameras
– Human-oriented devices
• Non-human-oriented devices
– sensors, embedded CPUs
• Social networking messages & data grow exponentially
– Twitter feeds, Facebook updates, LinkedIn messages
• Increasingly, business is conducted digitally – or digitized
• Big Data is global – any source to any target
© Copyright 2010 EMC Corporation. All rights reserved. 12
Source:
GoGlobe
© Copyright 2010 EMC Corporation. All rights reserved. 13
Companies want to store big data—Why?
• Google – Originally thought of as “search engine”
– Now: Storing the Internet, storing every search query
• Facebook, Twitter – Just social media?
– Storing every message you send, monitoring every
market trend
• Amazon – your every purchase, forever
• Carriers – Storing location-based data on everyone
© Copyright 2010 EMC Corporation. All rights reserved. 14
Social Networking AnalysisCourtesy of NSF Workshop on Social Modeling
© Copyright 2010 EMC Corporation. All rights reserved. 15
The race is on• Big Data leads to the Optimised OrganisationBig Data leads to the Optimised OrganisationBig Data leads to the Optimised OrganisationBig Data leads to the Optimised Organisation
• Takes a long time to build a functioning data
warehouse, analytics tools, connect to business
• Many companies have a head start
• Every CIO needs to consider Big Data in their
strategy to stay ahead
– How to manage, how to leverage
© Copyright 2010 EMC Corporation. All rights reserved. 16
A little retailer I once knew• Why can Amazon beat everyone on price?
• Purchase information used to adjust supply chain
• Shipping and logistics adjusted according to conditions on
the ground and supply chain
• Other customers’ information used to provide
recommendations, improve experience
• Not just Amazon: Tesco, Carrefour, Metro, etc all taking
advantage
© Copyright 2010 EMC Corporation. All rights reserved. 17
How do we make decisions?• Good data is hard to get—so often on no data at all
• Often on information from peers, colleagues,
reports, or because it’s always been done that way
• Many companies fail because they fail to detect shifts in consumer demand
• Internet has made customers more segmented, and
causes customer choice to change faster
© Copyright 2010 EMC Corporation. All rights reserved. 18
Moving to a Data-Driven Model• Managing with the facts
• Making a science out of data!
• Experimental model—different
than BI
• Moving from “gut feel” to
rational, scentific decisions
© Copyright 2010 EMC Corporation. All rights reserved. 19
Big-Data-based Decisions• Unlock value by making information transparent
and useable at higher frequency
• More accurate information (e.g. inventories, trends)
• Tailor products more precisely
• Sophisticated analytics makes for better decisions
• Better products (via web feedback, sensors, etc)Source: McKinsey
© Copyright 2010 EMC Corporation. All rights reserved. 20
What holds back big data?• Not ICT—compute & storage getting
bigger, cheaper, easier
• Not the quantity of data (see slide 1)
• Not the value—large-scale Big Data
projects generally have great ROI
• Real problems are organisational change and talent acquisition
© Copyright 2010 EMC Corporation. All rights reserved. 21
© Copyright 2010 EMC Corporation. All rights reserved. 22
How are people doing it?
• Enterprises ingesting > 1PB data per day within 5 yrs
• Big data is often largely unstructured
• Hadoop is an application written to analyze big data
– open source, Java-based
• Big data can mean billions to trillions of files
– Each file can be gigabytes to terabytes in size
• Directed graph analysis, Collaborative Filtering, A/B testing, Associative Rule Learning, Classification, Natural Language
processing, Data Mining, Pattern Matching, Sentiment Analysis, Comparative Effectiveness, Clinical Decision Support are
examples of big data techniques
• This means petabytes to exabytes of data
© Copyright 2010 EMC Corporation. All rights reserved. 23
How do you manage and design for Big Data?
• Scale and parallelism are the keys
– Big data is far too big to process sequentially
– Too much coming in too quickly
– Example: Banks seeking to process market data
more quickly, reducing decision making time from
days to minutes
• Answer: Scale-out storage and scale-out processing
© Copyright 2010 EMC Corporation. All rights reserved. 24
Cramming big data onto traditional models
Scalability
Performance
Management
Availability
Cost
Sto
rag
eN
etw
ork
Serv
er
© Copyright 2010 EMC Corporation. All rights reserved. 25
A different idea – scale-out
Scalability
Performance
Management
Availability
Cost
Sto
rag
eN
etw
ork
Serv
er
© Copyright 2010 EMC Corporation. All rights reserved. 26
Enterprise Hadoop: Greenplum & Isilon
• Easier and more reliable
– Packaged Hadoop distribution with Isilon storage
• Purpose-built Hadoop infrastructure
– Faster, less risk
• Sharing expertise to address the talent gap
– Architecture, data science, and roadmap services
• Proven at scale with worldwide support
– 24x7 one call Hadoop support from EMC
– Key component of Greenplum UAP
– Unstructured data processing
© Copyright 2010 EMC Corporation. All rights reserved. 27
Increasing Demand for Advanced Analytics• Complex
– Deep, rich analysis of big data sets
– Ad hoc, interactive analysis, not structured reports
• Timely
– On-going, frequent analysis (e.g. daily, weekly)
– Insights delivered in minutes/seconds
• Actionable
– Forward looking, predictive insight
– Create new business value
© Copyright 2010 EMC Corporation. All rights reserved. 28
• EMC Greenplum is a shared nothing, massively parallel
processing (MPP) data warehouse system
• Core principle of data computing is to move the processing
dramatically closer to the data andandandand to the people
EMC Greenplum: Purpose-built for Big Data
Fast DataFast DataFast DataFast DataLoadingLoadingLoadingLoading
Extreme PerformanceExtreme PerformanceExtreme PerformanceExtreme Performance
& Elastic Scalability& Elastic Scalability& Elastic Scalability& Elastic ScalabilityUnified Unified Unified Unified
Data AccessData AccessData AccessData Access
29© Copyright 2011 EMC Corporation. All rights reserved. EMC Confidential – NDA Required
� Greenplum’s Massively Parallel Processing (MPP) Database has extreme scalability on general purpose systems
� Automatic parallelization
– Load and query like any database
� Scan and process in parallel
– Extremely scalable and I/O optimized
� Linear scalability by adding nodes
– Each adds storage, query performance and loading performance
MPP Shared-Nothing Architecture
...
NetworkInterconnect
...
......Master
Servers
Query planningand dispatch
SegmentServers
Storage andquery
processing
MapReduce
ExternalSources
MPP loading, streaming, etc.
... ... ... ... ...... ... ... ... ...
© Copyright 2010 EMC Corporation. All rights reserved. 30
EMC Hadoop.
Open Source.
Fully Supported By
EMC.
© Copyright 2010 EMC Corporation. All rights reserved. 31
ActActActActDocumentum xCP
The EMC Big Data “Stack”
AnalyzeAnalyzeAnalyzeAnalyzeGreenplum, Hadoop
?
StoreStoreStoreStoreIsilon and AtmosIsilon and AtmosIsilon and AtmosIsilon and Atmos
Petabyte
Scale11
Structured &
Unstructured22
Real Time33
Collaborative44
© Copyright 2010 EMC Corporation. All rights reserved. 32
THANK YOU
HAVE A GREAT CONFERENCE!