© 2011 IBM Corporation June 26, 2012 Big Data Cloud Storage Technology Comparison Tony Pearson IBM Master Inventor and Senior Managing Consultant
© 2011 IBM CorporationJune 26, 2012
Big Data Cloud StorageTechnology Comparison
Tony PearsonIBM Master Inventor and Senior Managing Consultant
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 20072
�
Agenda
�What is Big Data?
� InfoSphere BigInsights
� Infrastructure and Storage Considerations
�Concluding Thoughts
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 20073
�
An Explosion of Data
4.6 BillonMobile PhonesWorld Wide
1.3 Billion RFID tags in 200530 Billion RFID today
2 Billion Internet users by 2011
Twitter process 7 terabytes ofdata every day
Facebook processes10 terabytes ofdata every day
World Data Centre for Climate� 220 Terabytes of Web data� 9 Petabytes of additional data
Capital market
data volumes grew
1,750%, 2003-06
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 20074
�
2009800,000 Petabytes
as much Data and ContentOver Coming Decade
44x Business leaders frequently make decisions based on information they don’t trust, or don’t have
1 in3
83%of CIOs cited “Business intelligence and analytics” as part of their visionary plansto enhance competitiveness
Business leaders say they don’t have access to the information they need to do their jobs
1 in2
of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions
60%Of world’s datais unstructured
80%
Information Overload… But Lacking Insight
202035 Zettabytes
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 20075
�
Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible.
The Big Data Opportunity
Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text
Streaming data and large volume data movement
Scale from Terabytes to Zettabytes
Variety:
Velocity:
Volume:
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 20076
�
Where did this begin…
� Apache Hadoop – Open source framework for harnessing large volumes of unstructured-data
- Inspired by Google technologies (MapReduce, GFS)
- Originally built to address scalability problems of web search and analytics
� Enables applications to run on thousands of nodes and leverage Petabytes of data in a highly parallel, cost effective manner
- CPU + Disks = Hadoop Node
- Nodes can be combined into clusters
- New nodes can be added dynamically
- Provides simple scalable growth
Processing
Storage
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 20077
�
How IBM BigInsights extends Hadoop capabiltity
Traditional / Non-traditional data sources
InfoSphere BigInsights(Internet Scale Analytics)
Extreme storage capacity
Log Analytics
Scientific Research
Climate modelling
Risk Exposure
Failure Analysis
Text Processing
Delivering enterprise-ready software
� Advanced Analytics
� Performance & Availability
� Security Hardened Architecture
� Management Disciplines
� Developer Value
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 20078
�
Infrastructure for the range of BigInsights deployments
Value Enterprise Performance
Characteristics�Optimized for cost effective scale-out�Classic Hadoop architecture�Redundancy provided by Hadoop
Typical customer use cases�Customer sentiment analysis�Internet behavior and buying pattern analysis
Characteristics�Enterprise class features�Options to support business critical workloads
Typical customer use cases� Financial Fraud Detection� Risk analysis� Data warehouse offload for “cold” data
Characteristics�Highest performance�Compute and I/O intensive workload options
Typical customer use cases�Email compliance analysis�Credit card fraud detection�Media analytics
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 20079
�
Technology Comparison
� Internal Storage in System x Servers- Block-level access
- Use GPFS-Shared Nothing Cluster (SNC)
- Typical for most Hadoop installations
� External Storage
� DCS3700- Block-level access
- 60 drives in 4U drawer
- Designed for Sequential workloads
- Use GPFS-Shared Nothing Cluster
� SONAS- File-level access
- Designed for unstructured data content used in Big Data analytics
Based on the IBM System x3630 M3: Ultra-dense, storage-rich server for Big Data
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200710
BigInsights Hardware Foundation
Rack-Level Features�Up to 20 System x3630 M3 nodes�Up to 840TB storage�Up to 240 cores�Up to 3,840GB memory�Up to two 10Gb Ethernet or 40Gb InfiniBand switches�Scalable to multi-rack configurations
Available Enterprise and Performance Features�Redundant storage�Redundant networking�High performance cores�Increased memory�High performance networking
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200711
BigInsights Value Node Features
Value Data Node�IBM System x3630 M3�Two Intel Xeon E5620 CPUs�Data: 12 x 2TB NL SAS HDDs�OS: 1 x 2TB NL SAS HDD�48GB DDR3 RDIMMs
Value Management Node(JobTracker, NameNode, Console)�IBM System x3630 M3�Two Intel Xeon E5620 CPUs�Data: 4 x 2TB NL SAS HDDs�OS: 2 x 2TB NL SAS HDD, RAID1�96GB DDR3 RDIMMs
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200712
�
IBM Storage Product Positioning – Primary Data
Entry Level
Sequential
DS5000 StorwizeV7000
XIV
SVC
DS8000SONAS
N3000
Random
N6000
High Performance Computing, Big Data
UnifiedStorage
Enterprise
Flash & Stash
DS3500
Midrange
Mainframe Optimized
Distributed
NAS for all servers
DCS3700StorwizeV7000 Unified
N7000
SSDSSD
SSD
SSD
SSD
SSD
SSD
12
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200713
�
|
����������������� ������������� ������������������������������������������
0
500
1000
1500
2000
������������
������� ������ ���� ���������
���� � ����� �Query languages like Pig and JAQL need good random I/O performance
Sort requires better sequential throughputGPFS is twice HDFS for both of the above
For document index lookups, client side caching is a big win17x throughput speedup
������
������� �����
��������
������� ����
� ���������
�����
����
�����
�����
����������������������
���! ��"�#����$������������%������
#������������������������
��������������
&����������������� '(���
� ���������
�����
�����
�����
��������%�����#����������������
��������$��������������)%���$��������
#���! ��������
� ��"������������� � Proven data integrity
� Replicated metadata services– *�����"��������������#���
���������#����������%����#�
%�"��! ���������������+,-– . %���������������#���������
�������/01 �#%���+2-
+,-������������������#����������%�����$�3 ����
4����$����� 2005
+2-�6���7�8 �������%�������������8 $�����8
9��$�. %����������:
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200714
�
|
� ������ � ���� ���!"�
File System GPFS HDFS
Robust No single point of failure NameNode vulnerability
Data Integrity High Evidence of data loss
Scale Thousands of nodes Thousands of nodes
POSIX Compliance Full – supports a wide range of applications Limited
Data Management Security, Backup, Replication Limited
MapReduce Performance Good Good
Workload Isolation Supports disk isolation No support
Traditional Application Performance Good Poor performance with random reads and writes
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200715
�
GPFS introduced concurrent file system access from multiple nodes.
Evolution of the global namespace:GPFS Active File Management (AFM)
Multi-cluster expands the global namespace by connecting multiple sites
AFM takes global namespace truly global by automatically managing asynchronous replication of data
GPFSGPFS
GPFS
GPFS
GPFS
GPFS
1993 2005 2011
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200716
�
High level view of Scale-Out NAS Storage (SONAS)
Benchmark Performance:403,326 IOPS single file system
(SPECsfs2008.nfs)
� SONAS Release 1.2
� Single File System over 900TB usable
� 10 Interface Nodes; each with:- Maximum 144 GB of memory
- One active 10GbE port
� 8 Storage Pods; each with:- 2 Storage nodes and 240 drives
- Drive type: 15K RPM SAS hard drives
- Data Protection: the drives were configured in RAID ranks
16
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200717
�
IBM Scale Out Network Attached Storage (SONAS)
� Enterprise Class Solution for IP-based File System Storage
� One global repository for application and user files
- One huge file system, or up to 256 file systems per SONAS
� Enterprise solution for all applications, departments and users
- Provision and monitor usage by application, file, department or whatever makes sense to the business
- Includes ability to report usage and access patterns for chargeback
- Capacity managed centrally
- Extremely high utilization rates
� Simplified management of petabytes of storage
� Independently scalable performance and capacity eliminates trade-offs
� Cloud-readyIBM SONAS
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200718
�
Concluding Thought: IBM’s Value
� A complete stack for Big Data- Others require multi-vendor solutions
� Embracing the open source community- Product support and additional offerings
- In-field expertise to ensure client success
� Enterprise-class focus- Performance tested
- Administrative and development tooling
- Deep integration with information management
- software inside and outside IBM
- Security and governance
- High availability and backup
� System x and System Storage- Industry leading innovation and technology
- Best in class reliability and availability
- #1 in customer satisfaction
© 2011 IBM CorporationJune 26, 2012
Thank You!
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200720
�
About the Speaker
Mr. Tony Pearson Master Inventor, Senior Managing ConsultantIBM System Storage
Tony Pearson is a Master Inventor and Senior managing consultant for the IBM System Storage™ product line. Tony joined IBM Corporation in 1986 in Tucson, Arizona, USA, and has lived there ever since. In his current role, Tony presents briefings on storage topics covering the entire System Storage product line, Tivoli storage software products, and topics related to Cloud Computing. He interacts with clients, speaks at conferences and events, and leads client workshops to help clients with strategic planning for IBM’s integrated set of storage management software, hardware, and virtualization products.
Tony writes the “Inside System Storage” blog, which is read by hundreds of clients, IBM sales reps and IBM Business Partners every week. This blog was rated one of the top 10 blogs for the IT storage industry by “Networking World” magazine, and #1 most read IBM blog on IBM’s developerWorks. The blog has been published in series of books, Inside System Storage: Volume I through IV.
Over the past years, Tony has worked in development, marketing and customer care positions for various storage hardware and software products. Tony has a Bachelor of Science degree in Software Engineering, and a Master of Science degree in Electrical Engineering, both from the University of Arizona. Tony holds 19 IBM patents for inventions on storage hardware and software products.
9000 S. Rita RoadBldg 9070 Mail 9070Tucson, AZ 85744
+1 520-799-4309 (Office)
Tony Pearson
Master Inventor, Senior Managing Consultant
IBM System Storage™
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200721
�
Additional Resources
21
Email:[email protected]
Twitter:http://twitter.com/az99Øtony
Blog: http://ibm.co/brAeZØ
Books:http://www.lulu.com/spotlight/99Ø_tony
IBM Expert Network:http://www.slideshare.net/az99Øtony
21
IBM NWA
© Copyright IBM Corporation 2007
IBM NWA
© Copyright IBM Corporation 200722
��
Trademarks and disclaimers
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countries.
Other product and service names might be trademarks of IBM or other companies. Information is provided "AS IS" without warranty of any kind.
The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.
Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.
Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here.
Prices are suggested U.S. list prices and are subject to change without notice. Starting price may not include a hard drive, operating system or other features. Contact your IBM representative or Business Partner for the most current pricing in your geography.
Photographs shown may be engineering prototypes. Changes may be incorporated in production models.
© IBM Corporation 2012. All rights reserved.References in this document to IBM products or services do not imply that IBM intends to make them available in every country.
Trademarks of International Business Machines Corporation in the United States, other countries, or both can be found on the World Wide Web at http://www.ibm.com/legal/copytrade.shtml.
ZSP03490-USEN-00