Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter
Dec 23, 2015
Intel® Distribution for Apache Hadoop*
Ram Lakshminarayan Asia Pac – BDM Datacenter
Other brands and names are the property of their respective owners.
From the dawn of civilization until 2003, we humans created 5 Exabyte of information.
Now we create that same amount of information in two days! In 2012, the digital
universe of data will expand to 2.72 zettabytes (ZB). Then it’s predicted to
double every two years.
Other brands and names are the property of their respective owners.
What is Big Data?
3
Datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze*
Unstructuredvolume, variety, value
and velocity
*”Big data: The next frontier for innovation, competition, and productivity”, McKinsey Global Institute
Time
Volu
me
Structured (relational) data
Unstructured(multi-structured) data
Intelligent Transportation System (Shanghai)
Volume: massive scale & growth
Variety: many different forms
Value: predictive analytics
Velocity: near-realtime processing
• Logs/records: 9TB/day
• Image: 900TB/day
• Video: 3PB/day
• Near realtime image/video processing needed
• Near realtime queries required
• Deep, complex analysis for traffic prediction, criminal detection, …
Other brands and names are the property of their respective owners.
Big Data usage across industries
National, Public and Cyber Security
Education GovernmentHealthcare
Retail ManufacturingTelecommunicationFinancial Services
Other brands and names are the property of their respective owners.
Big Data opportunity, a vertical industry view
Source: Gartner
Other brands and names are the property of their respective owners.
Hadoop IntroductionSource: http://blog.spec-india.com
Source: http://www.bodhtree.com
Hadoop is:• A flexible, extensible
open source frameworkHadoop includes:
• Storage (HDFS)• No SQL database
(Hbase)• Distributed compute
(Map Reduce)• Plus more utilities
Other brands and names are the property of their respective owners.
Res
pon
sive
En
erg
yEffi
cien
tH
igh
Avai
labili
tyS
ecu
re
Intel’s Foundational Technologies Offer Advanced Solutions for Big data Analytics
Ch
oic
e
Big Data Building Blocks
Intelligent Storage1
Scale-out Storage1
Scale-up Storage1
Intel® SSD 710 series, DC S3700
(SATA)
Intel® SSD 910 series (PCIe)
Intel® Ethernet Controllers
Intel® Ethernet Adapters
Intel® Ethernet Switch Silicon
Intel® True Scale Fabric
Compute Network Storage
Intel® Distribution for Apache Hadoop
Intel® Data Center Manager
Intel® Node Manager
Intel® Expressway Service Gateway
Intel® Cache Acceleration Software
Intel’s Lustre
Intel® VT and Intel® TXT
Intel® AES-NI
Software & Technologies
Intel® Xeon® Product Family E3-
E5-E7
Intel® Atom™
Intel® Xeon PhiTM
Xeon-based storage systems are available in a wide range of configuration options from the industry’s leading storage vendors
7
What is in it for us?
Other brands and names are the property of their respective owners.
Accelerating big data analytics through faster and more effective CPU, Storage, I/O, Network platform.
Driving innovation in big data applications by providing optimized software stack and services.
Foster the growth of big data ecosystem through broad collaboration with partners.
Intel’s Role in Big Data
Investing in Solution Research and Services for Big Data
Other brands and names are the property of their respective owners.
Intel® Distribution for Apache HadoopWhat did we launch…?
0500
100015002000250030003500
700
3500
• Focus on near real-time analytics w/ HBase & Hive enhancements • Access control, encryption, secure
data movement• Job throughput efficiency for HDFS• Dynamic replication for HDFS &
HBase• Intel optimized total solution
architecture -distro, storage, network, compute
Intel Supported Distribution Subscription
Open Source
Optimized Intel IA/Distro
5X Performance for Real-time jobs
HBase as the data store. Query all CDR in month− Inserting 10000 records/second/server− Read from disk: >400 query/second/server
Intel ® Manager for Hadoop* SoftwareDeployment, Configuration, Monitoring, Alerting
and Security
HDFS*Hadoop Distributed File System
MapReduceDistributed Processing Framework
Hb
ase*
Colu
mn
ar
Sto
rag
e
Zookeep
er*
Coord
inati
on
Flu
me
Log
C
olle
ctor
Sq
oop
Data
Exc
han
ge Pig*
ScriptingHive*
SQL-Like Query
Oozie*
Workflow
Mahout*
Data Mining
R-connec
tor
Other brands and names are the property of their respective owners.
Intel® Manager for Apache Hadoop
Compatible with Intel or Other Popular Distributions
• Quick cluster/node deployment
• Tab navigate between components
Node Node Node
• Guided wizards, tasks, workflows
• Single pane config for MapReduce fair or capacity scheduling • Tuning controls for HBase data
Other brands and names are the property of their respective owners.
Intel IA Architecture
Performance
Management
Cloud Enablement
Providing cross-stack optimization
s using Hadoop as
lead vehicle and open source as adoption
driver
Driving The Key Pillars for Big Data
Flash Storage
Caching & Non-volatile Memory Throughput
Distributed Tables Across Data Centers
Snapshots
File based encryption MapReduce Jobs
Access Control List at cell level
SSE Instruction Sets
InfinibandAES-NI Encryption
HDFS Cross Data Center Replication
Security
Archival for cold data on HDFS
OS Kernel cachingHot file replication
API AuthN Data Movement
NETWORK
STORAGECOMPUTE
Ensuring Scale-out architectures work best on Intel platforms
Other brands and names are the property of their respective owners.
Intel Platform Benefits for Big Data
TeraSort for 1TB Data - > 4 Hours to 7 Minutes
Intel® Xeon 5690
7200 HDD
1GbE Adapters
Intel® Xeon® E5-2690processo
r~50%
improvedIntel®
SSD 520 Series
~80%improved
Intel® 10GbE
Adapters
~50%improve
d
Deploy Intel
Distribution for
Apache Hadoop*~40%
improved
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.Source: Intel Internal testing
For more information go to : intel.com/performance `
>4 Hours ~7
mins
Other brands and names are the property of their respective owners.
Government - Smart Traffic Intelligent Transport SystemHadoop for Predictive Analytics
13
Crime prevention, Info sharing,
Predictive Traffic Analytics
Machine Generated Data:Embedded HBase client in camera for real-
time inserts of structured/unstructured data
30000 + camera data collection points
2 billion HBase records
Petabytes of traffic data
Terabytes of images
1 week of Data mining
Results: Automated queries for traffic violation
Crime Prevention: ID fake
licenses <1 minute
Traffic Routing
App Servers
Regional Data Collection
Distributed Processing Across District Nodes
Derived Analytics Services
Crime Prevention
Citizen Traffic Services
Other brands and names are the property of their respective owners.
Telco- China Mobile Group GuangdongHadoop & Xeon optimized Big Data storage & analytics
Challenge: Deliver real time access to Call Data Records (CDR) for billing self service
Solution: Chose Hadoop + Xeon over RDMS to remove data access bottlenecks, increase storage, and scale system
Benefits: Lower TCO, 30x performance increase, stable operation, analytics on subscriber usage for targeted promotions
Data Characteristics:
• 30TB billing data/month
• Real-time retrieval of 30 days CDRs
• 300k records/second, 800k insert speed/sec
• 15 analytics queries
• 133 server nodes
Analytics