Top Banner
Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter
15

Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Dec 23, 2015

Download

Documents

Spencer Fleming
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Intel® Distribution for Apache Hadoop*

Ram Lakshminarayan Asia Pac – BDM Datacenter

Page 2: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

From the dawn of civilization until 2003, we humans created 5 Exabyte of information.

Now we create that same amount of information in two days! In 2012, the digital

universe of data will expand to 2.72 zettabytes (ZB). Then it’s predicted to

double every two years.

Page 3: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

What is Big Data?

3

Datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze*

Unstructuredvolume, variety, value

and velocity

*”Big data: The next frontier for innovation, competition, and productivity”, McKinsey Global Institute

Time

Volu

me

Structured (relational) data

Unstructured(multi-structured) data

Intelligent Transportation System (Shanghai)

Volume: massive scale & growth

Variety: many different forms

Value: predictive analytics

Velocity: near-realtime processing

• Logs/records: 9TB/day

• Image: 900TB/day

• Video: 3PB/day

• Near realtime image/video processing needed

• Near realtime queries required

• Deep, complex analysis for traffic prediction, criminal detection, …

Page 4: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

Big Data usage across industries

National, Public and Cyber Security

Education GovernmentHealthcare

Retail ManufacturingTelecommunicationFinancial Services

Page 5: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

Big Data opportunity, a vertical industry view

Source: Gartner

Page 6: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

Hadoop IntroductionSource: http://blog.spec-india.com

Source: http://www.bodhtree.com

Hadoop is:• A flexible, extensible

open source frameworkHadoop includes:

• Storage (HDFS)• No SQL database

(Hbase)• Distributed compute

(Map Reduce)• Plus more utilities

Page 7: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

Res

pon

sive

En

erg

yEffi

cien

tH

igh

Avai

labili

tyS

ecu

re

Intel’s Foundational Technologies Offer Advanced Solutions for Big data Analytics

Ch

oic

e

Big Data Building Blocks

Intelligent Storage1

Scale-out Storage1

Scale-up Storage1

Intel® SSD 710 series, DC S3700

(SATA)

Intel® SSD 910 series (PCIe)

Intel® Ethernet Controllers

Intel® Ethernet Adapters

Intel® Ethernet Switch Silicon

Intel® True Scale Fabric

Compute Network Storage

Intel® Distribution for Apache Hadoop

Intel® Data Center Manager

Intel® Node Manager

Intel® Expressway Service Gateway

Intel® Cache Acceleration Software

Intel’s Lustre

Intel® VT and Intel® TXT

Intel® AES-NI

Software & Technologies

Intel® Xeon® Product Family E3-

E5-E7

Intel® Atom™

Intel® Xeon PhiTM

Xeon-based storage systems are available in a wide range of configuration options from the industry’s leading storage vendors

7

What is in it for us?

Page 8: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

Accelerating big data analytics through faster and more effective CPU, Storage, I/O, Network platform.

Driving innovation in big data applications by providing optimized software stack and services.

Foster the growth of big data ecosystem through broad collaboration with partners.

Intel’s Role in Big Data

Investing in Solution Research and Services for Big Data

Page 9: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

Intel® Distribution for Apache HadoopWhat did we launch…?

0500

100015002000250030003500

700

3500

• Focus on near real-time analytics w/ HBase & Hive enhancements • Access control, encryption, secure

data movement• Job throughput efficiency for HDFS• Dynamic replication for HDFS &

HBase• Intel optimized total solution

architecture -distro, storage, network, compute

Intel Supported Distribution Subscription

Open Source

Optimized Intel IA/Distro

5X Performance for Real-time jobs

HBase as the data store. Query all CDR in month− Inserting 10000 records/second/server− Read from disk: >400 query/second/server

Intel ® Manager for Hadoop* SoftwareDeployment, Configuration, Monitoring, Alerting

and Security

HDFS*Hadoop Distributed File System

MapReduceDistributed Processing Framework

Hb

ase*

Colu

mn

ar

Sto

rag

e

Zookeep

er*

Coord

inati

on

Flu

me

Log

C

olle

ctor

Sq

oop

Data

Exc

han

ge Pig*

ScriptingHive*

SQL-Like Query

Oozie*

Workflow

Mahout*

Data Mining

R-connec

tor

Page 10: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

Intel® Manager for Apache Hadoop

Compatible with Intel or Other Popular Distributions

• Quick cluster/node deployment

• Tab navigate between components

Node Node Node

• Guided wizards, tasks, workflows

• Single pane config for MapReduce fair or capacity scheduling • Tuning controls for HBase data

Page 11: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

Intel IA Architecture

Performance

Management

Cloud Enablement

Providing cross-stack optimization

s using Hadoop as

lead vehicle and open source as adoption

driver

Driving The Key Pillars for Big Data

Flash Storage

Caching & Non-volatile Memory Throughput

Distributed Tables Across Data Centers

Snapshots

File based encryption MapReduce Jobs

Access Control List at cell level

SSE Instruction Sets

InfinibandAES-NI Encryption

HDFS Cross Data Center Replication

Security

Archival for cold data on HDFS

OS Kernel cachingHot file replication

API AuthN Data Movement

NETWORK

STORAGECOMPUTE

Ensuring Scale-out architectures work best on Intel platforms

Page 12: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

Intel Platform Benefits for Big Data

TeraSort for 1TB Data - > 4 Hours to 7 Minutes

Intel® Xeon 5690

7200 HDD

1GbE Adapters

Intel® Xeon® E5-2690processo

r~50%

improvedIntel®

SSD 520 Series

~80%improved

Intel® 10GbE

Adapters

~50%improve

d

Deploy Intel

Distribution for

Apache Hadoop*~40%

improved

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated

purchases, including the performance of that product when combined with other products.Source: Intel Internal testing

For more information go to : intel.com/performance `

>4 Hours ~7

mins

Page 13: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

Government - Smart Traffic Intelligent Transport SystemHadoop for Predictive Analytics

13

Crime prevention, Info sharing,

Predictive Traffic Analytics

Machine Generated Data:Embedded HBase client in camera for real-

time inserts of structured/unstructured data

30000 + camera data collection points

2 billion HBase records

Petabytes of traffic data

Terabytes of images

1 week of Data mining

Results: Automated queries for traffic violation

Crime Prevention: ID fake

licenses <1 minute

Traffic Routing

App Servers

Regional Data Collection

Distributed Processing Across District Nodes

Derived Analytics Services

Crime Prevention

Citizen Traffic Services

Page 14: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.

Other brands and names are the property of their respective owners.

Telco- China Mobile Group GuangdongHadoop & Xeon optimized Big Data storage & analytics

Challenge: Deliver real time access to Call Data Records (CDR) for billing self service

Solution: Chose Hadoop + Xeon over RDMS to remove data access bottlenecks, increase storage, and scale system

Benefits: Lower TCO, 30x performance increase, stable operation, analytics on subscriber usage for targeted promotions

Data Characteristics:

• 30TB billing data/month

• Real-time retrieval of 30 days CDRs

• 300k records/second, 800k insert speed/sec

• 15 analytics queries

• 133 server nodes

Analytics

Page 15: Intel® Distribution for Apache Hadoop* Ram Lakshminarayan Asia Pac – BDM Datacenter.