Top Banner
Coordinating the Many Tools of Big Data Page 1 Alan F. Gates @alanfgates Strata 2013
26

Strata feb2013

Jan 27, 2015

Download

Technology

alanfgates

Slides from Strata talk "Coordinating the Many Tools of Big Data"
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Strata feb2013

Coordinating the Many Tools of Big Data

Page 1

Alan F. Gates

@alanfgates

Strata 2013

Page 2: Strata feb2013

Big Data = Terabytes, Petabytes, …

Page 2© Hortonworks 2013

Image Credit: Gizmodo

Page 3: Strata feb2013

But It Is Also Complex Algorithms

Page 3© Hortonworks 2013

• An example from a talk by Jimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs in Pig. This equation uses stochastic gradient descent to do machine learning with their data:

w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)

Page 4: Strata feb2013

And New Tools

Page 4© Hortonworks 2013

• Apache Hadoop brings with it a large selection of tools and paradigms–Apache HBase, Apache Cassandra – Distributed, high volume

reads and rights of individual data records–Apache Hive - SQL–Apache Pig, Cascading – Data flow programming for ETL, data

modeling, and exploration–Apache Giraph – Graph processing–MapReduce – Batch processing–Storm, S4 – Stream processing–Plus lots of commercial offerings

Page 5: Strata feb2013

Pre-Cloud: One Tool per Machine

Page 5© Hortonworks 2013

• Databases presented SQL or SQL-like paradigms for operating on data• Other tools came in separate packages (e.g. R) or on separate platforms (e.g.

SAS).

Data Warehouse

Statistical Analysis

Cube/MOLAP

OLTP

Data Mart

Page 6: Strata feb2013

Cloud: Many Tools One Platform

Page 6© Hortonworks 2013

• Users no longer want to be concerned with what platform their data is in – just apply the tool to it

• SQL no longer the only or primary data access tool

Data Warehouse

Statistical AnalysisData

Mart

Cube/MOLAP

OLTP

Page 7: Strata feb2013

Upside - Pick the Right Tool for the Job

Page 7© Hortonworks 2013

Page 8: Strata feb2013

Downside – Tools Don’t Play Well Together

Page 8© Hortonworks 2013

• Hard for users to share data between tools–Different storage formats–Different data models–Different user defined function interfaces

Page 9: Strata feb2013

Downside – Wasted Developer Time

Page 9© Hortonworks 2013

• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Page 10: Strata feb2013

Downside – Wasted Developer Time

Page 10© Hortonworks 2013

• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Overlap

Page 11: Strata feb2013

Conclusion: We Need Services

Page 11© Hortonworks 2013

• We need to find a way to share services where we can • Gives users the same experience across tools• Allows developers to share effort when it makes sense

Page 12: Strata feb2013

Hadoop = Distributed Data Operating System

Page 12© Hortonworks 2013

Service Hadoop Component

Table Management Hive

Access To Metadata HCatalog

User authentication Knox

Resource management YARN

Notification HCatalog

REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie

Relational data processing Tez

Exists Pieces exist in this component New Project

Page 13: Strata feb2013

Hadoop = Distributed Data Operating System

Page 13© Hortonworks 2013

Service Hadoop Component

Table Management Hive

Access To Metadata HCatalog

User authentication Knox

Resource management YARN

Notification HCatalog

REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie

Relational data processing Tez

Exists Pieces exist in this component New Project

Page 14: Strata feb2013

HCatalog – Table Management

Page 14© Hortonworks 2013

• Opens up Hive’s tables to other tools inside and outside Hadoop

• Presents tools with a table paradigm that abstracts away storage details

• Provides a shared data model• Provides a shared code path for data and metadata access

Page 15: Strata feb2013

HCatalog – Table Management

Page 15© Hortonworks 2013

• Opens up Hive’s tables to other tools inside and outside Hadoop

• Presents tools with a table paradigm that abstracts away storage details

• Provides a shared data model• Provides a shared code path for data and metadata access

Metastore

Hive

Page 16: Strata feb2013

HCatalog – Table Management

Page 16© Hortonworks 2013

• Opens up Hive’s tables to other tools inside and outside Hadoop

• Presents tools with a table paradigm that abstracts away storage details

• Provides a shared data model• Provides a shared code path for data and metadata access

Metastore

Hive Pig

HCatLoader

HCatInputFormat

MapReduce

Page 17: Strata feb2013

HCatalog – Table Management

Page 17© Hortonworks 2013

• Opens up Hive’s tables to other tools inside and outside Hadoop

• Presents tools with a table paradigm that abstracts away storage details

• Provides a shared data model• Provides a shared code path for data and metadata access

Metastore

Hive Pig

HCatLoader

HCatInputFormat

MapReduceWebHCat

ExternalSystems

REST

Page 18: Strata feb2013

Tez – Moving Beyond MapReduce

Page 18© Hortonworks 2013

• Low level data-processing execution engine• Use it for the base of MapReduce, Hive, Pig, Cascading etc.

• Enables pipelining of jobs• Removes task and job launch times• Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline

• Does not write intermediate output to HDFS–Much lighter disk and network usage

• Built on YARN

Page 19: Strata feb2013

Pig/Hive-MR versus Pig/Hive-Tez

Page 19© Hortonworks 2013

SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a

JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

Pig/Hive - MR

I/O Synchronization

Barrier

I/O Synchronization

Barrier

Job 1

Job 2

Job 3

Page 20: Strata feb2013

Pig/Hive-MR versus Pig/Hive-Tez

Page 20© Hortonworks 2013

SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a

JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

Pig/Hive - MR Pig/Hive - Tez

I/O Synchronization

Barrier

I/O Synchronization

Barrier

Job 1

Job 2

Job 3

Single Job

Page 21: Strata feb2013

FastQuery: Beyond Batch with YARN

Page 21© Hortonworks 2013

Tez Generalizes Map-Reduce

Simplified execution plans processdata more efficiently

Always-On Tez Service

Low latency processing forall Hadoop data processing

Page 22: Strata feb2013

Knox – Single Sign On

Page 22© Hortonworks 2013

Page 23: Strata feb2013

Today’s Access Options

Page 23© Hortonworks 2013

• Direct Access– Access Services via REST (WebHDFS, WebHCat)– Need knowledge of and access to whole cluster– Security handled by each component in the cluster– Kerberos details exposed to users

• Gateway / Portal Nodes– Dedicated nodes behind firewall– User SSH to node to access Hadoop services

Hadoop ClusterUser

Hadoop ClusterUserGW

Node

SSH

{REST}

Page 24: Strata feb2013

Knox Design Goals

Page 24© Hortonworks 2013

• Operators can firewall cluster without end user access to “gateway node”

• Users see one cluster end-point that aggregates capabilities for data access, metadata and job control

• Provide perimeter security to make Hadoop security setup easier

• Enable integration enterprise and cloud identity management environments

Page 25: Strata feb2013

Perimeter Verification & Authentication

Page 25© Hortonworks 2013

WebHCat

JT

NN

DN

DN DN

Hadoop Cluster

DN

Web HDFS

Hive

HCat

Authentication

Verification

Client

User StoreKDC, AD,

LDAP

ID ProviderKDC, AD,

LDAP

Verification- Verify identity token- SAML, propagation of identityAuthentication- Establish identity at Gateway to

Authenticate with LDAP + AD

{REST} KnoxGateway

Page 26: Strata feb2013

© Hortonworks 2012

Thank You

Page 26