Top Banner
TCloud Computing, Inc. Hadoop Product Family and Ecosystem
82

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Aug 20, 2015

Download

Education

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

TCloud Computing, Inc.

Hadoop Product Family and Ecosystem

Page 2: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Agenda

• What is Big Data?

• Big Data Opportunities

• Hadoop

– Introduction to Hadoop

– Hadoop 2.0

– What’s next for Hadoop?

• Hadoop ecosystem

• Conclusion

Page 3: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

What is Big Data?

A set of files A database A single file

Page 5: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Big data Expands on 4 fronts

Velocity

Volume

Variety

Veracity

MB GB TB PB

batch

periodic

near Real-Time

Real-Time

http://whatis.techtarget.com/definition/3Vs

Page 6: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Big Data Opportunities

http://www.sap.com/corporate-en/news.epx?PressID=21316

Page 7: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Big Data Revenue by Market Segment 2012

• 1

http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017

Page 8: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Big Data Market Forecast 2012-2017

• 1

http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017

Page 9: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Hadoop Solutions

The most common problems Hadoop can solve

Page 10: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Threat Analysis/Trade Surveillance

• Challenge:

– Detecting threats in the form of fraudulent activity or attacks

• Large data volumes involved

• Like looking for a needle in a haystack

• Solution with Hadoop:

– Parallel processing over huge datasets

– Pattern recognition to identify anomalies

• – i.e., threats

• Typical Industry:

– Security, Financial Services

Page 11: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Recommendation Engine

• Challenge:

– Using user data to predict which products to recommend

• Solution with Hadoop:

– Batch processing framework

• Allow execution in in parallel over large datasets

– Collaborative filtering

• Collecting ‘taste’ information from many users

• Utilizing information to predict what similar users like

• Typical Industry

– ISP, Advertising

Page 12: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Walmart Case

Revenue ?

Friday

Beer

Diapers

Page 13: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

• 1

http://tech.naver.jp/blog/?p=2412

Page 14: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Hadoop!

Page 15: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

• Apache Hadoop project

– inspired by Google's MapReduce and Google File System papers.

• Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware

• Open Source Software + Hardware Commodity

– IT Costs Reduction

– inspired by

Page 16: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Hadoop Concepts

• Distribute the data as it is initially stored in the system

• Moving Computation is Cheaper than Moving Data

• Individual nodes can work on data local to those nodes

• Users can focus on developing applications.

Page 17: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Hadoop 2.0

• Hadoop 2.2.0 is expected to GA in Fall 2013

• HDFS Federation

• HDFS High Availability (HA)

• Hadoop YARN (MapReduce 2.0)

Page 18: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

HDFS Federation - Limitation of Hadoop 1.0

• Scalability

– Storage scales horizontally - namespace doesn’t

• Performance

– File system operations throughput limited by a single node

• Poor isolation

– All the tenants share a single namespace

Page 19: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

HDFS Federation

• Multiple independent NameNodes and Namespace Volumes in a cluster

– Namespace Volume = Namespace + Block Pool

• Block Storage as generic storage service

– Set of blocks for a Namespace Volume is called a Block Pool

– DNs store blocks for all the Namespace Volumes – no partitioning

Page 20: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

HDFS Federation

Hadoop Hadoop 2.0

http://hortonworks.com/blog/an-introduction-to-hdfs-federation/

/home/ /app/Hive /app/HBase

Page 21: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

HDFS High Availability (HA)

• Secondary Name Node is not Name Node

• http://www.youtube.com/watch?v=hEqQMLSXQlY

Page 22: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

HDFS High Availability (HA)

https://issues.apache.org/jira/browse/HDFS-1623

Page 23: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Why do we need YARN

• Scalability

– Maximum Cluster size – 4,000 nodes

– Maximum concurrent tasks – 40,000

• Single point of failure

– Failure kills all queued and running jobs

• Lacks support for alternate paradigms

– Iterative applications implemented using MapReduce are 10x slower

– Example: K-Means, PageRank

Page 24: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Hadoop YARN

http://hortonworks.com/hadoop/yarn/

Page 25: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Role of YARN

• Resource Manager

– Per-cluster

– Global resource scheduler

– Hierarchical queues

• Node Manager

– Per-machine agent

– Manages the life-cycle of container

– Container resource monitoring

• Application Master

– Per-application

– Manages application scheduling and task execution

– E.g. MapReduce Application Master

Job Tracker

Resource Manager

Application Master

Page 26: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Hadoop YARN architectural

http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/

• Container

– Basic unit of allocation

– Ex. Container A = 2GB, 1CPU

– Fine-grained resource allocation

– Replace the fixed map/reduce slots

Page 27: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

What’s next for Hadoop?

• Real-time

– Apache Tez

• Part of Stinger

– Spark

• SQL in Hadoop

– Stinger

• An immediate aim of 100x performance increase for Hive is more ambitious than any other effort.

• Based on industry standard SQL, the Stinger Initiative improves HiveQL to deliver SQL compatibility.

– Shark

Page 28: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

What’s next for Hadoop?

• Security: Data encryption

– hadoop-9331: Hadoop crypto codec framework and crypto codec implementations

• hadoop-9332: Crypto codec implementations for AES

• hadoop-9333: Hadoop crypto codec framework based on compression codec

• mapreduce-5025: Key Distribution and Management for supporting crypto codec in Map Reduce

• 2013/09/28 Hadoop in Taiwan 2013

– Hadoop Security: Now and future

– Session B, 16:00~16:40

Page 29: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

The Hadoop Ecosystems

Page 30: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Growing Hadoop Ecosystem

• The term ‘Hadoop’ is taken to be the combination of HDFS and MapReduce

• There are numerous other projects surrounding Hadoop

– Typically referred to as the ‘Hadoop Ecosystem’

• Zookeeper

• Hive and Pig

• HBase

• Flume

• Other Ecosystem Projects

– Sqoop

– Oozie

– Mahout

Page 31: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

The Ecosystem is the System

• Hadoop has become the kernel of the distributed operating system for Big Data

• No one uses the kernel alone

• A collection of projects at Apache

Page 32: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Relation Map

MapReduce Runtime (Dist. Programming

Framework)

Hadoop Distributed File System (HDFS)

HBase (Column

NoSQL DB)

Sqoop/Flume (Data integration)

Oozie (Job Workflow & Scheduling)

Pig/Hive (Analytical Language)

Mahout (Data Mining)

YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Page 33: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

ZooKeeper – Coordination Framework

MapReduce Runtime (Dist. Programming

Framework)

Hadoop Distributed File System (HDFS)

HBase (Column

NoSQL DB)

Sqoop/Flume (Data integration)

Oozie (Job Workflow & Scheduling)

Pig/Hive (Analytical Language)

Mahout (Data Mining)

YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Page 34: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

What is ZooKeeper

• A centralized service for maintaining

– Configuration information

– Providing distributed synchronization

• A set of tools to build distributed applications that can safely handle partial failures

• ZooKeeper was designed to store coordination data

– Status information

– Configuration

– Location information

Page 35: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Why use ZooKeeper?

• Manage configuration across nodes

• Implement reliable messaging

• Implement redundant services

• Synchronize process execution

Page 36: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

ZooKeeper Architecture

– All servers store a copy of the data (in memory)

– A leader is elected at startup

– 2 roles – leader and follower

• Followers service clients, all updates go through leader

• Update responses are sent when a majority of servers have persisted the change

– HA support

Page 37: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

HBase – Column NoSQL DB

MapReduce Runtime (Dist. Programming

Framework)

Hadoop Distributed File System (HDFS)

HBase (Column

NoSQL DB)

Sqoop/Flume (Data integration)

Oozie (Job Workflow & Scheduling)

Pig/Hive (Analytical Language)

Mahout (Data Mining)

YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Page 38: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Structured-data V.S. Raw-data

Page 39: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

I – Inspired by

• Apache open source project

• Inspired from Google Big Table

• Non-relational, distributed database written in Java

• Coordinated by Zookeeper

Page 40: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Row & Column Oriented

Page 41: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

HBase – Data Model

• Cells are “versioned”

• Table rows are sorted by row key

• Region – a row range [start-key:end-key]

Page 42: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

When to use HBase

• Need random, low latency access to the data

• Application has a flexible schema where each row is slightly different

– Add columns on the fly

• Most of columns are NULL in each row

Page 43: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Flume / Sqoop – Data Integration Framework

MapReduce Runtime (Dist. Programming

Framework)

Hadoop Distributed File System (HDFS)

HBase (Column

NoSQL DB)

Sqoop/Flume (Data integration)

Oozie (Job Workflow & Scheduling)

Pig/Hive (Analytical Language)

Mahout (Data Mining)

YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Page 44: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

What’s the problem for data collection

• Data collection is currently a priori and ad hoc

• A priori – decide what you want to collect ahead of time

• Ad hoc – each kind of data source goes through its own collection path

Page 45: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

(and how can it help?)

• A distributed data collection service

• It efficiently collecting, aggregating, and moving large amounts of data

• Fault tolerant, many failover and recovery mechanism

• One-stop solution for data collection of all formats

Page 46: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

An example flow

Page 47: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Sqoop

• Easy, parallel database import/export

• What you want do?

– Insert data from RDBMS to HDFS

– Export data from HDFS back into RDBMS

Page 48: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

What is Sqoop

• A suite of tools that connect Hadoop and database systems

• Import tables from databases into HDFS for deep analysis

• Export MapReduce results back to a database for presentation to end-users

• Provides the ability to import from SQL databases straight into your Hive data warehouse

Page 49: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

How Sqoop helps

• The Problem

– Structured data in traditional databases cannot be easily combined with complex data stored in HDFS

• Sqoop (SQL-to-Hadoop)

– Easy import of data from many databases to HDFS

– Generate code for use in MapReduce applications

Page 50: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Why Sqoop

• JDBC-based implementation

– Works with many popular database vendors

• Auto-generation of tedious user-side code

– Write MapReduce applications to work with your data, faster

• Integration with Hive

– Allows you to stay in a SQL-based environment

Page 51: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Pig / Hive – Analytical Language

MapReduce Runtime (Dist. Programming

Framework)

Hadoop Distributed File System (HDFS)

HBase (Column

NoSQL DB)

Sqoop/Flume (Data integration)

Oozie (Job Workflow & Scheduling)

Pig/Hive (Analytical Language)

Mahout (Data Mining)

YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Page 52: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Why Hive and Pig?

• Although MapReduce is very powerful, it can also be complex to master

• Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code

• Many organizations have programmers who are skilled at writing code in scripting languages

• Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce

– Hive was initially developed at Facebook, Pig at Yahoo!

Page 53: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Hive – Developed by

• What is Hive?

– An SQL-like interface to Hadoop

• Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop

– MapRuduce for execution

– HDFS for storage

• Hive Query Language

– Basic-SQL : Select, From, Join, Group-By

– Equi-Join, Muti-Table Insert, Multi-Group-By

– Batch query

SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Page 54: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Hive/MR V.S. Hive/Tez

http://www.slideshare.net/adammuise/2013-jul-23thughivetuningdeepdive

Page 55: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Pig

• A high-level scripting language (Pig Latin)

• Process data one step at a time

• Simple to write MapReduce program

• Easy understand

• Easy debug A = load ‘a.txt’ as (id, name, age, ...)

B = load ‘b.txt’ as (id, address, ...)

C = JOIN A BY id, B BY id;STORE C into ‘c.txt’

– Initiated by

Page 56: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Hive vs. Pig

Hive Pig

Language HiveQL (SQL-like) Pig Latin, a scripting language

Schema Table definitions

that are stored in a

metastore

A schema is optionally defined

at runtime

Programmait Access JDBC, ODBC PigServer

Page 57: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

• Input

• For the given sample input the map emits

• the reduce just sums up the values

Hello World Bye World

Hello Hadoop Goodbye Hadoop

< Hello, 1>

< World, 1>

< Bye, 1>

< World, 1>

< Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

WordCount Example

Page 58: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

WordCount Example In MapReduce public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

}

Page 59: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

WordCount Example By Pig

A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;

Page 60: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ’wordcount/input'

OVERWRITE INTO TABLE wordcount;

SELECT count(*) FROM wordcount GROUP BY token;

Page 61: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Spark / Shark - Analytical Language

MapReduce Runtime (Dist. Programming

Framework)

Hadoop Distributed File System (HDFS)

HBase (Column

NoSQL DB)

Sqoop/Flume (Data integration)

Oozie (Job Workflow & Scheduling)

Pig/Hive (Analytical Language)

Mahout (Data Mining)

YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Page 62: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Why

• MapReduce is too slow

• Aims to make data analytics fast — both fast to run and fast to write.

• When you have the request: iterative algorithms

Page 63: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

What is

• In-memory distributed computing framework

• Create by UC Berkeley AMP Lab in 2010

• Target Problem that Hadoop MR is bad at

– Iterative algorithm (Machine Learning )

– Interactive data mining

• More general purpose than Hadoop MR

• Active contributions from ~15 companies

Page 64: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

BDAS, the Berkeley Data Analytics Stack

https://amplab.cs.berkeley.edu/software/

Page 65: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

What Different between Hadoop and Spark

Data Source

Map()

Data Source 2

Join()

Cache() Transform

http://spark.incubator.apache.org

HDFS

Map

Reduce

Map

Reduce

Page 66: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

What is Shark

• A data analytic (warehouse) system that

– Port of Apache Hive to run on Spark

– Compatible with existing Hive data, metastores, and query(Hive, UDFs,etc)

– Similar speedup of up to 40x than hive

– Scale out and is fault-tolerant

– Support low-latency, interactive query through in-memory computing

Page 67: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Shark Architecture

Hive

Meta Store

HDFS/HBase

Spark

SQL

Parser

Query

Optimizer Physical Plan

Execution

Cache Mgr.

CLI Thrift/JDBC

Driver

Page 68: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Oozie – Job Workflow & Scheduling

MapReduce Runtime (Dist. Programming

Framework)

Hadoop Distributed File System (HDFS)

HBase (Column

NoSQL DB)

Sqoop/Flume (Data integration)

Oozie (Job Workflow & Scheduling)

Pig/Hive (Analytical Language)

Mahout (Data Mining)

YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Page 69: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

What is ?

• A Java Web Application

• Oozie is a workflow scheduler for Hadoop

• Crond for Hadoop

Job 1

Job 3

Job 2

Job 4 Job 5

Page 70: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Why

• Why use Oozie instead of just cascading a jobs one after another

• Major flexibility

– Start, Stop, Suspend, and re-run jobs

• Oozie allows you to restart from a failure

– You can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes

Page 71: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

How it triggered

• Time

– Execute your workflow every 15 minutes

• Time and Data

– Materialize your workflow every hour, but only run them when the input data is ready.

00:15 00:30 00:45 01:00

01:00 02:00 03:00 04:00

Hadoop

Input Data Exists?

Page 72: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Oozie use criteria

• Need Launch, control, and monitor jobs from your Java Apps

– Java Client API/Command Line Interface

• Need control jobs from anywhere

– Web Service API

• Have jobs that you need to run every hour, day, week

• Need receive notification when a job done

– Email when a job is complete

Page 73: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Mahout – Data Mining

MapReduce Runtime (Dist. Programming

Framework)

Hadoop Distributed File System (HDFS)

HBase (Column

NoSQL DB)

Sqoop/Flume (Data integration)

Oozie (Job Workflow & Scheduling)

Pig/Hive (Analytical Language)

Mahout (Data Mining)

YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Page 74: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

What is

• Machine-learning tool

• Distributed and scalable machine learning algorithms on the Hadoop platform

• Building intelligent applications easier and faster

Page 75: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Why

• Current state of ML libraries

– Lack Community

– Lack Documentation and Examples

– Lack Scalability

– Are Research oriented

Page 76: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Mahout – scale

• Scale to large datasets

– Hadoop MapReduce implementations that scales linearly with data

• Scalable to support your business case

– Mahout is distributed under a commercially friendly Apache Software license

• Scalable community

– Vibrant, responsive and diverse

Page 77: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Mahout – four use cases • Mahout machine learning algorithms

– Recommendation mining : takes users’ behavior and find items said specified user might like

– Clustering : takes e.g. text documents and groups them based on related document topics

– Classification : learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to appropriate category

– Frequent item set mining : takes a set of item groups (e.g. terms in query session, shopping cart content) and identifies, which individual items typically appear together

Page 78: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Use case Example

• Predict what the user likes based on

– His/Her historical behavior

– Aggregate behavior of people similar to him

Page 79: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Conclusion

• Big Data Opportunities

– The market still growing

• Hadoop 2.0

– Federation

– HA

– YARN

• What’s next for Hadoop

– Real-time query

– Data encryption

• What other projects are included in the Hadoop ecosystem

– Different project for different purpose

– Choose right tools for your needs

Page 80: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Recap – Hadoop Ecosystem

MapReduce Runtime (Dist. Programming

Framework)

Hadoop Distributed File System (HDFS)

HBase (Column

NoSQL DB)

Sqoop/Flume (Data integration)

Oozie (Job Workflow & Scheduling)

Pig/Hive (Analytical Language)

Mahout (Data Mining)

YARN

Zo

oK

ee

pe

r (C

oo

rdin

atio

n)

Tez

(near real-time

processing)

Spark

(in-

memory)

Shark

Page 81: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Questions?

Page 82: Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Thank you!