Top Banner
Distributed Decision Tree Learning for Mining Big Data Streams 1 Master Thesis presentation by: Arinto Murdopo EMDC [email protected] Supervisors: Albert Bifet Gianmarco de Francisci Morales Ricard Gavaldà
33

Distributed Decision Tree Learning for Mining Big Data Streams

May 08, 2015

Download

Education

Arinto Murdopo

Master Thesis Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Distributed Decision Tree Learning for Mining Big Data Streams

Distributed Decision Tree Learning for Mining Big Data Streams

1

Master Thesis presentation by: Arinto Murdopo EMDC [email protected]

Supervisors: Albert Bifet Gianmarco de Francisci Morales Ricard Gavaldà

Page 2: Distributed Decision Tree Learning for Mining Big Data Streams

Big Data

200 million users

400 million tweets/day

2

1+ TB/day to Hadoop

2.7 TB/day follower update

4.5 billion likes/day

350 million photos/day

Volume

Velocity

Variety

May 2013

March 2013

May 2013

Page 3: Distributed Decision Tree Learning for Mining Big Data Streams

Machine Learning (ML)

3

Make sense of the data, but how?

Machine Learning = learn & adapt based on data

Due to the 3Vs, we should:

1. Distribute, to scale

2. Stream, to be fast

3. Distribute and stream,

scale and fast

Page 4: Distributed Decision Tree Learning for Mining Big Data Streams

Are We Satisfied?

4

scale fast

fast scale

scale fast

loose-coupling

loose-coupling

We want machine learning frameworks that

are able to scale, fast, and loose-coupling

loose-coupling

Page 5: Distributed Decision Tree Learning for Mining Big Data Streams

SAMOA

Scalable Advanced Massive Online Analysis

Distributed Streaming Machine Learning Framework:

• Fast, using streaming model

• Scale, on top of distributed SPEs (Storm and S4)

• Loose-coupling between ML algorithms and SPEs

5

Page 6: Distributed Decision Tree Learning for Mining Big Data Streams

Contributions

SAMOA

• Architecture and Abstractions

• Stream Processing Engine Adapter

• Integration with Storm

Vertical Hoeffding Tree

• Better than MOA for high number of

attributes

6

Page 7: Distributed Decision Tree Learning for Mining Big Data Streams

7

SAMOA Architecture Frequent

Pattern

Mining

Storm Other SPEs

SAMOA

S4

Clustering

Methods

Classification

Methods

Page 8: Distributed Decision Tree Learning for Mining Big Data Streams

SAMOA Abstractions To develop distributed ML algorithms

8

z

EPI

Processor

Stream

n

Content

Events

Grouping

Parallelism

Hint

Topology PI

External

Event Source

Page 9: Distributed Decision Tree Learning for Mining Big Data Streams

SAMOA SPE-adapter

• Transforms the abstractions into SPE-specific runtime components

• Abstract factory pattern to decouple API and SPE

• Platform developers need to provide

1. PI and EPI

2. Stream

3. Grouping

9

Page 10: Distributed Decision Tree Learning for Mining Big Data Streams

SAMOA SPE-adapter

Examples of SPE-specific runtime

components from SPE-adapter

10

Focus of this

thesis

Page 11: Distributed Decision Tree Learning for Mining Big Data Streams

Storm

• Distributed Streaming Processing Engine

• MapReduce-like programming model

11

stream A

.. .... .... .... ..

stream BS1

S2

B1

B2

B3

B5

B4

stores useful information

data storage

Stream

Spout

Bolt

DAG

Tuples

Page 12: Distributed Decision Tree Learning for Mining Big Data Streams

SAMOA-Storm Integration

Mapping between Storm and SAMOA

1. Spout Entrance Processing Item (EPI)

2. Bolt Processing Item

• Use composition for EPI and PI

3. Bolt Stream & Spout Stream Stream

• Storm pull model

12

Page 13: Distributed Decision Tree Learning for Mining Big Data Streams

Contributions so far ..

13

samoa-SPE

SAMOA

Algorithm and API

SPE-adapter

S4 Storm other SPEs

ML-

adap

ter MOA

Other ML frameworks

samoa-S4 samoa-storm samoa-other-SPEs

Flexibility

Scalability

Extensibility

Page 14: Distributed Decision Tree Learning for Mining Big Data Streams

Next Contribution… Distributed Algorithm implementation:

Vertical Hoeffding Tree

Decision tree:

• Classification

• Divide and conquer

• Easy to interpret

14

Page 15: Distributed Decision Tree Learning for Mining Big Data Streams

Sample Dataset

ID Code

Outlook Temperature Humidity Windy Play

a sunny hot high false no

b sunny hot high true no

c overcast hot high false yes

d rainy mild high false yes

… … … … … …

15

attribute class

a datum (an instance) to

build the tree

Page 16: Distributed Decision Tree Learning for Mining Big Data Streams

Decision Tree

16

outlook

Y

sunny

rainy overcast

humidity windy

N Y N Y

true false normal high

root

split node

leaf node

Page 17: Distributed Decision Tree Learning for Mining Big Data Streams

Very Fast Decision Tree (VFDT)

• Pioneer in decision tree for streaming

• Information Gain + Gain Ratio + Hoeffding

bound

• Hoeffding bound decides whether the

difference in information gain is enough to

split or not

• Often called Hoeffding Tree

17

Page 18: Distributed Decision Tree Learning for Mining Big Data Streams

Distributed Decision Tree

Types of parallelism

• Horizontal

• Partition the data by the instance

• Vertical

• Partition the data by the attribute

• Task

• Tree leaf nodes grow in parallel 18

Page 19: Distributed Decision Tree Learning for Mining Big Data Streams

MOA Hoeffding Tree Profiling

19

Learn 70%

Split 24%

Other 6%

CPU Time Breakdown, 200 attributes

Page 20: Distributed Decision Tree Learning for Mining Big Data Streams

Vertical Hoeffding Tree

20

1 z 1 z z n 1

source PI

model-

aggregator PI

local-statistic PI

evaluator PI

source

local-result

control

attribute

result

Page 21: Distributed Decision Tree Learning for Mining Big Data Streams

Evaluation

Metrics:

• Accuracy

• Throughput

Input data:

• Random Tree Generator

• Text Generator – resembles tweets

Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon CPU E5620 @ 2.4 GHz: 16 processors, Linux Kernel 2.6.18

21

Page 22: Distributed Decision Tree Learning for Mining Big Data Streams

VHT iteration 1 (VHT1)

• Goal: Verify algorithm correctness (same

accuracy as MOA)

• Utilized 2 internal queues: instances queue,

local-result queue

• Achieved same accuracy but throughput is

low. Proceed with VHT 2

22

Page 23: Distributed Decision Tree Learning for Mining Big Data Streams

VHT Iteration 2 (VHT2)

Goal: improve VHT1 throughput

• Kryo serializer: 2.5x throughput

improvement

• long identifier instead of String

• Remove 2 internal queues in VHT1

discard instances while attempting to split

23

Page 24: Distributed Decision Tree Learning for Mining Big Data Streams

tree-10

24

Around 8.2 % differences in accuracy

Page 25: Distributed Decision Tree Learning for Mining Big Data Streams

tree-100

25

Same trend as tree-10 (7.9% difference in accuracy)

Page 26: Distributed Decision Tree Learning for Mining Big Data Streams

No. Leaf Nodes VHT2 – tree-100

26

Very close and very high accuracy

Page 27: Distributed Decision Tree Learning for Mining Big Data Streams

Accuracy VHT2 – text-1000

27

Low accuracy when the number of

attributes increased

Page 28: Distributed Decision Tree Learning for Mining Big Data Streams

Throughput VHT2 – tree-generator

28

Not good for dense instance and low

number of attributes

Page 29: Distributed Decision Tree Learning for Mining Big Data Streams

Throughput VHT2 – text-generator

29

Higher throughput than MHT

Page 30: Distributed Decision Tree Learning for Mining Big Data Streams

30

0

50

100

150

200

250

300

VHT2-par-3 MHT

Executi

on T

ime (

seconds)

Classifier

Profiling Results for text-1000 with 1000000 instances

t_calc

t_comm

t_serial

Minimizing t_comm will increase throughput

Page 31: Distributed Decision Tree Learning for Mining Big Data Streams

31

0

50

100

150

200

250

VHT2-par-3 MHT

Executi

on T

ime (

seconds)

Classifier

Profiling Results for text-10000 with 100000 instances

t_calc

t_comm

t_serial

Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec

Page 32: Distributed Decision Tree Learning for Mining Big Data Streams

Future Work

• Open Source

• Evaluation layer in SAMOA architecture

• Online classification algorithms that are

based on horizontal parallelism

32

Page 33: Distributed Decision Tree Learning for Mining Big Data Streams

Conclusions Mining big data stream is challenging

• Systems needs to satisfy 3Vs of big data.

SAMOA – Distributed Streaming ML Framework

• Architecture and Abstractions

• Stream Processing Engine (SPE) adapter

• SAMOA Integration with Storm

Vertical Hoeffding Tree

• Better than MOA for high number of attributes 33