Distributed Decision Tree Learning for Mining Big Data Streams

Distributed Decision Tree Learning for Mining Big Data Streams

1

Master Thesis presentation by: Arinto Murdopo EMDC [email protected]

Supervisors: Albert Bifet Gianmarco de Francisci Morales Ricard Gavaldà

Big Data

200 million users

400 million tweets/day

2

1+ TB/day to Hadoop

2.7 TB/day follower update

4.5 billion likes/day

350 million photos/day

Volume

Velocity

Variety

May 2013

March 2013

May 2013

Machine Learning (ML)

3

Make sense of the data, but how?

Machine Learning = learn & adapt based on data

Due to the 3Vs, we should:

1. Distribute, to scale

2. Stream, to be fast

3. Distribute and stream,

scale and fast

Are We Satisfied?

4

scale fast

fast scale

scale fast

loose-coupling

loose-coupling

We want machine learning frameworks that

are able to scale, fast, and loose-coupling

loose-coupling

SAMOA

Scalable Advanced Massive Online Analysis

Distributed Streaming Machine Learning Framework:

• Fast, using streaming model

• Scale, on top of distributed SPEs (Storm and S4)

• Loose-coupling between ML algorithms and SPEs

5

Contributions

SAMOA

• Architecture and Abstractions

• Stream Processing Engine Adapter

• Integration with Storm

Vertical Hoeffding Tree

• Better than MOA for high number of

attributes

6

7

SAMOA Architecture Frequent

Pattern

Mining

Storm Other SPEs

SAMOA

S4

Clustering

Methods

Classification

Methods

SAMOA Abstractions To develop distributed ML algorithms

8

z

EPI

Processor

Stream

n

Content

Events

Grouping

Parallelism

Hint

Topology PI

External

Event Source

SAMOA SPE-adapter

• Transforms the abstractions into SPE-specific runtime components

• Abstract factory pattern to decouple API and SPE

• Platform developers need to provide

1. PI and EPI

2. Stream

3. Grouping

9

SAMOA SPE-adapter

Examples of SPE-specific runtime

components from SPE-adapter

10

Focus of this

thesis

Storm

• Distributed Streaming Processing Engine

• MapReduce-like programming model

11

stream A

.. .... .... .... ..

stream BS1

S2

B1

B2

B3

B5

B4

stores useful information

data storage

Stream

Spout

Bolt

DAG

Tuples

SAMOA-Storm Integration

Mapping between Storm and SAMOA

1. Spout Entrance Processing Item (EPI)

2. Bolt Processing Item

• Use composition for EPI and PI

3. Bolt Stream & Spout Stream Stream

• Storm pull model

12

Contributions so far ..

13

samoa-SPE

SAMOA

Algorithm and API

SPE-adapter

S4 Storm other SPEs

ML-

adap

ter MOA

Other ML frameworks

samoa-S4 samoa-storm samoa-other-SPEs

Flexibility

Scalability

Extensibility

Next Contribution… Distributed Algorithm implementation:


Decision tree:

• Classification

• Divide and conquer

• Easy to interpret

14

Sample Dataset

ID Code

Outlook Temperature Humidity Windy Play

a sunny hot high false no

b sunny hot high true no

c overcast hot high false yes

d rainy mild high false yes

… … … … … …

15

attribute class

a datum (an instance) to

build the tree

Decision Tree

16

outlook

Y

sunny

rainy overcast

humidity windy

N Y N Y

true false normal high

root

split node

leaf node

Very Fast Decision Tree (VFDT)

• Pioneer in decision tree for streaming

• Information Gain + Gain Ratio + Hoeffding

bound

• Hoeffding bound decides whether the

difference in information gain is enough to

split or not

• Often called Hoeffding Tree

17

Distributed Decision Tree

Types of parallelism

• Horizontal

• Partition the data by the instance

• Vertical

• Partition the data by the attribute

• Task

• Tree leaf nodes grow in parallel 18

MOA Hoeffding Tree Profiling

19

Learn 70%

Split 24%

Other 6%

CPU Time Breakdown, 200 attributes


20

1 z 1 z z n 1

source PI

model-

aggregator PI

local-statistic PI

evaluator PI

source

local-result

control

attribute

result

Evaluation

Metrics:

• Accuracy

• Throughput

Input data:

• Random Tree Generator

• Text Generator – resembles tweets

Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon CPU E5620 @ 2.4 GHz: 16 processors, Linux Kernel 2.6.18

21

VHT iteration 1 (VHT1)

• Goal: Verify algorithm correctness (same

accuracy as MOA)

• Utilized 2 internal queues: instances queue,

local-result queue

• Achieved same accuracy but throughput is

low. Proceed with VHT 2

22

VHT Iteration 2 (VHT2)

Goal: improve VHT1 throughput

• Kryo serializer: 2.5x throughput

improvement

• long identifier instead of String

• Remove 2 internal queues in VHT1

discard instances while attempting to split

23

tree-10

24

Around 8.2 % differences in accuracy

tree-100

25

Same trend as tree-10 (7.9% difference in accuracy)

No. Leaf Nodes VHT2 – tree-100

26

Very close and very high accuracy

Accuracy VHT2 – text-1000

27

Low accuracy when the number of

attributes increased

Throughput VHT2 – tree-generator

28

Not good for dense instance and low

number of attributes

Throughput VHT2 – text-generator

29

Higher throughput than MHT

30

0

50

100

150

200

250

300

VHT2-par-3 MHT

Executi

on T

ime (

seconds)

Classifier

Profiling Results for text-1000 with 1000000 instances

t_calc

t_comm

t_serial

Minimizing t_comm will increase throughput

31

0

50

100

150

200

250

VHT2-par-3 MHT

Executi

on T

ime (

seconds)

Classifier

Profiling Results for text-10000 with 100000 instances

t_calc

t_comm

t_serial

Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec

Future Work

• Open Source

• Evaluation layer in SAMOA architecture

• Online classification algorithms that are

based on horizontal parallelism

32

Conclusions Mining big data stream is challenging

• Systems needs to satisfy 3Vs of big data.

SAMOA – Distributed Streaming ML Framework

• Architecture and Abstractions

• Stream Processing Engine (SPE) adapter

• SAMOA Integration with Storm


• Better than MOA for high number of attributes 33

Distributed Decision Tree Learning for Mining Big Data Streams

Education