Top Banner
A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos
42

A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

A General Framework for Mining Massive Data Streams

Geoff HultenAdvised by Pedro Domingos

Page 2: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Mining Massive Data Streams

• High-speed data streams abundant– Large retailers– Long distance & cellular phone call records– Scientific projects– Large Web sites

• Build model of the process creating data

• Use model to interact more efficiently

Page 3: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Growing Mismatch BetweenAlgorithms and Data

• State of the art data mining algorithms– One shot learning– Work with static databases– Maximum of 1 million – 10 million records

• Properties of Data Streams– Data stream exists over months or years– 10s – 100s of millions of new records per day– Process generating data changing over time

Page 4: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

The Cost of This Mismatch

• Fraction of data we can effectively mine shrinking towards zero

• Models learned from heuristically selected samples of data

• Models out of date before being deployed

Page 5: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Need New Algorithms

• Monitor a data stream and have a model available at all times

• Improve the model as data arrives

• Adapt the model as process generating data changes

• Have quality guarantees

• Work within strict resource constraints

Page 6: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Solution: General Framework

• Applicable to algorithms based on discrete search

• Semi-automatically converts algorithm to meet our design needs

• Uses sampling to select data size for each search step

• Extensions to continuous searches and relational data

Page 7: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Outline

• Introduction

• Scaling up Decision Trees

• Our Framework for Scaling

• Other Applications and Results

• Conclusion

Page 8: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Decision Trees

• Examples:

• Encode:

• Nodes contain tests

• Leaves contain predictions

Gender?

False Age?

Male Female

< 25 >= 25

yxx D ,,,1 DxxFy ,,1

False True

Page 9: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Decision Tree Induction

DecisionTree(Data D, Tree T, Attributes A) If D is pure Let T be a leaf predicting class in D Return Let X be best of A according to D and G() Let T be a node that splits on X For each value V of X Let D^ be the portion of D with V for X Let T^ be the child of T for V DecisionTree(D^, T^, A – X)

Page 10: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

VFDT (Very Fast Decision Tree)

• In order to pick split attribute for a node looking at a few example may be sufficient

• Given a stream of examples:– Use the first to pick the split at the root– Sort succeeding ones to the leaves– Pick best attribute there– Continue…

• Leaves predict most common class• Very fast, incremental, any time decision tree

induction algorithm

Page 11: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

How Much Data?

• Make sure best attribute is better than second– That is:

• Using a sample so need Hoeffding bound– Collect data till: 21 XGXG

n

R

2

1ln2

021 XGXG

Page 12: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Core VFDT Algorithm

Proceedure VFDT(Stream, δ)Let T = Tree with single leaf (root)Initialize sufficient statistics at rootFor each example (X, y) in Stream

Sort (X, y) to leaf using TUpdate sufficient statistics at leafCompute G for each attributeIf G(best) – G(2nd best) > ε, then

Split leaf on best attributeFor each branch

Start new leaf, init sufficient statisticsReturn T

x1?

y=0 x2?

y=0 y=1

male female

> 65 <= 65

Page 13: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Quality of Trees from VFDT

• Model may contain incorrect splits, useful?

• Bound the difference with infinite data tree– Chance an arbitrary example takes different

path

• Intuition: example on level i of tree has i chances to go through a mistaken node

p

DTDT HT

,

Page 14: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Complete VFDT System

• Memory management– Memory dominated by sufficient statistics– Deactivate less promising leaves when needed

• Ties:– Wasteful to decide between identical attributes

• Check for splits periodically• Pre-pruning

– Only make splits that improve the value of G(.)

• Early stop on bad attributes

G

Page 15: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

VFDT (Continued)

• Bootstrap with traditional learner

• Rescan dataset when time available

• Time changing data streams

• Post pruning

• Continuous attributes

• Batch mode

Page 16: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Experiments

• Compared VFDT and C4.5 (Quinlan, 1993)

• Same memory limit for both (40 MB)– 100k examples for C4.5

• VFDT settings: δ = 10^-7, τ = 5%

• Domains: 2 classes, 100 binary attributes

• Fifteen synthetic trees 2.2k – 500k leaves

• Noise from 0% to 30%

Page 17: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.
Page 18: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.
Page 19: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Running Times

• Pentium III at 500 MHz running Linux

• C4.5 takes 35 seconds to read and process 100k examples; VFDT takes 47 seconds

• VFDT takes 6377 seconds for 20 million examples; 5752s to read 625s to process

• VFDT processes 32k examples per second (excluding I/O)

Page 20: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.
Page 21: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Real World Data Sets:Trace of UW Web requests

• Stream of Web page request from UW• One week 23k clients, 170 orgs. 244k hosts,

82.8M requests (peak: 17k/min), 20GB• Goal: improve cache by predicting requests• 1.6M examples, 61% default class• C4.5 on 75k exs, 2975 secs.

– 73.3% accuracy

• VFDT ~3000 secs., 74.3% accurate

Page 22: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Outline

• Introduction

• Scaling up Decision Trees

• Our Framework for Scaling

• Overview of Applications and Results

• Conclusion

Page 23: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Data Mining as Discrete Search

...

• Initial state– Empty – prior – random

• Search operators– Refine structure

• Evaluation function– Likelihood – many other

• Goal state– Local optimum, etc.

Page 24: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Data Mining As Search

...

...

Training Data Training Data Training Data

1.7

1.5

1.8

1.9

1.9

2.0

Page 25: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Example: Decision Tree

...

Training Data

1.7

1.5

X1?

Xd?

??

...

X1?

...

X1?

Training Data • Initial state– Root node

• Search operators– Turn any leaf into

a test on attribute• Evaluation

– Entropy Reduction

• Goal state– No further gain– Post prune

)(

lgyval

ii pp

Page 26: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Overview of Framework

• Cast the learning algorithm as a search

• Begin monitoring data stream– Use each example to update sufficient

statistics where appropriate (then discard it)– Periodically pause and use statistical tests

• Take steps that can be made with high confidence

– Monitor old search decisions• Change them when data stream changes

Page 27: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

How Much Data is Enough?

...

Training Data

1.65

1.38 Xd?

X1?

Page 28: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

How Much Data is Enough?

...

Sample of Data

1.6 +/- ε• Use statistical bounds

– Normal distribution– Hoeffding bound

• Applies to scores that are average over examples

• Can select a winner if– Score1 > Score2 + ε

1.4 +/- ε

nR 21ln2

Xd?

X1?

Page 29: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Global Quality Guarantee

• δ – probability of error in single decision

• b – branching factor of search

• d – depth of search

• c – number of checks for winner

δ* = δbdc

Page 30: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Identical States And Ties

• Fails if states are identical (or nearly so)

• τ – user supplied tie parameter

• Select winner early if alternatives differ by less than τ– Score1 > Score2 + ε or – ε <= τ

Page 31: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Dealing with Time Changing Concepts

• Maintain a window of the most recent examples• Keep model up to date with this window• Effective when window size similar to concept

drift rate• Traditional approach

– Periodically reapply learner– Very inefficient!

• Our approach– Monitor quality of old decisions as window shifts– Correct decisions in fine-grained manner

Page 32: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Alternate Searches• When new test looks better grow alternate sub-tree• Replace the old when new is more accurate• This smoothly adjusts to changing concepts

Gender?

Pets? College?

Hair?

false

false true

falsetrue true

Page 33: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

RAM Limitations• Each search

requires sufficient statistics structure

• Decision Tree– O(avc) RAM

• Bayesian Network– O(c^p) RAM

Page 34: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

RAM Limitations

Active

Temporarily inactive

Page 35: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Outline

• Introduction

• Data Mining as Discrete Search

• Our Framework for Scaling

• Application to Decision Trees

• Other Applications and Results

• Conclusion

Page 36: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Applications

• VFDT (KDD ’00) – Decision Trees• CVFDT (KDD ’01) – VFDT + concept drift• VFBN & VFBN2 (KDD ’02) – Bayesian Networks• Continuous Searches

– VFKM (ICML ’01) – K-Means clustering– VFEM (NIPS ’01) – EM for mixtures of Gaussians

• Relational Data Sets– VFREL (Submitted) – Feature selection in relational

data

Page 37: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

CFVDT Experiments

Page 38: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Activity Profile for VFBN

Page 39: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Other Real World Data Sets

• Trace of all web requests from UW campus– Use clustering to find good locations for proxy caches

• KDD Cup 2000 Data set– 700k page requests from an e-commerce site– Categorize pages into 65 categories, predict which a session will

visit

• UW CSE Data set– 8 Million sessions over two years– Predict which of 80 level 2 directories each visits

• Web Crawl of .edu sites– Two data sets each with two million web pages– Use relational structure to predict which will increase in

popularity over time

Page 40: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Related Work

• DB Mine: A Performance Perspective (Agrawal, Imielinski, Swami ‘93)– Framework for scaling rule learning

• RainForest (Gehrke, Ramakrishnan, Ganti ‘98)

– Framework for scaling decision trees• ADtrees (Moore, Lee ‘97)

– Accelerate computing sufficient stats• PALO (Greiner ‘92)

– Accelerate hill climbing search via sampling• DEMON (Ganti, Gehrke, Ramakrishnan ‘00)

– Framework for converting incremental algs. for time changing data streams

Page 41: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Future Work

• Combine framework for discrete search with frameworks for continuous search and relational learning

• Further study time changing processes• Develop a language for specifying data stream

learning algorithms• Use framework to develop novel algorithms for

massive data streams• Apply algorithms to more real-world problems

Page 42: A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Conclusion

• Framework helps scale up learning algorithms based on discrete search

• Resulting algorithms:– Work on databases and data streams– Work with limited resources – Adapt to time changing concepts– Learn in time proportional to concept complexity

• Independent of amount of training data!

• Benefits have been demonstrated in a series of applications