A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

A General Framework for Mining Massive Data Streams

Geoff HultenAdvised by Pedro Domingos

Mining Massive Data Streams

• High-speed data streams abundant– Large retailers– Long distance & cellular phone call records– Scientific projects– Large Web sites

• Build model of the process creating data

• Use model to interact more efficiently

Growing Mismatch BetweenAlgorithms and Data

• State of the art data mining algorithms– One shot learning– Work with static databases– Maximum of 1 million – 10 million records

• Properties of Data Streams– Data stream exists over months or years– 10s – 100s of millions of new records per day– Process generating data changing over time

The Cost of This Mismatch

• Fraction of data we can effectively mine shrinking towards zero

• Models learned from heuristically selected samples of data

• Models out of date before being deployed

Need New Algorithms

• Monitor a data stream and have a model available at all times

• Improve the model as data arrives

• Adapt the model as process generating data changes

• Have quality guarantees

• Work within strict resource constraints

Solution: General Framework

• Applicable to algorithms based on discrete search

• Semi-automatically converts algorithm to meet our design needs

• Uses sampling to select data size for each search step

• Extensions to continuous searches and relational data

Outline

• Introduction

• Scaling up Decision Trees

• Our Framework for Scaling

• Other Applications and Results

• Conclusion

Decision Trees

• Examples:

• Encode:

• Nodes contain tests

• Leaves contain predictions

Gender?

False Age?

Male Female

< 25 >= 25

yxx D ,,,1 DxxFy ,,1

False True

Decision Tree Induction

DecisionTree(Data D, Tree T, Attributes A) If D is pure Let T be a leaf predicting class in D Return Let X be best of A according to D and G() Let T be a node that splits on X For each value V of X Let D^ be the portion of D with V for X Let T^ be the child of T for V DecisionTree(D^, T^, A – X)

VFDT (Very Fast Decision Tree)

• In order to pick split attribute for a node looking at a few example may be sufficient

• Given a stream of examples:– Use the first to pick the split at the root– Sort succeeding ones to the leaves– Pick best attribute there– Continue…

• Leaves predict most common class• Very fast, incremental, any time decision tree

induction algorithm

How Much Data?

• Make sure best attribute is better than second– That is:

• Using a sample so need Hoeffding bound– Collect data till: 21 XGXG

n

R

2

1ln2

021 XGXG

Core VFDT Algorithm

Proceedure VFDT(Stream, δ)Let T = Tree with single leaf (root)Initialize sufficient statistics at rootFor each example (X, y) in Stream

Sort (X, y) to leaf using TUpdate sufficient statistics at leafCompute G for each attributeIf G(best) – G(2nd best) > ε, then

Split leaf on best attributeFor each branch

Start new leaf, init sufficient statisticsReturn T

x1?

y=0 x2?

y=0 y=1

male female

> 65 <= 65

Quality of Trees from VFDT

• Model may contain incorrect splits, useful?

• Bound the difference with infinite data tree– Chance an arbitrary example takes different

path

• Intuition: example on level i of tree has i chances to go through a mistaken node

p

DTDT HT

,

Complete VFDT System

• Memory management– Memory dominated by sufficient statistics– Deactivate less promising leaves when needed

• Ties:– Wasteful to decide between identical attributes

• Check for splits periodically• Pre-pruning

– Only make splits that improve the value of G(.)

• Early stop on bad attributes

G

VFDT (Continued)

• Bootstrap with traditional learner

• Rescan dataset when time available

• Time changing data streams

• Post pruning

• Continuous attributes

• Batch mode

Experiments

• Compared VFDT and C4.5 (Quinlan, 1993)

• Same memory limit for both (40 MB)– 100k examples for C4.5

• VFDT settings: δ = 10^-7, τ = 5%

• Domains: 2 classes, 100 binary attributes

• Fifteen synthetic trees 2.2k – 500k leaves

• Noise from 0% to 30%

Running Times

• Pentium III at 500 MHz running Linux

• C4.5 takes 35 seconds to read and process 100k examples; VFDT takes 47 seconds

• VFDT takes 6377 seconds for 20 million examples; 5752s to read 625s to process

• VFDT processes 32k examples per second (excluding I/O)

Real World Data Sets:Trace of UW Web requests

• Stream of Web page request from UW• One week 23k clients, 170 orgs. 244k hosts,

82.8M requests (peak: 17k/min), 20GB• Goal: improve cache by predicting requests• 1.6M examples, 61% default class• C4.5 on 75k exs, 2975 secs.

– 73.3% accuracy

• VFDT ~3000 secs., 74.3% accurate

Outline

• Introduction

• Scaling up Decision Trees


• Overview of Applications and Results

• Conclusion

Data Mining as Discrete Search

...

• Initial state– Empty – prior – random

• Search operators– Refine structure

• Evaluation function– Likelihood – many other

• Goal state– Local optimum, etc.

Data Mining As Search

...

...

Training Data Training Data Training Data

1.7

1.5

1.8

1.9

1.9

2.0

Example: Decision Tree

...

Training Data

1.7

1.5

X1?

Xd?

??

...

X1?

...

X1?

Training Data • Initial state– Root node

• Search operators– Turn any leaf into

a test on attribute• Evaluation

– Entropy Reduction

• Goal state– No further gain– Post prune

)(

lgyval

ii pp

Overview of Framework

• Cast the learning algorithm as a search

• Begin monitoring data stream– Use each example to update sufficient

statistics where appropriate (then discard it)– Periodically pause and use statistical tests

• Take steps that can be made with high confidence

– Monitor old search decisions• Change them when data stream changes

How Much Data is Enough?

...

Training Data

1.65

1.38 Xd?

X1?

How Much Data is Enough?

...

Sample of Data

1.6 +/- ε• Use statistical bounds

– Normal distribution– Hoeffding bound

• Applies to scores that are average over examples

• Can select a winner if– Score1 > Score2 + ε

1.4 +/- ε

nR 21ln2

Xd?

X1?

Global Quality Guarantee

• δ – probability of error in single decision

• b – branching factor of search

• d – depth of search

• c – number of checks for winner

δ* = δbdc

Identical States And Ties

• Fails if states are identical (or nearly so)

• τ – user supplied tie parameter

• Select winner early if alternatives differ by less than τ– Score1 > Score2 + ε or – ε <= τ

Dealing with Time Changing Concepts

• Maintain a window of the most recent examples• Keep model up to date with this window• Effective when window size similar to concept

drift rate• Traditional approach

– Periodically reapply learner– Very inefficient!

• Our approach– Monitor quality of old decisions as window shifts– Correct decisions in fine-grained manner

Alternate Searches• When new test looks better grow alternate sub-tree• Replace the old when new is more accurate• This smoothly adjusts to changing concepts

Gender?

Pets? College?

Hair?

false

false true

falsetrue true

RAM Limitations• Each search

requires sufficient statistics structure

• Decision Tree– O(avc) RAM

• Bayesian Network– O(c^p) RAM

RAM Limitations

Active

Temporarily inactive

Outline

• Introduction

• Data Mining as Discrete Search


• Application to Decision Trees

• Other Applications and Results

• Conclusion

Applications

• VFDT (KDD ’00) – Decision Trees• CVFDT (KDD ’01) – VFDT + concept drift• VFBN & VFBN2 (KDD ’02) – Bayesian Networks• Continuous Searches

– VFKM (ICML ’01) – K-Means clustering– VFEM (NIPS ’01) – EM for mixtures of Gaussians

• Relational Data Sets– VFREL (Submitted) – Feature selection in relational

data

CFVDT Experiments

Activity Profile for VFBN

Other Real World Data Sets

• Trace of all web requests from UW campus– Use clustering to find good locations for proxy caches

• KDD Cup 2000 Data set– 700k page requests from an e-commerce site– Categorize pages into 65 categories, predict which a session will

visit

• UW CSE Data set– 8 Million sessions over two years– Predict which of 80 level 2 directories each visits

• Web Crawl of .edu sites– Two data sets each with two million web pages– Use relational structure to predict which will increase in

popularity over time

Related Work

• DB Mine: A Performance Perspective (Agrawal, Imielinski, Swami ‘93)– Framework for scaling rule learning

• RainForest (Gehrke, Ramakrishnan, Ganti ‘98)

– Framework for scaling decision trees• ADtrees (Moore, Lee ‘97)

– Accelerate computing sufficient stats• PALO (Greiner ‘92)

– Accelerate hill climbing search via sampling• DEMON (Ganti, Gehrke, Ramakrishnan ‘00)

– Framework for converting incremental algs. for time changing data streams

Future Work

• Combine framework for discrete search with frameworks for continuous search and relational learning

• Further study time changing processes• Develop a language for specifying data stream

learning algorithms• Use framework to develop novel algorithms for

massive data streams• Apply algorithms to more real-world problems

Conclusion

• Framework helps scale up learning algorithms based on discrete search

• Resulting algorithms:– Work on databases and data streams– Work with limited resources – Adapt to time changing concepts– Learn in time proportional to concept complexity

• Independent of amount of training data!

• Benefits have been demonstrated in a series of applications

A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Documents

relational data slide

sample of data

data state

data changes

data size

x slide

data use model

mismatch fraction of