Bellwether Analysis Bellwether Analysis Predicting Global Aggregates from Local Regions Raghu Ramakrishnan Yahoo! Research University of Wisconsin—Madison.

Bellwether Analysis

Bellwether AnalysisPredicting Global Aggregates from Local

Regions

Raghu Ramakrishnan

Yahoo! Research University of Wisconsin—Madison

Bee-Chung Chen, Jude Shavlik, Pradeep Tamma

University of Wisconsin—Madison

2Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma

Motivating Example

• A company wants to predict the first year worldwide profit of a new item (e.g., a new movie) by using its historical database– By looking at the features and profits of previous (similar) movies,

we want to predict the expected total profit (total US sales at the end of the release year) for the new movie

• Wait a year and write a query! If you can’t wait, read this paper

– The most predictive “features” may be based on sales data gathered by releasing the new movie in many “regions” (different locations over different time periods).

• Example “region-based” features: 1st week sales in Peoria, week-to-week sales growth in Wisconsin, etc.

• Gathering this data has a cost (e.g., marketing expenses, waiting time)

• Problem statement: Find the most predictive region features that can be obtained within a given “cost budget”


Key Ideas

• Large datasets are rarely labeled with the targets that we wish to learn to predict– But for the tasks we address, we can readily use OLAP

queries to generate features (e.g., 1st week sales in Peoria) and even targets (e.g., profit) for mining

• We use data-mining models as building blocks in the mining process, rather than thinking of them as the end result– The central problem is to find data subsets

(“bellwether regions”) that lead to predictive features which can be gathered at low cost for a new case


Outline

• Motivating example• Basic bellwether analysis• Subset bellwether analysis

– Bellwether trees– Bellwether cubes

• Experimental results• Conclusion


Motivating Example

• A company wants to predict the first year’s worldwide profit for a new item, by using its historical database

• Database Schema:

Profit Table

TimeLocationCustIDItemIDProfit

Item Table

ItemIDCategoryR&D Expense

Ad Table

TimeLocationItemIDAdExpenseAdSize

• The combination of the underlined attributes forms a key


A Straightforward Approach

• Build a regression model to predict item profit

• There is much room for accuracy improvement!

Profit Table

TimeLocationCustIDItemIDProfit

Item Table

ItemIDCategoryR&D Expense

Ad Table

TimeLocationItemIDAdExpenseAdSize

ItemID Category R&D Expense Profit

1 Laptop 500K 12,000K

2 Desktop 100K 8,000K

… … … …

By joining and aggregating tables in the historical database we can create a training set:

Item-table features Target

An Example regression model:Profit = 0 + 1 Laptop + 2 Desktop + 3 RdExpense


Using Regional Features

• Example region: [1st week, Korea]• Regional features:

– Regional Profit: The 1st week profit in Korea– Regional Ad Expense: The 1st week ad expense in Korea

• A possibly more accurate model:

Profit[1yr, All] = 0 + 1 Laptop + 2 Desktop + 3 RdExpense +

4 Profit[1wk, KR] + 5 AdExpense[1wk, KR]

• Problem: Which region should we use?– The smallest region that improves the accuracy the most– We give each candidate region a cost– The most “cost-effective” region is the bellwether region

Bellwether Analysis

Basic Bellwether Problem



• Historical database: DB• Training item set: I• Candidate region set: R

– E.g., { [1-n week, Location] }

• Target generation query:i(DB) returns the target value of item i

I

– E.g., sum(Profit) i, [1-52, All] ProfitTable

• Feature generation query: i,r(DB), i Ir and r R

– Ir: The set of items in region r

– E.g., [ Categoryi, RdExpensei, Profiti, [1-n, Loc], AdExpensei, [1-n, Loc] ]

• Cost query: r(DB), r R, the cost of collecting data from r

• Predictive model: hr(x), r R, trained on {(i,r(DB), i(DB)) : i Ir}

– E.g., linear regression model

All

CA US KR

AL WI

All

Country

State

Location domain hierarchy



1 2 3 4 5 … 52

KR

USA

…

WI

WY

... …

ItemID Category … Profit[1-2,USA] …

… … … … …

i Desktop 45K

… … … … …

Aggregate over data recordsin region r = [1-2, USA]

Features i,r(DB)

ItemID Total Profit

… …

i 2,000K

… …

Target i(DB)

Total Profitin [1-52, All]

For each region r, build a predictive model hr(x); and then choose bellwether region:

• Coverage(r) fraction of all items in region minimum coverage support • Cost(r, DB) cost threshold• Error(hr) is minimized

r


Experiment on a Mail Order Dataset

0

5000

10000

15000

20000

25000

30000

5 25 45 65 85Budget

RM

SE

Bel Err Avg Err

Smp Err

• Bel Err: The error of the bellwether region found using a given budget

• Avg Err: The average error of all the cube regions with costs under a given budget

• Smp Err: The error of a set of randomly sampled (non-cube) regions with costs under a given budget

[1-8 month, MD]

Error-vs-Budget Plot

(RMSE: Root Mean Square Error)


Experiment on a Mail Order Dataset

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

5 25 45 65 85Budget

Fra

ctio

n of

indi

stin

guis

able

s

Uniqueness Plot

• Y-axis: Fraction of regions that are as good as the bellwether region– The fraction of regions that

satisfy the constraints and have errors within the 99% confidence interval of the error of the bellwether region

• We have 99% confidence that that [1-8 month, MD] is a quite unique bellwether region

[1-8 month, MD]


Basic Bellwether Computation

• OLAP-style bellwether analysis– Candidate regions: Regions in a data cube

– Queries: OLAP-style aggregate queries

• E.g., Sum(Profit) over a region

• Efficient computation:

– Use iceberg cube techniques to prune infeasible regions (Beyer-Ramakrishnan, ICDE 99; Han-Pei-Dong-Wang SIGMOD 01)

• Infeasible regions: Regions with cost > B or coverage < C

– Share computation by generating the features and target values for all the feasible regions all together

• Exploit distributive and algebraic aggregate functions• Simultaneously generating all the features and target values

reduces DB scans and repeated aggregate computation

1 2 3 4 5 … 52

KR …

USA

WI

... WY

Bellwether Analysis

Subset Bellwether Problem


Subset-Based Bellwether Prediction

• Motivation: Different subsets of items may have different bellwether regions– E.g., The bellwether region for laptops may be

different from the bellwether region for clothes

• Two approaches:

R&D Expense 50K

YesNo

Category

Desktop Laptop

[1-2, WI] [1-3, MD]

[1-1, NY]

Bellwether Tree Bellwether Cube

Low Medium High

Software OS [1-3,CA] [1-1,NY] [1-2,CA]

… ... … …

Hardware Laptop [1-4,MD] [1-1, NY] [1-3,WI]

… … … …

… … … … …

R&D Expenses

Cat

egor

y


Bellwether Tree

• How to build a bellwether tree– Similar to regression tree construction– Starting from the root node, recursively split the

current leaf node using the “best split criterion”• A split criterion partitions a set of items into disjoint subsets• Pick the split that reduces the error the most

– Stop splitting when the number of items in the current leaf node falls under a threshold value

– Prune the tree to avoid overfitting

R&D Expense 50K

YesNo

Category

Desktop Laptop

[1-2, WI] [1-3, MD]

[1-1, NY]

1

2 7

3 4 8 9

5 6


Problem of Naïve Tree Construction

• A naïve bellwether tree construction algorithm will scan the dataset nm times– n is the number of nodes– m is the number of candidate split criteria

• Idea: Extending the RainForest framework [Gehrke et al., 98]

1

2 7

3 4 8 9

5 6

For each node:• Try all candidate split criteria to find the best one• It needs to scan the dataset m times


Bellwether Cube

Low Medium High

SoftwareOS [1-3,CA]: 0.05 [1-1,NY]: 0.03 [1-

2,CA]:0.02

… ... … …

Hardware

Laptop NULL [1-1, NY]: 0.02

[1-4,WI]: 0.03

Desktop

[1-4,MD]: 0.17 [1-4,WA]: 0.01

NULL

… … … …

… … … … …

Low Medium High

Software [1-4,CA]: 0.10 [1-2,CA]: 0.05 [1-2,CA]:0.03

Hardware [1-4,MD]: 0.08 [1-1, IL]: 0.03 [1-4,WI]: 0.05

… … … …

AnyAll

Division

Category

Item Hierarchy: Category

HardwareSoftware

Desktop Laptop

Others

Level Hierarchy Tree

AnyAll

Range

Expense

Item Hierarchy: R&D Expense

MediumLow High

100K 1M

Level Hierarchy Tree

R&D Expenses

R&D ExpensesC

ateg

ory

Cat

egor

y

The number in a cell is the error of the bellwether region for that subset of items

Rollup Drilldown


Problem of Naïve Cube Construction

• A naïve bellwether cube construction algorithm will conduct a basic bellwether search for the subset of items in each cell– A basic bellwether search involves building a model

for each candidate region

Low Medium High

SoftwareOS

…

Hardware

Laptop

…

… …

Any

Any

Any

Software

Hardware

…

Low Medium High

Any

For each cell:• Build a model for each candidate region


Efficient Cube Construction

• Idea: Transform model construction into computation of distributive or algebraic aggregate functions– Let S1, …, Sn partition S

• S = S1 … Sn and Si Sj =

– Distributive function: (S) = F({(S1), …, (Sn)})

• E.g., Count(S) = Sum({Count(S1), …, Count(Sn)})

– Algebraic function: (S) = F({G(S1), …, G(Sn)})

• G(Si) returns a length-fixed vector of values

• E.g., Avg(S) = F({G(S1), …, G(Sn)})

– G(Si) = [Sum(Si), Count(Si)]

– F({[a1, b1], …, [an, bn]}) = Sum({ai}) / Sum({bi})



• Build models for each finest-grained cells• For higher-level cells, use data cube computation

techniques to compute the aggregate functions

Low Medium High

SoftwareOS

…

Hardware

Laptop

…

… …

Any

Any

Any

Software

Hardware

…

Low Medium High

Any

For each finest-grained cell:• Build models to find the bellwether region

For each higher-level cell:• Compute aggregate functions to find the bellwether region



• Classification models:– Use the prediction cube [Chen et al., 05] execution framework

• Regression models: (Weighted linear regression model; builds on work in Chen-Dong-Han-Wah-Wang VLDB 02)– Having the sum of squared error (SSE) for each candidate

region is sufficient to find the bellwether region– SSE(S) is an algebraic function, where S is a set of item– SSE(S) = q( { g(Sk) : k = 1, …, n } )

• S1, …, Sn partition S• g(Sk) = YkWkYk, XkWkXk, XkWkYk• q({Ak, Bk, Ck : k = 1, …, n}) = k Ak (k Ck)(k Bk)1(k Ck)

Yk is the vector of target values for set Sk of itemsXk is the matrix of features for set Sk of itemsWk is the weight matrix for set Sk of items

where

Bellwether Analysis

Experimental Results


Experimental Results: Summary

• We have shown the existence of bellwether regions on a real mail-order dataset

• We characterize the behavior of bellwether trees and bellwether cubes using synthetic datasets

• We show our computation techniques improve efficiency by orders of magnitude

• We show our computation techniques scale linearly in the size of the dataset


Characteristics of Bellwether Trees & Cubes

Dataset generation:• Use random tree to generate different bellwether regions for different subset of itemsParameters:• Noise• Concept complexity: # of tree nodes

Result:• Bellwether trees & cubes have better accuracy than basic bellwether search• Increase noise increase error• Increase complexity increase error

0

0.5

1

1.5

2

2.5

3

0.05 0.5 1 2Noise

RM

SE

basic

cube

tree

0

0.5

1

1.5

2

3 7 15 31 63Number of nodes

RM

SE

basic

cube

tree

15 nodes Noise level: 0.5


Efficiency Comparison

0

500

1000

1500

2000

2500

3000

100 150 200 250 300Thousands of examples

Sec

naive cube

naive tree

RF tree

single-scancube

optimizedcube

Naïve computationmethods

Our computationtechniques


Scalability

0

200

400

600

800

1000

1200

2.5 5 7.5 10Millions of examples

Sec

single-scancube

optimizedcube

0

1000

2000

3000

4000

5000

6000

7000

2.5 5 7.5 10Millions of examples

Sec RF tree


Conclusion

• Promising data mining paradigm:– Using OLAP queries to generate features and even

targets for mining– Using data-mining models as building blocks in the

mining process, rather than thinking of them as the end result

– Exploit the nested structure of OLAP queries to achieve efficient computation

Database

{ }

Multi -Dimension al ViewSubset Selection &

Aggregation Data Mining

Bellwether Analysis Bellwether Analysis Predicting Global Aggregates from Local Regions Raghu Ramakrishnan Yahoo! Research University of Wisconsin—Madison.

Documents