Bellwether Analysis Bellwether Analysis Predicting Global Aggregates from Local Regions Raghu Ramakrishnan Yahoo! Research University of Wisconsin—Madison Bee-Chung Chen, Jude Shavlik, Pradeep Tamma University of Wisconsin—Madison
Bellwether Analysis
Bellwether AnalysisPredicting Global Aggregates from Local
Regions
Raghu Ramakrishnan
Yahoo! Research University of Wisconsin—Madison
Bee-Chung Chen, Jude Shavlik, Pradeep Tamma
University of Wisconsin—Madison
2Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Motivating Example
• A company wants to predict the first year worldwide profit of a new item (e.g., a new movie) by using its historical database– By looking at the features and profits of previous (similar) movies,
we want to predict the expected total profit (total US sales at the end of the release year) for the new movie
• Wait a year and write a query! If you can’t wait, read this paper
– The most predictive “features” may be based on sales data gathered by releasing the new movie in many “regions” (different locations over different time periods).
• Example “region-based” features: 1st week sales in Peoria, week-to-week sales growth in Wisconsin, etc.
• Gathering this data has a cost (e.g., marketing expenses, waiting time)
• Problem statement: Find the most predictive region features that can be obtained within a given “cost budget”
3Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Key Ideas
• Large datasets are rarely labeled with the targets that we wish to learn to predict– But for the tasks we address, we can readily use OLAP
queries to generate features (e.g., 1st week sales in Peoria) and even targets (e.g., profit) for mining
• We use data-mining models as building blocks in the mining process, rather than thinking of them as the end result– The central problem is to find data subsets
(“bellwether regions”) that lead to predictive features which can be gathered at low cost for a new case
4Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Outline
• Motivating example• Basic bellwether analysis• Subset bellwether analysis
– Bellwether trees– Bellwether cubes
• Experimental results• Conclusion
5Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Motivating Example
• A company wants to predict the first year’s worldwide profit for a new item, by using its historical database
• Database Schema:
Profit Table
TimeLocationCustIDItemIDProfit
Item Table
ItemIDCategoryR&D Expense
Ad Table
TimeLocationItemIDAdExpenseAdSize
• The combination of the underlined attributes forms a key
6Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
A Straightforward Approach
• Build a regression model to predict item profit
• There is much room for accuracy improvement!
Profit Table
TimeLocationCustIDItemIDProfit
Item Table
ItemIDCategoryR&D Expense
Ad Table
TimeLocationItemIDAdExpenseAdSize
ItemID Category R&D Expense Profit
1 Laptop 500K 12,000K
2 Desktop 100K 8,000K
… … … …
By joining and aggregating tables in the historical database we can create a training set:
Item-table features Target
An Example regression model:Profit = 0 + 1 Laptop + 2 Desktop + 3 RdExpense
7Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Using Regional Features
• Example region: [1st week, Korea]• Regional features:
– Regional Profit: The 1st week profit in Korea– Regional Ad Expense: The 1st week ad expense in Korea
• A possibly more accurate model:
Profit[1yr, All] = 0 + 1 Laptop + 2 Desktop + 3 RdExpense +
4 Profit[1wk, KR] + 5 AdExpense[1wk, KR]
• Problem: Which region should we use?– The smallest region that improves the accuracy the most– We give each candidate region a cost– The most “cost-effective” region is the bellwether region
Bellwether Analysis
Basic Bellwether Problem
9Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Basic Bellwether Problem
• Historical database: DB• Training item set: I• Candidate region set: R
– E.g., { [1-n week, Location] }
• Target generation query:i(DB) returns the target value of item i
I
– E.g., sum(Profit) i, [1-52, All] ProfitTable
• Feature generation query: i,r(DB), i Ir and r R
– Ir: The set of items in region r
– E.g., [ Categoryi, RdExpensei, Profiti, [1-n, Loc], AdExpensei, [1-n, Loc] ]
• Cost query: r(DB), r R, the cost of collecting data from r
• Predictive model: hr(x), r R, trained on {(i,r(DB), i(DB)) : i Ir}
– E.g., linear regression model
All
CA US KR
AL WI
All
Country
State
Location domain hierarchy
10Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Basic Bellwether Problem
1 2 3 4 5 … 52
KR
USA
…
WI
WY
... …
ItemID Category … Profit[1-2,USA] …
… … … … …
i Desktop 45K
… … … … …
Aggregate over data recordsin region r = [1-2, USA]
Features i,r(DB)
ItemID Total Profit
… …
i 2,000K
… …
Target i(DB)
Total Profitin [1-52, All]
For each region r, build a predictive model hr(x); and then choose bellwether region:
• Coverage(r) fraction of all items in region minimum coverage support • Cost(r, DB) cost threshold• Error(hr) is minimized
r
11Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Experiment on a Mail Order Dataset
0
5000
10000
15000
20000
25000
30000
5 25 45 65 85Budget
RM
SE
Bel Err Avg Err
Smp Err
• Bel Err: The error of the bellwether region found using a given budget
• Avg Err: The average error of all the cube regions with costs under a given budget
• Smp Err: The error of a set of randomly sampled (non-cube) regions with costs under a given budget
[1-8 month, MD]
Error-vs-Budget Plot
(RMSE: Root Mean Square Error)
12Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Experiment on a Mail Order Dataset
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
5 25 45 65 85Budget
Fra
ctio
n of
indi
stin
guis
able
s
Uniqueness Plot
• Y-axis: Fraction of regions that are as good as the bellwether region– The fraction of regions that
satisfy the constraints and have errors within the 99% confidence interval of the error of the bellwether region
• We have 99% confidence that that [1-8 month, MD] is a quite unique bellwether region
[1-8 month, MD]
13Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Basic Bellwether Computation
• OLAP-style bellwether analysis– Candidate regions: Regions in a data cube
– Queries: OLAP-style aggregate queries
• E.g., Sum(Profit) over a region
• Efficient computation:
– Use iceberg cube techniques to prune infeasible regions (Beyer-Ramakrishnan, ICDE 99; Han-Pei-Dong-Wang SIGMOD 01)
• Infeasible regions: Regions with cost > B or coverage < C
– Share computation by generating the features and target values for all the feasible regions all together
• Exploit distributive and algebraic aggregate functions• Simultaneously generating all the features and target values
reduces DB scans and repeated aggregate computation
1 2 3 4 5 … 52
KR …
USA
WI
... WY
Bellwether Analysis
Subset Bellwether Problem
15Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Subset-Based Bellwether Prediction
• Motivation: Different subsets of items may have different bellwether regions– E.g., The bellwether region for laptops may be
different from the bellwether region for clothes
• Two approaches:
R&D Expense 50K
YesNo
Category
Desktop Laptop
[1-2, WI] [1-3, MD]
[1-1, NY]
Bellwether Tree Bellwether Cube
Low Medium High
Software OS [1-3,CA] [1-1,NY] [1-2,CA]
… ... … …
Hardware Laptop [1-4,MD] [1-1, NY] [1-3,WI]
… … … …
… … … … …
R&D Expenses
Cat
egor
y
16Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Bellwether Tree
• How to build a bellwether tree– Similar to regression tree construction– Starting from the root node, recursively split the
current leaf node using the “best split criterion”• A split criterion partitions a set of items into disjoint subsets• Pick the split that reduces the error the most
– Stop splitting when the number of items in the current leaf node falls under a threshold value
– Prune the tree to avoid overfitting
R&D Expense 50K
YesNo
Category
Desktop Laptop
[1-2, WI] [1-3, MD]
[1-1, NY]
1
2 7
3 4 8 9
5 6
18Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Problem of Naïve Tree Construction
• A naïve bellwether tree construction algorithm will scan the dataset nm times– n is the number of nodes– m is the number of candidate split criteria
• Idea: Extending the RainForest framework [Gehrke et al., 98]
1
2 7
3 4 8 9
5 6
For each node:• Try all candidate split criteria to find the best one• It needs to scan the dataset m times
20Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Bellwether Cube
Low Medium High
SoftwareOS [1-3,CA]: 0.05 [1-1,NY]: 0.03 [1-
2,CA]:0.02
… ... … …
Hardware
Laptop NULL [1-1, NY]: 0.02
[1-4,WI]: 0.03
Desktop
[1-4,MD]: 0.17 [1-4,WA]: 0.01
NULL
… … … …
… … … … …
Low Medium High
Software [1-4,CA]: 0.10 [1-2,CA]: 0.05 [1-2,CA]:0.03
Hardware [1-4,MD]: 0.08 [1-1, IL]: 0.03 [1-4,WI]: 0.05
… … … …
AnyAll
Division
Category
Item Hierarchy: Category
HardwareSoftware
Desktop Laptop
Others
Level Hierarchy Tree
AnyAll
Range
Expense
Item Hierarchy: R&D Expense
MediumLow High
100K 1M
Level Hierarchy Tree
R&D Expenses
R&D ExpensesC
ateg
ory
Cat
egor
y
The number in a cell is the error of the bellwether region for that subset of items
Rollup Drilldown
21Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Problem of Naïve Cube Construction
• A naïve bellwether cube construction algorithm will conduct a basic bellwether search for the subset of items in each cell– A basic bellwether search involves building a model
for each candidate region
Low Medium High
SoftwareOS
…
Hardware
Laptop
…
… …
Any
Any
Any
Software
Hardware
…
Low Medium High
Any
For each cell:• Build a model for each candidate region
22Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Efficient Cube Construction
• Idea: Transform model construction into computation of distributive or algebraic aggregate functions– Let S1, …, Sn partition S
• S = S1 … Sn and Si Sj =
– Distributive function: (S) = F({(S1), …, (Sn)})
• E.g., Count(S) = Sum({Count(S1), …, Count(Sn)})
– Algebraic function: (S) = F({G(S1), …, G(Sn)})
• G(Si) returns a length-fixed vector of values
• E.g., Avg(S) = F({G(S1), …, G(Sn)})
– G(Si) = [Sum(Si), Count(Si)]
– F({[a1, b1], …, [an, bn]}) = Sum({ai}) / Sum({bi})
23Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Efficient Cube Construction
• Build models for each finest-grained cells• For higher-level cells, use data cube computation
techniques to compute the aggregate functions
Low Medium High
SoftwareOS
…
Hardware
Laptop
…
… …
Any
Any
Any
Software
Hardware
…
Low Medium High
Any
For each finest-grained cell:• Build models to find the bellwether region
For each higher-level cell:• Compute aggregate functions to find the bellwether region
24Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Efficient Cube Construction
• Classification models:– Use the prediction cube [Chen et al., 05] execution framework
• Regression models: (Weighted linear regression model; builds on work in Chen-Dong-Han-Wah-Wang VLDB 02)– Having the sum of squared error (SSE) for each candidate
region is sufficient to find the bellwether region– SSE(S) is an algebraic function, where S is a set of item– SSE(S) = q( { g(Sk) : k = 1, …, n } )
• S1, …, Sn partition S• g(Sk) = YkWkYk, XkWkXk, XkWkYk• q({Ak, Bk, Ck : k = 1, …, n}) = k Ak (k Ck)(k Bk)1(k Ck)
Yk is the vector of target values for set Sk of itemsXk is the matrix of features for set Sk of itemsWk is the weight matrix for set Sk of items
where
Bellwether Analysis
Experimental Results
26Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Experimental Results: Summary
• We have shown the existence of bellwether regions on a real mail-order dataset
• We characterize the behavior of bellwether trees and bellwether cubes using synthetic datasets
• We show our computation techniques improve efficiency by orders of magnitude
• We show our computation techniques scale linearly in the size of the dataset
27Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Characteristics of Bellwether Trees & Cubes
Dataset generation:• Use random tree to generate different bellwether regions for different subset of itemsParameters:• Noise• Concept complexity: # of tree nodes
Result:• Bellwether trees & cubes have better accuracy than basic bellwether search• Increase noise increase error• Increase complexity increase error
0
0.5
1
1.5
2
2.5
3
0.05 0.5 1 2Noise
RM
SE
basic
cube
tree
0
0.5
1
1.5
2
3 7 15 31 63Number of nodes
RM
SE
basic
cube
tree
15 nodes Noise level: 0.5
28Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Efficiency Comparison
0
500
1000
1500
2000
2500
3000
100 150 200 250 300Thousands of examples
Sec
naive cube
naive tree
RF tree
single-scancube
optimizedcube
Naïve computationmethods
Our computationtechniques
29Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Scalability
0
200
400
600
800
1000
1200
2.5 5 7.5 10Millions of examples
Sec
single-scancube
optimizedcube
0
1000
2000
3000
4000
5000
6000
7000
2.5 5 7.5 10Millions of examples
Sec RF tree
30Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep TammaBellwether Cubes, VLDB 2006 Chen, Ramakrishnan, Shavlik, Tamma
Conclusion
• Promising data mining paradigm:– Using OLAP queries to generate features and even
targets for mining– Using data-mining models as building blocks in the
mining process, rather than thinking of them as the end result
– Exploit the nested structure of OLAP queries to achieve efficient computation
Database
{ }
Multi -Dimension al ViewSubset Selection &
Aggregation Data Mining