Top Banner
Feature Selection, Theory Building and Association Rules Network School of Information Technologies Insert Partner Logo - Delete if not required Sanjay Chawla
29

Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

Feature Selection, Theory Building and Association Rules Network

School of Information Technologies

Insert Partner Logo - Delete if not required

Sanjay Chawla

Page 2: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

2

Acknowledgement

› This is joint work with

- Gaurav Pandey; Simon Poon; Bavani Arunasalam; Joseph Davis

- Gaurav Pandey, Sanjay Chawla, Simon Poon, Bavani Arunasalam and Joseph Davis Association Rules Network: Definition and Applications Statistical Analysis and Data Mining 1(4), 260-279, 2009

- Earlier versions appeared in PAKDD (2003) and ICDE (2004)

Page 3: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

3

Outline

› The Context of Data Mining

› Common Data Mining Techniques

› From Patterns to Structure

› Association Rules Network (ARN)

› ARN to Hypothesis Generation

› Case Studies

Page 4: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

4

Context of Data Mining

› Large parts of human activity now has an observational data footprint

- Going for groceries – point of sales (POS) data

- Visit to a doctor – electronic health record & insurance data

- Work – email; documents; browsing web page (web logs), phone logs

- Entertainment – Netflix; Facebook; Twitter

- Surveillance – CCTV

› Scientific activity

- Large Hadron Collider (LHC)

- 15 petabytes of data per year;

- one 5 GB DVD every 5 sec;

- ….but Google processes 24 petabytes/day !!

Page 5: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

5

Jim Gray’s :The Fourth Paradigm of Science

› First Paradigm (till ~1000 years ago )

- Science was an empirical exercise

- about description (catalogue stars; plants, etc.)

› Second Paradigm (last few hundred years)

- Science became theoretical

- Models and generalizations

› Third Paradigm (last few decades)

- Science became computational

- Solve complex models through computational simulation

› Fourth Paradigm (today)

- Science is about data exploration

- Unify theory, experiment and simulation

Page 6: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

6

Data Mining Definition and Activity

› Within Computer Science/AI, Data Mining is one component in a larger process known as KDD (Knowledge Discovery in Databases):

- Data Extraction Transformation Data Mining Task Action

› Obviously strong overlap with Statistics and Machine Learning

› Statistics (hypothesis validation)

- form hypothesis run experiment collect data statistical techniques to test hypothesis conclusion

› Machine Learning (function estimation and stability)

- data fit function f perturb data is f valid?

Page 7: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

7

What is a theory?

› Computer Science Theory

- Computational Models (Church-Turing Hypothesis /Quantum Computing )

- Computational Complexity P = NP

- Approximation Algorithms

› Theory – A system of ideas and statements held as explanation

› Database Theory

- Relational Algebra

- Declarative Interface

- Query Language and Query Optimization

- ACID

› Statistics

- Statistical Inference

- Maximum Likelihood Estimation

Page 8: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

8

Data Mining Theory

› Statistics starts with a hypothesis;

- Which suggests an experiments

- Apply statistical techniques on experimental data to make objective judgements about the validity of the hypothesis.

- (e.g., the Universe is expanding; examine the redshift of distant stars )

› The validity of statistical theories is predicated on several mathematical theorems (e.g.,)

- the ubiquity of the Normal distribution through the Central Limit Theorem

› Data Mining starts with observational data;

- Computational correlation analysis suggests formulation of candidate hypothesis

- Primary focus on computational efficiency and reduction of semantic-gap

- Space-time tradeoff; and more recent space-accuracy tradeoff (e.g., Bloom Filters)

- Search for pruning invariants.

Page 9: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

9

Taxonomy of Theories

Theory Type Description

Analysis Says what is: analysis and description

Explanation What, why, where, how but no prediction

Prediction What is and what will be. Testable propositions but no causal explanation

Explanation and Prediction

Everything above and causal explanation

Design and Action

Says how to do something; prescriptive ; recipes for building artifact

Shirley Gregor: The Nature of Theory in Information Systems; MISQ 2006

Page 10: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

10

Data Mining: The task of hypothesis generation

Page 11: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

11

Common Data Mining Tasks

› Classification/Regression

- Fundamentally about function estimation

- Decision Trees; Logistic Regression; SVM

› Clustering

- Essentially lossy-compression

- K-means; spectral clustering

› Association Rule or Pattern Mining

- Multivariate discrete correlation generation; lossless compression

- Apriori; FP-Trees; Closed itemsets

› Outlier/Anomaly Detection

- Identification of "weak" signals

- Distance-based and Density-based outlier detection;

- direct search for middle eigenvalues

Page 12: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

12

Association Rules

› Let I be a set of items.

› Let T be a set of transactions:

› Let α be an itemset, i.e., subset of items.

› An association rule r is an implication of the form r: λ δ where λ and δ are itemsets

t ∈T ⊂ I

sup(α ) =|{t ∈T |α ⊂ t} |

|T |

sup(r) = sup(λ ∪δ )

conf (r) =sup(λ ∪δ )

sup(λ )

TID Items

1 a,b,e

2 b,e,f,g

3 a,b,g

4 a,b,e,f,g

Sup({b,e,f}) = 2/4 =0.5

Conf({b,e}f) = 2/3 = 0.67

Page 13: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

13

Association Rules

› Association Rule Mining (ARM) is easily the most "researched" topic in data mining

› The research has mainly focused along three dimensions:

1. Efficient data structures to mine frequent itemsets

a. An itemset is frequent if its support is greater than minsup.

2. Extend ARM to measures beyond support and confidence

a. but preserve anti-monotonicity

3. Condensed representation of all frequent itemsets

a. prune itemsets which are "redundant".

Page 14: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

14

Trivia: Association Rule Mining Top Citations (Google Scholar)

Page 15: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

15

From Patterns to Structure

Applied association rule mining to US Census Data for Elderly People

Lesson: pruning in context; remove (hyper) cycles and redundant paths

r1: immigrant=no income=below50k

r2: sex=female; age < 75 immigrant=no

r3: immigrant=no sex =female

r4: urban=no income=below50k

r5: urban=no sex=female

Page 16: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

16

Association Rule Network: Definition

An ARN is a backward directed hypergraph (B-graph) without cycles andredundant paths

Page 17: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

17

Hypergraphs

› A graph G=(V,E); V is a set of vertices and E is a subset of the cartesian product V x V.

› A hypergraph G=(V,H); V is the set of vertices and H is a subset of the powerset of V

- A hyperedge is a set of vertices.

- No good visual metaphor for hyperedges (like a line for an edge).

› The connection with the association rule problem is clear:

- V is the set of items

- H is the set of transactions

Page 18: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

18

ARN: Formal Definition

› Given a set of association rules R and a single item z, an ARN(R,z) is a weighted B-graph such that

i. There is a rule (hyperedge) whose consequent is the singleton item z

ii. Each hyperedge corresponds to a rule whose consequent is a singleton. The weight of the hyperedge is the consequent of the rule

iii. The node representing z is reachable from any other node in the ARN

iv. A node p ≠ z is not reachable from z

Page 19: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

19

Case Studies

› Case Study 1 – Open Source Software (sourceforge.net) – which ones tend to become popular and why?

- ~24K records; 46 attributes.

› Case Study 2 – Multifactor Producitivity Growth (MFG) – what factors contribute towards firm/company level productivity

- ~3K records: 32 variables

› In both these examples there is a lack of "general theory" –so good test cases for data mining

Page 20: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

20

Case Study 1: ARN formed from Open Source Data (Downloads=High)

Page 21: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

21

Case Study 1: Clusters formed from ARN have natural semantics

Page 22: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

22

Case Study 2: Measuring Firm Level Productivity

› In Economics, understanding factors which contribute towards productivity growth is an important problem.

› For example, a simple measure of employee productivity will be:

› In practice much more complex measures are used.

› Economists (and Businesses) are interested in understanding what are the factors that contribute towards productivity.

(total⋅val⋅ output − total.cos t.(capital∧raw)

total.# .employees

Page 23: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

23

Case Study 2: Measuring Firm Level Productivity

› Four years of data was obtained from the Australian Bureau of Statistics (ABS)

What is a natural data mining technique to use for these kinds of problems?

Page 24: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

24

Case Study 2: Variables

Page 25: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

25

Case Study 2- MFP (Positive Growth)

Page 26: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

26

Case Study 2: MFP (Negative Growth)

Page 27: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

27

Candidate Hypothesis

Page 28: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

28

Mathematical Modeling

MFP = β0 + β1UC + β2StockAdjust + β3BudgetPlan +

β4TQM + β5IntroGood + industry +ε

Page 29: Feature Selection, Theory Building and Association Rules Network School of Information Technologies Sanjay Chawla.

29

Summary

› Data Mining is presented as a mechanism for candidate hypothesis generation

› This viewpoint allows data mining to be integrated into the traditional theory building/scientific process.

› Demonstrated:

- Association Rules (Patterns) ARN (Structure) Knowledge

› Future: Engineering issues and Causality!