Top Banner
2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction Classification: predict categorical class labels – Build a model for a set of classes/concepts – Classify loan applications (approve/decline) Prediction: model continuous-valued functions – Predict the economic growth in 2015 Jian Pei: Big Data Analytics -- Classification 3 Classification: A 2-step Process Model construction: describe a set of predetermined classes Training dataset: tuples for model construction Each tuple/sample belongs to a predefined class Classification rules, decision trees, or math formulae Model application: classify unseen objects Estimate accuracy of the model using an independent test set Acceptable accuracy apply the model to classify tuples with unknown class labels Jian Pei: Big Data Analytics -- Classification 4 Model Construction Training Data Classification Algorithms IF rank = professorOR years > 6 THEN tenured = yesClassifier (Model) Name Rank Years Tenured Mike Ass. Prof 3 No Mary Ass. Prof 7 Yes Bill Prof 2 Yes Jim Asso. Prof 7 Yes Dave Ass. Prof 6 No Anne Asso. Prof 3 No Jian Pei: Big Data Analytics -- Classification 5 Model Application Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured? Name Rank Years Tenured Tom Ass. Prof 2 No Merlisa Asso. Prof 7 No George Prof 5 Yes Joseph Ass. Prof 7 Yes Jian Pei: Big Data Analytics -- Classification 6 Supervised/Unsupervised Learning Supervised learning (classification) – Supervision: objects in the training data set have labels – New data is classified based on the training set Unsupervised learning (clustering) – The class labels of training data are unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
24

Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

Jul 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

1

Classification

Jian Pei: Big Data Analytics -- Classification 2

Classification and Prediction

•  Classification: predict categorical class labels – Build a model for a set of classes/concepts – Classify loan applications (approve/decline)

•  Prediction: model continuous-valued functions – Predict the economic growth in 2015

Jian Pei: Big Data Analytics -- Classification 3

Classification: A 2-step Process

•  Model construction: describe a set of predetermined classes –  Training dataset: tuples for model construction

•  Each tuple/sample belongs to a predefined class

–  Classification rules, decision trees, or math formulae

•  Model application: classify unseen objects –  Estimate accuracy of the model using an independent

test set –  Acceptable accuracy à apply the model to classify

tuples with unknown class labels

Jian Pei: Big Data Analytics -- Classification 4

Model Construction

Training Data

Classification Algorithms

IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classifier (Model)

Name Rank Years Tenured Mike Ass. Prof 3 No Mary Ass. Prof 7 Yes Bill Prof 2 Yes Jim Asso. Prof 7 Yes

Dave Ass. Prof 6 No Anne Asso. Prof 3 No

Jian Pei: Big Data Analytics -- Classification 5

Model Application

Classifier

Testing Data Unseen Data

(Jeff, Professor, 4)

Tenured? Name Rank Years Tenured Tom Ass. Prof 2 No

Merlisa Asso. Prof 7 No George Prof 5 Yes Joseph Ass. Prof 7 Yes

Jian Pei: Big Data Analytics -- Classification 6

Supervised/Unsupervised Learning

•  Supervised learning (classification) – Supervision: objects in the training data set have

labels – New data is classified based on the training set

•  Unsupervised learning (clustering) – The class labels of training data are unknown – Given a set of measurements, observations, etc.

with the aim of establishing the existence of classes or clusters in the data

Page 2: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

2

Jian Pei: Big Data Analytics -- Classification 7

Data Preparation

•  Data cleaning – Preprocess data in order to reduce noise and

handle missing values •  Relevance analysis (feature selection)

– Remove the irrelevant or redundant attributes •  Data transformation

– Generalize and/or normalize data

Jian Pei: Big Data Analytics -- Classification 8

Measurements of Quality

•  Prediction accuracy •  Speed and scalability

– Construction speed and application speed •  Robustness: handle noise and missing

values •  Scalability: build model for large training data

sets •  Interpretability: understandability of models

Jian Pei: Big Data Analytics -- Classification 9

Decision Tree Induction

•  Decision tree representation •  Construction of a decision tree •  Inductive bias and overfitting •  Scalable enhancements for large databases

Jian Pei: Big Data Analytics -- Classification 10

Decision Tree

•  A node in the tree – a test of some attribute •  A branch: a possible value of the attribute •  Classification

– Start at the root – Test the attribute – Move down the tree branch

Outlook

Sunny Overcast Rain

Humidity

High Normal

No Yes

Yes Wind

Strong Weak

No Yes

Jian Pei: Big Data Analytics -- Classification 11

Training Dataset Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No

Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes

Rain Mild High Strong No

Jian Pei: Big Data Analytics -- Classification 12

Appropriate Problems

•  Instances are represented by attribute-value pairs – Extensions of decision trees can handle real-

valued attributes •  Disjunctive descriptions may be required •  The training data may contain errors or

missing values

Page 3: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

3

Jian Pei: Big Data Analytics -- Classification 13

Basic Algorithm ID3

•  Construct a tree in a top-down recursive divide-and-conquer manner –  Which attribute is the best at the current node? –  Create a node for each possible attribute value –  Partition training data into descendant nodes

•  Conditions for stopping recursion –  All samples at a given node belong to the same class –  No attribute remained for further partitioning

•  Majority voting is employed for classifying the leaf

–  There is no sample at the node

Jian Pei: Big Data Analytics -- Classification 14

Which Attribute Is the Best?

•  The attribute most useful for classifying examples

•  Information gain and gini index – Statistical properties – Measure how well an attribute separates the

training examples

Jian Pei: Big Data Analytics -- Classification 15

Entropy

•  Measure homogeneity of examples

– S is the training data set, and pi is the proportion of S belong to class i

•  The smaller the entropy, the purer the data set

∑=

−≡c

iii ppSEntropy

12log)(

Jian Pei: Big Data Analytics -- Classification 16

Information Gain

•  The expected reduction in entropy caused by partitioning the examples according to an attribute

∑∈

−≡)(

)(||||)(),(

AValuesvv

v SEntropySSSEntropyASGain

Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v

Jian Pei: Big Data Analytics -- Classification 17

Example Outlook Temp Humid Wind PlayTenni

s Sunny Hot High Weak No Sunny Hot High Strong No

Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes

Rain Mild High Strong No 94.0145log

145

149log

149)( 22

=

−−=SEntropy

048.000.1146811.0

14894.0

)(146)(

148)(

)(||||)(),(

},{

=×−×−=

−−=

−= ∑∈

StrongWeak

StrongWeakvv

v

SEngropySEngropySEntropy

SEntropySSSEntropyWindSGain

Jian Pei: Big Data Analytics -- Classification 18

Hypothesis Space Search in Decision Tree Building •  Hypothesis space: the set of possible

decision trees •  ID3: simple-to-complex, hill-climbing search

– Evaluation function: information gain

Page 4: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

4

Jian Pei: Big Data Analytics -- Classification 19

Capabilities and Limitations

•  The hypothesis space is complete •  Maintains only a single current hypothesis •  No backtracking

– May converge to a locally optimal solution •  Use all training examples at each step

– Make statistics-based decisions – Not sensitive to errors in individual example

Jian Pei: Big Data Analytics -- Classification 20

Natural Bias

•  The information gain measure favors attributes with many values

•  An extreme example – Attribute “date” may have the highest

information gain – A very broad decision tree of depth one –  Inapplicable to any future data

Jian Pei: Big Data Analytics -- Classification 21

Alternative Measures

•  Gain ratio: penalize attributes like date by incorporating split information – 

•  Split information is sensitive to how broadly and uniformly the attribute splits the data

–  •  Gain ratio can be undefined or very large

– Only test attributes with over average gain

||||log

||||),(

12 SS

SSASmationSplitInfor i

c

i

i∑=

−≡

),(),(),(

ASmationSplitInforASGainASGainRatio ≡

Jian Pei: Big Data Analytics -- Classification 22

Measuring Inequality

Lorenz Curve X-axis: quintiles Y-axis: accumulative share of income earned by the plotted quintile Gap between the actual lines and the mythical line: the degree of inequality

Gini index

Gini = 0, even distribution Gini = 1, perfectly unequal The greater the distance, the more unequal the distribution

Jian Pei: Big Data Analytics -- Classification 23

Gini Index (Adjusted)

•  A data set S contains examples from n classes

– pj is the relative frequency of class j in S •  A data set S is split into two subsets S1 and

S2 with sizes N1 and N2 respectively

•  The attribute provides the smallest ginisplit(T) is chosen to split the node

∑=

−=n

jp jTgini121)(

)()()( 22

11 Tgini

NNTgini

NNTginisplit +=

Jian Pei: Big Data Analytics -- Classification 24

Extracting Classification Rules

•  Classification rules can be extracted from a decision tree

•  Each path from the root to a leaf à an IF-THEN rule – All attribute-value pair along a path form a

conjunctive condition – The leaf node holds the class prediction –  IF age = “<=30” AND student = “no” THEN

buys_computer = “no” •  Rules are easy to understand

Page 5: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

5

Jian Pei: Big Data Analytics -- Classification 25

Inductive Bias

•  Inductive bias: the set of assumptions that, together with the training data, deductively justifies the classification to future instances – Preferences of the classifier construction

•  Shorter trees are preferred over longer trees •  Trees that place high information gain

attributes close to the root are preferred

Jian Pei: Big Data Analytics -- Classification 26

Why Prefer Short Trees?

•  Occam’s razor: prefer the simplest hypothesis that fits the data

•  Fewer short trees than long trees •  A short tree is less likely to be a statistical

coincidence

“One should not increase, beyond what is necessary, the number of entities required to explain anything” – Also known as the principle of parsimony

Jian Pei: Big Data Analytics -- Classification 27

Overfitting

•  A decision tree T may overfit the training data –  if ∃ alternative tree T’ s.t. T has a higher

accuracy than T’ over the training examples, but T’ has a higher accuracy than T over the entire distribution of data

•  Why overfitting? – Noise data

All data Training data

T T’

Jian Pei: Big Data Analytics -- Classification 28

Avoid Overfitting

•  Prepruning: stop growing the tree earlier – Difficult to choose an appropriate threshold

•  Postpruning: remove branches from a “fully grown” tree – Use an independent set of data to prune

•  Key: how to determine the correct final tree size

Jian Pei: Big Data Analytics -- Classification 29

Determine the Final Tree Size

•  Separate training (2/3) and testing (1/3) sets •  Use cross validation, e.g., 10-fold cross validation •  Use all the data for training

–  Apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution

•  Use minimum description length (MDL) principle –  halting growth of the tree when the encoding is

minimized

Jian Pei: Big Data Analytics -- Classification 30

Enhancements

•  Allow for attributes of continuous values – Dynamically discretize continuous attributes

•  Handle missing attribute values •  Attribute construction

– Create new attributes based on existing ones that are sparsely represented

– Reduce fragmentation, repetition, and replication

Page 6: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

6

Jian Pei: Big Data Analytics -- Classification 31

The Evaluation Issues

•  The accuracy of a classifier can be evaluated using a test data set – The test set is a part of the available labeled

data set •  But how can we evaluate the accuracy of a

classification method? – A classification method can generate many

classifiers •  What if the available labeled data set is too

small? Jian Pei: Big Data Analytics -- Classification 32

Holdout Method

•  Partition the available labeled data set into two disjoint subsets: the training set and the test set – 50-50 – 2/3 for training and 1/3 for testing

•  Build a classifier using the training set •  Evaluate the accuracy using the test set

Jian Pei: Big Data Analytics -- Classification 33

Limitations of Holdout Method

•  Fewer labeled examples for training •  The classifier highly depends on the

composition of the training and test sets – The smaller the training set, the larger the

variance •  If the test set is too small, the evaluation is

not reliable •  The training and test sets are not

independent Jian Pei: Big Data Analytics -- Classification 34

Cross-Validation

•  Each record is used the same number of times for training and exactly once for testing

•  K-fold cross-validation –  Partition the data into k equal-sized subsets –  In each round, use one subset as the test set, and use

the rest subsets together as the training set –  Repeat k times –  The total error is the sum of the errors in k rounds

•  Leave-one-out: k = n –  Utilize as much data as possible for training –  Computationally expensive

Jian Pei: Big Data Analytics -- Classification 35

Bootstrap

•  Use a bootstrap sample as the training set, use the tuples not in the training set as the test set

•  .632 bootstrap: compute the overall accuracy by combining the accuracies of each bootstrap sample with the accuracy computed from a classifier using the whole data set as the training set

)368.0632.0(11

632. all

k

ibootstrap acck

acc ×+×= ∑ ε

Jian Pei: Big Data Analytics -- Classification 36

Confidence Interval for Accuracy

•  Suppose a classifier C is tested on a test set of n cases, and the accuracy is acc

•  How much confidence can we have on acc? •  We need to estimate the confidence interval

of a given model accuracy

Page 7: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

7

Jian Pei: Big Data Analytics -- Classification 37

Binomial Experiments

•  When a coin is flipped, it has a probability p to have the head turned up

•  If the coin is flipped N times, what is the probability that we see the head X times? – Expectation (mean): Np – Variance: Np(1 - p)

vNv ppvN

vXP −−⎟⎟⎠

⎞⎜⎜⎝

⎛== )1()(

Jian Pei: Big Data Analytics -- Classification 38

Confidence Level and Approximation Area = 1 - α

Zα/2 Z1- α /2

α

αα

−=

<−−

<−

1

)/)1(

(2/12/

ZNpp

paccZP

)(2442

22/

222/2/

22/

α

ααα

ZNaccNaccNZZZaccN

+

⋅−⋅+±+⋅

Zα: the bound at confidence level (1-α)

Approximating using normal distribution

Jian Pei: Big Data Analytics -- Classification 39

Accuracy Can Be Misleading …

•  Consider a data set of 99% of the negative class and 1% of the positive class

•  A classifier predicts everything negative has an accuracy of 99%, though it does not work for the positive class at all!

•  Imbalance class distribution is popular in many applications – Medical applications, fraud detection, …

Jian Pei: Big Data Analytics -- Classification 40

Performance Evaluation Matrix

PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

FNFPTNTPTNTP

dcbada

++++

=+++

+=Accuracy

Confusion matrix: used for imbalance class distribution

Jian Pei: Big Data Analytics -- Classification 41

Performance Evaluation Matrix

PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

True positive rate (TPR, sensitivity) = TP / (TP + FN) True negative rate (TNR, specificity) = TN / (TN + FP) False positive rate (FNR) = FP / (TN + FP) False negative rate (FNR) = FN / (TP + FN)

Jian Pei: Big Data Analytics -- Classification 42

Recall and Precision

•  Target class is more important than the other classes

PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

Precision p = TP / (TP + FP) Recall r = TP / (TP + FN)

Page 8: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

8

Jian Pei: Big Data Analytics -- Classification 43

Fallout

•  Type I errors – false positive: a negative object is classified as positive – Precision is not related directly to this error!

(Precision p = TP / (TP + FP)) – Fallout: the type I error rate, FP / (TP + FP)

•  Type II errors – false negative: a positive object is classified as negative – Captured by recall

Jian Pei: Big Data Analytics -- Classification 44

Fβ Measure

•  How can we summarize precision and recall into one metric? –  Using the harmonic mean between the two

•  Fβ measure

–  β = 0, Fβ is the precision –  β = ∞, Fβ is the recall –  0 < β < ∞, Fβ is a tradeoff between the precision and the

recall

FNFPTPTP

prrp

++=

+=

222(F) measure-F

FNFPTPTP

prrpF

+++

+=

+

+= 22

2

2

2

)1()1()1(ββ

ββ

ββ

Jian Pei: Big Data Analytics -- Classification 45

Weighted Accuracy

•  A more general metric

dwcwbwawdwaw

4321

41Accuracy Weighted+++

+=

Measure w1 w2 w3 w4 Recall 1 1 0 0

Precision 1 0 1 0

Fβ β2 + 1 β2 1 0

Accuracy 1 1 1 1

Jian Pei: Big Data Analytics -- Classification 46

ROC Curve

•  Receiver Operating Characteristic (ROC) 1-dimensional data set containing 2 classes. Any points located at x > t is classified as positive

Jian Pei: Big Data Analytics -- Classification 47

ROC Curve (TP,FP): •  (0,0): declare everything

to be negative class •  (1,1): declare everything

to be positive class •  (1,0): ideal •  Diagonal line:

–  Random guessing –  Below diagonal line:

prediction is opposite of the true class Figure from [Tan, Steinbach, Kumar]

Jian Pei: Big Data Analytics -- Classification 48

Comparing Two Classifiers

Figure from [Tan, Steinbach, Kumar]

Page 9: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

9

Jian Pei: Big Data Analytics -- Classification 49

Cost-Sensitive Learning

•  In some applications, misclassifying some classes may be disastrous – Tumor detection, fraud detection

•  Using a cost matrix PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes -1 100 Class=No 1 0

Jian Pei: Big Data Analytics -- Classification 50

Sampling for Imbalance Classes

•  Consider a data set containing 100 positive examples and 1,000 negative examples

•  Undersampling: use a random sample of 100 negative examples and all positive examples –  Some useful negative examples may be lost –  Run undersampling multiple times, use the ensemble of

multiple base classifiers –  Focused undersampling: remove negative samples that

are not useful for classification, e.g., those far away from the decision boundary

Jian Pei: Big Data Analytics -- Classification 51

Oversampling

•  Replicate the positive examples until the training set has an equal number of positive and negative examples

•  For noisy data, may cause overfitting

Jian Pei: Big Data Analytics -- Classification 52

Significance Tests •  Are two algorithms different in effectiveness?

–  The null hypothesis: there is NO difference –  The alternative hypothesis: there is a difference – B is better than A

(the baseline method) •  Matched pair experiments: the rankings that are compared

are based on the same set of queries for both algorithms •  Possible errors of significant tests

–  Type I: the null hypothesis is rejected when it is true –  Type II: the null hypothesis is accepted when it is false

•  The power of a hypothesis test: the probability that the test will reject the null hypothesis correctly –  Reducing the type II errors

Jian Pei: Big Data Analytics -- Classification 53

Procedure of Comparison •  Using a set of data sets •  Procedure

–  Compute the effectiveness measure for every data set –  Compute a test statistic based on a comparison of the effectiveness

measures for each data set •  E.g., the t-test, the Wilcoxon signed-rank test, and the sign test

–  Compute a P-value: the probability that a test statistic value at least that extreme could be observed if the null hypothesis were true

–  The null hypothesis is rejected if the P-value ≤ α, where α is the significance level which is used to minimize the type I errors

•  One-sided (one-tailed) tests: whether B is better than A (the baseline method) –  Two-sided tests: whether A and B are different – the P-value is

doubled

Jian Pei: Big Data Analytics -- Classification 54

Distribution of Test Statistics

Page 10: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

10

Jian Pei: Big Data Analytics -- Classification 55

T-test

•  Assuming data values are sampled from normal distributions –  In a matched pair experiment, assuming the difference

between the effectiveness values is a sample from a normal distribution

•  The null hypothesis: the mean of the distribution of difference is 0

–  B – A is the mean of the differences, σB – A is the standard deviation of the differences

NABtAB−

−=σ

∑=

−=N

ii xx

N 1

22 )(1σ

Jian Pei: Big Data Analytics -- Classification 56

Example

33.21.294.21

=

=

=−

t

AB

ABσ

P-value = 0.02 significant at a level of σ = 0.05 – the null hypothesis can be rejected

Jian Pei: Big Data Analytics -- Classification 57

Issues in T-test •  Data is assumed to be sampled from normal

distributions –  Generally inappropriate for effectiveness measures –  However, experiments showed that t-test produces very

similar results to the randomization test which does not assume any distribution (the most powerful nonparametric test)

•  T-test assumes that the evaluation data is measured on an interval scale –  Effectiveness measures are ordinal – the magnitude of

the differences are not significant –  Use the Wilcoxon signed-rank test and the sign test,

which make less assumption about the effectiveness measure, but are less powerful

Jian Pei: Big Data Analytics -- Classification 58

Wilcoxon Signed-Rank Test •  Assumption: the differences between the effectiveness

values can be ranked, but the magnitude is not important

–  Ri is a signed-rank, N is the number of non-zero differences •  Procedure

–  The differences are sorted by their absolute values increasing order –  Differences are assigned rank values (ties are assigned the average

rank) –  The rank values are given the sign of the original difference

•  The null hypothesis: the sum of the positive ranks will be the same as the sum of the negative ranks

∑=

=N

iiRw

1

Jian Pei: Big Data Analytics -- Classification 59

Example The non-zero differences in rank order of absolute value: 2, 9, 10, 24, 25, 25, 41, 60, 70 The signed ranks: -1, +2, +3, -4, +5.5, +5.5, +7, +8, +9 w = 35 P-value = 0.025 significant at a level of σ = 0.05 – the null hypothesis can be rejected

Jian Pei: Big Data Analytics -- Classification 60

Sign Test

•  Completely ignore the magnitude of the differences –  In practice, we may require that a 5-10%

difference is needed to be considered as different

•  The null hypothesis: P(B > A) = P(A > B) = ½ •  Sum up the number of pairs B > A

Page 11: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

11

Jian Pei: Big Data Analytics -- Classification 61

Example 7 pairs out of 10 B > A P-value = 0.17 – the probability that we observe 7 successes out of 10 trials where the probability of success is 0.5 Cannot reject the null hypothesis

Jian Pei: Big Data Analytics -- Classification 62

Intuition – Bayesian Classification

•  More hockey fans in Canada than in US –  Which country is Tom, a hockey ball fan, from? –  Predicting Canada has a better chance to be right

•  Prior probability P(Canadian)=5%: reflect background knowledge 5% of total population is Canadians

•  P(hockey fan | Canadian)=30%: the probability of a Canadian who is also a hockey fan

•  Posterior probability P(Canadian | hockey fan): the probability of a hockey fan is from Canada

Jian Pei: Big Data Analytics -- Classification 63

Bayes Theorem

•  Find the maximum a posteriori (MAP) hypothesis

– Require background knowledge – Computational cost

)()()|()|(

DPhPhDPDhP =

)()|(max)()()|(max)|(max

hPhDPDP

hPhDPDhPh

Hh

HhHhMAP

∈∈

=

=≡

Jian Pei: Big Data Analytics -- Classification 64

Naïve Bayes Classifier

•  Assumption: attributes are independent •  Given a tuple (a1, a2, …, an), predict its

class as

–  : the value of x that maximizes f(x) •  Example:

∏=

=

jiji

i

iini

CaPCP

CPCaaaPC

)|()(maxarg

)()|,,,(maxarg 21 …

)(maxarg xf3maxarg 2

}3,2,1{−=

−∈x

x

Jian Pei: Big Data Analytics -- Classification 65

Example: Training Dataset

Data sample X = (Outlook=sunny, Temp=mild, Humid=high Wind=weak) Will she play tennis? Yes

Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No

Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes

Rain Mild High Strong No

P(Yes|X) = P(X|Yes) P(Yes) = 0.014 P(No|X) = P(X|No) P(No) = 0.007

Probability of Infrequent Values

•  (outlook = Sunny, temp = high, humid = low, wind = weak)?

•  P(humid = low) = 0

Jian Pei: Big Data Analytics -- Classification 66

Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No

Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes

Rain Mild High Strong No

Page 12: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

12

Smoothing

•  Suppose an attribute has n different values: a1, …, an

•  Assume a small enough value ε > 0 •  Let Pi be the frequency of ai,

Pi = # tuples having ai / total # of tuples •  Estimate

Jian Pei: Big Data Analytics -- Classification 67

P (ai) = ✏+1� n✏

nPi

Handling Continuous Attributes

•  Discretization •  Probability density estimation

Jian Pei: Big Data Analytics -- Classification 68

Density Estimation

•  Let and be the mean and variance of all samples of class Cj, respectively

Jian Pei: Big Data Analytics -- Classification 69

P (Xi = xi|Cj) =1p

2⇡�ij

e

� (xi

�µ

ij

)2

2�2ij

µij �2ij

Characteristics of Naïve Bayes

•  Robust to isolated noise points – Such points are averaged out in probability

computation •  Insensitive to missing values •  Robust to irrelevant attributes

– Distributions on such attributes are almost uniform

•  Correlated attributes degrade the performance

Jian Pei: Big Data Analytics -- Classification 70

Bayes Error Rate

•  The error rate of the ideal naïve Bayes classifier

Jian Pei: Big Data Analytics -- Classification 71

Err =

x̂Z

0

P (Crocodile | X)dX +

1Z

P (Alligator | X)dX

Jian Pei: Big Data Analytics -- Classification 72

Pros and Cons

•  Pros – Easy to implement – Good results obtained in many cases

•  Cons – A (too) strong assumption: independent

attributes •  How to handle dependent/correlated

attributes? – Bayesian belief networks

Page 13: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

13

Jian Pei: Big Data Analytics -- Classification 73

Bayesian Networks

•  Bayesian belief network allows a subset of the variables conditionally independent

•  A graphical model of causal relationships – Represents dependency among the variables – Gives a specification of joint probability

distribution

X Y

Z P

Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P no loops or cycles

Jian Pei: Big Data Analytics -- Classification 74

Bayesian Belief Network: Example

LC

~LC

(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Family History

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

The conditional probability table (CPT) for the variable LungCancer: Show the conditional probability for each possible combination of its parents

==

n

iZParents iziPznzP

1))(|(),...,1(

Jian Pei: Big Data Analytics -- Classification 75

Training Bayesian Networks

•  Given both the network structure and all variables observable: learn only the CPTs

•  Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning

•  Network structure unknown, all variables observable: search through the model space to reconstruct graph topology

•  Unknown structure, all hidden variables: no good algorithms known for this purpose

Jian Pei: Big Data Analytics -- Classification 76

Associative Classification

•  Mine association possible rules (PR) in form of condset à c – Condset: a set of attribute-value pairs – C: class label

•  Build classifier – Organize rules according to decreasing

precedence based on confidence and support •  Classification

– Use the first matching rule to classify an unknown case

Jian Pei: Big Data Analytics -- Classification 77

Associative Classification Methods

•  CBA (Classification By Association: Liu, Hsu & Ma, KDD’98) –  Mine association possible rules in the form of

•  Cond-set (a set of attribute-value pairs) à class label

–  Build classifier: Organize rules according to decreasing precedence based on confidence and then support

•  CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01) –  Classification: Statistical analysis on multiple rules

Jian Pei: Big Data Analytics -- Classification 78

CMAR – Model Generation

•  Classification based on Multiple Association Rules •  Efficiency: Uses an enhanced FP-tree that

maintains the distribution of class labels among tuples satisfying each frequent itemset

•  Rule pruning whenever a rule is inserted into the tree –  Given two rules, R1 and R2, if the antecedent of R1 is

more general than that of R2 and conf(R1) ≥ conf(R2), then R2 is pruned

–  Prune rules where the rule antecedent and class are not positively correlated, based on a χ2 test of statistical significance

Page 14: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

14

Jian Pei: Big Data Analytics -- Classification 79

CMAR – Classification

•  Classification based on generated/pruned rules

•  If only one rule satisfies tuple X, assign the class label of the rule

•  If a rule set S satisfies X, CMAR – Divide S into groups according to class labels – Use a weighted χ2 measure to find the strongest

group of rules, based on the statistical correlation of rules within a group

– Assign X the class label of the strongest group Jian Pei: Big Data Analytics -- Classification 80

Classification by Aggregating Emerging Patterns •  Emerging pattern (EP): A pattern frequent in

one class of data but infrequent in others – Age<=30 is frequent in class “buys_computer=yes” and infrequent in class “buys_computer=no”

– Rule: age<=30 à buys computer •  G. Dong & J. Li. Efficient mining of emerging

patterns: discovering trends and differences. In KDD’99

Jian Pei: Big Data Analytics -- Classification 81

How to Mine Emerging Patterns?

•  Border differential –  Max-patterns in D1 w.r.t. min_sup=90% –  Max-patterns in D2 w.r.t. min_sup=10% –  X is a pattern covered by a max-pattern in D1 but not by

a max-pattern in D2 à X is an emerging pattern •  Method

–  Mine max-patterns in D1 and D2, respectively –  Compare the two sets of borders, find the “maximal”

patterns that are frequent in D1 and infrequent D2

Jian Pei: Big Data Analytics -- Classification 82

Instance-based Methods

•  Instance-based learning –  Store training examples and delay the processing until a

new instance must be classified (“lazy evaluation”) •  Typical approaches

–  K-nearest neighbor approach •  Instances represented as points in an Euclidean space

–  Locally weighted regression •  Construct local approximation

–  Case-based reasoning •  Use symbolic representations and knowledge-based inference

Jian Pei: Big Data Analytics -- Classification 83

The K-Nearest Neighbor Method

•  Instances are points in an n-D space •  The k-nearest neighbors (KNN) in the

Euclidean distance – Return the most common value among the k

training examples nearest to the query point •  Discrete-/real-valued target functions

. _

+ _ xq

+

_ _ +

_

_

+

Jian Pei: Big Data Analytics -- Classification 84

KNN Methods

•  For continuous-valued target functions, return the mean value of the k nearest neighbors

•  Distance-weighted nearest neighbor algorithm –  Give greater weights to closer neighbors

•  Robust to noisy data by averaging k-nearest neighbors

•  Curse of dimensionality –  Distance could be dominated by irrelevant attributes –  Axes stretch or elimination of the least relevant attributes

wd xq xi

≡ 12( , )

Page 15: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

15

Jian Pei: Big Data Analytics -- Classification 85

Case-based Reasoning

•  Lazy evaluation + analysis of similar instances

•  Methodology –  Instances represented by rich symbolic

descriptions (e.g., function graphs) – Combine multiple retrieved cases – Tight coupling between case retrieval,

knowledge-based reasoning, and problem solving

Jian Pei: Big Data Analytics -- Classification 86

Lazy vs. Eager Learning

•  Efficiency: lazy learning uses less training time but more predicting time

•  Accuracy – Lazy method effectively uses a richer hypothesis

space – Eager: must commit to a single hypothesis that

covers the entire instance space

Jian Pei: Big Data Analytics -- Classification 87

Artificial Neural Networks

•  (To some extent) simulating biological neural networks

•  Basic mechanisms – Perceptrons – Multilayer networks

•  Essential algorithm: BACKPROPAGATION

Jian Pei: Big Data Analytics -- Classification 88

Perceptrons

•  Input: a vector of real-values •  Calculate a linear combination of inputs

– Output 1 if the result is positive, -1 otherwise

Σ x1 x2

xn

.

.

.

x0=1 w1

w2

wn

w0

∑=

n

iii xw

0⎪⎩

⎪⎨⎧

>= ∑

=

otherwise

xwifon

iii

1

010

Jian Pei: Big Data Analytics -- Classification 89

Why Perceptrons?

•  A perceptron is a hyperplane decision surface

•  Perceptrons represent all primitive Boolean functions – AND, OR, …

+ +

+ + +

- -

-

Linearly separable

+

+ -

-

Linearly inseparable

AND: w0=-0.8 w1=w2=0.5 OR: w0=-0.3 w1=w2=0.5

Jian Pei: Big Data Analytics -- Classification 90

Training Perceptrons

•  Begin with random weights •  Iteratively apply the perceptron to each

training example, modify the weights whenever it misclassifies an example – wißwi+Δwi – Δwi = η(t-o)xi –  t: the target output for the current example – o: the output generated by the perceptron – η: a positive constant called learning rate

Page 16: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

16

Jian Pei: Big Data Analytics -- Classification 91

Why Does the Training Work?

•  The training example is correctly classified –  (t-o) = 0 à Δwi = 0

•  The target output is +1 and the perceptron outputs –1 –  (t-o) = 2 –  If xi > 0, increasing wi will increase the output –  If xi < 0, decreasing wi will increase the output

wiß wi+Δwi

Δwi = η(t-o)xi

Jian Pei: Big Data Analytics -- Classification 92

The Sigmoid Unit

•  Similar to perceptron, except for the output function

•  σ: the sigmoid or logistic function

Σ x1 x2

xn

.

.

.

x0=1 w1

w2

wn

w0

∑=

n

iii xw

0 neteneto

−+==11)(σ

Jian Pei: Big Data Analytics -- Classification 93

Multilayer Networks

•  Each layer has some perceptrons/sigmoid units

•  A unit connects to all units in neighbor layers

Jian Pei: Big Data Analytics -- Classification 94

Backpropagation Algorithm

•  Training a multilayer network •  Create a feed-forward network •  Initialize all network weights to small random

numbers (e.g., -0.5 to 0.5) •  Until the termination condition is met, do

– For each training example •  Propagate the input forward through the network •  Propagate the errors backward through the network •  Update each network weight

Jian Pei: Big Data Analytics -- Classification 95

Termination Conditions

•  A fixed number of iterations •  Once the error on the training examples is

below some threshold •  Once the error on the test set meets some

criterion

Jian Pei: Big Data Analytics -- Classification 96

When Are ANNs Good?

•  The training data are noisy and complex – Example: sensor data, image data

•  More symbolic representations are often used – Similar to the capability of decision trees

Page 17: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

17

Jian Pei: Big Data Analytics -- Classification 97

Appropriate Problems for ANNs

•  Many attributes •  Target function may be discrete- or real-

valued, or a vector of multiple attributes •  Errors in training examples •  Long training time is acceptable •  Fast classification time is required •  Understandability is unimportant

Jian Pei: Big Data Analytics -- Classification 98

Support Vector Machines (SVM)

Support Vectors

Small Margin Large Margin

Jian Pei: Big Data Analytics -- Classification 99

Linear SVM

•  Given a set of points with label

•  The SVM finds a hyperplane separating the positive and negative samples

nix ℜ∈

}1,1{yi −∈

Jian Pei: Big Data Analytics -- Classification 100

Separate Samples by Projection

•  For linearly inseparable data, project the data to high dimensional space where it is linearly separable

-1 0 +1

+ + -

(1,0) (0,0)

(0,1) +

+ -

Jian Pei: Big Data Analytics -- Classification 101

Non-linear SVM

•  Learn a hyperplane •  Use quadratic programming techniques •  Using kernels can learn very complex

functions

Jian Pei: Big Data Analytics -- Classification 102

Non-linear SVM: An Example

Page 18: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

18

Jian Pei: Big Data Analytics -- Classification 103

Errors in Classification

•  Bias: the difference between the real class boundary and the decision boundary of a classification model

•  Variance: variability in the training data set •  Intrinsic noise in the target class: the target

class can be non-deterministic – instances with the same attribute values can have different class labels

Jian Pei: Big Data Analytics -- Classification 104

Bias

Figure from [Tan, Steinbach, Kumar]

Jian Pei: Big Data Analytics -- Classification 105

One or More?

•  What if a medical doctor is not sure about a case? –  Joint-diagnosis: using a group of doctors carrying

different expertise –  Wisdom from crowd is often more accurate

•  All eager learning methods make prediction using a single classifier induced from training data –  A single classifier may have low confidence in some

cases •  Ensemble methods: construct a set of base

classifiers and take a vote on predictions in classification

Jian Pei: Big Data Analytics -- Classification 106

Ensemble Classifiers Original

Training data

....D1 D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers C*(x)=Vote(C1(x), …, Ck(x))

Figure from [Tan, Steinbach, Kumar]

Jian Pei: Big Data Analytics -- Classification 107

Why May Ensemble Method Work?

•  Suppose there are two classes and each base classifier has an error rate of 35%

•  What if we use 25 base classifiers? –  If all base classifiers are identical, the ensemble

error rate is still 35% –  If base classifiers are independent, the

ensemble makes a wrong prediction only if more than half of the base classifiers are wrong

∑=

− =⎟⎟⎠

⎞⎜⎜⎝

⎛25

13

25 06.065.035.025

i

ii

iJian Pei: Big Data Analytics -- Classification 108

Ensemble Error Rate

Figure from [Tan, Steinbach, Kumar]

Page 19: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

19

Jian Pei: Big Data Analytics -- Classification 109

Ensemble Classifiers – When?

•  The base classifiers should be independent of each other

•  Each base classifier should do better than a classifier that performs random guessing

Jian Pei: Big Data Analytics -- Classification 110

How to Construct Ensemble?

•  Manipulating the training set: derive multiple training sets and build a base classifier on each

•  Manipulating the input features: use only a subset of features in a base classifier

•  Manipulating the class labels: if there are many classes, in a classifier, randomly divide the classes into two subsets A and B; for a test case, if a base classifier predicts its class as A, all classes in A receive a vote

•  Manipulating the learning algorithm, e.g., using different network configuration in ANN

Jian Pei: Big Data Analytics -- Classification 111

Bootstrap

•  Given an original training set T, derive a tranining set T’ by repeatedly uniformly sampling with replacement

•  If T has n tuples, each tuple has a probability p = 1 - (1 - 1/n)n of being selected in T’ – When n à ∞, p à 1 - 1/e ≈ 0.632

•  Use the tuples not in T’ as the test set

Jian Pei: Big Data Analytics -- Classification 112

Bagging •  Run bootstrap k times to obtain k base classifiers •  A test instance is assigned to the class that

receives the highest number of votes •  Strength: reduce the variance of base classifiers –

good for unstable base classifiers –  Unstable classifiers: sensitive to minor perturbations in

the training set, e.g., decision trees, associative classifiers, and ANN

•  For stable classifiers (e.g., linear discriminant analysis and kNN classifiers), bagging may even degrade the performance since the training sets are smaller

•  Less overfitting on noisy data

Jian Pei: Big Data Analytics -- Classification 113

Boosting •  Assign a weight to each training example

–  Initially, each example is assigned a weight 1/n •  Weights can be used in one of the following ways

–  Weights as a sampling distribution to draw a set of bootstrap samples from the original training set

–  Weights used by a base classifier to learn a model biased towards heavier examples

•  Adaptively change the weight at the end of each boosting round –  The weight of an example correctly classified decreases –  The weight of an example incorrectly classified

increases •  Each round generates a base classifier

Jian Pei: Big Data Analytics -- Classification 114

Critical Design Choices in Boosting

•  How the weights of the training examples are updated at the end of each boosting round?

•  How the predictions made by base classifiers are combined?

Page 20: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

20

Jian Pei: Big Data Analytics -- Classification 115

AdaBoost

•  Each base classifier carries an importance score related to its error rate – Error rate

– wi: weight, I(p) = 1 if p is true –  Importance score

( )∑=

≠=N

jjjiji yxCIw

N 1)(1

ε

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

i

ii ε

εα

1ln21

Jian Pei: Big Data Analytics -- Classification 116

How Does Importance Score Work?

Jian Pei: Big Data Analytics -- Classification 117

Weight Adjustment in AdaBoost

–  If any intermediate rounds generate an error rate more than 50%, the weights are reverted back to 1/n

•  The ensemble error rate is bounded

∑ =

⎪⎩

⎪⎨⎧

==

+

−+

i

)1(

)()1(

1 factor,ion normalizat theis where

)( ifexp)( ifexp

jij

iij

iij

j

jij

i

wZ

yxCyxC

Zww

j

j

α

α

∏ −≤i

iiensemblee )1( εε

Jian Pei: Big Data Analytics -- Classification 118

Random Forests

Figure from [Tan, Steinbach, Kumar]

Jian Pei: Big Data Analytics -- Classification 119

Random Forests

•  Using decision trees as base classifiers •  Each tree uses only a subset of features in

classification •  Bagging can be used to generate training

sets for decision trees

Jian Pei: Big Data Analytics -- Classification 120

Forest-RI (Random Input)

•  Each tree is built using a random subset of features

•  A tree is grown to its entirety without pruning •  The smaller the sets of features used by decision

trees, the less correlated the trees •  The larger the sets of features used by decision

trees, the more accurate the trees •  Tradeoff: m = log2d + 1

–  d: number of features in the training set •  If d is too small, it is hard to obtain independent

feature sets

Page 21: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

21

Jian Pei: Big Data Analytics -- Classification 121

Forest-RC

•  When the number of features in the training set is too small, use linear combinations of features to generate new features to build decision trees

•  Randomly select a set of features L •  Linearly combine features in L using coefficients

generated from a uniform distribution in the range of [-1, 1]

•  At each node, m such randomly combined new features are generated, and the best of them is used to split the node

Prediction and Time •  “Prediction is very difficult, especially about the

future.” — Niels Bohr (1885 - 1962)

•  “An economist is an expert who will know tomorrow why the things he predicted yesterday didn’t happen today.”

— Laurence J. Peter (1919 - 1988) •  “If something anticipated arrives too late it finds us

numb, wrung out from waiting, and we feel – nothing at all. The best things arrive on time.”

— Dorothy Gilman, A New Kind of Country, 1978

Jian Pei: Big Data Analytics -- Classification 122

Early Diagnosis

•  A retrospective study of the clinical data of infants admitted to a neonatal intensive care unit found that the infants, who were diagnosed with sepsis disease, had abnormal heartbeat time series patterns 24 hours preceding the diagnosis

•  Monitoring the heartbeat time series data and classifying the time series data as early as possible may lead to early diagnosis and effective therapy

Jian Pei: Big Data Analytics -- Classification 123

Online Traffic Classification

•  By only observing the first five packages of a TCP connection, the application associated with the traffic flow can be classified accurately

•  The applications of online traffic can be identified without waiting for the TCP flow to end

Jian Pei: Big Data Analytics -- Classification 124

Objectives in Classification

•  Objective function for optimization: quality – how well does the model learned approach the latent mechanism? – Accuracy – Recall – F-measure – Many other measures

•  Time is not a first-class citizen in the problem setting!

Jian Pei: Big Data Analytics -- Classification 125

Early Classification – Problem

•  Data: time series / temporal sequences with class labels

•  Task: construct a model that can produce class label prediction as early as possible – How can we measure the earliness? – Expected classification time (other measures are

possible) •  Constraint: classification quality — should be at

a satisfactory level – How can we constrain the quality? An important

issue

Jian Pei: Big Data Analytics -- Classification 126

Page 22: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

22

The Price on Time

•  In the classical classification problem, time is not considered as a factor

•  In early classification, time has a price — a model becomes more and more expensive as it consumes longer and longer prefixes

Jian Pei: Big Data Analytics -- Classification 127

Time

in early classification

in traditional classificationThe complexity of the model concerned

The complexity of the model conerned

Example

Jian Pei: Big Data Analytics -- Classification 128

−5

0

5

−5

0

5

−5

0

5

−5

0

5

0

1

2

0

1

2

0

1

2

0 2 4 6 8 10 120

1

2

Class Star Class Dimond

Feature A

Feature B

Challenges

•  How to incorporate time in classification? •  How to avoid overfitting in time? •  How to make a balanced tradeoff between time

and quality? •  Simplification

– All time series are normalized and aligned – All time series have the same length – For each time moment, we can get a snapshot of

the distribution of the time series — each time series is a point

Jian Pei: Big Data Analytics -- Classification 129

Example

Jian Pei: Big Data Analytics -- Classification 130

Find the earliest classification time

t2t1 t3Time

A Nearest Neighbor Idea

•  Exploiting the extreme of local features — Using 1NN to classify

•  A time series can be used reliably in 1NN classification if its local neighborhood belongs to the same class –  For a time series s, the reverse 1NN’s of s use s in

classification — s* is a reverse1NN of s if s is the1NN of s* –  The local neighborhood of s can be captured by the

reverse 1NN’s of s –  The local neighborhood of s is pure if all time series in

the neighborhood belong to the same class as s does

Jian Pei: Big Data Analytics -- Classification 131

Impure

s s

Pure

Minimum Prediction Length

•  The length of the prefix that can be trusted to classify other time series as accurate as the full length

Jian Pei: Big Data Analytics -- Classification 132

!!L=50!

MPL=33!

MPL=33!

MPL=41!

MPL=41!

!!!!!a!!!!b!!!!!c!!!!d!

NN(s)=a[1,1]!!!!!(<33)!!!

NN(s)=d[1,41]!(>=41),!RED!CLASS!!!

Page 23: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

23

Extracting Interpretable Features

•  Using features can improve generality of classifiers and help to avoid overfitting

•  Features can characterize the latent data generating mechanism

•  In some applications, such as medical diagnosis, users want to know not only what (i.e., the class labels), but also why (i.e., the manifesting features explaining the decision process)

Jian Pei: Big Data Analytics -- Classification 133

Shapelets as Features

•  A shapelet is duple: a time series segment and a distance threshold (s,δ)

•  Optimization objective — best information gain

Jian Pei: Big Data Analytics -- Classification 134

!Diamond! !Star!

Feature B

−5

0

5

−5

0

5

−5

0

5

−5

0

5

0

1

2

0

1

2

0

1

2

0 2 4 6 8 10 120

1

2

Class Star Class Dimond

Feature A

Feature B

Local Distinctive Shapelets

•  Finding shapelets shared by some times series that belong to the same class but not the other time series

•  Finding shapelets that appear early in time series

•  Rank features according to their precision, recall, and earliness – Use a minimum threshold to avoid overfitting – More sophisticated methods may be used

Jian Pei: Big Data Analytics -- Classification 135

Early Classification Is Accurate

Jian Pei: Big Data Analytics -- Classification 136

0 20 40 60 80

100

ECGGunPoint

CBFSyn-Con

WaferOlive

Two-Patterns

Accu

racy

Data sets

EDSC-CHEEDSC-KDE

ECTSFULL1NN

Earliness

Jian Pei: Big Data Analytics -- Classification 137

0 20 40 60 80

100

ECGGunPoint

CBFSyn-Con

WaferOlive

Two-PatternsPerc

enta

ge o

f Ave

. Pre

dict

ion

Len.

Data sets

EDSC-CHEEDSC-KDE

ECTSFULL1NN

Features (Gun-Point Data Set)

Jian Pei: Big Data Analytics -- Classification 138

Page 24: Classification and Prediction Classificationnovel.ict.ac.cn/files/Day 4.pdf · 2014-05-08 1 Classification Jian Pei: Big Data Analytics -- Classification 2 Classification and Prediction

2014-05-08

24

Summary •  Early classification is useful in a few important

applications –  Medical and health informatics applications –  Intrusion detection –  Security and safety –  Some of our algorithms are being tested for astronaut

tiredness prediction in a spaceship project •  Early prediction may be connected to many frontiers

of data mining research –  Subspace feature extraction –  Concise and non-redundant feature representation

•  The problem has not been thoroughly investigated

Jian Pei: Big Data Analytics -- Classification 139

Open Problems

•  How to balance earliness and classification quality? – Progressive and interactive early classification?

•  How to “rescue” mis-classified samples over time? – Re-classification?

•  Theoretical foundation for early classification? •  Earliness aware data mining? •  A more general model for cost-sensitive

learning? Jian Pei: Big Data Analytics -- Classification 140