Final Review This is not a comprehensive review but highlights certain key areas.

Final Review

This is not a comprehensive review but highlights certain key areas

2

Top-Level Data Mining Tasks

• At highest level, data mining tasks can be divided into:– Prediction Tasks (supervised learning)• Use some variables to predict unknown or future

values of other variables– Classification – Regression

– Description Tasks (unsupervised learning)• Find human-interpretable patterns that describe the

data– Clustering– Association Rule Mining

3

Classification: Definition

• Given a collection of records (training set )– Each record contains a set of attributes, one of the attributes is the

class, which is to be predicted.

• Find a model for class attribute as a function of the values of other attributes.– Model maps record to a class value

• Goal: previously unseen records should be assigned a class as accurately as possible.– A test set is used to determine accuracy of the model

• Can you think of classification tasks?

Classification

• Simple linear• Decision trees (entropy, GINI)• Naïve Bayesian • Nearest Neighbor• Neural Networks

5

Regression• Predict a value of a given continuous (numerical)

variable based on the values of other variables• Greatly studied in statistics• Examples:– Predicting sales amounts of new product based on

advertising expenditure.– Predicting wind velocities as a function of temperature,

humidity, air pressure, etc.– Time series prediction of stock market indices

6

Clustering• Given a set of data points find clusters so that– Data points in same cluster are similar– Data points in different clusters are dissimilar

You try it on the Simpsons. How can we cluster these 5 “data points”?

7

Association Rule Discovery• Given a set of records each of which contain

some number of items from a given collection– Produce dependency rules which will predict

occurrence of an item based on occurrences of other items.

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

beer

Diapers

Attribute Values• Attribute values are numbers or symbols

assigned to an attribute• Distinction between attributes and attribute

values– Same attribute can be mapped to different attribute

values• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of values• Example: Attribute values for ID and age are integers• But properties of attribute values can be different

– ID has no limit but age has a maximum and minimum value8

Types of Attributes

• There are different types of attributes– Nominal (Categorical)

• Examples: ID numbers, eye color, zip codes

– Ordinal• Examples: rankings (e.g., taste of potato chips on a scale

from 1-10), grades, height in {tall, medium, short}

– Interval• Examples: calendar dates, temperatures in Celsius or

Fahrenheit.

– Ratio• Examples: temperature in Kelvin, length, time, counts

9

10

Decision Tree Representation

• Each internal node tests an attribute• Each branch corresponds to attribute value• Each leaf node assigns a classification

outlook

sunny overcast rain

yeshumidity wind

high normal strong weak

yes yesno no

11

• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they can be discretized in

advance)– Examples are partitioned recursively based on selected attributes.– Test attributes are selected on the basis of a heuristic or statistical measure

(e.g., information gain)• Conditions for stopping partitioning

– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority voting is

employed for classifying the leaf– There are no samples left

• Pre-pruning/post-pruning

How do we construct the decision tree?

12

How To Split Records• Random Split

– The tree can grow huge – These trees are hard to understand. – Larger trees are typically less accurate than smaller trees.

• Principled Criterion– Selection of an attribute to test at each node - choosing the most useful attribute

for classifying examples. – How?– Information gain

• measures how well a given attribute separates the training examples according to their target classification

• This measure is used to select among the candidate attributes at each step while growing the tree

13

• Advantages:– Easy to understand (Doctors love them!) – Easy to generate rules

• Disadvantages:– May suffer from overfitting.– Classifies by rectangular partitioning (so does not

handle correlated features very well).– Can be quite large – pruning is necessary.– Does not handle streaming data easily

Advantages/Disadvantages of Decision Trees

Overfitting (another view)• Learning a tree that classifies the training data perfectly may not lead to

the tree with the best generalization to unseen data.– There may be noise in the training data that the tree is erroneously fitting.– The algorithm may be making poor decisions towards the leaves of the tree

that are based on very little data and may not reflect reliable trends.

hypothesis complexity/size of the tree (number of nodes)

accu

racy

on training data

on test data

14

15

Notes on Overfitting

• Overfitting results in decision trees (models in general) that are more complex than necessary

• Training error no longer provides a good estimate of how well the tree will perform on previously unseen records

• Need new ways for estimating errors

Evaluation

• Accuracy• Recall/Precision/F-measure

17

Bayes ClassifiersThat was a visual intuition for a simple case of the Bayes classifier, also called:

• Idiot Bayes • Naïve Bayes• Simple Bayes

We are about to see some of the mathematical formalisms, and more examples, but keep in mind the basic idea.

Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class.

Go through all the examples on the slides and be ready to generate tables similar to the ones presented in class and the one you created for your HW assignment.

Smoothing

18

Bayesian Classifiers• Bayesian classifiers use Bayes theorem, which says

p(cj | d ) = p(d | cj ) p(cj) p(d)

• p(cj | d) = probability of instance d being in class cj, This is what we are trying to compute

• p(d | cj) = probability of generating instance d given class cj,We can imagine that being in class cj, causes you to have feature d with some probability

• p(cj) = probability of occurrence of class cj, This is just how frequent the class cj, is in our database

• p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes

19

Bayesian Classification

– Statistical method for classification.– Supervised Learning Method.– Assumes an underlying probabilistic model, the Bayes theorem.– Can solve diagnostic and predictive problems.– Particularly suited when the dimensionality of the input is high – In spite of the over-simplified assumption, it often performs better in many

complex real-world situations

20

• Advantages:– Fast to train (single scan). Fast to classify – Not sensitive to irrelevant features– Handles real and discrete data– Handles streaming data well

• Disadvantages:– Assumes independence of features

Advantages/Disadvantages of Naïve Bayes

21

Nearest-Neighbor Classifiers Requires three things

– The set of stored records– Distance metric to compute

distance between records– The value of k, the number of

nearest neighbors to retrieve To classify an unknown record:

– Compute distance to other training records

– Identify k nearest neighbors – Use class labels of nearest

neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Unknown record

22

10

1 2 3 4 5 6 7 8 9 10

123456789

pn

i

pii cqCQD

1,

n

iii cqCQD

1

2,

Manhattan (p=1)

Max (p=inf)

Mahalanobis

Weighted Euclidean

Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case…

23

Strengths and Weaknesses• Strengths:

– Simple to implement and use– Comprehensible – easy to explain prediction– Robust to noisy data by averaging k-nearest neighbors– Distance function can be tailored using domain knowledge– Can learn complex decision boundaries

• Much more expressive than linear classifiers & decision trees• More on this later

• Weaknesses:– Need a lot of space to store all examples– Takes much more time to classify a new example than with a

parsimonious model (need to compare distance to all other examples)– Distance function must be designed carefully with domain knowledge

24

Strengths and Weaknesses• Strengths:

– Simple to implement and use– Comprehensible – easy to explain prediction– Robust to noisy data by averaging k-nearest neighbors– Distance function can be tailored using domain knowledge– Can learn complex decision boundaries

• Much more expressive than linear classifiers & decision trees• More on this later

• Weaknesses:– Need a lot of space to store all examples– Takes much more time to classify a new example than with a

parsimonious model (need to compare distance to all other examples)– Distance function must be designed carefully with domain knowledge

25

Perceptrons• The perceptron is a type of artificial

neural network which can be seen as the simplest kind of feedforward neural network: a linear classifier

• Introduced in the late 50s • Perceptron convergence theorem

(Rosenblatt 1962):– Perceptron will learn to classify

any linearly separable set of inputs.

XOR function (no linear separation)

Perceptron is a network:– single-layer– feed-forward: data only

travels in one direction

26

Perceptron: Artificial Neuron Model

Vector notation:

Model network as a graph with cells as nodes and synaptic connections as weighted edges from node i to node j, wji

The input value received of a neuron is calculated by summing the weighted input values from its input links

n

iiixw

0

threshold functionthreshold

27

Examples (step activation function)

In1 In2 Out

0 0 0

0 1 0

1 0 0

1 1 1

In1 In2 Out

0 0 0

0 1 1

1 0 1

1 1 1

In Out

0 1

1 0

n

iiixw

0w0 – t

28

Summary of Neural Networks

When are Neural Networks useful?– Instances represented by attribute-value pairs

• Particularly when attributes are real valued– The target function is

• Discrete-valued• Real-valued• Vector-valued

– Training examples may contain errors– Fast evaluation times are necessary

When not?– Fast training times are necessary– Understandability of the function is required

Types of Clusterings

• A clustering is a set of clusters

• Important distinction between hierarchical and partitional sets of clusters

• Partitional Clustering– A division data objects into non-overlapping subsets (clusters) such

that each data object is in exactly one subset

• Hierarchical clustering– A set of nested clusters organized as a hierarchical tree

K-means Clustering

• Partitional clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid• Number of clusters, K, must be specified• The basic algorithm is very simple

– K-means tutorial available from http://maya.cs.depaul.edu/~classes/ect584/WEKA/k-means.html

31

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering1. Ask user how

many clusters they’d like. (e.g. k=3)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns…

5. …and jumps there

6. …Repeat until terminated!

32

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering: Step 1

k1

k2

k3

33

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering

k1

k2

k3

34

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering

k1

k2k3

35

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering

k1

k2k3

36

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ress

ion

in c

on

dit

ion

2

K-means Clustering

k1

k2 k3

Strengths of Hierarchical Clustering

• Do not have to assume any particular number of clusters– Any desired number of clusters can be obtained by

‘cutting’ the dendogram at the proper level

• They may correspond to meaningful taxonomies– Example in biological sciences (e.g., animal

kingdom, phylogeny reconstruction, …)

Hierarchical Clustering

• Two main types of hierarchical clustering– Agglomerative:

• Start with the points as individual clusters• At each step, merge the closest pair of clusters until only one

cluster (or k clusters) left

– Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point

(or there are k clusters)

• Agglomerative is most common

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function– Ward’s Method uses squared error

Proximity Matrix

DBSCAN

• DBSCAN is a density-based algorithm.– Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number of points (MinPts) within Eps • These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

– A noise point is any point that is not a core point or a border point.

What Is Association Mining?• Association rule mining:

– Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

• Applications:– Market Basket analysis, cross-marketing, catalog design,

loss-leader analysis, clustering, classification, etc.

41

Association Rule Mining

–We are interested in rules that are• non-trivial (and possibly unexpected)• actionable• easily explainable

42

Support and Confidence• Find all the rules X Y with minimum

confidence and support– Support = probability that a transaction

contains {X,Y}• i.e., ratio of transactions in which X, Y

occur together to all transactions in database.

– Confidence = conditional probability that a transaction having X also contains Y• i.e., ratio of transactions in which X, Y

occur together to those in which X occurs.

In general confidence of a rule LHS => RHS can be computed as the support of the whole itemset divided by the support of LHS:

Confidence (LHS => RHS) = Support(LHS È RHS) / Support(LHS)

Customerbuys diaper

Customer buys both

Customerbuys beer

Definition: Frequent Itemset• Itemset

– A collection of one or more items• Example: {Milk, Bread, Diaper}

– k-itemset• An itemset that contains k items

• Support count ()– Frequency of occurrence of itemset– E.g. ({Milk, Bread,Diaper}) = 2

• Support– Fraction of transactions that contain an

itemset– E.g. s({Milk, Bread, Diaper}) = 2/5

• Frequent Itemset– An itemset whose support is greater than or

equal to a minsup threshold

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

44

CS583, Bing Liu, UIC 45

The Apriori algorithm• The best known algorithm• Two steps:– Find all itemsets that have minimum support (frequent

itemsets, also called large itemsets).– Use frequent itemsets to generate rules.

• E.g., a frequent itemset{Chicken, Clothes, Milk} [sup = 3/7]

and one rule from the frequent itemsetClothes Milk, Chicken [sup = 3/7, conf = 3/3]

Associations: Pros and Cons

• Pros– can quickly mine patterns describing business/customers/etc. without

major effort in problem formulation– virtual items allow much flexibility– unparalleled tool for hypothesis generation

• Cons– unfocused

• not clear exactly how to apply mined “knowledge”• only hypothesis generation

– can produce many, many rules!• may only be a few nuggets among them (or none)

Association Rules

• Association rule types:– Actionable Rules – contain high-quality, actionable

information– Trivial Rules – information already well-known by

those familiar with the business– Inexplicable Rules – no explanation and do not suggest

action

• Trivial and Inexplicable Rules occur most often

Final Review This is not a comprehensive review but highlights certain key areas.

Documents

set of attributes

class attribute

properties of attribute

set of values example

attribute valuessame

set of data points

future values

set of records