Final Review This is not a comprehensive review but highlights certain key areas
Jan 16, 2016
Final Review
This is not a comprehensive review but highlights certain key areas
2
Top-Level Data Mining Tasks
• At highest level, data mining tasks can be divided into:– Prediction Tasks (supervised learning)• Use some variables to predict unknown or future
values of other variables– Classification – Regression
– Description Tasks (unsupervised learning)• Find human-interpretable patterns that describe the
data– Clustering– Association Rule Mining
3
Classification: Definition
• Given a collection of records (training set )– Each record contains a set of attributes, one of the attributes is the
class, which is to be predicted.
• Find a model for class attribute as a function of the values of other attributes.– Model maps record to a class value
• Goal: previously unseen records should be assigned a class as accurately as possible.– A test set is used to determine accuracy of the model
• Can you think of classification tasks?
Classification
• Simple linear• Decision trees (entropy, GINI)• Naïve Bayesian • Nearest Neighbor• Neural Networks
5
Regression• Predict a value of a given continuous (numerical)
variable based on the values of other variables• Greatly studied in statistics• Examples:– Predicting sales amounts of new product based on
advertising expenditure.– Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.– Time series prediction of stock market indices
6
Clustering• Given a set of data points find clusters so that– Data points in same cluster are similar– Data points in different clusters are dissimilar
You try it on the Simpsons. How can we cluster these 5 “data points”?
7
Association Rule Discovery• Given a set of records each of which contain
some number of items from a given collection– Produce dependency rules which will predict
occurrence of an item based on occurrences of other items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
beer
Diapers
Attribute Values• Attribute values are numbers or symbols
assigned to an attribute• Distinction between attributes and attribute
values– Same attribute can be mapped to different attribute
values• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values• Example: Attribute values for ID and age are integers• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value8
Types of Attributes
• There are different types of attributes– Nominal (Categorical)
• Examples: ID numbers, eye color, zip codes
– Ordinal• Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval• Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio• Examples: temperature in Kelvin, length, time, counts
9
10
Decision Tree Representation
• Each internal node tests an attribute• Each branch corresponds to attribute value• Each leaf node assigns a classification
outlook
sunny overcast rain
yeshumidity wind
high normal strong weak
yes yesno no
11
• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they can be discretized in
advance)– Examples are partitioned recursively based on selected attributes.– Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)• Conditions for stopping partitioning
– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf– There are no samples left
• Pre-pruning/post-pruning
How do we construct the decision tree?
12
How To Split Records• Random Split
– The tree can grow huge – These trees are hard to understand. – Larger trees are typically less accurate than smaller trees.
• Principled Criterion– Selection of an attribute to test at each node - choosing the most useful attribute
for classifying examples. – How?– Information gain
• measures how well a given attribute separates the training examples according to their target classification
• This measure is used to select among the candidate attributes at each step while growing the tree
13
• Advantages:– Easy to understand (Doctors love them!) – Easy to generate rules
• Disadvantages:– May suffer from overfitting.– Classifies by rectangular partitioning (so does not
handle correlated features very well).– Can be quite large – pruning is necessary.– Does not handle streaming data easily
Advantages/Disadvantages of Decision Trees
Overfitting (another view)• Learning a tree that classifies the training data perfectly may not lead to
the tree with the best generalization to unseen data.– There may be noise in the training data that the tree is erroneously fitting.– The algorithm may be making poor decisions towards the leaves of the tree
that are based on very little data and may not reflect reliable trends.
hypothesis complexity/size of the tree (number of nodes)
accu
racy
on training data
on test data
14
15
Notes on Overfitting
• Overfitting results in decision trees (models in general) that are more complex than necessary
• Training error no longer provides a good estimate of how well the tree will perform on previously unseen records
• Need new ways for estimating errors
Evaluation
• Accuracy• Recall/Precision/F-measure
17
Bayes ClassifiersThat was a visual intuition for a simple case of the Bayes classifier, also called:
• Idiot Bayes • Naïve Bayes• Simple Bayes
We are about to see some of the mathematical formalisms, and more examples, but keep in mind the basic idea.
Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class.
Go through all the examples on the slides and be ready to generate tables similar to the ones presented in class and the one you created for your HW assignment.
Smoothing
18
Bayesian Classifiers• Bayesian classifiers use Bayes theorem, which says
p(cj | d ) = p(d | cj ) p(cj) p(d)
• p(cj | d) = probability of instance d being in class cj, This is what we are trying to compute
• p(d | cj) = probability of generating instance d given class cj,We can imagine that being in class cj, causes you to have feature d with some probability
• p(cj) = probability of occurrence of class cj, This is just how frequent the class cj, is in our database
• p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes
19
Bayesian Classification
– Statistical method for classification.– Supervised Learning Method.– Assumes an underlying probabilistic model, the Bayes theorem.– Can solve diagnostic and predictive problems.– Particularly suited when the dimensionality of the input is high – In spite of the over-simplified assumption, it often performs better in many
complex real-world situations
20
• Advantages:– Fast to train (single scan). Fast to classify – Not sensitive to irrelevant features– Handles real and discrete data– Handles streaming data well
• Disadvantages:– Assumes independence of features
Advantages/Disadvantages of Naïve Bayes
21
Nearest-Neighbor Classifiers Requires three things
– The set of stored records– Distance metric to compute
distance between records– The value of k, the number of
nearest neighbors to retrieve To classify an unknown record:
– Compute distance to other training records
– Identify k nearest neighbors – Use class labels of nearest
neighbors to determine the class label of unknown record (e.g., by taking majority vote)
Unknown record
22
10
1 2 3 4 5 6 7 8 9 10
123456789
pn
i
pii cqCQD
1,
n
iii cqCQD
1
2,
Manhattan (p=1)
Max (p=inf)
Mahalanobis
Weighted Euclidean
Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case…
23
Strengths and Weaknesses• Strengths:
– Simple to implement and use– Comprehensible – easy to explain prediction– Robust to noisy data by averaging k-nearest neighbors– Distance function can be tailored using domain knowledge– Can learn complex decision boundaries
• Much more expressive than linear classifiers & decision trees• More on this later
• Weaknesses:– Need a lot of space to store all examples– Takes much more time to classify a new example than with a
parsimonious model (need to compare distance to all other examples)– Distance function must be designed carefully with domain knowledge
24
Strengths and Weaknesses• Strengths:
– Simple to implement and use– Comprehensible – easy to explain prediction– Robust to noisy data by averaging k-nearest neighbors– Distance function can be tailored using domain knowledge– Can learn complex decision boundaries
• Much more expressive than linear classifiers & decision trees• More on this later
• Weaknesses:– Need a lot of space to store all examples– Takes much more time to classify a new example than with a
parsimonious model (need to compare distance to all other examples)– Distance function must be designed carefully with domain knowledge
25
Perceptrons• The perceptron is a type of artificial
neural network which can be seen as the simplest kind of feedforward neural network: a linear classifier
• Introduced in the late 50s • Perceptron convergence theorem
(Rosenblatt 1962):– Perceptron will learn to classify
any linearly separable set of inputs.
XOR function (no linear separation)
Perceptron is a network:– single-layer– feed-forward: data only
travels in one direction
26
Perceptron: Artificial Neuron Model
Vector notation:
Model network as a graph with cells as nodes and synaptic connections as weighted edges from node i to node j, wji
The input value received of a neuron is calculated by summing the weighted input values from its input links
n
iiixw
0
threshold functionthreshold
27
Examples (step activation function)
In1 In2 Out
0 0 0
0 1 0
1 0 0
1 1 1
In1 In2 Out
0 0 0
0 1 1
1 0 1
1 1 1
In Out
0 1
1 0
n
iiixw
0w0 – t
28
Summary of Neural Networks
When are Neural Networks useful?– Instances represented by attribute-value pairs
• Particularly when attributes are real valued– The target function is
• Discrete-valued• Real-valued• Vector-valued
– Training examples may contain errors– Fast evaluation times are necessary
When not?– Fast training times are necessary– Understandability of the function is required
Types of Clusterings
• A clustering is a set of clusters
• Important distinction between hierarchical and partitional sets of clusters
• Partitional Clustering– A division data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset
• Hierarchical clustering– A set of nested clusters organized as a hierarchical tree
K-means Clustering
• Partitional clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid• Number of clusters, K, must be specified• The basic algorithm is very simple
– K-means tutorial available from http://maya.cs.depaul.edu/~classes/ect584/WEKA/k-means.html
31
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering1. Ask user how
many clusters they’d like. (e.g. k=3)
2. Randomly guess k cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns…
5. …and jumps there
6. …Repeat until terminated!
32
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 1
k1
k2
k3
33
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering
k1
k2
k3
34
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering
k1
k2k3
35
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering
k1
k2k3
36
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
exp
ress
ion
in c
on
dit
ion
2
K-means Clustering
k1
k2 k3
Strengths of Hierarchical Clustering
• Do not have to assume any particular number of clusters– Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
• They may correspond to meaningful taxonomies– Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Hierarchical Clustering
• Two main types of hierarchical clustering– Agglomerative:
• Start with the points as individual clusters• At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left
– Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point
(or there are k clusters)
• Agglomerative is most common
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function– Ward’s Method uses squared error
Proximity Matrix
DBSCAN
• DBSCAN is a density-based algorithm.– Density = number of points within a specified radius (Eps)
– A point is a core point if it has more than a specified number of points (MinPts) within Eps • These are points that are at the interior of a cluster
– A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
– A noise point is any point that is not a core point or a border point.
What Is Association Mining?• Association rule mining:
– Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
• Applications:– Market Basket analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
41
Association Rule Mining
–We are interested in rules that are• non-trivial (and possibly unexpected)• actionable• easily explainable
42
Support and Confidence• Find all the rules X Y with minimum
confidence and support– Support = probability that a transaction
contains {X,Y}• i.e., ratio of transactions in which X, Y
occur together to all transactions in database.
– Confidence = conditional probability that a transaction having X also contains Y• i.e., ratio of transactions in which X, Y
occur together to those in which X occurs.
In general confidence of a rule LHS => RHS can be computed as the support of the whole itemset divided by the support of LHS:
Confidence (LHS => RHS) = Support(LHS È RHS) / Support(LHS)
Customerbuys diaper
Customer buys both
Customerbuys beer
Definition: Frequent Itemset• Itemset
– A collection of one or more items• Example: {Milk, Bread, Diaper}
– k-itemset• An itemset that contains k items
• Support count ()– Frequency of occurrence of itemset– E.g. ({Milk, Bread,Diaper}) = 2
• Support– Fraction of transactions that contain an
itemset– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset– An itemset whose support is greater than or
equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
44
CS583, Bing Liu, UIC 45
The Apriori algorithm• The best known algorithm• Two steps:– Find all itemsets that have minimum support (frequent
itemsets, also called large itemsets).– Use frequent itemsets to generate rules.
• E.g., a frequent itemset{Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemsetClothes Milk, Chicken [sup = 3/7, conf = 3/3]
Associations: Pros and Cons
• Pros– can quickly mine patterns describing business/customers/etc. without
major effort in problem formulation– virtual items allow much flexibility– unparalleled tool for hypothesis generation
• Cons– unfocused
• not clear exactly how to apply mined “knowledge”• only hypothesis generation
– can produce many, many rules!• may only be a few nuggets among them (or none)
Association Rules
• Association rule types:– Actionable Rules – contain high-quality, actionable
information– Trivial Rules – information already well-known by
those familiar with the business– Inexplicable Rules – no explanation and do not suggest
action
• Trivial and Inexplicable Rules occur most often