1 Statistical Techniques Chapter 10
Jan 03, 2016
1
Statistical Techniques
Chapter 10
2
10.1 Linear Regression Analysis
baxy
2x
xyb
cnxnaxaxaxanxxxxf .......)...,,( 332211321
Simple Linear Regression
n
yb
n
ya
3
Table 10.1 • District Office Building Data
Space Offices Entrances Age Value
2310 2 2 20 $142,0002333 2 2 12 $144,0002356 3 1.5 33 $151,0002379 3 2 43 $150,0002402 2 3 53 $139,0002425 4 2 23 $169,0002448 2 1.5 99 $126,0002471 2 2 34 $142,9002494 3 3 23 $163,0002517 4 4 55 $169,0002540 2 3 22 $149,000
Multiple Linear Regression with Excel
4
A Regression Equation for the District Office Building Data
83.5231724.23421.2553
77.1252964.27
AgeEntrances
OfficesSpaceValue
Table 10.2 • Regression Statistics for the Office Building Data
–234.2371645 2553.211 12529.77 27.64139 52317.8313.26801148 530.6692 400.0668 5.429374 12237.360.996747993 970.5785 #N/A #N/A #N/A459.7536742 6 #N/A #N/A #N/A1732393319 5652135 #N/A #N/A #N/A
5
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
2200 2250 2300 2350 2400 2450 2500 2550 2600
Acc
esse
dV
alu
e
Floor Space
6
Test 1
Test 3Test 2
Test 4
>=
>=
>=
<< >=
<
<
LRM1 LRM2 LRM3
LRM4 LRM5
Regression Trees
7
Amt
TotCost
TotCost
<= 246
LRM6
LRM7
Trips
Trips
TotCost
LRM1 Amt
LRM4 LRM5
LRM8 LRM9
Amt
LRM2 LRM3
<= 178 > 178 <= 136 > 136
> 171<= 171 > 390<= 390 > 309
> 7.5
> 39
<= 309
<= 7.5
<= 39
> 246
8
Transforming the Linear Regression Model
Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance.
10.2 Logistic Regression
9
The Logistic Regression Model
exp as denoted often logarithms natural of basetheis
where
1)|1(
e
xypc
c
e
e
ax
ax
0.000
0.200
0.400
0.600
0.800
1.000
1.200
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
x
P(y
= 1
| x
)
10
Logistic Regression: An Example
691.17415.0314.8
827.190001.0
AgeSex
InsCreditCardIncomecax
Table 10.3 • Logistic Regression: Dependent Variable = Life Insurance Promotion
Credit Card Life Insurance ComputedInstance Income Insurance Sex Age Promotion Probability
1 40K 0 1 45 0 0.0072 30K 0 0 40 1 0.9873 40K 0 1 42 0 0.0244 30K 1 1 43 1 1.0005 50K 0 0 38 1 0.9996 20K 0 0 55 0 0.0497 30K 1 1 35 1 1.0008 20K 0 1 27 0 0.5849 30K 0 1 43 0 0.00510 30K 0 0 41 1 0.98111 40K 0 0 43 1 0.98512 20K 0 1 29 1 0.38013 50K 0 0 39 1 0.99914 40K 0 1 55 0 0.00015 20K 1 0 19 1 1.000
11
10.3 Bayes Classifier
H
H
EP
HPHEPEHP
withassociated evidence theis E
testedbe tohypothesis theis where
)(
)()|()|(
12
Table 10.4 • Data for Bayes Classifier
Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Sex
Yes No No No MaleYes Yes Yes Yes FemaleNo No No No MaleYes Yes Yes Yes MaleYes No Yes No FemaleNo No No No FemaleYes Yes Yes Yes MaleNo No No No MaleYes No No No MaleYes Yes Yes No Female
Bayes Classifier: An Example
13
The Instance to be Classified
Magazine Promotion = Yes
Watch Promotion = Yes
Life Insurance Promotion = No
Credit Card Insurance = No
Sex = ?
Table 10.5 • Counts and Probabilities for Attribute Sex
Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance
Sex Male Female Male Female Male Female Male Female
Yes 4 3 2 2 2 3 2 1No 2 1 4 2 4 1 4 3
Ratio: yes/total 4/6 3/4 2/6 2/4 2/6 3/4 2/6 1/4Ratio: no/total 2/6 1/4 4/6 2/4 4/6 1/4 4/6 3/4
14
Computing The Probability For Sex = Male
)(
)()|()|(
EP
malesexPmalesexEPEmalesexP
15
Conditional Probabilities for Sex = Male
P(magazine promotion = yes | sex = male) = 4/6
P(watch promotion = yes | sex = male) = 2/6
P(life insurance promotion = no | sex = male) = 4/6
P(credit card insurance = no | sex = male) = 4/6
P(E | sex =male) = (4/6) (2/6) (4/6) (4/6) = 8/81
16
The Probability for Sex=Male Given Evidence E
P(sex = male | E) 0.0593 / P(E)
The Probability for Sex=Female Given Evidence E
P(sex = female| E) 0.0281 / P(E)
17
Zero-Valued Attribute Counts
attribute for the valuespossible ofnumber
total theofpart fractional equal an is p
1)(usually 1 and 0 between a value is
))((
k
kd
pkn
18
Missing Data
With Bayes classifier missing data items are ignored.
19
Numeric Data
where
e = the exponential function
= the class mean for the given numerical attribute
= the class standard deviation for the attribute
x = the attribute value
)2/()( 22
)2/(1)( xexf
20
Table 10.6 • Addition of Attribute Age to the Bayes Classifier Dataset
Magazine Watch Life Insurance Credit CardPromotion Promotion Promotion Insurance Age Sex
Yes No No No 45 MaleYes Yes Yes Yes 40 FemaleNo No No No 42 MaleYes Yes Yes Yes 30 MaleYes No Yes No 38 FemaleNo No No No 55 FemaleYes Yes Yes Yes 35 MaleNo No No No 27 MaleYes No No No 43 MaleYes Yes Yes No 41 Female
21
Agglomerative Clustering
1. Place each instance into a separate partition.
2. Until all instances are part of a single cluster:
a. Determine the two most similar clusters.
b. Merge the clusters chosen into a single cluster.
3. Choose a clustering formed by one of the step 2 iterations as a final result.
10.4 Clustering Algorithms
22
Table 10.7 • Five Instances from the Credit Card Promotion Database
Instance Income Magazine Watch Life InsuranceRange Promotion Promotion Promotion Sex
I1 40–50K Yes No No MaleI2 25–35K Yes Yes Yes FemaleI3 40–50K No No No MaleI4 25–35K Yes Yes Yes MaleI5 50–60K Yes No Yes Female
Agglomerative Clustering: An Example
23
Table 10.8 • Agglomerative Clustering: First Iteration
I1 I2 I3 I4 I5
I1 1.00I2 0.20 1.00I3 0.80 0.00 1.00I4 0.40 0.80 0.20 1.00I5 0.40 0.60 0.20 0.40 1.00
Table 10.9 • Agglomerative Clustering: Second Iteration
I1 I3 I2 I4 I5
I1 I3 0.80I2 0.33 1.00I4 0.47 0.80 1.00I5 0.47 0.60 0.40 1.00
24
A final clustering
• Compare the average within-cluster similarity to the overall similarity
• Compare the similarity within each cluster to the similarity between each cluster
• Examine the rule sets generated by each saved clustering
25
Conceptual Clustering
1. Create a cluster with the first instance as its only member.
2. For each remaining instance, take one of two actions at each tree level.
a. Place the new instance into an existing cluster.
b. Create a new concept cluster having the new instance as its only member.
26
Data for Conceptual Clustering
Table 10.10 • Data for Conceptual Clustering
Tails Color Nuclei
I1 One Light OneI2 Two Light TwoI3 Two Dark TwoI4 One Dark ThreeI5 One Light TwoI6 One Light TwoI7 One Light Three
27
N4
Tails
Nuclei
Color
OneTwo
1.01.0
.71
.291.01.0
.71
.29LightDarkOneTwo
Three
.14
.57
.29
1.01.01.0
P(N) = 7/7 P(V|C) P(C|V)
Tails
Nuclei
Color
OneTwo
.670.0
1.00.0
.670.0
1.00.0
LightDarkOneTwo
Three
0.01.00.0
0.01.00.0
P(N5) = 2/3 P(V|C) P(C|V)
Tails
Nuclei
Color
OneTwo
.330.0
1.00.0
.330.0
1.00.0
LightDarkOneTwo
Three
1.00.00.0
1.00.00.0
P(N3) = 1/3 P(V|C) P(C|V)
Tails
Nuclei
Color
OneTwo
.60.0
1.00.0
.60.0
1.00.0
LightDarkOneTwo
Three
.33
.670.0
1.0.50.0
P(N1) = 3/7 P(V|C) P(C|V)
Tails
Nuclei
Color
OneTwo
0.01.0
0.01.0
.2
.5.5.5
LightDarkOneTwo
Three
0.01.00.0
0.0.5
0.0
P(N2) = 2/7 P(V|C) P(C|V)
Tails
Nuclei
Color
OneTwo
.40.0
1.00.0
.2
.5.5.5
LightDarkOneTwo
Three
0.00.01.0
0.00.01.0
P(N4) = 2/7 P(V|C) P(C|V)
I2
I5 I6
I3 I4 I7
I1
N1
N5
N2
N3
N
28
COBWEB(Fisher 1987)
Heuristic measure of partition quality
Category utility
29
Expectation Maximization
1. Similar to the K-Means procedure
2. Makes use of the finite Gaussian mixtures model
3. The mixture model assigns each individual data instance a probability
30
3.3 The K-Means Algorithm
1. Choose a value for K, the total number of clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their closest cluster center.
4. Calculate a new cluster center for each cluster.
5. Repeat steps 3-5 until the cluster centers do not change.
31
Table 3.6 • K-Means Input Values
Instance X Y
1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.56 5.0 6.0
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6
f(x)
x
32
Expectation Maximization
1. Guess initial values for the parameters.
2. Until a termination criterion is achieved:
a. Use the probability density function for normal distributions to compute the cluster
probability for each instance.
b. Use the probability scores assigned to each instance in step 2(a) to re-estimate the parameters.
33
Table 10.11 • An EM Clustering of Gamma-Ray Burst Data
Cluster 0 Cluster 1 Cluster 2
# Instances 518 340 321
Log Fluence
Mean –5.6670 –4.8131 –6.3657SD 0.4088 0.5301 0.5812
Log HR321
Mean 0.0538 0.2949 0.5478SD 0.3018 0.1939 0.2766
Log T90
Mean 1.2709 1.7159 –0.3794SD 0.4906 0.3793 0.4825
34
Inductive problem-solving methods
• Query and visualization techniques
• Machine learning techniques
• Statistical techniques
10.5 Heuristics or Statistics?
35
Query and Visualization Techniques
• Query tools and OLAP tools–Unable to find hidden patterns
• Visualization tools–Decision trees, bar and pie charts, histograms, maps, surface plot diagrams–Applied after a data mining process to help us understand what has been discovered
36
Machine Learning and Statistical Techniques
1. Statistical techniques typically assume an underlying distribution for the data whereas machine learning techniques do not.
2. Machine learning techniques tend to have a human flavor.
3. Machine learning techniques are better able to deal with missing and noisy data.
4. Most machine learning techniques are able to explain their behavior.
5. Statistical techniques tend to perform poorly with large-sized data.