Top Banner
Last lecture summary
49

Last lecture summary

Feb 23, 2016

Download

Documents

eliza

Last lecture summary. Cluster analysis. unsupervised hierarchical clustering agglomerative divisive dendrogram partitional k-means. 295. dissimilarity. 268. 255. 219. 138. BA. NA. RM. FL. MI. TO. different methods to get the distances between object within two clusters - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Last lecture summary

Last lecture summary

Page 2: Last lecture summary

Cluster analysis

• unsupervised• hierarchical clustering– agglomerative– divisive– dendrogram

• partitional– k-means MI TOBA NA RM FL

dissimilarity219 138

255

268

295

Page 3: Last lecture summary

• different methods to get the distances between object within two clusters– sigle linkage– complete linkage– average linkage– centroid linkage– Ward’s method

• k-means– divides data points into some prespecified number

of clusters without the hierarchical structure

Page 4: Last lecture summary

Cluster validation

• external– comparison with known labeling

• internal– not known labeling, asses the intracluster

compactness and intercluster separation– CPCC for hierarchical clustering– Silhouettes

• relative

Page 5: Last lecture summary

Information Theory(new stuff)

Page 6: Last lecture summary

• mathematical theory of the measurement of the information– does not deal with semantic (i.e. meaning)

definition of information– it quantifies information, measures its

transmission– information is coupled with sender (source),

receiver and channel (means of communication)

Source(Sender) Receiver

Channel

Page 7: Last lecture summary

• information is inherently linked with uncertainty and surprise– Imagine you attend a biology symposium.– Speaker tells you he/she/it has a PhD in biology.– Does it surprise you?– Speaker tells you he/she/it plays a djembe

drum.– Does it surprise you?– In the first case the information content

was low, we were rather certain about speaker’s degree.

– In the second case we were given a large amount of information, as we were uncertain about speaker’s leisure time.

Page 8: Last lecture summary

• Thus we could decide to define information as a degree of uncertainty.– Before we roll a die, our uncertainty about the result is 6

(6 possible equally likely outcomes).– After rolling die our uncertainty is zero.– Therefore we have gained an amount of information of

6.• Pittfals of this simplistic approach:– Two dice – after rolling both of them we gained 66=36

units of information?– However, intuitively, the information after rolling two

dice is just 2-times more than the information after rolling one dice. Not 6-times!

• In another words, information feels additive.

Page 9: Last lecture summary

• Which function would you use to capture information’s additivity?– i.e. if I have 6 x 6 possible outcomes, I get only 12

units of information.– logarithm

• This is precisely the definition of information by Hartley (1928)

I(N) = log(N) N … number of possible results

• This definition is useful for equiprobable results, but will fail for biased dice.

Page 10: Last lecture summary

• If number 6 turns up 50% of the time, the information 6 provides is less than that provided by e.g. 3.

• Incorporate the probability in the definition of information. Shannon 1948.

ai … possible results (values of random variable) p(ai), pi … probability of gaining value ai

iii

i papap

aI loglog1log

Page 11: Last lecture summary

Information units

• The unit to measure the information should be as simple as possible.

• Simplest experiment possible – just one outcome. Too simple, it yields no information!

• Two possible results: I(a1) = I(a2) = log(2).• If we use base 2 for the logarithm, we obtain

log2(2) = 1.• We say that this amount of information is one bit.– Bit is the amount of information conveyed by an

experiment with two equally probable outcomes.

Page 12: Last lecture summary

• Other units based on other logarithm bases– nat – information conveyed by an experiment with e

(≈2.718281828459045) equiprobable outcomes … ln()– dit, digit, hartley … log10()

• Shannon was, however, working at the field of communication (Bell Labs).

• He was not thinking about experiments like rolling dice.• Outcomes of his experiments were the possible

symbols emmited by a source, and he was interested in analyzing the average information emitted by a source.

Page 13: Last lecture summary

• More formally, memory-less source emits messages using a given alphabet S=[a1, …, an] with emission probabilities P=[p1, …, pn].

• Shannon defined the entropy of the source X as

SourceS=[a, b, c, d, …, z]P=[5, 8, 4, 9, …….]

ReceiverChannel

N

iii

N

iii aIpppXH

11

log

Page 14: Last lecture summary

What is entropy?

• H(X) is the weighted mean (expectation) of log(pi)• H(X) is the expectation of the information of the

source• H(X) is the average information emitted by the source• H(X) is the measure of uncertainty of the source– if it is zero, there is no uncertainty (i.e. no information was

transmitted)

N

iii

N

iii aIpppXH

11

log

Page 15: Last lecture summary

Entropy of genome• Genome constitutes from A C T G symbols• Their probabilities for bacteria Escherichia coli: – 24.6% A, 25.4% C, 24.6% T, 25.4% G

• I(A) = I(T) = 2.023 bits, I(G) = I(C) = 1.977 bits• H(Eco) = 0.246*2.023 + 0.254*1.9777 + 0.246*2.023

+ 0.254*1.9777 = 1.999 bits– Close to 2 bits, as expected. 2 bits is maximum information

we can get from an experiment with 4 outcomes.• The entropy is the average information per symbol !!

Page 16: Last lecture summary

• Entropy of another organism, bacteria Thermus aquaticus:– 15.3% A, 15.3% T, 34.6% C, 34.6% G– H(Taq) = 1.876 bits … lower as expected, %GC content.– The decrease (0.123 bits) may not seem huge. But if we

consider size of typical bacterial genome (4 Mbp), then such a decrease gains a relevance.

– E. coli genome is able to encode 492 000 bits more than T. aquaticus.

• Which makes you wonder: if the ancestral microorganism was living in conditions to those of T. aquaticus, wouldn’t it have chosen another set of bases that gave it maximal information encoding capacity in these conditions?

Page 17: Last lecture summary

Noise and conditional entropy

• Information theory is concerned mainly about how information is transmitted from a source to a receiver by means of a channel.– Roll a die, observe a result – the channel is almost

noise-free.• You throw a coin from the balcony on the 25th

floor. Down there is your friend who shouts (no mobiles allowed!) the result back on you.

• Now the channel is not so nice anymore. We have noise.

Page 18: Last lecture summary

• Assume that we mishear “head” for “tail” one in every 100 (1%) coin tosses.

• If we are using a fair coin, the source entropy is H(X) = -0.5*log2(0.5)*2 = 1 bit

• We can factor in noise now, and compute the entropy after the coin have been tossed and we have heard the shouted result – conditional entropy. H(X|Y) = -0.99*log2(0.99) - 0.01*log2(0.01) = 0.081 bits.

Page 19: Last lecture summary

• Conditional entropy expresses our (as receiver) uncertainty on the result after the experiment has been carried out.

• X is the result of the experiment and Y is what we hear (mishear) as being the result.– We toss the coin, the outcome is X.– Friend shouts, we hear Y.– Knowing Y (the result of the experiment as we

perceive it), H(X|Y) expresses our remaining uncertainty over X.

Page 20: Last lecture summary

Mutual information

• H(X|Y) represents our uncertainty over X once we know Y.

• It is intimately linked to the channel over which the original message X travels in order to arrive to us as Y.

• We are also implying, that H(X|Y) is a measure of information loss (increase in uncertainty due to transmission through the channel).

• Mutual information: I(X,Y) = H(X) - H(X|Y)

Page 21: Last lecture summary

• So far we have been talking about source, receiver and channel.

• However, you can consider a random variable X and ask how much information is received when a specific value for this variable is observed.

• The amount of information can be viewed as the ‘degree of surprise’ on learning the value of X.

• You can easily calculate I(Xi) and H(X) if you can estimate the probability with which the variable gains its values. Or if you know the variable’s probability distribution.

N

iiiii aIpXHpxI

1

log

Page 22: Last lecture summary

• Further, you can consider two random variable X and Y.

• Now, you may want to quantify the remaining entropy (i.e. uncertainty) of a random variable X given that the value of another random variable Y is known.

• Conditional entropy of a random variable X given that the value of other random variable Y is known – H(X|Y)

2| | log |y x

H X Y P Y P X Y P X Y

Page 23: Last lecture summary

• Mutual information between the variable X and Y

I(X,Y) = H(X) - H(X|Y)

• The mutual information represents the reduction in uncertainty about X as a consequence of the observation Y.

• Mutual information measures the information that X and Y share.

• It measures how much knowing one of these variables reduces our uncertainty about the other. – X and Y are independent – then knowing X does not give any

information about Y, their mutual information is zero– X and Y are identical – all information conveyed by X is shared with

Y, knowing X determines Y. I(X,Y) is the same as the uncertainty (entropy) contained in Y (or X) alone.

• I(X,Y) ≥ 0 (is non-negative)• I(X,Y) = I(Y,X) (is symmetric)

Page 24: Last lecture summary

Decision trees

Page 25: Last lecture summary

Weather Light Ground condition

Umpires’ decision

Sunny Good Dry PlayOvercast Good Dry PlayRaining Good Dry No play

Overcast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No play

Overcast Good Damp PlaySunny Poor Dry Play

Intelligent bioinformatics The application of artificial intelligence techniques to bioinformatics problems, Keedwell

Umpires’ decision to play a cricket match

Page 26: Last lecture summary

Intelligent bioinformatics The application of artificial intelligence techniques to bioinformatics problems, Keedwell

Page 27: Last lecture summary

• Supervised • Used both for– classification – classification tree– regression – regression tree

• Advantages– relatively undemanding in computational terms– provide clear, explicit reasoning of their decision

making in the form of symbolic decision trees which can be converted to sets of rules

– accurate and, in more recent guises, increasingly robust in the face of noise

Page 28: Last lecture summary

• Task - determine, from the data, the rules the umpires are explicitly or implicitly using to determine whether play should take place.

• How to split the data so that each subset in the data uniquely identifies a class in the data?

• In our case, divide up the set of training examples into two smaller sets that completely encapsulate each class ‘Play’ and ‘No play’.

Page 29: Last lecture summary

• Each division is known as a test and splits the dataset in subsets according to the value of the attribute.

• E. g. if a test on ‘Light’ is performed this gives– Light = Good: yields 4 examples, 3 of class ‘Play’, 1 of ‘No play’

– Light = Poor: yields 4 examples, 1 of class ‘No play’, 3 of ‘Play’

Weather Light Ground condition DecisionSunny Good Dry Play

Overcast Good Dry PlayRaining Good Dry No play

Overcast Good Damp Play

Weather Light Ground condition DecisionOvercast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No playSunny Poor Dry Play

Page 30: Last lecture summary

• The above test on ‘Light’ separates the samples into two subsets, each with three examples of one class and one of another.

• This test has been chosen at random.• Is it really a best way of splitting the data? • A measurement of the effectiveness of each

attribute/feature is required. This measure must reflect the distribution of examples over the classes in the problem.

Page 31: Last lecture summary

Gain criterion

• Based on the amount of information that a test on the data conveys.

• The information contained within a test is related to the probability of selecting one training example from that class.

• T – training set, Cj – particular class– What is the probability?

Weather Light Ground ClassSunny Good Dry Play

Overcast Good Dry PlayRaining Good Dry No play

Overcast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No play

Overcast Good Damp PlaySunny Poor Dry Play

freq ,jC T

T

Page 32: Last lecture summary

• What is the information conveyed by selecting one training example from class Cj?

• What is the expected information from the whole training set?

• How is this quantity called?– Entropy

2

freq ,log jC T

T

2

1

freq , freq ,info log

kj j

j

C T C TT

T T

Page 33: Last lecture summary

• OK, we know the information measure for the entire training set.

• Each test that is devised by the algorithm must be compared with this to determine how much of an improvement (if any) is seen in classification.

• Now consider a similar measurement after T has been partitioned in a test x.

1

info infon

ix i

i

TT T

T

Page 34: Last lecture summary

• How is infox(T) called?• Conditional entropy– the entropy of the training set on condition that split x has

been performed

• Information gain measures the information yielded by a test x. It is defined asgain(x) = info(T) – infox(T)

• So what is information gain actually?– mutual information between the test x and the class

• Gain criterion selects a test to maximize the information gain.

Page 35: Last lecture summary

|T| = ?|T| = 8j = ?j = 1, 2freq(‘Play’, T) = ?freq(‘Play’, T) = 4freq(‘No play’, T) = ?freq(‘No play’, T) = 4

info(T) = ?

Weather Light Ground ClassSunny Good Dry PlaySunny Poor Dry Play

Overcast Good Dry PlayOvercast Good Damp PlayOvercast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No playRaining Good Dry No play

2

1

freq , freq ,info log

kj j

j

C T C TT

T T

info(T) = -4/8 * log2(4/8) - 4/8 * log2(4/8) = 1.0

Page 36: Last lecture summary

Weather Light Ground ClassSunny Good Dry PlaySunny Poor Dry Play

Overcast Good Dry PlayOvercast Good Damp PlayOvercast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No playRaining Good Dry No play

1

info infon

ix i

i

TT T

T

split on x = weather

i = ?i = 1, 2, 3i=1

weather = ‘sunny’|T1|/|T| = ?|T1|/|T| = 2/8info(T1) = ?

21

freq , freq ,info log

kj i j i

ij i i

C T C TT

T T

info(T1) = -2/2 * log2(2/2) - 0/2 * log2(0/2)infox(T) = 2/8 * info(T1) + …

Page 37: Last lecture summary

Weather Light Ground ClassSunny Good Dry PlaySunny Poor Dry Play

Overcast Good Dry PlayOvercast Good Damp PlayOvercast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No playRaining Good Dry No play

i=2weather = ‘overcast’

|T2|/|T| = ?|T2|/|T| = 4/8info(T2) = ?

info(T2) = -2/4 * log2(2/4) – 2/4 * log2(2/4)infox(T) = 2/8 * info(T1) + 4/8 * info(T2) + …

1

info infon

ix i

i

TT T

T

2

1

freq , freq ,info log

kj i j i

ij i i

C T C TT

T T

split on x = weather

Page 38: Last lecture summary

Weather Light Ground ClassSunny Good Dry PlaySunny Poor Dry Play

Overcast Good Dry PlayOvercast Good Damp PlayOvercast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No playRaining Good Dry No play

i=3weather = ‘raining’

|T3|/|T| = ?|T3|/|T| = 2/8info(T3) = ?

info(T3) = -0/2 * log2(0/2) – 2/2 * log2(2/2)infox(T) = 2/8 * info(T1) + 4/8 * info(T2) + 2/8 * info(T3)

1

info infon

ix i

i

TT T

T

2

1

freq , freq ,info log

kj i j i

ij i i

C T C TT

T T

split on x = weather

Page 39: Last lecture summary

• infoweather(T) = 0.5 bits– Gain = 1.0 - 0.5 = 0.5

• test “Light”– Gain = 0.189

• test “Ground”– Gain = 0.049

• Choose a split with maximum Gain.– i. e. split by weather first.

• ‘Sunny’ and ‘Raining’ are clean, they contain just one class.

• However, ‘Overcast’ contains both classes.

Weather Light Ground ClassSunny Good Dry PlaySunny Poor Dry Play

Overcast Good Dry PlayOvercast Good Damp PlayOvercast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No playRaining Good Dry No play

Page 40: Last lecture summary

• So the algorithm now proceeds by investigating which of two remaining features (‘Light’ or ‘Ground’) can classify the dataset correctly.

• Now, our training set are only those instances with ‘Weather’ = ‘Overcast’

Weather Light Ground Class

Overcast Good Dry PlayOvercast Good Damp PlayOvercast Poor Dry No playOvercast Poor Damp No play

Page 41: Last lecture summary

• info(T) = -2/4 * log2(2/4) – 2/4 * log2(2/4) = 1.0 bit

• infolight(T) = 2/4 * (-2/2 * log2(2/2) – 0/2 * log2(0/2)) +2/4 * (-0/2 * log2(0/2) – 2/2 * log2(2/2)) = 0 bits Gain = 1.0 – 0.0 = 1.0

• infoground(T) = 2/4 * (-1/2 * log2(1/2) – 1/2 * log2(1/2)) +2/4 * (-1/2 * log2(1/2) – 1/2 * log2(1/2)) = 1.0 bit Gain = 1.0 – 1.0 = 0.0

Weather Light Ground Class

Overcast Good Dry Play

Overcast Good Damp Play

Overcast Poor Dry No play

Overcast Poor Damp No play

(Good)

(Poor)

(Dry)

(Damp)

Page 42: Last lecture summary

Intelligent bioinformatics The application of artificial intelligence techniques to bioinformatics problems, Keedwell

1. split – Weather1. Sunny and Raining – fully classified as Play and No play, resp.

2. split – Light1. Good – Play, Poor – No play

3. End

Page 43: Last lecture summary

Gain ratio• Gain criterion is biased towards tests which have many subsets.• Revised gain measure taking into account the size of the subsets

created by test is called gain ratio.

• In our example, split by ‘Weather’ yielded three subsets, split by other two yielded only two subsets.

• Gain is biased for ‘Weather’ (Gain = 0.5), while Gain ratio corrects for this bias (it equals 0.33). However, split by ‘Weather’ still wins.

21

split info log

gaingain ratio

split info

ni i

i

T Tx

T T

xx

x

Page 44: Last lecture summary

• J. Ross Quinlan, C4.5: Programs for machine learning (book)

“In my experience, the gain ratio criterion is robust and typically gives a consistently better choice of test than the gain criterion”.

• However, Mingers J.1 finds that though gain ratio leads to smaller trees (which is good), it has tendency to favor unbalanced splits in which one subset is much smaller than the others.

1 Mingers J., ”An empirical comparison of selection measures for decision-tree induction.”, Machine Learning 3(4), 319-342, 1989

Page 45: Last lecture summary

Continuous data

• How to split on real, continuous data?• Use threshold and comparison operators <, ≤,

>, ≥ (e.g. “if Light ≥ 6 then Play” for Light variable being between 1 and 10).

• If continuous variable in the data set has n values, there are n-1 possible tests.

• Algorithm evaluates each of these splits, and it is actually not expensive.

Page 46: Last lecture summary

Pruning

• Decision tree overfits, i.e. it learns to reproduce training data exactly.

• Strategy to prevent overfitting – pruning:– Build the whole tree.– Prune the tree back, so that complex branches are

consolidated into smaller (less accurate on the training data) sub-branches.

– Pruning method uses some estimate of the expected error.

Page 47: Last lecture summary

Regression tree

Regression tree for predicting price of 1993-model cars.

All features have been standardized to have zero mean and unit variance.

The R2 of the tree is 0.85, which is significantly higher than thatof a multiple linear regression fit to the same data (R2 = 0.8)

Page 48: Last lecture summary

Algorithms, programs• ID3, C4.5, C5.0(Linux)/See5(Win) (Ross Quinlan)• Only classification• ID3

– uses information gain• C4.5

– extension of ID3– Improvements from ID3

• Handling both continuous and discrete attributes (threshold)• Handling training data with missing attribute values• Pruning trees after creation

• C5.0/See5– Improvements from C4.5 (for comparison see

http://www.rulequest.com/see5-comparison.html)• Speed• Memory usage• Smaller decision trees

Page 49: Last lecture summary

• CART (Leo Breiman)– Classification and Regression Trees– only binary splits

– splitting criterion – Gini impurity (index)• not based on information theory

• Both C4.5 and CART are robust tools• No method is always superior – experiment!

Not binary