Top Banner
B. Ross Cosc 4f79 1 Learning Machine learning is an area of AI concerned with the automatic learning of knowledge some ways that machine learning can be used in expert systems 1. increase efficiency of inference engine and knowledge base processing 2. testing the knowledge base 3. use learning principles to acquire knowledge itself 4. ??? Most learning techniques exploit heuristics: problem-specific information which makes the search for a solution more efficient Without heuristics, typical learning problems either take too long to execute effectively, or produce results which are too large & general to be useful
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learning

B. Ross Cosc 4f79 1

Learning

• Machine learning is an area of AI concerned with the automatic learning of knowledge

• some ways that machine learning can be used in expert systems

1. increase efficiency of inference engine and knowledge base processing 2. testing the knowledge base

3. use learning principles to acquire knowledge itself

4. ???

• Most learning techniques exploit heuristics: problem-specific information which makes the search for a solution more efficient

• Without heuristics, typical learning problems either take too long to execute effectively, or produce results which are too large & general to be useful

Page 2: Learning

B. Ross Cosc 4f79 2

Learning

1. Increase inference engine efficiency

• 20-80 principle: 20% of rules in KB account for 80% of diagnoses

• these 20% rules should take precedence in order to make execution faster

• otherwise, roughly half of the KB needs to be looked at for every diagnosis, which is a waste of time for most (80%) problems

• However, it is also possible that the set of rules most often used can vary according to who, where, and how the expert system is used

• One way to fix this: keep a record of the number of times a high-level rule was successful in making a diagnosis

eg. record(rule12, 102). record(rule6, 25). etc

• Save this information in a file, and reload it every session.

• Use these rule stats to determine the order in which diagnoses are to be executed

Page 3: Learning

B. Ross Cosc 4f79 3

Learning

1. ordering rules: p.118 Schnupp

p.118 schnupp

Page 4: Learning

B. Ross Cosc 4f79 4

Learning• Another possibility is to ask some preliminary questions to determine general high-level information, and then order the high-level inference accordingly

p.112-113 Schnupp

Page 5: Learning

B. Ross Cosc 4f79 5

Learning2. Knowledge acquisition

(i) Learning rules from examples

• user inputs typical examples (decision table) for a given rule domain; or, can process a database to automatically generate production rules

• system constructs a rule or set of rules (decision tree) for this example set

• can then generate production rules from this tree

• trivial to make comprehensive tree but more involved to make a minimal one

• inductive inference: learning technique which constructs a general rule from specific examples (compare with mathematical induction)

• popular algorithm: Quinlan's ID3, used in shells such as VP-expert, ExpertEase, RuleMaster, and others

Page 6: Learning

B. Ross Cosc 4f79 6

Learning

• Decision table: table of attribute values, with one or more conclusions - each row is an example or true instance of attribute value - conclusion

• Convenient for classification & identification problems

• Could create production rules directly from table: one rule per row

• induction: tries to generalize information in table, disregarding superfluous information, and yielding an efficient smaller decision tree

- results in "smarter" system

- good example of machine learning : computer tries to generalize and abstract from examples

• Given a table, can look at conclusions, and see if particular attributes have any effect on them. If not, then disregard those attributes when deriving that conclusion.

• There exist one or more "minimal" sets of tests for a table; however, finding this minimal set can be intractable in general

Page 7: Learning

B. Ross Cosc 4f79 7

ID3 definitions• entropy: measure of how much an attribute matches the

conclusion– match 1:1 -- low entropy (high information content, low

uncertainty)eg. a:1, b:2, c:3, d:4

– differ on every value matching - high entropy (low info content, high uncertainty)

eg. a:1, a:2, a:3, a:4

• Information or entropy: mathematical measurement of an entity that provides the answer to a question, or certainty about an outcome, eg. to describe whether a coin will be heads or tails– a value between 0 and 1, measured in bits– 1 bit = enough information to answer a yes/no question of a

fair, random event (because log2 (2) = 1)– 4 events require log2(4) = 2 bits etc.

– Note: log2 (X) = ln(X) / ln(2)

Page 8: Learning

B. Ross Cosc 4f79 8

ID3 definitions

• Information content: average entropy of different events weighted by probabilities of those events: Pr(E) log2 E– Formula: - Pr(E1)log2 E1 - Pr(E2)log2 E2 - ... - Pr(Ek) log2 Ek

• when attribute A has events E1,...,Ek

– eg. fair coin: IC(0.5, 0.5) = -0.5 log2 0.5 - 0.5 log2 0.5 = 1 bit – eg. weighted heads: – IC(0.01, 0.99) = -0.01 log2 0.01 - 0.99 log2 0.99 = 0.08 bits

– eg. always heads: IC(0, 1) = 0 bits

– Note: if event has prob 0, don’t use in equation (log 0 = ???)• instead, set value as 0.

• Information gain: difference between original information content and new information content, after attribute A selected

Page 9: Learning

B. Ross Cosc 4f79 9

ID3 Algorithm

• minimal trees are useful when example table is completely adequate for observations of interest, and user input is certain– for uncertain input, then additional tests are required for support

• ID3: induction algorithm that tries to find a small tree efficiently– not guaranteed to be minimal, but it is generally small

1. Given the example set C, find an attribute A that gives the highest information content

--> this is the most discriminatory test for distinguising the data

--> it yields the highest information gain: ideally, if we have X entropy before, then after A we have 0 entropy (ie. perfect decision strategy!)

2. Add it as the next node in tree

3. Partition the examples into subtables, and recurse (fills in remainder of tree)

Page 10: Learning

B. Ross Cosc 4f79 10

ID3 algorithm

• consider p positive and n negative examples– probability of p’s: p+ = p / (p + n)– probability of n’s: p- = n / (p + n)– compute the above from the example set

• information content for each attribute value:– IC(Vi) = - (p+) log2 (p+) - (p-) log2 (p-)– where Vi is value of attribute A

• information content for table after attribute A is used as test: – B(C, A) = sum_i : [ (prob value of A is Vi) * IC(Ci) ]– for attribute A, value Vi (i=1,...,#values), – subset of examples Ci corresp. to each Vi

• We wish to select an attribute which maximizes the information gain at that node in the tree. Repeat the following for all attribute in example set:– compute p+, p- for each value of attribute– compute IC for each attribute, value pair– compute overall info gain B(C, A) for that attribute

• --> select attribute A which maximizes information gain, ie. yields low information content after it is applied --> lowest B(C,A) value.

Page 11: Learning

B. Ross Cosc 4f79 11

ID3: entropy extremes

attr. 1 attr. 2 value IC(C) = -(3/4 log2 3/4) - (1/4 log2 1/4)

a c x = -(.75)(-.45) - (.25)(.5) = 0.725

b c y

a c x

a c x• attr 1:

– IC(a) = -1 log2 1 - 0 log2 0 = 0– IC(b) = -1 log2 1 -0 log2 0 = 0– B(C, attr1) = Pr(a)*IC(a) + Pr(b)*IC(b) = 0 + 0 = 0– Gain: IC(C) -B(C, attr1) = 0.725 - 0 = 0.725

--> maximum gain! All values precisely predicted using attribute 1.• attr 2:

– IC(c) = -(3/4) log2 (3/4) - (1/4) log2 (1/4) = 0.725– B(C, attr2) = Pr(c)*IC(c) = 1*.725 = 0.725– Gain: IC(C) - B(C, attr2) = 0.725 - 0.725 = 0

--> minimum gain; no information gained at all fromusing attribute 2.• Note: we can simply select attribute yielding minimum information content

“B” for table; computing the gain is redundant (doesn’t help us).

Page 12: Learning

B. Ross Cosc 4f79 12

ID3: example (from Durkin, pp.496-498)

• IC(C) = -(4/8) log2 (4/8) - (4/8) log2 (4/8) = 1 (initial info content of all examples)

• (a) test “wind”: – IC(North) = -3/5 log2 (3/5) - 2/5 log2 (2/5) = .971– IC(South) = -1/3 log2 (1/3) - 2/3 log2 (2/3) = .918– B(C, “Wind”) = 5/8* 0.971 + 3/8 log 0.918 = 0.951– gain = IC(C) - B(C, “Wind”) = 1 - 0.951= - .049

• (b) test “sky”:– “clear’: all are negative (= 0)– IC(cloudy) = -4/5 log2(4/5) - 1/5 log2 (1/5) = .722– B(C, “sky”) = 3/8 * 0 + 5/8 * .722 = .45– gain = IC(C) - B(C, “sky”) = 1 - .45 = .548

• (c) barometer: gain is .156• therefore sky gives highest info gain, and is selected.• Algorithm partitions example set for each new subcategory, and recurses

• Note: we’re simply finding attribute that yields smallest information content for remaining table table after it is applied

Page 13: Learning

B. Ross Cosc 4f79 13

Example (cont)

• Must now apply ID3 to remaining examples that have differing result values --> new layers of decision tree

• (a) barometer: – IC(rising) = - 1 log2 1 - 0 log2 0 = 0– IC(steady) = - 1 log2 1 - 0 log2 0 = 0

– IC(falling) = - 1/2 log2 1/2 - 1/2 log2 1/2 = .951

– B(C, barometer) = (2/5)*0 + (1/5)*0 + (2/5)*.951= .38• (b) wind:

– IC(south) = - 1/2 log2 1/2 - 1/2 log2 1/2 = .951– IC(north) = - 1 log 2 1 - 0 = 0

– B(C, wind) = (2/5)*.951 + (3/5)*0 = .38

• --> choose either• note that you’ll need both attributes together to classify remaining

table

Page 14: Learning

B. Ross Cosc 4f79 14

Example

• If we choose ‘barometer’, then remaining table left is:

– 5 Cloudy, Falling, North +

– 7 Cloudy, Falling, South -• Should be obvious that ‘wind’ is only possibility now.

Page 15: Learning

B. Ross Cosc 4f79 15

Example: final tree

Page 16: Learning

B. Ross Cosc 4f79 16

ID3

• ID3 generalizes to multivalued classifications (not just plus and minus): information content expression extends to multiple categories...

– IC(a, b, c) = -pa log2 pa - pb log2 pb - pc log2 pc

• Note: final tree can have “no data” leafs, meaning that the example set does not cover that combination of tests

– can presume that such a leaf is “impossible” wrt the examples– otherwise, implies that example set is missing information– --> must assume that examples are complete and correct; this

is responsibility of knowledge engineer!

Page 17: Learning

B. Ross Cosc 4f79 17

LearningInductive Inference:

p.129-132

Page 18: Learning

B. Ross Cosc 4f79 18

Learning

Inductive Inference

Page 19: Learning

B. Ross Cosc 4f79 19

Learning

Page 20: Learning

B. Ross Cosc 4f79 20

Learning

ID3 algorithm

p. 134-5

Page 21: Learning

B. Ross Cosc 4f79 21

Learning

Page 22: Learning

B. Ross Cosc 4f79 22

Learning

Page 23: Learning

B. Ross Cosc 4f79 23

Learning

Page 24: Learning

B. Ross Cosc 4f79 24

Possible ID3 problems

• clashing examples: need more data attributes, or must correct knowledge

• continuous values (floating point): must create ranges, – eg. 0 < x < 5

• noise: if compressing a database, noise can unduly influence decision tree

• trees can be too large: production rules are therefore large too

– break up table and create hierarchy of tables (structured induction)

• flat rules: only one or more conclusions, no intermediate rules

Page 25: Learning

B. Ross Cosc 4f79 25

ID3 enhancements

• If you have a large example set, can use a subset (“window”) of examples– then need to verify that resulting decision tree is valid, in case

that window wasn’t inclusive or had noise– knowledge bases must be 100% correct!

• Data mining: finding trends in large databases

– ID3 is one tool used there– don’t care about 100% correctness– rather, a good random sample of database may yield enough

useful information– can also apply to entire database, in an attempt to categorize

information into useful classes and trends

• C4.5 is the successor of ID3. Uses more advanced heuristic.

Page 26: Learning

B. Ross Cosc 4f79 26

Learning: Genetic algorithms

• Another way to create small decision trees from examples: genetic programming

• 1. Create a population of random decision trees• 2. Repeat Until a suitably correct and small tree is found:

– a. Rate all trees based on: (i) size, (ii) how many examples they cover (iii) how many examples they miss

– --> fitness score

– b. Create a new population: – (i) mate trees using Crossover: swap subtrees between

parents– (ii) mutate trees using Mutation: random change to tree

• This will search the space of decision trees for a correct and small sized tree

• Much slower than ID3, but possibly better results• Remember: finding the smallest tree for an example set is NP-

complete

Page 27: Learning

B. Ross Cosc 4f79 27

Learning comments

• Inductive inference is a convenient way of obtaining productions automatically from databases or user examples.

• Good for classification & diagnosis problems

• Assumes that domain is deterministic: that particular premises lead to only one conclusion, not multiple ones

• Need to have: good data, no noise, no clashes - assumes that the entire universe of interest is encapsulated in example set or database - clashes probably mean you need to identify more attributes

• ID3 doesn't assure the minimal tree, but its entropy measure is a heuristic that often generates a small tree

• Note that a minimal rule is not necessarily desireable: when you discard attributes, you discard information which might be relevant, especially later when the system is being upgraded

• Not desireable to use this technique on huge tables with many attributes. Better approach is to modularise data hierarchically.