Machine Learning - sorry.vse.czberka/docs/4iz451/sl04-uceni-en.pdf · 2019-10-01 · Relation between machine learning and data mining . Knowledge Discovery in Databases T4: machine

Knowledge Discovery in Databases T4: machine learning

P. Berka, 2019 1/19

Machine Learning

The field of machine learning is concerned with

the question of how to construct computer

programs that automatically improve with

experience.

(Mitchell, 1997)

Things learn when they change their behavior in

a way that makes them perform better in a

future.

(Witten, Frank, 1999)

types of learning: knowledge acquisition

skill refinement

Relation between machine learning and data mining


P. Berka, 2019 2/19

select learning

representation

learning knowledge

decision

making

object object decision decision

description making

General scheme of a learning system

Learning methods:

1. rote learning,

2. learning from instruction, learning by being told),

3. learning by analogy, instance-based learning, lazy

learning,

4. explanation-based learning,

5. learning from examples,

6. learning from observation and discovery.


P. Berka, 2019 3/19

Learning methods:

statistical methods - regression methods,

discriminant analysis, cluster analysis,

symbolic machine learning methods - decision

trees and rules, case-based reasoning (CBR)

sub-symbolic machine learning methods –

neuronal networks, bayesian networks or genetic

algorithms.

Feedback during learning:

pre-classified examples (supervised learning)

small number of pre-classified examples and a

large number of examples without known class

(semi-supervised learning)

algorithm can query the teacher for class

membership for unclassified examples (active

learning),

indirect hints derived from the teacher´s

behavior (apprenticeship learning)

no feedback (unsupervised learning)


P. Berka, 2019 4/19

representation of examples:

1. attributes: categorial (binary, nominal, ordinal)

and numeric [hair=black & height=180 & beard=yes & education=univ]

2. relations father(jan_lucembursky, karel_IV)

Algorithms:

batch – all examples are processed at once

incremental – examples are processed

subsequently system can be „re-trained“

Learning methods:

empirical – uses large set of (training) examples

and limited (or no) background knowledge

analytic – uses large background knowledge and

several (one or even no) illustrative examples


P. Berka, 2019 5/19

Principles of empirical concept learning

1. examples of the same class have similar

characteristics (similarity-based learning)

1. examples of the same class create

clusters in the attribute space

The goal of learning is to find and

represent these clusters

„garbage in, garbage out“ problem

Importance of data understanding and

preprocessing


P. Berka, 2019 6/19

2. General knowledge inferred from a finite set

of examples (inductive learning)

Examples divided into 2 (or 3) sets:

o training set to build a model

o (validation set to tune the parameters)

o testing set to test the model


P. Berka, 2019 7/19

General definition of (supervised) machine

learning

Analyzed data:

m n2 n1 n

m 22 21 2

m 12 11 1

x......xx

:::

x......xx

x......xx

D

Rows in the table represent objects (examples, instances)

Columns in the table correspond to attributes (variables)

When adding target attribute to the data table, we obtain

data suitable for supervised learning methods (so called

training data).

n

2

1

m n2 n1 n

m 22 21 2

m 12 11 1

TR

y

:

y

y

x......xx

:::

x......xx

x......xx

D

Classification task: to find knowledge (represented by a

decision function f), that assigns value of target attribute y

to an object described by values of input attributes x

f: x y.


P. Berka, 2019 8/19

We infer during classification for values of input attributes

x for an object the value of target attribute:

ŷ = f (x).

The derived value ŷ can be different from the real value y.

We can thus compute for every object oi DTR the

classification error Qf(oi, ŷi).

for numeric attribute C e.g. as:

Q y = (y - yf i i i( , ) )oi2

for categorial attribute C e.g. as:

ii

ii

ify = y iff 0

y y iff 1 = )y ,(Q io

We can compute the overall error Err(f,DTR) for the whole

training set DTR e.g. as mean error:

Err(f,D = 1

nQ yTR f

i=1

n

i) ( , )o i .

The goal of learning is to find such knowledge f*, that will

minimize this error

Err(f*,DTR) )TRf

DErr(f, min .


P. Berka, 2019 9/19

1. Learning as search

Looking for both structure and parameters

of the model - number of clusters and their location

Models as cluster descriptions:

MGM – most general model (one cluster for all

examples)

MSM – most specific model(s) (each example

creates a cluster)

M1 is more general than M2, M2 is more specific

than M1

Bell numbers

1159755215521)(

1054321

1)0(),(1

)(1

nB

n

BkBk

nnB

n

k


P. Berka, 2019 10/19

Search methods:

Direction

Top down (from general to specific

models)

Bottom up (from specific to general

models)

Strategy

blind (we consider each possibility how

to specialize/generalize given model)

heuristic (we use some criterion to

select only the “best” possibilities how

to specialize/generalize given model)

random

Bandwidth

single (we consider only one

transformation of actual model)

parallel (we consider more

transformations)

Entia non sunt multiplicanda praeter

necessitatem.

(William of Ockham, 1285 – 1327)


P. Berka, 2019 11/19

Example:

Let us assume, that both input attributes and target

attribute are categorial – let us denote category the value

of an attribute:

1. atomic formula that expresses property of object

kij

kijikji

v x 0

v x 1 ))((v A:

pro

prooo

2. set of objects that fulfill given property

}kijkj v x : { )(vA io

Combinations are created from categories using logical AND

)(v A... )(v A )(v A )](vA),...,(v A),(v A[ Combll2211ll2211 kjkjkjkjkjkj

1.

else 0

v x... v x v xif 1 )Comb( : ll2211 kijkijkij

ii oo

2. }ll2211

kijkijkij v x ... v x v x : { Comb io .

Comb covers object oi iff Comb(oi) = 1

We can create supercombinations by adding categories to a

combination and create subcombinations by removing

categories from a combination.


P. Berka, 2019 12/19

Partial ordering between combinations:

If combination Comb1 is a subcombination of

combination Comb2, then combination Comb1 is more

general than combination Comb2 and combination

Comb2 is more specific than combination Comb1.

If combination Comb1 is more general than combination

Comb2, then Comb1 covers at least all objects that are

covered by Comb2. (downward-closure property)

The resulting knowledge will be represented by

combinations that cover only examples of given class.

Combination Comb is consistent, iff it covers only examples

of a single class:

tiiTRit v y 1)Comb( :D C(v oo)

Example data: příjem konto pohlaví nezaměstnaný auto bydlení úvěr

vysoký vysoké žena ne ano vlastní Ano

vysoký vysoké muž ne ano vlastní Ano

nizký nízké muž ne ano nájemní Ne

vysoký vysoké muž ne ne nájemní Ano

Combination Comb (hypothesis representing the concept

„úvěr“) can contain following values of an attribute:

„?“ to indicate that the value of this attribute is

irrelevant,

value of the attribute,

„“ to indicate that no value of this attribute is

applicable.


P. Berka, 2019 13/19

Hypothesis space

We can traverse the hypothesis space using two methods:

from general to specific (top-down, specialization),

from specific to general (bottom-up, generalization).

[vysoký,?, ?, ?, ?, ?] [?,vysoké, ?, ?, ?, ?]

[vysoký, ?, ?, ne, ?, ?] [vysoký,vysoké,?, ?, ?, ?] [?, vysoké, ?, ne, ?, ?, ?]

[?, ?, ?, ?, ?, ?]

[?,?, ?, ?, ?, vlastní] [?, ?, žena, ?, ?, ?] ... ....

[, , , , , ]

[vysoký,vysoké, ?, ne, ?, ?]

[vysoký,vysoké,?, ne,ano,?] [vysoký,vysoké,?,ne, ?,vlastní] [vysoký,vysoké,muž, ne, ?,?]

[vysoký,

vysoké,?,

ne,ano,

vlastní]

[vysoký,

vysoké,

muž,ne,

?,vlastní]

[vysoký,

vysoké,

muž,ne,

ano, ?]

[vysoký,vysoké,žena,ne,

ano, vlastní]

[vysoký,vysoké,muž,ne,

ano, vlastní]

[vysoký,vysoké,muž,ne, ne,

nájemní]

[vysoký,

vysoké,

žena,ne,

?,vlastní]

[vysoký,

vysoké,

muž,ne, ne,

?]


P. Berka, 2019 14/19

[vysoký, vysoké,?,ne, ?, ?]

[vysoký,vysoké,?, ne,ano,?] [vysoký,vysoké,?,ne, ?,vlastní] [vysoký,vysoké,muž, ne, ?,?]

[vysoký,

vysoké,?,

ne,ano,

vlastní]

[vysoký,vysoké,žena,ne,

ano, vlastní]

[vysoký,vysoké,muž,ne,

ano, vlastní]

[vysoký,vysoké,muž,ne, ne,

nájemní]

[vysoký,

vysoké,

muž,ne, ne,

?]

S:

Find-S algorithm

1. Initialize h to the most specific hypothesis in H

2. For each positive training example x

2.1. For each attribute ai from hypothesis h

if value of attribute ai does not correspond to x

then replace value of ai by the next more general

value that corresponds to x

3. output h


P. Berka, 2019 15/19

[vysoký,?, ?, ?, ?, ?] [?, vysoké, ?, ?, ?, ?]

[vysoký, ?, ?, ne, ?, ?] [vysoký,vysoké,?, ?, ?, ?] [?, vysoké, ?, ne, ?, ?, ?]

[vysoký, vysoké,?,ne, ?, ?]

G:

S:

Candidate-Elimination algorithm

1. Initialize G to the set of maximally general hypotheses in H

2. Initialize S to the set of maximally specific hypotheses in H

3. for each example x

3.1. if x is a positive example then

remove form G any hypothesis inconsistent with x

for each hypothesis s in S that is not consistent with x

remove s from S

add to S minimal generalization h of s such, that h is

consistent with x and some member of G is more

general than h

remove from S hypotheses that are more general than

another hypothesis in S 3.2. if x is a negative example then

remove from S any hypothesis inconsistent with x

for each hypothesis g in G that is not consistent with x

remove g from G

add to G minimal specialization h of g such, that h is

consistent with x and some member of S is more

specific than h

remove from G hypotheses that are more specific than

another hypothesis in G


P. Berka, 2019 16/19

2. Learning as approximation

Looking “only” for parameters of the model

within a given class of models

Example:

using data points [xi, yi] to find parameters of a linear

function to best fit the data

f(x) = q1x + q0

least squares method:

the problem of finding the minimum of the overall error

min i (yi - f(xi)) 2

is transformed to solving the equation

d

dq

i

(yi - f(xi))2 = 0

x

x

x

x

x

x

x

x

y=f(x)


P. Berka, 2019 17/19

solution:

1) analytic (we know the type of function)

solving the equations for the parameters of

functions

q0 = (kyk )(kxk

2) - (kxkyk )(kxk )

n(kxk2) - (kxk)2

q1 = n(kxkyk ) - (kxk )(kyk)

n(kxk2)- (kxk)2

2) numeric (we do not know the type of function)

gradient methods

Err(q) =

Qq

Err,...,

q

Err,

q

Err

10.

Modification of knowledge q = [q0, q1, ..., qQ] according the

algorithm

qj qj + qj

where

j

jq

Errη- Δq

and is a parameter expressing „step“ used to approach the

minimum of function Err.


P. Berka, 2019 18/19

E.g. for error function

Err(f,D = 1

2(y - y

1

2(y - f`(TR i i

i=1

n

i

i=1

n

) ) ))2 2

x i

and expected function f as linear combination of inputs

f(x) = q x ,

we can derive the gradient of function Err as

ii

n

1 j

ii

2

ii

n

1i jj

y~-y q

y~-y22

1=y~-y

q2

1=

q

Err i

ij

n

1i

iii

n

1i j

ii x-y~-y=-y q

y~-y=

qx

So

q = y - y xj i i ij

i=1

n

Problem with convergence to local minimum


P. Berka, 2019 19/19

Tuning hyperparameters:

Incorporating search into learning as

approximation, i.e. search for optimal class of models

specifying number of clusters for k-means

clustering

specifying type of function for regression

analysis

specifying topology of neural network

approaches: heuristic methods e.g. genetic algorithms

Machine Learning - sorry.vse.czberka/docs/4iz451/sl04-uceni-en.pdf · 2019-10-01 · Relation between machine learning and data mining . Knowledge Discovery in Databases T4: machine

Documents