Astroinformatics - dame.dsf.unina.itdame.dsf.unina.it/ASTROINFOEDU/brescia-MLsupervised-partI.pdf · ML origins: from Aristotele to Darwin 3 The Greek philosopher Aristotele was one

Astroinformatics

What’s Machine Learning

M. Brescia - Astroinformatics 2

Field of study that gives computers the ability to learn without being explicitly programmed.Arthur Samuel (1959)

A computer program is said to learn from experience E with respect to some task T and someperformance measure P, if its performance on T, as measured by P, improves with experienceE.

Tom Mitchell (1998)

Machine Learning is a scientific discipline concerning design and development of algorithmsthat allow computers to learn complex patterns and to make intelligent decisions based ondata-driven resources (sensors, databases).

ML origins: from Aristotele to Darwin

3

The Greek philosopher Aristotele was one of the first to attempt tocodify "right thinking," that syllogism is, irrefutable reasoning processes.His syllogisms provided patterns for argument structures that alwaysyielded correct conclusions when given correct premises. For example,"Socrates is a man; all men are mortal; therefore, Socrates is mortal."These laws of thought were logic supposed to govern the operation ofthe mind; their study initiated the field called logic

By 1965, programs existed that could, in principle, process anysolvable problem described in logical notation. The so-calledlogicist tradition within artificial intelligence hopes to build on suchprograms to create intelligent systems and the ML theoryrepresents their demonstration discipline. A reinforcement in thisdirection came out by integrating ML paradigm with statisticalprinciples following the Darwin’s Nature evolution law

M. Brescia - Astroinformatics

ML supervised paradigm

4

In supervised ML we have a set of data points or observations for which we know the desiredoutput, expressed in terms of categorical classes, numerical or logical variables or asgeneric observed description of any real problem. The desired output is in fact providingsome level of supervision in that it is used by the learning model to adjust parameters ormake decisions allowing it to predict correct output for new data.Finally, when the algorithm is able to correctly predict observations we define it a classifier.Some classifiers are also capable of providing results in a more probabilistic sense, i.e. aprobability of a data point belonging to class. We usually refer to such model behavior asprobabilistic classification


ML supervised process (1/2)

5

Pre-processing of databuild input patterns appropriate for feeding into our supervised learning algorithm. Thisincludes scaling and preparation of data;

Create data sets for training and evaluationrandomly splitting the universe of data patterns. The training set is made of the data used bythe classifier to learn their internal feature correlations, whereas the evaluation set is used tovalidate the already trained model in order to get an error rate (or other validationmeasures) that can help to identify the performance and accuracy of the classifier. Typicallyyou will use more training data than validation data;

Training of the modelWe execute the model on the trainingdata set. The output result consists ofa model that (in the successful case)has learned how to predict theoutcome when new unknown dataare submitted;


ML supervised process (2/2)

6

ValidationAfter we have created the model, it is of course required a test of its performance accuracy,completeness and contamination (or its dual, the purity). It is particularly crucial to do this ondata that the model has not seen yet. This is main reason why on previous steps weseparated the data set into training patterns and a subset of the data not used for training.

UseIf validation was successful the modelhas correctly learned the underlyingreal problem. So far we can proceedto use the model to classify/predictnew data.

Verify Modelverify and measure the generalization capabilities of the model. It is very easy to learn every single combination of input vectors and their mappings to the output as observed on the training data, and we can achieve a very low error in doing that, but how does the very same rules or mappings perform on new data that may have different input to output mappings? If the classification error of the validationset is higher than the training error, thenwe have to go back and adjust modelparameters.


Glossary

7

§ Data can be tables, images, streaming vectors. They may be represented under the form ofnumbers, percentages, pixel values, literals, strings, probabilities, any other entity giving aninformation on a physical/conceptual/simulated event or phenomena of our world.

§ Dataset is a set of samples representing a problem. All samples must be expressed in auniform way (i.e. same dimensions and representation).

§ Pattern is a sequence of symbols/values identifying a single sample of any dataset.

§ Feature is an atomic element of a pattern, i.e. a number or symbol representing onecharacteristic of the pattern (carrier of hidden information).

§ Target (supervised dataset) is usually a label (number/symbol) or a set of labelsrepresenting the solution (desired/known output) of a single pattern. If unknown or missing,the pattern belongs to the unsupervised category of datasets.

§ Base of Knowledge (BoK) or KB (Knowledge Base) is the ensemble of datasets in which thepatterns contain the target (known solutions to a real problem). It is always available forsupervised ML.


Examples of BoK - Astronomy

M. Brescia - Data Mining - lezione 3 8

UMAG,GMAG,RMAG,IMAG,ZMAG,nuv,fuv,YMAG,JMAG,HMAG,KMAG,w1,w2,w3,w4,zspec20.38,20.46,20.32,20.09,20.04,0.65,3.21,19.28,18.963,19.286,17.505,16.828,15.238,12.238,8.579,1.82419.465,19.368,19.193,19.015,0.219,1.397,18.29,17.76,16.97,15.77,14.26,13.2,10.76,8.158,0.459,1.93417.995,17.934,17.873,1.865,0.132,16.863,16.597,15.902,14.75,13.33,12.28,9.5,7.37,0.478,20.49,2.24720.13,20.36,1.43,4.22,19.906,19.409,18.427,17.935,17.076,15.589,12.619,8.863,1.4365, 8.15,0.45,1.93

Dataset: set of galaxies observed by a space telescope. Each pattern has 15 features(different emission flux wavelength) + one target (the velocity dispersion of each galaxy,called redshift)

Usually such BoK is made by hundreds ofthousands of galaxies (patterns).The ML problem is to learn to predict theirredshift for new objects observed infurther space missions.


Examples of BoK - Astronomy

9

Dataset: large multi-band image of a nebula, million of patterns of galaxies and stars (theirspectra). Features are peaks in the object spectrum and target is the type of object. MLproblem: learn to classify objects (such as star/galaxy separation)

star

galaxy


Examples of BoK – wine classification

10

chemical analysis of wines grown in the same region in Italy butderived from three different cultivars.The analysis determined the quantities of 13 constituentsfound in each of the three types of wines.

1) Alcohol2) Malic acid3) Ash (cenere)4) Alcalinity of ash5) Magnesium6) Total phenols7) Flavanoids8) Nonflavanoid phenols9) Proanthocyanins10) Color intensity11) Hue12) OD280/OD315 of diluted wines13) Proline

14) Target class:1) Aglianico2) Falanghina3) Lacryma christi

14.23 1.71 2.43 15.6 127 2.8 3.06 .28 2.29 5.64 1.04 3.92 1065 113.2 1.78 2.14 11.2 100 2.65 2.76 .26 1.28 4.38 1.05 3.4 1050 113.16 2.36 2.67 18.6 101 2.8 3.24 .3 2.81 5.68 1.03 3.17 1185 214.37 1.95 2.5 16.8 113 3.85 3.49 .24 2.18 7.8 .86 3.45 1480 313.24 2.59 2.87 21 118 2.8 2.69 .39 1.82 4.32 1.04 2.93 735 114.2 1.76 2.45 15.2 112 3.27 3.39 .34 1.97 6.75 1.05 2.85 1450 314.39 1.87 2.45 14.6 96 2.5 2.52 .3 1.98 5.25 1.02 3.58 1290 214.06 2.15 2.61 17.6 121 2.6 2.51 .31 1.25 5.05 1.06 3.58 1295 2…


ML Functionalities

11

In the DM scenario, the ML model choice should always be accompanied by the functionality domain. To be more precise, some ML models can be used in a same functionality domain, because it represents the functional context in which it is performed the exploration of data.

Examples of such domains are:

Dimensional reduction;Classification;Regression;Clustering;Segmentation;Forecasting;Data Model Filtering;Statistical data analysis;


The core of Machine Learning

12

Whatever being the functionality or the model of interest in the machine learning context, the key point isalways the concept of LEARNING

More in practice, having in mind the functional taxonomy described in the previous slide, there areessentially four kinds of learning related with ML for DM:

Learning by association;Learning by classification;Learning by grouping (clustering);Learning by prediction (regression);


Learning by association

13

The learning by association consists of the identification of any structure hidden between data. It does notmean to identify the belonging of patterns to specific classes, but to predict values of any featureattribute, by simply recalling it, i.e. by associating it to a particular state or sample of the real problem..

It is evident that in the case of association we are dealing with very generic problems, i.e. those requiringa precision less than in the classification case. In fact, the complexity grows with the range of possiblemultiple values for feature attributes, potentially causing a mismatch in the association results.

In practical terms, fixed percentage thresholds are given in order toreduce the mismatch occurrence for different association rules, basedon the experience on that problem and related data. Therepresentation of data for associative learning is thus based on thelabeling of features with non-numerical values or by alpha-numericcoding.


Learning by classification

14

Classification learning is often named simply “supervised” learning, because the process to learn the rightassignment of a label to a datum, representing its category or “class”, is usually done by examples.Learning by examples stands for a training scheme operating under supervision of any oracle, able toprovide the correct, already known, outcome for each of the training sample. And this outcome isproperly a class or category of the examples. Its representation depends on the available Base ofKnowledge (BoK) and on its intrinsic nature, but in most cases is based on a series of numerical attributes,related to the extracted BoK, organized and submitted in an homogeneous way.

The success of classification learning isusually evaluated by trying out theacquired feature description on anindependent set of data, havingknown output but never submitted tothe model before.


Learning by clustering

15

Whenever there is no any class attribution, clustering learning is used to group data that show naturalsimilar features. Of course the challenge of a clustering experiment is to find these clusters and assigninput data to them.

The data could be given under the form of categorical/numerical tables and the success of a clusteringprocess could be evaluated in terms of human experience on the problem or a posteriori by means of asecond step of the experiment, in which a classification learning process is applied in order to learn anintelligent mechanism on how new data samples should be clustered.

The clustering can be performed in a top-down(from largest clusters down to singles), orbottom-up (from singles up to larger clusters).Both types may be represented by dendograms


Learning by prediction

16

Slightly different from classification scheme is the prediction learning. In this case the outcome is a numerical value instead of a class label (often called REGRESSION).The numeric prediction is related to a quantitative result, because is the predicted value much more interesting than the structure of the concept behind the numerical outcome.


Predict a value of a givencontinuous valued variable basedon the values of other variables,assuming any linear or nonlineardependency among them.

Examples:Predicting sales amounts of new product based on expenditure.Predicting wind speed as a function of temp., humidity, air pressure, etc.Time series prediction of stock market indices.

Regression types

17

In Machine Learning is a supervised search for an association from a domain Rn to a domain Rm,with n>m

Basically two types: curve fitting and function approximation.First one tries to validate the hypothesis that data distribution follows any function;The second tries to find out a function correlating data without any a priori assumption on thefunctional shape of data distribution;

•Curve fitting: given a couple of vectors (x, y) and the guess, i.e. associated functional shape, thesystem predicts best parameters identifying the guess (association);•Function approximation: given a couple of vectors (x, y), the system predicts the best model toidentify data correlation (f.e. a black box able to approximate any unknown analytic expression);


Regression is the attempt to explain the variation in a dependent variable using the variation in independent variables.

Regression is thus an explanation of causation.

If the independent variable(s) sufficiently explain the variation in the dependent variable, the model can be used for prediction.

• Given a dataset of the form (𝑥1, 𝑦1) , … , (𝑥𝑛, 𝑦𝑛) find a linear function that given the vector 𝑥𝑖 predicts the 𝑦𝑖 value as 𝑦𝑖

′ = 𝑤𝑇𝑥𝑖– Find a vector of weights 𝑤 that minimizes the sum of square errors

𝑖

𝑦𝑖′ − 𝑦𝑖

2

– Several techniques for solving the problem.

linear regression


Independent variable (x)

Dep

en

den

t va

riab

le (

y)

y’ = b0 + b1X ± є

b0 (y intercept)

B1 = slope= ∆y/ ∆x

є

The output of a regressionis a function that predictsthe dependent variablebased upon values of theindependent variables.

Simple regression fits astraight line to the data.

Regression line

19

Independent variable

dep

end

ent

vari

able

y

x

Straight lines

+ positive

- negative

A line can be positive if atincreasing x values correspondincreasing y values.

A line can be negative if atincreasing x values corresponddecreasing y values.


Regression line

20


dep

end

ent

vari

able

y

x

observations


Regression line

21


dep

end

ent

vari

able

y

x

Regression line

R line

observations

A line that fits all different points(observations) is called Regressionline.

Generally speaking, usually, given a set ofobservations about any phenomenon, theidea is to make a prediction of what isexpected to be the relationship between thevariables of the problem. This is to find thebest fit for the estimation.In these cases we want to evaluate thedistances between the estimated and actualobservations, with the aim at minimizingthose distances (or estimation errors).


Regression line

22


dep

end

ent

vari

able

y

x

Regression line

R line

observations

A line that fits all different points(observations) is called Regressionline.

estimated

actual

error

Note that the distances sometimes arepositive, sometimes negative. Trying tosimply sum up them, it would obtainzero, so it would not bring any usefulinformation about the predictionperformance.

Positive distance

Negative distance

In fact, there could exist infiniteestimation lines which could obtain thesum of distances = 0…!!!The most useful way is indeed to takethe squared distances, because thenegative ones will change sign and maycontribute to carry more information…

𝑒2 > 0 → 𝑓𝑖𝑛𝑑 𝑡ℎ𝑒min𝑒2


Regression line

23


dep

end

ent

vari

able

y

x

Regression line

R line

observations

A line that best fits all differentpoints (observations) is calledRegression line.

A Regression line is easily obtainedby the least squares method whichtries to minimize the difference(errors) between the estimatedvalues and the actual values

estimated

actual

error

Therefore we can define the regression line as theunique line which minimizes the sum of squared errors.


Regression line

24


dep

end

ent

vari

able

y

x

Regression line

R line

observations



estimated

actual

errorො𝑦 = 𝑏0 + 𝑏1𝑥

𝑏0 y intercept

𝑏1 slope


Regression line

25


dep

end

ent

vari

able

y

x

Regression line

R line

observations



estimated

actual

errorො𝑦 = 𝑏0 + 𝑏1𝑥

𝑏0 y intercept

𝑏1 slope

If slope is positive then the R line ispositive as well. Negative in case ofa −𝑏1 term.


Least squares method

26

y

x1 2 3 4 5

1

2

3

4

5

6

Now we want to calculateregression lines using the leastsquares method.

Let’s consider some observations



27

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

We can draw the means for thetwo variables



28

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

ො𝑦 = 𝑏0 + 𝑏1𝑥

Let’s trace the regression lines



29

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

It turns out that all R lines have to gothrough the intersection betweenthe two means.

ො𝑦 = 𝑏0 + 𝑏1𝑥



30

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

𝒙 𝒚 𝒙 − ഥ𝒙 𝒚 − ഥ𝒚

1 2 -2 -2

2 4 -1 0

3 5 0 1

4 4 1 0

5 5 2 1

3 4mean

We calculate the distances between actualvalues and their mean.



31

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

Take also the squared distances for x andthe product between the two distances

𝒙 𝒚 𝒙 − ഥ𝒙 𝒚 − ഥ𝒚 𝒙 − ഥ𝒙 𝟐 𝒙 − ഥ𝒙 𝒚 − ഥ𝒚

1 2 -2 -2 4 4

2 4 -1 0 1 0

3 5 0 1 0 0

4 4 1 0 1 0

5 5 2 1 4 2

3 4 sum = 10 sum = 6mean

ො𝑦 = 𝑏0 + 𝑏1𝑥



32

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

The slope of the R line is the ratiobetween the sums of last two columns


1 2 -2 -2 4 4

2 4 -1 0 1 0

3 5 0 1 0 0

4 4 1 0 1 0

5 5 2 1 4 2


ො𝑦 = 𝑏0 + 𝑏1𝑥

𝑏1 =σ 𝑥 − ҧ𝑥 𝑦 − ത𝑦

σ 𝑥 − ҧ𝑥 2 =6

10= .6



33

y

x1 2 3 4 5

1

2

3

4

5

6mean(x)

mean(y)

R line

To calculate the intersect term 𝑏0 , weexploit the fact that we know at least onevalue assumed by the term ො𝑦 => ො𝑦 = 4.This is because the R line must cross thepoint (x=3, y=4).


1 2 -2 -2 4 4

2 4 -1 0 1 0

3 5 0 1 0 0

4 4 1 0 1 0

5 5 2 1 4 2


ො𝑦 = 𝑏0 + 𝑏1𝑥 = 2.2 + .6𝑥

𝑏1 =σ 𝑥 − ҧ𝑥 𝑦 − ത𝑦

σ 𝑥 − ҧ𝑥 2 =6

10= .6

4 = ො𝑦 = 𝑏0 + .6 𝑥 = 3 → 𝑏0 = 2.2


Regression line vs mean line

34

The idea is to calculate the distances of actual values fromthe mean and the distances of estimated values from themean. Then we compare them.y

x1 2 3 4 5

1

2

3

4

5

6

mean(y)

R line

𝑦 − ത𝑦

𝑦 − ො𝑦

ො𝑦 − ത𝑦

distance of observed value from the mean

distance of estimated value from the mean

distance between observed and estimated values error

𝑆𝑆𝑅 = ො𝑦 − ത𝑦 2

𝑆𝑆𝐸 = 𝑦 − ො𝑦 2

𝑆𝑆𝑇 = 𝑦 − ത𝑦 2

SST= Sum of Squares TotalSSR = Sum of Squares due to RegressionSSE = Sum of Squares due to Errors

𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸


• Instead of predicting the class of a record we want to predict the probability of the class, given the record

• The problem of predicting continuous values is called regression problem

• General approach: find a continuous function that models the continuous points.

Classification via Regression


• Assume a linear classification boundary

𝑤 ⋅ 𝑥 = 0

𝑤 ⋅ 𝑥 > 0

𝑤 ⋅ 𝑥 < 0

For the positive class the bigger the value of 𝑤 ⋅ 𝑥, the further the point is from the classification boundary, the higher our certainty for the membership to the positive class• Define 𝑃(𝐶+|𝑥) as an increasing

function of 𝑤 ⋅ 𝑥

For the negative class the smaller the value of 𝑤 ⋅ 𝑥, the further the point is from the classification boundary, the higher our certainty for the membership to the negative class• Define 𝑃(𝐶−|𝑥) as a decreasing

function of 𝑤 ⋅ 𝑥

Classification via Regression


𝑓 𝑡 =1

1 − 𝑒−𝑡

𝑃 𝐶+ 𝑥 =1

1 − 𝑒−𝑤⋅𝑥

𝑃 𝐶− 𝑥 =𝑒−𝑤⋅𝑥

1 − 𝑒−𝑤⋅𝑥

log𝑃 𝐶+ 𝑥

𝑃 𝐶− 𝑥= 𝑤 ⋅ 𝑥 Logistic Regression: Find the

vector 𝑤 that maximizes the probability of the observed data

The logistic function

Logistic Regression


• Produces a probability estimate for the class membership which is often very useful.

• The weights can be useful for understanding the feature importance.

• Works for relatively large datasets

• Fast to apply.

Logistic Regression


Statistical estimation

39

The Mean is computed byadding all of the numbers inthe data together anddividing by the number ofelements contained in thedata set.

The Median of a data set is dependenton whether the number of elementsin the data set is odd or even. Firstreorder the data set from the smallestto the largest, then if the number ofelements are odd, then the Median isthe element in the middle of the dataset. If the number of elements areeven, then the Median is the averageof the two middle terms.



40

The Mode for a data set is the elementthat occurs the most often. It is notuncommon for a data set to have morethan one mode. This happens whentwo or more elements occur withequal frequency in the data set. A dataset with two modes is called bimodal.A data set with three modes is calledtrimodal.

The Range for a data set is the differencebetween the largest value and smallest valuecontained in the data set. First reorder thedata set from smallest to largest then subtractthe first element from the last element.



41

Non-parametric models differ from parametric models in that the model structure is notspecified a priori but is instead determined from data. The term non-parametric is notmeant to imply that such models completely lack parameters but that the number andnature of the parameters are flexible and not fixed in advance.

Examples:

A histogram is a simple non-parametric estimate of a probability distribution. Skewness, kurtosis, MAD, NMAD, RMSE are also simple non parametric estimators. Kernel density estimation provides better estimates of the density than histograms. Non-parametric regression methods have been developed based on kernels, splines,

and wavelets. Data envelopment analysis provides efficiency coefficients similar to those obtained

by multivariate analysis without any distributional assumption. KNNs (K-Nearest Neighbors) classify the unseen instance based on the K points in the

training set which are nearest to it. A Support Vector Machine (SVM with a Gaussian kernel) is a non-parametric large-

margin classifier.


Regression indicators

42

∆𝑧 = 𝑡 − 𝑜 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 ∆𝑧′ =(𝑡 − 𝑜)

(1 + 𝑡)

𝑏𝑖𝑎𝑠 =σ𝑖=1𝑁 (|∆𝑧𝑖

2|)

𝑁𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝜎 =

σ𝑖=1𝑁 ∆𝑧𝑖 −

σ𝑖=1𝑁 ∆𝑧𝑖𝑁

2

𝑁

𝑀𝐴𝐷 = 𝑀𝑒𝑑𝑖𝑎𝑛 ∆𝑧 − 𝑀𝑒𝑑𝑖𝑎𝑛 ∆𝑧𝑁𝑀𝐴𝐷 = 1.4826𝑀𝐴𝐷

𝑅𝑀𝑆 =σ𝑖=1𝑁 (∆𝑧𝑖)

2

𝑁

RMS = 𝑏𝑖𝑎𝑠2 + 𝜎2

By supposing that the target (desired) value of any regression problem is t, while the outputof the learning model is o, the following statistical operators are useful for qualityevaluation.

Although the RMS and standard deviation arein principle different, sometimes they differvery little when the errors are sufficiently smallbut a little bit higher in some error bins.

the median of the absolute deviations fromthe data's median

(1, 1, 2, 2, 4, 6, 9). It has a median value of2. The absolute deviations about 2 are (1,1, 0, 0, 2, 4, 7) which in turn have a medianvalue of 1 (because the sorted absolutedeviations are (0, 0, 1, 1, 2, 4, 7)). So theMAD for this data is 1


Regression indicators

43

∆𝑧 = 𝑡 − 𝑜 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 ∆𝑧′ =(𝑡 − 𝑜)

(1 + 𝑡)

𝑏𝑖𝑎𝑠 =σ𝑖=1𝑁 (|∆𝑧𝑖

2|)

𝑁𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝜎 =

σ𝑖=1𝑁 ∆𝑧𝑖 −

σ𝑖=1𝑁 ∆𝑧𝑖𝑁

2

𝑁

By supposing that the target (desired) value of any regression problem is t, while the outputof the learning model is o, the following statistical operators are useful for qualityevaluation.

Kth central moment = 𝝁𝒌 = 𝑬 𝑿 − 𝝁 𝒌

• µ = first central moment

• 𝜎2 = 2𝑛𝑑 𝑐𝑒𝑛𝑡𝑟𝑎𝑙 𝑚𝑜𝑚𝑒𝑛𝑡 = 𝐸 𝑋 − 𝜇 2

• Skewness =3𝑟𝑑 𝑐𝑒𝑛𝑡𝑟𝑎𝑙 𝑚𝑜𝑚𝑒𝑛𝑡

𝜎3=

𝐸 𝑋−𝜇 3

𝜎3

• Kurtosis =4𝑡ℎ 𝑐𝑒𝑛𝑡𝑟𝑎𝑙 𝑚𝑜𝑚𝑒𝑛𝑡

𝜎4=

𝐸 𝑋−𝜇 4

𝜎4

Mean and Standard Deviation are the first two central moments of a distribution. Important for a complete analysis are also moments of higher order:



44M. Brescia - Astroinformatics


45M. Brescia - Astroinformatics


46

Key features of Mean, standard deviation, Median and Mode are:

Centered Fixed score distribution Unimodal symmetrical

Benefits:Best measure for symmetrical distributionsInfluenced by all dataMost reliableGood for interval and ratio data

Limitations:Presence of outliers may dramatically affect the estimation


Outliers evaluation in regression

47

The MAD is a robust statistic, measure of statistical dispersion. Moreover, the MAD is lesssensible to outliers in a data set than the standard deviation. In the standard deviation, thedistances from the mean are squared, so large deviations are weighted more heavily, andthus outliers can heavily influence it. In the MAD, the deviations of a small number ofoutliers are irrelevant.So far, MAD could give more reliable information on the dispersion of data, without beinginfluenced by the outlier distribution.Besides such consideration, outliers percentage is very important. Especially to identifyanomalies within data (serendipity).

%|Δz| > 1𝜎% |Δz| > 2𝜎% |Δz| > 3𝜎% |Δz| > 4𝜎

…

• Blu dots: blazars• Green dots: unknown• Red triangles: gravitationally lensed QSOs

Gravitational lenscandidates

Peculiar objects


Training Records

Test Record

KNN based Classifiers


• Basic idea:

– If it walks like a duck, quacks like a duck, then it’s probably a duck

Training Records

Test RecordCompute Distance

Choose k of the “nearest” records



Requires three things

– The set of stored records

– Distance Metric to compute distance between records

– The value of k, the number of nearest neighbors to retrieve

To classify an unknown record:

– Compute distance to other training records

– Identify k nearest neighbors

– Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

Unknown record



X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x



Voronoi Diagram defines the classification boundary

The area takes the class of the green point

1-NN based Classifiers


• Compute distance between two points:– Euclidean distance

• Determine the class from nearest neighbor list– take the majority vote of class labels among the k-

nearest neighbors

– Weigh the vote according to distance• weight factor, w = 1/d2

i ii

qpqpd 2)(),(



• Choosing the value of k:– If k is too small, sensitive to noise points

– If k is too large, neighborhood may include points from other classes

X



• Scaling issues

– Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes

– Example:

• height of a person may vary from 1.5m to 1.8m

• weight of a person may vary from 50Kg to 130Kg

• income of a person may vary from 10K to 1M €



• Problem with Euclidean measure:

– High dimensional data

• curse of dimensionality

– Can produce counter-intuitive results

1 1 1 1 1 1 1 1 1 1 1 0

0 1 1 1 1 1 1 1 1 1 1 1

1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1vs

d = 1.4142 d = 1.4142

Solution: Normalize the vectors to unit length



• k-NN classifiers are lazy learners (pigri)

– It does not build models explicitly

– Unlike eager learners such as decision tree induction

• Classifying unknown records is relatively expensive

– K-NN complexity O(NxD) unlike DTs, having O(NxlogD)

– Need for structures to retrieve nearest neighbors fast.

• The Nearest Neighbor Search problem.



• Two-dimensional kd-trees– A data structure for answering nearest neighbor queries in R2

• kd-tree construction algorithm– Select the x or y dimension (alternating between the two)

– Partition the space into two with a line passing from the median point

– Repeat recursively in the two partitions as long as there are enough points or limited by the number d (features of parameter space)



2-dimensional kd-trees

Nearest Neighbors Search

















region(v) – all the black points in the subtree of v2-dimensional kd-trees


A binary tree:

Size O(n)

Depth O(logn)

Construction time O(nlogn)

Query time: worst case O(n), but for many cases O(logn)


kNN features

66

• Intrinsically multiclass

• Robustness to outliers

• Works w/ "small" learning set

• Scalability (large learning set)

• Prediction accuracy

• Parameter tuning

• Accuracy

kNN


X

Astroinformatics - dame.dsf.unina.itdame.dsf.unina.it/ASTROINFOEDU/brescia-MLsupervised-partI.pdf · ML origins: from Aristotele to Darwin 3 The Greek philosopher Aristotele was one

Documents