10 - Introduction to Machine Learning

5/19/2018 10 - Introduction to Machine Learning

1/73

Course:

Biomedical Informatics

Parisa Rashidi

Fall 2014

Lecture 10: Introduction toMachine Learning


2/73

Reminder

Your project progress reports are due on Tuesday,

10/28

~2 pages in length (excluding references) formatted using IEEE style link
http://www.ieee.org/conferences_events/conferences/publishing/templates.htmlhttp://www.ieee.org/conferences_events/conferences/publishing/templates.html


3/73

Agenda

Machine learning

Today

Introduction to machine learning Different types of machine learning methods

Walkthrough: a machine learning process

Later

More machine learning methods

NLP


4/73

Software

Rapidminer Link
http://sourceforge.net/projects/rapidminer/files/1.%20RapidMiner/5.3/http://sourceforge.net/projects/rapidminer/files/1.%20RapidMiner/5.3/


5/73

Artificial Intelligence

Artificial Intelligence (AI) has many subfields

Machine Learning (ML)

Natural Language Processing (NLP) Vision


6/73

What is Learning ?

Machine learning is programming computers tooptimize a performance criterion using example dataor past experience.


7/73

You were not made to live like beasts, but to follow virtue and knowledge.

(Dante Alighieri)

*ROBERTO BATTITI AND MAURO BRUNATO.The LION way. Machine Learning plus Intelligent Optimization.


8/73

What We Talk About When We

Talk AboutLearning

Learning general models from a data of particularexamples

Data is cheap and abundant (data warehouses, data

marts); knowledge is expensive and scarce. Example 1: adverse drug-drug interactions

Example 2: Customer behavior:

People who bought Blink also bought David and

Goliath (www.amazon.com) Build a model that is a good and useful approximation

to the data.

8


9/73

Relation with Other Fields

ML draws on ideas from many fields

Statistics

ControlTheory

ComputerScience

OptimizationNeuroscience

Economics

StatisticalPhysics

Machine Learning


10/73

To Understand ML

You need

Basic Knowledge of computer science

Linear Algebra Calculus

Probability and statistics

Optimization


11/73

Example ML Algorithms

Linear Regression

Decision trees, neural network, support vector machine,

Total

Energy

Stand Run

Very Low Very High

Low

Main

Frequency

Low

Sit

High

Walk

A simple decision treeSupport Vector Machines


12/73

Generic Applications

Almost everywhere

Speech recognition, face recognition, search engines,

bioinformatics, fraud detection

And it will be everywhere

Smart homes, smart vehicles, smart cities


13/73

Biomedical Application

Mobile health monitoring solutions

Electronic Health Record (EHR) mining

Genome-wide associations (GWAS) Smart homes for elderly

Biomarker discovery

13


14/73

Challenges & Competitions

Many other competitions at Kaggle

http://www.kaggle.com/competitions

Example: predict the likelihood that an HIV patient's

infection will become less severe

A great way to improve your skills (and maybe make

some money!)
http://www.kaggle.com/competitionshttp://www.kaggle.com/competitionshttp://www.kaggle.com/competitions


15/73

Supervised vs. Unsupervised

Learning


16/73

Supervised Machine Learning

Goal is Prediction

Example:

Input: examples of benign and malignant tumorsdefined in terms of tumor shape, radius, ..

Output: predict whether a previously unseen example is

benign or malignant

Machine

Learning

Algorithm

Tumor

Examples

New Instance

ModelBenign or

Malignant?


17/73

Supervised Learning Toy

Example: Classification

0Example: Surgery

Risk

0Differentiating

between low-riskandhigh-riskpatients

CellShape

Uniformity

Cell Size Uniformity

Rule: x > a AND y > b

then low-risk


18/73

Supervised Learning Toy

Example: Regression

0Example: Child Mortality

0x : maternal education

y : child mortalityy =g (x | q )

where

g ( ) model,

q parameters

y = wx+w0

ChildMortality

Maternal Education


19/73

Supervised Learning: Uses

Prediction of future cases: Use the rule to predict the

output for future inputs

Knowledge extraction: The rule is easy to understand Compression: The rule is simpler than the data it

explains

Outlier detection: Exceptions that are not covered by

the rule, e.g., fraud


20/73

Unsupervised Machine

Learning

Also known as data mining

Goal is knowledge discovery

Example: Input: DNA Sequence as a long string of {A,C,G,T}

Output: frequent subsequences (gene patterns)

Data

Mining

Algorithm

DNA

SequenceModel

Gene

Pattern

AACGTAACGGGACTCCAC AC

()

()


21/73

Unsupervised Learning

Example: Learning Associations

It started with market basket analysis

P (Y |X ) probability that somebody who buysXalso

buys Y whereXand Yare products/services.


22/73

Unsupervised Learning

Learning what normally happens

No labels

Example method: Clustering: Grouping similar instances

Example applications

Image compression: Color quantization

Bioinformatics: Learning motifs


23/73

You dont Always need Machine

Learning!

Machine Learning definition (supervised):

The ability to learn and to improve with experience

instead of using pre-determined rules.

Consider the following two tasks:

Recognizing

Handwritten Digits

Problem: Is m a prime number?

Solution: test up to to see if m can be

factored into two values.

Testing for

Prime Numbers


24/73

You dont Always need Machine

Learning!

Unsupervised learning definition(rather unofficial):

Automatic analysis of data to extract previously

unknown interesting patterns

Consider the following two tasks:

DNA Sequence Mining

Problem: Find all patterns matching regular

expression A*C.

Solution: Simple String matching (finite state

machine)

Regular Expression Matching


25/73

When Learning is needed?

There is no need to learn to calculate payroll

Learning is used when:

Human expertise does not exist (navigating on Mars), Humans are unable to explain their expertise (speech

recognition)

Solution changes in time (routing on a computer

network)

Solution needs to be adapted to particular cases (user

biometrics)


26/73

Supervised vs. Unsupervised

Learning Supervised Learning (learn from my example)

Goal: A program that performs a task as good as humans.

TASK well defined (the target function)

EXPERIENCE training data provided by a human

PERFORMANCE Metric error/accuracy on the task

Unsupervised Learning (see what you can find)

Goal: To find some kind of structure in the data.

TASK vaguely defined

No EXPERIENCE: no labeled data No PERFORMANCE Metric (but, there are some evaluations

metrics)

*TAKIS METAXAS, CS 315 Web Search and Data Mining


27/73

Terminology


28/73

A Simple Example

Tumor Classification

Benign: -1

Malignant: +1

Uniformity

of Cell Size

Uniformity

of Cell

Shape

Marginal

Adhesion

Single

Epithelial

Cell Size

Bare

Nuclei

Bland

Chromatin

Normal

Nucleoli

Mitoses Class Label

(benign =2,

malignant

=4)

2 5 1 1 1 2 1 3 -1

2 5 4 4 5 7 10 3 +1

3 2 1 1 1 2 5 4 ?


29/73

Terminology: Feature

Features = the set of attributes associated with an

example

(aka Independent variable in statistics)

Uniformity

of Cell Size

Uniformity

of CellShape

Marginal

Adhesion

Single

EpithelialCell Size

Bare

Nuclei

Bland

Chromatin

Normal

Nucleoli

Mitoses Class Label

(benign =2,malignant

=4)

2 5 1 1 1 2 1 3 -1

2 5 4 4 5 7 10 3 +1

3 2 1 1 1 2 5 4 ?

Feature


30/73

Terminology: Instance

Example = an instance of data = data point =xi Each row of the table is a data instance.

Uniformity

of Cell Size

Uniformity

of Cell

Shape

Marginal

Adhesion

Single

Epithelial

Cell Size

Bare

Nuclei

Bland

Chromatin

Normal

Nucleoli

Mitoses Class Label

(benign =2,

malignant

=4)

2 5 1 1 1 2 1 3 -1

2 5 4 4 5 7 10 3 +1

3 2 1 1 1 2 5 4 ?

Instance


31/73

Terminology: Label

Label = Class = the feature to be predicted = category

associated with an object

Denoted byyi

(aka Dependent variable in statistics)

Label usually provided by an expert

Uniformity

of Cell Size

Uniformity

of CellShape

Marginal

Adhesion

Single

EpithelialCell Size

Bare

Nuclei

Bland

Chromatin

Normal

Nucleoli

Mitoses Class Label

(benign =2,malignant

=4)

2 5 1 1 1 2 1 3 -1

2 5 4 4 5 7 10 3 +1

3 2 1 1 1 2 5 4 ?

Label


32/73

Data Representation

We usually represent data in a matrix

2 5 1 1 1 2 1 3

2 5 4 4 5 7 10 3

3 2 1 1 1 2 5 4

Features

=

-1

+1

?

=Instances

Label

Instances

=

=

Co-variance Matrix (Feature

Feature)

Gram Matrix (Instance Instance)

Note: We can also assign a probability to each label (well discuss it later)


33/73

Summary of Key Terms

Instance = example = data point

Feature = independent variable

Class label = dependent variable Decision boundary = separates examples in different

classes


34/73

Algorithms


35/73

Availability of Labeled Data

Supervised learning => when all data is labeled

Semi-supervised learning => when a small amount of data is labeled

Unsupervised learning => when data is not labeled

Transfer Learning => when labeled data is available in another domain

Active Learning => when the algorithm has access to a human oracle to

ask for labels of a few data points

Do you have

labeled data?

SupervisedSemi-

supervisedUnsupervised

TransferLearning

ActiveLearning

Yes A little NoIn another

domain

By asking

oracle


36/73

Task Type

Categorical: Classification task

Classifier

Continuous: Regression task

Ordered: Ranking taskWhat is youroutput type?

Classification Regression Ranking

Categorical Continuous Ordered


37/73

Input Representation

The most common type

Simple records in Tables

Can be analyzed using regular machine learningtechniques.

Most other data types are converted to this type.

(Not always: There are methods that directly process other

data types.)

ID WGT HGT CholesterolRisk

(Class)

1 high short 260 high

2 high med 254 high

3 high tall 142 med

A Simple Record


38/73

Input Representation(cont.)

Image, video

is preprocessed using Vision techniques.

Text

is preprocessed using NLP techniques.

Continuous measures along time (Time series)

is preprocessed using Time Series analysis.

Graphs

is preprocessed using Graph Theory tools.

Image Time series Text Graph


39/73

More Details


40/73

Important Steps

1. Determine relevant features (expert knowledge)

2. Collect data (and label data)

3. Split labeled data into training and test datasets4. Use training data to train machine learning

algorithm.

5. Predict labels of examples in test data,

6. Evaluate algorithm.


41/73

Features Are Important!

Should be rich enough to capture the problem

Should be simple enough to allow learning the model

Too Many features Makes learning more difficult

Not enough features

Impacts generalization power


42/73

Feature Extraction

Typically results in significant reduction in

dimensionality

Domain-specific

* Image taken from Jeff Howbert Slides


43/73

Feature Extraction

Typically results in significant reduction in

dimensionality

Domain-specific

* Image taken from Jeff Howbert Slides


44/73

Important Steps


2. Collect data


algorithm.

5. Predict labels of examples in test data,



45/73

How to Split Data?

Holdout

Training set

(validation set)

Test set

K-fold Cross-validation

E.g. 10 fold cross validation


46/73

Methods of Sampling

Holdout E.g. Reserve 2/3 for training and 1/3 for testing

Random subsampling

Cross validation Partition data into k disjoint subsets

k-fold: train on k-1 partitions, test on the remaining one

Leave-one-out: k=n

Stratified sampling

Bootstrap Sampling with replacement


47/73

Important Steps


2. Collect data


algorithm.

5. Predict labels of examples in test data



48/73

Decision Boundary

We seek to find this boundary

x2(uniformity)

x1 (Radius)

OutlierBenign

Malignant

= Labeled

True Decision

Boundary

Learned

Decision

Boundary


49/73

Why Noise?

Noise might be due to different reasons

Imprecision in recording the input data

Errors in labeling data

We might not have considered additional features

(latent, or hidden features)

When there is noise, the decision boundary becomes

more complex


50/73

Overfitting

Data are well described by our model, but the

predictions do not generalize to new data.

A very rich hypothesis space

Training set too small

y

x


51/73

Overfitting and Underfitting

Underfitting

If your hypothesis is less complex than the actual

function

Using a straight line to model data generated by a third

order polynomial

Overfitting

If your hypothesis is more complex than the actual

function Using a fifth order polynomial to model data generated by a

second order polynomial


52/73

Bias-Variance

Bias = assumptions, restrictions on model

Variance = variation of the prediction of the model

Simple linear model => high bias

Complex model => high variance

y

x

y

x

Over-fittingUnder-fitting


53/73

Important Steps


2. Collect data


algorithm.

5. Predict labels of examples in test data



54/73

Model Evaluation

Metrics for Performance Evaluation

How to evaluate the performance of a model?

Methods for Model Comparison

How to compare the relative performance among

competing models?

Metrics for Performance


55/73


Evaluation

Focus on the predictive capability of a model

Rather than how fast it takes to classify or build models,

scalability, etc.

Confusion Matrix:

PREDICTED CLASS

ACTUAL

CLASS

Class=Yes Class=No

Class=Yes a b

Class=No c d

a: TP (true positive)

b: FN (false negative)

c: FP (false positive)

d: TN (true negative)



56/73


Evaluation

Most widely-used metric:

PREDICTED CLASS

ACTUALCLASS

Class=Yes Class=No

Class=Yes a

(TP)

b

(FN)

Class=No c

(FP)

d

(TN)

FNFPTNTP

TNTP

dcba

da

Accuracy


58/73

Computing Cost of Classification

CostMatrix

PREDICTED CLASS

ACTUAL

CLASS

C(i|j) + -

+ -1 100

- 1 0

Model

M1

PREDICTED CLASS

ACTUAL

CLASS

+ -

+ 150 40- 60 250

Model

M2

PREDICTED CLASS

ACTUAL

CLASS

+ -

+ 250 45- 5 200

Accuracy = 80%

Cost = 3910

Accuracy = 90%

Cost = 4255


59/73

Limitation of Accuracy

Consider a 2-class problem

Number of Class 0 examples = 9990

Number of Class 1 examples = 10

If model predicts everything to be class 0, accuracy is

9990/10000 = 99.9 %

Accuracy is misleading because model does not detect

any class 1 example


60/73

Other Measures

cba

a

pr

rp

baa

ca

a

2

22(F)measure-F

(r)Recall

(p)PrecisionTrue Positives

All items predicted as positive

True Positives

All actual positive items


61/73

Triple Tradeoff

Complexity of the hypothesis space: C

Amount of training data: N

Generalization error on new data: E

N E

C first E, then E


62/73

Learning Curve

Learning curve shows howaccuracy (or error) changes

with varying sample size


63/73

More on Bias vs. Variance

Typical learning curve for high variance:

Test error still decreasing as m increases. Suggests largertraining set will help.

Large gap between training and test error.

*Andrew Y. Ng, Advice for applying Machine Learning, Stanford


64/73

More on Bias vs. Variance

Typical learning curve for high bias:

Even training error is unacceptably high.

Small gap between training and test error.



65/73

Diagnosis

Fixes to try:

Solution Fixes the problem of

Try getting more training examples.

Try a smaller set of features.

Try a larger set of features.

Try different features.

high variance.

high variance.

high bias.

high bias.



66/73

Model Evaluation

Metrics for Performance Evaluation

How to evaluate the performance of a model?

Methods for Model Comparison

How to compare the relative performance among

competing models?

We will look at this next time!

P i I All T h


67/73

Putting It All Together Differentiate between walking and Jogging using

accelerometer

Data

preprocess

Feature

Extraction

Feature

Selection

Train

Sample d=(x,y,z)at 60 HZ

-Segment-Label

f_1, f_2, f_3,.

Select some features

Kwapisz et al, SIGKDD exploration, 2010

Total

Energy

Stand Run

Very Low Very High

Low

Main

Frequency

Low

Sit

High

Walk

A simple decision tree model

Evaluate


68/73

References

Slides partially based on:

Lecture Notes for E Alpaydn 2010 Introduction to

Machine Learning 2e The MIT Press (V1.0)


69/73

Resources for You


70/73

Tools

RapidMiner

Weka

R

Scikits-learn

Matlab

More here

https://sites.google.com/site/parisar/links (You can also find some publicly available free e-books

on machine learning)
https://sites.google.com/site/parisar/linkshttps://sites.google.com/site/parisar/linkshttps://sites.google.com/site/parisar/links


71/73

Resources: Datasets

UCI Repository:http://www.ics.uci.edu/~mlearn/MLRepository.html

UCI KDD Archive:

http://kdd.ics.uci.edu/summary.data.application.html

Statlib: http://lib.stat.cmu.edu/

Delve: http://www.cs.utoronto.ca/~delve/

71
http://www.ics.uci.edu/~mlearn/MLRepository.htmlhttp://kdd.ics.uci.edu/summary.data.application.htmlhttp://lib.stat.cmu.edu/http://www.cs.utoronto.ca/~delve/http://www.cs.utoronto.ca/~delve/http://lib.stat.cmu.edu/http://kdd.ics.uci.edu/summary.data.application.htmlhttp://www.ics.uci.edu/~mlearn/MLRepository.html


72/73

Resources: Journals

IEEE transaction on knowledge and data engineering

Journal of Machine Learning Research www.jmlr.org

Machine Learning

Neural Computation Neural Networks

IEEE Transactions on Neural Networks

IEEE Transactions on Pattern Analysis and MachineIntelligence

Annals of Statistics Journal of the American Statistical Association

...

72

f
http://www.jmlr.org/http://www.jmlr.org/


73/73

Resources: Conferences

International Conference on Knowledge Discovery andData Mining (KDD)

International Conference on Machine Learning (ICML)

European Conference on Machine Learning (ECML)

Neural Information Processing Systems (NIPS)

Uncertainty in Artificial Intelligence (UAI)

Computational Learning Theory (COLT)

International Conference on Artificial Neural Networks(ICANN)

International Conference on AI & Statistics (AISTATS)

International Conference on Pattern Recognition (ICPR)

...

10 - Introduction to Machine Learning

Documents

machine learning process

example dataor

smart cities5192018

smart vehicles

abundant data warehouses

datamarts knowledge

face recognition

speech recognition