Top Banner
Computing Concepts for Bioinformatics ttp://amadeus.biosci.arizona.edu/~nirav Introduction to Machine Introduction to Machine Learning, Data Mining Learning, Data Mining and Knowledge Discovery and Knowledge Discovery Introduction to WEKA Introduction to WEKA Final Project Final Project MySQL exercise MySQL exercise
83
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Part XIV

Computing Concepts for Bioinformatics

http://amadeus.biosci.arizona.edu/~nirav

Introduction to Machine Introduction to Machine Learning, Data Mining and Learning, Data Mining and Knowledge DiscoveryKnowledge Discovery

Introduction to WEKAIntroduction to WEKA Final ProjectFinal Project MySQL exerciseMySQL exercise

Page 2: Part XIV

Systems Biology:Confluence of omics

SystemsSystemsBiologyBiology

Genomics

FunctionalGenomics

Meta-bolomics

Proteomics

Pharmaco-genomics

Modelling

Clinical

Pathways

Page 3: Part XIV

The players:

Statistics

MachineLearning

Databases

Data

Visualization

Data Mining and Knowledge Discovery

Page 4: Part XIV

Useful Websites:

Obtaining WEKA http://www.cs.waikato.ac.nz/ml/weka/

Data Mining http://www.kdnuggets.com/dmcourse/index.html

Page 5: Part XIV

Statistics, Machine Learning and Data Mining

Statistics: more theory-based more focused on testing hypotheses

Machine learning more heuristic focused on improving performance of a learning agent also looks at real-time learning and robotics – areas

not part of data mining Data Mining and Knowledge Discovery

integrates theory and heuristics focus on the entire process of knowledge discovery,

including data cleaning, learning, and integration and visualization of results

Distinctions are fuzzy

witten&eibe

Page 6: Part XIV

Knowledge Discovery Definition

Knowledge Discovery in Data is the non-trivial process of identifying

valid novel potentially useful and ultimately understandable patterns in

data.from Advances in Knowledge Discovery and Data

Mining, Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter 1), AAAI/MIT Press 1996

Page 7: Part XIV

Many Names of Data Mining

Data Fishing, Data Dredging: 1960- used by Statistician

Data Mining :1990 -- used DB, business

Knowledge Discovery in Databases (1989-) used by AI, Machine Learning Community

AKA Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, ...

Currently: Data Mining and Knowledge Discovery are used interchangeably

Piatetsky-Shapiro

Page 8: Part XIV

Major Data Mining Tasks

Classification: predicting an item class Clustering: finding clusters in data Associations: e.g. A & B & C occur

frequently Visualization: to facilitate human discovery Summarization: describing a group Deviation Detection: finding changes Estimation: predicting a continuous value Link Analysis: finding relationships …

Piatetsky-Shapiro

Page 9: Part XIV

Finding patterns Goal: programs that detect patterns

and regularities in the data Strong patterns good predictions

Problem 1: most patterns are not interesting

Problem 2: patterns may be inexact (or spurious)

Problem 3: data may be garbled or missing

Page 10: Part XIV

Machine learning techniques Algorithms for acquiring structural

descriptions from examples Structural descriptions represent patterns

explicitly Can be used to predict outcome in new situation Can be used to understand and explain how

prediction is derived(may be even more important)

Methods originate from artificial intelligence, statistics, and research on databases

witten&eibe

Page 11: Part XIV

Classification

Learn a method for predicting the instance class from pre-labeled (classified) instances

Many approaches: Regression, Decision Trees,Bayesian,Neural Networks, ...

Given a set of points from classes what is the class of new point ?

Page 12: Part XIV

Classification: Linear Regression

Linear Regressionw0 + w1 x + w2 y >= 0

Regression computes wi from data to minimize squared error to ‘fit’ the data

Not flexible enough

Page 13: Part XIV

Classification: Decision Trees

X

Y

if X > 5 then blueelse if Y > 3 then blueelse if X > 2 then greenelse blue

52

3

Page 14: Part XIV

Classification: Neural Nets

Can select more complex regions

Can be more accurate

Also can overfit the data – find patterns in random noise

Page 15: Part XIV

The weather problem

Outlook

Temperature

Humidity

Windy

Play

sunny 85 85 false no

sunny 80 90 true no

overcast

83 86 false yes

rainy 70 96 false yes

rainy 68 80 false yes

rainy 65 70 true no

overcast

64 65 true yes

sunny 72 95 false no

sunny 69 70 false yes

rainy 75 80 false yes

sunny 75 70 true yes

overcast

72 90 true yes

overcast

81 75 false yes

rainy 71 91 true no

Given past data,Can you come upwith the rules for Play/Not Play ?

What is the game?

Page 16: Part XIV

The weather problem

Conditions for playing golfOutlook Temperature Humidity Windy Play

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild Normal False Yes

… … … … …

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity = normal then play = yes

If none of the above then play = yes

witten&eibe

Page 17: Part XIV

Weather data with mixed attributes

Some attributes have numeric valuesOutlook Temperature Humidity Windy Play

Sunny 85 85 False No

Sunny 80 90 True No

Overcast 83 86 False Yes

Rainy 75 80 False Yes

… … … … …

If outlook = sunny and humidity > 83 then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity < 85 then play = yes

If none of the above then play = yes

witten&eibe

Page 18: Part XIV

The contact lenses dataAge Spectacle

prescriptionAstigmatism Tear production

rateRecommended

lensesYoung Myope No Reduced NoneYoung Myope No Normal SoftYoung Myope Yes Reduced NoneYoung Myope Yes Normal HardYoung Hypermetrope No Reduced NoneYoung Hypermetrope No Normal SoftYoung Hypermetrope Yes Reduced NoneYoung Hypermetrope Yes Normal hardPre-

presbyopicMyope No Reduced None

Pre-presbyopic

Myope No Normal Soft

Pre-presbyopic

Myope Yes Reduced None

Pre-presbyopic

Myope Yes Normal Hard

Pre-presbyopic

Hypermetrope No Reduced None

Pre-presbyopic

Hypermetrope No Normal Soft

Pre-presbyopic

Hypermetrope Yes Reduced None

Pre-presbyopic

Hypermetrope Yes Normal None

Presbyopic Myope No Reduced NonePresbyopic Myope No Normal NonePresbyopic Myope Yes Reduced NonePresbyopic Myope Yes Normal HardPresbyopic Hypermetrope No Reduced NonePresbyopic Hypermetrope No Normal SoftPresbyopic Hypermetrope Yes Reduced NonePresbyopic Hypermetrope Yes Normal None

witten&eibe

Page 19: Part XIV

A complete and correct rule set

If tear production rate = reduced then recommendation = none

If age = young and astigmatic = noand tear production rate = normal then recommendation = soft

If age = pre-presbyopic and astigmatic = noand tear production rate = normal then recommendation = soft

If age = presbyopic and spectacle prescription = myopeand astigmatic = no then recommendation = none

If spectacle prescription = hypermetrope and astigmatic = noand tear production rate = normal then recommendation = soft

If spectacle prescription = myope and astigmatic = yesand tear production rate = normal then recommendation = hard

If age young and astigmatic = yes and tear production rate = normal then recommendation = hard

If age = pre-presbyopicand spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none

If age = presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none

witten&eibe

Page 20: Part XIV

A decision tree for this problem

witten&eibe

Page 21: Part XIV

Classifying iris flowers

Sepal length

Sepal width

Petal length

Petal width

Type

1 5.1 3.5 1.4 0.2 Iris setosa

2 4.9 3.0 1.4 0.2 Iris setosa

51 7.0 3.2 4.7 1.4 Iris versicolor

52 6.4 3.2 4.5 1.5 Iris versicolor

101 6.3 3.3 6.0 2.5 Iris virginica

102 5.8 2.7 5.1 1.9 Iris virginica

If petal length < 2.45 then Iris setosa

If sepal width < 2.10 then Iris versicolor

...witten&eibe

Page 22: Part XIV

Example: 209 different computer configurations

Linear regression function

Predicting CPU performance

Cycle time (ns)

Main memory (Kb)

Cache (Kb)

Channels Performance

MYCT MMIN MMAX CACH CHMIN CHMAX PRP

1 125 256 6000 256 16 128 198

2 29 8000 32000 32 8 32 269

208 480 512 8000 32 0 0 67

209 480 1000 4000 0 0 0 45

PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX

witten&eibe

Page 23: Part XIV

Soybean classification

Attribute Number of

values

Sample value

Environment

Time of occurrence 7 July

Precipitation 3 Above normal…

Seed Condition 2 NormalMold growth 2 Absent

…Fruit Condition of fruit

pods4 Normal

Fruit spots 5 ?Leaves Condition 2 Abnormal

Leaf spot size 3 ?…

Stem Condition 2 AbnormalStem lodging 2 Yes

…Roots Condition 3 Normal

Diagnosis 19 Diaporthe stem canker

witten&eibe

Page 24: Part XIV

The role of domain knowledgeIf leaf condition is normal

and stem condition is abnormaland stem cankers is below soil lineand canker lesion color is brown

thendiagnosis is rhizoctonia root rot

If leaf malformation is absentand stem condition is abnormaland stem cankers is below soil lineand canker lesion color is brown

thendiagnosis is rhizoctonia root rot

But in this domain, “leaf condition is normal” implies“leaf malformation is absent”!

witten&eibe

Page 25: Part XIV

Learning as search

Inductive learning: find a concept description that fits the data

Example: rule sets as description language Enormous, but finite, search space

Simple solution: enumerate the concept space eliminate descriptions that do not fit examples surviving descriptions contain target concept

witten&eibe

Page 26: Part XIV

Enumerating the concept space

Search space for weather problem 4 x 4 x 3 x 3 x 2 = 288 possible combinations With 14 rules 2.7x1034 possible rule sets

Solution: candidate-elimination algorithm Other practical problems:

More than one description may survive No description may survive

• Language is unable to describe target concept• or data contains noise

witten&eibe

Page 27: Part XIV

The version space

Space of consistent concept descriptions Completely determined by two sets

L: most specific descriptions that cover all positive examples and no negative ones

G: most general descriptions that do not cover any negative examples and all positive ones

Only L and G need be maintained and updated

But: still computationally very expensive And: does not solve other practical problems

witten&eibe

Page 28: Part XIV

Machine Learning with WEKA

Page 29: Part XIV

WEKA: the bird

Copyright: Martin Kramer ([email protected])

Page 30: Part XIV

WEKA: the software Machine learning/data mining software written in

Java (distributed under the GNU Public License) Used for research, education, and applications Complements “Data Mining” by Witten & Frank Main features:

Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods

Graphical user interfaces (incl. data visualization)

Environment for comparing learning algorithms

Page 31: Part XIV

WEKA: versions There are several versions of WEKA:

WEKA 3.0: “book version” compatible with description in data mining book

WEKA 3.2: “GUI version” adds graphical user interfaces (old book version is command-line only)

WEKA 3.4: “Latest Stable” with lots of improvements

This next slides are based on the latest snapshot of WEKA 3.3

Page 32: Part XIV

@relation heart-disease-simplified

@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal,

atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}

@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...

WEKA only deals with “flat” files

Page 33: Part XIV

@relation heart-disease-simplified

@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal,

atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}

@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...

WEKA only deals with “flat” files

Page 34: Part XIV
Page 35: Part XIV
Page 36: Part XIV

Explorer: pre-processing the data Data can be imported from a file in various

formats: ARFF, CSV, C4.5, binary Data can also be read from a URL or from an

SQL database (using JDBC) Pre-processing tools in WEKA are called

“filters” WEKA contains filters for:

Discretization, normalization, resampling, attribute selection, transforming and combining attributes, …

Page 37: Part XIV
Page 38: Part XIV
Page 39: Part XIV
Page 40: Part XIV
Page 41: Part XIV
Page 42: Part XIV
Page 43: Part XIV
Page 44: Part XIV

Explorer: building “classifiers”

Classifiers in WEKA are models for predicting nominal or numeric quantities

Implemented learning schemes include: Decision trees and lists, instance-based

classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …

“Meta”-classifiers include: Bagging, boosting, stacking, error-correcting

output codes, locally weighted learning, …

Page 45: Part XIV
Page 46: Part XIV
Page 47: Part XIV
Page 48: Part XIV
Page 49: Part XIV
Page 50: Part XIV
Page 51: Part XIV
Page 52: Part XIV
Page 53: Part XIV
Page 54: Part XIV
Page 55: Part XIV
Page 56: Part XIV
Page 57: Part XIV
Page 58: Part XIV
Page 59: Part XIV
Page 60: Part XIV
Page 61: Part XIV
Page 62: Part XIV
Page 63: Part XIV
Page 64: Part XIV
Page 65: Part XIV
Page 66: Part XIV
Page 67: Part XIV
Page 68: Part XIV
Page 69: Part XIV
Page 70: Part XIV
Page 71: Part XIV
Page 72: Part XIV
Page 73: Part XIV
Page 74: Part XIV
Page 75: Part XIV
Page 76: Part XIV
Page 77: Part XIV

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Page 78: Part XIV

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Page 79: Part XIV

Final Project Involves:

Data Aggregation Data Visualization Typical laboratory environment

Will use: Perl MySQL GFF Public websites (ENSMBL, Gbrowser etc)

Page 80: Part XIV

Final Project

Two groups in different lab working on same region of the genome

Team members gather specific information and perform specific task

Method to visualize all information in a genome browser

Page 81: Part XIV

Final Project: Due dates

Description available on my site in pdf (here)

I will put all the data and hints up by Dec 5th Midnight

Due on Dec 15th 4:00 PM

Page 82: Part XIV

MySQL exercise Using haplo.csv from hw-3 (class13) Create a mysql table haplo_scores and

load data (from haplo.csv) into the table Write sql statement to show samples

where methods disagree (feel free to use web based tool)

Create program mysql_sieve.pl to save the output into a file called disagree.txt

Now modify the above script to show samples that both methods agree on and save results into file agree.txt

How may rows in each file ?

Page 83: Part XIV

Gratitude

Susan Miller Gavin Nelson Biochemistry for providing access to

this lab IGERT program