Computing Concepts for Bioinformatics ttp://amadeus.biosci.arizona.edu/~nirav Introduction to Machine Introduction to Machine Learning, Data Mining Learning, Data Mining and Knowledge Discovery and Knowledge Discovery Introduction to WEKA Introduction to WEKA Final Project Final Project MySQL exercise MySQL exercise
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Computing Concepts for Bioinformatics
http://amadeus.biosci.arizona.edu/~nirav
Introduction to Machine Introduction to Machine Learning, Data Mining and Learning, Data Mining and Knowledge DiscoveryKnowledge Discovery
Introduction to WEKAIntroduction to WEKA Final ProjectFinal Project MySQL exerciseMySQL exercise
Data Fishing, Data Dredging: 1960- used by Statistician
Data Mining :1990 -- used DB, business
Knowledge Discovery in Databases (1989-) used by AI, Machine Learning Community
AKA Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction, ...
Currently: Data Mining and Knowledge Discovery are used interchangeably
Piatetsky-Shapiro
Major Data Mining Tasks
Classification: predicting an item class Clustering: finding clusters in data Associations: e.g. A & B & C occur
frequently Visualization: to facilitate human discovery Summarization: describing a group Deviation Detection: finding changes Estimation: predicting a continuous value Link Analysis: finding relationships …
Piatetsky-Shapiro
Finding patterns Goal: programs that detect patterns
and regularities in the data Strong patterns good predictions
Problem 1: most patterns are not interesting
Problem 2: patterns may be inexact (or spurious)
Problem 3: data may be garbled or missing
Machine learning techniques Algorithms for acquiring structural
descriptions from examples Structural descriptions represent patterns
explicitly Can be used to predict outcome in new situation Can be used to understand and explain how
prediction is derived(may be even more important)
Methods originate from artificial intelligence, statistics, and research on databases
witten&eibe
Classification
Learn a method for predicting the instance class from pre-labeled (classified) instances
Many approaches: Regression, Decision Trees,Bayesian,Neural Networks, ...
Given a set of points from classes what is the class of new point ?
Classification: Linear Regression
Linear Regressionw0 + w1 x + w2 y >= 0
Regression computes wi from data to minimize squared error to ‘fit’ the data
Not flexible enough
Classification: Decision Trees
X
Y
if X > 5 then blueelse if Y > 3 then blueelse if X > 2 then greenelse blue
52
3
Classification: Neural Nets
Can select more complex regions
Can be more accurate
Also can overfit the data – find patterns in random noise
The weather problem
Outlook
Temperature
Humidity
Windy
Play
sunny 85 85 false no
sunny 80 90 true no
overcast
83 86 false yes
rainy 70 96 false yes
rainy 68 80 false yes
rainy 65 70 true no
overcast
64 65 true yes
sunny 72 95 false no
sunny 69 70 false yes
rainy 75 80 false yes
sunny 75 70 true yes
overcast
72 90 true yes
overcast
81 75 false yes
rainy 71 91 true no
Given past data,Can you come upwith the rules for Play/Not Play ?
What is the game?
The weather problem
Conditions for playing golfOutlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes
… … … … …
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
witten&eibe
Weather data with mixed attributes
Some attributes have numeric valuesOutlook Temperature Humidity Windy Play
Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …
If outlook = sunny and humidity > 83 then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
witten&eibe
The contact lenses dataAge Spectacle
prescriptionAstigmatism Tear production
rateRecommended
lensesYoung Myope No Reduced NoneYoung Myope No Normal SoftYoung Myope Yes Reduced NoneYoung Myope Yes Normal HardYoung Hypermetrope No Reduced NoneYoung Hypermetrope No Normal SoftYoung Hypermetrope Yes Reduced NoneYoung Hypermetrope Yes Normal hardPre-
presbyopicMyope No Reduced None
Pre-presbyopic
Myope No Normal Soft
Pre-presbyopic
Myope Yes Reduced None
Pre-presbyopic
Myope Yes Normal Hard
Pre-presbyopic
Hypermetrope No Reduced None
Pre-presbyopic
Hypermetrope No Normal Soft
Pre-presbyopic
Hypermetrope Yes Reduced None
Pre-presbyopic
Hypermetrope Yes Normal None
Presbyopic Myope No Reduced NonePresbyopic Myope No Normal NonePresbyopic Myope Yes Reduced NonePresbyopic Myope Yes Normal HardPresbyopic Hypermetrope No Reduced NonePresbyopic Hypermetrope No Normal SoftPresbyopic Hypermetrope Yes Reduced NonePresbyopic Hypermetrope Yes Normal None
witten&eibe
A complete and correct rule set
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = noand tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = noand tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myopeand astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = noand tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yesand tear production rate = normal then recommendation = hard
If age young and astigmatic = yes and tear production rate = normal then recommendation = hard
If age = pre-presbyopicand spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none
WEKA: the software Machine learning/data mining software written in
Java (distributed under the GNU Public License) Used for research, education, and applications Complements “Data Mining” by Witten & Frank Main features:
Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods
Graphical user interfaces (incl. data visualization)
Environment for comparing learning algorithms
WEKA: versions There are several versions of WEKA:
WEKA 3.0: “book version” compatible with description in data mining book
WEKA 3.2: “GUI version” adds graphical user interfaces (old book version is command-line only)
WEKA 3.4: “Latest Stable” with lots of improvements
This next slides are based on the latest snapshot of WEKA 3.3
@relation heart-disease-simplified
@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal,
atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}