Introduction Visualization Machine Learning Visualization and Machine Learning for exploratory data analysis Xiaochun Li 1,2 1 Division of Biostatistics Indiana University School of Medicine 2 Regenstrief Institute May 2, 2008 / CCBB Journal Club Xiaochun Li Visualization and ML
59
Embed
Visualization and Machine Learning - for exploratory data ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IntroductionVisualization
Machine Learning
Visualization and Machine Learningfor exploratory data analysis
Xiaochun Li1,2
1Division of BiostatisticsIndiana University School of Medicine
Mining large scale datasets, methods are needed tosearch for patterns, e.g., biologically important gene sets,or samplespresent data structure succinctlyboth are essential in the analysis.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
ObjectiveVisualization
An essential part of exploratory data analysis, and reporting theresults.
plot data as isplot data after simple summarizationplot data based on more advanced methods
Classical MDSscaling results of 39spectra from groupsA, D and G. Circlesrepresent group A,squares group Dand triangles groupG. Each group has13 spectra.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
MASS Specpairs plot
spec 1
0 10 20 30 40 0 10 20 30
020
4060
80
010
2030
40
0.66 spec 2
0.60 0.98 spec 3
010
2030
40
0 20 40 60 80
010
2030
0.59 0.970 10 20 30 40
0.99 spec 4
The outlier in groupA and 3 otherspectra from thesame group areplotted against eachother. The lower leftpanels show thePearson correlationcoefficients of pairsof spectra.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
MASS Specpairs plot
spec 1
0 10 20 30 40 0 10 20 30
05
1015
2025
30
010
2030
40
0.99 spec 2
0.96 0.98 spec 3
010
2030
40
0 5 10 15 20 25 30
010
2030
0.96 0.980 10 20 30 40
0.99 spec 4
The outlier in groupA and 3 otherspectra from thesame group areplotted against eachother. The lower leftpanels show thePearson correlationcoefficients of pairsof spectra.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Mass SpecMDS: 3-D
−400 −300 −200 −100 0 100 200 300 400−20
0−
100
0
100
200
−200
−150
−100
−50
0
50
100
150
200
first coordinate
seco
nd c
oord
inat
e
third
coo
rdin
ate
●●
●
●
●
●
●
●
●●
●●
●●
Classical MDSscaling results of 39spectra from groupsA, D and G. Circlesrepresent group A,squares group Dand triangles groupG. Each group has13 spectra.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Silhouette plotvisualize clustering results
A A A A A A A A A A A A AD
G D DG
G GG
D GG
D D D D D DG
G G D D G G D G
010
2030
4050
Cluster Dendrogram
hclust (*, "complete")d.s.nocut
Hei
ght
Dendrogram ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Silhouette plotvisualize clustering results
A AA A
AA A
AA A
AA A
DD D
D DD
D DD
D D D DG
G GG
G GG
GG G G
G G
010
020
030
040
0
Cluster Dendrogram
hclust (*, "complete")d.s.cut
Hei
ght
Dendrogram ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Silhouette plotvisualize clustering results
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
whole spec
Average silhouette width : 0.57
n = 39 3 clusters Cj
j : nj | avei∈∈Cj si
1 : 17 | 0.67
2 : 16 | 0.48
3 : 6 | 0.56
Silhouette plot ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Silhouette plotvisualize clustering results
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
mz<1000 cut
Average silhouette width : 0.65
n = 39 3 clusters Cj
j : nj | avei∈∈Cj si
1 : 13 | 0.82
2 : 13 | 0.60
3 : 13 | 0.53
Silhouette plot ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
Silhouette plotsilhouette width
For each observation i , the silhouette width si is defined asfollows:
ai = average dissimilarity between i and all other points ofthe cluster to which i belongsfor all other clusters C, put di,C = average dissimilarity of ito all observations of Cbi = minC di,C , and can be seen as the dissimilaritybetween i and its “neighbor” cluster, i.e., the nearest one towhich it does not belongsi = (bi − ai)/max(ai ,bi)
Xiaochun Li Visualization and ML
IntroductionVisualization
Machine Learning
As IsSimple SummarizationMore Advanced Methods
VisualizationR tools
classical MDS: cmdscale2-D, 3-D scatter plot: plot and R packagescatterplot3d
Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.
Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?
Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.
Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?
Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.
Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?
Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.
Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?
Random forests are a combination of tree predictors whichdepends on iid values random vectors, {θθθk}.Example - Bagging (bootstrap aggregation):
bootstrap samples are drawn from the training set, whereθθθk is counts in n boxes resulting from sampling withreplacementa tree is grown from each bootstrap sampleassign class per majority votes.
Improve predictiona single tree has poor accuracy for problems with manyvariables, each of them having very little information e.g.,genomics data setscombining trees grown using random features can improveaccuracy
Assess Performancetraining error (error rate from the training set) does notindicate performance over new dataoverfit→ small training error but poor generalization errorneed data which were not used to grow a particular tree toassess the performance of the tree.
for a given case (X, Y), and a given ensemble of classifiersmargin = proportion of votes for the right class −maxother classes(proportion of votes for any other class)
generalization error PE∗ = PX,Y(margin < 0)
s ≡strength = EX,Y(margin)ρ ≡correlation = some correlation btw any two trees.Thm 1.2. generalization error convergesThm 2.3. Gen. Error is bounded, PE∗ ≤ ρ(1− s2)/s2.
StrategyMinimize Correlation While Keeping Strength
Using randomly selected inputs or combinations of inputs ateach node to grow each tree:
Random Input Selection - Forest-RIat each node, select at random F variables to split on,grow the tree to maximum size and do not prune.Random Feature Selection - Forest-RCsame idea as above but with F Features- "linear combinations of randomly selected L variables"with random coefficients runif(L, -1, 1)⇒ further reducecorrelation
Bagging makes it possible to estimate the generalizationerror without a test set.
Why: in any bootstrap sample, about 1/3 of cases from theoriginal training set are left out due to sampling withreplacement (1− 1
n )n ≈ e−1 ≈ 1/3.Out-Of-Bag Estimates of Error, Strength and Correlation
For each (x, y), aggregate the votes over trees grownwithout (x, y) - out-of-bag classifier.Out-of-bag estimate of generalization error = error rate ofout-of-bag classifier.Same idea for out-of-bag strength and correlation.
Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.
Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.
Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.
Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.
Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.
Support vector machines (SVMs) are a set of supervisedlearning methods used for classification and regressionAn extension of LDA
many hyperplanes could classify the datainterested in the one achieving maximum separation(margin) between the two classesmathematically, for (yi , xi ), yi = ±1, i = 1, . . . ,n, min 1
2 ||x ||2
s.t., yi (x ′i x − b) ≥ 1 (if separable)min 1
2 ||w ||2 + λ
∑ni=1 ξi s.t., ξi ≥ 0, yi (x ′i w − b) ≥ 1− ξi (if
Are we only interested in a predictive black box, or are we alsointerested in which features predict?
p >> n, it’s easy to find classifiers to separate data - arethey meaningful?if features are suspected to be sparse, most features areirrelevant; need automatic feature selection. E.g., LASSO,SVM with L1 penalty
Are we only interested in a predictive black box, or are we alsointerested in which features predict?
p >> n, it’s easy to find classifiers to separate data - arethey meaningful?if features are suspected to be sparse, most features areirrelevant; need automatic feature selection. E.g., LASSO,SVM with L1 penalty
Visualization is an important aspect of EDA. "A picture isworth a thousand words".Supervised Learning allows one to select features, andclassify (prediction).Unsupervised Learning allows study of associationsamong features, feature selection, and cluster.