Introduction Experiments Future Work References Experience with Weka by Predictive Classification on Gene-Expression Data Mat ˇ ej Holec Czech Technical University in Prague Faculty of Electrical Engineering Department of Cybernetics Intelligent Data Analysis lab http://ida.felk.cvut.cz March 17, 2011 Mat ˇ ej Holec Experience with Weka by Predictive Classification on Gene-Expre
37
Embed
Experience with Weka by Predictive Classification on Gene ...ai.ms.mff.cuni.cz/~sui/zkusweka.pdf · Introduction Experiments Future Work References Experience with Weka by Predictive
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IntroductionExperimentsFuture WorkReferences
Experience with Weka by PredictiveClassification on Gene-Expression Data
Matej Holec
Czech Technical University in PragueFaculty of Electrical Engineering
Department of CyberneticsIntelligent Data Analysis lab
http://ida.felk.cvut.cz
March 17, 2011
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Outline
1 IntroductionMotivationBiological Background and DataTools: Weka and R
2 ExperimentsIntegrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
3 Future Work
4 ReferencesSoftwareBibliography
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Outline
1 IntroductionMotivationBiological Background and DataTools: Weka and R
2 ExperimentsIntegrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
3 Future Work
4 ReferencesSoftwareBibliography
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Outline
1 IntroductionMotivationBiological Background and DataTools: Weka and R
2 ExperimentsIntegrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
3 Future Work
4 ReferencesSoftwareBibliography
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Outline
1 IntroductionMotivationBiological Background and DataTools: Weka and R
2 ExperimentsIntegrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
3 Future Work
4 ReferencesSoftwareBibliography
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
MotivationBiological Background and DataTools: Weka and R
Motivation
Motivation Bridging the gap between system biology andmachine learning.
Biological DatabasesNCBI National Center for Biotechnology InformationEBI European Bioinformatic InstituteGenomeNet Japanese network of databases andcomputational services for genome researchThe Gene ontology (GO) vocabulary of terms fordescribing gene product characteristics and annotationdata. . .
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
MotivationBiological Background and DataTools: Weka and R
Short Introduction to Biology
Human cell genomeconsist of ∼30.000 genes.Cell is an integrated deviceof several thousand typesof interacting proteins.Cell respond to internaland external environmentalsignals by producingappropriate proteins.
Central dogma of molecularbiology
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
MotivationBiological Background and DataTools: Weka and R
Cellular Pathway and a Fully Coupled Flux Example
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
MotivationBiological Background and DataTools: Weka and R
DNA Microarrays
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
MotivationBiological Background and DataTools: Weka and R
Pitfalls of Microarray TechnologyProblem to interpret results (‘Gene list’ syndrome).Curse of dimensionality of MA data (tens of thousandsgenes in tens of samples).Noise in microarray data.Experiments are still expensive.
Set-level ApproachUse prior knowledge
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
MotivationBiological Background and DataTools: Weka and R
WEKA (Waikato Environment for Knowledge Analysis)
Machine learning software written in JavaLicensed under GNU GPLVersions: book 3.4.18, stable 3.6.4, developer 3.7.3
Allows data pre-processing, classification, regression,clustering, association rules, visualization
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
MotivationBiological Background and DataTools: Weka and R
Using Weka in Java Code
import weka.core.Instances;import . . . ;// Input dataDataSource source = new DataSource(”iris.arff”);Instances instances = source.getDataSet();. . .// Create classifier with optionsSMO classifier = new SMO();// train and evaluate the classifierclassifier.buildClassifier(train);Evaluation eval = new Evaluation(train);eval.evaluateModel(classifier, test);// Print summary on the testing instancesSystem.out.print(eval.toSummaryString());
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
MotivationBiological Background and DataTools: Weka and R
Using Weka in R
library(RWeka)file=”dataset.arff”splitR=66instances=read.arff(file)# shuffle instancesinstances=instances[sample(nrow(instances)),]#get training and testing datantrain=round(nrow(instances)*splitR/100)ntest=nrow(instances)-ntraintrain=instances[1:ntrain,]test=instances[(ntrain+1):(ntest+ntrain),]#train and evaluate the classifiercl=SMO(Class ∼ .,data=train,control = NULL)evaluate Weka classifier(cl,newdata=test)
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
Integrating Multiple-Platform Expression Data throughGene Set Features
Goals:Integration of data from heterogeneous platforms usinggene sets.Are the biologically defined gene sets more informativethen random gene sets.
Gene set features used for the integration process:1 Gene ontology terms2 Cellular pathways3 Fully coupled fluxes (strongly co-expressed genes)
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
Integrating Multiple-Platform Expression Data (contd)
1 Preparation (Quantile normalization)2 Gene set features construction and data integration3 Analysis by learning curves (Weka Experimenter)
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
(Q1) Single gene based classifiers vs. biologicallymeaningful gene sets
(Q2) Classifiers based on the biologically meaningfulgene sets vs. based on the gene sets constructedrandomly.
(Q3) Classifiers learned from single-platform data vs.learned from the data integrated fromheterogeneous platforms
Assembling of multiple-platform data did nothave a detrimental effect on classificationperformance.
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
Main Page
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
Results
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
Short Description
Web application for cross-genome multiple-platformanalysis of gene expression.Functionality is done by easy-to-extend plugin system (R,Weka, ...).Executes tasks in a grid environment (not working now).
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
Comparative Evaluation of Set-Level Techniques inPredictive Classification of Gene Expression Samples
Set-level analysis typically yields more compact andinterpretable results.Set-level strategy can be adopted by ML algorithms.
Q1 Which one state-of-the-art set-level analysistechnique can be used for a better classification.
Q2 How the classification accuracy depends on thefunctionally defined gene sets in compare torandom.
Q3 How accurate are classifiers based on theset-level features in compare to the gene-based.
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
Experimental settings
InputData
Microarray experiment data NCBI-GEOFunctionally defined gene sets (KEGG, KO)
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
Data Flow
Trainining fold
7. Testing fold
2. Rank gene sets
3. Select gene sets
4. Aggregate
5. Learn classifier
Test classifier
1. Prior gene sets
6. Data set
(Data Set \ Testing Fold)
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
Experiment Settings
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
ML Experiments in Weka – technical summary
30 datasets6 Weka algorithms (SMO, J48, 1-NN, 3-NN, NB, ZeroR)Total number of ML experiments is 1.470.600Speed of Weka experiments execution
30×49020105×60 ≈ 233[experiments
sec ]
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
Analysis
Results were obtained by (two-sided) Wilcoxon test (on level ofsignif. 0.05, Bonferroni-Dunn adjustment)
Factor AlternativesBetter Worse
1. Gene sets genuine random2. Ranking algo global, ig sam-gs, gsea3. Sets forming features high ranking low ranking3. Sets forming features 1:10 14. Aggregation∗ setsig, svd avg
∗ Difference not significant if Factor 3 is 1:10.
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Integrating Multiple-Platform Expression DataXGENE.ORGComparative Evaluation of Set-Level Techniques
Conclusion
1 Study determined suitability of various set-level methods.2 Classifiers based on aggregated gene-set features
outperform baseline experiments.3 Gene-set based features allows easier interpretability and
data compression.4 Still are ignored dependencies among gene set members.
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
Future Work
XGENE.ORG ver 0.2Support of semiautomatic workflows allowing to definecomplicated ML tasks.Full support of grid environment.Easy to debug environment (based on Java).
Experimental analysis of pathway modes (elementarypathways).Improve set-level techniques to take into account structuralknowledge.
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
SoftwareBibliography
WEKA
WEKA http://www.cs.waikato.ac.nz/ml/weka/
Documentationhttp://weka.wikispaces.com/
Using Weka in Java codehttp://weka.wikispaces.com/Use+Weka+in+your+Java+code
Related projectshttp://www.cs.waikato.ac.nz/ml/weka/index_related.html
Bioconductor http://www.bioconductor.org/RCPP (facilitates integration R and C++)http://dirk.eddelbuettel.com/code/rcpp.htmlhttp://cran.r-project.org/web/.../Rcpp/
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
Set-level analysisSubramanian A et al.: Gene set enrichment analysis: Aknowledge-based approach for inter- preting genome-wideexpression profiles, PNAS, 2005.Jelle Goeman and Peter Buhlmann. Analyzing geneexpression data in terms of genesets: methodologicalissues. Bioinformatics, 23(8):980–987, 2007Mramor Minca et al. On utility of gene set signatures ingene expression-based cancer class prediction. MachineLearning in Systems Biology, 2010.
Biological databasesThe Gene Ontology Consortium. Gene ontology: tool forthe unification of biology. Nature Genetics, 25, 2000.Minoru Kanehisa et al. KEGG for representation andanalysis of molecular networks involving diseases anddrugs.Nucleic acids research, 38:355–360, 2010
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data
IntroductionExperimentsFuture WorkReferences
SoftwareBibliography
Thank you for your attention
Matej Holec Experience with Weka by Predictive Classification on Gene-Expression Data