Top Banner
Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University [email protected] Pathway and Gene Set Analysis Part 1
72

Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Aug 10, 2019

Download

Documents

lamthuan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Alison Motsinger-Reif, PhDBioinformatics Research Center

Department of StatisticsNorth Carolina State University

[email protected]

Pathway and Gene Set AnalysisPart 1

Page 2: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Theearlystepsofamicroarraystudy

• ScientificQuestion(biological)• Studydesign(biological/statistical)• ConductingExperiment(biological)• Preprocessing/NormalizingData(statistical)• Findingdifferentiallyexpressedgenes(statistical)

Page 3: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Adataexample

• Leeetal(2005)comparedadiposetissue(abdominalsubcutaenousadipocytes)betweenobeseandleanPimaIndians

• SampleswerehybridisedonHGu95e-Affymetrixarrays(12639genes/probesets)

• AvailableasGDS1498ontheGEOdatabase• Weselectedthemalesamplesonly

– 10obesevs9lean

Page 4: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University
Page 5: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

The“Result”Probe Set ID log.ratio pvalue adj.p73554_at 1.4971 0.0000 0.000491279_at 0.8667 0.0000 0.001774099_at 1.0787 0.0000 0.010483118_at -1.2142 0.0000 0.013981647_at 1.0362 0.0000 0.013984412_at 1.3124 0.0000 0.022290585_at 1.9859 0.0000 0.025884618_at -1.6713 0.0000 0.025891790_at 1.7293 0.0000 0.035080755_at 1.5238 0.0000 0.035185539_at 0.9303 0.0000 0.035190749_at 1.7093 0.0000 0.035174038_at -1.6451 0.0000 0.035179299_at 1.7156 0.0000 0.035172962_at 2.1059 0.0000 0.035188719_at -3.1829 0.0000 0.035172943_at -2.0520 0.0000 0.035191797_at 1.4676 0.0000 0.035178356_at 2.1140 0.0001 0.035990268_at 1.6552 0.0001 0.0421

WhathappenedtotheBiology???

Page 6: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

SlightlymoreinformativeresultsProbe Set ID Gene SymbolGene Title go biological process termgo molecular function term log.ratio pvalue adj.p73554_at CCDC80 coiled-coil domain containing 80--- --- 1.4971 0.0000 0.000491279_at C1QTNF5 /// MFRPC1q and tumor necrosis factor related protein 5 /// membrane frizzled-related proteinvisual perception /// embryonic development /// response to stimulus--- 0.8667 0.0000 0.001774099_at --- --- --- --- 1.0787 0.0000 0.010483118_at RNF125 ring finger protein 125 immune response /// modification-dependent protein catabolic processprotein binding /// zinc ion binding /// ligase activity /// metal ion binding-1.2142 0.0000 0.013981647_at --- --- --- --- 1.0362 0.0000 0.013984412_at SYNPO2 synaptopodin 2 --- actin binding /// protein binding1.3124 0.0000 0.022290585_at C15orf59 chromosome 15 open reading frame 59--- --- 1.9859 0.0000 0.025884618_at C12orf39 chromosome 12 open reading frame 39--- --- -1.6713 0.0000 0.025891790_at MYEOV myeloma overexpressed (in a subset of t(11;14) positive multiple myelomas)--- --- 1.7293 0.0000 0.035080755_at MYOF myoferlin muscle contraction /// blood circulationprotein binding 1.5238 0.0000 0.035185539_at PLEKHH1 pleckstrin homology domain containing, family H (with MyTH4 domain) member 1--- binding 0.9303 0.0000 0.035190749_at SERPINB9 serpin peptidase inhibitor, clade B (ovalbumin), member 9anti-apoptosis /// signal transductionendopeptidase inhibitor activity /// serine-type endopeptidase inhibitor activity /// serine-type endopeptidase inhibitor activity /// protein binding1.7093 0.0000 0.035174038_at --- --- --- --- -1.6451 0.0000 0.035179299_at --- --- --- --- 1.7156 0.0000 0.035172962_at BCAT1 branched chain aminotransferase 1, cytosolicG1/S transition of mitotic cell cycle /// metabolic process /// cell proliferation /// amino acid biosynthetic process /// branched chain family amino acid metabolic process /// branched chain family amino acid biosynthetic process /// branched chain family amino acid biosynthetic processcatalytic activity /// branched-chain-amino-acid transaminase activity /// branched-chain-amino-acid transaminase activity /// transaminase activity /// transferase activity /// identical protein binding2.1059 0.0000 0.035188719_at C12orf39 chromosome 12 open reading frame 39--- --- -3.1829 0.0000 0.035172943_at --- --- --- --- -2.0520 0.0000 0.035191797_at LRRC16A leucine rich repeat containing 16A--- --- 1.4676 0.0000 0.035178356_at TRDN triadin muscle contraction receptor binding 2.1140 0.0001 0.035990268_at C5orf23 chromosome 5 open reading frame 23--- --- 1.6552 0.0001 0.0421

Ifwearelucky,someofthetopgenesmeansomething tous

Butwhatiftheydon’t?

Andhowwhataretheresultsforothergeneswithsimilarbiologicalfunctions

Page 7: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Howtoincorporatebiologicalknowledge

• Thetypeofknowledgewedealwithisrathersimple:

Weknowgroups/setsofgenesthatforexample– Belongtothesamepathway– Haveasimilarfunction– Arelocatedonthesamechromosome,etc…

• Wewillassumethesegroupingstobegiven,i.e.wewillnotyetdiscussmethodsusedtodetectpathways,networks,geneclusters• Wewilllater!

Page 8: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Whatisapathway?

• Nocleardefinition– Wikipedia:“In biochemistry,metabolicpathways are

seriesofchemicalreactionsoccurringwithinacell.Ineachpathway,aprincipalchemicalismodifiedbychemicalreactions.”

– Thesepathwaysdescribeenzymesandmetabolites

• Butoftentheword“pathway”isalsousedtodescribegeneregulatorynetworksorproteininteractionnetworks

• Inallcasesapathwaydescribesabiologicalfunctionveryspecifically

Page 9: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

WhatisaGeneSet?• Justwhatitsays:asetofgenes!

– AllgenesinvolvedinapathwayareanexampleofaGeneSet

– AllgenescorrespondingtoaGeneOntologytermareaGeneSet

– AllgenesmentionedinapaperofSmithetalmightformaGeneSet

• AGeneSetisamuchmoregeneralandlessspecificconceptthanapathway

• Still:wewillsometimesusetwowordsinterchangeably,astheanalysismethodsaremainlythesame

Page 10: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

WhereDoGeneSets/ListsComeFrom?

• Molecularprofilinge.g.mRNA,protein– Identificationà Genelist– Quantificationà Genelist+values– Ranking,Clustering(biostatistics)

• Interactions:Proteininteractions,Transcriptionfactorbindingsites(ChIP)

• Geneticscreene.g.ofknockoutlibrary• Associationstudies(Genome-wide)

– Singlenucleotidepolymorphisms(SNPs)– Copynumbervariants(CNVs)

– ……..

Page 11: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

WhatisGeneSet/Pathwayanalysis?

• Theaimistogiveonenumber(score,p-value)toaGeneSet/Pathway– Aremanygenesinthepathwaydifferentiallyexpressed(up-regulated/downregulated)

– Canwegiveanumber(p-value)totheprobabilityofobservingthesechangesjustbychance?

Page 12: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Goals• Pathwayandgenesetdataresources

• Geneattributes• Databaseresources

• GO,KeGG,Wikipathways,MsigDB• Geneidentifiersandissueswithmapping

• Differencesbetweenpathwayanalysistools• Selfcontainedvs.competitivetests• Cut-offmethodsvs.globalmethods• Issueswithmultipletesting

Page 13: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Goals• Pathwayandgenesetdataresources

• Geneattributes• Databaseresources

• GO,KeGG,Wikipathways,MsigDB• Geneidentifiersandissueswithmapping

• Differencesbetweenpathwayanalysistools• Selfcontainedvs.competitivetests• Cut-offmethodsvs.globalmethods• Issueswithmultipletesting

Page 14: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

GeneAttributes• Functionalannotation

– Biologicalprocess,molecularfunction,celllocation

• Chromosomeposition• Diseaseassociation• DNAproperties

– TFbindingsites,genestructure(intron/exon),SNPs

• Transcriptproperties– Splicing,3’UTR,microRNA bindingsites

• Proteinproperties– Domains,secondaryandtertiarystructure,PTMsites

• Interactionswithothergenes

Page 15: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

GeneAttributes• Functionalannotation

– Biologicalprocess,molecularfunction,celllocation

• Chromosomeposition• Diseaseassociation• DNAproperties

– TFbindingsites,genestructure(intron/exon),SNPs

• Transcriptproperties– Splicing,3’UTR,microRNA bindingsites

• Proteinproperties– Domains,secondaryandtertiarystructure,PTMsites

• Interactionswithothergenes

Page 16: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

DatabaseResources• Usefunctionalannotationtoaggregategenesintopathways/genesets

• Anumberofdatabasesareavailable– Differentanalysistoolslinktodifferentdatabases– Toomanydatabasestogointodetailoneveryone– Commonlyusedresources:

• GO• KeGG• MsigDB• WikiPathways

Page 17: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

PathwayandGeneSetdataresources

• TheGeneOntology(GO)database– http://www.geneontology.org/– GOoffersarelational/hierarchicaldatabase– Parentnodes:moregeneralterms– Childnodes:morespecificterms– Attheendofthehierarchytherearegenes/proteins– Atthetopthereare3parentnodes:biologicalprocess,molecularfunctionandcellularcomponent

• Example:wesearchthedatabasefortheterm“inflammation”

Page 18: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Thegenesonourarraythatcodeforoneofthe44geneproductswouldformthecorresponding “inflammation”geneset

Page 19: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

WhatistheGeneOntology(GO)?

• Setofbiologicalphrases(terms)whichareappliedtogenes:– proteinkinase– apoptosis– membrane

• Ontology:Aformalsystemfordescribingknowledge

Page 20: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

GOStructure• Termsarerelatedwithinahierarchy– is-a– part-of

• Describesmultiplelevelsofdetailofgenefunction

• Termscanhavemorethanoneparentorchild

Page 21: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

GOStructurecell

membrane chloroplast

mitochondrial chloroplastmembrane membrane

is-apart-of

Speciesindependent.Somelower-leveltermsarespecifictoagroup, buthigher leveltermsarenot

Page 22: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

WhatGOCovers?

• GOtermsdividedintothreeaspects:– cellularcomponent– molecularfunction– biologicalprocess

glucose-6-phosphate isomeraseactivity

Celldivision

Page 23: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Terms• WheredoGOtermscomefrom?

– GOtermsareaddedbyeditorsatEBIandgeneannotationdatabasegroups

– Termsaddedbyrequest– Expertshelpwithmajordevelopment– 27734terms,98.9%withdefinitions.

• 16731biological_process• 2385cellular_component• 8618molecular_function

Page 24: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

• Genesarelinked,orassociated,withGOtermsbytrainedcuratorsatgenomedatabases– Knownas‘geneassociations’orGOannotations– Multipleannotationspergene

• SomeGOannotationscreatedautomatically

Annotations

Page 25: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

AnnotationSources• Manualannotation

– Createdbyscientificcurators• Highquality• Smallnumber(time-consumingtocreate)

• Electronicannotation– Annotationderivedwithouthumanvalidation

• Computationalpredictions(accuracyvaries)• Lower‘quality’thanmanualcodes

• Keypoint:beawareofannotationorigin

Page 26: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

EvidenceTypes• ISS: Inferred from Sequence/Structural Similarity• IDA: Inferred from Direct Assay• IPI: Inferred from Physical Interaction• IMP: Inferred from Mutant Phenotype• IGI: Inferred from Genetic Interaction• IEP: Inferred from Expression Pattern• TAS: Traceable Author Statement• NAS: Non-traceable Author Statement• IC: Inferred by Curator• ND: No Data available

• IEA: Inferred from electronic annotation

Page 27: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

SpeciesCoverage

• Allmajoreukaryoticmodelorganismspecies

• HumanviaGOAgroupatUniProt

• SeveralbacterialandparasitespeciesthroughTIGRandGeneDB atSanger

• Newspeciesannotationsindevelopment

Page 28: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

VariableCoverage

LomaxJ.GetreadytoGO!Abiologist's guidetotheGeneOntology.BriefBioinform.2005Sep;6(3):298-304.

Page 29: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

ContributingDatabases– BerkeleyDrosophila GenomeProject(BDGP)– dictyBase (Dictyostelium discoideum)– FlyBase (Drosophilamelanogaster)– GeneDB (Schizosaccharomyces pombe,Plasmodiumfalciparum, Leishmania

major andTrypanosoma brucei)– UniProtKnowledgebase (Swiss-Prot/TrEMBL/PIR-PSD)andInterPro databases– Gramene (grains, including rice,Oryza)– MouseGenomeDatabase(MGD)andGeneExpressionDatabase(GXD) (Mus

musculus)– RatGenomeDatabase(RGD) (Rattus norvegicus)– Reactome– Saccharomyces GenomeDatabase(SGD) (Saccharomyces cerevisiae)– TheArabidopsis InformationResource(TAIR) (Arabidopsis thaliana)– TheInstituteforGenomicResearch(TIGR):databasesonseveralbacterial

species– WormBase (Caenorhabditis elegans)– ZebrafishInformationNetwork(ZFIN):(Danio rerio)

Page 30: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

GOSlimSets• GOhastoomanytermsforsomeuses– Summaries(e.g.Piecharts)

• GOSlimisanofficialreducedsetofGOterms– Generic,plant,yeast

Page 31: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

GOSoftwareTools

• GOresourcesarefreelyavailabletoanyonewithoutrestriction– Includestheontologies,geneassociationsandtoolsdevelopedbyGO

• OthergroupshaveusedGOtocreatetoolsformanypurposes– http://www.geneontology.org/GO.tools

Page 32: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

AccessingGO:QuickGO

http://www.ebi.ac.uk/ego/

Page 33: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

OtherOntologies

http://www.ebi.ac.uk/ontology-lookup

Page 34: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

KEGGpathwaydatabase

• KEGG=KyotoEncyclopediaofGenesandGenomes– http://www.genome.jp/kegg/pathway.html– ThepathwaydatabasegivesfarmoredetailedinformationthanGO• Relationshipsbetweengenesandgeneproducts

– But:thisdetailedinformationisonlyavailableforselectedorganismsandprocesses

– Example:Adipocytokinesignalingpathway

Page 35: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University
Page 36: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

KEGGpathwaydatabase

• Clickingonthenodesinthepathwayleadstomoreinformationongenes/proteins– Otherpathwaysthenodeisinvolvedwith– EntriesinGene/Proteindatabases– References– Sequenceinformation

• UltimatelythisallowstofindcorrespondinggenesonthemicroarrayanddefineaGeneSetforthepathway

Page 37: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Wikipathways

• http://www.wikipathways.org

• Awikipedia forpathways– Onecanseeanddownloadpathways– Butalsoeditandcontributepathways

• TheprojectislinkedtotheGenMAPP andPathvisio analysis/visualisationtools

Page 38: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University
Page 39: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

MSigDB• MSigDB =MolecularSignatureDatabasehttp://www.broadinstitute.org/gsea/msigdb

• RelatedtothetheanalysisprogramGSEA• MSigDB offersgenesetsbasedonvariousgroupings– Pathways– GOterms– Chromosomalposition,…

Page 40: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University
Page 41: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

SomeWarnings• Inmanycasesthedefinitionofapathway/genesetina

databasemightdifferfromthatofascientist

• Thenodesinpathwaysareoftenproteinsormetabolites;theactivityofthecorrespondinggenesetisnotnecessarilyagoodmeasurementoftheactivityofthepathway

• Therearemanymoreresourcesoutthere(BioCarta,BioPax)

• Commercialpackagesoftenusetheirownpathway/genesetdefinitions(Ingenuity,Metacore,Genomatix,…)

• GenesinagenesetareusuallynotgivenbyaProbeSetID,butrefertosomegenedatabase(Entrez IDs,Unigene IDs)

• Conversioncanleadtoerrors!

Page 42: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

SomeWarnings• Inmanycasesthedefinitionofapathway/genesetina

databasemightdifferfromthatofascientist

• Thenodesinpathwaysareoftenproteinsormetabolites;theactivityofthecorrespondinggenesetisnotnecessarilyagoodmeasurementoftheactivityofthepathway

• Therearemanymoreresourcesoutthere(BioCarta,BioPax)

• Commercialpackagesoftenusetheirownpathway/genesetdefinitions(Ingenuity,Metacore,Genomatix,…)

• GenesinagenesetareusuallynotgivenbyaProbeSetID,butrefertosomegenedatabase(Entrez IDs,Unigene IDs)

• Conversioncanleadtoerrors!

Page 43: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

GeneAttributes• Functionalannotation

– Biologicalprocess,molecularfunction,celllocation• Chromosomeposition• Diseaseassociation• DNAproperties

– TFbindingsites,genestructure(intron/exon),SNPs• Transcriptproperties

– Splicing,3’UTR,microRNA bindingsites• Proteinproperties

– Domains,secondaryandtertiarystructure,PTMsites• Interactionswithothergenes

Page 44: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

SourcesofGeneAttributes

• Ensembl BioMart (eukaryotes)– http://www.ensembl.org

• Entrez Gene(general)– http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene

• Modelorganismdatabases– E.g.SGD:http://www.yeastgenome.org/

• Manyothers…..

Page 45: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

EnsemblBioMart• Convenientaccesstogenelistannotation

Selectgenome

Selectfilters

Selectattributestodownload

Page 46: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

GeneandProteinIdentifiers• Identifiers(IDs)areideallyunique,stablenamesornumbersthathelptrackdatabaserecords– E.g.SocialInsuranceNumber,EntrezGeneID41232

• Geneandproteininformationstoredinmanydatabases– à GeneshavemanyIDs

• Recordsfor:Gene,DNA,RNA,Protein– Importanttorecognizethecorrectrecordtype– E.g.Entrez Generecordsdon’tstoresequence.TheylinktoDNAregions,RNAtranscriptsandproteins.

Page 47: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

NCBIDatabaseLinks

http://www.ncbi.nlm.nih.gov/Database/datamodel/data_nodes.swf

NCBI:U.S.NationalCenterforBiotechnologyInformation

PartofNationalLibraryofMedicine(NLM)

Page 48: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

CommonIdentifiersSpecies-specificHUGOHGNCBRCA2MGIMGI:109337RGD2219ZFINZDB-GENE-060510-3FlyBase CG9097WormBase WBGene00002299orZK1067.1SGDS000002187orYDL029WAnnotationsInterPro IPR015252OMIM600185Pfam PF09104GeneOntologyGO:0000724SNPs rs28897757ExperimentalPlatformAffymetrix 208368_3p_s_atAgilentA_23_P99452CodeLink GE60169Illumina GI_4502450-S

GeneEnsembl ENSG00000139618EntrezGene675UnigeneHs.34012

RNAtranscriptGenBankBC026160.1RefSeq NM_000059Ensembl ENST00000380152

ProteinEnsembl ENSP00000369497RefSeq NP_000050.2UniProt BRCA2_HUMANorA1YBP1_HUMANIPIIPI00412408.1EMBLAF309413PDB1MIU

Red=Recommended

Page 49: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

IdentifierMapping

• SomanyIDs!– Mapping(conversion)isaheadache

• Fourmainuses– Searchingforafavoritegenename– Linktorelatedresources– Identifiertranslation

• E.g.Genestoproteins,Entrez GenetoAffy– Unificationduringdatasetmerging

• Equivalentrecords

Page 50: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

IDMappingServices

• Synergizer– http://llama.med.harvard.edu/syner

gizer/translate/

• EnsemblBioMart– http://www.ensembl.org

• UniProt– http://www.uniprot.org/

Page 51: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

IDMappingChallenges• Avoiderrors:mapIDscorrectly• Genenameambiguity– notagoodID

– e.g.FLJ92943,LFS1,TRP53,p53– Bettertousethestandardgenesymbol:TP53

• Excelerror-introduction– OCT4ischangedtoOctober-4

• Problemsreaching100%coverage– E.g.duetoversionissues– Usemultiplesourcestoincreasecoverage

ZeebergBRetal.Mistakenidentifiers: genenameerrorscanbeintroducedinadvertentlywhenusingExcelinbioinformatics BMCBioinformatics.2004Jun 23;5:80

Page 52: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Goals• Pathwayandgenesetdataresources

• Geneattributes• Databaseresources

• GO,KeGG,Wikipathways,MsigDB• Geneidentifiersandissueswithmapping

• Differencesbetweenpathwayanalysistools• Selfcontainedvs.competitivetests• Cut-offmethodsvs.globalmethods• Issueswithmultipletesting

Page 53: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Goals• Pathwayandgenesetdataresources

• Geneattributes• Databaseresources

• GO,KeGG,Wikipathways,MsigDB• Geneidentifiersandissueswithmapping

• Differencesbetweenpathwayanalysistools• Selfcontainedvs.competitivetests• Cut-offmethodsvs.globalmethods• Issueswithmultipletesting

Page 54: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

AimsofAnalysis

• Reminder:Theaimistogiveonenumber(score,p-value)toaGeneSet/Pathway– Aremanygenesinthepathwaydifferentiallyexpressed(up-regulated/downregulated)?

– Canwegiveanumber(p-value)totheprobabilityofobservingthesechangesjustbychance?

– Similartosinglegeneanalysisstatisticalhypothesistestingplaysanimportantrole

Page 55: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Generaldifferencesbetweenanalysistools

• Selfcontainedvs competitivetest– Thedistinctionbetween“self-contained”and“competitive”methodsgoesbacktoGoeman andBuehlman (2007)

– Aself-containedmethodonlyusesthevaluesforthegenesofageneset

• Thenullhypothesishereis:H={“NogenesintheGeneSetaredifferentiallyexpressed”}

– Acompetitivemethodcomparesthegeneswithinthegenesetwiththeothergenesonthearrays

• HerewetestagainstH:{“ThegenesintheGeneSetarenotmoredifferentiallyexpressedthanothergenes”}

Page 56: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Example:AnalysisfortheGO-Term“inflammatoryresponse”(GO:0006954)

Page 57: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

BacktotheRealDataExample

• UsingBioconductor softwarewecanfind96probesetsonthearraycorrespondingtothisterm

• 8outofthesehaveap-value<5%

• Howmanysignificantgeneswouldweexpectbychance?

• Dependsonhowwedefine“bychance”

Page 58: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

The“self-contained”version

• Bychance(i.e.ifitisNOTdifferentiallyexpressed)ageneshouldbesignificantwithaprobabilityof5%

• Wewouldexpect96x 5%=4.8significantgenes

• Usingthebinomialdistributionwecancalculatetheprobabilityofobserving8ormoresignificantgenesasp=0.108,i.e.notquitesignificant

Page 59: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

The“competitive”version• Overall1272outof12639

genesaresignificantinthisdataset(10.1%)

• Ifwerandomlypick96geneswewouldexpect96x 10.1%=9.7genestobesignificant“bychance”

• Ap-valuecanbecalculatedbasedonthe 2x2table

• Testsforassociation:Chi-Square-TestorFisher’sexacttest

In GS Not in GSsig 8 1264

non-sig 88 11 279

P-valuefromFisher’sexacttest(one-sided):0.733,i.e veryfarfrombeingsignificant

Page 60: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

CompetitiveTests• Competitiveresultsdependhighlyonhowmanygenesareon

thearrayandpreviousfiltering– Onasmalltargetedarraywhereallgenesarechanged,acompetitive

methodmightdetectnodifferentialGeneSetsatall

• Competitivetestscanalsobeusedwithsmallsamplesizes,evenforn=1– BUT:Theresultgivesnoindicationofwhetheritholdsforawider

populationofsubjects,thep-valueconcernsapopulationofgenes!

• Competitiveteststypicallygivelesssignificantresultsthanself-contained(asseenwiththeexample)

• Fisher’sexacttest(competitive)isprobablythemostwidelyusedmethod!

Page 61: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Cut-offmethodsvs wholegenelistmethods

• Aproblemwithbothtestsdiscussedsofaris,thattheyrelyonanarbitrarycut-off

• Ifwecallagenesignificantfor10%alphathresholdtheresultswillchange– Inourexamplethebinomialtestyieldsp=0.022,i.e.forthiscut-offtheresultissignificant!

• Wealsoloseinformationbyreducingap-valuetoabinary(“significant”,“non-significant”)variable– Itshouldmakeadifference,whetherthenon-significantgenesinthesetarenearlysignificantorcompletelyunsignificant

Page 62: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

P-value histogram for inflammation genes

pvalue[incl]

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

• Wecanstudythedistributionofthep-valuesinthegeneset

• Ifnogenesaredifferentiallyexpressedthisshouldbeauniformdistribution

• Apeakontheleftindicates,thatsomegenesaredifferentiallyexpressed

• WecantestthisforexamplebyusingtheKolmogorov-Smirnov-Test

• Herep=0.082,i.e.notquitesignificant

•Thiswouldbea“self-contained”test,asonlythegenesinthegenesetarebeingused

Page 63: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Kolmogorov-SmirnovTest

• TheKS-testcomparesanobservedwithanexpectedcumulativedistribution

• TheKS-statisticisgivenbythemaximumdeviationbetweenthetwo

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Observed and Expected culmulative distribution

x

Fn(x)

Page 64: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Histogram of the ranks of p-values for inflammation genes

p.rank[incl]

Frequency

0 2000 4000 6000 8000 10000 12000 14000

05

1015

• AlternativelywecouldlookatthedistributionoftheRANKSofthep-valuesinourgeneset

• Thiswouldbeacompetitivemethod,i.e wecompareourgenesetwiththeothergenes

• AgainonecanusetheKolmogorov-Smirnovtesttotestforuniformity

• Here:p=0.851,i.e.veryfarfromsignificance

Page 65: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Othergeneralissues• Directionofchange

– Inourexamplewedidn’tdifferentiatebetweenupordown-regulatedgenes

– Thatcanbeachievedbyrepeatingtheanalysisforp-valuesfromone-sidedtest

• Eg.wecouldfindGO-Termsthataresignificantlyup-regulated– Withmostsoftwarebothapproachesarepossible

• MultipleTesting– AswearetestingmanyGeneSets,weexpectsomesignificantfindings

“bychance”(falsepositives)– Controllingthefalsediscoveryrateistricky:Thegenesetsdooverlap,

sotheywillnotbeindependent!• EvenmoretrickyinGOanalysiswherecertainGOtermsaresubsetofothers

– TheBonferroni-Methodismostconservative,butalwaysworks!

Page 66: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

• Resampling strategies(dependencebetweengenes)– Themethodsweusedsofarinourexampleassumethatgenesareindependentofeachother…ifthisisviolatedthep-valuesareincorrect

– Resampling ofgroup/phenotypelabelscancorrectforthis

– Wegiveanexampleforourdataset

Multiple TestingforPathways

Page 67: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

ExampleResampling Approach1. Calculatetheteststatistic,e.g.thepercentageofsignificant

genesintheGeneSet

2. Randomlyre-shufflethegrouplabels(lean,obese)betweenthesamples

3. Repeattheanalysisforthere-shuffleddatasetandcalculateare-shuffledversionoftheteststatistic

4. Repeat2and3manytimes(thousands…)

5. Weobtainadistributionofre-shuffled%ofsignificantgenes:thepercentageofre-shuffledvaluesthatarelargerthantheoneobservedin1isourp-value

Page 68: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

• Thereshufflingtakesgenetogenecorrelationsintoaccount

• Manyprogramsalsooffertoresamplethegenes:ThisdoesNOTtakecorrelationsintoaccount

• Roughlyspeaking:– Resampling phenotypes:correspondstoself-containedtest

– Resampling genes:correspondstocompetitivetest

Resampling Approach

Page 69: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

• Genesbeingpresentmorethanonce– Commonapproaches

• Combineduplicates(average,median,maximum,…)• Ignore(i.e treatduplicateslikedifferentgenes)

• Usingsummarystatisticsvs usingalldata– Ourexamplesusedp-valuesasdatasummaries– Otherapproachesusefold-changes,signaltonoiseratios,etc…

– Somemethodsarebasedontheoriginaldataforthegenesinthegenesetratherthanonasummarystatistic

Resampling Approaches

Page 70: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Resampling Approaches

• Theresamplingapproachesarehighlycomputationallyintensive

• Newmethodsarebeingdevelopedtospeedthisup– Empiricalapproximationsofpermutations– Empiricalpathwayanalysis,withoutpermutation.

• ZhouYH,BarryWT,WrightFA.Biostatistics.2013Jul;14(3):573-85.doi:10.1093/biostatistics/kxt004.Epub 2013Feb20.

Page 71: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Summary

• Databases• Choicemakesadifference• NotallusethesameIDs– watchoutJ• Majordifferencesbetweenmethods• Issueswithmultipletesting

• Nextlecture,willgointomoredetailonafewmethods

Page 72: Pathway and Gene Set Analysis Part 1 - biostat.washington.edu · Alison Motsinger-Reif, PhD Bioinformatics Research Center Department of Statistics North Carolina State University

Questions?