Bioinformatics History and Introduction Luce Skrabanek ICB, WMC January 28, 2010 http://chagall.med.cornell.edu/BioinfoCourse/
BioinformaticsHistoryandIntroduction
LuceSkrabanekICB,WMC
January28,2010
http://chagall.med.cornell.edu/BioinfoCourse/!
WhatISbioinformatics?
• Currentdefinitionsvarywidely:– Thetermbioinformaticsisusedtoencompassalmostallcomputerapplicationsinbiological
sciences,butwasoriginallycoinedinthemid‐1980sfortheanalysisofbiologicalsequencedata.(AttwoodandParry‐Smith,1999)
– Theuseofcomputersinsolvinginformationproblemsinthelifesciences,mainly,itinvolvesthecreationofextensiveelectronicdatabasesongenomes,proteinsequences,etc.Secondarily,itinvolvestechniquessuchasthethree‐dimensionalmodelingofbiomoleculesandbiologicsystems.(21Mar1998,CancerWEB)
– “Idonotthinkallbiologicalcomputingisbioinformatics,e.g.mathematicalmodellingisnotbioinformatics,evenwhenconnectedwithbiology‐relatedproblems.Inmyopinion,bioinformaticshastodowithmanagementandthesubsequentuseofbiologicalinformation,inparticulargeneticinformation.”(RichardDurbin,HeadofInformaticsattheSangerCenter)
– Thestorage,manipulationandanalysisofbiologicalinformationviacomputerscience.Bioinformaticsisanessentialinfrastructureunderpinningbiologicalresearch(theRoslinInstitute)
– Atthebeginningofthe“genomicrevolution”,abioinformaticsconcernwasthecreationandmaintenanceofadatabasetostorebiologicalinformation,suchasnucleotideandaminoacidsequences.[…]Thefieldofbioinformaticshasevolvedsuchthatthemostpressingtasknowinvolvestheanalysisandinterpretationofvarioustypesofdata,includingnucleotideandaminoacidsequences,proteindomains,andproteinstructures.(NCBI)
– Theapplictionofcomputationalsciences(computerscience,mathematics,statistics)toadvanceresearchinthelifesciences(agriculture,basicbiology,medicine).(U.ofTrieste)
Evenjobtitlesundecided
• Abioinformaticistisanexpertwhonotonlyknowshowtousebioinformaticstools,butalsoknowshowtowriteinterfacesforeffectiveuseofthetools.
• Abioinformatician,ontheotherhand,isatrainedindividualwhoonlyknowstousebioinformaticstoolswithoutadeeperunderstanding.
• Thus,abioinformaticististo*.omicsasamechanicalengineeristoanautomobile.Abioinformaticianisto*.omicsasatechnicianistoanautomobile.
BioinformaticsWeb
Notjust“informatics”• Bioinformaticsisthefieldofscienceinwhichbiology,computer
science,mathematicsandinformationtechnologymergeintoasinglediscipline.Theultimategoalofthefieldistoenablethediscoveryofnewbiologicalinsightsaswellastocreateaglobalperspectivefromwhichunifyingprinciplesinbiologycanbediscerned.Therearethreeimportantsub‐disciplineswithinbioinformatics:– thedevelopmentofnewalgorithmsandstatisticswithwhichtoassess
relationshipsamongmembersoflargedatasets– theanalysisandinterpretationofvarioustypesofdataincluding
nucleotideandaminoacidsequences,proteindomains,andproteinstructures
– thedevelopmentandimplementationoftoolsthatenableefficientaccessandmanagementofdifferenttypesofinformation.
• Needtohavebiologicalknowledgetoknowwhatquestionstoask
BagofTools
• Bioinformaticsisinterdisciplinary• Synthesisoftoolsfrommanyfields
– Biology– Computerscience– Mathematics/Statistics
• (Usually)don’thavetoreinventthewheel
MargaretDayhoff(1925‐1983)
HwaALim
PaulBerg(1926‐)
Typesofdataavailable
• Enormousamountsofdataavailablepublicly– DNA/RNAsequence– SNPs– proteinsequence– proteinstructure– proteinfunction– organism‐specificdatabases– genomes– geneexpression– biomolecularinteractions– molecularpathways– scientificliterature– diseaseinformation
http://www.ncbi.nlm.nih.gov/
The blue area shows the total number of bases in GenBank excluding those from whole genome shotgun (WGS) sequencing projects. The checkered area shows only the non-WGS portion.
In release 175.0, there are now over 110 billion bases in GenBank, and almost 160 billion bases in the WGS division.
GrowthofPDB
http://www.rcsb.org/pdb/
Thebadnews
• Hugenumbersoferrorsinthedatabases– wrongpositionsofgenes– exon‐intronboundaryerrors– contaminatingsequences
– sequencediscrepancies/variations– frameshifterrors
– annotationerrors– spellingmistakes– incorrectlyjoinedcontigs
Mousebuild32
Findingbioinformaticsresources
• Google!
• Databases:– http://www.expasy.org/links.html
• Programs:– siteswithcompendia,e.g.,http://www.bioinformatik.de/cgi‐bin/
browse/Catalog/Software/Online_Tools/
• Literaturesearches
Programs
• Mustknowatleasttheprinciplesbehindtheprograms
• Don’tjusttreatthemasablackbox
• Tounderstandtheresults,theusershouldhavesomeideaof:– howtheywork– whatassumptionstheymake
Somecommonanalysistools
• Homologysearching(e.g.,BLAST)
• Sequencealignment(e.g.,ClustalW)• Phylogenetics(e.g.,PHYLIP)• Functionalpatterns(e.g.,HMMER)• Geneprediction(e.g.,GenScan)• Regulatoryregionanalysis(e.g.,MatInspector)
• RNAstructure(e.g.,UniFold)• Proteinstructure(e.g.,JPred)
Scalability
• Hugevolumesofdataavailabletous– Completegenomes,NGS
• Necessarycomputationalresourcesnowavailabletodealwiththeseamountsofdata– 8GB(~humangenome)canbestoredonaniPod
– Treeoflifecanbestoredin1TB– Rawdatafrom1NGSexperiment=1TB
• Toolsandtechniqueshavetobeefficientandscalable
• Hugeamountsof‘parts’data– Sequence‐nucleotideandprotein– Structure– Function– Biochemicalinformation
– Protein‐proteininteractions,complexes
– Protein‐DNAcomplexes
– Kineticsofreactions Integratedtogetherinto“SystemsBiology”
• Thestudyoftheinteractionsbetweenthecomponentsofabiologicalsystem
• Howthoseinteractionsgiverisetothefunctionandbehaviorthatwesee
Wheredowegofromhere?
Mathematicalmodeling
• BiologicalsystemscanberepresentedbyODEs– compartments– stochasticmethodsforlowconcentrationcomponents
• Systemsmodelingcan:– effectivelyintegrate“parts”information– helprevealnon‐intuitiveproperties– teachushowcellsstoreinformationand‘compute’
• Quantitativemodelsofpathwaysandnetworks– predictcellularresponsestoexternalstimuli– modeleffectsofperturbationsonthesystem– predicthowto‘correct’diseasestates
• identifycontrolpointsinthesystem
RaviIyengarlab
Protein-protein interaction networks in the Drosophila melanogaster cell
Giotetal,Science,2003
Recentexample
• miRNAsdiscoveredin1993inC.elegans– Aberration?– Oneofthosestrangewormphenomena?
– Thenwasfoundtobeconservedinotherorganisms• Bioinformaticsmethodsused
– Alignments– RNAsecondarystructure&freeenergy– Scoring– Conceptofapipeline
• Noteinterplaybetweenwetanddrylabs
MicroRNAbackground• MicroRNAs(miRNAs)
– Short(21‐22nt)sequences– Involvedinregulationbytranslationinhibition– Tendtobetissue‐ordevelopmentalstage‐specific– SimilarinsomewaystosiRNAs
• LongprimarymiRNAs(pri‐miRNA,possibly1000sofnt)transcribedfrommiRNAgenebyRNApolIIorRNApolIII
• Pre‐miRNA(70nt)createdfrompri‐miRNAbyDroshaandPasha• MaturemiRNAcreatedfrompre‐miRNAbyDicer• Perfectornear‐perfecttargetcomplementarityleadstotranscriptdegradation
– lin4,let7:firstmiRNAandtargetsdiscovered(inC.elegans)– Conservedacrossspecies– 50%ofcasesfoundwithinintronsofgenes(alsofoundinprotein‐
codingandintergenicregions)– miRNAgenesoftenfoundtobeclustered,transcribedas
polycistrons
miRNA
3’UTR
adaptedfromLietal,MammGenome2009
Basicresearchquestions
1. HowcanweidentifynewmiRNAs?– Initiallydoneexperimentallybydirectcloning
ofshortRNAmolecules– Resultsdominatedbyafewhighlyexpressed
miRNAs
2. Howcanwefindtheirtargetsites?
3. HowaremiRNAgenesregulated?
• DiscovermiRNAsinDrosophila
• DiscovermiRNAtargetsinDrosophila
• DiscovermiRNAtargetsinmammals
Notegeneralmethodology
• Formulatehypothesis• Developmodelincorporatingbackgroundknowledge
• Runanalysis• Validateresults• Refinehypothesis/model
• DiscovermiRNAsinDrosophila
• DiscovermiRNAtargetsinDrosophila
• DiscovermiRNAtargetsinmammals
1.IdentifyingnewmiRNAsinDrosophila
• miRNAscreatedbyDicerfrompre‐miRNA• pre‐miRNA:~70ntwhichformsalonghairpin‐shapedstem‐loop
• Pre‐miRNA,miRNAconservedacrossspecies
• Highminimalfoldingfreeenergy• Moredifferencesareallowedintheloopregionthaninthestem
FindingnewmiRNAs:miRseeker
Identify conserved genomic regions!(between Drosophila melanogaster and Drosophila pseudooscura)!
Identify and rank stem-loop structures!(look at both forward and reverse complement of sequence)!
Evaluate pattern of divergence of potential miRNAs!
Add evidence from a third organism!(Anopheles gambiae)!
Laietal,GenomeBiology,200324 Drosophila pre-miRNAs!
Identifyingconservedgenomicregions
• Alignrepeat‐maskedD.melanogastergenomiccontigswithD.pseudoobscuracontigs
• Eliminateallannotatedsequences:– Removeexons,transposableelements,snRNA,snoRNA,tRNA,rRNA
• 51.3/90.2Mbofintronicandintergenicsequencealigned
Identifystem‐loopstructures
• RNAsecondarystructureprogram(MFOLD)usedtodetectstem‐loopstructures– Lookatlongesthelical(paired)arm– Calculatefreeenergyofarm– Penalizeinternalloopsofincreasingsize– Penalizeasymmetricloopsandbulgednucleotides
Laietal,GenomeBiology,2003
Evaluationofstem‐loops
Validation
• UsingNorthernblots• Ofthe124tophits
– 18arereferencesetmembers– 24werevalidated– 14werefalsepositives
• ExpressionprofilesandabundanceofcomputationallyderivedmiRNAsmuchmoreheterogeneousthanthosediscoveredexperimentally
• EstimatedthatDrosophilidgenomesmaycontain~110miRNAs
Referenceset(15%)
Validated(TP)(19%)
Notvalid(FP)(11%)
Untested(55%)
Computationalapproachsummary
• Sequence/structureconservation‐based– Heavilydependentonuseofconservationtofilterout“uninteresting”hairpins
• Machine‐learning(SVM,HM,NB)– Featureclassifiersthatdistinguishbetweenapositiveandnegativetrainingset
• Experimentaldata‐driven– Nextgeneration“deepsequencing”
• DiscovermiRNAsinDrosophila
• DiscovermiRNAtargetsinDrosophila
• DiscovermiRNAtargetsinmammals
2. IdentifyingmiRNAtargets
• Backgroundknowledge:– Ofteninthe3’UTR(unlikeinplants,wheretheyarepredominantlyinthecodingregion)
– Thefirst8‐10ntaremoreimportantindeterminingbindingthanthelast12‐14
– TendtobelesscomplementarytotheirtargetsthanplantmiRNAs
– Targetsitestendtobeconservedacrossspecies
PipelinetoidentifymiRNAtargetsinDrosophila‐miRanda
Enrightetal,GenomeBiology,2003
Find complementary sequence matches in 3’ UTRs!(Modified Smith-Waterman algorithm)!
Calculate free energy (stability) of miRNA/UTR binding!(ΔG Kcal / mol)!
Estimate evolutionary conservation!(Sequence conservation; relative positioning within the 3’ UTR) !
73 known Drosophila miRNAs!
Sequence matching: problems • miRNAs are very small (21-22nt)
– Enormous number of potential targets with complementary sequence
– BLAST does not scale.
• Low-complexity sequences – Signal to noise problem
• Standard sequence analysis packages generally not applicable
– Looking for complementarity, not similarity • i.e. A:U G:C not A:A G:G etc.
– Wobble pairing permitted • G:U and U:G base pairs
• Small number of known cases to work with
Sequencematchingalgorithm• ModifiedSmith‐Watermanalgorithm
– Insteadoflookingformatchingnucleotides,findscomplementarynucleotides
– AllowsGU‘wobble’pairs(butdownweightthem)– Scoringsystemweightedsothatcomplementaritytothefirst11basesofthemiRNAismoregreatlyrewarded
– Non‐complementarityalsomoreheavilypenalizedinthatregion
– KnownmiRNAsbind3’UTRsatmultiplesites• AdditivescoringsystemforalltargetsitespredictedinaUTR
• Calculatefreeenergyofbinding(ViennaRNApackage)
Evolutionaryconservation
• UsedconservationasawayofkeepingonlythemostlikelymiRNAtargetcandidates
• UsedDrosophilapseudooscuraandAnophelesgambiaeascloselyrelatedspecies:– Required>=80%sequencesimilarityoftargetsitewithD.pseudooscura
– Required>=60%seqidwithA.gambiae
• Also,requirethatthelocationofthetargetsiteintheUTRisequivalent
Controlsequences
• 100setsofrandom73miRNAsgenerated– ConservedD.melanogastermiRNAnucleotidefrequencies
• Analysisrunindependentlyforeachset• Resultsandcountsaveragedoverall100sets• OverallFPrate:35%
– Numberofrandomhits/numberof“real”hits
• Ifonlytargetsthathave≥2conservedsitesinaUTRarecounted,theFPratedropsto9%
Validation
• Initialvalidation:applicationtoexperimentallyverifiedtargets– 9/10knowntargetgenesforthreemiRNAscorrectlyidentified– BUTbiasedinfavorofthissincethemethodisbasedonthe
backgroundknowledgederivedfromthese• For73DrosophilamiRNAs,701predictedtargetgenes
(outof~9,805/13,500genesinthegenome)– Manytranscriptionfactorsandothergenesinvolvedin
development→ One‐to‐manyandmany‐to‐onerelationships
• DiscovermiRNAsinDrosophila
• DiscovermiRNAtargetsinDrosophila
• DiscovermiRNAtargetsinmammals
Pipelinetoidentifymammaliantargets‐TargetScan
Lewisetal,Cell,2003
Find “seed matches” in the 3’ UTR!(match bases 2-8 of the miRNA exactly)!
Extend the seed matches!
Evaluate the folding free energy!
79 conserved !mammalian miRNAs!
Controls
• Shuffledsequences‐havefewermatchesthantherealmiRNA
• Preserveallrelevantcompositionalfeatures– ExpectedfrequencyofseedmatchestotheUTRdataset– Expectedfrequencyofmatchingtothe3’endofthemiRNA– ObservedcountofseedmatchesintheUTRdataset– PredictedfreeenergyoftheRNAduplex
• Eachshuffledcontrolsequencealsohasthesamelengthandbasecompositionastheparent
• Signal:noiseratio=3.2:1– 5.7“real”targetsvs.1.8targetsfoundwithcontrolsequences
– Approximatelya31%FPrate
Validation
• Luciferasereporterassaysusedtotest15(outof>400)predictedtargets– Experimentalsupportfor11/15
• MammalianmiRNAtargetshavediversefunctions(unlikeplants,wheremiRNAsalmostexclusivelyinvolvedindevelopmentalprocesses)– Enrichedindevelopmentalfunction,transcription
– Alsoinnucleicacidbindingandtranscriptionalregulatoractivity
Examineresults
• Addedindogandchickenconservation• Lookedatflankingsequenceofcontrolandrealmatches
intheUTRs
Lewisetal,Cell,2005
anchoringAs
Lewisetal,Cell,2005
Modifymodel‐TargetScanS
• Targetsidentifiedbyconservedcomplementaritytonucleotides2‐7ofthemiRNA
• AconservedAdenosineatnucleotide1• Often,aconservedAdenosineatnucleotide8• Don’tlookpastnucleotide8anymore• Don’tcalculatefreeenergyanymore• Potentially,thousandsofmammaliantargets
Nottheendofthestory…
• ManyprogramsareclaimedtobeabletodiscovermiRNAtargetsinmammals– TargetScanS‐Lewisetal,MIT– miRanda‐Enrightetal,SKI– DIANA‐MicroT‐Hatzigeorgiouetal,UPenn– rna22‐Rigoutsosetal,IBM– PicTar‐Rajewskyetal,NYU– RNAhybrid‐Rehmsmeieretal,Bielefeld
• Differentalgorithms/modelsgivedifferentresults
Userfrustration
• AnilJeqqa,postingonthemiRNANatureforums,reports:– “I was looking at and comparing the miRNA target gene
predictions from five commonly used algorithms, viz., miRanda, targetScanS, PicTar, microT and mirTarget. Surprisingly, there is so little overlap! And I also did a comparison with the entries in TarBase (that houses about 100 experimentally validated miRNA-gene pairs) and surprisingly almost all of the five prediction algorithms perform quite badly.” (from the miRNA forum on the Nature forums, 27 August, 2007)
Evaluationcomparison
Alexiouetal,Bioinformatics,2009
Asingleaccuratealgorithmisbetterthanacombinationofpredictions.Betterspecificityofacombinationisachievedatahigherpriceinsensitivity
Futuredirections
• Adventofexperimentaldatagivesexcellentbenchmarkingopportunitesaswellasprovidingnewdatatorefinehypotheses– SILAC:measuresthelevelsofmanyproteinsconcurrently
• BaeketalNature2008• SelbachetalNature2008
– HITS‐CLIP:identificationandsequencingoftargetsitesformiRNAs
• ChietalNature2009• Lookfortargetsitesoutsidethe3’UTR• CombinatorialeffectofmiRNAs
– CoordinatedregulationbymultiplemiRNAs(whichmayalsobeco‐transcribedinthesamepri‐miRNA)
• SeereviewbyBartel(Cell2009)foradiscussionofotherchallenges
Importantpoints
• Thistypeofanalysisfollowsthesamebasicprocedureasa‘normal’wetlabscientificexperiment– Backgroundinformation– Hypothesis/model– Controls– Validation– Modifymodelandrepeat
• Manyofthetechniquesusedherearewell‐known,somearemodified
• Availabilityofcompletegenomes,scalablealgorithmsandcomputationalresourcescrucialtothistypeofanalysis
Knowledgeofthebiologyinformsthebioinformatics