Introduc)on to Databases part 2 Shifra Ben‐Dor Irit Orr
Introduc)ontoDatabasespart2
ShifraBen‐DorIritOrr
Andnow,forthemoleculesanddatabases...
• DNA
• RNA
• Protein
DNAsequences
• Genesareencodedingenomicsequences.
• Genesaretranscribedintopre‐mRNAs(includingcoding,intronic,5’and3’untranslatedregions).
• mRNA’sarespliced(intronsremoved)andtranslatedintoproteins.
• mRNAsarecopiedtocDNAs
TSS TTS
ATG Stop PolyA site
Promoter 1 2 3 4
ATG Stop PolyA site
1 2 3 4
Genomic DNA
Pre-mRNA
mRNA
Modified from Zhang MQ Nat Rev Genet. 2002 Sep;3(9):698-709.
ATG Stop
1 2 3 4 Cap PolyA
5’ UTR 3’ UTR CDS
Interna)onalDNAdatabases
GenbankatNCBI hNp://www.ncbi.nlm.nih.gov/
EMBLatEBI
hNp://www.ebi.ac.uk/embl/
DDBJinJapan hNp://www.ddbj.nig.ac.jp/
DATAsourcesforDNAdatabases
• Directscien)stsubmission
• Genomesequencinglabsandgroups
• Scien)ficliterature• Patentapplica)ons
• EMBL,GenbankandDDBJcollaboratetocollectallsequencedatareportedaroundtheworld.
Interna)onalDNAdatabases
Allofthesedatabaseshave:
Officialreleasesevery2‐3months.
Weekly(ordailyupdates).
Aredividedintosublibrariesforeasiersearching.
DNAdatabasedivisions
• PRI‐primate(human,monkey)• ROD‐rodent(mouse,rat)• MAM‐othermammalian
(bovine,cat)• VRT‐othervertebrate(chicken)• INV‐invertebrate• PLN‐plant,fungal,andalga• BCT‐bacteria• VRL‐viruses• PHG‐bacteriophage• SYN‐synthe)c(plasmids,vectors)• UNA‐unannotatedsequences• PAT‐patentsequences
• EST‐ExpressedSequenceTags• STS‐SequenceTaggedSites
• GSS‐GenomeSurveySequences
• HTG‐HighThroughputGenomicSequences
• HTC‐HighThroughputcDNASequences
ShortReadandTraceArchives
Theoutputoflargescalesequencingprojectsandnext‐genera)onsequencingarestoredinseparatedatabases.NCBIisphasingouttheSRA,butthedatawillbeavailableinGEO,thedatabaseformicroarrayresults.
Genomicdatabases
• Specializedresourcesthatare:– Speciesspecific– Sequencingtechniquespecific
• Displaywholechromosomes(notaspecificsequence).
SourcesofmRNA’s
• Experimental– Clonenewgene– Clonegenefromdatabase– 2hybridsystem,RNA‐Seq...
• Database– “Typical”cDNA– FulllengthcDNA– EST
mRNA
Full length cDNA
Typical cDNA
5’mG AAAA
TTTT
TTTT
primer
AAAA primer
primer
REFSEQNCBI(Referencesequencedatabase)
✵ Definition
The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.
REFSEQ from NCBI non-redundancy explicitly linked nucleotide and protein
sequences updates to reflect current knowledge of sequence
data and biology data validation and format consistency distinct accession series ongoing curation by NCBI staff and collaborators,
with reviewed records indicated
RefSeqrecordStatus
• TheRefSeqCOMMENTblockindicatestheStatusoftherecordandtheGenBanksequencedatathatwasusedtoprovidetherecord.
• Inaddi)on,theCOMMENTmayiden)fyacollabora)onwhichsuppliedthedefiningsequenceinforma)onforthegenome,gene,orprotein.
Thelevelofcura)onmaydifferbetweendifferentcollabora)nggroups.
RefSeq
• Reviewed*• Provisional• Predicted
• GenomeAnnota)on
• Validated*• Model
• Inferred
• WGS
✵ StatusCodes: RefSeqrecordsareprovidedwithastatuscodewhichprovidesanindica)onofthelevelofreviewaRefSeqrecordhasundergone.
*Curated
STATUSDefini+on
REVIEWEDTheRefSeqrecordhasbeenthereviewedbyNCBIstafforbycollaborator.TheNCBIreviewprocessincludesreviewingavailablesequencedataandfrequentlyalsoincludesareviewoftheliteratureandothersourcesofinforma)on.
VALIDATED
TheRefSeqrecordhasundergoneanini)alreviewtoprovidethepreferredsequencestandard.Therecordhasnotyetbeensubjecttofinalreviewatwhich)meaddi)onalfunc)onalinforma)onmaybeprovided.
PROVISIONALTheRefSeqrecordhasnotyetbeensubjecttoindividualreviewandisthoughttobewellsupportedandtorepresentavalidtranscriptandprotein.
STATUSDefini+on
PREDICTEDTheRefSeqrecordispredictedandhasnotbeensubjecttoindividualreview.Thetranscriptmayrepresentanabini&opredic)onormaybepar)allysupportedbyothertranscriptdata;inbothcases,theproteinispredicted.
INFERREDTheRefSeqrecordisinferredbygenomesequenceanalysis.Thereisnosame‐organismexperimentalsupportforthefullextentofthesequence;theremaybesomelevelofsupportbyhomology.
MODELTheRefSeqrecordispredictedbygenomesequenceanalysis.Therecordmayrepresentanabini&opredic)on,ormayhavesomeleveloftranscriptorhomologysupport.
STATUSDefini+on
GENOMEANNOTATION Thisiden)fiesRefSeqrecordsprovidedbytheNCBIGenomeAnnota)onprocess.Theserecordsareprovidedviaautomatedprocessingandarenotsubjecttoindividualrevieworrevisionbetweenbuilds
WGS
TheRefSeqrecordrepresentsacollec)onofwholegenomeshotgun(WGS)sequences.Thisstatuscodeisappliedtogenomicrecords
AccessionFormat MoleculeType
NC_123456 CompleteGenome CompleteChromosome CompleteSequence
NG_123456 GenomicRegion
NM_123456 mRNA
NR_123456 non‐codingRNA
NP_123456 Protein
NT_123456 GenomicCon)g(fromBACs)
NW_123456 GenomicCon)g(fromWGS)
XM_123456 mRNA(takenfromgenomicseq)
XR_123456 RNA(takenfromgenomicseq)
XP_123456 Protein(takenfromgenomicseq)
WhatisthedifferencebetweenRefSeqand
GenBank?Genbankis:
• ArchivaldatabaseandincludespubliclyavailableDNAsequencessubmiNedfromindividuallaboratoriesandlarge‐scalesequencingprojects.
• AccessionnumbersareassignedtothesesubmiNedsequences.
• SubmiNedsequencedataisexchangedbetweenNCBIsGenBank,EMBLDataLibrary(EMBL)andtheDNADataBankofJapan(DDBJ)toachievecomprehensiveworldwidecoverage.
• Asanarchivaldatabase,GenBankisveryredundantforsomeloci.
• SequencerecordsareownedbytheoriginalsubmiNerandcannotbealteredbyathirdparty.
WhatisthedifferencebetweenRefSeqand
GenBank?RefSeqis:
SequencesarederivedfromGenBankandprovidenon‐redundantcurateddata.
Entriesrecordsrepresentcurrentknowledge. RefSeqrecordsareownedbyNCBIandthereforecanbe
updatedasneededtomaintaincurrentannota)onortoincorporateaddi)onalsequenceinforma)on.
Somerecordsincludeaddi)onalsequenceinforma)onthatwasneversubmiNedtoanarchivaldatabasebutisavailableintheliterature.
Somesequencerecordsareprovidedthroughcollabora)on;andthusmaynotbeavailableinanyoneGenBankrecord.
RefSeqsequencesarenotsubmiNedprimaryseqs.
VariousHighThroughputCollec)onsNedo,DFKZ,HRI,Genoscope
• Full‐lengthcDNAlibrariesfromvarious)ssuesweresubtractedandnormalizedtoreduceredundancy
• Cloneswereend‐sequencedtofurtherreduceredundancy
• WholeinsertsweresequencedtogetmRNAsequences
• [KIAA–donebyKazusawasaprojectforlongcDNAs–over4kb,butmaynotbefull‐length]
MGC‐MammalianGeneCollec)on
TheNIHMammalianGeneCollec)on(MGC)seekstoiden)fyandsequencearepresenta)vefullopenreadingframe(ORF)cloneforeachhuman,mouse,ratandcowgene.ZebrafishandXenopushavetheirownprojects(ZGCandXGC)
MGCproducedover80cDNAlibrariesenrichedforfull‐lengthcDNAsderivedfromhuman)ssueandcelllines,andmouse)ssue.
5'ESTreadsweregeneratedfromeachlibrary.Severalalgorithmsareappliedtoselectputa)vefullORFclones.Targetedcloningorsynthesiswasusedtofinish.
SourcesofmRNAs
• IndividualLabs various
• Refseq XX_123456
FullLengthSequencingprojects:
• Riken,Nedo(FLJ),HRI AK,CR
DKFZ,Genoscope,[KIAA]... [AB,D]
• MGC BC,CT
AccessionNumbers
SourcesofmRNA’s
• Experimental– Clonenewgene– Clonegenefromdatabase– 2hybridsystem
• Database– “Typical”cDNA– FulllengthcDNA– EST
RNA
RNA, cDNA, and ESTs
mRNA
cDNA
exon 1 exon 2 exon 3
EST
EST
cDNA clone
Adapted with permission from Adam Sartiel
UsesofESTs
‐ predic)onofcodingregions‐ detec)onofalterna)vesplicing‐ clusteringtoform“genes”
Problemswithclustering:‐ incompletecoveragebreaksgenesup‐ genefamilies
ProblemswithESTs
‐ lowcopynumbergenes
‐ rare)ssues‐ mistakes
‐ enrichmentof3’endsofgenes
‐ incompletecoverageofgenes
With the increasing sequencing and annotation of key genomes, having a gene-based view of the resultant information is useful. Entrez Gene has therefore been implemented to supply key connections in the nexus of map, sequence, expression, structure, function, citation, and homology data. Unique identifiers are assigned to genes with defining sequences, genes with known map positions, and genes inferred from phenotypic information.
EntrezGeneatNCBI
EntrezGene‐Adatabaseforgene‐specificinforma)on.
Itdoesnotincludeallknownorpredictedgenes;insteadEntrezGenefocusesonthegenomesthathavebeencompletelysequenced,thathaveanac)veresearchcommunitytocontributegene‐specificinforma)on,orthatarescheduledforintensesequenceanalysis.
ThecontentofEntrezGenerepresentstheresultofcura)onandautomatedintegra)onofdatafromNCBI'sReferenceSequenceproject(RefSeq),fromcollabora)ngmodelorganismdatabases,andfrommanyotherdatabasesavailablefromNCBI.Recordsareassignedunique,stableandtrackedintegersasiden)fiers.
EntrezGeneatNCBI
Thecontent(nomenclature,maploca)on,geneproductsandtheiraNributes,markers,phenotypes,andlinkstocita)ons,sequences,varia)ondetails,maps,expression,homologs,proteindomainsandexternaldatabases)isupdatedasnewinforma)onbecomesavailable.
EntrezGenedataisusedbyotherNCBIresourcessuchas:BLAST,Geo,HomoloGene,MapViewer,UniGene,UniSTSandNCBI'sgenomeannota)onpipeline.
Datareliabilityindatabases
Thehugeamountofdatacollectedindatabasespresentalotofproblems:
– Dataaccuracy– Sequenceredundancy– Inconsistentnomenclature
– Inaccurateannota)on– Sequencecontamina)on(vectors,bacterial)
Datareliabilityindatabases
• Thedatabasestaffno)fytheAuthorsthatanerror(orcontamina)on)wasdetectedintheirsequenceentry.
• However,ittakes)metocorrectthedata.
• Meanwhiletheerroriscon)nued,becausealotoftheProteinsintheProteindbaretranslatedfromtheDNAsequencedb.
Datareliabilityindatabases
• Alotofthesequencesinthedatabasearequite“old”.TheywerenotupdatedsincetheyweresubmiNed,eventhoughtechnologyanddatawasverymuchupdated.
Genesymbols
GenesymbolsaredesignatedbyuppercaseLa)nleNersorbyacombina)onofupper‐caseleNersandArabicnumbers.
Symbolsshouldbeshortinordertobeuseful,andshouldnotaNempttorepresentallknowninforma)onaboutagene.
Ideallysymbolsshouldbenolongerthansixcharactersinlength.
Basedonclassicalgene)cguidelines,itisrecommendedthatgenesymbolsareeitherunderlinedoritalicizedwhenreferringtogenotypicinforma)on(phenotypicinforma)onisrepresentedinstandardfonts).
HUGOGeneNomenclatureCommiNee
• ThiscommiNeeisresponsiblefortheapprovalofauniquesymbolforeachgene.
• Italsodesignsalongerandmoredescrip)vename.
• ThecommiNeemakesconsiderableeffortstousesymbolsacceptabletoworkersinthefield,butsome)mesitisnotpossibletouseexactlywhathaspreviouslyappearedintheliterature.
• However,whereverthecommiNeeisawareofsuchsymbols,theyarelistedasaliasesintheGenewdatabase.(hIp://www.gene.ucl.ac.uk/cgibin/nomenclature/searchgenes.pl)
GeneSymbols
80887826000469q31ATP‐bindingcasseNe,sub‐familyA(ABC1)member1
ABCA1
PubMedID
MIMNumber
Cytogene)cLoca)on
FullnameSymbol
TaxonomyDatabases
• Aninterna)onaleffortisdoneforallsequencedatabasestocreateaunifiedtaxonomictagforthesequencessubmiNed.
Problem:eachsequencedepositorgives“his”nameforthespecie
Solu)on:UnifiedtaxonomyID
Proteindatabases
Proteindatabases
• Therearemanydifferentproteindatabasescontainingdifferenttypesofinforma)on:
– PrimaryAminoAcidssequence.
– Secondarystructure– 3Dstructure– Proteinfamilydomains
– Consensusac)vesites
SourcesofProtein
• Proteinsthathavebeenworkedonexperimentally
• mRNAwhoseproducthasbeenworkedonexperimentally(noactualproteinsequencingdone)
• TranslatedDNA(mRNA)sequences
ProteinPrimarySequenceDatabases
• Usuallycontaindescrip)onoftheproteinentry(annota)on),theaminoacidsequenceandsome)meslinkstootherrelateddatabases.
• Swiss‐Prot,fromtheUniversityofGeneva(nowtheSwissIns)tuteofBioinforma)cs),isacuratedproteindatabasewhichstrivestoprovideahighlevelofannota)on,aminimallevelofredundancyandhighlevelofintegra)onwithotherdatabases.
UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.
• The UniProt Knowledgebase (UniProt) is the central access point for extensive curated protein information, including function, classification, and cross-reference.
• The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed searches.
• The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.
Swiss‐ProtDatabase(primarydatabase)
• Swiss‐Protannota)onincludes:– Descrip)onofproteinfunc)on– Proteindomainstructure– Post‐transla)onalmodifica)ons– Proteinvariants
• Sequenceentriesarecomposedofdifferentline‐types,eachwiththeirownformat.Forstandardiza)onpurposestheformatofSwissProtfollowsascloselyaspossiblethatoftheEMBL(DNA)Database.
Swiss‐ProtDatabase
Swiss‐Protdiffersfromotherproteindatabasesbythefollowingcriteria:
Annota)on
MinimalRedundancy
Integra)onwithotherdatabases
Swiss‐ProtDatabase
Annota)on InSwiss‐Prot,asinmostothersequencedatabases,twoclassesofdatacanbedis)nguished:thecoredataandtheannota)on.
Thecoredataconsistsofthesequence;thecita)oninforma)on(bibliographicalreferences)andthetaxonomicdata(descrip)onofthebiologicalsourceoftheprotein).
Theannota)onconsistsofthedescrip)onof:
• Func)on(s)oftheprotein• Post‐transla)onalmodifica)on(s).Forexamplecarbohydrates,phosphoryla)on,acetyla)on,GPI‐anchor,etc.
• Domainsandsites.Forexamplecalciumbindingregions,ATP‐bindingsites,zincfingers,etc.
• Secondarystructure
Theannota)onconsistsofthedescrip)onof:
• Quaternarystructure.Forexamplehomodimer,heterotrimer,etc.
• Similari)estootherproteins• Disease(s)associatedwithdeficiency(s)of/intheprotein
• Sequenceconflicts,variants,etc.
Swiss‐ProtDatabase
Toobtainthisinforma)on,Swiss‐Protuses,inaddi)ontothepublica)onsthatreportnewsequencedata,reviewar)clestoperiodicallyupdatetheannota)onsoffamiliesorgroupsofproteins.
Swiss‐Protalsomakesuseofexternalexperts,whohavebeenrecruitedtosendtheircommentsandupdatesconcerningspecificgroupsofproteins.
Swiss‐ProtDatabase
MinimalRedundancy Manysequencedatabasescontain,foragivenproteinsequence,separateentrieswhichcorrespondtodifferentliteraturereports.InSWISS‐PROT,theytryasmuchaspossibletomergeallthesedatasoastominimizetheredundancyofthedatabase.
Ifconflictsexistbetweenvarioussequencingreports,theyareindicatedinthefeaturetableofthecorrespondingentry.
Swiss‐ProtDatabase
Integra)onwithotherdatabases Itisimportanttoprovidetheusersofbiomoleculardatabaseswithadegreeofintegra)onbetweenthethreetypessequence‐relateddatabases(nucleicacidsequences,proteinsequencesandproteinter)arystructures)aswellaswithspecializeddatacollec)ons.
SWISS‐PROTiscurrentlycross‐referencedwith~100differentdatabases.Cross‐referencesareprovidedintheformofpointerstoinforma)onrelatedtoSWISS‐PROTentriesandfoundindatacollec)onsotherthanSWISS‐PROT.
TrEMBLdatabase
• TrEMBLisacomputer‐annotatedsupplementofSWISS‐PROTthatcontainsallthetransla)onsoftheEMBL(DNA)database.
• TrEMBLcontainentriesnotyetintegratedinSWISS‐PROT.
• Combinesinforma)onnotinotherdatabases,likemicroarraydata,popula)onvaria)onstudies,proteomics
• Powerfulqueryingop)ons
• Onlyforhumanproteins
NRdatabase(primarydatabasesfromNCBI)
• TheNRProteindatabasecontainssequencedatafromthetranslatedcodingregionsfromDNAsequencesinGenBank,EMBLandDDBJaswellasproteinsequencessubmiNedtoPIR,SWISSPROT,PRF,PDB(sequencesfromsolvedstructures).
DatareliabilityinProteindatabases
• About30%oftheproteinsinthedatabaseshaveerroneoussequencesdueto:– missingexonsintheDNAtransla)on.– Intronsmistakenlytranslated.
• Anothercommonproblemistheassigningoffunc)onsto“new”proteins,basedonsequencesimilarity.
DatareliabilityinProteindatabases
• Forexample:– ProteinAissimilartoproteinB.
– ProteinBannota)onisbasedonProteinAannota)on(whichhasanerror).
– Annota)onofProteinAiscorrectedbythegroupworkingonit.Thiscorrec)ondoesnotappearorreflectinProteinBannota)on.
– WhenProteinCandDarealsobasedontheerroneousannota)ononB,theproblem…...
Textsearchingpi{alls
• Itfindsexactlywhatyoutype(trypseudogenevs.psuedogene)
• Olderrecordsmayhavedifferentannota)on,fromgenenameson…
• humanvshomosapiens
• Genesymbolsvsfullgenename(forexampleneuregulinvsnrg1)
• Mostsitesusebooleanoperators(AND,OR,BUTNOT)
• Cando(oradd)afieldspecifictag‐buteachsitehasadifferentwayofaddingittoasearch‐forexample,NCBIusessquarebrackets[]
Remember:
TextsearchingisNOTsequencesimilaritysearching!Youmanynotfindallrelatedsequencesbytextsearching!!!!