Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modiﬁed ...

Introduc)ontoDatabasespart2

ShifraBen‐DorIritOrr

Andnow,forthemoleculesanddatabases...

•  DNA

•  RNA

•  Protein

DNAsequences

•  Genesareencodedingenomicsequences.

•  Genesaretranscribedintopre‐mRNAs(includingcoding,intronic,5’and3’untranslatedregions).

• mRNA’sarespliced(intronsremoved)andtranslatedintoproteins.

• mRNAsarecopiedtocDNAs

TSS TTS

ATG Stop PolyA site

Promoter 1 2 3 4

ATG Stop PolyA site

1 2 3 4

Genomic DNA

Pre-mRNA

mRNA

Modified from Zhang MQ Nat Rev Genet. 2002 Sep;3(9):698-709.

ATG Stop

1 2 3 4 Cap PolyA

5’ UTR 3’ UTR CDS

Interna)onalDNAdatabases

  GenbankatNCBI  hNp://www.ncbi.nlm.nih.gov/

  EMBLatEBI

  hNp://www.ebi.ac.uk/embl/

  DDBJinJapan  hNp://www.ddbj.nig.ac.jp/

DATAsourcesforDNAdatabases

•  Directscien)stsubmission

•  Genomesequencinglabsandgroups

•  Scien)ficliterature•  Patentapplica)ons

•  EMBL,GenbankandDDBJcollaboratetocollectallsequencedatareportedaroundtheworld.

Interna)onalDNAdatabases

  Allofthesedatabaseshave:

  Officialreleasesevery2‐3months.

  Weekly(ordailyupdates).

  Aredividedintosublibrariesforeasiersearching.

DNAdatabasedivisions

•  PRI‐primate(human,monkey)•  ROD‐rodent(mouse,rat)•  MAM‐othermammalian

(bovine,cat)•  VRT‐othervertebrate(chicken)•  INV‐invertebrate•  PLN‐plant,fungal,andalga•  BCT‐bacteria•  VRL‐viruses•  PHG‐bacteriophage•  SYN‐synthe)c(plasmids,vectors)•  UNA‐unannotatedsequences•  PAT‐patentsequences

•  EST‐ExpressedSequenceTags•  STS‐SequenceTaggedSites

•  GSS‐GenomeSurveySequences

•  HTG‐HighThroughputGenomicSequences

•  HTC‐HighThroughputcDNASequences

ShortReadandTraceArchives

Theoutputoflargescalesequencingprojectsandnext‐genera)onsequencingarestoredinseparatedatabases.NCBIisphasingouttheSRA,butthedatawillbeavailableinGEO,thedatabaseformicroarrayresults.

Genomicdatabases

•  Specializedresourcesthatare:– Speciesspecific– Sequencingtechniquespecific

•  Displaywholechromosomes(notaspecificsequence).

SourcesofmRNA’s

•  Experimental– Clonenewgene– Clonegenefromdatabase– 2hybridsystem,RNA‐Seq...

•  Database– “Typical”cDNA– FulllengthcDNA– EST

mRNA

Full length cDNA

Typical cDNA

5’mG AAAA

TTTT

TTTT

primer

AAAA primer

primer

REFSEQNCBI(Referencesequencedatabase)

✵  Definition

  The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.

REFSEQ from NCBI  non-redundancy  explicitly linked nucleotide and protein

sequences  updates to reflect current knowledge of sequence

data and biology  data validation and format consistency  distinct accession series  ongoing curation by NCBI staff and collaborators,

with reviewed records indicated

RefSeqrecordStatus

•  TheRefSeqCOMMENTblockindicatestheStatusoftherecordandtheGenBanksequencedatathatwasusedtoprovidetherecord.

•  Inaddi)on,theCOMMENTmayiden)fyacollabora)onwhichsuppliedthedefiningsequenceinforma)onforthegenome,gene,orprotein.

Thelevelofcura)onmaydifferbetweendifferentcollabora)nggroups.

RefSeq

•  Reviewed*•  Provisional•  Predicted

•  GenomeAnnota)on

•  Validated*•  Model

•  Inferred

•  WGS

✵ StatusCodes: RefSeqrecordsareprovidedwithastatuscodewhichprovidesanindica)onofthelevelofreviewaRefSeqrecordhasundergone.

*Curated

STATUSDefini+on

REVIEWEDTheRefSeqrecordhasbeenthereviewedbyNCBIstafforbycollaborator.TheNCBIreviewprocessincludesreviewingavailablesequencedataandfrequentlyalsoincludesareviewoftheliteratureandothersourcesofinforma)on.

VALIDATED

TheRefSeqrecordhasundergoneanini)alreviewtoprovidethepreferredsequencestandard.Therecordhasnotyetbeensubjecttofinalreviewatwhich)meaddi)onalfunc)onalinforma)onmaybeprovided.

PROVISIONALTheRefSeqrecordhasnotyetbeensubjecttoindividualreviewandisthoughttobewellsupportedandtorepresentavalidtranscriptandprotein.

STATUSDefini+on

PREDICTEDTheRefSeqrecordispredictedandhasnotbeensubjecttoindividualreview.Thetranscriptmayrepresentanabini&opredic)onormaybepar)allysupportedbyothertranscriptdata;inbothcases,theproteinispredicted.

INFERREDTheRefSeqrecordisinferredbygenomesequenceanalysis.Thereisnosame‐organismexperimentalsupportforthefullextentofthesequence;theremaybesomelevelofsupportbyhomology.

MODELTheRefSeqrecordispredictedbygenomesequenceanalysis.Therecordmayrepresentanabini&opredic)on,ormayhavesomeleveloftranscriptorhomologysupport.

STATUSDefini+on

GENOMEANNOTATION Thisiden)fiesRefSeqrecordsprovidedbytheNCBIGenomeAnnota)onprocess.Theserecordsareprovidedviaautomatedprocessingandarenotsubjecttoindividualrevieworrevisionbetweenbuilds

WGS

TheRefSeqrecordrepresentsacollec)onofwholegenomeshotgun(WGS)sequences.Thisstatuscodeisappliedtogenomicrecords

AccessionFormat MoleculeType

NC_123456 CompleteGenome CompleteChromosome CompleteSequence

NG_123456 GenomicRegion

NM_123456 mRNA

NR_123456 non‐codingRNA

NP_123456 Protein

NT_123456 GenomicCon)g(fromBACs)

NW_123456 GenomicCon)g(fromWGS)

XM_123456 mRNA(takenfromgenomicseq)

XR_123456 RNA(takenfromgenomicseq)

XP_123456 Protein(takenfromgenomicseq)

WhatisthedifferencebetweenRefSeqand

GenBank?Genbankis:

•  ArchivaldatabaseandincludespubliclyavailableDNAsequencessubmiNedfromindividuallaboratoriesandlarge‐scalesequencingprojects.

•  AccessionnumbersareassignedtothesesubmiNedsequences.

•  SubmiNedsequencedataisexchangedbetweenNCBIsGenBank,EMBLDataLibrary(EMBL)andtheDNADataBankofJapan(DDBJ)toachievecomprehensiveworldwidecoverage.

•  Asanarchivaldatabase,GenBankisveryredundantforsomeloci.

•  SequencerecordsareownedbytheoriginalsubmiNerandcannotbealteredbyathirdparty.

WhatisthedifferencebetweenRefSeqand

GenBank?RefSeqis:

 SequencesarederivedfromGenBankandprovidenon‐redundantcurateddata.

 Entriesrecordsrepresentcurrentknowledge. RefSeqrecordsareownedbyNCBIandthereforecanbe

updatedasneededtomaintaincurrentannota)onortoincorporateaddi)onalsequenceinforma)on.

 Somerecordsincludeaddi)onalsequenceinforma)onthatwasneversubmiNedtoanarchivaldatabasebutisavailableintheliterature.

 Somesequencerecordsareprovidedthroughcollabora)on;andthusmaynotbeavailableinanyoneGenBankrecord.

 RefSeqsequencesarenotsubmiNedprimaryseqs.

VariousHighThroughputCollec)onsNedo,DFKZ,HRI,Genoscope

•  Full‐lengthcDNAlibrariesfromvarious)ssuesweresubtractedandnormalizedtoreduceredundancy

•  Cloneswereend‐sequencedtofurtherreduceredundancy

•  WholeinsertsweresequencedtogetmRNAsequences

•  [KIAA–donebyKazusawasaprojectforlongcDNAs–over4kb,butmaynotbefull‐length]

MGC‐MammalianGeneCollec)on

TheNIHMammalianGeneCollec)on(MGC)seekstoiden)fyandsequencearepresenta)vefullopenreadingframe(ORF)cloneforeachhuman,mouse,ratandcowgene.ZebrafishandXenopushavetheirownprojects(ZGCandXGC)

MGCproducedover80cDNAlibrariesenrichedforfull‐lengthcDNAsderivedfromhuman)ssueandcelllines,andmouse)ssue.

5'ESTreadsweregeneratedfromeachlibrary.Severalalgorithmsareappliedtoselectputa)vefullORFclones.Targetedcloningorsynthesiswasusedtofinish.

SourcesofmRNAs

•  IndividualLabs various

•  Refseq XX_123456

FullLengthSequencingprojects:

•  Riken,Nedo(FLJ),HRI AK,CR

DKFZ,Genoscope,[KIAA]... [AB,D]

•  MGC BC,CT

AccessionNumbers

SourcesofmRNA’s

•  Experimental– Clonenewgene– Clonegenefromdatabase– 2hybridsystem

•  Database– “Typical”cDNA– FulllengthcDNA– EST

RNA

RNA, cDNA, and ESTs

mRNA

cDNA

exon 1 exon 2 exon 3

EST

EST

cDNA clone

Adapted with permission from Adam Sartiel

UsesofESTs

‐ predic)onofcodingregions‐ detec)onofalterna)vesplicing‐ clusteringtoform“genes”

Problemswithclustering:‐ incompletecoveragebreaksgenesup‐ genefamilies

ProblemswithESTs

‐ lowcopynumbergenes

‐ rare)ssues‐ mistakes

‐ enrichmentof3’endsofgenes

‐ incompletecoverageofgenes

With the increasing sequencing and annotation of key genomes, having a gene-based view of the resultant information is useful. Entrez Gene has therefore been implemented to supply key connections in the nexus of map, sequence, expression, structure, function, citation, and homology data. Unique identifiers are assigned to genes with defining sequences, genes with known map positions, and genes inferred from phenotypic information.

EntrezGeneatNCBI

EntrezGene‐Adatabaseforgene‐specificinforma)on.

Itdoesnotincludeallknownorpredictedgenes;insteadEntrezGenefocusesonthegenomesthathavebeencompletelysequenced,thathaveanac)veresearchcommunitytocontributegene‐specificinforma)on,orthatarescheduledforintensesequenceanalysis.

ThecontentofEntrezGenerepresentstheresultofcura)onandautomatedintegra)onofdatafromNCBI'sReferenceSequenceproject(RefSeq),fromcollabora)ngmodelorganismdatabases,andfrommanyotherdatabasesavailablefromNCBI.Recordsareassignedunique,stableandtrackedintegersasiden)fiers.

EntrezGeneatNCBI

Thecontent(nomenclature,maploca)on,geneproductsandtheiraNributes,markers,phenotypes,andlinkstocita)ons,sequences,varia)ondetails,maps,expression,homologs,proteindomainsandexternaldatabases)isupdatedasnewinforma)onbecomesavailable.

EntrezGenedataisusedbyotherNCBIresourcessuchas:BLAST,Geo,HomoloGene,MapViewer,UniGene,UniSTSandNCBI'sgenomeannota)onpipeline.

Datareliabilityindatabases

Thehugeamountofdatacollectedindatabasespresentalotofproblems:

– Dataaccuracy–  Sequenceredundancy–  Inconsistentnomenclature

–  Inaccurateannota)on–  Sequencecontamina)on(vectors,bacterial)


•  Thedatabasestaffno)fytheAuthorsthatanerror(orcontamina)on)wasdetectedintheirsequenceentry.

•  However,ittakes)metocorrectthedata.

• Meanwhiletheerroriscon)nued,becausealotoftheProteinsintheProteindbaretranslatedfromtheDNAsequencedb.


•  Alotofthesequencesinthedatabasearequite“old”.TheywerenotupdatedsincetheyweresubmiNed,eventhoughtechnologyanddatawasverymuchupdated.

Genesymbols

GenesymbolsaredesignatedbyuppercaseLa)nleNersorbyacombina)onofupper‐caseleNersandArabicnumbers.

Symbolsshouldbeshortinordertobeuseful,andshouldnotaNempttorepresentallknowninforma)onaboutagene.

Ideallysymbolsshouldbenolongerthansixcharactersinlength.

Basedonclassicalgene)cguidelines,itisrecommendedthatgenesymbolsareeitherunderlinedoritalicizedwhenreferringtogenotypicinforma)on(phenotypicinforma)onisrepresentedinstandardfonts).

HUGOGeneNomenclatureCommiNee

•  ThiscommiNeeisresponsiblefortheapprovalofauniquesymbolforeachgene.

•  Italsodesignsalongerandmoredescrip)vename.

•  ThecommiNeemakesconsiderableeffortstousesymbolsacceptabletoworkersinthefield,butsome)mesitisnotpossibletouseexactlywhathaspreviouslyappearedintheliterature.

•  However,whereverthecommiNeeisawareofsuchsymbols,theyarelistedasaliasesintheGenewdatabase.(hIp://www.gene.ucl.ac.uk/cgibin/nomenclature/searchgenes.pl)

GeneSymbols

80887826000469q31ATP‐bindingcasseNe,sub‐familyA(ABC1)member1

ABCA1

PubMedID

MIMNumber

Cytogene)cLoca)on

FullnameSymbol

TaxonomyDatabases

•  Aninterna)onaleffortisdoneforallsequencedatabasestocreateaunifiedtaxonomictagforthesequencessubmiNed.

  Problem:eachsequencedepositorgives“his”nameforthespecie

  Solu)on:UnifiedtaxonomyID

Proteindatabases

Proteindatabases

•  Therearemanydifferentproteindatabasescontainingdifferenttypesofinforma)on:

–  PrimaryAminoAcidssequence.

–  Secondarystructure–  3Dstructure–  Proteinfamilydomains

–  Consensusac)vesites

SourcesofProtein

•  Proteinsthathavebeenworkedonexperimentally

• mRNAwhoseproducthasbeenworkedonexperimentally(noactualproteinsequencingdone)

•  TranslatedDNA(mRNA)sequences

ProteinPrimarySequenceDatabases

•  Usuallycontaindescrip)onoftheproteinentry(annota)on),theaminoacidsequenceandsome)meslinkstootherrelateddatabases.

•  Swiss‐Prot,fromtheUniversityofGeneva(nowtheSwissIns)tuteofBioinforma)cs),isacuratedproteindatabasewhichstrivestoprovideahighlevelofannota)on,aminimallevelofredundancyandhighlevelofintegra)onwithotherdatabases.

UniProt (Universal Protein Resource) is the world's most comprehensive catalog of information on proteins. It is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

•  The UniProt Knowledgebase (UniProt) is the central access point for extensive curated protein information, including function, classification, and cross-reference.

•  The UniProt Non-redundant Reference (UniRef) databases combine closely related sequences into a single record to speed searches.

•  The UniProt Archive (UniParc) is a comprehensive repository, reflecting the history of all protein sequences.

Swiss‐ProtDatabase(primarydatabase)

•  Swiss‐Protannota)onincludes:– Descrip)onofproteinfunc)on–  Proteindomainstructure–  Post‐transla)onalmodifica)ons–  Proteinvariants

•  Sequenceentriesarecomposedofdifferentline‐types,eachwiththeirownformat.Forstandardiza)onpurposestheformatofSwissProtfollowsascloselyaspossiblethatoftheEMBL(DNA)Database.

Swiss‐ProtDatabase

Swiss‐Protdiffersfromotherproteindatabasesbythefollowingcriteria:

 Annota)on

 MinimalRedundancy

  Integra)onwithotherdatabases


 Annota)on   InSwiss‐Prot,asinmostothersequencedatabases,twoclassesofdatacanbedis)nguished:thecoredataandtheannota)on.

  Thecoredataconsistsofthesequence;thecita)oninforma)on(bibliographicalreferences)andthetaxonomicdata(descrip)onofthebiologicalsourceoftheprotein).

  Theannota)onconsistsofthedescrip)onof:

•  Func)on(s)oftheprotein•  Post‐transla)onalmodifica)on(s).Forexamplecarbohydrates,phosphoryla)on,acetyla)on,GPI‐anchor,etc.

•  Domainsandsites.Forexamplecalciumbindingregions,ATP‐bindingsites,zincfingers,etc.

•  Secondarystructure

  Theannota)onconsistsofthedescrip)onof:

•  Quaternarystructure.Forexamplehomodimer,heterotrimer,etc.

•  Similari)estootherproteins•  Disease(s)associatedwithdeficiency(s)of/intheprotein

•  Sequenceconflicts,variants,etc.


  Toobtainthisinforma)on,Swiss‐Protuses,inaddi)ontothepublica)onsthatreportnewsequencedata,reviewar)clestoperiodicallyupdatetheannota)onsoffamiliesorgroupsofproteins.

  Swiss‐Protalsomakesuseofexternalexperts,whohavebeenrecruitedtosendtheircommentsandupdatesconcerningspecificgroupsofproteins.


 MinimalRedundancy   Manysequencedatabasescontain,foragivenproteinsequence,separateentrieswhichcorrespondtodifferentliteraturereports.InSWISS‐PROT,theytryasmuchaspossibletomergeallthesedatasoastominimizetheredundancyofthedatabase.

  Ifconflictsexistbetweenvarioussequencingreports,theyareindicatedinthefeaturetableofthecorrespondingentry.


  Integra)onwithotherdatabases   Itisimportanttoprovidetheusersofbiomoleculardatabaseswithadegreeofintegra)onbetweenthethreetypessequence‐relateddatabases(nucleicacidsequences,proteinsequencesandproteinter)arystructures)aswellaswithspecializeddatacollec)ons.

  SWISS‐PROTiscurrentlycross‐referencedwith~100differentdatabases.Cross‐referencesareprovidedintheformofpointerstoinforma)onrelatedtoSWISS‐PROTentriesandfoundindatacollec)onsotherthanSWISS‐PROT.

TrEMBLdatabase

•  TrEMBLisacomputer‐annotatedsupplementofSWISS‐PROTthatcontainsallthetransla)onsoftheEMBL(DNA)database.

•  TrEMBLcontainentriesnotyetintegratedinSWISS‐PROT.

•  Combinesinforma)onnotinotherdatabases,likemicroarraydata,popula)onvaria)onstudies,proteomics

•  Powerfulqueryingop)ons

•  Onlyforhumanproteins

NRdatabase(primarydatabasesfromNCBI)

•  TheNRProteindatabasecontainssequencedatafromthetranslatedcodingregionsfromDNAsequencesinGenBank,EMBLandDDBJaswellasproteinsequencessubmiNedtoPIR,SWISSPROT,PRF,PDB(sequencesfromsolvedstructures).

DatareliabilityinProteindatabases

•  About30%oftheproteinsinthedatabaseshaveerroneoussequencesdueto:– missingexonsintheDNAtransla)on.– Intronsmistakenlytranslated.

•  Anothercommonproblemistheassigningoffunc)onsto“new”proteins,basedonsequencesimilarity.

DatareliabilityinProteindatabases

•  Forexample:– ProteinAissimilartoproteinB.

– ProteinBannota)onisbasedonProteinAannota)on(whichhasanerror).

– Annota)onofProteinAiscorrectedbythegroupworkingonit.Thiscorrec)ondoesnotappearorreflectinProteinBannota)on.

– WhenProteinCandDarealsobasedontheerroneousannota)ononB,theproblem…...

Textsearchingpi{alls

•  Itfindsexactlywhatyoutype(trypseudogenevs.psuedogene)

•  Olderrecordsmayhavedifferentannota)on,fromgenenameson…

•  humanvshomosapiens

•  Genesymbolsvsfullgenename(forexampleneuregulinvsnrg1)

• Mostsitesusebooleanoperators(AND,OR,BUTNOT)

•  Cando(oradd)afieldspecifictag‐buteachsitehasadifferentwayofaddingittoasearch‐forexample,NCBIusessquarebrackets[]

Remember:

TextsearchingisNOTsequencesimilaritysearching!Youmanynotfindallrelatedsequencesbytextsearching!!!!

Introducon to Databases part 2 - Biological computing · TSS TTS ATG Stop PolyA site Promoter 1 2 3 4 ATG Stop PolyA site 1 2 3 4 Genomic DNA Pre-mRNA mRNA Modiﬁed ...

Documents