BioinfRes SoSe 18
Bioinforma)csResources-Genbank-
Lecture&ExercisesProf.B.Rost,Dr.L.Richter,J.Reeb
Ins)tutfürInforma)kI12
BioinfRes SoSe 18
PreliminarySchedule
* These exercises can earn you a bonus
April 13th Intro, General Overview (1. sh.) June 1th Lecture cancelled April 20th Sequence Databases (2. sh.) June 8th NoSql 2 (7.sh.) April 27th Sequence Databases (3. sh.) June 15th MongoDB, JavaScript (8.sh.) May 4th Structure Databases (4. sh.) June 22nd Node.js Applications (9.sh.) May 11th Lecture cancelled June 29th PredictProtein May 18th SQL (5. sh.) Jul 6th Wrap Up, Q&A May 25th SQL, NoSql (6. sh) Jul 20th Exam
BioinfRes SoSe 18
Na)onalCenterforBiotechnologyInforma)on,NCBI
http://nihrecord.nih.gov/newsletters/2013/07_19_2013/images/milestonesPic6.jpg
● firstideasinthemiddleofthe80s
● divisionoftheNa)onalLibraryofMedicine(NLM)insidetheNa)onalIns)tutesofHealth(NIH)
● poli)calmission
● foundedin1988
● DavidLipman
BioinfRes SoSe 18
NCBI’spoli)calmissionasdefinedbythebill:1. design,develop,implement,andmanageautomatedsystems
forthecollec)on,storage,retrieval,analysis,anddissemina)onofknowledgeconcerninghumanmolecularbiology,biochemistry,andgene)cs;
2. performresearchintoadvancedmethodsofcomputer-basedinforma)onprocessingcapableofrepresen)ngandanalyzingthevastnumberofbiologicallyimportantmoleculesandcompounds;
3. enablepersonsengagedinbiotechnologyresearchandmedicalcaretousesystemsdevelopedunderparagraph(1)andmethodsdescribedinparagraph(2);and
4. coordinate,asmuchasisprac)cable,effortstogatherbiotechnologyinforma)ononaninterna)onalbasis.
BioinfRes SoSe 18
SelectedNCBIAccomplishmentsBlastGenBankatNCBI
NCBIwebsite
GenomesOMIM
PubMed
1990
1992
1994
1995
1996
1997
HumanGenomePubMedCentral
EntrezGene/DTDs
NIHPublicAccessGenomeReferenceConsor)um
1000GenomesProject
1999
2000
2003
2005
2007
2008
BioinfRes SoSe 18
NCBIResources● NCBIcurrentlyhostsavastbunchofresourceshap://www.ncbi.nlm.nih.gov/guide/all/
● groupedaccordingtovariouscriteria- metadata,project-centric- methodoriented- topicoriented
● sortedinthesec)ons:databases,downloads,submissions,tools,howtos
BioinfRes SoSe 18
Genbank’sOrigin
● WalterGoad,LosAlamosNa)onalLaboratory
● LosAlamosSequenceDatabase1979
● Crea)onandreleaseofGenBankin1982
● Endof1982:2000sequences
● MovetoNCBIin1992http://www.lanl.gov/science-innovation/features/innovations/images/light/thumbnails/21.jpg
BioinfRes SoSe 18
Minutesfrom20thanniversaryofGenBankin2002
“....AmongthemisamemoonLosAlamosNa)onalLaboratorysta)onerydatedMay9,1980,thatreads:Monday,May12at10:30SteveSimoninvitesyouforcakeandcoffeetocelebrate100,000basesnowintheDNAsequencelibrary.”
takenfromhaps://www.genomeweb.com/genbank-turns-20
BioinfRes SoSe 18
GrowthofGenBankandWGS
-doublingapprox.every18months,diagramforrelease225,Apr.2018-currentversion:release225:260,189,141,631basesinGenbank,2,784,740,996,536basesinWGS-takenfromhap://www.ncbi.nlm.nih.gov/genbank/sta)s)cs
BioinfRes SoSe 18
GrowthofGenBankandWGS
-currentrelease225:208,452,303sequencesinGenbank,621,379,029sequencesinWGS-takenfromhap://www.ncbi.nlm.nih.gov/genbank/sta)s)cs,release225,Apr.2018
BioinfRes SoSe 18
ReferencesforGenBank● onecurrentcita)onsource:“GenBank”.NucleicAcidsRes.2014Jan;42(Databaseissue):D32-7.doi:10.1093/nar/gkt1030.Epub2013Nov11.
● PMID:24217914● themostrecent:“Genbank”.NucleicAcidsRes.2018Jan4;46(D1):D41–D47.Publishedonline2017Nov13th.doi:10.1093/nar/gkx1094
● PMCID:PMC5753231
BioinfRes SoSe 18
ReferencesforGenBank● moregeneralforNCBIservices:“DatabaseresourcesoftheNa)onalCenterforBiotechnologyInforma)on”.NucleicAcidsRes.2016Jan4;44(Databaseissue):D7–D19.Publishedonline2015Nov28.doi:10.1093/nar/gkv1290
● partoftheInterna)onalNucleo)deSequenceDatabaseCollabora)on(INSDC)togetherwithEMBLNucleo)deSequenceDatabase(EMBL-Bank),partoftheEuropeanNucleo)deArchive(ENA)andtheDNADataBankofJapan(DDBJ)
BioinfRes SoSe 18
MostGrowingDivisionsDivision Description Release 197
(8/2013) Annual Increase (%)
WGS* Whole-genome shotgun data 2,035,032,639,807 from Release 219
TSA* Transcriptome shotgun data 149,038,907,599 from Release 219
WGS* Whole-genome shotgun data 500.420.412.665 62.4.
TSA* Transcriptome shotgun data 8.6333123.935 49.9
PHG Phages 119.812.712 42.5
VRL Viruses 1.757.202.472 22.9
BCT Bacteria 10.281.048.518 21.8
ENV Environmental samples 3.743.277.434 10.9
INV Invertebrates 2.737.140.464 9.8
PAT Patented sequences 13.290.161.247 9.7
PLN Plants 5.963.882.822 8.8
GSS Genome survey sequences 23.726.384.753 8.1
VRT Other vertebrates 3.068.956.026 6.3
MAM Other mammals 911.342.025 5.6
... ... ... ...
TOTAL All GenBank sequences 654.613.333.676 45.1 * not distributed with the release; there specific project server sections
BioinfRes SoSe 18
TopOrganisms(Rel.207)Organism Entries Non-WGS base
pair Homo sapiens 20.921.637 17.714.786.437
Mus musculus 9.727.522 9.995.696.539
Rattus norvegicus 2.193.812 6.526.236.496
Bos taurus 2.227.298 5.410.360.312
Zea mays 4.177.175 5.201.714.457
Sus scrofa 3.297.029 4.895.127.638
Danio rerio 1.727.668 3.133.901.682
Triticum aestivum 1.796.780 1.927.718.314
... ... ...
Oryza sativa Japonica Group
1.376.410 1.265.556.227
... ... ...
Arabidopsis thaliana 2.578.785 1.202.100.008
... ...
BioinfRes SoSe 18
TopOrganisms(Rel.219)Organism Entries Non-WGS base pair
Homo sapiens 24,231,652 18,893,466,733
Mus musculus 9,883,173 10,229,286,664
Rattus norvegicus 2,197,781 6,528,984,315
Bos taurus 2,229,235 5,429,379,063
Zea mays 4,197,803 5,227,077,026
Sus scrofa 3,298,802 5,071,347,463
Hordeum vulgare ssp. vulgare
1,346,798 3,235,834,212
Danio rerio 1,729,033 3,190,913,255
Ovis canadanensis canadanensis
72 2,590,574,434
Triticum aestivum 1,812,814 1,942,831,630
... ... ...
Oryza sativa Japonica Group
1,378,262 1,642,328,218
... ... ...
Escherichia coli 118,884 1,571,576,668
... ...
BioinfRes SoSe 18
Distribu)onofSequenceFiles(Rel.207)Division Number of Files
BCT 178 CON 317 ENV 81 EST 478 HTG 142 INV 126 PAT 219 PLN 107 TSA 175 VRL 34
Release 207 consists of 2333 text files in total. Release 225 consists of 3120 text files in total.
BioinfRes SoSe 18
Distribu)onofSequenceFiles(Rel.2019)Division Number of Files
BCT 350 CON 359 ENV 97 EST 483 HTG INV 153 PAT 290 PHG 4 PLN 145 PRI 56 SYN 10 TSA 230 VRL 48
Release 219 consists of 2225 text files in total.
BioinfRes SoSe 18
DatabaseFiles(Rel.225)
● GenBankcomesinasetofcompressedtextfilesavailableviaFTP
● seekp://kp.ncbi.nih.gov/genbank/gbrel.txt● 3120ASCIIfiles(listedindivisionplusaddi)onallistfiles)intherangeof0.7-520MB
● uncompressed~885GB● eachfileconsistsoftwopor)ons
BioinfRes SoSe 18
DatabaseFiles● Part1:highlyconserveddatabasefileheaders1 10 20 30 40 50 60 70 79 ---------+---------+---------+---------+---------+---------+---------+--------- GBBCT1.SEQ Genetic Sequence Data Bank April 15 2015 NCBI-GenBank Flat File Release 207.0 Bacterial Sequences (Part 1) 51396 loci, 92682287 bases, from 51396 reported sequences ---------+---------+---------+---------+---------+---------+---------+--------- 1 10 20 30 40 50 60 70 79
● Part1:sequenceentriesforthatdivisiondescribedintheheader
BioinfRes SoSe 18
1 10 20 30 40 50 60 70 79!---------+---------+---------+---------+---------+---------+---------+---------!GBSMP.SEQ Genetic Sequence Data Bank! December 15 1992!! GenBank Flat File Release 74.0!! Structural RNA Sequences!! 2 loci, 236 bases, from 2 reported sequences!!LOCUS AAURRA 118 bp ss-rRNA RNA 16-JUN-1986!DEFINITION A.auricula-judae (mushroom) 5S ribosomal RNA.!ACCESSION K03160!VERSION K03160.1!KEYWORDS 5S ribosomal RNA; ribosomal RNA.!SOURCE A.auricula-judae (mushroom) ribosomal RNA.! ORGANISM Auricularia auricula-judae! Eukaryota; Fungi; Eumycota; Basidiomycotina; Phragmobasidiomycetes;! Heterobasidiomycetidae; Auriculariales; Auriculariaceae.!REFERENCE 1 (bases 1 to 118)! AUTHORS Huysmans,E., Dams,E., Vandenberghe,A. and De Wachter,R.! TITLE The nucleotide sequences of the 5S rRNAs of four mushrooms and! their use in studying the phylogenetic position of basidiomycetes! among the eukaryotes! JOURNAL Nucleic Acids Res. 11, 2871-2880 (1983)!FEATURES Location/Qualifiers! rRNA 1..118! /note="5S ribosomal RNA"!BASE COUNT 27 a 34 c 34 g 23 t!ORIGIN 5' end of mature rRNA.! 1 atccacggcc ataggactct gaaagcactg catcccgtcc gatctgcaaa gttaaccaga! 61 gtaccgccca gttagtacca cggtggggga ccacgcggga atcctgggtg ctgtggtt!//!!
LOCUS ABCRRAA 118 bp ss-rRNA RNA 15-SEP-1990!DEFINITION Acetobacter sp. (strain MB 58) 5S ribosomal RNA, complete sequence.!ACCESSION M34766!VERSION M34766.1!KEYWORDS 5S ribosomal RNA.!SOURCE Acetobacter sp. (strain MB 58) rRNA.! ORGANISM Acetobacter sp.! Prokaryotae; Gracilicutes; Scotobacteria; Aerobic rods and cocci;! Azotobacteraceae.!REFERENCE 1 (bases 1 to 118)! AUTHORS Bulygina,E.S., Galchenko,V.F., Govorukhina,N.I., Netrusov,A.I.,! Nikitin,D.I., Trotsenko,Y.A. and Chumakov,K.M.! TITLE Taxonomic studies of methylotrophic bacteria by 5S ribosomal RNA! sequencing! JOURNAL J. Gen. Microbiol. 136, 441-446 (1990)!FEATURES Location/Qualifiers! rRNA 1..118! /note="5S ribosomal RNA"!BASE COUNT 27 a 40 c 32 g 17 t 2 others!ORIGIN ! 1 gatctggtgg ccatggcggg agcaaatcag ccgatcccat cccgaactcg gccgtcaaat! 61 gccccagcgc ccatgatact ctgcctcaag gcacggaaaa gtcggtcgcc gccagayy!//!---------+---------+---------+---------+---------+---------+---------+---------!1 10 20 30 40 50 60 70 79!
BioinfRes SoSe 18
TheGenBankFlatFileFormat
● asequenceentryconsistsofmanyrecords(lines)● eachrecordconsistsoftwoparts
● Part1:columns1-10/EntryFieldName
● Part2:remaininglinewiththecontent
BioinfRes SoSe 18
Part1/1● akeyword,beginningincolumn1oftherecord(e.g.,REFERENCEisakeyword)
● asubkeywordbeginningincolumn3,withcolumns1and2blank(e.g.,AUTHORSisasubkeywordofREFERENCE)
● orasubkeywordbeginningincolumn4,withcolumns1,2,and3blank(e.g.,PUBMEDisasubkeywordofREFERENCE)
BioinfRes SoSe 18
Part1/2
● blankcharacters,indica)ngthatthisrecordisacon)nua)onoftheinforma)onunderthekeywordorsubkeywordaboveit
● acode,beginningincolumn6,indica)ngthenatureofanentry(featurekey)intheFEATUREStable
BioinfRes SoSe 18
Part1/3● anumber,endingincolumn9oftherecord:- Thisnumberoccursinthepor)onoftheentrydescribingtheactualnucleo)desequenceanddesignatesthenumberingofsequenceposi)ons
● twoslashes(//)inposi)ons1and2,markingtheendofanentry
BioinfRes SoSe 18
Part2● Thesecondpartofeachsequenceentryrecordcontainstheinforma)onappropriatetoitskeyword
● inposi)ons13to80forkeywords
● inposi)ons11to80forthesequence
BioinfRes SoSe 18
EntryFieldTypes(incomplete)● Locus:Ashortmnemonicnamefortheentry,chosentosuggestthesequence'sdefini)on;mandatorykeyword/exactlyonerecord.
● Defini4on:Aconcisedescrip)onofthesequence;mandatorykeyword/oneormorerecords
● Accession:- theprimaryaccessionnumberisaunique,unchangingiden4fierassignedtoeachGenBanksequencerecord.
- tobeusedforcita)onsfromGenBank- mandatorykeyword/oneormorerecords.
BioinfRes SoSe 18
EntryFieldTypes(incomplete)
● Version:- compoundiden)fierconsis)ngoftheprimaryaccessionnumberandanumericversionnumberassociatedwiththecurrentversionofthesequencedataintherecord
- op)onallyfollowedbyanintegeriden)fier(a"GI")assignedtothesequencebyNCBI
- mandatorykeyword/exactlyonerecord
BioinfRes SoSe 18
EntryFieldTypes(incomplete)
● DBLINK:providescross-referencestoresourcesthatsupporttheexistenceasequencerecord;op4onalkeyword/oneormorerecords
● Keywords:shortphrasesdescribinggeneproductsandotherinforma)onaboutanentry;mandatorykeywordinallannotatedentries/oneormorerecords
BioinfRes SoSe 18
EntryFieldTypes(incomplete)
● Source:Commonnameoftheorganismorthenamemostfrequentlyusedintheliterature;mandatorykeywordinallannotatedentries/oneormorerecords/includesonesubkeyword
● Organism:Formalscien)ficnameoftheorganism(firstline)andtaxonomicclassifica)onlevels(secondandsubsequentlines);mandatorysubkeywordinallannotatedentries/twoormorerecords
BioinfRes SoSe 18
EntryFieldTypes(incomplete)● Reference:- Cita)onsforallar)clescontainingdatareportedinthisentry
- includessevensubkeywordsandmayrepeat- mandatorykeyword/oneormorerecords
● Journal:liststhejournalname,volume,year,andpagenumbersofthecita)on;mandatorysubkeyword/oneormorerecords
● op)onalsubkeywords:Authors,Consor)um,Title,Medline,Pubmed,Remark
BioinfRes SoSe 18
EntryFieldTypes(incomplete)● Features:tablecontaininginforma)ononpor)onsofthesequencethatcodeforproteinsandRNAmolecules;sitesofbiologicalsignificance;op4onalkeyword/oneormorerecords
● Origin:- specifica)onofhowthefirstbaseofthereportedsequenceisopera)onallylocatedwithinthegenome
- mandatorykeyword/exactlyonerecord- followedbysequencedata(mul)plerecords)
● //:entrytermina)onsymbol;mandatoryattheendofanentry/exactlyonerecord
BioinfRes SoSe 18
DetailedLocusFormatColumns Contents 01-05 'LOCUS'
06-12 spaces
13-28 Locus name
29-29 space
30-40 Length of sequence, right-justified
41-41 space
42-43 bp
44-44 space
45-47 spaces, ss- (single-stranded), ds- (double-stranded), or ms- (mixed-stranded)
48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), mRNA (messenger RNA), uRNA (small nuclear RNA), left justified
54-55 space
56-63 'linear' followed by two spaces, or 'circular'
64-64 space
65-67 The division code
68-68 space
69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991)
BioinfRes SoSe 18
AccessionFormat● sixoreightcharacters● sixcharacterformat:- singleuppercaseleaer- 5digits
● eigthcharacterformat:- twouppercaseleaers- 6digits
● primaryaccessionnumberalwaysthefirstone
BioinfRes SoSe 18
Features(Incomplete)
● authorita)vesource:hap://www.insdc.org/documents/feature-table
● featuretablecontainsinforma)onabout:- geneandgeneproducts- regionsofbiologicalsignificance- canenumeratedifferencesbetweenvariousreports- providescross-referencestootherdatacollec)ons- allowshierarchicalrela)onbetweenthefeatures
BioinfRes SoSe 18
Layout● firstlineofthefeaturetableisaheader● includesthekeyword‘FEATURES’andthecolumnheader‘Loca)on/Qualifiers’
● eachfeatureconsistsof:- descriptorlinecontainingafeaturekeyandaloca)on
- acon)nua)onlinefortheloca)onmayfollow- featurequalifiersmayfollowthedescriptorline- key:column6-20,loca)onstartsincolumn22- qualifiersonsubsequentlinesatcolumn22star)ngwitha‘/’
BioinfRes SoSe 18
AFewFrequentFeatures● CDS:sequencecodingforaminoacidsinprotein(includesstopcodon)
● exon:regionthatcodesforpartofsplicedmRNA● gene:regionthatdefinesafunc)onalgene,possiblyincludingupstream(promotor,enhancer,etc)anddownstreamcontrolelements,andforwhichanamehasbeenassigned
● mRNA:messengerRNA
● .......>60featurescurrently
BioinfRes SoSe 18
Loca)onandQualifiers
● Loca)on:- aloca)oncanbe:asinglebase,aspanofbases,asitebetweentwobases,ajoinofsequences,...
- examples:23,23..56,23^24,join(23..56,87..110)
● Qualifiers:- format:fromcolumn22/qualifier_name[=value]- types:freetext,enumera)onorcontrolledvocabulary,cita)ons,sequences,featurelabels
BioinfRes SoSe 18
DatabaseCrossReferences/db_xref
● hap://www.ncbi.nlm.nih.gov/genbank/collab/db_xref/
● Qualifier:/db_xref="database:idenDfier”● Defini4on:databasecross-reference:pointertorelatedinforma)oninanotherdatabase
● Scope:allfeaturekeys● Example:/db_xref="Swiss-Prot:P12345”
● currently>120databasesavailable
BioinfRes SoSe 18
AnatomyofaGenbankFlatFile
. . .
BioinfRes SoSe 18
AnatomyofaGenbankFlatFile
. . .
Locus line
BioinfRes SoSe 18
AnatomyofaGenbankFlatFile
. . . Accession Number, Version and GI number
BioinfRes SoSe 18
AnatomyofaGenbankFlatFile
. . . Feature table with annotations
BioinfRes SoSe 18
UsefulResourcesfromNCBI
● Materials:● Electronicbookshelf
● hap://www.ncbi.nlm.nih.gov/educa)on/factsheets/
● kp://kp.ncbi.nih.gov/pub/factsheets/Factsheet_Books.pdf
● NCBImanuals
● textbooks
BioinfRes SoSe 18
UsefulResourcesfromNCBI
● Processes,e.g.Prokaryo)cGenomeAnnota)onPipeline
● designedforbacterialandarchaealgenomes● mul)-levelprocessincludingprotein-codinggenepredic)onandfunc)onalgenomeunitlikerRNAs,tRNAs,smallRNAs,pseudogenescontrolregions,repeats,inser)onelementsa.s.f.
● combina)onofab-iniDopredic)onandhomologybasedmethods
BioinfRes SoSe 18
UsefulResourcesfromNCBI● referencedatabases:RefSeq● hap://www.ncbi.nlm.nih.gov/refseq/
● comprehensive,integrated,non-redundant,well-annotatedsetofsequences,includinggenomicDNA,transcripts,andproteins
● stablereferenceforgenomeannota)on,esp.subsetofRefSeqGene
● referencesequences
● referencecoordinates● accessibleviaBLAST,EntrezandFTP
BioinfRes SoSe 18
RefSeq● createdby:- Eukaryo)cGenomeAnnota)onPipeline- Prokaryo)cGenomeAnnota)onPipeline- Manualcura)on- SubmissiontoINSDCmembers
● reflectcurrentknowledgeofsequencesdataandbiology
● formatconsistency● Accessionnumbercontainsan“_”
BioinfRes SoSe 18
RefSeqGrowth
BioinfRes SoSe 18
DatabasesAccessibleviaEntrez
http://www.ncbi.nlm.nih.gov/gquery/
BioinfRes SoSe 18
Computa)on:BlastatNCBI
BioinfRes SoSe 18
BioinfRes SoSe 18
BioinfRes SoSe 18
BioinfRes SoSe 18
BioinfRes SoSe 18
SearchingtheNCBI/Entrez● provideanintegratedsearchinterfacetothedifferentNCBIdatabases:EntrezProgrammingU)li)es(E-u)li)es)
● Base-URL:hap://eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/
● >40databases
● stableinterfaceofnineserver-sideprograms
● hap://www.ncbi.nlm.nih.gov/books/NBK25501/
BioinfRes SoSe 18
EntrezGuidelines● ifyouusetheeu)lsagainsttheguidelinesyoumightbebanned!
● >100requests:weekendsoroutsideUSpeak)mes(9pm-5am,EST)
● notmorethan3requestpersecond
● provideemailandtoolname:&tool=<...>&email=<...>!
● registra)onwithemailandtoolnamewithNCBImayrelaxtheserestric)ons
● supportedbyBioPython
BioinfRes SoSe 18
Construc)ngURLs
● parameter:&lowerCaseName● excep)on:&WebEnv
● norequiredorder
● nullvaluesandinappropriateparameteraregenerallyignored
● nospaces,use+instead
● useURLencodingsforspecialcharacterlike:%22for“or%23for#or%40for@
BioinfRes SoSe 18
E-u)li)es● Einfo● Esearch
● EPost
● ESummary● EFetch
● ELink
● EGQuery
● ESpell● ECitMatch
BioinfRes SoSe 18
ExternalInterfacestoEntrez/API● thereareanumberofAPIstoaccessthevariousservicesfromNCBI,describedat:
● hap://www.ncbi.nlm.nih.gov/books/NBK25501/● baseURL:hap://eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/
● basicsearching:- esearch.fcgi?db=<database>&term=<query>- Input:Entrezdatabase(&db);anyEntreztextquery(&term)
- Output:ListofUIDsmatchingtheEntrezquery
BioinfRes SoSe 18
ESearch
● textsearch● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/esearch.fcgi
● respondstoatextquerywiththelistofmatchingUIDsinagivendatabase(forlateruseinESummary,EFetchorELink),alongwiththetermtransla)onsofthequery
BioinfRes SoSe 18
ESummary
● documentsummarydownloads● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/esummary.fcgi
● respondstoalistofUIDsfromagivendatabasewiththecorrespondingdocumentsummaries
BioinfRes SoSe 18
EGQuery
● globalquery● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/egquery.fcgi
● respondstoatextquerywiththenumberofrecordsmatchingthequeryineachEntrezdatabase
BioinfRes SoSe 18
EInfo
● databasesta)s)cs● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/einfo.fcgi
● providesthenumberofrecordsindexedineachfieldofagivendatabase,thedateofthelastupdateofthedatabase,andtheavailablelinksfromthedatabasetootherEntrezdatabases
● without&db:listsallavailabledatabases
BioinfRes SoSe 18
EFetch
● datarecorddownloads● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/efetch.fcgi
● respondstoalistofUIDsinagivendatabasewiththecorrespondingdatarecordsinaspecifiedformat
BioinfRes SoSe 18
ELink
● Entrezlinks● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/elink.fcgi
● respondstoalistofUIDsinagivendatabasewitheitheralistofrelatedUIDs(andrelevancyscores)inthesamedatabaseoralistoflinkedUIDsinanotherEntrezdatabase
BioinfRes SoSe 18
ELink
● checksfortheexistenceofaspecifiedlinkfromalistofoneormoreUIDs
● createsahyperlinktotheprimaryLinkOutproviderforaspecificUIDanddatabase,orlistsLinkOutURLsandaaributesformul)pleUIDs
BioinfRes SoSe 18
EPost
● UIDuploads● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/epost.fcgi
● acceptsalistofUIDsfromagivendatabase,storesthesetontheHistoryServer,andrespondswithaquerykeyandwebenvironmentfortheuploadeddataset
BioinfRes SoSe 18
ESpell
● spellingsugges)ons● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/espell.fcgi
● retrievesspellingsugges)onsforatextqueryinagivendatabase
BioinfRes SoSe 18
ECitMatch
● batchcita)onsearchinginPubMed● eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/ecitmatch.cgi
● retrievesPubMedIDs(PMIDs)correspondingtoasetofinputcita)onstrings
BioinfRes SoSe 18
Iden)ficators● recordsareiden)fiedbyanintegerIDcalledUID● UIDaredatabasespecificlikeGInumbers,PMIDS,MMDB-IDs
● UIDareaswellinputandoutput
● especiallyusefulincombina)onwiththeHistoryserver
● afulldescrip)onofparametersandsyntaxcanbefoundat:hap://www.ncbi.nlm.nih.gov/books/NBK25499/
BioinfRes SoSe 18
SelectedUIDsEntrez Database UID common name E-utility Database Name Books Book ID books Conserved Domains PSSM-ID cdd dbVar dbVar ID dbvar EST GI number nucest Gene Gene ID gene Genome Genome ID genome MeSH MeSH ID mesh NCBI Web Site Web Site ID ncbisearch Nucleotide GI number nuccore PubMed PMID pubmed ... ... ...
BioinfRes SoSe 18
EntrezCoreEngine● EGQuery,ESearch,andESummary● twotasks:- assemblealistofUIDsthatmatchatextquery(ESearch)- retrieveabriefsummaryrecordcalledaDocumentSummary(DocSum)foreachUIDESummary)
● EGQuey:globalversionofESearch● esearch.fcgi?db=database&term=query esummary.fcgi?db=database&id=uid1,uid2,uid3,...!
● expandedintomorecomplicatedEntrezqueries
BioinfRes SoSe 18
EntrezDatabases(EInfo,EFetch,andELink)
● EInfo:- providesdetailedinforma)onabouteachdatabase- includinglistsoftheindexingfieldsinthedatabase- availablelinkstootherEntrezdatabases
BioinfRes SoSe 18
EntrezDatabases(EInfo,EFetch,andELink)
● addedvaluetotherawdata:- supportsavarietyofdisplayformats:EFetchUIDlistsinXMLandplaintext(&retmode)foralldatabases,otherformats(&rettype)aredatabasespecific
- hap://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly
- efetch.fcgi?db=database&id=uid1,uid2,uid3 &rettype=report_type&retmode=data_mode!
BioinfRes SoSe 18
EntrezDatabases(EInfo,EFetch,andELink)
● addedvaluetotherawdata:- linkstorecordsinotherEntrezdatabasesmanifestedaslistofassociatedUIDs
- UIDsmustbevalidinsourcedatabase(&dbfrom)- elink.fcgi?dbfrom=protein&db=gene&id=15718680,157427902
BioinfRes SoSe 18
EntrezHistoryServer
● simple:intheGUIaccessibleviatherespec)vetabs
● youcanstoretemporarilysetsofUIDsasinputforlaterqueriesthroughothertools
● eachlistofUIDsisspecifiedby:- &query_key(integerlabel)- &WebEnv(cookiestring)
BioinfRes SoSe 18
Crea)onofastoredUIDlist
● EPost:- EPostcanbeuseduploadaUIDlist- returns&query_keyand&WebEnv!
● ESearch:- storestheresultsifgiven&usehistory=y!
● ELink:- storestheresultsifgiven&cmd=neighbor_history!
BioinfRes SoSe 18
UsageofstoredUIDlists● Useofstoredlists:esummary.fcgi?db=database&WebEnv=webenv &query_key=key!
● onewebenvironmentcanholdmul)pleresultlists
● listsinthesamewebenvironmentcanbecombinedwithAND,OR,NOT
● bydefaulteverycallcreatesanewenvironment
● ->give&WebEnvinsubsequentcallstostorethelistsinthesamewebenvironment
BioinfRes SoSe 18
SketchingPipelines
● getDocSummariesorentriesforkeywordsorIDs:- ESearch->ESummary/EFetch- EPost->ESummary/EFetch
● filter/limitarecordset:- EPost/ELink->ESearch
● moreadvancedqueries:- ESearch->ELink->ESummary/EFetch- EPost->ELink->ESearch->EFetch
BioinfRes SoSe 18
● storingresults:- esearch.fcgi?db=<database>&term=<query>&usehistory=y
- input:anyEntreztextquery(&term);Entrezdatabase(&db);&usehistory=y
- output:webenvironment(&WebEnv)andquerykey(&query_key)parametersspecifyingtheloca)onontheEntrezhistoryserverofthelistofUIDsmatchingtheEntrezquery
- example:hap://eu)ls.ncbi.nlm.nih.gov/entrez/eu)ls/esearch.fcgi?db=pubmed&term=science%5bjournal%5d+AND+breast+cancer+AND+2008%5bpdat%5d&usehistory=y
BioinfRes SoSe 18
● Associa)ngSearchResultswithExis)ngSearchResults:- esearch.fcgi?db=<database>&term=<query1>&usehistory=y
- esearch.fcgi?db=<database>&term=<query2>&usehistory=y&WebEnv=$web1
- Input:AnyEntreztextquery(&term);Entrezdatabase(&db);&usehistory=y;Exis)ngwebenvironment(&WebEnv)fromapriorE-u)litycall
- Output:Webenvironment(&WebEnv)andquerykey(&query_key)parametersspecifyingtheloca)onontheEntrezhistoryserverofthelistofUIDsmatchingtheEntrezquery
BioinfRes SoSe 18
E-u)lityWebinar
● haps://www.youtube.com/watch?v=iCFVVexp30o