UBC Bioinformatics Centre Bioinformatics: Understanding the data in databases MedGen 505, January 9, 2003 Francis Ouellette Director, UBC Bioinformatics.

UBC BioinformaticsCentre

Bioinformatics:Understanding the data in databases

MedGen 505, January 9, 2003Francis OuelletteDirector, UBC Bioinformatics CentreVancouver, BC, [email protected]

Copyright 2002 UBC Bioinformatics Centre

http://vanbug.org

• Monthly bioinformatics seminar• The second Thursday of every month• Attended by academics, industry and government

types.• Talk followed by beer and pizza.• Tonight @ 6:00 at the Chan Centre at the BCRI

(CMMT).• Nat Goodman, a senior research scientist

from the ISB, aka “the IT guy”


http://bioinformatics.ubc.ca


Bioinformatics is about understanding how life works. It is an hypothesis driven science


In bioinformatics, we use software tools and biological databases to ask questions.


At the UBC Bioinformatics Centre (UBiC) we bring together scientists that share the vision of making advances in computational biology, also working with bench scientists to validate the hypotheses we are generating.


Structure

• Director

• Associate Director

• 6 adjunct faculty

• 4 more to be recruited

• Another recruitment already in progress

• Director of Operation and Strategy

• Chief Soft. Dev.• Chief Bioinformatics• Chief Systems• Chief Training and Support• Chief Web Development


UBiC: the vision

Basic Research

Basic Research

Support&

Training

Support&

Training

Large Scale BioinformaticsLarge Scale

Bioinformatics

BLASTIDBPeGASys

Gene IdentificationComparative GenomicsAlgorithm development

CBWWWWWorkshops


The UBC Bioinformatics Centre:



Ouellette Lab projects

• Core facility: training and support• GeneComber: an Ab initio gene finding algorithm.• IDB: the Integral DataBase system• PeGASys: Parallel genome annotation system• GeMS: Genomic Mutational Signature Sequences.


http://bioinformatics.ca


http://bioinformatics.ca


Canadian Bioinformatics Workshop Series

Bioinformatics

Genomics Proteomics Developing the Tools

Intro Programming


Bioinformatics is about bringing biological themes together withthe help of computer tools and biological databases. Computational biology can lead us to new insights or directions.


BLAST ResultBasicLocalAlignmentSearchTool


Genetic Analysis of Cancer in Families

The Genetic Predisposition to Cancer

PubMed Text Neighboring

• Common terms could indicate similar subject matter

• Statistical method• Weights based on term

frequencies within document and within the database as a whole

• Some terms are better than others


Micro-array analysis:

Figure 4Figure 1

Science Jan 1 1999: 83-87

The Transcriptional Program in the Response of Human Fibroblasts to Serum

Vishwanath R. Iyer, Michael B. Eisen, Douglas T. Ross, Greg Schuler, Troy Moore, Jeffrey C. F. Lee, Jeffrey M. Trent, Louis M. Staudt, James Hudson Jr., Mark S. Boguski, Deval Lashkari, Dari Shalon, David Botstein, Patrick O. Brown


VAST Result

Ferredoxin

•Halobacterium marismortui

•Chlorella fusca

• Vector• Alignment• Search• Tool


Computational Biology Analysis

Q Gln NH2-C-CH2-CH2-

O R Arg NH2-C-NH-CH2-CH2-CH2-

+NH2


Structural InteractionsOther interactions occurring within this structure (blue). In this case Glutaminyl-tRNA Synthetase interacting with AMP.


Positional Cloning

Genetic Mapping

Physical Mapping

Transcript Mapping

Gene Sequencing

FamilyStudies

Chromosome Interval

Large-InsertClones

CandidateGenes

DiseaseMutation

Met A A Met T T G GVal G G Val T T C C Ser T T Ser C C A ALeu C C Leu T T G G Gln C T A A A APro C C C C G GCys T T G G T T

STOP

*


Positional Candidate Cloning

Genetic Mapping

ComputerSearch

Gene Sequencing

FamilyStudies

Chromosome Interval

CandidateGenes

DiseaseMutation

Met A A Met T T G GVal G G Val T T C C Ser T T Ser C C A ALeu C C Leu T T G G Gln C T A A A APro C C C C G GCys T T G G T T

STOP

*


What does it mean to do CB?• Like to work with sequences, structures,

expression arrays, interaction of molecules and genetic maps.

• Like the whole systems approach• Like the IT component, and the power it

provides to crunching through lots of data• Like clear answers• Like to do Science


Doing CB means to be …

• Database user

• Tool user

• Database developer

• Tool developer

• Training, practicing or developing

• Doing bioinformatics experiments


Bioinformatics experiments:

BLAST searchSequence Alignment

Reagents:

•Sequence•Databases

Method:

•P-P BLASTP•N-P BLASTX•P-N TBLASTN•N-N BLASTN•N (P) – N (P) TBLASTX

Interpretation:

•Similarity•Hypothesis testing

Know your reagents

Know your methods

Do your controls


Nature 409:452



Part 1. The Databases

1.GenBank: The Nucleotide Sequence Database 2. PubMed: The Bibliographic Database 3. Macromolecular Structure Databases 4. The Taxonomy Project 5. The Single Nucleotide Polymorphism Database 6. The Gene Expression Omnibus (GEO)7. Online Mendelian Inheritance in Man (OMIM8. The NCBI BookShelf: Searchable Biomedical Books 9. PubMed Central (PMC) 10. The SKY/CGH Database

Part 2. Data Flow and Processing

11. Sequin: A Sequence Submission and Editing Tool 12. The Processing of Biological Sequence Data at NCBI13. Genome Assembly and Annotation Process

Part 3. Querying and Linking the Data

14. The Entrez Search and Retrieval System 15. The BLAST Sequence Analysis Tool 16. LinkOut: Linking to External Resources from Entrez17. The Reference Sequence (RefSeq) Project 18. LocusLink: A Directory of Genes 19. Using the Map Viewer to Explore Genomes 20. UniGene: A Unified View of the Transcriptome 21. The Clusters of Orthologous Groups (COGs)

Part 4. User Support

22. User Services: Helping You Find Your Way 23. Exercises: Using Map Viewer

Glossary


The challenge of the information space:

Nucleotide records 14,976,310Nucleotides 15,849,921,438Protein sequences 1,793,8503D structures 16,500Interactions 6,181

Expression data points >20,000,000Human Unigene Clusters 96,109 Maps and Complete Genomes 1,600Different taxonomy Nodes 229,799Human dbSNP 4,116,188 Human RefGenes records 17,984bp in Human Contigs > 500 kb 1,154,596,000 PubMed records 11,692,207OMIM records 13,346

Jan 2002


The challenge of the information space:

Nucleotide records 22,318,883Nucleotides 28,507,990,166Protein sequences 2,955,5883D structures 19,392Interactions & complexes 7,119

Expression data points >40,000,000Human Unigene Cluster 115,523 Maps and Complete Genomes 2,698Different taxonomy Nodes 278,402Human dbSNP 4,892,258 Human RefSeq records 20,008 bp in Human Contigs > 500 kb 1,451,804 PubMed records 12,319,105OMIM records 14,116

Jan 2003


Databases

• Organized array of information• Place where you put things in, and (if all is well) you

should be able to get them out again.• Resource for other databases and tools.• Simplify the information space by specialization.• Bonus: Allows you to make discoveries.


The UBC libraryGoogleEntrezSRS

Databases

Information system

Query system

Storage System

Data

GenBank flat file PDB fileInteraction RecordTitle of a bookBook

Boxes

PC binary files

Unix text files

Bookshelves

A List you look atA catalogueindexed filesSQLgrep


“... the more closely and elegantly a model follows a real phenomenon, the more useful it is in predicting or understanding the natural phenomenon it mimics.”

Ostell, Wheelan & Kans on the “NCBI data model”

from “Bioinformatics, a Practical Guide to the Analysis of Genes and Proteins.”, Baxevanis and Ouellette, Eds. 2001

Using the NCBI data model

Genomes Structures

MVILLVILAIVLISDVTGREGSWQIPCMNVKRKKGREGDHIVLILILLNNAWASVLPESDSSDSGPLIILHEREKRLALAMAREENSPNCTPLIKRESAEDSEDLRKRKKTDEDDHIVLIL

ACGATGTGGTCGATGTTCTCTATTATTATCGGAAGCTAAGGATATCGCTGATGTGAGGTGATCGGTTCTATCTGCATAGCATGGATATTGATGGCTTATAGGCTAGCGCTGATGTGAGGTG Links

Protein Sequences

GenBank

MEDLINE

CMMTCMMT

Expression Data

Expression Data

Accession Numbers

PubMed online Journals

PubMed online Journals

Full text

SNP DataSNP Data

Accession Numbers - Map

MMDB structure:function

MMDB structure:function

VAST

BIND interaction:function

BIND interaction:function



Primary Data• DNA sequences• RNA sequences• Protein sequences

– In most cases protein sequences are interpreted sequences.

• 3D structures• Expression data• Polymorphism data• Interaction data


Databases: some examples

• Primary (archival)– DDBJ/EMBL/GenBank– TrEMBL– UNIProt– PDB– Medline– BIND

• Secondary (curated)– LOCUSLink– RefSeq– Taxon– Swiss-Prot– PROSITE– OMIM– SGD– FlyBase– GO


What is GenBank?

GenBank is the NIH genetic sequence dataset of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain.

http://www.ncbi.nlm.nih.gov/Web/Genbank/index.htmlBenson et al., 2002, Nucleic Acids Res. 29:12-17


GenBankGenBank

DDBJDDBJEMBLEMBL

EMBLEMBL

Entrez

SRS

getentry

NIGNIGCIB EBI

NCBI

NIHNIH

•Submissions•Updates




GenBank Flat File (GBFF)LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds.ACCESSION D25291NID g1850791KEYWORDS neurite extension activity; growth arrest; TA20.SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma cell_line:NG108-15 cDNA to mRNA. ORGANISM Murinae gen. sp. Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae.REFERENCE 1 (sites) AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y. TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA cloning and expression JOURNAL Neurosci. Res. 23 (1), 21-27 (1995) MEDLINE 96064354REFERENCE 3 (bases 1 to 1803) AUTHORS Tohda,C. TITLE Direct Submission JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro Tohda, Toyama Medical and Pharmaceutical University, Research Institute for Wakan-yaku, Analytical Research Center for Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan (E-mail:[email protected], Tel:+81-764-34-2281(ex.2841), Fax:+81-764-34-5057)COMMENT On Feb 26, 1997 this sequence version replaced gi:793764.FEATURES Location/Qualifiers source 1..1803 /organism="Murinae gen. sp." /note="source origin of sequence, either mouse or rat, has not been identified" /db_xref="taxon:39108" /cell_line="NG108-15" /cell_type="mouse neuroblastma-rat glioma hybridoma" misc_signal 156..163 /note="AP-2 binding site" GC_signal 647..655 /note="Sp1 binding site" TATA_signal 694..701 gene 748..1311 /gene="TA20" CDS 748..1311 /gene="TA20" /function="neurite extensiion activity and growth arrest effect" /codon_start=1 /db_xref="PID:d1005516" /db_xref="PID:g793765" /translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY RGPSNRSPPLPPRNRIKQPNRIKLRCR" polyA_site 1803BASE COUNT 507 a 458 c 311 g 527 tORIGIN 1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg 61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat 121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg 181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca 241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca 301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc 361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc 421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa 481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag 541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag 601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat 661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga 721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg 781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat 841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg 901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg 961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact 1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt 1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct 1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc 1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct 1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt 1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata 1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat 1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt 1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt 1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc 1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc 1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa 1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc 1801 cat//

Features (AA seq)

DNA Sequence

Header•Title•Taxonomy•Citation


Abstract Syntax Notation (ASN.1)


FASTA

>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIIKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENLEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLEDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPESSDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER

>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4>


Graphical Representation


ASN.1

ASN.1ASN.1ASN.1ASN.1

FASTAFASTAFASTAFASTA

GraphicalGraphicalGraphicalGraphical GenPeptGenPeptGenPeptGenPept

GenBankGenBankGenBankGenBankMMDBMMDBMMDBMMDB

Swiss-ProtSwiss-ProtSwiss-ProtSwiss-ProtEMBLEMBLEMBLEMBL


• GenBank dissection– identifiers– divisions– format/structure– features– file conversions

Outline

• GenBank dissection– identifiers– divisions– format/structure– features– file conversions


Organismal DivisionsUsed in which database?

BCT Bacterial DDBJ - GenBankFUN Fungal EMBLHUM Homo sapiens DDBJ - EMBLINV Invertebrate allMAM Other mammalian allORG Organelle EMBLPHG Phage allPLN Plant allPRI Primate (also see HUM) all (not same data in all)PRO Prokaryotic EMBLROD Rodent allSYN Synthetic and chimeric allVRL Viral allVRT Other vertebrate all


Functional Divisions

PAT Patent EST Expressed Sequence TagsSTS Sequence Tagged SiteGSS Genome Survey Sequence HTG High Throughput Genome (unfinished)HTC High throughput cDNA (unfinished)

Organismal divisions:

BCT FUN INV MAM PHG PLNPRI ROD SYN VRL VRT


Guiding Principals

In GenBank, records are grouped for various reasons: understand this is key to using and fully taking advantage of this database.


LOCUS, Accession, NID and protein_id

LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier. ACCESSION: A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication.VERSION: : New system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes.PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS.Protein gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes.protein_id: Identifier which has the same structure and function as the nucleotide Accession.version numbers, but slightlt different format.


Accession.version

LOCUS, Accession, gi and PIDLOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.ACCESSION U40282VERSION U40282.1 GI:3150001

CDS 157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1" /db_xref="PID:g3150002" /db_xref="GI:3150002"

LOCUS: HSU40282 ACCESSION: U40282 VERSION: U40282.1 GI: 3150001 PID: g3150002 Protein gi: 3150002 protein_id: AAC16892.1 Protein_idprotein gi

ACCESSIONLOCUS

PIDgi


LOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.ACCESSION U40282VERSION U40282.1 GI:3150001KEYWORDS .SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 1789) AUTHORS Hannigan,G.E., Leung-Hagesteijn,C., Fitz-Gibbon,L., Coppolino,M.G., Radeva,G., Filmus,J., Bell,J.C. and Dedhar,S. TITLE Regulation of cell adhesion and anchorage-dependent growth by a new beta 1-integrin-linked protein kinase JOURNAL Nature 379 (6560), 91-96 (1996) MEDLINE 96135142REFERENCE 2 (bases 1 to 1789) AUTHORS Dedhar,S. and Hannigan,G.E. TITLE Direct Submission JOURNAL Submitted (07-NOV-1995) Shoukat Dedhar, Cancer Biology Research, Sunnybrook Health Science Centre and University of Toronto, 2075 Bayview Avenue, North York, Ont. M4N 3M5, Canada

Sample GenBank mRNA Record

Division

Create/updatemol-typeDEF line

Cit-Art

Cit-Sub

Accession.version

Taxonomygilength

LOCUSaccession


FEATURES Location/Qualifiers source 1..1789 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="11" /map="11p15" /cell_line="HeLa" gene 1..1789 /gene="ILK" CDS 157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1" /db_xref="PID:g3150002" /db_xref="GI:3150002" /translation="MDDIFTQCREGNAVAVRLWLDNTENDLNQGDDHGFSPLHWACRE . . . DK"BASE COUNT 443 a 488 c 480 g 378 tORIGIN 1 gaattcatct gtcgactgct accacgggag ttccccggag aaggatcctg cagcccgagt < ...> 1681 ggcgggctca gagctttgtc acttgccaca tggtgtcttc caacatggga gggatcagcc 1741 ccgcctgtca caataaagtt tattatgaaa aaaaaaaaaa aaaaaaaaa //

Sample GenBank Record

BioSource

gene

codingsequence

sequence


EST: Expressed Sequence Tag

Expressed Sequence Tags are short (300-500 bp) single reads from mRNA (cDNA) which are produced in large numbers. They represent a snapshot of what is expressed in a given tissue, and developmental stage.

Also see: http://www.ncbi.nlm.nih.gov/dbEST/ http://www.ncbi.nlm.nih.gov/UniGene/


LOCUS AA675481 524 bp mRNA EST 28-NOV-1997DEFINITION vr72d07.s1 Knowles Solter mouse 2 cell Mus musculus cDNA clone IMAGE:1134253 5' similar to TR:G992993 G992993 MYOSIN LIGHT CHAIN KINASE. ;, mRNA sequence.ACCESSION AA675481VERSION AA675481.1 GI:2652718KEYWORDS EST.SOURCE house mouse. ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.

...COMMENT Contact: Marra M/Mouse EST Project WashU-HHMI Mouse EST Project Washington University School of MedicineP 4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108 Tel: 314 286 1800 Fax: 314 286 1810 Email: [email protected] This clone is available royalty-free through LLNL ; contact the IMAGE Consortium ([email protected]) for further information. MGI:615525 Possible reversed clone: similarity on wrong strand High quality sequence stop: 469.

my friend Marco Marra

DEF line

Comment

DIVISIONKEYWORD


FEATURES Location/Qualifiers source 1..524 /organism="Mus musculus" /strain="B6D2 F1/J" /note="Organ: embryo; Vector: pBluescribe (modified); Site_1: MluI; Site_2: SalI; Cloned unidirectionally from mRNA prepared from 13,500 2-cell stage embryos. Primer: SalI(dT): 5'-CGGTCGACCGTCGACCGTTTTTTTTTTTTTTT-3'. cDNAs were cloned into the MluI/SalI sites of a modified pBluescribe vector using commercial linkers (NEB). Average insert size: 1.2 kb." /db_xref="taxon:10090" /clone="1134253" /clone_lib="Knowles Solter mouse 2 cell" /tissue_type="embryo" /dev_stage="2-cell" /lab_host="DH10B"BASE COUNT 168 a 111 c 115 g 130 tORIGIN 1 ctcagttgta gacagtgagc cagtcagatt tactgttaaa gtaacaggag aacccaagcc 61 ggaaattaca tggtggtttg aaggagaaat actgcaggat ggagaagact atcagtacat 121 cgaaagaggt gaaacttact gcctgtattt accggaaacc ttcccagaag atggaggaga 181 gtacatgtgt aaggcagtca acaataaagg ctcagcagcg agcacctgca ttcttaccat 241 tgaaatggat gactactagg cttccctctg tccttgggac tctctctctc gctgcatctc 301 tgtggagggg ccaaaaagga gaccagaggt gccactataa ctgacttaat ctttccccaa 361 atcttcctct taagaacttc tcatgcatat caggttcatt accatgctgt gcaaagtcaa 421 agcatagctg acagaaaagg gaaataaatg tacccattct gtcagaacta agacagaagc 481 ttcgtattta tagaactaag acttaacata tacagtttgc atga//

BioSource


STS

Sequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer pairs used in a PCR assay that generate a mapping reagent which maps to a single position within the genome.

Also see: http://www.ncbi.nlm.nih.gov/dbSTS/

http://www.ncbi.nlm.nih.gov/genemap/


GSS: Genome Survey Sequences

Genome Survey Sequences are similar in nature to the ESTs, except that its sequences are genomic in origin, rather than cDNA (mRNA).

The GSS division contains:• random "single pass read" genome survey sequences.• single pass reads from cosmid/BAC/YAC ends (these could be chromosome specific, but need not be)• exon trapped genomic sequences• Alu PCR sequences

Also see: http://www.ncbi.nlm.nih.gov/dbGSS/


LOCUS FR0029137 445 bp DNA GSS 30-JUN-1998DEFINITION Fugu rubripes GSS sequence, clone 037G16aE9, genomic survey sequence.ACCESSION AL031006VERSION AL031006.1 GI:3286795KEYWORDS GSS; genome survey sequence.SOURCE Fugu rubripes. ORGANISM Fugu rubripes Eukaryota; Metazoa; Chordata; Vertebrata; Actinopterygii; Neopterygii; Teleostei; Euteleostei; Acanthopterygii; Percomorpha; Tetraodontiformes; Tetraodontoidei; Tetraodontidae; Fugu.REFERENCE 1 (bases 1 to 445) AUTHORS Elgar,G., Clark,M., Smith,S., Meek,S., Warner,S., Umrania,Y., Williams,G. and Brenner,S. TITLE Direct Submission JOURNAL Submitted (09-JUN-1998) MRC Human Genome Mapping Project Resource Centre, Hinxton, Cambridge, CB10 1SB, UK. Email: [email protected] Vector: pBluescript II KS V_type: phagemid PRIMER: KS DESCR: One pass dye-terminator sequencing of cosmid cloned genomic sequence.

DIVISION

KEYWORD


Genome Survey SequencesFEATURES Location/Qualifiers source 1..445 /organism="Fugu rubripes" /db_xref="taxon:31033" /clone_lib="cosmid 037G16" /clone="037G16aE9"BASE COUNT 124 a 96 c 97 g 126 t 2 othersORIGIN 1 atcctgcagt gaggcagaac agggnctgtt tccatttttt gtctgtcagt ttaaacagtg 61 gtcggccgta aaagtcctcc gaaaacccac aaagcctttg cctatcgttc caaatcttac 121 atgggtaagt gcaaacattt aactcaagat aagtgccttt gagataacaa aacctctttt 181 ttcaagagag tcttggaagc gtacacacct acagcgtagc tgtttttacc tcagatgaat 241 gtctttggna tgagggaggg aaccagatac ctggtgaaaa cccatgcaga cttgcggaga 301 gcactgtgaa accctctggt actgagccct gaaacttcat gttgtgaggc aacagtgctt 361 accaaaagtt tatcctgcaa ctgctattta acttctgtta gcctctgttt tggagaccac 421 atgagttaaa tacggtttgt tgaaa//


HTG: High Throughput Genome

High Throughput Genome Sequences are unfinished genome sequencing efforts records. Unfinished records have gaps in the nucleotides sequence, low accuracy, and no annotations on the records.

Also see: http://www.ncbi.nlm.nih.gov/HTGS/ Ouellette and Boguski (1997) Genome Res. 7:952-955


HTGS in GenBank

phase 1 HTGAcc = AC000003 gi = 1556454

phase 2 HTGAcc = AC000003 gi = 2182283

phase 3 PRIAcc = AC000003 gi = 2204282


HTGS in GenBank

• Unfinished Record

– Sequencing will be unfinished– Phase 1 or phase 2– HTG division– KEYWORDS: HTG; HTGS_PHASE1 or 2

• Finished record

– Sequencing will be finished– Phase 3 – Organismal division it belongs to PRI,INV or PLN– KEYWORDS: HTG


LOCUS HSAC000003 120000 bp DNA HTG 20-SEP-1996DEFINITION *** SEQUENCING IN PROGRESS *** Chromosome 17 genomic sequence; HTGS phase 1, 6 unordered pieces.ACCESSION AC000003KEYWORDS HTG; HTGS_PHASE1....COMMENT *** *** *** WARNING: Phase 1 High Throughput Genome Sequence *** *** *** * This sequence is unfinished. It consists of 6 contigs for * which the order is not known; their order in this record is * arbitrary. In some cases, the exact lengths of the gaps * between the contigs are also unknown; these gaps are presented * as runs of N as a convenience only. When sequencing is complete, * the sequence data presented in this record will be replaced *by a single finished sequence with the same accession number. * 1 22526: contig of 22526 bp in length * 22527 23035: gap of unknown length * 23036 33919: contig of 10884 bp in length * 33920 34427: gap of unknown length * 34428 61877: contig of 27450 bp in length ...//

HTGS: phase 1

DIVISIONWARNING

WARNING

WARNINGWARNING

KEYWORD


gap of unknown length

HTGS Phase 1

* the sequence data presented in this record will be replaced* by a single finished sequence with the same accession number.* 1 33214: contig of 33214 bp in length* 33215 33250: gap of unknown length* 33251 35134: contig of 1884 bp in length...

33061 ggagagcttc agggagactc tgcggaatag caggttgtaa tcttccggtt cgatagtcga 33121 taaatgtctg gtttaccttc agccgaaacg cgggagaaat ccagcctgcg tactccacag 33181 cgagcaattc atgggcaaaa gtgccgccgc cacgnnnnnn nnnnnnnnnn nnnnnnnnnn 33241 nnnnnnnnnn tagttcatca ccttctggtg gaagccacat tttctctttc ctttctttcc 33301 ctgtctaccc tccctcttcc ccttcctccc caaatctatc agtaaagacc accttgctgt 33361 gggcagctag ctgaaagaga ccatctgcct taggaatagc ctacactaga ttcaaactac 33421 aaagaagcag gttgggggaa agaggaagtg aggatttcaa gtcaagaaag catcctgcct


LOCUS AC000003 122228 bp DNA PRI 07-OCT-1997DEFINITION Homo sapiens chromosome 17, clone 104H12, complete sequence.ACCESSION AC000003NID g2204282KEYWORDS HTG....COMMENT The Staden databases, finishing information, and all chromatographic files used in the assembly of this clone are available from our anonymous ftp site. All repeats were identified using RepeatMasker: Smit, A.F.A. & Green, P. (1996-1997) http://ftp.genome.washington.edu/RM/RepeatMasker.html.FEATURES Location/Qualifiers source 1..122228 /organism="Homo sapiens" /db_xref="taxon:9606" /clone="104H12" /clone_lib="Research Genetics/Cal Tech CITB978SK-B (plates 1-194)" /chromosome="17" repeat_region 261..370 /rpt_family="MLT1B"

HTGS phase 3

DIVISION

KEYWORD



Locus Link


http://nar.oupjournals.org/content/vol31/issue1/


Genome Projects: discussion point

• Whole genome assembly• “Bermuda agreement”• HTG Finished• What is it to be “finished”• 1:10,000 error rate?• How useful is an unfinished genome?• Reference genomes• TPA and RefSeq


In Closing ...

• Able to recognize various data formats, and know what their primary use is.

• Know, understand and utilize all types of sequence identifiers.• Know and understand various feature types present in the

GenBank flat files.

• Know and understand the various GenBank divisions.


Resources

• W W W:

– http://www.ncbi.nlm.nih.gov

– http://www.ddbj.nig.ac.jp/

– http://www.ebi.ac.uk/

– http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html

– http://www.expasy.ch/sprot/

– http://www.rcsb.org/pdb/index.html

– http://www.ncbi.nlm.nih.gov/Omim/

– http://genome-www.stanford.edu/Saccharomyces/

– http://nar.oupjournals.org/content/vol30/issue1/

– http://nar.oupjournals.org/content/vol31/issue1/

UBC Bioinformatics Centre Bioinformatics: Understanding the data in databases MedGen 505, January 9, 2003 Francis Ouellette Director, UBC Bioinformatics.

Documents

aleu c c leu t t g g

arg nh2c

ubc bioinformatics centre

biological databases

chan centre

biological themes

wholesome terms

software tools