UBC Bioinformatics Centre Bioinformatics: Understanding the data in databases MedGen 505, January 9, 2003 Francis Ouellette Director, UBC Bioinformatics Centre Vancouver, BC, Canada [email protected]
Jan 20, 2016
UBC BioinformaticsCentre
Bioinformatics:Understanding the data in databases
MedGen 505, January 9, 2003Francis OuelletteDirector, UBC Bioinformatics CentreVancouver, BC, [email protected]
Copyright 2002 UBC Bioinformatics Centre
http://vanbug.org
• Monthly bioinformatics seminar• The second Thursday of every month• Attended by academics, industry and government
types.• Talk followed by beer and pizza.• Tonight @ 6:00 at the Chan Centre at the BCRI
(CMMT).• Nat Goodman, a senior research scientist
from the ISB, aka “the IT guy”
Copyright 2002 UBC Bioinformatics Centre
http://bioinformatics.ubc.ca
Copyright 2002 UBC Bioinformatics Centre
Bioinformatics is about understanding how life works. It is an hypothesis driven science
Copyright 2002 UBC Bioinformatics Centre
In bioinformatics, we use software tools and biological databases to ask questions.
Copyright 2002 UBC Bioinformatics Centre
At the UBC Bioinformatics Centre (UBiC) we bring together scientists that share the vision of making advances in computational biology, also working with bench scientists to validate the hypotheses we are generating.
Copyright 2002 UBC Bioinformatics Centre
Structure
• Director
• Associate Director
• 6 adjunct faculty
• 4 more to be recruited
• Another recruitment already in progress
• Director of Operation and Strategy
• Chief Soft. Dev.• Chief Bioinformatics• Chief Systems• Chief Training and Support• Chief Web Development
Copyright 2002 UBC Bioinformatics Centre
UBiC: the vision
Basic Research
Basic Research
Support&
Training
Support&
Training
Large Scale BioinformaticsLarge Scale
Bioinformatics
BLASTIDBPeGASys
Gene IdentificationComparative GenomicsAlgorithm development
CBWWWWWorkshops
Copyright 2002 UBC Bioinformatics Centre
The UBC Bioinformatics Centre:
Copyright 2002 UBC Bioinformatics Centre
Copyright 2002 UBC Bioinformatics Centre
Ouellette Lab projects
• Core facility: training and support• GeneComber: an Ab initio gene finding algorithm.• IDB: the Integral DataBase system• PeGASys: Parallel genome annotation system• GeMS: Genomic Mutational Signature Sequences.
Copyright 2002 UBC Bioinformatics Centre
http://bioinformatics.ca
Copyright 2002 UBC Bioinformatics Centre
http://bioinformatics.ca
Copyright 2002 UBC Bioinformatics Centre
Canadian Bioinformatics Workshop Series
Bioinformatics
Genomics Proteomics Developing the Tools
Intro Programming
Copyright 2002 UBC Bioinformatics Centre
Bioinformatics is about bringing biological themes together withthe help of computer tools and biological databases. Computational biology can lead us to new insights or directions.
Copyright 2002 UBC Bioinformatics Centre
BLAST ResultBasicLocalAlignmentSearchTool
Copyright 2002 UBC Bioinformatics Centre
Genetic Analysis of Cancer in Families
The Genetic Predisposition to Cancer
PubMed Text Neighboring
• Common terms could indicate similar subject matter
• Statistical method• Weights based on term
frequencies within document and within the database as a whole
• Some terms are better than others
Copyright 2002 UBC Bioinformatics Centre
Micro-array analysis:
Figure 4Figure 1
Science Jan 1 1999: 83-87
The Transcriptional Program in the Response of Human Fibroblasts to Serum
Vishwanath R. Iyer, Michael B. Eisen, Douglas T. Ross, Greg Schuler, Troy Moore, Jeffrey C. F. Lee, Jeffrey M. Trent, Louis M. Staudt, James Hudson Jr., Mark S. Boguski, Deval Lashkari, Dari Shalon, David Botstein, Patrick O. Brown
Copyright 2002 UBC Bioinformatics Centre
VAST Result
Ferredoxin
•Halobacterium marismortui
•Chlorella fusca
• Vector• Alignment• Search• Tool
Copyright 2002 UBC Bioinformatics Centre
Computational Biology Analysis
Q Gln NH2-C-CH2-CH2-
O R Arg NH2-C-NH-CH2-CH2-CH2-
+NH2
Copyright 2002 UBC Bioinformatics Centre
Structural InteractionsOther interactions occurring within this structure (blue). In this case Glutaminyl-tRNA Synthetase interacting with AMP.
Copyright 2002 UBC Bioinformatics Centre
Positional Cloning
Genetic Mapping
Physical Mapping
Transcript Mapping
Gene Sequencing
FamilyStudies
Chromosome Interval
Large-InsertClones
CandidateGenes
DiseaseMutation
Met A A Met T T G GVal G G Val T T C C Ser T T Ser C C A ALeu C C Leu T T G G Gln C T A A A APro C C C C G GCys T T G G T T
STOP
*
Copyright 2002 UBC Bioinformatics Centre
Positional Candidate Cloning
Genetic Mapping
ComputerSearch
Gene Sequencing
FamilyStudies
Chromosome Interval
CandidateGenes
DiseaseMutation
Met A A Met T T G GVal G G Val T T C C Ser T T Ser C C A ALeu C C Leu T T G G Gln C T A A A APro C C C C G GCys T T G G T T
STOP
*
Copyright 2002 UBC Bioinformatics Centre
What does it mean to do CB?• Like to work with sequences, structures,
expression arrays, interaction of molecules and genetic maps.
• Like the whole systems approach• Like the IT component, and the power it
provides to crunching through lots of data• Like clear answers• Like to do Science
Copyright 2002 UBC Bioinformatics Centre
Doing CB means to be …
• Database user
• Tool user
• Database developer
• Tool developer
• Training, practicing or developing
• Doing bioinformatics experiments
Copyright 2002 UBC Bioinformatics Centre
Bioinformatics experiments:
BLAST searchSequence Alignment
Reagents:
•Sequence•Databases
Method:
•P-P BLASTP•N-P BLASTX•P-N TBLASTN•N-N BLASTN•N (P) – N (P) TBLASTX
Interpretation:
•Similarity•Hypothesis testing
Know your reagents
Know your methods
Do your controls
Copyright 2002 UBC Bioinformatics Centre
Nature 409:452
Copyright 2002 UBC Bioinformatics Centre
Copyright 2002 UBC Bioinformatics Centre
Part 1. The Databases
1.GenBank: The Nucleotide Sequence Database 2. PubMed: The Bibliographic Database 3. Macromolecular Structure Databases 4. The Taxonomy Project 5. The Single Nucleotide Polymorphism Database 6. The Gene Expression Omnibus (GEO)7. Online Mendelian Inheritance in Man (OMIM8. The NCBI BookShelf: Searchable Biomedical Books 9. PubMed Central (PMC) 10. The SKY/CGH Database
Part 2. Data Flow and Processing
11. Sequin: A Sequence Submission and Editing Tool 12. The Processing of Biological Sequence Data at NCBI13. Genome Assembly and Annotation Process
Part 3. Querying and Linking the Data
14. The Entrez Search and Retrieval System 15. The BLAST Sequence Analysis Tool 16. LinkOut: Linking to External Resources from Entrez17. The Reference Sequence (RefSeq) Project 18. LocusLink: A Directory of Genes 19. Using the Map Viewer to Explore Genomes 20. UniGene: A Unified View of the Transcriptome 21. The Clusters of Orthologous Groups (COGs)
Part 4. User Support
22. User Services: Helping You Find Your Way 23. Exercises: Using Map Viewer
Glossary
Copyright 2002 UBC Bioinformatics Centre
The challenge of the information space:
Nucleotide records 14,976,310Nucleotides 15,849,921,438Protein sequences 1,793,8503D structures 16,500Interactions 6,181
Expression data points >20,000,000Human Unigene Clusters 96,109 Maps and Complete Genomes 1,600Different taxonomy Nodes 229,799Human dbSNP 4,116,188 Human RefGenes records 17,984bp in Human Contigs > 500 kb 1,154,596,000 PubMed records 11,692,207OMIM records 13,346
Jan 2002
Copyright 2002 UBC Bioinformatics Centre
The challenge of the information space:
Nucleotide records 22,318,883Nucleotides 28,507,990,166Protein sequences 2,955,5883D structures 19,392Interactions & complexes 7,119
Expression data points >40,000,000Human Unigene Cluster 115,523 Maps and Complete Genomes 2,698Different taxonomy Nodes 278,402Human dbSNP 4,892,258 Human RefSeq records 20,008 bp in Human Contigs > 500 kb 1,451,804 PubMed records 12,319,105OMIM records 14,116
Jan 2003
Copyright 2002 UBC Bioinformatics Centre
Databases
• Organized array of information• Place where you put things in, and (if all is well) you
should be able to get them out again.• Resource for other databases and tools.• Simplify the information space by specialization.• Bonus: Allows you to make discoveries.
Copyright 2002 UBC Bioinformatics Centre
The UBC libraryGoogleEntrezSRS
Databases
Information system
Query system
Storage System
Data
GenBank flat file PDB fileInteraction RecordTitle of a bookBook
Boxes
PC binary files
Unix text files
Bookshelves
A List you look atA catalogueindexed filesSQLgrep
Copyright 2002 UBC Bioinformatics Centre
“... the more closely and elegantly a model follows a real phenomenon, the more useful it is in predicting or understanding the natural phenomenon it mimics.”
Ostell, Wheelan & Kans on the “NCBI data model”
from “Bioinformatics, a Practical Guide to the Analysis of Genes and Proteins.”, Baxevanis and Ouellette, Eds. 2001
Using the NCBI data model
Genomes Structures
MVILLVILAIVLISDVTGREGSWQIPCMNVKRKKGREGDHIVLILILLNNAWASVLPESDSSDSGPLIILHEREKRLALAMAREENSPNCTPLIKRESAEDSEDLRKRKKTDEDDHIVLIL
ACGATGTGGTCGATGTTCTCTATTATTATCGGAAGCTAAGGATATCGCTGATGTGAGGTGATCGGTTCTATCTGCATAGCATGGATATTGATGGCTTATAGGCTAGCGCTGATGTGAGGTG Links
Protein Sequences
GenBank
MEDLINE
CMMTCMMT
Expression Data
Expression Data
Accession Numbers
PubMed online Journals
PubMed online Journals
Full text
SNP DataSNP Data
Accession Numbers - Map
MMDB structure:function
MMDB structure:function
VAST
BIND interaction:function
BIND interaction:function
Copyright 2002 UBC Bioinformatics Centre
Copyright 2002 UBC Bioinformatics Centre
Primary Data• DNA sequences• RNA sequences• Protein sequences
– In most cases protein sequences are interpreted sequences.
• 3D structures• Expression data• Polymorphism data• Interaction data
Copyright 2002 UBC Bioinformatics Centre
Databases: some examples
• Primary (archival)– DDBJ/EMBL/GenBank– TrEMBL– UNIProt– PDB– Medline– BIND
• Secondary (curated)– LOCUSLink– RefSeq– Taxon– Swiss-Prot– PROSITE– OMIM– SGD– FlyBase– GO
Copyright 2002 UBC Bioinformatics Centre
What is GenBank?
GenBank is the NIH genetic sequence dataset of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain.
http://www.ncbi.nlm.nih.gov/Web/Genbank/index.htmlBenson et al., 2002, Nucleic Acids Res. 29:12-17
Copyright 2002 UBC Bioinformatics Centre
GenBankGenBank
DDBJDDBJEMBLEMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIGCIB EBI
NCBI
NIHNIH
•Submissions•Updates
•Submissions•Updates
•Submissions•Updates
Copyright 2002 UBC Bioinformatics Centre
GenBank Flat File (GBFF)LOCUS MUSNGH 1803 bp mRNA ROD 29-AUG-1997DEFINITION Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds.ACCESSION D25291NID g1850791KEYWORDS neurite extension activity; growth arrest; TA20.SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma cell_line:NG108-15 cDNA to mRNA. ORGANISM Murinae gen. sp. Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae.REFERENCE 1 (sites) AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y. TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA cloning and expression JOURNAL Neurosci. Res. 23 (1), 21-27 (1995) MEDLINE 96064354REFERENCE 3 (bases 1 to 1803) AUTHORS Tohda,C. TITLE Direct Submission JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro Tohda, Toyama Medical and Pharmaceutical University, Research Institute for Wakan-yaku, Analytical Research Center for Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan (E-mail:[email protected], Tel:+81-764-34-2281(ex.2841), Fax:+81-764-34-5057)COMMENT On Feb 26, 1997 this sequence version replaced gi:793764.FEATURES Location/Qualifiers source 1..1803 /organism="Murinae gen. sp." /note="source origin of sequence, either mouse or rat, has not been identified" /db_xref="taxon:39108" /cell_line="NG108-15" /cell_type="mouse neuroblastma-rat glioma hybridoma" misc_signal 156..163 /note="AP-2 binding site" GC_signal 647..655 /note="Sp1 binding site" TATA_signal 694..701 gene 748..1311 /gene="TA20" CDS 748..1311 /gene="TA20" /function="neurite extensiion activity and growth arrest effect" /codon_start=1 /db_xref="PID:d1005516" /db_xref="PID:g793765" /translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY RGPSNRSPPLPPRNRIKQPNRIKLRCR" polyA_site 1803BASE COUNT 507 a 458 c 311 g 527 tORIGIN 1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg 61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat 121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg 181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca 241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca 301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc 361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc 421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa 481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag 541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag 601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat 661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga 721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg 781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat 841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg 901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg 961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact 1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt 1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct 1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc 1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct 1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt 1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata 1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat 1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt 1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt 1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc 1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc 1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa 1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc 1801 cat//
Features (AA seq)
DNA Sequence
Header•Title•Taxonomy•Citation
Copyright 2002 UBC Bioinformatics Centre
Abstract Syntax Notation (ASN.1)
Copyright 2002 UBC Bioinformatics Centre
FASTA
>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIIKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENLEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLEDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPESSDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER
>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4>
Copyright 2002 UBC Bioinformatics Centre
Graphical Representation
Copyright 2002 UBC Bioinformatics Centre
ASN.1
ASN.1ASN.1ASN.1ASN.1
FASTAFASTAFASTAFASTA
GraphicalGraphicalGraphicalGraphical GenPeptGenPeptGenPeptGenPept
GenBankGenBankGenBankGenBankMMDBMMDBMMDBMMDB
Swiss-ProtSwiss-ProtSwiss-ProtSwiss-ProtEMBLEMBLEMBLEMBL
Copyright 2002 UBC Bioinformatics Centre
• GenBank dissection– identifiers– divisions– format/structure– features– file conversions
Outline
• GenBank dissection– identifiers– divisions– format/structure– features– file conversions
Copyright 2002 UBC Bioinformatics Centre
Organismal DivisionsUsed in which database?
BCT Bacterial DDBJ - GenBankFUN Fungal EMBLHUM Homo sapiens DDBJ - EMBLINV Invertebrate allMAM Other mammalian allORG Organelle EMBLPHG Phage allPLN Plant allPRI Primate (also see HUM) all (not same data in all)PRO Prokaryotic EMBLROD Rodent allSYN Synthetic and chimeric allVRL Viral allVRT Other vertebrate all
Copyright 2002 UBC Bioinformatics Centre
Functional Divisions
PAT Patent EST Expressed Sequence TagsSTS Sequence Tagged SiteGSS Genome Survey Sequence HTG High Throughput Genome (unfinished)HTC High throughput cDNA (unfinished)
Organismal divisions:
BCT FUN INV MAM PHG PLNPRI ROD SYN VRL VRT
Copyright 2002 UBC Bioinformatics Centre
Guiding Principals
In GenBank, records are grouped for various reasons: understand this is key to using and fully taking advantage of this database.
Copyright 2002 UBC Bioinformatics Centre
LOCUS, Accession, NID and protein_id
LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier. ACCESSION: A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication.VERSION: : New system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes.PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS.Protein gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes.protein_id: Identifier which has the same structure and function as the nucleotide Accession.version numbers, but slightlt different format.
Copyright 2002 UBC Bioinformatics Centre
Accession.version
LOCUS, Accession, gi and PIDLOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.ACCESSION U40282VERSION U40282.1 GI:3150001
CDS 157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1" /db_xref="PID:g3150002" /db_xref="GI:3150002"
LOCUS: HSU40282 ACCESSION: U40282 VERSION: U40282.1 GI: 3150001 PID: g3150002 Protein gi: 3150002 protein_id: AAC16892.1 Protein_idprotein gi
ACCESSIONLOCUS
PIDgi
Copyright 2002 UBC Bioinformatics Centre
LOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.ACCESSION U40282VERSION U40282.1 GI:3150001KEYWORDS .SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 1789) AUTHORS Hannigan,G.E., Leung-Hagesteijn,C., Fitz-Gibbon,L., Coppolino,M.G., Radeva,G., Filmus,J., Bell,J.C. and Dedhar,S. TITLE Regulation of cell adhesion and anchorage-dependent growth by a new beta 1-integrin-linked protein kinase JOURNAL Nature 379 (6560), 91-96 (1996) MEDLINE 96135142REFERENCE 2 (bases 1 to 1789) AUTHORS Dedhar,S. and Hannigan,G.E. TITLE Direct Submission JOURNAL Submitted (07-NOV-1995) Shoukat Dedhar, Cancer Biology Research, Sunnybrook Health Science Centre and University of Toronto, 2075 Bayview Avenue, North York, Ont. M4N 3M5, Canada
Sample GenBank mRNA Record
Division
Create/updatemol-typeDEF line
Cit-Art
Cit-Sub
Accession.version
Taxonomygilength
LOCUSaccession
Copyright 2002 UBC Bioinformatics Centre
FEATURES Location/Qualifiers source 1..1789 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="11" /map="11p15" /cell_line="HeLa" gene 1..1789 /gene="ILK" CDS 157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1" /db_xref="PID:g3150002" /db_xref="GI:3150002" /translation="MDDIFTQCREGNAVAVRLWLDNTENDLNQGDDHGFSPLHWACRE . . . DK"BASE COUNT 443 a 488 c 480 g 378 tORIGIN 1 gaattcatct gtcgactgct accacgggag ttccccggag aaggatcctg cagcccgagt < ...> 1681 ggcgggctca gagctttgtc acttgccaca tggtgtcttc caacatggga gggatcagcc 1741 ccgcctgtca caataaagtt tattatgaaa aaaaaaaaaa aaaaaaaaa //
Sample GenBank Record
BioSource
gene
codingsequence
sequence
Copyright 2002 UBC Bioinformatics Centre
EST: Expressed Sequence Tag
Expressed Sequence Tags are short (300-500 bp) single reads from mRNA (cDNA) which are produced in large numbers. They represent a snapshot of what is expressed in a given tissue, and developmental stage.
Also see: http://www.ncbi.nlm.nih.gov/dbEST/ http://www.ncbi.nlm.nih.gov/UniGene/
Copyright 2002 UBC Bioinformatics Centre
LOCUS AA675481 524 bp mRNA EST 28-NOV-1997DEFINITION vr72d07.s1 Knowles Solter mouse 2 cell Mus musculus cDNA clone IMAGE:1134253 5' similar to TR:G992993 G992993 MYOSIN LIGHT CHAIN KINASE. ;, mRNA sequence.ACCESSION AA675481VERSION AA675481.1 GI:2652718KEYWORDS EST.SOURCE house mouse. ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
...COMMENT Contact: Marra M/Mouse EST Project WashU-HHMI Mouse EST Project Washington University School of MedicineP 4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108 Tel: 314 286 1800 Fax: 314 286 1810 Email: [email protected] This clone is available royalty-free through LLNL ; contact the IMAGE Consortium ([email protected]) for further information. MGI:615525 Possible reversed clone: similarity on wrong strand High quality sequence stop: 469.
my friend Marco Marra
DEF line
Comment
DIVISIONKEYWORD
Copyright 2002 UBC Bioinformatics Centre
FEATURES Location/Qualifiers source 1..524 /organism="Mus musculus" /strain="B6D2 F1/J" /note="Organ: embryo; Vector: pBluescribe (modified); Site_1: MluI; Site_2: SalI; Cloned unidirectionally from mRNA prepared from 13,500 2-cell stage embryos. Primer: SalI(dT): 5'-CGGTCGACCGTCGACCGTTTTTTTTTTTTTTT-3'. cDNAs were cloned into the MluI/SalI sites of a modified pBluescribe vector using commercial linkers (NEB). Average insert size: 1.2 kb." /db_xref="taxon:10090" /clone="1134253" /clone_lib="Knowles Solter mouse 2 cell" /tissue_type="embryo" /dev_stage="2-cell" /lab_host="DH10B"BASE COUNT 168 a 111 c 115 g 130 tORIGIN 1 ctcagttgta gacagtgagc cagtcagatt tactgttaaa gtaacaggag aacccaagcc 61 ggaaattaca tggtggtttg aaggagaaat actgcaggat ggagaagact atcagtacat 121 cgaaagaggt gaaacttact gcctgtattt accggaaacc ttcccagaag atggaggaga 181 gtacatgtgt aaggcagtca acaataaagg ctcagcagcg agcacctgca ttcttaccat 241 tgaaatggat gactactagg cttccctctg tccttgggac tctctctctc gctgcatctc 301 tgtggagggg ccaaaaagga gaccagaggt gccactataa ctgacttaat ctttccccaa 361 atcttcctct taagaacttc tcatgcatat caggttcatt accatgctgt gcaaagtcaa 421 agcatagctg acagaaaagg gaaataaatg tacccattct gtcagaacta agacagaagc 481 ttcgtattta tagaactaag acttaacata tacagtttgc atga//
BioSource
Copyright 2002 UBC Bioinformatics Centre
STS
Sequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer pairs used in a PCR assay that generate a mapping reagent which maps to a single position within the genome.
Also see: http://www.ncbi.nlm.nih.gov/dbSTS/
http://www.ncbi.nlm.nih.gov/genemap/
Copyright 2002 UBC Bioinformatics Centre
GSS: Genome Survey Sequences
Genome Survey Sequences are similar in nature to the ESTs, except that its sequences are genomic in origin, rather than cDNA (mRNA).
The GSS division contains:• random "single pass read" genome survey sequences.• single pass reads from cosmid/BAC/YAC ends (these could be chromosome specific, but need not be)• exon trapped genomic sequences• Alu PCR sequences
Also see: http://www.ncbi.nlm.nih.gov/dbGSS/
Copyright 2002 UBC Bioinformatics Centre
LOCUS FR0029137 445 bp DNA GSS 30-JUN-1998DEFINITION Fugu rubripes GSS sequence, clone 037G16aE9, genomic survey sequence.ACCESSION AL031006VERSION AL031006.1 GI:3286795KEYWORDS GSS; genome survey sequence.SOURCE Fugu rubripes. ORGANISM Fugu rubripes Eukaryota; Metazoa; Chordata; Vertebrata; Actinopterygii; Neopterygii; Teleostei; Euteleostei; Acanthopterygii; Percomorpha; Tetraodontiformes; Tetraodontoidei; Tetraodontidae; Fugu.REFERENCE 1 (bases 1 to 445) AUTHORS Elgar,G., Clark,M., Smith,S., Meek,S., Warner,S., Umrania,Y., Williams,G. and Brenner,S. TITLE Direct Submission JOURNAL Submitted (09-JUN-1998) MRC Human Genome Mapping Project Resource Centre, Hinxton, Cambridge, CB10 1SB, UK. Email: [email protected] Vector: pBluescript II KS V_type: phagemid PRIMER: KS DESCR: One pass dye-terminator sequencing of cosmid cloned genomic sequence.
DIVISION
KEYWORD
Copyright 2002 UBC Bioinformatics Centre
Genome Survey SequencesFEATURES Location/Qualifiers source 1..445 /organism="Fugu rubripes" /db_xref="taxon:31033" /clone_lib="cosmid 037G16" /clone="037G16aE9"BASE COUNT 124 a 96 c 97 g 126 t 2 othersORIGIN 1 atcctgcagt gaggcagaac agggnctgtt tccatttttt gtctgtcagt ttaaacagtg 61 gtcggccgta aaagtcctcc gaaaacccac aaagcctttg cctatcgttc caaatcttac 121 atgggtaagt gcaaacattt aactcaagat aagtgccttt gagataacaa aacctctttt 181 ttcaagagag tcttggaagc gtacacacct acagcgtagc tgtttttacc tcagatgaat 241 gtctttggna tgagggaggg aaccagatac ctggtgaaaa cccatgcaga cttgcggaga 301 gcactgtgaa accctctggt actgagccct gaaacttcat gttgtgaggc aacagtgctt 361 accaaaagtt tatcctgcaa ctgctattta acttctgtta gcctctgttt tggagaccac 421 atgagttaaa tacggtttgt tgaaa//
Copyright 2002 UBC Bioinformatics Centre
HTG: High Throughput Genome
High Throughput Genome Sequences are unfinished genome sequencing efforts records. Unfinished records have gaps in the nucleotides sequence, low accuracy, and no annotations on the records.
Also see: http://www.ncbi.nlm.nih.gov/HTGS/ Ouellette and Boguski (1997) Genome Res. 7:952-955
Copyright 2002 UBC Bioinformatics Centre
HTGS in GenBank
phase 1 HTGAcc = AC000003 gi = 1556454
phase 2 HTGAcc = AC000003 gi = 2182283
phase 3 PRIAcc = AC000003 gi = 2204282
Copyright 2002 UBC Bioinformatics Centre
HTGS in GenBank
• Unfinished Record
– Sequencing will be unfinished– Phase 1 or phase 2– HTG division– KEYWORDS: HTG; HTGS_PHASE1 or 2
• Finished record
– Sequencing will be finished– Phase 3 – Organismal division it belongs to PRI,INV or PLN– KEYWORDS: HTG
Copyright 2002 UBC Bioinformatics Centre
LOCUS HSAC000003 120000 bp DNA HTG 20-SEP-1996DEFINITION *** SEQUENCING IN PROGRESS *** Chromosome 17 genomic sequence; HTGS phase 1, 6 unordered pieces.ACCESSION AC000003KEYWORDS HTG; HTGS_PHASE1....COMMENT *** *** *** WARNING: Phase 1 High Throughput Genome Sequence *** *** *** * This sequence is unfinished. It consists of 6 contigs for * which the order is not known; their order in this record is * arbitrary. In some cases, the exact lengths of the gaps * between the contigs are also unknown; these gaps are presented * as runs of N as a convenience only. When sequencing is complete, * the sequence data presented in this record will be replaced *by a single finished sequence with the same accession number. * 1 22526: contig of 22526 bp in length * 22527 23035: gap of unknown length * 23036 33919: contig of 10884 bp in length * 33920 34427: gap of unknown length * 34428 61877: contig of 27450 bp in length ...//
HTGS: phase 1
DIVISIONWARNING
WARNING
WARNINGWARNING
KEYWORD
Copyright 2002 UBC Bioinformatics Centre
gap of unknown length
HTGS Phase 1
* the sequence data presented in this record will be replaced* by a single finished sequence with the same accession number.* 1 33214: contig of 33214 bp in length* 33215 33250: gap of unknown length* 33251 35134: contig of 1884 bp in length...
33061 ggagagcttc agggagactc tgcggaatag caggttgtaa tcttccggtt cgatagtcga 33121 taaatgtctg gtttaccttc agccgaaacg cgggagaaat ccagcctgcg tactccacag 33181 cgagcaattc atgggcaaaa gtgccgccgc cacgnnnnnn nnnnnnnnnn nnnnnnnnnn 33241 nnnnnnnnnn tagttcatca ccttctggtg gaagccacat tttctctttc ctttctttcc 33301 ctgtctaccc tccctcttcc ccttcctccc caaatctatc agtaaagacc accttgctgt 33361 gggcagctag ctgaaagaga ccatctgcct taggaatagc ctacactaga ttcaaactac 33421 aaagaagcag gttgggggaa agaggaagtg aggatttcaa gtcaagaaag catcctgcct
Copyright 2002 UBC Bioinformatics Centre
LOCUS AC000003 122228 bp DNA PRI 07-OCT-1997DEFINITION Homo sapiens chromosome 17, clone 104H12, complete sequence.ACCESSION AC000003NID g2204282KEYWORDS HTG....COMMENT The Staden databases, finishing information, and all chromatographic files used in the assembly of this clone are available from our anonymous ftp site. All repeats were identified using RepeatMasker: Smit, A.F.A. & Green, P. (1996-1997) http://ftp.genome.washington.edu/RM/RepeatMasker.html.FEATURES Location/Qualifiers source 1..122228 /organism="Homo sapiens" /db_xref="taxon:9606" /clone="104H12" /clone_lib="Research Genetics/Cal Tech CITB978SK-B (plates 1-194)" /chromosome="17" repeat_region 261..370 /rpt_family="MLT1B"
HTGS phase 3
DIVISION
KEYWORD
Copyright 2002 UBC Bioinformatics Centre
Copyright 2002 UBC Bioinformatics Centre
Locus Link
Copyright 2002 UBC Bioinformatics Centre
http://nar.oupjournals.org/content/vol31/issue1/
Copyright 2002 UBC Bioinformatics Centre
Genome Projects: discussion point
• Whole genome assembly• “Bermuda agreement”• HTG Finished• What is it to be “finished”• 1:10,000 error rate?• How useful is an unfinished genome?• Reference genomes• TPA and RefSeq
Copyright 2002 UBC Bioinformatics Centre
In Closing ...
• Able to recognize various data formats, and know what their primary use is.
• Know, understand and utilize all types of sequence identifiers.• Know and understand various feature types present in the
GenBank flat files.
• Know and understand the various GenBank divisions.
Copyright 2002 UBC Bioinformatics Centre
Resources
• W W W:
– http://www.ncbi.nlm.nih.gov
– http://www.ddbj.nig.ac.jp/
– http://www.ebi.ac.uk/
– http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html
– http://www.expasy.ch/sprot/
– http://www.rcsb.org/pdb/index.html
– http://www.ncbi.nlm.nih.gov/Omim/
– http://genome-www.stanford.edu/Saccharomyces/
– http://nar.oupjournals.org/content/vol30/issue1/
– http://nar.oupjournals.org/content/vol31/issue1/