Top Banner
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine [email protected] http://compbio.uchsc.edu/Hunter Molecular Biology Databases
49

Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine [email protected] .

Dec 16, 2015

Download

Documents

Cecilia Ray
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Lawrence Hunter, Ph.D.Director, Computational Bioscience ProgramUniversity of Colorado School of Medicine

[email protected]://compbio.uchsc.edu/Hunter

Molecular Biology Databases

Page 2: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Tour of the major molecular biology

databases• A database is an indexed collection of

information• There is a tremendous amount of

information about biomolecules in publicly available databases.

• Today, we will just look at some of the main databases and what kind of information they contain.

Page 3: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Data about Databases

• Nucleic Acids research publishes an annual database issue. 2009 issue lists 1170 editorially selected databases (link on course web site)

• Small excerpt from the A's:– AARSDB: Aminoacyl-tRNA synthetase

sequences– ABCdb: ABC transporters– AceDB: C. elegans, S. pombe, and human

sequences and genomic information– ACTIVITY: Functional DNA/RNA site activity– ALFRED: Allele frequencies and DNA

polymorphisms

Page 4: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Located Sequence Features

• Indexing relevant data isn’t always easy– Naming schemes are always in flux, vary across

communities, and are often controversial. – Descriptions of phenotypes are very difficult to

standardize (even many clinical ones)• Genome sequences provide a clear

reference– A “located sequence feature” (place on a

chromosome) is unambiguous and biologically meaningful

– Closely related to the molecular concept of a gene.

Page 5: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

What can be discovered about a gene by a database search?

• Best to have specific informational goals:– Evolutionary information: homologous genes,

taxonomic distributions, allele frequencies, synteny, etc.

– Genomic information: chromosomal location, introns, UTRs, regulatory regions, shared domains, etc.

– Structural information: associated protein structures, fold types, structural domains

– Expression information: expression specific to particular tissues, developmental stages, phenotypes, diseases, etc.

– Functional information: enzymatic/molecular function, pathway/cellular role, localization, role in diseases

Page 6: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Using a database• How to get information out of a database:

– Summaries: how many entries, average or extreme values; rates of change, most recent entries, etc.

– Browsing: getting a sense of the kind and quality of information available, e.g. checking familiar records

– Search: looking for specific, predefined information

• “Key” to searching a database:– Must identify the element(s) of the database

that are of interest somehow:• Gene name, symbol, location or other identifying

information.• Sequences of genes, mRNAs, proteins, etc.• A crossreference from another database or database

generated id.

Page 7: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Searching for informationabout genes and their

products• Gene and gene product databases are

often organized by sequence– Genomic sequence encodes all traits of an

organism. – Gene products are uniquely described by their

sequences.– Similar sequences among biomolecules

indicates both similar function and an evolutionary relationship

• Macromolecular sequences provide biologically meaningful keys for searching databases

Page 8: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Searching sequence databases

• Starting from a sequence alone, find information about it

• Many kinds & sources of input sequences– Genomic, expressed, protein (amino acid vs.

nucleic acid) – Complete or fragmentary sequences

• Goal is to retrieve a set of similar sequences.– Exact matches are rare, and not always

interesting– Both small differences (mutations) and large

(not required for function) within “similar” sequences can be biologically important.

Page 9: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

What might we want to know about a sequence?

• Is this sequence similar to any known genes? How close is the best match? Significance?

• What do we know about that gene?– Genomic (chromosomal location, allelic

information, regulatory regions, etc.)– Structural (known structure? structural

domains? etc.)– Functional (molecular, cellular & disease)

• Evolutionary information: – Is this gene found in other organisms? – What is its taxonomic tree?

Page 10: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

NCBI and Entrez

• One of the most useful and comprehensive database collections is the NCBI, part of the National Library of Medicine.– Home to GenBank, PubMed & many other

familiar DBs.• NCBI provides interesting summaries,

browsers, and search tools• Entrez is their database search interface

http://www.ncbi.nlm.nih.gov/Entrez• Can search on gene names, chromosomal

location, diseases, articles, keywords...

Page 11: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .
Page 12: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

BLAST: Searching with a sequence

• Goals is to find other sequences that are more similar to the query than would be expected by chance (and therefore are likely homologous).

• Can start with nucleotide or amino acid sequence, and search for either (or both)

• Many options– E.g. ignore low information (repetitive)

sequence, set significance critical value– Defaults are not always appropriate: READ THE

NCBI EDUCATION PAGES!

Page 13: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Main BLAST page

Page 14: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

A demonstration sequence

atgcacttgagcagggaagaaatccacaaggactcaccagtctcctggtctgcagagaagacagaatcaacatgagcacagcaggaaaagtaatcaaatgcaaagcagctgtgctatgggagttaaagaaacccttttccattgaggaggtggaggttgcacctcctaaggcccatgaagttcgtattaagatggtggctgtaggaatctgtggcacagatgaccacgtggttagtggtaccatggtgaccccacttcctgtgattttaggccatgaggcagccggcatcgtggagagtgttggagaaggggtgactacagtcaaaccaggtgataaagtcatcccactcgctattcctcagtgtggaaaatgcagaatttgtaaaaacccggagagcaactactgcttgaaaaacgatgtaagcaatcctcaggggaccctgcaggatggcaccagcaggttcacctgcaggaggaagcccatccaccacttccttggcatcagcaccttctcacagtacacagtggtggatgaaaatgcagtagccaaaattgatgcagcctcgcctctagagaaagtctgtctcattggctgtggattttcaactggttatgggtctgcagtcaatgttgccaaggtcaccccaggctctacctgtgctgtgtttggcctgggaggggtcggcctatctgctattatgggctgtaaagcagctggggcagccagaatcattgcggtggacatcaacaaggacaaatttgcaaaggccaaagagttgggtgccactgaatgcatcaaccctcaagactacaagaaacccatccaggaggtgctaaaggaaatgactgatggaggtgtggatttttcatttgaagtcatcggtcggcttgacaccatgatggcttccctgttatgttgtcatgaggcatgtggcacaagtgtcatcgtaggggtacctcctgattcccaaaacctctcaatgaaccctatgctgctactgactggacgtacctggaagggagctattcttggtggctttaaaagtaaagaatgtgtcccaaaacttgtggctgattttatggctaagaagttttcattggatgcattaataacccatgttttaccttttgaaaaaataaatgaaggatttgacctgcttcactctgggaaaagtatccgtaccattctgatgttttgagacaatacagatgttttcccttgtggcagtcttcagcctcctctaccctacatgatctggagcaacagctgggaaatatcattaattctgctcatcacagattttatcaataaattacatttgggggctttccaaagaaatggaaattgatgtaaaattatttttcaagcaaatgtttaaaatccaaatgagaactaaataaagtgttgaacatcagctggggaattgaagccaataaaccttccttcttaaccatt

Page 15: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

• Major choices:– Translatio

n– Database– Filters– Restrictio

ns– Matrix

Page 16: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Formatted blast output

Page 17: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Close hit: Macaque ADH alpha

Page 18: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Distant hit:L-threonine 3-

dehydrogenase from a thermophilic bacterium

Page 19: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Parameters

Page 20: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Click on:

Page 21: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Page 22: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Taxonomy report(link from “Results of BLAST” page)

Page 23: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

What did we just do?

• Identify loci (genes) associated with the sequence. Input was human Alcohol Dehydrogenase 1A

• For each particular “hit”, we can look at that sequence and its alignment in more detail.

• See similar sequences, and the organisms in which they are found.

• But there’s much more that can be found on these genes, even just inside NCBI…

Page 24: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Blink: Precomputed blast

Page 25: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Conserved domains

Page 26: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

NCBI version of KEGG & EcoCyc

Page 27: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .
Page 28: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

More from Entrez Gene

Page 29: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

And more…

Page 30: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .
Page 31: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

PubMed

Page 32: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .
Page 33: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Gene Expression

Page 34: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Detailed expression information

Page 35: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Genome map view

Page 36: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

OMIM

Page 37: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .
Page 38: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .
Page 39: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

NCBI is not all there is...• Links to non-NCBI databases (see also “Link

Out”)– Reactome for pathways (also KEGG)– HGNC for nomenclature– HPRD protein information– Regulatory / binding site DBs (e.g. CREB; some not

linked)– IHOP (information hyperlinked over proteins)

• Other important gene/protein resources not linked:– UniProt (most carefully annotated)– PDB (main macromolecular structure repository)– UCSC (best genome viewer & many useful ‘tracks’)– DIP / MINT (protein-protein interactions)– More: InterPro, MetaCyc, Enzyme, etc. etc.

Page 40: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .
Page 41: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .
Page 42: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Gene Names (not easy!)

Page 43: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Protein reference db

Page 44: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .
Page 45: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .
Page 46: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Page 47: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .
Page 48: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Take home messages

• There are a lot of molecular biology databases, containing a lot of valuable information

• Not even the best databases have everything (or the best of everything)

• These databases are moderately well cross-linked, and there are “linker” databases

• Sequence is a good identifier, maybe even better than gene name!

Page 49: Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu .

Homework• Pick a favorite gene (or, if you don’t know any,

how about looking up one of my favorites, PPAR-Delta) and gather information about it from at least five distinct resources.

• Readings:– Nucleic Acids Research online Molecular

Biology Database Collection in 2009 Nucl. Acids Res. 2009 37: D1-D4doi:10.1093/nar/gkn942 • also, browse some of the entries themselves.

– NCBI tutorial, Entrez: Making use of its power.