Top Banner
PRINCIPLES OF BIOINFORMATICS PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep152010) Bioinformatics Databases Bioinformatics Databases Igor Kuznetsov Igor Kuznetsov Department of Epidemiology & Biostatistics Cancer Research Center Cancer Research Center University at Albany Reading: Zvelebil & Baum, Chapter 3. 1
37

PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Jul 15, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

PRINCIPLES OF BIOINFORMATICSPRINCIPLES OF BIOINFORMATICSBIO540/STA569/CSI660, Fall 2010

Lecture 4 (Sep‐15‐2010)

Bioinformatics DatabasesBioinformatics Databases

Igor KuznetsovIgor Kuznetsov

Department of Epidemiology & Biostatistics

Cancer Research CenterCancer Research Center

University at Albany

Reading: Zvelebil & Baum, Chapter 3.

1

Page 2: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Genomics ‐ the first of many ‘omics’ disciplines

Genome is the entire genetic material (DNA) of an individual organismorganism. 

Genomics is a new scientific discipline that studies the genomes of various organismsof various organisms. 

Genomics includes efforts to determine the entire DNA sequence of an organism’s genome (“sequencing thesequence of an organism s genome ( sequencing the genome”), mapping of individual genes within the genome, and studies of interactions between genes within the genome. 

There are many other ‘omics’ disciplines: proteomics, metabolomics, etc.

2

Page 3: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Human Genome Trivia

• # of chromosomes: 46• # of chromosomes: 46.

• # of base pairs: ~ 3 billion.

l l h f h d• Total length of stretched DNA: 2 meters.

• # of protein coding genes: ~ 25,000.

• # of proteins: ~ 50,000.

• # of RNA‐coding genes: ~ 6,000.

• Human genome is the only genome that was sequenced by its own species. 

How much biologically meaningful information is encoded inthe human genome?

• We know the function for about 2% of the nucleotides in the human ( h * 7 f * 9 l d )genome (that is, 6*107 out of 3*109 nucleotides).

• We know very little about the remaining 98% of so‐called “intergenicregions”.g

3

Page 4: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Raw genomic DNA sequenceCCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGAGATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGCTTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATTGTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAACGTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAACAAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTTTACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGATTTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCTCCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGATCCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGATGGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGATTGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAACATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGCAGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCGG G G G GG GG C G GG CG GG GG G CC GG CCGAGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAGCCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTGTTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGCAACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCCCTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAAAGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAAGCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGGATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTCGCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTTGAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGATATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACAAGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTC

4

TCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGAATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA

Page 5: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

5

Page 6: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Yet another step in genomic data analysis –Yet another step in genomic data analysis Structural genomics

• Structural genomics is the determination of three‐dimensional structures of novel proteins.

A typical output of a structural genomics experiment: 

6

a protein 3D structure composed of individual atoms (colored spheres)

Page 7: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Growth of the Protein Databank (PDB)

7

Page 8: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Proteome

• The proteome is the entire set of proteins expressed by a genome, cell, tissue or organism. 

• More specifically, it is the set of expressed proteins in a given type of cells or an organism at a given time under defined conditions. 

• Proteome defines the PROTEins expressed by the genOMEProteome defines the PROTEins expressed by the genOME.

Proteomics

• Proteomics is a recent scientific discipline that aims to study all the proteins expressed by a genome at a given momentproteins expressed by a genome at a given moment.

• Proteomics involves the identification of all the proteins in the body and the determination of their role in physiological and 

h h i l i l f ipathophysiological functions. 

8

Page 9: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Proteomics vs. Genomics

• ~25,000 genes in the human genome, but genes aren’t functional end‐productsend products.

• The functional end products: ∼50,000 proteins (many genes encode multiple proteins).

• For a living organism the genome is mostly static while the• For a living organism, the genome is mostly static, while the proteome is highly dynamic – proteins are continuously  produced, modified, and degraded.

• Proteins can be modified by post‐translational modification in response to the physiological state (stress, drug treatment,response to the physiological state (stress, drug treatment, disease, etc)

9

Page 10: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Completely sequenced genomes(as of Feb. 2010)

10

Page 11: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Informatics challenges 

• Various ‘omics’ projects produce more and more huge and p j p gdiverse datasets.• These datasets need to be organized, stored, and analyzed.• This requires adequate information technology (IT) infrastructure and support capable of handling and analyzing these datasetsthese datasets.• IT support for molecular biology research is provided by Bioinformatics.• Databases are the backbone of the bioinformatics.

11

Page 12: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Database Systems for Bioinformatics

• A database is a repository of information that has a specific structure that enables the user to enter and extract the data. Database structure consistsenables the user to enter and extract the data. Database structure consists of files or tables, each containing numerous records and fields.

• There are two most popular types of bioinformatics databases:

Flat file databasesRelational databases (RDBMS) 

12

Relational databases (R MS)

Page 13: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Flat file databasesThe simplest form of a database, where data, such as nucleotide or amino acid sequences, are stored as one large text file or a collection of text files. These databases are called “flat” because they are flat like a sheet of paper.These databases are called  flat  because they are flat like a sheet of paper. 

FASTA file format ‐ the most primitive bioinformatics flat file format>gi|45387601|ref|NP_991149.1| prion protein [Danio rerio] MHSKFKLFSFLNCLLLLAVLLPVAQSRRGGGFGRGGGRGGGWGGSSSGRAGWGAAGGHHRAPPVHTGHMG HIGHTGHTGHTGSSGHGVGKVAGAAAAGALGGMLVGHGLSSMGRPGYGYGYGGYGGHGYGYGHGYGHGHG HGGHGGHSGDHNETDADYYLDGAASGHAYSCVTVFGLMMSFLIGHFLSHGGHGGHSGDHNETDADYYLDGAASGHAYSCVTVFGLMMSFLIGHFLS >gi|684|emb|CAA39368.1| prion protein [Bos taurus] MVKSHIGSWILVLFVAMWSDVGLCKKRPKPGGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQ PHGGGWGQPHGGGWGQPHGGGWGQPHGGGGWGQGGTHGQWNKPSKPKTNMKHVAGAAAAGAVVGGLGGYM LGSAMSRPLIHFGSDYEDRYYRENMHRYPNQVYYRPVDQYSNQNNFVHDCVNITVKEHTVTTTTKGENFTGGYM LGSAMSRPLIHFGSDYEDRYYRENMHRYPNQVYYRPVDQYSNQNNFVHDCVNITVKEHTVTTTTKGENFT ETDIKMMERVVEQMCITQYQRESQAYYQRGASVILFSSPPVILLISFLIFLIVG >gi|147907216|ref|NP_001082180.1| prion protein [Xenopus laevis] MPQSLWTCLVLISLICTLTVSSKKSGGGKSKTGGWNTGSNRNPNYPGGYPGNTGGSWGQQPYNPSGYNKQ WKPPKSKTNMKSVAIGAAAGAIGGYMLGNAVGRMSYQFNNPMESRYYNDYYNQMPNRVYRPMYRGEEYVSWKPPKSKTNMKSVAIGAAAGAIGGYMLGNAVGRMSYQFNNPMESRYYNDYYNQMPNRVYRPMYRGEEYVS EDRFVRDCYNMSVTEYIIKPTEGKNNSELNQLDTTVKSQIIREMCITEYRRGSGFKVLSNPWLILTITLF VYFVIE >gi|2330626|emb|CAA04236.1| Prion protein [Ovis aries] MVKSHIGSWILVLFVAMWSDVGLCKKRPKPGGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQPHGGGWGQ PHGGGWGQPHGGGWGQPHGGGGWGQGGSHSQWNKPSKPKTNMKHVAGAAAAGAVVGGLGGYMLGSAM

13

PHGGGWGQPHGGGWGQPHGGGGWGQGGSHSQWNKPSKPKTNMKHVAGAAAAGAVVGGLGGYMLGSAMSRP LIHFGNDYEDRYYRENMYRYPNQVYYRPVDQYSNQNNFVHDCVNITVKQHTVTTTTKGENFTETDIKIME QVVEQMCITQYQRESQAYYQRGASVILFSSPPVILLISFLIFLIVG 

Page 14: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

GenBank flat file format

FieldsFields

14

Page 15: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Limitations of flat file databases

• Hard to integrate.• Hard to search.Hard to search. • Hard to update (e.g., need to download the entire database once a 

while).  • In general hard to handle efficiently• In general, hard to handle efficiently.• Mostly used to distribute data.

A better solution: Relational Database

A l i l d b ll i d i hi l i l bl• A relational database stores all its data within multiple tables.• A table is a set of rows and columns.• Each table is linked to other tables by a shared field called a key.y y• Uses Structured Query Language (SQL) to access, retrieve, update data.

15

Page 16: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Two tables from a relational databaseTwo tables from a relational database

key

16

Page 17: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

• MySQL is the most popular open source relational database.• Free for most users.

ll h l d h l• Integrates well with PHP, Perl and other scripting languages• Works well with Linux and Apache

S t l tf i l di i d UNIX/LINUX M• Supports many platforms including windows, UNIX/LINUX, Mac OSX, etc.

• Most bioinformatics applications, big and small, use MySQL.Most bioinformatics applications, big and small, use MySQL.  

17

Page 18: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Primary vs. Derivative (secondary) databases

Primary Databasey• Original submissions by experimentalists• Database staff organize but don’t add additional information

Derivative (secondary) DatabaseC t d b h t• Curated by human experts

• Computationally Derived• Combination of the above two• Combination of the above two

18

Page 19: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Primary vs. Derivative (secondary) databases

L bsRefSeq

LabsTATAGCCGAGCTCCGATACCGATGACAA

SequencingCenters

CuratorsGenomeAssembly

Updated

GenBank

TATAGCCG TATAGCCGTATAGCCG TATAGCCG

pcontinuously

GenBankUniGene

Updated ONLY

Algorithmsby submitters

19

Page 20: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Data quality issuesData quality issues

• Since primary databases are formed from all user submissions, the amount of garbage data can be significant.S hi h th h t i i t t ~1%• Some high‐throughput sequencing experiments can get ~1% of all bases wrong.

• The quality of primary database subsets decreases in the• The quality of primary database subsets decreases in the following order:

Manually curated ‐> Automatically curated ‐> Not curated

20

Page 21: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

A general scheme of an on‐line database

Local computer

Remote computer

Data Some computer program(s)

Output

Program outputProgram output

21

Page 22: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Two major bioinformatics mega portalsTwo major bioinformatics mega‐portals

• USA ‐ NCBI (The National Center for Biotechnology Information). The home of GenBank sequence database.qhttp://www.ncbi.nih.gov/

• European Union ‐ EBI (The European Bioinformatics Institute). The home of UniProt sequence database.http://www ebi ac uk/http://www.ebi.ac.uk/

22

Page 23: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

NCBI: National Center for Biotechnology Information

Bethesda,MD

Created in 1988 as a part of theCreated in 1988 as a part of theNational Library of Medicine at NIH

E t bli h bli d t b– Establish public databases– Develop research in computational biologyDevelop bioinformatics software tools– Develop bioinformatics software tools

– Disseminate biomedical information 23

Page 24: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

NCBI Databases and Services

• GenBank ‐ largest primary sequence databaseGenBank ‐ largest primary sequence database• Free public access to biomedical literature

– PubMed – free article abstracts search– PubMed – free article abstracts search– PubMed‐Central – full‐text article access

• Entrez integrated molecular and literature databases• Entrez ‐ integrated molecular and literature databases• BLAST – fastest sequence search service• VAST structure similarity searches• VAST ‐ structure similarity searches• Software and Databases for download 

M th i d d t b• Many other services and databases…

24

Page 25: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

GenBankhttp://www.ncbi.nlm.nih.gov/genbank/

• Three ways to search GenBank:– Search GenBank for sequence identifiers and annotations with Entrez

Search GenBank sequences using BLAST (Basic Local Alignment Search– Search GenBank sequences using BLAST (Basic Local Alignment Search Tool).

– Search, link, and download sequences using NCBI e‐utilities (a set of software programs). 

• The Reference Sequence (RefSeq) database is a curated ll ti f DNA RNA d t i b ilt b NCBIcollection of DNA, RNA, and protein sequences built by NCBI. 

Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging fromnatural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

25

Page 26: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

EBI: European Bioinformatics Institutehttp://www.ebi.ac.ukhttp://www.ebi.ac.uk

• The structure of EBI services is similar to that of NCBI. The core databases reflect themethods used by biologists to collect information on how cells andreflect themethods used by biologists to collect information on how cells and organisms work:

DNA/RNA/protein sequences‐ DNA/RNA/protein sequences‐ Protein structure‐Whole genomes‐ Gene expression experimentsGene expression experiments‐ Literature databases‐ Software databases

• UniProt – second largest primary sequence database. It consists of several components, each optimized for different uses:

UniProtKB/Swiss Prot is manually annotated and reviewed‐ UniProtKB/Swiss‐Prot is manually annotated and reviewed.‐ UniProtKB/TrEMBL is automatically annotated and is not reviewed.

• The sequences and information in UniProt are accessible via text search, BLAST h d FTPBLAST search, and FTP. 

26

Page 27: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

PDB ‐ the primary database of experimental structures

Protein Databank:  http://www.rcsb.org

Covers:  all proteins for which there are published x‐ray and NMR structures (plus some theoretical predictions). 

Provides:• Structures• Headers (authors, publications, experimental data)• Sequences• Links to structural classification information• Links to structural classification information• Lots of other tools

27

Page 28: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Sources of completely sequenced genomes

• ENSEMBL consortium:http://www.ensemblgenomes.org

• UCSC genome browser:http://genome.ucsc.edu

• TIGR, prokaryotic genomes:http://www tigr orghttp://www.tigr.org

28

Page 29: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Databases of protein‐protein interactions (PPIs)

• Protein‐protein interactions (PPIs) affect all processes in the cell. Examples: replication, transcription, translation, signal t d ti ttransduction, etc.

• Yeast – 6,000 proteins, 3 PPI per protein, 18,000 PPIs.O 100 000 di ll l PPI i h h ll• Over 100,000 medically‐relevant PPIs in the human cell.

• Databases of experimentally determined PPIs become increasingly important for reconstructing biological pathwaysincreasingly important for reconstructing biological pathways and modeling cellular processes.

• A comprehensive list of databases of PPI:• A comprehensive list of databases of PPI: http://mips.gsf.de/proj/ppi/

29

Page 30: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

The yeast interactome

30

Page 31: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Visualization of PPI networks

• Visualization of PPI is a popular application of scientific visualization techniques. PPI networks are represented as graphs. • This task is not straightforward because of the density of the graphs.This task is not straightforward because of the density of the graphs.

• http://www.cytoscape.org/

• Cytoscape is an open source bioinformatics software platform for visualizingmolecular interaction networks and integrating these interactions with gene expression profiles and other state data. 

31

Page 32: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

The Gene Ontology (GO) databasehtt // i t l / i bi / i / ihttp://amigo.geneontology.org/cgi‐bin/amigo/go.cgi

• GO provides a way to capture and represent the biological• GO provides a way to capture and represent the biological knowledge in a standardized database framework. 

• GO is a controlled vocabulary that can be applied to all• GO is a controlled vocabulary that can be applied to all organisms. It is used to describe gene products ‐ proteins and RNA ‐ in any organism.y g

• GO Includes:1. A vocabulary of terms (names for concepts)y ( p )2. Definitions3. Defined logical relationships to each otherg p

32

Page 33: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

GO is structured as Directed Acyclic Graphs (DAG)GO terms are nodes in the graph

llcell is-apart-of

membrane chloroplastmembrane chloroplast

it h d i l hl l tmitochondrial chloroplastmembrane membrane

33

Page 34: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

GO: Three ontologiesGO: Three ontologies

What does it do? Molecular Function

Wh t i it

What does it do? Molecular Function

What processes is it involved in? Biological Process

Where does it act? Cellular Component

d tgene product34

Page 35: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Molecular Function ‐ activities or “jobs” ofa gene product

insulin bindingi li i iinsulin receptor activity

35

Page 36: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Biological Process ‐ a commonly recognized series of biological events  

Transcription is a biological process

36

Page 37: PRINCIPLES OF BIOINFORMATICS - Albanyberg/bio540/BIO540 Lectures/lecture_4.pdf · 2011-08-29 · PRINCIPLES OF BIOINFORMATICS BIO540/STA569/CSI660, Fall 2010 Lecture 4 (Sep‐15‐2010)

Cellular Component: where a gene product acts

37