Genomics and bioinformatics summary 1. Gene finding: computer searches, cDNAs, ESTs, 2.Microarrays 3.Use BLAST to find homologous sequences 4.Multiple.

Genomics and bioinformatics summary

1. Gene finding: computer searches, cDNAs, ESTs,

2. Microarrays3. Use BLAST to find homologous sequences4. Multiple sequence alignments (MSAs)5. Trees quantify sequence and evolutionary

relationships6. Protein sequences are evolutionary clocks 7. Some public databases and protein sequence

analysis tools

Finding genes -- computer searches

Computer searches locate most genes in prokaryotes, Archeae, and yeast, but only ~1/3 of human genes are identified correctly.

CriteriaProtein start, stop signals, splicing signals . . .Codon biasComparisons to other genomes (mouse, rat, fish,

fly, mosquito, worm, yeast . . .)

Some hard problems: small genes, post-translational modifications,unique genes, spliced genes, alternative splicing, gene rearrangements (e.g. IgGs) . . .

Finding genes -- cDNA synthesis

Synthesizing “cDNA” (complementary DNA)

1. Extract RNA

2. Hybridize polyT primer

3. Synthesize DNA strand 1 using reverse transcriptase.

4. Fragment RNA strand using RNaseH.

5. Synthesize DNA strand 2 using DNA polSequences of random cDNAs provide ESTs (Expressed Sequence Tags)

Microarrays quantify expressed genes by hybridization

1. Label cDNAs with red fluorophore in one condition and green fluorophore in another reference condition.

2. Mix red and green DNA and hybridize to a “microarray”.

Red genes enriched in referenceYellow genes (green + red) =Green genes enriched in

experimentEach spot is a different synthetic oligonucleotide complementary to a specific gene.

“Cluster analysis” identifies patterns of gene expression

1. Similar patterns of expression are placed next to each other. Groups of genes with similar patterns form a hierarchical “tree”. For example the two major branches of the tree comprise activated (left, green) or repressed genes (right, red).

2. Genes with similar expression patterns (e.g. A-E) often function together.

Genes

Conditions

“Tiling” microarrays can find transcribed sequences

Each spot has a different synthetic oligonucleotide complementary to a different segment of the genome (E.g every 100 bps). Spots that hydridize reveal transcribed regions.

Microarray coding capacity ~16 M bases

Find similar sequences (homologs) with BLASTThe most related human protein identified by a BLAST search of the human genome using the sequence of M. tuberculosis PknB Ser/Thr protein kinase is . . . ELKL motif kinase 1. Query = the part of the PknB sequence that matches ELKL-1. Subject = ELKL-1. Expect = expectation value = the number of hits of this quality expected by chance in a database of this size (5e-24 = 5 x 10-24; is this a big number or small?) Identities = # of exact amino acid matches in the alignment. Positives = # of conservative changes as defined by the residues that tend to replace each other in homologous proteins. NP_00495.2 = sequence ID for ELKL-1.

>ref|NP_004945.2| ELKL motif kinase 1 [Homo sapiens]Length = 691

Score = 108 bits (270), Expect = 5e-24 Identities = 87/296 (29%), Positives = 135/296 (45%), Gaps = 21/296 (7%)

Query: 11 YELGEILGFGGMSEVHLARDLRLHRDVAVKVLRADLARDPSFYLRFRREAQNAAALNHPA 70 Y L + +G G ++V LAR + ++VAVK++ S FR E + LNHP Sbjct: 20 YRLLKTIGKGNFAKVKLARHILTGKEVAVKIIDKTQLNSSSLQKLFR-EVRIMKVLNHPN 78

Query: 71 IVAVYDTGEAETPAGPLPYIVMEYVDGVTLRDIVHTEGPMTPKRAIEVIADACQALNFSH 130 IV +++ E E Y+VMEY G + D + G M K A A+ + HSbjct: 79 IVKLFEVIETEKTL----YLVMEYASGGEVFDYLVAHGRMKEKEARAKFRQIVSAVQYCH 134

Query: 131 QNGIIHRDVKPANIMISATNAVKVMDFGIARAIADSGNSVTQTAAVIGTAQYLSPEQARG 190 Q I+HRD+K N+++ A +K+ DFG + GN + G+ Y +PE +GSbjct: 135 QKFIVHRDLKAENLLLDADMNIKIADFGFSNEFT-FGNKLD---TFCGSPPYAAPELFQG 190

Query: 191 DSVDA-RSDVYSLGCVLYEVLTGEPPFTGDSPVSVAYQHVREDPIPPSARHE-GLSADLD 248 D DV+SLG +LY +++G PF G + + +RE + R +S D +Sbjct: 191 KKYDGPEVDVWSLGVILYTLVSGSLPFDGQN-----LKELRERVLRGKYRIPFYMSTDCE 245

Query: 249 AVVLKALAKNPENRYQTAAEMRADLVRVHNGEPPEAPKV-----LTDAERTSLLSS 299 ++ K L NP R M+ + V + + P V D RT L+ SSbjct: 246 NLLKKFLILNPSKRGTLEQIMKDRWMNVGHEDDELKPYVEPLPDYKDPRRTELMVS 301

Ser/Thr Protein kinases diverge rapidly

Multiple Sequence Alignment (MSA) of the N-terminal ~90 residues of M. tuberculosis PknB (bottom) and Ser/Thr protein kinases of known structure. The histogram at the bottom shows % identity at each position. Only a few residues are absolutely conserved (functional sites!). The MSA defines the beginning of the kinase domain. Insertions often occur in loops.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Histones evolve slowly

Core H3 proteins (that have the same function) are nearly identical in eukaryotes (left). Archaeal H3s and specialized H3 proteins that bind at centromeres show much more divergence (bottom sequences and tree branches, right).

MSA = Multiple Sequence Alignment

Tree

Protein sequences are evolutionary clocks

Assuming that organisms diverged from a common ancestor and sequence changes accumulate at constant rates, the number of changes in homologous proteins gives information about the time that each sequence has been evolving independently.

Average rate of change of proteins of different function.

Fast

Slow

Tree of life (Sequences = biological clocks)

A tree derived by clustering sequences of a typical protein family (pterin-4a-hydroxylase) recapitulates the tree of life. Evolutionary relationships are seen at the molecular level in virtually every shared protein and RNA!

Some web sites for bioinformaticsNucleic acid sequences

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=nucleotideProtein sequences

http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein

Structure Coordinates: Protein Data Bankhttp://www.rcsb.org/pdb/

ProgramsBLAST sequence similarity calculationhttp://www.ncbi.nlm.nih.gov/BLAST/

BLAST bacterial genomeshttp://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi

PHD secondary structure predictor and motif searchhttp://www.embl-heidelberg.de/predictprotein/predictprotein.html

PHYRE fold predictorhttp://www.sbg.bio.ic.ac.uk/~phyre/

Multicoil: Coiled coil prediction http://multicoil.lcs.mit.edu/cgi-bin/multicoil/

Many nucleic acid and protein sequence-analysis toolshttp://au.expasy.org/

Predict transmembrane heliceshttp://www.cbs.dtu.dk/services/THMM-2.0/

Predict signal sequenceshttp://www.cbs.dtu.dk/services/SignalP/

Genomics and bioinformatics summary

1. Gene finding: computer searches, cDNAs, ESTs,

2. Microarrays3. Use BLAST to find homologous sequences4. Multiple sequence alignments (MSAs)5. Trees quantify sequence and evolutionary

relationships6. Protein sequences are evolutionary clocks 7. Lots of public databases and protein

sequence analysis tools

Genomics and bioinformatics summary 1. Gene finding: computer searches, cDNAs, ESTs, 2.Microarrays 3.Use BLAST to find homologous sequences 4.Multiple.

Documents

green genes

red genes

human genes

finding genes

small genes

expressed genes

spliced genes

unique genes