Dec 21, 2015
Genomics and bioinformatics summary
1. Gene finding: computer searches, cDNAs, ESTs,
2. Microarrays3. Use BLAST to find homologous sequences4. Multiple sequence alignments (MSAs)5. Trees quantify sequence and evolutionary
relationships6. Protein sequences are evolutionary clocks 7. Some public databases and protein sequence
analysis tools
Finding genes -- computer searches
Computer searches locate most genes in prokaryotes, Archeae, and yeast, but only ~1/3 of human genes are identified correctly.
CriteriaProtein start, stop signals, splicing signals . . .Codon biasComparisons to other genomes (mouse, rat, fish,
fly, mosquito, worm, yeast . . .)
Some hard problems: small genes, post-translational modifications,unique genes, spliced genes, alternative splicing, gene rearrangements (e.g. IgGs) . . .
Finding genes -- cDNA synthesis
Synthesizing “cDNA” (complementary DNA)
1. Extract RNA
2. Hybridize polyT primer
3. Synthesize DNA strand 1 using reverse transcriptase.
4. Fragment RNA strand using RNaseH.
5. Synthesize DNA strand 2 using DNA polSequences of random cDNAs provide ESTs (Expressed Sequence Tags)
Microarrays quantify expressed genes by hybridization
1. Label cDNAs with red fluorophore in one condition and green fluorophore in another reference condition.
2. Mix red and green DNA and hybridize to a “microarray”.
Red genes enriched in referenceYellow genes (green + red) =Green genes enriched in
experimentEach spot is a different synthetic oligonucleotide complementary to a specific gene.
“Cluster analysis” identifies patterns of gene expression
1. Similar patterns of expression are placed next to each other. Groups of genes with similar patterns form a hierarchical “tree”. For example the two major branches of the tree comprise activated (left, green) or repressed genes (right, red).
2. Genes with similar expression patterns (e.g. A-E) often function together.
Genes
Conditions
“Tiling” microarrays can find transcribed sequences
Each spot has a different synthetic oligonucleotide complementary to a different segment of the genome (E.g every 100 bps). Spots that hydridize reveal transcribed regions.
Microarray coding capacity ~16 M bases
Find similar sequences (homologs) with BLASTThe most related human protein identified by a BLAST search of the human genome using the sequence of M. tuberculosis PknB Ser/Thr protein kinase is . . . ELKL motif kinase 1. Query = the part of the PknB sequence that matches ELKL-1. Subject = ELKL-1. Expect = expectation value = the number of hits of this quality expected by chance in a database of this size (5e-24 = 5 x 10-24; is this a big number or small?) Identities = # of exact amino acid matches in the alignment. Positives = # of conservative changes as defined by the residues that tend to replace each other in homologous proteins. NP_00495.2 = sequence ID for ELKL-1.
>ref|NP_004945.2| ELKL motif kinase 1 [Homo sapiens]Length = 691
Score = 108 bits (270), Expect = 5e-24 Identities = 87/296 (29%), Positives = 135/296 (45%), Gaps = 21/296 (7%)
Query: 11 YELGEILGFGGMSEVHLARDLRLHRDVAVKVLRADLARDPSFYLRFRREAQNAAALNHPA 70 Y L + +G G ++V LAR + ++VAVK++ S FR E + LNHP Sbjct: 20 YRLLKTIGKGNFAKVKLARHILTGKEVAVKIIDKTQLNSSSLQKLFR-EVRIMKVLNHPN 78
Query: 71 IVAVYDTGEAETPAGPLPYIVMEYVDGVTLRDIVHTEGPMTPKRAIEVIADACQALNFSH 130 IV +++ E E Y+VMEY G + D + G M K A A+ + HSbjct: 79 IVKLFEVIETEKTL----YLVMEYASGGEVFDYLVAHGRMKEKEARAKFRQIVSAVQYCH 134
Query: 131 QNGIIHRDVKPANIMISATNAVKVMDFGIARAIADSGNSVTQTAAVIGTAQYLSPEQARG 190 Q I+HRD+K N+++ A +K+ DFG + GN + G+ Y +PE +GSbjct: 135 QKFIVHRDLKAENLLLDADMNIKIADFGFSNEFT-FGNKLD---TFCGSPPYAAPELFQG 190
Query: 191 DSVDA-RSDVYSLGCVLYEVLTGEPPFTGDSPVSVAYQHVREDPIPPSARHE-GLSADLD 248 D DV+SLG +LY +++G PF G + + +RE + R +S D +Sbjct: 191 KKYDGPEVDVWSLGVILYTLVSGSLPFDGQN-----LKELRERVLRGKYRIPFYMSTDCE 245
Query: 249 AVVLKALAKNPENRYQTAAEMRADLVRVHNGEPPEAPKV-----LTDAERTSLLSS 299 ++ K L NP R M+ + V + + P V D RT L+ SSbjct: 246 NLLKKFLILNPSKRGTLEQIMKDRWMNVGHEDDELKPYVEPLPDYKDPRRTELMVS 301
Ser/Thr Protein kinases diverge rapidly
Multiple Sequence Alignment (MSA) of the N-terminal ~90 residues of M. tuberculosis PknB (bottom) and Ser/Thr protein kinases of known structure. The histogram at the bottom shows % identity at each position. Only a few residues are absolutely conserved (functional sites!). The MSA defines the beginning of the kinase domain. Insertions often occur in loops.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Histones evolve slowly
Core H3 proteins (that have the same function) are nearly identical in eukaryotes (left). Archaeal H3s and specialized H3 proteins that bind at centromeres show much more divergence (bottom sequences and tree branches, right).
MSA = Multiple Sequence Alignment
Tree
Protein sequences are evolutionary clocks
Assuming that organisms diverged from a common ancestor and sequence changes accumulate at constant rates, the number of changes in homologous proteins gives information about the time that each sequence has been evolving independently.
Average rate of change of proteins of different function.
Fast
Slow
Tree of life (Sequences = biological clocks)
A tree derived by clustering sequences of a typical protein family (pterin-4a-hydroxylase) recapitulates the tree of life. Evolutionary relationships are seen at the molecular level in virtually every shared protein and RNA!
Some web sites for bioinformaticsNucleic acid sequences
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=nucleotideProtein sequences
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein
Structure Coordinates: Protein Data Bankhttp://www.rcsb.org/pdb/
ProgramsBLAST sequence similarity calculationhttp://www.ncbi.nlm.nih.gov/BLAST/
BLAST bacterial genomeshttp://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi
PHD secondary structure predictor and motif searchhttp://www.embl-heidelberg.de/predictprotein/predictprotein.html
PHYRE fold predictorhttp://www.sbg.bio.ic.ac.uk/~phyre/
Multicoil: Coiled coil prediction http://multicoil.lcs.mit.edu/cgi-bin/multicoil/
Many nucleic acid and protein sequence-analysis toolshttp://au.expasy.org/
Predict transmembrane heliceshttp://www.cbs.dtu.dk/services/THMM-2.0/
Predict signal sequenceshttp://www.cbs.dtu.dk/services/SignalP/
Genomics and bioinformatics summary
1. Gene finding: computer searches, cDNAs, ESTs,
2. Microarrays3. Use BLAST to find homologous sequences4. Multiple sequence alignments (MSAs)5. Trees quantify sequence and evolutionary
relationships6. Protein sequences are evolutionary clocks 7. Lots of public databases and protein
sequence analysis tools