MCB 5472 Lecture 3 Feb 10/14 (1) Types of homology (2) BLAST
Dec 18, 2015
Homology references
“Homology a personal view on some of the problems”
Fitch WM (2000) Trends Genet. 16: 227-231
“Orthologs, paralogs, and evolutionary genomics”
Koonin EV (2005) Annu. Rev. Genet. 39: 309-338
What is homology?
• Owen 1843: “the same organ in different animals under every variety of form and function”
• Huxley (post Darwin): homology evidence of evolution– Similarity is due to descent from a common
ancestor
What is homology?
• Homology is a statement about shared ancestry– Two things either share a common ancestor
(are homologous) or do not
Common ancestor of species 1, 2, 3
Common ancestor of species 2, 3
These are all homologs (common ancestor)
a b cSpecies 2 Species 3Species 1
Ohno 1970: “Evolution by Gene Duplication”
• New genes arise by gene duplication– One copy retains ancestral function– Other copy diverges functionally
• “Homolog” as a single term therefore is a sloppy fit– What kind of ancestor to homologs share?
Common ancestor of species 1, 2, 3
Common ancestor of species 2, 3
These are all homologs (common ancestor)These are all orthologs (vertical descent)
a b cSpecies 2 Species 3Species 1
Homology and Function
• Homology and function are two different concepts
• Strict orthology and functional conservation often correlate but this is not absolute
• Basis for annotating genomes based on similarity to previous work
Fitch 1970: “Orthologs” and “Paralogs”
• “Orthologs”: genes related by vertical descent
• “Paralogs”: gene related by gene duplication
Common ancestor of species 1, 2, 3
Common ancestor of species 2, 3
Genes a, b, c & d are homologs (common ancestor)Genes a & b are paralogs (related by duplication)
Duplication in species 1
a b c dSpecies 2 Species 3Species 1
Orthology/paralogy is somewhat relative
• Depends on the depth of duplication relative to common ancestry
• “Co-orthologs”: paralogs formed in a lineage after speciation, relative to other lineages (Koonin 2005)
Common ancestor of species 1, 2, 3
Common ancestor of species 2, 3
Genes a & b are paralogs (related by duplication)Genes a & b are co-orthologs of genes c & d
(duplication followed speciation)
Duplication in species 1
a b c dSpecies 2 Species 3Species 1
Common ancestor of species 1, 2, 3
Common ancestor of species 2, 3
Genes a & b are paralogs (duplication)Genes a & b are co-orthologs of c, d, e & f
(duplication followed speciation)
Duplication in ancestor of species 2 & 3
a c fd e
Duplication in species 1
a b
Common ancestor of species 1, 2, 3
Common ancestor of species 2, 3
Genes c & d are orthologs (common ancestor)Genes e & f are orthologs (common ancestor)
Genes c & d are paralogs of genes e & f (duplication preceded speciation)
Duplication in ancestor of species 2 & 3
a c fd e
Duplication in species 1
a b
Common ancestor of species 1, 2, 3
Common ancestor of species 2, 3
Genes c is a paralog of gene f even though it doesn’t seem so (duplication still preceded speciation followed by extinction)
Duplication in ancestor of species 2 & 3
a c f
Duplication in species 1
a b
(Loss of d) (Loss of e)
Xenologs
• Bacteria exchange DNA between distant relatives by horizontal gene transfer (HGT)– Increasingly recognized in eukaroytes too
• Gene tree does not match species tree
Common ancestor of species 1, 2, 3
Common ancestor of species 2, 3
Gene c is a xenolog relative to the others
HGT from species 2 ancestor to
species 1
a cb dSpecies 2 Species 3Species 1 Species 1
Other “-logs”
• Inparalogs: duplication follows speciation• Outparalogs: duplication precedes
speciation
• Synlogs: arising from organism fusion
• Orthology & paralogy can get quite complicated when multiple duplications happened at different moments in time
• Gene loss & HGT can always confound – one often has to rely on external evidence to recreate speciation– E.g., other genes not thought to be
horizontally transferred, average signal of multiple genes
Species 2 Species 3Species 1
Discuss: how are these genes related to each other? Three possibilities
How to determine orthologs
• Most detailed: phylogenetic trees– Can be computationally expensive
• Reciprocal BLAST hit (RBH/BBH)– Simplest, computationally cheap, less
accurate & more complicated with many genomes
• More complicated RBH clustering– OrthoMCL, Inparanoid
Genome A Genome B
RBH orthologs
Genome A Genome B
Best matches in both directions - ortholog
Best matches in both directions - ortholog
Best matches in only 1 direction - not ortholog
Different matches in each direction - not ortholog
Different matches in each direction - not ortholog
BLAST
• Standard method to identify homologous sequences– Not for comparing two sequences directly;
use NEEDLE instead for this (global vs. local alignment methods)
• Requires database to query sequence against
• Probably the most common scientific experiment
Different BLAST types
• BLASTn: nucleotide vs nucleotide• BLASTp: protein vs protein• BLASTx: protein vs translated nucleotide• tBLASTn: translated nucleotide vs protein• tBLASTx: translated nucleotide vs
translated nucleotide
• Nucleotides translated in all six open reading frames
Implimentations
• blastall: older command line version – Atschul et al. 1990 J. Mol. Biol. 215:403-410
• BLAST+: newer command line version– Camacho et al. 2008 BMC Bioinformatics
10:421– Faster than blastall
• Web BLAST:– www.blast.ncbi.nlm.nih.gov/Blast.cgi– Web version of BLAST+
Databases
• All BLAST queries are done vs. a database
• Examples:– NCBI’s “nr” queries against all of GenBank– WebBLAST has preformatted databases for
different taxonomic groups, other NCBI divisions (e.g., Refseq, Genomes)
• Command line allows custom databases– e.g., lab genomes
WebBLAST (BLASTn)
Input sequence
Database
BLAST typeMegablast optimized for short sequences vs. BLASTn
BLAST: Step 1
• Break sequence into words– Protein: 2-3 amino acids– Nucleotide: 16-256 nucleotides
• Goal: exact word matches– Computational speedup
http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001014
Substitution matrices
• Evolutionarily, some substitutions are more common than others– Some amino acids are common (e.g., Leu)
and some are rare (e.g., Trp)– Some substitutions are more feasible than
others (e.g., Leu -> Ile vs. Leu -> Arg)
• Substitution matrices therefore weight alignments by these probabilities
BLOSUM matrices
• Alignments of a set of divergent reference sequences– BLOSUM62: sequences 62% identical– BLOSUM80: sequences 80% identical
• Substitution frequency calculated for each reference set and used to derive substitution matrix
• Henikoff & Henikoff (1992) PNAS 89:10915-10919
• Also: M. Dayhoff’s PAM matrices from 1978
BLAST: Step 2
• Use substitution matrix to find synonymous words about some scoring threshold
http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001014
BLAST: Step 3
• Find matching words in the database• Extend word matches between query and
matching sequence in both directions until extension score drops below threshold– First without gaps
http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001014
BLAST: Step 4
• If initial alignment good enough, redo with gaps and calculate statistics
http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001014
BLAST score
𝑆=(∑𝑀 𝑖𝑗)−𝑐𝑂−𝑑𝐺Score Sum of scores from
distance matrix
# of gaps
Penalty for opening gap
Total length of gaps
Per-residue for gap
extension
Gap opening penalty typically significantly larger than gap extension penalty
Why?
Questions:
1. Why do gap opening and extension penalties differ?
2. Why is BLAST a local aligner vs. global
Local alignment
• Sequence extensions do not necessarily extend to sequence ends– Domains vs entire proteins
• Can be multiple query->reference matches– i.e., alignment can be broken, each with own
statistics
• Can be multiple reference matches to the same query
Sequence masking
• Low-complexity regions can arise convergently– Small hydrophobic amino acids common in
transmembrane helices
• Violates homology assumption, therefore often excluded from BLAST search
Comparing BLAST scores
• Different BLASTs can use different parameters, e.g., matrices & gap penalties
• “Bit scores” normalize for this
Matrixpenalty
Gappenalty
Bit score
Score
E-values
• What is the likelihood that the sequence similarity is due to chance vs. actual homology?
• Larger databases are more likely to include chance matches
E-values
𝐸=(𝑛×𝑚 ) /(2𝑆′ )
E-value
Bit score
Total # of residues in the
database
Length of the query sequence
E-values
• The E-value represents the likelihood of a random match >= the calculated score
• Smaller E-values therefore reflect greater probability of true homology
• Typically 1e-5 operationally used as a threshold for considering sequences as homologous