Comparative genomics Chapter 15 Features that are investigated include Features that are investigated include 1. Genome size variation 2. Base composition biases 3. Gene order 1. Colinearity 2. Identification of orthologs 3. Functional predictions 4. Whole genome alignments 1. Identify genes 2. Improve gene annotation 3 Id tif lt i 3. Identify regulatory regions (phylogenetic footprinting) 5. Evolution of genomes
20
Embed
Comparative genomics - Fordham Universitycschweikert/cisc4020/Lecture 8.pdf · Comparative genomics Gene order Different organisms show different rates of chromosome rearrangements
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Comparative genomicsChapter 15p
Features that are investigated includeFeatures that are investigated include1. Genome size variation2. Base composition biases3. Gene order
1. Colinearity2. Identification of orthologs3. Functional predictions
4. Whole genome alignments1. Identify genes2. Improve gene annotation3 Id tif l t i3. Identify regulatory regions
(phylogenetic footprinting)5. Evolution of genomes
Comparative genomics Genome Size
• There are huge differences in genome sizes among related organisms• More complex organisms not necessarily have larger genomes. • “C-value paradox”• Repetitive elements responsible for large proportion of the p p g p p
differences• Amount of DNA affect cell size, speed of cell cycle, and other traits
Lungfish 140 Gb!
D t l t t th l BAC B l i
Comparative genomics Aligning Genomic sequences
G 1
Dotplots to compare orthologous BACs: Barley (horizontal) vs. rice (vertical)
G 2
G 3
G 4
G 5
Th
G 1 G 2 G 3 G 4bG 4aTa Tb Tc Te Tf Te TgTd
Ti
Barley BAC 635P2 (horizontal) vs. Rice BAC 36I05 (vertical)Dubcovsky et al. Plant Physiology. 125:1342-1353
What have we learned?Dubcovsky and Dvorak 2007, Science 2007 316: 1862-1866
• Small groups of genes are dispersed within a sea of repetitive elements
Am genome T. monococcum
• Rapid divergence in 20 kb
A genome durum
intergenic regions:Comparison orthologous A and Am VRN2 regions• 1.1 MYA divergence• 31% similarity
Decay of synteny in intergenic regions1.0C(t)
A
31% similarity
• Fast decay of % similarity isHvsCh
0.4
0.6
0.8
Wheat -barley
A-B-D wheat
A-Am
B
C
• Fast decay of % similarity is also affecting genes • Fast rate of gene deletions and inactivation by retro insertions
HvsChHuman vs. Chimp7 mya, 3% divergence
Million years
0.0
0.2
0 2 4 6 8 10 12
-barleywheat
D E
insertions.•Small phenotypic changes in polyploid wheat. Valuable source of diversity
Comparative genomics Composition bias
Effect of repetitive elements i GC t tin GC content
•Grass genes 51% GC, Arabidopsis genes 44% GC
• The overall GC content in• The overall GC content in wheat 46% is 2% higher than Arabidopsis or rice
•The GC content of barley and wheat is very similar to the GCwheat is very similar to the GC content of the most abundant retroelements.
Comparative genomics Gene order
Different organisms show different rates of chromosome rearrangementsComparative maps of the wheat genome d ib d i f h i (A)described in terms of the rice genome (A) and the Aegilops umbellulata genome (B).(A) Data from Kurata et al. (1994), Van Deynze et al. (1995) (B) Data from Zhang et al. (1998).
A. Wheat- rice 50 million years
B. Wheat- Aegilops5 million years
Human – Mouse (80 million years): Many rearrangements (180 blocks) but some large conserved regions!Human – Chimpanzee (5 million years) almost completely colinear.• Allows transfer of genetic mapping data between organisms.• Allows functional analysis of human genes via precise deletion of their colinear regions in the mouse genome
Homology search for the mouse genome
• Homology search of all gygenes in the mouse genome27% i th t
EukaryoteOther
• 27% in other metazoans• 29% in other eukaryotes• 6% in other chordates
MetazoanMammal• 6% in other chordates
• 14% in other mammals• Less than 1% rodent
Rodentspecific
Chordate
Mammal
specific specific
Dotplots to discover duplications in rice
Comparative genomics Aligning Genomic sequencesp p
• A large genome duplication occurred approximately 70 million years ago before the divergences of most grass subfamilies.
Diploid grasses are actually ancient•Diploid grasses are actually ancient polyploids
•Figures shows the arrangement of duplicated protein-encoding genes in rice in the order found in the current sequence
bl B th X d Y tassembly. Both X and Y axes represent 45,174 genes in their chromosomal order.
• Colors indicate same (red) or opposite (green) transcriptional orientations.
• Differential Gene Loss Contributes to apparent Incongruities in Comparative Grass Genomics.
•Paterson et al. 2004 PNAS
• A similar result has been described for Arabidopsis suggesting that this species is also an ancient polyploid
Comparative Genomics. Gene Annotation
Though most human and mouse orthologs have similar structure some differences are sometimes observed:C i f th h d idi th• Comparison of the human and mouse spermidine synthase genes revealed an additional intron in the human gene that is not found in the mouse homolog
HumanHuman
MouseMouse
5,500 bp
Algorithms for Aligning Genomic sequences
Fi di th l i BLASTZ (http://zpicture dcode org/ see zpicture pdf)Finding orthologous region between two genomes is nontrivial
Existing methods based on
BLASTZ (http://zpicture.dcode.org/, see zpicture.pdf)
• Local Alignment program. Aligns 2-Mb<1min.
• 1st masks regions in seq. 1 repeated in seq. 2
• Search stretches of 19-bp /12 matches (1 transition OK)Existing methods based on dynamic programming algorithms (e.g. Needleman and Wunsch / Smith and Waterman) or
Search stretches of 19 bp /12 matches (1 transition OK)
• After initial match a gap free extension is performed until a cumulative score reaches a threshold (3000)
• If threshold is met, then realigned now allowing gaps
hashing (e.g. BLAST / FASTA) rapidly run out of memory.
Even Megablast and BLAT
• Alignments with scores >5000 move to next phase
• Scores are calculated using refined substitution matrices based on aligned human-mouse non-coding
• Connect individual alignments separated by <50-kbcan not handle large genome comparisons
Specific programsBLASTN BLASTZ
Connect individual alignments separated by 50 kb
• BLASTZ
• LAGAN
• AVIDAVIDSubstitution matrices used by BLAST and BLASTZ. BLASTZ matrix based on observed substitutions in aligned conserved regions of mouse and human. BLASTZ aligned mouse sequences to 40% of the human genome.
ZPicture http://zpicture.dcode.org/
zPicture is a pair-wise alignment and visualization tool that compares two sequences using the localalignment program BlastZMultiple sequences: http://zpicture.dcode.org/multiz.php
Gene 1 4548-7287 Gene 2 8663-13534 R Gene3 14166-22601
zPicture alignment Rice (AY013245) vs. Barley (AY013246) [no gene annotation provided so all in red]
Gene4 24136-29057 Repeat
Comparative genomics LAGAN Limited Area Global Alignment of Nucleotideshttp://lagan.stanford.edu/lagan_web/index.shtml
LAGAN: Global pair-wise and multiple alignment of finishedmultiple alignment of finished sequences.
• Detects closely and distantly related sequences
If f th i• If some of the sequences are in a draft format your query will be redirected to AVID
• Multiple alignment will be visualized by VISTAby VISTA
• This is the only alignment program available through the VISTA server that produces true multiple alignments.
LAGAN f b tt th BLASTZ• LAGAN performs better than BLASTZ for distantly related organisms
• Shuffle-LAGAN: detects rearrangements and inversions
AVID
AVID: global pair-wise alignment.
• Fast alignment of large seq.
Red: non-overlapping, non-crossing matches
Fast alignment of large seq.
• Detects weak homologies
• One of the sequences should be finished but all others can be
Clean Matches (1st) & Repeat matches
Set of non-overlapping, non-crossing matchesfinished, but all others can be either finished or in a draft format. • For all finished sequences in the set AVID generates all-against-all
pp g, g
Recursion
set, AVID generates all against all pair-wise alignments
• Draft sequences are aligned to the finished sequence.
Repeat for each inter-anchor region from the previous stepq
• View using VISTA
• Similar to LAGAN in that it uses anchorsanchors
VISTA• Option to use AVID or LAGAN• Option to use AVID or LAGAN.• If one of the sequences is not finished uses AVID.
• Can align large sequence to genome bliassemblies