How to Build a Horse Megan Smedinghoff
How to Build a Horse
Megan Smedinghoff
2
Background In February 2007, Broad Institute released a draft
genome of the horse (Equus caballus)
The project cost $15 million and was funded by the National Human Genome Research Institute and the National Institute of Health
300,000 Bacterial Artificial Chromosomes were provided by the University of Veterinary Medicine in Hanover, Germany and the Helmholtz Centre for Infection Research in Braunschweig, Germany
3
Horse Genome Statistics
The horse genome contains approximately 2.7 billion base pairs
The assembly was done using 6.8-fold coverage
The sequenced horse was a thoroughbred mare named Twilight from Cornell University
Twilight posing for a picture at Cornell
4
Why Sequence the Horse?
Allows scientists to study diseases that primarily affect horses such as Glanders
SNP information can be used to connect DNA to physical characteristics and explain differences between breeds
Lots of general information about mammals can be gained by looking at the horse since very few large mammals have been sequenced
5
How the Horse Genome Affects Us There are over 80 known genetic conditions in the
horse that are analogous to human disorders Horses have some conditions traditionally found
in humans such as allergies and arthritis Having the complete horse genome helps infer
the order of evolution
Horse Racing?
6
Project Proposal
Reassemble the horse genome using the Celera Assembler
Use existing UMD software to compare my assembly with the Broad assembly and produce a reconciled horse genome
Deposit the improved assembly in GenBank
Advisor: Jim Yorke
7
Introduction to Genome Sequencing
DNA target sampleDNA target sampleSHEAR
SIZE SELECT
e.g., e.g., 10Kbp 10Kbp ± 8% ± 8% std.dev.std.dev.
VectorVector
LIGATE & CLONE
PrimerPrimer
End Reads (Mates)End Reads (Mates)
SEQUENCE
750bp
Slide courtesy of Art Delcher
8
How Genomes are Assembled
Closure
Trim the Reads
Calculate Overlaps
Build Unitigs
Build Contigs
Build Scaffolds
9
Assembly: Calculating Overlaps
Compare every possible combination of reads to find every overlap of a certain length (~40bp)
Must compare forward and reverse orientation of each pair of reads
Comparisons must take into account the possibility of sequencing errors and use alignment algorithms such as Smith-Waterman
5’ 3’Read A
5’ 3’Read B
5’ 3’Read A
3’ 5’Read B
3’ 5’Read A
5’ 3’Read B
3’ 5’Read A
3’ 5’Read B
10
Assembly: Creating Unitigs
A unitig is a set of reads that have been linked together based on overlaps
A unitig has no ambiguities
Unitig
Reads
11
Assembly: Creating Unitigs (cont.)Best Buddy Algorithm for Unitig Assembly:
If the longest overlap with read A is read B and the longestoverlap with read B is read A, then reads A and B are best buddies
AB
CD
Read A and Read B are best buddies
D
Read A and Read B are NOT best buddies
AB
C
12
Assembly: Creating Contigs
A contig is a set of overlapping unitigs Contigs are assembled by using mate pair
information Since we know the distance between mates and the
orientation of the mates, we can infer the placement of the unitigs
Unitig A Unitig B
Read 1 Read 2
Read 1 and Read 2 are mates
13
Assembly: Building Scaffolds
Scaffolds are built from contigs The orientation and approximate distances
between contigs are inferred from mate pair information
When possible, the gaps between contigs are filled in with leftover sequence
Scaffold
Contig A Contig B
Reads
14
Arachne Assembler
24-mer indexing Any two reads that share at least one
24-mer are paired Each pair is scored Contigs are created by merging paired
pairs Repeat regions are avoided during
contig assembly but used during scaffold assembly
Subreads are placed after scaffold assembly
Serafim BatzoglouArachne Author
15
Celera Assembler
Find overlaps of at least 40bp with less than 6% error
Overlaps are found using 22-mers After overlaps are calculated, Celera
does error correction using a voting algorithm
Contigs are assembled using best buddy algorithm
Scaffolds are assembled from mate pair information
Scaffold gaps are filled when possible
Gene MeyersFormer vice president
of Celera Genomics
16
Project ExpectationsFall 2007
Produce Celera Assembly
Spring 2008
Produce Reconciled Assembly
General Goals
Tackle the unexpected problems that accompany genome assembly
Document my work
Validate my work wherever possible
17
Validation
Genome assemblies are not perfect I plan to validate my assembly by comparing
it to the current draft I expect about 1.5% difference between the
Celera Assembly and the Broad Assembly I will use Mummer to measure similarity
between genomes
18
Mummer Mummer is a piece of
software created by CBCB that is used to compare genomes
Mummer locates strings of at least 18bp that are present in each genome
Plotting the results makes it easy to see insertions, deletions, inversions, etc.
Graphs courtesy of Adam Phillippy
19
Implementation Details
I plan to use the Genome cluster at University of Maryland to produce my assembly
Much of my project will utilize existing software I intend to use Perl to write any
additional scripts that may be
needed
20
Time Permitting The University of Maryland has
recently produced a lot of software for the genome assembly pipeline, much of which has not been tested on large genomes
I hope to use programs like the UMD overlapper and Figaro to see how these programs affect my assembly
Mihai Pop
James White
21
Acknowledgements
James Yorke, Aleksey Zimin, and the Genome Group for advising me on the nature of this project
Steven Salzberg, Art Delcher, and Adam Phillippy for giving lectures and producing slides on genome assembly topics
Gene Myers paper on Drosophila Serafim Batzoglou paper on Arachne Wikipedia