How to Build a Horse Megan Smedinghoff. 2 Background In February 2007, Broad Institute released a draft genome of the horse (Equus caballus) The project.

How to Build a Horse

Megan Smedinghoff

2

Background In February 2007, Broad Institute released a draft

genome of the horse (Equus caballus)

The project cost $15 million and was funded by the National Human Genome Research Institute and the National Institute of Health

300,000 Bacterial Artificial Chromosomes were provided by the University of Veterinary Medicine in Hanover, Germany and the Helmholtz Centre for Infection Research in Braunschweig, Germany

3

Horse Genome Statistics

The horse genome contains approximately 2.7 billion base pairs

The assembly was done using 6.8-fold coverage

The sequenced horse was a thoroughbred mare named Twilight from Cornell University

Twilight posing for a picture at Cornell

4

Why Sequence the Horse?

Allows scientists to study diseases that primarily affect horses such as Glanders

SNP information can be used to connect DNA to physical characteristics and explain differences between breeds

Lots of general information about mammals can be gained by looking at the horse since very few large mammals have been sequenced

5

How the Horse Genome Affects Us There are over 80 known genetic conditions in the

horse that are analogous to human disorders Horses have some conditions traditionally found

in humans such as allergies and arthritis Having the complete horse genome helps infer

the order of evolution

Horse Racing?

6

Project Proposal

Reassemble the horse genome using the Celera Assembler

Use existing UMD software to compare my assembly with the Broad assembly and produce a reconciled horse genome

Deposit the improved assembly in GenBank

Advisor: Jim Yorke

7

Introduction to Genome Sequencing

DNA target sampleDNA target sampleSHEAR

SIZE SELECT

e.g., e.g., 10Kbp 10Kbp ± 8% ± 8% std.dev.std.dev.

VectorVector

LIGATE & CLONE

PrimerPrimer

End Reads (Mates)End Reads (Mates)

SEQUENCE

750bp

Slide courtesy of Art Delcher

8

How Genomes are Assembled

Closure

Trim the Reads

Calculate Overlaps

Build Unitigs

Build Contigs

Build Scaffolds

9

Assembly: Calculating Overlaps

Compare every possible combination of reads to find every overlap of a certain length (~40bp)

Must compare forward and reverse orientation of each pair of reads

Comparisons must take into account the possibility of sequencing errors and use alignment algorithms such as Smith-Waterman

5’ 3’Read A

5’ 3’Read B

5’ 3’Read A

3’ 5’Read B

3’ 5’Read A

5’ 3’Read B

3’ 5’Read A

3’ 5’Read B

10

Assembly: Creating Unitigs

A unitig is a set of reads that have been linked together based on overlaps

A unitig has no ambiguities

Unitig

Reads

11

Assembly: Creating Unitigs (cont.)Best Buddy Algorithm for Unitig Assembly:

If the longest overlap with read A is read B and the longestoverlap with read B is read A, then reads A and B are best buddies

AB

CD

Read A and Read B are best buddies

D

Read A and Read B are NOT best buddies

AB

C

12

Assembly: Creating Contigs

A contig is a set of overlapping unitigs Contigs are assembled by using mate pair

information Since we know the distance between mates and the

orientation of the mates, we can infer the placement of the unitigs

Unitig A Unitig B

Read 1 Read 2

Read 1 and Read 2 are mates

13

Assembly: Building Scaffolds

Scaffolds are built from contigs The orientation and approximate distances

between contigs are inferred from mate pair information

When possible, the gaps between contigs are filled in with leftover sequence

Scaffold

Contig A Contig B

Reads

14

Arachne Assembler

24-mer indexing Any two reads that share at least one

24-mer are paired Each pair is scored Contigs are created by merging paired

pairs Repeat regions are avoided during

contig assembly but used during scaffold assembly

Subreads are placed after scaffold assembly

Serafim BatzoglouArachne Author

15

Celera Assembler

Find overlaps of at least 40bp with less than 6% error

Overlaps are found using 22-mers After overlaps are calculated, Celera

does error correction using a voting algorithm

Contigs are assembled using best buddy algorithm

Scaffolds are assembled from mate pair information

Scaffold gaps are filled when possible

Gene MeyersFormer vice president

of Celera Genomics

16

Project ExpectationsFall 2007

Produce Celera Assembly

Spring 2008

Produce Reconciled Assembly

General Goals

Tackle the unexpected problems that accompany genome assembly

Document my work

Validate my work wherever possible

17

Validation

Genome assemblies are not perfect I plan to validate my assembly by comparing

it to the current draft I expect about 1.5% difference between the

Celera Assembly and the Broad Assembly I will use Mummer to measure similarity

between genomes

18

Mummer Mummer is a piece of

software created by CBCB that is used to compare genomes

Mummer locates strings of at least 18bp that are present in each genome

Plotting the results makes it easy to see insertions, deletions, inversions, etc.

Graphs courtesy of Adam Phillippy

19

Implementation Details

I plan to use the Genome cluster at University of Maryland to produce my assembly

Much of my project will utilize existing software I intend to use Perl to write any

additional scripts that may be

needed

20

Time Permitting The University of Maryland has

recently produced a lot of software for the genome assembly pipeline, much of which has not been tested on large genomes

I hope to use programs like the UMD overlapper and Figaro to see how these programs affect my assembly

Mihai Pop

James White

21

Acknowledgements

James Yorke, Aleksey Zimin, and the Genome Group for advising me on the nature of this project

Steven Salzberg, Art Delcher, and Adam Phillippy for giving lectures and producing slides on genome assembly topics

Gene Myers paper on Drosophila Serafim Batzoglou paper on Arachne Wikipedia

How to Build a Horse Megan Smedinghoff. 2 Background In February 2007, Broad Institute released a draft genome of the horse (Equus caballus) The project.

Documents