Introduction to Sequencing Unix Data cleaning
NGS Sequence AnalysisGeneral Process• Simpler version of the first two bubbles of fig 1 in Ekblom
AnnotationSample
Preparation
Sequencing
Data Cleaning
Quality Control
ContigAssembly
Scaffold Assembly
Validation and QC
Draft Genome
Quality and CleaningIs it important?
• Depends on the assembler
• Depends on level of contamination
• Depends on depth
Zhou & Rokas, Molec.Ecol. 23, 1679-1700,2014.
Saving results on RCAC
• What do you do when you have used up your space
o Save files that you are not actively using to fortress. Every RCAC user has access to fortress, a petabyte archive
– files should be large (preferably at least a Gbyte)use tar to archive whole directories
– read abut htar and hsi in the RCAC documentation to learn ho to do this
– Good practice is to archive your starting data before you make ANY changes to it
Mapping
Removing Contaminants
• Contaminant can be removed by mapping reads to the sequence of the contaminating sequence and keeping unmapped reads
• Many read mappers: BWA, Bowtie2, HISAT2, BBMap, to name a few
• Mapped reads frequently need to be manipulated – samtools
• converting from SAM to BAM is slow, and SAM takes lots of disk space. But Bowtie2 writes sam output
• solution: use unix pipes to samtools
Mapping
Practical Genomics - Course Introduction 7
Using pipes
#!/bin/sh -l#PBS -N bowtie_monascus_mt#PBS -q scholar#PBS -l nodes=1:ppn=16#PBS -l walltime=120:00:00
module load samtoolsmodule load bowtie2
cd $PBS_O_WORKDIR
bowtie2 --very-sensitive-local -a --maxins 700 --phred33 -p 16 -x mitochondria \-1 ../raw/Monpu1.genome.rawReads.r1.fq \-2 ../raw/Monpu1.genome.rawReads.r2.fq \| samtools view -uS - \| samtools sort - mitochondrial_raw.sorted
samtools index mitochondrial_raw.sorted.bam
MappingBowtie output• Monascus vs several fungal mt genomes
• Concordantly = correct orientation, correct spacing
74991761 reads; of these:74991761 (100.00%) were paired; of these:
72690295 (96.93%) aligned concordantly 0 times81653 (0.11%) aligned concordantly exactly 1 time2219813 (2.96%) aligned concordantly >1 times----72690295 pairs aligned concordantly 0 times; of these:4104 (0.01%) aligned discordantly 1 time
----72686191 pairs aligned 0 times concordantly or discordantly; of these:145372382 mates make up the pairs; of these:145003825 (99.75%) aligned 0 times55302 (0.04%) aligned exactly 1 time313255 (0.22%) aligned >1 times
Genome Assembly
Whole Genome Shotgun (WGS) Assembly Problems
• Genome is large, many eukaryotes have 109 bases
• Reads are short, 100-150 bases
• Genomes contain repetitive regions
○ Centromeres, telomeres, satellite (heterochromatin)
○ transposons
○ homopolymer repeats / microsatellites
• Genomes contain duplicated segments
• Many genomes are diploid, chromosomes may vary in sequence and/or structure
• Cells contain organelles with their own DNA (mitochondria, chloroplast)
• Cells contain parasites
• Laboratory contamination is always present
Genome Assembly
Repeats - Repetitive sequences are common
shortrepeats
LINESRTDTSINEs
Telomeres
Centromeres
Genome AssemblyShotgun Assembly• Previous method – Overlap-Layout-Consensus
○ Align all sequence reads to find how they overlap
– Human genome: 3x109 x 20X coverage x 100 base reads = 600 M reads
–Many pairwise alignments are required1.8x1017 comparisons
memory (assume 2 byte integers) ~ a billion Gb
○ Figure out the best layout (hard)
○ Generate consensus (easy)
• De Bruijn Graph method○ Break reads up into kmers – subsequences of length k
– How many kmers in a genome: Genome size – k + 1
○ Overlap kmers (Burrows-Wheeler transform)
○ Construct De Bruijn graph(s)
○ Find path through graph(s) contig
○ Use paired-end and mate-pair sequences to create scaffolds
○ Fill remaining gaps (gap closing)
Genome AssemblyDe Bruijn based assemblers
• Among othersVelvetABySSALLPATHSSOAPdenovoMiniaMeraculousSpades
Fig 3. Flicek & Birney, 2009
Genome AssemblyVelvet
• One of the first De Bruijn assemblers
• Pruning
o tips – a chain of nodes disconnected on one endcaused by sequencing errors OR coverage gapserrors tend to be short (rule trim if < 2 kmer )errors tend to have low multiplicity at junction
o bubbles – paths that leave and returncaused by sequence variation (SNPs)length/multiplicity ruleshorter, higher multiplicity paths are preferred
o Erroneous connectionsduplicate sequences + errorserrors will have low coverage, so will areas withlow coverage
Genome Assembly
De Bruijn Graph
• Perfect case
○ No repeats
○ with fragmentation
• Overlap reads to getconsensus
○ But, aligning fragmentsis too time consuming
○ Use kmer-based approach instead
A A T G C G C T A C G T A G G G T A A T A T A A G A C C A
A A T G C GA A T G C G
A T G C G CT G C G C T T G C G C T
G C G C T A G C G C T A
G C T A C G G C T A C G G C T A C G
C T A C G T T A C G T A
C G T A G GG T A G G G
T A G G G T T A G G G T
G G T A A TA T A T A A
T A T A A G A T A A G AA T A A G AA T A A G A
T A A G A C A G A C C A
29 base sequence
24 random fragments6 base words = 6mer
RealityReads 100-250 basesCoverage typically >50
Genome Assembly
De Bruijn Graph• Perfect case
• Easy because
○ complete coverage
○ every kmer is unique
• Looking up overlapping kmers is efficient using the Burrows-Wheeler transform
A A T G C G C T A C G T A G G G T A A T A T A A G A C C A
A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C A
T G C G G T A G A T A T G C G C T A G G T A T A
C G C T A G G G A T A AG C T A G G G T T A A G
C T A C G G T A A A G A T A C G G T A A A G A C
29 base sequence
26 4mers (k=4)
RealityFragments usually 100 basesk=25,100
A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G C C G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A T T A C G T A G GT A T A T G C G
kmers list
Genome Assembly
De Bruijn Graph• Perfect case
• Overlap kmers tomake De Bruijn graph
A A T G C G C T A C G T A G G G T A A T A T A A G A C C A
A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C AT G C G G T A G A T A T
G C G C T A G G T A T A C G C T A G G G A T A A
G C T A G G G T T A A G C T A C G G T A A A G A
T A C G G T A A A G A C
29 base sequence
26 4mers (k=4)
A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G CC G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A T T A C G T A G GT A T A T G C G
kmers list
AATG CGTAACGTTACGCTACGCTACGCTGCGCTGCGATGC
GTAA TAAT AATA ATAA TAAG AAGA AGAC GACC ACCA
GTAG ATAT TATATAGG AGGG
GGGT GGTADe Bruijn graph
Genome Assembly
De Bruijn graph – collapse unbranched paths
AATG CGTAACGTTACGCTACGCTACGCTGCGCTGCGATGC
GTAA TAAT AATA ATAA TAAG AAGA AGAC GACC ACCA
GTAG ATAT TATATAGG AGGG
GGGT GGTA
AATGCGCTACGTA
GTAA TAAT AATA ATAA TAAGACCA
GTAGGGTA ATAT TATAEuler 1736
Genome AssemblyVelvet
• One of the first De Bruijn assemblers
• Pruning
o tips – a chain of nodes disconnected on one endcaused by sequencing errors OR coverage gapserrors tend to be short (rule trim if < 2 kmer )errors tend to have low multiplicity at junction
o bubbles – paths that leave and returncaused by sequence variation (SNPs)length/multiplicity ruleshorter, higher multiplicity paths are preferred
o Erroneous connectionsduplicate sequences + errorserrors will have low coverage, so will areas withlow coverage
Genome Assembly
De Bruijn Graph• Perfect case
• Compress nodes in De bruijn graph
• Eulerian path
○ Visit each box only once
○ Only works with unique kmers
○ In real life, one must visit nodes more than once due to repeats
A A T G C G C T A C G T A G G G T A A T A T A A G A C C A
A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C A
T G C G G T A G A T A T G C G C T A G G T A T A
C G C T A G G G A T A AG C T A G G G T T A A G
C T A C G G T A A A G A T A C G G T A A A G A C
29 base sequence
26 4mers (k=4)
AATGCGCTACGTA
GTAA TAAT AATA ATAA TAAGACCA
GTAGGGTA ATAT TATA
Genome Assembly
De Bruijn Graph• Perfect case
• Eulerian path
○ Visit each box only once
○ Prune improbable paths
– Sequence depth – low coverage paths are likely to be errors
– Paired ends give information about path
A A T G C G C T A C G T A G G G T A A T A T A A G A C C A
A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C A
T G C G G T A G A T A T G C G C T A G G T A T A
C G C T A G G G A T A AG C T A G G G T T A A G
C T A C G G T A A A G A T A C G G T A A A G A C
29 base sequence
26 4mers (k=4)
AATGCGCTACGTA
GTAA TAAT AATA ATAA TAAGACCA
GTAGGGTA ATAT TATA
pruned paths
Genome Assembly
De Bruijn Graph• Perfect case
• Eulerian path
○ Prune graph
A A T G C G C T A C G T A G G G T A A T A T A A G A C C A
A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C A
T G C G G T A G A T A T G C G C T A G G T A T A
C G C T A G G G A T A AG C T A G G G T T A A G
C T A C G G T A A A G A T A C G G T A A A G A C
29 base sequence
26 4mers (k=4)
AATGCGCTACGTA GTAA TAAT AATA ATAA TAAGACCAGTAGGGTA ATAT TATA
rearranged from previous slide
AATGCGCTACGTA GTAA TAAT AATA ATAA TAAGACCAGTAGGGTA ATAT TATA
improbable paths deleted
Genome Assembly
De Bruijn Graph• Perfect case
• Eulerian path
○ Visit each box only once
○ Only works with uniquekmers
A A T G C G C T A C G T A G G G T A A T A T A A G A C C A
A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C A
T G C G G T A G A T A T G C G C T A G G T A T A
C G C T A G G G A T A AG C T A G G G T T A A G
C T A C G G T A A A G A T A C G G T A A A G A C
29 base sequence
26 4mers (k=4)
AATGCGCTACGTAGTAGGGTA
GTAATAATAATAATATTATAATAATAAGACCA
AATGCGCTACGTAGGGTAATATAAGACCA Reconstructed consensus
AATGCGCTACGTAGGGTAATATAAGACCA Original
AATGCGCTACGTA GTAA TAAT AATA ATAA TAAGACCAGTAGGGTA ATAT TATA
Genome Assembly
De Bruijn Graph• Perfect case
○ with random fragmentation
A A T G C G C T A C G T A G G G T A A T A T A A G A C C A
T A G G G T A A T G C G
C T A C G T G C G C T A
G T A G G GT G C G C T
G C T A C G G C T A C G
T A C G T A A T A A G A
T G C G C TC G T A G G
A G A C C A A T G C G C
G C T A C G A A T G C G
A T A A G AT A A G A C
A T A A G AT A T A A G
G C G C T A A T A T A A
T A G G G T
29 base sequence
23 random fragments6 base reads
A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G C C G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A T T A C G T A G GT A T A T G C G
kmers listReality
Reads 100-250 bases (not 6)Coverage typically >50kmer typically 25 - 100
Genome Assembly
De Bruijn Graph• Perfect case
○ with fragmentation
○ Exactly the same 4mers, therefore exactlythe same De Bruijn graph
○ What if GGTAAT is missing?
A A T G C G C T A C G T A G G G T A A T A T A A G A C C A
A A T G C GA A T G C G
A T G C G CT G C G C T T G C G C T
G C G C T A G C G C T A
G C T A C G G C T A C G G C T A C G
C T A C G T T A C G T A
C G T A G GG T A G G G
T A G G G T T A G G G T
G G T A A TA T A T A A
T A T A A G A T A A G AA T A A G AA T A A G A
T A A G A C A G A C C A
29 base sequence
24 random fragments6 base reads
A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G C C G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A T T A C G T A G GT A T A T G C G
kmers list
Genome Assembly
De Bruijn Graph• Perfect case
○ with fragmentation
• Fragments may notcompletely overlap
29 base sequence
24 random fragments6 base reads
A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G C C G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A TT A C G T A G GT A T A T G C G
kmers list
Contig consensus
A A T G C G C T A C G T A G G G T A A T A T A A G A C C A
A A T G C GA A T G C G
A T G C G CT G C G C T T G C G C T
G C G C T A G C G C T A
G C T A C G G C T A C G G C T A C G
C T A C G T T A C G T A
C G T A G GG T A G G G
T A G G G T T A G G G T T A G G G T
A T A T A AT A T A A G
A T A A G AA T A A G AA T A A G A
T A A G A C A G A C C A
A A T G C G C T A C G T A G G G TA T A T A A G A C C A
Genome Assembly
De Bruijn Graph• Perfect data
○ With small repeats
○ Repeats cause cyclesin the De Bruijn graph
SequenceReads
De BruijnGraph
A A T G C C G T A C G T A C G T A A A T A T A A G A C C A
A A T G C C C G T A C G A T A A G AA A T G C C G T A C G T A T A A G A
A T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C
G T A C G T T A A A T A A G A C C AT A C G T A A A A T A T
AATG ATGC TGCC GCCG CCGT CGTA GTAC TACG ACGT
A A A TA A G AA A T AA A T GA C C AA C G TA A G AA G A CA T A AA T A TA T G CC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A C GT A T AT G C C
ACCAAATA ATAT AGAC GACCTATA ATAA TAAG AAGA
GTAA TAAA AAAT
De Bruijn Graph• Perfect data
○ With small repeats SequenceReads
CompressedDe BruijnGraph
A A T G C C G T A C G T A C G T A A A T A T A A G A C C A
A A T G C C C G T A C G A T A A G AA A T G C C G T A C G T A T A A G A
A T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C
G T A C G T T A A A T A A G A C C AT A C G T A A A A T A T
AATGCCGT CGTA GTACGT
A A A TA A G AA A T AA A T GA C C AA C G TA A G AA G A CA T A AA T A TA T G CC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A C GT A T AT G C C
AATA ATAT TATA ATAA TAAGACCA
GTAA TAAAT
Genome Assembly
De Bruijn Graph• perfect data
○ With small repeats SequenceReads
PruningDe BruijnGraph
A A T G C C G T A C G T A C G T A A A T A T A A G A C C A
A A T G C C C G T A C G A T A A G AA A T G C C G T A C G T A T A A G A
A T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C
G T A C G T T A A A T A A G A C C AT A C G T A A A A T A T
A A A TA A G AA A T AA A T GA C C AA C G TA A G AA G A CA T A AA T A TA T G CC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A C GT A T AT G C C
AATGCCGT CGTA GTACGT
AATA ATAT TATA ATAA TAAGACCA
GTAA TAAAT
De Bruijn graph assembly with repeats
• Repeats cause expansion/contractionsif repeat length ≥ kmer
AATGCCGT CGTA GTACGT
ATAT TATA ATAAGACCAGTAAATA
AATGCCGTCGTAGTACGT
CGTAGTAAATA
ATATTATAATATTATAATAAGACCA
AATGCCGTACGTA....AATATATAAGACCA AssembledAATGCCGTACGTACGTAAATATA...AGACCA Original
Genome Assembly
Genome Assembly
De Bruijn Graph• perfect data
oWith sequence errors
SequenceReads
De BruijnGraph
A A T G C C G T A C G T A C G T A A A T A T A A G A C C A
A A T G C C C G T A C G T T A A G AA A T G C G G T A C G T A T A A G A
T T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C
G T A C G T T A A T T A A G T C C AT A C G T A A A A T A T
AATG ATGC TGCC GCCG CCGT CGTA GTAC TACG ACGT
A A A TA A G AA A T AA A T GA A T TA C C AA C G TA A G AA G A CA G T CA T A AA T A TA T G CA T T AC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A A TT A C GT A T AT G C CT G C GT T A AT T G C
ACCAAATA ATAT AGAC GACCTATA ATAA TAAG AAGA
GTAA TAAA AAATAGTC TAAT
TGCG
TTAA
TTGC
AATT ATTA
TGCCGT
Genome Assembly
De Bruijn Graph• perfect data
○ With sequence errorsSequenceReads
A A T G C C G T A C G T A C G T A A A T A T A A G A C C A
A A T G C C C G T A C G T T A A G AA A T G C G G T A C G T A T A A G A
T T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C
G T A C G T T A A T T A A G T C C AT A C G T A A A A T A T
AATG ATGC CGTA GTACGT
A A A TA A G AA A T AA A T GA A T TA C C AA C G TA A G AA G A CA G T CA T A AA T A TA T G CA T T AC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A A TT A C GT A T AT G C CT G C GT T A AT T G C
AATA ATAT TATA ATAA TAAGACCA
GTAA TAAA AAATAGTC TAAT
TGCG
TTAA
TTGC
AATT ATTA
TGCC
TGCCGT
Genome Assembly
De Bruijn Graph• perfect data
○ With sequence errors
○ Sequence errors create extra tips and bubbles
SequenceReads
A A T G C C G T A C G T A C G T A A A T A T A A G A C C A
A A T G C C C G T A C G T T A A G AA A T G C G G T A C G T A T A A G A
T T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C
G T A C G T T A A T T A A G T C C AT A C G T A A A A T A T
AATG ATGC CGTA GTACGT
A A A TA A G AA A T AA A T GA A T TA C C AA C G TA A G AA G A CA G T CA T A AA T A TA T G CA T T AC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A A TT A C GT A T AT G C CT G C GT T A AT T G C
AATAT TATA ATAA TAAGACCA
GTAA TAAATAGTC TAAT
TGCG
TTAA
TTGC
AATTA
tips
bubble
TGCC
Genome Assembly
• Practical Issues
omany methods use a kmer approach, what k should you use?
– typically k ranges from 25 to over 100
– large k give more unique matches
– large k misses more overlaps due to errors/SNPs
o Sensitivity to presence of adapters?
o Sensitivity to genomic repeats?
Genome AssemblyWhat kmer should you use?• Short kmer, e.g., 25 base
o not very affected by errors
omay have random/incorrect matches
o cannot distinguish repeats and duplications
• Long kmer, e.g., 70-100
o Significant possibility of error causing missed overlap
o Kmer may be cut-off by end of read
o few or no random matches, more specific
• Generalizations
o small kmer better for low coverage and small genomes
o large kmer better for repetitive sequences and large genomes
Genome AssemblyGenome size from kmer distribution
• total kmers = 197.4 M
• "good" peak ~180-500
o 22.99 M good kmers= estimated genome size
o average coverage 389.9
Genome AssemblyWhat kmer should you use?• Options:
o Run with many k, and choose the best
o Run with many k, and merge together
oUse a program to predict best kmer
– Velvetk – uses number of reads and genome size to estimate
– kmergenie – calculate based on kmer distribution
sample to get kmer distribution
fit to gaussians
choose k with largest number of non-noise kmers
– jellyfish
– kat
– khmer
Human chr 14
GAGE benchmark
Genome AssemblyScaffolding and gap filling/closing• Scaffold – contigs with defined order and spacing, but with sequence
gaps
o uses paired end and mate pair information (no mate pairs for monascus)
omost assemblers include a scaffolder, standalone scaffolders include
– sspace (easy to use)
– bambus 2
– opera
• Scaffolding is error prone
Genome AssemblyScaffolding and gap filling/closing
• Gap Filling/closing
o Some assemblers include a gap filler
– SOAPdenovo, Allpaths-LG
o Standalone
– GapCloser (from SOAPdenovo)
– PAGIT (Abacas and Image)
– GapFiller
– FinIS
Genome Assembly
Problems with De Bruijn assembly
• Sequence is fragmented due to lack of overlap
○ Get high coverage NGS (50X +)
• Read errors cause bubbles (false edges and nodes in De Bruijn graph)
○ hard to distinguish errors from natural variation (heterozygosity)
○ kmer coverage distinguishes errors from correct sequence and can identify and correct random sequencing errors
○ Differences in pruning strategy are the biggest difference in methods
• Kmer issues
○ too short – many bubbles and false overlaps
○ too long – overlaps missed due to sequence errors
○ kmers should be long enough to be unique in coding regions
• Repeats cause misassemblies
○ Repeats have higher coverage (depth)
Genome Assembly
Repeats
• If sequence reads are shorter than repeats, you cannot assemble past the repeats
• Repeats are the single biggest cause of errors in assembly
What happens to these repeats when you overlap
Repeats overlap each other and assemble together. Unique sequence is left in a separate contig
Genome Assembly
Repeats• Repeats result in systematic
errors in assembly
○ compression
○ expansion
pairs too close (red and blue)
blue pairs in wrong orientation)
Genome Assembly
Shotgun Assembly
• Scaffolding
○ Start with highest quality contigs with unique coverage
– kmer count tells you depth, unique have lower depth than repeat
○ Use mate-pairs - 1000 - 9000 base separation
– sizing errors make estimate of gap less accurate
– chimeras could be a problem
○ Mate pairs allow neighboring contigs and direction to be established
Genome Assembly
Repeats• Must have additional information to place
distant reads with respect to each other
• paired end reads
○ hundreds of bases between R1 and R2
• mate-pair reads
○ thousands of bases between R1 / R2
• long clones such as fosmids
○ not used much today
Mate Pairs: One common technique • isolate long fragment, e.g., 5 kb • circularize (including tag at junction) • fragment• isolate desired size fragment• attach adapters and sequence
BRead 1
Read 2
cloning strategy reverses paired-end orientation
Long-Insert Mate-Pairs
Genome Assembly
Scaffolding and gap filling/closing• Scaffold – contigs with defined order and spacing, but with sequence gaps
○ uses paired end and mate pair information
○ most assemblers include a scaffolder
Gap closing, sequentially add more reads building from ends of gaps
No reads ORRepeats
Genome Assembly
Populus
• Science. 2006 Sep 15;313(5793):1596-604
• 485 Mb (cytogenetic estimate 550 Mb)
• 2447 scaffolds
• 95% of genome
• 45,500 "genes"
• 19 Linkage groups
• Evidence for two whole genome duplications
Genome Assembly
Populus
• Clone and sequence statistics
Insert
Size
Kb
Vector Number
Reads
x10-6
Number
Reads
Used
x10-6
Number
Bases
Qual > 20
Mb
Number
Bases
After
Trimming
Mb
% Bases
Used
% of Total
2.0 - 4.0 plasmid 4.45 2.75 2.76 1.73 62.7 56.4
4.5 - 7.5 plasmid 2.58 1.62 1.78 1.04 58.4 33.4
38 - 41 fosmid 0,.65 0.43 0.41 0.30 73.1 9.8
Total 7.69 4.80 4.95 3.07 62.0
Genome Assembly
• Assembly of Linkage Group II. 1 Mb spans are colored in alternating black and white strips.
• The innermost track (black) shows the fingerprint map clone coverage. Each circle represents 5X coverage.
• The next outer track (red) shows the coverage provided by singletons.
• The next track shows anchored contigs, coded with an alternating color scheme.
• The final inside track shows the sequence position of individual clones in each contig, colored by map contig assignment.
• The first outer track shows the sequence position of clones that lack map contig assignment.
• The second outer track shows the coverage provided by the singletons.
Genome Assembly
Populus
• How to you know your assembly is accurate?
• Need external knowledge
• Mapping of scaffolds to chromosomes using microsatellites
Genome Assembly
Populus
• How to you know yourassembly is accurate?
• Need external knowledge
• Mapping BACs to chromosomes using FISH
Genome Assembly
Junk DNA
• Garbage you throw away
• Junk you keep (but may not have an immediate need or use for)
• Junk or garbage?
51.8 % repeats