Top Banner
Introduction to Sequencing Unix Data cleaning
54

Introduction to Sequencing Unix Data cleaning

Jan 18, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Sequencing Unix Data cleaning

Introduction to SequencingUnixData cleaning

Page 2: Introduction to Sequencing Unix Data cleaning

NGS Sequence AnalysisGeneral Process• Simpler version of the first two bubbles of fig 1 in Ekblom

AnnotationSample

Preparation

Sequencing

Data Cleaning

Quality Control

ContigAssembly

Scaffold Assembly

Validation and QC

Draft Genome

Page 3: Introduction to Sequencing Unix Data cleaning

Sequencing Basics

Illumina Sequencing

Page 4: Introduction to Sequencing Unix Data cleaning

Quality and CleaningIs it important?

• Depends on the assembler

• Depends on level of contamination

• Depends on depth

Zhou & Rokas, Molec.Ecol. 23, 1679-1700,2014.

Page 5: Introduction to Sequencing Unix Data cleaning

Saving results on RCAC

• What do you do when you have used up your space

o Save files that you are not actively using to fortress. Every RCAC user has access to fortress, a petabyte archive

– files should be large (preferably at least a Gbyte)use tar to archive whole directories

– read abut htar and hsi in the RCAC documentation to learn ho to do this

– Good practice is to archive your starting data before you make ANY changes to it

Page 6: Introduction to Sequencing Unix Data cleaning

Mapping

Removing Contaminants

• Contaminant can be removed by mapping reads to the sequence of the contaminating sequence and keeping unmapped reads

• Many read mappers: BWA, Bowtie2, HISAT2, BBMap, to name a few

• Mapped reads frequently need to be manipulated – samtools

• converting from SAM to BAM is slow, and SAM takes lots of disk space. But Bowtie2 writes sam output

• solution: use unix pipes to samtools

Page 7: Introduction to Sequencing Unix Data cleaning

Mapping

Practical Genomics - Course Introduction 7

Using pipes

#!/bin/sh -l#PBS -N bowtie_monascus_mt#PBS -q scholar#PBS -l nodes=1:ppn=16#PBS -l walltime=120:00:00

module load samtoolsmodule load bowtie2

cd $PBS_O_WORKDIR

bowtie2 --very-sensitive-local -a --maxins 700 --phred33 -p 16 -x mitochondria \-1 ../raw/Monpu1.genome.rawReads.r1.fq \-2 ../raw/Monpu1.genome.rawReads.r2.fq \| samtools view -uS - \| samtools sort - mitochondrial_raw.sorted

samtools index mitochondrial_raw.sorted.bam

Page 8: Introduction to Sequencing Unix Data cleaning

MappingBowtie output• Monascus vs several fungal mt genomes

• Concordantly = correct orientation, correct spacing

74991761 reads; of these:74991761 (100.00%) were paired; of these:

72690295 (96.93%) aligned concordantly 0 times81653 (0.11%) aligned concordantly exactly 1 time2219813 (2.96%) aligned concordantly >1 times----72690295 pairs aligned concordantly 0 times; of these:4104 (0.01%) aligned discordantly 1 time

----72686191 pairs aligned 0 times concordantly or discordantly; of these:145372382 mates make up the pairs; of these:145003825 (99.75%) aligned 0 times55302 (0.04%) aligned exactly 1 time313255 (0.22%) aligned >1 times

Page 9: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Whole Genome Shotgun (WGS) Assembly Problems

• Genome is large, many eukaryotes have 109 bases

• Reads are short, 100-150 bases

• Genomes contain repetitive regions

○ Centromeres, telomeres, satellite (heterochromatin)

○ transposons

○ homopolymer repeats / microsatellites

• Genomes contain duplicated segments

• Many genomes are diploid, chromosomes may vary in sequence and/or structure

• Cells contain organelles with their own DNA (mitochondria, chloroplast)

• Cells contain parasites

• Laboratory contamination is always present

Page 10: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Human Genome

• Eukaryotic genomes are often highly repetitive

Repeats 64%

Page 11: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Repeats - Repetitive sequences are common

shortrepeats

LINESRTDTSINEs

Telomeres

Centromeres

Page 12: Introduction to Sequencing Unix Data cleaning

Genome AssemblyShotgun Assembly• Previous method – Overlap-Layout-Consensus

○ Align all sequence reads to find how they overlap

– Human genome: 3x109 x 20X coverage x 100 base reads = 600 M reads

–Many pairwise alignments are required1.8x1017 comparisons

memory (assume 2 byte integers) ~ a billion Gb

○ Figure out the best layout (hard)

○ Generate consensus (easy)

• De Bruijn Graph method○ Break reads up into kmers – subsequences of length k

– How many kmers in a genome: Genome size – k + 1

○ Overlap kmers (Burrows-Wheeler transform)

○ Construct De Bruijn graph(s)

○ Find path through graph(s) contig

○ Use paired-end and mate-pair sequences to create scaffolds

○ Fill remaining gaps (gap closing)

Page 13: Introduction to Sequencing Unix Data cleaning

Genome AssemblyDe Bruijn based assemblers

• Among othersVelvetABySSALLPATHSSOAPdenovoMiniaMeraculousSpades

Fig 3. Flicek & Birney, 2009

Page 14: Introduction to Sequencing Unix Data cleaning

Genome AssemblyVelvet

• One of the first De Bruijn assemblers

• Pruning

o tips – a chain of nodes disconnected on one endcaused by sequencing errors OR coverage gapserrors tend to be short (rule trim if < 2 kmer )errors tend to have low multiplicity at junction

o bubbles – paths that leave and returncaused by sequence variation (SNPs)length/multiplicity ruleshorter, higher multiplicity paths are preferred

o Erroneous connectionsduplicate sequences + errorserrors will have low coverage, so will areas withlow coverage

Page 15: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph

• Perfect case

○ No repeats

○ with fragmentation

• Overlap reads to getconsensus

○ But, aligning fragmentsis too time consuming

○ Use kmer-based approach instead

A A T G C G C T A C G T A G G G T A A T A T A A G A C C A

A A T G C GA A T G C G

A T G C G CT G C G C T T G C G C T

G C G C T A G C G C T A

G C T A C G G C T A C G G C T A C G

C T A C G T T A C G T A

C G T A G GG T A G G G

T A G G G T T A G G G T

G G T A A TA T A T A A

T A T A A G A T A A G AA T A A G AA T A A G A

T A A G A C A G A C C A

29 base sequence

24 random fragments6 base words = 6mer

RealityReads 100-250 basesCoverage typically >50

Page 16: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• Perfect case

• Easy because

○ complete coverage

○ every kmer is unique

• Looking up overlapping kmers is efficient using the Burrows-Wheeler transform

A A T G C G C T A C G T A G G G T A A T A T A A G A C C A

A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C A

T G C G G T A G A T A T G C G C T A G G T A T A

C G C T A G G G A T A AG C T A G G G T T A A G

C T A C G G T A A A G A T A C G G T A A A G A C

29 base sequence

26 4mers (k=4)

RealityFragments usually 100 basesk=25,100

A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G C C G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A T T A C G T A G GT A T A T G C G

kmers list

Page 17: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• Perfect case

• Overlap kmers tomake De Bruijn graph

A A T G C G C T A C G T A G G G T A A T A T A A G A C C A

A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C AT G C G G T A G A T A T

G C G C T A G G T A T A C G C T A G G G A T A A

G C T A G G G T T A A G C T A C G G T A A A G A

T A C G G T A A A G A C

29 base sequence

26 4mers (k=4)

A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G CC G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A T T A C G T A G GT A T A T G C G

kmers list

AATG CGTAACGTTACGCTACGCTACGCTGCGCTGCGATGC

GTAA TAAT AATA ATAA TAAG AAGA AGAC GACC ACCA

GTAG ATAT TATATAGG AGGG

GGGT GGTADe Bruijn graph

Page 18: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn graph – collapse unbranched paths

AATG CGTAACGTTACGCTACGCTACGCTGCGCTGCGATGC

GTAA TAAT AATA ATAA TAAG AAGA AGAC GACC ACCA

GTAG ATAT TATATAGG AGGG

GGGT GGTA

AATGCGCTACGTA

GTAA TAAT AATA ATAA TAAGACCA

GTAGGGTA ATAT TATAEuler 1736

Page 19: Introduction to Sequencing Unix Data cleaning

Genome AssemblyVelvet

• One of the first De Bruijn assemblers

• Pruning

o tips – a chain of nodes disconnected on one endcaused by sequencing errors OR coverage gapserrors tend to be short (rule trim if < 2 kmer )errors tend to have low multiplicity at junction

o bubbles – paths that leave and returncaused by sequence variation (SNPs)length/multiplicity ruleshorter, higher multiplicity paths are preferred

o Erroneous connectionsduplicate sequences + errorserrors will have low coverage, so will areas withlow coverage

Page 20: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• Perfect case

• Compress nodes in De bruijn graph

• Eulerian path

○ Visit each box only once

○ Only works with unique kmers

○ In real life, one must visit nodes more than once due to repeats

A A T G C G C T A C G T A G G G T A A T A T A A G A C C A

A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C A

T G C G G T A G A T A T G C G C T A G G T A T A

C G C T A G G G A T A AG C T A G G G T T A A G

C T A C G G T A A A G A T A C G G T A A A G A C

29 base sequence

26 4mers (k=4)

AATGCGCTACGTA

GTAA TAAT AATA ATAA TAAGACCA

GTAGGGTA ATAT TATA

Page 21: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• Perfect case

• Eulerian path

○ Visit each box only once

○ Prune improbable paths

– Sequence depth – low coverage paths are likely to be errors

– Paired ends give information about path

A A T G C G C T A C G T A G G G T A A T A T A A G A C C A

A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C A

T G C G G T A G A T A T G C G C T A G G T A T A

C G C T A G G G A T A AG C T A G G G T T A A G

C T A C G G T A A A G A T A C G G T A A A G A C

29 base sequence

26 4mers (k=4)

AATGCGCTACGTA

GTAA TAAT AATA ATAA TAAGACCA

GTAGGGTA ATAT TATA

pruned paths

Page 22: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• Perfect case

• Eulerian path

○ Prune graph

A A T G C G C T A C G T A G G G T A A T A T A A G A C C A

A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C A

T G C G G T A G A T A T G C G C T A G G T A T A

C G C T A G G G A T A AG C T A G G G T T A A G

C T A C G G T A A A G A T A C G G T A A A G A C

29 base sequence

26 4mers (k=4)

AATGCGCTACGTA GTAA TAAT AATA ATAA TAAGACCAGTAGGGTA ATAT TATA

rearranged from previous slide

AATGCGCTACGTA GTAA TAAT AATA ATAA TAAGACCAGTAGGGTA ATAT TATA

improbable paths deleted

Page 23: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• Perfect case

• Eulerian path

○ Visit each box only once

○ Only works with uniquekmers

A A T G C G C T A C G T A G G G T A A T A T A A G A C C A

A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C A

T G C G G T A G A T A T G C G C T A G G T A T A

C G C T A G G G A T A AG C T A G G G T T A A G

C T A C G G T A A A G A T A C G G T A A A G A C

29 base sequence

26 4mers (k=4)

AATGCGCTACGTAGTAGGGTA

GTAATAATAATAATATTATAATAATAAGACCA

AATGCGCTACGTAGGGTAATATAAGACCA Reconstructed consensus

AATGCGCTACGTAGGGTAATATAAGACCA Original

AATGCGCTACGTA GTAA TAAT AATA ATAA TAAGACCAGTAGGGTA ATAT TATA

Page 24: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• Perfect case

○ with random fragmentation

A A T G C G C T A C G T A G G G T A A T A T A A G A C C A

T A G G G T A A T G C G

C T A C G T G C G C T A

G T A G G GT G C G C T

G C T A C G G C T A C G

T A C G T A A T A A G A

T G C G C TC G T A G G

A G A C C A A T G C G C

G C T A C G A A T G C G

A T A A G AT A A G A C

A T A A G AT A T A A G

G C G C T A A T A T A A

T A G G G T

29 base sequence

23 random fragments6 base reads

A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G C C G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A T T A C G T A G GT A T A T G C G

kmers listReality

Reads 100-250 bases (not 6)Coverage typically >50kmer typically 25 - 100

Page 25: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• Perfect case

○ with fragmentation

○ Exactly the same 4mers, therefore exactlythe same De Bruijn graph

○ What if GGTAAT is missing?

A A T G C G C T A C G T A G G G T A A T A T A A G A C C A

A A T G C GA A T G C G

A T G C G CT G C G C T T G C G C T

G C G C T A G C G C T A

G C T A C G G C T A C G G C T A C G

C T A C G T T A C G T A

C G T A G GG T A G G G

T A G G G T T A G G G T

G G T A A TA T A T A A

T A T A A G A T A A G AA T A A G AA T A A G A

T A A G A C A G A C C A

29 base sequence

24 random fragments6 base reads

A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G C C G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A T T A C G T A G GT A T A T G C G

kmers list

Page 26: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• Perfect case

○ with fragmentation

• Fragments may notcompletely overlap

29 base sequence

24 random fragments6 base reads

A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G C C G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A TT A C G T A G GT A T A T G C G

kmers list

Contig consensus

A A T G C G C T A C G T A G G G T A A T A T A A G A C C A

A A T G C GA A T G C G

A T G C G CT G C G C T T G C G C T

G C G C T A G C G C T A

G C T A C G G C T A C G G C T A C G

C T A C G T T A C G T A

C G T A G GG T A G G G

T A G G G T T A G G G T T A G G G T

A T A T A AT A T A A G

A T A A G AA T A A G AA T A A G A

T A A G A C A G A C C A

A A T G C G C T A C G T A G G G TA T A T A A G A C C A

Page 27: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• Perfect data

○ With small repeats

○ Repeats cause cyclesin the De Bruijn graph

SequenceReads

De BruijnGraph

A A T G C C G T A C G T A C G T A A A T A T A A G A C C A

A A T G C C C G T A C G A T A A G AA A T G C C G T A C G T A T A A G A

A T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C

G T A C G T T A A A T A A G A C C AT A C G T A A A A T A T

AATG ATGC TGCC GCCG CCGT CGTA GTAC TACG ACGT

A A A TA A G AA A T AA A T GA C C AA C G TA A G AA G A CA T A AA T A TA T G CC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A C GT A T AT G C C

ACCAAATA ATAT AGAC GACCTATA ATAA TAAG AAGA

GTAA TAAA AAAT

Page 28: Introduction to Sequencing Unix Data cleaning

De Bruijn Graph• Perfect data

○ With small repeats SequenceReads

CompressedDe BruijnGraph

A A T G C C G T A C G T A C G T A A A T A T A A G A C C A

A A T G C C C G T A C G A T A A G AA A T G C C G T A C G T A T A A G A

A T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C

G T A C G T T A A A T A A G A C C AT A C G T A A A A T A T

AATGCCGT CGTA GTACGT

A A A TA A G AA A T AA A T GA C C AA C G TA A G AA G A CA T A AA T A TA T G CC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A C GT A T AT G C C

AATA ATAT TATA ATAA TAAGACCA

GTAA TAAAT

Page 29: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• perfect data

○ With small repeats SequenceReads

PruningDe BruijnGraph

A A T G C C G T A C G T A C G T A A A T A T A A G A C C A

A A T G C C C G T A C G A T A A G AA A T G C C G T A C G T A T A A G A

A T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C

G T A C G T T A A A T A A G A C C AT A C G T A A A A T A T

A A A TA A G AA A T AA A T GA C C AA C G TA A G AA G A CA T A AA T A TA T G CC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A C GT A T AT G C C

AATGCCGT CGTA GTACGT

AATA ATAT TATA ATAA TAAGACCA

GTAA TAAAT

Page 30: Introduction to Sequencing Unix Data cleaning

De Bruijn graph assembly with repeats

• Repeats cause expansion/contractionsif repeat length ≥ kmer

AATGCCGT CGTA GTACGT

ATAT TATA ATAAGACCAGTAAATA

AATGCCGTCGTAGTACGT

CGTAGTAAATA

ATATTATAATATTATAATAAGACCA

AATGCCGTACGTA....AATATATAAGACCA AssembledAATGCCGTACGTACGTAAATATA...AGACCA Original

Genome Assembly

Page 31: Introduction to Sequencing Unix Data cleaning

Genome Assembly

De Bruijn Graph• perfect data

oWith sequence errors

SequenceReads

De BruijnGraph

A A T G C C G T A C G T A C G T A A A T A T A A G A C C A

A A T G C C C G T A C G T T A A G AA A T G C G G T A C G T A T A A G A

T T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C

G T A C G T T A A T T A A G T C C AT A C G T A A A A T A T

AATG ATGC TGCC GCCG CCGT CGTA GTAC TACG ACGT

A A A TA A G AA A T AA A T GA A T TA C C AA C G TA A G AA G A CA G T CA T A AA T A TA T G CA T T AC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A A TT A C GT A T AT G C CT G C GT T A AT T G C

ACCAAATA ATAT AGAC GACCTATA ATAA TAAG AAGA

GTAA TAAA AAATAGTC TAAT

TGCG

TTAA

TTGC

AATT ATTA

Page 32: Introduction to Sequencing Unix Data cleaning

TGCCGT

Genome Assembly

De Bruijn Graph• perfect data

○ With sequence errorsSequenceReads

A A T G C C G T A C G T A C G T A A A T A T A A G A C C A

A A T G C C C G T A C G T T A A G AA A T G C G G T A C G T A T A A G A

T T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C

G T A C G T T A A T T A A G T C C AT A C G T A A A A T A T

AATG ATGC CGTA GTACGT

A A A TA A G AA A T AA A T GA A T TA C C AA C G TA A G AA G A CA G T CA T A AA T A TA T G CA T T AC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A A TT A C GT A T AT G C CT G C GT T A AT T G C

AATA ATAT TATA ATAA TAAGACCA

GTAA TAAA AAATAGTC TAAT

TGCG

TTAA

TTGC

AATT ATTA

TGCC

Page 33: Introduction to Sequencing Unix Data cleaning

TGCCGT

Genome Assembly

De Bruijn Graph• perfect data

○ With sequence errors

○ Sequence errors create extra tips and bubbles

SequenceReads

A A T G C C G T A C G T A C G T A A A T A T A A G A C C A

A A T G C C C G T A C G T T A A G AA A T G C G G T A C G T A T A A G A

T T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C

G T A C G T T A A T T A A G T C C AT A C G T A A A A T A T

AATG ATGC CGTA GTACGT

A A A TA A G AA A T AA A T GA A T TA C C AA C G TA A G AA G A CA G T CA T A AA T A TA T G CA T T AC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A A TT A C GT A T AT G C CT G C GT T A AT T G C

AATAT TATA ATAA TAAGACCA

GTAA TAAATAGTC TAAT

TGCG

TTAA

TTGC

AATTA

tips

bubble

TGCC

Page 34: Introduction to Sequencing Unix Data cleaning

Genome Assembly

• Practical Issues

omany methods use a kmer approach, what k should you use?

– typically k ranges from 25 to over 100

– large k give more unique matches

– large k misses more overlaps due to errors/SNPs

o Sensitivity to presence of adapters?

o Sensitivity to genomic repeats?

Page 35: Introduction to Sequencing Unix Data cleaning

Genome AssemblyWhat kmer should you use?• Short kmer, e.g., 25 base

o not very affected by errors

omay have random/incorrect matches

o cannot distinguish repeats and duplications

• Long kmer, e.g., 70-100

o Significant possibility of error causing missed overlap

o Kmer may be cut-off by end of read

o few or no random matches, more specific

• Generalizations

o small kmer better for low coverage and small genomes

o large kmer better for repetitive sequences and large genomes

Page 36: Introduction to Sequencing Unix Data cleaning

Genome AssemblyGenome size from kmer distribution

• total kmers = 197.4 M

• "good" peak ~180-500

o 22.99 M good kmers= estimated genome size

o average coverage 389.9

Page 37: Introduction to Sequencing Unix Data cleaning

Genome AssemblyWhat kmer should you use?• Options:

o Run with many k, and choose the best

o Run with many k, and merge together

oUse a program to predict best kmer

– Velvetk – uses number of reads and genome size to estimate

– kmergenie – calculate based on kmer distribution

sample to get kmer distribution

fit to gaussians

choose k with largest number of non-noise kmers

– jellyfish

– kat

– khmer

Human chr 14

GAGE benchmark

Page 38: Introduction to Sequencing Unix Data cleaning

Genome AssemblyScaffolding and gap filling/closing• Scaffold – contigs with defined order and spacing, but with sequence

gaps

o uses paired end and mate pair information (no mate pairs for monascus)

omost assemblers include a scaffolder, standalone scaffolders include

– sspace (easy to use)

– bambus 2

– opera

• Scaffolding is error prone

Page 39: Introduction to Sequencing Unix Data cleaning

Genome AssemblyScaffolding and gap filling/closing

• Gap Filling/closing

o Some assemblers include a gap filler

– SOAPdenovo, Allpaths-LG

o Standalone

– GapCloser (from SOAPdenovo)

– PAGIT (Abacas and Image)

– GapFiller

– FinIS

Page 40: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Problems with De Bruijn assembly

• Sequence is fragmented due to lack of overlap

○ Get high coverage NGS (50X +)

• Read errors cause bubbles (false edges and nodes in De Bruijn graph)

○ hard to distinguish errors from natural variation (heterozygosity)

○ kmer coverage distinguishes errors from correct sequence and can identify and correct random sequencing errors

○ Differences in pruning strategy are the biggest difference in methods

• Kmer issues

○ too short – many bubbles and false overlaps

○ too long – overlaps missed due to sequence errors

○ kmers should be long enough to be unique in coding regions

• Repeats cause misassemblies

○ Repeats have higher coverage (depth)

Page 41: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Repeats

• If sequence reads are shorter than repeats, you cannot assemble past the repeats

• Repeats are the single biggest cause of errors in assembly

What happens to these repeats when you overlap

Repeats overlap each other and assemble together. Unique sequence is left in a separate contig

Page 42: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Repeats• Repeats result in systematic

errors in assembly

○ compression

○ expansion

pairs too close (red and blue)

blue pairs in wrong orientation)

Page 43: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Shotgun Assembly

• Scaffolding

○ Start with highest quality contigs with unique coverage

– kmer count tells you depth, unique have lower depth than repeat

○ Use mate-pairs - 1000 - 9000 base separation

– sizing errors make estimate of gap less accurate

– chimeras could be a problem

○ Mate pairs allow neighboring contigs and direction to be established

Page 44: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Repeats• Must have additional information to place

distant reads with respect to each other

• paired end reads

○ hundreds of bases between R1 and R2

• mate-pair reads

○ thousands of bases between R1 / R2

• long clones such as fosmids

○ not used much today

Mate Pairs: One common technique • isolate long fragment, e.g., 5 kb • circularize (including tag at junction) • fragment• isolate desired size fragment• attach adapters and sequence

BRead 1

Read 2

cloning strategy reverses paired-end orientation

Long-Insert Mate-Pairs

Page 45: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Scaffolding and gap filling/closing• Scaffold – contigs with defined order and spacing, but with sequence gaps

○ uses paired end and mate pair information

○ most assemblers include a scaffolder

Gap closing, sequentially add more reads building from ends of gaps

No reads ORRepeats

Page 46: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Populus

• Science. 2006 Sep 15;313(5793):1596-604

• 485 Mb (cytogenetic estimate 550 Mb)

• 2447 scaffolds

• 95% of genome

• 45,500 "genes"

• 19 Linkage groups

• Evidence for two whole genome duplications

Page 47: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Populus

• Clone and sequence statistics

Insert

Size

Kb

Vector Number

Reads

x10-6

Number

Reads

Used

x10-6

Number

Bases

Qual > 20

Mb

Number

Bases

After

Trimming

Mb

% Bases

Used

% of Total

2.0 - 4.0 plasmid 4.45 2.75 2.76 1.73 62.7 56.4

4.5 - 7.5 plasmid 2.58 1.62 1.78 1.04 58.4 33.4

38 - 41 fosmid 0,.65 0.43 0.41 0.30 73.1 9.8

Total 7.69 4.80 4.95 3.07 62.0

Page 48: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Populus• Not all reads can be incorporated

into assembly

○ Contaminants

○ No hits

Page 49: Introduction to Sequencing Unix Data cleaning

Genome Assembly

• Assembly of Linkage Group II. 1 Mb spans are colored in alternating black and white strips.

• The innermost track (black) shows the fingerprint map clone coverage. Each circle represents 5X coverage.

• The next outer track (red) shows the coverage provided by singletons.

• The next track shows anchored contigs, coded with an alternating color scheme.

• The final inside track shows the sequence position of individual clones in each contig, colored by map contig assignment.

• The first outer track shows the sequence position of clones that lack map contig assignment.

• The second outer track shows the coverage provided by the singletons.

Page 50: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Populus

• How to you know your assembly is accurate?

• Need external knowledge

• Mapping of scaffolds to chromosomes using microsatellites

Page 51: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Populus

• How to you know yourassembly is accurate?

• Need external knowledge

• Mapping BACs to chromosomes using FISH

Page 52: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Junk DNA

• Garbage you throw away

• Junk you keep (but may not have an immediate need or use for)

• Junk or garbage?

51.8 % repeats

Page 53: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Human Genome

Repeats 64%

Page 54: Introduction to Sequencing Unix Data cleaning

Genome Assembly

Populus

• Transposable elements

○ Kinds and numbers vary with the species