Introduction to Sequencing Unix Data cleaning

Introduction to SequencingUnixData cleaning

NGS Sequence AnalysisGeneral Process• Simpler version of the first two bubbles of fig 1 in Ekblom

AnnotationSample

Preparation

Sequencing

Data Cleaning

Quality Control

ContigAssembly

Scaffold Assembly

Validation and QC

Draft Genome

Sequencing Basics

Illumina Sequencing

Quality and CleaningIs it important?

• Depends on the assembler

• Depends on level of contamination

• Depends on depth

Zhou & Rokas, Molec.Ecol. 23, 1679-1700,2014.

Saving results on RCAC

• What do you do when you have used up your space

o Save files that you are not actively using to fortress. Every RCAC user has access to fortress, a petabyte archive

– files should be large (preferably at least a Gbyte)use tar to archive whole directories

– read abut htar and hsi in the RCAC documentation to learn ho to do this

– Good practice is to archive your starting data before you make ANY changes to it

Mapping

Removing Contaminants

• Contaminant can be removed by mapping reads to the sequence of the contaminating sequence and keeping unmapped reads

• Many read mappers: BWA, Bowtie2, HISAT2, BBMap, to name a few

• Mapped reads frequently need to be manipulated – samtools

• converting from SAM to BAM is slow, and SAM takes lots of disk space. But Bowtie2 writes sam output

• solution: use unix pipes to samtools

Mapping

Practical Genomics - Course Introduction 7

Using pipes

#!/bin/sh -l#PBS -N bowtie_monascus_mt#PBS -q scholar#PBS -l nodes=1:ppn=16#PBS -l walltime=120:00:00

module load samtoolsmodule load bowtie2

cd $PBS_O_WORKDIR

bowtie2 --very-sensitive-local -a --maxins 700 --phred33 -p 16 -x mitochondria \-1 ../raw/Monpu1.genome.rawReads.r1.fq \-2 ../raw/Monpu1.genome.rawReads.r2.fq \| samtools view -uS - \| samtools sort - mitochondrial_raw.sorted

samtools index mitochondrial_raw.sorted.bam

MappingBowtie output• Monascus vs several fungal mt genomes

• Concordantly = correct orientation, correct spacing

74991761 reads; of these:74991761 (100.00%) were paired; of these:

72690295 (96.93%) aligned concordantly 0 times81653 (0.11%) aligned concordantly exactly 1 time2219813 (2.96%) aligned concordantly >1 times----72690295 pairs aligned concordantly 0 times; of these:4104 (0.01%) aligned discordantly 1 time

----72686191 pairs aligned 0 times concordantly or discordantly; of these:145372382 mates make up the pairs; of these:145003825 (99.75%) aligned 0 times55302 (0.04%) aligned exactly 1 time313255 (0.22%) aligned >1 times

Genome Assembly

Whole Genome Shotgun (WGS) Assembly Problems

• Genome is large, many eukaryotes have 109 bases

• Reads are short, 100-150 bases

• Genomes contain repetitive regions

○ Centromeres, telomeres, satellite (heterochromatin)

○ transposons

○ homopolymer repeats / microsatellites

• Genomes contain duplicated segments

• Many genomes are diploid, chromosomes may vary in sequence and/or structure

• Cells contain organelles with their own DNA (mitochondria, chloroplast)

• Cells contain parasites

• Laboratory contamination is always present

Genome Assembly

Human Genome

• Eukaryotic genomes are often highly repetitive

Repeats 64%

Genome Assembly

Repeats - Repetitive sequences are common

shortrepeats

LINESRTDTSINEs

Telomeres

Centromeres

Genome AssemblyShotgun Assembly• Previous method – Overlap-Layout-Consensus

○ Align all sequence reads to find how they overlap

– Human genome: 3x109 x 20X coverage x 100 base reads = 600 M reads

–Many pairwise alignments are required1.8x1017 comparisons

memory (assume 2 byte integers) ~ a billion Gb

○ Figure out the best layout (hard)

○ Generate consensus (easy)

• De Bruijn Graph method○ Break reads up into kmers – subsequences of length k

– How many kmers in a genome: Genome size – k + 1

○ Overlap kmers (Burrows-Wheeler transform)

○ Construct De Bruijn graph(s)

○ Find path through graph(s) contig

○ Use paired-end and mate-pair sequences to create scaffolds

○ Fill remaining gaps (gap closing)

Genome AssemblyDe Bruijn based assemblers

• Among othersVelvetABySSALLPATHSSOAPdenovoMiniaMeraculousSpades

Fig 3. Flicek & Birney, 2009

Genome AssemblyVelvet

• One of the first De Bruijn assemblers

• Pruning

o tips – a chain of nodes disconnected on one endcaused by sequencing errors OR coverage gapserrors tend to be short (rule trim if < 2 kmer )errors tend to have low multiplicity at junction

o bubbles – paths that leave and returncaused by sequence variation (SNPs)length/multiplicity ruleshorter, higher multiplicity paths are preferred

o Erroneous connectionsduplicate sequences + errorserrors will have low coverage, so will areas withlow coverage

Genome Assembly

De Bruijn Graph

• Perfect case

○ No repeats

○ with fragmentation

• Overlap reads to getconsensus

○ But, aligning fragmentsis too time consuming

○ Use kmer-based approach instead

A A T G C G C T A C G T A G G G T A A T A T A A G A C C A

A A T G C GA A T G C G

A T G C G CT G C G C T T G C G C T

G C G C T A G C G C T A

G C T A C G G C T A C G G C T A C G

C T A C G T T A C G T A

C G T A G GG T A G G G

T A G G G T T A G G G T

G G T A A TA T A T A A

T A T A A G A T A A G AA T A A G AA T A A G A

T A A G A C A G A C C A

29 base sequence

24 random fragments6 base words = 6mer

RealityReads 100-250 basesCoverage typically >50

Genome Assembly

De Bruijn Graph• Perfect case

• Easy because

○ complete coverage

○ every kmer is unique

• Looking up overlapping kmers is efficient using the Burrows-Wheeler transform


A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C A

T G C G G T A G A T A T G C G C T A G G T A T A

C G C T A G G G A T A AG C T A G G G T T A A G

C T A C G G T A A A G A T A C G G T A A A G A C

29 base sequence

26 4mers (k=4)

RealityFragments usually 100 basesk=25,100

A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G C C G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A T T A C G T A G GT A T A T G C G

kmers list

Genome Assembly


• Overlap kmers tomake De Bruijn graph


A A T G A C G T T A A T G A C CA T G C C G T A A A T A A C C AT G C G G T A G A T A T

G C G C T A G G T A T A C G C T A G G G A T A A

G C T A G G G T T A A G C T A C G G T A A A G A

T A C G G T A A A G A C

29 base sequence

26 4mers (k=4)

A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G CC G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A T T A C G T A G GT A T A T G C G

kmers list

AATG CGTAACGTTACGCTACGCTACGCTGCGCTGCGATGC

GTAA TAAT AATA ATAA TAAG AAGA AGAC GACC ACCA

GTAG ATAT TATATAGG AGGG

GGGT GGTADe Bruijn graph

Genome Assembly

De Bruijn graph – collapse unbranched paths

AATG CGTAACGTTACGCTACGCTACGCTGCGCTGCGATGC

GTAA TAAT AATA ATAA TAAG AAGA AGAC GACC ACCA

GTAG ATAT TATATAGG AGGG

GGGT GGTA

AATGCGCTACGTA

GTAA TAAT AATA ATAA TAAGACCA

GTAGGGTA ATAT TATAEuler 1736

Genome AssemblyVelvet

• One of the first De Bruijn assemblers

• Pruning

o tips – a chain of nodes disconnected on one endcaused by sequencing errors OR coverage gapserrors tend to be short (rule trim if < 2 kmer )errors tend to have low multiplicity at junction

o bubbles – paths that leave and returncaused by sequence variation (SNPs)length/multiplicity ruleshorter, higher multiplicity paths are preferred

o Erroneous connectionsduplicate sequences + errorserrors will have low coverage, so will areas withlow coverage

Genome Assembly


• Compress nodes in De bruijn graph

• Eulerian path

○ Visit each box only once

○ Only works with unique kmers

○ In real life, one must visit nodes more than once due to repeats






29 base sequence

26 4mers (k=4)

AATGCGCTACGTA


GTAGGGTA ATAT TATA

Genome Assembly


• Eulerian path


○ Prune improbable paths

– Sequence depth – low coverage paths are likely to be errors

– Paired ends give information about path






29 base sequence

26 4mers (k=4)

AATGCGCTACGTA


GTAGGGTA ATAT TATA

pruned paths

Genome Assembly


• Eulerian path

○ Prune graph






29 base sequence

26 4mers (k=4)

AATGCGCTACGTA GTAA TAAT AATA ATAA TAAGACCAGTAGGGTA ATAT TATA

rearranged from previous slide


improbable paths deleted

Genome Assembly


• Eulerian path


○ Only works with uniquekmers






29 base sequence

26 4mers (k=4)

AATGCGCTACGTAGTAGGGTA

GTAATAATAATAATATTATAATAATAAGACCA

AATGCGCTACGTAGGGTAATATAAGACCA Reconstructed consensus

AATGCGCTACGTAGGGTAATATAAGACCA Original


Genome Assembly


○ with random fragmentation


T A G G G T A A T G C G

C T A C G T G C G C T A

G T A G G GT G C G C T

G C T A C G G C T A C G

T A C G T A A T A A G A

T G C G C TC G T A G G

A G A C C A A T G C G C

G C T A C G A A T G C G

A T A A G AT A A G A C

A T A A G AT A T A A G

G C G C T A A T A T A A

T A G G G T

29 base sequence

23 random fragments6 base reads


kmers listReality

Reads 100-250 bases (not 6)Coverage typically >50kmer typically 25 - 100

Genome Assembly



○ Exactly the same 4mers, therefore exactlythe same De Bruijn graph

○ What if GGTAAT is missing?








T A G G G T T A G G G T

G G T A A TA T A T A A

T A T A A G A T A A G AA T A A G AA T A A G A


29 base sequence



kmers list

Genome Assembly



• Fragments may notcompletely overlap

29 base sequence


A A G A A A T A A A T G A C C AA C G T A G A C A G G GA T A AA T A T A T G C C G C T C G T A C T A C G A C CG C G C G C T A G G G T G G T A G T A AG T A G T A A G T A A TT A C G T A G GT A T A T G C G

kmers list

Contig consensus








T A G G G T T A G G G T T A G G G T

A T A T A AT A T A A G

A T A A G AA T A A G AA T A A G A


A A T G C G C T A C G T A G G G TA T A T A A G A C C A

Genome Assembly

De Bruijn Graph• Perfect data

○ With small repeats

○ Repeats cause cyclesin the De Bruijn graph

SequenceReads

De BruijnGraph

A A T G C C G T A C G T A C G T A A A T A T A A G A C C A

A A T G C C C G T A C G A T A A G AA A T G C C G T A C G T A T A A G A

A T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C

G T A C G T T A A A T A A G A C C AT A C G T A A A A T A T

AATG ATGC TGCC GCCG CCGT CGTA GTAC TACG ACGT

A A A TA A G AA A T AA A T GA C C AA C G TA A G AA G A CA T A AA T A TA T G CC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A C GT A T AT G C C

ACCAAATA ATAT AGAC GACCTATA ATAA TAAG AAGA

GTAA TAAA AAAT

De Bruijn Graph• Perfect data

○ With small repeats SequenceReads

CompressedDe BruijnGraph





AATGCCGT CGTA GTACGT


AATA ATAT TATA ATAA TAAGACCA

GTAA TAAAT

Genome Assembly

De Bruijn Graph• perfect data

○ With small repeats SequenceReads

PruningDe BruijnGraph








GTAA TAAAT

De Bruijn graph assembly with repeats

• Repeats cause expansion/contractionsif repeat length ≥ kmer


ATAT TATA ATAAGACCAGTAAATA

AATGCCGTCGTAGTACGT

CGTAGTAAATA

ATATTATAATATTATAATAAGACCA

AATGCCGTACGTA....AATATATAAGACCA AssembledAATGCCGTACGTACGTAAATATA...AGACCA Original

Genome Assembly

Genome Assembly


oWith sequence errors

SequenceReads

De BruijnGraph


A A T G C C C G T A C G T T A A G AA A T G C G G T A C G T A T A A G A

T T G C C G T A C G T A T A A G A CT G C C G T C G T A A A A A G A C C

G T A C G T T A A T T A A G T C C AT A C G T A A A A T A T

AATG ATGC TGCC GCCG CCGT CGTA GTAC TACG ACGT

A A A TA A G AA A T AA A T GA A T TA C C AA C G TA A G AA G A CA G T CA T A AA T A TA T G CA T T AC C G TC G T AG A C CG C C GG T A AG T A CT A A AT A A GT A A TT A C GT A T AT G C CT G C GT T A AT T G C

ACCAAATA ATAT AGAC GACCTATA ATAA TAAG AAGA

GTAA TAAA AAATAGTC TAAT

TGCG

TTAA

TTGC

AATT ATTA

TGCCGT

Genome Assembly


○ With sequence errorsSequenceReads





AATG ATGC CGTA GTACGT



GTAA TAAA AAATAGTC TAAT

TGCG

TTAA

TTGC

AATT ATTA

TGCC

TGCCGT

Genome Assembly


○ With sequence errors

○ Sequence errors create extra tips and bubbles

SequenceReads





AATG ATGC CGTA GTACGT


AATAT TATA ATAA TAAGACCA

GTAA TAAATAGTC TAAT

TGCG

TTAA

TTGC

AATTA

tips

bubble

TGCC

Genome Assembly

• Practical Issues

omany methods use a kmer approach, what k should you use?

– typically k ranges from 25 to over 100

– large k give more unique matches

– large k misses more overlaps due to errors/SNPs

o Sensitivity to presence of adapters?

o Sensitivity to genomic repeats?

Genome AssemblyWhat kmer should you use?• Short kmer, e.g., 25 base

o not very affected by errors

omay have random/incorrect matches

o cannot distinguish repeats and duplications

• Long kmer, e.g., 70-100

o Significant possibility of error causing missed overlap

o Kmer may be cut-off by end of read

o few or no random matches, more specific

• Generalizations

o small kmer better for low coverage and small genomes

o large kmer better for repetitive sequences and large genomes

Genome AssemblyGenome size from kmer distribution

• total kmers = 197.4 M

• "good" peak ~180-500

o 22.99 M good kmers= estimated genome size

o average coverage 389.9

Genome AssemblyWhat kmer should you use?• Options:

o Run with many k, and choose the best

o Run with many k, and merge together

oUse a program to predict best kmer

– Velvetk – uses number of reads and genome size to estimate

– kmergenie – calculate based on kmer distribution

sample to get kmer distribution

fit to gaussians

choose k with largest number of non-noise kmers

– jellyfish

– kat

– khmer

Human chr 14

GAGE benchmark

Genome AssemblyScaffolding and gap filling/closing• Scaffold – contigs with defined order and spacing, but with sequence

gaps

o uses paired end and mate pair information (no mate pairs for monascus)

omost assemblers include a scaffolder, standalone scaffolders include

– sspace (easy to use)

– bambus 2

– opera

• Scaffolding is error prone

Genome AssemblyScaffolding and gap filling/closing

• Gap Filling/closing

o Some assemblers include a gap filler

– SOAPdenovo, Allpaths-LG

o Standalone

– GapCloser (from SOAPdenovo)

– PAGIT (Abacas and Image)

– GapFiller

– FinIS

Genome Assembly

Problems with De Bruijn assembly

• Sequence is fragmented due to lack of overlap

○ Get high coverage NGS (50X +)

• Read errors cause bubbles (false edges and nodes in De Bruijn graph)

○ hard to distinguish errors from natural variation (heterozygosity)

○ kmer coverage distinguishes errors from correct sequence and can identify and correct random sequencing errors

○ Differences in pruning strategy are the biggest difference in methods

• Kmer issues

○ too short – many bubbles and false overlaps

○ too long – overlaps missed due to sequence errors

○ kmers should be long enough to be unique in coding regions

• Repeats cause misassemblies

○ Repeats have higher coverage (depth)

Genome Assembly

Repeats

• If sequence reads are shorter than repeats, you cannot assemble past the repeats

• Repeats are the single biggest cause of errors in assembly

What happens to these repeats when you overlap

Repeats overlap each other and assemble together. Unique sequence is left in a separate contig

Genome Assembly

Repeats• Repeats result in systematic

errors in assembly

○ compression

○ expansion

pairs too close (red and blue)

blue pairs in wrong orientation)

Genome Assembly

Shotgun Assembly

• Scaffolding

○ Start with highest quality contigs with unique coverage

– kmer count tells you depth, unique have lower depth than repeat

○ Use mate-pairs - 1000 - 9000 base separation

– sizing errors make estimate of gap less accurate

– chimeras could be a problem

○ Mate pairs allow neighboring contigs and direction to be established

Genome Assembly

Repeats• Must have additional information to place

distant reads with respect to each other

• paired end reads

○ hundreds of bases between R1 and R2

• mate-pair reads

○ thousands of bases between R1 / R2

• long clones such as fosmids

○ not used much today

Mate Pairs: One common technique • isolate long fragment, e.g., 5 kb • circularize (including tag at junction) • fragment• isolate desired size fragment• attach adapters and sequence

BRead 1

Read 2

cloning strategy reverses paired-end orientation

Long-Insert Mate-Pairs

Genome Assembly

Scaffolding and gap filling/closing• Scaffold – contigs with defined order and spacing, but with sequence gaps

○ uses paired end and mate pair information

○ most assemblers include a scaffolder

Gap closing, sequentially add more reads building from ends of gaps

No reads ORRepeats

Genome Assembly

Populus

• Science. 2006 Sep 15;313(5793):1596-604

• 485 Mb (cytogenetic estimate 550 Mb)

• 2447 scaffolds

• 95% of genome

• 45,500 "genes"

• 19 Linkage groups

• Evidence for two whole genome duplications

Genome Assembly

Populus

• Clone and sequence statistics

Insert

Size

Kb

Vector Number

Reads

x10-6

Number

Reads

Used

x10-6

Number

Bases

Qual > 20

Mb

Number

Bases

After

Trimming

Mb

% Bases

Used

% of Total

2.0 - 4.0 plasmid 4.45 2.75 2.76 1.73 62.7 56.4

4.5 - 7.5 plasmid 2.58 1.62 1.78 1.04 58.4 33.4

38 - 41 fosmid 0,.65 0.43 0.41 0.30 73.1 9.8

Total 7.69 4.80 4.95 3.07 62.0

Genome Assembly

Populus• Not all reads can be incorporated

into assembly

○ Contaminants

○ No hits

Genome Assembly

• Assembly of Linkage Group II. 1 Mb spans are colored in alternating black and white strips.

• The innermost track (black) shows the fingerprint map clone coverage. Each circle represents 5X coverage.

• The next outer track (red) shows the coverage provided by singletons.

• The next track shows anchored contigs, coded with an alternating color scheme.

• The final inside track shows the sequence position of individual clones in each contig, colored by map contig assignment.

• The first outer track shows the sequence position of clones that lack map contig assignment.

• The second outer track shows the coverage provided by the singletons.

Genome Assembly

Populus

• How to you know your assembly is accurate?

• Need external knowledge

• Mapping of scaffolds to chromosomes using microsatellites

Genome Assembly

Populus

• How to you know yourassembly is accurate?

• Need external knowledge

• Mapping BACs to chromosomes using FISH

Genome Assembly

Junk DNA

• Garbage you throw away

• Junk you keep (but may not have an immediate need or use for)

• Junk or garbage?

51.8 % repeats

Genome Assembly

Human Genome

Repeats 64%

Genome Assembly

Populus

• Transposable elements

○ Kinds and numbers vary with the species

Introduction to Sequencing Unix Data cleaning

Documents