-
Approaches for in silico finishing of microbial genome
sequences
Frederico Schmitt Kremer1, Alan John Alexander McBride1 and
Luciano da Silva Pinto1
1Programa de Pós-Graduação em Biotecnologia (PPGB), Centro de
Desenvolvimento Tecnológico,
Universidade Federal de Pelotas, Pelotas, Brazil.
Abstract
The introduction of next-generation sequencing (NGS) had a
significant effect on the availability of genomic informa-tion,
leading to an increase in the number of sequenced genomes from a
large spectrum of organisms. Unfortunately,due to the limitations
implied by the short-read sequencing platforms, most of these newly
sequenced genomes re-mained as “drafts”, incomplete representations
of the whole genetic content. The previous genome sequencing
stud-ies indicated that finishing a genome sequenced by NGS, even
bacteria, may require additional sequencing to fill thegaps, making
the entire process very expensive. As such, several in silico
approaches have been developed to opti-mize the genome assemblies
and facilitate the finishing process. The present review aims to
explore some free(open source, in many cases) tools that are
available to facilitate genome finishing.
Keywords: microbial genetics, molecular microbiology, genomics,
microbiology, draft genomes.
Received: September 25, 2016; Accepted: March 13, 2017.
Introduction
The advent of second generation of sequencing plat-forms,
usually referred to as Next Generation Sequencing(NGS)
technologies, promoted an expressive growth in theavailability of
genomic data in public databases, mainlydue to the drastic
reduction in the cost-per-base (Chain etal., 2009). Compared to the
Sanger sequencing technique(Sanger et al., 1977), NGS platforms,
like Illumina HiSeq,IonTorrent PGM, Roche 454 FLX, and ABI SOLiD,
areable to generate a significantly higher throughput, resultingin
a high sequencing coverage. However, they typicallyhave a lower
accuracy (in terms of average Phred-score inthe raw data) and, in
most cases, can only generate short-length reads (short-reads) (Liu
et al., 2012). The term“short-read” is commonly used to refer to
the data gener-ated by platforms such as Illumina, IonTorrent and
SOLiD,as the length of their reads usually range from 30 bp
(eg:SOLiD) to ~120 bp (e.g., Illumina HiSeq), which is smallerthan
the length usually obtained by Sanger sequencing (~1kb), and by the
PacBio and Oxford Nanopore platforms(commonly referred to as
“long-reads”).
The higher throughtput obtained by NGS has stimu-lated the
development of new algorithms and tools that arecapable of dealing
with a larger volume of data and generat-ing genomic assemblies in
a reasonable time. Traditionalsequence assemblers, like Phrap
(http://www.phrap.org/
phredphrap/) and CAP3 (Huang and Madan, 1999), werereplaced with
new ones, such as Velvet (Zerbino and Bir-ney, 2008), ABySS
(Simpson et al., 2009), Ray (Boisvert etal., 2010), SPAdes
(Bankevich et al., 2012) andSOAPdenovo (Luo et al., 2012), many of
which were basedon the De Bruijn graph algorithm (Pevzner et al.,
2001;Compeau et al., 2011). A large number of these
softwareofferings are capable of assembling data from different
se-quencing platforms, called “hybrid assembly”; however, allof
them exhibit some limitations and, even after severalcorrections
and optimization steps, the generation of a fin-ished genome
continues to represent a complex task (Alkanet al., 2011).
Genome finishing is achieved by converting a set ofcontigs or
scaffolds into complete sequences that representthe full genomic
content of the organism without unknownregions (gaps) (Mardis et
al., 2002), and aims a “closed”and full representation of the
chromosome organization.On the contrary, a draft genome may be
composed of a setof a few or several contigs or scaffolds (Land et
al., 2015).Although useful for many studies, the unfinished and
frag-mented nature of draft genomes may difficult analysis
oncomparative genomics and structural genomics (Ricker etal.,
2012). In addition, some genes may be missed if locatedin a region
without coverage (e.g., edges of contigs/scaf-folds), or due to
assembly errors (e.g., repetitive regionsthat are “collapsed” into
a single one) (Klassen and Currie,2012).
Genome finishing represents a relevant step to reducedata loss
and lead to a more accurate representation of thegenomics features
of the organism of interest. The tradi-
Genetics and Molecular Biology, 40, 3, 553-576 (2017)Copyright ©
2017, Sociedade Brasileira de Genética. Printed in BrazilDOI:
http://dx.doi.org/10.1590/1678-4685-GMB-2016-0230
Send correspondence to Frederico Schmitt Kremer. Centro
deDesenvolvimento Tecnológico, Universidade Federal de
Pelotas,Campus Universitário, S/N, CEP 96160-000, Capão do Leão,
RS,Brazil. E-mail: [email protected].
Review Article
-
tional way to close gapped genomes includes: (1) designprimers
based on the edge of adjacent contigs, (2) PCR am-plification, (3)
Sanger sequencing, and (4) local assembly,usually with (5) manual
curation. As this process is verytime consuming, new in vitro
approaches were developedto speed it up, including multiplex PCR
(Tettelin et al.,1999), optical mapping (Latreille et al., 2007),
and hybridsequencing and assembly (Ribeiro et al., 2012). This,
how-ever, usually results in a drastic increase in the cost of
thesequencing project. Therefore it is not surprising that fromthe
87,956 prokaryote genomes available in GenBank untilDecember 2016
(GenBank release 217), only a small frac-tion (~6,586) is finished
(http://www.ncbi.nlm.nih.gov/genbank/).
In the last years, the availability of
third-generationsequencing technologies, such as PacBio SMRT and
Ox-ford Nanopore, has also provided another way to achievefinished
genomes (Ribeiro et al., 2012). As these platformsusually generate
reads with length > 10 kb, the assembly al-gorithms have to deal
with less ambiguities and problem-atic regions (Koren and
Phillippy, 2015). This makes thereconstruction of the chromosome
sequences easier, butjust like second generation technologies, both
platformshave their own limitations. In the earlier versions of
thePacBio SMRT platform, for example, it used to present ahigh
error-rate, so it was recommended to correct part ofthese errors
using short-reads data (e.g., Illumina) (Koren etal., 2012; Au et
al., 2012; Salmela and Rivals, 2014) beforeusing its result for any
analysis. Although this problem hasbeen minimized with the
improvements in the sequencingchemistry and base-calling process
over the last years, thetime required for the sequencing, the cost
of each run, andthe price of the equipment itself are still
drawbacks, andthese are some of reasons as to why it is not more
frequentlyapplied. In the case of Oxford Nanopore, there is a
limitednumber of tools that can be used to pre-process and
analyzeits data, and some types of errors (e.g.,
under-represen-tation of specific k-mers) are still recurrent
(Deschamps etal., 2016). Therefore, second-generation platforms are
stillthe most popular ones for a wide variety of applications,and
remain as the cheaper alternative to obtain genomicdata.
Previous reviews that focused on genome assemblyand whole-genome
sequencing analysis have already de-scribed some of the in silico
tools that have the potential toimprove a genome assembly without
the need for experi-mental data. Edwards and Holt (2013), Dark
(2013) andVincent et al. (2016) have provided very
comprehensivedescriptions of some of the main steps of data
analysis inmicrobial genomes sequenced by next-generation
technol-ogies, including de novo assembly, reference-based
contigordering, genome annotation and comparative genomics,and
these are useful starting points for those that are new tomicrobial
genomics with NGS. Nagarajan et al. (2010)have also made some
useful considerations regarding ge-
nome assembly and presented some methodologies for ge-nome
finishing, but limited to a small set of approaches thatthey have
developed to improve assemblies. Ramos et al.(2013) exemplified
some assembly and post-assembly me-thods for genomes sequenced with
IonTorrent platforms.Finally, Hunt et al. (2014) have also made a
complete andaccurate benchmark analysis of different tools for
assemblyscaffolding based on paired-end/mate-pair data.
However,although very instructive, these reviews provide
descrip-tions of some specific procedures, but do not give an
in-depth view of the different finishing strategies, nor of
thealgorithms that are used in background by different toolsfrom
each category. Therefore, a more complete descrip-tion of these
tools may be useful, especially for those re-searchers that are
starting to work with microbial genomeanalysis and want to achieve
better assemblies without re-lying on re-sequencing or manual
gap-closing.
The present review aims at discussing some of thecategories of
software that may be applied in the process ofassembly finishing,
especially in the context of microbialgenome sequencing projects. A
particular focus is placedon those that do not require additional
experimental and/ornew sequencing data. The review intentionally
focuses ontools that are freely available, at least for academic
use, andomits proprietary software, like the CLC Genomics
Work-bench (http://www.clcbio.com/), for example. This was notdue
to anticipated differences in the performance or qualityof the
results, but to adhere to the original intention: anoverview of
tools that could be used by most researchers.As different
approaches have been developed to improvegenomic assemblies, the
description of these programs wasdivided into four main categories:
scaffolding (placementof contigs into larger sequences by using
experimental dataor information for closely-related genomes, and
joiningthem by gap regions), assembly integration (generation of
aconsensus assembly using multiple assemblies for the samegenome),
gap closing (solving gaps by identifying theircorrect sequence),
error correction (removal of misidenti-fied bases or misassembled
regions) and assembly evalua-tion (quantification of the
reliability of a genome assemblyand identification of its erroneous
regions) (Table 1). Thesedifferent categories of programs can be
combined, accord-ing to the type of data that is available (e.g.,
sequencingplatform and library that was used), and may help to
reducethe fragmentation and improve the reliability of a
genomeassembly. Based on the categories of software that are
dis-cussed in the present review we have created the
flowchartshowed in Figure 1, which may of help in choosing themost
appropriate approach for genome finishing, depend-ing on the type
of the data that is available. Examples of ge-nome projects that
used some of the tools discussed in thepresent review are shown in
Supplementary material (Ta-ble S1).
554 Kremer et al.
-
Finishing microbial genomes 555
Tab
le1
-O
verv
iew
ofth
eto
ols
desc
ribe
din
the
pres
ent
revi
ew
Cat
egor
yT
ool
Mai
nfe
atur
esD
epen
denc
es*
Ref
eren
ceD
ownl
oad
link
/w
ebse
rver
Sca
ffol
ding
AB
ySS
-P
aire
d-en
dsc
affo
ldin
g.-
Sca
ffol
ding
feat
ure
alre
ady
inte
grat
edin
the
AB
ySS
de
novo
asse
mbl
ypi
peli
ne.
-U
ses
the
esti
mat
eddi
stan
ces
gene
rate
dby
the
prog
ram
Dis
tanc
eEst
(fro
mth
esa
me
pack
age)
asin
put.
-A
llow
sth
esc
affo
ldin
gus
ing
long
-rea
ds,s
uch
asth
ose
gen-
erat
edby
Pac
Bio
and
Oxf
ord
Nan
opor
epl
atfo
rms.
boos
tli
brar
ies:
ww
w.b
oost
.org
/O
pen
MP
I:ht
tp:/
/ww
w.o
pen-
mpi
.org
spar
se-h
ash
libr
ary:
http
://g
oog-
spar
seha
sh.s
ourc
efor
ge.n
et/
(Sim
pson
etal.
,20
09)
http
://w
ww
.bcg
sc.c
a/pl
at-
form
/bio
info
/sof
twar
e/ab
yss
Sca
ffol
ding
Bam
bus
2-
Pai
red-
end
scaf
fold
ing.
-C
anbe
easi
lyin
tegr
ated
wit
has
sem
bly
proj
ects
that
are
buil
ton
top
ofth
eA
MO
Spa
ckag
e.-
Sup
port
sth
esc
affo
ldin
gof
met
agen
omes
.-
Req
uire
sex
peri
ence
wit
hth
eA
MO
Spa
ckag
ean
dit
sda
tafo
rmat
s.
AM
OS
pack
age
(Tre
ange
net
al.
,20
11):
http
://a
mos
.sou
rcef
orge
.net
/(K
oren
etal.
,20
11)
http
s://
sour
cefo
rge.
net/
pro-
ject
s/am
os/
Sca
ffol
ding
MIP
-P
aire
d-en
dsc
affo
ldin
g.-
Sup
port
sbo
thpa
ired
-end
and
mat
e-pa
ir(l
ong
rang
e)re
ads.
lpso
lve
libr
ary:
http
://s
ourc
efor
ge.n
et/p
roje
cts/
lpso
lve/
lem
onli
-br
ary:
http
://l
emon
.cs.
elte
.hu/
(Sal
mel
aet
al.
,20
11)
http
s://
ww
w.c
s.he
l-si
nki.
fi/u
/lm
salm
el/m
ip-s
caff
old
er/
Sca
ffol
ding
OP
ER
A-
Pai
red-
end
scaf
fold
ing.
-Id
enti
fies
pote
ntia
lsp
urio
usco
nnec
tion
sca
used
bych
ime-
ric
read
san
dre
peti
tive
geno
mic
sel
emen
tsth
atm
ayaf
fect
the
reli
abil
ity
ofth
esc
affo
ldin
g.-
Con
tigs
iden
tifi
edas
mis
asse
mbl
edm
aybe
used
inth
eco
n-st
ruct
ion
ofm
ore
than
one
scaf
fold
,but
som
etim
esit
may
lead
tone
was
sem
bly
erro
rs.
BW
A(L
ian
dD
urbi
n20
09):
http
://b
io-b
wa.
sour
cefo
rge.
net/
Bow
tie
(Lan
gmea
det
al.
,20
09):
http
://b
owti
e-bi
o.so
urce
forg
e.ne
t/S
amto
ols
(Li
et
al.
,20
09):
http
://s
amto
ols.
sour
cefo
rge.
net/
(Gao
etal.
,20
11)
http
s://
sour
cefo
rge.
net/
pro-
ject
s/op
eras
f
Sca
ffol
ding
SC
AR
PA
-P
aire
d-en
dsc
affo
ldin
g.-
Onl
yus
esfo
rsc
affo
ldin
gth
ose
cont
igs
wit
hle
ngth
grea
ter
than
the
N50
ofth
eas
sem
bly.
-A
llow
sm
ulti
ple
libr
arie
sto
beus
edin
the
sam
esc
affo
ldin
gpr
ojec
t.
Non
e(D
onm
ezan
dB
rudn
o,20
13)
http
://c
ompb
io.c
s.to
-ro
nto.
edu/
haps
embl
er/s
carp
a.ht
ml
Sca
ffol
ding
SG
A-
Pai
red-
end
scaf
fold
ing.
-S
caff
oldi
ngfe
atur
eal
read
yin
tegr
ated
inth
eS
GA
asse
mbl
ypi
peli
ne,w
hich
isop
tim
ized
for
Illu
min
ada
taan
dla
rge
geno
mes
.-
Use
sth
ees
tim
ated
dist
ance
sge
nera
ted
byth
epr
ogra
mD
ista
nceE
st(f
rom
the
pack
age
AB
ySS
)as
inpu
t,al
ong
wit
hth
ere
adm
appi
ngfi
lein
.BA
Mfo
rmat
.-
All
ows
mul
tipl
eli
brar
ies
tobe
used
inth
esa
me
scaf
fold
ing
proj
ect.
Bam
tool
s(B
arne
ttet
al.
,20
11):
http
s://
gith
ub.c
om/p
ezm
aste
r31/
bam
tool
sB
WA
(Li
and
Dur
bin,
2009
):ht
tp:/
/bio
-bw
a.so
urce
forg
e.ne
t/S
amto
ols
(Li
et
al.
,20
09):
http
://s
amto
ols.
sour
cefo
rge.
net/
Spa
rse-
hash
libr
ary:
http
://g
oog-
spar
seha
sh.s
ourc
efor
ge.n
et/
(Sim
pson
and
Dur
bin,
2012
)ht
tps:
//gi
thub
.com
/jts
/sga
-
556 Kremer et al.
Cat
egor
yT
ool
Mai
nfe
atur
esD
epen
denc
es*
Ref
eren
ceD
ownl
oad
link
/w
ebse
rver
Sca
ffol
ding
SO
PR
A-
Pai
red-
end
scaf
fold
ing.
-D
evel
oped
toim
prov
eth
eas
sem
blie
sge
nera
ted
byV
elve
tan
dS
SA
KE
,and
requ
ired
the
.AF
Gfi
les.
-S
uppo
rts
data
from
earl
yIl
lum
ina
and
AB
IS
OL
iDpl
at-
form
s,in
clud
ing
pair
ed-e
ndan
dm
ate-
pair
read
s.-
Isno
tfu
lly
auto
mat
ed,s
oit
isne
cess
ary
toru
ndi
ffer
ent
scri
pts
for
each
step
ofth
esc
affo
ldin
g.
Non
e(D
ayar
ian
etal.
,201
0)ht
tp:/
/ww
w.p
hys-
ics.
rutg
ers.
edu/
~an
irva
ns/S
OP
RA
/
Sca
ffol
ding
SS
PA
CE
-P
aire
d-en
dsc
affo
ldin
g.-
Tri
ms
the
edge
ofth
eco
ntig
sas
they
are
mor
esu
itab
leto
asse
mbl
yer
rors
.-
Req
uire
sin
form
atio
nab
out
the
pair
ed-e
ndli
brar
y,in
clud
-in
gm
ean
size
ofth
ein
sert
,sta
ndar
dde
viat
ion
and
the
rela
-ti
veor
ient
atio
nof
the
mat
es.
Non
e(B
oetz
eret
al.
,20
11)
http
://w
ww
.bas
ecle
ar.c
om/g
eno
mic
s/bi
oinf
orm
atic
s/ba
seto
ols/
Sca
ffol
ding
SS
PA
CE
-L
ongR
ead
-P
aire
d-en
dsc
affo
ldin
g.-
All
ows
the
scaf
fold
ing
usin
glo
ng-r
eads
,suc
has
thos
ege
n-er
ated
byP
acB
ioan
dO
xfor
dN
anop
ore
plat
form
s.
Non
e(B
oetz
eran
dP
irov
ano,
2014
)ht
tp:/
/ww
w.b
asec
lear
.com
/gen
om
ics/
bioi
nfor
mat
ics/
base
tool
s/
Sca
ffol
ding
MU
Mm
er-
Sin
gle
refe
renc
e-ba
sed
scaf
fold
ing.
-T
here
sult
ofth
eal
ignm
ent
mus
tbe
post
-pro
cess
edto
ob-
tain
the
scaf
fold
s.
(Kur
tzet
al.
,20
04)
http
://m
umm
er.s
ourc
efor
ge.n
et/
Sca
ffol
ding
AB
AC
AS
-S
ingl
ere
fere
nce-
base
dsc
affo
ldin
g.-
Use
ful
whe
nth
ere
fere
nce
and
the
targ
etge
nom
ear
ecl
osel
y-re
late
d,an
dth
ege
nom
eto
besc
affo
lded
isno
tla
rger
than
the
refe
renc
ege
nom
e.-
Not
opti
miz
edfo
rba
cter
iaw
ith
two
orm
ore
repl
icon
s/ch
rom
osom
es(e
x:L
epto
spir
age
nus)
.-
All
ows
the
desi
gnof
prim
ers
for
gap-
clos
ing.
MU
Mm
er(K
urtz
etal.
,20
04):
http
://m
umm
er.s
ourc
efor
ge.n
et/
Pri
mer
3(K
ores
saar
and
Rem
m,2
007;
Unt
erga
sser
etal.
,20
12):
http
://p
rim
er3.
ut.e
e/
(Ass
efa
etal.
,200
9)ht
tp:/
/aba
cas.
sour
cefo
rge.
net/
Sca
ffol
ding
CO
NT
IGua
tor
-Sin
gle
refe
renc
e-ba
sed
scaf
fold
ing.
-Use
ful
whe
nth
eta
rget
geno
me
isco
mpo
sed
bym
ore
than
one
chro
mos
ome
/re
plic
on.
-A
llow
sa
mor
ese
nsit
ive
iden
tifi
cati
onof
synt
enic
re-
gion
s,if
com
pare
dto
AB
AC
AS
,as
itap
plie
sa
BL
AS
Tse
arch
afte
rM
UM
mm
er.
AB
AC
AS
(Ass
efa
etal.
,200
9):
http
://a
baca
s.so
urce
forg
e.ne
t/B
ioP
ytho
n(P
ytho
npa
ckag
e):
http
://b
iopy
thon
.org
/B
LA
ST
+(A
ltsc
hul
etal.
,19
90;
Cam
acho
et
al.
,20
09):
ftp:
//ft
p.nc
bi.n
lm.n
ih.g
ov/b
last
/M
UM
mer
(Kur
tzet
al.
,20
04):
http
://m
umm
er.s
ourc
efor
ge.n
et/
Pri
mer
3(K
ores
saar
and
Rem
m,2
007;
Unt
erga
sser
etal.
,20
12):
http
://p
rim
er3.
ut.e
e/
(Gal
ardi
niet
al.
,20
11)
http
://c
onti
guat
or.s
ourc
efor
ge.n
et/
Tab
le1
-co
ntin
ued
-
Finishing microbial genomes 557
Cat
egor
yT
ool
Mai
nfe
atur
esD
epen
denc
es*
Ref
eren
ceD
ownl
oad
link
/w
ebse
rver
Sca
ffol
ding
Mau
ve-
Sin
gle
refe
renc
e-ba
sed
scaf
fold
ing.
-C
anbe
used
both
thro
ugh
aco
mm
andl
ine
inte
rfac
e(C
LI)
and
agr
aphi
cal
user
inte
rfac
e(G
UI)
.-
All
ows
the
iden
tifi
cati
onof
geno
mic
inve
rsio
nsan
dtr
ansl
ocat
ions
.-
Not
opti
miz
edfo
rba
cter
iaw
ith
two
orm
ore
repl
icon
s/ch
rom
osom
es.
Java
:ht
tps:
//w
ww
.jav
a.co
m/
(Dar
ling
etal.
,200
4;R
issm
anet
al.
,20
09)
http
://d
arli
ngla
b.or
g/m
auve
/ma
uve.
htm
l
Sca
ffol
ding
Fil
lSca
ffol
ds-
Sin
gle
refe
renc
e-ba
sed
scaf
fold
ing.
-N
otop
tim
ized
for
bact
eria
wit
htw
oor
mor
ere
plic
ons/
chro
mos
omes
.-
Res
ults
may
requ
ire
post
-pro
cess
ing
tore
cons
truc
tth
ese
-qu
ence
ofth
esc
affo
ld.
Java
:ht
tps:
//w
ww
.jav
a.co
m/
(Muñ
ozet
al.
,20
10)
Sup
plem
enta
ryda
taof
Muñ
ozet
al.
(201
0).
http
://d
x.do
i.or
g/10
.118
6/14
71-
2105
-11-
304
Sca
ffol
ding
SIS
-S
ingl
ere
fere
nce-
base
dsc
affo
ldin
g.-
All
ows
the
iden
tifi
cati
onof
geno
mic
inve
rsio
ns.
-N
otop
tim
ized
for
bact
eria
wit
htw
oor
mor
ere
plic
ons/
chro
mos
omes
.
MU
Mm
er(K
urtz
etal.
,200
4):
http
://m
umm
er.s
ourc
efor
ge.n
et/
(Dia
set
al.
,201
2)ht
tp:/
/mar
te.i
c.un
icam
p.br
:874
7.
Sca
ffol
ding
CA
R-
Sin
gle
refe
renc
e-ba
sed
scaf
fold
ing.
-A
llow
sth
eid
enti
fica
tion
ofge
nom
icin
vers
ions
and
tran
sloc
atio
ns.
-A
lso
avai
labl
eas
aw
ebse
rver
.-
Not
opti
miz
edfo
rba
cter
iaw
ith
two
orm
ore
repl
icon
s/ch
rom
osom
es.
MU
Mm
er(K
urtz
etal.
,200
4):
http
://m
umm
er.s
ourc
efor
ge.n
et/
PH
P:
http
s://
php.
net/
(Lu
etal.
,20
14)
http
://g
e-no
me.
cs.n
thu.
edu.
tw/C
AR
/
Sca
ffol
ding
RA
CA
-M
ulti
ple
refe
renc
e-ba
sed
scaf
fold
ing.
-O
ptim
ized
for
larg
ege
nom
esan
dw
ith
mul
tipl
ech
rom
o-so
mes
.-
Can
also
use
pair
ed-e
ndda
ta.
Non
e(K
imet
al.
,20
13):
http
://b
ioen
-com
pbio
.bio
en.i
lli-
nois
.edu
/RA
CA
/
Sca
ffol
ding
Rag
out
-M
ulti
ple
refe
renc
e-ba
sed
scaf
fold
ing.
-U
ses
phyl
ogen
etic
info
rmat
ion
toid
enti
fyth
em
ost
prob
a-bl
eor
ient
atio
nof
the
cont
igs
/sc
affo
lds.
Net
wor
kx(P
ytho
npa
ckag
e):
http
://n
etw
orkx
.git
hub.
io/
New
ick
(Pyt
hon
pack
age)
:ht
tp:/
/ww
w.d
aim
i.au
.dk/
~m
ailu
nd/n
ewic
k.ht
ml
Sib
elia
:ht
tp:/
/git
hub.
com
/bio
inf/
Sib
elia
(Kol
mog
orov
et
al.
,20
14)
http
s://
gith
ub.c
om/f
ende
rgla
ss/
Rag
out
Sca
ffol
ding
MeD
uSa
-M
ulti
ple
refe
renc
e-ba
sed
scaf
fold
ing.
-A
ccep
tsbo
thfi
nish
edan
ddr
aft
geno
mes
asre
fere
nce.
Bio
Pyt
hon
(Pyt
hon
pack
age)
:ht
tp:/
/bio
pyth
on.o
rg/
Java
:ht
tps:
//w
ww
.jav
a.co
m/
MU
Mm
er(K
urtz
etal.
,200
4):
http
://m
umm
er.s
ourc
efor
ge.n
et/
(Bos
iet
al.
,20
15)
http
s://
gith
ub.c
om/c
ombo
geno
mic
s/m
edus
a
Ass
embl
yin
-te
grat
ion
Min
imus
-C
anbe
easi
lyin
tegr
ated
wit
has
sem
bly
proj
ects
that
are
buil
ton
top
ofth
eA
MO
Spa
ckag
e.-
Req
uire
sex
peri
ence
wit
hth
eA
MO
Spa
ckag
ean
dit
sda
tafo
rmat
s.
AM
OS
pack
age
(Tre
ange
net
al.
,20
11):
http
://a
mos
.sou
rcef
orge
.net
/(S
omm
eret
al.
,20
07)
http
s://
sour
cefo
rge.
net/
pro-
ject
s/am
os/
Tab
le1
-co
ntin
ued
-
558 Kremer et al.
Cat
egor
yT
ool
Mai
nfe
atur
esD
epen
denc
es*
Ref
eren
ceD
ownl
oad
link
/w
ebse
rver
Ass
embl
yin
-te
grat
ion
Rec
onci
liat
or-
Cor
rect
sth
em
isas
sem
bled
regi
ons
ina
targ
etas
sem
bly
byco
mpa
ring
toan
alte
rnat
ive
asse
mbl
yfo
rth
esa
me
geno
me.
-Id
enti
fies
repe
titi
vere
gion
sth
atsu
ffer
edco
mpr
essi
ons
orex
pans
ions
.
MU
Mm
er(K
urtz
etal.
,20
04):
http
://m
umm
er.s
ourc
efor
ge.n
et/
(Zim
inet
al.
,20
08)
http
://w
ww
.gen
ome.
umd.
edu/
Ass
embl
yin
-te
grat
ion
MA
IA-
All
ows
the
inte
grat
ion
oftw
oor
mor
eas
sem
blie
s.-
Acc
epts
refe
renc
ege
nom
eto
perf
orm
scaf
fold
ing,
wha
tis
usef
ulfo
rth
ose
cont
igs
wit
hout
corr
espo
nden
cein
the
othe
ras
sem
blie
s.
Mat
lab:
http
s://
ww
w.m
athw
orks
.com
/M
UM
mer
:ht
tp:/
/mum
mer
.sou
rcef
orge
.net
/G
AIM
C(M
atla
bto
olbo
x):
http
://g
ithu
b.co
m/d
glei
ch/g
aim
c
(Nij
kam
pet
al.
,20
10)
http
://b
ioin
form
atic
s.tu
delf
t.nl
Ass
embl
yin
-te
grat
ion
CIS
A-
All
ows
the
inte
grat
ion
ofth
ree
orm
ore
asse
mbl
ies.
-C
orre
cts
mis
asse
mbl
edre
gion
san
dco
mpr
esse
d/
expa
nded
repe
ated
regi
ons.
BL
AS
T+
(Alt
schu
let
al.
,199
0;C
amac
hoet
al.
,20
09):
ftp:
//ft
p.nc
bi.n
lm.n
ih.g
ov/b
last
/M
UM
mer
(Kur
tzet
al.
,20
04):
http
://m
umm
er.s
ourc
efor
ge.n
et/
(Lin
and
Lia
o,20
13)
http
://s
b.nh
ri.o
rg.t
w/C
ISA
/
Ass
embl
yin
-te
grat
ion
GA
A-
Use
sth
eal
ignm
ent
betw
een
the
diff
eren
tco
ntig
sin
the
set
ofas
sem
blie
sto
gene
rate
anas
sem
bly
grap
h,w
hich
isex
-pl
ored
toid
enti
fyto
min
imal
set
ofin
depe
nden
tpa
ths.
BL
AT
(Ken
t,20
02):
http
s://
geno
me.
ucsc
.edu
/G
SM
appe
r:ht
tp:/
/454
.com
/
(Yao
etal.
,20
12)
http
://s
ourc
efor
ge.n
et/p
ro-
ject
s/ga
a-w
ugi/
Ass
embl
yin
-te
grat
ion
Mix
-G
ener
ate
anex
tens
ion
grap
hth
atre
pres
ents
the
conn
ecti
onbe
twee
nth
eco
ntig
s.-
Fil
ters
the
alig
nmen
tto
redu
ceth
eam
bigu
itie
sca
used
byre
peti
tive
sequ
ence
s.
Net
wor
kx(P
ytho
npa
ckag
e):
http
://n
etw
orkx
.lan
l.go
v/B
ioP
ytho
n(P
ytho
npa
ckag
e):
http
://b
iopy
thon
.org
/M
UM
mer
(Kur
tzet
al.
,20
04):
http
://m
umm
er.s
ourc
efor
ge.n
et/
(Sou
eida
net
al.
,20
13)
http
s://
gith
ub.c
om/c
bib/
MIX
Ass
embl
yin
-te
grat
ion
GA
M/
GA
M-N
GS
-R
equi
res
the
read
file
sto
perf
orm
the
asse
mbl
yin
tegr
atio
n.-
One
ofth
eas
sem
blie
sto
bem
erge
dis
defi
ned
as“m
as-
ter”
,whi
leth
eot
hers
are
defi
ned
as“s
lave
s”.
-A
llow
sth
eid
enti
fica
tion
ofm
isas
sem
bled
regi
ons
inth
em
aste
r,w
hich
are
corr
ecte
dbe
fore
the
gene
rati
onof
the
fina
las
sem
bly.
cmak
e:ht
tps:
//cm
ake.
org/
zlib
libr
ary:
http
://w
ww
.zli
b.ne
t/bo
ost
libr
arie
s:w
ww
.boo
st.o
rg/
spar
se-h
ash
libr
ary:
http
://g
oog-
spar
seha
sh.s
ourc
efor
ge.n
et/
(Cas
agra
nde
et
al.
,20
09;
Vic
edom
ini
etal.
,20
13)
http
s://
gith
ub.c
om/v
ice8
7/ga
m-
ngs
Ass
embl
yin
-te
grat
ion
Zor
ro-
Req
uire
sth
ere
adfi
les
tope
rfor
mth
eas
sem
bly
inte
grat
ion.
-R
emap
sth
ere
ads
back
toth
eco
ntig
san
did
enti
fies
mis
asse
mbl
edan
dre
peti
tive
regi
ons
base
don
the
cove
rage
.-
Spl
its
the
mis
asse
mbl
edco
ntig
san
dpe
rfor
ms
the
asse
mbl
yin
tegr
atio
nus
ing
Min
imus
.
AM
OS
(Tre
ange
net
al.
,20
11):
http
://a
mos
.sou
rcef
orge
.net
/B
ioP
erl
(Per
lm
odul
e):
http
://b
iope
rl.o
rgB
owti
e(L
angm
ead
etal.
,20
09):
http
://b
owti
e-bi
o.so
urce
forg
e.ne
t/M
UM
mer
(Kur
tzet
al.
,20
04):
http
://m
umm
er.s
ourc
efor
ge.n
et/
(Arg
ueso
etal.
,20
09)
http
://l
ge.i
bi.u
nica
mp.
br/z
orro
/
Tab
le1
-co
ntin
ued
-
Finishing microbial genomes 559
Cat
egor
yT
ool
Mai
nfe
atur
esD
epen
denc
es*
Ref
eren
ceD
ownl
oad
link
/w
ebse
rver
Gap
clos
ing
Gap
Clo
ser
-G
ap-c
losi
ngfe
atur
eal
read
yin
tegr
ated
inth
eS
OA
Pde
novo
de
novo
asse
mbl
ypi
peli
ne-
Per
form
sa
loca
lre
asse
mbl
yin
the
gap
regi
onus
ing
the
read
slo
cate
din
the
edge
sof
the
surr
ound
ing
cont
igs.
Non
e(L
iet
al.
,20
10)
http
://s
oap.
geno
mic
s.or
g.cn
/
Gap
clos
ing
IMA
GE
-It
erat
ivel
ype
rfor
ms
are
map
ping
ofth
ere
ads
toth
eco
ntig
s,fo
llow
edby
the
sele
ctio
nof
thos
eth
atov
erla
pth
ega
pre
gion
and
alo
cal
reas
sem
bly.
Non
e(T
sai
etal.
,20
10)
http
s://
sour
cefo
rge.
net/
pro-
ject
s/im
age2
Gap
clos
ing
Gap
Fil
ler
-It
erat
ivel
ype
rfor
ms
are
map
ping
ofth
ere
ads
toth
eco
ntig
s,fo
llow
edby
the
sele
ctio
nof
thos
eth
atov
erla
pth
ega
pre
gion
and
alo
cal
reas
sem
bly.
-R
equi
res
info
rmat
ion
abou
tth
epa
ired
-end
libr
ary,
incl
ud-
ing
mea
nsi
zeof
the
inse
rt,i
tsst
anda
rdde
viat
ion
and
the
rel-
ativ
eor
ient
atio
nof
the
mat
es.
Non
e(B
oetz
eran
dP
irov
ano,
2012
)ht
tp:/
/ww
w.b
asec
lear
.com
/gen
om
ics/
bioi
nfor
mat
ics/
base
tool
s
Gap
clos
ing
Enl
y-
Iter
ativ
ely
perf
orm
sa
rem
appi
ngof
the
read
sto
the
cont
igs,
foll
owed
byth
ese
lect
ion
ofth
ose
that
over
lap
the
gap
regi
onan
da
loca
lre
asse
mbl
y.-
Ifa
refe
renc
ege
nom
eis
prov
ided
,ane
wsc
affo
ldin
gst
epca
nbe
perf
orm
edto
impr
ove
the
asse
mbl
y.
Bio
Pyt
hon
(Pyt
hon
pack
age)
:ht
tp:/
/bio
pyth
on.o
rg/
BL
AS
Tan
dB
LA
ST
+(A
ltsc
hul
etal.
,19
90;
Cam
acho
etal.
,200
9):
ftp:
//ft
p.nc
bi.n
lm.n
ih.g
ov/b
last
/C
dbfa
sta/
cdby
ank:
http
://c
ompb
io.d
fci.
harv
ard.
edu/
tgi/
soft
war
e/E
MB
OS
S:
http
://e
mbo
ss.s
ourc
efor
ge.n
et/
Min
imo
asse
mbl
er(T
rean
gen
etal.
,20
11):
http
://a
mos
.sou
rcef
orge
.net
/M
UM
mer
(Kur
tzet
al.
,20
04):
http
://m
umm
er.s
ourc
efor
ge.n
et/
Phr
ap:
http
://w
ww
.phr
ap.o
rg/p
hred
phra
pcon
sed.
htm
l
(Fon
diet
al.
,201
4)ht
tp:/
/enl
y.so
urce
forg
e.ne
t/
Gap
clos
ing
FG
AP
-U
ses
alte
rnat
ive
asse
mbl
ies
ofth
eta
rget
geno
me
toid
enti
fyre
gion
sth
atov
erla
pth
ega
p.M
atla
b:ht
tps:
//w
ww
.mat
hwor
ks.c
om/
(Pir
oet
al.
,20
14)
http
://w
ww
.bio
info
.ufp
r.br
/fga
p/
Gap
clos
ing
Sea
ler
-P
erfo
rms
alo
cal
re-a
ssem
bly
ofth
ega
pre
gion
sus
ing
dif-
fere
ntse
ttin
gsof
k-m
er,w
hat
may
help
inth
eso
lvin
gof
re-
gion
sw
ith
repe
titi
vese
quen
ces.
boos
tli
brar
ies:
ww
w.b
oost
.org
/sp
arse
-has
hli
brar
y:ht
tp:/
/goo
g-sp
arse
hash
.sou
rcef
orge
.net
/O
pen
MP
I:ht
tp:/
/ww
w.o
pen-
mpi
.org
(Pau
lino
etal.
,201
5)ht
tps:
//gi
thub
.com
/bcg
sc/a
byss
/tre
e/se
aler
-rel
ease
Tab
le1
-co
ntin
ued
-
560 Kremer et al.
Cat
egor
yT
ool
Mai
nfe
atur
esD
epen
denc
es*
Ref
eren
ceD
ownl
oad
link
/w
ebse
rver
Gap
clos
ing
GM
CL
oser
-M
ayus
ebo
thpa
ired
-end
read
san
dal
tern
ativ
eas
sem
blie
sto
perf
orm
the
gap-
clos
ing.
-A
ppli
esa
like
liho
odan
alys
isto
avoi
dth
eef
fect
ofm
isas
sem
blie
sin
the
alte
rnat
ive
asse
mbl
ies.
MU
Mm
er(K
urtz
etal.
2004
):ht
tp:/
/mum
mer
.sou
rcef
orge
.net
/B
LA
ST
+(A
ltsc
hul
etal.
,199
0;C
amac
hoet
al.
,200
9):
ftp:
//ft
p.nc
bi.n
lm.n
ih.g
ov/b
last
/B
owti
e(L
angm
ead
etal.
,200
9):
http
://b
owti
e-bi
o.so
urce
forg
e.ne
t/Y
AS
S(N
oéan
dK
uche
rov,
2005
):ht
tp:/
/bio
info
.lif
l.fr
/yas
s
(Kos
ugi
etal.
,20
15)
http
s://
sour
cefo
rge.
net/
pro-
ject
s/gm
clos
er/
Gap
clos
ing
Map
Rep
eat
-P
erfo
rms
are
fere
nce-
base
dsc
affo
ldin
gus
ing
acl
osel
y-re
late
dge
nom
epr
ovid
edby
the
user
.-
Use
sa
refe
renc
e-gu
ided
asse
mbl
yto
perf
orm
the
gap-
clos
ing
proc
ess.
BL
AS
T+
(Alt
schu
let
al.
,19
90;
Cam
acho
et
al.
,200
9):
ftp:
//ft
p.nc
bi.n
lm.n
ih.g
ov/b
last
/B
ioP
ytho
n(P
ytho
npa
ckag
e):
http
://b
iopy
thon
.org
/M
IRA
:ht
tp:/
/mir
a-as
sem
bler
.sou
rcef
orge
.net
MU
Mm
er(K
urtz
etal.
,200
4):
http
://m
umm
er.s
ourc
efor
ge.n
et/
(Mar
iano
etal.
,201
5)ht
tp:/
/git
hub.
com
/dcb
mar
iano
/m
apre
peat
Gap
clos
ing
Gap
Bla
ster
-A
llow
sa
man
ual
gap-
clos
ing
usin
gan
alte
rnat
ive
asse
mbl
yof
the
targ
etge
nom
e.B
LA
ST
and
BL
AS
T+
(Alt
schu
let
al.
,199
0;C
amac
hoet
al.
,20
09):
ftp:
//ft
p.nc
bi.n
lm.n
ih.g
ov/b
last
/M
UM
mer
(Kur
tzet
al.
,200
4):
http
://m
umm
er.s
ourc
efor
ge.n
et/
(de
Sá
etal.
,201
6)ht
tps:
//so
urce
forg
e.ne
t/pr
o-je
cts/
gapb
last
er20
15/
Ass
embl
yev
alua
tion
RE
AP
R-
Cal
cula
tes
the
accu
racy
ofth
eas
sem
bly
base
don
the
cov-
erag
eaf
ter
rem
appi
ngth
ere
ads
back
toth
esc
affo
lds.
-M
isas
sem
bled
regi
ons
can
beid
enti
fied
asth
eyus
uall
ypr
esen
ta
disc
repa
ntco
vera
ge.
-A
new
set
ofsc
affo
lds
isge
nera
ted
bysp
litt
ing
the
regi
ons
iden
tifi
edas
mis
asse
mbl
ed.
Fil
e::B
asen
ame,
Fil
e::C
opy,
Fil
e::S
pec,
Fil
e::S
pec:
:Lin
k,G
etop
t::L
ong
and
Lis
t::U
til
(Per
lm
odul
es):
http
://w
ww
.cpa
n.or
g/R
:ht
tps:
//w
ww
.r-p
roje
ct.o
rg/
(Hun
tet
al.
,20
13)
http
://w
ww
.san
ger.
ac.u
k/sc
i-en
ce/t
ools
/rea
pr
Ass
embl
yev
alua
tion
QU
AS
T-
Cal
cula
tese
vera
las
sem
bly
met
rics
,suc
has
C+
G%
,N50
and
L50
.-
Can
beus
edto
com
pare
diff
eren
tas
sem
blie
sfo
rth
esa
me
geno
me,
and
/or
com
pare
then
toa
refe
renc
ege
nom
e.
boos
tli
brar
ies:
ww
w.b
oost
.org
/Ja
va:
http
s://
ww
w.j
ava.
com
/M
atpl
otli
b(P
ytho
npa
ckag
e):
http
://m
atpl
otli
b.or
gT
ime:
:HiR
es(P
erl
mod
ule)
:ht
tp:/
/ww
w.c
pan.
org/
(Gur
evic
het
al.
,201
3)ht
tp:/
/bio
inf.
spba
u.ru
/qua
st
Tab
le1
-co
ntin
ued
-
Finishing microbial genomes 561
Cat
egor
yT
ool
Mai
nfe
atur
esD
epen
denc
es*
Ref
eren
ceD
ownl
oad
link
/w
ebse
rver
Ass
embl
yev
alua
tion
AL
E-
Cal
cula
tes
the
accu
racy
ofth
eas
sem
bly
base
don
the
k-m
ers
and
C+
G%
dist
ribu
tion
alon
gth
esc
affo
lds.
-D
oesn
’tre
quir
ea
refe
renc
ege
nom
e.
Mat
plot
lib
(Pyt
hon
pack
age)
:ht
tp:/
/mat
plot
lib.
org
Mpm
ath
(Pyt
hon
pack
age)
:ht
tp:/
/mpm
ath.
org
Num
py(P
ytho
npa
ckag
e):
http
://w
ww
.num
py.o
rgP
ymix
(Pyt
hon
pack
age)
:ht
tp:/
/ww
w.p
ymix
.org
/pym
ixS
etup
tool
s(P
ytho
npa
ckag
e):
http
s://
gith
ub.c
om/p
ypa/
setu
ptoo
ls
(Cla
rket
al.
,20
13)
http
://w
ww
.ale
scor
e.or
g
Ass
embl
yev
alua
tion
CG
AL
-C
alcu
late
sth
eac
cura
cyof
the
asse
mbl
yba
sed
onth
eco
v-er
age
afte
rre
map
ping
the
read
sba
ckto
the
scaf
fold
s.N
one
(Rah
man
and
Pac
hter
,201
3)ht
tp:/
/bio
.mat
h.be
rke-
ley.
edu/
cgal
/
Ass
embl
yev
alua
tion
GM
valu
e-
Ali
gns
the
asse
mbl
yto
are
fere
nce
geno
me
(or
alte
rnat
ive
asse
mbl
y)to
iden
tify
mis
asse
mbl
edre
gion
s.-
Ane
wse
tof
scaf
fold
sis
gene
rate
dby
spli
ttin
gth
ere
gion
sid
enti
fied
asm
isas
sem
bled
.
MU
Mm
er(K
urtz
etal.
,20
04):
http
://m
umm
er.s
ourc
efor
ge.n
et/
BL
AS
T+
(Alt
schu
let
al.
,19
90;
Cam
acho
et
al.
,200
9):
ftp:
//ft
p.nc
bi.n
lm.n
ih.g
ov/b
last
/B
owti
e(L
angm
ead
etal.
,200
9):
http
://b
owti
e-bi
o.so
urce
forg
e.ne
t/Y
AS
S(N
oéan
dK
uche
rov,
2005
):ht
tp:/
/bio
info
.lif
l.fr
/yas
s
(Kos
ugi
etal.
,20
15)
http
s://
sour
cefo
rge.
net/
pro-
ject
s/gm
clos
er/
Ass
embl
yco
r-re
ctio
niC
OR
N-
Req
uire
spa
ired
-end
read
s.-
Inte
ract
ivel
yid
enti
fies
and
corr
ects
shor
tm
isas
sem
blie
s,su
chas
base
-sub
stit
utio
nsan
dsh
ort
IND
EL
s.
SN
P-o
-mat
ic(M
ansk
ean
dK
wia
tkow
ski,
2009
):ht
tps:
//sn
pom
atic
.svn
.sou
rcef
orge
.net
/svn
root
/snp
om
atic
SS
AH
AP
ileu
p(N
ing
etal.
,200
1):
ftp:
//ft
p.sa
nger
.ac.
uk/p
ub/z
n1/s
saha
_pil
eup/
(Ott
oet
al.
,20
10)
http
://i
corn
.sou
rcef
orge
.net
/
Ass
embl
yco
r-re
ctio
nS
EQ
uel
-R
equi
res
pair
ed-e
ndre
ads.
-In
tera
ctiv
ely
iden
tifi
esan
dco
rrec
tssh
ort
mis
asse
mbl
ies,
such
asba
se-s
ubst
itut
ions
and
shor
tIN
DE
Ls.
-P
erfo
rms
alo
cal
reas
sem
bly
ofth
em
isas
sem
bled
regi
ons
usin
gin
form
atio
nfr
omk-
mer
san
dpa
ired
-end
read
s.
Java
:ht
tps:
//w
ww
.jav
a.co
m/
JGra
phT
(Jav
ali
brar
y):
http
://j
grap
ht.o
rg/
(Ron
enet
al.
,201
2)ht
tp:/
/bix
.ucs
d.ed
u/S
EQ
uel/
Ass
embl
yco
r-re
ctio
nG
Fin
ishe
r-
Doe
sn’t
requ
ire
pair
ed-e
ndre
ads.
-In
tegr
ates
are
fere
nce-
guid
edsc
affo
ldin
gst
epan
dga
p-cl
osin
gpr
oced
ures
,alo
ngw
ith
the
asse
mbl
yco
rrec
tion
proc
ess.
-Id
enti
fies
mis
asse
mbl
edre
gion
sba
sed
onth
eG
C-S
kew
dist
ribu
tion
.
Java
:ht
tps:
//w
ww
.jav
a.co
m/
(Gui
zeli
niet
al.
,20
16)
http
://g
fini
sher
.sou
rcef
orge
.net
/
*=
Con
side
ring
aco
mpu
terr
unni
ngU
NIX
,Lin
uxor
Mac
OS
oper
atin
gsy
stem
s(O
Ss)
.As
Mak
e,se
d,aw
k,G
CC
,Per
l,B
ash,
Pyt
hon
and
the
GN
U/U
nix
stan
dard
util
ity
seta
real
read
yin
clud
edin
mos
toft
hedi
stri
-bu
tion
s/
vers
ions
ofth
ese
OS
s,th
ese
prog
ram
sw
ere
not
list
edas
depe
nden
ces.
Tab
le1
-co
ntin
ued
-
Scaffolding
By definition, a contig consists of a contiguous se-
quence has no unknown regions or assembly gaps (but may
contain “N” that represent base-calling errors) (Staden,
1979). On the other hand, a scaffold consists of two or
morecontigs that have been joined according to some linkage
in-formation (e.g., paired-end reads, genome maps) (Huson etal.,
2002). Paired-end or mate-pair libraries can be veryuseful in de
novo genome assembly, and several tools usethe relative position
information to connect contigs intoscaffolds (Hunt et al., 2014).
In a similar way, with the in-crease in the availability of genomic
sequences from a widevariety of species, other scaffolding
alternatives were de-veloped to use one or multiple genomes as
reference to or-der the contigs.
Paired-end scaffolding
Most of the de novo genome assemblers usually inte-grate
scaffolding steps after the contig constructions, al-though it is
also possible to use third-party tools aiming amore reliable
result. The A5 assembly pipeline (Tritt et al.,2012), for example,
uses de novo assembler IDBA (Peng etal., 2010) to construct the
contigs and SSPACE (Boetzer etal., 2011) to generate scaffolds. The
scaffolding withpaired-end reads usually consist of the alignment
of readsto the contigs, followed by the identification of
connectionsbetween different contigs using the relative-orientation
in-formation and the estimated insert-size. ABySS (Simpsonet al.,
2009), SOPRA (Dayarian et al., 2010), SOAPdenovo(Li et al., 2010),
Bambus 2 (Koren et al., 2011), MIP(Salmela et al., 2011), Opera
(Gao et al., 2011), SSPACE(Boetzer et al., 2011), SLIQ (Roy et al.,
2012), SGA (Sim-pson and Durbin 2012), SCARPA (Donmez and
Brudno2013), WiseScaffolder (Farrant et al., 2015)
andScaffoldScaffolder (Bodily et al., 2016) are examples
ofscaffolding tools based on paired-end information. Morerecently,
the use of long-reads was also incorporated intoscaffolding tools
such as AHA (Bashir et al., 2012) andSSPACE-LongRead (Boetzer and
Pirovano, 2014).
ABySS (Simpson et al., 2009): The program abyss-scaffold, which
comes with the ABySS assembly package(Simpson et al., 2009), uses
the estimated mate-distancedistribution in paired-end reads to
connect contigs and gen-erate scaffolds. The distance distribution
can be calculatedby DistanceEst, that is also part of the package,
and is alsoused by other assembly pipelines, such as SGA
(Simpsonand Durbin, 2012). ABySS was developed to be used bothwith
small and large genomes, and can be executed in acomputer
clustering by using the Message Passing Inter-face (MPI), this
being useful also in case of high-coveragedata and when dealing
with multiple libraries. Like mostscaffolding programs,
abyss-scaffold can be used also inthe scaffolding of contigs
generated by third-party pro-grams. Finally, ABySS also supports
scaffolding withlong-reads by using BWA-MEM for read alignment
(Liand Durbin, 2009). The source code of the ABySS packageis
available at the address
http://www.bcgsc.ca/platform/bioinfo/software/abyss, and is
developed to work on theLinux operating system.
562 Kremer et al.
Figure 1 - A flowchart demonstrating how and when the different
genomefinishing approaches can be combined according to the data
that is avail-able for the user. (a) Scaffolding using paired-end
reads or long-reads,which is directly dependent on the way the
genome was sequencing (plat-form, library), and sometimes performed
as part of the de novo assemblyprocess. (b) Assembly integration,
which consists in the combination ofdifferent de novo assemblies
and generation of a consensus/extended as-sembly. Some programs use
only the assemblies as input, while others usealso the sequencing
reads. (c) The standard contig-ordering approachbased on a single
reference genome, which consists in the identification ofsynteny
blocks that guide the orientation of the contigs in the draft
ge-nome, without taking into count the occurrence of genome
inversionsother rearrangements. (d) The rearrangement-aware
contig-ordering, thatidentifies potential sites of inversion and
translocations based on signa-tures on the alignment against the
reference genome. (e) The multi-ple-reference contig ordering, that
may be more appropriate in those caseswhere there is no finished
reference genome, but there is a relatively highnumber of
close-related drafts, or when there are no apparent closest
refer-ence to be used. (f) Assembly correction, which consists in
the removing ofshort misassemblies, including base-substitutions
and short insertions anddeletions. (g) Gap-closing, which consists
in the joining of adjacentcontigs that used to be spaced by a gap.
(h) Assembly evaluation, whichmay provide help to access the
reliability of the assembly.
-
SOPRA (Dayarian et al., 2010): This scaffolding toolwas designed
to improve assemblies generated by Velvet(Zerbino and Birney, 2008)
and SSAKE (Warren et al.,2007), and targets the earlier sequencing
platforms fromIllumina and ABI SOLiD. The program parses the
read-placing file generated by these assemblers and extracts
in-formation of paired-end/mate-pair reads, that is used to
cal-culate the mean distance between pairs and the
correctorientation. Based on this file, SOPRA also infers the
con-nections between contigs by searches of those pairs of
readswhere mates are in different contigs. The program is notfully
automated, so each step of data processing must be ex-ecuted by a
different script before the main scaffolding pro-cess. Another
drawback is the limited support for differentde novo assemblers, as
it requires read-placing files in AFGformat, and this is only
produced by a few assemblers now-adays (e.g., Velvet, Ray and
AMOS). SOPRA can be ob-tained from the website
http://www.physics.rutgers.edu/~anirvans/SOPRA/.
Bambus 2 (Koren et al., 2011): It is part of the AMOSpackage
(Treangen et al., 2011) and is both a genome andmetagenome
scaffolding tool and an updated version of theSanger-based program
Bambus targeting NGS data (Pop etal., 2004). The program requires
read-placing informationto construct a contig-graph, and explore
the graph to findconsistent connections between the contigs. As
Bambus2can also be used to scaffold metagenomic assemblies,
dif-ferent from other programs, it considers the effect of
DNAsamples containing mixes of closely related organisms inthe
assembly processes and reduces the chance of fragmen-tation and
miss-joining by analyzing the molecular vari-ants. However, the use
of Bambus 2 is not as simple as forother scaffolding tools, as it
requires some experience withthe AMOS tools to generate its input
file and processes theoutputs (Treangen et al., 2011). The AMOS
package can beobtained from the SourceForge
repository:https://sourceforge.net/projects/amos/.
MIP (Salmela et al., 2011): uses the concept of mixedinteger
programming to generate a set of scaffolds from agenome assembly
and a set paired-ends/mate-pair reads.First, readaligner (Mäkinen
et al., 2010) is used to map theread-pairs back to the contigs.
Then, the pairs are filtered toremove inconsistent connections, and
the distances be-tween the contigs are estimated based on the mean
distancebetween the mates, which is calculated for each library
usedin the assembly. The connections in the generated scaffoldgraph
have a minimum and maximum estimated length, de-rived from the
library information. The MIP source code,along with usage
instructions, is available
athttps://www.cs.helsinki.fi/u/lmsalmel/mip-scaffolder/.
Opera (Gao et al., 2011): takes as inputs a collectionof contigs
and mapped reads and generates a scaffold graphbased on the
paired-end information. Frist, the program fil-ters the connections
between contigs to remove possiblemiss-joining errors caused by
chimeric pairs. The graph is
contracted, and the optimum orientation of the contigsinside the
scaffolds is inferred by a dynamic programmingalgorithm that
explores the search space. The algorithm canalso infer the
occurrence of repeated genomic regions, usu-ally assembled into a
single contig in case of short-reads. Inthis case, repeated regions
are identified by comparing thecoverage of the contigs to the mean
coverage of the wholegenome and selection those with value greater
than 1.5times the genomic mean. The identification of these
regionsallows a contig to be used in more than one scaffolds,
whichcan provide a better assembly of repeated regions, but canalso
result in misassemblies. Opera can be obtained from itsSourceForge
repository https://sourceforge.net/projects/operasf/.
SGA (String Graph Assembler) (Simpson and Durbin2012): is a de
novo genome assembler developed for thememory-efficient assembly of
small and large genomes byapplying the method proposed by Myers
(2005). As part ofits assembly pipeline, SGA also provide a
scaffolding toolthat uses information from read alignment (in .BAM
for-mat), that can be generated by a wide variety of mappingtools
(Li et al., 2008; Langmead et al., 2009; Lunter andGoodson, 2011),
and estimated distance between mates,generated by DistanceEst, from
the ABySS package, toconnect contigs into scaffolds. SGA also
supports scaffold-ing from multiple libraries, with different
insert sizes, andwas optimized to work with Illumina data. SGA is
availablefrom the GitHub repository https://github.com/jts/sga.
SCARPA (Donmez and Brudno, 2013): usespaired-end information to
generate scaffolds, but takes intoaccount that not only chimeric
reads may be to responsiblefor inconsistent linkages between
contigs but also mi-sassembled sequences. It estimates the mean and
standarddeviation of the distance between the mates, but only
usesinformation from those contigs with length greater than
theassembly N50. The connections between the contigs are es-timated
based on the mate information and the calculatedmetrics, and if
more than one library is provided, SCARPAprocess the scaffolding
iteratively starting from the librarywith smaller insertion size.
The program can be obtainedfrom the URL
http://compbio.cs.toronto.edu/hapsembler/scarpa.html.
SSPACE (Boetzer et al., 2011) and SSPACE-LongRead (Boetzer and
Pirovano, 2014): these scaffoldingprograms are currently
distributed by BaseClear(http://www.baseclear.com), which also
distributes thegap-closing program GapFiller (Boetzer and
Pirovano,2012). SSPACE requires information about paired-end
li-brary, including mean and standard deviation of distancebetween
the mates and the expected orientation, whose val-ues can be
predicted with the script “estimate_in-sert_size.pl”, distributed
along with the program. The usermay choose between BWA (Li and
Durbin, 2009) andBowtie (Langmead et al., 2009) for read mapping,
the mini-mum number of connections to link two contigs, the
num-
Finishing microbial genomes 563
-
ber of bases that will be removed from the border of thecontigs
(as they usually contain errors), and the number ofiterations. For
SSPACE-LongRead, the target assembly isaligned to a collection of
long-reads using BLASR (Chai-sson and Tesler, 2012) and the
alignments are filtered andrefined to find the best orientation.
Both SSPACE andSSPACE-LongRead can be requested from the
BaseClearwebsite
http://www.baseclear.com/genomics/bioinformatics/basetools/.
Hunt el al. (2014) have performed an extensive com-parison of
the scaffolding tools and demonstrated that thequality of the
resulting scaffolds is directly affected by theread-mapping program
and the complexity of the genome.For the tested datasets, the best
results were obtained bySGA, SOPRA and SSPACE, although all tested
tools pre-sented a certain percentage of miss-joined scaffolds in
theiroutputs. Some scaffolding tools (e.g., SGA) use pre-aligned
reads as input, so the user is able to test and choosethe read
mapper. In this case, it is important to try differentread mappers,
taking into account that platform-specificbias and read quality may
have a drastic effect on the qual-ity of the alignment (Hatem et
al., 2013; Caboche et al.,2014). When using mate-pair libraries
(long-insert paired-
end reads) it is also important to check if the scaffolder
wasdesigned to support it, or if it was just designed for
stan-dard, short-insert, paired-end reads. As mate-pairs maypresent
a relatively high rate of “false mates”, special caremay be taken
when working with this type of data.
Single reference-based scaffolding
In many cases the pairing information is not enoughto generate a
reliable reconstruction of the genome’s struc-ture, or simply, the
genome was not sequenced usingpaired-reads, but with single-end
sequencing. In order toovercome this, some tools were developed to
use a refer-ence genome as a template to perform the contig
orderingand relative positioning (Figure 2). Software like
MUMmer(Kurtz et al., 2004), ABACAS (Assefa et al.,
2009),CONTIGuator (Galardini et al., 2011) and Mauve (Darlinget
al., 2004; Rissman et al., 2009) are able to identify themost
probable orientation of the contigs, but may generateincorrect
results in the case of genome inversions of trans-locations. On the
other hand, SIS (Dias et al., 2012), CAR(Lu et al., 2014), and
FillScaffolds (Muñoz et al., 2010)consider the occurrence of
changes in the genomic struc-ture and take these phenomena into
account during the
564 Kremer et al.
Figure 2 - Reference-based contig ordering. (a) The program
takes a set of contigs (or scaffolds) and (b) aligns these to a
reference genome to identify themost probable relative orientation
of the sequences in the draft genome. (c) Regions not covered by
the contigs represent gaps and may be sequencing/as-sembling
artifacts or natural deletions. Based on the relative position of
each contig, a scaffold is created.
-
analysis, generating a more accurate reconstruction. Allthese
tools use information from a single genome as refer-ence, however
more recently, some tools, such as Ragout(Kolmogorov et al., 2014)
and MeDuSa (Bosi et al., 2015),were developed to use information
from multiple referencegenomes, allowing an evolutionary-based
inference ofstructural re-arrangements. These multiple
reference-basedcontig ordering tools will be discussed in the next
section.
MUMmer (Kurtz et al., 2004): is a genome-scale se-quence
alignment tool which can be applied to perform thealignment of a
set of contigs/scaffolds to a reference ge-nome, allowing a wide
variety of applications in genomicanalysis and NGS data processing,
including reference-guided scaffolding. The two main algorithms of
theMUMmer package are NUCmer, which performs a stan-dard DNA-DNA
alignment, and PROmer, which performsan alignment of the six
reading frames of both sequences(leading to a more sensitive
result, especially in the case ofhighly divergent organisms). The
package also includesother tools, such as delta-filter, that can be
used to removethe ambiguities in the alignments and select those
that aremore relevant for the analysis. Many scaffolding tools,
likeABACAS (Assefa et al., 2009), CONTIGuator (Galardiniet al.,
2011) and MeDuSa (Bosi et al., 2015), are built ontop of MUMmer and
take advatange of its performance, butalso add new features to
improve the output. MUMmer it-self does not provide the sequence of
the scaffold, just thepositions of the alignments. Therefore, it is
necessary toperform a post-processing of the results to obtain the
se-quence of the scaffolds. MUMmer can be obtained from
itsSourceForge repository http://mummer.sourceforge.net/.
ABACAS (Algorithm-based Automatic Contigua-tion of Assembled
Sequences) (Assefa et al., 2009): can useNUCmer or PROmer from the
MUMmer (Kurtz et al.,2004) package to align the contigs against a
reference ge-nome. The regions that do not have an equivalent
sequencein the contig set are filled with Ns, indicating
gaps.ABACAS can also be used to design PCR primers to am-plify the
unknown regions by integrating Primer3 (Ko-ressaar and Remm, 2007;
Untergasser et al., 2012).ABACAS can be obtained from its
SourceForge repositoryhttp://abacas.sourceforge.net/, and as part
of the PAGITpackage (Swain et al., 2012), available at
http://www.sang-er.ac.uk/science/tools/pagit.
CONTIGuator (Galardini et al., 2011): usesABACAS (Assefa et al.,
2009) to perform the contig order-ing, but adds support to multiple
references, which may beuseful in the case of organisms that have
more than onechromosome. BLAST (Altschul et al., 1990; Camacho
etal., 2009) is used to align the contigs used as input with
thereference sequences to identify the correct reference foreach
sequence. Then, ABACAS (Assefa et al., 2009) isused, and its
results are integrated with the BLAST align-ment to generate a
final assembly. CONTIGuator can beobtained from its SourceForge
repository
http://contiguator.sourceforge.net/, and is also available asa
webserver http://combo.dbe.unifi.it/contiguator.
Mauve (Darling et al., 2004; Rissman et al., 2009): isan
alignment tool that can handle and align multiple ge-nomes and
identify regions of high similarity called Lo-cally Collinear
Blocks (LCBs). One of the program’s fea-tures, Mauve Contig Mover,
performs contig orderingusing the same algorithm (Rissman et al.,
2009). The pro-gram runs in an iterative mode, generating and
optimizingthe contig orientations based on the reference until
nochange is possible that can improve the model. A directoryis
generated for each iteration that contains inputs to visual-ize the
genome in Mauve and a FASTA file with the sortedcontigs. The Mauve
aligner can be obtained from the
URLhttp://darlinglab.org/mauve/mauve.html.
FillScaffolds (Muñoz et al., 2010): analyzes the ge-nomic
distance between the contig set and a reference ge-nome and
generates an ordered sequence throughidentifying orthologous genes.
It considers the effects ofthe evolutionary distance in the case of
missing genes, andthen uses the position of the orthologos present
in the refer-ence to order the contigs. The source code of
FillScaffoldsis available as a supplementary data of the Muñoz et
al.(2010) paper
at:http://bmcbioinformatics.biomedcentral.com/arti-cles/10.1186/1471-2105-11-304.
SIS (Scaffolds from Inversion Signatures) (Dias etal., 2012):
takes as input a set of contigs in FASTA formatand a coordinate
file generated by NUCmer or PROmer(Kurtz et al., 2004) after these
contigs have been alignedwith the reference sequence. Using the
coordinates, theprogram searches for inversion signatures and
generates acollection of orientations of the sequences that can be
usedto construct the scaffolds. The source code of SIS can
beobtained from the URL http://marte.ic.unicamp.br:8747.
CAR (Contig Assembly using Rearrangements) (Luet al., 2014):
uses NUCmer and PROmer in combination,unlike ABACAS (Assefa et al.,
2009) and SIS (Dias et al.,2012), that use the result of only one.
Based on the coordi-nates, CAR uses a block permutation model to
generate thecontig order by considering not only the effect of the
ge-nomic inversions, but also the occurrence of transpositions(Li
et al., 2013). CAR can be used from the
webserverhttp://genome.cs.nthu.edu.tw/CAR/, where the source codeis
also available for download.
Considering the main algorithm of each program, it isimportant
to keep in mind that the most appropriate tool fora given task will
depend on the organism and the availabil-ity of the reference
genomes. ABACAS (Assefa et al.,2009) is very useful if the
reference genome is larger thanthe target genome (considering the
sum of the length of allcontigs, and that all contigs have a
homologous region inthe reference), and the primer designing tools
might behelpful in some cases; however, its sensibility decreases
incases of structural divergence. In such cases, other tools,
Finishing microbial genomes 565
-
like CONTIGuator (Galardini et al., 2011) and Mauve(Darling et
al., 2004; Rissman et al., 2009), may be moreeffective. Finally,
SIS (Dias et al., 2012) and CAR (Lu etal., 2014) are indicated if
the draft genome may presentgenomic inversions or transpositions.
For most of the appli-cations, these tools usually provide reliable
results, espe-cially for organisms that do not show a very variable
ge-nomic organization, and/or when there are enough finishedgenomes
to properly choose the best reference. However,in some situations
it may be necessary to evaluate differenttools and references to
check which one provides the bestresults. Finally, a
single-reference may also lead to an“overfitted” ordering,
especially when the reference issmaller, or in case of genomic
inversions and translo-cations.
Multiple reference scaffolding
Sometimes it is very difficult to identify the most ap-propriate
reference genome to use for contig ordering, es-pecially when
structural rearrangements are commonevents in the genus/species of
interests. Additionally, whenusing BLAST to identify the most
“close-related” strainfrom a database of already finished genomes,
it is not usualto find different strains as best hit for each
contig. Finally,there are also those cases where no finished genome
isavailable, but there are draft genomes of related strains.
Inthese cases it would not be appropriate to use programs thattake
into account alignments against only one reference, sodata from
multiple organisms should be considered. Theuse of multi-references
is relatively recent and another con-sequence of the advent of NGS,
as there is more draftgenomes than finished ones available in
public databases.Examples of algorithms and programs that use this
ap-proach are RACA (Kim et al., 2013), Ragout (Kolmogorovet al.,
2014) and MeDuSa (Bosi et al., 2015).
RACA (Reference-Assisted Chromosome Assem-bly) (Kim et al.,
2013): uses local sequence alignment toidentify co-linear synteny
blocks. The synteny blocks arefiltered using a length threshold,
and based on the referencegenomes, the probability of each synteny
block adjacent tothe others is calculated. This probability can
also be com-bined with paired-end information to identify the
mostprobable set of scaffold. The source code of RACA can
beobtained from the URL
http://bioen-compbio.bioen.illi-nois.edu/RACA/.
Ragout (Kolmogorov et al., 2014): uses phylogeneticinformation
and synteny blocks to order a set of contigsfrom a target genome
using multiple genome references.First, Sibelia (Minkin et al.,
2013) is used to identifysynteny blocks shared by the target and
the reference se-quences. Based on the synteny, the nucleotides of
the ge-nomes are represented as sequences of blocks, and the
best“block orientation” is identified by a maximum parsimony,taking
into account the block order in the reference ge-nomes. The source
code of Ragout can be obtained from its
repository at GitHub https://github.com/fenderglass/Ra-gout.
MeDuSa (Multi-Draft based Scaffolder) (Bosi et al.,2015): is a
graph-based scaffolder that uses informationfrom multiples
references, which can be finished or draftgenomes. The program uses
NUCmer to alignment the tar-get genomes to the references and
construct a weightedgraph based on the alignments were the nodes of
the graphare connected by identifying those contigs that aligned
tothe same sequence in the references. In the next step,
theorientation of each contig is assigned based on the align-ment
information and the most-probable ordered identifiedin the graph.
The source code of MeDuSa can be obtainedfrom the repository at
GitHub https://github.com/combogenomics/medusa.
Assembly integration
Different assemblers, or even the same assembler ex-ecuted with
different configurations, may produce differentresults. Minimum
coverage, coverage cut-offs, minimumcontig length, and k-mer size
are examples of just some pa-rameters that can affect the decisions
of the assembler dur-ing the construction of the contigs (Baker,
2012). The waylow-quality reads are treated, or how the correct
paths in theassembly graph are constructed, is also different for
eachprogram. As different assemblies may present
differentrepresentations of a given region in the genome, the
con-struction of a “consensus assembly” can be an effectivemethod
of reducing assembly errors and generating an opti-mized set of
contigs. This process, which is sometimescalled “assembly
reconciliation,” “assembly merging,” or“assembly integration,” can
receive as input only a set ofassemblies, as implemented in Minimus
(Sommer et al.,2007), Reconciliator (Zimin et al., 2008), MAIA
(Nijkampet al., 2010), CISA (Lin and Liao 2013), GAA (Yao et
al.,2012) and Mix (Soueidan et al., 2013), or both a set of
as-semblies and the reads used for the assembly, as is the casewith
GAM-NGS (Vicedomini et al., 2013) and Zorro(Argueso et al.,
2009).
Minimus (Sommer et al., 2007): is an assembly toolfrom the AMOS
package (Treangen et al., 2011). Initiallyconceived to perform
assembly of small genomes, it wasposteriorly adapted for assembly
integration. The main al-gorithm is based on the
overlap-layout-consensus paradigm(Peltola et al., 1984), which
involves taking a set of se-quences and performing several
alignments to identifyoverlaps. The information provided by the
alignments isused to construct a graph that is minimized by a
combina-tion of algorithms (Myers, 1995, 2005) to generate a
finalassembly. Minimus is available as part of the AMOS pack-age,
which can be obtained from the SourceForge reposi-tory
https://sourceforge.net/projects/amos/.
Reconciliator (Zimin et al., 2008): uses NUCmer,from the MUMmer
package (Kurtz et al., 2004), to identifyassembly errors by
comparing a template with a secondary
566 Kremer et al.
-
assembly. With the alignments, the tool is able to identifythe
regions that have possibly suffered compression or ex-pansion due
to assembly errors in repetitive DNA se-quences. The source code of
Reconciliator is available fromthe URL
http://www.genome.umd.edu/.
MAIA (Multiple Assembly Integrator) (Nijkamp etal., 2010): uses
the overlap-layout-consensus paradigm in asimilar way to Minimus to
construct a graph based on theoverlaps identified by MUMmer (Kurtz
et al., 2004). Theconnections in the graph are used to construct a
new assem-bly, and contigs that have no connection can be
integratedwith the assembly using a reference genome as a
template.MAIA was implemented on top of the Matlab program-ming
language and is available as a package for it that canbe obtained
from the URL http://bioinformatics.tudelft.nl.
GAA (Graph Accordance Assembly) (Yao et al.,2012): is an
assembly integration software that is based on ahomonymous data
structure. Taking a set of contigs as in-put, the tool uses BLAT
(Kent, 2002) to generate align-ments, identify overlaps and then
generate a graph thatrepresents the connections between the
contigs. GAA isavailable from the SourceForge
repositoryhttp://sourceforge.net/projects/gaa-wugi/.
CISA (Contig Integrator for Sequence Assembly)(Lin and Liao,
2013): uses a four-step algorithm to generatethe merged assembly.
First, a set of representative contigsis chosen from the individual
assemblies. Assembly errorsare identified by aligning all sets to
one another, and any re-gions that are present in only one sequence
are consideredto be erroneous. In the event of errors, the contigs
are bro-ken in the incorrect portion into smaller sequences.
Thethird step consists of generating several alignments usingBLAST
(Altschul et al., 1990; Camacho et al., 2009), andNUCmer (Kurtz et
al., 2004) to identify the optimal lengthof repetitive sequences.
The information generated in thethird step is used to construct the
merged assembly in the fi-nal stage of the program. CISA can be
obtained from theURL http://sb.nhri.org.tw/CISA/.
Mix (Soueidan et al., 2013): uses alignments gener-ated by
NUCmer (Kurtz et al., 2004) to generate an exten-sion graph where
the contigs are connected by theirborders. The alignments are
filtered to remove repetitivesequences, and this information is
used to generate a graph.Finally, the algorithm parses the graph to
identify the Maxi-mal Independent Longest Path Set (MILPS) that
representsthe final assembly. The Mix source code is available at
theGitHub repository https://github.com/cbib/MIX.
GAM (Genomic Assemblies Merger) (Casagrande etal., 2009) and
GAM-NGS (Genomic Assemblies Mergerfor Next Generation Sequencing)
(Vicedomini et al.,2013): GAM takes an assembly as a template,
which is re-ferred to as the “master”, and extends it using one or
moresets of auxiliary assemblies (called “slaves”)