Optimizing taxonomic classification of marker gene amplicon … · 2018. 1. 17. · 1 1 Optimizing taxonomic classification of marker gene 2 amplicon sequences 3 4 Nicholas A. Bokulich1#*,

1

Optimizingtaxonomicclassificationofmarkergene1

ampliconsequences23NicholasA.Bokulich1#*,BenjaminD.Kaehler2#*,JaiRamRideout1,MatthewDillon1,Evan4

Bolyen1,RobKnight3,GavinA.Huttley2#,J.GregoryCaporaso1,4,#5

6

1ThePathogenandMicrobiomeInstitute,NorthernArizonaUniversity,Flagstaff,AZ,USA7

2ResearchSchoolofBiology,AustralianNationalUniversity,Canberra,Australia8

3DepartmentsofPediatricsandComputerScience&Engineering,andCenterfor9

MicrobiomeInnovation,UniversityofCaliforniaSanDiego,LaJolla,CA,USA10

4DepartmentofBiologicalSciences,NorthernArizonaUniversity,Flagstaff,AZ,USA11

12

*Theseauthorscontributedequally13

14

#Correspondingauthors15

GregoryCaporaso16DepartmentofBiologicalSciences171298SKnolesDrive18Building56,3rdFloor19NorthernArizonaUniversity20Flagstaff,AZ,USA21(303)523-548522(303)523-4015(fax)23Email:gregcaporaso@gmail.com2425NicholasBokulich26ThePathogenandMicrobiomeInstitute27POBox407328Flagstaff,Arizona86011-4073,USA29Email:[email protected]

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018

2

31BenjaminKaehler32ResearchSchoolofBiology3346SullivansCreekRoad,34TheAustralianNationalUniversity,35ActonACT2601,Australia36Email:benjamin.kaehler@anu.edu.au3738GavinHuttley39ResearchSchoolofBiology4046SullivansCreekRoad,41TheAustralianNationalUniversity,42ActonACT2601,Australia43Email:[email protected]

46

Abstract47

Background:Taxonomicclassificationofmarker-genesequences isan important step in48

microbiome analysis. Results: We present q2-feature-classifier49

(https://github.com/qiime2/q2-feature-classifier), a QIIME 2 plugin containing several50

novelmachine-learningandalignment-basedtaxonomyclassifiersthatmeetorexceedthe51

accuracy of existing methods for marker-gene amplicon sequence classification. We52

evaluatedandoptimized several commonlyused taxonomic classificationmethods (RDP,53

BLAST, UCLUST) and several newmethods (a scikit-learn naive Bayesmachine-learning54

classifier, and alignment-based taxonomy consensusmethods of VSEARCH, BLAST+, and55

SortMeRNA) for classificationofmarker-geneampliconsequencedata.Conclusions:Our56


3

resultsillustratetheimportanceofparametertuningforoptimizingclassifierperformance,57

and we make recommendations regarding parameter choices for a range of standard58

operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are59

bothfree,open-source,BSD-licensedpackagesavailableonGitHub.60

61

Background62

High-throughput sequencing technologies have transformedour ability to explore63

complexmicrobial communities,offering insight intomicrobial impactsonhumanhealth64

[1] and global ecosystems [2]. This is achieved most commonly by sequencing short,65

conservedmarkergenesamplifiedwith ‘universal’PCRprimers,suchas16SrRNAgenes66

forbacteria andarchaea, or internal transcribed spacer (ITS) regions for fungi.Targeted67

marker-geneprimerscanalsobeusedtoprofilespecifictaxaorfunctionalgroups,suchas68

nifH genes [3]. These sequences often are compared against an annotated reference69

sequencedatabasetodeterminethelikelytaxonomicoriginofeachsequencewithasmuch70

specificityaspossible.Accurateandspecifictaxonomicinformationisacrucialcomponent71

ofmanyexperimentaldesigns.72

Challengesinthisprocessincludetheshortlengthoftypicalsequencingreadswith73

currenttechnology,sequencingandPCRerrors[4],selectionofappropriatemarkergenes74

that contain sufficient heterogeneity to differentiate target species but that are75


4

homogeneous enough in some regions to design broad-spectrum primers, quality of76

referencesequenceannotations[5],andselectionofamethodthataccuratelypredictsthe77

taxonomic affiliation of millions of sequences at low computational cost. Numerous78

methodshavebeendevelopedfortaxonomyclassificationofDNAsequences,butfewhave79

beendirectlycomparedinthespecificcaseofshortmarker-genesequences.80

We introduce q2-feature-classifier, a QIIME 2 (https://qiime2.org/) plugin for81

taxonomyclassificationofmarker-genesequences.QIIME2 is thesuccessortotheQIIME82

[6]microbiome analysis package. The q2-feature-classifier plugin supports use of any of83

thenumerousmachine-learningclassifiersavailableinscikit-learn[7][8]formarkergene84

taxonomyclassification,andcurrentlyprovidestwoalignment-basedtaxonomyconsensus85

classifiersbasedonBLAST+[9]andvsearch[10].Weevaluatethelattertwomethodsand86

the scikit-learnmultinomial naive Bayes classifier (labelled “Naive Bayes” in the Results87

section) for the first time.We show that the classifiers provided in q2-feature-classifier88

match or outperform the classification accuracy of several widely-used methods for89

sequence classification, and that performance of the naive Bayes classifier can be90

significantly increased by providing it with information regarding expected taxonomic91

composition.92

We also developed tax-credit (https://github.com/caporaso-lab/tax-credit-code/93

and https://github.com/caporaso-lab/tax-credit-data/), an extensible computational94

framework for evaluating taxonomy classification accuracy. This framework streamlines95

the process of methods benchmarking by compiling multiple different test data sets,96

includingmockcommunities[11]andsimulatedsequencereads.Itadditionallystorespre-97


5

computedresultsfrompreviouslyevaluatedmethods,includingtheresultspresentedhere,98

and provides a framework for parameter sweeps and method optimization. tax-credit99

couldbeusedasanevaluationframeworkbyotherresearchgroupsinthefuture,oritsraw100

datacouldbeeasilyextractedforintegrationinanotherevaluationframework.101

102

Results103

We used tax-credit to optimize and compare multiple marker-gene sequence104

taxonomy classifiers. We evaluated two commonly used classifiers that are wrapped in105

QIIME1(RDPClassifier(version2.2)[12],legacyBLAST(version2.2.22)[13]),twoQIIME106

1alignment-basedconsensustaxonomyclassifiers(thedefaultUCLUSTclassifieravailable107

in QIIME 1 (based on version 1.2.22q) [14], and SortMeRNA (version 2.0 29/11/2014)108

[15]), twoalignment-basedconsensus taxonomyclassifiersnewlyreleased inq2-feature-109

classifier(basedonBLAST+(version2.6.0)[9]andvsearch(version2.0.3)[10]),andanew110

multinomialnaiveBayesmachine-learningclassifierinq2-feature-classifier(seeMaterials111

and Methods for information about q2-feature-classifier methods and source code112

availability). We performed parameter sweeps to determine optimal parameter113

configurationsforeachmethod.114


6

Mockcommunityevaluations115

We first benchmarked classifier performance on mock communities, which are116

artificiallyconstructedmixturesofmicrobialcellsorDNAcombinedatknownratios[11].117

We utilized 15 bacterial 16S rRNA gene mock communities and 4 fungal internal118

transcribed spacer (ITS)mock communities (Table 1) sourced frommockrobiota [11], a119

public repository for mock community data. Mock communities are useful for method120

benchmarking because: 1) unlike for simulated communities, they allow quantitative121

assessmentsofmethodperformanceunderactualoperatingconditions, i.e., incorporating122

realsequencingerrorsthatcanbedifficult tomodelaccurately;and2)unlike fornatural123

community samples, the actual composition of amock community is known in advance,124

allowingquantitativeassessmentsofcommunityprofilingaccuracy.125

Anadditionalprioritywastotesttheeffectofsettingclassweightsonclassification126

accuracy for the naive Bayes classifier implemented in q2-feature-classifier. In machine127

learning, class weights or prior probabilities are vectors of weights that specify the128

frequency at which each class is expected to be observed (and should be distinguished129

fromtheuseofthistermunderBayesianinferenceasaprobabilitydistributionofweights130

vectors).Analternative tosettingclassweights is toassumethateachquerysequence is131

equally likely to belong to any of the taxa that are present in the reference sequence132

database.Thisassumption,knownasuniformclasspriors inthecontextofanaiveBayes133

classifier, ismadebytheRDPclassifier[12],andits impactonmarker-geneclassification134

accuracy has yet to be validated. Making either assumption, that the class weights are135


7

uniform or known to some extent, will affect results and cannot be avoided. The mock136

communities have taxonomic abundances that are far from uniform over the set of137

referencetaxonomies,asanyrealdatasetmust.Wecanthereforeusethemtoassessthe138

impact of making assumptions regarding class weights. Where we have set the class139

weights to the known taxonomic composition of a sample, we have labelled the results140

“bespoke”.141

We evaluated classifier performance accuracy on mock community sequences142

classifiedattaxonomiclevelsfromclassthroughspecies.Mockcommunitysequenceswere143

classified using the Greengenes 99% OTUs 16S rRNA gene or UNITE 99% OTUs ITS144

referencesequencesforbacterialandfungalmockcommunities,respectively.Asexpected,145

classificationaccuracydecreasedasclassificationdepth increased,andallmethodscould146

predictthetaxonomicaffiliationofmockcommunitysequencesdowntogenuslevelwith147

median F-measures exceeding 0.8 across all parameter sets (minimum: UCLUST F=0.81,148

maximum: Naive Bayes Bespoke F=1.00) (Figure 1A). However, species affiliation was149

predicted with much lower and more variable accuracy among method configurations150

(medianF-measureminimum:UCLUSTF=0.42,maximum:NaiveBayesBespokeF=0.95),151

highlighting the importanceof parameter optimization (discussed inmoredetail below).152

Figure1AillustrateslineplotsofmeanF-measureateachtaxonomiclevel,averagedacross153

all classifier configurations; hence, classifier performance is underestimated for some154

classifiers that are strongly affected by parameter configurations or for which a wider155

rangeof parameterswere tested (e.g.,NaiveBayes). Comparingonly optimizedmethods156

(i.e.,thetop-performingparameterconfigurationsforeachmethod),NaiveBayesBespoke157


8

achievedsignificantlyhigherF-measure (paired t-testP<0.05) (Figure1B), recall, taxon158

detectionrate,taxonaccuracyrate(Figure1C),andlowerBray-Curtisdissimilaritythanall159

othermethods(Figure1D).160

Mock communities are necessarily simplistic, and cannot assess method161

performance across a diverse range of taxa. Sequences matching the expected mock162

communitysequencesarenotremovedfromthereferencedatabasepriortoclassification,163

in order to replicate normal operating conditions and assess recovery of expected164

sequences.However,thisapproachmayimplicitlybiastowardmethodsthatfindanexact165

matchtothequerysequences,anddoesnotapproximatenaturalmicrobialcommunitiesin166

which few or no detected sequences exactly match the reference sequences. Hence, we167

performed simulated sequence read classifications (described below) to further test168

classifierperformance.169

Cross-validatedtaxonomyclassification170

Simulated sequence reads, derived from reference databases, allow us to assess171

methodperformanceacrossagreaterdiversityofsequencesthanasinglemockcommunity172

generally encompasses.We first evaluated classifier performance using stratified k-fold173

cross-validationoftaxonomyclassificationtosimulatedreads.Thek-foldcross-validation174

strategy is modified slightly to account for the hierarchical nature of taxonomic175

classifications,whichalloftheclassifiersinthisstudy(withtheexceptionoflegacyBLAST)176

handlebyassigningthelowest(i.e.,mostspecific)taxonomiclevelwheretheclassification177

surpasses some user-defined “confidence” or “consensus” threshold (see materials and178


9

methods). Themodification is to truncate any expected taxonomy in each test set to the179

maximumlevelatwhichaninstanceofthattaxonomyexistsinthetrainingset.Simulated180

readsweregeneratedfromGreengenes99%OTUs16SrRNAgeneorUNITE99%OTUsITS181

referencesequenceswithspecies-levelannotations.Greengenes16SrRNAgenesimulated182

reads were generated from full-length 16S rRNA genes (primers 27F/1492R) and V4183

(primers515F/806R)andV1-3sub-domains(primers27F/534R).Thesimulatedreadsdo184

notincorporateartificialsequencingerrors(seematerialsandmethodsformoredetails).185

Inthissetoftestsandbelowfornoveltaxa,the“bespoke”classifierhadpriorprobabilities186

thatwereinferredfromthetrainingseteachtimeitwastrained.187

Classification of cross-validated reads performed better at coarser levels of188

classification (Figure 2A), similar to the trend observed inmock community results. For189

bacterial sequences, average classification accuracy for all methods declined from near-190

perfect scores at family level (V4 domainmedian F-measureminimum: BLAST+ F=0.92,191

maximum:legacyBLASTF=0.99),butstillretainedaccuratescoresatspecieslevel(median192

minimum: BLAST+ F=0.76, maximum: SortMeRNA F=0.84), relative to some mock193

community data sets (Figure 2A). Fungal sequences exhibited similar performance,with194

the exception that mean BLAST+ and vsearch performance was markedly lower at all195

taxonomiclevels,indicatinghighsensitivitytoparameterconfigurations,andspecies-level196

F-measureswere in generalmuch lower (medianminimum: BLAST+ F=0.17,maximum:197

UCLUSTF=0.45)thanthoseofbacterialsequenceclassifications(Figure2A).198

Species-levelclassificationsof16SrRNAgenesimulatedsequenceswerebestwith199

optimized UCLUST and SortMeRNA configurations for V4 domain, and Naive Bayes and200


10

RDP for V1-3 domain and full-length 16S rRNA gene sequences (Figure 2B). UCLUST201

achieved the highest F-measure for ITS classification (F = 0.51). However, all optimized202

classifiersachievedsimilarF-measureranges,withtheexceptionof legacyBLASTfor ITS203

sequences(Figure2B).204

Species-level classification performance of 16S rRNA gene simulated reads was205

significantly correlated between each sub-domain and the full-length gene sequences206

(Figure2C).Inourtests,full-lengthsequencesexhibitedslightlyloweraccuracythanV1-3207

and V4 sub-domains. The relative performance of full-length 16S rRNA genes versus208

hypervariable sub-domain reads is variable in the literature [12,16–21], andour results209

addanotherdatapointtotheongoingdiscussionofthistopic.Nevertheless,species-level210

classificationsyieldedstrongcorrelationbetweenmethodconfigurations(Figure2C)and211

optimized method performance (Figure 2B), suggesting that primer choice impacts212

classificationaccuracyuniformlyacrossallmethods.Hence,wefocusedonV4sub-domain213

readsfordownstreamanalyses.214

215

Noveltaxonclassificationevaluation216

Noveltaxonclassificationoffersauniqueperspectiveonclassifierbehavior,217

assessinghowclassifiersperformwhenchallengedwitha“novel”cladethatisnot218

representedinthereferencedatabase[22–25].Anidealclassifiershouldidentifythe219

nearesttaxonomiclineagetowhichthistaxonbelongs,butnofurther.Inthisevaluation,a220

referencedatabaseissubsampledktimestogeneratequeryandreferencesequencesets,221


11

asforcross-validatedclassification,buttwoimportantdistinctionsexist:1)thereference222

databaseusedforclassificationexcludesanysequencethatmatchesthetaxonomic223

affiliationofthequerysequencesattaxonomiclevelL,thetaxonomicrankatwhich224

classificationisbeingattempted;and2)thisisperformedateachtaxonomiclevel,inorder225

toassessclassificationperformancewheneachmethodencountersa“novel”species,226

genus,family,etc.227

Duetothesedifferences,interpretationofnoveltaxonclassificationresultsis228

differentfromthatofmockcommunityandcross-validatedclassifications.Forthelatter,229

classificationaccuracymaybeassessedateachtaxonomiclevelforeachclassification230

result:meanclassificationaccuracyatfamilylevelandspecieslevelevaluatethesame231

resultsbutfocusondifferenttaxonomiclevelsofclassification.Fornoveltaxa,however,232

differentqueryandreferencesequencesarecompiledforclassificationateachtaxonomic233

levelandseparateclassificationsareperformedforeach.Hence,classificationsatfamily234

andspecieslevelareindependentevents—oneassesseshowaccuratelyeachmethod235

performswhenitencountersa“novel”familythatisnotrepresentedinthereference236

database,theotherwhena“novel”speciesisencountered.237

Noveltaxonevaluationsemployasuiteofmodifiedmetrics,toprovidemore238

informationonwhattypesofclassificationerrorsoccur.Precision,recall,andF-measure239

calculationsateachtaxonomiclevelLassesswhetheranaccuratetaxonomyclassification240

wasmadeatlevelL-1:forexample,a“novel”speciesshouldbeassignedagenus,because241

thecorrectspeciesclassisnotrepresentedwithinthereferencedatabase.Anyspecies-242

levelclassificationinthisscenarioisanoverclassification(affectingbothrecalland243


12

precision)[25].Overclassificationisoneofthekeymetricsfornoveltaxaevaluation,244

indicatingthedegreetowhichnovelsequenceswillbeinterpretedasknownorganisms.245

Thisoverclassificationisoftenhighlyundesirablebecauseitleads,forexample,tothe246

incorrectclassificationofunknownbutharmlessenvironmentalsequencesasknown247

pathogens.Novelsequencesthatareclassifiedwithinthecorrectclade,buttoalessspecific248

levelthanL,areunderclassified(affectingrecallbutnotprecision)[25].Sequencesthatare249

classifiedintoacompletelydifferentcladearemisclassified(affectingbothrecalland250

precision)[25].251

Precision,recall,andF-measureallgraduallyincreasefromaveragescoresnear0.0252

atclasslevel,reachingpeakscoresatgenuslevelforbacteriaandspecieslevelforfungi253

(Figure3A-C).Thesetrendsarepairedwithgradualdecreasesinunderclassificationand254

misclassificationratesforallclassificationmethods,indicatingthatallclassifiersperform255

poorlywhentheyencountersequenceswithnoknownmatchattheclass,order,orfamily256

levels(Figure3D-F).Atspecieslevel,UCLUST,BLAST+,andvsearchachievedsignificantly257

betterF-measuresthanallothermethodsfor16SrRNAgeneclassifications(P<0.05)258

(Figure3G).UCLUSTachievedsignificantlybetterF-measuresthanallothermethodsfor259

ITSclassifications(Figure3G).Over-,under-,andmisclassificationscoresareless260

informativeforoptimizingclassifiersforrealusecases,asmostmethodscouldbe261

optimizedtoyieldnear-zeroscoresforeachofthesemetricsseparately,butonlythrough262

extremeconfigurations,leadingtoF-measuresthatwouldbeunacceptableunderany263

scenario.Notethatallcomparisonsweremadebetweenmethodsoptimizedtomaximize264

(orminimize)asinglemetric,andhencetheconfigurationsthatmaximizeprecisionare265


13

frequentlydifferentfromthosethatmaximizerecallorothermetrics.Thistrade-off266

betweendifferentmetricsisdiscussedinmoredetailbelow.267

Thenoveltaxonevaluationprovidesanestimateofclassifierperformancegivena268

specificreferencedatabase,butitsgeneralizationislimitedbythequalityofthereference269

databasesavailableandbythelabel-basedapproachusedforpartitioningandevaluation.270

Mislabeledandpolyphyleticcladesinthedatabase,e.g.Clostridiumgroup,increasethe271

probabilityofmisclassification.Acomplementaryanalysisbasedonsequencesimilarity272

betweenanovelqueryandtopreferencehitcouldmitigatethisissue.However,wechoose273

toapplyalabel-basedapproach,asitbetterreflectsthebiologicalproblemthatuserscan274

expecttoencounter;i.e.,usingaparticularreferencesequencedatabase(whichwill275

containsomequantityofmislabeledandpolyphyletictaxainherenttocurrentlyavailable276

resources),howlikelyisaclassifiertomisclassifyataxonomiclabel?277

278

Multi-evaluationmethodoptimization279

Themockcommunityandcross-validationclassificationevaluationsyieldedsimilartrends280

inconfigurationperformance,butoptimizingparameterschoicesforthenoveltaxa281

generallyleadtosuboptimalchoicesforthemockcommunityandcross-validationtests282

(Figure4).Wesoughttodeterminetherelationshipbetweenmethodconfiguration283

performanceforeachevaluation,andusethisinformationtoselectconfigurationsthat284

performbestacrossallevaluations.For16SrRNAgenesequencespecies-level285

classification,methodconfigurationsthatachievemaximumF-measuresformockand286


14

cross-validatedsequencesperformpoorlyfornoveltaxonclassification(Figure4B).287

Optimizationismorestraightforwardforgenus-levelclassificationof16SrRNAgene288

sequences(Figure4A)andforfungalsequences(Figure4C-D),forwhichconfiguration289

performance(measuredasmeanF-measure)ismaximizedbysimilarconfigurations290

amongallthreeevaluations.291

Toidentifyoptimalmethodconfigurations,wesetaccuracyscoreminimum292

thresholdsforeachevaluationbyidentifyingnaturalbreaksintherangeofqualityscores,293

selectingmethodsandparameterrangesthatmetthesecriteria.Table2listsmethod294

configurationsthatmaximizespecies-levelclassificationaccuracyscoresformock295

community,cross-validated,andnoveltaxonevaluationsunderseveralcommonoperating296

conditions.“Balanced”configurationsarerecommendedforgeneraluse,andaremethods297

thatmaximizeF-measurescores.“Precision”and“Recall”configurationsmaximize298

precisionandrecallscores,respectively,formock,cross-validated,andnovel-taxa299

classifications(Table2).“Novel”configurationsoptimizeF-measurescoresfornoveltaxon300

classification,andsecondarilyformockandcross-validatedperformance(Table2).These301

configurationsarerecommendedforusewithsampletypesthatareexpectedtocontain302

largeproportionsofunidentifiedspecies,forwhichoverclassificationcanbeexcessive.303

However,theseconfigurationsmaynotperformoptimallyforclassificationofknown304

species(i.e.,underclassificationrateswillbehigher).Forfungi,thesameconfigurations305

recommendedfor“Precision”performwellfornoveltaxonclassification(Table2).For16S306

rRNAgenesequences,BLAST+,UCLUST,andvsearchconsensusclassifiersperformbestfor307

noveltaxonclassification(Table2).308


15

309

Computationalruntime310

High-throughputsequencingplatforms(andexperiments)continuetoyieldincreasing311

sequencecounts,which—evenafterqualityfilteringanddereplicationoroperational312

taxonomicunitclusteringstepscommontomostmicrobiomeanalysispipelines—may313

exceedthousandsofuniquesequencesthatneedclassification.Increasingnumbersof314

querysequencesandreferencessequencesmayleadtounacceptableruntimes,andunder315

someexperimentalconditionsthetop-performingmethod(basedonprecision,recall,or316

someothermetric)maybeinsufficienttohandlelargenumbersofsequenceswithinan317

acceptabletimeframe.Forexample,quickturnaroundsmaybevitalunderclinical318

scenariosasmicrobiomeevaluationbecomescommonclinicalpractice,orcommercial319

scenarios,whenlargesamplevolumesandclientexpectationsmayconstrainturnaround320

timesandmethodselection.321

Weassessedcomputationalruntimeasalinearfunctionof1)thenumberofquery322

sequencesand2)thenumberofreferencesequences.Lineardependenceisempirically323

evidentinFigure5.Forbothofthesemetrics,theslopeisthemostimportantmeasureof324

performance.Theinterceptmayincludetheamountoftimetakentotraintheclassifier,325

preprocessthereferencesequences,loadpreprocesseddata,orother“setup”stepsthat326

willdiminishinsignificanceassequencecountsgrow,andhenceisnegligible.327

UCLUST(0.000028s/sequence),vsearch(0.000072s/sequence),BLAST+328

(0.000080s/sequence),andlegacyBLAST(0.000100s/sequence)allexhibitshallow329


16

slopeswithincreasingnumbersofreferencesequences.NaiveBayes(0.000483330

s/sequence)andSortMeRNA(0.000543s/sequence)yieldmoderatelyhigherslopes,and331

RDP(0.001696s/sequence)demonstratesthesteepestslope(Figure5A).Forruntimeasa332

functionofquerysequencecount,UCLUST(0.002248s/sequence),RDP(0.002920333

s/sequence),andSortMeRNA(0.003819s/sequence)haverelativelyshallowslopes334

(Figure5B).NaiveBayes(0.022984s/sequence),BLAST+(0.026222s/sequence),and335

vsearch(0.030190s/sequence)exhibitgreaterslopes.LegacyBLAST(0.133292336

s/sequence)yieldedaslopemagnitudeshigherthanothermethods,renderingthismethod337

impracticalforlargedatasets.338

339

340

Discussion341

Wehavedevelopedandvalidatedseveralmachine-learningandalignment-based342

classifiersprovidedinq2-feature-classifierandbenchmarkedtheseclassifiers,aswellas343

othercommonclassificationmethods,toevaluatetheirstrengthsandweaknessesfor344

marker-geneampliconsequenceclassificationacrossarangeofparametersettingsfor345

each(Table2).346

Eachclassifierrequiredsomedegreeofoptimizationtodefinetop-performing347

parameterconfigurations,withthesoleexceptionofQIIME1’slegacyBLASTwrapper,348

whichwasunaffectedbyitsonlyuser-definedparameter,e-value,overarangeof10-10to349


17

1000.Forallothermethods,performancevariedwidelydependingonparametersettings,350

andasinglemethodcouldachieveamongtheworstperformancewithoneconfiguration351

butamongthebestperformancewithanother.Configurationsgreatlyaffectedaccuracy352

withmockcommunity,cross-validated,andnoveltaxonevaluations,indicatingthat353

optimizationisnecessaryunderavarietyofperformanceconditions,andoptimizationfor354

oneconditionmaynotnecessarilytranslatetoanother.Mockcommunityandcross-355

validatedevaluationsexhibitedsimilarresults,butnoveltaxonevaluationsselected356

differentoptimalconfigurationsformostmethods(Figure4),indicatingthatconfigurations357

optimizedtoonecondition,e.g.,high-recallclassificationofknownsequences,maybeless358

suitedforotherconditions,e.g.,classificationofnovelsequences.Table2liststhetop-359

performingconfigurationforeachmethodforseveralstandardperformanceconditions.360

Optimalconfigurationsalsovariedamongdifferentevaluationmetrics.Precision361

andrecall,inparticular,exhibitedsomemutualopposition,suchthatmethodsincreasing362

precisionreducedrecall.Forthisreason,F-measure,theharmonicmeanofprecisionand363

recall,isausefulmetricforchoosingconfigurationsthatarewellbalancedforaverage364

performance.“Balanced”methodconfigurations—whichmaximizeF-measurescoresfor365

mock,cross-validated,andnoveltaxonevaluations(Table2)—arebestsuitedforawide366

rangeofuserconditions.ThenaiveBayesclassifierwithk-merlengthsof6or7and367

confidence=0.7(orconfidence≥0.9ifusingbespokeclassweights),RDPwithconfidence368

=0.6-0.7,andUCLUST(minimumconsensus=0.51,minimumsimilarity=0.9,maxaccepts369

=3)performbestundertheseconditions(Table2).Performanceisdramaticallyimproved370

usingbespokeclassweightsfor16SrRNAsequences(Figure4A-B),thoughthisapproachis371


18

developmentalandonlyapplicablewhentheexpectedcompositionofsamplesisknownin372

advance(ascenariothatisbecomingincreasinglycommonwiththeincreasingquantityof373

publicmicrobiomedata,andwhichcouldbeaidedbymicrobiomedatasharingresources374

suchasQiita(http://qiita.microbio.me)).ForITSsequences,thenaiveBayesclassifierwith375

k-merlengthsof6or7andconfidence≥0.9,orRDPwithconfidence=0.7-0.9,perform376

best,andtheeffectsofbespokeclassweightsarelesspronounced(Figure4C-D).377

However,someusersmayrequirehigh-precisionclassifierswhenfalse-positives378

maybemoredamagingtotheoutcome,e.g.,fordetectionofpathogensinasample.379

PrecisionscoresaremaximizedbynaiveBayesandRDPclassifierswithhighconfidence380

settings(Table2).Optimizingforprecisionwillsignificantlydamagerecallbyyieldinga381

highnumberoffalsenegatives.382

Otherusersmayrequirehigh-recallclassifierswhenfalse-negativesand383

underclassificationhinderinterpretation,butfalsepositives(mostlyoverclassificationtoa384

closelyrelatedspecies)arelessdamaging.Forexample,inenvironmentswithhigh385

numbersofunidentifiedspecies,ahigh-precisionclassifiermayyieldlargenumbersof386

unclassifiedsequences;insuchcases,asecondpasswithahigh-recallconfiguration(Table387

2)mayprovideusefulinferenceofwhattaxaaremostsimilartotheseunclassified388

sequences.Whenrecallisoptimized,precisiontendstosufferslightly(leadingtosimilarF-389

measurescoresto“balanced”configurations)butnoveltaxonclassificationaccuracyis390

minimized,astheseconfigurationstendtooverclassify(Table2).Anyuserprioritizing391

recalloughttobeawareofandacknowledgetheserisks,e.g.,whensharingorpublishing392

theirresults,andunderstandthatmanyofthespecies-levelclassificationsmaybewrong,393


19

particularlyifthesamplesareexpectedtocontainmanyuncharacterizedspecies.For16S394

rRNAgenesequences,naiveBayesbespokeclassifierswithk-merlengthsbetween12-32395

andconfidence=0.5yieldmaximalrecallscores,butRDP(confidence=0.5)andnaive396

Bayes(uniformclassweights,confidence=0.5,k-merlength=11,12,or18)alsoperform397

well(Table2).Fungalrecallscoresaremaximizedbythesameconfigurations398

recommendedfor“Balanced”classification,i.e.,naiveBayesclassifierswithk-merlengths399

between6-7andconfidencebetween0.92-0.98,orRDPwithconfidencebetween0.7-0.9400

(Table2).401

Runtimerequirementsmayalsobethechiefconcerndictatingmethodselectionfor402

someusers.QIIME1’sUCLUSTwrapperprovidesthefastestruntimewhilestillachieving403

reasonablygoodperformanceformostevaluations;NaiveBayes,RDP,andBLAST+also404

deliveredreasonablylowruntimerequirements,andoutperformUCLUSTonmostother405

evaluationmetrics.406

Thisstudydidnotcomparemethodsforclassificationofshotgunmetagenome407

sequencingdatasets,whichpresentaseriesofuniquechallengesthatdonotexistfor408

marker-geneampliconsequencedata.Theseincludemuchhigheruniquesequencecounts409

(makingruntimeagreaterpriority),theuseoffullysequencedgenomesasreference410

sequences,anddifferentanalysisandqualitycontrolprotocols.Metagenomesequences411

alsoexhibitheterogenouscoverageandlength,unlikemarker-geneampliconsequences,412

whichtypicallyhaveuniformstartsitesandreadlengthswithinasinglesequencingrun.A413

recentbenchmarkofmetagenometaxonomicprofilingmethodsdescribessimilarresultsto414

ourbenchmarkofmarker-genesequenceclassifiers:mostprofilersperformwellfrom415


20

phylumtofamilylevelbutperformancedegradesatgenusandspecieslevels;different416

methodsdisplaysuperiorperformanceaccordingtodifferentperformancemetrics;and417

parameterconfigurationdramaticallyimpactsperformance[26].Inthecurrentstudywe418

focusedonbenchmarkingandoptimizingclassifiersformarker-geneampliconsequence419

data,inlightofthedistinctneedsofmetagenomeandmarker-genesequencedatasets.420

Conclusions421

Theclassificationmethodsprovidedinq2-feature-classifierwillsupportimproved422

taxonomyclassificationofmarker-geneampliconsequences,andarereleasedasafree,423

open-sourcepluginforusewithQIIME2.Wedemonstratethatthesemethodsperformas424

wellasorbetterthanotherleadingtaxonomyclassificationmethodsonanumberof425

performancemetrics.ThenaiveBayes,vsearch,andBLAST+consensusclassifiers426

describedherearereleasedforthefirsttimeinQIIME2,withoptimized“balanced”427

configurations(Table2)setasdefaults.428

Wealsopresenttheresultsofabenchmarkofseveralwidelyusedtaxonomy429

classifiersformarker-geneampliconsequences,andrecommendthetop-performing430

methodsandconfigurationsforthemostcommonuserscenarios.Ourrecommendations431

for“balanced”methods(Table2)willbeappropriateformostuserswhoareclassifying432

16SrRNAgeneorfungalITSsequences,butotherusersmayprioritizehigh-precision(low433

false-positive)orhigh-recall(lowfalse-negative)methods.434


21

Wehavealsoshownthatgreatpotentialexistsforimprovingtheaccuracyof435

taxonomyclassificationsbyappropriatelysettingclassweightsforthemachinelearning436

classifiers.Currently,notoolsexistthatallowuserstogenerateappropriatevaluesfor437

theseclassweightsinrealapplications.Compilingappropriateclassweightsfordifferent438

sampletypescouldbeapromisingapproachtofurtherimprovetaxonomicclassificationof439

markergenesequencereads.440

441

Methods442

Mockcommunities443

All mock communities were sourced from mockrobiota [11]. Raw fastq files were444

demultiplexed and processed using tools available in QIIME 2 (version 2017.4)445

(https://qiime2.org/). Reads were demultiplexed with q2-demux446

(https://github.com/qiime2/q2-demux) and quality filtered and dereplicated with q2-447

dada2 [4]. Representative sequence sets for each dada2 sequence variantwere used for448

taxonomyclassificationwitheachclassificationmethod.449

The inclusion of multiple mock community samples is important to avoid overfitting;450

optimizingmethod performance to a small set of data could result in overfitting to the451


22

specific community compositions or conditions underwhich those datawere generated,452

whichreducestherobustnessoftheclassifier.453

Cross-validatedsimulatedreads454

The simulated reads used here were derived from the reference databases using the455

“Cross-validated classification performance” notebooks in our project repository. The456

reference databases were either Greengenes or UNITE (99% OTUs) that were cleaned457

according to taxonomic label to remove sequences with ambiguous or null labels.458

ReferencesequencesweretrimmedtosimulateamplificationusingstandardPCRprimers459

and slice out the first 250 bases downstream (3’) of the forward primer. The bacterial460

primers used were 27F/1492R [27] to simulate full-length 16S rRNA gene sequences,461

515F/806R[28]tosimulate16SrRNAgeneV4domainsequences,and27F/534R[29]to462

simulate 16S rRNA gene V1-3 domain sequences; the fungal primers used were463

BITSf/B58S3r[30]tosimulateITS1internaltranscribedspacerDNAsequences.Theexact464

sequenceswereusedforcrossvalidation,andwerenotalteredtosimulateanysequencing465

error; thus, our benchmarks simulate denoised sequence data [4] and isolate classifier466

performance from impacts from sequencing errors. Each database was stratified by467

taxonomyand10-foldrandomisedcross-validationdatasetsweregeneratedusingscikit-468

learn’s libraryfunctions.Whereataxonomiclabelhadlessthan10instances,taxonomies469

wereamalgamatedtomakesufficientlylargestrata.If,asaresult,ataxonomyinanytest470

setwas not present in the corresponding training set, the expected taxonomy labelwas471

truncated to the nearest common taxonomic rank observed in the training set (e.g.,472


23

Lactobacillus caseiwould becomeLactobacillus). The notebook detailing simulated read473

generation (for both cross-validated and novel taxon reads) prior to taxonomy474

classification is available at https://github.com/caporaso-lab/tax-credit-475

data/blob/0.1.0/ipynb/novel-taxa/dataset-generation.ipynb.476

Classification performance was also slightlymodified from a standardmachine-learning477

scenario as the classifiers in this study are able to refuse classification if they are not478

confident above a taxonomic level for a given sample. This also accommodates the479

taxonomytruncationthatweperformedforthistest.Themethodologywasconsistentwith480

that used below for novel taxon evaluations, but we defer this description to the next481

section.482

“Noveltaxon”simulationanalysis483

“Novel taxon”classificationanalysiswasperformed to test theperformanceof classifiers484

whenassigning taxonomytosequences thatarenotrepresented inareferencedatabase,485

e.g.,asasimulationofwhatoccurswhenamethodencountersanundocumentedspecies486

[22–25].Inthisanalysis,simulatedampliconswerefilteredfromthoseusedforthecross-487

validationanalysis.Forallsequencespresentineachtestset,sequencessharingtaxonomic488

affiliationatagiventaxonomiclevelL(e.g.,tospecieslevel)inthecorrespondingtraining489

setwereremoved.Taxaarestratifiedamongqueryandtestsetssuchthatforeachquery490

taxonomy at level L, no reference sequences match that taxonomy, but at least one491

reference sequence will match the taxonomic lineage at level L-1 (e.g., same genus but492

different species). An ideal classifier would assign taxonomy to the nearest common493


24

taxonomic lineage (e.g., genus),butwouldnot “overclassify” [25] tonearneighbors (e.g.,494

assign species-level taxonomywhen species X is removed from the reference database).495

For example, a “novel” sequence representing the species Lactobacillus brevis should be496

classifiedas“Lactobacillus”,withoutspecies-levelannotation, inorder tobeconsidereda497

truepositiveinthisanalysis.Asdescribedaboveforcross-validatedreads,thesenoveltaxa498

simulatedcommunitieswerealsotestedinbothbacterial(B)andfungal(F)databaseson499

simulatedampliconstrimmedtosimulate250-ntsequencingreads.500

Novel taxon classification performance is evaluated using precision, recall, F-501

measure,overclassificationrates,underclassificationrates,andmisclassificationrates[25]502

foreachtaxonomiclevel(phylumtospecies),computedwiththefollowingdefinitions(see503

below,Performanceanalysesusingsimulatedreads, forfulldescriptionofprecision,recall,504

andF-measurecalculations):505

1) Atruepositiveisconsideredthenearestcorrectlineagecontainedinthereference506

database. For example, if Lactobacillus brevis is removed from the reference507

database and used as a query sequence, the only correct taxonomy classification508

wouldbe“Lactobacillus”,withoutspecies-levelclassification.509

2) A falsepositivewouldbeeitheraclassification toadifferentLactobacillus species510

(Overclassification),oranygenusotherthanLactobacillus(Misclassification).511

3) Afalsenegativeoccursifanexpectedtaxonomyclassification(e.g.,“Lactobacillus”)512

isnotobservedintheresults.Notethatthiswillbethemodifiedtaxonomyexpected513

whenusinganaivereferencedatabase,and isnot thesameas thetruetaxonomic514

affiliation of a query sequence in the novel taxa analysis. A false negative results515


25

from misclassification, overclassification, or when the classification contains the516

correct basal lineage, but does not assign a taxonomy label at level L517

(Underclassification). E.g., classification as “Lactobacillaceae”, but no genus-level518

classification. 519

Taxonomyclassification520

Representative sequences for all analyses (mock community, cross-validated, and novel521

taxa) were classified taxonomically using the following taxonomy classifiers and setting522

sweeps:523

1. q2-feature-classifiermultinomialnaiveBayesclassifier.Variedk-merlength524

in{4,6,7,8,9,10,11,12,14,16,18,32}andconfidencethresholdin{0,0.5,0.7,0.9,525

0.92,0.94,0.96,0.98,1}. 526

2. BLAST+ [9] local sequence alignment, followed by consensus taxonomy527

classification implemented inq2-feature-classifier.Variedmaxaccepts from1 to100;528

percent identity from 0.80 to 0.99; and minimum consensus from 0.51 to 0.99. See529

descriptionbelow.530

3. vsearch [10] global sequence alignment, followed by consensus taxonomy531

classification implemented in q2-feature-classifier. Varied max accepts from 1 to532

100;percentidentityfrom0.80to0.99;andminimumconsensusfrom0.51to0.99.533

Seedescriptionbelow.534


26

4. Ribosomal Database Project (RDP) naïve Bayesian classifier [12] (QIIME1535

wrapper),withconfidencethresholdsbetween0.0to1.0instepsof0.1. 536

5. LegacyBLAST[13](QIIME1wrapper)varyinge-valuethresholdsfrom1e-9537

to1000. 538

6. SortMeRNA [15] (QIIME1 wrapper) varying minimum consensus fraction539

from0.51 to0.99; similarity from0.8 to0.9;maxaccepts from1 to10;andcoverage540

from0.8to0.9. 541

7. UCLUST [14] (QIIME1wrapper) varyingminimumconsensus fraction from542

0.51to0.99;similarityfrom0.8to0.9;andmaxacceptsfrom1to10.543

544

WiththeexceptionoftheUCLUSTclassifier,wehaveonlybenchmarkedtheperformanceof545

open-source, free,marker-gene-agnostic classifiers, i.e., those that canbe trained/aligned546

onareferencedatabaseofanymarkergene.Hence,weexcludedclassifiers thatcanonly547

assign taxonomy to a particular marker gene (e.g., only bacterial 16S rRNA genes) and548

thosethatrelyonspecializedorunavailablereferencedatabasesandcannotbetrainedon549

other databases, effectively restricting their use for other marker genes and custom550

databases.551

Classification of bacterial/archaeal 16S rRNA gene sequences was made using the552

Greengenes(13_8release) [5]referencesequencedatabasepreclusteredat99%ID,with553

ampliconsforthedomainofinterestextractedusingprimers27F/1492R[27],515F/806R554

[28],or27F/534R[29]withq2-feature-classifier’sextract_readsmethod.Classificationof555

fungal ITS sequenceswasmadeusing theUNITEdatabase (version7.1QIIMEdeveloper556


27

release) [31] preclustered at 99% ID. For the cross validation and novel taxon557

classification tests we prefiltered to remove sequences with incomplete or ambiguous558

taxonomies(containingthesubstrings ‘unknown’, ‘unidentified’,or ‘_sp’orterminatingat559

anylevelwith‘__’).560

561

Thenotebooksdetailingtaxonomyclassificationsweepsofmockcommunitiesareavailable562

at https://github.com/caporaso-lab/tax-credit-data/tree/0.1.0/ipynb/mock-community.563

Cross-validated read classification sweeps are available at https://github.com/caporaso-564

lab/tax-credit-data/blob/0.1.0/ipynb/cross-validated/taxonomy-assignment.ipynb. Novel565

taxon classification sweeps are available at https://github.com/caporaso-lab/tax-credit-566

data/blob/0.1.0/ipynb/novel-taxa/taxonomy-assignment.ipynb.567

568

Runtimeanalyses569

The tax-credit frameworkemploys twodifferent runtimemetrics: asa functionof1) the570

numberofquerysequencesor2)thenumberofreferencesequences.Taxonomyclassifier571

runtimes were logged while performing classifications of pseudorandom subsets of 1,572

2,000,4,000,6,000,8,000,and10,000sequencesfromtheGreengenes99%OTUdatabase.573

Eachsubsetwasdrawnoncethenusedforallofthetestsasappropriate.Allruntimeswere574

computedonthesameLinuxworkstation(Ubuntu16.04.2LTS,IntelXeonCPUE7-4850v3575

@2.20GHz,1TBmemory).Theexactcommandsusedforruntimeanalysisarepresentedin576


28

the“Runtimeanalyses”notebookintheprojectrepository(https://github.com/caporaso-577

lab/tax-credit-data/blob/0.1.0/ipynb/runtime/analysis.ipynb).578

Performanceanalysesusingsimulatedreads579

Cross-validatedandnoveltaxareadsareevaluatedusingtheclassicprecision,recall,andF-580

measuremetrics[5](noveltaxausethestandardcalculationsasdescribedbelow,but581

modifieddefinitionsfortruepositive(TP),falsepositive(FP),andfalsenegative(FN),as582

describedabovefornoveltaxonclassificationanalysis).583

Precision,recall,andF-measurearecalculatedasfollows:584○ Precision=TP/(TP+FP)orthefractionofsequencesthatwereclassifiedcorrectlyat585

levelL.586

○ Recall = TP/(TP+FN) or the fraction of expected taxonomic labels that were587

predictedatlevelL.588

○ F-measure=2×Precision×Recall/(Precision+Recall),ortheharmonicmeanof589

precisionandrecall.590

The Jupyter notebook detailing commands used for evaluation of cross-validated read591

classifications is available at https://github.com/caporaso-lab/tax-credit-592

data/blob/0.1.0/ipynb/cross-validated/evaluate-classification.ipynb. The notebook for593

evaluation of novel taxon classifications is available at https://github.com/caporaso-594

lab/tax-credit-data/blob/0.1.0/ipynb/novel-taxa/evaluate-classification.ipynb.595


29

Performanceanalysesusingmockcommunities596

The Jupyter notebook detailing commands used for evaluation of mock communities,597

including the three evaluation types described below, is available at598

https://github.com/caporaso-lab/tax-credit-data/blob/0.1.0/ipynb/mock-599

community/evaluate-classification-accuracy.ipynb.600

PrecisionandRecall601

Classic precision, recall, and F-measure are used to calculate mock community602

classificationaccuracy,usingthedefinitionsgivenaboveforsimulatedreads.Thesemetrics603

require knowing the expected classification of each sequence, which we determine by604

performing a gapless alignment between each representative sequence in the mock605

community and themarker-gene sequences of eachmicrobial strain added to themock606

community. These “expected sequences” are provided for the mock communities in607

mockrobiota [11]. Representative sequences are assigned the taxonomy of the best608

alignment,andanyrepresentativesequencewithmorethan3mismatchestotheexpected609

sequences are excluded from precision/recall calculations. If a representative sequence610

aligns tomore than one expected sequence equallywell, all top hits are accepted as the611

“correct” classification. This scenario is rare and typically only occurred when different612

strains of the same species were added to the same mock community to intentionally613

produce this challenge (e.g., for mock-12 as described by [4]). Precision, recall, and F-614

measure are then calculated by comparing the “expected” classification for each mock615


30

communitysequencetotheclassificationspredictedbyeachtaxonomyclassifierusingthe616

fullreferencedatabases,asdescribedabove.617

Taxonaccuracyrateandtaxondetectionrate618

Taxon accuracy rate (TAR) and taxon detection rate (TDR) are used for qualitative619

compositional analyses of mock communities. As the true taxonomy labels for each620

sequenceinamockcommunityarenotknownwithabsolutecertainty,TARandTDRare621

useful alternatives to precision and recall that instead rely on the presence/absence of622

expected taxa, or microbiota that are intentionally added to the mock community. In623

practice, TAR/TDR are complementary metrics to precision/recall and should provide624

similar results if the expected classifications for mock community representative625

sequencesareaccurate.626

Atagiventaxonomiclevel,aclassificationisa:627

○ truepositive(TP),ifthattaxonisbothobservedandexpected.628

○ falsepositive(FP),ifthattaxonisobservedbutnotexpected.629

○ falsenegative(FN),ifataxonisexpectedbutnotobserved.630

TheseareusedtocalculateTARandTDRas:631

○ TAR=TP/(TP+FP)orthefractionofobservedtaxathatwereexpectedatlevelL.632

○ TDR=TP/(TP+FN)orthefractionofexpectedtaxathatareobservedatlevelL.633

634


31

Bray-CurtisDissimilarity635

Bray-Curtisdissimilarity[32] isusedtomeasurethedegreeofdissimilaritybetweentwo636

samples as a function of the abundance of each species label present in each sample,637

treating each species as equally related. This is a useful metric for evaluating classifier638

performancebyassessing therelativedistancebetweeneachpredictedmockcommunity639

composition(abundanceoftaxainasamplebasedonresultsofasingleclassifier)andthe640

expectedcompositionofthatsample.Foreachclassifier,Bray-Curtisdistancesbetweenthe641

expected and observed taxonomic compositions are calculated for each sample in each642

mock community dataset; this yields a single expected-observed distance for each643

individual observation. The distance distributions for each method are then compared644

statistically using paired or unpaired t-tests to assess whether one method (or645

configuration)performsconsistentlybetterthananother.646

Newtaxonomyclassifiers647

Wedescribeq2-feature-classifier(https://github.com/qiime2/q2-feature-classifier),a648

pluginforQIIME2(https://qiime2.org/)thatperformsmulti-classtaxonomyclassification649

ofmarker-genesequencereads.InthisworkwecomparetheconsensusBLAST+and650

vsearchmethodsandthenaiveBayesscikit-learnclassifier.Thesoftwareisfreeandopen-651

source.652


32

Machinelearningtaxonomyclassifiers653

Theq2-feature-classifierpluginallowsuserstoapplyanyofthesuiteofmachinelearning654

classifiersavailableinscikit-learn(http://scikit-learn.org)totheproblemoftaxonomy655

classificationofmarker-genesequences.Itfunctionsasalightweightwrapperthat656

transformstheproblemintoastandarddocumentclassificationproblem.Advancedusers657

caninputanyappropriatescikit-learnclassifierpipeline,whichcanincludearangeof658

featureextractionandtransformationstepsaswellasspecifyingamachinelearning659

algorithm.660

661

Thepluginprovidesadefaultmethodwhichistoextractk-mercountsfromreference662

sequencesandtrainthescikit-learnmultinomialnaiveBayesclassifier,anditisthis663

methodthatwetestextensivelyhere.Specifically,thepipelineconsistsofa664

sklearn.feature_extraction.text.HashingVectorizerfeatureextractionstepfollowedbya665

sklearn.naive_bayes.MultinomialNBclassificationstep.Theuseofahashingfeature666

extractorallowstheuseofsignificantlylongerk-mersthanthe8-mersthatareusedby667

RDPClassifier,andwetestedupto32-mers.Likemostscikit-learnclassifiers,weareable668

tosetclassweightswhentrainingthemultinomialnaiveBayesclassifiers.Inthenaive669

Bayessetting,settingclassweightsmeansthatclasspriorsarenotderivedfromthe670

trainingdataorsettobeuniform,astheyarefortheRDPClassifier.Formoredetailonhow671

classweightsenterthecalculationspleaserefertothescikit-learnUserGuide672

(http://scikit-learn.org).673


33

674

Inmostsettings,itishighlyunlikelythattheassumptionofuniformweightsiscorrect.That675

assumptionisthateachofthetaxainthereferencedatabaseisequallylikelytoappearin676

eachsample.Settingclassweightstomorerealisticvaluescangreatlyaidtheclassifierin677

makingmoreaccuratepredictions,asweshowinthiswork.Whentestingthemock678

communitieswemadeuseofthefactthatthesequencecompositionswereknownapriori679

forthebespokeclassifier.Forthesimulatedreadsstudies,weallowedtheclassifiertoset680

theclassweightsfromtheclassfrequenciesobservedineachtrainingsetforthebespoke681

classifier.682

683

Forthisstudy,weperformedtwoparametersweepsonthemockcommunities:aninitial684

broadsweeptooptimizefeatureextractionparametersandthenamorefocussedsweepto685

optimisek-merlengthandconfidenceparametersettings.Thesesweepsincludedvarying686

theassumptionsregardingclassweights.Thefocussedsweepswerealsoperformedforthe687

cross-validatedandnoveltaxaevaluations,butonlyfortheassumptionofuniformclass688

priors.Theresultsforthefocussedsweepsacrossalldatasetsarethosewhichare689

comparedagainsttheotherclassifiersinthiswork.690

691

Thebroadsweepsusedamodifiedscikit-learnpipelinewhichconsistedofthe692

sklearn.feature_extraction.text.HashingVectorizer,followedbythe693

sklearn.feature_extraction.text.TfidfTransformer,thenthe694

sklearn.naive_bayes.MultinomialNB.Weperformedafullgridsearchovertheparameters695


34

showninTable3.TheconclusionfromtheinitialsweepwasthattheTfidfTransformerstep696

didnotsignificantlyimproveclassification,thatn_featuresshouldbesetto8192,feature697

vectorsshouldbenormalisedusingL2normalisationandthatthealphaparameterforthe698

naiveBayesclassifiershouldbesetto0.001.Pleaseseehttps://github.com/caporaso-699

lab/tax-credit-data/blob/0.1.0/ipynb/mock-community/evaluate-classification-accuracy-700

nb-extra.ipynbfordetails.701

Consensustaxonomyalignment-basedclassifiers702

703

Twonewclassifiersimplementedinq2-feature-classifierperformconsensustaxonomy704

classificationbasedonalignmentofaquerysequencetoareferencesequence.The705

methodsclassify_consensus_vsearchandclassify_consensus_blastusetheglobalaligner706

vsearch[10]orthelocalalignerBLAST+[9],respectively,toreturnuptomaxaccepts707

referencesequencesthataligntothequerywithatleastperc_identitysimilarity.A708

consensustaxonomyisthenassignedtothequerysequencebydeterminingthetaxonomic709

lineageonwhichatleastmin_consensusofthealignedsequencesagree.Thisconsensus710

taxonomyistruncatedatthetaxonomiclevelatwhichlessthanmin_consensusof711

taxonomiesagree.Forexample,ifaquerysequenceisclassifiedwithmaxaccepts=3,712

min_consensus=0.51,andthefollowingtophits:713

714

k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;715

g__Lactobacillus;s__brevis716


35


g__Lactobacillus;s__brevis718


g__Lactobacillus;s__delbrueckii720

721

Thetaxonomylabelassignedwillbek__Bacteria;p__Firmicutes;c__Bacilli;722

o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__brevis.However,if723

min_consensus=0.99,thetaxonomylabelassignedwillbek__Bacteria;p__Firmicutes;724

c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus.725

726

727

Declarations728

Ethicsapprovalandconsenttoparticipate729

Notapplicable730

Consentforpublication731

Notapplicable732


36

Availabilityofdataandmaterials733

Mockcommunitysequencedatausedinthisstudyarepubliclyavailableinmockrobiota734[11]underthestudyidentitieslistedinTable1.Allotherdatageneratedinthisstudy,and735allnewsoftware,isavailableinourGitHubrepositoriesundertheBSDlicense.Thetax-736creditrepositorycanbefoundat:https://github.com/caporaso-lab/tax-credit,andstatic737versionsofallanalysisnotebooks,whichcontainallcodeandanalysisresults,canbe738viewedthere.Theq2-feature-classifierrepositorycanbeaccessedat739https://github.com/qiime2/q2-feature-classifier;asaQIIME2coreplugin,itis740automaticallyinstalledanytimeQIIME2(https://qiime2.org/)isinstalled.741742Projectname:q2-feature-classifier743Projecthomepage:https://github.com/qiime2/q2-feature-classifier744Operatingsystem(s):macOS,Linux745Programminglanguage:Python746Otherrequirements:QIIME2747License:BSD-3-Clause748Anyrestrictionstousebynon-academics:None749750Projectname:tax-credit751Projecthomepage:https://github.com/caporaso-lab/tax-credit752Operatingsystem(s):macOS,Linux753Programminglanguage:Python754Otherrequirements:None(QIIME2requiredforsomeoptionalfunctions)755License:BSD-3-Clause756Anyrestrictionstousebynon-academics:None757758759

Funding760

ThisworkwasfundedinpartbyNationalScienceFoundationaward1565100toJGCand761

RK,awardsfromtheAlfredP.SloanFoundationtoJGCandRK,awardsfromthe762

PartnershipforNativeAmericanCancerPrevention(NIH/NCIU54CA143924and763

U54CA143925)toJGC,andNationalHealthandMedicalResearchCouncilofAustralia764


37

awardAPP1085372toGAH,JGCandRK.Thesefundingbodieshadnoroleinthedesignof765

thestudy,thecollection,analysis,orinterpretationofdata,orinwritingthemanuscript.766

Acknowledgments767

TheauthorsthankStephenGouldandChengSoonOngforadviceonmachinelearning768

optimisation.769

Authors’Contributions770

NAB,RK,andJGCconceivedanddesignedtax-credit.NAB,BDK,JGC,andJRRcontributed771

totax-credit.BDK,MD,JGC,andNABcontributedtoq2-feature-classifier.BDK,JGC,MD,772

JRR,andEBprovidedQIIME2integrationwithq2-feature-classifier.JGCandGAHprovided773

materialsandsupport.NAB,BDK,JGC,andGAHwrotethemanuscriptwithinputfromall774

co-authors.775

CompetingInterests776

Theauthorsdeclarethattheyhavenocompetinginterests.777

778

TablesandFigures779

Table1.Mockcommunitiescurrentlyintegratedintax-credit.780


38

Study ID* Target gene** Platform Species Strains Citation mock-1 16S HiSeq 46 48 [33] mock-2 16S MiSeq 46 48 [33] mock-3 16S MiSeq 21 21 [33] mock-4 16S MiSeq 21 21 [33] mock-5 16S MiSeq 21 21 [33]

mock-7 16S HiSeq 67 67 [34] mock-8 16S HiSeq 67 67 [11]

mock-9 ITS HiSeq 13 16 [11]

mock-10 ITS HiSeq 13 16 [11]

mock-12 16S MiSeq 26 27 [4]

mock-16 16S MiSeq 56 59 [35]

mock-18 16S MiSeq 15 15 [36]

mock-19 16S MiSeq 15 27 [36]

mock-20 16S MiSeq 20 20 [37]

mock-21 16S MiSeq 20 20 [37]

mock-22 16S MiSeq 20 20 [37]

mock-23 16S MiSeq 20 20 [37]

mock-24 ITS MiSeq 8 8 [38]

mock-26 ITS FLX Titanium 11 11 [39] *All studies are available on mockrobiota [11] at https://github.com/caporaso-781

lab/mockrobiota/tree/master/data/[studyID] 782

**Abbreviations: 16S = 16S rRNA gene; HiSeq = Illumina HiSeq; MiSeq = Illumina MiSeq. 783

784

Table2.Optimizedmethodsconfigurationsforstandardoperatingconditions.785

Mock Cross-validated Novel taxa

Target Condition Method Parameters F P R F P R F P R Threshold

Balanced NB-bespoke [6,6]:0.9 0.705 0.98 0.582 0.827 0.931 0.744 0.165 0.243 0.125 F = (0.49, 0.8, 0.1)

[6,6]:0.92 0.705 0.98 0.581 0.825 0.936 0.737 0.165 0.251 0.123 F = (0.7, 0.8, 0.15)

[6,6]:0.94 0.703 0.98 0.579 0.822 0.942 0.729 0.162 0.259 0.118

16S rRNA gene

[7,7]:0.92 0.712 0.978 0.592 0.831 0.931 0.751 0.151 0.221 0.115


39

[7,7]:0.94 0.708 0.978 0.586 0.829 0.936 0.743 0.157 0.239 0.117

naive-bayes [7,7]:0.7 0.495 0.797 0.38 0.819 0.886 0.761 0.115 0.138 0.099

rdp 0.6 0.564 0.798 0.457 0.815 0.868 0.768 0.102 0.128 0.084

0.7 0.55 0.799 0.438 0.812 0.892 0.746 0.124 0.173 0.096

uclust 0.51:0.9:3 0.498 0.746 0.392 0.846 0.876 0.817 0.154 0.201 0.126

Precision NB-bespoke [6,6]:0.98 0.676 0.987 0.537 0.803 0.956 0.692 0.163 0.303 0.111 P = (0.94, 0.95, 0.25)

[7,7]:0.98 0.687 0.98 0.551 0.815 0.951 0.713 0.164 0.283 0.115

rdp 1 0.239 0.941 0.16 0.632 0.968 0.469 0.12 0.457 0.069

Recall NB-bespoke [12,12]:0.5 0.754 0.8 0.721 0.815 0.83 0.801 0.053 0.058 0.049 R = (0.47, 0.75, 0.04)

[14,14]:0.5 0.758 0.802 0.726 0.811 0.826 0.797 0.052 0.057 0.048 R = (0.7, 0.75, 0.04)

[16,16]:0.5 0.755 0.785 0.732 0.808 0.825 0.792 0.052 0.058 0.047

[18,18]:0.5 0.772 0.803 0.748 0.805 0.823 0.789 0.055 0.061 0.05

[32,32]:0.5 0.937 0.966 0.913 0.788 0.818 0.76 0.054 0.067 0.045

naive-bayes [11,11]:0.5 0.567 0.77 0.479 0.793 0.82 0.768 0.059 0.065 0.055

[12,12]:0.5 0.567 0.769 0.479 0.79 0.816 0.765 0.059 0.064 0.055

[18,18]:0.5 0.564 0.764 0.477 0.779 0.807 0.753 0.057 0.063 0.051

rdp 0.5 0.577 0.791 0.48 0.816 0.848 0.787 0.068 0.079 0.06

Novel blast+ 10:0.51:0.8 0.436 0.723 0.325 0.816 0.896 0.749 0.225 0.332 0.171 F = (0.4, 0.8, 0.2)

uclust 0.76:0.9:5 0.467 0.775 0.348 0.84 0.938 0.76 0.219 0.358 0.158

vsearch 10:0.51:0.8 0.45 0.74 0.342 0.814 0.891 0.75 0.226 0.333 0.171

10:0.51:0.9 0.45 0.74 0.342 0.82 0.896 0.755 0.219 0.338 0.162

Fungi Balanced naive-bayes [6,6]:0.94 0.874 0.935 0.827 0.481 0.57 0.416 0.374 0.438 0.327 F = (0.85, 0.45, 0.37)

[6,6]:0.96 0.874 0.935 0.827 0.495 0.597 0.423 0.399 0.473 0.344

[6,6]:0.98 0.874 0.935 0.827 0.505 0.629 0.423 0.426 0.52 0.361

[7,7]:0.98 0.874 0.935 0.827 0.485 0.596 0.409 0.388 0.47 0.33

NB-bespoke [6,6]:0.94 0.928 0.968 0.915 0.48 0.567 0.416 0.371 0.433 0.325

[6,6]:0.96 0.928 0.968 0.915 0.491 0.59 0.42 0.393 0.466 0.34

[6,6]:0.98 0.927 0.97 0.913 0.504 0.624 0.422 0.421 0.512 0.358

[7,7]:0.98 0.935 0.97 0.921 0.487 0.596 0.412 0.386 0.466 0.329

rdp 0.7 0.929 0.939 0.922 0.479 0.572 0.413 0.382 0.451 0.332

0.8 0.924 0.939 0.915 0.507 0.633 0.422 0.434 0.534 0.366

0.9 0.922 0.937 0.913 0.517 0.698 0.411 0.47 0.617 0.379

Precision naive-bayes [6,6]:0.98 0.874 0.935 0.827 0.505 0.629 0.423 0.426 0.52 0.361 P = (0.92, 0.6, 0.3)

NB-bespoke [6,6]:0.98 0.927 0.97 0.913 0.504 0.624 0.422 0.421 0.512 0.358


40

rdp 0.8 0.924 0.939 0.915 0.507 0.633 0.422 0.434 0.534 0.366

0.9 0.922 0.937 0.913 0.517 0.698 0.411 0.47 0.617 0.379

1 0.821 0.943 0.742 0.461 0.81 0.322 0.459 0.774 0.327

Recall NB-bespoke [6,6]:0.92 0.938 0.971 0.924 0.467 0.544 0.409 0.353 0.407 0.312 R = (0.9, 0.4, 0.3)

[6,6]:0.94 0.928 0.968 0.915 0.48 0.567 0.416 0.371 0.433 0.325

[6,6]:0.96 0.928 0.968 0.915 0.491 0.59 0.42 0.393 0.466 0.34

[6,6]:0.98 0.927 0.97 0.913 0.504 0.624 0.422 0.421 0.512 0.358

[7,7]:0.96 0.935 0.969 0.921 0.47 0.56 0.404 0.357 0.422 0.31

[7,7]:0.98 0.935 0.97 0.921 0.487 0.596 0.412 0.386 0.466 0.329

rdp 0.7 0.929 0.939 0.922 0.479 0.572 0.413 0.382 0.451 0.332

0.8 0.924 0.939 0.915 0.507 0.633 0.422 0.434 0.534 0.366

0.9 0.922 0.937 0.913 0.517 0.698 0.411 0.47 0.617 0.379

Novel naive-bayes [6,6]:0.98 0.874 0.935 0.827 0.505 0.629 0.423 0.426 0.52 0.361 F = (0.85, 0.45, 0.4)

NB-bespoke [6,6]:0.98 0.927 0.97 0.913 0.504 0.624 0.422 0.421 0.512 0.358

rdp 0.8 0.923 0.939 0.915 0.507 0.633 0.422 0.434 0.534 0.366

0.9 0.921 0.937 0.913 0.517 0.698 0.411 0.47 0.617 0.379

786

aF=F-measure,P=precision,R=recall787bNaiveBayesparameters:k-merrange,confidence788cRDPparameters:confidence789dBLAST+/vsearchparameters:maxaccepts,minimumconsensus,minimumpercent790identity791eUCLUSTparameters:minimumconsensus,similarity,maxaccepts792fThresholddescribesthescorecutoffsusedtodefineoptimalmethodranges,intheformat:793[metric=(mockscore,cross-validatedscore,novel-taxascore)].Iftwocutoffsaregiven,794thesecondindicatesahighercutoffusedtoselectparametersforthedevelopmentalNB-795bespokemethod,andtheconfigurationslistedaretheunionofthetwocutoffs:thesecond796cutoffforselectingNB-bespoke,thefirstforselectingallothermethods.797798

799

Table3.NaiveBayesbroadgridsearchparameters800

Step Parameter Values

sklearn.feature_extraction.text.HashingVectorizer n_features 1024, 8192, 65536

ngram_range [4,4], [8, 8], [16, 16], [4,16]


41

sklearn.feature_extraction.text.TfidfTransformer norm l1, l2, None

usd_idf True, False

sklearn.naive_bayes.MultinomialNB alpha 0.001, 0.01, 0.1

class_prior None, array of class weights

post processing confidence 0, 0.2, 0.4, 0.6, 0.8

801

802

803

804


42

805

Figure1.Classifierperformanceonmockcommunitydatasetsfor16SrRNAgene806

sequences(leftcolumn)andfungalITSsequences(rightcolumn).A,AverageF-measurefor807


43

eachtaxonomyclassificationmethod(averagedacrossallconfigurationsandallmock808

communitydatasets)fromclasstospecieslevel.Errorbars=95%confidenceintervals.B,809

AverageF-measureforeachoptimizedclassifier(averagedacrossallmockcommunities)at810

specieslevel.C,Averagetaxonaccuracyrateforeachoptimizedclassifier(averagedacross811

allmockcommunities)atspecieslevel.D,AverageBray-Curtisdistancebetweenthe812

expectedmockcommunitycompositionanditscompositionaspredictedbyeach813

optimizedclassifier(averagedacrossallmockcommunities)atspecieslevel.Violinplots814

showmedian(whitepoint),quartiles(blackbars),andkerneldensityestimation(violin)815

foreachscoredistribution.Violinswithdifferentlower-caselettershavesignificantly816

differentmeans(pairedt-testfalsedetectionrate-correctedP<0.05).817

818


44

819


45

Figure2.Classifierperformanceoncross-validatedsequencedatasets.Classification820

accuracyof16SrRNAgeneV4sub-domain(firstrow),V1-3sub-domain(secondrow),full-821

length16SrRNAgene(thirdtow),andfungalITSsequences(fourthrow).A,AverageF-822

measureforeachtaxonomyclassificationmethod(averagedacrossallconfigurationsand823

allcross-validatedsequencedatasets)fromclasstospecieslevel.Errorbars=95%824

confidenceintervals.B,AverageF-measureforeachoptimizedclassifier(averagedacross825

allcross-validatedsequencedatasets)atspecieslevel.Violinswithdifferentlower-case826

lettershavesignificantlydifferentmeans(pairedt-testfalsedetectionrate-correctedP<827

0.05).C,correlationbetweenF-measureperformanceforeachmethod/configuration828

classificationofV4sub-domain(x-axis),V1-3sub-domain(y-axis),andfull-length16S829

rRNAgenesequences(z-axis).InsetliststhepearsonR2valueforeachpairwise830

correlation;eachcorrelationissignificant(P<0.001).831

832


46

833


47

Figure3.Classifierperformanceonnovel-taxasimulatedsequencedatasetsfor16SrRNA834

genesequences(leftcolumn)andfungalITSsequences(rightcolumn).A-F,AverageF-835

measure(A),precision(B),recall(C),overclassification(D),underclassification(E),and836

misclassification(F)foreachtaxonomyclassificationmethod(averagedacrossall837

configurationsandallnoveltaxasequencedatasets)fromphylumtospecieslevel.Error838

bars=95%confidenceintervals.B,AverageF-measureforeachoptimizedclassifier839

(averagedacrossallnoveltaxasequencedatasets)atspecieslevel.Violinswithdifferent840

lower-caselettershavesignificantlydifferentmeans(pairedt-testfalsedetectionrate-841

correctedP<0.05).842

843


48

844

Figure4.Classificationaccuracycomparisonbetweenmockcommunity,cross-validated,845

andnoveltaxaevaluations.ScatterplotsshowmeanF-measurescoresforeachmethod846

configuration,averagedacrossallsamples,forclassificationof16SrRNAgenesatgenus847

level(A)andspecieslevel(B),andfungalITSsequencesatgenuslevel(C)andspecieslevel848

(D).849

850


49

851

Figure5.Runtimeperformancecomparisonoftaxonomyclassifiers.Runtime(s)foreach852

taxonomyclassifiereithervaryingthenumberofquerysequencesandkeepingaconstant853

10000referencesequences(A)orvaryingthenumberofreferencesequencesandkeeping854

aconstant1querysequence(B).855

856

857

References858

1.HumanMicrobiomeProjectConsortium.Aframeworkforhumanmicrobiomeresearch.Nature.8592012;486:215–21.860

2.ThompsonLR,SandersJG,McDonaldD,AmirA,LadauJ,LoceyKJ,etal.Acommunalcatalogue861revealsEarth’smultiscalemicrobialdiversity.Nature.2017;551:457–63.862

3.WangQ,QuensenJF3rd,FishJA,LeeTK,SunY,TiedjeJM,etal.EcologicalpatternsofnifHgenes863infourterrestrialclimaticzonesexploredwithtargetedmetagenomicsusingFrameBot,anew864


50

informaticstool.MBio.2013;4:e00592–13.865

4.CallahanBJ,McMurdiePJ,RosenMJ,HanAW,JohnsonAJA,HolmesSP.DADA2:High-resolution866sampleinferencefromIlluminaamplicondata.NatMethods.2016;13:581–3.867

5.McDonaldD,PriceMN,GoodrichJ,NawrockiEP,DeSantisTZ,ProbstA,etal.Animproved868Greengenestaxonomywithexplicitranksforecologicalandevolutionaryanalysesofbacteriaand869archaea.ISMEJ.2012;6:610–8.870

6.CaporasoJG,KuczynskiJ,StombaughJ,BittingerK,BushmanFD,CostelloEK,etal.QIIMEallows871analysisofhigh-throughputcommunitysequencingdata.NatMethods.2010;7:335–6.872

7.Pedregosa,F.,Varoquaux,G.,Gramfort,A.,Michel,V.,Thirion,B,Grisel,O.,Blondel,M.,873Prettenhofer,P.,Weiss,R.,Dubourg,V.,Vanderplas,J.,Passos,A.,Cournapeau,D.,Brucher,M.,874Perrot,M.,Duchesnay,E.Scikit-learn:MachineLearninginPython.JMachLearnRes.8752011;12:2825–30.876

8.Buitinck,L.,Louppe,G.,Blondel,M.,Pedregosa,F.,Mueller,A.,Grisel,O.,Niculae,V.,Prettenhofer,877P.,Gramfort,A.,Grobler,J.,Layton,R.,VanderPlas,J.,Joly,A.,Holt,B.,VaroquauxG.APIdesignfor878machinelearningsoftware:experiencesfromthescikit-learnproject.In:ECMLPKDDWorkshop:879LanguagesforDataMiningandMachineLearning.2013.p.108–22.880

9.CamachoC,CoulourisG,AvagyanV,MaN,PapadopoulosJ,BealerK,etal.BLAST:architecture881andapplications.BMCBioinformatics.2009;10:421.882

10.RognesT,FlouriT,NicholsB,QuinceC,MahéF.VSEARCH:aversatileopensourcetoolfor883metagenomics.PeerJ.2016;4:e2584.884

11.BokulichNA,RideoutJR,MercurioWG,ShifferA,WolfeB,MauriceCF,etal.mockrobiota:a885PublicResourceforMicrobiomeBioinformaticsBenchmarking.mSystems.2016;1.886doi:10.1128/mSystems.00062-16.887

12.WangQ,GarrityGM,TiedjeJM,ColeJR.NaiveBayesianclassifierforrapidassignmentofrRNA888sequencesintothenewbacterialtaxonomy.ApplEnvironMicrobiol.2007;73:5261–7.889

13.AltschulSF,GishW,MillerW,MyersEW,LipmanDJ.Basiclocalalignmentsearchtool.JMolBiol.8901990;215:403–10.891

14.EdgarRC.SearchandclusteringordersofmagnitudefasterthanBLAST.Bioinformatics.8922010;26:2460–1.893

15.KopylovaE,NoéL,TouzetH.SortMeRNA:fastandaccuratefilteringofribosomalRNAsin894metatranscriptomicdata.Bioinformatics.2012;28:3211–7.895

16.SoergelDAW,DeyN,KnightR,BrennerSE.Selectionofprimersforoptimaltaxonomic896classificationofenvironmental16SrRNAgenesequences.ISMEJ.2012;6:1440–4.897

17.LiuK-L,WongT-T.NaïveBayesianClassifierswithMultinomialModelsforrRNATaxonomic898Assignment.IEEE/ACMTransComputBiolBioinform.2013;10:1–1.899


51

18.ChaudharyN,SharmaAK,AgarwalP,GuptaA,SharmaVK.16Sclassifier:atoolforfastand900accuratetaxonomicclassificationof16SrRNAhypervariableregionsinmetagenomicdatasets.901PLoSOne.2015;10:e0116106.902

19.ClaessonMJ,WangQ,O’SullivanO,Greene-DinizR,ColeJR,RossRP,etal.Comparisonoftwo903next-generationsequencingtechnologiesforresolvinghighlycomplexmicrobiotacomposition904usingtandemvariable16SrRNAgeneregions.NucleicAcidsRes.2010;38:e200.905

20.LiuZ,DeSantisTZ,AndersenGL,KnightR.Accuratetaxonomyassignmentsfrom16SrRNA906sequencesproducedbyhighlyparallelpyrosequencers.NucleicAcidsRes.2008;36:e120.907

21.LiuZ,LozuponeC,HamadyM,BushmanFD,KnightR.Shortpyrosequencingreadssufficefor908accuratemicrobialcommunityanalysis.NucleicAcidsRes.2007;35:e120.909

22.LanzénA,JørgensenSL,HusonDH,GorferM,GrindhaugSH,JonassenI,etal.CREST–910ClassificationResourcesforEnvironmentalSequenceTags.PLoSOne.2012;7:e49334.911

23.LanY,WangQ,ColeJR,RosenGL.UsingtheRDPclassifiertopredicttaxonomicnoveltyand912reducethesearchspaceforfindingnovelorganisms.PLoSOne.2012;7:e32491.913

24.DeshpandeV,WangQ,GreenfieldP,CharlestonM,Porras-AlfaroA,KuskeCR,etal.Fungal914identificationusingaBayesianclassifierandtheWarcuptrainingsetofinternaltranscribedspacer915sequences.Mycologia.2016;108:1–5.916

25.EdgarR.SINTAX:asimplenon-Bayesiantaxonomyclassifierfor16SandITSsequences.2016.917doi:10.1101/074161.918

26.SczyrbaA,HofmannP,BelmannP,KoslickiD,JanssenS,DrögeJ,etal.CriticalAssessmentof919MetagenomeInterpretation-abenchmarkofmetagenomicssoftware.NatMethods.2017;14:1063–92071.921

27.WeisburgWG,BarnsSM,PelletierDA,LaneDJ.16SribosomalDNAamplificationfor922phylogeneticstudy.JBacteriol.1991;173:697–703.923

28.CaporasoJG,LauberCL,WaltersWA,Berg-LyonsD,HuntleyJ,FiererN,etal.Ultra-high-924throughputmicrobialcommunityanalysisontheIlluminaHiSeqandMiSeqplatforms.ISMEJ.9252012;6:1621–4.926

29.MuyzerG,deWaalEC,UitterlindenAG.Profilingofcomplexmicrobialpopulationsby927denaturinggradientgelelectrophoresisanalysisofpolymerasechainreaction-amplifiedgenes928codingfor16SrRNA.ApplEnvironMicrobiol.1993;59:695–700.929

30.BokulichNA,MillsDA.ImprovedSelectionofInternalTranscribedSpacer-SpecificPrimers930EnablesQuantitative,Ultra-High-ThroughputProfilingofFungalCommunities.ApplEnviron931Microbiol.2013;79:2519–26.932

31.KõljalgU,NilssonRH,AbarenkovK,TedersooL,TaylorAFS,BahramM,etal.Towardsaunified933paradigmforsequence-basedidentificationoffungi.MolEcol.2013;22:5271–7.934


52

32.BrayJR,CurtisJT.AnOrdinationoftheUplandForestCommunitiesofSouthernWisconsin.Ecol935Monogr.1957;27:325–49.936

33.BokulichNA,SubramanianS,FaithJJ,GeversD,GordonJI,KnightR,etal.Quality-filteringvastly937improvesdiversityestimatesfromIlluminaampliconsequencing.NatMethods.2013;10:57–9.938

34.MauriceCF,HaiserHJ,TurnbaughPJ.XenobioticsShapethePhysiologyandGeneExpressionof939theActiveHumanGutMicrobiome.Cell.2013;152:39–50.940

35.SchirmerM,IjazUZ,D’AmoreR,HallN,SloanWT,QuinceC.Insightintobiasesandsequencing941errorsforampliconsequencingwiththeIlluminaMiSeqplatform.NucleicAcidsRes.2015;43:e37.942

36.TourlousseDM,YoshiikeS,OhashiA,MatsukuraS,NodaN,SekiguchiY.Syntheticspike-in943standardsforhigh-throughput16SrRNAgeneampliconsequencing.NucleicAcidsRes.9442016;:gkw984.945

37.GohlDM,VangayP,GarbeJ,MacLeanA,HaugeA,BeckerA,etal.Systematicimprovementof946ampliconmarkergenemethodsforincreasedaccuracyinmicrobiomestudies.NatBiotechnol.9472016;34:942–9.948

38.TaylorDL,WaltersWA,LennonNJ,BochicchioJ,KrohnA,CaporasoJG,etal.Accurate949EstimationofFungalDiversityandAbundancethroughImprovedLineage-SpecificPrimers950OptimizedforIlluminaAmpliconSequencing.ApplEnvironMicrobiol.2016;82:7217–26.951

39.IhrmarkK,BödekerITM,Cruz-MartinezK,FribergH,KubartovaA,SchenckJ,etal.Newprimers952toamplifythefungalITS2region--evaluationby454-sequencingofartificialandnatural953communities.FEMSMicrobiolEcol.2012;82:666–77.954

955


Optimizing taxonomic classification of marker gene amplicon … · 2018. 1. 17. · 1 1 Optimizing taxonomic classification of marker gene 2 amplicon sequences 3 4 Nicholas A. Bokulich1#*,

Documents