Page 1
1
Optimizingtaxonomicclassificationofmarkergene1
ampliconsequences23NicholasA.Bokulich1#*,BenjaminD.Kaehler2#*,JaiRamRideout1,MatthewDillon1,Evan4
Bolyen1,RobKnight3,GavinA.Huttley2#,J.GregoryCaporaso1,4,#5
6
1ThePathogenandMicrobiomeInstitute,NorthernArizonaUniversity,Flagstaff,AZ,USA7
2ResearchSchoolofBiology,AustralianNationalUniversity,Canberra,Australia8
3DepartmentsofPediatricsandComputerScience&Engineering,andCenterfor9
MicrobiomeInnovation,UniversityofCaliforniaSanDiego,LaJolla,CA,USA10
4DepartmentofBiologicalSciences,NorthernArizonaUniversity,Flagstaff,AZ,USA11
12
*Theseauthorscontributedequally13
14
#Correspondingauthors15
GregoryCaporaso16DepartmentofBiologicalSciences171298SKnolesDrive18Building56,3rdFloor19NorthernArizonaUniversity20Flagstaff,AZ,USA21(303)523-548522(303)523-4015(fax)23Email:gregcaporaso@gmail.com2425NicholasBokulich26ThePathogenandMicrobiomeInstitute27POBox407328Flagstaff,Arizona86011-4073,USA29Email:[email protected]
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 2
2
31BenjaminKaehler32ResearchSchoolofBiology3346SullivansCreekRoad,34TheAustralianNationalUniversity,35ActonACT2601,Australia36Email:benjamin.kaehler@anu.edu.au3738GavinHuttley39ResearchSchoolofBiology4046SullivansCreekRoad,41TheAustralianNationalUniversity,42ActonACT2601,Australia43Email:[email protected]
46
Abstract47
Background:Taxonomicclassificationofmarker-genesequences isan important step in48
microbiome analysis. Results: We present q2-feature-classifier49
(https://github.com/qiime2/q2-feature-classifier), a QIIME 2 plugin containing several50
novelmachine-learningandalignment-basedtaxonomyclassifiersthatmeetorexceedthe51
accuracy of existing methods for marker-gene amplicon sequence classification. We52
evaluatedandoptimized several commonlyused taxonomic classificationmethods (RDP,53
BLAST, UCLUST) and several newmethods (a scikit-learn naive Bayesmachine-learning54
classifier, and alignment-based taxonomy consensusmethods of VSEARCH, BLAST+, and55
SortMeRNA) for classificationofmarker-geneampliconsequencedata.Conclusions:Our56
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 3
3
resultsillustratetheimportanceofparametertuningforoptimizingclassifierperformance,57
and we make recommendations regarding parameter choices for a range of standard58
operating conditions. q2-feature-classifier and our evaluation framework, tax-credit, are59
bothfree,open-source,BSD-licensedpackagesavailableonGitHub.60
61
Background62
High-throughput sequencing technologies have transformedour ability to explore63
complexmicrobial communities,offering insight intomicrobial impactsonhumanhealth64
[1] and global ecosystems [2]. This is achieved most commonly by sequencing short,65
conservedmarkergenesamplifiedwith ‘universal’PCRprimers,suchas16SrRNAgenes66
forbacteria andarchaea, or internal transcribed spacer (ITS) regions for fungi.Targeted67
marker-geneprimerscanalsobeusedtoprofilespecifictaxaorfunctionalgroups,suchas68
nifH genes [3]. These sequences often are compared against an annotated reference69
sequencedatabasetodeterminethelikelytaxonomicoriginofeachsequencewithasmuch70
specificityaspossible.Accurateandspecifictaxonomicinformationisacrucialcomponent71
ofmanyexperimentaldesigns.72
Challengesinthisprocessincludetheshortlengthoftypicalsequencingreadswith73
currenttechnology,sequencingandPCRerrors[4],selectionofappropriatemarkergenes74
that contain sufficient heterogeneity to differentiate target species but that are75
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 4
4
homogeneous enough in some regions to design broad-spectrum primers, quality of76
referencesequenceannotations[5],andselectionofamethodthataccuratelypredictsthe77
taxonomic affiliation of millions of sequences at low computational cost. Numerous78
methodshavebeendevelopedfortaxonomyclassificationofDNAsequences,butfewhave79
beendirectlycomparedinthespecificcaseofshortmarker-genesequences.80
We introduce q2-feature-classifier, a QIIME 2 (https://qiime2.org/) plugin for81
taxonomyclassificationofmarker-genesequences.QIIME2 is thesuccessortotheQIIME82
[6]microbiome analysis package. The q2-feature-classifier plugin supports use of any of83
thenumerousmachine-learningclassifiersavailableinscikit-learn[7][8]formarkergene84
taxonomyclassification,andcurrentlyprovidestwoalignment-basedtaxonomyconsensus85
classifiersbasedonBLAST+[9]andvsearch[10].Weevaluatethelattertwomethodsand86
the scikit-learnmultinomial naive Bayes classifier (labelled “Naive Bayes” in the Results87
section) for the first time.We show that the classifiers provided in q2-feature-classifier88
match or outperform the classification accuracy of several widely-used methods for89
sequence classification, and that performance of the naive Bayes classifier can be90
significantly increased by providing it with information regarding expected taxonomic91
composition.92
We also developed tax-credit (https://github.com/caporaso-lab/tax-credit-code/93
and https://github.com/caporaso-lab/tax-credit-data/), an extensible computational94
framework for evaluating taxonomy classification accuracy. This framework streamlines95
the process of methods benchmarking by compiling multiple different test data sets,96
includingmockcommunities[11]andsimulatedsequencereads.Itadditionallystorespre-97
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 5
5
computedresultsfrompreviouslyevaluatedmethods,includingtheresultspresentedhere,98
and provides a framework for parameter sweeps and method optimization. tax-credit99
couldbeusedasanevaluationframeworkbyotherresearchgroupsinthefuture,oritsraw100
datacouldbeeasilyextractedforintegrationinanotherevaluationframework.101
102
Results103
We used tax-credit to optimize and compare multiple marker-gene sequence104
taxonomy classifiers. We evaluated two commonly used classifiers that are wrapped in105
QIIME1(RDPClassifier(version2.2)[12],legacyBLAST(version2.2.22)[13]),twoQIIME106
1alignment-basedconsensustaxonomyclassifiers(thedefaultUCLUSTclassifieravailable107
in QIIME 1 (based on version 1.2.22q) [14], and SortMeRNA (version 2.0 29/11/2014)108
[15]), twoalignment-basedconsensus taxonomyclassifiersnewlyreleased inq2-feature-109
classifier(basedonBLAST+(version2.6.0)[9]andvsearch(version2.0.3)[10]),andanew110
multinomialnaiveBayesmachine-learningclassifierinq2-feature-classifier(seeMaterials111
and Methods for information about q2-feature-classifier methods and source code112
availability). We performed parameter sweeps to determine optimal parameter113
configurationsforeachmethod.114
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 6
6
Mockcommunityevaluations115
We first benchmarked classifier performance on mock communities, which are116
artificiallyconstructedmixturesofmicrobialcellsorDNAcombinedatknownratios[11].117
We utilized 15 bacterial 16S rRNA gene mock communities and 4 fungal internal118
transcribed spacer (ITS)mock communities (Table 1) sourced frommockrobiota [11], a119
public repository for mock community data. Mock communities are useful for method120
benchmarking because: 1) unlike for simulated communities, they allow quantitative121
assessmentsofmethodperformanceunderactualoperatingconditions, i.e., incorporating122
realsequencingerrorsthatcanbedifficult tomodelaccurately;and2)unlike fornatural123
community samples, the actual composition of amock community is known in advance,124
allowingquantitativeassessmentsofcommunityprofilingaccuracy.125
Anadditionalprioritywastotesttheeffectofsettingclassweightsonclassification126
accuracy for the naive Bayes classifier implemented in q2-feature-classifier. In machine127
learning, class weights or prior probabilities are vectors of weights that specify the128
frequency at which each class is expected to be observed (and should be distinguished129
fromtheuseofthistermunderBayesianinferenceasaprobabilitydistributionofweights130
vectors).Analternative tosettingclassweights is toassumethateachquerysequence is131
equally likely to belong to any of the taxa that are present in the reference sequence132
database.Thisassumption,knownasuniformclasspriors inthecontextofanaiveBayes133
classifier, ismadebytheRDPclassifier[12],andits impactonmarker-geneclassification134
accuracy has yet to be validated. Making either assumption, that the class weights are135
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 7
7
uniform or known to some extent, will affect results and cannot be avoided. The mock136
communities have taxonomic abundances that are far from uniform over the set of137
referencetaxonomies,asanyrealdatasetmust.Wecanthereforeusethemtoassessthe138
impact of making assumptions regarding class weights. Where we have set the class139
weights to the known taxonomic composition of a sample, we have labelled the results140
“bespoke”.141
We evaluated classifier performance accuracy on mock community sequences142
classifiedattaxonomiclevelsfromclassthroughspecies.Mockcommunitysequenceswere143
classified using the Greengenes 99% OTUs 16S rRNA gene or UNITE 99% OTUs ITS144
referencesequencesforbacterialandfungalmockcommunities,respectively.Asexpected,145
classificationaccuracydecreasedasclassificationdepth increased,andallmethodscould146
predictthetaxonomicaffiliationofmockcommunitysequencesdowntogenuslevelwith147
median F-measures exceeding 0.8 across all parameter sets (minimum: UCLUST F=0.81,148
maximum: Naive Bayes Bespoke F=1.00) (Figure 1A). However, species affiliation was149
predicted with much lower and more variable accuracy among method configurations150
(medianF-measureminimum:UCLUSTF=0.42,maximum:NaiveBayesBespokeF=0.95),151
highlighting the importanceof parameter optimization (discussed inmoredetail below).152
Figure1AillustrateslineplotsofmeanF-measureateachtaxonomiclevel,averagedacross153
all classifier configurations; hence, classifier performance is underestimated for some154
classifiers that are strongly affected by parameter configurations or for which a wider155
rangeof parameterswere tested (e.g.,NaiveBayes). Comparingonly optimizedmethods156
(i.e.,thetop-performingparameterconfigurationsforeachmethod),NaiveBayesBespoke157
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 8
8
achievedsignificantlyhigherF-measure (paired t-testP<0.05) (Figure1B), recall, taxon158
detectionrate,taxonaccuracyrate(Figure1C),andlowerBray-Curtisdissimilaritythanall159
othermethods(Figure1D).160
Mock communities are necessarily simplistic, and cannot assess method161
performance across a diverse range of taxa. Sequences matching the expected mock162
communitysequencesarenotremovedfromthereferencedatabasepriortoclassification,163
in order to replicate normal operating conditions and assess recovery of expected164
sequences.However,thisapproachmayimplicitlybiastowardmethodsthatfindanexact165
matchtothequerysequences,anddoesnotapproximatenaturalmicrobialcommunitiesin166
which few or no detected sequences exactly match the reference sequences. Hence, we167
performed simulated sequence read classifications (described below) to further test168
classifierperformance.169
Cross-validatedtaxonomyclassification170
Simulated sequence reads, derived from reference databases, allow us to assess171
methodperformanceacrossagreaterdiversityofsequencesthanasinglemockcommunity172
generally encompasses.We first evaluated classifier performance using stratified k-fold173
cross-validationoftaxonomyclassificationtosimulatedreads.Thek-foldcross-validation174
strategy is modified slightly to account for the hierarchical nature of taxonomic175
classifications,whichalloftheclassifiersinthisstudy(withtheexceptionoflegacyBLAST)176
handlebyassigningthelowest(i.e.,mostspecific)taxonomiclevelwheretheclassification177
surpasses some user-defined “confidence” or “consensus” threshold (see materials and178
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 9
9
methods). Themodification is to truncate any expected taxonomy in each test set to the179
maximumlevelatwhichaninstanceofthattaxonomyexistsinthetrainingset.Simulated180
readsweregeneratedfromGreengenes99%OTUs16SrRNAgeneorUNITE99%OTUsITS181
referencesequenceswithspecies-levelannotations.Greengenes16SrRNAgenesimulated182
reads were generated from full-length 16S rRNA genes (primers 27F/1492R) and V4183
(primers515F/806R)andV1-3sub-domains(primers27F/534R).Thesimulatedreadsdo184
notincorporateartificialsequencingerrors(seematerialsandmethodsformoredetails).185
Inthissetoftestsandbelowfornoveltaxa,the“bespoke”classifierhadpriorprobabilities186
thatwereinferredfromthetrainingseteachtimeitwastrained.187
Classification of cross-validated reads performed better at coarser levels of188
classification (Figure 2A), similar to the trend observed inmock community results. For189
bacterial sequences, average classification accuracy for all methods declined from near-190
perfect scores at family level (V4 domainmedian F-measureminimum: BLAST+ F=0.92,191
maximum:legacyBLASTF=0.99),butstillretainedaccuratescoresatspecieslevel(median192
minimum: BLAST+ F=0.76, maximum: SortMeRNA F=0.84), relative to some mock193
community data sets (Figure 2A). Fungal sequences exhibited similar performance,with194
the exception that mean BLAST+ and vsearch performance was markedly lower at all195
taxonomiclevels,indicatinghighsensitivitytoparameterconfigurations,andspecies-level196
F-measureswere in generalmuch lower (medianminimum: BLAST+ F=0.17,maximum:197
UCLUSTF=0.45)thanthoseofbacterialsequenceclassifications(Figure2A).198
Species-levelclassificationsof16SrRNAgenesimulatedsequenceswerebestwith199
optimized UCLUST and SortMeRNA configurations for V4 domain, and Naive Bayes and200
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 10
10
RDP for V1-3 domain and full-length 16S rRNA gene sequences (Figure 2B). UCLUST201
achieved the highest F-measure for ITS classification (F = 0.51). However, all optimized202
classifiersachievedsimilarF-measureranges,withtheexceptionof legacyBLASTfor ITS203
sequences(Figure2B).204
Species-level classification performance of 16S rRNA gene simulated reads was205
significantly correlated between each sub-domain and the full-length gene sequences206
(Figure2C).Inourtests,full-lengthsequencesexhibitedslightlyloweraccuracythanV1-3207
and V4 sub-domains. The relative performance of full-length 16S rRNA genes versus208
hypervariable sub-domain reads is variable in the literature [12,16–21], andour results209
addanotherdatapointtotheongoingdiscussionofthistopic.Nevertheless,species-level210
classificationsyieldedstrongcorrelationbetweenmethodconfigurations(Figure2C)and211
optimized method performance (Figure 2B), suggesting that primer choice impacts212
classificationaccuracyuniformlyacrossallmethods.Hence,wefocusedonV4sub-domain213
readsfordownstreamanalyses.214
215
Noveltaxonclassificationevaluation216
Noveltaxonclassificationoffersauniqueperspectiveonclassifierbehavior,217
assessinghowclassifiersperformwhenchallengedwitha“novel”cladethatisnot218
representedinthereferencedatabase[22–25].Anidealclassifiershouldidentifythe219
nearesttaxonomiclineagetowhichthistaxonbelongs,butnofurther.Inthisevaluation,a220
referencedatabaseissubsampledktimestogeneratequeryandreferencesequencesets,221
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 11
11
asforcross-validatedclassification,buttwoimportantdistinctionsexist:1)thereference222
databaseusedforclassificationexcludesanysequencethatmatchesthetaxonomic223
affiliationofthequerysequencesattaxonomiclevelL,thetaxonomicrankatwhich224
classificationisbeingattempted;and2)thisisperformedateachtaxonomiclevel,inorder225
toassessclassificationperformancewheneachmethodencountersa“novel”species,226
genus,family,etc.227
Duetothesedifferences,interpretationofnoveltaxonclassificationresultsis228
differentfromthatofmockcommunityandcross-validatedclassifications.Forthelatter,229
classificationaccuracymaybeassessedateachtaxonomiclevelforeachclassification230
result:meanclassificationaccuracyatfamilylevelandspecieslevelevaluatethesame231
resultsbutfocusondifferenttaxonomiclevelsofclassification.Fornoveltaxa,however,232
differentqueryandreferencesequencesarecompiledforclassificationateachtaxonomic233
levelandseparateclassificationsareperformedforeach.Hence,classificationsatfamily234
andspecieslevelareindependentevents—oneassesseshowaccuratelyeachmethod235
performswhenitencountersa“novel”familythatisnotrepresentedinthereference236
database,theotherwhena“novel”speciesisencountered.237
Noveltaxonevaluationsemployasuiteofmodifiedmetrics,toprovidemore238
informationonwhattypesofclassificationerrorsoccur.Precision,recall,andF-measure239
calculationsateachtaxonomiclevelLassesswhetheranaccuratetaxonomyclassification240
wasmadeatlevelL-1:forexample,a“novel”speciesshouldbeassignedagenus,because241
thecorrectspeciesclassisnotrepresentedwithinthereferencedatabase.Anyspecies-242
levelclassificationinthisscenarioisanoverclassification(affectingbothrecalland243
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 12
12
precision)[25].Overclassificationisoneofthekeymetricsfornoveltaxaevaluation,244
indicatingthedegreetowhichnovelsequenceswillbeinterpretedasknownorganisms.245
Thisoverclassificationisoftenhighlyundesirablebecauseitleads,forexample,tothe246
incorrectclassificationofunknownbutharmlessenvironmentalsequencesasknown247
pathogens.Novelsequencesthatareclassifiedwithinthecorrectclade,buttoalessspecific248
levelthanL,areunderclassified(affectingrecallbutnotprecision)[25].Sequencesthatare249
classifiedintoacompletelydifferentcladearemisclassified(affectingbothrecalland250
precision)[25].251
Precision,recall,andF-measureallgraduallyincreasefromaveragescoresnear0.0252
atclasslevel,reachingpeakscoresatgenuslevelforbacteriaandspecieslevelforfungi253
(Figure3A-C).Thesetrendsarepairedwithgradualdecreasesinunderclassificationand254
misclassificationratesforallclassificationmethods,indicatingthatallclassifiersperform255
poorlywhentheyencountersequenceswithnoknownmatchattheclass,order,orfamily256
levels(Figure3D-F).Atspecieslevel,UCLUST,BLAST+,andvsearchachievedsignificantly257
betterF-measuresthanallothermethodsfor16SrRNAgeneclassifications(P<0.05)258
(Figure3G).UCLUSTachievedsignificantlybetterF-measuresthanallothermethodsfor259
ITSclassifications(Figure3G).Over-,under-,andmisclassificationscoresareless260
informativeforoptimizingclassifiersforrealusecases,asmostmethodscouldbe261
optimizedtoyieldnear-zeroscoresforeachofthesemetricsseparately,butonlythrough262
extremeconfigurations,leadingtoF-measuresthatwouldbeunacceptableunderany263
scenario.Notethatallcomparisonsweremadebetweenmethodsoptimizedtomaximize264
(orminimize)asinglemetric,andhencetheconfigurationsthatmaximizeprecisionare265
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 13
13
frequentlydifferentfromthosethatmaximizerecallorothermetrics.Thistrade-off266
betweendifferentmetricsisdiscussedinmoredetailbelow.267
Thenoveltaxonevaluationprovidesanestimateofclassifierperformancegivena268
specificreferencedatabase,butitsgeneralizationislimitedbythequalityofthereference269
databasesavailableandbythelabel-basedapproachusedforpartitioningandevaluation.270
Mislabeledandpolyphyleticcladesinthedatabase,e.g.Clostridiumgroup,increasethe271
probabilityofmisclassification.Acomplementaryanalysisbasedonsequencesimilarity272
betweenanovelqueryandtopreferencehitcouldmitigatethisissue.However,wechoose273
toapplyalabel-basedapproach,asitbetterreflectsthebiologicalproblemthatuserscan274
expecttoencounter;i.e.,usingaparticularreferencesequencedatabase(whichwill275
containsomequantityofmislabeledandpolyphyletictaxainherenttocurrentlyavailable276
resources),howlikelyisaclassifiertomisclassifyataxonomiclabel?277
278
Multi-evaluationmethodoptimization279
Themockcommunityandcross-validationclassificationevaluationsyieldedsimilartrends280
inconfigurationperformance,butoptimizingparameterschoicesforthenoveltaxa281
generallyleadtosuboptimalchoicesforthemockcommunityandcross-validationtests282
(Figure4).Wesoughttodeterminetherelationshipbetweenmethodconfiguration283
performanceforeachevaluation,andusethisinformationtoselectconfigurationsthat284
performbestacrossallevaluations.For16SrRNAgenesequencespecies-level285
classification,methodconfigurationsthatachievemaximumF-measuresformockand286
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 14
14
cross-validatedsequencesperformpoorlyfornoveltaxonclassification(Figure4B).287
Optimizationismorestraightforwardforgenus-levelclassificationof16SrRNAgene288
sequences(Figure4A)andforfungalsequences(Figure4C-D),forwhichconfiguration289
performance(measuredasmeanF-measure)ismaximizedbysimilarconfigurations290
amongallthreeevaluations.291
Toidentifyoptimalmethodconfigurations,wesetaccuracyscoreminimum292
thresholdsforeachevaluationbyidentifyingnaturalbreaksintherangeofqualityscores,293
selectingmethodsandparameterrangesthatmetthesecriteria.Table2listsmethod294
configurationsthatmaximizespecies-levelclassificationaccuracyscoresformock295
community,cross-validated,andnoveltaxonevaluationsunderseveralcommonoperating296
conditions.“Balanced”configurationsarerecommendedforgeneraluse,andaremethods297
thatmaximizeF-measurescores.“Precision”and“Recall”configurationsmaximize298
precisionandrecallscores,respectively,formock,cross-validated,andnovel-taxa299
classifications(Table2).“Novel”configurationsoptimizeF-measurescoresfornoveltaxon300
classification,andsecondarilyformockandcross-validatedperformance(Table2).These301
configurationsarerecommendedforusewithsampletypesthatareexpectedtocontain302
largeproportionsofunidentifiedspecies,forwhichoverclassificationcanbeexcessive.303
However,theseconfigurationsmaynotperformoptimallyforclassificationofknown304
species(i.e.,underclassificationrateswillbehigher).Forfungi,thesameconfigurations305
recommendedfor“Precision”performwellfornoveltaxonclassification(Table2).For16S306
rRNAgenesequences,BLAST+,UCLUST,andvsearchconsensusclassifiersperformbestfor307
noveltaxonclassification(Table2).308
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 15
15
309
Computationalruntime310
High-throughputsequencingplatforms(andexperiments)continuetoyieldincreasing311
sequencecounts,which—evenafterqualityfilteringanddereplicationoroperational312
taxonomicunitclusteringstepscommontomostmicrobiomeanalysispipelines—may313
exceedthousandsofuniquesequencesthatneedclassification.Increasingnumbersof314
querysequencesandreferencessequencesmayleadtounacceptableruntimes,andunder315
someexperimentalconditionsthetop-performingmethod(basedonprecision,recall,or316
someothermetric)maybeinsufficienttohandlelargenumbersofsequenceswithinan317
acceptabletimeframe.Forexample,quickturnaroundsmaybevitalunderclinical318
scenariosasmicrobiomeevaluationbecomescommonclinicalpractice,orcommercial319
scenarios,whenlargesamplevolumesandclientexpectationsmayconstrainturnaround320
timesandmethodselection.321
Weassessedcomputationalruntimeasalinearfunctionof1)thenumberofquery322
sequencesand2)thenumberofreferencesequences.Lineardependenceisempirically323
evidentinFigure5.Forbothofthesemetrics,theslopeisthemostimportantmeasureof324
performance.Theinterceptmayincludetheamountoftimetakentotraintheclassifier,325
preprocessthereferencesequences,loadpreprocesseddata,orother“setup”stepsthat326
willdiminishinsignificanceassequencecountsgrow,andhenceisnegligible.327
UCLUST(0.000028s/sequence),vsearch(0.000072s/sequence),BLAST+328
(0.000080s/sequence),andlegacyBLAST(0.000100s/sequence)allexhibitshallow329
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 16
16
slopeswithincreasingnumbersofreferencesequences.NaiveBayes(0.000483330
s/sequence)andSortMeRNA(0.000543s/sequence)yieldmoderatelyhigherslopes,and331
RDP(0.001696s/sequence)demonstratesthesteepestslope(Figure5A).Forruntimeasa332
functionofquerysequencecount,UCLUST(0.002248s/sequence),RDP(0.002920333
s/sequence),andSortMeRNA(0.003819s/sequence)haverelativelyshallowslopes334
(Figure5B).NaiveBayes(0.022984s/sequence),BLAST+(0.026222s/sequence),and335
vsearch(0.030190s/sequence)exhibitgreaterslopes.LegacyBLAST(0.133292336
s/sequence)yieldedaslopemagnitudeshigherthanothermethods,renderingthismethod337
impracticalforlargedatasets.338
339
340
Discussion341
Wehavedevelopedandvalidatedseveralmachine-learningandalignment-based342
classifiersprovidedinq2-feature-classifierandbenchmarkedtheseclassifiers,aswellas343
othercommonclassificationmethods,toevaluatetheirstrengthsandweaknessesfor344
marker-geneampliconsequenceclassificationacrossarangeofparametersettingsfor345
each(Table2).346
Eachclassifierrequiredsomedegreeofoptimizationtodefinetop-performing347
parameterconfigurations,withthesoleexceptionofQIIME1’slegacyBLASTwrapper,348
whichwasunaffectedbyitsonlyuser-definedparameter,e-value,overarangeof10-10to349
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 17
17
1000.Forallothermethods,performancevariedwidelydependingonparametersettings,350
andasinglemethodcouldachieveamongtheworstperformancewithoneconfiguration351
butamongthebestperformancewithanother.Configurationsgreatlyaffectedaccuracy352
withmockcommunity,cross-validated,andnoveltaxonevaluations,indicatingthat353
optimizationisnecessaryunderavarietyofperformanceconditions,andoptimizationfor354
oneconditionmaynotnecessarilytranslatetoanother.Mockcommunityandcross-355
validatedevaluationsexhibitedsimilarresults,butnoveltaxonevaluationsselected356
differentoptimalconfigurationsformostmethods(Figure4),indicatingthatconfigurations357
optimizedtoonecondition,e.g.,high-recallclassificationofknownsequences,maybeless358
suitedforotherconditions,e.g.,classificationofnovelsequences.Table2liststhetop-359
performingconfigurationforeachmethodforseveralstandardperformanceconditions.360
Optimalconfigurationsalsovariedamongdifferentevaluationmetrics.Precision361
andrecall,inparticular,exhibitedsomemutualopposition,suchthatmethodsincreasing362
precisionreducedrecall.Forthisreason,F-measure,theharmonicmeanofprecisionand363
recall,isausefulmetricforchoosingconfigurationsthatarewellbalancedforaverage364
performance.“Balanced”methodconfigurations—whichmaximizeF-measurescoresfor365
mock,cross-validated,andnoveltaxonevaluations(Table2)—arebestsuitedforawide366
rangeofuserconditions.ThenaiveBayesclassifierwithk-merlengthsof6or7and367
confidence=0.7(orconfidence≥0.9ifusingbespokeclassweights),RDPwithconfidence368
=0.6-0.7,andUCLUST(minimumconsensus=0.51,minimumsimilarity=0.9,maxaccepts369
=3)performbestundertheseconditions(Table2).Performanceisdramaticallyimproved370
usingbespokeclassweightsfor16SrRNAsequences(Figure4A-B),thoughthisapproachis371
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 18
18
developmentalandonlyapplicablewhentheexpectedcompositionofsamplesisknownin372
advance(ascenariothatisbecomingincreasinglycommonwiththeincreasingquantityof373
publicmicrobiomedata,andwhichcouldbeaidedbymicrobiomedatasharingresources374
suchasQiita(http://qiita.microbio.me)).ForITSsequences,thenaiveBayesclassifierwith375
k-merlengthsof6or7andconfidence≥0.9,orRDPwithconfidence=0.7-0.9,perform376
best,andtheeffectsofbespokeclassweightsarelesspronounced(Figure4C-D).377
However,someusersmayrequirehigh-precisionclassifierswhenfalse-positives378
maybemoredamagingtotheoutcome,e.g.,fordetectionofpathogensinasample.379
PrecisionscoresaremaximizedbynaiveBayesandRDPclassifierswithhighconfidence380
settings(Table2).Optimizingforprecisionwillsignificantlydamagerecallbyyieldinga381
highnumberoffalsenegatives.382
Otherusersmayrequirehigh-recallclassifierswhenfalse-negativesand383
underclassificationhinderinterpretation,butfalsepositives(mostlyoverclassificationtoa384
closelyrelatedspecies)arelessdamaging.Forexample,inenvironmentswithhigh385
numbersofunidentifiedspecies,ahigh-precisionclassifiermayyieldlargenumbersof386
unclassifiedsequences;insuchcases,asecondpasswithahigh-recallconfiguration(Table387
2)mayprovideusefulinferenceofwhattaxaaremostsimilartotheseunclassified388
sequences.Whenrecallisoptimized,precisiontendstosufferslightly(leadingtosimilarF-389
measurescoresto“balanced”configurations)butnoveltaxonclassificationaccuracyis390
minimized,astheseconfigurationstendtooverclassify(Table2).Anyuserprioritizing391
recalloughttobeawareofandacknowledgetheserisks,e.g.,whensharingorpublishing392
theirresults,andunderstandthatmanyofthespecies-levelclassificationsmaybewrong,393
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 19
19
particularlyifthesamplesareexpectedtocontainmanyuncharacterizedspecies.For16S394
rRNAgenesequences,naiveBayesbespokeclassifierswithk-merlengthsbetween12-32395
andconfidence=0.5yieldmaximalrecallscores,butRDP(confidence=0.5)andnaive396
Bayes(uniformclassweights,confidence=0.5,k-merlength=11,12,or18)alsoperform397
well(Table2).Fungalrecallscoresaremaximizedbythesameconfigurations398
recommendedfor“Balanced”classification,i.e.,naiveBayesclassifierswithk-merlengths399
between6-7andconfidencebetween0.92-0.98,orRDPwithconfidencebetween0.7-0.9400
(Table2).401
Runtimerequirementsmayalsobethechiefconcerndictatingmethodselectionfor402
someusers.QIIME1’sUCLUSTwrapperprovidesthefastestruntimewhilestillachieving403
reasonablygoodperformanceformostevaluations;NaiveBayes,RDP,andBLAST+also404
deliveredreasonablylowruntimerequirements,andoutperformUCLUSTonmostother405
evaluationmetrics.406
Thisstudydidnotcomparemethodsforclassificationofshotgunmetagenome407
sequencingdatasets,whichpresentaseriesofuniquechallengesthatdonotexistfor408
marker-geneampliconsequencedata.Theseincludemuchhigheruniquesequencecounts409
(makingruntimeagreaterpriority),theuseoffullysequencedgenomesasreference410
sequences,anddifferentanalysisandqualitycontrolprotocols.Metagenomesequences411
alsoexhibitheterogenouscoverageandlength,unlikemarker-geneampliconsequences,412
whichtypicallyhaveuniformstartsitesandreadlengthswithinasinglesequencingrun.A413
recentbenchmarkofmetagenometaxonomicprofilingmethodsdescribessimilarresultsto414
ourbenchmarkofmarker-genesequenceclassifiers:mostprofilersperformwellfrom415
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 20
20
phylumtofamilylevelbutperformancedegradesatgenusandspecieslevels;different416
methodsdisplaysuperiorperformanceaccordingtodifferentperformancemetrics;and417
parameterconfigurationdramaticallyimpactsperformance[26].Inthecurrentstudywe418
focusedonbenchmarkingandoptimizingclassifiersformarker-geneampliconsequence419
data,inlightofthedistinctneedsofmetagenomeandmarker-genesequencedatasets.420
Conclusions421
Theclassificationmethodsprovidedinq2-feature-classifierwillsupportimproved422
taxonomyclassificationofmarker-geneampliconsequences,andarereleasedasafree,423
open-sourcepluginforusewithQIIME2.Wedemonstratethatthesemethodsperformas424
wellasorbetterthanotherleadingtaxonomyclassificationmethodsonanumberof425
performancemetrics.ThenaiveBayes,vsearch,andBLAST+consensusclassifiers426
describedherearereleasedforthefirsttimeinQIIME2,withoptimized“balanced”427
configurations(Table2)setasdefaults.428
Wealsopresenttheresultsofabenchmarkofseveralwidelyusedtaxonomy429
classifiersformarker-geneampliconsequences,andrecommendthetop-performing430
methodsandconfigurationsforthemostcommonuserscenarios.Ourrecommendations431
for“balanced”methods(Table2)willbeappropriateformostuserswhoareclassifying432
16SrRNAgeneorfungalITSsequences,butotherusersmayprioritizehigh-precision(low433
false-positive)orhigh-recall(lowfalse-negative)methods.434
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 21
21
Wehavealsoshownthatgreatpotentialexistsforimprovingtheaccuracyof435
taxonomyclassificationsbyappropriatelysettingclassweightsforthemachinelearning436
classifiers.Currently,notoolsexistthatallowuserstogenerateappropriatevaluesfor437
theseclassweightsinrealapplications.Compilingappropriateclassweightsfordifferent438
sampletypescouldbeapromisingapproachtofurtherimprovetaxonomicclassificationof439
markergenesequencereads.440
441
Methods442
Mockcommunities443
All mock communities were sourced from mockrobiota [11]. Raw fastq files were444
demultiplexed and processed using tools available in QIIME 2 (version 2017.4)445
(https://qiime2.org/). Reads were demultiplexed with q2-demux446
(https://github.com/qiime2/q2-demux) and quality filtered and dereplicated with q2-447
dada2 [4]. Representative sequence sets for each dada2 sequence variantwere used for448
taxonomyclassificationwitheachclassificationmethod.449
The inclusion of multiple mock community samples is important to avoid overfitting;450
optimizingmethod performance to a small set of data could result in overfitting to the451
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 22
22
specific community compositions or conditions underwhich those datawere generated,452
whichreducestherobustnessoftheclassifier.453
Cross-validatedsimulatedreads454
The simulated reads used here were derived from the reference databases using the455
“Cross-validated classification performance” notebooks in our project repository. The456
reference databases were either Greengenes or UNITE (99% OTUs) that were cleaned457
according to taxonomic label to remove sequences with ambiguous or null labels.458
ReferencesequencesweretrimmedtosimulateamplificationusingstandardPCRprimers459
and slice out the first 250 bases downstream (3’) of the forward primer. The bacterial460
primers used were 27F/1492R [27] to simulate full-length 16S rRNA gene sequences,461
515F/806R[28]tosimulate16SrRNAgeneV4domainsequences,and27F/534R[29]to462
simulate 16S rRNA gene V1-3 domain sequences; the fungal primers used were463
BITSf/B58S3r[30]tosimulateITS1internaltranscribedspacerDNAsequences.Theexact464
sequenceswereusedforcrossvalidation,andwerenotalteredtosimulateanysequencing465
error; thus, our benchmarks simulate denoised sequence data [4] and isolate classifier466
performance from impacts from sequencing errors. Each database was stratified by467
taxonomyand10-foldrandomisedcross-validationdatasetsweregeneratedusingscikit-468
learn’s libraryfunctions.Whereataxonomiclabelhadlessthan10instances,taxonomies469
wereamalgamatedtomakesufficientlylargestrata.If,asaresult,ataxonomyinanytest470
setwas not present in the corresponding training set, the expected taxonomy labelwas471
truncated to the nearest common taxonomic rank observed in the training set (e.g.,472
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 23
23
Lactobacillus caseiwould becomeLactobacillus). The notebook detailing simulated read473
generation (for both cross-validated and novel taxon reads) prior to taxonomy474
classification is available at https://github.com/caporaso-lab/tax-credit-475
data/blob/0.1.0/ipynb/novel-taxa/dataset-generation.ipynb.476
Classification performance was also slightlymodified from a standardmachine-learning477
scenario as the classifiers in this study are able to refuse classification if they are not478
confident above a taxonomic level for a given sample. This also accommodates the479
taxonomytruncationthatweperformedforthistest.Themethodologywasconsistentwith480
that used below for novel taxon evaluations, but we defer this description to the next481
section.482
“Noveltaxon”simulationanalysis483
“Novel taxon”classificationanalysiswasperformed to test theperformanceof classifiers484
whenassigning taxonomytosequences thatarenotrepresented inareferencedatabase,485
e.g.,asasimulationofwhatoccurswhenamethodencountersanundocumentedspecies486
[22–25].Inthisanalysis,simulatedampliconswerefilteredfromthoseusedforthecross-487
validationanalysis.Forallsequencespresentineachtestset,sequencessharingtaxonomic488
affiliationatagiventaxonomiclevelL(e.g.,tospecieslevel)inthecorrespondingtraining489
setwereremoved.Taxaarestratifiedamongqueryandtestsetssuchthatforeachquery490
taxonomy at level L, no reference sequences match that taxonomy, but at least one491
reference sequence will match the taxonomic lineage at level L-1 (e.g., same genus but492
different species). An ideal classifier would assign taxonomy to the nearest common493
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 24
24
taxonomic lineage (e.g., genus),butwouldnot “overclassify” [25] tonearneighbors (e.g.,494
assign species-level taxonomywhen species X is removed from the reference database).495
For example, a “novel” sequence representing the species Lactobacillus brevis should be496
classifiedas“Lactobacillus”,withoutspecies-levelannotation, inorder tobeconsidereda497
truepositiveinthisanalysis.Asdescribedaboveforcross-validatedreads,thesenoveltaxa498
simulatedcommunitieswerealsotestedinbothbacterial(B)andfungal(F)databaseson499
simulatedampliconstrimmedtosimulate250-ntsequencingreads.500
Novel taxon classification performance is evaluated using precision, recall, F-501
measure,overclassificationrates,underclassificationrates,andmisclassificationrates[25]502
foreachtaxonomiclevel(phylumtospecies),computedwiththefollowingdefinitions(see503
below,Performanceanalysesusingsimulatedreads, forfulldescriptionofprecision,recall,504
andF-measurecalculations):505
1) Atruepositiveisconsideredthenearestcorrectlineagecontainedinthereference506
database. For example, if Lactobacillus brevis is removed from the reference507
database and used as a query sequence, the only correct taxonomy classification508
wouldbe“Lactobacillus”,withoutspecies-levelclassification.509
2) A falsepositivewouldbeeitheraclassification toadifferentLactobacillus species510
(Overclassification),oranygenusotherthanLactobacillus(Misclassification).511
3) Afalsenegativeoccursifanexpectedtaxonomyclassification(e.g.,“Lactobacillus”)512
isnotobservedintheresults.Notethatthiswillbethemodifiedtaxonomyexpected513
whenusinganaivereferencedatabase,and isnot thesameas thetruetaxonomic514
affiliation of a query sequence in the novel taxa analysis. A false negative results515
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 25
25
from misclassification, overclassification, or when the classification contains the516
correct basal lineage, but does not assign a taxonomy label at level L517
(Underclassification). E.g., classification as “Lactobacillaceae”, but no genus-level518
classification. 519
Taxonomyclassification520
Representative sequences for all analyses (mock community, cross-validated, and novel521
taxa) were classified taxonomically using the following taxonomy classifiers and setting522
sweeps:523
1. q2-feature-classifiermultinomialnaiveBayesclassifier.Variedk-merlength524
in{4,6,7,8,9,10,11,12,14,16,18,32}andconfidencethresholdin{0,0.5,0.7,0.9,525
0.92,0.94,0.96,0.98,1}. 526
2. BLAST+ [9] local sequence alignment, followed by consensus taxonomy527
classification implemented inq2-feature-classifier.Variedmaxaccepts from1 to100;528
percent identity from 0.80 to 0.99; and minimum consensus from 0.51 to 0.99. See529
descriptionbelow.530
3. vsearch [10] global sequence alignment, followed by consensus taxonomy531
classification implemented in q2-feature-classifier. Varied max accepts from 1 to532
100;percentidentityfrom0.80to0.99;andminimumconsensusfrom0.51to0.99.533
Seedescriptionbelow.534
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 26
26
4. Ribosomal Database Project (RDP) naïve Bayesian classifier [12] (QIIME1535
wrapper),withconfidencethresholdsbetween0.0to1.0instepsof0.1. 536
5. LegacyBLAST[13](QIIME1wrapper)varyinge-valuethresholdsfrom1e-9537
to1000. 538
6. SortMeRNA [15] (QIIME1 wrapper) varying minimum consensus fraction539
from0.51 to0.99; similarity from0.8 to0.9;maxaccepts from1 to10;andcoverage540
from0.8to0.9. 541
7. UCLUST [14] (QIIME1wrapper) varyingminimumconsensus fraction from542
0.51to0.99;similarityfrom0.8to0.9;andmaxacceptsfrom1to10.543
544
WiththeexceptionoftheUCLUSTclassifier,wehaveonlybenchmarkedtheperformanceof545
open-source, free,marker-gene-agnostic classifiers, i.e., those that canbe trained/aligned546
onareferencedatabaseofanymarkergene.Hence,weexcludedclassifiers thatcanonly547
assign taxonomy to a particular marker gene (e.g., only bacterial 16S rRNA genes) and548
thosethatrelyonspecializedorunavailablereferencedatabasesandcannotbetrainedon549
other databases, effectively restricting their use for other marker genes and custom550
databases.551
Classification of bacterial/archaeal 16S rRNA gene sequences was made using the552
Greengenes(13_8release) [5]referencesequencedatabasepreclusteredat99%ID,with553
ampliconsforthedomainofinterestextractedusingprimers27F/1492R[27],515F/806R554
[28],or27F/534R[29]withq2-feature-classifier’sextract_readsmethod.Classificationof555
fungal ITS sequenceswasmadeusing theUNITEdatabase (version7.1QIIMEdeveloper556
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 27
27
release) [31] preclustered at 99% ID. For the cross validation and novel taxon557
classification tests we prefiltered to remove sequences with incomplete or ambiguous558
taxonomies(containingthesubstrings ‘unknown’, ‘unidentified’,or ‘_sp’orterminatingat559
anylevelwith‘__’).560
561
Thenotebooksdetailingtaxonomyclassificationsweepsofmockcommunitiesareavailable562
at https://github.com/caporaso-lab/tax-credit-data/tree/0.1.0/ipynb/mock-community.563
Cross-validated read classification sweeps are available at https://github.com/caporaso-564
lab/tax-credit-data/blob/0.1.0/ipynb/cross-validated/taxonomy-assignment.ipynb. Novel565
taxon classification sweeps are available at https://github.com/caporaso-lab/tax-credit-566
data/blob/0.1.0/ipynb/novel-taxa/taxonomy-assignment.ipynb.567
568
Runtimeanalyses569
The tax-credit frameworkemploys twodifferent runtimemetrics: asa functionof1) the570
numberofquerysequencesor2)thenumberofreferencesequences.Taxonomyclassifier571
runtimes were logged while performing classifications of pseudorandom subsets of 1,572
2,000,4,000,6,000,8,000,and10,000sequencesfromtheGreengenes99%OTUdatabase.573
Eachsubsetwasdrawnoncethenusedforallofthetestsasappropriate.Allruntimeswere574
computedonthesameLinuxworkstation(Ubuntu16.04.2LTS,IntelXeonCPUE7-4850v3575
@2.20GHz,1TBmemory).Theexactcommandsusedforruntimeanalysisarepresentedin576
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 28
28
the“Runtimeanalyses”notebookintheprojectrepository(https://github.com/caporaso-577
lab/tax-credit-data/blob/0.1.0/ipynb/runtime/analysis.ipynb).578
Performanceanalysesusingsimulatedreads579
Cross-validatedandnoveltaxareadsareevaluatedusingtheclassicprecision,recall,andF-580
measuremetrics[5](noveltaxausethestandardcalculationsasdescribedbelow,but581
modifieddefinitionsfortruepositive(TP),falsepositive(FP),andfalsenegative(FN),as582
describedabovefornoveltaxonclassificationanalysis).583
Precision,recall,andF-measurearecalculatedasfollows:584○ Precision=TP/(TP+FP)orthefractionofsequencesthatwereclassifiedcorrectlyat585
levelL.586
○ Recall = TP/(TP+FN) or the fraction of expected taxonomic labels that were587
predictedatlevelL.588
○ F-measure=2×Precision×Recall/(Precision+Recall),ortheharmonicmeanof589
precisionandrecall.590
The Jupyter notebook detailing commands used for evaluation of cross-validated read591
classifications is available at https://github.com/caporaso-lab/tax-credit-592
data/blob/0.1.0/ipynb/cross-validated/evaluate-classification.ipynb. The notebook for593
evaluation of novel taxon classifications is available at https://github.com/caporaso-594
lab/tax-credit-data/blob/0.1.0/ipynb/novel-taxa/evaluate-classification.ipynb.595
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 29
29
Performanceanalysesusingmockcommunities596
The Jupyter notebook detailing commands used for evaluation of mock communities,597
including the three evaluation types described below, is available at598
https://github.com/caporaso-lab/tax-credit-data/blob/0.1.0/ipynb/mock-599
community/evaluate-classification-accuracy.ipynb.600
PrecisionandRecall601
Classic precision, recall, and F-measure are used to calculate mock community602
classificationaccuracy,usingthedefinitionsgivenaboveforsimulatedreads.Thesemetrics603
require knowing the expected classification of each sequence, which we determine by604
performing a gapless alignment between each representative sequence in the mock605
community and themarker-gene sequences of eachmicrobial strain added to themock606
community. These “expected sequences” are provided for the mock communities in607
mockrobiota [11]. Representative sequences are assigned the taxonomy of the best608
alignment,andanyrepresentativesequencewithmorethan3mismatchestotheexpected609
sequences are excluded from precision/recall calculations. If a representative sequence610
aligns tomore than one expected sequence equallywell, all top hits are accepted as the611
“correct” classification. This scenario is rare and typically only occurred when different612
strains of the same species were added to the same mock community to intentionally613
produce this challenge (e.g., for mock-12 as described by [4]). Precision, recall, and F-614
measure are then calculated by comparing the “expected” classification for each mock615
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 30
30
communitysequencetotheclassificationspredictedbyeachtaxonomyclassifierusingthe616
fullreferencedatabases,asdescribedabove.617
Taxonaccuracyrateandtaxondetectionrate618
Taxon accuracy rate (TAR) and taxon detection rate (TDR) are used for qualitative619
compositional analyses of mock communities. As the true taxonomy labels for each620
sequenceinamockcommunityarenotknownwithabsolutecertainty,TARandTDRare621
useful alternatives to precision and recall that instead rely on the presence/absence of622
expected taxa, or microbiota that are intentionally added to the mock community. In623
practice, TAR/TDR are complementary metrics to precision/recall and should provide624
similar results if the expected classifications for mock community representative625
sequencesareaccurate.626
Atagiventaxonomiclevel,aclassificationisa:627
○ truepositive(TP),ifthattaxonisbothobservedandexpected.628
○ falsepositive(FP),ifthattaxonisobservedbutnotexpected.629
○ falsenegative(FN),ifataxonisexpectedbutnotobserved.630
TheseareusedtocalculateTARandTDRas:631
○ TAR=TP/(TP+FP)orthefractionofobservedtaxathatwereexpectedatlevelL.632
○ TDR=TP/(TP+FN)orthefractionofexpectedtaxathatareobservedatlevelL.633
634
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 31
31
Bray-CurtisDissimilarity635
Bray-Curtisdissimilarity[32] isusedtomeasurethedegreeofdissimilaritybetweentwo636
samples as a function of the abundance of each species label present in each sample,637
treating each species as equally related. This is a useful metric for evaluating classifier638
performancebyassessing therelativedistancebetweeneachpredictedmockcommunity639
composition(abundanceoftaxainasamplebasedonresultsofasingleclassifier)andthe640
expectedcompositionofthatsample.Foreachclassifier,Bray-Curtisdistancesbetweenthe641
expected and observed taxonomic compositions are calculated for each sample in each642
mock community dataset; this yields a single expected-observed distance for each643
individual observation. The distance distributions for each method are then compared644
statistically using paired or unpaired t-tests to assess whether one method (or645
configuration)performsconsistentlybetterthananother.646
Newtaxonomyclassifiers647
Wedescribeq2-feature-classifier(https://github.com/qiime2/q2-feature-classifier),a648
pluginforQIIME2(https://qiime2.org/)thatperformsmulti-classtaxonomyclassification649
ofmarker-genesequencereads.InthisworkwecomparetheconsensusBLAST+and650
vsearchmethodsandthenaiveBayesscikit-learnclassifier.Thesoftwareisfreeandopen-651
source.652
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 32
32
Machinelearningtaxonomyclassifiers653
Theq2-feature-classifierpluginallowsuserstoapplyanyofthesuiteofmachinelearning654
classifiersavailableinscikit-learn(http://scikit-learn.org)totheproblemoftaxonomy655
classificationofmarker-genesequences.Itfunctionsasalightweightwrapperthat656
transformstheproblemintoastandarddocumentclassificationproblem.Advancedusers657
caninputanyappropriatescikit-learnclassifierpipeline,whichcanincludearangeof658
featureextractionandtransformationstepsaswellasspecifyingamachinelearning659
algorithm.660
661
Thepluginprovidesadefaultmethodwhichistoextractk-mercountsfromreference662
sequencesandtrainthescikit-learnmultinomialnaiveBayesclassifier,anditisthis663
methodthatwetestextensivelyhere.Specifically,thepipelineconsistsofa664
sklearn.feature_extraction.text.HashingVectorizerfeatureextractionstepfollowedbya665
sklearn.naive_bayes.MultinomialNBclassificationstep.Theuseofahashingfeature666
extractorallowstheuseofsignificantlylongerk-mersthanthe8-mersthatareusedby667
RDPClassifier,andwetestedupto32-mers.Likemostscikit-learnclassifiers,weareable668
tosetclassweightswhentrainingthemultinomialnaiveBayesclassifiers.Inthenaive669
Bayessetting,settingclassweightsmeansthatclasspriorsarenotderivedfromthe670
trainingdataorsettobeuniform,astheyarefortheRDPClassifier.Formoredetailonhow671
classweightsenterthecalculationspleaserefertothescikit-learnUserGuide672
(http://scikit-learn.org).673
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 33
33
674
Inmostsettings,itishighlyunlikelythattheassumptionofuniformweightsiscorrect.That675
assumptionisthateachofthetaxainthereferencedatabaseisequallylikelytoappearin676
eachsample.Settingclassweightstomorerealisticvaluescangreatlyaidtheclassifierin677
makingmoreaccuratepredictions,asweshowinthiswork.Whentestingthemock678
communitieswemadeuseofthefactthatthesequencecompositionswereknownapriori679
forthebespokeclassifier.Forthesimulatedreadsstudies,weallowedtheclassifiertoset680
theclassweightsfromtheclassfrequenciesobservedineachtrainingsetforthebespoke681
classifier.682
683
Forthisstudy,weperformedtwoparametersweepsonthemockcommunities:aninitial684
broadsweeptooptimizefeatureextractionparametersandthenamorefocussedsweepto685
optimisek-merlengthandconfidenceparametersettings.Thesesweepsincludedvarying686
theassumptionsregardingclassweights.Thefocussedsweepswerealsoperformedforthe687
cross-validatedandnoveltaxaevaluations,butonlyfortheassumptionofuniformclass688
priors.Theresultsforthefocussedsweepsacrossalldatasetsarethosewhichare689
comparedagainsttheotherclassifiersinthiswork.690
691
Thebroadsweepsusedamodifiedscikit-learnpipelinewhichconsistedofthe692
sklearn.feature_extraction.text.HashingVectorizer,followedbythe693
sklearn.feature_extraction.text.TfidfTransformer,thenthe694
sklearn.naive_bayes.MultinomialNB.Weperformedafullgridsearchovertheparameters695
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 34
34
showninTable3.TheconclusionfromtheinitialsweepwasthattheTfidfTransformerstep696
didnotsignificantlyimproveclassification,thatn_featuresshouldbesetto8192,feature697
vectorsshouldbenormalisedusingL2normalisationandthatthealphaparameterforthe698
naiveBayesclassifiershouldbesetto0.001.Pleaseseehttps://github.com/caporaso-699
lab/tax-credit-data/blob/0.1.0/ipynb/mock-community/evaluate-classification-accuracy-700
nb-extra.ipynbfordetails.701
Consensustaxonomyalignment-basedclassifiers702
703
Twonewclassifiersimplementedinq2-feature-classifierperformconsensustaxonomy704
classificationbasedonalignmentofaquerysequencetoareferencesequence.The705
methodsclassify_consensus_vsearchandclassify_consensus_blastusetheglobalaligner706
vsearch[10]orthelocalalignerBLAST+[9],respectively,toreturnuptomaxaccepts707
referencesequencesthataligntothequerywithatleastperc_identitysimilarity.A708
consensustaxonomyisthenassignedtothequerysequencebydeterminingthetaxonomic709
lineageonwhichatleastmin_consensusofthealignedsequencesagree.Thisconsensus710
taxonomyistruncatedatthetaxonomiclevelatwhichlessthanmin_consensusof711
taxonomiesagree.Forexample,ifaquerysequenceisclassifiedwithmaxaccepts=3,712
min_consensus=0.51,andthefollowingtophits:713
714
k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;715
g__Lactobacillus;s__brevis716
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 35
35
k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;717
g__Lactobacillus;s__brevis718
k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;719
g__Lactobacillus;s__delbrueckii720
721
Thetaxonomylabelassignedwillbek__Bacteria;p__Firmicutes;c__Bacilli;722
o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__brevis.However,if723
min_consensus=0.99,thetaxonomylabelassignedwillbek__Bacteria;p__Firmicutes;724
c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus.725
726
727
Declarations728
Ethicsapprovalandconsenttoparticipate729
Notapplicable730
Consentforpublication731
Notapplicable732
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 36
36
Availabilityofdataandmaterials733
Mockcommunitysequencedatausedinthisstudyarepubliclyavailableinmockrobiota734[11]underthestudyidentitieslistedinTable1.Allotherdatageneratedinthisstudy,and735allnewsoftware,isavailableinourGitHubrepositoriesundertheBSDlicense.Thetax-736creditrepositorycanbefoundat:https://github.com/caporaso-lab/tax-credit,andstatic737versionsofallanalysisnotebooks,whichcontainallcodeandanalysisresults,canbe738viewedthere.Theq2-feature-classifierrepositorycanbeaccessedat739https://github.com/qiime2/q2-feature-classifier;asaQIIME2coreplugin,itis740automaticallyinstalledanytimeQIIME2(https://qiime2.org/)isinstalled.741742Projectname:q2-feature-classifier743Projecthomepage:https://github.com/qiime2/q2-feature-classifier744Operatingsystem(s):macOS,Linux745Programminglanguage:Python746Otherrequirements:QIIME2747License:BSD-3-Clause748Anyrestrictionstousebynon-academics:None749750Projectname:tax-credit751Projecthomepage:https://github.com/caporaso-lab/tax-credit752Operatingsystem(s):macOS,Linux753Programminglanguage:Python754Otherrequirements:None(QIIME2requiredforsomeoptionalfunctions)755License:BSD-3-Clause756Anyrestrictionstousebynon-academics:None757758759
Funding760
ThisworkwasfundedinpartbyNationalScienceFoundationaward1565100toJGCand761
RK,awardsfromtheAlfredP.SloanFoundationtoJGCandRK,awardsfromthe762
PartnershipforNativeAmericanCancerPrevention(NIH/NCIU54CA143924and763
U54CA143925)toJGC,andNationalHealthandMedicalResearchCouncilofAustralia764
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 37
37
awardAPP1085372toGAH,JGCandRK.Thesefundingbodieshadnoroleinthedesignof765
thestudy,thecollection,analysis,orinterpretationofdata,orinwritingthemanuscript.766
Acknowledgments767
TheauthorsthankStephenGouldandChengSoonOngforadviceonmachinelearning768
optimisation.769
Authors’Contributions770
NAB,RK,andJGCconceivedanddesignedtax-credit.NAB,BDK,JGC,andJRRcontributed771
totax-credit.BDK,MD,JGC,andNABcontributedtoq2-feature-classifier.BDK,JGC,MD,772
JRR,andEBprovidedQIIME2integrationwithq2-feature-classifier.JGCandGAHprovided773
materialsandsupport.NAB,BDK,JGC,andGAHwrotethemanuscriptwithinputfromall774
co-authors.775
CompetingInterests776
Theauthorsdeclarethattheyhavenocompetinginterests.777
778
TablesandFigures779
Table1.Mockcommunitiescurrentlyintegratedintax-credit.780
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 38
38
Study ID* Target gene** Platform Species Strains Citation mock-1 16S HiSeq 46 48 [33] mock-2 16S MiSeq 46 48 [33] mock-3 16S MiSeq 21 21 [33] mock-4 16S MiSeq 21 21 [33] mock-5 16S MiSeq 21 21 [33]
mock-7 16S HiSeq 67 67 [34] mock-8 16S HiSeq 67 67 [11]
mock-9 ITS HiSeq 13 16 [11]
mock-10 ITS HiSeq 13 16 [11]
mock-12 16S MiSeq 26 27 [4]
mock-16 16S MiSeq 56 59 [35]
mock-18 16S MiSeq 15 15 [36]
mock-19 16S MiSeq 15 27 [36]
mock-20 16S MiSeq 20 20 [37]
mock-21 16S MiSeq 20 20 [37]
mock-22 16S MiSeq 20 20 [37]
mock-23 16S MiSeq 20 20 [37]
mock-24 ITS MiSeq 8 8 [38]
mock-26 ITS FLX Titanium 11 11 [39] *All studies are available on mockrobiota [11] at https://github.com/caporaso-781
lab/mockrobiota/tree/master/data/[studyID] 782
**Abbreviations: 16S = 16S rRNA gene; HiSeq = Illumina HiSeq; MiSeq = Illumina MiSeq. 783
784
Table2.Optimizedmethodsconfigurationsforstandardoperatingconditions.785
Mock Cross-validated Novel taxa
Target Condition Method Parameters F P R F P R F P R Threshold
Balanced NB-bespoke [6,6]:0.9 0.705 0.98 0.582 0.827 0.931 0.744 0.165 0.243 0.125 F = (0.49, 0.8, 0.1)
[6,6]:0.92 0.705 0.98 0.581 0.825 0.936 0.737 0.165 0.251 0.123 F = (0.7, 0.8, 0.15)
[6,6]:0.94 0.703 0.98 0.579 0.822 0.942 0.729 0.162 0.259 0.118
16S rRNA gene
[7,7]:0.92 0.712 0.978 0.592 0.831 0.931 0.751 0.151 0.221 0.115
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 39
39
[7,7]:0.94 0.708 0.978 0.586 0.829 0.936 0.743 0.157 0.239 0.117
naive-bayes [7,7]:0.7 0.495 0.797 0.38 0.819 0.886 0.761 0.115 0.138 0.099
rdp 0.6 0.564 0.798 0.457 0.815 0.868 0.768 0.102 0.128 0.084
0.7 0.55 0.799 0.438 0.812 0.892 0.746 0.124 0.173 0.096
uclust 0.51:0.9:3 0.498 0.746 0.392 0.846 0.876 0.817 0.154 0.201 0.126
Precision NB-bespoke [6,6]:0.98 0.676 0.987 0.537 0.803 0.956 0.692 0.163 0.303 0.111 P = (0.94, 0.95, 0.25)
[7,7]:0.98 0.687 0.98 0.551 0.815 0.951 0.713 0.164 0.283 0.115
rdp 1 0.239 0.941 0.16 0.632 0.968 0.469 0.12 0.457 0.069
Recall NB-bespoke [12,12]:0.5 0.754 0.8 0.721 0.815 0.83 0.801 0.053 0.058 0.049 R = (0.47, 0.75, 0.04)
[14,14]:0.5 0.758 0.802 0.726 0.811 0.826 0.797 0.052 0.057 0.048 R = (0.7, 0.75, 0.04)
[16,16]:0.5 0.755 0.785 0.732 0.808 0.825 0.792 0.052 0.058 0.047
[18,18]:0.5 0.772 0.803 0.748 0.805 0.823 0.789 0.055 0.061 0.05
[32,32]:0.5 0.937 0.966 0.913 0.788 0.818 0.76 0.054 0.067 0.045
naive-bayes [11,11]:0.5 0.567 0.77 0.479 0.793 0.82 0.768 0.059 0.065 0.055
[12,12]:0.5 0.567 0.769 0.479 0.79 0.816 0.765 0.059 0.064 0.055
[18,18]:0.5 0.564 0.764 0.477 0.779 0.807 0.753 0.057 0.063 0.051
rdp 0.5 0.577 0.791 0.48 0.816 0.848 0.787 0.068 0.079 0.06
Novel blast+ 10:0.51:0.8 0.436 0.723 0.325 0.816 0.896 0.749 0.225 0.332 0.171 F = (0.4, 0.8, 0.2)
uclust 0.76:0.9:5 0.467 0.775 0.348 0.84 0.938 0.76 0.219 0.358 0.158
vsearch 10:0.51:0.8 0.45 0.74 0.342 0.814 0.891 0.75 0.226 0.333 0.171
10:0.51:0.9 0.45 0.74 0.342 0.82 0.896 0.755 0.219 0.338 0.162
Fungi Balanced naive-bayes [6,6]:0.94 0.874 0.935 0.827 0.481 0.57 0.416 0.374 0.438 0.327 F = (0.85, 0.45, 0.37)
[6,6]:0.96 0.874 0.935 0.827 0.495 0.597 0.423 0.399 0.473 0.344
[6,6]:0.98 0.874 0.935 0.827 0.505 0.629 0.423 0.426 0.52 0.361
[7,7]:0.98 0.874 0.935 0.827 0.485 0.596 0.409 0.388 0.47 0.33
NB-bespoke [6,6]:0.94 0.928 0.968 0.915 0.48 0.567 0.416 0.371 0.433 0.325
[6,6]:0.96 0.928 0.968 0.915 0.491 0.59 0.42 0.393 0.466 0.34
[6,6]:0.98 0.927 0.97 0.913 0.504 0.624 0.422 0.421 0.512 0.358
[7,7]:0.98 0.935 0.97 0.921 0.487 0.596 0.412 0.386 0.466 0.329
rdp 0.7 0.929 0.939 0.922 0.479 0.572 0.413 0.382 0.451 0.332
0.8 0.924 0.939 0.915 0.507 0.633 0.422 0.434 0.534 0.366
0.9 0.922 0.937 0.913 0.517 0.698 0.411 0.47 0.617 0.379
Precision naive-bayes [6,6]:0.98 0.874 0.935 0.827 0.505 0.629 0.423 0.426 0.52 0.361 P = (0.92, 0.6, 0.3)
NB-bespoke [6,6]:0.98 0.927 0.97 0.913 0.504 0.624 0.422 0.421 0.512 0.358
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 40
40
rdp 0.8 0.924 0.939 0.915 0.507 0.633 0.422 0.434 0.534 0.366
0.9 0.922 0.937 0.913 0.517 0.698 0.411 0.47 0.617 0.379
1 0.821 0.943 0.742 0.461 0.81 0.322 0.459 0.774 0.327
Recall NB-bespoke [6,6]:0.92 0.938 0.971 0.924 0.467 0.544 0.409 0.353 0.407 0.312 R = (0.9, 0.4, 0.3)
[6,6]:0.94 0.928 0.968 0.915 0.48 0.567 0.416 0.371 0.433 0.325
[6,6]:0.96 0.928 0.968 0.915 0.491 0.59 0.42 0.393 0.466 0.34
[6,6]:0.98 0.927 0.97 0.913 0.504 0.624 0.422 0.421 0.512 0.358
[7,7]:0.96 0.935 0.969 0.921 0.47 0.56 0.404 0.357 0.422 0.31
[7,7]:0.98 0.935 0.97 0.921 0.487 0.596 0.412 0.386 0.466 0.329
rdp 0.7 0.929 0.939 0.922 0.479 0.572 0.413 0.382 0.451 0.332
0.8 0.924 0.939 0.915 0.507 0.633 0.422 0.434 0.534 0.366
0.9 0.922 0.937 0.913 0.517 0.698 0.411 0.47 0.617 0.379
Novel naive-bayes [6,6]:0.98 0.874 0.935 0.827 0.505 0.629 0.423 0.426 0.52 0.361 F = (0.85, 0.45, 0.4)
NB-bespoke [6,6]:0.98 0.927 0.97 0.913 0.504 0.624 0.422 0.421 0.512 0.358
rdp 0.8 0.923 0.939 0.915 0.507 0.633 0.422 0.434 0.534 0.366
0.9 0.921 0.937 0.913 0.517 0.698 0.411 0.47 0.617 0.379
786
aF=F-measure,P=precision,R=recall787bNaiveBayesparameters:k-merrange,confidence788cRDPparameters:confidence789dBLAST+/vsearchparameters:maxaccepts,minimumconsensus,minimumpercent790identity791eUCLUSTparameters:minimumconsensus,similarity,maxaccepts792fThresholddescribesthescorecutoffsusedtodefineoptimalmethodranges,intheformat:793[metric=(mockscore,cross-validatedscore,novel-taxascore)].Iftwocutoffsaregiven,794thesecondindicatesahighercutoffusedtoselectparametersforthedevelopmentalNB-795bespokemethod,andtheconfigurationslistedaretheunionofthetwocutoffs:thesecond796cutoffforselectingNB-bespoke,thefirstforselectingallothermethods.797798
799
Table3.NaiveBayesbroadgridsearchparameters800
Step Parameter Values
sklearn.feature_extraction.text.HashingVectorizer n_features 1024, 8192, 65536
ngram_range [4,4], [8, 8], [16, 16], [4,16]
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 41
41
sklearn.feature_extraction.text.TfidfTransformer norm l1, l2, None
usd_idf True, False
sklearn.naive_bayes.MultinomialNB alpha 0.001, 0.01, 0.1
class_prior None, array of class weights
post processing confidence 0, 0.2, 0.4, 0.6, 0.8
801
802
803
804
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 42
42
805
Figure1.Classifierperformanceonmockcommunitydatasetsfor16SrRNAgene806
sequences(leftcolumn)andfungalITSsequences(rightcolumn).A,AverageF-measurefor807
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 43
43
eachtaxonomyclassificationmethod(averagedacrossallconfigurationsandallmock808
communitydatasets)fromclasstospecieslevel.Errorbars=95%confidenceintervals.B,809
AverageF-measureforeachoptimizedclassifier(averagedacrossallmockcommunities)at810
specieslevel.C,Averagetaxonaccuracyrateforeachoptimizedclassifier(averagedacross811
allmockcommunities)atspecieslevel.D,AverageBray-Curtisdistancebetweenthe812
expectedmockcommunitycompositionanditscompositionaspredictedbyeach813
optimizedclassifier(averagedacrossallmockcommunities)atspecieslevel.Violinplots814
showmedian(whitepoint),quartiles(blackbars),andkerneldensityestimation(violin)815
foreachscoredistribution.Violinswithdifferentlower-caselettershavesignificantly816
differentmeans(pairedt-testfalsedetectionrate-correctedP<0.05).817
818
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 44
44
819
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 45
45
Figure2.Classifierperformanceoncross-validatedsequencedatasets.Classification820
accuracyof16SrRNAgeneV4sub-domain(firstrow),V1-3sub-domain(secondrow),full-821
length16SrRNAgene(thirdtow),andfungalITSsequences(fourthrow).A,AverageF-822
measureforeachtaxonomyclassificationmethod(averagedacrossallconfigurationsand823
allcross-validatedsequencedatasets)fromclasstospecieslevel.Errorbars=95%824
confidenceintervals.B,AverageF-measureforeachoptimizedclassifier(averagedacross825
allcross-validatedsequencedatasets)atspecieslevel.Violinswithdifferentlower-case826
lettershavesignificantlydifferentmeans(pairedt-testfalsedetectionrate-correctedP<827
0.05).C,correlationbetweenF-measureperformanceforeachmethod/configuration828
classificationofV4sub-domain(x-axis),V1-3sub-domain(y-axis),andfull-length16S829
rRNAgenesequences(z-axis).InsetliststhepearsonR2valueforeachpairwise830
correlation;eachcorrelationissignificant(P<0.001).831
832
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 46
46
833
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 47
47
Figure3.Classifierperformanceonnovel-taxasimulatedsequencedatasetsfor16SrRNA834
genesequences(leftcolumn)andfungalITSsequences(rightcolumn).A-F,AverageF-835
measure(A),precision(B),recall(C),overclassification(D),underclassification(E),and836
misclassification(F)foreachtaxonomyclassificationmethod(averagedacrossall837
configurationsandallnoveltaxasequencedatasets)fromphylumtospecieslevel.Error838
bars=95%confidenceintervals.B,AverageF-measureforeachoptimizedclassifier839
(averagedacrossallnoveltaxasequencedatasets)atspecieslevel.Violinswithdifferent840
lower-caselettershavesignificantlydifferentmeans(pairedt-testfalsedetectionrate-841
correctedP<0.05).842
843
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 48
48
844
Figure4.Classificationaccuracycomparisonbetweenmockcommunity,cross-validated,845
andnoveltaxaevaluations.ScatterplotsshowmeanF-measurescoresforeachmethod846
configuration,averagedacrossallsamples,forclassificationof16SrRNAgenesatgenus847
level(A)andspecieslevel(B),andfungalITSsequencesatgenuslevel(C)andspecieslevel848
(D).849
850
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 49
49
851
Figure5.Runtimeperformancecomparisonoftaxonomyclassifiers.Runtime(s)foreach852
taxonomyclassifiereithervaryingthenumberofquerysequencesandkeepingaconstant853
10000referencesequences(A)orvaryingthenumberofreferencesequencesandkeeping854
aconstant1querysequence(B).855
856
857
References858
1.HumanMicrobiomeProjectConsortium.Aframeworkforhumanmicrobiomeresearch.Nature.8592012;486:215–21.860
2.ThompsonLR,SandersJG,McDonaldD,AmirA,LadauJ,LoceyKJ,etal.Acommunalcatalogue861revealsEarth’smultiscalemicrobialdiversity.Nature.2017;551:457–63.862
3.WangQ,QuensenJF3rd,FishJA,LeeTK,SunY,TiedjeJM,etal.EcologicalpatternsofnifHgenes863infourterrestrialclimaticzonesexploredwithtargetedmetagenomicsusingFrameBot,anew864
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 50
50
informaticstool.MBio.2013;4:e00592–13.865
4.CallahanBJ,McMurdiePJ,RosenMJ,HanAW,JohnsonAJA,HolmesSP.DADA2:High-resolution866sampleinferencefromIlluminaamplicondata.NatMethods.2016;13:581–3.867
5.McDonaldD,PriceMN,GoodrichJ,NawrockiEP,DeSantisTZ,ProbstA,etal.Animproved868Greengenestaxonomywithexplicitranksforecologicalandevolutionaryanalysesofbacteriaand869archaea.ISMEJ.2012;6:610–8.870
6.CaporasoJG,KuczynskiJ,StombaughJ,BittingerK,BushmanFD,CostelloEK,etal.QIIMEallows871analysisofhigh-throughputcommunitysequencingdata.NatMethods.2010;7:335–6.872
7.Pedregosa,F.,Varoquaux,G.,Gramfort,A.,Michel,V.,Thirion,B,Grisel,O.,Blondel,M.,873Prettenhofer,P.,Weiss,R.,Dubourg,V.,Vanderplas,J.,Passos,A.,Cournapeau,D.,Brucher,M.,874Perrot,M.,Duchesnay,E.Scikit-learn:MachineLearninginPython.JMachLearnRes.8752011;12:2825–30.876
8.Buitinck,L.,Louppe,G.,Blondel,M.,Pedregosa,F.,Mueller,A.,Grisel,O.,Niculae,V.,Prettenhofer,877P.,Gramfort,A.,Grobler,J.,Layton,R.,VanderPlas,J.,Joly,A.,Holt,B.,VaroquauxG.APIdesignfor878machinelearningsoftware:experiencesfromthescikit-learnproject.In:ECMLPKDDWorkshop:879LanguagesforDataMiningandMachineLearning.2013.p.108–22.880
9.CamachoC,CoulourisG,AvagyanV,MaN,PapadopoulosJ,BealerK,etal.BLAST:architecture881andapplications.BMCBioinformatics.2009;10:421.882
10.RognesT,FlouriT,NicholsB,QuinceC,MahéF.VSEARCH:aversatileopensourcetoolfor883metagenomics.PeerJ.2016;4:e2584.884
11.BokulichNA,RideoutJR,MercurioWG,ShifferA,WolfeB,MauriceCF,etal.mockrobiota:a885PublicResourceforMicrobiomeBioinformaticsBenchmarking.mSystems.2016;1.886doi:10.1128/mSystems.00062-16.887
12.WangQ,GarrityGM,TiedjeJM,ColeJR.NaiveBayesianclassifierforrapidassignmentofrRNA888sequencesintothenewbacterialtaxonomy.ApplEnvironMicrobiol.2007;73:5261–7.889
13.AltschulSF,GishW,MillerW,MyersEW,LipmanDJ.Basiclocalalignmentsearchtool.JMolBiol.8901990;215:403–10.891
14.EdgarRC.SearchandclusteringordersofmagnitudefasterthanBLAST.Bioinformatics.8922010;26:2460–1.893
15.KopylovaE,NoéL,TouzetH.SortMeRNA:fastandaccuratefilteringofribosomalRNAsin894metatranscriptomicdata.Bioinformatics.2012;28:3211–7.895
16.SoergelDAW,DeyN,KnightR,BrennerSE.Selectionofprimersforoptimaltaxonomic896classificationofenvironmental16SrRNAgenesequences.ISMEJ.2012;6:1440–4.897
17.LiuK-L,WongT-T.NaïveBayesianClassifierswithMultinomialModelsforrRNATaxonomic898Assignment.IEEE/ACMTransComputBiolBioinform.2013;10:1–1.899
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 51
51
18.ChaudharyN,SharmaAK,AgarwalP,GuptaA,SharmaVK.16Sclassifier:atoolforfastand900accuratetaxonomicclassificationof16SrRNAhypervariableregionsinmetagenomicdatasets.901PLoSOne.2015;10:e0116106.902
19.ClaessonMJ,WangQ,O’SullivanO,Greene-DinizR,ColeJR,RossRP,etal.Comparisonoftwo903next-generationsequencingtechnologiesforresolvinghighlycomplexmicrobiotacomposition904usingtandemvariable16SrRNAgeneregions.NucleicAcidsRes.2010;38:e200.905
20.LiuZ,DeSantisTZ,AndersenGL,KnightR.Accuratetaxonomyassignmentsfrom16SrRNA906sequencesproducedbyhighlyparallelpyrosequencers.NucleicAcidsRes.2008;36:e120.907
21.LiuZ,LozuponeC,HamadyM,BushmanFD,KnightR.Shortpyrosequencingreadssufficefor908accuratemicrobialcommunityanalysis.NucleicAcidsRes.2007;35:e120.909
22.LanzénA,JørgensenSL,HusonDH,GorferM,GrindhaugSH,JonassenI,etal.CREST–910ClassificationResourcesforEnvironmentalSequenceTags.PLoSOne.2012;7:e49334.911
23.LanY,WangQ,ColeJR,RosenGL.UsingtheRDPclassifiertopredicttaxonomicnoveltyand912reducethesearchspaceforfindingnovelorganisms.PLoSOne.2012;7:e32491.913
24.DeshpandeV,WangQ,GreenfieldP,CharlestonM,Porras-AlfaroA,KuskeCR,etal.Fungal914identificationusingaBayesianclassifierandtheWarcuptrainingsetofinternaltranscribedspacer915sequences.Mycologia.2016;108:1–5.916
25.EdgarR.SINTAX:asimplenon-Bayesiantaxonomyclassifierfor16SandITSsequences.2016.917doi:10.1101/074161.918
26.SczyrbaA,HofmannP,BelmannP,KoslickiD,JanssenS,DrögeJ,etal.CriticalAssessmentof919MetagenomeInterpretation-abenchmarkofmetagenomicssoftware.NatMethods.2017;14:1063–92071.921
27.WeisburgWG,BarnsSM,PelletierDA,LaneDJ.16SribosomalDNAamplificationfor922phylogeneticstudy.JBacteriol.1991;173:697–703.923
28.CaporasoJG,LauberCL,WaltersWA,Berg-LyonsD,HuntleyJ,FiererN,etal.Ultra-high-924throughputmicrobialcommunityanalysisontheIlluminaHiSeqandMiSeqplatforms.ISMEJ.9252012;6:1621–4.926
29.MuyzerG,deWaalEC,UitterlindenAG.Profilingofcomplexmicrobialpopulationsby927denaturinggradientgelelectrophoresisanalysisofpolymerasechainreaction-amplifiedgenes928codingfor16SrRNA.ApplEnvironMicrobiol.1993;59:695–700.929
30.BokulichNA,MillsDA.ImprovedSelectionofInternalTranscribedSpacer-SpecificPrimers930EnablesQuantitative,Ultra-High-ThroughputProfilingofFungalCommunities.ApplEnviron931Microbiol.2013;79:2519–26.932
31.KõljalgU,NilssonRH,AbarenkovK,TedersooL,TaylorAFS,BahramM,etal.Towardsaunified933paradigmforsequence-basedidentificationoffungi.MolEcol.2013;22:5271–7.934
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018
Page 52
52
32.BrayJR,CurtisJT.AnOrdinationoftheUplandForestCommunitiesofSouthernWisconsin.Ecol935Monogr.1957;27:325–49.936
33.BokulichNA,SubramanianS,FaithJJ,GeversD,GordonJI,KnightR,etal.Quality-filteringvastly937improvesdiversityestimatesfromIlluminaampliconsequencing.NatMethods.2013;10:57–9.938
34.MauriceCF,HaiserHJ,TurnbaughPJ.XenobioticsShapethePhysiologyandGeneExpressionof939theActiveHumanGutMicrobiome.Cell.2013;152:39–50.940
35.SchirmerM,IjazUZ,D’AmoreR,HallN,SloanWT,QuinceC.Insightintobiasesandsequencing941errorsforampliconsequencingwiththeIlluminaMiSeqplatform.NucleicAcidsRes.2015;43:e37.942
36.TourlousseDM,YoshiikeS,OhashiA,MatsukuraS,NodaN,SekiguchiY.Syntheticspike-in943standardsforhigh-throughput16SrRNAgeneampliconsequencing.NucleicAcidsRes.9442016;:gkw984.945
37.GohlDM,VangayP,GarbeJ,MacLeanA,HaugeA,BeckerA,etal.Systematicimprovementof946ampliconmarkergenemethodsforincreasedaccuracyinmicrobiomestudies.NatBiotechnol.9472016;34:942–9.948
38.TaylorDL,WaltersWA,LennonNJ,BochicchioJ,KrohnA,CaporasoJG,etal.Accurate949EstimationofFungalDiversityandAbundancethroughImprovedLineage-SpecificPrimers950OptimizedforIlluminaAmpliconSequencing.ApplEnvironMicrobiol.2016;82:7217–26.951
39.IhrmarkK,BödekerITM,Cruz-MartinezK,FribergH,KubartovaA,SchenckJ,etal.Newprimers952toamplifythefungalITS2region--evaluationby454-sequencingofartificialandnatural953communities.FEMSMicrobiolEcol.2012;82:666–77.954
955
PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.3208v2 | CC BY 4.0 Open Access | rec: 17 Jan 2018, publ: 17 Jan 2018