Page 1
1
Abstractwords:100(short)/127(long)
Maintextwords:13,730
Referenceswords:5,264
Entiretextwords:19,438
ClassicalStatisticsandStatisticalLearninginImagingNeuroscience
DaniloBzdok1,2,3
1DepartmentofPsychiatry,PsychotherapyandPsychosomatics,MedicalFaculty,RWTHAachen,Germany2JARA,TranslationalBrainMedicine,Aachen,Germany3Parietalteam,INRIA,Neurospin,bat145,CEASaclay,91191Gif-sur-Yvette,France
Prof.Dr.Dr.DaniloBzdokDepartmentforPsychiatry,PsychotherapyandPsychosomaticsPauwelsstraße3052074AachenGermanymail:danilo[DOT]bzdok[AT]rwth-aachen[DOT]de
Citableas:
"BzdokD.ClassicalStatisticsandStatisticalLearning in ImagingNeuroscience (2016).arXivpreprintarXiv:1603.01857."
Page 2
2
Short Abstract: Neuroimaging research has predominantly drawn conclusions based on classical
statistics. Recently, statistical learningmethods enjoy increasing popularity. Thesemethodological
familiesusedforneuroimagingdataanalysiscanbeviewedastwoextremesofacontinuum,butare
basedondifferenthistories,theories,assumptions,andoutcomemetrics;thuspermittingdifferent
conclusions. This paper portrays commonalities and differences between classical statistics and
statisticallearningwithregardtoneuroimagingresearch.Theconceptualimplicationsareillustrated
in three case studies. It is thus tried to resolve possible confusion between classical hypothesis
testinganddata-guidedmodelestimationbydiscussingramificationsfortheneuroimagingaccessto
neurobiology.
Long Abstract: Neuroimaging research has predominantly drawn conclusions based on classical
statistics, includingnull-hypothesis testing,t-tests,andANOVA.Throughoutrecentyears,statistical
learningmethods enjoy increasing popularity, including cross-validation, pattern classification, and
sparsity-inducingregression.Thesetwomethodologicalfamiliesusedforneuroimagingdataanalysis
can be viewed as two extremes of a continuum. Yet, they originated from different historical
contexts, build on different theories, rest on different assumptions, evaluate different outcome
metrics, and permit different conclusions. This paper portrays commonalities and differences
betweenclassicalstatisticsandstatisticallearningwiththeirrelationtoneuroimagingresearch.The
conceptualimplicationsareillustratedinthreecommonanalysisscenarios.Itisthustriedtoresolve
possible confusion between classical hypothesis testing and data-guided model estimation by
discussingtheirramificationsfortheneuroimagingaccesstoneurobiology.
Keywords:brainimaging,epistemology,cross-validation,hypothesistesting,machinelearning,pattern
recognition,p-value,prediction
Page 3
3
MainText
"Thetricktobeingascientististobeopentousingawidevarietyoftools."
LeoBreiman(2001)
1.Introduction
Among the greatest challenges humans face are cultural misunderstandings between individuals,
groups, and institutions (Hall, 1989). The topic of the present paper is the culture clash between
statisticalinferencebynull-hypothesisrejectionandout-of-samplegeneralization(L.Breiman,2001;
Friedman, 1998; Shmueli, 2010) that are increasingly combined in the brain-imaging domain (N.
Kriegeskorte, Simmons, Bellgowan, & Baker, 2009). Ensuing inter-cultural misunderstandings are
unfortunate, because the invention and application of new research methods has always been a
drivingforceintheneurosciences(Deisseroth,2015;Greenwald,2012;Yuste,2015).Itisthegoalof
thepresentpapertodisentangleclassicalinferenceandgeneralizationinferencebyjuxtaposingtheir
historicaltrajectories(section2),modellingphilosophies(section3),conceptualframeworks(section
4),andperformancemetrics(section5).
Duringthepast15years,neuroscientistshavetransitionedfromexclusivelyqualitativereportsoffew
patientswithneurologicalbrainlesionstoquantitativelesion-symptommappingonthevoxellevelin
hundredsofpatients(Batesetal.,2003).Wehavegonefrommanuallystainingandmicroscopically
inspecting single brain slices to 3D models of neuroanatomy at micrometer scale (Amunts et al.,
2013). We have also gone from individual experimental studies to the increasing possibility of
automatizedknowledgeaggregationacross thousandsofpreviously isolatedneuroimaging findings
(T. Yarkoni, Poldrack, Nichols, Van Essen, &Wager, 2011). Rather than laboriously collecting and
publishing in-housedata ina singlepaper, investigatorsarenowroutinely reanalyzingmulti-modal
data repositories managed by national, continental, and inter-continental consortia (Kandel,
Markram,Matthews,Yuste,&Koch,2013;HenryMarkram,2012;Poldrack&Gorgolewski,2014;Van
Essenetal.,2012).Thegranularityofneuroimagingdatasets ishencegrowingintermsofscanning
Page 4
4
resolution, sample size, and complexity of meta-information (S. Eickhoff, Turner, Nichols, & Van
Horn, 2016; Van Horn & Toga, 2014). As an important consequence, the scope of neuroimaging
analyses has expanded from the predominance of null-hypothesis testing to statistical-learning
methods that are i)more data-driven by flexiblemodels, ii) naturally scalable to high-dimensional
data, and iii) more heuristic by increased reliance on numerical optimization (Jordan & Mitchell,
2015; LeCun,Bengio,&Hinton,2015).Statistical learning (T.Hastie, Tibshirani,&Friedman,2001)
henceforth comprises the umbrella of "machine learning", "data mining", "pattern recognition",
"knowledgediscovery",and"high-dimensionalstatistics".
Infact,theverynotionofstatisticalinferencemaybesubjecttoexpansion.Thefollowingdefinitionis
drawn froma committee report to theNational Academies of theUSA (Jordan et al., 2013, p. 8):
"Inference is the problemof turning data into knowledge,where knowledgeoften is expressed in
termsofvariables [...] thatarenotpresent in thedataper se,butarepresent inmodels thatone
uses to interpret the data." According to this authoritative definition, statistical inference can be
understood as encompassing not only classical null-hypothesis falsification but also out-of-sample
generalization (cf. Jacob Cohen, 1990; G. Gigerenzer & Murray, 1987). Classical statistics and
statistical learningmightgiverise todifferentcategoriesof inference,whichremainsan inherently
difficult concept to define (Chamberlin, 1890; Pearl, 2009; Platt, 1964). Any choice of statistical
method for a neurobiological investigation predetermines the spectrum of possible results and
permissibleconclusions.
Briefly taking an epistemological perspective, a new scientific fact is probably not established in
vacuo(Fleck,Schäfer,&Schnelle,1935; italicterms inthispassagetakenfromsource).Rather,the
object is recognized and accepted by the subject according to socially conditioned thought styles
cultivated among members of thought collectives. A witnessed and measured neurobiological
phenomenontends toonlybecome"true" ifnotatoddswith theconstructed thoughthistoryand
closed opinion system shared by that subject. In the following, two such thought milieus will be
Page 5
5
revisited and reintegrated in the context of imaging neuroscience: classical statistics (CS) and
statisticallearning(SL).
Page 6
6
2.Differenthistories
2.1Theoriginsofclassicalhypothesistestingandlearningalgorithms
The largely independenthistorical trajectoriesof the twostatistical familiesareevenevident from
theirmostbasicterminology.Inputstostatisticalmodelsareusuallycalledindependentvariablesor
predictorsintheCSliterature,butarecalledfeaturescollectedinafeaturespaceintheSLliterature.
TheoutputsarecalleddependentvariablesorresponsesinCSandtargetvariablesinSL,respectively.
Around 1900 the notions of standard deviation, goodness of fit, and the "p < 0.05"-threshold
emerged(Cowles&Davis,1982).ThiswasalsotheperiodwhenWilliamS.Gossetpublishedthet-
test under the incognito name "Student" to quantify production quality in Guinness breweries.
Motivated by concrete problems such as the interaction between potato varieties and fertilizers,
Ronald A. Fisher invented the analysis of variance (ANOVA), null-hypothesis testing, promoted p
values,anddevisedprinciplesofproperexperimentalconduct(R.A.Fisher,1925;RonaldA.Fisher,
1935; Ronald A. Fisher &Mackenzie, 1923). An alternative framework for hypothesis testing was
proposedbyJerzyNeymanandEgonS.Pearson,which introducedthestatisticalnotionsofpower,
falsepositivesandfalsenegatives,but leftouttheconceptofpvalues(Neyman&Pearson,1933).
Thiswas a timewhen controlled experimentswere preferentially performedon single individuals,
before the gradual transition to participant groups in the 30s and 40s, and before electrical
calculatorsemergedafterWorldWarII(Efron&Tibshirani,1991;GerdGigerenzer,1993).Student's
t-test and Fisher's inference framework were institutionalized by American psychology textbooks
thatwerewidelyreadinthe40sand50s.NeymanandPearson'sapproachonlybecameincreasingly
known in the 50s and 60s. This led authors of social science textbooks to promote a somewhat
incoherentmixtureof theFisherandNeyman-Pearsonapproaches tostatistical inference, typically
withoutexplicitmention.Itisthisconglomerateofclassicalframeworkstoperforminginferencethat
today'stextbooksofappliedstatisticshaveinherited(Bortz,2006;Moore&McCabe,1989).
Page 7
7
Itisatopicofcurrentdebate1,2,3whetherCSisadisciplinethatisseparatefromSL(e.g.,L.Breiman,
2001;Chambers,1993;Friedman,2001)orifstatisticsisabroaderclassthatincludesCSandSLasits
members (e.g.,Cleveland,2001; Jordan&Mitchell,2015;Tukey,1962).SLmethodsare frequently
adopted by computer scientists, physicists, engineers, and others who have no formal statistical
background and are typically working in industry rather than academia. In fact, John W. Tukey
foresaw many of the developments that led up to what one might today call statistical learning
(Tukey, 1962, 1965).He proposed a “peaceful collision of computing and statistics” aswell as the
distinction into "exploratory" and "confirmatory" data analysis4. This emphasized data-driven
analysistechniquesasatoolboxusefulinalargevarietyofreal-worldsettingstogainanintuitionof
the data properties. Kernel methods, neural networks, decision trees, nearest neighbors, and
graphicalmodelsallactuallyoriginatedintheCScommunity,butmostlycontinuedtodevelopinthe
SLcommunity(Friedman,2001).Asoftencitedbeginningsofself-learningalgorithms,theperceptron
was an early brain-inspired computing algorithm (Rosenblatt, 1958), and Arthur Samuel created a
checkerboardprogram that succeeded inbeating itsowncreator (Samuel,1959). Such studieson
artificial intelligence (AI) led to enthusiastic optimism and subsequent disappointment due to the
slowprogressof learningalgorithms.Theconsequencewasa slow-downof research, funding,and
interestduringtheso-called"AIwinters"inthelate70sandaroundthe90s(D.D.Cox&Dean,2014;
Kurzweil, 2005; Russell & Norvig, 2002), while the increasingly available computers in the 80s
encouragedanewwaveofstatisticalalgorithms(Efron&Tibshirani,1991).Thedifficult-to-trainbut
backthenwidelyusedneuralnetworkalgorithmsweresupersededbysupportvectormachineswith
1"DataScienceandStatistics:differentworlds?"(PanelatRoyalStatisticalSocietyUK,March2015)(https://www.youtube.com/watch?v=C1zMUjHOLr4)2"50yearsofDataScience"(DavidDonoho,TukeyCentennialworkshop,USA,Sept.2015)
3"AreMLandStatisticsComplementary?"(MaxWelling,6thIMS-ISBAmeeting,December2015)
4Asaveryrecentreformulationofthesameidea:"Iftheinference/algorithmraceisatortoise-and-hareaffair,thenmodernelectroniccomputationhasbredabionichare."(Efron&Hastie,2016)
Page 8
8
convincing out-of-the-box performances (Cortes & Vapnik, 1995). Later, the use of SL methods
increased steadily in many quantitative scientific domains as they underwent an increase in
informationgranularityfromclassical"longdata"(samplesn>variablesp)tomodern"widedata"(n
< p) (Tibshirani, 1996). The emerging field of SL has beenmuch conceptually consolidated by the
seminal book "The Elements of Statistical Learning" (T. Hastie et al., 2001). The coincidence of
changing data properties, increasing computational power, and cheaper memory resources
encouragedaresurgeinSLresearchandapplicationsapproximatelysince2000(HouseofCommon,
2016;Manyikaetal.,2011).Overthelast15years,sparsityassumptionsgainedincreasingrelevance
for statistical tractability and domain interpretability when using supervised and unsupervised
learningalgorithms (i.e.,withandwithout targetvariables)by imposingapriordistributionon the
model parameters (Bach, Jenatton,Mairal,&Obozinski, 2012). According to the "bet on sparsity"
(TrevorHastie,Tibshirani,&Wainwright,2015),onlyasubsetofthefeaturesshouldbeexpectedto
be relevant because no existing statistical method performs well in the dense high-dimensional
scenariothatassumesallfeaturestoberelevantinthe"true"model(Brodersen,Haiss,etal.,2011;
T. Hastie et al., 2001). This enabled reproducible and interpretable statistical relationships in the
high-dimensional"n<<p"regime(Bühlmann&VanDeGeer,2011;TrevorHastieetal.,2015).More
recently,improvementsintrainingvery"deep"(i.e.,manynon-linearhiddenlayers)neuralnetworks
architectures(GeoffreyE.Hinton&Salakhutdinov,2006)havemuchimprovedautomatizedfeature
selection (Bengio, Courville, & Vincent, 2013) and have exceeded human-level performance in
several tasks (LeCunet al., 2015). For instance, one recent deep reinforcement learning algorithm
masteredplaying49differentcomputergamesbasedonsimplepixelinputalone(Mnihetal.,2015).
Today,systematiceducationinSLisstillrareatmostuniversities,incontrasttotheomnipresenceof
CScourses(Burnham&Anderson,2014;Cleveland,2001;Donoho,2015;Vanderplas,2013).
2.2Relatedspotlightsinthehistoryofneuroimaginganalysismethods
Page 9
9
Formore thana century, neuroscientific conclusionsweremainly drawn frombrain lesion reports
(Broca, 1865; Harlow, 1848; Wernicke, 1881), microscopical inspection (Brodmann, 1909; Vogt &
Vogt, 1919), brain stimulation during surgery (Penfield & Perot, 1963), and pharmacological
intervention (Clark, Del Giudice, & Aghajanian, 1970), often without strong reliance on statistical
methodology.Theadventofmorereadilyquantifiableneuroimagingmethods(Fox&Raichle,1986)
then allowed for in-vivo characterization of the neural correlates underlying sensory, cognitive, or
affectivetasks.Eversince,topographical localizationofneuralactivity increasesanddecreaseswas
dominatedbyanalysisapproachesfromCS,especiallythegenerallinearmodel(GLM;datesbackto
Nelder&Wedderburn,1972).AlthoughtheGLMiswellknownnottocorrespondtoneurobiological
reality,ithasprovidedgoodandinterpretableapproximations.Itwasandstill isroutinelyusedina
mass-univariateregimethatcomputessimultaneousunivariatestatisticsforeachindependentvoxel
observationofbrainscans(Fristonetal.,1994).Itinvolvesfittingbetacoefficientscorrespondingto
thecolumnsofadesignmatrix(i.e.,prespecifiedstimulus/task/behaviorindicators,theindependent
variables)toasinglevoxel'simagingtimeseriesofmeasuredneuralactivitychanges(i.e.,dependent
variable)toobtainabetacoefficientforeachindicator.ItisseldommentionedthattheGLMwould
nothavebeensolvableforuniquesolutionsinthehigh-dimensionalregimebecausethenumberof
input variables p exceeded by far the number of samples n (i.e., under-determined system of
equations),which incapacitatesmanystatisticalestimatorsfromCS(cf.Giraud,2014;TrevorHastie
et al., 2015). Regularization by sparsity-inducing norms, such as inmodern regression analysis via
LassoandElasticNet(cf.TrevorHastieetal.,2015;Jenatton,Audibert,&Bach,2011),emergedonly
later as a principled way to de-escalate the need for dimensionality reduction and to enable the
tractability of the high-dimensional "p > n" case (Tibshirani, 1996). Many software packages for
neuroimaging analysis consequently implemented discrete voxel-wise analyses with classical
inference. The ensuing multiple comparisons problem motivated more than two decades of
methodological research (Friston, 2006; ThomasE.Nichols, 2012; StephenM. Smith,Matthews,&
Jezzard, 2001; Worsley, Evans, Marrett, & Neelin, 1992). It was initially addressed by reporting
uncorrected(Vul,Harris,Winkielman,&Pashler,2008)orBonferroni-correctedfindings(ThomasE.
Page 10
10
Nichols, 2012), then increasingly by false discovery rate (Genovese, Lazar, & Nichols, 2002) and
cluster-leveltresholding(StephenM.Smith&Nichols,2009).Further,itwasearlyacknowledgedthat
theunitofinterestshouldbespatiallyneighboringvoxelgroupsandthatthenullhypothesisneeded
to account for all voxels exhibiting some signal (Chumbley& Friston, 2009). These concernswere
addressedbyinferencein locallysmoothneighborhoodsbasedonrandomfieldtheorythatmodels
discretevoxelactivationsastopologicalunitswithcontinuousactivationheightandextent(Worsley
etal.,1992).Thatis,thespatialdependenciesofvoxelobservationswerenotincorporatedintothe
GLM estimation step, but instead during the subsequent model inference step to alleviate the
multiplecomparisonsproblem.
To abandon the voxel-independence assumption of the mass-univariate GLM, SL models were
proposedearlyon forneuroimaging investigations. For instance,principal componentanalysiswas
usedtodistinguish localandglobalneuralactivitychanges(Moeller,Strother,Sidtis,&Rottenberg,
1987)aswellastostudyAlzheimer'sdisease(Gradyetal.,1990),whilecanonicalcorrelationanalysis
yielded complex relationships between task-free neural activity and schizophrenia symptoms
(Friston,Liddle,Frith,Hirsch,&Frackowiak,1992).Notethatthesefirstapproachesto"multivariate"
brain-behaviorassociationsdidnot igniteamajor research trend (cf. Fristonetal., 2008;Worsley,
Poline,Friston,&Evans,1997).Supervisedclassificationestimatorswerealsousedinearlystructural
neuroimaging analyses (Herndon, Lancaster, Toga, & Fox, 1996) and they improved preprocessing
performanceforvolumetricneuroimagingdata(Ashburner&Friston,2005).However,thepopularity
of SL methods only peaked after being rebranded as "mind-reading", "brain decoding", and
"multivariate pattern analysis" that appealed by identifying ongoing thought from neural activity
alone(J.D.Haynes&Rees,2005;Kamitani&Tong,2005).Uptothatpoint,thetermpredictionhad
lessoftenbeenusedintheconcurrentsenseofout-of-samplegeneralizationofalearningmodeland
more often in the incompatible sense of in-sample linear correlation between (time-free or time-
shifted)data(Gabrieli,Ghosh,&Whitfield-Gabrieli,2015;Shmueli,2010).The"searchlight"approach
Page 11
11
for "pattern-information analysis" subsequently enabled whole-brain assessment of local
neighborhoods of predictive patterns of neural activity fluctuations (N. Kriegeskorte, Goebel, &
Bandettini, 2006). The position of such "decoding" models within the branches of statistics has
seldombeenmadeexplicit(butseeFristonetal.,2008).Thegrowinginterestwasmanifestedinfirst
reviewandtutorialpapersthatwerepublishedonapplyingSLmethodstoneuroimagingdata(J.D.
Haynes&Rees,2006;Mur,Bandettini,&Kriegeskorte,2009;Pereira,Mitchell,&Botvinick,2009).
The conceptual appeal of this new access to the neural correlates of cognition and behaviorwas
flanked by availability of the necessary computing power andmemory resources. This was also a
preconditionforregularizationbystructuredsparsitypenalties(Wainwright,2014)that incorporate
neurobiological priors as local spatial dependence (Gramfort, Thirion,& Varoquaux, 2013;Michel,
Gramfort,Varoquaux,Eger,&Thirion,2011)orspatial-temporaldependence(Gramfort,Strohmeier,
Haueisen, Hamalainen, & Kowalski, 2011). Although challenging, "deep" neural networks have
recently been introduced to neuroimaging (de Brebisson &Montana, 2015; Güçlü & van Gerven,
2015; Plis et al., 2014). The application of these "deep" statistical architectures occurs in an
atheoretical, more empirically justified setting as their mathematical properties are incompletely
understoodandmanyformalguaranteesaremissing(butseeBach,2014).Nevertheless,theymight
help in deciphering and approximating the nature of neural processing in the brain (D. D. Cox &
Dean,2014;Yamins&DiCarlo,2016).ThisagendaappearscloselyrelatedtoDavidMarr'sdistinction
of information processing into computational (~what?), algorithmic-representational (~how?), and
implementational-physical(~where?)levels(Marr,1982).Lastbutnotleast,thereisanalwaysbigger
interest in and pressure for data sharing, open access, and building "big-data" repositories in
neuroscience(Devoretal.,2013;Gorgolewskietal.,2014;Kandeletal.,2013;HenryMarkram,2012;
Poldrack & Gorgolewski, 2014; Van Essen et al., 2012). As the dimensionality and complexity of
neuroimaging datasets increases, neuroscientific investigations will probably benefit increasingly
fromSLmethodsandtheirvariantsadaptedtothedata-intenseregime(e.g.,Engemann&Gramfort,
2015;Kleiner,Talwalkar,Sarkar,&Jordan,2012;Zou,Hastie,&Tibshirani,2006).While largerdata
quantitiesallowdetectionofmoresubtleeffects,falsepositivefindingsarelikelytobecomeamajor
Page 12
12
issue as they unavoidably arise from statistical estimation in the high-dimensional data scenario
(Jordanetal.,2013;Meehl,1967).
Page 13
13
3.Differentphilosophies
3.1Twodifferentmodellinggoalsinstatisticalmethods
One of the possible ways to catalogue statistical methods is by framing them along the lines of
classicalstatisticsandstatisticallearning.Statisticalmethodscanthusbeconceptualizedasspanning
acontinuumbetweenthetwopolesofCSandSL(Jordanetal.,2013;p.61).Whilesomestatistical
methodscannoteasilybecategorizedbythisdistinction,thetwofamiliesofstatisticalmethodscan
generallybedistinguishedbyanumberofrepresentativeproperties.
As the precise relationship between CS and SL has seldom been explicitly characterized in
mathematical terms, the authormust resort tomore descriptive explanations (see already Efron,
1978). One of the key differences becomes apparent when thinking of the neurobiological
phenomenonunderstudyasablackbox(L.Breiman,2001).CStypicallyaimsatmodellingtheblack
box bymaking a set of accurate assumptions about its content, such as the nature of the signal
distribution. Gaussian distributional assumptions have been very useful in many instances to
enhance mathematical convenience and, hence, computational tractability. SL typically aims at
finding any way to model the output of the black box from its input while making the least
assumptionspossible(Abu-Mostafa,Magdon-Ismail,&Lin,2012).InCSthestochasticprocessesthat
generatedthedataisthereforetreatedaspartlyknown,whereasinSLthephenomenonistreatedas
complex, largelyunknown,andpartlyunknowable. Inthissense,CStendstobemoreanalyticalby
imposingmathematicalrigoronthephenomenon,whereasSLtendstobemoreheuristicbyfinding
usefulapproximationstothephenomenon.CSspecifiesthepropertiesofagivenstatisticalmodelat
the beginning of the investigation, whereas in SL there is a bigger emphasis on models whose
parameters and at times even structures (e.g., learning algorithms creating decision trees) are
generatedduring thestatisticalestimation.TheSL-minded investigatormay favorsimple, tractable
modelsevenwhenmakingconsciousfalseassumptions(Domingos,2012)becausesufficiently large
data quantities are expected to remedy them (Devroye, Györfi, & Lugosi, 1996; Halevy, Norvig, &
Page 14
14
Pereira,2009).Anewfunctionwithpotentiallythousandsofparametersiscreatedthatcanpredict
the output from the input alone, without explicit programming model. This requires the input
featurestorepresentdifferentvariantsofallrelevantconfigurationsoftheexaminedphenomenon
in nature. In CS the mathematical assumptions are typically stated explicitly, while SL models
frequently have implicit assumptions that may be less openly discussed. Intuitively, the truth is
believed to be in the model (cf. Wigner, 1960) in a CS-constrained statistical regime, while it is
believedtobeinthedata(cf.Halevyetal.,2009)inaSL-constrainedstatisticalregime.Alsodiffering
intheiroutput,CStypicallyyieldspointandintervalestimates (e.g.,pvalues,variances,confidence
intervals), whereas SL frequently outputs functions (e.g., the k-means centroids or a trained
classifier'sdecision functioncanbeappliedtonewdata).Asanother tendency,SL revolvesaround
solvingandintroducingpriorknowledgeintoiterativenumericaloptimizationproblems.CSmethods
aremoreoftenclosed-formone-shotcomputationswithoutanysuccessiveapproximationprocess,
although several models are also fitted numerically by maximum likelihood estimation (Boyd &
Vandenberghe,2004;Jordanetal.,2013).
In more formal terms, CS relates closely to statistics for confirmatory data analysis, whereas SL
relatesmoretostatisticsforexploratorydataanalysis(Tukey,1962).Inpractice,CSisprobablymore
oftenappliedtoexperimentaldata,whereasetoftargetvariablesaresystematicallycontrolledby
theinvestigatorandthesystemunderstudiedhasbeensubjecttostructuredperturbation.Instead,
SLisperhapsmoretypicallyappliedtoobservationaldatawithoutsuchstructuredinfluenceandthe
studiedsystemhasbeenleftunperturbed(Domingos,2012).Fromyetanotherimportantangle,one
should not conflate so-called explanatory modelling and predictive modelling (Shmueli, 2010). CS
mainly performs retrospective explanatory modelling by emphasis on the operationalization of
preselected hypothetical constructs asmeasurable outcomes using frequently linear, interpretable
models. SL mainly performs prospective predictive modelling and quantifies the generalization to
futureobservationswithorwithoutformalincorporationofhypotheticalconceptsintomodelsthat
aremore frequently non-linear with challenging to impossible interpretability. There is the often-
Page 15
15
overlooked misconception that models with high explanatory power do necessarily exhibit high
predictive power (Lo, Chernoff, Zheng, & Lo, 2015; Wu, Chen, Hastie, Sobel, & Lange, 2009). An
important outcome measure in CS is the quantified significance associated with a statistical
relationshipbetweenfewvariablesgivenapre-specifiedmodel.TheoutcomemeasureforSListhe
quantifiedgeneralizabilityorrobustnessofpatternsbetweenmanyvariablesor,moregenerally,the
robustnessof special structure in thedata (T.Hastieetal.,2001).CS tends to test foraparticular
structure in thedatabasedonanalyticalguarantees, suchasmathematical convergence theorems
about approximating the population properties with increasing sample size. Instead, SL tends to
explore particular structure in the data and quantify its generalization to new data certified by
empiricalguarantees,suchasbyexplicitevaluationofthepredictivenessofafittedmodeltounseen
data (Efron& Tibshirani, 1991). CS thus resortsmore towhat Leo Breiman calleddatamodelling,
imposingana-priorimodelinatop-downfashion,whileSLwouldbealgorithmicmodelling,fittinga
modelasafunctionofthedataathandinabottom-upfashion(L.Breiman,2001).
Although this polarization of statisticalmethodologymay be oversimplified, it serves as a didactic
tool toconfront twodifferentperspectives.Takentogether,CSwasmostly fashionedforproblems
with small samples that can be grasped by plausible models with a small number of parameters
chosenbytheinvestigatorinananalyticalfashion.SLwasmostlyfashionedforproblemswithmany
variables inpotentially large sampleswith rare knowledgeof thedata-generatingprocess that are
emulatedbyamathematical functioncreated fromdatabyamachine inaheuristic fashion.Tests
fromCSthereforetypicallyassumethatthedatabehaveaccordingtoknownmechanisms,whereas
SLexploitsalgorithmictechniquestoavoida-priorispecificationsofdatamechanism.Inthisway,CS
preassumesandtestsamodelforthedata,whereasSLlearnsamodelfromthedata5.
5"Indeed,thetrendsincePearl’sworkinthe1980’shasbeentoblendreasoningandlearning:putsimply,onedoesnotneedtolearn(fromdata)whatonecaninfer(fromthecurrentmodel).Moreover,onedoesnotneedtoinferwhatonecanlearn(intractableinferentialprocedurescanbecircumventedbycollectingdata)."(Jordan,2010)
Page 16
16
Obviously,theexistingrepertoireofstatisticalmethodscouldhavebeendissectedindifferentways.
Forinstance,theBayesian-frequentistdistinctionisorthogonaltotheCS-SLdistinction(Efron,2005;
Freedman,1995;Z.Ghahramani,2015).Bayesianstatisticscanbeviewedassubjectiveandoptimistic
in considering the data at hand conditioned on preselected distributional assumptions for the
probabilisticmodel to perform direct inference (Wagenmakers, Lee, Lodewyckx,& Iverson, 2008).
Frequentiststatisticsaremoreobjectiveandpessimisticinconsideringadistributionofpossibledata
withunknownmodelparameterstoperformindirectinferencebythedifferencesbetweenobserved
andmodel-deriveddata.ItisimportanttoappreciatethatBayesianstatisticscanbeadoptedinboth
CSandSLfamiliesinvariousflavors(Fristonetal.,2008;GeoffreyE.Hinton,Osindero,&Teh,2006;
Kingma&Welling, 2013). Furthermore,CSandSL approaches canhardlybe clearly categorizedas
either discriminative orgenerative (Bishop, 2006; G.E. Hinton, Dayan, Frey, & Neal, 1995; Jordan,
1995;Ng&Jordan,2002).Discriminativemodelsfocusonsolvingasupervisedproblemofpredicting
a class y by directly estimating P(y|X) without capturing special structure in the data X. Typically
moredemandingindataquantityandcomputationalresources,generativemodelsestimatespecial
structurebyP(y|X)fromP(X|y)andP(y)andcanthusproducesyntheticalexamplesX~foreachclass
y.Further,CSandSLcannotbedisambiguatedintodeterministicversusprobabilisticmodels(Shafer,
1992). Statistical models are often not exclusively deterministic because they incorporate a
component thataccounts forunexpected,noisyvariation in thedata.Eachprobabilisticmodelcan
also be viewed as a superclass of a deterministic model (Norvig, 2011). Neither can the terms
univariate andmultivariate be exclusively grouped into either CS or SL. They traditionally denote
reliance on one versus several dependent variables (CS) or target variables (SL) in the statistical
literatures. In the neuroimaging literature, "multivariate" however frequently refers to high-
dimensional approaches operating on the neural activity from all voxels in opposition to mass-
univariateapproachesoperatingonsinglevoxelactivity(Brodersen,Haiss,etal.,2011;Fristonetal.,
2008).CScanbedividedintouni-andmultivariategroupsofstatisticaltests(Ashburner&Kloppel,
2011; Friston et al., 2008), while SL is largely focused on higher-dimensional problems that often
naturally transform to the multivariate classification/regression setting. Tapping on yet another
Page 17
17
terminology, CS methods may be argued to be closer to parametric statistics by instantiating
statistical models whose number of parameters is fixed, finite, and not a function of sample size
(Bishop, 2006; Z. Ghahramani, 2015). Instead, SL methods are more often (but not exclusively)
realizedinanon-parametricsettingbyinstantiatingflexiblemodelswhoseparametersgrowincrease
explicitlyorimplicitlywithmoreinputdata.Parametricapproachesareoftenmoresuccessfuliffew
observationsareavailable.Conversely,non-parametricapproachesmaybemorenaturallyprepared
tocaptureemergencepropertiesthatonlyoccurfromlargerdatasets(Z.Ghahramani,2015;Halevy
etal.,2009; Jordanetal.,2013).For instance,anon-parametric (butnotparametric) classifier can
extractevermorecomplexdecisionboundariesfromincreasingtrainingdata(e.g.,decisiontressand
nearestneighbors).Mostimportantly,neitherCSnorSLcangenerallybeconsideredsuperior.Thisis
captured by the no free lunch theorem6 (Wolpert, 1996), which states that no single statistical
strategycanconsistentlydobetterinallcircumstances(cf.GerdGigerenzer,2004).Theinvestigator
has the discretion to choose which statistical approach best suited to the neurobiological
phenomenonunderstudyandtheneuroscientificresearchobjectathand.
3.2Differentmodellinggoalsinneuroimaging
Statisticalanalysisgrounded inCSandSL ismoreclosely related toencodingmodelsanddecoding
models in the neuroimaging domain, respectively (Nikolaus Kriegeskorte, 2011; T. Naselaris, Kay,
Nishimoto,&Gallant,2011;Pedregosa,Eickenberg,Ciuciu,Thirion,&Gramfort,2015;butseeGüçlü
et al, 2015). Encoding models regress the data against a design matrix with potentially many
explanatorycolumnsofstimulus(e.g.,faceversushousepictures),task(e.g.,toevaluteortoattend),
orbehavioral (e.g.,ageorgender) indicatorsby fittinggeneral linearmodels. Incontrast,decoding
modelstypicallypredicttheseindicatorsbytrainingandtestingclassificationalgorithmsondifferent
6Inthesupervisedsetting,thereisnoaprioridistinctionbetweenlearningalgorithmsevaluatedbyout-of-samplepredictionerror.Intheoptimizationsettingoffinitespaces,allalgorithmssearchinganextremumperformidenticalwhenaveragedacrosspossiblecostfunctions.(http://www.no-free-lunch.org/)
Page 18
18
splitsfromthewholedataset.InCSparlance,theencodingmodelfitstheneuralactivitydatabythe
betacoefficients,thedependentvariables,accordingtotheindicatorsinthedesignmatrixcolumns,
theindependentvariables.AnexplanationfordecodingmodelsinSLjargonwouldbethatthemodel
weightsofaclassifierarefittedonthetrainingsetofthe inputdatatopredicttheclass labels,the
targetvariables, andaresubsequentlyevaluatedon the test setbycross-validation toobtain their
out-of-sample generalization performance. Put differently, a GLM fits coefficients of
stimulus/task/behavior indicatorsonneuralactivitydataforeachvoxelseparatelygiventhedesign
matrix (T. Naselaris et al., 2011). Classifiers predict entries of the design matrix for all voxels
simultaneously given the neural activity data (Pereira et al., 2009). A key difference between CS-
mediated encoding models and SL-mediated decoding models thus pertains to the direction of
inferencebetweenbrainspaceandindicatorspace(Fristonetal.,2008;Varoquaux&Thirion,2014).
Themapping direction hence pertains to the questionwhether the indicators in themodel act as
causesbyrepresentingdeterministicexperimentalvariablesofanencodingmodelorconsequences
byrepresentingprobabilisticoutputsofadecodingmodel(Fristonetal.,2008).Theseconsiderations
alsorevealtheintimaterelationshipofCSmodelstothenotionofso-calledforwardinference,while
SLmethodsareprobablymoreoftenusedforformalreverseinferenceinfunctionalneuroimaging(S.
B.Eickhoffetal.,2011;Poldrack,2006;Varoquaux&Thirion,2014;T.Yarkonietal.,2011).Onthe
one hand, forward inference relates to encoding models by testing the probability of observing
activity in a brain location given knowledge of a psychological process. Reverse inference, on the
otherhand, relates tobrain "decoding"by testing theprobabilityof apsychological processbeing
present given knowledge of activation in a brain location. Finally, establishing a brain-behavior
associationhasbeenarguedtobemoreimportantthantheactualdirectionofthemappingfunction.
This isbecause"showingthatonecandecodeactivityinthevisualcortextoclassify[...]asubject’s
percept is exactly the same as demonstrating significant visual cortex responses to perceptual
changes" and, conversely, "all demonstrations of functionally specialized responses represent an
implicitmindreading"(Friston,2009).
Page 19
19
Morespecifically,GLMencodingmodelsfollowarepresentationalagendabytestinghypotheseson
regionaleffectsoffunctionalspecializationinthebrain(where?).At-testisusedtocomparepairsof
measurestostatisticallydistinguishonetargetandonenon-targetsetofmentaloperations(Friston
et al., 1996). Essentially, this formal tests for significant differences between the beta coefficients
corresponding to two stimulus/task indicators based on well-founded arguments from cognitive
theory (the SL analogwould be a binary classifier distinguishing two class labels). It assumes that
cognitivesubtraction ispossible, that is, theregionalbrainresponsesof interestcanbe isolatedby
constrastingtwosetsofbrainscansthatpreciselydifferinthecognitivefacetofinterest(Fristonet
al.,1996;Stark&Squire,2001).Foronevoxellocationatatime,anattemptismadetorejectthenull
hypothesis of no difference between the averaged neural activity of a target brain state and the
averagedneuralactivityofacontrolbrainstate.Notethatestablishingabrain-behaviorassociation
via GLM is therefore a judgment about neural activity compressed in the time dimension. The
univariateGLManalysiscanbeeasilyextendedtomorethanoneoutput(dependent)variablewithin
theCSregimebyperformingamultivariateanalysisofcovariance(MANCOVA).Thisallowsfortests
ofmorecomplexhypothesesbut incursmultivariatenormalityassumptions (NikolausKriegeskorte,
2011). Conceptually, one can switch the direction of inference by reframing the estimated beta
coefficients as independent variables and the stimulus/task/behavior indicator as the response
variabletoperformmultiplelinearregression(ANCOVA)(Brodersen,Haiss,etal.,2011;T.Naselaris
et al., 2011). The beta coefficients would then play an analogous role to themodel weights of a
trained classification algorithm. Finally, theoften-performed small volume correction (an analog in
theSLworldwouldbeclassificationafter featureselection)correspondstosimultaneously fittinga
GLMtoeachvoxelofarestrictedregionofinterest.
BecausehypothesistestingforsignificantdifferencesbetweenbetacoefficientsoffittedGLMsrelies
onaveragedneuralactivity,thetestresultsarenotcorruptedbytheconventionallyappliedspatial
smoothing with a Gaussian filter. On the contrary, this even helps the correction for multiple
comparisonsbasedon random fields theory, alleviates inter-individual neuroanatomical variability,
Page 20
20
and thus increases sensitivity. Spatial smoothing however discards fine-grained spatial activity
patternsthatcarrypotentialinformationaboutmentaloperations(J.-D.Haynes,2015).Indeed,some
authors believe that sensory, cognitive, and motor processes manifest themselves as neuronal
populationcodes(Averbeck,Latham,&Pouget,2006).Relevanceofsuchpopulationcodesinhuman
neuroimagingwas for instance suggestedby revealing subject-specific fusiform-gyrus responses to
facial stimuli (Saygin et al., 2012). In applications of SL models, the spatial smoothing step is
therefore often skipped because the "decoding" algorithms precisely exploit this multivariate
structureofthesalt-and-pepperpatterns.
In contrast, decoding models use learning algorithms for an informational agenda by showing
generalization of robust patterns to new brain activity acquisitions (de-Wit, Alexander, Ekroll, &
Wagemans,2016;N.Kriegeskorteetal.,2006;Muretal.,2009).Informationthatislocallyweakbut
spatially distributed can be effectively harvested in a structure-preserving fashion (J. D. Haynes&
Rees, 2006). Some brain-behavior associationsmight only emerge when simultaneously capturing
neuralactivityinagroupofvoxelsbutdisappearinsingle-voxelapproaches,suchasGLMs.However,
analogous tomultivariate variants of theGLM, “decoding” can alsobedonebymeansof classical
statisticalapproaches.Essentially, inferenceon informationpatterns in thebrainreduces tomodel
comparison (Friston, 2009). During training of a classifier to predict indicators correctly, an
optimization algorithm (e.g., gradient descent or its variants) searches iteratively through the
hypothesisspace(=functionspace)ofthechosenlearningmodel.Eachsuchhypothesiscorresponds
to one specific combination ofmodel weights that equateswith one candidatemapping function
fromtheneuralactivityfeaturestotheindicators.Inthisway,fourtypesofneuroscientificquestions
havebeenproposed tobecomequantifiable (Brodersen,2009;Pereiraetal.,2009): i)Where isan
informationcategoryneurallyprocessed?Thisextendsthe interpretationalspectrumfromincrease
and decrease of neural activity to the existence of complex combinations of activity variations
distributedacrossvoxels.For instance, linearclassifiersdecodedobjectcategoriesfromtheventral
temporalcortex,evenafterexcludingthefusiformgyrus,whichisknowntoberesponsivetoobject
Page 21
21
stimuli(Haxbyetal.,2001). ii)Whetheragiveninformationcategoryisreflectedbyneuralactivity?
Thisextendstheinterpretationalspectrumtotopographicallysimilarbutneurallydistinctprocesses
that potentially underlie different cognitive facets. For instance, linear classifiers successfully
whether a subject is attending to the first or second of two simultaneously presented gratings
(Kamitani&Tong,2005).iii)Whenisaninformationcategorygenerated(i.e.,onset),processed(i.e.,
duration), and bound (i.e., alteration)? When applying classifiers to neural time series, the
interpretationalspectrumcanbeextendedtothebeginning,evolution,andendofdistinctcognitive
facets. For instance, different classifiers have been demonstrated to map the decodability time
structureofmentaloperationsequences(King&Dehaene,2014).iv)Morecontroversially,howisan
information category neurally processed? The interpretational spectrum is extended to
computational properties of the neural processes, including processing in brain regions versus
networks or isolated versus partially sharedprocessing facets. For instance, a classifier trained for
evolutionarily conserved eye gaze processes was able to decode evolutionarily more recent
mathematicalcalculationprocessesasapossiblecaseofneuralrecyclinginthehumanbrain(Knops,
Thirion,Hubbard,Michel,&Dehaene,2009).Asan importantcaveat,theparticularpropertiesofa
chosenlearningalgorithm(e.g., linearversusnon-linearsupportvectormachines)canprobablynot
serveasaconvincingargumentforreverse-engineeringneuralprocessingmechanisms(Misaki,Kim,
Bandettini,&Kriegeskorte,2010).However,manypredictionproblemsinneurosciencecanprobably
besolvedwithoutexhaustiveneurobiologicalmicro-,meso-,ormacro-levelknowledge(H.Markram,
2006;Sandberg&Bostrom,2008).
Page 22
22
4.Differenttheories
4.1Divergingwaystoformalizestatisticalinference
Besidesdiverginghistoricaloriginsandmodellinggoals,CSandSLrelyonlargelydistincttheoretical
frameworks that revolvearoundnull-hypothesis testingandstatistical learning theory.CSprobably
laid down its most important theoretical framework in the Popperian spirit of critical empiricism
(Popper, 1935/2005): scientific progress is to be made by continuous replacement of current
hypotheses by ever more pertinent hypotheses using verification and falsification. The rationale
behind hypothesis falsification is that one counterexample can reject a theory by deductive
reasoning, while any quantity of evidence cannot confirm a given theory by inductive reasoning
(Goodman, 1999). The investigator verbalizes two mutually exclusive hypotheses by domain-
informedjudgment.Thealternativehypothesisshouldbeconceivedsoastocontradictthestateof
the art of the research topic. The null hypothesis should automatically deduce from the newly
articulatedalternativehypothesis. The investigatorhas theagenda todisprove thenull hypothesis
because it only leaves the preferred alternative hypothesis as the new standard belief. A
conventional 5%-threshold (i.e., equating with roughly two standard deviations) guards against
rejection due to the idiosyncrasies of the sample that are not representative of the general
population.Ifthedatahaveaprobabilityof<=5%tobetruegiventhenullhypothesis(P(result|H0)),
itisevaluatedtobesignificant.Suchatestforstatisticalsignificanceindicatesadifferencebetween
twomeanswith a 5% chance after sampling twice from the samepopulation. There are common
misconceptions that it denotes the probability of the null hypothesis (P(H0)), the alternative
hypothesis (P(H1)), theresultof the teststatistic (P(result)),or thenullhypothesisgiventheresult
(P(H0|result))(Markus,2001;Pollard&Richardson,1987).Ifthenullhypothesisisnotrejected,then
nothingcanbeconcludedaboutthewholesignificancetestaccordingtomoststatisticians.That is,
the test yields no conclusive result, rather than a null result (Schmidt, 1996). In thisway, classical
hypothesis testingcontinuously replacescurrentlyembracedhypothesesexplainingaphenomenon
innaturebybetterhypotheseswithmoreempiricalsupport inaDarwinianselectionprocess.Note
Page 23
23
however that corroboratingsubstantivehypotheses (e.g., a specific linguistic theory likeChomsky's
UniversalGrammar)requiresmorethanstatisticalhypothesistesting(Chow,1998;Friedman,1998;
Meehl, 1978). A statistical hypothesis can be properly tested in the absence of a substantive
hypothesis of the phenomenon under study (Oakes, 1986). Finally, Fisher, Neyman, and Pearson
intended hypothesis testing as a marker for further investigation, rather than an off-the-shelf
decision-makinginstrument(J.Cohen,1994;Nuzzo,2014).
AtheoreticalframeworkrelevanttobothCSandSL,isthebias-variancetradeoff.Itisastand-alone
conceptthat ishelpfultoconsider intheoryandpracticebyprovidingadifferentangleonderiving
conclusions from data (e.g., Geman, Bienenstock, & Doursat, 1992). As famously noted7, any
statistical model is an over-simplification of the studied phenomenon, a distortion of the truth.
Choosingthe"right"statisticalmodelhencepertainstotherightbalancebetweenbiasandvariance.
Bias denotes thedifferencebetween the target function (i.e., the "true" relationshipbetween the
dataandaresponsevariablethattheinvestigatoristryingtouncoverfromthedata)andtheaverage
ofthefunctionspace(orhypothesisspace)instantiatedbyamodel.Intuitively,thebiasdescribesthe
rigidityofthemodel-derivedfunctions.High-biasmodelsyieldvirtuallythesamebestapproximating
function, even for larger perturbations of the data.Variance denotes the difference between the
bestapproximatingfunctionamongmembersofthefunctionspaceandtheaverageofthefunction
space. Intuitively,variancedescribes thedifferencesbetweenthe functionsderived fromthesame
model family.High-variancemodelsyieldverydifferentbestapproximatingfunctionevenforsmall
perturbationsof thedata. In sum,bias tends todecreaseandvariance tends to increaseasmodel
complexity increases. Simple models exhibit high bias and low variance. They lead to a bad
approximation as they do not well simulate the target function, a behavior of the studied
phenomenon.Yet,theyexhibitagoodgeneralizationinsuccessfullyextrapolatingfromseendatato
unseen data. Contrarily, complex models exhibit low bias and high variance. They have a better
chanceofwellapproximatingthetargetfunctionatthepriceofgeneralizing lesswelltonewdata.
7"Allmodelsarewrong;somemodelsareuseful."(GeorgeBox)
Page 24
24
That is because complexmodels tend to fit toowell the data to the point of fitting noise. This is
calledoverfitting and occurswhenmodel fitting integrates information independent of the target
function and idiosyncratic to the data sample. The bias-variance decomposition captures this
fundamental tradeoff in statistical modelling between approximating the behavior of the studied
phenomenon and generalizing to new data describing that behavior. If the target function was
known, the bias-variance tradeoff could be computed explicitly, in contrast to the inability of
computing the Vapnik-Chervonenkis dimensions (VC dimensions) of non-trivial models (cf. next
passage). The bias-variance tradeoff can also practically explain why successful applications of
statisticalmodelslargelyrelyoni)theamountofavailabledata,ii)thetypicallynotknownamountof
noiseinthedata,andiii)theunknowncomplexityofthetargetfunction(Abu-Mostafaetal.,2012).
Whiletheconceptofbias-variancetradeoff ishighlyrelevant inbothabstracttheoryandeveryday
application of CS and SL models (Bishop, 2006; T. Hastie et al., 2001), the concept of Vapnik-
Chervonenkis dimensions plays a crucial role in statistical learning theory. The VC dimensions
mathematically formalize the circumstances under which a pattern-learning algorithm can
successfullydistinguishbetweenpointsandextrapolatetonewexamples(Vapnik,1989,1996).This
comprises any instance of learning from a number of observations to derive general rules that
capture properties of phenomena in nature, including human and computer learning (cf. Bengio,
2014; Lake, Salakhutdinov,& Tenenbaum, 2015; Tenenbaum, Kemp,Griffiths,&Goodman, 2011).
Notethatthisinductivelogictolearnageneralprinciplefromexamplescontraststhedeductivelogic
ofhypothesis falsification (cf. above).VCdimensionsprovideaprobabilisticmeasureofwhethera
certain model is able to learn a distinction with respect to a given dataset. Formally, the VC
dimensionsmeasure the complexity capacity of a function space by counting the number of data
pointsthatcanbecleanlydivided(i.e.,"shattered")intodistinctgroupsasaresultoftheflexibilityof
thefunctionset.Intuitively,theVCdimensionsprovideaguidelineforthelargestsetofexamplesfed
intoa learning functionsuch that it ispossible toguaranteezeroclassificationerrors.Theycanbe
viewedastheeffectivenumberofparametersorthedegreesoffreedom,aconceptsharedbetween
Page 25
25
CS and SL. In this way, the VC dimensions formalize the circumstances under which a class of
functionsisabletolearnfromafiniteamountofdatatosuccessfullypredictagivenphenomenonin
previously unseen data. As one of themost important results from statistical learning theory, the
numberof configurationsone canobtain fromaclassificationalgorithmgrowspolynomially,while
the error is decreasing exponentially (Wasserman, 2013). Practically, good models have finite VC
dimensions - fittingwithasufficiently largeamountofdatayieldesperformancesthatapproximate
the theoretically expected performance in unseen data. Badmodels have infinite VC dimensions -
regardless of the available amount of data, it is impossible tomake generalization conclusions on
unseen data. In contrast to the bias-variance tradeoff, the VC dimensions (like null-hypothesis
testing) are unrelated to the target function, as the "true" mechanisms underlying the studied
phenomenon in nature. Yet, VC dimensions relate to themodels used to approximate that target
function.Asan interesting consideration, it ispossible that generalization froma concretedataset
failseveniftheVCdimensionspredictlearningasverylikely.However,suchadatasetistheoretically
unlikelytooccur.AlthoughtheVCdimensionsarethebestformalconcepttoderiveerrorsboundsin
statisticallearningtheory(Abu-Mostafaetal.,2012),theycanonlybeexplicitlycomputedforsimple
models.Hence,investigatorsareoftenrestrictedtoanapproximateboundforVCdimensions,which
limitstheirusefulnessfortheoreticalconsiderations.
4.2Theimpactofdiverginginferencecategoriesonneuroimaginganalyses
When lookingatneuroimagingresearchthroughtheCS lens, statisticalestimationrevolvesaround
solvingthemultiplecomparisonsproblem(ThomasE.Nichols,2012;T.E.Nichols&Hayasaka,2003).
FromtheSLstance,however,itisthecurseofdimensionalityandoverfittingthatstatisticalanalyses
need to tackle (Domingos,2012; Fristonetal., 2008). In typicalneuroimaging studies,CSmethods
typically test one hypothesismany times (i.e., the null hypothesis), whereas SLmethods typically
search through thousands of different hypotheses in a single process (i.e., walking through the
Page 26
26
functionspacebynumericaloptimization).Thehighvoxel resolutionofcommonbrainscansoffers
parallelmeasurementsof>100,000brainlocations.Inamass-univariateregime,suchasafterfitting
voxel-wiseGLMs,thesamestatisticaltestisapplied>100,000times.Themoreoftentheinvestigator
tests a hypothesis of relevance for a brain location, themore locationswill be falsely detected as
relevant(falsepositive,TypeIerror),especiallyinthenoisyneuroimagingdata.Theissueconsistsin
toomanysimultaneousstatisticalinferences.Fromageneralperspective,alldimensionsinthedata
(i.e., voxel variables) are implicitly treated as equally important and no neighborhoods of most
expectedvariationarestatisticallyexploited(T.Hastieetal.,2001).Hence,theabsenceofcomplexity
restrictions during the statistical modelling of neuroimaging data takes a heavy toll at the final
inferencestep.
This is contrasted by the high-dimensional SL regime, where the initial model choice by the
investigatordeterminesthecomplexityrestrictionstoalldatadimensions(i.e.,notsinglevoxels)that
are imposed explicitly or implicitly by themodel structure.Model choice predisposes existing but
unknownlow-dimensionalneighborhoodsinthefullvoxelspacetoachievethepredictiontask.Here,
thetollistakenatthebeginningbecausetherearesomanydifferentalternativemodelchoicesthat
wouldimposeadifferentsetofcomplexityconstraints.Forinstance,signalsfrom"brainregions"are
likelytobewellapproximatedbymodelsthatimposediscrete,locallyconstantcompartmentsonthe
data (e.g., k-means or spatially constrainedWard clustering). Tuningmodel choice to signals from
macroscopical "brainnetworks" should imposeoverlapping, locally continuousdata compartments
(e.g., independentcomponentanalysisorsparseprincipalcomponentanalysis).Knowledgeofsuch
effective dimensions in the neuroimaging data is a rare opportunity to simultaneously reduce the
model biasandmodel variance, despite their typically inverse relationship. Statisticalmodels that
overcome the curse of dimensionality typically incorporate an explicit or implicit metric for such
anisotropicneighborhoodsinthedata(Bach,2014;Bzdok,Eickenberg,Grisel,Thirion,&Varoquaux,
2015;T.Hastieetal.,2001).Viewedfromthebias-variancetradeoff,thissuccessfullycalibratesthe
sweet spot between underfitting and overfitting. Viewed from statistical learning theory, the VC
Page 27
27
dimensions can be reduced and thus the generalization performance increased. Applying amodel
withoutsuchcomplexityrestrictionstohigh-dimensionalbraindata,generalizationbecomesdifficult
toimpossiblebecausealldirectionsinthedataaretreatedequallyinwithisotropicstructure.Atthe
root of the problem, all data samples look virtually identical in high-dimensional data scenarios
("curseofdimensionality",Bellman,1961).Thelearningalgorithmwillnotbeabletoseethroughthe
noise and will thus overfit. In fact, these considerations explain why the multiple comparisons
problem is closely linked to encoding studies and overfitting is more closely related to decoding
studies(Fristonetal.,2008).Moreover,itoffersexplanationsastowhyanalyzingneuralactivityina
region of interest, rather than the whole brain, simultaneously alleviates both the multiple
comparisons problem (called "small volume correction" in CS studies) and the overfitting problem
(called"featureselection"inSLstudies).
Further,somecommoninvalidationsoftheCSandSLstatisticalframeworksareconceptuallyrelated
(cf.casestudytwo).Anoften-raisedconcerninneuroimagingstudiesperformingclassicalinference
isdoubledipping orcircular analysis (N. Kriegeskorteet al., 2009). This occurswhen, for instance,
firstcorrelatingabehavioralmeasurewithbrainactivityandthenusingtheidentifiedsubsetofbrain
voxels for a second correlation analysis with that same behavioral measurement (Lieberman,
Berkman,&Wager,2009;Vuletal.,2008). In this scenario,voxelsaresubmitted to twostatistical
testswiththesamegoalinanested,non-independentfashion8.Thiscorruptsthevalidityofthenull
hypothesis on which the reported test results conditionally depend. Importantly, this case of
repeatingasamestatisticalestimationwith iterativelypruneddataselections (onthetrainingdata
split) isavalidroutineintheSLframework,suchas inrecursivefeatureextraction(IsabelleGuyon,
Weston, Barnhill, & Vapnik, 2002; Hanson & Halchenko, 2008). However, there is an analog to
double-dipping or circular analysis in CSmethods applied to neuroimaging data:data-snooping or
peekinginSLanalysesbasedonout-of-samplegeneralization(Abu-Mostafaetal.,2012;Fithian,Sun,
&Taylor,2014;Pereiraetal.,2009).Thisoccurs,forinstance,whenperformingsimple(e.g.,mean-
8"Ifyoutorturethedataenough,naturewillalwaysconfess."(RonaldCoase)
Page 28
28
centering) or more involved (e.g., k-means clustering) target-variable-dependent or -independent
preprocessingontheentiredataset,where itshouldbeappliedseparatelytothetrainingsetsand
test sets. Data-snooping can lead to optimistic cross-validation estimates and a trained learning
algorithm that fails on fresh data drawn from the samedistribution. Rather than a corrupted null
hypothesis,itistheerrorboundsoftheVCdimensionsthatareloosenedand,ultimately,invalidated
because information from the concealed test set influences model selection on the training set.
Conceptually,thesetwodistinctclassesofstatisticalerrorsarisingwithintheCSandSLframeworks
haveseveralcommonpoints:i)Bothinvolveaformofinformationcompressionbythenecessityto
drawconclusionsonsmalldatapartsdrawnfromoriginallyhigh-dimensionalneuroimagingdata. ii)
Unauthorizedpriorknowledgehappens tobe introduced into thestatisticalestimationprocess. iii)
Bothdouble dipping (CS) anddata snooping (SL) involve a form ofbias - biasing the estimates to
systematicallydeviatefromthepopulationparameteror increasingthebiaserrorterminthebias-
variance decomposition (Tal Yarkoni & Westfall, 2016). iv) Both types of faux pas in statistical
conductcanoccur inverysubtleandunexpectedways,butcanbeavoidedwithoftensmalleffort
(see case study two). v) Invalidation of the null hypothesis or the VC bounds can yield overly
optimistic results that may encourage unjustified confidence in neuroscientific findings and
conclusions.
Moreover, it is probably optimal to perform cross-validation with k = 5 or 10 data splits in
neuroimaging, despite diverging choices in previous studies. This is because empirical simulation
studies (Leo Breiman & Spector, 1992; Kohavi, 1995) have indicated that more data splits k (i.e.,
bigger training set, smaller test set) can increase the estimate's variance. Fewer data splits k (i.e.,
smaller training set, bigger test set) can increase the estimate's bias. Concretely, for a particular
estimate issued by leave-one-out cross-validation (k data splits = n samples) the neuroimaging
practionermightunluckilyobtaina"fluctuation"farfromtheground-truthvalueandthereisoften
noseconddatasettodouble-check.Asadrasticsidenote,neuroscientistsdonotevenknowwhether
a target function exists in naturewhen approximating functionswhen, for instance, automatically
Page 29
29
classifyingmental operations or healthy from diseased individual based on brain scans (Nesterov,
2004;Wolpert,1996).Additionally,ananalyticalproof(Shao,1993)showedthatleave-one-outcross-
validation does not provide a theoretical guarantee for consistentmodel estimation. That is, this
cross-validationschemedoesnotalwaysselectthebestmodelassumingthatitisknown.However,
theneuroimagingpractitioneristypicallysatisfiedwithmodelweightsthatarereasonablyneartoan
optimallocalpointanddoesnotfocusonreachingthesingleglobaloptimalpointastypicalinconvex
optimization (Boyd & Vandenberghe, 2004). As an exception, leave-one out cross-validation may
thereforebemore justified inneuroimagingdata scenarioswithparticularly fewsamples,wherea
decentmodelfit istheprimaryconcern.Onamorepracticalnote,thecomputional loadfor leave-
one-outcross-validationwithk=nisoftenmuchhigherthanfork-foldcross-validationwithk=5or10.
Page 30
30
5.Differentcurrencies
5.1Divergingperformancemetricscanquantifythebehaviorofstatisticalmodels
Theneuroscientific investigatorwhoadoptsaCSculture, typically somebodywithabackground in
psychology,biology,ormedicine,isinthehabitofdiagnosingstatisticalinvestigationsbymeansofp
values,effectsizes,confidence intervals,andstatisticalpower.Thepvaluedenotestheconditional
probabilityofobtaininganequalormoreextremeteststatisticprovidedthatthenullhypothesisH0
is true at the prespecified significance threshold alpha (Anderson, Burnham, & Thompson, 2000).
Under the condition of sufficiently high power (cf. below), it quantifies the strength of evidence
againstthenullhypothesisasacontinuousfunction(Rosnow&Rosenthal,1989).Counterintuitively,
it is not an immediate judgment on the alternative hypothesis H1 preferred by the investigator
(Anderson et al., 2000; J. Cohen, 1994). Threemain interpretations for the significance level exist
(GerdGigerenzer, 1993): i) the conventional level of significance specifiedbefore the investigation
yielding yes-no information (early Fisher), ii) the exact level of significance obtained after the
investigationyieldingcontinuousinformation(lateFisher),andiii)thealphalevelindicatingaTypeI
errorfrequencyintestsafterrepeatedsampling(Neyman/Pearson).Theformerviewstheobtainedp
valueasapropertyof thedata,whereas the latterviewsalphaasapropertyof thestatistical test
(GerdGigerenzer,1993).Pvaluesdonotqualifythepossibilityofreplication.Itisanotherimportant
caveatthatpvaluesbecomebetter(i.e.,lower)withincreasingsamplesizes(Berkson,1938).
Theessentiallybinarypvalueisthereforeoftencomplementedbythecontinuouseffectsize.Thep
valueisadeductiveinferentialmeasure,whereastheeffectsizeisadescriptivemeasurethatfollows
neither inductive nor deductive reasoning. The effect size can be viewed as the strength of a
statistical relationship, howmuch H0 deviates from H1, or the likely presence of an effect in the
generalpopulation(Chow,1998;Ferguson,2009;Kelley&Preacher,2012).Itisaunit-free,sample-
size independent, often standardized statistical measure for the importance of rejecting H0. It
Page 31
31
equates to zero ifH0 could not be rejected. As a tendency, the lower thep value, the higher the
effectsize(Nickerson,2000).Importantly,theeffectsizeallowstheidentificationofmarginaleffects
thatpassthestatisticalsignificancethresholdbutarenotpracticallyrelevantintherealworld.Asa
propertyof the actual statistical test, the effect sizehasdifferent names and takes various forms,
suchasrhoinPearsoncorrelation,eta2inexplainedvariances,andCohen'sdindifferencesbetween
groupaverages.
Additionally,thecertaintyofapointestimate(i.e.,theoutcomeisavalue)canalwaysbeexpressed
byan intervalestimate(i.e.,theoutcomeisavaluerange)usingconfidenceintervals.Theyindicate
withachosenprobabilityhowoftenthe"true"populationeffectwouldbewithinthe investigator-
specifiedintervalaftermanyrepetitionsofthestudy(Cumming,2009;Estes,1997;Nickerson,2000).
Typically,a95%confidenceintervalisspannedaroundtherangeofvaluesofasamplemeanstatistic
that includes the population mean in 19 out of 20 cases across all samples. The tighter the
confidence interval, the smaller thevarianceof thepointestimateof thepopulationparameter in
each draw sample (keeping the sample size constant). The sometimes normalized confidence
intervals can be computed in a variety ofways. Their estimation is influenced by sample size and
population variability. They can be reported for different statistics, with different percentage
borders,andmaybeasymmetrical.Notethatconfidenceintervalscanbeusedasaviablesurrogate
forformaltestsofstatisticalsignificanceinmanyscenarios(Cumming,2009).
Confidenceintervalscanbecomputedinvariousdatascenariosandstatisticalregimes,whereasthe
power is onlymeaningfulwithin the culture of formal hypothesis falsification (Jacob Cohen, 1977,
1992; Oakes, 1986). The quality of interpretation for a statistically significant result is strongly
affectedbywhether theoriginalhypothesiswas tenable.Thepowermeasures theprobabilityofa
statisticaltesttofinda"true"effect,ofrejectingH0inthelongterm,orhowwella"true"alternative
hypothesis iscorrectlyacceptedassumingtheeffectexistsinthepopulation,thatis,P(H0detected
tobefalse|H1true).Ahighpowerthusensuresthatstatisticallysignificantandnon-significanttests
indeedreflectapropertyofthepopulation(Chow,1998). Intuitively,smallconfidenceintervalsare
Page 32
32
an indicator of high statistical power. Type II errors (i.e., false negatives, beta error) become less
likelywithhigherpower(=1-betaerror).Concretely,anunderpoweredinvestigationdoesnotallow
choosing between H0 and H1 at the specified significance threshold alpha. Power calculations
dependonseveral factors, including significance thresholdalpha, theeffect size in thepopulation,
variationinthepopulation,samplesizen,andexperimentaldesign(JacobCohen,1992).Ratherthan
retrospective, the necessary sample size n for a desired power can be computed in a prospective
fashionafterspecifyinganalphaandahypothesizedeffectsize.
In contrast, diagnosis of the obtained research findings takes a different shape for the SL-
indoctrinatedneuroscientist9,typicallysomebodywithabackgroundincomputerscience,physics,or
engineering.Cross-validation is thede facto standard toobtainanunbiasedestimateof amodel's
capacity to generalize beyond the sample at hand (Bishop, 2006; T. Hastie et al., 2001).Model
assessmentisdonebytrainingonabiggersubsetoftheavailabledata(i.e.,trainingsetforin-sample
performance)andsubsequentapplicationofthetrainedmodeltothesmallerremainingpartofdata
(i.e.,testsetforout-of-sampleperformance),whichisassumedtosharethesamedistribution.Cross-
validation thuspermutesover the sample indata splitsuntil theclass label (i.e., categorical target
variable) of each data point has been predicted once. This set ofmodel-predicted labels and the
corresponding true data point labels can then be submitted to the quality measures accuracy,
precision,recall,andF1score (Powers,2011).Asthesimplestamongthem,accuracy isasummary
statistic that captures the fraction of correct prediction instances among all performed model
applications.Thisandthefollowingmeasuresareoftencomputedseparatelyonthetrainingsetand
thetestset.Additionally,themeasuresfromtrainingandtestingcanbeexpressedbytheir inverse
(e.g.,trainingerrorasin-sampleerrorandtesterrorasout-of-sampleerror)becausethepositiveand
negativecasesareinterchangeable.
9"Itisalsoimportanttobeparticularlycarefulinreportingerrorsandmeasuresofmodelfitinthehigh-dimensionalsetting.Wehaveseenthatwhenp>n,itiseasytoobtainauselessmodelthathaszeroresidu-als.Therefore,oneshouldneverusesumofsquarederrors,p-values,R2statistics,orothertraditionalmeasuresofmodelfitonthetrainingdataasevidenceofagoodmodelfitinthehigh-dimensionalsetting."(James,Witten,Hastie,&Tibshirani,2013,p.247,authors'emphasis)
Page 33
33
The classification accuracy (= 1 - classification error) can be further decomposed into class-wise
metrics based on the so-called confusionmatrix, the juxtaposition of the true and predicted class
memberships.Theprecision(=truepositive/(truepositive+falsepositive))measureshowmanyof
the predicted labels are correct, that is, howmanymembers predicted to belong to a class really
belongtothatclass.For instance,amongtheparticipantspredictedtohavedepression,howmany
arereallyaffectedbythatdisease?Ontheotherhand, therecall (= truepositive/ (truepositive+
falsenegative))measureshowmanylabelsarecorrectlypredicted,thatis,howmanymembersofa
class were predicted to really belong to that class. Hence, among the participants affected by
depression,howmanywereactuallydetectedassuch?Putdifferently,precisioncanbeviewedasa
measure of "exactness" or"quality" and recall as a measure of "completeness" or"quantity"
(Powers,2011).Neitheraccuracy,precision,orrecallallow injectingsubjective importance intothe
evaluationprocessofthepredictionmodel.ThisdisadvantageisalleviatedbytheFbetascore,whichis
a weighted average of the precision and recall prediction scores. Concretely, the F1 score would
equallyweighprecisionand recallof classpredictions,while theF0.5 scoreputsmoreemphasison
precisionandtheF2scoremoreonrecall.Moreover,applicationsofrecall,precision,andFbetascores
havebeennoted to ignore the truenegativecasesaswellas tobehighlysusceptible toestimator
bias(Powers,2011).Needlesstosay,nosinglemeasurecanbeequallyoptimalinallcontexts.
Finally,learningcurves(Abu-Mostafaetal.,2012;Murphy,2012)areanimportantdiagnostictoolto
evaluate sample complexity, that is, the achieved model fit and prediction as a function of the
availablesamplesizen.Forincreasinglybiggersubsetsofthetrainingset,aclassificationalgorithmis
trainedonthatcurrentshareofthetrainingsetandthenevaluatedforaccuracyonthealways-same
testset.Acrosssubsetinstances,simplemodelsdisplayrelativelyhighin-sampleerrorbecausethey
cannot approximate the target function verywell (underfitting) but exhibit good generalization to
unseendatawith relatively lowout-of-sampleerror.Conversely, complexmodelsdisplay relatively
low in-sample error because they adapt too well to the data (overfitting) with difficulty to
Page 34
34
extrapolate to newly sampled data with high out-of-sample error. In both scenarios, the model
effectivenessismore"challenged",thescarcerthedatapointsavailableformodeltraining.
5.2Outcomemetricsofstatisticalmodelsusedinneuroimaging
Reports of statistical outcomes in the neuroimaging literature have previously been recognized to
confusenotionsfromclassicalstatisticsandstatisticallearning(Friston,2012).Onageneralbasis,CS
and SL do not judge findings by the same aspects of evidence (Lo et al., 2015; Shmueli, 2010). In
neuroimaging papers based on classical hypothesis-driven inference, p values, and confidence
intervalsareubiquitouslyreported.Therehavehoweverbeenveryfewreportsofeffectsize inthe
neuroimaging literature (N. Kriegeskorte, Lindquist,Nichols, Poldrack,&Vul, 2010). Effect sizes, in
turn,arenecessarytocomputepowerestimates.Thisexplainstheevenrareroccurrenceofpower
calculations inneuroimaging research (but seeTal Yarkoni&Braver, 2010). Effect sizeshavebeen
arguedtoallowverification that inanoptimalexperimentaldesign inference is tunedtobigeffect
sizes (Friston,2012). To, for instance,estimate thep valueand theeffect size for local changes in
neural activity during a psychological task, one would actually need two independent samples of
theseexperimentaldata.Onesamplewouldbeusedtoperformstatistical inferenceontheneural
activitychangeandonesample toobtainunbiasedeffect sizes. Ithasbeenpreviouslyemphasized
(Friston,2012)thatpvaluesandeffectsizesreflectin-sampleestimatesinaretrospectiveinference
regime(CS).Thesemetricsfindananalogueinout-of-sampleestimatesissuedfromcross-validation
inaprospectivepredictionregime(SL).In-sampleeffectsizesaretypicallyanoptimisticestimateof
the"true"effectsize(inflatedbyhighsignificancethresholds),whereasout-of-sampleeffectsizesare
unbiased estimates of the "true" effect size. As an important consequence, neuroimaging
investigatorsshouldrefrainfromsimultaneouslycomputingandreportingbothtypesofestimateson
anidenticaldatasample.Thiscanleadtodoubledipping(cf.casestudytwo).
Inthehigh-dimensionalscenario,analyzing"wide"neuroimagingdatainourcase,judgingstatistical
significancebypvalues isgenerallyconsideredtobechallenging (Bühlmann&VanDeGeer,2011;
Page 35
35
seecasestudiestwoandthree). Instead,classificationaccuracyonfreshdata isprobablythemost
often-reported performance metric in neuroimaging studies using learning algorithms. Basing
interpretationonaccuracyaloneishoweverinfluencedbythelocalcharacteristicsofhemodynamic
responses,efficiencyofexperimentaldesign,datafoldingintotrainandtestsets,anddifferencesin
the feature number p (J.-D. Haynes, 2015). A potentially under-exploited SL tool in this context is
bootstrapping.Itenablespopulation-levelinferenceofunknowndistributionsindependentofmodel
complexity by repeated random draws from the neuroimaging data sample at hand (Efron, 1979;
Efron&Tibshirani,1994).Thisopportunitytoequipvariouspointestimatesbyanintervalestimate
of certainty (e.g., the interval for the "true" accuracy of a classifier) is unfortunately seldom
embracedinthecontemporaryneuroimagingdomain(butseeBellec,Rosa-Neto,Lyttelton,Benali,&
Evans,2010;Vogelsteinetal.,2014).Besidesprovidingconfidenceintervals,bootstrappingcanalso
performnon-parametricnullhypothesistesting.Thismaybearareexampleofadirectconnection
between CS and SL methodology. Alternatively, binomial tests can be used to obtain a p-value
estimateof statistical significance fromaccuraciesandotherperformancescores (Brodersenetal.,
2013;Hanke,Halchenko,&Oosterhof,2015;Pereiraetal.,2009) inthebinaryclassificationsetting
(foranexampleseeBludauetal.,2015). Itcanrejectthenullhypothesisthattwocategoriesoccur
equally often. A last option that is also applicable to themulti-class setting is label permutation,
anothernon-parametricresamplingprocedure(Golland&Fischl,2003;T.E.Nichols&Holmes,2002).
Itcanservetorejectthenullhypothesisthattheneuroimagingdatadonotcontainanyinformation
abouttheclasslabels.
Extending from the setting of two hypotheses or yes-no classification to multiple classes injects
ambiguityintotheinterpretationofaccuracyscores.Ratherthanmerebetter-than-chancefindings,
it becomes more important to evalute the F1, precision and recall scores for each class to be
predictedinthebrainscans(e.g.,Brodersen,Schofield,etal.,2011;Schwartz,Thirion,&Varoquaux,
2013).Itisimportanttoappreciatethatthesensitivity/specificitymetrics,morefrequentlyreported
inCScommunities,andtheprecision/recallmetrics,morefrequentlyreportedinSLcommunities,tell
slighlydifferentstoriesaboutidenticalneuroscientificfindings.Infact,sensitivityequateswithrecall
Page 36
36
(=truepositive/(truepositive+falsenegative)).Specificity(=truenegative/(truenegative+false
positive))doeshowevernotequatewithprecision (=truepositive/ (truepositive+ falsepositive)).
Further,aCSviewontheSLmetricswouldbethatmaximumprecisioncorrespondstoabsentTypeI
errors(i.e.,nofalsepositives),whereasmaximumrecallcorrespondstoabsentTypeIIerrors(i.e.,no
falsenegatives).Again,Type Iand IIerrorsarerelatedtotheentiretyofdatapoints inaCSregime
andpredictionisonlyevaluatedonatestdatasplitofthesampleinaSLregime.Inlearningcurves,a
biggapbetweenhigh in-sampleand lowout-of-sampleperformance is typicallyobserved forhigh-
variancemodels,suchasneuralnetworkalgorithmsorrandomforests.Theseperformancemetrics
from different data splits often converge for high-bias models, such as linear support vector
machines and logistic regression. Moreover, the medical domain and social sciences usually
aggregate results in ROC (receiver operating characteristic) curves plotting sensitivity against
specificityscores,whereasengineeringandcomputersciencedomainstendtoreportrecall-precision
curvesinstead(Davis&Goadrich,2006;Demšar,2006).
Finally, it is often possible to inspect the fit for purpose of trained black-boxmodels (Brodersen,
Haiss,etal.,2011;Kuhn&Johnson,2013).InSLthiscantaketheformofevaluatingsomenotionof
supportrecovery,thatis,thequestiontowhatextentthelearningalgorithmsputprobabilitymasson
thepartsof the feature space thatare "truely"underlying thegiven classes. Inneuroimaging, this
pertains to the difference betweenmodels that capture task-specific aspects of neural activity or
arbitrary discriminative aspects, such as structured noise in participant- or scanner-related
idiosyncracies (Brodersen, Haiss, et al., 2011; Varoquaux & Thirion, 2014). Such a face-validity
criterion of meaningful model fit is all the more important because of the general trade-off in
statistics between choosing models with best possible performance and those with model
parameters that are most interpretable (T. Hastie et al., 2001). Instead of maximizing prediction
scores,neuroimaginginvestigatorsmightwanttofocusonneurobiologicallyinformedfeaturespaces
and mechanistically interpretable model weights (Brodersen, Schofield, et al., 2011; Bzdok et al.,
2015;K.E.Stephan,2004).Infact,SLneuroimagingstudieswouldperhapsbenefitfromametricfor
neurobiologicalplausibiltyasanacidtestduringmodelselection.Reverse-engineeringfittedmodels
Page 37
37
byreconstructionofstimulusortaskaspects(Miyawakietal.,2008;ThomasNaselaris,Prenger,Kay,
Oliver,&Gallant,2009;Thirionetal.,2006)couldbecomeanimportantevaluationmetricforwhya
learnedfeature-labelmappingexihibitsacertainmodelperformance(Gabrielietal.,2015).Thismay
further disambiguate explanations for (statistically significant) better-than-chance accuracies
because i) numerous, largely different members of a given function space may yield essentially
similarmodelperformanceand ii) each fittedmodelmayonly captureapartof the class-relevant
structure in the neuroimaging data (cf. Nikolaus Kriegeskorte, 2011). For instance, fit-for-purpose
metrics used in neuroimaging could uncover what neurobiological aspects allow classifying an
individual as schizophrenic versus normal andwhichneurobiological endophenotypesunderlie the
schizophrenia"spectrum"(Bzdok&Eickhoff,2015;Hyman,2007;KlaasE.Stephanetal.,2015;K.E.
Stephan,Friston,&Frith,2009).
Page 38
38
6.Casestudyone:Generalizationandsubsequentclassicalinference
Vignette:We are interested in potential differences in brain structure that are associatedwith an
individual's age (continuous target variable). A Lasso (belongs to SL arsenal) is computed on the
voxel-basedmorphometrydata(Ashburner&Friston,2000)fromthebrain'sgreymatterofthe500-
subjectHCPrelease(HumanConnectomeProject;VanEssenetal.,2012).ThisL1-penalizedresidual-
sum-of-squares regression performs variable selection (i.e., effectively eliminates coefficients by
settingthemtozero)onallgrey-mattervoxels'volumeinformationinahigh-dimensional(notmass-
univariate) regime.Assessinggeneralizationperformanceofdifferent sparsemodelsusing five-fold
crossvalidationyieldsthenon-zerocoefficientsforfewbrainvoxelswhosevolumetricinformationis
mostpredictiveofanindividual'sage.
Question:Howcanweperformclassicalinferencetoknownwhichofthegrey-mattervoxelsselected
tobepredictiveforbiologicalagearestatisticallysignificant?
This is an important concern becausemost statisticalmethods currently applied to large datasets
performsomeexplicitorimplicitformofvariableselection(TrevorHastieetal.,2015;Jenattonetal.,
2011;Jordanetal.,2013).Thereareevenmanydifferentformsofpreliminaryselectionofvariables
beforeperformingsignificancetestsonthem.First,Lassoisawidelyusedestimatorinengineering,
compressive sensing, various "omics" branches, and other sciences, mostly without a significance
test.Beyondneuroscience,generalization-approvedstatisticallearningmodelsareroutinelysolvinga
diversesetofreal-worldchallenges.Thisincludesbutisnotlimitedtoalgorithmictradinginfinancial
markets,real-timespeechtranslation,SPAMfilteringfore-mails,facerecognitionindigitalcameras,
and piloting self-driving cars (Jordan &Mitchell, 2015; LeCun et al., 2015). In all these examples
statistical learning algorithms successfully generalize to unseen data and thus tackle the problem
heuristicallywithoutclassicalsignificancetestforvariablesormodelperformance.
Second, theLassohassolvedthecombinatorialproblemofwhatsubsetofgrey-mattervoxelsbest
predicts an individual's age byautomatic variable selection. Computing voxel-wise p valueswould
Page 39
39
recast this high-dimensional pattern-learning setting into a mass-univariate hypothesis-testing
problem where relevance would be computed independently for each voxel and correction for
multiplecomparisonswouldbecomenecessary.Yet,recastingintothemass-univariatesettingwould
ignorethesophisticatedselectionprocessthatledtothepredictivemodelwithareducednumberof
variables (Wu et al., 2009). Put differently, the variable selection procedure is itself a stochastic
process that is however not accounted for by the theoretical guarantees of classical inference for
statisticalsignificance(Berk,Brown,Buja,Zhang,&Zhao,2013).Putinyetanotherway,data-driven
model selection is corrupting hypothesis-driven statistical inference because the sampling
distributionoftheparameterestimatesisaltered.Theimportantconsequenceisthatnaiveclassical
inference expects a non-adaptivemodel chosen before data acquisition and can therefore not be
usedalongLassoinparticularorarbitraryselectionproceduresingeneral10.
Third, this conflict between data-guidedmodel selection by cross-validation (SL) and confirmatory
classicalinference(CS)iscurrentlyatthefrontierofstatisticaldevelopment(Loftus,2015;J.Taylor&
Tibshirani, 2015).Newmethods for so-calledpost-selection inference (or selective inference) allow
computing p values for a set of features that have previously been chosen to be meaningful
predictors by some criterion. According to the theory of CS, the statisticalmodel is to be chosen
beforevisitingthedata.Classicalstatisticaltestsandconfidence intervalsthereforebecomeinvalid
and the p values become downward-biased (Berk et al., 2013). Consequently, the association
betweenapredictorand the targetvariablemustbeevenstronger tocertifyon thesame levelof
significance.Selectiveinferenceformodernadaptiveregressionthusreplacesloosenaivepvaluesby
morerigorousselection-adjustedpvalues.Asanordinarynullhypothesiscanhardlybeadopted in
thisadaptive testingsetting,conceptualextension isalsopromptedonthe levelofCS theory itself
(Trevor Hastie et al., 2015). Closed-form solutions to adjusted inference after variable selection
alreadyexistforprincipalcomponentanalysis(Choi,Taylor,&Tibshirani,2014)andforwardstepwise
regression (Jonathan Taylor, Lockhart, Tibshirani, & Tibshirani, 2014). Last but not least, a simple10"Onceappliedonlytotheselectedfew,theinterpretationoftheusualmeasuresofuncertaintydonotremainintactdirectly,unlessproperlyadjusted."(YoavBenjamini)
Page 40
40
alternativetoformallyaccountforpreceedingmodelselectionisdatasplittingorsamplesplitting(D.
R.Cox,1975;Fithianetal.,2014;Wasserman&Roeder,2009),whichisfrequentpracticeingenetics
(e.g.,Sladeketal.,2007). Inthisprocedure, theselectionprocedure iscomputedononedatasplit
andpvaluesarecomputedontheremainingseconddatasplit.However,datasplittingisnotalways
possibleandwillincurpowerlossesandinterpretationproblems.
7.Casestudytwo:Classicalinferenceandsubsequentgeneralization
Vignette: We are interested in potential brain structure differences that are associated with an
individual'sgender(categoricaltargetvariable)inthevoxel-basedmorphometrydata(Ashburner&
Friston,2000)ofthe500-subjectHCPrelease(HumanConnectomeProject;VanEssenetal.,2012).
Initially, the >100,000 voxels per brain scan are reduced to themost important 10,000 voxels to
lower the computational cost and facilitate predictive model estimation. To this end, ANOVA
(univariatetestforstatisticalsignificancebelongingtoCS)isfirstusedtoobtainarankingofthemost
relevant 10,000 features from the greymatter of each subject. This selects the 10,000 out of the
original>100,000voxelvariableswithhighestvarianceexplainingvolumedifferencesbetweenmales
andfemales(i.e.,themale-femaleclasslabelsareusedintheunivariatetest).Second,supportvector
machine classification ("multivariate" pattern-learning algorithm belonging to SL) is performed by
trainingand testingona feature spacewith the10,000preselectedgrey-mattermeasurements to
predictthegenderfromeachsubject'sbrainscan.
Question:Isananalysispipelinewithunivariateclassicalinferenceandsubsequenthigh-dimensional
predictionvalidifbothstepsrelyonthesametargetvariables?
Theimplicationsoffeatureengineeringproceduresappliedbeforetrainingalearningalgorithmisa
frequentconcernandcanhaveverysubtleanswers(I.Guyon&Elisseeff,2003;Hankeetal.,2015;N.
Kriegeskorte et al., 2009; Lemm, Blankertz, Dickhaus, & Muller, 2011). In most applications of
predictivemodels the largemajorityofbrainvoxelswillbeuninformative (Brodersen,Haiss,etal.,
Page 41
41
2011).Thedescribedscenarioofdimensionalityreductionbyfeatureselectiontofocuspredictionis
clearly allowed under the condition that the ANOVA is not computed on the entire data sample.
Rather, the voxels explaining most variance between the male and female individuals should be
computedonlyonthetrainingsetineachcross-validationfold.Inthetrainingsetandtestsetofeach
foldthesameidentifiedcandidatevoxelsarethenregroupedintoafeaturespacethatisfedintothe
support vectormachine algorithm. This ensures an identical feature space formodel training and
model testing but its construction only depends on structural brain scans from the training set.
Generally, any voxel preprocessing prior to model training is authorized if the feature space
construction isnot influencedbypropertiesof theconcealed test set. In thepresent scenario, the
Vapnik-Chervonenkis bounds of the cross-validation estimator are therefore not loosened or
invalidated if class labels have been exploited for feature selection or depending onwhether the
feature selection procedure is univariate or multivariate. Put differently, the cross-validation
proceduresimplyevaluatestheentirepredictionprocess includingtheautomatizedandpotentially
nested dimensionality reduction procedure. In sum, in a SL regime, using class information during
featurepreprocessingforacross-validatedsupervisedestimatorisnotaninstanceofdata-snooping
(orpeeking)ifdoneexclusivelyonthetrainingset(Abu-Mostafaetal.,2012).
Thisisanadvantageofcross-validationyieldingout-of-sampleestimates.Instarkcontrast,remember
thatnull-hypothesistestingyieldsin-sampleestimates.Usingtheclasslabelsforavariableselection
step just before statistical hypothesis testing on a same data sample would invalidate the null
hypothesis (N. Kriegeskorte et al., 2010; N. Kriegeskorte et al., 2009) (cf. case study one).
Consequently, in a CS regime, using class information to select variables before null-hypothesis
testingwillincuraninstanceofdouble-dipping(orcircularanalysis).
Regarding interpretation of the results, the classifier will miss some brain voxels that only carry
relevant information when considered in voxel ensembles. This is because the ANOVA filter kept
voxels thatare independently relevant (Brodersen,Haiss,etal.,2011).Univariate featureselection
maysystematicallyencouragemodelselection(i.e.,eachweightcombinationequateswithamodel
Page 42
42
hypothesisfromtheclassifier'sfunctionspace)thatarenotneurobiologicallymeaningful.Concretely,
inthediscussedscenariotheclassifier learnscomplexpatternsbetweenvoxelsthatwerechosento
be individually important. This may considerably weaken the interpretability and conclusions on
"whole-brainmultivariatepatterns".Rememberalsothatvariablesthathaveastatisticallysignificant
associationwithatargetvariabledonotnecessarilyhavegoodgeneralizationperformance,andvice
versa (Loetal.,2015;Shmueli,2010).On theupside, it iswidelybelieved that thecombinationof
whole-brain univariate feature selection and linear classification is frequently among the best
approaches if the primary goal is optimized prediction performance as opposed to optimized
interpretability.Finally,itisinterestingtoconsiderthatANOVA-mediatedfeatureselectionofp<500
voxel variables reduces the "wide" neuroimaging data ("n << p" setting) down to "long"
neuroimagingdatawithfewerfeaturesthanobservations("n>p"setting)giventhen=500subjects
(Wainwright,2014).Thisallows recasting theSL regime intoaCS regime inorder to fita standard
general linear model and perform classical null-hypothesis testing instead of training a predictive
classificationalgorithm(Brodersen,Haiss,etal.,2011).
8.Casestudythree:Structurediscoverybyclusteringalgorithms
Vignette:Eachfunctionallyspecializedregioninthehumanbrainprobablyhasauniquesetoflong-
range connections (Passingham, Stephan,& Kotter, 2002). This notion has prompted connectivity-
basedparcellationmethodsinneuroimagingthatsegregatearegionof interest (ROI,canbelocally
circumscribedorbrainglobal;S.B.Eickhoff,Thirion,Varoquaux,&Bzdok,2015)intodistinctcortical
modules(Behrensetal.,2003).Thewhole-brainconnectivityforeachROIvoxeliscomputedandthe
voxel-wise connectional fingerprints are submitted to a clustering algorithm (i.e., individual brain
voxelsintheROIaretheelementstogroup;theconnectivitystrengthvaluesarethefeaturesofeach
element for similarity assessment). In this way, connectivity-based parcellation issues cortical
modules in the specified ROI that exhibit similar connectivity patterns and are, thus potentially,
Page 43
43
functionally distinct. That is, voxels within the same cluster in the ROI will have more similar
connectivitypropertiesthanvoxelsfromdifferentROIclusters.
Question:Isitpossibletodecidewhethertheobtainedbrainclustersarestatisticallysignificant?
Essentially,theaimofconnectivity-guidedbrainparcellationistofinduseful,simplifiedstructureby
imposingdiscretecompartmentsonbraintopography(Frackowiak&Markram,2015;S.M.Smithet
al., 2013; Yeo et al., 2011). This is typically achieved by k-means, hierarchical, Ward, or spectral
clusteringalgorithms(Thirion,Varoquaux,Dohmatob,&Poline,2014).PuttingontheCShat,aROI
clustering result would be deemed statistically significant if it has a very low probability of being
"true" under the null hypothesis that the investigator seeks to reject (Everitt, 1979; Halkidi,
Batistakis,&Vazirgiannis,2001).Choosingateststatisticforclusteringsolutionstoobtainpvaluesis
difficult(Vogelsteinetal.,2014)becauseoftheneedforameaningfulnullhypothesistotestagainst
(Jain,Murty,&Flynn,1999).Putdifferently,forhypothesis-drivenstatisticalinferenceonemayneed
topickanarbitraryhypothesistofalsify.ItfollowsthattheCSnotionsofeffectsizeandpowerdonot
seemtoapply in thecaseofbrainparcellation. Insteadof classical inference to formally test fora
particular structure in the clustering results,we actually need to resort to exploratory approaches
thatdiscoverandassessstructureintheneuroimagingdata(Efron&Tibshirani,1991;T.Hastieetal.,
2001;Tukey,1962).AlthoughstatisticalmethodsspanacontinuumbetweenthetwopolesofCSand
SL,findingaclusteringmodelwiththehighestfitinthesenseofexplainingtheregionalconnectivity
differencesathandismorenaturallysituatedintheSLcommunity.
PuttingontheSLhat,werealizethattheproblemofbrainparcellationconstitutesanunsupervised
learningsettingwithoutany targetvariabley topredict (e.g., cognitive tasks, theageorgenderof
the participants). The learning task is therefore not to estimate a supervisedpredictivemodel y =
f(X), but to estimate an unsupervised descriptive model for the connectivity data X themselves.
Solvingsuchunsupervisedestimationproblemsisgenerallyrecognizedtobeveryhard(Bishop,2006;
ZoubinGhahramani, 2004; T. Hastie et al., 2001). In clustering problems, there aremany possible
Page 44
44
transformations, projections, and compressions of X but there is no criterion of optimality that
clearlysuggestsitself.Ontheonehand,the"true"shapeofclustersisunknownformostreal-world
clustering problems, including brain parcellation studies. On the other hand, finding an "optimal"
number of clusters represents anunresolved issue (cluster validity problem) in statistics in general
and inbrainneuroimaging inparticular (Handl,Knowles,&Kell,2005; Jainetal.,1999).Evaluating
themodelfitofclusteringresultsisconventionallyaddressedbyheuristicclustervaliditycriteria(S.
B.Eickhoffetal.,2015;Thirionetal.,2014).Thesearenecessarybecauseclusteringalgorithmswill
alwaysfindsubregionsintheinvestigator'sROI,thatis,relevantstructureaccordingtotheclustering
algorithm'soptimizationobjective,whether these trulyexist innatureornot. There is a varietyof
such criteria based on information theory, topology, and consistency. They commonly encourage
cluster solutions with low within-cluster and high between-cluster differences, regardless of the
appliedclusteringalgorithm.
Evidently, the discovered connectivity clusters are mere hints to candidate brain modules. Their
"existence"inneurobiologyrequiresfurtherscrutiny(S.B.Eickhoffetal.,2015;Thirionetal.,2014).
Nevertheless, such clustering solutions are an importantmeans to narrowdownhigh-dimensional
neuroimagingdata.Preliminaryclusteringresultsbroadenthespaceofresearchhypothesesthatthe
investigatorcanarticulate.Forinstance,unexpecteddiscoveryofacandidatebrainregion(cf.Mars
et al., 2012; zu Eulenburg, Caspers, Roski, & Eickhoff, 2012) can provide an argument for future
experimental investigations. Brain parcellation can thus be viewed as an exploratory unsupervised
methodoutliningrelevantstructureinneuroimagingdatathatcansubsequentlybeformallytested
as a concrete hypothesis in neuroimaging studies whose interpretations are based on classical
inference.
Page 45
45
9.Conclusion
Anovelscientificfactaboutthebrainisonlyvalidinthecontextofthecomplexityrestrictionsthat
havebeenimposedonthestudiedphenomenonduringtheinvestigation(Box,1976).Thestatistical
arsenal of the imaging neuroscientist can be divided into classical inference by hypothesis
falsificationandincreasinglyusedgeneralizationinferencebyextrapolatingcomplexpatterns.While
null-hypothesis testing has been dominating the academic milieu for several decades, statistical
learningmethodsareprevalent inmanydata-intensebranchesof industry(Vanderplas,2013).This
sociologicalsegregationmaypartlyexplainexistingconfusionabouttheirmutualrelationship.Based
on diverging historical trajectories and theoretical foundations, both statistical cultures aim at
extractingnewknowledgefromdatausingmathematicalmodels(Fristonetal.,2008;Jordanetal.,
2013). However, an observed effect with a statistically significant p value does not necessarily
generalizetofuturedatasamples.Conversely,aneffectwithsuccessfulout-of-samplegeneralization
isnotnecessarilystatisticallysignificant.Thedistributionalpropertiesofaneffectimportantforhigh
statistical significance and for successful generalization are not identical (Lo et al., 2015).
Additionally, classical inference is a judgment about an entire data sample, whereas predictive
inference can be applied to single datapoints. The goal and permissible conclusions of a formal
inferencethereforeareconditionedbytheadoptedstatisticalframework(Feyerabend,1975).Thisis
routinely exploited in drug development cycles with predictive inference for the early discovery
phaseandclassicalinferencefortheclinicaltrialsphase.Asimilarbackandforthbetweenapplying
inductivelearningalgorithmsanddeductivehypothesistestingofthediscoveredcandidatestructure
couldandshouldalsobecomeroutineinimagingneuroscience.Awarenessofthediscussedcultural
gapisimportanttokeeppacewiththeincreasinginformationgranularityofacquiredneuroimaging
repositories.Ultimately,statisticalinferenceisaheterogenousconcept.
Page 46
46
Acknowledgments
Thepresentpaperdidnotresultfromisolatedcontemplationsbyasingleperson.Rather,itemerged
fromseveralthoughtmilieuswithdifferentthoughtstylesandopinionsystems.Theauthorcordially
thanksthefollowingpeopleforvaluablediscussionandpreciouscontributionstothepresentpaper
(in alphabetical order): Olavo Amaral, Patrícia Bado, Jérémy Lefort-Besnard, Kay Brodersen, Elvis
Dohmatob, Guillaume Dumas, Michael Eickenberg, Simon Eickhoff, Denis Engemann, Can Ergen,
AlexandreGramfort,OlivierGrisel,CarlHacker,MichaelHanke,LukasHensel,ThiloKellermann,Jean-
Rémi King, Robert Langner, DanielMargulies, JorgeMoll, ZeinabMousavi, CarolinMößnang, Rico
Pohling,AndrewReid,JoãoSato,BertrandThirion,GaëlVaroquaux,MarcelvanGerven,Virginievan
Wassenhove,KlausWillmes,KarlZilles.
Page 47
47
References
Abu-Mostafa,Y.S.,Magdon-Ismail,M.,&Lin,H.T.(2012).Learningfromdata.California:AMLBook.
Amunts, K., Lepage,C., Borgeat, L.,Mohlberg,H.,Dickscheid, T., Rousseau,M. E., . . . Evans,A. C.(2013).BigBrain:anultrahigh-resolution3Dhumanbrainmodel.Science,340(6139),1472-1475.doi:10.1126/science.1235381
Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null hypothesis testing: problems,prevalence,andanalternative.Thejournalofwildlifemanagement,912-923.
Ashburner,J.,&Friston,K.J.(2000).Voxel-basedmorphometry-themethods.Neuroimage,11,805-821.
Ashburner,J.,&Friston,K.J.(2005).Unifiedsegmentation.Neuroimage,26(3),839-851.doi:S1053-8119(05)00110-2[pii]
10.1016/j.neuroimage.2005.02.018
Ashburner, J., & Kloppel, S. (2011). Multivariate models of inter-subject anatomical variability.Neuroimage,56(2),422-439.doi:10.1016/j.neuroimage.2010.03.059
Averbeck, B. B., Latham, P. E., & Pouget, A. (2006). Neural correlations, population coding andcomputation.NatRevNeurosci,7(5),358-366.doi:10.1038/nrn1888
Bach, F. (2014). Breaking the curse of dimensionality with convex neural networks. arXiv preprintarXiv:1412.8690.
Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2012). Optimization with sparsity-inducingpenalties.FoundationsandTrends®inMachineLearning,4(1),1-106.
Bates,E.,Wilson,S.M.,Saygin,A.P.,Dick,F.,Sereno,M. I.,Knight,R.T.,&Dronkers,N.F. (2003).Voxel-basedlesion-symptommapping.NatNeurosci,6(5),448-450.doi:10.1038/nn1050
Behrens,T.E.,Johansen-Berg,H.,Woolrich,M.W.,Smith,S.M.,Wheeler-Kingshott,C.A.,Boulby,P.A.,...Matthews,P.M.(2003).Non-invasivemappingofconnectionsbetweenhumanthalamusandcortexusingdiffusionimaging.NatNeurosci,6(7),750-757.doi:10.1038/nn1075
nn1075[pii]
Bellec, P., Rosa-Neto, P., Lyttelton, O. C., Benali, H., & Evans, A. C. (2010). Multi-level bootstrapanalysis of stable clusters in resting-state fMRI. Neuroimage, 51(3), 1126-1139. doi:10.1016/j.neuroimage.2010.02.082
Bellman,R.E.(1961).Adaptivecontrolprocesses:aguidedtour(Vol.4):PrincetonUniversityPress.
Bengio, Y. (2014). Evolving culture versus localminimaGrowingAdaptiveMachines (pp. 109-138):Springer.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and newperspectives.PatternAnalysisandMachineIntelligence,IEEETransactionson,35(8),1798-1828.
Berk,R.,Brown,L.,Buja,A.,Zhang,K.,&Zhao,L.(2013).Validpost-selectioninference.TheAnnalsofStatistics,41(2),802-837.
Page 48
48
Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-squaretest.JournaloftheAmericanStatisticalAssociation,33(203),526-536.
Bishop,C.M.(2006).PatternRecognitionandMachineLearning.Heidelberg:Springer.
Bludau,S.,Bzdok,D.,Gruber,O.,Kohn,N.,Riedl,V.,Sorg,C.,...Amunts,K.(2015).MedialPrefrontalAberrations in Major Depressive Disorder Revealed by Cytoarchitectonically Informed Voxel-BasedMorphometry.AmericanJournalofPsychiatry.
Bortz,J.(2006).Statistik:FürHuman-undSozialwissenschaftler:Springer-Verlag.
Box,G.E.P. (1976). Scienceandstatistics. Journalof theAmericanStatisticalAssociation,71(356),791-799.
Boyd,S.,&Vandenberghe,L.(2004).Convexoptimization:Cambridgeuniversitypress.
Breiman,L.(2001).StatisticalModeling:TheTwoCultures.StatisticalScience,16(3),199-231.
Breiman, L.,& Spector, P. (1992). Submodel selection and evaluation in regression. The X-randomcase.Internationalstatisticalreview/revueinternationaledeStatistique,291-319.
Broca, P. (1865). Sur la faculté du language articulaire. Bulletins et Memoires de la Societéd'AnthropologiedeParis,6,377-393.
Brodersen, K. H. (2009). Decoding mental activity from neuroimaging data— the science behindmind-reading.TheNewCollection,Oxford,4,50-61.
Brodersen,K.H.,Daunizeau,J.,Mathys,C.,Chumbley,J.R.,Buhmann,J.M.,&Stephan,K.E.(2013).VariationalBayesianmixed-effectsinferenceforclassificationstudies.Neuroimage,76,345-361.doi:10.1016/j.neuroimage.2013.03.008
Brodersen,K.H.,Haiss,F.,Ong,C.S., Jung,F.,Tittgemeyer,M.,Buhmann, J.M., . . .Stephan,K.E.(2011). Model-based feature construction for multivariate decoding.Neuroimage, 56(2), 601-615.doi:10.1016/j.neuroimage.2010.04.036
Brodersen,K.H.,Schofield,T.M.,Leff,A.P.,Ong,C.S.,Lomakina,E.I.,Buhmann,J.M.,&Stephan,K.E.(2011).Generativeembeddingformodel-basedclassificationoffMRIdata.PLoSComputBiol,7(6),e1002079.doi:10.1371/journal.pcbi.1002079
Brodmann,K.(1909).VergleichendeLokalisationslehrederGroßhirnrinde.Leipzig:Barth.
Bühlmann,P.,&VanDeGeer, S. (2011).Statistics forhigh-dimensionaldata:methods, theoryandapplications:SpringerScience&BusinessMedia.
Burnham, K. P., & Anderson, D. R. (2014). P values are only an index to evidence: 20th-vs. 21st-centurystatisticalscience.Ecology,95(3),627-630.
Bzdok,D.,Eickenberg,M.,Grisel,O.,Thirion,B.,&Varoquaux,G. (2015).Semi-SupervisedFactoredLogisticRegression forHigh-DimensionalNeuroimagingData.Paperpresentedat theAdvancesinNeuralInformationProcessingSystems.
Bzdok, D., & Eickhoff, S. B. (2015). Statistical Learning of theNeurobiology of Schizophrenia. In T.Abel&T.Nickl-Jockschat(Eds.),TheNeurobiologyofSchizophrenia:Springer.
Page 49
49
Chamberlin,T.C.(1890).TheMethodofMultipleWorkingHypotheses.Science,15(366),92-96.doi:10.1126/science.148.3671.754
Chambers, J. M. (1993). Greater or lesser statistics: a choice for future research. Statistics andComputing,3(4),182-184.
Choi,Y.,Taylor,J.,&Tibshirani,R.(2014).Selectingthenumberofprincipalcomponents:estimationofthetruerankofanoisymatrix.arXivpreprintarXiv:1410.8260.
Chow, S. L. (1998). Precis of statistical significance: rationale, validity, and utility.Behav Brain Sci,21(2),169-194;discussion194-239.
Chumbley,J.R.,&Friston,K.J.(2009).Falsediscoveryraterevisited:FDRandtopological inferenceusingGaussianrandomfields.Neuroimage,44,62-70.
Clark,W.G.,DelGiudice,J.,&Aghajanian,G.K.(1970).Principlesofpsychopharmacology:atextbookforphysicians,medicalstudents,andbehavioralscientists:AcademicPressInc.
Cleveland,W.S.(2001).Datascience:anactionplanforexpandingthetechnicalareasofthefieldofstatistics.Internationalstatisticalreview,69(1),21-26.
Cohen, J. (1977). Statistical power analysis for the behavioral sciences (rev: Lawrence ErlbaumAssociates,Inc.
Cohen,J.(1990).ThingsIhavelearned(sofar).Americanpsychologist,45(12),1304.
Cohen,J.(1992).Apowerprimer.Psychologicalbulletin,112(1),155.
Cohen,J.(1994).TheEarthIsRound(p<.05).AmericanPsychologist,49(12),997-1003.
Cortes,C.,&Vapnik,V.(1995).Support-vectornetworks.Machinelearning,20(3),273-297.
Cowles,M.,&Davis,C. (1982).On theOriginsof the .05 Levelof Statistical Significance.AmericanPsychologist,37(5),553-558.
Cox,D.D.,&Dean,T.(2014).Neuralnetworksandneuroscience-inspiredcomputervision.CurrBiol,24(18),R921-929.doi:10.1016/j.cub.2014.08.026
Cox,D.R.(1975).Anoteondata-splittingfortheevaluationofsignificancelevels.Biometrika,62(2),441-444.
Cumming,G.(2009).Inferencebyeye:readingtheoverlapofindependentconfidenceintervals.StatMed,28(2),205-220.doi:10.1002/sim.3471
Davis, J., & Goadrich,M. (2006). The relationship between Precision-Recall and ROC curves. PaperpresentedattheProceedingsofthe23rdinternationalconferenceonMachinelearning.
deBrebisson,A.,&Montana,G.(2015).DeepNeuralNetworksforAnatomicalBrainSegmentation.arXivpreprintarXiv:1502.02445.
de-Wit,L.,Alexander,D.,Ekroll,V.,&Wagemans,J.(2016). Isneuroimagingmeasuringinformationinthebrain?Psychonomicbulletin&review,1-14.
Deisseroth, K. (2015). Optogenetics: 10 years of microbial opsins in neuroscience. Nat Neurosci,18(9),1213-1225.doi:10.1038/nn.4091
Page 50
50
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal ofMachineLearningResearch,7,1-30.
Devor,A.,Bandettini,P.A.,Boas,D.A.,Bower,J.M.,Buxton,R.B.,Cohen,L.B.,...Franceschini,M.A.(2013).ThechallengeofconnectingthedotsintheBRAIN.Neuron,80(2),270-274.
Devroye, L.,Györfi, L.,&Lugosi,G. (1996).Aprobabilistic theoryofpattern recognition.NewYork,NY:Springer.
Domingos,P.(2012).AFewUsefulThingstoKnowaboutMachineLearning.CommunicationsoftheACM,55(10),78-87.
Donoho,D.(2015).50yearsofDataScience.TukeyCentennialworkshop.
Efron,B.(1978).Controversiesinthefoundationsofstatistics.AmericanMathematicalMonthly,231-246.
Efron,B.(1979).Bootstrapmethods:anotherlookatthejackknife.TheannalsofStatistics,1-26.
Efron,B. (2005).Modernscienceand theBayesian-frequentist controversy:DivisionofBiostatistics,StanfordUniversity.
Efron,B.,&Hastie,T.(2016).Computer-AgeStatisticalInference.
Efron,B.,&Tibshirani,R.J.(1991).Statisticaldataanalysisinthecomputerage.Science,253(5018),390-395.doi:10.1126/science.253.5018.390
Efron,B.,&Tibshirani,R.J.(1994).Anintroductiontothebootstrap:CRCpress.
Eickhoff,S.,Turner,J.A.,Nichols,T.E.,&VanHorn,J.D.(2016).Sharingthewealth:Neuroimagingdatarepositories.NeuroImage,124(FZJ-2015-06893),1065–1068.
Eickhoff,S.B.,Bzdok,D.,Laird,A.R.,Roski,C.,Caspers,S.,Zilles,K.,&Fox,P.T.(2011).Co-activationpatterns distinguish cortical modules, their connectivity and functional differentiation.Neuroimage,57(3),938-949.doi:S1053-8119(11)00509-X[pii]
10.1016/j.neuroimage.2011.05.021
Eickhoff, S. B., Thirion, B., Varoquaux, G., & Bzdok, D. (2015). Connectivity-based parcellation:Critiqueandimplications.HumBrainMapp.doi:10.1002/hbm.22933
Engemann,D.A.,&Gramfort,A. (2015).Automatedmodel selection in covarianceestimationandspatialwhiteningofMEGandEEGsignals.NeuroImage,108,328-342.
Estes, W. K. (1997). On the communication of information by displays of standard errors andconfidenceintervals.PsychonomicBulletin&Review,4(3),330-341.
Everitt, B. S. (1979). Unresolved Problems in Cluster Analysis. Biometrics, 35(1), 169-181. doi:10.2307/2529943
Ferguson, C. J. (2009). An effect size primer: A guide for clinicians and researchers. ProfessionalPsychology:ResearchandPractice,40(5),532.
Feyerabend, P. (1975). Against Method: Outline of an Anarchist Theory of Knowledge.London:NewLeftBooks.
Page 51
51
Fisher,R.A.(1925).Statisticalmethodsofresearchworkers.London:OliverandBoyd.
Fisher,R.A.(1935).Thedesignofexperiments.1935.OliverandBoyd,Edinburgh.
Fisher, R. A., & Mackenzie, W. A. (1923). Studies in crop variation. II. The manurial response ofdifferentpotatovarieties.TheJournalofAgriculturalScience,13(03),311-320.
Fithian, W., Sun, D., & Taylor, J. (2014). Optimal inference after model selection. arXiv preprintarXiv:1410.2597.
Fleck, L., Schäfer, L., & Schnelle, T. (1935). Entstehung und Entwicklung einer wissenschaftlichenTatsache:SchwabeBasel.
Fox,P.T.,&Raichle,M.E.(1986).Focalphysiologicaluncouplingofcerebralbloodflowandoxidativemetabolismduringso-matosensorystimulationinhumansubjects.ProcNatlAcadSciUSA,83,1140-1144.
Frackowiak,R.,&Markram,H.(2015).Thefutureofhumancerebralcartography:anovelapproach.PhilosTransRSocLondBBiolSci,370(1668).doi:10.1098/rstb.2014.0171
Freedman,D.(1995).Someissuesinthefoundationofstatistics.FoundationsofScience,1(1),19-39.
Friedman, J.H. (1998).DataMining and Statistics:What's the connection?Computing ScienceandStatistics,29(1),3-9.
Friedman, J. H. (2001). The role of statistics in the data revolution? International StatisticalReview/RevueInternationaledeStatistique,5-10.
Friston, K. J. (2006). Statistical parametric mapping: The analysis of functional brain images.Amsterdam:AcademicPress.
Friston,K.J.(2009).Modalities,modes,andmodelsinfunctionalneuroimaging.Science,326(5951),399-403.doi:10.1126/science.1174521
Friston,K.J.(2012).Tenironicrulesfornon-statisticalreviewers.Neuroimage,61(4),1300-1310.doi:10.1016/j.neuroimage.2012.04.018
Friston,K. J., Chu,C.,Mourao-Miranda, J.,Hulme,O.,Rees,G., Penny,W.,&Ashburner, J. (2008).Bayesian decoding of brain images. Neuroimage, 39(1), 181-205. doi:10.1016/j.neuroimage.2007.08.013
Friston, K. J., Holmes, A. P., Worsley, K. J., Poline, J. P., Frith, C. D., & Frackowiak, R. S. (1994).Statisticalparametricmaps in functional imaging: a general linearapproach.HumBrainMapp,2(4),189-210.
Friston, K. J., Liddle, P. F., Frith, C. D., Hirsch, S. R., & Frackowiak, R. S. J. (1992). The leftmedialtemporalregionandschizophrenia.Brain,115,367-382.
Friston,K.J.,Price,C.J.,Fletcher,P.,Moore,C.,Frackowiak,R.S.J.,&Dolan,R.J.(1996).Thetroublewithcognitivesubtraction.Neuroimage,4(2),97-104.
Gabrieli, J. D., Ghosh, S. S., & Whitfield-Gabrieli, S. (2015). Prediction as a humanitarian andpragmatic contribution from human cognitive neuroscience. Neuron, 85(1), 11-26. doi:10.1016/j.neuron.2014.10.047
Page 52
52
Geman, S.,Bienenstock, E.,&Doursat,R. (1992).Neuralnetworksand thebias/variancedilemma.NeuralComputation,4,1-58.
Genovese, C. R., Lazar, N. A., & Nichols, T. (2002). Thresholding of statistical maps in functionalneuroimagingusingthefalsediscoveryrate.Neuroimage,15(4),870-878.
Ghahramani,Z. (2004).Unsupervised learningAdvanced lecturesonmachine learning (pp.72-112):Springer.
Ghahramani, Z. (2015).Probabilisticmachine learningandartificial intelligence.Nature,521(7553),452-459.doi:10.1038/nature14541
Gigerenzer,G.(1993).Thesuperego,theego,andtheidinstatisticalreasoning.Ahandbookfordataanalysisinthebehavioralsciences:Methodologicalissues,311-339.
Gigerenzer,G.(2004).Mindlessstatistics.TheJournalofSocio-Economics,33(5),587-606.
Gigerenzer,G.,&Murray,D.J.(1987).Cognitionasintuitivestatistics.NJ:Erlbaum:Hillsdale.
Giraud,C.(2014).Introductiontohigh-dimensionalstatistics:CRCPress.
Golland,P.,&Fischl,B.(2003).Permutationtestsforclassification:towardsstatisticalsignificanceinimage-basedstudies.PaperpresentedattheInformationprocessinginmedicalimaging.
Goodman,S.N. (1999).Towardevidence-basedmedical statistics.1:ThePvalue fallacy.Annalsofinternalmedicine,130(12),995-1004.
Gorgolewski,K.J.,Varoquaux,G.,Rivera,G.,Schwarz,Y.,Ghosh,S.S.,Maumet,C.,...Margulies,D.S. (2014). NeuroVault.org: A web-based repository for collecting and sharing unthresholdedstatisticalmapsofthehumanbrain.inpress.
Grady,C.L.,Haxby,J.V.,Schapiro,M.B.,Gonzalez-Aviles,A.,Kumar,A.,Ball,M.J.,...Rapoport,S.I.(1990). Subgroups in dementia of the Alzheimer type identified using positron emissiontomography.JNeuropsychiatryClinNeurosci,2(4),373-384.doi:10.1176/jnp.2.4.373
Gramfort,A.,Strohmeier,D.,Haueisen, J.,Hamalainen,M.,&Kowalski,M.(2011).FunctionalbrainimagingwithM/EEGusingstructuredsparsityintime-frequencydictionaries.PaperpresentedattheInformationProcessinginMedicalImaging.
Gramfort,A.,Thirion,B.,&Varoquaux,G.(2013).IdentifyingpredictiveregionsfromfMRIwithTV-L1prior. Paper presented at the Pattern Recognition in Neuroimaging (PRNI), 2013 InternationalWorkshopon.
Greenwald, A. G. (2012). There is nothing so theoretical as a good method. Perspectives onPsychologicalScience,7(2),99-108.
Güçlü,U.,&vanGerven,M.A.J.(2015).Deepneuralnetworksrevealagradientinthecomplexityofneural representations across the ventral stream.The Journal of Neuroscience, 35(27), 10005-10014.
Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal ofMachineLearningResearch,3,1157-1182.
Guyon, I.,Weston,J.,Barnhill,S.,&Vapnik,V.(2002).Geneselectionforcancerclassificationusingsupportvectormachines.Machinelearning,46(1-3),389-422.
Page 53
53
Halevy, A., Norvig, P., & Pereira, F. (2009). The Unreasonable Effectiveness of Data. IntelligentSystems,IEEE,24(2),8-12.
Halkidi,M.,Batistakis,Y.,&Vazirgiannis,M. (2001).Onclusteringvalidation techniques. JournalofIntelligentInformationSystems,17(2),107-145.
Hall,E.T.(1989).Beyondculture:Anchor.
Handl, J., Knowles, J., & Kell, D. B. (2005). Computational cluster validation in post-genomic dataanalysis.Bioinformatics,21(15),3201-3212.doi:bti517[pii]
10.1093/bioinformatics/bti517
Hanke,M.,Halchenko,Y.O.,&Oosterhof,N.N.(2015).PyMVPAManuel.
Hanson,S.J.,&Halchenko,Y.O.(2008).BrainReadingUsingFullBrainSupportVectorMachinesforObjectRecognition:ThereIsNo“Face”IdentificationArea.NeuralComput,20,486-503.
Harlow,J.M.(1848).Passageofanironrodthroughthehead.BostonMedicalandSurgicalJournal,39(20),389-393.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning. Heidelberg,Germany:SpringerSeriesinStatistics.
Hastie,T.,Tibshirani,R.,&Wainwright,M. (2015).StatisticalLearningwithSparsity:TheLassoandGeneralizations:CRCPress.
Haxby,J.V.,Gobbini,M.I.,Furey,M.L.,Ishai,A.,Schouten,J.L.,&Pietrini,P.(2001).Distributedandoverlappingrepresentationsoffacesandobjectsinventraltemporalcortex.Science,293(5539),2425-2430.
Haynes, J.-D. (2015). A primer on pattern-based approaches to fMRI: Principles, pitfalls, andperspectives.Neuron,87(2),257-270.
Haynes,J.D.,&Rees,G.(2005).Predictingtheorientationofinvisiblestimulifromacitvityinhumanprimaryvisualcortex.NatNeurosci,8(5),686-691.
Haynes, J. D., & Rees, G. (2006). Decoding mental states from brain activity in humans.Nat RevNeurosci,7(7),523-534.doi:10.1038/nrn1931
Herndon,R.C.,Lancaster,J.L.,Toga,A.W.,&Fox,P.T.(1996).Quantificationofwhitematterandgraymattervolumes fromT1parametric imagesusing fuzzyclassifiers. JMagnReson Imaging,6(3),425-435.
Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The "Wake-Sleep" algorithm forunsupervisedneuralnetworks.Science,268,1158-1161.
Hinton,G.E.,Osindero,S.,&Teh,Y.-W.(2006).Afastlearningalgorithmfordeepbeliefnets.Neuralcomputation,18(7),1527-1554.
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neuralnetworks.Science,313(5786),504-507.
HouseofCommon,S.a.T.(2016).Thebigdatadilemma.UK:CommitteeonAppliedandTheoreticalStatistics.
Page 54
54
Hyman,S.E. (2007).Canneurosciencebe integrated into theDSM-V?NatRevNeurosci,8(9),725-732.doi:10.1038/nrn2218
Jain, A. K.,Murty,M.N.,& Flynn, P. J. (1999). Data clustering: a review.ACNComputing Surveys,31(3),264-323.
James,G.,Witten,D.,Hastie,T.,&Tibshirani,R. (2013).An introductiontostatistical learning (Vol.112):Springer.
Jenatton, R., Audibert, J.-Y.,& Bach, F. (2011). Structured variable selectionwith sparsity-inducingnorms.TheJournalofMachineLearningResearch,12,2777-2824.
Jordan, M. I. (1995). Why the logistic function? A tutorial discussion on probabilities and neuralnetworks.TechnicalReport9503.
Jordan, M. I. (2010). Bayesian nonparametric learning: Expressive priors for intelligent systems.Heuristics,ProbabilityandCausality:ATributetoJudeaPearl,11,167-185.
Jordan,M. I., Committee on the Analysis ofMassiveData, Committee on Applied and TheoreticalStatistics,BoardonMathematical SciencesandTheirApplications,DivisiononEngineeringandPhysicalSciences,&Council,N.R.(2013).Frontiers inMassiveDataAnalysis.Washington,D.C.:TheNationalAcademiesPress.
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects.Science,349(6245),255-260.
Kamitani,Y.,&Tong,F.(2005).Decodingthevisualandsubjectivecontentsofthehumanbrain.NatNeurosci,8(5),679-685.
Kandel, E.R.,Markram,H.,Matthews,P.M., Yuste,R.,&Koch,C. (2013).Neuroscience thinksbig(andcollaboratively).NatureReviewsNeuroscience,14(9),659-664.
Kelley,K.,&Preacher,K.J.(2012).Oneffectsize.Psychologicalmethods,17(2),137.
King, J. R., & Dehaene, S. (2014). Characterizing the dynamics of mental representations: thetemporalgeneralizationmethod.Trendsincognitivesciences,18(4),203-210.
Kingma,D.P.,&Welling,M.(2013).Auto-encodingvariationalbayes.arXivpreprintarXiv:1312.6114.
Kleiner,A.,Talwalkar,A.,Sarkar,P.,&Jordan,M.(2012).Thebigdatabootstrap.Proceedingsofthe29thInternationalConferenceonMachineLearning(ICML).
Knops, A., Thirion, B., Hubbard, E. M., Michel, V., & Dehaene, S. (2009). Recruitment of an areainvolved in eye movements during mental arithmetic. Science, 324(5934), 1583-1585. doi:10.1126/science.1171599
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and modelselection.PaperpresentedattheIjcai.
Kriegeskorte, N. (2011). Pattern-information analysis: from stimulus decoding to computational-modeltesting.Neuroimage,56(2),411-421.
Kriegeskorte, N., Goebel, R., & Bandettini, P. (2006). Information-based functional brainmapping.ProcNatlAcadSciUSA,103(10),3863-3868.doi:10.1073/pnas.0600244103
Page 55
55
Kriegeskorte,N.,Lindquist,M.A.,Nichols,T.E.,Poldrack,R.A.,&Vul,E.(2010).Everythingyouneverwantedtoknowaboutcircularanalysis,butwereafraidtoask.JCerebBloodFlowMetab,30(9),1551-1557.doi:10.1038/jcbfm.2010.86
Kriegeskorte,N.,Simmons,W.K.,Bellgowan,P.S.,&Baker,C.I.(2009).Circularanalysisinsystemsneuroscience:thedangersofdoubledipping.NatNeurosci,12(5),535-540.doi:10.1038/nn.2303
Kuhn,M.,&Johnson,K.(2013).Appliedpredictivemodeling:Springer.
Kurzweil,R.(2005).Thesingularityisnear:Whenhumanstranscendbiology:Penguin.
Lake, B.M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning throughprobabilisticprograminduction.Science,350(6266),1332-1338.doi:10.1126/science.aab3050
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444. doi:10.1038/nature14539
Lemm, S., Blankertz, B., Dickhaus, T., &Muller, K. R. (2011). Introduction tomachine learning forbrainimaging.Neuroimage,56(2),387-399.doi:10.1016/j.neuroimage.2010.11.004
Lieberman,M.D.,Berkman,E.T.,&Wager,T.D.(2009).Correlations inSocialNeuroscienceAren'tVoodoo:CommentaryonVuletal.PerspectivesonPsychologicalScience,4(3).
Lo,A.,Chernoff,H.,Zheng,T.,&Lo,S.H.(2015).Whysignificantvariablesaren'tautomaticallygoodpredictors.ProcNatlAcadSciUSA,112(45),13892-13897.doi:10.1073/pnas.1518285112
Loftus,J.R.(2015).Selectiveinferenceaftercross-validation.arXivpreprintarXiv:1511.08866.
Manyika,J.,Chui,M.,Brown,B.,Bughin,J.,Dobbs,R.,Roxburgh,C.,&Byers,A.(2011).Bigdata:Thenext frontier for innovation, competition, and productivity. Technical report, McKinsey GlobalInstitute.
Markram,H.(2006).TheBlueBrainProject.NatNeurosci,7,153-160.
Markram,H.(2012).Thehumanbrainproject.ScientificAmerican,306(6),50-55.
Markus, K. A. (2001). The Converse Inequality Argument Against Tests of Statistical Significance.Psychol.Methods,6,147-160.
Marr,D.(1982).Vision.Acomputationalinvestigationintothehumanrepresentationandprocessingofvisualinformation.WHSanFrancisco:FreemanandCompany.
Mars, R. B., Sallet, J., Schuffelgen, U., Jbabdi, S., Toni, I., & Rushworth, M. F. (2012).Connectivity-Based Subdivisions of the Human Right "Temporoparietal Junction Area":Evidence forDifferentAreas Participating inDifferent CorticalNetworks.CerebCortex,22(8),1894-1903.doi:bhr268[pii]
10.1093/cercor/bhr268
Meehl,P.E.(1967).Theory-testinginpsychologyandphysics:Amethodologicalparadox.Philosophyofscience,103-115.
Meehl,P.E.(1978).Theoreticalrisksandtabularasterisks:SirKarl,SirRonald,andtheslowprogressofsoftpsychology.JournalofConsultingandClinicalPsychologyandPsychotherapy,46,806-834.
Page 56
56
Michel,V.,Gramfort,A.,Varoquaux,G.,Eger,E.,&Thirion,B. (2011).TotalvariationregularizationforfMRI-basedpredictionofbehavior.MedicalImaging,IEEETransactionson,30(7),1328-1340.
Misaki,M.,Kim,Y.,Bandettini,P.A.,&Kriegeskorte,N.(2010).Comparisonofmultivariateclassifiersand response normalizations for pattern-information fMRI. Neuroimage, 53(1), 103-118. doi:10.1016/j.neuroimage.2010.05.051
Miyawaki,Y.,Uchida,H.,Yamashita,O.,Sato,M.-a.,Morito,Y.,Tanabe,H.C.,...Kamitani,Y.(2008).Visual image reconstruction fromhumanbrain activity using a combination ofmultiscale localimagedecoders.Neuron,60(5),915-929.
Mnih,V.,Kavukcuoglu,K.,Silver,D.,Rusu,A.A.,Veness,J.,Bellemare,M.G.,...Hassabis,D.(2015).Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533. doi:10.1038/nature14236
http://www.nature.com/nature/journal/v518/n7540/abs/nature14236.html - supplementary-information
Moeller, J. R., Strother, S. C., Sidtis, J. J., & Rottenberg, D. A. (1987). Scaled subprofile model: astatisticalapproachtotheanalysisoffunctionalpatternsinpositronemissiontomographicdata.JCerebBloodFlowMetab,7(5),649-658.doi:10.1038/jcbfm.1987.118
Moore,D.S.,&McCabe,G.P.(1989). IntroductiontothePracticeofStatistics:WHFreeman/TimesBooks/HenryHolt&Co.
Mur, M., Bandettini, P. A., & Kriegeskorte, N. (2009). Revealing representational content withpattern-information fMRI--an introductory guide. Soc CognAffectNeurosci, 4(1), 101-109. doi:10.1093/scan/nsn044
Murphy,K.P.(2012).Machinelearning:aprobabilisticperspective:MITpress.
Naselaris, T., Kay, K. N., Nishimoto, S., & Gallant, J. L. (2011). Encoding and decoding in fMRI.Neuroimage,56(2),400-410.doi:10.1016/j.neuroimage.2010.07.073
Naselaris,T.,Prenger,R.J.,Kay,K.N.,Oliver,M.,&Gallant,J.L. (2009).Bayesianreconstructionofnaturalimagesfromhumanbrainactivity.Neuron,63(6),902-915.
Nelder, J., & Wedderburn, R. (1972). Generalized Linear Models. Journal of the Royal StatisticalSociety.SeriesA,135(3),370–384.
Nesterov, Y. (2004). Introductory lectureson convexoptimization, vol. 87ofAppliedOptimization:KluwerAcademicPublishers,Boston,MA.
Neyman, J., & Pearson, E. S. (1933). On the Problem of the most Efficient Tests for StatisticalHypotheses.Phil.Trans.R.Soc.A,231,289-337.
Ng,A.Y.,&Jordan,M.I.(2002).Ondiscriminativevs.generativeclassifiers:Acomparisonoflogisticregressionandnaivebayes.Advancesinneuralinformationprocessingsystems,14,841.
Nichols,T.E.(2012).Multipletestingcorrections,nonparametricmethods,andrandomfieldtheory.Neuroimage,62(2),811-815.
Nichols,T.E.,&Hayasaka,S.(2003).Controllingthefamilywiseerrorrateinfunctionalneuroimaging:acomparativereview.StatMethodsMedRes,12(5),419-446.
Page 57
57
Nichols,T.E.,&Holmes,A.P.(2002).Nonparametricpermutationtestsforfunctionalneuroimaging:aprimerwithexamples.HumBrainMapp,15(1),1-25.
Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuingcontroversy.PsycholMethods,5(2),241-301.
Norvig,P.(2011).Onchomskyandthetwoculturesofstatisticallearning.AuthorHomepage.
Nuzzo, R. (2014). Scientific method: statistical errors. Nature, 506(7487), 150-152. doi:10.1038/506150a
Oakes,M. (1986). Statistical Inference: A commentary for the social and behavioral sciences. NewYork:Wiley.
Passingham,R.E.,Stephan,K.E.,&Kotter,R.(2002).Theanatomicalbasisoffunctionallocalizationinthecortex.NatRevNeurosci,3(8),606-616.doi:10.1038/nrn893
nrn893[pii]
Pearl,J.(2009).Causalinferenceinstatistics:Anoverview.StatisticsSurveys,3,96-146.
Pedregosa, F., Eickenberg, M., Ciuciu, P., Thirion, B., & Gramfort, A. (2015). Data-driven HRFestimationforencodinganddecodingmodels.NeuroImage,104,209-220.
Penfield,W.,&Perot,P.(1963).TheBrain?Srecordofauditoryandvisualexperience.Brain,86(4),595-696.
Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: a tutorialoverview.Neuroimage,45,199-209.doi:10.1016/j.neuroimage.2008.11.007
Platt, J.R. (1964). Strong Inference:Certain systematicmethodsof scientific thinkingmayproducemuch more rapid progress than others. Science, 146(3642), 347-353. doi:10.1126/science.146.3642.347
Plis,S.M.,Hjelm,D.R.,Salakhutdinov,R.,Allen,E.A.,Bockholt,H.J.,Long,J.D.,...Calhoun,V.D.(2014).Deeplearningforneuroimaging:avalidationstudy.Frontiersinneuroscience,8.
Poldrack,R.A.(2006).Cancognitiveprocessesbeinferredfromneuroimagingdata?TrendsCognSci,10(2),59-63.doi:S1364-6613(05)00336-0[pii]
10.1016/j.tics.2005.12.004
Poldrack,R.A.,&Gorgolewski,K.J.(2014).Makingbigdataopen:datasharinginneuroimaging.NatNeurosci,17(11),1510-1517.doi:10.1038/nn.3818
Pollard,P.,&Richardson,J.T.E.(1987).OntheprobabilityofmakingtypeIerrors.Psychol.Bull.,102,159-163.
Popper,K.(1935/2005).LogikderForschung(11thed.).Tübingen:MohrSiebeck.
Powers, D. M. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness,markednessandcorrelation.
Rosenblatt,F.(1958).Theperceptron:aprobabilisticmodelforinformationstorageandorganizationinthebrain.Psychologicalreview,65(6),386.
Page 58
58
Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge inpsychologicalscience.AmericanPsychologist,44(10),1276.
Russell,S.J.,&Norvig,P.(2002).Artificialintelligence:amodernapproach(InternationalEdition).
Samuel,A.L. (1959).Somestudies inmachine learningusingthegameofcheckers. IBMJournalofresearchanddevelopment,3(3),210-229.
Sandberg,A.,&Bostrom,N.(2008).Wholebrainemulation.
Saygin,Z.M.,Osher,D.E.,Koldewyn,K.,Reynolds,G.,Gabrieli,J.D.,&Saxe,R.R.(2012).Anatomicalconnectivitypatternspredictfaceselectivityinthefusiformgyrus.NatNeurosci,15(2),321-327.doi:10.1038/nn.3001
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology:Implicationsfortrainingofresearchers.Psychologicalmethods,1(2),115.
Schwartz, Y., Thirion, B., & Varoquaux, G. (2013).Mapping paradigm ontologies to and from thebrain.PaperpresentedattheAdvancesinNeuralInformationProcessingSystems.
Shafer,G.(1992).Whatisprobability.PerspectivesinContemporaryStatistics,19-39.
Shao, J. (1993). Linear model selection by cross-validation. Journal of the American statisticalAssociation,88(422),486-494.
Shmueli,G.(2010).Toexplainortopredict?Statisticalscience,289-310.
Sladek,R.,Rocheleau,G.,Rung,J.,Dina,C.,Shen,L.,Serre,D.,...Hadjadj,S.(2007).Agenome-wideassociationstudyidentifiesnovelrisklocifortype2diabetes.Nature,445(7130),881-885.
Smith, S. M., Beckmann, C. F., Andersson, J., Auerbach, E. J., Bijsterbosch, J., Douaud, G., . . .Consortium, W. U.-M. H. (2013). Resting-state fMRI in the Human Connectome Project.Neuroimage,80,144-168.doi:10.1016/j.neuroimage.2013.05.039
Smith, S.M.,Matthews, P.M., & Jezzard, P. (2001). FunctionalMRI: an introduction tomethods:OxfordUniversityPress.
Smith, S.M.,&Nichols, T. E. (2009). Threshold-free cluster enhancement: addressing problemsofsmoothing, thresholddependenceand localisation in cluster inference.Neuroimage,44(1), 83-98.
Stark, C. E., & Squire, L. R. (2001). When zero is not zero: the problem of ambiguous baselineconditionsinfMRI.ProcNatlAcadSciUSA,98,12760-12766.
Stephan, K. E. (2004). On the role of general system theory for functional neuroimaging. J Anat,205(6),443-470.doi:10.1111/j.0021-8782.2004.00359.x
Stephan,K.E.,Binder,E.B.,Breakspear,M.,Dayan,P., Johnstone,E.C.,Meyer-Lindenberg,A., . . .Fletcher, P. C. (2015). Charting the landscape of priority problems in psychiatry, part 2:pathogenesisandaetiology.TheLancetPsychiatry.
Stephan, K. E., Friston, K. J.,& Frith, C. D. (2009). Dysconnection in schizophrenia: from abnormalsynaptic plasticity to failures of self-monitoring. Schizophr Bull, 35(3), 509-527. doi:10.1093/schbul/sbn176
Page 59
59
Taylor, J., Lockhart, R., Tibshirani, R. J., & Tibshirani, R. (2014). Exact post-selection inference forforwardstepwiseandleastangleregression.arXivpreprintarXiv:1401.3889.
Taylor,J.,&Tibshirani,R.J.(2015).Statisticallearningandselectiveinference.ProcNatlAcadSciUSA,112(25),7629-7634.doi:10.1073/pnas.1507583112
Tenenbaum,J.B.,Kemp,C.,Griffiths,T.L.,&Goodman,N.D.(2011).Howtogrowamind:Statistics,structure,andabstraction.science,331(6022),1279-1285.
Thirion, B.,Duchesnay, E.,Hubbard, E.,Dubois, J., Poline, J. B., Lebihan,D.,&Dehaene, S. (2006).Inverse retinotopy: inferring the visual content of images from brain activation patterns.Neuroimage,33(4),1104-1116.doi:10.1016/j.neuroimage.2006.06.062
Thirion, B., Varoquaux, G., Dohmatob, E.,& Poline, J. B. (2014).Which fMRI clustering gives goodbrainparcellations?FrontNeurosci,8,167.doi:10.3389/fnins.2014.00167
Tibshirani,R.(1996).Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),267-288.
Tukey,J.W.(1962).Thefutureofdataanalysis.AnnalsofStatistics,33,1-67.
Tukey,J.W.(1965).Anapproachtothinkingaboutastatisticalcomputingsystem(unpublishednotescirculatedatBellLaboratories).
VanEssen,D.C.,Ugurbil,K.,Auerbach,E.,Barch,D.,Behrens,T.E.,Bucholz,R.,...Consortium,W.U.-M.H. (2012). TheHumanConnectomeProject: adata acquisitionperspective.Neuroimage,62(4),2222-2231.doi:10.1016/j.neuroimage.2012.02.018
VanHorn, J.D.,&Toga,A.W. (2014).Humanneuroimagingasa"BigData"science.Brain ImagingBehav,8(2),323-331.doi:10.1007/s11682-013-9255-y
Vanderplas, J. (2013). The Big Data Brain Drain: Why Science is in Trouble. Blog "PythonicPerambulations".
Vapnik,V.N.(1989).StatisticalLearningTheory.NewYork:Wiley-Interscience.
Vapnik,V.N.(1996).Thenatureofstatisticallearningtheory.NewYork:Springer.
Varoquaux, G., & Thirion, B. (2014). How machine learning is shaping cognitive neuroimaging.GigaScience,3(1),28.
Vogelstein, J. T., Park, Y.,Ohyama, T., Kerr, R.A., Truman, J.W., Priebe,C. E.,& Zlatic,M. (2014).Discovery of brainwide neural-behavioralmaps viamultiscale unsupervised structure learning.Science,344(6182),386-392.doi:10.1126/science.1250298
Vogt,C.,&Vogt,O.(1919).AllgemeinereErgebnisseunsererHirnforschung.JournalfürPsychologieundNeurologie,25,279-461.
Vul,E.,Harris,C.,Winkielman,P.,&Pashler,H.(2008).VoodooCorrelationsinSocialNeuroscience.PsychologicalScience.
Wagenmakers, E.-J., Lee, M., Lodewyckx, T., & Iverson, G. J. (2008). Bayesian versus frequentistinferenceBayesianevaluationofinformativehypotheses(pp.181-207):Springer.
Page 60
60
Wainwright, M. J. (2014). Structured Regularizers for High-Dimensional Problems: Statistical andComputationalIssues.Annu.Rev.Stat.Appl,1,233-253.
Wasserman, L. (2013).All of statistics: a concise course in statistical inference: Springer Science&BusinessMedia.
Wasserman,L.,&Roeder,K.(2009).Highdimensionalvariableselection.Annalsofstatistics,37(5A),2178.
Wernicke, C. (1881). Die akute haemorrhagische polioencephalitis superior. Lehrbuch DerGehirnkrankheitenFürAerzteUndStudirende,BdII,2,229-242.
Wigner, E. P. (1960). The unreasonable effectiveness of mathematics in the natural sciences.CommunicationsonPureandAppliedMathematics,13,1-14.
Wolpert, D. (1996). The lack of a priori distinctions between learning algorithms. NeuralComputation,8,1341–1390.
Worsley,K.J.,Evans,A.C.,Marrett,S.,&Neelin,P.(1992).Athree-dimensionalstatisticalanalysisforCBFactivationstudiesinhumanbrain.JournalofCerebralBloodFlowandMetabolism,12,900-900.
Worsley,K.J.,Poline,J.-B.,Friston,K.J.,&Evans,A.C.(1997).CharacterizingtheresponseofPETandfMRIdatausingmultivariatelinearmodels.NeuroImage,6(4),305-319.
Wu,T.T.,Chen,Y.F.,Hastie,T.,Sobel,E.,&Lange,K.(2009).Genome-wideassociationanalysisbylassopenalizedlogisticregression.Bioinformatics,25(6),714-721.
Yamins,D.L.,&DiCarlo,J.J.(2016).Usinggoal-drivendeeplearningmodelstounderstandsensorycortex.NatNeurosci,19(3),356-365.doi:10.1038/nn.4244
Yarkoni, T., & Braver, T. S. (2010). Cognitive neuroscience approaches to individual differences inworking memory and executive control: conceptual and methodological issues Handbook ofindividualdifferencesincognition(pp.87-107):Springer.
Yarkoni, T., Poldrack, R. A., Nichols, T. E., Van Essen, D. C., & Wager, T. D. (2011). Large-scaleautomatedsynthesisofhumanfunctionalneuroimagingdata.NatMethods,8(8),665-670.doi:10.1038/nmeth.1635
Yarkoni,T.,&Westfall, J. (2016).Choosingpredictionoverexplanation inpsychology:Lessonsfrommachinelearning.
Yeo,B.T.,Krienen,F.M.,Sepulcre,J.,Sabuncu,M.R.,Lashkari,D.,Hollinshead,M.,...Buckner,R.L.(2011). The organization of the human cerebral cortex estimated by intrinsic functionalconnectivity.JNeurophysiol,106(3),1125-1165.doi:10.1152/jn.00338.2011
Yuste,R.(2015).Fromtheneurondoctrinetoneuralnetworks.NatureReviewsNeuroscience,16(8),487-497.
Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal ofcomputationalandgraphicalstatistics,15(2),265-286.
Page 61
61
zu Eulenburg, P., Caspers, S., Roski, C., & Eickhoff, S. B. (2012). Meta-analytical definition andfunctional connectivity of the human vestibular cortex. Neuroimage, 60(1), 162-169. doi:10.1016/j.neuroimage.2011.12.032