Classical Statistics and Statistical Learning in Imaging Neuroscience · 2016-05-05 · between classical statistics and statistical learning with their relation to neuroimaging research.

1

Abstractwords:100(short)/127(long)

Maintextwords:13,730

Referenceswords:5,264

Entiretextwords:19,438

ClassicalStatisticsandStatisticalLearninginImagingNeuroscience

DaniloBzdok1,2,3

1DepartmentofPsychiatry,PsychotherapyandPsychosomatics,MedicalFaculty,RWTHAachen,Germany2JARA,TranslationalBrainMedicine,Aachen,Germany3Parietalteam,INRIA,Neurospin,bat145,CEASaclay,91191Gif-sur-Yvette,France

Prof.Dr.Dr.DaniloBzdokDepartmentforPsychiatry,PsychotherapyandPsychosomaticsPauwelsstraße3052074AachenGermanymail:danilo[DOT]bzdok[AT]rwth-aachen[DOT]de

Citableas:

"BzdokD.ClassicalStatisticsandStatisticalLearning in ImagingNeuroscience (2016).arXivpreprintarXiv:1603.01857."

2

Short Abstract: Neuroimaging research has predominantly drawn conclusions based on classical

statistics. Recently, statistical learningmethods enjoy increasing popularity. Thesemethodological

familiesusedforneuroimagingdataanalysiscanbeviewedastwoextremesofacontinuum,butare

basedondifferenthistories,theories,assumptions,andoutcomemetrics;thuspermittingdifferent

conclusions. This paper portrays commonalities and differences between classical statistics and

statisticallearningwithregardtoneuroimagingresearch.Theconceptualimplicationsareillustrated

in three case studies. It is thus tried to resolve possible confusion between classical hypothesis

testinganddata-guidedmodelestimationbydiscussingramificationsfortheneuroimagingaccessto

neurobiology.

Long Abstract: Neuroimaging research has predominantly drawn conclusions based on classical

statistics, includingnull-hypothesis testing,t-tests,andANOVA.Throughoutrecentyears,statistical

learningmethods enjoy increasing popularity, including cross-validation, pattern classification, and

sparsity-inducingregression.Thesetwomethodologicalfamiliesusedforneuroimagingdataanalysis

can be viewed as two extremes of a continuum. Yet, they originated from different historical

contexts, build on different theories, rest on different assumptions, evaluate different outcome

metrics, and permit different conclusions. This paper portrays commonalities and differences

betweenclassicalstatisticsandstatisticallearningwiththeirrelationtoneuroimagingresearch.The

conceptualimplicationsareillustratedinthreecommonanalysisscenarios.Itisthustriedtoresolve

possible confusion between classical hypothesis testing and data-guided model estimation by

discussingtheirramificationsfortheneuroimagingaccesstoneurobiology.

Keywords:brainimaging,epistemology,cross-validation,hypothesistesting,machinelearning,pattern

recognition,p-value,prediction

3

MainText

"Thetricktobeingascientististobeopentousingawidevarietyoftools."

LeoBreiman(2001)

1.Introduction

Among the greatest challenges humans face are cultural misunderstandings between individuals,

groups, and institutions (Hall, 1989). The topic of the present paper is the culture clash between

statisticalinferencebynull-hypothesisrejectionandout-of-samplegeneralization(L.Breiman,2001;

Friedman, 1998; Shmueli, 2010) that are increasingly combined in the brain-imaging domain (N.

Kriegeskorte, Simmons, Bellgowan, & Baker, 2009). Ensuing inter-cultural misunderstandings are

unfortunate, because the invention and application of new research methods has always been a

drivingforceintheneurosciences(Deisseroth,2015;Greenwald,2012;Yuste,2015).Itisthegoalof

thepresentpapertodisentangleclassicalinferenceandgeneralizationinferencebyjuxtaposingtheir

historicaltrajectories(section2),modellingphilosophies(section3),conceptualframeworks(section

4),andperformancemetrics(section5).

Duringthepast15years,neuroscientistshavetransitionedfromexclusivelyqualitativereportsoffew

patientswithneurologicalbrainlesionstoquantitativelesion-symptommappingonthevoxellevelin

hundredsofpatients(Batesetal.,2003).Wehavegonefrommanuallystainingandmicroscopically

inspecting single brain slices to 3D models of neuroanatomy at micrometer scale (Amunts et al.,

2013). We have also gone from individual experimental studies to the increasing possibility of

automatizedknowledgeaggregationacross thousandsofpreviously isolatedneuroimaging findings

(T. Yarkoni, Poldrack, Nichols, Van Essen, &Wager, 2011). Rather than laboriously collecting and

publishing in-housedata ina singlepaper, investigatorsarenowroutinely reanalyzingmulti-modal

data repositories managed by national, continental, and inter-continental consortia (Kandel,

Markram,Matthews,Yuste,&Koch,2013;HenryMarkram,2012;Poldrack&Gorgolewski,2014;Van

Essenetal.,2012).Thegranularityofneuroimagingdatasets ishencegrowingintermsofscanning

4

resolution, sample size, and complexity of meta-information (S. Eickhoff, Turner, Nichols, & Van

Horn, 2016; Van Horn & Toga, 2014). As an important consequence, the scope of neuroimaging

analyses has expanded from the predominance of null-hypothesis testing to statistical-learning

methods that are i)more data-driven by flexiblemodels, ii) naturally scalable to high-dimensional

data, and iii) more heuristic by increased reliance on numerical optimization (Jordan & Mitchell,

2015; LeCun,Bengio,&Hinton,2015).Statistical learning (T.Hastie, Tibshirani,&Friedman,2001)

henceforth comprises the umbrella of "machine learning", "data mining", "pattern recognition",

"knowledgediscovery",and"high-dimensionalstatistics".

Infact,theverynotionofstatisticalinferencemaybesubjecttoexpansion.Thefollowingdefinitionis

drawn froma committee report to theNational Academies of theUSA (Jordan et al., 2013, p. 8):

"Inference is the problemof turning data into knowledge,where knowledgeoften is expressed in

termsofvariables [...] thatarenotpresent in thedataper se,butarepresent inmodels thatone

uses to interpret the data." According to this authoritative definition, statistical inference can be

understood as encompassing not only classical null-hypothesis falsification but also out-of-sample

generalization (cf. Jacob Cohen, 1990; G. Gigerenzer & Murray, 1987). Classical statistics and

statistical learningmightgiverise todifferentcategoriesof inference,whichremainsan inherently

difficult concept to define (Chamberlin, 1890; Pearl, 2009; Platt, 1964). Any choice of statistical

method for a neurobiological investigation predetermines the spectrum of possible results and

permissibleconclusions.

Briefly taking an epistemological perspective, a new scientific fact is probably not established in

vacuo(Fleck,Schäfer,&Schnelle,1935; italicterms inthispassagetakenfromsource).Rather,the

object is recognized and accepted by the subject according to socially conditioned thought styles

cultivated among members of thought collectives. A witnessed and measured neurobiological

phenomenontends toonlybecome"true" ifnotatoddswith theconstructed thoughthistoryand

closed opinion system shared by that subject. In the following, two such thought milieus will be

5

revisited and reintegrated in the context of imaging neuroscience: classical statistics (CS) and

statisticallearning(SL).

6

2.Differenthistories

2.1Theoriginsofclassicalhypothesistestingandlearningalgorithms

The largely independenthistorical trajectoriesof the twostatistical familiesareevenevident from

theirmostbasicterminology.Inputstostatisticalmodelsareusuallycalledindependentvariablesor

predictorsintheCSliterature,butarecalledfeaturescollectedinafeaturespaceintheSLliterature.

TheoutputsarecalleddependentvariablesorresponsesinCSandtargetvariablesinSL,respectively.

Around 1900 the notions of standard deviation, goodness of fit, and the "p < 0.05"-threshold

emerged(Cowles&Davis,1982).ThiswasalsotheperiodwhenWilliamS.Gossetpublishedthet-

test under the incognito name "Student" to quantify production quality in Guinness breweries.

Motivated by concrete problems such as the interaction between potato varieties and fertilizers,

Ronald A. Fisher invented the analysis of variance (ANOVA), null-hypothesis testing, promoted p

values,anddevisedprinciplesofproperexperimentalconduct(R.A.Fisher,1925;RonaldA.Fisher,

1935; Ronald A. Fisher &Mackenzie, 1923). An alternative framework for hypothesis testing was

proposedbyJerzyNeymanandEgonS.Pearson,which introducedthestatisticalnotionsofpower,

falsepositivesandfalsenegatives,but leftouttheconceptofpvalues(Neyman&Pearson,1933).

Thiswas a timewhen controlled experimentswere preferentially performedon single individuals,

before the gradual transition to participant groups in the 30s and 40s, and before electrical

calculatorsemergedafterWorldWarII(Efron&Tibshirani,1991;GerdGigerenzer,1993).Student's

t-test and Fisher's inference framework were institutionalized by American psychology textbooks

thatwerewidelyreadinthe40sand50s.NeymanandPearson'sapproachonlybecameincreasingly

known in the 50s and 60s. This led authors of social science textbooks to promote a somewhat

incoherentmixtureof theFisherandNeyman-Pearsonapproaches tostatistical inference, typically

withoutexplicitmention.Itisthisconglomerateofclassicalframeworkstoperforminginferencethat

today'stextbooksofappliedstatisticshaveinherited(Bortz,2006;Moore&McCabe,1989).

7

Itisatopicofcurrentdebate1,2,3whetherCSisadisciplinethatisseparatefromSL(e.g.,L.Breiman,

2001;Chambers,1993;Friedman,2001)orifstatisticsisabroaderclassthatincludesCSandSLasits

members (e.g.,Cleveland,2001; Jordan&Mitchell,2015;Tukey,1962).SLmethodsare frequently

adopted by computer scientists, physicists, engineers, and others who have no formal statistical

background and are typically working in industry rather than academia. In fact, John W. Tukey

foresaw many of the developments that led up to what one might today call statistical learning

(Tukey, 1962, 1965).He proposed a “peaceful collision of computing and statistics” aswell as the

distinction into "exploratory" and "confirmatory" data analysis4. This emphasized data-driven

analysistechniquesasatoolboxusefulinalargevarietyofreal-worldsettingstogainanintuitionof

the data properties. Kernel methods, neural networks, decision trees, nearest neighbors, and

graphicalmodelsallactuallyoriginatedintheCScommunity,butmostlycontinuedtodevelopinthe

SLcommunity(Friedman,2001).Asoftencitedbeginningsofself-learningalgorithms,theperceptron

was an early brain-inspired computing algorithm (Rosenblatt, 1958), and Arthur Samuel created a

checkerboardprogram that succeeded inbeating itsowncreator (Samuel,1959). Such studieson

artificial intelligence (AI) led to enthusiastic optimism and subsequent disappointment due to the

slowprogressof learningalgorithms.Theconsequencewasa slow-downof research, funding,and

interestduringtheso-called"AIwinters"inthelate70sandaroundthe90s(D.D.Cox&Dean,2014;

Kurzweil, 2005; Russell & Norvig, 2002), while the increasingly available computers in the 80s

encouragedanewwaveofstatisticalalgorithms(Efron&Tibshirani,1991).Thedifficult-to-trainbut

backthenwidelyusedneuralnetworkalgorithmsweresupersededbysupportvectormachineswith

1"DataScienceandStatistics:differentworlds?"(PanelatRoyalStatisticalSocietyUK,March2015)(https://www.youtube.com/watch?v=C1zMUjHOLr4)2"50yearsofDataScience"(DavidDonoho,TukeyCentennialworkshop,USA,Sept.2015)

3"AreMLandStatisticsComplementary?"(MaxWelling,6thIMS-ISBAmeeting,December2015)

4Asaveryrecentreformulationofthesameidea:"Iftheinference/algorithmraceisatortoise-and-hareaffair,thenmodernelectroniccomputationhasbredabionichare."(Efron&Hastie,2016)

8

convincing out-of-the-box performances (Cortes & Vapnik, 1995). Later, the use of SL methods

increased steadily in many quantitative scientific domains as they underwent an increase in

informationgranularityfromclassical"longdata"(samplesn>variablesp)tomodern"widedata"(n

< p) (Tibshirani, 1996). The emerging field of SL has beenmuch conceptually consolidated by the

seminal book "The Elements of Statistical Learning" (T. Hastie et al., 2001). The coincidence of

changing data properties, increasing computational power, and cheaper memory resources

encouragedaresurgeinSLresearchandapplicationsapproximatelysince2000(HouseofCommon,

2016;Manyikaetal.,2011).Overthelast15years,sparsityassumptionsgainedincreasingrelevance

for statistical tractability and domain interpretability when using supervised and unsupervised

learningalgorithms (i.e.,withandwithout targetvariables)by imposingapriordistributionon the

model parameters (Bach, Jenatton,Mairal,&Obozinski, 2012). According to the "bet on sparsity"

(TrevorHastie,Tibshirani,&Wainwright,2015),onlyasubsetofthefeaturesshouldbeexpectedto

be relevant because no existing statistical method performs well in the dense high-dimensional

scenariothatassumesallfeaturestoberelevantinthe"true"model(Brodersen,Haiss,etal.,2011;

T. Hastie et al., 2001). This enabled reproducible and interpretable statistical relationships in the

high-dimensional"n<<p"regime(Bühlmann&VanDeGeer,2011;TrevorHastieetal.,2015).More

recently,improvementsintrainingvery"deep"(i.e.,manynon-linearhiddenlayers)neuralnetworks

architectures(GeoffreyE.Hinton&Salakhutdinov,2006)havemuchimprovedautomatizedfeature

selection (Bengio, Courville, & Vincent, 2013) and have exceeded human-level performance in

several tasks (LeCunet al., 2015). For instance, one recent deep reinforcement learning algorithm

masteredplaying49differentcomputergamesbasedonsimplepixelinputalone(Mnihetal.,2015).

Today,systematiceducationinSLisstillrareatmostuniversities,incontrasttotheomnipresenceof

CScourses(Burnham&Anderson,2014;Cleveland,2001;Donoho,2015;Vanderplas,2013).

2.2Relatedspotlightsinthehistoryofneuroimaginganalysismethods

9

Formore thana century, neuroscientific conclusionsweremainly drawn frombrain lesion reports

(Broca, 1865; Harlow, 1848; Wernicke, 1881), microscopical inspection (Brodmann, 1909; Vogt &

Vogt, 1919), brain stimulation during surgery (Penfield & Perot, 1963), and pharmacological

intervention (Clark, Del Giudice, & Aghajanian, 1970), often without strong reliance on statistical

methodology.Theadventofmorereadilyquantifiableneuroimagingmethods(Fox&Raichle,1986)

then allowed for in-vivo characterization of the neural correlates underlying sensory, cognitive, or

affectivetasks.Eversince,topographical localizationofneuralactivity increasesanddecreaseswas

dominatedbyanalysisapproachesfromCS,especiallythegenerallinearmodel(GLM;datesbackto

Nelder&Wedderburn,1972).AlthoughtheGLMiswellknownnottocorrespondtoneurobiological

reality,ithasprovidedgoodandinterpretableapproximations.Itwasandstill isroutinelyusedina

mass-univariateregimethatcomputessimultaneousunivariatestatisticsforeachindependentvoxel

observationofbrainscans(Fristonetal.,1994).Itinvolvesfittingbetacoefficientscorrespondingto

thecolumnsofadesignmatrix(i.e.,prespecifiedstimulus/task/behaviorindicators,theindependent

variables)toasinglevoxel'simagingtimeseriesofmeasuredneuralactivitychanges(i.e.,dependent

variable)toobtainabetacoefficientforeachindicator.ItisseldommentionedthattheGLMwould

nothavebeensolvableforuniquesolutionsinthehigh-dimensionalregimebecausethenumberof

input variables p exceeded by far the number of samples n (i.e., under-determined system of

equations),which incapacitatesmanystatisticalestimatorsfromCS(cf.Giraud,2014;TrevorHastie

et al., 2015). Regularization by sparsity-inducing norms, such as inmodern regression analysis via

LassoandElasticNet(cf.TrevorHastieetal.,2015;Jenatton,Audibert,&Bach,2011),emergedonly

later as a principled way to de-escalate the need for dimensionality reduction and to enable the

tractability of the high-dimensional "p > n" case (Tibshirani, 1996). Many software packages for

neuroimaging analysis consequently implemented discrete voxel-wise analyses with classical

inference. The ensuing multiple comparisons problem motivated more than two decades of

methodological research (Friston, 2006; ThomasE.Nichols, 2012; StephenM. Smith,Matthews,&

Jezzard, 2001; Worsley, Evans, Marrett, & Neelin, 1992). It was initially addressed by reporting

uncorrected(Vul,Harris,Winkielman,&Pashler,2008)orBonferroni-correctedfindings(ThomasE.

10

Nichols, 2012), then increasingly by false discovery rate (Genovese, Lazar, & Nichols, 2002) and

cluster-leveltresholding(StephenM.Smith&Nichols,2009).Further,itwasearlyacknowledgedthat

theunitofinterestshouldbespatiallyneighboringvoxelgroupsandthatthenullhypothesisneeded

to account for all voxels exhibiting some signal (Chumbley& Friston, 2009). These concernswere

addressedbyinferencein locallysmoothneighborhoodsbasedonrandomfieldtheorythatmodels

discretevoxelactivationsastopologicalunitswithcontinuousactivationheightandextent(Worsley

etal.,1992).Thatis,thespatialdependenciesofvoxelobservationswerenotincorporatedintothe

GLM estimation step, but instead during the subsequent model inference step to alleviate the

multiplecomparisonsproblem.

To abandon the voxel-independence assumption of the mass-univariate GLM, SL models were

proposedearlyon forneuroimaging investigations. For instance,principal componentanalysiswas

usedtodistinguish localandglobalneuralactivitychanges(Moeller,Strother,Sidtis,&Rottenberg,

1987)aswellastostudyAlzheimer'sdisease(Gradyetal.,1990),whilecanonicalcorrelationanalysis

yielded complex relationships between task-free neural activity and schizophrenia symptoms

(Friston,Liddle,Frith,Hirsch,&Frackowiak,1992).Notethatthesefirstapproachesto"multivariate"

brain-behaviorassociationsdidnot igniteamajor research trend (cf. Fristonetal., 2008;Worsley,

Poline,Friston,&Evans,1997).Supervisedclassificationestimatorswerealsousedinearlystructural

neuroimaging analyses (Herndon, Lancaster, Toga, & Fox, 1996) and they improved preprocessing

performanceforvolumetricneuroimagingdata(Ashburner&Friston,2005).However,thepopularity

of SL methods only peaked after being rebranded as "mind-reading", "brain decoding", and

"multivariate pattern analysis" that appealed by identifying ongoing thought from neural activity

alone(J.D.Haynes&Rees,2005;Kamitani&Tong,2005).Uptothatpoint,thetermpredictionhad

lessoftenbeenusedintheconcurrentsenseofout-of-samplegeneralizationofalearningmodeland

more often in the incompatible sense of in-sample linear correlation between (time-free or time-

shifted)data(Gabrieli,Ghosh,&Whitfield-Gabrieli,2015;Shmueli,2010).The"searchlight"approach

11

for "pattern-information analysis" subsequently enabled whole-brain assessment of local

neighborhoods of predictive patterns of neural activity fluctuations (N. Kriegeskorte, Goebel, &

Bandettini, 2006). The position of such "decoding" models within the branches of statistics has

seldombeenmadeexplicit(butseeFristonetal.,2008).Thegrowinginterestwasmanifestedinfirst

reviewandtutorialpapersthatwerepublishedonapplyingSLmethodstoneuroimagingdata(J.D.

Haynes&Rees,2006;Mur,Bandettini,&Kriegeskorte,2009;Pereira,Mitchell,&Botvinick,2009).

The conceptual appeal of this new access to the neural correlates of cognition and behaviorwas

flanked by availability of the necessary computing power andmemory resources. This was also a

preconditionforregularizationbystructuredsparsitypenalties(Wainwright,2014)that incorporate

neurobiological priors as local spatial dependence (Gramfort, Thirion,& Varoquaux, 2013;Michel,

Gramfort,Varoquaux,Eger,&Thirion,2011)orspatial-temporaldependence(Gramfort,Strohmeier,

Haueisen, Hamalainen, & Kowalski, 2011). Although challenging, "deep" neural networks have

recently been introduced to neuroimaging (de Brebisson &Montana, 2015; Güçlü & van Gerven,

2015; Plis et al., 2014). The application of these "deep" statistical architectures occurs in an

atheoretical, more empirically justified setting as their mathematical properties are incompletely

understoodandmanyformalguaranteesaremissing(butseeBach,2014).Nevertheless,theymight

help in deciphering and approximating the nature of neural processing in the brain (D. D. Cox &

Dean,2014;Yamins&DiCarlo,2016).ThisagendaappearscloselyrelatedtoDavidMarr'sdistinction

of information processing into computational (~what?), algorithmic-representational (~how?), and

implementational-physical(~where?)levels(Marr,1982).Lastbutnotleast,thereisanalwaysbigger

interest in and pressure for data sharing, open access, and building "big-data" repositories in

neuroscience(Devoretal.,2013;Gorgolewskietal.,2014;Kandeletal.,2013;HenryMarkram,2012;

Poldrack & Gorgolewski, 2014; Van Essen et al., 2012). As the dimensionality and complexity of

neuroimaging datasets increases, neuroscientific investigations will probably benefit increasingly

fromSLmethodsandtheirvariantsadaptedtothedata-intenseregime(e.g.,Engemann&Gramfort,

2015;Kleiner,Talwalkar,Sarkar,&Jordan,2012;Zou,Hastie,&Tibshirani,2006).While largerdata

quantitiesallowdetectionofmoresubtleeffects,falsepositivefindingsarelikelytobecomeamajor

12

issue as they unavoidably arise from statistical estimation in the high-dimensional data scenario

(Jordanetal.,2013;Meehl,1967).

13

3.Differentphilosophies

3.1Twodifferentmodellinggoalsinstatisticalmethods

One of the possible ways to catalogue statistical methods is by framing them along the lines of

classicalstatisticsandstatisticallearning.Statisticalmethodscanthusbeconceptualizedasspanning

acontinuumbetweenthetwopolesofCSandSL(Jordanetal.,2013;p.61).Whilesomestatistical

methodscannoteasilybecategorizedbythisdistinction,thetwofamiliesofstatisticalmethodscan

generallybedistinguishedbyanumberofrepresentativeproperties.

As the precise relationship between CS and SL has seldom been explicitly characterized in

mathematical terms, the authormust resort tomore descriptive explanations (see already Efron,

1978). One of the key differences becomes apparent when thinking of the neurobiological

phenomenonunderstudyasablackbox(L.Breiman,2001).CStypicallyaimsatmodellingtheblack

box bymaking a set of accurate assumptions about its content, such as the nature of the signal

distribution. Gaussian distributional assumptions have been very useful in many instances to

enhance mathematical convenience and, hence, computational tractability. SL typically aims at

finding any way to model the output of the black box from its input while making the least

assumptionspossible(Abu-Mostafa,Magdon-Ismail,&Lin,2012).InCSthestochasticprocessesthat

generatedthedataisthereforetreatedaspartlyknown,whereasinSLthephenomenonistreatedas

complex, largelyunknown,andpartlyunknowable. Inthissense,CStendstobemoreanalyticalby

imposingmathematicalrigoronthephenomenon,whereasSLtendstobemoreheuristicbyfinding

usefulapproximationstothephenomenon.CSspecifiesthepropertiesofagivenstatisticalmodelat

the beginning of the investigation, whereas in SL there is a bigger emphasis on models whose

parameters and at times even structures (e.g., learning algorithms creating decision trees) are

generatedduring thestatisticalestimation.TheSL-minded investigatormay favorsimple, tractable

modelsevenwhenmakingconsciousfalseassumptions(Domingos,2012)becausesufficiently large

data quantities are expected to remedy them (Devroye, Györfi, & Lugosi, 1996; Halevy, Norvig, &

14

Pereira,2009).Anewfunctionwithpotentiallythousandsofparametersiscreatedthatcanpredict

the output from the input alone, without explicit programming model. This requires the input

featurestorepresentdifferentvariantsofallrelevantconfigurationsoftheexaminedphenomenon

in nature. In CS the mathematical assumptions are typically stated explicitly, while SL models

frequently have implicit assumptions that may be less openly discussed. Intuitively, the truth is

believed to be in the model (cf. Wigner, 1960) in a CS-constrained statistical regime, while it is

believedtobeinthedata(cf.Halevyetal.,2009)inaSL-constrainedstatisticalregime.Alsodiffering

intheiroutput,CStypicallyyieldspointandintervalestimates (e.g.,pvalues,variances,confidence

intervals), whereas SL frequently outputs functions (e.g., the k-means centroids or a trained

classifier'sdecision functioncanbeappliedtonewdata).Asanother tendency,SL revolvesaround

solvingandintroducingpriorknowledgeintoiterativenumericaloptimizationproblems.CSmethods

aremoreoftenclosed-formone-shotcomputationswithoutanysuccessiveapproximationprocess,

although several models are also fitted numerically by maximum likelihood estimation (Boyd &

Vandenberghe,2004;Jordanetal.,2013).

In more formal terms, CS relates closely to statistics for confirmatory data analysis, whereas SL

relatesmoretostatisticsforexploratorydataanalysis(Tukey,1962).Inpractice,CSisprobablymore

oftenappliedtoexperimentaldata,whereasetoftargetvariablesaresystematicallycontrolledby

theinvestigatorandthesystemunderstudiedhasbeensubjecttostructuredperturbation.Instead,

SLisperhapsmoretypicallyappliedtoobservationaldatawithoutsuchstructuredinfluenceandthe

studiedsystemhasbeenleftunperturbed(Domingos,2012).Fromyetanotherimportantangle,one

should not conflate so-called explanatory modelling and predictive modelling (Shmueli, 2010). CS

mainly performs retrospective explanatory modelling by emphasis on the operationalization of

preselected hypothetical constructs asmeasurable outcomes using frequently linear, interpretable

models. SL mainly performs prospective predictive modelling and quantifies the generalization to

futureobservationswithorwithoutformalincorporationofhypotheticalconceptsintomodelsthat

aremore frequently non-linear with challenging to impossible interpretability. There is the often-

15

overlooked misconception that models with high explanatory power do necessarily exhibit high

predictive power (Lo, Chernoff, Zheng, & Lo, 2015; Wu, Chen, Hastie, Sobel, & Lange, 2009). An

important outcome measure in CS is the quantified significance associated with a statistical

relationshipbetweenfewvariablesgivenapre-specifiedmodel.TheoutcomemeasureforSListhe

quantifiedgeneralizabilityorrobustnessofpatternsbetweenmanyvariablesor,moregenerally,the

robustnessof special structure in thedata (T.Hastieetal.,2001).CS tends to test foraparticular

structure in thedatabasedonanalyticalguarantees, suchasmathematical convergence theorems

about approximating the population properties with increasing sample size. Instead, SL tends to

explore particular structure in the data and quantify its generalization to new data certified by

empiricalguarantees,suchasbyexplicitevaluationofthepredictivenessofafittedmodeltounseen

data (Efron& Tibshirani, 1991). CS thus resortsmore towhat Leo Breiman calleddatamodelling,

imposingana-priorimodelinatop-downfashion,whileSLwouldbealgorithmicmodelling,fittinga

modelasafunctionofthedataathandinabottom-upfashion(L.Breiman,2001).

Although this polarization of statisticalmethodologymay be oversimplified, it serves as a didactic

tool toconfront twodifferentperspectives.Takentogether,CSwasmostly fashionedforproblems

with small samples that can be grasped by plausible models with a small number of parameters

chosenbytheinvestigatorinananalyticalfashion.SLwasmostlyfashionedforproblemswithmany

variables inpotentially large sampleswith rare knowledgeof thedata-generatingprocess that are

emulatedbyamathematical functioncreated fromdatabyamachine inaheuristic fashion.Tests

fromCSthereforetypicallyassumethatthedatabehaveaccordingtoknownmechanisms,whereas

SLexploitsalgorithmictechniquestoavoida-priorispecificationsofdatamechanism.Inthisway,CS

preassumesandtestsamodelforthedata,whereasSLlearnsamodelfromthedata5.

5"Indeed,thetrendsincePearl’sworkinthe1980’shasbeentoblendreasoningandlearning:putsimply,onedoesnotneedtolearn(fromdata)whatonecaninfer(fromthecurrentmodel).Moreover,onedoesnotneedtoinferwhatonecanlearn(intractableinferentialprocedurescanbecircumventedbycollectingdata)."(Jordan,2010)

16

Obviously,theexistingrepertoireofstatisticalmethodscouldhavebeendissectedindifferentways.

Forinstance,theBayesian-frequentistdistinctionisorthogonaltotheCS-SLdistinction(Efron,2005;

Freedman,1995;Z.Ghahramani,2015).Bayesianstatisticscanbeviewedassubjectiveandoptimistic

in considering the data at hand conditioned on preselected distributional assumptions for the

probabilisticmodel to perform direct inference (Wagenmakers, Lee, Lodewyckx,& Iverson, 2008).

Frequentiststatisticsaremoreobjectiveandpessimisticinconsideringadistributionofpossibledata

withunknownmodelparameterstoperformindirectinferencebythedifferencesbetweenobserved

andmodel-deriveddata.ItisimportanttoappreciatethatBayesianstatisticscanbeadoptedinboth

CSandSLfamiliesinvariousflavors(Fristonetal.,2008;GeoffreyE.Hinton,Osindero,&Teh,2006;

Kingma&Welling, 2013). Furthermore,CSandSL approaches canhardlybe clearly categorizedas

either discriminative orgenerative (Bishop, 2006; G.E. Hinton, Dayan, Frey, & Neal, 1995; Jordan,

1995;Ng&Jordan,2002).Discriminativemodelsfocusonsolvingasupervisedproblemofpredicting

a class y by directly estimating P(y|X) without capturing special structure in the data X. Typically

moredemandingindataquantityandcomputationalresources,generativemodelsestimatespecial

structurebyP(y|X)fromP(X|y)andP(y)andcanthusproducesyntheticalexamplesX~foreachclass

y.Further,CSandSLcannotbedisambiguatedintodeterministicversusprobabilisticmodels(Shafer,

1992). Statistical models are often not exclusively deterministic because they incorporate a

component thataccounts forunexpected,noisyvariation in thedata.Eachprobabilisticmodelcan

also be viewed as a superclass of a deterministic model (Norvig, 2011). Neither can the terms

univariate andmultivariate be exclusively grouped into either CS or SL. They traditionally denote

reliance on one versus several dependent variables (CS) or target variables (SL) in the statistical

literatures. In the neuroimaging literature, "multivariate" however frequently refers to high-

dimensional approaches operating on the neural activity from all voxels in opposition to mass-

univariateapproachesoperatingonsinglevoxelactivity(Brodersen,Haiss,etal.,2011;Fristonetal.,

2008).CScanbedividedintouni-andmultivariategroupsofstatisticaltests(Ashburner&Kloppel,

2011; Friston et al., 2008), while SL is largely focused on higher-dimensional problems that often

naturally transform to the multivariate classification/regression setting. Tapping on yet another

17

terminology, CS methods may be argued to be closer to parametric statistics by instantiating

statistical models whose number of parameters is fixed, finite, and not a function of sample size

(Bishop, 2006; Z. Ghahramani, 2015). Instead, SL methods are more often (but not exclusively)

realizedinanon-parametricsettingbyinstantiatingflexiblemodelswhoseparametersgrowincrease

explicitlyorimplicitlywithmoreinputdata.Parametricapproachesareoftenmoresuccessfuliffew

observationsareavailable.Conversely,non-parametricapproachesmaybemorenaturallyprepared

tocaptureemergencepropertiesthatonlyoccurfromlargerdatasets(Z.Ghahramani,2015;Halevy

etal.,2009; Jordanetal.,2013).For instance,anon-parametric (butnotparametric) classifier can

extractevermorecomplexdecisionboundariesfromincreasingtrainingdata(e.g.,decisiontressand

nearestneighbors).Mostimportantly,neitherCSnorSLcangenerallybeconsideredsuperior.Thisis

captured by the no free lunch theorem6 (Wolpert, 1996), which states that no single statistical

strategycanconsistentlydobetterinallcircumstances(cf.GerdGigerenzer,2004).Theinvestigator

has the discretion to choose which statistical approach best suited to the neurobiological

phenomenonunderstudyandtheneuroscientificresearchobjectathand.

3.2Differentmodellinggoalsinneuroimaging

Statisticalanalysisgrounded inCSandSL ismoreclosely related toencodingmodelsanddecoding

models in the neuroimaging domain, respectively (Nikolaus Kriegeskorte, 2011; T. Naselaris, Kay,

Nishimoto,&Gallant,2011;Pedregosa,Eickenberg,Ciuciu,Thirion,&Gramfort,2015;butseeGüçlü

et al, 2015). Encoding models regress the data against a design matrix with potentially many

explanatorycolumnsofstimulus(e.g.,faceversushousepictures),task(e.g.,toevaluteortoattend),

orbehavioral (e.g.,ageorgender) indicatorsby fittinggeneral linearmodels. Incontrast,decoding

modelstypicallypredicttheseindicatorsbytrainingandtestingclassificationalgorithmsondifferent

6Inthesupervisedsetting,thereisnoaprioridistinctionbetweenlearningalgorithmsevaluatedbyout-of-samplepredictionerror.Intheoptimizationsettingoffinitespaces,allalgorithmssearchinganextremumperformidenticalwhenaveragedacrosspossiblecostfunctions.(http://www.no-free-lunch.org/)

18

splitsfromthewholedataset.InCSparlance,theencodingmodelfitstheneuralactivitydatabythe

betacoefficients,thedependentvariables,accordingtotheindicatorsinthedesignmatrixcolumns,

theindependentvariables.AnexplanationfordecodingmodelsinSLjargonwouldbethatthemodel

weightsofaclassifierarefittedonthetrainingsetofthe inputdatatopredicttheclass labels,the

targetvariables, andaresubsequentlyevaluatedon the test setbycross-validation toobtain their

out-of-sample generalization performance. Put differently, a GLM fits coefficients of

stimulus/task/behavior indicatorsonneuralactivitydataforeachvoxelseparatelygiventhedesign

matrix (T. Naselaris et al., 2011). Classifiers predict entries of the design matrix for all voxels

simultaneously given the neural activity data (Pereira et al., 2009). A key difference between CS-

mediated encoding models and SL-mediated decoding models thus pertains to the direction of

inferencebetweenbrainspaceandindicatorspace(Fristonetal.,2008;Varoquaux&Thirion,2014).

Themapping direction hence pertains to the questionwhether the indicators in themodel act as

causesbyrepresentingdeterministicexperimentalvariablesofanencodingmodelorconsequences

byrepresentingprobabilisticoutputsofadecodingmodel(Fristonetal.,2008).Theseconsiderations

alsorevealtheintimaterelationshipofCSmodelstothenotionofso-calledforwardinference,while

SLmethodsareprobablymoreoftenusedforformalreverseinferenceinfunctionalneuroimaging(S.

B.Eickhoffetal.,2011;Poldrack,2006;Varoquaux&Thirion,2014;T.Yarkonietal.,2011).Onthe

one hand, forward inference relates to encoding models by testing the probability of observing

activity in a brain location given knowledge of a psychological process. Reverse inference, on the

otherhand, relates tobrain "decoding"by testing theprobabilityof apsychological processbeing

present given knowledge of activation in a brain location. Finally, establishing a brain-behavior

associationhasbeenarguedtobemoreimportantthantheactualdirectionofthemappingfunction.

This isbecause"showingthatonecandecodeactivityinthevisualcortextoclassify[...]asubject’s

percept is exactly the same as demonstrating significant visual cortex responses to perceptual

changes" and, conversely, "all demonstrations of functionally specialized responses represent an

implicitmindreading"(Friston,2009).

19

Morespecifically,GLMencodingmodelsfollowarepresentationalagendabytestinghypotheseson

regionaleffectsoffunctionalspecializationinthebrain(where?).At-testisusedtocomparepairsof

measurestostatisticallydistinguishonetargetandonenon-targetsetofmentaloperations(Friston

et al., 1996). Essentially, this formal tests for significant differences between the beta coefficients

corresponding to two stimulus/task indicators based on well-founded arguments from cognitive

theory (the SL analogwould be a binary classifier distinguishing two class labels). It assumes that

cognitivesubtraction ispossible, that is, theregionalbrainresponsesof interestcanbe isolatedby

constrastingtwosetsofbrainscansthatpreciselydifferinthecognitivefacetofinterest(Fristonet

al.,1996;Stark&Squire,2001).Foronevoxellocationatatime,anattemptismadetorejectthenull

hypothesis of no difference between the averaged neural activity of a target brain state and the

averagedneuralactivityofacontrolbrainstate.Notethatestablishingabrain-behaviorassociation

via GLM is therefore a judgment about neural activity compressed in the time dimension. The

univariateGLManalysiscanbeeasilyextendedtomorethanoneoutput(dependent)variablewithin

theCSregimebyperformingamultivariateanalysisofcovariance(MANCOVA).Thisallowsfortests

ofmorecomplexhypothesesbut incursmultivariatenormalityassumptions (NikolausKriegeskorte,

2011). Conceptually, one can switch the direction of inference by reframing the estimated beta

coefficients as independent variables and the stimulus/task/behavior indicator as the response

variabletoperformmultiplelinearregression(ANCOVA)(Brodersen,Haiss,etal.,2011;T.Naselaris

et al., 2011). The beta coefficients would then play an analogous role to themodel weights of a

trained classification algorithm. Finally, theoften-performed small volume correction (an analog in

theSLworldwouldbeclassificationafter featureselection)correspondstosimultaneously fittinga

GLMtoeachvoxelofarestrictedregionofinterest.

BecausehypothesistestingforsignificantdifferencesbetweenbetacoefficientsoffittedGLMsrelies

onaveragedneuralactivity,thetestresultsarenotcorruptedbytheconventionallyappliedspatial

smoothing with a Gaussian filter. On the contrary, this even helps the correction for multiple

comparisonsbasedon random fields theory, alleviates inter-individual neuroanatomical variability,

20

and thus increases sensitivity. Spatial smoothing however discards fine-grained spatial activity

patternsthatcarrypotentialinformationaboutmentaloperations(J.-D.Haynes,2015).Indeed,some

authors believe that sensory, cognitive, and motor processes manifest themselves as neuronal

populationcodes(Averbeck,Latham,&Pouget,2006).Relevanceofsuchpopulationcodesinhuman

neuroimagingwas for instance suggestedby revealing subject-specific fusiform-gyrus responses to

facial stimuli (Saygin et al., 2012). In applications of SL models, the spatial smoothing step is

therefore often skipped because the "decoding" algorithms precisely exploit this multivariate

structureofthesalt-and-pepperpatterns.

In contrast, decoding models use learning algorithms for an informational agenda by showing

generalization of robust patterns to new brain activity acquisitions (de-Wit, Alexander, Ekroll, &

Wagemans,2016;N.Kriegeskorteetal.,2006;Muretal.,2009).Informationthatislocallyweakbut

spatially distributed can be effectively harvested in a structure-preserving fashion (J. D. Haynes&

Rees, 2006). Some brain-behavior associationsmight only emerge when simultaneously capturing

neuralactivityinagroupofvoxelsbutdisappearinsingle-voxelapproaches,suchasGLMs.However,

analogous tomultivariate variants of theGLM, “decoding” can alsobedonebymeansof classical

statisticalapproaches.Essentially, inferenceon informationpatterns in thebrainreduces tomodel

comparison (Friston, 2009). During training of a classifier to predict indicators correctly, an

optimization algorithm (e.g., gradient descent or its variants) searches iteratively through the

hypothesisspace(=functionspace)ofthechosenlearningmodel.Eachsuchhypothesiscorresponds

to one specific combination ofmodel weights that equateswith one candidatemapping function

fromtheneuralactivityfeaturestotheindicators.Inthisway,fourtypesofneuroscientificquestions

havebeenproposed tobecomequantifiable (Brodersen,2009;Pereiraetal.,2009): i)Where isan

informationcategoryneurallyprocessed?Thisextendsthe interpretationalspectrumfromincrease

and decrease of neural activity to the existence of complex combinations of activity variations

distributedacrossvoxels.For instance, linearclassifiersdecodedobjectcategoriesfromtheventral

temporalcortex,evenafterexcludingthefusiformgyrus,whichisknowntoberesponsivetoobject

21

stimuli(Haxbyetal.,2001). ii)Whetheragiveninformationcategoryisreflectedbyneuralactivity?

Thisextendstheinterpretationalspectrumtotopographicallysimilarbutneurallydistinctprocesses

that potentially underlie different cognitive facets. For instance, linear classifiers successfully

whether a subject is attending to the first or second of two simultaneously presented gratings

(Kamitani&Tong,2005).iii)Whenisaninformationcategorygenerated(i.e.,onset),processed(i.e.,

duration), and bound (i.e., alteration)? When applying classifiers to neural time series, the

interpretationalspectrumcanbeextendedtothebeginning,evolution,andendofdistinctcognitive

facets. For instance, different classifiers have been demonstrated to map the decodability time

structureofmentaloperationsequences(King&Dehaene,2014).iv)Morecontroversially,howisan

information category neurally processed? The interpretational spectrum is extended to

computational properties of the neural processes, including processing in brain regions versus

networks or isolated versus partially sharedprocessing facets. For instance, a classifier trained for

evolutionarily conserved eye gaze processes was able to decode evolutionarily more recent

mathematicalcalculationprocessesasapossiblecaseofneuralrecyclinginthehumanbrain(Knops,

Thirion,Hubbard,Michel,&Dehaene,2009).Asan importantcaveat,theparticularpropertiesofa

chosenlearningalgorithm(e.g., linearversusnon-linearsupportvectormachines)canprobablynot

serveasaconvincingargumentforreverse-engineeringneuralprocessingmechanisms(Misaki,Kim,

Bandettini,&Kriegeskorte,2010).However,manypredictionproblemsinneurosciencecanprobably

besolvedwithoutexhaustiveneurobiologicalmicro-,meso-,ormacro-levelknowledge(H.Markram,

2006;Sandberg&Bostrom,2008).

22

4.Differenttheories

4.1Divergingwaystoformalizestatisticalinference

Besidesdiverginghistoricaloriginsandmodellinggoals,CSandSLrelyonlargelydistincttheoretical

frameworks that revolvearoundnull-hypothesis testingandstatistical learning theory.CSprobably

laid down its most important theoretical framework in the Popperian spirit of critical empiricism

(Popper, 1935/2005): scientific progress is to be made by continuous replacement of current

hypotheses by ever more pertinent hypotheses using verification and falsification. The rationale

behind hypothesis falsification is that one counterexample can reject a theory by deductive

reasoning, while any quantity of evidence cannot confirm a given theory by inductive reasoning

(Goodman, 1999). The investigator verbalizes two mutually exclusive hypotheses by domain-

informedjudgment.Thealternativehypothesisshouldbeconceivedsoastocontradictthestateof

the art of the research topic. The null hypothesis should automatically deduce from the newly

articulatedalternativehypothesis. The investigatorhas theagenda todisprove thenull hypothesis

because it only leaves the preferred alternative hypothesis as the new standard belief. A

conventional 5%-threshold (i.e., equating with roughly two standard deviations) guards against

rejection due to the idiosyncrasies of the sample that are not representative of the general

population.Ifthedatahaveaprobabilityof<=5%tobetruegiventhenullhypothesis(P(result|H0)),

itisevaluatedtobesignificant.Suchatestforstatisticalsignificanceindicatesadifferencebetween

twomeanswith a 5% chance after sampling twice from the samepopulation. There are common

misconceptions that it denotes the probability of the null hypothesis (P(H0)), the alternative

hypothesis (P(H1)), theresultof the teststatistic (P(result)),or thenullhypothesisgiventheresult

(P(H0|result))(Markus,2001;Pollard&Richardson,1987).Ifthenullhypothesisisnotrejected,then

nothingcanbeconcludedaboutthewholesignificancetestaccordingtomoststatisticians.That is,

the test yields no conclusive result, rather than a null result (Schmidt, 1996). In thisway, classical

hypothesis testingcontinuously replacescurrentlyembracedhypothesesexplainingaphenomenon

innaturebybetterhypotheseswithmoreempiricalsupport inaDarwinianselectionprocess.Note

23

however that corroboratingsubstantivehypotheses (e.g., a specific linguistic theory likeChomsky's

UniversalGrammar)requiresmorethanstatisticalhypothesistesting(Chow,1998;Friedman,1998;

Meehl, 1978). A statistical hypothesis can be properly tested in the absence of a substantive

hypothesis of the phenomenon under study (Oakes, 1986). Finally, Fisher, Neyman, and Pearson

intended hypothesis testing as a marker for further investigation, rather than an off-the-shelf

decision-makinginstrument(J.Cohen,1994;Nuzzo,2014).

AtheoreticalframeworkrelevanttobothCSandSL,isthebias-variancetradeoff.Itisastand-alone

conceptthat ishelpfultoconsider intheoryandpracticebyprovidingadifferentangleonderiving

conclusions from data (e.g., Geman, Bienenstock, & Doursat, 1992). As famously noted7, any

statistical model is an over-simplification of the studied phenomenon, a distortion of the truth.

Choosingthe"right"statisticalmodelhencepertainstotherightbalancebetweenbiasandvariance.

Bias denotes thedifferencebetween the target function (i.e., the "true" relationshipbetween the

dataandaresponsevariablethattheinvestigatoristryingtouncoverfromthedata)andtheaverage

ofthefunctionspace(orhypothesisspace)instantiatedbyamodel.Intuitively,thebiasdescribesthe

rigidityofthemodel-derivedfunctions.High-biasmodelsyieldvirtuallythesamebestapproximating

function, even for larger perturbations of the data.Variance denotes the difference between the

bestapproximatingfunctionamongmembersofthefunctionspaceandtheaverageofthefunction

space. Intuitively,variancedescribes thedifferencesbetweenthe functionsderived fromthesame

model family.High-variancemodelsyieldverydifferentbestapproximatingfunctionevenforsmall

perturbationsof thedata. In sum,bias tends todecreaseandvariance tends to increaseasmodel

complexity increases. Simple models exhibit high bias and low variance. They lead to a bad

approximation as they do not well simulate the target function, a behavior of the studied

phenomenon.Yet,theyexhibitagoodgeneralizationinsuccessfullyextrapolatingfromseendatato

unseen data. Contrarily, complex models exhibit low bias and high variance. They have a better

chanceofwellapproximatingthetargetfunctionatthepriceofgeneralizing lesswelltonewdata.

7"Allmodelsarewrong;somemodelsareuseful."(GeorgeBox)

24

That is because complexmodels tend to fit toowell the data to the point of fitting noise. This is

calledoverfitting and occurswhenmodel fitting integrates information independent of the target

function and idiosyncratic to the data sample. The bias-variance decomposition captures this

fundamental tradeoff in statistical modelling between approximating the behavior of the studied

phenomenon and generalizing to new data describing that behavior. If the target function was

known, the bias-variance tradeoff could be computed explicitly, in contrast to the inability of

computing the Vapnik-Chervonenkis dimensions (VC dimensions) of non-trivial models (cf. next

passage). The bias-variance tradeoff can also practically explain why successful applications of

statisticalmodelslargelyrelyoni)theamountofavailabledata,ii)thetypicallynotknownamountof

noiseinthedata,andiii)theunknowncomplexityofthetargetfunction(Abu-Mostafaetal.,2012).

Whiletheconceptofbias-variancetradeoff ishighlyrelevant inbothabstracttheoryandeveryday

application of CS and SL models (Bishop, 2006; T. Hastie et al., 2001), the concept of Vapnik-

Chervonenkis dimensions plays a crucial role in statistical learning theory. The VC dimensions

mathematically formalize the circumstances under which a pattern-learning algorithm can

successfullydistinguishbetweenpointsandextrapolatetonewexamples(Vapnik,1989,1996).This

comprises any instance of learning from a number of observations to derive general rules that

capture properties of phenomena in nature, including human and computer learning (cf. Bengio,

2014; Lake, Salakhutdinov,& Tenenbaum, 2015; Tenenbaum, Kemp,Griffiths,&Goodman, 2011).

Notethatthisinductivelogictolearnageneralprinciplefromexamplescontraststhedeductivelogic

ofhypothesis falsification (cf. above).VCdimensionsprovideaprobabilisticmeasureofwhethera

certain model is able to learn a distinction with respect to a given dataset. Formally, the VC

dimensionsmeasure the complexity capacity of a function space by counting the number of data

pointsthatcanbecleanlydivided(i.e.,"shattered")intodistinctgroupsasaresultoftheflexibilityof

thefunctionset.Intuitively,theVCdimensionsprovideaguidelineforthelargestsetofexamplesfed

intoa learning functionsuch that it ispossible toguaranteezeroclassificationerrors.Theycanbe

viewedastheeffectivenumberofparametersorthedegreesoffreedom,aconceptsharedbetween

25

CS and SL. In this way, the VC dimensions formalize the circumstances under which a class of

functionsisabletolearnfromafiniteamountofdatatosuccessfullypredictagivenphenomenonin

previously unseen data. As one of themost important results from statistical learning theory, the

numberof configurationsone canobtain fromaclassificationalgorithmgrowspolynomially,while

the error is decreasing exponentially (Wasserman, 2013). Practically, good models have finite VC

dimensions - fittingwithasufficiently largeamountofdatayieldesperformancesthatapproximate

the theoretically expected performance in unseen data. Badmodels have infinite VC dimensions -

regardless of the available amount of data, it is impossible tomake generalization conclusions on

unseen data. In contrast to the bias-variance tradeoff, the VC dimensions (like null-hypothesis

testing) are unrelated to the target function, as the "true" mechanisms underlying the studied

phenomenon in nature. Yet, VC dimensions relate to themodels used to approximate that target

function.Asan interesting consideration, it ispossible that generalization froma concretedataset

failseveniftheVCdimensionspredictlearningasverylikely.However,suchadatasetistheoretically

unlikelytooccur.AlthoughtheVCdimensionsarethebestformalconcepttoderiveerrorsboundsin

statisticallearningtheory(Abu-Mostafaetal.,2012),theycanonlybeexplicitlycomputedforsimple

models.Hence,investigatorsareoftenrestrictedtoanapproximateboundforVCdimensions,which

limitstheirusefulnessfortheoreticalconsiderations.

4.2Theimpactofdiverginginferencecategoriesonneuroimaginganalyses

When lookingatneuroimagingresearchthroughtheCS lens, statisticalestimationrevolvesaround

solvingthemultiplecomparisonsproblem(ThomasE.Nichols,2012;T.E.Nichols&Hayasaka,2003).

FromtheSLstance,however,itisthecurseofdimensionalityandoverfittingthatstatisticalanalyses

need to tackle (Domingos,2012; Fristonetal., 2008). In typicalneuroimaging studies,CSmethods

typically test one hypothesismany times (i.e., the null hypothesis), whereas SLmethods typically

search through thousands of different hypotheses in a single process (i.e., walking through the

26

functionspacebynumericaloptimization).Thehighvoxel resolutionofcommonbrainscansoffers

parallelmeasurementsof>100,000brainlocations.Inamass-univariateregime,suchasafterfitting

voxel-wiseGLMs,thesamestatisticaltestisapplied>100,000times.Themoreoftentheinvestigator

tests a hypothesis of relevance for a brain location, themore locationswill be falsely detected as

relevant(falsepositive,TypeIerror),especiallyinthenoisyneuroimagingdata.Theissueconsistsin

toomanysimultaneousstatisticalinferences.Fromageneralperspective,alldimensionsinthedata

(i.e., voxel variables) are implicitly treated as equally important and no neighborhoods of most

expectedvariationarestatisticallyexploited(T.Hastieetal.,2001).Hence,theabsenceofcomplexity

restrictions during the statistical modelling of neuroimaging data takes a heavy toll at the final

inferencestep.

This is contrasted by the high-dimensional SL regime, where the initial model choice by the

investigatordeterminesthecomplexityrestrictionstoalldatadimensions(i.e.,notsinglevoxels)that

are imposed explicitly or implicitly by themodel structure.Model choice predisposes existing but

unknownlow-dimensionalneighborhoodsinthefullvoxelspacetoachievethepredictiontask.Here,

thetollistakenatthebeginningbecausetherearesomanydifferentalternativemodelchoicesthat

wouldimposeadifferentsetofcomplexityconstraints.Forinstance,signalsfrom"brainregions"are

likelytobewellapproximatedbymodelsthatimposediscrete,locallyconstantcompartmentsonthe

data (e.g., k-means or spatially constrainedWard clustering). Tuningmodel choice to signals from

macroscopical "brainnetworks" should imposeoverlapping, locally continuousdata compartments

(e.g., independentcomponentanalysisorsparseprincipalcomponentanalysis).Knowledgeofsuch

effective dimensions in the neuroimaging data is a rare opportunity to simultaneously reduce the

model biasandmodel variance, despite their typically inverse relationship. Statisticalmodels that

overcome the curse of dimensionality typically incorporate an explicit or implicit metric for such

anisotropicneighborhoodsinthedata(Bach,2014;Bzdok,Eickenberg,Grisel,Thirion,&Varoquaux,

2015;T.Hastieetal.,2001).Viewedfromthebias-variancetradeoff,thissuccessfullycalibratesthe

sweet spot between underfitting and overfitting. Viewed from statistical learning theory, the VC

27

dimensions can be reduced and thus the generalization performance increased. Applying amodel

withoutsuchcomplexityrestrictionstohigh-dimensionalbraindata,generalizationbecomesdifficult

toimpossiblebecausealldirectionsinthedataaretreatedequallyinwithisotropicstructure.Atthe

root of the problem, all data samples look virtually identical in high-dimensional data scenarios

("curseofdimensionality",Bellman,1961).Thelearningalgorithmwillnotbeabletoseethroughthe

noise and will thus overfit. In fact, these considerations explain why the multiple comparisons

problem is closely linked to encoding studies and overfitting is more closely related to decoding

studies(Fristonetal.,2008).Moreover,itoffersexplanationsastowhyanalyzingneuralactivityina

region of interest, rather than the whole brain, simultaneously alleviates both the multiple

comparisons problem (called "small volume correction" in CS studies) and the overfitting problem

(called"featureselection"inSLstudies).

Further,somecommoninvalidationsoftheCSandSLstatisticalframeworksareconceptuallyrelated

(cf.casestudytwo).Anoften-raisedconcerninneuroimagingstudiesperformingclassicalinference

isdoubledipping orcircular analysis (N. Kriegeskorteet al., 2009). This occurswhen, for instance,

firstcorrelatingabehavioralmeasurewithbrainactivityandthenusingtheidentifiedsubsetofbrain

voxels for a second correlation analysis with that same behavioral measurement (Lieberman,

Berkman,&Wager,2009;Vuletal.,2008). In this scenario,voxelsaresubmitted to twostatistical

testswiththesamegoalinanested,non-independentfashion8.Thiscorruptsthevalidityofthenull

hypothesis on which the reported test results conditionally depend. Importantly, this case of

repeatingasamestatisticalestimationwith iterativelypruneddataselections (onthetrainingdata

split) isavalidroutineintheSLframework,suchas inrecursivefeatureextraction(IsabelleGuyon,

Weston, Barnhill, & Vapnik, 2002; Hanson & Halchenko, 2008). However, there is an analog to

double-dipping or circular analysis in CSmethods applied to neuroimaging data:data-snooping or

peekinginSLanalysesbasedonout-of-samplegeneralization(Abu-Mostafaetal.,2012;Fithian,Sun,

&Taylor,2014;Pereiraetal.,2009).Thisoccurs,forinstance,whenperformingsimple(e.g.,mean-

8"Ifyoutorturethedataenough,naturewillalwaysconfess."(RonaldCoase)

28

centering) or more involved (e.g., k-means clustering) target-variable-dependent or -independent

preprocessingontheentiredataset,where itshouldbeappliedseparatelytothetrainingsetsand

test sets. Data-snooping can lead to optimistic cross-validation estimates and a trained learning

algorithm that fails on fresh data drawn from the samedistribution. Rather than a corrupted null

hypothesis,itistheerrorboundsoftheVCdimensionsthatareloosenedand,ultimately,invalidated

because information from the concealed test set influences model selection on the training set.

Conceptually,thesetwodistinctclassesofstatisticalerrorsarisingwithintheCSandSLframeworks

haveseveralcommonpoints:i)Bothinvolveaformofinformationcompressionbythenecessityto

drawconclusionsonsmalldatapartsdrawnfromoriginallyhigh-dimensionalneuroimagingdata. ii)

Unauthorizedpriorknowledgehappens tobe introduced into thestatisticalestimationprocess. iii)

Bothdouble dipping (CS) anddata snooping (SL) involve a form ofbias - biasing the estimates to

systematicallydeviatefromthepopulationparameteror increasingthebiaserrorterminthebias-

variance decomposition (Tal Yarkoni & Westfall, 2016). iv) Both types of faux pas in statistical

conductcanoccur inverysubtleandunexpectedways,butcanbeavoidedwithoftensmalleffort

(see case study two). v) Invalidation of the null hypothesis or the VC bounds can yield overly

optimistic results that may encourage unjustified confidence in neuroscientific findings and

conclusions.

Moreover, it is probably optimal to perform cross-validation with k = 5 or 10 data splits in

neuroimaging, despite diverging choices in previous studies. This is because empirical simulation

studies (Leo Breiman & Spector, 1992; Kohavi, 1995) have indicated that more data splits k (i.e.,

bigger training set, smaller test set) can increase the estimate's variance. Fewer data splits k (i.e.,

smaller training set, bigger test set) can increase the estimate's bias. Concretely, for a particular

estimate issued by leave-one-out cross-validation (k data splits = n samples) the neuroimaging

practionermightunluckilyobtaina"fluctuation"farfromtheground-truthvalueandthereisoften

noseconddatasettodouble-check.Asadrasticsidenote,neuroscientistsdonotevenknowwhether

a target function exists in naturewhen approximating functionswhen, for instance, automatically

29

classifyingmental operations or healthy from diseased individual based on brain scans (Nesterov,

2004;Wolpert,1996).Additionally,ananalyticalproof(Shao,1993)showedthatleave-one-outcross-

validation does not provide a theoretical guarantee for consistentmodel estimation. That is, this

cross-validationschemedoesnotalwaysselectthebestmodelassumingthatitisknown.However,

theneuroimagingpractitioneristypicallysatisfiedwithmodelweightsthatarereasonablyneartoan

optimallocalpointanddoesnotfocusonreachingthesingleglobaloptimalpointastypicalinconvex

optimization (Boyd & Vandenberghe, 2004). As an exception, leave-one out cross-validation may

thereforebemore justified inneuroimagingdata scenarioswithparticularly fewsamples,wherea

decentmodelfit istheprimaryconcern.Onamorepracticalnote,thecomputional loadfor leave-

one-outcross-validationwithk=nisoftenmuchhigherthanfork-foldcross-validationwithk=5or10.

30

5.Differentcurrencies

5.1Divergingperformancemetricscanquantifythebehaviorofstatisticalmodels

Theneuroscientific investigatorwhoadoptsaCSculture, typically somebodywithabackground in

psychology,biology,ormedicine,isinthehabitofdiagnosingstatisticalinvestigationsbymeansofp

values,effectsizes,confidence intervals,andstatisticalpower.Thepvaluedenotestheconditional

probabilityofobtaininganequalormoreextremeteststatisticprovidedthatthenullhypothesisH0

is true at the prespecified significance threshold alpha (Anderson, Burnham, & Thompson, 2000).

Under the condition of sufficiently high power (cf. below), it quantifies the strength of evidence

againstthenullhypothesisasacontinuousfunction(Rosnow&Rosenthal,1989).Counterintuitively,

it is not an immediate judgment on the alternative hypothesis H1 preferred by the investigator

(Anderson et al., 2000; J. Cohen, 1994). Threemain interpretations for the significance level exist

(GerdGigerenzer, 1993): i) the conventional level of significance specifiedbefore the investigation

yielding yes-no information (early Fisher), ii) the exact level of significance obtained after the

investigationyieldingcontinuousinformation(lateFisher),andiii)thealphalevelindicatingaTypeI

errorfrequencyintestsafterrepeatedsampling(Neyman/Pearson).Theformerviewstheobtainedp

valueasapropertyof thedata,whereas the latterviewsalphaasapropertyof thestatistical test

(GerdGigerenzer,1993).Pvaluesdonotqualifythepossibilityofreplication.Itisanotherimportant

caveatthatpvaluesbecomebetter(i.e.,lower)withincreasingsamplesizes(Berkson,1938).

Theessentiallybinarypvalueisthereforeoftencomplementedbythecontinuouseffectsize.Thep

valueisadeductiveinferentialmeasure,whereastheeffectsizeisadescriptivemeasurethatfollows

neither inductive nor deductive reasoning. The effect size can be viewed as the strength of a

statistical relationship, howmuch H0 deviates from H1, or the likely presence of an effect in the

generalpopulation(Chow,1998;Ferguson,2009;Kelley&Preacher,2012).Itisaunit-free,sample-

size independent, often standardized statistical measure for the importance of rejecting H0. It

31

equates to zero ifH0 could not be rejected. As a tendency, the lower thep value, the higher the

effectsize(Nickerson,2000).Importantly,theeffectsizeallowstheidentificationofmarginaleffects

thatpassthestatisticalsignificancethresholdbutarenotpracticallyrelevantintherealworld.Asa

propertyof the actual statistical test, the effect sizehasdifferent names and takes various forms,

suchasrhoinPearsoncorrelation,eta2inexplainedvariances,andCohen'sdindifferencesbetween

groupaverages.

Additionally,thecertaintyofapointestimate(i.e.,theoutcomeisavalue)canalwaysbeexpressed

byan intervalestimate(i.e.,theoutcomeisavaluerange)usingconfidenceintervals.Theyindicate

withachosenprobabilityhowoftenthe"true"populationeffectwouldbewithinthe investigator-

specifiedintervalaftermanyrepetitionsofthestudy(Cumming,2009;Estes,1997;Nickerson,2000).

Typically,a95%confidenceintervalisspannedaroundtherangeofvaluesofasamplemeanstatistic

that includes the population mean in 19 out of 20 cases across all samples. The tighter the

confidence interval, the smaller thevarianceof thepointestimateof thepopulationparameter in

each draw sample (keeping the sample size constant). The sometimes normalized confidence

intervals can be computed in a variety ofways. Their estimation is influenced by sample size and

population variability. They can be reported for different statistics, with different percentage

borders,andmaybeasymmetrical.Notethatconfidenceintervalscanbeusedasaviablesurrogate

forformaltestsofstatisticalsignificanceinmanyscenarios(Cumming,2009).

Confidenceintervalscanbecomputedinvariousdatascenariosandstatisticalregimes,whereasthe

power is onlymeaningfulwithin the culture of formal hypothesis falsification (Jacob Cohen, 1977,

1992; Oakes, 1986). The quality of interpretation for a statistically significant result is strongly

affectedbywhether theoriginalhypothesiswas tenable.Thepowermeasures theprobabilityofa

statisticaltesttofinda"true"effect,ofrejectingH0inthelongterm,orhowwella"true"alternative

hypothesis iscorrectlyacceptedassumingtheeffectexistsinthepopulation,thatis,P(H0detected

tobefalse|H1true).Ahighpowerthusensuresthatstatisticallysignificantandnon-significanttests

indeedreflectapropertyofthepopulation(Chow,1998). Intuitively,smallconfidenceintervalsare

32

an indicator of high statistical power. Type II errors (i.e., false negatives, beta error) become less

likelywithhigherpower(=1-betaerror).Concretely,anunderpoweredinvestigationdoesnotallow

choosing between H0 and H1 at the specified significance threshold alpha. Power calculations

dependonseveral factors, including significance thresholdalpha, theeffect size in thepopulation,

variationinthepopulation,samplesizen,andexperimentaldesign(JacobCohen,1992).Ratherthan

retrospective, the necessary sample size n for a desired power can be computed in a prospective

fashionafterspecifyinganalphaandahypothesizedeffectsize.

In contrast, diagnosis of the obtained research findings takes a different shape for the SL-

indoctrinatedneuroscientist9,typicallysomebodywithabackgroundincomputerscience,physics,or

engineering.Cross-validation is thede facto standard toobtainanunbiasedestimateof amodel's

capacity to generalize beyond the sample at hand (Bishop, 2006; T. Hastie et al., 2001).Model

assessmentisdonebytrainingonabiggersubsetoftheavailabledata(i.e.,trainingsetforin-sample

performance)andsubsequentapplicationofthetrainedmodeltothesmallerremainingpartofdata

(i.e.,testsetforout-of-sampleperformance),whichisassumedtosharethesamedistribution.Cross-

validation thuspermutesover the sample indata splitsuntil theclass label (i.e., categorical target

variable) of each data point has been predicted once. This set ofmodel-predicted labels and the

corresponding true data point labels can then be submitted to the quality measures accuracy,

precision,recall,andF1score (Powers,2011).Asthesimplestamongthem,accuracy isasummary

statistic that captures the fraction of correct prediction instances among all performed model

applications.Thisandthefollowingmeasuresareoftencomputedseparatelyonthetrainingsetand

thetestset.Additionally,themeasuresfromtrainingandtestingcanbeexpressedbytheir inverse

(e.g.,trainingerrorasin-sampleerrorandtesterrorasout-of-sampleerror)becausethepositiveand

negativecasesareinterchangeable.

9"Itisalsoimportanttobeparticularlycarefulinreportingerrorsandmeasuresofmodelfitinthehigh-dimensionalsetting.Wehaveseenthatwhenp>n,itiseasytoobtainauselessmodelthathaszeroresidu-als.Therefore,oneshouldneverusesumofsquarederrors,p-values,R2statistics,orothertraditionalmeasuresofmodelfitonthetrainingdataasevidenceofagoodmodelfitinthehigh-dimensionalsetting."(James,Witten,Hastie,&Tibshirani,2013,p.247,authors'emphasis)

33

The classification accuracy (= 1 - classification error) can be further decomposed into class-wise

metrics based on the so-called confusionmatrix, the juxtaposition of the true and predicted class

memberships.Theprecision(=truepositive/(truepositive+falsepositive))measureshowmanyof

the predicted labels are correct, that is, howmanymembers predicted to belong to a class really

belongtothatclass.For instance,amongtheparticipantspredictedtohavedepression,howmany

arereallyaffectedbythatdisease?Ontheotherhand, therecall (= truepositive/ (truepositive+

falsenegative))measureshowmanylabelsarecorrectlypredicted,thatis,howmanymembersofa

class were predicted to really belong to that class. Hence, among the participants affected by

depression,howmanywereactuallydetectedassuch?Putdifferently,precisioncanbeviewedasa

measure of "exactness" or"quality" and recall as a measure of "completeness" or"quantity"

(Powers,2011).Neitheraccuracy,precision,orrecallallow injectingsubjective importance intothe

evaluationprocessofthepredictionmodel.ThisdisadvantageisalleviatedbytheFbetascore,whichis

a weighted average of the precision and recall prediction scores. Concretely, the F1 score would

equallyweighprecisionand recallof classpredictions,while theF0.5 scoreputsmoreemphasison

precisionandtheF2scoremoreonrecall.Moreover,applicationsofrecall,precision,andFbetascores

havebeennoted to ignore the truenegativecasesaswellas tobehighlysusceptible toestimator

bias(Powers,2011).Needlesstosay,nosinglemeasurecanbeequallyoptimalinallcontexts.

Finally,learningcurves(Abu-Mostafaetal.,2012;Murphy,2012)areanimportantdiagnostictoolto

evaluate sample complexity, that is, the achieved model fit and prediction as a function of the

availablesamplesizen.Forincreasinglybiggersubsetsofthetrainingset,aclassificationalgorithmis

trainedonthatcurrentshareofthetrainingsetandthenevaluatedforaccuracyonthealways-same

testset.Acrosssubsetinstances,simplemodelsdisplayrelativelyhighin-sampleerrorbecausethey

cannot approximate the target function verywell (underfitting) but exhibit good generalization to

unseendatawith relatively lowout-of-sampleerror.Conversely, complexmodelsdisplay relatively

low in-sample error because they adapt too well to the data (overfitting) with difficulty to

34

extrapolate to newly sampled data with high out-of-sample error. In both scenarios, the model

effectivenessismore"challenged",thescarcerthedatapointsavailableformodeltraining.

5.2Outcomemetricsofstatisticalmodelsusedinneuroimaging

Reports of statistical outcomes in the neuroimaging literature have previously been recognized to

confusenotionsfromclassicalstatisticsandstatisticallearning(Friston,2012).Onageneralbasis,CS

and SL do not judge findings by the same aspects of evidence (Lo et al., 2015; Shmueli, 2010). In

neuroimaging papers based on classical hypothesis-driven inference, p values, and confidence

intervalsareubiquitouslyreported.Therehavehoweverbeenveryfewreportsofeffectsize inthe

neuroimaging literature (N. Kriegeskorte, Lindquist,Nichols, Poldrack,&Vul, 2010). Effect sizes, in

turn,arenecessarytocomputepowerestimates.Thisexplainstheevenrareroccurrenceofpower

calculations inneuroimaging research (but seeTal Yarkoni&Braver, 2010). Effect sizeshavebeen

arguedtoallowverification that inanoptimalexperimentaldesign inference is tunedtobigeffect

sizes (Friston,2012). To, for instance,estimate thep valueand theeffect size for local changes in

neural activity during a psychological task, one would actually need two independent samples of

theseexperimentaldata.Onesamplewouldbeusedtoperformstatistical inferenceontheneural

activitychangeandonesample toobtainunbiasedeffect sizes. Ithasbeenpreviouslyemphasized

(Friston,2012)thatpvaluesandeffectsizesreflectin-sampleestimatesinaretrospectiveinference

regime(CS).Thesemetricsfindananalogueinout-of-sampleestimatesissuedfromcross-validation

inaprospectivepredictionregime(SL).In-sampleeffectsizesaretypicallyanoptimisticestimateof

the"true"effectsize(inflatedbyhighsignificancethresholds),whereasout-of-sampleeffectsizesare

unbiased estimates of the "true" effect size. As an important consequence, neuroimaging

investigatorsshouldrefrainfromsimultaneouslycomputingandreportingbothtypesofestimateson

anidenticaldatasample.Thiscanleadtodoubledipping(cf.casestudytwo).

Inthehigh-dimensionalscenario,analyzing"wide"neuroimagingdatainourcase,judgingstatistical

significancebypvalues isgenerallyconsideredtobechallenging (Bühlmann&VanDeGeer,2011;

35

seecasestudiestwoandthree). Instead,classificationaccuracyonfreshdata isprobablythemost

often-reported performance metric in neuroimaging studies using learning algorithms. Basing

interpretationonaccuracyaloneishoweverinfluencedbythelocalcharacteristicsofhemodynamic

responses,efficiencyofexperimentaldesign,datafoldingintotrainandtestsets,anddifferencesin

the feature number p (J.-D. Haynes, 2015). A potentially under-exploited SL tool in this context is

bootstrapping.Itenablespopulation-levelinferenceofunknowndistributionsindependentofmodel

complexity by repeated random draws from the neuroimaging data sample at hand (Efron, 1979;

Efron&Tibshirani,1994).Thisopportunitytoequipvariouspointestimatesbyanintervalestimate

of certainty (e.g., the interval for the "true" accuracy of a classifier) is unfortunately seldom

embracedinthecontemporaryneuroimagingdomain(butseeBellec,Rosa-Neto,Lyttelton,Benali,&

Evans,2010;Vogelsteinetal.,2014).Besidesprovidingconfidenceintervals,bootstrappingcanalso

performnon-parametricnullhypothesistesting.Thismaybearareexampleofadirectconnection

between CS and SL methodology. Alternatively, binomial tests can be used to obtain a p-value

estimateof statistical significance fromaccuraciesandotherperformancescores (Brodersenetal.,

2013;Hanke,Halchenko,&Oosterhof,2015;Pereiraetal.,2009) inthebinaryclassificationsetting

(foranexampleseeBludauetal.,2015). Itcanrejectthenullhypothesisthattwocategoriesoccur

equally often. A last option that is also applicable to themulti-class setting is label permutation,

anothernon-parametricresamplingprocedure(Golland&Fischl,2003;T.E.Nichols&Holmes,2002).

Itcanservetorejectthenullhypothesisthattheneuroimagingdatadonotcontainanyinformation

abouttheclasslabels.

Extending from the setting of two hypotheses or yes-no classification to multiple classes injects

ambiguityintotheinterpretationofaccuracyscores.Ratherthanmerebetter-than-chancefindings,

it becomes more important to evalute the F1, precision and recall scores for each class to be

predictedinthebrainscans(e.g.,Brodersen,Schofield,etal.,2011;Schwartz,Thirion,&Varoquaux,

2013).Itisimportanttoappreciatethatthesensitivity/specificitymetrics,morefrequentlyreported

inCScommunities,andtheprecision/recallmetrics,morefrequentlyreportedinSLcommunities,tell

slighlydifferentstoriesaboutidenticalneuroscientificfindings.Infact,sensitivityequateswithrecall

36

(=truepositive/(truepositive+falsenegative)).Specificity(=truenegative/(truenegative+false

positive))doeshowevernotequatewithprecision (=truepositive/ (truepositive+ falsepositive)).

Further,aCSviewontheSLmetricswouldbethatmaximumprecisioncorrespondstoabsentTypeI

errors(i.e.,nofalsepositives),whereasmaximumrecallcorrespondstoabsentTypeIIerrors(i.e.,no

falsenegatives).Again,Type Iand IIerrorsarerelatedtotheentiretyofdatapoints inaCSregime

andpredictionisonlyevaluatedonatestdatasplitofthesampleinaSLregime.Inlearningcurves,a

biggapbetweenhigh in-sampleand lowout-of-sampleperformance is typicallyobserved forhigh-

variancemodels,suchasneuralnetworkalgorithmsorrandomforests.Theseperformancemetrics

from different data splits often converge for high-bias models, such as linear support vector

machines and logistic regression. Moreover, the medical domain and social sciences usually

aggregate results in ROC (receiver operating characteristic) curves plotting sensitivity against

specificityscores,whereasengineeringandcomputersciencedomainstendtoreportrecall-precision

curvesinstead(Davis&Goadrich,2006;Demšar,2006).

Finally, it is often possible to inspect the fit for purpose of trained black-boxmodels (Brodersen,

Haiss,etal.,2011;Kuhn&Johnson,2013).InSLthiscantaketheformofevaluatingsomenotionof

supportrecovery,thatis,thequestiontowhatextentthelearningalgorithmsputprobabilitymasson

thepartsof the feature space thatare "truely"underlying thegiven classes. Inneuroimaging, this

pertains to the difference betweenmodels that capture task-specific aspects of neural activity or

arbitrary discriminative aspects, such as structured noise in participant- or scanner-related

idiosyncracies (Brodersen, Haiss, et al., 2011; Varoquaux & Thirion, 2014). Such a face-validity

criterion of meaningful model fit is all the more important because of the general trade-off in

statistics between choosing models with best possible performance and those with model

parameters that are most interpretable (T. Hastie et al., 2001). Instead of maximizing prediction

scores,neuroimaginginvestigatorsmightwanttofocusonneurobiologicallyinformedfeaturespaces

and mechanistically interpretable model weights (Brodersen, Schofield, et al., 2011; Bzdok et al.,

2015;K.E.Stephan,2004).Infact,SLneuroimagingstudieswouldperhapsbenefitfromametricfor

neurobiologicalplausibiltyasanacidtestduringmodelselection.Reverse-engineeringfittedmodels

37

byreconstructionofstimulusortaskaspects(Miyawakietal.,2008;ThomasNaselaris,Prenger,Kay,

Oliver,&Gallant,2009;Thirionetal.,2006)couldbecomeanimportantevaluationmetricforwhya

learnedfeature-labelmappingexihibitsacertainmodelperformance(Gabrielietal.,2015).Thismay

further disambiguate explanations for (statistically significant) better-than-chance accuracies

because i) numerous, largely different members of a given function space may yield essentially

similarmodelperformanceand ii) each fittedmodelmayonly captureapartof the class-relevant

structure in the neuroimaging data (cf. Nikolaus Kriegeskorte, 2011). For instance, fit-for-purpose

metrics used in neuroimaging could uncover what neurobiological aspects allow classifying an

individual as schizophrenic versus normal andwhichneurobiological endophenotypesunderlie the

schizophrenia"spectrum"(Bzdok&Eickhoff,2015;Hyman,2007;KlaasE.Stephanetal.,2015;K.E.

Stephan,Friston,&Frith,2009).

38

6.Casestudyone:Generalizationandsubsequentclassicalinference

Vignette:We are interested in potential differences in brain structure that are associatedwith an

individual's age (continuous target variable). A Lasso (belongs to SL arsenal) is computed on the

voxel-basedmorphometrydata(Ashburner&Friston,2000)fromthebrain'sgreymatterofthe500-

subjectHCPrelease(HumanConnectomeProject;VanEssenetal.,2012).ThisL1-penalizedresidual-

sum-of-squares regression performs variable selection (i.e., effectively eliminates coefficients by

settingthemtozero)onallgrey-mattervoxels'volumeinformationinahigh-dimensional(notmass-

univariate) regime.Assessinggeneralizationperformanceofdifferent sparsemodelsusing five-fold

crossvalidationyieldsthenon-zerocoefficientsforfewbrainvoxelswhosevolumetricinformationis

mostpredictiveofanindividual'sage.

Question:Howcanweperformclassicalinferencetoknownwhichofthegrey-mattervoxelsselected

tobepredictiveforbiologicalagearestatisticallysignificant?

This is an important concern becausemost statisticalmethods currently applied to large datasets

performsomeexplicitorimplicitformofvariableselection(TrevorHastieetal.,2015;Jenattonetal.,

2011;Jordanetal.,2013).Thereareevenmanydifferentformsofpreliminaryselectionofvariables

beforeperformingsignificancetestsonthem.First,Lassoisawidelyusedestimatorinengineering,

compressive sensing, various "omics" branches, and other sciences, mostly without a significance

test.Beyondneuroscience,generalization-approvedstatisticallearningmodelsareroutinelysolvinga

diversesetofreal-worldchallenges.Thisincludesbutisnotlimitedtoalgorithmictradinginfinancial

markets,real-timespeechtranslation,SPAMfilteringfore-mails,facerecognitionindigitalcameras,

and piloting self-driving cars (Jordan &Mitchell, 2015; LeCun et al., 2015). In all these examples

statistical learning algorithms successfully generalize to unseen data and thus tackle the problem

heuristicallywithoutclassicalsignificancetestforvariablesormodelperformance.

Second, theLassohassolvedthecombinatorialproblemofwhatsubsetofgrey-mattervoxelsbest

predicts an individual's age byautomatic variable selection. Computing voxel-wise p valueswould

39

recast this high-dimensional pattern-learning setting into a mass-univariate hypothesis-testing

problem where relevance would be computed independently for each voxel and correction for

multiplecomparisonswouldbecomenecessary.Yet,recastingintothemass-univariatesettingwould

ignorethesophisticatedselectionprocessthatledtothepredictivemodelwithareducednumberof

variables (Wu et al., 2009). Put differently, the variable selection procedure is itself a stochastic

process that is however not accounted for by the theoretical guarantees of classical inference for

statisticalsignificance(Berk,Brown,Buja,Zhang,&Zhao,2013).Putinyetanotherway,data-driven

model selection is corrupting hypothesis-driven statistical inference because the sampling

distributionoftheparameterestimatesisaltered.Theimportantconsequenceisthatnaiveclassical

inference expects a non-adaptivemodel chosen before data acquisition and can therefore not be

usedalongLassoinparticularorarbitraryselectionproceduresingeneral10.

Third, this conflict between data-guidedmodel selection by cross-validation (SL) and confirmatory

classicalinference(CS)iscurrentlyatthefrontierofstatisticaldevelopment(Loftus,2015;J.Taylor&

Tibshirani, 2015).Newmethods for so-calledpost-selection inference (or selective inference) allow

computing p values for a set of features that have previously been chosen to be meaningful

predictors by some criterion. According to the theory of CS, the statisticalmodel is to be chosen

beforevisitingthedata.Classicalstatisticaltestsandconfidence intervalsthereforebecomeinvalid

and the p values become downward-biased (Berk et al., 2013). Consequently, the association

betweenapredictorand the targetvariablemustbeevenstronger tocertifyon thesame levelof

significance.Selectiveinferenceformodernadaptiveregressionthusreplacesloosenaivepvaluesby

morerigorousselection-adjustedpvalues.Asanordinarynullhypothesiscanhardlybeadopted in

thisadaptive testingsetting,conceptualextension isalsopromptedonthe levelofCS theory itself

(Trevor Hastie et al., 2015). Closed-form solutions to adjusted inference after variable selection

alreadyexistforprincipalcomponentanalysis(Choi,Taylor,&Tibshirani,2014)andforwardstepwise

regression (Jonathan Taylor, Lockhart, Tibshirani, & Tibshirani, 2014). Last but not least, a simple10"Onceappliedonlytotheselectedfew,theinterpretationoftheusualmeasuresofuncertaintydonotremainintactdirectly,unlessproperlyadjusted."(YoavBenjamini)

40

alternativetoformallyaccountforpreceedingmodelselectionisdatasplittingorsamplesplitting(D.

R.Cox,1975;Fithianetal.,2014;Wasserman&Roeder,2009),whichisfrequentpracticeingenetics

(e.g.,Sladeketal.,2007). Inthisprocedure, theselectionprocedure iscomputedononedatasplit

andpvaluesarecomputedontheremainingseconddatasplit.However,datasplittingisnotalways

possibleandwillincurpowerlossesandinterpretationproblems.

7.Casestudytwo:Classicalinferenceandsubsequentgeneralization

Vignette: We are interested in potential brain structure differences that are associated with an

individual'sgender(categoricaltargetvariable)inthevoxel-basedmorphometrydata(Ashburner&

Friston,2000)ofthe500-subjectHCPrelease(HumanConnectomeProject;VanEssenetal.,2012).

Initially, the >100,000 voxels per brain scan are reduced to themost important 10,000 voxels to

lower the computational cost and facilitate predictive model estimation. To this end, ANOVA

(univariatetestforstatisticalsignificancebelongingtoCS)isfirstusedtoobtainarankingofthemost

relevant 10,000 features from the greymatter of each subject. This selects the 10,000 out of the

original>100,000voxelvariableswithhighestvarianceexplainingvolumedifferencesbetweenmales

andfemales(i.e.,themale-femaleclasslabelsareusedintheunivariatetest).Second,supportvector

machine classification ("multivariate" pattern-learning algorithm belonging to SL) is performed by

trainingand testingona feature spacewith the10,000preselectedgrey-mattermeasurements to

predictthegenderfromeachsubject'sbrainscan.

Question:Isananalysispipelinewithunivariateclassicalinferenceandsubsequenthigh-dimensional

predictionvalidifbothstepsrelyonthesametargetvariables?

Theimplicationsoffeatureengineeringproceduresappliedbeforetrainingalearningalgorithmisa

frequentconcernandcanhaveverysubtleanswers(I.Guyon&Elisseeff,2003;Hankeetal.,2015;N.

Kriegeskorte et al., 2009; Lemm, Blankertz, Dickhaus, & Muller, 2011). In most applications of

predictivemodels the largemajorityofbrainvoxelswillbeuninformative (Brodersen,Haiss,etal.,

41

2011).Thedescribedscenarioofdimensionalityreductionbyfeatureselectiontofocuspredictionis

clearly allowed under the condition that the ANOVA is not computed on the entire data sample.

Rather, the voxels explaining most variance between the male and female individuals should be

computedonlyonthetrainingsetineachcross-validationfold.Inthetrainingsetandtestsetofeach

foldthesameidentifiedcandidatevoxelsarethenregroupedintoafeaturespacethatisfedintothe

support vectormachine algorithm. This ensures an identical feature space formodel training and

model testing but its construction only depends on structural brain scans from the training set.

Generally, any voxel preprocessing prior to model training is authorized if the feature space

construction isnot influencedbypropertiesof theconcealed test set. In thepresent scenario, the

Vapnik-Chervonenkis bounds of the cross-validation estimator are therefore not loosened or

invalidated if class labels have been exploited for feature selection or depending onwhether the

feature selection procedure is univariate or multivariate. Put differently, the cross-validation

proceduresimplyevaluatestheentirepredictionprocess includingtheautomatizedandpotentially

nested dimensionality reduction procedure. In sum, in a SL regime, using class information during

featurepreprocessingforacross-validatedsupervisedestimatorisnotaninstanceofdata-snooping

(orpeeking)ifdoneexclusivelyonthetrainingset(Abu-Mostafaetal.,2012).

Thisisanadvantageofcross-validationyieldingout-of-sampleestimates.Instarkcontrast,remember

thatnull-hypothesistestingyieldsin-sampleestimates.Usingtheclasslabelsforavariableselection

step just before statistical hypothesis testing on a same data sample would invalidate the null

hypothesis (N. Kriegeskorte et al., 2010; N. Kriegeskorte et al., 2009) (cf. case study one).

Consequently, in a CS regime, using class information to select variables before null-hypothesis

testingwillincuraninstanceofdouble-dipping(orcircularanalysis).

Regarding interpretation of the results, the classifier will miss some brain voxels that only carry

relevant information when considered in voxel ensembles. This is because the ANOVA filter kept

voxels thatare independently relevant (Brodersen,Haiss,etal.,2011).Univariate featureselection

maysystematicallyencouragemodelselection(i.e.,eachweightcombinationequateswithamodel

42

hypothesisfromtheclassifier'sfunctionspace)thatarenotneurobiologicallymeaningful.Concretely,

inthediscussedscenariotheclassifier learnscomplexpatternsbetweenvoxelsthatwerechosento

be individually important. This may considerably weaken the interpretability and conclusions on

"whole-brainmultivariatepatterns".Rememberalsothatvariablesthathaveastatisticallysignificant

associationwithatargetvariabledonotnecessarilyhavegoodgeneralizationperformance,andvice

versa (Loetal.,2015;Shmueli,2010).On theupside, it iswidelybelieved that thecombinationof

whole-brain univariate feature selection and linear classification is frequently among the best

approaches if the primary goal is optimized prediction performance as opposed to optimized

interpretability.Finally,itisinterestingtoconsiderthatANOVA-mediatedfeatureselectionofp<500

voxel variables reduces the "wide" neuroimaging data ("n << p" setting) down to "long"

neuroimagingdatawithfewerfeaturesthanobservations("n>p"setting)giventhen=500subjects

(Wainwright,2014).Thisallows recasting theSL regime intoaCS regime inorder to fita standard

general linear model and perform classical null-hypothesis testing instead of training a predictive

classificationalgorithm(Brodersen,Haiss,etal.,2011).

8.Casestudythree:Structurediscoverybyclusteringalgorithms

Vignette:Eachfunctionallyspecializedregioninthehumanbrainprobablyhasauniquesetoflong-

range connections (Passingham, Stephan,& Kotter, 2002). This notion has prompted connectivity-

basedparcellationmethodsinneuroimagingthatsegregatearegionof interest (ROI,canbelocally

circumscribedorbrainglobal;S.B.Eickhoff,Thirion,Varoquaux,&Bzdok,2015)intodistinctcortical

modules(Behrensetal.,2003).Thewhole-brainconnectivityforeachROIvoxeliscomputedandthe

voxel-wise connectional fingerprints are submitted to a clustering algorithm (i.e., individual brain

voxelsintheROIaretheelementstogroup;theconnectivitystrengthvaluesarethefeaturesofeach

element for similarity assessment). In this way, connectivity-based parcellation issues cortical

modules in the specified ROI that exhibit similar connectivity patterns and are, thus potentially,

43

functionally distinct. That is, voxels within the same cluster in the ROI will have more similar

connectivitypropertiesthanvoxelsfromdifferentROIclusters.

Question:Isitpossibletodecidewhethertheobtainedbrainclustersarestatisticallysignificant?

Essentially,theaimofconnectivity-guidedbrainparcellationistofinduseful,simplifiedstructureby

imposingdiscretecompartmentsonbraintopography(Frackowiak&Markram,2015;S.M.Smithet

al., 2013; Yeo et al., 2011). This is typically achieved by k-means, hierarchical, Ward, or spectral

clusteringalgorithms(Thirion,Varoquaux,Dohmatob,&Poline,2014).PuttingontheCShat,aROI

clustering result would be deemed statistically significant if it has a very low probability of being

"true" under the null hypothesis that the investigator seeks to reject (Everitt, 1979; Halkidi,

Batistakis,&Vazirgiannis,2001).Choosingateststatisticforclusteringsolutionstoobtainpvaluesis

difficult(Vogelsteinetal.,2014)becauseoftheneedforameaningfulnullhypothesistotestagainst

(Jain,Murty,&Flynn,1999).Putdifferently,forhypothesis-drivenstatisticalinferenceonemayneed

topickanarbitraryhypothesistofalsify.ItfollowsthattheCSnotionsofeffectsizeandpowerdonot

seemtoapply in thecaseofbrainparcellation. Insteadof classical inference to formally test fora

particular structure in the clustering results,we actually need to resort to exploratory approaches

thatdiscoverandassessstructureintheneuroimagingdata(Efron&Tibshirani,1991;T.Hastieetal.,

2001;Tukey,1962).AlthoughstatisticalmethodsspanacontinuumbetweenthetwopolesofCSand

SL,findingaclusteringmodelwiththehighestfitinthesenseofexplainingtheregionalconnectivity

differencesathandismorenaturallysituatedintheSLcommunity.

PuttingontheSLhat,werealizethattheproblemofbrainparcellationconstitutesanunsupervised

learningsettingwithoutany targetvariabley topredict (e.g., cognitive tasks, theageorgenderof

the participants). The learning task is therefore not to estimate a supervisedpredictivemodel y =

f(X), but to estimate an unsupervised descriptive model for the connectivity data X themselves.

Solvingsuchunsupervisedestimationproblemsisgenerallyrecognizedtobeveryhard(Bishop,2006;

ZoubinGhahramani, 2004; T. Hastie et al., 2001). In clustering problems, there aremany possible

44

transformations, projections, and compressions of X but there is no criterion of optimality that

clearlysuggestsitself.Ontheonehand,the"true"shapeofclustersisunknownformostreal-world

clustering problems, including brain parcellation studies. On the other hand, finding an "optimal"

number of clusters represents anunresolved issue (cluster validity problem) in statistics in general

and inbrainneuroimaging inparticular (Handl,Knowles,&Kell,2005; Jainetal.,1999).Evaluating

themodelfitofclusteringresultsisconventionallyaddressedbyheuristicclustervaliditycriteria(S.

B.Eickhoffetal.,2015;Thirionetal.,2014).Thesearenecessarybecauseclusteringalgorithmswill

alwaysfindsubregionsintheinvestigator'sROI,thatis,relevantstructureaccordingtotheclustering

algorithm'soptimizationobjective,whether these trulyexist innatureornot. There is a varietyof

such criteria based on information theory, topology, and consistency. They commonly encourage

cluster solutions with low within-cluster and high between-cluster differences, regardless of the

appliedclusteringalgorithm.

Evidently, the discovered connectivity clusters are mere hints to candidate brain modules. Their

"existence"inneurobiologyrequiresfurtherscrutiny(S.B.Eickhoffetal.,2015;Thirionetal.,2014).

Nevertheless, such clustering solutions are an importantmeans to narrowdownhigh-dimensional

neuroimagingdata.Preliminaryclusteringresultsbroadenthespaceofresearchhypothesesthatthe

investigatorcanarticulate.Forinstance,unexpecteddiscoveryofacandidatebrainregion(cf.Mars

et al., 2012; zu Eulenburg, Caspers, Roski, & Eickhoff, 2012) can provide an argument for future

experimental investigations. Brain parcellation can thus be viewed as an exploratory unsupervised

methodoutliningrelevantstructureinneuroimagingdatathatcansubsequentlybeformallytested

as a concrete hypothesis in neuroimaging studies whose interpretations are based on classical

inference.

45

9.Conclusion

Anovelscientificfactaboutthebrainisonlyvalidinthecontextofthecomplexityrestrictionsthat

havebeenimposedonthestudiedphenomenonduringtheinvestigation(Box,1976).Thestatistical

arsenal of the imaging neuroscientist can be divided into classical inference by hypothesis

falsificationandincreasinglyusedgeneralizationinferencebyextrapolatingcomplexpatterns.While

null-hypothesis testing has been dominating the academic milieu for several decades, statistical

learningmethodsareprevalent inmanydata-intensebranchesof industry(Vanderplas,2013).This

sociologicalsegregationmaypartlyexplainexistingconfusionabouttheirmutualrelationship.Based

on diverging historical trajectories and theoretical foundations, both statistical cultures aim at

extractingnewknowledgefromdatausingmathematicalmodels(Fristonetal.,2008;Jordanetal.,

2013). However, an observed effect with a statistically significant p value does not necessarily

generalizetofuturedatasamples.Conversely,aneffectwithsuccessfulout-of-samplegeneralization

isnotnecessarilystatisticallysignificant.Thedistributionalpropertiesofaneffectimportantforhigh

statistical significance and for successful generalization are not identical (Lo et al., 2015).

Additionally, classical inference is a judgment about an entire data sample, whereas predictive

inference can be applied to single datapoints. The goal and permissible conclusions of a formal

inferencethereforeareconditionedbytheadoptedstatisticalframework(Feyerabend,1975).Thisis

routinely exploited in drug development cycles with predictive inference for the early discovery

phaseandclassicalinferencefortheclinicaltrialsphase.Asimilarbackandforthbetweenapplying

inductivelearningalgorithmsanddeductivehypothesistestingofthediscoveredcandidatestructure

couldandshouldalsobecomeroutineinimagingneuroscience.Awarenessofthediscussedcultural

gapisimportanttokeeppacewiththeincreasinginformationgranularityofacquiredneuroimaging

repositories.Ultimately,statisticalinferenceisaheterogenousconcept.

46

Acknowledgments

Thepresentpaperdidnotresultfromisolatedcontemplationsbyasingleperson.Rather,itemerged

fromseveralthoughtmilieuswithdifferentthoughtstylesandopinionsystems.Theauthorcordially

thanksthefollowingpeopleforvaluablediscussionandpreciouscontributionstothepresentpaper

(in alphabetical order): Olavo Amaral, Patrícia Bado, Jérémy Lefort-Besnard, Kay Brodersen, Elvis

Dohmatob, Guillaume Dumas, Michael Eickenberg, Simon Eickhoff, Denis Engemann, Can Ergen,

AlexandreGramfort,OlivierGrisel,CarlHacker,MichaelHanke,LukasHensel,ThiloKellermann,Jean-

Rémi King, Robert Langner, DanielMargulies, JorgeMoll, ZeinabMousavi, CarolinMößnang, Rico

Pohling,AndrewReid,JoãoSato,BertrandThirion,GaëlVaroquaux,MarcelvanGerven,Virginievan

Wassenhove,KlausWillmes,KarlZilles.

47

References

Abu-Mostafa,Y.S.,Magdon-Ismail,M.,&Lin,H.T.(2012).Learningfromdata.California:AMLBook.

Amunts, K., Lepage,C., Borgeat, L.,Mohlberg,H.,Dickscheid, T., Rousseau,M. E., . . . Evans,A. C.(2013).BigBrain:anultrahigh-resolution3Dhumanbrainmodel.Science,340(6139),1472-1475.doi:10.1126/science.1235381

Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null hypothesis testing: problems,prevalence,andanalternative.Thejournalofwildlifemanagement,912-923.

Ashburner,J.,&Friston,K.J.(2000).Voxel-basedmorphometry-themethods.Neuroimage,11,805-821.

Ashburner,J.,&Friston,K.J.(2005).Unifiedsegmentation.Neuroimage,26(3),839-851.doi:S1053-8119(05)00110-2[pii]

10.1016/j.neuroimage.2005.02.018

Ashburner, J., & Kloppel, S. (2011). Multivariate models of inter-subject anatomical variability.Neuroimage,56(2),422-439.doi:10.1016/j.neuroimage.2010.03.059

Averbeck, B. B., Latham, P. E., & Pouget, A. (2006). Neural correlations, population coding andcomputation.NatRevNeurosci,7(5),358-366.doi:10.1038/nrn1888

Bach, F. (2014). Breaking the curse of dimensionality with convex neural networks. arXiv preprintarXiv:1412.8690.

Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2012). Optimization with sparsity-inducingpenalties.FoundationsandTrends®inMachineLearning,4(1),1-106.

Bates,E.,Wilson,S.M.,Saygin,A.P.,Dick,F.,Sereno,M. I.,Knight,R.T.,&Dronkers,N.F. (2003).Voxel-basedlesion-symptommapping.NatNeurosci,6(5),448-450.doi:10.1038/nn1050

Behrens,T.E.,Johansen-Berg,H.,Woolrich,M.W.,Smith,S.M.,Wheeler-Kingshott,C.A.,Boulby,P.A.,...Matthews,P.M.(2003).Non-invasivemappingofconnectionsbetweenhumanthalamusandcortexusingdiffusionimaging.NatNeurosci,6(7),750-757.doi:10.1038/nn1075

nn1075[pii]

Bellec, P., Rosa-Neto, P., Lyttelton, O. C., Benali, H., & Evans, A. C. (2010). Multi-level bootstrapanalysis of stable clusters in resting-state fMRI. Neuroimage, 51(3), 1126-1139. doi:10.1016/j.neuroimage.2010.02.082

Bellman,R.E.(1961).Adaptivecontrolprocesses:aguidedtour(Vol.4):PrincetonUniversityPress.

Bengio, Y. (2014). Evolving culture versus localminimaGrowingAdaptiveMachines (pp. 109-138):Springer.

Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and newperspectives.PatternAnalysisandMachineIntelligence,IEEETransactionson,35(8),1798-1828.

Berk,R.,Brown,L.,Buja,A.,Zhang,K.,&Zhao,L.(2013).Validpost-selectioninference.TheAnnalsofStatistics,41(2),802-837.

48

Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-squaretest.JournaloftheAmericanStatisticalAssociation,33(203),526-536.

Bishop,C.M.(2006).PatternRecognitionandMachineLearning.Heidelberg:Springer.

Bludau,S.,Bzdok,D.,Gruber,O.,Kohn,N.,Riedl,V.,Sorg,C.,...Amunts,K.(2015).MedialPrefrontalAberrations in Major Depressive Disorder Revealed by Cytoarchitectonically Informed Voxel-BasedMorphometry.AmericanJournalofPsychiatry.

Bortz,J.(2006).Statistik:FürHuman-undSozialwissenschaftler:Springer-Verlag.

Box,G.E.P. (1976). Scienceandstatistics. Journalof theAmericanStatisticalAssociation,71(356),791-799.

Boyd,S.,&Vandenberghe,L.(2004).Convexoptimization:Cambridgeuniversitypress.

Breiman,L.(2001).StatisticalModeling:TheTwoCultures.StatisticalScience,16(3),199-231.

Breiman, L.,& Spector, P. (1992). Submodel selection and evaluation in regression. The X-randomcase.Internationalstatisticalreview/revueinternationaledeStatistique,291-319.

Broca, P. (1865). Sur la faculté du language articulaire. Bulletins et Memoires de la Societéd'AnthropologiedeParis,6,377-393.

Brodersen, K. H. (2009). Decoding mental activity from neuroimaging data— the science behindmind-reading.TheNewCollection,Oxford,4,50-61.

Brodersen,K.H.,Daunizeau,J.,Mathys,C.,Chumbley,J.R.,Buhmann,J.M.,&Stephan,K.E.(2013).VariationalBayesianmixed-effectsinferenceforclassificationstudies.Neuroimage,76,345-361.doi:10.1016/j.neuroimage.2013.03.008

Brodersen,K.H.,Haiss,F.,Ong,C.S., Jung,F.,Tittgemeyer,M.,Buhmann, J.M., . . .Stephan,K.E.(2011). Model-based feature construction for multivariate decoding.Neuroimage, 56(2), 601-615.doi:10.1016/j.neuroimage.2010.04.036

Brodersen,K.H.,Schofield,T.M.,Leff,A.P.,Ong,C.S.,Lomakina,E.I.,Buhmann,J.M.,&Stephan,K.E.(2011).Generativeembeddingformodel-basedclassificationoffMRIdata.PLoSComputBiol,7(6),e1002079.doi:10.1371/journal.pcbi.1002079

Brodmann,K.(1909).VergleichendeLokalisationslehrederGroßhirnrinde.Leipzig:Barth.

Bühlmann,P.,&VanDeGeer, S. (2011).Statistics forhigh-dimensionaldata:methods, theoryandapplications:SpringerScience&BusinessMedia.

Burnham, K. P., & Anderson, D. R. (2014). P values are only an index to evidence: 20th-vs. 21st-centurystatisticalscience.Ecology,95(3),627-630.

Bzdok,D.,Eickenberg,M.,Grisel,O.,Thirion,B.,&Varoquaux,G. (2015).Semi-SupervisedFactoredLogisticRegression forHigh-DimensionalNeuroimagingData.Paperpresentedat theAdvancesinNeuralInformationProcessingSystems.

Bzdok, D., & Eickhoff, S. B. (2015). Statistical Learning of theNeurobiology of Schizophrenia. In T.Abel&T.Nickl-Jockschat(Eds.),TheNeurobiologyofSchizophrenia:Springer.

49

Chamberlin,T.C.(1890).TheMethodofMultipleWorkingHypotheses.Science,15(366),92-96.doi:10.1126/science.148.3671.754

Chambers, J. M. (1993). Greater or lesser statistics: a choice for future research. Statistics andComputing,3(4),182-184.

Choi,Y.,Taylor,J.,&Tibshirani,R.(2014).Selectingthenumberofprincipalcomponents:estimationofthetruerankofanoisymatrix.arXivpreprintarXiv:1410.8260.

Chow, S. L. (1998). Precis of statistical significance: rationale, validity, and utility.Behav Brain Sci,21(2),169-194;discussion194-239.

Chumbley,J.R.,&Friston,K.J.(2009).Falsediscoveryraterevisited:FDRandtopological inferenceusingGaussianrandomfields.Neuroimage,44,62-70.

Clark,W.G.,DelGiudice,J.,&Aghajanian,G.K.(1970).Principlesofpsychopharmacology:atextbookforphysicians,medicalstudents,andbehavioralscientists:AcademicPressInc.

Cleveland,W.S.(2001).Datascience:anactionplanforexpandingthetechnicalareasofthefieldofstatistics.Internationalstatisticalreview,69(1),21-26.

Cohen, J. (1977). Statistical power analysis for the behavioral sciences (rev: Lawrence ErlbaumAssociates,Inc.

Cohen,J.(1990).ThingsIhavelearned(sofar).Americanpsychologist,45(12),1304.

Cohen,J.(1992).Apowerprimer.Psychologicalbulletin,112(1),155.

Cohen,J.(1994).TheEarthIsRound(p<.05).AmericanPsychologist,49(12),997-1003.

Cortes,C.,&Vapnik,V.(1995).Support-vectornetworks.Machinelearning,20(3),273-297.

Cowles,M.,&Davis,C. (1982).On theOriginsof the .05 Levelof Statistical Significance.AmericanPsychologist,37(5),553-558.

Cox,D.D.,&Dean,T.(2014).Neuralnetworksandneuroscience-inspiredcomputervision.CurrBiol,24(18),R921-929.doi:10.1016/j.cub.2014.08.026

Cox,D.R.(1975).Anoteondata-splittingfortheevaluationofsignificancelevels.Biometrika,62(2),441-444.

Cumming,G.(2009).Inferencebyeye:readingtheoverlapofindependentconfidenceintervals.StatMed,28(2),205-220.doi:10.1002/sim.3471

Davis, J., & Goadrich,M. (2006). The relationship between Precision-Recall and ROC curves. PaperpresentedattheProceedingsofthe23rdinternationalconferenceonMachinelearning.

deBrebisson,A.,&Montana,G.(2015).DeepNeuralNetworksforAnatomicalBrainSegmentation.arXivpreprintarXiv:1502.02445.

de-Wit,L.,Alexander,D.,Ekroll,V.,&Wagemans,J.(2016). Isneuroimagingmeasuringinformationinthebrain?Psychonomicbulletin&review,1-14.

Deisseroth, K. (2015). Optogenetics: 10 years of microbial opsins in neuroscience. Nat Neurosci,18(9),1213-1225.doi:10.1038/nn.4091

50

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal ofMachineLearningResearch,7,1-30.

Devor,A.,Bandettini,P.A.,Boas,D.A.,Bower,J.M.,Buxton,R.B.,Cohen,L.B.,...Franceschini,M.A.(2013).ThechallengeofconnectingthedotsintheBRAIN.Neuron,80(2),270-274.

Devroye, L.,Györfi, L.,&Lugosi,G. (1996).Aprobabilistic theoryofpattern recognition.NewYork,NY:Springer.

Domingos,P.(2012).AFewUsefulThingstoKnowaboutMachineLearning.CommunicationsoftheACM,55(10),78-87.

Donoho,D.(2015).50yearsofDataScience.TukeyCentennialworkshop.

Efron,B.(1978).Controversiesinthefoundationsofstatistics.AmericanMathematicalMonthly,231-246.

Efron,B.(1979).Bootstrapmethods:anotherlookatthejackknife.TheannalsofStatistics,1-26.

Efron,B. (2005).Modernscienceand theBayesian-frequentist controversy:DivisionofBiostatistics,StanfordUniversity.

Efron,B.,&Hastie,T.(2016).Computer-AgeStatisticalInference.

Efron,B.,&Tibshirani,R.J.(1991).Statisticaldataanalysisinthecomputerage.Science,253(5018),390-395.doi:10.1126/science.253.5018.390

Efron,B.,&Tibshirani,R.J.(1994).Anintroductiontothebootstrap:CRCpress.

Eickhoff,S.,Turner,J.A.,Nichols,T.E.,&VanHorn,J.D.(2016).Sharingthewealth:Neuroimagingdatarepositories.NeuroImage,124(FZJ-2015-06893),1065–1068.

Eickhoff,S.B.,Bzdok,D.,Laird,A.R.,Roski,C.,Caspers,S.,Zilles,K.,&Fox,P.T.(2011).Co-activationpatterns distinguish cortical modules, their connectivity and functional differentiation.Neuroimage,57(3),938-949.doi:S1053-8119(11)00509-X[pii]

10.1016/j.neuroimage.2011.05.021

Eickhoff, S. B., Thirion, B., Varoquaux, G., & Bzdok, D. (2015). Connectivity-based parcellation:Critiqueandimplications.HumBrainMapp.doi:10.1002/hbm.22933

Engemann,D.A.,&Gramfort,A. (2015).Automatedmodel selection in covarianceestimationandspatialwhiteningofMEGandEEGsignals.NeuroImage,108,328-342.

Estes, W. K. (1997). On the communication of information by displays of standard errors andconfidenceintervals.PsychonomicBulletin&Review,4(3),330-341.

Everitt, B. S. (1979). Unresolved Problems in Cluster Analysis. Biometrics, 35(1), 169-181. doi:10.2307/2529943

Ferguson, C. J. (2009). An effect size primer: A guide for clinicians and researchers. ProfessionalPsychology:ResearchandPractice,40(5),532.

Feyerabend, P. (1975). Against Method: Outline of an Anarchist Theory of Knowledge.London:NewLeftBooks.

51

Fisher,R.A.(1925).Statisticalmethodsofresearchworkers.London:OliverandBoyd.

Fisher,R.A.(1935).Thedesignofexperiments.1935.OliverandBoyd,Edinburgh.

Fisher, R. A., & Mackenzie, W. A. (1923). Studies in crop variation. II. The manurial response ofdifferentpotatovarieties.TheJournalofAgriculturalScience,13(03),311-320.

Fithian, W., Sun, D., & Taylor, J. (2014). Optimal inference after model selection. arXiv preprintarXiv:1410.2597.

Fleck, L., Schäfer, L., & Schnelle, T. (1935). Entstehung und Entwicklung einer wissenschaftlichenTatsache:SchwabeBasel.

Fox,P.T.,&Raichle,M.E.(1986).Focalphysiologicaluncouplingofcerebralbloodflowandoxidativemetabolismduringso-matosensorystimulationinhumansubjects.ProcNatlAcadSciUSA,83,1140-1144.

Frackowiak,R.,&Markram,H.(2015).Thefutureofhumancerebralcartography:anovelapproach.PhilosTransRSocLondBBiolSci,370(1668).doi:10.1098/rstb.2014.0171

Freedman,D.(1995).Someissuesinthefoundationofstatistics.FoundationsofScience,1(1),19-39.

Friedman, J.H. (1998).DataMining and Statistics:What's the connection?Computing ScienceandStatistics,29(1),3-9.

Friedman, J. H. (2001). The role of statistics in the data revolution? International StatisticalReview/RevueInternationaledeStatistique,5-10.

Friston, K. J. (2006). Statistical parametric mapping: The analysis of functional brain images.Amsterdam:AcademicPress.

Friston,K.J.(2009).Modalities,modes,andmodelsinfunctionalneuroimaging.Science,326(5951),399-403.doi:10.1126/science.1174521

Friston,K.J.(2012).Tenironicrulesfornon-statisticalreviewers.Neuroimage,61(4),1300-1310.doi:10.1016/j.neuroimage.2012.04.018

Friston,K. J., Chu,C.,Mourao-Miranda, J.,Hulme,O.,Rees,G., Penny,W.,&Ashburner, J. (2008).Bayesian decoding of brain images. Neuroimage, 39(1), 181-205. doi:10.1016/j.neuroimage.2007.08.013

Friston, K. J., Holmes, A. P., Worsley, K. J., Poline, J. P., Frith, C. D., & Frackowiak, R. S. (1994).Statisticalparametricmaps in functional imaging: a general linearapproach.HumBrainMapp,2(4),189-210.

Friston, K. J., Liddle, P. F., Frith, C. D., Hirsch, S. R., & Frackowiak, R. S. J. (1992). The leftmedialtemporalregionandschizophrenia.Brain,115,367-382.

Friston,K.J.,Price,C.J.,Fletcher,P.,Moore,C.,Frackowiak,R.S.J.,&Dolan,R.J.(1996).Thetroublewithcognitivesubtraction.Neuroimage,4(2),97-104.

Gabrieli, J. D., Ghosh, S. S., & Whitfield-Gabrieli, S. (2015). Prediction as a humanitarian andpragmatic contribution from human cognitive neuroscience. Neuron, 85(1), 11-26. doi:10.1016/j.neuron.2014.10.047

52

Geman, S.,Bienenstock, E.,&Doursat,R. (1992).Neuralnetworksand thebias/variancedilemma.NeuralComputation,4,1-58.

Genovese, C. R., Lazar, N. A., & Nichols, T. (2002). Thresholding of statistical maps in functionalneuroimagingusingthefalsediscoveryrate.Neuroimage,15(4),870-878.

Ghahramani,Z. (2004).Unsupervised learningAdvanced lecturesonmachine learning (pp.72-112):Springer.

Ghahramani, Z. (2015).Probabilisticmachine learningandartificial intelligence.Nature,521(7553),452-459.doi:10.1038/nature14541

Gigerenzer,G.(1993).Thesuperego,theego,andtheidinstatisticalreasoning.Ahandbookfordataanalysisinthebehavioralsciences:Methodologicalissues,311-339.

Gigerenzer,G.(2004).Mindlessstatistics.TheJournalofSocio-Economics,33(5),587-606.

Gigerenzer,G.,&Murray,D.J.(1987).Cognitionasintuitivestatistics.NJ:Erlbaum:Hillsdale.

Giraud,C.(2014).Introductiontohigh-dimensionalstatistics:CRCPress.

Golland,P.,&Fischl,B.(2003).Permutationtestsforclassification:towardsstatisticalsignificanceinimage-basedstudies.PaperpresentedattheInformationprocessinginmedicalimaging.

Goodman,S.N. (1999).Towardevidence-basedmedical statistics.1:ThePvalue fallacy.Annalsofinternalmedicine,130(12),995-1004.

Gorgolewski,K.J.,Varoquaux,G.,Rivera,G.,Schwarz,Y.,Ghosh,S.S.,Maumet,C.,...Margulies,D.S. (2014). NeuroVault.org: A web-based repository for collecting and sharing unthresholdedstatisticalmapsofthehumanbrain.inpress.

Grady,C.L.,Haxby,J.V.,Schapiro,M.B.,Gonzalez-Aviles,A.,Kumar,A.,Ball,M.J.,...Rapoport,S.I.(1990). Subgroups in dementia of the Alzheimer type identified using positron emissiontomography.JNeuropsychiatryClinNeurosci,2(4),373-384.doi:10.1176/jnp.2.4.373

Gramfort,A.,Strohmeier,D.,Haueisen, J.,Hamalainen,M.,&Kowalski,M.(2011).FunctionalbrainimagingwithM/EEGusingstructuredsparsityintime-frequencydictionaries.PaperpresentedattheInformationProcessinginMedicalImaging.

Gramfort,A.,Thirion,B.,&Varoquaux,G.(2013).IdentifyingpredictiveregionsfromfMRIwithTV-L1prior. Paper presented at the Pattern Recognition in Neuroimaging (PRNI), 2013 InternationalWorkshopon.

Greenwald, A. G. (2012). There is nothing so theoretical as a good method. Perspectives onPsychologicalScience,7(2),99-108.

Güçlü,U.,&vanGerven,M.A.J.(2015).Deepneuralnetworksrevealagradientinthecomplexityofneural representations across the ventral stream.The Journal of Neuroscience, 35(27), 10005-10014.

Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal ofMachineLearningResearch,3,1157-1182.

Guyon, I.,Weston,J.,Barnhill,S.,&Vapnik,V.(2002).Geneselectionforcancerclassificationusingsupportvectormachines.Machinelearning,46(1-3),389-422.

53

Halevy, A., Norvig, P., & Pereira, F. (2009). The Unreasonable Effectiveness of Data. IntelligentSystems,IEEE,24(2),8-12.

Halkidi,M.,Batistakis,Y.,&Vazirgiannis,M. (2001).Onclusteringvalidation techniques. JournalofIntelligentInformationSystems,17(2),107-145.

Hall,E.T.(1989).Beyondculture:Anchor.

Handl, J., Knowles, J., & Kell, D. B. (2005). Computational cluster validation in post-genomic dataanalysis.Bioinformatics,21(15),3201-3212.doi:bti517[pii]

10.1093/bioinformatics/bti517

Hanke,M.,Halchenko,Y.O.,&Oosterhof,N.N.(2015).PyMVPAManuel.

Hanson,S.J.,&Halchenko,Y.O.(2008).BrainReadingUsingFullBrainSupportVectorMachinesforObjectRecognition:ThereIsNo“Face”IdentificationArea.NeuralComput,20,486-503.

Harlow,J.M.(1848).Passageofanironrodthroughthehead.BostonMedicalandSurgicalJournal,39(20),389-393.

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning. Heidelberg,Germany:SpringerSeriesinStatistics.

Hastie,T.,Tibshirani,R.,&Wainwright,M. (2015).StatisticalLearningwithSparsity:TheLassoandGeneralizations:CRCPress.

Haxby,J.V.,Gobbini,M.I.,Furey,M.L.,Ishai,A.,Schouten,J.L.,&Pietrini,P.(2001).Distributedandoverlappingrepresentationsoffacesandobjectsinventraltemporalcortex.Science,293(5539),2425-2430.

Haynes, J.-D. (2015). A primer on pattern-based approaches to fMRI: Principles, pitfalls, andperspectives.Neuron,87(2),257-270.

Haynes,J.D.,&Rees,G.(2005).Predictingtheorientationofinvisiblestimulifromacitvityinhumanprimaryvisualcortex.NatNeurosci,8(5),686-691.

Haynes, J. D., & Rees, G. (2006). Decoding mental states from brain activity in humans.Nat RevNeurosci,7(7),523-534.doi:10.1038/nrn1931

Herndon,R.C.,Lancaster,J.L.,Toga,A.W.,&Fox,P.T.(1996).Quantificationofwhitematterandgraymattervolumes fromT1parametric imagesusing fuzzyclassifiers. JMagnReson Imaging,6(3),425-435.

Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The "Wake-Sleep" algorithm forunsupervisedneuralnetworks.Science,268,1158-1161.

Hinton,G.E.,Osindero,S.,&Teh,Y.-W.(2006).Afastlearningalgorithmfordeepbeliefnets.Neuralcomputation,18(7),1527-1554.

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neuralnetworks.Science,313(5786),504-507.

HouseofCommon,S.a.T.(2016).Thebigdatadilemma.UK:CommitteeonAppliedandTheoreticalStatistics.

54

Hyman,S.E. (2007).Canneurosciencebe integrated into theDSM-V?NatRevNeurosci,8(9),725-732.doi:10.1038/nrn2218

Jain, A. K.,Murty,M.N.,& Flynn, P. J. (1999). Data clustering: a review.ACNComputing Surveys,31(3),264-323.

James,G.,Witten,D.,Hastie,T.,&Tibshirani,R. (2013).An introductiontostatistical learning (Vol.112):Springer.

Jenatton, R., Audibert, J.-Y.,& Bach, F. (2011). Structured variable selectionwith sparsity-inducingnorms.TheJournalofMachineLearningResearch,12,2777-2824.

Jordan, M. I. (1995). Why the logistic function? A tutorial discussion on probabilities and neuralnetworks.TechnicalReport9503.

Jordan, M. I. (2010). Bayesian nonparametric learning: Expressive priors for intelligent systems.Heuristics,ProbabilityandCausality:ATributetoJudeaPearl,11,167-185.

Jordan,M. I., Committee on the Analysis ofMassiveData, Committee on Applied and TheoreticalStatistics,BoardonMathematical SciencesandTheirApplications,DivisiononEngineeringandPhysicalSciences,&Council,N.R.(2013).Frontiers inMassiveDataAnalysis.Washington,D.C.:TheNationalAcademiesPress.

Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects.Science,349(6245),255-260.

Kamitani,Y.,&Tong,F.(2005).Decodingthevisualandsubjectivecontentsofthehumanbrain.NatNeurosci,8(5),679-685.

Kandel, E.R.,Markram,H.,Matthews,P.M., Yuste,R.,&Koch,C. (2013).Neuroscience thinksbig(andcollaboratively).NatureReviewsNeuroscience,14(9),659-664.

Kelley,K.,&Preacher,K.J.(2012).Oneffectsize.Psychologicalmethods,17(2),137.

King, J. R., & Dehaene, S. (2014). Characterizing the dynamics of mental representations: thetemporalgeneralizationmethod.Trendsincognitivesciences,18(4),203-210.

Kingma,D.P.,&Welling,M.(2013).Auto-encodingvariationalbayes.arXivpreprintarXiv:1312.6114.

Kleiner,A.,Talwalkar,A.,Sarkar,P.,&Jordan,M.(2012).Thebigdatabootstrap.Proceedingsofthe29thInternationalConferenceonMachineLearning(ICML).

Knops, A., Thirion, B., Hubbard, E. M., Michel, V., & Dehaene, S. (2009). Recruitment of an areainvolved in eye movements during mental arithmetic. Science, 324(5934), 1583-1585. doi:10.1126/science.1171599

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and modelselection.PaperpresentedattheIjcai.

Kriegeskorte, N. (2011). Pattern-information analysis: from stimulus decoding to computational-modeltesting.Neuroimage,56(2),411-421.

Kriegeskorte, N., Goebel, R., & Bandettini, P. (2006). Information-based functional brainmapping.ProcNatlAcadSciUSA,103(10),3863-3868.doi:10.1073/pnas.0600244103

55

Kriegeskorte,N.,Lindquist,M.A.,Nichols,T.E.,Poldrack,R.A.,&Vul,E.(2010).Everythingyouneverwantedtoknowaboutcircularanalysis,butwereafraidtoask.JCerebBloodFlowMetab,30(9),1551-1557.doi:10.1038/jcbfm.2010.86

Kriegeskorte,N.,Simmons,W.K.,Bellgowan,P.S.,&Baker,C.I.(2009).Circularanalysisinsystemsneuroscience:thedangersofdoubledipping.NatNeurosci,12(5),535-540.doi:10.1038/nn.2303

Kuhn,M.,&Johnson,K.(2013).Appliedpredictivemodeling:Springer.

Kurzweil,R.(2005).Thesingularityisnear:Whenhumanstranscendbiology:Penguin.

Lake, B.M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning throughprobabilisticprograminduction.Science,350(6266),1332-1338.doi:10.1126/science.aab3050

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444. doi:10.1038/nature14539

Lemm, S., Blankertz, B., Dickhaus, T., &Muller, K. R. (2011). Introduction tomachine learning forbrainimaging.Neuroimage,56(2),387-399.doi:10.1016/j.neuroimage.2010.11.004

Lieberman,M.D.,Berkman,E.T.,&Wager,T.D.(2009).Correlations inSocialNeuroscienceAren'tVoodoo:CommentaryonVuletal.PerspectivesonPsychologicalScience,4(3).

Lo,A.,Chernoff,H.,Zheng,T.,&Lo,S.H.(2015).Whysignificantvariablesaren'tautomaticallygoodpredictors.ProcNatlAcadSciUSA,112(45),13892-13897.doi:10.1073/pnas.1518285112

Loftus,J.R.(2015).Selectiveinferenceaftercross-validation.arXivpreprintarXiv:1511.08866.

Manyika,J.,Chui,M.,Brown,B.,Bughin,J.,Dobbs,R.,Roxburgh,C.,&Byers,A.(2011).Bigdata:Thenext frontier for innovation, competition, and productivity. Technical report, McKinsey GlobalInstitute.

Markram,H.(2006).TheBlueBrainProject.NatNeurosci,7,153-160.

Markram,H.(2012).Thehumanbrainproject.ScientificAmerican,306(6),50-55.

Markus, K. A. (2001). The Converse Inequality Argument Against Tests of Statistical Significance.Psychol.Methods,6,147-160.

Marr,D.(1982).Vision.Acomputationalinvestigationintothehumanrepresentationandprocessingofvisualinformation.WHSanFrancisco:FreemanandCompany.

Mars, R. B., Sallet, J., Schuffelgen, U., Jbabdi, S., Toni, I., & Rushworth, M. F. (2012).Connectivity-Based Subdivisions of the Human Right "Temporoparietal Junction Area":Evidence forDifferentAreas Participating inDifferent CorticalNetworks.CerebCortex,22(8),1894-1903.doi:bhr268[pii]

10.1093/cercor/bhr268

Meehl,P.E.(1967).Theory-testinginpsychologyandphysics:Amethodologicalparadox.Philosophyofscience,103-115.

Meehl,P.E.(1978).Theoreticalrisksandtabularasterisks:SirKarl,SirRonald,andtheslowprogressofsoftpsychology.JournalofConsultingandClinicalPsychologyandPsychotherapy,46,806-834.

56

Michel,V.,Gramfort,A.,Varoquaux,G.,Eger,E.,&Thirion,B. (2011).TotalvariationregularizationforfMRI-basedpredictionofbehavior.MedicalImaging,IEEETransactionson,30(7),1328-1340.

Misaki,M.,Kim,Y.,Bandettini,P.A.,&Kriegeskorte,N.(2010).Comparisonofmultivariateclassifiersand response normalizations for pattern-information fMRI. Neuroimage, 53(1), 103-118. doi:10.1016/j.neuroimage.2010.05.051

Miyawaki,Y.,Uchida,H.,Yamashita,O.,Sato,M.-a.,Morito,Y.,Tanabe,H.C.,...Kamitani,Y.(2008).Visual image reconstruction fromhumanbrain activity using a combination ofmultiscale localimagedecoders.Neuron,60(5),915-929.

Mnih,V.,Kavukcuoglu,K.,Silver,D.,Rusu,A.A.,Veness,J.,Bellemare,M.G.,...Hassabis,D.(2015).Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533. doi:10.1038/nature14236

http://www.nature.com/nature/journal/v518/n7540/abs/nature14236.html - supplementary-information

Moeller, J. R., Strother, S. C., Sidtis, J. J., & Rottenberg, D. A. (1987). Scaled subprofile model: astatisticalapproachtotheanalysisoffunctionalpatternsinpositronemissiontomographicdata.JCerebBloodFlowMetab,7(5),649-658.doi:10.1038/jcbfm.1987.118

Moore,D.S.,&McCabe,G.P.(1989). IntroductiontothePracticeofStatistics:WHFreeman/TimesBooks/HenryHolt&Co.

Mur, M., Bandettini, P. A., & Kriegeskorte, N. (2009). Revealing representational content withpattern-information fMRI--an introductory guide. Soc CognAffectNeurosci, 4(1), 101-109. doi:10.1093/scan/nsn044

Murphy,K.P.(2012).Machinelearning:aprobabilisticperspective:MITpress.

Naselaris, T., Kay, K. N., Nishimoto, S., & Gallant, J. L. (2011). Encoding and decoding in fMRI.Neuroimage,56(2),400-410.doi:10.1016/j.neuroimage.2010.07.073

Naselaris,T.,Prenger,R.J.,Kay,K.N.,Oliver,M.,&Gallant,J.L. (2009).Bayesianreconstructionofnaturalimagesfromhumanbrainactivity.Neuron,63(6),902-915.

Nelder, J., & Wedderburn, R. (1972). Generalized Linear Models. Journal of the Royal StatisticalSociety.SeriesA,135(3),370–384.

Nesterov, Y. (2004). Introductory lectureson convexoptimization, vol. 87ofAppliedOptimization:KluwerAcademicPublishers,Boston,MA.

Neyman, J., & Pearson, E. S. (1933). On the Problem of the most Efficient Tests for StatisticalHypotheses.Phil.Trans.R.Soc.A,231,289-337.

Ng,A.Y.,&Jordan,M.I.(2002).Ondiscriminativevs.generativeclassifiers:Acomparisonoflogisticregressionandnaivebayes.Advancesinneuralinformationprocessingsystems,14,841.

Nichols,T.E.(2012).Multipletestingcorrections,nonparametricmethods,andrandomfieldtheory.Neuroimage,62(2),811-815.

Nichols,T.E.,&Hayasaka,S.(2003).Controllingthefamilywiseerrorrateinfunctionalneuroimaging:acomparativereview.StatMethodsMedRes,12(5),419-446.

57

Nichols,T.E.,&Holmes,A.P.(2002).Nonparametricpermutationtestsforfunctionalneuroimaging:aprimerwithexamples.HumBrainMapp,15(1),1-25.

Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuingcontroversy.PsycholMethods,5(2),241-301.

Norvig,P.(2011).Onchomskyandthetwoculturesofstatisticallearning.AuthorHomepage.

Nuzzo, R. (2014). Scientific method: statistical errors. Nature, 506(7487), 150-152. doi:10.1038/506150a

Oakes,M. (1986). Statistical Inference: A commentary for the social and behavioral sciences. NewYork:Wiley.

Passingham,R.E.,Stephan,K.E.,&Kotter,R.(2002).Theanatomicalbasisoffunctionallocalizationinthecortex.NatRevNeurosci,3(8),606-616.doi:10.1038/nrn893

nrn893[pii]

Pearl,J.(2009).Causalinferenceinstatistics:Anoverview.StatisticsSurveys,3,96-146.

Pedregosa, F., Eickenberg, M., Ciuciu, P., Thirion, B., & Gramfort, A. (2015). Data-driven HRFestimationforencodinganddecodingmodels.NeuroImage,104,209-220.

Penfield,W.,&Perot,P.(1963).TheBrain?Srecordofauditoryandvisualexperience.Brain,86(4),595-696.

Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: a tutorialoverview.Neuroimage,45,199-209.doi:10.1016/j.neuroimage.2008.11.007

Platt, J.R. (1964). Strong Inference:Certain systematicmethodsof scientific thinkingmayproducemuch more rapid progress than others. Science, 146(3642), 347-353. doi:10.1126/science.146.3642.347

Plis,S.M.,Hjelm,D.R.,Salakhutdinov,R.,Allen,E.A.,Bockholt,H.J.,Long,J.D.,...Calhoun,V.D.(2014).Deeplearningforneuroimaging:avalidationstudy.Frontiersinneuroscience,8.

Poldrack,R.A.(2006).Cancognitiveprocessesbeinferredfromneuroimagingdata?TrendsCognSci,10(2),59-63.doi:S1364-6613(05)00336-0[pii]

10.1016/j.tics.2005.12.004

Poldrack,R.A.,&Gorgolewski,K.J.(2014).Makingbigdataopen:datasharinginneuroimaging.NatNeurosci,17(11),1510-1517.doi:10.1038/nn.3818

Pollard,P.,&Richardson,J.T.E.(1987).OntheprobabilityofmakingtypeIerrors.Psychol.Bull.,102,159-163.

Popper,K.(1935/2005).LogikderForschung(11thed.).Tübingen:MohrSiebeck.

Powers, D. M. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness,markednessandcorrelation.

Rosenblatt,F.(1958).Theperceptron:aprobabilisticmodelforinformationstorageandorganizationinthebrain.Psychologicalreview,65(6),386.

58

Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge inpsychologicalscience.AmericanPsychologist,44(10),1276.

Russell,S.J.,&Norvig,P.(2002).Artificialintelligence:amodernapproach(InternationalEdition).

Samuel,A.L. (1959).Somestudies inmachine learningusingthegameofcheckers. IBMJournalofresearchanddevelopment,3(3),210-229.

Sandberg,A.,&Bostrom,N.(2008).Wholebrainemulation.

Saygin,Z.M.,Osher,D.E.,Koldewyn,K.,Reynolds,G.,Gabrieli,J.D.,&Saxe,R.R.(2012).Anatomicalconnectivitypatternspredictfaceselectivityinthefusiformgyrus.NatNeurosci,15(2),321-327.doi:10.1038/nn.3001

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology:Implicationsfortrainingofresearchers.Psychologicalmethods,1(2),115.

Schwartz, Y., Thirion, B., & Varoquaux, G. (2013).Mapping paradigm ontologies to and from thebrain.PaperpresentedattheAdvancesinNeuralInformationProcessingSystems.

Shafer,G.(1992).Whatisprobability.PerspectivesinContemporaryStatistics,19-39.

Shao, J. (1993). Linear model selection by cross-validation. Journal of the American statisticalAssociation,88(422),486-494.

Shmueli,G.(2010).Toexplainortopredict?Statisticalscience,289-310.

Sladek,R.,Rocheleau,G.,Rung,J.,Dina,C.,Shen,L.,Serre,D.,...Hadjadj,S.(2007).Agenome-wideassociationstudyidentifiesnovelrisklocifortype2diabetes.Nature,445(7130),881-885.

Smith, S. M., Beckmann, C. F., Andersson, J., Auerbach, E. J., Bijsterbosch, J., Douaud, G., . . .Consortium, W. U.-M. H. (2013). Resting-state fMRI in the Human Connectome Project.Neuroimage,80,144-168.doi:10.1016/j.neuroimage.2013.05.039

Smith, S.M.,Matthews, P.M., & Jezzard, P. (2001). FunctionalMRI: an introduction tomethods:OxfordUniversityPress.

Smith, S.M.,&Nichols, T. E. (2009). Threshold-free cluster enhancement: addressing problemsofsmoothing, thresholddependenceand localisation in cluster inference.Neuroimage,44(1), 83-98.

Stark, C. E., & Squire, L. R. (2001). When zero is not zero: the problem of ambiguous baselineconditionsinfMRI.ProcNatlAcadSciUSA,98,12760-12766.

Stephan, K. E. (2004). On the role of general system theory for functional neuroimaging. J Anat,205(6),443-470.doi:10.1111/j.0021-8782.2004.00359.x

Stephan,K.E.,Binder,E.B.,Breakspear,M.,Dayan,P., Johnstone,E.C.,Meyer-Lindenberg,A., . . .Fletcher, P. C. (2015). Charting the landscape of priority problems in psychiatry, part 2:pathogenesisandaetiology.TheLancetPsychiatry.

Stephan, K. E., Friston, K. J.,& Frith, C. D. (2009). Dysconnection in schizophrenia: from abnormalsynaptic plasticity to failures of self-monitoring. Schizophr Bull, 35(3), 509-527. doi:10.1093/schbul/sbn176

59

Taylor, J., Lockhart, R., Tibshirani, R. J., & Tibshirani, R. (2014). Exact post-selection inference forforwardstepwiseandleastangleregression.arXivpreprintarXiv:1401.3889.

Taylor,J.,&Tibshirani,R.J.(2015).Statisticallearningandselectiveinference.ProcNatlAcadSciUSA,112(25),7629-7634.doi:10.1073/pnas.1507583112

Tenenbaum,J.B.,Kemp,C.,Griffiths,T.L.,&Goodman,N.D.(2011).Howtogrowamind:Statistics,structure,andabstraction.science,331(6022),1279-1285.

Thirion, B.,Duchesnay, E.,Hubbard, E.,Dubois, J., Poline, J. B., Lebihan,D.,&Dehaene, S. (2006).Inverse retinotopy: inferring the visual content of images from brain activation patterns.Neuroimage,33(4),1104-1116.doi:10.1016/j.neuroimage.2006.06.062

Thirion, B., Varoquaux, G., Dohmatob, E.,& Poline, J. B. (2014).Which fMRI clustering gives goodbrainparcellations?FrontNeurosci,8,167.doi:10.3389/fnins.2014.00167

Tibshirani,R.(1996).Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSociety.SeriesB(Methodological),267-288.

Tukey,J.W.(1962).Thefutureofdataanalysis.AnnalsofStatistics,33,1-67.

Tukey,J.W.(1965).Anapproachtothinkingaboutastatisticalcomputingsystem(unpublishednotescirculatedatBellLaboratories).

VanEssen,D.C.,Ugurbil,K.,Auerbach,E.,Barch,D.,Behrens,T.E.,Bucholz,R.,...Consortium,W.U.-M.H. (2012). TheHumanConnectomeProject: adata acquisitionperspective.Neuroimage,62(4),2222-2231.doi:10.1016/j.neuroimage.2012.02.018

VanHorn, J.D.,&Toga,A.W. (2014).Humanneuroimagingasa"BigData"science.Brain ImagingBehav,8(2),323-331.doi:10.1007/s11682-013-9255-y

Vanderplas, J. (2013). The Big Data Brain Drain: Why Science is in Trouble. Blog "PythonicPerambulations".

Vapnik,V.N.(1989).StatisticalLearningTheory.NewYork:Wiley-Interscience.

Vapnik,V.N.(1996).Thenatureofstatisticallearningtheory.NewYork:Springer.

Varoquaux, G., & Thirion, B. (2014). How machine learning is shaping cognitive neuroimaging.GigaScience,3(1),28.

Vogelstein, J. T., Park, Y.,Ohyama, T., Kerr, R.A., Truman, J.W., Priebe,C. E.,& Zlatic,M. (2014).Discovery of brainwide neural-behavioralmaps viamultiscale unsupervised structure learning.Science,344(6182),386-392.doi:10.1126/science.1250298

Vogt,C.,&Vogt,O.(1919).AllgemeinereErgebnisseunsererHirnforschung.JournalfürPsychologieundNeurologie,25,279-461.

Vul,E.,Harris,C.,Winkielman,P.,&Pashler,H.(2008).VoodooCorrelationsinSocialNeuroscience.PsychologicalScience.

Wagenmakers, E.-J., Lee, M., Lodewyckx, T., & Iverson, G. J. (2008). Bayesian versus frequentistinferenceBayesianevaluationofinformativehypotheses(pp.181-207):Springer.

60

Wainwright, M. J. (2014). Structured Regularizers for High-Dimensional Problems: Statistical andComputationalIssues.Annu.Rev.Stat.Appl,1,233-253.

Wasserman, L. (2013).All of statistics: a concise course in statistical inference: Springer Science&BusinessMedia.

Wasserman,L.,&Roeder,K.(2009).Highdimensionalvariableselection.Annalsofstatistics,37(5A),2178.

Wernicke, C. (1881). Die akute haemorrhagische polioencephalitis superior. Lehrbuch DerGehirnkrankheitenFürAerzteUndStudirende,BdII,2,229-242.

Wigner, E. P. (1960). The unreasonable effectiveness of mathematics in the natural sciences.CommunicationsonPureandAppliedMathematics,13,1-14.

Wolpert, D. (1996). The lack of a priori distinctions between learning algorithms. NeuralComputation,8,1341–1390.

Worsley,K.J.,Evans,A.C.,Marrett,S.,&Neelin,P.(1992).Athree-dimensionalstatisticalanalysisforCBFactivationstudiesinhumanbrain.JournalofCerebralBloodFlowandMetabolism,12,900-900.

Worsley,K.J.,Poline,J.-B.,Friston,K.J.,&Evans,A.C.(1997).CharacterizingtheresponseofPETandfMRIdatausingmultivariatelinearmodels.NeuroImage,6(4),305-319.

Wu,T.T.,Chen,Y.F.,Hastie,T.,Sobel,E.,&Lange,K.(2009).Genome-wideassociationanalysisbylassopenalizedlogisticregression.Bioinformatics,25(6),714-721.

Yamins,D.L.,&DiCarlo,J.J.(2016).Usinggoal-drivendeeplearningmodelstounderstandsensorycortex.NatNeurosci,19(3),356-365.doi:10.1038/nn.4244

Yarkoni, T., & Braver, T. S. (2010). Cognitive neuroscience approaches to individual differences inworking memory and executive control: conceptual and methodological issues Handbook ofindividualdifferencesincognition(pp.87-107):Springer.

Yarkoni, T., Poldrack, R. A., Nichols, T. E., Van Essen, D. C., & Wager, T. D. (2011). Large-scaleautomatedsynthesisofhumanfunctionalneuroimagingdata.NatMethods,8(8),665-670.doi:10.1038/nmeth.1635

Yarkoni,T.,&Westfall, J. (2016).Choosingpredictionoverexplanation inpsychology:Lessonsfrommachinelearning.

Yeo,B.T.,Krienen,F.M.,Sepulcre,J.,Sabuncu,M.R.,Lashkari,D.,Hollinshead,M.,...Buckner,R.L.(2011). The organization of the human cerebral cortex estimated by intrinsic functionalconnectivity.JNeurophysiol,106(3),1125-1165.doi:10.1152/jn.00338.2011

Yuste,R.(2015).Fromtheneurondoctrinetoneuralnetworks.NatureReviewsNeuroscience,16(8),487-497.

Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal ofcomputationalandgraphicalstatistics,15(2),265-286.

61

zu Eulenburg, P., Caspers, S., Roski, C., & Eickhoff, S. B. (2012). Meta-analytical definition andfunctional connectivity of the human vestibular cortex. Neuroimage, 60(1), 162-169. doi:10.1016/j.neuroimage.2011.12.032

Classical Statistics and Statistical Learning in Imaging Neuroscience · 2016-05-05 · between classical statistics and statistical learning with their relation to neuroimaging research.

Documents