Multivariate models for fMRI data Klaas Enno Stephan (with 90% of slides kindly contributed by Kay H. Brodersen) Translational Neuromodeling Unit (TNU) Institute for Biomedical Engineering University of Zurich & ETH Zurich
Feb 24, 2016
Multivariate models for fMRI data
Klaas Enno Stephan(with 90% of slides kindly contributed by Kay H. Brodersen)
Translational Neuromodeling Unit (TNU)Institute for Biomedical EngineeringUniversity of Zurich & ETH Zurich
2
Univariate approaches are excellent for localizing activations in individual voxels.
Why multivariate?
v1 v2 v1 v2
reward no reward
*
n.s.
3
Multivariate approaches can be used to examine responses that are jointly encoded in multiple voxels.
Why multivariate?
v1 v2 v1 v2
n.s.
orange juice apple juice v1
v2
n.s.
4
Multivariate approaches can utilize ‘hidden’ quantities such as coupling strengths.
Why multivariate?
activity
observed BOLD signal
hidden underlying neural activity and coupling strengths
t
driving input modulatory input
t
activity
activity
signal
signal
signal
Friston, Harrison & Penny (2003) NeuroImage; Stephan & Friston (2007) Handbook of Brain Connectivity; Stephan et al. (2008) NeuroImage
5
Overview
1 Modelling principles
2 Classification
3 Multivariate Bayes
4 Generative embedding
6
Overview
1 Modelling principles
2 Classification
3 Multivariate Bayes
4 Generative embedding
7
Encoding vs. decoding
context (cause or consequence) BOLD signal
condition stimulusresponse prediction error
encoding model
decoding model
𝑔 : 𝑋 𝑡→𝑌 𝑡
h :𝑌 𝑡→ 𝑋𝑡
8
Regression vs. classification
Regression model
independentvariables(regressors)
continuousdependent variable
Classification model
independentvariables(features)
categoricaldependent variable(label)
𝑓
𝑓 vs.
9
Univariate vs. multivariate models
BOLD signal
context
A univariate model considers a single voxel at a time.
A multivariate model considers many voxels at once.
Spatial dependencies between voxels are only introduced afterwards, through random field theory.
BOLD signalcontext
Multivariate models enable inferences on distributed responses without requiring focal activations.
10
Prediction vs. inference
The goal of prediction is to find a generalisable encoding or decoding function.
The goal of inference is to decide between competing hypotheses.
predicting a cognitive state using a
brain-machine interface
predicting asubject-specific
diagnostic status
predictive density marginal likelihood (model evidence)
comparing a model that links distributed neuronal
activity to a cognitive state with a model that
does not
weighing the evidence for
sparse vs. distributed coding
11
Goodness of fit vs. complexity
Goodness of fit is the degree to which a model explains observed data.
Complexity is the flexibility of a model (including, but not limited to, its number of parameters).
4 parameters 9 parameters
Bishop (2007) PRML
1 parameter
𝑋
𝑌
We wish to find the model that optimally trades off goodness of fit and complexity.
underfitting overfittingoptimal
truthdatamodel
13
Overview
1 Modelling principles
2 Classification
3 Multivariate Bayes
4 Generative embedding
14
A principled way of designing a classifier would be to adopt a probabilistic approach:
Constructing a classifier
Generative classifiers
use Bayes’ rule to estimate
• Gaussian naïve Bayes• Linear discriminant
analysis
Discriminative classifiers
estimate directlywithout Bayes’ theorem
• Logistic regression• Relevance vector machine• Gaussian process classifier
Discriminant classifiers
estimate directly
• Fisher’s linear discriminant
• Support vector machine
𝑓𝑌 𝑡 that which maximizes
In practice, classifiers differ in terms of how strictly they implement this principle.
15
Searchlight approach
A sphere is passed across the brain. At each location, the classifier is evaluated using only the voxels in the current sphere map of t-scores.
Whole-brain approach
A constrained classifier is trained on whole-brain data. Its voxel weights are related to their empirical null distributions using a permutation test map of t-scores.
Common types of fMRI classification studies
Nandy & Cordes (2003) MRMKriegeskorte et al. (2006) PNAS
Mourao-Miranda et al. (2005) NeuroImage
16
Support vector machine (SVM)
Vapnik (1999) Springer; Schölkopf et al. (2002) MIT Press
Nonlinear SVMLinear SVM
v1
v2
17
Stages in a classification analysis
Feature extraction
Classification using cross-validation
Performance evaluation
Bayesian mixed-effects inference
mixed effects
18
We can obtain trial-wise estimates of neural activity by filtering the data with a GLM.
Feature extraction for trial-by-trial classification
¿
data design matrix
𝛽1𝛽2⋮𝛽𝑝
× +𝑒
coefficients
Boxcar regressor for trial 2
Estimate of this coefficient reflects activity on trial 2
19
The generalization ability of a classifier can be estimated using a resampling procedure known as cross-validation. One example is 2-fold cross-validation:
Cross-validation
examples123
99100
?training exampletest examples
folds
??
1
...
???
2
...
performance evaluation
20
A more commonly used variant is leave-one-out cross-validation.
Cross-validation
examples123
99100
?training exampletest example
?...98
?
...99
?
...
100
...
folds
?
1
...
?
2
...
performance evaluation
21
Single-subject study with trialsThe most common approach is to assess how likely the obtained number of correctly classified trials could have occurred by chance.
Performance evaluation
number of correctly classified trialstotal number of trialschance level (typically 0.5)binomial cumulative density function
Binomial test
In MATLAB:p = 1 - binocdf(k,n,pi_0)
subject
-+--+
trial
trial 1… [0111
0]
22
Performance evaluation
subject 1
-+--+
trial
trial 1…
population
[01110]
subject 2
[11011]
subject 3
[01100]
subject 4
[11111]
subject
…
… [01110]
23
Group study with subjects, trials eachIn a group setting, we must account for both within-subjects (fixed-effects) and between-subjects (random-effects) variance components.
Performance evaluation
sample mean of sample accuracies chance level (typically 0.5)sample standard deviation cumulative Student’s -distribution
t-test onsummary statistics
Binomial test on concatenated data
Binomial test on averaged data
Bayesian mixed-effects inference
fixed effects random effectsfixed effects mixed effects
Brodersen, Mathys, Chumbley, Daunizeau, Ong, Buhmann, Stephan (2012) JMLRBrodersen, Daunizeau, Mathys, Chumbley, Buhmann, Stephan (2013) NeuroImage
available for MATLAB and R
24
Research questions for classification
Temporal evolution of discriminability Model-based classification
accuracy
50 %
100 %
within-trial time
Accuracy rises above chance
Participant indicates decision
Overall classification accuracy Spatial deployment of discriminative regions
80%
55%
accuracy
50 %
100 %
classification task
Truthor
lie?
Left or right button?
Healthy or ill?
Pereira et al. (2009) NeuroImage, Brodersen et al. (2009) The New Collection
{ group 1, group 2 }
25
Multivariate classification studies conduct group tests on single-subject summary statistics that discard the sign or
direction of underlying effects
do not necessarily take into account confounding effects (e.g. correlation of task conditions with difficulty etc.)
Therefore, in some analyses confounds rather than distributed representations may have produced positive results.
Potential problems
Todd et al. 2013, NeuroImage
26
Simulation: Experiment condition (rule A vs. rule B) does not affect voxel activity, but difficulty does. Moreover, experiment condition and difficulty are confounded at the individual-subject level in random directions across 500 subjects.
Potential problems
Todd et al. 2013, NeuroImage
27
Empirical example: searchlight analysis: fMRI study on rule representations (flexible stimulus–response mappings
Standard MVPA: rule representations in prefrontal regions.
GLM: no significant results. Controlling for a variable
that is confounded with rule at the individual-subject level but not the group level (reaction time differences across rules) eliminates the MVPA results.
Potential problems
Todd et al. 2013, NeuroImage
28
Overview
1 Modelling principles
2 Classification
3 Multivariate Bayes
4 Generative embedding
29
Multivariate Bayes
Mike West
SPM brings multivariate analyses into the conventional inference framework of hierarchical Bayesian models and their inversion.
30
Multivariate Bayes
some cause or consequence decoding model
Multivariate analyses in SPM rest on the central notion that inferences about how the brain represents things can be reduced to model comparison.
vs.
sparse coding in orbitofrontal cortex
distributed coding in prefrontal cortex
31
From encoding to decoding
Encoding model: GLM Decoding model: MVB
𝑌
¿ 𝑋 𝛽
¿𝑇𝐴+𝐺𝛾+𝜀𝛾𝜀
𝛽
𝛽𝑋
𝐴
𝑌
¿ 𝐴 𝛽
¿𝑇𝐴+𝐺𝛾+𝜀
𝑋
𝐴
𝑌=𝑇𝑋 𝛽+𝐺𝛾+𝜀 𝑇𝑋=𝑌 𝛽−𝐺𝛾𝛽−𝜀𝛽
𝛾𝜀
32
Multivariate Bayes
Friston et al. 2008 NeuroImage
33
Encoding model: neuronal activity = linear mixture of causes:
A=X𝛽 BOLD data matrix is a temporal convolution of underlying neuronal activity:
𝑌=𝑇A+𝐺𝛾+𝜀 Decoding model:
mental state is a linear mixture of voxel-wise activity: 𝑋=A𝛽 thus A=X𝛽-1
substitution into gives:𝑌 Y =𝑇X𝛽-1 +𝐺𝛾+𝜀 ⇒ Y𝛽 =𝑇X+𝐺𝛾𝛽+𝜀𝛽 ⇒ 𝑇X= Y𝛽-𝐺𝛾𝛽-𝜀𝛽
Importantly, we can remove the influence of confounds by premultiplying with the appropriate residual forming matrix: RTX= RY𝛽+𝜍
MVB: details
34
To make inversion tractable: imposing priors on 𝛽 (i.e., assumption about how mental states are
represented by voxel-wise activity)
Scientific inquiry: model selection: comparing different priors to identify which neuronal
representation of mental states is most plausible
MVB: details
35
Specifying the prior for MVB
To make the ill-posed regression problem tractable, MVB uses a prior on voxel weights. Different priors reflect different anatomical and/or coding hypotheses.
For example:
voxels
patterns
Voxel 3 is allowed to play a role, but only if its neighbours play similar roles.
Voxel 2 is allowed to play a role.
Friston et al. 2008 NeuroImage
36
MVB can be illustrated using SPM’s attention-to-motion example dataset.
This dataset is based on a simple block design. There are three experimental factors:
photic – display shows random dots
motion – dots are moving attention – subjects asked to pay
attention
Example: decoding motion from visual cortex
scan
s
photic motion attention const
Büchel & Friston 1999 Cerebral CortexFriston et al. 2008 NeuroImage
During these scans, for example, subjects were passively viewing moving dots.
37
Multivariate Bayes in SPM
Step 1After having specified and estimated a model, use the Results button.
Step 2Select the contrast to be decoded.
38
Multivariate Bayes in SPM
Step 3Pick a region of interest.
39
Multivariate Bayes in SPM
Step 4Multivariate Bayes can be invoked from within the Multivariate section.
Step 5Here, the region of interest is specified as a sphere around the cursor. The spatial prior implements a sparse coding hypothesis.
anatomical hypothesis
coding hypothesis
40
Multivariate Bayes in SPM
Step 6Results can be displayed using the BMS button.
41
Model evidence and voxel weights
log BF = 3
42
Summary: research questions for MVB
How does the brain represent things?Evaluating competing coding hypotheses
Where does the brain represent things?Evaluating competing anatomical hypotheses
43
Overview
1 Modelling principles
2 Classification
3 Multivariate Bayes
4 Generative embedding
44
Model-based classification
Brodersen, Haiss, Ong, Jung, Tittgemeyer, Buhmann, Weber, Stephan (2011) NeuroImageBrodersen, Schofield, Leff, Ong, Lomakina, Buhmann, Stephan (2011) PLoS Comput Biol
step 2 —embedding
step 1 —modelling
measurements from an individual subject
subject-specificgenerative model
subject representation in model-based feature space
A → BA → CB → BB → C
A
C B
step 3 —classification
A
C B
jointly discriminativeconnection strengths?
step 5 —interpretation
classification model
1
0discriminability of
groups?
accu
racy step 4 —
evaluation
45
Model-based classification: model specification
medialgeniculate
body
planumtemporale
Heschl’sgyrus(A1)
medialgeniculate
body
planumtemporale
Heschl’sgyrus(A1)
stimulus input
anatomicalregions of interest
y = –26 mm
L R
46
Model-based classification
patientscontrols 16218917734729133230781 243389360
50
60
70
80
90
100
bala
nced
acc
urac
yactivation-
basedcorrelation-
basedmodel-based
a c s p m e z o f l r
bala
nced
acc
urac
y
100%
50%
90%
80%
70%
60%
n.s. n.s.
*
47
Generative embedding
-10
0
10
-0.50
0.5
-0.1
0
0.1
0.2
0.3
0.4
-0.4-0.2
0 -0.5
0
0.5-0.4
-0.35
-0.3
-0.25
-0.2
-0.15
-10
0
10
-0.50
0.5
-0.1
0
0.1
0.2
0.3
0.4
-0.4-0.2
0 -0.5
0
0.5-0.4
-0.35
-0.3
-0.25
-0.2
-0.15
generative em
bedding
Para
met
er 3
Voxe
l 3
Parameter 1Voxel 1Voxel 2 Parameter 2
controlspatients
Voxel-based activity space Model-based parameter space
classification accuracy
98%classification accuracy
75%
48
Model-based classification: interpretation
MGB
PT
HG(A1)
MGB
PT
HG(A1)
stimulus input
L R
49
Model-based classification: interpretation
MGB
PT
HG(A1)
MGB
PT
HG(A1)
stimulus input
L R
highly discriminativesomewhat discriminativenot discriminative
50
Model-based clustering
Deserno, Sterzer, Wüstenberg, Heinz, & Schlagenhauf (2012) J Neurosci
PC dLPFC
VC
WM
stimulus
fMRI data acquired during working-memory task & modelled using a three-region DCM
42 patients diagnosedwith schizophrenia
41 healthy controls
51
Model-based clustering
model selection interpretation validation
Brodersen et al. 2014, NeuroImage: Clinical
52
Summary
Classification• to assess whether a cognitive state is linked
to patterns of activity• to visualize the spatial deployment of
discriminative activity
Multivariate Bayes• to evaluate competing anatomical
hypotheses• to evaluate competing coding hypotheses
Generative embedding• to assess whether groups differ in terms of
model parameter estimates (connectivity)• to generate mechanistic subgroup
hypotheses
53
Thank you
54
55
Classification Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: A tutorial
overview. NeuroImage, 45(1, Supplement 1), S199-S209. O'Toole, A. J., Jiang, F., Abdi, H., Penard, N., Dunlop, J. P., & Parent, M. A. (2007). Theoretical,
Statistical, and Practical Perspectives on Pattern-based Classification Approaches to the Analysis of Functional Neuroimaging Data. Journal of Cognitive Neuroscience, 19(11), 1735-1752.
Haynes, J., & Rees, G. (2006). Decoding mental states from brain activity in humans. Nature Reviews Neuroscience, 7(7), 523-534.
Norman, K. A., Polyn, S. M., Detre, G. J., & Haxby, J. V. (2006). Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences, 10(9), 424-30.
Brodersen, K. H., Haiss, F., Ong, C., Jung, F., Tittgemeyer, M., Buhmann, J., Weber, B., et al. (2010). Model-based feature construction for multivariate decoding. NeuroImage (2010).
Brodersen, K. H., Ong, C., Stephan, K. E., Buhmann, J. (2010). The balanced accuracy and its posterior distribution. ICPR, 3121-3124.
Multivariate Bayes Friston, K., Chu, C., Mourao-Miranda, J., Hulme, O., Rees, G., Penny, W., et al. (2008). Bayesian
decoding of brain images. NeuroImage, 39(1), 181-205.
Further reading
56
Multivariate approaches can exploit a sampling bias in voxelized imagesto reveal interesting activity on a subvoxel scale.
Why multivariate?
Boynton (2005) Nature Neuroscience
¿
57
The last 10 years have seen a notable increase in decoding analyses in neuroimaging.
Why multivariate?
Haxby et al. (2001) ScienceLautrup et al. (1994) Supercomputing in Brain Research
PET
prediction
58
Searchlight approach
A sphere is passed across the brain. At each location, the classifier is evaluated using only the voxels in the current sphere map of t-scores.
Whole-brain approach
A constrained classifier is trained on whole-brain data. Its voxel weights are related to their empirical null distributions using a permutation test map of t-scores.
Which brain regions are jointly informative of a cognitive state of interest?
Spatial deployment of informative regions
Nandy & Cordes (2003) MRMKriegeskorte et al. (2006) PNAS
Mourao-Miranda et al. (2005) NeuroImage
59
Classification induces constraints on the experimental design. When estimating trial-wise Beta values, we need longer ITIs (typically 8 – 15
s). At the same time, we need many trials (typically 100+). Classes should be balanced. If they are imbalanced, we can resample the
training set, constrain the classifier, or report the balanced accuracy. Construction of examples
Estimation of Beta images is the preferred approach. Covariates should be included in the trial-by-trial design matrix.
Temporal autocorrelation In trial-by-trial classification, exclude trials around the test trial from the
training set. Avoiding double-dipping
Any feature selection and tuning of classifier settings should be carried out on the training set only.
Performance evaluation Do random-effects or mixed-effects inference. Correct for multiple tests.
Issues to be aware of (as researcher or reviewer)
60
Evaluating the performance of a classification algorithm critically requires a measure of the degree to which unseen examples have been identified with their correct class labels.
The procedure of averaging across accuracies obtained on individual cross-validation folds is flawed in two ways. First, it does not allow for the derivation of a meaningful confidence interval. Second,it leads to an optimistic estimate when a biased classifier is tested on an imbalanced dataset.
Both problems can be overcome by replacing the conventional point estimate of accuracy by an estimate of the posterior distribution of the balanced accuracy.
Performance evaluation
Brodersen, Ong, Buhmann, Stephan (2010) ICPR
61
Pattern characterization
Example – decoding the identity of the person speaking to the subject in the scanner
Formisano et al. (2008) Science
voxe
l 1
voxel 2
...
fingerprint plot(one plot per class)
62
Is there a link between and ?To test for a statistical dependency between a contextual variable and the BOLD signal , we compare : there is no dependency : there is some dependency
Which statistical test?1. define a test size
(the probability of falsely rejecting , i.e., specificity),
2. choose the test with the highest power (the probability of correctly rejecting , i.e., sensitivity).
Lessons from the Neyman-Pearson lemma
The Neyman-Pearson lemma
The most powerful test of size is:to reject when the likelihood ratio exceeds a criticial value ,
with chosen such that
The null distribution of the likelihood ratio can be determined non-parametrically or under parametric assumptions.
This lemma underlies both classical statistics and Bayesian statistics (where is known as a Bayes factor).
Neyman & Person (1933) Phil Trans Roy Soc London
63
Lessons from the Neyman-Pearson lemma
In summary1. Inference about how the brain represents
things reduces to model comparison.
2. To establish that a link exists between some context and activity , the direction of the mapping is not important.
3. Testing the accuracy of a classifier is not based on and is therefore suboptimal.
Neyman & Person (1933) Phil Trans Roy Soc LondonKass & Raftery (1995) J Am Stat AssocFriston et al. (2009) NeuroImage
64
Temporal evolution of discriminability
Soon et al. (2008) Nature Neuroscience
Example – decoding which button the subject pressed
motor cortex
frontopolar cortex
classificationaccuracy
decision response
65
Approach1. estimation of an encoding model2. nearest-neighbour classification or voting
Identification / inferring a representational space
Mitchell et al. (2008) Science
66
Approach1. estimation of an encoding model2. model inversion
Reconstruction / optimal decoding
Paninski et al. (2007) Progr Brain Res Pillow et al. (2008) Nature Miyawaki et al. (2009) Neuron
67
Recent MVB studies
68
Specifying the prior for MVB
1st level – spatial coding hypothesis
2nd level – pattern covariance structure
𝜂×voxels
patterns
Thus: and
𝑈 𝑈 𝑈Voxel 3 is allowed to play a role, but only if its neighbours play weaker roles.
Voxel 2 is allowed to play a role.
69
Inverting the model
Partition #1
Partition #2
Partition #3(optimal)
Σ=𝜆1×
Σ=𝜆1×
Σ=𝜆1×
+𝜆2×
+𝜆2× +𝜆3×
Model inversion involves finding the posterior distribution over voxel weights .
In MVB, this includes a greedy search for the optimal covariance structure that governs the prior over .
subset subset
subset subset subset
subset
Pattern 8 is allowed to make some contribution, independently of the contributions of other patterns.
Pattern 3 makes an important positive contribution / is activated.
70
Observations vs. predictions
𝑹𝑿 𝑐
(moti
on)
71
MVB may outperform conventional point classifiers when using a more appropriate coding hypothesis.
Using MVB for point classification
Support Vector Machine
72
Model-based analyses by data representation
Model-basedanalysesHow do patterns ofhidden quantities (e.g., connectivity among brain regions) differ between groups?
Structure-basedanalysesWhich anatomical structures allow us to separate patients and healthy controls?
Activation-basedanalysesWhich functionaldifferences allow us toseparate groups?
73
0
0.2
0.4
0.6
0.8
1
bala
nced
acc
urac
y
1 2 3 4 5 6 7 80
20
40
60
80
100
120
log
mod
el e
vide
nce
1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8
1
bala
nced
pur
ity
Model-based clustering
0
0.2
0.4
0.6
0.8
1
bala
nced
acc
urac
y
1 2 3 4 5 6 7 80
20
40
60
80
100
120
log
mod
el e
vide
nce
1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8
1
bala
nced
pur
ity
0
0.2
0.4
0.6
0.8
1
bala
nced
acc
urac
y
1 2 3 4 5 6 7 80
20
40
60
80
100
120
log
mod
el e
vide
nce
1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8
1
bala
nced
pur
ity
functional connectivity
effective connectivity
78%71%
(model selection) (model validation)
best model
71%
supervised learning:SVM classification
unsupervised learning:GMM clustering (using effective connectivity)
62%
78%
55%
regionalactivity
Brodersen, Deserno, Schlagenhauf, Penny, Lin, Gupta, Buhmann, Stephan (in preparation)
74
Question 1 – What do the data tell us about hidden processes in the brain? compute the posterior
Generative embedding and DCM
?
?
Question 2 – Which model is best w.r.t. the observed fMRI data? compute the model evidence
Question 3 – Which model is best w.r.t. an external criterion? compute the classification accuracy
{ patient, control }
75
Model-based classification using DCM
structure-basedclassification
activation-basedclassification
model-basedclassification
model selection
inference onmodelparameters
vs.
?
{ group 1, group 2 }