Page 1
Shohei Shimizu
Osaka University, Japan
1
Non-Gaussian structural equation
models for causal discovery
2016 Probabilistic Graphical Model Workshop:
Sparsity, Structure and High-dimensionality
References:
https://sites.google.com/site/sshimizu06/home/lingampapers
Page 2
Abstract
• Estimation of causal direction and
connection strength of two observed
variables in the presence of hidden
common causes
• A key challenge in causal discovery
• Propose a non-Gaussian model
– Not require us to specify the number of hidden
common causes
2
Page 3
Illustrative example
Page 4
Significant correlation btw Chocolate
consumption and Num. Nobel laureates(Messerli12NEJM)
4
2002-2011Chocolate consumption (kg/yr/capita)
Nu
m. N
ob
el la
ure
ate
s p
er
10
mill
ion
po
p.
Corr. 0.791
P-value < 0.001
Page 5
Eating more chocolate increases
the number of Nobel laureates??
• Interpretational Drift (Maurage+13, J. Nutrition)
5
Chclt Nobel?
Chclt Nobelor
GDP GDP
Chclt Nobelor
GDP
Corr. 0.791
P-value < 0.001N
ob
el
Chocolate
Hidden
Common
cause
Manage this gap!
Hidden
Common
cause
Hidden
Common
cause
Page 6
Under what conditions
can we manage this gap?
• We have shown that it is possible under the
three assumptions (Hoyer+08IJAR; Shimizu+14JMLR)
– Linearity
– Acyclity
– Non-Gaussianity
• Performing interventions often very hard
• Theory closely related to independent
component analysis (ICA) (Hyvarinen+01)
6
Page 7
7
Many application areas
Epidemiology Economics
Neuroscience Chemistry
Sleep
problems
Depression
mood
Sleep
problems
Depression
mood ?
or
OpInc.gr(t)
Empl.gr(t)
Sales.gr(t)
R&D.gr(t)
Empl.gr(t+1)
Sales.gr(t+1)
R&D(.grt+1)
OpInc.gr(t+1)
Empl.gr(t+2)
Sales.gr(t+2)
R&D.gr(t+2)
OpInc.gr(t+2)
(Moneta et al., 2012)(Rosenstrom et al., 2012)
Policy evaluation
(Campomanes et al., 2014)
Causal information flow
Improving health and QOL
(Boukrina & Graves, 2013)
What changes absorption spectra?
Page 8
Brief review of structural
causal models
Page 9
Structural causal models (Pearl, 2000)
• A framework for describing causal relations
(or data generating processes)
• An example of linear cases:
• Generally speaking, if the value of 𝑥1 has
been changed and then that of 𝑥2 changes,
then 𝑥1 causes 𝑥2
9
𝒙𝟐 ∶= 𝒃𝟐𝟏𝒙𝟏 + 𝒆𝟐
𝒙𝟏 ∶= 𝒆𝟏
x2x1
e1 e2
e1 and e2 dependent
Page 10
73
Changing the value of x1
from c to d
• Replacing the function determining x1 with
a constant c, denoted by do(x1=c), and
then change the constant to d (Pearl, 2000)
21212
11
exbx
ex
21212
1
exbx
cx
Intervention: do(x1=c)
x2x1
e1 e2
x2x1
c e2
Page 11
74
Average causal effect(Rubin, 1974; Pearl, 2000)
• Average causal effect of x1 on x2 when changing x1 from c to d
– Computed based on the models with do(x1=d) and do(x1=c)
•
cdb
cxdoxEdxdoxE
21
1212 ||
cdbxE
dcx
212
1
bychangewill)(then
,tofromof value thechangedhaveyouIf
Page 12
Formulating the problem
Page 13
13
Estimation of causal direction
• Suppose that data X was randomly generated from either of the following two models:
• Estimate which model generated the data X based on the data X only
or
21212
11
exbx
ex
22
12121
ex
exbx
Model 1: Model 2:
)0( 21 b
x1x2
e2 e1
x1x2
e2 e1
12b21b
)0( 12 b
Page 14
Major difficulty
• Errors and are often dependent
• Regression coefficient of on is not
equal to even if we know the right
causal direction
14
or
21212
11
exbx
ex
22
12121
ex
exbx
Model 1: Model 2:
x1x2
e2 e1
x1x2
e2 e1
12b21b
21b
1e 2e
1x2x
)0( 21 b )0( 12 b
Page 15
Hidden common causes
• Such dependency is typically introduced
by hidden common causes, say
15
or
Model 1’: Model 2’:
x1x2
e’2 e’1
21b
2
21211212
1
11111
e
efxbx
e
efx
1f
1f
x1x2
e’2 e’1
12b
1f
2
21212
1
11112121
e
efx
e
efxbx
Page 16
A well-known guideline(Pearl2000; Spirtes+1993)
• Observe the hidden common cause ,
incorporate it in the models,
and carry out three-variable analysis
• Errors independent!
16
1f
or
Model 1’: Model 2’:
x1x2
e’2 e’1
21b
21211212
11111
efxbx
efx
1f
x1x2
e’2 e’1
12b
1f
21212
11112121
efx
efxbx
21, ee
Page 17
Following the guideline is often
very hard• A large number of hidden common causes
may exist (Q unknown)
• Often no idea what they are
17
Qfff ,,, 21
or
Model 1’: Model 2’:
x1x2
e’2 e’1
21b
221212
111
efxbx
efx
q
qq
q
qq
1f
222
112121
efx
efxbx
q
qq
q
qq
Qf
x1x2
e’2 e’1
12b
1f Qf
Page 18
18Estimation of causal direction
in the presence of
hidden common causes
• Estimate which model generated the data X
or
Model 1’: Model 2’:
x1x2
e’2 e’1
21b
221212
111
efxbx
efx
q
qq
q
qq
1f
222
112121
efx
efxbx
q
qq
q
qq
Qf
x1x2
e’2 e’1
12b
1f Qf
qf
Page 19
Note
• If we intervene on x1 (and x2), we have no
hidden common causes
• But, ethically and costly often difficult to do
interventions
19
Model 1’:
x1x2
e’2 e’1
21b
221212
111
efxbx
efx
q
qq
q
qq
1f Qf
Model 1’’:
x1x2
e’2 c
21b
cx 1
1f Qf
221212 efxbxq
qq
Page 20
1. Estimation of causal direction when temporal information is not available
2. Managing hidden common causes
20
Major challenges
x1 x2
?x1 x2
or
x1 x2 ?x1 x2 or
f1 f1
Page 21
Basic non-Gaussian model
(No hidden common cause)
S. Shimizu, P. O. Hoyer, A. Hyvärinen
and A. Kerminen.
Journal of Machine Learning Research,
2006.
Page 22
• Implying no hidden common causes
• The two models distinguishable if the errors
e1 and e2 are non-Gaussian (Dodge+00CSTM; Shimizu+06JMLR)
Independent errors22
or
21212
11
exbx
ex
22
12121
ex
exbx
Model 1: Model 2:
x1x2
e2 e1
x1x2
e2 e1
12b21b
)0,( 2112 bb
Page 23
2323
Different directions give
different data distributionsGaussian Non-Gaussian
Model 1:
Model 2:
x1
x2
x1
x2
e1
e2
x1
x2
e1
e2
x1
x2
x1
x2
x1
x2
212
11
8.0 exx
ex
22
121 8.0
ex
exx
1varvar 21 xx
,021 eEeE
Page 24
24
Independent Component Analysis
(ICA) (Jutten & Herault, 1991; Comon, 1994)
• Observed random vector x is modeled by
where
– The mixing matrix A = [ ]
– The hidden variables (independent components) are non-Gaussian and mutually independent
• Then, A is identifiable up to permutation and scaling of the columns
Asx
is
p
j
jiji sax1
or
ija
Page 25
Sketch of the identifiability proof
• Different directions give different zero/non-
zero patterns of the mixing matrices
– No zeros on the diagonal in the causal model
– No permutation indeterminacy
25
2
1
212
1
1
01
e
e
bx
x
21212
11
exbx
ex
A sx
2
112
2
1
10
1
e
eb
x
x
A sx22
12121
ex
exbx
x1
x2
e1
e2
x1
x2
e1
e2
0
0
Page 26
Linear Non-Gaussian Acyclic
Models (LiNGAM) (Shimizu+06JMLR)
• Identifiable: Directions, coefficients, and intercepts
– Can be uniquely estimated without knowing the causal
structure
26
i
ij
jijii exbx
x1 x2
x3
21b
23b13b
2e
3e
1e
Acyclicity
Non-Gaussian errors ei
Independence of errors ei
(no hidden common causes)
Page 27
Extensions
• Cyclic models (Lacerda+08UAI; Hyvarinen+13JMLR)
• Time series (Hyvarinen+10JMLR; Huang+15IJCAI; Gong15ICML)
• Nonlinearity (Zhang+09UAI; Peters+14JMLR; cf. Imoto02PSB)
• Discrete variables (Peters+11TPAMI; Park+15NIPS)
27
iiiii exofparentsffx
1,
1
2,
x1x2e2 e1
)()()(0
tttk
exBx
Page 28
LiNGAM with hidden
common causes
P. O. Hoyer, S. Shimizu, A. Kerminen,
and M. Palviainen.
Int. J. Approximate Reasoning
2008
Page 29
• Extension to incorporate non-Gaussian hidden
common causes
i
ij
jij
Q
q
qiqii exbfx 1
LiNGAM with hidden
common causes (Hoyer+08IJAR)
29
where are independent: ),,1( Qqfq
qf
x1 x2 2e1e
1f 2f
2121
1
222
1
1
111
exbfx
efx
Q
q
qq
Q
q
qq
Page 30
qfWLG, hidden common causes
are assumed to be independent
Independent hidden
common causes
i
ij
jij
Q
q
qiqii exbfx 1
30
x1 x2 2e1e
1fe
2fe
x1 x2 2e1e
1
:1 fef
2
:2 fef
1f 2f
Dependent hidden
common causes
2
1
2221
11
2221
11
2
100
2
1
f
f
aa
a
e
e
aa
a
f
f
f
f
Page 31
Different causal directions give
different data distributions(Hoyer, Shimizu, Kerminen and Palviainen, 2008, IJAR)
• Faithfulness + N. hidden common causes “known”
31
x1 x2
f1
x1 x2
orfQ f1 fQ
… …
2e1e2e1e
2121
1
222
1
1
111
exbfx
efx
Q
q
qq
Q
q
qq
2
1
222
1212
1
111
efx
exbfx
Q
q
qq
Q
q
qq
1x1x
2x2x
Page 32
Previous estimation approaches
• Explicitly model hidden common causes and
compare two models with opposite directions of
causation
– Maximum likelihood principle (Hoyer+08IJAR)
– Bayesian model selection (Henao & Winther, 2011, JMLR)
• Require us to specify the number of hidden
common causes, which is difficult in general
32
x1 x2
f1
x1 x2
orfQ f1 fQ… …
2e1e2e1e
Page 33
Our proposal:
a Bayesian approach
S. Shimizu and K. Bollen.
Journal of Machine Learning Research,
2014
Page 34
)(
2
m
)1(
1x)1(
2x
)(
2
mx)1(
1x
)(
2
)(
121
1
)(
22
)(
2
mmQ
q
m
qq
m exbfx
Key idea (1/2)
• Another look at the LiNGAM with hidden common
causes:
34
x1 x2
f1 fQ…
2e1e
m-th obs.:
)1(
2e)1(
1e
)(
2
me)(
1
me
……
21b
21b
21b)(
22
m
)1(
22
Observations are generated from the LiNGAM
model with possibly different intercepts )(
22
m
Page 35
Key idea (2/2)
• Include the sums of hidden common
causes as the observation-specific
intercepts:
• Not explicitly model hidden common
causes
– Neither necessary to specify the number of
hidden common causes Q nor estimate the
coefficients
35
)(
2
m
)(
2
)(
121
1
)(
22
)(
2
mmQ
q
m
qq
m exbfx
m-th obs.:
q2
Obs.-specific
intercept
Page 36
• Compare the marginal likelihoods of these two
models with opposite directions
• Many additional parameters
– Similar to mixed models and multi-level models
– Informative Prior for the observation-specific intercepts
)()(
121
)(
22
)(
2
)(
1
)(
11
)(
1
m
i
mmm
mmm
exbx
ex
Bayesian model selection36
),,1;2,1()( nmim
i
Model 3 (x1 x2)
)(
2
)(
22
)(
2
)(
1
)(
212
)(
11
)(
1
mmm
mmmm
ex
exbx
Model 4 (x1 x2)
Page 37
v
Prior for the observation-specific
intercepts
• Motivation: Central limit theorem
– Sums of independent variables tend to be more Gaussian
• Approximate the density by a bell-shaped curve dist.
• Select the hyper-parameter values that maximize the
marginal likelihood
–
– DOF fixed to be 6 in the experiments below
37
Q
q
m
qq
mQ
q
m
qq
m ff1
)(
2
)(
2
1
)(
1
)(
1 ,
~)(
2
)(
1
m
m
t-distribution with sd ,
correlation , and DOF1221,
v
)},(sd0.1,),(sd2.0,0{ lll xx }9.0,,1.0,0{12
Page 38
The chocolate data revisited
Corr. 0.791
P-value < 0.001No
bel
Chocolate
Gaussianity rejected for both
``Chocolate consumption”
and ``Num. Nobel laureates’’
Page 39
Model comparison
• No method available before to compare these two
39
Page 41
Conclusions
• Estimation of causal direction in the presence of
hidden common causes is a major challenge in
causal discovery
• Proposed a linear non-Gaussian SEM with
possibly different intercepts
– Not require to specify the number of hidden common
causes
• Future work
– Sensitivity to the choice of prior distributions
– Better estimation methods computationally and
statistically efficient … and many others
41
Page 43
Pairwise
analysis
High-dimensional cases
• Huge number of candidate networks
• Analyze every pair of variables and Integrate the
results to get an entire causal ordering
• Simpler than trying all the combinations of
causal orders
43
x1
x2x4
x3
f1
f3
x1 x2
x3 x4
x1
x2x4
x3
f1
f3
Full graph
Prune
redundant
edges
Integrate
the results
Page 44
Non-Gaussian
x2
x1
Gaussian e1,e2, f1
x2
• Faithfulness on 𝑥𝑖, 𝑓𝑖 + Number of 𝑓𝑖 given
Different zero/non-zero patterns
of the mixing matrices (Hoyer+08IJAR)
44
x1 x2
f1
x1 x2
f1
x1 x2
f1
Models
1.
2.
3.
**0
*0*
***
*0*
**0
***
A
A