Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Multiple imputation with principal componentmethods
Vincent Audigier1, François Husson2, Julie Josse2
1. Inserm ECSTRA team, Saint-Louis Hospital, Paris
2. Agrocampus Ouest, Rennes
Midia meeting, March 23, 2016
1 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
1 Introduction
2 Single imputation based on principal component methods
3 Multiple imputation for continuous data with PCA
4 Multiple imputation for categorical data with MCA
5 Conclusion
2 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Principal component methods: aims
From multidimensional data, principal component methods:
• summarize
• describe
• visualize
Several objectives:
• identify the similarities between individuals
• highlight the relationships between variables
• describe some groups of individuals by a set of relevantvariables
3 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
How does it work?
• The I individuals are seen as elements of RK
• A distance d on RK to de�ne proximities between individuals
• Vect(v1, ..., vS) maximising the projected inertia
⇒ Dimensionality reduction methods
4 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
A set of methods
dfamd
FAMD
mixed
dpca
PCA
continuous
dmca
MCAcategorical
d2famd = d2
pca + d2mca
5 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
FAMD
Wind Rainfall maxO3 T9 T12 . . .0602 North Dry 82 17.0 18.4 . . .0603 East Dry 92 15.3 17.6 . . .0604 North Dry 114 16.2 19.7 . . .0605 West Dry 94 17.4 20.5 . . .0606 West Rain 80 17.7 19.8 . . .. . . . . . . . . . . . . . . . . . . . .
●
−4 −2 0 2 4 6 8
−6
−4
−2
02
4
Individuals
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
0601
0602
0603
0604
06050606
0607
06100611
0612
06130614
06150616
0617
0618
0620
0621
0622
0623
0624
0625
06260627
06280629
0630
07010702
0703
0704
07050706
070707080709
07100711
0712
0713
07140715 0716
07170718
0719
07200721
0722
07230724
07250726
0727
07280729
0730
0731
0801
080208030804
0805
08060807
0808
0809
0810
0811
0812
0813
0814
08150819
082008210822
0823
0824
0825
0826
0827
0828
0829
08300831
0901
0902
0903
0904
0905
0906
0907
09080912
0913
0914
0915
0916
0917
0918
0919
0920
0921 09220923
0924 0925
09270928
0929
0930
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
−2 −1 0 1 2 3
−3
−2
−1
01
Categories
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
East
North
WestSouth
Rain
Dry
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables
Dim 1 (44.89%)
Dim
2 (
15.8
8%) maxO3
T9T12T15
Ne9Ne12
Ne15
Vx9
Vx12Vx15
maxO3v
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Graph of the variables
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
maxO3
T9T12T15
Ne9Ne12Ne15
Vx9
Vx12Vx15
maxO3v
Wind
Rainfall
●
−4 −2 0 2 4 6 8
−6
−4
−2
02
4
Individuals
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
0601
0602
0603
0604
06050606
0607
06100611
0612
06130614
06150616
0617
0618
0620
0621
0622
0623
0624
0625
06260627
06280629
0630
07010702
0703
0704
07050706
070707080709
07100711
0712
0713
07140715 0716
07170718
0719
07200721
0722
07230724
07250726
0727
07280729
0730
0731
0801
080208030804
0805
08060807
0808
0809
0810
0811
0812
0813
0814
08150819
082008210822
0823
0824
0825
0826
0827
0828
0829
08300831
0901
0902
0903
0904
0905
0906
0907
09080912
0913
0914
0915
0916
0917
0918
0919
0920
0921 09220923
0924 0925
09270928
0929
0930
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
6 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
FAMD
Wind Rainfall maxO3 T9 T12 . . .0602 North Dry 82 17.0 18.4 . . .0603 East Dry 92 15.3 17.6 . . .0604 North Dry 114 16.2 19.7 . . .0605 West Dry 94 17.4 20.5 . . .0606 West Rain 80 17.7 19.8 . . .. . . . . . . . . . . . . . . . . . . . .
●
−4 −2 0 2 4 6 8
−6
−4
−2
02
4
Individuals
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
0601
0602
0603
0604
06050606
0607
06100611
0612
06130614
06150616
0617
0618
0620
0621
0622
0623
0624
0625
06260627
06280629
0630
07010702
0703
0704
07050706
070707080709
07100711
0712
0713
07140715 0716
07170718
0719
07200721
0722
07230724
07250726
0727
07280729
0730
0731
0801
080208030804
0805
08060807
0808
0809
0810
0811
0812
0813
0814
08150819
082008210822
0823
0824
0825
0826
0827
0828
0829
08300831
0901
0902
0903
0904
0905
0906
0907
09080912
0913
0914
0915
0916
0917
0918
0919
0920
0921 09220923
0924 0925
09270928
0929
0930
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
−2 −1 0 1 2 3
−3
−2
−1
01
Categories
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
East
North
WestSouth
Rain
Dry
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables
Dim 1 (44.89%)
Dim
2 (
15.8
8%) maxO3
T9T12T15
Ne9Ne12
Ne15
Vx9
Vx12Vx15
maxO3v
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Graph of the variables
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
maxO3
T9T12T15
Ne9Ne12Ne15
Vx9
Vx12Vx15
maxO3v
Wind
Rainfall
●
−2 −1 0 1 2 3
−3
−2
−1
01
Categories
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
East
North
WestSouth
Rain
Dry
6 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
FAMD
Wind Rainfall maxO3 T9 T12 . . .0602 North Dry 82 17.0 18.4 . . .0603 East Dry 92 15.3 17.6 . . .0604 North Dry 114 16.2 19.7 . . .0605 West Dry 94 17.4 20.5 . . .0606 West Rain 80 17.7 19.8 . . .. . . . . . . . . . . . . . . . . . . . .
●
−4 −2 0 2 4 6 8
−6
−4
−2
02
4
Individuals
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
0601
0602
0603
0604
06050606
0607
06100611
0612
06130614
06150616
0617
0618
0620
0621
0622
0623
0624
0625
06260627
06280629
0630
07010702
0703
0704
07050706
070707080709
07100711
0712
0713
07140715 0716
07170718
0719
07200721
0722
07230724
07250726
0727
07280729
0730
0731
0801
080208030804
0805
08060807
0808
0809
0810
0811
0812
0813
0814
08150819
082008210822
0823
0824
0825
0826
0827
0828
0829
08300831
0901
0902
0903
0904
0905
0906
0907
09080912
0913
0914
0915
0916
0917
0918
0919
0920
0921 09220923
0924 0925
09270928
0929
0930
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
−2 −1 0 1 2 3
−3
−2
−1
01
Categories
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
East
North
WestSouth
Rain
Dry
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables
Dim 1 (44.89%)
Dim
2 (
15.8
8%) maxO3
T9T12T15
Ne9Ne12
Ne15
Vx9
Vx12Vx15
maxO3v
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Graph of the variables
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
maxO3
T9T12T15
Ne9Ne12Ne15
Vx9
Vx12Vx15
maxO3v
Wind
Rainfall
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables
Dim 1 (44.89%)
Dim
2 (
15.8
8%) maxO3
T9T12T15
Ne9Ne12
Ne15
Vx9
Vx12Vx15
maxO3v
6 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
FAMD
Wind Rainfall maxO3 T9 T12 . . .0602 North Dry 82 17.0 18.4 . . .0603 East Dry 92 15.3 17.6 . . .0604 North Dry 114 16.2 19.7 . . .0605 West Dry 94 17.4 20.5 . . .0606 West Rain 80 17.7 19.8 . . .. . . . . . . . . . . . . . . . . . . . .
●
−4 −2 0 2 4 6 8
−6
−4
−2
02
4
Individuals
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
0601
0602
0603
0604
06050606
0607
06100611
0612
06130614
06150616
0617
0618
0620
0621
0622
0623
0624
0625
06260627
06280629
0630
07010702
0703
0704
07050706
070707080709
07100711
0712
0713
07140715 0716
07170718
0719
07200721
0722
07230724
07250726
0727
07280729
0730
0731
0801
080208030804
0805
08060807
0808
0809
0810
0811
0812
0813
0814
08150819
082008210822
0823
0824
0825
0826
0827
0828
0829
08300831
0901
0902
0903
0904
0905
0906
0907
09080912
0913
0914
0915
0916
0917
0918
0919
0920
0921 09220923
0924 0925
09270928
0929
0930
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
−2 −1 0 1 2 3
−3
−2
−1
01
Categories
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
East
North
WestSouth
Rain
Dry
●
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Variables
Dim 1 (44.89%)
Dim
2 (
15.8
8%) maxO3
T9T12T15
Ne9Ne12
Ne15
Vx9
Vx12Vx15
maxO3v
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Graph of the variables
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
maxO3
T9T12T15
Ne9Ne12Ne15
Vx9
Vx12Vx15
maxO3v
Wind
Rainfall
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Graph of the variables
Dim 1 (44.89%)
Dim
2 (
15.8
8%)
maxO3
T9T12T15
Ne9Ne12Ne15
Vx9
Vx12Vx15
maxO3v
Wind
Rainfall
6 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Why principal component methods for MI?
1 Lack of joint imputation models for non-continuous data• mixed data: various relationships between variables• categorical data: combinatorial explosion
A large range of joint models
2 Imputation models with too many parameters:• over�tting• storage issue
A small number of parameters
3 Models based on regressions:• number of individuals less than the number of variables• collinearity
No inversion issues
7 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Why principal component methods for MI?
1 Lack of joint imputation models for non-continuous data• mixed data: various relationships between variables• categorical data: combinatorial explosion
A large range of joint models
2 Imputation models with too many parameters:• over�tting• storage issue
A small number of parameters
3 Models based on regressions:• number of individuals less than the number of variables• collinearity
No inversion issues
7 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Why principal component methods for MI?
1 Lack of joint imputation models for non-continuous data• mixed data: various relationships between variables• categorical data: combinatorial explosion
A large range of joint models
2 Imputation models with too many parameters:• over�tting• storage issue
A small number of parameters
3 Models based on regressions:• number of individuals less than the number of variables• collinearity
No inversion issues
7 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Why principal component methods for MI?
1 Lack of joint imputation models for non-continuous data• mixed data: various relationships between variables• categorical data: combinatorial explosion
A large range of joint models
2 Imputation models with too many parameters:• over�tting• storage issue
A small number of parameters
3 Models based on regressions:• number of individuals less than the number of variables• collinearity
No inversion issues
7 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
1 Introduction
2 Single imputation based on principal component methods
3 Multiple imputation for continuous data with PCA
4 Multiple imputation for categorical data with MCA
5 Conclusion
8 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
How to perform FAMD?
FAMD can be seen as the SVD of X with weights for
• the continuous variablesand categories: (DΣ)−1
• the individuals: 1I 1I
−→ SVD
(X, (DΣ)−1 ,
1
I1I
)11.04 . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . 2.04 1 0 . . . 1 0 011.02 . . . 1.92 0 1 . . . 0 1 0
X =
11.06 2.01 0 1 0 0 110.95 1.67 0 1 0 1 0
σx1
... 0
σxkDΣ = Ik+1
0 ...
IK
9 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
How to perform FAMD?
SVD(X, (DΣ)−1 , 1I 1I
)−→ XI×K = UI×KΛ
1/2K×KV
>K×K
with U> (1
I 1I
)U = 1K
V>D−1Σ V = 1K
• principal components: FI×S = UI×S Λ1/2S×S
• loadings: V>K×S
• �tted matrix: XI×K = UI×S Λ1/2S×S V
>K×S
‖ X− X ‖2D−1Σ ⊗
1I1
= tr
((X− X
)D−1Σ
(X− X
)>1I 1I
)minimized under the constraint of rank S
10 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Properties of the method
• The distance between individuals is:
d2(i , i ′) =k∑
j=1
(xij − xi ′j)2
σ2xj+
K∑j=k+1
1
Ij(xij − xi ′j)
2
• The principal component Fs maximises:∑var∈continuous
r2(Fs , var) +∑
var∈categoricalη2(Fs , var)
11 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
FAMD with missing values
⇒ FAMD: least squares
‖XI×K −UI×SΛ12S×SV
>K×S‖2
⇒ FAMD with missing values: weighted least squares
‖WI×K ∗ (XI×K −UI×SΛ12S×SV
>K×S)‖2
with wij = 0 if xij is missing, wij = 1 otherwise
Many algorithms developed for PCA such as NIPALS(Christo�ersson, 1970) or iterative PCA (Kiers, 1997)
12 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
FAMD with missing values
Iterative FAMD algorithm:
1 initialization: imputation by mean/proportion
2 iterate until convergence
(a) estimation of the parameters of FAMD
→ SVD of(X, (DΣ)−1 , 1I 1I
)(b) imputation of the missing values with
XI×K = UI×S Λ1/2S×S V
>K×S
(c) DΣ is updated
NA . . . 2.07 A . . . A10.76 . . . 1.86 A . . . A11.02 . . . NA A . . . NA
11.02 . . . 1.92 B . . . B
11.06 2.01 NA . . . CNA 1.67 B . . . B
→
NA . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . NA 1 0 . . . NA NA NA
11.02 . . . 1.92 0 1 . . . 0 1 0
11.06 2.01 NA NA 0 0 1NA 1.67 0 1 0 1 0
⇒ the imputed values can be seen as degree of membership
13 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
FAMD with missing values
Iterative FAMD algorithm:
1 initialization: imputation by mean/proportion
2 iterate until convergence
(a) estimation of the parameters of FAMD
→ SVD of(X, (DΣ)−1 , 1I 1I
)(b) imputation of the missing values with
XI×K = UI×S Λ1/2S×S V
>K×S
(c) DΣ is updated
NA . . . 2.07 A . . . A10.76 . . . 1.86 A . . . A11.02 . . . NA A . . . NA
11.02 . . . 1.92 B . . . B
11.06 2.01 NA . . . CNA 1.67 B . . . B
→
11.01 . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . 1.89 1 0 . . . 0.61 0.19 0.20
11.02 . . . 1.92 0 1 . . . 0 1 0
11.06 2.01 0.32 0.68 0 0 111.01 1.67 0 1 0 1 0
⇒ the imputed values can be seen as degree of membership
13 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
FAMD with missing values
Iterative FAMD algorithm:
1 initialization: imputation by mean/proportion
2 iterate until convergence
(a) estimation of the parameters of FAMD
→ SVD of(X, (DΣ)−1 , 1I 1I
)(b) imputation of the missing values with
XI×K = UI×S Λ1/2S×S V
>K×S
(c) DΣ is updated
NA . . . 2.07 A . . . A10.76 . . . 1.86 A . . . A11.02 . . . NA A . . . NA
11.02 . . . 1.92 B . . . B
11.06 2.01 NA . . . CNA 1.67 B . . . B
→
11.04 . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . 2.04 1 0 . . . 0.81 0.05 0.14
11.02 . . . 1.92 0 1 . . . 0 1 0
11.06 2.01 0.25 0.75 0 0 110.95 1.67 0 1 0 1 0
⇒ the imputed values can be seen as degree of membership
13 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Single imputation with FAMD (Audigier et al., 2014)
Iterative FAMD algorithm:
1 initialization: imputation by mean/proportion
2 iterate until convergence
(a) estimation of the parameters of FAMD
→ SVD of(X, (DΣ)−1 , 1I 1I
)(b) imputation of the missing values with
XI×K = UI×S Λ1/2S×S V
>K×S
(c) DΣ is updated
11.04 . . . 2.07 A . . . A10.76 . . . 1.86 A . . . A11.02 . . . 2.04 A . . . A
11.02 . . . 1.92 B . . . B
11.06 2.01 B . . . C10.95 1.67 B . . . B
←
11.04 . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . 2.04 1 0 . . . 0.81 0.05 0.14
11.02 . . . 1.92 0 1 . . . 0 1 0
11.06 2.01 0.25 0.75 0 0 110.95 1.67 0 1 0 1 0
⇒ the imputed values can be seen as degree of membership
14 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Single imputation with FAMD (Audigier et al., 2014)
Iterative FAMD algorithm:
1 initialization: imputation by mean/proportion
2 iterate until convergence
(a) estimation of the parameters of FAMD
→ SVD of(X, (DΣ)−1 , 1I 1I
)(b) imputation of the missing values with
XI×K = UI×S f ( Λ1/2S×S )V
>K×S f ( λ
1/2s ) = λ
1/2s − σ2
λ1/2s
(c) DΣ is updated
11.04 . . . 2.07 A . . . A10.76 . . . 1.86 A . . . A11.02 . . . 2.04 A . . . A
11.02 . . . 1.92 B . . . B
11.06 2.01 B . . . C10.95 1.67 B . . . B
←
11.04 . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . 2.04 1 0 . . . 0.81 0.05 0.14
11.02 . . . 1.92 0 1 . . . 0 1 0
11.06 2.01 0.25 0.75 0 0 110.95 1.67 0 1 0 1 0
⇒ the imputed values can be seen as degree of membership
14 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Simulation results
Single imputation with FAMD shows a high quality of predictioncompared to random forests (Stekhoven and Bühlmann, 2012)
• on real data
• when the relationships between continuous variables are linear
• for rare categories
• with MAR/MCAR mechanism
Can impute mixed, continuous or categorical data
But a single imputation method only
15 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Simulation results
Single imputation with FAMD shows a high quality of predictioncompared to random forests (Stekhoven and Bühlmann, 2012)
• on real data
• when the relationships between continuous variables are linear
• for rare categories
• with MAR/MCAR mechanism
Can impute mixed, continuous or categorical data
But a single imputation method only
15 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
From single imputation to multiple imputation
P(Xmiss |X obs , ψ1
). . . . . . . . . P
(Xmiss |X obs , ψM
)(F u′)ij (F u′)1ij + ε
1
ij (F u′)2ij + ε2
ij(F u′)3ij + ε
3
ij (F u′)Bij + εBij
1 Re�ect the variability on the parameters of the imputationmodel
→((UI×S , Λ
1/2S×S , V
>K×S
)1, . . . ,
(UI×S , Λ
1/2S×S , V
>K×S
)M
)Bayesian or Bootstrap
2 Add a disturbance on the prediction by Xm = UmΛ1/2m V
>m
→ need to distinguish continuous and categorical data
16 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
1 Introduction
2 Single imputation based on principal component methods
3 Multiple imputation for continuous data with PCA
4 Multiple imputation for categorical data with MCA
5 Conclusion
17 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
PCA model (Caussinus, 1986)
Model
XI×K = XI×K + εI×K
= UI×SΛ12S×SV
>K×S + εI×K with ε ∼ N
(0, σ21K
)Maximum Likelihood:
XS
= UI×S Λ12S×S V
>K×S → σ2 =‖ X− X
S‖2 /degrees of f.
Bayesian formulation:
• Ho� (2007): Uniform prior for U and V, Gaussian on(λs)s=1...S
• Verbanck et al. (2013): Prior on X
18 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Bayesian PCA (Verbanck et al., 2013)
Model: XI×K = XI×K + εI×Kxik = xik + εik , εik ∼ N (0, σ2)
=∑S
s=1
√λsuisvjs + εik
=∑S
s=1 x(s)ik + εik
Prior: x(s)ik ∼ N (0, τ2s )
Posterior:(x
(s)ik |x
(s)ik
)∼ N (Φsx
(s)ik ,Φsσ
2) with Φs = τ2sτ2s +σ2
Empirical Bayes for τ2s : τ2s =
(λs − σ2
)Φs =
λs − σ2
λs=
signal variance
total variance(Efron and Morris, 1972)
19 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Multiple imputation with Bayesian PCA (Audigier et al.,2015)
1 Variability of the parameters, M plausible (xij)1, . . . , (xij)
M
• Posterior distribution: Bayesian PCA(x
(s)ij |x
(s)ij
)= N (Φsx
(s)ij ,Φsσ
2)
• Data Augmentation (Tanner and Wong, 1987)
2 Imputation according to the PCA model using the set of Mparameters xmiss
ij ← N (xij , σ2)
20 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Multiple imputation with Bayesian PCA (Audigier et al.,2015)
1 Variability of the parameters, M plausible (xij)1, . . . , (xij)
M
• Posterior distribution: Bayesian PCA(x
(s)ij |x
(s)ij
)= N (Φsx
(s)ij ,Φsσ
2)
• Data Augmentation (Tanner and Wong, 1987)
2 Imputation according to the PCA model using the set of Mparameters xmiss
ij ← N (xij , σ2)
20 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Multiple imputation with Bayesian PCA (Audigier et al.,2015)
Data augmentation
• a Gibbs sampler
• simulate(ψ,Xmiss |X obs
)from
(I)(Xmiss |X obs , ψ
): imputation
(P)(ψ|X obs ,Xmiss
): draw from the posterior
• convergence checked by graphical investigations
For Bayesian PCA:
• initialisation: ML estimate for X
• for ` in 1...L
(I) Given X, xmissij ← N (xij , σ
2)
(P) xij ← N(∑
s Φsx(s)ij ,
σ2∑
s Φs )I−1
)21 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Standard MI methods for continuous data
Generally based on normal distribution:
• JM: XI×K : xi . ∼ N (µ,Σ) (Honaker et al., 2011)
1Bootstrap rows: X 1, . . . ,XM
EM algorithm: (µ1,Σ1), . . . , (µM ,ΣM)2 Imputation: xmi. drawn from N (µm,Σm)
• FCS: N(µXk |X(−k)
,ΣXk |X(−k)
)(Van Buuren, 2012)
1 Bayesian approach: (βm, σm)2 Imputation: stochastic regression xmij drawn from
N(X(−k)β
m, σm)
22 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Simulations
• Quantities of interest: θ1 = E [Y ] , θ2 = β1, θ3 = ρ
• 1000 simulations
• data set drawn from Np (µ,Σ) witha two-block structure, varying I (30or 200), K (6 or 60) and ρ (0.3 or0.9)
0000
0000
0.80.80.80.8
0.80.80.80.8
0.80.80.80.8
0.80.80.80.8
• 10% or 30% of missing values using a MCAR mechanism• multiple imputation using M = 20 imputed arrays
• Criteria• bias• CI width, coverage
23 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Results for the expectation
parameters con�dence interval width coverage
I K ρ % JM
FCS
BayesM
IPCA
JM
FCS
BayesM
IPCA
1 30 6 0.3 0.1 0.803 0.805 0.781 0.955 0.953 0.9502 30 6 0.3 0.3 1.010 0.898 0.971 0.9493 30 6 0.9 0.1 0.763 0.759 0.756 0.952 0.95 0.9494 30 6 0.9 0.3 0.818 0.783 0.965 0.9535 30 60 0.3 0.1 0.775 0.9556 30 60 0.3 0.3 0.864 0.9527 30 60 0.9 0.1 0.742 0.9538 30 60 0.9 0.3 0.759 0.9549 200 6 0.3 0.1 0.291 0.294 0.292 0.947 0.947 0.94610 200 6 0.3 0.3 0.328 0.334 0.325 0.954 0.959 0.95211 200 6 0.9 0.1 0.281 0.281 0.281 0.953 0.95 0.95212 200 6 0.9 0.3 0.288 0.289 0.288 0.948 0.951 0.95113 200 60 0.3 0.1 0.304 0.289 0.957 0.94514 200 60 0.3 0.3 0.384 0.313 0.981 0.95815 200 60 0.9 0.1 0.282 0.279 0.951 0.94816 200 60 0.9 0.3 0.296 0.283 0.958 0.952
24 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Properties for BayesMIPCA
A MI method based on a Bayesian treatment of the PCA model
advantages
• captures the structure of the data: good inferences forregression coe�cient, correlation, mean
• a dimensionality reduction method: (I < K or I > K , low orhigh percentage of missing values)
• no inversion issue: strong or weak relationships
• a regularization strategy improving stability
remains competitive if:
• the low rank assumption is not veri�ed
• the Gaussian assumption is not true
25 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
1 Introduction
2 Single imputation based on principal component methods
3 Multiple imputation for continuous data with PCA
4 Multiple imputation for categorical data with MCA
5 Conclusion
26 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Multiple imputation for categorical data using MCA
MI for categorical data is challenging for a moderate number ofvariables
• estimation issues
• storage issues
MI with MCA
1 Variability on the parameters of the imputation model((UI×S , Λ
1/2S×S , V
>K×S
)1, . . . ,
(UI×S , Λ
1/2S×S , V
>K×S
)M
)→ A non-parametric bootstrap approach
2 Add a disturbance on the MCA prediction Xm = UmΛ1/2m V
>m
27 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Multiple imputation for categorical data using MCA
MI for categorical data is challenging for a moderate number ofvariables
• estimation issues
• storage issues
MI with MCA
1 Variability on the parameters of the imputation model((UI×S , Λ
1/2S×S , V
>K×S
)1, . . . ,
(UI×S , Λ
1/2S×S , V
>K×S
)M
)→ A non-parametric bootstrap approach
2 Add a disturbance on the MCA prediction Xm = UmΛ1/2m V
>m
27 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Multiple imputation with MCA (Audigier et al., 2015)
1 Variability of the parameters of MCA (UI×S , Λ1/2S×S , V
>K×S)
using a non-parametric bootstrap:• de�ne M weightings (Rm)
1≤m≤M for the individuals
• estimate MCA parameters using SVD of(X, 1
K (DΣ)−1 ,Rm
)2 Imputation:
X1 X2 XM
1 0 . . . 1 01 0 . . . 1 01 0 . . . 0.81 0.19
0.25 0.75 0 10 1 0 1
1 0 . . . 1 01 0 . . . 1 01 0 . . . 0.60 0.40
0.26 0.74 0 10 1 0 1
. . .
1 0 . . . 1 01 0 . . . 1 01 0 . . . 0.74 0.16
0.20 0.80 0 10 1 0 1
Draw categories from the values of(Xm
)1≤m≤M
A . . . AA . . . AA . . . B
B . . . CB . . . B
A . . . AA . . . AA . . . A
B . . . CB . . . B
. . .
A . . . AA . . . AA . . . B
B . . . CB . . . B
28 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Properties
MCA address the categorical data challenge by
• requiring a small number of parameters
• preserving the essential data structure
• using a regularisation strategy
MIMCA can be applied on various data sets
• small or large number of variables/categories
• small or large number of individuals
29 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
MI methods for categorical data
• Log-linear model (Schafer, 1997)
• Hypothesis on X = (xijk)i,j,k : X |ψ ∼M (n, ψ)
log(ψijk) = λ0 + λAi + λBj + λCk + λABij + λACik + λBCjk + λABCijk
1 Variability of the parameter ψ: Bayesian formulation2 Imputation using the set of M parameters
• Latent class model (Si and Reiter, 2013)
• Hypothesis:P (X = (x1, . . . , xK );ψ) =L∑`=1
(ψ`
K∏k=1
ψ(`)xk
)1 Variability of the parameters ψL and ψX : Bayesian formulation2 Imputation using the set of M parameters
• FCS: GLM (Van Buuren, 2012) or Random Forests (Doove etal., 2014; Shah et al., 2014)
30 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Simulations from real data sets
• Quantities of interest: θ = parameters of a logistic model
• Simulation design (repeated 200 times)• the real data set is considered as a population• drawn one sample from the data set• generate 20% of missing values• multiple imputation using M = 5 imputed arrays
• Criteria• bias• CI width, coverage
• Comparison with :• JM: log-linear model, latent class model• FCS: logistic regression, random forests
31 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Results - Inference
●
MIM
CA
5
Logl
inea
r
Late
nt c
lass
FC
S−
log
FC
S−
rf
0.80
0.85
0.90
0.95
1.00
Titanic
cove
rage ●
●●
●
MIM
CA
2
Logl
inea
r
Late
nt c
lass
FC
S−
log
FC
S−
rf
0.80
0.85
0.90
0.95
1.00
Galetas
cove
rage
●
MIM
CA
5
Late
nt c
lass
FC
S−
log
FC
S−
rf
0.80
0.85
0.90
0.95
1.00
Income
cove
rage
Titanic Galetas Income
Number of variables 4 4 14
Number of categories ≤ 4 ≤ 11 ≤ 9
32 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Results - Time
Titanic Galetas Income
MIMCA 2.750 8.972 58.729
Loglinear 0.740 4.597 NA
Latent class model 10.854 17.414 143.652
FCS logistic 4.781 38.016 881.188
FCS forests 265.771 112.987 6329.514
Table: Time consumed in second
Titanic Galetas Income
Number of individuals 2201 1192 6876
Number of variables 4 4 14
33 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Conclusion
MI methods using dimensionality reduction method
• captures the relationships between variables
• captures the similarities between individuals
• requires a small number of parameters
Address some imputation issues:
• can be applied on various data sets
• provide correct inferences for analysis model based onrelationships between pairs of variables
Available in the R package missMDA, with a user guide onhttp://vincentaudigier.weebly.com/links.html
34 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
Perspectives
To go further:
• require a modelisation e�ort when categorical variables occur• for a deeper understanding of the methods• for an extension of the current methods• for a MI method based on FAMD
→ some lines of research:
• link between CA and log-linear model• link between log-linear model and general location model
• uncertainty on the number of dimensions S
35 / 36
Introduction Single Imputation MI with PCA MI with MCA Conclusion References
References I
V. Audigier, F. Husson, and J. Josse. MIMCA: Multiple imputation forcategorical variables with multiple correspondence analysis. Statistics andComputing, 2016.
V. Audigier, F. Husson, and J. Josse. Multiple imputation for continuousvariables using a bayesian principal component analysis. Journal ofStatistical Computation and Simulation, 2015.
V. Audigier, F. Husson, and J. Josse. A principal component method to imputemissing values for mixed data. Advances in Data Analysis and Classi�cation,pages 1�22, 2014.
D. B. Rubin. Multiple Imputation for Non-Response in Survey. Wiley,New-York, 1987.
J. L. Schafer. Analysis of Incomplete Multivariate Data. Chapman &Hall/CRC, London, 1997.
36 / 36