Multiple imputation with principal component methods · IntroductionSingle ImputationMI with PCAMI with MCAConclusionReferences Multiple imputation with principal component methods

Introduction Single Imputation MI with PCA MI with MCA Conclusion References

Multiple imputation with principal componentmethods

Vincent Audigier1, François Husson2, Julie Josse2

1. Inserm ECSTRA team, Saint-Louis Hospital, Paris

2. Agrocampus Ouest, Rennes

Midia meeting, March 23, 2016

1 / 36


1 Introduction

2 Single imputation based on principal component methods

3 Multiple imputation for continuous data with PCA

4 Multiple imputation for categorical data with MCA

5 Conclusion

2 / 36


Principal component methods: aims

From multidimensional data, principal component methods:

• summarize

• describe

• visualize

Several objectives:

• identify the similarities between individuals

• highlight the relationships between variables

• describe some groups of individuals by a set of relevantvariables

3 / 36


How does it work?

• The I individuals are seen as elements of RK

• A distance d on RK to de�ne proximities between individuals

• Vect(v1, ..., vS) maximising the projected inertia

⇒ Dimensionality reduction methods

4 / 36


A set of methods

dfamd

FAMD

mixed

dpca

PCA

continuous

dmca

MCAcategorical

d2famd = d2

pca + d2mca

5 / 36


FAMD

Wind Rainfall maxO3 T9 T12 . . .0602 North Dry 82 17.0 18.4 . . .0603 East Dry 92 15.3 17.6 . . .0604 North Dry 114 16.2 19.7 . . .0605 West Dry 94 17.4 20.5 . . .0606 West Rain 80 17.7 19.8 . . .. . . . . . . . . . . . . . . . . . . . .

●

−4 −2 0 2 4 6 8

−6

−4

−2

02

4

Individuals

Dim 1 (44.89%)

Dim

2 (

15.8

8%)

0601

0602

0603

0604

06050606

0607

06100611

0612

06130614

06150616

0617

0618

0620

0621

0622

0623

0624

0625

06260627

06280629

0630

07010702

0703

0704

07050706

070707080709

07100711

0712

0713

07140715 0716

07170718

0719

07200721

0722

07230724

07250726

0727

07280729

0730

0731

0801

080208030804

0805

08060807

0808

0809

0810

0811

0812

0813

0814

08150819

082008210822

0823

0824

0825

0826

0827

0828

0829

08300831

0901

0902

0903

0904

0905

0906

0907

09080912

0913

0914

0915

0916

0917

0918

0919

0920

0921 09220923

0924 0925

09270928

0929

0930

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

−2 −1 0 1 2 3

−3

−2

−1

01

Categories

Dim 1 (44.89%)

Dim

2 (

15.8

8%)

East

North

WestSouth

Rain

Dry

●

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables

Dim 1 (44.89%)

Dim

2 (

15.8

8%) maxO3

T9T12T15

Ne9Ne12

Ne15

Vx9

Vx12Vx15

maxO3v

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Graph of the variables

Dim 1 (44.89%)

Dim

2 (

15.8

8%)

maxO3

T9T12T15

Ne9Ne12Ne15

Vx9

Vx12Vx15

maxO3v

Wind

Rainfall

●

−4 −2 0 2 4 6 8

−6

−4

−2

02

4

Individuals

Dim 1 (44.89%)

Dim

2 (

15.8

8%)

0601

0602

0603

0604

06050606

0607

06100611

0612

06130614

06150616

0617

0618

0620

0621

0622

0623

0624

0625

06260627

06280629

0630

07010702

0703

0704

07050706

070707080709

07100711

0712

0713

07140715 0716

07170718

0719

07200721

0722

07230724

07250726

0727

07280729

0730

0731

0801

080208030804

0805

08060807

0808

0809

0810

0811

0812

0813

0814

08150819

082008210822

0823

0824

0825

0826

0827

0828

0829

08300831

0901

0902

0903

0904

0905

0906

0907

09080912

0913

0914

0915

0916

0917

0918

0919

0920

0921 09220923

0924 0925

09270928

0929

0930

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

6 / 36


FAMD


●

−4 −2 0 2 4 6 8

−6

−4

−2

02

4

Individuals

Dim 1 (44.89%)

Dim

2 (

15.8

8%)

0601

0602

0603

0604

06050606

0607

06100611

0612

06130614

06150616

0617

0618

0620

0621

0622

0623

0624

0625

06260627

06280629

0630

07010702

0703

0704

07050706

070707080709

07100711

0712

0713

07140715 0716

07170718

0719

07200721

0722

07230724

07250726

0727

07280729

0730

0731

0801

080208030804

0805

08060807

0808

0809

0810

0811

0812

0813

0814

08150819

082008210822

0823

0824

0825

0826

0827

0828

0829

08300831

0901

0902

0903

0904

0905

0906

0907

09080912

0913

0914

0915

0916

0917

0918

0919

0920

0921 09220923

0924 0925

09270928

0929

0930

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

−2 −1 0 1 2 3

−3

−2

−1

01

Categories

Dim 1 (44.89%)

Dim

2 (

15.8

8%)

East

North

WestSouth

Rain

Dry

●

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables

Dim 1 (44.89%)

Dim

2 (

15.8

8%) maxO3

T9T12T15

Ne9Ne12

Ne15

Vx9

Vx12Vx15

maxO3v

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Dim 1 (44.89%)

Dim

2 (

15.8

8%)

maxO3

T9T12T15

Ne9Ne12Ne15

Vx9

Vx12Vx15

maxO3v

Wind

Rainfall

●

−2 −1 0 1 2 3

−3

−2

−1

01

Categories

Dim 1 (44.89%)

Dim

2 (

15.8

8%)

East

North

WestSouth

Rain

Dry

6 / 36


FAMD


●

−4 −2 0 2 4 6 8

−6

−4

−2

02

4

Individuals

Dim 1 (44.89%)

Dim

2 (

15.8

8%)

0601

0602

0603

0604

06050606

0607

06100611

0612

06130614

06150616

0617

0618

0620

0621

0622

0623

0624

0625

06260627

06280629

0630

07010702

0703

0704

07050706

070707080709

07100711

0712

0713

07140715 0716

07170718

0719

07200721

0722

07230724

07250726

0727

07280729

0730

0731

0801

080208030804

0805

08060807

0808

0809

0810

0811

0812

0813

0814

08150819

082008210822

0823

0824

0825

0826

0827

0828

0829

08300831

0901

0902

0903

0904

0905

0906

0907

09080912

0913

0914

0915

0916

0917

0918

0919

0920

0921 09220923

0924 0925

09270928

0929

0930

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

−2 −1 0 1 2 3

−3

−2

−1

01

Categories

Dim 1 (44.89%)

Dim

2 (

15.8

8%)

East

North

WestSouth

Rain

Dry

●

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables

Dim 1 (44.89%)

Dim

2 (

15.8

8%) maxO3

T9T12T15

Ne9Ne12

Ne15

Vx9

Vx12Vx15

maxO3v

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Dim 1 (44.89%)

Dim

2 (

15.8

8%)

maxO3

T9T12T15

Ne9Ne12Ne15

Vx9

Vx12Vx15

maxO3v

Wind

Rainfall

●

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables

Dim 1 (44.89%)

Dim

2 (

15.8

8%) maxO3

T9T12T15

Ne9Ne12

Ne15

Vx9

Vx12Vx15

maxO3v

6 / 36


FAMD


●

−4 −2 0 2 4 6 8

−6

−4

−2

02

4

Individuals

Dim 1 (44.89%)

Dim

2 (

15.8

8%)

0601

0602

0603

0604

06050606

0607

06100611

0612

06130614

06150616

0617

0618

0620

0621

0622

0623

0624

0625

06260627

06280629

0630

07010702

0703

0704

07050706

070707080709

07100711

0712

0713

07140715 0716

07170718

0719

07200721

0722

07230724

07250726

0727

07280729

0730

0731

0801

080208030804

0805

08060807

0808

0809

0810

0811

0812

0813

0814

08150819

082008210822

0823

0824

0825

0826

0827

0828

0829

08300831

0901

0902

0903

0904

0905

0906

0907

09080912

0913

0914

0915

0916

0917

0918

0919

0920

0921 09220923

0924 0925

09270928

0929

0930

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

−2 −1 0 1 2 3

−3

−2

−1

01

Categories

Dim 1 (44.89%)

Dim

2 (

15.8

8%)

East

North

WestSouth

Rain

Dry

●

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Variables

Dim 1 (44.89%)

Dim

2 (

15.8

8%) maxO3

T9T12T15

Ne9Ne12

Ne15

Vx9

Vx12Vx15

maxO3v

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Dim 1 (44.89%)

Dim

2 (

15.8

8%)

maxO3

T9T12T15

Ne9Ne12Ne15

Vx9

Vx12Vx15

maxO3v

Wind

Rainfall

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0


Dim 1 (44.89%)

Dim

2 (

15.8

8%)

maxO3

T9T12T15

Ne9Ne12Ne15

Vx9

Vx12Vx15

maxO3v

Wind

Rainfall

6 / 36


Why principal component methods for MI?

1 Lack of joint imputation models for non-continuous data• mixed data: various relationships between variables• categorical data: combinatorial explosion

A large range of joint models

2 Imputation models with too many parameters:• over�tting• storage issue

A small number of parameters

3 Models based on regressions:• number of individuals less than the number of variables• collinearity

No inversion issues

7 / 36








No inversion issues

7 / 36








No inversion issues

7 / 36








No inversion issues

7 / 36


1 Introduction




5 Conclusion

8 / 36


How to perform FAMD?

FAMD can be seen as the SVD of X with weights for

• the continuous variablesand categories: (DΣ)−1

• the individuals: 1I 1I

−→ SVD

(X, (DΣ)−1 ,

1

I1I

)11.04 . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . 2.04 1 0 . . . 1 0 011.02 . . . 1.92 0 1 . . . 0 1 0

X =

11.06 2.01 0 1 0 0 110.95 1.67 0 1 0 1 0

σx1

... 0

σxkDΣ = Ik+1

0 ...

IK

9 / 36


How to perform FAMD?

SVD(X, (DΣ)−1 , 1I 1I

)−→ XI×K = UI×KΛ

1/2K×KV

>K×K

with U> (1

I 1I

)U = 1K

V>D−1Σ V = 1K

• principal components: FI×S = UI×S Λ1/2S×S

• loadings: V>K×S

• �tted matrix: XI×K = UI×S Λ1/2S×S V

>K×S

‖ X− X ‖2D−1Σ ⊗

1I1

= tr

((X− X

)D−1Σ

(X− X

)>1I 1I

)minimized under the constraint of rank S

10 / 36


Properties of the method

• The distance between individuals is:

d2(i , i ′) =k∑

j=1

(xij − xi ′j)2

σ2xj+

K∑j=k+1

1

Ij(xij − xi ′j)

2

• The principal component Fs maximises:∑var∈continuous

r2(Fs , var) +∑

var∈categoricalη2(Fs , var)

11 / 36


FAMD with missing values

⇒ FAMD: least squares

‖XI×K −UI×SΛ12S×SV

>K×S‖2

⇒ FAMD with missing values: weighted least squares

‖WI×K ∗ (XI×K −UI×SΛ12S×SV

>K×S)‖2

with wij = 0 if xij is missing, wij = 1 otherwise

Many algorithms developed for PCA such as NIPALS(Christo�ersson, 1970) or iterative PCA (Kiers, 1997)

12 / 36



Iterative FAMD algorithm:

1 initialization: imputation by mean/proportion

2 iterate until convergence

(a) estimation of the parameters of FAMD

→ SVD of(X, (DΣ)−1 , 1I 1I

)(b) imputation of the missing values with

XI×K = UI×S Λ1/2S×S V

>K×S

(c) DΣ is updated

NA . . . 2.07 A . . . A10.76 . . . 1.86 A . . . A11.02 . . . NA A . . . NA

11.02 . . . 1.92 B . . . B

11.06 2.01 NA . . . CNA 1.67 B . . . B

→

NA . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . NA 1 0 . . . NA NA NA

11.02 . . . 1.92 0 1 . . . 0 1 0

11.06 2.01 NA NA 0 0 1NA 1.67 0 1 0 1 0

⇒ the imputed values can be seen as degree of membership

13 / 36







→ SVD of(X, (DΣ)−1 , 1I 1I



>K×S

(c) DΣ is updated

NA . . . 2.07 A . . . A10.76 . . . 1.86 A . . . A11.02 . . . NA A . . . NA

11.02 . . . 1.92 B . . . B

11.06 2.01 NA . . . CNA 1.67 B . . . B

→

11.01 . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . 1.89 1 0 . . . 0.61 0.19 0.20

11.02 . . . 1.92 0 1 . . . 0 1 0

11.06 2.01 0.32 0.68 0 0 111.01 1.67 0 1 0 1 0


13 / 36







→ SVD of(X, (DΣ)−1 , 1I 1I



>K×S

(c) DΣ is updated

NA . . . 2.07 A . . . A10.76 . . . 1.86 A . . . A11.02 . . . NA A . . . NA

11.02 . . . 1.92 B . . . B

11.06 2.01 NA . . . CNA 1.67 B . . . B

→

11.04 . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . 2.04 1 0 . . . 0.81 0.05 0.14

11.02 . . . 1.92 0 1 . . . 0 1 0

11.06 2.01 0.25 0.75 0 0 110.95 1.67 0 1 0 1 0


13 / 36


Single imputation with FAMD (Audigier et al., 2014)





→ SVD of(X, (DΣ)−1 , 1I 1I



>K×S

(c) DΣ is updated

11.04 . . . 2.07 A . . . A10.76 . . . 1.86 A . . . A11.02 . . . 2.04 A . . . A

11.02 . . . 1.92 B . . . B

11.06 2.01 B . . . C10.95 1.67 B . . . B

←

11.04 . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . 2.04 1 0 . . . 0.81 0.05 0.14

11.02 . . . 1.92 0 1 . . . 0 1 0

11.06 2.01 0.25 0.75 0 0 110.95 1.67 0 1 0 1 0


14 / 36


Single imputation with FAMD (Audigier et al., 2014)





→ SVD of(X, (DΣ)−1 , 1I 1I


XI×K = UI×S f ( Λ1/2S×S )V

>K×S f ( λ

1/2s ) = λ

1/2s − σ2

λ1/2s

(c) DΣ is updated

11.04 . . . 2.07 A . . . A10.76 . . . 1.86 A . . . A11.02 . . . 2.04 A . . . A

11.02 . . . 1.92 B . . . B

11.06 2.01 B . . . C10.95 1.67 B . . . B

←

11.04 . . . 2.07 1 0 . . . 1 0 010.76 . . . 1.86 1 0 . . . 1 0 011.02 . . . 2.04 1 0 . . . 0.81 0.05 0.14

11.02 . . . 1.92 0 1 . . . 0 1 0

11.06 2.01 0.25 0.75 0 0 110.95 1.67 0 1 0 1 0


14 / 36


Simulation results

Single imputation with FAMD shows a high quality of predictioncompared to random forests (Stekhoven and Bühlmann, 2012)

• on real data

• when the relationships between continuous variables are linear

• for rare categories

• with MAR/MCAR mechanism

Can impute mixed, continuous or categorical data

But a single imputation method only

15 / 36


Simulation results

Single imputation with FAMD shows a high quality of predictioncompared to random forests (Stekhoven and Bühlmann, 2012)

• on real data

• when the relationships between continuous variables are linear

• for rare categories

• with MAR/MCAR mechanism

Can impute mixed, continuous or categorical data

But a single imputation method only

15 / 36


From single imputation to multiple imputation

P(Xmiss |X obs , ψ1

). . . . . . . . . P

(Xmiss |X obs , ψM

)(F u′)ij (F u′)1ij + ε

1

ij (F u′)2ij + ε2

ij(F u′)3ij + ε

3

ij (F u′)Bij + εBij

1 Re�ect the variability on the parameters of the imputationmodel

→((UI×S , Λ

1/2S×S , V

>K×S

)1, . . . ,

(UI×S , Λ

1/2S×S , V

>K×S

)M

)Bayesian or Bootstrap

2 Add a disturbance on the prediction by Xm = UmΛ1/2m V

>m

→ need to distinguish continuous and categorical data

16 / 36


1 Introduction




5 Conclusion

17 / 36


PCA model (Caussinus, 1986)

Model

XI×K = XI×K + εI×K

= UI×SΛ12S×SV

>K×S + εI×K with ε ∼ N

(0, σ21K

)Maximum Likelihood:

XS

= UI×S Λ12S×S V

>K×S → σ2 =‖ X− X

S‖2 /degrees of f.

Bayesian formulation:

• Ho� (2007): Uniform prior for U and V, Gaussian on(λs)s=1...S

• Verbanck et al. (2013): Prior on X

18 / 36


Bayesian PCA (Verbanck et al., 2013)

Model: XI×K = XI×K + εI×Kxik = xik + εik , εik ∼ N (0, σ2)

=∑S

s=1

√λsuisvjs + εik

=∑S

s=1 x(s)ik + εik

Prior: x(s)ik ∼ N (0, τ2s )

Posterior:(x

(s)ik |x

(s)ik

)∼ N (Φsx

(s)ik ,Φsσ

2) with Φs = τ2sτ2s +σ2

Empirical Bayes for τ2s : τ2s =

(λs − σ2

)Φs =

λs − σ2

λs=

signal variance

total variance(Efron and Morris, 1972)

19 / 36


Multiple imputation with Bayesian PCA (Audigier et al.,2015)

1 Variability of the parameters, M plausible (xij)1, . . . , (xij)

M

• Posterior distribution: Bayesian PCA(x

(s)ij |x

(s)ij

)= N (Φsx

(s)ij ,Φsσ

2)

• Data Augmentation (Tanner and Wong, 1987)

2 Imputation according to the PCA model using the set of Mparameters xmiss

ij ← N (xij , σ2)

20 / 36



1 Variability of the parameters, M plausible (xij)1, . . . , (xij)

M

• Posterior distribution: Bayesian PCA(x

(s)ij |x

(s)ij

)= N (Φsx

(s)ij ,Φsσ

2)

• Data Augmentation (Tanner and Wong, 1987)

2 Imputation according to the PCA model using the set of Mparameters xmiss

ij ← N (xij , σ2)

20 / 36



Data augmentation

• a Gibbs sampler

• simulate(ψ,Xmiss |X obs

)from

(I)(Xmiss |X obs , ψ

): imputation

(P)(ψ|X obs ,Xmiss

): draw from the posterior

• convergence checked by graphical investigations

For Bayesian PCA:

• initialisation: ML estimate for X

• for ` in 1...L

(I) Given X, xmissij ← N (xij , σ

2)

(P) xij ← N(∑

s Φsx(s)ij ,

σ2∑

s Φs )I−1

)21 / 36


Standard MI methods for continuous data

Generally based on normal distribution:

• JM: XI×K : xi . ∼ N (µ,Σ) (Honaker et al., 2011)

1Bootstrap rows: X 1, . . . ,XM

EM algorithm: (µ1,Σ1), . . . , (µM ,ΣM)2 Imputation: xmi. drawn from N (µm,Σm)

• FCS: N(µXk |X(−k)

,ΣXk |X(−k)

)(Van Buuren, 2012)

1 Bayesian approach: (βm, σm)2 Imputation: stochastic regression xmij drawn from

N(X(−k)β

m, σm)

22 / 36


Simulations

• Quantities of interest: θ1 = E [Y ] , θ2 = β1, θ3 = ρ

• 1000 simulations

• data set drawn from Np (µ,Σ) witha two-block structure, varying I (30or 200), K (6 or 60) and ρ (0.3 or0.9)

0000

0000

0.80.80.80.8

0.80.80.80.8

0.80.80.80.8

0.80.80.80.8

• 10% or 30% of missing values using a MCAR mechanism• multiple imputation using M = 20 imputed arrays

• Criteria• bias• CI width, coverage

23 / 36


Results for the expectation

parameters con�dence interval width coverage

I K ρ % JM

FCS

BayesM

IPCA

JM

FCS

BayesM

IPCA

1 30 6 0.3 0.1 0.803 0.805 0.781 0.955 0.953 0.9502 30 6 0.3 0.3 1.010 0.898 0.971 0.9493 30 6 0.9 0.1 0.763 0.759 0.756 0.952 0.95 0.9494 30 6 0.9 0.3 0.818 0.783 0.965 0.9535 30 60 0.3 0.1 0.775 0.9556 30 60 0.3 0.3 0.864 0.9527 30 60 0.9 0.1 0.742 0.9538 30 60 0.9 0.3 0.759 0.9549 200 6 0.3 0.1 0.291 0.294 0.292 0.947 0.947 0.94610 200 6 0.3 0.3 0.328 0.334 0.325 0.954 0.959 0.95211 200 6 0.9 0.1 0.281 0.281 0.281 0.953 0.95 0.95212 200 6 0.9 0.3 0.288 0.289 0.288 0.948 0.951 0.95113 200 60 0.3 0.1 0.304 0.289 0.957 0.94514 200 60 0.3 0.3 0.384 0.313 0.981 0.95815 200 60 0.9 0.1 0.282 0.279 0.951 0.94816 200 60 0.9 0.3 0.296 0.283 0.958 0.952

24 / 36


Properties for BayesMIPCA

A MI method based on a Bayesian treatment of the PCA model

advantages

• captures the structure of the data: good inferences forregression coe�cient, correlation, mean

• a dimensionality reduction method: (I < K or I > K , low orhigh percentage of missing values)

• no inversion issue: strong or weak relationships

• a regularization strategy improving stability

remains competitive if:

• the low rank assumption is not veri�ed

• the Gaussian assumption is not true

25 / 36


1 Introduction




5 Conclusion

26 / 36


Multiple imputation for categorical data using MCA

MI for categorical data is challenging for a moderate number ofvariables

• estimation issues

• storage issues

MI with MCA

1 Variability on the parameters of the imputation model((UI×S , Λ

1/2S×S , V

>K×S

)1, . . . ,

(UI×S , Λ

1/2S×S , V

>K×S

)M

)→ A non-parametric bootstrap approach

2 Add a disturbance on the MCA prediction Xm = UmΛ1/2m V

>m

27 / 36


Multiple imputation for categorical data using MCA

MI for categorical data is challenging for a moderate number ofvariables

• estimation issues

• storage issues

MI with MCA

1 Variability on the parameters of the imputation model((UI×S , Λ

1/2S×S , V

>K×S

)1, . . . ,

(UI×S , Λ

1/2S×S , V

>K×S

)M

)→ A non-parametric bootstrap approach

2 Add a disturbance on the MCA prediction Xm = UmΛ1/2m V

>m

27 / 36


Multiple imputation with MCA (Audigier et al., 2015)

1 Variability of the parameters of MCA (UI×S , Λ1/2S×S , V

>K×S)

using a non-parametric bootstrap:• de�ne M weightings (Rm)

1≤m≤M for the individuals

• estimate MCA parameters using SVD of(X, 1

K (DΣ)−1 ,Rm

)2 Imputation:

X1 X2 XM

1 0 . . . 1 01 0 . . . 1 01 0 . . . 0.81 0.19

0.25 0.75 0 10 1 0 1

1 0 . . . 1 01 0 . . . 1 01 0 . . . 0.60 0.40

0.26 0.74 0 10 1 0 1

. . .

1 0 . . . 1 01 0 . . . 1 01 0 . . . 0.74 0.16

0.20 0.80 0 10 1 0 1

Draw categories from the values of(Xm

)1≤m≤M

A . . . AA . . . AA . . . B

B . . . CB . . . B

A . . . AA . . . AA . . . A

B . . . CB . . . B

. . .

A . . . AA . . . AA . . . B

B . . . CB . . . B

28 / 36


Properties

MCA address the categorical data challenge by

• requiring a small number of parameters

• preserving the essential data structure

• using a regularisation strategy

MIMCA can be applied on various data sets

• small or large number of variables/categories

• small or large number of individuals

29 / 36


MI methods for categorical data

• Log-linear model (Schafer, 1997)

• Hypothesis on X = (xijk)i,j,k : X |ψ ∼M (n, ψ)

log(ψijk) = λ0 + λAi + λBj + λCk + λABij + λACik + λBCjk + λABCijk

1 Variability of the parameter ψ: Bayesian formulation2 Imputation using the set of M parameters

• Latent class model (Si and Reiter, 2013)

• Hypothesis:P (X = (x1, . . . , xK );ψ) =L∑`=1

(ψ`

K∏k=1

ψ(`)xk

)1 Variability of the parameters ψL and ψX : Bayesian formulation2 Imputation using the set of M parameters

• FCS: GLM (Van Buuren, 2012) or Random Forests (Doove etal., 2014; Shah et al., 2014)

30 / 36


Simulations from real data sets

• Quantities of interest: θ = parameters of a logistic model

• Simulation design (repeated 200 times)• the real data set is considered as a population• drawn one sample from the data set• generate 20% of missing values• multiple imputation using M = 5 imputed arrays

• Criteria• bias• CI width, coverage

• Comparison with :• JM: log-linear model, latent class model• FCS: logistic regression, random forests

31 / 36


Results - Inference

●

MIM

CA

5

Logl

inea

r

Late

nt c

lass

FC

S−

log

FC

S−

rf

0.80

0.85

0.90

0.95

1.00

Titanic

cove

rage ●

●●

●

MIM

CA

2

Logl

inea

r

Late

nt c

lass

FC

S−

log

FC

S−

rf

0.80

0.85

0.90

0.95

1.00

Galetas

cove

rage

●

MIM

CA

5

Late

nt c

lass

FC

S−

log

FC

S−

rf

0.80

0.85

0.90

0.95

1.00

Income

cove

rage

Titanic Galetas Income

Number of variables 4 4 14

Number of categories ≤ 4 ≤ 11 ≤ 9

32 / 36


Results - Time


MIMCA 2.750 8.972 58.729

Loglinear 0.740 4.597 NA

Latent class model 10.854 17.414 143.652

FCS logistic 4.781 38.016 881.188

FCS forests 265.771 112.987 6329.514

Table: Time consumed in second


Number of individuals 2201 1192 6876

Number of variables 4 4 14

33 / 36


Conclusion

MI methods using dimensionality reduction method

• captures the relationships between variables

• captures the similarities between individuals

• requires a small number of parameters

Address some imputation issues:

• can be applied on various data sets

• provide correct inferences for analysis model based onrelationships between pairs of variables

Available in the R package missMDA, with a user guide onhttp://vincentaudigier.weebly.com/links.html

34 / 36

http://vincentaudigier.weebly.com/


Perspectives

To go further:

• require a modelisation e�ort when categorical variables occur• for a deeper understanding of the methods• for an extension of the current methods• for a MI method based on FAMD

→ some lines of research:

• link between CA and log-linear model• link between log-linear model and general location model

• uncertainty on the number of dimensions S

35 / 36


References I

V. Audigier, F. Husson, and J. Josse. MIMCA: Multiple imputation forcategorical variables with multiple correspondence analysis. Statistics andComputing, 2016.

V. Audigier, F. Husson, and J. Josse. Multiple imputation for continuousvariables using a bayesian principal component analysis. Journal ofStatistical Computation and Simulation, 2015.

V. Audigier, F. Husson, and J. Josse. A principal component method to imputemissing values for mixed data. Advances in Data Analysis and Classi�cation,pages 1�22, 2014.

D. B. Rubin. Multiple Imputation for Non-Response in Survey. Wiley,New-York, 1987.

J. L. Schafer. Analysis of Incomplete Multivariate Data. Chapman &Hall/CRC, London, 1997.

36 / 36

Multiple imputation with principal component methods · IntroductionSingle ImputationMI with PCAMI with MCAConclusionReferences Multiple imputation with principal component methods

Documents