Lisa M. Lix, PhD P. Stat. School of Public Health Joint Seminar: Statistics and Collaborative Graduate Program in Biostatistics January 5, 2012.

A Comparison of Variable Importance Measures for Two Independent GroupsLisa M. Lix, PhD P. Stat.School of Public Health

Joint Seminar: Statistics and Collaborative Graduate Program in Biostatistics

January 5, 2012

Co-Authors: Tolu Sajobi, Bola Dansu

Funding: ◦ Canadian Institutes of Health Research◦ Centennial Chair Program, University of

Saskatchewan

Acknowledgements

Background

Description of Relative Importance Measures

Numeric Example

Monte Carlo Study: Design and Results

Discussion and Conclusions

Presentation Overview

m ≥ 2 correlated variables for N study participants with n1 participants in group 1 and n2 participants in group 2 (n1 + n2 = N)

In many studies, the variables are assumed to follow a normal distribution, N(μjk, σjk

2), for k = 1 ,…, m and j = 1, 2

We will focus on the case where there are no missing observations

Study Design

Do different measures of relative importance result in the same rankings of a set of correlated variables for distinguishing between two independent groups?

What factors affect the variable ranking performance of relative importance measures?

Research Questions

For exploratory analysis and model development

Organizational research: ◦ the relative contribution of various applicant characteristics

in hire–not hire decisions made by managers Genetics research:

◦ Relative contribution of individual genes to distinguishing between patients with and without chronic health conditions

Quality of life research:◦ Relative importance of quality of life domains for

distinguishing between patients who do and do not receive healthcare treatments

When and Where are Measures of Relative Importance Used?

Back et al. (2008). Journal of Biopharmaceutical Statistics◦ Rankings of variable importance were used to

identify a set of genes to classify life-threatening diseases according to prognosis or type

◦ Variable importance was assessed using a variety of techniques, including non-parametric recursive partitioning techniques

Applications of Relative Importance Measures

Statistical significance (e.g., t-test)

Practical significance (e.g., effect size)

Descriptive discriminant analysis (DDA): linear combination of variables that maximizes separation of the groups

Stepwise multivariate analysis of variance (MANOVA): F-to-remove statistic measures the decrease in the inter-group Mahalanobis distance caused by removing each of the variables in sequence

Logistic regression analysis (LRA): Contribution of each variable to the total predicted variance in the dichotomous outcome

How can Relative Importance be Measured?

Dominance analysis: Budescu, 1993 ◦ General dominance analysis determines relative importance

based on the average ΔR2 observed by adding a predictor to all possible subsets of the remaining predictors

Relative weights analysis: Johnson, 2000◦ creates a new set of variables that are orthogonal

representations of the original set of variables

How can Relative Importance be Measured?

Denote Xij as the m x 1 vector of observations for the ith study participant in the jth group (i = 1,…, nj; j = 1, 2)

is the m x 1 vector of means for the jth group

Vector of discriminant function coefficients is estimated by

DDA-Based Measures of Relative Importance

jX

)( 21 XXSa 1

where

and S1 and S2 are the variance-covariance matrices for groups 1 and 2, respectively


2

)1()1( 2211

N

nn SSS

The kth standardized discriminant function coefficient is

where ak and sk are the kth estimated discriminant function coefficient and standard deviation, respectively

By placing a constraint on the discriminant function coefficients such that aTSa = 1, where T is the transpose operator, the coefficients will range in value from -1 to +1


kkk saa *

The parallel discriminant ratio coefficient for the kth variable is

where fk is the kth structure coefficient, the correlation between the kth variable and the discriminant function

Coefficients can take on positive and negative values


kkk faq *

The total discriminant ratio coefficient for the kth variable is

where STkk is the (k,k)th element of ST, ST = T/ (N – 1), T = H + E, and H and E are the hypothesis and error sum of squares and cross-product matrices, respectively

Coefficients have a lower bound of zero but no upper bound


For the kth variable, the F-to-remove statistic is

where k2= N – m, k3 = N2/(n1n2), and

is the squared Mahalanobis distance, and is the value of D2 when the kth variable is omitted Statistics take on positive values

Stepwise MANOVA-Based Measures of Relative Importance

)/()( 2)(3

2)(

22)( kkk DkDDkF

)()( 211T

212 XXSXX D

2)(kD

The model is

where Al is the vector of (m + 1) observations for the lth study participant (l = 1 ,…, N) where the first element is equal to one

pl = Pr(yl = 1| Al) is the probability the lth study participant is a member of group 1 conditional on the explanatory variables

β is the (m + 1) vector of coefficients to be estimated, with the first element equal to the model intercept, β0

LRA-Based Measures of Relative Importance

βA ll

l

p

p

1

ln

The estimated coefficient for the kth variable can be defined as

where is the correlation between the kth variable and the logit of the predicted probabilities

and is the R2 value for a LRA model in which the kth variable is excluded

and is the R2 value for a model in which the kth variable is regressed on the remaining (m – 1) variables


,1

ˆ2

)(|

2)(|

2)()ˆlogit(

kk

kkkkpk R

RRrβ

kpr )ˆ(logit

2)( kR

2)(| kkR

Standardized logistic regression coefficients have also been used to assess relative importance. The kth standardized coefficient is

where is the estimated coefficient and is the standard deviation of the logit of the predicted probabilities



,/ˆˆ)ˆ(logit

*pkkk sRsββ

k̂ )ˆlogit( ps

Pratt’s (1987) index for relative importance was originally proposed for multiple regression and then extended to LRA. The index value for the kth variable is

where is the estimated correlation between the kth explanatory variable and the logit of the predicted probabilities



,ˆˆ2R

ρβd k

*k

k

kρ̂

Data are from the Manitoba Inflammatory Bowel Disease (IBD) Cohort Study

Started in 2002 and initially enrolled 388 patients who had recently diagnosed with Crohn’s disease or ulcerative colitis

Health-related quality of life (HRQOL) data collected at regular intervals throughout the study◦ SF-36: 8 domains◦ IBD Questionnaire: 4 domains

A central theme of the study is the effect of disease activity on quality of life, stress, well-being, and coping with illness

Numeric Example

Numeric Example: Descriptive Statistics for HRQOL Domains

Active Disease

(n1 = 244)

Inactive

Disease

(n2 = 105)

IBDQ

Bowel Symptoms 4.92 (1.03) 6.08 (0.76)

Emotional Health 4.81 (1.05) 5.85 (0.89)

Social Function 4.09 (1.18) 5.19 (1.05)

Systemic Symptoms 5.62 (1.35) 6.65 (0.64)

SF-36

Bodily Pain 60.78 (24.15) 77.45 (26.11)

Role Physical 63.48 (29.07) 83.65 (24.08)

General Health 43.40 (19.52) 59.18 (17.01)

Mental Health 60.33 (14.11) 66.62 (12.47)

Physical Functioning 77.49 (21.73) 91.11 (14.41)

Role Emotional 76.06 (23.98) 85.82 (20.11)

Social Functioning 63.74 (27.20) 78.85 (27.10)

Vitality 46.13 (16.39) 57.84 (14.49)

Significance Test Results and Numeric Values of Relative Importance Measures for HRQOL Domains

Domain t-statisticSLRC

LPI ALPI SDFC PDRC FTRIBDQ Bowel Symptoms

10.430* 0.463 0.471 0.376 0.587 0.542 5.034Emotional Health

8.840* 0.309 0.28 0.223 0.428 0.347 4.033Social Function

7.500* 0.183 0.165 0.132 0.044 -0.031 5.072Systemic Symptoms

7.980* 0.145 -0.117 - 0.083 -0.062 14.334SF-36

Bodily Pain 5.690* 0.103 0.066 0.053 0.103 0.057 0.504

Role Physical 6.220* 0.015 -0.010 0.000 0.037 -0.022 6.099

General Health 6.930* 0.135 0.095 0.076 0.226 0.149 12.334Mental Health

3.790* 0.143 -0.059 - 0.1910 -0.072 0.952Physical Functioning

5.890* 0.169 0.113 0.090 0.185 0.106 8.329Role Emotional

3.640* 0.171 -0.066 - 0.120 -0.043 0.508Social Functioning

4.770* 0.026 0.015 0.012 0.027 0.013 0.011Vitality

6.080* 0.074 0.049 0.039 0.029 0.017 6.911

Note: * denotes a test statistic that is statistically significant at α = .05/12 = .004

Rank Order of HRQOL Domains based on Relative Importance Measures

DomainSLRC

ALPI SDFC PDRC FTRIBDQ

Bowel Symptoms 1 1 1 1 7

Emotional Health 2 2 2 2 8

Social Function 3 3 9 9 6Systemic Symptoms 6 - 8 - 1

SF-36

Bodily Pain 9 6 7 5 11

Role Physical 12 9 10 9 5

General Health 8 5 3 3 2

Mental Health 7 - 4 - 9Physical Functioning 5 4 5 4 3

Role Emotional 4 - 6 - 10Social Functioning 11 8 12 7 12

Vitality 10 7 11 6 4

SDFC: standardized discriminant function coefficient PDRC: parallel discriminant ratio coefficients TDRC: total discriminant ratio coefficients FTR: F-to-remove statistic SLRC: standardized logistic regression coefficient LPI: Logistic Pratt’s index

Monte Carlo Study

Number of variables (m = 4, 6, 8) Total sample size (N = 60, 80, 140, 200) Equality/inequality of group sizes Magnitude and pattern of correlation among the

variables Group covariance homogeneity/heterogeneity Group means Shape of the population distribution

Monte Carlo Study

Let ρ denote the average correlation between the variables◦ ρ = 0, 0.3, 0.6

Pattern of correlation◦ Compound symmetric◦ Unstructured◦ Modified simplex

Magnitude and Pattern of Correlation

Mean Pattern μ1 D2

I (2.5, 2, 1.5, 1) 13.5II (1.5, 1, 0.5, 2) 7.5III (1.0, 0.75, 0.5, 0.25) 1.9IV (0.75, 0.5, 0.25, 1.0) 1.9

Mean Configurations for the Simulation Study: m = 4

Note: μ2 is the null vector

Mean Pattern

μ1 D2

I (4.5, 4, 3.5, 3, 2.5, 2, 1.5, 1) 71.0II (2.5, 2, 1.5, 1, 0.5, 3, 3.5, 4) 47.0III (2, 1.75, 1.5, 1.25, 1, 0.75, 0.5, 0.25) 12.8IV (1.25, 1, 0.75, 0.5, 0.25, 1.5, 1.75, 2) 12.8

Mean Configurations for the Simulation Study: m = 8

Note: μ2 is the null vector

Normal◦ γ1 = 0; γ2 = 0

Skewed◦ γ1 =1.8; γ2 =5.9

Heavy-Tailed◦ γ1 = 0 and γ2 = 33

Shape of Population Distribution

All-variable correct ranking percentage: percent of simulations in which the sample rank was the same as the corresponding population rank for the variable

Average per-variable correct ranking percentage: the percent of simulations in which a variable in the sample had the same rank as the variable in the population, averaged across all variables

Kendall’s concordance statistic (not reported in this presentation)

Measures of Ranking Performance

Mean Pattern

SDFC PDRC TDRC FTR SLRC LPII 49.1 59.8 59.0 38.0 41.7 61.1II 43.7 63.1 56.2 32.1 38.0 64.3 III 34.8 47.0 37.8 26.4 33.2 47.4IV 37.0 54.3 41.1 28.3 34.8 54.7

Average 41.2 56.0 48.5 31.2 36.9 56.9

Average Per-Variable Correct Ranking (%)Equal Group Covariances and Multivariate Normal Distribution

Mean Pattern

SDFC PDRC TDRC FTR SLRC LPII 17.5 28.3 27.1 9.1 13.6 29.4 II 12.2 32.1 23.6 5.7 9.8 33.6III 7.7 12.7 9.4 2.1 7.3 12.8 IV 8.1 21.1 11.0 3.8 7.6 21.4

Average 11.4 23.5 17.8 5.2 9.6 24.3

All-Variable Correct Ranking (%)Equal Group Covariances and Multivariate Normal Distribution

Corr. Scenari

o SDFC PDRC TDRC FTR SLRC LPI1 60.3 63.3 63.2 40.2 55.0 66.32 45.9 63.2 51.0 32.6 42.4 63.63 32.2 65.9 42.5 25.8 25.7 65.44 39.7 52.1 45.1 29.7 36.5 53.15 25.8 34.2 38.6 27.0 24.3 33.96 43.0 57.6 50.5 31.8 37.8 58.5

Average 41.2 56.0 48.5 31.2 36.9 56.9

Average Per-Variable Correct Ranking (%)Equal Group Covariances and Multivariate Normal Distribution

Scenario 1: ρ = 0, where ρ is the average correlation; Scenario 2: compound symmetric matrix with ρ = 0.3; Scenario 3: compound symmetric matrix with ρ = 0.6; Scenario 4: unstructured matrix with ρ = 0.3; Scenario 5: unstructured matrix with ρ = 0.6; Scenario 6: modified simplex matrix with correlations of 0.3 and 0.6 on alternating diagonals.

Average Per-Variable Correct Ranking (%)Unequal Group Covariances and Multivariate Skewed Distribution

Average Per-Variable Correct Ranking (%)Unequal Group Covariances and Multivariate Skewed Distribution

Average Per-Variable Correct Ranking (%)Skewed Distribution

Average Per-Variable Correct Ranking (%)Skewed Distribution

Average Per-Variable Correct Ranking (%)Heavy-Tailed Distribution

The LPI and PDRC measures tended to result in the highest percentages of correct rankings and values of the concordance statistic

The FTR measure tended to result in the lowest percentages of correct rankings and concordance followed by the SLRC measure


The LPI and PDRC measures were relatively insensitive to many of the correlation structures

However, they resulted in a substantial drop in correct ranking percentages when the data exhibited an unstructured correlation pattern with a high average correlation (ρ = 0.6)

Differences in correct ranking percentages across the correlation structures were smaller for the TDRC and SLRC measures than for other measures and were smallest for the FTR measure


Violations of the assumption of covariance homogeneity had a very small effect on the correct ranking rates

The correct ranking percentages for all measures were consistently lower for heavy-tailed than for skewed distributions


The choice of measures of relative importance depends on the perspective the researcher wants to take on the data◦ contribution of a variable to the discriminant function score◦ contribution of a variable to the grouping variable effect◦ contribution of a variable to explaining variation in a

regression model


Inference for relative importance measures and ranks

Comparisons with recent developments in relative importance measures that are more computationally intensive (e.g., relative weights)

Extensions to more than two groups

Future Research

Lisa M. Lix, PhD P. Stat. School of Public Health Joint Seminar: Statistics and Collaborative Graduate Program in Biostatistics January 5, 2012.

Documents

relative importance

n study participants

dichotomous outcome

healthcare treatments

set of correlated variables

exploratory analysis

type variable importance

original set of variables