Case Studies Eric Cormier

7/31/2019 Case Studies Eric Cormier

1/38

The Canadian Journal of Statistics

Vol. 39, No. 2, 2011, Pages 181217

La revue canadienne de statistique

181

Case studies in data analysis

Alison L. GIBBS1*, Kevin J. KEEN2 and Liqun WANG3

1Department of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G32Department of Mathematics and Statistics, University of Northern British Columbia, Prince George,

BC, Canada V2N 4Z93Department of Statistics, University of Manitoba, Winnipeg, Man., Canada R3T 2N2

The following short papers are summaries of student contributions to the Case Studies in DataAnalysis from the Statistical Society of Canada 2009 annual meeting. Case studies have been an

important part of the SSC annual meeting for many years, providing the opportunity for students

to delve into interesting problems and data sets and to present their findings at the meeting. Since

2008, prizes have been awarded for the best poster presentations for each of two case studies. The

case studies at the 2009 annual meeting and the selection of this suite of papers were organized

by Gibbs and Keen.

This section consists of two groups of papers corresponding to two case studies. Each sub-

section starts with an introduction given by the data donors, which is followed by the winning

paper and contributed papers. The subsection ends with discussion and summary by the data

donors.The theme of case study 1 is the identification of relevant factors for the growth of lodgepole

pine trees. First, Dean, Gibbs, and Parish provide an introduction to the data and the problems

of scientific interest. The winning paper authors Cormier and Sun first use the nonparametric

smoothing technique to identify a nonlinear relationship of the growth rate and the age of the trees.

They then use a mixed model to explain the growth rate through the age and other environmental

factors. In the second paper, Salamh first estimates a similar mixed model and then supplements

the analysis using a dynamic model.

The theme of case study 2 is the classification of disease status through proteomic biomarkers.

Balshaw and Cohen-Freue introduce the data and problems of interest. The winning paper is

authored by Lu, Mann, Saab, and Stone who first explore various data imputation techniquesincluding the k-nearest neighbours, local least squares and singular value decomposition. They

then apply various multiple selection methods such as LASSO, least angle regression (LARS)

and sparse logistic regression. This paper is accompanied by four contributed papers which use

various modern classification techniques. Guo, Chen, and Peng use a score procedure to classify

the disease status. Liu and Malik employ a multiple testing procedure. Meaney, Johnston and

Sykes apply support vector machines (SVM). Wang and Xia use classification tree and logistic

regression techniques. A summary and comparison of these methods and outcomes are given by

Balshaw and Cohen-Freue.

We are grateful to Charmaine Dean of Simon Fraser University, Roberta Parish of the British

Columbia Ministry of Forests and Range, and Rob Balshaw and Gabriela Cohen-Freue of the


2/38

182 Vol. 39, No. 2

NCE CECR PROOF Centre of Excellence for the use of their data and their contributions to this

suite of papers. We also thank the former and current Editors of the CJS, Paul Gustafson and

Jiahua Chen, for agreeing to publish these papers and for their patience and support during the

editorial process.

Received 31 October 2010

Accepted 6 January 2011


3/38

2011 183

Case study 1: The effects of climate on thegrowth of lodgepole pine

C.B. DEAN1*, Alison L. GIBBS2 and Roberta PARISH3

1Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada V5A 1S62Department of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G33British Columbia Ministry of Forests, Victoria, BC, Canada V8N 1R8

Key words and phrases: Climate effects; tree growth; biomass; lodgepole pine.

1. BACKGROUND

To compete successfully in the world economy, the commercial forestry industry requires an

understanding of how changes in climate influence the growth of trees. The goal of this case study

was to examine how well-known climate variables, combined with estimated crown biomass, can

predict wood accumulation in lodgepole pine. In order to model the growth and yield of trees over

time, we need to determine how much wood a tree accumulates each year. Each year, a tree lays

down an annual ring of wood in a layer under the bark. Presslers hypothesis states that area of

wood laid down annually, measured by the cross-sectional area increment, increases linearly from

the top of the tree to the base of the crown (the location of the lowest live branches) and is based

on the assumption that area increment in the crown increases with the amount of foliage above

the point of interest. Below the crown, the area increment in any given year remains constant

down the bole until the region of butt swell at the base of most trees.

The growth of a tree in any given year is strongly influenced by growth in the previous years.

One reason for this is that buds are formed the year before they start to grow and carbohydrates

from good years can be stored to fuel growth in subsequent years. The effects of previous growing

conditions can last from 1 to 3 years, depending on tree species and location. Climate affects

growth and influences both the size of the annual ring of wood and the proportions of early

and late wood. Low density early wood is laid down during the spring when water is plentiful.

Late wood, which is laid down from mid-summer until growth ceases in the fall, has a high

density. Cessation of wood formation is sensitive to weather conditions such as temperature and

drought.

Lodgepole pine (Pinus contorta Doug. ex Loud.) stands dominate much of western Canada and

the United States, covering over 26 million hectares of forest land. It is an important commercial

species in British Columbia; stands consisting of more than 50% lodgepole pine occupy 58% of the

forests in the interior of the province. Lodgepole pine is primarily used for lumber, poles, railroad

ties, posts, furniture, cabinetry, and construction timbers. It is commercially important to be able

to predict how lodgepole pine will grow and accumulate wood over time. Using high resolution

satellite images of lodgepole pine stands to predict wood attributes is under consideration, but

first the relationship of crown properties such as the amount of foliage must be linked to wood

properties and growth.


4/38

184 Vol. 39, No. 2

2. THE DATA

Data on the annual growth and wood density of 60 lodgepole pine trees from four sites in two

geographic areas in central British Columbia were provided for this investigation. Samples were

removed at 1013 locations along each tree and two radii (A and B) per sample disc were mea-

sured. Measurements of the last year of growth and wood density are often unreliable because

of proximity to the bark and difficulties of sample preparation. However, it is for this ring only

that we have measures of the amount of foliage. Several growth outcomes are available includ-

ing the widths of the A and B radii, in millimetres, percentage of late and early wood, and

early and late wood densities, in kg/m3. Foliar biomass measurements are provided for multi-

ple branches; estimates are available for each annual whorl. The data on biomass include the

average relative position of the branch in the crown (1 is the base of the crown and 0 is the

top) and corresponding foliar biomass (the mass, in kg/m2, of needles subtended by the branches

at that position). Other variables such as the total height of the tree, in metres, as well as the

height to the base of the crown, in metres, are also provided. Climate data from Environment

Canada arise from the two nearest stations with long-term records, Kamloops and Quesnel. For

each of these locations, monthly and annual data are provided on: (1) the minimum temper-

ature, in degrees Celsius, (2) the maximum temperature, in degrees Celsius, and (3) the total

precipitation, in millimetres. Additional details on the data and variables are provided at the site

www.ssc.ca/en/education/archived-case-studies/ssc-case-studies-2009.

3. OBJECTIVES

The primary objective of this case study was to determine to what extent climate, position on the

tree bole (trunk), and current foliar biomass explain cross-sectional area increment and proportion

of early and late wood.

Other questions of interest included:

How have temperature and precipitation affected the annual cross-sectional growth and the

proportions of early and late wood in lodgepole pine?

Is annual growthbest explained by average annual temperature or do monthly maximum and/or

minimum values provide a better explanation? Do early and late wood need to be considered

separately?

Does the use of climate variables to predict the growth and proportions of early and late wood

provide more reliable estimates than the use of the growth and density measurements from

previous years as measured from the interior rings?




5/38

2011 185

The determination of the relevant explanatoryvariables for the growth of lodgepole pineusing mixed models

Eric CORMIER* and Zheng SUN

Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada V8W 2Y2

Key words and phrases: Climate effects; biomass; mixed models; tree growth

Abstract: In this paper a mixed model with nested random effects was used to model the cross-sectional

area increment of lodgepole pine based on relevant explanatory variables. This model was used to show

that minimum monthly temperature, monthly precipitation, and foliar biomass are positively related to the

cross-sectional area increment, while an ordinal variable approximating lower trunk position and maximum

monthly temperature are negatively related. It was shown that annual growth is better explained by monthly

maximum and minimum temperatures than by average annual temperature and that the use of climate

variables provided more reliable estimates for growth prediction than the use of the growth and density

measurements from previous years. The Canadian Journal of Statistics 39: 185189; 2011 2011 Statistical

Society of Canada

Resume: Dans cet article, un modele mixte avec des effets aleatoires embotes est utilise pour representer

les accroissements transversaux de laire du pin de Murray a laide de variables explicatives pertinentes. Ce

modele a permis de montrer que la temperature mensuelle minimale, la quantite de precipitation mensuelle

et la biomasse foliaire sont positivement reliees a laccroisement transversal daire tandis quune variable

ordinale approximant la position de la partie inferieure du tronc et la temperature mensuelle maximale sont

negativement reliees. Il est montre que la croissance annuelle est mieux expliquee par les temperatures

mensuelles maximale et minimale que par la temperature annuelle moyenne et que lutilisation des variablesde climats procure des estimes plus fiables de la prediction de croissance que lutilisation des observations

sur la croissance et les mesures de densite des annees precedentes. La revue canadienne de statistique 39:

185189; 2011 2011 Socit statistique du Canada

1. INTRODUCTION

In this analysis, we addressed the following questions: (1) To what extent climate, position on

the tree bole, and current foliar biomass explain cross-sectional area increment. (2) How have

temperature and precipitationaffected the annual cross-sectional growth? (3) Is annual growth best

explained by average annual temperature or do monthly maximum and/or minimum values provide

a better explanation? (4) Does the use of climate variables to predict the growth and proportionsof early and late wood provide more reliable estimates than the use of the growth and density

measurements from previous years as measured from the interior rings? We used mixed models to

model the relationships between the explanatory variables: climate, position on the tree bole, and

current foliar biomass, and the responses: cross-sectional area increment and the proportion of late

wood of lodgepole pine. There were four features of the data that complicated the analyses: (1)

Climate variables for each year were available and annual growth measurements were collected

from tree samples, so we expected the data to exhibit autocorrelation. The correlation structure was

accommodated by the use of random effects. (2) For each disc, there were measurements from two

separate radii. Radius was treated as a nested random effect. It could have been assumed that the

measurements along the two radii were two observations of the same variable and then an averagecould be taken, but due to the asymmetry of the tree radii, an average is not a good estimate of the


6/38

186 Vol. 39, No. 2

variable. Alternatively, the measurements along the two radii could be used individually but there

would be very high correlations between them. To allow both sets of measurements to be used

and include the correlations between radii, a nested random effect was used (Pinheiro & Bates,

2000, p. 40). (3) The ages of the trees varied resulting in a different number of observations for

each tree. This complication corresponds to drop-out in a longitudinal study. It was not believed

that the reason for the resulting missing data was informative because it depended only on theage of the tree and not on a growth factor. Therefore this situation was modelled assuming the

data on missing years of a trees life were missing at random and bias was not taken into account.

(4) Destructive sampling meant that foliar biomass was collected at only one point in time; an

inverse regression was conducted to determine the foliar biomass in other years.

In addition to the climate variables, the growth and density measurements from previous years

were used to predict the growth of early and late wood. This prediction was done using an ARIMA

model and the reliability of these estimates were examined.

2. METHODOLOGY

Non-parametric regressions of late wood percentage and cross-sectional area increment against

trunk position, foliar biomass, and annual maximum temperature were fit to determine the general

trend in the response. We assumed that the nth measurement from the ground would correspond

across trees regardless of the height of the trees so the ordinal variable position was used. The plot

of cross-sectional area increment versus trunk position in Figure 1 shows that from trunk position

14, there is a negative relationship between trunk position and cross-sectional increment; but

after position 5, there is a positive relationship between trunk position and cross-sectional incre-

ment. This could be due to the fact the position 4 or 5 corresponds with the start of the crown. The

plot of cross-sectional area increment versus biomass and the plot of percentage of late wood ver-

sus biomass show that high values of biomass correspond to high cross-sectional area incrementand low late wood percentage, respectively. The plot of cross-sectional area increment versus

annual maximum temperature and the plot of percentage of late wood versus annual maximum

temperature show that the relationship between annual maximum temperature and cross-sectional

increment and the relationship between annual maximum temperature are different for each loca-

tion (Kamloops and Quesnel). This suggests including two-factor interactions between location

and annual maximum temperature. Similar interaction plots, which are not included in this paper,

suggested including the interactions between the following pairs of variables: location and annual

minimum temperature, location and precipitation, age of the tree and foliar biomass, and location

and foliar biomass. We also included two-factor interactions among climate variables.

One of the questions of interest was how the position on the tree bole affects the cross-sectionalarea increment. When examining the effect of position, it was necessary to account for the fact

that trees had a large range in their absolute height. Measurements taken at 10 m from the ground

on a tree that was 11 m tall have a different relative position on the tree trunk then the same

measurement on a tree that was 40 m tall. To account for this an ordinal variable, position, was

used to represent relative height.

To determine whether monthly maximum and minimum temperature or average annual tem-

perature best explain cross-sectional growth, the mixed model was fitted separately with monthly

measurements and with average annual measurements and model goodness-of-fit criteria were

compared. The variables that resulted in a better fit were used in the analysis. To model the corre-

lation structure, a random intercept and a random slope for each tree was adopted. To model thenested effect presented by having two radii measured on each tree disc, a nested random effect

was introduced.


7/38


8/38

188 Vol. 39, No. 2

Figure 2: Mean cross-sectional area increment over age of the tree.

Table 1: Climate effects on cross-sectional increment.

Month

J F M A M J J A S O N D

Maximum temperature * * * * *

Minimum temperature + + + + + + + + + + + +

Precipitation + + + + + + + + + + + +

Significant positive effect (+), significant negative effect (), not significant (*) at 5% level.

3. RESULTS

To improve model adequacy a Box-Cox transformation was performed with = 0.25. The

function g was determined to be cubic through examination of Figure 2, that is, g(tihk) =

40tihk + 41t2ihk + 42t

3ihk. Trunk position in the crown and the amount of foliar biomass have

positive relationships, and age of the tree has a negative relationship with the cross-sectional

area increment. The results of climate on cross-sectional increment are presented in Table 1.

Results from the fitted models were quite consistent with patterns observed in Figure 1 except the

interaction effect of annual maximum temperature and location was not significant.

The estimated standard errors for the random effects were: 1 = 0.5394, 2 = 0.0225, 3 =

0.0284. The nested effect for radii was not significant, implying that the measurements from the

two different radii did not significantly differ.

It was determined that annual growth was better explained using monthly maximum and

minimum temperature values then average annual values because both AIC (66,135 vs. 67,750) and

BIC (66,781 vs. 68,050) were smaller when monthly measurements were included in the model.

However, although we used monthly climate variables as main effects, to reduce the number of

interaction terms in the model interactions were modelled using annual climate variables.

Examination of the residual plot showed heavy tails and skewness. This indicates that the

residuals deviate from normality and that modelling them with a skew-elliptical distributionwould be more appropriate (Jara, Quintana & San Martn, 2008).

To model the dependence of the growth of early and late wood on measurements from previous


9/38

2011 189

Figure 3: Five step ahead predictions on early and late wood growth.

autoregressive, third order integrated and first order moving average model. This model was used

to predict future growth and determine prediction intervals (Figure 3).

4. CONCLUSIONS

Data on the cross-sectional area increment and proportion of late wood for tree rings at multiple

years, heights and geographical regions were analyzed, with the addition of climate data and

measurements of foliar biomass for each year of the trees lives. The use of mixed models allowed

all of these covariates to be included in each model and their relationship to be modelled usingall available data. However, the model is not sufficient to conclude cause-and-effect relationships

between the variables and the growth of lodgepole pine. Since the monthly climate factors were

correlated, the problem of multi-collinearity was present. Therefore, the model requires careful

interpretation of the coefficients. In response to questions (1) and (2) in Section 1, the results show

that minimum monthly temperature, monthly precipitation and foliar biomass were positively

related to the cross-sectional area increment while lower trunk position and maximum monthly

temperature were negatively related. Examination of question (3) showed that annual growth

was better explained by monthly maximum and minimum temperatures than by average annual

temperature. Due to the wide prediction intervals from the time series analysis it was believed that

the use of climate variables provided more reliable estimates for growth prediction (question (4)).Possible future work that could be carried out to improve the model is the use of skew-elliptical

distributions to model the residuals to account for both skewness and heavy tails in the error terms

and the incorporation of splines to accommodate the temporal trend in the observations.

ACKNOWLEDGEMENTS

Many thanks to Farouk Nathoo for the very helpful correspondence.

BIBLIOGRAPHY

A. Jara, F. Quintana & E. San Martn (2008). Linear mixed models with skew-elliptical distributions: a

Bayesian approach, Computational Statistics and Data Analysis, 52, 50335045.

J. Pinheiro & D. Bates (2000). Mixed-Effects Models in S and S-PLUS, Springer-Verlag, New York.


10/38

190 Vol. 39, No. 2

Determinants of lodgepole pine growth: Staticand dynamic panel data models

Mustafa SALAMH*

Department of Statistics, University of Manitoba, Winnipeg, Man., Canada R3T2M2

Key words and phrases: linear mixed model; nested error components; autoregressive panel models; tree

growth; climate effect; Presslers hypothesis; random coefficients regression; two-stage least squares

1. INTRODUCTION

This study was concerned with modelling the wood properties and growth over time for theLodgepole pine in British Columbia. The primary objective was to determine to what extent cli-

mate, position on the tree trunk, and current foliar biomass explain cross-sectional area increment

and proportion of early and late wood. The study also addressed other questions, such as whether

growth is best explained by average annual temperature or monthly temperature extremes, and

whether the use of climate variables to predict the growth and wood properties provide more

reliable estimates than the use of the growth and density measurements from previous years.

2. METHODOLOGY

Presslers hypothesis states that the annual increment in cross-sectional area of wood increaseslinearly from the top of the tree to the base of the crown and is proportional to amount of foliage

above the point of the increment. Since tree and crown heights and foliar biomass were only

available for the year in which the tree was cut down, they had to be estimated for other years

of the trees life. This was done using loess regression based on the trees current height, crown

length, and the height of disks. In order to answer the primary question about the effects of

climate, disk position on the tree bole, and foliar biomass on the cross-sectional area increment

and percentage of late wood, a linear mixed effects model was used to account for the two-level

grouping structure of the data. The model was formulated according to Presslers hypothesis

without ignoring the possible random variability due to disks below the crown. Other nuisance

factors such as age from pith, tree height, and the geographic location were included in the modelto control for their effect. The climate model takes the form

yijt = f1(disk ageijt, tree heightit, sitei) + f2(climt,t1,t2) + Dijt

+ topdistanceijt + topmassijt + i + iDijt + i topdistanceijt + i topmassijt

+ ij(1Dijt) + ijt, i = 1, I , j = 1, Ji, t = 1, Tij, (1)

where the responseyijt is either the square root of the area increment or the log of late to early wood

ratio for diskj within tree i at year t.f1 andf2 are linear functions, and clim represents a vector of

climate variables (temperature and precipitation). The variable D is an indicator for being above

the crown, topdistance is theproduct ofD and the distance from the tree top to the disk, and topmass


11/38

2011 191

is the product of D and the gross amount of foliage above the disk. The standard distributional

settings were assumed for the random effects and residual error, namely (i, i, i, i, ij, j =

1, Ji) iid N(0, diag(2 ,

2 ,

2,

2,

2IJi )), ijtN(0,

2 ), ijt

ij ARMA(p, q), and are indepen-

dent of the random effects. Two lags of the climate variables were included since the effects

of previous growing conditions can last from 1 to 3 years. I focused on the spring, summer,and fall climate variables because early wood is laid down during the spring and late wood is

laid down from mid-summer to fall. Several sub-models were fit using the R-package nlme.

Diagnostic graphs were produced to ensure the adequacy of the models and to check the validity

of assumptions. In almost all sub-models the residual error had ARMA(2,1) structure, which is

consistent with Monserud (1986). Then, likelihood ratio and Wald-tests were performed to check

the significance.

To answer whether annual growth is best explained by average annual temperature or monthly

temperature extremes, I used the structure of Equation (1) for each set of temperature variables.

Since neither of the two models is nested in the other, I applied the idea of the J-test (Gujarati,

2003, p. 533) to determine which model is preferred. To consider whether climate variables or

growth and density variables from previous years better predict growth, I used a nested error

component model with autoregressive dynamics and other explanatory variables to predict the

annual growth. The proposed model equation is free of the climate variables. It is given by

yijt = 1yijt1 + 2yijt2 + 1xijt1 + 2xijt2 + 1zijt1 + 2zijt2 + ij + ijt, (2)

whereyijt is defined as in Equation (1), andx,z arethe densities of early andlate wood, respectively.

The heterogeneity due to the trees and disks within trees is represented by the error component

ij. The model is semiparametric, where the residual error, ijt, satisfies the moment condition

E(ijt|ij, yijt1, xijt1, zijt1, yijt2, xijt2, zijt2, . . .) = 0. The model was fit using two-stage

least squares on the first difference within disks. It was compared to the models of Equation (1)with regards to their out of sample prediction power. A test sample of size 2,718 taken across

almost all the disks was discarded from the data and both models were fit using the remaining

data. The fitted models were compared according to their mean squared error of prediction in the

test sample. The MSE for the climate models was one-quarter that of the dynamic model.

3. CONCLUSION

Regarding how the temperature and precipitation affect the annual cross-sectional growth and the

proportions of early and late wood, the climate models showed high contributions of the current

and lagged values of rain and temperature in explaining both of the target dependent variables.For example, annual growth was positively affected by rain (especially in spring) and negatively

affected by extreme levels of temperatures. The proportion of early wood was positively affected

by rain throughout the year and by higher temperatures in spring, however the proportion of

late wood is negatively affected by higher temperatures in mid-summer. It is recommended that

proportions of early and late wood be considered separately to allow a clearer view of how they

were individually affected by the climate.

About the extent to which the position on the tree bole, and current foliar biomass explain

the annual cross-sectional growth and the proportions of early and late wood, it was found that

the higher the disk was on the crown the smaller the amount of wood annual increment and

the higher the ratio of early wood. The foliar biomass had a positive linear effect on the annualwood increment within the crown consistent with Presslers hypothesis. However it should be

mentioned that the proportionality parameter is highly variable from one tree to another.


12/38

192 Vol. 39, No. 2

explained by the monthly extremes of temperature than the average annual temperature. The incre-

mental contribution of the annual temperatures over the monthly temperatures was not significant,

but monthly temperatures were significant additions to the annual climate model.

Regarding whether the use of climate variables to predict the two dependent variables provides

more reliable estimates than the use of the growth density from previous years, it was clear that

the climate models provide more reliable prediction than the autoregressive dynamic model. Thisemphasizes the importance of climate, disk position and foliar biomass in prediction as well as

in explanation.

ACKNOWLEDGEMENTS

I am grateful to Dr. Liqun Wang for encouraging me to carry out this research project and for

financial support through his research grants from NSERC and the National Institute for Complex

Data Structures.

BIBLIOGRAPHYD. N. Gujarati (2003). Basic Econometrics, 4th ed., McGraw-Hill, New York.

R. A. Monserud (1986). Time-series analyses of tree-ring chronologies. Forest Science, 32(2), 349372.




13/38

2011 193

Discussion of case study 1 analyses

C. B. DEAN1*, Alison L. GIBBS2 and Roberta PARISH3

1Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada V5A 1S62Department of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G33British Columbia Ministry of Forests, Victoria, BC, Canada V8N 1R8

Key words and phrases: Nonlinear mixed models; hierarchical random effects; transformations; prediction

intervals; autoregressive integrated moving average model; measurement error; model diagnostics

1. INTRODUCTION

The two analyses incorporated similar features, but with different model formulations and vari-

ables;even so, they yielded similar major conclusions. Both Salamh andCormier & Sunconsideredwhether and to what extent climate, position on the tree bole, and foliar biomass explain cross-

sectional area increment and proportion of early and late wood. They both also specifically

investigated the effects of temperature and precipitation, and whether average annual temper-

ature or monthly maximum and/or minimum values better explain variability in area increment.

Additionally, Salamh considered whether climate variables explain the variability in growth

and proportions of early and late wood better than measurements from previous years of these

variables.

2. THE MODELS

Both Salamh and Cormier & Sun utilized nonlinear mixed effects models with hierarchical ran-

dom effects. Transformations of the responses were considered including the square root of area

increment and the logarithm of the ratio of early to late wood. Lags of climate variables were

included as explanatory variables in Salamh, but not in Cormier & Sun. Salamh utilized a con-

ceptually based approach, modelling the growth as increasing linearly from the top of the tree

to the base of the crown, with tree-to-tree variability in this linear functional form. Cormier &

Sun incorporated a variable labelled position (taking values 1, 2, 3, . . . ) which is meant to reflect

the height of measurement of the area increment from the base of the tree. Note that the height

at which measurements are taken are not multiples of a specific value, so when position = 2,

the height from the base of the tree is not twice that when position=

1. As well, the heights atwhich measurements are taken are not the same from tree to tree. So the modelling of the trans-

formed response variable as a linear function of the variable position, as considered in Cormier

& Sun, is somewhat of an approximation to including a linear function of the height at which

measurements are taken in the model. Future models should incorporate a more accurate measure

of relative position by using tree heights and height to sample. Both Salamh and Cormier & Sun

include (estimates of) biomass in the model, with the transformed response changing linearly with

biomass increases. Salamh incorporated autoregressive errors, while Cormier & Sun incorporated

interaction terms. Site-specific intercepts were included in Salamh, while Cormier & Sun omitted

site effects. Cormier & Sun also handled the modelling of early and late wood separately, looking

at the annual averages of early and late wood over all trees and investigating how these are related

to lagged effects.


14/38


15/38

2011 195

Case study 2: Proteomic biomarkers fordisease status

Robert F. BALSHAW* and Gabriela V. COHEN FREUE

NCE CECR PROOF Centre of Excellence, Vancouver, BC, Canada V6Z 1Y6

Key words and phrases: Variable selection; class prediction; imputation

1. BACKGROUND

Renal transplantation saves the lives of hundreds of Canadians every year. However, every

transplant recipient must be monitored closely for signs of acute rejection, which is the bodys

immunologic and inflammatory response to the presence of foreign organ. If not properly treated,

acute rejection can lead to loss of the transplanted organ, dialysis or even death. Unfortunately,

acute rejection can only be detected by biopsy a distressing, uncomfortable and expensive surgical

procedure that can be required multiple times during the first year post-transplant.

The Biomarkers in Transplantation project was funded by Genome Canada to identify non-

invasive biomarkers for the detection and prediction of acute rejection based on proteomic analysis

of peripheral blood samples. A clinical test based on such a biomarker could lead to a better

method for monitoring transplant recipients, reducing costs, improving treatment outcomes, and

substantially improving recipients quality of life.

Measures of protein abundance data were collected from a selection of kidney transplant

recipients who were known at the time of the blood sampling to be either experiencing treatable

acute rejection (AR) or not experiencing acute rejection (NR). Each sample was drawn from

an independent subject within the first 3 months post-transplant. For each AR sample, two NR

samples were selected at approximately the same time point post-transplant.

The goal of this case study was to utilize these proteomic data to create a classifier for acute

rejection, which could then be evaluated on a test set of 15 samples.

2. THE DATA

At thetime of theCase StudyCompetitionfor the2009 SSC meeting, potential intellectual property

considerations meant that the true nature of the dataset had to be hidden from the participants.For example AR and NR status are referred to the patients being in an active or inactive state of

disease. The data were also supplemented with synthetic sample data, constructed to mimic the

observed characteristics of the AR and NR samples, both to enrich the size of the dataset and to

further protect intellectual property.

The dataset included 11 samples from AR patients, 21 samples from NR patients, plus an

additional 15 samples whose classification was hidden. All experimental samples were taken

from independent patients at the time when acute rejection was suspected or at a corresponding

matched time-point for non-rejection samples. Historically, approximately 10% of renal transplant

recipients experience rejection during the first few months post-transplant; however, the study

design was to select approximately 2 NR samples for every AR sample.


16/38

196 Vol. 39, No. 2

A multiplex proteomic technology, called iTRAQ, was used to measure the relative protein

abundances of the experimental samples relative to the quantity of the corresponding protein in a

reference sample. The reference samples were taken from a homogeneous batch of blood pooled

from 16 healthy volunteers.

Plasma was obtained from each whole blood sample through centrifugation. To enhance

detection sensitivity, the plasma samples were first depleted of the 14 most abundant proteins.Trypsin was used to digest the proteins in the depleted samples, and the resulting peptides were

labelled with one of four distinct iTRAQ reagents (i.e., chemical tags with unique molecular

weights but otherwise identical chemical properties). The labelled samples were then pooled and

processed using a MALDI TOF/TOF technology.

Each iTRAQ run was designed with three experimental samples and one reference sample.

Peptide identification and quantitation was carried out by ProteinPilot Software v2.0 and the data

were assembled into a comprehensive summary of the relative abundance of the proteins in the

experimental samples. As the same reference sample was used in all runs, these relative abundance

measures were comparable across experimental runs.

Each run of the experiment detected and measured several hundred proteins (about 200 per

run), but not every protein was identified in every sample, nor even in every run, leading to a

complex pattern of missing data.

If a protein was not identified in a particular experimental sample, the proteins relative abun-

dance level was then unknown. When this happened for the reference sample in a particular run,

then the relative levels for this protein could not be estimated for any of the three experimental

samples in that run.

Proteins were identified in the data using arbitrary protein identifiers (BPG0001BPG0460).

Though this prevented the incorporation of biological/subject-matter context in the analysis, it

was necessary due to potential intellectual property concerns.

In addition to acute rejection status and relative abundance measures for 460 proteins, for

each observation sex, race, and age were provided.




17/38

2011 197

Disease status determination: Exploringimputation and selection techniques

Linghong LU, Rena K. MANN*, Rabih SAAB and Ryan STONE

Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada V8W 3R4

Key words and phrases: BPCA; imputation; k-NN; LARS; LASSO; LLS; selection method; SLR; SVD

Abstract: Analyzing a proteomics dataset that contains a large amount of independent variables (biomarkers)

with few response variables and many missing values can be very challenging. The authors tackle the problem

by first exploring different imputation techniques to treat the missing values and then investigate multiple

selection techniques to pick the best set of biomarkers to predict the unknown patients disease status. They

conclude their analysis by cross-validating the different combinations of imputation and selection techniques

(using the set of patients of known disease status) in order to find the optimal technique for the supplied

dataset. The Canadian Journal of Statistics 39: 197201; 2011 2011 Statistical Society of Canada

Resume: Analyser les jeux de donnees provenant de la proteomique represente un grand defi, car ils con-

tiennent un grand nombre de variables independantes (biomarqueurs) avec peu de variables reponses et ils

contiennent aussi des valeurs manquantes. Les auteurs sattaquent a ce probleme en explorant differentes

techniques dimputation pour traiter les valeurs manquantes. Ils considerent aussi differentes techniques

de selection multiple pour choisir le meilleur ensemble de biomarqueurs afin de predire letat non connu

de la maladie dun patient. Ils concluent leurs analyses en croisant les differentes combinaisons de tech-

niques dimputation et de selection (en utilisant un ensemble de patients dont letat de la maladie est

connu) de facon a trouver la technique optimale pour le jeu de donnees analysees. La revue canadienne

de statistique 39: 197201; 2011 2011 Socit statistique du Canada

1. INTRODUCTION

There is a multitude of missing information in the data and as such, we eliminated proteins with

40 or more missing values leaving 330 proteins. A log 2 transformation of the data was then taken

as the data is fold change.

In order to select the set of proteins that accurately predict the status of the unknown list of

patients, combinations of different imputation methods and selection techniques were studied and

then cross validated using the protein expressions for individuals with known disease status.

2. IMPUTATION

Several imputation techniques were explored: k-Nearest Neighbours (k-NN), Local Least Squares

(LLS), Singular Value Decomposition (SVD), and Bayesian Principle Component Analysis

(BPCA).

k-nearest neighbours is a sensitive and robust method when the missing values are determined

by the proteins with expressions most similar to that protein in the other samples. The optimal

value of k, which is the number of neighbours to be used, has been shown in literature to be in

the range of 1020 (Troyanskaya et al., 2001; Mu, 2008). Thus, the values of 10 and 20 were

chosen in this instance. For each protein, the k-NN imputed value was found using Euclideandistance (Troyanskaya et al., 2001) for the columns where that protein was not missing. If more


18/38

198 Vol. 39, No. 2

than 50% of a protein was missing, the missing values were imputed using each patients average.

Otherwise, the average of the k-nearest neighbours was used to estimate the missing value.

In local least squares, a missing protein is evaluated as a linear combination of similar proteins

(Kim, Golub & Park, 2005). The method borrows from k-NN and from least squares imputation.

It is most effective when there is strong correlation in the data as the k proteins were selected

based on those which had the highest correlation with the target protein. We used two meth-ods: Spearman and Kendall as the two correlation types along with different values of kfor the

neighbours.

Singular value decomposition starts by taking the data set and ignoring the missing entries.

Then it calculates the mean for each of the rows of complete data. By initializing the missing

values to be the previously calculated row means, an iterative procedure finds the missing values.

Next, SVD is performed on the newly formed complete data set and the solution that is produced

replaces the row means in the missing values. These steps are repeated until a solution converges

which usually happens after five iterations (Hastie et al., 1999).

Bayesian principal component analysis is another imputation method that was used to impute

missing protein expression values. It combines an EM approach for PCA with a Bayesian model

and is based on three processes: principal component regression, Bayesian estimation, and an

expectation-maximization (EM)-like repetitive algorithm (Oba et al., 2003). The algorithm was

developed for imputation and very few components are needed in order to ensure accuracy. It

is an iterative process; it either terminates if the increase in precision becomes lower than the

threshold of 0.01 or if the number of set iterations is reached.

3. SELECTION METHODS

We used several methods to select differentially expressed proteins between the active and inactive

groups of patients. The simplest of these methods is the t-test. We used this basic test to compare theprotein expressions between the two groups. Significant results of the t-test provided a preliminary

analysis and a list of proteins potentially being expressed differently between the active and

inactive groups. We also used the Wilcoxon signed rank test which is conceptually similar to the

t-test but provides a more robust approach.

More advanced selection techniques were then explored to select biomarkers influencing the

disease status such as the Least Absolute Shrinkage and Selection Operator (LASSO), Least Angle

Regression (LARS), and Sparse Logistic Regression (SLR).

3.1. LASSO and LARS

The concept of the LASSO, a technique suggested by Tibshirani (1996), is similar to ordinary

least squares but uses a constraint to shrink the number of independent random variables included

in the model by setting some of the coefficients to zero.

LARS is a model selection method, proposed by Efron et al. (2004), that is computationally

simpler than LASSO. It starts by equating all of the coefficients to zero and then adds the predictor

most correlated with the response. The next predictor added is the predictor most correlated with

the current residuals. The algorithm proceeds in a direction equiangular between the predictors

that are already in the model. Efron et al. (2004) presented modifications to the LARS algorithm

that generate the LASSO estimates and these were used to produce the LARS estimates in the

paper.

3.2. SLR


19/38

2011 199

Table 1: Number of individuals of known disease status misclassified by the different imputation and

selection methods.

SVD SVD k-NN k-NN LLS LLS LLS

(e = 3) (e = 2) (k= 20) (k= 10) BPCA (k= 20 spear) (k= 10 ken) (k= 5 ken)

SLR 21 16 17 13 12 18 13 16

LASSO 0 0 12 13 1 13 16 15

LARS 3 0 3 4 0 5 1 1

The rows and columns correspond to the different selection and imputation methods respectively. LLS has a

correlation type of either Spearman (spear) or Kendall (ken).

Table 2: Predicted status for unknown subjects using LARS combined SVD imputation, where I and Astand for inactive and active respectively.

Subject 2 7 8 11 12 13 17 18 29 31 33 34 35 38 41

Status I I I I I A A A I I I I I A I

maximum likelihood estimation to obtain estimates of the model coefficients. The method is

similar to LASSO in that it uses a constraint to shrink the logistic regression model.

4. METHOD SELECTION

To select the set of proteins that accurately predict the status of the unknown list of patients,

combinations of imputation methods and selection techniques were studied and then cross

validated using the protein expressions for individuals with known status. Misclassification levels

observed for the eight imputation and three selection methods employed are displayed in Table 1.

The LARS algorithm had relatively few misclassified cases compared to SLR and LASSO. The

SLR method had high misclassification rates and was therefore dismissed for prediction purposes.

SVD and BPCA appear to be the best imputation techniques to use.

5. PROTEIN SELECTION AND PREDICTION

We chose the proteins picked by most methods. A protein was deemed to be selected as differen-

tially expressed by the t- and Wilcoxon signed rank tests if the respective P-values were less than

0.01. Proteins that had nonzero coefficients in the SLR, LASSO and LARS models were consid-

ered differentially expressed between the active and inactive groups of patients. The maximum

frequency of selection for proteins was 13, so 10 was taken as the cut-off value.

We noticed that seven proteins stood out: BPG0036, BPG0105, BPG0235, BPG0262,BPG0333, BPG0381, and BPG0447. For prediction purposes we used the selected proteins men-

tioned above and their corresponding coefficients by the LARS algorithm combined with the SVD


20/38

200 Vol. 39, No. 2

Table 3: Predicted status for unknown subjects where I and A stand for inactive and active status

respectively.

Subject 2 7 8 11 12 13 17 18 29 31 33 34 35 38 41

Status I A I I I A A I I A I I A A I

6. LOGISTIC REGRESSION

After determining the seven significant proteins, the variables of race, gender and age were

analyzed to determine whether they were significant. Chi-square tests of independence determined

race by status was not significant while gender was significant and linear regression showed age

by status was not significant. A logistic regression model (Dobson, 2002) was then fitted with the

variables gender and all seven of the selected proteins to produce a second set of predictions for

unknown disease status.

Three proteins: BPG0036, BPG0105, and BPG0333 were found to be significant by stepwisealgorithms on the fitted model. The expression levels of any of these three proteins changed

significantly between the active and inactive groups. Fitting the reduced model on the data of

given status gave the final model:

log

j

1j

= 1,3096,948 BPG0036 + 4,509 BPG01052,715 BPG0333

for j = 1, . . . , 32. The status of patient j is determined to be active if the corresponding j is less

than 0.5 and inactive otherwise. The predictions of the unknowns are shown in Table 3. To check

consistency, we predicted the status of the known status patients (11 active and 21 inactive). Themisclassification was zero, which gave us confidence in the prediction of the 15 unknowns.

7. CONCLUSIONS

The main problem encountered when dealing with microarray data was the abundance of missing

data and therefore, the choice of imputation method proved to be vital. Both logistic regression

and LARS were shown to be very good selection methods; the LARS algorithm had low mis-

classification rates. The three proteins picked by logistic regression were good predictors for a

patients status as there was an obvious separation between the inactive and active patients.

ACKNOWLEDGEMENTS

The support and guidance of Dr. Mary Lesperance is greatly appreciated. The study was supported

by Fellowships from the University of Victoria.

BIBLIOGRAPHY

A. Dobson (2002). An Introduction to Generalized Linear Models, Chapman & Hall/CRC, Washington.

B. Efron, T. Hastie, I. Johnstone & R. Tibshirani (2004). Least angle regression, Annals of Statistics, 32(2),

407499.

T. Hastie, R. Tibshirani, G. Sherlock, M. Eisen, P. Brown & D. Botstein (1999). Imputing missing datafor gene expression arrays. Technical Report, Department of Statistics, Stanford University, Palo Alto,

California USA


21/38

2011 201

R. Mu (2008).Applications of correspondence analysis in microarray data analysis. MSc Thesis, University

of Victoria, Victoria, British Columbia, Canada.

S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara & S. Ishii (2003). A Bayesian missing value

estimation method for gene expression profile data, Bioinformatics, 19(16), 20882096.

S. Shevade & S. Keerthi (2003). A simple and efficient algorithm for gene selection using sparse logistic

regression, Bioinformatics, 19(17), 22462253.R. Tibshirani (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical

Society, 58(1), 267288.

O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein & R. B. Altman

(2001). Missing value estimation methods for DNA microarrays, Bioinformatics, 17(16), 520525.




22/38

202 Vol. 39, No. 2

Bootstrap-multiple-imputation;high-dimensional model validation withmissing data

Billy CHANG*, Nino DEMETRASHVILI and Matthew KOWGIER

Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada M5T 3M7

Key words and phrases: Bootstrap validation; EM-algorithm; multiple imputation; penalized likelihood

1. METHODOLOGY

We ignored age, sex, and race when building the classifiers. We then removed proteins with more

than 50% missing (there were 191 proteins left), transformed zero relative abundance values to0.0001 to avoid when applying the log-transform.

We compared four classifiers: regularized discriminant analysis (RDA) with penalization

constant = 0.99, diagonal linear discriminant analysis (DLDA), penalized logistic regression

(PLR) with penalization constant = 0.01, and linear kernel support vector machine (SVM) with

penalization constant C= 1 (for the Lagrange multiplier). The classifiers are described in detail

in Hastie, Tibshirani and Friedman (2009).

Multiple imputation (Little & Rubin, 2002) and bootstrap validation (Hastie, Tibshirani &

Friedman, 2009) were used to compare and validate the four classifiers. We employed ROC

curves and AUC (area under ROC curves) as the classifiers performance metrics. Due to the

small-sample validation sets created by bootstrapping, we employed the BS.632+ error correctionmethod (Hastie, Tibshirani & Friedman, 2009) to adjust for the small-sample prediction error bias

when comparing the prediction error for the four classifiers.

We assume the 191 log-abundance scores for subject i (i = 1, . . . , N = 47) are multivari-

ate normal: xiN(, ). To avoid singular covariance estimates, we penalize the trace of the

precision matrix:

l(, |{xi}Ni=1) = logdet(

1)trace(S1) trace(1)

Here Sis the sample covariance matrix. The maximum penalized-likelihood estimates are:

=1

N

Ni=1

xi, = S + I

where I is the identity matrix. In the presence of missing data, we use the EM-algorithm as

described in Little & Rubin (2002) with a slight modification in the M-Step for parameters

estimation:

E-step: compute conditional expectations and covariance of the missing data given the

observed data.

M-step:

(1) Fill in the missing entries with their conditional expectations.

(2) Update the sample mean of the filled-in data.


23/38

2011 203

Figure 1: ROC curves.

(3) Update the sum of the filled-in data sample covariance and the conditional variance of

the missing data given the observed data, and I (we used = 0.02).

To validate the classifiers, we employed the following procedure:

(1) Estimate the missing-data distribution using all 47 subjects by the above EM-algorithm.

(2) Generate 100 imputed data sets with missing values imputed by values drawn from the

estimated missing data distribution.

(3) From the 100 imputed data set, remove the 15 subjects with unknown status.

(4) For each data set created from step 2 and 3, create 200 bootstrap resampled data set. Fit the

classifier on the bootstrap sample and get averaged out-of-bag (OOB) ROC curves, OOB

AUC scores, and BS.632+ error.

Note that all the penalization constants were chosen only to eliminate the singularity issues

due to the high dimensionality of the data; no fine-tuning was done to optimize classification

performance.

2. RESULTS

The averaged ROC curve for PLR lies above the ROC curves of the other three classifiers (Figure1), suggesting that PLR achieves better performances on average across all level of threshold. The

AUC scores distributions (results not shown) also suggest that PLR consistently achieves good

separation ability, however there are certain bootstrap samples which give very low OOB AUC.

We use the OOB BS.632+ error to check the classifiers performance at threshold 0.5 for RDA,

DLDA, PLR (i.e., classifies a subject as Active if P(Active | the subjects protein scores) > 0.5)

and 0 for SVM, and found that PLRs errors are also consistently lower than the other three

classifiers (results not shown).

3. CONCLUSIONS

Based on the above observations, PLR is the best classifier among the four classifiers compared.

However the huge variance in AUC and BS 632+ errors casts doubt on whether PLR can truly


24/38

204 Vol. 39, No. 2

ACKNOWLEDGEMENTS

We thank our team mentor Rafal Kustra for his guidance and support throughout.

BIBLIOGRAPHY

T. Hastie, R. Tibshirani & J. Friedman (2009). The Elements of Statistical Learning, 2nd ed., Springer,New York.

R. J. A. Little & D. B. Rubin (2002). Statistical Analysis with Missing Data, 2nd ed., Wiley, New York.




25/38

2011 205

Exploring proteomics biomarkers using ascore

Qing GUO1*, Yalin CHEN2 and Defen PENG2

1Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, ON, Canada

L8N 3Z52Department of Mathematics and Statistics, McMaster University, Hamilton, ON, Canada L8S 4K1

Key words and phrases: AUC; biomarker; cluster analysis; cross-validation; jittered QQ plot; logistic

regression; missing data imputation; proteomics score; sensitivity; specificity

1. METHODOLOGY

Proteins are constantly changing and interacting with each other, which makes it challenging toidentify a single protein. Often, a function is implemented by some proteins, among which, some

proteins up-regulate and some down-regulate. In this article, we propose a particular procedure

(PHD-SCORE) to seek a proteomics score that is a combination of relevant functional groups

of proteins. This can be used in place of a single or a limited number of potential proteins, as

the biomarker to distinguish active disease status from the inactive. The procedure involves the

following steps: Process the data; Condense the High dimensional Dataset; Identify Statistically

meaningful clusters and Calculate a proteomic score; Build statistical models to find one more

appropriate for the data; Determine patients disease status by choosing an Optimal probability cut

point; Test model prediction ability by using cross-validation; Repeat the procedure until a model

is chosen with proper predictions and low Error rate; Apply the chosen model to unknown cases.

1.1. Process the Data

Data checking has been conducted to ensure consistency and integrity. Nothing peculiar was

found. Among 47 subjects, 11 had active disease, 21 were inactive, and 15 had unknown disease

status. Based on our exploration, any protein with more than 38.2% of observations missing in

either group (equivalent to 4 in active and 8 in inactive groups) was excluded from further analysis.

This left us with 160 protein variables in the data set, together with 3 covariates (age, sex, race).

1.2. Condense the High-Dimensional Protein Data

Hierarchical cluster analysis was employed to reduce the dimensionality of the protein data. Weused (1 r) (r is the Pearson correlation coefficient) as the similarity between clusters using

average linkage. Usually, cutting too many clusters will result in small number of variables

within each cluster, and too few clusters will mean that many less correlative variables will be

included. Given these considerations, we decided to choose 13 clusters with corresponding cut-off

correlation coefficient of 0.85.

1.3. Impute Missing Data

Two imputation methods were used for handling missing data: hot-deck and linear regression

methods. The missing value of the protein is predicted and imputed from the regression model

with all available predictors.


26/38

206 Vol. 39, No. 2

Table 1: Fitted coefficients for the logistic regression model.

Predictor Estimate SE P-value

Constant 4.05 4.41 0.36

Age 0.06 0.07 0.42

Proteomic score 20.75 8.64 0.02

1.4. Identify Informative Clusters

The clusters were selected by considering 13 plots to visually discriminate the active and inactive

disease status by plotting the average value of proteins in each cluster (say, 13 clusters) for 32

subjects against disease status, and then stepwise logistic regression. It gave us 2 clusters with 8

variables in one cluster and 12 variables in the other. The proteomics score was calculated as the

mean difference between the 2 clusters.

1.5. Build Model and Optimal Estimate of Cut-Off Probability

The logistic regression model was built up by using covariates and the calculated proteomics

score as predictors. Gender and race are not statistically significant. The final model is logit

(P) = 0 + 1 Age + 2 Score, where P is the probability of having active disease status.

A probability cut-off plot was drawn to detect the patients disease status. AUC and jittered QQ

plot (Zhu & El-Shaarawi, 2009) were effective tools to diagnose the adequacy of fitted model in

numerical and graphical ways.

1.6. Test Prediction Ability and Apply to the Unknown Cases

Cross-validation within the 32 subjects with known disease status was utilized to provide a nearlyunbiased estimate of the prediction misclassification rate (Farruggia, Macdonald & Viveros-

Aguilera, 1999). We use a random split to partition the observed data into a training set (2/3 of

32 subjects, about 22 subjects) and a validation set (1/3 of 32 subjects, about 10 subjects). We

pseudo-randomly repeated the cross-validation 200 times to assess the misclassification rate. The

steps 16 were run repeatedly until the model with low misclassification rate was chosen. Finally,

we applied the model to 15 unknown cases to identify their disease status.

2. RESULTS

The analyses were performed by using the statistical package R. The maximum likelihood esti-mates of coefficients obtained for the fitted model and their P-values are shown in Table 1. The

proteomic score is significant at level 0.05 to detect patients disease status, while the insignificant

covariate age was included to adjust for the logit(P).

The probability cut-point of 0.26 was chosen based on the trade-off between the sensitivity

and specificity. If patients disease probability is greater than 0.26, the active disease status is

assumed. The misclassification rates for the active, inactive and overall are 0.13, 0.11, and 0.12,

respectively. This chosen model was applied to the data (15 participants), patients 7, 12, 13, 17,

18 were identified with active disease.

3. CONCLUSIONS

There are limitations Firstly the chosen percentage of missingness (here 38 2%) for each protein


27/38

2011 207

Finally, some factors such as the number of clusters and the cut-point for the probability of

disease and some specific proteins could be more targeted and ascertained with the involvement

of principal investigator so to understand the underlying biological mechanism better. The steps

of 16 in PHD-SCORE procedure need to be repeatedly run until the satisfactory model is met.

ACKNOWLEDGEMENTS

We thank Dr. Rong Zhu for the guidance, encouragement and availability to us during the study.

We are also indebted to Drs. Eleanor Pullenayegum and Romn Viveros-Aguilera for their financial

support and constructive suggestions. We also would like to acknowledge Drs. Stephen Walter,

Harry Shannon, and Lehana Thabane for their helpful comments.

BIBLIOGRAPHY

J. Farruggia, P. D. Macdonald & R. Viveros-Aguilera (1999). Classification based on logistic regression and

trees. Canadian Journal of Statistics, 28, 197205.

R. Zhu & A. H. El-Shaarawi (2009). Model clustering and its application to water quality monitoring.Environmetrics, 20, 190205.




28/38

208 Vol. 39, No. 2

A multiple testing procedure to proteomicbiomarker for disease status

Zhihui (Amy) LIU* and Rajat MALIK

Department of Mathematics and Statistics, McMaster University, Hamilton, ON, Canada L8S 4K1

Key words and phrases: Multiple testing procedure; imputation; classification tree; heatmap; proteomic

biomarker

1. METHODOLOGY

When hundreds of hypotheses are tested simultaneously, the chance of false positives is greatly

increased. We first removed the proteins that contain more than 5 missing values for the activegroup and 11 for the inactive group. To control Type I error rates, resampling-based single-step

and stepwise multiple testing procedures (MTP) were applied, using the Bioconductor R package

multtest (Pollard et al., 2009). Our null hypothesis was that each protein has equal mean relative

abundance in the active and inactive group. The non-parametric bootstrap with centring and

scaling was used with 1,000 iterations. Non-standardized Welch t-statistics were implemented,

allowing for unequal variances in the two groups.

We explored four imputation methods in the R package pcaMethods (Stacklies et al., 2007):

the nonlinear iterative partial least squares algorithm, the Bayesian principle component analysis

missing value estimator, the probabilistic principle component analysis missing value estimator,

and the singular value decomposition algorithm. A classification tree implemented in the R pack-age rpart (Therneau et al., 2009) was fitted by using the nine most rejected proteins from the

MTPs. To visualize the results, a heatmap was produced (see Figure 1).

2. RESULTS

By setting the random seed = 20, the MTPs result in five rejections: BPG0167, BPG0235,

BPG0333, BPG0381, BPG0447 at the significance level = 0.05 Note that different choices

of seed yield slightly different results. Applying the random seed = 20 and 1,000 bootstrap

iterations to the imputed data, we found that the results from the four imputation methods are

similar and they agree with those without imputation. A heatmap was plotted (Figure 1) using thenine most rejected proteins: BPG0098, BPG0100, BPG0105, BPG0167, BPG0235, BPG0310,

BPG0333, BPG0381, and BPG0447. Notice that it correctly classifies almost all the samples. The

classification tree predicts that patients 7, 11, 13, 18, 33, and 35 belong to the active group.

3. DISCUSSION

Multiple testing procedures are a concern because an increase in specificity is coupled with a

loss of sensitivity. Furthermore, we suspect that the proteins with the most difference in relative

abundance between the active group and inactive group are not necessarily the key players in

the relevant biological processes. These problems can only be addressed by incorporating prior

biological knowledge into our analysis, which may lead to focusing on a specific set of proteins.


29/38

2011 209

Figure 1: A heatmap of nine proteins comparing the active (+) and inactive group.

4. CONCLUSION

The results from MTPs before the imputation agree more or less with those after the imputation.

The reason for this is unclear, it could mean either that the imputation works very well or that it

is not very helpful. There is no evidence that race, sex or age is associated with disease status.

Proteins BPG0098, BPG0100, BPG0105, BPG0167, BPG0235, BPG0310, BPG0333, BPG0381,

and BPG0447 seem to be important in indicating the disease status. Patients 7, 11, 13, 18, 33, and

35 are predicted to be active among the 15 patients of unknown status.

ACKNOWLEDGEMENTS

We thank Dr. Peter Macdonald for his valuable advice and encouragement.

BIBLIOGRAPHY

K. S. Pollard, H. N. Gilbert, Y. Ge, S. Taylor & S. Dudoit (2009). multtest: Resampling-based multiple

hypothesis testing. R package version 2.1.1.

T. M. Therneau, B. Atkinson & B. Ripley (2009). rpart: Recursive Partitioning. R package version 3.1-44.

W. Stacklies, H. Redestig & K. Wright (2007). pcaMethods: A collection of PCA methods.R package version

1.22.0.




30/38

210 Vol. 39, No. 2

The process of selecting a disease statusclassifier using proteomic biomarkers

Christopher MEANEY1* Calvin JOHNSTON1 and Jenna SYKES2

1Dalla Lana School of Public Health, University of Toronto2Department of Statistics and Actuarial Science, University of Waterloo

Key words and phrases: Statistical classifier; biomarker; support vector machine

1. METHODOLOGY

Our first challenge was to narrow down the list of available statistical classifiers. One resource at

this stage was the open-source data mining and statistical learning program Weka developed at the

University of Waikato in New Zealand (Hall et al., 2009). With a few simple clicks on its intuitive

Explorer interface, we quickly sifted through an extensive selection of supervised classificationtechniques, including logistic regression, trees and forests, bagging, boosting, neural networks,

Bayes classifiers, and support vector machines. Weka has a convenient cross-validation function

which allowed us to whittle down this long, intimidating list to only the most promising methods

and allowed us to focus our efforts in a fruitful direction. We found that the Multilayer Perceptron

and Support Vector Machines (SVM) had strong empirical classification properties according to

leave-one-out cross validation. We opted to focus our energy on the SVM algorithm as it seemed

more intuitive.

Consider a Euclidean space with dimensionality equal to the number of factors; in this

case the factors are the 400+ protein relative abundance measurements. Each patient maps to a

vector in that space positioned according to his or her particular measurements on each protein.Along with this positioning, each patient also has a disease status: active or inactive. The SVM

then finds the optimally separating hyperplane which splits the Euclidean space into two sections,

each ideally containing patients of only one disease status. When there are multiple possible

hyperplanes that achieve this completely separating objective, the SVM chooses the plane that

maximizes the distance between the hyperplane and its nearest datapoints.

In many SVM implementations, the use of linear hyperplanes is extended with kernels by

replacing the linear vector dot product with a nonlinear kernel function, such as a polynomial

or radial basis function. A common modification is to relax the requirement that the hyperplane

divide all cases perfectly by allowing for a few penalized exceptions known as slack vectors

(Hastie et al., 2009). Since the provided data set was already of high dimensionality and was fullyseparable, neither of these extensions was used.

We used the function svm() in the R (R Development Core Team, 2005) package e1071

(Wu, 2009). This function allows many options to be set including a choice of kernels and the

penalization of slack vectors (Meyer, 2009). Data were scaled within the function to prevent

proteins with large variance from dominating the classification decision.

2. RESULTS

Many protein abundance measurements were absent from the data. The function svm() requires

complete data to operate. Thus, we imputed values for the missing data. Exploratory analysisindicated that there were a number of proteins for which fewer than 5 out of 47 measurements


31/38

2011 211

were missing and a large number for which more than 30 out of 47 measurements were missing.

Furthermore, several of the proteins with more than 20% missing data had missing values for all

patients in one of the two status groups.

We decided to limit our imputation to only proteins for which fewer than 20 measurements

were missing. As a result, the observational set of interest for a given case consisted of only

146 remaining proteins. After careful and extensive visual exploration of the missing data, wefelt comfortable that the data were mostly missing at random and not systematically linked to

the true underlying values (such as higher probability of missingness for very small true values).

A review of imputation literature revealed many possible strategies for the replacement of the

missing values in the dataset. The imputation segment of the Multivariate Statistics Task View of

R/CRAN revealed eight possible libraries that were devoted to multivariate imputation. However,

the fact that our dataset consisted of a large number of variables, measured on a small number of

cases, restricted our imputation options slightly. We began by using simplistic strategies such as

mean/median imputation; however, we felt that more advanced methods would permit us to build

a more accurate classifier. We settled on using a nonparametric, k-Nearest Neighbours (k-NN)

imputation strategy, as it had been recommended by some authors from the field of proteomics andit behaved well in leave-one-out crossvalidation (Troyanskaya et al., 2001). The k-NN imputation

was performed using the impute.knn() function from the impute library in R. Ranging k from

1 to 10, we found that only one patient was misclassified for all k. The misclassified patient

was in the inactive status group, so our final empirical classification rate was 100% (0/11 cases

misclassified) and 95.2% (1/21 case misclassified) for active and inactive statuses, respectively.

Since all kbehaved equally well, we arbitrarily chose kto be 8 for our final classifier.

Decision processes were heavily reliant on empirical fitting, so there was some risk of over-

fitting. However, given the rigidity of the linearity requirement for our SVM implementation and

the care we took in heeding this problem, we felt the amount of overfitting was probably small.

3. CONCLUSIONS

Our final classifier was a linear Support Vector Machine with 8-Nearest-Neighbour missing value

imputation for proteins with fewer than 20% missing data. Allowing for small amounts of overfit-

ting, we suspect our technique would correctly classify about 90% of all patients in each disease

status category.

BIBLIOGRAPHY

T. Hastie, R. Tibshirani & J. Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference,

and Prediction, 2nd ed., Springer, New York.

M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann & I. Witten (2009). The WEKA Data Mining

Software: An Update. SIGKDD Explorations, 11, 19.

D. Meyer (2009). Support Vector Machines, CRAN R Project, accessed July 15, 2009 at cran.r-

project.org/web/packages/e1071/vignettes/svmdoc.pdf.

R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation

for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, www.r-project.org.

O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein & R. Altman (2001).

Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520525.

T. Wu (2009). Misc Functions of the Department of Statistics (e1071), CRAN R Project, accessed July 15,

2009 at cran.r-project.org/web/packages/e1071/e1071.pdf.


32/38


33/38

2011 213

Table 2: Predictions of patients with missing status.

Tree LG Decision

1 Inactive Inactive Inactive

2 Inactive Active Active





7 Active Active Active



10 Active InactiveActive






For inconsistent results,Active by the classification tree andActive by the logistic regression were regarded more

reliable.

and false active error rates (see Table 1). Predictions from the two models were obtained for the

status of patients with missing status (see Table 2).

3. CONCLUSIONS

By the low false inactive error rates we obtained from both models, we conclude that it is possible

to use classifiers as a pre-screening procedure in identifying active patients, particularly when

there is a much larger sample available for model training. However, without knowing the error

rates of the current diagnostic method, it seems infeasible to make a conclusion on whether the

classification methods based on a simple blood sample perform as well as the current diagnostic

method.

ACKNOWLEDGEMENTS

We would like to thank the Department of Statistics of University of British Columbia for warm

support on this case study.

BIBLIOGRAPHY

L. Breiman, J. H. Friedman, R. A. Olshen & C. J. Stone (1984). Classification and Regression Trees,

Wadsworth & Brooks, Pacific Grove.


A t d 6 J 2011


34/38

214 Vol. 39, No. 2

Discussion of case study 2 analyses

Robert F. BALSHAW* and Gabriela V. COHEN FREUE

NCE CECR PROOF Centre of Excellence, Vancouver, BC, Canada V6Z 1Y6

Key words and phrases: Variable selection; class prediction; imputation

1. INTRODUCTION

These analyses have been performed on data provided as a case study for the 2009 SSC case study

session organized byAlison Gibbs. A general overview of thedata was provided in theBackground

and Description. A comparable analysis is described in Cohen et al. (2010). In many ways, the

data represent what has become a fairly standard statistical challenge in supervised learning: to

develop a classification rule for prediction of unknown class labels for 15 new samples based ona small training set, possibly utilizing only a subset of the features.

From our experience, the principal challenge was the abundance of missing data, whose

presence may well carry information about class membership. Missing values often occur with

our analytical proteomic platform due to the challenge of protein identification as well as detection

of low abundance proteins.

We would like to thank all six teams their efforts; all are to be commended for their insightful

analyses. The six teams will be referred to as follows:

WX: Wang and Xia.

LM: Liu and Malik. CDK: Chang, Demetrashvili, and Kowgier.

MJS: Meaney, Johnston, and Sykes.

GCP: Guo, Chen, and Peng.

LMSS: Lu, Mann, Saab, and Stone.

2. METHODOLOGIES

The six teams used a wide variety of methodologies, which we have attempted to summarize very

briefly in Table 1. In general, the teams all took similar approaches to pre-filtering the features

based on detection rates and explored a variety of imputation methods. All teams eliminated asubset of the proteins with lower rates of detection, though the detection rule and threshold used

varied between the teams. Essentially, two variant pre-filtering rules were used: (1) select proteins

for which the overall detection rate was above an arbitrary, pre-determined threshold: 0 and 1 by

WX; 0.5 by CDK; 0.8 by MJS; or 0.15 by LMSS; and (2) select proteins for which the within-class

detection rates were above a threshold: approximately 0.5 by LM and 0.4 by GCP. Having chosen

a set of proteins, the teams then utilized a variety of imputation methods (see Table 1) to permit

the use of those classification techniques which do not permit missing values.

Two teams then conducted univariate or one-protein-at-a-time tests before building a mul-

tivariate classifier (LM; LMSS). LM selected a reduced set of candidate significant proteins

(controlling the Type I error rate), and LMSS selected a reduced set of candidate proteins thatshowed differential expression in several methods (including other approaches that test multiple


35/38

2011 215

Table 1: Summary of strategies and accuracy.

Team Missing values Classifier Selected model Accuracy

WX Overall detection rate

threshold: 1 and 0;

Imputation Methods:

mean and k-NN

Tree with splitting rule

(Tree) and

logistic-AIC (LG)

Ensemble of two models:

Tree based on

k-NN-imputed data and

LG based on complete

data

11/15

LM Within group detection

rate threshold: 0.5;

Imputation Methods:

NIPALS, BPCA,

Probabilistic PCA, SVD

Non-standardized Welch

t-test using FDR

followed by a

classification tree

Tree based on top-9 tested

proteins. High

concordance among

imputation methods

12/15

CDK Overall detection rate

threshold: 0.5;

Imputation Method:

EM-algorithm

RDA, DLDA, PLR, and

SVM

PLR 14/15

MJS Overall detection rate

threshold: 0.8;

Imputation Method:

median/mean and k-NN

(several k)

Extensive list of

supervised

classification

techniques available

in Weka

SVM in k-NN-imputed data

(k= 8, similar results for

other values)

13/15

GCP Within group detection

rate threshold: 0.4;Imputation Methods:

hot-deck and linear

regression imputation

Hierarchical cluster

analysis followed bystepwise logistic

regression to select

clusters. A proteomic

score, the mean

difference between

the two identified

clusters, is calculated

Logistic regression with the

calculated proteomicscore and age as

covariates

11/15

LMSS Overall detection rate

threshold: 0.15;

Imputation Methods:k-NN, LLS, SVD,

BPCA

Classical t-test,

Wilcoxon signed rank

test, LASSO, LARS,and SLR

Model 1: LARS algorithm

on the seven most

frequent proteins from allmethods explored on

SVD imputed data.

Model 2: logistic

regression based on three

proteins selected by a

stepwise analysis on the

seven most frequent

proteins and gender

Model 1:

10/15;

Model 2:12/15

features simultaneously). Though with different emphasis and in combination with other meth-

ods almost all teams utilized a logistic regression approach (WX CDK MJS GCP LMSS)


36/38


37/38

2011 217

variables were related with the disease condition; GCP built a logistic regression classification

model that selected qge from all other covariates and the proteomic score; and LMSS found only

gender to be statistically significant using a chi-square test, though it was not selected in their

final logistic regression model.

The predictive performance for each teams final classifier was tested on 15 samples whose

class labels were kept hidden until after each teams results were provided to the organizers. Theaccuracy of each teams final model appears in the last column of Table 1. Though this permits

a quantitative comparison between the teams, we will resist the temptation to over-interpret the

relative predictive performance of the teams and their methods. Instead, we would like to once

again thank all the teams for their hard work, insight and thoughtful questions.

BIBLIOGRAPHY

G. V. Cohen Freue, M. Sasaki, A. Meredith, O. P. Gunther, A. Bergman, M. Takhar, A. Mui, R. F. Bal-

shaw, R. T. Ng, N. Opushneva, Z. Hollander, G. Li, C. H. Borchers, J. Wilson-McManus, B. M.

McManus, P. A. Keown, W. R. McMaster, and the Genome Canada Biomarkers in Transplantation

Group. (2010). Proteomic signatures in plasma during early acute renal allograft rejection. Molecular& Cellular Proteomics, 9, 19541967.




38/38

Copyright of Canadian Journal of Statistics is the property of Wiley-Blackwell and its content may not be

copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written

permission. However, users may print, download, or email articles for individual use.

Case Studies Eric Cormier

Documents

Case Studies Eric Cormier