7/31/2019 Case Studies Eric Cormier
1/38
The Canadian Journal of Statistics
Vol. 39, No. 2, 2011, Pages 181217
La revue canadienne de statistique
181
Case studies in data analysis
Alison L. GIBBS1*, Kevin J. KEEN2 and Liqun WANG3
1Department of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G32Department of Mathematics and Statistics, University of Northern British Columbia, Prince George,
BC, Canada V2N 4Z93Department of Statistics, University of Manitoba, Winnipeg, Man., Canada R3T 2N2
The following short papers are summaries of student contributions to the Case Studies in DataAnalysis from the Statistical Society of Canada 2009 annual meeting. Case studies have been an
important part of the SSC annual meeting for many years, providing the opportunity for students
to delve into interesting problems and data sets and to present their findings at the meeting. Since
2008, prizes have been awarded for the best poster presentations for each of two case studies. The
case studies at the 2009 annual meeting and the selection of this suite of papers were organized
by Gibbs and Keen.
This section consists of two groups of papers corresponding to two case studies. Each sub-
section starts with an introduction given by the data donors, which is followed by the winning
paper and contributed papers. The subsection ends with discussion and summary by the data
donors.The theme of case study 1 is the identification of relevant factors for the growth of lodgepole
pine trees. First, Dean, Gibbs, and Parish provide an introduction to the data and the problems
of scientific interest. The winning paper authors Cormier and Sun first use the nonparametric
smoothing technique to identify a nonlinear relationship of the growth rate and the age of the trees.
They then use a mixed model to explain the growth rate through the age and other environmental
factors. In the second paper, Salamh first estimates a similar mixed model and then supplements
the analysis using a dynamic model.
The theme of case study 2 is the classification of disease status through proteomic biomarkers.
Balshaw and Cohen-Freue introduce the data and problems of interest. The winning paper is
authored by Lu, Mann, Saab, and Stone who first explore various data imputation techniquesincluding the k-nearest neighbours, local least squares and singular value decomposition. They
then apply various multiple selection methods such as LASSO, least angle regression (LARS)
and sparse logistic regression. This paper is accompanied by four contributed papers which use
various modern classification techniques. Guo, Chen, and Peng use a score procedure to classify
the disease status. Liu and Malik employ a multiple testing procedure. Meaney, Johnston and
Sykes apply support vector machines (SVM). Wang and Xia use classification tree and logistic
regression techniques. A summary and comparison of these methods and outcomes are given by
Balshaw and Cohen-Freue.
We are grateful to Charmaine Dean of Simon Fraser University, Roberta Parish of the British
Columbia Ministry of Forests and Range, and Rob Balshaw and Gabriela Cohen-Freue of the
7/31/2019 Case Studies Eric Cormier
2/38
182 Vol. 39, No. 2
NCE CECR PROOF Centre of Excellence for the use of their data and their contributions to this
suite of papers. We also thank the former and current Editors of the CJS, Paul Gustafson and
Jiahua Chen, for agreeing to publish these papers and for their patience and support during the
editorial process.
Received 31 October 2010
Accepted 6 January 2011
7/31/2019 Case Studies Eric Cormier
3/38
2011 183
Case study 1: The effects of climate on thegrowth of lodgepole pine
C.B. DEAN1*, Alison L. GIBBS2 and Roberta PARISH3
1Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada V5A 1S62Department of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G33British Columbia Ministry of Forests, Victoria, BC, Canada V8N 1R8
Key words and phrases: Climate effects; tree growth; biomass; lodgepole pine.
1. BACKGROUND
To compete successfully in the world economy, the commercial forestry industry requires an
understanding of how changes in climate influence the growth of trees. The goal of this case study
was to examine how well-known climate variables, combined with estimated crown biomass, can
predict wood accumulation in lodgepole pine. In order to model the growth and yield of trees over
time, we need to determine how much wood a tree accumulates each year. Each year, a tree lays
down an annual ring of wood in a layer under the bark. Presslers hypothesis states that area of
wood laid down annually, measured by the cross-sectional area increment, increases linearly from
the top of the tree to the base of the crown (the location of the lowest live branches) and is based
on the assumption that area increment in the crown increases with the amount of foliage above
the point of interest. Below the crown, the area increment in any given year remains constant
down the bole until the region of butt swell at the base of most trees.
The growth of a tree in any given year is strongly influenced by growth in the previous years.
One reason for this is that buds are formed the year before they start to grow and carbohydrates
from good years can be stored to fuel growth in subsequent years. The effects of previous growing
conditions can last from 1 to 3 years, depending on tree species and location. Climate affects
growth and influences both the size of the annual ring of wood and the proportions of early
and late wood. Low density early wood is laid down during the spring when water is plentiful.
Late wood, which is laid down from mid-summer until growth ceases in the fall, has a high
density. Cessation of wood formation is sensitive to weather conditions such as temperature and
drought.
Lodgepole pine (Pinus contorta Doug. ex Loud.) stands dominate much of western Canada and
the United States, covering over 26 million hectares of forest land. It is an important commercial
species in British Columbia; stands consisting of more than 50% lodgepole pine occupy 58% of the
forests in the interior of the province. Lodgepole pine is primarily used for lumber, poles, railroad
ties, posts, furniture, cabinetry, and construction timbers. It is commercially important to be able
to predict how lodgepole pine will grow and accumulate wood over time. Using high resolution
satellite images of lodgepole pine stands to predict wood attributes is under consideration, but
first the relationship of crown properties such as the amount of foliage must be linked to wood
properties and growth.
7/31/2019 Case Studies Eric Cormier
4/38
184 Vol. 39, No. 2
2. THE DATA
Data on the annual growth and wood density of 60 lodgepole pine trees from four sites in two
geographic areas in central British Columbia were provided for this investigation. Samples were
removed at 1013 locations along each tree and two radii (A and B) per sample disc were mea-
sured. Measurements of the last year of growth and wood density are often unreliable because
of proximity to the bark and difficulties of sample preparation. However, it is for this ring only
that we have measures of the amount of foliage. Several growth outcomes are available includ-
ing the widths of the A and B radii, in millimetres, percentage of late and early wood, and
early and late wood densities, in kg/m3. Foliar biomass measurements are provided for multi-
ple branches; estimates are available for each annual whorl. The data on biomass include the
average relative position of the branch in the crown (1 is the base of the crown and 0 is the
top) and corresponding foliar biomass (the mass, in kg/m2, of needles subtended by the branches
at that position). Other variables such as the total height of the tree, in metres, as well as the
height to the base of the crown, in metres, are also provided. Climate data from Environment
Canada arise from the two nearest stations with long-term records, Kamloops and Quesnel. For
each of these locations, monthly and annual data are provided on: (1) the minimum temper-
ature, in degrees Celsius, (2) the maximum temperature, in degrees Celsius, and (3) the total
precipitation, in millimetres. Additional details on the data and variables are provided at the site
www.ssc.ca/en/education/archived-case-studies/ssc-case-studies-2009.
3. OBJECTIVES
The primary objective of this case study was to determine to what extent climate, position on the
tree bole (trunk), and current foliar biomass explain cross-sectional area increment and proportion
of early and late wood.
Other questions of interest included:
How have temperature and precipitation affected the annual cross-sectional growth and the
proportions of early and late wood in lodgepole pine?
Is annual growthbest explained by average annual temperature or do monthly maximum and/or
minimum values provide a better explanation? Do early and late wood need to be considered
separately?
Does the use of climate variables to predict the growth and proportions of early and late wood
provide more reliable estimates than the use of the growth and density measurements from
previous years as measured from the interior rings?
Received 31 October 2010
Accepted 6 January 2011
7/31/2019 Case Studies Eric Cormier
5/38
2011 185
The determination of the relevant explanatoryvariables for the growth of lodgepole pineusing mixed models
Eric CORMIER* and Zheng SUN
Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada V8W 2Y2
Key words and phrases: Climate effects; biomass; mixed models; tree growth
Abstract: In this paper a mixed model with nested random effects was used to model the cross-sectional
area increment of lodgepole pine based on relevant explanatory variables. This model was used to show
that minimum monthly temperature, monthly precipitation, and foliar biomass are positively related to the
cross-sectional area increment, while an ordinal variable approximating lower trunk position and maximum
monthly temperature are negatively related. It was shown that annual growth is better explained by monthly
maximum and minimum temperatures than by average annual temperature and that the use of climate
variables provided more reliable estimates for growth prediction than the use of the growth and density
measurements from previous years. The Canadian Journal of Statistics 39: 185189; 2011 2011 Statistical
Society of Canada
Resume: Dans cet article, un modele mixte avec des effets aleatoires embotes est utilise pour representer
les accroissements transversaux de laire du pin de Murray a laide de variables explicatives pertinentes. Ce
modele a permis de montrer que la temperature mensuelle minimale, la quantite de precipitation mensuelle
et la biomasse foliaire sont positivement reliees a laccroisement transversal daire tandis quune variable
ordinale approximant la position de la partie inferieure du tronc et la temperature mensuelle maximale sont
negativement reliees. Il est montre que la croissance annuelle est mieux expliquee par les temperatures
mensuelles maximale et minimale que par la temperature annuelle moyenne et que lutilisation des variablesde climats procure des estimes plus fiables de la prediction de croissance que lutilisation des observations
sur la croissance et les mesures de densite des annees precedentes. La revue canadienne de statistique 39:
185189; 2011 2011 Socit statistique du Canada
1. INTRODUCTION
In this analysis, we addressed the following questions: (1) To what extent climate, position on
the tree bole, and current foliar biomass explain cross-sectional area increment. (2) How have
temperature and precipitationaffected the annual cross-sectional growth? (3) Is annual growth best
explained by average annual temperature or do monthly maximum and/or minimum values provide
a better explanation? (4) Does the use of climate variables to predict the growth and proportionsof early and late wood provide more reliable estimates than the use of the growth and density
measurements from previous years as measured from the interior rings? We used mixed models to
model the relationships between the explanatory variables: climate, position on the tree bole, and
current foliar biomass, and the responses: cross-sectional area increment and the proportion of late
wood of lodgepole pine. There were four features of the data that complicated the analyses: (1)
Climate variables for each year were available and annual growth measurements were collected
from tree samples, so we expected the data to exhibit autocorrelation. The correlation structure was
accommodated by the use of random effects. (2) For each disc, there were measurements from two
separate radii. Radius was treated as a nested random effect. It could have been assumed that the
measurements along the two radii were two observations of the same variable and then an averagecould be taken, but due to the asymmetry of the tree radii, an average is not a good estimate of the
7/31/2019 Case Studies Eric Cormier
6/38
186 Vol. 39, No. 2
variable. Alternatively, the measurements along the two radii could be used individually but there
would be very high correlations between them. To allow both sets of measurements to be used
and include the correlations between radii, a nested random effect was used (Pinheiro & Bates,
2000, p. 40). (3) The ages of the trees varied resulting in a different number of observations for
each tree. This complication corresponds to drop-out in a longitudinal study. It was not believed
that the reason for the resulting missing data was informative because it depended only on theage of the tree and not on a growth factor. Therefore this situation was modelled assuming the
data on missing years of a trees life were missing at random and bias was not taken into account.
(4) Destructive sampling meant that foliar biomass was collected at only one point in time; an
inverse regression was conducted to determine the foliar biomass in other years.
In addition to the climate variables, the growth and density measurements from previous years
were used to predict the growth of early and late wood. This prediction was done using an ARIMA
model and the reliability of these estimates were examined.
2. METHODOLOGY
Non-parametric regressions of late wood percentage and cross-sectional area increment against
trunk position, foliar biomass, and annual maximum temperature were fit to determine the general
trend in the response. We assumed that the nth measurement from the ground would correspond
across trees regardless of the height of the trees so the ordinal variable position was used. The plot
of cross-sectional area increment versus trunk position in Figure 1 shows that from trunk position
14, there is a negative relationship between trunk position and cross-sectional increment; but
after position 5, there is a positive relationship between trunk position and cross-sectional incre-
ment. This could be due to the fact the position 4 or 5 corresponds with the start of the crown. The
plot of cross-sectional area increment versus biomass and the plot of percentage of late wood ver-
sus biomass show that high values of biomass correspond to high cross-sectional area incrementand low late wood percentage, respectively. The plot of cross-sectional area increment versus
annual maximum temperature and the plot of percentage of late wood versus annual maximum
temperature show that the relationship between annual maximum temperature and cross-sectional
increment and the relationship between annual maximum temperature are different for each loca-
tion (Kamloops and Quesnel). This suggests including two-factor interactions between location
and annual maximum temperature. Similar interaction plots, which are not included in this paper,
suggested including the interactions between the following pairs of variables: location and annual
minimum temperature, location and precipitation, age of the tree and foliar biomass, and location
and foliar biomass. We also included two-factor interactions among climate variables.
One of the questions of interest was how the position on the tree bole affects the cross-sectionalarea increment. When examining the effect of position, it was necessary to account for the fact
that trees had a large range in their absolute height. Measurements taken at 10 m from the ground
on a tree that was 11 m tall have a different relative position on the tree trunk then the same
measurement on a tree that was 40 m tall. To account for this an ordinal variable, position, was
used to represent relative height.
To determine whether monthly maximum and minimum temperature or average annual tem-
perature best explain cross-sectional growth, the mixed model was fitted separately with monthly
measurements and with average annual measurements and model goodness-of-fit criteria were
compared. The variables that resulted in a better fit were used in the analysis. To model the corre-
lation structure, a random intercept and a random slope for each tree was adopted. To model thenested effect presented by having two radii measured on each tree disc, a nested random effect
was introduced.
7/31/2019 Case Studies Eric Cormier
7/38
7/31/2019 Case Studies Eric Cormier
8/38
188 Vol. 39, No. 2
Figure 2: Mean cross-sectional area increment over age of the tree.
Table 1: Climate effects on cross-sectional increment.
Month
J F M A M J J A S O N D
Maximum temperature * * * * *
Minimum temperature + + + + + + + + + + + +
Precipitation + + + + + + + + + + + +
Significant positive effect (+), significant negative effect (), not significant (*) at 5% level.
3. RESULTS
To improve model adequacy a Box-Cox transformation was performed with = 0.25. The
function g was determined to be cubic through examination of Figure 2, that is, g(tihk) =
40tihk + 41t2ihk + 42t
3ihk. Trunk position in the crown and the amount of foliar biomass have
positive relationships, and age of the tree has a negative relationship with the cross-sectional
area increment. The results of climate on cross-sectional increment are presented in Table 1.
Results from the fitted models were quite consistent with patterns observed in Figure 1 except the
interaction effect of annual maximum temperature and location was not significant.
The estimated standard errors for the random effects were: 1 = 0.5394, 2 = 0.0225, 3 =
0.0284. The nested effect for radii was not significant, implying that the measurements from the
two different radii did not significantly differ.
It was determined that annual growth was better explained using monthly maximum and
minimum temperature values then average annual values because both AIC (66,135 vs. 67,750) and
BIC (66,781 vs. 68,050) were smaller when monthly measurements were included in the model.
However, although we used monthly climate variables as main effects, to reduce the number of
interaction terms in the model interactions were modelled using annual climate variables.
Examination of the residual plot showed heavy tails and skewness. This indicates that the
residuals deviate from normality and that modelling them with a skew-elliptical distributionwould be more appropriate (Jara, Quintana & San Martn, 2008).
To model the dependence of the growth of early and late wood on measurements from previous
7/31/2019 Case Studies Eric Cormier
9/38
2011 189
Figure 3: Five step ahead predictions on early and late wood growth.
autoregressive, third order integrated and first order moving average model. This model was used
to predict future growth and determine prediction intervals (Figure 3).
4. CONCLUSIONS
Data on the cross-sectional area increment and proportion of late wood for tree rings at multiple
years, heights and geographical regions were analyzed, with the addition of climate data and
measurements of foliar biomass for each year of the trees lives. The use of mixed models allowed
all of these covariates to be included in each model and their relationship to be modelled usingall available data. However, the model is not sufficient to conclude cause-and-effect relationships
between the variables and the growth of lodgepole pine. Since the monthly climate factors were
correlated, the problem of multi-collinearity was present. Therefore, the model requires careful
interpretation of the coefficients. In response to questions (1) and (2) in Section 1, the results show
that minimum monthly temperature, monthly precipitation and foliar biomass were positively
related to the cross-sectional area increment while lower trunk position and maximum monthly
temperature were negatively related. Examination of question (3) showed that annual growth
was better explained by monthly maximum and minimum temperatures than by average annual
temperature. Due to the wide prediction intervals from the time series analysis it was believed that
the use of climate variables provided more reliable estimates for growth prediction (question (4)).Possible future work that could be carried out to improve the model is the use of skew-elliptical
distributions to model the residuals to account for both skewness and heavy tails in the error terms
and the incorporation of splines to accommodate the temporal trend in the observations.
ACKNOWLEDGEMENTS
Many thanks to Farouk Nathoo for the very helpful correspondence.
BIBLIOGRAPHY
A. Jara, F. Quintana & E. San Martn (2008). Linear mixed models with skew-elliptical distributions: a
Bayesian approach, Computational Statistics and Data Analysis, 52, 50335045.
J. Pinheiro & D. Bates (2000). Mixed-Effects Models in S and S-PLUS, Springer-Verlag, New York.
7/31/2019 Case Studies Eric Cormier
10/38
190 Vol. 39, No. 2
Determinants of lodgepole pine growth: Staticand dynamic panel data models
Mustafa SALAMH*
Department of Statistics, University of Manitoba, Winnipeg, Man., Canada R3T2M2
Key words and phrases: linear mixed model; nested error components; autoregressive panel models; tree
growth; climate effect; Presslers hypothesis; random coefficients regression; two-stage least squares
1. INTRODUCTION
This study was concerned with modelling the wood properties and growth over time for theLodgepole pine in British Columbia. The primary objective was to determine to what extent cli-
mate, position on the tree trunk, and current foliar biomass explain cross-sectional area increment
and proportion of early and late wood. The study also addressed other questions, such as whether
growth is best explained by average annual temperature or monthly temperature extremes, and
whether the use of climate variables to predict the growth and wood properties provide more
reliable estimates than the use of the growth and density measurements from previous years.
2. METHODOLOGY
Presslers hypothesis states that the annual increment in cross-sectional area of wood increaseslinearly from the top of the tree to the base of the crown and is proportional to amount of foliage
above the point of the increment. Since tree and crown heights and foliar biomass were only
available for the year in which the tree was cut down, they had to be estimated for other years
of the trees life. This was done using loess regression based on the trees current height, crown
length, and the height of disks. In order to answer the primary question about the effects of
climate, disk position on the tree bole, and foliar biomass on the cross-sectional area increment
and percentage of late wood, a linear mixed effects model was used to account for the two-level
grouping structure of the data. The model was formulated according to Presslers hypothesis
without ignoring the possible random variability due to disks below the crown. Other nuisance
factors such as age from pith, tree height, and the geographic location were included in the modelto control for their effect. The climate model takes the form
yijt = f1(disk ageijt, tree heightit, sitei) + f2(climt,t1,t2) + Dijt
+ topdistanceijt + topmassijt + i + iDijt + i topdistanceijt + i topmassijt
+ ij(1Dijt) + ijt, i = 1, I , j = 1, Ji, t = 1, Tij, (1)
where the responseyijt is either the square root of the area increment or the log of late to early wood
ratio for diskj within tree i at year t.f1 andf2 are linear functions, and clim represents a vector of
climate variables (temperature and precipitation). The variable D is an indicator for being above
the crown, topdistance is theproduct ofD and the distance from the tree top to the disk, and topmass
7/31/2019 Case Studies Eric Cormier
11/38
2011 191
is the product of D and the gross amount of foliage above the disk. The standard distributional
settings were assumed for the random effects and residual error, namely (i, i, i, i, ij, j =
1, Ji) iid N(0, diag(2 ,
2 ,
2,
2,
2IJi )), ijtN(0,
2 ), ijt
ij ARMA(p, q), and are indepen-
dent of the random effects. Two lags of the climate variables were included since the effects
of previous growing conditions can last from 1 to 3 years. I focused on the spring, summer,and fall climate variables because early wood is laid down during the spring and late wood is
laid down from mid-summer to fall. Several sub-models were fit using the R-package nlme.
Diagnostic graphs were produced to ensure the adequacy of the models and to check the validity
of assumptions. In almost all sub-models the residual error had ARMA(2,1) structure, which is
consistent with Monserud (1986). Then, likelihood ratio and Wald-tests were performed to check
the significance.
To answer whether annual growth is best explained by average annual temperature or monthly
temperature extremes, I used the structure of Equation (1) for each set of temperature variables.
Since neither of the two models is nested in the other, I applied the idea of the J-test (Gujarati,
2003, p. 533) to determine which model is preferred. To consider whether climate variables or
growth and density variables from previous years better predict growth, I used a nested error
component model with autoregressive dynamics and other explanatory variables to predict the
annual growth. The proposed model equation is free of the climate variables. It is given by
yijt = 1yijt1 + 2yijt2 + 1xijt1 + 2xijt2 + 1zijt1 + 2zijt2 + ij + ijt, (2)
whereyijt is defined as in Equation (1), andx,z arethe densities of early andlate wood, respectively.
The heterogeneity due to the trees and disks within trees is represented by the error component
ij. The model is semiparametric, where the residual error, ijt, satisfies the moment condition
E(ijt|ij, yijt1, xijt1, zijt1, yijt2, xijt2, zijt2, . . .) = 0. The model was fit using two-stage
least squares on the first difference within disks. It was compared to the models of Equation (1)with regards to their out of sample prediction power. A test sample of size 2,718 taken across
almost all the disks was discarded from the data and both models were fit using the remaining
data. The fitted models were compared according to their mean squared error of prediction in the
test sample. The MSE for the climate models was one-quarter that of the dynamic model.
3. CONCLUSION
Regarding how the temperature and precipitation affect the annual cross-sectional growth and the
proportions of early and late wood, the climate models showed high contributions of the current
and lagged values of rain and temperature in explaining both of the target dependent variables.For example, annual growth was positively affected by rain (especially in spring) and negatively
affected by extreme levels of temperatures. The proportion of early wood was positively affected
by rain throughout the year and by higher temperatures in spring, however the proportion of
late wood is negatively affected by higher temperatures in mid-summer. It is recommended that
proportions of early and late wood be considered separately to allow a clearer view of how they
were individually affected by the climate.
About the extent to which the position on the tree bole, and current foliar biomass explain
the annual cross-sectional growth and the proportions of early and late wood, it was found that
the higher the disk was on the crown the smaller the amount of wood annual increment and
the higher the ratio of early wood. The foliar biomass had a positive linear effect on the annualwood increment within the crown consistent with Presslers hypothesis. However it should be
mentioned that the proportionality parameter is highly variable from one tree to another.
7/31/2019 Case Studies Eric Cormier
12/38
192 Vol. 39, No. 2
explained by the monthly extremes of temperature than the average annual temperature. The incre-
mental contribution of the annual temperatures over the monthly temperatures was not significant,
but monthly temperatures were significant additions to the annual climate model.
Regarding whether the use of climate variables to predict the two dependent variables provides
more reliable estimates than the use of the growth density from previous years, it was clear that
the climate models provide more reliable prediction than the autoregressive dynamic model. Thisemphasizes the importance of climate, disk position and foliar biomass in prediction as well as
in explanation.
ACKNOWLEDGEMENTS
I am grateful to Dr. Liqun Wang for encouraging me to carry out this research project and for
financial support through his research grants from NSERC and the National Institute for Complex
Data Structures.
BIBLIOGRAPHYD. N. Gujarati (2003). Basic Econometrics, 4th ed., McGraw-Hill, New York.
R. A. Monserud (1986). Time-series analyses of tree-ring chronologies. Forest Science, 32(2), 349372.
Received 31 October 2010
Accepted 6 January 2011
7/31/2019 Case Studies Eric Cormier
13/38
2011 193
Discussion of case study 1 analyses
C. B. DEAN1*, Alison L. GIBBS2 and Roberta PARISH3
1Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada V5A 1S62Department of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G33British Columbia Ministry of Forests, Victoria, BC, Canada V8N 1R8
Key words and phrases: Nonlinear mixed models; hierarchical random effects; transformations; prediction
intervals; autoregressive integrated moving average model; measurement error; model diagnostics
1. INTRODUCTION
The two analyses incorporated similar features, but with different model formulations and vari-
ables;even so, they yielded similar major conclusions. Both Salamh andCormier & Sunconsideredwhether and to what extent climate, position on the tree bole, and foliar biomass explain cross-
sectional area increment and proportion of early and late wood. They both also specifically
investigated the effects of temperature and precipitation, and whether average annual temper-
ature or monthly maximum and/or minimum values better explain variability in area increment.
Additionally, Salamh considered whether climate variables explain the variability in growth
and proportions of early and late wood better than measurements from previous years of these
variables.
2. THE MODELS
Both Salamh and Cormier & Sun utilized nonlinear mixed effects models with hierarchical ran-
dom effects. Transformations of the responses were considered including the square root of area
increment and the logarithm of the ratio of early to late wood. Lags of climate variables were
included as explanatory variables in Salamh, but not in Cormier & Sun. Salamh utilized a con-
ceptually based approach, modelling the growth as increasing linearly from the top of the tree
to the base of the crown, with tree-to-tree variability in this linear functional form. Cormier &
Sun incorporated a variable labelled position (taking values 1, 2, 3, . . . ) which is meant to reflect
the height of measurement of the area increment from the base of the tree. Note that the height
at which measurements are taken are not multiples of a specific value, so when position = 2,
the height from the base of the tree is not twice that when position=
1. As well, the heights atwhich measurements are taken are not the same from tree to tree. So the modelling of the trans-
formed response variable as a linear function of the variable position, as considered in Cormier
& Sun, is somewhat of an approximation to including a linear function of the height at which
measurements are taken in the model. Future models should incorporate a more accurate measure
of relative position by using tree heights and height to sample. Both Salamh and Cormier & Sun
include (estimates of) biomass in the model, with the transformed response changing linearly with
biomass increases. Salamh incorporated autoregressive errors, while Cormier & Sun incorporated
interaction terms. Site-specific intercepts were included in Salamh, while Cormier & Sun omitted
site effects. Cormier & Sun also handled the modelling of early and late wood separately, looking
at the annual averages of early and late wood over all trees and investigating how these are related
to lagged effects.
7/31/2019 Case Studies Eric Cormier
14/38
7/31/2019 Case Studies Eric Cormier
15/38
2011 195
Case study 2: Proteomic biomarkers fordisease status
Robert F. BALSHAW* and Gabriela V. COHEN FREUE
NCE CECR PROOF Centre of Excellence, Vancouver, BC, Canada V6Z 1Y6
Key words and phrases: Variable selection; class prediction; imputation
1. BACKGROUND
Renal transplantation saves the lives of hundreds of Canadians every year. However, every
transplant recipient must be monitored closely for signs of acute rejection, which is the bodys
immunologic and inflammatory response to the presence of foreign organ. If not properly treated,
acute rejection can lead to loss of the transplanted organ, dialysis or even death. Unfortunately,
acute rejection can only be detected by biopsy a distressing, uncomfortable and expensive surgical
procedure that can be required multiple times during the first year post-transplant.
The Biomarkers in Transplantation project was funded by Genome Canada to identify non-
invasive biomarkers for the detection and prediction of acute rejection based on proteomic analysis
of peripheral blood samples. A clinical test based on such a biomarker could lead to a better
method for monitoring transplant recipients, reducing costs, improving treatment outcomes, and
substantially improving recipients quality of life.
Measures of protein abundance data were collected from a selection of kidney transplant
recipients who were known at the time of the blood sampling to be either experiencing treatable
acute rejection (AR) or not experiencing acute rejection (NR). Each sample was drawn from
an independent subject within the first 3 months post-transplant. For each AR sample, two NR
samples were selected at approximately the same time point post-transplant.
The goal of this case study was to utilize these proteomic data to create a classifier for acute
rejection, which could then be evaluated on a test set of 15 samples.
2. THE DATA
At thetime of theCase StudyCompetitionfor the2009 SSC meeting, potential intellectual property
considerations meant that the true nature of the dataset had to be hidden from the participants.For example AR and NR status are referred to the patients being in an active or inactive state of
disease. The data were also supplemented with synthetic sample data, constructed to mimic the
observed characteristics of the AR and NR samples, both to enrich the size of the dataset and to
further protect intellectual property.
The dataset included 11 samples from AR patients, 21 samples from NR patients, plus an
additional 15 samples whose classification was hidden. All experimental samples were taken
from independent patients at the time when acute rejection was suspected or at a corresponding
matched time-point for non-rejection samples. Historically, approximately 10% of renal transplant
recipients experience rejection during the first few months post-transplant; however, the study
design was to select approximately 2 NR samples for every AR sample.
7/31/2019 Case Studies Eric Cormier
16/38
196 Vol. 39, No. 2
A multiplex proteomic technology, called iTRAQ, was used to measure the relative protein
abundances of the experimental samples relative to the quantity of the corresponding protein in a
reference sample. The reference samples were taken from a homogeneous batch of blood pooled
from 16 healthy volunteers.
Plasma was obtained from each whole blood sample through centrifugation. To enhance
detection sensitivity, the plasma samples were first depleted of the 14 most abundant proteins.Trypsin was used to digest the proteins in the depleted samples, and the resulting peptides were
labelled with one of four distinct iTRAQ reagents (i.e., chemical tags with unique molecular
weights but otherwise identical chemical properties). The labelled samples were then pooled and
processed using a MALDI TOF/TOF technology.
Each iTRAQ run was designed with three experimental samples and one reference sample.
Peptide identification and quantitation was carried out by ProteinPilot Software v2.0 and the data
were assembled into a comprehensive summary of the relative abundance of the proteins in the
experimental samples. As the same reference sample was used in all runs, these relative abundance
measures were comparable across experimental runs.
Each run of the experiment detected and measured several hundred proteins (about 200 per
run), but not every protein was identified in every sample, nor even in every run, leading to a
complex pattern of missing data.
If a protein was not identified in a particular experimental sample, the proteins relative abun-
dance level was then unknown. When this happened for the reference sample in a particular run,
then the relative levels for this protein could not be estimated for any of the three experimental
samples in that run.
Proteins were identified in the data using arbitrary protein identifiers (BPG0001BPG0460).
Though this prevented the incorporation of biological/subject-matter context in the analysis, it
was necessary due to potential intellectual property concerns.
In addition to acute rejection status and relative abundance measures for 460 proteins, for
each observation sex, race, and age were provided.
Received 31 October 2010
Accepted 6 January 2011
7/31/2019 Case Studies Eric Cormier
17/38
2011 197
Disease status determination: Exploringimputation and selection techniques
Linghong LU, Rena K. MANN*, Rabih SAAB and Ryan STONE
Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada V8W 3R4
Key words and phrases: BPCA; imputation; k-NN; LARS; LASSO; LLS; selection method; SLR; SVD
Abstract: Analyzing a proteomics dataset that contains a large amount of independent variables (biomarkers)
with few response variables and many missing values can be very challenging. The authors tackle the problem
by first exploring different imputation techniques to treat the missing values and then investigate multiple
selection techniques to pick the best set of biomarkers to predict the unknown patients disease status. They
conclude their analysis by cross-validating the different combinations of imputation and selection techniques
(using the set of patients of known disease status) in order to find the optimal technique for the supplied
dataset. The Canadian Journal of Statistics 39: 197201; 2011 2011 Statistical Society of Canada
Resume: Analyser les jeux de donnees provenant de la proteomique represente un grand defi, car ils con-
tiennent un grand nombre de variables independantes (biomarqueurs) avec peu de variables reponses et ils
contiennent aussi des valeurs manquantes. Les auteurs sattaquent a ce probleme en explorant differentes
techniques dimputation pour traiter les valeurs manquantes. Ils considerent aussi differentes techniques
de selection multiple pour choisir le meilleur ensemble de biomarqueurs afin de predire letat non connu
de la maladie dun patient. Ils concluent leurs analyses en croisant les differentes combinaisons de tech-
niques dimputation et de selection (en utilisant un ensemble de patients dont letat de la maladie est
connu) de facon a trouver la technique optimale pour le jeu de donnees analysees. La revue canadienne
de statistique 39: 197201; 2011 2011 Socit statistique du Canada
1. INTRODUCTION
There is a multitude of missing information in the data and as such, we eliminated proteins with
40 or more missing values leaving 330 proteins. A log 2 transformation of the data was then taken
as the data is fold change.
In order to select the set of proteins that accurately predict the status of the unknown list of
patients, combinations of different imputation methods and selection techniques were studied and
then cross validated using the protein expressions for individuals with known disease status.
2. IMPUTATION
Several imputation techniques were explored: k-Nearest Neighbours (k-NN), Local Least Squares
(LLS), Singular Value Decomposition (SVD), and Bayesian Principle Component Analysis
(BPCA).
k-nearest neighbours is a sensitive and robust method when the missing values are determined
by the proteins with expressions most similar to that protein in the other samples. The optimal
value of k, which is the number of neighbours to be used, has been shown in literature to be in
the range of 1020 (Troyanskaya et al., 2001; Mu, 2008). Thus, the values of 10 and 20 were
chosen in this instance. For each protein, the k-NN imputed value was found using Euclideandistance (Troyanskaya et al., 2001) for the columns where that protein was not missing. If more
7/31/2019 Case Studies Eric Cormier
18/38
198 Vol. 39, No. 2
than 50% of a protein was missing, the missing values were imputed using each patients average.
Otherwise, the average of the k-nearest neighbours was used to estimate the missing value.
In local least squares, a missing protein is evaluated as a linear combination of similar proteins
(Kim, Golub & Park, 2005). The method borrows from k-NN and from least squares imputation.
It is most effective when there is strong correlation in the data as the k proteins were selected
based on those which had the highest correlation with the target protein. We used two meth-ods: Spearman and Kendall as the two correlation types along with different values of kfor the
neighbours.
Singular value decomposition starts by taking the data set and ignoring the missing entries.
Then it calculates the mean for each of the rows of complete data. By initializing the missing
values to be the previously calculated row means, an iterative procedure finds the missing values.
Next, SVD is performed on the newly formed complete data set and the solution that is produced
replaces the row means in the missing values. These steps are repeated until a solution converges
which usually happens after five iterations (Hastie et al., 1999).
Bayesian principal component analysis is another imputation method that was used to impute
missing protein expression values. It combines an EM approach for PCA with a Bayesian model
and is based on three processes: principal component regression, Bayesian estimation, and an
expectation-maximization (EM)-like repetitive algorithm (Oba et al., 2003). The algorithm was
developed for imputation and very few components are needed in order to ensure accuracy. It
is an iterative process; it either terminates if the increase in precision becomes lower than the
threshold of 0.01 or if the number of set iterations is reached.
3. SELECTION METHODS
We used several methods to select differentially expressed proteins between the active and inactive
groups of patients. The simplest of these methods is the t-test. We used this basic test to compare theprotein expressions between the two groups. Significant results of the t-test provided a preliminary
analysis and a list of proteins potentially being expressed differently between the active and
inactive groups. We also used the Wilcoxon signed rank test which is conceptually similar to the
t-test but provides a more robust approach.
More advanced selection techniques were then explored to select biomarkers influencing the
disease status such as the Least Absolute Shrinkage and Selection Operator (LASSO), Least Angle
Regression (LARS), and Sparse Logistic Regression (SLR).
3.1. LASSO and LARS
The concept of the LASSO, a technique suggested by Tibshirani (1996), is similar to ordinary
least squares but uses a constraint to shrink the number of independent random variables included
in the model by setting some of the coefficients to zero.
LARS is a model selection method, proposed by Efron et al. (2004), that is computationally
simpler than LASSO. It starts by equating all of the coefficients to zero and then adds the predictor
most correlated with the response. The next predictor added is the predictor most correlated with
the current residuals. The algorithm proceeds in a direction equiangular between the predictors
that are already in the model. Efron et al. (2004) presented modifications to the LARS algorithm
that generate the LASSO estimates and these were used to produce the LARS estimates in the
paper.
3.2. SLR
7/31/2019 Case Studies Eric Cormier
19/38
2011 199
Table 1: Number of individuals of known disease status misclassified by the different imputation and
selection methods.
SVD SVD k-NN k-NN LLS LLS LLS
(e = 3) (e = 2) (k= 20) (k= 10) BPCA (k= 20 spear) (k= 10 ken) (k= 5 ken)
SLR 21 16 17 13 12 18 13 16
LASSO 0 0 12 13 1 13 16 15
LARS 3 0 3 4 0 5 1 1
The rows and columns correspond to the different selection and imputation methods respectively. LLS has a
correlation type of either Spearman (spear) or Kendall (ken).
Table 2: Predicted status for unknown subjects using LARS combined SVD imputation, where I and Astand for inactive and active respectively.
Subject 2 7 8 11 12 13 17 18 29 31 33 34 35 38 41
Status I I I I I A A A I I I I I A I
maximum likelihood estimation to obtain estimates of the model coefficients. The method is
similar to LASSO in that it uses a constraint to shrink the logistic regression model.
4. METHOD SELECTION
To select the set of proteins that accurately predict the status of the unknown list of patients,
combinations of imputation methods and selection techniques were studied and then cross
validated using the protein expressions for individuals with known status. Misclassification levels
observed for the eight imputation and three selection methods employed are displayed in Table 1.
The LARS algorithm had relatively few misclassified cases compared to SLR and LASSO. The
SLR method had high misclassification rates and was therefore dismissed for prediction purposes.
SVD and BPCA appear to be the best imputation techniques to use.
5. PROTEIN SELECTION AND PREDICTION
We chose the proteins picked by most methods. A protein was deemed to be selected as differen-
tially expressed by the t- and Wilcoxon signed rank tests if the respective P-values were less than
0.01. Proteins that had nonzero coefficients in the SLR, LASSO and LARS models were consid-
ered differentially expressed between the active and inactive groups of patients. The maximum
frequency of selection for proteins was 13, so 10 was taken as the cut-off value.
We noticed that seven proteins stood out: BPG0036, BPG0105, BPG0235, BPG0262,BPG0333, BPG0381, and BPG0447. For prediction purposes we used the selected proteins men-
tioned above and their corresponding coefficients by the LARS algorithm combined with the SVD
7/31/2019 Case Studies Eric Cormier
20/38
200 Vol. 39, No. 2
Table 3: Predicted status for unknown subjects where I and A stand for inactive and active status
respectively.
Subject 2 7 8 11 12 13 17 18 29 31 33 34 35 38 41
Status I A I I I A A I I A I I A A I
6. LOGISTIC REGRESSION
After determining the seven significant proteins, the variables of race, gender and age were
analyzed to determine whether they were significant. Chi-square tests of independence determined
race by status was not significant while gender was significant and linear regression showed age
by status was not significant. A logistic regression model (Dobson, 2002) was then fitted with the
variables gender and all seven of the selected proteins to produce a second set of predictions for
unknown disease status.
Three proteins: BPG0036, BPG0105, and BPG0333 were found to be significant by stepwisealgorithms on the fitted model. The expression levels of any of these three proteins changed
significantly between the active and inactive groups. Fitting the reduced model on the data of
given status gave the final model:
log
j
1j
= 1,3096,948 BPG0036 + 4,509 BPG01052,715 BPG0333
for j = 1, . . . , 32. The status of patient j is determined to be active if the corresponding j is less
than 0.5 and inactive otherwise. The predictions of the unknowns are shown in Table 3. To check
consistency, we predicted the status of the known status patients (11 active and 21 inactive). Themisclassification was zero, which gave us confidence in the prediction of the 15 unknowns.
7. CONCLUSIONS
The main problem encountered when dealing with microarray data was the abundance of missing
data and therefore, the choice of imputation method proved to be vital. Both logistic regression
and LARS were shown to be very good selection methods; the LARS algorithm had low mis-
classification rates. The three proteins picked by logistic regression were good predictors for a
patients status as there was an obvious separation between the inactive and active patients.
ACKNOWLEDGEMENTS
The support and guidance of Dr. Mary Lesperance is greatly appreciated. The study was supported
by Fellowships from the University of Victoria.
BIBLIOGRAPHY
A. Dobson (2002). An Introduction to Generalized Linear Models, Chapman & Hall/CRC, Washington.
B. Efron, T. Hastie, I. Johnstone & R. Tibshirani (2004). Least angle regression, Annals of Statistics, 32(2),
407499.
T. Hastie, R. Tibshirani, G. Sherlock, M. Eisen, P. Brown & D. Botstein (1999). Imputing missing datafor gene expression arrays. Technical Report, Department of Statistics, Stanford University, Palo Alto,
California USA
7/31/2019 Case Studies Eric Cormier
21/38
2011 201
R. Mu (2008).Applications of correspondence analysis in microarray data analysis. MSc Thesis, University
of Victoria, Victoria, British Columbia, Canada.
S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara & S. Ishii (2003). A Bayesian missing value
estimation method for gene expression profile data, Bioinformatics, 19(16), 20882096.
S. Shevade & S. Keerthi (2003). A simple and efficient algorithm for gene selection using sparse logistic
regression, Bioinformatics, 19(17), 22462253.R. Tibshirani (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical
Society, 58(1), 267288.
O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein & R. B. Altman
(2001). Missing value estimation methods for DNA microarrays, Bioinformatics, 17(16), 520525.
Received 31 October 2010
Accepted 6 January 2011
7/31/2019 Case Studies Eric Cormier
22/38
202 Vol. 39, No. 2
Bootstrap-multiple-imputation;high-dimensional model validation withmissing data
Billy CHANG*, Nino DEMETRASHVILI and Matthew KOWGIER
Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada M5T 3M7
Key words and phrases: Bootstrap validation; EM-algorithm; multiple imputation; penalized likelihood
1. METHODOLOGY
We ignored age, sex, and race when building the classifiers. We then removed proteins with more
than 50% missing (there were 191 proteins left), transformed zero relative abundance values to0.0001 to avoid when applying the log-transform.
We compared four classifiers: regularized discriminant analysis (RDA) with penalization
constant = 0.99, diagonal linear discriminant analysis (DLDA), penalized logistic regression
(PLR) with penalization constant = 0.01, and linear kernel support vector machine (SVM) with
penalization constant C= 1 (for the Lagrange multiplier). The classifiers are described in detail
in Hastie, Tibshirani and Friedman (2009).
Multiple imputation (Little & Rubin, 2002) and bootstrap validation (Hastie, Tibshirani &
Friedman, 2009) were used to compare and validate the four classifiers. We employed ROC
curves and AUC (area under ROC curves) as the classifiers performance metrics. Due to the
small-sample validation sets created by bootstrapping, we employed the BS.632+ error correctionmethod (Hastie, Tibshirani & Friedman, 2009) to adjust for the small-sample prediction error bias
when comparing the prediction error for the four classifiers.
We assume the 191 log-abundance scores for subject i (i = 1, . . . , N = 47) are multivari-
ate normal: xiN(, ). To avoid singular covariance estimates, we penalize the trace of the
precision matrix:
l(, |{xi}Ni=1) = logdet(
1)trace(S1) trace(1)
Here Sis the sample covariance matrix. The maximum penalized-likelihood estimates are:
=1
N
Ni=1
xi, = S + I
where I is the identity matrix. In the presence of missing data, we use the EM-algorithm as
described in Little & Rubin (2002) with a slight modification in the M-Step for parameters
estimation:
E-step: compute conditional expectations and covariance of the missing data given the
observed data.
M-step:
(1) Fill in the missing entries with their conditional expectations.
(2) Update the sample mean of the filled-in data.
7/31/2019 Case Studies Eric Cormier
23/38
2011 203
Figure 1: ROC curves.
(3) Update the sum of the filled-in data sample covariance and the conditional variance of
the missing data given the observed data, and I (we used = 0.02).
To validate the classifiers, we employed the following procedure:
(1) Estimate the missing-data distribution using all 47 subjects by the above EM-algorithm.
(2) Generate 100 imputed data sets with missing values imputed by values drawn from the
estimated missing data distribution.
(3) From the 100 imputed data set, remove the 15 subjects with unknown status.
(4) For each data set created from step 2 and 3, create 200 bootstrap resampled data set. Fit the
classifier on the bootstrap sample and get averaged out-of-bag (OOB) ROC curves, OOB
AUC scores, and BS.632+ error.
Note that all the penalization constants were chosen only to eliminate the singularity issues
due to the high dimensionality of the data; no fine-tuning was done to optimize classification
performance.
2. RESULTS
The averaged ROC curve for PLR lies above the ROC curves of the other three classifiers (Figure1), suggesting that PLR achieves better performances on average across all level of threshold. The
AUC scores distributions (results not shown) also suggest that PLR consistently achieves good
separation ability, however there are certain bootstrap samples which give very low OOB AUC.
We use the OOB BS.632+ error to check the classifiers performance at threshold 0.5 for RDA,
DLDA, PLR (i.e., classifies a subject as Active if P(Active | the subjects protein scores) > 0.5)
and 0 for SVM, and found that PLRs errors are also consistently lower than the other three
classifiers (results not shown).
3. CONCLUSIONS
Based on the above observations, PLR is the best classifier among the four classifiers compared.
However the huge variance in AUC and BS 632+ errors casts doubt on whether PLR can truly
7/31/2019 Case Studies Eric Cormier
24/38
204 Vol. 39, No. 2
ACKNOWLEDGEMENTS
We thank our team mentor Rafal Kustra for his guidance and support throughout.
BIBLIOGRAPHY
T. Hastie, R. Tibshirani & J. Friedman (2009). The Elements of Statistical Learning, 2nd ed., Springer,New York.
R. J. A. Little & D. B. Rubin (2002). Statistical Analysis with Missing Data, 2nd ed., Wiley, New York.
Received 31 October 2010
Accepted 6 January 2011
7/31/2019 Case Studies Eric Cormier
25/38
2011 205
Exploring proteomics biomarkers using ascore
Qing GUO1*, Yalin CHEN2 and Defen PENG2
1Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, ON, Canada
L8N 3Z52Department of Mathematics and Statistics, McMaster University, Hamilton, ON, Canada L8S 4K1
Key words and phrases: AUC; biomarker; cluster analysis; cross-validation; jittered QQ plot; logistic
regression; missing data imputation; proteomics score; sensitivity; specificity
1. METHODOLOGY
Proteins are constantly changing and interacting with each other, which makes it challenging toidentify a single protein. Often, a function is implemented by some proteins, among which, some
proteins up-regulate and some down-regulate. In this article, we propose a particular procedure
(PHD-SCORE) to seek a proteomics score that is a combination of relevant functional groups
of proteins. This can be used in place of a single or a limited number of potential proteins, as
the biomarker to distinguish active disease status from the inactive. The procedure involves the
following steps: Process the data; Condense the High dimensional Dataset; Identify Statistically
meaningful clusters and Calculate a proteomic score; Build statistical models to find one more
appropriate for the data; Determine patients disease status by choosing an Optimal probability cut
point; Test model prediction ability by using cross-validation; Repeat the procedure until a model
is chosen with proper predictions and low Error rate; Apply the chosen model to unknown cases.
1.1. Process the Data
Data checking has been conducted to ensure consistency and integrity. Nothing peculiar was
found. Among 47 subjects, 11 had active disease, 21 were inactive, and 15 had unknown disease
status. Based on our exploration, any protein with more than 38.2% of observations missing in
either group (equivalent to 4 in active and 8 in inactive groups) was excluded from further analysis.
This left us with 160 protein variables in the data set, together with 3 covariates (age, sex, race).
1.2. Condense the High-Dimensional Protein Data
Hierarchical cluster analysis was employed to reduce the dimensionality of the protein data. Weused (1 r) (r is the Pearson correlation coefficient) as the similarity between clusters using
average linkage. Usually, cutting too many clusters will result in small number of variables
within each cluster, and too few clusters will mean that many less correlative variables will be
included. Given these considerations, we decided to choose 13 clusters with corresponding cut-off
correlation coefficient of 0.85.
1.3. Impute Missing Data
Two imputation methods were used for handling missing data: hot-deck and linear regression
methods. The missing value of the protein is predicted and imputed from the regression model
with all available predictors.
7/31/2019 Case Studies Eric Cormier
26/38
206 Vol. 39, No. 2
Table 1: Fitted coefficients for the logistic regression model.
Predictor Estimate SE P-value
Constant 4.05 4.41 0.36
Age 0.06 0.07 0.42
Proteomic score 20.75 8.64 0.02
1.4. Identify Informative Clusters
The clusters were selected by considering 13 plots to visually discriminate the active and inactive
disease status by plotting the average value of proteins in each cluster (say, 13 clusters) for 32
subjects against disease status, and then stepwise logistic regression. It gave us 2 clusters with 8
variables in one cluster and 12 variables in the other. The proteomics score was calculated as the
mean difference between the 2 clusters.
1.5. Build Model and Optimal Estimate of Cut-Off Probability
The logistic regression model was built up by using covariates and the calculated proteomics
score as predictors. Gender and race are not statistically significant. The final model is logit
(P) = 0 + 1 Age + 2 Score, where P is the probability of having active disease status.
A probability cut-off plot was drawn to detect the patients disease status. AUC and jittered QQ
plot (Zhu & El-Shaarawi, 2009) were effective tools to diagnose the adequacy of fitted model in
numerical and graphical ways.
1.6. Test Prediction Ability and Apply to the Unknown Cases
Cross-validation within the 32 subjects with known disease status was utilized to provide a nearlyunbiased estimate of the prediction misclassification rate (Farruggia, Macdonald & Viveros-
Aguilera, 1999). We use a random split to partition the observed data into a training set (2/3 of
32 subjects, about 22 subjects) and a validation set (1/3 of 32 subjects, about 10 subjects). We
pseudo-randomly repeated the cross-validation 200 times to assess the misclassification rate. The
steps 16 were run repeatedly until the model with low misclassification rate was chosen. Finally,
we applied the model to 15 unknown cases to identify their disease status.
2. RESULTS
The analyses were performed by using the statistical package R. The maximum likelihood esti-mates of coefficients obtained for the fitted model and their P-values are shown in Table 1. The
proteomic score is significant at level 0.05 to detect patients disease status, while the insignificant
covariate age was included to adjust for the logit(P).
The probability cut-point of 0.26 was chosen based on the trade-off between the sensitivity
and specificity. If patients disease probability is greater than 0.26, the active disease status is
assumed. The misclassification rates for the active, inactive and overall are 0.13, 0.11, and 0.12,
respectively. This chosen model was applied to the data (15 participants), patients 7, 12, 13, 17,
18 were identified with active disease.
3. CONCLUSIONS
There are limitations Firstly the chosen percentage of missingness (here 38 2%) for each protein
7/31/2019 Case Studies Eric Cormier
27/38
2011 207
Finally, some factors such as the number of clusters and the cut-point for the probability of
disease and some specific proteins could be more targeted and ascertained with the involvement
of principal investigator so to understand the underlying biological mechanism better. The steps
of 16 in PHD-SCORE procedure need to be repeatedly run until the satisfactory model is met.
ACKNOWLEDGEMENTS
We thank Dr. Rong Zhu for the guidance, encouragement and availability to us during the study.
We are also indebted to Drs. Eleanor Pullenayegum and Romn Viveros-Aguilera for their financial
support and constructive suggestions. We also would like to acknowledge Drs. Stephen Walter,
Harry Shannon, and Lehana Thabane for their helpful comments.
BIBLIOGRAPHY
J. Farruggia, P. D. Macdonald & R. Viveros-Aguilera (1999). Classification based on logistic regression and
trees. Canadian Journal of Statistics, 28, 197205.
R. Zhu & A. H. El-Shaarawi (2009). Model clustering and its application to water quality monitoring.Environmetrics, 20, 190205.
Received 31 October 2010
Accepted 6 January 2011
7/31/2019 Case Studies Eric Cormier
28/38
208 Vol. 39, No. 2
A multiple testing procedure to proteomicbiomarker for disease status
Zhihui (Amy) LIU* and Rajat MALIK
Department of Mathematics and Statistics, McMaster University, Hamilton, ON, Canada L8S 4K1
Key words and phrases: Multiple testing procedure; imputation; classification tree; heatmap; proteomic
biomarker
1. METHODOLOGY
When hundreds of hypotheses are tested simultaneously, the chance of false positives is greatly
increased. We first removed the proteins that contain more than 5 missing values for the activegroup and 11 for the inactive group. To control Type I error rates, resampling-based single-step
and stepwise multiple testing procedures (MTP) were applied, using the Bioconductor R package
multtest (Pollard et al., 2009). Our null hypothesis was that each protein has equal mean relative
abundance in the active and inactive group. The non-parametric bootstrap with centring and
scaling was used with 1,000 iterations. Non-standardized Welch t-statistics were implemented,
allowing for unequal variances in the two groups.
We explored four imputation methods in the R package pcaMethods (Stacklies et al., 2007):
the nonlinear iterative partial least squares algorithm, the Bayesian principle component analysis
missing value estimator, the probabilistic principle component analysis missing value estimator,
and the singular value decomposition algorithm. A classification tree implemented in the R pack-age rpart (Therneau et al., 2009) was fitted by using the nine most rejected proteins from the
MTPs. To visualize the results, a heatmap was produced (see Figure 1).
2. RESULTS
By setting the random seed = 20, the MTPs result in five rejections: BPG0167, BPG0235,
BPG0333, BPG0381, BPG0447 at the significance level = 0.05 Note that different choices
of seed yield slightly different results. Applying the random seed = 20 and 1,000 bootstrap
iterations to the imputed data, we found that the results from the four imputation methods are
similar and they agree with those without imputation. A heatmap was plotted (Figure 1) using thenine most rejected proteins: BPG0098, BPG0100, BPG0105, BPG0167, BPG0235, BPG0310,
BPG0333, BPG0381, and BPG0447. Notice that it correctly classifies almost all the samples. The
classification tree predicts that patients 7, 11, 13, 18, 33, and 35 belong to the active group.
3. DISCUSSION
Multiple testing procedures are a concern because an increase in specificity is coupled with a
loss of sensitivity. Furthermore, we suspect that the proteins with the most difference in relative
abundance between the active group and inactive group are not necessarily the key players in
the relevant biological processes. These problems can only be addressed by incorporating prior
biological knowledge into our analysis, which may lead to focusing on a specific set of proteins.
7/31/2019 Case Studies Eric Cormier
29/38
2011 209
Figure 1: A heatmap of nine proteins comparing the active (+) and inactive group.
4. CONCLUSION
The results from MTPs before the imputation agree more or less with those after the imputation.
The reason for this is unclear, it could mean either that the imputation works very well or that it
is not very helpful. There is no evidence that race, sex or age is associated with disease status.
Proteins BPG0098, BPG0100, BPG0105, BPG0167, BPG0235, BPG0310, BPG0333, BPG0381,
and BPG0447 seem to be important in indicating the disease status. Patients 7, 11, 13, 18, 33, and
35 are predicted to be active among the 15 patients of unknown status.
ACKNOWLEDGEMENTS
We thank Dr. Peter Macdonald for his valuable advice and encouragement.
BIBLIOGRAPHY
K. S. Pollard, H. N. Gilbert, Y. Ge, S. Taylor & S. Dudoit (2009). multtest: Resampling-based multiple
hypothesis testing. R package version 2.1.1.
T. M. Therneau, B. Atkinson & B. Ripley (2009). rpart: Recursive Partitioning. R package version 3.1-44.
W. Stacklies, H. Redestig & K. Wright (2007). pcaMethods: A collection of PCA methods.R package version
1.22.0.
Received 31 October 2010
Accepted 6 January 2011
7/31/2019 Case Studies Eric Cormier
30/38
210 Vol. 39, No. 2
The process of selecting a disease statusclassifier using proteomic biomarkers
Christopher MEANEY1* Calvin JOHNSTON1 and Jenna SYKES2
1Dalla Lana School of Public Health, University of Toronto2Department of Statistics and Actuarial Science, University of Waterloo
Key words and phrases: Statistical classifier; biomarker; support vector machine
1. METHODOLOGY
Our first challenge was to narrow down the list of available statistical classifiers. One resource at
this stage was the open-source data mining and statistical learning program Weka developed at the
University of Waikato in New Zealand (Hall et al., 2009). With a few simple clicks on its intuitive
Explorer interface, we quickly sifted through an extensive selection of supervised classificationtechniques, including logistic regression, trees and forests, bagging, boosting, neural networks,
Bayes classifiers, and support vector machines. Weka has a convenient cross-validation function
which allowed us to whittle down this long, intimidating list to only the most promising methods
and allowed us to focus our efforts in a fruitful direction. We found that the Multilayer Perceptron
and Support Vector Machines (SVM) had strong empirical classification properties according to
leave-one-out cross validation. We opted to focus our energy on the SVM algorithm as it seemed
more intuitive.
Consider a Euclidean space with dimensionality equal to the number of factors; in this
case the factors are the 400+ protein relative abundance measurements. Each patient maps to a
vector in that space positioned according to his or her particular measurements on each protein.Along with this positioning, each patient also has a disease status: active or inactive. The SVM
then finds the optimally separating hyperplane which splits the Euclidean space into two sections,
each ideally containing patients of only one disease status. When there are multiple possible
hyperplanes that achieve this completely separating objective, the SVM chooses the plane that
maximizes the distance between the hyperplane and its nearest datapoints.
In many SVM implementations, the use of linear hyperplanes is extended with kernels by
replacing the linear vector dot product with a nonlinear kernel function, such as a polynomial
or radial basis function. A common modification is to relax the requirement that the hyperplane
divide all cases perfectly by allowing for a few penalized exceptions known as slack vectors
(Hastie et al., 2009). Since the provided data set was already of high dimensionality and was fullyseparable, neither of these extensions was used.
We used the function svm() in the R (R Development Core Team, 2005) package e1071
(Wu, 2009). This function allows many options to be set including a choice of kernels and the
penalization of slack vectors (Meyer, 2009). Data were scaled within the function to prevent
proteins with large variance from dominating the classification decision.
2. RESULTS
Many protein abundance measurements were absent from the data. The function svm() requires
complete data to operate. Thus, we imputed values for the missing data. Exploratory analysisindicated that there were a number of proteins for which fewer than 5 out of 47 measurements
7/31/2019 Case Studies Eric Cormier
31/38
2011 211
were missing and a large number for which more than 30 out of 47 measurements were missing.
Furthermore, several of the proteins with more than 20% missing data had missing values for all
patients in one of the two status groups.
We decided to limit our imputation to only proteins for which fewer than 20 measurements
were missing. As a result, the observational set of interest for a given case consisted of only
146 remaining proteins. After careful and extensive visual exploration of the missing data, wefelt comfortable that the data were mostly missing at random and not systematically linked to
the true underlying values (such as higher probability of missingness for very small true values).
A review of imputation literature revealed many possible strategies for the replacement of the
missing values in the dataset. The imputation segment of the Multivariate Statistics Task View of
R/CRAN revealed eight possible libraries that were devoted to multivariate imputation. However,
the fact that our dataset consisted of a large number of variables, measured on a small number of
cases, restricted our imputation options slightly. We began by using simplistic strategies such as
mean/median imputation; however, we felt that more advanced methods would permit us to build
a more accurate classifier. We settled on using a nonparametric, k-Nearest Neighbours (k-NN)
imputation strategy, as it had been recommended by some authors from the field of proteomics andit behaved well in leave-one-out crossvalidation (Troyanskaya et al., 2001). The k-NN imputation
was performed using the impute.knn() function from the impute library in R. Ranging k from
1 to 10, we found that only one patient was misclassified for all k. The misclassified patient
was in the inactive status group, so our final empirical classification rate was 100% (0/11 cases
misclassified) and 95.2% (1/21 case misclassified) for active and inactive statuses, respectively.
Since all kbehaved equally well, we arbitrarily chose kto be 8 for our final classifier.
Decision processes were heavily reliant on empirical fitting, so there was some risk of over-
fitting. However, given the rigidity of the linearity requirement for our SVM implementation and
the care we took in heeding this problem, we felt the amount of overfitting was probably small.
3. CONCLUSIONS
Our final classifier was a linear Support Vector Machine with 8-Nearest-Neighbour missing value
imputation for proteins with fewer than 20% missing data. Allowing for small amounts of overfit-
ting, we suspect our technique would correctly classify about 90% of all patients in each disease
status category.
BIBLIOGRAPHY
T. Hastie, R. Tibshirani & J. Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd ed., Springer, New York.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann & I. Witten (2009). The WEKA Data Mining
Software: An Update. SIGKDD Explorations, 11, 19.
D. Meyer (2009). Support Vector Machines, CRAN R Project, accessed July 15, 2009 at cran.r-
project.org/web/packages/e1071/vignettes/svmdoc.pdf.
R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, www.r-project.org.
O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein & R. Altman (2001).
Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520525.
T. Wu (2009). Misc Functions of the Department of Statistics (e1071), CRAN R Project, accessed July 15,
2009 at cran.r-project.org/web/packages/e1071/e1071.pdf.
7/31/2019 Case Studies Eric Cormier
32/38
7/31/2019 Case Studies Eric Cormier
33/38
2011 213
Table 2: Predictions of patients with missing status.
Tree LG Decision
1 Inactive Inactive Inactive
2 Inactive Active Active
3 Inactive Inactive Inactive
4 Inactive Inactive Inactive
5 Inactive Inactive Inactive
6 Inactive Active Active
7 Active Active Active
8 Inactive Active Active
9 Inactive Inactive Inactive
10 Active InactiveActive
11 Inactive Inactive Inactive
12 Inactive Inactive Inactive
13 Active Active Active
14 Active Active Active
15 Inactive Inactive Inactive
For inconsistent results,Active by the classification tree andActive by the logistic regression were regarded more
reliable.
and false active error rates (see Table 1). Predictions from the two models were obtained for the
status of patients with missing status (see Table 2).
3. CONCLUSIONS
By the low false inactive error rates we obtained from both models, we conclude that it is possible
to use classifiers as a pre-screening procedure in identifying active patients, particularly when
there is a much larger sample available for model training. However, without knowing the error
rates of the current diagnostic method, it seems infeasible to make a conclusion on whether the
classification methods based on a simple blood sample perform as well as the current diagnostic
method.
ACKNOWLEDGEMENTS
We would like to thank the Department of Statistics of University of British Columbia for warm
support on this case study.
BIBLIOGRAPHY
L. Breiman, J. H. Friedman, R. A. Olshen & C. J. Stone (1984). Classification and Regression Trees,
Wadsworth & Brooks, Pacific Grove.
Received 31 October 2010
A t d 6 J 2011
7/31/2019 Case Studies Eric Cormier
34/38
214 Vol. 39, No. 2
Discussion of case study 2 analyses
Robert F. BALSHAW* and Gabriela V. COHEN FREUE
NCE CECR PROOF Centre of Excellence, Vancouver, BC, Canada V6Z 1Y6
Key words and phrases: Variable selection; class prediction; imputation
1. INTRODUCTION
These analyses have been performed on data provided as a case study for the 2009 SSC case study
session organized byAlison Gibbs. A general overview of thedata was provided in theBackground
and Description. A comparable analysis is described in Cohen et al. (2010). In many ways, the
data represent what has become a fairly standard statistical challenge in supervised learning: to
develop a classification rule for prediction of unknown class labels for 15 new samples based ona small training set, possibly utilizing only a subset of the features.
From our experience, the principal challenge was the abundance of missing data, whose
presence may well carry information about class membership. Missing values often occur with
our analytical proteomic platform due to the challenge of protein identification as well as detection
of low abundance proteins.
We would like to thank all six teams their efforts; all are to be commended for their insightful
analyses. The six teams will be referred to as follows:
WX: Wang and Xia.
LM: Liu and Malik. CDK: Chang, Demetrashvili, and Kowgier.
MJS: Meaney, Johnston, and Sykes.
GCP: Guo, Chen, and Peng.
LMSS: Lu, Mann, Saab, and Stone.
2. METHODOLOGIES
The six teams used a wide variety of methodologies, which we have attempted to summarize very
briefly in Table 1. In general, the teams all took similar approaches to pre-filtering the features
based on detection rates and explored a variety of imputation methods. All teams eliminated asubset of the proteins with lower rates of detection, though the detection rule and threshold used
varied between the teams. Essentially, two variant pre-filtering rules were used: (1) select proteins
for which the overall detection rate was above an arbitrary, pre-determined threshold: 0 and 1 by
WX; 0.5 by CDK; 0.8 by MJS; or 0.15 by LMSS; and (2) select proteins for which the within-class
detection rates were above a threshold: approximately 0.5 by LM and 0.4 by GCP. Having chosen
a set of proteins, the teams then utilized a variety of imputation methods (see Table 1) to permit
the use of those classification techniques which do not permit missing values.
Two teams then conducted univariate or one-protein-at-a-time tests before building a mul-
tivariate classifier (LM; LMSS). LM selected a reduced set of candidate significant proteins
(controlling the Type I error rate), and LMSS selected a reduced set of candidate proteins thatshowed differential expression in several methods (including other approaches that test multiple
7/31/2019 Case Studies Eric Cormier
35/38
2011 215
Table 1: Summary of strategies and accuracy.
Team Missing values Classifier Selected model Accuracy
WX Overall detection rate
threshold: 1 and 0;
Imputation Methods:
mean and k-NN
Tree with splitting rule
(Tree) and
logistic-AIC (LG)
Ensemble of two models:
Tree based on
k-NN-imputed data and
LG based on complete
data
11/15
LM Within group detection
rate threshold: 0.5;
Imputation Methods:
NIPALS, BPCA,
Probabilistic PCA, SVD
Non-standardized Welch
t-test using FDR
followed by a
classification tree
Tree based on top-9 tested
proteins. High
concordance among
imputation methods
12/15
CDK Overall detection rate
threshold: 0.5;
Imputation Method:
EM-algorithm
RDA, DLDA, PLR, and
SVM
PLR 14/15
MJS Overall detection rate
threshold: 0.8;
Imputation Method:
median/mean and k-NN
(several k)
Extensive list of
supervised
classification
techniques available
in Weka
SVM in k-NN-imputed data
(k= 8, similar results for
other values)
13/15
GCP Within group detection
rate threshold: 0.4;Imputation Methods:
hot-deck and linear
regression imputation
Hierarchical cluster
analysis followed bystepwise logistic
regression to select
clusters. A proteomic
score, the mean
difference between
the two identified
clusters, is calculated
Logistic regression with the
calculated proteomicscore and age as
covariates
11/15
LMSS Overall detection rate
threshold: 0.15;
Imputation Methods:k-NN, LLS, SVD,
BPCA
Classical t-test,
Wilcoxon signed rank
test, LASSO, LARS,and SLR
Model 1: LARS algorithm
on the seven most
frequent proteins from allmethods explored on
SVD imputed data.
Model 2: logistic
regression based on three
proteins selected by a
stepwise analysis on the
seven most frequent
proteins and gender
Model 1:
10/15;
Model 2:12/15
features simultaneously). Though with different emphasis and in combination with other meth-
ods almost all teams utilized a logistic regression approach (WX CDK MJS GCP LMSS)
7/31/2019 Case Studies Eric Cormier
36/38
7/31/2019 Case Studies Eric Cormier
37/38
2011 217
variables were related with the disease condition; GCP built a logistic regression classification
model that selected qge from all other covariates and the proteomic score; and LMSS found only
gender to be statistically significant using a chi-square test, though it was not selected in their
final logistic regression model.
The predictive performance for each teams final classifier was tested on 15 samples whose
class labels were kept hidden until after each teams results were provided to the organizers. Theaccuracy of each teams final model appears in the last column of Table 1. Though this permits
a quantitative comparison between the teams, we will resist the temptation to over-interpret the
relative predictive performance of the teams and their methods. Instead, we would like to once
again thank all the teams for their hard work, insight and thoughtful questions.
BIBLIOGRAPHY
G. V. Cohen Freue, M. Sasaki, A. Meredith, O. P. Gunther, A. Bergman, M. Takhar, A. Mui, R. F. Bal-
shaw, R. T. Ng, N. Opushneva, Z. Hollander, G. Li, C. H. Borchers, J. Wilson-McManus, B. M.
McManus, P. A. Keown, W. R. McMaster, and the Genome Canada Biomarkers in Transplantation
Group. (2010). Proteomic signatures in plasma during early acute renal allograft rejection. Molecular& Cellular Proteomics, 9, 19541967.
Received 31 October 2010
Accepted 6 January 2011
7/31/2019 Case Studies Eric Cormier
38/38
Copyright of Canadian Journal of Statistics is the property of Wiley-Blackwell and its content may not be
copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written
permission. However, users may print, download, or email articles for individual use.