Top Banner

of 38

Case Studies Eric Cormier

Apr 04, 2018

Download

Documents

Eric Cormier
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/31/2019 Case Studies Eric Cormier

    1/38

    The Canadian Journal of Statistics

    Vol. 39, No. 2, 2011, Pages 181217

    La revue canadienne de statistique

    181

    Case studies in data analysis

    Alison L. GIBBS1*, Kevin J. KEEN2 and Liqun WANG3

    1Department of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G32Department of Mathematics and Statistics, University of Northern British Columbia, Prince George,

    BC, Canada V2N 4Z93Department of Statistics, University of Manitoba, Winnipeg, Man., Canada R3T 2N2

    The following short papers are summaries of student contributions to the Case Studies in DataAnalysis from the Statistical Society of Canada 2009 annual meeting. Case studies have been an

    important part of the SSC annual meeting for many years, providing the opportunity for students

    to delve into interesting problems and data sets and to present their findings at the meeting. Since

    2008, prizes have been awarded for the best poster presentations for each of two case studies. The

    case studies at the 2009 annual meeting and the selection of this suite of papers were organized

    by Gibbs and Keen.

    This section consists of two groups of papers corresponding to two case studies. Each sub-

    section starts with an introduction given by the data donors, which is followed by the winning

    paper and contributed papers. The subsection ends with discussion and summary by the data

    donors.The theme of case study 1 is the identification of relevant factors for the growth of lodgepole

    pine trees. First, Dean, Gibbs, and Parish provide an introduction to the data and the problems

    of scientific interest. The winning paper authors Cormier and Sun first use the nonparametric

    smoothing technique to identify a nonlinear relationship of the growth rate and the age of the trees.

    They then use a mixed model to explain the growth rate through the age and other environmental

    factors. In the second paper, Salamh first estimates a similar mixed model and then supplements

    the analysis using a dynamic model.

    The theme of case study 2 is the classification of disease status through proteomic biomarkers.

    Balshaw and Cohen-Freue introduce the data and problems of interest. The winning paper is

    authored by Lu, Mann, Saab, and Stone who first explore various data imputation techniquesincluding the k-nearest neighbours, local least squares and singular value decomposition. They

    then apply various multiple selection methods such as LASSO, least angle regression (LARS)

    and sparse logistic regression. This paper is accompanied by four contributed papers which use

    various modern classification techniques. Guo, Chen, and Peng use a score procedure to classify

    the disease status. Liu and Malik employ a multiple testing procedure. Meaney, Johnston and

    Sykes apply support vector machines (SVM). Wang and Xia use classification tree and logistic

    regression techniques. A summary and comparison of these methods and outcomes are given by

    Balshaw and Cohen-Freue.

    We are grateful to Charmaine Dean of Simon Fraser University, Roberta Parish of the British

    Columbia Ministry of Forests and Range, and Rob Balshaw and Gabriela Cohen-Freue of the

  • 7/31/2019 Case Studies Eric Cormier

    2/38

    182 Vol. 39, No. 2

    NCE CECR PROOF Centre of Excellence for the use of their data and their contributions to this

    suite of papers. We also thank the former and current Editors of the CJS, Paul Gustafson and

    Jiahua Chen, for agreeing to publish these papers and for their patience and support during the

    editorial process.

    Received 31 October 2010

    Accepted 6 January 2011

  • 7/31/2019 Case Studies Eric Cormier

    3/38

    2011 183

    Case study 1: The effects of climate on thegrowth of lodgepole pine

    C.B. DEAN1*, Alison L. GIBBS2 and Roberta PARISH3

    1Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada V5A 1S62Department of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G33British Columbia Ministry of Forests, Victoria, BC, Canada V8N 1R8

    Key words and phrases: Climate effects; tree growth; biomass; lodgepole pine.

    1. BACKGROUND

    To compete successfully in the world economy, the commercial forestry industry requires an

    understanding of how changes in climate influence the growth of trees. The goal of this case study

    was to examine how well-known climate variables, combined with estimated crown biomass, can

    predict wood accumulation in lodgepole pine. In order to model the growth and yield of trees over

    time, we need to determine how much wood a tree accumulates each year. Each year, a tree lays

    down an annual ring of wood in a layer under the bark. Presslers hypothesis states that area of

    wood laid down annually, measured by the cross-sectional area increment, increases linearly from

    the top of the tree to the base of the crown (the location of the lowest live branches) and is based

    on the assumption that area increment in the crown increases with the amount of foliage above

    the point of interest. Below the crown, the area increment in any given year remains constant

    down the bole until the region of butt swell at the base of most trees.

    The growth of a tree in any given year is strongly influenced by growth in the previous years.

    One reason for this is that buds are formed the year before they start to grow and carbohydrates

    from good years can be stored to fuel growth in subsequent years. The effects of previous growing

    conditions can last from 1 to 3 years, depending on tree species and location. Climate affects

    growth and influences both the size of the annual ring of wood and the proportions of early

    and late wood. Low density early wood is laid down during the spring when water is plentiful.

    Late wood, which is laid down from mid-summer until growth ceases in the fall, has a high

    density. Cessation of wood formation is sensitive to weather conditions such as temperature and

    drought.

    Lodgepole pine (Pinus contorta Doug. ex Loud.) stands dominate much of western Canada and

    the United States, covering over 26 million hectares of forest land. It is an important commercial

    species in British Columbia; stands consisting of more than 50% lodgepole pine occupy 58% of the

    forests in the interior of the province. Lodgepole pine is primarily used for lumber, poles, railroad

    ties, posts, furniture, cabinetry, and construction timbers. It is commercially important to be able

    to predict how lodgepole pine will grow and accumulate wood over time. Using high resolution

    satellite images of lodgepole pine stands to predict wood attributes is under consideration, but

    first the relationship of crown properties such as the amount of foliage must be linked to wood

    properties and growth.

  • 7/31/2019 Case Studies Eric Cormier

    4/38

    184 Vol. 39, No. 2

    2. THE DATA

    Data on the annual growth and wood density of 60 lodgepole pine trees from four sites in two

    geographic areas in central British Columbia were provided for this investigation. Samples were

    removed at 1013 locations along each tree and two radii (A and B) per sample disc were mea-

    sured. Measurements of the last year of growth and wood density are often unreliable because

    of proximity to the bark and difficulties of sample preparation. However, it is for this ring only

    that we have measures of the amount of foliage. Several growth outcomes are available includ-

    ing the widths of the A and B radii, in millimetres, percentage of late and early wood, and

    early and late wood densities, in kg/m3. Foliar biomass measurements are provided for multi-

    ple branches; estimates are available for each annual whorl. The data on biomass include the

    average relative position of the branch in the crown (1 is the base of the crown and 0 is the

    top) and corresponding foliar biomass (the mass, in kg/m2, of needles subtended by the branches

    at that position). Other variables such as the total height of the tree, in metres, as well as the

    height to the base of the crown, in metres, are also provided. Climate data from Environment

    Canada arise from the two nearest stations with long-term records, Kamloops and Quesnel. For

    each of these locations, monthly and annual data are provided on: (1) the minimum temper-

    ature, in degrees Celsius, (2) the maximum temperature, in degrees Celsius, and (3) the total

    precipitation, in millimetres. Additional details on the data and variables are provided at the site

    www.ssc.ca/en/education/archived-case-studies/ssc-case-studies-2009.

    3. OBJECTIVES

    The primary objective of this case study was to determine to what extent climate, position on the

    tree bole (trunk), and current foliar biomass explain cross-sectional area increment and proportion

    of early and late wood.

    Other questions of interest included:

    How have temperature and precipitation affected the annual cross-sectional growth and the

    proportions of early and late wood in lodgepole pine?

    Is annual growthbest explained by average annual temperature or do monthly maximum and/or

    minimum values provide a better explanation? Do early and late wood need to be considered

    separately?

    Does the use of climate variables to predict the growth and proportions of early and late wood

    provide more reliable estimates than the use of the growth and density measurements from

    previous years as measured from the interior rings?

    Received 31 October 2010

    Accepted 6 January 2011

  • 7/31/2019 Case Studies Eric Cormier

    5/38

    2011 185

    The determination of the relevant explanatoryvariables for the growth of lodgepole pineusing mixed models

    Eric CORMIER* and Zheng SUN

    Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada V8W 2Y2

    Key words and phrases: Climate effects; biomass; mixed models; tree growth

    Abstract: In this paper a mixed model with nested random effects was used to model the cross-sectional

    area increment of lodgepole pine based on relevant explanatory variables. This model was used to show

    that minimum monthly temperature, monthly precipitation, and foliar biomass are positively related to the

    cross-sectional area increment, while an ordinal variable approximating lower trunk position and maximum

    monthly temperature are negatively related. It was shown that annual growth is better explained by monthly

    maximum and minimum temperatures than by average annual temperature and that the use of climate

    variables provided more reliable estimates for growth prediction than the use of the growth and density

    measurements from previous years. The Canadian Journal of Statistics 39: 185189; 2011 2011 Statistical

    Society of Canada

    Resume: Dans cet article, un modele mixte avec des effets aleatoires embotes est utilise pour representer

    les accroissements transversaux de laire du pin de Murray a laide de variables explicatives pertinentes. Ce

    modele a permis de montrer que la temperature mensuelle minimale, la quantite de precipitation mensuelle

    et la biomasse foliaire sont positivement reliees a laccroisement transversal daire tandis quune variable

    ordinale approximant la position de la partie inferieure du tronc et la temperature mensuelle maximale sont

    negativement reliees. Il est montre que la croissance annuelle est mieux expliquee par les temperatures

    mensuelles maximale et minimale que par la temperature annuelle moyenne et que lutilisation des variablesde climats procure des estimes plus fiables de la prediction de croissance que lutilisation des observations

    sur la croissance et les mesures de densite des annees precedentes. La revue canadienne de statistique 39:

    185189; 2011 2011 Socit statistique du Canada

    1. INTRODUCTION

    In this analysis, we addressed the following questions: (1) To what extent climate, position on

    the tree bole, and current foliar biomass explain cross-sectional area increment. (2) How have

    temperature and precipitationaffected the annual cross-sectional growth? (3) Is annual growth best

    explained by average annual temperature or do monthly maximum and/or minimum values provide

    a better explanation? (4) Does the use of climate variables to predict the growth and proportionsof early and late wood provide more reliable estimates than the use of the growth and density

    measurements from previous years as measured from the interior rings? We used mixed models to

    model the relationships between the explanatory variables: climate, position on the tree bole, and

    current foliar biomass, and the responses: cross-sectional area increment and the proportion of late

    wood of lodgepole pine. There were four features of the data that complicated the analyses: (1)

    Climate variables for each year were available and annual growth measurements were collected

    from tree samples, so we expected the data to exhibit autocorrelation. The correlation structure was

    accommodated by the use of random effects. (2) For each disc, there were measurements from two

    separate radii. Radius was treated as a nested random effect. It could have been assumed that the

    measurements along the two radii were two observations of the same variable and then an averagecould be taken, but due to the asymmetry of the tree radii, an average is not a good estimate of the

  • 7/31/2019 Case Studies Eric Cormier

    6/38

    186 Vol. 39, No. 2

    variable. Alternatively, the measurements along the two radii could be used individually but there

    would be very high correlations between them. To allow both sets of measurements to be used

    and include the correlations between radii, a nested random effect was used (Pinheiro & Bates,

    2000, p. 40). (3) The ages of the trees varied resulting in a different number of observations for

    each tree. This complication corresponds to drop-out in a longitudinal study. It was not believed

    that the reason for the resulting missing data was informative because it depended only on theage of the tree and not on a growth factor. Therefore this situation was modelled assuming the

    data on missing years of a trees life were missing at random and bias was not taken into account.

    (4) Destructive sampling meant that foliar biomass was collected at only one point in time; an

    inverse regression was conducted to determine the foliar biomass in other years.

    In addition to the climate variables, the growth and density measurements from previous years

    were used to predict the growth of early and late wood. This prediction was done using an ARIMA

    model and the reliability of these estimates were examined.

    2. METHODOLOGY

    Non-parametric regressions of late wood percentage and cross-sectional area increment against

    trunk position, foliar biomass, and annual maximum temperature were fit to determine the general

    trend in the response. We assumed that the nth measurement from the ground would correspond

    across trees regardless of the height of the trees so the ordinal variable position was used. The plot

    of cross-sectional area increment versus trunk position in Figure 1 shows that from trunk position

    14, there is a negative relationship between trunk position and cross-sectional increment; but

    after position 5, there is a positive relationship between trunk position and cross-sectional incre-

    ment. This could be due to the fact the position 4 or 5 corresponds with the start of the crown. The

    plot of cross-sectional area increment versus biomass and the plot of percentage of late wood ver-

    sus biomass show that high values of biomass correspond to high cross-sectional area incrementand low late wood percentage, respectively. The plot of cross-sectional area increment versus

    annual maximum temperature and the plot of percentage of late wood versus annual maximum

    temperature show that the relationship between annual maximum temperature and cross-sectional

    increment and the relationship between annual maximum temperature are different for each loca-

    tion (Kamloops and Quesnel). This suggests including two-factor interactions between location

    and annual maximum temperature. Similar interaction plots, which are not included in this paper,

    suggested including the interactions between the following pairs of variables: location and annual

    minimum temperature, location and precipitation, age of the tree and foliar biomass, and location

    and foliar biomass. We also included two-factor interactions among climate variables.

    One of the questions of interest was how the position on the tree bole affects the cross-sectionalarea increment. When examining the effect of position, it was necessary to account for the fact

    that trees had a large range in their absolute height. Measurements taken at 10 m from the ground

    on a tree that was 11 m tall have a different relative position on the tree trunk then the same

    measurement on a tree that was 40 m tall. To account for this an ordinal variable, position, was

    used to represent relative height.

    To determine whether monthly maximum and minimum temperature or average annual tem-

    perature best explain cross-sectional growth, the mixed model was fitted separately with monthly

    measurements and with average annual measurements and model goodness-of-fit criteria were

    compared. The variables that resulted in a better fit were used in the analysis. To model the corre-

    lation structure, a random intercept and a random slope for each tree was adopted. To model thenested effect presented by having two radii measured on each tree disc, a nested random effect

    was introduced.

  • 7/31/2019 Case Studies Eric Cormier

    7/38

  • 7/31/2019 Case Studies Eric Cormier

    8/38

    188 Vol. 39, No. 2

    Figure 2: Mean cross-sectional area increment over age of the tree.

    Table 1: Climate effects on cross-sectional increment.

    Month

    J F M A M J J A S O N D

    Maximum temperature * * * * *

    Minimum temperature + + + + + + + + + + + +

    Precipitation + + + + + + + + + + + +

    Significant positive effect (+), significant negative effect (), not significant (*) at 5% level.

    3. RESULTS

    To improve model adequacy a Box-Cox transformation was performed with = 0.25. The

    function g was determined to be cubic through examination of Figure 2, that is, g(tihk) =

    40tihk + 41t2ihk + 42t

    3ihk. Trunk position in the crown and the amount of foliar biomass have

    positive relationships, and age of the tree has a negative relationship with the cross-sectional

    area increment. The results of climate on cross-sectional increment are presented in Table 1.

    Results from the fitted models were quite consistent with patterns observed in Figure 1 except the

    interaction effect of annual maximum temperature and location was not significant.

    The estimated standard errors for the random effects were: 1 = 0.5394, 2 = 0.0225, 3 =

    0.0284. The nested effect for radii was not significant, implying that the measurements from the

    two different radii did not significantly differ.

    It was determined that annual growth was better explained using monthly maximum and

    minimum temperature values then average annual values because both AIC (66,135 vs. 67,750) and

    BIC (66,781 vs. 68,050) were smaller when monthly measurements were included in the model.

    However, although we used monthly climate variables as main effects, to reduce the number of

    interaction terms in the model interactions were modelled using annual climate variables.

    Examination of the residual plot showed heavy tails and skewness. This indicates that the

    residuals deviate from normality and that modelling them with a skew-elliptical distributionwould be more appropriate (Jara, Quintana & San Martn, 2008).

    To model the dependence of the growth of early and late wood on measurements from previous

  • 7/31/2019 Case Studies Eric Cormier

    9/38

    2011 189

    Figure 3: Five step ahead predictions on early and late wood growth.

    autoregressive, third order integrated and first order moving average model. This model was used

    to predict future growth and determine prediction intervals (Figure 3).

    4. CONCLUSIONS

    Data on the cross-sectional area increment and proportion of late wood for tree rings at multiple

    years, heights and geographical regions were analyzed, with the addition of climate data and

    measurements of foliar biomass for each year of the trees lives. The use of mixed models allowed

    all of these covariates to be included in each model and their relationship to be modelled usingall available data. However, the model is not sufficient to conclude cause-and-effect relationships

    between the variables and the growth of lodgepole pine. Since the monthly climate factors were

    correlated, the problem of multi-collinearity was present. Therefore, the model requires careful

    interpretation of the coefficients. In response to questions (1) and (2) in Section 1, the results show

    that minimum monthly temperature, monthly precipitation and foliar biomass were positively

    related to the cross-sectional area increment while lower trunk position and maximum monthly

    temperature were negatively related. Examination of question (3) showed that annual growth

    was better explained by monthly maximum and minimum temperatures than by average annual

    temperature. Due to the wide prediction intervals from the time series analysis it was believed that

    the use of climate variables provided more reliable estimates for growth prediction (question (4)).Possible future work that could be carried out to improve the model is the use of skew-elliptical

    distributions to model the residuals to account for both skewness and heavy tails in the error terms

    and the incorporation of splines to accommodate the temporal trend in the observations.

    ACKNOWLEDGEMENTS

    Many thanks to Farouk Nathoo for the very helpful correspondence.

    BIBLIOGRAPHY

    A. Jara, F. Quintana & E. San Martn (2008). Linear mixed models with skew-elliptical distributions: a

    Bayesian approach, Computational Statistics and Data Analysis, 52, 50335045.

    J. Pinheiro & D. Bates (2000). Mixed-Effects Models in S and S-PLUS, Springer-Verlag, New York.

  • 7/31/2019 Case Studies Eric Cormier

    10/38

    190 Vol. 39, No. 2

    Determinants of lodgepole pine growth: Staticand dynamic panel data models

    Mustafa SALAMH*

    Department of Statistics, University of Manitoba, Winnipeg, Man., Canada R3T2M2

    Key words and phrases: linear mixed model; nested error components; autoregressive panel models; tree

    growth; climate effect; Presslers hypothesis; random coefficients regression; two-stage least squares

    1. INTRODUCTION

    This study was concerned with modelling the wood properties and growth over time for theLodgepole pine in British Columbia. The primary objective was to determine to what extent cli-

    mate, position on the tree trunk, and current foliar biomass explain cross-sectional area increment

    and proportion of early and late wood. The study also addressed other questions, such as whether

    growth is best explained by average annual temperature or monthly temperature extremes, and

    whether the use of climate variables to predict the growth and wood properties provide more

    reliable estimates than the use of the growth and density measurements from previous years.

    2. METHODOLOGY

    Presslers hypothesis states that the annual increment in cross-sectional area of wood increaseslinearly from the top of the tree to the base of the crown and is proportional to amount of foliage

    above the point of the increment. Since tree and crown heights and foliar biomass were only

    available for the year in which the tree was cut down, they had to be estimated for other years

    of the trees life. This was done using loess regression based on the trees current height, crown

    length, and the height of disks. In order to answer the primary question about the effects of

    climate, disk position on the tree bole, and foliar biomass on the cross-sectional area increment

    and percentage of late wood, a linear mixed effects model was used to account for the two-level

    grouping structure of the data. The model was formulated according to Presslers hypothesis

    without ignoring the possible random variability due to disks below the crown. Other nuisance

    factors such as age from pith, tree height, and the geographic location were included in the modelto control for their effect. The climate model takes the form

    yijt = f1(disk ageijt, tree heightit, sitei) + f2(climt,t1,t2) + Dijt

    + topdistanceijt + topmassijt + i + iDijt + i topdistanceijt + i topmassijt

    + ij(1Dijt) + ijt, i = 1, I , j = 1, Ji, t = 1, Tij, (1)

    where the responseyijt is either the square root of the area increment or the log of late to early wood

    ratio for diskj within tree i at year t.f1 andf2 are linear functions, and clim represents a vector of

    climate variables (temperature and precipitation). The variable D is an indicator for being above

    the crown, topdistance is theproduct ofD and the distance from the tree top to the disk, and topmass

  • 7/31/2019 Case Studies Eric Cormier

    11/38

    2011 191

    is the product of D and the gross amount of foliage above the disk. The standard distributional

    settings were assumed for the random effects and residual error, namely (i, i, i, i, ij, j =

    1, Ji) iid N(0, diag(2 ,

    2 ,

    2,

    2,

    2IJi )), ijtN(0,

    2 ), ijt

    ij ARMA(p, q), and are indepen-

    dent of the random effects. Two lags of the climate variables were included since the effects

    of previous growing conditions can last from 1 to 3 years. I focused on the spring, summer,and fall climate variables because early wood is laid down during the spring and late wood is

    laid down from mid-summer to fall. Several sub-models were fit using the R-package nlme.

    Diagnostic graphs were produced to ensure the adequacy of the models and to check the validity

    of assumptions. In almost all sub-models the residual error had ARMA(2,1) structure, which is

    consistent with Monserud (1986). Then, likelihood ratio and Wald-tests were performed to check

    the significance.

    To answer whether annual growth is best explained by average annual temperature or monthly

    temperature extremes, I used the structure of Equation (1) for each set of temperature variables.

    Since neither of the two models is nested in the other, I applied the idea of the J-test (Gujarati,

    2003, p. 533) to determine which model is preferred. To consider whether climate variables or

    growth and density variables from previous years better predict growth, I used a nested error

    component model with autoregressive dynamics and other explanatory variables to predict the

    annual growth. The proposed model equation is free of the climate variables. It is given by

    yijt = 1yijt1 + 2yijt2 + 1xijt1 + 2xijt2 + 1zijt1 + 2zijt2 + ij + ijt, (2)

    whereyijt is defined as in Equation (1), andx,z arethe densities of early andlate wood, respectively.

    The heterogeneity due to the trees and disks within trees is represented by the error component

    ij. The model is semiparametric, where the residual error, ijt, satisfies the moment condition

    E(ijt|ij, yijt1, xijt1, zijt1, yijt2, xijt2, zijt2, . . .) = 0. The model was fit using two-stage

    least squares on the first difference within disks. It was compared to the models of Equation (1)with regards to their out of sample prediction power. A test sample of size 2,718 taken across

    almost all the disks was discarded from the data and both models were fit using the remaining

    data. The fitted models were compared according to their mean squared error of prediction in the

    test sample. The MSE for the climate models was one-quarter that of the dynamic model.

    3. CONCLUSION

    Regarding how the temperature and precipitation affect the annual cross-sectional growth and the

    proportions of early and late wood, the climate models showed high contributions of the current

    and lagged values of rain and temperature in explaining both of the target dependent variables.For example, annual growth was positively affected by rain (especially in spring) and negatively

    affected by extreme levels of temperatures. The proportion of early wood was positively affected

    by rain throughout the year and by higher temperatures in spring, however the proportion of

    late wood is negatively affected by higher temperatures in mid-summer. It is recommended that

    proportions of early and late wood be considered separately to allow a clearer view of how they

    were individually affected by the climate.

    About the extent to which the position on the tree bole, and current foliar biomass explain

    the annual cross-sectional growth and the proportions of early and late wood, it was found that

    the higher the disk was on the crown the smaller the amount of wood annual increment and

    the higher the ratio of early wood. The foliar biomass had a positive linear effect on the annualwood increment within the crown consistent with Presslers hypothesis. However it should be

    mentioned that the proportionality parameter is highly variable from one tree to another.

  • 7/31/2019 Case Studies Eric Cormier

    12/38

    192 Vol. 39, No. 2

    explained by the monthly extremes of temperature than the average annual temperature. The incre-

    mental contribution of the annual temperatures over the monthly temperatures was not significant,

    but monthly temperatures were significant additions to the annual climate model.

    Regarding whether the use of climate variables to predict the two dependent variables provides

    more reliable estimates than the use of the growth density from previous years, it was clear that

    the climate models provide more reliable prediction than the autoregressive dynamic model. Thisemphasizes the importance of climate, disk position and foliar biomass in prediction as well as

    in explanation.

    ACKNOWLEDGEMENTS

    I am grateful to Dr. Liqun Wang for encouraging me to carry out this research project and for

    financial support through his research grants from NSERC and the National Institute for Complex

    Data Structures.

    BIBLIOGRAPHYD. N. Gujarati (2003). Basic Econometrics, 4th ed., McGraw-Hill, New York.

    R. A. Monserud (1986). Time-series analyses of tree-ring chronologies. Forest Science, 32(2), 349372.

    Received 31 October 2010

    Accepted 6 January 2011

  • 7/31/2019 Case Studies Eric Cormier

    13/38

    2011 193

    Discussion of case study 1 analyses

    C. B. DEAN1*, Alison L. GIBBS2 and Roberta PARISH3

    1Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada V5A 1S62Department of Statistics, University of Toronto, Toronto, ON, Canada M5S 3G33British Columbia Ministry of Forests, Victoria, BC, Canada V8N 1R8

    Key words and phrases: Nonlinear mixed models; hierarchical random effects; transformations; prediction

    intervals; autoregressive integrated moving average model; measurement error; model diagnostics

    1. INTRODUCTION

    The two analyses incorporated similar features, but with different model formulations and vari-

    ables;even so, they yielded similar major conclusions. Both Salamh andCormier & Sunconsideredwhether and to what extent climate, position on the tree bole, and foliar biomass explain cross-

    sectional area increment and proportion of early and late wood. They both also specifically

    investigated the effects of temperature and precipitation, and whether average annual temper-

    ature or monthly maximum and/or minimum values better explain variability in area increment.

    Additionally, Salamh considered whether climate variables explain the variability in growth

    and proportions of early and late wood better than measurements from previous years of these

    variables.

    2. THE MODELS

    Both Salamh and Cormier & Sun utilized nonlinear mixed effects models with hierarchical ran-

    dom effects. Transformations of the responses were considered including the square root of area

    increment and the logarithm of the ratio of early to late wood. Lags of climate variables were

    included as explanatory variables in Salamh, but not in Cormier & Sun. Salamh utilized a con-

    ceptually based approach, modelling the growth as increasing linearly from the top of the tree

    to the base of the crown, with tree-to-tree variability in this linear functional form. Cormier &

    Sun incorporated a variable labelled position (taking values 1, 2, 3, . . . ) which is meant to reflect

    the height of measurement of the area increment from the base of the tree. Note that the height

    at which measurements are taken are not multiples of a specific value, so when position = 2,

    the height from the base of the tree is not twice that when position=

    1. As well, the heights atwhich measurements are taken are not the same from tree to tree. So the modelling of the trans-

    formed response variable as a linear function of the variable position, as considered in Cormier

    & Sun, is somewhat of an approximation to including a linear function of the height at which

    measurements are taken in the model. Future models should incorporate a more accurate measure

    of relative position by using tree heights and height to sample. Both Salamh and Cormier & Sun

    include (estimates of) biomass in the model, with the transformed response changing linearly with

    biomass increases. Salamh incorporated autoregressive errors, while Cormier & Sun incorporated

    interaction terms. Site-specific intercepts were included in Salamh, while Cormier & Sun omitted

    site effects. Cormier & Sun also handled the modelling of early and late wood separately, looking

    at the annual averages of early and late wood over all trees and investigating how these are related

    to lagged effects.

  • 7/31/2019 Case Studies Eric Cormier

    14/38

  • 7/31/2019 Case Studies Eric Cormier

    15/38

    2011 195

    Case study 2: Proteomic biomarkers fordisease status

    Robert F. BALSHAW* and Gabriela V. COHEN FREUE

    NCE CECR PROOF Centre of Excellence, Vancouver, BC, Canada V6Z 1Y6

    Key words and phrases: Variable selection; class prediction; imputation

    1. BACKGROUND

    Renal transplantation saves the lives of hundreds of Canadians every year. However, every

    transplant recipient must be monitored closely for signs of acute rejection, which is the bodys

    immunologic and inflammatory response to the presence of foreign organ. If not properly treated,

    acute rejection can lead to loss of the transplanted organ, dialysis or even death. Unfortunately,

    acute rejection can only be detected by biopsy a distressing, uncomfortable and expensive surgical

    procedure that can be required multiple times during the first year post-transplant.

    The Biomarkers in Transplantation project was funded by Genome Canada to identify non-

    invasive biomarkers for the detection and prediction of acute rejection based on proteomic analysis

    of peripheral blood samples. A clinical test based on such a biomarker could lead to a better

    method for monitoring transplant recipients, reducing costs, improving treatment outcomes, and

    substantially improving recipients quality of life.

    Measures of protein abundance data were collected from a selection of kidney transplant

    recipients who were known at the time of the blood sampling to be either experiencing treatable

    acute rejection (AR) or not experiencing acute rejection (NR). Each sample was drawn from

    an independent subject within the first 3 months post-transplant. For each AR sample, two NR

    samples were selected at approximately the same time point post-transplant.

    The goal of this case study was to utilize these proteomic data to create a classifier for acute

    rejection, which could then be evaluated on a test set of 15 samples.

    2. THE DATA

    At thetime of theCase StudyCompetitionfor the2009 SSC meeting, potential intellectual property

    considerations meant that the true nature of the dataset had to be hidden from the participants.For example AR and NR status are referred to the patients being in an active or inactive state of

    disease. The data were also supplemented with synthetic sample data, constructed to mimic the

    observed characteristics of the AR and NR samples, both to enrich the size of the dataset and to

    further protect intellectual property.

    The dataset included 11 samples from AR patients, 21 samples from NR patients, plus an

    additional 15 samples whose classification was hidden. All experimental samples were taken

    from independent patients at the time when acute rejection was suspected or at a corresponding

    matched time-point for non-rejection samples. Historically, approximately 10% of renal transplant

    recipients experience rejection during the first few months post-transplant; however, the study

    design was to select approximately 2 NR samples for every AR sample.

  • 7/31/2019 Case Studies Eric Cormier

    16/38

    196 Vol. 39, No. 2

    A multiplex proteomic technology, called iTRAQ, was used to measure the relative protein

    abundances of the experimental samples relative to the quantity of the corresponding protein in a

    reference sample. The reference samples were taken from a homogeneous batch of blood pooled

    from 16 healthy volunteers.

    Plasma was obtained from each whole blood sample through centrifugation. To enhance

    detection sensitivity, the plasma samples were first depleted of the 14 most abundant proteins.Trypsin was used to digest the proteins in the depleted samples, and the resulting peptides were

    labelled with one of four distinct iTRAQ reagents (i.e., chemical tags with unique molecular

    weights but otherwise identical chemical properties). The labelled samples were then pooled and

    processed using a MALDI TOF/TOF technology.

    Each iTRAQ run was designed with three experimental samples and one reference sample.

    Peptide identification and quantitation was carried out by ProteinPilot Software v2.0 and the data

    were assembled into a comprehensive summary of the relative abundance of the proteins in the

    experimental samples. As the same reference sample was used in all runs, these relative abundance

    measures were comparable across experimental runs.

    Each run of the experiment detected and measured several hundred proteins (about 200 per

    run), but not every protein was identified in every sample, nor even in every run, leading to a

    complex pattern of missing data.

    If a protein was not identified in a particular experimental sample, the proteins relative abun-

    dance level was then unknown. When this happened for the reference sample in a particular run,

    then the relative levels for this protein could not be estimated for any of the three experimental

    samples in that run.

    Proteins were identified in the data using arbitrary protein identifiers (BPG0001BPG0460).

    Though this prevented the incorporation of biological/subject-matter context in the analysis, it

    was necessary due to potential intellectual property concerns.

    In addition to acute rejection status and relative abundance measures for 460 proteins, for

    each observation sex, race, and age were provided.

    Received 31 October 2010

    Accepted 6 January 2011

  • 7/31/2019 Case Studies Eric Cormier

    17/38

    2011 197

    Disease status determination: Exploringimputation and selection techniques

    Linghong LU, Rena K. MANN*, Rabih SAAB and Ryan STONE

    Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada V8W 3R4

    Key words and phrases: BPCA; imputation; k-NN; LARS; LASSO; LLS; selection method; SLR; SVD

    Abstract: Analyzing a proteomics dataset that contains a large amount of independent variables (biomarkers)

    with few response variables and many missing values can be very challenging. The authors tackle the problem

    by first exploring different imputation techniques to treat the missing values and then investigate multiple

    selection techniques to pick the best set of biomarkers to predict the unknown patients disease status. They

    conclude their analysis by cross-validating the different combinations of imputation and selection techniques

    (using the set of patients of known disease status) in order to find the optimal technique for the supplied

    dataset. The Canadian Journal of Statistics 39: 197201; 2011 2011 Statistical Society of Canada

    Resume: Analyser les jeux de donnees provenant de la proteomique represente un grand defi, car ils con-

    tiennent un grand nombre de variables independantes (biomarqueurs) avec peu de variables reponses et ils

    contiennent aussi des valeurs manquantes. Les auteurs sattaquent a ce probleme en explorant differentes

    techniques dimputation pour traiter les valeurs manquantes. Ils considerent aussi differentes techniques

    de selection multiple pour choisir le meilleur ensemble de biomarqueurs afin de predire letat non connu

    de la maladie dun patient. Ils concluent leurs analyses en croisant les differentes combinaisons de tech-

    niques dimputation et de selection (en utilisant un ensemble de patients dont letat de la maladie est

    connu) de facon a trouver la technique optimale pour le jeu de donnees analysees. La revue canadienne

    de statistique 39: 197201; 2011 2011 Socit statistique du Canada

    1. INTRODUCTION

    There is a multitude of missing information in the data and as such, we eliminated proteins with

    40 or more missing values leaving 330 proteins. A log 2 transformation of the data was then taken

    as the data is fold change.

    In order to select the set of proteins that accurately predict the status of the unknown list of

    patients, combinations of different imputation methods and selection techniques were studied and

    then cross validated using the protein expressions for individuals with known disease status.

    2. IMPUTATION

    Several imputation techniques were explored: k-Nearest Neighbours (k-NN), Local Least Squares

    (LLS), Singular Value Decomposition (SVD), and Bayesian Principle Component Analysis

    (BPCA).

    k-nearest neighbours is a sensitive and robust method when the missing values are determined

    by the proteins with expressions most similar to that protein in the other samples. The optimal

    value of k, which is the number of neighbours to be used, has been shown in literature to be in

    the range of 1020 (Troyanskaya et al., 2001; Mu, 2008). Thus, the values of 10 and 20 were

    chosen in this instance. For each protein, the k-NN imputed value was found using Euclideandistance (Troyanskaya et al., 2001) for the columns where that protein was not missing. If more

  • 7/31/2019 Case Studies Eric Cormier

    18/38

    198 Vol. 39, No. 2

    than 50% of a protein was missing, the missing values were imputed using each patients average.

    Otherwise, the average of the k-nearest neighbours was used to estimate the missing value.

    In local least squares, a missing protein is evaluated as a linear combination of similar proteins

    (Kim, Golub & Park, 2005). The method borrows from k-NN and from least squares imputation.

    It is most effective when there is strong correlation in the data as the k proteins were selected

    based on those which had the highest correlation with the target protein. We used two meth-ods: Spearman and Kendall as the two correlation types along with different values of kfor the

    neighbours.

    Singular value decomposition starts by taking the data set and ignoring the missing entries.

    Then it calculates the mean for each of the rows of complete data. By initializing the missing

    values to be the previously calculated row means, an iterative procedure finds the missing values.

    Next, SVD is performed on the newly formed complete data set and the solution that is produced

    replaces the row means in the missing values. These steps are repeated until a solution converges

    which usually happens after five iterations (Hastie et al., 1999).

    Bayesian principal component analysis is another imputation method that was used to impute

    missing protein expression values. It combines an EM approach for PCA with a Bayesian model

    and is based on three processes: principal component regression, Bayesian estimation, and an

    expectation-maximization (EM)-like repetitive algorithm (Oba et al., 2003). The algorithm was

    developed for imputation and very few components are needed in order to ensure accuracy. It

    is an iterative process; it either terminates if the increase in precision becomes lower than the

    threshold of 0.01 or if the number of set iterations is reached.

    3. SELECTION METHODS

    We used several methods to select differentially expressed proteins between the active and inactive

    groups of patients. The simplest of these methods is the t-test. We used this basic test to compare theprotein expressions between the two groups. Significant results of the t-test provided a preliminary

    analysis and a list of proteins potentially being expressed differently between the active and

    inactive groups. We also used the Wilcoxon signed rank test which is conceptually similar to the

    t-test but provides a more robust approach.

    More advanced selection techniques were then explored to select biomarkers influencing the

    disease status such as the Least Absolute Shrinkage and Selection Operator (LASSO), Least Angle

    Regression (LARS), and Sparse Logistic Regression (SLR).

    3.1. LASSO and LARS

    The concept of the LASSO, a technique suggested by Tibshirani (1996), is similar to ordinary

    least squares but uses a constraint to shrink the number of independent random variables included

    in the model by setting some of the coefficients to zero.

    LARS is a model selection method, proposed by Efron et al. (2004), that is computationally

    simpler than LASSO. It starts by equating all of the coefficients to zero and then adds the predictor

    most correlated with the response. The next predictor added is the predictor most correlated with

    the current residuals. The algorithm proceeds in a direction equiangular between the predictors

    that are already in the model. Efron et al. (2004) presented modifications to the LARS algorithm

    that generate the LASSO estimates and these were used to produce the LARS estimates in the

    paper.

    3.2. SLR

  • 7/31/2019 Case Studies Eric Cormier

    19/38

    2011 199

    Table 1: Number of individuals of known disease status misclassified by the different imputation and

    selection methods.

    SVD SVD k-NN k-NN LLS LLS LLS

    (e = 3) (e = 2) (k= 20) (k= 10) BPCA (k= 20 spear) (k= 10 ken) (k= 5 ken)

    SLR 21 16 17 13 12 18 13 16

    LASSO 0 0 12 13 1 13 16 15

    LARS 3 0 3 4 0 5 1 1

    The rows and columns correspond to the different selection and imputation methods respectively. LLS has a

    correlation type of either Spearman (spear) or Kendall (ken).

    Table 2: Predicted status for unknown subjects using LARS combined SVD imputation, where I and Astand for inactive and active respectively.

    Subject 2 7 8 11 12 13 17 18 29 31 33 34 35 38 41

    Status I I I I I A A A I I I I I A I

    maximum likelihood estimation to obtain estimates of the model coefficients. The method is

    similar to LASSO in that it uses a constraint to shrink the logistic regression model.

    4. METHOD SELECTION

    To select the set of proteins that accurately predict the status of the unknown list of patients,

    combinations of imputation methods and selection techniques were studied and then cross

    validated using the protein expressions for individuals with known status. Misclassification levels

    observed for the eight imputation and three selection methods employed are displayed in Table 1.

    The LARS algorithm had relatively few misclassified cases compared to SLR and LASSO. The

    SLR method had high misclassification rates and was therefore dismissed for prediction purposes.

    SVD and BPCA appear to be the best imputation techniques to use.

    5. PROTEIN SELECTION AND PREDICTION

    We chose the proteins picked by most methods. A protein was deemed to be selected as differen-

    tially expressed by the t- and Wilcoxon signed rank tests if the respective P-values were less than

    0.01. Proteins that had nonzero coefficients in the SLR, LASSO and LARS models were consid-

    ered differentially expressed between the active and inactive groups of patients. The maximum

    frequency of selection for proteins was 13, so 10 was taken as the cut-off value.

    We noticed that seven proteins stood out: BPG0036, BPG0105, BPG0235, BPG0262,BPG0333, BPG0381, and BPG0447. For prediction purposes we used the selected proteins men-

    tioned above and their corresponding coefficients by the LARS algorithm combined with the SVD

  • 7/31/2019 Case Studies Eric Cormier

    20/38

    200 Vol. 39, No. 2

    Table 3: Predicted status for unknown subjects where I and A stand for inactive and active status

    respectively.

    Subject 2 7 8 11 12 13 17 18 29 31 33 34 35 38 41

    Status I A I I I A A I I A I I A A I

    6. LOGISTIC REGRESSION

    After determining the seven significant proteins, the variables of race, gender and age were

    analyzed to determine whether they were significant. Chi-square tests of independence determined

    race by status was not significant while gender was significant and linear regression showed age

    by status was not significant. A logistic regression model (Dobson, 2002) was then fitted with the

    variables gender and all seven of the selected proteins to produce a second set of predictions for

    unknown disease status.

    Three proteins: BPG0036, BPG0105, and BPG0333 were found to be significant by stepwisealgorithms on the fitted model. The expression levels of any of these three proteins changed

    significantly between the active and inactive groups. Fitting the reduced model on the data of

    given status gave the final model:

    log

    j

    1j

    = 1,3096,948 BPG0036 + 4,509 BPG01052,715 BPG0333

    for j = 1, . . . , 32. The status of patient j is determined to be active if the corresponding j is less

    than 0.5 and inactive otherwise. The predictions of the unknowns are shown in Table 3. To check

    consistency, we predicted the status of the known status patients (11 active and 21 inactive). Themisclassification was zero, which gave us confidence in the prediction of the 15 unknowns.

    7. CONCLUSIONS

    The main problem encountered when dealing with microarray data was the abundance of missing

    data and therefore, the choice of imputation method proved to be vital. Both logistic regression

    and LARS were shown to be very good selection methods; the LARS algorithm had low mis-

    classification rates. The three proteins picked by logistic regression were good predictors for a

    patients status as there was an obvious separation between the inactive and active patients.

    ACKNOWLEDGEMENTS

    The support and guidance of Dr. Mary Lesperance is greatly appreciated. The study was supported

    by Fellowships from the University of Victoria.

    BIBLIOGRAPHY

    A. Dobson (2002). An Introduction to Generalized Linear Models, Chapman & Hall/CRC, Washington.

    B. Efron, T. Hastie, I. Johnstone & R. Tibshirani (2004). Least angle regression, Annals of Statistics, 32(2),

    407499.

    T. Hastie, R. Tibshirani, G. Sherlock, M. Eisen, P. Brown & D. Botstein (1999). Imputing missing datafor gene expression arrays. Technical Report, Department of Statistics, Stanford University, Palo Alto,

    California USA

  • 7/31/2019 Case Studies Eric Cormier

    21/38

    2011 201

    R. Mu (2008).Applications of correspondence analysis in microarray data analysis. MSc Thesis, University

    of Victoria, Victoria, British Columbia, Canada.

    S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara & S. Ishii (2003). A Bayesian missing value

    estimation method for gene expression profile data, Bioinformatics, 19(16), 20882096.

    S. Shevade & S. Keerthi (2003). A simple and efficient algorithm for gene selection using sparse logistic

    regression, Bioinformatics, 19(17), 22462253.R. Tibshirani (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical

    Society, 58(1), 267288.

    O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein & R. B. Altman

    (2001). Missing value estimation methods for DNA microarrays, Bioinformatics, 17(16), 520525.

    Received 31 October 2010

    Accepted 6 January 2011

  • 7/31/2019 Case Studies Eric Cormier

    22/38

    202 Vol. 39, No. 2

    Bootstrap-multiple-imputation;high-dimensional model validation withmissing data

    Billy CHANG*, Nino DEMETRASHVILI and Matthew KOWGIER

    Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada M5T 3M7

    Key words and phrases: Bootstrap validation; EM-algorithm; multiple imputation; penalized likelihood

    1. METHODOLOGY

    We ignored age, sex, and race when building the classifiers. We then removed proteins with more

    than 50% missing (there were 191 proteins left), transformed zero relative abundance values to0.0001 to avoid when applying the log-transform.

    We compared four classifiers: regularized discriminant analysis (RDA) with penalization

    constant = 0.99, diagonal linear discriminant analysis (DLDA), penalized logistic regression

    (PLR) with penalization constant = 0.01, and linear kernel support vector machine (SVM) with

    penalization constant C= 1 (for the Lagrange multiplier). The classifiers are described in detail

    in Hastie, Tibshirani and Friedman (2009).

    Multiple imputation (Little & Rubin, 2002) and bootstrap validation (Hastie, Tibshirani &

    Friedman, 2009) were used to compare and validate the four classifiers. We employed ROC

    curves and AUC (area under ROC curves) as the classifiers performance metrics. Due to the

    small-sample validation sets created by bootstrapping, we employed the BS.632+ error correctionmethod (Hastie, Tibshirani & Friedman, 2009) to adjust for the small-sample prediction error bias

    when comparing the prediction error for the four classifiers.

    We assume the 191 log-abundance scores for subject i (i = 1, . . . , N = 47) are multivari-

    ate normal: xiN(, ). To avoid singular covariance estimates, we penalize the trace of the

    precision matrix:

    l(, |{xi}Ni=1) = logdet(

    1)trace(S1) trace(1)

    Here Sis the sample covariance matrix. The maximum penalized-likelihood estimates are:

    =1

    N

    Ni=1

    xi, = S + I

    where I is the identity matrix. In the presence of missing data, we use the EM-algorithm as

    described in Little & Rubin (2002) with a slight modification in the M-Step for parameters

    estimation:

    E-step: compute conditional expectations and covariance of the missing data given the

    observed data.

    M-step:

    (1) Fill in the missing entries with their conditional expectations.

    (2) Update the sample mean of the filled-in data.

  • 7/31/2019 Case Studies Eric Cormier

    23/38

    2011 203

    Figure 1: ROC curves.

    (3) Update the sum of the filled-in data sample covariance and the conditional variance of

    the missing data given the observed data, and I (we used = 0.02).

    To validate the classifiers, we employed the following procedure:

    (1) Estimate the missing-data distribution using all 47 subjects by the above EM-algorithm.

    (2) Generate 100 imputed data sets with missing values imputed by values drawn from the

    estimated missing data distribution.

    (3) From the 100 imputed data set, remove the 15 subjects with unknown status.

    (4) For each data set created from step 2 and 3, create 200 bootstrap resampled data set. Fit the

    classifier on the bootstrap sample and get averaged out-of-bag (OOB) ROC curves, OOB

    AUC scores, and BS.632+ error.

    Note that all the penalization constants were chosen only to eliminate the singularity issues

    due to the high dimensionality of the data; no fine-tuning was done to optimize classification

    performance.

    2. RESULTS

    The averaged ROC curve for PLR lies above the ROC curves of the other three classifiers (Figure1), suggesting that PLR achieves better performances on average across all level of threshold. The

    AUC scores distributions (results not shown) also suggest that PLR consistently achieves good

    separation ability, however there are certain bootstrap samples which give very low OOB AUC.

    We use the OOB BS.632+ error to check the classifiers performance at threshold 0.5 for RDA,

    DLDA, PLR (i.e., classifies a subject as Active if P(Active | the subjects protein scores) > 0.5)

    and 0 for SVM, and found that PLRs errors are also consistently lower than the other three

    classifiers (results not shown).

    3. CONCLUSIONS

    Based on the above observations, PLR is the best classifier among the four classifiers compared.

    However the huge variance in AUC and BS 632+ errors casts doubt on whether PLR can truly

  • 7/31/2019 Case Studies Eric Cormier

    24/38

    204 Vol. 39, No. 2

    ACKNOWLEDGEMENTS

    We thank our team mentor Rafal Kustra for his guidance and support throughout.

    BIBLIOGRAPHY

    T. Hastie, R. Tibshirani & J. Friedman (2009). The Elements of Statistical Learning, 2nd ed., Springer,New York.

    R. J. A. Little & D. B. Rubin (2002). Statistical Analysis with Missing Data, 2nd ed., Wiley, New York.

    Received 31 October 2010

    Accepted 6 January 2011

  • 7/31/2019 Case Studies Eric Cormier

    25/38

    2011 205

    Exploring proteomics biomarkers using ascore

    Qing GUO1*, Yalin CHEN2 and Defen PENG2

    1Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, ON, Canada

    L8N 3Z52Department of Mathematics and Statistics, McMaster University, Hamilton, ON, Canada L8S 4K1

    Key words and phrases: AUC; biomarker; cluster analysis; cross-validation; jittered QQ plot; logistic

    regression; missing data imputation; proteomics score; sensitivity; specificity

    1. METHODOLOGY

    Proteins are constantly changing and interacting with each other, which makes it challenging toidentify a single protein. Often, a function is implemented by some proteins, among which, some

    proteins up-regulate and some down-regulate. In this article, we propose a particular procedure

    (PHD-SCORE) to seek a proteomics score that is a combination of relevant functional groups

    of proteins. This can be used in place of a single or a limited number of potential proteins, as

    the biomarker to distinguish active disease status from the inactive. The procedure involves the

    following steps: Process the data; Condense the High dimensional Dataset; Identify Statistically

    meaningful clusters and Calculate a proteomic score; Build statistical models to find one more

    appropriate for the data; Determine patients disease status by choosing an Optimal probability cut

    point; Test model prediction ability by using cross-validation; Repeat the procedure until a model

    is chosen with proper predictions and low Error rate; Apply the chosen model to unknown cases.

    1.1. Process the Data

    Data checking has been conducted to ensure consistency and integrity. Nothing peculiar was

    found. Among 47 subjects, 11 had active disease, 21 were inactive, and 15 had unknown disease

    status. Based on our exploration, any protein with more than 38.2% of observations missing in

    either group (equivalent to 4 in active and 8 in inactive groups) was excluded from further analysis.

    This left us with 160 protein variables in the data set, together with 3 covariates (age, sex, race).

    1.2. Condense the High-Dimensional Protein Data

    Hierarchical cluster analysis was employed to reduce the dimensionality of the protein data. Weused (1 r) (r is the Pearson correlation coefficient) as the similarity between clusters using

    average linkage. Usually, cutting too many clusters will result in small number of variables

    within each cluster, and too few clusters will mean that many less correlative variables will be

    included. Given these considerations, we decided to choose 13 clusters with corresponding cut-off

    correlation coefficient of 0.85.

    1.3. Impute Missing Data

    Two imputation methods were used for handling missing data: hot-deck and linear regression

    methods. The missing value of the protein is predicted and imputed from the regression model

    with all available predictors.

  • 7/31/2019 Case Studies Eric Cormier

    26/38

    206 Vol. 39, No. 2

    Table 1: Fitted coefficients for the logistic regression model.

    Predictor Estimate SE P-value

    Constant 4.05 4.41 0.36

    Age 0.06 0.07 0.42

    Proteomic score 20.75 8.64 0.02

    1.4. Identify Informative Clusters

    The clusters were selected by considering 13 plots to visually discriminate the active and inactive

    disease status by plotting the average value of proteins in each cluster (say, 13 clusters) for 32

    subjects against disease status, and then stepwise logistic regression. It gave us 2 clusters with 8

    variables in one cluster and 12 variables in the other. The proteomics score was calculated as the

    mean difference between the 2 clusters.

    1.5. Build Model and Optimal Estimate of Cut-Off Probability

    The logistic regression model was built up by using covariates and the calculated proteomics

    score as predictors. Gender and race are not statistically significant. The final model is logit

    (P) = 0 + 1 Age + 2 Score, where P is the probability of having active disease status.

    A probability cut-off plot was drawn to detect the patients disease status. AUC and jittered QQ

    plot (Zhu & El-Shaarawi, 2009) were effective tools to diagnose the adequacy of fitted model in

    numerical and graphical ways.

    1.6. Test Prediction Ability and Apply to the Unknown Cases

    Cross-validation within the 32 subjects with known disease status was utilized to provide a nearlyunbiased estimate of the prediction misclassification rate (Farruggia, Macdonald & Viveros-

    Aguilera, 1999). We use a random split to partition the observed data into a training set (2/3 of

    32 subjects, about 22 subjects) and a validation set (1/3 of 32 subjects, about 10 subjects). We

    pseudo-randomly repeated the cross-validation 200 times to assess the misclassification rate. The

    steps 16 were run repeatedly until the model with low misclassification rate was chosen. Finally,

    we applied the model to 15 unknown cases to identify their disease status.

    2. RESULTS

    The analyses were performed by using the statistical package R. The maximum likelihood esti-mates of coefficients obtained for the fitted model and their P-values are shown in Table 1. The

    proteomic score is significant at level 0.05 to detect patients disease status, while the insignificant

    covariate age was included to adjust for the logit(P).

    The probability cut-point of 0.26 was chosen based on the trade-off between the sensitivity

    and specificity. If patients disease probability is greater than 0.26, the active disease status is

    assumed. The misclassification rates for the active, inactive and overall are 0.13, 0.11, and 0.12,

    respectively. This chosen model was applied to the data (15 participants), patients 7, 12, 13, 17,

    18 were identified with active disease.

    3. CONCLUSIONS

    There are limitations Firstly the chosen percentage of missingness (here 38 2%) for each protein

  • 7/31/2019 Case Studies Eric Cormier

    27/38

    2011 207

    Finally, some factors such as the number of clusters and the cut-point for the probability of

    disease and some specific proteins could be more targeted and ascertained with the involvement

    of principal investigator so to understand the underlying biological mechanism better. The steps

    of 16 in PHD-SCORE procedure need to be repeatedly run until the satisfactory model is met.

    ACKNOWLEDGEMENTS

    We thank Dr. Rong Zhu for the guidance, encouragement and availability to us during the study.

    We are also indebted to Drs. Eleanor Pullenayegum and Romn Viveros-Aguilera for their financial

    support and constructive suggestions. We also would like to acknowledge Drs. Stephen Walter,

    Harry Shannon, and Lehana Thabane for their helpful comments.

    BIBLIOGRAPHY

    J. Farruggia, P. D. Macdonald & R. Viveros-Aguilera (1999). Classification based on logistic regression and

    trees. Canadian Journal of Statistics, 28, 197205.

    R. Zhu & A. H. El-Shaarawi (2009). Model clustering and its application to water quality monitoring.Environmetrics, 20, 190205.

    Received 31 October 2010

    Accepted 6 January 2011

  • 7/31/2019 Case Studies Eric Cormier

    28/38

    208 Vol. 39, No. 2

    A multiple testing procedure to proteomicbiomarker for disease status

    Zhihui (Amy) LIU* and Rajat MALIK

    Department of Mathematics and Statistics, McMaster University, Hamilton, ON, Canada L8S 4K1

    Key words and phrases: Multiple testing procedure; imputation; classification tree; heatmap; proteomic

    biomarker

    1. METHODOLOGY

    When hundreds of hypotheses are tested simultaneously, the chance of false positives is greatly

    increased. We first removed the proteins that contain more than 5 missing values for the activegroup and 11 for the inactive group. To control Type I error rates, resampling-based single-step

    and stepwise multiple testing procedures (MTP) were applied, using the Bioconductor R package

    multtest (Pollard et al., 2009). Our null hypothesis was that each protein has equal mean relative

    abundance in the active and inactive group. The non-parametric bootstrap with centring and

    scaling was used with 1,000 iterations. Non-standardized Welch t-statistics were implemented,

    allowing for unequal variances in the two groups.

    We explored four imputation methods in the R package pcaMethods (Stacklies et al., 2007):

    the nonlinear iterative partial least squares algorithm, the Bayesian principle component analysis

    missing value estimator, the probabilistic principle component analysis missing value estimator,

    and the singular value decomposition algorithm. A classification tree implemented in the R pack-age rpart (Therneau et al., 2009) was fitted by using the nine most rejected proteins from the

    MTPs. To visualize the results, a heatmap was produced (see Figure 1).

    2. RESULTS

    By setting the random seed = 20, the MTPs result in five rejections: BPG0167, BPG0235,

    BPG0333, BPG0381, BPG0447 at the significance level = 0.05 Note that different choices

    of seed yield slightly different results. Applying the random seed = 20 and 1,000 bootstrap

    iterations to the imputed data, we found that the results from the four imputation methods are

    similar and they agree with those without imputation. A heatmap was plotted (Figure 1) using thenine most rejected proteins: BPG0098, BPG0100, BPG0105, BPG0167, BPG0235, BPG0310,

    BPG0333, BPG0381, and BPG0447. Notice that it correctly classifies almost all the samples. The

    classification tree predicts that patients 7, 11, 13, 18, 33, and 35 belong to the active group.

    3. DISCUSSION

    Multiple testing procedures are a concern because an increase in specificity is coupled with a

    loss of sensitivity. Furthermore, we suspect that the proteins with the most difference in relative

    abundance between the active group and inactive group are not necessarily the key players in

    the relevant biological processes. These problems can only be addressed by incorporating prior

    biological knowledge into our analysis, which may lead to focusing on a specific set of proteins.

  • 7/31/2019 Case Studies Eric Cormier

    29/38

    2011 209

    Figure 1: A heatmap of nine proteins comparing the active (+) and inactive group.

    4. CONCLUSION

    The results from MTPs before the imputation agree more or less with those after the imputation.

    The reason for this is unclear, it could mean either that the imputation works very well or that it

    is not very helpful. There is no evidence that race, sex or age is associated with disease status.

    Proteins BPG0098, BPG0100, BPG0105, BPG0167, BPG0235, BPG0310, BPG0333, BPG0381,

    and BPG0447 seem to be important in indicating the disease status. Patients 7, 11, 13, 18, 33, and

    35 are predicted to be active among the 15 patients of unknown status.

    ACKNOWLEDGEMENTS

    We thank Dr. Peter Macdonald for his valuable advice and encouragement.

    BIBLIOGRAPHY

    K. S. Pollard, H. N. Gilbert, Y. Ge, S. Taylor & S. Dudoit (2009). multtest: Resampling-based multiple

    hypothesis testing. R package version 2.1.1.

    T. M. Therneau, B. Atkinson & B. Ripley (2009). rpart: Recursive Partitioning. R package version 3.1-44.

    W. Stacklies, H. Redestig & K. Wright (2007). pcaMethods: A collection of PCA methods.R package version

    1.22.0.

    Received 31 October 2010

    Accepted 6 January 2011

  • 7/31/2019 Case Studies Eric Cormier

    30/38

    210 Vol. 39, No. 2

    The process of selecting a disease statusclassifier using proteomic biomarkers

    Christopher MEANEY1* Calvin JOHNSTON1 and Jenna SYKES2

    1Dalla Lana School of Public Health, University of Toronto2Department of Statistics and Actuarial Science, University of Waterloo

    Key words and phrases: Statistical classifier; biomarker; support vector machine

    1. METHODOLOGY

    Our first challenge was to narrow down the list of available statistical classifiers. One resource at

    this stage was the open-source data mining and statistical learning program Weka developed at the

    University of Waikato in New Zealand (Hall et al., 2009). With a few simple clicks on its intuitive

    Explorer interface, we quickly sifted through an extensive selection of supervised classificationtechniques, including logistic regression, trees and forests, bagging, boosting, neural networks,

    Bayes classifiers, and support vector machines. Weka has a convenient cross-validation function

    which allowed us to whittle down this long, intimidating list to only the most promising methods

    and allowed us to focus our efforts in a fruitful direction. We found that the Multilayer Perceptron

    and Support Vector Machines (SVM) had strong empirical classification properties according to

    leave-one-out cross validation. We opted to focus our energy on the SVM algorithm as it seemed

    more intuitive.

    Consider a Euclidean space with dimensionality equal to the number of factors; in this

    case the factors are the 400+ protein relative abundance measurements. Each patient maps to a

    vector in that space positioned according to his or her particular measurements on each protein.Along with this positioning, each patient also has a disease status: active or inactive. The SVM

    then finds the optimally separating hyperplane which splits the Euclidean space into two sections,

    each ideally containing patients of only one disease status. When there are multiple possible

    hyperplanes that achieve this completely separating objective, the SVM chooses the plane that

    maximizes the distance between the hyperplane and its nearest datapoints.

    In many SVM implementations, the use of linear hyperplanes is extended with kernels by

    replacing the linear vector dot product with a nonlinear kernel function, such as a polynomial

    or radial basis function. A common modification is to relax the requirement that the hyperplane

    divide all cases perfectly by allowing for a few penalized exceptions known as slack vectors

    (Hastie et al., 2009). Since the provided data set was already of high dimensionality and was fullyseparable, neither of these extensions was used.

    We used the function svm() in the R (R Development Core Team, 2005) package e1071

    (Wu, 2009). This function allows many options to be set including a choice of kernels and the

    penalization of slack vectors (Meyer, 2009). Data were scaled within the function to prevent

    proteins with large variance from dominating the classification decision.

    2. RESULTS

    Many protein abundance measurements were absent from the data. The function svm() requires

    complete data to operate. Thus, we imputed values for the missing data. Exploratory analysisindicated that there were a number of proteins for which fewer than 5 out of 47 measurements

  • 7/31/2019 Case Studies Eric Cormier

    31/38

    2011 211

    were missing and a large number for which more than 30 out of 47 measurements were missing.

    Furthermore, several of the proteins with more than 20% missing data had missing values for all

    patients in one of the two status groups.

    We decided to limit our imputation to only proteins for which fewer than 20 measurements

    were missing. As a result, the observational set of interest for a given case consisted of only

    146 remaining proteins. After careful and extensive visual exploration of the missing data, wefelt comfortable that the data were mostly missing at random and not systematically linked to

    the true underlying values (such as higher probability of missingness for very small true values).

    A review of imputation literature revealed many possible strategies for the replacement of the

    missing values in the dataset. The imputation segment of the Multivariate Statistics Task View of

    R/CRAN revealed eight possible libraries that were devoted to multivariate imputation. However,

    the fact that our dataset consisted of a large number of variables, measured on a small number of

    cases, restricted our imputation options slightly. We began by using simplistic strategies such as

    mean/median imputation; however, we felt that more advanced methods would permit us to build

    a more accurate classifier. We settled on using a nonparametric, k-Nearest Neighbours (k-NN)

    imputation strategy, as it had been recommended by some authors from the field of proteomics andit behaved well in leave-one-out crossvalidation (Troyanskaya et al., 2001). The k-NN imputation

    was performed using the impute.knn() function from the impute library in R. Ranging k from

    1 to 10, we found that only one patient was misclassified for all k. The misclassified patient

    was in the inactive status group, so our final empirical classification rate was 100% (0/11 cases

    misclassified) and 95.2% (1/21 case misclassified) for active and inactive statuses, respectively.

    Since all kbehaved equally well, we arbitrarily chose kto be 8 for our final classifier.

    Decision processes were heavily reliant on empirical fitting, so there was some risk of over-

    fitting. However, given the rigidity of the linearity requirement for our SVM implementation and

    the care we took in heeding this problem, we felt the amount of overfitting was probably small.

    3. CONCLUSIONS

    Our final classifier was a linear Support Vector Machine with 8-Nearest-Neighbour missing value

    imputation for proteins with fewer than 20% missing data. Allowing for small amounts of overfit-

    ting, we suspect our technique would correctly classify about 90% of all patients in each disease

    status category.

    BIBLIOGRAPHY

    T. Hastie, R. Tibshirani & J. Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference,

    and Prediction, 2nd ed., Springer, New York.

    M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann & I. Witten (2009). The WEKA Data Mining

    Software: An Update. SIGKDD Explorations, 11, 19.

    D. Meyer (2009). Support Vector Machines, CRAN R Project, accessed July 15, 2009 at cran.r-

    project.org/web/packages/e1071/vignettes/svmdoc.pdf.

    R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation

    for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, www.r-project.org.

    O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein & R. Altman (2001).

    Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520525.

    T. Wu (2009). Misc Functions of the Department of Statistics (e1071), CRAN R Project, accessed July 15,

    2009 at cran.r-project.org/web/packages/e1071/e1071.pdf.

  • 7/31/2019 Case Studies Eric Cormier

    32/38

  • 7/31/2019 Case Studies Eric Cormier

    33/38

    2011 213

    Table 2: Predictions of patients with missing status.

    Tree LG Decision

    1 Inactive Inactive Inactive

    2 Inactive Active Active

    3 Inactive Inactive Inactive

    4 Inactive Inactive Inactive

    5 Inactive Inactive Inactive

    6 Inactive Active Active

    7 Active Active Active

    8 Inactive Active Active

    9 Inactive Inactive Inactive

    10 Active InactiveActive

    11 Inactive Inactive Inactive

    12 Inactive Inactive Inactive

    13 Active Active Active

    14 Active Active Active

    15 Inactive Inactive Inactive

    For inconsistent results,Active by the classification tree andActive by the logistic regression were regarded more

    reliable.

    and false active error rates (see Table 1). Predictions from the two models were obtained for the

    status of patients with missing status (see Table 2).

    3. CONCLUSIONS

    By the low false inactive error rates we obtained from both models, we conclude that it is possible

    to use classifiers as a pre-screening procedure in identifying active patients, particularly when

    there is a much larger sample available for model training. However, without knowing the error

    rates of the current diagnostic method, it seems infeasible to make a conclusion on whether the

    classification methods based on a simple blood sample perform as well as the current diagnostic

    method.

    ACKNOWLEDGEMENTS

    We would like to thank the Department of Statistics of University of British Columbia for warm

    support on this case study.

    BIBLIOGRAPHY

    L. Breiman, J. H. Friedman, R. A. Olshen & C. J. Stone (1984). Classification and Regression Trees,

    Wadsworth & Brooks, Pacific Grove.

    Received 31 October 2010

    A t d 6 J 2011

  • 7/31/2019 Case Studies Eric Cormier

    34/38

    214 Vol. 39, No. 2

    Discussion of case study 2 analyses

    Robert F. BALSHAW* and Gabriela V. COHEN FREUE

    NCE CECR PROOF Centre of Excellence, Vancouver, BC, Canada V6Z 1Y6

    Key words and phrases: Variable selection; class prediction; imputation

    1. INTRODUCTION

    These analyses have been performed on data provided as a case study for the 2009 SSC case study

    session organized byAlison Gibbs. A general overview of thedata was provided in theBackground

    and Description. A comparable analysis is described in Cohen et al. (2010). In many ways, the

    data represent what has become a fairly standard statistical challenge in supervised learning: to

    develop a classification rule for prediction of unknown class labels for 15 new samples based ona small training set, possibly utilizing only a subset of the features.

    From our experience, the principal challenge was the abundance of missing data, whose

    presence may well carry information about class membership. Missing values often occur with

    our analytical proteomic platform due to the challenge of protein identification as well as detection

    of low abundance proteins.

    We would like to thank all six teams their efforts; all are to be commended for their insightful

    analyses. The six teams will be referred to as follows:

    WX: Wang and Xia.

    LM: Liu and Malik. CDK: Chang, Demetrashvili, and Kowgier.

    MJS: Meaney, Johnston, and Sykes.

    GCP: Guo, Chen, and Peng.

    LMSS: Lu, Mann, Saab, and Stone.

    2. METHODOLOGIES

    The six teams used a wide variety of methodologies, which we have attempted to summarize very

    briefly in Table 1. In general, the teams all took similar approaches to pre-filtering the features

    based on detection rates and explored a variety of imputation methods. All teams eliminated asubset of the proteins with lower rates of detection, though the detection rule and threshold used

    varied between the teams. Essentially, two variant pre-filtering rules were used: (1) select proteins

    for which the overall detection rate was above an arbitrary, pre-determined threshold: 0 and 1 by

    WX; 0.5 by CDK; 0.8 by MJS; or 0.15 by LMSS; and (2) select proteins for which the within-class

    detection rates were above a threshold: approximately 0.5 by LM and 0.4 by GCP. Having chosen

    a set of proteins, the teams then utilized a variety of imputation methods (see Table 1) to permit

    the use of those classification techniques which do not permit missing values.

    Two teams then conducted univariate or one-protein-at-a-time tests before building a mul-

    tivariate classifier (LM; LMSS). LM selected a reduced set of candidate significant proteins

    (controlling the Type I error rate), and LMSS selected a reduced set of candidate proteins thatshowed differential expression in several methods (including other approaches that test multiple

  • 7/31/2019 Case Studies Eric Cormier

    35/38

    2011 215

    Table 1: Summary of strategies and accuracy.

    Team Missing values Classifier Selected model Accuracy

    WX Overall detection rate

    threshold: 1 and 0;

    Imputation Methods:

    mean and k-NN

    Tree with splitting rule

    (Tree) and

    logistic-AIC (LG)

    Ensemble of two models:

    Tree based on

    k-NN-imputed data and

    LG based on complete

    data

    11/15

    LM Within group detection

    rate threshold: 0.5;

    Imputation Methods:

    NIPALS, BPCA,

    Probabilistic PCA, SVD

    Non-standardized Welch

    t-test using FDR

    followed by a

    classification tree

    Tree based on top-9 tested

    proteins. High

    concordance among

    imputation methods

    12/15

    CDK Overall detection rate

    threshold: 0.5;

    Imputation Method:

    EM-algorithm

    RDA, DLDA, PLR, and

    SVM

    PLR 14/15

    MJS Overall detection rate

    threshold: 0.8;

    Imputation Method:

    median/mean and k-NN

    (several k)

    Extensive list of

    supervised

    classification

    techniques available

    in Weka

    SVM in k-NN-imputed data

    (k= 8, similar results for

    other values)

    13/15

    GCP Within group detection

    rate threshold: 0.4;Imputation Methods:

    hot-deck and linear

    regression imputation

    Hierarchical cluster

    analysis followed bystepwise logistic

    regression to select

    clusters. A proteomic

    score, the mean

    difference between

    the two identified

    clusters, is calculated

    Logistic regression with the

    calculated proteomicscore and age as

    covariates

    11/15

    LMSS Overall detection rate

    threshold: 0.15;

    Imputation Methods:k-NN, LLS, SVD,

    BPCA

    Classical t-test,

    Wilcoxon signed rank

    test, LASSO, LARS,and SLR

    Model 1: LARS algorithm

    on the seven most

    frequent proteins from allmethods explored on

    SVD imputed data.

    Model 2: logistic

    regression based on three

    proteins selected by a

    stepwise analysis on the

    seven most frequent

    proteins and gender

    Model 1:

    10/15;

    Model 2:12/15

    features simultaneously). Though with different emphasis and in combination with other meth-

    ods almost all teams utilized a logistic regression approach (WX CDK MJS GCP LMSS)

  • 7/31/2019 Case Studies Eric Cormier

    36/38

  • 7/31/2019 Case Studies Eric Cormier

    37/38

    2011 217

    variables were related with the disease condition; GCP built a logistic regression classification

    model that selected qge from all other covariates and the proteomic score; and LMSS found only

    gender to be statistically significant using a chi-square test, though it was not selected in their

    final logistic regression model.

    The predictive performance for each teams final classifier was tested on 15 samples whose

    class labels were kept hidden until after each teams results were provided to the organizers. Theaccuracy of each teams final model appears in the last column of Table 1. Though this permits

    a quantitative comparison between the teams, we will resist the temptation to over-interpret the

    relative predictive performance of the teams and their methods. Instead, we would like to once

    again thank all the teams for their hard work, insight and thoughtful questions.

    BIBLIOGRAPHY

    G. V. Cohen Freue, M. Sasaki, A. Meredith, O. P. Gunther, A. Bergman, M. Takhar, A. Mui, R. F. Bal-

    shaw, R. T. Ng, N. Opushneva, Z. Hollander, G. Li, C. H. Borchers, J. Wilson-McManus, B. M.

    McManus, P. A. Keown, W. R. McMaster, and the Genome Canada Biomarkers in Transplantation

    Group. (2010). Proteomic signatures in plasma during early acute renal allograft rejection. Molecular& Cellular Proteomics, 9, 19541967.

    Received 31 October 2010

    Accepted 6 January 2011

  • 7/31/2019 Case Studies Eric Cormier

    38/38

    Copyright of Canadian Journal of Statistics is the property of Wiley-Blackwell and its content may not be

    copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written

    permission. However, users may print, download, or email articles for individual use.