Top Banner
Proceedings of the 29th International Workshop on Statistical Modelling Volume 2 July 14 – 18, 2014 ottingen, Germany Thomas Kneib, Fabian Sobotka, Jan Fahrenholz, Henriette Irmer (editors)
195

29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Aug 10, 2019

Download

Documents

trinhtruc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Proceedings of the

29th InternationalWorkshop

on Statistical Modelling

Volume 2

July 14 – 18, 2014

Gottingen, Germany

Thomas Kneib, Fabian Sobotka,Jan Fahrenholz, Henriette Irmer

(editors)

Page 2: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Proceedings of the 29th International Workshop on Statistical Modelling,Volume 2,Gottingen, July 14 – 18, 2014,Thomas Kneib, Fabian Sobotka, Jan Fahrenholz, Henriette Irmer(editors),Gottingen, 2014.

Editors:Thomas Kneib, [email protected] Sobotka, [email protected] Fahrenholz, [email protected] Irmer, [email protected]

Centre for StatisticsGeorg-August-UniversityPlatz der Gottinger Sieben 537073, Gottingen, Germany

Printed by:Druckerei Bruggemann GmbHViolenstraße 2328195 BremenGermany

Page 3: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Scientific Programme Committee

• Thomas Kneib (Chair)Georg-August-University Gottingen, Germany

• Jim BoothCornell University, Ithaca, USA

• Maria DurbanUniversity Carlos III of Madrid, Spain

• Gillan HellerMacquarie University, Sydney, Australia

• Arnost KomarekCharles University in Prague, Czech Republic

• Stefan LangUniversitat Innsbruck, Austria

• Vito MuggeoUniversity of Palermo, Italy

• Mikis StasinopoulosLondon Metropolitan University, UK

• Lola UgarteUniversidad Publica de Navarra, Spain

• Florin VaidaUniversity of California, San Diego, USA

• Helga WagnerJohannes Kepler Universitat Linz, Austria

Page 4: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Local Organizing Committee

• Thomas Kneib (Chair)

• Heike Bickeboller

• Jan Fahrenholz

• Jan Gertheiss

• Henriette Irmer

• Nadja Klein

• Tatyana Krivobokova

• Juliane Manitz

• Julia Meskauskas

• Patrick Michaelis

• Oleg Nenadic

• Hauke Rennies

• Holger Reulen

• Benjamin Safken

• Fabian Sobotka

• Alexander Sohn

• Elmar Spiegel

• Elisabeth Waldmann

Page 5: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Contents

Part III - Contributed Papers (Posters)

Diego Ayma, Marıa Durban, Dae-Jin Lee: Penalized compositelink mixed models for spatially aggregated data. . . . . . . . . . . . . . . . . . . . . 3

M. Amo-Salas, E. Delgado-Marquez, J. Lopez-Fidalgo: Op-timal Experimental Design in a real case . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Jacek Bialek: Stochastic Index Numbers in Inflation Measurement 11

Daniel Bonetti, Alexandre Delbem, Jochen Einbeck: Bivari-ate Estimation of Distribution Algorithms for Protein Structure Pre-diction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Mileno Cavalcante, Betsabe Blas: Beta calibration model . . . . 19

Hayala Cristina Cavenague de Souza, Gleici da Silva Cas-tro Perdona, Francisco Louzada Neto, Fernanda MarisPeria : Parameter estimation for the the Exponentiated ModifiedWeibull Model with Long Term Survival: A Simulation Study. . . . . . . 23

Francesco Donat, Giampiero Marra: P-IRLS Representation ofa Semiparametric Bivariate Triangular Cumulative Link Model . . . . . 27

Cara Dooley , Joanne Feeney, Rose Anne Kenny : NormativeValues of Visual Acuity and Contrast Sensitivity in Older Irish Adults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Jochen Einbeck, Daniel Bonetti: A study of online and block-wise updating of the EM algorithm for Gaussian mixtures . . . . . . . . . . 35

Amira Elayouty, Marian Scott, Claire Miller, Susan Wal-dron, Maria Franco Villoria: Visualisation and Modelling ofEnvironmental Sensor Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Manuela Ender, Tong Ma: Statistical modeling of extreme pre-cipitation in recent flood regions in China . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Page 6: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

vi Contents

Marco Enea, Mikis Stasinopoulos, Robert Rigby, An-tonella Plaia : The pblm package: semiparametric regression forbivariate categorical responses in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

M.J. Garcıa-Ligero, A. Hermoso-Carazo , J. Linares-Perez:Least-squares linear distributed estimation for discrete-time systemswith packet dropouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Phillip Gichuru, Gillian Lancaster, Andrew Titman,Melissa Gladstone: Developing Robust Scoring Methods for usein Child Assessment Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Mousa Golalizadeh , Vida Azod Azad : Multilevel Factor Anal-ysis of the PIRLS Test for the Iranian Pupils . . . . . . . . . . . . . . . . . . . . . . . 61

Andreas Groll, Gerhard Tutz: Variable Selection in Heteroge-neous Discrete Survival Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Alexander Hartmann, Stephan Huckemann, Jorn Danne-mann, Oskar Laitenberger, Claudia Geisler, AlexanderEgner, Axel Munk: Drift Estimation in Sparse Sequential DynamicImaging: with Application to Nanoscale Fluorescence Microscopy . . . 69

Radek Hendrych: State space analysis of the Prague Stock Ex-change index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Freddy Hernandez , Olga Usuga, Viviana Giampaoli: A mis-specification simulation study in Poisson mixed model . . . . . . . . . . . . . . 75

Jixia Huang, Jinfeng Wang: Identification of Health Risks ofHand, Foot and Mouth Disease in China Using the Geographical De-tector Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Yufen Huang, Bo-Shiang Ke: Influence Analysis on CrossoverDesign Experiment in Bioequivalence Studies . . . . . . . . . . . . . . . . . . . . . . . 85

German Ibacache-Pulgar: Symmetric semiparametric additivemodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Lydia Lera, Cecilia Albala, Barbara Leyton, HugoSanchez, Barbara Angel, Marıa J Hormazabal: Applicationof classification tree analysis: Algorithm of diagnostic of sarcopenia inChilean older people. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Page 7: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Contents vii

Ivana Mala : Modelling of the distribution of the unemploymentduration in the Czech Republic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Mario Martınez-Araya, Jianxin Pan: Unknown break-point esti-mation and selection between semi-parametric and segmented models 101

Kenan Matawie, Bahman Javadi: Mixture distribution model forresources availability in volunteer computing systems . . . . . . . . . . . . . . . 105

Jakob W. Messner, Georg J. Mayr, Achim Zeileis: WeatherForecasts and Censored Regression with Conditional Heteroscedastic-ity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

J.M. Munoz-Pichardo, J.L. Moreno-Rebollo, R.Pino-Mejıas, M.D. Cubiles de la Vega: Influence measures in betaregression models through Frechet metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

J.M. Munoz-Pichardo , R. Pino-Mejıas, J.L. Moreno-Rebollo, M.D. Cubiles de la Vega: Influence measures in betaregression models through Rao distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Adrian O’Hagan: Estimation of Upper Tail Dependence for Insur-ance Loss Data Using an Empirical Copula-based Approach . . . . . . . . 121

Christian Pfeifer, Peter Holler: Effects of precipitation andtemperature in alpine areas on backcountry avalanche accidents re-ported in the western part of Austria within 1987–2009 . . . . . . . . . . . . . 125

Eliane C. Pinheiro, Silvia L. P. Ferrari: Small-sample one-sided testing in extreme value regression models . . . . . . . . . . . . . . . . . . . . 129

Hildete P. Pinheiro, Mariana Rodrigues-Motta, GabrielFranco: Modelling performance of students with bivariate gener-alized linear mixed models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

R. Pino-Mejıas, M.D. Cubiles-de-la-Vega, J.M. Munoz-Pichardo, J.L. Moreno-Rebollo: Identification of Outliers withBoosting Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Iain Proctor, E. Marian Scott, Rognvald I. Smith: Incorpo-rating sub-grid variability of environmental covariates in biodiversitymodelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Page 8: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

viii Contents

Marıa Xose Rodrıguez - Alvarez, Thomas Kneib , CarmenCadarso - Suarez: Semiparametric ROC Regression based on Con-ditional Transformation Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

P. Roman-Roman, J. J. Serrano-Perez, F. Torres-Ruiz: Ap-proximating unconditioned first-passage-time densities for diffusionprocesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Mariangela Sciandra, Antonella Plaia: Classification trees forpreference data: a distance-based approach . . . . . . . . . . . . . . . . . . . . . . . . . 153

Karin A. Tamura , Viviana Giampaoli, Alexandre Noma: Theimpact of the misspecification of the random effects distribution onthe prediction of the mixed logistic model . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Thomai Tsiftsi, Ian H. Jermyn, Jochen Einbeck: Bayesianshape modelling of cross-sectional geological data . . . . . . . . . . . . . . . . . . . 161

Marcio Valk, Aluısio Pinheiro: Tests for contaminated time se-ries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

Elisabeth Waldmann, Fabian Sobotka, Thomas Kneib: Reg-ularisation in Bayesian Expectile Regression . . . . . . . . . . . . . . . . . . . . . . . . 169

Andrea Wiencierz: Nonparametric regression with interval-valuedresponse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Bruce J. Worton, Chris R. McLellan: Comparison of robustand nonparametric measures of fidelity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Dilek Yildiz, Peter W.F. Smith, Peter G.M. van der Heij-den: Extending capture-recapture models to handle erroneous recordsin linked administrative data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Page 9: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Part III - Contributed Papers(Posters)

Page 10: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,
Page 11: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Penalized composite link mixed models forspatially aggregated data

Diego Ayma1, Marıa Durban1, Dae-Jin Lee2

1 Department of Statistics, Universidad Carlos III de Madrid, Spain2 CSIRO Computational Informatics, Clayton, VIC, Australia

E-mail for correspondence: [email protected],[email protected] and [email protected]

Abstract: We propose an extension of the Poisson penalized composite linkmodels (P-CLMs) to the case of aggregated spatial data. We estimate a trend ata finer scale by fitting a spatial smooth mixed model to the latent expectation ofthe underlying process, and we apply the methodology proposed to the analysisof female deaths due to cardiovascular diseases in the region of Madrid.

Keywords: Composite link models; P-splines; Mixed models; Aggregated countdata

1 Introduction

Disease mapping techniques are very popular in areas such as Public Healthand Epidemiology. Their popularity has yielded the development of newmethodologies based on spatial techniques and the use of geographical in-formation systems technology. But they present a main drawback: dataused are, in general, summarized over geographical areas, like districts,city quarters or municipalities. Then, only averages or sums over a coarsegrid are given, therefore, the use of geographical information is very re-strictive. Several methods have been used in order to analyze this type ofdata, but most of them assume that the observations are located at thegeographical centroids of the areas, and this yields a coarse spatial trend.In this paper, we propose a new methodology for the analysis of this typeof data, using two-dimensional penalized composite link mixed model, inwhich we generalize the approach given by Eilers (2007) to the case of spa-tial data. This model estimates the trend at a much finer scale (so it doesnot need to be constant within a region), and the model is formulated insuch a way that the total estimated counts in each area are the sum of theestimated counts at the finer scale.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 12: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

4 Penalized composite link mixed models for spatially aggregated data

2 The spatial composite link model

In the one-dimensional case, suppose that we observe a vector of aggregatedcounts y, from Poisson distributions with E(y) = µ. In order to model theunderlying process behind these data, we can use the Poisson PenalizedComposite Link Model (P-CLM, Eilers, 2007) given by

µ = Cγ = C exp(Bθ),

where γ represents the latent expectation of the underlying process, C is thecomposition matrix that describes how latent expectations are combinedto yield µ, B is the B -spline basis, constructed from a covariate at the non-aggregated level, x, and θ is the associated vector of regression parameters.Smoothness is imposed via the penalty matrix P based on second orderdifference matrix D, that is P = λDTD, where λ is parameter that controlsthe amount of smoothness.When count data are aggregated by geographical regions, a first attempt tomodel the data could be the spatial smooth mixed model proposed by Leeand Durban (2009), in which the spatial variation is modelled by a two-dimensional P -spline at the centroids of the regions. However, this modelassumes that mortality rates are constant over each region (which might beunsatisfactory, specially if the regions are large). We propose an improvedsmooth spatial model, by extending the composite link model to the spatialsetting.Let x∗1 and x∗2 be geographical coordinates (latitude and longitude) of thecentroids of each region in a map, we now impose a fine grid over it withnew coordinates x1 and x2 of length n. Figure 1 shows the map of themunicipalities in the region of Madrid and the 120× 120 grid chosen.

FIGURE 1. Map of municipalities in the region of Madrid, and the location ofcentroids (red) and the grid points selected (blue).

Then, in this new context, the regression basis B is defined as the Box-Product or “row-wise” Kronecker product of the individual basis B1 andB2 (of dimensions n× c1 and n× c2, respectively):

B = B2B1 = (B2 ⊗ 1T

c1) (1T

c2 ⊗B1)

Page 13: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Ayma et al. 5

and the penalty is given by:

P = λ2P2 ⊗ Ic1 + λ1Ic2 ⊗P1,

where Pi, i = 1, 2, are marginal penalty matrices based on second orderdifferences. We choose to use a mixed model approach to the P-CLM since itwill allow the inclusion of area specific random effects or further correlationstructure if necessary. Then, Bθ = Xβ + Zα, α ∼ N(0,G(λ1, λ2)), and

f(y|α) = expyT log (C exp (Xβ + Zα))− 1TC exp (Xβ + Zα)− 1T log (Γ (y + 1))

.

Then, β and α are estimated by maximizing the penalized log-likelihood:

`P = log f(y|α) − 1

2αTG−1α,

which yields a modified version of the standard mixed model estimators:

β =(XTV−1X

)−1

XTV−1z ,

α = GZTV−1(z− Xβ

),

where X = W−1CΓX, Z = W−1CΓZ and V = W−1 + ZGZT, with W =diag (C exp(Xβ + Zα)), Γ = diag (exp (Xβ + Zα)) and working vector:

z = Xβ + Zα+ W−1 (y −C exp(Xβ + Zα)) .

Smoothing parameters are estimated using the approximation of the marginalquasi-likelihood of Breslow and Clayton (1993).

3 Application to disease mapping

The data that we analyze come from a large European epidemiologicalstudy called MEDEA (http://www.proyectomedea.org/). The aim of theproject was to study the impact of socioeconomic and environmental in-equalities on the mortality rates by different causes. Deaths are not onlyrelated to individual factors, but also to contextual factors, most of themrelated to the area of residence. Therefore, it is of great interest to esti-mate spatial trends present in the data that can help to identify areas thatmay need intervention. Our data correspond to deaths by cardiovasculardiseases of females in the region of Madrid in 2001. Our aggregated countscorrespond to observed deaths and population at risk in each of the 179municipalities of the region of Madrid. We fitted two models:

• Model 1: The model in Lee and Durban (2009) based on aggre-gated data at the centroids of the region (equivalent to having thecomposition matrix C = I).

• Model 2: A modified version of the P-CLM introduced in the previ-ous section, in which we included the log of the population at risk asan offset in the linear predictor (since we are interested in mortalityrates).

Page 14: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

6 Penalized composite link mixed models for spatially aggregated data

FIGURE 2. Plot of fitted spatial trend using the centroids of municipalities (left)and a grid of estimated latent spatial trend over the region of Madrid (right).

Figure 2 shows the estimated spatial trend (in the scale of the linear predic-tor) for both models. Model 2 estimates the spatial component at a higherresolution, allowing for a smooth trend within municipalities, and with sim-ilar goodness of fit as Model 1. Mortality rates are higher, in general, atthe boundaries of the region (they correspond to areas with difficult accessto health facilities, or industrialized areas where environmental conditionsare poor).

Acknowledgments: The work of the authors have been funded by theMinistry of Economy and Competitiveness grant MTM2011-28285-C02-02

References

Breslow, N.E. and Clayton, D.G. (1993). Approximate Inference in Gen-eralized Linear Mixed Models Journal of the American StatisticalAssociation, 88, 9 – 25.

Eilers, P.H.C. (2007). Ill-posed problems with counts, the composite linkmodel and penalized likelihood. Statistical Modelling, 104, 91 – 98.

Lee, D-J. and Durban, M. (2009). Smooth-CAR mixed models for spatialcount data. Computational Statistics & Data Analysis, 53, 2958 –2979.

Page 15: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Optimal Experimental Design in a real case

M. Amo-Salas1, E. Delgado-Marquez1, J. Lopez-Fidalgo1

1 University of Castilla-La Mancha, Spain

E-mail for correspondence: [email protected]

Abstract: This work presents research results on analysing the process of jamformation during the discharge by gravity of granular material stored in a 2D silo.The aim is to provide optimal experimental design considering different criteria ofoptimality: D−optimality criterion, a combination of the D−optimality criterionwith a cost/gain function, a Bayesian approach and sequential design. The resultsreveal that the efficiency of the design used by the experimenters may be improveddramatically. A sensitivity analysis with respect to the most important parameteris performed as well.

Keywords: D−optimal design; compound criterion; Bayesian design; sequentialdesign; jams formation.

1 Introduction

During the discharging process caused by gravity, the flow of granular ma-terial stored in a two dimensional silo can be interrupted due to formationof an arch at the outlet. The cost of breaking the arcs may be dangerous,expensive or just difficult. An avalanche can be defined as the amount ofgranular material fallen between two successive jamming events or as thewaiting time between two successive jamming events.The data of the average size of avalanches, s, obtained for the different sizesof the output follows an exponential distribution with parameter (φC −φ)γ/A, that is,

E(s) =A

(φC − φ)γ; var(s) =

(A

(φC − φ)γ

)2

(1)

where φ is the diameter of the outlet, φC is the theoretical critical diameterabove which jamming is not possible and ρ is the diameter of the granularmaterial, thus, ρ < φ < φC . The parameter A is a constant that corre-sponds to the value of s when φ = φC − 1 and γ is the rate of exponentialdecline. The unknown parameters vector θ = (γ,A, φC) has to be estimatedadequately.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 16: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

8 OED in a real case

2 D−optimal Design

Theorem: Given the model defined by (1) and the design space, Xφ =[φL, φU ], the three–point equally weighted design maximizing the determi-nant is

ξ? =

φL φ? φU1/3 1/3 1/3

, (2)

where φ? = φC − (φC−φU )(φC−φL)(log(φC−φU )−log(φC−φL))φL−φU and φL < φ? <

φU . Moreover, if ψ(ξ?, φ) ≥ 0 for all φ then it is actually D−optimal.To ensure that ξ is D−optimal, we apply the General Equivalence Theoremto ξ is D−optimal if and only if

ψ(φ, ξ, θ) = −trM−1(ξ∗, θ)M(ξφ, θ)+ k ≥ 0

where M−1(ξ∗, θ) is the inverse of the Fisher Information Matrix (FIM)for the candidate to D−optimal design, M(ξφ, θ) is the FIM for one pointand k is the number of parameters to estimate.Example: Janda et al. defined X = [1.53, 5.63] as the space of design andθ = (12.7, 1.1× 1011, 8.5), the D−optimal design will be:

ξ∗ =

1.53 4.17 5.631/3 1/3 1/3

(3)

1 2 3 4 5Φ

-1.0

-0.8

-0.6

-0.4

-0.2

Ψ HΦ, Ξ, ΘL

FIGURE 1. Sensitivity function for the design (3).

Janda et al. (2008) considered a equally weighted design with 8 non–replicated points,

ξ8 =

1.53 2.17 2.51 3.02 3.52 4.12 4.36 4.81/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8

. (4)

The D−efficiency of this design is

EffD(ξ8) =

(|M(ξ8, θ)||M(ξ?, θ)|

)1/3

= 33.57%.

Therefore, with the D−optimal design, the number of experiments can bereduced by 66.43% to achieve the same accuracy in the estimators of theparameters.

Page 17: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Amo-Salas et al. 9

A criterion combining D−optimality and a gain function will be defined by

Φλ(ξ) = (1− λ) log |M(ξ, θ)|+ λ

k∑i=1

ξ(φi) log(φC − φi + 1), (5)

where 0 ≤ λ ≤ 1, where |M(ξ, θ)| is the determinant of the FIM for athree–point design and log(φC − φi + 1) is the gain (opposite to the cost)function associated to φi and the sensitivity function is

ψλ(ξ?, ξφ) = (1− λ)(tr(M−1(ξ?λ)M(ξφ))−m

)+

+ λ

(log(φC − φ+ 1)−

m∑i=1

ξ(φi) log(φC − φi + 1)

).

For λ = 0.745, the D−efficiencies of the D−optimal design and the gain/costfunction are equals and the compound optimal design is

ξCC =

1.53 4.01 5.630.46 0.31 0.23

. (6)

0.2 0.4 0.6 0.8 1.0Λ

0.2

0.4

0.6

0.8

1.0

Efficiency

FIGURE 2. Efficiency with respect to the D−optimal design (solid line) andthe efficiency with respect to the cost of experimentation (dashed line) forXφ = [1.53, 5.63] and θ = (12.7, 1.1× 1011, 8.5).

A sensitivity analysis has been performed with respect to the D−optimaldesign, (2) and with respect to the compound optimal design. We considerdifferent values for the critical diameter in a neighbourhood of the nominal

value, φ?C ∈ [φ(0)C −ε1, φ

(0)C +ε2] where φ

(0)C = 8.5, φ

(0)C −ε1 > φU , ε1, ε2 > 0.

Figure 3 shows the behaviour of the efficiency with respect to the valueof φ?C for both cases. For possible true values below 8.5, the D−efficiencyof the D−optimal design decays quite fast while for possible true valuesgreater than 8.5, the D−efficiency of the D−optimal design is over 94%.The efficiency of the compound optimal design decays slightly for possibletrue values below 8.5 but faster than for values greater than 8.5. Therefore,it is better to underestimate the nominal value of the parameter φC leavingless chances to the left for the possible true values.These designs are locally optimal designs because its computation dependon the nominal values of the parameters. From the prior knowledge on

Page 18: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

10 OED in a real case

6 8 10 12 14 16 18 20ΦC

*

0.7

0.8

0.9

1.0

EffD

6 8 10 12 14 16 18 20ΦC

*

0.96

0.97

0.98

0.99

1.00

Effcompound

FIGURE 3. Sensitivity analysis for φ?C ∈ [5.7, 20.5] with respect to (3) and (4).

the possible values of the unknown parameter φC , a Bayesian approachhas been considered. This knowledge is represented by a prior distribution,π(θ). In particular, two continuous prior distributions will be considered forφC . A Log–Normal distribution shifted to the right starting at the upperlimit of the design space and a truncated Normal distribution to the left ofthe upper limit of the design space will be considered. Another way to dealwith the problem of the locally optimal designs is considering sequentialdesigns.

Acknowledgments: The authors are supported by Ministerio de Economıay Competitividad and Fondos FEDER MTM2010− 20774−C03− 01 andJunta de Comunidades de Castilla la Mancha PEII10− 0291− 1850.

References

Atkinson A.C., Donev A.N. and Tobias R.D. (1992) Optimum Experimen-tal Designs, with SAS. Oxford University Press: New York.

Biswas A. and Lopez Fidalgo J. (2012) Compound designs for dose-findingin the presence of non-designable covariates. Ed. Pharmaceuticalstatistics.

Cook D. and Fedorov V. (1995) Constrained Optimization of Experimen-tal Design. Ed. National Science Foundation.

Janda A., Zuriguel I., Garcimartın A., Pugnaloni L. A. and Maza D.(2008) Jamming and critical outlet size in the discharge of a two-dimensional silo. A letter journal exploring the Frontiers of Physics.

Page 19: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Stochastic Index Numbers in InflationMeasurement

Jacek Bialek1

1 University of Lodz, Department of Statistical Methods, Poland

E-mail for correspondence: [email protected]

Abstract: The stochastic approach is a specific way of viewing index numbers,in which uncertainty and statistical properties play a central role. This approach,applied to the prices, treats the underlying rate of inflation as an unknown pa-rameter that has to be estimated from the individual prices. Thus, the stochasticapproach provides the whole probability distribution of inflation. In this paperwe present and discuss several basic stochastic index numbers. We propose asimple stochastic model, which leads to a price index formula being a mixtureof the previously presented specifications. We verify the considered indices on areal data set.

Keywords: Price indices; Stochastic index numbers; Price index theory.

1 Introduction

The weighted price index is a function of a set of prices and quantites of theconsidered group of N commodities comming from the given moment t andthe basic moment s. In reality, the price index formula is a quotient of somerandom variables and thus, it is really difficult to construct a confidenceinterval for that formula. The so called new stochastic approach (NSA) inthe price index theory gives a solution for the above-mentioned problem.Within this approach, a price index is a regression coefficient (unknownparameter θ) in a model explaining price variation. Having estimated sam-pling variance of the estimator (σ2

θ) we can build the (1 − α) confidence

interval as θ±t1−α/2,n−1σθ, where n is the sample size and t1−α/2,n−1 is the100(1− α/2) percentile of the t distribution with n− 1 degrees of freedom(see von der Lippe (2007)). The individual prices are observed with errorand the problem is a signal-extraction one of how to combine noisy pricesso as to minimize the effects of measurement errors. Under certain assump-tions, the stochastic approach leads to known price index formulas (such as

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 20: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

12 Stochastic Index Numbers

Divisa, Laspees, etc.), but their foundations differ from the classical deter-ministic approach (see Diewert (2004)). Within this approach we can alsoobtain some new price index formulas having some desired economical andstatistical properties (Clements et al. (2006)).The recent resurrection of the stochastic approach to index number theoryis due to Balk (1980), Clements and Izan (1981, 1987), Bryan and Cecchetti(1993) and Selvanathan and Prasada Rao (1992). In this paper we presentand discuss only some basic stochastic index numbers. We propose a simplestochastic model, which leads to a price index formula being a mixture ofthe previously presented specifications.

2 Stochastic index numbers in inflation measurement

There are many directions and stochastic models in the field of the stochas-tic approach (see Clements et al. (2006), Crompton (2000)). Let Dpi,t =ln pi,t − ln pi,t−1 be the log-change in price of commodity i (i = 1, 2, ..., N)from year t − 1 to t. Suppose that each price change is made up of a sys-tematic part that is common to all prices (θt) and a random componentεi,t,

Dpi,t = θt + εi,t, (1)

where we assume that E(εi,t) = 0 and thus E(Dpi,t) = θt. We can see thatthe parameter θt is interpreted here as the common trend in all prices, orthe underlying rate of inflation. Let all εi,t have variances and covariancesof the form σ2

ij,t and let Σt = [σ2ij,t] be the corresponding N×N covariance

matrix. Under above significations we can write (1) in vector form as

Dpt = θtu+ εt, (2)

where Dpt = [Dpi,t]T, u = [1, ..., T ]T, εt = [εi,t]

T are all N × 1 vectors.Using the generalized least squares method for estimating θt we obtain theBLUE estimator as follows (see Clements et al. (2006))

θt = (uTΣ−1t u)−1uTΣ−1

t Dpt, (3)

with variationσ2θt = (uTΣ−1

t u)−1. (4)

The presented formulas (3) and (4) have a general form and in the remain-ing part of the paper we consider some special case of this model. Let usassume the following specification of the matrix Σt

Σt = Dt(I − λt)−1DtW−1t , (5)

where Dt is diagonal matrix with the standard deviations of N relativeprices on the main diagonal, λt = [λij,t] is an N × N symmetric matrixwith diagonal elements zero and (i, j)−th off-diagonal element the relevantcorrelation λij,t = σ2

ij,t/(σii,tσjj,t) and Wt = diag[w1,t, w2,t, ..., wN,t] is an

Page 21: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Bialek 13

N × N diagonal matrix, where wi,t = pi,tqi,t/∑Ni=1 pi,tqi,t. The following

theorem is true:In the stochastic model described by (1) with the corresponding covariancematrix defined by (5) we obtain the following estimator of the rate ofinflation and its variation

θ∗t =

N∑i=1

w∗i,tDpi,t, (6)

σ2θ∗t

=1∑N

i=1 wi,t(σ−2ii,t − λ∗.i,t)

, (7)

where

w∗i,t =wi,t(σ

−2ii,t − λ∗.i,t)∑N

i=1 wi,t(σ−2ii,t − λ∗.i,t)

, (8)

and λ∗.i,t is the sum of elements in the i−th row of the matrix λ∗t =

D−1t λtD

−1t .

3 Empirical study

In our empirical illustration of the presented measure of inflation we usemonthly data on price indices of consumer goods and services in Polandfor the time period I 2010 – XII 2012 (36 observations). The weights wi,talso are taken from data published by the Central Statistical Office. Theestimated rates of inflation for years: 2010-2012 with the correspondingvariations and confidence intervals are presented in Table 1.

TABLE 1. Values of the considered estimator of a rate of inflation, its variancesand the corresponding 95% confidence intervals for years 2010 – 2012 in Poland.

Measure Year: 2010 (published rate of inflation – 0,031)

θ∗t 0,0325σ2θ∗t

0,0023

95% CI (0,0274; 0,0376)

Measure Year: 2011 (published rate of inflation – 0,046)

θ∗t 0,0474σ2θ∗t

0,0011

95% CI (0,0450; 0,0498)

Measure Year: 2012 (published rate of inflation – 0,024)

θ∗t 0,0239σ2θ∗t

0,0009

95% CI (0,0219; 0,0259)

Page 22: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

14 Stochastic Index Numbers

4 Conclusions

As we can see the estimated rate of inflation (6) with weights describedby (7) is a weighted arithmetic mean of the price log-changes, where theweights are proportional to the reciprocals of the variances of the relativeprices, proportional to the budget-shares and it also takes into accountcorrelations among prices. The presented, used for inflation measurement,gives the following conclusions: the published rate of inflation in Polandseems to be too small in 2010 (it equals 3,1%, when θ∗t = 3, 25%) and in

2011 (it equals 4,6%, when θ∗t = 4, 74%) and minimally overestimated in

2012 (it equals 2,4%, when θ∗t = 2, 39%). The variance of the discussedestimator is quite small and obviously acceptable. Let us also notice thatall confidence intervals for estimated rate of inflation include the value ofthis rate published by the Central Statistical Office in the correspondingyear.

References

Balk, B.M. (1980). A Method for Constructing Price Indices for SeasonalCommodities. Journal of the Royal Statistical Society A, 143, 68 – 75.

Bryan, M.F., Cecchetti, S.G. (1993). The Consumer Price Index as a Mea-sure of Inflation. Economic Review, Federal Reserve Bank of Cleve-land, 29, 15 – 24.

Clements, K.W., Izan, H.Y. (1981). A Note on Estimating Divisia IndexNumbers International Economics Review, 22, 745 – 747.

Clements, K.W., Izan, H.Y. (1987). The Measurement of Inflation: AStochastic Approach. Journal of Business and Economic Statistics,5, 339 – 350.

Clements, K. W., Izan, H.Y., Selvanathan, E.A. (2006). Stochastic IndexNumbers: A Review. International Statistical Review, 74(2), 235 –270.

Crompton, P. (2000). Extendig the Stochastic Approach to Index Num-bers. Applied Economics Letters, 7, 367 – 371.

Diewert, W.E. (2004). On the Stochastic Approach to Linking the Regionsin the ICP, Discussion Paper No. 04-16. British Columbia: Depart-ment of Economics, University of British Columbia

Selvanathan, E.A., Prasada Rao, D.S. (1994). Index Numbers: A Stochas-tic Approach. Ann Arbor: The University of Michigan Pres

Von der Lippe, P. (2007). Index Theory and Price Statistics. Frankfurt:Peter Lang

Page 23: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Bivariate Estimation of DistributionAlgorithms for Protein Structure Prediction

Daniel Bonetti1, Alexandre Delbem1, Jochen Einbeck2

1 Universidade de Sao Paulo, Sao Carlos, SP, Brazil2 Department of Mathematical Sciences, Durham University, UK

E-mail for correspondence: [email protected]

Abstract: A real-valued bivariate ‘Estimation of Distribution Algorithm’ specificfor the ab initio and full-atom Protein Structure Prediction problem is proposed.It is known that this is a multidimensional and multimodal problem. In order todeal with the multimodality and the correlation of dihedral angles φ and ψ, wedeveloped approaches based on Kernel Density Estimation and Finite GaussianMixtures. Simulation results have shown that both techniques are promising whenapplied to that problem.

Keywords: Estimation of Distribution Algorithm; Protein Structure Prediction;Finite Gaussian Mixture; Kernel Density Estimation.

1 Background

Protein Structure Prediction (PSP) is a key problem in Biology. It triesto find protein configurations in order to help in the development of newmedicines. Computer methods have received attention in order to bypassthe high costs and time needed by experimental methods [Bujnicki, 2009].Despite that computer simulations do not have the same capability to pre-dict proteins as the experimental methods have, new methods have beenintroduced over the last two decades.In this paper we present a novel computer method to predict protein con-figurations. We use an Estimation of Distribution Algorithm (EDA) [Lar-ranaga and Lozano, 2002] to guide our search process in order to find goodprotein configurations. EDAs are from the family of Evolutionary Algo-rithms and they are general optimization techniques.Most Evolutionary Algorithms use two or three solutions to compose newones. In contrast, EDAs have the capability to extract significant statis-tical information from a set of promising solutions in order to create the

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 24: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

16 Bivariate EDAs for Protein Structure Prediction

offspring. This is an important step in the optimization process, since itguides the search process properly toward good solutions.There are several types of EDAs. The simplest one uses the mean and vari-ance of variables to compose the offspring. However, as we know, the PSPproblem is a multivariate and multimodal problem. Thus, simply usingmean and variance will not describe our set of solutions properly. We de-signed a new EDA specific for PSP with ab initio and full-atom modelling.Basically, we fed our EDA with three statistical methods. The first method,which serves as a reference algorithm, treats the variable as independentones, that is, the univariate. Further, we modeled the fixed correlation be-tween the dihedral angles φ and ψ with two-dimensional Kernel DensityEstimation (KDE) [Venables and Ripley, 2002] and Finite Gaussian Mix-tures (FGM) [McLachlan and Peel, 2004].We evaluated our approach with a 25 residues protein. The results showedthat, indeed, EDAs with appropriate statistical methods are able to findgood solutions for small protein configurations.

2 Estimation of Distribution Algorithms for ProteinStructure Prediction

EDAs are a relatively new class of optimization algorithms. They explorethe search space by building and sampling probabilistic models from promis-ing solutions. The whole set of solutions is called population. From a ran-domly initialized population p all solutions (also called individuals) areevaluated using a fitness function. It is a quality measure function thatdescribes how good a solution is. Next, solutions are chosen to be part ofthe set of selected individuals s. This new set of individuals usually haspromising solutions, in which the probabilistic model will be created. It isalso important to have a diversified set of selected solutions in order toavoid premature convergence. Next, a probabilistic model of the selectedindividuals is built. As we discussed in the previous section, we developedthree methods. From this model, we generated a new set of individuals,called offspring o. Finally, we can mix the population and the offspringordered by the fitness value and truncate it with the size of population andoverwrite it. All these steps are called a generation (or iteration) and theymay continue until a convergence criterion is reached as, for example, apredefined variation of the fitness of the population.The fitness function of our algorithm has eight different energy types. How-ever, in this paper, only van der Waals energy is used since it describes wellthe interaction among atoms and makes the experimental analysis easierto understand.The population is denoted as p = (p1

1, . . . , pmn ), in which n is the popula-

tion size and m is the length of the vector (an individual). A real-valuedvector holds all the dihedral angles of a protein configuration, ranging in(−180.0, 180.0). Each residue in a protein has its own number of dihedralangles. For example, the smallest residue Glycine has only two dihedral

Page 25: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Bonetti et. al 17

angles, φ and ψ, while Arginine has six, φ, ψ, χ1, χ2, χ3 and χ4. Thus, thenumber of dihedral angles of a protein configuration depends on the pri-mary sequence of the amino acids. In this paper, all experiments wereperformed using a 25 residues protein called 1A11. This yielded in a vectorwith m = 95 positions.

3 Univariate EDA for PSP

The Univariate (UNI) version is a simple and fast algorithm. First, anindividual randomly chosen from s is used to generate an o1

1 value. Next,we cause a perturbation to that value by adding a Gaussian random numberwith mean zero and standard deviation σ. Then, another individual from sis selected to generate the value of o2

1 and the perturbation is added. Thatprocess is repeated until the vector o1 is filled. At this point, we are readyto compute the fitness of individual o1. That entire process is repeated untilall individuals of o have been filled, that is, until on has been created.

4 Bivariate EDA for PSP

Considering that all variables can interact with each other we could considerthe entire individual as a m–dimensional problem. However, as we showedin previous section, even a small protein with 25 residues would producea 95–dimensional problem. That is not computationally efficient and maynot render good results. For this reason, we decided to split the problemper residue. Thus, for a 25 residues protein configuration we will have 25two–dimensional algorithms to perform instead of one of 95–dimensional.We considered that dihedral angles φ and ψ within the same residue arestrongly correlated, since rotations produced in φ will, in general, causestereochemical constraints in angle ψ. Furthermore, every time one sees aRamachandram plot, it is correlating φ and ψ of the same amino acid.Two methods to handle the bivariate data were implemented. One usestwo-dimensional Kernel Density Estimation (KDE). For each [φ;ψ] pair, itcreates a kernel density map from the real values of s. Then, the φ valuesare generated independently, and ψ values are generated conditional on thevalues of φ. To do that, we need also to create a one-dimensional KDE ofφ and sample a value from that distribution. Next, we choose the closestpoint from value φ in our just created two-density map. At this point, thereis a one-dimensional KDE that is conditional to the previous value of φ.Finally, we generate a new ψ value based on that distribution.The second method uses a two-dimensional Finite Gaussian Mixture (FGM)in order to estimate values of the pair [φ;ψ]. For each [φ;ψ] pair, it per-forms a whole bivariate Expectation-Maximization (EM) algorithm untilconvergence, for a given number of components. The value of [φ;ψ] of theoffspring is generated at once. First, a component is randomly chosen withprobability π. Next, a bivariate Gaussian random number is generated ac-cording to the mean and covariance matrix of the chosen component.

Page 26: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

18 Bivariate EDAs for Protein Structure Prediction

In order to keep performance, in both two-dimensional KDE and FGMmethods, we generated all these steps for all individuals of the offspringbefore going to the next pair [φ;ψ].

5 Results

All techniques were performed, which changed the population size, selectionpressure and tournament size. Specifically, UNI also had the σ changed andFGM also had the number of components and λ (used to avoid singularmatrices) modified. That rendered in a total of 1680 combinations and tookover than 3000 hours of CPU time of the LNCC cluster. Figure 1 showsthe results achieved. At the left, a performance comparison is made. Aswe expected, UNI was the fastest. FGM was better than KDE, despite ithas some outliers caused by the runnings with the high number of mixturecomponents. Figure 1 (middle) shows a scatterplot between energy andRMS (quality measurement in PSP: low RMS means high quality). Pointsof Pareto Front are highlighted by a dashed line. Finally, Figure 1 (right)shows a protein configuration predicted (blue) fitted with a native (green).

UNI FGM KDE

2040

6080

100

Time (min.)

−160 −140 −120 −100 −80

24

68

10

Energy

RM

S

UNIFGMKDE

FIGURE 1. Left: Performance comparison; Middle: Scatterplot between energyand RMS; Right: A predicted configuration (blue) aligned to native (green).

References

Bujnicki, J. M (2009). Prediction of Protein Structures, Functions, andInteractions West Sussex: Wiley

Larranaga, P. and Lozano, J. A. (2002). Estimation of Distribution Algo-rithms: A New Tool for Evolutionary Computation New York: Spring

McLachlan, G. and Peel, D. (2004). Finite Mixture Models. Hoboken, NJ,USA: Wiley

Venables, W. N. and Ripley, B. D. (2002). Statistics with S Fourth edi-tion. New York: Springer.

Page 27: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Beta calibration model

Mileno Cavalcante1, Betsabe Blas2

1 Petrobras S.A., Recife, PE, Brazil2 Universidade Federal de Pernambuco, Depto. Estatıstica, Recife, PE, Brazil

E-mail for correspondence: [email protected], [email protected]

Abstract: This work discusses the calibration problem in the context of non-linear data. We propose a new calibration model where for each calibration stage,the response variables are beta distributed using a reparameterization of the betalaw that is indexed by mean and dispersion parameters (Ferrari and Cribari-Neto,2004). This new approach is suitable for applications when the response variableis continuous, nonlinear, and restricted to the interval (0, 1), and it is related toother variable(s) through a regression structure. This structure is linked to theresponse variable through a logit function and the model estimation is performedby the maximum likelihood method. The Fisher’s information matrix is alsoderived. Simulation results for this model suggest it has good properties.

Keywords: Calibration model; beta regression; Monte Carlo simulation.

1 Introduction

Calibration model applications are often encountered in different fields likephysics, economics, biology, and engineering to name just a few. The sim-plest case can be thought of as the relationship between two variables, y(the response variable) and x, which is established through a known func-tion h(·). Because in most cases this relationship is not exact, it is usuallywritten as y = h(·) + ε, where ε is an error term.Calibration models are usually composed of two stages: the first stage,where a pair of data sample (xi, yi) is observed, i ∈ 1, 2, . . . , n, and thesecond stage, where k (k > 1) values of y associated to an unknown quantityx0 are observed (henceforth, y0). In practice, the main interest is to jointlyestimate x0 and h(·), which can be achieved through the use of statisticaltechniques.In the literature, it is not uncommon to suppose that h(·) is a linear functionand ε is normally distributed. In this situation, two point estimators for x0

are considered: the classical and the inverse estimator (Shukla, 1972). Theclassical estimator is obtained through the joint estimation of h(·) and x0

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 28: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

20 Beta calibration model

by least squares or maximum likelihood methods. The inverse estimator iscomputed through the regression of x on y (i.e. the estimation of h−1(·)).

2 Beta Calibration Model

The proposed model in this work is suitable for a nonlinear response vari-able which lies in the open interval (0, 1), as it is the case of beta-distributedvariables. In this context, the relationship between y and x is also nonlinearsince the proposed model resembles the beta regression model (Ferrari andCribari-Neto, 2004). Notice that instead of ε, we model the mean of y.We say that a random variable y is beta distributed with parameters p > 0and q > 0 (i.e. y ∼ B(p, q)), when its probability density function (pdf) is

f(y | p, q) =Γ(p+ q)

Γ(p)Γ(q)yp−1(1− y)q−1, y ∈ (0, 1) (1)

where Γ(·) is a gamma function, E(y) = pp+q , and V (y) = pq

(p+q)2(p+q+1) .

In order to obtain a regression structure for the response variable y, wefollow the reparameterization suggested by Ferrari and Cribari-Neto (2004)for (1). Then, we suppose y ∼ B(p, q), with p > 0, q > 0, for the firstcalibration stage, and y0 ∼ B(p0, q0), with p0 > 0, q0 > 0, p 6= p0 and/orq 6= q0 for the second calibration stage. The reparameterization is as follows:µ = p

p+q , φ = p+ q, µ0 = p0p0+q0

, and φ0 = p0 + q0.

Now, let yi ∼ B(µi, φ), with i = 1, 2, . . . , n, and yi0 ∼ B(µ0, φ0), withi = n + 1, n + 2, . . . , n + k, be a random sample. For a logit link functiong : (0, 1) → R, strictly monotonic, and twice differentiable, with g(µi) =ηi = α + βxi and g(µ0) = η0 = α + βx0, we get µi = E(yi) = h(α + βxi),

µ0 = E(yi0) = h(α + βx0), V ar(yi) = µi(1−µi)1+φ , and V ar(yi0) = µ0(1−µ0)

1+φ0.

We remark that other link functions can be used.The joint likelihood function is given by

`(y,y0, µ, µ0, φ, φ0) =

n∑i=1

`i1(µi, φ) +

n+k∑i=n+1

`i0(µ0, φ0) =

=

n∑i=1

[log(Γ(φ))− log(Γ(φµi))− log(Γ((1− µi)φ)) +

+(µiφ− 1) log yi + ((1− µi)φ− 1) log(1− yi)] +

+

n+k∑i=n+1

[log(Γ(φ0))− log(Γ(φ0µ0))− log(Γ(1− µ0)φ0) +

+(µ0φ0 − 1) log yi0 + ((1− µ0)φ0 − 1) log(1− yi0)] (2)

where y = (y1, . . . , yn)T , y0 = (y(n+1)0, . . . , y(n+k)0)T , µ = (µ1, . . . , µn)T ,`i1(µi, φ) = log f1(yi | µi, φ), and `i0(µ0, φ0) = log f0(yi0 | µ0, φ0).The regression parameters α, β, φ, φ0, and x0 are estimated by maximazing

(2), with ψ(x) = d log(Γ(x))dx , x > 0. Since the estimators for these parameters

Page 29: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Mileno Cavalcante, Betsabe Blas 21

do not have closed form, it is necessary to employ numerical methods toobtain the MLEs using a nonlinear optimization algorithm (e.g. L-BFGS-B(Nocedal and Wright, 2006)).The Fisher’s information matrix is obtained taking expected values of the

second derivatives of (2), with ψ′(xz) = ∂2ψ(xz)∂z2 . Under regularity condi-

tions (Cox and Hinkley, 1974) and after some algebra, we get

K =

φZTWZ + φ0ZT0 W0Z0 ZTTC ZT0 T0C0

CTTTZ tr(D) 0CT0 T

T0 Z0 0 tr(D0)

(3)

where Z = [1 x 0], Z0 = [1 x0 ·1 βB ·1], W = diagw1, w2, . . . , wn, W0 =diagw(n+1)0, w(n+2)0, . . . , w(n+k)0, T = diag1/g′(µ1), . . . , 1/g′(µn),T0 = diag1/g′(µ0), . . . , 1/g′(µ0), C = (c1, . . . , cn)T , C0 = (c0, . . . , c0)T ,D = diagd1, . . . , dn, and D0 = diagd0, . . . , d0. Z is n × 3 and Z0 isk × 3.The elements of matrices in (3) are x = (x1, . . . , xn), 1 is a vector of 1s, 0 isa vector of 0s, wi = φ[ψ′(µiφ)+ψ′((1−µi)φ)] 1

[g′(µi)]2, wi0 = φ0[ψ′(µ0φ0)+

ψ′((1 − µ0)φ0)] 1[g′(µ0)]2 , ci = φ[ψ′(µiφ)µi − ψ′((1 − µi)φ)(1 − µi)], c0 =

φ0[ψ′(µ0φ0)µ0 − ψ′((1 − µ0)φ0)(1 − µ0)], di = −ψ′(φ) + µ2iψ′(φµi) + (1 −

µi)2ψ′((1−µi)φ), and d0 = −ψ′(φ0)+µ2

0ψ′(φ0µ0)+(1−µ0)2ψ′((1−µ0)φ0).

3 Numerical results

The Monte Carlo simulations for the beta calibration model were carriedout considering the logit link function and the following true values for theparameters: α = 1.3, β = −2.5, x0 = 1.2, and φ = φ0 = 144. The sampleswere generated assuming that yi ∼ B(µi, φ), and yi0 ∼ B(µ0, φ0), withi = 1, 2, . . . , n for yi and i = n+ 1, n+ 2, . . . , n+ k for yi0, respectively.The variable x, which is fixed for all samples, was generated according tothe law xi = xi−1+2.5/(n−1), i = 1, 2, . . . , n, with x0 = 0 (i.e. x ∈ (0, 2.5)).The samples for the first and second stages were n = 5, 100, 500, and k =2, 5, 10, 50, respectively. The total number of Monte Carlo replications wasset at 5000 for each sample size (i.e. all pairs (n, k)). All simulations wereperformed using the object-oriented matrix language Ox (Doornik, 2006),with the ML estimates for the parameters of interest calculated using thealgorithm L-BFGS-B. The results are presented in Table 1.In Table 1, for all parameters, except x0, we can see that the empiricalbias, the mean estimated variance (MEV), and the empirical mean squareerror (MSE) decrease as n and/or k increase. Also, we notice that MEVapproach MSE for large sample sizes (of n and k). Regarding x0, our mainparameter of interest, the simulation results show very similar outcomesfor small and large samples.

References

Blas, B.G., Sandoval, M.C., and Yoshida, O. (2007). Homoscedastic Con-trolled Calibration Model. Journal of Chemometrics, 21, 145 – 155.

Page 30: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

22 Beta calibration model

TABLE 1. Simulation results for the beta calibration model.

n 5 100 500

k 2 5 10 50 10 50

α 1.31260 1.31230 1.30130 1.29980 1.30030 1.29990Variance 0.02858 0.02763 0.00176 0.00172 0.00036 0.00037Bias 0.01260 0.01230 0.00130 −0.00020 0.00030 −0.00010MSE 0.02874 0.02778 0.00176 0.00176 0.00036 0.00037

β −2.51820 −2.51530 −2.50090 −2.49990 −2.50050 −2.49990Variance 0.03379 0.02655 0.00196 0.00164 0.00041 0.00040Bias −0.01820 −0.01530 −0.00090 −0.00010 −0.00050 0.00010MSE 0.03412 0.02678 0.00196 0.00164 0.00041 0.00040

x0 1.25000 1.25000 1.25000 1.25000 1.25000 1.25000Variance 2.13e-32 2.13e-32 2.03e-32 2.01e-32 1.98e-32 1.99e-32Bias 0.05000 0.05000 0.05000 0.05000 0.05000 0.05000MSE 0.00250 0.00250 0.00250 0.00250 0.00250 0.00250

φ 583.390 626.370 149.580 149.140 145.190 145.050Variance 2.12e+06 8.06e+06 461.680 464.190 81.1480 81.4520Bias 439.900 482.370 5.58000 5.14000 1.19000 1.05000MSE 2.32e+06 8.29e+06 492.850 490.560 82.5590 82.5610

φ0 37662.0 346.440 185.490 152.080 180.760 150.870Variance 7.19e+10 9.21e+05 12792.0 1001.30 11090.0 965.520Bias 37518.0 202.440 41.4900 8.08000 36.7600 6.87000MSE 7.33e+10 9.62e+05 14513.0 1066.60 12442.0 1012.70

Cox, D.R., Hinkley, D.V. (1974). Theoretical Statistics. London: Chapman& Hall.

Doornik, J.A. (2006). Ox: An Object-Oriented Matrix Language. London:Timberlake Consultants Press.

Ferrari, S.L. P., Cribari-Neto. F. (2004). Beta regression for modelling ratesand proportions. Journal of Applied Statistics, 31, 799 – 815.

Nocedal, J., and Wright, S.J. (2006). Numerical Optimization, 2nd ed., NewYork: Springer-Verlag.

Shukla, G.K. (1972) On the problem of calibration. Technometrics, 14, 547– 553.

Simas, A.B., Barreto-Souza, W., and Rocha, A.V. (2010) Improved estima-tors for a general class of beta regression models. ComputationalStatistics & Data Analysis, 54, 348 – 366.

Page 31: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Parameter estimation for the theExponentiated Modified Weibull Model withLong Term Survival: A Simulation Study

Hayala Cristina Cavenague de Souza12, Gleici da Silva CastroPerdona1, Francisco Louzada Neto3, Fernanda Maris Peria 1

1 FMRP - USP, Sao Paulo2 LEE - Instituto Dante Pazzanese, Sao Paulo3 ICMC- USP, Sao Paulo

E-mail for correspondence: [email protected]

Abstract: In this paper, we present the Exponentiated Modified Weibull Modelfor Long Term survivors which embeds several existing lifetime models in a moregeneral and flexible framework. Point and Interval estimation are based on max-imum likelihood and bootstrap resampling. A simulation study is performed inorder to verify the frequentists properties of the inference procedures. A realdata set representing the time between breast cancer diagnosis to death has beenanalyzed for illustration purpose.

Keywords: Exponentiated Modified Weibull Model; Cure Rate Modeling; Breastcancer.

1 Introduction

Mixture models for long-term survivors (Berkson and Gage, 1952) havebeen widely used for fitting time-to-event data in which some individualsmay never suffer the cause of failure under study. Interested readers canrefer to Maller and Zhou (1996)) and Perdona and Louzada-Neto (2011)for more information and literature review.In this paper, we present the Exponentiated Modified Weibull Model forLong Term Survivors (EMWLT), which embeds several existing lifetimemodels in a more general and flexible framework and allows the accommo-dation of non monotone hazard function shapes.Estimation procedures via maximum likelihood point, interval estimationvia asymptotic theory and bootstrap resampling and model selection viaAIC and BIC are presented. The properties of the estimators for a par-ticular case of the model and the selection criteria - Akaike information

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 32: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

24 EMWLT : A Simulation Study

criterion (AIC) and Bayesian information criterion (BIC) - are evaluatedvia simulation study.

2 Model Formulation

Let T be a nonnegative random variable representing the lifetime of anindividual in some population. The hazard function at time t, is definedas h(t) = lim∆t→0 Pr(t ≤ T < t + ∆t | T ≥ t)/(∆t) = f(t)/S(t) (Law-less,2003). The EMWLT hazard function is given by

h(t, p, θ, ν) =pθ(αt)β(βt + λ)exp(−(αt)βeλt + λt)[1− exp(−(αt)βeλt)]θ−1

1− p1− exp[−(αt)βeλt]θ(1)

3 Point Estimation

The estimation procedure is maximum-likelihood-based. Let T 01 , T

02 , ..., T

0n

be the true survival times of a sample of size n. Assuming that they areindependent identically distributed random variables with hazard functionh0(t), for i = 1, ..., n, with observations subject to arbitrary right censor-ing, the period of follow-up for the ith individual is limited to a value Ci.Subsequently, the observed survival time of the ith individual is given byTi = min(T 0

i , Ci). Let δi = 1 if Ti = T 0i (that is, if Ti is an observed

death) and δi = 0 if Ti < T 0i (that is, if Ti is a censored observation).

The maximum likelihood estimates (MLEs) of the parameters of model(1), can be obtained by direct numerical maximization of the log-likelihoodfunction L(t, γ) =

∏f(ti; γ)δiS(ti; γ)1−δi . The advantage of this procedure

is that it runs immediately using existing statistical packages. The maxi-mization procedure can be performed by solving the system of nonlinearequations given by the partial derivatives of log(L(t, γ)) with respect to theparameters using iterative methods.

4 Interval Estimation

In order to obtain more precise intervals, authors suggest that we shoulduse some parameter transformation when parametric space is restricted.Because α, β, λ and θ are positive parameters, it is possible to use logtransformation to obtain approximate confidence intervals (CI) for thisparameters (Ng ,2005). Therefore, a two-sided 100 ∗ (1 − υ)% CI for α isgiven by[

α

expz1−υ√V AR(log(α))

; αexpz1−υ√V AR(log(α))

](2)

Page 33: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Souza et al. 25

where V AR(log(α)) can be calculated based on the delta method

V ar(log(α)) =V ar(α)

α2(3)

Similarly, using the logit transformation to p, the CI is given by[(1− pp

expz1−υ√V ar(logito(p))+ 1

)−1

;

(1− pp

1

expz1−υ√V ar(logito(p))

+ 1

)−1](4)

where V ar(logito(p)) = V ar(p)(p−p2)2

An alternative way to obtain CIs is to consider the bootstrap resamplingmethodology, which is a computer-based simulation method for statisticalinference (Efron and Tibshirani,1986). Assume that we have a data setwith n pairs (ti, δi), the resampling bootstrap process consists on sampling,R times, n pairs from the original sample with reposition. This processresults in R samples of size n. For each sample, the parameter estimatesare obtained and registered. The bootstrap CI is given by the estimatedquantiles.

5 Model Selection

The AIC and BIC (Burnham and Anderson, 2004) are measures used inmodel selection. Considering a model i, they are given by AICi = 2k −2log(Li) and BICi = klog(n)− 2log(Li).

6 Simulation Study

A Monte Carlo Simulation study was conducted to assess the inferencemethods previously presented. Censored samples from EMWLT were gen-erated via inversion of the cdf. The generating values were defined accordingthe hazard function shape, as follow :

• (α = 2, β = 3, λ = 0, 5, θ = 1)

• (α = 2, β = 0, 8, λ = 0, 5, θ = 1)

• (α = 0, 5, β = 0, 8, λ = 0, 5, θ = 1)

• (α = 0, 5, β = 0, 8, λ = 3, θ = 1)

Different samples sizes (n=50, 100 e 300) and long term proportion (p=10%,50% e 70%) were considered. In order to assess the MLE and asymptoticmethods properties, B=1000 samples were generated. For evaluate the CIsconstructed via bootstrap resampling, B=100 samples were generated withR= 300 resamples. The biases and mean square errors (MSE) of MLE de-creases as the sample size increases. For comparing MWLT with WLT,both the AIC and Likelihood criterion correctly identified the generationmodel. The coverage probabilities of both interval methods (asymptoticand bootstrap) reach the desired confidence levels (95%) for larger samplesizes. The asymptotic CI was not very efficient, for α and λ, improvingwhen bootstrap CI was considered.

Page 34: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

26 EMWLT : A Simulation Study

7 Breast Cancer Data

We applied the proposed model in a breast cancer dataset. We consideredthe lifetime between diagnosis to death to 96 womens with advanced stageof the disease treated at the Ribeirao Preto School of Medicine Clinic Hos-pital, Brazil, between 2000 and 2010. In the sample, 32,29% of the patientshad their follow-up time censored. The hazard function is bathtub shapedat the the first times and decreasing in the end. The EMWLT and MWLTmodels were fitted to the data and the EWMLT model presents the smallerAIC and fits better to data when we look graphically. The bootstrap CIwas more accurate than the asymptotic, particularly for α e p.

References

Berkson, J. and Gage, R.P. (1952). Survival curve for cancer patients fol-lowing treatment. Journal of the American Statistical Association,52, 501 – 515.

Burnham, K. P., and Anderson, D. R. (2004). Multimodel inference under-standing aic and bic in model selection Sociological methods & re-search, 33,2, 261 – 304.

Efron, Bradley and Tibshirani, Robert (1986). Bootstrap methods for stan-dard errors, confidence intervals, and other measures of statisticalaccuracy. Statistical science, 54 – 75.

Ghitany, M., Maller, R. (1992). Asymptotic results for exponential mix-ture models with long-term survivors. Statistics, 23, 321 – 336.

Lawless, J.F. (2003). Statistical Models and Methods for Lifetime Data.New York: John Wiley.

Maller, R.A., Zhou, X. (1996). Survival Analysis with Long-Term Survivors.New York: John Wiley.

Perdona, G.S.C. and Louzada, F. (2011). A general hazard model for life-time data in the presence of cure rate. Journal of Applied Statistics,38, 1395 – 1405.

Ng, H. K. T. (2005). Parameter estimation for a modified Weibull distri-bution, for progressively type-II censored samples IEEE Transactionson Reliability, 54(3), 374 – 380.

Perdona, G.S.C. and Louzada, F. (2011). A general hazard model for life-time data in the presence of cure rate. Journal of Applied Statistics,38, 1395 – 1405.

Page 35: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

P-IRLS Representation of a SemiparametricBivariate Triangular Cumulative Link Model

Francesco Donat1, Giampiero Marra1

1 Department of Statistical Science, University College London, London, UK

E-mail for correspondence: [email protected]

Abstract: We propose a method to estimate the parameters in a regressionof an ordered outcome on some covariates where a background variable affectssimultaneously the response and an ordered treatment. The functional form of thedependences between the variables involved is taken very flexible, and representedusing smoothing splines. An equivalent P-IRLS representation of the model forthe estimation of the smoothing parameters is explicitly derived.

Keywords: Ordered outcomes; Penalised regression splines; Simultaneous equa-tions system; Unmeasured confounding.

1 Introduction

We consider the regression of an ordered variable of interest on some mea-sured covariates where a background variable is assumed to be both ex-planatory to a response of primary interest and to one of its ordered re-gressors. We refer to this situation as direct confounding effect.Problems arise when the researcher fails to adjust for pertinent confoundersas they might be either unknown or not readily quantifiable. When this hap-pens, the confounding effect induces endogeneity, and the use of standardestimators typically yields inconsistent estimates.The instance of ordered outcomes in a regression has only few comparisonsin the literature, and they are mostly confined to applied research. Becauseof that, little attention has been given to the methodological strategieswhen unmeasured confounders are present in the study. This issue is ad-dressed here through a triangular simultaneous system of equations.In this paper, we only examine the bivariate case, which arises where onediscrete ordered confounded treatment variable is included in the orderedregression. In doing that, we extend the existing parametric model (e.g.Sajaia, 2008) to a semiparametric framework where the covariate effectsare estimated nonparametrically via smoothing splines.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 36: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

28 P-IRLS for a Bivariate Triangular Cumulative Link Model

Since the parameter vector and the smoothing parameters cannot be esti-mated simultaneously, a Generalized Cross Validation approach is plannedto be adopted for the latter. The derivation of the needed P-IRLS repre-sentation of the model is the primary goal of this paper.

2 Model Description and Estimation

Let us consider the following bivariate extension of a Cumulative LinkModel (McCullagh, 1980) defined by:

Πk1,k2 = g(θ((k1, k2)|x,f); η3), (1)

where Πk1,k2 := P[Y1 k1, Y2 k2] is a bivariate cdf with a partiallyordered support, supp(Y ) = K1 ×K2. We further set g : R2 → [0, 1] a linkfunction, while θ((k1, k2)|x,f) is assumed to be a semiparametric formdepending on the ordered categories (k1, k2), the data x = (x>1 ,x

>2 )>, and

some unknown smooth functions, f = (f>1 ,f>2 )>, of the strictly exogenous

continuous covariates z.The triangular specification of the model imposes a peculiar structure onthe predictors:

θ((k1, k2)|x,f) = [η1,k1(x1,f1,ϑ1), η2,k2(x,f ,ϑ); η3]

which has to be accounted for in the proposed estimation procedure. Inthe above, ϑ denotes the parameter vector, and η3 any unidimensionalassociation parameter required by the link function. When g is a bivariateStandard Normal distribution, it is well-known that (1) can be representedby a bivariate system of simultaneous equations for the continuous randomvector Y ∗ = (Y ∗1 , Y

∗2 )> defined on the latent scale implied by g. In the

proceeding, we assume additivity in the effects of the covariates and thesmooths on the responses, whereas the predictors ηj , for j = 1, 2, need notto be linear in ϑ.The functions fj,lj ’s in f are described using smoothing splines. They canbe therefore approximated by linear combinations of some unknown regres-sion parameters, δj,lj , and known basis functions bj,lj . The imposition ofconstraints on the smooth terms and on the parameter vector is necessaryto achieve identification. The latter usually involves the inclusion in theequation for Y ∗1 of a covariate not explanatory to Y ∗2 : this is often regardedas an instrumental variable.Our estimation is based on the likelihood function under a roughness penaltyapproach for the parameter vector ϑ, where the resulting penalized likeli-hood aims to avoid over-fitting in the representation of the smooths:

`p(ϑ) = `(ϑ)− 1

2ϑ>∗ Sλϑ∗. (2)

In the above, Sλ is made up by a symmetric positive semidefinite matrix– appropriately padded with zeros – that depends on the bases and on the

Page 37: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Donat and Marra 29

smoothing parameters, λ, associated to each smooth function; whereas ϑ∗is a suitable transformation of ϑ. The smoothing parameters need to beestimated within the model; however, this may not occur simultaneouslywith ϑ, and an iterative procedure is therefore necessary (Wood, 2006). Tothis purpose, we adopt a Generalized Cross Validation criterion (Craven &Wahba, 1979) based on the P-IRLS representation of the model.

3 A P-IRLS Representation

The maximum penalized likelihood estimator is defined, for any given λ,as solution of the following nonsingular p-dimensional system of equations:

∇p(ϑ) = 0p,

where the left-hand side can be decomposed as D>u − 1/2 g(ϑ), with Dbeing the (p × 5n) matrix of the first derivatives of the predictors withrespect to the parameter vector, and g(ϑ) the score of the penalisationterm. Similarly to Green (1984), we also write the Hessian matrix as:

D>WD + K − 1/2 H(ϑ), where W accounts for the second derivativesof the unpenalised likelihood with respect to each predictor.Further letting ϑ the MPLE, the first-order Taylor approximation of thescore vector ∇p(ϑ) about ϑ = ϑ− ϑ0 reads as:

∇p(ϑ) ≈ ∇p(ϑ0) +Hp(ϑ0)(ϑ− ϑ0) = 0p

or, equivalently, (D>WD + K − 1/2 H(ϑ))(ϑ − ϑ∗) = D>u − 1/2 g(ϑ),with ϑ∗ a given step of the iterative procedure employed to maximise (2).It can be proven that ϑ∗ also solves the penalized least squares criterion

ϑ∗ = arg mint∈Θ

‖B>(z−Dt)‖22 + t>St, (3)

which stems from regressing B>z onto the columns of B>D with weightmatrix W = BB>; z := Dϑ −W−1u is the pseudo-data vector, andS := Q1−Q2Q

>2 , with Q1(λ) := K−1/2 H(ϑ) and −Q2(λ) := Q1(λ)ϑ+

1/2 g(ϑ). These quantities are augmented versions of those defining thescore and the Hessian matrix of (2), and have been computed to ensure aquadratic penalisation term in the P-IRLS (3). In particular we have:

D =

[DQ2

], z =

[z−1

], and W =

[W 05n

0>5n 1

].

In the context of Generalized Linear Models, problem (3) is usually in-troduced for the automatic selection of λ (Wood, 2006), for which stableand efficient computational routines are available (Wood, 2004). Althoughother representations equivalent to (3) can be derived, we have maintainedhere a quadratic penalisation term to stress the similarities between theCLM considered and the more developed literature on GAMs.The above representation nests bivariate models with dichotomous out-comes, and/or without any triangular structure. For example, the resultsin Marra & Radice (2011) are obtained by imposing K = 0p,p and ϑ∗ = ϑ.

Page 38: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

30 P-IRLS for a Bivariate Triangular Cumulative Link Model

4 Discussion and Future Work

This paper has introduced a P-IRLS representation of a semiparametricbivariate triangular CLM suitable to encompass studies affected by directunmeasured confounding. In the GAM context, this representation is oftenused to estimate automatically the smoothing parameters.The applicability of the method proposed will crucially require a fast andstable estimation procedure, as well as a computer routine available for thegeneral audience of practitioners. This means, in particular, the need ofdeveloping methods to compute efficiently all the quantities involved.

Acknowledgments: Financial support for the first author’s doctoral stud-ies by a UCL Impact Studentship is gratefully acknowledged.

References

Craven, P. and G. Wahba (1979). Smoothing Noisy Data with Spline Func-tions. Numerische Mathematik, 31, 377 – 403.

Green, P.J. (1984). Iteratively Reweighted Least Squares for MaximumLikelihood Estimation, and some Robust and Resistant Alternatives.Journal of the Royal Statistical Society, Series B, 46, 149 – 192.

Marra, G. and R. Radice (2011). Estimation of a Semiparametric Recur-sive Bivariate Probit Model in the Presence of Endogeneity. TheCanadian Journal of Statistics, 39, 259 – 279.

McCullagh, P. (1980). Regression Models for Ordinal Data. Journal of theRoyal Statistical Society, Series B, 42, 109 – 142.

Sajaia, Z. (2008). Maximum Likelihood Estimation of a Bivariate OrderedProbit Model: Implementation and Monte Carlo Simulations. Mimeo.

Wood, S.N. (2004). Stable and Efficient Multiple Smoothing ParameterEstimation for Generalized Additive Models. Journal of the AmericanStatistical Association, 99, 673 – 686.

Wood, S.N. (2006). Generalized Additive Models: An Introduction With R.London: Chapman & Hall.

Page 39: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Normative Values of Visual Acuity andContrast Sensitivity in Older Irish Adults

Cara Dooley 1, Joanne Feeney1, Rose Anne Kenny 1

1 TILDA, Trinity College, Dublin, Ireland.

E-mail for correspondence: [email protected]

Abstract: Normative values are useful to healthcare professionals. Vision hasshown to be associated with both falls and fear of falling in older adults. Thedata collected by TILDA, means TILDA is in a unique position to calculatenormative values of contrast sensitivity and visual acuity for the Irish populationof over 50s. GAMLSS models are used to estimate the centiles of both the contrastsensitivity and visual acuity distributions.

Keywords: GAMLSS; Longitudinal; Ageing; Normative Values.

1 Introduction

The Irish Longitudinal Study on Ageing (TILDA) is a prospective cohortstudy of the social, economic and health circumstances of community-dwelling older adults in Ireland. TILDA is currently in its third wave.Analysis is based on the first wave of data, collected between October2009 and July 2011. The sampling frame is the Irish Geodirectory, a listingof all residential addresses in the Republic of Ireland. A clustered sampleof addresses was chosen, and household residents aged 50 years and olderand their spouses/partners (of any age) were eligible to participate. Ethi-cal approval was obtained from the Trinity College Dublin Research EthicsCommittee, and participants provided written informed consent.The study design is described in depth, elsewhere. Briefly, data collectionincluded: (i) a computer-assisted personal interview that included detailedquestions on socio-economics, demographics, wealth, health, lifestyle, so-cial support and participation, use of health and social care, and attitudesto ageing; (ii) a self-completion questionnaire; and (iii) a detailed healthassessment carried out by research nurses including cognitive, cardiovas-cular, mobility, strength, bone and vision tests. In total, 8175 individualsaged over 50 years were interviewed, of whom 5037 attended the health

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 40: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

32 Normative Values of Visual Acuity and Contrast Sensitivity

center assessment (61.3%). The vision measures were carried out as part ofthe health assessment, so this analysis is concentrated on those individuals.

1.1 Vision Measures

Visual acuity (VA) represents a high contrast letter recognition task, andwas assessed using an Early Treatment Diabetic Retinopathy Study (ET-DRS) logMAR chart (Precision Vision, La Salle, IL, USA) at a viewingdistance of 4 m. VA was measured psychophysically for both eyes, usingthe habitual distance vision correction if required. For statistical purposes,the best acuity value measured from either eye was selected and convertedto a Visual Acuity Score (VAS). This score inverts the logMAR scale usingthe formula: VAS= 100 − 50 log(MAR), so that a VAS of 100 representsa logMAR score of 0 or 20/20 vision. For each letter that is read correctlyusing the ETDRS chart, there is a corresponding 1-point increase in theVAS. This allows a more intuitive interpretation of the acuity scores, ashigher values indicate better acuity.Contrast sensitivity (CS) represents the ability to distinguish an objectfrom the background in varying size and contrast conditions. It was mea-sured in the eye with better VA using a Functional Vision Analyser (StereoOptical, Chicago, IL, USA) under mesopic (3cd/m2) background illumi-nation conditions. Testing was then repeated for the same backgroundillumination conditions, but in the presence of a radial glare source.[39]During the test, the respondent viewed a Functional Acuity Contrast Test(FACT), which comprised sinusoidal gratings presented as Gabor patchesat five spatial frequencies of 1.5, 3, 6, 12 and 18 cycles per degree (cpd)respectively. For each spatial frequency, a series of nine patches were pre-sented in order of decreasing contrast (0.15 log unit or 50% loss of contrastbetween consecutive patches). Respondents were instructed to indicate ifthe gratings tilted to the left (+15o), right (15o) or upright (0o), movingfrom patch 1 to 9 for each spatial frequency tested, in order of increasingfrequency. The CS score corresponds to the contrast of the last grating thatwas accurately identified on each row.In addition, respondents were asked to self-rate their vision and a historyof visual impairments was taken.

2 Statistical Analysis

Figures 1 and 2 show the distribution of both the CS and VA measures forthe sample. Centiles are calculated using a GAMLSS (Generalised AdditiveModel for Location Shape and Scale). A GAMLSS model is more flexiblethan a general or generalised linear model, allowing the distribution of theresponse to come from outside the exponential family of distributions. Also,a GAMLSS model allows for estimation of additional parameters outsidethe mean and variance, for example skewness and kurtosis. The form of themodel will be shown, along with the estimate centiles.

Page 41: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Dooley et al. 33

0

20

40

60

80

100

50-64 65-74 >=75 50-64 65-74 >=75

Male Female

Mean CS No Glare Mean CS Glare

FIGURE 1. Mean Contrast Sensitivity for both the Glare and No Glare condi-tions by age category and sex.

-.5

0

.5

1

1.5

50-64 65-74 >=75 50-64 65-74 >=75

Male Female

Vis

ual A

cuity

FIGURE 2. Visual acuity by age category and sex.

3 Conclusion

CS and VA have been shown previously, to be associated with fear of falling,falls and reduced mobility in older populations. This can cause knock oneffects in both which are costly in terms of individuals health and from aneconomic standpoint, in terms of healthcare utilisation. Normative valuesof CS and VA may be useful to healthcare professionals. The use of theflexible GAMLSS model makes estimating these normative values leads to

Page 42: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

34 Normative Values of Visual Acuity and Contrast Sensitivity

improved inference.

Acknowledgments: Special thanks to the TILDA participants. This workwas supported by the Irish Government, Irish Life plc and the Atlantic Phi-lanthropies.

References

Kenny, R. A., Coen, R. F., Frewen, J., Donoghue, O. A., Cronin, H. andSavva, G. M. (2013). Normative Values of Cognitive and PhysicalFunction in Older Adults: Findings from The Irish Longitudinal Studyon Ageing. Journal of the American Geriatrics Society, 61(s2), S279-S290.

Kenny RA, Whelan BJ, Cronin H et al. (2009) Design report of The IrishLongitudinal Study on Ageing . Online : www.tcd.ie/tilda/publications.

Muniz Terreta G., van den Hout A., Rigby R. and Stasinopoulos D. M.(2012). Analysing cognitive test data: distributions and non-parametricrandom effects. Statistical Methods in Medical Research, in press.

Stasinopoulos, D. M. and Rigby, R. A. (2007). Generalized additive mod-els for location scale and shape (GAMLSS) in R. Journal of StatisticalSoftware, 23(7), 1-46.

Page 43: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

A study of online and blockwise updating ofthe EM algorithm for Gaussian mixtures

Jochen Einbeck1, Daniel Bonetti2

1 Department of Mathematical Sciences, Durham University, UK2 Universidade de Sao Paulo, Sao Carlos, SP, Brazil

E-mail for correspondence: [email protected]

Abstract: A variant of the EM algorithm for the estimation of multivariateGaussian mixtures, which allows for online as well as blockwise updating of se-quentially obtained parameter estimates, is investigated. Several different updateschemes are considered and compared, and the benefits of artificially performingEM in batches, even though all data are available, are discussed.

Keywords: Multivariate Gaussian mixtures; Maximum Likelihood; IncrementalEM.

1 Background

Consider multivariate data yi = (yi1, . . . , yip)T , i = 1, . . . , n sampled inde-

pendently from some population consisting of several latent groups or sub-populations. Data of this type are conveniently modelled through Gaussianmixture models

f(yi|θ) =

K∑k=1

πkf(yi|θk)

where f(yi|θk) = (2π)−p/2|Σk|−1/2 exp− 1

2 (yi − µk)TΣ−1k (yi − µk)

, θk =

πk, µk,Σk, and θ = θ1, . . . , θK, with the restriction πK = 1−∑K−1k=1 πk.

Assume a given set of starting values, say θ(0). The EM algorithm iteratesbetween the Expectation (E) –step, updating the posterior probabilities

wik = P (obs. i belongs to comp. k) for given θk,

wik ≡ w(yi, θk) =πkf(yi|θk)∑Kk=1 πkf(yi|θk)

, (1)

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 44: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

36 Online and blockwise updating of the EM algorithm

and the Maximization (M) –step, where the parameter estimates are up-dated for given wik as

πk =1

n

n∑i=1

wik; µk =

∑ni=1 wikyi∑ni=1 wik

;

Σk =

∑ni=1 wik(yi − µk)(yi − µk)T∑n

i=1 wik

Let θ(j) denote the value of θ after j–th iteration and lj =∑ni=1 log f(yi|θ(j))

the corresponding log-likelihood. Convergence is established at j–th itera-tion if the difference lj − lj−1 falls below a small threshold (which we taketo be 0.001). An important feature of this methodology should be high-lighted: After convergence of the EM algorithm we have not only obtainedthe estimate θ, but are also able to assign each observation i to a classci ∈ 1, . . . ,K via the classification rule

ci = arg maxk

wik.

For details on how the described routine relates to the original formulationof the EM algorithm, we refer to Aitkin et al. (2009), Section 7.

2 Updating the EM algorithm

The methodology described in the previous Section implies the assumptionthat the full set of responses y = (yT1 , . . . , y

Tn )T has been observed before

the EM algorithm is run. Unfortunately, for new data yn+j , j ≥ 1, thefitted model does not provide information on the class membership cn+j ,since the wn+j,k are unknown. Traditionally, if n′ new data have beenobserved, the way forward would have been to re–fit the entire model,using data y1, . . . , yn, yn+1, . . . yn+n′ . This is clearly inefficient, and maybe computationally expensive especially if n and/or p are large. However,the EM algorithm can easily be adapted to become an update algorithm,which we outline as follows. Assume that, using a batch of size n, the EMalgorithm as outlined in Section 1 has been executed, yielding estimates θand a weight matrix W = (wik)1≤i≤n,1≤k≤K . Now, a new batch n′ has beenmade available. Via (1), one can compute new membership probabilities

w′n+j,k = w(yn+j , θk)

using the already computed θ, which we again summarize in a matrix W ′ =(wn+j,k)1≤j≤n′,1≤k≤K . Then we stack W ′ underneath W , which we denoteas W ∪W ′. At this stage we have several options.

(i) We use W ∪W ′ to run a single M-step, using all n+n′ data, leading

to an updated θ. We refer to this option as one–step– update. Notethat, under this scenario, the matrix W has not got updated, so theposterior probabilities of the original data have not benefited fromthe new information.

Page 45: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Einbeck and Bonetti 37

(ii) After running (i), we can run one additional E-step (which then willupdate W ∪W ′) and one additional M-step. We call this the two–step update.

(iii) One can now repeat step (ii) a further number of times, and if we dothis until convergence, we call this a converged update.

Now, any of (i), (ii) or (iii) lead to an update of θ for the single batch n′.After reception of the next batch, say n′′, again any of (i) to (iii) needs tobe executed, and this will be repeated until all batches have been received.Unlike other recently proposed approaches to ‘incremental EM’ (such asZhang and Scordilis, 2009), where the focus is on approximative updatingto gain computational speed, any combination of (i), (ii) and (iii) will givean ‘exact’ update, hence enabling to find the MLE of the full data at anystage, if only (iii) is executed after the last update.

3 Online updating

Options (i) and (ii) are especially attractive in an online scenario, wherenew observations i come in at certain timepoints ti, and one wishes tocontinuously update the current parameter estimates and weight matrix.Clearly the initial batch needs to contain at least as many observations asparameters in θ, that is mathematically n ≥ K(1+3p/2+p2/2)−1 (but inpractice larger values will be required). For all further batches one wouldhave 1 = n′ = n′′ = . . ..We illustrate the performance of the online update schemes (i) and (ii)through a simple simulation. We generated a random dataset with twoGaussian clusters containing a total of 100 points. The result of applyingusual EM onto this data set is provided in Figure 1, with the log-likelihoodprovided by the dashed curve in the middle panel. Next, 60 from these 100points were randomly selected to drop out the dataset, leaving us with abatch of size n = 40. After running EM until convergence for this reducedbatch, which required 20 iterations, the previously removed points werereturned one-by-one to the dataset, each followed by procedures (i) and(ii), respectively. After all data points have been included, we run step (iii)(this is the part after the vertical line in Figure 1 (middle)).Note that the higher likelihoods (as compared to usual EM) obtained forsmall iteration numbers are an artefact of the smaller sample sizes usedfor these. The key observations from Fig. 1 (middle) are: (a) During theexecution of the online–updating, scheme (i) leads to worse solutions thanscheme (ii), and needs a longer time to recover once one goes into mode

(iii). (b) After final convergence, the log-likelihood `(θ) is the same for allapproaches.Conceptually, it is clear that (b) must be true: If we have achieved some

estimate θ through repeated online updating, and from this moment on wedecide to fill in the remaining data, then we effectively carry out usual EMbut with starting value θ(0) = θ. The MLE of this problem is the same asfor the full data set.

Page 46: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

38 Online and blockwise updating of the EM algorithm

4 Blockwise updating

Even though the MLE is in principle the same, there may be a difference inwhether or not the maximum is actually found, that is, whether or not theEM algorithm gets trapped in a local maximum on its way to the MLE. Inother words, there is the intriguing possibility that, by artificial blockwisesplitting and updating data sets which were in principle fully available, oneis able to overcome local maxima (through the influx of fresh data), inwhich traditional EM gets trapped. Indeed, one can find scenarios wherethis is the case. An exemplary graphical illustration of the log-likelihoodsobtained in a simulation involving 1200 samples (split in three batches)and K = 8 mixture components is provided in Figure 1 (right).

References

Aitkin, M, Hinde, J., Francis, B., and Darnell, R. (2009). Statistical Mod-elling in R. Oxford: Oxford University Press.

Zhang, Y., and Scordilis, M.S. (2007). Effective online unsupervised adap-tation of Gaussian mixture models and its application to speech clas-sification. Pattern Recognition Letters, 29, 735–774.

y1

y2

−5 0 5 10

−5

05

1015

0 20 40 60 80 100

−50

0−

400

−30

0−

200

Iterations

Logl

ik

One−step−updateTwo−step−updateTraditional EM

0 500 1000 1500 2000

−30

00−

2500

−20

00−

1500

Iterations

Logl

ik

Update EMTraditional EM

FIGURE 1. Left: Result of EM for bivariate data with two clusters; Middle:Comparing traditional EM with one– and two–step updates; Right: Comparingtraditional EM with blockwise updating (for a more complex data set involving8 clusters).

Page 47: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Visualisation and Modelling ofEnvironmental Sensor Time Series

Amira Elayouty1, Marian Scott1, Claire Miller1, SusanWaldron2, Maria Franco Villoria3

1 School of Mathematics and Statistics, University of Glasgow, UK2 School of Geographical and Earth Sciences, University of Glasgow, UK3 School of Economics and Statistics, University of Turin, Italy

E-mail for correspondence: [email protected]

Abstract: Advances in sensor technology are improving the ability to detectand understand changes within environmental systems occurring over short timeperiods that would not have been apparent with monthly, fortnightly or even dailysampling. However, such high temporal resolution data present various challengesfor statistical modelling. The aim of this paper is to investigate the complexitiesof modeling high frequency data which arise from environmental applications. Ahigh resolution sensor-generated 15 minute time series of carbon dioxide partialpressure in a small order river system will be used as an illustrative dataset.

Keywords: High-Frequency Data; Wavelets; Generalized Additive Models, Par-tial Pressure of Carbon Dioxide

1 Introduction

Environmental monitoring sensor technologies are continually developingwhich presents the opportunity to improve our understanding of environ-mental systems. In the past, water quality monitoring programmes typicallyinvolved monthly, fortnightly, and occasionally daily sampling campaignsbut rarely shorter time intervals. However, many changes in stream waterquality happen at sub-daily scales (e.g. Neal et al., 2013) and high tempo-ral resolution monitoring programmes are required to observe these rapidchanges. Sensors can provide a vast amount of High-Frequency Data (HFD)and hence statistical developments are needed to analyze these huge datasets, to detect abrupt changes and unusual events.The aim of this paper is to explore the complexities of modeling HFD whicharise from environmental applications. The partial pressure of carbon diox-ide (EpCO2) in a small river system is used as an illustrative dataset to

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 48: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

40 Visualisation and Modelling of Environmental Sensor Time Series

explore such sensor-generated HFD. The excess EpCO2 in surface freshwa-ter is a dynamic representation of the processes that consume and producecarbon dioxide. Specifically, this paper will explore the EpCO2 behavior inrelation to seasonal/diurnal cycles, hydrological and chemical processes toidentify ecological controls. In this regard, the study uses high resolutionsensor-generated time series of EpCO2, pH, discharge, temperature andconductivity recorded in the Glen Dye catchment in Aberdeenshire, Scot-land, every 15 minutes from October 2003 to August 2007. However, thepaper will focus only on the analysis of the hydrological year 2003/2004.

2 Methodology

Wavelet analysis was first performed to visualize the periodic variations ofEpCO2 and the other hydrological variables over the different time scales.Wavelet analysis is used to analyze non-stationary time series. The resultis a time-scale decomposition of the original signal that helps identify thecyclical components over the different frequencies, as well as the long-termtrend. The multi-resolution analysis (MRA) of the discrete wavelet trans-form decomposes the signal into discrete time scales without losing theoriginal available information (Percival and Walden, 2006). The R “wmtsa”package is used for the maximum overlap discrete wavelet transform withDaubechies family of wavelets and the least asymmetric filter of width 8.Generalised Additive Models (GAMs) were then employed to describe andexplain the EpCO2 behavior in response to the different hydrological vari-ables and its cyclic variations identified by the MRA, firstly at 15 minuteresolution and then for aggregated data at one hour, 8 hours and dailyscales. In GAMs, the relationships between a certain response variable Yand its explanatory variables X’s are estimated by smooth functions (see,for example, Wood (2006)). In such HFD, long range dependence (LRD) isoften apparent which makes the model inference inefficient. A solution is tomodel this correlation through random effects using a Generalized AdditiveMixed Model (GAMM) expressed as follows:

g(µt) = βo + f1(X1t) + f2(X2t) + f3(X3t, X4t) + ....+ Ztb (1)

Here the observations Yt are assumed to be Gaussian and conditionallyindependent with means µt = E(Yt|b) and g() = I; f ’s are smooth func-tions; Zt is a row of a random effects model matrix and b ∼ N(0,B) isa vector of random effects coefficients. The GAMMs are fitted using pe-nalized regression where the smooth functions are approximated by splines((Wood, 2006), Pinheiro and Bates (2000)). The smoothness of each curvefj is controlled by a smoothing parameter selected automatically using re-stricted maximum likelihood. Akaike Information Criteria (AIC) is used todetermine the amount of smoothing, then AIC and approximate F-testsare used for model selection. The R “mgcv” package (Wood, 2006) is usedto fit the GAMs and GAMMs.

Page 49: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Elayouty et al. 41

3 Results and Discussion

The wavelet analysis emphasized the non-stationarity of the EpCO2 se-ries and the different hydrological factors, and the seasonal and diurnalcycles of EpCO2. The MRA indicated that the 6th wavelet detail, reflect-ing changes over 8 hours, is the main contributor to the sample variance,which represents the diurnal cycle of EpCO2 (Figure 1: Panel(d)). How-ever, this diurnal cycle is not constant throughout the year. The MRAalso implied that constant flow variability is associated with high EpCO2

variability while sudden/unusual flow increases are associated with com-pressed EpCO2 variability (Panels (d)&(e)). It also appears that EpCO2 ismore variable during the summer when a pronounced daily cycle is present(Panels (d)&(f)).Sequentially, GAMs were fitted to the 15 minute data for EpCO2 withsmooth functions for global trend, month within the year and time withinthe day, and conductivity. Although this model explains a large percentageof the variability in EpCO2, the autocorrelation function of the residualshighlights the presence of long range dependence and a diurnal cycle that isnot yet captured by the model. One strategy to account for this dependencestructure is to add a smooth term for lagged EpCO2 and fit a first orderautoregressive (AR(1)) term to the model residuals, resulting in Model (2):

EpCO2t = βo + f1(Month.dect) + f2(Hourt) + f3(Montht ∗Hourt) +

f4(Conductivityt) + f5(EpCO2t−8) + εt, εt ∼ AR(1) (2)

where “Month.dec”, “Month” and “Hour” terms represent the long-termtrend, seasonal and diurnal cycle of the EpCO2, respectively, and EpCO2t−8

denotes the 2 hours lagged EpCO2. The results indicated that the EpCO2

is lower in Winter and during the daylight hours and that the daily cycleis more apparent during Summer.The EpCO2 data were also aggregated over coarser scales to explore theeffect of coarser resolution sampling and to investigate the amount of infor-mation available/lost at the different temporal resolutions. The behavior ofEpCO2 and its relationships with the hydrological factors did not changeover the different levels of aggregation; but they become smoother overcoarser time scales. While aggregation decreases the autocorrelation be-tween residuals, it masks the unusual events at the fine time scales.

4 Conclusion and Future Work

Although HFD are useful, describing and analyzing them is complex. Thevisualisation and modelling presented in this paper enabled the validationof the process based model for EpCO2, which includes flow and tempera-ture, in addition to exploring its relationships with conductivity and time.Statistical developments are required to account for LRD in HFD in acomputationally efficient manner. Current work includes fitting temporallyvarying coefficient models to the full time series. Further investigations

Page 50: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

42 Visualisation and Modelling of Environmental Sensor Time Series

FIGURE 1. EpCO2 (top), flow and temperature series followed by 6thwavelet detail (8 hrs scale) of MRA of EpCO2, flow and temperature(bottom) for the hydrological year 2003/2004.

will consider latent state models, which offer an attractive way forward fordynamic modelling of EpCO2 and its relationships with driving factors,depending on different environmental conditions.

Acknowledgments: AE is supported by a Glasgow University sensor stu-dentship.

References

Neal, C., Reynolds, B., Kirchner, J., Rowland, P., Norris, D., Sleep, D., Lawlor,A.,Woods, C., Thacker, S., Guyatt, H., Vincent, C., Letho, K., Grant,S., Williams, J., Neal, M., Wickham, H., Harman, S. and Armstrong, L.(2013) High-frequency precipitation and stream water quality time seriesfrom plynlimon, Wales: an openly accessible data resource spanning theperiodic table. Hydrological Processes, 27, 2531 – 2539

Percival, D. and Walden, A. (2006). Wavelets Methods for Time Series Analysis.Cambridge: University Press.

Phineiro, J. and Bates, D.M. (2000). Mixed-Effects Models in S and S-Plus. Springer.

Wood, S.N. (2006). Generalized Additive Models - An introduction with R. Lon-don: Chapman & Hall.

Page 51: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Statistical modeling of extreme precipitationin recent flood regions in China

Manuela Ender1, Tong Ma1

1 Department of Mathematical Sciences, Xian Jiaotong–Liverpool University,China

E-mail for correspondence: [email protected]

Abstract: The objective of this paper is to model extreme precipitation eventsusing 60 years of daily data for two cities in China, Shantou and Qiqihaer. Bothcities experienced severe floods in 2013. Using generalized extreme value distribu-tions and generalized Pareto distributions, we study the frequency of flood eventsfrom a statistical point of view. With this knowledge, decision makers should beable to make better informed decisions about risk mitigating measures.

Keywords: Extreme value theory; precipitation; return level; China.

1 Introduction

During August 2013, China’s south provinces Guangdong, Guangxi and Fujianand the north eastern provinces Heilongjiang, Jilin and Liaoning were hit byheavy rain that caused flooding (RCSC, 2013). For the north eastern region, itwas the worst flooding in 50 years. Millions of residents were affected, many peopledied or went missing. Direct economic losses solely for Guangdong province arereported to be CNY 13 billion and for Heilongjiang province to be around CNY7 billion (RCSC, 2013).Consequently, estimations of extreme rainfall events in China play a significantrole in an efficient risk appraisal. To know the statistics that a certain flood eventwill occur gives advice to decision makers to design efficient mitigating measures.Examples are planning and construction of water management, sewerage systems,capacity of channels and river basins, buying suitable insurance, and the imple-mentation of an information system that citizens are better prepared (Overeemet al., 2008). To contribute to the achievement of these goals, this paper focuseson the estimation of extreme precipitation distributions using 60 years of dailydata (1951-2010) in two Chinese cities, Shantou (Guangdong) and Qiqihaer (Hei-longjiang). The estimation is based on Extreme Value Theory (EVT), generalizedextreme value distribution (GEV) and generalized Pareto deistribution (GPD).

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 52: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

44 Statistical modeling of extreme precipitation

2 Methodology

The Block Maxima approach is appropriate when the maximum observationsof each block with a predefined and fixed length are assembled from a largenumber of iid variables. In this case, the asymptotic distribution of the maximumobservations is exactly one of three well known distributions - Gumbel (ξ = 0),Frechet (ξ > 0) or Weibull distribution (ξ < 0) (Fisher and Tippett, 1928). Thecumulative distribution function can be summarized by the GEV:

H(x, ξ, σ, µ) =

e−(1+ξ( x−µσ ))− 1ξ, if ξ 6= 0

e−e− x−µ

σ , if ξ = 0,(1)

where x are the extreme values from the blocks, and ξ, σ, µ are the shape, scaleand location parameters, respectively.The Peak over Threshold (POT) method considers the maximum variables ex-ceeding a predetermined threshold. Balkema and de Haan (1974) and Pickands(1975) showed that the distribution of the exceedances converges to GPD whenthe threshold is sufficiently high:

H(x, ξ, σ, µ) =

1−

[1 + ξ

(x−µσ

)]− 1ξ , if ξ 6= 0

1− e−x−µσ , if ξ = 0,

(2)

where x are the exceedances, and ξ and σ are the shape and scale parameter,respectively. There are three types of GPD: Exponential (ξ = 0), ordinary Paretodistribution (ξ < 0), and Pareto (II) type distribution (ξ > 0). The initial stepis to choose a threshold that is sufficiently high. However, if it is too high, thesample is small. In other words, there is a tradeoff between bias and variance. Weuse three methods of threshold selection including Hill plot, sample mean excessplots, and Standardized Precipitation Index (SPI) (McKee et al., 1993).

3 Results

Table 1 gives the shape parameter of GEV as results of Maximum Likelihoodestimation (MLE) fitted to half-yearly maxima. For both cities, ξ is positivewhich means fat-tailed Frechet distributions. However, as zero is included in theconfidence intervals (CI), a Gumbel distribution cannot be excluded. In the case ofGPD, the whole years data is divided into two seasons: dry season from Decemberto March and rainy season from April to November. The best indicated thresholdsby all methods, Hill plot, mean excess plot and SPI, are finally chosen for eachseason. The shape parameter for GPD are estimated using MLE and representedin Table 1. For Shantou, ξ is negative for rainy threshold which indicates anordinary Parteo distribution, but not for dry seasons. In the case of Qiqihaer,a Pareto (II) type distribution was found in all cases. As all CI include zero,the Exponential distribution cannot be excluded. Table 2 lists the return levelestimates for different return periods. For Shantou, the highest observation in thesample with 297.4mm is covered by the CI for T=20 for GEV and for T=50 forGPD. The maximum value of Qiqihaer with 135.5mm lies within the CI of T=50for GEV and rainy GPD, and of T=100 for dry GPD.According to CMA (2012), 24-hour rainfall exceeding 250mm is considered asextreme rainfall that can cause floods in Southern China. Following this, the

Page 53: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Ender and Ma 45

event of a flood is included in the CI of T=10 for GEV, of T=20 for rainy GPDand of T=50 for dry GDP. For Qiqihaer, rainfall with 100mm to 200mm in 24hours is defined as heavy rainfall, and with more than 200mm as extreme rainfall(HAIC, 2008). The first event happens for GEV every 10 to 20 years, for rainyGPD every 20 years and for dry GPD every 50 years. KPSS tests show that thereis insignificant evidence to reject stationarity around a constant or linear trend forShantou. Qiqihaer accepts stationarity for GEV and dry GPD, but rejects it forrainy GPD which reveals limitations of EVT. Further, the null hypotheses of theMann-Kendall test can never be rejected that no trend exists in the exceedances.

TABLE 1. GEV and GPD parameter for seasonal threshold in mm

Shantou GEV Rainy over 150 Dry over 60

ξ / s.e. 0.035 / 0.073 -0.232 / 0.203 0.238 / 0.189(95% CI) (-0.108, 0.178) (-0.630, 0.166) (-0.131, 0.608)

Qiqihaer GEV Rainy over 45 Dry over 10

ξ / s.e. 0.127/ 0.083 0.032 / 0.139 0.372 / 0.311(95% CI) (-0.035, 0.288) (-0.241, 0.304) (-0.237, 0.981)

TABLE 2. Return level estimates for GEV and GPD

T (years) 5 10 20 50 100

Shantou 185.452 217.089 248.900 291.723 324.888GEV (168.566, (194.011, (217.348, (245.484, (264.927,

210.172) 256.593) 309.097) 389.603) 460.329)

Shantou 192.070 220.839 245.334 272.222 289.084Rainy 150 (177.516, (200.222, (221.295, (244.958, (260.912,

GPD 212.280) 247.139) 289.245) 364.340) 437.436)

Shantou 92.388 114.232 139.998 181.290 219.095Dry 60 (82.513, (98.202, (116.803, (139.196, (156.720,GPD 106.947) 145.591) 209.595) 361.692) 564.733)

Qiqihaer 67.655 82.383 98.194 121.141 140.281GEV (60.116, (71.345, (82.228, (96.153, (106.432,

79.383) 102.623) 131.153) 179.253) 225.622)

Qiqihaer 69.788 81.472 93.415 109.607 122.169Rainy 45 (63.743, (73.038, (82.286, (93.649, (102.475,

GPD 77.935) 95.772) 119.709) 163.860) 209.891)

Qiqihaer 14.949 19.382 25.119 35.372 45.815Dry 10 (13.003, (16.032, (19.605, (24.564, (28.154,GPD 17.821) 26.898) 47.758) 118.854) 251.963)

Page 54: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

46 Statistical modeling of extreme precipitation

4 Conclusion

Return levels show that Shantou and Qiqihaer experience flood events caused byheavy rain every 10 to 20 years during the summer, as seen in August 2013. Inthe winter, flooding is less likely and should occur only once every 50 years. Asthe first results of this paper are promising, future research is needed to overcomelimitations like clusters or seasonality.

References

Balkema, A.A. and DeHaan, L. (1974). Residual life time at great age. The An-nals of Probability. pp. 792 – 804.

CMA - China Meteorology Administration (2012). www.cma.gov.cn

Fisher, R.A. and Tippett, L.H.C. (1928). Limiting forms of the frequency distri-bution of the largest or smallest member of a sample. In: MathematicalProceedings of the Cambridge Philosophical Society, pp. 180 – 190.

HAIC - Heilongjiang Agricultural Information Center (2008).www.hljagri.gov.cn

McKee, T.B., Doesken, N.J. and Kleist J. (1993). The relationship ofdrought frequency and duration to time scales. In: Preprints,8th Conferenceon Applied Climatology, Anaheim, CA, pp. 179 – 184.

Overeem, A., Buishand, A. and Holleman, I. (2008). Rainfall depth-duration-frequency curves and their uncertainties. Journal of Hydrology.pp. 124 – 134.

Pickands, J. (1975). Statistical inference using extreme order statistics. The An-nals of Statistics. pp. 119 – 131.

RCSC - International Federation of Red Cross and Red Crescent Societies (2013).Information Bulletin - China: Floods and typhoons.

Page 55: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

The pblm package: semiparametric regressionfor bivariate categorical responses in R

Marco Enea12, Mikis Stasinopoulos2, Robert Rigby2, AntonellaPlaia1

1 University of Palermo, Palermo, Italy2 London Metropolitan University, London, United Kingdom

E-mail for correspondence: [email protected]

Abstract: We present an R package to fit semiparametric regression modelsfor two categorical responses. It works for both nominal and ordered responsesand several types of logits can be specified. Proportional, non-proportional andpartial proportional odds models can be fitted, with marginal and associationparameters estimated in a parametric or semiparametric way, via penalized max-imum likelihood estimation. An application to show the potential of the packageis carried out on a data set of Italian university students.

Keywords: bivariate additive Dale model; partial proportional-odds models;semiparametrically structured ordered models, students’ performance

1 Introduction

In the last three decades, a lot of papers focused on regression model for bivariatecategorical responses. Applications of these models can be found in health and so-cial fields. For the ordered case, Dale (1986) was the first important contributionon the topic. She used global logits as appropriate measures of interest for themarginal responses, and global cross-ratios for the association structure. Glonekand McCullagh (1995) used the multivariate logistic transform to link multivari-ate binary, nominal or ordered responses to predictors. Looking at the univariateordered case, Tutz (2003) proposed a semiparametric partial proportional-oddsmodel. In that model, besides a parametric part, the predictor included additiveeffects of unspecified functions of continuous variables, as for generalized additivemodels (Hastie and Tibishirani, 1990). Effects of covariates can be smoothed ina “horizontal” way, that is independently of the response categories, but alsoassuming category-specific effects, by penalizing the variation of curves over re-sponse categories (“vertical” smoothing). In the literature, the bivariate versionof the Tutz’s approach has not been fully considered yet, but proposed piecewise.Bustami et al. (2001) already developed the generalized additive part for the

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 56: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

48 The pblm package

horizontal smoothing, whereas Enea and Lovison (2012) vertically smoothed theeffects of covariates across response categories using some penalty terms.In this work, we present an R package, named pblm, to fit additive partialproportional-odds models for two categorical responses. At time of writing, thefull implementation of the bivariate model following the Tutz’s approach is stillongoing, but P-splines (Eilers and Marx, 1996) can be used for the horizontalsmoothing, also including category-specific curves, whereas the vertical smooth-ing is allowed for the non-additive part. Tuning (or smoothing) parameters canbe fixed by the user or selected automatically. The automatic selection criterionimplemented is the Generalized Akaike Information Criterion (GAIC) minimiza-tion. The package is not on CRAN yet but is available by emailing the firstauthor. In the following section, we outline the general structure of model, whilein Section 3 we show some results from the application.

2 The model

Consider a n×(p+2) data set, where n indicates the number of observations andp the number of covariates, and two categorical responses Y1 and Y2, respectivelywith D1 and D2 categories. Let φ1r, φ2c and ψrc, r = 1, . . . , D1, c = 1, . . . , D2,define the marginal and association parameters of interest. Besides the basic,the marginal parameters can be chosen, for example, among the local, global,continuation and reverse continuation types, and the association ones are formed,accordingly, by their combinations. Let xi be the covariate vector for the ithobservation, i = 1, .., n. From the cross tabulation of the responses, the newresponse vector yi can be obtained by using the multivariate logistic transformby Glonek and McCullagh (1995). Consider the model

log[φ1r(xi)] = η1ri = β10r + βT

1(r)xi1(r) +∑pj=1 fj1(r)(xij1(r)),

log[φ2c(xi)] = η2ci = β20c + βT2(c)xi2(c) +

∑pj=1 fj2(c)(xij3(c)),

log[ψrc(xi)] = η3(rc)i = β30(rc) + βT3(rc)xi3(rc) +

∑pj=1 fj3(rc)(xij3(rc)),

(1)

where, for brevity of notation, the generic subscripts (a) indicates those termsthat might include category specific, while fjk(a)(xijk), k = 1, 2, 3, are unspec-ified functions of continuous variables. Note that the marginal intercepts arealways assumed to be category specific. We refer to model (1) as the BivariateAdditive Partially Proportional Odds Model (BAPPOM). For ordered responses,the proportional-odds version of (1) is called the Bivariate Additive Dale Model(BADM) if global logits and global odds ratios are used, while association in-tercepts are parameterized according to an RC-type structure (Bustami et al.,2001). In the proposed package, the fitting of the BAPPOM is performed viamaximum likelihood estimation, by employing both the iteration scheme pro-posed by Glonek and McCullagh (1995) and the backfitting algorithm by Hastieand Tibishirani (1990).

3 Application to the Italian university students’ data.

The application, which concerns an analysis of university students’ performanceat University of Palermo, is similar to that performed in Enea and Attanasio

Page 57: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Enea and Stasinopoulos. 49

(2012) but here we used the 2007-2008 cohort of students enrolled at one of thedegree courses in economics. Both students’ university Credits accumulated andmean grade have been observed at the end of the fourth year of permanence inthe university system, assuming that taking a degree after four years is a goodresult. Both variables have been suitably categorized in a five-category orderedscale. We refer to Enea and Attanasio (2012) for details of this categorization.Among the covariates there are sex, age, high school final grade, high school typeand student’s family income. Figure 1, on the left, shows the observed associationstructure between CREDITS and GRADE measured by 16 log-global odds ratios(log-GORs). We fitted several models, including BAPPOM and BADM. On thegrounds of AIC and likelihood ratio test, the final model estimated an associationstructure which is unconstrained along CREDITS and linear along GRADES.The graph on the right shows the fitted log-GORs at baseline smoothed by thepenalty term used in Enea and Attanasio (2012).

12

34

12

34

0

2

4

CREDITS GRADES

logGOR

12

34

12

34

−2

0

2

4

CREDITS GRADES

logGOR

FIGURE 1. Observed (left) and fitted (right) log-GORs for the EcD data set.

The smoothed effect of the high school final grade is reported on Figure 2. Sucha variable has been centered at the minimum of the Italian high-school gradingscale (60-100 range). Here the smoothed effect of such a variable shows a clearlinear trend on the logit scale. However, the smooth term has been used here onlyfor sake of diagnostics, in order to the assess possible missing polynomial terms,as suggested in Bustami et al. (2001). In the final model, such variable has beenput linearly.Due to the limit of space the model estimates are not here reported. However,from the analysis, high school final grade resulted a good (significant) predictorfor students’ academic success, along with having attended the “CLASSICO”and “SCIENTIFICO” high school types.

References

Bustami, R., Lesaffre, E., Molenberghs, G., Loos, R., Danckaerts, M., Vlietinck,R. (2001). Modelling bivariate ordinal responses smoothly with examplesfrom ophthalmology and genetics. Statistics in Medicine, 20, 1825 – 1842.

Dale, J.R. (1986). Global Cross-Ratio Models for Bivariate, Discrete, OrderedResponses. Biometrics, 42(4) 909 – 917.

Eilers, P.H.C. and Marx, B.D. (1996). Flexible smoothing with B-splines andpenalties. Statistical Science, 11, 89 – 121.

Page 58: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

50 The pblm package

0 10 20 30 40

−3

−2

−1

01

23

model for GRADES

High School Grade

fitte

d va

lues

0 10 20 30 40

−3

−2

−1

01

23

model for CREDITS

High School Gradefit

ted

valu

es

0 10 20 30 40

−3

−2

−1

01

23

model for ASSOCIATION

High School Grade

fitte

d va

lues

FIGURE 2. Smoothed effect of the Italian student’s high-school final grade onthe risk (log-global odds) of a negative performance at university, in terms ofGRADES, CREDITS and their association (log-GORs).

Enea and Lovison (2012). A penalized approach to the bivariate logistic modelfor the association between ordered responses. DSSM Working Papers n.2012.3, University of Palermo.

Enea and Attanasio (2012). Bivariate logistic model for the analysis of the stu-dents’ university “success”. Proceedings of the 46th Scientific Meeting ofthe Italian Statistical Society.

Glonek, G.F.V. and McCullagh, P. (1995). Multivariate logistic models. Journalof the Royal Statistical Society, Serie B, 57 533 – 546.

Hastie, T.J., Tibshirani, R.J (1990).Generalized Additive Models. Chapman &Hall, London.

Rigby, R., and Stasinopoulos, M. (2005). Generalized additive models for loca-tion, scale and shape. Applied Statistics, 54(3) 507 – 544.

Tutz, G. (2003). Generalized semiparametrically structured ordinal models. Bio-metrics, 59(2) 263 – 273.

Page 59: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Least-squares linear distributed estimationfor discrete-time systems with packetdropouts

M.J. Garcıa-Ligero1, A. Hermoso-Carazo 1, J. Linares-Perez1

1 Dpto. de Estadıstica e I.O., Universidad de Granada, Avda Fuentenueva s/n,18071 Granada, Spain

E-mail for correspondence: [email protected]

Abstract: Least-squared linear estimation of signals observed from multiple sen-sors with packet dropouts is addressed. The packet dropouts are modeled bysequences of Bernoulli random variables with different characteristics for eachsensor. Under the assumption that the equation modeling the signal is unknownand only information about covariance functions of the process involved in theobservations are available, distributed fusion filtering and fixed-point smoothingalgorithms are obtained.

Keywords: Packet dropouts; Covariance information; Least-squares estimation.

1 Introduction

Modern systems become increasingly complex and the observation mechanismsbased on a single sensor are often unable to provide sufficient information for theirstudy. So, recently the study of complex systems is carried out using multiple sen-sors. In this situation, the signal estimation problem is addressed considering theinformation provided by the different sensors and combining that informationby means of two different techniques: centralized and distributed fusion. In thecentralized fusion, all measurement data are sent to the fusion center for beingprocessed jointly and to obtain the signal estimator. However the computationaldisadvantages of this technique lead us to consider an alternative as the dis-tributed fusion method; by this method, the estimator is derived by combiningthe information provided by the estimators obtained from each sensor.In sensor network systems the loss of some measurements (packet dropouts)is usually unavoidable due to unreliable features of the networks. The packetdropouts are usually modeled by Bernoulli random variables which take values0 or 1 depending on whether the measurement is lost or it is received. Usingthis model, Garcıa-Ligero et al. (2011) derived least-squares (LS) linear filter and

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 60: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

52 Distributed estimation

smoothers considering a single sensor and assuming no signal equation is avail-able and only covariance functions of the processes involved in the observationequation are known. In this paper, observations coming from multiple sensors areconsidered and, motivated by the advantages of distributed fusion method, filterand fixed-point smoother are obtained by this method; specifically, the estimatorsare obtained as a linear combination of the local ones using the mean squarederror as optimality criterion.

2 Observation model

Consider a n-dimensional signal, xk, observed from m sensors which providescalar measurements, zik, i = 1, . . . ,m, given by

zik = Hikxk + vik, k ≥ 1, i = 1, . . . ,m,

where Hik are known 1 × n matrices and

vik, k ≥ 1

, i = 1, . . . ,m, are scalar

white noise processes with zero means and known variances V ar[vik] = Rik, k ≥1. It is assumed that the measurements, zik, are transmitted to a processingunit through an unreliable network, where it is possible that some of them arelost during the transmission; if so, the last available measurement is processed.According to Sahebsara et al. (2007), this situation can be modeled as

yik = ξikzik + (1− ξik)yik−1, k > 1, i = 1, . . . ,m; yi1 = ξi1z

i1, (1)

where for i = 1, . . . ,m,ξik, k ≥ 1

are sequences of independent Bernoulli ran-

dom variables with known probabilities, P [ξik = 1] = ξi

k, ∀k ≥ 1. If ξik = 1, thenthe measurement at time k, zik, is received; otherwise the measurement at timek is lost.We assume that the signal, xk, k ≥ 1, has zero mean and E[xkx

Ts ] = AkB

Ts , s ≤

k, where Ak and Bs are known n×M matrix functions. Also we consider that thesignal process, xk, k ≥1, and the noise processes,

vik, k ≥1

and

ξik, k ≥1

,

i = 1, . . . ,m, are mutually independent.

3 Distributed fusion filter and fixed-point smoother

The application of the distributed fusion method is developed in two stages; thefirst consists of obtaining the local estimators, xik/L, i = 1, . . . ,m, of the signalfrom observations coming from each of the sensors. This first stage was solvedin Garcıa-Ligero et al. (2011); specifically, local LS linear filtering and fixed-point smoothing algorithms were derived. In the second stage, the distributedestimators are obtained as the linear combination of the local estimators usingthe mean squared error as the optimally criterion. Theorem 1, derived in Garcıa-Ligero et al. (2012), provides the expressions of the distributed fusion estimatorsand the covariance matrices.

Theorem 1.- Let xik/L, L ≥ k, i = 1, . . . ,m, be the local estimators of the

signal; then the LS linear distributed fusion estimators, xDk/L (for L = k andL > k, filter and smoother, respectively) and estimation error covariance matricesare given by

xDk/L = Ξk/L(Σk/L

)−1xk/L, L ≥ k,

PDk/L = AkBTk − Ξk/L(Σk/L)−1ΞTk/L, L ≥ k,

Page 61: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Garcıa-Ligero et al. 53

where xk/L = (x1Tk/L, . . . , x1Tk/L)T , Σk/L =

(Σijk/L=E[xik/Lx

jTk/L])i,j=1,...,m

and

Ξk/L=(Σ11k/L| . . . |Σmmk/L

).

Notice that these expressions for the distributed estimator and the estimationerror covariance matrix depend on the cross-covariance matrices between any twolocal estimators Σijk/L, i, j = 1, . . . ,m, L ≥ k; therefore, our aim, in this paper, isto derive an explicit expression for them in order to apply the distributed filteringand smoothing algorithms.

Theorem 2. Under hypotheses (i)-(iv), the cross-covariance matrices, Σijk/L, fori, j = 1, . . . ,m, L ≥ k, can be calculated by

Σijk/L = Σijk/L−1 + Eijk,L(ΠjjL )−1SjTk,L + Sik,L(Πii

L)−1(Ejik,L)T

+Sik,L(ΠiiL)−1Πij

L (ΠjjL )−1SjTk,L, L > k,

Σijk/k = Akrijk A

Tk , k ≥ 1,

where

Eijk,L = ξj

L[eiik,L−1 − eijk,L−1]ATLHjTL , L > k, i 6= j; Eiik,L = 0,

Eijk,k = ξj

kAk[riik−1 − rijk−1]ATkHjTk ,

eijk,L = eijk,L−1 + ξi

LSik,L(Πii

L)−1HiLAL[rjjL−1 − r

ijL−1]

+[Eijk,L + Sik,L(ΠiiL)−1Πij

L ](ΠjjL )−1JjTL , L > k; eijk,k = Akr

ijk ,

rijk = rijk−1 + ξj

k[riik−1 − rjik−1]TATkHjTk (Πjj

k )−1JjTk + ξi

kJik(Πii

k )−1HikAk

×[rjjk−1 − rijk−1] + J ik(Πii

k )−1Πijk (Πjj

k )−1JjTk ; rij0 = 0,

Πijk = ξ

i

kξj

kHikAkB

Tk H

jTk − ξ

i

kξj

kHikAk[rjjk−1 + riik−1 − rijk−1]ATkH

jTk .

Expressions for Sik,L, riik , e

iik,L, J

ik and Πii

k are given in Garcıa-Ligero (2011).

4 Simulation example

The proposed algorithms are applied to a simulation example to illustrate theireffectiveness. Consider a zero-mean scalar signal, xk, with known covariance func-tion and expressed according to hypothesis (i), being Ak = 1.025641× 0.95k andBs = 0.95−s. The signal measurements coming from two sensors, zik = xk+vik, i =1, 2, are affected by independent additive white noises with zero mean and vari-ances R1

k = 0.5 and R2k = 0.9 for all k. The observations of zik, i = 1, 2, are

modeled as in (1), whereξik, k ≥ 1

are independent Bernoulli variables with

known probabilities ξi, for all k. Figure 1 displays the distributed filtering, PDk/k,

and fixed-point smoothing error variances, PDk/L, (L = 1, 3), for different values

of ξi, i = 1, 2. It shows that the error variances corresponding to the smothers

are smaller than those of the filters and, moreover, that as the probabilities ξi

increase (equivalently, as the dropout probabilities 1 − ξi, decrease) the error

variances are smaller and then, the performance of the estimators is better.

Acknowledgments: This work was partially supported by “Ministerio deEconomıa y Competitividad” under contract MTM2011-24718.

References

Garcıa-Ligero, M.J., Hermoso-Carazo, A. and Linares-Perez, J. (2011). Estimationfor discrete-time systems with multiple packet dropouts using covariance

Page 62: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

54 Distributed estimation

0 5 10 15 20 25 30 35 40 45 500.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Time k

ξ1=0.5 ξ2=0.3

ξ1=0.9 ξ2=0.7

Distributed filtering error variances Pk/kD

Distributed smoothing error variances Pk/k+1D

Distributed smoothing error variances Pk/k+3D

FIGURE 1. Distributed filtering and smoothing error variances.

information. Mathematical and Computer Modelling, 54, 2277-2286.

Garcıa-Ligero, M.J., Hermoso-Carazo, A. and Linares-Perez, J. (2012). Distributedand centralized fusion estimation from multiple sensors with Markovian de-lays. Applied Mathematics and Computation, 219, 2932-2948.

Sahebsara, M., Chen, T., and Shah, L. (2007). Optimal H2 filtering with ran-dom sensor delay, multiple packet dropout and uncertain observations. In-ternational Journal Control, 80, 292-301.

Page 63: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Developing Robust Scoring Methods for usein Child Assessment Tools

Phillip Gichuru1, Gillian Lancaster1, Andrew Titman1, MelissaGladstone2

1 Mathematics and Statistics Department, Lancaster University, Lancaster, UK2 Institute of Child Health, Royal Liverpool Childrens NHS Trust, Liverpool, UK

E-mail for correspondence: [email protected]

Abstract: Earlier and more sensitive diagnosis of disability reduces its detri-mental effect on children. We therefore seek to develop robust scoring methodsfor Child Assessment tools which will ensure more timely intervention of detecteddelayed development to reduce stress on the child and its family. Most of the cur-rent development scores are dependent on age hence a key objective is to developmethods that correct or account for age. Generally, regardless of implementationmedium, two main scoring approaches are usually used; item by item scoringwhich creates development score norms for each item and total scoring that usesall the responses of the child to give one score across the entire domain. Usingdata from 1,446 normal children from the recent Malawi Development Assess-ment Tool (MDAT) study, we review classical total scoring methods includingsimple scoring, Log Age Ratio methods and Item Response Models under dif-ferent assumptions to derive normative scores in this child development contextusing binary responses only. While evaluating the pros and cons of each method,we also suggest extensions to current total scoring methods including smooth-ing methods and using more flexible models within various scoring algorithms.Preliminary results show that weighting simple scores is important as a lack ofresponse to all items does not necessarily imply a lack of ability. Further, smooth-ing of score values is beneficial when variability in certain age groups is high dueto recruitment problems. The more complex methods accounting for most studydesign issues produce more reliable and more generalizable normative scores. Thesensitivity analysis showed that simple methods perform well in ideal situations.

Keywords: binary data; disability; development assessment; scoring; item re-sponse theory.

1 Introduction

Children with disability have increased susceptibility to delayed development,various psychiatric disorders and social adjustment problems, which leads to

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 64: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

56 Developing Robust Scoring Methods

many stresses on the child and their family. In developing countries these prob-lems are especially exacerbated, and it has been shown that the degree of detri-mental effect of developmental delays could be prevented or reduced considerablyif early intervention and detection programmes were available for these children(Grantham-McGregor, S et al. 2007).The aim of this project is to develop a methodological framework of scientificallyrobust statistical modelling methods for assessing health outcomes in childrenwith disabilities, in particular, by implementing and contrasting methods of de-tecting and scoring the stages of development. Such tools will be of benefit tocommunity health workers in rural African settings as well as those looking atthe developmental outcomes of children with disabilities in the developed world.These methods can be incorporated into national and international maternal andcommunity health programs, and be used to monitor and evaluate child wellbeing,health and nutrition at a population level. They will also allow for evidence-basedevaluation of identified factors influencing disability.Once the relevant data that typically takes different types of responses has beencollected using an assessment tool, then follows the intricate process of convertingthese responses to scores that can be used to classify the developmental status ofa child. Current scoring methods used can be broadly categorised into two maingroups; item-by-item scoring and total scoring. Item-by-item scoring generatesnorms for each item using a sample of healthy normal children, against whichnew children can be assessed, while total score creates a score using all itemsresponses in the domain.

2 Methods

2.1 Data

The data come from a study of 1,446 normal subjects from Malawi in South-east Africa. A stratified quota sampling method was used to select a sample ofchildren between the dates January to August 2006 after a series of preliminarypilot studies (Gladstone et al, 2008). Each subject was assessed on their abilityto perform tasks in the motor function, fine motor function, language and socialskills domains. Each domain provided 34 items giving a binary response corre-sponding to 0, if a child failed to complete an item and 1 if the child successfullycompleted the item.

2.2 Scoring methods/algorithms to derive overall score normsof a child using all the domain items

This paper concentrates on total scoring approaches where all binary responsesfor a given child are considered simultaneously across the entire developmentaldomain being tested in order to summarise the performance of the child within agiven domain.. The work of Drachler et al. (2007) and Cheung Y.B. et al. (2008)explored and compared simple scoring methods that simply add up passed items,creation of Z-scores, Log Age Ratio methods, to the use of Rasch models (Ja-cobusse et al. 2006; Jacobusse & Buuren 2007) that characterise child develop-ment. All these procedures deal with each of the four domains of developmentseparately, giving four domain scores for each child.

Page 65: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Gichuru et al. 57

2.3 Empirical and smoothed model based Z-Scores

The Z-score is arrived at by standardisation of the simple count, defined by sum-ming the binary responses of a child across all items in a domain, using the meanand standard deviation of scores from pre-determined discrete age categories.Formally the internal standardisation can be defined as;

Z score for child i =SCi − µi

σi(1)

Where SCi is the sum of binary responses for the ith child, µi is the meansimple count of children in the age group this ith child belongs to and σi is thestandard deviation of the simple count of children in the age group this ith childbelongs to. Unless a very “large” data set is available the Z-scores obtained fromthis approach are unlikely to be monotonic with age group and may exhibit toomuch variability to allow reliable use to classify development status of children.An alternative to using the empirical mean and standard deviation of each agecategory to standardize is to use smoothed model-based versions of these values.The Generalized Additive Models for Location Scale and Shape (GAMLSS) asdescribed by Rigby et al. (2007) was explored to provide a smooth functionalestimate of mean and standard deviation at each age. Alternatively, we alsoconsider Z-scores obtained using a modelled mean and variance from a betabinomial GAMLSS model. The model based Z score method standardizes theSC of each child using the GAMLSS model based estimate of mean and standarddeviation at the specific age. Formally the model based internal standardisationcan be defined as;

Z score for child i =SCi −Model based µi

Model based σi(2)

2.4 Log Age Ratio (LAR) and Adapted Log Age Ratio (ALAR)

The LAR method (Drachler et al. (2007), Cheung et al. (2008)) contrasts the‘ability age’ and the ‘actual age’ of a particular child. A logistic regression modelwith the natural logarithm of age at assessment as an independent variable isestimated for each binary item. Then, treating the regression slope and interceptsfrom this model as fixed, the data from each child is individually used to obtainthe maximum likelihood estimate of log (ability age).The LAR is then taken as the difference between the estimate log (ability age)and the log of the child’s actual age. The standard LAR method is limited inrequiring that the logistic regression model for each item is correctly specified.More robust Shape Constrained Additive Models (Natalya P. et al. 2010,2014)can be used instead of the Logistic Model at the 1st and 2nd LAR algorithmsteps to compute more accurate LARs, which we refer to as Adapted Log AgeRatio method (ALAR).

2.5 Item Response (IRT) Models

A further approach is based on IRT models which describes the conditional prob-ability that a child with a particular latent trait will correctly be able to success-fully complete an item with a specific difficulty. For instance, the 2-parameterlogistic model (2PL) specifies that the conditional probability that a child with a

Page 66: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

58 Developing Robust Scoring Methods

particular latent trait level (assumed to be N(0, 1)) will correctly answer an itemwith a specific difficulty is;

P (Yjn = 1|θn, βj) =expαj(θn − βj)

1 + expαj(θn − βj)(3)

where αj refers to the discrimination parameter or slope of the jth item, θn refersto the ability of subject n and βj is the difficulty of item j. The ability of a childis expected to increase with age. Therefore scores are expected to increase withage. Thus age is strongly positively correlated to these models scores.A two-step process then involves first fitting an IRT model and then using astandardization method to create age-adjusted scores based on the IRT factorscores. For instance a GAMLSS model may be used to model the mean andstandard deviation of scores as a function of age at pre-determined age groupsthat will be used to standardize the IRT factor scores.An alternative IRT based approach is to directly model the dependence of age onability within the IRT model. For instance, we can specify a mixed effects logisticregression model in terms of log odds;

log

[P (Yjn = 1|θn)

1− P (Yjn = 1|θn)

]= B(xj) + θj (4)

where B(.) is some smooth spline function and xj represents age. θj then refers tothe age-adjusted trait level of the jth child where higher values reflect a higherlevel on the trait being measured by the items. The trait values are usuallyassumed to be normally distributed in the population of children with mean zeroand variance σ2, N(0, σ2).

3 Results

Table 1 is a descriptive summary of the distribution of scores from the 34 itemswithin the gross motor domain only of the MDAT data. There were 2, 8, and163 children respectively who had all 34 items missing, an age of 0, passed orfailed all gross motor items and therefore had no Log Age Ratio. The spearmancorrelation coefficient assesses the degree of dependency between scores and age(or log age).

4 Conclusions

The results show that weighting simple scores is especially beneficial whenthere are many cases of unadministered tool items. While the Log Age Ra-tio method is prone to inability to deal with several data issues, by usinga more robust modelling approach within the algorithm the Adapted LogAge Ratio method proved beneficial especially in dealing with items withlow or high pass rates. Further, smoothing of score values is importantwhen variability in certain age groups is high due to recruitment problems.The smoothing technique not only produces ‘smooth’ scores but ensuresthe monotonicity property is adhered to and offers a framework to create aconfidence interval around estimates. The more complex methods account-ing for most study design issues facilitate a framework for correcting for the

Page 67: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Gichuru et al. 59

age of the child that is known to influence ability and allows production ofmore reliable and more generalizable normative scores. However, the sensi-tivity analysis showed that simple methods perform well in ideal situations.Therefore more rigorous testing of the complex methods is further neededto motivate their use.

References

Cheung, Y.B., Jeffries, D., Thomson, A., and Milligan, P. (2008). A sim-ple approach to test for interaction between intervention and anindividual-level covariate in community randomised trials. TropicalMedicine & International Health 13,247-255.

Drachler, M.L., Marshall, T., and de Carvalho Leite, J.,C,. (2007).A continuous-scale measure of child development for population-basedepidemiological surveys: a preliminary study using item response the-ory for the Denver Test. Paediatr Perinat Epidemiol 21,138-153.

Gladstone, M., Lancaster, G.A., Umar, E., Nyirenda, M., Kayira, E., et al.(2010)The Malawi Developmental Assessment Tool (MDAT): TheCreation, Validation, and Reliability of a Tool to Assess Child De-velopment in Rural African Settings. PLoSMed ,7(5): e1000273.

Gladstone, M., Lancaster, G.A., and Jones, A.P., (2008) Can Western de-velopmental screening tools be modified for use in a rural Malawiansetting?. Archives of Diseases in Childhood 93,23-29.

Grantham-McGregor, S., Cheung, Y.B., Cueto, S., Glewwe, P. et al. (2007)Child developmental potential in the first 5 years for children in de-veloping countries. Lancet 2007;369: 6070.

Pya, N. (2012) scam: Shape constrained additive models.R package version1.1-5, http://cran.r-project.org/web/packages/scam/.

TABLE 1. Distribution of summary statistics and sensitivity analysis for devel-opmental scores in Gross Motor domain. Spearman correlation of scores with age(log age for LAR and ALAR) in normal sample

Summary statistic Z-score Smooth Beta Binom LAR ALAR Smooth 1PL SplineZ-score GAMLSS IRT Spline

Z-score 1PL 2PL

Mean 0.00 -0.01 -1.93 0.14 -0.15 0.00 0.00 0.00(s.e) 0.03 0.03 0.28 0.04 0.01 0.03 0.03 0.04SD 0.99 1.00 10.53 1.57 0.54 0.99 0.99 1.67Skewness -0.16 -0.38 -1.43 6.87 1.66 -0.03 0.00 0.02Kurtosis 2.63 4.07 14.96 65.06 9.72 4.38 2.70 40.51Minimum -5.94 -7.25 -94.88 -1.67 -2.68 -5.91 -5.67 -2.07Maximum 4.01 4.44 69.00 15.81 3.24 6.76 6.11 1.65Correlation 0.08 0.04 -0.15 -0.95 0.34 -0.01 0.03 < 0.01(p-value) 0.002 0.040 < 0.001 < 0.001 < 0.001 0.64 0.32 0.92N 1444 1444 1444 1273 1436 1444 1444 1442Observed proportion below 3.60% 4.16% 50.27% 0.00% 0.00% 4.85% 4.92% 6.10%nominal 5% level (n) 52 60 726 1 2 70 71 88Proportion of True positives 95.00% 95.00% 100% 35.21% 71.25% 70.00% 83.75% 1.25%at 5% level in disabled 76 76 80 25 57 56 67 1sample (n=80)

Page 68: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

60 Developing Robust Scoring Methods

Pya, N. and Wood, S. N. (2014) Shape constrained additive models. Statis-tics and Computing, to appear. DOI 10.1007/s11222-013-9448-7.

Rigby, R.A. and Stasinopoulos, D.M. (2005) Generalized additive modelsfor location, scale and shape. Appl Stat; 54:50744.

Page 69: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Multilevel Factor Analysis of the PIRLS Testfor the Iranian Pupils

Mousa Golalizadeh 1, Vida Azod Azad 1

1 Dept. of Statistics, Faculty of Mathematical Sciences, Tarbiat Modares Uni-versity, Tehran, Iran

E-mail for correspondence: [email protected]

Abstract: One of the most popular linear models, which are utilized to analysisa set of data with intra-class correlation structure, is the multilevel model. Onthe other hand, the factor analysis is a modeling theme to discover some latentvariables based upon the observed data. When a multivariate data set follow hi-erarchical structure and one is looking to find some common factors representinggeneral effective indexes then the multilevel factor analysis arises. Such model-ing strategy for ordinal responses is considered in this paper. Particularly, westudy fitting multilevel factor model to the PIRLS data related to the educa-tional achievements in Iran. To build our model, we use the common connectionbetween ordinal and normal response variables through threshold parameters.Furthermore, in order to derive some effective factors for describing the generalachievement indexes among pupils in Iran, the exploratory and then confirmatoryfactor analyses are implemented to both levels of hierarchies.

Keywords: Multilevel models; Multilevel factor models; Threshold parameters;PIRLS data.

1 Introduction

There are various examples in which the crucial assumption of indepen-dency among observations is not hold. A typical situation for this case isthe data nested as hierarchy. For instance, to study the education achieve-ments among pupils one would randomly select some of the pupils fromsome classes of schools. In this case, there would be strong intra correla-tion among pupils from a class and also among classes from an school. Themultilevel models, as the generalization of regression models, can be usedto tackle this feature (see, e.g. Goldstein, 2010) and attracted a particularattention in diverse disciplines including education and sociology.In some studies one cannot directly measure the factor, known as latentvariables, such as intelligence and social class. This sort of studies are gen-

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 70: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

62 Multilevel Factor Model

erally set in the factor analysis framework and is usually performed asexploratory or confirmatory (see, e.g. Everitte and Hotorn, 2011). Whenthe observations are collected through bags of questionnaires, one is usu-ally encountered with ordinal responses. In this typical case, common pro-cedure is to link the ordinal responses with continuous variables throughsome thresholds parameters and follow the standard factor analysis for thecontinuous data (Joreskog and Mostaki, 2006).Sometimes, the multivariate data not only are ordinals but also inherithierarchical structure. Further, a typical case, then, aries when one wouldlike to perform the factor analysis and derive some indexes for commonlatent variables. Grilli and Rampichini (2007) studied this and derived themaximum likelihood estimators via EM algorithm with adaptive Gaussiannumerical quadrature. In this paper, we follow the same procedure to findthe level of educational achievement among pupils studying at year four (inthe Iranian educational system) based upon data provided by the Progressin International Reading Literacy Study (PIRLS). The data are collectedby the International association for the Evaluation of Achievement (IEA)and are freely available at http://www.iea.nl/.The organization of this paper is as follows. In Section 2 we provide mate-rials for multilevel factor models for ordinal responses. Section 3 includesdescribing the data, fitting the aforementioned model and interpreting rel-evant results.

2 Multilevel Factor Model for Ordinal Responses

Let us assume, for h = 1, . . . ,H, i = 1, . . . , ni, j = 1, . . . , J, Yhij repre-sents h-th ordinal variable (answer to particular statement), for i-th in-dividual (pupil) from j-th group (school). To describe the model in the

standard linear form, one do need a continuous variable (say Yhij) whosevalue in some intervals, established by some thresholds parameters (say τ),be the corresponding value for the ordinal variable Yhij . Particularly, letus assume the real response takes its value from the set 1, 2, . . . , Ch. Therelationship between the ordinal and continuous variables is given by

Yhij = ch ⇔τch−1,h < Yhij < τch,h

(1)

where−∞ = τ0,h < τ1,h < . . . < τch−1,h < τch,h =∞.

If one ignores the hierarchical feature and is interested on establishing afactor model (with M factors) in this point, he could use the standardfactor model as

Yhij = µh +

M∑m=1

λmhzmij + εhij , (2)

where µh is the grand mean for h-th variable, λq is the q-th loading, zqis the q-th common factor and ε’s are specific errors independently dis-tributed as εq ∼ (0, σ2

ε ). Then, because we are encountered with two level

Page 71: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Golalizadeh and Azod Azad 63

models, following the notation used by Goldstein (2010) for such modelsthe common factors and specific errors should be separately decomposedinto two components as

zmij = z(2)mj + z

(1)mij , m = 1, . . . ,M,

εhij = ε(2)hj + ε

(1)hij , h = 1, . . . ,H. (3)

Combining equations (2) and (3), we will get

Yhij = µh +[ M∑h=1

λmhz(2)mj + ε

(2)hj

]+[ M∑h=1

λmhz(1)mij + ε

(1)hij

](4)

Accordingly, we can write such relationship in matrix form, along withnecessary modifications (notice different loadings in each level), as

Yij =[λ(2)z

(2)j + ε

(2)j

]+[λ(1)z

(1)ij + ε

(1)ij

]. (5)

Now, following standard factor models, we can decompose the variance as

Var(Yij) =[λ(2)Σ(2)λ(2)′ + Ψ(2)

]+[λ(1)Σ(1)λ(1)′ + Ψ(1)

], (6)

showing how the between and within covariance decompositions can beutilized to get a multilevel factor model for ordinal responses.

3 Multilevel Factor Analysis for Iranian PupilsAchievements Data

In this section we deal with multilevel factor analysis for Iranian pupilsachievements data. The data include the responses given by 5746 pupilsfrom 244 schools to 6 statements with multiple answers as “agree a lot”,“agree a little”, “disagree a little”, “disagree a lot”. The statements are:

Q1: I would like to have more time for reading.

Q2: I enjoy reading.

Q3: I need to read well for my future.

Q4: I like it when a book helps me imagine other worlds.

Q5: Reading is harder for me than for many of my classmates.

Q6: Reading is harder for me than any other subject.

To follow decomposition in (6), we used the simple correlation matrixes,in the pupils and schools levels, arisen from assigning ordinal values toresponses. After performing relevant tests on each correlations matrix, wefound two loadings are suffice to explain most variability among variables.This completed the exploratory factor analysis step and now, as recom-mended by Grilli and Rampichini (2007), we should perform the confirma-tory analysis. This was undertaken through recalling equations (2) and (6)

Page 72: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

64 Multilevel Factor Model

and estimating relevant parameters. Note that, due to identifiability con-straint in the multilevel factor model, the EM algorithm fixes some load-ings to one. So, while estimating the loadings it is expected to get someestimated loadings larger than one. Hence, we divided the all respectiveloadings in each column by the maximum of the absolute value of the cor-responding column to get loadings in the interval [−1, 1]. Table 1 show thefinal results for each level. Note that we did not alter those value alreadyset to one and, instead, highlighted them by astrict in the table. Further,those cells indicated by dashed lines are very close to zero. Based upon the

TABLE 1. Estimated loadings for multilevel factor analysis of PIRLS data.

schools level pupils level

variables first second first second

Q1 0.920 - 0.830 -Q2 1∗ -0.257 1∗ -Q3 0.574 -0.610 0.564 -Q4 0.460 -0.597 0.479 -Q5 - 1∗ - 1∗

Q6 - 1 - 0.9

values in these tables, we see that the first factor in schools level includethe statements 1 and 2. This can be called as an index for “enthusiasmof pupils for reading among Iranian schools”. The second factor might becalled “negative attitude of pupils for studying among Iranian schools”. Itis seen that statements 3 and 4 did not, in general, manage to contributein all factors in both levels. We also obtained the scores for school andpupils and found both have relatively average scores, meaning Iranian edu-cational achievements is in acceptable level based on current data. Clearly,a comprehensive study, which can take into account many educational andeconomical variables into account, is required to represent some feasibleeducational indexes.

References

Everitt, B. and Hothorn, T. (2011). An Introduction to Applied Multivari-ate Analysis with R. New York: Springer.

Goldstein, H. (2010). Multilevel Statistical Models. (4th Ed.) London: Arnold.

Grilli, L. and Rampichini, C. (2007). Multilevel Factor Models for OrdinalVariables. Structural Equation Modeling, 12, 1 – 25.

Joreskog, K. G. and Moustaki, I. (2006). Factor Analysis of Ordinal Vari-able with Full Information Maximum Likelihood. New Jersey: LawrenceElbaum Associates.

Page 73: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Variable Selection in Heterogeneous DiscreteSurvival Models

Andreas Groll1, Gerhard Tutz2

1 Ludwig-Maximilians-Universitat Munchen, Institut fur Statistik, Akademies-traße 1, 80799 Munchen, Germany

2 Ludwig-Maximilians-Universitat Munchen, Mathematisches Institut, Theresien-straße 39, 80333 Munchen, Germany

E-mail for correspondence: [email protected]

Abstract: In survival analysis frequently time is observed on a discrete scale,for example, in days, months or weeks. An appropriate way to model such data isby use of discrete survival models. However, their use is typically restricted to fewcovariates, because the presence of many predictors yields unstable estimates. Wepresent an approach to variable selection that is based on L1-type penalty termsand utilizes a gradient ascent algorithm. It allows to maximize the penalizedlog-likelihood yielding models with reduced complexity. The approach works forstandard discrete time-to-event data as well as for frailty models that accountfor the heterogeneity among observations.

Keywords: Discrete survival; Heterogeneity; Lasso; Variable selection; Frailtymodels.

1 Discrete Hazard Models Including Heterogeneity

In the following we introduce the most common basic discrete hazard mod-els including heterogeneity. In addition, we sketch how censoring is incor-porated into the estimation.

1.1 Basic Models

Let time take values from 1, . . . , k. If it results from intervals one has kunderlying intervals [a0, a1), [a1, a2), . . . . . . , [aq−1, aq), [aq,∞), where q =k − 1. Discrete time T ∈ 1, . . . , k means that T = t is observed if failureoccurs within the interval [at−1, at).The main tool in modeling survival data is the hazard function. In discretetime it has the form

λ(t|x) = P (T = t|T ≥ t,x), t = 1, . . . , q,

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 74: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

66 Variable Selection in Heterogeneous Discrete Survival Models

which is the conditional probability for failure in interval [at−1, at) giventhe interval is reached. The corresponding survivor function is

S(t|xxx) = P (T > t|xxx) =

t∏i=1

(1− λ(i|xxx)).

Models for discrete survival given covariates xxx have the form

λ(t|x) = h(γ0t + xxxTγγγ). (1)

where h(.) is a fixed response function, which is assumed to be strictlymonotonically increasing. The parameters γ0t represent the baseline haz-ard, which is the same for all individuals. The contribution of the predictorsis captured by the term xTγγγ, where xT = (x1, . . . , xp) is a vector of predic-tors and γγγT = (γ1, . . . , γp) are the weights. The most prominent discretesurvival model is the continuation ratio model

λ(t|xxx) =exp(γγγ0t + xxxTγγγ)

1 + exp(γ0t + xxxTγγγ),

which uses the logistic distribution function h(η) = exp(η)/(1 + exp(η)).An alternative widely used model is the grouped proportional odds model

λ(t|xxx) = 1− exp(− exp(γ0t + xxxTγγγ)),

which uses the Gompertz distribution h(η) = 1−exp(− exp(η)) as responsefunction. The name refers to its derivation as a grouped version of Cox’sproportional hazard model.Discrete survival models were already considered in the seminal paper ofCox (1972). Various extensions have been given since then, see, for example,Therneau and Grambsch (2000) or Goeman (2010).The model (1) includes an unspecified baseline hazard, which is assumedto be the same for all observations. This is frequently too strict an assump-tion because it ignores heterogeneity among individuals. Therefore we willconsider the extended model, which includes potential heterogeneity. Thecorresponding basic frailty model for the ith observation has the form

λ(t|xxx, bi) = h(bi + γ0t + xxxTγγγ),

where bi is a random effect that is assumed to follow a fixed mixture dis-tribution with density p(.), typically the normal distribution.

1.2 Estimation with Censoring

When modeling survival data censoring has to be taken into account. In thecase of right censoring, which is considered here, for censored data it is onlyknown that T exceeds a certain value but the exact value is not known. LetCi denote the censoring time and Ti the exact failure time for observation i.In random censoring it is assumed that Ti and Ci are independent random

Page 75: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Groll and Tutz 67

variables. The observed time is given by ti = min(Ti, Ci) as the minimumof survival time Ti and censoring time Ci. It is often useful to introduce anindicator variable for censoring given by

δi =

1 if Ti ≤ Ci,0 if Ti > Ci,

where it is implicitly assumed that censoring occurs at the end of theinterval.Under random censoring the probability of observing (ti, δi) (given therandom effect bi) is given by

P (ti, δi|xxxi, bi) = P (Ti = ti)δiP (Ti > ti)

1−δiP (Ci ≥ ti)δiP (Ci = ti)1−δi .

If one assumes that the censoring contributions do not depend on the pa-rameters that determine the survival time, one can separate the factorci = P (Ci ≥ ti)δiP (Ci = ti)

1−δi to obtain the simpler form

P (ti, δi|xxxi, bi) = ciP (Ti = ti)δiP (Ti > ti)

1−δi .

An important tool in discrete survival is that the probability and there-fore the corresponding likelihood can be rewritten by using sequences ofbinary data. By defining for a non-censored observation (δi = 1) the se-quence (yi1, . . . , yiti) = (0, . . . , 0, 1) and for a censored (δi = 0) the sequence(yi1, . . . , yiti) = (0, . . . , 0), the probability (omitting ci) can be written as

P (ti, δi|xxxi, bi) =

ti∏s=1

λ(s|xxxi, bi)yis(1− λ(s|xxxi, bi))1−yis .

For the model without heterogeneity the random effect bi is omitted andthe corresponding log-likelihood is that of a binary response model, whereλ(s|xxxi) is the probability of the binary response, transition to the nextcategory or not, and the corresponding linear predictor for λ(s|xxxi) is givenby γ0s + xxxTγγγ. By construction of an appropriate design matrix estimationmethods and software for binary response models can be used, see e.g.Brown (1975) or Fahrmeir and Tutz (2001).In the frailty model the unconditional probability is given by

P (ti, δi|xxxi) =

∫P (ti, δi|xxxi, bi)p(bi)dbi

=

∫P (Ti = ti)

δiP (Ti > ti)1−δip(bi)dbi

=

∫ ti∏s=1

λ(s|xxxi)yis(1− λ(s|xxxi))1−yisp(bi)dbi.

This is the unconditional probability of a random effects model for struc-tured binary data. Estimation can be based on integration techniques, inparticular Gauss-Hermite integration (see e.g. Anderson and Aitkin, 1985).An alternative way to estimate mixed models is penalized quasi-likelihoodestimation, which uses Laplace approximation (Breslow and Clayton, 1993).Simpler estimates can be obtained under specific assumptions.

Page 76: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

68 Variable Selection in Heterogeneous Discrete Survival Models

2 Variable Selection by Regularization

In the following we consider the more general model

λ(t|xxxit, zzzit, bbbi) = h(γ0t + xxxTitγγγ + zzzTitbbbi),

with explanatory variables xxxit, zzzit and random effect vector bbbi. We willshow that the corresponding log-likelihood is the log-likelihood of a spe-cific random effects model for binary responses, where variable selectioncan be obtained by use of penalized log-likelihood concepts. As shown byGroll and Tutz (2014), for simple binary random effects models a lassopenalty can be included within the framework of penalized quasi-likelihoodyielding selection procedures for mixed models. Groll and Tutz (2014) gavea detailed algorithm which is based on an approximate EM-method.

References

Anderson, D. A. and M. Aitkin (1985). Variance component models withbinary response: Interviewer variability. Journal of the Royal Statis-tical Society Series, B 47, 203 – 210.

Breslow, N. E. and D. G. Clayton (1993). Approximate inference in gen-eralized linear mixed model. Journal of the American Statistical As-sociation, 88, 9 – 25.

Brown, C. (1975). On the use of indicator variables for studying the time-dependence of parameters in a response-time model. Biometrics, 31,863 – 872.

Cox, D. R. (1972). Regression models and life tables (with discussion).Journal of the Royal Statistical Society, B 34, 187 – 220.

Fahrmeir, L. and G. Tutz (2001). Multivariate Statistical Modelling basedon Generalized Linear Models. New York: Springer.

Goeman, J. J. (2010). L1 penalized estimation in the Cox proportionalhazards model. Biometrical Journal, 52, 70 – 84.

Groll, A. and G. Tutz (2014). Variable selection for generalized linear mixedmodels by L1-penalized estimation. Statistics and Computing, 24(2),137 – 154.

Therneau, T. and P. Grambsch (2000). Modeling Survival Data: Extendingthe Cox Model. Springer-Verlag.

Page 77: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Drift Estimation in Sparse SequentialDynamic Imaging: with Application toNanoscale Fluorescence Microscopy

Alexander Hartmann1, Stephan Huckemann12, JornDannemann1, Oskar Laitenberger3, Claudia Geisler3,Alexander Egner3, Axel Munk124

1 Institute for Mathematical Stochastics, Georg-August-University Gottingen,Germany

2 Felix Bernstein Institute for Mathematical Statistics in the Biosciences, Georg-August-University Gottingen, Germany

3 Laser Laboratory, Gottingen, Germany4 Max-Planck-Institute for Biophysical Chemistry, Gottingen, Germany

E-mail for correspondence:[email protected]

Abstract: A major challenge in many superresolution fluorescence microscopytechniques at the nanoscale such as stochastic / single marker switching (SMS)microscopy lies in the correct alignment of long sequences of sparse but spatiallyand temporally highly resolved images. This is caused by the temporal drift ofthe protein structure, e.g. due to thermal fluctuations of the object of interestor its supporting area during the observation process. We develop a simple semi-parametric model for drift correction in SMS microscopy. Then we propose anM-estimator for the drift and show its asymptotic normality. This is used tocorrect the final image and it is shown that this purely statistical method is com-petitive with state of the art calibration techniques which require to incorporatefiducial markers into the specimen. Moreover, a simple bootstrap algorithm al-lows to quantify the precision of the drift estimate and its effect on the final imageestimation. We argue that purely statistical drift correction is even more robustthan fiducial tracking rendering the latter superfluous in many applications. Thepracticability of our method is demonstrated by a simulation study and by anSMS application. This serves as a prototype for many other typical imaging tech-niques where sparse observations with highly temporal resolution are blurred bymotion of the object to be reconstructed.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 78: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,
Page 79: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

State space analysis of the Prague StockExchange index

Radek Hendrych1

1 Charles University in Prague, Faculty of Mathematics and Physics, Dept. ofProbability and Mathematical Statistics, Prague, Czech Republic

E-mail for correspondence: [email protected]

Abstract: This year the PX index, the key price index of the Prague StockExchange, celebrated twenty years of its existence. The aim of this contributionis to analyse its historical daily closing quotes from an econometric perspective.In greater detail, a particular class of state space models appropriate for this typeof univariate financial time series is introduced. It simply combines a local levelmodel and a linear ARMA process together with conditionally heteroscedasticdisturbances. The finally selected model is further examined and statisticallyverified.

Keywords: ARMA; Local level model; PX index; State space modelling.

1 Prague Stock Exchange index

The PX index (ISIN XC0009698371) is an official market-cap weightedstock index composed of the most liquid shares traded on the Prague StockExchange. In particular, it is a price index of blue chips issues (dividendsare not considered) calculated in real-time and weighted by market capi-talization. A new value of the PX index is delivered by a particular formulaand reflects each single price change of index constituents. The maximumweight for a share issue is 20% on a decisive day. A portfolio of basic issuesis variable and it can be restructured quarterly.The PX index was launched on 5th April 1994 (originally known as PX-50).Its base was composed of the fifty most important share issues operatingon the Prague Stock Exchange. The opening base value was fixed on 1000.The number of basic issues has been variable since December 2001. InMarch 2006, the (new) PX index was introduced. It took over the wholehistory of the replaced index PX-50 continuing in its development. In Jan-uary 2014, the PX base contained fourteen issues. The top five stocks hadapproximately 85 percent share of market capitalization in this portfolio.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 80: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

72 State space analysis of the Prague Stock Exchange index

The majority of capitalization was allocated in banking, energy and insur-ance sectors. Further details including historical data can be found on theofficial web pages of the Prague Stock Exchange (http://www.pse.cz).

2 State space model of the PX index

The financial time series of the PX index historical quotes is analysedby means of a particular state space modelling class that combines severalclassical econometric concepts altogether. Concretely, a local level model, alinear ARMA process jointly with conditionally heteroscedastic innovationsare considered in modelling the PX index prices. Particularly, all historicaldaily closing quotes until 3rd January 2014 (i.e. 4878 observations) areinvestigated. The minimal value 316 occurred on 8th October 1998 afterthe Russian financial crisis. The maximal observation 1936 was achieved29th October 2007. Additionally, the crisis year 2008 was truly exceptionalin terms of the highly volatile prices. The augmented Dickey-Fuller test(Tsay, 2005) finding unit roots delivers statistics −1.143 with the declaredp-value 0.701, i.e. the presence of the unit roots cannot be rejected at 5%level. This feature concerning the supposed time series should be taken intoaccount e.g. by differencing the original data, see below.The following form of the discrete-time state space model is assumed:

∆yt = µt + yt +√gtεt, (1)

µt+1 = µt +√hµη

µt , (2)

yt+1 = θr−1xt−r+1 + · · ·+ θ1xt−1 + θ0xt, (3)

xt+1 = φr−1xt−r+1 + · · ·+ φ0xt +√hxt+1η

xt+1, t ∈ N. (4)

In the signal equation (1), ∆yt denotes the first difference of the observedtime series yt, i.e. ∆yt = yt − yt−1, µt stands for the local level, yt is theprocess fully described by formulas (3)-(4), and

√gtεt is a GARCH(Pg, Qg)

type process. The first state equation (2) represents the local level with thenon-negative finite variance hµ instant in time. If hµ = 0, the state processµt is generated by the relation µt+1 = µt. The next state equation (3)contains the parameters θ0, . . . , θr−1. Finally, the last state equation (4)includes the parameters φ0, . . . , φr−1 and

√hxt+1η

xt+1, a GARCH(Ph, Qh)

type process. All model disturbances, i.e. εt, ηµt and ηxt+1, are mutually

and serially independent i.i.d. Gaussian random variables with zero meanand unit variance. They are also uncorrelated with the initial state vector(µ1, y1, x1−r+1, . . . , x1)T which has an expected value a1 and a covariancematrix P1. Frequently, one sets a1 = 0 and P1 = κI, where κ is a largepositive number. More precisely, the formulated model should be accom-panied by the following set of artificial (state) equations: xt−r+i = xt−r+i,i = 1, . . . , r − 1. In summary, the whole considered state space modellingsystem (1)-(4) has r+2 state equations. Further, denote r = max(p, q+1),where p, q ∈ N0, and one puts θj = 0 for j > q and φj = 0 for j > p − 1,respectively.

Page 81: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Hendrych 73

In addition, the model (1)-(4) can be rewritten more concisely as

∆yt = µt + yt +√gtεt, (5)

µt+1 = µt +√hµη

µt , (6)

yt+1 = φ0yt + · · ·+ φp−1yt−p+1

+ θ0

√hxt η

xt + θ1

√hxt−1η

xt−1 + · · ·+ θq

√hxt−qη

xt−q. (7)

According to (5)-(7), one can concludes that the assumed model for ∆ytis composed of the local level and the unobserved ARMA process with theconditionally heteroscedastic disturbances and the additive error. It con-tains p + q + Pg + Ph + Qg + Qh + 4 unknown parameters. Under somemodelling circumstances, it can be calibrated by the numerically effectiveKalman recursive formulas associated with the linear Gaussian state spacemodels. The unknown parameters can be estimated by the correspond-ing maximum likelihood procedure. Consult Commandeur and Koopman(2007) or Durbin and Koopman (2001) and the references given therein.

3 Model calibration

The suggested modelling class expressed by the equations (1)-(4) was ex-amined in the framework of various settings of parameters. In particular,p were chosen from the set 0, 1, 2, 3, q from 0, 1, 2 and Ph, Qh, Pg, Qgtook values in 0, 1. All computations have been performed in EViews andR, see Tusell (2011) and Van den Bossche (2011). The calibrated modelshave been compared using: (i) the Akaike information criterion (AIC) and(ii) the root mean squared error of one-step-ahead predictions (RMSE).According to the mentioned criteria, the model corresponding to the par-ticular choice of parameters p = q = Pg = Qg = Ph = Qh = 1 was selectedwith AIC=7.1839 and RMSE=13.0930. Figure 1 shows the ∆yt data to-gether with their smoothed estimates and the

√gt estimates. The graph

including the series yt with their smoothed estimates is not inserted sincethe original dataset and the delivered estimates are not distinguishable.Moreover, the calibrated model has been statistically verified in a properway. The prediction residuals have the key role in this context. Particu-larly, the prediction residuals of the introduced model are given as vt =∆yt− µt|t−1− ˆyt|t−1, where µt|t−1 and ˆyt|t−1 denote the best linear predic-tion of the unobservable states µt and yt to time t based on informationaccumulated by the observed time series until and including time t− 1. Inrespect to the theoretical background of state space methods, the predic-tion residuals (after their standardization) follow the Gaussian white noisewith zero mean and unit variance. Therefore, the adequacy of the finallyselected model has been investigated by means of common statistical tech-niques, see Commandeur and Koopman (2007). In summary, the model hasfulfilled all assumptions (except for the normality).

Page 82: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

74 State space analysis of the Prague Stock Exchange index

-160

-120

-80

-40

0

40

80

120

160

94 96 98 00 02 04 06 08 10 120

10

20

30

40

50

60

70

80

94 96 98 00 02 04 06 08 10 12

FIGURE 1. Left : ∆yt (grey) with their estimates (black). Right :√gt estimates.

Acknowledgments: This research was supported by the grant SVV-2014-260105

References

Commandeur, J.J.F. and Koopman, S.J. (2007). An Introduction to StateSpace Time Series Analysis. Oxford: Oxford University Press.

Durbin, J. and Koopman, S.J. (2001). Time Series Analysis by State SpaceMethods. New York: Springer.

Tsay, R.S. (2005). Analysis of Financial Time Series. Hoboken, N.J.: Wi-ley Interscience.

Tusell, F. (2011). Kalman Filtering in R. Journal of Statistical Software,39, 1 – 27.

Van den Bossche, F.A.M. (2011). Fitting State Space Models with EViews.Journal of Statistical Software, 41, 1 – 16.

Page 83: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

A misspecification simulation study inPoisson mixed model

Freddy Hernandez 1, Olga Usuga2, Viviana Giampaoli3

1 Universidad Pontificia Bolivariana, Colombia2 Universidad de Antioquia, Colombia3 Universidade de Sao Paulo, Brasil

E-mail for correspondence: [email protected]

Abstract: We studied the impact of misspecification random effect distributionon the vector estimates in a GLMM with Poisson response. We varied the truedistribution for random effects (normal, uniform, exponential and log–normal)while holding constant the assumed distribution (normal) for the random effects.We found that the relative distance was less when the random effect distributionis symmetric (normal and uniform).

Keywords: Misspecification; Poisson mixed model.

1 Introduction

Generalized Linear Mixed Model (GLMMs) were proposed by Breslow& Clayton (1993) and have been used in many applications related toclustered and longitudinal data. Let yij the jth response variable (j =1, 2, . . . , ni) within cluster i(i = 1, 2, . . . ,m). GLMMs with random in-tercept assumes that, conditionally on the random effects bi, the yij areindependent as follows:

yij | bi ∼ independent in Fy,

g(µij) = Xijβ + bi,

biind∼ N(0, σ2),

(1)

where Fy corresponds to some distribution in the exponential Family, g(·)is a known link function, Xij is a vector covariates for jth observations incluster i and β is the parameter vector for the fixed effects.From (1) we note that GLMMs have some assumptions and one of themis the normality of random effects distribution. Unfortunately, because therandom-effects are not observed, is difficult to check this assumption, and

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 84: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

76 A misspecification simulation study in Poisson mixed model

if the true distribution of random effects are not normal, it is possible thatthis can affect the inference in GLMMs.Recently, some authors such as Neuhaus et al. (2011, 2013), McCulloch& Neuhaus (2011), Alonso et al. (2010) e Litiere et al. (2011) have beenstudying the impact of a misspecified random effects distribution on aspectssuch as the behavior of the parameter estimators and the performance ofthe inferential procedures. Herein, we specifically analyze the impact onestimates of parameter vector using the relative distance.

2 Simulation study

The considered model in this study can be summarized as follows:

yij | biind∼ Poisson(λij),

log(λij) = β0 + β1x1i + β2x2ij + bi,(2)

where random intercept bi has zero mean and variance σ2, thus the vectorparameter for this model is θ = (β0, β1, β2, σ

2)T. In simulations fixed ef-fects were constant at β0 = −2, β1 = 2 and β2 = 1. The between–clustercovariate was x1 and was generated as N(0,1) whereas x2 represents awithin–cluster covariate and was generated as U(0,1). The number of clus-ter was fixed at m = 200 with ni = 2, 4, 6, 10, 20 and 40. Random effectvariance assumed values of 1 and 2.

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Normal

bi

−3 −2 −1 0 1 2 3

0.20

0.25

0.30

0.35

0.40

Uniform

bi

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

Exponential

bi

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

Log−normal

bi

FIGURE 1. Random effect distribution considered with zero mean and unitvariance.

In the statistical literature we find two different approches to evaluate theimpact of misspecification. In the first, we can vary the assumed distribu-tion random effects while holding the assumed distribution constant; in theother one, we can vary the true distribution random effects while holdingthe assumed distribution. Herein we adopted the second approach used by

Page 85: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Hernandez et al. 77

some authors as Verbeke & Lesaffre (1997), Agresti et al. (2004) and Litiereet al. (2007). For the random intercept we assumed in the model (2) thenormal distribution while were considered four true distributions: normal,uniform, exponential and log–normal, the probability density function foreach distribution is shown in figure 1. The distributions were transformedsuch that the zero mean condition was satisfied and the correspondingvariance equal to the pre specified values for σ2.As Verbeke and Lesaffre (1997), we used the relative distance ‖θ− θ‖/‖θ‖to quantify the impact of misspecification on the estimates, the smaller thevalue of this distance, the lower is the impact.For each setting of ni, σ

2 and true random effect distribution were gen-erated 10000 samples with the model given by (2). For each sample was

obtained the estimates θ for parameter vector θ by R Core Team (2013)using the glmer function from lme4 package.

σ2 = 1

ni

Rel

ativ

e di

stan

ce

0.1

0.2

0.3

0.4

0.5

2 4 6 10 20 40

NormalUniformExponentialLog−normal

σ2 = 2

ni

Rel

ativ

e di

stan

ce

0.1

0.2

0.3

0.4

0.5

2 4 6 10 20 40

NormalUniformExponentialLog−normal

FIGURE 2. Median Relative Distance for θ.

3 Results

The median relative distance for θ in each setting of ni, σ2 and true random

effect distribution are shown in figure 2. From left panel of this figurewe can see how the relative distance decreases as ni increases and thismeasure is less when the random effect distribution is symmetric (normaland uniform). The relative distance between symmetric and asymmetricrandom effect was greater for small values of ni =2, 4 and 6; this differencedecreases as ni increases. Right panel of figure 2 shows the median relativedistance when σ2 = 1, we can see a similar pattern to that found whenσ2 = 1, however, measures of relative distances are greater than in the case

Page 86: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

78 A misspecification simulation study in Poisson mixed model

of unit variance of random effects. Thus, this study revealed the importancefor a correct specification of the distribution of the random effects in Poissonmixed model, even when the total sample size is relatively large.

References

Agresti, A., Caffo, B. and Ohman-Strickland, P. (2004). Examples inwhich misspecification of a random effects distribution reduces effi-ciency, and possible remedies. Computational Statistics & Data Anal-ysis, 47, 639 – 653.

Alonso, A.A., Litiere, S. and Molenberghs, G. (2010). Testing for misspec-ification in generalized linear mixed models. Biostatistics, 11,(4),771 – 786.

Breslow, N.E. and Clayton, D.G. (1993). Approximate inference in gene-ralized linear mixed models. Journal of the American Statistical As-sociation, 88, 9 – 25.

Litiere, S., Alonso, A., and Molenberghs, G. (2007). Type I and type II er-ror under random-effects misspecification in generalized linear mixedmodels. Biometrics, 63, 1038 – 1044.

Litiere, S., Alonso, A.A., and Molenberghs, G. (2011). Rejoinder to ”ANote on Type II Error Under Random Effects Misspecification inGeneralized Linear Mixed Models”. Biometrics, 67,(4), 656 – 660.

McCulloch, C.E. and Neuhaus, J.M. (2011). Prediction of Random Effectsin Linear and Generalized Linear Models under Model Misspecifica-tion. Biometrics, 67(1), 270 – 2279.

Neuhaus, J.M., McCulloch, C.E. and Boylan, R (2011). A note on TypeII error under random effects misspecification in generalized linearmixed models. Biometrics, 67(2), 654 – 660.

Neuhaus, J.M., McCulloch, C.E. and Boylan, R (2013). Estimation of co-variate effects in generalized linear mixed models with a misspecifieddistribution of random intercepts and slopes. Statistics in Medicine,32,(14), 2419 – 2429.

R Core Team (2013). R: A Language and Environment for Statistical Com-puting. R Foundation for Statistical Computing, Vienna, Austria.URL http://www.R-project.org/

Verbeke, G. and Lesaffre, E. (1997). The effect of misspecifying the random-effects distribution in linear mixed models for longitudinal data.,Computational Statistics & Data Analysis, 23, 541 – 556.

Page 87: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Identification of Health Risks of Hand, Footand Mouth Disease in China Using theGeographical Detector Technique

Jixia Huang12, Jinfeng Wang12

1 LREIS, Institute of Geographic Science and Natural Resource Research, Chi-nese Academy of Sciences, Beijing, China

2 Key Laboratory of Surveillance and Early Warning on Infectious Disease, Chi-nese Center for Disease Control and Prevention, Beijing, China

E-mail for correspondence: [email protected]; [email protected]

Abstract: In this research, we used a method of geographical detector of estimat-ing relationship between Hand, food and mouth disease (HFMD) incidence andrisk factors. Our study shows that child density affects the incidence of HFMDmore than middle school student density and tertiary industry plays a greaterrole in the incidence of HFMD than first industry.

Keywords: Hand, food and mouth disease; geographical detector; risk factors;tertiary industry.

1 Introduction

Hand, foot and mouth disease (HFMD) is a common infectious disease,which mainly occurs among Children less than 5 years old (China CDC2009). The risk factors and transmission patterns of HFMD had been con-ducted in many previous studies (Wang et al. 2011; Ho et al. 1999; UKCommunicable Disease Surveillance Centre 1980). However, few researchesconsider the influence of different industrial structures on HFMD incidence.Furthermore, most previous researches simultaneously analyze the impactof single environmental factor on HFMD incidence, while study of the in-teractive effect of two or more environmental factors on HFMD incidence islacking. In this research, we used a method of geographical detector (Wanget al. 2010) of estimating relationship between HFMD incidence and envi-ronmental risk factors. By means of the concept of Power of Determinant(PD), four detectors were applied to estimate the impact of environmentalrisk factors on HFMD incidence.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 88: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

80 Identification of health risks of HFMD using Geographical detector

FIGURE 1. The spatial distribution of the incidence of children between 0-9 yearsold in China after adjusting by Hierarchical Bayesian model in May,2008.

2 Data and method

2.1 HFMD data

The HFMD data used in this research was provided by the Chinese Cen-ter for Disease Control and Prevention. The data included the count ofchildren aged 0 9 years suffering from HFMD in each county during May2008 in China (not including Taiwan province). The children incidence wasdisplayed in Figure 1.

2.2 Determinants of HFMD and their proxies

The proxy variable associations of potential factors that affected HFMDincidence displayed on Figure 2. Since the environmental factors and theeconomic factors both had impacts on HFMD incidence (Wang et al. 2011;Cao et al. 2010; Hu et al. 2012), we considered both in the research.The meteorological data was collected from the China Meteorological DataSharing Service System. There were total 727 meteorological stations dis-tributed in China, including three factors: monthly mean temperature,monthly precipitation and relative humidity. We chose the thin plate smooth-ing spline to estimate each countys climate factors from the 727 meteo-rological observation sites (Hu et al. 2012; Hutchinson 1995). Populationand economic data was collected from China Statistical Yearbook for So-cial and economic data of Chinese counties, China Statistical Yearbook forregional economy, and China Statistical Yearbook for cities, including: chil-dren population aged 0 9 years, pupils population, middle school students

Page 89: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Huang and Wang 81

FIGURE 2. Determinants of HFMD and their proxies.

population, GDP, the first industry, the second industry, the third industryin each counties of 2307 counties.

2.3 Geographical detector

As a novel spatial analysis method, the geographical detector method doesnot require any assumptions or restrictions with respect to explanatory andresponse variables (Hu et al., 2011). By means of the concept of the powerof determinant (PD), four geographical detectors were applied to estimatethe effect of environmental risk factors on the incidence of HFMD. A riskdetector was used to calculate the geographical area under environmen-tal health risk. A factor detector was used to assess which determinantsare responsible for the incidence of HFMD. An ecological detector deter-mined if there is a significant difference between the effects of differentenvironmental factors on HFMD. An interactive detector was used to an-alyze whether multi determinants independently or dependently affect thespread of HFMD (Wang et al.,2010). All of the four detectors can be easilyimplemented using the software Excel-GeogDetector.

3 Results

We used the risk detector to analyze the effect of different variables on theincidence of HFMD. Table 1 shows the effect of monthly mean temperatureon the incidence of HFMD. When the temperature was high, the incidenceof HFMD was also high. This finding indicates that there is a correlationbetween monthly mean temperature and the incidence of HFMD. Similaranalysis was undertaken to analyze the correlation between other variablesand the incidence of HFMD using the risk detector.We used the factor detector to determine the effect of risk factors on theincidence of HFMD, and this was ranked by PD value as follows: densityof children aged 0C9 years (0.25) >tertiary industry (0.23) >GDP (0.20)>middle school student density (0.13) >relative humidity (0.12) >averagetemperature (0.11) >first industry (0.05).

Page 90: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

82 Identification of health risks of HFMD using Geographical detector

Using the factor detector and ecological detector, we found that child den-sity, tertiary industry, and GDP had a strong effect on the incidence ofHFMD.The interactive detector showed some interesting phenomena (Table 2). Theinteractive PD value of tertiary industry and child population density was0.42, which of GDP and tertiary industry was 0.34, that of child populationdensity and GDP was 0.35, and that of average temperature and relativehumidity was 0.28. All of these interactive PD values appeared to be higherthan any PD value of sole risk factors. The combinations of the above-mentioned risk factors could effectively explain spatial variability of theincidence of HFMD in China.

4 Conclusions

This study analyzed the distribution of the incidence of HFMD in main-land China in May 2008. Our data show that HFMD is serious in China.The results of traditional models all show that population density greatlyaffects HFMD. Our study shows that child density affects the incidenceof HFMD more than middle school student density. We also found thattertiary industry plays a greater role in the incidence of HFMD than firstindustry. Our results should be useful for providing instructions and rec-ommendations for the government on epidemic risk responses to HFMD.When HFMD is spreading at a rapid speed, we need to adopt isolationmeasures to avoid interactive infection of HFMD among children.

TABLE 1. Average incidence of HFMD according to the average temperaturestratum(Note: Incidence unit: 1/105, average temperature: C).

Stratum <7 7-13 13-16 16-18 18-22 22-24 >24

incidence 2.9 18.2 18.6 53.5 51.6 62.8 122.6

TABLE 2. PD values for interactions between pairs of factors on the incidence ofHFMD. (AT: average temperature; RH: relative humidity; GDP: gross domesticproduct; FI: first industry; TI: tertiary industry; PD0 9: aged 0C9 populationdensity; MSD: middle school density)

AT RH GDP FI TI PD0 9

RH 0.28GDP 0.31 0.28FI 0.17 0.16 0.24TI 0.31 0.32 0.34 0.26PD0 9 0.31 0.33 0.35 0.30 0.42MSD 0.23 0.21 0.29 0.19 0.26 0.27

Page 91: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Huang and Wang 83

References

Cao Z.D., Zeng D.J., Wang Q.Y., Zheng X.L., Wang F.Y. (2010). An epi-demiological analysis of the Beijing 2008 Hand-Foot-Mouth epidemic.Chinese Science Bulletin 55(12), 1142 – 1149.

China CDC. (2009). http://www.cdcp.org.cn/editor/uploadfile/-20090429

174659788.ppt.

Ho M., Chen E., Hsu K., Twu S., Chen K., et al. (1999). An epidemic ofenterovirus71 infection in Taiwan. New England Journal of Medicine341, 929 – 935.

Hu Y., Wang J., et al. (2011). Geographical Detector-Based Risk Assess-ment of the Under-Five Mortality in the 2008 Wenchuan Earthquake,China. PloS ONE. 6,e21427.

Wang J.F., Li X.H., Christakos G., Liao Y.L., et al. (2010). Geographicaldetectors-based health risk assessment and its application in the neu-ral tube defects study of the Heshun Region, China. IJGIS 24(1),107 – 127.

Wang J., Guo Y., Christakos G., Yang W., Liao Y., et al. (2011). Hand,foot and mouth disease: spatiotemporal transmission and climate.nternational Journal of Health Geographics 10: 25.

UK Communicable Disease Surveillance Centre. (1980). Hand, Foot andMouth Disease. Communicable Disease Report 34, 3 – 4.

Page 92: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,
Page 93: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Influence Analysis on Crossover DesignExperiment in Bioequivalence Studies

Yufen Huang1, Bo-Shiang Ke1

1 Department of Mathematics, National Chung Cheng University,160, San-Hsing,Min-Hsiung, Chia-Yi 621, Taiwan

E-mail for correspondence: [email protected]

Abstract: Crossover designs are commonly used in bioequivalence studies. How-ever, the results can be affected by some outlying observations, which may leadto the wrong decision on bioequivalence. Therefore, it is essential to investigatethe influence of aberrant observations. Chow and Tse (1990) discussed this is-sue by considering the methods based on the likelihood distance and estimatesdistance. Perturbation theory provides a useful tool for the sensitivity analysison statistical models. Hence, in this paper, we develop the influence functionsvia the perturbation scheme proposed by Hampel (1974) as an alternative ap-proach on the influence analysis for a crossover design experiment. Moreover, thecomparisons between the proposed approach and the method proposed by Chowand Tse are investigated. Two real data examples are provided to illustrate theresults of these approaches. Our proposed influence functions show excellent per-formance on the identification of outlier/influential observations and are suitablefor use with small sample size crossover designs commonly used in bioequivalencestudies. ,

Keywords: bioequivalence; influence functions; perturbation.

1 An Example

1.1 Data from Chow and Tse (1990)

In order to illustrate the results of the influence functions obtained in theprevious section, we consider the area under the plasma concentration-timecurve (AUC) data from three formulations of a drug (one reference formu-lation and two test formulations) in a bioavailability study published byChow and Tse (1990, p.552-553). In this study, a standard 3×3 crossoverexperiment was conducted with 12 subjects to compare the bioavailabilityof three formulations of a drug. The test formulations and the referenceformulation are denoted by formulations A, B and R, respectively. Table 1gives AUCs as well as individual subject ratios (with respect to the refer-ence) for each formulation. The plot of two ratios (each test formulationwith respect to the reference) shown in Figure 1 indicates that subject 4

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 94: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

86 Influence Analysis on Crossover Design

TABLE 1. AUCs for test formulation A and B with reference formulation RSubject R A B A/R B/R

1 863.8 944.2 558.0 1.09 0.652 672.0 675.8 619.7 1.01 0.923 500.0 524.6 507.3 1.05 1.014 1662.0 1006.3 442.4 0.61 0.275 344.5 254.6 304.2 0.74 0.886 435.0 383.9 316.8 0.88 0.737 915.5 653.2 690.0 0.71 0.758 772.0 612.6 728.2 0.79 0.949 763.5 529.2 701.5 0.69 0.9210 221.9 177.9 216.1 0.80 0.9711 735.2 794.9 534.5 1.08 0.7312 1087.0 976.1 813.6 0.90 0.75

0.4 0.6 0.8 1.0

0.6

0.7

0.8

0.9

1.0

1.1

B/R

A/R

1

23

4

5

6

7

8

9

10

11

12

possible outlier

FIGURE 1. The plot of two ratios (each test formulation w.r.t the reference)

is a possible outlier case since this subject’s response is far from those ofother subjects.

TABLE 2. Summary of Likelihood Distance and Estimates DistanceDeletedSubject L.D. P E.D. P

1 0.05 0.997 0.66 0.8832 0.13 0.988 1.73 0.6303 0.14 0.987 1.80 0.6154 32.02 0.000 80.64 0.0005 0.37 0.946 4.19 0.2426 0.20 0.978 2.42 0.4907 0.07 0.995 0.93 0.8188 0.11 0.991 1.43 0.6999 0.08 0.994 1.09 0.77910 0.96 0.811 8.85 0.03111 0.07 0.995 0.96 0.81012 0.28 0.964 3.05 0.384

First, we present the results of LD and ED tests in Table 2. Both methodsindicate that subject 4 is an outlying observation at the significance levelα = 0.01. In addition, the results of ED suggest that the 10th subject is alsoan influential subject with α = 0.05. Next we plot influence functions of µ,σ2ε and σ2

τ versus the subject index as diagnostic plots. Table 3 summarizesthe influential cases via influence functions with lower-upper inner fences.The results in Table 3 show that the 4th subject is detected as an influential

Page 95: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Yufen Huang and Bo-Shiang Ke 87

TABLE 3. Influential cases of the single-perturbation influence functions

Influence function Influential cases Influential cases Influential casesof perturbation µ of perturbation σ2

ε of perturbation σ2τ

by (LIF,UIF ) by (LIF,UIF ) by (LIF,UIF )

EIF, DEIF, SIF 4-R 1-B, 4-R, 4-B 4-R, 4-B

subject by all influence functions. This agrees with the results of Chow andTse (1990), and is also consistent with the conclusion drawn from the plotin Figure 1. In addition, the influence function of σ2

ε indicates that the 1st

subject is a potential outlier/influential point. This seems more sensiblethan the 10th subject being detected by the ED test as an potentiallyoutlying observation. (See Figure 1).

References

Chow, S. C. and Tse, S. K. (1990). Outlier detection in bioavailability /bioequivalence studies Statistics in Medicine, 9, 549 – 558.

Clayton, D. and Leslie, A. (1981). The bioavailability of erythromycinstearate versus enteric-coated erythromycin based when taken im-mediately before and after food. Journal of International MedicalResearch, 9, 470 – 477.

Cook, R. D. (1986). Assessment of local influence. Journal of the RoyalStatistical Society Series B, 48, 133 – 169.

Hampel, F. R. (1974). The influence curve and its role in robust estima-tion. Journal of the American Statistical Association, 69, 383 – 393

Page 96: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,
Page 97: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Symmetric semiparametric additive models

German Ibacache-Pulgar1

1 Institute of Statistics, University of Valparaiso, Chile

E-mail for correspondence: [email protected]

Abstract: In this work we studied semiparametric additive models with symmet-ric errors in order to permit distributions with heavier and lighter tails than thenormal ones. This class of models includes all symmetric continuous distributions,such as Student-t, among others. Estimation is performed by maximum penalizedlikelihood and by using natural cubic smoothing splines. Finally, a real data setis analyzed under semiparametric additive models with heavy-tailed errors.

Keywords: Back-fitting algorithm; Smoothing splines; Nonparametric models.

1 Motivating example

As a motivating example we will consider the house prices data set (see Bel-sley et al., 1980). The aim of the study is to assess the association of houseprices with the air quality of the neighbourhood. The outcome variableLMV (logarithm of the median house price in USD 1000) is related with14 explanatory variables, 6 of them are defined from census track and theremaining variables are defined for clusters. Altogether there are 506 obser-vations. We will work with three explanatory variables, LSTAT (% lowerstatus of the population), DIST (weighted distances to five Boston em-ployment centres) and TAX (full-value property-tax rate per USD 10000).

We see in Figures 1b and 1c that the relationships between LMV and theexplanatory variables LSTAT and DIST appear in nonlinear ways, whereasthe relationship between LMV and TAX seems to be linear (Figure 1a).These tendencies suggest a semiparametric model among LMV and theexplanatory variables.

2 Semiparamteric additive model

The semiparametric additive model assume the following relationship:

yi = xTi β + f1(t1i) + . . .+ fs(tsi) + εi (i = 1, . . . , n) ,

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 98: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

90 Symmetric semiparametric additive models

100 200 300 400 500 600 700 8001.5

2

2.5

3

3.5

4

(a)

Full−Value−Property−Tax Rate

Lo

g(M

ed

v)

0 5 10 15 20 25 30 35 401.5

2

2.5

3

3.5

4

(b)

% Lower Status of the Population

Lo

g(M

ed

v)

0 2 4 6 8 10 12 141.5

2

2.5

3

3.5

4

(c)

Weighted Distance

Lo

g(M

ed

v)

FIGURE 1. Scatter plots: LMV versus TAX (a), LMV versus LSTAT (b) andLMV versus DIST (c).

where yi denotes the response value and xi is a (p×1) vector of explanatoryvariable values, β is the (p× 1) parameter vector, fk is a smooth functionand εi is random error. Alternatively, we can rewrite the model above as

y = Xβ + N1f1 + . . .+ Nsfs + ε ,

where y is an (n× 1) vector of observed responses, X is an (n× p) designmatrix with rows xTi , Nk is an (n× rk) incidence matrix with the (i, j)thelement equal to the indicator I(tki = t0

kj), where t0

kj(j = 1, . . . , rk) denote

the distinct and ordered values of tk, fk = (ψk1 , . . . , ψkrk )T is a (rk × 1)

vector of parameters such that ψkj = fk(t0kj

), and ε = (ε1, . . . , εn)T .

2.1 Penalized log-likelihood function

We will assume that εi (i = 1, . . . , n) are independent random variablessuch that εi follows a symmetric distribution, this is, εi ∼ S(0, φ, g). Then,we have that yi ∼ S(µi, φ, g), whose density function is given by

hy(yi) =1√φg(δi) , yi ∈ R ,

where δi = φ−1ε2i , εi = (yi−µi), µi = xTi β +∑sk=1 nTkifk, nTki denote the

ith row of the incidence matrix Nk, and g is the density generator function.The likelihood function associated to the model is

L(θ) = −n2

log φ+

n∑i=1

logg(δi) ,

where θ = (fT0 , fT1 , . . . , f

Ts , φ)T ∈ Θ ⊆ Rp∗ , with f0 = β, N0 = X, p∗ =

p+ r+ 1 and r =∑sk=1 rk. According to Green and Silverman (1994), the

penalized log-likelihood function is given by

Lp(θ,α) = L(θ)−s∑

k=1

αk2

fTk Kkfk ,

where Kk is a (rk×rk) non-negative definite matrix and α = (α1, . . . , αs)T

denotes an (s× 1) vector of smoothing parameters.

Page 99: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

German Ibacache-Pulgar 91

2.2 Estimation of θ

Consider α and Dv fixed, and µ = Xβ +∑sk=1 Nkfk. The maximum

penalized likelihood estimate (MPLE) of θ is solution to equation system

NT0 DvN0f0 = NT

0 Dv

(y −

s∑k=1

Nkfk

),

(NTkDvNk + αkφKk

)fk = NT

kDv

(y −

s∑l=0,l 6=k

Nlfl

), (k = 1, . . . , s) ,

which leads to the following back-fitting (Gauss-Seidel) iterations:

f = S

(y −

s∑l=0,l 6=

Nlfl

), ( = 0, 1, . . . , s) ,

where S0 = (NT0 DvN0)−1NT

0 Dv and Sk = (NTkDvNk +αkφKk)−1NT

kDv.In addition,

φ =1

n

(y − µ

)TDv

(y − µ

),

where Dv = diag1≤i≤n(vi), vi = −2ζi, ζi = d log g(δi)

dδiand ε = y − µ.

3 Application

We will assume the following semiparametric additive model:

yi = β1 + xi β2 + f1(t1i) + f2(t2i) + εi , (i = 1, . . . , 506) ,

where yi denotes the logarithm of the median house price in USD 1000,xi denotes the full-value property-tax rate per USD 10000, t1i denotesthe value of % lower status of the population and t2i denotes the valueof the weighted distances to five Boston employment centres from the ithexperimental unit. We will compare the fits based on normal and Student-t errors; see Table 1. In each case the smoothing parameters were chosenso that the degrees of freedom are approximately eight. Comparing theresults in the Tables 1-2 we may notice a similarity between the regressioncoefficient estimates under normal and Student-t models, but the standarderror for β1 appears to be smaller under the Student-t model. Additionally,we may notice that the AIC value under the Student-t model is smaller thanthe one under the normal model, indicating a superiority of the heavy-tailedmodel.

Page 100: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

92 Symmetric semiparametric additive models

TABLE 1. MPLEs, estimated standard errors and AIC values under normal andStudent-t (ν = 4) models fitted to house prices data.

Normal Student-t

Estimate SE Estimate SE

β1 3.2824 0.0341 3.2683 0.0295β2 -0.0006 0.0001 -0.0006 0.0001φ 0.0415 0.0026 0.0222 0.0018

AIC -121.648 -166.414

TABLE 2. Fit summary for smoothing components under normal and Student-t(ν = 4) models fitted to house prices data.

Normal Student-t

Reference df Estimated df Estimated df

f1(t1) 7.956 7.6376 10.938f2(t2) 8.094 7.6671 11.024

0 5 10 15 20 25 30 35 40−1.5

−1

−0.5

0

0.5

1

f 1(LS

TA

T)

Student−t

LSTAT0 2 4 6 8 10 12 14

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Student−t

DIST

f 2(DIS

T)

FIGURE 2. The additive fit for DIST under Student-t model.

Acknowledgments: This work was supported by Fondecyt, Chile.Project 11130704.

References

Belsley D.A., Kuh E. and Welsch, R.E. (1980) Regression Diagnostics.Identifying Influential Data and Sources of Collinearity. Wiley Seriesin Probability and Mathematical Statistics. John Wiley, New York.

Green, P.J. and Silverman, B.W. (1994). Nonparametric Regression andGeneralized Linear Models. London: Chapman & Hall.

Ibacache-Pulgar, G. and Paula, G.A. (2011). Local influence for Student-tpartially linear models. Computational Statistics and Data Analysis,55, 1462 – 1478.

Page 101: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Application of classification tree analysis:Algorithm of diagnostic of sarcopenia inChilean older people

Lydia Lera, Cecilia Albala, Barbara Leyton, Hugo Sanchez,Barbara Angel, Marıa J Hormazabal

1 INTA University of Chile, Santiago de Chile

E-mail for correspondence: [email protected]

Abstract: We develop an algorithm for the diagnostic of sarcopenia in Chileanolder people using predicted values of appendicular skeletal muscle mass by aregression model by means of classification tree analysis, with high sensibilityand specificity. The algorithm can be a useful tool for the everyday practice forsarcopenia classification.

Keywords: Sarcopenia; ROC Curve; Sensitivity and Specificity; Predictive Valueof Tests; classification tree analysis

1 Introduction

Sarcopenia is a syndrome characterized by a progressive and generalizedloss of skeletal muscle mass and strength with risk to disability, poor qualityof life and mortality (Serra Rexah, 2006). Although age-related sarcopeniawas described in the 80’s, there is not yet a clinical definition with a uni-versal consensus. However, regardless of the algorithm or method used forthe diagnostic, the definition of sarcopenia include decreased muscle massand function requiring the estimation of lean mass and specifically appen-dicular skeletal muscle mass (ASM), which is the sum of lean mass of legsand arms. ASM can be determined by dual X-ray absorptiometry (DXA)and most clinical studies of sarcopenia use DXA as ”Gold Standard” mea-surement of muscle mass. In population studies and in primary care, thediagnosis of sarcopenia by this method is difficult, due to the high costof this test. The objective of this study was to develop a methodology forthe diagnostic of sarcopenia in Chilean older adults using predicted valuesof SMI by a prediction model (Lera et al., 2014) and classification treeanalysis (CTA) (Breiman et al., 1984), with high sensibility and specificity,based on the European consensus (Cruz-Jentoft et al., 2010).

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 102: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

94 Algorithm of diagnostic of sarcopenia

2 Subjects and Methods

This was a cross sectional study in 607 community dwelling subjects 60-98years old (67.7% women), beneficiaries of the public health system livingin Chile (Albala et al., 2011, Lera et al., 2014). All the subjects under-went anthropometric measurements of weight, height, calf circumference,knee-height, waist and hip circumferences. Handgrip strength and physicalperformance tests gait speed (3 m) were evaluated. ASM was measured byDXA and estimated by the multiple linear regression model: ASM (kg) =0.107 (weight in kg) + 0.251 (knee height in cm) + 0.197 (Calf Circum-ference in cm) +0.047 (dynamometry in kg) - 0.034 (Hip Circumferencein cm) + 3.417 (sex Man) - 0.020 (age in years) - 7.646 (R2 = 0.89 andstandard error of estimate = 1.346 kg) (1) (Lera et al., 2014). Sarcopenia ischaracterized by low muscle mass (defined as <20th percentile of skeletalmuscle mass index (SMI)), low muscle strength (hangrip<15 kg for womenand < 27 kg for men) and low physical performance (gait speed< 0.8 m/s).The diagnostic of sarcopenia is based on the confirmation of criterion 1plus 2 or 3 criteria (Cruz-Jentoft et al., 2010). Specific cut-off values ofSMI (SMI = ASM/heigh2 kg/m2), were obtained to Chilean population,corresponding to 7.19 kg/m2 for men and 5.77 kg/m2 for women. The studywas approved by INTA ethics committee and all signed informed consent.

2.1 Statistical Methods

Descriptive statistics, Receiver Operating Characteristic (ROC) analysisand classification tree analysis (CTA) were done. We calculate cut-off pointsof SMI predicted, using ROC, because the prediction model overestimatesslightly the ASM. The analysis of the area under the ROC curve was con-sidered as a measure of discriminatory power of the cut-off values in pre-dicting sarcopenia. A classification tree analysis was performed using theGini’s index, to determine predictors for sarcopenia. CTA identifies split-ting variables based on an complete search of all. Different loss matricesto improve the sensitivity were considered. Since efficient algorithms areused, CTA is able to search all possible variables as binary. Recursive termbinary implies that each group of persons, represented by a node in a de-cision tree, can only be split into two groups. The specificity, sensitivity,positive and negative predictive values were calculated using dichotomizedscores.

3 Results

We obtained cut-off values of the predicted SMI by the model (1) applyingROC analysis (7.45 in men and 5.88 in women), using DXA measurementsas Gold Standard. The areas under ROC curves were 0.86 in men and0.89 in women, considering a good accuracy (Table 1). Figure 1 show clas-sification tree which resulted of the application of CTA to sample. The

Page 103: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Lera et al. 95

tree consisted of a root node (node 1), containing all sample. This nodewas split based on the value of the gait speed variable. If gait speed wasgreater than 0.8 m/s (node 2) and all other patients are placed in node 3.The subjects in node 2 were also split based on their value of handgrip.The subjects with gait speed >0.8 m/s, and handgrip >15 kg in womenor >27 Kg in men, were classified in the node 5, and were predicted tobe normal. The other people, with handgrip less (node 4) were also splitbased on cut-off values of SMI. They were also split based on their SMIestimated. Those subjects with SMI less and equal 5.88 in women and 7.45in men were put in terminal node 7, and assigned as sarcopenics, whilethose with an SMI greater than 5.88/7.45 were placed in terminal node 6and assigned as Normal. The CTA identified subgroups of sarcopenics andno sarcopenics people. The CTA algorithm for the classification of sarcope-nia corresponding to the best loss had a sensitivity of 81.6%, specificity to87.8%, negative predictive value to 94.3%, and overall error rate of 0.26.The most significant variables were SMI (100%) and gait speed (78.6%).

TABLE 1. Cut-off values of SMI by gender. Sensitivity and specificity

Men (n=196) Women (n=411)

Gold Standard: Values of SMI by DXA 7.19 kg/m2 5.77 kg/m2

Cut-off values of predicted SMI by ROC 7.45 kg/m2 5.88 kg/m2

Sensitivity % 80.8 88.0Specificity % 79.4 72.1Correctly classify % 80.4 84.1Likelihood ratio + 3.9 3.2Likelihood ratio - 0.24 0.17ROC Area (CI 95%) 0.86 (0.78-0.93) 0.89 (0.85-0.93)

4 Conclusions

We obtained an algorithm, for the screening of sarcopenia in older adults,according to simple anthropometric variables, using specific-values of theChilean older population. The algorithm is simple to use and non expen-sive, then, it can be used in primary health care services and in populationstudies. CTA results showed the same structure obtained to use the algo-rithm proposed by Cruz-Jentoft et al. (2010). Decision rules are proposedwith high sensitivity and high negative predictive value. The variables ofphysical performance and muscle strength are crucial to discriminate sar-copenia. The results show that the selected set of descriptors can be utilizedto predict sarcopenia in clinical practice especially in primary health care.

Acknowledgments: Funded by Fonis SA12I2337 and Fondecyt 1080589

Page 104: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

96 Algorithm of diagnostic of sarcopenia

Id node: 1 0: 85.8% 1: 14.2% n: 607

Id node: 2 0: 98.5% 1: 1.5% n: 338

Id node: 4 0: 79.2% 1: 20.8% n: 24

Id node: 8 0: 94.1% 1: 5.9% n: 17

Id node: 9 0: 42.9% 1: 57.1% n: 7

Id node: 5 0: 100% 1: 0% n: 314

Id node: 3 0: 69.9% 1: 30.1% n: 269

SMI

Id node: 6 0: 86.8% 1: 13.2% n: 190

Id node: 7 0: 29.1% 1: 70.9% n: 79

Gait speed

Handgrip strength

0>25p 1: ≤25p

SMI

0: > 0.8 m/s 1≤0.8 m/s

0:>cut-off values 1: ≤cut-off values

1: ≤cut-off values 0: >cut-off values

FIGURE 1. Classification tree of sarcopenia of Chilean older people.

References

Albala, C., Sanchez, H., Lera, L., Angel, B. and Cea, X. (2011). Efecto so-bre la salud de las desigualdades socioeconomicas en el adulto mayor.Resultados basales del estudio expectative de vida saludable y dis-capacidad relacionada con la obesidad (Alexandros). Rev Med Chile,139, 1276 – 85.

Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classificationand Regression Trees.. Monterrey, Cal, USA: Wadsworth. Inc.

Cruz-Jentoft, A.J., Baeyens, J.P., Bauer, J.M., Boirie, Y., et al. (2010).Sarcopenia: European consensus on definition and diagnosis: Reportof the European Working Group on Sarcopenia in Older People. AgeAgeing, 39, 412 – 23.

Lera, L., Albala, C., Angel, B., Sanchez, H., et al. (2014). Prediccion de lamasa muscular apendicular esqueletica basado en mediciones antropo-metricas en Adultos Mayores Chilenos. Nutr. Hosp., 29.

Serra Rexah, JA. (2006). Consecuencias clınicas de la sarcopenia. Nutr.Hosp., 21, 46 – 50.

Page 105: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Modelling of the distribution of theunemployment duration in the CzechRepublic

Ivana Mala 1

1 University of Economics in Prague, Czech Republic

E-mail for correspondence: [email protected]

Abstract: The unemployment is a serious economic and social problem thatis met in the economies of developed countries. This phenomena (especially itsrate or duration) is of great interest of experts as well as general public. In thecontribution three possible approaches to the modelling of the distribution of theunemployment duration depending on factor explanatory variables (AFT model,Turnbull nonparametric estimator of survival function and maximum likelihoodfit of the selected probabilistic distribution to subsamples given by factor vari-ables) are used in order to analyse unemployment in the Czech Republic. Subsetsgiven by gender and education of an unemployed were taken into account, sub-samples of men and highly educated people are showing lower unemployment rateas well as shorter unemployment spell. The presented analysis enables to quan-tify differences. Data from Labour Force Sample Survey in 2010 (five consecutivequarters from I/2010 to I/2011) were used to construct and estimate models.

1 Methods and data

There is a wide spectrum of contributions in statistical (or econometric)modelling of unemployment duration, we refer only to Krueger et al. (2011).We will denote by T the duration of unemployment in the Czech Republicin 2010 (in months). The AFT (Accelerated Failure Time model, Lawless(2003)) will be used to model the distribution of logarithm of T

ln T = µ+ β′x + σε,

where µ is a shape parameter of baseline distribution, β is a vector of mregression parameters, σ denotes a scale parameter of baseline distributionand ε is a random error with standardized baseline distribution. The se-lection of the distribution of ε is crucial for the quality of the parametricmodel, lognormal distribution (positively skewed, hazard function with one

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 106: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

98 Unemployment in the Czech Republic

extreme (maximum)) was used as a baseline distribution in this contribu-tion. The vector β reflects the influence of covariances on the time line. Thepositive regression coefficient means that the unemployment spell tends tobe longer than the baseline spell (time is decelerated), the negative coef-ficient implies that the duration tends to be shorter (time is accelerated),for dummy variables used in the models the change in time is quantifiedby the coefficient exp(−β).The Turnbull estimator (Lawless (2003)) of survival function provides non-parametric estimate of survival function. In the third approach all data weredivided into groups according to the explanatory variables and lognormaldistribution was fitted separately to T. These estimated subset distribu-tions may be then mixed into overall distribution.We use data from the Labour Force Sample Survey (LFS) that is performedquarterly by the Czech Statistical Office (CZSO (2014)) and provides thor-ough information about employment or unemployment in the Czech house-holds. The households in the LFS survey form a rotating panel, one fifth ofthe sample rotates every quarter and none of the households is followed formore than one year. For the modelling, data from five consecutive quartersI/2010 to I/2011 were used. During interviews an unemployed states theduration of unemployment in the intervals (in months) 0 - 1, 1 - 3, 3 - 6,6 - 12, 12 - 18, 18 - 24, 24 - 48 and over 48 months and no exact durationsare recorded. It means that all observed durations are censored, for theunemployed who found a job during the study the time of unemploymentis interval censored in (l, u〉, for those who did not find a new job we knowthat the duration of unemployment is longer that the stated interval andthe datum is right censored in the left limit l of the interval (or intervalcensored in (l,∞)).All calculations were done in the program R with the package Survival(Therneau and Grabsh (2000), Therneau (2014)).

2 Results

We refer to Figure 1 for the illustration of the distribution of the un-employment duration in the Czech Republic since 1993 (CZSO (2014)),proportions of intervals 0 - 3, 3 - 6, 6 - 12, 12 - 24, 24 − ∞ months areshown.All the unemployed (4,753 individuals) aged from 16 years who found a jobsooner than after two years and those, who are still looking for a job buttheir duration of unemployment is less than two years, were used in theanalysis. Three AFT models were fitted to data (with one explanatory vari-able gender (I, m=1, baseline distribution for men), with one explanatoryvariable education (II, m=2, baseline distribution for secondary educationwithout baccalaureate) and for both explanatory variables including in-teraction (III, m=5). Turnbull estimates were constructed for the wholepopulation, subgroups of men and women and subgroups of the unem-ployed with education up to secondary, secondary with baccalaureate and

Page 107: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Mala 99

0%

20%

40%

60%

80%

100%

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011

0 - 3 3 - 6 6 - 12 12 - 24 24+

FIGURE 1. Duration of unemployment in the Czech Republic,1993-2012

tertiary. For all subgroups lognormal distribution was fitted.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30

surv

iva

l fu

nct

ion

unemployment duration (months)

secondary complete secondary tertiary men women

FIGURE 2. Survival functions from AFT models I and II

In the Figure 2 the estimated survival functions in AFT models are shownfor all above mentioned subgroups (models I and II).The AFT I model enables to quantify the difference in unemployment dura-

Page 108: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

100 Unemployment in the Czech Republic

tion between men and women. The estimate 0.116 of the parameter β thatdistinguish women from baseline distribution for men gives estimated value0.89 for the deceleration coefficient of time (µ = 2.588 and σ = 0.937). Itimplies an estimated median 13.3 months of unemployment for men and14.9 months for women. For education we try to show positive impact ofeducation (model II). For the reference group of people with educationwithout baccalaureate we obtain baseline µ = 3.279 and σ = 0.919.Theestimate -0.535 of the parameter β that distinguishes the unemployed withbaccalaureate and -1.225 for those with tertiary education from baseline dis-tribution gives estimated values 1.71 and 3.40 for the acceleration coefficientof time. These values imply estimated medians 26.5 months, 15.5 monthsand 7.8 months. If the model III is used, only parameters for secondaryeducation and tertiary education were found to be significant.

3 Conclusions

Presented approaches (and more detailed analyses not shown in the text)provide detailed information about distribution of unemployment durationand approve intuitive and theoretical dependencies of analysed duration ongender and education (positive impact of education, longer unemploymentspell for women). All results obtained by different models are comparable(survival functions, hazard functions and median as location characteristicwere used) The selected lognormal distribution is applicable, however theremight be more suitable probability distribution (Weibull, loglogistic).

References

CZSO Czech Statistical Office. [ www.czso.cz] 28.01.2014

Krueger, A. B., Mueller, A., Davis S. J. and Aysegul Sahin. (2011). JobSearch, Emotional Well-Being, and Job Finding in a Period of MassUnemployment: Evidence from High Frequency Longitudinal Data[with Comments and Discussion]. Brookings Papers on Economic Ac-tivity, pp. 1-81.

Lawless, J.F. (2003). Statistical models and methods for lifetime data. JohnWiley & Sons.

Therneau, T. M. and Grambsch, P.M. (2000). Modeling Survival Data: Ex-tending the Cox Model. Springer, New York.

Therneau, T. M. (2014). Survival: A Package for Survival Analysis in S.R package version 2.37-7. .

Page 109: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Unknown break-point estimation andselection between semi-parametric andsegmented models

Mario Martınez-Araya1, Jianxin Pan1

1 University of Manchester, UK

E-mail for correspondence: [email protected]

Abstract: Semi-parametric models are a suitable alternative to model data whenthe trend of the observed response is not linear respect to a covariate of interest.In this work, by defining Z = R∗∇(∇TR∗∇)−1, we have extended the relationbetween semi-parametric and mixed-models allowing any structure in R, not onlythose holding the condition RΩ = Ω where Ω is the roughness matrix for thespline (Nummi, Pan and Messue, 2013; Nummi et al., 2011; Nummi and Koskela,2008; Verbyla et al., 1999). With this, now we can assume covariance structuresarising from the general time series models –such as AR(1)– or Toeplitz bandedstructures, among others. Also, we have provided a method to estimate unknownbreak-point parameters in a semi-parametric model. By using the relation withmixed-models we can obtain a meaningful interpretation for the smooth functionin the model: the fixed effects model the linear trend in segments defined bythe break-point parameters while the random effects model the deviation fromthat linear trend. To choose a parsimonious model –a segmented linear model–in this work we have provided a method to test both models including unknownbreak-point parameters by testing the null hypothesis segmented linear modelversus the alternative semi-parametric model using an exact F-test as describedin in Nummi, Pan, and Messue (2013). Numerical results from real data andcomparison to a standard method are provided.

Keywords: semi-parametric; segmented; break-point estimation; mixed-models;covariance structure.

1 Semi-parametric model

Given the semi-parametric model

y = Ub+Wg + ε, (1)

with y = (y1, y2, . . . , ym)T the observations at times t = (t1, . . . , tm)T withyi = (yi1, . . . , yini)

T and ti = (ti1, . . . , tini)T for i = 1, . . . ,m with m the

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 110: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

102 Splines and Segmented Models

number of subjects and ni = K. The times tij ∈ T = [l1, l2] and thedesign points τ = (τ1, . . . , τK)T satisfy l1 < τ1 < . . . < τK < l2. ε ∼Normaln(0, σ2R), R is a covariance matrix, U a matrix of covariates, b avector of fixed parametric effects, W a n×K incidence matrix with elementswlr = 1 if tl = τr and wlr = 0 otherwise for l = 1, . . . , n, and r = 1, . . . ,K.g is a smooth function, double differentiable. The penalized log-likelihood,`p, can be found in Green, and Silverman (1994), and Nummi et al. (2008).Ω = ∇∆−1∇T is the roughness matrix (Nummi et al., 2011). Let X =[1, τ ], Z = R∗∇(∇TR∗∇)−1, and ξ = y − Ub. Then Wg = X∗β + Z∗uis the term that maximize `p, where X∗ = WX, Z∗ = WZ, and β =X∗(X

T∗HX∗)

−1XT∗Hξ, and u = (α∆−1 + ZT

∗HZ∗)−1ZT

∗Hξ are the BLUEand BLUP for β and u, respectively, in the mixed model ξ = X∗β+Z∗u+εwith ε ∼ N(0,Σ), Σ = σ2R, and u ∼ N(0, D), D = σ2

u∆ where σ2u = 1/λ

(α = σ2/σ2u). This relation is valid for any covariance structure in R,

extending the use of spline modelling using mixed models to a greater classof covariance families not only those that hold the property RΩ = Ω as isusually stated in the literature.

2 Unknown break-points estimation

We can include several break-points parameters ψ = (ψ1, . . . , ψD)T into(1) by including the terms (ti − ψj)+ = (ti − ψj) × I(ti > ψj) (Muggeo,

2003). By using first order Taylor on (ti − ψj)+ around ψ(0)j we obtain

yi ≈ cTi a + wTi g(ti) +

∑Dd=1

δjU

(0)ij + γjV

(0)ij

+ εi for i = 1, . . . , n, j =

1, . . . , D. Then, over all the data we can fit this model using (1) by set-ting U = [C,U1, V1, . . . , UD, VD], C is a n × p matrix of others covariates,Uj ,Vj for j = 1, . . . , D are the column vectors with elements defined by

U(0)ij = (ti − ψ

(0)j )+ and V

(0)ij = (−1) × I(ti > ψ

(0)j ) respectively. Here,

b = (a, δ1, γ1, . . . , δD, γD)T with γj = δj × (ψj − ψ(0)j ) for j = 1, . . . , D,

β = (β1, β2)T and βs = (b, β)T. Here, β2 is the slope of the segment beforethe first break-point, δj is the difference in slopes before and after the j-th

break-point, and β2 +∑dj=1 δj is the slope in the segment between ψd−1

and ψd. To estimate the parameters, given a starting value ψ(0)j , on every

step we update by ψj by ψ(l)j = γ

(l)j /δ

(l)j + ψ

(l−1)j . On convergence, we have

the estimator for the break-point parameter, ψj .

3 Testing the Segmented Model versus theSemiparametric Model

Assuming there is D unknown break-points parameters ψ = (ψ1, . . . , ψD)T,if our interest is to compare the null hypothesis (segmented model) H0 :µ = Up+2D+2bp+2D+2, where Up+2D+2 = [C,U,X∗], with C a matrixof p fixed covariates, U = [U1, V1, . . . , UD, VD] the matrix with column

Page 111: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Martınez-Araya and Pan 103

0 50 100 150 200 250 300 350

050

100

150

200

250

x

y

Segmented estimationSemi−parametric estimation

0 20 40 60 800.

9990

0.99

920.

9994

0.99

960.

9998

1.00

00

c

AIC

/max

(AIC

) an

d B

IC/m

ax(B

IC)

AICBIC

0 50 100 150 200 250 300 350

050

100

150

200

250

x

y

Segmented approximationSemi−parametric approximation

FIGURE 1. Fitted models. Left: estimated semi-parametric (dashed) and seg-mented (solid) model. Center: AIC (solid) and BIC (dashed) trends versus c.Drawn as AICc/max(AIC) for c = 2, . . . , n. The same for BIC values. Right:approximation using c = 8 for both segmented model (Ucbc + X∗βc) and semi–parametric model (Ucbc +Wgc).

elements Uj ,Vj for j = 1, . . . , D the terms including the break-pointsparameters as described in the previous section and D is the numberof breakpoints; versus the alternative hypothesis (semiparametric model)Ha : µ = Up+2Dbp+2D+Wg, where Up+2D = [C,U ], and g is a smooth func-tion. Then, using the relation between splines and mixed-models we couldre-express the alternative hypothesis as Ha : µ = Up+2Dbp+2D+X∗β+Z∗uand thus the comparison between this two hypothesis is equivalent to testthe hypothesis H0 : σ2

u > 0. To test this hypothesis we can proceed similarlyto how is described in Nummi, Pan, and Messue (2013) by fixing the dimen-

sion of the spline using y = M∗y where M∗ = [I − Sα]U UT∗ U

−1UT∗ + P

and U∗ = [I − P ]THU . Note that P = T∗T

T∗ , and T∗ is the matrix of the c

first eigenvectors of T , from Sα = TΛTT. Here, M∗ is now a projection ma-trix and thus lays the grounds for the application of the standard theory oflinear models (Nummi et al., 2011). So we have that σ−2S0 ∼ χ2

n−p−2D−2

with S0 = yT(I−M0)y with M0 = Up+2D+2(UT

p+2D+2Up+2D+2)−1UT

p+2D+2,

σ−2S∗ ∼ χ2n−c−p−2D with S∗ = yT(I −M∗)y, and we have σ−2(S0−S∗) ∼

χ2c−2. Finally, the test can now be based on F (y) = (n− c− p− 2D)/(c−

2) × (S0 − S∗)/S∗ ∼ F(c−2;n−c−p−2D).

4 Numerical Results

The data Cefamandole from the R package nlme considers m = 6 subjects,x is the time of the measure (minutes post-injection), with K = 14 design-points at τ = (10, 15, 20, 30, 45, 60, 75, 90, 120, 150, 180, 240, 300,360)T, and y is the plasmatic concentration of cefamandole (µg/ml). Wefitted the semi-parametric model in (1) and a linear segmented model using

Page 112: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

104 Splines and Segmented Models

the R package segmented. For both, was assumed R ≡ In and startingvalues ψ(0) = (50, 100)T. To choose c was used AIC and BIC. In figure(1) can be seen a summary graph with the estimated models (left), trendon AIC and BIC to choose c (center), and the approximations for bothmodels (right) using the selected value c = 8 for which AIC = 862.01 andBIC = 881.46. The estimated break-points, slope, variance and smoothingparameters are presented in table (1). To test the null hypothesis segmentedmodel versus the alternative semi-parametric model –both including thetwo break-points– and for c = 8 the corresponding F-test is given by F (y) =(72/6) × (5, 449.202/26, 780.93) = 2.442, and F (α = 0.95; 6, 72) = 2.227,then we reject the null hypothesis segmented model, and is preferable thesemi-parametric model including the two break-points parameters.

TABLE 1. Estimated parameters for the semi-parametric and segmented model.

Model ψ1 ψ2 β1 β2 δ1 δ2 σ2 α

Semiparam. 15.623 96.408 175.058 -6.517 6.033 0.266 373.773 0.929Segmented 23.845 95.986 230.306 -7.221 6.530 0.657 387.232 -

References

Green, P. J., and Silverman, B. (1994). Nonparametric regression and gen-eralized linear models: a roughness penalty approach. London: Chap-man & Hall, xii + 182.

Muggeo, V. M. R. (2003) Estimating regression models with unknown break-points. Statistics in Medicine, John Wiley & Sons, Ltd., 22, 3055 –3071.

Nummi, T., and Koskela, L. (2008). Analysis of growth curve data by usingcubic smoothing splines. Journal of Applied Statistics, 35, 681 – 691.

Nummi, T., Pan, J., and Messue, N. (2013). Testing Linearity in Semi-parametric Regression Models. Statistics and Its Interface, 6(1), 3 – 8.

Nummi, T., Pan, J., Siren, T. and Liu, K. (2011). Testing for CubicSmoothing Splines under Dependent Data. Biometrics, Blackwell Pub-lishing Inc, 67, 871 – 875.

Verbyla, A. P., Cullis, B. R., Kenward, M. G., and Welham, S. J. (1999).The Analysis of Designed Experiments and Longitudinal Data by Us-ing Smoothing Splines Journal of the Royal Statistical Society: SeriesC (Applied Statistics), Blackwell Publishers Ltd., 48, 269 – 311.

Page 113: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Mixture distribution model for resourcesavailability in volunteer computing systems

Kenan Matawie1, Bahman Javadi1

1 School of Computing, Engineering and Mathematics. University of WesternSydney, Australia

E-mail for correspondence: [email protected]

Abstract: Characterizing, analysis and modelling resources availability in vol-unteer computing systems is becoming extremely important and essential for ef-ficient application scheduling and systems utilization. In this paper we describe,analyse and model the availability characteristics using mixture probability den-sity functions. We apply Mixture-gamma model as a predictive method to esti-mate the availability tail probability using the real availability traces from theSETI@home project with more than 230,000 hosts.

Keywords: Volunteer computing, availability models, mixture distributions.

1 Introduction

Volunteer computing systems with large number of Internet-connected hostscan provide the PetaFLOPS scale computing power to many scientificprojects in different area such as mathematics, astronomy, physics andchemistry (Anderson, 2004). These platforms are suitable to run High-Throughput Computing (HTC) applications due to their unavailabilityrate and frequent churn. So, characterizing, analyzing and modeling suchresources availability in volunteer computing is an essential requirementto schedule various scientific applications on these platforms, which is themain goal of this paper. We focus on the problem of scheduling applicationsneeding fast response time (instead of high throughput).There are several related work for availability modeling of volunteer sys-tems where most of them used limited number of hosts for a short period oftime (Douceur, 2003). In previous work, an analysis and methodology wasproposed to form subsets of hosts with similar statistical properties thatcan be modeled with similar distribution functions (Javadi et al., 2011).A recent paper presented different models for the non-random cluster ofthe availability time, this includes autocorrelation structure for the subset

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 114: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

106 Availability and mixture distributions

with short/long dependency (Javadi et al., 2013). In this paper, we focuson another alternative for modeling volunteer resources with random avail-ability time. To do this, we extend the existing methodology to furthercharacterize, analyze and use the mixture distributions model, particularlyGamma Shape Mixtures (GSM) introduced by Venturini et al. (2008) tomodel the random availability intervals.

2 Availability Trace

In this paper, we used a real CPU availability trace from 230,000 hosts overthe Internet between April 1, 2007 to January 1, 2009 (Javadi et al., 2011).The CPU availability is considered as a binary value indicating whetherthe CPU was free or not. The traces record the start and end time of CPUavailability for each host. This trace is collected using BOINC server fromthe SETI@home volunteer resources (Anderson, 2004). The traces are in-dependent since they are in the level of BOINC client. In total, the tracescaptured 57,800 years of CPU time and 102,416,434 continues intervalsof CPU availability. It should be noted that BOINC is a middle ware forvolunteer distributed computing and it is the underlaying software infras-tructure for several scientific projects such as SETI@Home, Einstein@homeand Docking@Home.

3 Mixture Model

Previous work (Javadi et al., 2011) proposed a modeling workflow to modelCPU availability and unavailability for large-scale distributed systems. Timeseries of availability and unavailability of each host in the system was used,and Ax and Uy are defined as random variables of availability and un-availability intervals, respectively. Different behaviors in these intervals interms of randomness and periodicity were examined. For significant andaccurate modeling, we need to capture and distinguish these different be-haviors among available resources in the system. In order to classify hostswhose availability is truly random, (Javadi et al., 2011) showed when thevariables Ax and Uy are classified as iid hosts. This means that they haveidentical and independent distribution for availability and unavailabilityintervals. Otherwise they will be considered as non-iid hosts. It was ob-served that 21% of total hosts in the given volunteer computing systemhave random availability. For iid hosts, clustering approach based on a dis-tance metric that measures the difference between two distributions wasapplied, this resulted to six different clusters of hosts. The availability andunavailability intervals of these clusters have several distinct distributionsfamilies such as Gamma and hyper-exponential distributions.The distribution of the availability variable is skewed and has heavy-taileddistribution shape (Javadi et al., 2011). Hence, in this paper we focus onthe use of mixture model, a method that has not been explored in this areaof research. Mixture models are flexible enough to represent and capture a

Page 115: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Matawie and Javadi 107

large spectrum of different in such a heterogeneity of the hosts. In particularthe Gamma shape mixture model, GSM(πππ, θ|J) given below, is defined andpresented by Venturini et al. (2008) and will be applied and investigatedwith respect to the current data characteristics.

f(y|π1, ....., πJ , θ) =

J∑j=1

πjfj(y|θ), (1)

where y is the positive random variable (nonzero), and

fj(y|θ) =θj

Γ (j)yj−1e−θy

is the density function of a gamma G(j, θ) random variable, and the numberof known/fixed component J and the unknown vector of mixture weightsπππ = (π1, ....., πJ) are as discussion in details in Venturini et al. (2008).GSM model is used to estimate the tail probability for different k thresholdestimator of the tail probability f(y∗ > k|y), details of estimating thispredictive probability is given in Venturini et al. (2008).

x

density

0 50 100 150 200

0.00

0.02

0.04

0.06

FIGURE 1. Hosts availability histogram with model density (solid line).

From our hosts-availability data for the period 2007-09 a random sampleof 10,000 intervals are selected. This analysis used GSM model with hyper-parameters for j, α and β as mentioned in the above section and with thechain of 1,200 iterations (200 of which for burn-in). Figure 1 shows the fitof the GSM model to the hosts availability, histogram of the positive datatogether with the posterior mean of the model density.

Page 116: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

108 Availability and mixture distributions

Figure 2 displays the CDF, ECDF and the tail probabilities P (y > k) withdifferent k values as threshold. The tail probability for our data representsthe probability of task completion at a given availability interval. GSMdemonstrated an excellent fit to the sampled data.

0 50 100 150 200

0.0

0.2

0.4

0.6

0.8

1.0

Availabilty intervals (hours)

Cum

ulat

ive

prob

abilt

y

ECDF

Posterior Probabilty

(a) CDF

50 100 150 200

0.0

0.1

0.2

0.3

0.4

Availability threshold (hours)

Com

pliti

on p

roba

bilit

y

True tail Probabilty

Estimated tail Probabilty

(b) Tail

FIGURE 2. CDF, ECDF and the right tail probability

References

Anderson, D. (2004), Boinc: A system for public-resource computing andstorage, in Proceedings of the 5th IEEE/ACM International Work-shop on Grid Computing, Pittsburgh, USA.

Douceur, J. R. (2003), Is remote host availability governed by a universallaw?, SIGMETRICS Performance Evaluation Review 31(3), 25–29.

Javadi, B., Kondo, D., Vincent, J. and Anderson, D. P. (2011), Discover-ing statistical models of availability in large distributed systems: Anempirical study of SETI@home, IEEE TPDS 22(11), 1896–1903.

Javadi, B., Matawie, K. and Anderson, D. P. (2013), Modeling and analy-sis of resources availability in volunteer computing systems, in IPCCC,pp. 1–9.

Venturini, S., Dominici, F. and Parmigiani, G. (2008), Gamma shape mix-tures for heavy-tailed distributions, The Annals of Applied Statistics2(2), 756–776.

Page 117: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Weather Forecasts and Censored Regressionwith Conditional Heteroscedasticity

Jakob W. Messner1, Georg J. Mayr2, Achim Zeileis1

1 Department of Statistics, University of Innsbruck, Austria2 Institute of Meteorology and Geophysics, University of Innsbruck, Austria

E-mail for correspondence: [email protected]

Abstract: In this paper we present regression models to correct numericalweather forecasts. Censored models are employed to account for the non-negativityof precipitation. Furthermore, conditional heteroscedasticity effectively utilizesforecast uncertainty information from ensemble forecasts. Ordered models canbe used to relax parametric distribution assumptions.

Keywords: weather forecasts; censored regression; conditional heteroscedastic-ity.

1 Introduction

Weather forecasts are usually based on numerical weather prediction mod-els. These models use the current state of the atmosphere and computefuture weather by numerically simulating the most important atmosphericprocesses. Unfortunately the current state is usually not perfectly knownand not all atmospheric processes can be considered. Thus, numerical weatherprediction models always exhibit errors. Fortunately, some of these errorsare systematic and can be corrected with statistical models.To deal with remaining random errors, it can be of advantage for deci-sion makers to know the expected forecast uncertainty. To get uncertaintyinformation often so called ensemble forecasts are used. These are severalnumerical weather forecasts that use slightly different initial conditions anddifferent model formulations. If these forecasts are very different from eachother the forecast uncertainty is assumed to be high.In this paper we present regression models to correct numerical weatherforecasts. By incorporating the ensemble spread as regressor for the scale,event dependent uncertainty information can be utilized effectively (Gneit-ing et al. 2005). To consider the limited (non-negative) range of meteo-rological variables like precipitation we apply censored regression models.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 118: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

110 Weather Forecasts and Censored Regression

Furthermore we show that ordered regression models can be used to relaxparametric distribution assumptions.

2 Statistical models

The classical approach for statistical correction of numerical weather pre-dictions is to use ordinary least squares regression (Glahn and Lowry 1972).However, such an approach ignores the non-negativity of meteorologicalvariables like precipitation. Furthermore, uncertainty information from theensemble spread cannot be utilized effectively. Thus, we employ a censoredregression model with conditional heteroscedasticity. For this model it isassumed that a latent response y∗ (e.g., precipitation) follows a distributionF (e.g., normal or logistic):

y∗ ∼ F (µ, σ2) (1)

where the mean µ is a linear function of some regressor variables x (e.g.,ensemble mean forecast), and the logarithm of the standard deviation σ is alinear function of regressor variables z (e.g., ensemble standard deviation).

µ = x>β (2)

log(σ) = z>γ (3)

To account for the non-negativity of forecast variables like e.g. precipita-tion, the response can be assumed to be censored:

y =

0 y∗ ≤ 0

y∗ y∗ > 0(4)

Additionally, the response is often transformed to meet the distributionassumption of Equation 1. E.g., modeling the square root of precipitationhas shown to be of advantage for a logistic distribution F (e.g., Wilks 2009).Weather forecast users often do not need exact forecasts and the predictionof categories (e.g., no precipitation, light precipitation, heavy precipitation)or the probabilities to fall into such categories is often sufficient. In that casethe response are ordered categories that are separated by several thresholdsqj

y =

1 y∗ ≤ q1

2 q1 < y∗ ≤ q2

3 q2 < y∗ ≤ q3

. . .

(5)

Then, the probability to fall below a certain threshold can be modeled withan ordered regression model

P (y ≤ j|x) = H

(θj − x>β

exp(z>γ)

)(6)

Page 119: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Messner et al. 111

where H is the cumulative distribution function of F . The θj can eitherbe replaced with qj or estimated additionally to β and γ (Messner et al.2013). Additional estimation has the advantage of further relaxing the as-sumptions from Equations 1 – 3.

3 Case study

In this section we present results from a case study of precipitation forecastsfor Innsbruck (Austria). Therefore we used accumulated precipitation fore-casts for 5 to 8 days in advance from the freely available Global EnsembleReforecast data set (Hamill et al. 2013). Corresponding observations weretaken from the automatic weather station at the Innsbruck airport. Figure1 shows these forecasts and observations. It can be seen, that precipitationis frequently over-predicted by the numerical model. A censored and anordered regression model were fitted with the R-package crch 0.1-0 andTable 1 shows their estimated regression coefficients. Both models showclearly positive effects of the ensemble standard deviation on the scale(forecast uncertainty). Furthermore, the similar coefficients of both mod-els indicate that a logistic distribution assumption for the square root ofprecipitation is reasonable for the present data set.

References

Glahn, H. and Lowry, D. (1972). The Use of Model Output Statistics(MOS) in Objective Weather Forecasting. Journal of Applied Meteo-rology, 11, 1203 – 1211.

Gneiting, T, Raftery, A.E., Westveld, A.H., Goldman, T. (2005).Calibrated Probabilistic Forecasting using Ensemble Model OutputStatistics and Minimum CRPS Estimation. Monthly Weather Review,133, 1098 – 1118.

Hamill, T.M., Bates, G.T., Whitaker, J.S., Murray, D.R., Fiorino, M.,Galarneau, T.J., Zhu, Y., and Lapenta, W. (2013). NOAA’s Second-Generation Global Medium-Range Ensemble Reforecast Data Set.Bulletin of the American Meteorological Society, 94, 1553 – 1565.

Messner, J.M., Mayr, G.J., Wilks, D.S., and Zeileis, A. (2013). ExtendingExtended Logistic Regression for Ensemble Post-Processing: Extendedvs. Separate vs. Ordered vs. Censored. Working Papers, Faculty ofEconomics and Statistics, University of Innsbruck.

Wilks, D.S. (2009). Extending Logistic Regression to Provide Full-Proba-bility-Distribution MOS Forecasts. Meteorological Applications, 368,361 – 368.

Page 120: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

112 Weather Forecasts and Censored Regression

May 03 May 13 May 23

020

4060

8010

0

Date

prec

ipita

tion

[mm

/3 d

ays] observation

ensemble mean forecast

0 20 40 60 80 100

020

4060

8010

0

ensemble mean forecasts [mm / 3 days]ob

serv

atio

ns [m

m /

3 da

ys]

FIGURE 1. Day 5–8 accumulated precipitation amount forecast and correspond-ing observations. The left panel shows a timeseries of one month where the redsolid line shows the mean of the 11 different forecasts in the ensemble. The areabetween the 0.1 and 0.9 quantiles of these 11 forecasts is shaded. In the rightpanel, precipitation amount is plotted against forecasts of 13 years (2000–2013).The fit from a censored regression model is shown as blue solid line.

TABLE 1. Coefficient table for censored and ordered regression models. Standarderrors are shown in brackets. Ordered categories for the ordered model are definedby the thresholds 0, 1, 4, and 16 mm. Precipitation observations and forecastsare square root transformed and a logistic distribution is used for F .

censored ordered

location scale location scale

(Intercept) −0.873∗∗∗ −0.140∗∗∗ −0.988∗∗∗ −0.178∗∗∗

(0.070) (0.042) (0.072) (0.048)sqrtensmean 0.792∗∗∗ 0.816∗∗∗

(0.019) (0.021)sqrtenssd 0.240∗∗∗ 0.287∗∗∗

(0.033) (0.038)

Page 121: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Influence measures in beta regression modelsthrough Frechet metric

J.M. Munoz-Pichardo1, J.L. Moreno-Rebollo1, R.Pino-Mejıas1,M.D. Cubiles de la Vega1

1 University of Seville, Spain

E-mail for correspondence: [email protected]

Abstract: In this poster case-deletion diagnostics are proposed in beta regres-sion models. The influence diagnostics are based on Frechet distance between thedistributions of the MLEs of the model parameters resulting from the completedata set and after the omission of a sample case. The diagnostic will be relatedto other influence measures proposed in the literature.

Keywords: Frechet metric ; Influence analysis ; Beta regression.

1 Introduction

The beta regression model is commonly used when the response variableis a ratio or percentage. Because the Beta distribution is very flexible formodelling proportions, it can be applied in a wide range of situations. Dueto its relevance, papers focused on the inference of the model parameters,adjustment measures and influence diagnostics can be found in the liter-ature. In particular, a class of beta regression models was proposed byFerrari and Cribari-Neto (2004) and the development of topics of inferenceand diagnostic techniques are considered in Espinheira et al.(2008a, 2008b),Ferrari and Pinheiro (2012) and Chien (2012).Standard methods of assessing the influence are based on a suitable schemefor perturbing the model and a procedure to measure the effect of theperturbation. Case-deletion approach is the most common perturbationscheme. To assess the effect of the omission on a parameter estimate, θ,some type of distance between θ(i), the estimator obtained deleting case

ith, and θ is generally used. Alternatively, the effect of the omission can bemeasured through some type of distance between the distribution functionsFθ(i) and Fθ. This approach has been applied by Jimenez-Gamero et al.

(2002) in various linear models, resulting diagnostics more complete than

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 122: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

114 Influence measures on beta regression models through Frechet metric

those obtained by direct comparison of the estimate values. More completemeans that they not only compare θ and θ(i) for a specific sample but theirprobability distributions. In this poster, we propose case-deletion diagnos-tics in the beta regression model through the Frechet distance (Frechet,1957). This metric has been used by different authors to study outliers andinfluential observations. See Hadi and Nyquist (1999), Munoz-Pichardo etal. (2004, 2008).The poster is organized as follows. The beta regression model and some keyresults are presented in Section 2. Section 3 describes the Frechet distance,properties and results to be used later. In Section 4, two new influencediagnostics based on the Frechet distance are proposed. The resulting di-agnostics will be compared to other diagnostics proposed in the literature.Finally, two illustrative examples are included and conclusions are drawn.

2 The model

The probability density function of the beta distribution, modeled in termsof its mean (µ) and a precision parameter (φ), is given by

f(y) =Γ(φ)

Γ(µφ)Γ[(1− µ)φ]yµφ−1(1− y)(1−µ)φ−1

with 0 < µ < 1 and φ > 0. The mean and the variance are given by

E[y] = µ and var[y] = µ(1−µ)1+φ , respectively.

Given x = (x1, ..., xk)T a vector of covariates, the beta regression model(Ferrari and Cribari-Neto, 2004) assumes that the response y follows a betadistribution with

E[y|x] = µ(x) , var[y|x] =µ(x)(1− µ(x))

1 + φ, and g(µ(x)) = xTβ

where β is a vector of unknown parameters and g(·) is a strictly monotonictwice differentiable link function. Among the various proposals for linkfunction, we will consider the logistic function, i.e., g(µ) = log [µ/(1− µ)].The maximum likelihood estimates (MLEs) of β and φ are obtained by nu-merical approximation. They can be calculate using R ( betareg package),see Cribari-Neto and Zeilei (2010). The asymptotic distribution is given by(β

φ

)∼ Nk+1

[(βφ

),V(β, φ)

], V(β, φ) =

(Vβ(β, φ) vT

β,φ(β, φ)

vβ,φ(β, φ) vφ(β, φ)

),

where V(β, φ) is the inverse of the Fisher information matrix.

3 Frechet distance as influence measure

Let P (Rq) be the space of the distributions on Rq with finite second ordermoment. Given F1, F2 ∈ P (Rq), the Frechet metric between F1 and F2 is

Page 123: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Munoz-Pichardo et al. 115

defined as

d(F1, F2) =

(minX1,X2

E[‖X1 −X2‖2] : L(X1) = F1, L(X2) = F2

)1/2

where L(X) represents the distribution of the random variable X. Dow-son and Landau (1982) obtained d(F1, F2) when F1 and F2 belong to anysubclass of P (Rq) which is closed with respect to linear transformations

d(F1, F2) =‖µ1 − µ2‖2 + tr

[Σ1 + Σ2 − 2(Σ1Σ2)1/2

]1/2

(1)

where µ1, µ2,Σ1 and Σ2 are the mean vectors and the covariance matricesof F1 and F2, respectively. In this case, the Frechet distance combines twometrics: the first term, ‖µ1 − µ2‖, defines a metric on the means, and thesecond term, tr

[Σ1 + Σ2 − 2(Σ1Σ2)1/2

], defines a metric on the space of

q×q covariance matrices. Munoz-Pichardo et al. (2004) called them locationcomponent (LC) and dispersion component (DC).A sample version of the Frechet distance provides an influence measurethat has been applied in various models. Let θ be an estimator of θ andθ(i) the estimator obtained under the omission of the ith case. Let F (·; θ)and F(i)(·; θ) be the distributions of θ and θ(i), respectively. The Frechet

measure on the distribution of θ associated to the ith case is

δi(θ) = d[F (·; θ), F(i)(·; θ(i))

]. (2)

4 Influence measures on MLEs

The asymptotic distributions of the MLEs of β and φ, for the completedata set and deleting the ith sample case, are given by

β ∼ Nk [β,Vβ(β, φ)] , φ ∼ N [φ, vβ(β, φ)] ,

β(i) ∼ Nk[β,Vβ(i)(β, φ)

]and φ(i) ∼ N

[φ, vβ(i)(β, φ)

].

Following Munoz-Pichardo et al. (2004), we propose two measures to assessthe influence of the ith observation (yi) on MLEs: the Frechet measure on

the distribution of β associated to the ith case (δi(β)) and the Frechet

measure on the distribution of φ associated to the ith case (δi(φ). Sincethe MLEs follow multivariate normal distributions, which are closed withrespect to linear transformations, from (2)

δi(β) =

‖β − β(i)‖2 + tr

[Vβ + Vβ(i) − 2

(VβVβ(i)

)1/2]1/2

,

δi(φ) =

[(φ− φ(i)

)2

+(v

1/2φ − v1/2

φ(i)

)2]1/2

,

Page 124: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

116 Influence measures on beta regression models through Frechet metric

with Vβ = Vβ(β, φ), Vβ(i) = Vβ(β(i), φ(i)), vφ = vφ(β, φ) and vφ(i) =

vφ(β(i), φ(i)). Finally, δi(β) and δi(φ) will be compared to other diagnos-tics proposed in the literature (residuals, generalized leverage and Cookdistance) and the jackknife-after-bootstrap approach, Martin and Roberts(2010), will be considered to determine cut-offs.

References

Chien, L.C. (2012). Multiple deletion diagnostics in beta regression mod-els. Computational Statistics. 28, 1639 – 1661

Cribari-Neto, F. and Zeilei, A. (2010). Beta regression in R. Journal ofStatistical Software, 34, 1 – 24.

Dowson D.C. and Landau B.V. (1982) Frechet’s distance between multi-variate normal distributions. J. Multivariate Anal. 12, 450 – 455

Espinheira, P.L., Ferrari, S.L. and Cribari-Neto, F. (2008a). On beta re-gression residuals. Journal of Applied Statistics, 35, 407 – 419.

Espinheira, P.L., Ferrari, S.L. and Cribari-Neto, F. (2008b). Influence di-agnostics in beta regression. Comp.St.. & Data Anal., 52, 4417 – 4431.

Ferrari, S.L. and Cribari-Neto, F. (2004). Beta regression for modellingrates and proportions. Journal of Applied Statistics, 31, 799 – 815.

Ferrari, S.L. and Pinheiro, E.C. (2012). Improved likelihood inference inbeta regression. J. Statistical Comput. and Simulation, 81, 431 – 443.

Frechet, M. (1957) Sur la distance de deux lois de probabilite. C. R. Acad.Sci. Paris, 244, 689 – 692

Hadi A.S. and Nyquist, H. (1999) Frechet’s distance as a tool for diagnos-ing multivariate data. Linear Algebra Appl 289, 183 – 201

Jimenez-Gamero M.D., Munoz-Pichardo J.M., Munoz-Garcıa J., Pascual A.(2002) Rao distance as a measure of influence in multivariate linearmodel. Journal of Applied Statistics 29(6), 841 – 854

Martin M.A. and Roberts S. (2010). Jackknife-after-bootstrap regressioninfluence diagnostics. J. of Nonparametric Statistics, 22, 257 – 269

Munoz-Pichardo J.M., Enguix A., Munoz-Garcıa J., Pascual A. (2004)Frechet’s metric as measure of influence in multivariate linear modelswith random errors ellipt. distributed. Comp.Stat. & Data Anal. 46,469 – 491

Munoz-Pichardo J.M., Moreno-Rebollo J.L., Enguix A., Pascual A. (2008)Influence measures on profile analysis with elliptical data throughFrechet’s metric. Metrika 68, 111 – 127

Page 125: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Influence measures in beta regression modelsthrough Rao distance

J.M. Munoz-Pichardo 1, R. Pino-Mejıas1, J.L.Moreno-Rebollo1, M.D. Cubiles de la Vega1

1 University of Seville, Spain

E-mail for correspondence: [email protected]

Abstract: In this poster case-deletion diagnostics are proposed in beta regres-sion models. The influence diagnostics are based on the Rao distance between thedistributions of the MLEs of the model parameters resulting from the completedata set and after the omission of a sample case. The diagnostic will be relatedto other influence measures proposed in the literature.

Keywords: Rao distance ; Influence analysis ; Beta regression.

1 Introduction

The beta regression model is commonly used when the response variable isa ratio or percentage. Because the Beta distribution is very flexible for mod-elling proportions, it can be applied in a wide range of situations. Due toits relevance, papers focused on the inference of the model parameters, ad-justment measures and influence diagnostics can be found in the literature.In particular, a class of beta regression models was proposed by Ferrari andCribari-Neto (2004) and development of topics of inference and diagnostictechniques are considered in Espinheira et al.(2008a, 2008b), Ferrari andPinheiro (2012) and Chien (2012).Standard methods of assessing the influence are based on a suitable schemefor perturbing the model and a procedure to measure the effect of theperturbation. Case-deletion approach is the most common perturbationscheme. To assess the effect of the omission on a parameter estimate, θ,some type of distance between θ(i), the estimator obtained deleting case

ith, and θ is generally used. Alternatively, the effect of the omission can bemeasured through some type of distance between the distribution functionsFθ(i) and Fθ.

This approach has been applied by Jimenez-Gamero et al. (2002) in variouslinear models, resulting diagnostics more complete than those obtained by

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 126: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

118 Influence measures on beta regression models through Rao distance

direct comparison of the estimate values. More complete means that theynot only compare θ and θ(i) for a specific sample but their probabilitydistributions. In this poster, we propose a case-deletion diagnostic in thebeta regression model through the Rao distance (Rao, 1949).The poster is organized as follows. The beta regression model and somekey results are presented in Section 2. Section 3 describes the Rao distance,properties and results to be used later. In Section 4, two new influence diag-nostics based on the Rao distance are proposed. The resulting diagnosticswill be compared to other diagnostics proposed in the literature. Finally,two illustrative examples are included and conclusions are drawn.

2 The model

The probability density function of the beta distribution, modeled in termsof its mean (µ) and a precision parameter (φ), is given by

f(y) =Γ(φ)

Γ(µφ)Γ[(1− µ)φ]yµφ−1(1− y)(1−µ)φ−1

with 0 < µ < 1 and φ > 0. The mean and the variance are given by

E[y] = µ and var[y] = µ(1−µ)1+φ , respectively.

Given x = (x1, ..., xk)T a vector of covariates, the beta regression model(Ferrari and Cribari-Neto, 2004) assumes that the response y follows a betadistribution with

E[y|x] = µ(x) , var[y|x] =µ(x)(1− µ(x))

1 + φ,

and

g(µ(x)) =

k∑i=1

xiβi = xTβ

where β = (β1, ..., βk)T is a vector of unknown parameters and g(·) isa strictly monotonic twice differentiable link function. Among the variousproposals for link function, we will consider the logistic function, i.e., g(µ) =log [µ/(1− µ)].The maximum likelihood estimates (MLEs) of β and φ are obtained by nu-merical approximation. They can be calculate using R (betareg package),see Cribari-Neto and Zeilei (2010). The asymptotic distribution is given by(β

φ

)∼ Nk+1

[(βφ

),V(β, φ)

], V(β, φ) =

(Vβ(β, φ) vT

β,φ(β, φ)

vβ,φ(β, φ) vφ(β, φ)

),

where V(β, φ) is the inverse of the Fisher information matrix (Ferrari andCribari-Neto, 2004).

Page 127: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Munoz-Pichardo et al. 119

3 Rao distance as influence measure

The Rao distance defines a metric in the space of probability distributions.It can be considered as a natural generalization of the Mahalanobis distanceand is invariant for admissible transformations of the parameters and ofthe data. Atkinson and Mitchell (1981) obtained the Rao distance for somecommon families of distributions. In particular, the Rao distance betweenNq(µ,Σ1) and Nq(µ,Σ2) is given by

s(µ,Σ1,Σ2) =

1

2

q∑j=1

log2[λj(Σ2Σ−1

1 )]

1/2

, (1)

with λj(Σ2Σ−11 ), j = 1, . . . , q, the characteristic roots of Σ2Σ−1

1 .A sample version of the Rao distance provides a measure of influence. Letθ be an estimator of θ and θ(i) the estimator obtained under the omission

of the ith case. Let F (·; θ) and F(i)(·; θ) be the distributions of θ and θ(i),

respectively. The Rao measure on the distribution of θ associated to theith case is

ρi(θ) = s[F (·; θ), F(i)(·; θ(i))

].

4 Influence measures on MLEs

The asymptotic distributions of the MLEs of β and φ, for the completedata set and deleting the ith sample case, are given by

β ∼ Nk [β,Vβ(β, φ)] , φ ∼ N [φ, vβ(β, φ)]

andβ(i) ∼ Nk

[β,Vβ(i)(β, φ)

], φ(i) ∼ N

[φ, vβ(i)(β, φ)

],

respectively. Following Jimenez-Gamero et al. (2002), we propose two mea-

sures to assess the influence of the ith observation (yi) on the MLEs: ρi(β)

and ρi(φ), the sample versions of the Rao distance between the asymp-

totic distributions of β and β(i), and φ and φ(i), respectively. Since thedistributions of the MLEs of β and φ have common mean vector, from (1)

ρi(β) =

1

2

k∑j=1

log2[λj

(Vβ(i)(β(i), φ(i))V

−1β (β, φ)

)]1/2

and

ρi(φ) =

1

2log2

[vβ(i)(β(i), φ(i))

vβ(β, φ)

]1/2

.

Finally, ρi(β) and ρi(φ) will be compared to other diagnostics proposedin the literature (residuals, generalized leverage and Cook distance) and

Page 128: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

120 Influence measures on beta regression models through Rao distance

the jackknife-after-bootstrap approach, Martin and Roberts (2010), will beconsidered to determine cut-offs.

References

Chien, L.C. (2012). Multiple deletion diagnostics in beta regression mod-els. Computational Statistics. 28, 1639 – 1661

Cribari-Neto, F. and Zeilei, A. (2010). Beta regression in R. Journal ofStatistical Software, 34, 1 – 24.

Espinheira, P.L., Ferrari, S.L.P. and Cribari-Neto, F. (2008a). On beta re-gression residuals. Journal of Applied Statistics, 35, 407 – 419.

Espinheira, P.L., Ferrari, S.L.P. and Cribari-Neto, F. (2008b). Influence di-agnostics in beta regression. Comput. Statistics and Data Analysis,52, 4417 – 4431.

Ferrari, S.L.P. and Cribari-Neto, F. (2004). Beta regression for modellingrates and proportions. Journal of Applied Statistics, 31, 799 – 815.

Ferrari, S.L.P. and Pinheiro, E.C. (2012). Improved likelihood inference inbeta regression. J. Statistical Comput. and Simulation, 81, 431 – 443.

Jimenez-Gamero M.D., Munoz-Pichardo J.M., Munoz-Garcıa J. and Pas-cual A. (2002) Rao distance as a measure of influence in multivariatelinear model. Journal of Applied Statistics 29(6), 841 – 854

Martin M.A. and Roberts S. (2010). Jackknife-after-bootstrap regressioninfluence diagnostics. Journal of Nonparametric Statistics, 22, 257 –269

Rao, C. R. (1949). On the distance between two populations. Sankhya, 9,246 – 291.

Page 129: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Estimation of Upper Tail Dependence forInsurance Loss Data Using an EmpiricalCopula-based Approach

Adrian O’Hagan1

1 University College Dublin, Ireland

E-mail for correspondence: [email protected]

Abstract: Considerable focus in the world of insurance risk quantification isplaced on modeling loss values from lines of business (LOBs) that possess up-per tail dependence. Parametric copulas such as the Joe, Gumbel and Student-tcopula may be used for this purpose. The copula structure imparts a specifieddegree of tail dependence on the joint distribution of claims from different LOBs.Alternatively, practitioners may possess historical or simulated data that alreadyexhibit upper tail dependence, through the impact of catastrophe events suchas hurricanes or earthquakes. In these cases it is of interest to accurately assessthe degree of tail dependence already present in the data. The empirical copulaand its associated upper tail dependence coefficient are presented in this paperas robust, efficient means of achieving this goal.

Keywords: Empirical copula; extreme loss events; insurance reserving; uppertail dependence coefficient.

1 Introduction

It is vital that, in demonstrating prudent reserving to industry regulators,insurance companies possess a model that simulates company losses in amanner consistent with the real world phenomenon of upper tail depen-dence across multiple lines of business. A robust approach for gauging amodel’s ability to effectively meet this requirement is to use the empiricalcopula (Shaw, Smith & Spivak, 2010). This provides the standard benefit ofall non-parametric approaches in that it does not make any distributionalassumptions as to the form of the copula or the nature of the underlyingdependence structure between different lines of business (LOBs). Throughmanipulation of the link between the joint empirical distribution of the dataand the empirical copula, a smoothed estimate of the upper tail dependencecoefficient can be extracted.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 130: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

122 Empirical Copula Upper Tail Dependence

2 Data

The data comprise of 50, 000 simulated general insurance loss claims acrosseach of 9 lines of business (LOBs), provided by Kiln Group, insuranceunderwriters on the Lloyd’s market in London. The ultimate goal whenproducing simulated loss data of this nature is to identify extreme totalloss events, which are more likely to occur when multiple LOBs exhibitupper tail dependence. Hence insurance companies increasingly attempt toincorporate this phenomenon in their loss models, through copula-basedor other procedures (Embrechts, Lindskog & McNeil, 2003). The empiricalcopula provides a natural method of verifying whether or not their attemptshave proven successful.

3 Methods

3.1 The Empirical Copula

Sklar’s Theorem (Sklar, 1959) describes the dependence between two ormore random variables X1, X2, . . . , Xd. It states that the joint cumulativedistribution function (CDF) of the random variables, H(x1, x2, . . . , xd), canbe expressed as a function C of the marginal CDFs,F1(x1), F2(x2), . . . , Fd(xd):

H(x1, x2, . . . xd) = P (X1 ≤ x1, X2 ≤ x2, . . . Xd ≤ xd) (1)

H(x1, x2, . . . xd) = C(F1(x1), F2(x2), . . . Fd(xd)) (2)

C is known as a copula and is unique if all marginal CDFs, (F1, F2, . . . , Fd),are continuous.

3.2 The Upper Tail Dependence Coefficient

For the bivariate case with marginal random variables X and Y , the uppertail dependence coefficient λu is defined as:

λu = limη→−1

[P (Y > F−1

Y (η) |X > F−1X (η))

]= limη→−1

[(1− 2η + C(η, η))

(1− η)

](3)

In the case of the non-parametric empirical copula, Schmidt (2006) providesa useful approximation to λu for the bivariate case:

λu ≈ 2−[

logC(1− t/T, 1− t/T )

log (1− t/T )

](4)

where T denotes the number of observations and t = T 1/2. This paperand its accompanying R package EmpCop (currently in development) au-tomates this process efficiently for actuarial and financial practitioners.

Page 131: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

O’Hagan 123

3.3 Using the Empirical Copula to Estimate the Upper TailDependence Coefficient

1. Compute the marginal empirical CDFs for losses from lines of busi-ness denoted X and Y , and scale each CDF by T/(T + 1)

2. Find a smoothed estimate x′ such that P (X < x′) ≈ (1− t/T ); andan equivalent estimate y′.

3. Calculate the smoothed joint CDF for X and Y , H∗(x, y). The non-parametric kernel smoothing method in the np package in R is em-ployed in steps 2 and 3 (Hayfield & Racine, 2008).

4. Use the extrapolated joint CDF formed in step 3 to find the valueH∗(x, y) ≈ P (X < x, Y < y) ≈ C(1− t/T, 1− t/T ).

5. Calculate the estimated value of the upper tail dependence coefficientλu′ by substituting H∗(x, y) for C(1− t/T, 1− t/T ) in equation (4).

4 Results

The estimated results for λu were unanimously consistent with Kiln’s ex-pectations, in cases of both strong and weak upper tail dependence. Forexample, lines of business 5 and 6 are subject to the same catastropheevents, causing large upper tail dependence. The mean estimated value ofλu for this LOB pairing was accordingly 0.87 with a standard deviation of0.05. Figure 1(a) presents a histogram of the values of λu recorded across20 sub-samples of 2, 500 loss claims for LOBs 5 and 6.The sensitivity of the λu results was checked relative to the input parame-ter t = T 1/2 . For each LOB pairing, a fixed sample of 2, 500 simulated lossvalues was used to evaluate λu, with t varied from 45 to 55 in 1 unit incre-ments. The results for λu across LOB pairings displayed minimal volatilityfor variations in of t. Figure 1(b) presents a line plot of λu values for linesof business 5 and 6 across the range of t.

5 Conclusions

The empirical copula provides a robust method of estimating the upper taildependence coefficient between insurance loss distributions. It exhibits lowsensitivity to its input parameter and is stable under repeated resamplingof the data. The empirical copula offers users the benefit of requiring noassumptions as to the distributional nature of the underlying dependencestructure. This affords insurance companies flexibility in modelling theirloss processes. Simulated losses from a fitted model can be tested usingthe empirical copula approach to verify that upper tail dependence hasbeen incorporated, if desired. This affords the user greater confidence inidentifying extreme total loss events and setting aside appropriate reserves.

Page 132: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

124 Empirical Copula Upper Tail Dependence

FIGURE 1. (a) Histogram of upper tail dependence coefficients λu using theempirical copula for 20 samples of 2, 500 simulated losses across lines of business5 and 6 (b) Line plot showing variation in λu for a sample of 2, 500 simulatedlosses in lines of business 5 and 6 for different values of input parameter t.

Histogram of Upper TDC for LOB(5,6) for samples of size 2500

Upper TDC values

Fre

quency

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

67

Upper tail dependence coefficient λu

Frequency

(a)

0.78

0.79

0.8

0.81

0.82

0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

45 46 47 48 49 50 51 52 53 54 55

λu

45 46 47 48 49 50 51 52 53 54 55

0.90

0.88

0.86

0.84

0.82

0.80

0.78

t

(b)

Acknowledgments: Mr Brian Heffernan, Chief Actuary RJ Kiln & Co.

References

Embrechts, P., Lindskog, F. and McNeil, A. (2003). Handbook of HeavyTail Distributions in Finance. Amsterdam: Elsevier.

Hayfield, T. and Racine, J.S. (2008). Nonparametric econometrics: the nppackage. Journal of Statistical Software, 27.

Shaw, R.A., Smith, A.D. and Spivak, G.S. (2010). Hierarchical generalizedlinear models. Measurement and modelling of dependencies in eco-nomic capital. A discussion paper, 10, 54 – 71.

Sklar, A. (1959). Fonctions de repartition a n dimensions et leurs marges.Inst. Statist. Univ. Paris, 8, 229 – 231.

Page 133: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Effects of precipitation and temperature inalpine areas on backcountry avalancheaccidents reported in the western part ofAustria within 1987–2009

Christian Pfeifer1, Peter Holler2

1 Institut fur Statistik, Universitat Innsbruck, A–6020 Innsbruck2 Bundesamt und Forschungszentrum fur Wald, Institut fur Naturgefahren, A–

6020 Innsbruck

Abstract: In this article we analyze the effects of precipitation (snow, rain) andtemperature on observed backcountry avalanche accident counts.

1 Introduction

In recent years backcountry skiing has become very popular. Unfortunatelythere are a lot of avalanche accidents which cause about 25 fatalities inAustria every year. However, efforts have been made in order to preventbackcountry avalanche accidents (see for example Pfeifer, 2010).In this paper we are considering the following: Since 2010 the Tyroleanavalanche service is publishing special information for backcountry skierswhich they call ‘danger patterns’ (see Mair and Nairz, 2012) such as:

• new snow

• rain

• long cold periods followed by new snow

• wet snow in spring etc.

However, Mair and Nairz did not give any empirical evidence for the effectsof their danger patterns on avalanche danger. In this contribution we areanalyzing such kind of patterns in relation to backcountry avalanche acci-dents. Avalanche experts expect that at least new snow has a significanteffect on backcountry avalanche accidents (see Holler, 2009; Hller 2012).Here a short description of the avalanche and weather data is given:

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 134: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

126 Avalanche accidents

2 Data

• Number of daily backcountry avalanche accidents in the western partof Austria (federal states Tyrol and Vorarlberg) within the winterperiods 1987/88–2008/09 stratified by municipal areas; see Amt derTiroler Landesregierung, 1994–2010 and Kuratorium fur alpine Sicher-heit 1973–2011. Total number of accidents taken into account: 885.

• Homogenized daily precipitation and temperature data from 18 wea-ther stations in Tyrol and Vorarlberg since 1987; see Auer et al.,2010.

3 Models

Using a spatial model (kriging) we are computing mean precipitation andmean temperature every day for about 300 municipal areas separately. Wedid these calculations using the kriging functions of the R package geoR,see e.g. Diggle and Ribeiro, 2007.With this information we are able to calculate the mean new snow and rain(mm) in alpine regions (sea level≥1500m) within each municipality consid-ering the average snow line (in our case defined by the zero-degree line)and the distribution of alpine region levels in the corresponding municipalarea.Finally we are analyzing the effects of snow, rain and temperature on thenumber of observed daily avalanche accidents within each municipal area.For this purpose we employ a hurdle count model which takes into accountthat avalanche counts are expected to be rather rare:The observations yi of the hurdle model are assumed to come from a mix-ture that is zero with probability one in the first component and a truncatedPoisson in the second component:

p(yi) =

1− p : yi = 0

p exp(−λ)λyi

(1−exp(−λ))yi!: yi > 0

In order to define the covariate effects on the observations we use the linkfunctions of the logistic and the loglinear model:

log(λ) = Bβ logit(p) = Gγ

The fitting of this model is done using the hurdle() function which is partof the R package pscl, see Zeileis et al., 2008.

4 Results and Discussion

Beside the explanatory variables snow, rain and tempalpin (mean tem-perature at sea level 1800), which are the daily values, we are consideringthe running average over the last 3 days (denoted by) snow3 and rain3.

Page 135: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Pfeifer and Holler 127

Model Intercept SE Effect SE z-value Log-lik

snow -7.56 0.037 0.0091 0.0074 1.23 -7227.55snow3 -7.73 0.040 0.0820 0.0059 13.89 -6970.26tempalpin -7.82 0.046 -0.0875 0.0065 -13.42 -7129.62tempalpinrestr -7.05 0.065 -0.0072 0.0094 -0.77 -5726.08rain -7.37 0.036 -0.7236 0.1033 -7.01 -7133.90rain3 -7.29 0.037 -0.7188 0.0748 -9.62 -6911.12

snow3restr -7.18 0.045 0.0303 0.0159 1.90 -5510.04temp1800:snow3 -0.0057 0.0025 -2.31

TABLE 1. Summary of hurdle models for the avavalanche accidents count data,effects of snow, rain, temperature etc. on avalanche accident counts reportingintercept, effect, se, z-value and log-lik of the logistic part

Table 1 shows the results of our modelling approach assuming that onlythe logistic part of the models (γ) is of interest for us. As we can see, snowhas no significant influence on avalanche accident counts (z-value: 1.23).However, if we observe the mean new snow of the last 3 days instead ofthe actual daily snow, we take notice of a positive significant influence, seealso the boxplots in Figure 1.The mean temperature at sea level 1800 (tempalpin) seems to have anegative effect on avalanche counts. But if we restrict the data to monthsbetween December and March the effect tempalpinrestr turns out not tobe significant. It might be reasonable that in October, November and May(higher temperatures) the number of backcountry skiers going on ski touris rather low.Further on, we notice that rain (actual value and mean of the last 3 days)has a significant negative effect on avalanche counts. Among other reasons,the impact of lower number of skiers on slopes is discussed.Finally, we are considering the model snow3 with the interaction termtempalpin:snow3 on the restricted database which turns out to be sig-nificant. As a result of this (interaction term negative), the effect of snowon avalanche counts seems be become less important in case of higher tem-peratures.

References

Amt der Tiroler Landesregierung (1994–2010). Schnee und Lawinen,Jahresberichte. Innbruck.

Auer, I., Nemec, J., Gruber, C., Chimani, B. and Turk, K. (2010). HOM–START. Homogenisation of climate series on a daily basis, an applica-tion to the StartClim dataset. Klima und Energiefonds Projektbericht.Wien.

Diggle, P.J. and Ribeiro, P.J. (2007). Model-based Geostatisics. Springer.New York.

Page 136: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

128 Avalanche accidents

FIGURE 1. Mean new snow (left) and mean new snow in the last 3 days (right)against number of daily avalanche counts in municipal areas

Holler, P. (2009). Avalanche cycles in Austria: an analysis of the majorevents in the last 50 year. Natural Hazards, 48, 399 – 424.

Holler, P. (2012). The cumulation of avalanche accidents in certain peri-ods - an analysis of backcountry events in Austria. In: ProceedingsInternational Snow Science Workshop. Anchorage Alaska.

Mair, R. and Nairz, P. (2012). Lawine. Die entscheidenden 10 Gefahren-muster erkennen. Praxis-Handbuch Tyrolia Innsbruck.

Kuratorium fur alpine Sicherheit (1973–2011). Sicherheit im Bergland,Jahrbucher des Kuratoriums fur alpine Sicherheit. Innsbruck.

Pfeifer, C. (2010). On probabilities of avalanches triggered by alpine skiers.An empirically driven decision strategy for backcountry skiers basedon these probabilities. In: Proceedings International Workshop of Sta-tistical Modelling 2010. Glasgow

Zeileis, A., Kleiber, C. and Jackman, S. (2008). Regression Models forCount Data in R. Journal of Statistical Sofware, 27(8).

Page 137: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Small-sample one-sided testing in extremevalue regression models

Eliane C. Pinheiro1, Silvia L. P. Ferrari1

1 Department of Statistics, University of Sao Paulo, Brazil

E-mail for correspondence: [email protected]

Abstract: The majority of papers on extreme value theory for modeling ex-treme data is supported by moderate or large sample sizes. We derive adjustedsigned likelihood ratio statistics in a general class of extreme value regressionmodels. The adjustments reduce the error in the standard normal approximationto the distribution of the signed likelihood ratio statistic. We use Monte Carlosimulations to compare the finite sample performance of the different tests. Oursimulations suggest that the signed likelihood ratio test tend to be liberal whenthe sample size is not large, and that the adjustments are effective in shrinking itssize distortion. We illustrate the application of the usual tests and their modifiedversions in real data.

Keywords: extreme value regression; Gumbel distribution; Nonlinear models;Signed likelihood ratio test; Small-sample adjustments.

1 Introduction

The extreme value distribution is frequently used to model extreme events,such as extreme floods and wind speed, and in survival or reliability anal-ysis to model the logarithm of lifetime data. Lately it is of particular in-terest as extreme phenomena are more common and intense. We deal witha general class of extreme value regression models introduced by Barreto-Souza and Vasconcellos (2011). We derive five adjustments proposed byBarndorff-Nielsen (1986), DiCiccio and Martin (1993), Skovgaard (1996),Severini (1999) and Fraser et al. (1999) in this class of models. The ad-justed statistics are approximately distributed as standard normal with ahigh degree of accuracy. We compare the finite sample performance of thesigned likelihood ratio test and the adjusted signed likelihood ratio testsobtained in this work. Further, we illustrate an application of the usualsigned likelihood ratio test and its modified versions in a real dataset.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 138: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

130 Extreme Value Regression Models

2 Main results

Let y1, . . . , yn be independent random variables, where each yt, t = 1, . . . , n,has an extreme value distribution with parameters µt and σt and densityfunction

f(y;µt, σt) =1

σtexp

(−y − µt

σt

)exp

(− exp

(−y − µt

σt

)), y ∈ IR, (1)

where µ ∈ IR and σ > 0 are the location and dispersion parameters, re-spectively. The mean and the variance of yt are E(yt) = µt + Eσt andvar(yt) = (π2/6)σ2

t , respectively, where E is the Euler constant; E ≈ 0.5772.The extreme value regression model with dispersion covariates is definedby (1) and by two systematic components given by

g(µt) = ηt = η(xt, β), h(σt) = δt = δ(zt, γ), (2)

where β = (β1, . . . , βk)> and γ = (γ1, . . . , γm)> are vectors of unknownregression parameters (β ∈ IRk and γ ∈ IRm) and xt and zt are observationson covariates. Here, η(·, ·) and δ(·, ·) are continuously twice differentiable(possibly nonlinear) functions in the second argument. Finally, g(·) andh(·) are known strictly monotonic and twice differentiable link functionsthat maps IR and IR+, respectively.Let θ = (β>, γ>)> be the unknown parameter vector that indexes theextreme value regression model (1)-(2). In what follows, ν represents thescalar parameter of interest and ψ = (ψ1, . . . , ψs)

> is the nuisance param-eter; notice that 1 + s = k +m. We consider signed likelihood-based testsof the null hypotheses H0 : ν ≤ ν0 and H0 : ν ≥ ν0, where ν0 is a fixedscalar. The signed likelihood ratio statistic is given by

R = R(ν0) = sgn(ν − ν0)[2(`(θ)− `(θ)]1/2,

where `(θ) is the log-likelihoood function; θ = (ν, ψ) and θ = (ν0, ψ) arethe unrestricted and the restricted maximum likelihoood estimators of θ,respectively.R is asymptotically distributed according to a standard normaldistribution if ν = ν0. The accuracy of the normal approximation can bepoor in small or moderate-sized samples.Barndorff-Nielsen (1986) proposed a modified signed likelihood statisticbased on an ancilary statistic, here refered as R∗. There are several methodsto approximate R∗, when it is not possible to explicit an ancilary statistic.We focus on approximations based on orthogonal parameters (DiCiccioand Martin, 1993), covariances (Skovgaard, 1996), empirical covariances(Severini, 1999), and an approximately ancillary statistic (Fraser et al.,1999). We obtain these aproximations in the extreme value regression model(1)-(2). Since the extreme value regression model (1)-(2) is a location-scalemodel, in the linear homoskedastic particular case, a = (a1, . . . , an)>, with

at = (yt−µt)/σt, is ancilar and (θ, a) is sufficient. We, therefore, also obtainthe Barndorff-Nielsen adjustment in this case. The formulas are not shownto save space.

Page 139: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Pinheiro and Ferrari 131

3 Simulation resultsWe present Monte Carlo simulation results on the small sample behaviourof the signed likelihood ratio test (R) and its adjusted versions: Barndorff-

Nielsen (R∗), DiCiccio (R∗0), Skovgaard (R∗), Severini (R∗), and Fraser-

Reid-Wu (R∗). We consider model (1) with constant dispersion and sys-tematic component for the location given by

µt = β0 + β1xt1 + β2xt2 + β3xt3 + β4xt4 + β5xt5.

We consider the null hypothesis H0 : β5 ≤ 0 to be tested against the one-sided alternative H0 : β5 ≥ 0. The number of Monte Carlo replicationsis 10,000. Figure 1 plots the relative p-value discrepancies of the differ-ent tests versus the corresponding asymptotic p-values. It is clear that inthe tail of the limiting normal distribution (small asymptotic p-values)the relative p-value discrepancy of the signed likelihood ratio test is verylarge. The tests based on adjusted statistics display better behavior, theirp-value discrepancies remaining close to zero even for small asymptotic p-values. Our simulation results suggest that the signed likelihood ratio test

0.0 0.2 0.4

−0.

50.

00.

51.

01.

5

n = 15

asymptotic p−value

rela

tive

p−va

lue

disc

repa

ncy

0.0 0.2 0.4

−0.

50.

00.

51.

01.

5

n = 20

asymptotic p−value

rela

tive

p−va

lue

disc

repa

ncy

0.0 0.2 0.4

−0.

50.

00.

51.

01.

5

n = 30

asymptotic p−value

rela

tive

p−va

lue

disc

repa

ncy

RR0

*

R *

R *

R *

R *

FIGURE 1. Relative p-value discrepancy plots.

can be markedly oversized in small and moderate-sized samples. Severini’sand Skovgaard’s adjusted tests can be conservative, but less size distortedthan the unadjusted test. DiCiccio’s, Fraser, Reid & Wu’s and Barndorff-Nielsen’s adjusted signed likelihood ratio tests obtained in this paper per-form better than the others. They are the least size distorted in most casesand are, therefore, recommended for practical applications. We shall em-phasize that our simulations were carried out in heteroskedastic and/ornonlinear extreme value regression models. For the linear homoskedasticmodel, Barndorff-Nielsen’s, DiCiccio’s and Fraser, Reid & Wu’s tests be-haved equality well and clearly better than Severini’s and Skovgaard’s tests.For the nonlinear homoskedastic model, Severini’s and Fraser, Reid & Wu’stests presented better performance than the others. Overall, Fraser, Reid& Wu’s test is the best performing test.

4 ApplicationOur application deals with a data set (see Table 1), consisting of maxi-mum wind speed in January from 2001 to 2010 and the minimum temper-

Page 140: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

132 Extreme Value Regression Models

ature at the day in which the maximum wind speed was reached. The datawere extracted from the alpine tundra climate station (3743 m) of NiwotRidge/Green Lakes Valley, Colorado, USA. We assume that the maxi-

TABLE 1. Minimum temperature (oC) and maximum wind speed (m/s) in Jan-uary

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010−7.40 −11.95 −17.99 −25.63 −16.61 −10.93 −9.21 −26.13 −20.27 −19.0033.42 44.04 42.92 42.51 45.75 47.78 43.34 48.69 43.20 43.00

mum wind speed follows a maximum extreme value regression model (1)with constant dispersion and location sub-model given by µt = β0 + β1xt,for t = 1, . . . , 10, where the covariate x is the minimum temperature.The maximum likelihood estimates of the parameters (standard errors

between parentheses) are: β0 = 34.3412(3.0910), β1 = −0.4409(0.1740),σ = 3.4211(0.8435). The values of the test statistics for testing the nullhypothesis H0 : β1 ≥ 0 against H1 : β1 ≤ 0 and the corresponding p-valuesbetween parentheses are: R = −2.2912(0.0110), R∗0 = −1.8989(0.0288),

R∗

= −1.6085(0.0539), R∗ = −1.7592(0.0393), R∗ = −1.9043(0.0284),R∗ = −1.9043(0.0284). We notice that the p-value of the unmodified signedlikelihood ratio test is 1.1% while the p-values of the modified tests rangefrom 2.8% (Fraser, Reid and Wu’s R∗ and Barndorff-Nielsen’s R∗) to 5.4%

(Skovgaard’s R∗). The unmodified test displays the smallest p-value, in

accordance with its liberal behavior observed in our simulation study. Thedifferent adjustments weaken the evidence in favor of the alternative hy-pothesis.

Acknowledgments: We acknowledge grants from CNPq, CAPES andFAPESP, Brazil.

References

Barndorff-Nielsen, O.E. (1986). Inference on full or partial parameters, based onthe standardized signed log likelihood ratio. Biometrika, 73, 307 – 322.

Barreto-Souza, W. and Vasconcellos, K.L.P. (2011). Bias and skewness in a gen-eral extreme-value regression model. Computational Statistics & Data Anal-ysis, 55, 1379 – 1393.

DiCiccio, T. J. and Martin, M. A. (1993). Simple modifications for signed rootsof likelihood ratio statistics. Journal of the Royal Statistical Society, 55,305 – 316.

Fraser, D. A. S., Reid, N., Wu, J. (1999). A simple general formula for tail prob-abilities for frequentist and bayesian inference. Biometrika, 86, 249 – 264.

Severini, T.A. (1999). An empirical adjustment to the likelihood ratio statistic.Biometrika, 86, 235 – 247.

Skovgaard, I.M. (1996). An explicit large-deviance approximation to one - pa-rameter tests. Bernoulli, 2, 145 – 165.

1http://niwot.colorado.edu/exec/.extracttoolA?d-1cr23x.ml

Page 141: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Modelling performance of students withbivariate generalized linear mixed models

Hildete P. Pinheiro1, Mariana Rodrigues-Motta1, GabrielFranco1

1 Department of Statistics, University of Campinas, Brazil

E-mail for correspondence: [email protected]

Abstract: We propose a bivariate generalized linear mixed models (GLMM) toevaluate the performance of undergraduate students from the State University ofCampinas (Unicamp). For each student we have the final GPA score as well asthe number of courses he/she failed during his/her Bachelor’s degree. The coursesare separated in three categories: Required (R), Elective (E) and Extracurricu-lar courses (Ex). Therefore, for each variable each student may have at mostthree measures. In this model we need to take into account the within studentcorrelation between required, elective and extracurricular courses as well as thecorrelation between the GPA score and the number of courses failed. The mainpurpose of this study is the sector of High School education from which collegestudents come - Private or Public. Because of affirmative action programs beingimplemented by the Brazilian government to include more students from PublicSchools in the Universities, there is a great interest in studies of performanceof undergraduate students according to the sector of High School of which theycome from. The data comes from the State University of Campinas (Unicamp), apublic institution, in the State of Sao Paulo, Brazil and one of the top universitiesin Brazil.

Keywords: multivariate generalized mixed models; multivariate analysis; zeroinflated models.

1 Introduction

We propose here a multivariate generalized linear mixed model (GLMM)using the GPA scores and the number of courses failed by each student.The class of GLMM models have been studied by Williams (1982); Breslowand Lin (1995); and others in the univariate context. In the multivariatecontext, Rodrigues-Motta et al. (2013) used multivariate generalized mod-els for correlated count data. Here, we will use a bivariate GLMM, with the

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 142: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

134 Modelling performance of students with bivariate GLMM

response variables being the GPA score and the number of courses a stu-dent have failed. Note that in this case we have two different distributionsfor the response variables: a continuous variable (which may be normal dis-tributed) and a count variable (which may be Poisson or negative binomialdistributed with zero inflated or not).For each student we have all the grades in the courses taken in the univer-sity as well as the number of courses he/she failed during his/her Bachelor’sdegree. The courses are separated in three categories: Required (R), Elec-tive (E) and Extracurricular (Ex) courses. Then, we could get the GPAscore for each type of course (R, E or EC). We also have their entranceexam grades (e.g., SAT scores) as well as their socioeconomic status, whichare going to be considered as covariates in the model.The main purpose of this study is the sector of High School education fromwhich college students come - Private (Pr) or Public (Pu). In Brazil, thegreat majority of middle class students go to Private High Schools (around70%). The data comes from the State University of Campinas (Unicamp),a public institution, located in the State of Sao Paulo and one of the topresearch universities in Brazil. The socioeconomic data of more than 10,000students admitted to Unicamp from 2000 through 2005 forms the studydatabase.

2 Statistical Methods

Let Yij = (Yij1, Yij2)> be the vector of observations of subject i for coursesof type j. Here, j can be up to three, since a student may not have takenextracurricular courses or may have dropped out and not taken extracur-ricular or elective courses. In order to understand better each variable, wemay first use two independent GLMMs to fit Yij1 and Yij2 separately, usinga normal distribution for Yij1 (the GPA score) and a discrete distributionfor Yij2 (the number of courses failed). We will consider a Poisson, NegativeBinomial and zero-inflated models for Yij2, since we are dealing with countdata that can be zero-inflated or have overdispersion.Let ui = (ui1, ui2)> be the vector of random effects, and suppose that Yij1conditioned on ui1 and Yij2 conditioned on ui2 are independent, respec-tively. To model correlation among the three type of courses of subject i,let u1 = (u11, u21, ..., un1)> ∼ Nn(0,D1) and u2 = (u12, u22, ..., un2)> ∼Nn(0,D2), where D1 and D2 are covariance matrices of order n×n. There-

fore, u = (u>1 u>2 )> ∼ N2n(0,D), where D =

(D1 00 D2

).

We assume that, conditioned on a random effect ui1, the conditional ex-pectation of the GPA (Yij1) is modeled as a general linear mixed model

E(Yij1|ui1) = x>ij1βj1 + ui1, j = 1, 2, 3. (1)

For Yij2, we model

log(E(Yij2|ui2)) = x>ij2βj2 + ui2, j = 1, 2, 3, (2)

Page 143: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Pinheiro et al. 135

where x>ij1 and x>ij2 are vectors of covariates and βj1 and βj2 ∈ Rdj arevector of parameters.To model Yij1 and Yij2 jointly we let the random effects u1 and u2 to becorrelated. In particular, we now assume that u = (u>1 u>2 )> ∼ N2n(0,D),

where D =

(D1 D12

D12 D2

). Here, D12 is a diagonal matrix containing the

covariance between ui1 and ui2. Conditioned on the random effects ui, Yij1and Yij2 are independent; marginally, Yij1 and Yij2 may not be indepen-dent. In the bivariate model, conditional expectations are also modeled asin (1) and (2).For the estimation of the model parameters,let β = (β>11,β

>21,β

>31,β

>12,β

>22,β

>32)> and ψ = (vect(D)>,β>). The likeli-

hood function is the product of the contributions p(Yij ;ψ), where p(Yij ;ψ)is the joint probability distribution of Yij1 and Yij2. Therefore, the log like-lihood for the appropriate parameter vector ψ is given by

l(ψ; y) =

n∑i=1

ni∑j=1

log

∫ [ 2∏k=1

p(Yijk|ui;ψ)

]φ2(ui; D

∗)dui (3)

where p(Yij |uij ;ψ) is a bivariate distribution in the class of the GLMM andφ2 denotes the bivariate normal density function with covariance matrixD∗, which is constructed upon appropriate variance components of D.Maximization of (3) with respect to ψ is complicated by the integrationwith respect to ui. We use m-point Gaussian quadrature to approximatethese integrals and the dual quasi-Newton optimization algorithm (Pinheiroand Bates, 1995) is used to carry out the maximization of (3).

3 Application

The data set is composed by 666,620 observations with all the grades ofall the courses taken by more than 10,000 undergraduate students whoentered in Unicamp from 2000 to 2005. For each student it was recordedthe grade obtained in each course taken during the whole period in theUniversity, the type of course (R, E or Ex), the final result (Passed orFailed) and his/her entrance exam score (e.g., SAT score). There is alsoinformation about demographic and socioeconomic status for each student.There are 58.8% of men and 41.2% of women, the great majority (93.5%)between 17 and 23 years of age divided in four different areas: Medicaland Biological Sciences (18.23%), Engineering and Exact Sciences (59.6%),Human Sciences (14.94%) and Arts (7.23%). Also, 70.45% of the studentscome from Private High Schools (Pr HS) and 28.13% come from PublicHigh Schools (Pu HS). The GPA scores were standardized within eachentrance year and each major/course.When modelling the GPA scores (Y1), in all the Areas, for Elective (E) andExtracurricular (Ex) courses, the performance of students are better thanfor Required (R) courses. The younger the student, the better is his/her

Page 144: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

136 Modelling performance of students with bivariate GLMM

performance. For Arts and Biological Sciences women have better perfor-mance than men, while in Human, Engineering and Exact Sciences, menperform better than women. For all the Areas, the students coming fromPu HS perform better than those coming from Pr HS. On the other hand,students from Pr HS do much better in the entrance exam (e.g. SAT scores)than those from Pu HS. Interaction effects were also tested and they werenot significant by a 5% level.For the number of courses failed (Y2) , the negative binomial distributionwas a better fit to the data, since there was a clear super dispersion on thecount data (the variance was at least ten times greater than the mean). Wefound that, for all the Areas, students failed less in E and Ex courses thanin R courses. In Arts Sciences, students from Pr HS failed less courses thanthose coming from Pu HS. For all the other Areas, the average numberof courses failed for students coming from Pu HS is less than for thosecoming form Pr HS. Also, women and younger students failed less courses.All the variance components in the models were positive and the correlationbetween types of courses were all positive as well.It is not surprising the fact that the number of courses falied for E and Excourses are less than for R courses, since the number of R courses taken bythe students are much greater than E or Ex courses. The surprising pointhere is the good performance in the university of the students coming fromPu HS, since they do not do very well in the entrance exam. There aremany problems in Brazil with the Public School system and this is whyUnicamp started an affirmative action program in 2005, which gives bonuson the entrance exam grades for students coming Pu HS. This is still adebate in progress in Brazil.

References

Breslow, N. E and Lin, X. (1995). Bias correction in generalized linearmixed models with a single component of dispersion. Biometrika. 82(1), 81 – 91.

Pinheiro, J.C. and Bates, D.M. (1995) Approximations to the Log-likelihoodFunction in the Nonlinear Mixed-effects Model. Journal of Computa-tional and Graphical Statistics. 4, 12 – 35.

Rodrigues-Motta, M., Pinheiro, H.P., Martins, E.G., Araujo, M.S. anddos Reis, S. (2013) Multivariate models for correlated count data.Journal of Applied Statistics, 1, 1 – 11.

Williams, D. A. (1982) Extra-binomial variation in logistic linear models.Applied statistics, 31(2), 144 – 148.

Page 145: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Identification of Outliers with BoostingAlgorithms

R. Pino-Mejıas1, M.D. Cubiles-de-la-Vega1, J.M.Munoz-Pichardo1, J.L. Moreno-Rebollo1

1 University of Seville, Spain

E-mail for correspondence: [email protected]

Abstract: We propose a method to identify outliers in data sets where a boost-ing algorithm has been previously run according to functional gradient descentformulation. The final negative gradient values computed by the boosting pro-cedure are analyzed to detect outlier observations. This technique is illustratedwith two algorithms, BinomialBoosting and AdaBoost, on a microarray data setand using the R statistical platform. The performance of the boosting procedureis clearly improved after the removal of the identified outliers, defining a researchline about influence analysis in boosting models that we are currently exploring.

Keywords: Boosting; Classification models; R; Outliers; Micro-array data

1 Introduction

This paper is concerned with the task of identifying multivariate outliers indata sets where a boosting algorithm has previously executed. Specifically,this work arose in a quality analysis of microarray data. The dimension-ality of these data sets, where thousands of genes are expressed in tens ofsamples, imposes a strong quality requirement for all the available samples.There is a big amount of literature on outlier detection, but the proposalof detection methods in bioinformatic data sets is still a growing topic. Inthe discrimination setting, the identification of outliers is usually realizedthrough the fit of some model, for example an approach based on proba-bilistic discriminant partial least squares regression is adopted in Botella etal. (2010). We have considered a two-sample classification framework, andthe information obtained from a boosting algorithm has been exploited toidentify outlier samples.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 146: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

138 Outliers with Boosting

2 A criterion for detecting outlliers

Boosting was proposed to combine the outputs of many ”weak” classifiers toproduce a powerful ”committee”. Friedman et al. (2000) linked AdaBoostand other boosting algorithms to the framework of statistical estimationand additive basis expansion. We have followed this line through the librarymboost in the statistical environment R, for binary classification problems.The boosting details are extensively described in Buhlman and Hothorn(2007). The mboost package considers the problem of estimating a real-valued function

f∗(·) = argminf(·)

E[ρ(Y, f(X))] (1)

where ρ is a loss function. The generic functional gradient descent algorithmis an iterative procedure where in each step the negative gradient valuesare fitted with a base procedure, g(·), and the predictor is updated by:

f [m](·) = f [m−1](·) + νg[m](·) (2)

The choice of the step-length factor ν is of minor importance, as long as it issmall, such as ν=0.1. The major tuning parameter of boosting is the num-ber of iterations M, which can be selected through a 10-fold cross-validationsearch, in the range 1 to 3000. The selection of the other elements of thealgorithm drives to different boosting procedures. The two main methodsfor classification appearing in Buhlman and Hothorn (2007) have been ex-plored in our study, namely BinomialBoosting and AdaBoost. They sharethe same base procedure: selecting the best variable in a simple linearmodel in the sense of ordinary least squares fitting. This way, the finalmodel is a linear combination of the input variables, and the importance ofeach predictor can be assessed. The negative binomial log-likelihood is theloss function in BinomialBoosting algorithm, whereas AdaBoost considersan exponential loss function. The R object resulting from the output ofglmboost function in the package mboost contains a great number of valu-able elements. We have focused on the final values of the negative gradientloss function (Ui). A great absolute value for Ui can reveal a possible outliervalue, therefore we propose the following method:1. Extract the final values of Ui, i = 1, 2, . . . , n.2. Apply the Tukey outlier identification procedure based on the boxplotoutput of Ui, separating both classes.Usually the Ui values are well differentiated for both clasess, masking anypossible outlier. However, the separate analysis can reveal discordant valuesin each class.

3 An application

The prostate data set contains measurements of the expression of 6033genes for 102 observations: 52 prostate cancer patients and 50 healty men.We have used the data frame singh2002 available from the sda library in

Page 147: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

R. Pino-Mejıas et al. 139

the statistical environment R. A previous selection of the top 2000 genesaccording to the t-test comparing the means in both groups (with equalityof variances assumption) was performed, reducing the data set to 102 sam-ples and 2000 genes. BinomialBoosting was applied in first place. A 10-foldcross-validation search of the number of iterations provided M=422. Oncethe boosting model was fitted, we extracted the negative gradient values,and the box-and-whisker plot was separately built in each class, obtainingFigure 1, where the set of identified outlier samples is superimposed on theplot. Each label is formed by a letter corresponding to the class (Tumor orNormal) and a number denoting the order in each class. A similar analy-sis was realized with AdaBoost, selecting M=295 boosting iterations. Theonly difference between both models is the sample T46, which is detectedby BinomialBoosting but it is not detected by AdaBoost.An evaluation of the importance of the identified outlier samples has beenperformed through the comparison of the performance of the boosting mod-els with all the samples and after the removal of the detected outlier sam-ples. The success percent, the sensitivity (percent of tumor cases correctlyclassified), the specificity (percent of normal cases correctly classified) andthe area under the receiver operating characteristic curve (AUC) have beenestimated with the leave-one-out (Jackknife) method.Table 1 shows thesemeasures for this data set. The first column identifies the model, BB de-notes BinomialBoosting, and AB is the AdaBoost algorithm. The rows with”All” refer to models where all the samples have been used, while the label”Out” is a mark for models where the outlier samples have been previouslyremoved. ”Out1” denotes a boosting model where the number of iterationsis the same number that was previously selected for the model fitted on thewhole data set. The ”Out2” model selects a new number of iterations bya new 10-fold cross-validation search. Table 1 reveals the dramatic effectof removing the identified outliers, obtaining a clear improvement of theperformance of the boosting algorithm.

TABLE 1. Leave-One-Out measures. Prostate data set.

Model Success Sensitivity Specificity AUC

BB,All 75.5 69.2 82.0 0.859BB,Out1 88.9 86.0 91.8 0.972BB,Out2 90.9 88.0 93.9 0.978AB,All 75.5 67.3 84.0 0.853

AB,Out1 81.0 86.3 95.9 0.978AB,Out2 81.0 86.3 95.9 0.978

4 Conclusions and future works

Our proposed procedure to identify outlier samples in Boosting algorithmsis an automatic and easy way to improve the quality of a multivariate

Page 148: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

140 Outliers with Boosting

N T

−1.

0−

0.5

0.0

0.5

1.0

Neg

ativ

e gr

adie

nt

N17

T2T46

FIGURE 1. Box-and-whisker plot of negative gradients, Binomial Boosting,Prostate dataset.

data set in a binary classification setting. Future lines may address theproblem of a multi-class classification scenario, regression problems andother elements resulting from the boosting algorithms can also be explored.This last line could eventually drive to more elaborate residuals, or someleverage measure could also be defined.

References

Botella, C. Ferr, J. and Boque, R. (2010). Outlier identification and am-biguity detection for micro-array data in probabilistic discriminantpartial least squares regression. Journal of Chemometrics, 24,434 –443.

Buhlman, P. and Hothorn, T. (2007). Boosting algorithms: regularization,prediction and model fitting. Statistical Science, 22,477 – 505.

Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regres-sion: A statistical view of boosting (with discussion). Annals of Statis-tics, 28,337 – 407.

Page 149: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Incorporating sub-grid variability ofenvironmental covariates in biodiversitymodelling

Iain Proctor12, E. Marian Scott1, Rognvald I. Smith2

1 Department of Mathematics and Statistics, University of Glasgow, UK2 Centre for Ecology and Hydrology, Edinburgh, UK

E-mail for correspondence: [email protected]

Abstract: Downscaling techniques are used widely in climate modelling to re-align environmental covariates at a coarse grid scale to smaller regions at a finerscale. These covariate predictions have uncertainty associated with them, whichneeds to be taken account of when using the values in a subsequent model. Thispaper describes the effect of varying uncertainty in downscaled environmentalcovariates in a generalised additive model of biodiversity data in the UnitedKingdom.

Keywords: Downscaling; Spatial Modelling; Biodiversity.

1 Introduction

The Countryside Survey (CS) was set up in 1978 with the intention torecord an ‘ecological snapshot’ of the state of U.K. habitats at a certainpoint in time. As part of achieving this end, a wide-scale vegetation surveyis conducted roughly every decade. It is comprised of nearly 600 1km2

sites, within which various plots are sampled. The CS used a stratifiedrandom sampling technique to ensure a range of different habitats wouldbe surveyed across Great Britain (Bunce et al. (1999)).Extensive analyses have been conducted on CS data to assess the responseof individual species to environmental change (see e.g. Smart et al. (2003)).The response variable in this analysis is a biodiversity measure of flora fromacross the United Kingdom, using data from the CS sampling plots. Theaims of the analysis reported here are two: firstly to predict the rainfall andnitrogen deposition at the location of the CS plot, rather than using thegrid square value; secondly to look at the relationship between the uncer-tainty associated with these predictions and the resultant model fit. This is

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 150: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

142 Sud-grid variability

necessary in order that correct inferences are made as to the relationshipsbetween these explanatory covariates and the biodiversity responses.

2 Method

Perfect Prognosis is the use of real-world statistical relationships betweenobserved values of a predictand and selected variables (see e.g. Klein (1971)).Here, a similar regression relationship for the downscaling of rainfall datais outlined: for every 1km2 CS site i, there exists a 5km2 square with anon-downscaled rainfall value which here is termed Ri1. There are (up to)eight other 5km2 squares adjacent to it. Thus, there are n terrestrial 5km2

squares associated with each survey site, where n ≤ 9.The estimates of rainfall and mean altitude for each 5km2 region are there-fore interpreted as belonging to a specific survey site. For each site, linearregression is then applied to the rainfall values, using mean altitude as anexplanatory variable, in order to predict the covariate value at each plotk in site i. The prediction, qik has an associated error term: ζik, whereζik = SEq(z∗ik)

ti, and SEq(z∗ik)is the standard error, given the prediction q

at altitude zik. The ti values are randomly sampled from the t-distributionwith n− 2 degrees of freedom.A multiplicative term is used to preserve the relative magnitude of theresidual error between sites. The error terms for the downscaled rainfallvalues are multiplied by MQ, a constant term. The randomly sampled qikare inserted into the model in place of the non-downscaled Ri1 value. Thissame procedure is applied to the 1km2 gridded estimates for total annual ni-trogen deposition (denoted Di1) in the same manner, by regressing againsttheir respective mean altitudes and downscaling using the parameter esti-mates, and multiplying the error terms by the constraint, MC .

3 Application

3.1 Data

An univariate index response is calculated for each plot surveyed: theShannon-Wiener biodiversity index, which takes non-negative real values.Sik is the Shannon-Wiener index at plot k in site i. The covariates of rain-fall and nitrogen deposition, after being downscaled, are then inserted intoan generalised additive model with a Gaussian error structure, to whichthe biodiversity response is fitted. The fixed effects included in the modelpreviously fitted are as follows:

Sik ∼ s (Ri : BHik) + s (Di : BHik) + s (Eik, Nik) +BHik, (1)

The ‘s (·)’ denotes a spline of one or two covariates. Interaction betweencovariates is denoted by ‘:’. The explanatory covariates are denoted thus:Easting (E) and Northing (N) define the positioning of each plot, accurate

Page 151: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Proctor et al. 143

to 100m. The downscaled covariates are denoted as Q for rainfall and Cfor total nitrogen deposition, in place of R and D respectively. Each plotis classified as having exactly one Broad Habitat (BH), using the speciesassemblage to inform which habitat is present there. There are seven dis-tinct habitat types used. Each model is fitted 50 times and the mean ofthe median p-values for each downscaled covariate and the median AICc isrecorded. The constraining values of MQ and MC are altered, in order tolook at the effect of the prediction uncertainty on the resultant model fit.

4 Results

Model MQ MC Rain p-value Ndep p-value AICc1 10−3 10−5 0.39 0.16 17312 10−2 10−4 0.42 0.19 17333 10−1 10−3 0.43 0.29 1764

TABLE 1. Final models of downscaled models, given the differing values of theconstraints MQ and MC , along with the median p-values and AICc.

As is shown in the table, for lesser extreme error constraints, greater p-values are recorded for both the rainfall and nitrogen deposition covariates.As the error constraint is relaxed, the error associated with the downscaledvalues increases and the thus the estimated p-value is greater. However,there does not appear to a great change in the AICc value between models1 and 2; there is also little change in the mean of the p-values for bothdownscaled covariates. From this it can be inferred that the model perfor-mance is not particularly reduced, as the model structure in both modelsis the same. Model 3 shows a greater increase in the AICc value and thenitrogen deposition p-value; the rainfall p-value shows a lesser deviation.The greater uncertainty associated with the downscaled covariates, as sim-ulated by less stringent multiplicative error constraints, the less significantthe relationships become between model terms and the response.

5 Conclusion

Using this approach for downscaling the covariate data resulted in theremoval of rainfall and nitrogen deposition from this biodiversity model.Larger ζik error terms, which are added to the predicted covariate valuescause greater random scatter which perturb the covariate predictions cre-ating a worse model fit. This framework is applicable to other modellingscenarios, where explanatory covariates which contain unknown measure-ment error can be downscaled using fine-scale predictors, allowing the co-variate uncertainty to be estimated. As shown in this example, when this

Page 152: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

144 Sud-grid variability

local uncertainty estimation is included in the model, covariate significancecan change greatly.

Acknowledgments: Special thanks to NERC for funding this researchand CEH for providing required data.

References

Bunce, R. G. H., Smart, S. M., van de Poll, H., Watkins, J. W., andScott, W. A. (1999). Measuring change in British vegetation. ECO-FACT Volume 2. Institute of Terrestrial Ecology.

Klein, W. H. (1971). Computer Prediction of Precipitation Probability inthe United States. Journal of Applied Meteorology, 10, 903 – 915.

Smart, S. M., Robertson, J. C., Shield, E. J., and van de Poll, H. (2003).Locating eutrophication effects across British vegetation between 1990and 1998. Global Change Biology, 9, 1763 – 1774.

Page 153: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Semiparametric ROC Regression based onConditional Transformation Models

Marıa Xose Rodrıguez - Alvarez1, Thomas Kneib 2, CarmenCadarso - Suarez3

1 Department of Statistics and Operational Research, University of Vigo, Spain.2 Chair of Statistics, Georg-August-Universitat Gottingen, Germany.3 Department of Statistics and Operations Research, University of Santiago de

Compostela, Spain.

E-mail for correspondence: [email protected]

Abstract: This paper presents two new approaches for the estimation of thecovariate-specific ROC curve based on conditional transformation models recentlyproposed by Hothorn et al. (2014). The new proposals allow to incorporate flexiblespecifications of the effect of the covariates on the ROC curve in the spirit ofgeneralised additive models. Moreover, since the estimation procedure relies on afunctional gradient descent boosting approach, both approaches allow to handlecorrelated data, and model choice and variable selection can be automaticallyperformed.

Keywords: ROC curve; Conditional Transformation Models; Boosting

1 Introduction

Before the routine application of a diagnostic test in clinical practice, anyerrors of classification must be quantified in order to check the diagnostictest’s validity or invalidity. For tests with continuous or ordinal results, themost widely used measure of diagnostic accuracy is the receiver operatingcharacteristic (ROC) curve. In many situations the performance of a di-agnostic test and therefore its discriminatory capacity can be affected bycovariates (see Pepe, 2003, pp 48–49 for examples). In such cases, inter-est should be focused on assessing the accuracy of the test according tothe values of the covariates X = (X1, . . . , Xp). Let Y denote the result ofthe diagnostic test, and D the dummy variable indicating the true diseasestatus (D = 1 for presence and D = 0 for absence of the disease). Thecovariate-specific ROC curve, given X = x, is defined as

ROCX=x (t) = SDX=x

(S−1DX=x

(t)), 0 ≤ t ≤ 1, (1)

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 154: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

146 Semiparametric ROC Regression based on CTM

whereSDX=x (c) = P [Y ≥ c|D = 1,X = x] ,

SDX=x (c) = P [Y ≥ c|D = 0,X = x] .

In this case, a continuum of different ROC curves is obtained by varyingthe value x in the range of X.Several approaches for estimating the covariate-specific ROC curve (1) havebeen suggested in the statistical literature, most of them within the generalregression framework (see e.g., Pepe, 2003). In this paper, we adopt a strat-egy to estimate the covariate-specific ROC curve based on the semiparamet-ric estimation of conditional distribution functions via conditional trans-formation models (CTMs) as recently proposed by Hothorn et al. (2014).

2 Covariate-Specific ROC Curve Estimation

In this section we present two different approaches for the estimation ofthe covariate-specific ROC curve:

• Conditional distribution function approach : This approach isbased on estimating SDX (·) and SDX (·), and then to compute thecovariate-specific ROC curve via

ROCX=x (t) = SDX=x

(S−1DX=x

(t)),

where S−1DX=x

(t) = inf y : SDX=x (y) ≤ t. Although this approachseems to be natural for the estimation of the covariate-specific ROCcurve, it has the drawback of only indirectly describing the depen-dency of the ROC curve on covariates. This may make the conclusionabout effects on the accuracy of the diagnostic test problematic sinceeven if a diagnostic test Y is affected by covariates, this does not nec-essarily imply that the ROC curve is also affected by these covariates.

• Placement value approach : This approach is based on the place-ment value (Pepe and Cai, 2004) for YD, defined as

PVD ≡ SDX=x (YD) .

Given that

P [PVD ≤ t|X = x] = P [SDX=x (YD) ≤ t|X = x]

= P[YD ≤ S−1

DX=x(t) |X = x

]= SDX=x

(S−1DX=x

(t))

= ROCX=x (t) ,

the covariate-specific ROC curve can be viewed as the conditionaldistribution function of the placement values PVD. In contrast tothe first approach, here the covariate-specific ROC curve is directlyestimated. This makes it possible to directly evaluate the effect of thecovariates on the diagnostic accuracy.

Page 155: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Rodrıguez -Alvarez et al. 147

As can be observed, the estimation of the covariate-specific ROC curve inboth approaches turns into the problem of estimating conditional distribu-tion functions. In this paper, we follow the work by Hothorn et al. (2014),where a new semiparametric estimator of the conditional distribution func-tion is proposed based on conditional transformation models. Specifically,transformation models rely on the specification

h (Y |x) = ε, (2)

where ε is an error term, and h (·|x) : R→ R is a monotonically increas-ing transformation function. On the basis of model (2), the conditionaldistribution function of Y given the covariates x can be evaluated via

P [Y ≤ y|X = x] = P [h (Y |x) ≤ h (y|x) |X = x] = g (h (y|x)) ,

where g is the cumulative distribution function of the error term ε in (2).Hothorn et al. (2014) propose to decompose the transformation functionh (·|x) in (2) as

h (y|x) = β0 + hY (y) +

U∑u=1

fu (xu) +

V∑v=1

hv (y,xv) , (3)

where xu and xv denote subsets of the covariate vector x, and hY , fu andhv define generic representations of different types of covariates and effects(such as usual linear, parametric effects, smooth nonlinear effects of contin-uous covariates, varying coefficient terms, or bivariate smooth functions).To estimate model (3) the authors propose the use of a component-wiseboosting procedure. Among the advantages of using this approach are: 1)boosting techniques are less sensitive to the presence of correlated errors; 2)it allows for a great variety of specifications for the partial functions in (3)(as e.g. random and spatial effects); and 3) component-wise boosting tech-niques provide automatic variable selection and model choice procedures.By using CTM for the estimation of the covariate-specific ROC curve, allthe aforementioned advantanges are therefore made available in the contextof ROC regression.

3 Application to endocrine data

In this section we illustrate our proposal using data from the endocrinefield. The aim of the study was to evaluate the discriminatory capacity ofthe Body Mass Index (BMI) when predicting the presence of cardiovasculardisease risk factors, but accounting for the possible effect of age and genderon the accuracy of this anthropometric measure. For both approaches, theeffect of age was incorporated in a flexible way by using P-Splines in combi-nation with B-spline basis functions. Moreover, the interaction between ageand gender was also incorporated when estimating the conditional distribu-tion functions involved in both approaches. It is well known that a crucial

Page 156: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

148 Semiparametric ROC Regression based on CTM

20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

Men

Age

Gen

der

20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

Women

Age

Gen

der

Conditional distribution function approachPlacement value approach

FIGURE 1. Estimated conditional AUC for the BMI along age for Men andWomen. The solid line represent the results based on the ‘conditional distributionfuntion approach’ and the dotted line those results based on the ‘placement valueapproach’.

point when using boosting procedures is the choice of the number boostingiterations. For this study, 10-fold cross-validation was used to determinethe optimal number of iterations and so to avoid overfitting.Figure 1 shows the estimated conditional area under the ROC curve (AUC)based on the two proposals presented in this work. As can be observed, bothmethods have yielded similar results. It is interesting to note that whereasfor men the accuracy of the BMI is more or less constant along age, forwomen age displays an important effect on the accuracy of this measure.

Acknowledgments: The authors acknowledge the support received in theform of the Spanish Ministry of Science and Innovation grant MTM2011-28285-C02-01, and thank the Galician Endocrinology & Nutrition Founda-tion (FENGA) for having supplied the database used in this study

References

Hothorn, T., Kneib, T., and Buhlmann P. (2014). Conditional Transforma-tion Models. Journal of the Royal Statistical Society: Series B, 763 – 27.

Pepe, M.S. (2003) The Statistical Evaluation of Medical Tests for Classifi-cation and Prediction. Oxford University Press, New York.

Pepe, M.S., and Cai, T. (2004). The analysis of Placement Values for Eval-uating Discriminatory Measures. Biometrics, 60, 528 – 535.

Page 157: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Approximating unconditionedfirst-passage-time densities for diffusionprocesses

P. Roman-Roman1, J. J. Serrano-Perez1, F. Torres-Ruiz1

1 Dpto. de Estadıstica e I.O., Universidad de Granada, Avda Fuentenueva s/n,18071 Granada, Spain

E-mail for correspondence: [email protected]

Abstract: The unconditioned first-passage-time problem for diffusion processesis addressed by means of a new version of the R package fptdApprox. The mainnew functionality of this version is described, and an application is presented.

Keywords: Diffusion processes; unconditioned first-passage times; Volterra in-tegral equations; FPTL function; R.

1 Introduction

Let X(t); t0 ≤ t ≤ T be a diffusion process defined on a real interval I,with infinitesimal moments A1(x, t) and A2(x, t), and be S(t) a derivablefunction. The first-passage time (f.p.t.) of the process through boundaryS(t), provided X(t0) = x0, is defined as the random variable

TS(t),x0=

Inft ≥ t0 : X(t) > S(t) |X(t0) = x0, if x0 < S(t0)

Inft ≥ t0 : X(t) < S(t) |X(t0) = x0, if x0 > S(t0).(1)

The density function of TS(t),x0, g(S(t), t|x0, t0), is the solution to the

Volterra integral equation of the second kind

g(S(t), t |x0, t0) = ρ

− 2 Ψ(S(t), t |x0, t0)

+ 2

∫ t

t0

g(S(τ), τ |x0, t0)Ψ(S(t), t |S(τ), τ) dτ

, (2)

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 158: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

150 Unconditioned first-passage-time densities

where ρ = Sgn (S(t0)− x0),

Ψ(S(t), t|y, τ) =1

2f(S(t), t |y, τ)

[S′(t)−A1(S(t), t) +

3

4

∂A2(x, t)

∂x

∣∣∣∣x=S(t)

]

+1

2A2(S(t), t)

∂f(x, t |y, τ)

∂x

∣∣∣∣x=S(t)

,

and f(x, t|y, s) is the transition probability density function of the process.Version 1.2 of the R package fptdApprox (described in Roman-Roman et al.(2012)) implemented a general heuristic strategy based on the informationprovided by the First-Passage-Time-Location (FPTL) function (defined inRoman et al. (2008)) for the efficient application of numerical schemes forsolving this Volterra integral equation.When no conditioning to a fixed initial value is considered, the f.p.t. de-pends on the random variable X(t0). In such case, definition (1) can begeneralized as (in the following an unconditioned f.p.t.):

TS(t),X(t0) =

Inft ≥ t0 : X(t) > S(t) if X(t0) < S(t0)

Inft ≥ t0 : X(t) < S(t) if X(t0) > S(t0).(3)

In this last case, the density of (3) can be obtained from the family ofdensities g(S(t), t|x0, t0), x0 ∈ J − S(t0) by means of the expression

g(S(t), t) = limε→0+

[∫ S(t0)−ε

−∞g(S(t), t|x0, t0)fX(t0)(x0) dx0

+

∫ +∞

S(t0)+ε

g(S(t), t|x0, t0)fX(t0)(x0) dx0

], (4)

where fX(t0) is the density of X(t0) and J its range of variation.Version 1.2 of the R package fptdApprox did not allow solving uncondi-tioned f.p.t. problems. To overcome this limitations, a new version hasbeen developed (available in Roman-Roman et al. (2013)).

2 Functionality of the new version of the R packagefptdApprox

To approximate the density function for a conditioned f.p.t. problem (1)with the previous versions of the package, once the ‘diffproc’ class objectwas created with the definition of the generic diffusion process under study,the user must run the FPTL() and summary.fptl() functions before the f.p.t.density is found by means of the Approx.fpt.density() function.The main achievement of the new version is that it allows directly solvingboth conditioned and unconditioned f.p.t. problems by means of the newfunction Approx.fpt.density(). The previous version of this function hasbeen renamed as Approx.cfpt.density.

Page 159: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Roman-Roman et al. 151

For a conditioned f.p.t problem (1), the new function makes an internalcall to each of the FPTL() and summary.fptl() functions, in order to locatethe f.p.t. variable, and to the Integration.Steps() function, in order todetermine the suitable subintervals and integration steps to be used toefficiently approximate the conditioned f.p.t. density.In the case of an unconditioned f.p.t. problem (3) the new function proceedsas follows: after selecting m values of the initial distribution (equally spacedin its variation range), first locates the variation range of each conditionedf.p.t. variable by means of internal calls to the FPTL() and summary.fpl()functions. From this information, and by means of a single call to theIntegration.Steps() function, it obtains a common and suitable sequenceof instants in the time interval [t0, T ] (according to the selected options)where it can approximate all the f.p.t. densities involved. Then, for eachtime instant in such sequence, the Approx.fpt.density() function com-putes the conditioned f.p.t. densities by approximating numerically (2), andthe unconditioned f.p.t density by using (4).In this sense, the code of the previous FPTL() and summary.fptl() functionshas not changed, whereas the IntegrationSteps() function has undergonesubstantial modification to include the new task of unifying the intervalsand integration steps obtained for each particular conditioned problem,and thus to obtain a common network of points where the approximationof f.p.t densities can be evaluated.

3 Application

We consider the unconditioned first-passage problem through the boundaryS = 2500 for the Gompertz-type diffusion process X(t); t0 ≤ t ≤ Tdefined by the infinitesimal moments A1(x, t) = me−βtx and A2(x, t) =σ2x2 with t0 = 1, T = 30, m = 0.755152, β = 0.183128, σ = 0.0708605 andinitial distribution Λ(5.1785, 0.0973), related to real data of rabbits weight.We use the following R code:

R> x <- c("m*x*exp(-beta*t)", "(sigma^2)*x^2"+ "dnorm((log(x)-log(y) + (m/beta)*(exp(-beta*t) -+ exp(-beta*s)) + (t-s)*sigma^2/2)/(sigma*sqrt(t-s)),0,1)/+ (x*sigma*sqrt(t-s))"+ "plnorm(x,log(y) - (m/beta)*(exp(-beta*t) - exp(-beta*s)) -+ (t-s)*sigma^2/2,sigma*sqrt(t-s))")R> NewGompertz <- diffproc(x)

to define the Gompertz-type diffusion process and:

R> g<- Approx.fpt.density(dp=NewGompertz, t0=1, T=30 ,+ id=list("dlnorm(x, 5.1785, 0.0973)","Lambda(5.1785, 0.0973)",+ "\\Lambda(5.1785, 0.0973)","Lognormal(5.1785, 0.0973)"),+ S=2500, list(m=0.755152, beta = 0.183128, sigma = 0.0708605))

to approximate the f.p.t. density for this unconditioned problem.

Page 160: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

152 Unconditioned first-passage-time densities

The R code plot(g, cex=1.25, lwd=2) results in Figures 1a) and 1b) andthey show the approximate density function for the unconditioned f.p.t.problem and the approximate f.p.t. densities conditioned to each value x0

selected from the initial distribution, respectively.

Diffusion process: X(t), t in [1, 30], X(1) ~ Λ(5.1785, 0.0973), A1(x, t) = 0.755152 ⋅ x ⋅ exp(− 0.183128 ⋅ t) , A2(x, t) = 0.07086052 ⋅ x2

Boundary: S(t) = 2500

0.00

0.05

0.10

0.15

0.20

0.25

5 10 15 20 25 30

Approximate First−Passage−Time Density Function Plot

t

g1(t

)

Diffusion process: X(t), t in [1, 30], X(1) ~ Λ(5.1785, 0.0973), A1(x, t) = 0.755152 ⋅ x ⋅ exp(− 0.183128 ⋅ t) , A2(x, t) = 0.07086052 ⋅ x2

Boundary: S(t) = 2500

0.0

0.1

0.2

0.3

0.4

0.5

5 10 15 20 25 30

Approximate Conditional First−Passage−Time Density Functions Plot

t

g1(t

)a) b)

FIGURE 1. Unconditioned and conditioned f.p.t. densities

Acknowledgments: This work was supported in part by the Ministeriode Economıa y Competitividad, Spain, under Grant MTM2011-28962.

References

Roman, P., Serrano, J.J., Torres, F. (2008). First-passage-time locationfunction: Application to determine first-passage-time densities in dif-fusion processes. Computational Statistics and Data Analysis, 52,4132 – 4146.

Roman-Roman, P., Serrano-Perez, J.J., Torres-Ruiz, F. (2012). An R pack-age for an efficient approximation of first-passage-time densities fordiffusion processes based on the FPTL function. Applied Mathematicsand Computation, 218, 8408 – 8428.

Roman-Roman, P., Serrano-Perez, J.J., Torres-Ruiz, F. (2013).fptdApprox: Approximation of first-passage-time densities for diffu-sion processes, R package version 2.0.URL http://cran.r-project.org/package=fptdApprox.

Page 161: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Classification trees for preference data: adistance-based approach

Mariangela Sciandra1, Antonella Plaia1

1 Univerisita degli Studi di Palermo, Italy

E-mail for correspondence: [email protected]

Abstract: In the framework of preference rankings, when the interest lies inexplaining which predictors and which interactions among predictors are ableto explain the observed preference structures, the possibility to derive consensusmeasures using a classification tree represents a novelty and an important toolgiven its easy interpretability. In this work we propose the use of a multivariatedecision tree where a weighted Kemeny distance is used both to evaluate thedistances between rankings and to define an impurity measure to be used in therecursive partitioning. The proposed approach allows also to weight differentlyhigh distances in rankings in the top and in the bottom alternatives.

Keywords: MRT; distance-based methods; preference data, Kemeny distance.

1 Introduction

In every day life ranking and classification are basic cognitive skills thatpeople use in order to graduate everything that they experience. Moreover,many data collection methods in the social sciences often rely on rankingand classification. Grouping and ordering a set of elements is consideredeasy and communicative, so often it happens to observe rankings of sport-teams, universities, countries and so on. A particular case of ranking data isrepresented by preference data, in which individuals show their preferencesover a set of alternatives, items from now on.Preference rankings, for example consumers’ preferences, are indicator ofindividual behaviours and, if subject-specific characteristics are available,besides a principal preference structure, interactions among predictors (ofpreference rankings) can be discerned. This allows to identify profiles ofrespondents giving same/similar rankings.From a methodological point of view, preference analyses often model theprobability for certain preference structures, finally providing the proba-bilities for choosing one single object. Such a problem has been widely

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 162: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

154 Preference data trees

explored in literature and many models have been proposed over the years,such as order statistics models (Dwass, 1957), distance-based models (Leeand Yu, 2010) and some log-linear version of standard Bradley-Terry mod-els (Dittrich, R. et al.,2002). Compared to parametric ranking models, theapproach by Lee and Yu (2010), being based on the definition of a decisiontree, is characterized by a simpler interpretability. Aim of this work is toextend the idea of Lee and Yu and to build a decision tree model where theresponse variable is represented by the subject specific preference rankings.The resulting decision tree will not be a classical Multivariate RegressionTree (MRT), because now each ranking vector should be considered asa unique multidimensional entity. Therefore, in this context, techniquesknown in literature to define splits for multivariate response variables arenot feasible, because they should not take into account the ordinal structureof a ranking. Building a tree-based structure with rankings as response vari-able requires the definition of an impurity measure, for example a suitabledistance which is sufficiently discriminatory. Starting from the well-knownKemeny distance, in this work we try to derive a decision tree througha recursive partitioning that uses, as impurity measure, the sum of thesedistances within node. In particular, we propose the use of a weighted ver-sion of the Kemeny distance (Garcıa-Lapresta, J.L., and Perez-Roman, D.,2010) to deal also with decision problems where it is important to empha-size differences between rankings that occur in particular items.The paper is organized as follows. In the next section a brief introductionto MRT is presented together with the proposed approach based on theuse of the weighted Kemeny distance for building a classification tree forpreference data. Finally, an example on a real data set is presented.

2 Multivariate Regression Trees

Most of the literature on classification and regression tree deals with uni-variate response variables. Some recent attempts to develop regression treemethods handling multivariate responses are due to De’ath (2002), Larsenand Speckman (2004) and Lee and Lee (2005). In this section we assume theordinary regression tree methodology as known, and give a brief overview ofthe class of the multivariate regression trees (MRT). A MRT can be seenas the natural extension of univariate classification and regression tree,where the multivariate response is predicted by some explanatory vari-ables, both numeric and/or categorical. The main difference between theproposed extensions lies in the way in which they extend the definition ofthe partitioning metric . During the definition of a measure of distance be-tween rankings many problems can arise. First of all, the measure should atleast ensure that equal preference structures have zero distance; moreover,the distance should increases as the difference in these structures increases.Several measures have been proposed in literature in order to assess con-sensus: Bosch (2005) introduced the notion of consensus measures in thecontext of linear orders, while Garcıa-Lapresta, and Perez-Roman, (2008)extended Bosch’s concept to the context of weak orders.

Page 163: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Sciandra and Plaia 155

2.1 The proposed extension: a distance-based impuritymeasure

Weighted Kemeny distance has been proposed in order to face decisionproblem where it is not the same to have differences in the top alternativesthan in the bottom ones. The use of the weighted Kemeny distance givesthe possibility of introducing weights in order to distinguish where thesedifferences occur. So, let V = v1, . . . , vm be a set of rankers with m ≥ 3and X = x1, . . . , xn a set of alternatives with n ≥ 3. Let L(X) be the setof linear orders on X and R ∈ L(X). A profile vector is a vector of linearorder such as

R = (R1, . . . , Rm).

Given R ∈ L(X) it is possible to define oR as the position of each alternativein R, oR = (oR(x1), oR(x2), . . . , oR(xn)). In this way we can identify L(X)with Sn (the set of permutations on the first n integers). Given A ⊆ Rnsuch that Sn ⊆ A and a distance (metric) d : A×A −→ R, the distance oflinear orders is the mapping d : L(X)× L(X) −→ R defined by

d(R1, R2) = d((oR1(x1), . . . , oR1

(xn)), (oR2(x1), . . . , oR2

(xn)))

∀R1, R2 ∈ L(X). The Kemeny metric on L(X) is the mapping dK : L(X)×L(X) −→ R defined as the cardinality of the symmetric difference betweenthe linear orders. So let R1 ≡ (a1, . . . , an) ∈ L(X) and R2 ≡ (b1, . . . , bn) ∈L(X)

dK(R1, R2) = dK(R1, R2) = dK((a1, . . . , an), (b1, . . . , bn)) =

n∑i,j=1,i<j

|sgn(ai − aj)− sgn(bi − bj)|

Let w = (w1, . . . , wn−1) ∈ [0, 1]n−1 be a weighting vector such that w1 ≥· · · ≥ wn−1 and

∑n−1i=1 wi = 1. The weighted Kemeny distance on L(X)

associated with w is the mapping

dK,w(R1, R2) =1

2

n∑i,j=1,i<j

wi|sgn(aσ1i − a

σ1j )− sgn(bσ1

i − bσ1j )|

+

n∑i,j=1,i<j

wi|sgn(bσ2i − b

σ2j )− sgn(aσ2

i − aσ2j )|

where (a1, . . . , an) ≡ R1 ∈ L(X), (b1, . . . , bn) ≡ R2 ∈ L(X) and σ1, σ2 ∈ Snare such that Rσ1

1 = Rσ22 ≡ (1, 2, . . . , n).

The recursive partitioning process creates a nested sequence of subtreesTm = root− node ⊂ · · · ⊂ T0 = full − tree maximazing, at each step,the decrease in the node impurity

i(t) =∑p > q

dK,w(Rp, Rq),

Page 164: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

156 Preference data trees

according to all covariates and respective split points. Accordingly, theimpurity measure for a generic subtree T having Nt terminal nodes isdefined as

I(T ) =∑

all terminal nodesN .t

dK,w(Rp, Rq), with p > q.

As example, a dataset concerning the ranks assigned by n = 91 students tosix different platforms for computer games has been considered. For eachrespondent the age, the number of hours spent on gaming per week and adummy own indicating if the platform is currently owned have been used asexplanatory variables. Using the proposed MRT the profiles of respondentsgiving the same rankings have been identified, even if results cannot bereported due to lack of space.

Acknowledgments: Special thanks to Giovanni Boscaino and Enza Ca-pursi for the intersting discussions on the topic.

References

De’ath, G. (2002). Multivariate regression trees: A new technique for mod-eling species-environmental relationships. Ecology, 83, 1105 – 1117.

Dittrich, R., Hatzinger, R. and Katzenbeisser, W. (2002). Modelling the ef-fect of subject-specific covariates in paired comparison studies withan application to university rankings. Journal of the Royal StatisticalSociety, Series C, 47 (4), 511 – 525.

Dwass, M. (1957). On the distribution of ranks and of certain rank orderstatistics. The Annals of Mathematical Statistics, 28, 424 – 431.

Garcıa-Lapresta, J.L., and Perez-Roman, D. (2008) Some Measures of Con-sensus Generated by Distances on Weak Orders. In: Proceedings ofthe XIV Congreso espaol sobre tecnologıas y logica fuzzy, 477 – 483

Garcıa-Lapresta, J.L., and Perez-Roman, D. (2010) Consensus MeasuresGenerated by Weighted Kemeny Distances on Linear Orders. In: Pro-ceedings of the 10th International Conference on Intelligent SystemsDesign and Applications (ISDA), 463 – 468

Larsen, D. R., Speckman, P. L. (2004). Multivariate Regression Trees forAnalysis of Abundance Data. Biometrics, 60, 543 – 549.

Lee, S.K. Lee, J.C. (2005). On generalized multivariate decision tree byusing GEE. Computational Statistics and Data Analysis, 49, 1105 –1119.

Lee, P.H. Yu, P.L.H. (2010). Distance-based tree models for ranking data.Computational Statistics and Data Analysis, 54, 1672 – 1682.

Page 165: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

The impact of the misspecification of therandom effects distribution on the predictionof the mixed logistic model

Karin A. Tamura 1, Viviana Giampaoli1, Alexandre Noma2

1 Departamento de Estatıstica, Universidade de Sao Paulo, Brazil2 Centro de Matematica, Computacao e Cognicao, UFABC, Brazil

E-mail for correspondence: [email protected]

Abstract: In studies that consider observations belonging to a particular group,the data present a hierarchical structure which can be adjusted by mixed models.In addition to the fixed effects, these models can predict random effects for eachgroup. Mixed logistic model assumes that the random effects follow the gaussiandistribution. In this context, the objective is to study the problem of misspecifieddistribution of the random effects and how it affects the binary classification ofnew groups. We conducted simulation studies considering two prediction methodsfor new groups and the following distributions for the random effects: gaussian,t-student, log-normal and exponential.

Keywords: empirical best prediction; nearest neighbors; mixed logistic model;distribution misspecification of the random effects; outcome prediction.

1 Introduction

Recently, literature has drawn attention to the impacts of specifying theshape of the random effects distribution when fitting generalized linearmixed models (GLMM) to clustered or longitudinal data. In particular,McCulloch and Neuhaus (2011) demonstrated that the standard approachassuming gaussian distribution of the random effects resulted in a good per-formance of predicted values across situations with different true randomeffect distributions. This paper focuses on the investigation of the binaryclassification performance of a mixed logistic model for new groups underdifferent distributions for the random effects. We conducted an investiga-tion of the random intercept model by simulation studies considering twoprediction methods: empirical best prediction (EBP) and nearest neighborprediction method (NNPM). The challenge is to identify whether the pre-

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 166: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

158 The impact of the misspecification of the random effects distribution

diction methods for the mixed logistic model are performative with respectto the assumption of the gaussian distribution of the random effect.

1.1 Model specification: Mixed logistic model

Mixed logistic model considers that, conditional on αi, yij ’s are independentBernoulli’s, in which i indexes the group, i = 1, . . . , q, and j indexes theobservation within the i-th group, j = 1, . . ., ni. This model is given by

logit[P (yij = 1|αi)] = log[pij

1− pij] = xtijβ + αi, (1)

in which β is an unknown vector of fixed effects (p×1) and αi is an unknownrandom intercept. Here, we considered the random intercept model. Vectorxtij of known covariates (1×p) is associated with β, defined by xtij=(1, x1ij ,

x2ij ,. . ., x(p−1)ij). This model assumes that αi are i.i.d. with αi ∼ N (0,σ2),in which σ2 is the unknown variance of the random intercept.

1.2 Prediction methods for new groups

Empirical Best Prediction (EBP)Tamura and Giampaoli (2010) proposed EBP to predict the outcome ofthe i-th new group in the observation level of model (1) based on theconditional expectation of the random effect. This method assumes that therandom intercept follows the gaussian distribution. In this case, numericalintegration methods must be used to solve the unidimensional integration.

Nearest Neighbor Prediction Method (NNPM)NNPM models the dependence of the outcome (empirical random effect ob-tained by model (1)) in relation to the covariates by considering the nearestneighbors technique. For a new group, the prediction of the random effectis based on the feature vectors (covariates) by using a distance (e.g. Euclid-ian, Mahalanobis, City Block) and a centrality measurement (e.g. mean,median, medoid) of the known random effects of the l nearest neighbors.Then, the predicted random effect is inserted in the linear predictor of themixed logistic model, providing the outcome probability of a new group.Underscoring that this approach does not require any distribution of theempirical random effect since the proximity criterion is based on a distancemeasure. More details, see Tamura and Giampaoli (2013).

2 Simulation studies

We performed simulation studies to evaluate the performance of EBP andNNPM under different distributions of the random intercept. We considera particular case of the model (1), for p = 1, given by

logit[P (yij = 1|αi)] = xtijβ + αi, (2)

Page 167: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Tamura, et al. 159

with covariate x from a longitudinal data of q = 241 subjects observed inn = 7 periods of time, and the parameters β = 3 and σ = 10. For each sub-ject, we generated the true random intercept by considering: (i) the realdistribution gaussian with mean zero and standard deviation σ; (ii) thedistributions t-student with 3 d.f. and standard deviation σ, log-normalwith standard deviation log(σ), exponential with standard deviation 1/σ(extreme skewness and kurtosis). Thus, if the values of random effects weregenerated assuming the distributions (ii), the gaussian assumption is incor-rect, i.e., there is a misspecification of random effect. For each distributionshape of the random intercept, we generated 500 data sets. For each dataset, the probability of each observation was calculated by using model (2),which was the input to generate the binary response y of a Bernoulli dis-tribution. We randomly splitted each data set into two equal parts (50% ofthe groups), the training and the validation data sets. In the training dataset, we considered Laplace Approximation to obtain parameter estimates ofmodel (2), with the gaussian assumption of the random intercept. In the val-idation data set, we evaluated prediction performance of EBP and NNPMby AUC (area under the curve) measurement. Figure 1 presents the AUC

(a) (b)

(c) (d)

FIGURE 1. AUC for EBP and NPMM with the following random effect distri-butions: (a) gaussian, (b) t-student (3 d.f.), (c) log-normal, and (d) exponential.

values boxplots of EBP and NNPM for 500 replications, over four differentdistributions of the random intercept. In Figure 1 (a), EBP and NNPMpresented similar performance of prediction; however, EBP presented AUC

Page 168: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

160 The impact of the misspecification of the random effects distribution

values with less dispersion. This result was expected since EBP considersthe gaussian distribution of the random effect in the prediction function.When we considered a distribution with symmetric or moderate skewnessand extreme kurtosis, such as Figure 1 (b) and (c), NNPM presented AUCvalues slightly higher than EBP. On the other hand, in Figure 1 (d), NNPMoutperformed EBP, in which case there is an extreme skewness of the ran-dom effect. Table 1 exihibts the median of the AUC values of Figure 1and the difference of the median in percentual points (p.p.) from NNPMto EBP.

TABLE 1. Median of the AUC values of EBP and NNPM for 500 replications.

Distribution EBP NNPM Difference (p.p.)

gaussian 0.57 0.58 0.01t-student 0.82 0.86 0.04log-normal 0.82 0.89 0.07exponential 0.69 0.85 0.16

3 Conclusions

We evaluated the impact of the distribution misspecification of the randomeffect by considering the prediction methods EBP and NNPM. NNPM out-performed EBP in the case of extreme skewness of the random effect. Ifthe random effect followed the gaussian distribution (correct specification)or the moderate skewness, EBP and NNPM presented similar level of pre-diction. In general, NNPM was more performative to random effect shapemisspecification in relation to binary classification since this approach doesnot require any distribution of the empirical random effect. That means anadvantage in practical applications because it would be possible to makepredictions by relaxing the assumption of normality.

Acknowledgments: We received financial support from FAPESP and CNPq.

References

McCulloch, C.E. and Neuhaus J.M. (2011). Prediction of random effects in lin-ear and generalized linear models under model misspecification. Biometrics,67(1), 270 – 279.

Tamura, K.A. and Giampaoli, V. (2010). Prediction in Multilevel Logistic Re-gression. Communications in Statistics-Simulation and Computation, 39(6), 1083 – 1096.

Tamura, K.A. and Giampaoli, V. (2013). Nearest Neighbors Prediction Methodfor mixed logistic regression. In: 28th International Workshop on StatisticalModelling, Palermo, pp. 799 – 802.

Page 169: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Bayesian shape modelling of cross-sectionalgeological data

Thomai Tsiftsi1, Ian H. Jermyn1, Jochen Einbeck1

1 University of Durham, Department of Mathematical Sciences, Statistics andProbability Group, UK

E-mail for correspondence: [email protected]

Abstract: Shape information is of great importance in many applications. Forexample, the oil-bearing capacity of sand bodies, the subterranean remnants ofancient rivers, is related to their cross-sectional shapes. The analysis of theseshapes is therefore of some interest, but current classifications are simplistic andad hoc. In this paper, we describe the first steps towards a coherent statisticalanalysis of these shapes by deriving the integrated likelihood for data shapes givenclass parameters. The result is of interest beyond this particular application.

Keywords: shape analysis; classification; estimation; EM algorithm.

1 Introduction

Sand bodies, the sedimentary, subterranean remnants of ancient rivers, are im-portant to both geology and the petroleum industry. In particular, their cross-sectional shapes help determine their oil-bearing capacity. Current classificationschemes for sand body shapes are qualitative, simple, and ad hoc, and so there isa need for a quantitative analysis with the help of statistical models. There areseveral problems of interest: estimation of shape class parameters given labelleddata shapes (a ‘data shape’ is an ordered set of points in R2); classification of newdata shapes; and unsupervised classification. Parameter estimation is describedby the probability P(w|y, c), where w denotes the shape class parameters and ythe dataset, which consists of several data shapes, together with their class labelsc. By Bayes’ theorem, this is given by:

P(w|y, c) ∝ P(y|w, c)P(w) . (1)

In this, as in all of the above problems, the major task is to calculate the likelihoodP(y|w, c). This is the problem addressed in this paper. The problem is not uniqueto the sand body application: it occurs in many applications of shape modelling,and is thus of broad interest.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 170: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

162 Bayesian shape modelling

2 The likelihood

The calculation of the likelihood is complicated due to the presence of manynuisance parameters that must be integrated over. The partitioned likelihood is:

P(y|w, c) =∑b∈B

∫ds dg dβ dσ P(y|σ, b, s, g, β)P(σ)P(b)P(s|β)P(g)P(β|w, c) , (2)

where we have made a number of simplifying independence assumptions. Here

P(y|σ, b, s, g, β) = exp(− 1

2σ2

∑Ni=1 |y(i)− g β(s(b−1(i)))|2

)models errors in

shape point collection as Gaussian white noise.

In the above expression, β is the underlying sand body shape modulo similaritytransformations, which comes from a class C with parameters w, while g ∈ G ≡SO(2) n R2 × R+ is a similarity transformation generating the full sand bodyshape gβ. Data formation is modelled as a sampling s of N points around thesand body shape, and a bijection B 3 b : [1, ...n] → [1, ...n] relating each pointof the sand body shape to a unique point of the data shape, giving gβ(s(b−1)),plus the above Gaussian noise with variance σ.

In previous work, e.g. Dryden and Mardia (1998) or Srivastava and Jermyn(2009), an algorithmic approach was taken to the integrals over the group G, us-ing the Procrustes algorithm to compute a zeroth order Laplace approximation.Here we carry out the group integrations, and the integration over σ, analytically,resulting in a closed form expression. This is the main contribution of this paper.

First, we have to choose the priors for g and σ. Jeffrey’s joint prior for g andσ was calculated to be a Var(β(s))

σ5 , but this leads to a non-normalizable likeli-hood. Instead, a regularized version was employed. A Gaussian prior was usedfor translations; a uniform prior for rotation angle; and a Rayleigh prior for scal-ing. These choices break translation invariance by effectively limiting the sizeof the two-dimensional domain in R2 in which the shape points lie, and breakscale invariance by effectively limiting the range of scales considered. With thesepriors, the result of the integrations over translations, rotations, and scalings is:

P(y|b, s, β) =

1

Z

1

σ2nexp

−1

2σ2

(n ˜Var(y)− n2 ˜Cov2(β(s(b−1)), y)

nVar(β(s(b−1)) + 1/B2

), (3)

where:

• n = n+ 1D2

• Var(y) = 1n

[∑i |yi|

2 − 1n

∑i

∑j yiyj

]• Cov(β(s(b−1)), y) = 1

n

[∑i β(s(b−1(i)))yi − 1

n

∑i

∑j β(s(b−1(i))yj

]which are regularized versions of the number of points and the variance. B,D,α, care appropriate regulators and Z the normalization constant.

Page 171: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Tsiftsi et al. 163

A Γ prior was used for σ. After integration, this leads to:

P(y|b, s, β) = Γ(n+ α− 3

2)

1

2Z×[

n ˜Var(y)− n2 ˜Cov2(β(s(b−1)), y)

nVar(β)(s(b−1)) + 1/B2+ c

] 32−n−α

. (4)

This expression is the main result of the paper. It applies to any shape modellingapplication in which white Gaussian noise is added to a discrete set of shapepoints.

To finish the calculation of the likelihood, we have to perform the s and β integra-tions, and the b summation, in equation (2). This we do using Monte Carlo tech-niques. We use a uniform distribution on [0, 1]N for samplings s, while β, whichconsists of positive quantities such as aspect ratios and lengths, is modelled withΓ distributions, whose parameters over all classes constitute w. In previous work,when the group integrations were approximated using the Laplace approximationand the Procrustes algorithm, the sum over bijections could be approximated,again in a zeroth order Laplace estimation, using the Hungarian algorithm, sinceonly a linear assignment problem was involved. In the integrated likelihood de-rived here, the terms involving ˜Cov(β(s(b−1)), y) complicate the situation, andturn the linear assignment problem into quadratic assignment, which is NP-hard.Instead of using the Laplace approximation, we approximate the full summationusing Monte Carlo, with a uniform prior on b.

3 Parameter estimation

The above result can be used to estimate the class parameters w given datashapes from each class. Figure 1 shows an example of a likelihood surface, inthis case computed for the simplest case of a rectangle modulo similarities. Thisis a one-dimensional shape space that can be parametrized by the aspect ratio.There is thus one Gamma distribution involved, and w is two-dimensional. Thesurface is rough, since only a coarse grid in parameter space was used to reducesimulation time.Maximum likelihood estimation was carried out using a built-in Matlab opti-mization function. The algorithm correctly navigated towards the maximum ofthe likelihood surface, but convergence was slow. We hope to improve this in thefuture by using a different maximization algorithm.

4 Conclusion

The main contribution of this paper is the analytical evaluation of the integralsover the similarity group and the noise variance in a model for shape data. Theapplication considered here is to the classification of the cross-sectional shapes ofsand bodies, but the same techniques apply to any shape model involving Gaus-sian noise. This work is still in progress but shows promising signs of improvingthe current classification and estimation methods employed by geologists. Thereare technical obstacles in the form of numerical integrations which we hope toovercome in the near future.

Page 172: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

164 Bayesian shape modelling

FIGURE 1. The likelihood surface

References

Dryden, I.L. and Mardia, K (1998). Statistical shape analysis. J. Wiley.

Srivastava, A. and Jermyn, I.H. (2009) Looking for shapes in two - dimensional,cluttered point clouds. IEEE Trans. Patt. Anal. Mach. Intell., 31(9), 1616 –1629.

Page 173: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Tests for contaminated time series

Marcio Valk1, Aluısio Pinheiro2

1 Federal University of Rio Grande do Sul, Brazil2 University of Campinas, Brazil

E-mail for correspondence: [email protected]

Abstract: The presence of outliers in time series may cause some problems inmodel specification, parameter estimation and forecasting. We propose a non-parametric algorithm with three main objectives: clustering; testing groupings;and classifying new time series. We employ a robust kernel quasi U-statistic andshow that it works well even if some (or all) time series are contaminated byoutliers. The set-up is based on models for which the probability of occurrence ofoutliers may be time-dependent. We motivate the methodology through its the-oretical properties. The procedure is then illustrated in a simulation study andby its application in a real data set concerning Heart Rate Variability (HRV).

Keywords: Quasi U -statistics, Outliers, Dissimilarity Measures.

1 Introduction

An Additive Outlier (AO) affects only a specific observation while the influence ofan Innovational Outlier (IO) propagates to subsequent observations (Fox, 1972).An extensive literature on outliers in time series is available (Chang et al., 1988;Ljung,1993; Burridge et al., 2006; Huang et al., 2013). Ma and Genton (2000)address the problem of the robustness of the sample autocovariance function.Recent discussions on outliers in time series can be found in Fajardo et al. (2009),Hotta and Tsay (2011) and Reisen and Molinares (2012). A communality in theseworks is that the probability of occurrence of one or more outliers is constant intime. We consider model with time-dependent probabilities, which can fit in arealistically way phenomena under various external factors. As an example weanalyze the Heart Rate Variability (HRV) (Spang and Dutter, 2007). We studythe performance of clustering methods when data is contaminated, using testswhich belong to the class of quasi U-statistics (Pinheiro et al., 2011; Valk andPinheiro, 2012).

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 174: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

166 Tests for contaminated time series

2 The Test Statistic

The model is defined by

Yt = Zt +

m∑j=1

ωjXjt, (1)

where Xjt take values in 0, 1,−1. We define P (Xjt = 1) = P (Xjt = −1) =pjt/2 and P (Xjt = 0) = 1 − pjt, for all t = 1, . . . , T and j = 1, . . . ,m. Figure 1presents six configurations for the vector of probabilities p which are successfulin modeling a wide range of data (Hotta and Tsay, 2011).

0 100 2000

0.05

0.1

0.15

0.2

g(t)

t

(a)

0 100 200

0

0.05

0.1

0.15

0.2

0.25(b)

t0 100 200

0

0.5

1

(c)

t

0 100 200

0

0.05

0.1

0.15

0.2

0.25

g(t)

(d)

t0 100 200

0

0.05

0.1

0.15

0.2

0.25(e)

t0 100 200

0

0.05

0.1

0.15

0.2

0.25

t

(f)

FIGURE 1. Time dependent probabilities for outliers’s occurrence

Fajardo et al. (2009) proposes the robust estimator of the periodogram:

RSDE(ωj) =1

T−1∑h=−(T−1)

γR(h) cos(hωj). (2)

where γR(h) = [Q2n−h(u+v)+Q2

n−h(u−v)]/4, for vectors u and v of the first n−hand last n−h observations, respectively (Ma and Genton, 2000). Qn(·) is the κthorder statistic of

(n2

)distances |Zi−Zj |, i < j, i.e., Qn(Z) = c×|Zi−Zj |, i <

j(τ), for Z = (Z1, Z2, . . . , Zn) and c a constant used to guarantee consistency.We employ two quasi U -statistics. Bn is based on the usual periodogram, andRBn is based on RSDE. We refer the reader to Pinheiro et al. (2011) for thegeneral properties of quasi U -statistics, and for Valk and Pinheiro (2012) for thespecific properties under a time series set-up. Basically, under a null hypothesis ofhomogeneity between all pairs of groups, i.e., no real groups do exist, B−n andRBn are centered at 0, n

√TBn and n

√TRBn are asymptotically normal, where n

is the sample size and T , the length of the series. Under the alternative hypothesisof heterogeneous groups, Bn and RBn have positive means and

√nT (Bn−E[Bn])

and√nT (RBn − E[RBn]) are asymptotically normal.

3 Simulation and Application to real data sample ofECG

We use the six configurations in Figure 1. Three different amplitudes of theoutliers are considered w = 0, 3, 10, where w = 0 means no outliers. The lengthof the time series varies as T = 250, 500, 1000. Two test statistics are shown here:

Page 175: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Valk and Pinheiro 167

Bn, based on the usual periodogram; and RBn, based on the robust spectraldensity estimator RSDE. The simulations are performed in the software R with1000 replications. Three underlying error structures are used: M1 is a pure error;M2 is an AR(1) with φ = 0.5; and M3 is an ARMA(1,1) with φ = 0.5 andθ = −0.8. Four series were generated in each group being compared. Table 1presents the empirical test sizes. An amplitude of w = 0 means that the timeseries are not contaminated. The significance level of the test is α = 0.05. Oneshould note the good empirical test sizes for RBn even for w = 10, and its overallsuperior performance compared to Bn’s..

TABLE 1. Empirical Test Sizes for Bn and RBn.cases for probability of occurence

a b c d e f

AmpModel T RBn Bn RBn Bn RBn Bn RBn Bn RBn Bn RBn BnM1 250 0.04 0.91 0.05 0.26 0.08 0.75 0.02 0.19 0.04 0.35 0.05 0.04× 500 0.05 1.00 0.06 0.60 0.06 1.00 0.02 0.58 0.05 0.55 0.05 0.14M1 1000 0.05 1.00 0.05 0.85 0.06 1.00 0.08 0.92 0.04 0.81 0.04 0.24

0 M2 250 0.07 0.30 0.07 0.09 0.07 0.19 0.06 0.07 0.04 0.05 0.04 0.05× × 500 0.04 0.63 0.07 0.09 0.05 0.73 0.04 0.15 0.05 0.11 0.05 0.063 M2 1000 0.06 0.95 0.06 0.21 0.05 1.00 0.05 0.13 0.05 0.16 0.04 0.04

M3 250 0.07 0.71 0.04 0.19 0.04 0.43 0.04 0.20 0.03 0.20 0.05 0.09× 500 0.04 0.98 0.06 0.41 0.06 0.98 0.02 0.30 0.05 0.23 0.02 0.08M3 1000 0.04 1.00 0.06 0.61 0.04 1.00 0.02 0.68 0.05 0.68 0.08 0.21

M1 250 0.07 1.00 0.08 0.97 0.06 1.00 0.05 0.93 0.03 0.97 0.02 0.72× 500 0.05 1.00 0.04 1.00 0.07 1.00 0.05 1.00 0.05 1.00 0.07 0.96M1 1000 0.09 1.00 0.06 1.00 0.06 1.00 0.06 1.00 0.05 1.00 0.06 1.00

0 M2 250 0.10 1.00 0.07 0.93 0.06 1.00 0.04 0.91 0.10 0.98 0.08 0.63× × 500 0.10 1.00 0.04 1.00 0.07 1.00 0.05 1.00 0.06 1.00 0.06 0.8510 M2 1000 0.07 1.00 0.06 1.00 0.06 1.00 0.06 1.00 0.05 1.00 0.05 1.00

M3 250 0.05 1.00 0.04 0.94 0.06 1.00 0.06 0.95 0.04 1.00 0.03 0.75× 500 0.04 1.00 0.07 1.00 0.05 1.00 0.08 1.00 0.05 1.00 0.04 0.92M3 1000 0.07 1.00 0.06 1.00 0.05 1.00 0.07 1.00 0.06 1.00 0.05 1.00

The ECG data set used here is available at the MIT-BIH Arrhythmia Database(http://www.physionet.org/physiobank/database/mitdb/) and is described in Moodyand Mark (2001). It consists of ECG recordings of healthy and unhealthy patientswith clinically significant arrhythmias. We focus on the Heart Rate Variability(HRV) which is a continuous beat-by-beat measurement of interbeat intervals.The RHRV package from software R was used to obtain the HRV time seriesfrom the ECG records. Outliers in the HRV can appear by several factors suchas activity, emotion, sex, and age. However, in this case, the group of healthypatients is medically homogeneous (Spangl and Dutter, 2007). Using the non ro-bust test Bn one finds two spurious groups of healthy patients. The robust testprovides the correct decision of not separating patients within the homogeneousgroup.

4 Conclusions

We propose homogeneity tests for groups of time series. The importance of arobust kernel is illustrated by simulation and in a real data time series concerningMIT-BIH Arrhythmia. In both instances, spurious grouping may result from lackof robustness of the test statistic. The test behavior is greatly improved by therobust kernel.

References

Burridge, P. and Taylor, A. M. R. (2006). Additive outlier detection via extreme-value theory. J. Time Ser. Anal., 27(5), 685701.

Page 176: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

168 Tests for contaminated time series

Chang, I., Tiao, G. C., and Chen, C. (1988). Estimation of time series parame-ters in the presence of outliers. Technometrics, 30(2), 193-204.

Fajardo, F., Reisen, V., and Cribari-Neto, F. (2009). Robust estimation in long-memory processes under additive outliers. J. Statist. Plann. Inference,139(8), 2511-2525.

Fox, A. J. (1972). Outliers in time series. J. Roy. Statist. Soc. Ser. B, 34, 350-363.

Hotta, L. K. and Tsay, R. S. (2011). Outliers in garch processes. In EconomicTime Series: Modeling and Seasonality, pages 337-358. Chapman and Hall/CRC.

Huang, H., Mehrotra, K., and Mohan, C. K. (2013). Rank-based outlierdetection. J. Stat. Comput. Simul., 83(3), 518-531.

Ljung, G. M. (1993). On outlier detection in time series. J. Roy. Statist. Soc.Ser. B, 55(2), 559-567.

Ma, Y. and Genton, M. G. (2000). Highly robust estimation of the autocovari-ance function. J. Time Ser. Anal., 21(6), 663-684.

Moody, G. and Mark, R. (2001). The impact of the mit-bih arrhythmia database.IEEE Eng. Med. Biol., 20(3), 45-50.

Pinheiro, A., Sen, P. K., and Pinheiro, H. P. (2011). A class of asymptoticallynormal degenerate quasi U -statistics. Ann. Instit. Statist. Math., 63(6),1165-1182.

Reisen, V. A. and Fajardo, F. (2012). Robust estimation in time series with longand short memory properties. Ann. Math. Inform., 39, 207-224.

Spangl, B. and Dutter, R. (2007). Estimating spectral density functions robustly.REVSTAT, 5(1), 41-61.

Valk, M. and Pinheiro, A. (2012). Time-series clustering via quasi U - statistics.J. Time Ser. Anal., 33(4), 608-619.

Page 177: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Regularisation in Bayesian ExpectileRegression

Elisabeth Waldmann1, Fabian Sobotka2, Thomas Kneib2

1 Institute of Infection and Global Health, University of Liverpool2 Chair of Statistics, Georg-August-University Gottingen

E-mail for correspondence: [email protected]

Abstract: Recent interest in the development of flexible regression specificationshas had a specific focus on describing more complex features of the responsedistribution than only the mean. The standard instrument in this situation isquantile regression where conditional quantiles are related to a regression predic-tor. Computationally this is achieved by minimizing an asymmetrically weightedabsolute residuals criterion which induces additional complexity compared tostandard least squares optimization. As a consequence, expectile regression thatrelies on asymmetrically weighted squared residuals has gained considerable inter-est since expectile regression estimates can be obtained by iteratively weightedleast squares fits. In this abstract, we introduce a Bayesian formulation of ex-pectile regression that relies on the asymmetric normal distribution as auxiliaryresponse distribution. As this distribution renders possible the usage of Bayesianregularisation priors we will introduce spike and slab priors in this setup.

Keywords: asymmetric normal distribution, iteratively weighted least squaresproposals, quantile regression, Bayesian regularisation priors

1 Expectile Regression

Suppose that regression data (yi,zi), i = 1, . . . , n, on a continuous responsevariable y and a covariate vector z are given and shall be analyzed in a regressionmodel of the form yi = ηiτ + εiτ where ητ is a predictor formed by the covariatesand ετ is an appropriate error term. Unlike in mean regression where regressioneffects on the mean are of interest, we focus on situations where specific outerparts of the response distribution shall be studied. We will denote the extremenessof these outer parts by the asymmetry parameter τ ∈ (0, 1) where τ = 0.5corresponds to the central part of the distribution while τ → 0 and τ → 1 yieldthe lower and upper part of the distribution, respectively. The standard approachfor implementing such regression models is quantile regression where we assumethat the τ -quantile of the error distribution equals zero, i.e. P (εiτ ≤ 0) = τ. This

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 178: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

170 Regularisation in Bayesian Expectile Regression

implies that the predictor ηiτ corresponds to the τ -quantile of the response yiand the regression model can be estimated by minimizing

n∑i=1

wτ (yi, ηiτ )|yi − ηiτ |

with asymmetric weights

wτ (yi, ηiτ ) =

1− τ yi ≤ ηiττ yi > ηiτ .

To avoid numerical difficulties associated with the absolute deviations in thequantile regression specification, we will instead focus on the criterion

n∑i=1

wτ (yi, ηiτ )(yi − ηiτ )2 (1)

that yields expectile regression estimates. This criterion has the advantage to bedifferentiable with respect to the regression predictor so that estimates can beobtained by iteratively weighted least squares estimation. Basically, expectiles arean alternative possibility to characterize the distribution of a continuous randomvariable where τ indexes the “extremeness” of the part of the distribution thatshall be studied, see Newey and Powell (1987).

2 Asymmetric Normal Distribution

To make expectile regression accessible in a Bayesian formulation, we requirethe specification of an auxiliary response distribution. For expectile regression,the analogous distribution to the asymmetric Laplace distribution for quantileregression is an asymmetric normal distribution, denoted by yi ∼ AN(ηi, σ

2, τ)with density

p(yi) =2√σ2π

(√1

1− τ +

√1

τ

)exp

(− 1

2σ2wτ (yi, ηiτ )(yi − ηiτ )2

).

Maximising the likelihood then is equivalent to minimizing (1).

3 Semiparametric Regression

Instead of only considering linear regression specifications, we are interested inapplying expectile regression in the context of general semiparametric regressionmodels with predictor

ηi = β0 +

p∑j=1

fj(zi)

where we suppressed the index τ for notational simplicity, β0 is an interceptrepresenting the overall level of the predictor, and the functions fj(zi) reflectdifferent types of regression effects depending on subsets of the covariate vectorzi. For the fj , we make the following assumptions:

Page 179: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Waldmann et al. 171

• The functions fj are approximated in terms of basis functions

fj(z) =

K∑k=1

βjkBk(z)

where Bk(z) are the basis functions and βjk denote the corresponding basiscoefficients.

• The prior for the vector of basis coefficients βj = (βj1, . . . , βjK)′ is a mul-tivariate normal distribution with density

p(βj |δ2j ) ∝ exp

(− 1

2δ2jβ′jKjβj

)where the precision matrix Kj represents different types of structural as-sumptions about the function fj such as smoothness. The prior may bepartially improper if the precision matrix Kj is not of full rank.

This framework covers a large variety of modelling tools, see Fahrmeir, Kneiband Lang (2004).

4 Regularisation

We complete the Bayesian specification by assuming inverse gamma priors forthe error variance and the smoothing variances, i.e. σ2 ∼ IG(a0, b0) and δ2j ∼IG(aj , bj). Given the model specification, this implies that the full conditionalsare also inverse gamma with updated parameters. Thus there is no difference tothe full conditonals of the variances in geoadditive Gaussian regression. The sameholds for adding regularisation priors, which base on priors on the variance. Asan example we will describe the implementation of spike and slab priors for lineareffects based on the single variance components δj here, but also the transfer ofridge priors or Bayesian LASSO is straight forward. Spike and slab priors can bewritten as a mixture distribution in the following way:

δ2j |νj = (1− νj)IG(aδ2j, ν0bδ2j

) + νjIG(aδ2j, bδ2j

),

where the parameter νj is distributed with a Bernoullidistribution with an unin-formative Beta as hyperprior:

νj |θiid∼ B(1, θ) θ ∼ Beta(aθ, bθ)

and thus indicates, which component of the prior is active. The ν0 in the first termis to be chosen very small, to get an imitation of the Dirac measure on zero, whichis used in other implementations of that method. The fact, that we are dealingwith expectiles does not have any influence on this structure and thus also thefull conditionals for this addon are exactly the same as in the Gaussian setup andcan be drawn in a simple Gibbs sampler. In contrast, the full conditionals for theregression coefficients βj are not available in closed form. Thus we use proposaldensities based on the penalized iteratively weighted least squares updates thatwould have to be performed to compute penalized expectile regression estimatesin a frequentist backfitting procedure.

Page 180: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

172 Regularisation in Bayesian Expectile Regression

5 Discussion

The Bayesian formulation of expectile regression outlined in this abstract pro-vides both the Bayesian counterpart to frequentist expectile regression and theexpectile analogue to Bayesian quantile regression. While standard semiparamet-ric regression specifications in expectile regression can be handled in a frequentistsetting, the Bayesian formulation opens up the possibility to include more com-plex regression specification such as Bayesian regularisation priors using a condi-tional Gaussian prior structure like sketched above or Dirichlet process mixturepriors for random effects.Moreover, Bayesian expectile regression comprises thedetermination of the smoothing variances δ2j as an integral part of the inferentialprocedure and provides measures of uncertainty also for complex functionals ofthe model parameters. However, the asymmetric normal likelihood will usuallyinduce a model misspecification and the impact of this will have to be studied indetailed simulations.

References

Fahrmeir, L., Kneib, T. and Lang, S. (2004): Penalized structured additive re-gression for space-time data: a Bayesian perspective, Statistica Sinica, 14,731 – 761.

Newey, W.K. and Powell, J.L. (1987): Asymmetric least squares estimation andtesting, Econometrica, 55, 819 – 847.

Scheipl, F., Ludwig Fahrmeir and Thomas Kneib (2012): Spike-and-Slab Priorsfor Function Selection in Structured Additive Regression Models, Journalof the American Statistical Association, 107, 1518 – 1532.

Sobotka, F. and Kneib, T. (2012): Geoadditive expectile regression, Computa-tional Statistics & Data Analysis, 56, 755 – 767.

Waldmann, E., Sobotka, F. and Kneib, T. (2013) Bayesian Geoadditive Expec-tile Regression, working paper on arXiv.org.

Page 181: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Nonparametric regression withinterval-valued response

Andrea Wiencierz1

1 Department of Statistics, LMU Munich, Germany

E-mail for correspondence: [email protected]

Abstract: This contribution deals with nonparametric regression in the situa-tion in which the response variable is only imprecisely observed. In particular,nonparametric Support Vector Regression methods are generalized to this settingwithin the general framework of Likelihood-based Imprecise Regression.

Keywords: Nonparametric regression; interval-censored response; Likelihood-based Imprecise Regression; Support Vector Regression.

1 Introduction

The goal of nonparametric regression is to obtain a quantitative description of therelationship between some explanatory variables and a response variable, withoutimposing a particular parametric shape of the describing function. Here, we focuson the case of continuous random variables. Hence, the interest lies in analyzingthe relationship between one or more explanatory variables X ∈ X ⊂ Rd, withd ∈ N, and a response variable Y ∈ Y ⊂ R. For nonparametric regression in thissituation, many different methods were suggested, including spline-based andkernel-based methods.Spline-based methods can be treated within the framework of linear (mixed)models, where the function to be estimated models the conditional mean of therandom variable Y , whose conditional distribution given the covariates X is of-ten assumed to be a normal distribution with expectation zero and fixed variance(see, e.g., Fahrmeir et al., 2013, Chapter 8). Kernel-based methods, in particu-lar, Support Vector Regression (SVR) methods are based on a more generalprobability model for the joint distribution of the analyzed variables, where theconditional distribution of Y given X is not restricted to a parametric class ofdistributions. Moreover, different loss functions can be chosen, leading to differ-ent results. Hence, in SVR methods, the regression function does not necessarilymodel the conditional mean of the response (see, e.g., Steinwart and Christmann,2008).Usually, regression methods are based on the assumption that the analyzed dataare precise and correct observations of the variables of interest. In many practi-

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 182: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

174 Nonparametric regression with interval-valued response

cal settings, however, the variables of interest cannot not be measured precisely.For example, survey data on income are often rounded or completely missingor may only be available as income classes. All these cases of uncertainty aboutthe exact income values can be represented by intervals of the observation space.Therefore, considering interval-censored observations of the variables of interestallows accounting for different kinds of data uncertainty within the same frame-work. Here, we consider the situation in which the response variable is possiblyobserved as a closed and bounded interval and discuss possible generalizations ofnonparametric regression to this situation, thereby focusing on SVR methods.

2 SVR with interval-valued response

To formally describe the statistical problem of regression with imprecise data, wewrite (X,Y ) = V and furthermore assume here that X × Y = V is a compactsubset of Rd+1. Instead of V , only the random set V ∗ ⊆ V can be observed,whose possible realizations are of the form V ∗ = X × [Y , Y ], with X ∈ X andY , Y ∈ Y such that Y ≤ Y . Moreover, we assume that (V, V ∗) ∼ P ∈ P, whereP entails all probability measures P ′ satisfying P ′(V ∈ V ∗) = 1. This means,the imprecise observations are supposed to always contain the precise values ofinterest, which is a common assumption in approaches to analyzing imprecisedata.Standard SVR (when the variables are precisely observed) can be formulated asa decision problem on F × PV , where F denotes the space of considered regres-sion functions and PV the set of marginal distributions of the precise data. InSVR, F is assumed to be a Reproducing Kernel Hilbert Space (RKHS) withnorm ‖ · ‖F =

√〈·, ·〉F and reproducing kernel function κ : X × X → R,

(x, x′) 7→ κ(x, x′). The loss function of the corresponding decision problem isthe risk functional Ef (PV ), which is the expected value (under PV ) of a functionof the associated residual Rf , defined for each f ∈ F by Rf = |Y − f(X)|. Forexample, Ef (PV ) = E(R2

f ) is the Least Squares (LS) loss. The optimal descrip-tion of the relationship of interest is the regression function minimizing the riskassociated with the true distribution PV . Of course, PV is usually unknown, butit can be estimated on the basis of some independent and identically distributed(i.i.d.) data by their empirical distribution PV and the corresponding empiricalrisk Ef (PV ) can be minimized. Yet, to avoid overfitting when considering largeRKHSs as it is done for nonparametric regression, the risk functional is supple-mented by a penalization. The regularized risk functional is given by

Ef,λ(PV ) = E(ψ(Rf )) + λ ‖f‖2F ,

for all PV ∈ PV , each f ∈ F , some fixed λ > 0, and ψ : R≥0 → R≥0 being convex.Hence, given some i.i.d. observations (x1, y1), . . . , (xn, yn), the regularized riskwith respect to their empirical distribution PV is minimized. It can be shownthat there exists a unique optimal function f ∈ F , which can be represented asf =

∑nj=1 αj κ(·, xj), with α1, . . . , αn ∈ R (see, e.g., Steinwart and Christmann,

2008, Theorem 5.5). This result allows determining f in practice by minimizingEfα,λ(PV ) over α ∈ Rn, with

Efα,λ(PV ) =1

n

n∑i=1

ψ(|yi −∑nj=1 αj κ(xi, xj)|) + λ

n∑i=1

n∑j=1

αi αj κ(xi, xj).

Page 183: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Wiencierz 175

Now, let us consider the situation in which the response variable is imprecisely ob-served. Instead of precise observations, only imprecise i.i.d. data x1×[y

1, y1], . . .,

xn×[yn, yn] are available, with empirical distribution PV ∗ . Therefore, Ef,λ(PV )

cannot be directly determined and minimized. But the empirical distribution ofthe imprecise data defines a set [PV ∗ ] of compatible marginal distributions of theprecise data. How to use the information about the precise variables of interestthat is contained in [PV ∗ ]? Utkin and Coolen (2011) suggested choosing either aminimin or a minimax strategy, i.e., minimizing either the lower empirical regu-larized risk

Ef,λ(PV ∗) = minP ′V∈[PV ∗ ]

Ef,λ(P ′V ) =

1

n

n∑i=1

minyi∈[yi,yi]

ψ(|yi −∑nj=1 αj κ(xi, xj)|) + λ

n∑i=1

n∑j=1

αi αj κ(xi, xj)

or the upper one Ef,λ(PV ∗) = maxP ′V∈[PV ∗ ]

Ef,λ(P ′V ). In doing so, precise re-

gression estimates are obtained, however, the meaning of the obtained functionsis not clear. Moreover, the choice of the lower or the upper empirical risk as thedecision criterion is arbitrary because any risk value between them is equallyplausible given the data at hand. A more reliable way to exploit the informationprovided by the imprecise data in SVR can be derived in considering SVR withinthe general framework for regression with imprecisely observed variables thatwas developed in Cattaneo and Wiencierz (2012) and called Likelihood-basedImprecise Regression (LIR).

3 A LIR method for SVR

In the LIR framework, the information provided by the imprecise data about therelationship between the precise variables is processed via the likelihood functionon P induced by the imprecise observations. From this likelihood function, pro-file likelihood functions for characteristics of the distribution of the precise datacan be derived in the considered situation, as shown in Cattaneo and Wiencierz(2012, Section 2). This allows determining, for each f ∈ F , likelihood-based con-fidence intervals Ef,λ,>β for Ef,λ(PV ) by cutting the graph of the corresponding(normalized) profile likelihood function likEf for the risk functional at level β,i.e,

Ef,λ,>β = e+ λ ‖f‖2F : e ∈ R≥0 and likEf (e) > β,where λ > 0 and β ∈ (0, 1) are fixed. For each function f ∈ F , the confidenceregion Ef,λ,>β contains all regularized risk values associated with f that are plau-sible to certain degree, the latter being determined by the choice of β. If the cutoffpoint β tends to one, in the limit we have that Ef,λ,>β = [Ef,λ(PV ∗), Ef,λ(PV ∗)],that is, the entire interval is the Maximum Likelihood (ML) estimate of the reg-ularized risk Ef,λ(PV ) for each function f . By choosing a lower cutoff point,statistical uncertainty can directly be accounted for, e.g., β = 0.15 correspondsto an asymptotic coverage level of the confidence intervals of at least approxi-mately 95%. Following the general LIR methodology, we apply the dominanceprinciple to identify the set of all regression functions that are plausible in thelight of the observations. Hence, all functions f ∈ F for which

inf Ef,λ,>β ≤ inff ′∈F

sup Ef ′,λ,>β

Page 184: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

176 Nonparametric regression with interval-valued response

is satisfied are considered as the set-valued result of the regression analysis. Theextent of the set of all undominated functions reflects the entire uncertaintyabout the relationship of interest, comprising both the uncertainty related withthe imprecision of the data and the statistical uncertainty. For more details onthe underlying LIR methodology, see Cattaneo and Wiencierz (2012, in particularSections 2 and 3).The implementation of the LIR method for SVR can be based on the minimaxmethod by Utkin and Coolen (2011), when the ML estimate of the risk is consid-ered as decision criterion. The minimax method allows determining the smallestregularized upper risk, which is necessary to identify the functions whose lowerrisk does not exceed this value. Thus, given the smallest upper bound, a randomsearch over α ∈ Rn can be performed to approximately determine the set of allundominated regression functions.

References

Cattaneo, M. and Wiencierz, A. (2012). Likelihood-based ImpreciseRegression. International Journal of Approximate Reasoning, 53,1137 – 1154.

Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). Regression:Models, Methods and Applications. Berlin, Heidelberg: Springer.

Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Berlin, Hei-delberg: Springer.

Utkin, L.V. and Coolen, F.P.A. (2011). Interval-valued Regression andClassification Models in the Framework of Machine Learning. In: ISIPTA11, Proceedings of the 7th International Symposium on Imprecise Probabil-ity: Theories and Applications, Innsbruck, Austria, pp. 371 – 380.

Page 185: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Comparison of robust and nonparametricmeasures of fidelity

Bruce J. Worton1, Chris R. McLellan1

1 School of Mathematics and Maxwell Institute for Mathematical Sciences, TheUniversity of Edinburgh, Edinburgh, UK

E-mail for correspondence: [email protected]

Abstract: In this paper we use robust mixture models as well as nonparametricestimators to estimate fidelity. These estimators are defined using three differenttypes of measure of fidelity. The methods are applied to the analysis of twodimensional data on locations of mule deer. A comparison of the estimators showsthat both robust mixture and nonparametric based estimators lead to similarconclusions, but the various measures of fidelity highlight different features ofthe data sets being compared.

Keywords: Nonparametric estimation; Robust mixture modelling; Similaritymeasures.

1 Introduction

In this paper we consider methods for assessing fidelity. In particular, we userobust mixture models as well as nonparametric estimator based methods toestimate fidelity. The fidelity is defined as the similarity of two bivariate datasets, such as the locations of mule deer shown in Figure 1. Our aim is to studyand quantify any changes in the spatial use by the mule deer from one year to thenext using the observed point locations. In Sections 2 and 3 we outline flexibleapproaches, and in Section 4 we apply them to the mule deer data.

2 Fidelity measures

We consider three measures of fidelity based on similarity measures betweendensities f1 and f2:

• SP: scaled product measure, scaled∫f1(x)f2(x)dx,

• SRP: square root product measure∫ √

f1(x)√f2(x)dx , and

• OVL: overlap measure,∫

min f1(x), f2(x) dx (Clemons and Bradley, 2000).

In each case the measure is defined on the interval zero to unity, with unitycorresponding to the case f1 = f2.

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 186: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

178 Robust and nonparametric measures of fidelity

FIGURE 1. Observed locations of a particular mule deer in two consecutiveyears, together with fitted robust mixtures.

3 Estimation of fidelity

In this section we consider estimation of fidelity by both nonparametric estimatorsand robust mixture modelling. These approaches provide sufficient flexibility formodelling data sets such as those illustrated in Figure 1. We assume we havedata sets x1, . . . ,xn1 and y1, . . . ,yn2

, from densities f1 and f2 respectively,available to estimate fidelity.

3.1 Nonparametric estimation

We use a nonparametric kernel estimator of the form

fker(x) =1

nh2

n∑i=1

K(x− xi

h

),

where x1, . . . ,xn are observations, K is the kernel and h is the bandwidth(Worton, 1989; Wand and Jones, 1995). We can show, for a normal kernel, thatthere is an explicit expression for the estimated product measure,

1

2πn1n2

n1∑i=1

n2∑j=1

|h21I|−

12 |h2

2I|−12 |(h2

1I)−1 + (h22I)−1|−

12

× exp

[−1

2(xi − yj)

T(h21I)−1 (h2

1I)−1 + (h22I)−1−1

(h22I)−1(xi − yj)

].

3.2 Robust mixtures

We consider a robust mixture of t distributions

fmix-t(x) =

g∑i=1

wift(x;µi,Σi, νi),

Page 187: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Worton and McLellan 179

FIGURE 2. Difference of density estimates for the robust mixture estimation.

where µi, Σi and νi are the mean, scale matrix and degrees of freedom of the ithmixture component t-density ft respectively with mixing weights wi (Peel andMcLachlan, 2000). In the case of the product measure we define estimators

I1 =1

n1

n1∑i=1

f2(xi), I2 =1

n2

n2∑i=1

f1(yi),

and then use the combined estimator I = (n1I1 + n2I2)/(n1 + n2). These aresimilar to the one-dimensional definitions used by Ridout and Linkie (2009),but here we use t mixtures. Calculating the measure using numerical integrationproduces similar values but is much more time consuming. Similar estimatorsmay be defined for the other measures of similarity.

4 Application

Table 1 gives the estimates of the various measures of fidelity based on the non-parametric and robust mixture estimators. We can see that for each of the mea-sures of fidelity the nonparametric and robust parametric approaches lead tosimilar estimates. The different measures themselves give a similar conclusionwhich would be difficult to objectively assess using the raw data or even the den-sity estimate plots. To further investigate where any differences lie we use thedifference in the estimated densities. Figure 2 gives an example in the case of themixture of t-densities, but the version based on the nonparametric estimation isbroadly similar.

TABLE 1. Estimates of fidelity using various estimators for the mule deer data.

Density estimator SP SRP OVL

Kernel 0.64 0.70 0.55Mixture of t 0.61 0.75 0.60

Page 188: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

180 Robust and nonparametric measures of fidelity

5 Discussion

We have seen that the observed point locations may be compared in differentways using the various measures of fidelity. We have fortunately found that boththe robust mixture approach and the nonparametric approach lead to similarconclusions. In summary, for our application, we found a reasonably good levelof fidelity using each of the various methods to quantify fidelity, whether based onrobust mixture or nonparametric methods. However, we also found that studyingplots of the difference between density estimates provides a convenient way ofidentifying and highlighting differences. The graphical approach is a convenientway to present the results to biologists leading to greater understanding andhighlighting the changes that have taken place between two time periods whenthe data were collected.

Acknowledgments: We would particularly like to thank Gary White for mak-ing the data sets on mule deer available.

References

Clemons, T.E. and Bradley, E.L. (2000). A nonparametric measure of the over-lapping coefficient. Computational Statistics and Data Analysis, 34, 51 – 61.

Peel, D. and McLachlan, G.J. (2000). Robust mixture modelling using the t dis-tribution. Statistics and Computing, 10, 339 – 348.

Ridout, M.S. and Linkie, M. (2009). Estimating overlap of daily activity pat-terns from camera trap data. Journal of Agricultural, Biological and Envi-ronmental Statistics, 14, 322 – 337.

Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. London: Chapman &Hall.

Worton, B.J. (1989). Kernel methods for estimating the utilization distributionin home-range studies. Ecology, 70, 164 – 168.

Page 189: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Extending capture-recapture models tohandle erroneous records in linkedadministrative data

Dilek Yildiz1, Peter W.F. Smith1, Peter G.M. van derHeijden1,2

1 Southampton Statistical Sciences Research Institute, University of Southamp-ton, United Kingdom

2 Utrecht University, The Netherlands

E-mail for correspondence: [email protected]

Abstract: The Beyond 2011 Programme of the Office for National Statisticsis evaluating the alternative methods of collecting census data and producingsmall-area socio-demographic statistics. In the absence of a traditional census,one alternative method to estimate population count is to use individual linkagebetween administrative sources. The aim of this paper is to estimate the popu-lation of England and Wales by combining information from the linked PatientRegister and Customer Information System; and the 2011 Census estimates. Forthis particular research we use one and two-way marginal information from theCensus estimates. However, in the absence of a traditional census we assume thatthe marginal information will be available from other sources such as coveragesurveys, annual surveys or other administrative data sources.

Keywords: Administrative data; Log-linear model; Capture-recapture method;Combining data; England and Wales.

1 Introduction

The Beyond 2011 Programme of the Office for National Statistics (ONS) is evalu-ating the alternative methods of collecting census data and producing small-areasocio-demographic statistics. In the absence of a traditional census, one alterna-tive method to estimate population count is to use individual linkage betweenadministrative sources. However, it is possible that some of the usual residentsare not included in any of the linked sources and will be missing in the finalpopulation estimate leading to underestimation of the population size. Moreover,it is also possible that some of the people registered in the administrative datasources may not be eligible to be included in the usual resident population andtherefore cause overestimation of the population size. The Patient Register and

This paper was published as a part of the proceedings of the 29th Interna-tional Workshop on Statistical Modelling, Georg-August-Universitat Gottingen,14–18 July 2014. The copyright remains with the author(s). Permission to repro-duce or extract any parts of this abstract should be requested from the author(s).

Page 190: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

182 Models to handle erroneous records in linked administrative data

the Customer Information System (CIS) are two of the administrative sources inEngland and Wales which can be used to estimate population count. However,they are not designed to estimate the usual resident population and they collectinformation from different but overlapping population groups. The aim of thisresearch is to estimate the number of people in a particular age group, sex andlocal authority by using information from the individually linked Patient Registerand CIS.

2 Data sources and Methodology

We use summary tables from the linked 2011 Patient Register and CIS, and the2011 Census estimates table by 5-year age group, sex and 346 local authorities(except City of London and Isle of Scilly). Detailed information about the datasources is available in the ONS reports (2012a, 2012b, 2013a).The capture-recapture approach is used to estimate the number of people who areusual residents but neither in the Patient Register and in the CIS. The resultingestimates are compared with the census estimates, which are assumed to be the’gold standard’. The capture-recapture approach is used in countries which areconducting traditional census to estimate under-coverage after post enumerationsurveys. They are also used in the countries which estimate their population byregister based censuses. In this case, a second register is used as the recapturesample. Both the Patient Register and the CIS exceeds the 2011 Census esti-mates for England and Wales (ONS 2012a, 2013a). Hence, it is inevitable thatclassical capture-recapture approach will also overestimate the total populationcount by adding people who are not registered in both data sets. Therefore, weuse log-linear models with an offset to combine the linked Patient Register andCIS information with marginal information from the census estimates to adjustfor both underestimation and overestimation. We start with simple models anddevelop more complicated log-linear models with an offset to estimate the popu-lation counts by age group, sex and local authority.

3 Results

Model 0 assumes that everyone in the usual resident population is at least reg-istered with the CIS or the Patient Register and only the usual residents can beregistered with any of these sources. In this case, it is possible to compute thepopulation of England and Wales by adding people who are registered with atleast one of these two administrative sources: N = n11 + n10 + n01 in Table 1.The model assumes n00 = 0. Model 0 overestimates the population of Englandand Wales for almost all of the age groups. In addition its sex ratio also exceedsthe census estimates sex ratio for age groups between 20 and 70 year olds. Weknow that both data sources exceed the census estimates so as expected addingpeople registered with these sources overestimates the population.Model 1 assumes that there are people who are not registered with either thePatient Register or the CIS but they are usual residents in England and Wales.So we employ the classical capture-recapture approach to estimate the number ofpeople who are missed by both of the registers. The log-linear model for particularage group, sex and local authority is log(mij) = λ+λIi +λJj , where mij = exp(nij)and m00 = exp(λ). Model 1 overestimates the population of England and Wales

Page 191: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Yildiz et al. 183

for almost all of the age groups, and its sex ratio exceeds the census estimatessex ratio for age groups between 20 and 70 year olds.Due to computational constraints, we fit more complex log-linear models only forthe South East region (67 local authorities) to provide a better age group-sex-local authority association structure. The log-linear models fitted to the SouthEast region data also overestimate the population. Therefore, we use log-linearmodels with offsets to combine the predicted values from the log-linear modelswith the marginal association from the census estimates. We fit age group, agegroup and sex and age group-sex interaction log-linear models with offsets equalto the predicted values from the previously fitted log-linear models so that thecorresponding marginal totals from these models are equal to the ones in thecensus estimates. The age group-sex interaction log-linear model with an offset islog qijasl = λ+λIi +λJj +λAa +λSs +λASas + log(masl), where qijasl is the expectedvalue and masl presents the predicted values from the log-linear models for agegroup a, sex s, and local authority l. This model provides closer estimates to thecensus estimates than previous models.

4 Conclusion

As mentioned above, both the Patient Register and the CIS overestimate theusual resident England and Wales population for 2011. Unsurprisingly, usinginformation only from the linked Patient Register and CIS to estimate populationalso results in overestimation. Therefore, we combine them with the gold standardmarginal information from the 2011 Census estimates to produce more accuratepopulation estimates.Different log-linear models are considered. All of these models exceed the censusestimates. Therefore, we extended these models so that they combine informa-tion from the linked data with the age group, age group and sex and the agegroup-sex marginal information from the census estimates. Including additionalinformation provided better estimates than the previous models. We comparedmodels according to the percentage errors between the population estimates cal-culated by the models and the census estimates and found that the discrepancybetween the models and the census estimates are lower for the log-linear modelswith offsets.In the future this research can be extended to combine linked data tables otherthan the Patient Register and the CIS with one and two-way marginal informationfrom coverage surveys, annual surveys or other administrative data sources.

Acknowledgments: This research is funded by the joint Economic and Social

TABLE 1. Linked data sources

CIS

Patient Register Yes No

Yes n11 n10

No n01 n00

Page 192: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

184 Models to handle erroneous records in linked administrative data

Research Council - Office for National Statistics studentship.

References

Office for National Statistics (2012a). Beyond 2011: Administrative Data SourcesReport: NHS Patient Register.Available at: http://www.ons.gov.uk/ons/about-ons/who-ons-are/programmes-and-projects/beyond-2011

reports-and-publications/sources-reports/

beyond-2011--administrative-data-sources-report--nhs-

patient-register--s1-.pdf. (accessed January 2014).

Office for National Statistics (2012b). Information paper. Quality andmethodology information for 2011 Census - Statistics for England andWales: March 2011.Available at: http://www.ons.gov.uk/ons/guide-method/method-quality/quality/quality-information/population/

population-and-household-estimates.doc.

(accessed February 2014).

Office for National Statistics (2013a). Beyond 2011: S5 Administrative Data SourcesReport: Department for Work and Pensions (DWP) and Her Majestys Rev-enue and Customs (HMRC) Benefit and Revenue Information (CIS) andLifetime Labour Market Database (L2).Available at: http://www.ons.gov.uk/ons/about-ons/who-ons-are/programmes-and-projects/beyond-2011/

reports-and-publications/sources-reports/

beyond-2011-administrative-data-sources-report---dwp-

and-hmrc-cis-and-l2-combined--s5-.pdf (accessed January 2014).

Page 193: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

Index

Angel, Barbara, 93

Albala, Cecilia, 93

Amo-Salas, M, 7

Ayma, Diego, 3

Azod Azad, Vida, 61

Bialek, Jacek, 11

Blas, Betsabe, 19

Bonetti, Daniel, 15, 35

Cadarso - Suarez, Carmen, 145

Cavalcante, Mileno, 19

Cavenague de Souza, Hayala Cristina,

23

Cubiles-de-la-Vega, M, 113, 117, 137

da Silva Castro Perdona, Gleici, 23

Dannemann, Jorn, 69

Delbem, Alexandre, 15

Delgado-Marquez, E, 7

Donat, Francesco, 27

Dooley, Cara, 31

Durban, Marıa, 3

Egner, Alexander, 69

Einbeck, Jochen, 15, 35, 161

Elayouty, Amira, 39

Ender, Manuela, 43

Enea, Marco, 47

Feeney, Joanne, 31

Ferrari, Silvia, 129

Franco, Gabriel, 133

Garcıa-Ligero, M.J., 51

Geisler, Claudia, 69

Giampaoli, Viviana, 75, 157

Gichuru, Phillip, 55

Gladstone, Melissa, 55

Golalizadeh, Mousa, 61

Groll, Andreas, 65

Holler, Peter, 125

Hartmann, Alexander, 69

Hendrych, Radek, 71

Hermoso-Carazo, A., 51

Hernandez, Freddy, 75

Hormazabal, Marıa, 93

Huang, Jixia, 79

Huang, Yufen, 85

Huckemann, Stephan, 69

Ibacache-Pulgar, German, 89

Javadi, Bahman, 105

Jermyn, Ian, 161

Ke, Bo-Shiang, 85

Kenny, Rose Anne, 31

Kneib, Thomas, 145, 169

Lopez-Fidalgo, J, 7

Laitenberger, Oskar, 69

Lancaster, Gillian, 55

Lee, Dae-Jin, 3

Lera, Lydia, 93

Leyton, Barbara, 93

Linares-Perez, J., 51

Louzada Neto, Francisco, 23

Ma Tong, 43

Mala, Ivana, 97

Maris Peria, Fernanda, 23

Marra, Giampiero, 27

Martınez-Araya, Mario, 101

Matawie, Kenan, 105

Mayr, Georg, 109

McLellan, Chris, 177

Messner, Jakob, 109

Miller, Claire, 39

Moreno-Rebollo, J, 113, 117, 137

Munoz-Pichardo, Juan, 113, 117, 137

Munk, Alexander, 69

Noma, Alexandre, 157

Page 194: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

186 Index

O’Hagan, Adrian, 121

Pan, Jianxin, 101

Pfeifer, Christian, 125

Pinheiro, Aluısio, 165

Pinheiro, Eliane, 129

Pinheiro, Hildete, 133

Pino-Mejıas, R, 113, 117, 137

Plaia, Antonella, 47, 153

Proctor, Iain, 141

Rigby, Robert, 47

Rodrıguez - Alvarez, Marıa Xose, 145

Rodrigues-Motta, Mariana, 133

Roman-Roman, P., 149

Sanchez, Hugo, 93

Sciandra, Mariangela, 153

Scott, E. Marian, 141

Scott, Marian, 39

Serrano-Perez, J.J., 149

Smith, Peter, 181

Smith, Rognvald I., 141

Sobotka, Fabian, 169

Stasinopoulos, Mikis, 47

Tamura, Karin A., 157

Titman, Andrew, 55

Torres-Ruiz, F., 149

Tsiftsi, Thomai, 161

Tutz, Gerhard, 65

Usuga, Olga, 75

Valk, Marcio, 165

van der Heijden, Peter, 181

Villoria, Maria Franco, 39

Waldmann, Elisabeth, 169

Waldron, Susan, 39

Wang, Jinfeng, 79

Wiencierz, Andrea, 173

Worton, Bruce, 177

Yildiz, Dilek, 181

Zeileis, Achim, 109

Page 195: 29th International Workshop on Statistical Modelling · Proceedings of the 29th International Workshop on Statistical Modelling, Volume 2, G ottingen, July 14{18, 2014, Thomas Kneib,

29th IWSM 2014 Sponsors

We are very grateful to the following organisationsfor sponsoring 29th IWSM 2014.

• Georg-August-University, Gottingen

• Toyota Motor Corporation

• Leonard N. Stern School of Business,New York University

• Springer Publisher

• Statistical Modelling Society

• RTG 1644 “Scaling Problems in Statistics”