Top Banner
Springer Texts in Statistics Advisors: George Casella Stephen Fienberg Ingram Olkin Springer Science+Business Media, LLC
13

Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

Jun 23, 2018

Download

Documents

buiphuc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

Springer Texts in Statistics

Advisors: George Casella Stephen Fienberg Ingram Olkin

Springer Science+Business Media, LLC

Page 2: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

Springer Texts in Statistics

Alfred: Elements of Statistics for the Life and Social Sciences Berger: An Introduction to ProbabiIity and Stochastic Processes Bilodeau and Brenner: Theory ofMultivariate Statistics Biom: Probability and Statistics: Theory and Applications Brockwell and Davis: Introduction to Times Series and Forecasting,

Second Edition Chow and Teicher: Probability Theory: Independence, Interchangeability,

Martingales, Third Edition Christensen: Advanced Linear Modeling: Multivariate, Time Series, and

Spatial Data; Nonparametrie Regression and Response Surface Maximization, Second Edition

Christensen: Log-Linear Models and Logistic Regression, Second Edition Christensen: Plane Answers to Complex Questions: The Theory ofLinear

Models, Third Edition Creighton: A First Course in Probability Models and Statisticallnference Davis: Statistical Methods for the Analysis ofRepeated Measurements Dean and Voss: Design and Analysis ofExperiments du Toit, Steyn, and Stumpf Graphical Exploratory Data Analysis Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and Levin: Statistics for Lawyers Flury: A First Course in Multivariate Statistics Jobson: Applied Multivariate Data Analysis, Volume I: Regression and

Experimental Design Jobson: Applied Multivariate Data Analysis, Volume 11: Categorical and

Multivariate Methods Kalbfleiseh: Probability and Statisticallnference, Volume I: Probability,

Second Edition Kalbfleiseh: Probability and Statisticallnference, Volume 11: Statisticallnference,

Second Edition Karr: Probability Keyfitz: Applied Mathematical Demography, Second Edition Kiefer: Introduction to Statisticallnference Kokoska and Nevison: Statistical Tables and Formulae Kulkarni: Modeling, Analysis, Design, and Control of Stochastic Systems Lange: Applied Probability Lehmann: Elements ofLarge-Sample Theory Lehmann: Testing Statistical Hypotheses, Second Edition Lehmann and Casella: Theory ofPoint Estimation, Second Edition Lindman: Analysis ofVariance in Experimental Design Lindsey: Applying Generalized Linear Models

(continued after index)

Page 3: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

J effrey S. Simonoff

Analyzing Categorical Data

With 64 Figures

, Springer

Page 4: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

Ieffrey S. Simonoff Leonard N. Stern School of Business New York University New York, NY 10012-0258 USA [email protected]

Editorial Board

George Casella Department of Statistics University of Florida Gainesville, FL 32611-8545 USA

Stephen Fienberg Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213-3890 USA

Cover illustration: The Poisson regression model (Figure 5.1).

Library of Congress Cataloging-in-Publication Data Simonoff, Jeffrey S.

Analyzing categorical data / Jeffrey S. Simonoff. p. cm. - (Springer texts in statistics)

Includes bibliographical references and index.

Ingram Olkin Department of Statistics Stanford University Stanford, CA 94305 USA

ISBN 978-1-4419-1837-6 ISBN 978-0-387-21727-7 (eBook) DOI 10.1007/978-0-387-21727-7 1. Multivariate analysis. 1. Title. II. Series.

QA278.S524 2003 519.5'35-dc21 2003044946

Printed on acid-free paper.

© 2003 Springer Science+Business Media New York

Originally published by Springer-Verlag New York, loc 2003.

Softcover reprint ofthe hardcover 1 st edition 2003

Ali rights reserved. This work may not be translated or copied in whole or in part without the written permission ofthe pubJisher (Springer-Verlag New York,loc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Vse in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now lmown or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

9 8 7 6 5 432 I SPIN 10919460

Typesetting: Pages created by the author in u,'IEJX 2e using Springer' s svsing2e.sty macro.

www.springer-ny.com

Page 5: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

To my parents, Pearl and Morris Simonoff

Page 6: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

Preface

This book grew out of notes that I prepared for a dass in categorical data analysis that I gave at the Stern School of Business of New York University during the Fall 1998 semester. The dass drew from a very diverse pool of students, induding undergraduate statistics majors, M.B.A. students, M.S. and Ph.D. statistics students, and M.S. and Ph.D. students in other fields, induding management, economics, and public administration.

My task was to come up with a way of presenting the material in a way that such a heterogeneous group could grasp. I immediately hit on the idea of using regression ideas to drive everything, since all of the students would have seen regression before, at some level. This is not a new ideaj many books have in recent years exploited the generalized linear model when discussing categorical data analysis. I had in mind something a little different, however-a heavily data-analytic approach that covered a broad range of categorical data problems, from the count data models common in econometric modeling, to the loglinear models familiar to statisticians and social scientists, to binomial and multinomial regression models originally used in biological applications, with linear regression at the core.

This origin has several implications for the reader of this book. First, Chapters 2 and 3 contain a more detailed overview of least squares regres­sion modeling than is typical in books of this type, since this material is continually drawn on when describing analogous techniques for categorical data. There is also a good deal of detailed material on univariate discrete random variables (binomial, Poisson, negative binomial, multinomial) in Chapter 4. My hope is that these three chapters will make it possible for the book to stand alone more effectively, and make it useful for readers

Page 7: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

viii Preface

with a wide range of backgrounds. On the other hand, they make the book longer than it might have been; there is a lot to get through if areader just sits down and attempts to read straight through.

The Poisson regression model, and its variants and extensions, is the en­gine for much of the material here. This is not unusual for books on econo­metric models for count data (for example, Long, 1997, or Cameron and Trevidi, 1998), but is not typical for categorical data analysis books written by statisticians, which tend to highlight the Poisson regression model as the basis of loglinear modeling, but do not focus very much on count data modeling problems directly (for example, Agresti, 1996, or Lloyd, 1999). On the other hand, this book also incIudes extensive discussion of loglin­ear models for contingency tables, incIuding tables with special structure, which is common in statistical categorical data analysis books, but not count data modeling books. The cIose connection between these models and useful models for binomial and multinomial data makes it easy to then incIude material on logistic regression (and its variants and competitors) as weIl. The approach is cIassical; for a Bayesian approach to many of these problems, see Johnson and Albert (1999).

The target audience for this book is similar to the (student) audience for my original cIass, but extended to incIude working data analysts, whether they are statisticians, social scientists, engineers, or workers in any other area. My hope is that anyone who might be faced with categorical data to analyze would benefit from reading and using the book. Some exposure to linear regression modeling would be helpful, although the material in Chapters 2-4 is designed to provide enough background for the later ma­terial. The book has a strong focus on applying methods to real problems (which accounts for the active, rat her than passive, nature of its titIe). For this reason, there is more detailed discussion of examples than is typical in books of this type, incIuding more background material on the problem, more model checking and selection, and more discussion of implications from a contextual point of view. These discussions are set aside in the text with grey rules and titIes, in order to emphasize their importance. Nothing can take the place of reading the original papers (or doing the analysis yourself), but my intention is to give the reader more of a flavor of the full data-analytic experience than is typical in a textbook. I hope that the readers will find the examples interesting on their own merits, not just as examples of categorical data analysis. Many are from recent pa­pers in subject-area scientific journals. A more detailed description of the organization of the book is given in Section 1.2.

Many of the basic techniques for categorical data analysis are available in almost all statistical packages. A great deal (but not aIl) of categorical data modeling can be done using any statistical package that has a general­ized linear model function. All of the statistical modeling and figures in the text are based on S-PLUS (Insightful Corporation, 2001), incIuding func-

Page 8: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

Preface ix

tions and libraries written by myself and other people, with the following exceptions:

• ANOAS (Eliason, 1986) was used to fit the Goodman RC association model.

• Egret (Cytel Software Corporation, 2001a) was used for beta-bino­mial regression.

• LIMDEP (Greene, 2000a) was used for zero inflated count regression and truncated Poisson regression.

• LogXact (Cytel Software Corporation, 1999) was used to conduct conditional analyses of logistic regression models.

• SAS (SAS Institute, 2000) was used to fit some ordinal regression models.

• SPSS (SPSS, Inc., 2001) was used to construct the Hosmer-Lemeshow statistic when fitting a binary logistic regression model.

• StatXact (Cytel Software Corporation, 2001 b) was used for various conditional analyses on contingency tables.

I have set up a Web site related to the material in this book at the address http://www . stern.nyu. edu/rvjsimonof/AnalCatData (a link also can be found at the Springer-Verlag Web site, http://www.springer-ny.com. un­der "Author Websites"). The site indudes computer code, functions, and macros in S-PLUS (and the free package R, which is virtually identical to S-PLUS; see Ihaka and Gentleman, 1996), and SAS for the material in the book and the data sets used in the text and exercises (these data sets are identified by name in typewriter font in the book). Answers to selected exercises are available to instructors who adopt the book as a course text. For more information, see the book's Web site or the Springer-Verlag Web site.

I would like to thank some of the people who helped me in the prepa­ration of this book. The students in my dass on categorical data analysis helped me to focus my ideas on the subject in a systematic way. Many people provided me with interesting data sets; their names are given in the book where the data are introduced. David Firth, Mark Handcock, and Gary Simon read and commented on draft versions of the text. Yufeng Ding and Zheng Sun helped with software issues, and checked many of the computational results given in the text. John Kimmel was his usual patient and supportive self in guiding the book through the publication process. I would also like to thank my family for their support and encouragement during this long process.

East Meadow, New York Jeffrey S. Simonoff May 2003

Page 9: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

Contents

Preface

1 Introduction 1.1 The Nature of Categorical Data . 1.2 Organization of This Book ..

2 Gaussian-Based Data Analysis 2.1 The Normal (Gaussian) Random Variable

2.1.1 The Gaussian Density Function .. 2.1.2 Large-Sample Inference for the Gaussian

Random Variable . . . . . . . . . . . . . .

vii

1 1 3

7 7 7

8 2.1.3 Exact Inference for the Gaussian Random Variable. 11

2.2 Linear Regression and Least Squares 12 2.2.1 The Linear Regression Model . . . . 12 2.2.2 Least Squares Estimation . . . . . . 14 2.2.3 Interpreting Regression Coefficients . 15 2.2.4 Assessing the Strength of a Regression Relationship 17

2.3 Inference for the Least Squares Regression Model . . . . .. 18 2.3.1 Hypothesis Tests and Confidence Intervals for ß .. 18 2.3.2 Interval Estimation for Predicted and Fitted Values 19

2.4 Checking Assumptions 20 2.5 An Example. . . . . . 21 2.6 Background Material. 25 2.7 Exercises ....... 26

Page 10: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

xii Contents

3 Gaussian-Based Model Building 3.1 3.2

3.3

3.4

3.5 3.6 3.7

Linear Contrasts and Hypothesis Tests . . . . . . . Categorical Predictors . . . . . . . . . . . . . . . . 3.2.1 One Categorical Predictor with Two Levels 3.2.2 One Categorical Predictor with More Than

Two Levels .............. . 3.2.3 More Than One Categorical Predictor Regression Diagnostics . . . . . . . 3.3.1 Identifying Outliers ..... 3.3.2 Identifying Leverage Points 3.3.3 Identifying Influential Points Model Selection . . . . . . . . . . . . 3.4.1 Choosing a Set of Candidate Models 3.4.2 Choosing the "Best" Model . . . . . 3.4.3 Model Selection Uncertainty .... Heteroscedasticity and Weighted Least Squares Background Material . Exercises ............... .

4 Categorical Data and Goodness-of-Fit 4.1 The Binomial Random Variable ... .

4.1.1 Large-Sample Inference ... . 4.1.2 Sample Size and Power Calculations 4.1.3 Inference from a Sample of Binomial

Random Variables ..... 4.1.4 Exact Inference . . . . . . .

4.2 The Multinomial Random Variable 4.2.1 Large-Sample Inference

4.3 The Poisson Random Variable. 4.3.1 Large-Sample Inference 4.3.2 Exact Inference . . . . . 4.3.3 The Connection Between the Poisson and

the M ultinomial ......... . 4.4 Testing Goodness-of-Fit .......... .

4.4.1 Chi-Squared Goodness-of-Fit Tests. 4.4.2 Partitioning Pearson's X 2 Statistic . 4.4.3 Exact Inference ......... .

4.5 Overdispersion and Lack of Fit . . . . . 4.5.1 The Zero-Inflated Poisson Model 4.5.2 The Negative Binomial Model .. 4.5.3 Overdispersed Binomial Data and the

Beta-Binomial Model ........ . 4.6 Underdispersion ................ . 4.7 Robust Estimation Using Hellinger Distance . 4.8 Background Material . . . . . . . . . . . . . .

29 29 31 31

32 33 36 36 37 38 42 44 45 46 49 51 52

55 55 56 59

61 61 68 68 69 71 73

74 75 75 80 82 84 84 88

93 98

100 102

Page 11: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

Contents xiii

4.9 Exercises ............. . 103

5 Regression Models for Count Data 125 5.1 The Generalized Linear Model ............ 125

5.1.1 The Form of the Generalized Linear Model 125 5.1.2 Estimation in the Generalized Linear Model. 127 5.1.3 Hypothesis Tests and Confidence Intervals for ß 128 5.1.4 The Deviance and Lack of Fit. . . . . . . . . 129 5.1.5 Model Selection. . . . . . . . . . . . . . . . . 130 5.1.6 Model Checking and Regression Diagnostics . 132

5.2 Poisson Regression . . . . . . . . . . . . . . . . . . . 133 5.3 Overdispersion . . . . . . . . . . . . . . . . . . . . . 147

5.3.1 The Robust Sandwich Covariance Estimator 149 5.3.2 Quasi-Likelihood Estimation .... 149

5.4 Non-Poisson Parametric Regression Models 154 5.4.1 Negative Binomial Regression. . . . 155 5.4.2 Zero-Inflated Count Regression . . . 162 5.4.3 Zero-Truncated Poisson Regression. 168

5.5 Nonparametric Count Regression . . . . . . 171 5.5.1 Local Likelihood Estimation Based on One Predictor 171 5.5.2 Smoothing Multinomial Data . . . . . . 175 5.5.3 Regression Smoothing with More Than

One Predictor . 178 5.6 Background Material. 181 5.7 Exercises ....... 182

6 Analyzing Two-Way Tables 197 6.1 Two-by-Two Tables ...................... 197

6.1.1 Two-Sample Tests and Comparisons of Proportions. 197 6.1.2 Two-by-Two Tables and Tests of Independence 199 6.1.3 The Odds Ratio and the Relative Risk . 203

6.2 Loglinear Models for Two-Way Tables 208 6.2.1 2 x 2 Tables. . 208 6.2.2 I x J Tables .......... 210

6.3 Conditional Analyses . . . . . . . . . . 218 6.3.1 Two-by-Two Tables and Fisher's Exact Test. 218 6.3.2 I x J Tables .............. 221

6.4 Structural Zeroes and Quasi-Independence . . 225 6.5 Outlier Identification and Robust Estimation 228 6.6 Background Material. 234 6.7 Exercises .......... 235

7 Tables with More Structure 7.1 Models for Tables with Ordered Categories

7.1.1 Linear-by-Linear (Uniform) Association

247 247 248

Page 12: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

xiv Contents

7.2

7.3 7.4 7.5

7.1.2 Row, Column, and Row + Column Effects Models 7.1.3 A Log-Multiplicative Row and Column

Effects Model . . . . . . . . . . . 7.1.4 Bivariate Discrete Distributions. Square Tables . . . . . . . . 7.2.1 Quasi-Independence 7.2.2 Symmetry ..... . 7.2.3 Quasi-Symmetry .. 7.2.4 Tables with Ordered Categories . 7.2.5 Matched Pairs Data ... 7.2.6 Rater Agreement Tables . Conditional Analyses . Background Material . Exercises ...... .

252

261 264 270 270 270 273 276 278 283 289 291 293

8 Multidimensional Contingency Tables 309 8.1 2 x 2 x K Tables . . . . . . . . 309

8.1.1 Simpson's Paradox. . . . . . . 310 8.1.2 Types of Independence. . . . . 314 8.1.3 Tests of Conditional Association 316

8.2 Loglinear Models for Three-Dimensional Tables 318 8.2.1 Hierarchical Loglinear Models. . . . . . 318 8.2.2 Association Diagrams, Conditional Independence,

and Collapsibility. . . . . . 323 8.2.3 Loglinear Model Fitting for

Three-Dimensional Tables . 324 8.3 Models for Tables with Ordered Categories 330 8.4 Higher-Dimensional Tables 337 8.5 Conditional Analyses . 344 8.6 Background Material. 347 8.7 Exercises ....... 348

9 Regression Models for Binary Data 365 9.1 The Logistic Regression Model . . . . . . . . . . . 365

9.1.1 Why Logistic Regression? . . . . . . . . . . 365 9.1.2 Inference for the Logistic Regression Model 369 9.1.3 Model Fit and Model Selection 372

9.2 Retrospective (Case-Control) Studies . 380 9.3 Categorical Predictors . . . . . . . . . 387 9.4 Other Link Functions ......... 393

9.4.1 Dose Response Modeling and Logistic Regression 393 9.4.2 Pro bit Regression. . . . . . . . . . . . . . . . . . 394 9.4.3 Complementary Log-Log and Log-Log Regression 396

9.5 Overdispersion . . . . . . . . . . . 398 9.5.1 Nonparametric Approaches . . . . . . . . . . . .. 398

Page 13: Springer Texts in Statistics978-0-387-21727-7/1.pdfSpringer Texts in Statistics ... Log-Linear Models and Logistic Regression, Second Edition ... was used to construct the Hosmer-Lemeshow

9.5.2 Beta-Binomial Regression 9.6 Smoothing Binomial Data 9.7 Conditional Analysis . 9.8 Background Material . 9.9 Exercises ...... .

10 Regression Models for Multiple Category Response Data 10.1 Nominal Response Variable ........ .

10.1.1 Multinomial Logistic Regression .. 10.1.2 Independence of Irrelevant Alternatives

10.2 Ordinal Response Variable ...... . 10.2.1 The Proportional Odds Model . 10.2.2 Other Link Functions ..... . 10.2.3 Adjacent-Categories Logit Model 10.2.4 Continuation Ratio Models

10.3 Background Material. 10.4 Exercises ........... .

A Some Basics of Matrix Algebra

References

Index

Contents xv

399 403 407 411 412

427 427 427 434 435 436 438 442 443 448 449

455

459

485