Leveraging Statistical Consulting and Research Projects in ...

Leveraging Statistical Consulting

and Research Projects in Academia

Timothy E. O’Brien Department of Mathematics and Statistics

Loyola University Chicago

WIILSU Conference Chicago, Illinois November 4th 2013

2 | P a g e

Talk Outline

A. Life as an Academic Statistician B. Teaching Activities C. Research Activities D. Consulting Activities E. Problem and Solution F. Some Illustrations: Consulting G. Some Illustrations: Applied Research H. Summary

3 | P a g e

A. Life as an Academic Statistician

Teaching activities: 2/2, 2/3 or 3/3 teaching load including some ‘concepts’ classes; MS/Statistics advising

Loyola is a liberal arts university focusing on ‘arts’ in addition to ‘sciences’

Research and Grants: publishing and grants are essential; at Loyola, the focus is on involving UG and MS quantitative students in our research activities

Consulting: mostly for fun (service) – but now also a part of our MS curriculum

4 | P a g e

B. Teaching Activities

In addition to service courses (concepts & biostatistics), we teach the following to advanced UGs and MS students:

Introductory probability and math statistics SAS and R programming and methods Applied regression (REG, LOGISTIC, NLIN) Experimental design (GLM, RSREG; JMP) Categorical data analysis (FREQ, LOGISTIC, CATMOD) Survival Analysis (LIFEEST, PHREG, LIFEREG) Quantitative Bioinformatics (MCMC, IML) Sampling Methods (SURVEY‐Means, ‐Select, ‐Logistic) Nonparametrics (NPAR1WAY, MULTTEST) Longitudinal Methods (MIXED, NLMIXED, GENMOD) Statistical Consulting (newly added capstone)

5 | P a g e

From Gerry Hahn’s A Career in Statistics:

6 | P a g e

C. Research Activities

My research (often involves students and) focuses on: Experimental / optimal design Applied generalized linear models (bivariate logistic) Applied nonlinear models Likelihood methods, confidence interval, profiling Impact of curvature in applied statistics Bioassay, potency, and drug synergy Violation of “the usual conditions”; robustness

7 | P a g e

D. Consulting Activities (on and off campus)

These include: On campus pro bono consulting in Biology, Chemistry,

CJ, Nursing, Sociology, Environmental Science Medical colleagues: Pharmacology, Virology,

Neuroscience and Aging Children’s Triangle Research (pediatrics) Environmental consulting group (MWRDGC) INRA and INSERM in France Pharma (Amylin, Chiron, Glaxo, J&J‐Janssen) Chiang Mai University colleagues: Family Medicine,

Animal Research, Biostatistics, Clinical Epidemiology Infectious Disease Institute (Uganda) Loyola Center for the Human Rights of Children

8 | P a g e

In this era of big data and analytics, where does the statistician fit in? Historically, into the Scientific Method process:

Downloaded on 11/1/13 from: https://www.google.com/search?site=&source=hp&q=scientific+method+diagram&oq=scientific+method+diagram&gs_l=hp.3..0l4j0i22i30l5.1272.12352.0.12593.41.26.7.8.9.0.208.2092.21j4j1.26.0....0...1c.1.30.hp..1.40.2022.oA_qbf6Hoa4

9 | P a g e

Infectious Disease Institute, Kampala, Uganda:

10 | P a g e

Chiang Mai University:

11 | P a g e

This year’s Career Night / Pizza Party:

12 | P a g e

E. Problem and Solution Problem: In our studies, taking coursework often encourages “stove‐piping” in which one focuses only on the course topic and often with contrived (textbook) illustrations Solution: Studied statistical coursework can be integrated / synthesized by working on real‐time hands‐on statistical consulting and/or research projects. These experiences help graduates to ‘hit the ground running’ once on the job, and help those furthering their studies to understand the open‐ended yet rewarding aspects of applied research

13 | P a g e

F. Some Illustrations: Consulting

Current Class Projects – Assessment of Loyola’s Dissertation Boot Camp (questionnaires, sampling methods, associations)

NYC Fish Project (data dimension reduction, graphics, comparison of two interventions)

Healthy Homes Initiative (GIS, spatial statistics, “big data”, “overlaying” maps)

Loyola Political Science PhD Student (dimension reduction, regression, logistic, survival, length of time in office)

Nan (Thailand) Hospital DM Study (bivariate logistic, determinants of diabetes)

Chicago Crime Study (questionnaires, what murders make the news)

14 | P a g e

Additional Projects ‐ Assessment of Loyola’s McNair Scholar Project (DB management, correlations/regression)

Loyola Anthropology colleague: discriminant analysis for archeology research

Loyola Sustainability colleague: factors which motivates the purchase of your cell phone

Loyola Center for the Human Rights of Children: grant application challenge grant

Infectious Disease Institute: assisting interns/residents with their research projects

LUMC Virology group: sample size, CIs for ED50s CMU Family Medicine, Animal Research, Faculty of Medicine biostatistics group, Clinical Epidemiology

15 | P a g e

Last Year’s Class Projects ‐ Loyola Psychology colleague: hierarchical modelling, study in Chicago schools

Loyola Neurology colleague: recovery from stroke, repeated measures

Tanzania Public/Global Health Research: assessment of interventions, HLM

YMCA Data Analytics: econometric methods Environmental/Water Reclamation Assessment: regression methods, mixed modelling

16 | P a g e

Additional Projects with Students: Student with Provost Fellowship: helping to coordinate

campus‐wide statistical consulting and computer software resources

Bioinformatics Student: working on biostatistics research related to relative potency and synergy

Two MS/Applied Statistics Students: working on manuscript related to likelihood methods

17 | P a g e

Observations To Date:

Working on real projects: are key to help students apply their coursework and integrate their knowledge (do away with stove‐piping) & with real deadlines

From start to finish: data‐cleaning (missing values, merging from different sources) and project write‐up, reporting, & presentation are as important as the correct analysis

Gives students a deeper appreciation of their studies/field Jump‐starts students into their jobs, thereby lessening the transition into the “real world”

18 | P a g e

G. Some Illustrations: Applied Research

Estimating a Binomial Proportion Hardy‐Weinberg, Genotypes, Phenotypes Modelling: Logistic Regression Relative Potency and Synergy Optimal Design

19 | P a g e

G1. Bridging via Likelihood: Estimating a Binomial Proportion

When you toss a coin 15 times and obtain just one Head (success), you may estimate the true to be 6.67%. The usual 95% Wald confidence interval for the success probability,

. ; here: ‐ 0.0596 to 0.1929. This interval is clearly nonsensical since this success probability must be non‐negative. This Wald interval is based on the quadratic approximation (the dashed curve plotted below) to the log‐likelihood expression (solid curve plotted below),

LL() = y log() + (n ‐ y) log(1 ‐ ) = log() + 14 log(1 ‐ )

We obtain the cut‐lines below by using 12 = 3.84 [95%] and 1

2 = 6.63 [99%] in our calculations. Here, the LCI, (0.0039,0.2621), has been obtained using PROC IML. Alternative CIs include the Score, Exact (but note Agresti’s 1998 TAS paper), and Bayesian.

20 | P a g e

This example clearly shows that the Wald quadratic curve can do a poor job of approximating the actual likelihood.

21 | P a g e

G2. Bridging via Likelihood: Phenotypes, Genotypes, CIs/Info

Hardy‐Weinberg equilibrium holds that whenever random mixing occurs and a trait or allele occurs with probability in the parent population, then the offspring realize following:

AA [prob = 2] Aa [prob = 2(1‐)] aa [prob = (1‐)2]

A study of n = 169 crosses resulted in x1 = 125, x2 = 34, x3 = 10.

Genotype: LL() = (2x1+x2) log() + (x2+2x3) log(1 ‐ ) Information (2nd deriv.) = 8n3/[(2x1+x2) (x2+2x3)] = 2517.9 SE = (2517.9)‐½ = 0.0199; pG = 0.8402

Phenotype: LL() = (x1+x2)log() + (x1+x2)log(2‐) + 2x3log(1‐) Info (2nd deriv.) = 718.5 and SE = (718.5)‐½ = 0.0373; pP = 0.7570 Note the drop in information in going from genotype to phenotype, and the increase (almost doubling) in the SE.

22 | P a g e

0.65 0.70 0.75 0.80 0.85 0.90

-154

-153

-152

-151

-150

-149

PI: probability of dominant trait

Log-

Like

lihoo

d

0.65 0.70 0.75 0.80 0.85 0.90

-154

-153

-152

-151

-150

-149

GenotypePhenotype

23 | P a g e

G3. Modelling: Logistic Regression

The Binary (or Binomial) Logistic Model is an illustration of a generalized linear model; it assumes the Binomial distribution and (usually) the logit link – viz,

log[/ (1 ‐)] = 0 + 1x

As we see from this expression, the model function (RHS) is linear. The variance function here, n (1 ‐ ), comes from the assumed Binomial distribution, so no new parameter (such as 2) is introduced here; this may indeed be a problem where ‘over‐dispersion’ exists.

Note that if we write instead log[/ (1 ‐)] = 1(x ‐ ) – so that = LD50 is a model parameter, then we have a generalized nonlinear model. Another GdNLM example is log[/ (1 ‐)] = + log()x where is the OR (odds ratio).

24 | P a g e

In addition to PROC LOGISTIC, the NLMIXED procedure can be used here. Here is an illustration with a 2×2 table:

x (dummy) Cases Non‐cases Total exs Exposed 1 3 2 5 e01

Unexposed 0 4 22 26 e0 Total 7 24 31

data one; do x=1,0; input y n @@; output; end; datalines; 3 5 4 26 ; proc logistic; model y/n=x / clodds=both; run;

The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -1.7047 0.5436 9.8363 0.0017 x 1 2.1100 1.0624 3.9443 0.0470 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits x 8.248 1.028 66.177 Profile Likelihood Confidence Interval for Adjusted Odds Ratios Effect Unit Estimate 95% Confidence Limits x 1.0000 8.248 1.062 81.025

25 | P a g e

Students are often curious (1) how the parameter estimates and SEs are obtained, and (2) how the 95% WCIs and PLCIs are found. A discussion of MLE, information, SEs follows, but profiling can take several examples to ensure understanding.

intercept

slop

e

-16.44746

-3.0 -2.5 -2.0 -1.5 -1.0

01

23

45

95% LRCR for 2x2 Logistic Example

intercept

OR

: odd

s ra

tio

-16.44746

-3.0 -2.5 -2.0 -1.5 -1.0

020

4060

80

95% LRCR for 2x2 Logistic Example

26 | P a g e

SAS Program to obtain MLEs and create PLL Curve

proc iml; start LL(beta); b0=beta[1]; theta=beta[2]; b1=log(theta); tomax=7*b0+3*b1-5*log(1+exp(b0+b1))-26*log(1+exp(b0)); return(-tomax); finish LL; con={-20 -20,20 20}; beta0={-1 10}; start PNLL(b0) global(theta); b1=log(theta); tomax=7*b0+3*b1-5*log(1+exp(b0+b1))-26*log(1+exp(b0)); return(-tomax); finish PNLL; beta0p={-1}; conp={-99,99}; opt={0,0}; call nlpcg(rc,betahat,"LL",beta0,opt,con); b0=betahat[1]; theta=betahat[2]; b1=log(theta); ans={0 0 0}; do theta=0.5 to 210.5 by 0.1; call nlpcg(rc,beta0hat,"PNLL",beta0p,opt,conp); PNLLmin=PNLL(beta0hat); PLL=-PNLLmin; temp=beta0hat||theta||PLL; ans=ans//temp; end; len=nrow(ans); ans=ans[2:len,]; print ans; quit;

27 | P a g e

0 50 100 150 200

-18.

0-1

7.5

-17.

0-1

6.5

-16.

0-1

5.5

-15.

0-1

4.5

Odds Ratio

Pro

file

Log-

Like

lihoo

dProfile Log-Likelihood Plot for Theta (Odds Ratio)

0 50 100 150 200

-18.

0-1

7.5

-17.

0-1

6.5

-16.

0-1

5.5

-15.

0-1

4.5

95% Cut Line for CI

0 50 100 150 200

-18.

0-1

7.5

-17.

0-1

6.5

-16.

0-1

5.5

-15.

0-1

4.5

99% Cut Line for CI

28 | P a g e

G4. Bioassay Modelling: Relative Potency and Synergy

Relative potency is often assessed using the ratio of two means (as in the above Fieller‐Creasy example) or the ratio of two LD50’s; drug synergy can be assessed using the Finney model. The “treatment” here is the combination of two drugs (A & B)

Amount of Compound A

Am

ount

of C

ompo

und

B

0 5 10 15 20 25 30

05

1015

2025

30

0 5 10 15 20 25 30

05

1015

2025

30

0 5 10 15 20 25 30

05

1015

2025

30

0 5 10 15 20 25 30

05

1015

2025

30

0 5 10 15 20 25 30

05

1015

2025

30

0 5 10 15 20 25 30

05

1015

2025

30

29 | P a g e

The Finney4 model applied to similar compounds A and B in respective amounts x1 and x2 are related to the binomial response Y by first calculating the effective dose,

Here, 4 is the relative potency parameter and 5 is the coefficient of synergy. If 5 < 0, compounds A and B exhibit antagonism; if 5 > 0, synergy is indicated; and if 5 = 0, then compounds A and B behave independently. The binomial response variable and effective dose can be related using a dose‐response model function such as the LL2 function

30 | P a g e

This generalized nonlinear Finney4 model is easily fit to the data given in Giltinan et al (1988) using the NLMIXED procedure; the relevant output is given in the proceeding paper. For these two compounds, antagonism is indicated since the estimate of the

coefficient of synergy, 5̂ = ‐1.0349, is negative. To test whether independent action is observed here (5 = 0), instead of using the Wald results given the NLMIXED output, we again use the likelihood‐based test. Thus, we fit the reduced model with the condition 5 = 0 imposed; this results in the value –2LL = 110.9, and the test statistic 1

2 = 110.9 ‐ 80.6 = 30.3 (p < 0.0001). Clearly, these compounds appear to interact antagonistically.

31 | P a g e

G5. Optimal Experimental Design

An n‐point design (measure) is written

Here the k are non‐negative ‘design weights’ which sum to one; the xk which may indeed be vectors, belong to the design space, and are not necessarily distinct. For the model function (x,), the n×p Jacobian matrix is V = ∂/∂and the p×p Information matrix is M(,) = VTV, = diag{1, 2, …, n}. The first‐order/asymptotic variance of the LS estimator of is proportional to M‐1, so designs are often chosen to minimize some convex function of M‐1. Designs which minimize its determinant are called D‐optimal. The variance function is the

32 | P a g e

(first‐order) variance of the predicted response at X = x and is given by the expression d(x,,) = [∂(x)/∂]T M‐1 [∂(x)/∂]. Designs that minimize (over ) the maximum (over x) of d(x,,) are called G‐optimal.

The General Equivalence Theorem (GET) of Kiefer & Wolfowitz (1960) proves that D‐ and G‐optimal designs are equivalent, that the variance function evaluated using the D‐/G‐optimal design does not exceed the line y = p (number of model parameters) – but that it will exceed this line for all other designs. A corollary establishes that the maximum of the variance function is achieved for the D‐/G‐optimal design at the support points of this design. This semester, one student is working under Loyola’s Research Experience for Master’s Programs Fellowships on robust

33 | P a g e

optimal design methods for Multi‐Category Logit (MCL) Models, and we are planning to submit our results for publication later this year. This class of MCL Models includes: Continuation Ratio A (CRA) Logit

Un‐Proportional Odds (UPO) Logit

Adjacent Category (AC) Logit

Continuation Ratio B (CRB) Logit

And our approach is to find designs which are near‐optimal for all members of the class and to highlight which is best.

34 | P a g e

H. Summary

Hands‐on, real consulting projects as well as cutting‐edge applied research helps students to draw from all of their coursework and skills, much in the same way as is done on the job. Powerful tools – such as those available using SAS® software – enable researchers, educators and decision‐makers to directly answer their queries and to further the learning and integration process.

Thank You

Leveraging Statistical Consulting and Research Projects in ...

Documents