Leveraging Statistical Consulting and Research Projects in Academia Timothy E. O’Brien Department of Mathematics and Statistics Loyola University Chicago WIILSU Conference Chicago, Illinois November 4 th 2013
Leveraging Statistical Consulting
and Research Projects in Academia
Timothy E. O’Brien Department of Mathematics and Statistics
Loyola University Chicago
WIILSU Conference Chicago, Illinois November 4th 2013
2 | P a g e
Talk Outline
A. Life as an Academic Statistician B. Teaching Activities C. Research Activities D. Consulting Activities E. Problem and Solution F. Some Illustrations: Consulting G. Some Illustrations: Applied Research H. Summary
3 | P a g e
A. Life as an Academic Statistician
Teaching activities: 2/2, 2/3 or 3/3 teaching load including some ‘concepts’ classes; MS/Statistics advising
Loyola is a liberal arts university focusing on ‘arts’ in addition to ‘sciences’
Research and Grants: publishing and grants are essential; at Loyola, the focus is on involving UG and MS quantitative students in our research activities
Consulting: mostly for fun (service) – but now also a part of our MS curriculum
4 | P a g e
B. Teaching Activities
In addition to service courses (concepts & biostatistics), we teach the following to advanced UGs and MS students:
Introductory probability and math statistics SAS and R programming and methods Applied regression (REG, LOGISTIC, NLIN) Experimental design (GLM, RSREG; JMP) Categorical data analysis (FREQ, LOGISTIC, CATMOD) Survival Analysis (LIFEEST, PHREG, LIFEREG) Quantitative Bioinformatics (MCMC, IML) Sampling Methods (SURVEY‐Means, ‐Select, ‐Logistic) Nonparametrics (NPAR1WAY, MULTTEST) Longitudinal Methods (MIXED, NLMIXED, GENMOD) Statistical Consulting (newly added capstone)
5 | P a g e
From Gerry Hahn’s A Career in Statistics:
6 | P a g e
C. Research Activities
My research (often involves students and) focuses on: Experimental / optimal design Applied generalized linear models (bivariate logistic) Applied nonlinear models Likelihood methods, confidence interval, profiling Impact of curvature in applied statistics Bioassay, potency, and drug synergy Violation of “the usual conditions”; robustness
7 | P a g e
D. Consulting Activities (on and off campus)
These include: On campus pro bono consulting in Biology, Chemistry,
CJ, Nursing, Sociology, Environmental Science Medical colleagues: Pharmacology, Virology,
Neuroscience and Aging Children’s Triangle Research (pediatrics) Environmental consulting group (MWRDGC) INRA and INSERM in France Pharma (Amylin, Chiron, Glaxo, J&J‐Janssen) Chiang Mai University colleagues: Family Medicine,
Animal Research, Biostatistics, Clinical Epidemiology Infectious Disease Institute (Uganda) Loyola Center for the Human Rights of Children
8 | P a g e
In this era of big data and analytics, where does the statistician fit in? Historically, into the Scientific Method process:
Downloaded on 11/1/13 from: https://www.google.com/search?site=&source=hp&q=scientific+method+diagram&oq=scientific+method+diagram&gs_l=hp.3..0l4j0i22i30l5.1272.12352.0.12593.41.26.7.8.9.0.208.2092.21j4j1.26.0....0...1c.1.30.hp..1.40.2022.oA_qbf6Hoa4
9 | P a g e
Infectious Disease Institute, Kampala, Uganda:
10 | P a g e
Chiang Mai University:
11 | P a g e
This year’s Career Night / Pizza Party:
12 | P a g e
E. Problem and Solution Problem: In our studies, taking coursework often encourages “stove‐piping” in which one focuses only on the course topic and often with contrived (textbook) illustrations Solution: Studied statistical coursework can be integrated / synthesized by working on real‐time hands‐on statistical consulting and/or research projects. These experiences help graduates to ‘hit the ground running’ once on the job, and help those furthering their studies to understand the open‐ended yet rewarding aspects of applied research
13 | P a g e
F. Some Illustrations: Consulting
Current Class Projects – Assessment of Loyola’s Dissertation Boot Camp (questionnaires, sampling methods, associations)
NYC Fish Project (data dimension reduction, graphics, comparison of two interventions)
Healthy Homes Initiative (GIS, spatial statistics, “big data”, “overlaying” maps)
Loyola Political Science PhD Student (dimension reduction, regression, logistic, survival, length of time in office)
Nan (Thailand) Hospital DM Study (bivariate logistic, determinants of diabetes)
Chicago Crime Study (questionnaires, what murders make the news)
14 | P a g e
Additional Projects ‐ Assessment of Loyola’s McNair Scholar Project (DB management, correlations/regression)
Loyola Anthropology colleague: discriminant analysis for archeology research
Loyola Sustainability colleague: factors which motivates the purchase of your cell phone
Loyola Center for the Human Rights of Children: grant application challenge grant
Infectious Disease Institute: assisting interns/residents with their research projects
LUMC Virology group: sample size, CIs for ED50s CMU Family Medicine, Animal Research, Faculty of Medicine biostatistics group, Clinical Epidemiology
15 | P a g e
Last Year’s Class Projects ‐ Loyola Psychology colleague: hierarchical modelling, study in Chicago schools
Loyola Neurology colleague: recovery from stroke, repeated measures
Tanzania Public/Global Health Research: assessment of interventions, HLM
YMCA Data Analytics: econometric methods Environmental/Water Reclamation Assessment: regression methods, mixed modelling
16 | P a g e
Additional Projects with Students: Student with Provost Fellowship: helping to coordinate
campus‐wide statistical consulting and computer software resources
Bioinformatics Student: working on biostatistics research related to relative potency and synergy
Two MS/Applied Statistics Students: working on manuscript related to likelihood methods
17 | P a g e
Observations To Date:
Working on real projects: are key to help students apply their coursework and integrate their knowledge (do away with stove‐piping) & with real deadlines
From start to finish: data‐cleaning (missing values, merging from different sources) and project write‐up, reporting, & presentation are as important as the correct analysis
Gives students a deeper appreciation of their studies/field Jump‐starts students into their jobs, thereby lessening the transition into the “real world”
18 | P a g e
G. Some Illustrations: Applied Research
Estimating a Binomial Proportion Hardy‐Weinberg, Genotypes, Phenotypes Modelling: Logistic Regression Relative Potency and Synergy Optimal Design
19 | P a g e
G1. Bridging via Likelihood: Estimating a Binomial Proportion
When you toss a coin 15 times and obtain just one Head (success), you may estimate the true to be 6.67%. The usual 95% Wald confidence interval for the success probability,
. ; here: ‐ 0.0596 to 0.1929. This interval is clearly nonsensical since this success probability must be non‐negative. This Wald interval is based on the quadratic approximation (the dashed curve plotted below) to the log‐likelihood expression (solid curve plotted below),
LL() = y log() + (n ‐ y) log(1 ‐ ) = log() + 14 log(1 ‐ )
We obtain the cut‐lines below by using 12 = 3.84 [95%] and 1
2 = 6.63 [99%] in our calculations. Here, the LCI, (0.0039,0.2621), has been obtained using PROC IML. Alternative CIs include the Score, Exact (but note Agresti’s 1998 TAS paper), and Bayesian.
20 | P a g e
This example clearly shows that the Wald quadratic curve can do a poor job of approximating the actual likelihood.
21 | P a g e
G2. Bridging via Likelihood: Phenotypes, Genotypes, CIs/Info
Hardy‐Weinberg equilibrium holds that whenever random mixing occurs and a trait or allele occurs with probability in the parent population, then the offspring realize following:
AA [prob = 2] Aa [prob = 2(1‐)] aa [prob = (1‐)2]
A study of n = 169 crosses resulted in x1 = 125, x2 = 34, x3 = 10.
Genotype: LL() = (2x1+x2) log() + (x2+2x3) log(1 ‐ ) Information (2nd deriv.) = 8n3/[(2x1+x2) (x2+2x3)] = 2517.9 SE = (2517.9)‐½ = 0.0199; pG = 0.8402
Phenotype: LL() = (x1+x2)log() + (x1+x2)log(2‐) + 2x3log(1‐) Info (2nd deriv.) = 718.5 and SE = (718.5)‐½ = 0.0373; pP = 0.7570 Note the drop in information in going from genotype to phenotype, and the increase (almost doubling) in the SE.
22 | P a g e
0.65 0.70 0.75 0.80 0.85 0.90
-154
-153
-152
-151
-150
-149
PI: probability of dominant trait
Log-
Like
lihoo
d
0.65 0.70 0.75 0.80 0.85 0.90
-154
-153
-152
-151
-150
-149
GenotypePhenotype
23 | P a g e
G3. Modelling: Logistic Regression
The Binary (or Binomial) Logistic Model is an illustration of a generalized linear model; it assumes the Binomial distribution and (usually) the logit link – viz,
log[/ (1 ‐)] = 0 + 1x
As we see from this expression, the model function (RHS) is linear. The variance function here, n (1 ‐ ), comes from the assumed Binomial distribution, so no new parameter (such as 2) is introduced here; this may indeed be a problem where ‘over‐dispersion’ exists.
Note that if we write instead log[/ (1 ‐)] = 1(x ‐ ) – so that = LD50 is a model parameter, then we have a generalized nonlinear model. Another GdNLM example is log[/ (1 ‐)] = + log()x where is the OR (odds ratio).
24 | P a g e
In addition to PROC LOGISTIC, the NLMIXED procedure can be used here. Here is an illustration with a 2×2 table:
x (dummy) Cases Non‐cases Total exs Exposed 1 3 2 5 e01
Unexposed 0 4 22 26 e0 Total 7 24 31
data one; do x=1,0; input y n @@; output; end; datalines; 3 5 4 26 ; proc logistic; model y/n=x / clodds=both; run;
The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -1.7047 0.5436 9.8363 0.0017 x 1 2.1100 1.0624 3.9443 0.0470 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits x 8.248 1.028 66.177 Profile Likelihood Confidence Interval for Adjusted Odds Ratios Effect Unit Estimate 95% Confidence Limits x 1.0000 8.248 1.062 81.025
25 | P a g e
Students are often curious (1) how the parameter estimates and SEs are obtained, and (2) how the 95% WCIs and PLCIs are found. A discussion of MLE, information, SEs follows, but profiling can take several examples to ensure understanding.
intercept
slop
e
-16.44746
-3.0 -2.5 -2.0 -1.5 -1.0
01
23
45
95% LRCR for 2x2 Logistic Example
intercept
OR
: odd
s ra
tio
-16.44746
-3.0 -2.5 -2.0 -1.5 -1.0
020
4060
80
95% LRCR for 2x2 Logistic Example
26 | P a g e
SAS Program to obtain MLEs and create PLL Curve
proc iml; start LL(beta); b0=beta[1]; theta=beta[2]; b1=log(theta); tomax=7*b0+3*b1-5*log(1+exp(b0+b1))-26*log(1+exp(b0)); return(-tomax); finish LL; con={-20 -20,20 20}; beta0={-1 10}; start PNLL(b0) global(theta); b1=log(theta); tomax=7*b0+3*b1-5*log(1+exp(b0+b1))-26*log(1+exp(b0)); return(-tomax); finish PNLL; beta0p={-1}; conp={-99,99}; opt={0,0}; call nlpcg(rc,betahat,"LL",beta0,opt,con); b0=betahat[1]; theta=betahat[2]; b1=log(theta); ans={0 0 0}; do theta=0.5 to 210.5 by 0.1; call nlpcg(rc,beta0hat,"PNLL",beta0p,opt,conp); PNLLmin=PNLL(beta0hat); PLL=-PNLLmin; temp=beta0hat||theta||PLL; ans=ans//temp; end; len=nrow(ans); ans=ans[2:len,]; print ans; quit;
27 | P a g e
0 50 100 150 200
-18.
0-1
7.5
-17.
0-1
6.5
-16.
0-1
5.5
-15.
0-1
4.5
Odds Ratio
Pro
file
Log-
Like
lihoo
dProfile Log-Likelihood Plot for Theta (Odds Ratio)
0 50 100 150 200
-18.
0-1
7.5
-17.
0-1
6.5
-16.
0-1
5.5
-15.
0-1
4.5
95% Cut Line for CI
0 50 100 150 200
-18.
0-1
7.5
-17.
0-1
6.5
-16.
0-1
5.5
-15.
0-1
4.5
99% Cut Line for CI
28 | P a g e
G4. Bioassay Modelling: Relative Potency and Synergy
Relative potency is often assessed using the ratio of two means (as in the above Fieller‐Creasy example) or the ratio of two LD50’s; drug synergy can be assessed using the Finney model. The “treatment” here is the combination of two drugs (A & B)
Amount of Compound A
Am
ount
of C
ompo
und
B
0 5 10 15 20 25 30
05
1015
2025
30
0 5 10 15 20 25 30
05
1015
2025
30
0 5 10 15 20 25 30
05
1015
2025
30
0 5 10 15 20 25 30
05
1015
2025
30
0 5 10 15 20 25 30
05
1015
2025
30
0 5 10 15 20 25 30
05
1015
2025
30
29 | P a g e
The Finney4 model applied to similar compounds A and B in respective amounts x1 and x2 are related to the binomial response Y by first calculating the effective dose,
Here, 4 is the relative potency parameter and 5 is the coefficient of synergy. If 5 < 0, compounds A and B exhibit antagonism; if 5 > 0, synergy is indicated; and if 5 = 0, then compounds A and B behave independently. The binomial response variable and effective dose can be related using a dose‐response model function such as the LL2 function
30 | P a g e
This generalized nonlinear Finney4 model is easily fit to the data given in Giltinan et al (1988) using the NLMIXED procedure; the relevant output is given in the proceeding paper. For these two compounds, antagonism is indicated since the estimate of the
coefficient of synergy, 5̂ = ‐1.0349, is negative. To test whether independent action is observed here (5 = 0), instead of using the Wald results given the NLMIXED output, we again use the likelihood‐based test. Thus, we fit the reduced model with the condition 5 = 0 imposed; this results in the value –2LL = 110.9, and the test statistic 1
2 = 110.9 ‐ 80.6 = 30.3 (p < 0.0001). Clearly, these compounds appear to interact antagonistically.
31 | P a g e
G5. Optimal Experimental Design
An n‐point design (measure) is written
Here the k are non‐negative ‘design weights’ which sum to one; the xk which may indeed be vectors, belong to the design space, and are not necessarily distinct. For the model function (x,), the n×p Jacobian matrix is V = ∂/∂and the p×p Information matrix is M(,) = VTV, = diag{1, 2, …, n}. The first‐order/asymptotic variance of the LS estimator of is proportional to M‐1, so designs are often chosen to minimize some convex function of M‐1. Designs which minimize its determinant are called D‐optimal. The variance function is the
32 | P a g e
(first‐order) variance of the predicted response at X = x and is given by the expression d(x,,) = [∂(x)/∂]T M‐1 [∂(x)/∂]. Designs that minimize (over ) the maximum (over x) of d(x,,) are called G‐optimal.
The General Equivalence Theorem (GET) of Kiefer & Wolfowitz (1960) proves that D‐ and G‐optimal designs are equivalent, that the variance function evaluated using the D‐/G‐optimal design does not exceed the line y = p (number of model parameters) – but that it will exceed this line for all other designs. A corollary establishes that the maximum of the variance function is achieved for the D‐/G‐optimal design at the support points of this design. This semester, one student is working under Loyola’s Research Experience for Master’s Programs Fellowships on robust
33 | P a g e
optimal design methods for Multi‐Category Logit (MCL) Models, and we are planning to submit our results for publication later this year. This class of MCL Models includes: Continuation Ratio A (CRA) Logit
Un‐Proportional Odds (UPO) Logit
Adjacent Category (AC) Logit
Continuation Ratio B (CRB) Logit
And our approach is to find designs which are near‐optimal for all members of the class and to highlight which is best.
34 | P a g e
H. Summary
Hands‐on, real consulting projects as well as cutting‐edge applied research helps students to draw from all of their coursework and skills, much in the same way as is done on the job. Powerful tools – such as those available using SAS® software – enable researchers, educators and decision‐makers to directly answer their queries and to further the learning and integration process.
Thank You