Top Banner
Current Developments in Quantitative Research Methods LOT Winter School January 2014 Luke Plonsky
46

Current Developments in Quantitative R esearch M ethods

Feb 24, 2016

Download

Documents

Alice

Current Developments in Quantitative R esearch M ethods. LOT Winter School January 2014 Luke Plonsky. Welcome & Introductions. Course Introduction. Methodological reform (revolution?) taking place Goal: more accurately inform theory, practice, and future research - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Current  Developments  in  Quantitative  R esearch  M ethods

Current Developments in

Quantitative Research Methods

LOT Winter SchoolJanuary 2014Luke Plonsky

Page 2: Current  Developments  in  Quantitative  R esearch  M ethods

Welcome & Introductions

Page 3: Current  Developments  in  Quantitative  R esearch  M ethods

Course IntroductionMethodological reform (revolution?) taking place

Goal: more accurately inform theory, practice, and future research

Content objectives: conceptual and practical (but mostly conceptual) Inform participants’ current and future research efforts Motivate future inquiry with a methodological focus

Not stats-heavy, technical; assumed: basic knowledge of descriptive and inferential statistics (e.g., M, SD, t test, ANOVA)

Examples mostly from second language (L2) researchLecture, all-group discussion, and small-group discussion ask

Qs at any time!

Theory Research Practice

Page 4: Current  Developments  in  Quantitative  R esearch  M ethods

Course OverviewMonday/today: Statistical power, effect sizes,

and fallacies of statistical significanceTuesday: Meta-analysis and the synthetic

approachWednesday: Assessing methodological qualityThursday: Replication researchFriday: Data transparency, reporting practices,

and visualization techniques

Page 5: Current  Developments  in  Quantitative  R esearch  M ethods

Statistical power, effect sizes, and fallacies of statistical

significance

Luke PlonskyCurrent Developments in

Quantitative Research MethodsDay 1

Page 6: Current  Developments  in  Quantitative  R esearch  M ethods

Review of Common Stats: Comparing Means

t test

ANOVA

from Bialystok & Miller (1999)

Mean scores (DV)

Groups (IV)

Page 7: Current  Developments  in  Quantitative  R esearch  M ethods

Review of Common Stats : Correlations

from DeKeyser (2000)

Question: What is the relationship between two (continuous) variables?Positive, negative, curvilinearStrong, weak, moderate, none

Page 8: Current  Developments  in  Quantitative  R esearch  M ethods

A Model of ResearchConduct a study

(e.g., the effects of A on B)

p < 0.05

p > 0.05

Important finding / Get published!

Modify relevant theory, research, practice

TrashWhat’s wrong

with this picture?

Page 9: Current  Developments  in  Quantitative  R esearch  M ethods

p Values

Page 10: Current  Developments  in  Quantitative  R esearch  M ethods

(Another quick review)Q: Wait, real quick: What’s a p value?

A1: The probability of similar results (e.g., differences between groups; relationship between variables) given NO difference between groups / no relationship between variables

A2: NOT an indication of the magnitude, importance, direction, or replicability of an effect/relationship

WHAT WE REALLY WANT TO KNOW!

Also: Observed p values vary as a function of sample size (N), effect size (e.g., Cohen’s d), and variance.

Page 11: Current  Developments  in  Quantitative  R esearch  M ethods

OK, on to the Controversy

(Anderson et al., 2000)

60+ years and 400+ articles (e.g,., Schmidt, 1996; Thompson, 2001)

“The almost universal reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories … is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology” (Meehl, 1967, p. 72).

APA Task Force on Statistical Inference (Wilkinson & TFSI, 1999)

AL: strict (dogmatic?) adherence to NHST; very little discussion until recently (Crookes, 1991; Ellis, 2006; Larson-Hall, 2010; Lazaraton, 1991; Nassaji, 2012; Norris, 2013; Norris & Ortega, 2000, 2006; Oswald & Plonsky, 2010; Plonsky, 2011, 2012, 2013; Plonsky & Gass, 2011) http://oak.ucc.nau.edu/ldp3/bib_nhst.html

Page 12: Current  Developments  in  Quantitative  R esearch  M ethods

Wilkinson & TFSI (1999)Purpose: “to initiate discussion in the field about

changes in current practices of data analysis and reporting”

General recommendations: be transparent; calculate power a priori; inspect data descriptively and visually; simpler analyses are best

Specifics: report exact p values; report (contextualized) ESs for all tests; CIs

Page 13: Current  Developments  in  Quantitative  R esearch  M ethods

Main arguments against NHST?

Page 14: Current  Developments  in  Quantitative  R esearch  M ethods

NHST is UnreliableThe effects of A and B are always different—in some

decimal place—for any A and B. Thus asking ‘are the effects different?’ is foolish (Tukey, 1991, p. 100).

Study N1 N2 M1 (SD1) M2 (SD2) p d

1 5 5 15 (3) 18 (4) .2265 0.85

2 15 15 15 (3) 18 (4) .0276 0.85

3 45 45 15 (3) 18 (4) .0001 0.85

↑↓↑↓↑

The (nil) hypothesis that d = 0 is (almost)

always false! (Cohen, 1994)

Page 15: Current  Developments  in  Quantitative  R esearch  M ethods

NHST is Unreliable (Cont’d)Same goes for p values based on correlationsRemember: r = .30 = 0.30 = 0.30

p = .05

Page 16: Current  Developments  in  Quantitative  R esearch  M ethods

NHST is Unreliable (Cont’d)“[with NHST] … tired researchers, having collected data on hundreds of subjects, then conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they are tired.”

Thompson, 1992, p. 436

Page 17: Current  Developments  in  Quantitative  R esearch  M ethods

NHST is Crude and UninformativeContinuous data yes/no dichotomyp values say nothing about:

Replicability Theoretical or practical importance Magnitude of effects

p > .05 ≠ zero effect size: The absence of evidence for differences is not evidence for equivalence (Kline, 2004, p. 67)

Large p values can correspond to large effects and vice versaOther explanations for p > .05?

small sample/low power/high sampling error; small (i.e., hard-to-detect effect size; unreliable instruments; weak treatment; other hidden variables; …

Appropriate for a limited period of exploratory research (Should be an) inverse relationship between theoretical

maturity and reliance on p

Page 18: Current  Developments  in  Quantitative  R esearch  M ethods

From Papi & Abdollahzadeh (2012)

What could these t test and resulting p values possibly contribute here?

NHST is Crude and Uninformative

Page 19: Current  Developments  in  Quantitative  R esearch  M ethods

Do you see any similar

patterns here? (Hint: look at the p

values and ESs)

p >.05 but

sizeable d

Key

p <.05 but not large d

Taylor et al. (2006)

p >.05 w/neg. d

Page 20: Current  Developments  in  Quantitative  R esearch  M ethods

NHST is Arbitrary…surely, God loves the .06 nearly as much as

the .05 (Rosnow & Rosenthal, 1989, p. 1277)

How much more (or less) would we know if the conventional alpha level was .03 (or .15)?

What if tests of statistical significance never existed? (Harlow et al., 1997)

Page 21: Current  Developments  in  Quantitative  R esearch  M ethods

Adherence to NHST (and p values) constrains progress of theory inefficient research efforts

NHST & publication bias (Rothstein, et al., 2005)

Scenario: 100 intervention studies; H0 is true (i.e., no difference between treatments A and B with alpha .05) (At least) 5 studies will find p < .05 95 studies will sit unpublished, or be re-run until p < .05

(jelly beans cause acne) Type 1 error (false positive) in published studies = 100% Treatment effects (which are nil) become grossly overestimated

Conduct a study (e.g., the effects of A on B)

p < 0.05 p > 0.05

Important finding / Get published!

Modify relevant theory, research, practice

Trash

NHST is Counter-productive

Page 22: Current  Developments  in  Quantitative  R esearch  M ethods

Summary(Quantitative) linguistics research relies heavily on NHST, which is…

highly controversial at best and possibly dangerous and to-be-avoided;unreliable;crude and uninformative;arbitrary; andcounter-productive

OK, but what we can do to improve?

Page 23: Current  Developments  in  Quantitative  R esearch  M ethods

Power (Or: a possible solution to our obsession with p values?)

Page 24: Current  Developments  in  Quantitative  R esearch  M ethods

Statistical PowerWhat is it?Why does it matter?How many participants do I need? (A very

practical and common question)

Page 25: Current  Developments  in  Quantitative  R esearch  M ethods

What kind of power is needed vs. typical?

Table 2 in Cohen (1992)

Are these Ns typical in linguistics research?

d=0.2 d=0.5 d=0.8

Page 26: Current  Developments  in  Quantitative  R esearch  M ethods

What kind of power is needed vs. typical?

Plonsky & Gass (2011) 2% conducted a power analysis Median d = 0.65 + median n = 22 Overall post hoc power = .56

Plonsky (2013) 1% (6/606 studies) conducted a power analysis median d = .71 (inflated?) + median n = 19 Overall post hoc power = .57

What does this mean for Internal validity (and, hence, external validity/generalizability)? Past research? Theory-building? Practical implications? Availability bias in meta-analyses?

Page 27: Current  Developments  in  Quantitative  R esearch  M ethods

The “Power Problem” in L2 Research (Plonsky, 2013, in press)

Rarely analyze powerSmall samples Heavy reliance on NHST (median = 18)Effects not generally very largeOmission of non-statistical resultsRarely check assumptionsRarely use multivariate statistics

Page 28: Current  Developments  in  Quantitative  R esearch  M ethods

Tools for Power AnalysisCohen’s (1988, 1992) power tablesA priori

Conceptually?Practically: http://danielsoper.com/statcalc3/calc.aspx?id=47

Post hocConceptually?Practically: http://danielsoper.com/statcalc3/calc.aspx?id=49

Page 29: Current  Developments  in  Quantitative  R esearch  M ethods

Quick Review

Page 30: Current  Developments  in  Quantitative  R esearch  M ethods

What if you can’t get enough power?

This may be the case when, for example… You’re studying a very small or hard-to-find population (L3

learners of Swahili with L1 Korean) You have limited funding for running participants Your phenomenon/relationship/effect of interest is small (i.e.,

hard to detect) Your advisor says you can’t use the PSY participant pool

Avoid or limit inferential statsForm less (sub)groups less contrastsFocus on descriptives (including effect sizes and CIs) ‘Bootstrap’ the data?

Page 31: Current  Developments  in  Quantitative  R esearch  M ethods

BootstrappingRandom re-sampling from observed data to produce a

simulated but more stable outcome (see Larson-Hall & Herrington, 2010)

(More) robust to: outliers, non-normal data commonLarson-Hall & Herrington (2010)

ANOVA: p>.05 between NSs (n=15) and 3 learner groups (n=14, 15, 15) Tukey post hocs: p < .05 ONLY between NSs and Group A (p

= .002); pb = .407; pc = .834 Bootstrapped post hoc tests p < .05 for all three groups p values non-statistical due to a lack of power; Type II error

Plonsky et al. (in press) Re-analyzed raw data from 26 primary L2 studies 4 (of 16) Type I ‘misfits’ (i.e., 25% Type I ‘misfit’ rate) 0 Type II ‘misfits’ Too much power (via large N) inflated findings?

Page 32: Current  Developments  in  Quantitative  R esearch  M ethods

BUT EVEN WITH GREATER POWER VIA BOOTSRTAPPING, OUR

RESULTS ARE STILL BASED ON THE FLAWED NOTION OF STATISTICAL

SIGNIFICANCE

Page 33: Current  Developments  in  Quantitative  R esearch  M ethods

EFFECT SIZES!(Or: a MUCH BETTER solution to our obsession with p values)

Page 34: Current  Developments  in  Quantitative  R esearch  M ethods

Effect SizesQuestions we’ll addressWhat are they? How do we calculate them?What advantages do ESs provide over p

values?How can we interpret ESs?

Page 35: Current  Developments  in  Quantitative  R esearch  M ethods

What is an effect size?A quantitative indication of the strength of a

relationship or an effectCommon effect sizes

Standardized mean differences (Cohen’s d)M1-M2 / SDpooled (see Excel macro for calculating d)

Correlation coefficients (e.g., r)Shared variance (R2, eta2)Odds Ratios (likelihood of A given B)Percentages

Page 36: Current  Developments  in  Quantitative  R esearch  M ethods

Why Effect Sizes?- An alternative to NHST (p) -

Null Hypothesis Significance Testing (p) vs. Effect Sizes (d)

Unreliable: result dependent on sample size (e.g., Kline, 2009)

ESs: not dependent on N Crude and uninformative: a) forces continuous data into a yes/no

dichotomy; b) tells us nothing about practical significance or magnitude (e.g., Cohen, 1994)

ESs: Express magnitude/size of relationship (i.e., WHAT WE REALLY WANT TO KNOW)

Arbitrary: …surely, God loves the .06 nearly as much as the .05 (Rosnow & Rosenthal, 1989, p. 1277)

ESs: Continuous and can be compared/combined across studies

36

Page 37: Current  Developments  in  Quantitative  R esearch  M ethods

Research Questions and Their Answers Using NHST vs. ESs

Think of a study you read recently or one that you’re working on.What were the RQs?Where they phrased dichotomously (Do …? Is

there a difference …?)?If so, what kind of answer can come from such a RQ?How might the findings differ with an emphasis on

magnitude rather than presence/absence of a relationship or effect?

Page 38: Current  Developments  in  Quantitative  R esearch  M ethods

Why Effect Sizes?- Journal Requirements -

APA Publication Manual, 6th Edition

Three major L2 Journals: Language Learning, TESOL Quarterly, Modern Language Journal

• Plonsky & Gass (2011): 0% (1980s) 0% (1990s) 27% (2000s)

• Plonsky (2013): 3% (1990s) 42% (2000s)

So now effect sizes get reported more often…?

Page 39: Current  Developments  in  Quantitative  R esearch  M ethods

…but very rarely do we interpret them

SMALL BIG

What do they mean anyway?

What implications do these effect have for

future research, theory, and practice?

What does d = 0.50 (or 0.10, or 1.00…) mean?

How big is ‘big’? And how small

is ‘small’?

Page 40: Current  Developments  in  Quantitative  R esearch  M ethods

ESs: Summary ESs are best understood in relation to other, field-specific effects

d ≈ 0.40 (small) d ≈ 0.70 (medium) d ≈ 1.00 (large)

…if people interpreted effect sizes [using fixed benchmarks] with the same rigidity that .05 has been used in statistical testing, we would merely be being stupid in another metric (Thompson, 2001, pp. 82–83).

Additional considerations: Theoretical and methodological maturity (over time) SD units Research setting (lab vs. classroom; SL vs. FL) Length/intensity of treatment Manipulation of IVs Publication bias Sample size / sampling error Instrument reliability

Empirically-based, field-specific scale for d values in L2 research

Page 41: Current  Developments  in  Quantitative  R esearch  M ethods

A Revised Model of ResearchConduct a study

(e.g., the effects of A on B)

p < 0.05

d = ?p > 0.05

d = ?

More precise and reliable estimate of effects

Modify relevant theory, research, practice

Trash

Accumulation of results (via meta-analysis)

Page 42: Current  Developments  in  Quantitative  R esearch  M ethods

Based on our discussion today, what changes would

you suggest to the field?

Page 43: Current  Developments  in  Quantitative  R esearch  M ethods

10 Suggestions for Reform1. A diminished reliance on NHST / p-values2. Drop the “significant” from “statistically significant” 3. Focus on the practical and theoretical importance of results4. Better educate ourselves and future generations of

researchers Emphasize: ESs, alternatives to NHST, synthetic-mindedness in primary research De-emphasize NHST

5. ESs (for all findings, not only when p < .05)6. CIs (for all findings, not only when p < .05) “a quiet but

insistent reminder that no knowledge is complete or perfect” (Sagan, 1996)

7. Replication (to mitigate effects of low power)8. Examine data visually9. Meta-analysis / a synthetic approach10. Initiative from the top down

Page 44: Current  Developments  in  Quantitative  R esearch  M ethods

Further ReadingBeyond significance testing (Kline, 2013)The cult of statistical significance (McCloskey,

2008)Understanding the new statistics (Cumming,

2012)Effect sizes for research (Grissom & Kim, 2012,

2nd ed.)Statistical power analysis for the behavioral

sciences (Cohen, 1988, 2nd ed.)

Page 45: Current  Developments  in  Quantitative  R esearch  M ethods

Connections to Other Topics to be Discussed this

WeekMeta-analysis (relies on ES) rather than p values

(TUESDAY)Replication (THURSDAY)Reporting practices (full descriptives including ES,

always; data transparency, etc.) (FRIDAY)

Page 46: Current  Developments  in  Quantitative  R esearch  M ethods

Tomorrow: Meta-analysisMotivation for and benefits of (conceptual

understanding)Procedures/techniques (practical understanding)