Top Banner
That vexed problem of That vexed problem of choice choice reflections on experimental design and reflections on experimental design and statistics statistics with corpora with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie and Bas Aarts Survey of English Usage University College London {s.wallis, j.bowie, b.aarts}@ucl.ac.uk
44

That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Mar 28, 2015

Download

Documents

Rebecca Hood
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

That vexed problem of That vexed problem of choicechoicereflections on experimental design and reflections on experimental design and statisticsstatisticswith corporawith corporaICAME 33

Leuven 30 May-3 June 2012

Sean Wallis, Jill Bowie and Bas AartsSurvey of English Usage University College London

{s.wallis, j.bowie, b.aarts}@ucl.ac.uk

Page 2: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

OutlineOutline

• Introduction

• Definitions

• Refining baselines and the ratio principle

• Surveying ‘absolute’ and ‘relative’ variation

• Potential sources of interaction

• Employing alternation analysis

• Objections

• Conclusions

Page 3: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

IntroductionIntroduction

• Research questions are really about choice– If speakers had no choice about the words or

constructions they used, language would be invariant!

• Lab experiments– Press button A or button B

• Corpus– Speakers may choose construction A or B

• But they can only actually chose one, A, at each point• We have to infer the other type, B,

counterfactually• Identifying alternates is often non-trivial

Page 4: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Mutual substitutionMutual substitution

• Mutual substitution A B– Given a corpus, identify all events of Type A that

alternate with events of Type B, such that A is mutually replaceable by B, without altering the meaning of the text.

• Replacement– B replaces A if B increases, and vice-versa

• p (A)+p (B)+... = 1

• Freedom to vary• p (X) [0, 1]

– Ideal: eliminate invariant Type C terms

Page 5: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Mutual substitutionMutual substitution

• Mutual substitution A B– Pronoun who/whom

• A = whom• B = who

Page 6: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Mutual substitutionMutual substitution

• Mutual substitution A B– Pronoun who/whom

• A = whom• B = who (objective)

– But whom is limited to objective case• C = who (subjective)• We therefore limit alternation to Objects

– If whom is used ‘incorrectly’ as a Subject, it has an additional constraint (social disfavour)

Page 7: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

True rate of alternationTrue rate of alternation

• True rate of alternation– If A B

• p (A | {A, B}) =F (A)

F (A)+F (B)

Page 8: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

True rate of alternationTrue rate of alternation

• True rate of alternation– If A B

• p (A | {A, B}) =

• Proportion (fraction) of all cases that are Type A– we use p (A) as a shorthand for p (A | {A, B}) if the

baseline {A, B} is stated

F (A)

F (A)+F (B)

Page 9: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

True rate of alternationTrue rate of alternation

• True rate of alternation– If A B

• p (A | {A, B}) =

• Proportion (fraction) of all cases that are Type A– we use p (A) as a shorthand for p (A | {A, B}) if the

baseline {A, B} is stated

• Contingency tables

F (A)

F (A)+F (B)

IV DV A B Totalcondition 1 f1(A) f1(B) f1(A)+f1(B)condition 2 f2(A) f2(B) f2(A)+f2(B)Total F (A) F (B) F (A)+F (B)

probability

p1(A)p2(A)p (A)

Page 10: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

True rate of alternationTrue rate of alternation

• Shall/will alternation over time in DCPSE

0

0.2

0.4

0.6

0.8

1

1955 1960 1965 1970 1975 1980 1985 1990 1995

p baseline = {shall, will}

(Aarts et al., forthcoming)

Page 11: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

True rate of alternationTrue rate of alternation

• Shall/(will+’ll) alternation over time in DCPSE

0

0.2

0.4

0.6

0.8

1

1955 1960 1965 1970 1975 1980 1985 1990 1995

p baseline = {shall, will, ’ll}

(Aarts et al., forthcoming)

Page 12: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

True rate of alternationTrue rate of alternation

• Logistic ‘S’ curve assumes freedom to vary– p (X) [0, 1]

0

1

t

p

Page 13: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

True rate of alternationTrue rate of alternation

• Logistic ‘S’ curve assumes freedom to vary– p (X) [0, 1]

– as do Wilson confidence intervals

0

1p

t

shall/(will+’ll)

shall/’ll

Page 14: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Refining baselinesRefining baselines

• Over-general baselines– conflate opportunity and use– ‘normalisation’ per million words

• implies that every word other than A is Type B!• is this plausible?

• ‘Art’ of experimental design– refine baseline by narrowing dataset

• reduce and eliminate non-alternating Type C cases• optionally: subdivide where different constraints apply

– different baselines test different hypotheses• cf. shall / will / ’ll

AB

Page 15: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

form

al f-

to-f

info

rmal

f-to

-f

tele

phon

e

b di

scus

sion

s

b in

terv

iew

s

com

men

tary

parli

amen

t

lega

l x-e

xam

asso

rt s

pont

prep

ared

sp

Total

LLCICE-GB

Refining baselinesRefining baselines

• Tensed VPs per million words, DCPSETotal: constant over time

Diachronic variation: within text categories

Synchronic variation: between text categories

(Bowie et al., forthcoming)

Page 16: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

The ratio principleThe ratio principle

• Simple algebra– any sequence of ratios can be reduced to

the ratio of the first and last term:F (modal)

F (word)

F (modal)

F (tVP)

F (tVP)

F (word)

Page 17: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

The ratio principleThe ratio principle

• Simple algebra– any sequence of ratios can be reduced to

the ratio of the first and last term:

– we saw that the ratio tVP:word varies synchronically and diachronically in DCPSE

• we can eliminate this variation by simply focusing on modal:tVP

• use tensed VPs as baseline for modals

F (modal)

F (word)

F (modal)

F (tVP)

F (tVP)

F (word)

Page 18: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

The ratio principleThe ratio principle

• Simple algebra– any sequence of ratios can be reduced to

the ratio of the first and last term:

– we saw that the ratio tVP:word varies synchronically and diachronically in DCPSE

• we can eliminate this variation by simply focusing on modal:tVP

• use tensed VPs as baseline for modals

– this baseline is not a strict alternation set• we have not eliminated all Type C terms

F (modal)

F (word)

F (modal)

F (tVP)

F (tVP)

F (word)

Page 19: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

‘‘Absolute’ and ‘relative’ Absolute’ and ‘relative’ variationvariation• Changes in core modals over time in

DCPSE

0.00

0.01

0.02

0.03

0.04

0.05

0.10

0.15

0.20

0.25

0.30

0.00can could may might must shall should will would

p (modal | tVP) p (modal | modal tVP)

Left axis: absolute change as a proportion of tensed VPs

Right axis: relative change as a proportion of set of modals

(Bowie et al., forthcoming)

Page 20: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

• Simple grammatical interaction– Independent and dependent variables are

grammatical• mutual substitution concerns the dependent

variable

Employing alternation analysisEmploying alternation analysis

Page 21: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

• Simple grammatical interaction– Independent and dependent variables are

grammatical• mutual substitution concerns the dependent

variable

– Numerous examples in Nelson et al. 2002• e.g. clause table: mood transitivity

• not alternation, but survey: could be refined

Employing alternation analysisEmploying alternation analysis

CL(inter)

IV DV montr Totalexclamative

CL(montr, exclam)

interrogative CL(montr, inter)

Total CL(montr)

ditr

CL(ditr, exclam)CL(ditr, inter)

CL(ditr)

CL(exclam)

CL

… … …

Page 22: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Employing alternation analysisEmploying alternation analysis

• Repeating choices: to add or not to add– e.g. repeated decisions to add an attributive AJP to

specify a NP head: the tall white ship• A = add AJP• B = don’t add AJP (and stop)

Page 23: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Employing alternation analysisEmploying alternation analysis

• Repeating choices: to add or not to add– e.g. repeated decisions to add an attributive AJP to

specify a NP head: the tall white ship• A = add AJP• B = don’t add AJP (and stop)

– Sequential analysis: examine p (A | {A, B}) at each step

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4

p Conclusion: decision to add an AJP becomes successively more difficult

(Wallis, forthcoming)

Page 24: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Employing alternation analysisEmploying alternation analysis

• Grammatically diverse alternates– Biber and Gray (forthcoming) investigate evidence

for increasing nominalisation• A = nouns that have been derived from verb forms

– This paper reports an analysis of Tucker’s central prediction system model and an empirical comparison of it with two competing models. [1965, Acad-NS]

• B = verbs that could be nominalised

Page 25: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Employing alternation analysisEmploying alternation analysis

• Grammatically diverse alternates– Biber and Gray (forthcoming) investigate evidence

for increasing nominalisation• A = nouns that have been derived from verb forms

– This paper reports an analysis of Tucker’s central prediction system model and an empirical comparison of it with two competing models. [1965, Acad-NS]

• B = verbs that could be nominalised– Could just use clauses as baseline

• But this is little better than words– Better option is to enumerate types

• analysis• prediction• comparison

•analyse•predict•compare

Page 26: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Employing alternation analysisEmploying alternation analysis

• Grammatically diverse alternates– Biber and Gray (forthcoming) investigate evidence

for increasing nominalisation• A = nouns that have been derived from verb forms

– This paper reports an analysis of Tucker’s central prediction system model and an empirical comparison of it with two competing models. [1965, Acad-NS]

• B = verbs that could be nominalised– Could just use clauses as baseline– Better option is to enumerate types

• analysis• prediction• comparison

– Examine cases: is alternation possible?

•analyse•predict•compare

Page 27: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

ObjectionsObjections

• If this is such a good idea, why isn’t everybody doing it?

• Three main objections are made: alternates are not reliably identifiable baselines are arbitrarily chosen by the

researcher different constraints apply to different

terms (no such thing as free variation)

Page 28: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Alternates are not reliably Alternates are not reliably identifiable?identifiable?

• Identifying alternates can be difficult– phrasal vs. Latinate verbs

Page 29: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Alternates are not reliably Alternates are not reliably identifiable?identifiable?

• Identifying alternates can be difficult– phrasal vs. Latinate verbs

• Strategies: enumerate cases from bottom, up

• find Type B cases for each Type A

Page 30: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Alternates are not reliably Alternates are not reliably identifiable?identifiable?

• Identifying alternates can be difficult– phrasal vs. Latinate verbs

• Strategies: enumerate cases from bottom, up

• find Type B cases for each Type A put up tolerate 4 put up with it [S1A-037 #1]

?position 3 put your feet up [S1A-032 #21]

build, make 3 shacks put up without any planning [S2B-022 #118]

display, project 2 put up two… trees [on the screen] [S1B-002 #157]

sell 2 put the plant up for sale [W2C-015 #8]

propose 2 put [a motion] up [S1B-077 #127]

increase 1 put up the poll tax [W2C-009 #3]

accommodate 1 we could put up the children [S1A-073 #197]

finance 1 put up the money [W2F-007 #36]

Page 31: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Alternates are not reliably Alternates are not reliably identifiable?identifiable?

• Strategies: enumerate cases from bottom, up

• find Type B cases for each Type A

Page 32: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Alternates are not reliably Alternates are not reliably identifiable?identifiable?

• Strategies: enumerate cases from bottom, up

• find Type B cases for each Type A refine baseline from top, down

• start with verbs, eliminate non-alternating Type Cs

– Copular verbs– Clitics– Stative verbs

• are dynamic verbs the upper bound for alternation with phrasal verbs?

Page 33: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Alternates are not reliably Alternates are not reliably identifiable?identifiable?

• Strategies: enumerate cases from bottom, up

• find Type B cases for each Type A refine baseline from top, down

• start with verbs, eliminate non-alternating Type Cs

– Copular verbs– Clitics– Stative verbs

• are dynamic verbs the upper bound for alternation with phrasal verbs?

– combine strategies: • identify stative verbs lexically• a few verbs are stative and dynamic

– check in situ

Page 34: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Baselines are arbitrary?Baselines are arbitrary?

• Is there such an ‘objective’ baseline?– No, but optimum baselines identify where

speakers have a real choice: Type A vs. Type B

• Baselines are a control– Experimental hypothesis:

• the ratio of Type A to the baseline is constant over values of independent variable

– Baseline cited as part of experimental reporting

• Indeed we can experiment with baselines– e.g. does the present perfect correlate

more with past-referring or present-referring VPs?

Page 35: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Comparing baselinesComparing baselines

• Does the present perfect correlate more with past-referring or present-referring VPs?

Page 36: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Comparing baselinesComparing baselines

• Does the present perfect correlate more with past-referring or present-referring VPs?

present present perf TotalLLC 2,696

ICE-GB 2,488

Total 5,184

present non-perf

33,13132,11465,245

35,82734,60

270,429

past present perf TotalLLC 2,696

ICE-GB 2,488

Total 5,184

other TPM VPs

18,20114,29332,494

20,89716,78

137,678

(Bowie et al., forthcoming)

Page 37: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Comparing baselinesComparing baselines

• Does the present perfect correlate more with past-referring or present-referring VPs?

– Present perfect correlates more withpresent-referring VPs

present present perf TotalLLC 2,696

ICE-GB 2,488

Total 5,184

present non-perf

33,13132,11465,245

35,82734,60

270,429

past present perf TotalLLC 2,696

ICE-GB 2,488

Total 5,184

other TPM VPs

18,20114,29332,494

20,89716,78

137,678

d% = -4.455.13%’ = 0.02272 = 2.68ns

d% = +14.925.47%’ = 0.06942 = 25.06s

(Bowie et al., forthcoming)

Page 38: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Different constraints apply in each Different constraints apply in each case?case?• Speakers choices are influenced by

multiple pressures– to talk about a single ‘choice’ is misleading– there is no such thing as free variation

• We are not attempting to infer “the reason” for a particular speaker decision– we are attempting to identify statistically

sound • patterns • correlations• trends

– across many speakers

Page 39: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Different constraints apply in each Different constraints apply in each case?case?• Does one or more of these multiple

constraints represent a systematic bias on the true rate?Yes = try to identify it experimentally No = ‘noise’

• Can focus on subset of cases to restrict different influences– e.g. limit shall / will by modal semantics

• This objection is misplaced:– freedom to vary

=grammatical and semantic possibility (potential)=not that choices are free from influence

Page 40: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

A competitive ecology?A competitive ecology?

• Not everything is a binary choice– but the same principles apply

hoping to

hoping that / Ø

hoping for

0%

20%

40%

60%

80%

100%

1920s 1960s 2000s

p

0%

20%

40%

60%

80%

100%

‘cogitate’

‘intend’

quotative

interpretive

1920s 1960s 2000s

p

(Levin, forthcoming)

Meanings of THINK Complementation patterns of HOPE

Page 41: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

ConclusionsConclusions

• Researchers need to pay attention to questions of choice and baselines– This does not mean that an observed change is due

to a single source

• Minimum condition: baseline is a control– statistics evaluate difference from this control

• is it a good control?

• Alternation studies: baseline is opportunity for making choice under investigation

• Word-based baselines should only really be used for comparison with other studies– we should not make statements about choice

unless we investigate that question

Page 42: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

ConclusionsConclusions

• ‘Alternation’ can be interpreted – strictly

• all Type As and Type Bs identified and cases checked

– generously• small number of Type Cs permitted

– Alternation is semantically bounded but grammatical analysis helps identify cases!

• We may try different experimental designs, modifying baselines and subsets– many more novel experiments are possible

• experimental assumptionsshould always be clearly reported

Page 43: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

ReferencesReferences

ACLW: Aarts, B., J. Close, G. Leech and S.A. Wallis (eds.) (forthcoming). The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP.

Preview at www.ucl.ac.uk/english-usage/projects/verb-phrase/book.

• Aarts, B., J. Close and S.A. Wallis. forthcoming. Choices over time: methodological issues in investigating current change. ACLW Chapter 2.

• Biber, D. and B. Gray. forthcoming. Nominalizing the verb phrase in academic science writing. ACLW Chapter 5.

• Bowie, J., S.A. Wallis and B. Aarts, forthcoming. The perfect in spoken English. ACLW Chapter 13.

• Levin, M., forthcoming. The progressive verb in modern American English. ACLW Chapter 8.

• Nelson, G., S.A. Wallis and B. Aarts. 2002. Exploring Natural Language. Amsterdam: John Benjamins.

• Wallis, S.A. forthcoming. Capturing linguistic interaction in a grammar:a method for empirically evaluating the grammar of a parsed corpus.

Page 44: That vexed problem of choice reflections on experimental design and statistics with corpora ICAME 33 Leuven 30 May-3 June 2012 Sean Wallis, Jill Bowie.

Statistical postscriptStatistical postscript

• Type Cs make statistical tests less sensitive– What happens to confidence intervals as we

add to F (A)+F (B) = 100 alternating cases?

0

0.05

0.1

0.15

0.2

0.25

100 1,000 10,000

5

20

40

60

80

95

F (A)eN/100

N

Tests assume freedom to vary (F (A)+F (B) = N )

Including Type Cs makes statistical tests conservative