Top Banner
The ‘London Corpora’ projects The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College London [email protected]
32

The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

Dec 24, 2015

Download

Documents

Willa Hampton
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

The ‘London Corpora’ projectsThe ‘London Corpora’ projects

- the benefits of hindsight -

some lessons for diachronic corpus design

Sean WallisSurvey of English Usage

University College London

[email protected]

Page 2: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

Motivating questionsMotivating questions

• What is meant by the phrase ‘a balanced corpus’?– How do sampling decisions made by corpus

builders affect the type of research questions that may be asked of the data?

Page 3: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

Motivating questionsMotivating questions

• What is meant by the phrase ‘a balanced corpus’?– How do sampling decisions made by corpus

builders affect the type of research questions that may be asked of the data?

• Reviewing ICE-GB and DCPSE:– Should the data have been more

sociolinguistic-ally representative, by social class and region?

Page 4: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

Motivating questionsMotivating questions

• What is meant by the phrase ‘a balanced corpus’?– How do sampling decisions made by corpus

builders affect the type of research questions that may be asked of the data?

• Reviewing ICE-GB and DCPSE:– Should the data have been more sociolinguistic-

ally representative, by social class and region?– Should texts have been stratified: sampled so

that speakers of all categories of gender and age were (equally) represented in each genre?

Page 5: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

ICE-GBICE-GB

• British Component of ICE

• Corpus of speech and writing (1990-1992)– 60% spoken, 40% written; 1 million words;

orthographically transcribed speech, marked up, tagged and fully parsed

• Sampling principles– International sampling scheme, including broad

range of spoken and written categories– But:

• Adults who had completed secondary education• ‘British corpus’ geographically limited

– speakers mostly from London / SE UK (or sampled there)

Page 6: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

DCPSEDCPSE

• Diachronic Corpus of Present-day Spoken English (late 1950s - early 1990s)– 800,000 words (nominal)– London-Lund component annotated as ICE-GB

• orthographically transcribed and fully parsed

• Created from subsamples of LLC and ICE-GB– Matching numbers of texts in text categories– Not sampled over equal duration

• LLC (1958-1977) • ICE-GB (1990-1992) – Text passages in LLC larger than ICE-GB

• LLC (5,000 words) • ICE-GB (2,000 words)• But text passages may include subtexts

– telephone calls and newspaper articles are frequently short

Page 7: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

DCPSEDCPSE

• Representative?– Text categories of unequal size– Broad range of text types sampled– Not balanced by speaker demography

text category LLC (1960s) ICE-GB (1990s) TOTAL

formal face-to-face 46,291 (51) 39,201 (58) 85,492 (109)informal face-to-face 207,852 (146) 176,244 (398) 384,096 (544)telephone conversations 25,645 (110) 19,455 (30) 45,100 (140)broadcast discussions 43,620 (47) 42,002 (101) 85,622 (148)broadcast interviews 20,359 (12) 21,385 (26) 41,744 (38)spontaneous commentary 45,765 (50) 48,539 (60) 94,304 (110)parliamentary language 10,081 (14) 10,226 (58) 20,307 (72)legal cross-examination 5,089 (4) 4,249 (5) 9,338 (9)assorted spontaneous 10,111 (8) 10,767 (5) 20,878 (13)prepared speech 30,564 (14) 32,180 (71) 62,744 (85)

TOTAL 445,377 (450) 404,248 (818) 849,625 (1,268)

Page 8: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

A balanced corpus?A balanced corpus?

• Corpora are reusable experimental datasets– Data collection (sampling) should avoid limiting

future research goals– Samples should be representative

• What are they representative of?

• Quantity vs. quality– Large/lighter annotation vs. small/richer– Are larger corpora more (easily) representative?

• Problems for historical corpora– Can we add samples to make the corpus more

representative?

Page 9: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

““Representativeness”Representativeness”

• Do we mean representative...– of the language?

• A sample in the corpus is a genuine random sample of the type of text in the language

Page 10: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

““Representativeness”Representativeness”

• Do we mean representative...– of the language?

• A sample in the corpus is a genuine random sample of the type of text in the language

– of text types?• Effort made to include examples of all types of

language “text types” (including speech contexts)

Page 11: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

““Representativeness”Representativeness”

• Do we mean representative...– of the language?

• A sample in the corpus is a genuine random sample of the type of text in the language

– of text types?• Effort made to include examples of all types of

language “text types” (including speech contexts)

– of speaker types?• Sampling decisions made to include equal numbers

(by gender, age, geography, etc.) of participants in each text category

• Should subdivide data independently (stratification)

Page 12: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

““Representativeness”Representativeness”

• Do we mean representative...– of the language?

• A sample in the corpus is a genuine random sample of the type of text in the language

– of text types?• Effort made to include examples of all types of

language “text types” (including speech contexts)

– of speaker types?• Sampling decisions made to include equal numbers

(by gender, age, geography, etc.) of participants in each text category

• Should subdivide data independently (stratification)

“broad”

“stratified”

“random sample”

Page 13: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

Stratified samplingStratified sampling

• Ideal– Corpus independently

subdivided by each variable

Page 14: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

Stratified samplingStratified sampling

• Ideal– Corpus independently

subdivided by each variable

Page 15: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

Stratified samplingStratified sampling

• Ideal– Corpus independently

subdivided by each variable

– Equal subdivisions?

Page 16: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

Stratified samplingStratified sampling

• Ideal– Corpus independently

subdivided by each variable– Equal subdivisions?

• Not required• Independent variables =

constant probability in each subset

– e.g. proportion of words spoken by women not affected by text genre

– e.g. same ratio of women:men in age groups, etc.

Page 17: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

Stratified samplingStratified sampling

• Ideal– Corpus independently

subdivided by each variable– Equal subdivisions?

• Not required• Independent variables =

constant probability in each subset

– e.g. proportion of words spoken by women not affected by text genre

• What is the reality?

Page 18: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

ICE-GB: gender / written-ICE-GB: gender / written-spokenspoken• Proportion of words in each category

spoken by women and men– The authors of some texts are unspecified– Some written material may be jointly

authored

– female/male ratio varies slightly (=0.02)

0 0.2 0.4 0.6 0.8 1

TOTAL

spoken

written femalefemale

malemale

p

Page 19: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

ICE-GB: gender / spoken ICE-GB: gender / spoken genresgenres• Gender variation in spoken subcategories

0 0.2 0.4 0.6 0.8 1

TOTAL spoken dialogue private

direct conversations telephone calls

public broadcast discussions

broadcast interviews business transactions

classroom lessons legal cross-examinations

parliamentary debates mixed

broadcast news monologue

scripted broadcast talks

non-broadcast speeches unscripted

demonstrations legal presentations

spontaneous commentaries unscripted speeches

p

femalefemale

malemale

Page 20: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

ICE-GB: gender / written ICE-GB: gender / written genresgenres• Gender variation in written genres

TOTAL written non-printed

correspondence business letters

social letters non-professional writing

student examination scripts untimed student essays

printed academic writing

humanities natural sciences social sciences

technology creative writing

novels/stories instructional writing

administrative/regulatory skills/hobbies

non-academic writing humanities

natural sciences social sciences

technology persuasive writing press editorials

reportage press news reports

p0 0.2 0.4 0.6 0.8 1

femalefemalemalemale

<author unknown/joint>

Page 21: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

ICE-GBICE-GB

• Sampling was not stratified across variables– Women contribute 1/3 of corpus words– Some genres are all male (where specified)

• speech: spontaneous commentary, legal presentation

• academic writing: technology, natural sciences• non-academic writing: technology, social

science

Page 22: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

ICE-GBICE-GB

• Sampling was not stratified across variables– Women contribute 1/3 of corpus words– Some genres are all male (where specified)

• speech: spontaneous commentary, legal presentation

• academic writing: technology, natural sciences• non-academic writing: technology, social

science

– Is this representative?

Page 23: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

ICE-GBICE-GB

• Sampling was not stratified across variables– Women contribute 1/3 of corpus words– Some genres are all male (where specified)

• speech: spontaneous commentary, legal presentation• academic writing: technology, natural sciences• non-academic writing: technology, social science

– Is this representative?– When we compare

• technology writing with creative writing• academic writing with student essays

– are we also finding gender effects?

Page 24: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

ICE-GBICE-GB

• Sampling was not stratified across variables– Women contribute 1/3 of corpus words– Some genres are all male (where specified)

• speech: spontaneous commentary, legal presentation• academic writing: technology, natural sciences• non-academic writing: technology, social science

– Is this representative?– When we compare

• technology writing with creative writing• academic writing with student essays

– are we also finding gender effects?

– Difficult to compensate for absent data in analysis!

Page 25: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

DCPSE: gender / genreDCPSE: gender / genre

• DCPSE has a simpler genre categorisation– also divided by time

0 0.2 0.4 0.6 0.8 1

TOTAL

face-to-face conversations

formal

informal

telephone conversations

broadcast discussions

broadcast interviews

spontaneous commentary

parliamentary language

legal cross-examination

assorted spontaneous

prepared speech

femalefemalemalemale

p

Page 26: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

DCPSE: gender / timeDCPSE: gender / time

• DCPSE has a simpler genre categorisation– also divided by time

• note the gap

0

0.2

0.4

0.6

0.8

1

1958

1960

1962

1964

1966

1968

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

p

time

Page 27: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

DCPSE: genre / timeDCPSE: genre / time

• Proportion in each spoken genre, over time– sampled by matching LLC and ICE-GB overall

• this is a ‘stratified sample’ (but only LLC:ICE-GB)• uneven sampling over 5-year periods (within LLC)

0

0.2

0.4

0.6

1960 1965 1970 1975 1980 1985 1990

formal face-to-face

informal face-to-face

spontaneous commentary

telephone conversations

prepared speech

pICE-GB

target for LLC

Page 28: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

DCPSEDCPSE

• LLC sampling not stratified– Issue not considered, data collected over

extended period– Some data was surreptitiously recorded

Page 29: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

DCPSEDCPSE

• LLC sampling not stratified– Issue not considered, data collected over

extended period– Some data was surreptitiously recorded

• DCPSE matched samples by ‘genre’– Same text category sizes in ICE-GB and LLC– But problems in LLC (and ICE) percolate

Page 30: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

DCPSEDCPSE

• LLC sampling not stratified– Issue not considered, data collected over

extended period– Some data was surreptitiously recorded

• DCPSE matched samples by ‘genre’– Same text category sizes in ICE-GB and LLC– But problems in LLC (and ICE) percolate

• No stratification by speaker– Result: difficult and sometimes impossible to

separate out speaker-demographic effects from text category

Page 31: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

ConclusionsConclusions

• Ideal would be that:– the corpus was “representative” in all 3 ways:

• a genuine random sample• a broad range of text types• a stratified sampling of speakers

– But these principles are unlikely to be compatible• e.g. speaker age and utterance context

• Some compensatory approaches may be employed at research (data analysis) stage– what about absent or atypical data?– what if we have few speakers/writers?

• So...

Page 32: The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

ConclusionsConclusions

• …pay attention to stratification in deciding which texts to include in subcategories– consider replacing texts in outlying categories

• …justify and document non-inclusion of stratum by evidence– e.g. “there are no published articles

attributable to authors of this age in this time period”