Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz [email protected]. uk http:// www.natcorp.ox.ac.uk/
Dec 19, 2015
Introduction : corpora, corpus use,
and the British National Corpus
Dr. Ylva Berglund [email protected]://www.natcorp.ox.ac.uk/
Outline Presentation: Corpora, corpus use, and the
BNC Demonstration: How to use BNC with Xaira Hands-on: BNC with Xaira Presentation: Using the BNC for teaching
and research More hands-on: exploring more Questions and answers
At the end of today you should have a basic working knowledge about
corpora and corpus use the BNC Xaira
feel confident using Xaira be able to explore area on your own
know where to turn for help and advice
Approaches to linguistic studyIntuition• “Feel” what is
right/wrong/possible
• One person’s language
• Subjective
Study of usage• Examine what is
actually said/written
• Several people• Objective
How do you study usage?
Examine naturally occurring language Draw conclusions
Need a sample of language, produced by different people in various contexts
Find a corpus!
What is a corpus?
A collection of naturally occurring language data compiled to mirror a language/language variety
(Usually) computer-readable (Usually) contains more than text
(annotation, meta-data)
What is a corpus? – some definitions
A corpus can be defined as a collection of texts
assumed to be representative of a given
language. (Tognini-Bonelli 2001: 2)
A corpus is a collection of naturally-occurring
language text, chosen to characterise a state or
variety of language. (Sinclair 1991: 171)
All the material included in a corpus, whether
spoken, written […] is assumed to be taken from
genuine communications of people going about
their normal business. (ibid: 55)
How can a corpus help? Look for patterns to see regularities
Quantify
See several examples
Real language – language in use
Based on a variety of sources
• Balanced corpora (= Reference or general corpora)
• Specialised corpora Genre-specific, LSP (e.g. English for Academic
Purposes) …
Varieties (dialectal, social, historical)
Learner language, English as a Lingua Franca
• Multilingual corpora
Parallel corpora (translations; alignable)
Comparable corpora (similar texts)
• Fixed size / monitor corpora
• Mode and medium
Written, spoken and transcribed, spoken with audio, video
Types of corpora
Famous corpora Brown family (Brown, LOB, FLOB)
1 million words, different text categories Bank of English
Monitor corpus, grows with time International Corpus of English (ICE)
Different national varieties of English. 1 million words written and spoken
British National Corpus Reference corpus, fixed, 100 million words, written
and spoken
British National Corpus (BNC)
What is the BNC?
A snapshot of British English, taken at the end of the 20th century
100 million words in approx 4,000 different text samples, both spoken (10%) and written (90%)
Synchronic (1960-93), sampled, general purpose corpus
Available under licence; latest edition is BNC XML edition (March 2007)
More than text
Metadata About text, author/speaker, audience
Structural & typographical information Paragraph, sentence, heading, list, bolds
Extra-linguistic information Voice quality, noise, pauses, overlap
Linguistic information Part-of-speech
Who produced the BNC and why?
a consortium of dictionary publishers and academic researchers OUP, Longman, Chambers OUCS, UCREL, BL R&D
with funding from DTI/ SERC under JFIT 1990-1994
Lexicographers, NLP researchers, But not language teachers!
Stated Project Goals A synchronic (1990-4) corpus of samples
both spoken and written from the full range of British English language production
of non-opportunistic design, for generic applicability
with word class annotation and contextual information
Actual (?) project goals Better ELT dictionaries
authoritative both speech and writing
A model for European corpus work design, and encoding Industrial-academic co-operation
A REALLY BIG corpus
Production of the BNC took three years (at least) cost GBP 1.6 million (at least) came about through an unusual coincidence
of interests amongst: Lexicographical publishers Government (DTI) Engineering and Science Research Council
Project consequences
industrial-scale text production system necessary compromises? technically over-ambitious? IPR and profitability
The BNC looks back to Brown and LOB in its design and markup, and forward to the Web in its scope and indeterminacy
How was the corpus created?
How was the corpus created?1. Corpus design2. Text selection3. Clearance4. Capture5. Add additional information6. Merge7. (documentation)8. Distribution
The BNC “sausage machine”
OUPWritten(OUP/
Chambers)
Spoken(Longman)
Initial CDIF Conversion and Validation
(OUCS)Word Class Annotation
(UCREL)
Header generation and final validation
(OUCS)
Selection, clearance, and capture
Enrichment and encoding
Documentation, distribution, maintenance
Text selection1. Design criteria
Types of texts Sources Number of samples Size of samples
2. Descriptive criteria Additional information where available
Selection criteria: written texts
Domainimaginative (c 25%)informative
MediumBook, periodicals, misc. published, unpublished, written to be spoken
Time1985-1993(1960-75, 1975-84)
“Descriptive” criteria: written texts Sample size (number of words) and extent (start
and end points) Topic or subject of the text Author's name, age, gender, region of origin, and
domicile Target age group and gender "Level" of writing (reading difficulty) : the more
literary or technical a text, the "higher" its level
Selection criteria: spoken texts
demographic (spoken conversation) transcriptions of spontaneous natural
conversations made by recruited volunteers original recordings are available from British
Librarycontext-governed (other spoken
material) transcriptions of recordings made at specific
types of meeting and event.
Spoken texts: context-governed
Four broad categories of social context: • Educational and informative events, such as
lectures, news broadcasts, classroom discussion, tutorials
• Business events such as sales demonstrations, trades union meetings, consultations, interviews
• Institutional and public events, such as sermons, political speeches, council meetings
• Leisure events, such as sports commentaries, after-dinner speeches, club meetings, radio phone-ins
Descriptive criteria: spoken texts Features relating to the speaker (age, sex,
social class, dialect) Context of recording (place, time) Features of the recording (non-verbal
events, paralinguistic phenomena, unclear instances)
Included when known Sometimes provided by respondent
What is the result?
What is the BNC? 4,000+ texts Ca. 100,000,000 words 10% spoken Information about
the texts the speakers/writers the words
Delivered with a search tool: XAIRA
What's in the BNC?
79238146
6175896
4233955 8715786
Spoken Demographic Spoken Context Governed
Books and Periodicals Other written
What topics?
17244534
7341163
6574857
3037533
1223783416496420
3821902
14025537
7174152
Imaginative Scientific Social ScienceApplied Science World Affairs CommerceArts Belief Leisure
Post-hoc text-type classification
...sentences
Academic
Literary
Press
Nonfiction
Unpublished
Conversation
OtherSpolen
...words
FormatCorpus header (1)
Corpus texts (4,000+)
Text
Text header
<corpus> <corpusHeader></corpusHeader> <corpusText>
<textHeader></textHeader><text></text>
</corpusText> <corpusText>
<textHeader></textHeader><text></text>
</corpusText>
…</corpus>
Annotation, encoding, markup• A means of making explicit, and thus
processable: structure
• texts, sections, paragraphs, turns, sentences, words...
metadata • text-type, situational parameters,
context analysis
• morphology, syntactic function, translation
Word class annotation CLAWS (Leech, Garside et al) approach What counts as a word?
In BNC-XML, each word is explicitly marked and annotated with a root form or lemma an automatically assigned C5 word class
code a simplified POS code
This isn't prima facie obvious, in spite of spelling conventions.
Example: word class annotation
<s n="11"><w c5="NN1" hw="difficulty" pos="SUBST">Difficulty </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="VBG" hw="be" pos="VERB">being </w><w c5="VVN" hw="express" pos="VERB">expressed </w><w c5="PRP" hw="with" pos="PREP">with </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="method" pos="SUBST">method </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VBI" hw="be" pos="VERB">be </w><w c5="VVN" hw="use" pos="VERB">used </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="launch" pos="VERB">launch </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="scheme" pos="SUBST">scheme</w><c c5="PUN">.</c></s>
<s n="11"><w c5="NN1" hw="difficulty" pos="SUBST">Difficulty </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="VBG" hw="be" pos="VERB">being </w><w c5="VVN" hw="express" pos="VERB">expressed </w><w c5="PRP" hw="with" pos="PREP">with </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="method" pos="SUBST">method </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VBI" hw="be" pos="VERB">be </w><w c5="VVN" hw="use" pos="VERB">used </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="launch" pos="VERB">launch </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="scheme" pos="SUBST">scheme</w><c c5="PUN">.</c>
</s> c5 = detailed part-of-speechhw = head word (new)pos = simple part-of-speech (new)
Some BNC-XML elements
<wtext> or <stext> <div> = section <p> = paragraph or <u> =
utterance <s> = “sentence” <w> = word and <c> = punctuation <mw> = multiword unit
What is the markup for?
It makes it possible for you to distinguish aids=SUBST from aids=VERB distinguish occurrences in writing from ones in
speech distinguish occurrences in headings from ones in
paragraphs identify contextual units like sentences and
paragraphs
FACTSHEET WHAT IS AIDS?AIDS (Acquired Immune Deficiency Syndrome) is a condition caused by a virus called HIV (Human Immuno Deficiency Virus).
Who uses the BNC (and how?) Linguists
Research on (English) language Teachers
Reference, Generate teaching materials, In classroom
Publishers Dictionaries, EFL text books
Language engineers Language + computer tools, AI, NLP
Students/language learners Computer scientists
Information retrieval Psychologists/neurologists
General ‘norm’ or reference
LexicographersNLP researchers
What makes the BNC so special? Size Design General availability Standardized markup system
Structural annotation Word class annotation Contextual information
Model for other projects
...in these respects, the BNC remains distinctive, twenty years on!
How to use the BNC (with Xaira)
The BNC can be used in different ways and with different tools User needs to know
What information is available Where/how is information coded
XAIRA can help
Search for Words or phrases Word class information Annotation/mark-up
or a combination of them
Display Search term with context
with or without mark-up Information about text Collocations (co-occurring words) Distribution across parts of the corpus
and much more
XAIRA – XML-aware retrieval application Searches an index of the corpus Uses information in the headers and the
texts Often more than one way to make a search
Can be used with other corpora (if they are indexed first)
Introduction : corpora, corpus use,
and the British National Corpus
Dr. Ylva Berglund [email protected]://www.natcorp.ox.ac.uk/