Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services The British National Corpus: where did we go wrong?
Post on 20-Dec-2015
215 Views
Preview:
Transcript
Lou BurnardHUMANITIES COMPUTING UNIT
Oxford University Computing Serviceshttp://info.ox.ac.uk/bnc/
The British National Corpus:where did we go wrong?
What is the BNC? 100 million words of modern British English produced by a consortium of dictionary
publishers and academic researchers OUP, Longman, Chambers Oxford, Lancaster, British Library
funded as pre-competitive resource by DTI/ SERC under JFIT 1990-1994
Where did we go wrong?
(if we did) or, The Benefit of Hindsight or, If I'd known then what I know now... or, Wisdom After the Event And, Where Do We Go From Here?
Production of the BNC
took three years (at least) cost GBP 1.6 million (at least) came about through an unusual coincidence
of interests amongst: Lexicographical publishers Government (DTI) Engineering and Science Research Council
The Neotenous Nineties
WinWord or WP5? the choice is yours On your desk … a 386 with 50 Mb
diskspace (just about enough to run Windows 3)
In your lab ... a VAX or a Sparc for serious work
On the WWW (maybe) ... Mosaic for X
Intellectual currents
corpus linguistics the LOB school the Birmingham school the LDC view
text encoding theory language engineering the JFIT mentality, or Reconciling Town
and Gown
Stated Project Goals
A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production
of non-opportunistic design, for generic applicability
with word class annotation and contextual information
Actual (?) project goals
Better ELT dictionaries authoritative both speech and writing
A model for European corpus work design, and encoding Industrial-academic co-operation
A REALLY BIG corpus
Consequences
industrial scale text production system compromises in design and execution IPR and profitability
The BNC looks back to Brown and LOB in its design and markup, and forward to the Web in its scope and indeterminacy
The BNC “sausage machine”
OUPOUPWritten(OUP/Chambers)
Written(OUP/Chambers)
Spoken(Longman)
Spoken(Longman)
Initial CDIF Conversion and Validation
(OUCS)
Initial CDIF Conversion and Validation
(OUCS)Word Class Annotation
(UCREL)
Header generation and final validation
(OUCS)
Header generation and final validation
(OUCS)
Selection, clearance, and capture
Enrichment and encoding
Documentation, distribution, maintenance
Task groups
permissions selection, design criteria encoding and markup enrichment and annotation retrieval software
Tensions
desire to test annotation scheme requirement to meet deliverables
slipping goal posts quantity above quality
… an interesting learning experience for both sides!
BNC Selection Criteria Written selection criteria
predefined proportions of• different media (books, newspapers,
unpublished…)
• different domains (informative, entertaining…)
maximum sample size 45000 words all texts incomplete
Spoken selection criteria context-governed demographically-sampled
Word tagging
<s n=00011> <w AT0>The <w NP0>Queen<w POS>‘s <w AJ0>real <w NN1>annus horribilis <w VVD>began <w PRP> <w NN0>Sunday<c PUN>.</s>
word-pos pair white space problems validation problems
Sample written text<text complete=Y decls='CN000 HN001 QN000 SN000'> <div1 complete=Y org=SEQ> <head type=MAIN> <s n=001><w NP0>CAMRA <w NN1>FACT <w NN1>SHEET <w AT0>No <w CRD>1 </head> <head r=it type=SUB> <s n=002><w AVQ>How <w NN1>beer <w VBZ>is <w AJ0-VVN>brewed </head> <p><s n=003><w NN1>Beer <w VVZ>seems <w DT0>such <w AT0>a <w AJ0>simple <w NN1>drink <w CJT>that <w PNP>we <w VVB>tend <w TO0>to <w VVI>take <w PNP>it <w CJS-PRP>for <w VVD-VVN>granted<c PUN>.
Transcription practice
Regionalised typists Markup makes explicit
changes of speaker and overlap words as perceived by transcriber plus indications of false starts, truncation, uncertainty some performance features e.g. pausing, stage
directions etc. speaker details where available (always for
respondents, sometimes for others)
Sample spoken text<u who=PS04Y><s n=01296><w ITJ>Mm <pause> <w ITJ>yes <pause dur=7><w PNP>I <w VVD>told <w NP0>Paul <pause> <w CJT>that <w PNP>he <w VM0>can <w VVI>bring <w AT0>a <w NN1>lady <w AVP>up <pause> <w PRP>at <w NN1>Christmas-time<c PUN>.</u><u who=PS04U><s n=01297><w VBZ>Is <w PNP>he <w XX0>not <w VVG>going <w AV0>home <w AV0>then<c PUN>?</u><u who=PS04Y><s n=01298><w ITJ>No <pause dur=8> <w CJC>and <w UNC>erm <pause dur=7> <w PNP>I<w VBB>'m <w VVG>leaving <w AT0>a <w NN1>turkey <w PRP>in <w AT0>the <w NN1>freezer<c PUN> <s n=01299><w NP0>Paul <w VBZ>is <w AV0>quite <w AJ0>good <w PRP>at <w NN1-VVG>cooking <pause> <w AJ0>standard <w NN1>cooking<c PUN>.</u>
Metadata
each text has a TEI header identification and classification specific details (e.g. speakers) housekeeping information
all common data in the corpus header classification(s) in header pointed to by
individual texts
Text classifications
spoken texts age, sex, class (of respondent) domain, region, type
written texts author age, sex, type audience, circulation, status medium, domain
Intention was to improve coverage, not accessibility
In retrospect…
Some classifications were poorly defined and only partially populated Domain or text-type? Dating
• date of copy? first publication?
Author age• when?
Author ethnic origin, domicile
That famous BNC balance
81089443
6143048
4214819 8712764
Spoken Demographic Spoken Context Governed
Books and Periodicals Other written
BNC-1
That famous BNC balance
787312765997489
8021274 8743604
Spoken Demographic Spoken Context Governed
Books and Periodicals Other written
BNC-W
Written Domains
16479306
7106818
7259346
3064222
1134065719695650
3754756
13707349
7394103
Imaginative Scientific Social ScienceApplied Science World Affairs CommerceArts Belief Leisure
BNC-1
Written Domains
16781393
7327671
7242024
3093407
1163008316612770
3798318
13496137
7493077
Imaginative Scientific Social ScienceApplied Science World Affairs CommerceArts Belief Leisure
BNC-2
Written Domains
7493.077
13496.137
3798.318
16612.77
11630.083
3093.407
7242.0247327.671
16781 .393
0
5000
10000
15000
20000
25000
I magi nati ve Sc i enti fi c Soc i al Sc i enc e A ppl i ed Sc i enc e Wor l d A ff ai r s C ommer c e A r ts B el i ef Lei s ur e
Thou
sand
s WB
Spoken domains
4214819
1639159
1285938
1652246
1565705
Educational Business Institutional
Leisure Demographic
Availability
BNC end-user licence commercial exploitation of the corpus is
forbidden commercial exploitation of derived works is
permitted OUCS is sole agent for licensing, reporting to
Consortium Original restriction to EU has been lifted
Distribution methods
100 million words is (still) a lot of data IPR agreements imply not-for-profit
distribution (which has its downsides too)
The options are... install it yourself online access the sampler
Install it yourself (version 1)
You need... £220 for a licence and 3 CDs £2000 for a Unix box with min 6 Gb disk some Unix expertise
You get... access to the whole corpus using the tools of your choice configurable for a local network
Version 2 will be delivered to run “standalone” on a suitably configured PC
BNC Online service
You need... access to the Internet
You get... free (but limited) access using any web browser free (temporary) access using SARA (PC only) for an annual fee, SARA plus documentation
http://sara.natcorp.ox.ac.uk
The BNC Sampler
You need... $50 for a CD A PC with a CD drive and (preferably) 90 Mb
disk space You get...
2% sample, half written, half spoken four different search engines documentation
Available at this conference, at a special price !!
The BNC World Edition (aka BNC2)
has IPR clearance for world usage (we lose about 50 texts)
extensive set of revisions and corrections catching up with the standards accompanied with new enhanced version of
SARA
… and it’s nearly ready (honest)
Error correction issues
Nothing can be added Catching up with the standards
CDIF … TEI … EAGLES… CES … headers are now in TEI-conformant XML
Indeterminacy of any transcription On the scale of the BNC, especially
If seven maids with seven mops…
Error Corrections in BNC2 POS correction
Systematic• uses improved rules derived from BNC Sampler• significantly reduced error rate and indeterminacy
Major production errors fixed Semi-systematic
• duplicate texts• wrongly labelled texts• participant details• classification errors and lacunae
Typos remain... and will do so!
The BNC as an Open Corpus
We chose SGML to encourage development of other tools
This is coming more slowly than we expected,e.g. the Sampler
But people still think the BNC and SARA are the same thing
New features in SARA
POS code searches Collocation searches Subcorpora Lemmatization rules Usable with any TEI conformant corpus
Know your audience
Everyone knows you should research the market first... small, specialist research community, lexicographers
The actual market is immense: language learners applied linguists cultural historians
and technically unsophisticated hence often misled or disappointed
Technological blind spots
we didn't expect the XML revolution! • so we wasted time in format conversion and
compromises
we didnt foresee pcs with 8Gb disks and sound cards!
• so we didn’t try to get rights to the audio
• and we focussed efforts on developing a client/server application
Missed opportunities: the R-word
Original design talks of Representativeness This shifted to the idea of the BNC as a
"fonds" : a source of specialist corpora This implies
a clearer and agreed taxonomy of text types better access facilities for subcorpora
top related