Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
Post on 16-Jan-2016
221 Views
Preview:
Transcript
Using Corpora and how to build them
Adam Kilgarriff
Lexical Computing Ltd
Madrid 2010 Kilgarriff: Web Corpora
• Corpora show us the facts of the language
Madrid 2010 Kilgarriff: Web Corpora 3
What is a corpus?
• a corpus is a collection of texts – when viewed as an object of language
research
Madrid 2010 Kilgarriff: Web Corpora
Which texts?
• Written
• Spoken
Madrid 2010 Kilgarriff: Web Corpora
Written
• Books– Fiction– Non-fiction– Textbooks
• Newspapers• Letters, unpublished• Web pages• Academic journals• Student essays• …
Madrid 2010 Kilgarriff: Web Corpora
Spoken
Must be transcribed, for text corpora
• Conversation– Who? Region, class, age-group, situation…
• Lectures
• TV and Radio
• Film transcripts
• Meetings, seminars
• …
Madrid 2010 Kilgarriff: Web Corpora
Which texts?
• Different purposes, different text types
• Making dictionaries:– Cover the whole language– Some of everything
Madrid 2010 Kilgarriff: Web Corpora
How much?
• Most words are rare
• Zipf’s Law
• To get enough data for most words, we need very big corpora
Madrid 2010 Kilgarriff: Web Corpora
Zipf’s Law
Word (pos) r f r x f
the (det) 1 6187267 6187267 to (prep) 10 917579 9175790as (adv) 100 91583 9158300playing (vb) 1000 9738 9738000paint (vb) 2000 4539 9078000amateur (adj) 10,000 741 7410000
Madrid 2010 Kilgarriff: Web Corpora
Zipf’s Law
• the: 6%
• 100 most frequent: 45%
• 7500 most frequent: 90%
• all others: rare
Madrid 2010 Kilgarriff: Web Corpora
Zipf’s Law
0102030405060708090
100
'the' 100 mostfrequent
3500most
frequent
7500most
frequent
% of all texts
Madrid 2010 Kilgarriff: Web Corpora
Leading English Corpora: Size
109
108
107
106
Size of
Corpora
(in words)
1960s 1970s 1980s 1990s 2000s
Brown/LOB COBUILD BNC OEC
Madrid 2010 Kilgarriff: Web Corpora
Good news
• The web
Madrid 2010 Kilgarriff: Web Corpora 14
You can’t help noticing
• Replaceable or replacable?– http://googlefight.com
Madrid 2010 Kilgarriff: Web Corpora 15
• Very very large– 2006 estimates for duplicate free, linguistic, Google-
indexed web• German: 44 billion words• Italian: 25 billion words• English: 1,000 billion -10,000 billion words
• Most languages• Most language types• Up-to-date• Free• Instant access
Madrid 2010 Kilgarriff: Web Corpora 16
What is a corpus?
• a corpus is a collection of texts – when viewed as an object of language
research
Madrid 2010 Kilgarriff: Web Corpora 17
Is the web a corpus?
Yes
Madrid 2010 Kilgarriff: Web Corpora 18
but it’s not representative
Madrid 2010 Kilgarriff: Web Corpora 19
sublanguage• Language = core + sublanguages• Options for corpus construction
– none– some– all
• None– impoverished view of language
• Some: BNC– cake recipes and gastro-uterine disease– not car repair manuals or astronomy or …
• All: until recently, not viable
Madrid 2010 Kilgarriff: Web Corpora 20
Representativeness• The web is not representative• but nor is anything else• Text type variation
– under-researched, lacking in theory• Atkins Clear Ostler 1993 on design brief for BNC;
Biber 1988, Kilgarriff 2001
• Text type is an issue across NLP– Web: issue is acute because, as against BNC or
WSJ, we simply don’t know what is there
Madrid 2010 Kilgarriff: Web Corpora 21
What is out there?
• What text types are there on the web?– some are new: chatroom
– proportions
• is it overwhelmed by porn? How much?
• Hard question
Madrid 2010 Kilgarriff: Web Corpora 22
Comparing frequency lists
• Web1T– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one
trillion (1012) words of English• that’s 1,000,000,000,000
• Compare with BNC– 100 words with highest Web1T:BNC ratio– 100 words with lowest ratio
Madrid 2010 Kilgarriff: Web Corpora 23
Web-high (155 terms)
• 61 web and computing– config browser spyware url www forum
• 38 porn• 22 US English (incl Spanish influence –los)• 18 business/products common on web
– poker viagra lingerie ringtone dvd casino rental collectible tiffany
– NB: BNC is old
• 4 legal– trademarks pursuant accordance herein
Madrid 2010 Kilgarriff: Web Corpora 24
Web-low
• Exclude British English, transcription/tokenisation anomalies
– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
Madrid 2010 Kilgarriff: Web Corpora 25
Observations
• Pronouns and past tense verbs– Fiction
• Masc vs fem
• Yesterday– Probably daily newspapers
• Constancy of ratios:– He/him/himself– She/her/herself
Madrid 2010 Kilgarriff: Web Corpora 26
• The web– a social, cultural, political phenomenon– new, little understood– a legitimate object of science– mostly language
• we are well placed– a lot of people will be interested
• Let’s– study the web– source of language data– apply our tools for web use (dictionaries, MT)– use the web as infrastructure
Madrid 2010 Kilgarriff: Web Corpora 27
Using Search Engines
No setup costsStart querying today
Methods• Hit counts• ‘snippets’
– Metasearch engines, WebCorp
• Find pages and download
Madrid 2010 Kilgarriff: Web Corpora 28
Googleology
• Google hit counts for language modelling
– Example: (Keller & Lapata 2003) – 36 queries to estimate freq(fulfil, obligation) to
each of Google and Altavista
• Very interesting work
• Great interest in query syntax
Madrid 2010 Kilgarriff: Web Corpora 29
The Trouble with Google• not enough instances
– max 1000• not enough queries
– max 1000 per day with API• not enough context
– 10-word snippet around search term• sort order
– search term in titles and headings • untrustworthy hit counts• limited search options• linguistically dumb, eg not lemmatised
• aime/aimer/aimes/aimons/aimez/aiment …
Madrid 2010 Kilgarriff: Web Corpora 30
• Appeal– Zero-cost entry, just start googling
• Reality– High-quality work: high-cost methodology
Madrid 2010 Kilgarriff: Web Corpora 31
Also:
• No replicability
• Methods, stats not published
• At mercy of commercial corporation
Madrid 2010 Kilgarriff: Web Corpora 32
Also:
• No replicability
• Methods, stats not published
• At mercy of commercial corporation
• Googleology is bad science
• So…
Madrid 2010 Kilgarriff: Web Corpora 33
Basic steps
• Gather pages– Google hits– Select and gather whole sites– General crawl
• Filter
• De-duplicate
• Linguistic processing
• Load into corpus tool
Madrid 2010 Kilgarriff: Web Corpora 34
Oxford English Corpus
• Whole domains chosen and harvested– control over text type
• 2 billion words (Mar 08)
Madrid 2010 Kilgarriff: Web Corpora 35
Oxford English Corpus
Madrid 2010 Kilgarriff: Web Corpora 36
DeWaC, ItWaC, UKWaC
• 1.5 B words each
• Marco Baroni, Adriano Ferraresi
• Seeds: – mid-frequency words from ‘core vocab’ lists
and corpora
• Google on seed words, then crawl
Madrid 2010 Kilgarriff: Web Corpora 37
Filtering
• Non-text (sound, image etc) files• Boilerplate (within file)
– Copyright notices, navigation bars– “high markup” heuristic
• Not “text in sentences”– Look for function words– Lists?? Sports results?? Crossword puzzles??
• Spam, pornography– Tough
• De-duplication (also tough)
Madrid 2010 Kilgarriff: Web Corpora 38
Small, specialised corpora
• Terminologists
• Translators needing target-language domain-specific vocab
• Specialist dictionaries– Don’t exist– Expensive/inaccessible– Out of date
Madrid 2010 Kilgarriff: Web Corpora 39
BootCat (Bootstrapping Corpora and Terms)
• Put in seed terms• Google/Yahoo search• Retrieve Google/Yahoo hits
– Remove duplicates, boilerplate
• Small instant corpora• Baroni and Bernardini, LREC 2004• Web version
– WebBootCaT– At Sketch Engine site
Madrid 2010 Kilgarriff: Web Corpora
Task
• Choose area of specialist interest– English or Spanish
• Select at least 5 seed terms– Specialist: good
• Build corpus– At least 100,000 words– Iterate if necessary
• Find at least six words/phrases/meanings you did not know before
• Write up
top related