Top Banner
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING What’s in a Corpus? Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)
21

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Mar 28, 2015

Download

Documents

Katelyn Keith
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

School of somethingFACULTY OF OTHER

School of ComputingFACULTY OF ENGINEERING

What’s in a Corpus?

Eric Atwell, Language Research Group

(with thanks to Katja Markert, Marti Hearst, and other contributors)

Page 2: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Reminder

Why NLP is difficult: language is a complex system

How to solve it? Corpus-based machine-learning approaches

Motivation: applications of “The Language Machine”

BACKGROUND READING: (Atwell 99) The Language Machine

Intro to NLTK

Visit the website: http://www.nltk.org

Page 3: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Today

The main areas of linguistics

Rationalism: language models based on expert introspection

Empiricism: models via machine-learning from a corpus

Corpus: text selected by language, genre, domain, …

Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …

Corpus Annotation: text headers, PoS, parses, …

Corpus size is no. of words – depends on tokenisation

We can count word tokens, word types, type-token distribution

Lexeme/lemma is “root form”, v inflections (be v am/is/was…)

Page 4: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

The main sub-areas of linguistics

◮ Phonetics and Phonology: The study of linguistic sounds or speech.

◮ Morphology: The study of the meaningful components of words.

◮ Syntax (grammar): The study of the order and links between words.

◮ Semantics: The study of meanings of words, phrases, sentences.

◮ Discourse: The study of linguistic units larger than a single utterance.

◮ Pragmatics: The study of how language is used to accomplish goals.

Page 5: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Why is NLP hard?

Main reason: Ambiguity in all areas and on all levels, e.g:

◮ Phonetic Ambiguity: 1 expression being pronounced in several ways

◮ POS Ambiguity: 1 word having several different Parts of Speech (adjective/noun...)

◮ Lexical Ambiguity: 1 word having several different meanings

◮ Syntactic/Structural Ambiguity: 1 phrase or sentence having several

different possible structures

◮ Pragmatic Ambiguity: 1 sentence communicating several different intentions

◮ Referential Ambiguity: 1 expression having several different possible references

Key Task in NLP: Disambiguation in context!

Page 6: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Rationalism v Empiricism

Rationalism: the doctrine that knowledge is acquired by reason without regard to experience (Collins English Dictionary)

Noam Chomsky, 1957 Syntactic Structures

-Argued that we should build models through introspection:

-A language model is a set of rules thought up by an expert

Like “Expert Systems”…

Chomsky thought data was full of errors, better to rely on linguists’ intuitions…

Page 7: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Empiricism v Rationalism

Empiricism: the doctrine that all knowledge derives from experience (Collins English Dictionary)

The field was stuck for quite some time: rationalist

linguistic models for a specific example did not generalise.

A new approach started around 1990: Corpus Linguistics

• Well, not really new, but in the 50’s to 80’s, they didn’t have the text, disk space, or GHz

Main idea: machine learning from CORPUS data

How to do corpus linguistics:

• Get large text collection (a corpus; plural: several corpora)

• Compute statistical models over the words/PoS/parses/… in the corpus

Surprisingly effective

Page 8: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

What is a corpus?

A corpus is a finite machine-readable body of naturally occurring text, selected according to specified criteria, eg:

◮ Language and type: English/German/Arabic/…, dialects v. “standard”, edited text v. spontaneous speech, …

◮ Genre and Domain: 18th century novels, newspaper text, software manuals, train enquiry dialogue...

◮ Web as Corpus: URL “domain” = country: .uk .ar

◮ Media: “Written” Text, Audio, Transcriptions, Video.

◮ Size: 1000 words, 50K words, 1M words, 100M words, ???

Page 9: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Brown and LOB

◮ Brown: Famous first corpus! (well, first widely-used corpus)

◮ by Nelson Francis and Henry Kucera, Brown University USA

◮ A balanced corpus: representative of a whole language

◮ Brown: balanced corpus of written, published American English from 1960s (newspapers, books, … NOT handwritten)

◮ 1 million words, Part-of-Speech tagged.

◮ LOB: Lancaster-Oslo/Bergen corpus: British English version

◮ published British English text from equivalent 1960s sources

◮FROWN, FLOB: US, UK text from equivalent 1990s sources

Page 10: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Some recent corpora

Corpus features: Size, Domain, Language

British National Corpus: 100M words, balanced British English

Newswire Corpus: 600M words, newswire, American English

UN or EU proceedings: 20M+ words, legal, 10 language pairs

Penn Treebank: 2M words, newswire American English

MapTask: 128 dialogues, British English

Corpus of Contemporary Arabic: 1M words, balanced Arabic

Web: 8 billion(?) words, many domains and languages

Web-as-Corpus: harvest your own corpus from WWW, via “seed terms” Google API web-pages Corpus!

Marco Baroni: BootCat, Adam Kilgarriff: SketchEngine, …

Page 11: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Corpus Annotation

Annotation is a process in which linguistics experts add (linguistic) information to the corpus that is not explicitly there (increases utility of a corpus), e.g.:

◮Text Headers: meta-data for each text: author, date, type,…

◮ Part-of-speech tag for each word (very common!).

◮ Syntactic structure: parse-tree for each sentence

◮ Word Sense label for each word

◮ Prosodic information: pauses, rise and fall in pitch, etc.

Page 12: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Annotation example: POS tagging

◮ Some texts are annotated with Part-of-speech (POS) tags.

◮ POS tags encode simple grammatical functions.

<s><w pos=RN> Here </w> <w pos=BEZ> is </w> <w pos=AT> a </w>

<w pos=NN> sentence </w>.</s>

◮ Several tag sets:

◮ Brown tag set (87 tags) in Brown corpus

◮ CLAWS / LOB tag set (132 tags) in LOB corpus

◮ Penn tag set (45 tags) in Penn Treebank

◮ CLAWS c5 tag set (62 tags) in BNC (British National Corpus)

◮ Tagging is usually done automatically (then proofread and corrected)

Page 13: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

http://www.comp.leeds.ac.uk/eric/atwell00icamej.pdf

Page 14: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

http://www.comp.leeds.ac.uk/eric/atwell08clih.pdf

Page 15: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

What’s a word?

How many words do you find in the following short text?

What is the biggest/smallest plausible answer to this question?

What problems do you encounter?

It’s a shame that our data-base is not up-to-date. It is a shame that um, data base A costs $2300.50 and that database B costs $5000. All databases cost far too much.

Time: 1 minute

Page 16: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Counting words: tokenization

Tokenisation is a processing step where the input text is automatically divided into units called tokens where each is either a word or a number or a punctuation mark…

So, word count can ignore numbers, punctuation marks (?)

Word: Continuous alphanumeric characters delineated by whitespace.

Whitespace: space, tab, newline.

BUT dividing at spaces is too simple: It’s, data base

Page 17: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Counting words: types v tokens

◮ Word token: individual occurrence of words.

◮ Q: How big is the corpus (N)?

= how many word tokens are there? (LOB: 1M; BNC: 100M)

◮Word type: the “word itself” regardless of context

◮ Q: How many “different words” (word types) are there?

= Size of corpus vocabulary (LOB: 50K, BNC: 650K)

◮ Q: What is the frequency of each word type?

= type-token distribution

A few word=types (the of a …) are very frequent, but most are rare, and half of all word-types occur only once! Zipf’s Law

Page 18: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Other sorts of “words”

◮ Lemma/Lexeme: dictionary form of a word. cost and costs

are derived from the same lexeme “cost”.

data-base, data base, database, databases – same lexeme

Can include spaces: data base, New York

Ambiguous tokenization: as well (= also), as well as (= and)

Inflection: grammatical variant, eg cost v costs

◮ Morpheme: basic “atomic” indivisible unit of meaning or grammar, e.g. data, base, s

◮ For languages other than English, morphological analysis can be hard: root/stem, affixes (prefix, postfix, infix)

morph ologi cal or morpho logic al ?

Page 19: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Arabic Morphology

Templatic Morphology

وكتمب

b

و? ??َم�

kt

تاكب

�ِا???

maktūbwritten

kātibwriterLexeme.Meaning =

(Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random

ب تك

maū āi

Root

Pattern

Lexeme

Page 20: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Arabic MorphologyRoot Meaning + Pattern meaning + ??

ب ت ك KTB = notion of “writing”

كتب/katab/write

كاتب/

kātib/writer

مكتوب/maktūb/

letter

كتاب/kitāb/book

مكتبة/maktaba/

libraryمكتب

/maktab/office

مكتوب/maktūb/written

Page 21: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Whats in a Corpus? Eric Atwell, Language Research Group (with thanks to.

Reminder

Rationalism: language models based on expert introspection

Empiricism: models via machine-learning from a corpus

Corpus: text selected by language, genre, domain, …

Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …

Corpus Annotation: text headers, PoS, parses, …

Corpus size is no. of words – depends on tokenisation

We can count word tokens, word types, type-token distribution

Morpheme: basic lexical unit, “root form”, plus affixes

Lexeme: dictionary entry, can be multi-word: New York