Top Banner
+ Using Corpora - I Albert Gatt 31 st October, 2014
45

+ Using Corpora - I Albert Gatt 31 st October, 2014.

Dec 23, 2015

Download

Documents

Timothy Atkins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: + Using Corpora - I Albert Gatt 31 st October, 2014.

+

Using Corpora - I

Albert Gatt31st October, 2014

Page 2: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Goals of this seminar

Practical skills:

1. Searching for words in corpora and quantifying results Basics of frequency distributions Measures of collocational strength Keyword analysis

2. Pattern-matching Regular expressions Corpus query language

3. Analysing results Sampling from resultsets Categorising outcomes

Page 3: + Using Corpora - I Albert Gatt 31 st October, 2014.

+

Some basic conceptsPart 1

Page 4: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Text

Page 5: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Text (vertical format)

Didelphoidea

Didelphoideahijasuperfamiljata'mammiferimarsupjali,eżattamentl-opossumital-kontinentiAmerikani

...

...

...

Paragraph splitting, sentence splitting, tokenisation

p

s

p

s

text

s

Page 6: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Metadata

Text-level

Information about the text, origins etc. E.g. Text genre

Can be very detailed, e.g. include gender of author Depends on the info available.

Page 7: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Metadata

Structural

Information about the principal divisions. Section, heading, paragraph...

Page 8: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Metadata

Token-level

Information about individual words: Part of speech Lemma Orthographic info (e.g. Error coding) Sentiment Word sense

(pretty much anything that might be relevant, and is feasible)

Page 9: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Underlying representation: MLRS

Page 10: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Underlying representation: CLEM

Page 11: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Underlying representation: BNC

<u who=D00011><s n=00011><event desc="radio on">

<w PNP><pause dur=34>You<w VVD>got<w TO0>ta <unclear><w NN1>Radio<w CRD>Two <w PRP>with <w DT0>that <c PUN>.

</u>

Many other tags to mark non-linguistic phenomena...

Utterance tag + speaker ID attribute

Sentence tag within utterance

Non-verbal action during speech

Pauses marked with duration

Unclear, non-transcribed speech

Page 12: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Levels of linguistic annotation

part-of-speech (word-level)

lemmatisation (word-level)

parsing (phrase & sentence-level -- treebanks)

semantics (multi-level) semantic relations between words and phrases semantic features of words

discourse features (supra-sentence level)

phonetic transcription

prosody

Page 13: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Searching It is important to know what metadata is available in a corpus.

Corpus Text-level Structural Token-level

MLRS v1.0 Text type Paragraph, sentence, token

None

MLRS v2.0 Text type Paragraph, sentence, token

Part of speech

MLRS v3.0 (forthcoming)

Text type Paragraph, sentence, token

Part of speechLemma, root, (phonetic trans)

CLEM v1.0 Exam level Paragraph, sentence, token

Part of speech, lemma

CLEM v2.0 (forthcoming)

Exam level, gender, mark/grade, locality, school

Paragraph, sentence, token

Part of speech, lemma, orthographic errors

Page 14: + Using Corpora - I Albert Gatt 31 st October, 2014.

+How it’s used

May be online or local

Page 15: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Tools

We will be using online interfaces to corpora: MLRS (Maltese Language Resource Server)

Uses the Corpus Workbench and CQP Different corpora available in English and Maltese

Other online interfaces: SketchEngine (http://www.sketchengine.co.uk)

Corpora in several languages Similar interface Requires licence

Corpora @ BYU (http://corpus.byu.edu ) Different corpora (mostly English) Somewhat different search interface Free

You also have access to a large corpus called the Web

Page 16: + Using Corpora - I Albert Gatt 31 st October, 2014.

+

Part 3Part-of-speech tagging

Page 17: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Part of speech tagging

Purpose:Label every token with information about its part of speech.

Requirements:A tagset which lists all the relevant labels.

Page 18: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Part of speech tagsets

Tagging schemes can be very granular. Maltese example: VV1SR: verb, main, 1st pers, sing, perf

imxejt – “I walked” VA1SP: verb, aux, 1st pers, sing, past

kont miexi – “I was walking” NNSM-PS1S: noun, common, sing, masc + poss. pronoun, sing, 1st

pers

missier-i – “my father”

Page 19: + Using Corpora - I Albert Gatt 31 st October, 2014.

+How POS Taggers work

1. Start with a manually annotated portion of text (usually several thousand words).

the/DET man/NN1 walked/VV

2. Extract a lexicon and some probabilities.Probability that a word is NN given that the previous word is DET.

3. Run the tagger on new data.

Page 20: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Challenges in POS tagging

Recall that the process is usually semi-automatic.

Granularity vs. correctness the finer the distinctions, the greater the likelihood of error manual correction is extremely time-consuming

Page 21: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Try it out

Maltese (MLRS POS Tagger): http://metanet4u.research.um.edu.mt/tools.jsp

English (example from LingPipe): http://alias-i.com/lingpipe/web/demo-pos.html

Page 22: + Using Corpora - I Albert Gatt 31 st October, 2014.

+

Words I: BNC and SkEPart 3

Page 23: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Get online!

We’ll work with the British National Corpus first.

SketchEngine: http://www.sketchengine.co.uk

Username: lin3098Password: pZxMmUaVTd

Page 24: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Use case 1: word frequencies

Construct a word list for the entire BNC

Rank-frequency distribution

Zipf’s law

Page 25: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Use case 2: KWIC Concordance

Case study: quiver: transitive or intransitive?

Basic search Use the simple search interface to find word in context. View frequency by text type Analyse results.

Take a random sample (n = 100) View concordance

Page 26: + Using Corpora - I Albert Gatt 31 st October, 2014.

+KIWC/sentence views

Page 27: + Using Corpora - I Albert Gatt 31 st October, 2014.

+KIWC/sentence views

Page 28: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Frequency representation

Simple frequency: Just the raw frequency of the word/phrase

Multilevel frequency distribution: Cross-classification Eg. frequency of word/phrase by document type

Page 29: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Relative frequency

Corpus1000 words

Subcorpus A500 words

Subcorpus B500 words

Quiver: 100 times

Quiver: 50 times Quiver: 50 times

The distribution of quiver over the 2 subcorpora matches the distribution of the two subcorpora within the whole (50%)Relative frequency in A = 100%Relative frequency in B = 100%

Expectation: Since A is 50% and B is 50% of total, then quiver would be expected to occur 50% of the time in A and 50% of the time in B.

Page 30: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Frequency by doc type

Thickness = raw frequencyLength = text type frequency

Page 31: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Relative frequency

Corpus1000 words

Subcorpus A500 words

Subcorpus B500 words

Quiver: 100 times

Quiver: 75 times Quiver: 25 times

The distribution of quiver over the 2 subcorpora does not match the distribution of the two subcorpora within the whole (50%)Relative frequency in A > 100%Relative frequency in B < 100%

Expectation: Since A is 50% and B is 50% of total, then quiver would be expected to occur 50% of the time in A and 50% of the time in B.

Page 32: + Using Corpora - I Albert Gatt 31 st October, 2014.

+A better concordance

Slightly more informed Search by lemma Exploit POS information: quiver only as a verb Look at frequencies of node + word to the right

Page 33: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Use case 3: big, large, great

A traditional dictionary (OED online): large adj. of considerable or relatively great size,

extent, or capacity big adj. of considerable size, physical power, or extent great adj. of an extent, amount, or intensity

considerably above average

Can collocational analysis give a better sense of the differences?

Page 34: + Using Corpora - I Albert Gatt 31 st October, 2014.

+A motivating exampleConsider phrases such as:

strong tea ? powerful tea strong support ? powerful support powerful drug ? strong drug

Traditional semantic theories have difficulty accounting for these patterns. strong and powerful seem to be near-synonyms do we claim they have different senses? what is the crucial difference?

Page 35: + Using Corpora - I Albert Gatt 31 st October, 2014.

+The empiricist view of meaning

Firth’s view (1957): “You shall know a word by the company it keeps”

This is a contextual view of meaning, akin to that espoused by Wittgenstein (1953).

In the Firthian tradition, attention is paid to patterns that crop up with regularity in language. Contrast symbolic/rationalist approaches, emphasising

polysemy, componential analysis, etc.

Statistical work on collocations tends to follow this tradition.

Page 36: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Defining collocations

“Collocations … are statements of the habitual or customary places of [a] word.” (Firth 1957)

Characteristics/Expectations: regular/frequently attested; occur within a narrow window (span of few words); not fully compositional; non-substitutable; non-modifiable display category restrictions

Page 37: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Collocation analysis

The term collocation typically refers to some semantically interesting relationship between two (or more) words.

But the techniques we will look at are in fact generalisable. Can be used to quantify the “closeness” between any two words.

Page 38: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Get some data!

Run a concordance for big/large/great. You can control how wide your window is. Use the

context option from the left menu. We can restrict our search to the immediate right

collocate which is a noun.

Page 39: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Get some data!

Make a note of: The frequency of each

adjective

For each adjective, generate the list of collocates by choosing the collocations option from the left menu.

Sort the collocates by frequency. Take note of: The top 10 most frequent

NOUN collocates.

Page 40: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Measures of collocational strength Statistical measures of collocational strength are based on the

following notion: If x and y are truly collocated then the likelihood of x and y

cropping up together should be greater than the likelihood of x and y cropping up independently.

Case 1: x and y are independent

• If this is true, the P(y|x) should be no larger than P(y)P(x)

Case 2: x and y are collocated

• If this is true, the P(y|x) should be (significantly) larger than P(y)P(x)

Page 41: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Common measures: Mutual Info

A ratio that seeks to answer the question: How much do I get to know about y If I also know about x

(i.e. How much information about y is contained in x)

The relevant sense of information here: Occurrence: does an occurrence of x also guarantee that y will

occur?

)()(),(

logyPxPyxP

Page 42: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Common measures: T-test

A ratio that seeks to answer the question: If I make the assumption that x and y are related, does this differ

statistically from the assumption that x and y are not related?

Example: is large number a collocation? My corpus (ukWaC) contains 239,074,304 two-word sequences. I can answer the question above by counting how many of these

sequences are the one I am interested in.

Page 43: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Common measures: T-test cont/d

<large x>

<x number(s)>

C(w) 555,510 1,303,561

P(w) 0.0023 0.0054

P(large)P(number)

0.0000126

Large number

Any bigram

C(w1,w2) 50,833 239,074,304

P(large number)

0.00021

Large and number are independent

Large and number are not independent (i.e. They are collocated)

62.210

23907430400021.0

0000126.000021.0

t

Page 44: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Common measures: chi-square

A ratio that seeks to answer the question: If I make the assumption that x and y are related, does this differ

statistically from the assumption that x and y are not related? I.e. Just like the t-test

Main difference: The t-test works with probabilities Chi-square is designed to work directly with frequencies.

Page 45: + Using Corpora - I Albert Gatt 31 st October, 2014.

+Common measures: log likelihood A ratio that seeks to answer the question:

What evidence do I have for the hypothesis that x and y are related, compared to the hypothesis that they are not?

(I won’t go into the maths)

Log likelihood is used more often than chi-square (and can be interpreted in much the same way).