Top Banner
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex
31

Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Jan 12, 2016

Download

Documents

Franklin Hodge
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Why We Need Corpora and the Sketch Engine

Adam KilgarriffLexical Computing Ltd, UKUniversities of Leeds and Sussex

Page 2: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 2

Corpora show us the facts of the language

Page 3: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 3

Exercise

planet Think about the word What could you say about it if you

were writing a dictionary entry Write down three (or more) things

Page 4: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 4

The Sketch Engine: demo

http://www.sketchengine.co.uk

Page 5: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 5

Dictionaries

How to decide what to say about the word?

Page 6: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 6

Dictionaries

How to decide what to say about the word? What the native speaker knows

(introspection)

Page 7: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 7

Dictionaries

How to decide what to say about the word? What the native speaker knows

(introspection) What other dictionaries say

Page 8: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 8

Dictionaries

How to decide what to say about the word? What the native speaker knows

(introspection) What other dictionaries say corpus

Page 9: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 9

Four ages of corpus lexicography

Page 10: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 10

Age 1:

Pre-computer

Oxford English Dictionary:• 20 million index cards

Page 11: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 11

Age 2: KWIC Concordances

From 1980 Computerised Overhauled lexicography

Page 12: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 12

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: :read all • 500 lines: could read all, takes a long

time, slow • 5000 lines: no

Page 13: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 13

Age 3: Collocation statistics

Problem:too much data - how to summarise?

Solution:list of words occurring in neighbourhood of headword, with frequencies

Sorted by salience

Page 14: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 14

Collocation listing

For collocates of save (>5 hits), to right of nodeword

word word

forests life

$1.2 dollars

lives costs

enormous thousands

annually face

jobs estimated

money your

Page 15: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 15

Age-3 collocation statistics: limitations

Lists contain junk unsorted for type

mixes together adverbs, subjects, objects, prepositions

What we really want: noise-free lists one list for each grammatical relation

Page 16: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 16

Age 4: The word sketch

Large well-balanced corpus Parse to find

subjects, objects, heads, modifiers etc

One list for each grammatical relation Statistics to sort each list, as before

Page 17: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 17

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002, 2007

Page 18: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 18

Demo part 2

Page 19: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 19

Fruit task

Choose fruit Concordance

Lemma, noun, lower case Frequency: node forms Write down

Plural freq (pl) Singular freq (sing)

Compute proportion: pl/(pl+sing)

Page 20: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 20

What is a corpus?

A collection of texts (as used for linguistic study)

Which texts? How many?

Page 21: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 21

Which texts?

Written Spoken

Page 22: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 22

Written Books

Fiction Non-fiction Textbooks

Newspapers Letters, unpublished Web pages Academic journals Student essays …

Page 23: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 23

Spoken

Must be transcribed, for text corpora Conversation

Who? Region, class, age-group, situation… Lectures TV and Radio Film transcripts Meetings, seminars …

Page 24: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 24

Which texts?

Different purposes, different text types

Making dictionaries: Cover the whole language Some of everything

Page 25: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 25

How much?

Most words are rare Zipf’s Law To get enough data for most words,

we need very big corpora

Page 26: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 26

Zipf’s Law

Word (pos) r f r x f

the (det) 1 6187267 6187267 to (prep) 10 917579 9175790as (adv) 100 91583 9158300playing (vb) 1000 9738 9738000paint (vb) 2000 4539 9078000amateur (adj) 10,000 741 7410000

Page 27: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 27

Zipf’s Law the: 6%

100 most frequent: 45% 7500 most frequent: 90% all others: rare

Page 28: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 28

Zipf’s Law

0102030405060708090

100

'the' 100 mostfrequent

3500most

frequent

7500most

frequent

% of all texts

Page 29: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 29

Leading English Corpora: Size

109

108

107

106

Size of

Corpora

(in words)

1960s 1970s 1980s 1990s 2000s

Brown/LOB COBUILD BNC OEC

Page 30: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 30

Good news

The web

Page 31: Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Madrid April 2010 Kilgarriff: Why corpora and how 31

Thank you

http://www.sketchengine.co.uk