Top Banner
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing
22

1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

Dec 30, 2015

Download

Documents

Shawn Austin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

1

Corpus-Based Work

Chapter 4

Foundations of statistical natural language processing

Page 2: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

2

Introduction

• Requirements of NLP work– Computers– Corpora– Application/Software

• This section covers some issues concerning the formats and problems encountered in dealing with raw data

• Low-level processing before actual work– Word/Sentence extraction

Page 3: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

3

Getting Set Up

• Computers– Memory requirements for large corpora– Statistical NLP methods involve counts required to

be accessed speedily

• Corpora– “A corpus is a special collection of textual material

collected according to a certain set of criteria”– Licensing– Most of the time free sources are not linguistically

marked-up

Page 4: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

4

• Corpora– Representative sample

• What we find for sample also holds for general population

– Balanced corpus• Each subtype of text matching predetermined

criterion of importance

• Importance in statistical NLP– Representative corpus– In results type/domain of corpus should be

included

Page 5: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

5

• Software– Text editors

• TextPad, Emacs, BBedit• Regular expressions

– Patterns as regular language

– Programming language• C/C++ widely used (Efficient)• Pearl for text preparation and formatting• Built in database and easy handling of complicated

structures makes Prolog important• Java as pure Object Oriented gives automatic

memory management

Page 6: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

6

Looking at Text

• Either in raw format or marked-up– ‘Markup’ is used for putting some codes into

data file, giving some information about text

• Issues in automatic processing– Junk formatting/content (Corpus Cleaning)– Case sensitivity (All capitalize)

1. Proper Nouns?

2. Stress through capitalization • Loss of contextual information

Page 7: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

7

• Tokenization– Text is divided into units called ‘tokens’– Treatment of punctuation marks?

• What is a word?– Graphic word (Kucera and Francis 1967)

• A string of contiguous alphanumeric characters with white space on either side.

• This is not practical definition even in case of Latin• Especially for news corpus some odd entries can

be present e.g. Micro$oft, C| net• Apart from these oddities there are some other

issues

Page 8: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

8

• Periods– Words are not always bounded by white

spaces (commas, semicolons and periods)– Periods are at the end of sentence and also at

the end of abbreviations – In abbreviation they should be attached to

words (Wash. wash)– When abbreviations occur at the end of

sentence there is only one period present, performing both functions

• Within morphology, this phenomenon is referred as ‘haplology’

Page 9: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

9

• Single Apostrophes– Difficulties in dealing with constructions such

as I’ll or isn’t– The count of graphic word is 1 according to

basic definition but should be counted as 2 words

• 1. S NP VP• 2. if we split then some funny words may occur in

collection

– End of quotations marks– Possessive form of words ending with ‘s’ or ‘z’

• Charles’ Law Muaz’ book

Page 10: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

10

• Hyphenation– Does sequence of letters with hyphen in-

between, count as one or two?– Line ending hyphens

• Remove hyphen at the end of line and join both parts together

• If there is some other type of hyphen at end of line (haplology) then? (text-based)

– Mostly in electronic text line breaking hyphens are not present, but there are some other issues…….

Page 11: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

11

• Some things with hyphens are clearly treated as one word – E-mail, A-l-Plus and co-operate

• Other cases are arguable– Non-lawyer, pro-Arabs and so-called– The hyphens here are called lexical hyphens– Inserted before or after small word formatives to split

vowel sequence in some cases

• Third class of hyphens is inserted to indicate correct grouping– A text-based medium– A final take-it-or-leave-it offer

Page 12: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

12

• Inconsistencies in hyphenation – Cooperate Co-operate– So we can have multiple forms treated as

either one word or two

• Lexemes– Single dictionary entry with single meaning

• Homographs– Two lexemes have overlapping forms/nature

• Saw

Page 13: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

13

• Word segmentation in other languages

• Opposite issue– White spaces but not word boundary– “the New York-New Heaven railroad”– “I couldn’t work the answer out”

• In spite of, in order to, because of

• Variant coding of information of certain semantic type– Phone numbers 42-111-128-128

• Problem in information extraction

Page 14: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

14

• Speech Corpora Issues– More contractions– Various phonetic representations – Pronunciation variants– Sentence fragments– Filler words

• Morphology– Keep various forms separately or collapse

them? e.g. sit, sits, sat– Grouping them together and working with

lexemes (Initially looks easier)

Page 15: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

15

• Stemming – Strips off affixes

• Lemmatization– To extract the lemma or lexeme from

inflected form

• Empirical research within IR shows that stemming does not help in performance

1. Information loss (operating operate)

2. Closely related tokens are grouped in chunks, which are more useful

3. Not good for morphologically rich languages

Page 16: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

16

• Sentences– What is a sentence?– In English, something ending with ‘.’, ‘?’ or ‘!’– Abbreviations issues

• Other issues– you reminded me, she remarked, of your

mother.”– Nested things are classified as ‘clauses’– Quotation marks after punctuation

• ‘.’ is not sentence boundary in this case

Page 17: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

17

• Sentence boundary (SB) detection– Place tentative SB after all occurrences of .?!– Move the boundary after quotation mark (if

any)– Disqualify a period boundary in case of

• Preceded by an abbreviation not at sentence end , and capitalized Prof., Dr.

• Or not followed by capitalized words like in case of etc., jr.

– Disqualify a boundary with ? Or ! • If followed by a lower case letter

– Regard all other as correct SBs

Page 18: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

18

• Riley (1989) used classification trees for SB detection– Features of trees included case and length of

words preceding or following a period and probabilities of words to occur before and after a sentence boundary

– It required large quantity of labeled data

• Palmer and Hearst used POS of such words and implemented with Neural Networks (98-99% accurate)

• In other languages?

Page 19: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

19

• Marked-up Data– Some sort of code is used to provide

information (mostly SGML, XML)– It can be done automatically, manually or

mixture of both (Semi-Automatic)– Some texts mark up just sentence and

paragraph boundaries – Other mark up more than this basic

information • e.g. Pen Treebank (Full syntactic structure)

– Common mark up is POS tagging

Page 20: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

20

• Grammatical Tagging– Generally done with conventional POS

tagging like Noun, Verbs etc.– Also some information regarding nature

of the words like Plurality of nouns or Superlative forms of adjectives

• Tag set– The most influential tag set have been the

one used to tag American Brown Corpus and Lancaster-Oslo-Bergen corpus

Page 21: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

21

• Size of tag sets– Brown 87 179 (Total tags)– Penn 45– Claws1 132

• Penn tag set is widely used in computational work

• Tags are different in different tag sets– Larger tag sets obviously have fine-grained

distinctions– Detail level is according to domain of corpora

Page 22: 1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.

22

• The design of tag set– Grammatical class of word– Features to tell the behavior of the word

• Part of Speech– Semantic grounds– Syntactic distributional grounds– Morphological grounds

• Splitting tags in further categories gives improved information but makes classification harder

• There is not a simple relationship between tag set size and performance of taggers