1 CPE 641 Natural Language Processing Lecture 2: Levels of Linguistic Analysis, Tokenization & Part- of-speech Tagging Asst. Prof. Dr. Nuttanart Facundes.

1

CPE 641 Natural Language Processing

Lecture 2: Levels of Linguistic Analysis, Tokenization & Part-

of-speech Tagging

Asst. Prof. Dr. Nuttanart Facundes

2

Levels of Linguistic Analysis

• Word level– Part-of-speech: DOG, EAT, RED– Sub-word level: phonetic, phonology,

morphology

• Phrase level (syntax): the red dog, cutting edge, by 3 o’clock

• Semantics: lexical semantics: DOG/ANIMAL, compositional semantics

• Discourse: I like my car and John like his.

3

Words: parts-of-speech (POS)

• Word belong to different classes (categories)• Basic parts of speech

– NOUN: dog, man, car– ADJECTIVE: red, fat, brave– VERB: run, barked

• Best known set of POS tags: Brown tags– NN for nouns– VB for verb base forms– JJ for adjective in positive form

• Many words belong in more than one class• Open vs. closed classes

4

Nouns & Pronouns

• Nouns: cats, dogs, house, notebook

• Plurals: regular – dog/dogs, irregular – fish/fish, goose/geese

• Pronouns: I, you, he, she, it, they– Reflexives: herself, myself

5

Words that go with nouns: determiners, adjectives

• Determiners– Articles: a tree, the tree– Demonstratives: this tree– Quantifiers: most trees, all trees

• Adjectives– A red rose

6

Verbs

• Used to describe– Action – she threw the stone– State – I have 50 baht

• Morphological forms– Base form – walk– 3rd person, singular, present tense – walks– Gerund – walking

• Auxiliaries– John has been to Boston– You should spend more time with your family

7

Other parts-of-speech

• Adverb – she often travel to Las Vegas

• Preposition – in the glass

• Particles – He turned the light off

• Conjunction – but, or, and

8

Sub-word level: Morphology

• Inflections – dog/dogs, walk/walked

• Derivation – change categories, for example, turning adjective into adverb by adding –ly: happy happily

• Compound – can opener

9

Tokenization

• The process of segmenting a string of characters into words is known as tokenization.

• Tokenization is important in text processing because it tells our processing software what our basic units are.

• Type vs. token

10

Tokenization with Regular Expressions

• We want to tokenize some non-words like USA, $22.50

• This can be done by using Regular Expression patterns

11

Lemmatization and Normalization

• I saw the saw

• So far, we consider two tokens of ‘saw’ to be of the same type.

• But actually, one is a verb (past tense of ‘see’) and the other one is a noun.

• As a first approximation to discovering the distribution of a word, we can look at all the bigrams it occurs in.

12

• A bigram is simply a pair of words. For example, in the sentence

She sells sea shells by the sea shore,

the bigrams are She sells, sells sea, sea shells, shells by, by the, the sea, sea shore.

13

Different forms of the same word

• Two forms such as appear and appeared belong to a more abstract notion of a word called a lexeme; by contrast, appeared and called belong to different lexemes.

• You can think of a lexeme as corresponding to an entry in a dictionary, and a lemma as the headword for that entry. By convention, small capitals are used when referring to a lexeme or lemma: APPEAR.

14

Suffix – Inflected form

• Although appeared and called belong to different lexemes, they do have something in common: they are both past tense forms.

• This is signaled by the segment -ed, which we call a morphological suffix. We also say that such morphologically complex forms are inflected.

• If we strip off the suffix, we get something called the stem, namely appear and call respectively.

• While appeared, appears and appearing are all morphologically inflected, appear lacks any morphological inflection and is therefore termed the base form. In English, the base form is conventionally used as the lemma for a word

15

Lemmatization

• Lemmatization — the process of mapping words to their lemmas.

• Lemmatization is a rather sophisticated process that uses rules for the regular word patterns, and table look-up for the irregular patterns. Within NLTK, we can use off-the-shelf stemmers, such as the Porter Stemmer, the Lancaster Stemmer, and the stemmer that comes with WordNet.

16

• Lemmatization (or stemming) is a special case of normalization. They identify a canonical representative for a set of related word forms. Normalization collapses distinctions.

• Exactly how we normalize words depends on the application.

• Often, we convert everything into lower case so that we can ignore the written distinction between sentence-initial words and the rest of the words in the sentence.

17

Counting Words

• Counting words can bring us to several interesting applications

• Frequency Distributions• Stylistics• Lexical dispersion• Comparing word length in different languages• Generating random text• Collocations

18

Tokenization so far

• We saw that we can do a variety of interesting language processing tasks that focus solely on words. Tokenization turns out to be far more difficult than expected. No single solution works well across-the-board, and we must decide what counts as a token depending on the application domain. We also looked at normalization (including lemmatization) and saw how it collapses distinctions between tokens.

19

Categorizing and Tagging Words

• Distributional SimilarityWords with the same parts of speech (POS) may have similar distribution.

• One of the notable features of the Brown corpus is that all the words have been tagged for their part-of-speech.

• The process of classifying words into their parts-of-speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. The collection of tags used for a particular task is known as a tag set.

20

Automatic Tagging

• Automatic tagging has several applications.

• We get a clear understanding of the distribution of ‘often’ by looking at the tags of adjacent words.

• Automatic tagging also helps predict the behavior of previously unseen words e.g. blogging

21

POS - applications

• Parts of speech are also used in speech synthesis and recognition.

• For example, wind/NN, as in the wind blew, is pronounced with a short vowel,

• whereas wind/VB, as in to wind the clock, is pronounced with a long vowel.

• Other examples can be found where the stress pattern differs depending on whether the word is a noun or a verb, e.g. contest, insult, present, protest, rebel, suspect. Without knowing the part of speech we cannot be sure of pronouncing the word correctly.

22

English Morphology

• Nouns, verbs, adjectives, adverbs

• Open-class, closed-class

• Regular forms, irregular forms

23

Unigram Tagging

• Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.

• For example, it will assign the tag JJ (adjective) to any occurrence of the word frequent, since frequent is used as an adjective (e.g. a frequent visit) more often than it is used as a verb

24

N-gram Taggers

• Using unigram means we would tag a word such as wind with the same tag, regardless of whether it appears in the context the wind or to wind.

• An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens.

25

N-gram Tagging

The tag to be chosen, tn, is circled, and the context is shaded in grey. In the example of an n-gram tagger shown in the above figure, we have n=3; that is, we consider the tags of the two preceding words in addition to the current word. An n-gram tagger

picks the tag that is most likely in the given context.

26

• A 1-gram tagger is another term for a unigram tagger: i.e., the context used to tag a token is just the text of the token itself. 2-gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers.

• As with the other taggers, n-gram taggers assign the tag None to any token whose context was not seen during training.

• As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. Thus, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off in information retrieval).

27

Conclusion• Words can be grouped into classes, such as nouns, verbs,

adjectives, and adverbs. These classes are known as lexical categories or parts of speech. Parts of speech are assigned short labels, or tags, such as NN, VB,

• The process of automatically assigning parts of speech to words in text is called part-of-speech tagging, POS tagging, or just tagging.

• Some linguistic corpora, such as the Brown Corpus, have been POS tagged.

• A variety of tagging methods are possible, e.g. default tagger, regular expression tagger, unigram tagger and n-gram taggers. These can be combined using a technique known as backoff.

• Taggers can be trained and evaluated using tagged corpora. • Part-of-speech tagging is an important, early example of a

sequence classification task in NLP.

1 CPE 641 Natural Language Processing Lecture 2: Levels of Linguistic Analysis, Tokenization & Part- of-speech Tagging Asst. Prof. Dr. Nuttanart Facundes.

Documents

opener slide

token slide

family slide

red rose slide

nuttanart facundes slide

speech pos word

sea shells

red subword level