NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

NLTK & basic text statsDay 19 - 10/08/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Course organization

08-Oct-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction. http://www.tulane.edu/~howard/

CompCultEN/ Chapter numbering

3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode

characters 6. Control

The quiz as a function in a script

Review of scripts & functions

08-Oct-2014

3

NLP, Prof. Howard, Tulane University

Open Spyder

08-Oct-2014

4


Could you download the archive?

NLTK

08-Oct-2014

5



6

Loading the book's texts

>>> from nltk.book import *

*** Introductory Examples for the NLTK Book ***

Loading text1, ..., text9 and sent1, ..., sent9

Type the name of the text or sentence to view it.

Type: 'texts()' or 'sents()' to list the materials.

text1: Moby Dick by Herman Melville 1851

text2: Sense and Sensibility by Jane Austen 1811

text3: The Book of Genesis

text4: Inaugural Address Corpus

text5: Chat Corpus

text6: Monty Python and the Holy Grail

text7: Wall Street Journal

text8: Personals Corpus

text9: The Man Who Was Thursday by G . K . Chesterton 1908

>>>


7

Searching text

Show every token of a word in context, called concordance view:

>>> text1.concordance('monstrous') Show the words that appear in a similar

range of contexts:>>> text1.similar('monstrous') Show the contexts that two words share:>>> text1.common_contexts(['whale','man'])


8

Searching text, cont.

Plot how far each token of a word is from the beginning of a text.>>> text1.dispersion_plot(['monstrous'])

Generate random text.>>> text1.generate()


9

Counting vocabulary

Count the word and punctuation tokens in a text:>>> len(text1)

List the unique words, i.e. the word types, in a text:>>> set(text1)

Count how many types there are in a text:>>> len(set(text1))

Count the tokens of a word type:>>> text1.count('smote')


10

Lexical richness or diversity

The lexical richness or diversity of a text can be estimated as tokens per type:>>> len(text1) / len(set(text1)

The frequency of a type can be estimated as tokens per all tokens, but '/' does integer division:>>> from __future__ import division

>>> 100 * text1.count('a') / len(text1)

There is no quiz for Monday.We will learn how to get our own text into Python & NLTK.

Next time


11

NLTK & BASIC TEXT STATS DAY 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Documents

tulane university08

tulane universitycould

tulane universitythere

tulane universitythe

tulane universityopen

loading text1

nltk basic text statsday

nltk book