NLTK & basic text stats Day 19 - 10/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
Dec 28, 2015
NLTK & basic text statsDay 19 - 10/08/14LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
Course organization
08-Oct-2014NLP, Prof. Howard, Tulane University
2
http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction. http://www.tulane.edu/~howard/
CompCultEN/ Chapter numbering
3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode
characters 6. Control
The quiz as a function in a script
Review of scripts & functions
08-Oct-2014
3
NLP, Prof. Howard, Tulane University
Open Spyder
08-Oct-2014
4
NLP, Prof. Howard, Tulane University
Could you download the archive?
NLTK
08-Oct-2014
5
NLP, Prof. Howard, Tulane University
08-Oct-2014NLP, Prof. Howard, Tulane University
6
Loading the book's texts
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>
08-Oct-2014NLP, Prof. Howard, Tulane University
7
Searching text
Show every token of a word in context, called concordance view:
>>> text1.concordance('monstrous') Show the words that appear in a similar
range of contexts:>>> text1.similar('monstrous') Show the contexts that two words share:>>> text1.common_contexts(['whale','man'])
08-Oct-2014NLP, Prof. Howard, Tulane University
8
Searching text, cont.
Plot how far each token of a word is from the beginning of a text.>>> text1.dispersion_plot(['monstrous'])
Generate random text.>>> text1.generate()
08-Oct-2014NLP, Prof. Howard, Tulane University
9
Counting vocabulary
Count the word and punctuation tokens in a text:>>> len(text1)
List the unique words, i.e. the word types, in a text:>>> set(text1)
Count how many types there are in a text:>>> len(set(text1))
Count the tokens of a word type:>>> text1.count('smote')
08-Oct-2014NLP, Prof. Howard, Tulane University
10
Lexical richness or diversity
The lexical richness or diversity of a text can be estimated as tokens per type:>>> len(text1) / len(set(text1)
The frequency of a type can be estimated as tokens per all tokens, but '/' does integer division:>>> from __future__ import division
>>> 100 * text1.count('a') / len(text1)
There is no quiz for Monday.We will learn how to get our own text into Python & NLTK.
Next time
08-Oct-2014NLP, Prof. Howard, Tulane University
11