The Natural Language Toolkit (NLTK) Python basics NLTK Texts Lists Distributions Control structures Nested Blocks New data POS Tagging Basic tagging Tagged corpora Automatic tagging The Natural Language Toolkit (NLTK) Markus Dickinson Dept. of Linguistics, Indiana University Catapult Workshop Series; March 8, 2013 1 / 45
45
Embed
The Natural Language Toolkit (NLTK) - Indiana University
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
The Natural Language Toolkit (NLTK)
Markus Dickinson
Dept. of Linguistics, Indiana UniversityCatapult Workshop Series; March 8, 2013
1 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
NLTK
Natural Language Toolkit (NLTK) is:
Open source Python modules, linguistic data anddocumentation for research and development innatural language processing and text analytics,with distributions for Windows, Mac OSX andLinux.
http://www.nltk.org/
Today, we’ll look at:I Some basic functionality for working with text files
I http://nltk.org/book/ch01.htmlI http://nltk.org/book/ch03.html
I One example of an NLP process, POS taggingI http://nltk.org/book/ch05.html
To start, type python in a terminal or command promptI Better yet might be to use the Interactive DeveLopment
Environment (IDLE)
> python
Python 2.7.2 (default, Jun 20 2012, 16:23:33)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
5 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
Numbers & Strings
Some uses of numbers:
>>> 2+2
4
>>> 3/2.
1.5
Some uses of strings:I single quotes: ’string’I double quotes: "string"I There are string characters with special meaning: e.g.,\n (newline) and \t (tab)
6 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
String indices & slices
You can use slices to get a part of a string
>>> s = "happy"
>>> len(s) # use the len function
5
>>> s[3] # indexed from 0, so 4th character
’p’
>>> s[1:3] # characters 1 and 2
’ap’
>>> s[:3] # first 3 characters
’hap’
>>> s[3:] # everything except first 3 characters
’py’
>>> s[-4] # 4th character from the back
’a’
7 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
Variables
Definition
A variable is a name that refers to some value (could be anumber, a string, a list etc.)
1. Store the value 42 in a variable named foofoo = 42
2. Store the value of foo+10 in a variable named barbar = foo + 10
8 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
Installing NLTK
Installing NLTK is pretty straightforward:I http://nltk.org/install.htmlI I recommend installing Numpy, but that can sometimes
I In case the program needs to do something when thetest is false, use the else: statement
I E.g. if a user is not known, add him/her to the list
Example
known users = [ ’ Sandra ’ , ’ Markus ’ ]name = raw inpu t ( ’ type your name : ’ )
i f name in known users :pr in t ’ He l lo ’ + name + ’ . ’pr in t ’ I t i s n ice to have you back . ’
else :known users . append (name)pr in t ’You have been added to the l i s t . ’
22 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
Elif
I if you want to check the next condition in the else case,there is a shortcut for else if called elif
Example
known users = [ ’ Sandra ’ , ’ Markus ’ ]name = raw inpu t ( ’ type your name : ’ )
i f name in known users :pr in t ’ He l lo ’ + name + ’ . ’pr in t ’ I t i s n ice to have you back . ’
e l i f len (name) > 20:pr in t ’ Your name i s too long ! ’
else :known users . append (name)pr in t ’You have been added to the l i s t . ’
23 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
Tests
x == y x equals yx < y x is less than yx > y x is greater than yx >= y x is greater than or equal to yx <= y x is less than or equal to yx != y x is not equal to yx is y x is the same object as yx is not y x is not the same object as yx in y x is a member of yx not in y x is not a member of y
24 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
Word comparison tests
s.startswith(t) test if s starts with ts.endswith(t) test if s ends with tt in s test if t is contained inside ss.islower() test if all cased characters in s are low-
ercases.isupper() test if all cased characters in s are up-
percases.isalpha() test if all characters in s are alphabetics.isalnum() test if all characters in s are alphanu-
merics.isdigit() test if all characters in s are digitss.istitle() test if s is titlecased (all words in s have
have initial capitals)
25 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
For Loops
Iteration
for loops allow us to iterate over each element of a set orsequence
Syntax:
for <var> in <set > :do . . .do . . .
26 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
Example
words = [ ’ a ’ , ’ rose ’ , ’ i s ’ , ’ a ’ , ’ rose ’ , ’ i s ’ ,’ a ’ , ’ rose ’ ]
for w in words :pr in t w
27 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
List comprehensions
Python has a cool shorthand called list comprehensionsfor creating new lists from old ones:I a = [1,2,3,4,5]
b = [x**2 for x in a]
b is set to [1, 4, 9, 16, 25]
So: [len(w) for w in text1] gives a list of word lengthsI What does this do?
sorted([w for w in set(text1)
if w.endswith(’ableness’)])
28 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
Functions
Returning to NLTK functions ...
I Get bigrams from a text (or list):
>>> bigrams(text1[:10])
[(’[’, ’Moby’), (’Moby’, ’Dick’), (’Dick’, ’by’),
(’by’, ’Herman’), (’Herman’, ’Melville’),
(’Melville’, ’1851’), (’1851’, ’]’),
(’]’, ’ETYMOLOGY’), (’ETYMOLOGY’, ’.’)]
I Get the most frequent collocations:
>>> text1.collocations()
Building collocations list
Sperm Whale; Moby Dick; White Whale; old man;
Captain Ahab; sperm whale; Right Whale;
Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; ...
29 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
Using your own data
Using .read(), you can read a text file as a string in PythonI With a string representation, you can use NLTK’s
utilities
raw is Crime and Punishment, from Project Gutenberg
>>> raw = open(’crime.txt’).read()
>>> tokens = nltk.word_tokenize(raw)
>>> tokens[:10]
[’The’, ’Project’, ’Gutenberg’, ’EBook’, ’of’,
’Crime’, ’and’, ’Punishment’, ’,’, ’by’]
I open() opens a file & read() converts it to a string
30 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
Creating an NLTK text
nltk.Text() creates a NLTK text, with all its internalmethods available:
>>> text = nltk.Text(tokens)
>>> type(text)
<class ’nltk.text.Text’>
>>> text[:10]
[’The’, ’Project’, ’Gutenberg’, ’EBook’, ’of’,
’Crime’, ’and’, ’Punishment’, ’,’, ’by’]
>>> text.collocations()
Building collocations list
Katerina Ivanovna; Pyotr Petrovitch;
Pulcheria Alexandrovna; Avdotya Romanovna;
Marfa Petrovna; Rodion Romanovitch;
Sofya Semyonovna; old woman; Project Gutenberg-tm;
Porfiry Petrovitch; Amalia Ivanovna; great deal; ...
31 / 45
The NaturalLanguage Toolkit
(NLTK)
Python basics
NLTK
TextsLists
Distributions
Control structures
Nested Blocks
New data
POS TaggingBasic tagging
Tagged corpora
Automatic tagging
Managing corpora in NLTK
There is much more you can do to use your own corpus datain NLTKI Some of this involves using Corpus ReadersI See: http: