4.12.2017 8_natural_language file:///home/szwabin/Dropbox/Zajecia/UnstructuredData/8_natural_language/8_natural_language.html 1/40 Analysis of unstructured data Lecture 8 - natural language processing (in NLTK) ¶ Janusz Szwabiński Outlook: NLP - what does it mean? First steps with NLTK Tokenizing text into sentences Tokenizing text into words Part-Of_Speach tagging Stemming and lemmatization An introduction into text classification References: Dive into NLTK, http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk (http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk) Natural Language Processing with Python, http://www.nltk.org/book/ (http://www.nltk.org/book/) In [1]: %matplotlib inline import matplotlib.pyplot as plt
40
Embed
Analysis of unstructured data - prac.im.pwr.edu.pl
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Lecture 8 - natural language processing (in NLTK) ¶
Janusz Szwabiński
Outlook:
NLP - what does it mean?First steps with NLTKTokenizing text into sentencesTokenizing text into wordsPart-Of_Speach taggingStemming and lemmatizationAn introduction into text classification
References:
Dive into NLTK, http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk(http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk)Natural Language Processing with Python, http://www.nltk.org/book/ (http://www.nltk.org/book/)
In [1]:
%matplotlib inline import matplotlib.pyplot as plt
NLP - what does it mean?natural language processing, NLPinterdisciplinary domain, combines artificial intelligence and machine learning with linguisticschallenges in natural language processing frequently involve speech recognition, natural languageunderstanding, natural language generation (frequently from formal, machine-readable logicalforms), connecting language and machine perception, dialog systems, or some combination thereofnatural language generation converts information from computer databases or semantic intents intoreadable human languagenatural language understanding converts chunks of text into more formal representations such asfirst-order logic structures that are easier for computer programs to manipulatenatural language understanding involves the identification of the intended semantic from the multiplepossible semantics which can be derived from a natural language expression
Is it difficult?
text tokenizationthere are no clear word or sentence boundaries in a written text in some languages (e.g.Chinese, Japanese, Thai)
no clear grammar (exceptions, exceptions to the exceptions):Potato --> potato es, tomato --> tomato es, hero --> hero es, photo --> ???
homonyms, synonymsfluke --> a fish, fluke --> fins on a whale's tail, fluke --> end parts of an anchor, fluke --> astroke of lucka river bank, a savings bank, a bank of switchesranny --> zraniony, ranny --> o poranku (context is important)ranny ptaszekto book a flight, to borrow a bookbuy - purchasesamochód - gablota
inflexionwrite - writtenpopiół – o popiele
grammar is often ambiguousa sentence can have more than only one parse tree
Widziałem chłopca jedzącego zupę i bociana.Jest szybka w łóżkuEvery man saw the boy with his binoculars
FINISHED FILES ARE THE RESULT OF YEARS OF SCIENTIFIC STUDY COMBINED WITH THE EXPERIENCE OF YEARS.
THE SILLIEST MISTAKE IN IN THE WORLD
Two different approaches of NLP
grammaticalnatural language can be described with help of logical formscomparative linguistics - Jakob Grimm, Rasmus RaskI-language and E-language - Noam Chomsky
statisticalanalysis of real texts may help you to discover the structure of a natural language, inparticular typical word usage patternsit is good to look at a large set of textsit is better to look at a huge set of textsit is even better to... --> statisticsfirst attempts - Markov chains(http://www.cs.princeton.edu/courses/archive/spr05/cos126/assignments/markov.html(http://www.cs.princeton.edu/courses/archive/spr05/cos126/assignments/markov.html)),Shannon game
How the statistical method works?
They put the money in the bankHow should we interpret the word bank? River bank? Savings bank?We take all available texts and calculate the probability of words' cooccurence:
we choose the meaning with higher probability
Text corpora
text corpus - a large and structured set of texts (nowadays usually electronically stored andprocessed), which is usually used to do statistical analysis and hypothesis testing, checkingoccurrences or validating linguistic rules within a specific language territoryessential for linguistic researchoften used as the training and test data for machine learning algorithmsapplications:
dictionariesforeign language handbookssearch engines optimized for specific languagestranslators
worth to visit:Narodowy Korpus języka Polskiego, http://nkjp.pl/ (http://nkjp.pl/)British National Corpus, http://www.natcorp.ox.ac.uk/ (http://www.natcorp.ox.ac.uk/)Das Deutsche Referenzkorpus, http://www1.ids-mannheim.de/kl/projekte/korpora/(http://www1.ids-mannheim.de/kl/projekte/korpora/)Český národní korpus, http://ucnk.ff.cuni.cz/ (http://ucnk.ff.cuni.cz/)Национальный корпус русского языка, http://www.ruscorpora.ru/(http://www.ruscorpora.ru/)
Getting started with NLTKAfter installing NLTK, you need to install NLTK Data which include a lot of corpora, grammars, models andetc. Without NLTK Data, NLTK is nothing special. You can find the complete nltk data list here:http://nltk.org/nltk_data/ (http://nltk.org/nltk_data/)
The simplest way to install NLTK Data is to run the Python interpreter and to type the following commands:
In [2]:
import nltk
In [3]:
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
The meaning of the tags is explained for instance at https://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used (https://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used)
The most important tags are:
NN - singular or mass nounNNS - plural nounNNP - possessive singular nounNNSP - possessive plural nounVB - verb, base formVBD - verb, past tenseVBP - verb, non 3rd person, singular, presentVBN - verb, past participleVBZ - verb, 3rd. singular presentJJ - adjectiveJJR - comparative adjectiveJJS - semantically superlative adjective (chief, top)RB - adverbRBR - comparative adverbRBT - superlative adverbCD - cardinal numeralMD - modal auxiliary (can, should, will)FW - foreign wordPRP - personal pronounIN - prepositionCC - coordinating conjunction
It is also possible to use a simplified set of tags:
In this case the following universal tags are used:
Tag Meaning Examples
ADJ adjectives new, good, high, special, big, local
ADP adpositions on, of, at, with, by, into, under
ADV adverbs really, already, still, early, now
CONJ conjunctions and, or, but, if, while, although
DET determiners the, a, some, most, every, no, which
NOUN nouns year, home, costs, time, Africa
NUM cardinal numbers twenty-four, fourth, 1991, 14:24
PRT particles, other function words at, on, out, over per, that, up, with
PRON promouns he, their, her, its, my, I, us
. punctuation . , ; !
X other (foreign words, typos, etc) ersatz, esprit, dunno, gr8, univeristy
Tokenizing text into sentencessentence boundary disambiguation (SBD)also known as sentence breakinga problem of deciding where sentences begin and endoften required by natural language processing tools for a number of reasonschallenging because punctuation marks are often ambiguous
a period may denote an abbreviation, decimal point, an ellipsis, or an email address – notthe end of a sentenceabout 47% of the periods in the Wall Street Journal corpus denote abbreviationsquestion marks and exclamation marks may appear in embedded quotations, emoticons,computer code, and slanglanguages like Japanese and Chinese have ambiguous sentence-ending markers
In [9]:
text = "this's a sent tokenize test. this is sent two. is this sent three? sent 4 is cool! Now it's your turn." from nltk.tokenize import sent_tokenize sent_tokenize_list = sent_tokenize(text) print(len(sent_tokenize_list)) print(sent_tokenize_list)
The function sent_tokenize uses an instance of the class PunktSentenceTokenizer from the module nltk.tokenize.punkt. The class was trained for many languages:
5 ["this's a sent tokenize test.", 'this is sent two.', 'is this sent three?', 'sent 4 is cool!', "Now it's your turn."]
# %load /home/szwabin/nltk_data/tokenizers/punkt/README Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected) Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have been contributed by various people using NLTK for sentence boundary detection. For information about how to use these models, please confer the tokenization HOWTO: http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html and chapter 3.8 of the NLTK book: http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation There are pretrained tokenizers for the following languages: File Language Source Contents Size of training corpus(in tokens) Model contributed by ======================================================================================================================================================================= czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss Literarni Noviny ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss (Berlingske Avisdata, Copenhagen) Weekend Avisen ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss (American) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss Text Bank (Suomen Kielen newsp
apers Tekstipankki) Finnish Center for IT Science (CSC) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss (European) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss (Switzerland) CD-ROM (Uses "ss" instead of "ß") ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss (Bokmål and Information Technologies, Nynorsk) Bergen ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner (http://www.nkjp.pl/) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss (Brazilian) (Linguateca) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss Slovene Academy for Arts
and Sciences ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss (European) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss (and some other texts) ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss (Türkçe Derlem Projesi) University of Ankara ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to Unicode using the codecs module. Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525. ---- Training Code ---- # import punkt import nltk.tokenize.punkt # Make a new Tokenizer tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer() # Read in training corpus (one example: Slovene) import codecs text = codecs.open("slovene.plain","Ur","iso-8859-2").read() # Train tokenizer tokenizer.train(text) # Dump pickled tokenizer import pickle out = open("slovene.pickle","wb") pickle.dump(tokenizer, out) out.close() ---------
text = "Czy to miało sens? Nie byłem pewien." tokenizer = nltk.data.load('tokenizers/punkt/polish.pickle') res = tokenizer.tokenize(text) for sent in res: print(sent)
However, it does not work all the time:
In [12]:
text = "Zapytaj o to dr. Kowalskiego." res = tokenizer.tokenize(text) for sent in res: print(sent)
In [13]:
text = u"Nie widzę gdzie leży por. Magda chyba go wyrzuciła." tokenizer = nltk.data.load('tokenizers/punkt/polish.pickle') res = tokenizer.tokenize(text) for i in res: print(i)
Tokenizing text into words
Out[10]:
["this's a sent tokenize test.", 'this is sent two.', 'is this sent three?', 'sent 4 is cool!', "Now it's your turn."]
Czy to miało sens? Nie byłem pewien.
Zapytaj o to dr. Kowalskiego.
Nie widzę gdzie leży por. Magda chyba go wyrzuciła.
from nltk.tokenize import word_tokenize print(word_tokenize('Hello World.')) print(word_tokenize("this's a test"))
The world_tokenize function is a wrapper of the TreebankWordTokenizer. However, other tokenizersare also available in NLTK:
In [15]:
text = "At eight o'clock on Thursday morning Arthur didn't feel very good."
In [16]:
print(word_tokenize(text))
In [17]:
from nltk.tokenize import WordPunctTokenizer word_punct_tokenizer = WordPunctTokenizer() print(word_punct_tokenizer.tokenize(text))
Part-of-speech taggingFrom Wikipedia:
In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammaticaltagging or word-category disambiguation, is the process of marking up a word in a text(corpus) as corresponding to a particular part of speech, based on both its definition, as wellas its context—i.e. relationship with adjacent and related words in a phrase, sentence, orparagraph. A simplified form of this is commonly taught to school-age children, in theidentification of words as nouns, verbs, adjectives, adverbs, etc.
Once performed by hand, POS tagging is now done in the context of computationallinguistics, using algorithms which associate discrete terms, as well as hidden parts ofspeech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into twodistinctive groups: rule-based and stochastic. E. Brill’s tagger, one of the first and most widelyused English POS-taggers, employs rule-based algorithms.
Tokenization of the text into words is required in NLTK before the tagging:
import nltk text = "Part-of-speech tagging is harder than just having a list of words and their parts of speech" text = nltk.word_tokenize(text) nltk.pos_tag(text)
It is possible to check the meaning of a tag:
In [19]:
nltk.help.upenn_tagset('JJ')
In [20]:
nltk.help.upenn_tagset('NN')
Languages other than English
The default tagger in NLTK was trained on the PENN Treebank corpus of English(https://www.cis.upenn.edu/~treebank/ (https://www.cis.upenn.edu/~treebank/)). However, NLTK includescorpora of other languages that may be used to train the taggers for languages different than English. Let ushave a look at the Polish corpus:
# Find the directory where the corpus lives import nltk pl196x_dir = nltk.data.find('corpora/pl196x')
In [22]:
print(pl196x_dir)
In [23]:
#Create a new corpus reader object from nltk.corpus.reader import pl196x pl = pl196x.Pl196xCorpusReader(pl196x_dir,r'.*\.xml',textids='textids.txt',cat_file="cats.txt")
In [24]:
#Use the new corpus object print(pl.fileids())
In [25]:
#Look at tagged words twords = pl.tagged_words(fileids=pl.fileids(),categories='cats.txt') for w in twords[:10]: print(w)
Important note If you want to use the pl196X corpus with Python 3.X, then you have to edit the reader file(/usr/local/lib/python3.5/dist-packages/nltk/corpus/reader/pl196x.py on Ubuntu) andreplace the following line in the _resolve method
if len(filter(lambda accessor: accessor is None, (fileids, categories, textids))) != 1:
by
if len(list(filter(lambda accessor: accessor is None, (fileids, categories, textids)))) != 1:
In pl196X corpus every tag consists of 14 characters. Their meaning may be checked at:http://clip.ipipan.waw.pl/PL196x?action=AttachFile&do=view&target=taksonomia.pdf(http://clip.ipipan.waw.pl/PL196x?action=AttachFile&do=view&target=taksonomia.pdf)
Now, we can use the corpus to train the UnigramTagger:
accuracy is 65%it may be improved by taking a larger training set or by taking only a set of most frequent tags in thecorpus
Other taggers
The UnigramTagger is not the only tagger contained in NLTK. Among other taggers we have:
DefaultTagger - the simplest possible tagger assigns the same tag to each token. It establishesan important baseline for tagger performance (it allows to tag each word with the most likely tag).Useful backoff in case a more advanced tagger fails to tag a word.RegexpTagger - assigns tags to tokens on the basis of matching patterns.
NgramTagger - a generalization of a unigram tagger whose context is the current word togetherwith the part-of-speech tags of the preceding tokensBigramTagger - a special case of the NgramTagger for TrigramTagger - a special case of the NgramTagger for AffixTagger - a tagger that chooses a token's tag based on a leading or trailing substring of itsword stringBrillTagger - it uses an initial tagger (such as DefaultTagger) to assign an initial tagsequence to a text and then applies an ordered list of transformational rules to correct the tags ofindividual tokens
tekst = "To jest przykładowe zdanie w języku polskim" tagger.tag(tekst.split())
The above example illustrates the possibility of linking taggers with each other. A more complex configurationcould be:
BigramTagger --> UnigramTagger --> RegexpTagger
The are some benefits of such an approach:
improved accuracy of the resulting taggerdata reduction - for instance, we do not have to tag nouns in the corpus. Instead, we simply assignthe 'NN' tag to all unknown words with the DefaultTagger
TreeTagger
homepage: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)Python module available: treetagggerwrappersupport for German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Portuguese,Galician, Chinese, Swahili, Slovak, Slovenian, Latin, Estonian, Polish, Romanianadding new languages possible, given a dictionary and a manually tagged corpus
!echo "To jest przykładowe zdanie w języku polskim." | /home/szwabin/Tools/TreeTagger/cmd/tree-tagger-polish
Stemming and lemmatization
Stemming
From Wikipedia:
In linguistic morphology and information retrieval, stemming is the process for reducinginflected (or sometimes derived) words to their stem, base or root form—generally a writtenword form. The stem need not be identical to the morphological root of the word; it is usuallysufficient that related words map to the same stem, even if this stem is not in itself a validroot. Algorithms for stemming have been studied in computer science since the 1960s. Manysearch engines treat words with the same stem as synonyms as a kind of query expansion, aprocess called conflation.
NLTK provides several famous stemmers interfaces, such as Porter stemmer, Lancaster Stemmer, SnowballStemmer etc. Using those stemmers is very simple.
Porter stemmer
A very good explanation of the Porter algorithm may be found at:http://snowballstem.org/algorithms/porter/stemmer.html(http://snowballstem.org/algorithms/porter/stemmer.html)
In [33]:
from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer()
reading parameters ... tagging ... To subst:sg:nom:n to jest fin:sg:ter:imperf być przykładowe adj:pl:acc:n:pos przykładowy zdanie subst:sg:acc:n zdanie w prep:loc:nwok w języku subst:sg:loc:m3 język polskim adj:sg:loc:m3:pos polski . SENT . finished.
pl_stemmer.py program is a very simple python stemmer for Polish language based on Porter's Algorithm(https://github.com/Tutanchamon/pl_stemmer/blob/master/pl_stemmer.py(https://github.com/Tutanchamon/pl_stemmer/blob/master/pl_stemmer.py)).
In [42]:
! cat email.txt
arabic danish dutch english finnish french german hungarian italian norwegian porter portuguese romanian russian spanish swedish
Kariera na językach to wydarzenie zorganizowane z myślą o studentach i absolwentach znających języki obce na poziomie co najmniej dobrym Będą oni mieli okazję zastanowić się nad kierunkami rozwoju własnej kariery zawodowej w oparciu o informacje na temat możliwości wykorzystania swoich umiejętności lingwistycznych na współczesnym rynku pracy
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together thedifferent inflected forms of a word so they can be analysed as a single item.
In computational linguistics, lemmatisation is the algorithmic process of determining thelemma for a given word. Since the process may involve complex tasks such asunderstanding context and determining the part of speech of a word in a sentence (requiring,for example, knowledge of the grammar of a language) it can be a hard task to implement alemmatiser for a new language.
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on asingle word without knowledge of the context, and therefore cannot discriminate betweenwords which have different meanings depending on part of speech. However, stemmers aretypically easier to implement and run faster, and the reduced accuracy may not matter forsome applications.
The NLTK Lemmatization method is based on WordNet’s built-in morphy function.
From WordNet official website (https://wordnet.princeton.edu/ (https://wordnet.princeton.edu/)):
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs aregrouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.Synsets are interlinked by means of conceptual-semantic and lexical relations. The resultingnetwork of meaningfully related words and concepts can be navigated with the browser.WordNet is also freely and publicly available for download. WordNet's structure makes it auseful tool for computational linguistics and natural language processing.
WordNet superficially resembles a thesaurus, in that it groups words together based on theirmeanings. However, there are some important distinctions. First, WordNet interlinks not justword forms—strings of letters—but specific senses of words. As a result, words that arefound in close proximity to one another in the network are semantically disambiguated.Second, WordNet labels the semantic relations among words, whereas the groupings ofwords in a thesaurus does not follow any explicit pattern other than meaning similarity.
In [44]:
from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() print(wordnet_lemmatizer.lemmatize('better', pos='a')) print(wordnet_lemmatizer.lemmatize('meeting',pos='v'))
import nltk text = "Part-of-speech tagging is harder than just having a list of words and their parts of speech" text = nltk.word_tokenize(text) for w in text: print(w, " | ", wordnet_lemmatizer.lemmatize(w))
In the default setting, the lemmatizer assumes that every word is a noun, i.e. for each word it searches for theclosest noun. We can change it by the pos flag:
In [46]:
for w in text: print(w, " | ", wordnet_lemmatizer.lemmatize(w,pos='v'))
Part-of-speech | Part-of-speech tagging | tagging is | is harder | harder than | than just | just having | having a | a list | list of | of words | word and | and their | their parts | part of | of speech | speech
Part-of-speech | Part-of-speech tagging | tag is | be harder | harder than | than just | just having | have a | a list | list of | of words | word and | and their | their parts | part of | of speech | speech
Thus, we have to use POS Tagging before word lemmatization. Moreover, since the POS tagger uses theTreebank tag set (https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html(https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)), we have to translate itinto WordNet compatible tags:
import nltk from nltk.stem.wordnet import WordNetLemmatizer tokens = ['better','meeting','churches','abaci','are','is'] tagged = nltk.pos_tag(tokens) print(tagged) results = [] lemmatizer = WordNetLemmatizer() for word, tag in tagged: wntag = get_wordnet_pos(tag) if wntag is None:# not supply tag in case of None lemma = lemmatizer.lemmatize(word) else: lemma = lemmatizer.lemmatize(word, pos=wntag) results.append(lemma) print(results)
Morfeusz, http://sgjp.pl/morfeusz/morfeusz.html (http://sgjp.pl/morfeusz/morfeusz.html) (Python API)Lametyzator, http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml(http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml) (currently a part of theMorfologik project, https://github.com/morfologik/ (https://github.com/morfologik/))
In [51]:
%%python2 # coding: utf-8 import morfeusz2 lem = morfeusz2.Morfeusz() print lem text = u'Mam próbkę analizy morfologicznej' for i in lem.analyse(text): print len(i),i
An introduction into text classificationFrom Wikipedia:
Document classification or document categorization is a problem in library science,information science and computer science. The task is to assign a document to one or moreclasses or categories. This may be done “manually” (or “intellectually”) or algorithmically. Theintellectual classification of documents has mostly been the province of library science, whilethe algorithmic classification of documents is used mainly in information science andcomputer science. The problems are overlapping, however, and there is therefore alsointerdisciplinary research on document classification.
Text classification is a very important technique of text analysis. It has many potential applications:
Classification in NLTKrequirements - labeled category data, which can be used as a training setexample: NLTK Name Corpusgoal: a Gender Identification classifier
In [52]:
from nltk.corpus import names import random names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])
First, we use the NaiveBayes classifier (see https://en.wikipedia.org/wiki/Naive_Bayes_classifier(https://en.wikipedia.org/wiki/Naive_Bayes_classifier) for explanation):
In [62]:
from nltk import NaiveBayesClassifier nb_classifier = NaiveBayesClassifier.train(train_set)
Here is how to train a Maximum Entropy Classifier(https://en.wikipedia.org/wiki/Multinomial_logistic_regression(https://en.wikipedia.org/wiki/Multinomial_logistic_regression)) for Gender Identification:
Most Informative Features last_letter = 'a' female : male = 33.0 : 1.0 last_letter = 'k' male : female = 31.9 : 1.0 last_letter = 'f' male : female = 16.7 : 1.0 last_letter = 'p' male : female = 11.9 : 1.0 last_letter = 'v' male : female = 10.6 : 1.0
It seems that Naive Bayes and Maxent model have the same result on this gender task. However, let us lookwhat happens if we define a more complex feature extractor function and train the models again:
In [70]:
def gender_features2(name): features = {} features["firstletter"] = name[0].lower() features["lastletter"] = name[-1].lower() for letter in 'abcdefghijklmnopqrstuvwxyz': features["count(%s)" % letter] = name.lower().count(letter) features["has(%s)" % letter] = (letter in name.lower()) return features
6.644 last_letter==' ' and label is 'female' 6.644 last_letter=='c' and label is 'male' -4.864 last_letter=='a' and label is 'male' -3.503 last_letter=='k' and label is 'female' -2.700 last_letter=='f' and label is 'female'