Top Banner
Tutorial on MALLET Shatakirti MT2011096
13
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mallet

Tutorial on MALLET

ShatakirtiMT2011096

Page 2: Mallet

MALLET

Contents1 Introduction to MALLET 2

2 Where do we use MALLET? 2

3 Getting Started 33.1 Installing MALLET . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Using the Script . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4 Importing Data files 3

5 Natural Language Processing 4

6 Document classification 5

7 Sequence Tagging 9

8 Topic Models 11

References 12

List of Figures1 Natural Language Processing using MALLET . . . . . . . . . 52 Document classification . . . . . . . . . . . . . . . . . . . . . . 83 Sequence Tagging . . . . . . . . . . . . . . . . . . . . . . . . . 10

1

Page 3: Mallet

MALLET

1 Introduction to MALLETMALLET is a Java-based package for statistical natural language processing,document classification, clustering, topic modeling, information extraction,and other machine learning applications to text..

2 Where do we use MALLET?:

1. Historical Topics and TrendsOur aim here is to automatically discover general topics that appearin a large newpaper corpus. MALLET is run over a period of interestto find the top general topic groups. For example: if we wish to knowthe top ten topic groups between the years 1965-1901, the MALLETis run to find this dataset. In addition, we can also find topics morestrongly associated with say ”iron”. We can extract 5 lines on eachside of the line containing ”iron” and again run mallet to find the topgeneral topic groups.

2. Detect spam mailsWe can use the document classification capabilities of MALLET todetect spam mails. A simple example of this would be a spam classifierlike you’d find in your email inbox. Since we know what good maillooks like, and since we know what spam typically looks like, we cancraft a Naive Bayes classifier to make a statistical approximation as towhether or not a new message is spam.

3. Extract important informationWe can use the sequence tagging functionality that MALLET providesto extract important information from data. By employing named-entity recognition techniques, we can figure out exactly what a docu-ment is talking about without having to read through the entire textourselves. Imagine someone hands you a book and asks you for all thecharacters and locations featured throughout the text. Using named-entity recognition, a computer can accomplish that task in mere secondsas compared to the hours it would take a human.

2

Page 4: Mallet

MALLET

3 Getting Started

3.1 Installing MALLET1. Download the latest version of mallet from

http://mallet.cs.umass.edu/download.php

2. To Build MALLET 2.0, you must have Apache Ant. You can downloadit from http://ant.apache.org/

3. Set all the environment variables pointing to Java Home, Ant Homeand Mallet Home (Mallet Directory).

4. Change to the MALLET directory and type:antExample : C:\Users\VAIO\WebIR\mallet-2.0.7>ant

If ant finishes with ”BUILD SUCCESSFUL”, MALLET is now readyto use.

3.2 Using the ScriptNow, if you installed MALLET in the directory \WebIR\mallet-2.0.7,this script will be present in the \WebIR\mallet-2.0.7\bin. If the cur-rent working directory is the MALLET directory, you can use this script inthis pattern:

bin\mallet [command] --option value --option value ...

Type bin\mallet to get a list of commands and the help can be foundby using the option --help with any command to get a description of validcommands.

4 Importing Data filesTo import a data file use the command:

bin\mallet import-file --input [filename]--output [output filename] [options]

3

Page 5: Mallet

MALLET

Similarly, to import an entire directory use:

bin\mallet import dir [dir path]--output [output filename] [options]

For example:

bin\mallet import-file --input sample-data\web\en\hill.txt--output output.mallet

in the above example, the input data is hill.txt and the output is presentin the output.mallet file after removing the stopwords.

bin\mallet import-dir --input sample-data\web\*--output output.mallet

in the above example, the input data is folders present in web folder andthe output is given in the output.mallet file after removing the stopwords

For more options use the help by typing in:

bin\mallet import-file --help orbin\mallet import-dir --help

5 Natural Language ProcessingMALLET includes routines for transforming text documents into numericalrepresentations that can then be processed efficiently. This process is imple-mented through a flexible system called ”pipes”, which handle distinct taskssuch as tokenizing strings, removing stopwords, and converting sequencesinto count vectors. MALLET uses Unicode files, and thus, we can use vari-ous language files and provide MALLET with certain rules for for processingthe data. We can use regular expressions to tokanize any word segment inany language. For example if we type in

bin\mallet import-file --input sample-data\web\en\hill.txt--output output.mallet --print-output --remove-stopwords

in the above example, MALLET removes the stopwords and prints the out-put and also writes the output in the output.mallet file. A sample output

4

Page 6: Mallet

MALLET

with and without removing stopwords is shown below :

(a) without removing stopwords (b) Removing stopwords

Figure 1: Natural Language Processing using MALLET

The above figure shows the support for English language by MALLET.In the above snapshot, a simple txt file ”hill.txt” written in English languageis imported. The words are numbered and the number of occurrences arealso shown. The stopwords are recognized by MALLET and can or cannotbe included in the output file as per the user’s requirements. Currently,MALLET doesn’t support only Chineese and Japaneese text..

6 Document classificationA classifier is an algorithm that distinguishes between a fixed set of classes,such as ”spam” vs. ”non-spam”, based on some previous training (Note thatMALLET is also a machine learning tool). MALLET includes implemen-tations of several classification algorithms. Some of them are Naive Bayesalgorithm, Maximum Entropy, and Decision Trees.

To get strted with the document classifier, first loasd the data into MAL-LET format. Then follow the following steps:

5

Page 7: Mallet

MALLET

1. Train the classifier:Suppose u have a MALLET data file called train.mallet, use thecommand :

bin\mallet train-classifier --input train.mallet--output-classifier my.classifier

2. Choose the algorithm:The default classification algorithm is Naive Bayes Theorem. To selecta different algorithm, use the --trainer option. For example, to usethe MaxEnt algorithm, use the following command:

bin\mallet train-classifier --input training.mallet--output-classifier my.classifier --trainer MaxEnt

You can also try - NaiveBayes, C45, Decision Tree.To compare multiple training algorithms, use the following command,

bin\mallet train-classifier --input labeled.mallet--training-portion --trainer MaxEnt--trainer NaiveBayes

This command will comapre the MaxEnt and the NaiveBayes algo-rithms.

3. Evaluation:If we wish to know if the classifier is producing good results on datanow used in the training, we can split a single set of instance into train-ing and testing lists. For this purpose, you can use a command like:

bin\mallet train-classifier --input labeled.mallet--training-portion 0.9

This command will randomly split the data into 90% training instances,which will be used to train the classifier and the remaining 10% testinginstances. MALLET will use the classifier to predict the class labelsof the testing instances, compare those to the true labels, and reportresults. You can even try various training options that u can find inthe help of mallet.

6

Page 8: Mallet

MALLET

For example, u can try the following command :

bin\mallet train-classifier --input web.mallet--trainer MaxEnt --trainer NaiveBayes--training-portion 0.9 --num-trials 10

This command will run 10 trials, in which the input data is randomlysplit into 90% training instances and 10% testing instances. For eachtrial, MALLET trains a MaxEnt classifier and a Naive Bayes classifieron the training instances, then prints accuracy results and a matrix ofcorrect and predicted labels for each classifier. An illustration is shownin the next page.

7

Page 9: Mallet

MALLET

(a)

(b)

Figure 2: Document classification

8

Page 10: Mallet

MALLET

7 Sequence TaggingSometimes, we may have a very large database with distinct values in it, takefor example, a large gene database. MALLET includes implementations ofwidely used sequence algorithms including hidden Markov models (HMMs)and linear chain conditional random fields (CRFs). These algorithms supportapplications such as gene finding and named-entity recognition.

Simple Tagger

Simple tagger is a command line interface to the MALLET CRF class. Touse this, each line in the input file should represent a token. The neededformat is :

feature1 feature2 ... featuren label

For example, write the following in a file named ”sample” and put it inthe mallet directory.

Kirti CAPITALIZED nounslept non-nounhere LOWERCASE STOPWORD non-noun

To train the CRF, use the following command while in the mallet direc-tory:

java -cp class;lib\mallet-deps.jarcc.mallet.fst.SimpleTagger --train true--model-file nouncrf sample

This command will train the CRF. The --train true command will spec-ify that this is the training. Here the CRF file is created in the mallet direc-tory itself. We can however specify the locations as per convinience.

9

Page 11: Mallet

MALLET

(a)

(b)

Figure 3: Sequence Tagging10

Page 12: Mallet

MALLET

Now that we have trained MALLET, we can put it to test by creating anew file called ”test”. Inside this file, we write :

CAPITAL Alslepthere .

Now we need the file to be labelled, so, we use CRF in the nouncrf bytyping:

java -cp class;lib\mallet-deps.jarcc.mallet.fst.SimpleTagger--model-file nouncrf test

which produces the following output:

Number of predicates: 5noun CAPITAL Alnon-noun sleptnon-noun here

8 Topic ModelsTopic models provide a simple way to analyze large volumes of unlabeled text.A ”topic” consists of a cluster of words that frequently occur together. Usingsome contextual clues, the topic models can connect the words with similarmeanings and distinguish between uses of words with multiple meanings.

Now the first step in acheiving a Topic model is to import a set of doc-uments. Suppose we want to import the files in the folder ”en”, type thecommand:

bin\mallet import-dir--input sample-data\web\en --output output.mallet--keep-sequence --remove-stopwords

This command will remove all the stopwords, keep all the sequences andwrite the output to a ”output.mallet” file in the mallet directory.

11

Page 13: Mallet

MALLET

Now, type in the command:

bin\mallet train-topics--input sample-data\web\en\output.mallet--num-topics 100 --output-state topic-state.gz

Here --num-topics [NUMBER] represents the number of topics to use.More the number, more the fine-grained results we get and --output-stateoutputs a compressed text file containing the words in the corpus with theirtopic assignments. This file format can easily be parsed and used by non-Java-based software. Note that the state file will be GZipped, so it is helpfulto provide a filename that ends in .gz.

References[1] http://mallet.cs.umass.edu

[2] http://www.fieldstone-software.com/mallet/

12