Top Banner
MALLET MAchine Learning for LanguagE Toolkit
33

MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Jan 15, 2016

Download

Documents

Kerry Fletcher
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

MALLETMAchine Learning for LanguagE Toolkit

Page 2: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Outline• About MALLET

• Representing Data

• Command Line Processing

• Simple Evaluation

• Conclusion

Page 3: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Outline• About MALLET

• Representing Data

• Command Line Processing

• Simple Evaluation

• Conclusion

Page 4: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

About MALLET• "MALLET: A Machine Learning for Language Toolkit.“

• written by Andrew McCallum• http://mallet.cs.umass.edu. 2002.• Implemented in Java, currently version 2.0.6

• Motivation:• Text classification and information extraction• Commercial machine learning• Analysis and indexing of academic publications

Page 5: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

About MALLET• Main idea

• Text focus: data is discrete rather than continuous, even when values could be continuous

• How to• Command line scripts:

• bin/mallet [command] --[option] [value] …• Text User Interface (“tui”) classes

• Direct Java API• http://mallet.cs.umass.edu/api

Page 6: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Outline• About MALLET

• Representing Data

• Command Line Processing

• Simple Evaluation

• Conclusion

Page 7: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Representations• Transform text documents to

vectors x1 , x2 …

• Elements of vector are called feature values• Example: “Feature at row 345 is

number of times “dog” appears in document”

• Retain meaning of vector indices

Page 8: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Documents to Vectors

Page 9: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Documents to Vectors

Page 10: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Documents to Vectors

Page 11: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Documents to Vectors

Page 12: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Documents to Vectors

Page 13: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Instances

Page 14: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Instances

Page 15: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Instances

Page 16: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Outline• About MALLET

• Representing Data

• Command Line Processing

• Developing with MALLET

• Conclusion

Page 17: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Command Line• Importing Data

• Classification

• Sequence Tagging

• Topic Modeling

Page 18: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Importing Data• One Instance per file

• files in the folder:sample-data/web/en or sample-data/web/de

• command line: bin/mallet import-dir --input sample-data/web/* --output web.mallet

• One file, one instance per line• file format:[URL] [language] [text of the page...]

• command line:bin/mallet import-file --input /data/web/data.txt --output web.mallet

Page 19: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Classification• Training a classifier

bin/mallet train-classifier --input training.mallet --output-classifier my.classifier

• Choosing an algorithm• MaxEnt, NaiveBayes, C45, DecisionTree and many others.

bin/mallet train-classifier --input training.mallet --output-classifier my.classifier --trainer MaxEnt

• Evaluation• Random split the data into 90% training instances, which will be used to train the

classifier, and 10% testing instances. 

bin/mallet train-classifier --input labeled.mallet --training-portion 0.9

Page 20: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Sequence Tagging• Sequence algorithms

• hidden Markov models (HMMs)• linear chain conditional random fields (CRFs).

• SimpleTagger• a command line interface to the MALLET Conditional Random

Field (CRF) class

Page 21: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

SimpleTagger• Input file: [feature1 feature2 ... featuren label]

Bill CAPITALIZED nounslept non-nounhere LOWERCASE STOPWORD non-noun

• Train a CRF• An input file “sample”• A trained CRF in the file "nouncrf"

java -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample

Page 22: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

SimpleTagger• A file “stest” needed to be labeled

CAPITAL Al slept here

• Label the inputjava -cp “~/mallet/class:~/mallet/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrf stest

• OutputNumber of predicates: 5 noun CAPITAL Al non-noun slept non-noun here

Page 23: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Topic Modeling• Building Topic Models

bin/mallet train-topics --input topic-input.mallet --num-topics 100 --output-state topic-state.gz

--input [FILE] 

--num-topics [NUMBER] The number of topics to use. The best number depends on what you are looking for in the model.

--num-iterations [NUMBER] The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.

--output-state [FILENAME] This option outputs a compressed text file containing the words in the corpus with their topic assignments. 

Page 24: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Demo

Page 25: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Outline• About MALLET

• Representing Data

• Command Line Processing

• Simple Evaluation

• Conclusion

Page 26: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Methodology• Focus on sequence tagging module in MALLET

• CRF-based implementation• Some scripts written for importing data and evaluating results

• Small corpora collected from web• Divided into two parts, 80% for training, 20% for test

• Evaluate both POS Tagging and Named Entity Recognition• The performance of training• Accuracy (POS Tagging) and Precision, Recall and FB1 (NER)

• All scripts, corpora and results can be found here• http://mallet-eval.googlecode.com

Page 27: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

A Survey of Named Entity Corpora• Well known named entity corpora

• Language-Independent Named Entity Recognition at CoNLL-2003• A manual annotation of a subset of RCV1 (Reuters Corpus Volume 1)• free and public, but need RCV1 raw texts as the input

• Message Understanding Conference (MUC) 6 / 7• not for free

• Affective Computational Entities (ACE) Training Corpus• not for free

• Other special purpose corpora• Enron Email Dataset

• email messages in this corpus are tagged with person names, dates and times.

• A variety of biomedical corpora• some corpora in this collection are tagged with entities in the biomedical domain,

such as gene name

Page 28: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Small Corpora• Two small corpora collected from web

• Penn Treebank Sample• English POS tagging corpora, ~5% fragment of Penn Treebank, (C)

LDC 1995.• raw, tagged, parsed and combined data from Wall Street Journal• 148120 tokens, 36 Standard treebank POS tagger• http://web.mit.edu/course/6/6.863/OldFiles/share/data/corpora/treebank/

• HIT CIR LTP Corpora Sample• Chinese NER corpora integrated• 10% of the whole corpora (open to public)• 23751 tokens, 7 kinds of named entities• http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

Page 29: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Environment• Hardware

• CPU: Q8300 Quad Core 2.50 GHz• Memory: 3GB

• Software• Fedora 13 x86_64• Java 1.6.0_18• MALLET 2.0.6

Page 30: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Data Format and Labels• Data Format

• Each token one row, each feature one columnBill nounslept non-nounHere non-noun

• Labels• Standard treebank POS Tagger

• CC Coordinating conjunction | CD Cardinal number | DT Determiner | EX Existential there | FW Foreign word | IN Preposition or subordinating conjunction | JJ Adjective | JJR Adjective, comparative | JJS Adjective, superlative | LS List item marker | MD Modal | NN Noun, singular or mass | NNS Noun, plural … … (36 taggers in all)

• HIT Named Entity• O 不是 NE | S- 单独构成 NE | B- 一个 NE 的开始 | I- 一个 NE 的中间 | E- 一个 NE 的

结尾• Nm 数词 | Ni 机构名 | Ns 地名 | Nh 人名 | Nt 时间 | Nr 日期 | Nz 专有名词• Example: 美国 B-Ni 洛杉矶 I-Ni 警察局 E-Ni

Page 31: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

pos chunking ner

Training

Instance # 3982 8936 1286

Tokens # 95767 211727 20913

Time 308m 23s 190m 50s 17m 13s

Test

Tokens # 46452 47377 2829

Accuracy 85.67% 93.97% 98.55%

Precision - 90.54% 86.89%

Recall - 89.89% 86.89%

FB1 - 90.21 86.89

Time 15.80s 4.43s 0.8s

Evaluation

StagesTasks

Page 32: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

DEMO

Page 33: MALLET MAchine Learning for LanguagE Toolkit. Outline About MALLET Representing Data Command Line Processing Simple Evaluation Conclusion.

Q&A