EVALITA 2007 EVALITA 2007 Frascati, September 10th 2007 Frascati, September 10th 2007 TAGPRO A system for ITALIAN POS A system for ITALIAN POS TAGGING based on SVM TAGGING based on SVM Emanuele Pianta and Roberto Emanuele Pianta and Roberto Zanoli Zanoli FBK-irst, Trento FBK-irst, Trento
19
Embed
EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
EVALITA 2007EVALITA 2007
Frascati, September 10th 2007Frascati, September 10th 2007
TAGPROA system for ITALIAN POS A system for ITALIAN POS TAGGING based on SVMTAGGING based on SVM
Emanuele Pianta and Roberto ZanoliEmanuele Pianta and Roberto Zanoli
FBK-irst, TrentoFBK-irst, Trento
TextProTextPro
22
A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis TagPro: Part-of-Speech tagging LemmaPro: lemmatization EntityPro: Named Entity recognition ChunkPro: phrase chunking SentencePro: sentence splitting
Architecture designed to be efficient, scalable and robust. Cross-platform: Unix / Linux / Windows / MacOS X Multi-lingual models All modules integrated and accessible through unified command line interface
33
TagPro’s architecture
To build TagPro we used YamCha, an SVM-based machine learning environment. TagPro can exploit a rich set of linguistic features, such as morphological analysis, prefixes and suffixes
Can redefine Context (window-size) parsing-direction (forward/backward) algorithms for multi-class problem (pair wise/one vs rest)
Practical chunking time (1 or 2 sec./sentence.)
Available as C/C++ library
Support Vector MachinesSupport Vector Machines
55
Based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995)
• SVM map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. • Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. • The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.
The Evalita development set was randomly split into 2 parts
Training: 89,170 tokens
Tuning: 44,586 tokens
FEATURESFEATURES
88
For each running word a rich set of features are extracted
WORD: the word itself (both unchanged and lower-cased) e.g. Autore autore
MORPHO: the morphological analysis (produced by MorphoPro)e.g. Autore autore+n+m+sing Calcio calcio calcio+n+m+sing
calciare+v+indic+pres+nil+1+sing
AFFIX: prefixes/suffixes (2, 3, 4 or 5 chars. at the start/end of the word)e.g. libro {li,lib,libr,libro,ro,bro,ibro,libro}
ORTHOgraphic information (e.g. capitalization, hypenation)e.g. Oggi C (capitalized) oggi L (lowercased)
GAZETTeers of proper nouns (154,000 proper names, 12,000 cities,5,000 organizations and 3,200 locations)
99
Static vs Dynamic FeaturesStatic vs Dynamic Features
STATIC FEATURES extracted for the current, previous and
following word WORD, MORPHO, AFFIXes, ORTHO,
GAZET
DYNAMIC FEATURES decided dynamically during tagging tag of the two tokens preceding the current
token.
An Example of An Example of Feature Extraction Feature Extraction
1010
l' ARTex ADJleader NNsocialista ADJBettino NN_PCraxi NN_P
l' l' l' __nil__ __nil__ __nil__ l' __nil__ __nil__ __nil__ L A N N N N N N N N N N N Y N N N N N N N N Y N O O O O ARTex ex ex __nil__ __nil__ __nil__ ex __nil__ __nil__ __nil__ L N N N N N N N N N N Y 2 N N N Y N N N N N N N O O O O ADJleader leader le lea lead leade er der ader eader L N N N N N N N N N N Y N N Y 0 N N N N N N N N O O O O NNsocialista socialista so soc soci socia ta sta ista lista L N N N N N N N N N N Y 2 N Y 0 N N N N N N N N O O O O ADJBettino bettino be bet bett betti no ino tino ttino C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-NAM NN_PCraxi craxi cr cra crax craxi xi axi raxi craxi C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-SUR NN_P
Finding the best featuresFinding the best features
1111
EAGLES TagSet Accuracy UTAccuracy
baseline 86.70 59.95
+AFFIX +ORTHO +8.56 +25.56
+AFFIX +ORTHO +MORPHO +10,69 +33.18
+AFFIX +ORTHO +MORPHO +GAZETT +10.72 +33.13
Baseline: WORD (both unchanged and lower-cased) window-size: +1,-1
Finding the best window-sizeFinding the best window-size
1212
EAGLES TagSet STAT DYN Accuracy
+1,-1 -1 97.42
+2,-2 -2 -0.34+1,-1 -2 +0.23+1,-1 -3 +0.22
Given the best set of features (F1=97.42) we tried to improve Accuracy by changing the window-size
multi-class problemmulti-class problempair-wise/one vs restpair-wise/one vs rest
1313
one vs rest: fewer bigger classifiers pairwise:
a classifier for each possible pair of classes choose the classifier with best confidence many relatively small classifiers faster, less memory
EAGLES TagSet method Accuracy
pairwise 97.65one vs rest 97.78
Evaluating the best algorithmEvaluating the best algorithmPKI vs. PKEPKI vs. PKE
1414
EAGLES TagSet Accuracy
PKI 97.78PKE 97.64
YamCha uses two implementations of SVMs: PKI and PKE.
• both are faster than the original SVMs
PKI (3-12 x faster) produces the same accuracy as the original SVMs.
PKE (10-300 x) approximates the orginal SVM, slightly less accurate but much faster
Results on the development setResults on the development set
1515
EAGLES DISTRIB
Accuracy 97.78 97.52
Known Words: 40,320 MorphoPro coverage: 96.20% Accuracy 98.29 97.95