DutchSemCor Building a semantically annotated corpus for Dutch Piek Vossen, Attila Görög, VU University Amsterdam Fons Laan, ISLA, University of Amsterdam Rubén Izquierdo, Tilburg University Antal van den Bosch, Maarten van Gompel, Radboud University Nijmegen 1 CLIN 22,Tilburg University, 20/01/2012
19
Embed
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DutchSemCor
Building a semantically annotated corpus for Dutch
Piek Vossen, Attila Görög, VU University AmsterdamFons Laan, ISLA, University of Amsterdam
Rubén Izquierdo, Tilburg UniversityAntal van den Bosch, Maarten van Gompel, Radboud University Nijmegen
1CLIN 22,Tilburg University, 20/01/2012
2
Overview
Project goals and planning Current progress Word-sense-disambiguation results Active learning phase
CLIN 22,Tilburg University, 20/01/2012
3
Goals and planning
Funded by NWO, 2009-2012 Create a large semantically tagged corpus for
Dutch: Sense-tags from the Cornetto database
(includes Dutch wordnet) Domain labels from Wordnet Domains Named entities mapped to Wikipedia
CLIN 22,Tilburg University, 20/01/2012
4
Global procedure Phase-1:
25 examples per meaning for 3,000 most polysemous and frequent nouns, verbs and adjectives (average nr. of meanings = 3)
Annotated by two student assistents
Minimal IAA 80% Phase-2:
Word-sense-disambiguation (WSD) systems trained with the data of phase-1
Active learning: add examples for low performing words and meanings untill we reach accuracy of 80% or no progress
Phase-3:
Apply WSD to rest of the full corpus
CLIN 22,Tilburg University, 20/01/2012
5
Corpora
SoNaR: 500M tokens written Dutch CGN: 1M tokens spoken Dutch Web snippets mediated through WebCorp.co.uk (
http://www.webcorp.org.uk/) In case no or insufficient examples are found for
particular senses in SoNaR and CGN Students select snippets (target word and
context) which are added to the corpus in the SoNaR annotation format
For comparison SemEval2010 Task on WSD in specific domain, all-words-task: UKB3 52.6 precision English UKB 48.1 precision
UKB5 & UKB4 gained 9 points on UKB3 due to co-occurrence relations
12CLIN 22,Tilburg University, 20/01/2012
Tilburg WSD System Based on TiMBL, K-nearest neighbour classifier
(Daelemans et at, 2007) Features:
Local context (words in window around target) Global context (binary Bag of Words) Sonar category (domain label)
Parameter Search:
Using TiMBL leave-one-out feature Evaluation:
10 examples per sense TEST >= 15 examples per sense TRAIN
13CLIN 22,Tilburg University, 20/01/2012
Tilburg WSD System. First results
Feature set Token accuracy
Words1
0.6462
Words1 + Bag-of-words 0.7259
Words1 + PoS
1 + Bag-of-words 0.7226
Words1 + Bag-of-words + PS 0.7931
Bag-of-words improvement of 8% Parameter search (PS) improvement of another 7%
Previous experiments suggest that the best size for the context window is 1
14CLIN 22,Tilburg University, 20/01/2012
TIMBL confidence 0.55:Precision 0.84 (+0.44 compared to no filtering)Fscore 0.78 (only -0.03 less than no filtering)
Tilburg WSD System. TiMBL Confidence
15CLIN 22,Tilburg University, 20/01/2012
Active Learning
1. Obtain annotated data
2. Train and evaluate the system
3. Select words with accuracy < 80%
4. Apply WSD all tokens of selected words not annotated
5. Select tokens of meanings with F-score < 80%
16CLIN 22,Tilburg University, 20/01/2012
Active Learning
6) For each word meaning rank all the tokens according to the combination (F-score)
1) TiMBL confidence
2) Distance to the nearest neighbor
6) Select the 50 first ranking tokens per meaning to be manually reviewed in 2 weeks
6) Go to 1
17CLIN 22,Tilburg University, 20/01/2012
Future Work
Fine tune the active learning Optimize the WSD systems Combine different WSD systems Test on independent texts in all-words task Apply optimal system to full corpora (over 500K