technische universität dortmund
Fakultät für InformatikLS 8
An Application-Oriented View of Automatic Tagging and Information ExtractionKatharina Morik
Name Autor | Ort und Datum
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Overview
Handling texts – overview Mark-up languages
Services based on annotated texts Automatic tagging
From lay-out information to tags Named entity recognition
Data-intensive Approach Counting in very large unlabeled corpus Turning frequencies into features Compiling sequences into features
Overview
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Handling Texts
Granularity:hypertext structure, text, paragraph, word, letters
Learning mode:batch, incremental
Learning goal:adapted organization,class or clustering,syntactic or semantic structures
Application tasks:Personalization, optimization of information access, integration in business processes, reporting
Handling Texts
Hyper- text
Text Para- graph
Word
Adapta-tion
Alesker, Joachims, Neifach
Veltmann Hüppe, Mintert, Thomas
Helbig
Extraction Rössler
Clustering Schewe, Wurst
Classifica- tion
Joachims, Klinken- berg
this talk
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Intelligent Publishing Using Mark-Ups
Search qualified by semantic category
Self-contained parts of text (atoms) as search result
Composition of one’s own text Presentation according to
semantic category
IP4W3 System by Stefan Mintert 1999
Mark-up languages
Query: category + word
Webserver
Result: list of atoms
Text
Selection
Search
Composition
User
Presentation
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Qualified search
Mark-up languages
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Presentation of Results
Mark-up languages
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Text Composition
Mark-up languages
Selected results from 2 Queries combined
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Applications e-Learning, e-Publishing
Intelligent publication in the web:users customize the material to their own needs.“IP4W3” Stefan Mintert 1999, Dortmund
Course material for different groups:from the central repository of presentations or texts, courses are designed for special interests.“Slicing Books” Ingo Dahm 2001, Koblenz-Landau
Additional sequence information allows to tailor courses to learning types, e.g. Top-down from definition to applicationBottom-up from application to definition.Moritz Thomas 1999, Dortmund
Mark-up languages
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Behind the Curtain
Mark-up editor Editor for defining qualified search<!---- search pattern --- ><element type= “block”> <any> <target-element type=“definition”> <\target-element> <\any><\element>
fits<block> <em> <definition> Characters <\definition> are the atomic
unit of texts according to ISO/IEC 10646. … <\em><\block>
Mark-up languages
DTD/Schema
Webserver
Search patterns Style sheets
Administrator
Author
Annotated text
Bottleneck!
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Automatic tagging
WISDOM++ Univ. BariFrom scanned texts to blocks to XML tags – classification of blocks by C4.5Altamura, Esposito, Malerba 2000
ADT Univ. DortmundFrom RTF annotation to XML tags – classification by C4.5Christian Hüppe 2003
Automatic tagging
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
ADT – Input Document
Automatic Tagging
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
ADT – Manual Annotation of Examples
Automatic Tagging
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
ADT – Attributes of Examples
RTF control words Presence of control word in current and preceding paragraph
ff: neither in this nor in preceding paragraph ft: not in this but in preceding paragraph tf: in this but not in preceding paragraph tt: as well in this as in preceding paragraph
Value of indention in current and preceding paragraph First and second word of paragraph
Automatic Tagging
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
ADT -- Learning
Automatic Tagging
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
ADT – Classification of Paragraphs
Automatic Tagging
No. of examples for each class
F-measure
1 41 %
2 94 %
3 98,3%
4 99,68%
9 classes (tags)159 paragraphs
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Application Options
Automatic Tagging
Named entity recognition necessary!
Combination of ADT and IP4W3 offers qualified search, tailored courseware, and enhanced e-learning without tedious annotation on behalf of the author or administrator.
Semantic information within paragraphs cannot be captured.
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Named Entity Recognition
Classification of single words into given semantic categories (e.g., person, location, date).
A phrase of the category is a sequence of the same label. Features of a word:
Linguistic features (e.g., part of speech) Letters (e.g., beginning with upper case letter) Word length N-grams
Knowledge intensive vs. data intensive approaches: Linguistic rules Examples Unlabeled text (corpus)
Training time, classification time – size of training and test setsNamed Entity Recognition
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
The Task
Biomedical task on 22 000 word forms (JNLPBA) 472 000 labeled occurrences for training 54 173 occurrences for testing 100 Mio. word forms from Medline as background
German corpus 33 000 word forms (CoNLL) 220 189 labeled occurrences for training 54 173 occurrences for testing
40 Mio. word forms from Frankfurter Rundschau as background
Fast learning and classification necessary!
Named Entity Recognition
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Data-intensive Approach -- Marc Rössler Knowledge-poor:
No linguistic knowledge No given word lists No hand-written rules
Use of very large given corpora: Distribution of word occurrence in corpus Frequencies of words Frequencies of word sequences
Bootstrapping of features:1. Learn classifiers from examples2. Apply classifiers to unlabeled corpus3. Extract features from now labeled corpus,
enhance examples4. Learn classifiers from enhanced examples
Named Entity Recognition
Stop
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
The Base Classifier -- Input
Features: 1 out of 30 word surface features (e.g., 4-digit number, uppercase only,
starting with capital letter) Word length Positional substrings (at most 8):
Last character z Before last and last character nz Last 3 character enz First trigram Kon Second trigram… onk Fifth trigram urr
Window of 3 preceding and 2 succeeding wordsEbenso schnell hat Peter Müllers Konkurrenz
Vector of 60 features for each occurrence
Named Entity Recognition
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
The Base Classifier -- Output
A classifier fc is trained for each category against all others
A classifier fNE is trained for “is a NE” vs. “is no NE”
Tagging the focus of the sliding window according to
Named Entity Recognition
bxwxf
)()( bxwsignxfNE
0,max jNEjcc
xfxf
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Corpus-based Features – Internal Evidence
Applying the base classifier to the corpus results in new features. Membership frequencies (how often a word v was seen as a
member of the category c ) –
where v is the token described by All fc > 0 become a feature with the ratio as value.
Example:
Named Entity Recognition
x
10
35.0,0,:12""
10
20,5.0,:11""
PERSON
PERSON
fPeter
fPeter
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Sequences -- Windows
The sequences of words with the same label are considered one token within the sliding window.
Ebenso schnell hat Peter Müllers Konkurrenz
P P
Ebenso schnell hat Peter Müllers Konkurrenz die
)(...)(
...))((:
1
)1(
ejcjcjc
sjcsjcc xfsignxfsignxfsign
xfsignxfsignseq
3 2 1 1 2
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Compiling Sequences into Features
Membership frequencies (how often a word v was seen as the first (last) in a sequence labeled c ) – internal evidence
Membership frequencies become new features.Example:
Named Entity Recognition
10
2,:14""
10
2,1:13""
PERSON
PERSON
seqlastMüller
seqstPeter
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Corpus-based Features – External Evidence
Context frequencies (how often a sequence seqc was preceded or succeeded by certain words)
Sequence s preceding seqc is written seqpreC
Contexts with relative frequency >0.01 become features of the preceding words in the sliding window
Example:
Named Entity Recognition
34
1,:""
23
1,:""
23
1,:""
2
1
1
prePERSON
prePERSON
prePERSON
seqfirsthat
seqfirsthat
seqthirdEbenso
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Enhanced Features
Based on the tagging of the unlabeled corpus by the base classifiers, features are extracted: Internal evidence:
fc X intervals
First/last in seqc
External evidence: First, second, third in seqpreC
First/second in seqsucC
Training is again performed using the enriched feature set. Tagging is enhanced by max(length(seqi)) (read again)
Named Entity Recognition
cc
jNEjcc
seqlengthxfxf max,0,max
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Experiments
Does the sequence in window focus enhance the learning result? Does the use of unlabeled background corpus enhance learning results? How is the enhancement per round? How many rounds are necessary? Is the knowledge-poor approach compatible with approaches using linguistic knowledge? Would a Hidden Markov Model be better?
Named Entity Recognition
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Does the sequence in window focus enhance the learning result?
Instances in Training/Test
F-measureLOC
F-measurePER
F-measureORG
Overall Precision
Overall Recall
Overall F-measure
Regular N-grams
101 810/ 25 909
50.6 42.67 45.38 69.82 34.18 45.9
Sequences
113 245/ 30 792
52.68 44.19 49.1 89.72 33.13 48.39
Named Entity RecognitionYes.
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Does the use of unlabeled background corpus enhance learning results?
F-measureLOC
F-measurePER
F-measureORG
Overall Precision
Overall Recall
Overall F-measure
No use of corpus, sequences
52.68 44.19 49.10 89.72 33.13 48.39
Corpus for internal and external evidence
75.04 91.09 65.36 83.69 73.82 78.44
Named Entity RecognitionYes.
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
How is the enhancement per round? How many rounds are necessary?
Named Entity Recognition
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
0 1 2 3 4 5 6 7
Overall Recall
Overall F-measure
Overall Precision
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Number of Support Vectors
Named Entity Recognition
Number of Support Vectors
0
2000
4000
6000
8000
10000
12000
0 1 2 3 4
Rounds
#SV
LOC
PER
ORG
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Is the knowledge-poor approach compatible with approaches using linguistic knowledge?
Named Entity Recognition
Author F-measureLOC
F-measurePER
F-measureORG
Volk, Clematide 2001
85.7 88.9 78.4
Neumann, Piskorski 2002
81.1 88.0 79.4
Florian et al. 2003 (best CoNLL)
77.71 83.57 71.08
Rössler 75.94 91.09 65.36
Hmm…
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Would a Hidden Markov Model be better?
Named Entity Recognition
0
10
20
30
40
50
60
70
80
Protein D N A RNA Cell Type Cell Line Overall Recall OverallPrecision
Overall F-measure
Base
HMM
Base+HMM
No, but turning its classification into a feature helps SVM!
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Summary of Rössler’s Approach System consists of 3 components SVM training:
fast in large, sparse vector space Feature extraction from large corpora:
fast automatic adaptation to new domain The outer loop
Splitting instances of a m-class learning problem into m-1 binary problems
Tagging using a voting mechanism Enhancing examples by extracted features
The feature approach easily integrates linguistic knowledge or predictions of other learners, if given.
The data-driven approach is language independent. Results are compatible with knowledge-based approaches.
Named Entity Recognition
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund
Conclusion
Tagged data allow for enhanced services. Automatic tagging of paragraphs or tables can easily be done using very few examples in an
interactive, incremental way. Named entity recognition for automatic tagging remains a challenge.
Name Autor | Ort und Datum
Fakultät für InformatikLS 8
technische universität dortmund