GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham Department of Computer Science, University of Sheffield http:// gate.ac.uk/ Structure of the talk: • A brief introduction to GATE • Multilingual infrastructure in GATE • Simple multilingual IE components
18
Embed
GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction
GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham Department of Computer Science, University of Sheffield http://gate.ac.uk/ Structure of the talk: A brief introduction to GATE - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
GATE: A Unicode-based Infrastructure Supporting
Multilingual Information ExtractionKalina Bontcheva, Diana Maynard,
Valentin Tablan, Hamish Cunningham
Department of Computer Science, University of Sheffield
• Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse
XML support, integration of Protégé, Jena, Weka...) • (Almost) everything is a component, and component sets
are user-extendable • (Almost) all operations are available both from API and GUI
Component-based development
CREOLE – Collection of REusable Objects for Language Engineering:• Java Beans: an OO way of chunking software• GATE components: modified Java Beans with XML
configuration• The minimal component = 10 lines of Java, 10 lines of
XML, 1 URL• Three types: Language Resources, Processing
Resources, Visual Resources
Why bother? • Allows the system to load arbitrary language processing
components
Language Resources (LRs)• LRs are documents, ontologies, corpora, lexicons, ……• LRs can be associated with DataStores (Oracle,
• Support for defining Input Methods (IMs)• Currently 30 IMs for 17 languages• Pluggable in other applications (e.g. JEdit, EUDICO)• Can use virtual kybd or standard layouts over QWERTY• IMs defined in plain text files• GUK comes with a standalone Unicode editor
Editing Multilingual Data
Processing Multilingual DataAll processing, visualisation and editing tools use GUK
Multilingual IE ComponentsThe ANNIE system – a reusable and easily extendable set of components
The Unicode TokeniserA very portable component for multliple languages:• splits text into typed tokens based on FSM • dynamically constructed from rules based on
character categories defined by the Unicode, e.g.:UPPERCASE_LETTER(LOWERCASE_LETTER|DASH_PUNCTUATION)*
> Token;orth=upperInitial;kind=word; • output generally localised by a later module (e.g.
“don’t” … “do” “n’t”)• 23 rules seem able to handle without changes Indo-
European languages. • the English tokeniser: Unicode tokeniser + pattern
grammar FST
POS tagging in new languages
• TIDES Surprise Language: Hepple tagger but substituted Cebuano/Hindi lexicon for English
• Used empty ruleset since no training data available
• Used default heuristics (e.g. return NNP for capitalised words)
• Very experimental, but reasonable results• 67% correctness for Hindi and 75% for
Cebuano• Adaptation time per language - 2 days
Porting NE grammars
• Most English JAPE rules based on POS tags and gazetteer lookup
• Grammars can be reused for languages with similar word order, orthography etc.
• No time to make detailed study of Cebuano, but very similar in structure to English
• Most of the rules left as for English, but some adjustments to handle especially dates
• Used both English and Cebuano grammars and gazetteers, because NEs appear in both languages
TIDES Evaluation Results
Cebuano English Baseline
Entity P R F P R F
Person 71 65 68 36 36 36
Org 75 71 73 31 47 38
Location 73 78 76 65 7 12
Date 83 100 92 42 58 49
Total 76 79 77.5 45 41.7 43
Conclusion
• GATE – a Unicode-based NLP infrastructure, particularly suitable for multilingual adaptation of IE systems
• Requires little involvement of native speakers and very little annotated data for a basic job