Top Banner
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html CSC 9010: Text Mining Applications Fall, 2012 ANNIE Introduction Dr. Paula Matuszek [email protected]
42

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

Dec 26, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

CSC 9010: Text Mining Applications

Fall, 2012ANNIE Introduction

Dr. Paula [email protected]

Page 2: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Recap

The goal of information extraction is to pull some well-defined information out of a large corpus of unstructured documents and put it in a more structured form, for easier access and understandability.

Typically you have– a domain model– a knowledge model– an extraction engine

Page 3: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Examples

Some examples of information extraction in use:– Watson: http://www.youtube.com/watch?v=WFR3lOm_xhE&feature=player_detailpage

– I2EOnDemand– ANNIE demo: http://services.gate.ac.uk/annie/

– The MUC and TREC conferences sponsored by the Information Technology Laboratory of the National Institute of Standards (NIST)

– The TIPSTER program of DARPA

Page 4: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Watson Information Sources

Software is built on top of UIMA: unstructured information management application: framework built by IBM and now open-source.

The information corpus was downloaded and indexed offline; no web access during the game.

For the game, Watson had access to 200 million pages of structured and unstructured content consuming four terabytes of disk storage, including the full text of Wikipedia. (http://en.wikipedia.org/wiki/Watson_(computer))

Corpus was developed from a large variety of text sources: – baseline from wikipedia, Project Gutenberg, newspaper articles,

thesauri, etc. – extend with web retrieval, extract potentially relevant text

“nuggets”, score for informative, merge the best into corpus Primary corpus started as unstructured text, not semantically tagged. About 2% of Jeopardy! answers can be looked up directly. Also leverages semistructured and structured sources such as

Wordnet and Yago.

Page 5: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

MUC tracks MUC-1 (1987), MUC-2 (1989): Naval

operations messages. MUC-3 (1991), MUC-4 (1992): Terrorism

in Latin American countries. MUC-5 (1993): Joint ventures and

microelectronics domain. MUC-6 (1995): News articles on

management changes. MUC-7 (1998): Satellite launch reports.

Page 6: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

TREC Tracks Focus is information retrieval, not just information extraction Ongoing: find them at http://trec.nist.gov/tracks.html Some of the 2012 tracks:

– Crowdsourcing Track– Knowledge Base Acceleration Track– Legal Track– Medical Records Track– Microblog Track

Some earlier tracks– Entity Track– Chemical IR Track– Genomics Track

Page 7: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Knowledge-Based Approaches

Systems like GATE and I2E use an approach based in natural language processing and modeling knowledge– domain model– knowledge model

There are also machine-learning approaches using classifiers and sequence models

And, of course, hybrid approaches. GATE includes some machine-learning based tools (not in ANNIE)

Page 8: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Domain Model

Terms: enumerated strings which are members of some class

Classes: Categories of terms, such as “locations”, “proteins”, “diseases

Both are often organized into an ontology

Extraction rules

Page 9: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Terms, Classes, Ontologies

I2EOnDemand has a rich ontology in the medical domain; we used it for searching

In ANNIE these are represented by gazetteer lists and by an index file which describes all the gazetteers, including their major type and possibly minor type and language.

GATE also has a number of ontology plugins and tools

Page 10: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Domain Rules LHS matches some pattern in the

domain, RHS does something In ANNIE, RHS typically adds an

annotation In I2EOnDemand

– tag or annotate members of classes– identify and extract various kinds of

relations

Page 11: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Knowledge Model

What you’re trying to extract Can be pretty simple and generic:

– tuples of the form entity-relation-entity– entity:class relations

Can be more detailed and specific:Org Name Villanova University

Org Alias Nova, Villanova

Org Description “The local university”

Org Type Educational Institution

Org Locale Villanova, PA

Page 12: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Knowledge Model in ANNIE

The knowledge model isn’t specified explicitly; it can basically be considered as all of the annotations which the various processing resources can produce. Each processing resource has an implicit knowledge model

Additional models are defined in JAPE, which can create additional annotations.

Page 13: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Extraction Engine Tool which applies rules to text and

extracts matches– Tokenizer– Part of Speech (POS) Tagger– Term and class tagger– Rule engine: match LHS, execute RHS

Rule engine is iterative May include an interactive component

which is essentially a query engine against already extracted information

Page 14: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

ANNIE/GATE Extraction Engine

The entire GATE system can be considered as an extraction engine (although it does a lot more)

Most of the processing resources in ANNIE and many others available in CREOLE are specific extraction engines

Some systems (eg, AeroText, OpenCalais) make an explicit distinction among the domain model, the knowledge model, and the extraction engine.

The distinction in ANNIE is less clear

Page 15: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

On to ANNIE:

A family of Processing Resources for language analysis included with GATE

Stands for A Nearly-New Information Extraction system.

Using finite state techniques to implement various tasks: tokenization, semantic tagging, verb phrase chunking, and so on.

(LaSIE is the forerunner of ANNIE, focused specifically on information extraction for the TREC conferences)

Page 16: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

ANNIE IE Modules

http://gate.ac.uk/sale/tao/splitch6.html#chap:annie

Page 17: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

ANNIE Standard Components

These are what is loaded when you load ANNIE and run the default application– Document Reset– Tokenizer– Gazetteer: lists of entities– Sentence Splitter/Regex sentence splitter– Part of Speech Tagger– Named Entity Transducer– Orthomatcher

Page 18: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Document Reset

Task: reset a document to its original state. Releases old annotations for garbage collection.

Parameters:– annotationTypes: specify annotations to

remove. Default is all.– setsToKeep: – setsToRemove: – keepOriginalMarkupsAS: Boolean.

Page 19: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

ANNIE Component: Tokenizer

Task: split text into simple tokens In ANNIE, used as one piece of the

English Tokenizer Five types of tokens, some of which

have attributes or subtypes Uses tokenizer rules

– left hand side (LHS): pattern to be matched – right hand side (RSH): action to be taken

Page 20: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Default Token Types Token Types. All have attributes string and length Tokens

– word. Any contiguous set of upper or lowercase letters, including a hyphen.

– orthography attribute (orth): upperInitial, allCaps, lowerCase, mixedCaps

– number. Any combination of successive digits. – symbol. Currency symbols, ^, +, =, etc..

– symbolkind attribute: currency

– punctuation.– position attribute (position) startpunct, endpunct

– subkind attribute (subkind) dashpunct

SpaceTokens– space. any non-control character – control. any control character

Page 21: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Tokenizer Rule Operations used on the LHS:

– | (or) –  * (0 or more occurrences)  – ? (0 or 1 occurrences)  – + (1 or more occurrences)

The RHS uses ’;’ as a separator, and has the following format: {LHS} > {Annotation type};{attribute1}={value1};...;{attribute  n}={value n}

Page 22: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Example Tokenizer Rule– "UPPERCASE_LETTER" "LOWERCASE_LET

TER"* – >  – Token;orth=upperInitial;kind=word; – The sequence must begin with an uppercase

letter, followed by zero or more lowercase letters. This sequence will then be annotated as type “Token”. The attribute “orth” (orthography) has the value “upperInitial”; the attribute “kind” has the value “word”.

Page 23: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

English Tokenizer

Task: adapt generic tokenizer for POS tagger by dealing with constructs involving apostrophes– don’t --> do + n’t– you’ve --> you + ‘ve

Uses a JAPE transducer: Java Annotation Patterns Engine.– LHS: annotation pattern description– RHS: annotation manipulation statement

If you’re going to use the POS tagger, always use this tokenizer

Page 24: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

ANNIE Component: Gazetteer

Task: identify entity names based on lists– Located in

<GATEhome>/plugins/ANNIE/resources/gazeteer– defined in lists.def, which gives name, major type,

minor type, language Example lists.def entries:

– airports.lst:location:airport– city.lst:location:city– festival.lst:date:festival– day.list:date:day

Page 25: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Gazetteer lists

The gazetteer lists used are plain text files, with one entry per line.

Typical lists include – named entities, such as locations, organizations,

dates, names, – grammatical entities such as determiners– components of entities, such location prefixes– anything else that can usefully be enumerated

Don’t always have a minor type or language. Many are one-offs. (spur_ident)

Page 26: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Gazetteer Parameters

Init-time parameters– listsURL: index file. Default is lists.def– encoding: default is UTF-8– gazetteerFeatureSeparator. Not required.– caseSensitive (Boolean). Default is true.

Run-time parameters– document to process– annotationSetName: what to call the Lookup

annotation set. Not required.– wholeWordsOnly: Default is true– longestMatchOnly: Default is true. EG: Amazon UK

Page 27: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Gazetteer Editor

You can add or modify – the name of a list– the major and minor categories in the list– an entry in the list

You can delete a list by right-clicking the name

You can also add a new list Or edit everything outside of GATE with a

text editor.

Page 28: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

ANNIE Component: Sentence Splitter

Task: just what it says: segments the text into sentences.

The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds.

This module is required for the POS tagger.

Page 29: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

POS Tagger Modified version (Hepple) of the Brill

tagger, probably the best-known of the POS taggers– produces a POS tag as an annotation on

each word or symbol– Uses a default lexicon and rule set

– alternate lexicons for all upper-case and all lower-case corpora.

Must run English Tokenizer and Sentence Splitter first

Page 30: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Partial List of Hepple (Penn treebank) Tags

CC - coordinating conjunction: ‘and’, ‘but’, ‘nor’, ‘or’, ‘yet’, plus, minus, less.CD - cardinal numberIN - preposition or subordinating conjunctionJJ - adjective: Hyphenated compounds that are used as modifiers; happy-go-lucky.JJR - adjective - comparative: Adjectives with the comparative ending ‘-er’’.JJS - adjective - superlative: Adjectives with the superlative ending ‘-est’ (and ‘worst’)’.NN - noun - singular or massNNS - noun - pluralNP - proper noun - singularNPS - proper noun - pluralPOS - possessive ending: Nouns ending in ‘’s’ or ‘’’.PP - personal pronounRB - adverb: most words ending in ‘-ly’. Also ‘quite’, ‘too’, ‘very’, othersRBR - adverb - comparative: adverbs ending with ‘-er’ with a comparative meaning.RBS - adverb - superlativeSYM - symbol: technical symbols or expressions that aren’t English words.UH - interjection: Such as ‘my’, ‘oh’, ‘please’, ‘uh’, ‘well’, ‘yes’.VBD - verb - past tenseVBG - verb - gerund or present participleVBN - verb - past participleVBP - verb - non-3rd person singular presentVB - verb - base form: subsumes imperatives, infinitives and subjunctives.VBZ - verb - 3rd person singular present

Page 31: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Example Lexicon Entries

'30s CD NNS & CC SYM 1995-1999 CD Announcement NN Figuring VBG Frequently RB ... A total of over 17,000 entries

Page 32: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Brill Tagger

A Brill tagger starts by assigning all the known words from the lexicon

It next assigns NNP to capitalized unknown words and NN to everything else unknown.

Finally, a bunch of replacement rules change the assignment based on various features and context.

Page 33: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Example Brill Tagger Rules

replace(pos 'NN' 'JJ') # [suffix#less#[0]]– "replace tag NN with JJ if the word in

question ends in "less"". replace(pos 'VB' 'NN') #

[canHave#'NN'#[0] pos#'DT'#[~1]]– "replace tag VB with NN if the the word in

question can have tag NN (according to the lexicon) and if the previous word is tagged DT"

From http://www.ling.gu.se/~lager/mogul/brill-tagger/index.html

Page 34: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

LOTS of Parameters Init-Time: all optional

– encoding - encoding to be used for reading rules and lexicons

– lexiconURL - The URL for the lexicon file – rulesURL - The URL for the ruleset file

Run-time:– document - The document to be processed. Required– inputASName - Input annotation set. Optional– outputASName - Output annotation set . Optional– baseTokenAnnotationType - name of annotation type for

tokens ( default = Token). Required.– baseSentenceAnnotationType - name of annotation type for

sentences ( default = Sentence). Required. – outputAnnotationType - default = Token. Required.– failOnMissingInputAnnotations - Default is true.

Page 35: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Named Entity Transducer

Task: find terms that suggest entities Hand-crafted grammars define patterns

over the annotations Written in JAPE Doesn’t require that you run anything

else, but doesn’t do much if you don’t already have annotations.

Adds new token types

Page 36: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Required Parameters

Initialization– encoding. default is UTF-8– grammar. default is

<GATEhome>/plugins/ANNIE/resources/NE/main.jape (which loads a bunch more)

Runtime– Corpus name (will default if there’s only

one)

Page 37: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Example JAPE rulesPhase: Unknown

Input: Location Person Date Organization Lookup

Rule: Known

Priority: 100

( {Location}| {Person}| {Date}| {JobTitle}):known

-->

{}

Rule:Unknown

Priority: 50

( {Token.category == NNP}) :unknown

-->

:unknown.Unknown = {kind = "PN", rule = Unknown}

(This is only a part of the actual rules)

Page 38: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Orthomatcher Task: add co-reference information to Named

Entities based on orthographic information. In a typical text, a NE may be referred to multiple times, with abbreviations and acronyms– Creates match lists for tagged NEs– May assign an entity type to previously unclassified

entities– Won’t reassign already classified entities

Without NE Transducer annotations doesn’t actually do anything.

Page 39: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Required Parameters Required Init-time

– definitionFileURL. Default is <GATEhome>/plugins/ANNIE/resources/othomatcher/listsNM.def.

– encoding. Default is UTF-8. – minimumNicknameLikehihood: default is 0.5. .

Run-time– corpus or document– Can also have: AnnotationTypes: list of types to

be processed. Default: Organization, Person, Location, Date. Not required.

Page 40: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Co-reference Editor

Can modify the reference chains within a document– Delete an existing member of the chain– Add a new item to a chain

This will change all relevant chains: adding an item to a chain will change the chain for all the items in it

This will not change NE types or other annotations; it doesn’t propagate.

Page 41: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Some Additional ANNIE Modules

These modules are not included in the default ANNIE load, but they are available in processing resources– Morphological analyzer (stemmer). Add

“root” annotation– Pronomial Coreference. Add “coreference”

annotation May need to load tools from CREOLE

manager first

Page 42: ©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, //gate.ac.uk/wiki/training-materials-2011.html.

©2012 Paula MatuszekGATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials, http://gate.ac.uk/wiki/training-materials-2011.html

Summary

The default ANNIE application uses a sequence of processing resources to create annotations over unstructured text

The tokenizer, sentence splitter, POS tagger and orthomatcher are (mostly) domain-independent

The gazetteer lists and Named Entity Transducer are domain-dependent; ANNIE comes with a reasonably good default set.

The domain information is largely captured in gazetteer lists and JAPE rules.