Top Banner
Software Architecture for Language Engineering (SALE) – where next? http://gate.ac.uk/ http://nlp.shef.ac.uk/ Hamish Cunningham IBM TJ Watson, 1 st August/2003
39

Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

Dec 27, 2015

Download

Documents

Tracy Harris
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

Software Architecture for Language Engineering (SALE)

– where next?

http://gate.ac.uk/ http://nlp.shef.ac.uk/

Hamish Cunningham

IBM TJ Watson, 1st August/2003

Page 2: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

2(39)

Structure of the Talk1. SALE and its context

• Definitions• The Knowledge Economy and HLT• Software Lifecycle

2. GATE, a General Architecture for Text Engineering• History• Summary of Features and Principles• Component-base development• Unicode support• Measurement• CREOLE: some components • Users and Projects

3. Where Next (give up and go home)?• Future context• Desirables• Conclusion

Page 3: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

3(39)

SALE: definitions

• Computational Linguistics: science of language that uses computation as an investigative tool.

• Natural Language Processing: science of computation whose subject matter is data structures and algorithms for human language processing.

• Language Engineering: building systems whose cost and outputs are measurable and predictable.

• Software Architecture: macro-level organisational principles for families of systems. In this context is also used as infrastructure.

• SALE: software infrastructure, architecture and development tools for applied NLP and LE.

Page 4: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

4(39)

                                                                                                                           

The Knowledge Economy and Human Language

Gartner, December 2002: • taxonomic and hierachical knowledge mapping and

indexing will be prevalent in almost all information-rich applications

• through 2012 more than 95% of human-to-computer information input will involve textual language

A contradiction: formal knowledge in semantics-basedsystems vs. ambiguous informal natural language

The challenge: to reconcile these two opposing tendencies

Page 5: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

5(39)

HumanLanguage

Formal Knowledge(ontologies andinstance bases)

(A)IE

CLIE

(M)NLG

ControlledLanguage

OIE

SemanticWeb; Semantic Grid;Semantic Web Services

KEYMNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE

IE and Knowledge: Closing the Language Loop

Page 6: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

6(39)

                                                                                                                           

Software lifecycle in collaborative research

Project Proposal: We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to.

Analysis and Design: Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg.

Implementation: Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator.

Integration and Testing: The lead partner gets desperate and decides to hard-code the results for a small set of examples into the demonstrator, and have a fail-safe crash facility for unknown input ("well, you know, it's still a prototype...").

Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry).

Page 7: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

7(39)

                                                                                                                           

Where did GATE come from?Early- mid-1990s (e.g. in TIPSTER):• Increasing trend towards multi-site collaborative projects• Role of engineering in scalable, reusable, and portable HLT• Support for large data, in multiple media, languages, formats,

and locations• Lower cost of creation of language processing components • Promote quantitative evaluation metrics via tools and a level

playing field

GATE history:• 1996 – 2002: GATE version 1, proof of concept• March 2002: version 2, rewritten in Java, component based,

LGPL, more users• Fall 2003: new development cycle

Page 8: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

8(39)

GATE is...• An architecture A macro-level organisational picture for LE software

systems. • A framework For programmers, GATE is an object-oriented class

library that implements the architecture. • A development environment For language engineers,

computational linguists et al, a graphical development environment.

GATE comes with...• Some free components... ...and wrappers for other people's

components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue;

ontologies; etc.• Free software (LGPL). Download at http://gate.ac.uk/download/

Page 9: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

9(39)

                                                                                                                           

Architectural principles

• Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse

XML support, integration of Protégé, Jena, Weka...) • (Almost) everything is a component, and component sets

are user-extendable • (Almost) all operations are available both from API and GUI

Page 10: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

10(39)

Component-based development

CREOLE: a Collection of REusable Objects forLanguage Engineering:• Java Beans: an OO way of chunking software• GATE components: modified Java Beans with

XML configuration• The minimal component = 10 lines of Java, 10

lines of XML, 1 URL

Why bother? • Allows the system to load arbitrary language

processing components

Page 11: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

11(39)

CREOLE lifecycle• Bootstrap: stub Java class, Makefile, config• Registration: URL / JAR / creole.xml • Instantiation: class loading, parameterisation,

bean object creation– load-time parameters, e.g. a document’s charset– run-time parameters, e.g. a parser’s lexicon

Three types of beans (not a new religion!):• Language Resources, e.g. doc, corpus, lexicon• Processing Resource, e.g. tagger, stat modeller• Visual Resource, e.g. doc editor, syntax editor

Page 12: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

12(39)

Language Resources (LRs)• GATE LRs are documents, ontologies, corpora,

lexicons, ……• LRs can be associated with DataStores (Oracle,

PostgreSQL, XML, Java Serialisation)• Documents / corpora:

– Diverse document formats: text, html, XML, email, RTF, SGML

– Optional format-preserving markup analyse / save

• Standoff annotation model (start, end, type, features), derivative of TIPSTER, compatible with ATLAS and XCES

Page 13: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

13(39)

Processing Resources (PRs)• Algorithmic components knows as PRs – beans

with execute methods.• Controllers: execute a set of PRs

– SerialController: sequential run of arbitrary PR set– SerialAnalyserController: analyser PRs over corpus– Conditional controllers: execute depend on features– Parallel controller?

• PRs + Controller = Applications• Application parameterisation state can be saved

and restored, and used for embedding / batching

Page 14: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

14(39)

Vis

ual R

esou

rces

(V

Rs)

Page 15: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

15(39)

VRs (2): Coreference

Page 16: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

16(39)

VRs (3): Syntax

Page 17: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

17(39)

                     

GATE Unicode Kit (GUK) Complements Java’s facilities

• Support for defining Input Methods (IMs)

• currently 30 IMs for 17 languages

• Pluggable in other applications (e.g. JEdit)

Editing Multilingual Data

Page 18: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

18(39)

Processing Multilingual DataAll processing, visualisation and editing tools use GUK

Page 19: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

19(39)

 Performance Evaluation

• At document level – annotation diff

Page 20: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

20(39)

Regression TestAt corpus level – corpus benchmark tool – tracking system’s performance over time

Page 21: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

21(39)

More CREOLE

1. JAPE, FSTs over annotations

2. ANNIE, A Nearly-New IE system

3. DAML+OIL, Protégé, Ontology-Aware IE

4. Information Retrieval, Lucene

5. WordNet

6. Machine Learning support

Page 22: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

22(39)

 FSTs over annotationsJAPE: a Java Annotation Patterns Engine• Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components• Simplifies multi-phase regex processing

Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ {Lookup.kind == companyDesignator} ):match --> :match.NamedEntity = { kind=company, rule=“Company1” }

Page 23: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

23(39)

Info Extraction ComponentsThe ANNIE system – a reusable and easily extendable set of components

Page 24: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

24(39)

Populating Ontologies with IE

Page 25: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

25(39)

Protégé and Ontology Management

Page 26: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

26(39)

Information Retrieval

Currently based on the Lucene IR engine

Page 27: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

27(39)

Wor

dNet

sup

port

Page 28: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

28(39)

Machine Learning support

• Uses classification.

[Attr1, Attr2, Attr3, … Attrn] Class• Classifies annotations.

(Documents can be classified as well using a simple trick.)

• Annotations of a particular type are selected as instances.

• Attributes refer to instance annotations.• Attributes have a position relative to the instance

annotation they refer to.

Page 29: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

29(39)

AttributesAttributes can be:

– Boolean

The [lack of] presence of an annotation of a particular type [partially] overlapping the referred instance annotation.

– Nominal

The value of a particular feature of the referred instance annotation. The complete set of acceptable values must be specified a-priori.

– Numeric

The numeric value (converted from String) of a particular feature of the referred instance annotation.

Page 30: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

30(39)

Implementation

Machine Learning PR in GATE.Has two functioning modes:

– training– application

Uses an XML file for configuration:<?xml version="1.0" encoding="windows-1252"?><ML-CONFIG>

<DATASET> … </DATASET><ENGINE>…</ENGINE>

<ML-CONFIG>

Page 31: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

31(39)

<DATASET><DATASET><INSTANCE-TYPE>Token</INSTANCE-TYPE> <ATTRIBUTE> <NAME>POS_category(0)</NAME> <TYPE>Token</TYPE> <FEATURE>category</FEATURE> <POSITION>0</POSITION> <VALUES> <VALUE>NN</VALUE> <VALUE>NNP</VALUE> … </VALUES> [<CLASS/>] </ATTRIBUTE> …</DATASET>

Page 32: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

32(39)

<ENGINE><ENGINE> <WRAPPER>gate.creole.ml.weka.Wrapper</WRAPPER> <OPTIONS> <CLASSIFIER>weka.classifiers.j48.J48</CLASSIFIER> <CLASSIFIER-OPTIONS>-K 3</CLASSIFIER-OPTIONS> <CONFIDENCE-THRESHOLD>0.85</CONFIDENCE-

THRESHOLD> </OPTIONS> </ENGINE>

Now: WEKASoon: Torch? YASMET? TIMBL?

Page 33: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

33(39)

Attributes Position

Instances type: Token

Page 34: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

34(39)

Standard Use ScenarioTraining• Prepare training annotations. • Run the ML PR in training

mode.• Export the dataset as .arff

and perform experiments using the WEKA interface in order to find the best attribute set / algorithm / algorithm options.

• Update the configuration file accordingly.

• Run the ML PR again to collect the actual data.

• [ Save the learnt model. ]

Application• [ Load the previously

saved model. ]• Run the ML PR in

application mode.• [ Save the learnt model. ]

Page 35: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

35(39)

Using Other ML LibrariesThe MLEngine Interface

• void addTrainingInstance(List attributes) Adds a new training instance to the dataset. 

• Object classifyInstance(List attributes) Classifies a new instance. 

• void init() This method will be called after an engine is created

and has its dataset and options set. • void setDatasetDefinition(DatasetDefintion definition)

Sets the definition for the dataset used. • void setOptions(org.jdom.Element options)

Sets the options from an XML JDom element.• void setOwnerPR(ProcessingResource pr)

Registers the PR using the engine with the engine. 

Page 36: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

36(39)

A bit of a nuisance (GATE users)GATE team projects. Past:• Conceptual indexing: MUMIS:

automatic semantic indices for sports video

• MUSE, cross-genre entitiy finder• HSL, Health-and-safety IE• Old Bailey: collaboration with HRI

on 17th century court reports• Multiflora: plant taxonomy text

analysis for biodiversity research e-science

Present:• Advanced Knowledge

Technologies: €12m UK five site collaborative project

• EMILLE: S. Asian languages corpus

• ACE / TIDES: Arabic, Chinese NE• JHU summer w/s on semtaggingFuture:• Five new projects (below)

Thousands of users at hundreds of sites. A representative sample: • the American National Corpus project • the Perseus Digital Library project,

Tufts University, US• Longman Pearson publishing, UK• Merck KgAa, Germany• Canon Europe, UK• Knight Ridder, US• BBN (leading HLT research lab), US• SMEs inc. Sirma AI Ltd., Bulgaria• Imperial College, London, the University

of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities

• UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia...

Page 37: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

37(39)

Where Next (1)?• Can Universities cope with the long term?• User survey• Future context:

– SEKT: Knowledge Management– KnowledgeWeb: OntoWeb II– PrestoSpace: audiovisual preservation (FSTs for users?)– hTechSight: knowledge portal for petrochemicals– ETCSL: Electronic Text Corpus of Sumerian Language– DERI: Digital Enterprise Research Institute– PhDs: INK, PIE

Page 38: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

38(39)

Where Next (2)?

• Some desirables:– Corpus tools (ANNIC in progress)– Audiovisual documents– WS-based backend server, for ML, active learning etc.– Better dialogue support (cf. AMITIES, Galaxy)– Better MT support– PDF documents

– JAPE debugger, editor, 101 language extensions (e.g. quantified ops, deletion ontology callouts)

– Cleverer treatment of large documents in the GUI– PR reloading

Page 39: Software Architecture for Language Engineering (SALE) – where next? //gate.ac.uk/ //nlp.shef.ac.uk/ Hamish.

39(39)

Conclusion

GATE is:• Addressing the need for scalable, reusable, and portable

HLT solutions• Supporting large data, in multiple media, languages, formats,

and locations• Lowering the cost of creation of new language processing

components • Promoting quantitative evaluation metrics via tools and a

level playing field• Promoting experimental repeatability by developing and

supporting free software

http://gate.ac.uk/