Issues in Learning an Ontology from Text

Post on 21-May-2015

112 Views

Category:

Science

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk at bio-ontologies SIG at ISMB Toronto, 2008

Transcript

Issues in Learning an Ontology from

Text

Christopher Brewster, Simon Jupp, Joanne Luciano, David Shotton, Robert Stevens, and Ziqi Zhang

The Use Case: Animal Behaviour

• Animal behaviour community recognises the need for an ontology, e.g. for video annotation/retrieval

• The community created an “Animal Behaviour Ontology” - 339 terms

• Can we (semi-) automatically build from text?

Some Questions

• Do we get a “good ontology”?

• If not, is it useful?

• Is it low-effort?

• Should the result be “tidied up” or used as a donor?

Methodology: Dataset

• Journal “Animal Behaviour” from Elsevier

• 623 articles from Vol 71 (2006) - Vol 74 (2007)

• 2.2 million words

• Various formats - most usefully xml

We Want an Ontology of Green

• An ontology of “animal behaviours”

• Not an ontology of the corpus

We want the green terms in the ontology

Processing Steps (1)

1. Text extracted from XML - excluding affiliations, acknowledgements, bibliography except for title etc.

2. Noise removed - person names, animal names, place names

3. Lemmatiser used to reduce data sparsity

4. Term extraction applied

Processing Steps (2)5. Term selection

Regular expression used to select terms ending in behaviour, display, construction, inspection plus generic -ing, -ism, etc.

Build hierarchies using String Inclusion

6. Top level terms filtered using “Hearst Patterns” to test if X ISA behaviour/activity/etc.

WalkingRunningJumpingHuntingPeckingReed BuntingCorn BuntingHerringCourtshipStudentshipCannibalismDimorphism

Applying String Inclusion /Rules to Terms

C

BCAC

ABC

Selection

Mate Selection

Natural Selection

Female Mate Selection

Lexico-Syntactic Patterns

• X such as P, Q, R; X is a Y

• Grooming is a behaviour

• Copulation is an activity

• Dimorphism is a behaviour

• Calls such as trills, whistles, grunts

Results

• 64,000 terms extracted

• The regexp selected 10,335 terms

• Step 6a resulted in an ontology with 17,776 classes and 1295 top level classes

• Step 6b resulted in an ontology with 13,058 classes and 912 top level classes

Results (2) - Copulation Sub-tree

Results(3)

• Evaluation of terms excluded by regexp:

• 56,000 terms excluded

• Random sample of 3140 terms evaluated by hand

• 7 verbs and 42 nouns should not have been excluded

• E.g., “interaction”

• A recall of .905

Discussion: The problem of focus

Other Issues

• More a vocabulary than an ontology

• SKOS-like rather than OWL-like

• Can deal with “selection”, “mate selection” and “natural selection

• Highly compositional terms “Adult male grooming behaviour”

• Cleanish list of top level terms: Canabalism, copulation, eating, foraging, fighting, grooming

Discussion: Is it useful?

• Answers: No, yes, yes, donor

• Useful ontological fragments

• Bringing ontology to ontology learning is the research challenge

• Limitations: noise; the problem of focus; only taxonomic relations

• Advantages: speed; ease; a step towards formal ontologies

top related