Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U. of Sheffield Populating a Database from Parallel Texts using “Ontology-based” Information Extraction
39
Embed
Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Mary McGee Wood, Shenghui WangDept of Computer Science, U. of Manchester
Valentin Tablan, Diana Maynard,
Hamish Cunningham Dept of Computer Science, U. of Sheffield
Susannah LydonEarth Science Education Unit, U. of Keele
Populating a Database from Parallel Texts using “Ontology-based” Information Extraction
The hypothesis
Overview
Parallel texts
Legacy data in the natural sciences
“Ontology-based” Information Extraction
NLDB’04 - a few running threadsMultiple / semi-overlapping text sources
Sophisticated vs shallow or statistical text processing
“Ontologies” are not the same as gazetteers or lexicons (or semantic nets!)
Autonomous agents vs HCC (Human-Computer Collaborative) approaches
We are doing…
Highly homogeneous data sources
Shallow text processing
“Ontologies” only as a last resort
HCC approach
We are not doing…
Heterogeneous data sources
Sophisticated language processing
Improvement of single-source IE or question-answering
Autonomous agents
Parallel textsText descriptions in the traditional descriptive sciences.
Descriptions of protein sequences and functions in molecular biology.
Press coverage of news stories.
Police witness-of-crime reports.
(Semi-) automatic marking of free text answers in examinations.
Legacy data in the natural sciences
Text descriptions in the traditional descriptive sciences:
Species descriptions in botany and zoology
Descriptions of diseases in medicine.
Five species of Ranunculus (buttercups)
Six botanists’ text descriptions (Floras)
Data sources
R. acris L. - Meadow Buttercup. Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; 2n=14.
Typical data
Hand Parsing & Correlation
CTM FE FNA GLEASON GRAY STACE
Petals Petals petals Petals petals
number 5 5
length usually 10-15 mm
9-13 mm 8-14 mm. long
0.8-1.4 cm long
width 8-11 mm nearly as broad
shape broadly obovate with cuneate base
broadly obovate
rounded-obovate
colour bright glossy yellow, rarely paler or white
yellow
Hand-parsed species descriptions for Ranunculus bulbosus
Results of hand-analysis of Ranunculus descriptions from six sources
- Most data from one source only
- Individual texts contain on average 39% of the total information for each species
Department of BotanyNatural History Museum,
LondonRob HuxleyDavid Sutton
MultiFlora IAutomatic compilation of accurate taxonomic databases from multiple non-computerised sourcesDepartment of Computer ScienceUniversity of ManchesterMary McGee WoodDavid RydeheardSusannah Lydon
Supported by the BBSRC / EPSRC joint Bioinformatics Initiative, grant reference number 34/BIO12072
Template output (1)Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; HEAD KIND FEATURE TYPE KIND
Erect
Perennial
to 1m measure unknown
basal position pubescent
leaves Prefix
deeply
palmately
lobed
Template output (2)flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; HEAD KIND FEATURE TYPE KIND NEGATION
flowers 15-25mm measure width
across
sepals reflexed true
achenes short
hooked
smooth
glabrous
2-3.5mm measure unknown
MultiFlora II:Combining Information Extraction and Knowledge Representation for Biodiversity Informatics
Department of Computer Science, University of ManchesterMary McGee WoodSusannah LydonAlan Rector
Department of Botany, Natural History Museum, LondonRob Huxley
Natural Language Processing Group, University of SheffieldHamish CunninghamValentin TablanDiana Maynard
Supported by the BBSRC Bioinformatics and E-science Programme, grant reference number 34/BEP17049
GATE II
“Ontology-based” Information Extraction
“Ontology” – classes of heads, properties, and features
Perennial herb with overwintering lf-rosettes from the short oblique to erect premorse stock up to 5 cm, rarely longer and more rhizome-like; roots white, rather fleshy, little branched.
More typical data
System outputHead Class Head Property FeatClass Feature