On the Semantic Representation and Extraction of Complex Category Descriptors André Freitas , Rafael Vieira, Edward Curry, Danilo Carvalho, João C. Pereira da Silva Insight Centre for Data Analytics NLDB 2014 Montpellier, France
Aug 23, 2014
On the Semantic Representation and Extraction of Complex
Category DescriptorsAndré Freitas, Rafael Vieira, Edward Curry, Danilo
Carvalho, João C. Pereira da Silva
Insight Centre for Data AnalyticsNLDB 2014
Montpellier, France
Outline
Motivation Extracting Natural Language Category Descriptors
(NLCDs) Evaluation Summary
2
Motivation3
Big Data Vision: More complete data-based picture of the world
for systems and users.
4
“Schema” Growth & Complexity Fundamental shift in the database landscape How to build large ‘schemas’?
10s-100s attributes1,000s-1,000,000s attributes
5
Target Motivational Scenario: Wikipedia
Decentralized content generation 300,000 editors have edited Wikipedia more than 10
times > 280,000 distinct Natural Language Category
Descriptors (NLCDs)
6
Natural Language Category Descriptors (NLCDs)
7
NLCDs Natural Language Category Descriptors (NLCDs)
are natural language descriptors for sets
Simple NLCDs:- ‘People’- ‘Countries’- ‘Films’
Complex NLCDs:- ‘French Senators Of The Second Empire’- ‘United Kingdom Parliamentary Constituencies Represented By A Sitting Prime Minister’
Goal: - Parse NLCDs into an integrated structured graph
8
Assumptions
NLCD
NLCDs as a more syntactically tractable subset of natural language
NLCDs as a low effort interface for structuring a domain of discourse
IE
9
Formality vs. Usability Spectrum
NLCDss NLCD graphss
Information Extraction
10
NLCD graphss
Applications Database Creation Semantic Annotation Entity/Semantic Search
11
Other Examples
IFRS and US GAAP - ‘Partially owned properties’ - ‘Residential portfolio segment’ - ‘Assets arising from exploration for and evaluation of
mineral resources’ - ‘Key management personnel compensation’ - ‘Other long-term employee benefits’
12
Extracting Natural Language Category Descriptors (NLCDs)
13
Natural Language Category Descriptors
What is Big Data?
14
Core Features
Manual analysis of 10,000 NLCDs.
15
Features/Core Lexical Categories Distribution
16
Number of distinct POS Tag patterns
17
Graph Representation Model
18
Focus of the Representation
Taxonomic Structure
Context Representation (Open Relation Extraction)
- Reification-based
Examples
20
Examples
21
Examples
22
Examples
23
NLCD Extractor
24
NLCD Extractor: POS Tagging
25
NLCD Extractor: Segmentation
26
NLCD Extractor: Named Entity Recognition
27
NLCD Extractor: Core Detection
28
NLCD Extractor: WSD
29
NLCD Extractor: Entity Linking
30
Dbpedia
NLCD Extractor: RDF Representation
31
Dbpedia
RDF Representation
32
Evaluation33
Evaluation Setup Total of 287,957 English Wikipedia categories (Open
Domain scenario)
Selected random sample of 2,696 categories
Manual evaluation of the core extraction features- Entity segmentation- Relation identification- Unary operators- Specialization relations- Category core identification- Entity core identification- Word Sense Disambiguation (WordNet)- Entity linking (DBpedia)
34
Results
Performance:- (i) graph extraction time: 9.8 ms per graph- (ii) word sense disambiguation: 121.0 ms per word - (iii) entity linking: 530.0 ms per link
* i5-3317U (1.70GHz) CPU computer with 4GB RAM (4 core, 2 threads per core).
35
Summary NLCDs can provide a more tractable (from the IE
perspective) natural language interface for structuring large KBs
We developed an approach for the representation, extraction and integration of NLCDs
- ~75% extraction accuracy
Limitations:- Need for a more principled and formal definition for a NLCD- Need for a better entity recognition and linking approach
Future Work: evaluation under a domain-specific scenario
36