Top Banner
On the Semantic Representation and Extraction of Complex Category Descriptors André Freitas , Rafael Vieira, Edward Curry, Danilo Carvalho, João C. Pereira da Silva Insight Centre for Data Analytics NLDB 2014 Montpellier, France
36

On the Semantic Representation and Extraction of Complex Category Descriptors

Aug 23, 2014

Download

Science

André Freitas

Natural language descriptors used for categorizations are
present from folksonomies to ontologies. While some descriptors are composed of simple expressions, other descriptors have complex compositional patterns (e.g. ‘French Senators Of The Second Empire’, ‘Churches
Destroyed In The Great Fire Of London And Not Rebuilt’). As conceptual models get more complex and decentralized, more content is transferred to unstructured natural language descriptors, increasing the
terminological variation, reducing the conceptual integration and the structure level of the model. This work describes a formal representation for complex natural language category descriptors (NLCDs). In the
representation, complex categories are decomposed into a graph of primitive concepts, supporting their interlinking and semantic interpretation. A category extractor is built and the quality of its extraction under the proposed representation model is evaluated.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: On the Semantic Representation and Extraction of Complex Category Descriptors

On the Semantic Representation and Extraction of Complex

Category DescriptorsAndré Freitas, Rafael Vieira, Edward Curry, Danilo

Carvalho, João C. Pereira da Silva

Insight Centre for Data AnalyticsNLDB 2014

Montpellier, France

Page 2: On the Semantic Representation and Extraction of Complex Category Descriptors

Outline

Motivation Extracting Natural Language Category Descriptors

(NLCDs) Evaluation Summary

2

Page 3: On the Semantic Representation and Extraction of Complex Category Descriptors

Motivation3

Page 4: On the Semantic Representation and Extraction of Complex Category Descriptors

Big Data Vision: More complete data-based picture of the world

for systems and users.

4

Page 5: On the Semantic Representation and Extraction of Complex Category Descriptors

“Schema” Growth & Complexity Fundamental shift in the database landscape How to build large ‘schemas’?

10s-100s attributes1,000s-1,000,000s attributes

5

Page 6: On the Semantic Representation and Extraction of Complex Category Descriptors

Target Motivational Scenario: Wikipedia

Decentralized content generation 300,000 editors have edited Wikipedia more than 10

times > 280,000 distinct Natural Language Category

Descriptors (NLCDs)

6

Page 7: On the Semantic Representation and Extraction of Complex Category Descriptors

Natural Language Category Descriptors (NLCDs)

7

Page 8: On the Semantic Representation and Extraction of Complex Category Descriptors

NLCDs Natural Language Category Descriptors (NLCDs)

are natural language descriptors for sets

Simple NLCDs:- ‘People’- ‘Countries’- ‘Films’

Complex NLCDs:- ‘French Senators Of The Second Empire’- ‘United Kingdom Parliamentary Constituencies Represented By A Sitting Prime Minister’

Goal: - Parse NLCDs into an integrated structured graph

8

Page 9: On the Semantic Representation and Extraction of Complex Category Descriptors

Assumptions

NLCD

NLCDs as a more syntactically tractable subset of natural language

NLCDs as a low effort interface for structuring a domain of discourse

IE

9

Page 10: On the Semantic Representation and Extraction of Complex Category Descriptors

Formality vs. Usability Spectrum

NLCDss NLCD graphss

Information Extraction

10

NLCD graphss

Page 11: On the Semantic Representation and Extraction of Complex Category Descriptors

Applications Database Creation Semantic Annotation Entity/Semantic Search

11

Page 12: On the Semantic Representation and Extraction of Complex Category Descriptors

Other Examples

IFRS and US GAAP - ‘Partially owned properties’ - ‘Residential portfolio segment’ - ‘Assets arising from exploration for and evaluation of

mineral resources’ - ‘Key management personnel compensation’ - ‘Other long-term employee benefits’

12

Page 13: On the Semantic Representation and Extraction of Complex Category Descriptors

Extracting Natural Language Category Descriptors (NLCDs)

13

Page 14: On the Semantic Representation and Extraction of Complex Category Descriptors

Natural Language Category Descriptors

What is Big Data?

14

Page 15: On the Semantic Representation and Extraction of Complex Category Descriptors

Core Features

Manual analysis of 10,000 NLCDs.

15

Page 16: On the Semantic Representation and Extraction of Complex Category Descriptors

Features/Core Lexical Categories Distribution

16

Page 17: On the Semantic Representation and Extraction of Complex Category Descriptors

Number of distinct POS Tag patterns

17

Page 18: On the Semantic Representation and Extraction of Complex Category Descriptors

Graph Representation Model

18

Page 19: On the Semantic Representation and Extraction of Complex Category Descriptors

Focus of the Representation

Taxonomic Structure

Context Representation (Open Relation Extraction)

- Reification-based

Page 20: On the Semantic Representation and Extraction of Complex Category Descriptors

Examples

20

Page 21: On the Semantic Representation and Extraction of Complex Category Descriptors

Examples

21

Page 22: On the Semantic Representation and Extraction of Complex Category Descriptors

Examples

22

Page 23: On the Semantic Representation and Extraction of Complex Category Descriptors

Examples

23

Page 24: On the Semantic Representation and Extraction of Complex Category Descriptors

NLCD Extractor

24

Page 25: On the Semantic Representation and Extraction of Complex Category Descriptors

NLCD Extractor: POS Tagging

25

Page 26: On the Semantic Representation and Extraction of Complex Category Descriptors

NLCD Extractor: Segmentation

26

Page 27: On the Semantic Representation and Extraction of Complex Category Descriptors

NLCD Extractor: Named Entity Recognition

27

Page 28: On the Semantic Representation and Extraction of Complex Category Descriptors

NLCD Extractor: Core Detection

28

Page 29: On the Semantic Representation and Extraction of Complex Category Descriptors

NLCD Extractor: WSD

29

Page 30: On the Semantic Representation and Extraction of Complex Category Descriptors

NLCD Extractor: Entity Linking

30

Dbpedia

Page 31: On the Semantic Representation and Extraction of Complex Category Descriptors

NLCD Extractor: RDF Representation

31

Dbpedia

Page 32: On the Semantic Representation and Extraction of Complex Category Descriptors

RDF Representation

32

Page 33: On the Semantic Representation and Extraction of Complex Category Descriptors

Evaluation33

Page 34: On the Semantic Representation and Extraction of Complex Category Descriptors

Evaluation Setup Total of 287,957 English Wikipedia categories (Open

Domain scenario)

Selected random sample of 2,696 categories

Manual evaluation of the core extraction features- Entity segmentation- Relation identification- Unary operators- Specialization relations- Category core identification- Entity core identification- Word Sense Disambiguation (WordNet)- Entity linking (DBpedia)

34

Page 35: On the Semantic Representation and Extraction of Complex Category Descriptors

Results

Performance:- (i) graph extraction time: 9.8 ms per graph- (ii) word sense disambiguation: 121.0 ms per word - (iii) entity linking: 530.0 ms per link

* i5-3317U (1.70GHz) CPU computer with 4GB RAM (4 core, 2 threads per core).

35

Page 36: On the Semantic Representation and Extraction of Complex Category Descriptors

Summary NLCDs can provide a more tractable (from the IE

perspective) natural language interface for structuring large KBs

We developed an approach for the representation, extraction and integration of NLCDs

- ~75% extraction accuracy

Limitations:- Need for a more principled and formal definition for a NLCD- Need for a better entity recognition and linking approach

Future Work: evaluation under a domain-specific scenario

36