On the Semantic Representation and Extraction of Complex Category Descriptors

On the Semantic Representation and Extraction of Complex

Category DescriptorsAndré Freitas, Rafael Vieira, Edward Curry, Danilo

Carvalho, João C. Pereira da Silva

Insight Centre for Data AnalyticsNLDB 2014

Montpellier, France

Outline

Motivation Extracting Natural Language Category Descriptors

(NLCDs) Evaluation Summary

2

Motivation3

Big Data Vision: More complete data-based picture of the world

for systems and users.

4

“Schema” Growth & Complexity Fundamental shift in the database landscape How to build large ‘schemas’?

10s-100s attributes1,000s-1,000,000s attributes

5

Target Motivational Scenario: Wikipedia

Decentralized content generation 300,000 editors have edited Wikipedia more than 10

times > 280,000 distinct Natural Language Category

Descriptors (NLCDs)

6

Natural Language Category Descriptors (NLCDs)

7

NLCDs Natural Language Category Descriptors (NLCDs)

are natural language descriptors for sets

Simple NLCDs:- ‘People’- ‘Countries’- ‘Films’

Complex NLCDs:- ‘French Senators Of The Second Empire’- ‘United Kingdom Parliamentary Constituencies Represented By A Sitting Prime Minister’

Goal: - Parse NLCDs into an integrated structured graph

8

Assumptions

NLCD

NLCDs as a more syntactically tractable subset of natural language

NLCDs as a low effort interface for structuring a domain of discourse

IE

9

Formality vs. Usability Spectrum

NLCDss NLCD graphss

Information Extraction

10

NLCD graphss

Applications Database Creation Semantic Annotation Entity/Semantic Search

11

Other Examples

IFRS and US GAAP - ‘Partially owned properties’ - ‘Residential portfolio segment’ - ‘Assets arising from exploration for and evaluation of

mineral resources’ - ‘Key management personnel compensation’ - ‘Other long-term employee benefits’

12

Extracting Natural Language Category Descriptors (NLCDs)

13

Natural Language Category Descriptors

What is Big Data?

14

Core Features

Manual analysis of 10,000 NLCDs.

15

Features/Core Lexical Categories Distribution

16

Number of distinct POS Tag patterns

17

Graph Representation Model

18

Focus of the Representation

Taxonomic Structure

Context Representation (Open Relation Extraction)

- Reification-based

Examples

20

Examples

21

Examples

22

Examples

23

NLCD Extractor

24

NLCD Extractor: POS Tagging

25

NLCD Extractor: Segmentation

26

NLCD Extractor: Named Entity Recognition

27

NLCD Extractor: Core Detection

28

NLCD Extractor: WSD

29

NLCD Extractor: Entity Linking

30

Dbpedia

NLCD Extractor: RDF Representation

31

Dbpedia

RDF Representation

32

Evaluation33

Evaluation Setup Total of 287,957 English Wikipedia categories (Open

Domain scenario)

Selected random sample of 2,696 categories

Manual evaluation of the core extraction features- Entity segmentation- Relation identification- Unary operators- Specialization relations- Category core identification- Entity core identification- Word Sense Disambiguation (WordNet)- Entity linking (DBpedia)

34

Results

Performance:- (i) graph extraction time: 9.8 ms per graph- (ii) word sense disambiguation: 121.0 ms per word - (iii) entity linking: 530.0 ms per link

* i5-3317U (1.70GHz) CPU computer with 4GB RAM (4 core, 2 threads per core).

35

Summary NLCDs can provide a more tractable (from the IE

perspective) natural language interface for structuring large KBs

We developed an approach for the representation, extraction and integration of NLCDs

- ~75% extraction accuracy

Limitations:- Need for a more principled and formal definition for a NLCD- Need for a better entity recognition and linking approach

Future Work: evaluation under a domain-specific scenario

36

On the Semantic Representation and Extraction of Complex Category Descriptors

Science

word sense

entity linking

extraction

domain

nlcds

0 ms

evaluation