, Representing Texts as Contextualized Entity-Centric Linked Data Graphs Andr´ e Freitas 1 Jo˜ ao C. P. da Silva 2 Danilo S. Carvalho 2 Se´ an O’Riain 1 Edward Curry 1 1 DERI - Ireland 2 Universidade Federal Rio de Janeiro - Brazil August 27, 2013 Freitas, Silva, Carvalho, O’Riain, Curry 1/ 54
54
Embed
Representing Texts as contextualized Entity Centric Linked Data Graphs
The integration of a small fraction of the information present in the Web of Documents to the Linked Data Web can provide a significant shift on the amount of information available to data consumers. However, information extracted from text does not easily fit into the usually highly normalized structure of ontology-based datasets. While the representation of structured data assumes a high level of regularity, relatively simple and consistent conceptual models, the representation of information extracted from texts need to take into account large terminological variation, complex contextual/dependency patterns, and fuzzy or conflicting semantics. This work focuses on bridging the gap between structured and unstructured data, proposing the representation of text as structured discourse graphs (SDGs), targeting an RDF representation of unstructured data. The representation focuses on a semantic best-effort information extraction scenario, where information from text is extracted under a pay-as-you-go data quality perspective, trading terminological normalization for domain-independency, context capture, wider representation scope and maximization of textual information capture.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
,
Representing Texts as Contextualized
Entity-Centric Linked Data Graphs
Andre Freitas1 Joao C. P. da Silva2 Danilo S. Carvalho2
Sean O’Riain1 Edward Curry1
1DERI - Ireland2Universidade Federal Rio de Janeiro - Brazil
August 27, 2013
Freitas, Silva, Carvalho, O’Riain, Curry 1/ 54
,
Outline
Motivation & Objective
Existing Approaches
Structured Discourse Graphs (SDGs)
Representation RequirementsSemantic Model Elements & Graph PatternsSemantic Model - Formalization
Two types co-references: pronominal and non-pronominal
Co-references can refer to either intra or inter sentences
Substituting the co-referent term by the named entity
Can corrupt the semantics of the representation
Freitas, Silva, Carvalho, O’Riain, Curry 22/ 54
,
SDGs - Semantic Model Elements
In late 1988, Obama entered Harvard Law School.
Context Elements
A semantic interpretation may depend on different contexts(temporal context)
The main contextual information is intra-sentence
Intra-sentence context ⇒ reification
Freitas, Silva, Carvalho, O’Riain, Curry 23/ 54
,
SDGs - Semantic Model Elements
In late 1988, Obama entered Harvard Law School.
Quantifiers & Generic Operators
Quantifier: one, two, (cardinal numbers), many (much),some, all, thousands of, one of, several, only, most ofNegation: notModal: could, may, shall, need to, have to, must, maybe,always, possiblyComparative: largest, smallest, most, largest, smallest, thesame, is equal, like, similar to, more than, less than
Freitas, Silva, Carvalho, O’Riain, Curry 24/ 54
,
SDGs - Semantic Model Elements & Graph Patterns
In late 1988, Obama entered Harvard Law School.
Text segmentation: (Obama,entered,Harvard Law School)
Named entities: Obama, Harvard Law School
Resolve co-references: Barack Obama
Context representation: time
Freitas, Silva, Carvalho, O’Riain, Curry 25/ 54
,
SDGs - Semantic Model Elements & Graph Patterns
In late 1988, Obama entered Harvard Law School.
Resolved & Normalized Entities
Resolved entities: a node-substitution in the graph was madefrom a co-reference to a named entity
Normalized entities: entities are transformed to a normalizedform (September 1st of 2010 to 01/09/2010)
Freitas, Silva, Carvalho, O’Riain, Curry 26/ 54
,
SDGs - Semantic Model Elements & Graph Patterns
Later in 2007, Obama sponsored an amendment to theDefense Authorization Act to add safeguards for
personality-disorder military discharges.
Freitas, Silva, Carvalho, O’Riain, Curry 27/ 54
,
SDGs - Semantic Model Elements & Graph Patterns
Freitas, Silva, Carvalho, O’Riain, Curry 28/ 54
,
SDGs - Semantic Model Elements & Graph Patterns
He served three terms representing the 13th District in theIllinois Senate from 1997 to 2004.
Non-Named (Generic) Entities
Non-named entities map to non-rigid designators
Are more subject to vocabulary variation
Have more complex compositional patterns
Freitas, Silva, Carvalho, O’Riain, Curry 29/ 54
,
SDGs - Semantic Model Elements & Graph Patterns
Following high school, Obama moved to Los Angeles in 1979to attend Occidental College.
Triple Trees
Not all facts extracted can be represented in one triple.
Transformation from the syntactic tree to a set of triples
The sentence subject defines the root node
Interpretation: DFS traversal of the tree
Freitas, Silva, Carvalho, O’Riain, Curry 30/ 54
,
SDGs - Semantic Model Elements & Graph Patterns
He won election to the U.S. Senate in Illinois in November2004.
Pronominal Co-Reference
Freitas, Silva, Carvalho, O’Riain, Curry 31/ 54
,
SDGs - Semantic Model Elements & Graph Patterns
On Agust 23, Obama announced his election of DelawareSenator Joe Biden as his vice presidential running mate.
Pronominal Co-Reference
Freitas, Silva, Carvalho, O’Riain, Curry 32/ 54
,
SDGs - Semantic Model Elements & Graph Patterns
As a member of the Senate Foreign Relations Commitee,Obama made official trips to Eastern Europe, the Middle
East, Central Asis and Africa.
Conjunctive Co-Reference
Freitas, Silva, Carvalho, O’Riain, Curry 33/ 54
,
SDGs - Semantic Model Elements & Graph Patterns
Freitas, Silva, Carvalho, O’Riain, Curry 34/ 54
,
Structured Discourse Graphs - SDGs
Semantic Model - Formalization
Freitas, Silva, Carvalho, O’Riain, Curry 35/ 54
,
Semantic Model - Formalization
Graph pattern: atomic graph structure which maps to adiscourse structure
Named and Non-named Entities: [[ne]], [[∼ ne]] ∈ U,where U is a set of IRIs
Basic Triple: tr = (es , p, eo) where es , eo representnamed/non-named entities associated with the subject (s) andobject (o), and p represents a relation between es and eo ,interpreted by [[tr ]] = ([[es ]], [[p]], [[eo ]]) ∈ U × U × U
[[ccr ]] = {([[e]], [[conjlinki ]], [[nei ]]) ∈ U × U × U such that
[[e]] = [[proj3(tr)]] and (∧ni=0[[nei ]] sameas [[e]])}
Freitas, Silva, Carvalho, O’Riain, Curry 39/ 54
,
Semantic Model - Formalization
Possessive/Reflexive/Demonstrative Co-Reference:
pcr = {(∼ nei , coreflink , pr), (pr , coreflink , ej )}
Interpretation:
[[pcr ]] ={([[proj1(tr)]], [[coreflink ]], [[pr ]]), ([[pr ]], [[coreflink ]], [[proj3(tr)]]) ∈U × U × U such that tr = (∼ nei , p, ej)}
Freitas, Silva, Carvalho, O’Riain, Curry 40/ 54
,
Semantic Model - Formalization
Extracted Graph: set of basic and reified triples and generic,quantifier and co-reference operators.
Paths
Basic: sequence of basic triples
Reified: basic paths with some reified triples associated
Operational: basic paths with some operators associated
Complex: contains both reified and operational paths
Freitas, Silva, Carvalho, O’Riain, Curry 41/ 54
,
Semantic Model - Formalization
Context Triples: (tr , contextlink , ct) which indicates that abasic triple tr can be associated with a specific context ct[[context]] = ([[tr ]], [[contextlink ]], [[ct]]) ∈ U3 × U × U
Multi-Context Graphs: is an extracted graph with morethan one context associated to its triples
if all basic triples in a path belong to an unique (same)context, the path is an unique context (basic, reified,operational or complex) path
otherwise, we call this path a multi-context path
Freitas, Silva, Carvalho, O’Riain, Curry 42/ 54
,
Extraction
Graphia
Freitas, Silva, Carvalho, O’Riain, Curry 43/ 54
,
Graphia
http://graphia.dcc.ufrj.br
Graphia is an information extraction pipeline
Takes factual text as input and produces SDGs as output
Graphia’s modules combine state-of-art NLP tools with anefficient set of heuristics
Can build graphs for sentences andentire documents.