A Common Concept Description A Common Concept Description of Natural Language Texts as of Natural Language Texts as a Foundation of Semantic a Foundation of Semantic Computing Computing on the Web on the Web Mitsuru Ishizuka Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Creative Informatics & Dept. of Info. and Communication Eng. Dept. of Info. and Communication Eng. School of Information Science and School of Information Science and Technology Technology
60
Embed
Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng.
A Common Concept Description of Natural Language Texts as a Foundation of Semantic Computing on the Web. Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng. School of Information Science and Technology. Semantic Computing Initiative. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Common Concept DescriptionA Common Concept Descriptionof Natural Language Texts as of Natural Language Texts as
a Foundation of Semantic Computinga Foundation of Semantic Computingon the Webon the Web
Mitsuru IshizukaMitsuru Ishizuka Dept. of Creative Informatics &Dept. of Creative Informatics &
Dept. of Info. and Communication Eng.Dept. of Info. and Communication Eng.School of Information Science and TechnologySchool of Information Science and Technology
The aims of CDL are 1) to realize machine understandability of Web text contents, and 2) to overcome language barrier on the Web.
lay a foundation that allows computers to understand the semantic meaning of Web contents so that they can perform semantic computing on the Web.
3
Major Differences from Semantic WebMajor Differences from Semantic Web
Semantic WebSemantic Web
Target of representation: Meta-data extracted from Web contents.
Domain-dependent ontologies (which cause the difficulty of wide inter-boundary usage)
RDF / OWL (description logic is hard for ordinary people to understand)
Semantic Computing Semantic Computing InitiativeInitiative Target of representation:
Semantic concepts expressed in texts.
Universal vocabulary (+ additional specific vocabulary in a domain if necessary), and pre-defined relation set.
CDL.nl (richer than RDF)
Tim Berners-Lee says that: “the Data Web” is more adequate rather than “the Semantic Web”. (2007)
Main body:Institute of Semantic Computing (ISeC) Institute of Semantic Computing (ISeC) in Japanin JapanInt’l Standardization Activity: W3C Common Web Language(CWL)-XG W3C Common Web Language(CWL)-XG
4
Incubator Group Activity at W3CIncubator Group Activity at W3Cfrom Oct. 2006 to March 2008from Oct. 2006 to March 2008
5
22ndnd Incubator Group at W3C Incubator Group at W3C from May 2008from May 2008
6
CDLs and Semantic WebCDLs and Semantic Web
Tim Berners-Lee(2007): The Semantic Web The Data Web (more adequate)
7
Another BroaderAnother Broader View of View of CDL DevelopmentCDL Development In 1960s – 1970s The foundation on the common representation and
manipulation (retrieval) of Data. Database
In 2000s – 2010s The foundation of the common representation and
manipulation of Semantic Information. Common Concept Base
It is preferable that this is language independent; in other words, Computer Esperanto Language which is understandable by computers.
8
Functions and Supporting StandardsFunctions and Supporting Standards
Minimal sufficient relations have been chosen to represent the surface-level concept meaning of texts.
10
UNL (Universal Networking Language)UNL (Universal Networking Language) The development started in1997
at the United Nations Univ. (Tokyo). The chief scientist has been Dr. Hiroshi Uchida. It is now continuously developed under the UNDL foundation.
The purpose is to let people in the world exchange and share textural info. on the Web beyond language barrier.
The design is based on the results of Machine Translation (especially, Pivot method) and Electric Dictionaries.
There have been activities wrt English, Japanese, Chinese, Spanish, French, Arabic, etc.
11
The defining method of one unique The defining method of one unique sense of a word in sense of a word in UW UW (( Patent of UN UniPatent of UN Univ.v. ))
Defining categoryswallow(icl>bird) the bird
“One swallow does not make a summer”swallow(icl>action) the action of swallowing
“at one swallow”swallow(icl>quantity) the quantity
“take a swallow of water”
Defining possible case relationsspring(agt>thing,obj>wood) bending or dividing somethingspring(agt>thing,obj>mine)) blasting somethingspring(agt>thing,obj>person, escaping (from) prison src>prison))spring(agt>thing,gol>place) jumping up
“to spring up”spring(agt>thing,gol>thing) jumping on
“to spring on”spring(obj>liquid) gushing out
“to spring out”
12
UWUW ((Universal WordsUniversal Words)) in UNL in UNLUniversal Worduw{(equ>Universal Word)}adjective concept{(icl>uw)} uw(aoj>thing{,and>uw,ben>thing,cao>thing,cnt>uw,cob>thing,con>uw,coo>uw,dur>period,man> how,obj>thing,or>uw(aoj>thing),plc>thing,plf>thing,plt>thing,rsn>uw(aoj>thing),rsn>do,icl>adjective concept})
This set is not sufficient for representing every concept expressed in natural language texts. It cannot be used for every language due to its language (English) dependency.
Arg0 (prototypical agent) agt (agent), cag (co-agent), aoj (thing with attribute), cao (co-thing with attribute) Arg1 (prototypical patient) obj (affected thing), cob (affected co-thing) Arg2 (indirect object/benefactive/instrument/attribute/end state) ---, ben (beneficiary), ins (instrument), mat (material), met (method or means), sta (state), gol (goal, final state) Arg3 (start point/benefactive/instrument/attribute) plf (initial place), ben (beneficiary), ins (instrument), mat (material), met (method or means), sta (state) Arg4 (end point) plt (final place), to (destination) TMP (time) tim (time), tmf (initial time), tmf (final time), dur (duration) LOC (location) plc (place) DIR (direction) to (destination) MNR (manner) mal (qualitative manner), mat (quantitative manner) PRP (purpose) pur (purpose or objective) CAU (cause) rsn (reason) MOD (modal verb) an attribute in CDL.nl NEG (negative marker) an attribute in CDL.nl ADV (general-purpose modifier) mod (modification), qua (quantity), pos (possessor), cnt (content), nam (name), per (proportion, rate or distribution), fmt (range/from to), frm (origine) DIS (discourse particle and clause) [inter-sentence relation] PRD (secondary predication) [unique in English]
Rough Correspondence between Rough Correspondence between Semantic Relations of PropBand and Semantic Relations of PropBand and CDL.nl CDL.nl (1)(1)
21
Rough Correspondence between Semantic Rough Correspondence between Semantic Relations of PropBand and CDL.nl Relations of PropBand and CDL.nl (2)(2)
Other CDL.nl Relations [AgentRelation]
ptn (partner): an indispensable non-focused initiator of an action
Ex) He competes with John. Mary collaborates with him.
[PatientRelation]opl (affected place): a place in focus affected by an
event. Ex) He cut the paper in middle in the room.
[PlaceRelation]vip (intermediate place, via place)
[StateRelation]src (source, initial state)vis (instermediate place or state)
[SceneRelation]scn(scene): a scene where an event occurs, or state is
true, or a thing exists. A scene is different from plc in that plc is the real
place something happens, whereas scn is an abstract or metaphorical world.
Ex) He won a prize in a contest. He played in the movie.
vic (via scene)
CDL.nl’s Relations other than Predicate-Argument Relations
[OrderRelation]coo (co-occurrence) seq (sequence)
[LogicalRelaion]and (conjunction) or (disjunction, atternative)
not (complement) [ConceptRelation]
equ (equivalent) icl (included/a kind of)
tof (type of) pof (part of) [ConnectingRelaion]
cau (causal) adv (adversative) adt (additive)
cot (contrastive) par (parallel) att (attached) [ReferringRelaion]
rfi (referred by identically)
rfp (referred by partially) rfw (referred by wholly) [AttensionRelation]
ent (entry) main (main element) in Connexor
foc (focus) qfo (question focus) tpc (topic)
com (comment)
22
Rich Attributes in UNL and CDL.nl Rich Attributes in UNL and CDL.nl
Time with respect to speaker @past @present @future Writer’s view on aspect of event @begin @complete @continue @custom @end @experience @progress @repeat @state Writer’s view of reference
@generic @def @indef @not @ordinal Writer’s view of emphasis, focus and topic
There are several choices for the deep semantic-level description depending on applications. On the other hand, a certain consensus has been made wrt “Concept Description” which is slightly below the surface level, through decades-long researches on NLP, machine translation and electric dictionaries.
Whereas a complete consensus has not been achieved yet regarding the Concept Description level and its description scheme, it is meaningful to set up a common concept description format as an international standard today.
Surface Level
Deep SemanticLevel
ConceptDescription
25
Hierarchical Construction of Hierarchical Construction of Concept Representation in CDL.nlConcept Representation in CDL.nl
elementary thing/entitycorresponding to disambiguated word sense
composite entity
single event(single sentence)consisting ofproposition and modality components
compositeconcept/event(complex sentence)
situation (discourse)
predicate, case components, predicate-modification components, etc.
temporal and causal relations, etc., and coreference
agent-patient relation, phrasal relation, etc.
26
Current Major Issues in CDL.nlCurrent Major Issues in CDL.nl Semi-automatic Conversion from Text. (Text generation from CDL.nl is not so difficult.)
Semantic Retrieval of CDL Data (The design of a CDL Query Language (CDQL)
and its processing mechanism )
Killer Application(s) -- information exchange and share beyond language barrier. -- semantic patent document retrieval.
23/04/22 26
27
Approaches for Generating CDL DataApproaches for Generating CDL Data Manual Coding & Editing
Even in this case, a graphical input editor is necessary.
The brave soldiers fought with their enemies for their country in the War
Assign predicate-argument roles to sentence elements. (Who did What to Whom, When, Where, Why, How, etc.)
[ARG0The brave soldiers] [rel fought] with [ARG1-with their enemies]
for [ARG2-for their country] in [ARGM-loc the War] (in PropBank) Corpora: PropBank, FrameNet, …
30
Dependency Parser as Dependency Parser as a lower basisa lower basis of our Semantic Analysisof our Semantic Analysis
Dependency Functions close to the semantic rolemain (main element)
agt (agent) : The agent by-phase in passive sentences. Ex) The dog was chased by the boy.
ins (instrument)
tmp (time)
dur (duration) Ex) ...experience in the past 10 years.
man (manner)
loc (location)
sou (source) Ex) ... move away from the street.
goa (goal) Ex) ... shift to a full power.
pth (path) Ex) ... travel from Tokyo to Beijing.
cnt (contingency (purpose or reason)) Ex) ... unable to say why he was too ....
cod (condition)
qn (quantifier)
Syntactic Functionspcomp (prepositional complement) Ex) They are in that red car.phr (verb particle) Ex) She looked up the word in the dictionary.subj (subject)obj (object)comp (subject complement) Ex) John remains a boy. dat (indirect object) Ex) John gave her an apple.oc (object complement) Ex) John called him a fool.corpred (copredicative) Ex) John regards him as foolish.com (comitative) Ex) Drinking with you is nice.voc (vocative) Ex) John, come here!frq (frequency)qua (quantity)meta (clause adverbial) Ex) So far, he has been ….cla (clause initial adverbial) Ex) Under his guidance, they can ....ha (heuristic prepositional phrase attachment)
Ex) escape trough ..., fight for ...det (determiner)neg (negator) notattr (attributive nominal) Ex) industrial editormod (other postmodifer) Ex) … of …ad (attributive adverbial) Ex) So much for modern technology, cc (coordination) Ex) and
Connexor Machines Text Analyser
31
Named Entity RecognitionNamed Entity Recognition In Connexor Machinese Text Analyser
+ org (organization, company) + loc (location) + ind (individual) + name (name) + role (occupation, title)
This info. is useful as a lexical feature for the semantic parsing.
32
Conversion of text into CDL.nl Conversion of text into CDL.nl through Shallow Semantic Parsingthrough Shallow Semantic Parsing Original text The records retrieved in answer to queries become information that can be
used to make decisions . #
Separate each word with a ID The \ w37 records \ w38 retrieved \ w39 in \ w40 answer \ w41 to \ w42 queries \ w43
become \ w44 information \ w45 that \ w46 can \ w47 be \ w48 used \ w49 to \ w50 make \ w51 decisions \ w52 . \ w53 #
Relation/Role Set ComparisonRelation/Role Set Comparison PropbankPropbank
describes how a verb relates to its arguments. FrameNetFrameNet
describes how to describe words with its arguments in a related common scenario. Common disadvantages of FrameNet & Propbank role set:
The set covers only predicate-argument roles; they don’t consider any other types of relationships between entities.
CDL.nlCDL.nl CDL relation set describes how words are correlated and what the meanings of
their relationships are. Advantages of CDL.nl relation set
Each relation in the set is pre-defined along with distinctive information from other similar relations.
It describes not only predicate-argument relations, but also those between each pair of entities there exists a meaningful relationship. Thus it has better coverage.
The set has been chosen so that every concept expressed in texts can be sufficiently encoded.
It is universal, i.e., language independent, and can be applied to any language.
34
CDL Relation SetCDL Relation Set Used to describes shallow semantic structure of text. Relations have been chosen to be able to sufficiently represent the
semantic concepts of texts, and are predefined. The set of relations contains all relation types which are organized
roughly into three groups: intra-event relations (22)
agt(agent), aoj(thing with attribute), cag(co-agent), cao(co-thing with attribute), ptn(partner), …..
Data sparseness : The whole number of relation:13487 Relation type: 44 Average num per relation: 306.5
36
Feature spacesFeature spaces Combine information from different language processes:
syntactic analysis tells the details of the word forms used in the text and the syntactic
structures among words. dependency analysis
A dependency relation specifies an asymmetric relationship between words, where one word is a dependent of the other word, which is called its governor.
lexical construction Lexical meaning contains two parts of information: word sense and
semantic behavior which is all the semantic relationships the word may contain.
37
Syntax and Dependency FeaturesSyntax and Dependency Features Connexor Machinese
Text Analyser based on a functional
dependency grammar
Syntax Features Morphology features Syntactic features
Dependency Features Relation type Dependency Path
38
Identification of Entity Pair Identification of Entity Pair with a Semantic Relationwith a Semantic RelationTesting all possible pairs is not efficient.
Step 1: For each input sentence, generate a dependency tree that specifies the syntactic head in the sentence.
Step 2: Find a headNode set from the dependency tree. Each can be a headword of a head entity to govern a relation. We select nodes which have subtree, and omit those which cannot be headNodes by creating a head stoplist.
Step3: For each headNode, check its subtrees to find those that can be tail entities to the headNode. We create a tail stoplist containing those that cannot be root nodes of subtrees of tail entities. Repeat this process.
Step 4: A simple post-processing is applied to correct the boundaries within which the dependency tree does not show correct relationship.
Entity pairs[fought, (the brave soldiers)][fought, (their enemies)][fought, (their country)][fought, (the War)][soldiers, brave][enemies, their][country, their]……..
A dependency tree generated fromConnexor Machinese Analyser
39
main:
root
in
forwith
soldiers
braveThe enemies
their
country
their
War
the
foughtsubj:
attr:det:
pcomp:
attr:
attr:
det:
loc:
phr:ha:
pcomp:
pcomp:
Syntactic andDependency-pathfeatures
Lexical features fromWordNet,VerbNet andUNLKB.
Features for CDL Relation RecognitionFeatures for CDL Relation Recognition
Some labels of Connexor Machinese Analyser: ha (prepositional phase attachment), phr (verb particle), pcomp (subject complement)
Supporting Input EditorSupporting Input Editor Selection among possible candidates proposed by a
computer analysis. (Like Japanese Input Front-end Processor.)
Graphical verification and editing.
43
Semantic Search with CDL.nl Semantic Search with CDL.nl beyond Keyword-based Searchbeyond Keyword-based Search
Baseline: Combination of keywords (The use of bi-gram, tri-gram,… does not lead to the improvement of
performance.) A pair of dependent words leads to a slight improvement. Search using natural language queries is preferable in a sense.
However, when the search result is unsatisfactory, it is not obvious for a user how to modify the query sentence.
The CDL.nl-based Search allows a more specified search based on a set of words with named dependency relations, rather than with the simple (non-named) dependency.
It also allows a search with using more specific word concepts such as one modified by attributes and/or larger concept units than a single word.
It allows a search taking account of semantic relevancy, such as similarity between two words, a relation between words in a sentence, etc.
44
Interface for CDL.nl Data Retrieval Interface for CDL.nl Data Retrieval (Query)(Query)
Query by Natural Language
SQL-like Query Language: CDQL
Graphical Query Interface for CDL.nl data
45
Approach to Implementing CDQLApproach to Implementing CDQL 1st Step
It is not easy to implement it from scratch. Thus we utilize SPARQL which is the query language
for RDF data. SPARQL is backed by Jena RDB.
Next Step Maybe original implementation for the CDQL
processing.
46
CDL.nl Data Retrieval SystemCDL.nl Data Retrieval System
47
CDL to RDFCDL to RDF
48
CDL.nl Data converted into CDL.nl Data converted into RDF Graphical FormRDF Graphical Form
49
CDL Data Retrieval via SPARQL :: CDL Data Retrieval via SPARQL :: a simple case a simple case
Query (the RealizationLabel of) a person to whom John reported.
50
CDL.nl Data Retrieval CDL.nl Data Retrieval via CDQL (an Extended SPARQL)via CDQL (an Extended SPARQL)
Toward the Foundation Toward the Foundation of Next-generation Webof Next-generation Web
Immediate Applications ofImmediate Applications ofRelation ExtractionRelation Extraction from Textsfrom Texts
Jie Yang, Dat Nguyen and Mitsuru IshizukaJie Yang, Dat Nguyen and Mitsuru IshizukaDept. of Creative Informatics &Dept. of Creative Informatics &
Dept. of Info. and Communication Eng.Dept. of Info. and Communication Eng.School of Information Science and TechnologySchool of Information Science and Technology
54
Relation Extraction from WikipediaRelation Extraction from WikipediaWilliam Henry Gates III (born October 28, 1955) is the co-founder, chairman, former chief software architect, and former CEO of Microsoft Corporation. He is also the founder of Corbis, a digital image archiving company…
Microsoft Corporation,…Headquartered in Redmond, Washington, USA, its best selling products are the Microsoft Windows operating system and the Microsoft Office suite of productivity software.
(Microsoft, location, Redmond)
(Microsoft, product, MS Office)(Microsoft, product, MS Windows)
…(Microsoft, founder, Bill Gates)
(Microsoft, CEO, Bill Gates)(Microsoft, chairman, Bill Gates)