+
Page 1
Semantic Interoperabilityof Relational (& Object) Databases
Lucian Russell, [email protected]
Expert Reasoning & Decisions LLC
SICoP – February 5th, 2008
+
Page 2
A Possible Objection to Semantic Interoperability
+
Page 3
Semantic Interoperability of Relational Databases
• The three new lower level services needed for Semantic Interoperability are– The Query Relaxation Service (for Cooperative Answering)– The Transformation Reasoner Service applied to database syntax (just shown)– The Descriptor Processor Service that loads the Knowledge Base Extensions
• The new artifact needed in Data Description (Ch 3) is the Data Descriptor– Prior Art in Information Systems describes a process of building a Data Model
Verbal descriptions of a system and its environment were deleted– The Business Context was beyond the Automation Boundary (Gane & Sarson 1977)– The requirements specs were generated for what was within the Automation Boundary – Nouns and related adjectives were to be kept for use in the data model (verbs for “relationships”)
Verbal descriptions in the Data Model were omitted or truncated– The Data Dictionary was supposed to describe the automated processes’ relationships to the data,
but was often omitted– Data Base Entity Names were truncated so the programmer would have less to type– Data Base Attribute Names were also truncated so the programmer would have less to type
Relationships had very reduced semantics– The Entity Relationship diagram maps only the Entities into the SQL Schema– Actual Relationships are realized by Foreign keys (object ids) among relational instances– Relationship names are systems documentation only: they are not differentially operationalized
• The Data Descriptor describes in exact Natural Language the Real World and Business Activity for which the Database Captures Data
+
Page 4
Block Diagram for Current vs. SI Services
Knowledge Base
DB Extensions
QueryCooperative
Answering Interface
SI Query Processor
Query Relaxation
Guidance Results
Common Knowledge
Base
Database
Schema
Data
Descriptor
Descriptor Processor
Transformation
DB
Index
SQL Front End
NOWSemantic Interoperability
+
Page 5
Data Descriptor – Rebuilding the DB Environment
• The Data Descriptor is an exact English Language Artifact that describes the relationship between the real world, the social processes modeled and the artificial constructs of the database
• The Entities/Objects in the Schema are a related set of measured/sampled or recorded facts about the start, end, or interim state of a concept
• The concept in the Entity/Object is a combination of measured or recorded facts in four categories of data:
– The Real World
– The Social Organization World
– Individuals
– Mathematical Structures
• The Data Descriptor accurately describes the processes that generate the data
Data Model
Social World
Mathematical Structures
Individuals
Real
World
Four Data Categories
+
Page 6
Example: USCIS “Standard Biographic Questionnaire” 7/31/05
• This form was created by a study of the databases used to track the data captured in the many Entities of the USCIS Data Base. It is based on an analysis and synthesis of the many public forms obtainable for the USCIS website. We will call it “I-1”
• It is not an OMB or USCIS official form at this time.• It is 7 pages long, has 45 blocks of information
+
Page 7
What is its real structure?
Photograph
Signature
Fingerprints
1: Data
Elements:
Name & Country
of
Citizenship
2: Data
Elements:
Identification
Numbers
3: Data
Elements:
Residence
History
6: Data
Elements:
Arrivals &
Departures
7: Data
Elements:
Arrests &
Citations
8: Data
Elements:
Marital
Information
9: Data
Elements:
Children’s
Names
10: Data
Elements:
Parents
Country
of
Citizenship
5: Data
Elements:
Employment
History
4: Data
Elements:
Education
History
The Structure of I-1 is a set of 10 Different types of Objects (Entity Clusters),
Each one of which is defined by a process relating the 4 categories of data
C3 C4
+
Page 8
Processes: An example citing the relevant verbs
Process 1: Verb 1
Sub-Process 1:
Verb 2
Sub-Process 3:
Verb 4
Sub-Process 4:
Verb 5
Sub-Sub Process 1:
Verb 6
Sub-Process 2:
Verb 3
Sub-Sub Process 2:
Verb 7
Sub-Sub Process 3:
Verb 8
Sub-Sub Process 4:
Verb 9
Start time
End time
Alternatives
Parallel & Repeating instances
V3
V1
V5
V4
V6
V2
V9
V8
V7
Alternative Sub-
processes
+
Page 9
These Verbs in a Part-of and not a Class Relationship
• A process is a Meronymic Construction using verbs• Main Process Verb 1 has 4 sub-process with their own verbs
– Sub-Process 1 is Verb 2 which has a definite start time but many parallel instances which are interruptible; it has an indefinite end-time but will end prior to the process being complete
– Sub-process 2 has Verb 3 which has a definite start and stop time, but has two Sub-sub process which are verbs 6 and 7.
Verb 7 is an abstract Verb which can be realized by Sub-sub-processes described by Verb 8 or Verb 9 but these are alternatives.
• Sub-Process 3 has Verb 4, a definite start time and end time• Sub-Process 4 has Verb 5 with a minimum start time, by an indefinite actual
start. It ends Process 1• As we will illustrate in the next slides it is the processes which interact with
real world entities/objects and• The processes are the means of real world entities/objects interaction;
these identify Entities, describe the Attributes and change their values• Current Data Modeling, however, groups all attributes together & destroys
the traceability to which process is associated, and with which attributes
+
Page 10
A Process interacts with Objects via attributes’ values
Final Condition
Data Values
Initial Condition
Data Values
E1
E2
E1E1
E2
E1
E2
State Changes or Samples of
Data Values
V1Start time
End time
SamplesSocial World
Individual
+
Page 11
Many Processes Affecting Different Attributes
V2V4
V5
E1: Entity Key
E2: Entity Key
E3: Entity Key
Why are there Different Attribute Names?
1. Because of independent database development
2. Because of the USED-AS Telic Relationship
+
Page 12
To Support Interoperability use Process Descriptions
• At SICoP Feb 6th 2007 meeting a conceptual Data Resource Awareness agent using Language and Logic was posited as the basis for DRM 3.0
• Here is how it is done: by describing the processes that create the data
• The Data Descriptor is what describes the Processes and their Sub-processes
• The processes over time generate the data that is stored in the database using a data model and a schema
• The Data Descriptor is a Natural Language Artifact: it is in English
• Because of WordNet’s disambiguation of English this can be done:– it is possible to
choose an exact meaning of each English Word make precise extensions from these words for specialized concepts
• The Data Descriptor explains the Database Schema in words in such a way that it can be converted to a logic representation with many standard terms
• The Data Descriptor’s logic representation is an input into the Descriptor Processor Service that creates a Common Knowledge Extension for the Query Relaxation Reasoner Service
+
Page 13
To Build the Data Descriptor
• Start out with the assumptions– The basic processes assumed to be in play (e.g. people are born)
• Look for the processes that access Entities in the database– Give them all a name– Create a process model with sub-processes that create and change data – Use the right meaning (WordNet) of the right verbs
• Determine upon which Attributes they have an impact– Determine the life cycle of this process in terms of Entity deletion– Determine how changes in the underlying assumptions effect the data– Name all Entities and Attributes fully and precisely in English– Map current schema names to these English names
• Make all time assumptions about processes and their interactions specific• Clarify update semantics (e.g. a person moved vs. a street renamed)• Tag all names with English standard words or precisely define the extension• Determine if different terminologies are Telic (“Chicken is a food”)• Let it be trial-processed as a KB extension and change as needed
+
Page 14
There’s a Combination Process using Categories
• To see the interaction it is better to think of the database as being built up from a number of objects, abstractions of reality
– Each object has a life cycle
– Objects from different categories are combined into a composite object
– The category of the composite object depends upon usually on what changes fastest
• The process is: – a Real-World area is named a Country
by a Social Organization
– Within that area a sub-area is named a village
– The name of a village in a country has a start time and end time when it is valid
– The difference, the duration, is sometimes less than the lifetime of a person
Category: Real World (Sub SO)
Country: Name
Start Date
City/Town Village
End Date
Dura-tion
P
L
A
C
E
Semantic Model Choice: RW
-The USCIS is interested in the community at its Geo-coordinates regardless of its politics
-This is best thought of as a Real World Object, with subordinate sub-objects that are Social Organization artifacts
+
Page 15
Example: Data Block 1&2: Name & Country Of Citizenship; the Output is “I-1 Customer” and it is in Category Social Organizations
Name
Alternate Name
NameName
Category: Individuals
Date of Birth
Last/Family Name
First Name
Middle Name
N
A
M
E
P
E
R
S
O
N
Place of Birth Geo/Place
Country of Citizenship
Date of Birth
Category: Real World (Sub SO)
Country: Name
Start Date
City/Town Village
End Date
Dura-tion
P
L
A
C
E
I-1 Cust.
Legal name changes
are social artifacts
“Your Name”, Data 1 & 2
“Your Birth Info”, Data 3,4,5
Person is a “Customer” to USCIS
Birth Infor-mation
+
Page 16
People CIS
Detect Customer
1a: Vet: Name & Country of Citizenship
1b: Obtain Biometrics
3: Vet: Residence History
6: Vet: Arrivals & Departures
7: Vet: Arrests & Citations
2: Vet: Id Numbers
8: Vet: Marital Information 9: : Vet: Children's Names
10: Vet: Parents’ Country of Citizenship
4: Vet : Education History
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Accept/Reject Customer
all YESs?
Stopped In Motion
Y/N
5: Vet : Employment History
Y/N
Vet Non-citizen
How Does the Data Evolve in Time? Long Transactions and within them many Short Transaction – ALL PROCESSES!
Person: changes slowly
USCIS: changes faster
+
Page 17
What is the Value? High to the Person Involved!
• In the USCIS database non-citizen person “John John” was an address of
– 2000 28th Street, – Apartment 2A, – Arlington VA 22206
• A letter is returned because the address is changed!
– 2000 Campbell Street, – Apartment 2A, – Arlington VA 22206
• Has “John John” moved without informing USCIS?
– If YES contact ICE and have him thrown in jail
– If NO then send the form to the new address
• At issue is the fact that Arlington did a change of a Street Name
• Do we care about the geo-location of the person of the social contest for the address? It sets a SI context for data
NameGeo-Loc
Address
Name
Geo-Loc
Address
What is important is the name of the street on the form: throw him in jail!
What is important whether he actually moved or not: leave him be!
+
Page 18
The Ontological Implications: Common Logic
• Quick Look ahead: It is not done in an OWL Ontology
• At 2/6/2007 SICoP meeting, WordNet PI Prof. Christiane Fellbaum re-iterated that nouns are not verbs
– Nouns have Class hierarchies, one type of subclass. You have both hypernyms and hyponyms
– Verbs do not have the same hierarchy: there are four subclass types depending on the time relationships; you have hypernyms but not hyponyms
• Processes are described by verbs – Verbs describe states, state changes and motions, all of which occur in time– Verbs, like nouns can have meronyms and holonyms, “part-of” relationships– A process has sub-processes that have complex start/stop time descriptions– Processes and sub-processes, described by verbs, are meronyms and holonyms
and are hence NOT related by a class structure
• To model events time one requires First Order Logic (ISO CL), for example NIST’s Process Specification Language or Cycorp's CYC-L
• Data in the database that is captured at the start, middle or end or a process can be described in DL, but the process itself cannot
+
Page 19
Satisfying a Relaxed Query Semantically
• Raymond Reiter proved in 1982 that under the closed world assumption (and no, the Web is not well modeled as being “open world”) that an SQL query is a logic proof and vice versa
• An SQL query states that “there exists in the database data satisfying the following logical conditions …..” and proves it by finding and returning all instances of the data that meet the criteria
• What a Relaxed Query Service does is look for a chain of reasoning
• The Relaxed Query Reasoner interacts with the user to see if the most general form of the query has been stated
• It then backward chains to see what data could imply data that satisfies the query
The Query QR (after relaxation)
1. QR: There exists data X,Y Z satisfying these conditions C1, C2, C3
2. Find data A, B, C such that either A=X or A->X, B=Y … and the resultant instances satisfy the conditions C1, C2, C3.
Data A,B,C is interoperable with X,Y,Z with respect to the query Q because, due to its semantics, it implies X,Y and X respectively
The Query Q:
Find x,y,z subject to c1,c2,c3
+
Page 20
Theresa Gaasterland’s Cooperative Answering
• Like most really good ideas somebody else has done it first!
– Theresa Gaasterland’s dissertation (“Generating Cooperative Answers in Deductive Databases”, Univ MD PhD thesis,1992) described a process that allowed a user to create a query and then look for alternative answers should the results be inadequate
– The work used deductive databases (e.g. Prolog, Datalog)
– It assumed that “a database designer augments the database with a graph of taxonomic relationships between database predicates (relation) and constants (attribute values)” (IEEE EXPERT, September/October 1997)
– It provides a means of incorporating users’ semantic constraints
– It provides relaxation search mechanisms for exploring the taxonomic and schema descriptions.
User/heuristic query
selection
Integrity Constraints and presuppositions
Relaxation
User-constraint Module
User-constraints
Taxonomy Clauses DB Integrity
ConstraintsInitial Query
Query Answering
Module
Current Query
Relaxed User-
constrained Query
Relaxed semantically
optimized Query
Relaxed Query
+
Page 21
So Semantic Interoperability can be Achieved
• A two pronged approach for the underlying Data Layer:– Build up a general business logic of the government in a Common Knowledge
Base, using a general purpose Ontology like CYC– Build up Data Descriptors that can be ingested to create Extensions to the
Common Knowledge Base
• A New Service at the Presentation Layer to Interact with the Knowledge Base to allow Cooperative Answering through query relaxation. This can include reasoning about Data Context artifacts that relate the data to business conditions
• New Services will be installed in a SOA’s Data Layer to use Data Descriptors to build logic representations that then generate Extensions and input to the Transformation and Indexing services
• Each Community of Interest can build out its own knowledge base, and build bridging semantics using the same approach as Data Descriptors (an extension of its metadata activities)
+
Page 22
Economic Viability? Yes
• Data Descriptors are built by people who can read and write English precisely
• If the Data Descriptor is wrong its Ontology can be withdrawn, the Data Descriptor can be re-written and a new Ontology Built – nothing is thrown away
• As people get practice they will get better at it
• The labor pool is far vaster for English Majors than for DBAs for major products like Oracle, DB2, Sybase, SQL Server etc.
• Many English Majors can be hired for the cost of one DBA!
8 Recently Graduated English Majors at $40/hour
One Database Administrator at $320/hour
+
Page 23
Semantic Interoperability Requires a New Service
Data Resource Awareness Agent
Data & Information & Knowledge Repository
staticdynamic
Figure 3-1 DRM Standardization Areas
Language Logic
A job to be done, YES, but NOT a MIRACLE