+ Page 1 Semantic Interoperability of Relational (& Object) Databases Lucian Russell, PhD [email protected] Expert Reasoning & Decisions LLC SICoP.

+

Page 1

Semantic Interoperabilityof Relational (& Object) Databases

Lucian Russell, [email protected]

Expert Reasoning & Decisions LLC

SICoP – February 5th, 2008

+

Page 2

A Possible Objection to Semantic Interoperability

+

Page 3

Semantic Interoperability of Relational Databases

• The three new lower level services needed for Semantic Interoperability are– The Query Relaxation Service (for Cooperative Answering)– The Transformation Reasoner Service applied to database syntax (just shown)– The Descriptor Processor Service that loads the Knowledge Base Extensions

• The new artifact needed in Data Description (Ch 3) is the Data Descriptor– Prior Art in Information Systems describes a process of building a Data Model

Verbal descriptions of a system and its environment were deleted– The Business Context was beyond the Automation Boundary (Gane & Sarson 1977)– The requirements specs were generated for what was within the Automation Boundary – Nouns and related adjectives were to be kept for use in the data model (verbs for “relationships”)

Verbal descriptions in the Data Model were omitted or truncated– The Data Dictionary was supposed to describe the automated processes’ relationships to the data,

but was often omitted– Data Base Entity Names were truncated so the programmer would have less to type– Data Base Attribute Names were also truncated so the programmer would have less to type

Relationships had very reduced semantics– The Entity Relationship diagram maps only the Entities into the SQL Schema– Actual Relationships are realized by Foreign keys (object ids) among relational instances– Relationship names are systems documentation only: they are not differentially operationalized

• The Data Descriptor describes in exact Natural Language the Real World and Business Activity for which the Database Captures Data

+

Page 4

Block Diagram for Current vs. SI Services

Knowledge Base

DB Extensions

QueryCooperative

Answering Interface

SI Query Processor

Query Relaxation

Guidance Results

Common Knowledge

Base

Database

Schema

Data

Descriptor

Descriptor Processor

Transformation

DB

Index

SQL Front End

NOWSemantic Interoperability

+

Page 5

Data Descriptor – Rebuilding the DB Environment

• The Data Descriptor is an exact English Language Artifact that describes the relationship between the real world, the social processes modeled and the artificial constructs of the database

• The Entities/Objects in the Schema are a related set of measured/sampled or recorded facts about the start, end, or interim state of a concept

• The concept in the Entity/Object is a combination of measured or recorded facts in four categories of data:

– The Real World

– The Social Organization World

– Individuals

– Mathematical Structures

• The Data Descriptor accurately describes the processes that generate the data

Data Model

Social World

Mathematical Structures

Individuals

Real

World

Four Data Categories

+

Page 6

Example: USCIS “Standard Biographic Questionnaire” 7/31/05

• This form was created by a study of the databases used to track the data captured in the many Entities of the USCIS Data Base. It is based on an analysis and synthesis of the many public forms obtainable for the USCIS website. We will call it “I-1”

• It is not an OMB or USCIS official form at this time.• It is 7 pages long, has 45 blocks of information

+

Page 7

What is its real structure?

Photograph

Signature

Fingerprints

1: Data

Elements:

Name & Country

of

Citizenship

2: Data

Elements:

Identification

Numbers

3: Data

Elements:

Residence

History

6: Data

Elements:

Arrivals &

Departures

7: Data

Elements:

Arrests &

Citations

8: Data

Elements:

Marital

Information

9: Data

Elements:

Children’s

Names

10: Data

Elements:

Parents

Country

of

Citizenship

5: Data

Elements:

Employment

History

4: Data

Elements:

Education

History

The Structure of I-1 is a set of 10 Different types of Objects (Entity Clusters),

Each one of which is defined by a process relating the 4 categories of data

C3 C4

+

Page 8

Processes: An example citing the relevant verbs

Process 1: Verb 1

Sub-Process 1:

Verb 2

Sub-Process 3:

Verb 4

Sub-Process 4:

Verb 5

Sub-Sub Process 1:

Verb 6

Sub-Process 2:

Verb 3

Sub-Sub Process 2:

Verb 7

Sub-Sub Process 3:

Verb 8

Sub-Sub Process 4:

Verb 9

Start time

End time

Alternatives

Parallel & Repeating instances

V3

V1

V5

V4

V6

V2

V9

V8

V7

Alternative Sub-

processes

+

Page 9

These Verbs in a Part-of and not a Class Relationship

• A process is a Meronymic Construction using verbs• Main Process Verb 1 has 4 sub-process with their own verbs

– Sub-Process 1 is Verb 2 which has a definite start time but many parallel instances which are interruptible; it has an indefinite end-time but will end prior to the process being complete

– Sub-process 2 has Verb 3 which has a definite start and stop time, but has two Sub-sub process which are verbs 6 and 7.

Verb 7 is an abstract Verb which can be realized by Sub-sub-processes described by Verb 8 or Verb 9 but these are alternatives.

• Sub-Process 3 has Verb 4, a definite start time and end time• Sub-Process 4 has Verb 5 with a minimum start time, by an indefinite actual

start. It ends Process 1• As we will illustrate in the next slides it is the processes which interact with

real world entities/objects and• The processes are the means of real world entities/objects interaction;

these identify Entities, describe the Attributes and change their values• Current Data Modeling, however, groups all attributes together & destroys

the traceability to which process is associated, and with which attributes

+

Page 10

A Process interacts with Objects via attributes’ values

Final Condition

Data Values

Initial Condition

Data Values

E1

E2

E1E1

E2

E1

E2

State Changes or Samples of

Data Values

V1Start time

End time

SamplesSocial World

Individual

+

Page 11

Many Processes Affecting Different Attributes

V2V4

V5

E1: Entity Key

E2: Entity Key

E3: Entity Key

Why are there Different Attribute Names?

1. Because of independent database development

2. Because of the USED-AS Telic Relationship

+

Page 12

To Support Interoperability use Process Descriptions

• At SICoP Feb 6th 2007 meeting a conceptual Data Resource Awareness agent using Language and Logic was posited as the basis for DRM 3.0

• Here is how it is done: by describing the processes that create the data

• The Data Descriptor is what describes the Processes and their Sub-processes

• The processes over time generate the data that is stored in the database using a data model and a schema

• The Data Descriptor is a Natural Language Artifact: it is in English

• Because of WordNet’s disambiguation of English this can be done:– it is possible to

choose an exact meaning of each English Word make precise extensions from these words for specialized concepts

• The Data Descriptor explains the Database Schema in words in such a way that it can be converted to a logic representation with many standard terms

• The Data Descriptor’s logic representation is an input into the Descriptor Processor Service that creates a Common Knowledge Extension for the Query Relaxation Reasoner Service

+

Page 13

To Build the Data Descriptor

• Start out with the assumptions– The basic processes assumed to be in play (e.g. people are born)

• Look for the processes that access Entities in the database– Give them all a name– Create a process model with sub-processes that create and change data – Use the right meaning (WordNet) of the right verbs

• Determine upon which Attributes they have an impact– Determine the life cycle of this process in terms of Entity deletion– Determine how changes in the underlying assumptions effect the data– Name all Entities and Attributes fully and precisely in English– Map current schema names to these English names

• Make all time assumptions about processes and their interactions specific• Clarify update semantics (e.g. a person moved vs. a street renamed)• Tag all names with English standard words or precisely define the extension• Determine if different terminologies are Telic (“Chicken is a food”)• Let it be trial-processed as a KB extension and change as needed

+

Page 14

There’s a Combination Process using Categories

• To see the interaction it is better to think of the database as being built up from a number of objects, abstractions of reality

– Each object has a life cycle

– Objects from different categories are combined into a composite object

– The category of the composite object depends upon usually on what changes fastest

• The process is: – a Real-World area is named a Country

by a Social Organization

– Within that area a sub-area is named a village

– The name of a village in a country has a start time and end time when it is valid

– The difference, the duration, is sometimes less than the lifetime of a person

Category: Real World (Sub SO)

Country: Name

Start Date

City/Town Village

End Date

Dura-tion

P

L

A

C

E

Semantic Model Choice: RW

-The USCIS is interested in the community at its Geo-coordinates regardless of its politics

-This is best thought of as a Real World Object, with subordinate sub-objects that are Social Organization artifacts

+

Page 15

Example: Data Block 1&2: Name & Country Of Citizenship; the Output is “I-1 Customer” and it is in Category Social Organizations

Name

Alternate Name

NameName

Category: Individuals

Date of Birth

Last/Family Name

First Name

Middle Name

N

A

M

E

P

E

R

S

O

N

Place of Birth Geo/Place

Country of Citizenship

Date of Birth

Category: Real World (Sub SO)

Country: Name

Start Date

City/Town Village

End Date

Dura-tion

P

L

A

C

E

I-1 Cust.

Legal name changes

are social artifacts

“Your Name”, Data 1 & 2

“Your Birth Info”, Data 3,4,5

Person is a “Customer” to USCIS

Birth Infor-mation

+

Page 16

People CIS

Detect Customer

1a: Vet: Name & Country of Citizenship

1b: Obtain Biometrics

3: Vet: Residence History

6: Vet: Arrivals & Departures

7: Vet: Arrests & Citations

2: Vet: Id Numbers

8: Vet: Marital Information 9: : Vet: Children's Names

10: Vet: Parents’ Country of Citizenship

4: Vet : Education History

Y/N

Y/N

Y/N

Y/N

Y/N

Y/N

Y/N

Y/N

Accept/Reject Customer

all YESs?

Stopped In Motion

Y/N

5: Vet : Employment History

Y/N

Vet Non-citizen

How Does the Data Evolve in Time? Long Transactions and within them many Short Transaction – ALL PROCESSES!

Person: changes slowly

USCIS: changes faster

+

Page 17

What is the Value? High to the Person Involved!

• In the USCIS database non-citizen person “John John” was an address of

– 2000 28th Street, – Apartment 2A, – Arlington VA 22206

• A letter is returned because the address is changed!

– 2000 Campbell Street, – Apartment 2A, – Arlington VA 22206

• Has “John John” moved without informing USCIS?

– If YES contact ICE and have him thrown in jail

– If NO then send the form to the new address

• At issue is the fact that Arlington did a change of a Street Name

• Do we care about the geo-location of the person of the social contest for the address? It sets a SI context for data

NameGeo-Loc

Address

Name

Geo-Loc

Address

What is important is the name of the street on the form: throw him in jail!

What is important whether he actually moved or not: leave him be!

+

Page 18

The Ontological Implications: Common Logic

• Quick Look ahead: It is not done in an OWL Ontology

• At 2/6/2007 SICoP meeting, WordNet PI Prof. Christiane Fellbaum re-iterated that nouns are not verbs

– Nouns have Class hierarchies, one type of subclass. You have both hypernyms and hyponyms

– Verbs do not have the same hierarchy: there are four subclass types depending on the time relationships; you have hypernyms but not hyponyms

• Processes are described by verbs – Verbs describe states, state changes and motions, all of which occur in time– Verbs, like nouns can have meronyms and holonyms, “part-of” relationships– A process has sub-processes that have complex start/stop time descriptions– Processes and sub-processes, described by verbs, are meronyms and holonyms

and are hence NOT related by a class structure

• To model events time one requires First Order Logic (ISO CL), for example NIST’s Process Specification Language or Cycorp's CYC-L

• Data in the database that is captured at the start, middle or end or a process can be described in DL, but the process itself cannot

+

Page 19

Satisfying a Relaxed Query Semantically

• Raymond Reiter proved in 1982 that under the closed world assumption (and no, the Web is not well modeled as being “open world”) that an SQL query is a logic proof and vice versa

• An SQL query states that “there exists in the database data satisfying the following logical conditions …..” and proves it by finding and returning all instances of the data that meet the criteria

• What a Relaxed Query Service does is look for a chain of reasoning

• The Relaxed Query Reasoner interacts with the user to see if the most general form of the query has been stated

• It then backward chains to see what data could imply data that satisfies the query

The Query QR (after relaxation)

1. QR: There exists data X,Y Z satisfying these conditions C1, C2, C3

2. Find data A, B, C such that either A=X or A->X, B=Y … and the resultant instances satisfy the conditions C1, C2, C3.

Data A,B,C is interoperable with X,Y,Z with respect to the query Q because, due to its semantics, it implies X,Y and X respectively

The Query Q:

Find x,y,z subject to c1,c2,c3

+

Page 20

Theresa Gaasterland’s Cooperative Answering

• Like most really good ideas somebody else has done it first!

– Theresa Gaasterland’s dissertation (“Generating Cooperative Answers in Deductive Databases”, Univ MD PhD thesis,1992) described a process that allowed a user to create a query and then look for alternative answers should the results be inadequate

– The work used deductive databases (e.g. Prolog, Datalog)

– It assumed that “a database designer augments the database with a graph of taxonomic relationships between database predicates (relation) and constants (attribute values)” (IEEE EXPERT, September/October 1997)

– It provides a means of incorporating users’ semantic constraints

– It provides relaxation search mechanisms for exploring the taxonomic and schema descriptions.

User/heuristic query

selection

Integrity Constraints and presuppositions

Relaxation

User-constraint Module

User-constraints

Taxonomy Clauses DB Integrity

ConstraintsInitial Query

Query Answering

Module

Current Query

Relaxed User-

constrained Query

Relaxed semantically

optimized Query

Relaxed Query

+

Page 21

So Semantic Interoperability can be Achieved

• A two pronged approach for the underlying Data Layer:– Build up a general business logic of the government in a Common Knowledge

Base, using a general purpose Ontology like CYC– Build up Data Descriptors that can be ingested to create Extensions to the

Common Knowledge Base

• A New Service at the Presentation Layer to Interact with the Knowledge Base to allow Cooperative Answering through query relaxation. This can include reasoning about Data Context artifacts that relate the data to business conditions

• New Services will be installed in a SOA’s Data Layer to use Data Descriptors to build logic representations that then generate Extensions and input to the Transformation and Indexing services

• Each Community of Interest can build out its own knowledge base, and build bridging semantics using the same approach as Data Descriptors (an extension of its metadata activities)

+

Page 22

Economic Viability? Yes

• Data Descriptors are built by people who can read and write English precisely

• If the Data Descriptor is wrong its Ontology can be withdrawn, the Data Descriptor can be re-written and a new Ontology Built – nothing is thrown away

• As people get practice they will get better at it

• The labor pool is far vaster for English Majors than for DBAs for major products like Oracle, DB2, Sybase, SQL Server etc.

• Many English Majors can be hired for the cost of one DBA!

8 Recently Graduated English Majors at $40/hour

One Database Administrator at $320/hour

+

Page 23

Semantic Interoperability Requires a New Service

Data Resource Awareness Agent

Data & Information & Knowledge Repository

staticdynamic

Figure 3-1 DRM Standardization Areas

Language Logic

A job to be done, YES, but NOT a MIRACLE

+ Page 1 Semantic Interoperability of Relational (& Object) Databases Lucian Russell, PhD [email protected] Expert Reasoning & Decisions LLC SICoP.

Documents

+ Page 1 Semantic Interoperability of Relational (& Object) Databases Lucian Russell, PhD [email protected] Expert Reasoning & Decisions LLC SICoP.