1 www.geongrid.org CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Towards a Generic Framework for Towards a Generic Framework for Semantic Data Registration and Semantic Data Registration and Integration in Geosciences Integration in Geosciences Kai Lin, Chaitan Baru Kai Lin, Chaitan Baru San Diego Supercomputer Center San Diego Supercomputer Center University of California, San Diego University of California, San Diego
29
Embed
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES1 Towards a Generic Framework for Semantic Data Registration and Integration in Geosciences Kai.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES3 Data Integration Challenges: Heterogeneities Syntactical Heterogeneity Syntactical Heterogeneity heterogeneous data format heterogeneous data format e.g vs. 02/04/04 Structural Heterogeneity Structural Heterogeneity heterogeneous data models and schemas e.g is saved as three columns or one columns Semantics Heterogeneity Semantics Heterogeneity fuzzy metadata, terminology, “hidden” semantics, implicit assumptions GEON Solution: data should be semantically registered to GEON first heterogeneities are resolved by registration
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Towards a Generic Framework for Semantic Data Towards a Generic Framework for Semantic Data Registration and Integration in GeosciencesRegistration and Integration in Geosciences
Kai Lin, Chaitan BaruKai Lin, Chaitan BaruSan Diego Supercomputer CenterSan Diego Supercomputer CenterUniversity of California, San DiegoUniversity of California, San Diego
2www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Data Integration GoalData Integration Goal• Query heterogeneous data sources as a single Query heterogeneous data sources as a single
resourceresource– Query: not write a program (“ad hoc, non-procedural
query languages”)– Heterogeneous: local resource controls definition of the
data– Single resource: remove the burden of individually
accessing each data source
3www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Data Integration Challenges: Data Integration Challenges: HeterogeneitiesHeterogeneities
• Syntactical Heterogeneity Syntactical Heterogeneity heterogeneous data format heterogeneous data format
e.g. 02-04-2004 vs. 02/04/04• Structural Heterogeneity Structural Heterogeneity
heterogeneous data models and schemas e.g. 02-04-2004 is saved as three columns or one columns
• Requires ontology annotations for backend databases • Use simple ontology query language to query the integrated database• End users do not need to know the backend schemas and local semantics
CA B
G
D
FE
CA B
D
GFE
GEON Mediatorbackend
backend Ontology Based Query
14www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
GEON Ontology Based Data IntegrationGEON Ontology Based Data Integration
Challenges for Computer Scientists and Domain ScientistsChallenges for Computer Scientists and Domain Scientists– Computer Scientists: build an integration system based on the
ontological registration of datasets– Domain Scientists: create domain ontologies– Data Providers: register datasets to ontologies
Ontology1 Ontology2 ontology3
dataset1 dataset2 dataset3 dataset4
15www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Ontological Data Registration for Data integrationOntological Data Registration for Data integration
• Registering a dataset to an ontology for data integration Registering a dataset to an ontology for data integration is a procedure to generate a partial model of the ontology is a procedure to generate a partial model of the ontology from the dataset itselffrom the dataset itself
From registrationdataset
individuals ontology
p
Not all the constraints inthe ontology are satisfied
by the generated individuals
16www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
• Associate one or more columns under an optional Associate one or more columns under an optional SQL condition to a selected class in the ontologySQL condition to a selected class in the ontology
• Provide a mapping method if no explicit names of Provide a mapping method if no explicit names of individuals should be generatedindividuals should be generated
Registering Relational Tables to Ontology ClassesRegistering Relational Tables to Ontology Classes
(23.5, 47.9) is the name of an individual of the class Location
Same name indicates the same location
RockSampleRockSample GeologicAgeGeologicAge …… ……
Jurassic/TriassicJurassic/Triassic
PrecambrianPrecambrian
………… …………
GeologicalAge
Precambrian Cenozoic Paleozoic
17www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
Registering Relational Tables to Ontology Object PropertiesRegistering Relational Tables to Ontology Object Properties
• Associate two entities which are already registered to the Associate two entities which are already registered to the domain class and the range class of a selected object domain class and the range class of a selected object property in the ontologyproperty in the ontology
The values in the column ssID of the table Samples, RockTexture, RockGeoChemistry, ModalData,MineralChemistry and Images represent instances of RockSample
• Create a partial model of ontologies from databases• Independent of end interface• Independent of specific database implementations• The ODAL mapping is itself a “first-class” object
20www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
ODAL: Import OntologiesODAL: Import Ontologies
The Ontologies used for annotating a database can be imported as follows:The Ontologies used for annotating a database can be imported as follows:
26www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
• To join data across independent resources we need we need to know To join data across independent resources we need we need to know the correspondence between entities. the correspondence between entities.
• For example, does “10001” represent the same rock in the two For example, does “10001” represent the same rock in the two resources. By default, we assume they are not.resources. By default, we assume they are not.
• A set of datatype properties can be declared as a key for a class in the A set of datatype properties can be declared as a key for a class in the ontology. We do join cross multiple resources based on keys.ontology. We do join cross multiple resources based on keys.
e.g. e.g. { hasLatitude, hasLongitude}{ hasLatitude, hasLongitude} can be declared as a key of Location can be declared as a key of Location Two locations from different resources are same if they have the same Two locations from different resources are same if they have the same latitude and longitude latitude and longitude
Conditions for Joining Individuals from Different ResourcesConditions for Joining Individuals from Different Resources
Rock
RockSampleIDRockSampleID
1000110001
… …......
RockIDRockID
1000110001
…… ……
27www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
SOQL (SOQL (SSimple imple OOntology ntology QQuery uery LLanguage)anguage)Query single or integrated resources
• via ontologies (i.e., high level logical views)• independent of schema-level representation
RockSample Location
ValueWithUnit float
location
hasSiO2
valuelat long
unit
string
SELECT X.location.*; FROM RockSample X WHERE X.location.lat > 60 AND X.location.long > 100 AND X.hasSiO2.value < 30 AND X.hasSiO2.unit =‘weightPercetage’
GUIgenerate to SOQL
processor
28www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
The Architecture of GEON Semantic MediatorThe Architecture of GEON Semantic Mediator
Portal or Application
Mediator JDBC Driver
GUI
SOQLSemantic Query Rewriter
SOQL Parser Ontology
Reasoner
SOQL Processor
Spatial SQL against federal schemas
SQL Parser
OWL ODAL
Query Execution
Query Optimization
QueryPlanning Internal Database
Oracle DB2 MySQL SQLServer PostgreSQL PostGIS
ODAL Processor
29www.geongrid.orgCYBERINFRASTRUCTURE FOR THE GEOSCIENCES
SELECT X.code, X.location.* FROM SeismicStation X, Railroad Y WHERE distance(X.location, Y.geometry) < 1
SELECT X2.stationcode, X2.lat, X2.lon FROM railroads_of_the_united_states X1, stationdatatable X2 WHERE distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1