From Databases to Dataspaces*
Wearing the Linked Data goggles
* M. Franklin, A. Halevy, D. Maier in ACM SIGMOD Record, Dez. 2005
DERI reading group presentation 23.02.2011 PhD J. Umbrich
Background of the paper
• Motivation of the paper in 2005
• The authors
• Motivation of the paper in 2005– Development of relational database management
systems showed spectacular results– BUT: “data everywhere” and use cases relying on large
amount of diverse, interrelated data sources poses new challenges for the data management
– M. Franklin: UC Berkeley, large scale data management
– A. Halevy: Google Inc.usage of structured data in web search
– D. Maier: Portland State Universitycoined Datalog, data stream processing
1 / 24
Dataspaces and their
support systems as a
new agenda for
data management
Topic of the paper
2 / 24
The Problem: Data Management
• Loosely connected data sources• Information are available in various formats• Not always control over data
• Low-level data management challenges across heterogeneous collections
Search & querying
Integrity constraints
Naming convention
Tracking lineage
Availability & recovery
Access control
(meta) data evolution
Enforcing rules
3 / 24
The Solution
• Define space of data– Identifiable scope and control across the data and
underlying systems
• DataSpace Support Platforms (DSSPs)Offers a suite of interrelated services and guarantees over self managed data sources (no complete data control)
• Pay-as-you-go– Keyword search is bare minimum– More function and increased consistency as you add
work
4 / 24
DataSpaces: System
DataSpaces: Logical Components
• data co-existence approach (not data integration)• contains all information relevant to a particular organisation
regardless of the format and location• model a rich collection of relationships between data
repositories
Participants
• Individual data sources• RDBs, XML, text, services• Stored or streamed• Different query support• Support updates, read
only
Relations
• Any kind of relationship• A replica of B• C mapping for A and BBroader set of relations• E and F created
independently but cover same physical system
5 / 24
DataSpaces: Services
• Content heterogeneity requires multiple style of data access
• Cataloging data resources (source, name, size, creation data, location)
• Search as a primary mechanism to deal with large collections and unfamiliar data (Similarity search, ranking)– Search applicable to all content of the dataspace
regardless of data format (includes also meta data)
• Updates (major research)• Monitoring, event detection, support for complex
workflows
6 / 24
DataSpaces: System
Source: Franklin et al: From Databases to Dataspaces, SIGMOD Rec. 20057 / 24
DSSP: Catalog
• Contains information about all the participants• Like (Rate of change, query answering,
statistics, ownership, access, privacy policies, relationships
• Basic inventory• Identifier, type, creation date
• Answering presence, absence of data element
• Model Management environment on top of the catalog
8 / 24
DSSP: Search & Query
• Query everything• Query data item regardless of format• Keyword search
• Structured Query• common interfaces (mediated schema)• Over specific source• Peer-data management systems• Various query formats with mappings
• Meta-data queries• Result sources, timestamps, uncertainty• Source location and similarity queries
• Monitoring• Stateless or stateful
9 / 24
DSSP: Local Store and Index
• Create efficiently queryable association between participants
• Improve access to data sources with limited access patterns
• Data replication• Support of high availability and
recovery• Highly adaptive to heterogeneous
data• Identifies information across
participants• Robust for multiple real-world objects
10 / 24
DSSP: Discovery
• Locating participants
• Creation of relationships• Semi automatically
• Monitoring/Learning
11 / 24
DSSP: Enhancement
• Imbue participants with additional capabilities• Schema• Keyword search• Update monitoring
12 / 24
Research Challenges
• Data models and querying • Dataspace discovery• Reusing human attention• Dataspace storage and indexing• Correctness guarantees• Theoretical foundations
13 / 24
Data Models and Querying
• Heterogeneous data models and query languages
• Query reformulation (complex -> simple, vice versa)
File system-like queries
Keyword query (bag-of-words)
Path/containment queries (semi-structured)
Structured Queries (XML , RDF, OWL)
• Hierachy of query languages (pay-as-you-go
14 / 24
DataSpace Discovery
• Locate participants
• Semi-automatic tool for clustering and finding relationships between data sources
• Creation of more precise relationships
15 / 24
Reusing Human Attention
• Semantic integration evolves over time
• Humans the most scarce resource
• Machine learning
16 / 24
Storage & Indexing
• Heterogeneity of the index (different data formats)
• Ideally, uniformly indexing of all data items
• Dealing with multiple identifiers for the same real word thing
• Updates• Automated tuning, which data items to cache
which indexes to build ?
17 / 24
Correctness guarantees
• Quality of answers for accessing disparate data source– Involving updates
• Define levels of service guarantees
• Rethinking of fundamental data management principles
• Inherent tradeoffs in terms of quality, performance and control
18 / 24
Theoretical Foundations
• Formal understanding of different data models
• What queries are expressible over a dataspace?
• Detection of semantically equivalent but syntactically different query languages?
19 / 24
… as a major step towards a concrete implementation
of a dataspace support platform ?
Linked Data …
Use and reuse of HTTP URIs for real-world things
Provide useful (self-descriptive) content in RDF
20 / 24
Data Models and Querying
• Unified data model (RDF) • URIs as identifiers for real-world things• Linkage as relationships between sources and
entities• Data co-exists (everyone can say everything
about everybody)
Keyword query (bag-of-words)
SPARQL
21 / 24
Remaining Challenges
• Querying– Meta data queries
• Discovery– Link traversal, link creation– Reasoning, Graph Mining
• Storage & Indexing– Consolidation
• Correctness Guarantees• Reuse Human Attention • Updates / Monitoring• Data access/ privacy
22 / 24
LinkedData: DataSpaces
Source: Franklin et al: From Databases to Dataspaces, SIGMOD Rec. 200523 / 24
DSSPs Examples
• Search Engines (SWSE, Sindice, FalconS,…)– Keyword search, ranking– SPARQL
• Data access– RDB2RDF, RDFizers
• Discovery– SILK
• All-in-oneStructured Dynamics LLC
24 / 24
Questions? Opinions?