Top Banner
Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang
27

Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Jan 12, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Information Integration Across Heterogeneous Sources:

Where Do We Stand and How to Proceed?

Aditya TelangSharma Chakravarthy, Yan Huang

Page 2: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Motivation

• “Retrieve castles near London that are reachable by train in less than 2 hours”

• “Find 3-bedroom houses in Houston within 2 miles of a school and within 5 miles of a highway and priced under 250,000$”

• “Retrieve French restaurants within 1 mile of IMAX Theater in Dallas, Texas”

• …

Page 3: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Motivation

• Search engines • Meta-search engines• Faceted search engines• Domain-specific portals

Page 4: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Current Scenario

•“Retrieve castles near London that are reachable by train in less than 2 hours”

London Train schedules

Trains from London

Castles Near London

- Decision Making Process- Manually Combine Results to arrive at a decision

- Decision Making Process- Manually Combine Results to arrive at a decision

Page 5: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Ideal Scenario

Information Integration

System

Intent: Retrieve castles near London that are reachable by train in less than 2 hours

Actual Results for the intent

Page 6: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Focus of the Paper

• Identify the salient challenges needed to be encountered to address this problem

• Survey existing work to identify the challenges for which acceptable solutions are available

• Propose a framework that could provide potential solutions towards the problem

Page 7: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Broader Challenges

• Intent specification and formulation• Query processing and optimization• Discover of sources, their schemas and

characteristics• Data Extraction, Integration and Ranking• Result Visualization• Issues with inconsistency, security, privacy, …

Page 8: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Intent Specification

• “Retrieve castles near London that are reachable by train in less than 2 hours”– Keyword-based (e.g., search engine query)?– Structured (e.g., SQL) ?– Unstructured (e.g., natural language) ?– Template/Form/Menu-based (deep Web query) ?

Page 9: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Query Processing• The number of sources to be integrated are much larger than

in a normal database environment.

• Heterogeneous sources (RDBMS, websites, web services, etc.) do not provide the same processing capabilities found in a typical database system (such as the ability to perform joins).

• Unlike relational databases, there might be restrictions on how a source can be accessed.

Page 10: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Query Processing

• In contrast to query optimization in DBMS, the query optimizer in information integration has little information about the data since it resides in remote autonomous sources

• Web data sources are not necessarily database systems and may have different processing capabilities.

• Hence, the query optimizer must consider the possibility of exploiting a data source’s query-processing capabilities.

Page 11: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Discovery

• Source discovery– Given the domain of travel, determine all possible

source providing airfare information– Not a simple crawling process since categorization

is necessary after crawling [Gal:VLDB’06]– Use of

• search engines ?• web directories ?

Page 12: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Discovery

• Discovery of source schema and characteristics– Understanding source schema– Understanding query mechanism (for deep Web sources)– Understanding characteristics of sources

Coverage: probability that a random answertuple for query Q belongs to source S.Noted as P(S|Q).

Overlap: Degree to which sources containthe same answer tuples for query Q.Noted as P(S1 ̂S2 ̂… ̂Sk |Q).

DBLP

CSB

ACMDL

Page 13: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Data Extraction

• How to extract data for individual sub-queries?– APIs, Web services for deep Web?– Data extractors (e.g., Lixto, Florid) for surface

Web?

• Temporary storage of extracted data (becomes a critical issue when data can be large in size such as spatial data)

Page 14: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Data Integration

• Schema integration a complex challenge across domains [Gal:VLDB’06]

• Additional challenges while integrating data– Inefficient execution of recursive integration plans– No support to dynamic service composition– Lack of operators to support GeoSpatial data types– No support for record linkage and object consolidation in the mediator

can incorporate the source into a new or existing workflow

Page 15: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Ranking

• In context of integration, ranking has not been addressed as a significant challenge [Telang:ICDE’07]

• When to rank?– Before integrating sub-query results?– After integrating sub-query results?

• Source-independent ranking possible?

Page 16: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Other Challenges

• Visualization of results• Handling inconsistencies• Ensuring no breach of privacy and security• ….

Page 17: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

The Current Big Players

• Industry-level– Google (Google Base) [Madhavan:CIDR’07]– IBM (Web Sphere)– Yahoo (Trip Planner)

• Academia-level– Havasu [Kambhampatti:ICDE’05]– MetaQuerier [Chang et. al: VLDB’05, CIDR’07]– Ariadne [Knoblock: VLDB’02,03]– …

Page 18: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.
Page 19: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

The InfoMosaic Approach

Page 20: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Knowledge-Base

• Identify different types of information needed for the domains and sources to answer a query.

• Domain Knowledge –– Necessary information/knowledge required for elaborating and refining the

query based on the domains and keywords provided by the user

• Source Semantics – – Information store for modeling and maintaining all the necessary information

for each source within a given domain

Knowledge Base

DomainKnowledge

Source Semantics

Metadata Ontology

Vocabulary Operators

Attributes

Statistics

Schemas

Page 21: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

User Intent Specification

• Specify intent that is more precise than a “search” but less rigid than a “SQL-query”

• Ability to resolve concepts and their attributes elegantly with minimal user interaction

• Effectiveness depends on user feedback and past query statistics [Telang:COMAD’08]

Feedback-centricQuery SpecificationUser Intent

Feedback

Knowledge-Base

Refined Query

Page 22: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Multi-level Query Planning

• Evaluation is made at each stage to prune plans using relevant cost metrics.

• Some of the additional cost metrics – – volume of data retrieved from each source– number of calls made to and amount of data sent by each source– quantity of data processed– the number of integration queries executed

Refined QueryDomain-Level

Source-Level SP-1

DomainLevel Plan

Source-Level … SP-2 …

Query Planner & Optimizer

Knowledge-Base

Page 23: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Query Execution & Data Extraction

• Checking availability of sources, identifying attributes to be extracted (using the source semantics) and extracting data

• Determining the output in XML and spatial data formats for storage and further querying

• Reuse of previously retrieved results is an integral part of this task

Query Executor & Data Extractor

Internet

Results Query

Extracted

ResultsData Store

XMLData

Repository

SpatialData

Repository

Query Plan

Knowledge-Base

Page 24: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Integration of Results

• Generation of XQueries for combining extracted data • Develop external functions for XQueries to access spatial data • The result of the query will be transformed into a

homogeneous schema for understanding and analyzing the results.

Data Store

XMLData

Repository

SpatialData

RepositoryIntegrator

Results

Query

DomainLevel Plan

Knowledge-Base

Result Set

Page 25: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Ranking

• Two approaches to ranking [Telang:DBRank’07]–– Rank Before Integration: Applicable when user-specified

metrics can be decomposed and applied to individual sub-queries

– Rank After Integration: Applicable when user-specified metrics CANNOT be decomposed and applied to individual sub-queries

Ranking

Query Executor & Data Extractor

Integrator

Page 26: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

To Conclude

• Ideally, an information integration system should allow users to specify what information is needed without having to provide detailed instructions on how or from where to obtain the information.

• A number of challenges need to be addressed by different research communities (AI, DB, IR, NLP, Semantic Web, …)

• Existing work suggests we are on the right track• Our proposed framework (InfoMosaic) could be a further step

in this direction

Page 27: Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.

Thank You !