Top Banner
CSE 636 Data Integration Introduction
13

CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: [email protected] Location: 210 Bell Hall Office Hours:

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

CSE 636Data Integration

Introduction

Page 2: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

2

Staff

• Instructor: Dr. Michalis PetropoulosEmail: [email protected]: 210 Bell HallOffice Hours: Wednesday & Friday 1:00-2:00pm &

By Appointment

• Web Pagehttp://www.cse.buffalo.edu/~mpetropo/CSE636-FA08/

• Newsgroupsunyab.cse.636

Page 3: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

3

Course Goals

• Data integration applications and architectures• Issues in building such applications

– Really big and currently active research area

• Solutions to several of them

• Provide foundation for – understanding current research problems– criticizing proposed solutions– proposing your own solution!

• Acquire valuable experience by implementing the project

Page 4: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

4

Prerequisites

• An introductory database course– CSE 520, CSE 562 or equivalent

• Data structures and algorithms• Knowledge Representation• Distributed systems• Complexity theory• Mathematical Logic

• Curiosity!– You should ask a lot of questions

• Have a lot of fun!

Page 5: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

5

Relevant Material

Textbooks• Database Systems: The Complete Book

– by Garcia-Molina, Ullman and Widom

• Database Management Systems– by Ramakrishnan

• Fundamentals of Database Systems– by Elmasri and Navathe

• Foundations of Databases– by Abiteboul, Hull and Vianu

• Data on the Web– by Abiteboul, Buneman and Suciu

Page 6: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

6

Course Format

• Assignments: 15%– Three assignments will be given, 5% each

• Final: 20% (take home)• Projects: 60%

– Detailed specs will be given– Can be used to satisfy the M.S. project requirement

• Participation: 5%

Page 7: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

7

What is Data Integration?

The problem of providing• uniform (sources transparent to users)• access to (query)• multiple (even 2 is a problem)• autonomous (not affect the behavior of sources)• heterogeneous (different data models, schemas)• structured (at least semistructured)• data sources (not only databases)

Page 8: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

8

The Data Integration Problem

MyBookstore.com Mediated Schema

DB

Books Inventory Orders Shipping Reviews

SiteMorganKaufman

AddisonWesley

PrenticeHall

East

West

DB Orders Site FedEx

UPS

DBCustomerReviews

Site NY TimesDB

Intranet

Site …WS

Site

Internet

WS

Internet Internet

Uniform query capability across autonomous,heterogeneous data sources on the Internet

Page 9: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

9

Motivation

• Enterprise data integration– Web site construction

• WWW– Comparison shopping– Portals integrating data from multiple sources– B2B, electronic marketplaces

• Sciences– Geology: integrate geological data across the US

continent (text as well as spatial data)– Biology: integrating genomic data

Page 10: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

10

Current Solutions

• Mostly ad-hoc programming– Create a special solution for every case– Pay consultants a lot of money

• Data Warehousing (Data Exchange)– Load all the data periodically into a warehouse– Separates operational DBMS from decision support

DBMS (not only a solution to data integration)– Performance is good– Data may not be fresh– Need to clean data

Page 11: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

11

Course Outline (Tentative)

• Data Integration Scenarios & Architectures– Find out what the problems are

• Data Models & Type Systems– XML/Semistructured Data, DTDs, XML Schema

• Query & Transformation Languages– Datalog, XPath, XQuery, XSLT

• Data Integration Approaches– Different approaches depending on application

characteristics

• Schema Integration– Schema Mapping/Matching– Semi-automate the discovery of schema mappings

Page 12: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

12

Course Outline (cont)

• Distributed Query Processing Algorithms• Query Rewriting Algorithms• Limited Query Capabilities

– We don’t have full access to any database

• Consistent Query Answers• Web Services

– What can they do for data integration?

• Semantic Web– RDF & SPARQL

• Workflow Languages– How is this related to data integration?

Page 13: CSE 636 Data Integration Introduction. 2 Staff Instructor: Dr. Michalis Petropoulos Email: mpetropo@cse.buffalo.edu Location: 210 Bell Hall Office Hours:

13

References

• Data Integration: a Status Report– Alon Halevy

– German Database Conference (BTW), 2003– Invited Talk

• Lecture Slides– Alon Halevy

– http://www.cs.washington.edu/education/courses/cse544/00sp/lectures/ps/l12.ps