Top Banner
Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158
16

Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

Dec 14, 2015

Download

Documents

Fatima Allenson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

Information Integration

Instructor: Pankaj MehraTeaching Assistant: Raghav Gautam

Lec. 9May 13, 2010

ISM 158

Page 2: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

Enterprise Information

Web Service

ApplicationInte

r-applic

atio

n

mes

sages

E-mailFile

system

Instantmessages

Webcontentserver

Tag

Tag

1st-levelindex

CQL

DatabaseSQL

Centralarchives

2nd-levelmetadata

Integration Hub

DistributedQuery

Optimizer

EnterpriseInformation

Schema

2nd-levelindex

2nd-levelcache

Central

archives

2nd-levelmetadata

Integration HubDistributedQueryOptim

izer

EnterpriseInformation

Schema

2nd-levelindex

2nd-levelcach

eCo- or sub-repository with

separate data, metadata & index

Page 3: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

page 3

Centralized versus Distributed?

• Distributed systems occur naturally• State of the art does not allow complex queries or deep analysis

against distributed information• Centralization may also be favored due to lower costs of infrastructure,

license and labor, as well as due to their ability to better enforce tighter integrity constraints and other information management policies

• Ultimately, the decision needs to take into account issues of ownership and control– Technology considerations often are secondary; even so, rational

rules for resolving these considerations exist, as described in Distributed Computing Economics paper

Page 4: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

page 4

Contrasting Business & Technical Information

Businessdomain

Technical domain

Metadata scaling

Data bandwidth scaling

SQL schema & query

XML or WS schema & query

File schema & query

Centralized metadata

Real-time information

Ad hoc query Inconsistent information

Pivoting Pivoting

Data mining

Search federation

Structured sources

Distributed archives

Distributed complex controls

Central controlCentral archive

Stable schemata

Schema evolution

Unstructured sources

Heavy data processingSimple metadata fusion

Complex metadataSimpler data fusion

ETL ETL

Streaming A/VVisualization

DashboardsSteering

Deep linguistics

Page 5: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

page 5

The Guiding Principles• It is a bad idea to address the following as afterthoughts

– Scale– Availability– Integrity

• The ability to embed function close to data is fundamental to scalable information processing

• In order to deliver the best performance/$, systems tend to scale out from technology sweet spot of the day

• Redundancy configured in from the start, as well as mechanisms for early detection and isolation of faults

• Optimize availability by optimizing recovery

– Privacy and security– Compliance / auditability– Retention requirements

– Business value– Information

quality

Page 6: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

page 6

Scalable Content Processing• Enterprise information is

complex

• Diversity of information sources and formats– Entail complex integration

and processing flows– Metadata generation and

indexing– Content indexing

• Protection and security

stor

age

data

cont

ent

conn

ecto

rs

conn

ecto

rs

scalable repository

scalable processing

e.g. JCR API

Page 7: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

page 7

Smart Cells Scalable distributed system

of self contained, all-inclusive data repositories

Principles Scale-out Federation Intelligence close to data Pluggable platforms

supporting proprietary and 3rd-party storage services

Example Platforms for Information

Lifecycle Management services

Scale out architecture used under cloud information services

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

SmartCell

Smart Query Fabric

Storage:Block,File,Object &Fragment

Content indexing

Attribute indexing

Su

pp

orte

d p

roto

cols

an

d A

PIs

Page 8: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

page 8

Considerations in Distributed Information Management

• Information is distributed across heterogeneous sources and has varied provenanceIntegration

• Information management requires information about informationMetadata

• Useful information is timely and findableReal-time integration and cachingIndexingSemantic analysisContext

Page 9: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

page 9

Dimensions of IntegrationInformation Integration

Methodologies

Access Mechanism

Scheduled crawl

Triggered crawl

Tap update operations

Tap change log

Tap message flow

Tap streaming data

Subscribe to data

Subscribe to metadata

Query language XQuery

SQL DML

Proprietary API

Proprietary protocol

Search Terms

Query processing technique

Centralized

Distributed, 1-pass, forwarded

Distributed, 1-pass, flooded

Distributed, two-pass

Optional DQO (chaining,referral, recruiting, virtual

stored procedures)

Optional results caching for multi-step queries

Indexing technique

Centralized

Distributed, one-level

Distributed, two-level

Statefulness

Stateful: Local queries on

cached data

Stateful: D

istributed query; D

QO

& lnterm

ediate result caching

Stateless

Schema definition language SQL DDL

XML Schema

WSDL

GGF DAIS

Navigable Filesystem metadata

Navigable Repository metadata

Metadata architecture

Centralized, one-level

Distributed, one-level

Distributed, two-level

SPARQL

Page 10: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

page 10

Ecosystem of integration products

• Metadata– Determines

information richness

• Service Orientation– Determines

protocol richness

• Future– Integration as

syndication– Integration aaS

SQL-based EIISAP, Oracle, Composite

XML-based EIIBEA LiquidData, Mark Logic

JSR 170 ECIDay

WS-basedSOA

Microsoft,IBM

RSS-based

NewsGator

PureEAI

Tibco, SAG

Met

adat

a

Service-orientedness

Uniformaccess

MOSS, Attivio

Page 11: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

Points for Discussion in class• Consider a healthcare

patient information scenario.– Is it mainly

transactional or mainly analytic?

– Would you lean toward a distributed (EAI) approach or a centralized one (warehouse)?

• Consider a scenario in which a company wants to drill down into the root causes of customer complaints?– Again, centralized or

distributed?• Identifying the root

cause• Tracking the problem

– Would real-time integration become a requirement?

Page 12: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

Points to ponder at home

• Pros of integration– Connecting the dots– Single view of …– Quality control over

• Inconsistency• Staleness• Gaps

• Cons of integration– Loss of context– Often, read only– Cost– Duplication– Scale– Losing battle?– Risk

Page 13: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

Where to learn more

• Data Integration: The Relational Logic Approach by Michael Genesereth, Morgan & Claypool Publishers, 2010

Page 14: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

Upcoming guest lectures in May

• Dr. V. Galotra, Oracle– SOA Deep Dive

• Rahul Nim, Efficient Frontier– Online marketing

Page 15: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

Questions?

Page 16: Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam Lec. 9 May 13, 2010 ISM 158.

NEWS PRESENTATION