Top Banner
Top Five Data Challenges for the Next Decade ICDE Keynote April 2005 © 2005 IBM Corporation Top Five Data Challenges for the Next Decade Dr. Pat Selinger IBM Fellow and VP, Area Strategist
57
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PPT

Top Five Data Challenges for the Next Decade

ICDE Keynote April 2005 © 2005 IBM Corporation

Top Five Data Challenges for the Next Decade

Dr. Pat SelingerIBM Fellow and VP, Area Strategist

Page 2: PPT

2

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

The World of Data is Changing

Hardware gives us more choices than ever before

Cost of labor is rising Data isn’t all (or even mostly) in the database

Data access paradigms evolving

Customers want integration and FAST access to the data they want

Page 3: PPT

3

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Research Challenges - Examples

SPAM

Keyword-basedSearch Engines

..xyz..

Page 4: PPT

4

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Research Challenges – Examples

Page 5: PPT

5

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Research Challenge – Examples

Q: Can you spell your name please ?

A: P.A.T.

Q: One more time please…

A: P..… A..… T..…

Q: Sorry… connecting you to a live operator… one moment, please.

Page 6: PPT

6

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

The World of Data is Changing

Hardware gives us more choices than ever before

Cost of labor is rising Data isn’t all (or even mostly) in the database

Data access paradigms evolving

Customers want integration and FAST access to the data they want

Page 7: PPT

7

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Issue: HW and SW systems have changed since RDB was invented. Information mgmt architecture hasn’t kept pace

1975– 1 MIPS processor

– Mainframe uniprocessor

– 14 inch disks

– 24 bit addresses

– 256K real memory

– Channel to channel connections

– Strings and numbers

Today– 2+ GigaHertz processors

– 32 and 64-way SMPs

– RAID disks, logical volume managers

– 64 bit addresses

– 100+ GB real memory

– Gigabit Ethernet, Infiniband supporting clusters of systems

– Rich data (audio, documents, XML, …)

Page 8: PPT

8

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

TransactionsTransactions 100-500GB100-500GB

WarehousesWarehouses 100s GB – 10’s TB100s GB – 10’s TB

MartsMarts 1 - 50 GBs1 - 50 GBs

MobileMobile 100s MB100s MB

PervasivePervasive 100s KB100s KB

Workload 2005

1s TB1s TB1s TB1s TB

100s TB100s TB100s TB100s TB

1s TB1s TB1s TB1s TB

10s GB10s GB10s GB10s GB

1s GB1s GB1s GB1s GB

2010

10X

100X

100X

1,000X

10,000X

Issue: Data Volumes Exploding

The world produces 250MB of information every year

for every man, woman and child on earth.

85% of the data is unstructured.

Common Database Sizes

Page 9: PPT

10

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Storage aerial density CGR continues at 100% per year to >100 Gbit/in2. The price of storage is now significantly cheaper than paper.

1980 1990 2000 20100.001

0.01

0.1

1

10

100

1000

10000

Are

al D

ensi

ty (

Gb

/in2)

Lab demos (year 2001) at 106 Gbit/in2

25% CGR

60% CGR

100% CGR

Progress from breakthroughs, including MR, GMR heads, AFC media

Progress could slow down due to technological challenges

Products

Lab Demos

Lab Demos

1980 1985 1990 1995 2000 2005 20100.0001

0.001

0.01

0.1

1

10

100

1000

Pri

ce

/MB

yte

(D

oll

ars

)

HDD DRAM Flash Paper/Film

Range of Paper/Film

3.5 " HDD

2.5 " HDD

1 " Microdrive

Flash

DRAM

Since 1997 raw storage prices have been declining at 50%-60% per year

Storage Density Storage Price

Storage Trends Aid this Data Explosion

Page 10: PPT

11

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Issue: CPU performance growing by 100% I/O performance by 5% every year

Disk

CPU

Page 11: PPT

12

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Solution: Overlapping, deferring or avoiding I/O. Examples:

Multi-dimensional Clustering Multiple bufferpools Prefetching into Bufferpools Page Cleaners use Async I/O Indexes with added columns or tables in indexes Index anding and oring Pushdown predicates Function-shipping on clusters Materialized Query Tables Compression And much more….

Page 12: PPT

13

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Multi-Dimensional Clustering via Cells and blocks:

yearyeardimensiondimension

colorcolordimensiondimension

nationnationdimensiondimension

Cell for Cell for (1997, Canada, (1997, Canada, yellow)yellow)

1997, Canada,

blue

1997, Mexico, yellow

1997, Mexico,

blue

1997, Canada, yellow

1998, Canada, yellow

1997, Mexico, yellow

1998, Mexico, yellow

1997, Canada, yellow

1998, Canada, yellow

1998, Mexico, yellow

Each cell Each cell contains one or contains one or more blocks more blocks

Page 13: PPT

14

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Research Challenge #1Scalability: Massive Growth in Multiple Dimensions

Scaling directions:– Petabytes of storage

– “Fire hose” of data continuously loading

– Millions of users,

– Millions of processors

– Larger and more complex data objects

– Systems being only partly online

– Partial answers, relevancy ranking

UnlimitedCPUs

1000 processors

10**6processors

Page 14: PPT

15

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Research Challenge #1

Design our DBMSs to keep pace with HW, SW, data changes

Scale without sacrificing user-visible availability or performance.

While always inventing new techniques to “cover up” the ever-increasing gap between processor speeds and disk speeds, e.g. exploit large memories

Page 15: PPT

16

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

The World of Data is Changing

Hardware gives us more choices than ever before

Cost of labor is rising Data isn’t all (or even mostly) in the database

Data access paradigms evolving

Customers want integration and FAST access to the data they want

Page 16: PPT

17

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Cost of Labor Increasing While Demands Rising

Labor-intensive management effort

Scarcity of skilled DBAs

Rising Costs

ChangingEcosystem

Lower Cost

StorageTighter

Integration

More DynamicWorkloads

Low CostClusters

Structured andUnstructured

Data

HigherAvailability

Page 17: PPT

18

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Autonomic Computing:Deliver significantly lower total cost of ownership

1.Cost of application development, time to solution delivery

2.Labor cost and skills availability for database administration and management

Page 18: PPT

19

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Autonomic Capabilities Available Today in DB2 for Linux, Unix, Windows

Available in v8.1.x Up and running

– Configuration advisor– Sets dozens of the most critical parameters in seconds.

Heaps, process model, optimizer, and more. Automated physical database design

– Design Advisor

– Automated index selection. Runtime

– Industry leading query optimizer, – automatic high quality plan selection.

– Query Patroller workload manager.– Policy controlled management of SQL/ODBC. – Query throughput control with QP query classes– Usage trending reports with QP Historical Analysis– Real-Time monitoring and control of current running queries

– Self tuning LOAD

– Adaptive utility throttling for Backup– Allows maintenance to consume as much resource as possible

without impacting the user workload throughput beyond Policy specified constraint.

– Control Center scheduler– Task Center (within CC) can schedule/automate execution of

OS or DB2 scripts. Self healing, availability and diagnostics

– Health Monitor– Ensures proper database operation by constantly monitoring

key indicators. – Notification of alerts by e-mail, page, CLP, GUI, SQL. – Health Center tooling provides graphical tools to drill down on

details.– Fault monitor

– Automatically restarts DB2– Automatic Index Reorganization

– Automatically defragment leaf pages. – Automatic continual I/O consistency checking

New in DB2 UDB v8.2 Automated physical database design

– Design Advisor extensions– Combined (or individual) recommendations for indexes, MQTs, MDC, and

DPF partitioning.– Automatic workload compression. – 4 workload capture techniques. (package cache, Query Patroller, event

monitor, text file) – Exploits sampling and multi-query optimization

Runtime– Automated database maintenance

– Automation of Backup, Runstats, Reorg – Statistics collection is online, throttled, with new locking protocols for non-

intrusive collection. – Policy expression lets users select subset of schema, and available times

of day.– Advanced algorithms detect “when” maintenance is really needed.

– Automatic statistics profiling – Determines what statistics should be collected.– Automatic detection of column groups allows query optimizer to model

correlation.– First “industrial version of “LEO” technology.

– eWLM integration – Performance analysis for the IBM stack.

– Utility throttling for Backup, Runstats, Rebalance.– The v8.1.2 BACKUP throttling technology is extended to a broader set of

administrative utilities. – Self tuning BACKUP

– Up to 4x faster than v8.1.x defaults– Simplified memory management

– Heaps automatically grow when constrained Self healing, availability and diagnostics

– Common Logging across IBM software products.

– HADR with automatic client reroute

– Extensions to Health Monitor – Increase recommendations for user response to alerts.

Self protecting– Data Encryption

– Common Criteria Certification

– Enhanced Security for Windows users

Page 19: PPT

20

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Example: DB2 Design Advisor

Makes recommendations for:– Indexes on the base tables

– Materialized Query Tables

– Indexes on the Materialized Query Tables

– Converting non Multi-Dimensional Clustering tables to Multi-Dimensional Clustering tables

– Partitioning existing tables

Page 20: PPT

21

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Complex,Unknown

Simple,Understood

ApplicationCharacteristics

Business SegmentSmall businesses Enterprise

Small DB enginesOpen Source

Current Product

Autonomic Efforts

High EndDBMS

Research Challenge:Zero Admin.

For Complex AppsEnterprise Class Scale

and Performance

Research Challenge #2Examine radically simpler architectures and address total cost of ownership

Page 21: PPT

22

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

The World of Data is Changing

Hardware gives us more choices than ever before

Cost of labor is rising Data isn’t all (or even mostly) in the database

Data access paradigms evolving

Customers want integration and FAST access to the data they want

Page 22: PPT

23

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Nature of “Interesting” Data is ChangingHow do we process these in an integrated way?

Unstructured Information

Management

Information from Multi-Modal Interactions, e.g.

speech

Classic Information Management -- relational databases

Autonomous ?Name Dept

NumberEmployeeID

Manager

Jane Doe 2 1000 -

John K 3 1001 1000

Employee

DeptName

Dept ID

Corporate 1

Manufacturing 2

Department

Product InventorySales Data

Bank AccountsWarehouses

...

Product InventorySales Data

Bank AccountsWarehouses

...85% unstructured and not in DBMS

Page 23: PPT

24

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Addressing the Changing Characteristics of Data

Actionability

Heterogeneity

Scale

Query

CCGAGTACCCAC

Satellite & Surveillance Images and Video

Gene Sequences

Transactions

Text and Web

Increasing need to manage and analyze new data types

Protein Folding

Page 24: PPT

25

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Changing Characteristics of DataVolume growth versus semantics per unit of dataTransactions and structured data

Seat on an airplane: easy to find, structured data

Actionability

Scale

Heterogeneity

-High

-Lo

w

-Low

Page 25: PPT

26

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Changing Characteristics of Data

Text and other human data

Hard work to extract the pearl, but you know where to look

Actionability

Scale

Heterogeneity

Medium -

-Med

ium

-Medium

Page 26: PPT

27

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Changing Characteristics of Data

Machine-generated data

There is gold somewhere in the pile, and you need to keep sifting

Actionability

Scale

HeterogeneityLow - -H

igh

- High

Page 27: PPT

28

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Extending “Mission-Critical” to Unstructured Data

XML View Of Relational Data– SQL data viewed and updated as XML

– Done via document shredding and composition

– DTD and Schema ValidationXML Documents As Monolithic Entities

– Atomic Storage And Retrieval

– Search CapabilitiesNext: XML As A Rich Datatype

– Full storage and indexing– Powerful querying capabilities

XML has become the “data interchange” format.

Page 28: PPT

29

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Example: XML Strategy for DB2 UDB

Native XML capabilities inside the engine

SERVERCLIENT

Data management

client

Customer client

application

SQL(X)

XQuery

DB2 Server

XMLInterface

Interface

XMLStorage

RelationalStorageRelational

Page 29: PPT

30

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Content Management Solutions - Capability

Content Solutions

Information Integration

Workflow/Business Process Management/Collaboration

ArchivingDocument

ManagementWeb ContentManagement

Output/ReportManagement

IBM Content Management PortfolioMultimediaManagement

Imaging

Digital AssetManagement

ContentIntegration

Digital Rights ManagementRegulatory Compliance/ Records Management

Page 30: PPT

31

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Enterprise Content Management

Cross Industry– Customer Service– Human Resources– Accounts Payable– Records Management– Marketing Communications– Online Report Viewing– E-mail Archival– Business Continuity

Financial– Loan Origination, Signature Verification– Credit Card Dispute Handling– Retirement Account Management– Mutual Fund Processing– Leasing and Contract Management

Insurance– Claims, Underwriting, Policy Service– Agent Management

Government– Law Enforcement and Land

Records– Permits, Licensing, Vital Records– Constituent Correspondence &

Services– Tax Form Capture

Manufacturing– Engineering Documentation, Change

Management and ISO 9000 Cert.– Product Management– Customer and Channel Service– SAP Data Archiving and Document

Management

Retail/Distribution– Vendor Management– Claims and Loyalty Management Programs– Web Site Content Mgmt.– Digital Content Commerce

Transportation:– Proof of Deliveries, Service– Driver Management

…Content-enabled Business Processes…Electronic Statements…e-Mail Management…e-Records Management…

Transforming Processes

withDigital

Content

Page 31: PPT

32

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Classic Data and Content Management Converging

Content Manager provides more “Data Management” services

– Transactional and referential integrity– Optimized query– Scalable storage

RDB users want more “Content Management” services

– Check-in, check-out and versioning– Integrated hierarchical storage management– Non-normal (i.e. hierarchical) metamodel

XML is accelerating this convergence– Sometimes it’s data – other times it’s content

Page 32: PPT

33

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

So, are we done?

No!

Page 33: PPT

34

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Research Challenge # 3

Every one of us should know Content APIs as well as we do SQL

Content Management has VERY different requirements than

– Short atomic transactions with two phase commit

– Two phase locking

– B-tree indexing

– Cursors

– ….

Page 34: PPT

35

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Research Challenge #3

We need to learn what managing content is all about, what is needed and forge new models:

–Query and client interaction–Versioning–Foldering–Sub-document authorization–Sub-document checkin/out–Text search and analytics

Page 35: PPT

36

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

The World of Data is Changing

Hardware gives us more choices than ever before

Cost of labor is rising Data isn’t all (or even mostly) in the database

Data access paradigms evolving

Customers want integration and FAST access to the data they want

Page 36: PPT

37

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Research Challenge #4Data Interaction Paradigms – What’s Next?

Richness of DataStrings and Numbers Text Audio, Video, Sensor

Ease of

Access

Ease of D

ata A

ccess

Programs

Spreadsheets

Search Engines

Speech enhancedwith semantics ?

Relational DB

Web

Page 37: PPT

38

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

IM

Web

Kiosks

Email, SMS

Mail, Fax, etc

Customers

Sch

ed

ulin

g a

nd

C

oo

rdin

atio

n

Business ProcessesContact Points

Branch office Web

IVR

Business Intelligence

Web logsSpeech transcriptionsCall logs….

Voice

Call Center

Face to face

Analytics

Workforce

Embracing richer data types and functionality in information management middleware Speech Technology will Enable New and Easier Applications

Integrated Interaction Channels

Analytics Across Data Types

Shared Infrastructure and Business Logic

Page 38: PPT

39

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Goal: Surpass human ability to accurately transcribe speech across multiple domains and environments.IBM Value: This level of performance required to achieve truly pervasive conversational technologies.

Toards SperHuman Speech Recognition

Data driven with careful statistical modelingWide variety of test dataRegular benchmarks of human performance

Basic Principles and New Techniques

Cooperative UserImmediate FeedbackHigh Bandwidth Microphone

1997-2001

Transparent to userNo feedbackAcross channel, domain, environment

2007-2010

Multiple Channels

Variable Noise

Overlapping Talkers

Accented Speech

Multiple Domains

Graded Challenges

Discrimination

System 1

System 2

System 3

Fusion

Recognizer

Adaptation

Speech Recognition Technology Evolution

Page 39: PPT

40

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Text To Speech Generation Technology

Impressive quality

Can you guess what is TTS and what is recorded speech?

Page 40: PPT

41

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Analytics bridge the Unstructured & Structured worlds

UnstructuredInformation

UnstructuredInformation UIMAUIMA

High-ValueMost Current ContentFastest GrowingBUT ...

Buried in Huge Volumes – Lots of NoiseImplicit SemanticsInefficient Search

Explicit StructureExplicit SemanticsEfficient SearchFocused Content

Text, Chat, Email, Audio,

Video

Text, Chat, Email, Audio,

Video

IndicesIndices

DBsDBs

KBsKBs

Identify Semantic Entities, Induce StructureChats, Phone Calls, Transfers People, Places, Org, Events Times, Topics, Opinions, RelationshipsThreats, Plots, etc.

Identify Semantic Entities, Induce StructureChats, Phone Calls, Transfers People, Places, Org, Events Times, Topics, Opinions, RelationshipsThreats, Plots, etc.

UIMA - The Big Picture

StructuredInformation

Page 41: PPT

42

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Unstructured Information Management Architecture

Common Research infrastructure for advancing Text Analysis and NLP capability– Promotes re-use of best-of-breed components– Promotes combination hypothesis through ease of integration

Unstructured Information

Application Libraries

Specialized Application Libraries

Provide basic functions common to a broad class of application libraries & applications (e.g. Glossary Extraction Taxonomy Generation, Classification, Translation, etc.)

Question Answering

e-Commerce

Semantic Search EngineToken and Concept Indexing

Query Key words, concepts, spans, ranges -> Ranked Hit List

National & Intelligence Business

Bioinformatics

Technical Support

Document & Meta Data StoreDocuments with meta data based on key-value pairs

Enables view & collection management

(Text) Analysis Engine (TAEs)Combination of analysis engines employing a variety of analytical techniques and strategies

Structured Knowledge AccessKnowledge Source Adapters - (KSAs) deliver content from many structured knowledge sources according to central ontologies

Collection

Processing Manager

KSA Directory Service

Dynamic query & delivery of KSAs

TAE Directory Service

Dynamic query & delivery of TAEs

UIMA Standard Application Libraries

Relevant Application Knowledge

Structured Data

UIM

So

luti

on

s

Page 42: PPT

43

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Collection Processing EngineCollection Processing Engine

CAS ConsumerCAS Consumer

UIMA Component Architecture from “Source to Sink”

CAS ConsumerCAS Consumer

CAS ConsumerCAS Consumer

OntologiesOntologies

IndicesIndices

DBsDBs

KnowledgeBases

KnowledgeBases

Aggregate Analysis EngineAggregate Analysis Engine

Analysis EngineAnalysis Engine

AnnotatorAnnotator

CASCAS

Collection

Reader

Collection

ReaderText, Chat,

Email, Audio, Video

Text, Chat, Email, Audio,

Video

CAS InitializerCAS Initializer CAS

CAS Analysis EngineAnalysis Engine

AnnotatorAnnotator

CASCAS

Page 43: PPT

44

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Language, Speaker Identifiers

Part of Speech Detectors

Document Structure Detectors

Tokenizers, Parsers, Translators

Named-Entity Detectors

Sentiment Detectors

Face Recognizers

Relationship Detectors

Classifiers

What can analytics do?

Page 44: PPT

45

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Basic Building Blocks: Annotators Iterate over a document to discover new annotations based on existing ones and update the Common Analysis Structure (CAS).

GovernorGovernor visitsvisits embassyembassyJonesJones inin JapanJapan

Located InLocated In

Gov OfficialGov Official

Arg2:LocationArg2:Location

CountryCountryGov TitleGov Title PersonPerson

Arg1:EntityArg1:Entity

PPPPVPVPNPNPParserParser

Named Entity AnnotatorNamed Entity Annotator

Relationship AnnotatorRelationship Annotator

Page 45: PPT

46

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Data Mining

InformationRetrieval

String & GraphAlgorithms

UI / Human Factors

Privacy & Security

Machine Learning

Text analytics& NLP

Unstructured Information

ManagementArchitecture

Research: The Combination Hypothesis

Source

Indexer Entity extractor Classifier

Result 1 Result 2 Result 3

Application

Source

Ent. extractor

Classifier

Indexer

Result

Application

Independent Analyzers Combined Analyzers

via Common

Annotation Structure

(UIMA)

If intimately integrated, various KM technologies will provide higher quality results (accuracy, recall, etc.)

Page 46: PPT

47

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Research Challenge #4

Include speech and text data and derived text analytics and context in our scope of data research work. How does that change:

– Access techniques,

– Search and optimization algorithms

– Result sets and interaction mechanisms

– Storage and indexing

– Models of data

– Framework for derived information, ways to query and search it

– System architecture

Page 47: PPT

48

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

The World of Data is Changing

Hardware gives us more choices than ever before

Cost of labor is rising Data isn’t all (or even mostly) in the database

Data access paradigms evolving

Customers want integration and FAST access to the data they want

Page 48: PPT

49

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Data Heterogeneity in Enterprises

Proposals

Contracts

Offerings

Historical

Engagement

Ledger

Lessons Learned

Intellectual Capital

MarketingDelivery

Notes

Notes

Notes

Notes

Claim

CCMT

.DOC

.LWP.XLS

.123

.DOC

.LWP

.XLS.123

Source

CompetitorsPricingDemandOfferings

CLIENT DATA

Demographics, configurations, current costs, financial, legal, existing contracts, RFI, etc.

.DOC

.LWP

.XLS

.123

.XLS

.123

Sage

123654…

Acct. TeamsSD

.DOC

.LWP

ProposalsContractsNegotiations

Engagement Workbook

Today data is in disparate locations; it is not easily accessible nor harnessed for key information

What data do we have?

Where is it?

How can I find it?

What format is it in?

Is it searchable?

Does “customer” mean the same in each system?

How do I reconcile differences?

What applications feed data to other applications?

If I change something, what breaks?

Page 49: PPT

50

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Metadata: Today and Tomorrow

IdentifyingIdentifying

IntegratingIntegrating

Current FocusCurrent Focus

Current ChallengeCurrent Challenge OpportunityOpportunity

Store Search Store Search

Discover Linkages within domains Linkages across domains

Discover Linkages within domains Linkages across domains

UnderstandingUnderstanding

Definitions Taxonomies Complex relationships Sophisticated semantics

Definitions Taxonomies Complex relationships Sophisticated semantics

Page 50: PPT

51

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Metadata: Spectrum

Structured Information Unstructured Information Relational data

– Column attributes– Table values (domains)

XML Documents– Tags, XML Schema

Software Assets– Date, version, …– Interface definitions

Web Services– Name, attributes, …– Interface definitions

People Proxies– Name, location, serial no. …

System Information– System resource– Operating environment

Text/Documents– Names, locations, phone numbers,

language, …

Images– Name, date, time– Characteristics

Rich, Streaming Media– Location, timing, scene identification,

participants, actions, ...

– Formats

Metadata describes and adds meaning to data and business process

Information Structures

About Applications, Processes, Resources

Vocabularies & Concepts

Page 51: PPT

52

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Metadata Associated With Digital Camera Pictures

Future Camera Generated Metadata

•Latitude•Longitude•Altitude•GPS time (atomic clock)•GPS satellites used for measurement•Measurement precision•Speed of GPS receiver•Direction of movement•Direction of image•Name of GPS processingmethod•GPS differential correction• • ~ 30 Tags

Future Camera Generated Metadata

•Latitude•Longitude•Altitude•GPS time (atomic clock)•GPS satellites used for measurement•Measurement precision•Speed of GPS receiver•Direction of movement•Direction of image•Name of GPS processingmethod•GPS differential correction• • ~ 30 Tags

Camera Generated Metadata

File Name102-0299_IMG.JPG

Camera Model Name Canon PowerShot S400

Shooting Date/Time 1/12/2004 12:12:07 PM

Shooting Mode AutoTv( Shutter Speed) 1/200Av( Aperture Value) 2.8Metering Mode EvaluativeFocal Length 7.4mmFlash OffWhite Balance AutoAF Mode Single AFDrive Mode Single-frame shooting

~ 100 Tags

Camera Generated Metadata

File Name102-0299_IMG.JPG

Camera Model Name Canon PowerShot S400

Shooting Date/Time 1/12/2004 12:12:07 PM

Shooting Mode AutoTv( Shutter Speed) 1/200Av( Aperture Value) 2.8Metering Mode EvaluativeFocal Length 7.4mmFlash OffWhite Balance AutoAF Mode Single AFDrive Mode Single-frame shooting

~ 100 Tags

Associated audio file

Additional tags (User input)• Title, Comments, Favorite Picture, Keywords, PrintMe, Categories, PrintOrder, etc

Additional tags (User input)• Title, Comments, Favorite Picture, Keywords, PrintMe, Categories, PrintOrder, etc

EXIF 2.2 Standard -- Exchangeable Image File for Digital Cameras

Page 52: PPT

53

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Ent

erpr

ise

Dat

a S

tore

Customer Space

Network Space Network

Ent

erpr

ise

Dat

a

Sales

Ordering Billing

Accounting

ProvisioningExtract Billable

Events

ControlElements

CollectEvents

configuration net event

eventselement configuration

usage

revenuecommercial order

service order

ManageTraffic

Finances &Business

Tel

eco

m O

SS

/BS

S

Networkspecific

Productspecific

Businesssystems

customizedsolution

standardimpementation

Internet

1000’s systems in silos; only one third interconnected.

Those “connected systems” create over 2000 interfaces.

Significant maintenance or enhancement efforts.

Significant replication required thru “monolith” apps to accomplish any sharing.

No archiving or sharing strategy for decommissioned systems … data and “lessons lost”.

No common convention for enterprise elements … we are speaking different “languages”.

No traceability for security for data access. We have a huge vulnerability.

Access, update, backup and recovery, synchronization are provided through a complex tapestry of processes and technologies.

Inconsistent development, deployment, monitoring, tooling.

Cost prohibitive, redundant and/or specialized skills.

Escalating and redundant storage costs for HW/SW/People.

Metadata Explosion:Case Study at Customer

DB

DB

Network

Sales

Ordering Billing

Accounting

ProvisioningManageTraffic

ControlElements

CollectEvents

configurationnet event

eventselement configuration

usage

revenuecommercial order

service order

Sales

Ordering Billing

Accounting

ProvisioningManageTraffic

ControlElements

CollectEvents

eventselement configuration

usage

revenuecommercial order

service order

net eventconfiguration

Product A"Silo" Monolith

Product B"Silo" Monolith

SwivelChair

If we do nothing … it will only become worse!

Lag times in order to cash and order taking

Inability to handle any call from any center

Incomplete or errant customer information at point of contact

Unable to perform real time integration of data at point of use regardless of data form

Unable to provide complete / correct information to drive decisions; even in batch

No single view of our customers

Degraded (or errant) business decision making due to data corruption, data access (depth and breadth) and poor synchronization (event driven)

The problem is so big and has to be approached incrementally!

Page 53: PPT

54

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Evolution of Metadata

Hierarchical Data Model Rigid MetadataSingle Application

Domain Specific OntologiesFlexible MetadataCross Industry Integration

Increased Business Value of Metadata

Syntactic annotations of data: what this

data represents

Semantic annotations of data: what this

data means

Relational Data ModelRigid MetadataIntegration Within Enterprise

Extensible Data Model (XML)Flexible MetadataIntegration Within Industry

1970 1990 2000 20101980

Page 54: PPT

55

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Metadata from UNSTRUCTURED data is growing exponentially

Hierarchical Data Model Rigid MetadataSingle Application

Domain Specific OntologiesFlexible MetadataCross Industry Integration

Increased Business Value of Metadata

Syntactic annotation of

data: what this data

represents

Semantic annotations of data: what this

data means

Relational Data ModelRigid MetadataIntegration Within Enterprise

Extensible Data Model (XML)Flexible MetadataIntegration Within Industry

1970 1990 2000 20101980

How to make metadata fromunstructured data ACTIONABLE?

Page 55: PPT

56

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Research Challenge #5

Treat metadata as a first class research area– Access

– Search

– Sharing

– Distribution

– Consolidating

– Aggregating

– Deriving and Discovering new Metadata

– Querying

– ……

Don’t ignore existing in-place data sources and metadata

Page 56: PPT

57

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

The World of Data is Changing

Hardware gives us more choices than ever before

Cost of labor is rising Data isn’t all (or even mostly) in the database

Data access paradigms evolving

Customers want integration and FAST access to the data they want

Page 57: PPT

58

Top Five Data Challenges for the Next Decade

© 2005 IBM CorporationICDE Keynote 2005

Five Data Research Challenges for the Next Decade:

1.Reexamine DBMS architecture and invent ways to scale more, without sacrificing user-visible availability or performance

2.Address Total Cost of Ownership3.Learn what managing content is all about,

what is needed and forge new models4.Include speech and text data and derived

text analytics and context in our scope of data research work

5.Treat metadata as a first class research area