Top Five Data Challenges for the Next Decade ICDE Keynote April 2005 © 2005 IBM Corporation Top Five Data Challenges for the Next Decade Dr. Pat Selinger IBM Fellow and VP, Area Strategist
Top Five Data Challenges for the Next Decade
ICDE Keynote April 2005 © 2005 IBM Corporation
Top Five Data Challenges for the Next Decade
Dr. Pat SelingerIBM Fellow and VP, Area Strategist
2
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
The World of Data is Changing
Hardware gives us more choices than ever before
Cost of labor is rising Data isn’t all (or even mostly) in the database
Data access paradigms evolving
Customers want integration and FAST access to the data they want
3
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Research Challenges - Examples
SPAM
Keyword-basedSearch Engines
..xyz..
4
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Research Challenges – Examples
5
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Research Challenge – Examples
Q: Can you spell your name please ?
A: P.A.T.
Q: One more time please…
A: P..… A..… T..…
Q: Sorry… connecting you to a live operator… one moment, please.
6
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
The World of Data is Changing
Hardware gives us more choices than ever before
Cost of labor is rising Data isn’t all (or even mostly) in the database
Data access paradigms evolving
Customers want integration and FAST access to the data they want
7
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Issue: HW and SW systems have changed since RDB was invented. Information mgmt architecture hasn’t kept pace
1975– 1 MIPS processor
– Mainframe uniprocessor
– 14 inch disks
– 24 bit addresses
– 256K real memory
– Channel to channel connections
– Strings and numbers
Today– 2+ GigaHertz processors
– 32 and 64-way SMPs
– RAID disks, logical volume managers
– 64 bit addresses
– 100+ GB real memory
– Gigabit Ethernet, Infiniband supporting clusters of systems
– Rich data (audio, documents, XML, …)
8
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
TransactionsTransactions 100-500GB100-500GB
WarehousesWarehouses 100s GB – 10’s TB100s GB – 10’s TB
MartsMarts 1 - 50 GBs1 - 50 GBs
MobileMobile 100s MB100s MB
PervasivePervasive 100s KB100s KB
Workload 2005
1s TB1s TB1s TB1s TB
100s TB100s TB100s TB100s TB
1s TB1s TB1s TB1s TB
10s GB10s GB10s GB10s GB
1s GB1s GB1s GB1s GB
2010
10X
100X
100X
1,000X
10,000X
Issue: Data Volumes Exploding
The world produces 250MB of information every year
for every man, woman and child on earth.
85% of the data is unstructured.
Common Database Sizes
10
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Storage aerial density CGR continues at 100% per year to >100 Gbit/in2. The price of storage is now significantly cheaper than paper.
1980 1990 2000 20100.001
0.01
0.1
1
10
100
1000
10000
Are
al D
ensi
ty (
Gb
/in2)
Lab demos (year 2001) at 106 Gbit/in2
25% CGR
60% CGR
100% CGR
Progress from breakthroughs, including MR, GMR heads, AFC media
Progress could slow down due to technological challenges
Products
Lab Demos
Lab Demos
1980 1985 1990 1995 2000 2005 20100.0001
0.001
0.01
0.1
1
10
100
1000
Pri
ce
/MB
yte
(D
oll
ars
)
HDD DRAM Flash Paper/Film
Range of Paper/Film
3.5 " HDD
2.5 " HDD
1 " Microdrive
Flash
DRAM
Since 1997 raw storage prices have been declining at 50%-60% per year
Storage Density Storage Price
Storage Trends Aid this Data Explosion
11
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Issue: CPU performance growing by 100% I/O performance by 5% every year
Disk
CPU
12
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Solution: Overlapping, deferring or avoiding I/O. Examples:
Multi-dimensional Clustering Multiple bufferpools Prefetching into Bufferpools Page Cleaners use Async I/O Indexes with added columns or tables in indexes Index anding and oring Pushdown predicates Function-shipping on clusters Materialized Query Tables Compression And much more….
13
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Multi-Dimensional Clustering via Cells and blocks:
yearyeardimensiondimension
colorcolordimensiondimension
nationnationdimensiondimension
Cell for Cell for (1997, Canada, (1997, Canada, yellow)yellow)
1997, Canada,
blue
1997, Mexico, yellow
1997, Mexico,
blue
1997, Canada, yellow
1998, Canada, yellow
1997, Mexico, yellow
1998, Mexico, yellow
1997, Canada, yellow
1998, Canada, yellow
1998, Mexico, yellow
Each cell Each cell contains one or contains one or more blocks more blocks
14
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Research Challenge #1Scalability: Massive Growth in Multiple Dimensions
Scaling directions:– Petabytes of storage
– “Fire hose” of data continuously loading
– Millions of users,
– Millions of processors
– Larger and more complex data objects
– Systems being only partly online
– Partial answers, relevancy ranking
UnlimitedCPUs
1000 processors
10**6processors
15
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Research Challenge #1
Design our DBMSs to keep pace with HW, SW, data changes
Scale without sacrificing user-visible availability or performance.
While always inventing new techniques to “cover up” the ever-increasing gap between processor speeds and disk speeds, e.g. exploit large memories
16
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
The World of Data is Changing
Hardware gives us more choices than ever before
Cost of labor is rising Data isn’t all (or even mostly) in the database
Data access paradigms evolving
Customers want integration and FAST access to the data they want
17
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Cost of Labor Increasing While Demands Rising
Labor-intensive management effort
Scarcity of skilled DBAs
Rising Costs
ChangingEcosystem
Lower Cost
StorageTighter
Integration
More DynamicWorkloads
Low CostClusters
Structured andUnstructured
Data
HigherAvailability
18
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Autonomic Computing:Deliver significantly lower total cost of ownership
1.Cost of application development, time to solution delivery
2.Labor cost and skills availability for database administration and management
19
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Autonomic Capabilities Available Today in DB2 for Linux, Unix, Windows
Available in v8.1.x Up and running
– Configuration advisor– Sets dozens of the most critical parameters in seconds.
Heaps, process model, optimizer, and more. Automated physical database design
– Design Advisor
– Automated index selection. Runtime
– Industry leading query optimizer, – automatic high quality plan selection.
– Query Patroller workload manager.– Policy controlled management of SQL/ODBC. – Query throughput control with QP query classes– Usage trending reports with QP Historical Analysis– Real-Time monitoring and control of current running queries
– Self tuning LOAD
– Adaptive utility throttling for Backup– Allows maintenance to consume as much resource as possible
without impacting the user workload throughput beyond Policy specified constraint.
– Control Center scheduler– Task Center (within CC) can schedule/automate execution of
OS or DB2 scripts. Self healing, availability and diagnostics
– Health Monitor– Ensures proper database operation by constantly monitoring
key indicators. – Notification of alerts by e-mail, page, CLP, GUI, SQL. – Health Center tooling provides graphical tools to drill down on
details.– Fault monitor
– Automatically restarts DB2– Automatic Index Reorganization
– Automatically defragment leaf pages. – Automatic continual I/O consistency checking
New in DB2 UDB v8.2 Automated physical database design
– Design Advisor extensions– Combined (or individual) recommendations for indexes, MQTs, MDC, and
DPF partitioning.– Automatic workload compression. – 4 workload capture techniques. (package cache, Query Patroller, event
monitor, text file) – Exploits sampling and multi-query optimization
Runtime– Automated database maintenance
– Automation of Backup, Runstats, Reorg – Statistics collection is online, throttled, with new locking protocols for non-
intrusive collection. – Policy expression lets users select subset of schema, and available times
of day.– Advanced algorithms detect “when” maintenance is really needed.
– Automatic statistics profiling – Determines what statistics should be collected.– Automatic detection of column groups allows query optimizer to model
correlation.– First “industrial version of “LEO” technology.
– eWLM integration – Performance analysis for the IBM stack.
– Utility throttling for Backup, Runstats, Rebalance.– The v8.1.2 BACKUP throttling technology is extended to a broader set of
administrative utilities. – Self tuning BACKUP
– Up to 4x faster than v8.1.x defaults– Simplified memory management
– Heaps automatically grow when constrained Self healing, availability and diagnostics
– Common Logging across IBM software products.
– HADR with automatic client reroute
– Extensions to Health Monitor – Increase recommendations for user response to alerts.
Self protecting– Data Encryption
– Common Criteria Certification
– Enhanced Security for Windows users
20
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Example: DB2 Design Advisor
Makes recommendations for:– Indexes on the base tables
– Materialized Query Tables
– Indexes on the Materialized Query Tables
– Converting non Multi-Dimensional Clustering tables to Multi-Dimensional Clustering tables
– Partitioning existing tables
21
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Complex,Unknown
Simple,Understood
ApplicationCharacteristics
Business SegmentSmall businesses Enterprise
Small DB enginesOpen Source
Current Product
Autonomic Efforts
High EndDBMS
Research Challenge:Zero Admin.
For Complex AppsEnterprise Class Scale
and Performance
Research Challenge #2Examine radically simpler architectures and address total cost of ownership
22
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
The World of Data is Changing
Hardware gives us more choices than ever before
Cost of labor is rising Data isn’t all (or even mostly) in the database
Data access paradigms evolving
Customers want integration and FAST access to the data they want
23
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Nature of “Interesting” Data is ChangingHow do we process these in an integrated way?
Unstructured Information
Management
Information from Multi-Modal Interactions, e.g.
speech
Classic Information Management -- relational databases
Autonomous ?Name Dept
NumberEmployeeID
Manager
Jane Doe 2 1000 -
John K 3 1001 1000
Employee
DeptName
Dept ID
Corporate 1
Manufacturing 2
Department
Product InventorySales Data
Bank AccountsWarehouses
...
Product InventorySales Data
Bank AccountsWarehouses
...85% unstructured and not in DBMS
24
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Addressing the Changing Characteristics of Data
Actionability
Heterogeneity
Scale
Query
CCGAGTACCCAC
Satellite & Surveillance Images and Video
Gene Sequences
Transactions
Text and Web
Increasing need to manage and analyze new data types
Protein Folding
25
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Changing Characteristics of DataVolume growth versus semantics per unit of dataTransactions and structured data
Seat on an airplane: easy to find, structured data
Actionability
Scale
Heterogeneity
-High
-Lo
w
-Low
26
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Changing Characteristics of Data
Text and other human data
Hard work to extract the pearl, but you know where to look
Actionability
Scale
Heterogeneity
Medium -
-Med
ium
-Medium
27
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Changing Characteristics of Data
Machine-generated data
There is gold somewhere in the pile, and you need to keep sifting
Actionability
Scale
HeterogeneityLow - -H
igh
- High
28
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Extending “Mission-Critical” to Unstructured Data
XML View Of Relational Data– SQL data viewed and updated as XML
– Done via document shredding and composition
– DTD and Schema ValidationXML Documents As Monolithic Entities
– Atomic Storage And Retrieval
– Search CapabilitiesNext: XML As A Rich Datatype
– Full storage and indexing– Powerful querying capabilities
XML has become the “data interchange” format.
29
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Example: XML Strategy for DB2 UDB
Native XML capabilities inside the engine
SERVERCLIENT
Data management
client
Customer client
application
SQL(X)
XQuery
DB2 Server
XMLInterface
Interface
XMLStorage
RelationalStorageRelational
30
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Content Management Solutions - Capability
Content Solutions
Information Integration
Workflow/Business Process Management/Collaboration
ArchivingDocument
ManagementWeb ContentManagement
Output/ReportManagement
IBM Content Management PortfolioMultimediaManagement
Imaging
Digital AssetManagement
ContentIntegration
Digital Rights ManagementRegulatory Compliance/ Records Management
31
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Enterprise Content Management
Cross Industry– Customer Service– Human Resources– Accounts Payable– Records Management– Marketing Communications– Online Report Viewing– E-mail Archival– Business Continuity
Financial– Loan Origination, Signature Verification– Credit Card Dispute Handling– Retirement Account Management– Mutual Fund Processing– Leasing and Contract Management
Insurance– Claims, Underwriting, Policy Service– Agent Management
Government– Law Enforcement and Land
Records– Permits, Licensing, Vital Records– Constituent Correspondence &
Services– Tax Form Capture
Manufacturing– Engineering Documentation, Change
Management and ISO 9000 Cert.– Product Management– Customer and Channel Service– SAP Data Archiving and Document
Management
Retail/Distribution– Vendor Management– Claims and Loyalty Management Programs– Web Site Content Mgmt.– Digital Content Commerce
Transportation:– Proof of Deliveries, Service– Driver Management
…Content-enabled Business Processes…Electronic Statements…e-Mail Management…e-Records Management…
Transforming Processes
withDigital
Content
32
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Classic Data and Content Management Converging
Content Manager provides more “Data Management” services
– Transactional and referential integrity– Optimized query– Scalable storage
RDB users want more “Content Management” services
– Check-in, check-out and versioning– Integrated hierarchical storage management– Non-normal (i.e. hierarchical) metamodel
XML is accelerating this convergence– Sometimes it’s data – other times it’s content
33
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
So, are we done?
No!
34
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Research Challenge # 3
Every one of us should know Content APIs as well as we do SQL
Content Management has VERY different requirements than
– Short atomic transactions with two phase commit
– Two phase locking
– B-tree indexing
– Cursors
– ….
35
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Research Challenge #3
We need to learn what managing content is all about, what is needed and forge new models:
–Query and client interaction–Versioning–Foldering–Sub-document authorization–Sub-document checkin/out–Text search and analytics
36
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
The World of Data is Changing
Hardware gives us more choices than ever before
Cost of labor is rising Data isn’t all (or even mostly) in the database
Data access paradigms evolving
Customers want integration and FAST access to the data they want
37
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Research Challenge #4Data Interaction Paradigms – What’s Next?
Richness of DataStrings and Numbers Text Audio, Video, Sensor
Ease of
Access
Ease of D
ata A
ccess
Programs
Spreadsheets
Search Engines
Speech enhancedwith semantics ?
Relational DB
Web
38
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
IM
Web
Kiosks
Email, SMS
Mail, Fax, etc
Customers
Sch
ed
ulin
g a
nd
C
oo
rdin
atio
n
Business ProcessesContact Points
Branch office Web
IVR
Business Intelligence
Web logsSpeech transcriptionsCall logs….
Voice
Call Center
Face to face
Analytics
Workforce
Embracing richer data types and functionality in information management middleware Speech Technology will Enable New and Easier Applications
Integrated Interaction Channels
Analytics Across Data Types
Shared Infrastructure and Business Logic
39
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Goal: Surpass human ability to accurately transcribe speech across multiple domains and environments.IBM Value: This level of performance required to achieve truly pervasive conversational technologies.
Toards SperHuman Speech Recognition
Data driven with careful statistical modelingWide variety of test dataRegular benchmarks of human performance
Basic Principles and New Techniques
Cooperative UserImmediate FeedbackHigh Bandwidth Microphone
1997-2001
Transparent to userNo feedbackAcross channel, domain, environment
2007-2010
Multiple Channels
Variable Noise
Overlapping Talkers
Accented Speech
Multiple Domains
Graded Challenges
Discrimination
System 1
System 2
System 3
Fusion
Recognizer
Adaptation
Speech Recognition Technology Evolution
40
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Text To Speech Generation Technology
Impressive quality
Can you guess what is TTS and what is recorded speech?
41
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Analytics bridge the Unstructured & Structured worlds
UnstructuredInformation
UnstructuredInformation UIMAUIMA
High-ValueMost Current ContentFastest GrowingBUT ...
Buried in Huge Volumes – Lots of NoiseImplicit SemanticsInefficient Search
Explicit StructureExplicit SemanticsEfficient SearchFocused Content
Text, Chat, Email, Audio,
Video
Text, Chat, Email, Audio,
Video
IndicesIndices
DBsDBs
KBsKBs
Identify Semantic Entities, Induce StructureChats, Phone Calls, Transfers People, Places, Org, Events Times, Topics, Opinions, RelationshipsThreats, Plots, etc.
Identify Semantic Entities, Induce StructureChats, Phone Calls, Transfers People, Places, Org, Events Times, Topics, Opinions, RelationshipsThreats, Plots, etc.
UIMA - The Big Picture
StructuredInformation
42
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Unstructured Information Management Architecture
Common Research infrastructure for advancing Text Analysis and NLP capability– Promotes re-use of best-of-breed components– Promotes combination hypothesis through ease of integration
Unstructured Information
Application Libraries
Specialized Application Libraries
Provide basic functions common to a broad class of application libraries & applications (e.g. Glossary Extraction Taxonomy Generation, Classification, Translation, etc.)
Question Answering
e-Commerce
Semantic Search EngineToken and Concept Indexing
Query Key words, concepts, spans, ranges -> Ranked Hit List
National & Intelligence Business
Bioinformatics
Technical Support
Document & Meta Data StoreDocuments with meta data based on key-value pairs
Enables view & collection management
(Text) Analysis Engine (TAEs)Combination of analysis engines employing a variety of analytical techniques and strategies
Structured Knowledge AccessKnowledge Source Adapters - (KSAs) deliver content from many structured knowledge sources according to central ontologies
Collection
Processing Manager
KSA Directory Service
Dynamic query & delivery of KSAs
TAE Directory Service
Dynamic query & delivery of TAEs
UIMA Standard Application Libraries
Relevant Application Knowledge
Structured Data
UIM
So
luti
on
s
43
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Collection Processing EngineCollection Processing Engine
CAS ConsumerCAS Consumer
UIMA Component Architecture from “Source to Sink”
CAS ConsumerCAS Consumer
CAS ConsumerCAS Consumer
OntologiesOntologies
IndicesIndices
DBsDBs
KnowledgeBases
KnowledgeBases
Aggregate Analysis EngineAggregate Analysis Engine
Analysis EngineAnalysis Engine
AnnotatorAnnotator
CASCAS
Collection
Reader
Collection
ReaderText, Chat,
Email, Audio, Video
Text, Chat, Email, Audio,
Video
CAS InitializerCAS Initializer CAS
CAS Analysis EngineAnalysis Engine
AnnotatorAnnotator
CASCAS
44
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Language, Speaker Identifiers
Part of Speech Detectors
Document Structure Detectors
Tokenizers, Parsers, Translators
Named-Entity Detectors
Sentiment Detectors
Face Recognizers
Relationship Detectors
Classifiers
What can analytics do?
45
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Basic Building Blocks: Annotators Iterate over a document to discover new annotations based on existing ones and update the Common Analysis Structure (CAS).
GovernorGovernor visitsvisits embassyembassyJonesJones inin JapanJapan
Located InLocated In
Gov OfficialGov Official
Arg2:LocationArg2:Location
CountryCountryGov TitleGov Title PersonPerson
Arg1:EntityArg1:Entity
PPPPVPVPNPNPParserParser
Named Entity AnnotatorNamed Entity Annotator
Relationship AnnotatorRelationship Annotator
46
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Data Mining
InformationRetrieval
String & GraphAlgorithms
UI / Human Factors
Privacy & Security
Machine Learning
Text analytics& NLP
Unstructured Information
ManagementArchitecture
Research: The Combination Hypothesis
Source
Indexer Entity extractor Classifier
Result 1 Result 2 Result 3
Application
Source
Ent. extractor
Classifier
Indexer
Result
Application
Independent Analyzers Combined Analyzers
via Common
Annotation Structure
(UIMA)
If intimately integrated, various KM technologies will provide higher quality results (accuracy, recall, etc.)
47
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Research Challenge #4
Include speech and text data and derived text analytics and context in our scope of data research work. How does that change:
– Access techniques,
– Search and optimization algorithms
– Result sets and interaction mechanisms
– Storage and indexing
– Models of data
– Framework for derived information, ways to query and search it
– System architecture
48
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
The World of Data is Changing
Hardware gives us more choices than ever before
Cost of labor is rising Data isn’t all (or even mostly) in the database
Data access paradigms evolving
Customers want integration and FAST access to the data they want
49
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Data Heterogeneity in Enterprises
Proposals
Contracts
Offerings
Historical
Engagement
Ledger
Lessons Learned
Intellectual Capital
MarketingDelivery
Notes
Notes
Notes
Notes
Claim
CCMT
.DOC
.LWP.XLS
.123
.DOC
.LWP
.XLS.123
Source
CompetitorsPricingDemandOfferings
CLIENT DATA
Demographics, configurations, current costs, financial, legal, existing contracts, RFI, etc.
.DOC
.LWP
.XLS
.123
.XLS
.123
Sage
123654…
Acct. TeamsSD
.DOC
.LWP
ProposalsContractsNegotiations
Engagement Workbook
Today data is in disparate locations; it is not easily accessible nor harnessed for key information
What data do we have?
Where is it?
How can I find it?
What format is it in?
Is it searchable?
Does “customer” mean the same in each system?
How do I reconcile differences?
What applications feed data to other applications?
If I change something, what breaks?
50
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Metadata: Today and Tomorrow
IdentifyingIdentifying
IntegratingIntegrating
Current FocusCurrent Focus
Current ChallengeCurrent Challenge OpportunityOpportunity
Store Search Store Search
Discover Linkages within domains Linkages across domains
Discover Linkages within domains Linkages across domains
UnderstandingUnderstanding
Definitions Taxonomies Complex relationships Sophisticated semantics
Definitions Taxonomies Complex relationships Sophisticated semantics
51
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Metadata: Spectrum
Structured Information Unstructured Information Relational data
– Column attributes– Table values (domains)
XML Documents– Tags, XML Schema
Software Assets– Date, version, …– Interface definitions
Web Services– Name, attributes, …– Interface definitions
People Proxies– Name, location, serial no. …
System Information– System resource– Operating environment
Text/Documents– Names, locations, phone numbers,
language, …
Images– Name, date, time– Characteristics
Rich, Streaming Media– Location, timing, scene identification,
participants, actions, ...
– Formats
Metadata describes and adds meaning to data and business process
Information Structures
About Applications, Processes, Resources
Vocabularies & Concepts
52
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Metadata Associated With Digital Camera Pictures
Future Camera Generated Metadata
•Latitude•Longitude•Altitude•GPS time (atomic clock)•GPS satellites used for measurement•Measurement precision•Speed of GPS receiver•Direction of movement•Direction of image•Name of GPS processingmethod•GPS differential correction• • ~ 30 Tags
Future Camera Generated Metadata
•Latitude•Longitude•Altitude•GPS time (atomic clock)•GPS satellites used for measurement•Measurement precision•Speed of GPS receiver•Direction of movement•Direction of image•Name of GPS processingmethod•GPS differential correction• • ~ 30 Tags
Camera Generated Metadata
File Name102-0299_IMG.JPG
Camera Model Name Canon PowerShot S400
Shooting Date/Time 1/12/2004 12:12:07 PM
Shooting Mode AutoTv( Shutter Speed) 1/200Av( Aperture Value) 2.8Metering Mode EvaluativeFocal Length 7.4mmFlash OffWhite Balance AutoAF Mode Single AFDrive Mode Single-frame shooting
~ 100 Tags
Camera Generated Metadata
File Name102-0299_IMG.JPG
Camera Model Name Canon PowerShot S400
Shooting Date/Time 1/12/2004 12:12:07 PM
Shooting Mode AutoTv( Shutter Speed) 1/200Av( Aperture Value) 2.8Metering Mode EvaluativeFocal Length 7.4mmFlash OffWhite Balance AutoAF Mode Single AFDrive Mode Single-frame shooting
~ 100 Tags
Associated audio file
Additional tags (User input)• Title, Comments, Favorite Picture, Keywords, PrintMe, Categories, PrintOrder, etc
Additional tags (User input)• Title, Comments, Favorite Picture, Keywords, PrintMe, Categories, PrintOrder, etc
EXIF 2.2 Standard -- Exchangeable Image File for Digital Cameras
53
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Ent
erpr
ise
Dat
a S
tore
Customer Space
Network Space Network
Ent
erpr
ise
Dat
a
Sales
Ordering Billing
Accounting
ProvisioningExtract Billable
Events
ControlElements
CollectEvents
configuration net event
eventselement configuration
usage
revenuecommercial order
service order
ManageTraffic
Finances &Business
Tel
eco
m O
SS
/BS
S
Networkspecific
Productspecific
Businesssystems
customizedsolution
standardimpementation
Internet
1000’s systems in silos; only one third interconnected.
Those “connected systems” create over 2000 interfaces.
Significant maintenance or enhancement efforts.
Significant replication required thru “monolith” apps to accomplish any sharing.
No archiving or sharing strategy for decommissioned systems … data and “lessons lost”.
No common convention for enterprise elements … we are speaking different “languages”.
No traceability for security for data access. We have a huge vulnerability.
Access, update, backup and recovery, synchronization are provided through a complex tapestry of processes and technologies.
Inconsistent development, deployment, monitoring, tooling.
Cost prohibitive, redundant and/or specialized skills.
Escalating and redundant storage costs for HW/SW/People.
Metadata Explosion:Case Study at Customer
DB
DB
Network
Sales
Ordering Billing
Accounting
ProvisioningManageTraffic
ControlElements
CollectEvents
configurationnet event
eventselement configuration
usage
revenuecommercial order
service order
Sales
Ordering Billing
Accounting
ProvisioningManageTraffic
ControlElements
CollectEvents
eventselement configuration
usage
revenuecommercial order
service order
net eventconfiguration
Product A"Silo" Monolith
Product B"Silo" Monolith
SwivelChair
If we do nothing … it will only become worse!
Lag times in order to cash and order taking
Inability to handle any call from any center
Incomplete or errant customer information at point of contact
Unable to perform real time integration of data at point of use regardless of data form
Unable to provide complete / correct information to drive decisions; even in batch
No single view of our customers
Degraded (or errant) business decision making due to data corruption, data access (depth and breadth) and poor synchronization (event driven)
The problem is so big and has to be approached incrementally!
54
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Evolution of Metadata
Hierarchical Data Model Rigid MetadataSingle Application
Domain Specific OntologiesFlexible MetadataCross Industry Integration
Increased Business Value of Metadata
Syntactic annotations of data: what this
data represents
Semantic annotations of data: what this
data means
Relational Data ModelRigid MetadataIntegration Within Enterprise
Extensible Data Model (XML)Flexible MetadataIntegration Within Industry
1970 1990 2000 20101980
55
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Metadata from UNSTRUCTURED data is growing exponentially
Hierarchical Data Model Rigid MetadataSingle Application
Domain Specific OntologiesFlexible MetadataCross Industry Integration
Increased Business Value of Metadata
Syntactic annotation of
data: what this data
represents
Semantic annotations of data: what this
data means
Relational Data ModelRigid MetadataIntegration Within Enterprise
Extensible Data Model (XML)Flexible MetadataIntegration Within Industry
1970 1990 2000 20101980
How to make metadata fromunstructured data ACTIONABLE?
56
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Research Challenge #5
Treat metadata as a first class research area– Access
– Search
– Sharing
– Distribution
– Consolidating
– Aggregating
– Deriving and Discovering new Metadata
– Querying
– ……
Don’t ignore existing in-place data sources and metadata
57
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
The World of Data is Changing
Hardware gives us more choices than ever before
Cost of labor is rising Data isn’t all (or even mostly) in the database
Data access paradigms evolving
Customers want integration and FAST access to the data they want
58
Top Five Data Challenges for the Next Decade
© 2005 IBM CorporationICDE Keynote 2005
Five Data Research Challenges for the Next Decade:
1.Reexamine DBMS architecture and invent ways to scale more, without sacrificing user-visible availability or performance
2.Address Total Cost of Ownership3.Learn what managing content is all about,
what is needed and forge new models4.Include speech and text data and derived
text analytics and context in our scope of data research work
5.Treat metadata as a first class research area