Database Activities and Trends Jim Gray Microsoft Research 2 June 2006, Microsoft, TechNet, London Outline • What I have been doing 15 minutes • BIG Changes in DBs 15 minutes • Q&A 20 minutes
Jan 11, 2016
Database Activities and TrendsJim Gray
Microsoft Research2 June 2006, Microsoft, TechNet, London
Outline
• What I have been doing 15 minutes
• BIG Changes in DBs 15 minutes
• Q&A 20 minutes
Scalability Projects • TerraServer: Geospatial data online
– Now part of Virtual Earth http://local.live.com/
• SATA disk evaluation– Copy 1.5 Petabytes (count types of errors)
MSR-TR-2005-166
• Disk and Network performancemove 1GB/s from CERN to Pasadena MSR-TR-2004-62
• Bricks – BI-Bricks: cheap boxes/disks for BI – Server Bricks: TerraServer Bricks: MSR-TR-2004-107
DB Projects• Spatial data access inside SQL
– Gives a good example of using CLR to extend SQL– Sample is part of SQL 2005 programming samples.– Many papers …, MSR-TR-2005-122, MSR-TR-2006-52
• To Blob or NOT to Blob?– Explored what is the break-even point of Blobs vs Files.
Guess what! Almost all files should be blobs. MSR-TR-2006-45
• GPU TeraSort: – You have been hearing about Many-Core from Intel– Nvidia & ATI give you 100 cores today (2x next year)
10x the operations per second than the CPU 10x the memory bandwidth of the CPU
– How to program them?– Sort represents IO, memory, processing. – GPU TerraSort demos this MSR-TR-2005-183 – Accelerator: C# extension is a GPU compiler. MSR-TR-2005-184
Not me, but very cool!
eScience Projects • SkyServer: Astronomy data online http://skyserver.sdss.org/
– A real Data Grid app – Web services are popular– SkyQuery and CasJobs use web services.
http://casjobs.sdss.org/CasJobs/– Spatial access built as SQL 2005 C# extensions.
• Doing Finite Element Analysis with a DB and Vis toolsSupporting Finite Element Analysis with a Relational Database Backend; Part I: There is Life beyond Files MSR-TR-2005-49
• Ecological sensors (soil, water, ocean,…)– Only public thing so far: http://lifeunderyourfeet.org/
– Many papers coming
• Starting BioInfo efforts (Portable PubMed Central, ….)
Portable PubMedCentral
• “Information at your fingertips”• Helping build PortablePubMedCentral• Deployed US, China, England, Italy, South
Africa, (Japan soon).• Each site can accept documents • Archives replicated • Federate thru web services • Working to integrate Word/Excel/…
with PubmedCentral – e.g. WordML, XSD,• To be clear: NCBI is doing 99% of the work,
but it is very cool and very significant.
Outline
• What I have been doing 15 minutes
• BIG Changes in DBs 15 minutes
• Q&A 20 minutes
DB System Architecture • The classic DBMS model
os
records
sets
utilities
Added:+Text, Time, Space+ Cubes, Data mining+ XML, XQuery+ Programming Languages+ Triggers and queues+ Replication, Pub/sub+ Extract-Transform-Load+ Many more extensions coming
…
Replicatio
n
ET
LT
extC
ubesD
ata Mine
Tim
eS
paceN
otification
Procedure
s
QueuesX
ML
os
records
sets
utilities
A Mess?
Worked, but applications wanted to query other data types
DB Systems evolved to be containers for information services
develop, deploy, and execution environment• Classic ++
– + Programming Languages– + Triggers and queues– + Replication, Pub/sub– + Extract-Transform-Load– + Text, Time, Space– + Cubes, Data mining– + XML, XQuery– + Many more extensions coming
• DBMS is an ecosystemOO is the key structuring strategy:– Everything is a class
– Database is a complex object
– Core object is DataSet
– Classes publish/consume them
– Depends on strong Object Model
os
records
sets
utilities
DataSet
The Object-Relational Worldmarry programming languages and DBMSs
• Stored procedures evolve to “real” languagesJava, C#,.. With real object models.
• Data encapsulated: a class with methods• Classes may be persistent• Tables are enumerable & index-able
record sets with foreign keys• Records are vectors of objects• Opaque or transparent types• Set operators on transparent classes• Transactions:
– Preserve invariants – A composition strategy– An exception strategy
• Ends Inside-DB Outside-DB dichotomy
Business Business ObjectsObjects
Ask not “How to add objects to databases?”,
Ask “What kind of object is a database?”
Q: Given an object model, what is a DB?A: DataSet class and methods
(nested relation with metadata)This is the basis for the ecosystem
Distributed DBExtensible DBInteroperable DB….
This was implicit in ODBCbut is now explicit within the DBMS ecosystem
Input: Command (any language) Output: Dataset
Tablesor Textor cubeOr…..
Question
Dataset
Entity Set in ADO.NET 3.0
Queues & WorkflowsSODA (Service Oriented Data Architecture)
Service Oriented Database Architecture: App Server-Lite? MSR TR 2005 129
• Apps are loosely connected via Queued messages
• Queues are databases.
• Basis for workflow
• Queues: the first class to add to an OR DBMS
• Queues fire triggers.Active databases
• Synergy with DBMSsecurity, naming, persistence, types, query,…
Workflow:Script Execute Administer &
Expedite all built on queues
Text, Temporal, and Spatial Data Access
• Q: What comes after queues?A: Basic types: text, time, space,…
• Great application of OR technology• Key idea:
table valued functions == indicesAn index is a table, organized differentlyQuery executor uses index to map: Key → set (aka sequence of rows)
• Table valued function can do this mapOptimizer can use it.
• +extras: cost function, cardinality,…
• BIG DEAL: Approximate answers: Rank and Support
select Title, Abstract, Rank from Books join FreeTextTable(Title, Abstract, ‘XML semistructured') Ton BookID = T.Key
select store, holiday, sum(sales) from Sales join HolidayDates(2004) Ton Sales.day = T.daygroup by store, holiday
select galaxy, distance from GetNearbyObjEQ(22,37)
What’s new here?
• DBMS have tight-integration withlanguage classes (Java, C#, VB,.. )
• The DB is a class• You can add classes to DB. • Adding indices is “easy”
If you have a new idea.• Now have solid Queue systems
Adding workflow is “easy”If you have a new idea.
• This is a vehicle for publishing data on the Web.
Tablesor Textor cubeOr…..
Question
Dataset
Cubes
• Data cubes now standard• MDX is very powerful
(Multi-Dimensional eXpressions) • Dimension, Measure, Operator
concepts highly evolved beyond snowflake schema
• Cube stores cohabit with row storesROLAP + MOLAP + (x xOLAP) (relational +multidimensional online analytic processing)
• Very sophisticated algorithms
• A big part of the ecosystem
CHEVY
CHEVY
FORDFORD 19901990
1991199119921992
19931993
REDREDWHITEWHITE
BLUEBLUE
SELECT <axis_spec> FROM <cube_spec>WHERE <slicer_spec>
Semi-Structured Data• “Everyone starts with the same schema:
<stuff/>.” Then they refine it.” J. Widom
• We are a “strong schema” community• That has pros-and-cons.• Files <stuff/> and XML <<foo/> <bar/>>
are here to stay. Get over it! • File directories are becoming databases;
– Pivot on any attribute– Folders are standing queries.– Freetext+schema search (better precision/recall)
• XSD (xml schema) and xQuery are transitional;But we have to do them to get to the real answer.
• Cohabit with row-stores.• Challenge: figure out what comes after XSD+xQuery
Data Mining and Machine Learning
• Tasks: classification, association, prediction• Tools: Decision trees, Bayes, A Priori,
clustering, regression, Neural net,… • now unified with DBs
– Create table T (x,y,z,u,v,w)Learn “x,y,z” from “u,v,w” using <algorithm>
– Train T with data.– Then can ask:
• Probability x,y,z,u,v,w• What are the u,v,w probabilities given x,y,z
– Example: Learn height from age.
• Anyone with a data mining algorithm hasfull access to the DBMS infrastructure.
• Challenge: Better learning algorithms.
Create the model:CREATE MINING MODEL HeightFromAgeSex
( ID long key, Gender text discrete, Age long continuous, Height long continuous PREDICT) USING Decision_Trees
Train a data mining model:INSERT INTO Height
SELECT ID, Gender, Age, Height FROM People
Predict height from model:SELECT height,
PredictProbability(height) FROM Height PREDICTION JOIN New
ON New.Gender = Height.Gender AND New.Age = Height.Age
DM – DB Synergy
Probabilistic Reasoning
DB verbs to drive Modeler
learn height from Gender + Age
Notification,Stream Processing, and
Sensor Processing• Traditionally:
Query billions of facts• Streams:
millions of queries one new fact – New protein compare to all DNA– Change in price or time
• Implications– New aggregation operators (extension)– New programming style– Streams in products:
• Queries represented as records• New query optimizations.
• Sensor networks – push queries out to sensors.– Simpler programming model– Optimizes power & bandwidth
facts
Q?
A!
Qfact, fact, fact…
Notification
os
records
sets
utilities
Restatement: DB Systems evolved to be containers for information services
develop, deploy, and execution environment• DBMS is an ecosystem
Key structuring strategy:– Everything is a class– Database is a complex object– Core object is DataSet
• The architecture lets you add your new ideas.
os
records
sets
utilities
DataSet
Language + DB Integration(the Microsoft contribution)
• LINQ are a BIG deal (SQL and XML) http://msdn.microsoft.com/data
• Entity Sets are next step in Data SetsADO.NET V3 automates entities
Data access todayvoid EmpsByDate(DateTime date) {using( SqlConnection con = new SqlConnection(
Settings.Default.AdventureWorksSQL)) {con.Open();
SqlCommand cmd = con.CreateCommand();cmd.CommandText = @"SELECT SalesPersonID, FirstName, HireDateFROM SalesPerson spINNER JOIN Employee e ON
sp.SalesPersonID = e.EmployeeIDINNER JOIN Contact c ON
e.EmployeeID = c.ContactIDWHERE e.HireDate < @date";cmd.Parameters.AddWithValue("@date", date);
DbDataReader r = cmd.ExecuteReader();while(r.Read()) {
Console.WriteLine("{0:d}:\t{1}", r["HireDate"], r["FirstName"]);
}}
Relational EngineCustomer SalesPerson
Explicit DB connections
Opaque command
text
Untyped resultsets
Connection
DataReader
Command
Rows
Entities ≠ Rows
Data access tomorrowpublic partial class AdventureWorksDB :
System.Data.Objects.ObjectContext {public System.Data.Objects.
Query<SalesOrder> SalesOrders{ … }public System.Data.Objects.
Query<SalesPerson> SalesPeople{ … }
}
void EmpsByDate(DateTime date) {
using (AdventureWorksDB aw = new AdventureWorksDB()) {
var people = from p in aw.SalesPeoplewhere p.HireDate < dateselect p;
foreach (SalesPerson p in people) { Console.WriteLine("{0:d}\t{1}",
p.HireDate, p.FirstName ); }}
Relational EngineCustomer SalesPerson
No explicit connections
Strongly typed
commands
Strongly typed results
Connection
DataReader
Command
Rows
MapConnection
MapCommand
MapDataReader
Entities
ObjectContext
Query<T>
Objects
SalesData
Order
Domain
Objects
Auto-Gen classes
Outline
• What I have been doing 15 minutes
• BIG Changes in DBs 15 minutes
• Q&A 20 minutes