How to Select an Analytic DBMS Overview, checklists, and tips by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .

Post on 26-Mar-2015

218 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

How to Select an Analytic DBMS

Overview, checklists, and tips

byCurt A. Monash, Ph.D.

President, Monash ResearchEditor, DBMS2

contact @monash.comhttp://www.monash.comhttp://www.DBMS2.com

Curt Monash

Analyst since 1981, own firm since 1987 Covered DBMS since the pre-relational days Also analytics, search, etc.

Publicly available research Blogs, including DBMS2 (www.DBMS2.com -- the

source for most of this talk) Feed at www.monash.com/blogs.html White papers and more at www.monash.com

User and vendor consulting

Our agenda

Why are there such things as specialized analytic DBMS?

What are the major analytic DBMS product alternatives?

What are the most relevant differentiations among analytic DBMS users?

What’s the best process for selecting an analytic DBMS?

Why are there specialized analytic DBMS?

General-purpose database managers are optimized for updating short rows …

… not for analytic query performance 10-100X price/performance differences

are not uncommon

At issue is the interplay between storage, processors, and RAM

Moore’s Law, Kryder’s Law, and a huge exception

Growth factors:

Transistors/chip:

>100,000 since 1971 Disk density:

>100,000,000 since 1956 Disk speed:

12.5 since 1956

The disk speed barrier dominates everything!

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

Compound Annual Growth Rate

Transistors/Chipssince 1971

Disk Density since 1956

Disk Speed since 1956

04/10/23 DRAFT!! THIRD TEST!!

The “1,000,000:1” disk-speed barrier

RAM access times ~5-7.5 nanoseconds CPU clock speed <1 nanosecond Interprocessor communication can be ~1,000X slower

than on-chip

Disk seek times ~2.5-3 milliseconds Limit = ½ rotation i.e., 1/30,000 minutes i.e., 1/500 seconds = 2 ms

Tiering brings it closer to ~1,000:1 in practice, but even so the difference is VERY BIG

Software strategies to optimize analytic I/O

Minimize data returned Classic query optimization

Minimize index accesses Page size

Precalculate results Materialized views OLAP cubes

Return data sequentially Store data in columns Stash data in RAM

Hardware strategies to optimize analytic I/O

Lots of RAM Parallel disk access!!! Lots of networking

Tuned MPP (Massively Parallel Processing) is the key

Specialty hardware strategies

Custom or unusual chips (rare) Custom or unusual interconnects Fixed configurations of common parts

Appliances or recommended configurations

And there’s also SaaS

18 contenders (and there are more)

Aster Data Dataupia Exasol Greenplum HP Neoview IBM DB2 BCUs Infobright/MySQL Kickfire/MySQL Kognitio Microsoft Madison

Netezza Oracle Exadata Oracle w/o Exadata ParAccel SQL Server w/o

Madison Sybase IQ Teradata Vertica

General areas of feature differentiation

Query performance Update/load performance Compatibilities Advanced analytics Alternate datatypes Manageability and availability Encryption and security

Major analytic DBMS product groupings

Architecture is a hot subject

Traditional OLTP Row-based MPP Columnar (Not covered tonight) MOLAP/array-based

Traditional OLTP examples

Oracle (especially pre-Exadata) IBM DB2 (especially mainframe) Microsoft SQL Server (pre-Madison)

Analytic optimizations for OLTP DBMS

Two major kinds of precalculation Star indexes Materialized views

Other specialized indexes Query optimization tools OLAP extensions SQL 2003 Other embedded analytics

Drawbacks

Complexity and people cost Hardware cost Software cost Absolute performance

Legitimate use scenarios

When TCO isn’t an issue Undemanding performance (and therefore

administration too) When specialized features matter

OLTP-like Integrated MOLAP Edge-case analytics

Rigid enterprise standards Small enterprise/true single-instance

Row-based MPP examples

Teradata DB2 (open systems version) Netezza Oracle Exadata (sort of) DATAllegro/Microsoft Madison Greenplum Aster Data Kognitio HP Neoview

Typical design choices in row-based MPP

“Random” (hashed or round-robin) data distribution among nodes

Large block sizes Suitable for scans rather than random accesses

Limited indexing alternatives Or little optimization for using the full boat

Carefully balanced hardware High-end networking

Tradeoffs among row MPP alternatives

Enterprise standards Vendor size Hardware lock-in Total system price Features

Columnar DBMS examples

Sybase IQ SAND Vertica ParAccel InfoBright Kickfire Exasol MonetDB SAP BI Accelerator (sort of)

Columnar pros and cons

Bulk retrieval is faster Pinpoint I/O is slower Compression is easier Memory-centric processing is easier MPP is not quite as crucial

Segmentation – a first cut

One database to rule them all One analytic database to rule them all Frontline analytic database Very, very big analytic database Big analytic database handled very cost-

effectively

Basics of systematic segmentation

Use cases Metrics Platform preferences

Use cases – a first cut

Light reporting Diverse EDW Big Data Operational analytics

Metrics – a first cut

Total raw/user data Below 1-2 TB, references abound 10 TB is another major breakpoint

Total concurrent users 5, 15, 50, or 500?

Data freshness Hours Minutes Seconds

Basic platform issues

Enterprise standards Appliance-friendliness Need for MPP? Cloud/SaaS

The selection process in a nutshell

Figure out what you’re trying to buy Make a shortlist Do free POCs* Evaluate and decide

*The only part that’s even slightly specific to the analytic DBMS category

Figure out what you’re trying to buy

Inventory your use cases Current Known future Wish-list/dream-list future

Set constraints People and platforms Money

Establish target SLAs Must-haves Nice-to-haves

Use-case checklist -- generalities

Database growth As time goes by … More detail New data sources

Users (human) Users/usage (automated) Freshness (data and query results)

Use-case checklist – traditional BI

Reports Today Future

Dashboards and alerts Today Future Latency

Ad-hoc Users Now that we have great response time …

Use-case checklist – data mining

How much do you think it would improve results to Run more models? Model on more data? Add more variables? Increase model complexity?

Which of those can the DBMS help with anyway? What about scoring?

Real-time Other latency issues

SLA realism

What kind of turnaround truly matters? Customer or customer-facing users Executive users Analyst users

How bad is downtime? Customer or customer-facing users Executive users Analyst users

Short list constraints

Cash cost But purchases are heavily negotiated

Deployment effort Appliances can be good

Platform politics Appliances can be bad You might as well consider incumbent(s)

Filling out the shortlist

Who matches your requirements in theory?

What kinds of evidence do you require? References?

How many? How relevant?

A careful POC? Analyst recommendations? General “buzz”?

A checklist for shortlists

What’s your tolerance for specialized hardware? What’s your tolerance for set-up effort? What’s your tolerance for ongoing administration? What are your insert and update requirements? At what volumes will you run fairly simple

queries? What are your complex queries like? For which third-party tools do you need support?

and, most important,

Are you madly in love with your current DBMS?

Proof-of-Concept basics

The better you match your use cases, the more reliable the POC is

Most of the effort is in the set-up You might as well do POCs for several

vendors – at (almost) the same time! Where is the POC being held?

The three big POC challenges

Getting data Real?

Politics Privacy

Synthetic? Hybrid?

Picking queries And more?

Realistic simulation(s) Workload Platform Talent

POC tips

Don’t underestimate requirements Don’t overestimate requirements Get SOME data ASAP Don’t leave the vendor in control Test what you’ll be buying Use the baseball bat

Evaluate and decide

It all comes down to

Cost Speed Risk

and in some cases

Time to value Upside

Further information

Curt A. Monash, Ph.D.President, Monash Research

Editor, DBMS2

contact @monash.comhttp://www.monash.comhttp://www.DBMS2.com

top related