Top Banner
Daniel Abadi -- Yale University Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12 th , 2008
26

Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Mar 17, 2018

Download

Documents

vanhuong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Column-Stores vs. Row-Stores: How Different Are They Really?

Daniel Abadi (Yale), Samuel Madden (MIT),

Nabil Hachem (AvantGarde Consulting)June 12th, 2008

Page 2: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Row vs. Column-Stores

StreetAddressPhone #E-mail

FirstName

LastName

Last Name

FirstName E-mail Phone #

StreetAddress

Row-Store Column-Store

− Might read in unnecessary data

+ Only need to read in relevant data

+ Easy to add a new record

− Tuple writes might require multiple seeks

Page 3: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Column-Stores

• Really good for read-mostly data warehouses� Lot’s of column scans and aggregations� Writes tend to be in batch� [CK85], [SAB+05], [ZBN+05], [HLA+06],

[SBC+07] all verify this� Top 3 in TPC-H rankings (Exasol, ParAccel,

and Kickfire) are column-stores� Factor of 5 faster on performance� Factor of 2 superior on price/performance

Page 4: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Data Warehouse DBMS Software

• $4.5 billion industry (out of total $16 billion DBMS software industry)

• Growing 10% annually

Page 5: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Momentum

• Right solution for growing market � $$$$• Vertica, ParAccel, Kickfire, Calpont,

Infobright, and Exasol new entrants• Sybase IQ’s profits rapidly increasing• Yahoo’s world largest (multi-petabyte)

data warehouse is a column-store (from Mahat Technologies acquisition)

Page 6: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Paper Looks At Key Question

• How much of the buzz around column-stores just marketing hype?� Do you really need to buy Sybase IQ or

Vertica?� How far will your current row-store take you?

� Can you get column-store performance from a row-store?

� Can you simulate a column-store in a row-store?

Page 7: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Paper Methodology

• Comparing row-store vs. column-store is dangerous/borderline meaningless

• Instead, compare row-store vs. row-store and column-store vs. column-store� Simulate a column-store inside of a row-store� Remove column-oriented features from

column-store until it behaves like a row-store

Page 8: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Simulate Column-Store Inside Row-Store

StreetAddressPhone #E-mail

FirstName

LastName

Last Name

FirstName E-mail

1

2

3

1

2

3

1

2

3

Option A: Vertical Partitioning

Option B:Index Every Column

Last Name Index First Name Index

Page 9: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Experiments

• Star Schema Benchmark (SSBM)� Fact table contains 17 columns and 60,000,000 rows� 4 dimension tables, biggest one has 80,000 rows� Queries perform 2-4 joins between fact table and

dimension tables, aggregate 1-2 columns from fact table

� [OOC06]

• Implemented by professional DBA� Original row-store plus 2 column-store simulations on

same row-store product

Page 10: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

SSBM Averages

0.0

50.0

100.0

150.0

200.0

250.0

Time (seconds)

Average 25.7 79.9 221.2

Normal Row-StoreVertically Partitioned

Row-Store

Row-Store With All

Indexes

Page 11: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

What’s Going On?

• Vertically Partitioned Case� Tuple Sizes� Horizontal Partitioning

• All Indexes Case� Tuple Reconstruction

Page 12: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Tuple Size

1

2

3

ColumnData

TID

1

2

3

TID ColumnData

1

2

3

TID ColumnData

TupleHeader

•Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns

•Complete fact table takes up ~4 GB (compressed)

•Vertically partitioned tables take up 0.7-1.1 GB (compressed)

Page 13: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Horizontal Partitioning

• Fact table horizontally partitioned on year� Year is an element of the ‘Date’ dimension

table� Most queries in SSBM have a predicate on

year� Since vertically partitioned tables do not

contain the ‘Date’ foreign key, row-store could not similarly partition them

Page 14: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

What’s Going On?

• Vertically Partitioned Case� Tuple Sizes� Horizontal Partitioning

• All Indexes Case� Tuple Construction

Page 15: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Tuple Construction

• Common type of query:� SELECT store_name, SUM(revenue)

FROM Facts, StoresWHERE fact.store_id = stores.store_id

AND stores.country = “Canada”GROUP BY store_name

Page 16: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Tuple Construction

• Result of lower part of query plan is a set of TIDs that passed all predicates

• Need to extract SELECT attributes at these TIDs� BUT: index maps value to TID� You really want to map TID to value (i.e., a

vertical partition)

�� Tuple construction is SLOW

Page 17: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

So….

• All indexes approach is a poor way to simulate a column-store

• Problems with vertical partitioning are NOT fundamental� Store tuple header in a separate partition� Allow virtual TIDs� Allow HP using a foreign key on a different VP

• So can row-stores simulate column-stores?

Page 18: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Row-Store vs. Column-Store

0.0

5.0

10.0

15.0

20.0

25.0

30.0

Time (seconds)

Average 25.7 11.7 4.4

Row-Store Row-Store (M V) C-Store

Page 19: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Row-Store vs. Column-Store

0.0

5.0

10.0

15.0

20.0

25.0

30.0

Time (seconds)

Average 25.7 11.7 4.4

Row-Store Row-Store (M V) C-Store

Page 20: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Column-Store Experiments

• Start with column-store (C-Store)• Remove column-store-specific

performance optimizations• End with column-store with a row-oriented

query executer

Page 21: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Compression

• Higher data value locality in column-stores� Better ratio � reduced I/O

• Can use schemes like run-length encoding� Easy to operate on directly

for improved performance ([AMF06])

Q1Q1Q1Q1Q1Q1Q1

Q2Q2Q2Q2

Quarter

(Q1, 1, 300)

Quarter

(Q2, 301, 350)

(Q3, 651, 500)

(Q4, 1151, 600)

Page 22: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

• Early Materialization: create rows first. But:� Poor memory bandwidth

utilization� Lose opportunity for

vectorized operation

2131

2333

7134280

Construct

2

3

3

3

7

13

42

80

Select + Aggregate

2

1

3

1

4

4

4

4

prodID storeIDcustID price

QUERY:

SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND

(storeID = 1) ANDGROUP BY custID

Early vs. Late Materialization

4444

Page 23: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Other Column-Store Optimizations

• Invisible join� Column-store specific join� Optimizations for star schemas

� Similar to a semi-join

• Block Processing

Page 24: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Simplified Version of Results

0.0

10.0

20.0

30.0

40.0

50.0

Time (seconds)

Average 4.4 14.9 40.7

Original C-St oreC-St ore, No

Compression

C-St ore, Early

Mat erializat ion

Page 25: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Conclusion

• Might be possible to simulate a row-store in a column-store, BUT:� Need better support for vertical partitioning at

the storage layer� Need support for column-specific

optimizations at the executer level

• Working with HP Labs to find out

Page 26: Column-Stores vs. Row-Stores: How Different Are They Really?cs- · PDF fileColumn-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), ...

Daniel Abadi -- Yale University

Come Join the Yale DB Group!