This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1. ICDE2015 Research 3: Distributed Storage and Processing :
(NTT) 2015.5.16
2. 1.PABIRS: A Data Access Middleware for Distributed File
Systems S. Wu (Zhejiang Univ.), G. Chen, X. Zhou, Z. Zhang, A. K.
H. Tung, and M. Winslett (UIUC) 2.Scalable Distributed Transactions
across Heterogeneous Stores A. Dey, A. Fekete, and U. Rhm (Univ. of
Sydney) R3: Distributed Storage and Processing :(NTT)2
3. FS () (PABIRS) 10() PABIRS: A Data Access Middleware for
Distributed File Systems R3: Distributed Storage and Processing
:(NTT)3 0 500 1000 1500 2000 2500 0 100 200 300 400 500 600 700 800
900 1000 CallFrequency Caller ID Call Frequency Fig. 1.
Distribution of Call Frequency 0 200 400 600 800 1000 1200 0 100
200 300 400 500 600 700 800 900 1000 NumberofBlocks Caller ID
Number of Blocks Fig. 2. Number of Blocks per Key PABIRS )1,000(ID)
Fig.1. 0 500 1000 1500 2000 2500 0 100 200 300 400 500 600 700 800
900 1000 CallFrequency Caller ID Call Frequency Fig. 1.
Distribution of Call Frequency 0 200 400 600 800 1000 1200 0 100
200 300 400 500 600 700 800 900 1000 NumberofBlocks Caller ID
Number of Blocks Fig. 2. Number of Blocks per Key support efcient
data retrieval for various query workloads. PABIRS Fig.2.ID 1%ID
(power-law) ID DFS
4. PABIRS = Bitmap + LSM index DFS() DFSGET
MapReducemapinputformat KVSsecondary index DFS wrapper: Bitmap
index/ LSM indexhot value R3: Distributed Storage and Processing
:(NTT)4 PABIRS DFS g. 3. Architecture of PABIRSFig.3.PABIRS
InputFormatInsert(key,value)Lookup(key) Fig.4.bitmap signature DAG
(directoryverticesdatavertices) 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
1 0 1 0 0 0 0 1 0 0 0 0 u u u u u u UID u u u u u 1 0 0 1 0 1 0 0 0
1 block signature data block 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 data
block 2 block signature UID Fig. 4. Bitmap Example III. HYBRID
INDEXING SCHEME The general idea behind our hybrid indexing scheme
is to build bitmap signatures for all data blocks and select
certain hot keys for LSM index. Bitmap signature is created for
multiple attributes without re-ordering the records. To facilitate
efcient parallel search, we design a hierarchical model based on a
virtual Directed Acyclic Graph (DAG) structure, in which each
intermediate vertex is a summary of the signatures accessible on
its descendants. We present an example DAG structure in Figure 5 as
a virtual index structure Param s, s1 v pj N Bp Bt k m F W r with
entries taken from the leaf level of the C0 tree, thus decreasing
the size of C0, and creates a newly merged leaf node of the C1
tree. The buffered multi-page block containing old C1 tree nodes
prior to merge is called the emp- tying block, and new leaf nodes
are written to a different buffered multi-page block called the
filling block. When this filling block has been packed full with
newly merged leaf nodes of C1, the block is written to a new free
area on disk. The new multi-page block containing merged results is
pictured in Figure 2.2 as lying on the right of the former nodes.
Subsequent merge steps bring together increasing index value
segments of the C0 and C1 components until the maximum values are
reached and the rolling merge starts again from the smallest
values. C1 tree C0 tree Disk Memory Figure 2.2. Conceptual picture
of rolling merge steps, with result written back to disk Newly
merged blocks are written to new disk positions, so that the old
blocks will not be over- written and will be available for recovery
in case of a crash. The parent directory nodes in C1, also buffered
in memory, are updated to reflect this new leaf structure, but
usually remain in buffer for longer periods to minimize I/O; the
old leaf nodes from the C1 component are in- validated after the
merge step is complete and are then deleted from the C1 directory.
In gen- eral, there will be leftover leaf-level entries for the
merged C1 component following each merge step, since a merge step
is unlikely to result in a new node just as the old leaf node
empties. The same consideration holds for multi-page blocks, since
in general when the filling block has filled with newly merged
nodes, there will be numerous nodes containing entries still
LSMTree[O'Neil+,96] : C0(AVL-Tree) C0 C1(B-Tree)rollingmerge
(:LSMTree)
5. 1. Bitmap Signature fanout F low-level vertices high-level
vertex F Pregel [Malewicz+, 10] BSP 2. LSM LSM index R3:
Distributed Storage and Processing :(NTT)5 0 200 400 600 800 1000
1200 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
ProcessingTime(msec) Selectivities of Call Numbers (%) bitmap lsm
Cost of Bitmap and LSM he fanout of the B-tree. We try to insert
the key ex, only when the estimated cost is no larger than Index
Manager Data Statistics DFS Data Stream Append LSM Index New Data
Bitmap Signature MapReduce Algorithm Fig. 8. Index Update
Fig.7.bitmapLSM LSM 0.1% bitmap 90% 0.1%0.1 0 200 400 600 800 1000
1200 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
ProcessingTime(msec) Selectivities of Call Numbers (%) bitmap lsm
Fig. 7. Search Cost of Bitmap and LSM where E is the fanout of the
B-tree. We try to insert the key into LSM index, only when the
estimated cost is no larger than the cost of bitmap index. Based on
the inequality above, we are able to calculate the minimal
selectivity, which makes LSM a more attractive selection than the
bitmap. In Figure 7, we apply the theoretical Index Manager Data
Statistics DFS Data Stream Append LSM Index New Data Bitmap
Signature MapReduce Algorithm Fig. 8. Index Update C. Update on the
Indices PABIRS is specically designed for the applications that
require fast data insertion. In PABIRS, bitmap index is a
lightweight index which can be built in a batch, while LSM index is
intentionally designed to support the fast insertion. Fig.8. DFS
MRJob BitmapsignatureLSM Hot-key for multiple attributes without
re-ordering the records. To facilitate efcient parallel search, we
design a hierarchical model based on a virtual Directed Acyclic
Graph (DAG) structure, in which each intermediate vertex is a
summary of the signatures accessible on its descendants. We present
an example DAG structure in Figure 5 as a virtual index structure
on two different attributes, using 8 bits and 5 bits for these two
attributes respectively. Generally speaking, the DAG structure
consists of three layers. The retrieval layer contains individual
signatures cor- responding to the data blocks, while each
intermediate vertex in the index layer is associated with a summary
signature by merging signatures of its children vertices. Data
layer refers to the physical data blocks stored in the DFS.
Signatures and their corresponding graph vertices are randomly
distributed to multiple DFS nodes. In the rest of the paper, we
refer the vertices in the retrieval layer as data vertices and the
vertices in the index layer as directory vertices. On the other
hand, LSM index replicates the records with hot keys and sorts them
in its B-trees. For each indexing attribute, we independently
create an LSM index to maintain its sorted replicas of hot data. In
the rest of the section, we rst introduce the tuning approaches
used on the bitmap-based index, followed by the selection strategy
between these two indices. For better readability, the notations
used in the section are summarized in Table I. A. Optimizations on
Bitmap Signature Suppose the signature of each data block follows
the same distribution {p1, p2, . . . , pk}, in which each pj
indicates the probability of having 1 on the j-th bit. Because of
the exclusiveness between the values, the signature is a sparse
vector, i.e. P j pj k. Given two signatures s1 and s2, the expected
number of common 1 on j-th bits in both s1 and s2 is P j(pj)2 . It
is much smaller than the expected number of 1 in either s1 or s2,
i.e. P j(pj), unless there exists a pj dominating the distribution.
When records are randomly assigned to the data blocks, each
probability pj is supposed to be a small positive number. This
leads to the phenomenon of Weak Locality in PABIRS. It N total
number of records/tuples Bp size of a data block Bt size of a tuple
query selectivity k number of distinct values of the attributes m
number of values mapped to the same bit F fanout of the directory
vertex W number of virtual machines (workers) rl computation cost
of a directory vertex rn network delay between any pair of vertices
rd the overhead of reading a data block selectivity of a particular
queried key fmin minimum frequency of any value in a domain p() pdf
of distribution on selectivity of queries. ... ... Fig. 5.
Demonstration of Signature Graph is thus not helpful to group
similar signatures when building high-level directory vertex in the
index layer, because such merging only generates a new signature
with a union of 1s from the signatures of its children vertices.
Although it is unlikely to optimize by better grouping, the fanout
of the abstract tree structure, i.e. the number of children
vertices for every directory vertex, remains tunable and turns out
to be crucial to the searching efciency. 1) Cost Model and Fanout
Optimization: Instead of picking up similar signatures during
bitmap construction, PABIRS simply groups the low-level vertices to
generate a high-level vertex, based on a pre-specied fanout
parameter F. Speci- 116 Fig.5.signaturegraph
6. Hadoop 1.0.4 + GPS [Salihoglu+, 12] (Pregel OSS) 4cpu, 8G
RAM, 32hadoop A. 3selectHBase Phoenix, Impala, BIDS [Lu+, 13] B.
tpcdskew [Bruno+, 05] TPC-H Q3, Q5, Q6Hive R3: Distributed Storage
and Processing :(NTT)6 320 400 ) 0 50 100 150 200 250 300 350 400
Q1 Q2 Q3 ResponseTime(second) HBase PABIRS BIDS Impala-8G Impala-4G
Fig. 13. Queries 1 10 100 1000 80 160 240 320 400
AverageResponseTime(second) Data Size (G) Q1 Q2 Q3 Fig. 14. Effect
of Data Size 40 50 onseTime PABIRS HBase 150 200 nute) PABIRS 0 20
40 60 80 00 20 40 60 0 4 8 12 16 20 24 28 32 Query Batch Size
Single Processor Thread Quad Processor Thread 19. Throughput of
Concurrent ries (Q1) 0 5 10 15 20 25 30 35 40 45 50 0 4 8 12 16 20
24 28 32 AverageResponseTime(second) Query Batch Size Single
Processor Thread Quad Processor Thread Fig. 20. Response Time of
Concurrent Queries (Q1) 10 100 1000 10000 1 2 3 4
ResponseTime(second) Skew Factor q3/h q3/p q5/h q5/p q6/h q6/p Fig.
21. Performance of TPC-H Query (Skew) 10 100 1000 10000 3.5%(3)
7.1%(6) 10.5%(9)14.2%(12) ResponseTime(second) Query Selectivity
(month) q3/h q3/p q5/h q5/p q6/h q6/p Fig. 22. Performance of TPC-H
Query (Selectivity) emory which leads to the Memory Limit Exceeded
tion. or Q3 and Q6, we build an index for the column shipdate we
increase to a larger value (e.g., 365), PABIRS nds th index-based
access is even worse than scan-based access. will automatically
switch to the disk scan, which generate better station. To avoid
query with empty result, we intentionally select a number with at
least one record under the base station. PABIRS can effectively
handle queries with a high selectiv- ity but still involving
numerous tuples. As shown in Table III, in our 160G dataset, we
have 40960 blocks in total. Although the selectivities of the
queries are as low as 0.00001%, the records related to Q1, Q2 and
Q3 cover 477, 28863 and 343 data blocks respectively. The involved
data blocks, especially for Q1 and Q3, are no close to the total
number of data blocks, while the overhead of loading hundreds of
data blocks from the disks remains high. In experiments, PABIRS,
Phoenix and BIDS are allowed to use 4 GB main memory on each node
of the cluster, while Impala are tested under two settings with 4
GB and 8 GB main memory respectively. The results in Figure 13
shows that Impala-4G is unable to nish the queries in reasonable
time (i.e. 1,000 seconds), as it incurs high I/O cost on memory-
disk data swap. It reveals the limitation of Impala on memory usage
efciency. Moreover, Impala and BIDS show a similar performance for
all queries, because both approaches adopt the scan-based
techniques (memory scan and disk scan). In the rest of the
experiments, we only report the results of Impala- 8G, denoted as
Impala in abbreviation. The results also imply that PABIRS
signicantly outperforms the other systems on all queries. When the
selectivity of the query is high, such as Q1 and Q3, HBase Phoenix
is the only alternative with close performance to PABIRS, because
of its adoption of secondary index. But for the query involving a
large portion of data like Q2, HBase Phoenix is slow as it incurs
many random I/Os to retrieve all results. TABLE III. PROCESSING
TIME OF PABIRS QID selectivity index time disk time total time Q1
1.2% 1.03s 1.47s 2.50s Q2 70% 2.11s 137.63s 139.74s Q3 0.8% 1.04s
1.28s 2.32s To gain better insight into the scalability of PABIRS,
we the performances of PABIRS and HBase Phoenix degrade slightly
when more insertions are conducted, because they need to build and
query indexes for the new tuples. Finally, we implement a simple
transaction module as discussed in Section 2. Our test transaction
retrieves all records of a specic phone number (normally hundreds
to thousands of records) and updates the values of NeID in those
records to a new value. We vary the number of concurrent
transactions and ss shown in Figure 18, for this test transaction,
PABIRS can provide a good throughput. In PABIRS, queries can be
grouped into batch and share the index searching process. In Figure
19 and Figure 20, we show the throughput and response time for
varied batch size. As each node in the cluster is equipped with a
4-core CPU, we start four concurrent I/O threads at the same time.
For comparison purpose, we also show the result when a single I/O
thread run. The throughput of four I/O threads is almost three
times higher than the single thread case. The throughput improves
dramatically for a larger query batch, since we can share more
signature and data scans among the queries. However, the results
imply that the throughput gain shrinks with the increase of the
query batch size. It is thus important to choose an appropriate
batch size in real actions. The response time is also affected by
the batch size. Figure 20 illustrates that the response time is
generally proportional to the batch size. If a strict real-time
requirement is needed, it is important for the system to carefully
choose batch size, in order to hit a balance between the throughput
and response time. C. Analytic Query Performance In this group of
experiments, we evaluate the performance of PABIRS on data and
queries generated by TPC-H bench- mark. Specically, we generate 320
GB data with different skew factors using the TPC-H Skew Generator6
. We deploy Hive on top of PABIRS and compare the performances of
PABIRS against the original Hive on query Q3, Q5 and Q6 in TPC-H.
We also include Impala in the experiment. However, Impala requires
buffering all intermediate join results A. B. Fig.21.TPC-H(skew)
Fig.22.TPC-H(Selectivity) (Fig.13) skewHive skew Q5(orders)
7. Scalable Distributed Transactions across Heterogeneous
Stores : Cherry Garcia (CG) Windows Azure Storage (WAS), Google
Cloud Storage (GCS), Tora (a high-throughput KVS) YCSB+T [Dey+, 14]
(WebTXN) R3: Distributed Storage and Processing :(NTT)7
BEGINTRANSACTION SETitem1ofStore1 SETitem2ofStore2
COMMITTRANSACTION
8. Datastore wds. The example also uses a third store (e later)
that acts as the Coordinating Data Store (CDS) 1 public void
UserTransaction ( ) { D a t a s t o r e cds = D a t a s t o r e . c
r e a t e ( c r e d e n t i a l s . xml ) ; 3 D a t a s t o r e gds
= D a t a s t o r e . c r e a t e ( goog creds . xml ) ; D a t a s
t o r e wds = D a t a s t o r e . c r e a t e ( msft creds . xml )
; 5 T r a n s a c t i o n tx = new T r a n s a c t i o n ( cds ) ;
try { 7 tx . s t a r t ( ) ; Record saving = tx . read ( gds ,
saving ) 9 Record checking = tx . read ( wds , checking ) ; i n t s
= saving . get ( amount ) ; 11 i n t c = checking . get ( amount )
; saving . s e t ( amount , s 5 ) ; 13 checking . s e t ( amount ,
c + 5 ) ; tx . w r i t e ( gds , saving , saving ) ; 15 tx . w r i
t e ( wds , checking , checking ) ; tx . commit ( ) ; 17 } catch (
Exception e ) { tx . a b o r t ( ) ; 19 } } Listing 1. Example code
that uses the API to accesses two da R3: Distributed Storage and
Processing :(NTT)8 Listing.1.GCAPI2 Datastore: Transaction:
GoogleCloudStorage Datastore(gds) savingWindowsAsureStorage
Datastore(wds)checking CoordinatingDataStore(CDS) Datastore
9. Cherry Garcia (CG): read Strong Consistency (Test- and-Set)
R3: Distributed Storage and Processing :(NTT)9 II. SYSTEM DESIGN
ection, we describe the design of our client- ransaction processing
protocol that enables trans- ving multiple data items that span
multiple het- data store instances. The protocol is to be imple-
library whose API abstracts data store instances alled Datastore,
and these are accessed via a oordinator abstraction, a class called
Transaction. ecord is addressable using a string key and its
accessed using an object of a class called Record. an example of an
application that uses the API to a records, one (saving) residing
in an instance of d Storage, abstracted by the Datastore gds, while
s stored in Windows Azure Storage represented as ds. The example
also uses a third store (explained ts as the Coordinating Data
Store (CDS). ransaction ( ) { D a t a s t o r e . c r e a t e ( c r
e d e n t i a l s . xml ) ; D a t a s t o r e . c r e a t e ( goog
creds . xml ) ; D a t a s t o r e . c r e a t e ( msft creds . xml
) ; = new T r a n s a c t i o n ( cds ) ; = tx . read ( gds ,
saving ) g = tx . read ( wds , checking ) ; g . get ( amount ) ; ng
. get ( amount ) ; mount , s 5 ) ; amount , c + 5 ) ; Application 1
Transaction Application 2 Transaction Tora Windows Azure Storage
Google Cloud Storage Tora Datastore Abstraction Application 3
Transaction Tora Datastore Abstraction WAS Datastore Abstraction
WAS Datastore Abstraction GCS Datastore Abstraction GCS Datastore
Abstraction Datastore Specific REST API Cherry Garcia Coordinating
Storage TSR Fig. 1. Library architecture 2) Overview: In essence,
the protocol calls for each data item to maintain the last
committed and perhaps also the currently active version, for the
data and relevant meta- data. Each version is tagged with meta-data
pertaining to the transaction that created it. This includes the
transaction commit time and transaction identier that created it,
pointing to a globally visible transaction status record (TSR)
using a Universal Resource Identier (URI). The TSR is used by the
client to determine which version of the data item to use when
reading it, and so that transaction commit can happen just by
updating (in one step) the TSR. The transaction identier, Fig.1.
2PC
10. CG 2PC Current state previous state Key hash PREPARED
Coordinating Data Store (CDS) Transaction Status Record (TSR)
COMMITTED () R3: Distributed Storage and Processing :(NTT)10 data
wo the mit. ped ked hed ion wer ten SR) ing uly COMMITTED PREPARED
application logic CDS WAS GCS C1 t1 r2 r1 transaction cache
COMMITTED read() read() v1v1 v1 v1 v2 v2 write() commit() v2 v2
PREPARE PREPARE TSR COMMIT v2 v2 COMMITTEDPREPARED COMMITTED DELETE
application logic C2 t2 transaction cache read() v1v1 v2 v2
commit()write() v1v1 v2 v2 time read() PREPARE application logic t3
transaction cache Cherry Garcia Cherry Garcia Fig. 2. The timeline
describing 3 transactions running on 2 client hosts to access
records in 2 data stores using a third data store as a CDS In the
rest of this section we go deeper in detail on the components of
the library and the algorithms. Pseudocode for (Fig.2.)
11. Cherry Garcia Java (JDK 1.6) Datastore abstractionWindows
Azure Storage (WAS), Google Cloud Storage (GCS), Tora
(WiredTigerKVS) R3: Distributed Storage and Processing :(NTT)11
1885.4& 1888.6& 1862.2& 1911.6& 5898.4&
33810& 0& 10000& 20000& 30000& 40000&
0.1& 0.3& 0.5& 0.7& 0.9& 0.99&
aborts'per'million' theta'
aborts&per&million&transac:ons& Fig. 6. Aborts
measured varying theta with 1 YCSB+T client against a 1-node Tora
cluster 0" 5000" 10000" 15000" 1" 32" 60" 91" 121" 152" 182" 213"
244" Throughput" (transac8ons/second)" YCSB+T"Client"Threads"
transac8ons/sec" Fig. 7. Throughput of 4 YCSB+T client hosts each
with 1 though 64 threads against a 4-node Tora cluster 0" 5000"
10000" 15000" 20000" 25000" 1" 2" 3" 4" 5" 6" 7" 8" Throughput"
(transac8ons/second)"
Number"of"hosts"running"16"YCSB+T"clients"threads"each"
transac8ons/second" Fig. 8. Throughput of YCSB+T with 16 through
128 threads on 8 client hosts against a 4-node Tora cluster 600"
700" 800" 900" ons"per"second)" 1"record"tx" 1"record"nonBtx"
2"record"tx" 2"record"nonBtx"800" 1000" 1200" tx/sec)"
3"records"serial" phase"2" 3"records"with" parallel"phase"2"
Fig.8.84Tora 16128 increased linearly until 16 threads and the
average latency for each request stayed within the 500s mark. As
the number of threads were increased beyond 16 the latency
increased until it reached 4.5ms at 64 threads. This increased
latency suggests that there is a performance bottleneck somewhere
in the system. We ran a further test with 4 client hosts and a
cluster of 4 Tora servers and repeated the previous test and varied
the number of threads from 1 through to 64 threads across all 4
client hosts and measured the throughput. The graph in Figure 7
shows that the performance on each host scales linearly until 16
threads (an aggregate of 64 threads across 4 client hosts) and then
attens out. We observed that the socket send buffers on the servers
were full suggesting a network bottleneck at the client. G.
Experiment 4: abort rates vary with contention We setup one EC2
m3.2xlarge server each as a YCSB+T client and Tora server in AWS
and ran the client with 16 threads with a read to read-write ration
of 50:50 over 1 million transactions. We used the Zipan access key
pattern, and varied the theta value over 0.1, 0.3, 0.5, 0.7,0.9 and
0.99. Figure Fig 6 shows that the aborts increase as the contention
increases, though aborts are infrequent even with extreme
contention. H. Experiment 5: Scale-out test We ran YCSB+T with a
mix of 90:10 read to read-modify- write operations in a Zipan data
access pattern with theta set to 0.99 across 1 to 8 client hosts
each with 16 threads, running against a 4-node Tora cluster. We
collected the throughput 0" 5000" 10000" 15000" 1" 32" 60" 91" 121"
152" 182" 213" 244" Throughput" (transac8ons/second)"
YCSB+T"Client"Threads" transac8ons/sec" Fig. 7. Throughput of 4
YCSB+T client hosts each with 1 though 64 threads against a 4-node
Tora cluster 0" 5000" 10000" 15000" 20000" 25000" 1" 2" 3" 4" 5" 6"
7" 8" Throughput" (transac8ons/second)"
Number"of"hosts"running"16"YCSB+T"clients"threads"each"
transac8ons/second" Fig. 8. Throughput of YCSB+T with 16 through
128 threads on 8 client hosts against a 4-node Tora cluster 0" 100"
200" 300" 400" 500" 600" 700" 800" 900" 1" 6" 11" 16"
throughput"(transac8ons"per"second)" number"of"client"threads"
1"record"tx" 1"record"nonBtx" 2"record"tx" 2"record"nonBtx" 0" 200"
400" 600" 800" 1000" 1200" 1" 6" 11" 16" throughput"(tx/sec)"
number"of"client"threads" 3"records"serial" phase"2"
3"records"with" parallel"phase"2" Fig. 9. Overhead of transactions
and the effect of 1-phase optimization 133 Fig.9. 1-phase(*)
(*)1PREPARE ( 23288trans/sec) 1-phase