ICDE2015 Research 3: Distributed Storage and Processing

1. ICDE2015 Research 3: Distributed Storage and Processing : (NTT) 2015.5.16

2. 1.PABIRS: A Data Access Middleware for Distributed File Systems S. Wu (Zhejiang Univ.), G. Chen, X. Zhou, Z. Zhang, A. K. H. Tung, and M. Winslett (UIUC) 2.Scalable Distributed Transactions across Heterogeneous Stores A. Dey, A. Fekete, and U. Rhm (Univ. of Sydney) R3: Distributed Storage and Processing :(NTT)2

3. FS () (PABIRS) 10() PABIRS: A Data Access Middleware for Distributed File Systems R3: Distributed Storage and Processing :(NTT)3 0 500 1000 1500 2000 2500 0 100 200 300 400 500 600 700 800 900 1000 CallFrequency Caller ID Call Frequency Fig. 1. Distribution of Call Frequency 0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 NumberofBlocks Caller ID Number of Blocks Fig. 2. Number of Blocks per Key PABIRS )1,000(ID) Fig.1. 0 500 1000 1500 2000 2500 0 100 200 300 400 500 600 700 800 900 1000 CallFrequency Caller ID Call Frequency Fig. 1. Distribution of Call Frequency 0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 700 800 900 1000 NumberofBlocks Caller ID Number of Blocks Fig. 2. Number of Blocks per Key support efcient data retrieval for various query workloads. PABIRS Fig.2.ID 1%ID (power-law) ID DFS

4. PABIRS = Bitmap + LSM index DFS() DFSGET MapReducemapinputformat KVSsecondary index DFS wrapper: Bitmap index/ LSM indexhot value R3: Distributed Storage and Processing :(NTT)4 PABIRS DFS g. 3. Architecture of PABIRSFig.3.PABIRS InputFormatInsert(key,value)Lookup(key) Fig.4.bitmap signature DAG (directoryverticesdatavertices) 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 u u u u u u UID u u u u u 1 0 0 1 0 1 0 0 0 1 block signature data block 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 data block 2 block signature UID Fig. 4. Bitmap Example III. HYBRID INDEXING SCHEME The general idea behind our hybrid indexing scheme is to build bitmap signatures for all data blocks and select certain hot keys for LSM index. Bitmap signature is created for multiple attributes without re-ordering the records. To facilitate efcient parallel search, we design a hierarchical model based on a virtual Directed Acyclic Graph (DAG) structure, in which each intermediate vertex is a summary of the signatures accessible on its descendants. We present an example DAG structure in Figure 5 as a virtual index structure Param s, s1 v pj N Bp Bt k m F W r with entries taken from the leaf level of the C0 tree, thus decreasing the size of C0, and creates a newly merged leaf node of the C1 tree. The buffered multi-page block containing old C1 tree nodes prior to merge is called the emp- tying block, and new leaf nodes are written to a different buffered multi-page block called the filling block. When this filling block has been packed full with newly merged leaf nodes of C1, the block is written to a new free area on disk. The new multi-page block containing merged results is pictured in Figure 2.2 as lying on the right of the former nodes. Subsequent merge steps bring together increasing index value segments of the C0 and C1 components until the maximum values are reached and the rolling merge starts again from the smallest values. C1 tree C0 tree Disk Memory Figure 2.2. Conceptual picture of rolling merge steps, with result written back to disk Newly merged blocks are written to new disk positions, so that the old blocks will not be over- written and will be available for recovery in case of a crash. The parent directory nodes in C1, also buffered in memory, are updated to reflect this new leaf structure, but usually remain in buffer for longer periods to minimize I/O; the old leaf nodes from the C1 component are in- validated after the merge step is complete and are then deleted from the C1 directory. In general, there will be leftover leaf-level entries for the merged C1 component following each merge step, since a merge step is unlikely to result in a new node just as the old leaf node empties. The same consideration holds for multi-page blocks, since in general when the filling block has filled with newly merged nodes, there will be numerous nodes containing entries still LSMTree[O'Neil+,96] : C0(AVL-Tree) C0 C1(B-Tree)rollingmerge (:LSMTree)

5. 1. Bitmap Signature fanout F low-level vertices high-level vertex F Pregel [Malewicz+, 10] BSP 2. LSM LSM index R3: Distributed Storage and Processing :(NTT)5 0 200 400 600 800 1000 1200 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 ProcessingTime(msec) Selectivities of Call Numbers (%) bitmap lsm Cost of Bitmap and LSM he fanout of the B-tree. We try to insert the key ex, only when the estimated cost is no larger than Index Manager Data Statistics DFS Data Stream Append LSM Index New Data Bitmap Signature MapReduce Algorithm Fig. 8. Index Update Fig.7.bitmapLSM LSM 0.1% bitmap 90% 0.1%0.1 0 200 400 600 800 1000 1200 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 ProcessingTime(msec) Selectivities of Call Numbers (%) bitmap lsm Fig. 7. Search Cost of Bitmap and LSM where E is the fanout of the B-tree. We try to insert the key into LSM index, only when the estimated cost is no larger than the cost of bitmap index. Based on the inequality above, we are able to calculate the minimal selectivity, which makes LSM a more attractive selection than the bitmap. In Figure 7, we apply the theoretical Index Manager Data Statistics DFS Data Stream Append LSM Index New Data Bitmap Signature MapReduce Algorithm Fig. 8. Index Update C. Update on the Indices PABIRS is specically designed for the applications that require fast data insertion. In PABIRS, bitmap index is a lightweight index which can be built in a batch, while LSM index is intentionally designed to support the fast insertion. Fig.8. DFS MRJob BitmapsignatureLSM Hot-key for multiple attributes without re-ordering the records. To facilitate efcient parallel search, we design a hierarchical model based on a virtual Directed Acyclic Graph (DAG) structure, in which each intermediate vertex is a summary of the signatures accessible on its descendants. We present an example DAG structure in Figure 5 as a virtual index structure on two different attributes, using 8 bits and 5 bits for these two attributes respectively. Generally speaking, the DAG structure consists of three layers. The retrieval layer contains individual signatures corresponding to the data blocks, while each intermediate vertex in the index layer is associated with a summary signature by merging signatures of its children vertices. Data layer refers to the physical data blocks stored in the DFS. Signatures and their corresponding graph vertices are randomly distributed to multiple DFS nodes. In the rest of the paper, we refer the vertices in the retrieval layer as data vertices and the vertices in the index layer as directory vertices. On the other hand, LSM index replicates the records with hot keys and sorts them in its B-trees. For each indexing attribute, we independently create an LSM index to maintain its sorted replicas of hot data. In the rest of the section, we rst introduce the tuning approaches used on the bitmap-based index, followed by the selection strategy between these two indices. For better readability, the notations used in the section are summarized in Table I. A. Optimizations on Bitmap Signature Suppose the signature of each data block follows the same distribution {p1, p2, . . . , pk}, in which each pj indicates the probability of having 1 on the j-th bit. Because of the exclusiveness between the values, the signature is a sparse vector, i.e. P j pj k. Given two signatures s1 and s2, the expected number of common 1 on j-th bits in both s1 and s2 is P j(pj)2 . It is much smaller than the expected number of 1 in either s1 or s2, i.e. P j(pj), unless there exists a pj dominating the distribution. When records are randomly assigned to the data blocks, each probability pj is supposed to be a small positive number. This leads to the phenomenon of Weak Locality in PABIRS. It N total number of records/tuples Bp size of a data block Bt size of a tuple query selectivity k number of distinct values of the attributes m number of values mapped to the same bit F fanout of the directory vertex W number of virtual machines (workers) rl computation cost of a directory vertex rn network delay between any pair of vertices rd the overhead of reading a data block selectivity of a particular queried key fmin minimum frequency of any value in a domain p() pdf of distribution on selectivity of queries. ... ... Fig. 5. Demonstration of Signature Graph is thus not helpful to group similar signatures when building high-level directory vertex in the index layer, because such merging only generates a new signature with a union of 1s from the signatures of its children vertices. Although it is unlikely to optimize by better grouping, the fanout of the abstract tree structure, i.e. the number of children vertices for every directory vertex, remains tunable and turns out to be crucial to the searching efciency. 1) Cost Model and Fanout Optimization: Instead of picking up similar signatures during bitmap construction, PABIRS simply groups the low-level vertices to generate a high-level vertex, based on a pre-specied fanout parameter F. Speci- 116 Fig.5.signaturegraph

6. Hadoop 1.0.4 + GPS [Salihoglu+, 12] (Pregel OSS) 4cpu, 8G RAM, 32hadoop A. 3selectHBase Phoenix, Impala, BIDS [Lu+, 13] B. tpcdskew [Bruno+, 05] TPC-H Q3, Q5, Q6Hive R3: Distributed Storage and Processing :(NTT)6 320 400 ) 0 50 100 150 200 250 300 350 400 Q1 Q2 Q3 ResponseTime(second) HBase PABIRS BIDS Impala-8G Impala-4G Fig. 13. Queries 1 10 100 1000 80 160 240 320 400 AverageResponseTime(second) Data Size (G) Q1 Q2 Q3 Fig. 14. Effect of Data Size 40 50 onseTime PABIRS HBase 150 200 nute) PABIRS 0 20 40 60 80 00 20 40 60 0 4 8 12 16 20 24 28 32 Query Batch Size Single Processor Thread Quad Processor Thread 19. Throughput of Concurrent ries (Q1) 0 5 10 15 20 25 30 35 40 45 50 0 4 8 12 16 20 24 28 32 AverageResponseTime(second) Query Batch Size Single Processor Thread Quad Processor Thread Fig. 20. Response Time of Concurrent Queries (Q1) 10 100 1000 10000 1 2 3 4 ResponseTime(second) Skew Factor q3/h q3/p q5/h q5/p q6/h q6/p Fig. 21. Performance of TPC-H Query (Skew) 10 100 1000 10000 3.5%(3) 7.1%(6) 10.5%(9)14.2%(12) ResponseTime(second) Query Selectivity (month) q3/h q3/p q5/h q5/p q6/h q6/p Fig. 22. Performance of TPC-H Query (Selectivity) emory which leads to the Memory Limit Exceeded tion. or Q3 and Q6, we build an index for the column shipdate we increase to a larger value (e.g., 365), PABIRS nds th index-based access is even worse than scan-based access. will automatically switch to the disk scan, which generate better station. To avoid query with empty result, we intentionally select a number with at least one record under the base station. PABIRS can effectively handle queries with a high selectivity but still involving numerous tuples. As shown in Table III, in our 160G dataset, we have 40960 blocks in total. Although the selectivities of the queries are as low as 0.00001%, the records related to Q1, Q2 and Q3 cover 477, 28863 and 343 data blocks respectively. The involved data blocks, especially for Q1 and Q3, are no close to the total number of data blocks, while the overhead of loading hundreds of data blocks from the disks remains high. In experiments, PABIRS, Phoenix and BIDS are allowed to use 4 GB main memory on each node of the cluster, while Impala are tested under two settings with 4 GB and 8 GB main memory respectively. The results in Figure 13 shows that Impala-4G is unable to nish the queries in reasonable time (i.e. 1,000 seconds), as it incurs high I/O cost on memory- disk data swap. It reveals the limitation of Impala on memory usage efciency. Moreover, Impala and BIDS show a similar performance for all queries, because both approaches adopt the scan-based techniques (memory scan and disk scan). In the rest of the experiments, we only report the results of Impala- 8G, denoted as Impala in abbreviation. The results also imply that PABIRS signicantly outperforms the other systems on all queries. When the selectivity of the query is high, such as Q1 and Q3, HBase Phoenix is the only alternative with close performance to PABIRS, because of its adoption of secondary index. But for the query involving a large portion of data like Q2, HBase Phoenix is slow as it incurs many random I/Os to retrieve all results. TABLE III. PROCESSING TIME OF PABIRS QID selectivity index time disk time total time Q1 1.2% 1.03s 1.47s 2.50s Q2 70% 2.11s 137.63s 139.74s Q3 0.8% 1.04s 1.28s 2.32s To gain better insight into the scalability of PABIRS, we the performances of PABIRS and HBase Phoenix degrade slightly when more insertions are conducted, because they need to build and query indexes for the new tuples. Finally, we implement a simple transaction module as discussed in Section 2. Our test transaction retrieves all records of a specic phone number (normally hundreds to thousands of records) and updates the values of NeID in those records to a new value. We vary the number of concurrent transactions and ss shown in Figure 18, for this test transaction, PABIRS can provide a good throughput. In PABIRS, queries can be grouped into batch and share the index searching process. In Figure 19 and Figure 20, we show the throughput and response time for varied batch size. As each node in the cluster is equipped with a 4-core CPU, we start four concurrent I/O threads at the same time. For comparison purpose, we also show the result when a single I/O thread run. The throughput of four I/O threads is almost three times higher than the single thread case. The throughput improves dramatically for a larger query batch, since we can share more signature and data scans among the queries. However, the results imply that the throughput gain shrinks with the increase of the query batch size. It is thus important to choose an appropriate batch size in real actions. The response time is also affected by the batch size. Figure 20 illustrates that the response time is generally proportional to the batch size. If a strict real-time requirement is needed, it is important for the system to carefully choose batch size, in order to hit a balance between the throughput and response time. C. Analytic Query Performance In this group of experiments, we evaluate the performance of PABIRS on data and queries generated by TPC-H bench- mark. Specically, we generate 320 GB data with different skew factors using the TPC-H Skew Generator6 . We deploy Hive on top of PABIRS and compare the performances of PABIRS against the original Hive on query Q3, Q5 and Q6 in TPC-H. We also include Impala in the experiment. However, Impala requires buffering all intermediate join results A. B. Fig.21.TPC-H(skew) Fig.22.TPC-H(Selectivity) (Fig.13) skewHive skew Q5(orders)

7. Scalable Distributed Transactions across Heterogeneous Stores : Cherry Garcia (CG) Windows Azure Storage (WAS), Google Cloud Storage (GCS), Tora (a high-throughput KVS) YCSB+T [Dey+, 14] (WebTXN) R3: Distributed Storage and Processing :(NTT)7 BEGINTRANSACTION SETitem1ofStore1 SETitem2ofStore2 COMMITTRANSACTION

8. Datastore wds. The example also uses a third store (e later) that acts as the Coordinating Data Store (CDS) 1 public void UserTransaction ( ) { D a t a s t o r e cds = D a t a s t o r e . c r e a t e ( c r e d e n t i a l s . xml ) ; 3 D a t a s t o r e gds = D a t a s t o r e . c r e a t e ( goog creds . xml ) ; D a t a s t o r e wds = D a t a s t o r e . c r e a t e ( msft creds . xml ) ; 5 T r a n s a c t i o n tx = new T r a n s a c t i o n ( cds ) ; try { 7 tx . s t a r t ( ) ; Record saving = tx . read ( gds , saving ) 9 Record checking = tx . read ( wds , checking ) ; i n t s = saving . get ( amount ) ; 11 i n t c = checking . get ( amount ) ; saving . s e t ( amount , s 5 ) ; 13 checking . s e t ( amount , c + 5 ) ; tx . w r i t e ( gds , saving , saving ) ; 15 tx . w r i t e ( wds , checking , checking ) ; tx . commit ( ) ; 17 } catch ( Exception e ) { tx . a b o r t ( ) ; 19 } } Listing 1. Example code that uses the API to accesses two da R3: Distributed Storage and Processing :(NTT)8 Listing.1.GCAPI2 Datastore: Transaction: GoogleCloudStorage Datastore(gds) savingWindowsAsureStorage Datastore(wds)checking CoordinatingDataStore(CDS) Datastore

9. Cherry Garcia (CG): read Strong Consistency (Test- and-Set) R3: Distributed Storage and Processing :(NTT)9 II. SYSTEM DESIGN ection, we describe the design of our client- ransaction processing protocol that enables trans- ving multiple data items that span multiple het- data store instances. The protocol is to be imple- library whose API abstracts data store instances alled Datastore, and these are accessed via a oordinator abstraction, a class called Transaction. ecord is addressable using a string key and its accessed using an object of a class called Record. an example of an application that uses the API to a records, one (saving) residing in an instance of d Storage, abstracted by the Datastore gds, while s stored in Windows Azure Storage represented as ds. The example also uses a third store (explained ts as the Coordinating Data Store (CDS). ransaction ( ) { D a t a s t o r e . c r e a t e ( c r e d e n t i a l s . xml ) ; D a t a s t o r e . c r e a t e ( goog creds . xml ) ; D a t a s t o r e . c r e a t e ( msft creds . xml ) ; = new T r a n s a c t i o n ( cds ) ; = tx . read ( gds , saving ) g = tx . read ( wds , checking ) ; g . get ( amount ) ; ng . get ( amount ) ; mount , s 5 ) ; amount , c + 5 ) ; Application 1 Transaction Application 2 Transaction Tora Windows Azure Storage Google Cloud Storage Tora Datastore Abstraction Application 3 Transaction Tora Datastore Abstraction WAS Datastore Abstraction WAS Datastore Abstraction GCS Datastore Abstraction GCS Datastore Abstraction Datastore Specific REST API Cherry Garcia Coordinating Storage TSR Fig. 1. Library architecture 2) Overview: In essence, the protocol calls for each data item to maintain the last committed and perhaps also the currently active version, for the data and relevant meta- data. Each version is tagged with meta-data pertaining to the transaction that created it. This includes the transaction commit time and transaction identier that created it, pointing to a globally visible transaction status record (TSR) using a Universal Resource Identier (URI). The TSR is used by the client to determine which version of the data item to use when reading it, and so that transaction commit can happen just by updating (in one step) the TSR. The transaction identier, Fig.1. 2PC

10. CG 2PC Current state previous state Key hash PREPARED Coordinating Data Store (CDS) Transaction Status Record (TSR) COMMITTED () R3: Distributed Storage and Processing :(NTT)10 data wo the mit. ped ked hed ion wer ten SR) ing uly COMMITTED PREPARED application logic CDS WAS GCS C1 t1 r2 r1 transaction cache COMMITTED read() read() v1v1 v1 v1 v2 v2 write() commit() v2 v2 PREPARE PREPARE TSR COMMIT v2 v2 COMMITTEDPREPARED COMMITTED DELETE application logic C2 t2 transaction cache read() v1v1 v2 v2 commit()write() v1v1 v2 v2 time read() PREPARE application logic t3 transaction cache Cherry Garcia Cherry Garcia Fig. 2. The timeline describing 3 transactions running on 2 client hosts to access records in 2 data stores using a third data store as a CDS In the rest of this section we go deeper in detail on the components of the library and the algorithms. Pseudocode for (Fig.2.)

11. Cherry Garcia Java (JDK 1.6) Datastore abstractionWindows Azure Storage (WAS), Google Cloud Storage (GCS), Tora (WiredTigerKVS) R3: Distributed Storage and Processing :(NTT)11 1885.4& 1888.6& 1862.2& 1911.6& 5898.4& 33810& 0& 10000& 20000& 30000& 40000& 0.1& 0.3& 0.5& 0.7& 0.9& 0.99& aborts'per'million' theta' aborts&per&million&transac:ons& Fig. 6. Aborts measured varying theta with 1 YCSB+T client against a 1-node Tora cluster 0" 5000" 10000" 15000" 1" 32" 60" 91" 121" 152" 182" 213" 244" Throughput" (transac8ons/second)" YCSB+T"Client"Threads" transac8ons/sec" Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads against a 4-node Tora cluster 0" 5000" 10000" 15000" 20000" 25000" 1" 2" 3" 4" 5" 6" 7" 8" Throughput" (transac8ons/second)" Number"of"hosts"running"16"YCSB+T"clients"threads"each" transac8ons/second" Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client hosts against a 4-node Tora cluster 600" 700" 800" 900" ons"per"second)" 1"record"tx" 1"record"nonBtx" 2"record"tx" 2"record"nonBtx"800" 1000" 1200" tx/sec)" 3"records"serial" phase"2" 3"records"with" parallel"phase"2" Fig.8.84Tora 16128 increased linearly until 16 threads and the average latency for each request stayed within the 500s mark. As the number of threads were increased beyond 16 the latency increased until it reached 4.5ms at 64 threads. This increased latency suggests that there is a performance bottleneck somewhere in the system. We ran a further test with 4 client hosts and a cluster of 4 Tora servers and repeated the previous test and varied the number of threads from 1 through to 64 threads across all 4 client hosts and measured the throughput. The graph in Figure 7 shows that the performance on each host scales linearly until 16 threads (an aggregate of 64 threads across 4 client hosts) and then attens out. We observed that the socket send buffers on the servers were full suggesting a network bottleneck at the client. G. Experiment 4: abort rates vary with contention We setup one EC2 m3.2xlarge server each as a YCSB+T client and Tora server in AWS and ran the client with 16 threads with a read to read-write ration of 50:50 over 1 million transactions. We used the Zipan access key pattern, and varied the theta value over 0.1, 0.3, 0.5, 0.7,0.9 and 0.99. Figure Fig 6 shows that the aborts increase as the contention increases, though aborts are infrequent even with extreme contention. H. Experiment 5: Scale-out test We ran YCSB+T with a mix of 90:10 read to read-modify- write operations in a Zipan data access pattern with theta set to 0.99 across 1 to 8 client hosts each with 16 threads, running against a 4-node Tora cluster. We collected the throughput 0" 5000" 10000" 15000" 1" 32" 60" 91" 121" 152" 182" 213" 244" Throughput" (transac8ons/second)" YCSB+T"Client"Threads" transac8ons/sec" Fig. 7. Throughput of 4 YCSB+T client hosts each with 1 though 64 threads against a 4-node Tora cluster 0" 5000" 10000" 15000" 20000" 25000" 1" 2" 3" 4" 5" 6" 7" 8" Throughput" (transac8ons/second)" Number"of"hosts"running"16"YCSB+T"clients"threads"each" transac8ons/second" Fig. 8. Throughput of YCSB+T with 16 through 128 threads on 8 client hosts against a 4-node Tora cluster 0" 100" 200" 300" 400" 500" 600" 700" 800" 900" 1" 6" 11" 16" throughput"(transac8ons"per"second)" number"of"client"threads" 1"record"tx" 1"record"nonBtx" 2"record"tx" 2"record"nonBtx" 0" 200" 400" 600" 800" 1000" 1200" 1" 6" 11" 16" throughput"(tx/sec)" number"of"client"threads" 3"records"serial" phase"2" 3"records"with" parallel"phase"2" Fig. 9. Overhead of transactions and the effect of 1-phase optimization 133 Fig.9. 1-phase(*) (*)1PREPARE ( 23288trans/sec) 1-phase

ICDE2015 Research 3: Distributed Storage and Processing

Technology