Top Banner

of 15

Iceberg Query Algorithm

Apr 06, 2018

Download

Documents

JITHIN CHANDRAN
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/3/2019 Iceberg Query Algorithm

    1/15

    Efficient Iceberg Query Evaluation usingCompressed Bitmap Index

    Bin He, Hui-I Hsiao, Ziyang Liu, Yu Huang and Yi Chen

    AbstractDecision support and knowledge discovery systems often compute aggregate values of interesting attributes by processing

    a huge amount of data in very large databases and/or warehouses. In particular, iceberg query is a special type of aggregation

    query that computes aggregate values above a user provided threshold. Usually only a small number of results will satisfy the

    threshold constraint. Yet, the results often carry very important and valuable business insights. Because of the small result set, iceberg

    queries offer many opportunities for deep query optimization. However, most existing iceberg query processing algorithms do not take

    advantage of the small-result-set property and rely heavily on the tuple-scan based approach. This incurs intensive disk accesses and

    computation, resulting in long processing time especially when data size is large. Bitmap index, which builds one bitmap vector for each

    attribute value, is gaining popularity in both column-oriented and row-oriented databases in recent years. It occupies less space than

    the raw data and gives opportunities for more efficient query processing. In this paper, we exploited the property of bitmap index and

    developed a very effective bitmap pruning strategy for processing iceberg queries. Our index-pruning based approach eliminates the

    need of scanning and processing the entire data set (table) and thus speeds up the iceberg query processing significantly. Experiments

    show that our approach is much more efficient than existing algorithms commonly used in row-oriented and column-oriented databases.

    Index Termsiceberg query, bitmap index, column oriented database

    !

    1 INTRODUCTION

    Business insight and knowledge discovery from oper-ational data have been powerful weapons for gainingcompetitive advantages in the modern business world.To discover business insights, analysts often computeaggregate values over one or more attributes in largedatabases (warehouses). Iceberg query [9] is a special classof aggregation query, which computes aggregate values

    above a given threshold. It is of special interest to theusers, as high frequency events or high aggregate valuesoften carry more important information.

    For example, for warehouse space planning and prod-uct promotion purposes, market analysts at Walmartmay want to analyze the relationship between Productand State in its Sales database. In particular, the ana-lysts may be interested in products that are very pop-ular, say with more than 100K units sold in a state.This is a typical iceberg query: SELECT Product, State,COUNT(*) FROM Sales GROUP BY Product, State HAV-ING COUNT(*) >= 100000. That is, an aggregation isdone on states and products with a COUNT function.Only (state, product) groups whose counts exceed 100Kare included in the result set.

    The name Iceberg query is coined by Fang et. al. in [9].The general form of an iceberg query on a relation R(C1,C2, , Cn) is:

    B. He and H. Hsiao are with the IBM Almaden Research Center, San Jose CA 95120. Email: [email protected], [email protected]

    Z. Liu and Y. Chen are with the School of Computing, Informatics,and Decision Systems Engineering, Arizona State University, Tempe AZ85287. Email: {ziyang.liu,yi}@asu.edu

    SELECT Ci, Cj, ..., Cm, AGG(*) FROM R

    GROUP BY Ci, Cj, ..., CmHAVING AGG(*) >= T

    Ci, Cj , ..., Cm represent a subset of attributes in Rand are referred as aggregate attributes or groupingattributes. greater than (>=) is the comparison pred-icate. AGG represents an aggregation function. Besidesthe SUM function, iceberg queries can have other ag-gregation functions such as COUNT, MAX, MIN andAVERAGE. In this paper, we focus on iceberg querieswith aggregation functions having the anti-monotoneproperty [2], such as COUNT, SUM, MIN and MAX.We plan to study iceberg query without anti-monotoneproperty in the future.

    With the threshold constraint, an iceberg query usuallyonly returns a very small percentage of distinct groupsas the output, which resembles the tip of an iceberg.Because of the small result set, iceberg queries canpotentially be answered quickly even over a very largedata set. However, current database systems and/or

    approaches do not fully take advantage of this featureof iceberg query. The relational database systems nowa-days (e.g., DB2, Oracle, SQL Server, Sybase, MySQL,PostgreSQL, and column oriented databases Vertica,MonetDB, LucidDB) are all using general aggregationalgorithms (according to [11], [25], [15] and our commu-nications with people who developed these databases)to answer iceberg queries by first aggregating all tuplesand then evaluating the HAVING clause to select theiceberg result. For large data set, multi-pass aggregationalgorithms are used when the full aggregate result can-not fit in memory (even when the final iceberg result

    Digital Object Indentifier 10.1109/TKDE.2011.73 1041-4347/11/$26.00 2011 IEEE

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    2/15

    2

    is small). Most existing query optimization techniquesfor processing iceberg queries [9], [4] can be categorizedas the tuple-scan based approach, which requires at leastone table scan to read data from disk. They focus onreducing the number of passes when the data size islarge. None has effectively leveraged the property oficeberg queries for efficient processing. Such a tuple-scan based scheme often takes a long time to answericeberg queries, especially when the table is very large.Besides these tuple-scan based approaches, [10] designeda two-level bitmap index which can be leveraged forprocessing iceberg queries. However, it suffers from themassive empty bitwise-AND results problem, which will

    be illustrated in Section 5.In this paper we aim at answering iceberg query

    efficiently using bitmap indices. Specifically, we devel-oped an index-pruning based approach to compute icebergqueries using bitmap indices. Bitmap indices [17] pro-vide a vertical organization of a column using bitmapvectors. Each vector represents the occurrences of a

    unique value in the column across all rows in the ta- ble. The state-of-the-art developments on bitmap com-pression methods [24], [3], [8] and encoding strate-gies [19] have further broadened the applicability of

    bitmap indices. Todays bitmap indices can be applied onall types of attributes (e.g., high-cardinality categoricalattributes [23], numeric attributes [23], [19] and textattributes [20]). Studies have shown that compressed

    bitmap indices occupy less space than the raw data [24],[3] and provides better query performance for equalquery [24], range query [23], and keyword query [20].Nowadays, bitmap index is supported in many commer-cial database systems (e.g, Oracle, Sybase, Informix), and

    is often the default (or only) index option in column-oriented database systems (e.g., Vertica, C-Store [21],LucidDB).

    While widely used in warehouse applications, thebitmap indices have not been effectively leveraged in an-swering iceberg queries. Three characteristics of bitmapindices catch our attention for using it to answer icebergqueries. First, bitmap indices can avoid massive diskaccess on tuples. Using bitmap indices, we only needto access bitmap indices of the aggregate attributes (i.e.the attributes in the GROUP BY clause). Second, bitmapindices operate on bits rather than real tuple values.Bitwise operations are very fast to execute and can often

    be accelerated by hardware [24]. Third, bitmap indiceshave the advantage of leveraging the anti-monotoneproperty of iceberg queries to enable aggressive indexpruning strategies. Iceberg queries have an intriguinganti-monotone property for many of the aggregationfunctions and predicates. For example, if the count ofa group is below T, the count of any super-group of itmust be below T. Each iceberg result can be produced

    by doing a bitwise-AND between the bitmap vectorsrepresenting each value in a group and counting thenumber of 1 bits in the resulting bitmap vector.

    Observing these attractive characteristics, we see the

    opportunities of computing iceberg queries efficientlyusing compressed bitmap index. A naive way for com-puting iceberg query using bitmap indices is to do pair-wise bitwise-AND operations between bitmap vectorsof all aggregate attributes. This is very inefficient be-cause the product of the number of bitmap vectors inall aggregate attributes is large and a large portion ofthese operations are not necessary. Leveraging the anti-monotone property of iceberg query, we developed thedynamic pruning algorithm. However, we also noticethere is another challenge in the dynamic index-pruning

    based approach - the problem of massive empty bitwise- AND results. When the number of unique values inan attribute is large, a large number of bitwise-ANDoperations produce empty results and the computationtime dominates the query processing time. Example 5.1in Section 5 shows the severity of this problem. Theapproach in [10] also suffers from this problem.

    To overcome this challenge, we developed an effi-cient vector alignment algorithm. The major challenge

    in developing such an algorithm is to effectively de-tect whether a bitwise-AND will generate empty result before doing the AND operation. This sounds like adilemma in the beginning, but after careful research, wefind out that such a solution is indeed possible. Thevector alignment algorithm guarantees that any bitwise-AND operation will not generate empty result. Section5 will discuss the vector alignment algorithm in detail.

    Our contributions in this paper are summarized asfollows:

    1) We developed the dynamic pruning and vectoralignment algorithms to efficiently compute icebergqueries using compressed bitmap indices. Our ap-

    proach can be applied to both row-oriented andcolumn-oriented databases, as long as bitmap in-dices for the aggregate attributes are available.

    2) We performed comprehensive experiments to eval-uate our approach by comparing with the state-of-the-art iceberg query processing algorithms and atuple-scan based algorithm. Experiments show thatour algorithm achieves remarkable performanceimprovement for iceberg query computation.

    The remaining sections of this paper are structured asfollows. We discuss related work in Section 2. Necessary

    background of bitmap index and its compression areintroduced in Section 3. Section 4 describes the dynamicindex-pruning algorithm, and Section 5 discussed thevector alignment algorithm. We only discuss how tohandle two aggregate attributes, and we generalize thesealgorithms in Section 6. Section 7 analyzes the experi-mental results. Section 8 concludes the paper.

    2 RELATED WOR K

    Iceberg Query and Iceberg Cube. Processing of Icebergquery is first defined and studied by Fang et. al. in1998 [9]. In [9], it proposed the Hybrid and Multi-Bucketsalgorithms by extending the probabilistic techniques

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    3/15

    3

    proposed in [22]. Sampling/bucketing method is usedto predict valid groups, with possible false positives andfalse negatives. Then efficient strategies are designed toefficiently correct false positives and false negatives toretrieve the exact result. [4] designed a partitioning al-gorithm for computing a specific type of iceberg queries:computing the average of aggregate values. All thesetechniques are tuple-scan based, which require at leastone scan of each tuple in the relation. None of themleverages the bitmap indices for query optimization,which is the focus of this paper.

    In data warehouses, [6], [1], [12], [10] conducted stud-ies on computing iceberg cube, which computes andmaterializes cells of a data cube satisfying specifiedcondition. These works focus on selecting a proper orderof computing aggregation over all combination of aggre-gate attributes, to maximize sharing of the computation.

    Answering iceberg queries and computing icebergcube have different optimization goals. The focus ofanswering iceberg queries is to speed up the processing

    time of single iceberg query. The focus of computingiceberg cubes, such that of [10], is to maximize the sharedcomputation to shorten the cube generation time. De-veloping efficient iceberg query answering algorithm isnecessary. These algorithms can be leveraged to generateiceberg cube more efficiently.

    In [16], comparison of the algorithms in [9] and othertuple-scan based algorithms was conducted. As indi-cated in [16], the algorithm proposed in [9] performs

    better, especially when the data is highly skewed.Bitmap Indices. Bitmap indices are known to be ef-

    ficient, especially for read-mostly or append-only data,and are commonly used in the data warehousing ap-

    plications and column stores. Model 204 [17] was thefirst commercial product making extensive use of the

    bitmap index. Early bitmap indices are used to imple-ment inverted files [14]. In data warehouse applications,

    bitmap indices are shown to perform better than tree- based index schemes, such as the variants of B-tree orR-tree [13], [17], [18]. Compressed bitmap indices arewidely used in column oriented databases, such as C-Store [21], which contribute to the performance gain ofcolumn databases over row oriented databases.

    Various compression schemes for bitmap index havebeen developed. Word-Aligned Hybrid (WAH) [24] andByte-aligned Bitmap Code (BBC) [3] are two important

    compression schemes that can be applied to any columnand be used in query processing without decompression.The development of bitmap compression methods [24],[3] and encoding strategies [19] further broaden the ap-plicability of bitmap index. Nowadays it can be appliedon all types of attributes (e.g., high-cardinality categori-cal attributes [23], numeric attributes [23], [19], and textattributes [20]). And it is very efficient for OLAP andwarehouse query processing [24], [3]. However, bitmapindex is not effectively leveraged in existing works toprocess iceberg queries. In this paper, we develop noveliceberg query processing algorithms using bitmap in-

    Fig. 1: An Example of Bitmap Index

    dices, which are shown to be highly effective.

    3 PRELIMINARIES: BITMAP IN DE X A ND I TSCOMPRESSION

    Bitmap indices are commonly used in databases nowa-days, especially for data warehousing applications and

    in column stores. A bitmap for an attribute (column) ofa table can be viewed as a v r matrix, where v is thenumber of distinct values of the column and r is thenumber of tuples (rows) in the table. Each value in thecolumn corresponds to a bitmap vector of length r, inwhich the kth position of the vector is 1 if this valueappears in the kth row, and 0 otherwise.

    In this paper we use equality-encoded bitmaps [7].An example of bitmap index is shown in Figure 1. Theleft part of Figure 1 shows an example relation with aset of attributes. The right part of Figure 1 shows thecorresponding bitmap indices on attributes A and B ofthe table. For each distinct value of A and B, there is acorresponding bitmap vector. For instance, A1s bitmapvector is 010010001000, because A1 occurs in the 2nd,5th, and 9th rows in the table.

    An uncompressed bitmap can be much larger thanthe original data, thus compression is typically utilizedto reduce the storage size and improve performance.As reported in [24], with proper compression, bitmapsperform well for a column with cardinality up to 55%of the number of rows, that is, up to 55% of rowshaving distinct values in the column. We adopt Word-Aligned Hybrid (WAH) [24] to compress the bitmaps inour implementation.

    4 DYNAMIC PRUNING

    We will illustrate our dynamicpruning algorithm usingan iceberg query having two aggregate attributes withCOUNT function as the running example. We will showhow the algorithm can be adjusted to support otheraggregation functions and arbitrary number of aggregateattributes in Section 6.

    Suppose the iceberg query that we need to answeris as the one in Figure 2. The data table and bitmapindices are as those in Figure 1. The naive way toprocess this iceberg query on two attributes A and B

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    4/15

    4

    using bitmap indices is to conduct pair-wise bitwise-AND operations between each vector of A and eachvector of B. For example, if A and B have I and Jdistinct values respectively, IJ bitwise-AND operationswill be needed to produce the iceberg results.

    SELECT A,B,COUNT(*) FROM R GROUP BY A,B HAVING COUNT(*) > 2

    Fig. 2: An Iceberg Query with COUNT Function

    Example 4.1: In table R, column A has 3 distinct valuesA1, A2, A3, and column B has 3 distinct values B1,B2, B3. The bitmap indices are those on the right ofFigure 1. To process the iceberg query in Figure 2, thenaive approach will conduct bitwise-AND operations be-tween 9 pairs: (A1,B1), (A1,B2), (A1,B3), (A2,B1), (A2,B2),(A2,B3), (A3,B1), (A3,B2) and (A3,B3). After each bitwise-AND operation, the number of 1 bits in the resulting

    bitmap vector is counted. If the number of 1 bits is largerthan the threshold (2 in this example), it is added intothe iceberg result set.

    Obviously, this is not very efficient, because the prod-uct of the number of unique values in all aggregateattributes can be large, especially when the number ofdistinct values in each column is large. The situation getsworse when multiple aggregate attributes are specified.As described in Section 1, iceberg queries have an in-triguing anti-monotone property [2], which can be lever-aged to reduce the number of bitwise-AND operations.With bitmap indices, it is easy to calculate the total occur-rences of a single value (using its bitmap vector) withoutaccessing other data. As a result, we can leverage theanti-monotone property to quickly prune bitmap vectorsthat will not produce valid iceberg results.

    First, we introduce a new bitwise-AND operation,which carries out the following three actions in one

    bitwise-AND operation between vectors X and Y: Z = XAND Y, X = X XOR Z, and Y = Y XOR Z. That is, besidesgenerating the resulting vector Z of the bitwise-ANDoperation, the operation also sets the 1 bit in the originalvectors to 0, if the corresponding bit in the resultingvector is 1. In our algorithm, when we mention the

    bitwise-AND operation, we refer to this new semantics.

    Example 4.2: Consider the bitmap vectors A2 =101101110100 and B1 = 001001010011 of our runningexample in Figure 1. When a bitwise-AND is conducted

    between them, the resulting vector is 001001010000.Also, A2 becomes 100100100100 and B1 becomes000000000011.

    After each bitwise-AND operation, the dynamic prun-ing strategy adds an extra pruning step of monitoringthe number of remaining 1s in both bitmap vectorsinvolved. If the number of 1 bits of a modified vector

    becomes smaller than the iceberg threshold, this vectorcan be pruned. That is, no further AND operation isnecessary for this vector. With dynamic pruning, thenumber of AND operations can be reduced effectively,since the iceberg threshold is usually large. Although

    each AND operation is now more expensive, yet thereduction in the number of AND operations usually faroutweighs the increased cost of the AND operations.

    Example 4.3: Continue our running example, supposebitwise-AND operations are first conducted between A1and all values in B. (A1,B1) and (A1,B2) produce noresult. After the bitwise-AND operation between A1 and

    B3 is done, the number of 1s left in B3 is 2, whichdoes not meet the threshold in the query. Thus, B3can be pruned. Then when we process A2, we onlyneed to conduct bitwise-AND operations on (A2,B1) and(A2,B2) (The pair (A2,B3) is pruned). Further, A3 can bedirectly pruned because it only contains two 1 bits. No

    bitwise-AND operations are needed between A3 and B.The number of operations is reduced from 9 to 5.

    A prior approach for computing iceberg cubes [10]also has a vector pruning strategy. However, the focus of[10] is to compute iceberg cubes, which, as we point outin Section 2, has different optimization goals from ourdynamic pruning strategy. Thus, the pruning methodused in [10], designed for cube computation, is notuseful for answering iceberg queries. Specifically, [10]uses a second level bitmap to record the co-occurrenceinformation of values in every pair of columns. Fortwo attributes A, B, for each value ai of A and eachvalue bj of B, if they co-occur in the same tuple, thencooccur(a, b) = true. When computing the bitmap cube,after each AND operation, if group (a, b) does not satisfythe iceberg threshold, [10] will update the second-level

    bitmap by setting co occur(a, b) = false. This is notuseful for answering iceberg queries because group (a, b)has already been examined.

    5 VECTOR ALIGNMENT

    5.1 Algorithm Description

    The dynamic pruning strategy works fine for attributeswith a relatively small number of unique values. How-ever, its performance downgrades severely due to theempty bitwise-AND results problem. With the dynamicindex pruning strategy alone, many of the bitwise-ANDoperations produce empty results after a bitwise-ANDoperation. That is, the resulting bitmap vector containsno bits having value 1. Such bitwise-AND operations arefruitless in two aspects: First, they do not produce valid

    iceberg result. Second, they do not reduce the number of1 bits in original vectors for index pruning purpose. Thenumber of fruitless bitwise-AND operations can be solarge as to dominate the query processing time. The fol-lowing example shows the severity of this problem andwe have also demonstrated the impact of this problemin our experiments.

    Example 5.1: Suppose a table has 1,000,000 tuples, itsattribute A has 10,000 unique values, and attribute Bhas 10,000 unique values. A and B will have 10,000

    bitmap vectors each. In the worst case, the total numberof pair-wise bitwise-AND operations is 10, 00010, 000 =

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    5/15

    5

    100, 000, 000, which is 100 times larger than the numberof tuples. Since the number of distinct groups is bounded

    by the number of tuples n in the relation, we need atmost n bitwise-AND operations to answer an icebergquery. In this example, more than 99% of the bitwise-AND operations are useless.

    To overcome this challenge of empty bitwise-AND

    results problem, we developed the vector alignmentalgorithm. For the dynamic pruning algorithm, the worstcase bound of the number of bitwise-AND operationsneeded is equal to the product of the numbers of distinctvalues of all aggregate attributes, which would be muchlarger than the number of tuples. To ensure the indexpruning based approach achieves good performance inpractice, it is thus critical to develop a smart strategy toavoid unnecessary AND operations with a much betterworst case bound. As a novel attempt, we proposed thevector alignment algorithm, which can guarantee theworst case bound to be the number of distinct groupsin the query, which is usually much smaller than the

    number of tuples in the relation. The key idea is touse two priority queues to sort bitmap vectors by theposition of their first 1 bits(defined below), throughwhich we can effectively find two bitmap vectors toconduct bitwise-AND operation and ensure the resulting

    bitmap vector will not be empty. Before illustrating ouralgorithm in detail, we first introduce a few definitions.

    Definition 5.1: First 1-bit position: It refers to the posi-tion of the first 1-bit in a bitmap vector.

    Since our AND operation will update the originalvectors, the first 1-bit position of a vector may thuschange after the AND operation.

    Definition 5.2: Vector Alignment: Two bitmap vectorsare aligned if their first 1-bit positions are the same.

    Our observation is that if two vectors are aligned, theirbitwise-AND result will not be empty, because they haveat least one overlapping position. Now we illustrate howto avoid the massive useless AND operations with vectoralignment through a concrete example. Algorithm detailswill be illustrated following the example. To begin with,for each aggregate attribute, we build a priority queue ofits bitmap vectors prioritized by their first 1-bit positions.

    Example 5.2: Consider our running example in Figure 1and the query in Figure 2. Figure 3 shows the priorityqueues for attributes A and B. It is not necessary toput the vector A3 in As priority queue because A3 onlycontain two 1 bits and can be pruned directly.

    PriorityQueue1 PriorityQueue2

    A2 101101110100 B2 100100100100 Number of 1s in A3 is not larger than 2

    A1 010010001000 B3 010010001000

    A3 000000000011 B1 001001010011

    Fig. 3: Bitmap vectors in priority queues

    Then, we choose the top bitmap vector from eachpriority queue and check whether they can be aligned.

    If they are, it means the resulting bitmap vector of thebitwise-AND operation between these two vectors willnot be empty. We thus conduct a bitwise-AND operationand apply the dynamic pruning strategy.

    Example 5.3: Continuing the example, the top vector ineach queue is selected (A2 and B2). Since they can bealigned (the first 1-bit position of both vectors is 1), we

    conduct a bitwise-AND operation between them. Theresulting bitmap vector satisfies the iceberg threshold.Thus the group (A2, B2) is added into the iceberg result.

    The vectors of A2 and B2 are also updated. A vectorwhose remaining number of 1s is less than the thresholdis pruned. Thus, B2 is pruned after the bitwise-AND.Otherwise, we find the first 1-bit position of the updatedvector, and reinsert it into the priority queue. In thisexample, A2 is reinserted with its first 1-bit position

    being updated to 3. Figure 4 shows the updated priorityqueues.

    PriorityQueue1 PriorityQueue2

    A1 010010001000 B3 010010001000 B2 is removed

    A2 001001010000 B1 001001010011

    Fig. 4: Bitmap vectors after first vector alignment

    Next, we repeat the above process until at least onequeue is empty.

    Example 5.4: Continuing our example, we apply thesame process on vector pairs (A1, B3) and (A2, B1).After that, both queues are empty and we can stop.With vector alignment, we reduce the number of ANDoperations to only 3.

    There are cases when the two top bitmap vectors arenot aligned, because one of the two bitmap vectors mighthave been pruned already. In those cases, we select thevector which has the smaller first 1-bit position andreset all 1-bits with positions smaller than the first 1-

    bit of the other bitmap vector. We can safely remove(reset) these bits and recompute the fist 1-bit positionof the selected vector because they will not have acorresponding matching bits in the remaining vectors ofthe other queue.

    Example 5.5: For example, suppose that two vectorsV1 = 101001000000 from C1 and V2 = 000100101010from C2 are the top vectors in the queues, and the

    threshold is 2. The first 1-bit position ofV1 is 1. The first1-bit position ofV2 is 4. Thus, V1 and V2 are not aligned.We can safely reset bits 1 and 3 of V1 and recomputeV1s first 1-bit position as 6. Notice that the 1-bits atposition 1 and 3 of V1 can be reset because they are atthe position smaller than 4 and thus will not have analigned vector in C2 since the queues are prioritized bythe first 1-bit position.

    By skipping the removable bits, we can be moreefficient on the queue management and explore morepruning opportunities. For this example, after we knowthe first 1-bit position of V1 is 6, we realize the count

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    6/15

    6

    Algorithm 1 Iceberg Processing with Vector Alignmentand Dynamic Pruning

    icebergPQ (attribute A , attribute B , threshold T)Output: iceberg results

    1: P QA.clear, P QB .clear2: for each vector a of attribute A do3: a.count = BIT1 COUNT(a)4: if a.count >= T then5: a.next1 = first1BitPosition(a, 0)6: P QA.push(a)7: for each vector b of attribute B do8: b.count = BIT1 COUNT(b)9: if b.count >= T then

    10: b.next1 = first1BitPosition(b, 0)11: P QB.push(b)12: R = 13: a, b = nextAlignedVectors(P QA, P QB , T)14: while a = null and b = null do15: P QA.pop16: P QB .pop17: r = BITWISE AND(a, b)18: if r.count >= T then19: Add iceberg result (a.value, b.value, r.count) into R20: a.count = a.count - r.count

    21: b.count = b.count - r.count22: if a.count >= T then23: a.next1 = first1BitPosition(a, a.next1 + 1)24: if a.next1 = null then25: P QA.push(a)26: if b.count >= T then27: b.next1 = first1BitPosition(b, b.next1 + 1)28: if b.next1 = null then29: P QB .push(b)30: a, b = nextAlignedVectors(P QA, P QB, T)31: return R

    of 1s in V1 is less than the threshold and thus it can be

    pruned.

    Through the examples, we can see that vector align-ment completely removes the need of conducting fruit-less bitwise-AND operations. When the number of po-tential empty bitwise-AND operations is large, the per-formance gain is significant.

    Algorithm icebergPQ (algorithm 1) summarizes theprocedure discussed in the above examples. It has twophases. In the first phase, we prioritize bitmap vectorsof each attribute by their first 1-bit positions (lines 1-11). We will explain the subroutines used here in moredetail later. The function first1BitPosition is to find the

    position of the first 1-bit, starting from a given positionin the bitmap vector. The initial starting position is 0.Algorithm 2 shows the detail of the first1BitPositionfunction. A function BIT1 COUNT is used to count thenumber of 1s in a vector efficiently [5].

    The second phase of algorithm icebergPQ (lines 12-30)combines vector alignment and dynamic pruning strate-gies. The nextAlignedVector function is used to find twoaligned vectors in the priority queues. It retrieves the top

    bitmap vectors from both priority queues, and finds thevectors which can be aligned by the position of the first1 bit.

    Algorithm 2 First 1 bit position

    first1BitPosition (bitmap vector vec, start position pos)Output: the position of the first 1 bit position in vec, startingfrom position pos

    1: len = 02: for each word w in vector vec do3: if w is a literal word then4: if len pos then5: for p = pos to len + 30 do6: if position p is 1 then7: return p8: else if len > pos then9: for p = len to len + 30 do

    10: if position p is 1 then11: return p12: len += 3113: else if w is a 0 fill word then14: fillLength = length of this fill word15: len += fillLength * 3116: else17: fillLength = length of this fill word18: len += fillLength * 3119: if len > pos then20: return pos

    21: return null

    Now we describe the details of the first1BitPositionfunction. The first1BitPosition function returns the first1 bit in a bitmap vector, given a starting position. Itefficiently finds the first 1 bit in the bitmap vector, whichhas a position larger than the given position value. Thisfunction is critical for updating the priority of a bitmapvector. Also, it helps to calculate the number of 1 bitsremovable from the bitmap vector. The pseudo codeof the first1BitPosition function is as follows. Given a

    starting position pos, the length of bits covered by wordsseen previously is denoted as len (line 1). For each wordin the bitmap vector, it checks the type of the word (line3, 13 and 16). If it is a literal word, and pos is in the rangeof (len, len+30) (line 4), then the first 1 bit position might

    be in this range (pos, len+30) (line 5-7). Otherwise it is inthe range (len, len + 30) (line 8-11). If the word is a 0 fillword, no 1 bit position can be found with this word, we

    just increase len by computing the bits covered by this0 fill word. (line 13-15). Otherwise if it is a 1 fill wordand it covers fillLen bits, if pos is at the range of (len,len + fillLen), pos must be the first 1 bit position.

    Algorithm 3 shows the details of the nextAlignedVec-

    tors function. Every time, the function pops out a vectorfrom each of the priority queue (lines 2-3). If these twovectors can be aligned, they are returned for conductinga bitwise-AND operation (lines 4-5). Otherwise, we se-lect the vector which has smaller first 1 bit position (lines6 and 12), and use the first1BitPositionWithSkip functionto find the first-1-bit-position, as well as the numberof removable 1 bits (lines 7-8, 13-14). The removable 1

    bits are as those described in example 5.5. The functionfirst1BitPositionWithSkip is implemented by slightly re-vising the function first1BitPosition in Algorithm 2. Itreturns how many 1 bits have been skipped during the

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    7/15

    7

    Algorithm 3 Find Next Aligned Vectors

    nextAlignedVectors (priority queue P QA , priority queueP QB , threshold T)Output: two aligned vectors a P QA, b P QB

    1: while P QA is not empty and P QB is not empty do2: a = P QA.top3: b = P QB.top4: if a.next1 = b.next1 then5: return a, b6: if a.next1 > b.next1 then7: P QB.pop8: b.next1, skip = first1BitPositionWithSkip(b, a.next1)9: b.count = b.count - skip

    10: if b.next1 = null AND b.count >= T then11: P QB .push(b)12: else13: P QA.pop14: a.next1, skip = first1BitPositionWithSkip(a, b.next1)15: a.count = a.count - skip16: if a.next1 = null AND a.count >= T then17: P QA.push(a)18: return null, null

    process of finding the first 1 bit, so that the number of1s need to be processed in the vector can be reducedfor index pruning. Since the change is minor, we do notdiscuss it in further detail. Then we update the numberof 1s unprocessed in the vector (lines 9, 15), and dopruning accordingly (lines 10-11, 16-17).

    The vector alignment algorithm is a general frame-work to process iceberg queries, which can be applied tovarious bitmap compression schemes and support differ-ent aggregation functions. The first1BitPosition functionin vector alignment is specific to the compression scheme

    of the bitmap vectors. In our implementation, we use thebitmap vectors compressed by WAH [24].Using vector alignment, the iceberg query processing

    is much more efficient. Suppose that the number ofdistinct groups in a relation is G, the number of distinctvalues in column Ci is di and we consider a bitwise-ANDas a basic operation. The complexity of vector alignmentis O(G) while the complexity of dynamic pruning isO(ni=1 di). Though vector alignment needs to maintain

    two priority queues, the queue maintenance cost is small.The complexity of maintaining the two priority queueswill be analyzed later.

    5.2 Optimization

    The algorithm icebergPQ is effective in dealing withattributes with large number of unique values. To furtherimprove the performance, we developed two additionaloptimization strategies: 1) using tracking pointers toaccelerate vector relevant operations, and 2) using aglobal filter vector to reduce futile queue pushing.

    Optimization 1: Shortening vector operations withtracking pointers

    With WAH compression scheme, a bitmap vectorsfirst 1 bit will appear in its first 2 words in most cases,

    unless the table has more than 31 230 rows (assumingthe word length is 32), which is rarely seen. However, toget a better efficiency, we do not strictly follow the WAHscheme in an AND operation. When an AND operationcreates a literal word in which all bits are 0, we do notconvert it to a fill word or merge it with the previous0-fill word. For example, suppose a word has 16 bits,and vector A = 8001 (0*15) 7000 (1*3, 0*12) 0FF0 (0*3,1*8, 0*4), and vector B = 8001 (0*15) 0FFF (0*3, 1*12)7F0F (1*7, 0*4, 1*4). After C = A AND B, A = A XORC, vector A will become 8001 0000 00F0, rather than8002 00F0. Therefore, the first 1 bit does not necessarilyappear in the first two words of a bitmap vector; instead,it may appear at an arbitrary word of the bitmap vector.

    The bitwise-AND operation discussed so far in ouralgorithm always starts from the first bits of the twovectors. When the first 1-bit of a vector can be at anyword location, it could lead to a potentially inefficientAND operation. With our priority queue based algo-rithm, this inefficiency can be avoid with further op-

    timization. We observed that for two vectors selectedby the vector alignment algorithm, the AND operationcan start from the aligned position. Therefore, we de-signed a tracking pointer for each vector to point to thememory location containing the first 1-bit (i.e., it is amemory pointer pointing to the middle of a bitmapvector), and passed the pointers to the bitwise ANDfunction. Now a bitwise AND operation can start fromthe location that the tracking pointer points to, and thusshorten the execution time. The same tracking pointerscan also be used to shorten the execution time of thefirst1BitPosition function. As the bitwise AND func-tion and first1BitPosition function are both frequently

    executed in algorithm icebergPQ, the tracking pointeroptimization can save significant execution time.

    Optimization 2: Reducing futile queue pushing witha global filter vector

    We observe that icebergPQ may suffer from a futilequeue pushing problem. Futile queue pushing occurswhen we find the first 1-bit position in an attributesvector, but the corresponding aligned vector in the otherattribute has already been pruned. When a lot of vectorsare pruned, the futile queue pushing occurs quite fre-quently and downgrades the performance.

    To reduce futile queue pushing as much as possible,

    we leverage a global filter vector. It is a bitmap vectorwhich is used to record the 1-bit positions that have been pruned across all the vectors. Now when thefirst1BitPosition function looks for the new position offirst 1-bit, it should also check whether the correspond-ing position in the global filter vector is marked asnot being pruned. Although carrying some maintenancecost, the global filter vector helps us avoid many unnec-essary push and pop operations, and thus can potentiallyachieve significant performance gain.

    Example 5.6: Consider a list of bitmap vectors A1, A2, for attribute A, and a list of bitmap vectors B1, B2,

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    8/15

    8

    for attribute B. SupposeA1 = 1010000011, B1 = 1001000001And suppose the threshold is 2. After A1 AND B1, B1

    becomes 0001000000, thus it is pruned. Since the 4th bitofB1 is 1, we annotate in the global filter vector that the4th bit has been pruned. Suppose

    A2 = 0101001100, B2 = 0100001000.After A2 AND B2, A2 becomes 0001000100. Without

    the global filter vector, we would push A2 into attributeAs priority queue based on its 4th position being 1.However, this push is futile because Bs vector whose4th position is 1 (i.e., B1) has already been pruned. As aresult, A2 wont find a matching 1 at the 4th positionin Bs vectors, and will have to be popped out of thepriority queue and pushed back again at some point.On the other hand, with the global filter vector, afterA2 AND B2, we can directly push the new A2 into theproper position in the queue based on its 8th position

    being 1, thereby saving a pop and push operation.

    Since the changes need to be made on the algorithm

    is minor, we didnt describe more detail in incorporatingthe optimizations.

    5.3 Performance Analysis of Vector Alignment

    Now we analytically demonstrate that comparing to thedynamic pruning algorithm, icebergPQ is much moreefficient. Given a table R(A, B) with n tuples. SupposeA has s unique values, B has t unique values, and group

    by operation on A, B forms g groups. Note that here grepresents the number of valid groups that appear atleast once in the relation. It is clear that we have s

  • 8/3/2019 Iceberg Query Algorithm

    9/15

    9

    all attributes are processed. We name this algorithm asicebergPQMulti. One advantage of icebergPQMulti isthat at any time, we only need to keep vectors of twoattributes in memory regardless of the total number ofaggregate attributes. Another advantage of icebergPQ-Multi is that after two aggregate attributes are processed,the number of resulting bitmap vectors (the number oficeberg results on these two attributes) will potentially

    be very small (at least much smaller than the numberof unique values in any aggregate attribute). Therefore,this algorithm is very memory efficient.

    One issue in icebergPQMulti is that the order ofattributes to be processed may affect the performance ofthe algorithm. Intuitively, we prefer to first process at-tributes that can generate the smallest number of icebergresults, because if the intermediate iceberg is small, thesubsequent iceberg processing should be more efficient.However, it is difficult to know the exact iceberg size

    before doing the processing.

    As an attempt to address this problem, we developed a

    simple greedy strategy. We observed that if an attributehas more unique values, each unique value will havelower count in average and is more likely to be pruned.Thus, we sort attributes according to their numbersof unique values in descending order. We choose thefirst two attributes to process in the first iteration. Theintermediate result is further processed together withthe third attribute in the sorted list, and so on. Ourexperiment in Section 7 shows that icebergPQMulti isvery efficient. We plan to do more studies for otherpossible ordering strategies in the future.

    6.2 Handling Other Aggregation Functions

    The icebergPQ algorithm can be easily generalized tosupport other aggregation functions which have theanti-monotone property. For example, to support SUMfunction, rather than computing the count of 1-bits foreach vector (the BIT1 COUNT function in algorithm 1),we instead compute the sum of the values correspondingto the 1 bits in the resulting bitmap vector. When weconduct the index pruning, we prune bitmap vectors bythe sum of all values corresponding to 1 bits left in thevector, rather than the number of 1 bits. Other parts ofthe icebergPQ algorithm are kept the same. Because the

    anti-monotone property of iceberg queries is still validfor SUM, our algorithm is still correct. Besides SUM,for MIN(MAX) functions, the modification is similarsince MIN(MAX) also operates on numeric values asSUM function. The minor difference is that after each

    bitwise-AND operation, rather than computing the sumvalue, we compute the min(max) value. Then we use themin(max) value for index pruning. We plan to imple-ment the algorithms to support SUM, MIN, and MAXfunctions, and evaluate its performance as part of ourfuture work. Now we illustrate how the SUM functioncan be handled through a concrete example.

    Example 6.1: Using the same example relation of ourrunning example in Figure 1. Suppose the icebergquery is SELECT A, B, SUM(C) FROM R GROUP BYA, B HAVING SUM(C) > 15, the priority queues arethe same as in Figure 3. At the beginning the sumvalue for each distinct value in attribute A and B arecomputed. For example, the total sum value for A2 is1.23+5.56+8.36+9.45+6.23+1.98+0.11 = 32.92 and 15.93for B2. Now we pop out bitmap vectors A2 and B2 fromthe two priority queues and conduct a bitwise-AND

    between A2 and B2. The resulting bitmap vector r is100100100100. We define a function BIT1 SUM, whichuses a given bitmap vector to compute the sum of valuesof the columns specified in the SUM clause. Insteadof computing how many 1 bits exist in r, BIT1 SUMuses r to access the numerical values in attribute C andcomputes the sum value. The sum of the values in thecorresponding positions in C is 15.93, which is largerthan the threshold. Therefore we add the group (A2,B2)into the iceberg result. The total sum value of A2 is

    reduced to 16.99, which is still larger than the threshold.Thus we update the first 1 bit position in A2 and insertit back to the priority queue, as in Figure 4. The totalsum value of B2 (0.00) is smaller than the threshold nowand thus B2 is removed. We do the same thing for thenext aligned vectors until one of the priority queue isempty.

    Through the above example, we can see that ourvector alignment algorithm can be easily generalizedto handle other anti-monotonic aggregation functions,without much modification to the algorithm. Notice thatthe implementation of BIT1 SUM will have impact onthe overall performance of the algorithm. If the available

    memory is large enough, we can simply keep the valuesof attributes, on which SUM function is applied, inmemory for efficient access. If the memory is limited,we can also leverage bit-sliced index which providesefficient compression and access of the numeric valuesto the compressed attributes.

    7 EXPERIMENTAL EVALUATION

    In this section, we report our extensive experimentsin evaluating the performance of the icebergPQ andicebergPQMulti algorithms, with the optimization tech-

    niques incorporated. We implemented and comparedwith an optimal tuple-scan based baseline algorithm andthe algorithms in [9]. We also compared our techniquewith the method in [10]. We tested the algorithms

    by changing various factors over both large real andsynthetic data sets.

    The experiments show that icebergPQ outperformsthe baseline algorithm, the multiBuckets algorithm andthe dynamic index pruning algorithm in [10] on datasets with zipfian distribution, a representative distribu-tion for real data [14]. icebergPQ also shows to deliverbetter performance for normal distribution. Though the

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    10/15

    10

    performance is worse on uniform distribution, such dis-tribution is rarely seen in real data [14]. Our algorithmscales well with respect to data sizes, iceberg threshold,and the number of distinct values. It is also not sensitiveto the length of attribute values and the number ofattributes in a relation. Furthermore, as expected, thenumber of attributes in the table has no impact onicebergPQ.

    7.1 Experimental Setup

    The experiments are conducted on a machine with aPentium 4 single core processor of 3.6GHz, 2.0GB mainmemory and 7200rpm IDE hard drive, running Ubuntu9.10 with ext4 file system. All algorithms are imple-mented in C++.

    Our experiments were carried out with both a syn-thetic data set and a real patent data set. In our ex-periment figures, we denote the algorithm in [9] asmultiBuckets, the dynamic index pruning algorithm

    (in [10] and in our paper) as icebergDP, the ice-bergPQ and icebergPQMulti algorithms as icebergPQand icebergPQMulti, and the baseline algorithm asbaseline.

    We generate large synthetic data according to fiveparameters: data size (i.e., number of tuples), attributevalue distributions, number of distinct values, numberof attributes in the table, and attribute lengths. In ourexperiment, for the zipfian distribution, the probability

    of a value Ai in column A is 1i/

    i

    i=1

    1

    i. In each experiment,

    we usually focus on the impact of one parameter withrespect to data sizes and thus fix the other parameters.

    Our synthetic data sizes vary from 10 millions to 40millions.We also use a relation in the patent database, which

    can be purchased from the United States Patent andTrademark Office, as the real test data set. The patentrelation contains 25 million tuples, with a total size of29GB.

    We implemented an optimal tuple-scan based baselinealgorithm, which is a one-pass in memory hash-basedaggregation algorithm with iceberg selection. It readsone tuple from a file at a time, and use in-memory hashmaps to record the count of each aggregation group. Thehash map implementation used in our experiments is the

    one in C++ STL. One pass on disk is the minimal diskaccess cost of tuple-scan based approach [11]. Hashingis the most efficient aggregation strategy when there areenough memory [11]. Basically, the baseline algorithmassumes that there are infinite memory to hold all values.

    Though the assumption in the baseline algorithmis impractical, the baseline algorithm provides a goodapproximation of the best performance the tuple-scan

    based approach can reach. Thus, it provides a good com-parison object for evaluating our proposed algorithm.

    In our experiment we assume that bitmap indexes ofthe aggregation attributes have already been built offline.

    Fig. 5: Performance of icebergDP and icebergPQ

    80

    100

    120

    140

    160

    ime(s)

    Zipfian Distr ibution

    icebergPQ multiBuckets basel ine

    0

    20

    40

    10 20 30 40

    T

    Number of Tuples (in millions)

    Fig. 6: Zipfian

    This is a reasonable assumption, since other than ice-berg queries, bitmap indexes are useful for many othertasks especially in column-oriented databases. Besides,

    building a bitmap is generally efficient. For a table of 10million tuples, the time for building a bitmap is less thanone minute.

    7.2 Performance of dynamic index pruning and vec-

    tor alignmentIn this suite of experiments, we tested icebergDP andicebergPQ, on data sets with zipfian distribution. Wevaried the data size from 1 million to 8 million tuples.We didnt test icebergDP with larger data set because itsperformance is already very slow when the data size is8 million.

    As shown in Figure 5, the performance of icebergPQis magnitudes faster than icebergDP. It demonstratesthe severe performance issue triggered by the emptybitwise-AND results problem discussed before. With 1million tuples, icebergPQ only needs 0.404 seconds tofinish processing, while icebergDPneeds 10.688 seconds.

    icebergPQ also scales well when the data size increases.It only takes 11.36 second with 8 million tuples, whileicebergDPtakes more than 18 minutes. The performanceof icebergDP is unacceptable for practical data sizes.

    7.3 Performance of iceberg query processing algo-rithms on real and synthetic data sets

    As demonstrated in previous subsection, since the per-formance of icebergDP is very slow, we didnt fur-ther compare with it. In this suite of experiments, wetested the performance of icebergP Q, multiBuckets and

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    11/15

    11

    150

    200

    250

    300

    350

    ime(s)

    Normal Distr ibution

    icebergPQ multiBuckets basel ine

    0

    50

    100

    10 20 30 40

    T

    Number of Tuples (in millions)

    Fig. 7: Normal

    300

    400

    500

    600

    ime(s)

    Uniform Distribution

    icebergPQ MultiBuckets basel ine

    0

    100

    200

    10 20 30 40

    T

    Number of Tuples (in millions)

    Fig. 8: Uniform

    baseline for iceberg query processing on real and syn-thetic data sets. For the real (patent) data set, sincewe can not control the parameters of the data set (e.g.number of attributes, value distributions), we testedthese algorithms with different sizes of the patent dataset. For the synthetic data set, we tested the systems withrespect to attribute value distributions. Performance testson other parameters are discussed in Section 7.4.

    Synthetic Data Set with Different Value Distri-butions. Generally speaking, aggregation algorithmsshould not be affected by value distributions [11],[25], [15]. But optimizations specific for iceberg queries[9], [4] are usually sensitive to value distributions, andmay only perform well for certain types of distribution.Thus, we want to first understand the performance ofour algorithm and optimizations under various valuedistributions.

    We generated data sets for three representative butvery different distributions (zipfian, normal and uni-form). For each distribution, we varied tuples sizesfrom 10 million to 40 million. The generated table R

    contains only two attributes A and B, where both havethe CHAR(10) data type and follow the same valuedistribution. In all data sizes and value distributions,attribute A has 500,000 distinct values, and B has 300,000distinct values. We use 20000, 150 and 10 as the icebergthresholds for zipfian, normal and uniform distribution,respectively. Figure 6 shows the performance of thetested algorithms for zipfian distribution, Figure 7 fornormal distribution and Figure 8 for uniform distribu-tion.

    The experiment results show that algorithm icebergPQachieves the best performance and shows great scala-

    150

    200

    250

    ime(s)

    Patent Dataset

    icebergPQ multiBuckets basel ine

    0

    50

    5 10 15 20 25

    Ti

    Number of Tuples (in millions)

    Fig. 9: Patent Data

    Fig. 10: Different Threshold

    bility in zipfian distribution. In almost all data sizes,icebergPQ is 3 to 8 times faster than the optimal baselinealgorithm and 3 to 6 times faster than the multiBucketsalgorithm. For instance, for 10 million tuples, icebergPQuses 5.546 seconds, while multiBuckets takes 28.241seconds and baseline takes 39.598 seconds.

    Besides providing superior performance, icebergPQ

    also scales well for zipfian distribution. Its running timeincreases from 5.546 to 15.819 seconds, when the numberof tuples increases from 10 millions to 20 millions. Whilethe processing time of multiBuckets increases from28.421 seconds to 64.089 seconds. As zipfian distributionis often considered as the representative distribution forreal data, this experiment shows great promise of theicebergPQ algorithm in terms of both the efficiency andscalability.

    For data sets under other value distributions, ice-bergPQ also demonstrates reasonable performance andlinear scalability. For normal distribution, as Figure 7shows, icebergPQ is still better than multiBuckets and

    baseline. icebergPQ is slower than the other two algo-rithms for uniform distribution. For uniform distribu-tion, it is easy to imagine that the pruning power oficebergPQ is not very effective, as the data is evenlydistributed. As uniform distribution is rarely seen in realdata, we believe that the performance for zipfian andnormal distributions is a better indicator of algorithmeffectiveness.

    Real Patent Data Set. We tested the performance of allalgorithms on the real patent data set. The patent data setcontains one relation, which has 9 attributes, includingfirstname, lastname, states, city, etc. We vary the data set

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    12/15

    12

    Fig. 11: Number of Distinct Groups

    30

    40

    50

    ime(s)

    VaryingNumber of Distinct Values

    icebergPQ mult iBuckets baseline

    0

    10

    10 20 40 80

    T

    Distinct Values (in Thousands)

    Fig. 12: Distinct Values

    sizes from 5 to 25 million tuples.The experiment result on the patent data set is demon-

    strated in Figure 9. As shown in the figure, the ice-bergPQ algorithm has the best performance among allalgorithms tested. It is much faster than the other twoalgorithms. Besides, with the increase of data size inpatent data set, icebergPQ scales well with respect to

    the increase of data size. The difference in running timebetween icebergPQ and the other algorithms increaseswhen data size increases.

    7.4 Performance of iceberg query processing ondifferent parameters

    Our experiments so far have demonstrated the effec-tiveness of the icebergPQ algorithm, especially for datasets with zipfian distribution. Next, we vary one param-eter at a time to study the performance of icebergPQand compare it with others under different parametersettings. The data set used in this set of experiments

    includes a relation with 10 million tuples and zipfianvalue distribution.

    Iceberg Threshold. We gradually lower the thresholdT of the iceberg query from 20000 to 625. Figure 10presents the results of our algorithm and others. Ta-

    ble 1 shows the sizes of the result sets for differentthreshold. The total number of distinct groups in ourexperiment is about 1 million. The set of numbers belowthreshold values in Figure 10 represents the percentageof the number of results divided by the total numberof groups. As we can see from the figure, even if welower the threshold to 1250, icebergPQ still shows the

    80

    100

    120

    e(s)

    VaryingAtt ribute Length

    icebergPQ mult iBuckets basel ine

    0

    20

    40

    10 20 40 80

    Tim

    Attribut e Length ( in chars)

    Fig. 13: Attribute Length

    60

    80

    100

    120

    ime(s)

    VaryingNumber of Aggregation Attributes

    icebergPQMulti mult iBuckets baseline

    0

    20

    40

    1 2 3 4 5 6 7

    T

    Number of Aggregate Attributes

    Fig. 14: Number of Aggregation Attributes

    best performance among all. For instance, it only takes32.885 seconds when the threshold is 1250. The size ofthe iceberg result is 165 with threshold 1250. They arealready big enough iceberg tips from the perspectiveof user analysis. The performance of icebergPQ is com-parable to the other two methods even with threshold625 and the result set size is already 385. The baselinealgorithm is insensitive to the threshold, and thus keeps

    the same performance in all cases. Figure 10 shows that,when threshold is greater than 1250, icebergPQ alwaysperforms better than multiBuckets, while they are bothfaster than baseline. This shows that icebergPQ is betterthan multiBuckets, when the threshold is reasonablylarge.

    The percentage of the results is relatively low in thisset of experiments, as shown in Figure 10. This isthe case because the number of distinct groups in thisset of experiment is large. As will be seen in the nextexperiment, our algorithm also performs well when thenumber of iceberg results covers a higher percentage ofthe groups.

    Number of Distinct Groups in Relation. In this exper-iment, we gradually lower the number of distinct groupsin a relation while keep the threshold fixed at 20000.Figure 11 reports the results of our experiment. Thenumbers of iceberg results for different distinct groupnumbers are as shown in Table 2.

    As we can see, the processing time of the baseline ap-proach decreases with respect to the number of groups,

    because when the number of groups decreases, the hashtable used in the baseline approach has a better locality.For multiBuckets, the processing time is almost constant

    because (1) the size of the hash table (number of buckets)

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    13/15

    13

    Threshold 20000 10000 5000 2500 1250 625 312Result # 3 10 27 68 165 385 890

    TABLE 1: Result set sizes for different threshold

    Distinct Groups 5000k 100k 500k 100k 50k 10k 5kResult # 16 45 58 85 96 140 160

    TABLE 2: Result set sizes for different numbers of distinct groups

    30

    40

    50

    60

    ime(s

    VaryingNumber of Att ributesin Relation

    icebergPQ mult iBuckets baseline

    0

    10

    20

    2 4 6 8 10

    T

    Number of Attr ibutes in Relation

    Fig. 15: Number of Attributes in a Relation

    remains constant, and (2) by running the proceduremultiple times using different hash functions, the falsepositives can be significantly reduced, and the time taken

    by the last counting step does not change much.As shown in Figure 11, our algorithm performs

    better than the baseline algorithm and the multiBucketsalgorithm for all cases in our test. Our algorithm shows

    better performance when the percentage of results is upto 2.7% among the total groups. The percentage is shownunder the x axis in Figure 11.

    Number of Distinct Values in Attributes. Analyt-ically, more distinct values indicate more vectors to

    deal with. But on the other hand, each distinct valuehas fewer occurrences on average, which means morevectors can be pruned by icebergPQ. We thus wonderwhich effect is more dominating. We vary number of dis-tinct values from 100,000 to 800,000 in an attribute, andas Figure 12 shows, both icebergPQ and multiBucketsperform better when there are less distinct values. Also,

    both scale well when the number of distinct valuesincrease. The performances of baseline is insensitive tothe number of distinct values.

    Attribute Length. The length of an attribute is the sizeof the value type defined for that attribute. The largerthe size an attribute has, the larger the size a table is.

    Therefore, attribute length will affect relation size andthus disk access time. As tuple-scan based algorithmsoperate on tuple values to compute the aggregation,attribute length also has impact on the computation cost.On the contrary, bitmap index operates on bits and thusis not affected by attribute lengths. Figure 13 shows theexperiment results of varying the attribute length fromCHAR(10) to CHAR(80). As indicated by Figure 13 theprocessing times of multiBuckets and baseline increasewhen attribute lengths increase, while icebergPQ staysthe same.

    Number of Aggregate Attributes. We discussed how

    to extend icebergPQ to handle multiple aggregate at-tributes in Section 6. To verify the effectiveness of thealgorithm, we used seven queries, whose number of ag-gregate attributes varies from 1 to 7. When we increasedthe number of aggregation attributes, we also loweredthe iceberg threshold to maintain a similar number oficeberg results. The experiment result is shown in Fig-ure 14.

    As indicated by the result, the icebergPQ algorithmgenerally has better performance and scales well com-pared with the other two. However, when the processingtime of icebergPQ increases faster than multiBuckets

    when the number of attributes increases. This is becausemultiBuckets performs data scans and hashing, thuswhen the number of aggregation attributes increases,the increase of its processing time is mainly due to (1)it will need to access more data during the scan; (2)the hash function will be executed on more attributes.Both of these factors should be linear wrt to the numberof aggregation attributes, and is largely independent ofthe iceberg threshold. On the other hand, for icebergPQ,when the threshold decreases it becomes increasinglydifficult to prune the vectors, thus the processing timeincreases more than linearly wrt the number of aggre-gation attributes. Nevertheless, in practice, aggregationqueries usually will not involve a lot of attributes, thusicebergPQ is usually the most efficient approach.

    Number of Attributes in the Relation. For tuple-scan based algorithms, a complete tuple needs to beread from disk, even if only a small part of the tupleis used in the computation. Therefore, the performanceof these algorithms degrades with the increase of num-

    ber of attributes in a relation. The experimental resultsfor varying the number of attributes in a relation areshown in Figure 15. As we can see from the figure,the performances of multiBuckets and baseline showa linear increase in processing time with respect to the

    increase of the number of attributes. On the contrary,the performance of icebergPQ stays the same becauseit only accesses bitmap indices of attributes involved inthe iceberg query. For example, in our experiment, theprocessing time of the multiBuckets algorithm increasesfrom 31.206 seconds to 59.772 seconds when the numberof attributes changes from 2 to 10. The processing timeof baseline algorithm increases from 22.141 seconds to82.383 seconds under the same setting. The performanceoficebergPQ stays at around 5.5 seconds, which is muchfaster compared with multiBuckets and baseline. In ourexperiment, we only vary the number of attributes from

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    14/15

    14

    2 to 10. Real data warehouse may have 100 or moreattributes, in which case the performance gain is evenmore significant using bitmap indices.

    8 CONCLUSION

    This paper presents an efficient algorithm for icebergquery processing using compressed bitmap indices. Our

    algorithm demonstrates superior performance over ex-isting schemes and it does not depend on any particularcompression method. We observed that bitmap indexhas three attractive advantages: 1) Saving disk access byavoiding tuple-scan on a table with a lot of attributes,2) Saving computation time by conducting bitwise op-erations, 3) Leveraging the anti-monotone property oficeberg queries to develop aggressive pruning strategies.To solve the problem of massive empty AND results, weproposed an efficient vector alignment algorithm usingpriority queues. We also developed optimization tech-niques to further improve the performance. Both analysisand experiments verify the effectiveness of our approach

    and show that our algorithm can outperform the state-of-the-art algorithms for iceberg query processing.

    Our algorithm is not sensitive to the number of distinctvalues, number of attributes in the relation and thelength of individual attributes. It works well on datasets with zipfian distribution. The performance of ouralgorithm is better when the query is more iceberg-like. That is, when the threshold of the iceberg queryis relatively large (which means the percentage of theiceberg results is relatively small). It also works betterwhen the number of aggregation attribute is relativelysmall.

    There are several issues that we consider as futurework. First, we would like to investigate the processingof iceberg queries without the anti-monotone property,e.g., queries with AVERAGE functions. For this type ofqueries, even if a pair of values (a, b) does not satisfy thepredicate, its superset (a,b,c) may still satisfy the pred-icate, which makes pruning much harder. Second, wewill study the optimal order of attributes to be processed(in case we have three or more aggregation attributes)to gain better efficiency. This requires us to efficientlyand reliably predict the number of iceberg tuples givenany two relations, which is a very challenging problem.Third, we will also investigate in solutions when the

    data is of enormous size such that the bitmap of a singlecolumn does not fit in main memory.

    9 ACKNOWLEDGEMENT

    This material is based on work partially supported byIBM Faculty Awards, NSF CAREER Award IIS-0915438,and NSF IIS-0915438.

    REFERENCES

    [1] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. F. Naughton,R. Ramakrishnan, and S. Sarawagi. On the Computation ofMultidimensional Aggregates. In VLDB, pages 506521, 1996.

    [2] R. Agrawal, T. Imielinski, and A. N. Swami. Mining AssociationRules between Sets of Items in Large Databases. In SIGMODConference, pages 207216, 1993.

    [3] G. Antoshenkov. Byte-aligned Bitmap Compression. In Proceed-ings of the Conference on Data Compression, page 476, Washington,DC, USA, 1995. IEEE Computer Society.

    [4] J. Bae and S. Lee. Partitioning Algorithms for the Computationof Average Iceberg Queries. In DaWaK, 2000.

    [5] M. Beeler, R. W. Gosper, and R. Schroeppel. HAKMEM. Technicalreport, Massachusetts Institute of Technology, Cambridge, MA,

    USA, 1972.[6] K. S. Beyer and R. Ramakrishnan. Bottom-Up Computation of

    Sparse and Iceberg CUBEs. In SIGMOD Conference, pages 359370, 1999.

    [7] C. Y. Chan and Y. E. Ioannidis. Bitmap Index Design andEvaluation. In SIGMOD Conference, 1998.

    [8] F. Deliege and T. B. Pedersen. Position List Word Aligned Hybrid:Optimizing Space and Performance for Compressed Bitmaps. InEDBT, pages 228239, 2010.

    [9] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D.Ullman. Computing Iceberg Queries Efficiently. In VLDB, pages299310, 1998.

    [10] A. Ferro, R. Giugno, P. L. Puglisi, and A. Pulvirenti. BitCube: ABottom-Up Cubing Engineering. In DaWaK, pages 189203, 2009.

    [11] G. Graefe. Query Evaluation Techniques for Large Databases.ACM Comput. Surv., 25(2):73170, 1993.

    [12] J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of

    iceberg cubes with complex measures. In SIGMOD Conference,pages 112, 2001.

    [13] M. Jrgens. Tree Based Indexes vs. Bitmap Indexes: A PerformanceStudy. In DMDW, 1999.

    [14] D. E. Knuth. The Art of Computer Programming. Addison-WesleyProfessional, second edition, 10 Jan. 1973.

    [15] P.-A. Larson. Grouping and Duplicate Elimination: Benefits ofEarly Aggregation. Technical Report MSR-TR-97-36, MicrosoftResearch, 1997.

    [16] K. P. Leela, P. M. Tolani, and J. R. Haritsa. On IncorporatingIceberg Queries in Query Processors. In DASFAA, pages 431442,2004.

    [17] P. E. ONeil. Model 204 Architecture and Performance. In HPTS,pages 4059, 1987.

    [18] P. E. ONeil and G. Graefe. Multi-Table Joins Through Bitmapped Join Indices. SIGMOD Record, 24(3):811, 1995.

    [19] P. E. ONeil and D. Quass. Improved Query Performance with

    Variant Indexes. In SIGMOD Conference, pages 3849, 1997.[20] K. Stockinger, J. Cieslewicz, K. Wu, D. Rotem, and A. Shoshani.

    Using Bitmap Index for Joint Queries on Structured and Text Data. Annals of Information Systems, 3:123, 2009.

    [21] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack,M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. ONeil, P. E. ONeil,A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A Column-orientedDBMS. In VLDB, pages 553564, 2005.

    [22] K.-Y. Whang, B. T. V. Zanden, and H. M. Taylor. A Linear-Time Probabilistic Counting Algorithm for Database Applications.

    ACM Trans. Database Syst., 15(2):208229, 1990.[23] K. Wu, E. J. Otoo, and A. Shoshani. On the Performance of Bitmap

    Indices for High Cardinality Attributes. In VLDB, pages 2435,2004.

    [24] K. Wu, E. J. Otoo, and A. Shoshani. Optimizing Bitmap Indiceswith Efficient Compression. ACM Trans. Database Syst., 31(1):138,2006.

    [25] W. P. Yan and P.-A. Larson. Data Reduction Through EarlyGrouping. In CASCON, page 74, 1994.

    Bin He is a Research Scientist at IBM Almaden Research Center. Bingot his Ph.D. in the Department of Computer Science at the Universityof Illinois at Urbana-Champaign in 2006. He also received the M.S.degree in Computer Science from the University of Illinois at Urbana-Champaign in 2002, and M.S. and B.S. degrees in Mathematics fromPeking University, China in 2000 and 1998 respectively. Bins researchmainly focuses on large scale databases, data warehousing, and dataintegration.

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 8/3/2019 Iceberg Query Algorithm

    15/15

    15

    Hui-I Hsiao received a bachelors degree from National Taiwan Uni-versity, and the M.S. and Ph.D. degrees in computer science fromUniversity of Wisconsin at Madison.

    He is a Program Director at IBM Almaden Research Center wherehe is responsible for technology innovation for emerging market. Priorto this, he was the Chief Scientist of Information Management andDeputy Director of IBM China Research Lab in Beijing from 2006 to2008. Dr. Hsiao joined IBM T.J. Watson Research Center in 1990 andwas appointed the manager of the Parallel Databases department in1995. At IBM Watson Research, he led the design and development

    of DB2 Parallel Edition - a highly scalable parallel database system onopen system platform. He moved to IBM Almaden Research Center in1997 and managed the Content Management System department from2000 to 2006. Prior to joining IBM, he worked as a software engineerat Nicolet Instrument Corporation at Madison Wisconsin, from 1984 to1990, where he was named a Nicolet Associate Fellow.

    Dr. Hsiao is a recipient of 2008 ACM Software System Award, whichrecognizes individual(s) for developing a software system that has had alasting influence, reflected in contributions to concepts and in commer-cial acceptance. He received an Outstanding Innovation Award and anOutstanding Technical Achievement Award from IBM for contributionsto IBM DB2 and Content Manager technologies. He was invited to IBMcorporate technical recognition events (CTRE) in 2005 and 2010, whichrecognize top technical contributors in IBM. Dr. Hsiao is a memberof IEEE, ACM, and ACM SIGMOD. He has published more than 30refereed research papers and been awarded 29 patents. He was theprogram committee chair for 2006 AP SSME Symposium and served on

    program committees for many international conferences.

    Ziyang Liu is a Ph.D. candidate at Arizona State University and arecipient of the Science Foundation Arizona (SFAz) Fellowship (2008-2010). He joined Arizona State University in August 2006 and receivedM.S. degree in Computer Science in May 2008. His current researchfocuses on keyword search on structured and semi-structured data andworkflow management.

    Yu Huang is a Ph.D. student of computer science at Arizona StateUniversity. He joined ASU in 2007 and is currently working at thesoftware research group (SRLAB) at ASU. His research areas includedatabase systems, software as a service and cloud computing.

    Yi Chen received her M.S. and Ph.D. degrees in computer sciencefrom University of Pennsylvania, and is currently an assistant professorat Arizona State University. Her current research focuses on support-ing keyword search on structured and semi-structured data, workflowmanagement, social network, information integration, and informationextraction. She is a recipient of NSF CAREER Award (2009) and IBMFaculty Award (2010).

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.