Top Banner
SQL for Advanced Data Aggregation SQL - A Flexible and Comprehensive Framework for In-Database Analytics ORACLE WHITEPAPER
18

SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

Sep 19, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

SQL for Advanced Data Aggregation

SQL - A Flexible and Comprehensive Framework for In-Database Analytics

O R A C L E W H I T E P A P E R

3 + P A P E R | N O V E M B E R 2 0 1 6

Page 2: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

SQL FOR ADVANCED DATA AGGREGATION

Contents

Data Analysis with SQL 1

SQL – A Flexible and Comprehensive Analytical Framework 2

Advanced Data aggregation with Oracle Database 12c Release 2 3

Rollups and Cubes 3

Grouping Sets 5

Grouping Sets 6

Composite columns 6

Understanding Levels Within Hierarchical Totals 6

Approximate Query Processing 8

Approximate Queries for Data Discovery 8

Aggregating Approximate Results For Faster Analysis 11

Conclusion 14

Further Reading 14

Disclaimer

The following is intended to outline our general product direction. It is intended for information purposes only, and

may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and

should not be relied upon in making purchasing decisions. The development, release, and timing of any features or

functionality described for Oracle’s products remains at the sole discretion of Oracle.

Page 3: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

1 | SQL FOR ADVANCED DATA AGGREGATION

Data Analysis with SQL

Today information management systems along with operational applications need to support a wide

variety of business requirements that typically involve some degree of analytical processing. These

requirements can range from data enrichment and transformation during ETL workflows, creating time-

based calculations like moving average and moving totals for sales reports, performing real-time

pattern searches within logs files to building what-if data models during budgeting and planning

exercises. Developers, business users and project teams can choose from a wide range of languages

to create solutions to meet these requirements.

Over time many companies have found that the use so many different programming languages to drive

their data systems creates five key problems:

1. Decreases the ability to rapidly innovate

2. Creates data silos

3. Results in application-level performance bottlenecks that are hard to trace and rectify

4. Drives up costs by complicating the deployment and management processes

5. Increases the level of investment in training

Development teams need to quickly deliver new and innovative applications that provide significant

competitive advantage and drive additional revenue streams. Anything that stifles innovation needs to

be urgently reviewed and resolved. The challenge facing many organizations is to find the right

platform and language to securely and efficiently manage the data and analytical requirements while at

the same time supporting the broadest range of tools and applications to maximize the investment in

existing skills.

IT and project managers need an agile platform to underpin their projects and applications so that

developers can quickly and effectively respond to ever-changing business requirements without

incurring the issues listed above.

Page 4: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

2 | SQL FOR ADVANCED DATA AGGREGATION

SQL – A Flexible and Comprehensive Analytical Framework

The process of analyzing data has seen many changes and significant technological advances over the last forty

years. However, there has been one language, one capability that has endured and evolved: the Structured Query

Language or SQL. Many other languages and technologies have come and gone but SQL has been a constant. In

fact, SQL has not only been a constant, but it has also improved significantly over time.

SQL is now the default language for data analytics because it provides a mature and comprehensive framework for

data access and it supports a broad range of sophisticated analytical features. The key benefits for IT and business

teams provided by Oracle’s in-database analytical SQL features and functions are:

Enhanced developer productivity

Using the latest built-in analytical SQL capabilities, developers can simplify their application code by replacing

complex analytical processing – written using many different languages - with purpose-built analytical SQL that is

much clearer and more concise. Tasks that in the past required the use of procedural languages or multiple SQL

statements can now be expressed using single, comprehensive SQL statements. This simplified SQL (analytic SQL)

is quicker to formulate, maintain and deploy compared to older approaches, resulting in greater developer

productivity.

Improved Manageability

When computations are centralized close to the data then the inconsistency, lack of timeliness and poor security of

calculations scattered across multiple specialized processing platforms completely disappears. The ability to access

a consolidated view of all your data is simplified when applications share a common relational environment rather

than a mix of calculation engines with incompatible data structures and languages.

Oracle’s in-database approach to analytics allows developers to efficiently layer their analysis using SQL because it

can support a very broad range of business requirements.

Minimized Learning Effort

The amount of effort required to understand analytic SQL is minimized through the use of careful syntax design.

Syntax typically leveraged existing SQL constructs, such as the aggregate functions SUM and AVG, and extends

them using well-understood keywords such as OVER, PARTITION BY, ORDER BY, RANGE INTERVAL etc.

Most developers and business users with a reasonable level of proficiency with SQL and can quickly adopt and

integrate sophisticated analytical features, such as pareto-distributions, pattern matching, cube and rollup

aggregations into their applications and reports.

The amount of time required for enhancements, maintenance and upgrades is minimized: more people will be able

to review and enhance the existing SQL code rather than having to rely on a few key people with specialized

programming skills.

ANSI SQL compliance

Most of Oracle’s analytical SQL is part of the ANSI SQL standard; or in the process of becoming adopted in newer

versions. This ensures broad support for these features and rapid adoption of newly introduced functionality across

applications and tools – both from Oracle’s partner network and other independent software vendors.

Page 5: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

3 | SQL FOR ADVANCED DATA AGGREGATION

Oracle is continuously working with its many partners to assist them in exploiting the expanding library of analytic

functions. Already many independent software vendors have integrated support for the new Database 12c Release

2 1 in-database analytic functions into their products.

5. Improved performance

Oracle’s in-database analytical functions and features enable significantly better query performance. Not only does it

remove the need for specialized data-processing silos but also the internal processing of these purpose-built

functions is fully optimized. Using SQL unlocks the full potential of the Oracle database - such as parallel execution

– to provide enterprise level scalability unmatched by external specialized processing engines.

Summary

This section has outlined how Oracle’s in-database analytic SQL features provide IT, application development teams

and business users with a robust and agile analytical language that enhances both query performance and

productivity while providing investment protection by building on existing standards-based skills. For a more detailed

analysis of the benefits of SQL as an analysis language please refer to the following whitepaper: SQL – the natural

language for analysis

The rest of this paper will outline the key SQL-based features for data aggregation and approximate query

processing within Oracle Database 12c Release 2.

Advanced Data aggregation with Oracle Database 12c Release 2

Oracle has extended the processing capabilities of the GROUP BY clause to provide fine-grained control over the

creation of totals derived from the initial result set. This includes the following features:

Rollup – calculates multiple levels of subtotals across a specified group of dimensions

Cube - calculates subtotals for all possible combinations of a group of dimensions and it calculates a grand

total

Grouping – helps identify which rows in a result set have been generated by a rollup or cube operation

Grouping sets - is a set of user defined groupings that are generated as part of the result set

The following sections will look at these features in more detail.

Rollups and Cubes

ROLLUP creates subtotals that "roll up" from the most detailed level to a grand total, following a grouping list

specified in the ROLLUP clause. ROLLUP takes as its argument an ordered list of grouping columns.

ROLLUP is very helpful for subtotaling along a hierarchical dimension such as time or geography and it simplifies and

speeds the population and maintenance of summary tables. This is especially useful for ETL developers and DBAs.

1 Oracle Database 12c Release 2 (12.2), the latest generation of the world’s most popular database, is now available in the Oracle Cloud

Page 6: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

4 | SQL FOR ADVANCED DATA AGGREGATION

FIGURE 1: GROUP BY ROLLUP RESULTS IS EQUIVALENT TO A CROSSTAB

CUBE can calculate a cross-tabular report with a single SELECT statement. Like ROLLUP, CUBE is a simple

extension to the GROUP BY clause, and its syntax is also easy to learn. CUBE takes a specified set of grouping

columns and creates the required subtotals for all possible combinations. This feature is very useful in situations

where summary tables need to be created. CUBE adds most value to query processing where the query is based on

columns from multiple dimensions rather than columns representing different levels of a single dimension.

FIGURE 2 - GROUP BY CUBE AGGREGATES RESULTS ACROSS ALL DIMENSIONS/LEVELS

While ROLLUP and CUBE are very powerful features they can seem a little inflexible. Developers often need to

determine which result set rows are subtotals and the exact level of aggregation for a given subtotal. This allows

them to use subtotals in calculations such as percent-of-totals.

Page 7: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

5 | SQL FOR ADVANCED DATA AGGREGATION

To help resolve data quality issues it is often important to differentiate between stored NULL values and "NULL"

values created by a ROLLUP or CUBE. The GROUPING function resolves these problems. Using a single column as

its argument, GROUPING returns 1 when it encounters a NULL value created by a ROLLUP or CUBE operation. That

is, if the NULL indicates the row is a subtotal, GROUPING returns a 1. Any other type of value, including a stored

NULL, returns a 0. Even though this information is only of value to the developer it is a very powerful feature. It is not

only useful for identifying NULLs, it also enables sorting subtotal rows and filtering results.

Grouping Sets

Grouping sets allow the developer and business user to precisely define the groupings of key dimensions. It

produces a single result set which is equivalent to a UNION ALL of differently grouped rows. This allows efficient

analysis across multiple dimensions without computing the whole CUBE. Since computing all the possible

permutations for a full CUBE creates a heavy processing load, the precise control enabled by grouping sets

translates into significant performance gains.

For example, consider the following statement:

SELECT

channel_desc

, calendar_month_desc, country_iso_code,

TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$

FROM sales, customers, times, channels, countries

WHERE sales.time_id=times.time_id

AND sales.cust_id=customers.cust_id

AND sales.channel_id= channels.channel_id

AND channels.channel_desc IN ('Direct Sales', 'Internet')

AND times.calendar_month_desc IN ('2000-09', '2000-10')

AND country_iso_code IN ('GB', 'US')

GROUP BY GROUPING SETS(

(channel_desc, calendar_month_desc, country_iso_code)

,(country_iso_code)

,(channel_desc));

The above statement calculates aggregates over the following three groupings:

1. Totals for each combination of channel_desc, calendar_month_desc and country_iso_code 2. grand totals for each country ISO code 3. Grand totals for each channel

FIGURE 3 – GROUPING SETS PROVIDE FINE-GRAINED CONTROL OF THE AGGREGATION PROCESS

Compare the above results to the statements that use other aggregation operators such as CUBE and ROLLUP

which compute all possible groupings across all dimensions. The key point is that when using CUBE and ROLLUP it

is likely that many of the calculated groupings will not be required.

Page 8: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

6 | SQL FOR ADVANCED DATA AGGREGATION

Grouping Sets

Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified with

concatenated groupings yield the cross product of groupings from each grouping set. Developers can use this

feature to specify a small number of concatenated groupings, which in turn actually generates a large number of

final groups. This helps to both simplify and reduce the length of the SQL statement making it easier to understand

and maintain. Concatenated groupings are specified by listing multiple grouping sets, cubes, and rollups, and

separating them with commas. The example below contains concatenated grouping sets:

GROUP BY GROUPING SETS(a, b), GROUPING SETS(c, d)

which defines the following groupings:

(a, c), (a, d), (b, c), (b, d)

Concatenation of grouping sets is very helpful for a number of reasons. Firstly, it reduces the complexity of query

development because there is no need to enumerate all groupings within the SQL statement. Secondly, it allows

application developers to push more processing back inside the Oracle Database. The SQL typically generated by

OLAP-type applications often involves the concatenation of grouping sets, with each grouping set defining groupings

needed for a dimension.

Composite columns

A composite column is a collection of columns that are treated as a unit during the computation of groupings. In

general, composite columns are useful in ROLLUP, CUBE, GROUPING SETS, and concatenated groupings. For

example, in CUBE or ROLLUP, composite columns would mean skipping aggregation across certain levels.

You specify the columns in parentheses as in the following statement:

ROLLUP (year, (quarter, month), day)

In this statement, the data is not rolled up across year and quarter. What is actually produced is equivalent to the

following groupings of a UNION ALL:

(year, quarter, month, day),

(year, quarter, month), (year)

()

There is more information about advanced SQL aggregations in the Oracle Data Warehouse Guide 2.

Understanding Levels Within Hierarchical Totals

While the above extensions to GROUP BY clause offer a lot power and flexibility, they also allow developers and

report writers to create complex result sets that include duplicate groupings. As a result two key challenges arise:

1. How can you programmatically determine which result set rows are subtotals?

2. How do you find the exact level of aggregation for a given subtotal?

2 HTTP://DOCS.ORACLE.COM/DATABASE/122/DWHSG/SQL-AGGREGATION-DATA-WAREHOUSES.HTM - DWHSG-GUID-E051A04E-0C53-491D-9B16-

B71BA00B80C2

Page 9: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

7 | SQL FOR ADVANCED DATA AGGREGATION

Within result sets there is often a need to identify subtotals within non-additive calculations such as percent-of-totals.

Therefore, developers need an easy way to determine which rows are the subtotals. An additional complication

arises when a query’s results contain both stored NULL values and "NULL" values created by the GROUP BY

operation. Oracle provides tools to resolve both these challenges.

Identifying NULLs within dimensions using the GROUPING Function

The GROUPING function returns ‘1’ when it encounters a NULL value that has been created by a GROUP BY

operations. That is, if the NULL indicates the row is a subtotal, GROUPING returns a 1. Any other type of value,

including a stored NULL, returns a 0. Using this information it is possible to auto-fill descriptor columns with more

useful descriptive values, such as “All Products” or ‘All Years” as shown below:

FIGURE 4 – MAKING REPORTS MORE READABLE BY USING GROUPING ID FUNCTION

Identify the GROUP BY level

Using lots of GROUPING functions within a query to identify dimensional aggregates can end up creating a very

wide report that is also difficult to interpret both visually and programmatically.

GROUPING_ID function returns a single number that enables you to determine the exact GROUP BY level for each

row within your report. For each row, GROUPING_ID takes the set of 1's and 0's that would be generated based on

the appropriate GROUPING functions and concatenates them to form a bit vector. The bit vector is treated as a

binary number, and the number's base-10 value is returned by the GROUPING_ID function.

Page 10: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

8 | SQL FOR ADVANCED DATA AGGREGATION

FIGURE 5 – MAKING REPORTS MORE READABLE BY USING GROUPING ID FUNCTION

Approximate Query Processing

Approximate Queries for Data Discovery

In some cases, 100% accuracy within an analytical query is not actually needed – i.e. good enough is, in fact, good

enough for an answer. An approximate answer that is, for example, within 1% of the actual value can be sufficient,

especially if the result is returned extremely quickly.

Oracle Database 12c Release 2 3 has expanded its support for aggregation and data discovery based on

approximate results by extending its library of approximate functions. This now includes:

APPROX_COUNT_DISTINCT

APPROX_PERCENTILE

APPROX_MEDIAN

Speeding up count distinct operations

Oracle Database uses the HyperLogLog algorithm for 'approximate count distinct' operations. Processing of large

volumes of data is significantly faster using this algorithm compared with the exact aggregation, especially for data

sets with a large number of distinct values. The following statement shows how to return the approximate number of

distinct customers for each product:

SELECT

p.prod_name,

APPROX_COUNT_DISTINCT(s.cust_id) AS "Unique of Customers"

FROM sales s, products p

WHERE p.prod_id = s.prod_id

GROUP BY p.prod_name

ORDER BY p.prod_name;

It produces the following output:

3 Oracle Database 12c Release 2 (12.2), the latest generation of the world’s most popular database, is now available in the Oracle Cloud

Page 11: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

9 | SQL FOR ADVANCED DATA AGGREGATION

FIGURE 6 – AN EXAMPLE OF USING APPROXIMATE COUNT FEATURE TO FIND NUMBER OF UNIQUE CUSTOMERS BUYING EACH PRODUCT

Approximate count distinct does not use sampling. When computing an approximation of the number of distinct

values within a data set the database processes every value for the specified column. Despite processing every

value, approximate processing is significantly faster compared to the precise COUNT(DISTINCT …) function. There

are a number of reasons for this but the main one relates to the removal of the sort operation. By using a hashing

process to manage the counting the approximate count distinct function there is no need to maintain a sorted list of

members. This means that CPU consumption is reduced and both temp usage for sorting and i/o related to sort

operations are eliminated. Whilst APPROX_COUNT_DISTINCT is significantly faster, there is actually negligible

deviation from the exact result. There is more information about this new feature in the Oracle SQL Language

Reference documentation 4.

Faster way to approximately identify outliers

Using percentiles is perfect for locating outliers in a data set. In the vast majority of cases the aim is to start with the

assumption that a data set exhibits a normal distribution. Percentiles are perfect for quickly analyzing the distribution

of a data set to check for skew or bimodalities. Probably, the most common use case is for monitoring service levels

where anomalies are the values of most interest. Taking the data around the 0.13th and 99.87th percentiles (i.e.

outside 3 standard deviations from the mean) will pull out the most important anomalies.

To help speed up the process of finding outliers, Database 12c Release 2 Oracle introduces two new approximate

functions:

APPROX_PERCENTILE

APPROX_MEDIAN

The percentile function takes a number of input arguments. The first argument is a numeric type ranging from 0% to

100%. The second parameter is optional: if the ‘DETERMINISTIC’ argument is provided, it means the user requires

deterministic results. This would typically be used where results are shared with other users. Non-deterministic

results are only really useful for data scientists who are exploring a data set and need one-off answers for specific

queries.

The next argument is optional and provides more information about the accuracy and confidence level of the

resultset. The input expression for the function is derived from the expr in the ORDER BY clause.

4 HTTP://DOCS.ORACLE.COM/DATABASE/122/SQLRF/APPROX_COUNT_DISTINCT.HTM - SQLRF56900

Page 12: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

10 | SQL FOR ADVANCED DATA AGGREGATION

APPROX_MEDIAN is a convenience function on top of APPROX_PERCENTILE. The APPROX_MEDIAN function takes

three input arguments. The first argument is a numeric expression such as a column or a calculation. The second

and third arguments are optional and work in the same way as with APPROX_PERCENTILE.

An example using both functions is shown below:

SELECT

calendar_year,

APPROX_PERCENTILE(0.25 deterministic) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.25",

APPROX_PERCENTILE(0.25 deterministic, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.25-er",

APPROX_PERCENTILE(0.25 deterministic, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.25-ci",

APPROX_MEDIAN(amount_sold deterministic) as "p-0.50",

APPROX_MEDIAN(amount_sold deterministic, 'ERROR_RATE') as "p-0.50-er",

APPROX_MEDIAN(amount_sold deterministic, 'CONFIDENCE') as "p-0.50-ci",

APPROX_PERCENTILE(0.75 deterministic) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.75",

APPROX_PERCENTILE(0.75 deterministic, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.75-er",

APPROX_PERCENTILE(0.75 deterministic, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.75-ci"

FROM sales s, times t

WHERE s.time_id = t.time_id

GROUP BY calendar_year

ORDER BY calendar_year

The results from the above query are shown below and highlight the use of confidence intervals and error rates

within result sets:

FIGURE 7 – AN EXAMPLE OF USING APPROXIMATE PERCENTILE AND MEDIAN FUNCTIONS

Understanding error rates and confidence levels

These two additional elements, error and confidence level, are a necessary part of the approximate processing

model. They provide guidance on the actual accuracy of the result set compared to using the non-approximate, i.e.

standard statistical functions. For example, if an approximate analysis of response times for a specific web page

indicates that 98% of users had a response time of 1 second then in addition to this information we need to

understand the margin of error and confidence interval to fully understand the meaning of this result. Assuming a

margin of error of 2% at a 95 percent level of confidence, it is possible to infer that if the web page was accessed a

100 times then the response time would be between 1 second + or – 20 milliseconds most (i.e. 95%) of the time.

Using approximate query processing with zero code changes

The new approximate functions offer significant resource and performance benefits. It is possible to force existing

COUNT(DISTINCT) and PERCENTILE/MEDIAN queries to use the new approximate processing by using the

following init.ora parameters:

approx_for_count_distinct = TRUE

converts existing COUNT(DISTINCT …) functions to use approximate processing.

approx_for_percentile = TRUE

Page 13: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

11 | SQL FOR ADVANCED DATA AGGREGATION

converts existing PERCENTILE/MEDIAN functions to use approximate processing. There is an additional parameter

to control the use of deterministic and non-deterministic results:

approx_percentile_deterministic = TRUE/FALSE

These parameters can be set at both the session and database levels. Therefore, making use of these new 12c

Release 2 functions can be done with zero change to existing application code.

Aggregating Approximate Results For Faster Analysis

In the past creating a reusable aggregated result set from a query that included approximate functions, such as

APPROX_COUNT_DISTINCT, was not possible because the base fact data was always needed to re-compute each

combination of dimensions-levels included in the GROUP BY clause.

With Database 12c Release 2, Oracle has introduced three new functions to specifically manage the process of

creating reusable approximate aggregations:

APPROX_xxxxxx_DETAIL

APPROX_xxxxxx_AGG

TO_APPROX_xxxxxx

These functions avoid the need to rescan the original source data to compute further approximate results for different combinations of dimensions and levels. The key benefit is increased performance and reduced resource requirements.

Building a reusable approximate resultset

The APPROX_xxx_DETAIL function builds a summary result set, which can be persisted as a table or materialized,

for all the dimensional levels in a GROUP BY clause. The data type returned by this function is a BLOB object. For

example:

SELECT

t.calendar_year AS cal_year,

t.calendar_quarter_desc AS cal_quarter,

t.calendar_month_desc AS cal_month,

t.calendar_week_number AS cal_week,

APPROX_COUNT_DISTINCT_DETAIL(s.cust_id)

FROM sales s, times t

WHERE t.calendar_year = '2001'

AND s.time_id = t.time_id

GROUP BY t.calendar_year, t.calendar_quarter_desc, t.calendar_month_desc,

t.calendar_week_number

ORDER BY t.calendar_year, t.calendar_quarter_desc, t.calendar_month_desc,

t.calendar_week_number;

The output from the DETAIL column is not in a user readable format, as shown below. However, it is easily

converted into a readable result set using the TO_APPROX function – discussed below.

Page 14: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

12 | SQL FOR ADVANCED DATA AGGREGATION

FIGURE 8 – AN EXAMPLE OF USING APPROX_XXX_DETAIL FUNCTION TO CREATE REUSABLE AGGREGATED RESULTSET

Interrogating a reusable approximate resultset

The TO_APPROX_ simply converts the results stored in the BLOB object into a readable, i.e. a numeric format (note:

to simplify the code a view is used in the FROM clause, cust_acd, which contains the previous SQL from the

previous statement)

SELECT

calendar_year AS cal_year,

calendar_quarter_desc AS cal_quarter,

calendar_month_desc AS cal_month,

calendar_week_number AS cal_week,

TO_APPROX_COUNT_DISTINCT(cust_acd)

FROM cd_agg

ORDER BY calendar_year, calendar_quarter_desc, calendar_month_desc,

calendar_week_number;

FIGURE 9 – AN EXAMPLE OF USING TO_APPROX_XXX FUNCTION TO VIEW RESULTS FROM AGGREGATED RESULTSET

Aggregating a reusable approximate resultset to an even higher level

The _AGG function builds a higher-level summary result set (and/or table/materialized view) based on results derived

from _DETAIL function. This avoids having to re-query base fact table to create a higher level of dimension

groupings. The output from the function derives new aggregates from _DETAIL table and as with _DETAIL function

the data is returned as a BLOB object, see below:

SELECT

calendar_year AS cal_year,

Page 15: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

13 | SQL FOR ADVANCED DATA AGGREGATION

calendar_quarter_desc AS cal_quarter,

APPROX_COUNT_DISTINCT_AGG(cust_acd)

FROM cd_agg

GROUP BY calendar_year, calendar_quarter_desc

ORDER BY calendar_year, calendar_quarter_desc;

which returns the following:

FIGURE 10 – AN EXAMPLE OF USING APPROX_XXX_AGG FUNCTION TO CREATE HIGHER LEVEL RESULT SET

As before, this new aggregate result set needs to be queried using the TO_APPROX_ function to convert the data

into a user readable format.

FIGURE 11 – AN EXAMPLE OF USING TO_APPROX_XXX FUNCTION TO EXTRACT RESULTS FROM HIGHER LEVEL RESULT SET

Using Approximate Materialized Views to Support Wide Range of Queries

The previous functions (_DETAIL and _AGG) can be used to create materialized views that support query rewrite for

approximate queries as shown below – assuming that a materialized view has been created based on the query

supporting the output shown in Figure 12:

SELECT

t.calendar_year AS calendar_year,

t.calendar_quarter_desc AS calendar_quarter_desc,

t.calendar_month_desc AS calendar_month_desc,

APPROX_COUNT_DISTINCT(s.cust_id) AS cust_acd

FROM sales s, times t

WHERE t.calendar_year = '2001'

AND s.time_id = t.time_id

GROUP BY t.calendar_year, t.calendar_quarter_desc, t.calendar_month_desc

ORDER BY t.calendar_year, t.calendar_quarter_desc, t.calendar_month_desc;

The explain plan for the above query shows that this query has been rewritten to use the materialized view which is

derived from a query returning a blob based result set. This is completely transparent to the calling application

and/or user.

Page 16: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

14 | SQL FOR ADVANCED DATA AGGREGATION

FIGURE 12 – AN EXAMPLE OF QUERY REWRITE BASED ON APPROX FUNCTIONS

Using approximate query rewrite with zero code changes

As with approximate queries, it is possible to make existing COUNT(DISTINCT), PERCENTILE and MEDIAN based

queries to rewrite to approximate materialized views. For more information see section headed “Using Approximate

Query Processing with Zero Code Changes”.

Conclusion

Oracle’s data aggregation and approximate query processing features provide business users and SQL developers

with a simplified way to support the most important operational and business intelligence reporting requirements. By

moving processing inside the database developers can benefit from increased productivity and business users can

benefit from improved query performance across a broad range business calculations.

These key features deliver the following benefits to IT teams and business users:

Increased developer productivity

Minimizes learning effort

Improves manageability

Provides investment protection (adheres to industry standards based syntax)

Delivers increased query speed

The flexibility and power of Oracle’s aggregation features, combined with their adherence to international SQL

standards, makes them an important tool for all SQL users: DBAs, application developers, data warehouse

developers and business users. In addition, many business intelligence tool vendors have recognized the

importance of these features and functions by incorporating support for them directly in to their products.

Overall, these features make Oracle Database 12c Release 2 the most effective platform for delivering analytical

results directly into operational, data warehousing and business intelligence projects.

Further Reading

See the following links for more information about the in-database analytic features that are part of Oracle Database:

1. Database SQL Language Reference - Oracle and Standard SQL

Page 17: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

15 | SQL FOR ADVANCED DATA AGGREGATION

2. Oracle Analytical SQL Features and Functions - a compelling array of analytical features and

functions accessible through SQL. Available via the Analytic SQL home page on OTN.

3. SQL - the natural language for analysis – a review of the reasons why SQL is the best language for data

analysis. Available via the Analytic SQL home page on OTN.

4. Oracle Statistical Functions - eliminate movement and staging to external systems to perform statistical

analysis. For more information see the SQL Statistical Functions home page on OTN.

5. Oracle Database 12c Query Optimization - providing innovation in plan execution and stability.

The following Oracle whitepapers, articles, presentations and data sheets are essential reading and available via the

Analytic SQL home page on OTN:

a. SQL for Data Validation and Data Wrangling

b. SQL for Analysis, Reporting and Modeling

c. SQL for Advanced Data Aggregation

d. SQL for Approximate Query Processing

e. SQL for Pattern Matching

2. Oracle Magazine SQL 101 Columns

3. Oracle Database SQL Language Reference—T-test Statistical Functions

4. Oracle Statistical Functions Overview

5. SQL Analytics Data Sheet

You will find links to the above papers, and more, on the “Oracle Analytical SQL” web page hosted on the Oracle

Technology Network:

http://www.oracle.com/technetwork/database/bi-datawarehousing/sql-analytics-index-1984365.html

Page 18: SQL for Advanced Data Aggregation · Advanced Data aggregation with Oracle Database 12c Release 2 Oracle has extended the processing capabilities of the GROUP BY clause to provide

1 | SQL FOR ADVANCED DATA AGGREGATION

Oracle Corporation, World Headquarters Worldwide Inquiries

500 Oracle Parkway Phone: +1.650.506.7000

Redwood Shores, CA 94065, USA Fax: +1.650.506.7200

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. This document is provided for information purposes only, and the contents hereof are subject to change without notice. This document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or fitness for a particular purpose. We specifically disclaim any liability with respect to this document, and no contractual obligations are formed either directly or indirectly by this document. This document may not be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without our prior written permission. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group. 1116

C O N N E C T W I T H U S

blogs.oracle.com/datawarehousing

facebook/BigRedDW

twitter/BigRedDW

oracle.com/sql

github/oracle/analytical-sql-examples