www.SageLogix.com NoCOUG Summer Conference 2004 Scaling the DW to Infinity 19-August 2004 Tim Gorman SageLogix, Inc.
Dec 28, 2015
www.SageLogix.com
NoCOUG Summer Conference 2004
NoCOUG Summer Conference 2004
Scaling the DW to Infinity
19-August 2004
Tim GormanSageLogix, Inc.
www.SageLogix.com
AgendaAgenda
It’s all in the designDimensional data models
Understanding time
DW “utopia” on OracleRiding the “virtuous cycle”
Sliding down the “death spiral”
www.SageLogix.com
DW Data ModelsDW Data Models
First determine WHAT is going to be presented to the users…
…before discussing or considering HOW to do itToo many people get stuck in the pro’s and con’s of the various solutions without clearly understanding WHAT is going to be presented
Users want to see a simplified data model that is optimized for data retrieval
Dimensional data modelNormalized “star” schema involving FACTs and DIMENSIONs
Understanding TIME in analytic processingStatic current views versus non-volatile time-variant views
Users know what they wantIT must observe and learn from patterns of usage
www.SageLogix.com
Simplistic dimensional data model is favored by most ad-hoc and analytic tools for data warehousing
Ralph Kimball discusses extensively in The Data Warehouse Toolkit
John Wiley & Sons, publisher (ISBN #0471153370)
FACT tables closely resemble the output of a crosstab reportMeasure columns which provide data for the cells of the report
Foreign-key columns to join to dimension tables
DIMENSION tables represent a single level of data normalizationAttribute columns which comprise the row-headers and column-headers of the report
Primary-key columns for joins from fact tables
Optimized for retrieval of dataNot intended for “representing the full richness of the data”
DW Data ModelsDW Data Models
www.SageLogix.com
Customers
Orders
Order Lines
Products
Suppliers
Order Facts
Products DimSuppliers Dim
Customers Dim Time Dim
TransactionalOperational
Entity-RelationalModeling
DimensionalModeling
DW Data ModelsDW Data Models
www.SageLogix.com
Understanding TIMEUnderstanding TIME
Programming and reporting in operational systems usually focus on the present point-in-time
What is the current state of XYZ’s account?
How many widgets were shipped for each month this year-to-date, as of today?
How much inventory is on-hand?
Analytical reporting needs to be able to recreate data as it appeared at any point-in-time
Non-volatile and time-variant dataChanges are inserted, not updated
Joining the current information about an account to a transaction that occurred four years ago
Does not present an accurate picture of the situation four years ago
www.SageLogix.com
Understanding TIMEUnderstanding TIME
Static “type-1” dimensionsAll new data is “merged” (inserted else updated)Appropriate where values are not supposed to change over time
Time (?!?)Geography
Also appropriate where changes are expected to reflect automatically across all related facts
“current-image” views“executive information systems” or “dashboards”
Can be implemented as a “snapshot” or “materialized view” of a time-variant dimension?Key is usually a numeric “surrogate” or “synthetic” key
www.SageLogix.com
Understanding TIMEUnderstanding TIME
Time-variant “type-2” dimensionsAll new data is inserted as a new rowAppropriate for:
Slowly-changing dimensions which might include:ProductsStores
Quickly-changing dimensions which might include:AccountsPeopleCurrency
Appropriate for most reporting and analysis over timePresents a “chronologically consistent” view of data
Key can be a “surrogate key” with a separate date component
www.SageLogix.com
Understanding TIMEUnderstanding TIME
Time-variant “type-3” dimensionsAll new data is inserted as a new row
previous column values retained alongside new column values
Often useful when a change is regarded as “soft”Facts may show impact of previous attribute values
For example, changing sales territories for salespeopleCustomers continue to contact previous sales person for a time after the change
Impact of previous sales person persists for a whilePrevious sales person may have alienated customers
Previous sales person may have been beloved of customers
So, while reporting on new/current sales personmight be useful to also know previous sales person
www.SageLogix.com
The Virtuous CycleThe Virtuous Cycle
Time-variant dimensional data model implies:Insert, index, and analyze each row of data only once
Insert-only processing implies:Time-variant tables/indexes and tablespacesPartitioning and ETL using EXCHANGE PARTITION operations
Using EXCHANGE PARTITION enables:Direct-path loads of data with no interruption of end-user queriesPerformance scalability from partition pruningTime-variant tablespaces set to READ ONLY as data agesBitmap indexes and bitmap join indexes become feasible!
Using direct-path loads of data enables:Table compressionNOLOGGING and PARALLEL DML operations
www.SageLogix.com
The Death SpiralThe Death Spiral
Not using EXCHANGE PARTITION implies:ETL using “conventional-path” INSERT, UPDATE, and DELETE operations
Conventional-path operations are trouble with:Bitmap indexes and bitmap-join indexes
Forcing frequent complete rebuilds until they get too big
Contention in Shared Pool, Buffer Cache, global structures
Mixing of queries and loads simultaneously on table and indexes
Periodic rebuilds/reorgs of tables if deletions occur
Full redo logging and undo transaction tracking
www.SageLogix.com
Just say no… :-)Just say no… :-)
Don’t fall into the trap of misunderstanding timeAll data (dimensions and facts) should be stored primarily in a time-variant fashion
Inserts only
No updates
Current-image tables have their purposeBut they should be derived from time-variant tables
Understand dimensional data modelingRead Kimball’s book!
It is true normalization, but only for the purposes of analysis and reporting
Not for portraying the full richness of the attributes of the data
www.SageLogix.com
UtopiaUtopia
What are some goals for a DW based on Oracle?Dimensional data model for fast query
Star transformations using bitmap and bitmap-join indexes
Partition pruning during queries
Non-intrusive ETL processingDirect-path “bulk” loading without interrupting queries
Newly-loaded data “published” simultaneously to users
InfrastructureTable compression
READ ONLY tablespacesBackups that only backup what needs to be backed up
Less-frequently accessed data on less-expensive storage media
www.SageLogix.com
Star TransformationsStar Transformations
Why are “star” join-transformations desirable?Typical Oracle “nested loops”, “sort/merge”, or “hash” join methods tend to start the query from a dimension table and join into the fact table using only one index
Further “filtering” is performed by joining to other dimensions
Very “sequential” and not optimal
Star transformations do the following:Using database statistics, identifies the pattern of one large table at the center of a query involving two or more smaller tables
Resolves a result set from each of the dimension tables
Merges all of the results sets from all of the dimensions
Uses powerful BITMAP MERGE operation on fact table
www.SageLogix.com
Star TransformationsStar Transformations
1. Drive query from one of the dimensions
2. Join to the fact from that dimension
3. Filter on the fact by joining to other dimensions
www.SageLogix.com
Star TransformationsStar Transformations
1. Find result set in each dimension
2. Merge results from all dimensions
3. Join to the fact from merged result set, using BITMAP MERGE index scan
www.SageLogix.com
Star TransformationsStar Transformations
Enabling “star” join transformations in OracleParameter settings:
COMPATIBLE = 8.1.0 or higherOPTIMIZER_FEATURES_ENABLE = 8.1.0 or higherSTAR_TRANSFORMATION_ENABLED = TEMP_DISABLE
The optimizer will consider performing a cost-based query transformation on the star query but will not use temporary tables in the star transformation.Lots of bugs associated with setting to TRUE
BITMAP_MERGE_AREA_SIZE = <huge!>Default is 1M. Set to 16M? 32M? 128M? 512M?
HASH_JOIN_ENABLED = TRUE
Bitmap indexesAll fact table foreign-key columns must have bitmap indices(Optional) all dimension table non-key attribute columns should have bitmap indicesALTER TABLE … MINIMIZE RECORDS_PER_BLOCK
www.SageLogix.com
Star TransformationsStar Transformations
STAR_TRANSFORMATION_ENABLE = TEMP_DISABLESetting to TRUE causes star transformation to create then drop a global temporary table to store intermediate results
Buggy, nasty plan…
ALTER TABLE … MINIMIZE RECORDS_PER_BLOCKMetalink note #103490.1 provides explanation
ALTER TABLE command scans rows in the table and calculates a value for “maximum number of rows in a block”, which is used by CREATE BITMAP INDEX operations
If MINIMIZE RECORDS_PER_BLOCK not performed, a default max number of rows value is used instead
MINIMIZE allows smaller bitmap indexes to be created
www.SageLogix.com
Star TransformationsStar Transformations
A “star transformation” is not named as such in the EXPLAIN PLAN display
Instead, it is indicated by the following operations:
TABLE ACCESS BY ROWID OF (<fact-table-name>)
BITMAP AND
BITMAP MERGE
(data retrieved from a dimension)
BITMAP MERGE
(data retrieved from a dimension)
BITMAP MERGE
(data retrieved from a dimension)
www.SageLogix.com
Star TransformationsStar Transformations
Bitmap-join index (BJI) is an optimization of the initial phase of a star transformation
Index itself is comprised of saved data from the initial merge structure
BJI is an index on a table using data from one or more table joins
Almost like a materialized view of one of the steps of a BITMAP MERGE operation during a star query
SQL> CREATE BITMAP INDEX c_s_p_bjix1 2 ON SALES (c.region, p.category) 3 FROM SALES s, CUSTOMERS c, PRODUCTS p 4 WHERE c.cust_id = s.cust_id 5 AND p.prod_id = s.prod_id; Index created.
www.SageLogix.com
Partition pruningPartition pruning
Oracle offers a total of five ways to partition tables and indexes
RANGE of data values
LIST of specified data values
HASH (pseudo-random distribution of data values)
Composite RANGE-HASH
Composite RANGE-LIST
Oracle cost-based optimizer can prune partitions that will not be utilized from the query
Explicit pruning (partition/subpartition name specified in query)
Implicit pruning (partition-key columns are referenced in query)
Don’t scan what you don’t need
www.SageLogix.com
Partition pruningPartition pruning
The decision of what columns to partition upon must be carefully considered
Ultimately, it comes down to the question: “Which queries do we want to optimize?”Think of partitioning as a gun with either one or two bullets
Use them wisely…
Also, bear in mind that RANGE partitioning is very important to ETL processing also…
So, you might be able to expend one of your bullets on both goals
Optimizing queries using partition pruningOptimizing ETL using EXCHANGE PARTITION
www.SageLogix.com
Partition PruningPartition Pruning
Explicit partition pruning
SELECT …
FROM sales_fact PARTITION (sales_200403)
WHERE …
SELECT …
FROM sales_fact SUBPARTITION (sales_200403_sp12)
WHERE …
www.SageLogix.com
Implicit partition pruning
select …
from txn_fact
where posting_date between ‘03-May-2004’
and ‘05-May-2004’
and …
select …
from txn_fact
where state = ‘NJ’
and …
Partition PruningPartition Pruning
www.SageLogix.com
Partition PruningPartition Pruning
Implicit partition pruning can be disabled if there is an expression involving the partition-key column
Partition pruning is disabled in this situation:select …from sales_factwhere trunc(posting_date) = ‘15-MAY-2004’and …
Partitiong pruning is enabled in this situation:select …from sales_factwhere posting_date >= ‘15-MAY-2004’and posting_date < ‘16-MAY-2004’and …
www.SageLogix.com
Partition PruningPartition Pruning
EXPLAIN PLAN displays are a little confusing:PARTITION RANGE (ALL)PARTITION LIST (ALL)PARTITION HASH (ALL)
No partition pruning occurring
PARTITION RANGE (ITERATOR)PARTITION LIST (ITERATOR)PARTITION HASH (ITERATOR)
Pruning involving two or more partitions
No mention of partitions in the EXPLAIN PLAN generally indicates that pruning has occurred to include one and
only one partition
www.SageLogix.com
Non-intrusive ETL processing
Non-intrusive ETL processing
Extraction, transformation, and load (ETL) is the data load process for a data warehouse
Traditionally, this is an off-hours activity, performed when end-users are not using the database
However, offline “load windows” are a luxury few can afford anymore…
End-user activities often demand round-the-clock accessMultiple time-zones, extremely long-running queries
As data volumes grow, ETL processing extends to exceed any “ load window”
www.SageLogix.com
Non-intrusive ETL processing
Non-intrusive ETL processing
Goal should be to avoid transactional MERGE logic (i.e. update-else-insert) wherever possible
Bulk (direct-path or APPEND) INSERT
Index each row only once
Analyze each row only once
Bulk (a.k.a direct-path or APPEND) loadsSQL*Loader
Parallel CREATE TABLE AS SELECT and INSERT … SELECT
EXCHANGE PARTITIONStandard load technique for large tables
www.SageLogix.com
Direct-path loadsDirect-path loads
Bulk loading feature first introduced in Oracle v6 in FASTLOAD utility on MVS to compete with DB2
Incorporated into SQL*Loader DIRECT=TRUE in v7.0Extended to CREATE TABLE AS SELECT in v7.2Extended to INSERT /*+ APPEND */ in v8.0Enhanced in v8.1 to leave behind a direct-path log for use by MV “fast” refresh
Loads data outside of “managed space”Load above the “high-water mark” in target table in SQL*Loader DIRECT=TRUE PARALLEL=FALSE
After successful completion, high-water mark is raised to include newly-loaded rows in the table
Load into TEMPORARY segments in all other load mechanismsAfter successful completion, TEMPORARY segments are merged into the table segment
www.SageLogix.com
Direct-path loadsDirect-path loads
Fast bulk load mechanism bypasses:Buffer Cache
Though which “conventional-path” INSERTs, and all UPDATE and DELETE operations pass
Log Buffer and entire redo log generation processIf keyword NOLOGGING is utilized
Processes format blocks with newly-inserted rows in private process memory
Writes them directly to datafiles
www.SageLogix.com
Direct-path loadsDirect-path loads
NOLOGGING options exist for all direct-path or APPEND operations
Available only for INSERT operations, never UPDATE or DELETE
A potential performance enhancementIf the redo logging stream is truly causing performance problems
Don’t assume that this is so, please verify!
Flip side: NOLOGGING means no recoverabilityRMAN incremental backup capability can help here…
www.SageLogix.com
Exchange PartitionExchange Partition
EXCHANGE PARTITION is crucial to non-intrusive ETLData is transformed, cleansed, loaded, indexed, analyzed offline from “live” tables and indexes
Direct-path load operations are especially tough on “live” indexes
“In-flight” queries continue to process during and after EXCHANGE PARTITION operations
Oracle’s read-consistency mechanisms cause existing operations to use data that was exchanged away from the table, and new operations to use data exchanged into the table
Local-partitioned indexes and statistics are exchanged as well
Global-partitioned indexes are maintained during exchange operation
www.SageLogix.com
Exchange PartitionExchange Partition
EXCHANGE PARTITION can be transparent to in-flight queries
DML locks prevent exchange on objects where INSERT, UPDATE, DELETE, and SELECT … FOR UPDATE in progressWhat happens to in-flight queries if standalone table “TT” is truncated or dropped immediately after the exchange completes?
P11 P12 P13 P14 P15 P15 P17 P18
TT
Queries started after EXCHANGEutilize the segment that is partitionP18 after the exchange
Queries that were in-flight beforeEXCHANGE continue to utilizethe segment that was partitionP18 before the exchange
www.SageLogix.com
Exchange PartitionExchange Partition
The basic technique of bulk-loading new data into a temporary “load table”, which is then indexed, analyzed, and then “published” all at once to end-users using the EXCHANGE PARTITION operation, should be the default load technique for all large tables in a data warehouse
fact tables
slowly-changing or quickly-changing dimensions
Assumptions for this example:Composite partitioned fact table named TXN
Range partitioned on DATE column TXN_DATE
Hash partitioned on NUMBER column ACCT_KEY
Data to be loaded into partition P20040225 on TXN
www.SageLogix.com
Exchange PartitionExchange Partition
1. Create temporary table TXN_TEMP as a hash-partitioned table
2. Perform parallel, direct-path load of new data into TXN_TEMP
3. Gather CBO statistics on table TXN_TEMP4. Create indexes on the temporary hash-partitioned
table TXN_TEMP corresponding to the local indexes on TXN
• using PARALLEL, NOLOGGING, and COMPUTE STATISTICS options
5. alter table TXNexchange partition P20040225 with table TXN_TEMPincluding indexes without validation update global indexes;
6. Table TXN_TEMP is left ready for next load cycle
www.SageLogix.com
22-Feb2004
23-Feb2004
24-Feb2004
(empty) 25-Feb2004
Composite-partitionedtable TXN
Hash-partitionedtable TXN_TEMP
2. BulkLoads
5. EXCHANGE PARTITION
4. IndexCreates
3. Analyze
Exchange PartitionExchange Partition
www.SageLogix.com
Exchange PartitionExchange Partition
It is a good idea to encapsulate this logic inside PL/SQL packaged- or stored-procedures:
SQL> execute exchpart.prepare(‘TXN_FACT’,’TMP_’, - 2 ’25-FEB-2004’,’27-FEB-2004’);SQL> alter session enable parallel dml;SQL> insert /*+ append nologging parallel(n,4) */ 2 into tmp_txn_fact n 3 select /*+ full(x) parallel(x,4) */ * 4 from stage_txn_fact x 5 where load_date >= ‘25-FEB-2004’ 6 and load_date < ‘28-FEB-2004’;SQL> commit;SQL> execute exchpart.finish(‘TXN_FACT’,’TMP_’);
DDL for “exchpart.sql” posted at http://www.EvDBT.com/tools.htm
www.SageLogix.com
Exchange PartitionExchange Partition
It is wise to encapsulate this partition-exchange functionality in a PL/SQL package- or stored-procedure
Along with the related functionality to:Gather CBO statistics on the table
Build indexes (in the proper related tablespaces with the proper parameters)
Also, the use of stored procedures to encapsulate this logic is crucial for security
You do NOT want to grant anybody the ability to ALTER TABLE or CREATE TABLE
You do NOT want anybody connecting as the table owner schema!
Stored procedures, once created, can be granted
www.SageLogix.com
Publishing Loaded Data
Publishing Loaded Data
Coordinating the final EXCHANGE PARTITION operation permits all of the newly-loaded data to appear to the end-users simultaneously
Publishing data
If newly-loaded data is becoming visible to users gradually
Then a “load window” when new queries cannot be started becomes necessary
Exchange Partition load techniques make “load windows” of restricted activity unnecessary
www.SageLogix.com
Table CompressionTable Compression
Available in Oracle9i Release 2 (v9.2.0)Physical storage attribute for tables and materialized views
[ CREATE | ALTER ] TABLE …[ COMPRESS | NOCOMPRESS ] …
Restrictions:Can be used for RANGE or LIST partitions
But cannot be used with HASH partitionsBut cannot be used for HASH or LIST sub-partitions
Can be specified for NESTED tablesBut cannot be used with any LOB construct
Such as CLOB, BLOB, BFILE, and VARRAY
Not valid for index-organized or external tables
www.SageLogix.com
Table CompressionTable Compression
Storing repeated data values once in each blockA symbol table of distinct data values created in each block
The symbol table is stored as another table in the block
Each column in a row in a block references back to an entry in the symbol table in the block
Header & Tailer
Table & Column Map
ITL
Symbol table
Free
Row data
www.SageLogix.com
Table CompressionTable Compression
Only bulk-loading INSERT operations do compression
CREATE TABLE … AS SELECT …INSERT /*+ APPEND */ (single-threaded and parallel)ALTER TABLE … MOVE …ALTER TABLE … MOVE PARTITION …ALTER TABLE … MERGE PARTITION …ALTER TABLE … SPLIT PARTITION …SQL*Loader DIRECT=TRUE
Conventional INSERT operations unaffectedSELECT, UPDATE, and DELETE behavior also unaffected
www.SageLogix.com
Table CompressionTable Compression
SELECTImpressive performance improvements!!!
Less I/O due to fewer blocksCompression ratio is linear with performance improvements
Better impact on FULL table scansIndexed scans still exhibit less-impressive improvements
Conventional and direct-path INSERTNoticeable performance slowdown (2-3x)
UPDATEVery negative performance impact observed (4-8x)
DELETESome performance improvements observed
www.SageLogix.com
Table CompressionTable Compression
Columns cannot be added, renamed, modified, or dropped on compressed tables or partitioned tables with compressed partitions
Might be fixed in 10g?
Local partitioned indexes are marked UNUSABLE during compression
Includes indexes on non-partitioned tables
Must be rebuilt
Global partitioned indexes can be maintained using UPDATE GLOBAL INDEXES
Includes non-partitioned indexes on partitioned tablesA rare situation when GLOBAL indexes can be more highly available than LOCAL indexes!
www.SageLogix.com
READ ONLY tablespacesREAD ONLY tablespaces
Partitioning by a datetime value allows the time-variant nature of data to be exploited
Within the same table, different partitions can exist in different tablespaces
Tablespaces containing older data can be set READ ONLY, while tablespaces containing active and newer data can stay READ WRITE
Over time, the majority of data in any DW can be READ ONLY
READ ONLY tablespaces can be:Backed up less frequently (i.e. quarterly, annually, etc) than “active” READ WRITE tablespaces, with no compromise on recovery strategy
Moved from faster, more-expensive storage to cheaper, less-expensive storage, without interrupting operations
www.SageLogix.com
Usage TrackingUsage Tracking
Usage tracking and data-usage analysisCan’t improve that which is not measured
Understanding what is going on should be built in to the DW infrastructure
Should be objective, factual, and measured
Options for usage-tracking:Oracle STATSPACK (included with Oracle RDBMS)
Ambeo Usage Tracker (http://www.ambeo.com)
Teleran iSight (http://www.teleran.com)
www.SageLogix.com
Q&AQ&A
Questions?
Email: [email protected]
Personal website: http://www.EvDBT.com/
Corporate website: http://www.SageLogix.com/