IBM ® DB2 ® for Linux ® , UNIX ® , and Windows ® Best Practices Data Life Cycle Management Christopher Tsounis Executive IT Specialist Information Management Technical Sales Enzo Cialini Senior Technical Staff Member DB2 Data Server Development Last updated: 2009-10-23 ®
35
Embed
IBM DB2 for Linux, UNIX, and Windows Best Practices Data Life
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
example, DROP TABLE) cannot proceed until LOAD is terminated.
• If you are appending data to a partition, specify LOAD INSERT. Performing
LOAD REPLACE of a partition replaces an entire table (all partitions).
Data Life Cycle Management Page 17
• Avoid attaching a partition with the same name as a detached partition. This
results in a duplicate name until asynchronous index cleanup (AIC) completes.
Data Life Cycle Management Page 18
Rolling in data: Which solution to use? There are several factors that affect how you choose the best roll-in solution for your
installation:
• Minimizing the time it takes to bring new data into the system and make it
available
• Minimizing the amount of logging activity that occurs as part of the SET
INTEGRITY operation during roll in
• Whether you have a requirement for continuous updates rather than daily batch
process.
• Maximizing compression for new ranges to effectively manage data skew
The following methods are the two different techniques for the roll-in of data with table
partitions.
1. ALTER/ATTACH
With the ALTER/ATTACH method you first populate the table offline, and then
attach the partition. You must run SET INTEGRITY (a potentially long-running
operation for large data volumes). The impact of running SET INTEGRITY may
be reduced by using partitioned indexes in DB2 version 9.7.
Advantages:
• Concurrent access
• All previous partitions are available for updates
• No partial data view (new data cannot be seen until Set Integrity
completes)
Disadvantages:
• Additional log space is required
• Long elapsed times
• Draining of queries is required
2. ALTER/Add
With the ALTER/Add method, you attach an empty table partition, and then
populate it using the LOAD utility or INSERT statements.
You do not need to run SET INTEGRITY.
Data Life Cycle Management Page 19
Advantages:
• Faster elapsed times
• SET INTEGRITY is not required
• Less log space for global index maintenance
Disadvantages:
• Partial data view occurs when you use INSERT statements (not with
LOAD utility).
• LOAD utility allows read-only access to older partitions
Recommendation:
For larger data volumes, utilize the ALTER/Add method for roll-in of a table partition or
utilize MDC for roll-in if many non-partitioned indexes are deployed.
Data Life Cycle Management Page 20
Best practices for roll-in of compressed table
partitions: These best practices use the ALTER/Attach method of attaching a table partition, which
are described in the preceding section.
For Version 9.1, rapidly attach a table partition with compressed data (large data
volumes) by using the following technique:
1. Load a subset of data (a true random sample) into a separate DB2 table
2. Alter the standalone table to enable compression
3. Reorganize the subset of data to build a compression dictionary
4. Empty the table or retain minimal data (so the dictionary is retained)
5. ALTER/ATTACH the table as a new table partition (the dictionary is retained)
6. Execute SET INTEGRITY (this is rapid, due to minimal data)
7. Populate data by using the LOAD utility or INSERT statements (compression
will occur). For applications with continuous updates, load data into a staging
table using the LOAD utility. Then, use an insert with a sub-select from the
staging table or run an ETL (extract, transform, and load) job to update the
primary tables (compression will occur). The roll-in of data can be improved
further if you exploit the benefits of MDC within the table partition.
For Version 9.5, the technique to rapidly attach a table partition is simplified by
automatic dictionary creation:
1. ALTER/Add the empty table.
2. Populate the table with data, by using the LOAD utility or an INSERT/SELECT
statement (data is compressed with automatic dictionary creation).
Note that a full offline reorganization of a fully-loaded partition is likely to achieve better
compression than can be achieved with this method. DB2 Version 9.7 fix pack 1 supports
rapid reorganization by partition when using partitioned indexes to improve
compression results.
Data Life Cycle Management Page 21
Best practice for roll-in and roll-out with continuous
updates: This database design combines various features of the DB2 database system to facilitate
roll-in and roll-out of data with continuous update requirements.
This design is for applications with the following characteristics:
• Continuous updates occur all day long (which prevents performing ALTER/Add
to attach a partition).
• Data is added daily.
• Queries frequently access a certain day.
• Table partitioning on day results in too many partitions (for example, 365 days
times 3 years).
• Roll-out occurs weekly or monthly (typically on a reporting boundary).
Recommended database design:
To facilitate the roll-in of data, specify a single-dimension MDC on day (see the section
“Features of MDC that benefit roll-in and roll-out of data”).
To facilitate the roll-out of data, specify a table partition range per week or month. This
provides the same time dimension as MDC but at a coarser scale.
Applications with long running reports might not be able to drain queries for the
execution of the DB2 LOAD utility. The best practice in this case is to use the LOAD
utility to rapidly load data into staging tables. Then populate the primary tables using an
insert with a sub-select.
Data Life Cycle Management Page 22
After roll-out: How to manage data growth and
retention? To satisfy corporate policy, government regulations, or audit requirements, you might
need to retain your data and keep it accessible for long durations of time. For example,
the Health Insurance Portability and Accountability Act (HIPAA) contains medical
record retention requirements for health-care organizations. The Sarbanes-Oxley Act sets
out certain record retention requirements for corporate accountants. Additionally, some
enterprises are also finding value in performing analytics on historical data and are
therefore retaining data for longer durations.
Therefore, in addition to implementing a suitable roll-in and roll-out strategy and an
appropriate database design, you need to consider the complete lifespan of your data
and include a policy for data retention and retrieval. You could do nothing and
continually add hardware capacity and resources to maintain the additional data growth
for retention purposes, however there are better practices for data retention, as described
in this paper.
Using UNION ALL views One practice is to keep all the data in the database but roll out certain ranges for retention
and create UNION ALL views over the ranges that require easy accessibility.
The following example demonstrates how to create a UNION ALL view:
CREATE VIEW all_sales AS
(
SELECT * FROM sales_0105
WHERE sales_date BETWEEN '01-01-2005' AND '01-31-20 05'
UNION ALL
SELECT * FROM sales_0205
WHERE sales_date BETWEEN '02-01-2005' AND '02-28-20 05'
UNION ALL
...
UNION ALL
SELECT * FROM sales_1207
WHERE sales_date BETWEEN '12-01-2007' AND '12-31-20 07'
);
Data Life Cycle Management Page 23
Using UNION ALL views addresses data retention and real time accessibility while
keeping all the data maintained online in the database using primary storage. A problem
caused by this method is that you might be unnecessarily maintaining this data in
associated backup images. Also, historical data typically does not require high
performance, so does not need the indexing or other high-cost factors encountered with
your primary data.
There are a variety of ways you could use UNION ALL views:
• Access active data using UNION ALL views and keep your historical data
compressed in a range-partitioned table.
• Keep active data in a range-partitioned table and use a UNION ALL view to
access historical data in another a range-partitioned table.
Using UNION ALL views has some limitations. When you have a large number of
ranges, use range-partitioned tables because, for UNION ALL views some complex
predicates and joins are not pushed down.
However, in some situations UNION ALL views are advantageous. For example, a
UNION ALL view may work in a federated environment, whereas a range-partitioned
table does not.
Although UNION ALL views may be useful in some environments, DB2 version 9.7
users should strongly consider migrating to table partitioning.
Using IBM Optim Data Growth Solution Depending on your service level agreement (SLA) objectives for your historical data,
usually the best practice to address both data growth and retention is to implement data
archiving with IBM Optim™ Data Growth Solution.
IBM Optim Data Growth Solution is a leading solution for addressing growth,
compliance and management of data. It preserves application integrity by archiving
complete business objects, rather than single tables. For example, it retains foreign keys
and preserves metadata within the archive. These features enable you to have:
• Flexible access to data.
• The ability to selectively or fully restore archived data into the original database
table, or into a new table, or even into an alternate database.
The following steps guide you through the process of determining how best to
implement your archiving strategy.
STEP 1: Classify your applications
First, you need to classify your applications according to their archival requirements.
By understanding which transactions you need to retain from your application data,
Data Life Cycle Management Page 24
you can group applications with similar data requirements for archive accessibility
and performance. Some applications require only current transactions be retained;
some require access to only historical transactions; and others require access to a mix
of current and historical transactions (with a varying current-to- historical ratio).
Also, consider the service level agreement (SLA) objectives for your archived data.
An SLA is a formal agreement between groups that defines the expectations between
them and includes objectives for items such as services, priorities, and
responsibilities. SLA objectives are often formulated using response time goals. For
example, a specific human resources report might need to run, on average, within 5
minutes.
STEP 2: Assess the temperature of your data:
Data derives its “temperature” from the following criteria:
• How frequently the data is accessed
• How long it takes to access the data
• How rapidly the data changes (volatility)
• User and application requirements
The temperature varies from enterprise to enterprise, but typically the data
temperatures fall into common classifications across industries. The following table
provides guidelines for data temperatures.
Regulatory Data that needs to be available on an exception basis.Dormant
Deep Historical Data – Queries rarely access this data but it must be available for periodic access.Cold
Traditional Decision Support Data – Queries access this data less frequently and data retrieval doesn’t require the urgency of a quick turnaround in response time.
Warm
Tactical Data – The bulk of the queries are for current data, accessed frequently, heavily and requiring quick response time turnaround.Hot
FactoidData
Temperature
Regulatory Data that needs to be available on an exception basis.Dormant
Deep Historical Data – Queries rarely access this data but it must be available for periodic access.Cold
Traditional Decision Support Data – Queries access this data less frequently and data retrieval doesn’t require the urgency of a quick turnaround in response time.
Warm
Tactical Data – The bulk of the queries are for current data, accessed frequently, heavily and requiring quick response time turnaround.Hot
FactoidData
Temperature
There are various means of assessing the temperature of data. Consider business and
application definitions and requirements, roll-out criteria, and workload and query
tracking statistics as potential methods for determining how to classify your data
according to temperature. Gather the following potential workload and query
information to assess the data temperature:
• Which objects are (and are not) being accessed
Data Life Cycle Management Page 25
• The frequency each object is accessed
• The common time intervals at which objects are accessed,
For example: THIS_WEEK, LAST_WEEK, THIS_QUARTER,
LAST_QUARTER.
• Which data within an object is being accessed
You can use DB2 Version 9.5 workload management (WLM) to assist in discovering
data temperatures. The WLM historical analysis tool provides statistics on which
tables, indexes and columns have, or have not, been accessed, along with the
associated frequency.
The WLM historical analysis tool consists of 2 scripts:
• wlmhist.pl: generates historical data
• wlmhisrep.pl: produces reports from the historical data
To discover which data within an object is being accessed, analyze the SQL statement
using an ACTIVITIES event monitor to collect data on workload activities, including
the SQL statement text. You might want to collect information about workload
management objects such as workloads, service classes, and work classes (through
work actions). Enable activity collection using the COLLECT ACTIVITY DATA …
WITH DETAILS clause of the CREATE or ALTER statements for the workload
management objects for which you want to collect information, as shown in the
following example:
ALTER SERVICE CLASS sysdefaultsubclass
UNDER sysdefaultuserclass
COLLECT ACTIVITY DATA ON ALL WITH DETAILS
The WITH DETAILS clause enables collection of the statement text for both static and
dynamic SQL.
If applications make use of parameter markers within the statement text, you should
also include the AND VALUES clause, (so that you have COLLECT ACTIVITY
DATA … WITH DETAILS AND VALUES). The AND VALUES clause collects the
data values associated with the parameter markers in addition to the detailed
statement information.
STEP 3: Discover and classify your business objects
Business objects, such as insurance claims, invoices, or purchase orders, represent
business transactions. By classifying your business objects, you can begin to define
Data Life Cycle Management Page 26
rules and associated business drivers for managing these objects at different stages in
the data life cycle.
From a database perspective, a business object represents a group of related rows
from related tables.
Simplified example of a business object:
Given the following three tables:
2/1/2006E11OPERATIONOP1010
2/1/2006E01OPERATION SUPPORTOP1000
12/1/2002D11W L PROD CONT PROGSMA2113
12/1/2002D11W L ROBOT DESIGNMA2112
12/1/2002D11W L PROGRAM DESIGNMA2111
2/1/2006D11W L PROGRAMMINGMA2110
2/1/2006D01WELD LINE AUTOMATIONMA2100
2/1/2006C01USER EDUCATIONIF2000
PRJENDATEDEPTNOPROJNAMEPROJNO
2/1/2006E11OPERATIONOP1010
2/1/2006E01OPERATION SUPPORTOP1000
12/1/2002D11W L PROD CONT PROGSMA2113
12/1/2002D11W L ROBOT DESIGNMA2112
12/1/2002D11W L PROGRAM DESIGNMA2111
2/1/2006D11W L PROGRAMMINGMA2110
2/1/2006D01WELD LINE AUTOMATIONMA2100
2/1/2006C01USER EDUCATIONIF2000
PRJENDATEDEPTNOPROJNAMEPROJNO
PROJECT
E11O’CONNELL310
D11CIALINI170
C01TYRRELL140
D11CASSELLS160
D11GOODMAN150
C01VINCENT130
D11TSOUNIS60
WORKDEPTLASTNAMEEMPNO
E11O’CONNELL310
D11CIALINI170
C01TYRRELL140
D11CASSELLS160
D11GOODMAN150
C01VINCENT130
D11TSOUNIS60
WORKDEPTLASTNAMEEMPNO
EMPLOYEE
OPERATIONSE11
ADMINISTRATION SYSTEMSD21
MANUFACTURING SYSTEMSD11
INFORMATION CENTERC01
DEPTNAMEDEPTNO
OPERATIONSE11
ADMINISTRATION SYSTEMSD21
MANUFACTURING SYSTEMSD11
INFORMATION CENTERC01
DEPTNAMEDEPTNO
DEPARTMENT
The business object is:
Project Department
Employee
For data retention and archiving purposes, you want the complete business object to
be represented such that you have a historical “point-in-time” snapshot of a business
transaction. Creating a historical snapshot requires both transactional detail and
related master information, which involves multiple tables in the database.
Archiving complete business objects allows the archives to be intact and accurate and
to provide a standalone repository of transaction history. To respond to inquiries or
discovery requests, you can query this repository without the need to access “hot”
data.
Data Life Cycle Management Page 27
In this example, to ensure the complete object is available, the archived business
object must consist of associated data from the DEPARTMENT and EMPLOYEE
tables. After archiving, you would only want to delete the data in the production
PROJECT table and not in the associated EMPLOYEE and DEPARTMENT data.
You can discover business objects based on data relationships within the schema, as
demonstrated in this example. However, you might also want to include other
related tables that do not have any schema relationship, but, for example, might be
related through use of an application. In addition, you might elect to remove certain
discovered relationships from the business object.
STEP 4: Produce your comprehensive data classification:
After you have classified your applications and business objects and determined
their associated data temperatures, you can produce a data classification table to
summarize this information. This table articulates the aging of the data.
The following table provides a sample data classification:
>10yrs6-10yrs3-5yrs0-2yrsClaimsAppA
DeleteOffline Archive
Online Archive
ProductionBusiness Object
Application
>10yrs6-10yrs3-5yrs0-2yrsClaimsAppA
DeleteOffline Archive
Online Archive
ProductionBusiness Object
Application
STEP 5: Determine the post-archive storage type
To determine what storage type is most appropriate for your aged data, consider the
following questions:
• Who needs to access the archive data, and for what purpose?
• What are the response time expectations?
• How will the archive data age?
• How many storage tiers and what type of storage should be deployed, for
example, SAN, WORM, or tape?
For example, for online archive you could use ATA disks or large capacity slower
drives. For offline archive, you could use tape or WORM (IBM DR550, EMC
Centera).
Data Life Cycle Management Page 28
Non DBMSRetention PlatformATA File ServerIBM DR550EMC Centera
CurrentData
0-2 years
Offline Retention Platform
CDTapeOptical
ProductionDatabase
Archive
OnlineArchive
3-5 years
OfflineArchive
6+ years
RestoreRestore
IBM Federation
Report WriterXMLODBC / JDBCNative Application
Universal Access to Application Data
Application Independent Access
STEP 6: Access to archived data
The Optim Data Growth Solution access layer uses SQL92 capability and various
protocols (as shown in the above figure) to provide access to the archived data. This
accessibility is out-of-line from the production database, and so does not use any
resources from the production database system.
Alternatively, you can use a federated system (using IBM DB2 Federated Server) to
provide transparent access to the archive from the production database.
Both methods allow for direct access to archived data, without the need to retrieve or
restore the archived data.
The following example demonstrates how to use a UNION ALL view to access both
active and archived data. The example renames the database table called project to a
different name, and then creates a UNION ALL view that is also named project.
RENAME TABLE project TO project_active
CREATE VIEW project AS
SELECT * FROM project_active
WHERE prjendate >= (CURRENT_DATE – 5 YEARS)
UNION ALL
SELECT * FROM project_arch
Data Life Cycle Management Page 29
WHERE prjendate < (CURRENT_DATE – 5 YEARS)
As an alternative, the following example avoids the need to rename the table in the
database. Instead, the example creates a UNION ALL view called project_all that
the application can query from to get the complete project data set:
CREATE VIEW project_all AS
SELECT * FROM project
WHERE prjendate >= (CURRENT_DATE – 5 YEARS)
UNION ALL
SELECT * FROM project_arch
WHERE prjendate < (CURRENT_DATE – 5 YEARS)
Data Life Cycle Management Page 30
Best Practices
• For database partitioning, use a partitioning key column with
high cardinality and frequently used by a join predicate.
• Use database partitioning to improve scalability for large scale
data warehouses.
• Use table partitioning for very large tables, tables with queries
that access range-subsets of data, and for roll-out requirements.
• For MDC, specify low-cardinality columns or use generated
columns to reduce cardinality.
• Use a single-column MDC design to facilitate roll-in and roll-out
to minimize increased disk space usage.
• For large scale applications, implement database partitioning,
table partitioning, and MDC simultaneously.
• Use large table spaces for tables with deep compression if you
believe you will have very small row sizes. For table partitioning,
place each table partition global index in a separate table space
(this might avoid the need for large table spaces) or use
partitioned local indexes.
• For larger data volumes, use the ALTER/Add method to roll-in a
table partition, or use MDC.
• For Version 9.1, to attach a table partition with compressed data
build a dictionary with minimal data prior to ALTER/ATTACH
to avoid table reorganization
• For Version 9.5, to attach a table partition, use the ALTER/Add
method.
Data Life Cycle Management Page 31
• For continuous updates, facilitate roll-in of data by specifying a
single-dimension MDC on day
• Use federation to facilitate access to archived data from
production databases.
• Use UNION ALL views for transparent access to archived data.
• IBM Optim Data Growth Solution is the recommended tool for
data retention and retrieval.
Data Life Cycle Management Page 32
Conclusion
Careful selection of the most appropriate partitioning method for your DB2 database,
and using the most efficient roll-in and roll-out technique for your system can maximize
your system’s overall performance and efficiency.
Devote sufficient time to analyzing and understanding your data so that you can make
the best use of the guidelines in this paper and take advantage of the features the DB2
database system provides to help make your system as efficient as possible.
You can use database partitioning to provide scalability and to help ensure even
distribution of data across partitions. Follow the guidelines in the section “Designing and
implementing your table partitioning strategy” to devise the most effective table
partitioning strategy. Use MDC to help improve the performance of queries and to
facilitate the roll-in of data.
If you need to roll-in large volumes of data from compressed table-partitions, upgrade to
Version 9.5 of the DB2 database system and use the ALTER/Add method to attach a table
partition.
If you need to accommodate continuous updates, your best strategy is to use MDC to
facilitate the roll-in process.
To determine how to handle the needs of your historical data, follow the guidelines in
the section “After roll-out: How to manage data growth and retention?”.
Before you are ready to roll out your data and archive it, you need to determine a policy
for data retention and retrieval-of-data-from-archive that suits your organization.
You can better understand your organization’s technical requirements for retention and
retrieval by analyzing the following factors:
The kind of transactions you need to retain
The “temperature” of your data
How your business objects are composed
Your policy should include what kind of post-archive storage is most appropriate, and
how best to access the archived data. The guidelines in the section “After roll-out: How
to manage data growth and retention?” can assist you in producing your policy.
Data Life Cycle Management Page 33
Further reading • DB2 Best Practices - http://www.ibm.com/developerworks/db2/bestpractices/
• Leveraging DB2 Data Warehouse Edition for Business Intelligence -