This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Target User ................................................................................................................................................ 3
Example ................................................................................................................................................. 6
Example ................................................................................................................................................. 7
Example ................................................................................................................................................. 9
Best Practices for Data Analysis .................................................................................................................. 10
Cache Compression for Large Datasets .................................................................................................. 11
Example ............................................................................................................................................... 11
Example ............................................................................................................................................... 14
Working with Output ................................................................................................................................. 15
Extract What You Need from Large Output ..................................................................................... 15
Target User This paper is intended for users of and support specialists for both IBM® SPSS® Statistics Desktop and
IBM® SPSS® Statistics Server. You will find information about optimizing performance and
troubleshooting performance-related issues.
Introduction SPSS Statistics is comprehensive software for data and statistical analysis. It enables users to quickly look
at their data and includes a wide range of procedures and tests to help users solve complex business and
research challenges. This article provides SPSS Statistics users and support specialists with best practices
for configuration, data preparation, data analysis, and other tasks. These best practices can improve the
efficiency, performance, and optimization of SPSS Statistics.
This article contains the following information:
Methods for diagnosing problems
Best practices for data preparation, primarily with Automatic Data Preparation (ADP)
Best practices of data transformations, including compiled transformations and how to group
the transformations for best performance
Best practices for data analysis, including multithreading and cache compression
Best practices about how to extract useful information from large output efficiently
Best practices for working with syntax
Best practices for SPSS Statistics Server
For each of the best practices, this article provides detailed background, sample code, and instructions
for running the sample code.
Methods of Problem Diagnosis If you want to use SPSS Statistics efficiently, you must first identify the problems, especially for
performance issues. The methods described in this section help you identify which areas may be
problematic.
Performance Logging for Statistics Server
If you need to check the performance of SPSS Statistics Server, the IBM® SPSS® Statistics
Administration Console allows you to configure the analytic server software to write performance
information to a log file. The log file provides detailed information about current users, CPU usage, and
RAM usage. For more information about logging, refer to Chapter 4 in the IBM SPSS Statistics Server
Administrators Guide.
4
Timing for Backend Procedures
This method is designed for backend procedures. In this method, the show $VARS command is used to get time information. By issuing the command at the beginning and end of a job, you can obtain an accurate cost of the job and diagnose the problematic area.
Example
GET FILE = dataset.
SHOW $VARS.
FREQUENCIES VARIABLES= var1 var2.
SHOW $VARS.
FREQUENCIES VARIABLES=var3 var4.
SHOW $VARS.
The first SHOW $VARS command records the start time of the first FREQUENCIES command.
The second SHOW $VARS command records the end time of the first FREQUENCIES
command and the start time of the second FREQUENCIES command.
The last SHOW $VARS command records the end time of the second FREQUENCIES
command.
You can then calculate the costs for each FREQUENCIES command with subtraction.
Benchmarking with a Python Module
The benchmark Python module helps you to identify inefficient work. It provides classes that measure
various aspects of the SPSS Statistics syntax that is executed on the Microsoft Windows platform. To run
this module, you must do the following.
Install Python. Note that the Python version is specific for the SPSS Statistics version and the
operating system.
Download and install win32com utility from http://sourceforge.net/projects/pywin32.
Download and install IBM SPSS Statistics – Integration Plug-In for Python, which is installed with
IBM SPSS Statistics – Essentials for Python. For more information, refer to the document IBM
SPSS Statistics - Essentials for Python: Installation Instructions for Windows.
Download the benchmark module, which can be found in the SPSS community’s Utilities
collection at http://www.ibm.com/developerworks/spssdevcentral. To install this module,
please read the article “How to Use Downloaded Python Modules,” which is also available in the
SPSS community,
After finishing installation process, open benchmark.py in a text editor or Python development
environment and follow the instructions to execute the benchmarking work.
Best Practices for Data Preparation This section provides best practices for data preparation. IBM SPSS Statistics Data Preparation option allows you to identify unusual and invalid cases, variables, and data values in your active dataset. It also allows you to prepare data for modeling.
Preparing data automatically with ADP Preparing data for analysis is one of the most important steps in any project—and traditionally, one of the most time consuming. Automated Data Preparation (ADP) handles the task for you, analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques.
Benefits
Using ADP enables you to make your data ready for model building quickly and easily, without needing prior knowledge of the statistical concepts involved. Models will tend to build and score more quickly; in addition, using ADP improves the robustness of automated modeling processes.
Obtaining ADP
To run ADP automatically, from the menus choose:
Transform > Prepare Data for Modeling > Automatic...
Click Run.
Optionally, you can:
Specify an objective on the Objective tab.
Specify field assignments on the Fields tab.
Specify expert settings on the Settings tab.
Note
This article provides only general instructions for using ADP. For more details, read the document IBM
SPSS Statistics Data Preparation released with the product. In particular refer to the following:
Chapter 4 provides detailed instructions for running ADP, including background information,
user interface operations, and explanations of the settings.
In chapter 8, you can find ADP sample code and examples, including the full process of running
ADP. Also, build models using the data “before” and “after” preparation so that you can
compare the results.
SQL Pushback SPSS Statistics Server supports the pushback of sorting and aggregation to a SQL database. This ability to perform sorting and aggregation operations in the SQL database is called SQL Pushback. When large datasets are sourced from a SQL database, SQL Pushback ensures that operations that can be performed more efficiently in the database are performed there.
Preconditions
The following preconditions are required for SQL Pushback functionality.
SPSS Statistics Server
SPSS Statistics Client used to connect to a SPSS Statistics Server
SQL database, such as IBM DB2®, Microsoft SQL Server, or Oracle Database
6
Obtaining SQL Pushback
SQL Pushback is available only through the graphical user interface. Therefore you first need to use SPSS
Statistics client to connect to the SPSS Statistics Server. Then complete the following steps.
From the menus choose File > Open Database > New Query...
Select the data source.
If necessary (depending on the data source), select the database file and/or enter a login name,
password, and other information.
Select the table(s) and fields. For OLE DB data sources (available only on Windows operating
systems), you can select only one table.
Specify any relationships between your tables, such as selection criteria.
If needed, aggregate the data by selecting one or more break variables, aggregated variables
and an aggregate function for each aggregate variable. Otherwise, skip this step.
Edit variable names and properties.
If needed, sort the data. Otherwise, press Next to skip this step.
Run the query or save it.
Example
This example compares the performance of SQL Pushback versus using the SORT procedure with SPSS
Statistics client.
Data File and Configurations
Dataset: Size 1.25 GB, 7.71 million cases, 27 variables CPU: 1 CPU, Intel T 9400, 2.53 GHz, dual-core processor RAM: 3 GB Operating System: Windows XP, 32-bit IBM SPSS Statistics: Statistics Server 20, Statistics Client 20
Test Results
Sort with SQL Pushback: 77 seconds Sort with Statistics Client: 289 seconds Time Saved: 212 seconds (73.35%) Note: The above result is based on testing done in IBM SPSS laboratories. Although our test
environments simulate typical production environments in the field, we can’t guarantee that
organizations performing similar tests will see identical results. This data are presented for general
guidance.
Summary
Based on the example, the performance improvement is up to 73.35% by executing sorting with SQL
Pushback. The improvement may vary depending on configurations, data size, and syntax.
Note
If you are familiar with the SQL language, you can arrange the SQL query to execute sorting and
aggregating work in the database, which can gain the same performance improvement as SQL Pushback.
7
Best Practices for Data Transformations In most situations, the raw data aren’t perfectly suitable for the type of analysis you want to perform.
Preliminary analysis may reveal inconvenient coding schemes or coding errors, and then data
transformations may be required in order to expose the true relationship between variables. You can
perform data transformations ranging from simple tasks, such as collapsing categories for analysis, to
more advanced tasks, such as creating new variables.
This section introduces several best practices for data transformations, which help to use SPSS Statistics
Data Transformations more efficiently.
Grouping the Transformations Data transformations are usually necessary for data analysis. The typical user job is defining data,
transforming, analyzing, transforming, analyzing and so on.
Obviously, the transformation commands are interspersed with analytic procedures, which cause low
efficiency because of repetitive executions of data transformations. In this situation, you need to group
the transformations.
Benefits
By grouping the transformation commands, you can execute all the transformation work at one time, which saves extra interpretation cost for the transformations. In addition, it makes syntax arrangement clearer and more ordered.
Example
The example executes the sample syntax before and after grouping the transformation work, so that
you can see the difference from the results.
Ungrouped Syntax
Get file="dataset".
COMPUTE testvar1=var1-var2.
IF (testvar1 LT 10 OR testvar1 GT 50) testvar1=20.
The syntax creates three test variables (testvar1, testvar2, and testvar3) based on the original variables (var1, var2, var3, and var4), and then recodes them for next step analysis. We use the simple FREQUENCIES command for demonstration.
Data File and Configurations
Dataset: Size 0.9 GB, 3 million cases, 132 variables CPU: 1 CPU, Intel T 9400, 2.53 GHz, dual-core processor RAM: 3 GB Operating System: Windows XP, 32-bit IBM SPSS Statistics: Statistics Client 20
Test Results
Ungrouped syntax: 77 seconds Grouped syntax: 43 seconds Time saved: 26 seconds (33%). Note: The above result is based on testing done in IBM SPSS laboratories. Although our test
environments simulate typical production environments in the field, we can’t guarantee that
organizations performing similar tests will see identical results. This data are presented for general
guidance.
Summary
Based on the example, the performance improvement is up to 33% by grouping the transformations.
The improvement may vary depending on configurations, data size, and syntax, but you can see obvious
improvement. Grouping your transformation work is a good practice.
Compiled Transformations The compiled transformations feature is designed to improve the performance of complex
transformations. When you use compiled transformations, transformation commands (such as
COMPUTE and RECODE) are compiled to machine code at run time for better performance. This feature
works only with SPSS Statistics Server running on Windows Server.
Preconditions
The following preconditions are required for the compiled transformations feature.
SPSS Statistics Server running on Windows.
The SPSS Statistics Administration Console for configuring SPSS Statistics Server.
GNU G++ compiler.
9
Because there is an overhead involved in compiling the transformations, you should use
compiled transformations only when there are a large number of cases and multiple
transformations commands.
Obtaining Compiled Transformations
To run compiled transformations, complete the following steps:
Have an administrator use the SPSS Statistics Administration Console to turn on the feature and
set the correct compiler path. Chart 1 highlights these settings.
Chart 1: Settings for compiled transformations
Set CMPTRANS to YES in the syntax file.
Execute the syntax while connected to the SPSS Statistics Server or with the SPSS Statistics Batch
Facility.
Note: For compiled transformations to be available the administrator must turn on compiled
transformations with the SPSS Statistics Server setting and CMPTRANS must be set to YES. If the
administrator does not turn on compiled transformations, a warning message is displayed and the
command is ignored.
Example
This example runs compiled transformations with different data sizes and complexity levels. It also provides the test results without compiled transformations for a contrast.
Sample Syntax
INPUT PROGRAM.
LOOP icase = 1 to 1000000.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
10
EXECUTE.
SET CMPTRANS=ON.
VECTOR x(10).
LOOP jvar = 1 to 10.
COMPUTE x(jvar)=rnd(uniform(10)).
END LOOP.
EXECUTE.
The above syntax generates a dataset and initializes the variables with the COMPUTE command.
The first LOOP command (highlighted with bold) defines the case numbers, and the second
LOOP defines the variable numbers.
Configurations
CPU: 4 CPUs, Intel Xeon, 3.00 GHz, dual-core hyper-threaded processor RAM: 8 GB Operating System: Windows 2003 Server, 64-bit IBM SPSS Statistics: Statistics Server 20
Test Results
The following table summarizes the test results.
Cases Number No Compiled
Transformations Compiled
Transformations Time Saved
1,000,000 10 loops 9 seconds 9 seconds 0
100 loops 70 seconds 51 seconds 27%
5,000,000 10 loops 45 seconds 36 seconds 20%
100 loops 349 seconds 255 seconds 28% Table 1: Test results of compiled transformations
Note: The above result is based on testing done in IBM SPSS laboratories. Although our test
environments simulate typical production environments in the field, we can’t guarantee that
organizations performing similar tests will see identical results. This data are presented for general
guidance.
Summary
Compiled transformations may improve performance when there are a large number of cases and
complex transformation commands.
Best Practices for Data Analysis This section provides best practices for data analysis. IBM SPSS Statistics is a comprehensive system for
analyzing data. It makes statistical analysis more accessible for the beginner and more convenient for
the experienced user. The best practices introduced here are helpful for analyzing large datasets more
efficiently and improving the parallelization for CPU intensive procedures.
11
Cache Compression for Large Datasets When running many procedures on a large dataset, the cost of getting data obviously increases. The
application must read the original dataset for each procedure. For data tables read from a database
source, this means that the SQL query must be re-executed for any command or procedure that needs
to read the data. Cache compression allows you to avoid this overhead.
Benefits
Creating a data cache eliminates multiple data readings. The CACHE command copies all of the data to a
temporary disk file for subsequent uses of the data. To decrease I/O costs, you can also compress the
temporary data file. Combining CACHE with compression improves efficiency when dealing with large
datasets.
Obtaining Cache Compression
Cache compression works only if you are connected to Statistics Server. Then, complete the following
steps.
Have an administrator use the SPSS Statistics Administration Console to turn on the feature.
Chart 2 highlights these settings.
Chart 2: Settings for Cache Compression
Issue an explicit CACHE command before the analytical procedures.
Set ZCOMPRESSION to YES in syntax file.
Execute the syntax while connected to the SPSS Statistics Server or with the SPSS Statistics Batch
Facility.
Example
This example runs several procedures with cache compression on a large dataset and then summaries
the test results.
12
Configurations
Dataset: Size 1.25 GB, 7.71 million cases, 27 variables CPU: 4 CPUs, Intel Xeon, 3.00 GHz, dual-core hyper-threaded processor RAM: 8 GB Operating System: Windows 2003 Server, 64-bit IBM SPSS Statistics: Statistics Server 20
Results
The following table summarizes the test results.
Procedures No Cache Compression (Seconds) Cache Compression (Seconds) Time Saved
CODEBOOK 41.79 11.77 71.83%
CORRELATIONS 21.18 8.42 60.25%
COXREG 28.09 16.28| 42.04%
CROSSTABS 21.85 8.84 59.54%
CTABLES 25.06 11.10 55.71%
EXAMINE 1343.87 1157.34 13.88%
GLM 20.72 8.68 58.11%
LOGISTIC 37.09 25.55 31.11%
NOMREG 29.26 16.35 44.12%
OLAP CUBES 30.83 16.96 44.98%
T-TEST 20.22 8.24 59.25%
TREE 192.09 164.17 14.54%
Table 2: Test results for cache compression
As shown in Table 2, the procedures CODEBOOK, CORRELATIONS, CROSSTABS, CTABLES, GLM, T-
TEST improve over 50%.
Note: The data shown is based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data are presented for general guidance. Actual results will vary depending on the configuration of the SPSS Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.).
13
Multithreading
Multithreading is the technical term used to break a task into multiple tasks that can be executed in
parallel. Not all analytical procedures can take advantage of multithreading. Procedures that can be
easily parallelized and scheduled to run simultaneously on different CPUs/cores benefit the most. The
procedures that are multithreaded in SPSS Statistics are listed in the following table.
Procedure family Procedure Name
Correlations Bivariate
Partial
Regression Linear
Ordinal
Multinomial
Logistic
Data Reduction Factor Analysis
Survival Analysis Cox Regression
Logistic Regression
Multiple Imputation Impute missing values
Table 3: Multithreaded analytical procedures
Preconditions
To benefit from multithreading, the following preconditions are required.
The computer on which the procedure is run has multiple processors or each processor has
multiple cores.
The procedure that is executed is listed in Table 3.
Note: In SPSS Statistics client, the maximum thread number is 4. In SPSS Statistics Server, there is no limit to the number of threads.
Setting
By default, SPSS Statistics uses an internal algorithm to determine the number of threads for a particular
computer. You can change this setting, but the default will often provide the best performance. You can
override the default setting by issuing the command SET THREADS=n, where n indicates the number
of threads, often corresponding the number of CPUs or cores. It’s suitable to use SET THREADS to
override the default setting in the following scenarios.
14
The default thread number is usually equal to the number of processing units. The threads
consume CPU resources, which may reduce the processing cycles needed for other CPU-
intensive applications. In this situation, you can use SET THREADS to limit the thread number.
For multi-threaded procedures the performance may not improve when the thread number
increases because the overhead on separating the data, managing the threads, and merging the
results also increases. (For specific results, you can refer to Table 4). Therefore, you should find
the optimal thread number and set it by using the command SET THREADS.
Example
This example provides detailed performance information for multithreaded procedures using different
data sizes, and different thread number.
Configurations
CPU: 4 CPUs, Intel Xeon, 3.00 GHz, dual-core hyper-threaded processor RAM: 8 GB Operating System: Windows 2003 Server, 64-bit IBM SPSS Statistics: Statistics Server 20
Csselect 33.2 1,069,000 6 47.97 48.27 48.10 48.00 0.00% Table 4: Benchmarking results with different thread numbers
Based on above results, as the number of threads increases from 2 to 16:
Cscoxreg procedure improves by 59.84%.
Linear regression procedure improves by 76.4%.
Factor procedure improves by 71.44%.
Partially correlated procedure improves by 43.89%.
Nomreg procedure improves by 50.87%.
15
Note: The data shown is based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data are presented for general guidance. Actual results will vary depending on the configuration of the SPSS Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.).
Working with Output SPSS Statistics provides rich methods to display the statistical results, including tables, charts, and text.
By default, the results are displayed in an SPSS Statistics Viewer window. You can manipulate the output
and create an output document that contains precisely the output you want, arranged and formatted
appropriately. The best practice introduced in this section helps you to achieve this goal.
Extract What You Need from Large Output
When running multiple procedures, SPSS Statistics often generates mass results consisting of tables, charts, logs, text and so on. It’s painful to review so much information and find what you want. Fortunately, SPSS Statistics provides the Output Management System (OMS) and OUTPUT commands
(OUTPUT NEW, OUTPUT NAME, OUTPUT ACTIVATE, OUTPUT OPEN, OUTPUT SAVE, OUTPUT CLOSE) to help you refine and route the output.
Benefits
With OMS and OUTPUT commands, you can gain the following benefits.
Partition the large output into separate output documents.
Select and route required information from the output.
Work with multiple open output documents in a given session.
Use output as input with OMS.
Obtaining OMS and OUTPUT Commands
There are two ways to run OMS: from the OMS control panel and a syntax command.
Use the OMS Control Panel. From the menus choose Utilities > OMS Control Panel. With the
control panel, you can start and stop the routing of the output to various destinations. Note that
OUTPUT commands can be used only with a syntax command.
Use OMS and OUTPUT commands. The following examples illustrate how to insert these
commands into your existing syntax.
Examples
Example 1: Partitioning the Output with OUTPUT Commands
This example demonstrates how to partition the statistical results according to gender. Results for males
will appear in one output documents, and results for females will appear in another one.
GET FILE='SurveyData.sav'.
TEMPORARY.
SELECT IF (Sex='Male').
FREQUENCIES VARIABLES=ALL.
16
OUTPUT NAME males.
TEMPORARY.
SELECT IF (Sex='Female').
OUTPUT NEW NAME=females.
FREQUENCIES VARIABLES=ALL.
OUTPUT SAVE NAME=males OUTFILE='Males.spv'.
OUTPUT SAVE NAME=females OUTFILE='Females.spv'.
OUTPUT CLOSE *.
The GET command loads survey data for male and female respondents.
The FREQUENCIES output for male respondents is written to the designated output document.
The OUTPUT NAME command is used to assign the name males to the designated output
document.
The FREQUENCIES output for female respondents is written to a new output document
named females.
The OUTPUT SAVE commands are used to save the output in two separate files.
The OUTPUT CLOSE command closes all open output documents.
Example 2: Formatting and Routing the Output with OMS
This example demonstrates how to route the output in different format. The following is the sample
code.
OMS
/SELECT TABLES
/IF COMMANDS = ['Regression']
/DESTINATION FORMAT = DOC
OUTFILE = 'tables.doc'.
REGRESSION
/STATISTICS COEFF OUTS R ANOVA
/DEPENDENT income
/METHOD=ENTER age address EDU employ.
OMS SELECT WARNINGS
/DESTINATION FORMAT=HTML
OUTFILE='warnings.htm’
FREQUENCIES age EDU.
OMSEND.
The first OMS command selects tables from REGRESSION results and saves them to tables.doc.
The REGRESSION command generates the output used by OMS.
The second OMS command selects warnings from FREQUENCIES results and saves them to
warnings.htm.
The FREQUENCIES command generates the results used by the second OMS command.
OMSEND command ends OMS commands.
Example 3: Converting Output into Input with OMS
Using the OMS command, you can save the output to an SPSS Statistics data file and then use that
output as input in subsequent commands or sessions.