Ibm spss-statistics-performance-best-practices

IBM SPSS Statistics

©Copyright IBM Corporation 1989, 2012.

IBM SPSS Statistics Performance Best Practices

Contents Overview ....................................................................................................................................................... 3

Target User ................................................................................................................................................ 3

Introduction .............................................................................................................................................. 3

Methods of Problem Diagnosis ................................................................................................................. 3

Performance Logging for Statistics Server ............................................................................................ 3

Timing for Backend Procedures ............................................................................................................ 4

Benchmarking with a Python Module .................................................................................................. 4

Best Practices for Data Preparation .............................................................................................................. 4

Preparing data automatically with ADP .................................................................................................... 5

Benefits ................................................................................................................................................. 5

Obtaining ADP ....................................................................................................................................... 5

Note ...................................................................................................................................................... 5

SQL Pushback ............................................................................................................................................ 5

Preconditions ........................................................................................................................................ 5

Obtaining SQL Pushback ....................................................................................................................... 6

Example ................................................................................................................................................. 6

Summary ............................................................................................................................................... 6

Note ...................................................................................................................................................... 6

Best Practices for Data Transformations ...................................................................................................... 7

Grouping the Transformations ................................................................................................................. 7

Benefits ................................................................................................................................................. 7

Example ................................................................................................................................................. 7

Summary ............................................................................................................................................... 8

Compiled Transformations ....................................................................................................................... 8

Preconditions ........................................................................................................................................ 8

Obtaining Compiled Transformations ................................................................................................... 9

2

Example ................................................................................................................................................. 9

Best Practices for Data Analysis .................................................................................................................. 10

Cache Compression for Large Datasets .................................................................................................. 11

Benefits .............................................................................................................................................. 11

Obtaining Cache Compression ............................................................................................................ 11

Example ............................................................................................................................................... 11

Multithreading ....................................................................................................................................... 13

Preconditions .................................................................................................................................... 13

Setting ................................................................................................................................................. 13

Example ............................................................................................................................................... 14

Working with Output ................................................................................................................................. 15

Extract What You Need from Large Output ..................................................................................... 15

Benefits ............................................................................................................................................... 15

Obtaining OMS and OUTPUT Commands .................................................................................. 15

Examples ............................................................................................................................................. 15

Summary ............................................................................................................................................. 17

Working with Command Syntax ............................................................................................................. 17

Removing Unnecessary EXECUTE Commands ........................................................................................ 17

Benefits ............................................................................................................................................... 17

Examples ............................................................................................................................................. 17

Working with SPSS Statistics Server .................................................................................................... 18

Decreasing Data Passing Costs with SPSS Statistics Server ...................................................... 18

Benefits .............................................................................................................................................. 19

Testing and Results ............................................................................................................................. 19

Guidelines for purchasing Statistics Server ......................................................................................... 19

64-bit Computing with Statistics Server ............................................................................................ 20

Benchmarking Test .............................................................................................................................. 20

Using Multiple Locations for Temporary Files ..................................................................................... 21

Benefits .............................................................................................................................................. 21

How to Set Multiple Temporary File Locations ............................................................................ 21

Conclusion ................................................................................................................................................. 23

Trademarks ............................................................................................................................................... 24

3

Overview

Target User This paper is intended for users of and support specialists for both IBM® SPSS® Statistics Desktop and

IBM® SPSS® Statistics Server. You will find information about optimizing performance and

troubleshooting performance-related issues.

Introduction SPSS Statistics is comprehensive software for data and statistical analysis. It enables users to quickly look

at their data and includes a wide range of procedures and tests to help users solve complex business and

research challenges. This article provides SPSS Statistics users and support specialists with best practices

for configuration, data preparation, data analysis, and other tasks. These best practices can improve the

efficiency, performance, and optimization of SPSS Statistics.

This article contains the following information:

Methods for diagnosing problems

Best practices for data preparation, primarily with Automatic Data Preparation (ADP)

Best practices of data transformations, including compiled transformations and how to group

the transformations for best performance

Best practices for data analysis, including multithreading and cache compression

Best practices about how to extract useful information from large output efficiently

Best practices for working with syntax

Best practices for SPSS Statistics Server

For each of the best practices, this article provides detailed background, sample code, and instructions

for running the sample code.

Methods of Problem Diagnosis If you want to use SPSS Statistics efficiently, you must first identify the problems, especially for

performance issues. The methods described in this section help you identify which areas may be

problematic.

Performance Logging for Statistics Server

If you need to check the performance of SPSS Statistics Server, the IBM® SPSS® Statistics

Administration Console allows you to configure the analytic server software to write performance

information to a log file. The log file provides detailed information about current users, CPU usage, and

RAM usage. For more information about logging, refer to Chapter 4 in the IBM SPSS Statistics Server

Administrators Guide.

4

Timing for Backend Procedures

This method is designed for backend procedures. In this method, the show $VARS command is used to get time information. By issuing the command at the beginning and end of a job, you can obtain an accurate cost of the job and diagnose the problematic area.

Example

GET FILE = dataset.

SHOW $VARS.

FREQUENCIES VARIABLES= var1 var2.

SHOW $VARS.

FREQUENCIES VARIABLES=var3 var4.

SHOW $VARS.

The first SHOW $VARS command records the start time of the first FREQUENCIES command.

The second SHOW $VARS command records the end time of the first FREQUENCIES

command and the start time of the second FREQUENCIES command.

The last SHOW $VARS command records the end time of the second FREQUENCIES

command.

You can then calculate the costs for each FREQUENCIES command with subtraction.

Benchmarking with a Python Module

The benchmark Python module helps you to identify inefficient work. It provides classes that measure

various aspects of the SPSS Statistics syntax that is executed on the Microsoft Windows platform. To run

this module, you must do the following.

Install Python. Note that the Python version is specific for the SPSS Statistics version and the

operating system.

Download and install win32com utility from http://sourceforge.net/projects/pywin32.

Download and install IBM SPSS Statistics – Integration Plug-In for Python, which is installed with

IBM SPSS Statistics – Essentials for Python. For more information, refer to the document IBM

SPSS Statistics - Essentials for Python: Installation Instructions for Windows.

Download the benchmark module, which can be found in the SPSS community’s Utilities

collection at http://www.ibm.com/developerworks/spssdevcentral. To install this module,

please read the article “How to Use Downloaded Python Modules,” which is also available in the

SPSS community,

After finishing installation process, open benchmark.py in a text editor or Python development

environment and follow the instructions to execute the benchmarking work.

Best Practices for Data Preparation This section provides best practices for data preparation. IBM SPSS Statistics Data Preparation option allows you to identify unusual and invalid cases, variables, and data values in your active dataset. It also allows you to prepare data for modeling.

http://sourceforge.net/projects/pywin32

http://www.ibm.com/developerworks/spssdevcentral

5

Preparing data automatically with ADP Preparing data for analysis is one of the most important steps in any project—and traditionally, one of the most time consuming. Automated Data Preparation (ADP) handles the task for you, analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques.

Benefits

Using ADP enables you to make your data ready for model building quickly and easily, without needing prior knowledge of the statistical concepts involved. Models will tend to build and score more quickly; in addition, using ADP improves the robustness of automated modeling processes.

Obtaining ADP

To run ADP automatically, from the menus choose:

Transform > Prepare Data for Modeling > Automatic...

Click Run.

Optionally, you can:

Specify an objective on the Objective tab.

Specify field assignments on the Fields tab.

Specify expert settings on the Settings tab.

Note

This article provides only general instructions for using ADP. For more details, read the document IBM

SPSS Statistics Data Preparation released with the product. In particular refer to the following:

Chapter 4 provides detailed instructions for running ADP, including background information,

user interface operations, and explanations of the settings.

In chapter 8, you can find ADP sample code and examples, including the full process of running

ADP. Also, build models using the data “before” and “after” preparation so that you can

compare the results.

SQL Pushback SPSS Statistics Server supports the pushback of sorting and aggregation to a SQL database. This ability to perform sorting and aggregation operations in the SQL database is called SQL Pushback. When large datasets are sourced from a SQL database, SQL Pushback ensures that operations that can be performed more efficiently in the database are performed there.

Preconditions

The following preconditions are required for SQL Pushback functionality.

SPSS Statistics Server

SPSS Statistics Client used to connect to a SPSS Statistics Server

SQL database, such as IBM DB2®, Microsoft SQL Server, or Oracle Database

6

Obtaining SQL Pushback

SQL Pushback is available only through the graphical user interface. Therefore you first need to use SPSS

Statistics client to connect to the SPSS Statistics Server. Then complete the following steps.

From the menus choose File > Open Database > New Query...

Select the data source.

If necessary (depending on the data source), select the database file and/or enter a login name,

password, and other information.

Select the table(s) and fields. For OLE DB data sources (available only on Windows operating

systems), you can select only one table.

Specify any relationships between your tables, such as selection criteria.

If needed, aggregate the data by selecting one or more break variables, aggregated variables

and an aggregate function for each aggregate variable. Otherwise, skip this step.

Edit variable names and properties.

If needed, sort the data. Otherwise, press Next to skip this step.

Run the query or save it.

Example

This example compares the performance of SQL Pushback versus using the SORT procedure with SPSS

Statistics client.

Data File and Configurations

Dataset: Size 1.25 GB, 7.71 million cases, 27 variables CPU: 1 CPU, Intel T 9400, 2.53 GHz, dual-core processor RAM: 3 GB Operating System: Windows XP, 32-bit IBM SPSS Statistics: Statistics Server 20, Statistics Client 20

Test Results

Sort with SQL Pushback: 77 seconds Sort with Statistics Client: 289 seconds Time Saved: 212 seconds (73.35%) Note: The above result is based on testing done in IBM SPSS laboratories. Although our test

environments simulate typical production environments in the field, we can’t guarantee that

organizations performing similar tests will see identical results. This data are presented for general

guidance.

Summary

Based on the example, the performance improvement is up to 73.35% by executing sorting with SQL

Pushback. The improvement may vary depending on configurations, data size, and syntax.

Note

If you are familiar with the SQL language, you can arrange the SQL query to execute sorting and

aggregating work in the database, which can gain the same performance improvement as SQL Pushback.

7

Best Practices for Data Transformations In most situations, the raw data aren’t perfectly suitable for the type of analysis you want to perform.

Preliminary analysis may reveal inconvenient coding schemes or coding errors, and then data

transformations may be required in order to expose the true relationship between variables. You can

perform data transformations ranging from simple tasks, such as collapsing categories for analysis, to

more advanced tasks, such as creating new variables.

This section introduces several best practices for data transformations, which help to use SPSS Statistics

Data Transformations more efficiently.

Grouping the Transformations Data transformations are usually necessary for data analysis. The typical user job is defining data,

transforming, analyzing, transforming, analyzing and so on.

Obviously, the transformation commands are interspersed with analytic procedures, which cause low

efficiency because of repetitive executions of data transformations. In this situation, you need to group

the transformations.

Benefits

By grouping the transformation commands, you can execute all the transformation work at one time, which saves extra interpretation cost for the transformations. In addition, it makes syntax arrangement clearer and more ordered.

Example

The example executes the sample syntax before and after grouping the transformation work, so that

you can see the difference from the results.

Ungrouped Syntax

Get file="dataset".

COMPUTE testvar1=var1-var2.

IF (testvar1 LT 10 OR testvar1 GT 50) testvar1=20.

FREQUENCIES testvar1.

COMPUTE testvar2=var3.

RECODE testvar2 (1 thru 10=1) (11 thru 30=2) (31 thru 50=3) (51 thru

Highest=4).



RECODE testvar3 (SYSMIS=SYSMIS) (Lowest thru 20=1) (21 thru 50=2) (100 thru

Highest=4) (51 thru 100=3).


Grouped Syntax

Get file="dataset".

COMPUTE testvar1=var1-var2.

IF (testvar1 LT 10 OR testvar1 GT 50) testvar1=20.


8

RECODE testvar2 (1 thru 10=1) (11 thru 30=2) (31 thru 50=3) (51 thru

Highest=4).


RECODE testvar3 (SYSMIS=SYSMIS) (Lowest thru 20=1) (21 thru 50=2) (100 thru

Highest=4) (51 thru 100=3).




The syntax creates three test variables (testvar1, testvar2, and testvar3) based on the original variables (var1, var2, var3, and var4), and then recodes them for next step analysis. We use the simple FREQUENCIES command for demonstration.

Data File and Configurations

Dataset: Size 0.9 GB, 3 million cases, 132 variables CPU: 1 CPU, Intel T 9400, 2.53 GHz, dual-core processor RAM: 3 GB Operating System: Windows XP, 32-bit IBM SPSS Statistics: Statistics Client 20

Test Results

Ungrouped syntax: 77 seconds Grouped syntax: 43 seconds Time saved: 26 seconds (33%). Note: The above result is based on testing done in IBM SPSS laboratories. Although our test



guidance.

Summary

Based on the example, the performance improvement is up to 33% by grouping the transformations.

The improvement may vary depending on configurations, data size, and syntax, but you can see obvious

improvement. Grouping your transformation work is a good practice.

Compiled Transformations The compiled transformations feature is designed to improve the performance of complex

transformations. When you use compiled transformations, transformation commands (such as

COMPUTE and RECODE) are compiled to machine code at run time for better performance. This feature

works only with SPSS Statistics Server running on Windows Server.

Preconditions

The following preconditions are required for the compiled transformations feature.

SPSS Statistics Server running on Windows.

The SPSS Statistics Administration Console for configuring SPSS Statistics Server.

GNU G++ compiler.

9

Because there is an overhead involved in compiling the transformations, you should use

compiled transformations only when there are a large number of cases and multiple

transformations commands.

Obtaining Compiled Transformations

To run compiled transformations, complete the following steps:

Have an administrator use the SPSS Statistics Administration Console to turn on the feature and

set the correct compiler path. Chart 1 highlights these settings.

Chart 1: Settings for compiled transformations

Set CMPTRANS to YES in the syntax file.

Execute the syntax while connected to the SPSS Statistics Server or with the SPSS Statistics Batch

Facility.

Note: For compiled transformations to be available the administrator must turn on compiled

transformations with the SPSS Statistics Server setting and CMPTRANS must be set to YES. If the

administrator does not turn on compiled transformations, a warning message is displayed and the

command is ignored.

Example

This example runs compiled transformations with different data sizes and complexity levels. It also provides the test results without compiled transformations for a contrast.

Sample Syntax

INPUT PROGRAM.

LOOP icase = 1 to 1000000.

END CASE.

END LOOP.

END FILE.

END INPUT PROGRAM.

10

EXECUTE.

SET CMPTRANS=ON.

VECTOR x(10).

LOOP jvar = 1 to 10.

COMPUTE x(jvar)=rnd(uniform(10)).

END LOOP.

EXECUTE.

The above syntax generates a dataset and initializes the variables with the COMPUTE command.

The first LOOP command (highlighted with bold) defines the case numbers, and the second

LOOP defines the variable numbers.

Configurations

CPU: 4 CPUs, Intel Xeon, 3.00 GHz, dual-core hyper-threaded processor RAM: 8 GB Operating System: Windows 2003 Server, 64-bit IBM SPSS Statistics: Statistics Server 20

Test Results

The following table summarizes the test results.

Cases Number No Compiled

Transformations Compiled

Transformations Time Saved

1,000,000 10 loops 9 seconds 9 seconds 0

100 loops 70 seconds 51 seconds 27%

5,000,000 10 loops 45 seconds 36 seconds 20%

100 loops 349 seconds 255 seconds 28% Table 1: Test results of compiled transformations

Note: The above result is based on testing done in IBM SPSS laboratories. Although our test



guidance.

Summary

Compiled transformations may improve performance when there are a large number of cases and

complex transformation commands.

Best Practices for Data Analysis This section provides best practices for data analysis. IBM SPSS Statistics is a comprehensive system for

analyzing data. It makes statistical analysis more accessible for the beginner and more convenient for

the experienced user. The best practices introduced here are helpful for analyzing large datasets more

efficiently and improving the parallelization for CPU intensive procedures.

11

Cache Compression for Large Datasets When running many procedures on a large dataset, the cost of getting data obviously increases. The

application must read the original dataset for each procedure. For data tables read from a database

source, this means that the SQL query must be re-executed for any command or procedure that needs

to read the data. Cache compression allows you to avoid this overhead.

Benefits

Creating a data cache eliminates multiple data readings. The CACHE command copies all of the data to a

temporary disk file for subsequent uses of the data. To decrease I/O costs, you can also compress the

temporary data file. Combining CACHE with compression improves efficiency when dealing with large

datasets.

Obtaining Cache Compression

Cache compression works only if you are connected to Statistics Server. Then, complete the following

steps.

Have an administrator use the SPSS Statistics Administration Console to turn on the feature.

Chart 2 highlights these settings.

Chart 2: Settings for Cache Compression

Issue an explicit CACHE command before the analytical procedures.

Set ZCOMPRESSION to YES in syntax file.

Execute the syntax while connected to the SPSS Statistics Server or with the SPSS Statistics Batch

Facility.

Example

This example runs several procedures with cache compression on a large dataset and then summaries

the test results.

12

Configurations

Dataset: Size 1.25 GB, 7.71 million cases, 27 variables CPU: 4 CPUs, Intel Xeon, 3.00 GHz, dual-core hyper-threaded processor RAM: 8 GB Operating System: Windows 2003 Server, 64-bit IBM SPSS Statistics: Statistics Server 20

Results


Procedures No Cache Compression (Seconds) Cache Compression (Seconds) Time Saved

CODEBOOK 41.79 11.77 71.83%

CORRELATIONS 21.18 8.42 60.25%

COXREG 28.09 16.28| 42.04%

CROSSTABS 21.85 8.84 59.54%

CTABLES 25.06 11.10 55.71%

EXAMINE 1343.87 1157.34 13.88%

GLM 20.72 8.68 58.11%

LOGISTIC 37.09 25.55 31.11%

NOMREG 29.26 16.35 44.12%

OLAP CUBES 30.83 16.96 44.98%

T-TEST 20.22 8.24 59.25%

TREE 192.09 164.17 14.54%

Table 2: Test results for cache compression

As shown in Table 2, the procedures CODEBOOK, CORRELATIONS, CROSSTABS, CTABLES, GLM, T-

TEST improve over 50%.

Note: The data shown is based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data are presented for general guidance. Actual results will vary depending on the configuration of the SPSS Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.).

13

Multithreading

Multithreading is the technical term used to break a task into multiple tasks that can be executed in

parallel. Not all analytical procedures can take advantage of multithreading. Procedures that can be

easily parallelized and scheduled to run simultaneously on different CPUs/cores benefit the most. The

procedures that are multithreaded in SPSS Statistics are listed in the following table.

Procedure family Procedure Name

Correlations Bivariate

Partial

Regression Linear

Ordinal

Multinomial

Logistic

Data Reduction Factor Analysis

Survival Analysis Cox Regression

Logistic Regression

Multiple Imputation Impute missing values

Table 3: Multithreaded analytical procedures

Preconditions

To benefit from multithreading, the following preconditions are required.

The computer on which the procedure is run has multiple processors or each processor has

multiple cores.

The procedure that is executed is listed in Table 3.

Note: In SPSS Statistics client, the maximum thread number is 4. In SPSS Statistics Server, there is no limit to the number of threads.

Setting

By default, SPSS Statistics uses an internal algorithm to determine the number of threads for a particular

computer. You can change this setting, but the default will often provide the best performance. You can

override the default setting by issuing the command SET THREADS=n, where n indicates the number

of threads, often corresponding the number of CPUs or cores. It’s suitable to use SET THREADS to

override the default setting in the following scenarios.

14

The default thread number is usually equal to the number of processing units. The threads

consume CPU resources, which may reduce the processing cycles needed for other CPU-

intensive applications. In this situation, you can use SET THREADS to limit the thread number.

For multi-threaded procedures the performance may not improve when the thread number

increases because the overhead on separating the data, managing the threads, and merging the

results also increases. (For specific results, you can refer to Table 4). Therefore, you should find

the optimal thread number and set it by using the command SET THREADS.

Example

This example provides detailed performance information for multithreaded procedures using different

data sizes, and different thread number.

Configurations

CPU: 4 CPUs, Intel Xeon, 3.00 GHz, dual-core hyper-threaded processor RAM: 8 GB Operating System: Windows 2003 Server, 64-bit IBM SPSS Statistics: Statistics Server 20

Results


Multi-threaded

Procedures

File Size

(MB)

Case

Number

Variable

Number

Time (sec)

2 threads

Time (sec)

4 threads

Time (sec)

8 threads

Time (sec)

16 threads

Saved

Time

Discriminant 688 400,000 200 7.56 6.67 6.03 6.34 20.23%

Cscoxreg 2.38 50,000 50 32.12 20.36 15.25 12.90 59.84%

SORT 2610 2,000,000 457 392.68 261.13 241.29 249.42 38.55%

Csordinal 47.6 1,000,000 50 40.11 37.05 36.87 36.94 8.07%

Cslogistic 47.6 1,000,000 50 63.03 55.74 58.49 54.72 13.18%

Linear regression

703 200,000 400 50.13 30.93 14.74 11.83 76.4%

Factor 686 200,000 400 97.83 49.24 27.94 28.18 71.44%

Correlation 343 200,000 200 29.67 19.55 16.81 12.97 56.28%

Partially correlated

343 200,000 200 21.94 12.41 12.56 12.31 43.89%

Nomreg 3.76 50,000 15 16.61 11.89 9.41 8.16 50.87%

Csselect 33.2 1,069,000 6 47.97 48.27 48.10 48.00 0.00% Table 4: Benchmarking results with different thread numbers

Based on above results, as the number of threads increases from 2 to 16:

Cscoxreg procedure improves by 59.84%.

Linear regression procedure improves by 76.4%.

Factor procedure improves by 71.44%.

Partially correlated procedure improves by 43.89%.

Nomreg procedure improves by 50.87%.

15

Note: The data shown is based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data are presented for general guidance. Actual results will vary depending on the configuration of the SPSS Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.).

Working with Output SPSS Statistics provides rich methods to display the statistical results, including tables, charts, and text.

By default, the results are displayed in an SPSS Statistics Viewer window. You can manipulate the output

and create an output document that contains precisely the output you want, arranged and formatted

appropriately. The best practice introduced in this section helps you to achieve this goal.

Extract What You Need from Large Output

When running multiple procedures, SPSS Statistics often generates mass results consisting of tables, charts, logs, text and so on. It’s painful to review so much information and find what you want. Fortunately, SPSS Statistics provides the Output Management System (OMS) and OUTPUT commands

(OUTPUT NEW, OUTPUT NAME, OUTPUT ACTIVATE, OUTPUT OPEN, OUTPUT SAVE, OUTPUT CLOSE) to help you refine and route the output.

Benefits

With OMS and OUTPUT commands, you can gain the following benefits.

Partition the large output into separate output documents.

Select and route required information from the output.

Work with multiple open output documents in a given session.

Use output as input with OMS.

Obtaining OMS and OUTPUT Commands

There are two ways to run OMS: from the OMS control panel and a syntax command.

Use the OMS Control Panel. From the menus choose Utilities > OMS Control Panel. With the

control panel, you can start and stop the routing of the output to various destinations. Note that

OUTPUT commands can be used only with a syntax command.

Use OMS and OUTPUT commands. The following examples illustrate how to insert these

commands into your existing syntax.

Examples

Example 1: Partitioning the Output with OUTPUT Commands

This example demonstrates how to partition the statistical results according to gender. Results for males

will appear in one output documents, and results for females will appear in another one.

GET FILE='SurveyData.sav'.

TEMPORARY.

SELECT IF (Sex='Male').

FREQUENCIES VARIABLES=ALL.

16

OUTPUT NAME males.

TEMPORARY.

SELECT IF (Sex='Female').

OUTPUT NEW NAME=females.

FREQUENCIES VARIABLES=ALL.

OUTPUT SAVE NAME=males OUTFILE='Males.spv'.

OUTPUT SAVE NAME=females OUTFILE='Females.spv'.

OUTPUT CLOSE *.

The GET command loads survey data for male and female respondents.

The FREQUENCIES output for male respondents is written to the designated output document.

The OUTPUT NAME command is used to assign the name males to the designated output

document.

The FREQUENCIES output for female respondents is written to a new output document

named females.

The OUTPUT SAVE commands are used to save the output in two separate files.

The OUTPUT CLOSE command closes all open output documents.

Example 2: Formatting and Routing the Output with OMS

This example demonstrates how to route the output in different format. The following is the sample

code.

OMS

/SELECT TABLES

/IF COMMANDS = ['Regression']

/DESTINATION FORMAT = DOC

OUTFILE = 'tables.doc'.

REGRESSION

/STATISTICS COEFF OUTS R ANOVA

/DEPENDENT income

/METHOD=ENTER age address EDU employ.

OMS SELECT WARNINGS

/DESTINATION FORMAT=HTML

OUTFILE='warnings.htm’

FREQUENCIES age EDU.

OMSEND.

The first OMS command selects tables from REGRESSION results and saves them to tables.doc.

The REGRESSION command generates the output used by OMS.

The second OMS command selects warnings from FREQUENCIES results and saves them to

warnings.htm.

The FREQUENCIES command generates the results used by the second OMS command.

OMSEND command ends OMS commands.

Example 3: Converting Output into Input with OMS

Using the OMS command, you can save the output to an SPSS Statistics data file and then use that

output as input in subsequent commands or sessions.

OMS

http://ibm-r8xgk47:1784/help/topic/com.ibm.spss.statistics.help/syn_oms.htm

17

/SELECT TABLES

/IF COMMANDS=['Descriptives'] SUBTYPES=['Descriptive Statistics']

/DESTINATION FORMAT=SAV OUTFILE='des_table.sav'

/COLUMNS DIMNAMES=['Variables'].

DESCRIPTIVES VARIABLES=salary salbegin.

OMSEND.

The OMS command selects the “Descriptive Statistics” table from DESCRIPTIVES results and

saves it as the SPSS Statistics data file des_table.sav. The COLUMNS subcommand selects the

descriptive variables as the variables of output data file.

The DESCRIPTIVES command generates the table used by OMS.

The OMSEND command ends OMS commands.

Summary

The OMS and OUTPUT commands provide the ability to manage one or more output documents

programmatically. This ability helps you deal with the output more easily. For more information, please

refer to the IBM SPSS Statistics Command Syntax Reference, which is released with the product.

Working with Command Syntax The powerful command syntax allows you to save and automate many common tasks. It also provides

some functionality not found in the menus and dialog boxes. You can also save your jobs in a syntax file

so that you can repeat your analysis at a later date. This section provides best practices for working with

command syntax.

Removing Unnecessary EXECUTE Commands The EXECUTE command is designed for use with transformation commands and facilities such as ADD

FILES, MATCH FILES, UPDATE, PRINT, and WRITE, which do not read data and are not executed

unless followed by a data-reading procedure. Because the EXECUTE command forces the data to be

read, unnecessary EXECUTE commands can result in extra data passing and wasted time.

Benefits

By identifying and removing unnecessary EXECUTE commands, you can optimize syntax arrangement

and reduce the time needed for reading data. This optimization is especially effective for I/O intensive

procedures.

Examples

The following examples demonstrate the improper usage of EXECUTE commands and how to correct the

improper usage.

Example 1: Using EXECUTE Between Independent Transformations

COMPUTE var1=var1*2

EXECUTE.

COMPUTE var2=var2*2

18

The two COMPUTE commands operate on different variables. They are independent. In this

scenario, inserting the EXECUTE command causes unnecessary data passing and lowers the

execution efficiency of the transformations.

Ensuring that the transformations are truly independent is critical. If the transformation are in fact

dependent, you may need to put the EXECUTE command between the transformations to get the right

results. For example:

Syntax 1:

COMPUTE lagvar=LAG(var1).

COMPUTE var1=var1*2.

Syntax 2:

COMPUTE lagvar=LAG(var1).

EXECUTE.


Compared with Syntax 1, the only difference in Syntax 2 is the EXECUTE command between

the two COMPUTE commands. However, the value of lagvar is totally different in Syntax 1 and

Syntax 2. Syntax 1 uses the transformed value of var1 to calculate lagvar, while Syntax 2 uses

the original value.

Example 2: Inserting EXECUTE Between a Transformation and Statistical Procedure


EXECUTE.

FREQUENCIES VARIABLES=var1.

Sometimes it’s necessary to execute the transformations with the EXECUTE command.

However, when the transformations are followed by one or more statistical procedures that

need to read the data, the EXECUTE command becomes redundant. In this example, you

should remove the EXECUTE command.

Working with SPSS Statistics Server SPSS Statistics Server is robust, powerful analytical software that seamlessly scales from handling the

analytical needs of a single department to hundreds of users across the enterprise. It provides all of the

features of SPSS Statistics, plus capabilities that deliver faster performance, more efficient processing of

large datasets, and enhanced security in enterprise deployments. This section provides best practices for

working with SPSS Statistics Server.

Decreasing Data Passing Costs with SPSS Statistics Server

For an organization with distributed offices, accessing large data files across offices takes a significant

amount of time. Passing large data on the network can cause bandwidth saturation, which disturbs the

normal use of other applications. In this situation, SPSS Statistics Server is a good choice.

19

Benefits

With SPSS Statistics Server, data is read from server machine, avoiding transferring large datasets to end

users’ desktops. The data transferred over the network is minimized and performance is improved. This

prevents bandwidth saturation and improves the performance of SPSS Statistics in addition to other

mission-critical applications, including e-mail, enterprise resource planning (ERP), and customer

relationship management (CRM).

Testing and Results

The following table compares the time needed to access data in these situations:

SPSS Statistics client is running in local mode and accesses files in the data center directly over

the wide area network (WAN).

SPSS Statistics client is running in distributed mode and is connected to an SPSS Statistics Server

installed at the data center.

File Size SPSS Statistics client

connecting directly to the

data over a WAN (T1 3.0

Mbps)

SPSS Statistics client connecting to

the SPSS Statistics Server at the data

center over a WAN (T1 3.0 Mbps)

Time saved with

SPSS Statistics

Server in seconds

50 MB 2 minutes, 10 seconds 4 seconds 2 minutes, 6 seconds

250 MB 10 minutes, 50 seconds 40 seconds 10 minutes, 10 seconds

1 GB 43 minutes, 17 seconds 80 seconds 41 mi minutes, 57 seconds

Table 5: Timing in seconds to access a data file

As shown in above table, compared with SPSS Statistics client, significant time savings can be achieved

with SPSS Statistics Server when accessing files in distributed offices. For example, 2 minutes were saved

for a 25 MB file, 10 minutes for a 250 MB file, and 42 minutes for a 1 GB file.

Note: The results are based on the assumption that the available bandwidth is 3.0 Mbps. In reality, the

time saved will be greater as bandwidth is taken up by other applications such as e-mail, network

backups, and other network resources. The data presented here are for illustrative purposes only. Actual

results will vary depending on the configuration, bandwidth, and latency of the WAN. Therefore,

organizations performing similar tests may not see identical results.

Guidelines for purchasing Statistics Server

The SPSS Statistics Server is especially designed for the following scenarios:

Organizations with distributed offices looking to centralize their data and IT infrastructure in one

or more data centers.

Organizations with distributed offices that need to analyze and share files greater than 25 MB

across offices.

20

Organizations that need to perform analysis on large datasets (greater than 100 MB) sourced

from a SQL server or a data warehouse.

64-bit Computing with Statistics Server

The amount of physical RAM is critical for performance because accessing data from RAM is much faster

than accessing data from a disk. For faster performance, it’s best to have the entire dataset in RAM.

However, the total amount of RAM supported depends on the processor. Theoretically, 32-bit

processors are limited to accessing 4 GB of RAM. Transferring to a 64-bit machine allows you to increase

the amount of RAM to several multiples higher than a 32-bit machine. It’s much faster to execute

analytical procedures with larger datasets on a 64-bit machine.

SPSS Statistics Server has strong support for 64-bit computing on multiple server operating systems,

including Windows Server, IBM® AIX®, Sun Solaris, HP-UX, Red Hat Enterprise Linux, and SUSE Linux

Enterprise Server. Most analytical procedures run much faster on 64-bit SPSS Statistics Server than 32-

bit SPSS Statistics client.

Benchmarking Test

We compare the processing times for statistical procedures run on 64-bit SPSS Statistics Server and 32-

bit SPSS Statistics client.

Configuration of Statistics Server

CPU: 4 CPUs, Intel Xeon 3 GHz, dual-core hyper-threaded processor RAM: 8 GB Operating system: Windows 2003 Server, 64-bit

Configuration of Statistics Client

CPU: 1 CPU, Intel T 7500, 2.19 GHz, dual-core processor RAM: 3 GB Operating system: Windows XP, 32-bit

Datasets

Two datasets were used:

Dataset 1: Size 2.1 GB, 5 million cases, 127 variables

Dataset 2: Size 3 GB, 10 million cases, 127 variables (for testing multithreaded procedures)

Result

The test results are summarized in the following table. The chosen procedures the typical type of

analysis or data processing that an SPSS Statistics user might execute in daily work.

Procedures 64-bit Server

(seconds)

32-bit Client

(seconds)

Time Saved Average Speedup

Factor

ADD FILES 18.45 169.34 89.10% 9.18

AGGREGATE 33.19 94.95 65.04% 2.86

21

Procedures 64-bit Server

(seconds)

32-bit Client

(seconds)

Time Saved Average Speedup

Factor

MATCH FILES 22.00 224.17 90.19% 10.19

SORT 146.90 578.73 74.62% 3.94

CORRELATION 230.78 800.83 71.18% 3.47

FACTOR 140.95 219.22 35.70% 1.56

GLM 70.09 350.91 80.03% 5.01

MIXED 116.23 174.13 33.25% 1.50

TREES 615.00 885.49 43.98% 1.44

BETA 40.12 106.20 62.22% 2.65 Table 6 Benchmarking results for jobs run on 64-bit SPSS Statistics Server and 32-bit SPSS Statistics clients

Note: The results shown in Table 6 are based on testing done in IBM SPSS laboratories. Although our

test environments simulate typical production environments in the field, we cannot guarantee that


guidance. Actual results will vary depending on the configuration of the SPSS Statistics Server and clients

(number of CPU cores, RAM, disk speed, etc.)

Summary

The benchmarking results show impressive speedup for most procedures with 64-bit SPSS Statistics

Server. For best performance, use 64-bit SPSS Statistics Server.

Using Multiple Locations for Temporary Files

When SPSS Statistics Server processes data, it often keeps a temporary copy of that data on disk. In

addition, some procedures (CACHE, SORT, AGGREGATE, transformations, etc.) can create temporary

files during execution. The size of temporary files varies from the size of the data file to three times the

size of the data file. Because the temporary files are writable and can get quite large, it’s hard to manage

I/O operation, especially when there are several concurrent I/O intensive users. In this situation, setting

multiple temporary file locations is necessary.

Benefits

Using multiple temporary file locations, you can:

Limit the users to operate the directories to which they have access.

Control the temporary files space allocated to each user by specifying a partitioned drive.

Improve performance when the locations are on different spindles. This option requires your

server workstation to have multiple physical disks.

How to Set Multiple Temporary File Locations

There are several ways to set multiple temporary file locations using the SPSS Statistics Administration

Console. Note that this optimization is available only when using SPSS Statistics Server.

22

Set global temporary file locations

Chart 3 shows a screen capture from the SPSS Statistics Administration Console and highlights the

setting for temporary file locations.

Chart 3: Setting for temporary file location

As shown in Chart 3, the administrator set three locations: c:\temp, d:\temp, and e:\temp. This setting is

global for all users but can be overridden by the user profile or group setting.

Set Temporary File Location with Group Setting

The group setting applies to all users in a group, but it can be overridden with the setting in specific user

profiles. To display the group settings, double-click the User Profiles and Groups node beneath the

desired SPSS Statistics Server in the Server Administration pane.

The Manage Users and Groups pane displays the currently defined user profiles and groups in the User

Profiles and Groups grid. To create a new user group, complete the following steps.

In the Manage Users and Groups pane, click New Group.

In the Create New Group dialog box, enter a name for the group.

Define any of the available settings, including temporary file locations.

Set Temporary File Location for Each User

To create a new user profile, open the Manage Users and Groups pane in the Server Administration

pane and complete the following steps.

In the Manage Users and Groups pane, click New User Profile.

In the Create New User Profile dialog box, enter the name of the user for whom you are creating

the profile.

If necessary, define any of the available settings. You can define the temporary file location for

this user. If you are creating a user profile to assign to a group, you don’t have to define any

settings. The group settings will be applied to the user.

For more information about creating and editing SPSS Statistics Server user profiles and groups, please

refer to Chapter 4 of IBM SPSS Statistics Server Administrators Guide.

23

Conclusion This paper provides some best practices for improving the efficiency, performance and optimization of

IBM SPSS Statistics. These best practices include data preparation, data transformations, data analysis,

output, command syntax and Statistics Server. By learning from these cases, SPSS Statistics users can

optimize their work and improve overall performance.

24

Trademarks

IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in

many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other

companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark

information" at http://www.ibm.com/legal/copytrade.shmtl.

http://www.ibm.com/legal/copytrade.shmtl

Ibm spss-statistics-performance-best-practices

Software