IBM SPSS Modeler and R
R model building is to calculate whatever you want to store within modelerModel that couldbe used within your nugget calculations to score your data This can be any object withinR As any SPSS Model builder node this will be a terminal code meaning that from thiscode no data will go back to SPSS modeler unless some outputs and things that are storedwithin the modelerModel object
R model scoring is the syntax to dene how you will use the object modelerModel containingthe content you stored to it in the R model building syntax to derive the new data Apartfrom the use of modelerModel this is very similar to the R transform node
Let us start with a simple example where we would like to create a basic linear model for thevariable tenure The formula of this model should be saved in the modelerModel after which itcan be used in the scoring
1 Create the model and save it in modelerModel
2 modelerModel lt- lm(tenure ~ age + region + ed + income data= modelerData)
3
4 Add some summary of the model in the nugget
5 summary(modelerModel)
6
7 together with a histogram
8 hist(modelerModel$residuals main = residual histogram)
9
10 and the residual vs actuals scatterplot
11 plot(modelerData$tenure modelerModel$fittedvalues xlab = actual ylab = predicted )
12
13 All of these output will be stored in the modeler nugget tabs
1 Use the model to make a prediction and add it to the existing data
2 pred lt- predict(modelerModel modelerData)
3 modelerData lt- cbind(modelerDatapred)
4
5 Take care of the metadata
6 newVar lt-c(fieldName=$L-tenure fieldLabel= fieldStorage=real fieldMeasure=
fieldFormat= fieldRole=)
7 modelerDataModel lt- cbind(modelerDataModelnewVar)
It is important to note that modelerModel can be lled with any type of object but will veryoften be of a model class but does not have to be In the previous example the object stored toit was clearly a (statistical) model In the next example we will just save 2 numbers within themodelerModel object Imagine we want to calculate the z-values of a certain variable In orderto create the z-values we need the mean and the standard deviation of the column We will storeboth of these within modelerModel after which we will use them in the scoring syntax1 Thisexample shows you do not need to store a statistical model within your modelerModel objectbut it really can be any R object
1 calculate mean and standard deviation
2 M lt- mean(modelerData$tenure)3 SD lt- sd(modelerData$tenure)4
5 and save it in a list called modelerModel
6 modelerModel lt- list(avg = M sDev = SD)
1 calculate z scores using the elements in modelerModel
2 zTenure lt- (modelerData$tenure - modelerModel$avg)modelerModel$sDev3 modelerData lt- cbind(modelerDatazTenure)
4
5 define new metadata column and add it
6 newVar lt- c(fieldName=zTenure fieldLabel=fieldStorage=real fieldMeasure=
fieldFormat= fieldRole=)
7 modelerDataModel lt- cbind(modelerDataModelnewVar)
1Note that there is a very good reason this is not combined into an R Transform node explained in 34
Page 9 of 20
IBM SPSS Modeler and R
You can nd all streams and R scripts explaining modelerModel here
33 Some general remarks
bull Although it might seems this way you are not required to build modelerData from theexisting data within that frame modelerData will be lled with the dataset you have inModeler however nothing stops you to throw that data away in R and dene some new datacoming from another data source in R As an example imagine this link from the WeatherCompany website This will give the weather history in Brussels Belgium in the monthNovember 2015 Now we can use R code as a source node by just overwriting modelerData
and redening modelerDataModel
1 Define the link
2 linkPath lt- httpwwwwundergroundcomhistoryairportEBBR20151101
MonthlyHistoryhtmlreq_city=Brusselsampformat=1
3 Read the data as csv
4 modelerData lt- readcsv(linkPath)
5 modelerData[1] lt- asDate(modelerData[1])
6
7 Redefining modelerDataModel all are real numbers except the first column is the date
8 modelerDataModel lt- asdataframe(t(dataframe(fieldName = colnames(modelerData)
fieldLabel = fieldStorage = c(daterep(realncol(modelerData)-1))
fieldMeasure = fieldFormat = fieldRole = )))
As you can see this code does not use the old denition of the dened R objects butcompletely redenes them Placing this in a R transform node will give back this newdataset to modeler So in this way you can use this approach to create an R input node
You can nd an example here
bull Within an R model node there is place for 2 scripts The building script will be the scriptthat will be populated within the R nugget It will not be run when you run the model nodeAs a result of this these 2 scripts are independent The only thing they share is the valueof the object modelerModel which is saved within the nugget when running the buildingsyntax and picked up within the R scoring syntaxThis also means that eventual R-libraries that are required should be loaded in both scriptsTake for example a model for a random forest
1 Load the library
2 library(randomForest)
3 Create the model and save it in modelerModel
4 modelerModel lt- randomForest(tenure ~ age + region + ed + income data= modelerDatantree
=50)
1 Load the library
2 library(randomForest)
3
4 Use the model to make a prediction and add it to the existing data
5 predlt- predict(modelerModel modelerData)
6 modelerData lt- cbind(modelerDatapred)
7
8 Take care of the metadata
9 newVar lt- c(fieldName=$RF-tenurefieldLabel=fieldStorage=realfieldMeasure=fieldFormat=fieldRole=)
10 modelerDataModel lt- cbind(modelerDataModelnewVar)
bull Talking about libraries and package A package is a collection of R objects dened for acertain purpose These often are specic statistical functionalities like randomForest in theexample above A basic R installation comes with the standard packages however there aremany more packages available made available by the R community on CRAN
Page 10 of 20
Explain modelerModelxml
data0001dat
data0002dat
data0003dat
data0004dat
data0005dat
data0006dat
data0007dat
data0008dat
data0009dat
data0010dat
data0011dat
data0012dat
data0013dat
data0014dat
data0015dat
data0016dat
data0017dat
data0018dat
data0019dat
data0020dat
data0021dat
R as sourcexml
data0001dat
data0002dat
IBM SPSS Modeler and R
Packages needs to be installed and made locally available in libraries Once the package isinstalled on the system as a library you can load this library in any R session by the codelibrary(ltnamegt)To install a package you have several options The easiest is probably to write a codelike installpackages(randomForest) within R You will have to select a CRAN mirrorwhere this library will be downloaded from and the download will go automatically Nor-mally you will only have to do this onceAlthough possible it is not recommended to run this package installation command fromwithin SPSS The reason is that these libraries will than be saved in a temporary folder andafterwards be deleted If you still want to this through SPSS you will have to hard codethe installation path
34 Read data options
Something we ignored until now are the settings within the node under Read data options Thebasics of the R integration with Modeler can be done without the knowledge of this as it requiressome more advanced R knowledge The user guide still has a good explanation about these items
However there is one more thing that might be important For modeler version 17 and lowerthe R integration of non-terminal nodes (ie transform and nuggets) will by default be done inbatches of 1000 The reason for this was to allow these R nodes to work on hadoop and otherclustered environments As a result of this it is very important to realize that any R code thatwould span over multiple lines of data would lead to false results For a workaround for this werefer to 511
Take as an example the z scores above If we would calculate the mean and the standard de-viation of the variable in a non-terminal node it would start with running this code for the rst1000 lines of data So that leads to a specic mean deviation and corresponding z-scores How-ever the next 1000 lines a new mean and deviation would be calculated and the z scores will bebased on these
As a solution the means and standard deviations are calculated in the R model (ie a termi-nal node) over all the data and used in the R nugget to calculate the z scores
Note that this approach may lead to a very slow integration between SPSS and R in the caseof streaming R nodes in a local non-clustered environment However as from IBM SPSS Modelerversion 171 there is a default option not to use this approach of batch processing or to increasethe batch size For the lower versions there is a workaround possible if you still want to increasethis batch size or turn it o (see later)
4 Custom Dialog builder
The Custom Dialog Builder allow you to create and manage R nodes with prefilled R code to use insideIBM SPSS Modeler streams In this way users can create their own nodes You can start the CustomDialog builder in the Tools menu under rdquoCustom Dialog Builderrdquo
When opening a custom dialog builder you will see a 2 windows One of them is the custom dialogitself the other is the toolset to populate the dialog
Page 11 of 20
IBM SPSS Modeler and R
41 Tools
The tools window is a list of items you can place within your dialog This include among others the fieldchooser Check and combo boxes Text and number controls and tabs You can select any and drag themonto the dialog itselfOnce you have any item in the dialog you can select it and you will see the item properties These are theproperties of this specific item and will change dependent on the type of item it is The most importantare the identifier (the way it will be referenced within the script) and the Title (the one that will be visualin the dialog)
42 Custom dialog
The big gray window is the dialog itself For the moment it is empty as it should be populated with itemsfrom the Tools listClicking on this gray dialog will show you the dialog properties below As main items this includes thename and title of the dialog the script itself and the type and position of the created nodeWith regards to the script to be written The global rule is that you reference to the items within thedialog using their identifier between double percentages (rdquoltidentifiergtrdquoOnce you finished creating the custom node you can install it by pressing the green arrow in the toolbarYou can also save intermediate versions to the disk
43 Simple example
Let us create a custom dialog for the randomForest model created earlier in section 33 Below you willfind a step by step approachThe most important thing we have to wonder is what within this code we want flexible for the user Inthe case of this model there might be 3 things that we want flexibel the input variables the target andthe number of trees in our forest
First fill in the Custom dialog properties as indicated
Page 12 of 20
IBM SPSS Modeler and R
In our fixed example the target is tenure but a user might chooseany other field As a result we will place a field chooser on the dialogChange the properties like shown The variable filter properties allowsyou to select only categorical variables
In our fixed example the inputs are age income but a user mightchoose any other field As a result we will place a field chooser on thedialog The biggest difference is that now we can select several variablesas there might be different inputs To make it easier we will separatethese values by a + Therefore change the properties like shown
As a third custom choice we would like to add the number of trees inour forest In the original script is was 50 so we will choose this as thedefault However users may choose any integer value between 1 and1000 Add a number control on the dialog and change the properties
So now the dialog is ready and we need to add the script to it Go tothe Edit options and choose rdquoScript Templaterdquo This will bring you toan empty window for the script In this case (as we selected we wanted2 scripts) there is a tab for the building code and one for the scoringscript If the coring script is greyed out you did set the dialog propertyrdquoScore from the Modelrdquo to True
Page 13 of 20
IBM SPSS Modeler and R
Let us start with the scoring script as this is easier The only thingwhich will need to be adapted for custom input is the variable namethat will be send back to SPSS So copy the scoring code and changethe name tenure to TARGET (this is the name of the identifier of thetarget)
Fill in the code for the building script and change in a similar wayas above the values for target and intput variables together with thenumber of trees Afterwards press OK to close the script window
Being back at your Custom Dialog builder save the dialog in any appro-priate location Also deploy the dialog by clicking on the green deployarrow in the toolbar Close the Dialog builder
Back to the stream you will now see the new node in the model paletteYou can use this node within your stream
You can find the resulted cfe file here (place this in the correct location see 521) and a stream
where it is deployed
5 Tips amp tricks Some more detailed
51 R code
511 ibmspsscf70 library
Let us now have a more detailled view about what actually happens with the code First of all it is worthto check what happens when you do the R installation correctly This will install the by IBM delivered Rpackage ibmspsscf70 in the library folder of your R installation This library contains several functionsto handle the data traffic between SPSS and R
Running any R node in SPSS will not only run the code you write but it will also run some extracode behind the scenes You can see this code in the rdquoConsole outputrdquo window of the R node Lookingfor example at this tab for an R nugget you will see that your code will be something like
1 modelerModel lt- ibmspsscfoutputGetModel()
2 while(ibmspsscfdataHasMoreData())
3 modelerDataModel lt- ibmspsscfdatamodelGetDataModel()
4 modelerData lt- ibmspsscfdataGetData(rowCount=1000 missing=NA rDate=None
logicalFields=FALSE)
5
Page 14 of 20
app-extensionxml
cdb_peerjar
comspsssharedcustom_guiui_builderpeersRStatsApplierPeerclass
package comspsssharedcustom_guiui_builderpeers public synchronized class RStatsApplierPeer implements compaswframeworkcommonextensionspiExtensionObjectPeer compaswframeworkcommonextensionspiOutputDataModelProvider compaswframeworkcommonextensionspiInteractorListener compaswcorepropertyPropertySetListener private compaswframeworkcommonextensionExtensionObject extensionObject private static final String RINTERACTOR = rinteractor private static final String RBUILDER = rbuilder private static final String RAPPLIER = rapplier private static final String ROUTPUT = routput private static final String RMODELAPPLIER = rmodel private static final String RPROCESS = rprocess private static final String SYNTAX = syntax private static final String SCORE_SYNTAX = score_syntax private static final String OUTPUT_DATAMODEL = output_datamodel private static final String INPUT_DATAMODEL = input_datamodel private static final String OUTPUT_MODE = output_mode private static final String FILE_MODE = File private static final String OUTPUT_TYPE = output_type private static final String GRAPH_TYPE = Graph private static final String TEXT_TYPE = Text private static final String OUTPUT_FILE_TYPE = output_file_type private static final String GRAPH_OUTPUT_TYPE = graph_output_file_type private static final String TEXT_OUTPUT_TYPE = text_output_file_type private static final String OUTPUT_CONTAINER_ID = output_container_id private static final String OUTPUT_CONTAINER_TYPE = output_container_type private static final String OUTPUT_CONTAINER_GRAPH = HTMLOutput private static final String OUTPUT_CONTAINER_TEXT = TextOutput private static final String CONTAINER_TYPE_HTML = html private