Training in Analytics
2
Website and CommunityAugustus is an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML).
It is written in Python and is freely available.
http://augustus.googlecode.com
Training in Analytics
4
Getting Augustus
● Releases can be downloaded from the website under the Download tab.
● Current release are also on the main page's Featured side bar
● Augustus can be directly checked out from source control. We use Subversion.
● Project members can be granted commit access.
Training in Analytics
6
Source
●All of the source files are viewable on line with markup and revision history.
●The raw version of each file is also available.
http://augustus.googlecode.com/http://augustus.googlecode.com/source/browsesource/browse
Training in Analytics
8
Training in Analytics
9
Training in Analytics
10
Documentation and Community
WIKI▼The wiki is intended for people who want to install Augustus for use and possibly develop new features.
FORUM▼The forum is open for any general discussion regarding Augustus.
Training in Analytics
11
Training in Analytics
12
Training in Analytics
13
Using Augustus
● Model Development● Use Cycle● Work Flow
Training in Analytics
14
Development and Use Cycle
The typical model development and use cycle with Augustus is as follows:
1.Identify suitable data with which to construct a new model.
2.Provide a model schema which proscribes the requirements for the model.
3.Run the Augustus producer to obtain a new model.
4.Run the Augustus consumer on new data to effect scoring.
Training in Analytics
15
Development and Use Cycle
2. Model schema1. Data Inputs
Running Augustus
3. Obtain new model with Producer
4. Score with Consumer
Training in Analytics
17
Work Flows
●Augustus is typically used to construct models and score data with models.
● Augustus includes a dedicated application for creating, or producing, predictive models rendered as PMML-compliant files. Scoring is accomplished by consuming PMML-compliant files describing an appropriate model.
Training in Analytics
18
Components
● Pre-processing● Producers● Consumers● Post-Processing
Training in Analytics
19
Producers and Consumers
●The Producers and Consumers require configuration with XML-formatted files.
●Supplying the schema, configuration and training data to the Producer yields a completely specified model.
●The Consumers provide for some configurability of the output but post-processing can be used to render the output according to the user's needs.
Training in Analytics
20
Post Processing
●Augustus can accommodate a post-processing step. While not necessary, this is often useful to:
▼ Re-normalize the scoring results or perform an additional transformation.
▼ Supplement the results with global meta-data such as timestamps.
▼ Format the results.▼ Select certain interesting values from the
results.▼ Restructure the data for use with other
applications.
Training in Analytics
21
Segments
Segments are covered elsewhere, but Augustus supports segments and this can be described at the Producer level.
● Augustus was originally written to an Open Data draft RFC for segmented models. Augustus 0.3.x conform to the RFC.
● PMML 4 formalized the specification for segments and it deviates somewhat from the RFC. Augustus 0.4.x conforms to this standard.
● Augustus 0.3.x and 0.4.x both support segments, they differ in how they handle them.
Result of Scoring
Training in Analytics
23
Case Study: Auto● Auto is an example distributed with
Augustus, found in the examples directory.▼ It consists of four simple examples of applying
vector channel analysis to a single field of a stream of input records.
▼ The examples use two types of data files. ▼ The data consists of records with three entries: Date, Color, and Automaker.
▼ The Weighted examples have an additional 'weight' column, named Count. The Count field records the number of occurrences of identical tuples in the non-weighted data and collapses them into one record.
Training in Analytics
24
Work Flow Overview
Training in Analytics
25
Auto: Weighted BatchUsing the Baseline for Training:
$ cd WeightedBatch
`-- scripts
|-- consume.py
|-- postprocess.py
`-- produce.py
http://code.google.com/p/augustus/source/browse/#svn/trunk/examples/auto/WeightedBatch
Training in Analytics
26
Input for the ProducerThe Producer takes the training data set. In the code, we have declared how we want to test the data
import augustus.modellib.baseline.producer.Producer as Producer
def makeConfigs(inFile, outFile, inPMML, outPMML):
#open data file
inf = uni.UniTable().fromfile(inFile)
#start the configuration file
test = ET.SubElement(root, "test")
test.set("field", "Automaker")
test.set("weightField", "Count")
test.set("testStatistic", "dDist")
test.set("testType", "threshold")
test.set("threshold", "0.475")
Training in Analytics
27
Input for the Producer Continued
# use a discrete distribution model for test
baseline = ET.SubElement(test, "baseline")
baseline.set("dist", "discrete")
baseline.set("file", str(inFile))
baseline.set("type", "UniTable")
# create the segmentation declarations for the two fields at this level
'''
Taken out for the example, other Use Cases will focus on Segments
segmentation = ET.SubElement(test, "segmentation")
makeSegment(inf, segmentation, "Color")
'''
#output the configuration file
tree = ET.ElementTree(root)
tree.write(outFile)
Training in Analytics
28
Running the Producer( Training)$ cd scripts
$ python2.5 produce.py -f wtraining.nab -t20
(0.000 secs) Beginning timing
(0.000 secs) Creating configuration file
(0.001 secs) Creating input PMML file
(0.001 secs) Starting producer
(0.000 secs) Inputting configurations
(0.001 secs) Inputting model
(0.008 secs) Collecting stats for baseline distribution
(0.011 secs) Events 20.067% processed
(0.009 secs) Events 40.134% processed
(0.009 secs) Events 60.201% processed
(0.009 secs) Events 80.268% processed
(0.009 secs) Events 100.000% processed
(0.000 secs) Making test distributions from statistics
(0.002 secs) Outputting PMML
(0.062 secs) Lifetime of timer
Training in Analytics
29
Model generated by the Producer
<PMML version="3.1">
<Header copyright=" " />
<DataDictionary>
<DataField dataType="string" name="Automaker" optype="categorical" />
<DataField dataType="string" name="Color" optype="categorical" />
<DataField dataType="float" name="Count" optype="continuous" />
</DataDictionary>
<BaselineModel functionName="baseline">
<MiningSchema>
<MiningField name="Automaker" />
<MiningField name="Color" />
<MiningField name="Count" />
</MiningSchema>
</BaselineModel>
</PMML>
Training in Analytics
30
Model generated by the Producer (Cont)
The structure is determined by code in the Producer.py:
def makePMML(outFile):
#create the pmml
root = ET.Element("PMML")
root.set("version", "3.1")
header = ET.SubElement(root, "Header")
header.set("copyright", " ")
dataDict = ET.SubElement(root,
"DataDictionary")
It then goes on for each Data and Mining Field:
dataField = ET.SubElement(dataDict, "DataField")
dataField.set("name", "Automaker")
dataField.set("optype", "categorical")
dataField.set("dataType", "string")
. . .
miningSchema = ET.SubElement(baselineModel, "MiningSchema")
miningField = ET.SubElement(miningSchema, "MiningField")
miningField.set("name", "Automaker")
Training in Analytics
31
Producer OutputThe training step used the code in producer.py to generate a model and get expected results.
Training generated the following files:.
|-- consumer
| `-- wtraining.nab.pmml MODEL WITH EXPECTED VALUES
BASED ON THE TRAINING DATA
`-- producer
|-- wtraining.nab.pmml BASELINE DATA, DATA DICTIONARY,
MINING SCHEMA
`-- wtraining.nab.xml MODEL FILE USED FOR TRAINING
Training in Analytics
32
Training XMLThis provides:
● Model with expected values from Training that is used when we score
● Test Distribution
● Baeline data and how it is to be handled
$ cat producer wtraining.nab.xml
<model input="../producer/wtraining.nab.pmml"
output="../consumer/wtraining.nab.pmml">
<test field="Automaker" testStatistic="dDist" testType="threshold"
threshold="0.475" weightField="Count">
<baseline dist="discrete" file="../data/wtraining.nab"
type="UniTable" />
</test>
</model>
Training in Analytics
33
Unitable
●Unitable is used to hold the data that is read in.
●It allows us to encapsulate the data is a why which allows us to manipulate it efficiently.
●It can be thought of, in part, as a data structure holding a spread sheet of data with columns, types, etc and the relevant operations which can be performed on the data and the data structure.
●More to follow.
Training in Analytics
34
Running the Consumercd script
$ python2.5 consume.py -b wtraining.nab -f wscoring.nab
Ready to score
.
|-- consumer
| |-- wscoring.nab.wtraining.nab.xml
| `-- wtraining.nab.pmml
|-- postprocess
| `-- wscoring.nab.wtraining.nab.xml
`-- producer
|-- wtraining.nab.pmml
`-- wtraining.nab.xml
This examples generates a report in the post process directory.
Training in Analytics
35
Consumer (Scoring) output$ cat consumer/wscoring.nab.wtraining.nab.xml
<pmmlDeployment> <inputData> <readOnce /> <batchScoring /> <fromFile name="../data/wscoring.nab" type="UniTable" /> </inputData> <inputModel> <fromFile name="../consumer/wtraining.nab.pmml" /> </inputModel>
<output> <report name="report"> <toFile name="../postprocess/wscoring.nab.wtraining.nab.xml" /> <outputRow name="event"> <score name="score" /> <alert name="alert" /> <segments name="segments" /> </outputRow> </report> </output></pmmlDeployment>
Training in Analytics
36
Scoring Report
$ cat postprocess/ wscoring.nab.wtraining.nab.xml
<report>
<event>
<score>0.471458430077</score>
<alert>True</alert>
<Segments></Segments>
</event>
</report>
Training in Analytics
37
Unitable● The Unitable is one of the main components of
the Augustus system. ▼Data read into Augustus is stored in a Unitable. ▼Results in a very fast, efficient object for data shaping, model building, and scoring, both in a batch and real-time context.
● Designed to hold data in a way which allows it to be acted upon by numpy.
▼Takes advantage of new features and improvements which are put into numpy by the scientific Python community.
● Unitable can be used outside of the Augustus scoring flow.
▼Find a standalone example on the wiki
Training in Analytics
38
Key Features of Unitable● File format that matches the native
machine memory storage of the data-allowing for memory-mapped access to the data.▼ No parsing or sequential reading
● Fast vector operations using any number of data columns.
● Support for demand driven, rule based calculations. ▼ Derived columns defined in terms of
operations on other columns, including other derived columns, and made available when referenced.
Training in Analytics
39
Key Features of Unitable (cont)
●Can handle huge real-time data rates by automatically switching to vector mode when behind, and scalar mode when keeping up with individual input events.
●Ability to invoke calculations in scalar or vector mode transparently. ▼ One set of rule definitions can be applied to
an entire data set in batch mode, or to individual rows of real-time events.
Training in Analytics
40
For more information
Open Data Group400 Lathrop AvenueRiver Forest IL 60305
http://code.google.com/p/augustus/