Training in Analytics 2 Website and Community Augustus is an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML). It is written in Python and is freely available. http://augustus.googlecode.com
An introduction to Augustus, an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML). Augustus is able to produce and consume models with 10,000s of segments. Developed by Open Data Group, written in Python, PMML 4.0 compliant and freely available.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Training in Analytics
2
Website and CommunityAugustus is an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML).
It is written in Python and is freely available.
http://augustus.googlecode.com
Training in Analytics
4
Getting Augustus
● Releases can be downloaded from the website under the Download tab.
● Current release are also on the main page's Featured side bar
● Augustus can be directly checked out from source control. We use Subversion.
● Project members can be granted commit access.
Training in Analytics
6
Source
●All of the source files are viewable on line with markup and revision history.
WIKI▼The wiki is intended for people who want to install Augustus for use and possibly develop new features.
FORUM▼The forum is open for any general discussion regarding Augustus.
Training in Analytics
11
Training in Analytics
12
Training in Analytics
13
Using Augustus
● Model Development● Use Cycle● Work Flow
Training in Analytics
14
Development and Use Cycle
The typical model development and use cycle with Augustus is as follows:
1.Identify suitable data with which to construct a new model.
2.Provide a model schema which proscribes the requirements for the model.
3.Run the Augustus producer to obtain a new model.
4.Run the Augustus consumer on new data to effect scoring.
Training in Analytics
15
Development and Use Cycle
2. Model schema1. Data Inputs
Running Augustus
3. Obtain new model with Producer
4. Score with Consumer
Training in Analytics
17
Work Flows
●Augustus is typically used to construct models and score data with models.
● Augustus includes a dedicated application for creating, or producing, predictive models rendered as PMML-compliant files. Scoring is accomplished by consuming PMML-compliant files describing an appropriate model.
●The Producers and Consumers require configuration with XML-formatted files.
●Supplying the schema, configuration and training data to the Producer yields a completely specified model.
●The Consumers provide for some configurability of the output but post-processing can be used to render the output according to the user's needs.
Training in Analytics
20
Post Processing
●Augustus can accommodate a post-processing step. While not necessary, this is often useful to:
▼ Re-normalize the scoring results or perform an additional transformation.
▼ Supplement the results with global meta-data such as timestamps.
▼ Format the results.▼ Select certain interesting values from the
results.▼ Restructure the data for use with other
applications.
Training in Analytics
21
Segments
Segments are covered elsewhere, but Augustus supports segments and this can be described at the Producer level.
● Augustus was originally written to an Open Data draft RFC for segmented models. Augustus 0.3.x conform to the RFC.
● PMML 4 formalized the specification for segments and it deviates somewhat from the RFC. Augustus 0.4.x conforms to this standard.
● Augustus 0.3.x and 0.4.x both support segments, they differ in how they handle them.
Result of Scoring
Training in Analytics
23
Case Study: Auto● Auto is an example distributed with
Augustus, found in the examples directory.▼ It consists of four simple examples of applying
vector channel analysis to a single field of a stream of input records.
▼ The examples use two types of data files. ▼ The data consists of records with three entries: Date, Color, and Automaker.
▼ The Weighted examples have an additional 'weight' column, named Count. The Count field records the number of occurrences of identical tuples in the non-weighted data and collapses them into one record.
Training in Analytics
24
Work Flow Overview
Training in Analytics
25
Auto: Weighted BatchUsing the Baseline for Training:
●Unitable is used to hold the data that is read in.
●It allows us to encapsulate the data is a why which allows us to manipulate it efficiently.
●It can be thought of, in part, as a data structure holding a spread sheet of data with columns, types, etc and the relevant operations which can be performed on the data and the data structure.
Unitable● The Unitable is one of the main components of
the Augustus system. ▼Data read into Augustus is stored in a Unitable. ▼Results in a very fast, efficient object for data shaping, model building, and scoring, both in a batch and real-time context.
● Designed to hold data in a way which allows it to be acted upon by numpy.
▼Takes advantage of new features and improvements which are put into numpy by the scientific Python community.
● Unitable can be used outside of the Augustus scoring flow.
▼Find a standalone example on the wiki
Training in Analytics
38
Key Features of Unitable● File format that matches the native
machine memory storage of the data-allowing for memory-mapped access to the data.▼ No parsing or sequential reading
● Fast vector operations using any number of data columns.
● Support for demand driven, rule based calculations. ▼ Derived columns defined in terms of
operations on other columns, including other derived columns, and made available when referenced.
Training in Analytics
39
Key Features of Unitable (cont)
●Can handle huge real-time data rates by automatically switching to vector mode when behind, and scalar mode when keeping up with individual input events.
●Ability to invoke calculations in scalar or vector mode transparently. ▼ One set of rule definitions can be applied to
an entire data set in batch mode, or to individual rows of real-time events.
Training in Analytics
40
For more information
Open Data Group400 Lathrop AvenueRiver Forest IL 60305