This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
•Definition: This function calculates seasonal factors by using Fourier coefficients. Itcombines sine and cosine waves to help you determine seasonality or othercyclical business factors.
•ParametersInput/ Output Parameter Description
Amplitude Input Field Item Amplitude of sine/cosine.
Length Input Field Item Length (in years) over which the cycle repeats
itself.
Startdate Input Field Item Time in years at which the cycle starts.
Function Input Field Item 0 for a sine wave, and 1 for a cosine wave.
Time Input F ield Item Time periods.
Result Output Field Item Result table that contains the expected result.
Definition: find the most frequent associations in a dataset.
Applications
Clearly - shopping carts and supermarket shoppers
Analysis of any product purchases… not just in shops
Analysis of telecom service purchasesAnalysis of telephone calling patterns
The ‘basket’ can be a household
…
Identification of fraudulent medical insurance claims - consider cases wherecommon rules are broken.
Differential analysis - compare results between different stores, betweencustomers in different demographic groups, between different days of theweek, different seasons of the year, etc.
K-Means algorithm is used for partition a data set into K clusters. It is a verypopular cluster algorithm.
Kohonen Self Organizing Maps are a type of neural network that performclustering. When the network is fully trained, records that are similar shouldappear close together on the output map, while records that are different willappear far apart. This may give you a sense of the appropriate number of clusters.
Cluster Analysis
K Means on the Iris data set Kohonen Self Organizing Map
Definition: A classification is a model to define the relationships betweeninputs and an output. The output, in statistics referred to as the dependentvariable, is a function of one or more inputs, the independent variables. Weuse known inputs and outputs to define a model, and then use the model topredict or ‘score’ unknown values. This is sometimes referred to assupervised learning or directed data mining.
Classification algorithms can be sub-divided into:
Decision Tree algorithms CNR Tree is one of the most well known. CHAID analysis and C 5.0 are also popular.
Regression algorithms Multiple Linear Regression is the most well known
Neural Network algorithms These are defined in terms of their ‘topology’ e.g. MLP, RBF…
Other Support Vector Machines, K Nearest Neighbour…
A set of rules and graphical tree-shaped representation of the relationshipsbetween a dependent variable and a set of independent variables . The treemay be binary or multi-branching, depending upon the algorithm used tosegment the data. Each node represents a test of a decision.
There are many use cases for decision tree analysis
Determining the best targets for a mail shot campaign Churn analysis
Profiling high income earners from census data
Identifying spam
Loan applicant creditworthiness
Classification Analysis – Decision Tree Algorithms
In statistics, regression analysis is a collective name for techniques for themodelling and analysis of numerical data consisting of values of a dependentvariable (also called response or target) and of one or more independent variables(also known as explanatory variables or predictors).
The dependent variable in the regression equation is modelled as a function of theindependent variables, corresponding parameters ("constants"), and an error term.
The error term is treated as a random variable. It represents unexplained variationin the dependent variable.
The parameters are estimated so as to give a "best fit" of the data. Mostcommonly the best fit is evaluated by using the least squares method, but othercriteria can also been used.
Exponential smoothing is a method of forecasting that uses weighted valuesof previous series observations to predict future values. The principle beingthat the older the data points, the less importance they should be given.
Single or Simple Exponential Smoothing – a weighted average of the past
Example: if is 0.1, then the weights are 0.1, 0.09, 0.081, 0.0729… If is 0.5, then the weights are 0.5, 0.25, 0.125, 0.0625… If is 0.9, then the weights are 0.9,0.81, 0.729, 0.6561…
Now the above equation can be shown to be equal to Ft+1 = Xt +(1- ) Ft
So the computation becomes very easy, but we have to start the process with the firstforecast and that is where different starting methods can lead to different fits and forecasts.
Three basic patterns: stationary, trends, seasonality. These equate to single,double and triple exponential smoothing.
Double Exponential Smoothing which applies two smoothing constants, one for the stationaryelement and the other for the trend.
Holt’s Two-Parameter Model St = Xt +(1- ) (St-1+ bt-1)… the stationary element bt =µ (St – St-1) +(1 - µ) bt-1… the trend element Ft+m = St + bt m
Triple Exponential Smoothing – for stationary and trend and seasonality Winters’ Three-Parameter Model St = Xt / It-L +(1- ) (St-1+ bt-1) … the stationary element bt =µ (St – St-1) +(1 - µ) bt-1 … the trend It = Xt/St + (1 – ) It-L … the seasonality
Definition: An outlier is an observation that lies an ‘abnormal’ distance from othervalues in a random sample from a population.
Outliers can occur because of measurement errors and might be removed fromthe data set or corrected.
They can occur naturally and therefore must be treated carefully.
Some statistics / algorithms can be heavily biased by outliers. For example thesimple mean, correlation, linear regression. In contrast the trimmed mean andmedian are not so affected.
Outliers can be detected visually, for example Scatter Plots and Box Plots. -
Outlier Algorithms – Inter Quarti le Range Test (IQR)
Outliers can be detected using various algorithms. The most well known beingthe Inter Quartile Range Test or the Tukey Test, named after it’s author. It’s thecalculation behind the construction of Box Plots.
Given a time series X1 to Xn, calculate the upper and lower quartiles (25th and 75th percentile),denoted as UQ and LQ. Calculate the mid spread as MID =UQ - LQ. An outlier is then defined to beany observation where
Xi < LQ - n * MID or Xi > UQ + n * MID
The value of n is usually set to 1.5, however for large time series, say more than 36 points, it isrecommended to use a value of 2. The concept of very significant and significant outliers could beintroduced by using values of n =3 and n =2 respectively.
The PAL supports: Inter-Quartile Range Test (Tukey’s Test)
Variance Test – this is just the simple identification of values outside x standard deviations from themean
Anomaly Detection – this is conceptually the ‘reverse’ of cluster analysis. We look for values furthestaway from their nearest cluster centre, measure the absolute and percentage distance and rank thelargest ‘outliers’.
<area_name>: 'AFLPAL'. This is used for all PAL functions and cannot be changed by users.
<function_name>: PAL built-in function name.
<signature_tab>: user-defined table variable. The table contains records to describe input table type,parameter table type, and result table type. A typical table variable references a table with the followingdefinition:
PAL: Functions Available for GA use by Customers and
Partners in SP5 – 1-
K Means – A method of cluster analysis whereby the algorithm partitions N observations or records into K clusters in which eachobservation belongs to the cluster with the nearest center.
K Nearest Neighbor - The K-Nearest Neighbor (KNN) algorithm is a method for classifying objects based on the closest K objects
and their average classification / value.
Multiple Linear Regression (MLR) - An approach to modeling the linear relationship between a variable Y, usually referred to asthe dependent variable, and one or more other variables, usually referred to as independent variables, denoted X1, X2, X3...
C4.5 Decision Tree – A classification algorithm, C4.5 builds decision trees from a set of training data, using the concept of information entropy. The training data is a set of already classified samples. At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits it into subsets in one class or the other. Its criterion is the normalized information gain(difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalizedinformation gain is chosen to make the decision. The C4.5 algorithm then proceeds recursively until meeting some stopping criteriasuch as minimum number of cases in a leaf node.
CHAID Analysis - This model is similar to the C4.5 decision tree. CHAID stands forCHi-squared Automatic Interaction Detection,and is a classification method for building decision trees by using chi-square statistics to identify optimal splits. CHAID examinesthe cross tabulations between each of the input fields and the outcome, and tests for significance using a chi-square independence
test. If more than one of these relations is statistically significant, CHAID will select the input field that is the most significant(smallest p value). CHAID can generate non-binary trees
Apr iori & Aprior i L ite - Popular association discovery algorithm commonly associated with market basket analysis. The algorithmlooks for rules to describe frequent product and other items associations. Apriori Lite is a subset of Apriori when only singleantecedent and single subsequent are required and is therefore faster.
ABC Classi ficat ion – A dataset is divided into 3 groups – A,B,C so X% of a variable are in A, Y% in B and 100% - X – Y in C. Itcan be used for analyzing customer behavior and defining market segments.
PAL: Functions Available for GA use by Customers and
Partners in SP5 – 2 –
Weighted Score Tables – Each column / variable in a table is allocated a score, which may vary across its range of values, and then a weight. Each record is scored and the scores are multiplied by the weights and summed. The summedscores can then be ranked to identify the highest.
Exponential Regression - An approach to model the relationship between a variable Y and one or more variablesdenoted X1, X2, X3... In exponential regression, data are modeled using an exponential function and unknown modelparameters are estimated from the data using the criteria of least squares.
Logistic Regression - Predicts the outcome of a categorical variable (a variable that can take on a limited number of categories) based on one or more predictor variables. The probabilities describing the possible outcome are modeled as afunction of the explanatory variables, using a logistic function. It is analogous to linear regression but takes a categorical
target field instead of a numeric one.
Inter-Quartile Range Test - Given a series of numeric data, the Inter-Quartile Range is the difference between 3rd-quartile(Q3) and 1st-quartile(Q1) of that data series. Values which are several multiples of the IQR from the median areidentified as outliers.
Bi-Variate Geometric Regression - An approach to model the relationship between a dependent numeric variable Y andan independent numeric variable X. In geometric regression, data are modeled using a geometric function, and unknownmodel parameters are estimated from the data using least squares regression.
Bi-Variate Natural Logarithmic Regression – An approach to model the relationship between a dependent numeric
variable Y and an independent numeric variable X. In geometric regression, data are modeled using a natural logarithmicfunction, and unknown model parameters are estimated from the data using least squares regression.
Single, Double, Triple Exponential Smoothing - Techniques that can be applied to time series data, either to producesmoothed data for presentation, or to make forecasts. Single smoothing is used when the time series is stationary, doublewhen there is a trend and triple when there is seasonality. Older values in the time series are given less importance with theweights forming an exponential decay.
PAL: Functions Available for GA use by Customers and
Partners in SP5 – 3 –
Polynomial Regression - An approach to model the relationship between a numeric variable Y and a numeric variable X,raised to the power of 2,3,4 etc. denoted X2, X3, X4… In polynomial regression, data are modeled using polynomialfunctions, and unknown model parameters are estimated from the data using the criteria of least squares. .
Variance Test - Given a series of numeric data, the Variance Test simply calculates the variance. Values which are severalmultiples of the variance from the mean are identified as outliers.
Anomaly Detec tion - this is conceptually the ‘reverse’ of cluster analysis. We look for values furthest away from theirnearest cluster centre, measure the absolute and percentage distance and rank the largest ‘anomalies’ or outliers.
Sampling – An aspect of statistics concerned with the selection of an unbiased or random subset of individual observations
within a population of individuals intended to yield some knowledge about the population of concern, especially for thepurposes of making predictions based on statistical inference.
Binning – A common requirement prior to running certain predictive algorithms. It generally reduces the complexity of themodel, for example the model in a decision tree can become very complex if every value of a numeric variable becomes abranch in the tree. Binning methods smooth a sorted data value by consulting its “neighborhood”, that is, the values aroundit. The sorted values are distributed into a number of “buckets” or bins.
Scaling - This function is used where the data is to be scaled to fall within a specified range, such as -1.0 to 1.0, or 0.0 to1.0. You can normalize an attribute by scaling its values to make them fall within a specified range. Normalization isparticularly useful for classification algorithms involving neural networks, or distance measurements such as nearest-
neighbor classification and clustering. This PAL algorithm includes three data normalization methods: min-max, z-score,and decimal scaling.
Kohonen Self Organized Maps - A type of artificial neural network that is trainedusingunsupervised learning to produce alow-dimensional (typically two-dimensional), discretized representationof the input space of the training samples, calleda map. Self-organizing maps are different to other artificial neural networks in the sense that they use a neighborhoodfunction to preserve the topological properties of the input space.
Weitergabe und Vervielfältigung dieser Publikation oder von Teilen daraus sind, zuwelchem Zweck und in welcher Form auch immer, ohne die ausdrückliche schriftlicheGenehmigung durch SAP AG nicht gestattet. In dieser Publikation enthaltene Informationenkönnen ohne vorherige Ankündigung geändert werden.
Die von SAP AG oder deren Vertriebsfirmen angebotenen Softwareprodukte könnenSoftwarekomponenten auch anderer Softwarehersteller enthalten.
Microsoft, Windows, Excel, Outlook, und PowerPoint sind eingetragene Marken derMicrosoft Corporation.
IBM, DB2, DB2 Universal Database, System i, System i5, System p, System p5, System x,System z, System z10, z10, z/VM, z/OS, OS/390, zEnterprise, PowerVM, P ower
Architecture, Power Systems, POWER7, POWER6+, POWER6, POWER, PowerHA,pureScale, PowerPC, BladeCenter, System Storage, Storwize, XIV, GPFS, HACMP,RETAIN, DB2 Connect, RACF, Redbooks, OS/2, AIX, Intelligent Miner, WebSphere, Tivoli,Informix und Smarter Planet sind Marken oder eingetragene Marken der IBM Corporation.
Linux ist eine eingetragene Marke von Linus Torvalds in den USA und anderen Ländern.
Adobe, das Adobe-Logo, Acrobat, PostScript und Reader sind Marken oder eingetrageneMarken von Adobe Systems Incorporated in den USA und/oder anderen Ländern.
Oracle und J ava sind eingetragene Marken von Oracle und/oder ihrer Tochtergesellschaften.
UNIX, X/Open, OSF/1 und Motif sind eingetragene Marken der Open Group.
Citrix, ICA, Program Neighborhood, MetaFrame, WinFrame, VideoFrame und MultiWinsind Marken oder eingetragene Marken von Citrix Systems, Inc.
HTML, XML, XHTML und W3C sind Marken oder eingetragene Marken des W3C®
,World Wide Web Consortium, Massachusetts Institute of Technology.
Apple, App Store, iBooks, iPad, iPhone, iPhoto, iPod, iTunes, Multi-Touch, Objective-C,Retina, Safari, Siri und Xcode sind Marken oder eingetragene Marken der Apple Inc.
IOS ist eine eingetragene Marke von Cisco Systems Inc.
RIM, BlackBerry, BBM, BlackBerry Curve, BlackBerry Bold, BlackBerry Pearl, BlackBerry Torch, BlackBerry Storm, BlackBerry Storm2, BlackBerry PlayBook und BlackBerry AppWorld sind Marken oder eingetragene Marken von Research in Motion Limited.
Google App Engine, Google Apps, Google Checkout, Google Data API, Google Maps,Google Mobile Ads, Google Mobile Updater, Google Mobile, Google Store, Google Sync,Google Updater, Google Voice, Google Mail, Gmail, YouTube, Dalvik und Android sindMarken oder eingetragene Marken von Google Inc.
INTERMEC ist eine eingetragene Marke der Intermec Technologies Corporation.
Wi-Fi ist eine eingetragene Marke der Wi-Fi Alliance.
Bluetooth ist eine eingetragene Marke von Bluetooth SIG Inc.
Motorola ist eine eingetragene Marke von Motorola Trademark Holdings, LLC.
Computop ist eine eingetragene Marke der Computop Wirtschaftsinformatik GmbH.
SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP BusinessObjects Explorer,StreamWork, SAP HANA und weitere im Text erwähnte SAP-Produkte und -Dienstleistungen sowie die entsprechenden Logos sind Marken oder eingetragene Markender SAP AG in Deutschland und anderen Ländern.
Business Objects und das Business-Objects-Logo, BusinessObjects, Crystal Reports,Crystal Decisions, Web Intelligence, Xcelsius und andere im Text erwähnte Business-Objects-Produkte und Dienstleistungen sowie die entsprechenden Logos sind Markenoder eingetragene Marken der Business Objects Software Ltd. Business Objects ist einUnternehmen der SAP AG.
Sybase und Adaptive Server, iAnywhere, Sybase 365, SQL Anywhere und weitere im Texterwähnte Sybase-P rodukte und -Dienstleistungen sowie die entsprechenden Logos sindMarken oder eingetragene Marken der Sybase Inc. Sybase ist ein Unternehmen derSAP AG.
Crossgate, m@gic EDDY, B2B 360°
, B2B 360°
Services sind eingetragene Marken derCrossgate AG in Deutschland und anderen Ländern. Crossgate ist ein Unternehmen derSAP AG.
Alle anderen Namen von Produkten und Dienstleistungen sind Marken der jeweiligenFirmen. Die Angaben im Text sind unverbindlich und dienen lediglich zu Informations-zwecken. Produkte können länderspezifische Unterschiede aufweisen.
Die in dieser Publikation enthaltene Information ist Eigentum der SAP. Weitergabe undVervielfältigung dieser Publikation oder von Teilen daraus sind, zu welchem Zweck undin welcher Form auch immer, nur mit ausdrücklicher schriftlicher Genehmigung durchSAP AG gestattet.