A SAS/IML PROGRAM FOR GENERALISED PROCRUSTES ANALYSIS Pascal SCHLICH MINISTRY DE L"AGRICULTURE ABSTRACT GPA is a fairly known mUltivariate statistical method. GENSTAT software includes a GPA macro, but SAS does not. The aim of this communication is to announce a GPA program written in SAS/IML. GPA is a three-way data analysis : variables have been recorded on the same n sam- ples in k different situations. Each situation defines a configuration of n points in a multidimensional space. Purpose of GPA is to match the k configurations to a common consensus configuration, by translation, scale change and iterative rotation/reflec- tion. When transformed configurations are as close as possible, their mean defines the cqnsensus. Output of the program allows interpre"tation in several ways. Prccrustes statis- tics quantify agreement of each configuration or sample with the consensus. Grar lical representations of the consensus are obtained by principal component analysis. Samples separation on principal plot is explained by correlations between the principal compo- nents and all of the variables. Relations between configurations are studied by prin- cipal co-ordinate analyses. An application of GPA to Free-choice Profiling (FCP) of 6 strawberry jam samples by 15 assessors (configurations) is detailed. FCP is a method of sensory analysis in Which each assessor chooses and scores his own attributes to describe the samples. To take into account these vocabulary differences is precisely the goal of Procrustes rotations. GPA can be applied each time individuals are described by several sets of variables. INTRODUCTION In Greek mythology, Procrustes was a highwayman supposed to have made all hi;;; victims fit his bed, cruelly stretching those who were too small and cutting down to size those who were too tall. HURLEY and CATTELL (1962) gave the code name Procrustes to their program to refer to its ability to fit almost any data to any other "for better or worse". As described by HARMAN (1976), Procrustes is the common term used to refer to any forced transformation in factor analysis. In this paper, Procrustes analysis means the matching of two configurations of n points in a p-dimensional space by translation, scale change and rotation/reflec Generalised Procrustes Analysis (GPA) introduced by KRISTOF and WINGERSKY (1971 popularized by GOWER (1975) allows to match iteratively, in a Procrustes configurations of n points to their mean which defines a consensus configuration. 529
9
Embed
A SAS/IML PROGRAM FOR GENERALISED PROCRUSTES ANALYSIS ... SASIML Program For... · A SAS/IML PROGRAM FOR GENERALISED PROCRUSTES ANALYSIS ... by translation, ... GPA isa fairly known
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A SAS/IML PROGRAM FOR GENERALISED PROCRUSTES ANALYSIS
Pascal SCHLICH MINISTRY DE L"AGRICULTURE
ABSTRACT
GPA is a fairly known mUltivariate statistical method. GENSTAT software includes
a GPA macro, but SAS does not. The aim of this communication is to announce a GPA
program written in SAS/IML.
GPA is a three-way data analysis : variables have been recorded on the same n sam
ples in k different situations. Each situation defines a configuration of n points in
a multidimensional space. Purpose of GPA is to match the k configurations to a common
consensus configuration, by translation, scale change and iterative rotation/reflec
tion. When transformed configurations are as close as possible, their mean defines the
cqnsensus.
Output of the program allows interpre"tation in several ways. Prccrustes statis
tics quantify agreement of each configuration or sample with the consensus. Grar lical
representations of the consensus are obtained by principal component analysis. Samples
separation on principal plot is explained by correlations between the principal compo
nents and all of the variables. Relations between configurations are studied by prin
cipal co-ordinate analyses.
An application of GPA to Free-choice Profiling (FCP) of 6 strawberry jam samples
by 15 assessors (configurations) is detailed. FCP is a method of sensory analysis in
Which each assessor chooses and scores his own attributes to describe the samples. To
take into account these vocabulary differences is precisely the goal of Procrustes
rotations.
GPA can be applied ~enerally each time individuals are described by several sets
of variables.
INTRODUCTION
In Greek mythology, Procrustes was a highwayman supposed to have made all hi;;;
victims fit his bed, cruelly stretching those who were too small and cutting down to
size those who were too tall. HURLEY and CATTELL (1962) gave the code name Procrustes
to their program to refer to its ability to fit almost any data to any other "for
better or worse". As described by HARMAN (1976), Procrustes is the common term used to
refer to any forced transformation in factor analysis.
In this paper, Procrustes analysis means the matching of two configurations of n
points in a p-dimensional space by translation, scale change and rotation/reflec
Generalised Procrustes Analysis (GPA) introduced by KRISTOF and WINGERSKY (1971
popularized by GOWER (1975) allows to match iteratively, in a Procrustes sen~
configurations of n points to their mean which defines a consensus configuration.
529
GPA isa fairly known multivariate statistical method. For instance, the GENSTAT
statistical software includes a GPA macro, but SAS does not. The aim of this communica~
tion is to announce ~ GPA program written in SAS/IML.
Th~ first patt of this communication gives the broad outlines of GPA. The second
one, which is an application of GPA to a sensory analysis, shows and interprets the
computer output obtain~d from these real data.
GPA METHOD
Assume n samples have been described by k sets of Pi variables (i=l, ... ,k).
Each set of variables are arranged into a n*Pi d~tamatrix call~d Xi. The raws of Xi
denote the. samples and the columns denote the variables. The samples are the same in
each Xi' but the variables could be different. The samples can be seen geometrically
as k configurations of n points in a Pi~dimensional space. If p is the maxi(Pi)' P-Pi
zero columns are appended to Xi' allowing to see the k configurations in the same p
dimensional space. Doing this common representation, .each configuration· has its own
axis meanings, except if the var~ables are identical in each Xi.
A usual way to describe one Xi data matrix is to perform a principal component
analysis (PCA) (MORRISSON, 1976) using PRINCOMP or FACTOR SAS/STAT procedures. With
several Xi data matrices, one PCA for each matrix does not allow to easily define a
consensus about sample differences. If and only if variables are identical in each Xi'
it is possible to compute the mean data matrix and then perform a single PCA. But if
large differences are found between means or scales of configurations, the average
configuration could not be a good consensus.
GPA first translates the configurations to a common origin, and applies contrac
tions or dilatations to give a common dispersion to each configuration. These two
first stages are iri the same spirit then centering and autoscaling columns of a data
matrix inPCA, in order to describe correlations rather than covar iances. The third
stage, which is the most characteristic of Procrustes analysis, consists in rotating
configurations to fit a target confi~uration defined as the mean configuration.
Figure 1 illustrates what the GPA transformations do in the case of two configurations.
With more than two configurations, the algorithm (GOWER, 1975) needs to be iterative
as described in figure 2. The sum of squared distances between corresponding samples
in transformed configurations, called the Procrustes statistic. and denoted by s, ** and s in figure 2, is minimized by this algorithm. When iter~tion is complete,
mean of transformed configurations defines the consensus denoted by C in figur~ 2.
* s
the
The sum of squared distances between corresponding sample~ in initial configura-
tions is called the total sum of squares. It .can be seen as the sum of squares for
translation, scaling and rotation/reflection, plus a redisual which is the Procrustes
statistic defined above,LANGRON and COLtINS (1985) derived asymptotic Procr
Simulated example for.k=2 configurations (O.C.6) and { ...... } of n=3' samples symbolised by geometrical shapes in a Pl=P2=2-dimensional space . . .
INITIAL I + configurations
~. . C
~ • '"
:r6 0 0 ..
.. .~
. /'" 0 I 0
ure 1
A GPAAlgorithm (J.C. GOWER. Psychrometrika, 1975)
1. Centre each column of each ~. Scale each Xi by A. =kll"1tr(XiXi')
2. SetC = Xl For i = 2 to k do: rotate Xi to Gand take meanofX1, .... ,Xi as newC Set s = k.(l-tr(CC'». For i = 1 to k do : ri = 1
3. For i = 1 to k do : rotate riKi to C giving X'I = rJ{;Hi Let C· be the mean:()fX\, ... X·k
Set s· = s - k.tr (C'C·' - CC').
4. If scaliRg is not required set s'· = s·, set C·· = C' and go to step 6
5. For i = 1 to k do: r*,jr; = tr (X·jG")/(tr(X·jX·j').tr (C'C"» Set X·oj = (r*/rj)X';. Set r; = r·; Let C,· be the mean ofX·\ •... ,X··k
Set s·· = s - k.tr (C"· C"·' - CC')
6; Ifs - s··> 0.0001 then set s = s'·, set C = CoO. and. go to step 3 else go to next step
7. Iteration is complete. Calculate and print Procrustes analysis of variance .
8. Perform a principal coordinate analysis (PCO) of matrices, considered as ksamples, before each stage ofGPA Dissimilarity matrices for these analyses are given by the covariance matrices of the transformed configurations considered as n.p vectors.
9. Perform a principal component analysis (PCA) of the consensus C. Refer each configuration to these principal axes. Calculate correlations between variables and principal components . Calculate rotation matrices of variables on principal components.
10. Computation is achieve. To return to the original units before the scaling in step 1. the configurations obtained must be divided by ...JA. and the sums of squares by A..
Figure 2
analysis of variance (PANOVA) from this decomposition. PANOVA allows to see which are
the most important transformations.
To detect possible outlier configurations, or group of configurations or configu
rations badly represented by the consensus, a principal co-ordinate analysis (PCO)
(GOW~R, 1966) of configurations is performed. Dissimilarity matrix analyzed by PCOis
the covariance matrix of the k configurations considered as n. p vectors (LANGRON,
1981). A first PCO is computed with the initial configurations, then one PCOis
computed after each stage of GPA. It is convenient to show on a same plot the first
and the last PCO (ARNOLD and WILLIAMS, 1986), in order to appreciate the cohesion bet
ween configurations saved by GPA.
Finally, a PCA of the consensus is performed. In addition to the n consensus
samples, the n.k transformed sam~les are located on the principal axes. Each principal
component is explained using either its correlations with the ~iPi initial variables,
or fhe rotation coefficients of variables on it.
The program presented here is composed of 331 IML instructions, but will be
optimized soon. It has been developed in IML Version 6 for personal computer, and
receritly adapted in IML Version 5 fot mainfr~me.
APPLICATION OF GPA IN SENSORY ANALYSIS
Food s~ientists often use the profile method, ~hich consist in asking to asses
sors to score samples for attributes or descriptive terms. But a few problems appear
when human is taken as a measure instrument. For instance, understanding of attributes
could be different from one to another, or given attributes could not be the most
appropriate
assessors.
to describe the differences between samples detected by some
WILLIAMS and LANGRON (1984) described the Free-choice Profiling (FC9) in order to
avoid these problems. Each assessor is asked to establish his own list of attributes,
which must mean something to him, but not necessary to anyone else.
The foundation of FCP is to assume that assessors feel the same differences
between samples, but describe them with different vocabularies. Thus it is admissible
to translate, dilate and rotate configurations defined by assessors, as these transfor
mations keep the ratio of distances.
The Procrustes rotations establish ~ link between the different vocabularies, and
finally each principal" component of the consensus can be refered" to each vocabulary
through correlations.
Real data submi ted to the program consist in the evaluation of 6 strawberry jams
(n=6 sampies) by 15 asseSSors (k=15configurations) for their self-chosen attributes
(Pivariables, 1 < Pi < 8). The jams were made with different sugars, but it is beyond
the scope .of this paper to detail the material and the results 6f this sensory study.
The data are only used here as an example to show imput and output of the GPA program
GENERALISED PROCRUSTES ANALYSIS - Free choice profiling Input data set observations must be the samples Variables must be sensory attributes arranged assessor by assessor A character variable must give sample identifiers
DATA - strawber Input data set
SAMP jam Name of sample identifier
m = 15 Number of assessors
JUDG = A B CD E F G H t J K L M. N 0 L1st of assessor identifiers
ATTR 8 2 2 3 2 1 7 5 1 3 8 3 4 7 1
OUTl OUT2 OUT3 OUT4
List of numbers o~ attributes for every assessor
OPTIONAL jam_gpc ass peo att cor
OUTPUT DATA SETS Sample coordinates (GPC) Assessors coordinates after each stage (PCO) Correlation between original attributes and GPC Rotation of original attributes on GPC
!l{ptation is not very important compare! to transfation. See S.P Langron antE JiI..j.CO{["UIS, J.~tatist Soc.$, 1985 fot tfetaifs on PM{Oo/JiI. tFieory. Ji/n.yway, totaC.
SS is composetf of 65% from transfation, 18% from rotation, 2% from scaCing antE 15% from resitfuaC.
'Ta6Ces are cuttea to sfiow onfy resu{ts of assessors fIL, $, 9{. antE O. :first fetter of assessor itEentifiers aenotes tfie stage of tfie anafysis, Joffowing fetters aenote tne assessors. 'I1iis ta6fe is store! in 'ass_pco' tfata set.
peA of consensus configuration:
Variation explained by dimensions of individual and consensus PCA (\): GPCl GPC2 GPC3 GPC4 GPC5
A 66.679404· 29.90766 2.7336591 0.5272911 0.1519861 B 77 .864503 22.135497· 9.579E-15 9.4l3E-16 -1.73E-17 C 99.263743 0.7362S66 6.588E-15 4.794E-15 9.031E-16. D 75.337828 23.370458 1.2917138 8.931E-16 -6.41E-16 E 67.842288 32.157712 3.658E-15 2.434E-15 4.612E-16 F 100 1.171E-15 4.608E-lS 3.831E-15 -9.65E-16 G 77.81177414.982047 5.4996584 1.5481041.0.1584165 H 77.915897 18.489711 3.4279184 0.1612059 0.0052674 I 100 1.482E-14 1.485E-15 6.471E-16 -1.26E-15 J 96.720179'2.86241070.4174106 3.602E-15 1.779E-15 K 49.532966 18.083898 16.89005 11.297887 4.1951995 L 83.914547 9.8759313 6.209522 1.085E~14 1.171E-15 M 77 .842552 21.222954 0.5296829 0.404.8104 1.803E-15 N 65.587271 21.21533 12.550197 0.5110385 0.1361642 a 100 4.214E-15 9.451E-16 1.159E-17 -2.54E-11
76.609031 11.242371 7.5959023 2.9919233 1.5607728
• tfenotes tfie consensus. 'I1ie first twoljPC are sufficient to aescri6e tfie consensus