Data Transformations - umass.edu · Data Transformations Most data sets benefit by one or more data transformations. The reasons for transforming data can be grouped into statistical

CHAPTER 9

Data Transformations

Most data sets benefit by one or more data transformations. The reasons for transforming data can be grouped into statistical and ecological reasons:

Statistical

• improve assumptions of normality, linearity, homogeneity of variance, etc.

• make units of attributes comparable when measured on different scales (for example, if you have elevation ranging from 100 to 2000 meters and slope from 0 to 30 degrees)

Ecological

• make distance measures work better

• reduce the effect of total quantity (sample unit totals) to put the focus on relative quantities

• equalize (or otherwise alter) the relative importance of common and rare species

• emphasize informative species at the expense of uninformative species.

Monotonie transformations are applied to each element of the data matrix, independent of the other elements. They are "monotonie" because they change the values of the data points without changing their rank. Relativizations adjust matrix elements by a row or column standard (e.g.. maximum, sum, mean, etc.). One transformation described below, Beals smoothing, is unique in being a probabilistic transformation based on both row and column relationships. In this chapter, we also describe other adjustments to the data matrix, including deleting rare species, combining entities, and calculating first differences for time series data.

It is difficult to overemphasize the potential importance of transformations. They can make the difference between illusion and insight, fog and clarity . To use transformations effectively requires a good understanding of their effects, and a clear vision of your goals.

Notation.— In all of the transformations described below,

x,j = the original value in row / and column j of the data matrix

b,} = the adjusted value that replaces x„.

Domains and rangesBear in mind that some transformations are

unreasonable or even impossible for certain types of data. Table 9.1 lists the kinds of data that are potentially usable for each transformation.

Monotonie transformations

Power transformation

Different parameters (exponents) for the transformation change the effect of the transformation; p = 0 gives presence/absence, p - 0.5 gives square root, etc. The smaller the parameter, the more compression applied to high values (Fig. 9.1 ).

The square root transformation is similar in effect to, but less drastic than, the log transform. Unlike the log transform, special treatment of zeros is not needed. The square root transformation is commonly used. Less frequent is a higher root, such as a cube root or fourth root (Fig. 9.1). For example. Smith et al. (2001)

109876543

2

10

power 1/2

power 1 3

power = 14

power " 1 10

0 25 50 75 100

x

Figure 9.1. Effect of square root and higher root transformations, b = f(x). Note that roots higher than three are essentially presence-absence transformations, yielding values close to 1 for all nonzero values.

67

( 'hap ter 9

Table 9 1 Domain of input and range of output from transformations

Reasonable and acceptable domain of x

Range of fix /

MONOTONIC TRANSFORMATIONS

x" (power) all 0 or 1 only

x (power) nonnegative nonnegative

log(x) positive all

(2/7t)-arcsin(x) 0 < x < 1 0 to 1 inclusive

(2/7T)-arcsin (x ") 0 < x < 1 0 to 1 inclusive

SMOOTHING

Beals smoothing 0 or 1 only 0 to 1 inclusive

ROW/COLUMN RELATIVIZATIONS

general nonnegative 0 to 1 inclusive

by maximum nonnegative 0 to 1 inclusive

by mean all all

by standard deviates all generally between -10 and 10

binary by mean all 0 or 1 only

rank all positive integers

binary by median all 0 or 1 only

ubiquity nonnegative nonnegative

information function of ubiquity nonnegative nonnegative

applied a cube root to count data, a choice supported bv an optimization procedure. Roots at a higher power than three nearly transform to presence-absence: nonzero values become close to one. while zeros remain at zero.

Logarithmic transformation

K = lo g ( * „ )

Log transformation compresses high values and spreads low values by expressing the values as orders of magnitude. Log transformation is often useful when there is a high degree of variation within variables or when there is a high degree of variation among attributes within a sample. These are commonly true with count data and biomass data.

Log transformations are extremely useful for many kinds of environmental and habitat variables, the lognormal distribution being one of the most common in nature See Limpert et al. (2001) for a general introduction to lognormal distributions and applications in

various sciences. They claim that the abundance of species follows a truncated lognormal distribution, citing Sugihara (1980) and Magurran (1988) While the nonzero values of community data sets often resemble a lognormal distribution, excluding zeros often amounts to ignoring half of a data set. The lognormal distribution is fundamentally flawed when applied to community data because a zero value is. more often than not. the most frequent abundance value for a species. Nevertheless, the log transformation is extremely useful in community analysis, providing that one carefully handles the problem of log(O) being undefined.

To log-transform data containing zeros, a small number must be added to all data points. If the lowest nonzero value in the data is one (as in count data), then it is best to add one before applying the transformations:

hi} = lo g (x ,; + 1 )

Data Transformation

If. however, the lowest nonzero value of x differs from one by more than an order of magnitude, then adding one will distort the relationship between zeros and other values in the data set For example, biomass data often contain many small decimal fractions (values such as 0.00345 and 0.00332) ranging up to fairly large values (in the hundreds). Adding a one to the whole data set will tend to compress the resulting distribution at the low end of the scale. The order-of- magnitude difference between 0.003 and 0.03 is lost if you add a one to both values before log transformation: log( 1.003) is about the same as log( 1.03).

The following transformation is a generalized procedure that (a) tends to preserve the original order of magnitudes in the data and (b) results in values of zero when the initial value was zero. Given:

Min(.v) is the smallest nonzero value in the data

lnt(x) is a function that truncates x to an integer by dropping digits after the decimal point

c = order of magnitude constant = Int(log(Min(.v))

J — decimal constant = log'1 (c)

then the transformation is

b,j = log(x,; + J ) - c

Subtracting the constant c from each element of the data set after the log transformation shifts the values such that the lowest value in the data set will be a zero

For example, if the smallest nonzero value in the data set is0.00345. then 0.8 -

log(min(x)) = -2.46

c = mt(log(min(x))) = -2

log'1 (c) = 0.01.

Applying the transformation to some example values:

If x = 0. then/) = log(0+0.01)-(-2). therefore b = 0.

If x = 0.00345. then b = log(0.00345+0.01 )-(-2), therefore b = 0.128.

A resine transformationh,, = 2 /π * arcsin(x,,)

The constant 2/π scales the result of arcsin(x) [in radians] to range from 0 to 1. assuming that 0

< x < 1 The function arcsin is the same as sin 1 or inverse sine Data must range between zero and one. inclusive If they do not. you should rclativize before selecting this transformation.

Unlike the arcsine-squareroot transformation, an arcsine transformation is usually counterproductive in community ecology, because it tends to spread the high values and compress the low values (Fig 9.2). This might be useful for distributions with negative skew, but community data almost alway s have positiv e skew

Arcsine sqnareroot transformation

bj = 2 /π * a rcsin (д /х ^ )

The arcsine-squareroot transformation spreads (he ends of the scale for proportion data, while compressing the middle (Fig. 9.2). This transformation is recommended by many statisticians for proportion data, often improving normality (Sokal and Rohlf 1995). The data must range between zero and one. inclusive. The arcsine-squareroot is multiplied by 2/π to rescale the result so that it ranges from 0 to 1

The logit transformation, b = ln(x/(l-x)). is also sometimes used for proportion data (Sokal and Rohlf 1995). However, if x = 0 or x = 1. then the logit is undefined Often a small constant is added to prev ent

arcsin(sqrt(x)

arcsin(x

Figure 9.2. Effect of several transformations on proportion data

ln(0) and division by /ero Alternatively, empirical logits may be used (see Sokal and Rohlf 1995.762). Because zeros are so common in community data, it seems reasonable to use the arcsine squareroot or squareroot transformations to avoid this problem.

Beals smoothingBeals smoothing replaces each cell in the commu

nity matrix with a probability of the target species occurring in that particular sample unit, based on the loint occurrences of the target species with the species that are actually in the sample unit. The purpose of this transformation (also known as the sociological favorability index, Beals 1984) is to relieve the "zero- truncation problem" (Beals 1984). This problem is nearly universal in community data sets and most severe in heterogeneous community data sets that contain a large number of zeros (i.e., most samples contain a fairly small proportion of the species). Beals smoothing replaces presence/absence or other binary data with quantitative values that represent the "favorability" of each sample for each species, regardless of whether the species was present in the sample. The index evaluates the favorability of a given sample for species based on the whole data set, using the proportions of joint occurrences between the species that do occur in the sample and species i.

where S, is the number of species in sample unit i, A/,* is the number of sample units with both species j and k. and Λ), is the number of sample units with species k. This transformation is illustrated in Box 9.1

This transformation is essentially a smoothing operation designed for community data (McCune 1994). As with any numerical smoothing, it tends to reduce the noise in the data by enhancing the strongest patterns. In this case the signal that is smoothed is the pattern of joint occurrences in the data. This is an extremely powerful transformation that is particularly effective on heterogeneous or noisy data. Caution is warranted, however, because, as for any smoothing function, this transformation can produce the appearance of reliable, consistent trends even from a series of random numbers.

This transformation should not be used on data sets with few zeros. It also should not be used if the

data are quantitative and you do not want to lose this information

Beals smoothing can be slow' to compute If you have a large data set and a slow computer, be sure to allocate plenty of time This transformation is available in PC-ORD but apparently not in other packages for statistical analysis.

Relativizations

"To relativize or not to relativize. that focuses the question. " (Shakespeare. ????)

Relativizations rescale individual rows (or columns) in relationship to some criterion based on the other rows (or columns) Any relativization can be applied to either rows or columns.

Relativization is an extremely important tool that all users of multivariate statistics in community ecology MUST understand There is no right or wrong answer to the question of whether or not to relativize UNTIL one specifies the question and examines the properties of the data.

If the row totals are approximately equal, then relativization by rows will have little effect. Consistency of row totals can be evaluated by the coefficient of variation (CV) of the row totals (Table 9.2). The CV% is calculated as l(K)*(standard deviation / mean). In this case, it is the standard deviation of the row totals divided by the mean of the row totals.

Table 9.2. Ev aluation of degree of v ariability in row or column totals as measured with the coefficient of variation of row or column totals

CV, % Variability among rows (or columns)

< 50 Small. Relativization usually has smalleffect on qualitative outcome of theanalysis

50-100 Moderate (with a correspondinglymoderate effect on the outcome offurther analy sis)

100-300 Large. Large effect on results.

> 300 Very large.

Data Transformation

Box 9.1. Example of Beals smoothing

Data matrix X before transformation (3 sample units χ 5 species):

spi sp2 sp3 sp4 sp5 S,

SUI 1 1 1 1SU2 0 0 1 0SU3 1 0 0 0

•V, 2 1 2 1

S, = number of species in sample unit /.N, = number of sample units with species j.

Construct matrix M. where M,k = number of sample units with both species j and k.

(Note that where j = k. then Mjk = Nį).

Species k

1 2 3 4 5

Species j1 22 1 13 1 0 14 1 0 1 25 1 0 1 1 1

Construct new matrix B containing values transformed with Beals smoothing function:

h,, - I Σ I fo r all k w ith x lk τ 0S, k \ N k J

Data after transformation (B):spl sp2 sp3 sp4 sp5

SUI 0.88 0 13 0.75 0.88 0.75

SU2 0.50 0.00 0.50 1.00 0 50

SU3 1.00 0.75 0.25 0.25 0.25

Example for sample unit 1 and species 2:

bu = 1/4 (1/2 + 0/1 +0/2 + 0/1)

bl2 = 0.25 (0.5)

b¡ : = 0.125 (rounded to 0.13 in matrix above)

Example for sample unit 3 and species 2:

b3,2 = 1/2 (1 /2+ 1/1)

¿-3.’ = 0 5 (15)

bl2 = 0.75

Sum

p(oc

cur)

Fr

eque

ncy

Sum

of C

over

( ’h a p ter 9

120

100

80

60

40

20

0

504540

5530

2520

15105

0

504540353025201510

50

raw data

♦*

10 20 30

presence-absence

10

%

♦♦

20 30

Bealssmoothing

♦♦

4

10 20

Rank

30

120

100

\ 80 o CJ «> 60

♦♦

log (x)

♦♦

c3СЛ 40

20

0

10.90.8

k. 0.7<o¿ o · 6'S 0-55 0.4

0.3

o>oUk-oe3CO

0.20.1

0

16

14

12

10

8

6

4

2

0

0 10 20 30

h\~ rei. by species

l·- totals

l·rl·L

0 10 20 30

♦

rei. by SU1 ♦ totalsh

\1 %r ♦[ *

) 10 20 30

Rank

Ł)S

Cj

c3CO

80

70

60

50

40

30

20

10

0

30

25

Ö 20 > o U'S 15

¿o 10

sqrt (x)

N .

10 20 30

rei. by species maximum

♦

♦

10 20 30

45

40

35

5 30 oO 25c-% 20

co 15

10

5

0

♦♦♦

log (x) then rei. by

species max.

♦«v

10 20

Rank

30

Figure 9.3. Effect of various transformations on relative weighting of species. Species abundance was measured on a continuous, quantitative scale. 'Rank" is the order of species ranked by their abundance.

Data Transformation

[f the row or column totals are unequal, one must decide whether to retain this information as part of the analysis or whether to remove it by relativizing. One must justify this decision on biological grounds, not on its effect on the CV of row or column totals. For example, consider two quadrats with identical proportions of three species, but one quadrat has a total cover of 1% and the other has a total cover of 95%. If the data are relativized, then the quadrats appear similar or identical. If they are not relativized, then distance measures will consider them to be very different. Which choice is correct? The answer depends on the question. Does the question refer to proportions of different species or is the total amount also important? If the latter is true, the data should not be relativized.

An example demonstrates how relativization can change the focus of the analysis. Menges et al. (1993) reported rates of vegetation change based on both relativized and nonrelativized tree species data, beginning with a matrix of basal area of each species in remeasured permanent plots. They used absolute rates to emphasize structural changes (e.g.. increase in basal area of existing species) and relative rates to emphasize shifts in species composition (changes in the relative proportions of species).

Relativization is often used to put variables that were measured in different units on an equal footing. For example, a data set may contain counts for some species and cover for other species. In forest ecology , one may wish to combine basal area data for trees with cover data for herbs. If the species measured in different units are to be analyzed together, then one must relativize the data such that the quantify for each species is expressed as a proportion of some total or maximum abundance

Relativizations can have a huge effect on the relative weighting of rare and abundant species. Raw quantitative data on a continuous scale tends to have a lew abundant species and many rare species (Fig. 9,3)A multivariate analysis of these raw data might emphasize only a few species, ignoring most of the species. Log or square-root transformation of the data usually moderates the imbalance, while relativization by species totals can eliminate it completely (Fig. 9.3). fins is. however, a drastic transformation. Rare species often occur haphazardly, so that giving them a lot of weight greatly increases the noise in the analysis

General relativizationBy rows: By columns:

for a matrix of n rows and q columns.

The parameter, p. can be set to achieve different objectives. If p = 1, relativization is by row or column totals. This is appropriate when using analytical tools based on city-block distance measures, such as Bray- Curtis or Sorensen distance If p = 2. you are "standardizing by the norm" (Greig-Snuth 1983. p. 248) Using p = 2 is the Euclidean equivalent of relativization by row or column totals. It is appropriate when the analysis is based on a Euclidean distance measure. The same effect can be achieved by using "relative Euclidean distance" (see Chapter 6).

Relativization by maximumb,j = Xy/xmax,

where rows (i) are samples and columns Ų) are species, xmax, is the largest value in the matrix for species j. As for relativization by species totals, this adjustment tends to equalize common and uncommon species Relativization by species maxima equalizes the heights of peaks along environmental gradients, while relativ i- zation by species totals equalizes the areas under the curves of species responses.

Many people have found this to be an effective transformation for community data. A couple of cautions should be heeded, however. ( 1 ) very rare species can cause considerable noise in subsequent analyses if not omitted; (2) this and any other statistic based on extreme values can accentuate sampling error

Adjustment to mean

The row or column mean is subtracted from each value, producing positive and negative numbers If relativized by rows, the means are row means: if by columns, the means are column means. The negative numbers obviate proportion-based distance measures, such as Sorensen and Jaccard. This unstandardized

73

( 'h apte r 9

centering procedure can have detrimental effects on analysis of community data, it lends to emphasize to values of /.ero more than does the raw data Also, more variable species are reduced in importance relative to more constant species

Adjustment to standard deviate

K = ( x v - X j ) i Sj

where .v, is the standard deviation within column j.Each transformed value represents the number of

standard deviations that it differs from the mean, often known as " 2 scores." As for all of the relativizations. tins transformation can be applied to either rows or columns It is. however, usually applied to variables (columns). This transformation results in all variables having mean = 0 and variance = 1.

Because this transformation produces both positive and negative numbers, it is NOT compatible with proportion-based distance measures, such as Sorensen's While this transformation is of limited utility for species data, it can be a verv useful relativization for environmental variables, placing them on equal footing for a variety of purposes.

Binary with respect to meanh,j = 1 if x,, > x , h,j = 0 if Xy < xAn element is assigned a zero if its value is less

than or equal to the row or column mean, x. The element is assigned a one if its value is above the mean Applied to species (columns), this transformation can be used to contrast above-average conditions with below-average conditions The transformation therefore emphasizes the optimal parts of a species distribution. It also tends to equalize the influence of common and rare species Applied to sample units, it emphasizes dominant species and is likely to eliminate many species, particularly those that rarely, if ever, occur in high abundances.

Rank adjustmentMatrix elements are assigned ranks within rows or

columns such that the row or column totals are constant. Ties are assigned the average rank of the tied elements For example, the values 1, 3, 3. 9. 10 would receive ranks 1. 2.5. 2.5. 4, 5.

This transformation should be applied with caution For example, most community data have many zeros These zeros are counted as ties. Because the number of zeros in each row or column will vary, zeros will be transformed to different values, depend

ing on the number of zeros in each row or column For example, the values 0, 0, 0. 0. 6, 9 would receive the ranks 2.5, 2.5. 2.5, 2.5. 5, 6. while the values 0. 0. 6. 9 would receive the ranks 1.5, 1.5, 3. 4

Binary with respect to median:bj = 1 i f x,, > m ed ian , b„ = 0 i f x,, < m edian

The transformed values are zeros or ones An element is assigned a zero if its value is less than or equal to the row or column median The element is assigned a one if its value is greater than the row or column median. This transformation can be used to emphasize the optimal pans of a species range, at the same time equalizing to some extent the weight given to rare and dominant species. The Rank adjustment caution also applies to this relativization because it too is based on ranks

Weighting by ubiquitybi} = U j Xy where U J = N j / N

If rows are samples, columns are species, and relativization is by columns, more ubiquitous species are given more weight. Under these conditions, .V, is the number of samples in which species j occurs and Y is the total number of samples

Information function o f ubiquity

\ = 11 X:where

Ij = ~ P j ^ g ( p J) - ( i - p J) \ o g ( ] - p J)

and pj = ,V,/.V with V, and V as defined above.

To illustrate the effect of this relativization, assume that rows are samples, columns are species, and relativization is by columns. Maximum weight is applied to species occurring in half of the samples because those species have the maximum information content, according to information theory. Very common and rare species receive little weight. Note that if there are empty columns, the transformation will fail because the log of zero is undefined

Double relativizationsThe relativizations described above can be applied

in various combinations to rows then columns or vice- versa. When applied in series, the last relativization necessarily mutes the effect of the preceding relativization.

Data Transformation

The most common double relativization was first used by Bray and Curtis (1957). They first relativized by species maximum, equalizing the rare and abundant species, then they relativized by SU total. This and other double relativizations tend to equalize emphasis among SUs and among species. This conies at a cost of diminishing the intuitive meaning for individual data values.

Austin and Greig-Sntith (1968) proposed a "contingency deviate" relativization. This measures the deviation from an expected abundance. The expected abundance is based on the assumption of independence of the species and the samples. Expected abundance is calculated from the marginal totals of the n x p data set. just as if it were a large contingency table:

P Ί

Σ χ Ί Σ x 'jL 7 ' ! ' = 1

K * * i - ~ r ~ ,—

ΣΣ*.J - - 1 i - l

The resulting values include both negative and positiv e v alues and are centered on zero. The row and column totals become zero Because this transformation produces negative numbers, it is incompatible w'ith proportion-based distance measures.

One curious feature of this transformation is that zeros take on various values, depending on the marginal totals. The meaning of a zero is taken differently depending on whether the other elements of that row and column create large or small marginal totals. With sample unit x species data, a zero for an otherwise common species will be given more weight (i.e.. a more negative value). This may be ecologically meaningful, but applied to rows the logic seems counter-intuitive: a species that is absent from an otherwise densely packed sample unit will also be given high weight.

Deleting rare speciesDeleting rare species is a useful way of reducing

the bulk and noise in your data set without losing much information. In fact, it often enhances the detection of relationships between community composition and environmental factors. In PC-ORD, you select deletion ot columns "with fewer than N nonzero numbers." For example, if ,V = 3. then all species with less than 3 occurrences are deleted. If .V = 1. all empty species (columns) are deleted.

Deleting rare species is clearly inappropriate if you wish to examine patterns in species diversity. Cao et al. (1999) correctly pointed this out but confused the issue by citing proponents of deletion of rare species who were concerned with extracting patterns with multivariate analysis, not with comparison of species diversity. None of the authors they criticized suggested deleting rare species prior to analysis of species richness.

For multivariate analysis of correlation structure (in the broad sense), it is often helpful to delete rare species. As an approximate rule of thumb, consider deleting species that occur in fewer than 5% of the sample units. Depending on vour purpose, however, vou may wish to retain all species or eliminate an even higher percentage

Some analysts object to removal of rare species on the grounds that we are discarding good information. Empirically this can be shown true or false by using an external criterion of what is "good' information. You can try this yourself. Use a familiar data set that has at least a moderately strong relationship between communities and a measured env ironmental factor Ordinate (Part 4) the full data set. rotate the solution to align it with that environmental variable (Ch 15). and record the correlation coefficient between the environmental variable and the axis scores. Now delete all species occurring in just one sample unit. Repeat the ordination-^rotation->correlation procedure Progressively delete more species (those only in two sample units, etc.). until only the few most common species remain. Now plot the correlation coefficients against the number of species retained (Fig 9 4).

In our experience the correlation coefficient usually peaks at some intermediate level of retention of species (Fig. 9.4) When including all species, the noise from the rare ones weakens the structure slightly On the other hand, when including only a few dominant species, too little redundancy remains in the data for the environmental gradient to be clearly expressed.

A second example compared the effect of stand structures on small mammals using a blocked design (D. Waldien 2002, unpublished) Fourteen species were enumerated in 24 stands, based on trapping data, then relativized bv species maxima. The treatment effect size was measured with blocked MRPP (Ch 24). using the A statistic (chance-corrected within-group agreement). Rare species were successively deleted, beginning with the rarest one. until only half of the species remained. In this case, removal of the four rarest species increased slightly the apparent effect size

Chapter У

0.8Όg ° '6

ψ 0.4U,

0.2

0.0

0 5 10 15 20 25 30 35 40 45

Criterion for Species Removal (occurrence in % o f SUs)

“· — Depth to waier table

■ Distance from stream

Elevation above stream

Figure 9.4. Correlation between ordination axis scores and environmental variables can often be improved by removal of rare species. In this case, the strength of relationship between hydrologie variables and vegetation, as measured by r \ is maximized with removal of species occurring in fewer than 5-15% of the sample units, depending on the hydrologie variable. The original data set contained 88 species. 59. 35. 16. and 9 species remained after removal of species occurring in fewer than 5, 15, 40, and 45% of the sample units, respectively. Data are courtesy of Nick Otting (1996, unpublished).

(Fig. 9.5). The fifth and sixth rarest species, however, were distinctly patterned with respect to the treatment, so their removal sharply diminished the apparent effectsi/e.

Another objection to removal of rare species is that you cannot test hypotheses about whole-community structure, if you exclude rare species. Certainly this is true for hypotheses about diversity. But it also applies to other measures of community structure. Statistical hypothesis tests are always, in some form or another, based on evaluating the relative strength of signal and noise. Because removal of rare species tends to reduce noise, the signal is more likely to be detected. This can be taken as an argument against removal of rare species because it introduces a bias toward rejecting a null hypothesis. Alternatively, one can define beforehand the community of interest as excluding the rare species and proceed without bias

Mark Fulton (1998, unpublished) summarized the noise vs. signal problem well:

Noise and information can only be defined in the context o f a question o f interest. An analogy: we are sitting in a noisy restaurant trvmg to have a conversation. From the point o f view o f our attem pting to com m unicate, the am bient sound around us is no ise .' Yet that noise carries all k inds o f inform ation — that clatter over to the left is the bus person clearing dishes at the next table: the laughter across the room is in response to the

0.2«

A

0.05

0.00

1 2 3 40 6 73

Number o f species removed

Figure 9.5. Response of A statistic (blocked MRPP) to removal of rare species from small mammal trapping data. A measures the effect size of the treatments, in this case different stand structures.

punchline o f a fairly good joke, with a little a ttention you can hear what the tw o m en in business suits two tables over are arguing about: and that rum ble you ju s t heard is a truck full o f furniture turning the com er outside the restaurant. But none o f th is inform ation is relevant to the conversation, and so we filter it out w ithout thinking about the process much.

V egetation analysis is a process o f noise filtering right trom the very start Data collection

Data Transformation

itse lf is a trem endous filtering process. We decide what NOT to m easure Any transform ations we do on the data — w hether weighting, rescaling, or deletion o f rare species — is also a filtering process. O rdination itse lf is a further filter. The patterns in the whole н-dim ensional m ess are o f less interest than a carefully selected reduction o f those patterns. The point is. as scientists, we need to do this process o f inform ation selection and noise reduction carefully and w ith full knowledge o f w hat we are doing. There is no single procedure which w ill alw ays bnn g out the inform ation o f interest. D ata selection, transform ation, and analysis can only be judged on how well they work in relation to the questions at hand.

Combining entitiesAggregate sample units (SUs) can be created by

averaging existing SUs. Each new entity is the "centroid" of the entities that you average. See Greig- Smith (1983. p. 286) for comments on ordinating groups of SUs In general, community SUs should not be averaged unless they are very similar If SUs are heterogeneous, then the average species composition tends to fall outside the variation of the SUs. the averages being unnaturally species-rich.

If your rows are SUs. and vou also have an environmental matrix, you should also calculate centroids for environmental data. Be careful if you have categorical environmental variables. Depending on how the categories are structured, averaging the categories can be meaningless.

Difference between two datesBefore-and-after data on species abundance

obtained by revisiting the same SUs can be analyzed as differences, rather than the original quantities. If a„, and a , ,2 are the abundances of species j in sample unit i at times 1 and 2. then the difference between dates is

b„ ~ a,jz * a,ßThe transformed data represents changes through

time Even with species abundance data, this transformation yields variables that are more or less normally distributed with means near zero and with both positive and negative numbers. After this transformation, be sure not to use methods that demand nonnegative numbers: proportion coefficients (such as Sorensen) as distance measures and techniques based on Correspondence Analysis (CA. RA. CCA, DCA. Twinspan). On the other hand. PCA and other techniques calling for multivariate normal data and linear relationships among variables will work far better on such a matrix than they would with either matrix alone.

First difference of time seriesIf your data form a time series (sample units are

repeatedly evaluated at fixed locations), you may want to ordinate the differences in abundance between successive dates rather than the raw abundances:

b n a u j . į - a ¡i.'for a community sampled at times t and /+1 This is simply the extension through time of the idea described in the preceding section. This transformation can be called a "first difference " (Allen et al. 1977) because it is analogous to the first derivative of a time series curve. With community data, a matrix of first differences represents changes in species composition. If we visualize changes in species composition as vectors in species space, the matrix of differences represents the lengths and directions of those vectors A matrix of "second" differences would represent the rates of acceleration (or deceleration) of sample units moving through species space.

The matrix of first differences takes into account the direction of compositional change. For example, assume that the plankton in a lake go through two particular compositional states in the fall, then go through the same compositional states in the spring, but in the opposite direction. The difference between the two fall samples is not. therefore, the same as the difference between the spring samples, even though the absolute values of the differences are equal. Analyzing the signed difference is logical, but other possibilities exist. Allen et al. (1977) analyzed the absolute differences, creating a matrix of species' contributions to community change, without regard to the direction of the change:

b„ = ! η , , . , . ι - a , , ! , I

If environmental variables are recorded at each date, you might analyze species change from time t to ir 1 in relationship to the state of the environment at time t. Alternatively, you could apply the firstdifference transformation to the environmental variables as well, to analyze the question of how community change is related to environmental change. On the other hand, variables that are constant through time for a given sample unit through time (e.g., location or treatment variables) could be retained without transformation.

Note that the statistical properties of these differences are radically different from the original data. For more information, see the preceding section on differences between two dates.

C 'hapter 9

A general procedure for data adjustments

Species dataWhile one can easily grasp the logic of a particular

data adjustment, the number of combinations and sequences can be bewildering. Although it is impossible to write a step-by-step cookbook that covers all possible data sets and goals, we suggest a general

procedure for data adjustments that will be applicable to many community data sets (Table 9.3). For more details on steps 2, 3, and 4. consult the preceding pages. For more detail on step 5, consult the section on outliers in Chapter 7

The sequence of actions is important For example, we check for outliers last, because many apparent outliers will disappear, depending on the monotonic transformations or relativizations that are used.

Table 9.3. Suggested procedure for data adjustments of species data matrices.

Action to be considered Criteria

1. Calculate descriptive statistics. Repeat this after Alwayseach step below. (In PC-ORD run Row & columnsummary)

Beta diversity (community data sets)Average skewness of columns Coefficient of variation (CV, %)

CV of row totals

CV of column totals

2. Delete rare species (< 5% of sample units) Usually applied to community' data sets, unless contrary to study goals

3 Monotonie transformation (if applied to species, then usually applied uniformly to all of them, so that all are scaled the same)

A. Average skew ness of columns (species)

B. Data range over how many orders of magnitude .’ (Count and biomass data often are extreme.)

C Beta diversity. (Consider presence/absence transformation for community data when ß is high.)

4 Row or column relativizations What is the question .’

Are units for all variables the same.’

Is relativization built into the subsequent analysis?

CV of row totals

CV of column totals

What distance measure do you intend to use?

Note: regardless of your decision to relativize or not. you should state your decision and justify it briefly on biological grounds

5. Check for outliers based on the average distance of each point from all other points. Calculate standard deviation of these average distances. Describe outliers and take steps to reduce influence, if necessary

standard degree of deviation problem

< 2 no problem2 - 2 . 3 weak outlier 2 . 3 - 3 moderate outlier >3 strong outlier

78

ƒlala Transformation

Environmental dataAdjustments of environmental data depend greatly on their intended use. as indicated in Table 9,4.

Categorical and binary variables in general need 110 adjustment, but one should always examine quantitative environmental variables.

Table 9.4. Suggested procedure for data adjustments of quantitative variables in environmental data matrices.

Action to be considered Criteria

1. Calculate descriptive statistics for quantitative variables. Repeat this after each step below. (In PC-ORDrun Row & column summary)

Skewness and range for each variable (column)

2. Monotonic transformation (applied to individual variables, depending on need)

3 Column relativizations

Al wavs

Consider log or square root transformation for variables with skewness > 1 or ranging over several orders of magnitude.

Consider arcsine squareroot transformation for proportion data.

Consider column relativization (by norm or standard deviates) if environmental variables are to be used in a distance-based analysis that does not automatically relativize the variables (for example, using MRPP to answer the question: do groups of sample units defined by species differ in environmental space0). Column relativization is not necessary for analyses that use the variables one at a time (e.g., ordination overlays) or for analyses with built-in standardization (e.g.. PCA of a correlation matrix).

4. Check for univariate outliers and Examine scatterplots or frequency distributions or relativize bytake corrective steps if necessary. standard deviates ("z scores ') and check for high absolute

values.

Data Transformations - umass.edu · Data Transformations Most data sets benefit by one or more data transformations. The reasons for transforming data can be grouped into statistical

Documents