Top Banner
Joual of Cerebral Blood Flow and Metaholism 16:1271-1279 © 1996 The International Society of Cerebral Blood Flow and Metabolism Published by Lippincott-Raven Publishers. Philadelphia Tests for Comparing Images Based on Randomization and Permutation Methods *tStephan Arndt, *Ted Cizadlo, *Nancy C. Andreasen, *Dan Heckel, *Sherri Gold, and *Daniel S. O'Leary *Mental Health Clinical Research Center, Department of Psychiat, tDepartment of Preventive Medicine and Environmental Health. The Universi of Iowa Hoitals and Clinics, Iowa Ci. Iowa Summary: Tests comparing image sets can play a critical role in PET research, providing a yes-no answer to the question "Are two image sets different?" The statistical goal is to de- termine how often observed differences would occur by chance alone. We examined randomization methods to provide several omnibus test for PET images and compared these tests with two cuently used methods. In the first series of analyses, normally distributed image data were simulated fulfilling the require- ments of standard statistical tests. These analyses generated power estimates and compared the various test statistics under optimal conditions. Varying whether the standard deviations Tests comparing image sets can play a critical role in PET research. These tests provide a simple yes or no answer to the question "Are two image sets different?" before more qualitative or quantitative questions are asked about where the differences lie within the images. Sometimes an omnibus test provides a trivial answer to an obvious question: for instance, judging whether there is any difference in regional cerebral blood flow (rCBF) between two different cognitive task (A and B) condi- tions. In this case the omnibus test might reject the null or no-difference hypothesis, which absurdly conjectures that brain metabolism does not depend on cognition, and the test may be only ceremonial. On the other hand, an omnibus test sometimes provides important information. For example, a straightforward contrast might be ger- mane when comparing images from two different lan- guage reading conditions in bilingual subjects. - Received July 13, 1995; final revision April 12, 1996; accepted April 16, 1996. Address correspondence and reprint requests to Stephan Arndt at 2911 JPP MH-CRC, The University of Iowa Hospitals and Clinics, 200 Hawkins Drive, Iowa City, IA 52242, U.S.A. Abbreviations used: ES, effect size; MR, magnetic resonance; PET. position emission tomography; rCBF. regional CBF; ROI, region of interest; SD, standard deviation. 1271 were local or pooled estimates provided an assessment of a distinguishing feature between the SPM and Montreal methods. In a second series of analyses, we more closely simulated cur- rent PET acquisition and analysis techniques. Finally, PET im- ages from normal subjects were used as an example of ran- domization. Randomization proved to be a highly flexible and powerful statistical procedure. Furthermore, the randomization test does not require extensive and unrealistic statistical as- sumptions made by standard procedures currently in use. Key Words: Statistical tests-Methodology-Randomization. An omnibus test to compare position emission tomog- raphy (PET) images is based on the notion of a signifi- cance test The significance test has a specific and fun- damental meaning in statistics: measuring the probability that an observed result could have occurred by chance. For instance, a frequently asked question is whether the differences between two sets of images suggest a signifi- cant change. Task B might be a memory task using a 5-s retention interval while task A uses a 60-s retention in- terval involving recognizing a word om a previously memorized list Suppose that the area with the largest difference is 10% higher during task B in the leſt frontal region. The question arises as to whether this observed difference could have occurred by chance. Conceivably, there could be no real activation effect at all, yet random fluctuations could still produce 10% differences in vari- ous brain regions, If this chance probability is relatively high, say 5 in 10, then the 10% difference does not provide convincing evidence for an activation effect On the other hand, if the chance probability is relatively low (e.g., less than 1 in 20), then the difference is typically interpreted as statistically significant using the conven- tional 0.05 type I error rate (alpha). The statistical approach with this process is to deter- mine how oſten such differences would occur by chance
9

Tests for Comparing Images Based on Randomization and Permutation Methods

Feb 06, 2023

Download

Documents

Dong Yu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tests for Comparing Images Based on Randomization and Permutation Methods

Journal of Cerebral Blood Flow and Metaholism 16:1271-1279 © 1996 The International Society of Cerebral Blood Flow and Metabolism Published by Lippincott-Raven Publishers. Philadelphia

Tests for Comparing Images Based on Randomization and

Permutation Methods

*tStephan Arndt, *Ted Cizadlo, *Nancy C. Andreasen, *Dan Heckel, *Sherri Gold, and *Daniel S. O'Leary

*Mental Health Clinical Research Center, Department of Psychiatry, tDepartment of Preventive Medicine and Environmental Health. The Universi(v of Iowa Hospitals and Clinics, Iowa City. Iowa

Summary: Tests comparing image sets can play a critical role in PET research, providing a yes-no answer to the question "Are two image sets different?" The statistical goal is to de­termine how often observed differences would occur by chance alone. We examined randomization methods to provide several omnibus test for PET images and compared these tests with two currently used methods. In the first series of analyses, normally distributed image data were simulated fulfilling the require­ments of standard statistical tests. These analyses generated power estimates and compared the various test statistics under optimal conditions. Varying whether the standard deviations

Tests comparing image sets can play a critical role in PET research. These tests provide a simple yes or no answer to the question "Are two image sets different?" before more qualitative or quantitative questions are asked about where the differences lie within the images. Sometimes an omnibus test provides a trivial answer to an obvious question: for instance, judging whether there is any difference in regional cerebral blood flow (rCBF) between two different cognitive task (A and B) condi­tions. In this case the omnibus test might reject the null or no-difference hypothesis, which absurdly conjectures that brain metabolism does not depend on cognition, and the test may be only ceremonial. On the other hand, an omnibus test sometimes provides important information. For example, a straightforward contrast might be ger­mane when comparing images from two different lan­guage reading conditions in bilingual subjects.

----�.----------

Received July 13, 1995; final revision April 12, 1996; accepted April 16, 1996.

Address correspondence and reprint requests to Stephan Arndt at 2911 JPP MH-CRC, The University of Iowa Hospitals and Clinics, 200 Hawkins Drive, Iowa City, IA 52242, U.S.A.

Abbreviations used: ES, effect size; MR, magnetic resonance; PET. position emission tomography; rCBF. regional CBF; ROI, region of interest; SD, standard deviation.

1271

were local or pooled estimates provided an assessment of a distinguishing feature between the SPM and Montreal methods. In a second series of analyses, we more closely simulated cur­rent PET acquisition and analysis techniques. Finally, PET im­ages from normal subjects were used as an example of ran­domization. Randomization proved to be a highly flexible and powerful statistical procedure. Furthermore, the randomization test does not require extensive and unrealistic statistical as­sumptions made by standard procedures currently in use. Key Words: Statistical tests-Methodology-Randomization.

An omnibus test to compare position emission tomog­raphy (PET) images is based on the notion of a signifi­cance test The significance test has a specific and fun­damental meaning in statistics: measuring the probability that an observed result could have occurred by chance. For instance, a frequently asked question is whether the differences between two sets of images suggest a signifi­cant change. Task B might be a memory task using a 5-s retention interval while task A uses a 60-s retention in­terval involving recognizing a word from a previously memorized list Suppose that the area with the largest difference is 10% higher during task B in the left frontal region. The question arises as to whether this observed difference could have occurred by chance. Conceivably, there could be no real activation effect at all, yet random fluctuations could still produce 10% differences in vari­ous brain regions, If this chance probability is relatively high, say 5 in 10, then the 10% difference does not provide convincing evidence for an activation effect On the other hand, if the chance probability is relatively low (e.g., less than 1 in 20), then the difference is typically interpreted as statistically significant using the conven­tional 0.05 type I error rate (alpha).

The statistical approach with this process is to deter­mine how often such differences would occur by chance

Page 2: Tests for Comparing Images Based on Randomization and Permutation Methods

1272 S. ARNDT ET AL.

alone in repeated trials of the same experiment. An im­practicable solution is to replicate the experiment many times using different subjects and randomly flipping whether a subject's A image was subtracted from B or vice versa. After a large number of replications, the num­ber of comparisons that yielded differences as large as the original baseline activation difference would provide an indication of how unusual it would be for the ob­served 10% change in rCBF (from the previous example) to occur when chance alone is operating. The number of trials in which a difference of 10% or larger is noted divided by the number of repeated trials provides the probability estimate.

Another replication strategy, randomization or permu­tation, uses only the data from the original experiment. The notion that chance alone produced the observed dif­ference implies that the task A and B images are inter­changeable and that the observed difference is the result of a not-so-unlikely arbitrary arrangement. Instead of replicating the experiment de novo, a random switching of the condition labels would serve the same purpose. For example, Table 1 shows the possible arrangements of a study with three SUbjects. While subjects always retain their own data sets, the two voxel arrays would be arbi­trarily reassigned to A and B producing a new pattern of A-B arrangements for each replication. The number of times that the difference from the original experiment occurs by chance can be readily ascertained by reference to the distribution generated from the replications.

For small samples all possible (2n) arrangements pro­vide the exact frequencies by enumerating all possible experimental results. For larger samples, a very large subsample of the permuted arrangements (e.g., 5,000) provides an accurate approximation. These tests are vari­ously and sometimes interchangeably referred to as ran­domization, permutation tests, or Monte Carlo tests. A number of authors have expressed a preference for these types of tests since they are applicable to any population and do not explicitly assume random subject selection (Fisher, 1936, 1990; Pitman, 1937a,b, 1938; Edgington, 1966, 1995), although generalizations from any theoret­ical or randomization test on a particular sample may be constrained to the sampling procedure.

There are several advantages to using randomization to obtain probabilities. These estimates are based on an empirical definition of probability. Randomization meth-

TABLE 1, Eight A-B B-A permuted samples generated by three subjects with paired image data sets

Sample

Subject 2 4 6 7

I AI-BI BI-AI A,-B, BI-A, AI-BI BI-A, AI-BI BI-AI 2 A,-Bz A2-B2 B,-A, Bz-A2 A,-B2 ArB, B2 -A2 B2-A2 3 ArB, A,-B, A,-B, A,-B, B,-A, B,-A, BrA, BrA,

J Cereb Blood Flow Metab, Vol. 16, No.6. 1996

ods produce results that make no assumptions about the nature of the data distribution since no reference is made to a theoretical curve or function. The process is also extremely flexible. Tests of mean differences, correla­tions, and differences in correlations can be produced for the whole brain (an omnibus test), sets of regions of interest (ROIs), or individual voxel areas. The only re­quirements for a test are that a single-valued test statistic (e.g., a single summary value characterizing the effect of interest) can be found and that the sample is large enough to randomize. This sample size requirement is actually very liberal since as few as 10 subjects in an A-B con­dition experiment will produce (210 = ) 1,024 permuta­tions.

PET and other imaging research created new statistical problems for which standard tests often do not ad­equately apply. Important pioneering work has been done by Worsley et al. (1992) and Friston et al. (1991) to modify and extend simple statistics to an omnibus test for PET applications. Both groups have derived theoret­ical probability distributions for a test of PET images.

The techniques of Worsley et al. (the Montreal method) and Friston et al. (SPM94) have been compared by Arndt et al. (1995). Both begin with an omnibus test for the sets of images being compared and both calculate a t test for each voxel location in the data set. The highest t value (tmax) is used as the criterion for determining whether the image sets differ in this initial omnibus test. Both can represent the voxel comparisons with a t image. A distinction between these two methods is how they estimate the standard deviation (SD) used for calculating the t values. The Worsley method uses a single SD found by pooling the SDs from all voxels. SPM uses the local voxel SD. Both of these methods attempt to account for the number of tests performed, by reference to the theory of Gaussian random fields (Adler and Hasofer, 1976; Hasofer, 1978). Randomization tests of tmax have also been suggested in the statistical literature (Chung and Fraser, 1958; Boyett and Shuster, 1977) but only recently applied to imaging data (Holmes et aI., 1996).

Test statistics other than tmax may be more powerful or more interesting to the PET researcher. In questions about whether two sets of images differ, a general test that makes use of more information than the tmax has been advanced by a number of authors (Chung and Fra­ser, 1958; Blair and Karniski, 1994; Good, 1994; Edg­ington, 1995; Worsley et aI., 1995). This test is based on the sum of the image Ps (Ii), which may be more sensitive to moderate changes or differences that may involve several peaks or a large flat region of activation. Unlike the tmax statistic, Ii is a composite of all image t values.

Whatever the potential advantages of It2 or other mul­tivariate tests, their application to imaging data has been hampered since the standard reference distribution (Ho-

Page 3: Tests for Comparing Images Based on Randomization and Permutation Methods

RANDOMIZATION TESTS FOR IMAGES 1273

telling's T2 or the multivariate F) is not applicable. Im­aging data often do not meet the necessary assumptions, and multivariate analyses typically require more subjects than voxels. Worsley et al. (1995) have recently outlined a solution to the distribution of I.t2 using the theory of Gaussian random fields. However, randomization meth­ods are appropriate. Manly (1991) also describes several other statistics that may be applicable.

The present paper examines the use of randomization methods to provide omnibus tests for PET images. In the first series of analyses, normally distributed image data are simulated that fulfill the requirements of standard statistical tests and present a simplest case scenario. These analyses were used to generate detection rates and to compare the tmax and I.t2 statistics under optimal con­ditions. Varying whether the standard deviations were local or pooled estimates provides an assessment of a distinguishing feature between the SPM and Worsley methods. In a second series of analyses, we more closely simulate current PET acquisition and analysis techniques by examining data that have been smoothed with a filter. Finally, PET images from normal subjects are used as an example of this technique.

MATERIALS AND METHODS

Analysis 1: Simulation using white noise data with

peaks inserted The first set of analyses were conducted on white noise

images generated using normally distributed independent voxel elements. All calculations were performed on a Macintosh Power PC 7100 in MATLAB (Mathworks, 1995). The random numbers were produced by the algorithms given in Forsythe et al. (1977). A test of this generator with 5,000 random numbers showed no significant spatial dependence (inter pixel r = 0.003, 95% confidence interval: -0.0247 � p � 0.0308).

For the number of randomizations, we chose 5,000 replica­tions when the full set of permutations became impractical. This was a compromise between practicality and attempting to obtain accurate estimates of p values near the critical p values used in hypothesis testing. Using a variety of methods, Manly (1991) calculates the 99% confidence intervals around the p < 0.05 cutoff obtained with 5,000 replications to be 0.042-0.058. Using central distribution free intervals (Stuart and Ord, 1987, formula 20.131), we found the same values for the 99% interval and between 0.056 and 0.044 for the 95% confidence interval. Around the 0.01 probability estimate, 5,000 replications pro­vided a confidence interval of approximately ± 0.003.

For each analysis, n ( = 10, IS, 20, or 30) A-B image pairs were constructed with 512 elements. To find the critical values, the image pairs were either fully permuted (n = 10) or ran­domly reassigned 5,000 times to different A-B combinations. Thus, for each subject's pair of images, the arrangement A minus B or B minus A was randomized for the replication. After each rearrangement, the image sets were averaged and an SD was found for each pixel location (local SD). The pooled SD was also calculated. These values were used to generate two images made from the local or pooled pixel-based { tests. Thus, each f image was based on 512 { tests. The largest absolute t value from the local and pooled t images, local tmax and pooled

tmax' as well as the sum of local and pooled f (i.e., If) were stored after each resampling. After all replications, the resa­mpled statistics were sorted, and the value cutting off the most extreme 5, 2.5, and I % of the distribution were used as the alpha level p < 0.05, 0.025, and 0.01 critical values. Although this procedure was instructive, other algorithms, such as de­scribed in analysis 2 below, are more efficient for real appli­cations.

Detection rates were calculated in a similar fashion. Before rerandomizing the image pairs, an effect (0) was added to an image location. The effect sizes (ES) were equal to 0 divided by the local SD, thus an ES = 1 added 1 SD to form the peak. The effect was added to the same randomly located voxel in all n images. Effect sizes of 0, 0.5, 1, and 2 were added. The effect size denoted as [I I] placed two peaks in the image. Other effect size conditions involving multiple peaks were also in­vestigated. The test's detection rate was defined using the num­ber of replications that produced statistics exceeding the critical values chosen for this simulation experiment (e.g., the fre­quency that the image would have been significant). The num­ber of significant (i.e., observed (max> critical value) replica­tions was divided by the total number of replications to form approximate detection rates.

Critical values and power estimates were also found for the methods of Worsley et a!. (1992) and the SPM94 program, using pooled and local SD (max values, respectively. A three­dimensional random field made from 512 independent nor­mally distributed numbers [i.e., N(O,I)] contains approximately 314 resels (resolution elements) in the 512 images based on formulas in Worsley et a!. (1992). Formula I in Worsley et a1. (1992) provided critical values corresponding to standard p values (i.e., 0.05, 0.025, and 0.01). The critical values for SPM were found using the SPM program (subroutine SPM_PZ.M in SPM94) using formulae discussed by Friston et a!. (1991, 1995). To find the smoothness parameters (Friston et a!., 1991) for our random images, we passed SPM94 10 pairs of images containing 46,256 independent random numbers. This pro­duced the three smoothness parameters I, I, and 0.5. These smoothness values, the image, and the sample size were used to generate the critical values for our images. Once a critical value was established, we noted the number of times each method signaled a significant value; the number of times the observed tmax exceeded the critical value divided by the number of rep­lications was the detection rate.

Analysis 2: Using a lifelike phantom In the second set of analyses, a more lifelike phantom simu­

lation was used with n = 20. These consisted of a 3D array of approximately 445,700 voxels for each "subject" and was val­ued using a normally distributed random number with constant mean and SD. After their initial generation, the 3D arrays were smoothed using a 18.33-mm Hanning filter resulting in ap­proximately the same number of resels that we find in our center's PET experiments. Calculations were performed on a Silicon Graphics workstation (running IRIX version 5.3 and using a ISO-MHz IP22 processor) and the randomization pro­gram was written in the C language. The random number gen­erators were based on standard published algorithms (Griffiths and Hill, 1985; Press et a!., 1992). Computational efficiency was increased by updating summary statistics rather than re­calculating the sample statistics and sequencing the random­ization samples to minimize the updating workload using pub­lished algorithms (Bitner et a!., 1976; Lam and Sotchen, 1982). As in the prior analyses, the program computed both local and pooled forms of tmax and If. Probability estimates were pro­vided using randomization, the formalizations of Worsley, and

J Cereb Blood Flow Metab. Vol. 16, No.6. 1996

Page 4: Tests for Comparing Images Based on Randomization and Permutation Methods

1274 S. ARNDT ET AL.

the subroutine in SPM94. Visualization of the image was ac­complished using BRAINS (Andreasen et a!., 1993).

Analysis 3: Human PET data Subjects. Subjects were 20 healthy normal volunteers re­

cruited from the community by newspaper advertising. This sample is a random subset of data that has been previously published (Andreasen et a!., 1995). Their demographic features have been previously described. All gave written informed con­sent to a protocol approved by the University of Iowa Human Subjects Institutional Review Board.

Memory tasks. The complete study in which these subjects participated was designed in order to explore aspects of recog­nition memory using variable retention intervals. A substudy on verbal memory included two retention intervals (short and long) plus a comparison condition (reading words); this study has been described in detail in a prior publication (Andreasen et a!., 1995). For this analysis, we focus on the positive activa­tions from one subtraction from this study: long-term memory for words minus reading words. For the experimental condition referred to as "long-term memory," subjects were trained until they had perfect recognition memory of a group of 18 words during the week prior to the PET study; the group of words was again reviewed on the day before the PET experiment in order to ensure that it was still well leamed. The comparison condi­tion consisted of reading common English words. All words consisted of one- or two-syllable concrete nouns that were pre­sented visually on a video monitor.

We used the same algorithms and C language program as in analysis 2 with the lifelike phantom to perform randomization tests on images from 20 subjects. The methods for PET and magnetic resonance (MR) data acquisition have been previ­ously described (Haining, 1990; Hurtig et a!., 1994; Andreasen et aI., 1995). The PET data were acquired with a bolus of 75 mCi of C50]H20 in 5-7 ml saline using a GE PC4096-plus 15-slice whole-body scanner. MR scans were obtained for each subject with a standard T l-weighted three-dimensional SPGR sequence on a 1.5-T GE Signa scanner (TE = 5, TR = 24, flip angle = 40, NEX = 2, FOV = 26, matrix = 256 x 192, slice thickness 1.5 mm). The anterior commissure-posterior com­missure (AC-PC) line was identified and used to realign the MR image of all subjects to a standard position. The PET image for each individual was fit to that individual's MR scan using a surface fit algorithm (Levin et a!., 1988; Cizadlo et a!., 1994). The MR images from all subjects were averaged using a bound­ing box technique, so that the functional activity visualized by the PET studies could be localized on coregistered MR and PET images where the MR image represented the "average brain" of the subjects in this study (Andreasen et a!., 1994, 1995). An 18.33-mm Hanning filter smoothed the PET images for each condition to eliminate residual anatomical variability. MR images were then aligned on the AC-PC axis and the interhemispheric fissure. Images were resampled to 128 x 128 x 80 voxels using the Talairach Atlas image space (Talairach and Touflloux, 1988). After thresholding the image to 150% of the mean voxel value to isolate brain, there were 522,141 vox­els in the images.

RESULTS

Analysis 1

From a statistical perspective, maintaining a consistent known probability of mistakenly finding significance (the alpha level or type I error rate) is the primary con­straint on a statistical test. Standard testing procedures

J Cereb Blood Flow Metab. Vo!' 16. No.6. 1996

use tabled functions to provide a critical value that, when exceeded, signals an unlikely and significant (p < alpha) event. For the following analyses, we chose the conven­tional alpha level of 0.05. However, the trends may be generalizable for any usual nominal alpha levels.

Critical values cutting off the most extreme test sta­tistic corresponding to a p value < 0.05 were found em­pirically by randomizing simulated images. Results for both {max and L:r statistics, calculated with pooled and local SDs are shown in Table 2 for samples of 10, 15, 20, and 30 subjects. These values were obtained by generating five different experiments at each sample size, randomizing them each, and then averaging. Table 2 also shows the corresponding values based on Gaussian random field theory using Worsley's method for the pooled SD {max and SPM94 for the local SD {max' The data were normally dis­tributed and all generated with a common SD. Randomiza­tion of the larger sample size included 5,000 replications while the complete series of 210 ( = 1,024) permutations was generated for the sample size of 10.

The randomization {max and L:r critical values based on the pooled SD were consistently smaller than those based on the local SD. The locally based values decrease as the sample size increases but are still considerably larger than the corresponding pooled values at n = 30. The increased values result from the added instability of variable local SD estimates producing a wider distribu­tion. The instability in the SDs is particularly noticeable for small samples.

The {max for a 0.05 significance threshold produced by SPM94 and the Worsley et al. (1992) formulations are also shown in Table 2. The tmax values for the 0.05 significance levels produced by SPM94 were consis­tently larger than the corresponding local {max critical values found using randomization. At the largest sample size (n = 30), this difference became less pronounced, with SPM94 providing a critical value of 4.69 compared to the randomization value of 4.51. Similarly, the for­mula given in Worsley et al. (1992) produced values consistently larger than the corresponding critical values from randomization using the pooled SD. Since the

TABLE 2. Critical values (p ::; 0.05) produced by randomizationU of normally distributed data for tmax and �e statistics in an image containing 512 independent elements

lmax

Randomization Gaussian random field It' randomization

n Local SD Pooled SD SPM94 Montreal l.ocal SD Pooled SD -----_._---_ ... -

10 6.632 3.830 7.575 4.356 741.952 564.970 15 5.378 3.901 5.773 4.356 671.138 569.683 20 4.896 3.846 5.169 4.356 635.683 565.323 30 4.511 3.877 4.691 4.356 611.616 567.360

" Critical values were found using five sets of randomizations with 5,000 replications for each set. For the smallest sample size (n = 10). five sets of complete permutations (2'0 replications).

Page 5: Tests for Comparing Images Based on Randomization and Permutation Methods

RANDOMIZATION TESTS FOR IMAGES 1275

pooled SD is based on a very large number of voxel locations and thus has an extremely large number of degrees of freedom, Worsley's formula does not depend on n and produces a value of 4.36 for all sample sizes. For these randomly drawn images, the SPM94 and Wors­ley methods were always conservative when no peak was present in the image.

We evaluated how often these values erroneously sig­nal a significant difference in a repeated series of random images. Ideally, the IX = 0.05 critical values should be exceeded less than 5% of the time in repeated samples or replications. The proportion of times that the repeated sample produced a "significant" value when there was no difference between the image sets (effect size = 0) is shown in Table 3. The critical values from randomization produced error rates that were all very close to the ex­pected 0.05 level.

Detection rates for the tests in several different situa­tions are also shown in Table 3. Peaks were manufac­tured by inserting a hot pixel in a second set of random­izations. The magnitude of these hot spots was indexed by the effect size (i.e., mean difference divided by the SD) (Cohen, 1988). The frequency that the modified im­ages crossed the critical value was tabulated, and the proportion of times the test suggested a significant dif­ference gives a rough indication of the test's power. Ide­ally, several thousand datasets, each randomized several thousand times, would be needed to provide power esti-

mates. Since such an effort would be impractical, we chose to provide this more limited assessment of power referring to correct rejection of the null hypothesis as the detection rate.

In the presence of real peaks, the test statistics based on pooled SD estimates tended to find the significant results more often than the statistics using local SD es­timates. For instance, when n = 20 and there was a significant peak with an effect size of 1, the pooled tmax found a significant (p < 0.05) result more than 70% of the time (detection rate = 0.709). However, using the local tmax statistic, significance was seen less than 40% (de­tection rate = 0.376) of the time. Thus, when the as­sumption of a constant SD holds, the pooled estimate is a better choice. For large samples (n = 30) and large effects size (?c 1.0), both local and pooled tmax statistics produced excellent detection.

Interestingly, SPM94 and the Worsley method, despite their initial tendency to conservatism when no peaks were present, provide good power at the largest sample sizes for effect sizes ?c 1. These methods were always somewhat less powerful than the randomization tests.

The It2 statistic had a lower detection rate than tmax for all of the effect sizes shown in Table 3. This test statistic seems to perform better when there are many peaks spread over a large area. For instance, when 2% of the image was covered (10 peaks in 512 elements) with a 0.5 effect size (n = 15), the local and pooled tmax

TABLE 3. Detection rates for two randomization" tests (tmax and �e) using local and pooled standard deviations (SD) in a 512 independent element image of varying sample sizes (n) and peak effect sizes (ES)

{max

Randomization Gaussian random field 2.r randomization

n Peak ES Local SO Pooled SO SPM Montreal Local SO Pooled SO

10 0 0.049 0.049 0.025 0.004 0.049 0.049 15 0 0.046 0.053 0.018 0.003 0.056 0.054 20 0 0.053 0.051 0.026 0.005 0.049 0.046 30 0 0.049 0.045 0.031 0.007 0;060 0.059

10 0.5 0.050 0.058 0.016 0.007 0.055 0.064 15 0.5 0.051 0.081 0.021 0.014 0.060 0.066 20 0.5 0.065 0.112 0.030 0.028 0.073 0.073 30 0.5 0.090 0.178 0.058 0.067 0.079 0.088

10 0.052 0.363 0.Q25 0.186 0.066 0.098 15 0.122 0.496 (U)63 0.340 0.096 0.105 20 0.376 0.709 0.267 0.523 0.142 0.172 30 1 0.841 0.909 0.800 0.811 0.200 0.226

10 2 0.309 0.896 0.128 0.842 0.217 0.381 15 2 0.988 0.986 0.967 0.975 0.385 0.461 20 2 1.000 0.997 0.999 0.995 0.600 0.674 30 2 1.000 1.000 1.000 1.000 0.932 0.929

10 [1 Ilb 0.059 0.480 0.014 0.286 0.121 0.167 15 [1 1] 0.177 0.732 0.104 0.549 0.186 0.275 20 [1 I] 0.581 0.905 0.447 0.772 0.330 0.403 30 [1 1] 0.975 0.991 0.950 0.967 0.473 0.526

a All of the tests were based on approximate randomizations with 5,000 replications except for n = 10, which were complete permutation tests (i.e., with 210 replications).

h This condition represented two randomly located peaks each with an effect size of I.

J Cere!> Blood Flow Meta!>, Vol. 16, No.6, 1996

Page 6: Tests for Comparing Images Based on Randomization and Permutation Methods

1276 S. ARNDT ET AL.

statistics had detection rates of 0.07 and 0.3 1, respec­tively. The rate of the pooled Ir was higher, at 0.34. Thus, for multiple areas or a single diffuse area of low­level activation, the Ir statistic may be preferable to tmax'

Analysis 2

In this series of analyses, two different sets of images were produced: a null condition set with no differences between the A and B contrast and one with a fairly robust peak (effect size = 2) added. Each analysis used 20 image pairs smoothed with a 18.33-mm Hanning filter. As with the previous analyses, the data were normally distributed, had a constant SD throughout the images, and fulfilled the assumptions of a Gaussian random field.

As expected for the null condition comparison, none of the statistical tests suggested a significant difference in the A - B subtraction. The observed local SD tmax was 5.499, which was less than the p < 0.05 critical value of 6.38 based on randomization. The likelihood of finding a (local) {max this large during the 5,000 replications was p < 0.2 16. Similarly, using the pooled SD, the 0.05 critical value of {max was 5.28 and the observed value was 4.20. This was not an unusually large {max' since nearly two thirds (63.38% or p < 0.634) of the random replications produced a value this large or larger. The Worsley et al. ( 1992) formulas and the pooled tmax gave a p value of 0. 162, which was considerably smaller than the random­ization p value, but larger than 0.05. SPM94 produced an omnibus p value of 0.2 16. Also, the It2 statistic yielded nonsignificant results with p values of 0.20 1 and 0. 159, for local and pooled versions, respectively. Thus, none of the tests provided a false positive result.

In contrast, all of the omnibus statistical tests signaled a significant difference when the single peak was added. The local tmax was 9.46, located at the peak's location. A value of 6.42 or larger would have been significant at the ex = 0.05 level. The actual p value from the randomiza­tion of local tmax was 0.004. The pooled SD version of this test gave similar results. The 0.05 critical value was 5.43, while the observed tmax was much larger, 8.07. The approximate p value was p < 0.0002. Both the local and pooled It2 statistics produced randomization p values of < 0.0004. The probabilities based on the randomization technique were very similar to those based on the W ors­ley (p < 0.0001) and SPM94 (p < 0.0003) methods.

Analysis 3 Comparing a real set of 20 PET image A - B pairs

from normal subjects, all of the statistical procedures again agreed in assessing the probabilities. The pooled tmax was 7.44 for this image subtraction. Based on the randomization method and 5,000 replications, this tmax had a probability 01'0.0024, while the method of Worsley provided a p value less than 0.000 I. The local tmax was 8.56. According to the randomization distribution, this

J Cereb Blood Flow Metab. Vol. 16, No.6. 1996

value had a probability of 0,0018, which was somewhat larger than the value offered by SPM, P < 0.000 1. While the two theoretically based probabilities were consider­ably more liberal, all of the methods of testing tmax sug­gested a significant difference between the A - B image sets. Likewise, the two It2 statistical tests suggested highly significant differences. None of the 5,000 ran­domized images provided local or pooled Ir statistics that were larger than the observed. Hence both p values were less than 1/5,000 (= 0.0002).

DISCUSSION

Randomization tests offer advantages over standard tests that use approximate reference distributions. An im­portant statistical advantage is the consistent protection against the nominal type I error rates or the frequency of mistaking a nonsignificant difference as significant. The randomization test always provides a close approxima­tion to a 0.05 cutoff when the 0.05 critical value is cho­sen. The exact confidence intervals for this approxima­tion can also be calculated (Stuart and Ord, 1987; Nor­een, 1989), but typically these intervals are small enough to be inconsequential when a reasonable number of rep­lications is used. In contrast, procedures that rely on distributional assumptions can be affected when their assumptions do not hold. Both SPM94 and the method detailed by Worsley et al. ( 1992) make extensive distri­butional and large sample assumptions about the data throughout their calculations. Some of these assumptions are outlined in Friston et al. ( 199 1), Worsley et al. ( 1992), and Arndt et al. ( 1995) and include: the indepen­dence of mean and SD, an independent and identical distribution of the errors, and an isotropic spatial corre­lation (Upton and Fingleton, 1990; Cressie, 1993) over the entire image defined by first-order (i.e., next-pixel­over) correlations. While all of the testing procedures agreed somewhat on our lifelike phantom and actual PET images, they need not do so. In one analysis from an early small (n = 10) PET pilot project, the Worsley method of calculating an omnibus test provided a p <

0.0033. However, a complete permutation revealed that this tmax value occurred in almost 8% of the random arrangements. Either the application of Gaussian random field theory to this particular set of images was inappro­priate or the number of resels was incorrectly estimated.

Both SPM and the Montreal method require an ex­plicit estimate of the spatial dependency within the im­age. In SPM, this estimate is referred to as the image smoothness, and in the Worsley method the dependency underlies the notion of resel. These estimates of spatial dependence are based on first-order serial (i.e., next­pixel-over) correlations or are assumed to be a function of the spatial filter used during image processing. None of these estimation procedures considers that the brain is

Page 7: Tests for Comparing Images Based on Randomization and Permutation Methods

RANDOMIZATION TESTS FOR IMAGES 1277

a highly integrated organ and naturally exhibits an or­chestrated pattern of activation. To assume that the only correlations among brain voxels are caused by image filtering and can be estimated by next-pixel-over corre­lations ignores the biology and is likely an oversimpli­fication. Randomization tests do not require these as­sumptions.

In addition to the flexibility afforded by few statistical assumptions, randomization tests have been suggested for different testing situations (e.g., tests of multiple groups), different statistical measures (e.g., I/, Pearson rmax' Ir2, Mantel's spatial correlation), and may be ex­tended to other imaging modalities (e.g., single photon emission computed tomography, fMRI). For instance, analysis of fMRI often correlates an input (stimulus) wave form with voxel values over time resulting in a (Pearson or Spearman) correlation and a phase offset for each voxel (Bandettini et aI., 1993). Either the image of correlations (r) or phase offsets (<p) may be of interest. Group comparisons of either r or <p can be easily tested with randomization. However, novel statistics should be adequately investigated before randomization, since new indices of an effect may react in unexpected ways.

The performance evaluation of the It2 statistic ap­pears mixed. In the simulations, we created isolated peaks in the data set. The If maintained the nominal alpha level of the test (0.05), but was clearly less pow­erful than the tmax statistic. This was perhaps an unfair comparison of these two alternatives since the peaks were unrealistically focal. Most cognitive activations produce multiple areas of activation corresponding to the various structures called upon in the task. In the actual PET data (analysis 3), the If statistic clearly provided the strongest case for rejecting the notion that chance alone was responsible for the differences. None of the 5,000 random arrangements produced a It2 as large as the value observed when contrasting the two cognitive conditions placing the p value at something less than 115,000. Thus, as noted by Worsley et al. ( 1995), the Ir may be an extremely powerful alternative omnibus test statistic.

The power of randomization tests applied to real data is difficult to determine theoretically. However, asymp­totically, a randomization procedure will have power ri­valing the most powerful possible test (Manly, 199 1; Good, 1994). For smaller samples, the randomization test may have less power than one based on Gaussian random field theory. This advantage usually exists only if the assumptions hold true and the estimates of smoothness or resel size using next-pixel-over correlations are correct. Based on experience in other situations (e.g., the Mann­Whitney and Wilcoxen tests), when the assumptions are violated, the randomization test is often more powerful, sometimes strikingly so. Violating the Gaussian or other assumptions of a theory-based test sometimes increases

the alpha level above the nominal level. Thus, the tabled values of a standard test's 0.05 threshold become too liberal. Post hoc corrections of the theoretical signifi­cance threshold often make the theory-based test less powerful than the randomization alternative. On the other hand, randomization tests always hold the alpha level close to nominal level (e.g., 0.05). Moreover, the most powerful relevant test for an experiment will be one that best characterizes the effect of interest. As we have noted, randomization tests can be applicable to a variety of statistics.

After a significant omnibus test statistic, regional analyses are necessary to find out how and where the image sets differ. Unfortunately, there are no good exact solutions for this problem. Several papers (McCrory and Ford, 1991; Bandettini et aI., 1993; Blair and Karniski, 1994; Holmes et aI., 1996) have discussed a number of problems and possible solutions. Worsley et al. ( 1992) suggested using the critical value of tmax as a cutoff or threshold for the entire image of t values. The logic is that any excursion area that protrudes above the critical value threshold would have signaled a significant differ­ence between the image sets. Thus, by implication, any excursion set of voxels can be said to be significant. Since this is similar to a Bonferroni correction for the number of resels, this is likely a conservative estimate. Other more powerful post hoc thresholds have been de­veloped theoretically (Hochberg and Tamhane, 1987) and suggested by Blair and Karniski ( 1994). These step­down procedures check the largest t value in the image, then the second largest, and so on. With randomization, these techniques require regenerating a new randomized distribution at each step. Holmes et al. ( 1996) provide examples, discussion, and simple algorithms for these tests applied to PET data.

As we mentioned in the introduction, omnibus tests are sometimes very important. However, often an overly general test is of less interest than tests of local areas or structures. In some instances, the question, "Is there a difference somewhere?" is immaterial or even counter­productive. For instance, large interindividual differ­ences coupled with the basic between-group similarity of cerebral blood flow may make it difficult to obtain a significant omnibus test when an actual group difference is restricted to a single small area. Specific regional (ROI) contrasts do not require an omnibus test if these contrasts were hypothesized a priori. However, when a number of a priori contrasts are made, appropriate ad­justments are necessary. Standard procedures are avail­able to adjust the alpha level of the test or the test can be randomized using either ROI-based tmax or Ir. Explor­atory analyses represent another situation where omnibus testing may be inappropriate. In this case, whether to use significance testing procedures at all during the explor­atory phase could be arguable since other, more appro-

J Cereb Blood Flow Metab, Vol. 16. No.6, /996

Page 8: Tests for Comparing Images Based on Randomization and Permutation Methods

1278 S. ARNDT ET AL.

priate procedures may be available (e.g., effect size es­timation) (Cohen, 1988; Andreasen et aI., 1994).

One issue with any randomization method is that it is computation intensive. For many situations, efficient al­gorithms and fast computers make randomization an at­tractive solution. There may, however, be questions about how practical it might be for a given experimental design or for a sample size. We have found that simple two-condition contrasts even with large samples (n > 30) are acceptable, particularly if the number of replications (N) is reduced to 3,000. While with the more efficient algorithms, the number of replications does not linearly affect the amount of time required for a randomization, fewer replications do produce quicker solutions. How­ever, reducing N affects the precision of the probability estimates. Fortunately, the precision of an estimated cut­off (e.g., 0.05) can be easily predetermined. This proce­dure and the formulae required are discussed in Manly (1991). Noreen ( 1989) provides tables and formulae for one-sided confidence intervals and Stuart and Ord ( 1987) provide the formulae for two-sided intervals. In general, if all that is required is a simple yes-no answer using a predetermined p value cutoff, fewer replications are nec­essary than if the actual goal is to provide an accurate estimate of the randomization probability.

The logic for randomization tests has been accepted in principle for over half a century. Fisher ( 1936) notes that reference to standard mathematical distributions for tests has no justification beyond its agreement with a simple randomization procedure. Many basic statistical tests are based squarely on permutation logic, for instance, Fish­er's exact test (Fisher, 1990), tests of Spearman and Ken­dall correlations (Kendall and Gibbons, 1990), and Mann-Whitney, Wilcoxen, Kruskal-Wallis, Friedman, and Pitman tests (Pitman, 1937a,b, 1938; Conover, 1980). Fisher ( 1936) noted the main drawback for wide­spread use of randomization procedures: the required computations. The availability of fast computer chips has, to a great extent, removed this obstacle. Given their advantages, ease of interpretation, consistent protection against type I errors, the potential for increased power, t1exible choice of the tested statistic (e.g., tmax' Ii, I,-2), and the ability to use any imaging modality, randomiza­tion tests have clear advantages over other testing pro­cedures.

Acknowledgment: This research was supported in part by NIMH grants MH31593, MH40856, MHCRC 43271, MH00625, and an Established Investigator Award from NARSAD.

REFERENCES

Adler RJ, Hasofer AM (1976) Level crossings for random fields. Ann Probab 4:1-12

Andreasen NC, Cizadlo T. Harris G, Swayne II V. O'Leary DS, Cohen G, Ehrhardt J, Yuh WTC (1993) Voxel processing techniques for

J Cereb Blood Flow Metab. Vol. 16. No.6, 1996

the antemortem study of neuroanatomy and neuropathology using magnetic resonance imaging. J Neuropsychiatry Clin Neurosci 5: 121-130

Andreasen NC. Arndt S. Swayze VW. Cizadlo T, Flaum M, O'Leary D, Ehrhardt J, Yuh WTC (1994) Thalamic abuormalities in schizo­phrenia visualized through magnetic resonance image averaging. Science 266:294-298

Andreasen NC, O'Leary DS. Arndt S, Cizadlo T, Hurtig R, Rezai K, Watkins GL, Boles-Ponto LL, Hichwa RD (1995) Short-term and long-term verbal memory: A positron emission tomography study. Proc Nat! Acad Sci USA 92:51 11-5115

Arndt S, Cizadlo T, Andreasen NC, Zeien G, Harris G, O'Leary DS, Watkins GL, Boles-Ponto LL. Hichwa RD (1995) A comparison of approaches to the statistical analysis of [I50]H20 PET cognitive activation studies. J Neuropsychiatry Clin Neurosci 7:155-168

Bandettini PA, lesmanowicz A, Wong EC, Hyde JS (1993) Processing strategies for time-course data sets in functional MRI of the human brain. Magn Reson Med 30:161-173

Bitner JR, Ehrlich G, Rheingold E ( 1976) Efficient generation of the binary reflected Gray code and its applications. Commun A C M 19:5 17-521

Blair RC, Karniski W (1994) Distribution-free statistical analyses of surface and volumetric maps. In: Functional Neuroimaging: Tech­nical Foundations. (Thatcher RW, Hallett M, Zeffiro T, Jony ER, Huerta M, eds), San Diego, Academic Press, pp 19-28

Boyett JM, Shuster JJ (1977) Nonparametric one-sided tests in multi­variate analysis with medical applications. J Am Stat Assoc 72: 665-668

Chung lH, Fraser DAS (1958) Randomization tests for a multivariate two-sample problem. J Am Stat Assoc 53:729-735

Cizadlo T, Andreasen NC, Zeien G, Rajarethinam R, Harris G, O'Leary D, Swayze V. Arndt S, Hichwa 1, Ehrhardt J, Yuh WTC (1994) Image Registration Issues in the Analysis of Multiple-Injection /50 H20 PET Studies: BRAINFIT, SPIE-The International Society for Optical Engineering. Newport Beach, California, Society of Photo­Optical Instrumentation Engineers

Cohen J (1988) Statistical Power Analysis for the Behavioral Sciences, Hillsdale, New Jersey, Lawrence Erlbaum Associates

Conover WJ (1980) Practical Nonparametric Statistics, New York, John Wiley & Sons

Cressie NAC (1993) Statistics for Spatial Data (Revised Edition). New York. John Wiley & Sons

Edgington ES (1966) Statistical inference and nonrandom samples. Psychol Bull 66:485--487

Edgington ES (1995) Randomization Tests, New York, Marcel Dekker Fisher RA ( 1936) The coefficient of racial likeness and the future of

craniometry. J R Anthropol Inst GB Irel 66:57-63 Fisher RA ( 1990) Statistical Methods, Experimental Design, and Sta­

tistical Inference. Oxford, Oxford University Press Forsythe GE, Malcolm MA, Moler CB (1977) Computer Methods for

Mathematical Computations, New York, Prentice-Hall Friston KJ, Frith CD, Liddle PF, Frackowiak RSJ (1991) Comparing

functional (PET) images: The assessment of significant change. J Cereb Blood Flow Metab I I :690-699

Friston KJ, Holmes AP, Worsley KJ, Poline J-P, Frith CD, Frackowiak RSJ ( 1995) Statistical parametric maps in functional imaging: A general linear approach. Hum Brain Mapp 2: 189-210

Good P (1994) Permutation Tests, New York, Springer-Verlag

Griffiths P, Hill ID (1985) Applied Statistics Algorithms, Chichester, Ellis Horwood

Haining R (1990) Spatial Data Analysis ill the Social and Environ­mental Sceinces, New York, Cambridge University Press

Hasofer AM (1978) Upcrossing of random fields. Suppl Adv Appl Probab 10:14-21

Hochberg Y, Tamhane AC (1987) Multiple Comparison Procedures, New York, John Wiley & Sons

Holmes AP, Blair RC, Watson JDG, Ford I (1996) Non-parametric analysis of statistic images from functional mapping experiments. J Cereb Blood Flow Metab 16:7-22

Hurtig RR, Hichwa RD, O'Leary DS, Boles Ponto LL, Narayana S, Watkins L, Andreasen NC (1994) The effects of timing and dura­tion of cognitive activation in 150 water PET studies. J Cereb Blood Flow Metab 14:423--430

Page 9: Tests for Comparing Images Based on Randomization and Permutation Methods

RANDOMIZATION TESTS FOR IMAGES 1279

Kendall M, Gibbons JD (1990) Rank Correlation Methods, New York, Oxford University Press

Lam CWH, Sotchen LH (1982) Three new combination algorithms with the minimal-change property. Commun A C M 25:555-559

Levin DN, Pelizzari CA, Chen GTY, Chen C-T, Cooper MD (1988) Retrospective geometric correlation of MR, CT, and PET images. Radiology 169:817-823

Manly BFJ (1991) Randomization and Monte Carlo Methods in Biol­ogy. New York, Chapman & Hall

Mathworks (1995) MA TLAB, Natick, Massachusettes McCrory SJ, Ford I (1991) Multivariate analysis of SPECT images

with illustrations in Alzheimer's disease. Stat Med 1 0:1711 -1718 Noreen EW (1989) Computer Intensive Methods for Testing Hypoth­

eses, New York, John Wiley & Sons Pitman EJG (1937a) Significance tests which may be applied to

samples from any population. R Stat Soc Suppl 4: 11 9-130 Pitman EJG (1937 b) Significance tests which may be applied to

samples from any population. Part III: The correlation coefficient test. R Stat Soc Suppl 4:225-232

Pitman EJG (1938) Significance tests which may be applied to samples from any population. Part III. The analysis of variance test. Biometrika 29:322-335

Press WHo Teukolsky SA, Vetterling WT, Flannery BP (1992) Numeri­cal Recipes, New York, Cambridge University Press

Stuart A. Ord JK (1987) Kendal/'s Advanced Theory of Statistics, New York, Oxford University Press

Talairach J. Tournoux P (1988) Co-Planar Stereotaxic Atlas of the Human Brain, New York, Thieme Medical

Upton G. Fingleton B (1990) Spatial Data Analysis by Example: Vol­ume I Point Pattern and Quantitative Data, New York, John Wiley & Sons

Worsley KJ, Evans AC. Marrett S. Neelin P (1992) A three dimen­sional statistical analysis for CBF activation studies in human brain. J Cereb Blood Flow Metab 1 2:900-918

Worsley KJ, Poline J-B, Vandal AC, Friston KJ (1995) Tests for dis­tributed. nonfocal brain activations. Neurolmage 2: 183-194

J Cereb Blood Flow Metab, Vol. 16, No.6, 1996