m-2s m-s m m+s m+2s

1

0

20

40

60

80

Fre

qu

ency

0 .5 1 1.5 2triglyceride

m-2s m-s m m+s m+2s m = sample means = sample standard deviation

Ph.D. COURSE IN BIOSTATISTICS DAY 3Example: Serum triglyceride measurements in cord blood from 282 babies (Data file: triglyceride.dta)

2

0

.5

1

1.5

2tr

igly

cerid

e

0 .5 1Inverse Normal

The distribution of serum triglyceride does not look like a normal distribution. The distribution is skew to the right or positively skew.

A Q-Q plot clearly shows the lack of fit.

How should we summarize these data?

How should we compare two groups of measurements for which the variation is positively skew?

3

Note:If a distribution is symmetric then mean = medianIf a distribution is skew to the right then mean > median.If a distribution is skew to the left then mean < median

Solution 1: A “non-parametric” approachMean and standard deviations are mainly useful for symmetric distributions. For a skew distribution the “upward spread” differsfrom the “downward” spread, and a single number summary of the spread is therefore not sufficient.

Summary statistics useful for skew distributions:

median, 25, 75 percentiles or e.g. median, 5, 95 percentiles

The skewness can be identified from these statistics.

variable | p5 p50 p95-------------+------------------------------ trigly | .27 .46 .88--------------------------------------------

tabstat trigly , stat(p5 med p95)

4

Solution 2: Transformation of data

A transformation of the data may lead to data that conform with anormal distribution. Statistical methods based for data from a normal distribution may then be applied to the transformed data.

In principle many different transformations could be considered, but interpretation of results based on transformed data, may be complicated, so in practice only a few simple transformations e.g. the power transformations and in particular logarithm transformations, are used. Here we look at logarithm transformations:

Data, original scale

Data log-transformed

Basic facts about logarithms: see

www.biostat.au.dk/teaching/praegrad/opgaver/biostat/logaritmefunktioner.pdf

1 2, , , nx x x

1 2, , , log( )n i iz z z z x with

Logarithms satisfy log( ) log( ) log( )

log( / ) log( ) log( )

a b a b

a b a b

Natural logarithms are usually preferred for calculations. Figures with a logarithmic scale usually apply a logarithm to base 10.

5

0

20

40

60

80

Fre

qu

ency

.5 1 1.5 2triglyceride

On a log-scale pairs of points with the same ratio have the same distance, e.g. log(10) - log(5) = log(100) – log(50). Log-transformations therefore remove (or reduce) positive skewness.

histogram trigly , ///xscale(log) start(0.1) width(0.1) freq

The figure suggests that on a log-scale the variation in the triglyceride data may be described by a normal distribution.

log-scalewith originalunits

6

0

10

20

30

40

50

Fre

qu

ency

-2 -1.5 -1 -.5 0 .5log(triglyceride)

To obtain a more satisfactory histogram the log-transformed data must be generated explicitly

gene logtri=log(trigly)histogram logtri , freq start(-1.86) normalqnorm logtri , mco(blue)

natural logarithm can also be written as ln

0.14 0.38 1.00

-2

-1.5

-1

-.5

0

.5

log

(trig

lyce

ride

)

-2 -1.5 -1 -.5 0 .5Inverse Normal

log-scale with natural log-units

original units

On a log-scale these data look like a sample from a normal distribution,but can we interpret results derived from log-transformed data?

7

Consider a random variable X and introduce Z = log(X). The logarithm is here the natural logarithm, sometimes also called ln.

log( ( )) (log( )) ( )median X median X median Z

The exponential function exp is the inverse of the natural logarithm, so percentiles are easily “back-transformed, in particular

exp ( ) ( )median Z median X

ANALYSIS OF LOG-TRANSFORMED DATA. INTERPRETATION OF RESULTS

Percentiles of the distribution of X are transformed to the analogouspercentiles of the distribution of Z. In particular

On a log-scale the triglyceride data can be described by a normal distribution, so on a log-scale the data are usefully summarized by a mean and a standard deviation,

but what does these numbers mean?

8

exp ( ) exp ( )

exp ( ) exp ( )

Z X

Z X

mean Z mean X

sd Z sd X

Similar results are not true for mean and standard deviation

If the transformed variable Z follows a symmetric distribution we have

( ) ( )mean Z median Z

so, when “back transformed” the sample mean of the log-transformedobservations becomes an estimate of the median on the original scale,i.e. is an estimate of median(X). exp( )z

If the transformed variable follows a normal distribution further results are avialable:

2

2

exp exp

exp 1

X Z Z

X X Z

9

If the transformed variable Z follows a normal distribution, the standard deviation of Z can be used to estimate c.v.(X), the coefficient of variation of X. The relations above show that

2. .( ) exp 1XZ Z

X

c v X

The approximation is accurate if the variation is small, e.g. 0.3Z

The result above shows that - unless the variation is large - the standard deviation of the log-transformed data, , can be used as an estimate of the coefficient of variation of the observations on the original scale.

The coefficient of variation is often used to describe the variation ofpostively skew distributions.

Zs

If the variation is large the coefficient of variation of X is estimated by

2exp 1Zs

10

Let and denote the lower and upper 95% confidence limits for then

ˆL Û( )Z mean Z

0.975

0.975

ˆ

ˆ

L z

U z

z t s n

z t s n

and are 95% confidence limits of the median(X). êxp L êxp U

Confidence intervals

Similarly, by inserting the confidence limits for in the relation Z

2. .( ) exp 1Zc v X

confidence limits for the coefficient of variation of X are obtained.

Assume that the log-transformed data follow a normal distribution

Confidence intervals for mean(Z) are ”back-transformed” to confidence intervals for median(X).

11

Two independent samples

Consider two independent samples and .1, , nx x 1, , my y

Assume that the sample distributions are positively skew, but that thelog-transformed samples and can be described by normal distributions.

1, , nv v 1, , mw w

The hypothesis of identical means on a log-scale corresponds to the hypothesis median(X) = median(Y).

V W

The ”back-transformed” difference is and estimate of theratio

exp v w

( )

( )

median X

median Y

Similarly, the confidence interval for transform back to a confidence interval for the ratio of the medians.

V W

The F-test of the hypothesis is equivalent to testing the hypothesis c.v.(X) = c.v.(Y).

V W

12

Example – triglyceride dataSummary statistics for skew data

Solution 1 ( a non-parametric approach)

tabstat trigly , stat(p5 p25 p50 p75 p95)

variable | p5 p25 p50 p75 p95-------------+-------------------------------------------------- trigly | .27 .36 .46 .6 .88----------------------------------------------------------------

For later comparison we also compute the sample c.v. of X:

tabstat trigly , stat(mean sd cv)

variable | mean sd cv-------------+------------------------------ trigly | .507153 .2184926 .4308218--------------------------------------------

The population median is estimated by the sample median of 0.46.

The population c.v. is estimated by the sample c.v. of 43%.

13

Example – triglyceride dataSummary statistics for skew data

Solution 2 ( log-transformation of data)

tabstat logtri , stat(p5 p25 p50 p75 p95)

variable | p5 p25 p50 p75 p95-------------+-------------------------------------------------- logtri | -1.309333 -1.021651 -.7765288 -.5108256 -.1278334----------------------------------------------------------------

Percentiles transform back to percentiles, e.g. exp(-1.309333) = 0.27and exp(-0.7765288) = 0.46

Confidence interval for the median of X

tabstat logtri , stat(mean semean sd var)

variable | mean se(mean) sd variance-------------+---------------------------------------- logtri | -.7570465 .0231264 .3876687 .1502871------------------------------------------------------

We compute a confidence interval for mean(Z) and transform the limitsback to the original scale.

14

Estimation of c.v.(X):

The standard deviation is here a rather crude estimate of c.v.(X), instead we use

exp(.1502871) 1 0.403

ˆ 0.757 1.968 0.0231 0.8026

ˆ 0.757 1.968 0.0231 0.7115L

U

95% confidence limits for median(X) therefore become exp(-0.8026) = 0.45 and exp(-0.7115) = 0.49.

The estimate is very similar to the estimate computed from the samplemean and the sample standard deviation of the original data.

We can also compute a 90% prediction interval based a normal distribution for Z and transform back to the original scale.

exp( 0.757 1.650 0.388) exp( 1.397) 0.25

exp( 0.757 1.650 0.388) exp( 0.117) 0.89

as expected these values are very similar to the 5 and 95 percentiles.

lower limit:

upper limit:

15

Why use transformations of data?

For simple problems a simple solution is OK: Use sample percentiles as descriptive statistics and non-parametric tests for statistical inference.

Why ”change the data to fit the method”? – instead of ”change the method to fit the data”?

Transformation of dataFor more complicated problems the methods available assume thatdata are from a normal population. The choice is then between

• A simple, but unsatisfactory analysis

• An satisfactory analysis based on transformed data. The methods based on a normal distribution are appealing because they are more powerful and cover a wider range of problems.

Disadvantages: Interpretation and presentation of results from analysesof transformed data is not always straightforward.

16

THE TWO-SAMPLE PROBLEM WITH PAIRED DATA

Example: Measurement of body temperatureThe Stata file temp.dta contains measurements of body temperature in 96 patients using Rectal Hg thermometer, Oral Hg thermometer,and an electronic device (Craft-thermometer).

+--------------------------------+ | id hgrectal hgoral craft | |--------------------------------| 1. | 1 38.1 37.1 37.4 | 2. | 2 38.1 37.8 37.8 | 3. | 3 38.9 38.2 38.2 | 4. | 4 38.4 38.2 38.1 | 5. | 5 40.3 40 40.1 | +--------------------------------+

list id hgrectal hgoral craft in 1/5

The data were collected as part of a study to assess the validity of the electronic thermometer. Here we consider the two series of HG thermometer readings.

17

35

36

37

38

39

40

41

hgo

ral

35 36 37 38 39 40 41hgrectal

Oral Hg versus Rectal Hg

A plot of 94 pairs of readings of oral and rectal Hg temperatur (twopatients had missing rectal reading)

identity line

18

Questions:

• What is the difference between oral and rectal body temperature?

• How much does the differences vary between patient?

The situation resembles the two-sample problem discussed in lecture 2, but here the same patients are measured with both methods. We no longer have two independent samples, but two paired samples.

The “obvious” analysis

1. For each patient compute the difference between the two readings

2. Provided the assumptions are satisfied the differences are modeled asa sample for from a normal distribution: compute a confidence intervals for the expected difference and the standard deviation and perhaps a one-sample t-test of the hypothesis that the expected difference is 0.

If the normal distribution does not apply, use a non-parametric approach instead (Lecture 5).

19

Note:• The obvious approach was used for the analysis of the data on

change in diastolic blood pressure (Lecture 2).

• The data may be used to identify systematic differences between the methods (accuracy), but replicate measurements are needed to describe the random measurement error (precision).

What kind of model considerations leads to this analysis?

Consider a given patient and introduce the following notation

X = A reading of the rectal Hg temperatureY = A reading of the oral Hg temperature

and letD = X – Y = the difference between the two readings.

The temperature readings X and Y can be considered as a sum of two components:

X = A + EY = B + F

20

Here

A = the “true” rectal temperature, i.e. the hypothetical value we would obtain as the average reading if the “experiment” was repeated a large number of times

B = the “true” oral temperature

E = the random measurement error associated with a rectal temperature reading using a Hg thermometer

F = the random measurement error associated with an oral temperature reading using a Hg thermometer

The difference D is therefore

D = (A – B) + (E – F)

The first term, A – B, represents the difference between true rectal andoral temperature for the given patient on a given day.The second term, E – F, is the combined effect of measurement errors.

21

In a population of patients the difference of true temperatures, A – B, may vary between individuals and also between days for a given individual. Let denote the population mean, then

A – B = + C,

where C represents intra-individual variation in difference in true temperature.

Returning to the difference between the readings we therefore have thefollowing decomposition

D = X – Y = + (C + E – F) = + U,

where is the population mean and U decribes random variation,which has three components, inter-individual variation in true differences,measurement error of rectal reading, measurement error of oral reading.

The statistical model underlying the ”obvious” approach describes the data as a random sample from a normal distribution with mean and variance . 2

22

Note:The model describes the differences and does not specify anythingabout the variation of temperature (rectal or oral) in the population of patients (A or B is not assumed to vary according to a normal distribution).

Checking the validity of the assumptions:

• Independence: Reasonable here, since each patient contributes with one difference only.

• Same distribution:• The expected difference between the two readings does not

depend on the overall level of the temperature. Is this a reasonable assumption?

• The random variation in the difference has the same size for all temperatures. Is this a reasonable assumption?

• Normal distribution: Is this a reasonable assumption?

23

35

36

37

38

39

40

41

hg

ora

l

35 36 37 38 39 40 41

hgrectal


-.8

-.4

0

.4

.8

1.2

1.6

difr

o

37 38 39 40 41

hgmean

difference versus average

0

.5

1

1.5

2

De

nsi

ty

-1 -.5 0 .5 1 1.5

difro

Rectal Hg - Oral Hg

-.5

0

.5

1

1.5

difr

o

-.5 0 .5 1 1.5

Inverse Normal

Q-Q plot of difference

Important plots that should always be made:a. A plot of Y against Xb. Difference against average (or against sum)c. Histogram of differencesd. Q-Q plot of differences

a b

c d

24

scatter hgoral hgrectal , ///title("Oral Hg versus Rectal Hg") ///ylabel(35(1)41) xlabel(35(1)41) ///saving(g1, replace) mco(gs3)

gene difro = hgrectal-hgoralgene hgmean=(hgrectal+hgoral)/2scatter difro hgmean , ///

title("difference versus average") /// ylabel(-0.8(0.4)1.6) xlabel(37(1)41) ///

saving(g2,replace) mco(gs3)histogram difro , ///

title("Rectal Hg - Oral Hg") ///start(-0.8) width(0.2) ///saving(g3,replace)

qnorm difro , title("Q-Q plot of difference") ///saving(g4,replace) mco(gs3)

graph combine g1.gph g2.gph g3.gph g4.gph , ///saving(g5,replace)

The plots were made using a Stata do-file with the following commands

25

35

36

37

38

39

40

41

hgo

ral

35 36 37 38 39 40 41hgrectal


CommentsNo systematic differences are present if the points would scatter around the identity line (red line).

The plot suggests that rectal temperature is systematicallyhigher, but the size of the difference does not seem to vary systematically over the range of temperatures. The points scatter around a line with slope 1 (black line) is fairly constant over the range of temperatures.

Plot a: Oral readingversus rectal reading

26

-.8

-.4

0

.4

.8

1.2

1.6

difr

o

37 38 39 40 41hgmean

difference versus average

Plot b: Differenceversus average.Often called a Bland- Altman plot

CommentsIf no systematic differences are present the points scatter aroundthe horizontal line at 0 (red line).

The plot suggests that rectal temperature is systematically higher, but the size of the difference does not seem to vary systematically over the range of temperatures. The points scatter around a horizontal line at approx. 0.5 (black line) and the variation is fairly constant over the range of temperatures.

27

0

.5

1

1.5

2

De

nsity

-1 -.5 0 .5 1 1.5difro

Rectal Hg - Oral Hg

-.5

0

.5

1

1.5d

ifro

-.5 0 .5 1 1.5Inverse Normal

Q-Q plot of difference

Plot c and d: The histogram and the Q-Q plot of the difference.

Comments:These plots are mainly used for assessing the validity of a normal distribution. The distribution looks fairly symmetric with a modest departure from the normal curve. The slightly S-shaped pattern in the Q-Q plotindicates that both tails aretoo long. This could be theconsequence of a few grosserrors.

28

Analysis using Stata:

A paired two-sample problem can be analyzed with the command ttest directly, i.e. without first generating the difference

ttest hgrectal=hgoral== is also allowed

Paired t test---------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+-----------------------------------------------------------hgrectal | 94 38.06808 .0852441 .8264723 37.89881 38.23736 hgoral | 94 37.57447 .0855485 .8294235 37.40459 37.74435---------+----------------------------------------------------------- diff | 94 .4936168 .0330279 .3202179 .4280298 .5592038---------------------------------------------------------------------

mean(diff) = mean(hgrectal - hgoral) t = 14.9454Ho: mean(diff) = 0 degrees of freedom = 93

Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0 Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

95% confidence limits for the expected differenceaverage difference

p-value of the hypothesis that the expected difference is 0

29

Analysis, continued

Estimates obtained form the output

2 2

)ˆ 0.49 (95% 0.43 0.56

ˆ ˆ0.32 ( 0.32 0.102)

confidence limits :

i.e.

95% confidence limits for becomes (see Lecture 2 page 31):

0.28 0.37L U

Conclusion: Rectal Hg temperature is on the average 0.5 degree higher than oral Hg temperature. The random variation (measurement errors and inter-and intra-individual variation) has a standard deviation of 0.32 degree.

The confidence interval of the expected difference does not describethe agreement between the two methods. Limits of agreement are defined as (i.e. mean diff. ± 2·s.d.).This is an approx. 95% prediction interval. The limits are rather wide: -0.15 and 1.13 reflecting the relatively large variation in the differences.

ˆ ˆ2

30

Example Urinary albumin excretion rate

The file albu.dta contains data on urinary albumin excretion rate (µg/min) measured by two different methods (“one-hour” and “night”)in 15 patients.

0369

121518212427

on

eho

ur

0 3 6 9 12 15 18 21 24 27

night

0

5

10

15

diff

ere

nce

0 5 10 15 20

average

0

.05

.1

.15

.2

Den

sity

0 5 10 15

difference

0

5

10

15d

iffe

ren

ce

0 2 4 6 8 10

Inverse Normal

Diagnostic plots

a. b.

c. d.

31

0

500

1000

1500

2000

2500

t8

0 500 1000 1500 2000 2500

t4

-1500

-1000

-500

0

500

Diff

ere

nce

0 300 600 900 1200 1500 1800

Average

0

2.0e-04

4.0e-04

6.0e-04

8.0e-04

Den

sity

-1500 -1000 -500 0 500

Difference

-1500

-1000

-500

0

500

Diff

ere

nce

-1000 -500 0 500 1000

Inverse Normal

Diagnostic plots

Example T4 and T8 cell concentration

The file tcounts.dta contains data on the number of T4 and T8 cell/mm3

in blood samples from 20 patients in remission from Hodgkin’s disease.

a. b.

c. d.

32

In both of these examples the difference between the two valuesincreases with the size.

The ”obvious” analysis may not be apppropriate, but do we haveany alternatives?

Also the variation in the difference seems to increase with the size of the measurements.

0

5

10

15

diff

eren

ce

0 5 10 15 20average

-1500

-1000

-500

0

500

Diff

ere

nce

0 300 600 900 1200 1500 1800Average

33

PAIRED DATA: ABSOLUTE OR RELATIVE CHANGE/DIFFERENCE?

Data: a sample of size n of pairs of observations (x,y). Such data arise in many situations, e.g.

• Comparison of measurement methods• Comparison of pre-treatment and post-treatment measurements• Studies of inter-observer or intra-observer variation

Problem: Do the x’s differ from the y’s? If yes, how?

First reaction: Look at differences (the obvious analysis above)

BUT should we consider absolute or relative difference/change?

Absolute change

Relative change 1

y x

y x y

x x

note: the relative changeis just the ratio ”adjusted”such that 0 becomes the”no change” value.

34

Absolute or relative change?

The relationship between the two series of measurements may identifythe most appropriate choice.Two simple relations between y and x

1.

2.

The choice should reflect the structure of the systematic variation inthe data.

y x a

y x b

In situation 1 differences will be relatively stable. If the variation in the data conforms to this relation absolute change is the appropriate choice (from a statistical perspective).

In situation 2 ratios will be relatively stable. If the variation in the data conforms to this relation relative change is the appropriate choice (from a statistical perspective).

35

The paired t-test and the non-parametric analog are primarily intendedfor situation 1, since these tests are based on the differences.

Situation 2 is often best handled by taking logarithms:

i.e. the relationship between log(x) and log(y) is additive (situation 1) with a = log(b)

log( ) log( ) log( )y x b y x b implies

In short: If the variation in a series of pairs (x,y) is multiplicative (situation 2) the best approach is usually to compute the differences

and base the statistical analyses on these differences.

This corresponds to working with ratios on a log-scale.

log( ) log( ) logy

y xx

36

Paired data: Difference of logs versus relative change

Why not use relative change in situation 2?

Example

Patient x y (y-x)/x y/x log(y)-log(x)

1 100 200 100% 2.00 0.693

2 200 100 -50% 0.50 -0.693

Average 150 150 25% 1.25 0

The second person is just a ”reversed version” of the first person, so”nothing happens”.

• The first average (x) is identical to the second average (y).• The average relative change is 25% indicating a slight increase.• The average ratio is larger than 1 indicating a slight increase.• The average of difference of logs – the ratio measured on a log-scale –

is 0, indicating the sensible result: no change.

37

Paired data: Difference of logs versus relative change

Working with ratios on a log-scale is convenient, since e.g.

log( / ) log( / )y x x yThe relative change does not have this property

y x x y

x y

When the ratio y/x is close to 1 then

log( ) log( ) log 1y y y x

y xx x x

When the ratio is close to 1 (e.g. within ±20%) then the differencebetween the natural logarithms is approximately equal to the relative change.

Paired data with a multiplicative structure (situation 2) usually also have an increased variability for large observations. Working with data on log-scale stabilize variation and may therefore lead to a simpler description of the random variation (a coefficient of variation).

38

5

5.5

6

6.5

7

7.5

8

log(

t8)

5 5.5 6 6.5 7 7.5 8

log(t4)

-1.5-1.25

-1-.75-.5

-.250

.25.5

.751

Diff

ere

nce

of l

og

s

5 5.5 6 6.5 7 7.5 8

Average of logs

0

.2

.4

.6

.8

Den

sity

-1.5 -1 -.5 0 .5 1

Difference of logs

-1.5

-1

-.5

0

.5

1D

iffe

ren

ce o

f lo

gs

-1 -.5 0 .5 1

Inverse Normal

Diagnostic plots based on log-transformed data

Example T4 and T8 cell concentration

The diagnostic plots (page 31) indicated that an additive relationship was inappropriate. Data are now log-transformed to evaluate the fit of a multiplicative structure.

a. b.

c. d.

39

Example T4 and T8 cell concentration – continued

Stata analysis of log-transformed datagene logt4=log(t4)gene logt8=log(t8)ttest logt4=logt8

Paired t test---------------------------------------------------------------------Variable| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]--------+------------------------------------------------------------ logt4 | 20 6.486932 .1583811 .7083018 6.155437 6.818427 logt8 | 20 6.237297 .1371353 .6132875 5.95027 6.524325--------+------------------------------------------------------------ diff | 20 .2496345 .1271238 .5685149 -.0164387 .5157076--------------------------------------------------------------------- mean(diff) = mean(logt4 - logt8) t = 1.9637Ho: mean(diff) = 0 degrees of freedom = 19

Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0Pr(T < t) = 0.9678 Pr(|T| > |t|) = 0.0644 Pr(T > t) = 0.0322

log-transformed data: Both systematic and random variation agreebetter with assumptions .

40

ANALYSIS OF LOG-TRANSFORMED, PAIRED SAMPLESINTERPRETATION OF RESULTS

Summary statistics in output describe ,i.e. the ratio of the counts on a log-scale.

44 8

8

log( ) log( ) logT

T TT

The log-transformed ratios are a sample from a normal distribution. Therelationships on page 7-10 show how to translate the results to resultsabout the ratios. We have

Ratio: Log scale Ratio: Original scale

mean = 0.250 median = exp(0.250) = 1.28

conf.limits of mean conf. limits of median-0.016 0.516 0.98 1.67

standard deviation coefficient of variation0.569 62%

approx. 95% prediction interval approx. 95% prediction interval-0.89 1.39 0.41 4.00

41

The estimates above may be compared with estimates based on the sample of ratios.

gene ratio=t4/t8tabstat ratio , stat(med cv min max)

Note: the samplesize is 20, so the ”non-parametric”95% predictionlimits are min andmaxOutput

variable | p50 cv min max-------------+---------------------------------------- ratio | 1.193993 .6085281 .4736842 3.818452------------------------------------------------------

The two sets of estimates agree reasonably well.

Note: Using log-transformation in the analysis of two paired samples leadsto an estimate of the median ratio.

Using log-transformation in the analysis of two independent samples leads to an estimate of the ratio of the medians.

42

WHY USE TRANSFORMATION ÒF DATA?

Overall purposelog transformation – and other transformations - of the data are oftenused because a simpler description of the transformed data is available.

Two aspectsSimpler description of the random variation: • A positively skew distribution was replaced by a symmetric distribution• Bland-Altman plot: on a log scale the size of variation was independent • of the level of the outcome

A simpler, and more valid, description of the systematic variation• For paired data the multiplicative structure was replaced by a

additive structure

More advanced statistical methodology for continuous variables , e.g. regression models and analysis of variance models, assumes constant standard deviation and additivity of effects. Transformationof data may therefore be necessary to apply these methods.

43

Note: Whether or not to use t-test is often believed to be the most importantquestion in analysis of two-sample problems.

BUT the choice of appropriate measure of change is usually much more important than the choice of test statistic.

Always look at diagnostic plots! However, sometimes there is no clear answer, because

1. Neither the original scale or a log-scale is appropriate

2. If the variation between pairs is modest it can be very difficult to distinguish between an additive structure and a multiplicative structure in the data. In such situations the results are usually very similar for transformed and untransformed data.

3. The transformation that achieves additivity is not the transformationthat achieves constant variation.

4. It is just a mess

44

STATA – some interactive commands

Both ttest and sdtest have interactive versions, called ttesti and sdtesti, respectively. These commands do not require thefull data set, but the relevant summary statistics are instead entered directly after the command.

Some examples

one-sample problem: n = 213, mean = 1.901, sd = 7.529, H: mean = 0ttesti 213 1.901 7.529 0

two-sample problem: n1 = 213, mean1 = 1.901, sd1 = 7.529,n2 = 217, mean2 = 2.194, sd2 = 8.365H: mean1 = mean2

ttesti 213 1.901 7.529 217 2.194 8.365

one-sample problem: n = 213, sd = 7.529, H: sd = 7sdtesti 213 . 7.529 7

two-sample problem: n1 = 213, sd1 = 7.529, n2 = 217, sd2 = 8.365H: sd1 = sd2

sdtesti 213 . 7.529 217 . 8.365

means arenot necessary,write a period.

m-2s m-s m m+s m+2s

Documents