1 0 20 40 60 80 Frequency 0 .5 1 1.5 2 triglyceride m-2s m-s m m+s m+2s m = sample mean s = sample standard deviation Ph.D. COURSE IN BIOSTATISTICS DAY 3 Example: Serum triglyceride measurements in cord blood from 282 babies (Data file: triglyceride.dta)
Ph.D. COURSE IN BIOSTATISTICSDAY 3. Example: Serum triglyceride measurements in cord blood from 282 babies (Data file: triglyceride.dta ). m = sample mean s = sample standard deviation. m-2s m-s m m+s m+2s. The distribution of serum triglyceride does not look like a normal - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
0
20
40
60
80
Fre
qu
ency
0 .5 1 1.5 2triglyceride
m-2s m-s m m+s m+2s m = sample means = sample standard deviation
Ph.D. COURSE IN BIOSTATISTICS DAY 3Example: Serum triglyceride measurements in cord blood from 282 babies (Data file: triglyceride.dta)
2
0
.5
1
1.5
2tr
igly
cerid
e
0 .5 1Inverse Normal
The distribution of serum triglyceride does not look like a normal distribution. The distribution is skew to the right or positively skew.
A Q-Q plot clearly shows the lack of fit.
How should we summarize these data?
How should we compare two groups of measurements for which the variation is positively skew?
3
Note:If a distribution is symmetric then mean = medianIf a distribution is skew to the right then mean > median.If a distribution is skew to the left then mean < median
Solution 1: A “non-parametric” approachMean and standard deviations are mainly useful for symmetric distributions. For a skew distribution the “upward spread” differsfrom the “downward” spread, and a single number summary of the spread is therefore not sufficient.
Summary statistics useful for skew distributions:
median, 25, 75 percentiles or e.g. median, 5, 95 percentiles
The skewness can be identified from these statistics.
A transformation of the data may lead to data that conform with anormal distribution. Statistical methods based for data from a normal distribution may then be applied to the transformed data.
In principle many different transformations could be considered, but interpretation of results based on transformed data, may be complicated, so in practice only a few simple transformations e.g. the power transformations and in particular logarithm transformations, are used. Here we look at logarithm transformations:
Natural logarithms are usually preferred for calculations. Figures with a logarithmic scale usually apply a logarithm to base 10.
5
0
20
40
60
80
Fre
qu
ency
.5 1 1.5 2triglyceride
On a log-scale pairs of points with the same ratio have the same distance, e.g. log(10) - log(5) = log(100) – log(50). Log-transformations therefore remove (or reduce) positive skewness.
On a log-scale these data look like a sample from a normal distribution,but can we interpret results derived from log-transformed data?
7
Consider a random variable X and introduce Z = log(X). The logarithm is here the natural logarithm, sometimes also called ln.
log( ( )) (log( )) ( )median X median X median Z
The exponential function exp is the inverse of the natural logarithm, so percentiles are easily “back-transformed, in particular
exp ( ) ( )median Z median X
ANALYSIS OF LOG-TRANSFORMED DATA. INTERPRETATION OF RESULTS
Percentiles of the distribution of X are transformed to the analogouspercentiles of the distribution of Z. In particular
On a log-scale the triglyceride data can be described by a normal distribution, so on a log-scale the data are usefully summarized by a mean and a standard deviation,
but what does these numbers mean?
8
exp ( ) exp ( )
exp ( ) exp ( )
Z X
Z X
mean Z mean X
sd Z sd X
Similar results are not true for mean and standard deviation
If the transformed variable Z follows a symmetric distribution we have
( ) ( )mean Z median Z
so, when “back transformed” the sample mean of the log-transformedobservations becomes an estimate of the median on the original scale,i.e. is an estimate of median(X). exp( )z
If the transformed variable follows a normal distribution further results are avialable:
2
2
exp exp
exp 1
X Z Z
X X Z
9
If the transformed variable Z follows a normal distribution, the standard deviation of Z can be used to estimate c.v.(X), the coefficient of variation of X. The relations above show that
2. .( ) exp 1XZ Z
X
c v X
The approximation is accurate if the variation is small, e.g. 0.3Z
The result above shows that - unless the variation is large - the standard deviation of the log-transformed data, , can be used as an estimate of the coefficient of variation of the observations on the original scale.
The coefficient of variation is often used to describe the variation ofpostively skew distributions.
Zs
If the variation is large the coefficient of variation of X is estimated by
2exp 1Zs
10
Let and denote the lower and upper 95% confidence limits for then
ˆL ˆU( )Z mean Z
0.975
0.975
ˆ
ˆ
L z
U z
z t s n
z t s n
and are 95% confidence limits of the median(X). ˆexp L ˆexp U
Confidence intervals
Similarly, by inserting the confidence limits for in the relation Z
2. .( ) exp 1Zc v X
confidence limits for the coefficient of variation of X are obtained.
Assume that the log-transformed data follow a normal distribution
Confidence intervals for mean(Z) are ”back-transformed” to confidence intervals for median(X).
11
Two independent samples
Consider two independent samples and .1, , nx x 1, , my y
Assume that the sample distributions are positively skew, but that thelog-transformed samples and can be described by normal distributions.
1, , nv v 1, , mw w
The hypothesis of identical means on a log-scale corresponds to the hypothesis median(X) = median(Y).
V W
The ”back-transformed” difference is and estimate of theratio
exp v w
( )
( )
median X
median Y
Similarly, the confidence interval for transform back to a confidence interval for the ratio of the medians.
V W
The F-test of the hypothesis is equivalent to testing the hypothesis c.v.(X) = c.v.(Y).
V W
12
Example – triglyceride dataSummary statistics for skew data
We compute a confidence interval for mean(Z) and transform the limitsback to the original scale.
14
Estimation of c.v.(X):
The standard deviation is here a rather crude estimate of c.v.(X), instead we use
exp(.1502871) 1 0.403
ˆ 0.757 1.968 0.0231 0.8026
ˆ 0.757 1.968 0.0231 0.7115L
U
95% confidence limits for median(X) therefore become exp(-0.8026) = 0.45 and exp(-0.7115) = 0.49.
The estimate is very similar to the estimate computed from the samplemean and the sample standard deviation of the original data.
We can also compute a 90% prediction interval based a normal distribution for Z and transform back to the original scale.
exp( 0.757 1.650 0.388) exp( 1.397) 0.25
exp( 0.757 1.650 0.388) exp( 0.117) 0.89
as expected these values are very similar to the 5 and 95 percentiles.
lower limit:
upper limit:
15
Why use transformations of data?
For simple problems a simple solution is OK: Use sample percentiles as descriptive statistics and non-parametric tests for statistical inference.
Why ”change the data to fit the method”? – instead of ”change the method to fit the data”?
Transformation of dataFor more complicated problems the methods available assume thatdata are from a normal population. The choice is then between
• A simple, but unsatisfactory analysis
• An satisfactory analysis based on transformed data. The methods based on a normal distribution are appealing because they are more powerful and cover a wider range of problems.
Disadvantages: Interpretation and presentation of results from analysesof transformed data is not always straightforward.
16
THE TWO-SAMPLE PROBLEM WITH PAIRED DATA
Example: Measurement of body temperatureThe Stata file temp.dta contains measurements of body temperature in 96 patients using Rectal Hg thermometer, Oral Hg thermometer,and an electronic device (Craft-thermometer).
The data were collected as part of a study to assess the validity of the electronic thermometer. Here we consider the two series of HG thermometer readings.
17
35
36
37
38
39
40
41
hgo
ral
35 36 37 38 39 40 41hgrectal
Oral Hg versus Rectal Hg
A plot of 94 pairs of readings of oral and rectal Hg temperatur (twopatients had missing rectal reading)
identity line
18
Questions:
• What is the difference between oral and rectal body temperature?
• How much does the differences vary between patient?
The situation resembles the two-sample problem discussed in lecture 2, but here the same patients are measured with both methods. We no longer have two independent samples, but two paired samples.
The “obvious” analysis
1. For each patient compute the difference between the two readings
2. Provided the assumptions are satisfied the differences are modeled asa sample for from a normal distribution: compute a confidence intervals for the expected difference and the standard deviation and perhaps a one-sample t-test of the hypothesis that the expected difference is 0.
If the normal distribution does not apply, use a non-parametric approach instead (Lecture 5).
19
Note:• The obvious approach was used for the analysis of the data on
change in diastolic blood pressure (Lecture 2).
• The data may be used to identify systematic differences between the methods (accuracy), but replicate measurements are needed to describe the random measurement error (precision).
What kind of model considerations leads to this analysis?
Consider a given patient and introduce the following notation
X = A reading of the rectal Hg temperatureY = A reading of the oral Hg temperature
and letD = X – Y = the difference between the two readings.
The temperature readings X and Y can be considered as a sum of two components:
X = A + EY = B + F
20
Here
A = the “true” rectal temperature, i.e. the hypothetical value we would obtain as the average reading if the “experiment” was repeated a large number of times
B = the “true” oral temperature
E = the random measurement error associated with a rectal temperature reading using a Hg thermometer
F = the random measurement error associated with an oral temperature reading using a Hg thermometer
The difference D is therefore
D = (A – B) + (E – F)
The first term, A – B, represents the difference between true rectal andoral temperature for the given patient on a given day.The second term, E – F, is the combined effect of measurement errors.
21
In a population of patients the difference of true temperatures, A – B, may vary between individuals and also between days for a given individual. Let denote the population mean, then
A – B = + C,
where C represents intra-individual variation in difference in true temperature.
Returning to the difference between the readings we therefore have thefollowing decomposition
D = X – Y = + (C + E – F) = + U,
where is the population mean and U decribes random variation,which has three components, inter-individual variation in true differences,measurement error of rectal reading, measurement error of oral reading.
The statistical model underlying the ”obvious” approach describes the data as a random sample from a normal distribution with mean and variance . 2
22
Note:The model describes the differences and does not specify anythingabout the variation of temperature (rectal or oral) in the population of patients (A or B is not assumed to vary according to a normal distribution).
Checking the validity of the assumptions:
• Independence: Reasonable here, since each patient contributes with one difference only.
• Same distribution:• The expected difference between the two readings does not
depend on the overall level of the temperature. Is this a reasonable assumption?
• The random variation in the difference has the same size for all temperatures. Is this a reasonable assumption?
• Normal distribution: Is this a reasonable assumption?
23
35
36
37
38
39
40
41
hg
ora
l
35 36 37 38 39 40 41
hgrectal
Oral Hg versus Rectal Hg
-.8
-.4
0
.4
.8
1.2
1.6
difr
o
37 38 39 40 41
hgmean
difference versus average
0
.5
1
1.5
2
De
nsi
ty
-1 -.5 0 .5 1 1.5
difro
Rectal Hg - Oral Hg
-.5
0
.5
1
1.5
difr
o
-.5 0 .5 1 1.5
Inverse Normal
Q-Q plot of difference
Important plots that should always be made:a. A plot of Y against Xb. Difference against average (or against sum)c. Histogram of differencesd. Q-Q plot of differences
The plots were made using a Stata do-file with the following commands
25
35
36
37
38
39
40
41
hgo
ral
35 36 37 38 39 40 41hgrectal
Oral Hg versus Rectal Hg
CommentsNo systematic differences are present if the points would scatter around the identity line (red line).
The plot suggests that rectal temperature is systematicallyhigher, but the size of the difference does not seem to vary systematically over the range of temperatures. The points scatter around a line with slope 1 (black line) is fairly constant over the range of temperatures.
Plot a: Oral readingversus rectal reading
26
-.8
-.4
0
.4
.8
1.2
1.6
difr
o
37 38 39 40 41hgmean
difference versus average
Plot b: Differenceversus average.Often called a Bland- Altman plot
CommentsIf no systematic differences are present the points scatter aroundthe horizontal line at 0 (red line).
The plot suggests that rectal temperature is systematically higher, but the size of the difference does not seem to vary systematically over the range of temperatures. The points scatter around a horizontal line at approx. 0.5 (black line) and the variation is fairly constant over the range of temperatures.
27
0
.5
1
1.5
2
De
nsity
-1 -.5 0 .5 1 1.5difro
Rectal Hg - Oral Hg
-.5
0
.5
1
1.5d
ifro
-.5 0 .5 1 1.5Inverse Normal
Q-Q plot of difference
Plot c and d: The histogram and the Q-Q plot of the difference.
Comments:These plots are mainly used for assessing the validity of a normal distribution. The distribution looks fairly symmetric with a modest departure from the normal curve. The slightly S-shaped pattern in the Q-Q plotindicates that both tails aretoo long. This could be theconsequence of a few grosserrors.
28
Analysis using Stata:
A paired two-sample problem can be analyzed with the command ttest directly, i.e. without first generating the difference
95% confidence limits for the expected differenceaverage difference
p-value of the hypothesis that the expected difference is 0
29
Analysis, continued
Estimates obtained form the output
2 2
)ˆ 0.49 (95% 0.43 0.56
ˆ ˆ0.32 ( 0.32 0.102)
confidence limits :
i.e.
95% confidence limits for becomes (see Lecture 2 page 31):
0.28 0.37L U
Conclusion: Rectal Hg temperature is on the average 0.5 degree higher than oral Hg temperature. The random variation (measurement errors and inter-and intra-individual variation) has a standard deviation of 0.32 degree.
The confidence interval of the expected difference does not describethe agreement between the two methods. Limits of agreement are defined as (i.e. mean diff. ± 2·s.d.).This is an approx. 95% prediction interval. The limits are rather wide: -0.15 and 1.13 reflecting the relatively large variation in the differences.
ˆ ˆ2
30
Example Urinary albumin excretion rate
The file albu.dta contains data on urinary albumin excretion rate (µg/min) measured by two different methods (“one-hour” and “night”)in 15 patients.
0369
121518212427
on
eho
ur
0 3 6 9 12 15 18 21 24 27
night
0
5
10
15
diff
ere
nce
0 5 10 15 20
average
0
.05
.1
.15
.2
Den
sity
0 5 10 15
difference
0
5
10
15d
iffe
ren
ce
0 2 4 6 8 10
Inverse Normal
Diagnostic plots
a. b.
c. d.
31
0
500
1000
1500
2000
2500
t8
0 500 1000 1500 2000 2500
t4
-1500
-1000
-500
0
500
Diff
ere
nce
0 300 600 900 1200 1500 1800
Average
0
2.0e-04
4.0e-04
6.0e-04
8.0e-04
Den
sity
-1500 -1000 -500 0 500
Difference
-1500
-1000
-500
0
500
Diff
ere
nce
-1000 -500 0 500 1000
Inverse Normal
Diagnostic plots
Example T4 and T8 cell concentration
The file tcounts.dta contains data on the number of T4 and T8 cell/mm3
in blood samples from 20 patients in remission from Hodgkin’s disease.
a. b.
c. d.
32
In both of these examples the difference between the two valuesincreases with the size.
The ”obvious” analysis may not be apppropriate, but do we haveany alternatives?
Also the variation in the difference seems to increase with the size of the measurements.
0
5
10
15
diff
eren
ce
0 5 10 15 20average
-1500
-1000
-500
0
500
Diff
ere
nce
0 300 600 900 1200 1500 1800Average
33
PAIRED DATA: ABSOLUTE OR RELATIVE CHANGE/DIFFERENCE?
Data: a sample of size n of pairs of observations (x,y). Such data arise in many situations, e.g.
• Comparison of measurement methods• Comparison of pre-treatment and post-treatment measurements• Studies of inter-observer or intra-observer variation
Problem: Do the x’s differ from the y’s? If yes, how?
First reaction: Look at differences (the obvious analysis above)
BUT should we consider absolute or relative difference/change?
Absolute change
Relative change 1
y x
y x y
x x
note: the relative changeis just the ratio ”adjusted”such that 0 becomes the”no change” value.
34
Absolute or relative change?
The relationship between the two series of measurements may identifythe most appropriate choice.Two simple relations between y and x
1.
2.
The choice should reflect the structure of the systematic variation inthe data.
y x a
y x b
In situation 1 differences will be relatively stable. If the variation in the data conforms to this relation absolute change is the appropriate choice (from a statistical perspective).
In situation 2 ratios will be relatively stable. If the variation in the data conforms to this relation relative change is the appropriate choice (from a statistical perspective).
35
The paired t-test and the non-parametric analog are primarily intendedfor situation 1, since these tests are based on the differences.
Situation 2 is often best handled by taking logarithms:
i.e. the relationship between log(x) and log(y) is additive (situation 1) with a = log(b)
log( ) log( ) log( )y x b y x b implies
In short: If the variation in a series of pairs (x,y) is multiplicative (situation 2) the best approach is usually to compute the differences
and base the statistical analyses on these differences.
This corresponds to working with ratios on a log-scale.
log( ) log( ) logy
y xx
36
Paired data: Difference of logs versus relative change
Why not use relative change in situation 2?
Example
Patient x y (y-x)/x y/x log(y)-log(x)
1 100 200 100% 2.00 0.693
2 200 100 -50% 0.50 -0.693
Average 150 150 25% 1.25 0
The second person is just a ”reversed version” of the first person, so”nothing happens”.
• The first average (x) is identical to the second average (y).• The average relative change is 25% indicating a slight increase.• The average ratio is larger than 1 indicating a slight increase.• The average of difference of logs – the ratio measured on a log-scale –
is 0, indicating the sensible result: no change.
37
Paired data: Difference of logs versus relative change
Working with ratios on a log-scale is convenient, since e.g.
log( / ) log( / )y x x yThe relative change does not have this property
y x x y
x y
When the ratio y/x is close to 1 then
log( ) log( ) log 1y y y x
y xx x x
When the ratio is close to 1 (e.g. within ±20%) then the differencebetween the natural logarithms is approximately equal to the relative change.
Paired data with a multiplicative structure (situation 2) usually also have an increased variability for large observations. Working with data on log-scale stabilize variation and may therefore lead to a simpler description of the random variation (a coefficient of variation).
38
5
5.5
6
6.5
7
7.5
8
log(
t8)
5 5.5 6 6.5 7 7.5 8
log(t4)
-1.5-1.25
-1-.75-.5
-.250
.25.5
.751
Diff
ere
nce
of l
og
s
5 5.5 6 6.5 7 7.5 8
Average of logs
0
.2
.4
.6
.8
Den
sity
-1.5 -1 -.5 0 .5 1
Difference of logs
-1.5
-1
-.5
0
.5
1D
iffe
ren
ce o
f lo
gs
-1 -.5 0 .5 1
Inverse Normal
Diagnostic plots based on log-transformed data
Example T4 and T8 cell concentration
The diagnostic plots (page 31) indicated that an additive relationship was inappropriate. Data are now log-transformed to evaluate the fit of a multiplicative structure.
a. b.
c. d.
39
Example T4 and T8 cell concentration – continued
Stata analysis of log-transformed datagene logt4=log(t4)gene logt8=log(t8)ttest logt4=logt8
log-transformed data: Both systematic and random variation agreebetter with assumptions .
40
ANALYSIS OF LOG-TRANSFORMED, PAIRED SAMPLESINTERPRETATION OF RESULTS
Summary statistics in output describe ,i.e. the ratio of the counts on a log-scale.
44 8
8
log( ) log( ) logT
T TT
The log-transformed ratios are a sample from a normal distribution. Therelationships on page 7-10 show how to translate the results to resultsabout the ratios. We have
Ratio: Log scale Ratio: Original scale
mean = 0.250 median = exp(0.250) = 1.28
conf.limits of mean conf. limits of median-0.016 0.516 0.98 1.67
standard deviation coefficient of variation0.569 62%
The estimates above may be compared with estimates based on the sample of ratios.
gene ratio=t4/t8tabstat ratio , stat(med cv min max)
Note: the samplesize is 20, so the ”non-parametric”95% predictionlimits are min andmaxOutput
variable | p50 cv min max-------------+---------------------------------------- ratio | 1.193993 .6085281 .4736842 3.818452------------------------------------------------------
The two sets of estimates agree reasonably well.
Note: Using log-transformation in the analysis of two paired samples leadsto an estimate of the median ratio.
Using log-transformation in the analysis of two independent samples leads to an estimate of the ratio of the medians.
42
WHY USE TRANSFORMATION ÒF DATA?
Overall purposelog transformation – and other transformations - of the data are oftenused because a simpler description of the transformed data is available.
Two aspectsSimpler description of the random variation: • A positively skew distribution was replaced by a symmetric distribution• Bland-Altman plot: on a log scale the size of variation was independent • of the level of the outcome
A simpler, and more valid, description of the systematic variation• For paired data the multiplicative structure was replaced by a
additive structure
More advanced statistical methodology for continuous variables , e.g. regression models and analysis of variance models, assumes constant standard deviation and additivity of effects. Transformationof data may therefore be necessary to apply these methods.
43
Note: Whether or not to use t-test is often believed to be the most importantquestion in analysis of two-sample problems.
BUT the choice of appropriate measure of change is usually much more important than the choice of test statistic.
Always look at diagnostic plots! However, sometimes there is no clear answer, because
1. Neither the original scale or a log-scale is appropriate
2. If the variation between pairs is modest it can be very difficult to distinguish between an additive structure and a multiplicative structure in the data. In such situations the results are usually very similar for transformed and untransformed data.
3. The transformation that achieves additivity is not the transformationthat achieves constant variation.
4. It is just a mess
44
STATA – some interactive commands
Both ttest and sdtest have interactive versions, called ttesti and sdtesti, respectively. These commands do not require thefull data set, but the relevant summary statistics are instead entered directly after the command.
Some examples
one-sample problem: n = 213, mean = 1.901, sd = 7.529, H: mean = 0ttesti 213 1.901 7.529 0