Top Banner
Introduction to Clinical Biostatistics for Medical Students Atif Zafar, MD Department of Medicine
90

Introduction to Clinical Biostatistics for Medical Students

Jan 26, 2015

Download

Technology

Medresearch

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Clinical Biostatistics for Medical Students

Introduction to Clinical Biostatistics for Medical Students

Atif Zafar, MD

Department of Medicine

Page 2: Introduction to Clinical Biostatistics for Medical Students

Overview of Presentation

• Introductory Concepts (Review)

• Hypothesis Testing

• Linear Regression and Correlation

• Analysis of Variance (ANOVA)

• Nonparametric Statistics

• Survival Analysis

Page 3: Introduction to Clinical Biostatistics for Medical Students

Introductory Concepts

Page 4: Introduction to Clinical Biostatistics for Medical Students

Introductory Concepts

• Types of Data

• Presenting Data

• Descriptive Measures

• Probability and Distributions

• Estimation Techniques

Page 5: Introduction to Clinical Biostatistics for Medical Students

Types of Data

• Data are usually Discrete or Continuous– Discrete Variables take on a finite set of

values that can be counted• Race, Gender, Year in School etc.

– Continuous Variables take on an infinite set of values

• Age, Height/Weight, Blood Pressure

Page 6: Introduction to Clinical Biostatistics for Medical Students

Types of Data

• A Special type of Discrete Variable is the Binary Variable which takes on exactly 2 possible values– Gender (M/F)– Pregnant? (Y/N)– Hypertensive? (Y/N)

Page 7: Introduction to Clinical Biostatistics for Medical Students

Types of Data

• Sometimes, discrete variables have a “natural ordering” to them– For example, names of consecutive days in a

week (M, Tu, Wed, Thurs, Fri, Sat, Sun)

• Other types of discrete variables do not have a natural order and are called Nominal Variables– Race (African American, Caucasian, Asian,

Hispanic etc.)

Page 8: Introduction to Clinical Biostatistics for Medical Students

Types of Data

• If in an experiment you measure a single variable, it is called a Univariate experiment

• If you measure 2 variables, it is called a Bivariate experiment

• And if you measure multiple variables, it is called a Multivariate experiment

Page 9: Introduction to Clinical Biostatistics for Medical Students

Types of Data

• A Random variable is one whose value is determined by chance or random event

• Typically, a variable X is random if it is the outcome of an experiment where results can occur by chance or are not completely predictable

Page 10: Introduction to Clinical Biostatistics for Medical Students

Types of Data

• Nonparametric Variables– Many times in clinical studies, we seek

opinion data (I.e. patient satisfaction scores, relative value scales etc.)

– The data can be ranked but has no absolute scale that is comparable

– This type of data is called nonparametric data

Page 11: Introduction to Clinical Biostatistics for Medical Students

Presenting Data

• There are many ways to present data:– Frequency Tables– Pie Charts– Bar Graphs (Histograms)– Line Graphs– Scatter Plots (Scattergrams)– Stem and Leaf Displays– Box Plots

Page 12: Introduction to Clinical Biostatistics for Medical Students

Presenting Data

• Scatter Plots (Plot of a Bivariate experiment)

                            

              

Study Hours Regents Score

3 80

5 90

2 75

6 80

7 90

1 50

2 65

7 85

1 40

7 100

Page 13: Introduction to Clinical Biostatistics for Medical Students

Presenting Data• Stem and Leaf Displays

– Presents a histogram like picture of the data, while retaining the original data values: – Dataset: 8520; 9274; 8142; 11298; 10624; 7987; 11172; 12899; 10737; 9198; 13625;

9462; 11847; 10178; 12240; 11690; 10069; 11240; 12745; 12995

Stem Leaf Count (Frequency)

7 987 1

8 520, 142 2

9 274, 198, 462 3

10 624, 737, 178, 069 4

11 298, 172, 847, 690, 240 5

12 899, 240, 745, 995 4

13 625 1

Total number: 20

Page 14: Introduction to Clinical Biostatistics for Medical Students

Presenting Data

• Boxplots– Complex visual data structures that combine various

measures:• Maximum and Minimum Data Points• 1st and 3rd Quartile Points

– Sort the data points from lowest to highest– Divide the number of data points into 2 halves– Take the Median value of each half and those are

the 1st and 3rd quartiles (Q1, Q3)

• Computer the Inter Quartile Range (IQR)– IQR = Q3-Q1

• Compute 1.5 x IQR. Compute Q3+1.5*IQR and Q1-1.5*IQR– Data points lying outside this range are called Outliers

Page 15: Introduction to Clinical Biostatistics for Medical Students

Presenting Data

• Boxplots

Page 16: Introduction to Clinical Biostatistics for Medical Students

Descriptive Measures

• Now that we have displayed our data, we want to be able to characterize it quantitatively– Measures of Central Tendency

• Mean, Median, Mode

– Measures of Variability• Range, Variance, Standard Deviation

– Measures of Relative Standing• Z-Scores, Percentiles, Quartiles

Page 17: Introduction to Clinical Biostatistics for Medical Students

Measures of Central Tendency

• Mean– Arithmetic Average of a sample of data

• Median– If you order the data from smallest to highest, the

median is the middle value, assuming an odd number of data elements

– If you have an even number of elements, it is the average of the 2 middle numbers.

• Mode– The most common value in a set of values

Page 18: Introduction to Clinical Biostatistics for Medical Students

Measures of Variability

• Once we have located the center of a set of data points, we want to know how “dispersed” they are

Page 19: Introduction to Clinical Biostatistics for Medical Students

Measures of Variability• Range

– This is the difference between the highest and lowest value• Variance

– Defined to be the average of the square of the deviations of the individual data points about their mean

• Standard Deviation– This is defined as the square root of the variance

Page 20: Introduction to Clinical Biostatistics for Medical Students

Measures of Relative Standing

• Sometimes we want to know the position of a particular observation relative to others in a data set– Ex: How you performed with respect to your classmates on an exam

• The Z-Score measures this as follows:

Page 21: Introduction to Clinical Biostatistics for Medical Students

Measures of Relative Standing

• Percentiles and Quartiles also indicate relative standing but in terms of the categories of scores from lowest to highest– Given a set of n measurements x1, …, Xn the pth percentile is defined to

be the value of x that exceeds p% of the measurements and is less than (100-p)% of the values.

– Ex: Scores of 20, 30, 50, 60, 67, 67, 70, 80, 90, 95• The score 50 is in the 30th percentile, meaning that 30% of the scores were

lower than yours and 70% were higher than yours.

• Quartiles similarly reflect in which quarter of the set of values a particular observation lies:– Ex: Scores of 20, 30, 50, 60, 67, 67, 70, 80, 90, 95

– 1st Quartiles = 50, 3rd Quartile = 80

Page 22: Introduction to Clinical Biostatistics for Medical Students

Probability

• Suppose you do an experiment with a finite number of possible outcomes (ex: coin toss)

• The Probability of an event E (H/T) is the chance (%) that the event will turn out in a given way in the next repetition of the experiment

• Probabilities values are always between 0 and 1• The notation for probabilities is as follows:

– Given our coin toss experiment,• P(H) = Probability that a Head will be tossed in the next round• P(T) = Probability that a Tail will be tossed in the next round

• One can estimate probabilities by repeating the event many times and observing the outcomes

Page 23: Introduction to Clinical Biostatistics for Medical Students

Probabilities: Some Simple Rules

• Arithmetically, one can combine probabilities of simple and sequential events:– Given a complex event composed of N simple events, the

probability of the complex event is equal to the sum of the probabilities of each of the simple events

– Ex: Coin toss 1 and Coin toss 2

Event First Coin Second Coin P(Ei)

E1 Heads Heads ¼

E2 Heads Tails ¼

E3 Tails Heads ¼

E4 Tails Tails ¼

Let A = E2, E3. Then P(A) = P(E2)+P(E3) = ½

Page 24: Introduction to Clinical Biostatistics for Medical Students

Probability Distributions

• Given a random variable X (either discrete or continuous), the Probability Distribution gives a table or formula or graph of the probabilities of each potential value of X

• For a Probability Distribution P(x) the following must hold:– 0 <= P(x) <= 1– Sum (all P(x) over all x) = 1

Page 25: Introduction to Clinical Biostatistics for Medical Students

Probability Distributions

• There are many kinds of probability distributions:– Binomial Distribution

• Applies to binary variable experiments where only 2 outcomes are possible

– Poisson Distribution• Applies to variables that represent the number of

occurrences of a specified event in a given unit of time or space

– Hypergeometric Distribution• Applies to experiments where the numbers of elements in the

population is small in comparison to the sample size and thus the success of a trial depends on the outcomes of preceding trials

Page 26: Introduction to Clinical Biostatistics for Medical Students

Probability Distributions

• Normal Distribution (N)– Applies to continuous random variables

• Standard Normal Distribution (Z)– A Normal Distribution with:

• Mean of 0• Standard Deviation of 1

Page 27: Introduction to Clinical Biostatistics for Medical Students

Estimation Techniques

• So now that we know that certain “experiments” can have results distributed in certain ways, how can we “predict” the result of this experiment?

• This process is called Statistical Inference, where we can estimate the quality of a larger population by analyzing a small sample

Page 28: Introduction to Clinical Biostatistics for Medical Students

Estimation Techniques

• Populations and Samples– A Population is the larger set of objects we

wish to study• Ex: The number of democrats in the country

– A Sample is a set of “representative” objects we choose in order to estimate the characteristics of the larger set of objects

• Ex: Take 100 people from each state and determine whether they are democrats

Page 29: Introduction to Clinical Biostatistics for Medical Students

Estimation Techniques

• Parameters and Statistics– A Parameter is the “quality” of the population

we are trying to estimate– In order to estimate the parameter we

measure the quality in a sample. This sample quality is called its statistic

Page 30: Introduction to Clinical Biostatistics for Medical Students

Estimation Techniques

• Many types of samples can be taken:– Completely Random Sample– Stratified Random Sample

• Divide the population into strata (groups)• Take a sample from each group• Ex: Party loyalties of teenagers, adults and elderly

– Cluster Sample• Take a simple random sample of clusters from the

available clusters in a population• Ex: Urban vs. Rural sampling

Page 31: Introduction to Clinical Biostatistics for Medical Students

Hypothesis Testing

Large Sample Estimation Techniques

Page 32: Introduction to Clinical Biostatistics for Medical Students

Introduction to Estimating Techniques

• Before we begin, lets review some common terms:– Point Estimate: When we do an experiment and

generate a result, the result at one point in time for one “run” of the experiment is called a point estimate (mean, etc.). Since each experiment has some error, there is a margin of error for every point estimate

– Interval Estimate: Now if we repeat the experiment many times over we will get sense of how far off we are from running a “perfect” experiment. This sense of “confidence” in our experimental ability is called an interval estimate or a confidence interval.

Page 33: Introduction to Clinical Biostatistics for Medical Students

Confidence Intervals

• Typically, the confidence interval is defined as follows:

CI = Mean +/- 1.96 x Variance / sqrt(N)

It tells us that if we repeat the experiment many times over, 95% of the time our values for the Mean will lie in the limits specified here

Page 34: Introduction to Clinical Biostatistics for Medical Students

Significance Value ()

• Statisticians arbitrarily choose a value of 5% to represent events that can occur by chance alone

• So if an event occurs more than 5% of the time, it is considered statistically significant

• The 5% value is called a significance value, or

Page 35: Introduction to Clinical Biostatistics for Medical Students

P-Values

• A P-value is a useful way to represent the probability of a certain event and is seen extensively in the medical literature

• Definition:– The P-Value is simply the probability that an

event occurs by chance alone– Given our significance level of 5% for chance,

we want P-values to be less than 5% or .05 to be considered statistically significant

Page 36: Introduction to Clinical Biostatistics for Medical Students

Comparing Means

• Many times we wish to compare the means of two subsets of a population– Ex: MCAT scores for Biology vs. Chemistry majors– To do this we would sample MCAT scores from random samples

of biology and chemistry majors across the country– We would compute the mean of all these samples– We would compare the means to determine if they are

significantly different.

• This kind of analysis is exactly what is done by Hypothesis Testing (we hypothesize there is no difference and then refute this hypothesis)

Page 37: Introduction to Clinical Biostatistics for Medical Students

Hypothesis Testing

• A statistical test of hypothesis consists of 4 parts– A NULL Hypothesis, termed Ho– An Alternate Hypothesis, termed H– A test statistic– A rejection region

• The NULL hypothesis is what we want to refute• The Alternate hypothesis is what we want to support• The test statistic is what we will use to compare the

NULL and the Alternate Hypotheses• The Rejection Region is the value of the test statistic for

which Ho will be rejected

Page 38: Introduction to Clinical Biostatistics for Medical Students

Hypothesis Testing

• So what does this all mean IN LAYMANS TERMS?

• Basically we are asking the question that given a test statistic we specify, what is the probability that the hypothesis in question (H) is due to chance alone?

• We convert the test statistic into a probability value by looking it up in a table that specifies the respective probabilities associates with that particular statistic value

Page 39: Introduction to Clinical Biostatistics for Medical Students

Constructing a Hypothesis

• Consider the following question:– We wish to show that the hourly wages of

construction workers in California is larger than the national average of $14

– The hypothesis will be written down as:Hα: <> $14

Ho: = $14

Test statistic = Z-value = X – Uo / (Var/sqrt(N))

Rejection region = 0.05 (α value)

Page 40: Introduction to Clinical Biostatistics for Medical Students

Testing a Hypothesis

• The average weekly earnings for men in managerial and professional positions is $725. Do women in the same position have average weekly earnings that are less than those for men?

• A random sample of N=40 women in managerial positions showed X=$670 and Var = $102. Test the appropriate hypothesis using = 0.01

Solution: Ho: U = 725 H: U < 725

Z = X – U / (Var/sqrt(N))

Z = 670 – 725 / (102 / sqrt(40)) = -3.41

Since -3.41 < 0.01 we conclude that Ho is false and the average weekly salary for women is significantly less than for men and the probability that we have made an incorrect decision is 0.01

Page 41: Introduction to Clinical Biostatistics for Medical Students

Confidence in our Test Result

• So what is our “confidence” in our result?• Well, we can have 2 types of errors:

– Type I error = Rejecting Ho when Ho is true = – Type II error = Accepting Ho when Ho is false =

• To compute a confidence value, we calculate the Power of the Test which is the probability of correctly rejecting the NULL hypothesis– Power = (1-)

Page 42: Introduction to Clinical Biostatistics for Medical Students

Types of Tests

• Given the kinds of data we have and the types of information we seek there are different types of tests available to us:

– Students T-Test• Used to compare MEANS of two populations

• Works for small samples (N<30)

– Chi-Square Test• Used to estimate a population’s VARIANCE

– F-Test• Used to compare the VARIANCES of 2 populations

Page 43: Introduction to Clinical Biostatistics for Medical Students

Types of Tests

• We can do these tests in different ways:• We can have one-tailed and two-tailed tests

– A One-tailed test occurs when our hypothesis mean is on one side (either less or greater) than the null hypothesis mean

– A Two-tailed test occurs when we can say that the hypothesis mean can be on either side of the null value

• We can also do Paired Tests, where we do 2 tests in a specific sequential order

Page 44: Introduction to Clinical Biostatistics for Medical Students

T-tests: Small Sample Testing

• Up to now we have assumed the sample size to be large (N>30) in order to achieve good power. But what happens when the sample size is small (N<30).

• Well, in this case the shape of the normal distribution looks somewhat different – it is shorter and wider and is called the T-Distribution

• Every T-distribution has an associated Degree of Freedom (df) which is equal to N-1

• A T-Table is consulted to get the appropriate values of the T-statistic when doing a T-test. You need the df and the significance level to look up the T-values.

Page 45: Introduction to Clinical Biostatistics for Medical Students

Chi-Square Distribution• Remember that the T-test compares population Means. What if we

want to estimate a population variance?• In this case, we would use a Chi-Square distribution and our test

statistic will be a chi-square value

– X2 = (n-1)s2 / oo2

• where n = sample size• s = sample variance• oo = Population Variance that we are trying to estimate

• A variant of the Chi-Square Distribution is called the Mantel-Haenszel Test– It is a test of association between 2 ordinal variables (frequency data)

Page 46: Introduction to Clinical Biostatistics for Medical Students

F-Distribution

• What if we want to compare the population variances of two different populations?

• In this case we use an F-Distribution and an F-statistic

– F = s12/s2

2, where s1 and s2 are variances of Samples 1 and 2

– Typically we will have 2 degrees of freedom (v1 and v2) with F-tests

Page 47: Introduction to Clinical Biostatistics for Medical Students

Linear Regression and Correlation

Page 48: Introduction to Clinical Biostatistics for Medical Students

Linear Regression and Correlation

• In many situations in clinical studies we wish to attempt to answer the question: How is the random variable X related to the random variable Y?

– Ex: How is smoking related to lung cancer?– Ex: How is age related to development of Alzheimer’s Disease?– Ex: How is hypertriglyceridemia related to metabolic syndrome?

• Such questions are answered statistically using the concepts of Regression Analysis which looks for relationships among different variables (either negatively or positively) and Correlations, the strengths of the relationships

• Relationships may have many forms:– Related linearly– Related curvilinearly– Related colinearly– Associations but not Correlations

Page 49: Introduction to Clinical Biostatistics for Medical Students

Linear Regression

• The Linear Regression model postulates that two random variables X and Y are related by a straight line as follows:

Y = a + bX + e

Where

Y is the dependent variableX is the independent variable

a is the intercept b is the slope e is the residual value

Page 50: Introduction to Clinical Biostatistics for Medical Students

Linear Regression

• Residual Value (e)– Given that the regression analysis procedure is itself a statistical

approach, it is expected to have some degree of error associated with it

– Thus we add a value called the residual value (e) to any regression equation to account for random errors in the process

• Scatterplots– In order to perform regression analysis visually, it helps to

“graph” the points on a scatterplot– A visual relationship can often be observed when looking at

these plots

Page 51: Introduction to Clinical Biostatistics for Medical Students

Method of Least Squares

• So, assuming that 2 variables are linearly related, how do we “best fit” a line through a series of points on a scatterplot – the “regression line”.

• One way is to use a “goodness of fit” estimator called the Sum of Squares for Error (S) which we want to minimize

f(xi)

yi

Page 52: Introduction to Clinical Biostatistics for Medical Students

Inferences Concerning Slopes

• The initial question once we have a regression line is whether the data present sufficient evidence to indicate that Y increases or decreases linearly as X increases over the observed region?

• So we use the variability of the points about the line to estimate this:

Variance = s2 = S / n – 2

S = Sum of squares for error

n = Sample size

Page 53: Introduction to Clinical Biostatistics for Medical Students

Inferences Concerning Slopes

• Given that we can use S for estimating the population variance, we can formulate our hypothesis using a T-test to compare means as follows:

Null Hypothesis: Ho: b = boAlternate Hypothesis: H: b < bo or b > boTest Statistic = t-value = b – bo / (s / sqrt(Sxx))

b = regression line slopebo = slope to test withs = varianceSxx = Standard Error for Xi’s = Sum over all i (Xi – Xmean)2

Page 54: Introduction to Clinical Biostatistics for Medical Students

Inferences Concerning Slopes

• So how do we do the T-test and reach a conclusion or calculate a P-value?

• Well, the T-table has several features:– Df = Degrees of Freedom = n – 1– T-values listed for various significance levels

• The procedure for using a T-Table is as follows:– Compute the T-value using the statistic in your test– Lookup the appropriate T-value in the table given your degree of

freedom (n – 1)– Then look up the column to whichever significance level it

belongs to and the P will be less than that significance level

Page 55: Introduction to Clinical Biostatistics for Medical Students

Linear Regression

• So, graphically what does it look like?

Page 56: Introduction to Clinical Biostatistics for Medical Students

Other Regressions

• Given the types of data you have, there are other methods for fitting the data to a geometric shape:

– For example, there is Curvilinear Regression• Cubic Spline Interpolation

• Quadratic Interpolation

• Higher Order Interpolation

– Logarithmic Regression• This is useful when you have categorical data (non-numeric)

• For example, when you have a binomial random variable such as HTN (y/n), Gender(M/F) or Race

Page 57: Introduction to Clinical Biostatistics for Medical Students

Correlation

• As opposed to finding the “best fit” line through a set of data points, Correlation seeks to understand the strength of the relationship.

R = 0.17 R = 0.85 R = -0.94

Page 58: Introduction to Clinical Biostatistics for Medical Students

Correlation

• We compute the Pearson Product Moment Coefficient of Correlation (R) as follows:

R = Sxy / sqrt (Sxx X Syy)

where

Sxy = Sum over all i (Xi – Yi)2

Sxx = Sum over all i (Xi – Xmean)2

Syy = Sum over all i (Yi – Ymean)2

0 <= R <= 1, the larger the R the stronger the correlation

Page 59: Introduction to Clinical Biostatistics for Medical Students

Multiple Linear Regression

• So far we discussed how one variable is related to another in a study.

• But in real life, a study typically has many variables that it is trying to compare as they related to an outcome– Ex: CAD as f(HTN, DM, Smoking, Hyperchol., Obesity, Age)

• In order to do this type of analysis, we can extend the general notion of linear regression to multiple variables.

• We have an intercept as usual but partial slopes (or partial regression coefficients), each one representing a different variable

Page 60: Introduction to Clinical Biostatistics for Medical Students

Multiple Linear Regression

• The General Linear Model (GLM) is then stated as follows:

Y = 0 + 1x1 + 2x2 + 3x3 + …. + nxn +

With the following assumptions:

1. Y is the response variable you wish to predict

2. 0, 1 …. n are unknown constants

3. x1, x2 …. xn are independent predictor variables that are measured without error

4. is a random error that for any set of predictors is normally distributed

5. The random errors associated with any pair of Y values are independent

Page 61: Introduction to Clinical Biostatistics for Medical Students

Multiple Linear Regression

• Note that you can use qualitative (categorical) and quantitative variables in a GLM.

• Categorical Variables look like:– X1 = 1, if Group A, 0 if not Group A

• Typically computing p-values and regression equations in a GLM is hard to do by hand so most people will do it using computer software:– SAS™ has a procedure called Proc GLM– SPSS/PC™– StatSoft™– HyperStat™

Page 62: Introduction to Clinical Biostatistics for Medical Students

Multiple Linear Regression

• Problems that can occur when using GLM:– Multicolinearity:

• This happens when 2 of the independent variables xi, xj are themselves related and occurrence in a model overestimates the true effect size

• Also known as Covariants or Confounding Factors

– Interaction Terms:• When 2 variables in a model are co-related then we must add an interaction

term to the model• For example, suppose you want to study the salary of a professor with

respect to # of years of service. Well, this may differ slightly whether you are a male or female.

• Thus, the salary slope for males may be slightly higher than the salary slope for females despite the same number of years of service.

• This type of relationship is called an Interaction (between gender and years of service because the slope varies depending on whether a male or female is selected) and we must add a term of the type:

Y = b0 + b1x1 + b2x2 + b3x1x2

Page 63: Introduction to Clinical Biostatistics for Medical Students

Logistic Regression

• What happens when you have data in the form of proportions (or frequency data) of categorical variables?

• The tool for analysis of this type of data is called a Logistic Regression

• It is based on the Chi-Square Distribution and the model is described as follows:

ln[p/(1-p)] = a + BX + e or [p/(1-p)] = expa expB X exp e where:

ln is the natural logarithm, logexp, where exp=2.71828… p is the probability that the event Y occurs, p(Y=1) p/(1-p) is the "odds ratio" ln[p/(1-p)] is the log odds ratio, or "logit"

all other components of the model are the same.

Page 64: Introduction to Clinical Biostatistics for Medical Students

The ANalysis Of VAriance

Also known as ANOVA

Page 65: Introduction to Clinical Biostatistics for Medical Students

ANOVA

• Suppose you want to compare the mean reimbursement rates from 5 different health plans

• You could do t-tests among all combinations of the 5 plans, or 10 t-tests

• Suppose all the means are equal. When this procedure is repeated 10 times, the probability of incorrectly concluding that at least one pair of means differ is quite high and you reach an erroneous decision

• Thus we want one test which could compare means for all 5 groups at the same time

• This is exactly what ANOVA provides

Page 66: Introduction to Clinical Biostatistics for Medical Students

ANOVA

• ANOVA is a powerful procedure which allows you to do 2 things:– Compare the variance between the means of 2 or more groups– Compare the variance in data values within each group

Page 67: Introduction to Clinical Biostatistics for Medical Students

ANOVA

• ANOVA procedures can be done with different study designs:– Completely Randomized Design

• Random samples are independently selected from each of k populations.

• Assumes that the data is homogeneously distributed with a fixed variation

– Randomized Block Design• Assumes that subsets of the population have different variances

• Within each subset, however, the variability is the same

• Each subset is called a block.

• Random samples are then taken from each block

Page 68: Introduction to Clinical Biostatistics for Medical Students

ANOVA for Completely Randomized Designs

• Suppose we want to compare k population means u1..uk based on random independent samples of n1..nk observations selected from populations 1..k respectively– Ex: Suppose we have 10 observations of reimbursement figures

from each of 5 health plans then we will have 50 total values

• Then let:– Xij represent the jth measurement in the ith group

• We define an entity called the Total Sum of Squares (SS) as follows:

k ni

Total SS = Sxx = (xij – x)2

i=1 j=1

Page 69: Introduction to Clinical Biostatistics for Medical Students

ANOVA for Completely Randomized Designs

• It can be shown that the sum of squares of deviations of all values about the overall mean – the Total SS - (of all 50 values) can be partitioned into 2 components:– SST = Sum of Squares for Treatments– SSE = Sum of Squares for Error (measures variation within

samples)

• We have:

– Total SS = SST + SSE

Page 70: Introduction to Clinical Biostatistics for Medical Students

ANOVA for Completely Randomized Designs

• Now, we can also compute SSE readily and it is:

n1 n2 nk

SSE = (x1j – x1)2 + (x2j – x2)2 + … + (xkj – xk)2

j=1 j=1 j=1

Knowing SSE and SS, we can compute SST

We then compute the Mean Squares of these as follows:

MST = SST / k-1

MSE = SSE / n-k

The final step is to compute an F-statistic as follows:

F = MST / MSE

Page 71: Introduction to Clinical Biostatistics for Medical Students

ANOVA for Completely Randomized Designs

• Now, F-tests have 2 degrees of freedom v1 and v2• In the case of ANOVA,

– v1 = k – 1 – v2 = n – k

• We can then our usual hypothesis testing using this F-statistic as our test

Ho: u1 = u2 = u3 = … = uk

H: One of more pairs of population means differ

F-Statistic = MST/MSE with df v1=(k-1), v2=(n-k)

Rejection Region: Reject Ho if F > F (found from the table using v1, v2 and )

Page 72: Introduction to Clinical Biostatistics for Medical Students

ANOVA for Randomized Block Designs

• The computational steps are very similar to those of a completely randomized design except that we add a third term, the sum of squares for BLOCKS (with b blocks)

Total SS = SST + SSE + SSB

We then perform 2 different hypothesis tests:

(1) For comparing Treatment Means:

F = MST/MSE, v1=k-1, v2=n-b-k+1

(2) For comparing BLOCK Means:

F = MSB/MSE, v1=b-1, v2=n-b-k+1

Page 73: Introduction to Clinical Biostatistics for Medical Students

Nonparametric Statistics

Analysis of Ranked Data

Page 74: Introduction to Clinical Biostatistics for Medical Students

Nonparametric Statistics

• What do we do when we have “oppinion data”?• For example, suppose a judge is employed to evaluate

and rank the sales abilities of 4 salesmen, the edibility of 5 brands of Corn Flakes or the relative appeal of 5 brands or automobiles

• Clearly it is impossible to give an exact measure of sales competence, the palatability of food or design appeal

• But, it is possible to rank the salespeople, food or design choices based on our own oppinions.

• Many, Many types of studies in medicine use this kind of data gathering (patient satisfaction is one example)

Page 75: Introduction to Clinical Biostatistics for Medical Students

Nonparametric Statistics

• There are many tests available for studying this kind of data:– The Sign Test– The Mann-Whitney U Test– The Wilcoxon Signed-Rank Test for a Paired Experiment– The Kruskal-Wallis H Test for Completely Randomized Designs– The Friedman Fr Test for Randomized Block Designs– Spearman’s Rank Correlation Test

Page 76: Introduction to Clinical Biostatistics for Medical Students

The Sign Test

• Compares 2 populations with respect to how they differ in the responses to qualitative questions:– Compute the number of responses that were the same– Then compute the number of responses that differed– Finally compute X, the number of times responses from

population A was greater than responses from population B– This gives you the number of times (A-B) is positive (i.e. has a

positive sign – hence the name)– This is your test statistic– You then use a Binomial Probability Distribution to do a

hypothesis test

Page 77: Introduction to Clinical Biostatistics for Medical Students

Mann-Whitney U Test

• Analogous to the T-test for nonparametric data• Suppose you have 2 populations from which 2 samples

n1 and n2 are obtained• You should rank all samples (n1+n2) into ascending

order assigning rank values 1, 2, 3… to all observations• Tied observations are handled by averaging the ranks

assigned to both of the tied observations• Then calculate the sum of the ranks T1 and T2 for both

of the samples

Page 78: Introduction to Clinical Biostatistics for Medical Students

Mann-Whitney U Test

• Now compute the U statistic as follows:

U1 = n1n2 + (n1(n1+1)/2) – T1

U2 = n1n2 + (n2(n2+1)/2) – T2

Look up the appropriate value in the table given n2

The Table will give you a value for Uo on the left hand side corresponding to your n1

Your computed U (smaller of U1 or U2) should be less than the U stated in the table in order to reject the Null hypothesis (that the population relative frequency distributions are identical)

Page 79: Introduction to Clinical Biostatistics for Medical Students

Wilcoxon Signed Rank Test

• Similar to the Mann-Whitney U Test• Allows you to compare paired differences• Given n pairs of observations from populations A and B,

compute the paired differences (xA-xB) for each pair of values

• Rank the positive differences and the negative differences separately

• Compute the sums T+ and T- of these rankings• For a one tailed test, use T- and for a two tailed test, use

the smaller of T+ or T-

• Reject Ho if T < To (critical value) obtained from the Wilcoxon Table, given n and values

Page 80: Introduction to Clinical Biostatistics for Medical Students

Other Nonparametric Tests

• Kruskal-Wallis H Test– Just as the Mann-Whitney U Test is the nonparametric

alternative to the Student’s T-Test for comparing population means, the Kruskal-Wallis H Test is the nonparametric alternative to ANOVA for a completely randomized design and is used to detect differences in location among more than 2 population distributions based on independent random sampling

• Friedman Fr Test for Randomized Block Designs– Is a nonparametric test for comparing the distributions of

measurements for k treatments laid out in b blocks using a randomized block design

Page 81: Introduction to Clinical Biostatistics for Medical Students

Test of Association

• Spearman’s Rank Correlation Test– Tests whether there is an association between 2 populations– Assume n pairs (xi, yi) of observations from 2 populations X, Y– Rank each of the xi and yi in ascending order– Compute:

• Rs = Sxy / sqrt (Sxx Syy)

– Then given n and , look up Ro in the Spearman Table– Reject Ho (no association) if Rs => Ro or Rs <= -Ro

Page 82: Introduction to Clinical Biostatistics for Medical Students

Survival Analysis

Page 83: Introduction to Clinical Biostatistics for Medical Students

Introduction

• There are many clinical studies that address the question of time to an event

• For example, we often want to know given risk factors, what is a patients chance for an MI? (I.e. time to MI)

• This type of data is called censored data• Survival Analysis seeks to study this type of question

Page 84: Introduction to Clinical Biostatistics for Medical Students

Life Tables

• The most straightforward way to compute a data structure known as a Life Table

• The entire lifetime of a study object is divided into intervals of specified length

• For each interval, the number of subjects surviving or died within that interval is determined and plotted

• Based on this number, we can compute several types of statistics:– Numbers of cases at risk

– Proportion Failing or Proportion Surviving

– Probability Density or Hazard Rate

– Median Survival Time

– Required Sample Sizes

Page 85: Introduction to Clinical Biostatistics for Medical Students

Survival Analysis

• Although life tables give us a good estimate of the risk of adverse events, it is desirable to understand the underlying survival function algorithmically for prediction purposes

• The three “distributions” proposed for this are the:– Exponential (linear exponential) distribution– Weibull Distribution– Gompertz Distribution

• The parameter estimation procedure is then a modified version of the least-squares model

• And the statistic used to study it is an incremental Chi-Square Statistic

Page 86: Introduction to Clinical Biostatistics for Medical Students

Kaplan-Meier Product Limit Estimator

• Rather than classify the survival into a life-table, the KM estimator computes a survival function directly from continuous survival or failure times

• Imagine creating a life table with exactly one observation for each interval

• Then we avoid the effect of “grouping” observations together into interval categories

• Then S(t) = Productj ((n-j)/(n-j+1))(j)

n = total # observations

(j) = 1 if censored, 0 if not in interval j

Page 87: Introduction to Clinical Biostatistics for Medical Students

Comparing Survival Times

• Often we wish to compare survival times in 2 or more populations

• There are several tests available for this purpose:– Gehans Generalized Wilcoxon Test– Cox-Mantel Test– Cox’s F-Test– Log-Rank Test– Peto and Peto’s Wilcoxon Test

• These are mostly nonparametric tests that generate Z-values for comparing means

Page 88: Introduction to Clinical Biostatistics for Medical Students

Regression Models

• We also want to be able to predict survival time given some independent risk factors

• This is very common in the medical literature• The regression test of choice is the Cox-Proportional Hazards Model• The model is written as:

h{(t), (z1, z2, ..., zm)} = h0(t)*exp(b1*z1 + ... + bm*zm)

(where h(t,...) denotes the resultant hazard, given the values of the m covariates for the respective case (z1, z2, ..., zm) and the respective survival time (t). The term h0(t) is called the baseline hazard; it is the hazard for the respective individual when all independent variable values are equal to zero). We can linearize this model by dividing both sides of the equation by h0(t) and then taking the natural logarithm of both sides:

log[h{(t), (z...)}/h0(t)] = b1*z1 + ... + bm*zm We now have a fairly "simple" linear model that can be readily estimated)

Page 89: Introduction to Clinical Biostatistics for Medical Students

Useful Links

• http://hesweb1.med.virginia.edu/biostat/teaching/handouts.html• http://stat.tamu.edu/stat30x/notes/trydouble2.html• http://www.statsoft.com/textbook/stathome.html• http://davidmlane.com/hyperstat/index.html• http://members.aol.com/johnp71/javastat.html• http://www.helsinki.fi/~jpuranen/links.html• http://ubmail.ubalt.edu/~harsham/statistics/REFSTAT.HTM#rgenRe

s• http://trochim.human.cornell.edu/kb/index.htm

Page 90: Introduction to Clinical Biostatistics for Medical Students

Questions

Thank You