Top Banner
EIPB 698C Lecture 7 Raul Cruz-Cano Summer 2012
39

EIPB 698C Lecture 7

Feb 23, 2016

Download

Documents

lovey

EIPB 698C Lecture 7. Raul Cruz-Cano Summer 2012. Statistical analysis procedures. Proc univariate Proc t test Proc corr. Proc Univariate. The UNIVARIATE procedure provides data summarization on the distribution of numeric variables. PROC UNIVARIATE < option(s) >; - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EIPB  698C  Lecture  7

EIPB 698C Lecture 7

Raul Cruz-CanoSummer 2012

Page 2: EIPB  698C  Lecture  7

Statistical analysis procedures

• Proc univariate • Proc t test • Proc corr

Page 3: EIPB  698C  Lecture  7

Proc Univariate• The UNIVARIATE procedure provides data summarization

on the distribution of numeric variables.

PROC UNIVARIATE <option(s)>; Var variable-1 variable-n; Run; Options: PLOTS : create low-resolution stem-and-leaf, box, and

normal probability plotsNORMAL: Request tests for normality

Page 4: EIPB  698C  Lecture  7

data blood;INFILE 'C:\teaching\SAS09\lecture9\blood.txt';INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol;run;proc univariate data =blood ; var cholesterol; run;

Page 5: EIPB  698C  Lecture  7

OUTPUT (1)

The UNIVARIATE Procedure Variable: cholesterol Moments

N 795 Sum Weights 795Mean 201.43522 Sum Observations 160141Std Deviation 49.8867157 Variance 2488.6844Skewness -0.0014449 Kurtosis -0.0706044Uncorrected SS 34234053 Corrected SS 1976015.41Coeff Variation 24.7656371 Std Error Mean 1.76929947

Page 6: EIPB  698C  Lecture  7

OUTPUT (1) Moments

N 795 Sum Weights 795Mean 201.43522 Sum Observations 160141Std Deviation 49.8867157 Variance 2488.6844Skewness -0.0014449 Kurtosis -0.0706044Uncorrected SS 34234053 Corrected SS 1976015.41Coeff Variation 24.7656371 Std Error Mean 1.76929947

Moments - Moments are statistical summaries of a distribution

N - This is the number of valid observations for the variable. The total number of observations is the sum of N and the number of missing values.

Sum Weights - A numeric variable can be specified as a weight variable to weight the values of the analysis variable. The default weight variable is defined to be 1 for each observation. This field is the sum of observation values for the weight variable

Page 7: EIPB  698C  Lecture  7

OUTPUT (1) Moments

N 795 Sum Weights 795Mean 201.43522 Sum Observations 160141Std Deviation 49.8867157 Variance 2488.6844Skewness -0.0014449 Kurtosis -0.0706044Uncorrected SS 34234053 Corrected SS 1976015.41Coeff Variation 24.7656371 Std Error Mean 1.76929947

Sum Observations - This is the sum of observation values. In case that a weight variable is specified, this field will be the weighted sum. The mean for the variable is the sum of observations divided by the sum of weights.

Page 8: EIPB  698C  Lecture  7

OUTPUT (1) Moments

N 795 Sum Weights 795Mean 201.43522 Sum Observations 160141Std Deviation 49.8867157 Variance 2488.6844Skewness -0.0014449 Kurtosis -0.0706044Uncorrected SS 34234053 Corrected SS 1976015.41Coeff Variation 24.7656371 Std Error Mean 1.76929947

Std Deviation - Standard deviation is the square root of the variance. It measures the spread of a set of observations. The larger the standard deviation is, the more spread out the observations are.

Page 9: EIPB  698C  Lecture  7

OUTPUT (1) Moments

N 795 Sum Weights 795Mean 201.43522 Sum Observations 160141Std Deviation 49.8867157 Variance 2488.6844Skewness -0.0014449 Kurtosis -0.0706044Uncorrected SS 34234053 Corrected SS 1976015.41Coeff Variation 24.7656371 Std Error Mean 1.76929947

Variance - The variance is a measure of variability. It is the sum of the squared distances of data value from the mean divided by N-1. We don't generally use variance as an index of spread because it is in squared units. Instead, we use standard deviation.

Page 10: EIPB  698C  Lecture  7

OUTPUT (1) Moments

N 795 Sum Weights 795Mean 201.43522 Sum Observations 160141Std Deviation 49.8867157 Variance 2488.6844Skewness -0.0014449 Kurtosis -0.0706044Uncorrected SS 34234053 Corrected SS 1976015.41Coeff Variation 24.7656371 Std Error Mean 1.76929947

Skewness - Skewness measures the degree and direction of asymmetry. A symmetric distribution such as a normal distribution has a skewness of 0, and a distribution that is skewed to the left, e.g. when the mean is less than the median, has a negative skewness.

Page 11: EIPB  698C  Lecture  7

OUTPUT (1) Moments

N 795 Sum Weights 795Mean 201.43522 Sum Observations 160141Std Deviation 49.8867157 Variance 2488.6844Skewness -0.0014449 Kurtosis -0.0706044Uncorrected SS 34234053 Corrected SS 1976015.41Coeff Variation 24.7656371 Std Error Mean 1.76929947

(1)Kurtosis - Kurtosis is a measure of the heaviness of the tails of a distribution. In SAS, a normal distribution has kurtosis 0. (2) Extremely nonnormal distributions may have high positive or negative kurtosis values, while nearly normal distributions will have kurtosis values close to 0. (3) Kurtosis is positive if the tails are "heavier" than for a normal distribution and negative if the tails are "lighter" than for a normal distribution.

Page 12: EIPB  698C  Lecture  7

OUTPUT (1) Moments

N 795 Sum Weights 795Mean 201.43522 Sum Observations 160141Std Deviation 49.8867157 Variance 2488.6844Skewness -0.0014449 Kurtosis -0.0706044Uncorrected SS 34234053 Corrected SS 1976015.41Coeff Variation 24.7656371 Std Error Mean 1.76929947

Uncorrected SS - This is the sum of squared data values.

Corrected SS - This is the sum of squared distance of data values from the mean. This number divided by the number of observations minus one gives the variance.

Page 13: EIPB  698C  Lecture  7

OUTPUT (1) Moments

N 795 Sum Weights 795Mean 201.43522 Sum Observations 160141Std Deviation 49.8867157 Variance 2488.6844Skewness -0.0014449 Kurtosis -0.0706044Uncorrected SS 34234053 Corrected SS 1976015.41Coeff Variation 24.7656371 Std Error Mean 1.76929947

(1)Coeff Variation - The coefficient of variation is another way of measuring variability. (2)It is a unitless measure. (3)It is defined as the ratio of the standard deviation to the mean and is generally expressed as a percentage. (4) It is useful for comparing variation between different variables.

Page 14: EIPB  698C  Lecture  7

OUTPUT (1) Moments

N 795 Sum Weights 795Mean 201.43522 Sum Observations 160141Std Deviation 49.8867157 Variance 2488.6844Skewness -0.0014449 Kurtosis -0.0706044Uncorrected SS 34234053 Corrected SS 1976015.41Coeff Variation 24.7656371 Std Error Mean 1.76929947

(1)Std Error Mean - This is the estimated standard deviation of the sample mean. (2)It is estimated as the standard deviation of the sample divided by the square root of sample size. (3)This provides a measure of the variability of the sample mean.

Page 15: EIPB  698C  Lecture  7

OUTPUT (2)

Basic Statistical Measures

Location Variability

Mean 201.4352 Std Deviation 49.88672Median 202.0000 Variance 2489Mode 208.0000 Range 314.00000 Interquartile Range 71.00000

NOTE: The mode displayed is the smallest of 2 modes with a count of 12.

Page 16: EIPB  698C  Lecture  7

OUTPUT (2)

Location VariabilityMean 201.4352 Std Deviation 49.88672Median 202.0000 Variance 2489Mode 208.0000 Range 314.00000 Interquartile Range 71.00000

NOTE: The mode displayed is the smallest of 2 modes with a count of 12.

Median - The median is a measure of central tendency. It is the middle number when the values are arranged in ascending (or descending) order. It is less sensitive than the mean to extreme observations.

Mode - The mode is another measure of central tendency. It is the value that occurs most frequently in the variable.

Page 17: EIPB  698C  Lecture  7

OUTPUT (3)

Location VariabilityMean 201.4352 Std Deviation 49.88672Median 202.0000 Variance 2489Mode 208.0000 Range 314.00000 Interquartile Range 71.00000

NOTE: The mode displayed is the smallest of 2 modes with a count of 12.

Range - The range is a measure of the spread of a variable. It is equal to the difference between the largest and the smallest observations. It is easy to compute and easy to understand.

Interquartile Range - The interquartile range is the difference between the upper (75% Q) and the lower quartiles (25% Q). It measures the spread of a data set. It is robust to extreme observations.

Page 18: EIPB  698C  Lecture  7

OUTPUT (3)

Tests for Location: Mu0=0

Test -Statistic- -----p Value------

Student's t t 113.8503 Pr > |t| <.0001Sign M 397.5 Pr >= |M| <.0001Signed Rank S 158205 Pr >= |S| <.0001

Page 19: EIPB  698C  Lecture  7

OUTPUT (3)

Student's t t 113.8503 Pr > |t| <.0001Sign M 397.5 Pr >= |M| <.0001Signed Rank S 158205 Pr >= |S| <.0001

(1)Student's t - The Student t-test is used to test the null hypothesis that the population mean equals Mu0. The default value in SAS for Mu0 is 0. (2)The t-statistic is defined to be the difference between the mean and the hypotheses mean divided by the standard error of the mean. (3)The p-value is the two-tailed probability computed using a t distribution. If the p-value associated with the t-test is small (usually set at p < 0.05), there is evidence to reject the null hypothesis in favor of the alternative. In other words, the mean is statistically significantly different than the hypothesized value

Page 20: EIPB  698C  Lecture  7

OUTPUT (3)

Student's t t 113.8503 Pr > |t| <.0001Sign M 397.5 Pr >= |M| <.0001Signed Rank S 158205 Pr >= |S| <.0001

(1) Sign - The sign test is a simple nonparametric procedure to test the null hypothesis regarding the population median. (2) It is used when we have a small sample from a nonnormal distribution. (3)The statistic M is defined to be M=(N+-N-)/2 where N+ is the number of values that are greater than Mu0 and N- is the number of values that are less than Mu0. Values equal to Mu0 are discarded. (4)Under the hypothesis that the population median is equal to Mu0, the sign test calculates the p-value for M using a binomial distribution. (5)The interpretation of the p-value is the same as for t-test. In our example the M-statistic is 398 and the p-value is less than 0.0001. We conclude that the median of variable is significantly different from zero.

Page 21: EIPB  698C  Lecture  7

OUTPUT (3)

Student's t t 113.8503 Pr > |t| <.0001Sign M 397.5 Pr >= |M| <.0001Signed Rank S 158205 Pr >= |S| <.0001

(1) Signed Rank - The signed rank test is also known as the Wilcoxon test. It is used to test the null hypothesis that the population median equals Mu0. (2) It assumes that the distribution of the population is symmetric. (3)The Wilcoxon signed rank test statistic is computed based on the rank sum and the numbers of observations that are either above or below the median. (4) The interpretation of the p-value is the same as for the t-test. In our example, the S-statistic is 158205 and the p-value is less than 0.0001. We therefore conclude that the median of the variable is significantly different from zero.

Page 22: EIPB  698C  Lecture  7

OUTPUT (4)

Quantiles (Definition 5)

Quantile Estimate

100% Max 331 99% 318 95% 282 90% 267 75% Q3 236 50% Median 202 25% Q1 165 10% 138 5% 123 1% 94 0% Min 17

Page 23: EIPB  698C  Lecture  7

OUTPUT (4)

Quantiles (Definition 5)

Quantile Estimate

100% Max 331 99% 318 95% 282 90% 267 75% Q3 236 50% Median 202 25% Q1 165 10% 138 5% 123 1% 94 0% Min 17

95% - Ninety-five percent of all values of the variable are equal to or less than this value.

Page 24: EIPB  698C  Lecture  7

OUTPUT (5)

Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 17 829 323 828 36 492 328 203 56 133 328 375 65 841 328 541 69 79 331 191 Missing Values -----Percent Of----- Missing Missing Value Count All Obs Obs . 205 20.50 100.00

Extreme Observations - This is a list of the five lowest and five highest values of the variable

Page 25: EIPB  698C  Lecture  7

25

Student's t-test• Independent One-Sample t-test• This equation is used to compare one sample mean to a

specific value μ0.

• Where s is the grand standard deviation of the sample. N is the sample size. The degrees of freedom used in this test is N-1.

0

/Xts N

Page 26: EIPB  698C  Lecture  7

26

Student's t-test• Dependent t-test is used when the samples are dependent; that is,

when there is only one sample that has been tested twice (repeated measures) or when there are two samples that have been matched or "paired".

• For this equation, the differences between all pairs must be calculated. The pairs are either one person's pretest and posttest scores or one person in a group matched to another person in another group. The average (XD) and standard deviation (sD) of those differences are used in the equation. The constant μ0 is non-zero if you want to test whether the average of the difference is significantly different than μ0. The degree of freedom used is N-1.

0

/D

D

Xts N

Page 27: EIPB  698C  Lecture  7

PROC TTEST

The following statements are available in PROC TTEST. PROC TTEST < options > ; CLASS variable ; PAIRED variables ; BY variables ; VAR variables ;

CLASS: CLASS statement giving the name of the classification (or grouping) variable must accompany the PROC TTEST statement in the two independent sample cases (TWO SAMPLE T TEST). The class variable must have two, and only two, levels.

Page 28: EIPB  698C  Lecture  7

Paired Statements• PAIRED: the PAIRED statement identifies the variables to be

compared in paired t test1. You can use one or more variables in the PairLists. 2. Variables or lists of variables are separated by an asterisk (*)

or a colon (:). 3. The asterisk (*) requests comparisons between each

variable on the left with each variable on the right. 4. Use the PAIRED statement only for paired comparisons. 5. The CLASS and VAR statements cannot be used with the

PAIRED statement.

Page 29: EIPB  698C  Lecture  7

PROC TTEST

OPTIONS :ALPHA=pspecifies that confidence intervals are to be 100(1-p)% confidence intervals, where

0<p<1. By default, PROC TTEST uses ALPHA=0.05. If p is 0 or less, or 1 or more, an error message is printed.

H0=mrequests tests against m instead of 0 in all three situations (one-sample, two-

sample, and paired observation t tests). By default, PROC TTEST uses H0=0. DATA=SAS-data-setnames the SAS data set for the procedure to use

Page 30: EIPB  698C  Lecture  7

*One sample ttest*;Proc ttest data =blood H0=200;var cholesterol;run;

Page 31: EIPB  698C  Lecture  7

One sample t test Output

The TTEST Procedure

Variable: cholesterol

N Mean Std Dev Std Err Minimum Maximum

795 201.4 49.8867 1.7693 17.0000 331.0

Mean 95% CL Mean Std Dev 95% CL Std Dev

201.4 198.0 204.9 49.8867 47.5493 52.4676

DF t Value Pr > |t|

794 0.81 0.4175

Page 32: EIPB  698C  Lecture  7

One sample t test Output

The TTEST Procedure

Variable: cholesterol

N Mean Std Dev Std Err Minimum Maximum

795 201.4 49.8867 1.7693 17.0000 331.0

Mean 95% CL Mean Std Dev 95% CL Std Dev

201.4 198.0 204.9 49.8867 47.5493 52.4676

DF t Value Pr > |t|

794 0.81 0.4175

95%CL Mean is 95% confidence interval for the mean.

95%CL Std Dev is 95% confidence interval for the standard deviation.

Page 33: EIPB  698C  Lecture  7

One sample t test Output Variable: cholesterol N Mean Std Dev Std Err Minimum Maximum 795 201.4 49.8867 1.7693 17.0000 331.0

Mean 95% CL Mean Std Dev 95% CL Std Dev 201.4 198.0 204.9 49.8867 47.5493 52.4676

DF t Value Pr > |t| 794 0.81 0.4175

DF - The degrees of freedom for the t-test is simply the number of valid observations minus 1. We loose one degree of freedom because we have estimated the mean from the sample. We have used some of the information from the

data to estimate the mean; therefore, it is not available to use for the test and the degrees of freedom accounts for this

T value is the t-statistic. It is the ratio

of the difference between the sample mean and the given

number to the standard error of the

mean.

It is the probability of observing a greater absolute value of t under the null hypothesis.

Page 34: EIPB  698C  Lecture  7

title 'Paired Comparison';data pressure;input SBPbefore SBPafter @@;diff_BP=SBPafter-SBPbefore ;

datalines; 120 128 124 131 130 131 118 127 140 132 128 125 140 141 135 137 126 118 130 132 126 129 127 135 ; run;

proc ttest data=pressure; paired SBPbefore*SBPafter; run;

Page 35: EIPB  698C  Lecture  7

Paired t test Output

The TTEST Procedure

Difference: SBPbefore - SBPafter

N Mean Std Dev Std Err Minimum Maximum

12 -1.8333 5.8284 1.6825 -9.0000 8.0000

Mean 95% CL Mean Std Dev 95% CL Std Dev

-1.8333 -5.5365 1.8698 5.8284 4.1288 9.8958

DF t Value Pr > |t|

11 -1.09 0.2992

Page 36: EIPB  698C  Lecture  7

Paired t test Output

The TTEST Procedure

Difference: SBPbefore - SBPafter

N Mean Std Dev Std Err Minimum Maximum

12 -1.8333 5.8284 1.6825 -9.0000 8.0000

Mean 95% CL Mean Std Dev 95% CL Std Dev

-1.8333 -5.5365 1.8698 5.8284 4.1288 9.8958

DF t Value Pr > |t|

11 -1.09 0.2992

Mean of the differences

T statistics for testing if the mean of the difference is 0

P =0.3, suggest the mean of the difference is equal to 0

Page 37: EIPB  698C  Lecture  7

Proc corr

The CORR procedure is a statistical procedure for numeric random variables that computes correlation statistics (The default correlation analysis includes descriptive statistics, Pearson correlation statistics, and probabilities for each analysis variable).

PROC CORR options;VAR variables;WITH variables;BY variables;

Proc corr data=blood; var RBC WBC cholesterol; run;

Page 38: EIPB  698C  Lecture  7

Proc Corr Output Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum

RBC 908 7043 1003 6395020 4070 10550 WBC 916 5.48353 0.98412 5023 1.71000 8.75000 cholesterol 795 201.43522 49.88672 160141 17.00000 331.00000

N - This is the number of valid (i.e., non-missing) cases used in the correlation. By default, proc corr uses pairwise deletion for missing observations, meaning that a pair of observations (one from each variable in the pair being correlated) is included if both values are non-missing. If you use the nomiss option on the proc corr statement, proc corr uses listwise deletion and omits all observations with missing data on any of the named variables.

Page 39: EIPB  698C  Lecture  7

Proc Corr Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations

RBC WBC cholesterol

RBC 1.00000 -0.00203 0.06583 0.9534 0.0765 908 833 725

WBC -0.00203 1.00000 0.02496 0.9534 0.5014 833 916 728

cholesterol 0.06583 0.02496 1.00000 0.0765 0.5014 725 728 795

Pearson Correlation Coefficients - measure the strength and direction of the linear relationship between the two variables.

The correlation coefficient can range from -1 to +1, with -1 indicating a perfect negative correlation, +1 indicating a perfect positive correlation, and 0 indicating no correlation at all.

P-value

Number of observations