STATISTICAL ANALYSIS. Your introduction to statistics should not be like drinking water from a fire hose!!

STATISTICAL ANALYSIS.

Your introduction to statistics should not be like drinking water from a fire

hose!!

What do you mean by data??Nature of the Data

•Two main types: categorical or continuous

1. Categorical:

• Nominal (unordered, unequal categories) • E.g.: Female=1 and male=2

• Ordinal (ordered unequal or ranked categories) • E.g.: 1=SD 2=D 3=N 4=A 5=SA

2. Continuous:• Interval (ordered, equal intervals, no zero)

• E.g.: 5-point Likert scale with equal intervals or IQ score

• Ratio (ordered, equal intervals with absolute zero)• E.g.: raw scores, class attendance (in days); age (in years)

Descriptive statistics: Procedures used for summarizing the data in both numerical and graphic form. Includes, frequencies, distributions, percents, cumulative percents, pie charts, bar graphs (histograms) and scatter plots.

(Cross-tabulations: summarizes relationships between two variables like a scatter plot but in a table form.)

Measures of central tendency:

• Mean: arithmetic average (interval & ratio data only)

• Mode: most frequent; can be bimodal or multimodal (all types)

• Median: mid point with equal half above and below; (ordinal, interval and ration)

Statistics 101!!

• Statistics– Measures of location—mean vs. median and

why– Measures of scale—range, interquartile range,

standard deviation (and variance)– Measures of position—percentiles, deciles,

quartiles, medianNote. For categorical variables, we use

proportions as the descriptive statistics

Why does lack of normality cause problems?

When we calculate the p-value for an inference test, we find When we calculate the p-value for an inference test, we find the probability that the sample was different due to the probability that the sample was different due to sampling variability. Basically, we are trying to see if a sampling variability. Basically, we are trying to see if a recorded value occurred by chance and chance alone. recorded value occurred by chance and chance alone. When we look for a p-value, we are assuming that all When we look for a p-value, we are assuming that all samples of the given sample size are normally distributed samples of the given sample size are normally distributed around the mean. This is why the test statistic, which is the around the mean. This is why the test statistic, which is the number of standard deviations away from the population number of standard deviations away from the population mean the sample mean is, is able to be used. Therefore, mean the sample mean is, is able to be used. Therefore, without normality, no p-value can be found.without normality, no p-value can be found.

There are non-parametric tests which are similar to the parametric tests. The following table shows how some of the tests match up.

Parametric Parametric TestTest

Goal for Goal for Parametric TestParametric Test

Non-Parametric Non-Parametric TestTest

Goal for Non-Goal for Non-Parametric Parametric TestTest

Two Sample T-TestTwo Sample T-Test To see if two samples To see if two samples have identical have identical population meanspopulation means

Wilcoxon Rank-Sum Wilcoxon Rank-Sum TestTest

To see if two To see if two samples have samples have identical population identical population mediansmedians

One Sample T-TestOne Sample T-Test To test a hypothesis To test a hypothesis about the mean of the about the mean of the population a sample population a sample was taken fromwas taken from

Wilcoxon Signed Wilcoxon Signed Ranks TestRanks Test

To test a hypothesis To test a hypothesis about the median of about the median of the population a the population a sample was taken sample was taken fromfrom

Chi-Squared Test for Chi-Squared Test for Goodness of FitGoodness of Fit

To see if a sample fits To see if a sample fits a theoretical a theoretical distribution, such as distribution, such as the normal curvethe normal curve

Kolmogorov-Kolmogorov-Smirnov TestSmirnov Test

To see if a sample To see if a sample could have come could have come from a certain from a certain distributiondistribution

ANOVAANOVA To see if two or more To see if two or more sample means are sample means are significantly differentsignificantly different

Kruskal-Wallis TestKruskal-Wallis Test To test if two or To test if two or more sample more sample medians are medians are significantly significantly differentdifferent

What is different about Non-Parametric Statistics?

Sometimes statisticians use what is called “ordinal” data. This data is obtained by taking the raw data and giving each sample a rank. These ranks are then used to create test statistics.

In parametric statistics, one deals with the median rather than the mean. Since a mean can be easily influenced by outliers or skewness, and we are not assuming normality, a mean no longer makes sense. The median is another judge of location, which makes more sense in a non-parametric test. The median is considered the center of a distribution.

Drawing a histogram..the good the bad and the downright ugly!!.

Many modern introductory texts and confuse frequency graphs, relative frequency graphs, and histograms.

GoodBad

What's the difference between a bar chart & a Histogram??

Critical Values

For a given number of degrees of freedom, by the property of the t-distribution, we know how large the t-statistic must be in order to reject the null.

We call that number the “critical value” of the t-statistic and is typically determined by the values in a table of the t-statistic.

If the value of the t-statistic calculated from the data is greater than this critical value, then we “reject the null hypothesis.”- This is because, for t-statistics greater than this critical value, our probability of falsely rejecting the null hypothesis is very small.

Example

Suppose our null hypothesis is that X is less than 0. The sample mean is 3; The sample standard deviation is 2;There are 121 observations.

Step 1. We need to establish our “critical value.” We wish to reject the null hypothesis if we are 95% certain that it is false. For 121 observations and a “one-tailed test,” the critical value is 1.66 (which we look up on the table. This corresponds to a significance level of .05 with 120 degrees of freedom).

Step 2. The t-statistic = ( 3 – 0 ) / ( 2 / 121 ) 3 / .18 16.7.

Step 3. Compare the t-statistic with the critical value. If the t-statistic is greater than the critical value, then you can reject the null hypothesis.In this case, 16.7 is greater than 1.66, so we can reject the null hypothesis that X is less than zero.

Example

The table to the right is a sample “cross-tab”

Your research hypothesis is that dog ownership and gender are related.

How do you test this hypothesis?

Dog-Owners

No Pets Totals

Men 100 400 500

Women 50 450 500

Totals 150 850 1,000

Hypothesis Tests about tables

Step 1. Define null and research hypotheses.The null hypothesis will usually be that there is no relationship between the rows and the columns.

Step 2. Determine your tolerance for falsely rejecting the null hypothesis of no relationship.

Step 3. Empirically analyse the data to determine if there is a relationship.

Example

To calculate independence:

1) Identify the number of respondents in each internal cell of the table

2) Calculate the number of respondents who would be in each cell if independent (corresponds to the second number under each total)

e.g. cell1,1 = .5 * .15 *1000 = 75

cell1,2 = .5 * .85 *1000 = 425

3) Compute the chi-squared test statistic (next slide)

Dog-Owners

No Pets Totals

Men

100

( 75 )

400

( 425 )

500

Women 50

( 75 )

450

( 425 )

500

Totals 150

850

1,000

1.00

The Chi-Square Test StatisticTo calculate independence:

3) Compute the chi-squared test statistic

The chi-squared test statistic is simply:

2 = rowscolumns (Observedrow,column - Expectedrow,column)2

Expectedrow,column

The chi-squared statistic follows a chi-squared distribution with degrees of freedom = (rows – 1) (columns – 1).

Example

If we look at our table of the 2 with 1 degrees of freedom, the critical value for our test statistic is 3.84.

2 = (100 - 75)2 / 75+(400-425)2 / 425

+ (50- 75)2 / 75+ (450-425)2 / 425

=19.6

In this case, we reject the null hypothesis that the two populations are statistically independent because our test-statistic is greater than our critical value.

Dog-Owners

No Pets Totals

Men

100

(75)

400

(425)

500

Women 50

(75)

450

(425)

500

Totals 150

850

1,000

STATISTICAL ANALYSIS. Your introduction to statistics should not be like drinking water from a fire hose!!

Documents