Normal curve in Biostatistics data inference and applications

Good Morning

NORMAL CURVESeminar-1

ByDr.M.S.Bala Vidyadhar

Introduction

Basis for statistical analysis

Probability Distributions

Normal Distribution/Curve

History

Description

Standard normal Variate

Variations

Contents:

Normal interpretation

Comparisons

Normality tests

Conclusion

Previous Year Questions

References

Contents:

The word statistics comes from Italian word ‘statista’

meaning statesman or German word ‘statistik’ which

means political state.

Science of statistics existed since the time of early Egypt to

the Roman empire to count the families.

John Graunt (1620-1674) is known as The Father of Health

Statistics.

Introduction

Statistics is the science of compiling classifying and tabulating numerical data and expressing the results in a mathematical or graphical form.

Biostatistics is that branch of statistics concerned with mathematical facts and data related to Biological events.

Introduction:

Statistical analyses are based on three primary entities:

1. (U) Population of interest,

2. (V) set of characteristics or variables of the units of

this population,

3. (P) Probability Distribution of the variables in the

given population.

Basis for Statistical Analysis

It is the most crucial link between the population and

its variables, which allows us to draw inferences on the

population based on the sample observations.

It is a way to enumerate the different values the

variable can have, and how frequently each value

appears in the population.

Probability Distributions

The three probability distributions useful in

medicine/health care are:

1. Normal distribution

2. Binomial distribution

3. Poissons distribution.

Probability distributions

Binomial distribution: Useful where an event or variables are only binary

outcomes.(eg: yes/no; positive/negative).

Poissons distribution: Useful where the outcome is the number of times an

event occurs in the population, hence very helpful in determining the probability of rare events/diseases.

Both these distributions are applicable in discrete data

only.

When data is collected from a very large population and a frequency distribution is made with narrow class intervals, the resulting curve is smooth, symmetrical and is called a normal distribution curve.

Also called as Gaussian Distribution.

The normal distribution is continuous, so it can take on

any value.

Normal Distribution

Was first discovered by Abraham de Moivre and published in 1733.

History

Two mathematician astronomers , Pierre-Simon Laplace (France) and Karl Friedrich Gauss (Germany) established the scientific principles of the Normal Distribution.

But Gauss’ name was given to the distribution as he applied it to the Theory of motions of heavenly bodies.

The normal distribution curve is a smooth, bell shaped curve and is symmetric about the mean of the distribution, symbolized by the letter μ(mu).

The standard deviation is denoted by the Greek letter sigma(σ).

Sigma is the horizontal distance between mean and the point of inflection on the curve.

Description

Normal distribution curve

The mean and the standard deviation are the two parameters that completely determine the location on the number line and the shape of the normal curve.

Thus many normal curves are possible , one each for every value of the mean and standard deviation , but all curves under probability distributions have area under the curve equal to 1.

Description contd.

The curve will be normal for values like height, weight, hemoglobin ,PCV, BP, etc.

For all normal curves their mean ,median and mode are equal and coincide on the graph.

The horizontal distance between the central point and 1 SD to both left and right of mean is marked as one confidence limit.

Normal Curve

In case of Normal Curve, distribution of individual subjects for their characters are symmetrically distributed in relation to mean and SD.

The area between -1SD to +1SD, there are 68.27 % of the observations.

The area between -2SD to +2SD, there are 95.4% of the observations.

The area between -3SD to +3SD, there are 99.5% of the observations of the population.

Confidence Limits

These limits are called the confidence limits and range between

the two is called confidence interval.

Observations lying within -2 SD to +2SD are known to lie in the

critical level of significance.

The data lying outside this area is said to be significantly different

from the population mean value.

Extreme values will occur only about 5 times in 100 observations.

As the normal curve is symmetrical, coefficient of skewness is

equal to 0

The central limit theorem states that under certain (fairly common) conditions, the sum of many random variables will have an approximately normal distribution.

More specifically, where X1, …, Xn are independent and identically distributed random variables with the same arbitrary distribution, zero mean, and variance σ2; and Z is their mean scaled by

Central Limit Theorem

A normal distribution with parameters μ and σ has the following properties.

The curve is Bell –shaped a. It is symmetrical (Non-skew). b. The mean, media and mode are equal. The curve is asymptotic to the X-axis. That is, the

curve touches the X-axis only at -∞ and+∞. The curve has points of inflexion at μ - σ and μ +σ.

Properties of the Normal Curve

For the distribution …. a. Standard deviation = σ b. Quartile deviation = 2/3 σ (approximately) c. Mean deviation = 4/5 σ (approximately)

For the distribution …. a. The odd order moments are equal to zero. b. The even order moments are given by –

Thus, μ2 = σ2 and μ4 = 3σ4.

The distribution is mesokurtic. That is,β2=3.

Total area under the curve is unity.

P[a < X ≤ b]=

a. P[ μ - σ < X ≤ μ + σ ] = 0.6826 = 68.26% b. P[μ – 2σ < X ≤ μ + 2σ] = 0.9544 = 95.44% c .P[μ – 3 σ < X ≤ μ + 3σ] = 0.9974 = 99.74%

Area bounded by the curve and the ordinates at a and b

Deviation of an individual observation from the mean in a normal distribution or curve is called standard normal variate and is given the symbol Z.

It is measured in terms of standard deviations (SDs) and indicates how such an observation is bigger or smaller than mean in units of standard deviation.

Standard Normal Variate

So Z will be a ratio, calculated as given.

Where Z stands for individual observation whereas μ and σ stand for mean and SD as usual.

Z is also called Standard normal deviate or relative normal deviate.

z X

The normal curve is completely determined by two

parameters mean(µ) and SD(σ).

So, a different normal distribution is specified for each

different value of µ and σ.

Variations of mean and SD values affect the normal

curve in different ways.

Comparisons

The effects of µ and SDHow does the standard deviation affect the shape of f(x)

= 2

=3 =4

= 10 = 11 = 12How does the mean value affect the location of f(x)

Different values of μ shift the graph of the distribution along the x-axis.

Different values of σ (SD) determine the degree of flatness or peakedness of the graph of the distribution.

Details of area under cumulative normal distribution can also be plotted.

It shows the cumulative probability by levels of mean ± Standard error.

assssssdddddfffffff

The two variations of the normal curve are due to the 2

variants of the curve.

1. Skewness

2. Kurtosis The normal curve is symmetric; Frequently, however,

our data distributions, especially with small sample sizes, will show some degree of asymmetry, or departure from symmetry.

Variations of the Normal Curve

Skewness is a statistic to measure the degree of asymmetry.

If the distribution has a longer "tail" to the right of the peak than to the left, the distribution is skewed to the right or has positive skewness.

If the reverse is true the distribution is said to be skewed to the left or to have a negative skewness.

Skewness

The value of skewness can be computed by

Where X is each individual score. The value of skewness is zero when the distribution is a

completely symmetric bell shaped curve.

A positive value indicates that the distribution is skewed to the right (i.e.,positive skewness) and a negative value indicates that the distribution is skewed to the left (i.e.negative skewness).

While skewness describes the degree of symmetry of a distribution, kurtosis measures the height of a distribution curve.

To compute kurtosis, we use the formula

Kurtosis

A positive kurtosis indicates that the distribution has a relatively high peak ; this is called leptokurtic.

A negative kurtosis indicates that the distribution is relatively flat topped this is called platykurtic.

A normal distribution has a kurtosis of zero this is called mesokurtic.

Skewness and Kurtosis provide distributional

information about the data.

In statistical tests that assume a normal distribution of a

data, skewness and kurtosis can be used to examine this

assumption called normality.

With measurements whose distributions are not normal, a simple

transformation of the scale of the measurement may induce

approximate normality.

The square root √x, and the logarithm, log x, are often used as

transformations in this way.

Those transformations are found useful for flexible use of some

tests of significance like student's t test.

Non Normal Distributions

Even if the distribution in the original population is far from normal, the distribution of sample averages tends to become normal, under a wide variety of conditions, as the size of the sample increases.

This is the single most important reason for the use of the normal distribution.

Also, many results that are useful in statistical work, although strictly true only when the population is normal, hold well enough for rough and ready use when samples come from non-normal populations.

When presenting such results, we can indicate how well they stand up under non-normality.

Normal curve INTERPRETATION

...71828.2eand...14159.3where

xe2

1)x(f

2x

)2/1(

Mathematical representation A random variable X with mean µ and standard deviation σ is normally distributed if its probability density function is given by

The Shape of Normal Distributions

Normal distributions are bell shaped, and symmetrical around .

Why symmetrical? Let µ = 100. Suppose x = 110. 22

10)2/1(

100110)2/1(

e2

1e

21

)110(f

Now suppose x = 9022 10

)2/1(10090

)2/1(e

21

e2

1)90(f

11090

The expected value (also called the mean) E(X) (or µ) can be any number

The standard deviation can be any nonnegative number

The total area under every normal curve is 1 There are infinitely many normal distributions

Normal Probability Distributions

Total area =1; symmetric around µ

The effects of μ and σHow does the standard deviation affect the shape of f(x)?

= 2 =3

=4

= 10 = 11 = 12How does the expected value affect the location of f(x)?

X83 6 9 120

A family of bell-shaped curves that differ only in their means and standard deviations.µ = the mean of the distributionσ = the standard deviation

µ = 3 and = 1

X3 6 9 120

X3 6 9 120

µ = 3 and = 1

µ = 6 and = 1

X83 6 9 120

X83 6 9 120

µ = 6 and σ = 2

µ = 6 and σ = 1

X

Probability = area under the density curveP(6 < X < 8) = area under the density curve

between 6 and 8.

3 6 9 12

P(6 < X < 8) µ = 6 and σ = 2

0 X

X

Probability = area under the density curveP(6 < X < 8) = area under the density curve

between 6 and 8.a b

83 6 9 12

P(6 < X < 8) µ = 6 and σ =2

0

6 8

X

a b

Probabilities:area undergraph of f(x)

P(a < X < b) = area under the density curve between a and b.

P(X=a) = 0P(a < x < b) = P(a < x < b)

f(x) P(a < X < b)

X

P(a X b) = f(x)dxa

b

Suppose X~N( Form a new random variable by subtracting the mean μ

from X and dividing by the standard deviation :(X

This process is called standardizing the random variable X.

Standardizing

(X is also a normal random variable; we will denote it by Z:

Z = (X-µ)/σ

Z has mean 0 and standard deviation 1: E(Z) ==0; SD(Z) =1.

1

The probability distribution of Z is called the standard normal distribution.

Standardizing (cont.)

If X has mean and stand. dev. , standardizing a particular value of x tells how many standard deviations x is above or below the mean .

Exam 1: =80, =10; exam 1 score: 92 Exam 2: =80, =8; exam 2 score: 90Which score is better?

Standardizing (cont.)

1 exam on 92 than better is 2 exam on 90

1.258

108

8090z

1.21012

108092z

2

1

X83 6 9 120

µ = 6 and = 2

Z0 1 2 3-1-2-3

.5.5

µ = 0 and = 1

(X-6)/2

A normal random variable x has the following pdf:Pdf of a standard normal curve

zez

pdfforandforsubstituteNZ

xexf

z

x

,21)(

becomes rv normal standard for the10)1,0(~

,)(

2

2

2)(21

21

21

Z = standard normal random variable = 0 and = 1

Standard Normal Distribution

Z0 1 2 3-1-2-3

.5.5 .5.5

Table Z is the standard Normal table. We have to convert our data to z-scores before using the table.

The figure shows us how to find the area to the left when we have a z-score of 1.80:

Finding Normal Percentiles by Hand (cont.)

Areas Under the Z Curve: Using the Table

P(0 < Z < 1) = .8413 - .5 = .3413

0 1 Z.1587.3413

.50

Standard normal probabilities have been calculated and are provided in table Z.

The tabulated probabilities correspondto the area between Z= - and some z0

Z = z0

P(- <Z<z0)

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359

0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753

0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141

… … … …

1 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621

… … … …

1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015

… … … …

Example – continued X~N(60, 8)

In this example z0 = 1.25

0.89440.8944

0.89440.89440.89440.8944= 0.8944

60 70 60( 70)8 8

( 1.25)

XP X P

P z

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.090.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.53590.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.57530.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141

… … … …1 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621

… … … …1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015

… … … …

P(z < 1.25)

P(0 z 1.27) =

Examples

1.270 z

Area=.3980

.8980-.5=.3980

P(Z .55) = A1

= 1 - A2

= 1 - .7088 = .2912

0 .55

A2

P(-2.24 z 0) =

ExamplesArea=.4875

.5 - .0125 = .4875

z-2.24 0Area=.0125

P(z -1.85) = .0322

P(-1.18 z 2.73) = A - A1

= .9968 - .1190 = .8778

Examples (cont.)

A1 A2

0 2.73z

-1.18

.1190

.9968

A1

A

P(-1 ≤ Z ≤ 1) = .8413 - .1587 =.6826

vi) P(-1≤ Z ≤ 1)

0.84130.1587

0.6826

Look up .2514 in body of table; corresponding entry is -.67

6. P(z < k) = .2514

6. P(z < k) = .2514

.5 .5

Is k positive or negative?

-.67

Direction of inequality; magnitude of probability

250 275( 250) ( )43

25( ) ( .58) 1 .2810 .719043

P X P Z

P Z P Z

Examples (cont.)

.2810

.7190

Examples (cont.)

225 275 275 375 27543 43 43

) (225 375)

( 1.16 2.33) .9901 .1230 .8671

x

ix P xP

P z

.9901.1230

.8671

X~N(275, 43) find k so that P(x<k)=.9846

88.367275)43(16.2

table)normal standard from(16.243275

43275

43275

43275)(9846.

k

k

kzP

kxPkxP

P( Z < 2.16) = .9846

0 2.16 Z.1587.4846Area=.5

.9846

Regulate blue dye for mixing paint; machine can be set to discharge an average of μ ml./can of paint.

Amount discharged: N(µ, .4 ml). If more than 6 ml. discharged into paint can, shade of blue is unacceptable.

Determine the setting μ so that only 1% of the cans of paint will be unacceptable

Example

Solution

=amount of dye discharged into can~N( , .4); determine so that

( 6) .01

XX

P X

Solution (cont.)

6 6.4 .4 .4

6.4

=amount of dye discharged into can~N( , .4); determine so that

( 6) .01

.01 ( 6)

2.33(from standard normal table) = 6-2.33(.4) = 5.068

x

XX

P X

P x P P z

In statistics, normality tests are used to determine if a data set is well-modelled by a normal distribution and to compute how likely it is for a random variable underlying the data is set to be normally distributed.

More precisely, the tests are a form of model selection, and can be interpreted several ways, depending on one's interpretations of probability.

Tests for Normality

Graphical methods

An informal approach to testing normality is to compare a histogram of

the sample data to a normal probability curve.

The empirical distribution of the data (the histogram) should be bell-

shaped and resemble the normal distribution. This might be difficult to

see if the sample is small.

In this case one might proceed by regressing the data against the

quantiles of a normal distribution with the same mean and variance as the

sample.

Lack of fit to the regression line suggests a departure from normality.

A graphical tool for assessing normality is the normal probability plot, a quantile-quantile plot (QQ plot) of the standardized data against the standard normal distribution.

Here the correlation between the sample data and normal quantiles (a measure of the goodness of fit) measures how well the data is modeled by a normal distribution.

For normal data the points plotted in the QQ plot should fall approximately on a straight line, indicating high positive correlation.

These plots are easy to interpret and also have the benefit that outliers are easily identified.

Simple back-of-the-envelope test takes the sample maximum and minimum and computes their z-score, or more properly t-statistic (number of sample standard deviations that a sample is above or below the sample mean), and compares it to the 68–95–99.7 rule.

This test is useful in cases where one faces kurtosis risk – where large deviations matter – and has the benefits that it is

very easy to compute and to communicate: non-statisticians can easily grasp that “6σ events don’t happen in normal distributions”.

Back-of-the-envelope test

Tests of univariate normality include D'Agostino’s Ksquared test Jarque–Bera test Anderson–Darling test Cramér–von Mises criterion Lilliefors test for normality (itself an adaptation of the

Kolmogorov– Smirnov test) Shapiro–Wilk test Pearson’s chisquared test Shapiro–Francia test.

Frequentist tests

A 2011 paper from The Journal of Statistical Modeling and Analytics concludes that Shapiro-Wilk has the best power for a given significance, followed closely by Anderson- Darling when comparing the Shapiro-Wilk, Kolmogorov- Smirnov, Lilliefors, and Anderson-Darling tests.

Ralph B. D'Agostino (1986). “Tests for the Normal Distribution”. In D'Agostino, R.B. and Stephens, M.A. Goodness-of-Fit Techniques. New York: Marcel Dekker. ISBN 0-8247-7487-6.

More recent tests of normality include the energy test (Székely and

Rizzo) and the tests based on the empirical characteristic function

(ecf) (e.g. Epps and Pulley, Henze–Zirkler, BHEP (Baringhaus–

Henze–Epps–Pulley multivariate normality test)

The energy and the ecf tests are powerful tests that apply for testing

univariate or multivariate normality and are statistically consistent

against general alternatives.

Kullback–Leibler divergences between the whole posterior distributions

of the slope and variance do not indicate non-normality.

However, the ratio of expectations of these posteriors and the

expectation of the ratios give similar results to the Shapiro–Wilk statistic

except for very small samples, when non-informative priors are used.

Spiegelhalter suggests using a Bayes factor to compare normality with a

different class of distributional alternatives. This approach has been

extended by Farrell and Rogers-Stewart.

Bayesian tests

One application of normality tests is to the residuals from a linear regression model. If they are not normally distributed, the residuals should not be used in Z tests or in any other tests derived from the normal distribution, such as t tests, F tests and chi-squared tests.

If the residuals are not normally distributed, then the dependent variable or at least one explanatory variable may have the wrong functional form, or important variables may be missing, etc.

Correcting one or more of these systematic errors may produce residuals that are normally distributed.

Results of Normality tests

Most of the statistical analyses presented are based on the bell-shaped or normal distribution.

The major importance of the normal distribution is the statistical inference of how often an observation can occur normally in a population.

The normal distribution is the most important and most widely used distribution in statistics.

The normal distribution is very useful in practice and makes statistical analysis easy.

Conclusion

1. Essentials of community dentistry by Soben Peter.2. Basic and Clinical Biostatistics by Dawson and Trapp.3. Biostatistics by Dr.Vishweswara Rao.4. Health Research Methodology by Okolo.5. Biostatistics by Sarmakaddam6. Biostatistics by Kim and Dialey7. http://en.wikipedia.org/w/index.php?title=File:Normal_

Distribution_PDF8. Introduction to Normal Distributions by David M. Lane9. Introduction to Statistics Online Edition by David M.

Lane1 Other authors: David Scott1, Mikki Hebl, Rudy Guerra , Dan Osherson, and Heidi Ziemer

References

http://en.wikipedia.org/w/index.php?title=File:Normal_Distribution_PDF

http://en.wikipedia.org/w/index.php?title=File:Normal_Distribution_PDF

Spiegelhalter, D.J. (1980). An omnibus test for normality for small samples. Biometrika, 67, 493–496. doi:10.1093/biomet/67.2.493

Farrell, P.J., Rogers-Stewart, K. (2006) “Comprehensive study of tests for normality and symmetry: extending the Spiegelhalter test”. Journal of Statistical Computation and Simulation, 76(9), 803 – 816. doi:10.1080/10629360500109023

References

RGUHS -April 2000, Sept.2007, normal curve(10 mks). Sumandeep university- April 2012, normal curve (10mks). Manipal university -April 2007, normal curve (10mks). Mangalore university-July 1993, December 1997(10mks).

Previous year questions

THANK YOU

Normal curve in Biostatistics data inference and applications

Health & Medicine