On the Reliability of Confidence Interval Estimates in both Normal and Non-normal Data: A Simulation Study

8/20/2019 On the Reliability of Confidence Interval Estimates in both Normal and Non-normal Data: A Simulation Study

http://slidepdf.com/reader/full/on-the-reliability-of-confidence-interval-estimates-in-both-normal-and-non-normal 1/37

On the Reliability of Confidence Interval

Estimates in both Normal and Non-normalData: A Simulation Study

A ThesisPresented to the

Faculty of the Department of MathematicsCollege of Arts and Sciences

Caraga State UniversityButuan City

In Partial Fulfillmentof the Requirements for the DegreeBachelor of Science in Mathematics

(Applied Statistics)

Marchan A. SagaFebruary 2016



ABSTRACT

As usually aimed in any estimation problems, researchers are so in-

terested on the reliability of estimates. As far as the theory is concerned,

confidence interval estimates are expressed as functions of tabular values of

either T distribution or Z-distribution. These distributions further lie on the

assumption of the normality of data. Thus, reliability of estimates absolutely

depends on the necessary requirement. Common fault in vast analysis are

committed when this important assumption is ignored. When this happens,

uncertainties on inferences about the population parameter become higher

than what is often defined (the level of significance). In this way, decision

making is mislead.

This study investigates the reliability of confidence interval estimates

via computer program (C-programming language) for both normal and non-

normal data taken from normal and non-normal population, respectively, via

Simple Random Sampling without Replacement (SRSWR). Confidence inter-

val reliability is also observed when sample sizes and level of significance are

varying.

Results showed that when normality assumption is ignored, confidence

interval estimates become less reliable. This worsens when the specified levelof significance gets smaller. However, it is further revealed that increasing the

sample size contributed gain in reliability.



TABLE OF CONTENTS

Title Page i

Approval Sheet ii

Abstract iii

Acknowledgement iv

Table of Contents v

I. Introduction 1

1.1 Background of the Study 1

1.2 Objectives of the Study 3

1.3 Significance of the Study 3

II. Basic Concepts And Preliminaries 4

2.1 Basic Concepts 4

2.2 Preliminaries 10

III. Methodology 12

3.1 Algorithm 12

3.2 Flowchart 13

IV. Results And Discussions 14

4.1 Results 14

V. Summary, Conclusions And Recommendations 19



5.1 Summary And Conclusion 19

5.2 Recommendations 20

Bibliography 21

Annexes 23



CHAPTER 1

Introduction

”When Statistics are not based on strictly accurate and precise

calculations, they mislead instead of give” -Alexis de Tocqueville,

Democracy in America

1.1 Background of the Study

Across disciplines, many researchers have endeavoured in considerations to

cost of data collection, time to accomplish the project and even human re-

sources. This problem has been addressed through the development of both

sampling and survey theory where estimation is an inevitable element. Esti-

mation has a big role in the field of statistics. One of the major applications of

statistics is estimating population parameters from sample statistics. In con-

ducting a research, there is nothing wrong if the researcher will cover all the

respondents, however as mentioned, it will be costly and time consuming. All

those consequences will be resolved due to estimation. As an important par-

cel of statistics, estimation theory aims to search for an accurate and efficient

estimate of a given parameter of interest. These estimates can be in a point or

in an interval form. Point estimate is a single number (value) computed from

a random sample which represents a plausible value of the parameter. It pin-



2

points a location or a point in the distribution of possible values of the random

variable [8]. On the other hand, interval estimate is a range of values computed

from a random sample, which represents an interval of plausible values for the

unknown value of the parameter of the population. When some measure of

certainty or confidence is attached to the interval estimate, the interval called

the confidence interval estimate. Confidence interval estimate is a fundamen-

tal technique in statistical inference and widely used method of inference [13].

The interpretation of a confidence interval derives from the sampling process

that generates the sample from which the confidence interval is calculated. It

provides a measure of how ”confident” the researcher is in stating that the

interval estimate obtained from the random sample contains the true value of

the parameter. Equivalently, if 95% confidence interval is constructed, then, in

the long run, 95% of the intervals constructed in similar manner, will contain

the true value of the parameter [8]. As usually aimed in any estimation prob-

lems, researchers are so interested on the reliability of estimates. As far as the

theory is concerned, confidence interval estimates are expressed as functions of

tabular values of either T distribution or Z-distribution. These distributions

further lie on the assumption of the normality of data. Thus, reliability of

estimates absolutely depends on the necessary requirements or assumptions.

Common fault in vast analysis are committed when these important assump-



3

tions are ignored. When this happens, uncertainties on inferences about the

population parameter become higher than what is often defined (the level of

significance). In this way, decision making is mislead.

1.2 Objectives of the Study

This study investigates the reliability of confidence interval estimates via

computer program (C-programming language) for both normal and non-normal

data taken from normal and non-normal population, respectively, via Simple

Random Sampling without Replacement (SRSWR). Confidence interval relia-

bility is also observed when sample sizes and level of significance are varying.

1.3 Significance of the Study

Research is not new in our sociey. In fact, it has a big role in our soci-

ety, however, several researchers neglected some of the statistical assumptions.

This study shows the significance of satisfying necessary statistical assump-

tions.



CHAPTER 2

Basic Concepts And Preliminaries

This section presents the basic concepts and terminologies that will be used

in the next chapter. Definitions and theorems are taken from [12], [10] and

[8].

2.1 Basic Concepts

Definition 2.1.1 A variable is a characteristic or property of an individual

population unit.

Definition 2.1.2 Universe is a collection or set of all individuals or entities

whose characteristics are to be studied.

Definition 2.1.3 A population is set of all possible values of the variable.

Let us consider an illustration given below.



5

illustration:

That is, the population is now expressed as collection of values of a variable

Y taken from a universe.

Furtheremore, there are two types of population, namely; finite and infinite.

Population is finite when the elements of the population can be counted for a

given time period. On the other hand, infinite when the number of elements

of the population is unlimited.

Definition 2.1.4 Any numerical value describing a characteristic of a popu-

lation is called a parameter.

Definition 2.1.5 Let X 1, X 2, . . . , X N be the set of data, represents a finite

population of size N , then the population mean is,

µ =

N i=1 X iN



6

Population mean descibes the characteristic of the population, thus, µ is a

parameter.

Example 2.1.6 The number of employees at 5 different drugstores are 3, 5, 6,

4 and 6. Treating the data as a population, find the mean number of employees

for 5 stores.

Solution. Since the data are considered to be a finite population,

µ = 3 + 5 + 6 + 4 + 6+

5 = 4.8.

Definition 2.1.7 Let X 1, X 2, . . . , X N be a finite population, the population

variance is,

σ2 =

ni=1(X i − µ)2

N

Definition 2.1.8 Sample is a subset of the units of a population.

Definition 2.1.9 Any numerical value describing a characteristic of a sample

is called a statistic.

Definition 2.1.10 Let X 1, X 2, . . . , X N be the set data, represents a finite

sample of size n, then the sample mean is,

X =

ni=1 X i

n

sample mean X is describing the characteristic of a sample, thus, X is a

statistic.



7

Example 2.1.11 A food inspector examined a random sample of 7 cans of

a certain brand of tuna to determine the percent of foreign impurities. The

following data were recorded: 1.8, 2.1, 1.7, 1.6, 0.9, 2.7 and 1.8. Compute the

sample mean.

X = 1.8 + 2.1 + 1.7 + 1.6 + 0.9 + 2.7 + 1.8

7 = 14.2.

Definition 2.1.12 Let X 1, X 2, . . . , X N be a random sample, the sample vari-

ance is,

s2 =

ni=1(X i − X )2

n − 1

Example 2.1.13 Suppose set A is a random sample taken from population

B, set A consist of -5,-4,-3,-2,0,1,2,4,7.

Solution:

first, compute the sample mean,

X = −5 − 4 − 3 − 2 + 0 + 1 + 2 + 4 + 7

9 = 0,

then,the variance is,

s2 =

9i=1(X i − 0)

9 − 1 =

(−5)2 + (−4)2 + · · · + (4)2 + (7)2

8 =

124

8 .

Definition 2.1.14 Sampling is the process of selecting a sample from a uni-

verse or a population.



8

Definition 2.1.15 Probability Sampling is a sampling where samples are

obtained using some objective chance mechanism, thus involving randomiza-

tion. It requires the use of a complete listing of the elements of the universe

called the sampling frame. Simple Random Sapling is one of the exam-

ple of probability sampling. There are two types of simple random sampling.

Simple random sapling without replacement (SRSWOR) does not allow rep-

etitions of selected units in the sample. On the other hand, simple random

sampling with replacement (SRSWR) allows repetitions of selected units in

the sample.

Definition 2.1.16 Non-Probability Sampling is a sampling where sam-

ples are obtained haphazardly, selected purposively or are taken as volunteers.

The probablities of selection are unknown.

Definition 2.1.17 Estimation is concerned with finding a value or range of

values or unknown parameter.

Definition 2.1.18 Point Estimator of a population parameter is a rule or

formula that tells us how to use sample data to calculate a single number that

can be used as an estimate of the population parameter.

Definition 2.1.19 Interval Estimator is a formula that tells us how to use

sample data to calculate an interval that estimates a populaton parameter.



9

Definition 2.1.20 The numerical values of the test statistic for which the

null hypothesis will be rejected. The value of α is usually chosen to be small

(e.g., 0.01, 0.05, 0.10) and is reffered to as level of significance of the test.

Definition 2.1.21 (1−α)×100%Confidence Interval is a range of numbers

believed to include an unknown population parameter associated with the

interval is a measure of the confidence we have that the interval does indeed

contain the parameter of interest.

Example 2.1.22 The contents of 7 similar containers of sulfuric acid are

9.8, 10.2, 10.4, 9.8, 10.0, and 9.6 liters. Find the 95% confidence interval

for the mean content of all such containers, assuming an approximate normal

distribution for container contents.

Solution . The sample mean and standard deviation for the given data are

X = 10.0 and s = 0.283.

using the t table, we find t0.025 = 2.447 for v = 6 degrees of freedom. Hence

the 95% confidence interval for µ is

10.0 < (2.447)(0.283/

√ 7) < µ < 10.0 + (2.447)(0.283/

√ 7)

which reduces to

9.4 < µ < 10.26.



10

2.2 Preliminaries

The next theorem is stated in [12].

Theorem 2.2.1 if X and s2 are the mean and variance, respectively, of a

random sample of size n taken from a population that is normally distributed

with mean µ and variance σ2, then

t =X − µ

s/√

n

is value of a random variable T having the t distribution with v = n − 1

degrees of freedom.

Theorem 2.2.2 If all possible random samples of size n are drawn, without

replacement from a finite population of size N with mean µ and standard

deviation σ, then the sampling distribution of the sample mean X will be

approximately normally distributed with a mean and standard deviation given

by,

µX = µ

s = σ√ n N

−n

N − 1

Theorem 2.2.3 Central Limit Theorem . If random sample of size n are

drawn from a large or infinite population with mean µ and variance σ2, ten



11

the sampling distriution of the sample mean X is approximately normally

distributed with µX = µ and standard deviation σX =σ/√ n. Hence,

z =X − µ

σ/√

n

is a value of a standard normal variable Z .

Theorem 2.2.4 Sample Size for Estimating µ. If X is used as an esti-

mate of µ, we can be (1 − α)100% confident that the error will not exceed a

specified amount e when the sample size is

n =

z α/2σ

e

2

.

Theorem 2.2.5 If X and s2

are from normal then

P [ X − tα/2,n.s < µ < X + tα/2,n.s] ≥ 1 − α

where α is the level of significance.

Theorem 2.2.5 is the core of this study. The researcher will validate the

theorem if this really holds given the assumptions above. Moreover, X and s2

from non-normal data will also part of the validation.



CHAPTER 3

Methodolody

In this chapter, the methodology on how the researcher obtained the results

is being presented. The researcher generate a normal data from R- program-

ming language (Free Software) and a non-normal data of size 20 from the

examination test results of the students. The researcher used C-programming

language (Free Software) to exhaust all combinations of sample size 5 , 10 and

15 from the population size 20.

3.1 Algorithm

1st

Step: Set the level of significance α, sample size n, integer x, the population

mean µ and N C n.

2nd Step: Take a random of size n from a population of size N .

3rd Step: Compute the (1−α)× 100% Confidence interval for the population

mean µ,

X − t

α

2 ,(n−1)s√ n N

−n

N , X + tα

2 ,(n−1)s√ n N

−n

N

4th Step: Assign x = 1 if µ is in interval and x = 0 otherwise.

5th Step: Repeat the steps 2− 4 until all N C n samples of size n are exhausted.



13

Then the following proportion gives the percent reliability;

x

N C n

3.2 Flow Chart



CHAPTER 4

Results And Discussions

4.1 Results

Figure 4.1: Histogram of a Non-normal Data.

The figure above shows the histogram of a non-normal data taken from the

examination test results of the students. Observe that the histogram is skewed

the left, from that observation, it is visually evident that the set of data is not



15

normally distributed.

Figure 4.2: Histogram of a Normal Data.

The figure above shows the histogram of a normal data generated in R-

programming language (free software) , observe that the histogram of the data

is somewhat like a bell curve, that is, visually evident that the data is normally

distributed.



16

Table 1: Simulation result on the proportion of (1 − α) × 100 Confidence

Interval containing the population parameter across normal and non-normal

data, varying level of signifance and increasing sample sizes.

Table 1 above shows the main result of the study. For α = 0.10 , theo-

retically it must be expected that at least 90% of the constructed intervals

will contain the population mean . This is indeed true for data taken from a

normal population. Proportions are 90.33%, 90.47% and 90.50% for sample

sizes 5, 10, and 15, respectively. It must be noted that there is a relatively

small increase in the said reliability at an increasing sample sizes. However,

it is not observed when data are taken from a non-normal population. As

shown, observed proportions are 83.29% for sample size 5, 87.19% for sample

size 10 and 89.99% for sample size 15. This result reveals that when normality

assumption is ignored, the error in estimating a desired parameter becomes

relatively higher than what is pre-defined by the researcher. Surprisingly, in-

creasing the sample size still exposes the remedy when normality assumption



17

is not met. This in fact explained previously by virtue of the commonly known

central limit theorem.

At α = 0.05, almost similar trend of result is observed. A usual, it is

expected that in the long run, 95% of the intervals under a given sampling

design will contain the parameter of interest, the population mean. Under the

normality assumption, it is divulged that proportions are 95.37%, 95.35% and

95.68% for sample sizes, 5, 10, and 15, respectively. However, samples from

non-normal polulation still generated lesser number of intervals that contained

the delared parameter. Observed proportions are 89.86%, 91.36% and 93.64%

for respective samples, 5, 10 and 15 which are obviously smaller than the

theoretically expected value (95%). It is noted that increassing sample sizes

contributed a relative increase in reliablity while normality assumption is ig-

nored.

Lastly, at a more precise level of significance α = 0.01 , the same pattern

of result was observed. When normality assumption is violated, the more that

the estimation process becomes worse (less accrurate and less precise). Under

this unsatisfied assumption, proportion of intervals that contain the popula-

tion parameter are 95.98%, 95.94%, and 97.24% for sample sizes 5, 20, 15,

respectively. This values are unfortunately lesser than what is again expected

(99%). The idea of increasing the sample size as a remedy in increasing the



18

reliability of estimates for non-normal data still holds.



CHAPTER 5

Summary, Conclusion and Recommendation

5.1 Summary And Conclusion

Under the assumption of normality, (1−α)×100 Confidence Interval indeed

attained its predefined degree of statistical reliability. The result is consistent

as revealed in a computer simulation achored on Simple Random Sampling

Without Replacement scheme. The mentioned consistency is supported when

the simulation is run in across level of significance (0.10, 0.05 and 0.01) and

varying sample sizes (increasing in order). On the otherhand, for non-normal

data, the simulation showed that intervals did not reach the expected degree

of reliability. For an instance, at , in all of the sample sizes considered, pro-

portion of interval estimates that contain the population mean are lesser than

95% which is suppose to be expected. Furthermore, it was also found out

that increasing the sample size is a good remedy to account for the situation

when normality assumption is not met. Based on the results presented, it

is concluded that to attain higher degree of statistical reliability, researchers

should first satisfy necessary statistical assumptions. In this study, the normal

distribution.



20

5.2 Recommendations

The study clearly focuses only on the interval estimates of population mean.

Thus for the interest of future researchers, the following recommendations are

generated;

• Estimating the interval estimates for population variance in both normal

and non-normal data.

• Estimating the interval estimates for population proportion both normal and

non-normal data.

• Comparing the power of the test in hypothesizing a parameter value using

one sample t-test in both normal and non-normal data.

• Comparing the power of the test in testing mean differences using indepen-

dent sample t-test in both normal and non-normal data.



21

Bibliography

[1] K. Kenly, 2005. The Effects of Non-Normal Distribution on Confidence

Intervals Around the Standardize Mean Difference: Bootstrap and Para-

metric Confidence Intervals , Sage Publications.

[2] S. Gali, 2015. On Importance of Normality Assumption in Using a T-Test:

One Sample and Two Sample Cases , Chennai-India.

[3] R. Bender, G. Berg, and H. Zeeb, 2005. Using Confidence Curves in Med-

ical Research , Biometrical Journal 47, pp. 237247.

[4] A. Attia, 2005.Why should researchers report the confidence interval in

modern research? , Middle East Fertility Society Journal, Vol. 10, No. 1.

[5] Statistics Solutions, 2016. Normality , http://www.statisticssolutions.com/

academic-solutions/resources/directory-of-statistical-analyses/normality.

[6] J. Sim and N. Reid, 1999. Statistical inference by confidence intervals:

issues of interpretation and utilization , Physical Therapy. Vol. 79, No. 2,

pp. 186-195.

[7] Mood. A.M., 1913. Introduction to the theory of statistics , Third Ed.

McGraw-Hill, Inc.



22

[8] Institute of Statistics, 2014. WorkBook in Statistics 1, Tenth Ed. University

of the Philippines, Los Banos.

[9] J. Sauro and J.R. Lewis, 2005 Estimating Completion Rates From Small

Samples Using Binomial Confidence Intervals: Comparisons And Recom-

mendations , Proceeding of the Human Factors And Ergonomics Society

49th Annual Meeting.

[10] J.t. Mc Claive, F.H. Dietrich and T. Sincich, 1997, Statistics , Seventh Ed.

Prentice Hall.

[11] A.D. Aczel, 1995, Statistics Concepts and Applications , Richard D. IR-

WIN, INC.

[12] R.E. Walpole, 1982, INTRODUCTION TO STATISTICS , Macmillan

Publishing Company, New York.

[13] D. Gilliland and V. Melfi, 2010, A Note on Confidence Interval Estimation

and Margin of Error , Journal of Statistics Education, Volume 18, No.1.

[14] J. Orloff and J. Bloom, 2014, Confidence Intervals for the Mean of Non-

normal Data , Class 23, 18.05, Spring.



23

Annexes

CODE:



24





26

20 taken 15

α = 0.05,

20 taken 5



27

20 taken 10

20 taken 15



28

α = 0.01,

20 taken 5

20 taken 10



29

20 taken 15

Actual Results for Non-Normal Data at

α = 0.1,

20 taken 5



30

20 taken 10

20 taken 15



31

α = 0.05,

20 taken 5

20 taken 10



32

20 taken 15

α = 0.01,

20 taken 5



33

20 taken 10

20 taken 15

On the Reliability of Confidence Interval Estimates in both Normal and Non-normal Data: A Simulation Study

Documents