Top Banner
NOTES 1 ANNA UNIVERSITY CHENNAI STATISTICS FOR MANAGEMENT UNIT I PROBABILITY THEORY AND DISTRIBUTIONS 1.1 INTRODUCTION TO PROBABILITY Managers need to cope with uncertainty in many decision situations. For example, you as a manager may be interested to know precisely what volume of sales you will have next year. But you may know approximately what the next year sales will be, but you cannot give the exact number. There is some uncertainty. Concepts of probability will help you to measure uncertainty and perform associated analyses. This unit provides the conceptual framework of probability and the various probability rules that are essential in business decisions. 1.2 LEARNING OBJECTIVES After reading this unit, you will be able to: Appreciate the use of probability in decision making. Explain the types of probability. Define and use the various rules of probability depending on the problem situation. Make use of expected values for decision-making. 1.3 BASIC TERMINOLOGY We have all used the word “probably” more frequently than we may care to remember or recount. But have we pondered the basis on which we have used the word “probably”? When we say an event (or something similar) has a probability of occurring, are we not making an educated guess based on past occurrences? So, probability is a chance that something will happen. Probabilities are expressed as fractions or as real values ranging between 0 and 1. A probability value of 0 means that an event will never happen while a probability value of 1 means that an event will definitely happen. However in real time most events have a probability value between 0 and 1 and not either 0 or 1.
169
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DBA1602 STATISTICS.pdf

NOTES

1 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

UNIT I

PROBABILITY THEORY AND DISTRIBUTIONS

1.1 INTRODUCTION TO PROBABILITY

Managers need to cope with uncertainty in many decision situations. For example,you as a manager may be interested to know precisely what volume of sales you will havenext year. But you may know approximately what the next year sales will be, but youcannot give the exact number. There is some uncertainty. Concepts of probability will helpyou to measure uncertainty and perform associated analyses. This unit provides theconceptual framework of probability and the various probability rules that are essential inbusiness decisions.

1.2 LEARNING OBJECTIVES

After reading this unit, you will be able to:

Appreciate the use of probability in decision making.

Explain the types of probability.

Define and use the various rules of probability depending on the problem situation.

Make use of expected values for decision-making.

1.3 BASIC TERMINOLOGY

We have all used the word “probably” more frequently than we may care to rememberor recount. But have we pondered the basis on which we have used the word “probably”?When we say an event (or something similar) has a probability of occurring, are we notmaking an educated guess based on past occurrences?

So, probability is a chance that something will happen. Probabilities are expressedas fractions or as real values ranging between 0 and 1. A probability value of 0 means thatan event will never happen while a probability value of 1 means that an event will definitelyhappen. However in real time most events have a probability value between 0 and 1 andnot either 0 or 1.

Page 2: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

2 ANNA UNIVERSITY CHENNAI

The importance of learning probability lies in predicting the future events based on thepast occurrences of similar events. We shall now familiarize ourselves with some termswhich are associated with probability theory.

In this section, the various terms those are used in the definition of probability underdifferent approaches are listed.

Probability

Probability is a chance that something will happen. Probabilities are expressed asfractions or as decimals between 0 to 1. Assigning a probability of 0 means that somethingcan never happen; a probability of 1 indicates that something will always happen.

Event

An event is one or more of the possible outcomes of doing something. The activitythat produces such an event is referred in the probability theory as an experiment.

Random Experiment:

If in each trial of an experiment conducted under identical conditions, the outcome isnot unique, but may be any one of the possible outcomes, then such an experiment is calleda random experiment.

For example if we were tossing a coin, throwing a die, selecting a card from a pack ofplaying cards or selecting a family out of a given group of families there are a number ofpossible results which can occur but there is an uncertainty as to which one of them willactually occur.

Outcome:

The result of a random experiment will be called an outcome.

Trial and Event:

Any particular performance of a random experiment is called a trial and outcome orcombinations of outcomes are termed as events.

Exhaustive Events:

The total number of possible outcomes of random experiment is known as theexhaustive events.

Example: In tossing of a coin, there are two exhaustive cases, viz., head and tail; Inthrowing of a die, there are 6 exhaustive cases since any one of the 6 faces may comeuppermost.

Page 3: DBA1602 STATISTICS.pdf

NOTES

3 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Favourable Events:

The number of cases favourable to an event in a trial is the number of outcomes whichwill faciliatate the happening of the event.

Example: In drawing a card from a pack of cards the number of cases favourable todrawing of a king is 4, for drawing a diamond is 13 and for drawing a black card is 26.

Mutually Exclusive Events:

Events are said to be mutually exclusive or incompatible, if the happening of any oneof them does not allow the happening of all the others.

Examples: In throwing a die all the 6 faces numbered 1 to 6 are mutually exclusivesince if any one of these faces comes, the possibility of others, in the same trial, is ruled out.Similarly, in tossing a coin the events head and tail are mutually exclusive.

Equally likely events:

Outcomes of trial are said to be equally likely if taking into consideration all the relevantevidences, there is no reason to expect a higher possibility of occurrence of any one ofthem.

Example: In a random toss of an unbiased or uniform coin, head and tail are equallylikely events. Similarly, in throwing an unbiased die, all the six faces are equally likely tocome.

Independent Events:

Several events are said to be independent if the happening (or non-happening) of anevent is not affected by the supplementary knowledge concerning the occurrence of anynumber of the remaining events.

Example: In tossing an unbiased coin, the event of getting a head in the first toss isindependent of getting a head in the second, third and subsequent throws. Similarly, whena die is thrown twice, the result of the first throw does not affect the result of the secondthrow.

Sample space:

The set of all possible outcomes of a given random experiment is called the samplespace associated with that experiment.

Example: In random toss of two coins (or two tosses of a single coin), the samplespace S is given by: S = { HH, HT, TH, TT }

Page 4: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

4 ANNA UNIVERSITY CHENNAI

1.4 TYPES OF PROBABILITY

There are three types of probability. They are:1. Classical or a priori probability2. Statistical or relative frequency probability3. Subjective probability

1.4.1 Classical probability

If a random experiment or a trial results in ‘n’ exhaustive, mutually exclusive andequally likely outcomes, out of which ‘m’ are favourable to the occurrence of an event E,then the probability ‘p’ of occurrence of E, usually denoted by P(E) is given by:

Classical probability is often called a priori probability because if we keep usingorderly examples such as fair coins, unbiased dice and standard decks of cards, we canstate the answer in advance (a priori) without tossing a coin, rolling a die, or drawing acard.

Complementary event:

The non-happening of the event E is called the complementary event of E and isdenoted by E . The number of cases favourable to , i.e., non-happening of E is (n-m).Then the probability of is denoted by q is

which implies p + q = 1

1.4.2 Statistical probability:

If an experiment is performed repeatedly under essentially homogeneous and identicalconditions, then the limiting value of the ratio of the number of times the event occurs to thenumber of trials, as the number of trials becomes indefinitely large, is called the probabilityof happening of the event, it being assumed that the limit is finite and unique.

Symbolically, if in N trials an event E happened M times, then the probability of thehappening of E, denoted by P(E), is given by:

Number of favourable cases( )Total number of exhaustive events

mp P En

Note: Since m 0, n>0, we get P(E) 0and m n, we get P(E) 1, which implies that 0 P(E) 1

q = P(E ) 1 1 P(E) = 1 - pn m m

n n

N

MP(E) = limN

Page 5: DBA1602 STATISTICS.pdf

NOTES

5 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

1.4.3 Subjective probability:

The probabilities (chances) of occurrence of the corresponding events are assignedby individuals and are based on their personal judgement, wisdom, intuition and expertise.These probabilities are called the subjective probabilities and represent the degree of beliefand the confidence; one has in the occurrence of the respective event. Such probabilitiescan be calculated in all the cases, though the probabilities assigned by different persons tothe same event may vary from person to person.

Examples:

1. An expert of cricket says that the chance of Australia’s winning the world cup is 80percent.

2. In the view of stocks and share expert, the chance that the price of the shares ofa particular company will double in the next month is 75 percent.

3. An old man says that the chance of rain on a particular given day is 60 percent.

Here is one more illustration of this kind of probability assignment. A judge is decidingwhether to allow the construction of a nuclear power plant on a site where there is someevidence of a geological fault. He must ask himself, “What is the probability of a majornuclear accident at this location?” The fact that there is no relative frequency of occurrenceevidence of previous accidents at this location does not excuse him from making a decision.He must use his best judgment in trying to determine the subjective probabilities of anuclear accident.

Because most high-level social and managerial decisions are concerned with specific,unique situations, rather than with a long series of identical situations, decision makers atthis level make considerable use of subjective probabilities.

1.5 PROBABILITY RULES

1.5.1 Addition rule for mutually exclusive events:

Here, the symbol is called A union B meaning A occurs, or B occurs or both A and Bsimultaneously occur. When A and B are mutually exclusive, A and B cannot simultaneouslyoccur. The following Venn diagram explains the same. There is no common elements for Aand B when they are mutually exclusive events.

P(A B) = P(A) + P(B)

Page 6: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

6 ANNA UNIVERSITY CHENNAI

Illustration 1.1

From a pack of well-shuffled cards, a card is picked up at random. What is theprobability that the selected card is a King or a Queen?

Solution:

Let A = getting a King

Let B = getting a Queen

There are 4 kings and there are 4 queens. The events are clearly mutually exclusive.

Appling the formula P(AUB) = P(A) + P(B) = 4/52 + 4/52 = 8/52 = 2/13.

Hence the answer is 2/13.

Illustration 1.2

Determine the probability of the following events in drawing a card from a standardbunch of 52 cards.

i. A card with number 7.ii. Probability of drawing a black card.iii. Probability of drawing an ace or a king.iv. A black 2 or a black 3.v. A red face card.

Solution:i. A card with number 7.

Total number of cards with number “7” = 4.Therefore, P (E) = 4/52 = 1/13.

ii. Probability of drawing a black card.Total number of black cards = 26.Therefore, P (E) = 26/52 = 1/2.

iii. Probability of drawing an ace or a king.

Total number of aces = 4.

B A

Page 7: DBA1602 STATISTICS.pdf

NOTES

7 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Total number of kings = 4.

Therefore, P (E) = 4/52 + 4/52 = 8/52 = 2/13.

iv. A black 2 or a black 3.

Total number of black “2”s = 2.

Total number of black “3”s = 2.

Therefore, P (E) = 2/52 + 2/52 = 4/52 = 1/13.

v. A red face card.

Total number of red face cards = 6.

Therefore, P (E) = 6/52 = 3/26.

Illustration 1.3

A labor officer has tabulated the outcomes of a poll conducted by him regarding thenew wage revision. The poll was conducted among machiners and supervisors. The datais as below:

i. What is the probability that a machiner randomly selected from the group mildlysupports the package?

Solution:

Total number of machiners = 30.

Total number of machiners supporting mildly = 11.

Therefore, P (E) = 11/30.

ii. What is the probability that a supervisor is undecided?

Solution:

Total number of supervisors = 30.

Total number of supervisors undecided = 2.

Therefore, P (E) = 2/30.

iii. What is the probability that a worker strongly or mildly supports the package?

OPINIONS MACHINERS SUPERVISORS Strongly support 9 10 Mildly support 11 3 Undecided 2 2 Mildly oppose 4 8 Strongly oppose 4 4 30 30

Page 8: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

8 ANNA UNIVERSITY CHENNAI

Solution:

Total number of workers = 60.

Total number of strongly or mildly supporting = 9+10+11+3 = 33.

Therefore, P (E) = 33/60 = 11/20.

Illustration 1.4

The frequency distribution of annual sales commission done on 300 media sales peopleyielded the following results:

What is the probability of a sales person receiving a commission:

i. Between 5,000 and less than 10,000.

ii. Less than 15,000.

iii. Greater than or equal to 20,000.

iv. More than 15,000 but less than 20,000.

Solution:

To solve, we should first find the probability of each group.

i. Probability of a person receiving commission between 5,000 and less than 10,000= 1/12.

ii. Probability of a person receiving commission less than 15,000= 1/20 + 1/12 + 7/60 = 1/4.

ANNUAL COMMISSION FREQUENCY 0 – 4,999 15

5,000 – 9,999 25 10,000 – 14,999 35 15,000 – 19,999 125 20,000 – 24,999 70

25,000 + 30

ANNUAL COMMISSION FREQUENCY PROBABILITY 0 – 4,999 15 15/300 = 1/20

5,000 – 9,999 25 25/300 = 1/12 10,000 – 14,999 35 35/300 = 7/60 15,000 – 19,999 125 125/300 = 5/12 20,000 – 24,999 70 70/300 = 7/30

25,000 + 30 30/300 = 1/10 300

Page 9: DBA1602 STATISTICS.pdf

NOTES

9 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

iii. Probability of a person receiving commission greater than or equal to 20,000= 7/30 + 1/10 = 1/3.

iv. Probability of a person receiving commission more than 15,000 but less than20,000= 5/12.

Illustration 1.5

The probability distribution demand for newspapers in a newspaper shop per day isgiven as below:

What is the probability of:

i. Selling 23 newspapers.

ii. Selling less than 22 newspapers.

iii. Selling maximum of 24 newspapers.

iv. Selling not more than 21 newspapers.

v. Selling more than 22 newspapers.

Solution:

i. P( 23 newspapers ) = 0.3 = 30%.

ii. P( <22 newspapers) = 0.1 + 0.15= 0.25 = 25%.

iii. P( max 24 newspapers ) = 0.1 + 0.15 + 0.2 + 0.3 + 0.15 = 0.9 = 90%.

iv. P( not more than 21 newspapers ) = 0.1 + 0.15 = 0.25 = 25%.

v. P( more than 22 newspapers) = 0.3 + 0.15 + 0.1 = 0.55 = 55%.

1.5.2 Addition rule for not mutually exclusive events:

Here, the symbol is called A intersection B meaning both A and B simultaneouslyoccur. The following Venn diagram shows that there are common elements A and Bdesignated as. The subtraction of the term is because of double counting. This is includedin the event A as well as in event B.

NO. OF NEWSPAPERS SOLD 20 21 22 23 24 >24

PROBABILITY 0.1 0.15 0.2 0.3 0.15 0.1

Page 10: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

10 ANNA UNIVERSITY CHENNAI

Illustration 1.6:

From a pack of well-shuffled cards, a card is picked up at random. What is theprobability that the selected card is a King or a Diamond?

Solution:

There are totally 52 cards in a pack out of which 4 are Kings and 13 are Diamonds.Lets A=getting a King and B= getting a Diamond.

The two events here are not mutually exclusive because you can have a card, whichis both a king and a Diamond called King Diamond. Applying the formula for non-mutuallyexclusive case, namely,

Substituting and simplifying, we have:

Substituting and simplifying, we have:

Illustration 1.7

What is the probability of drawing an ace or a heart from a standard deck ofcards?

Solution:

P( Ace or Heart) = P(Ace) + P(Heart) – P(Ace of Hearts)

= 4/52 + 13/52 - 1/60

= 16/52

= 4/13

B

A

A B

( ) ( ) ( ) ( ),4 13we have ( ) , ( ) ,

52 521and ( )

52

P A B P A P B P A B

P A P B

P A B

4 13 1 16 4( ) ( ) ( ) ( )52 52 52 52 13

P A B P A P B P A B

Page 11: DBA1602 STATISTICS.pdf

NOTES

11 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Illustration 1.8

The profile of nominated members of an executive committee is given below:

If one of these members is to be randomly selected for the post of heading thecommittee, what is the probability that the committee head is likely to be a lady or a personwith an age greater than 35 years?

Solution:

P( head to be a lady ) = 2/5

P( head to be more than 35 years old ) = 2/5

P( head to be a lady of age greater than 35 years) = 1/5

Therefore, P( head to be a lady or a person >35 years old) = 2/5 + 2/5 - 1/5 = 3/5

1.5.3 Statistically independent events

The occurrence of an event has no effect on the probability of the occurrence of anyother event.

Marginal probabilities (unconditional)

It is the single probability of the occurrence of an event.

Multiplication rule for independent events:

If the two events A and B are independent, the probability of the simultaneousoccurrence of A and B (also know as probability of intersection of A and B) equals to theproduct of the probability of A and the probability of B. Of course, this rule can be extendedto more than two events.

Joint probability under statistical independence

The probability of two or more independent events occurring together or in successionis the product of their marginal probabilities.

P(AB) = P(A) * P(B) * Joint Probabilities of two independent events, where

P(AB) = Probability of events A and B occurring together or in succession.

P(A) = Marginal Probability of A occurring.

P(B) = Marginal Probability of B occurring.

GENDER M M L L M

AGE 30 32 45 20 40

P(A B) = P(A) P(B)

Page 12: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

12 ANNA UNIVERSITY CHENNAI

Illustration 1.9

The probability that you will get an A grade in Quantitative Methods is 0.7. Theprobability that you will get an A grade in Marketing is 0.5. Assuming these two coursesare independent, compute the probability that you will get an A grade in both thesesubjects.

Solution:

Let A=getting A grade in Quantitative Methods

Let B= getting A grade in Marketing

It is given that A and B are independent.

Applying the formula we get

1.5.4 Conditional probability:

Conditional probability is the probability that a second event (B) will happen if a firstevent (A) has already happened.

Here the symbol B/A is called B given that A means the event B will happen given thatevent A has already happened.

Illustration 1.10

When a fair die is tossed, the conditional probability of getting ‘1’, given that an oddnumber has been obtained, is equal to 1/3 as explained below:S = { 1, 2, 3, 4, 5, 6 }; A = { 1, 3, 5 }; B = { 1 };

Illustration 1.11

A box contains 75 marbles. 35 of them are blue and 25 of these blue marbles havedesigns on them. The rest of them are red and 30 of these red ones have designs on them.The marbles that do not have designs are plain. What is the probability of:

i. Drawing a blue marble.

ii. Drawing a plain marble.

iii. Drawing a blue designed marble.

iv. Drawing a red plain marble.

v. Drawing a designed marble.

( ) ( ). ( ) 0.7 0.5 0.35.P A B P A P B

P(A B)P(B/A) = , provided P(A) 0P(A)

P(A B) 1/ 6 1P(B/A) = P(A) 1/ 2 3

Page 13: DBA1602 STATISTICS.pdf

NOTES

13 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Solution:

i. P( drawing a blue marble ) = 35/75 = 7/15.

ii. P( drawing a plain marble ) = P(plain blue) + P( plain red )

= 10/75 + 10/75

= 20/75.

= 4/15.

iii. P( drawing a blue designed marble ) = No. of blue designed marbles

Total no. of marbles

= 25/75.

= 1/3.

iv. P( drawing a red plain marble ) = No. of red plain marbles

Total no. of marbles

= 10/75.

= 2/15.

v. P( drawing a designed marble ) = No. of designed marbles

Total no. of marbles

= 55/75.

= 11/15.

(OR)

P(drawing a designed marble) = P(red designed) + P(blue designed)

= 30/75 + 25/75.

= 55/75.

= 11/15.

Conditional probability under statistical independence

The probability of an event B is given that event A has occurred is

P(B/A) = P(B)

Probabilities under conditions of statistical dependence

Statistical dependence exists when the probability of some event is dependent on oris affected by the occurrence of some other event.

Page 14: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

14 ANNA UNIVERSITY CHENNAI

Conditional probability

P(B/A) = P(AB)

P(A)

Illustration 1.12

There is a box with 10 balls, of which

3 are yellow and dotted

1 is yellow and striped

2 are grey and dotted

4 are grey and striped.

If a ball is drawn from the box, what is the probability that

i. It is dotted.

ii. It is striped & yellow.

iii. It is dotted & grey.

iv. It is striped & grey.

v. It is grey & dotted.

vi. It is yellow & dotted.

vii. Probability of yellow out of striped.

viii. Probability of grey out of striped.

ix. Probability of getting a yellow ball.

x. Probability of getting a grey ball.

Solution:

EVENT PROBABILITY DESCRIPTION Yellow Dotted 1 0.1 Y + D 2 0.1 3 0.1 Yellow Striped 4 0.1 Y + S Grey Dotted 5 0.1 G + D 6 0.1 Grey Striped 7 0.1 G + S 8 0.1 9 0.1 10 0.1

Page 15: DBA1602 STATISTICS.pdf

NOTES

15 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

i. P(dotted) = 21

105

ii. P(S/Y) = 41

4.01.0

iii. P(D/G) = 31

6.02.0

iv. P(S/G) = 32

6.04.0

v. P(G/D) = )D(P)GD(P

= 52

5.02.0

vi. P(Y/D) = )D(P)YD(P

= 53

5.03.0

vii. P(Y/S) = )S(P)YS(P

= 51

5.01.0

viii. P(G/S) = )S(P)GS(P

= 54

5.04.0

ix. P(getting a yellow ball) = P(YD) + P(YS)= 3/10 + 1/10

= 4/10

= 2/5 .x. P(getting a grey ball) = P(GD) + P(GS)

= 2/10 + 4/10

= 6/10

= 3/5 .

Joint probability equation

P(BA) = P(B/A) * P(A)

Consider the Illustration 1.12. The probability of getting a grey ball out of a stripedball is 80% and the number of striped balls in the total balls of 10 is 5. What is the probabilityof getting a grey striped ball?

Solution:

P(GS) = P(G/S) * P(S)

= 4/5 * 0.5

= 0.4.

Page 16: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

16 ANNA UNIVERSITY CHENNAI

Multiplication rule for dependent events:

The probability of simultaneous occurrence of two events A and B is equal to theproduct of the probability of one of the events by the conditional probability of the other,given that the first one has already occurred.

Illustration 1.13:

From a pack of cards, 2 cards are drawn in succession one after the other. Afterevery draw, the selected card is not replaced. What is the probability that in both thedraws you will get Spades?

Solution:

Let A= getting spade in the first draw.Let B =getting spade in the second draw.The cards are not replaced.

This situation requires the use of conditional probability.P (A) = 13/52 (There are 13 Spade and 52 cards in a pack).P (B/A) = 12/15 (There are 12 Spade and 51 cards because the card selected is notreplaced after the first draw).

Applying the formula

1.6 REVISING PRIOR ESTIMATES OF PROBABILITY: BAYE’STHEOREM

Revised probabilities = Posterior probabilities.

I f correctly used, it makes collection of masses of data over long periods of timeunnecessary in order to make good decisions based on probabilities.

)A(P)BA(P)A/B(P:)eventSingle(Theroems'Baye

If E1, E2, E3, …, En are mutually disjoint events with P(Ei) , (i = 1, 2, …,n), then

for any arbitrary event A which is a subset of

1

n

ii

E such that P(A) > 0, we have

P(A B) = P(A) P(B/A) or P(A B) = P(B) P(A/B)

( ) ( ). ( / ), we have( ) ( ). ( / ) (13/ 52).(12 / 51) 156 / 2652 1/17.

P A B P A PA B AP A B P A P B A

i i i ii

i i1

P(E ) P(A / E ) P(E ) P(A / E )P(E / ) = = ; i = 1, 2, ..., nP(A)P(E ) P(A / E )

n

i

A

Page 17: DBA1602 STATISTICS.pdf

NOTES

17 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Proof: Since

n n n

i i ii = 1 i = 1 i = 1

A E , we have, A = A E = (A E )

Since i i(A E ) E , (i = 1, 2, ..., n) are mutually disjoint events, we have by additiontheorem of probability:

n n n

i i i ii = 1 i = 1i = 1

P(A) = P (A E ) = P(A E ) = P(E ) P(A / E )

by multiplication theorem of probability.Also we have i iP(A E ) = P(A) P(E / A)

ii

P(A E )P(E / A) = = P(A) i i

i i1

P(E ) P(A / E ) ; i = 1, 2, ..., nP(E ) P(A / E )

n

i

Illustration 1.14

A manufacturing firm is engaged in the production of steel pipes in its three plantswith a daily production of 1000, 1500, and 2000 units respectively. According to pastexperience, it is known that the fractions of defective pipes produced by the three plantsare 0.04, 0.09, and 0.07 respectively. If a pipe is selected from a day’s total productionand found to be defective, find out which plant has the maximum probability of producinga defective pipe. Also find out the probability that it has come from the second plant.

Solution:

Prior probabilities are probabilities of selecting a pipe from either A or B or C.

51

50001000 P(A)

103

50001500 P(B)

TOTAL PRODUCTION DEFECTIVE FRACTION

PLANT A 1000 0.04

PLANT B 1500 0.09

PLANT C 2000 0.07

Page 18: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

18 ANNA UNIVERSITY CHENNAI

52

50002000 P(C)

Joint Probabilities

008.004.0*51)A(P

027.009.0*103)B(P

016.004.0*52)C(P

Hence, Plant B has a greater probability of producing a defective pipe.

P(chosen defective pipe from second plant)

P(Ei/A) = Ei)P(A/ * P(Ei)Ei)P(A/ * P(Ei)

= 016.0027.0008.0027.0

= 0.5294

P(Ei) P(A/Ei) = P(E1) P(A/E1) + P (E2) P(A/E2) + P(E3) P(A/E3)

Hence, the probability that the chosen defective pipe is from second plant is 0.5294.

Illustration 1.15:

A manufacturing firm is engaged in the production of steel pipes in its three plants witha daily production of 1,000, 1,500 and 2,500 units respectively. According to the pastexperience, it is known that the fractions of defective pipes produced by the three plantsare respectively 0.04, 0.09 and 0.07. If a pipe is selected from a day’s total productionand found to be defective, find out (a) from which plant the defective pipe has come, and(b) what is the probability that it has come from the second plant?

Solution:Let the probabilities of the possible events be

Probability that a pipe is manufactured in plant A = P( 1E ) = 1,000/(1,000+1,500+2,500) = 0.2

Probability that a pipe is manufactured in plant B = P(E2) = 1,500/(1,000+1,500+2,500) = 0.3

Probability that a pipe is manufactured in plant C = P( 3E ) = 2,500/(1,000+1,500+2,500) = 0.5

Page 19: DBA1602 STATISTICS.pdf

NOTES

19 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Let P (D) be the probability that a defective pipe is drawn. Given that the proportionsof the defective pipes coming from the three plants are 0.04, 0.09 and 0.07 respectively,these are, in fact, the conditional probabilities: P(D/ 1E )=0.04;

P (D/E2)=0.09 and P(D/ 3E )=0.07.

Now we can multiply prior probabilities and conditional probabilitites in order toobtain the joint probabilities.

Joint probabilities are

Plant A 0.04 x 0.2 = 0.008

Plant B 0.09 x 0.3 = 0.027

Plant C 0.07 x 0.5 = 0.035

Now we can obtain posterior probabilities by the following calculations:

3

1

( / ) ( )( / )( / ) ( )

i ii

i ii

P D E P EP E DP D E P E

Plant A

10.008( / ) 0.114

0.008 0.027 0.035P E D

Plant B

20.027( / ) 0.386

0.008 0.027 0.035P E D

Plant C

30.035( / ) 0.500

0.008 0.027 0.035P E D

The above information resulting into posterior probabilities is summarized in thetable below.

Page 20: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

20 ANNA UNIVERSITY CHENNAI

Computation of Posterior Probabilities

On the basis of these calculations, we can say that (a) most probably the defectivepipe has come plant C, and (b) the probability that the defective pipe has come from thesecond plant is 0.39.

1.6.1 Prior Probability VS Posterior Probability

We have seen in the previous table that as any additional information becomesavailable, it can be used to revise the prior probability. The revised probability is called theposterior probability. Management should know-how to use the additional information torevise its prior probabilities. However, before collecting any additional information, itshould also assess the utility or worth of the additional information. It may, at times, findthat the cost of obtaining the additional information is more than its actual worth. In suchcases, obviously it is not advisable to go in for any additional information, and managementshould be satisfied with the prior probabilities.

1.7 PROBABILITY DISTRIBUTIONS

In the previous section we encountered some experiments where the outcomes werecategorical. We found that, an experiment may result in a number of possible outcomesand discussed how the probability of the occurrence of an outcome can be determined. Inthis section, we shall extend our discussion of probability theory. Here our focus will be onthe probability distribution, which describes how probability is spread over the possiblenumerical values associated with the outcomes.

After reading this section you will be able to:

Define random variables. Appreciate what is probability distribution. Explain and use the Binomial distribution. Explain and use the Poisson distribution. Explain and use the Uniform distribution. Explain and use the Normal distribution.

Event (1)

Prior ( )iP E (2)

Conditional 1( / )iP E E (3)

Joint ( )iP E E (4)

Posterior ( / )iP E E (5)=(4)/P(E)

E1

E2

E3

0.2

0.3

0.5

0.04

0.09

0.07

0.040.2=0.008

0.090.3=0.027

0.070.5=0.035

0.008/0.07=0.11

0.027/0.07=0.39

0.035/0.07=0.50

Total 1.0 P(E)=0.07 1.00

Page 21: DBA1602 STATISTICS.pdf

NOTES

21 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

1.7.1 Random variable

A variable is said to be random if it takes on different values as a result of the outcomesof a random experiment.

For instance, in an experiment of a single throw of a die, if the possible outcomes {1,2, 3, 4, 5, 6} is denoted by X, then X is a random variable, and values of this randomvariable would be 1, 2, 3, 4, 5, 6. Similarly, in four tosses of a fair coin, the randomvariable, say Y, the possible number of heads can have any value out of 0, 1, 2, 3 and 4heads. Thus, a random variable is a function that assigns a real number to each element ina sample space.

A random variable is said to be a discrete random variable if its possible values arecountable (finite or count-ably infinite). In other words, a discrete random variable impliesthat the random variable can assume only a restricted number of distinct values that arewhole numbers. From the above instances X and Y are discrete random variables. Thenumber of units demanded per day of a product is an example of discrete random variable.The number of cars pass in through a street per hour during peak traffic is another exampleof a discrete random variable.

In contrast, if a random variable represents a measurement on a continuous scale sothat all values in an interval are possible, it is called a continuous random variable. Inother words, a continuous random variable is a random variable, which can take anyvalue within some interval of real number. Examples of a continuous random variable areprice of a car and daily consumption of milk. Measurement of the height and weight of therespondents is an example of a continuous random variable. Similarly, voltage, pressure,and temperature, are example of continues random variable.

1.7.2 Probability distribution

A probability distribution is a total listing of the various values; the random variablecan came along with the corresponding probability for each value. A real life examplewould be the pattern of distribution of the machine breakdowns in a manufacturing unit.The random variable in this example, would be the various values the machine breakdownscould assume. The probability corresponding to each value of the breakdown is the relativefrequency of occurrence of the breakdown. The probability distribution for this example isconstructed by the actual breakdown pattern observed over a period of time.

Illustration 1.16

A multinational bank is concerned about the waiting time of its customers for usingtheir ATMs. A study of a random sample of 500 customers reveals the following probabilitydistribution:

Page 22: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

22 ANNA UNIVERSITY CHENNAI

(a) What is the probability that a customer will have to wait more than 5 minutes?(b) What is the probability that a customer need not wait?(c) What is the probability that a customer will have to wait less than 4 minutes?

Solution:

(a) P(X>5) = P(6)+P(8)=0.08+0.04+0.03=0.15

(b) P(X=0)=0.20

(c) P(X<4)= P(0)+P(1)+P(2)+P(3)=0.20+0.18+0.16+0.12=0.66

Types of probability distributions

There are two types of probability distributions. They are:

1. Discrete probability distribution

2. Continuous probability distribution

Discrete probability distribution

Discrete probability distribution the above examples we have taken to explain theconcept of a probability distribution are called discrete probability distributions. Theprobability distribution that uses a discrete random variable is called a discrete probabilitydistribution. Their corresponding probability functions are known as probability massfunction. In this unit, we will deal with two types of discrete probability distributions namely,1) The Binomial distribution and 2) The Poisson distribution.

X (waiting Time per Customer in Minutes) P(X)

0

1

2

3

4

5

6

7

8

0.20

0.18

0.16

0.12

0.10

0.09

0.08

0.04

0.03

Total 1.00

Page 23: DBA1602 STATISTICS.pdf

NOTES

23 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Continuous probability distribution

The probability distribution that uses a continuous random variable is called acontinuous probability distribution. Their corresponding probability functions are knownas probability density functions. In this unit, we will deal with two types of continuousdistributions namely, 1) Uniform distribution and 2) the Normal distribution.

Binomial distribution

If we assume that a coin is tossed once, there can be two possibilities-either head(orsuccess) or tail (or failure). The sum of the probabilities is p+q, where p is the probabilityof success and q of failure. Instead of success and failure we may also say 1 and 0.

Probability mass function:

The probability of getting x successes out of n trials is given by the following expression:

( ) (1 ) ; 0,1,...,x n xxP x nC p p x n

P(x) = nCxpxqn-x

mean = np, q = 1-p, variance = npq, p+q = 1, S.D = npq

Conditions necessary for Binomial distributionAt this stage, we should know that there are certain conditions that must be fulfilled

by a distribution if it is to be termed a binomial distribution. These conditions are:1. It is necessary that each observation is classified in two categories such as success

and failure. For example, if raw material is obtained by a firm from its suppliers, itmay be classified as defective or non-defective on the basis of its normal quality.Similarly, if a die is thrown, we may call 4, 5 or 6 a success and getting 1, 2 or 3 afailure.

2. It is necessary that the probability of success(or failure) remains the same for eachobservation in each trial. Thus the probability of getting head (or tail) must remainthe same in each toss of the experiment. In other words, if the probability ofsuccess (or failure) changes from trial to trial or if the results of each trial areclassified in more than two categories, then it is not possible to use the binomialdistribution.

3. The trials or individual observations must be independent of each other. In otherwords, no trial should influence the outcome of another trial.

Let us take an example. The binomial distribution (p+q) n in general terms

rnC = n! / {r!(n-r)!}, where r is the number of ways in which we can get r successes andn-r failures out of n trials.Illustration 1.17

In a shooting competition, the probability of a man hitting a target is 52 . If he tries 5

times, what is the probability of hitting the target

Page 24: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

24 ANNA UNIVERSITY CHENNAI

(i) at-least twice

(ii) at-most twice.

Solutionn=5p=2/5q=1-(2/5) =3/5

(a) At least twice [x>=2]P(x>=2) =1-P(x<2) = 1-[P(X=0)+ P(X=1)]P(X=0) =5C0 (2/5)0(3/5)5 =1*1*0.778 =0.778P(X=1) =5C1 (2/5)1(3/5)4 =5*4*0.13 =0.26P(x>=2) =1-[0.778+ 0.26]

=1-.3378 = 0.6622

FormulaP (X x) = 1 – P (X < x)P (X > x) = 1 – P (X x)

(b) At most twice [x<=2]P(x<=2) = P(X=0)+ P(X=1)+ P(X=2)P(X=0) =5C0 (2/5)0(3/5)5 =1*1*0.778 =0.778P(X=1) =5C1 (2/5)1(3/5)4 =5*4*0.13 =0.26P(X=2) =5C2 (2/5)2(3/5)3 =[(5*4)/2]*0.0346 =0.346P(x<=2) =0.778+ 0.26+ 0.346

= 0.6838

Illustration 1.18

If 4 out of 12 motorists do not carry a driving license, what is the probability that atraffic inspector who randomly selects 4 motorists will catch

(i)1 for not carrying driving license(ii) at-least 2 for not carrying driving license

Solution

n= 4

p= 4/12 =1/3

q=1-(1/3) =2/3

Page 25: DBA1602 STATISTICS.pdf

NOTES

25 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

(a) Probability of catching one for not carrying the driving license.

P(X=x=1) = 4C1 (1/3)1(2/3)3

=4*(1/3)*0.296

=0.3946

(b) Probability of catching one for not carrying the driving license.

P(X=x>=2) = 1-[P(X=0) + P(X=1)]

P(X=0) = 4C0 (1/3)0(2/3)4 =0.1975

P(X=1) = 4C1 (1/3)1(2/3)3 =0.3946

P(X=x>=2) = 1-[0.1975 + 0.3946]

= 1-0.5921

= 0.4079

Illustration 1.19

In a large consignment of electric bulbs, 10 per cent are defective. A randomsample of 20 is taken for inspection. Find the probability that

(i) All bulbs are good(ii) atmost there are 3 defective bulbs, and(iii) exactly there are 3 defective bulbs.

Solution:

Here n = 20; p = 10 0.1;100

q = 0.9;

By Binomial distribution, the probability of getting x defective bulbs

20( ) 20 (0.1) (0.9)x xxp x C

(i) Probability of getting all good bulbs =probability of getting zero defective bulbs. = P ( X = 0)

0 20020 (0.1) (0.9)C

20(0.9) = 0.1216.

(ii) ( 3) ( 0) ( 1) ( 2) ( 3)P X P X P X P X P X

0 20 1 19 2 18 3 170 1 2 3020 (0.1) (0.9) 20 (0.1) (0.9) 20 (0.1) (0.9) 20 (0.1) (0.9)C C C C

= 0.8671

Page 26: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

26 ANNA UNIVERSITY CHENNAI

(iii) 3 173( 3) 20 (0.1) (0.9) 0.1901P X C

Illustration 1.20

Fit the Binomial Distribution to the following observations Solution

The mean is given by fx / f.

fx is : 0 14 12 9 0 0

fx is : 35

f is : 50

Mean = np = 35/50 = 7/10

Here n=5 [There are 5 outcomes excluding 0]

Hence, 5p=7/10

p= 7/50

p=0.14

q= 1-p = 0.86

P(X=0) = 5C0 (0.14)0(0.86)5 =0.47

P(X=1) = 5C1 (0.14)1(0.86)4 =0.382

P(X=2) = 5C2 (0.14)2(0.86)3 =0.124

P(X=3) = 5C3 (0.14)3(0.86)2 =0.0019

P(X=4) = 5C4 (0.14)4(0.86)1 =0.00165

P(X=5) = 5C5 (0.14)5(0.86)0 =0.000053

x: 0 1 2 3 4 5 f: 27 14 6 3 0 0

x 0 1 2 3 4 5 f 27 14 6 3 0 0

P(x) 0.47 0.382 0.124 0.0019 0.00165 0.000053 Expected

Frequency 24 19 6 1 0 0

Page 27: DBA1602 STATISTICS.pdf

NOTES

27 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Poisson distributionHaving discussed the binomial distribution in the preceding section, we now turn to

Poisson distribution, which is also a discrete probability distribution. It was developed bya French mathematician S.D. Poisson (1781-1840) and hence named after him. Alongwith the binomial distribution, the Poisson distribution is one the most widely used distribution.It is used in quality statistics to count the number of defective items or in insurance problemsto count the number of casualties or in waiting-time problems to count the number ofincoming telephone calls or incoming customers or the number of patients arriving to consulta doctor in a given time period, and so forth. All these examples have a common feature:they can be described by a discrete random variable, which takes on integer values (0, 1,2, 3, and so on). The characteristics of the Poisson distribution are:

The events occur independently. This means that the occurrence of a subsequentevent is not at all influenced by the occurrence of an earlier event.

1. Theoretically, there is no upper limit with the number of occurrence of an eventduring a specified time period.

2. The probability of a single occurrence of an event within a specified time period isproportional to the length of the time period of interval.

3. In an extremely small portion of the time period, the probability of two or moreoccurrence of an event is negligible.

Poisson distribution as an Approximation of Binomial Distribution:As the binomial distribution involves tedious calculations, the Poisson distribution can

be used in its place. However, this holds well under certain condition such as when thenumber of trials is large while the binomial probability of success is small. When n is largeand p <0.05, then calculation of binomial probabilities becomes much simpler by usingPoisson distribution.

The formula for calculating Poisson probabilities is

( ) ; 0,1,...,

!

x eP x xx

mean = , variance = , S-D = , = np

If we change by ‘np’ in the above formula, then the Poisson distribution becomes anapproximation of the Binomial:

( )( ) ; 0,1,...,!

x npnp eP x xx

.Illustration 1.21Past experience shows that 2 per cent defective is the process capability of

producing fuses in a factory. Obtain the probability that there will be more than 5 defectivefuses in a box of 200 fuses. Arrive at the probability using Binomial and Poisson distributionand comment on the results.

Page 28: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

28 ANNA UNIVERSITY CHENNAI

Solution:

Here p = 0.02; n = 200; np = 2000.02= 4 = mean

Using Poisson distribution, the probability of getting x defective fuses

4 4( ) ; ( 5) 1 ( 5)!

xeP x P x P xx

2 3 4 54 4 4 4 4 41 1

1! 2! 3! 4! 5!e

=1-0.785=0.215.

Using Binomial distribution,

200 200( ) (0.02) (0.98)x xxP x c

( 5) 1 ( 5)P x P x

= 1-[P(0)+P(1)+P(2)+P(3)+P(4)+P(5)]

= 1-0.788

= 0.212.

Here, we see that the probabilities calculated in both distributions are more or lessequal. Hence, the Poisson approximation to Binomial is very good here.

Uniform distribution

The preceding two distributions in this unit were discrete probability distributions. Weshall now take up another distribution in which the random variable can take on any valuewithin a given range.

Definition:

A continuous random variable X with parameters a and b uniform, if it has the probabilitydensity function,

1( )

0x

a x bf x b a

otherwise

Illustration 1.22

Trains arrive at a station at 15 minutes intervals starting at 4 A.M. If a passengerarrives at the station at a time that is uniformly distributed between 9 and 9.30. Find theprobability that he has to wait for the train for

Page 29: DBA1602 STATISTICS.pdf

NOTES

29 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

a) Less than 6 minutes

b) More than 10 minutes

Solution:

Let X be the random variable representing the number of minutes past 9 that thepassenger arrives at the station.

Formula xdx

So, 1( )

30xf x

a) He has to wait for less than 6 minutes if he arrives between 9.09 and 9.15 or between9.24 and 9.30.

So, required Probability = P (9< X<15) + P(24<X<30)

15 30

9 24

1 1 230 30 5

dx dx

b) He has to wait for more than 10 minutes if he arrives between 9.00 and 9.05 or between 9.15 and 9.20.

Required probability = 1(0 5) (15 20)

3P X P X

Illustration 1.23

Let the probability that an individual suffers a bad reaction from an injection be 0.001.What is the probability, that out of 3000 individuals

(a) exactly 3 and

(b) more than 2

individuals will suffer a bad reaction?

Solution

n= 3000

p= 0.001

λ= np = 3000 * 0.001 = 3

(a) Exactly 3 will suffer a bad reaction.

P(X=x=3) = ( e-3 * 33 )/ 3! = 0.222

Page 30: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

30 ANNA UNIVERSITY CHENNAI

(b) More than 2 will suffer a bad reaction.

P(X=x>2) = 1-P(X=x<=2) = 1-[P(X=x=0) + P(X=x =1) + P(X=x =2)]

P(X=x>2) = 1-{[( e-3 * 30 )/ 0!]+[ ( e-3 * 31 )/ 1!]+[ ( e-3 * 32 )/ 2!]}

= 1-[0.0497+0.1491+0.22185]

= 1-0.4207

= 0.5793

Illustration 1.24

A controlled manufacturing process is 0.2% defective. What is the probability oftaking 2 (or) more defectives from a lot of 100 pieces?

(a) By Binomial Distribution

(b) By Poisson Distribution

Solution

n= 100

p= (0.2/100)=0.002

q = 1-p = 1-.002 = 0.998

λ= np = 100 * 0.002 = 0.2

(a) Binomial Distribution.

P(X=x>=2) = 1-P(X=x<2)

= 1- [P(X=x=0) + P(X=x=1)]

= 1- [100C0 (0.002)0(0.998)100+ 100C1 (0.002)1(0.998)99]

= 1-[0.8185 + 0.1640]

= 1-0.9825

= 0.0175

(b) Poisson Distribution.

P(X=x>=2) =1-P(X=x<2) = 1-[P(X=0) + P(X=1)]

= 1-{[( e-0.2 * 0.20 )/ 0!]+[ ( e-0.2 * 0.21 )/ 1!]

= 1-[0.818 + 0.1656]

= 0.0164

Page 31: DBA1602 STATISTICS.pdf

NOTES

31 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Thus, the probability of training two or more defective units by binomial distribution is0.0175 and that by poisson distribution is 0.0164.

Normal distribution

Normal distribution is an important continuous probability distribution. This distributionis also known as the Gaussian distribution after the name of the eighteenth centurymathematician- astronomer Karl Gauss, whose contribution in the development of thenormal distribution, it has wide application in Statistics. In business, there arise a number ofsituations where management has to make inference by drawing samples. The normaldistribution has certain characteristics, which make it applicable to such situations.

Normal probability density function:

2

2( )

21( ) ; , , 0 12

x

f x e x

2x2/1e

21)x(f

Characteristics of Normal probability distribution:

Figure 1.1 shows the normal probability distribution

Figure 1.1 The normal probability Distribution

Let us see what the above figure 1.1 indicates in terms of characteristics of the normaldistribution. It indicates the following characteristics:

1. The curve is bell shaped, that is, it has the same shape on either side of the verticalline from mean.

2. It has a single peak. As such it is unimodal.3. The mean is located at the centre of the distribution.4. The distribution is symmetrical.5. The two tails of the distribution extend indefinitely but never touch the horizontal

axis.

Mean

Page 32: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

32 ANNA UNIVERSITY CHENNAI

6. Since the normal curve is symmetrical, the lower and upper quartiles are equidistantfrom the median, that is, - Median=Median-.

7. The mean, median and mode have the same value, that is, mean=median= mode.8. The percentage distribution of area under standard normal curve is broadly as

follow: 1 68.27%;2 95.44% and This is also shown in fig 2.2.

Standardizing a Normal Random variable: xz

The units for the standard normal distribution curve are denoted by Z and called theZ values or Z scores. They are also called standard units or standard scores. The Z scoreis known as a ‘standardized’ variable because it has a zero mean and a standard deviationof one.

As can be seen from Figure 1.2 the horizontal axis is labeled Z. The Z values on theright side of the mean are positive while those on its left side are negative. The Z for a pointon the horizontal axis gives the distance between the mean and that point in terms of thestandard deviation. For example, a specific value of Z gives the distance between themean and the point represented by Z in terms of 1 standard deviation to the right of themean. Likewise, a point with a value of Z= -1 is one standard deviation to the left of themean. It can be seen that the mean is at the centre and its value has been shown as zero.The area on either side of the mean is 0.5. Thus, the total area under the curve is 1.

Figure 1.2 Standard Normal distribution curve

Figure 1.3

Page 33: DBA1602 STATISTICS.pdf

NOTES

33 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Figure 1.4

Figure 1.5

The Z - intercept is given by

xz

It is seen from Figures 1.2, 1.3, 1.4 and 1.5 that Approximately 68% of all the values in a normally distributed population will lie

within ± 1 SD from mean Approximately 95.5% of all the values in a normally distributed population lie

within ± 2 SD from mean and Approximately 99.7% of all the values in a normally distributed population will lie

within ± 3 SD from mean

Illustration 1.25

A sample of 100 dry battery cells listed the average length of life of the batteries to be12 hrs with a standard deviation of 3 hrs. Assuming the data to be normally distributed,what percentage of battery cells are expected to have life

Page 34: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

34 ANNA UNIVERSITY CHENNAI

(i) More than 15 hrs.

(ii) Less than 6 hrs.

(iii) Between 6 hrs and 15 hrs.

Solution:(i) More than 15 hrs:

12 315 12If 15 , 1

3( 15) ( 1) 0.5 (0 1)

0.5 0.34 0.16

hrs hrsxx hrs z

P X P z P z

Hence, the percentage of battery having life more than 15 hours is 16%.(ii) Less than 6 hrs:

If X = 6hrs , 6 12 2

3z

( 6) ( 2) 0.5 ( 2 0)0.5 (0 2)0.5 0.4722 0.0228

P X P z P zP z

The percentage of batteries having life less than 6 hrs is = 2.28%.

Page 35: DBA1602 STATISTICS.pdf

NOTES

35 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

(iii) Having life between 6 hrs and 15 hrs

( 2 1) 0.4772 0.34 0.8172P z

So the number of batteries having life between 6 hrs and 15 hrs is 81.72%.

Illustration 1.26

In a large institute 2.28% of employees receive income below 4500 and 15.87% ofemployees receive income above 7500. Assuming normal distribution (for income) findthe mean and Standard Deviation (SD).

Solution

Since the area beyond Z1 is 0.0228, the area between µ andZ1 = 0.5-0.0228 = 0.4772

From the normal table, Z1 = -2.0.

Similarly, the area beyond Z2 is 0.1587.

The area between µ and Z2 = 0.5-0.1587 = 0.3413

From the normal table, Z2 = 1.0.

Now, Z = (x- µ)/σ

Z1 = (x1- µ)/σ ; Z2 = (x2- µ)/σ

          -2 = (4500- µ)/σ; 1 = (7500- µ)/σ

-2 σ = (4500- µ); σ = (7500- µ)

   -2 (7500- µ) = (4500- µ)

   -15000 + 2 µ = 4500- µ

    3 µ = 19500

      µ = 6500

Substituting in above equation, σ = 1000

Page 36: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

36 ANNA UNIVERSITY CHENNAI

Illustration 1.27

Samples of 40 are taken from a lot, which is on the average 20% defective.(a) what is the probability that a sample of 40 will contain exactly 11 defectives and(b) what is the probability it will contain 11 (or) more defectives.

Solution

n= 40

p= (20/100)=0. 2

q = 1-p = 1-.2 = 0. 8

λ= np = 40 * 0. 2 = 8SD = (npq) = (40*0.2*0.8) = (6.4) = 2.53

(a) Probability that 11 are defective.

= P(exactly 11.1 defectives) - P(exactly 10.9 defectives)

P(exactly 11.1 defectives) = (11.1-8)/2.53 = 1.225.

From the normal table, 1.225 takes the value 0.3907.

P(exactly 10.9 defectives) = (10.9-8)/2.53 = 1.15.

From the normal table, 1.15 takes the value 0.3749.

Therefore, P(exactly 11 defectives) = 0.3907 – 0.3749 = 0.0158 or 15.8%

(a) Probability that 11 or more are defective.

=0.5 - P(getting 8 to 11 defectives)

P(getting 8 to 11 defectives) = (11-8)/2.53 = 1.19.

From the normal table, 1.19 takes the value 0.383.

Therefore, P(11 or more defectives) = 0.5 – 0.383 = 0.1170 or 11.7%

1.8 SUMMARY

This unit is extremely important from the point of view of many fascinating aspects ofstatistical inference that would follow in the subsequent units. Certainly, it is expected fromyou that you master the nitty-gritty of this unit. This unit specifically focused on:

The definition, meaning and concepts of a probability distribution. The related terms – discrete random variable and continuous random variable. Discrete probability distribution and continuous probability distribution. The Binomial distribution and its role in business problems. The Poisson distribution and its uses.

Page 37: DBA1602 STATISTICS.pdf

NOTES

37 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

The Normal distribution and its role in statistical inference. The concept of the Standard Normal distribution and its role.

Exercises on probability theory

1. A local bank has 5276 accounts cross-classified by type of account and averageaccount balance. The summarized results are (in rupees):

Account checking savings New Money Total

Balance A/C A /C A/C Market

<5000 1020 803 21 90 1934

5000-10,000 640 774 452 112 1978

> 10,000 51 659 538 116 1364

a) What is the probability than an account does not have over Rs10,000 in it and thatit is not many market account?

b) What is the probability that a new account’s balance is between Rs5,000 andRs10,000.

c) What is the probability that an account has less than Rs5, 000 in it or that theaccount is a saving account.

d) Given that an account is not a saving account, Wwat is the probability that theaccount has Rs10, 000 or less in it.

2. If there are 20 sophomores, 10 juniors and 5seniors in a classroom, what is theprobability of choosing a junior at random? Is this a relative frequency approach toestimating a probability

3. Materials for a food processing plant are supplied by four companies. The followingtable lists the percentage of defective items from each company and the percentage ofmaterials supplied by that company to the food processing plant.

4. For the following data

(a) Determine the percentage of all materials that are defective.

(b) Given that a material supplied to the plant is defective, what is the probability thatit come from supplier 3?

% of Materials Supplied

% of defective materials

Supplier 1 40 2

Supplier 2 5 10

Supplier 3 20 8

Supplier 4 35 3

Page 38: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

38 ANNA UNIVERSITY CHENNAI

5. A software company develops banking software where performance variable dependson the number of customer accounts. The list given below provides the statistics of150 of the clients using the software along with the performance variable group.

Total customer accounts performance variable group No of clients using

0-50 A 7

25-100 B 14

75-200 C 28

200-300 D 60

300-400 E 25

400-500 F 16

a. Find the probability that a client has <200 user accounts .

b. Find the probability that the performance variable B is used.

c. A client chosen will fit in both variable A and B, if of clients who can use A is 9 andwho can use B are 19.

6. In a software test environment holding software developed on J2EEspecification adown time analysis was done. Based on the 100 earlier records it was found that thereis about 5% downtime per day .A study on the components involved in the environmentshow that a problem in webs ere cause’s errors out of which 25% led to a downtime.If there are issues in the operating system, 40% of the issues lead to a down time andagain 20% of the problems in network led to a downtime. Given that there is a downtimes find the probability that each of the above reason could have contributed thedowntime between themselves (considering just these 3 reasons).

7. The probability that a contractor will get a plumbing contract is 2/3 and the probabilitythat he will not get an elective contract is 5/9. If the probability of getting at least onecontract is 4/5, what is the probability that he will get both the contracts?

8. There are two men aged 30 and 36 years The probability to live 35 years more is 0.67for the 30 years and 0.60 for the 36 years old person. Find the probability that at leastone of these persons will be alive 35 years hence.

9. The personal department of a company has records which show the following analysisof its 200 engineers:

Age (years) Bachelor’s Master’s Total Degree Degree

Under 30 90 10 100

30 to 40 20 30 50

over 40 40 10 50

Total 150 50 200

Page 39: DBA1602 STATISTICS.pdf

NOTES

39 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

If an engineer is selected at random from the company, find

a) The probability that he has only a bachelor’s degree

b) The probability that he has a master’s degree the he is over 40

c) The probability that he is under 30, given that has only a bachelor’s degree.10. Suppose that there is a chance for newly constructed building to collapse whether

the design is faulty or not. The chance that the design is faulty is 10% the chance thatthe building collapses is 95% if the design is faulty and otherwise it is 45%.It is seenthat the building collapsed. What is the probability that is due to faulty design?

11. Given the probabilities of three events A, B and C occurring are P (A) =0.35, P (B)=0.45 and P(C) =0.2. Assuming that A<B<C has occurred. The probabilities of anotherevent X, occurring are P(X/A) =0.8, P(X/B) =0.65 and P(X/C) =0.3. Find P (A/X),P (B/X), P(C/X).

12. In a bolt factory machines A1,A2,A3 manufacture respectively 25% ,35%,40% of thetotal output of these 5,4,2 percent are defective bolts. A bolt is drawn at random fromthe product and is found to be defective. What is the probability that it was manufacturedby machine A2?

13. A group was planning to make a count challenge to two wholes Insurance rates in oneof the three cities in Tamilnadu (A), Madurai (B) or Trichy (C). The probability that ifwould choose ‘A’ was 0.4: ‘B’ = 0.35 , ‘C’=0.25. The group also knew that it had a60% chance of favorable ruling if it chooses B, 45% if it chooses ‘A’ and 35%, if itchooses ‘C’. If the group did receive a favorable ruling which city would it choosemost likely.

14. Two persons X and Y appear for an interview for two vacancies in the same post. Theprobability of X’s selection is 1/5 and that of Y’s selection is 1/3. What is the probabilitythat: (i) Both X and Y will be selected? (ii) Only one of them will be selected? And (iii)None of them will be selected?

15. There are three departments in the company.

Somehow the man in department – C come to know that a man has been selected forthe promotion from the departments, they wanted to know the probability that it isfrom their department.

16. A company is to appoint a person as its managing director who must be M.Tech andM.B.A and C.A. The probability of which are one in twenty five, one in forty and onein fifty respectively. Find the probability of getting such a person to be appointed bythe company.

Department A B C

Women 5 4 3 Men 5 8 6

Page 40: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

40 ANNA UNIVERSITY CHENNAI

Exercises on probability distributions

1. A production process manufactures computer chips on the average at 2% non–conforming .Every day a random sample size 50 is taken from the process. If thesample contains more than two non-confirming chips, the process will be stopped.Determine the probability that the process is stopped by the sampling scheme.

2. The Database and Storage Area Network (SAN) team of a software firm decidedto increase their storage capacity. Storage LUNs were planned to be createdconsidering future requirements on LUN size .The details of 25 Dbase requestsgiven earlier is as follows.

LUN SIZE IN GB USAGE100-200 10%200-300 40%

300-400 30% 400-500 20%

a) Find the probability that requirement for LUN of size 400-500 is greater thanor equal to 4.

b) What is the expected disk size that can be allotted to each LUN size category?3. It is observed that 80% of television viewers watch ‘Thanga Vettai’ programme.

What is probability that at least 80% of the viewers in a random sample of fivewatch this progromme?

4. A quality head is conducting his monthly spot check of automatic transmissions. Inthis procedure, 10 transmissions are removed from the pool of components andare checked for manufacturing defects. Historically only 2 percent of thetransmissions have such flaws. (Assume that flaws occur independently in differenttransmissions).a) What is the probability that the sample contains more than two transmissions

with manufacturing flaws?b). What is the probability that none of the selected transmissions has any

manufacturing flaws?5. On an average if one vessel in every ten is wrecked find the probability that out of

five vessels expected to arrive at least four will arrive safely.6. What is the probability of guessing correctly at least six of the ten answers in a

True- False Objective test?7. The incidence of occupational disease in an industry is such that the workmen

have a 20% chance of suffering from it. What is the probability that out of sixworkmen, 4 or more will contact the disease?

8. A computer terminal repair person is ‘beeped’ each time there is a call for service.The number of beeps per hour is known to occur in accordance with a Poissondistribution with a mean of = 2per hour. Determine the probability of the two ormore beeps in a 1-hour period.

Page 41: DBA1602 STATISTICS.pdf

NOTES

41 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

9. The Technical team says that on an average 3 hits of 10 million hits made by thesoft ware fails. The Marketing department requires that a service Level Agreementon the Q.S. (Quality of Service) that the probability of occurrence of failure of 4request hits failing amidst 10million requests is less than .15. a). Can the agreement be signed? b) A Technical upgradation at a higher cost can bring down the hit failure rate

from a mean of 3/10 million to 1/10 million. Is it required?10. In a certain factory, blades are manufactured in packets of 10. There is a 0.2%

probability for any blade to be defective. Calculate approximately the number ofpackets containing two defective blades in a consignment of 20,000 packets.

11. A book contains 100 misprints distributed at random throughout its 100 pages.What is the probability that a page observed at random contains at least twomisprints. (Assume poison distribution).

12. The orange county dispute settlement centre (DSC) handles various kinds ofdisputes, but most are marital disputes. In fact, 96 percent of the disputes handledby the DSC are of a marital nature.a) What is the probability that out of 80 disputes handled by the DSC, exactly

seven are non-marital.b) None are non-marital.

13. The number of accidents in a year attributed to taxi drivers in a city follows Poissondistribution with mean 3. Out of 1000 taxi drivers, find the approximate numberof drivers with (i) No accident in a year. (ii) More than 3 accident in a year.

14. Assuming that on an average 2% of the output in a factory manufacturing certainbolts is defective and that 200 units are in a package, what is the probability that(i) None is defective (ii)At most 3 defective bolts may be found in that package?

15. One –fifth percent of the blades produced by a blade manufacturing factory turnout to be defective. The blades are supplied in packets of 10. Use Poissondistribution to calculate the approximate number of packets containing no defective,one defective and two defective blades respectively in a consignment of 1,00,000packets.

16. A bus arrives every 20 minutes at a specified stop beginning at 6.40 am andcontinuing until 8:40am. A certain passenger does not know the schedule, butarrives randomly (uniformly distributed) between 7:00am and 7:30 am everymorning. What is the probability that the passenger waits for more than 5 minutesfor a bus.

17. A sever crash brought a down time of 19 minutes in a particular environment .Auser does an operation in it that gives a request once in 3 minutes. Find probabilitythat the number of request that fail is grater than 3, assuming that the problem isuniformly distributed.

18. Mr. X has to submit the statistics Assignment on 05/10/2006. On 03/10/2006, hekept a remainder in his mobile to alert him about this on 04/10/2006. The remainder

Page 42: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

42 ANNA UNIVERSITY CHENNAI

will alert him for every two hours. If he sets the first remainder to be alerted at06:00AM on 04/10/2006, what is the probability that he will listen to the remainderwhich alerts him between 2:00 PM and 08:00 PM?

19. A mobile company update its subscribers, about the Latest information on cricketevery 15 minutes interval. The matches starts from 8am and is expected to continuetill 2pm. If a wicket is taken at a time that is uniformly distribution between 1 pm to2 pm. Find the probability that the customer will receive the information in lessthan 5 min.

20. Vijay must visit either the library or the canteen, the time consume to both is 20 to30 mins. He has to ride by cycle from his hostel. If he leaves at 10.30 am, whatwill be the probability that he reach the canteen and library by 25 mins.

21. Lead–time demand X, for an item is approximated by a normal distribution withmean of 25 and a variance of 9. It is desired to determine a valve d lead-time thatwill only be exceeded 5% of the time.

22. The response time, for an application to send a request to another application andget back a response in an Enterprise application Interconnection was monitoredby a tool for 3 months continuously. The mean response time was found to be 600milliseconds with a standard deviation of 200 milliseconds for the distributed randomvariable. a).A response time of >1.0 seconds is flagged as a severity. Find the probability

of occurrence of a severity. b). Find the probability of a response time <800ms means, u = 500ms.

23. A set of examination marks is approximately normally distributed with a mean of75 & S.D of 5. If the top 5% of students get grade A and bottom 25% get GradeF, what mark is the lowest A and what mark is the highest F?

24. The customer accounts of a certain departmental store have an average balance ofRs. 120 and a standard deviation of Rs 40. Assuming that the account balance arenormally distributed, find

i) what proportion of accounts is over Rs 150?

ii) what proportion of accounts is between Rs 100 and Rs 150.

iii) what proportion of accounts is between Rs 60 and Rs 90.

25. The Intelligence quotient of a group of 1000 school children has mean 96 and thestandard deviation 12. Assuming that the distribution of I.Q among school childrenis normal. Find approximately the number of school children having I.Q.

(i) Less than 72.

(ii) between 80 and 120.

Page 43: DBA1602 STATISTICS.pdf

NOTES

43 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

UNIT II

SAMPLING DISTRIBUTIONS AND ESTIMATION2.1 INTRODUCTION TO SAMPLING

In-order to predict or forecast based on historical or past data statistics uses a collectionof methods for planning experiments, collecting the data analyzing, interpreting and drawingconclusions based on the data.

Sometimes it is possible and practical to examine every element or item in the populationin which case it is called census or complete enumeration. But many a time it is impossibleto perform complete enumeration (just imagine if your are to conduct an exit poll amongthe entire electorate) and sampling is useful when it is not possible to count or measureevery element or item in the population.

It is obvious that for any statistical investigation, a complete enumeration of thepopulation is next to impossible. For example, if we want to have an idea of the averageper capita (monthly) income of the people in Chennai, we will have to enumerate all theearning individuals in Chennai which is rather a difficult task. If the population is infinite, acomplete enumeration is not possible. Also if the units are destroyed in the course ofinspection (crackers, explosive material) such as destructive testing, 100% inspection isnot done.

Sampling is quite often used in our day to day practical life. For example, in a shopwe assess the quality of rice, wheat or any other commodity by taking a handful of it fromthe bag and then decide to purchase it or not. A housewife normally tests the cooked foodby just taking a pinch of it to find if it is properly cooked or contains the proper quantity ofsalt.

2.2 LEARNING OBJECTIVES

After reading this unit, you will be able to: Explain the need for sampling Explain the types of sampling Define and explain the concept of sampling distribution

Page 44: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

44 ANNA UNIVERSITY CHENNAI

Define and explain the concept of standard error Explain and use the sampling distribution of mean Define and compute point estimation Define and compute interval estimation Determine sample size based on confidence interval

2.3 SOME DEFINITIONS

The entire group of individuals, under study is called population or universe. Thusin statistics, a population is an aggregate of objects (animate or inanimate) chosen forstudy. The population may be finite or infinite.

A finite subset of statistical individuals in a population is called Sample. The numberof individuals in a sample is called the sample size. For the purpose of determiningpopulation characteristics instead of enumeration of the entire population, the individuals inthe sample are observed. The sample characteristics are utilized to approximately determineof estimate the population.

For example, on examining a handful of rice, we arrive at a decision to purchase ornot to purchase the rice. It could have been that the upper layer was good but the layers ofrice beneath were not-so-good. We would have decided to purchase and may be a wiserperson after examining the entire lot after reaching home. The error involved in suchapproximations is known as a sampling error and is inherent and unavoidable in all samplingschemes. But sampling results in considerable gains, especially in time and cost not only inrespect of making observations of characteristics, but also in subsequent handling of data.

Parameters and Statistics:

Statistical constants of the population, namely mean 2, variance are referredto as parameters. Statistical measures computed from sample observations e.g. mean ( )x ,variance 2( )s etc. are referred to as statistic.

In real time, parameter values are not known and their estimation is based on samplevalues. Thus a statistic which is related to a sample is used to draw a variety of inferences.Since a statistic is based on the sample values and there are multiple choices of samplesthat can be drawn from a population, a sample statistic may vary based on the sample.

Sampling Distribution:

If we draw a sample of size n from a given population of size N then the totalnumber of possible samples is nNC .

!!( )!n

NNC kn N n

Page 45: DBA1602 STATISTICS.pdf

NOTES

45 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

For each of these K samples, we can compute some statistic say 1 2( , ,.........., )nt t x x x in particular the mean ( )x , variance 2( )s , etc. as given below:

Statistic Sample

No. x (mean) 2s (variance)

1 1x 21s

2 2x 22s

3 3x 23s

… … … k kx 2

ks

The set of the values of the statistic so obtained, one for each sample, constitutes thesampling distribution of the statistic.

Standard Error

The standard deviation of sampling distribution of a statistic is known as its standarderror and it is denoted by (S.E.)

Tests of Significance

A very important aspect of the sampling theory is the study of tests of significancewhich enable us to decide on the basis of the sample results if

i. The deviation between the observed sample statistic and the hypothetical parametervalue is significant.

ii. The deviation between two sample statistics is significant.

We will have a look at certain dimensions of hypothesis testing before looking intoestimation. We will have a closer look at hypothesis testing in unit 3 later. However itwould help you to go through the following sections.

A hypothesis is an assumption made about a population and tested on a sample(because it is feasible). What holds good for the sample will hold good for the populationalso, provided the sample is truly representative of the population.

When we test a hypothesis, we first assume that there is no significant differencebetween two parameters.

For example, if the average marks of the students of two classes are known and wewant to check whether the average marks of the students of the two classes differ significantly,then we will frame a null hypothesis like this “ There is no significant difference in theaverage marks of the students of the two classes”. The alternate hypothesis will becomplimentary to the null hypothesis

Page 46: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

46 ANNA UNIVERSITY CHENNAI

Null Hypothesis

For applying the test of significance, we first set up of a hypothesis, a definite statementabout the population parameter. Such a hypothesis is usually a hypothesis of no-differenceand it is denoted by. For example,

In case of a single statistic, 0H will be that “the sample statistic does not differ

significantly from the hypothetical parameter value” and in case of two statistics ( 0H ) willbe that “the sample statistics do not differ significantly”.

Alternative Hypothesis

Any hypothesis which is complementary to the null hypothesis is called an alternativehypothesis, usually denoted by. For example,

If we want to test the null hypothesis that the population has a specified mean (say) i.e.,

The alternative hypothesis

(i) is known as two tailed if there are two alternatives

(ii) is known as right tailed and

(iii) Is known as left tailed.

The setting of alternative hypothesis is very important to decide whether we have touse a single-tailed (right or left) or two tailed test.

Do not worry yourself about the left and right tailed tests here. You will understand itbetter when you go through illustration 3.1 in the next unit.

Two types of errors in Sampling

The main objective in sampling theory is to draw valid inferences about the populationparameters on the basis of the sample results. In practice we decide to accept or to rejectthe lot after examining a sample from it. As such we have two types of errors.

(i) Type I error: Reject the null hypothesis ( 0H ) when it is true.

(ii) Type II error: Accept the null hypothesis ( 0H ) when it is false.

If we write

P (Reject 0H when it is true) =P (Type I error) =

And P (Accept 0H when it is false) = P (Type II error) =

And are called sizes of Type I and Type II errors.

Page 47: DBA1602 STATISTICS.pdf

NOTES

47 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

i.e., =P (The risk of rejecting a good lot)

= P (The risk of accepting a bad lot)

The sizes of type I and type II errors are also known as producer’s risk and consumer’srisk respectively.

Procedure for testing of hypothesis

We now summarize below the various steps in testing of a statistical hypothesis in asystematic manner.

(i) Set up the null hypothesis and alternative hypothesis.(ii) Choose the appropriate level of significance. (Either 5% or 1% level). This is to be

decided before sample is drawn.

(iii) Compute the test statistic ( )

( )t E tzsE t

under the null hypothesis.

(iv) We compare the computed value of Z in step III with significant value at givenlevel of significance.

If we say 5% LOS for a two tailed test we mean that there is a rejection areacorresponding to 2.5% on either sides of the normal curve.

Figure 2.1 – Normal Curve

This means that there is an acceptance area corresponding to 47.5% on either side ofthe normal curve. 47.5% means a value of 0.475. If you look into the Normal table(Appendix 1) you will that the Z intercept corresponding to 0.475 (area under the curve)is 1.96. Similarly, a 1% LOS means you have a rejection region of 0.5% and an acceptanceregion of 49.5% on either side of the normal curve. Therefore the Z intercept for an area of0.495 is 2.58. The following points are all related to a two tailed test.

Page 48: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

48 ANNA UNIVERSITY CHENNAI

0

0

0

0

1.96, may be accepted at 5%levelof significance.

1.96, may be at 5% levelof significance.

2.58, may be accepted at1%level of significance.

2.58, may be rejected at1%levelof significance.

If z H

If z H rejectedIf z H

If z H

For a single tailed test, suppose we have a rejection region of 5% only in the RHS(Right tailed test), then the LHS of the normal curve has 50% area of acceptance. But theRHS has 45% of acceptance and 5% of rejection area. Therefore the Z intercept at 0.45,both for left tail and right tail at 5% is 1.65 and 1% is 2.33.

Therefore for a single tail test (Right tail or left tail) we compare the computed valueof with 1.654 (at 5% level) and 2.33 (at 1% level) and accept or reject accordingly.

Critical Region

A region corresponding to a statistic in the sample space S which leads to the rejectionof is called Critical Region or Rejection Region. Those which lead to the acceptanceof give us a region called Acceptance Region.

Level of significance

The probability that a random value of the statistic ‘t’ belongs to the critical region isknown as the level of significance. In other words, level of significance is the size of theType I error. The levels of significance usually employed in testing of hypothesis are 5%and 1%.

One Tailed and Two Tailed Tests

A Test of any statistical hypothesis where the alternative hypothesis is one tailed (righttailed or left tailed) is called a One tailed Test.

For example, in a test for testing the mean of a population in a single tailed we assumethat the null hypothesis 0H : 0 against the alternative hypothesis

1 0

1 0

( )

( )

H RightTailedor

H Left Tailed

is called one tailed test.

In a test of statistical hypothesis where the alternative hypothesis is two tailed, weassume that the null hypothesis.

Page 49: DBA1602 STATISTICS.pdf

NOTES

49 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

0 0:H

Against the alternative hypothesis

1 0 0 0[ ]H or

is called Two Tailed Test.

Applying one tailed or two tailed test for a particular problem depends entirely on thenature of the alternative hypothesis, if the alternative test is two-tailed we apply two-tailedtest and if alternative hypothesis is one-tailed we apply one-tailed test.

For a better understanding of two and single tailed tests please consider two brandsof bulbs one manufactured by routine process (mean 1 ) and the other manufactured by

news technique (mean 2 ). If we want to test if the bulbs differ significantly then thehypothesis is H0 : 1 = 2 and the alternative hypothesis will be H1 : 1 2. This gives usa two-tailed test. Suppose if we want to test if the bulbs produced by new process ( 2 )

have higher average life than those produced by standard process ( 1 ), then we have

0 1 2 1 1 2: :H and H

This gives us a left tail test.

If we want to test whether the product of new process () is inferior to that ofstandard process, then we have

0 1 2 1 1 2: :H and H

This gives a right-tail test.

Hence the decision about applying a two tail test or a single-tail (right or left) test willdepend on the problem under study.

2.4 TYPES OF SAMPLING

There are two methods of selecting samples from populations: nonrandom or judgmentsampling and random or probability sampling.

In nonrandom or judgment sampling, personal knowledge and opinion are used toidentify the items from the population that are to be included in the sample. A sampleselected by judgment sampling is based on someone’s expertise about the population.Sometimes a judgment sample is used as a pilot or trial sample to decide how to take arandom sample later.

Page 50: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

50 ANNA UNIVERSITY CHENNAI

In random sampling, all the items in the population have a chance of being chosen inthe sample. Please note that every potential sample need not have the same chance ofselection. Practitioners have been using various forms of random selection, the most popularbeing a random number table. Today, computers have replaced the random number tableand the software generates the random numbers in a scientific manner very fast.

Some key Terms in Sampling:

N= Number of units in the Population

n= Number of units in the Sample (Sample Size)

NCn = N

n

= Number of possible selection of n units from N units

( 1) ( 2)....... ( 1)1 2 3...............

N N N N nn

= without replacement scheme

f = n/N = Sampling fraction

Four methods of random sampling are as follows:

1. Simple random sampling

2. Systematic sampling

3. Stratified sampling

4. Cluster sampling

2.4.1 Simple random sampling

Simple Random Sampling is the foundation of Probability Sampling. It is a specialcase of probability sampling in which every unit in the population has the same chance ofbeing selected. If you have to select n units out of N units, every possible selection of nunits must have the same probability. Can you say how many ways are possible to pick up

n units out of N units ? Of course, you can. Is it equal to N

n

? Simple random sampling

guarantees that every possible selection of n units from N units has the same probability.

1/

Nn

We are assuming here that the units are selected without replacement.

Illustration 2.1

A bank wants to do a study on the customers’ perception of its service quality in thelast 12 months with regard to the savings bank account holders.

Page 51: DBA1602 STATISTICS.pdf

NOTES

51 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Method

First, you have to prepare the sampling frame for this study. You can go through thebank’s records and get a complete list of savings bank account holders. This is your samplingframe. Suppose your sampling frame contains 500 account holders and you have to select50 account holders out of the 500 account holders?

One way to do this, is to prepare 500 small paper slips, each giving the accountholder’s name and account number. Put these slips in a container, shuffle the containerthoroughly and then select 50 slips one after the other from the container.

2.4.2 Systematic sampling

In practice, the method followed in systematic sampling is simpler than that explainedearlier. First, a sampling interval k is calculated. Suppose we have to select a sample of50 out of 500 units, then calculate the sampling interval k (N/n), where N is the totalnumber of units in the population and n is the size of the sample. In our example, k is 500/50=10. Second, a number between 1 and 10 is chosen at random. Suppose the numberthus selected happens to be 9, then the sample will comprise numbers 9, 19, 29, 39,49,…. 489 and 499.

It will be seen that it is extremely convenient to select a sample in this way. Themain point to note is that once the first unit in the sample is selected, the selection ofsubsequent units in the sample becomes obvious. In view of this, it has been questionedwhether the process of selection of subsequent units is random. Here, the selection of theunit is dependent on the selection of a preceding unit in contrast to simple random samplingwhere the selection of units is independent of each other. In view of this, systematic samplingis sometimes called quasi-random sampling.

2.4.3 Stratified sampling

Imagine that you are working as a Marketing Manager in a consumer productcompany. Suppose you are studying the customer attitudes towards your product in orderto improve your sales. Suppose there are three typical cities that will influence your sales.Suppose the customers within each city are similar and between cities are vastly different.Selection of the customers for the study has to be a random sample of customers chosenform each city so that meaningful and reliable inferences can be drawn, which in turn willenable the marketing manger to develop suitable strategies. This is an example of stratifiedrandom sampling.

Stratified Random Sampling involves dividing the population into a number of groupscalled strata, in such a manner that the units within a stratum are homogenous and the unitsbetween the strata are heterogeneous. Having divided the population into a number ofstrata, now select a simple random sample of appropriate size from each stratum. Thesample size in each stratum is equal to the overall sample size multiplied by the proportion

Page 52: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

52 ANNA UNIVERSITY CHENNAI

number of units of that stratum to the total number of units in the population. This is calledproportionate stratified random sampling. Suppose the overall sample size for stratum1 is 200. Suppose stratum 1 has a total number of 2000 units and the population has 2000units.

The sample size for stratum 1 is = 200* (2000/2000) = 20.

You can also have a disproportionate stratified random sampling that requiressome idea of the standard deviation of the distribution of the parameter of interest, withinthe strata. As this information is not easy to obtain managers may have to rely on intuitionand logic to determine the sample size for each stratum. For example, bigger retail outletsmay have greater variation in sales of certain products compared to small retail outlets.Therefore, it is appropriate to have a disproportionately large sample size for bigger retailoutlets. One common strategy mangers follow is that, first select equal sample size fromeach stratum and then give weights according to the stratum’s proportion of total population.

Stratified random sampling is appropriate when the population is heterogeneous andyou are keen to get a realistic picture of the overall population. An ordinary simple randomsampling performed on the entire population that is heterogeneous will be highly misleading.You would prefer stratified sampling to simple random sampling in the case of aheterogeneous population because it assures that you will be able to represent not only theoverall population, but also the key strata of the population.

2.4.4 Cluster sampling (Area Random Sampling)

One of the problems encountered with probability sampling methods is that you haveto apply sampling procedure to a population that is scattered across a number of widegeographic regions. In these cases, you will have to cover a lot of distance in order to haveaccess to the units you propose to sample.

Suppose you want to do a simple random sample survey of all the residents in Indiawho belong to the highest income category. Your interviewers will have to do a tremendousamount of traveling. It is for this reason cluster-sampling method is followed. The stepsinvolved in cluster sampling are: Divide the population into a number of clusters based on geographic boundaries. Select a random sample of clusters from this population of clusters. Either measure all units within the randomly chosen clusters or do further random

sampling in each cluster.

Strictly speaking, when you measure all the units in the selected cluster, the procedureis called cluster sampling. Suppose you do further sampling within each cluster by adoptinga simple random sampling or stratified random sampling, the procedure becomes a multi-stage sampling.

Page 53: DBA1602 STATISTICS.pdf

NOTES

53 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Sampling distributions

Various types of sampling distributions are

Sampling distribution of single mean Sampling distribution of difference of means Sampling distribution of single proportion Sampling distribution of difference of proportions

2.5 SAMPLING FROM NORMAL POPULATIONS

The sampling distribution of a mean of a sample taken from a normal populationshows two important properties. First, the sampling distribution has a mean that is equal to

the population mean. In symbols, x . Second, the sampling distribution has astandard deviation(a standard error) that is equal to the population standard deviationdivided by the square root of the sample size. This can be written in symbolic form

tantan

x

x

nwhere s dard error of the mean

population s dard deviationn sample size

Let us take an example to explain further these properties of the sampling distributionof a mean.

Illustration 2.2

Suppose in a normally distributed population, average income per household isRs.10000 per month with the standard deviation of Rs.800. A survey based on a randomsample of 100 households is undertaken. What is the probability that the sample mean willbe between Rs. 9800 and Rs.10100?

Solution:

As this is a question relating to the sampling distribution of the mean, we have to firstcalculate the standard error of the mean. In order to calculate it, we use the formula givenearlier:

x n

Page 54: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

54 ANNA UNIVERSITY CHENNAI

Substituting the two values in the above formula, we get

800100x

= Rs. 80-Standard error of the mean

Having obtained the standard error of the mean, we now use the following formula

x

xZ

This formula transforms any normal random variable to a standard normal randomvariable. Since we have been given the range within which the sample mean will lie, wehave to use the formula for the two values.

For x =9,800, x

xZ

9,800 10,00080

2.510,100,

RsRs

For x

10,100 10,00080

1.25

Rs RsRs

Now, we have to use the table on the standard normal probability distribution. Wefind that the table gives an area of 0.0062 corresponding to a Z value of -2.5, and an areaof 0.1056 corresponding to a Z value of 1.25. It may be noted that the former figurerelates to the left-tail of the normal curve while the latter figure relates to its right-tail. Assuch, each figure is to be subtracted from 0.5 which is the total area of one-half of thecurve. Thus, we get 0.5-0.0062=0.4938 and 0.5-0.1056=0.3944. Adding these twovalues, we get

0.4938+0.3944=0.8882

This is the probability that the sample mean will lie between Rs9,800 and Rs10,100.This is shown below

Page 55: DBA1602 STATISTICS.pdf

NOTES

55 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

2.6 SAMPLING FROM NON-NORMAL POPULATIONS

The foregoing Illustration was related to a normal population. It showed that in caseof a normal population, the sampling distribution of the mean is also normal. However, inmost of the cases the population from which a sample is taken is not normally distributed.In such cases, we use an important theorem to infer the shape of the sampling distributionof the mean.

2.6.1 The central Limit Theorem

The central limit theorem states that as sample size gets large enough, the samplingdistribution of the mean can be approximated by the normal distribution. This is trueregardless of the distribution of the population from which the random sample is drawn.Symbolically, the mean of the sampling distribution x is x and the standard deviation

is

x n

On the basis of this theorem we can make probability statements about the possiblerange of values the sample mean may take. The question here is: which sample size isconsidered large enough? As we do not know the shape of the population distribution, it isnecessary for us to apply some general rule which may indicate as to when a sample islarge enough so that the central limit theorem can be applied. In general, when a samplecomprises 30 or more elements, it is considered large enough for the application of thecentral limit theorem.

A point worth nothing is that when the population distribution is very different from anormal distribution, it is necessary to have a large minimum sample size for a good normalapproximation. By the same token, when the population distribution is close to a normaldistribution, even a smaller minimum sample size is considered sufficient. In short, thelarger the sample size, the better is the approximation to the normal distribution and viceversa.

Page 56: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

56 ANNA UNIVERSITY CHENNAI

According to the central limit theorem, for a large sample size, the sampling distributionof the sample mean x is approximately normal, regardless of the shape of the populationdistribution. Symbolically, the mean of the sampling distribution x is x and the

standard deviation is

x n . By ‘large sample size’, we mean that n is greater than or

equal to 30. Population Distribution and Sampling Distributions of mean ( x ):

Illustration 2.3: In a certain locality, the average rent paid by all tenants amounts toRs1500 per month with a standard deviation of Rs 450. However, the population distributionof rents pertaining to all tenants in that city is positively skewed. Find out the mean andstandard deviation of x when the sample sizes is (a) 30 and (b)100. Also describe theshape of its sampling distribution in both the cases

Solution: Given that the population distribution is not normal, but the sample size in boththe cases is large as 30n , the central limit theorem can be applied to infer the shape ofthe sampling distribution of x .

(a). Let x be the average rent paid by a sample of 30 tenants. Then, the samplingdistribution of x is approximately normal with the values of the mean and standard

deviation as x = Rs1,500 and

x n

450 82.1630

(b). Let x be the average rent paid by a sample of 100 tenants. Then, the samplingdistribution of x is approximately normal with the values of the mean and standard

deviation as x =Rs1,500 and

x n

4 5 0 4 5

1 0 0 let us take another

example.In respect of the central limit theorem, the following three aspects need to be noted:

1. When the sample size is large enough, the sampling distribution of x is normal2. The expected value of x is .

3. The standard deviation of x is

x n .

The last statement is very important as it shows that when the sample size increases,the variation between x and its mean decreases. This will be evident from the aboveexample when the sample size is 30, x is 82.16. In contrast, when the sample size isincreased to 100, x is much lower, being only 45.

2.7 ESTIMATION

2.7.1 Introduction

A Marketing Manager in an organization needs to estimate the likely market share hiscompany can achieve in the market place. Quality Assurance Manager may be interested

Page 57: DBA1602 STATISTICS.pdf

NOTES

57 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

in estimating the proportion defective of the finished product before shipment tot thecustomer. Manager of the credit department needs to estimate the average collection periodfor collecting dues from the customers. How confident are they in their estimates? Thisunit provides some insights into point estimation and interval estimation that are essential inbusiness planning. Please remember that the three components- point estimation, intervalestimation, and hypothesis testing together constitutes inferential statistics.

An estimator is a sample statistic used to estimate the population parameter. Anestimator is a specific value of a statistic. A point estimate is a single number that is used toestimate an unknown population parameter. An interval estimate is a range of values usedto estimate a population parameter. Thus an estimator is a sample statistic used to estimatea population parameter. For example, sample mean ( x ) is estimator of population mean(µ). Sample proportion (p) is the estimator of population proportion (P). An estimate is aspecific observed value of a statistic.

Criteria for selecting an Estimator Unbiasedness: An estimator is unbiased if its expected value equals the parameter

for all sample sizes. Relative Efficiency: An unbiased estimator is relatively efficient when its S.E. Is

smaller than that of another unbiased estimator of the same parameter. (median,mean)

Consistency: An estimator is consistent if the probability that its value is very nearthe parameter’s value increasingly approaches 1 as the sample size increases.

Sufficiency: An estimator is sufficient if it makes so much use of the information inthe sample that no other estimator could extract from the sample additionalinformation about the population parameter being estimated.

2.7.2 Types of Estimation

There are two types of estimation. They are

(1) Point estimation.

(2) Interval estimation.

A point estimate is a single-valued estimate. For example, estimation of the populationmean to be 410 is equal to the sample mean.

An interval estimate is an estimate that is a range of values. For example, estimationof the population mean to be 400 to 420.

Page 58: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

58 ANNA UNIVERSITY CHENNAI

2.7.3 Point Estimation

Point Estimation – Population Mean

The sample mean X is an unbiased estimator of the population mean ,. An unbiasedestimator is one whose expected value is equal to the population parameter. That is, is ( )E X equal to . Of course the samples drawn must be independent random samplesfrom the population of interest.

Point Estimation-Population Proportion

First you must try to explain as to what do you mean by the word proportion? Aproportion is a special type of arithmetic mean, special in the sense that the individualelements can take the value either 0 or 1. In a survey if you select a random sample ofrespondents and ask them whether they have a deluxe car, you may get a reply-yes or no.You will assign “1” to each person who says yes and “0” to each person who says no. Thevalues are summed and the sum is divided by the total number of respondents. Theinterpretation of a proportion then becomes similar to the interpretation of and arithmeticmean. In other words, proportion is a particular case of arithmetic mean.

Let us designate the population proportion as P and sample proportion as p. Sampleproportion is an unbiased estimator of the population proportion. That is p is an unbiasedestimator of P. So E(p) = P.

A particular value of p based on a sample survey becomes a point estimate.

In symbols p = iX

n

where iX =1, if the thi unit in the sample possesses an attribute

of interest such as having a deluxe car. iX = 0, if the thi unit in the sample does notpossess the attribute.

Applying the central limit theorem, if the sample size is large, regardless of the shapeof the population distribution, the distribution of the sample proportion follows a normal

Distribution with mean = P and standard deviation = (1 )P P

n

. The standard

error of the proportion (1 )

pP P

n

. We avoid the derivations here because they are

not necessary.

2.7.4 Confidence interval for population mean for large samples

Confidence Interval for Population Mean ( Known)

Page 59: DBA1602 STATISTICS.pdf

NOTES

59 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

The ( 1 ) Confidence Interval for the population mean is given by:

X Z X Z

n n

where X is the sample mean based on a random sample size n and would vary forrepeated random samples of the same size n.

n

is the standard error of the distribution of sample mean X .

is the population mean.

Z is the value corresponding to the area of 1

2

from the mean of the

standard normal distribution.

is the proportion in the tails of the standard normal distribution that is outsidethe range of the confidence interval.

The meaning of the confidence interval given in this section is described below.

If you take repeated independent random samples of size n from a population withan unknown mean but known standard deviation, the probability that the true population

mean will fall in the interval X Z X Zn n

is 1 . is a measure of risk

indicating the percentage times the true value of the population mean will fall outside thisinterval. Please note that the original population need not be normal. When the samplesize is sufficiently large, the sampling distribution of the mean follows a normal distribution

with mean and standard deviation n (central limit theorem).

Note: Formula for confidence interval is derived from

/xz

n

.

Confidence interval for a mean when the population S.D. known:

2

( )

(infinite population)

(finite population)1

where N be the population size.

x

x

x

x z

nN norNn

Page 60: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

60 ANNA UNIVERSITY CHENNAI

Some formulae useful in estimation

Point estimate of the population variance:

1n)xx(

s2

2

Standard error of mean for an infinite population:

nx

- standard deviation of population

Standard error of mean for a finite population:

1NnN

nx

Interval estimate for large sample

x2

x2

.Zx.Zx

Interval estimate for small samples

ns.tx

ns.tx 1n1n

Interval estimate for proportions

n)p1(pZpp

n)p1(pZp

22

Illustration 2.4:

As the owner of custom travel, you want to estimate the mean time that it takes atravel agent to make the initial arrangements for vacation package. You have asked youroffice manager to take a random sample of 40 vacation requests and to observe how longit takes to complete the initial engagements. The office manger reported a mean time of23.4 min you want to estimate the true mean time using 95% confidence level previoustime studies indicate that the S.D of times is a relatively constant 9.8 min.

Solution:

2

9.8 1.540

( ) 23.4 1.96(1.5) 23.4 2.9 20.5 26.3

x

x

nx z to

Page 61: DBA1602 STATISTICS.pdf

NOTES

61 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Illustration 2.5:

A machine produces components, which have a standard deviation of 1.6cm in length.A random sample of 64 parts is selected from the output and this sample has a mean lengthof 90 cm. The customer will reject the part if it is either less than 88cm or more than 92cm. Does the 95% confidence interval for the true mean length of all the componentsproduced ensure acceptance by the customer?

Solution:

To answer the question of acceptance by the customer, you should first work outthe 95% confidence interval for the population mean (Here is the mean length of thecomponents in the population). The formula for the confidence interval is X Z X Z

n n

. When you want 95% confidence level, the Z value is 1.96

from the standard normal distribution. See diagram below :

Standard Normal Distribution

XZ

n

Hence 95% confidence interval for the population mean is given by:

1.96 1.96X X

n n

Page 62: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

62 ANNA UNIVERSITY CHENNAI

Now 1.690. 0.2.

64X

n

Substituting these values in the interval

1.96 1.96X X

n n

, we have

90-1.96(.2) 90 1.96(.2) 89.61 90.39.

This implies that the probability that the true value of the population mean length ofthe components will fall in this interval of 89.61 90.39 is 95%. Hence, we inferthat the 95% confidence interval ensures acceptance by the customer.

Confidence interval for the population proportion for large samples

Please go through the discussion in this unit on the concept of proportion beforeknowing how to set up the confidence interval for the population. Proportion is frequentlyused in survey research particularly in marketing research.

The 1- confidence interval for the population proportion is given below:

(1 ) (1 )P P P Pp Z P p Zn n

where P is the population proportion, p is the sample proportion, Z is the value

corresponding to the area of 1

2

from the mean of the standard normal distribution,

is the proportion in the tails of the standard normal distribution that is outside the rangeof the confidence interval.

Unfortunately, this interval contains the term P, which is the population proportionthat we are trying to estimate. It is a practice among the statisticians to use the sampleproportion p in the place of P. Please note that p is an unbiased estimator of P. Hence, the(1- ) confidence interval for the population proportion P becomes:

(1 ) (1 )p p p pp Z P p Zn n

This confidence interval in particular for 95% and 99% confidence levels are givenbelow:

Confidence Level Confidence Interval for the Population Proportion P

Page 63: DBA1602 STATISTICS.pdf

NOTES

63 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

95%

(1 ) (1 )1.96 1.96p p p pp P pn n

99%

(1 ) (1 )2.58 2.58p p p pp P pn n

Illustration 2.6 In a health survey involving a random sample of 75 patients who developeda particular illness, 70% of them are cured of this illness by a new drug. Establish the 95%confidence interval for the population proportion of all the patients who will be cured bythe new drug. This would help assess the market potential for this new drug by apharmaceutical company.

Solution The 95% confidence Interval for the population proportion is given by:

(1 ) (1 )1.96 1.96p p p pp P pn n

0.70(1 0.70) 0.70(1 0.70)0.70 1.96 0.70 1.9675 75

P

Upon simplification, this interval becomes 0.5963 0.8037.P That is, theprobability that the population proportion will fall in this interval of 0.5963 0.8037.P is 95%.

2.8 t-DISTRIBUTION

How will you establish a 95% confidence interval for the population mean whenthe standard deviation is not known? How will you establish a 95% confidence interval forpopulation mean when the sample size is small? Both these questions can be answered bythe t-distribution. William Gosset, under the nickname student, discovered the t-distribution.Hence, it is called “Student’s t-distribution”. For better clarity, see the diagram

Page 64: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

64 ANNA UNIVERSITY CHENNAI

Characteristics of the t-distribution:

As you can see from the above diagram, t distribution is also symmetrical like thenormal distribution. However, the t distribution is flatter than the normal distribution. Thet distribution depends upon one more factor called the degrees of freedom. You candefine the degrees of freedom as the number of unrestricted (independent) moments youcan have out of the sample size n. in other words how many of the sample values are freeto vary? In the case of t distribution, there are n-1 degrees of freedom. Intuitively, thedegrees of freedom suggest that if we know the values for n-1 terms, the nth term can becalculated. According to Ronald Fisher, the father of Statistics, degrees of freedom is aterm borrowed from mathematics. When n is large (greater than or equal to 30), the tdistribution looses its flatness and becomes a normal distribution. Hence, we can use thenormal approximation to t when n is greater than or equal to 30. For estimating meaninvolving small sample, t distribution is the best choice.

Confidence interval for population mean for small samples using t-distribution

The (1 - ) confidence interval for the population mean is given by:

1 1n n

S SX t X tn n

where X is the sample mean based on independent random samples from a normalpopulation and the sample size is small.

S is the sample standard deviation = 2

( )1

iX Xn

Page 65: DBA1602 STATISTICS.pdf

NOTES

65 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

1nt is the value of the t distribution with n-1 degrees of freedom for an area of

2

in both

the tails of distribution.

Please note that the sample statistic

( )Xt Sn

follows a t distribution with n-1 degrees

of freedom (d.f).

Illustration 2.7: The average travel time taken based on a random sample of 10 peopleworking in a company to reach the office is 40 minutes with a standard deviation of 10minutes. Establish the 95% confidence interval for the mean travel time of everyone in thecompany redesign the working hours.

Solution: The 1- confidence interval for the population mean is given by

1 1

10. 40 3.1610n n

S S SX t X t Xn n n

For a 95% confidence level, /2 = 0.05/2 = 0.025 Each tail will have probability of.025.

2.9 DETERMINING THE SAMPLE SIZE USING CONFIDENCE INTERVAL

If you specify the sampling error (precision), the confidence level desired and thestandard deviation of the original population, you can compute the optimal sample size.You can determine the sample size both for estimation the population mean as well as thepopulation proportion. Please note that the sampling error is the difference between theestimate and the actual parameter.

Sample Size Determination – population Mean

Sample size may be obtained using the formula: n = 2 2

2 .ZE

where, Sampling error E = X , Z is the table value corresponding to the given level ofsignificance of the standard normal distribution.

Illustration 2.8: A marketing manager of a fast food restaurant in a city wishes toestimate the average yearly amount that families spend on fast food restaurants. He wantsthe estimate to be within Rs 100 with a confidence level of 99%. It is known from anearlier pilot study that the standard deviation of the family expenditure on fast food restaurantis Rs. 500. How many families must be chosen for this problem?

Page 66: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

66 ANNA UNIVERSITY CHENNAI

Solution:

Applying the formula 2 2

2

znE

, we have 2 2

2

2.58 (500 ) 166.41 166100

n rounded to

the nearest integer.

Sample Size Determination – Population Proportion

2

2

(1 )Z p pnE

where sampling error E = p-P, Z is the table value corresponding to the given level ofsignificance of the standard normal distribution.

Illustration 2.9: A company manufacturing sports goods wants to estimate the proportionof cricket players among high school students in India. The company wants the estimate tobe within 0.03 with a confidence level of 99%. A pilot study done earlier reveals thatout of 80 high school students, 36 students play cricket. What should be the sample sizefor this study?

Solution: p =36/80 =0.45

Applying the formula for calculating the sample size 2

2

(1 ) ,Z p pnE

we have,

2

2

2.58 (0.45)(1 0.45).03

n . Simplifying, you get n = 1831.

Illustration 2.10: A Machine produces components that have a standard deviation of 1.6cm in length. Random sample of 64 parts is selected from the output and this sample has amean length of 90 cm. Your customer will reject the part if it is less than 88 cm or more than92 cm. Calculate the true mean length of all the products at 95% confidence level.

Solution:

large sample

n = 64

%)95(96.16.1

90

2

Z

x

Page 67: DBA1602 STATISTICS.pdf

NOTES

67 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Std. error = 2.0646.1

nx

Therefore lower limit = x2.Zx = 90 + (1.96) (0.2) = 89.6

Upper limit = x2.Zx = 90 + (1.96) (0.2) = 90.39

Thus the interval is

39.906.89

Illustration 2.11: The average travel time of 10 people (based on a random sample)working in a company to reach office is 40 minutes with a standard deviation of 10 minutes.Established 95% confidence interval for mean travel time of everyone in the company andredesign the working hour.

Solution:

small sample

n = 10

1090

x

Degree of freedom = 9

= 2.262 (95%)

Std. error = 1622.310

10n

Lower limit =

n

stx 1n = 40 – (2.262) (3.1622) = 32.85

Upper limit =

n

stx 1n = 40 + (.262) (3.1622) = 47.15

32.85 < µ < 47015

Illustration 2.12: A health survey on a random sample of 75 patients with a particularillness was done. 70% of them were cured by a new drug. Establish 95% confidenceinterval for the population proportion of all patients who will be cured by the new drug.This would help to assess the market potential of the new drug.

Page 68: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

68 ANNA UNIVERSITY CHENNAI

Solution:

proportions

%)95(96.1Z75n

100707.0p

2

Lower limit = 596.075

3.07.096.17.0n

)p1(pZp2

Upper limit = 804.075

3.07.096.17.0n

)p1(pZp2

0.596 < p < 0.804

Illustration 2.13: A random sample of 40 vacation requests showed that the mean timeto complete request formalities was 23.4 minutes. Estimate the true mean time using 95%confidence limits. Previous studies indicate the SD to be relatively constant at 9.8 minutes.

Solution:

%)95(96.1Z40n

8.94.23x

2

Std.error = 55.1408.9

n

Lower limit = n

Zx2

= 23.4 (1.96) (1.55) = 20.362

Upper limit = n

Zx2

= 23.4 + (1.96) (1.55) = 26.438

20.362 < µ < 26.438

Page 69: DBA1602 STATISTICS.pdf

NOTES

69 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

2.10 SUMMARY

This chapter has exposed you to the basics of sampling techniques and the conceptsof sampling distribution. Remember that the entire inferential statistics is built on thefoundation of the sampling distribution. In particular, this chapter focused on:

The meaning and the need for sampling. The types of sampling categorized into probability sampling and non-probability

sampling. The most widely used sapling methods that include simple random sampling stratified

random sampling, systematic sampling, and cluster sampling under probabilitysampling and convenience sampling, expert opinion sampling and quota samplingunder non-probability sampling.

The conceptual framework of the sampling distribution. The meaning and role of standard error. The sampling distribution of the mean from normal population. The sampling distribution of the mean from non-normal population interwoven

with the central limit theorem. The definition and meaning of point estimation for the population mean and

population proportion. The role of sample mean and sample proportion in estimating the population mean

and population proportion with their property of unbiasedness. The conceptual framework of interval estimation with its key elements. The methodology for establishing the confidence interval for the population mean

and the population proportion based on the sample mean and the sample proportion. Illustrations giving the 95% and 99% confidence interval for the population mean

and the population proportion for large samples. Establishing confidence interval for small samples using the t distribution after

explaining the role of degrees of freedom in computing the value of t. Determining the optimal sample size based on precision, confidence level, and a

knowledge about the population standard deviation.

Exercises on sampling distributions:

1. L&T has determined that the assembly line for a particular electrical component isnormally distributed with a mean of 20 minutes and a standard deviation of 3minutes.a) What is the probability that an employee in the assembly division takes larger

than 22 minutes to assembly one of these components.b) What is the probability that the average assembly time for 15 employees is

between 19 and 21 minutes.

Page 70: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

70 ANNA UNIVERSITY CHENNAI

2. The distribution of the JVM Heap size required a population of application seversshow a mean of 500MB with a standard deviation of 120MB for all applicationservers. A new server with 16GB of memory is received. What is the probabilitythat the server can hold 30 application servers given that the operating systemrequires 1 GB for itself? The population has a negatively skewed distribution.

3. A ship carries 25 passengers. The weight of each passenger has a normal distributionwith mean 168 pounds and variance 361 pounds .safety regulations state that forthis particular ship, the total weight of passengers on the boat should not exceed4,250 pounds more than 5% of time. As a service to the ship owners. Find theprobability that the total weight of passengers on the ship will exceed 4,250 pounds.

4. Assume that the monthly savings of 1000 employees working in a factory arenormally distributed with mean Rs.2000 and standard deviation Rs.50. If 25 samplesconsisting of 4 employees each are obtained, what would b the mean and standarddeviation of the resulting sampling distribution of means if sampling were done (i)with replacement (ii) without replacement.

5. An auditor for a large credit card company knows that on average, the monthlybalance of any given customer is $112, and standard deviation is $56. If he audits50 randomly selected accounts, what is the probability that the sample averagemonthly balance is (a) Below $100 (b) Between $100 and $130?

Exercises on estimation

1. A manufacturer of ten-speed racing bicycles believes that the average weight ofthe bicycle is normally distributed with a mean of 22 kgs and a standard deviationof 1.5kg. A random sample of 30 bicycles is selected. If the mean from this sampleis 22.8. What is a 96% confidence internal for the mean weight of the bicycle?

2. General Cinema sampled 55 people who viewed Ghost Hunter8 and asked themwhether they planned to see it again. Only 10 of them believed the film was worthyof a second look.a) Estimate the standard error of the population of movie goers who will view the

film a second time.b) Construct a 90% confidence interval for this population.

Page 71: DBA1602 STATISTICS.pdf

NOTES

71 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

UNIT III

TESTING OF HYPOTHESIS

3.1 INTRODUCTION

In the real world it is often required of us to make some educated guesses. Provenhypotheses facilitate educated guessing. A hypothesis is an assumption made and statisticallytested using appropriate tools. Our exit polls are classical examples. A representative sampleof the electorate (population) is surveyed on their choice of candidate and a hypothesis istested regarding the prospects of the parties contesting. Generally the hypotheses thatholds good for the sample holds good for the population too. However there are possibilitiesof errors to occur and we will discuss the same in the near future as type 1 and type 2errors. We may adopt parametric or non-parametric tests for our hypotheses testing. Inthis unit we will discuss parametric tests (the chi-square tests are an exception becausethey are non-parametric but are part of the third unit). The non-parametric tests will bediscussed in the fourth unit.

In parametric tests the estimation of the population parameters and the testing ofhypotheses concerning those parameters are similar techniques but at the same time thereare major differences in the interpretation of results arising from each method. When weare concerned with measurement, say, of expenditure on entertainment, the appropriatemethod would be the process of estimation because a range would be an ideal answer.When we are involved in decision making such as whether we should raise the price of ourproduct by 5 percent or not, it is the hypothesis testing that would enable us to take aproper decision. In addition, hypothesis testing is very helpful in examining the validity orotherwise of theories such as wage increase leads to rising prices.

3.2 LEARNING OBJECTIVES

After reading this unit, you will be able to:

define and explain hypothesis formulate appropriate hypotheses in a given situation test hypotheses for large samples and small samples using parametric tests

Page 72: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

72 ANNA UNIVERSITY CHENNAI

3.3 PROCEDURE IN HYPOTHESIS TESTING

There are five steps involved in testing a hypothesis. These are briefly discussedbelow:

1. Formulate a Hypothesis In order to make decisions about the statistical populationbased on sample observations an assumption or statement about the population made isknown as a statistical hypothesis.

Hypothesis is of two types

Null hypothesis (H0)

Alternate Hypothesis (H1)

The Null Hypothesis Ho and alternative hypothesis H1, are framed in such a way thatif one hypothesis is true, then the other is false. That is, if the null hypothesis is false and isrejected, then the alternate hypothesis is true and is accepted. A Type I error occurs whenhe null hypothesis is rejected when it is actually true. Type II error occurs when the nullhypothesis is accepted when it is actually false.

2. Set up a Suitable Significance Level Having formulated the hypothesis, the next stepis to test its validity at a certain level of significance (LOS). The confidence with which anull hypothesis is rejected or accepted depends upon the significance level used for thepurpose. A significance level of say 5 percent means that in the long run, the risk of makingthe wrong decision is about 5 percent. In other words, one is likely to be committing anerror by accepting a false hypothesis or in rejecting a true hypothesis in 5 out of 100occasions. A significance level of, say, 1 percent implies that there is a risk of being wrongin accepting or rejecting the hypothesis in 1 out of every 100 occasions. So a null hypothesisthat is rejected in 5% LOS may be accepted in 1% LOS because the area of acceptanceat 99% confidence level is more. Your choice of significance level will be based on thecriticality of the decision.

The critical region is decided based on LOS. The LOS is the probability that arandom value of the statistic belongs to the critical value. In the other words it is the size ofthe critical region. A region corresponding to a statistic which amounts to rejection of theH0 (null hypothesis) is termed as the critical region or region of rejection. The value whichseparates the critical region from the acceptance region is called the critical value (alsoknown as the test statistic). This value is set based on the alternate hypothesis.

3. Select Test Criterion The next step in hypothesis testing is the selection of an appropriatestatistical technique as a test criterion. There are many techniques from which one is to bechosen. For example, when the hypothesis pertains to a large sample of more than 30, theZ-test implying normal distribution is used. When a sample is small (less than 30), then t –test are more appropriate.

Page 73: DBA1602 STATISTICS.pdf

NOTES

73 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

4. Compute After having selected the statistical technique to test the hypothesis, the nextstep involves various computations necessary for the application of that particular test.These computations include the testing statistic as also its standard error.

5. Make Decisions The final step in hypothesis testing is to draw a statistical decision,involving the acceptance or rejection of the null hypothesis. This will depend on whetherthe computed value of the test criterion falls in the region of acceptance or in the region ofrejection at a given level of significance. It may be noted that the statement rejecting thehypothesis is much stronger than the statement accepting it. If the calculated value is lessthan or equal to the absolute tabulated value (the test statistic) then the null hypothesis isaccepted, otherwise it is rejected.

A summary of the steps in hypotheses testing is given below.

1. Set up the null hypothesis2. Set up the alternate hypothesis3. Set up the test statistic4. Decide upon the level of significance and critical value using statistical tables.5. Compute the value of statistic using the sample drawn from the population6. Compare the computed statistics with the pre-determined statistic and come to a

conclusion.

The most tricky aspect in problems relating to hypotheses testing is identifying thetype of problem and the selection of the correct formula to be used for the problem. I havelisted out the formulae that we would use to solve problems related to parametric hypothesestesting.

Formulae for Parametric tests:

Test for large samples:

1. Testing of single mean:

n

xZ 0

2. Testing of difference between two means:

2

22

1

21

)21

nn

kxx(Z

Page 74: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

74 ANNA UNIVERSITY CHENNAI

Test for Small Samples

1. Testing of a single mean

nsxt 0

2. Testing of difference between two means

21

21

n1

n1sp

k)xx(t

2nn

S)1n(S)1n(SP21

222

2112

Paired t-test:

1ns

dt

nd

d i

n)dd(

s2

i

Testing for Proportions:

1. Single Proportion

n)P1(P

PXZ00

0

X – Actual Proportion

0P - Theoretical proportion

n – number of trials

Page 75: DBA1602 STATISTICS.pdf

NOTES

75 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

2. Difference in two Proportions

21

21

n1

n1pq

ppZ

21

2211

nnpnpnP

p1q

Testing of Variances:

1. Single Variance

20

22 S)1n(

2. Equity of two variances

22

212

2

21

22

212

1

22

if,SSF

if,SSF

I would strongly suggest that you write these formulae in your statistics note book (ifyou have one) as many times as possible (or at-least once) to help you recollect the same.

We will now start solving the problems.

Illustration 3.1:

The mean life time of 100 picture tubes produced by a manufacturing company isestimated to be 5795 hours, with a standard deviation of 150 hours. Test the hypothesisthat the mean life time of picture tubes is 6000 hours at 5% level of significance.

Solution:

As I had hinted earlier, the tricky part is in the identification of the type of problem.We shall try to answer a few simple questions and that will help you in solving the problem.

Page 76: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

76 ANNA UNIVERSITY CHENNAI

Q1. Is it a problem related to large sample or small sample?Ans: Large sample. Since the data says 100 picture tubes, n> 30 hence it is aproblem related to large sample.

Q2. Is it a problem related to means, proportions or variances?Ans: It is a problem related to means because you are required to test the hypothesesrelating to the mean life of the bulb.

Q3. If it is a problem relating to means, is it relating to single mean or two means?Ans: It is a problem relating to single mean, since only one sample mean and onestd deviation value is given.

Q4. Is it a problem relating to single tail or two tailed test?

Ans: Since you have been asked to test whether the mean life is equal to 6000hours (or) not equal to 6000 hours, it is a problem relating to two-tailed test. If you wereasked to check whether the mean life is greater than 6000 hours, then it would be singletailed test (Right tailed test) or if you were asked to check whether the mean life of the bulbis less than 6000 hours then again you can conclude that it is a single tailed test (Left tailedtest). Hence all problems relating to just, checking ‘equal to’ or ‘not equal to’ are twotailed tests.

Now that we have answers to the questions, let us summarise the same and frame thehypotheses. You can adopt the same procedure for all problems.

Without answers to the above questions, it will not be possible for us to frame thehypotheses, decide on the critical value and select the appropriate formula to solve.

The above Q & A format has been done for you to understand the problem. Whenyou actually solve the problem, you need not elaborate in the above format.

Summary of answers to solve

Large Sample

TTT (Two tailed test)

Single mean testing

The data from the problem is as below.

The sample size = n = 100

The sample mean = x = 5796

The assumed mean = 0 = 6000

The sample standard deviation = = 150

We now frame the hypotheses as given below.

H0:There is no significant difference between the sample mean and the assumed mean.

H1: There is a significant difference between the sample mean and the assumed mean.

Page 77: DBA1602 STATISTICS.pdf

NOTES

77 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

We need to fix the standard test value based on the LOS. You have been asked totest the hypotheses at 5% LOS. Hence the CL is 95%. This means we have 0.475 area oneither sides (because it is a two tailed test) of the normal curve and the corresponding Zintercept ranges from -1.96 to +1.96. If our calculated value of Z falls within this rangethen we shall accept the null hypothesis, else we shall reject it.

Let us now substitute the values in the Z equation and determine the calculatedvalue of Z.

n

xZ 0

100150

60005796

= -13.667

Since the calculated value of Z is less than the critical value in the negative side(that is falls outside the range), we reject the null hypothesis and accept the alternatehypothesis, (i.e.,) there is a significant difference between the sample mean and assumedmean statistically.

You may try illustration 3.2 using a similar procedure.

Illustration 3.2: Hinton press hypothesizes that the average life of its largest web press is14,500 hours. They know that the standard deviation of press life is 2,100 hours. From asample of 35 presses, the company finds a sample mean of 13,000 hours. At a 0.01significance level, should the company conclude that the average life of the presses is lessthan the hypothesized 14,500 hours?

Solution:

Summary of answers to solve

Large Sample

LTT (Left tailed test) (Single tail)

(Because you are to test for ‘less than’ the hypothesized value)

Single mean testing

6000(µ)

+1.96

Acceptance Region

-1.96

5796 (x)(-13.667)

Page 78: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

78 ANNA UNIVERSITY CHENNAI

The data from the problem is as below.

The sample size = n = 35

The sample mean = x = 13000 hours

The assumed mean = 0 = 14500 hours

The sample standard deviation = = 2100

We now frame the hypotheses as given below.

H0: There is no significant difference between the sample mean and the assumed mean.

H1: The sample mean is significantly lesser than the assumed mean.

(Please note that the alternate hypothesis here is different in-comparison to theprevious illustration. This is because this is a single tailed test, more specifically aLTT )

We need to fix the standard test value based on the LOS. You have been asked totest the hypotheses at 0.01 LOS (which means 1% LOS). Hence the CL is 99%. But youhave to be careful here as it is a left tailed test. This means we have 0.5 area on the rightside and an area of 0.49 on the left of the normal curve and the corresponding Z intercepthas only a lower limit that corresponds to an area of 0.49. This Z intercept value is – 2.33.If our calculated value of Z is more than -2.33 then we shall accept the null hypothesis, elsewe shall reject it.

Let us now substitute the values in the Z equation and determine the calculated valueof Z.

n

xZ 0

= - 4.22

Since the calculated value of Z is less than the critical value in the negative side (thatis falls outside the range), we reject the null hypothesis and accept the alternate hypothesis,(i.e.,) the sample mean is significantly lesser than the assumed mean.

Illustration 3.3: A random sample of 100 villages was taken from a district A and theaverage height of the population was found to be 170 cm with SD of 10 cm. Another

-2.33

Acceptance Region

0.490.5

Page 79: DBA1602 STATISTICS.pdf

NOTES

79 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

random sample of 120 villages from district B was taken and the average height of thepopulation was found to be 176 cm with a SD of 12 cm. Is the difference between theaverage heights of the population in the two villages statistically significant?

Solution:

Going by our previous illustrations we find that this problem is related to

large sample

two mean testing

Two tailed test,

The data from the problem is listed below.

1n = 100

2n = 120

1x = 170

2x = 176

0 = 42000

1 = 10

2 = 10

H0: There is no significant difference between the average heights of the population in thetwo villages.

H1: There is a significant difference between the average heights of the population in thetwo villages.

In this problem the LOS has not been specified. In cases such as these, you mayassume a suitable LOS and state the same explicitly. A popular value for assumed LOS is5% and the corresponding Z intercept test statistic values will be -1.96 to + 1.96. (For atwo tailed test)

The value of ‘k’ is the pre-specified difference in the problem. For example if theproblem read as ‘test whether the average heights differed by 10 cms.’ Then the value ofk would be taken as 10. Since no such pre-specified difference in mentioned, we assumethe value of k in this problem to be ‘0’.

2

22

1

21

21

nn

k)xx(Z

120144

100100

06

= - 4.054

Since the calculated value of Z is less than the tabulated value (-1.96) in the negativeregion, the null hypothesis is rejected and the alternate hypothesis is accepted (i.e.,) thereis a significant difference between the average height of the people in the villages.

Page 80: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

80 ANNA UNIVERSITY CHENNAI

Please go through illustration 3.4. This illustration talks about the probabilities associatedwith the mean life of the bulbs. You have not been asked to test the hypotheses.

Illustration 3.4: The electric light bulbs of manufacturer A have a mean lifetime of 1400hours (h) with a S.D of 200 h, while those of manufacturer B have a mean lifetime of 1200h with a S.D. of 100 h. If random samples of 125 bulbs from each manufacturers areselected then what is the probability that the bulbs will have a mean lifetime that is at least(a) 160 and (b) 250 h more than the brand B bulbs?

Solution:

1400 1200 200 .A B A Bx x x x h

This is the difference in the means for the two bulbs, which the numerator in theformula. Similarly the denominator is computed below. Then the Z intercept is computedand finally the area under the curve for the given range will indicate the probability for therange.

The probability of the difference in the mean life of the bulbs being at-least 160 hoursis 0.9770 i.e., 97.72% and the probability of the difference in the mean life of the bulbsbeing more than 250 hours is 0.0062 i.e., 0.62%.

2 2 2 2200 100 20 .125 125

1251400 .

1200 .

200100

( ) ( ) ( ) 20020

(a) If 160 , then160 200 2

20( 160) ( 2) 0.5 0.4772 0.9772

(b) I

A B

A

B

A B

A B

A Bx x

A B

A B

x

x

A

B

A B x x A B

x x

A B

A B

hn n

n nh

h

hh

x x x xz

x x h

z

P x x P z

250 200f 250 , then 2.520

( 250) ( 2.5) 0.5 0.4938 0.0062

A B

A B

x x h z

P x x P z

Page 81: DBA1602 STATISTICS.pdf

NOTES

81 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Illustration 3.5: 4. A dice was thrown 500 times and 6 resulted 100 times. Does the datajustify the hypothesis that the dice is unbiased?

Solution: This is a problem concerned with large number of trials (500 times) andwith proportions to be tested. It is a single proportion case and a two tailed test. The dataare as given below.

The number of trials = n = 500

The actual proportion got from trials is = X 500100

51

The theoretical proportion that must have been is = P0 (theoretical proportion) = 1/6

Since in this case also, the LOS has not been specified, we can assume 5% LOSand the Z intercept range would be -1.96 to +1.96.

Hence we frame the hypotheses for a two tailed test as below.

H0: There is no significant difference between the actual and the theoretical proportion(i.e.,) The die is not biased.

H1: There is a significant difference between the actual and the theoretical proportion(i.e.,) The die is biased.

n)P1(P

PXZ00

0

50061

51

61

51

= 2

At 5% LOS, as the is greater than the , (1.96) the null hypothesis is rejected. Weconclude that the die is biased.

An interesting point that is worth discussing here is: If we had decided on 1%LOS, then the Z intercept range would have been -2.58 to + 2.58. The null hypothesiswould have been accepted. Hence the null hypothesis rejected at 5% LOS is accepted at1% LOS. This will give you a clue as to how to select your LOS.

Illustration 3.6: A manufacturer claims that at-least 95% of the components of an electriccircuit board which he supplies conforms to specification. A random sample of 220components showed that only 185 were up-to the standard. Test his claim at 1% LOS.

Solution:

X (actual proportion) = 220185

Page 82: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

82 ANNA UNIVERSITY CHENNAI

P0(theoretical proportion) = 10095

H0: There is no significant difference between the actual and accepted proportion (i.e.,)The manufacturer claim is true.

H1: The actual proportion is significantly lesser than the accepted proportion (i.e.,) Themanufacturer claim is not true. (Left tailed test)

n)P1(P

PXZ00

0

220100

510095

10095

220185

= - 7.4180

Thus calZ is less than tabZ = -2.33 (LTT) in the negative side. So the null hypothesisis rejected. We conclude that the manufacturers claim is not true.

Illustration 3.7: A machine produced 20 defective articles in a batch of 400. Afteroverhauling, it produced 10 defectives in a batch of 300. Has the machine improved?

Solution:

testing of two proportions

Here if we need to conclude that the machine has improved, then the proportion ofdefectives after overhauling must be significantly lower than the proportion of defectivesbefore overhauling. That is, if p1 is the proportion of defectives before overhauling and p2

is the proportion of defectives after overhauling then the positive difference must besignificant. Hence it is a single tailed test and more specifically right tailed test.

RTT

The data are as given below.

40020P1 300

10P2

400n1 300n1

H0: There is no significant difference between the two proportions (i.e.,) The machine hasnot improved

H1: The proportion of defectives after overhauling is significantly lower than the proportionof defectives before overhauling. (i.e.,)The machine has improved.

Page 83: DBA1602 STATISTICS.pdf

NOTES

83 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

21

2211

nnpnpnP

300400

30030010

40020400

= 0.043

p = 0.043, Therefore q = 1 – 0.043 = 0.957

21

21

n1

n1pq

ppZ

3001

4001)957.0)(043.0(

30010

40020

= 1.111

Since calZ is less than tabZ = 1.65 at 5% LOS (RTT) we accept the null hypothesisand conclude that there is no significant between the two proportions. The machine has notimproved.

You may try the following illustration 3.7 which is related to determination of theprobability.

Illustration 3.7a: The manager at a departmental store wants you to estimate the proportionof credit accounts that are paid up. You plan to estimate that proportion by taking arandom sample of 200 credit accounts from the store’s 16,032 credit accounts. If theactual proportion of a paid-up credit account is 0.32, what is the probability that a sampleproportion will be within .02 of the true proportion? In other words, what is the probabilitythat 0.3 p 0.34?

n)p1(p

p

Solution:np = 200 x 0.32 = 64.75n (1-p) = 200 (1-0.32) = 136.5

S.E. of the proportion 033.0200

)321(32.n

)p1(pp

61.0033.0

32.30.

61.0033.0

32.30.z

Area under the curve at

z = 0.61 = 0.2291

(.30 .34) 2 .2291P p x

= 0.4582

Page 84: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

84 ANNA UNIVERSITY CHENNAI

There is a probability of .46 that your sample proportion will be within .02 of thepopulation proportion.

Illustration 3.8: A tomato ketch-up company’s marketing- research department used anational telephone survey of 6,000 households and found that the new extra-spicy ketchupwould be purchased by 335 of the households. A much more extensive study made 2years ago showed that 5 percent of the households would purchase the extra spicy ketch-up at that point of time. At a 2 percent significance level, should the company conclude thatthere is an increased interest in the extra-spicy flavor now?

Solution:

n = 6000

X (actual proportion) = 335 / 6000 = 0.0592

P0 (theoretical proportion) = 5/ 100 = 0.05

H0: There is no significant difference between the actual and theoretical proportion (i.e.,)There is no increase in interest in the extra spicy flavour

H1: The actual proportion is significantly more than the theoretical proportion (i.e.,) Thereis an increase in interest in the extra spicy flavour (Right tailed test)

We now fix our Z tab value as = +2.06 (at 2% LOS in the right tail from the normaltable)

n)P1(P

PXZ00

0

= 0.071

Thus calZ is less than tabZ = +2.06 (RTT) in the positive side. So the null hypothesisis accepted. We conclude that there is no increase in interest in the extra spicy flavour.

Illustration 3.9: The mean breaking strength of a metallic rope is 160 kgs. If 6 ropes(randomly selected from different roles) have a mean breaking strength of 154.3 kg with aSD of 6.4 kg test if the mean breaking strength is less than 160 kg at 1% LOS.

Solution:

We will proceed as we have done in the previous problems

small sample (n = 6)

LTT (Because you are going to test at less than the assumed mean

Page 85: DBA1602 STATISTICS.pdf

NOTES

85 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Assumed mean = 0 = 160

Sample size = n = 6

x = Sample mean = 154.3

s = Sample standard deviation = 6.4

H0: There is no significant difference between the sample mean and the assumed mean

H1: The sample mean is significantly lesser than the assumed mean

The critical value or the ttab value is taken from Appendix Table 2. Unlike in largesamples where you took the Z intercept value from the normal tables based on the areadetermined through the LOS, the t test values are taken from the students t-distributiontable using a degree of freedom and the LOS.

nsxt 0

64.61603.154

= 6133.2

7.5 -2.182

tabt = -3.345

As the calt > tabt we accept the null hypothesis i.e., there is no significant difference betweenthe sample mean and the assumed mean

Degrees of freedom for t test = n-1. If the test is single tailed then multiply the LOS with2 and check the value across the degree of freedom. In the above problem as the degreesof freedom are 5, the tabt value of it across 0.02 is taken as -3.365.

Illustration 3.10: On measuring specimens of Nylon yarn taken from two machines itwas found that 8 from the first machine had a mean thickness of 9.67 and SD of 1.81 while10 from the second machine had a mean thickness of 7.43 and SD of 1.48. Assuming thepopulation to be normal, check if the mean difference between the first and the secondsample is more than 1.5.

Solution:

If we follow our usual strategy, we conclude that it is a sample relating to twomeans.

small sample

two mean testing

A RTT with a value of k = 1.5.

Page 86: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

86 ANNA UNIVERSITY CHENNAI

The data are as given below.

1n = 8 2n = 10

1x = 9.67 2x = 7.43

1s = 1.81 2s = 1.48

Since LOS has not been specified, you can assume 1% LOS.

H0: There is no significant difference between the sample means of the two samples.

H1: The difference in the sample means is significantly greater than 1.5 (RTT)

We will use the formula as given below.

21p

21

n1

n1s

k)xx(t

2nns)1n(s)1n()Sp(

21

222

2112

2810

48.1981.17 22

1667.42

= 2.667

667.2S2p 633.1667.2Sp

Therefore 101

81633.1

k)43.767.9(

= (2.24-1.5)/0.775 = 0.95

In this problem, the value of k = 1.5.

Degrees of freedom = 16221 nn

At 1% LOS, the value at 16 corresponding to 0.02 is 2.583.

As calt is lesser than tabt , the null hypothesis is accepted.

Illustration 3.11: The following are the number of sales with a sample of 6 sales peopleof gas lighters in city A and sample of 8 sales people of gas lighters in city B over a certainperiod of time.

Page 87: DBA1602 STATISTICS.pdf

NOTES

87 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

City A – 63, 48, 54, 44, 59, 52

City B – 41, 52, 38, 50, 66, 54, 44, 61

Assuming that the population can be approximated closely with a normal distributionhaving same variance test if the average sales of two cities are the same.

Solution:

Small sample

Two means and a two tailed test.

In this problem we will have to compute the means and the standard deviations of thetwo samples before we can proceed to use the formula for t-test.

2

22i22 xx

n1s

1x 2x 11xxi 2

1)(1

xxi 22 )(

2xxi )xx(

2i

63 41 9.67 94.09 95.06 -9.75

48 52 -5.33 28.40 1.56 -1.25

54 38 0.67 0.45 162.26 -12.75

44 50 -9.33 87.05 0.56 -0.75

59 66 5.67 32.15 232.56 15.25

52 54 -1.33 1.77 10.56 3.25

44 45.56 -6.75

61 105.06 10.25

320 406 243.3 653.48

03.9s

66.818

48.653s

36.6s

55.406

3.243s

75.508

406x

33.536

320x

2

22

1

21

2

1

Page 88: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

88 ANNA UNIVERSITY CHENNAI

H0: There is no significant difference between the average sales of two cities.

H1: There is a significant difference between the average sales of two cities.

21

21

n1

n1sp

k)xx(t

2nns)1n(s)1n(sp

21

222

2112

12

)03.9(7)36.6(5 22 = 64.48

03.8sp

81

6103.8

75.5033.53t

= 0.595

Since the test is two tailed, the value at 1% LOS and 12 degrees of freedom is3.055. As tabcal tt the null hypothesis is accepted.

Illustration 3.12: An IQ test was administered to 5 persons before and after training.The scores are as follows

Before Training 110 120 123 132 125

After Training 120 118 125 136 121

(a) At 1% LOS test whether the training program was useful?

(b) Was there any significant change in the test scores after training?

Solution:

Small sample, paired t-test, two-tailed.

Page 89: DBA1602 STATISTICS.pdf

NOTES

89 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

nd

d i

n)dd(

s2

i2

2510

nd

d

222222 )24()24()22()22()210(s

1206248 2222

245

120s2

899.4s

H0: There is no significant difference between the two values, i.e., the training was notuseful..

H1 : There is a significant difference between the two values, ie., the training was useful.

1ns

dt

4899.4

2

= -0.816

ttab at 1 % LOS and 4 degrees of freedom is 4.60.

As tabcal tt , we accept the null hypothesis. There is no significant change in the testscores and hence we can conclude that the training was not useful.

ix iy ix - iy

110 120 -10

120 118 2

123 125 -2

132 136 -4

125 121 4

610 10d

Page 90: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

90 ANNA UNIVERSITY CHENNAI

Illustration 3.13: Using 5% LOS test if the diameter of a certain wire rope had a standarddeviation of 0.022 in given that a random sample of size 18 resulted in a value of 0.000324.

Solution:

Test of single variance, TTT

LOS = 5% 0.05

025.02

LOS

Ho: There is no significant difference between the sample standard deviation and theassumed standard deviation.

H1: There is a significant difference between the sample standard deviation and theassumed standard deviation.

38.11022.0000324.017s)1n(

022.0:H022.0:H

000324.0s18n

022.0

22

22

1

0

2

2tab upper bound = 0.025 =30.191

2tab lower bound = 0.975 = 7.564

Since 2cal lies between the 2

tab the null hypothesis is accepted

Illustration 3.14. It is desired to determine whether there is less variability in the intensityof light by two bulbs made by company A and B respectively in a certain location. If theindependent random samples of size 16 of the two bulbs yield values of 5.1s1 foot

candles and 75.1s2 foot candles test if the illumination of the second bulb is greater thanthe first bulb.

Page 91: DBA1602 STATISTICS.pdf

NOTES

91 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Solution:

Testing two variance, small sample, LTT

5.1s1

75.1s2

16n1

162 n

LOS = 1%

H0 - 21

22

H1 - 21

22 , 2

122

2

2

21

22

5.175.1

ssF = 25.2

0625.3= 1.361

52.3Ftab

As tabcal FF , the null hypothesis accepted.

Illustration 3.15: Given a sample mean of 83, a sample standard deviation of 12.5, anda sample size of 22, test the hypothesis that the value of the population mean is 70 againstthe alternative that it is more than 70. Use the 0.025 significance level.

Solution:

0

0

0 1

12.5 22 83: 70 : 70 0.025

the upper limit of the acceptance region is t=2.080, or2.080(12.5)/ 70 75.54

22

83 70Because the observation t value= 4.878 2.080/ 12.5 / 22

( 75.54),

H

H

s n xH H

x ts n

xs n

or x

0we should reject H .

Page 92: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

92 ANNA UNIVERSITY CHENNAI

3.4 CHI SQUARE TESTS

3.4.1 Chi Square Test Of Goodness Of Fit

The Chi-square goodness of fit is done to check the differences between a theoreticaldistribution and an actual distribution. For example, we were able to a fit a binomial andpoisson distribution using a data set in unit 1. We can establish the validity of our theoreticaldistribution values if the chi-square goodness of fit establishes the hypothesis that there isno significant difference between the theoretical distribution and the actual distribution.Please go through the following illustration.

Procedure for conducting chi-square test

1. State the null hypothesis, which is usually as follows: the sample distributionagrees with the hypothetical or theoretical distribution.

2. Calculate the number in each category on the assumption that the null hypothesisis correct. Thus, for each observation, we shall have observed frequency andexpected frequency.

3. Determine the level of significance.4. Calculate the chi-square by using the following formula

22

1

( )ki i

i i

O EE

Where 2 = chi-square

iO = observed frequency in the ith category

iE = expected frequency in the ith category

K= number of categories

5. Determine the number of degrees of freedom. For the specified level of significanceand the degrees of freedom, find the critical or theoretical value of 2 .

6. Compare the calculated value of 2 with the theoretical value and determine the

region of rejection. In case the calculated value of 2 is less than the theoretical(or critical) value, the null hypothesis is accepted. If, on the other hand, the calculatedvalue of 2 is greater than the theoretical value, the null hypothesis is rejected.

Let us take an example involving the use of 2 test

Page 93: DBA1602 STATISTICS.pdf

NOTES

93 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Illustration 3.16: Test the goodness of fit of the observed and expected frequency givenbelow using chi square.

O.F: 27 14 6 3 0 0

E .F: 24 19 6 1 0 0

Solution:

H0 : There is no significant difference between the observed and expected frequencies

H1 : There is a significant difference between the observed and expected frequencies.

76.05038

EEO 2

2cal

d.f = 6 - 1 = 52tab at 5% = 11.07 (From Appendix table 5)

As the 2cal

2tab , we accept the null hypothesis. That is, there is no significant

difference between the observed frequency and the theoretical frequency.

3.4.2 Testing Attributes Using Chi Square

E

EO 22

O – An observed frequency

E – An expected frequency

nCTRTE

E – Expected frequency in a given cell

O E O - E (O-E)2

27 24 3 9

14 19 -5 25

6 6 0 0

3 1 2 4

0 0 0 0

0 0 0 0

50 38

Page 94: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

94 ANNA UNIVERSITY CHENNAI

RT – Row total for the two row containing that cell

CT – column for the two column containing that cell

n – total number of observations

At this stage, we may introduce a simple approach for calculating 2 in a table

2 2 where the cell frequencies and the marginal totals are as shown in table 6.2

Observed frequency table

Expected frequency table

Chi-square can be calculated by a simple formula, which involves only the observedfrequencies. The formula is

22 ( ) / ( ) ( ) ( ) ( )N ad bc a c b d c d a b

In case there is a 2 3 table, the alternative formula for calculating 2 will bedifferent and is given below:

The formula for using the values on the basis of the above table is as follows:

2 2 2 2 2 22 N a b c N d e f N

a b c a d b c c f d e f a s b e c f

a B (a + b)

c D (c + d)

( a + c) (b + d) N = a+b+c+d

( )( )a b a cN

( )( )a b b dN

( )( )c d a cN

( )( )c d b dN

A b c (a + b + c)

D e f (d + e + f)

( a + d) (b + e) ( c +f ) N

Page 95: DBA1602 STATISTICS.pdf

NOTES

95 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Where the general result is used

As mentioned earlier, Chi-square tests are non-parametric tests which do not useparametric values such as mean or variances. They use count data instead. By count datawe mean the number of people or unit in a population or sample that fall under an distinctcategory. Please go through the following illustration for a better understanding.

Illustration 3.17: 1600 families were selected at random in a city to test the belief thathigh income families sent their children to public schools and low income families sent theirchildren to government schools. A survey was conducted and the following results wereobtained.

Test whether there is a significant difference between the income levels of the familiesand choice of schools.

Solution:

H0- There is no significant difference in choice of the schools due to income levels

[There is no relationship between income levels and choice of the schools]

H1- There is a significant difference in choice of the schools due to income levels

[There is a relationship between two]

2 calculated = 77.77

LOS – 1% [assumed level]

School Income

Public Government Total

High 494 506 1000

Low 162 438 600

656 944 1600

Observed

frequencies

Expected

frequencies (O-E)2 (O-E)2/E

494 410 7056 17.2

506 590 7056 11.96

162 246 7056 28.68

438 354 7056 19.93

77.77

Page 96: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

96 ANNA UNIVERSITY CHENNAI

Degree of freedom = (r – 1) (c – 1)

= (2 – 1) (2 – 1) = 12 tabulated = 6.635

2tab

2cal reject H0 and accept H1

Thus there is a significant difference between the income levels and choice of theschools.

Illustration 3.18: The following information relates the color of hair and eyes of 100babies. Test if there is an association between hair color and the eye color.

Solution:

Hypothesis:

H0- There is no significant association between the hair and eye color.

H1- There is a significant association between the hair and eye color.2 calculated = 34.97

Assuming 1% LOSDegree of freedom = (r – 1) (c – 1) = 2 ×1 = 2

Eye Color

Hair Color Blue Grey

Black 30 10 40

Brown 20 10 30

Golden 10 20 30

60 40 100

Observed frequencies

Expected frequencies

(O-E)2 (O-E)2/E

30 24 36 1.5 10 16 36 2.25 20 18 4 0.22 10 12 4 0.33 10 18 64 3.56 20 12 64 5.33 34.97

Page 97: DBA1602 STATISTICS.pdf

NOTES

97 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

2 tabulated = 9.210

As the 2tab

2cal , the null hypothesis is rejected.

Illustration 3.19: A movie producer wants to determine the appeal of his movie to aparticular age group. He takes a random sample from the people that attended the preview.The following table shows the results.

Solution:

Age < 20 20 – 39 40 – 59 > 60

Liked 320 80 110 200 710

Disliked 50 15 70 60 195

Indifferent 50 5 20 40 115

420 100 200 300 1020

Observed

frequencies

Expected

frequencies (O-E)2 (O-E)2/E

320 292.35 764.52 2.62

80 69.60 108.16 1.55

110 139.22 853.80 6.13

200 208.82 77.79 0.37

50 80.29 917.48 11.42

15 19.11 16.89 0.88

70 38.23 1009.33 26.40

60 57.35 7.02 0.12

50 47.35 7.02 0.15

5 11.27 39.31 3.49

20 22.55 6.50 0.29

40 33.82 38.19 1.13

54.55

Page 98: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

98 ANNA UNIVERSITY CHENNAI

Hypothesis:

H0- There is no significant difference in the appeal of the movie among the consideredage groups.

H1- There is a significant difference in the appeal of the movie among the consideredage groups.

2 calculated = 54.55

Assuming 1% LOS

Degree of freedom = (r – 1) (c – 1)

= 2 ×3 = 6

2 tabulated = 16.812

As the 2tab

2cal , the null hypothesis is rejected.

Illustration 3. 20

A brand manager is concerned that her brand’s share may be unevenly distributedthroughout the country. In a survey, in which the country was divided into four geographicalregions, a random sampling of 100 consumers in each region was surveyed, with thefollowing results:

REGION

Develop a table of observed and expected frequencies for this problem.

1. Calculate the sample chi-square value.

2. State the null and alternative hypotheses.

3. At = 0.05, test whether brand share is the same across the four regions.

Solution:

The two hypotheses are as follows:

H0: The brand share is evenly distributed

H1: The brand share is not evenly distributed.

NE NW SE SW total

Purchase the brand

Do not purchase

40

60

55

45

45

55

50

50

190

210

Total 100 100 100 100 100

Page 99: DBA1602 STATISTICS.pdf

NOTES

99 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

In order to calculate the expected frequencies for the corresponding observedfrequencies, we have to apply the formula:

(Row totalcolumn total)/Grand total

For example, the observed frequency in row1 and column1 is 40. Its expectedfrequency will be E=(190100)/400=47.5. In this manner, expected frequencies arecalculated and shown in the table 6.1.

Calculation of sample chi-square

The degrees of freedom=(r-1)(c-1)=(2-1)(4-1)=3

The critical value of 2 at =0.05 level for 3 degrees of freedom from the table is7.815. Since the critical value of 2 is less than the critical value of 7.815, the null hypothesisis accepted. In other words, the brand share is evenly distributed in all the four regions ofthe country.

Illustration 3.21

Given a sample variance of 127 from a set of nine observations, construct a 95percent confidence interval for the population variance.

Solution:

2

2

2

2

2

2

( 1) 8(127) 57.94117.535

( 1) 8(127) 466.0552.180

thus, the confidence interval is (57.941,466.055).

Lu

UL

n sx

n sx

O E O-E (O-E) 2 (O-E) 2 /E

Row1

Row2

NE

NW

SE

SW

NE

NW

SE

SW

40

55

45

50

60

45

56

50

47.5

47.5

47.5

47.5

52.5

52.5

52.5

52.5

-7.5

7.5

-2.5

2.5

7.5

-7.5

2.5

-2.5

56.25

56.25

6.25

6.25

56.25

56.25

6.25

6.25

1.18

1.18

0.13

0.13

1.07

1.07

0.12

0.12

2 = 5.00

Page 100: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

100 ANNA UNIVERSITY CHENNAI

Illustration 3.22

A production manager feels that the output rate of experienced employees is surelygreater than that of new employees at this particular type of work is 20 units per hour witha variance of 56 units squared. For a group of 20 employees with 5 years experience, theaverage output for this same type of work is 30 units per hour, with a sample variance of28 units squared. Does the variability in output appear to differ at the two experiencelevel? Test the hypotheses at the 0.05 significance level.

Solution:

Inference about Two population variances

Illustration 3.23

A prominent sociologist at a large Midwestern university believes that incomes earnedby college graduates show much greater variability than the earnings of those who did notattend college. In order to test this theory, she dispatches two research assistants to Chicagoto look at the earnings of these two populations. The first assistant takes a random sampleof 21 college graduates and finds that their earnings have a sample standard deviation of21 college graduates and finds that their earnings have a sample standard deviation of s1=$17,000. The second assistant samples 25 nongraduates and obtains a standard deviationin earnings of s2= $ 7,500. Check whether the variances are different

H0 : There is no difference in variancesH1 : There is a difference in variances

Solution:

The data of our problem can be summarized as follows:

1

1

2

2

17,000 standard deviation of first samplen 21 size of first sample

7,500 standard deviation of second sample25 size of second sample

s

sn

n = 20

2

2

2 20 1

2 2

2

0

For testing H : 56 H : 56 0.05,the limits of theacceptance region are

8.907 and 32.852Theobserved

( 1) 19(28) 9.5,56

So we don't reject H ;the variability is not siginificantly differ

versus at

x x

n sx

ent.

Page 101: DBA1602 STATISTICS.pdf

NOTES

101 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Because the sociologist theorizes that the earnings of college graduates are morevariable than those of people not attending college, a one-tailed test is appropriate. Shewishes to verify her theory at the 0.01 level of significance.

F for Inferences about Two VariancesRatio

14.57500

17000SSF 2

2

22

l2

Ftab = 2.03, Hence Reject Ho

Illustration 3.24: A quality control supervisor for an automobile manufacturer is concernedwith uniformity in the number of defects in cars coming off the assembly line. If one assembleline has significantly more variability in the number of defects, then changes have to bemade. The supervisor has collected the following data:

Number of Defects

Assembly Line A Assembly Line B

Mean 10 11

Variance 9 25

Sample Size 20 16

Does assembly line B have significantly more variability in the number of defects?Test at the 0.05 significance level.

Solution:

2 20

2 21

2

2

0.05

:

:25 2.7789

(15,19) 2.23

B A

B A

B

A

CRIT

HH

sObserved Fs

F F

Thus, we have rejected assembly line B does have significantly more variability in thenumber of defects, so some changes have to be made.

(Note: we are checking for uniformity here; the cars could be uniformly bad.)

3.5 SUMMARY

This unit has introduced you to the nitty-gritty of hypotheses testing with conceptualrigor and decision orientation. In particular, this unit focused on:

Page 102: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

102 ANNA UNIVERSITY CHENNAI

The definition and meaning of statistical hypothesis.

Conceptual foundation of the null and alternative hypothesis. Type I and Type II error and their relationship with confidence level. How to formulate the null and alternative hypothesis for the one-tailed and two-

tailed test. How the hypothesis works in practice as a decision making tool. One sample tests for the population mean using Z-test for the large sample and

t-test for the small sample. One sample test for the population proportion using Z-test for the large sample. Two sample tests for the difference in population means using the t-test for small

sample when the samples are independent Paired t test for the dependent sample.

Exercises:

1) An automatic machine fills tea in sealed tins with mean weight of tea 1 kg andstandard deviation of 1gm. A random sample of 50 tins was examined, and it wasfound that their mean weight was 999.50g. State whether the machine is workingproperly or not.

2) The weights of fish in a certain pond that is regularly stocked are considered to benormally distributed with a mean of 3.1 kg and a standard deviation of 1.1kg. Arandom sample of size 30 is selected from the pond and the sample mean is foundto be 2.4kg. Is these sufficient evidence to indicate that the mean weight of the fishdiffers from 3.1kg? Use a 10% significance level.

3) A Stenographer claims that she can type at a rate of 120 words per minute. Canuse reject her claim on the basis of 100 trials in which she demonstrates a mean of116 words with a standard deviation of 15 words? Use 5% level of significance.

4) In a certain factory there are two different process manufacturing the same item.The average weight in sample of 250 items produced from one process is found tobe 120 gms with a s.d of 12 gms; the corresponding figures in a sample of 400items from the other process are 124 and 14. In there any significant differencebetween the average of two samples at 1% level of significance?

5) The average annual pay in 1989 was Rs 21,128 in the state of Tamil Nadu andRs. 25,233 in the state of Maharastra. There is a difference of Rs 4,105. Supposethat a statistician believes that the difference is much less for employees in themanufacturing industry and takes an independent random sample of employees in

the manufacturing industry in each state. The results are as follows:

State x s n

Tamilnadu 21,900 3,700 150

Maharastra 24,800 3,100 190

Page 103: DBA1602 STATISTICS.pdf

NOTES

103 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

At the 0.05 significance level, do the data support the statistician’s belief that foremployees in the manufacturing industry, the mean annual salary in Tamil Nadu differsfrom the mean annual salary in Maharastra by less than Rs 4105?

6) A college conducts both day and night classes intended to be identical. A sampleof 100 day students’ field estimation results as below:

8.14and4.72x 11 A sample of 200 night students’ field examination results as below:

9.17and9.73x 22 Are the two means statistically equal at 10% level of significance?

7) For the last decade, a medical representative’s record shows a proportion offavorable calls of 2/5. He tries a new technique and achieves 23 favorable callsout of 40 calls. Does this signify that the new technique is effective at 5% level ofsignificance?

8) In a survey of 600 adults who earn over Rs100,000 a month, 36 of them said thatthey feel that it is a necessity to fly first class when they travel. Do the data supportthe belief that more than 5 percent of those adults with a monthly income exceedingRs100,000 find that it is a necessity to fly first class when traveling? Use a 10%significance level for the test.

9) A manufacturer claimed that least 95% of the equipment which he supplied to afactory conformed to specifications. An examination of a sample of 200 pieces ofequipment revealed that 18 were faulty. Test his claim at a significance level of0.05 of 0.01.

10) Before an increase in excise duty, on tea, 400 people out of 500 persons werefound to be tea drinkers. After an increase in duty, 400 people were tea drinkersin a sample of 600 people. State whether there is a significant decrease in theconsumption of tea.

11) A manufacturer of storm windows samples 250 new homes and found that 142 ofthem had storm windows. Another sample of size 320 of older homes was taken;150 of them had storm windows. The manufacturer believes that the proportion ofnew home that have storm windows is larger than the proportion of older homesthat have storm windows. Do the sample statistics support the manufacturer’sclaim at the 0.05 significance level?

12) On a certain day 74 trains were arrives on time at Delhi and 83 were late. AtNew Delhi 65 on time, 107 late. Is there any difference in the proportions arrivingon time at the 2 stations?

13) A soap manufacturing company was distributing a particular brand of soap througha large no of retail shops. Before a heavy advertisement campaign, the mean salesper week per shop were 140 dozens. After the campaign, a sample of 26 shopswas taken and the mean sales was found to be 147 dozens with s.d =16. Can youconsider the advertisement effective?

Page 104: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

104 ANNA UNIVERSITY CHENNAI

14) A delivery service is considering delivering Arun’s ice-cream if the average orderin a suburban area is greater than 1.5 liters of ice-cream. A random sample of 23household order yield a mean of 1.7liters with a standard deviation of 0.5 liters.Test at 0.05 significance level that the mean household order is greater than 1.5liters.

15) A random sample of single 16 has 53 as means. The sum of the squares of thedeviations taken from the mean is 150; can this sample be regarded as taken fromthe population having 56 as mean? Obtain 95% and 99% confidence limits of themean population.

16) A chemical products distributor is faced with the continuing problem of brokenglassware and flasks. He has determined some additional shipping precaution andasked the purchasing director to inform the supplies of these precautions. Data for8 suppliers are given below in terms of average number of broken items pershipment. Do the data indicates, at =0.05,that the new measures have lowered theaverage number of broken item?

Supplier 1 2 3 4 5 6 7Before 16 12 18 7 14 19 6After 14 13 12 6 9 15 8

17) The average number of articles to be produced by 2 machine per day are 200 and250 with S.D of 20 and 25 respectively on the basis of record, of 25 daysproduction can you regard the machines equally efficient at 1% level ofsignificance.

18) A controversial ordinance banning billboards will come before the new paltz citycouncil next week. A census of the voters would have found that 68% favor a banor billboards within the city limits. You work in the city manager’s office and havebeen assigned to conduct a random poll of 400 voters. What is the probabilitythat the sample proportion of voters favoring such a ban will be

(a) Within 0.01 of the true proportion?(b) Within 0.02 of the true proportion?

19) Suppose in example 17 (previous problem). The city manager’s office hadconducted a random poll of 500 voters. What is the probability that the samplerproportion of voters favoring the billboard ban will be

(a) Within 0.01 of the True proportion?(b) Within 0.02 of the True proportion?

20) Records at western power show that 13% of customers pay their monthly billswith a postal money order. If you take a sample of 300 accounts, what is theprobability that a sample proportion of those paying with a postal money orderwill be within 0.03 of the true proportion?

Page 105: DBA1602 STATISTICS.pdf

NOTES

105 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

21) Records at western power show that 57% of customers consume more than275kw hours per month. If you take a sample of 300 accounts, what is theprobability that a sample proportion of those consuming 275kw hours per monthwill be within 0.03 of the true proportion?

22) Last week, the 12,983 employees at Quantity Two Corporation worked a meanof 38.2 hours, with a S.D of 5.8 hours, as head of personal; you take a randomsample of 50 time awards. What is the probability that the sample mean will bewithin 1 hour of the true mean?

23) The home mortgages at First Home savings and loan have a mean balance due of$ 37,594 and a of 100 mortgage accounts. What is the probability that the samplemean will be?a. Within $ 1000 of the true mean?b. Within $ 800 of the true mean?

24) A Psychologist is aware of studies showing that the variability of attention span of5 year olds can be summarized by 2 =64 minutes squared. She wonders whetherthe attention span of 6 years olds is different. A sample of 20 6 year olds gives 2S =28 minutes squared. Is there any significance difference in the variability ofattention span of 5 year old and 6 year old children at =5%?

25) The production manage of Onida picture tubs believes that the life of the company’spicture tubes are 25,000 hours. However, to maintain the company’s reputationfor quality; the manager would like to keep the standard deviation of the life spanof picture tubes below 1000 hours. A sample of 24 picture tubes was randomlyselected and the sample standard deviation was found to 928 hours. Do the dataindicate that the population standard deviation is less than 1000 hours? Use a10% significance level.

26) A sample of 30 light tubes fielded a standard deviation of 90 hours running timewhere as the long experience with the particular brand showed standard deviationof 105 hours using 05.0 , test it there is any difference is Standard deviation

27) An insurance company is interested in the length of hospital-stays for variousillnesses. The company has randomly selected 20 patients from Hospital A and 25from hospital B who were treated for the same ailment. The amount of time spentin hospital a had an average of 2.4 days with a s.d of 0.6 day. The treatment timein hospital B averaged 2.3 days with a s.d of 0.9 day. Do patients at hospital. Ahave significantly less variability in their recovery time at = 1%?

28) In evaluating capital-investment projects the ranability of the cash flows of returnsis carefully assessed. The higher the ranability, the higher the risk associated withthat project. ABC enterprises is currently evaluating two projects. The meanexpected net cash flow for the next II years for project 1 is Rs.1,370,000 comparedwith Rs.1,660,000 for project 2 for the next 8 years. The standard deviations ofthe net cash flows are Rs 280,000 and Rs 370,000 for project1 and project2,respectively. Assume that the cash flows are distributed normally. Do these samplestandard deviation present sufficient evidence to indicate that project 1 and project2are not equally risky? Use a significance level is 5%.

Page 106: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

106 ANNA UNIVERSITY CHENNAI

29) Two impendent samples of signs 9 and 8 gave the sum of squares of deviationfrom their respective means as 160 and 91 can the samples be regarded as drawnfrom the normal population with equal variance

50.3)8,7(F73.3)7,8(F

05.0

05.0

30) A salesman has five accounts to visit per day. It is suggested that the variablesales by him may be described by the Binomial Distribution, with the probability ofselling each account being 0.4. Given the following frequency distribution of numberof sales per day, can we conclude that the data do in fact follow the suggesteddistribution? Use 5% level of significance.

31) Dennis Barry, a hospital administrator, has examined past record from 210 randomlyselected 8 hour shifts to determine the frequency with which the hospital treatsfractures. The numbers of day in which 0,1,2,3,4or 5 or more patients with brokenbones were treated were 25,55,65,35,20,and 10 , respectively. At the 0.05 levelof significance, can we reasonably believe that the incidence of broken-bone casesfollows a Poisson distribution with 2

32) A stockbroker believes that when too many of the stock market newsletters arebullish on the market, the stock market will ,must likely fall. Thirty two randomlyselected stock market newsletters were placed in one of three categories:

i. Beansh on Neutral on Bullish on

ii. Stock market Stock market Stock market

1. 9 10 13

Test the null hypothesis that the newsletters are equally divided among the threecategories. Use a 0.05 significance level.

33) The following data gives the number of aircraft accidents that occurred during thevarious days of the week. Find whether the accidents uniformly distributes overthe week

No. of sales per day Frequency of the no. of sales

0

1

2

3

4

5

10

41

60

20

6

3

Page 107: DBA1602 STATISTICS.pdf

NOTES

107 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

34) A financial consultant is interested in the differences in capital structure within differentfirm sizes in a certain industry. He surveys a group of firms with assets of differentamounts and divides the firms into three groups. Each firm is classified accordingto whether its total debit is greater than or equal to or vice-versa. The results of thesurvey are:

Do these firms have the same capital structure. Use 10% 35) A lawn-equipment shop is considering adding a brand of lawn movers to its

merchandise. The manager of the shop believes that the highest quality lawn moversare Trooper, lawn eater and Nipper, and he needs to decide whether it makes adifference which of these three shop adds to its existing merchandise. Twentyowners of each of these three types of lawn movers are randomly sampled andasked how satisfied they are with their lawn movers

Are the owners of the lawn movers homogeneous in their response of the survey?Use a 5% significance level.

36) In an industry, 200 workers, employed for a specific job, were classified accordingtheir performance and training received /not received to test independence of aspecific training and performance. The dates is

5% level of significancePerformance

Days SUN MON TUE WED THU FRI SAT

No Of accidents 14 16 8 20 11 9 14

<500 500-2000 2000+ total

Debt<equity 7 10 8 25

Debt<equity 10 18 9 37

17 28 17 62

Lawn mover Very satisfied Satisfied Not satisfied Total

Trooper 11 6 3 20

Lawn eater 13 4 3 20

Nipper 13 6 1 20

Good Not Good Total

Trained 100 50 150

Untrained 20 30 50

Total 120 80 200

Page 108: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

108 ANNA UNIVERSITY CHENNAI

36 To see whether silicon chip sales are independent of where the U.S economy is inthe business cycle, data have been collected on the weekly sales of Zippy Chippy,a silicon Valley firm , and on whether the U.S economy was rising to a cycle peak,falling to a cycle through, of at a cycle through the results are Economically weaklychip sales:

High Medium Low Total

At Peak 8 3 7 18

Rising 4 8 5 17

Falling 8 4 3 15

Total 20 15 15 50

Page 109: DBA1602 STATISTICS.pdf

NOTES

109 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

UNIT IV

NON-PARAMETRIC METHODS IN TESTING OFHYPOTHESIS

4.1 INTRODUCTION

We have seen in unit III that hypothesis tests can be divided into two categories:parametric tests and nonparametric tests. Parametric tests were applied when parameterssuch as mean, variances, standard deviations and proportions existed and were used intesting the hypotheses. These tests had assumed that the form of the population distributionis known and that a test concerning a parameter of a distribution is to be made. However,there are many situations where one or more assumptions that are made in the case ofparametric tests cannot be met. In such cases, statisticians have developed some othertechniques that are based on less stringent assumptions. These include nonparametricmethods as well as distribution-free methods. When we are not concerned with theparameters of a given population, then the nonparametric methods are applied. Asregards distribution free methods, we do not make any assumptions about the populationfrom which we are sampling. It may be noted that as this distinction between nonparametricmethods and distribution-free methods is rather fine, in practice, both the methods arereferred to as nonparametric methods.

4.2 LEARNING OBJECTIVES

After reading this unit, you will be able to: understand the significance of non-parametric methods. explain the sign test for paired data. test two or more than two sample means using rank sum tests. test the randomness of the data using run test. find the correlation coefficient between two sets of rank.

Page 110: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

110 ANNA UNIVERSITY CHENNAI

4.3 ADVANTAGES AND DISADVANTAGES OF NON-PARAMETRIC TESTS

4.3.1 Main Advantages Of Nonparametric Tests

There are certain advantages of the nonparametric tests. Hence they have becomemore important and are being increasingly used in recent years. The main advantages ofthese tests are:

1. No assumptions are required or less stringent assumptions are required innonparametric tests as compared to parametric tests.

2. Nonparametric tests do not take much time as they involve very simple calculations.

4.3.2 Disadvantages Of Nonparametric Tests

1. They are based on limited amount of information and do not make use of all theavailable information. For example, a number of figures are just replaced by ranks1, 2, 3 and so on. Obviously, a good deal of information is lost in this way.

2. Another disadvantage of such tests is that they are less powerful than the parametrictests. This means that there is a greater risk of accepting a false hypothesis. Inother words, chances of committing the Type II error are considerable.

3. Another disadvantage of such tests is that the null hypothesis is somewhat looselyformulated. In view of this, when the null hypothesis is rejected, conclusions arisingthere from are less precise as compared to the parametric tests.

We now discuss the following non-parametric tests as they are listed in the syllabuscontent as being relevant to the fourth unit.

1. Sign Test for Paired Data,2. Mann-Whitney U Test,3. Kruskal-Wallis Test or H-test4. One sample run test and5. Rank correlation.

4.4 SIGN TEST FOR PAIRED DATA

Illustration 4.1:

Consider the result of a test panel of 40 college students evaluating the effectivenessof two types of classes: major lectures by full professors and minor lectures by graduateassistants. Table 4.1 lists the responses to the following request: “indicate how you ratethe effectiveness in transmitting knowledge of these two types of lectures by giving them anumber from 4 to 1. A rating of 4 is excellent and 1 is poor.” Determine whether the

Page 111: DBA1602 STATISTICS.pdf

NOTES

111 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Table 4.1 – Data TablePanel-member

number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Score for major

lectures(1) 2 1 4 4 3 3 4 2 4 1 3 3 4 4 4 1 1 2 2 4

Score for minor

lectures(2) 3 2 2 3 4 2 2 1 3 1 2 3 4 4 3 2 3 2 3 3

Panel-member

number 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Score for major

lectures(1) 4 4 4 3 3 2 3 4 3 4 3 1 4 3 2 2 2 1 3 3

Score for minor

lectures(2) 1 4 3 3 2 2 1 1 1 3 2 2 4 4 3 3 1 1 4 2

Solution:

Any problem relating to testing of hypotheses will begin with framing of the hypotheses.Accordingly we have the null and alternate hypotheses framed as below.

0 : 0.5H p Null hypothesis: There is no difference between the two types of lectures

1 : 0.5H p Alternative hypothesis: There is a difference between the two types oflectures

For the stated hypotheses and the given data the appropriate tests would be a signtest. How can we come to such a conclusion? The logic works like this. We want to checkwhether the students find the major and minor lectures to be significantly different, usingthe ratings given by them. So if 20 students feel major lectures are better (if there are 20‘+‘ signs) and another 20 feel minor lectures are better (if there are 20 ‘-‘ signs) then wewill conclude that there is no significant difference. If 18 students feel major lectures arebetter (if there are 18 ‘+’ signs) and 22 students feel that minor lectures are better (if thereare 22 ‘-‘ signs) then we have to check whether there is a significant difference in theopinions through ratings. So based on this logic, we adopt the sign test to test this hypotheses.

The procedure is quite simple. If the score given by a student for a major lecture ismore than that of the score given for a minor lecture then it will have a plus sign. Alternatively,if the score given by a student for a major lecture is less than that of the score given for aminor lecture then it will have a minus sign. If there is no difference in the rating, then theresult is given as 0. This is shown in the Table 4.1 below.

Page 112: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

112 ANNA UNIVERSITY CHENNAI

Table 4.1 Evaluation by 40 students of two types of classes

If we count the number of + and – signs from Table 4.1, we get the following results:Number of + signs 19Number of – signs 11Number of 0s 10

_____ Total sample size 40

We are using the sign test to deter mine whether there is significant difference in theopinions. Because we are testing perceived differences, we shall exclude tie evaluation(0s). We can see that we have 19 plus sign and 11 minus signs, for a total of 30 usableresponses. If there is no difference between the two types of classes, p (the probabilitythat the first score exceeds the second score) would be 0.5 and we would expect to getabout 15 plus signs and 15 minus sign. The hypotheses has been framed according to thefollowing logic.

If we tossed an unbiased coin 30 times, p would be 0.5, and we would expect about15 heads and 15 tails. In that case, we would use the binomial distribution as the appropriatesampling distribution. You may also remember that when np and nq are each at least 5, wecan use the normal distribution to approximate the binomial. This is just the case with theresults from our student ratings. Thus, we can apply the normal distribution to our test ofthe two teaching methods.

Panel-member

number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Score for

large

lectures(1)

2 1 4 4 3 3 4 2 4 1 3 3 4 4 4 1 1 2 2 4

Score for

small

section(2)

3 2 2 3 4 2 2 1 3 1 2 3 4 4 3 2 3 2 3 3

Sign of score

1 minus score

2

- - + + - + + + + 0 + 0 0 0 + - - 0 - +

Panel-

member

number

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Score for

major

lectures(1)

4 4 4 3 3 2 3 4 3 4 3 1 4 3 2 2 2 1 3 3

Score for

minor

lectures(2)

1 4 3 3 2 2 1 1 1 3 2 2 4 4 3 3 1 1 4 2

Sign of score

1 minus score

2

+ 0 + 0 + 0 + + + + + - 0 - - - + 0 - +

Page 113: DBA1602 STATISTICS.pdf

NOTES

113 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

0 0.5HP Hypothesized proportion of the population that prefers major lectures

0 0.5Hq Hypothesized proportion of the population that prefers minor lectures

n =30 Sample size

0.633p Proportion of successes in the sample (from 19 / 30)

0.367q Proportion of failures in the sample (from 11/ 30)

We want to test the hypothesis that there is no difference between studentperceptions of the two types of lectures at the 0.05 level of significance. We shall conductthis test using the following method. The first step is to calculate standard error of theproportion:

(0.5)(0.5) 0.00833 0.091 Standard error of the proportion.30p

pqn

Figure 4.1 Area under the normal curve for 5% LOS (two tailed test)

Because we want to know whether the true proportion is significantly larger or smallerthan the hypothesized proportion, this is a two-tailed test. The above Figure 4.1 illustratesthis hypothesis test graphically. The two colored regions cumulatively represent the 0.05level of significance.

Next we use the equation to standardize the sample proportion, p , bysubtracting 0HP

The hypothesized proportion, and dividing by p the standard error of the proportion.

0 0.633 0.5 1.462

0.091H

p

p pz

Page 114: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

114 ANNA UNIVERSITY CHENNAI

This standardized value of 1.462, which is less than +1.96 (the critical value in thepositive side) on the z scale, shows that the sample proportion falls well within the acceptanceregion as shown in Figure 4.1 . Therefore, we accept the null hypothesis that studentsperceive no difference between the two types of lectures.

A sign test such as this is quite simple to do and applies to both one-tailed and two-tailed tests. It is usually based on the binomial distribution. Remember, however, that wewere able to use the normal approximation to the binomial as our sampling distributionbecause np and nq were both greater than 5. When these conditions are not met, we mustuse the binomial distribution.

4.5 MANN-WHITNEY U TEST

Although there are a number of rank sum tests, we shall confine ourselves to just twosuch tests-the Mann-Whitney U test and the Kruskal-Wallis test. These tests enable us todetermine whether independent samples have been drawn from the same population orfrom different populations having the same distribution. In these types of problems wewould assume in our null hypotheses that there is no significant difference in the sum of theranks of the two groups. The two groups may be from the same population or from differentpopulations having the same distribution.

When only two populations are involved we shall use the Mann-Whitney U test.When more than two populations are involved, we shall use the Kruskal-Wallis test. Itmay be pointed out that since these tests use ranking data rather than ‘+’ and ‘-‘ signs, theywill yield a better interpretation than the sign test.

One of the most common and best known distribution-free test is the Mann-WhitneyU test for two independent samples. The logical basis of this test is easy to understand.Suppose we have two independent groups, and there are n1 observations (scores) fromgroup1 and n 2 observations (scores) from group2. We will arrange the scores in ascendingorder. We will then allocate ranks based on the ascending order scores. A higher rank isassigned to a lower score. When we mean higher rank we mean rank 1 is higher than rank2 which is higher than rank 3 and so on. Now we add together the ranks assigned to eachgroup, that is, the sum of the ranks in group1 and the sum of the ranks of group2 scores. Ifan equal distribution of scores were present in both groups, then there would be no significantdifference in the sum of the ranks of the two groups (null hypothesis is accepted). On thecontrary, if the sum of the ranks differ significantly, it means that either of the groups havea varying distribution of scores. To test whether the sum of the ranks are differ significantlywe will frame the hypotheses and compute the following statistics.

Page 115: DBA1602 STATISTICS.pdf

NOTES

115 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Formulae used in Mann – Whitney U test

1. The U statistic

U={n1n2+([n1(n1+1)]/2)}-R1 4.1

where n1 and n2 are the sample sizes of group1 and group2, respectively, and R1

is the sum of the ranks assigned to the values of the first sample.

2. The mean of the sampling distribution of U

µu = (n1n2) / 2 4.2

3. Standard Error of the U Statistic

σu = ([n1 n2 (n1+ n2+1)] /12) 4.3

4. The Z intercept

Z = (U - µu) /  σu 4.4

We will eventually compute the Z intercept value to check if it falls in the acceptanceregion or rejection region. If you had gone thro’ Unit 3 you will be very familiar with thecritical value and the acceptance / rejection criteria. If you refer to figure 4.1 in this unit,you will understand that for a two tailed test (most non-parametric tests are) and 5% LOS,the acceptance region falls between -1.96 and +1.96. In other words, if our computedvalue of Z falls within this range we will accept the null hypothesis else accept the alternatehypothesis. The Mann-Whitney test is based on the logic just described, using the sum ofthe ranks in one of the groups as the test statistic. The common practice is to take the sumof the ranks assigned to the smaller group, or if n1 = n2, the smaller of the two sums as thetest statistic. This value is then compared with the critical value that can be obtained fromthe table to test the null hypothesis.

Note: If n1 and n2 are less than 8, then we will use the chi-square tabulated value asour test statistic instead of Z.

Let us take a look at illustration 4.2 to understand the application of the Mann WhitneyU test.

Illustration 4.2

The following scores obtained by students from two classes have been tabulated. Isthere significant difference in the performance of the two classes?

Class A scores 1000 1100 800 750 1300 950 1050 1250

Class B Scores 920 1120 830 1360 650 725 890 1600

Class A scores 1400 850 1150 1200 1500 600 775

Class B Scores 900 1140 1550 550 1240 925 500

Page 116: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

116 ANNA UNIVERSITY CHENNAI

Solution:

H0: There is no significant difference in the scores (performance) of the two classes.

H1: There is a significant difference in the scores (performance) of the two classes.

We will now rank the scores in ascending order and also indicate the class fromwhich the score has been obtained.

We know, n1 = 15 and n2= 15

If we sum up the ranks of Class A and Class B we have the following results.

Sum of the ranks of Class A

=(3+6+7+8+10+15+16+17+18+21+22+24+25+27+28) = 247 = R1

Sum of the ranks of Class B

= (1+2+4+5+9+11+12+13+14+19+20+23+26+29+30) =218 = R2

The U statistic

U={n1n2+([n1(n+1)]/2)}-R1 = 98

Score Class Rank Score Class Rank

500 B 1 1000 A 16

550 B 2 1050 A 17

600 A 3 1100 A 18

650 B 4 1120 B 19

725 B 5 1140 B 20

750 A 6 1150 A 21

775 A 7 1200 A 22

800 A 8 1240 B 23

830 B 9 1250 A 24

850 A 10 1300 A 25

890 B 11 1360 B 26

900 B 12 1400 A 27

920 B 13 1500 A 28

925 B 14 1550 B 29

950 A 15 1600 B 30

Page 117: DBA1602 STATISTICS.pdf

NOTES

117 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

where n1 and n2 are the sample sizes of grooup1 and group2, respectively, and R1 is thesum of the ranks assigned to the values of the first sample.

The mean of the sampling distribution of U

µu = (n1n2) / 2 = 112.5

Standard Error of the U Statistic

σu = ([n1 n 2(n1+ n2+1)] /12) = 24.1

The Z intercept

Z = (U - µu) /  σu = -0.602

Since the calculated value of Z lies within -1.96 and +1.96 we accept the null hypothesis.There is no significant difference (performance) between the scores of the two classes.

Illustration 4.3:

The following data indicate the lifetime (in hours) of samples of two kinds of lightbulbs in continuous use:

We are required to use the Mann-Whitney test to compare the lifetimes of brands Aand B light bulbs.

Solution:

The first step for performing the Mann-Whitney test is to rank the given data jointly(asif they were one sample) in an increasing order of magnitude. For our data, we thus obtainthe following array where we use the letters A and B to denote whether the light bulb wasfrom brand A or brand B.

Brand A 603 625 641 622 585 593 660 600 633 580 615 648

Brand B 620 640 646 620 652 639 590 646 631 669 610 619

Page 118: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

118 ANNA UNIVERSITY CHENNAI

Ranking of light bulbs of brand A or brand B

As both the samples come from identical populations, it is reasonable to assume thatthe means of the ranks assigned to the values of two samples are more or less the same.As such, our null hypothesis is:

H0: Means of ranks assigned to the values in the two groups are the same.

H1: Means are not the same.

However, instead of using the means of the ranks, we shall use rank sums for whichthe following formula will be used.

U = {n1n2+([n1(n1+1)]/2)}-R1

where n1 and n2 are the sample sizes of grooup1 and group2, respectively, and R1 isthe sum of the ranks assigned to the values of the first sample.

In our example, we have n1=12, n2=12 and

R1=1+2+4+5+6+8+12+13+15+18+21+23=128.

Substituting these values in the above formula,

U = (12)(12)+[12(12+1)]/2-128

= 144+78-128

= 94

Sample score Group Rank Sample score Group Rank

580

585

590

593

600

603

610

615

619

620

620

622

A

A

B

A

A

A

B

A

B

B

B

A

1

2

3

4

5

6

7

8

9

10.5

10.5

12

625

631

633

639

640

641

646

646

648

652

660

669

A

B

A

B

B

A

B

B

A

B

A

B

13

14

15

16

17

18

19.5

19.5

21

22

23

24

Page 119: DBA1602 STATISTICS.pdf

NOTES

119 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Since both n1 and n2 are more than 8, the sampling distribution of the U statistic canbe approximated by the normal distribution. As our problem meets this requirement, wecan apply the normal approximation to this problem. For this, we have to use the Zstatistic.

1. Mean = u = [ n1n2/2]=[(1212)/2]=72

2. Standard error = 12)1nn(nn 2121

= 12

)11212(1212

= 300

= 17.3

3. (Statistic-Mean)/Standard deviation

= (94-72)/17.3 = 1.27

The critical value of Z at 0.05 level of significance is 1.96. Since the calculated valueof Z=1.27 is smaller than 1.96, the null hypothesis is accepted. This shows that there is nodifference in average lifetimes of brands A and B bulbs. The Z test is more dependable. Itmay be noted that Mann-Whitney test required fewer assumptions than the correspondingstandard test. In fact, the only assumption required is that the populations from whichsamples have been drawn are continuous.

4.6 KRUSKAL-WALLIS TEST OR H-TEST

This test is a direct generalization of the Mann-Whitney test to the case in which wehave three or more independent groups. It tests the null hypothesis that all samples camefrom identical populations. As against this, the alternative hypothesis is that the means ofthe populations are not all equal.

To perform the Kruskal-Wallis test, we have to rank all scores without regard togroups to which they belong and then compute the sum of the ranks for each group. Thesums are denoted by Rs to be more or less equal.

The formula used in this test is

H =

)1(3)1(

121

2

nnR

nn

K

i i

i4.5

Where H is test statistic, n = n1 + n2 + …..+ nk is the total number of observations inall samples and Ris the sum of ranks of all items in sample i.

Page 120: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

120 ANNA UNIVERSITY CHENNAI

Illustration 4.4:

The written examination scores for 20 students trained by three different methods arerecorded as given below. Check whether the training methods contributed in any significantdifferences in scores.

Video (V) 74 88 82 93 55 70

Audio (A) 78 80 65 57 89

Classroom (C) 68 83 50 91 84 77 94 81 92

Solution:

H0: There is no significant difference in the scores due to the training methods. They havesame means.

H1: There is a significant difference in the scores due to the training methods. They havedifferent means.

Just as we had done in illustration 4.2, we will list the scores in ascending order, rankthem and also indicate the training method. Have a look at the following table.

Now we shall select the ranks of the training methods and sum up the ranks as shownbelow.

Score Rank Training

Method Score Rank

Training

Method

50 1 C 81 11 C

55 2 V 82 12 V

57 3 A 83 13 C

65 4 A 84 14 C

68 5 C 88 15 V

70 6 V 89 16 A

74 7 V 91 17 C

77 8 C 92 18 C

78 9 A 93 19 V

80 10 A 94 20 C

Page 121: DBA1602 STATISTICS.pdf

NOTES

121 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

We now use equation 4.5 to compute the value of k .

H =

)1(3)1(

121

2

nnR

nn

K

i i

i

H = [12 / 20 (20+1)] {612/6+422/5+1072/9} – 3(21)

= 1.143.

The sampling distribution of the H-statistic can be approximated by a χ2 distributionwhen all sample sizes are > = 5. The degrees of freedom (d.f) = (m-1) = 3 – 1 = 2, where‘m’ is the number of training methods. The χ2 statistic tabulated is 4.605 (At 1% LOS and2 d.f). Since the calculated value is less than the tabulated value we accept the nullhypothesis.

Illustration 4.5:

A petrol station has the following sample data on sales volume in litres of petrol filled,separated according to the mode of payment. Test the hypothesis that there is no differencein the quantity of petrol bought and paid for by cash, cheque or by credit card. Use KWtest at 5% level of significance.

Video Audio Classroom 2 3 1 6 4 5 7 9 8

12 10 11 15 16 13 19 14 17 18 20

R1 = 61 R2 = 42 R3 = 107

Credit card 78 64 75 45 82 69 60

Cheque 110 70 53 51 61 68

Cash 90 68 70 54 74 65 59

Page 122: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

122 ANNA UNIVERSITY CHENNAI

Solution:Table 7.3 Associated ranks for the given problem

H0: There is no significant difference in the purchase volume due to the payment mode

H1: There is a significant difference in the purchase volume due to the payment mode

2

1

2 2 2

12 [ ] 3( 1)( 1)

12 (78) (56) (76) 3(21)20(21) 7 6 73 869.14 522.67 825.14 63

1050.3414.

jkj

j

Rk n

n n n

The χ2 statistic tabulated is 7.815 (At 5% LOS and 2 d.f). Since the calculated valueis less than the tabulated value we accept the null hypothesis. There is no significantdifference in the purchase volume due to the payment mode.

4.7 ONE SAMPLE RUN TEST

So far, we have assumed that the samples in our problems were randomly selected-that is, chosen without preference or bias. What if you were to notice recurrent patterns ina sample chose by someone else? Suppose that applicants for advanced job training wereto be selected without regard to gender from a large population. Using the notation

W = woman and M = man, you find that the first group enters in this order:

W,W,W,W,M,M,M,M,,W,W,W,W,M,M,M,M

CC CHEQUE CASH

17 20 19

8 13.5 10.5

16 3 13.5

1 2 4

18 7 15

12 10.5 9

6 5

R1=78 R2=56 R3=76

Page 123: DBA1602 STATISTICS.pdf

NOTES

123 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

By inspection, you would conclude that although the total number of applicants isequally divided between the sexes, the order is not random. A random process wouldrarely list two items in alternating groups of four. Suppose now that the applicants begin toarrive in this order:

W,M,W,M,W,M,W,M,W,M,W,M,W,M,W,M

It is just as unreasonable to think that a random selection process would producesuch an orderly pattern of men and women. In this case, too, the proportion of women tomen is right, but you would be suspicious about the order in which they are arriving.

To allow us to test samples for the randomness of their order, statisticians havedeveloped the theory of runs. A run is a sequence of identical occurrences preceded andfollowed by different occurrences or by none at all. If men and women enter as follows,the sequence will contain three runs:

W, M,M,M,M, W

1st 2nd 3rd

And this sequence contains six runs:

W,W,W, M,M, W, M, M, M, M, W, W, W, W, M

1st 2nd 3rd 4th 5th 6th

A test of runs would use the following symbols if it contained just two occurrences:

1n = number of occurrences of type 1

2n = number of occurrences of type 2

r = number of runs

Let ‘s apply these symbols to a different pattern for the arrival of applicants:

M,W,W,M,M,M,M,W,W,W,M,M,W,M,W,W,M

In this case, the values of and , and r would be

1n = 8 Number of women

2n = 9 Number of men

r = 9 Number of runs

The Sampling Distribution of the ‘r’ Statistic:

Page 124: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

124 ANNA UNIVERSITY CHENNAI

The number of runs, or r, is a statistic with its own special sampling distribution and itsown test. Obviously, runs may be of differing lengths, and various numbers of runs canoccur in one sample. Statisticians can prove that too many of too few runs in a sampleindicate that something other than chance was at work when the items were selected.

A one-sample runs test, then, is based on the idea that too few or too many runsshow that the items were not chosen randomly.

To derive the mean of the sampling distribution of the r statistic, use the followingformula:

Mean of the sampling distribution of the r statistic:

1 2

1 2

2 1rn n

n n

The standard error of the r statistic can be calculated with formula:

1 2 1 2 1 2

21 2 1 2

2 (2 )( ) ( 1)rn n n n n nn n n n

And then we use the standard Z equation to compute our test statistic.

Illustration 4.6:

A manufacturer of breakfast cereal uses a machine to randomly insert one of twotypes of toys (Type A or B) in each box. The company wants randomness so that everychild in the neighborhood does not get the same toy. Testers choose samples of 60successive boxes to see whether the machine is properly mixing the two types of toys.Using the symbols A and B to represent the two types of toys, a tester reported that onesuch batch looked like this:

B,A,B,B,B,A,A,A,B,B,A,B,B,B,B,A,A,A,A,B

A,B,A,A,B,B,B,A,A,B,A,A,A,A,B,B,A,B,B,A

A,A,A,B,B,A,B,B,B,B,A,A,B,B,A,B,A,A,B,B

Test whether the toy types have been inserted randomly at 20%LOS.

Solution:

Null Hypothesis H0 : The toys are randomly mixed

Alternative Hypothesis H1 : The toys are not randomly mixed.

Page 125: DBA1602 STATISTICS.pdf

NOTES

125 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

The values in our test will be

1n = 29 Number of toys of type AA

2n = 31 Number of toys of type B

r = 29 Number of runs

0.20

Mean of the sampling distribution of the r statistic:

1 2

1 2

2 1rn n

n n

Applying this to the cereal company, the mean of the r statistic would be

(2)(29)(31) 129 31

1,798 160

29.97 130.97

r

The standard error of the r statistic can be calculated with formula:

1 2 1 2 1 2

21 2 1 2

2 (2 )( ) ( 1)rn n n n n nn n n n

For our problem, the standard error of the r statistic becomes

2

2

(2)(29)(31)(2 29 31 29 31)(29 31) (29 31 1)

(1,798)(1,738)(60) (59)

14.713.84

r

In the one-sample runs test, the sampling distribution of r can be closely approximatedby the normal distribution if either n1 or n2 is larger than 20. Our cereal company hassample of 60 boxes, so we can use the normal approximation. Management is interested intesting at the 0.20 level the hypothesis that the toys are randomly mixed, so the test becomes

Because too many or too few runs would indicate that the process by which the toysare inserted into the boxes is not random, a two-tailed test is appropriate.

Page 126: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

126 ANNA UNIVERSITY CHENNAI

Next use the following equation to standardize the sample r statistic:

29 30.973.84

0.513

r

r

rz

Zcal = -0.513 lies between Z0.20 = ±1.28. Therefore, management should accept thenull hypothesis and conclude from this test that toys are being inserted in boxes in randomorder.

4.8 RANK CORRELATION

4.8.1 Meaning Of Spearman’s Rank Correlation

Spearman’s rank correlation uses ranks rather than actual observation and makes noassumptions about the population from which actual observations are drawn. Thecorrelation coefficient between two series of ranks is called ‘Rank Correlation Coefficient’.It is given by the formula:

2

3

61

DR

N N

where, R = rank correlation coefficient

D = d = Difference of the ranks between paired items in two series

N = Number of pairs of ranks

In case of tied ranks- in case there is more than one item with same value in the series,

usually average rank is allotted to each of these items and the factor 3

12m m

is added for

each such tied item to 2D . Thus, in case of tied ranks, the modified formula for rankcorrelation coefficient becomes:

32

3

6[ ...]121

m mDR

N N

Page 127: DBA1602 STATISTICS.pdf

NOTES

127 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

4.8.2 Features of Spearman’s Correlation Coefficient

1) Spearman’s correlation coefficient is based on ranks rather than on actualobservations.

2) Spearman’s correlation coefficient is distribution-free or non-parametric becauseno strict assumptions are made about the form of population from which sampleobservations are drawn.

3) The sum of the differences of ranks between two variables shall be zero

(i.e. D )

4) It can be interpreted like Karl Pearson’s Correlation Coefficient.5) It lies between -1 and +1 1 1R .

4.8.3 Merits of Spearman’s Rank Correlation Coefficient

1) Simple to understand and easy to apply- Rank method is simple to understandand easy to apply as compared to karl pearson’s method

2) Suitable for qualitative data- Rank method can be conveniently used as ameasure degree of association between variables which are not capable of beingquantifiable but can only be ranked in some order. For example, it may be possiblefor the two judges to rank by preference 10 girls in terms of beauty whereas it maybe difficult to give them a numerical grade in terms of beauty.

3) Suitable for abnormal data- Rank method can conveniently be used when dataare abnormal because rank correlation coefficient is not based on the assumptionof normality of data like karl pearson.

4) Only method for ranks- Rank method is the only method where only ranks aregiven and not the actual data

5) Application even for actual data-Rank method can be applied even whereactual data are given.

4.8.4 Limitations Of Spearman’s Rank Correlation Coefficient

1) Unsuitable for grouped data- Rank method cannot be applied in case of groupedfrequency distribution

2) Tedious calculations-calculations become quite tedious where N exceeds 303) Approximation-Rank method’s result is only approximation since actual data are

not taken into account

4.8.5 When To Use Rank Method

Rank method is used when:

1) Number of pairs of observation is fairly small(say not more than 30)

2) The original data are in the form of ranks.

Page 128: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

128 ANNA UNIVERSITY CHENNAI

Illustration 4.7:

The following are ratings of age (X) and the volume of sales(Y) for eight salespeople.Is there a significant rank correlation between age and sales volume? Use 0.10 significancelevel.

Solution:

H0: No significant correlation between age and sales volume

H1: The correlation is significant between age and sales volume

We assign ranks for X and for Y. In X (age) 18 is the least age and hence is assignedrank 1 and next is 19 so rank 2 is assigned and a similar pattern is followed for Y (sales)also.

From the Spearman’s rank correlation table in Appendix 7 ,the critical value for sr is

0.6190. at (0.1 LOS and n=8). Since the calculated value of rs 0.8095 is greater than thetabulated value, we reject 0H . The correlation is significant between age and sales volume..

4.9 SUMMARY

This unit has introduced you to the essentials of non-parametric methods in variousbusiness applications. Specifically, this unit focused on:

Need and role of non-parametric methods.

Understanding the sign test to test the significance difference between the samplesbefore and after a scenario.

X 30 18 35 28 42 25 19 29

Y 35 31 43 46 50 32 33 42

X(ranks) 6 1 7 4 8 3 2 5

Y(ranks) 4 1 6 7 8 2 3 5

D = d 2 0 1 -3 0 1 -1 0

D2 = 2d 4 0 1 9 0 1 1 0

2

0 1

2

2

16 8 0.10: 0 : 0

6 6(16)1 1 0.8095.( 1) 8(63)

s s

s

d nH H

drn n

Page 129: DBA1602 STATISTICS.pdf

NOTES

129 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Importance of rank sum tests. Run test to test the randomness of the data. Understanding rank correlation through Spearman’s correlation coefficient. How to compute rank correlation coefficient.

Exercises:

1. After collection the data on the amount of air pollution in a metro city in India, thepollution control board (PCB) decided to issue strict new rules to govern theamount of hydrocarbons in the air. For the next year, the data was collected andcompared with the previous year. Based on the following data, does the PCBhave enough evidence to conclude with 90% confidence that the new rules wereeffective in lowering the amount of air pollution in the air?

2. The year 1996 was particularly bad for injuries to professional cricketers. Fromthe following data, does a sign test for paired data indicate that the IPL playerssuffered significantly more injuries than their national team (NT) counterparts?Use a 0.05 level of significance.

3. Use the sign test to see if there is difference between the number of days requiredto collect an account be available before and often a new collection policy . Usethe 0.05 significance level.

Jan Feb Mar Apr May June Jul Aug Sep Oct Nov Dec

Last year 7.0 6.0 5.4 5.9 3.9 5.7 6.9 7.6 6.3 5.8 5.1 5.0

This year 5.3 6.1 5.6 5.7 3.7 4.7 6.1 7.2 6.4 5.7 4.9 5.8

Injury location IPL NT Injury location IPL NT

Shoulder(1) 46 22 Back(9) 10 7

Neck(2) 3 0 Wrist(10) 10 2

Rib(3) 7 5 Hip(11) 1 1

Elbow(4) 21 19 Hand(12) 6 4

Finger(5) 7 5 Ankle(13) 6 4

Thigh(6) 17 14 Foot(14) 1 4

Groin(7) 7 3 Toe(15) 0 1

Knee(8) 16 18 Other(16) 10 4

Before 33 36 41 32 39 41 34 29 32 34 40 42 33 36 27

After 25 29 38 34 37 47 36 32 30 34 41 38 37 35 28

Page 130: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

130 ANNA UNIVERSITY CHENNAI

4. Use the sign test to see whether there is a difference between the number of daysrequired to collect an account receivable before and after a new collection policy.Use the 0.05 significance level.

5. From a sample of days that were considered heavy shopping days, the followingdata gives the one stores sales on days when cheese sampling was done and ondays when it was not done.

Use U-test and decide whether the storefront sampling produced greater salesat = 5%

6. A nursery is experimenting with two blends of fertilizers for fertilizing lawns in acertain area. Twenty four random selected patches of grass are selected forexperimenting with the fertilizer. Twelve patches are randomly assigned to fertilizerA and another 12 patches are assigned to fertilizer B. The increase in the height ofthe goals after two weeks is given in the following table. Using a 10% level ofsignificance, test the hypothesis that fertilizer B is more effective than fertilizer A.

Before After 33 35 36 29 41 38 32 34 39 37 47 47 34 36 29 32 32 30 34 34 40 41 42 38 33 37 36 35 29 28

Sales (in hundreds)

Promotion days 18 21 23 15 19 26 17 18 22 20 18 21 27

Regular days 22 17 15 23 25 20 26 24 16 17 23 21

Fertilizer A Fertilizer B Fertilizer A Fertilizer B 1.2 1.0 0.8 1.2 0.9 1.1 1.4 1.3 1.3 1.0 0.7 0.8 0.5 0.9 0.9 1.0 0.3 0.7 0.8 0.8 0.9 0.8 0.7 1.1

Page 131: DBA1602 STATISTICS.pdf

NOTES

131 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

7. The following are the weight gains of 2 random samples of young Indians fed ontwo different diets but otherwise under identical condition.

Use U test at 0.01 level of significance to test the null hypothesis that the twopopulation samples are identical against the alternation hypothesis that on theaverage the second diet produce a great gain in weight.

8. Test the Hypothesis of no difference between the age of male and female employeesof certain company using the Mann-Whitney U test for the sample data. Usethe0.10 level of significance.

9. The vice-president of quality assurance of airline company is interested in whetherthree quality engineers are usually in agreement on the ratings they give to differentairline seating designs. Severn different designs are chosen at random and thequality engineers are asked to rate the comfort top the passengers on a scale fromone to ten, with ten representing the highest level of comfort possible. Do thegiven data indicate that there is a difference in the ratings of the three qualityengineers? Use a 0.10 significance level.

Diet 1 16.3 10.1 10.7 13.5 14.9 11.8 14.3 10.2

12.0 14.7 23.6 15.1 14.5 18.4 13.2 14.0

Diet 2 21.3 23.8 18.4 19.6 12.0 13.9 18.8 19.2

15.3 20.1 14.8 18.9 20.7 21.1 15.8 16.2

Marks Females 31 44 25 30 38 34 33 47 42 35 40 32 44 35 26 47 43 48 35 34

Design Quality Engineer1

Quality Engineer2

Quality Engineer3

1 5 7 8 2 4 3 5 3 6 5 4 4 9 7 8 5 5 7 4 6 8 7 6 7 9 6 8

Page 132: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

132 ANNA UNIVERSITY CHENNAI

10. The following are the final examination marks of three groups of students whowere taught computer by these different methods:

Test whether the 3 methods are equally effective.

11. A Newspaper circulation office Repot a record of the gender of people who calledthe company to complain about delivery problems with the Sunday Paper. For arecent Sunday, these data were as follows:

M,F,F,M,M,F,M,F,F,F,M,M,M,F,M,F,M,F,F,F,F,M,M,M,M,M.

Using 0.05, test this sequence for randomness.

12. SBI is curious about the level of students who use their ATM in the campus.Juniors (1 & 2 years) are classified as type A and seniors (3 & 4 years) as type B.Data are presented below for 45 students who used the ATM during one Fridayafternoon. Test this sequence For Randomness at the 0.05 significance level.

BBBAA ABAAA AAABB BBABA AAABB AABBB BABBB BAAAA AABBB

13. Thirty five true or false questions were given in a statistics test. The followingsequence contains the answers to the questions in the order in which they appeared.At the 5% level, is there evidence that the true and false answers are not randomlyassigned?

FFFFT TFTTF FFFTF TFFTT TTFFT TFFTF TTTF

14. The following is an arrangement of men, M women W lined up to purchasetickets for a music concert:

MWMWM MWMWM MMWWM MMMWW MWMMM

WMMMW WWMWM MMWMW M MMMW WMMMM

Test for randomness of the arrangement at 0.05 level of significance.

A plant supervisor ranked a sample of eight workers on the number of hours ofovertime worked and length of employment. Is the rank correlation between the twomeasures significant at the 0.01 level?

1st Method 94 88 91 74 87 97

2nd Method 85 82 79 84 61 72 80

3rd Method 89 67 72 76 69

Page 133: DBA1602 STATISTICS.pdf

NOTES

133 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

15. The management of a firm wishes to know whether there is a positive relationshipbetween the length of time a certain product has been on the market and thepercent of the market that the product has captured. Do the data indicate that apositive relationship exists? Use a 5% significance level.

Amount of overtime (X) 5 8 2 4 3 7 1 6

Years employed (Y) 1 6 4.5 2 7 8 4.5 3

Time on market (years)

% of market

Time on market (years)

% of market

1 1.2 6 3.9 2 2.6 7 3.2 3 1.8 8 4.1 4 2.7 9 3.8 5 2.9 10 4.6

Page 134: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

134 ANNA UNIVERSITY CHENNAI

Page 135: DBA1602 STATISTICS.pdf

NOTES

135 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

UNIT V

CORRELATION, REGRESSION AND TIMESERIES ANALYSIS

5.1 INTRODUCTION

In this unit we shall discuss the application of Karl-pearson’s correlation coefficientand then try to frame a simple linear regression equation for the given data. The value of thecorrelation coefficient will tell you the strength of the relationship between two variablesand the regression equation will help you predict one variable from the other, providedthere exists a correlation.

For example, we all spend more when we earn more and vice versa are also true.Suppose you have record of your income and expenditure over the last ten years, then youcan use the Karl-pearson’s correlation coefficient equation to measure the strength of therelationship between your annual income and expenditure. It is likely to be a positive value.If there does exist a relationship between your annual income and expenditure then youcan frame a regression equation with your recorded data and may intend to predict yourfuture spending pattern based on expected income. From the organizational point of view,if there exists a correlation between advertising expenditure and sales revenue, then themarketing department can predict its expected sales revenue based on its intendedadvertising expenditure.

I guess you now have a fair idea of how the correlation and regression equations areuseful for establishing relationships and predicting value for the future.

5.2 LEARNING OBJECTIVES

After reading this unit, you will be able to:

Define correlation coefficient with its properties. Calculate correlation coefficient and interpret. Look forward to applying the regression equation. Formulate the regression equation and use it for estimation and prediction.

Page 136: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

136 ANNA UNIVERSITY CHENNAI

5.3 CORRELATION ANALYSIS

5.3.1 Meaning of Correlation

Correlation is the relation that exists between two or more variables. If two variablesare related to each other in such a way that a change in one creates a correspondingchange in the other, then the variables are said to be correlated. Some common relationshiphave been established between,

Heights and weights The quantum of rainfall and the yield of crops The age of individuals and their blood pressure The advertising expenditure and sales

Correlation is sometimes termed as “Covariation”. The measure of correlation is calledthe coefficient of correlation. In other words, correlation analysis is a statistical techniqueused to measure the degree and direction of relationship between the variables.

5.3.2 Applications of Karl Pearson’s Correlation Coefficient

Some applications of Karl Pearson’s correlation are Economic theory and business studies have shown relationships exist between

variables like price and quantity demanded, advertising expenditure and salespromotion measures etc.

Correlation analysis also helps in deriving precisely the degree and the direction ofsuch relationship.

The effect of correlation is to reduce the range of uncertainty of our prediction.The prediction based on correlation analysis will be more reliable and near toreality.

Correlation analysis contributes to the understanding of macro and micro economictrends and aids in identifying the significant variables on which the variables depend.

5.3.3 Karl Pearson’s Coefficient of Correlation

Definition- Given a set of N pairs of observation 1 1 2 2( , ), ( , ),...( , )N NX Y X Y X Y relatingto two variables X and Y, Coefficient of Correlation between X and Y, denoted by thesymbol ‘r’ is defined as

yx ,)Y.X(Covr

Where, Cov.(X,Y) =Covariance of X and Y

x =Standard Deviation of X variable

y =Standard Deviation of variable YY

Page 137: DBA1602 STATISTICS.pdf

NOTES

137 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

This expression is known as Pearson’s product –moment for and is used as a measureof linear correlation between X and Y.

Expanded forms of the above formula:

1. Expanding the formula of Cov,(X,Y)

(i).

N)YY()XX(

)Y,X(Cov:Note,N

)YY()XX(r

yx

(ii).

[ .( , ) ],x y

xy xyr Cov X Y

N N

2. Expanding the formula of Standard Deviation

2 2

2 2[ : , ]x y

xy x yr Note

N Nx yN

N N

or

2 2

xyr

x y

yyyxxx

1

It should be good enough if you could just memorize equation 1. We will use thisequation for computing ‘r’. The value of r will necessarily lie between -1 to +1. If you getan answer for ‘r’ either greater than +1 or less than -1, then you can be sure that you havecommitted an error in your calculations.

The significance of the values of ‘r’ are listed below.

Page 138: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

138 ANNA UNIVERSITY CHENNAI

Table 5.1 Significance of the values of ‘r’

5.3.4 Properties Of Coefficient Of Correlation1. Independent of choice of origin – The coefficient of correlation (r) is independent

of the choice of origin. In other words, the value of ‘r’ is not affected even if eachof the individual values of X and Y is increased or decreased by some constant.

2. Independent of choice of scale – The coefficient of correlation (r) is independentof the choice of scale of observations. In other words, the value of ‘r’ is notaffected even if each of the individual values of X or Y is multiplied or divided bysome constant.

3. Independent of Units of Measurement – The correlation coefficient r is a purenumber and is independent of the units of measurement. This means that if, forexample, X represents height in inches and Y weight in lbs., then the correlationcoefficient between X and Y will neither be in inches nor in lbs. or any other unit,but only a number.

4. The correlation coefficient r lies between -1 and +1.

5. The coefficient of correlation is the geometric mean of two regression coefficients.

xy yxr b b .

Value of r Significance (a) If r = +1 (b) If r = -1

(c) If r=0 (d) If +0.75<=r<+1

(e) If -0.75>=r>-1

(f) If +0.50 <= r< 0.75 (g) If -0.50 >= r>-0.75 (h) If +0.25 <r<=+0.50 (i) If -0.25 > r >-0.50

There exists perfect positive correlation between the variables.

There exists perfect negative correlation between the variable

There exists no relationship between the variable

There exists high positive correlation between the variables.

There exists high negative correlation between the variables

There exists moderate positive correlation between the variables.

There exists moderate negative correlation between the variables.

There exists low positive correlation between the variables

There exists low negative correlation between the variables.

Page 139: DBA1602 STATISTICS.pdf

NOTES

139 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

5.3.5 Assumptions of Coefficient of Correlation

1. Linear relationship between variables: There is linear relationship betweenthe variables. It means that if pair items of both the variables are plotted on ascatter diagram, the plotted points will from a straight line

2. Cause & Effect Relationship: There is a cause and effect relationship betweenthe forces affecting the distribution of the items of both the series. Correlation ismeaningless if there is no such relationship. For example, there is no relationshipbetween no. of crimes and no. of political leaders.

3. Normality: The two variables are affected to a great extent by a large number ofindependent causes so that they form a normal distribution ids formed

5.3.6 Merits of Coefficient of Correlation

1. Correlation coefficient gives direction as well as degree of relationship betweenthe variables.

2. Correlation coefficient along with other information helps in estimating the value ofthe dependent variable from the known value of independent variables.

5.3.7 Limitations of Coefficient of Correlation

1. Assumption of linear relationship-The assumption of linear relationship betweenthe variables may or may not always hold true.

2. Time consuming- Its computation is time consuming as compared to othermethods.

3. Affected by extreme values- it is affected by the value of extreme items.4. Requires careful interpretation- it is to be interpreted after taking into

consideration other factors as well. The investigator should reach a conclusionbased on logical reasoning and intelligent investigation on significantly related matters.

ILLUSTRATION 5.1: Check whether there is any relationship in the marks scored duringclass tests and the marks scored in the University examinations. The following data showsthe marks scored by 9 students in the class tests and the University examinations.

SOLUTION: So here is a typical situation where you can use the Karl-Pearsons correlationcoefficient to find whether there exists any relationship between the test scores and theUniversity scores. We’ll use equation 1 to compute the value of ‘r’.

Class Test Scores (%) 51 63 73 46 50 60 47 36 60

University Examination

Scores (%) 49 72 74 44 58 66 50 30 55

Page 140: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

140 ANNA UNIVERSITY CHENNAI

2 2

xyr

x y

If you look at the above equation you will understand that we need to calculate, Σxy,

Σx2, Σy2. We need to calculate x = X – X’ and y = Y-Y’. The class test scores are treatedand ‘X’ and the University examination scores are treated as ‘Y’. X’ and Y’ are computedas follows.

X’ = ΣX / n   and    Y’ = ΣY / n,  where ‘n’ is the number of data.

Now let us prepare the following table 5.2

Table 5.2

 X’ = ΣX / n

= 486 / 9 = 54

Y’ = ΣY / n

= 498 / 9 = 55.3

Now we have all the parameters to be fitted into the equation 1 and compute thevalue of

2 2

xyr

x y

= 28051 / Sqrt (976 x 1586) = 0.93

X Y x = X –

X’

y = Y-

Y’ xy Σx2 Σy2

51 49 -3 -16.3 18.9 9 39.64

63 72 9 16.7 150.3 81 278.89

73 74 19 18.7 355.3 361 349.69

46 44 -8 -11.3 90.4 64 127.69

50 58 -4 2.7 -10.8 16 7.29

60 66 6 10.7 64.2 36 114.49

47 50 -7 -5.3 37.1 49 28.09

36 30 -18 -25.3 455.4 324 640.09

60 55 6 -0.3 1.8 36 0.09

ΣX = 

486

ΣY= 

498

Σxy = 

28051

Σx2 =

976

Σy2 =

1586

Page 141: DBA1602 STATISTICS.pdf

NOTES

141 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Since the value of ‘r’ is a high positive value, we can conclude that there is a highdegree of correlation between class test scores and university examination scores. That is,we can state that the students who have faired well in the class tests will also do well in theUniversity examinations.

ILLUSTRATION 5.2:

The following data refers to two variables: promotional expenses (in Lakhs of Rs)and sales (100 units) data collected in the context of a promotional study. Calculate thecorrelation coefficient and comment.

Solution:

You can try out this problem using the same formula as in the previous illustrationor using the other equation for correlation coefficient. (X - X’) in the previous illustration isthe same as ( )X X in this illustration. Similarly (Y – Y’) in the previous illustration is the

same as ( )Y Y in this illustration. I am highlighting the differences in notations since youmay come across both notations in different contexts. Similarly this problem is solved usinganother relation for ‘r’. However both procedures are typically the same. You may adoptany one of them. The calculations are shown in table 5.3 below:

Promotional Expenses

Sales

7

10

9

4

11

5

3

12

14

13

5

15

7

4

Page 142: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

142 ANNA UNIVERSITY CHENNAI

Table 5.3

The calculations shown above, in the first two columns, the number 7 and 10 in thebottom row are the mean of X and Y. That is X =7 and Y =10.Likewise,

( )( )X X Y Y =83, 2( )X X =58 and 2( )Y Y =124

r =

2 2

( )( )( ) ( )

X X Y YX X Y Y

= 83(58)(124)

=0.9787.The inference that can be made here is that promotional expenditure is strongly

associated with sales and the correlation is very close to 1.

a. Regression Analysis

The real meaning of correlations is that there exists a relationship between the variables.So it would be possible to predict one from another. For example in illustration 5.1, wefound that there was a high positive correlation between class test scores and Universityscores. So it would be possible to predict the prospective University score based on theclass test score. Similarly in illustration 5.2, it would be possible to predict the sales basedon promotional expenditure since the correlations show that there is a high degree ofrelationship existing between them. The regression equation does just that. A linear regressionequation will help predict one variable from another. The predicted variable is known asthe dependent variable and the variable used for prediction is known as the independentvariable. For example if we predict sales from promotional expenditure, then ‘expenditure’is the independent variable and ‘sales’ is the dependent variable.

Promotional Expenses(X)

Sales(Y) ( )X X

( )Y Y

( )( )X X Y Y

2( )X X

2( )Y Y

7

10

9

4

11

5

3

12

14

13

5

15

7

4

0

3

2

-3

4

-2

-4

2

4

3

-5

5

-3

-6

0

12

6

15

20

6

24

0

9

4

9

16

4

16

4

16

9

25

25

9

36

Mean = 7 Mean = 10 83 58 124

Page 143: DBA1602 STATISTICS.pdf

NOTES

143 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Hence regression is the measure of the average relationship between two or morevariables in terms of the original units of the data. For example, after having establishedthat two variables (sales and advertising expenditure) are correlated, one may find out theaverage relationship between the two to estimate the unknown values of dependent variable(sales) as stated above from the known values of independent variable (promotionalexpenditure).

5.4.1 Meaning Of Regression

Regression analysis is a statistical tool to study the nature and extent of functionalrelationship between two or more variables and to estimate (or predict) the unknownvalues of dependent variables from the known values of independent variables.

Dependent Variable (or Explained Variable): The variable which is predicted on thebasis of another variable is called dependent variable or explained variable. Dependentvariable is usually denoted by Y.

Independent Variable (or Explanatory Variable): the variable which is used to predictanother variable is called Independent Variable (or Explanatory Variable). Independentvariable is usually denoted by X.

Note: The terms ‘dependence’ and ‘independence’ do not mean that there is necessarilyany cause and effect relationship between the variables.

5.4.2 Use Of Regression Analysis

Regression analysis through a regression line facilitates to predict the values of adependent variable from the given values of an independent variable

Regression analysis through standard error facilitates to obtain a measure of theerror involved in using the regression line as a basis for estimation.

Regression analysis through regression coefficients ( xy yxb and b ) facilitates tocalculate coefficient of determination ( 2r ) and coefficient of correlation(r).

Regression analysis is a highly valuable tool in economics and business researchsince most of the problems of economic analysis are based on cause and effectrelationship.

Page 144: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

144 ANNA UNIVERSITY CHENNAI

5.4.3 Distinction between Correlation and Regression

Correlation differs from Regression in the following respect:

Table 5.4 Distinction between correlation and regression

5.4.4 Regression Lines

In the case of a simple linear regression model (i.e. When there is only one independentvariable and there is a linear relationship between the dependent and independent variable),there are two regression lines as follows:

1. Regression Line of X on Y

X = a + bY

where,

X = Dependent variable,

Basis of Distinction Correlation Regression 1. What measure? Correlation measures degree

and direction of relationship between the variable

Regression measures are predictive based on the extent of average relationship between two or more variables in terms of the original units of the data.

2. Whether relative or absolute measure.

It is a relative measure showing association between variables

It is a absolute measure of relationship

3. Whether independent of choice of both origin and scale.

Correlation coefficient is independent of choice of both origin and scale.

Regression coefficient is Independent of choice of origin and not scale.

4. Whether independent of units of measurement.

Correlation coefficient is independent of units of measurement.

Regression coefficient is not independent of units of measurement

5. Expression of relationship.

Expression of relationship Between the variable ranges from -1 to +1.

Expression of relationship Between the variable may be in any of the forms like Y = a+bX Y= 2a bX cX

6. Whether a forecasting device?

It is not a forecasting device It is a forecasting device which can be used to predict the value of dependent variable from the given value of independent variable.

7.Non-sensical There may be correlations such as UG admission in India and PG admissions in China, which may not really make sense except for the data compatibility.

There is nothing like non-sense regression. Only meaningful regression equations can be built.

Page 145: DBA1602 STATISTICS.pdf

NOTES

145 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Y = Independent variable,

a = X intercept, (i.e. value of dependent variable when vale of independent variableis zero)

b = slope of the linear regresion line (i.e. the amount of change in the value of thedependent variable per unit change in independent variable).

The value of two constants ‘a’ and ‘b’ can be calculated for the given data of X andY variable by solving the following two algebraic normal equations:

2

X Na b YXY a Y b Y

where, N = Number of pairs of X and Y variables

X = Sum of values of variable X

Y = Sum of values of variable YY

2Y = Sum of square of values of variable YY

XY = Sum of product of values of X and YY

Regression of Y on X

Similarly the regression of Y on X is written as below

Y = a + bX

Where Y is the dependent variable and X is the independent variable. Y = a, when X=0 and ‘b’ is the slope. The values of a and b are computed from the two simultaneousequations.

ΣY = Na + bΣX

ΣXY = a ΣX + b ΣX2

Thus be it regression of X on Y or Y on X, we have two sets of equations to solve for‘a’and ‘b’. The choice of adopting either of them depends on the problem. You willunderstand when to choose between the two as you go through illustration 5.3.

Use of Regression Line of X and Y: This line gives the probable value of X for anygiven value of Y.

Another way of expressing Regression line of X and y: this line can also be expressedas follows:

Page 146: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

146 ANNA UNIVERSITY CHENNAI

( ) ( )xyX X b Y Y

Or

( ) ( )x

y

X X r Y Y

[ ]x

xyy

Since b r

where, X =Arithmetic Mean of X series

Y = Arithmetic Mean of Y series

x =Standard deviation of X series

y = Standard deviation of Series

r = Coefficient of correlation between two variables X and Y.

5.4.5 Properties of Linear Regression

1. The Regression Equation: There are two linear regression equations.

(i) Regression equation of y on x:

( )xyY Y b X X

(ii) Regression equation of x on y:

( )xyX X b Y Y

Where xyb and yxb are respectively the regression coefficient (or slope) of Y on Xand the regression coefficient (or slope) of X on Y.

2

( , ) yyx

x

Cov X Yb rx

2

( , ) xxy

y y

Cov X Yb r

2. Product of regression coefficient: The product of the two regression coefficientsis equal to the square of correlation coefficient

2yx xyb b r

Page 147: DBA1602 STATISTICS.pdf

NOTES

147 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

3. Signs of Regression coefficient and correlation coefficient: r, yxb , and xyb ,all have the same sign. If the correlation coefficient ‘r’ is zero, the regressioncoefficients yx xyb and b are also zero.

4. Intersection at Means: The regression lines always intersect at their means.

5. Slopes: The slopes of the regression line of Y on X and the regression line of X onY are respectively yx xyb and b .

6. Angle between Regression Lines: The angle between the two regression linesdepends on the correlation coefficient(r).

Table 5.5 – Value of ‘r’ and angle between regression lines.

As value of r increases numerically from 0 to 1,the angle between regression equationdecreases from to .In other words, the farther the two regression lines are from eachother, the lesser is the degree of correlation and the nearer the two regression lines are toeach other, the higher is the degree of correlation.

7. Estimation of value: The value of X or Y can be estimated from linear equation ifr 0.

5.4.6 Properties of Regression Coefficients

The following are the important properties of regression coefficients:

1. Same sign: Both regression coefficients have the same signs, i.e. it will either bepositive or negative.

2. Both cannot be greater than one: If one of the regression coefficients is greaterthan unity, the other must be less than unity to the extent the product of bothregression coefficient is less than unity. In other words, both the regressioncoefficients cannot be greater than one.

3. Independance of orgin: Regression coefficients are independent of the originbut not of scale.

4. A.M>r: arithmetic mean of regression coefficients is greater than the correlationcoefficient.

5. r is G.M: Correlation coefficient is the geometric mean between the regressioncoefficients.

Value of r Angle between regression lines

(a)If r=0

(b)If r= +1 or -1

Regression lines are perpendicular to each other

Regression lines coincide (i.e. become identical)

Page 148: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

148 ANNA UNIVERSITY CHENNAI

6. r, xyb and yxb have same sign: the coefficient of correlation will have the samesign as that of regression coefficient i.e. if regression coefficient have a positivesign, r will also be positive and if regression coefficient have a negative sign, r willalso be negative.

ILLUSTRATION 5.3:

If you can refer to the Illustration 5.1, that tried to check for the relationship betweenclass test scores and university examination scores, we’ll use the same data to frame theregression equation to predict the university examinations.

Data:

Class test scores X 51 63 73 46 50 60 47 36 60

University Exam scores Y 49 72 74 44 58 66 50 30 55

SOLUTION:

Since we want to predict the University scores from the class test scores we will beusing the regression equation ‘Y’ on ‘X’. Y is the dependent variable and X is the independentvariable. The regression equation is

Y = a + bX

and to compute the values of ‘a’ and ‘b’, we have the equations,

ΣY = Na + b ΣX 1

ΣXY = a ΣX + b ΣX2 2

We need to compute the values of ΣY, ΣX, ΣXY, and ΣX2. Refer to the table 5.6below for the calculations.

Table 5.6

X Y XY X2 51 49 2499 2601 63 72 4536 3969 73 74 5402 5329 46 44 2024 2116 50 58 2900 2500 60 66 3960 3600 47 50 2350 2209 36 30 1080 1296 60 55 3300 3600

ΣX= 486 ΣY= 498 ΣXY= 28051 ΣX2 = 27220

Page 149: DBA1602 STATISTICS.pdf

NOTES

149 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Now we can solve for ‘a’ and ‘b’ based on the principle of simultaneous equations.Substituting the values in 1 and 2 we have

9a + 486b = 498 3

486a + 27220b = 28051 4

We multiply equation 3 by 54 and equation 4 by 1, to make the coefficient of ‘a’ in 3equal to the coefficient of ‘a’ in equation 4.

So we now have

(Equation 3) x 54 = 486a + 26244b = 26892

(Equation 4 ) x 1 = 486a + 27220b = 28051

We now multiply equation 4 by -1, so that we can subtract equation 4 from equation3. By doing this we can compute the value of ‘b’.

(Equation 3) x 54 = 486a + 26244b = 26892

(Equation 4 ) x -1 = -486a - 27220b = - 28051

Therefore – 976b = -1159

Hence b = 1.1875.

Now we substitute the value of b in either equation 3 or 4 to get the value of ‘a’. Thevalue of ‘a’ = - 8.76. Therefore the regression equation is

Y = - 8.76 + 1.18 X

We can substitute the value of X (class test scores) to get the predicted value of Y(university exam scores).

You can now go through the next Illustration which is also another familiar problem.

Important Note: If you were asked to predict the class test scores based on earlierUniversity scores, then you will have to use regression of X on Y.

ILLUSTRATION 5.4:

The following data refer to two variables promotional expenses (in Lakhs of Rs) andsales (100 units) collected in the context of a promotional study. Set up the simple linearregression model and predict sales when promotional expense is Rs.13 lakhs.

Page 150: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

150 ANNA UNIVERSITY CHENNAI

SOLUTION: The basic calculations are shown in table 5.7 below:

Table 5.7

In the calculations shown above, in the first two columns, the numbers 7 and 10 in thebottom row are the mean of X and Y. That is X = 7 and Y =10. In the 5th, 6th and 7th

columns, you find

( )( )X X Y Y =83, 2( )X X =58 and 2( )Y Y =124

You postulate the model in the standard form as follows:

Y = a + bx

where

Y is the dependent variable

X is the independent variable

a and b are constants.

As already worked out by solving the two normal equations,

Promotional Expenses

Sales

7 10 9 4 11 5 3

12 14 13 5

15 7 4

Promotional

Expenses(X) Sales(Y) ( )X X ( )Y Y ( )( )X X Y Y 2( )X X 2( )Y Y

7

10

9

4

11

5

3

12

14

13

5

15

7

4

0

3

2

-3

4

-2

-4

2

4

3

-5

5

-3

-6

0

12

6

15

20

6

24

0

9

4

9

16

4

16

4

16

9

25

25

9

36

Mean = 7 Mean= 10 83 58 124

Page 151: DBA1602 STATISTICS.pdf

NOTES

151 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

b =

2

( )( )( )

X X Y YX X

= 83/58=1.4310

a = -0.017

So the fitted equation is:

Y=-0.017+1.4310X. This is the line of best fit.

To predict the sales when promotional expenses=13, put X=13 in the fitted equation,you will get the answer=-0.017+1.4310 (13) = 18.59. The estimated sales whenpromotional expense is Rs. 13 lakhs is =18.59(1000) units=18590.

5.5 SUMMARY

This section has introduced you to the essentials of correlation and regression withtheir important features. Specifically, this unit focused on:

Meaning and role of correlation. Understanding correlation through scatter diagram. Definition and properties of Pearson’s correlation coefficient. How to compute correlation coefficient. Basics of regression analysis. Linear regression model and statistical validation.

5.6 EXERCISES ON CORRELATION AND REGRESSION

1. Following table gives the rainfall in inches and production in ’00 kgs, for Rabi andKhariff crops for number of years. Find the coefficient of correlation between rainfall andtotal production:

2. Calculate coefficient of correlation from the following data:

X 100 200 300 400 500 600 700

Y 0.3 0.5 0.6 0.8 1.0 1.1 1.3

Rainfall (cms) 20 22 24 26 28 30 32

Rabi production 15 18 20 32 40 39 40

Khariff

production

15 17 20 18 20 21 15

Page 152: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

152 ANNA UNIVERSITY CHENNAI

3. Percentage of marks of 10 students in M.Com and B.Com (Hons) Examination arefollows:

M.Com. 65 58 40 67 72 48 54 76 54 66

B.Com 70 75 62 45 78 60 40 64 45 61

Under similar conditions, how much a student securing 76 marks in B.Com. (Hons.)may expect in M.Com? Do you consider both the regression equations important?

4. Find the equation of the line of regression of x on y for the following data:

X 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Y 5.3 5.7 6.3 7.2 8.2 8.7 8.4

5. Determine the coefficient of correlation between the marks scored in English and themarks scored in Tamil out of 25..

6. Eight debators were awarded marks by two judges. The marks out of 50 are shownbelow.

When the 8th debator was completing his speech only judge 2 was present and hehad awarded 36 marks for him. How much marks would judge 1 have awarded for the 8th

debator?

(Hint: Frame a regression equation for judge 1 on judge 2 using the marks of theseven debators and then predict the marks by judge 1 for the 8th debator)

7. The maintenance cost for the cars are recorded based on their age. Based on the givendata, can you forecast the maintenance cost for a ten year old car.

Marks in English 10 25 13 25 22 11 12 25 21 20

Marks in Tamil 12 22 16 15 18 18 17 23 24 17

Debator 1 2 3 4 5 6 7

Judge 1 40 34 28 30 44 38 31

Judge 2 32 39 26 30 38 34 28

Age of car (Years) 2 4 6 8

Maintenance cost (In thousands of rupees) 10 20 25 30

Page 153: DBA1602 STATISTICS.pdf

NOTES

153 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

5.7 TIME SERIES ANALYSIS

5.7.1 Introduction

Forecasting, or predicting, is an essential tool in any decision- making process. In theprevious section we tried to forecast or predict one variable based on another using alinear regression equation. But what happens if we have recorded our sales over the lastten years but have no records on the advertising expenditure. Using the sales pattern overthe last ten years we can predict the future sales based on a line of best fit. Such applicationsare known as time series analysis. Its uses vary from determining inventory requirementsfor local shore store to estimating the annual sales of video games. Time-series analysis isone quantitative method we use to determine in data collected over time.

5.7.2 Learning Objectives

After reading this unit, you will be able to:

Appreciate the role of Time series. List and describe the qualitative and quantitative methods of forecasting. Forecast using time series models. Measure the forecast error to assess the accuracy of the models.

5.7.3 Variation In Time Series

We use the term time series to refer to any group of statistical information accumulatedat regular intervals over a period of time. There are four kinds of changes or variations orcomponents involved in time-series analysis. They are

1. Secular trend2. Cyclical fluctuation3. Seasonal variation4. Irregular variation

With the first type of change, secular trend, the value of the variable tends to increaseor decrease over a long period of time. The steady increase in the cost of living recordedby the Consumer Price Index is an example of secular trend. From year to individual year,the cost of living varies a great deal, but if we examine a long-term period, we see that thetrend is toward a steady increase.

The second type of variation seen in a time series is cyclical fluctuation. The mostcommon example of cyclical fluctuation is the business cycle. Over time, there are yearswhen the business cycle hits a peak above the trend line. At other times, business activityis likely to slump, hitting a low point below the trend line. At other times, business activityis likely to slump, hitting a low point below the trend line. The time between hitting peaks orfalling to low points is at least 1 year, and it can be many as 15 or 20 years. Figure 5.1shows the secular trends and cyclical variations.

Page 154: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

154 ANNA UNIVERSITY CHENNAI

Figure 5.1 Secular and Cyclical Variations

The third kind of change in time –series data is seasonal variation. As we mightexpect from the name, seasonal variation involves patterns of change within a year thattend to be repeated from year to year. For example, a physician can expect a substantialincrease in the number of flu cases every winter and of poison ivy every summer. Becausethese are regular patterns, they are useful in forecasting the future.

Irregular variation is the fourth type of change in time-series analysis. In manysituations, the value of a variable may be completely unpredictable, changing in a randommanner. Irregular variations describe such movements. The effects of the Middle Eastconflict in 1973, the Iranian situation in 1979-1981, the collapse of OPEC in 1986, theIraqi situation in 1990 on gasoline prices in the United States, the tsunami disaster in 2004and the Earth quake at Gujarat are occasions that had resulted in irregular variations intrade. Figure 5.2 shows the secular trends and cyclical variations.

Figure 5.2 Seasonal and Irregular Variations

Secular and Cyclical Variations

0

10

20

30

40

50

1 2 3 4 5 6 7 8 9 10 11 12

Year

Uni

ts S

old

(In la

khs)

Secular TrendCyclical Variation

Seasonal and Irregular Variations

0

10

20

30

40

50

60

1 3 5 7 9 11 13 15 17 19 21 23

Months

Cas

es re

port

ed Seasonal Variations -Seasonal FluIrregular Variations - Newinfections

Page 155: DBA1602 STATISTICS.pdf

NOTES

155 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Thus far, we have referred to a time series as exhibiting one or another of these fourtypes of variation. In most instances, however, a time series will contain a mixture of thesecomponents. Thus, we can describe the overall variation in a single time series in terms ofthese four different kinds of variation. In the following sections, we will examine the fourcomponents and the ways in which we measure each.

5.7.4 Trend Analysis

Among the four components of a time series, secular trend represents the long-termdirection of the series. One way to describe the trend component is to fit a line visually toa set of points on a graph. Any given graph, however, is subject to slight differentinterpretations by different individuals. We can also fit a trend by the method of leastsquares, which we examined earlier. In our discussion, we will concentrate on the methodof least squares because visually fitting a line to a time series is not a completely dependableprocess.

5.7. 4.1 Reasons For Studying Trends

There are three reasons why it is useful to study secular trends:1. The study of secular trends allows us to describe a historical pattern. There are

many instances when we can use a part trend to evaluate the success of a previouspolicy. For example, a university may evaluate the effectiveness of a recruitingprogram by examining its past enrollment trends.

2. Studying secular trends permits us to project past patterns, or trends, into thefuture. Knowledge of the past can tell us a great deal about the future .Examiningthe growth rate of the world’s population, for example, can help us estimate thepopulation for some future time.

3. In many situations, studying the secular trend of a time series allows us to eliminatethe trend component from the series. This makes it easier for us to study the otherthree components of the time series. If we want to determine the seasonal variationin ski sales, for example, eliminating the trend component gives us a more accurateidea of the seasonal component.

Illustration 5.5:

A start up company making tractors has recorded the number of tractors they havesold since 1997. The data is shown below

(a) Find the linear equation that describes the trend in the number of tractors sold

(b) Estimate their sales of tractors for the year 2008.

Year 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Tractors Sold 42 50 61 75 92 111 120 127 140 138

Page 156: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

156 ANNA UNIVERSITY CHENNAI

Solution:Table 5.8

We shall use the same regression equation we have been familiar with, in the previoussection.

Y = a + b x 7.7.1

We will compute a = Mean of Y = Y

2xYxb

If you look carefully, you will notice that in the previous section we used the equation

Y = a + b X as our linear regression equation. Is there a difference?

The ‘x’ we use in equation 7.7.1 is coded time. The first column in the solution tableis the year. If you sum up the years and divide by the number of years, then you will get themedian year as 2001.5. Now ‘x’ is computed by subtracting the median year (2001.5)from the years (X). Since we have even number of years (10 years) we get the results indecimals. To eliminate the decimals we can multiply ‘x’ by 2 and get a new column ofcoded time as ‘2x’.

Note : If you have odd number of data, such as 7 years or 9 years then you do not haveto multiply ‘x’ by 2.

Year (X) Coded Time x 2x Tractors

Sold Y (2x) xY

(2x)2

1997 -4.5 -9 42 -378 81

1998 -3.5 -7 50 -350 49

1999 -2.5 -5 61 -305 25

2000 -1.5 -3 75 -225 9

2001 -0.5 -1 92 -92 1

2002 0.5 1 111 111 1

2003 1.5 3 120 360 9

2004 2.5 5 127 635 25

2005 3.5 7 140 980 49

2006 4.5 9 138 1242 81

? Year = 20015 Mean

Year = 2001.5

Y = 956 Y = 95.6

1,978

330

Page 157: DBA1602 STATISTICS.pdf

NOTES

157 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

From this point forth in this problem, ‘x’ implies ‘2x’.

Therefore a = Mean of Y = Y’ = 95.6 and

b = xY/ x2 = 1978 / 330 = 5.994

Therefore our line of best fit is Y = 95.6 + 5.994 x

If we want to forecast the sales for the 2008, we need to code the year as 2008 –2001.5 = 6.5 (Since 2001.5 is the base year). We then multiply 6.5 by 2 (Since we use 2xfor our calculations. So if we substitute x = 13 in our line of best fit equation, we cancompute the sales for the year 2008 as

Y 2008 = 95.6 + 5.994 * 13 = 173.5 = 174 tractors.

5.7.5 Cyclical Fluctuations

Illustration 5.6:

Using a line of best fit, determine a profit trend line for a company that has recordedits profit in lakhs of rupees over the last twelve years.

Year 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Profit 5 5 6 6 7 9 10 11 13 15 17 19

Also compute the percentage trend and the relative cyclical residual.

Solution:

We will first try to get the line of best fit as we had done in the previous illustration.

Try to compute the following table on your own (seeing the previous illustration) andthen verify the answers.

Table 5.9

Year (X) Coded Time x

2x Profit Y

(2x) xY (2x)2

1994 -5.5 -11 5 -55 121 1995 -4.5 -9 5 -45 81 1996 -3.5 -7 6 -42 49 1997 -2.5 -5 6 -30 25 1998 -1.5 -3 7 -21 9 1999 -0.5 -1 9 -9 1 2000 0.5 1 10 10 1 2001 1.5 3 11 33 9 2002 2.5 5 13 65 25 2003 3.5 7 15 105 49 2004 4.5 9 17 153 81 2005 5.5 11 19 203 121

Page 158: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

158 ANNA UNIVERSITY CHENNAI

We know that, a = Mean of Y = Y’and b = ΣxY / Σx2, and remember that since wehave an even set of data, ‘x’ means ‘2x’.

By our usual calculation we get a = 10.25 and b = 0.652.

I hope you have got your calculations right to check with this answer.

Thus the equation is Y = 10.25 + 0.652x.

Using this equation we will compute the values of the forcasted profit (¡) bysubstituting the values of coded time ‘x’.

Please refer to the table below that will give us the actual profit and the forecastedprofit using the above equation.

Table 5.10

Now we will fit a trend line with the forecasted values. A close look at figure 5.3 willtell us that the trend line fitted using the equation is almost a straight line increasing steadily.The same is evident if you observe the ¡ values in the table above. Although the profitvalues (Y) are also increasing it seems to fluctuate above and below the trend line. Theprofit line which is initially above the trend line cuts the trend line (predicted profit line) inthe year ‘96 and comes below the trend line. It again cuts the trend line in the year 2003and moves above the trend line. If we extend the lines for a further period a similar patternwould be exhibited and a cyclical fluctuation pattern would be observed..

Year (X) Profit Y

Forecasted Profit ( y )

Percentage Trend =

(Y / y ) x 100

Relative cyclical residual =

{(Y – y ) / Y} x 100 1994 5 3 166.6 40

1995 5 4 125 20

1996 6 6 100 0

1997 6 7 85 -17

1998 7 8 87 -14

1999 9 10 90 -11

2000 10 11 90 -10

2001 11 12 91 -9

2002 13 14 92 -8

2003 15 15 100 0

2004 17 16 94 6

2005 19 17 89 11

Page 159: DBA1602 STATISTICS.pdf

NOTES

159 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Figure 5.3 Trend Line

Figure 5.4 Percentage trend and relative cyclical residual

Figure 5.4 has been plotted using the percentage trend and the relative cyclical residualvalues computed from table 5.10.

Illustration 5.7:

The Western Natural Gas Company has supplied 18, 20, 21, 25, and 26 billion cubicfeet of gas, respectively, for the years 1991 to 1995.

(a) Find the relative estimating equation that best describes these data.

(b) Calculate the percent of trend for these data.(c) Calculate the relative cyclical residual for these data.

(d) In which year does the largest fluctuation from occur?

Trend Line

0

5

10

15

20

94 95 96 97 98 99 00 01 0203 04 05

Year

Actual Profit Predicted Profit

Prof

it

Cyclical Trend

-50

0

50

100

150

200

941 95 96 97 98 99 00 01 02 03 04 05

Year

Perc

enta

ge

Percentage Trend

Relative cyclicalresidual

Page 160: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

160 ANNA UNIVERSITY CHENNAI

Solution:

Table 5.11

Year x Y xY x2 Ў (Y / Ў) x 100 {(Y – Ў) / Y} x 100

1991 -2 18 -36 4 17.8 101.12 1.12

1992 -1 20 -20 1 19.9 100.50 0.50

1993 0 21 0 0 22.0 95.45 - 4.55

1994 1 25 25 1 24.1 103.73 3.73

1995 2 26 52 4 26.2 99.24 - 0.76

Total 0 110 21 10

(a)

2110 2122, 2.1

5 10xY

a Y bx

We use the equation Y = a + bx and hence are able to compute

Y = 22 + 2.1x as we had done in the previous illustration.

We compute y from the above equation. These values are predicted values ofsupply based on the forecasting equation. In the previous illustration the regression equationY = a + bx is used to compute the values of y . For example, to compute the forecastedvalue ( y ) for the year 1991, take the value of x = -2 and substitute it in the equation Y = 22+ 2.1x, you will get a value of 17.8. Do the same thing for all values of x and you have ycolumn computed. Y is the actual supply and y is the predicted supply..

(b) The percentage trend is the ratio of actual supply (Y) to the predicted supply (¡)multiplied by 100. This is shown in the seventh column.

(c) The last column shows the relative cyclical residual which is computed as the ratioof difference between the actual supply and predicted supply to the actual supplymultiplied by 100. .

(d) Largest fluctuation of -4.55 is seen in the year 1993.

5.7.6 Seasonal Variation

Besides secular trend and cyclical variation, a time series also includes seasonalvariation. A seasonal variation is defined as the repetitive and predictable movement aroundthe trend line in one year or less. In order to detect seasonal variations, the time intervalmust be measured in small units, such as days, weeks, months, or quarters.

Page 161: DBA1602 STATISTICS.pdf

NOTES

161 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Seasonal variations are computed for the following purposes.

1. We can establish the pattern of past change. This gives us a way to compare twotime intervals that would otherwise be too dissimilar. If a flight training schoolwants to know if a slump in business during December is normal, it can examinethe seasonal pattern in previous years and find the information it needs.

2. It is useful to project past patterns in the future. In the case of long-range decisions,secular-trend analysis may be adequate. But for short-run decisions, the ability topredict seasonal fluctuations is often essential. Consider a wholesale food chainthat wants to maintain a minimum adequate stock of all items. The ability to predictshort-range patterns, such as the demand for turkeys at Thanksgiving, candy atChristmas, or peaches in the summer, is useful to the management of the chain.

3. Once we have established the seasonal pattern that exists, we can eliminate itseffects from the time series. This adjustment allows us to calculate the cyclicalvariation that takes place each year. When we eliminate the effect of seasonalvariation from a time series, we have depersonalized the time series.

5.7.6.1 Ratio-To-Moving-Average Method

In order to measure seasonal variation, we typically use ratio-to-moving-averagemethod. This technique provides an index that describes the degree of seasonal variation.The index is based on a mean of 100, with the degree of seasonality measured by variationsaway from the base. Let us consider the following example.

Illustration 5.8:

The guest house on ECR has recorded the quarterly occupancy data over a 5-yearperiod:

To optimize the occupancy rate, the management wants to understand the seasonalpattern of demand for rooms.

We shall now help the guest house look for a seasonal pattern, if it exists, and use it toforecast demand for rooms.

Year First quarter Second quarter Third quarter Fourth quarter

1991 1861 2203 2415 1908

1992 1921 2343 2514 1799

1993 1834 2154 2098 1799

1994 1837 2025 2304 1965

1995 2073 2414 2339 1967

Page 162: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

162 ANNA UNIVERSITY CHENNAI

The guest house wanted to establish the seasonal pattern of room demand by itsclientele. The management wants to improve customer service and is considering severalplans to employ personnel during peak periods to achieve this goal. The given problemcontains the quarterly occupancy, that is, the average number of guests during each quarterof the last 5 years.

We will refer to the above problem to demonstrate the six steps required to computea seasonal index:

1. The first step in computing a seasonal index is to calculate the 4-quarter movingtotal for the series.

2. In the second step, we compute the 4-quarter moving average by dividing each ofthe 4-quarter total by 4.

3. In the third step, we center the 4-quarter moving average.4. Next, we calculate the percentage of the actual value to the moving-average value

for each quarter in the time series having a 4-quarter moving-average entry.5. To collect all the percentage of actual to moving-average values in column 7 of

Table 5.1, arrange them by quarter.6. The final step, demonstrated in Table 5.3, adjusts the modified mean slightly.

Table 5.12 Calculating the 4-Quarter centered Moving Average

Year (1)

Quarter (2)

Occupancy (3)

Step1

4-Quarter Moving Total

(4)

Step 2

4-Quarter Moving Average (5)=(4)/4

Step 3

4-Quarter Centered Moving Average

(6)

Step4 Percentage Of

Actual To Moving

Average Values (3)(7) 100(6)

1991 I II III IV

1,861 2,203 2,415 1,908

8,387 8,447

2,096.75 2,111.75

2,104.205 2,129.250

114.8 89.6

1992 I II III IV

1,921 2,343 2,514 1,986

8,587 8,686 8,764 8,677

2,146.75 2,171.50 2,191.00 2,169.25

2,159.125 2,181.250 2,180.125 2,145.625

89.0 107.4 115.3 92.6

1993 I II III IV

1,834 2,154 2,098 1,799

8,488 8,072 7,885 7,888

2,122.00 2,018.00 1,971.25 1,972.00

2,070.000 1,994.625 1,971.625 1,955.875

88.6 108.0 106.4 92.0

1994 I II III IV

1,837 2,025 2,304 1,965

7,759 7,965 8,131 8,367

1,939.75 1,991.25 2,032.75 2,091.75

1,965.500 2,012.000 2,062.250 2,140.375

93.5 100.6 111.7 91.8

1995 I II III IV

2,073 2,414 2,339 1,967

8,756 8,791 8,793

2,189.00 2,197.75 2,198.25

2,193.375 2,198.000

94.5 109.8

Page 163: DBA1602 STATISTICS.pdf

NOTES

163 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Table 5.13 Demonstration of Step 5 in computing a seasonal Index

Modified mean:Quarter I: 91.25Quarter II: 107.70Quarter III: 113.25Quarter IV: 91.90

______Total of indices =404.1______

Table 5.14 Demonstration of Step 6

Mean of seasonal indices== 400

4 =100.0

5.7.6.2 Uses of the Seasonal Index

The ratio-to-moving-average method just explained allows us to identify seasonalvariation in a time series. The seasonal indices are used to remove the effects of seasonalityform a time series. This is called deseasonalizing time series. Before we can identify eitherthe trend or cyclical components of a time series, we divide each of the actual values in theseries by the appropriate seasonal index (expressed as a fraction of100). To demonstrate,we shall deseasonalize the value of the first four quarters in the given problem. In Table5.4, we show the deseasonalizing process using the values for the seasonal indices fromTable 5.3. Once the seasonal effect has been eliminated, the deseasonalized values thatremain reflect only the trend, cyclical, and irregular components of the time series.

Year Quarter I Quarter II Quarter III Quarter IV 1991 1992 1993 1994 1995

--- 89.0 88.6 93.5 94.5 182.5

--- 107.4 108.0 100.6 109.8 215.4

114.8 115.3 106.4 111.7

--- 226.5

89.6 92.6 92.0 91.8 ---

183.8

Quarter Unadjusted indices

Adjusting Constant

= Seasonal Index

I II III IV

91.25 107.70 113.25 91.90

0.9899 0.9899 0.9899 0.9899

= = = =

Total of seasonal =

indices

90.3 106.6 112.1 91.0

400.0

Page 164: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

164 ANNA UNIVERSITY CHENNAI

Table 5.15 Demonstration of Deseasonalizing Data

Once we have removed the seasonal variation, we can compute a deseasonalizedtrend line, which we can then project into the future. Suppose the hostel management inour example estimates from a deseasonalized trend line that the deseasonalized averageoccupancy for the fourth quarter of the next year will be 2,121. When this predication hasbeen obtained, management must then take the seasonality into account. To do this, itmultiplies the deseasonalized predicted average occupancy of 2,121 by the fourth-quarterseasonal index (expressed as a fraction of 100) to obtain a seasonalized estimate of 1,930room for the fourth-quarter average occupancy.

5.8 SUMMARY

This unit has provided a conceptual framework on various forecasting techniqueswith their strengths and limitations. Specifically, this unit focused on:

The need for forecasting. Quantitative forecasting split into time series analysis. Detailed coverage on time series analysis involving regression. Trend projection using least square line. Determining the cyclical variation. Seasonal effects through seasonal indices using ratio-to-moving average method.

Year (1)

Quarter (2)

Actual Occupancy

(3)

100

Seasonal Index

(4)

Depersonalized Occupancy (5)=(3) (4)

1991

1991

1991

1991

I

II

III

IV

1,861

2,203

2,145

1,908

90.3100

106.6100

112.1100

91.0100

=

=

=

=

2,061

2,067

2,154

2,097

Page 165: DBA1602 STATISTICS.pdf

NOTES

165 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

Exercises:

1) The number of faculty-owned personal computers at the University of Ohio increaseddramatically between 1990 and 1995:

(a) Develop a linear estimating equation that best describes these data.(b) Develop a second –degree estimating equation that best describes these data.(c) Estimate the number of PCs that will be in use at the university in 1999, using both

equations.(d) If there are 8,000 faculty member at the university, which equation is the predictor?

Why?

2) Below are the figures of production (thousands) of a sugar factory:

Required:

(a) Fit Straight Line Trend by the method of the least squares and tabulate the values.

(b) Eliminate the trend using additive model. What components of the time series arethus left over?

(c) Estimate the likely production for the year 2007.

3) Calculate seasonal indices by ratio to moving average method from the followingdata:

Year 1999 2001 2202 2003 2005

Production (thousands) 700 600 400 900 900

Year Quarterly data

I II III IV

2000 8 16 24 32

2001 48 36 24 12

2002 48 16 32 64

2003 72 108 144 36

2004 56 28 84 112

Year 1990 1991 1992 1993 1994 1995

Number of PCs 50 110 350 1,020 1,950 3,710

Page 166: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

166 ANNA UNIVERSITY CHENNAI

BIBILOGRAPHY1. Aczel A.D. and Sounderpandian J., “Complete Business Statistics”, 6th edition,

Tata McGraw – Hill Publishing Company Ltd., New Delhi, 2006.2. Anderson D.R., Sweeney D.J. and Williams T.A., “Statistics for business and

economics”, 8th edition, Thomson (South – Western) Asia Pte. Ltd., Singapore,2002.

3. Arora P.N. and Arora S., “Statistics for Management”, S. Chand & CompanyLtd., New Delhi, 2003.

4. Arora P.N., Sumeet Arora and Arora S., “Comprehensive Statistical Methods”,S. Chand & Company Ltd., New Delhi, 2006.

5. Baisnab A.P. and Jas M., “Elements of Probability and Statistics”, Tata McGraw-Hill Pub. Co. Ltd., New Delhi, 2001.

6. Beri G.C., “Business Statistics”, 2nd edition, Tata McGraw – Hill PublishingCompany Ltd., New Delhi, 2006.

7. Bowerman B.L., Connell R.T.O’ and Hand M.L., “Business Statistics in Pratice”,2nd edition, McGraw-Hill / Irwin, New York, 2001.

8. Chandan J.S., “Statistics for Business and Economics”, Vikas Publishing HousePvt. Ltd., New Delhi, 2000.

9. Gupta S.C. and Kapoor V.K., “Fundamentals of Mathematical Statistics”, 11th

edition, Sultan Chand & Sons, New Delhi, 2002.10. Hooda R.P., “Statistics for Business and Economics”, 2nd edition, Macmillan India

Ltd., 2001.11. Jain D.R. and Jhunjhunwala B., “Business Statistics”, Tata McGraw – Hill Publishing

Company Ltd., New Delhi, 2007.12. Kapur J.N. and Saxena H.C., “Mathematical Statistics”, 20th edition, S.Chand &

company ltd., New Delhi, 2003.13. Kazmier L.J. and Pohl N.F., “Basic Statistics for Business and Economics”,

McGraw-Hill Book Company, Singapore, 1987.14. Levin R.I. and Rubin D.S., “Statistics for management”, 7th edition, Prentice –

Hall of India Pvt. Ltd., New Delhi, 2001.15. Levine D.M., Krehbiel T.C. and Berenson M.L., “Business Statistics: A First

course”, Pearson Education Asia, 2nd edition, New Delhi, 2002.16. Morse L.B., “Statistics for Business and Economics”, Hapercollins college

Publishers, New York, 1993.17. Raghavachari M., “Mathematics for Management: An Introduction”, Tata McGraw-

Hill Publishing company limited, New Delhi, 2001.18. Rosenkrantz W.A., “Introduction to probability and statistics for scientists and

engineers”, The McGraw-Hill Companies, Inc., Singapore, 1997.

Page 167: DBA1602 STATISTICS.pdf

NOTES

167 ANNA UNIVERSITY CHENNAI

STATISTICS FOR MANAGEMENT

19. Siegel A.F., “Practical Business Statistics”, 4th edition, Irwin McGraw-Hill, USA,2000.

20. Spiegel M.R. and Stephens L.J., “Schaum’s Outline of Theory and Problems ofStatistics”, 3rd edition, Tata McGraw–Hill Publishing Company Ltd., New Delhi,2001.

21. Spiegel M.R., Schiller J and Srinivasan A., “Schaum’s Outline of Probability andStatistics”, 2nd edition, McGraw – Hill Publishing Company, 2000.

22. Viswanathan P.K., “Business Statistics: An Applied Orientation”, PearsonEducation, New Delhi, 2003.

Page 168: DBA1602 STATISTICS.pdf

DBA 1602

NOTES

168 ANNA UNIVERSITY CHENNAI

Page 169: DBA1602 STATISTICS.pdf