-
Introduction to Probability and Statistics
Nagarajan Krishnamurthy
Introduction to Business Statistics for EPGP 2015-16 batchIndian
Institute of Management Indore
Thanks to Prof. Arun Kumar and Prof. Ravindra
Gokhale,co-instructors of QT1, AY 2012-13
-
Part 1: Summarizing and Visualizing a Data Set
-
Types of Data
Quantitative Data: Data for which arithmetic operationsmakes
sense. E.g.: Age, Salary, Length.
Categorical Data: Data obtained by putting individuals
indifferent categories. E.g.: Gender, States of a country
-
Visualization
Quantitative Data: Histogram, Stem-Leaf plot, Box plot
Categorical Data: Pie Chart, Bar chart
*Discuss Cafe data (using Excel)
-
Interpreting a Histogram
Shape: symmetric, skewed;unimodal, bimodal, ...;leptokurtic,
platykurtic, mesokurtic
Center: mean, median
Spread: range, standard deviation, inter-quartile range
-
Measure of the central tendency of a data set
Mean: If we have a data set x1, . . . , xn then mean of the
dataset is x1++xn
n.
Notation: x
-
Mean: Example
The mean of 0,5,1,1,3 is 2.
-
Measure of the Central Tendency of a Data Set
Median: Middle number in a sorted data set. When thenumber of
observations (sample size) is an even number thenthere are two
middle numbers. In that case, we take averageof the two middle
numbers to obtain the median.
Notations: x
-
Median: Example 1
For example the median of 0,5,1,1,3 is 1 because 1 is themiddle
number of the sorted data i.e. 0,1,1,3,5.
-
Median: Example 2
The median of 3,2,5,6,4,5,3,5 is 4.5 because 4.5 is the
averageof the two middle numbers of the sorted data
i.e.2,3,3,4,5,5,5,6.
-
Measure of the Central Tendency of a Data Set
Mode: Observation in the data set with the largest
frequency.Note that we can have more than one mode for a data
set.
-
Mode: Example
For example the mode of 0,5,1,1,3 is 1.
-
Effect of an Outlier
Calculate mean, median, and mode of 0,5,1,1,3,100.
mean=18.33, median=2, mode=1.
-
Effect of an Outlier
Calculate mean, median, and mode of 0,5,1,1,3,100.
mean=18.33, median=2, mode=1.
-
Effect of an Outlier
Outlier pulls mean towards it but may not affect median
andmode.
-
Identifying Relation Between Mean and Median
from Histogram
-
Identifying Relation Between Mean and Median
from Histogram
Symmetric: mean median
-
Identifying Relation Between Mean and Median
from Histogram
Symmetric: mean median
Left skewed: Mean < Median < Mode (in general)
Right skewed: Mean > Median > Mode (in general)
-
Kurtosis
Leptokurtic
Platykurtic
Mesokurtic
-
Measure of the Spread of a Data Set
Range: max-min
Ex: 0,5,1,1,3; what is the range?
Range = 5 0 = 5.
-
Measure of the Spread of a Data Set
Range: max-min
Ex: 0,5,1,1,3; what is the range?Range = 5 0 = 5.
-
Measure of the Spread of a Data Set
Variance:n
i=1(xix)2n1
Standard deviation:n
i=1(xix)2n1
-
Variance and Standard Deviation: Example
What is the variance and the standard deviation of
3,3,3,3,3?
Ans. variance=0 standard deviation=0
What is the variance and the standard deviation of
1,2,3,4,5?Ans. variance=2.5 standard deviation=1.58
-
Variance and Standard Deviation: Example
What is the variance and the standard deviation of
3,3,3,3,3?Ans. variance=0 standard deviation=0
What is the variance and the standard deviation of
1,2,3,4,5?
Ans. variance=2.5 standard deviation=1.58
-
Variance and Standard Deviation: Example
What is the variance and the standard deviation of
3,3,3,3,3?Ans. variance=0 standard deviation=0
What is the variance and the standard deviation of
1,2,3,4,5?Ans. variance=2.5 standard deviation=1.58
-
Standard Deviation
Standard deviation is always greater than or equal to zero.
-
Does Standard Deviation Gets Affected by
Outliers?
What is the standard deviation for the data 3,3,3,3,100?
Ans. 43.38
-
Does Standard Deviation Gets Affected by
Outliers?
What is the standard deviation for the data 3,3,3,3,100?
Ans. 43.38
-
Is Standard Deviation Always a Good Measure of
the Spread of a Data Set?
Not a good measure when data is skewed or has outliers.
-
Is Standard Deviation Always a Good Measure of
the Spread of a Data Set?
Not a good measure when data is skewed or has outliers.
-
Quartiles
First quartile: 25th percentile
Notation: Q1
-
Quartiles
Third quartile: 75th percentile
Notation: Q3
-
Exercise
Find the first and third quartile of
8,7,1,4,6,6,4,5,7,6,3,0.
Ans. The sorted data is 0,1,3,4,4,5,6,6,6,7,7,8. The median
ofthe red half of the data is 3.5 (Q1) and the median of the
bluehalf of the data is 6.5 (Q3).
-
Exercise
Find the first and third quartile of
8,7,1,4,6,6,4,5,7,6,3,0.
Ans. The sorted data is 0,1,3,4,4,5,6,6,6,7,7,8. The median
ofthe red half of the data is 3.5 (Q1) and the median of the
bluehalf of the data is 6.5 (Q3).
-
Quartiles
Median is the second quartile (Q2).
-
Measure of the Spread of a Data Set
Inter Quartile Range (IQR): Q3 Q1
*IQR is a robust measure of spread. IQR does not get
affectedmuch by skewness or outliers.
-
Exercise
Find IQR of 8,7,1,4,6,6,4,5,7,6,3,0.
Q3-Q1=6.5-3.5=3.
-
Exercise
Find IQR of 8,7,1,4,6,6,4,5,7,6,3,0.
Q3-Q1=6.5-3.5=3.
-
Five Number Summary
Minimum
First quartile
Median
Third quartile
Maximum
-
Boxplot
*We will create a box plot for the Cafe data set.
-
Interpreting a Box Plot
Shape:
Outliers: Any observation not in the range[Q1 1.5 IQR,Q3 + 1.5
IQR] is considered an outlier(Informal Rule).
-
Why Do We Need Box Plot?
To compare two or more data sets.
Visualization of summary statistics.
-
Categorical Data Visualization
*Bar Chart
*Pie Chart
Show billionaires data.
-
Part 2: Introduction to Probability
-
Describing Shape of a Bar Graph
Proportion of observations in a particular category.
-
Describing Shape of a Histogram
Proportion of observations in a particular class interval.
-
Probability
Proportion sample
Probability population
-
Example
Workforce distribution in the United States.
Industry ProbabilityAgriculture 0.130Construction 0.147Finance,
Insurance, Real Estate 0.059Manufacturing 0.042Mining 0.002Services
0.419Trade 0.159Transportation, Public Utilities 0.042
-
Sample Space
Def: Set of all possible outcomes.
E.g.: ={Agriculture, Construction, . . . , Services,
Trade,Transportation and Public Utilities}
-
Simple Events
Simple event: An event in the finest partition of the
samplespace.
Example: 1=Agriculture, 2=Construction.
-
Event
Def: Any subset of the sample space
E.g.: {Agriculture, Construction}
-
Exercise
A bowl contains three red and two yellow balls. Two balls
arerandomly selected and their colors recorded. Use a treediagram
to list the 20 simple events in the experiment, keepingin mind the
order in which the balls are drawn.
-
Other Approaches for Calculating Probabilities
Classical Approach: Assuming all outcomes to be equallylikely,
the probability of an event is the number of favourableoutcomes
divided by the total number of outcomes.E.g. Rolling a dice
Subjective Approach: Assigning probability to an event basedon
ones experience.
-
Example
Workforce distribution in the United States.
Industry ProbabilityAgriculture 0.130Construction 0.147Finance,
Insurance, Real Estate 0.059Manufacturing 0.042Mining 0.002Services
0.419Trade 0.159Transportation, Public Utilities 0.042
-
Probability
P(Agriculture)
= 0.13
P(Either Agriculture or Construction or both) P(Agriculture
Construction) = 0.13+0.147=0.277.
P(Agriculture and Construction) P(Agriculture Construction)
=0.
P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.
-
Probability
P(Agriculture) = 0.13
P(Either Agriculture or Construction or both) P(Agriculture
Construction)
= 0.13+0.147=0.277.
P(Agriculture and Construction) P(Agriculture Construction)
=0.
P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.
-
Probability
P(Agriculture) = 0.13
P(Either Agriculture or Construction or both) P(Agriculture
Construction) = 0.13+0.147=0.277.
P(Agriculture and Construction) P(Agriculture Construction)
=0.
P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.
-
Probability
P(Agriculture) = 0.13
P(Either Agriculture or Construction or both) P(Agriculture
Construction) = 0.13+0.147=0.277.
P(Agriculture and Construction) P(Agriculture Construction)
=0.
P(Not in Agriculture) P(Agriculturec)
= 1-0.13=0.87.
-
Probability
P(Agriculture) = 0.13
P(Either Agriculture or Construction or both) P(Agriculture
Construction) = 0.13+0.147=0.277.
P(Agriculture and Construction) P(Agriculture Construction)
=0.
P(Not in Agriculture) P(Agriculturec) = 1-0.13=0.87.
-
Compound Events
If A and B are two events then
Union event is A B
Intersection event is A B
Complement event is Ac
-
Venn Diagram Representation
8
A B
S
Disjoint events A and B A B
A
S
B
U
A U B
A
S
B
C
BS
Mutually exclusive and exhaustiveevents: A, B, C, and D
A
D
-
Probability Rules
1 P(A B) = P(A) + P(B) P(A B)2 P(Ac) = 1 P(A)
-
Mutually Exclusive
Def: Two events are mutually exclusive if they do not haveany
common outcome.
E.g.: Agriculture and Construction are mutually
exclusiveevents.
-
Mutually Exclusive
A and B are mutually exclusive if P(A B) = 0.
This implies that for mutually exclusive events A and B,P(A B) =
P(A)+P(B).
-
Pizza Venn Diagram
-
What is the sample space?
Sample space={Tomato only, Fish Only,
Mushroom-Tomato,Mushroom-Tomato-Fish, Mushroom-Fish, No
toppings}.
-
What is the sample space?
Sample space={Tomato only, Fish Only,
Mushroom-Tomato,Mushroom-Tomato-Fish, Mushroom-Fish, No
toppings}.
-
Probability of the events in the sample space
P(Tomato only)
=2/8; P(Fish only)=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)
=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)=1/8.
P(Mushroom-Tomato)
=2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)
=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish)
=1/8; P(No toppings)=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)
=1/8.
-
Probability of the events in the sample space
P(Tomato only) =2/8; P(Fish only)=1/8.
P(Mushroom-Tomato) =2/8=1/4;P(Mushroom-Tomato-Fish)=1/8.
P(Mushroom-Fish) =1/8; P(No toppings)=1/8.
-
Union Rule
What is the probability that your slice will have tomato
ormushroom?
Ans. 6/8=3/4
-
Union Rule
What is the probability that your slice will have tomato
ormushroom?
Ans. 6/8=3/4
-
Intersection Rule
What is the probability that your slice will have tomato
andmushroom?
Ans. 3/8
-
Intersection Rule
What is the probability that your slice will have tomato
andmushroom?
Ans. 3/8
-
Complement Rule
What is the probability that your slice will not have
tomato?
Ans. 3/8
-
Complement Rule
What is the probability that your slice will not have
tomato?
Ans. 3/8
-
Conditional Probability
You have pulled out a slice of pizza that has tomato on it.What
is the probability that your slice will have mushrooms?
Ans. 3/5.
-
Conditional Probability
Def: Probability of event A in event B. That is, probabilitythat
even A occurs given than B occurs.
Notation: A|B
-
Multiplication rule
P(A B) = P(A)P(B |A)P(A B) = P(B)P(A|B)
-
Statistical Independence
Two events are said to be independent if the occurrence ofone
has no effect on the chance of occurrence of the other.
-
Statistical Independence
Two events A and B are considered independent
whenP(A|B)=P(A).
-
Exercise 1
Is gender related to whether someone voted in the last
mayoralelection? Answer the question using the joint
probabilitiesgiven in the table below.
GenderVoted in the last mayoral election Female MaleYes 0.25
0.18No 0.33 0.24
-
Statistical Independence
If two events A and B are independent then
1 P(A B) = P(A)P(B)
-
Law of Total Probability
Given a set of events S1, S2, . . . , Sk that are mutually
exclusiveand exhaustive, and an event A, the probability of the
event Acan be expressed as
P(A) = P(S1).P(A|S1) + P(S2).P(A|S2)+P(S3).P(A|S3) + . . . +
P(Sk).P(A|Sk)
-
Exercise 2
A business group owns three five-star hotels (say, A, B, and
C)in India. By studying the past behavior of the revenueobtained
from the three hotels month by month, it has beenobserved that the
probability of increase in revenue of either Bor C or both of them
is 0.5. If As revenue increases in a givenmonth, the probability of
increase in Bs revenue is 0.7, theprobability of increase in Cs
revenue is 0.6, and the probabilityof increase in both B and Cs
revenue is 0.5. However if Asrevenue does not increase in a given
month, the probability ofincrease in Bs revenue is 0.2, the
probability of increase in Csrevenue is 0.3, and the probability of
increase in both B andCs revenue is 0.1. What is the probability
that the revenue ofall the three hotels, A, B, and C, increase in a
given month?
-
Exercise 3
You are a physician. You think it is quite likely that one of
your patients has strep
throat, but you are not sure. You take some swabs from the
throat and send them to
a lab for testing. The test is (like nearly all lab tests) not
perfect. If the patient has
strep throat, then 70% of the time the lab says YES but 30% of
the time it says NO.
If the patient does not have strep throat, then 90% of the time
the lab says NO but
10% of the time it says YES. You send five succesive swabs to
the lab, from the same
patient. You get back these results, in order; YNYNY. What do
you conclude?
These results are worthless.
It is likely that the patient does not have the strep
throat.
It is slightly more likely than not, that patient does have the
strep throat.
It is very much more likely than not, that patient does have the
strep throat.
-
Bayes Rule
Let S1, S2, . . . , Sk represents k mutually exclusive
andexhaustive sub-populations with prior probabilitiesP(S1),P(S2),
. . . ,P(S2). If an event A occurs, the posteriorprobability of Si
given A is the conditional probability
P(Si |A) = P(Si).P(A|Si)kj=1 P(Sj).P(A|Sj)
-
Exercise
Strep Throat Exercise
-
Bibliography
An Introduction to Probability and Inductive Logic, by
IanHacking
Introduction to Probability and Statistics, by
WilliamMendenhall, Robert J. Beaver, and Barbara M. Beaver
Practice of Business Statistics, by David S. Moore, GeorgeP.
McCabe, William M. Duckworth, and Stanley L. Sclove
Bradley A. Warner, David Pendergrift, and TimothyWebb,That was
Venn, This is now, Journal ofStatistical Education, Volume 6,
Number 1, 1998