3/21/2018 1 1 Stat Camp for the MBA Program Daniel Solow Lecture 1 Exploratory Data Analysis 2 What is Statistics? Statistics is the art and science of collecting, analyzing, presenting and interpreting data, which are information you have or can obtain. Business Statistics helps managers make more informed decisions. Descriptive Statistics Inferential Statistics Describes properties of large data sets with a few summary numbers or graphs. Helps you make decisions when you can obtain only a portion of the desired data. 3 Where Is Statistics Needed? • Market survey/research – A market survey says your market share is 19% with margin of error of 3%. What does this mean? • Manpower planning – A bank wants to know how many tellers they should have during the busiest time on a given day? • Quality control – A machine is set to produce parts with a length of 2 inches. A part just produced has a length of 2.1 inches. Should you stop the production and reset the machine? 4 Where Is Statistics Needed? • Forecasting – How much sales can I expect next quarter? • Premiums and Warranties – What should the insurance premium be for a particular class of customers? – You have just introduced a new automobile tire in the market. How many miles of warranty should you offer on this product? • Fun and Games – I bet that “this class has at least two persons with the same birthday (day and month)”. Should you take this bet?
62
Embed
Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
3/21/2018
1
1
Stat Camp for theMBA Program
Daniel SolowLecture 1
Exploratory Data Analysis
2
What is Statistics?Statistics is the art and science of collecting, analyzing, presenting and interpreting data, which are information you have or can obtain.Business Statistics helps managers make more informed decisions.
Descriptive Statistics
Inferential Statistics
Describes properties of large data sets with a few summary numbers or graphs.
Helps you make decisions when you can obtain only a portion of the desired data.
3
Where Is Statistics Needed?• Market survey/research
– A market survey says your market share is 19% with margin of error of 3%. What does this mean?
• Manpower planning– A bank wants to know how many tellers they should
have during the busiest time on a given day?• Quality control
– A machine is set to produce parts with a length of 2 inches. A part just produced has a length of 2.1 inches. Should you stop the production and reset the machine?
4
Where Is Statistics Needed?
• Forecasting– How much sales can I expect next quarter?
• Premiums and Warranties– What should the insurance premium be for a
particular class of customers?– You have just introduced a new automobile tire in
the market. How many miles of warranty should you offer on this product?
• Fun and Games– I bet that “this class has at least two persons with
the same birthday (day and month)”. Should you take this bet?
3/21/2018
2
Example 1: Suppose you want to know the average length of iron bars produced by your machine.
5
Inferential Statistics
In such situations, there are a large number of items you are interested in, which is called the population.Every item in the population has a number of interest.You want to know the value of one number associated with the whole population, called the parameter.
Population:Length of the bar.
Average length of all iron bars
All iron bars produced on that machine.Number of interest for each item:Parameter: = .
6
Inferential Statistics• Example 2: You want to know your “market
share” (the fraction of customers that purchase your product).– Population:– Number Associated with Each Item in the
Population:
– Parameter:
All people that buy this product.
= fraction of the population thatbuys your product.
1, if that person buys your product0, if that person does not buy your product
7
Inferential Statistics• In general, you can never know the value of the
parameter of a population (why?).– Because there are too many items in the population.
• In such cases, you should compute your best estimate (statistic) from a “manageable” subset of data (sample) collected randomly from the population.
parameter is unknown
Population
statisticbest
estimate
sample
Random Sample
Example 1 (Iron Bars):– Collect a sample of n iron bars (iron bar i has a length xi).– Compute the following statistic (sample mean):
8
Inferential Statistics
Example 2 (Market Share):– Collect a sample of n people from the population of people
that buy the product (each person i has a value xi of 1 or 0).– Compute the following statistic (sample proportion):
y = number in the sample who buy your product
3/21/2018
3
9
Data• Data are information that are collected, summarized
and analyzed for presentation and interpretation.• Cross-Sectional: Data collected at the same point in
time.• Time Series: Data collected over several time
periods.• Example: The Data Files web site on the first page
of these notes has the following file shadow02.xls with data on certain stocks.
10
Exchange Classes:
OTC
AMEX
NYSE
Mkt Cap Classes:
0-50
50-100
100-150150-200
200-250
Qualitative Quantitative
11
Data SetsAs shown on the previous slide, • Elements: Entities on which data are
collected (the 25 different companies in the shadow-stocks example).
• Variable: A characteristic of the elements you are interested in and whose value varies (Exchange, Ticker Symbol, and so on).
• Class: A group consisting of one or more values for a variable.
12
Types of Statistical Data• Qualitative (non-numeric)
– Nominal – values cannot be compared in terms of order (color, stock exchange, and so on)
– Ordinal – values can be compared in terms of order (rank, quality level, satisfaction)
• Quantitative (numeric)– Interval – difference between values is
meaningful (birth year, customer arrival time)– Ratio – ratio of two values is meaningful
(income, age, height, inventory level)
3/21/2018
4
13
Example: MBA SURVEY Identify the Data Type
• What is your height in inches?
• What is your gender?
• Attitude toward this Course on 1 to 6 scale:
1 = seriously worried (strongly dreading this),
6 = enthused & confident (eager to start)
• Do you smoke?
• WWW purchases (in $) over past year.
RATIO
NOMINAL
ORDINAL
NOMINAL
RATIO14
Descriptive Statistics• Descriptive statistics is the art of
summarizing a data set using either:– Graphical Methods (Charts)– Numerical Methods– All done with computer software packages.– Used all the time in annual reports, news
articles, research studies.– Different for qualitative and quantitative data.
Frequency Distribution: A table listing the number of elements in each class.
16
Using SPSS for Frequency Table(See the files UsingSPSS_Intro.ppt and UsingSPSS_DescriptiveStats.ppt)
To Open an EXCEL file:
•Click on file/open/data.
•Under Files of Type use .xls files.
3/21/2018
5
17
SPSS Output
The relative frequency table shows the proportion(or fraction) of elements in each class. You can display both the frequency and relative frequency tables in a graphical form for easy visualization.
18
•Click on Graphs/Legacy Dialogs/Bar.
•Click on Simple then Define.
•Drag the var. to the Category axis and click either N of Cases or % of Cases.
Using SPSS for a Bar GraphBar Graph: A graph with the classes on the x-axis and the frequencies (or percentages) on the y-axis.
19
SPSS Output
20
•Click on Graphs/Legacy Dialogs/Pie.
•Click Define
•Move the var. into the Slice By box and click % of Cases.
•Click OK
Using SPSS for a Pie ChartPie Chart: A circle having one “slice” for each class, with the size of each slice proportional to the relative frequency of that value..
3/21/2018
6
21
SPSS Output
22
Summarizing Quantitative Data• With quantitative data, the classes have to be
determined by the statistician. Given the minimum and maximum data values:– Determine the number of non-overlapping classes
(usually 5 – 20).• Too few classes: variation does not show.• Too many classes: too much detail.
– The class widths and class limits are then determined from the number of classes.
min
[ ][ ][ ][ ][ ]
width max
lower limit upper limit
23
Graphical Methods for Summarizing Quantitative Data
• Tabular Summaries– Frequency Distributions
• Number of items in each class• Relative Frequency (percentage of items in each class)• Cumulative (everything up to a certain value)
• Graphical Summaries– Histograms (like a bar chart)
An invention on average takes 22.167 years to develop.
32
MEDIAN (splits data in half)
• MEDIAN = middle value when data values are sorted from low to high...– At least 50% of values are below the median and
at least 50% are above the median.– If sample size (n) is even, the median is the mean
of the two middle values.• What is the median development time?
3/21/2018
9
33
Example: Invention Development Time
Median = (16+18)/2 = 17
In Excel:MEDIAN(range)
34
Mean vs. Median
• The mean is the most commonly used measure of location.
• However the mean is affected by extremely large or small values.
• In those cases the median may be a more reliable measure of location.
35
Example: Salaries
• Mean = 65,400
• Median = 32,000
Employee Salary
John 30,000
Doe 32,000
Smith 32,000
Perry 33,000
Sweeney 200,000
36
Example: Invention Development Time
Mean = 22.167
Median = 17
3/21/2018
10
37
SYMMETRIC DATA
Mean = MedianMedianMean
50% 50%
38
RIGHT SKEWED DATA
Long Right Hand TailMean > Median
Median Mean
39
LEFT SKEWED DATA
Long Left Hand TailMean < Median
MedianMean
40
PercentilesThink about your numerical data values lying on a line:
pth per-centile
The p-percentile is a number such that:
At least p % are ≤
•About p% of your data values are ≤ that number and
Example: The 90th percentile on the GMAT is a score so that about 90% of people’s GMAT scores are ≤ that number and about 10% are ≥ that number .
At least 100 p % are ≥
•About (100 p)% of your data values are ≥ that number.
3/21/2018
11
41
Quartiles• Q1 = First quartile = 25th percentile = a value so
that about 25% of the elements are that value and about 75% are ≥ that value.
• Q2 = Second quartile = 50th percentile = a value so that about 50% of the elements are that value and about 50% are ≥ that value
• Q3 = Third quartile = 75th percentile = a value so that about 75% of the elements are that value and about 25% are ≥ that value.
= the median..
42
Percentiles in EXCEL: (file salary.xls)
43
Percentiles in SPSS(File salary.xls)
Analyze; Descriptive Statistics; 123 Frequencies; then move the desired variable to the Variable(s) box; then click on Statistics; then click Percentile(s) and type your desired percentiles and Add; then click Continue and OK.
44
MODE
• The mode of a variable is the value or category that occurs most often in the batch of data.
• A data set can have more than one mode (bimodal, trimodal).
3/21/2018
12
45
Example: Invention Development Time
Modes: 7, 15,16,18
In Excel:MODE(range),which returnsonly one of
these values.46
Do It Yourself Example: Blood Problem
Suppose that the number of pints per day of whole blood used in transfusions at a hospital over the previous 11 days is:25, 18, 61, 12, 18, 15, 20, 25, 17, 19, 28.Use the file blood.xls and Excel to:•Find and interpret the mean, median and mode(s).
47
Is the Mean Enough? In the Blood Problem, an average of 23.45 pints of
blood are used on a day. Question: Does this mean you should have exactly
23.45 pints of blood available? Answer: Because the amount of blood you need
varies, that is, there is variation in the blood data. Question: How much variation is there? Answer: What is needed is a numerical value to
represent how much variation there is in the data. Example: Range = Largest Value – Smallest Value
No. Why not?
48
Variance• Variance is a number ≥ 0 that measures how close
the data values are to the mean .
• Variance is generally a relative measure.• More reliable measure of variation than the range.• Uses all the data.• There are two different formulas, depending on
whether you are computing the population variance or sample variance (see the handout formulas.pdf).
• Consider the following example for managing the amount of blood at a hospital (file blood.xls).
µ Var. is small µ Var. is larger
3/21/2018
13
49
Example: Blood Problem (blood.xls)
50
Population Variance• = population mean• xi = value of the ith item• (xi –) = deviation of ith item from • (xi –)2 = square deviation of ith item• Variance = average of the square deviations:
• In Excel: VAR.P(range)
51
• (xi – )2 = square deviation • (xi – ) = deviation of ith item from
Sample Variance
• xi = value of the ith item
• Sample Variance =
• In Excel: VAR.S(range)
• = sample mean
52
Standard Deviation• Square root of the variance.• Expressed in the same units as the data.• More intuitive measure of variability.• Blood Problem
– Sample Variance = S2 = 177.07– Sample Standard Deviation = S = = 13.31
• Under circumstances you will learn soon, the std. dev. has a useful interpretation)
3/21/2018
14
53
Using EXCEL and SPSS to Compute Descriptive Statistics
•Both EXCEL and SPSS can automatically compute all of the descriptive statistics.•In EXCEL:
– Tools/Data Analysis/Descriptive Statistics
•In SPSS: – Analyze/Descriptive Statistics/Frequencies– Click on the “Statistics” box and select all of the
descriptive statistics you want (including the percentiles).
•EXCEL and SPSS are now illustrated on the data in the file salary.xls.
54
Descriptive Statistics in ExcelTo compute descriptive statistics in EXCEL, in the Data tab, use the Data-Analysis add-in and choose Descriptive Statistics:
55
EXCEL Salary Example
56
Descriptive Statistics in SPSSTo compute descriptive statistics in SPSS, use the Analyze/Descriptive Statistics/Frequencies and then on the bottom of the screen, click on Statistics and choose the statistics you want reported:
3/21/2018
15
57
SPSS Salary Example
58
Relationship BetweenTwo Variables
• So far you have seen ways to analyze information about a single variable.
• One is often interested in the relationshipbetween two or more variables.
• Examples of relationships– Advertising expenditures and sales.– Company profits and stock price.– Home size and sales price.
59
Example: Stereo Store• File stereo.xls
• Is there any relationship between the number of commercials and the sales levels?
60
Scatter Diagrams in Excel• In Excel, select the two columns of data;
click on the Insert tab; then on the Scatter icon; then on the top left diagram.
• Number of commercials on the x-axis.• Sales levels on the y-axis.
3/21/2018
16
61
Scatter Diagrams in SPSS• Plot of two variables on the same graph.• In SPSS, choose Graphs/Legacy Dialogs/ Scatter
then choose Simple and click on Define• Number of commercials on the x-axis.• Sales levels on the y-axis.
62
Covariance and Correlation The sample and population covariance of two
variables X and Y are numbers whose sign have the following meaning: COV(X,Y) > 0 means that the two variables tend to
move in the same direction—if one increases (decreases), then the other increases (decreases).
COV(X,Y) < 0 means that the two variables tend to move in opposite directions—if one increases (decreases), then other decreases (increases).
The value of the covariance is hard to interpret, so the covariance is converted to a number between −1 and +1 called the correlation of X and Y that indicates how strongly X and Y are correlated.
63
Covariance and Correlation• For two variables X and Y for which you have n
pairs of data in the form (x1, y1), …, (xn, yn), the covariance and correlation are computed by:
Population Sample
COV(X, Y):
COR(X, Y):
Note: COVARIANCE.P and COVARIANCE.S in Excel compute the population and sample covariance XY. CORREL computes the sample correlation = population correlation. 64
Cov. and Correlation in EXCEL
3/21/2018
17
65
Cov and Correlation in SPSS• In SPSS, choose Analyze/Correlate/Bivariate.• On the next menu, click on Options.• Select Cross-Product Deviations and Covariances.• Click Continue and, on the previous menu, OK.
3/21/2018
1
66
Stat Camp for theMBA Program
Daniel Solow
Lecture 2Probability
67
Motivation• You often need to make decisions under
uncertainty, that is, facing an unknown future.• Examples:
– How many computers should I produce this month?– What premium should I charge a class of customers
for a particular type of insurance policy?• The answers to such questions requires
knowledge of probability that is, the study of likelihood of certain events occurring.
68
Probability• Probability is a number that measures the
likelihood that an event will happen.• Useful as an indicator of the uncertainty
associated with an event.• Scale from 0 to 1.
– Probability = 0: the event will certainly not happen.
– Probability = 1: the event will certainly happen.– Probability = 0.5: the event is equally likely to
happen or not happen.69
Experiments and Outcomes• Experiment: A situation in which an action
could be repeated many times, each resulting in one of many possible outcomes or sample points. Exactly one of these outcomes will occur, but it is not known which. For example:
Experiment OutcomesToss a coin Head, Tail
Roll a die 1, 2, 3, 4, 5, 6
Sales Call Sale, No sale
Dow Jones tomorrow All positive numbers
3/21/2018
2
70
Assigning Probabilities• Assume an experiment has n possible
outcomes E1, E2,…, En . • The probability assigned to each outcome
must be a number between 0 and 1.
• The sum of the probabilities of all outcomes must equal 1.
0 P(Ei) 1
P(E1) + … + P(En) = 1
71
Assigning Probabilities• Depending on the situation, you can obtain the
probabilities of the outcomes of an experiment from:– Assumption that all outcomes are equally likely
(classical method)– Experience from past data (relative frequency
method)– Experience, intuition or personal judgment
(subjective method)
72
Classical Method• In many situations it is reasonable to assume
that all n outcomes of an experiment are equally likely to occur.
• Then each outcome has probability equal to 1/n (why)?
• Examples:– Toss a coin: P(H) = P(T) = 1/2.– Roll a die: P(1) = P(2) = … = P(6) = 1/6.
73
Relative Frequency Method
• In some experiments, there are past data available, from which you can estimate the proportion of time each outcome has occurred if the experiment is repeated a large number of times.
• This proportion is used as an estimate of the probability of the outcome.
3/21/2018
3
74
Example• When asking about a person’s attitude on a new law, the
outcome could be: disagree (D), neutral (N), agree (A), or uninformed (U).
• One way to assign probabilities to these four outcomes is to use the results of a survey, such as the following:
– B = contest in an English-speaking city = {N, L, S, C}
– C = contest not in Europe = {N, T, B, S, C}
P(A) = P(T)+P(B) = 0.1 + 0.05 = 0.15.
P(B) = 0.2 + 0.1 + 0.15 +0.25 = 0.70.
P(C) = 0.2 + 0.1 + 0.05 + 0.15 + 0.25 = 0.75.
85
Example: Playing Cards• P(any outcome) = 1/52.
– A = draw a king = {KS, KC, KD, KH}
– B = draw a red two = {2H, 2D}
– C = draw a club face = {JC, QC, KC}
P(A) = 4/52 = 1/13 = 0.077.
P(B) = 2/52 = 1/26 = 0.038.
P(C) = 3/52 = 0.058.
3/21/2018
6
86
Complement of an Event• The complement of an event A is the event that
A does not happen and thus contains all outcomes that are not contained in A.
• The complement of A is written as Ac.• If A happens, Ac does not happen and vice versa.• Complement Law: P(A) = 1 – P(Ac).• Note: If the event A you are interested in has
many outcomes but Ac does not, then, to compute P(A), it is easier to find P(Ac) and then
P(A) = 1 – P(Ac). 87
Examples• Example 1: Suppose A is the event that the
weekly sales exceed $2,000 and P(A) = 0.75.• Then Ac is the event that the weekly sales do
not exceed $2,000 and P(Ac) = 0.25.• Example 2: When you pick a card at
random, what is the probability that you do not pick an ace?
• Answer: Let A = not pick and ace so Ac = pick an ace =
P(A) = 1 – P(Ac) = 1 – (4/52) = 48/52.{AC, AD, AH, AS} and so
88
Intersection of Two Events• If A and B are two events, you are often
interested in the probability that both A and B occur simultaneously.
• The event A and B is called the intersection of A and B, written A B, and consists of outcomes that are in both A and B simultaneously.
When you see the word AND think of .
ABA B
8889
Intersection of Two Events• Example: What is the probability of drawing a
red king?– A = draw a king = {KS, KC, KH, KD}– B = draw a red card =
P(drawing a red king) = P(drawing a red card and a king) = P(A B) = {KD, KH} = 1 / 26.
3/21/2018
7
90
Mutually Exclusive Events
• If A B contains no outcomes, then the events A and B are called mutually exclusive.
• If A and B are mutually exclusive it is not possible that both A and B will happen.
• If A and B are mutually exclusive, then
P(A B) = 0.
91
Example
• A = draw a king = {KS, KC, KH, KD}• B = draw a queen = {QS, QC, QH, QD}• A B contains no outcomes.• If you draw a single card, it is not possible that
the card will be both a king and a queen.
92
Union of Two Events• If A and B are two events, you are often
interested in probability that either A or B (or both) occur simultaneously.
• In math terms, this is called the union of A and B and is denoted by A B.
Example: Roll a die. A = the outcome is less than 3 = {1, 2}. B = the outcome is even = {2, 4, 6}. AB = the outcome is less than 3 or even (or
both) = {1, 2, 4, 6}.93
The Addition Law• The event A B consists of all the
outcomes that belong to both A and B.• When you see the word OR think of .
AA B: B
The probability of A B can be computed by adding the probabilities of the individual outcomes in A B,
P(A B) = P(A) + – P(A B)
A B
OR with the following addition law (whichever is easier):
3/21/2018
8
94
Union of Mutually Exclusive Events
• If A and B are mutually exclusive, you know that P(A B) = 0.
• Therefore, for mutually exclusive events,P(A B) = P(A) + P(B) – P(A B)
= P(A) + P(B)
95
Example• Roll a die
– A = the outcome is less than 3 = {1, 2}– B = the outcome is even = {2, 4, 6}– AB={2}– AB = {1, 2, 4, 6}
Conditional Probability• Sometimes the probability of an event changes
when you get information about another related event.
• Example: You roll a die and don’t see the outcome.– What is the probability of the outcome being a 2?– If I tell you that the outcome was odd, what is the
probability of a 2 now?– If I tell you that the outcome was even, what is the
probability of a 2 now? • This is the conditional probability of an event
given that another event happened.
0
1/6
1/3
99
Example: Police Force• A police force consists of 960 men an 240 women
officers. Last year 288 men and 36 women were promoted. Women officers complained of discri-mination. The administration said that this was due to the low number of women officers in the force.
• Do you think the discrim. complaint is justified?• Approach: Compare P(promotion given a man)
and P(promotion given a woman).Probability Model
Experiment: Select an officer at randomOutcomes: Any one of 1200 officers,
all equally likely.
100
Events and Their Probabilities• Events:
– M = man officer, – W = woman officer, – A = promoted officer, – Ac = non-promoted officer, – MA = male prom. off.,
P(M) = 960/1200 = 0.8P(W) = 240/1200 = 0.2
P(A) = 324/1200 = 0.27P(Ac) = 0.73
P(MA) = 288/1200 = 0.24
Men Women TotalsPromoted
NotTotals 960 240 1200
288672
36204 876
324
101
Conditional Probability
• Select an officer at random.• P(selected officer is promoted) = P(A) =
Men Women TotalsPromoted
NotTotals 960 240 1200
288672
36204 876
324
324/1200 = 0.27Suppose the selected
off. is a man.
P(A | M) = 288/960 = 0.3
Suppose the selected off. is a woman.
P(A | W) = 36/240 = 0.15
3/21/2018
10
102
Conclusion
• We have found that – P(A|M) = 0.3: 30% of men officers were
promoted.– P(A|W) = 0.15: 15% of women officers were
promoted.• Is the complaint justified in your opinion? YES!
103
Computing Conditional Prob.• The general formula (Baye’s Theorem) for the
conditional probability of an event A given that event B has occurred is:
• Police Example:
This is consistent with our original calculation.
104
Independent Events
• Two events A and B are independent if the probability of A does not change with information about B, that is, if
• If this relation is not true, then the two events are dependent.
P(A|B) = P(A)and vice versa
P(B|A) = P(B).
105
Example: Police Force
• You have found that – P(A|M) = 0.3– P(A) = 0.27
• Therefore, P(A|M) P(A), and the events A and M are dependent.
• This means that the probability of an officer being promoted is influenced by whether this officer is a man.
• This justifies the discrimination claim.
3/21/2018
11
106
Multiplication Law
• Assume A and B are independent. ThenP(A|B) = P(A).
• However, you also know thatP(A|B) = P(AB) / P(B).
• Then you find thatP(AB) / P(B) = P(A|B) = P(A),
orP(AB) = P(A) P(B).
107
Independence• It is often reasonable to assume that two events are
independent, because of their nature.• For example, if you roll two dice.
– Does knowing what happened on the first roll tell you anything about what will happen on the second roll?
• Because there is no relationship between the first and the second roll, you may assume that A and B are independent.
• Then the multiplication law implies
P(AB) = P(A) P(B) = (1/6) (1/6) = 1/36.
No!
108
Summary Example• Bob and Jon live together and each has a car that works,
respectively, 60% and 90% of the time.• A potential employer has said she will hire them if they have
one working car at least 95% of the time. • State what you are looking for as a prob. question.
Is P(that at least one car is working) 0.95?• Use probability theory to answer the question.
= P(A) + P(B) P(AB) (Addition Law)(A & B are ind.)= 0.6
= 0.96.
Events:
P(at least one car is working) B = Jon’s car works
They get the job.
Will they get the job?
A = Bob’s car worksP( ) = 0.9P( ) = 0.6
0.6 (0.9) + 0.9
= P(A OR B) = P(AB)
109
Random Variables• A random variable (rv) is a quantity of interest:
– Whose value is uncertain
– You cannot control the value that occurs.• Example 1: Y = the outcome of flipping a coin. • Random variables are used to help make decisions
in a problem involving uncertainty.• Example 2: Roll a die once. If the outcome is 1 or 2
you lose $5, if 3 you lose $1, if 4 you win $2, if 5 or 6 you win $4.
• To decide, you need to identify appropriate rvs.Do you want to play this game?
(by “uncertain” is meant that there are many (at least two) possible values and you do not know which value will occur).
3/21/2018
12
110
Identifying Random Variables and Their Possible Values
• The first two steps involved in working with a random variable are:
• Step 1: Identify the random variable.– Use a symbol and write the meaning of the variable,
including units.Example : Let X = $ earned in the dice game.
• Step 2: List all possible values the r.v. can have.Example: X = 5, 1, 2, or 4.
111
Types of Random Variables• Discrete Random Variable: A r.v. whose possible
values you can “count” (either finite or an infinite set of countable numbers, for example, 0,1,2,…).
Example Roll-and-Earn: X = −5, −1, 2, 4Can you count these possible values?
• Continuous Random Variable: A r.v. whose value is any number (including decimals and fractions) in an interval or a collection of intervals (infinite uncountable number of values).
Example: X = liters of water I drink today.Can you count these possible values?
[0, 5]
Yes!
No!
Poss. Val.
112
Examples of Random Variables• Number of heads in 50 tosses of a coin.
– All possible values:• Number of customers who enter a store in a day.
– All possible values:• Number of cm of rain next month.
– All possible values:• Time in minutes between two customers arriving
at a bank.– All possible values:
(discrete)
(discrete)
(continuous)
0, 1, …, 50
0, 1, 2, …
[0, 30].
[0, ). (continuous)113
• Number of defective products in a shipment of 100. – All possible values:
Examples of Random Variables
(discrete)
Quantity x of liquid inside a 12 oz can– All possible values: (continuous)
0, 1, 2, …, 100.
[0, 12]. Percentage x of a project completed by the
deadline.– All possible values:
$ sales in a retail store tomorrow.– All possible values:
(continuous)
(discrete)
[0, 100].
$0.00 - $10000.00
3/21/2018
13
114
• The temperature at noon yesterday.• The temperature at noon tomorrow.• The age of a person chosen at random in this class.
Which of the Following are RVs?(No)(Yes)
(It depends on timing.)
(Yes)
All possible values:(discrete)
(No: there is only one value for that person’s age.)
Time is critical!
If you have already selected the person:If you have not yet selected the person:
The (finite) list of agesof everyone in this class.
115
• The average of a population.
Which of the Following are RVs?(No, µ only has one val.)
(Depends on time.)
(Yes)
(No, there is only one value for that average.)
If you have already selected the sample:If you have not yet selected the sample:
The average of a sample of size 2.
G1Groups of size 2:
for the group: A1
G2
A2
G3
A3
The (finite) list of averagesof every group of size 2 in the population.
(discrete, but …)
All possible values of
Warning: When a discrete rv has too many possible values, it might not be practical to work with that rv.
116
RVs, Populations, and Sampling
For any RV, you can create a sample of size n by observing and recording the value of the RV n separate times.
Y = the value of an item that will be chosen randomly from the population.
For any population, you can create the following two discrete random variables:
Note: The average of the pop. is not a RV.
= the average of a sample of size n before taking the sample.
Question: What can you do when a quantity of interest—such as the average of a population—is unknown?
117
When Something is Unknown
Ideal: Determine the value, however…If doing so requires too much time, effort, money, then…
Next Best: Estimate the value, for example, by…Building a model! For example:Model 1: Take a sample of size n (the model) and use the average from the sample as your best estimate of .
Model 2: Think of as a discrete random variable (the model) with possible values: 20, 21, …, 30Model 3: Think of as a continuous random variable (the model) with possible values: [20, 30]
3/21/2018
14
118
Identifying Random Variables• INSURANCE PREMIUMS
What should the insurance premium be for a particular class of customers?
• Question: Is the annual premium a r.v.?Answer: No, because you can control its value.
• Let C = the $/year to be claimed by this type of customer with possible values: (discrete)
0 – 100,000
119
Examples of Random Variables• WARRANTIES
GoodTire has just introduced a new tire in the car market. How many miles of warranty should the company offer?
• Qn: Is the warranty mileage a r.v.?• Ans: No, because you can control its value.• M = the number of miles such a tire is
expected to last, with possible values: (continuous)
[0, ).
120
Examples of Random Variables• PERSONNEL PLANNING
How many bank tellers should be working during the busiest time of the day?
• A = the total number of customers that arrive during that period, with possible values:
• W = the number of minutes it takes a teller to serve a customer, with possible values:(continuous)
[0, 30].
0, 1, 2, … (discrete)
121
Probability Distribution• To “work” with a random variable, you must
know that variable’s probability distribution, which describes the probabilities of all the possible values of the random variable occurring.
• Note: Probability distributions are different for discrete RVs and continuous RVs.
3/21/2018
15
122
Discrete Distributions• For a discrete random variable X, the
probability distribution is described by a probability density function that consists of:– The list of the possible values and, for each one,
the probability of that value occurring.• Notationally, if t is a possible value for the rv X
then the density function is written as follows:
f(t) = P(X = t) = the probability that the random variable takes the value t.
123
Example 1• Toss a coin once.• X = number of heads.• Possible values of X:
t 0 1P(X = t) = f(t) 0.5 0.5
• A valid probability density function for a discrete random variable must satisfy the following two properties:– 0 f(t) 1, for each value of t.– f(t) = 1.
124
Example 2Toss a coin twice, and let X = number of heads.
• The probability density function can be shown in a table or graph:
0.25
0 1
0.5f(t)
t2
3/21/2018
16
126
Example 3: Roll and Earn
• Find the probability density function of X, the amount earned in the following game:Roll a die once. If the outcome is 1 or 2 you lose $5, if 3 you lose $1, if 4 you win $2, if 5 or 6 you win $4.
Possible values for X: 5 1 2 4
Probabilities: 2/6 1/61 or 2 3
1/64
2/65 or 6
127
• In addition to the density function, you also want to find the expected value (mean) of the discrete random variable, which is a measure of the central location of the value of the random variable.
• Computed as the sum of the products of the possible values x and the corresponding probabilities:
X = E[X] = t f(t)
• Note: The mean of a r.v. is different from the mean of a population and the mean of a sample.
Expected Value
128
Why is the ExpectedValue Useful?
• Law of Large Numbers: If you observe the value of the random variable X a large number of times, the average of the observed values will be very close to the expected value of the random variable X.
• Let the rv X = be your random variable.
129
Example• Toss a coin twice and let X = number of heads.• The probability density function is:
• The expected value is: E(X) = (0) (0.25) + (1)(0.5) +(2)(0.25) = 1
• If the two tosses are repeated many times and the number of heads recorded each time, the average number of heads per two-tosses will be close to 1.
3/21/2018
17
130
Computing Exp.Value in Excel• Roll and Earn
This means that, if you play the game many times and record the results, on average, you will lose $0.16 each time. 131
Example: Car Dealership
• Now that the new car models are available, a dealership has lowered the prices on last year’s models in order to clear its inventory. With prices slashed, a salesperson estimates the following probability distribution of X, the number of cars that person will sell next week.
• Find the expected value of X. What does it mean?
Probability Distribution• Recall that a random variable is a quantity of
interest whose value is uncertain and you cannot control it. To use a r. v. to help solve a decision problem:
• Step 1: Identify the random variable, say X.• Step 2: List all possible values for X.• Step 3: Determine if X is discrete or continuous.• Step 4: Identify the density function of X.• Step 5: Find E[ X ] = the expected value of X.• Step 6: Find VAR[ X ] = the variance of X (or the
STDEV[ X ]).
138
Example: Debbon Air Seat-Release Problem
• Debbon Air needs to make a decision about Flight 206 to Miami, which is fully booked except that…
• 3 seats are reserved for last-minute customers (who pay $475 per seat), but the airline does not know if anyone will buy those seats.
• If they release them now, they know they will be able to sell them all for $250 each.
• Debbon Air counts a $150 loss of goodwill for every last-minute customer turned away.
139
Debbon Air “Seat Release”
• Question: How many seats, if any, should Debbon Air release?
• Question: On what basis that is, on what criterion are you going to make the final decision?
• Answer: Based on profits.• Approach: Find the expected profit when you release
0 seats, 1 seat, 2 seats, and 3 seats, and then…Choose the alternative that has the max. expected profit.
3/21/2018
2
140
Identifying Random Variables• Question: Can you identify any r.v.s that can
help you make this decision?• Let the r.v. X =• Possible values for X:
• Probability distribution for X:
# of arriving last-minute customers
141
Identifying Random Variables
• Another random variable of interest is:R = net revenue (revenue minus loss
of goodwill) • However, this revenue depends on the number
of seats released, so, defineRi = net revenue when i seats are
released (i = 0, 1, 2, 3).
142
“Debbon Air” Seat Release• What are the possible values for R3, that is,
what are the possible revenues when all 3 seats are released?
• The answer depends on how many last-minute customers (X) arrive, so: If X = 0: R3 = If X = 1: R3 = If X = 2: R3 = If X = 3: R3 =
• How many seats should be released to maximize expected net revenue?
Two seats should be released.Qn: What is the prob. that there are 0 defects (“successes”) in, say, a sample of n = 20 tires? 145
A Binomial Experiment• In many applications, you will perform a binomial
experiment a number of times for which, each time, there are two possible outcomes:– A “success” or – A “failure”.
• Example: At a tire factory, you will examine a number of tires and, for each one, determine if• There is a defect (“success”) or• There is no defect (“failure”).
Note: A “success” does not have to be a “good” outcome.
What you are then interested in is the probability of having k successes out of, say, n trials.
146
Binomial ExperimentIn a binomial experiment, you must identify:• What constitutes a trial, a “success” and a “failure.”• p = P(success) = the probability of a success occurring
in each trial (so, P(failure) = ).• n = the number of independent trials (repetitions).Then define the following r.v.:
X = number of successes out of n trials.
1 – p
Possible Values0, 1, 2, …, n.
Density Function = Probabilities: get from EXCEL
discrete
147
Example 1• Toss a fair coin 100 times.• A “trial” is a flip and a “success” = a heads.• p = probability of success = P(head) = 0.5.• n = 100 independent coin tosses.• X = number of heads follows binomial
distribution with n = 100 and p = 0.5.Then, E(X) =
(100)(0.5)(0.5) = 25 = 5.
100 (0.5)= 50SD(X) =
3/21/2018
4
148
• A student takes a multiple choice test with 25 questions. Each question has 4 choices. Assume the student does not know the answer to any question and just guesses.
• A “trial” is a question and a “success” is a correct answer.
• p = prob. of success = P(correct answer) = 0.25.• n = 25 independent questions.• X = number of correct answers follows the Binomial
distribution with n = 25 and p = 0.25.• Then, E(X) =
Example 2
SD(X) =
(25) (0.25) = 6.25.
(25) (0.25) (0.75) = 4.6875 = 2.17.149
Binomial Random Variables • If n = 1, then the possible values for X are:
X =1 if the outcome is a success0 if the outcome is a failure
Here, X is called a binomial random variable.• The density function is: 1 p, • The expected value of X is E(X) = np = p.• The standard deviation of X is SD(X) =
f(0) = f(1) = p.
150
Binomial Probabilities in Excel • The Excel BINOMDIST function provides
two kinds of binomial probabilities.• Suppose that the random variable X is
Binomial with parameters (n, p).• For k successes, where k is between 0 and n:
= BINOMDIST(k, n, p, FALSE)
= BINOMDIST(k, n, p, TRUE)
151
Example 1: Bad Seafood• Consumer Reports (Feb. 1992) found widespread
contamination of seafood in supermarkets in NYC and Chicago.
• 40% of the swordfish pieces for sale had a level of mercury above the maximum allowed by the Food & Drug Administration (FDA).
• Suppose a random sample of 12 swordfish pieces is selected.
• What is the probability that exactly five of the pieces have mercury levels above the FDA maximum?
• What is the probability that at least 10 pieces are contaminated?
3/21/2018
5
152
Answer• Find P(exactly five of the pieces have mercury levels
above the FDA maximum)?• “Trial” is choosing a piece of fish and a “success” =
the piece is contaminated.• p = prob. of success = P(contamination) = 0.4.• n = 12 independent (why?) pieces.• X = number of contaminated pieces follows the
Binomial distribution with n = 12 and p = 0.4.• From Excel:
P(X = 5) = BINOMDIST(5, 12, 0.4,FALSE) = 0.227
153
Answer (continued)• Find P(at least ten pieces are contaminated)• P(X 10) = P(X = 10 or 11 or 12) =
P(X = 10) + P(X = 11) + P(X = 12)• From Excel, you have
• From the Complement Law, you know that P(X 10) = 1 – P(X < 10)
= 1 – P(X 9)• From Excel, you have that
P(X 9) = BINOMDIST(9, 12, 0.4,TRUE) = 0.9972.
• Then P(X 10) = 1 – 0.9972 = 0.0028.
155
Example 2: Murder Trial• As the lawyer for a client accused of murder, you are
looking for ways to establish “reasonable doubt”. The prosecutor's case is based on the forensic evidence that a blood sample from the crime scene matches the DNA of your client. It is known that 2% of the time DNA tests are in error.
• Suppose your client is guilty. If six laboratories in the country are asked to perform a DNA test, what is the probability that at least one of them will make a mistake and conclude that your client is innocent?
3/21/2018
6
156
Answer• A “trial” is sending the DNA to a lab and a success =
lab makes error (finds no match)• p = prob. of success = P(error) = 0.02.• n = 6 independent (why?) lab tests.• X = number of labs that make an error follows the
Binomial with n = 6 and p = 0.02.• P(X 1) = 1 – P(X < 1) = 1 – P(X = 0)• From Excel, P(X = 0) = 0.8858.• P(X 1) = 1 – 0.8858 = 0.1142 • So there is an 11.42% probability that at least one lab
will find no DNA match.• Question: How many labs would you need to raise
this probability to 25%?157
Example 3:Multiple-Choice Quiz
• A multiple-choice quiz has 15 questions. Each question has five possible answers, of which only one is correct.
• What is the expected number of correct answers by sheer guesswork?
• What is the standard deviation of the correct answers by sheer guesswork?
• What is the probability that sheer guesswork will yield at least seven correct answers?
158
Answer
• A “trial” is answering a question and a success = a correct answer.
• p = Prob. of success = P(correct answer) = 1/5 = 0.2.
• n = 15 independent (why?) answers.• X = number of correct answers follows a
Covariance• When working with two random variables, say X
and Y, you are sometimes interested in the degree to which the values of X and Y are correlated—that is, as X increases, to what degree is it likely that Y increases (or decreases)?
• When X and Y are discrete RVs with n possible pairs of values, say (x1, y1), …, (xn, yn), and corresponding probabilities p1, …, pn, then the covariance of X and Y, written COV(X, Y) or XY, is given by the following formula:
161
Covariance and Correlation•COV(X,Y) > 0 means that the two variables tend to move in the same direction—if one increases (decreases), then the other increases (decreases).COV(X,Y) < 0 means that the two variables tend to move in opposite directions—if one increases (decreases), then other decreases (increases). The value of the covariance is hard to interpret, so the
covariance is converted to the following number between −1 and +1 called the correlation of X and Y, written COR(X, Y) or XY ( that indicates how strongly X and Y are correlated):
Note: Cov and correlation of RVs are different from cov. and correlation of samples and populations.
162
Example: Stocks and Bonds• Example: Suppose you are considering investing in both a
stock and a bond fund, and define the following RVs:S = annual rate of return on the stock fundB = annual rate of return on the bond fund
Possible Values (depend on the state of the economy):
Economy Stock Fund Bond Fund Prob.Recession −7% 17% 1/3Normal 12% 7% 1/3Boom 28% −3% 1/3
E[S] = 1/3(−0.07) +1/3(0.12) +1/3(0.28) = 0.11
E[B] = 1/3(0.17) +1/3(0.07) +1/3(−0.03) = 0.07163
See File Cov_and_Cor.xlsUsing E[S] = 0.11 and E[B] = 0.07, you can now compute COV(S, B) using the formula as follows:
= − 0.0117COV(S, B) =1/3(−0.07 − 0.11)(0.17 − 0.07)
Using [S] = 0.143 and [B] = 0.082, you can now compute COR(S, B) using the formula as follows:
COR(S, B) =COV(S, B)[S] [B]
= − 0.01170.143 (0.082)
− 1
So S and B are perfectly negatively correlated: when S returns are up, B returns are down and vice versa.
3/21/2018
8
164
Continuous Random Variables• A continuous random variable assumes any
value, including decimals and fractions, in intervals on the real line.
• Example: X = the number of minutes a customer waits in line. Possible values:– Question: If all values are equally likely, what is
P(X = 5.3789)?– Answer: Prob(X = 5.3789) = 1/ = 0 because there
are an infinite number of possible values for X.• Conclusion: For a continuous rv, it is not meaningful
to specify the likelihood that the variable is equal to one specific value.
[0, ).
165
Continuous vs. Discrete RVs• Question: Is there any difference between P(X < a)
and P(X a)?• Answer: For a continuous rv, the answer is “no”
because:
P(X a) = P(X < a or X = a)= P(X < a) + P(X = a)= P(X < a) + 0= P(X < a)
• Note: The foregoing step that P(X = a) = 0 is not true for a finite discrete rv, and this is one major difference between a fnite discrete r.v. and a continuous rv.
Possible values where the density function is higher are morelikely to occur than where the density function is lower. 166
Using Density Functions• Solution—Use a probability density function to
describe the likelihood that X is in a given interval.
a b
Area under the graph =Total area = 1
f(x)
x
Probability density function
All Possible Values of X
P(a<X<b)
Morelikely
Lesslikely
To find probabilities, wefind areas under the density function.
167
Example: Finding Probabilitiesf(x)
x54 6
Suppose that the graph of some probability density function is symmetric around 5.
What is P(X < 5)? 0.5What is P(X > 5)? 0.5
If P(X < 4) = 0.3, find:P(X > 6)P(X < 4 or X > 6)
0.30.3 + 0.3 = 0.6
P(4 < X < 6) 1 0.6 = 0.4
P(4 < X < 5) 0.4/2 = 0.2
0.30.4
0.20.3
3/21/2018
9
168
Example: Uniform Distribution
1
1
f(x)
x
Find P(0.2 < X < 0.5)0.50.2
0.3 ꞏ 1 = 0.3Find P(X > 0.6)
0.6
0.4 ꞏ 1 = 0.4E[X] = 0.5
?
2
f(x)
x
1/2
Fact: In the real world, you can never find the density function of a continuous rv, so what can you do?Ans. Use a density function that mathematicians have created.Example: Consider a rv X with possible values between 0 and 1.
All values are equally likely.
Uniform DistributionX ~ U[0,1]
169
The Normal Distribution• Fact: In the real world, you can never find the density
function of a continuous rv, so what can you do?• Answer: Borrow an existing density function that
mathematicians have created (like the uniform dist.).• The Normal distribution is one such density function
with many desirable properties.• The Normal distribution applies to a continuous rv whose
possible values can be any real number from – to +.• To write the density function, you must know the:
ꞏ Mean ꞏ Standard Deviation
Usually estimated by computing the average and standard dev. from a sample of values for the rv.
170
The Normal DistributionThe density function f (x) =
x
f (x)
“The Bell Curve”
This density function is:
Smaller values of make the bell part thinner and taller.
•Centered at the mean .
•The std. dev. controls the “thickness”:•Symmetric about . •“Bell shaped.”
Use the Normal when most values of your rv are close to the mean and then become less likely farther from the mean.
171
The Normal Distribution: Effect of the SD
12
3/21/2018
10
172
The Normal DistributionArea left of = 0.5
a b
This area =
P(a ≤ Y ≤ b)f (x)
x
Note: Excel is used to find areas under the Normal density function.
173
Excel Function NORMDIST• The Excel function NORMDIST is used to find
the area under the normal curve to the left of a given value z, that is, if X ~ N(, ), then
P(X ≤ z) = NORMDIST(z, , , TRUE).
x z
174
Practice with ExcelExample: If X ~ N(20, 2), find P(X ≤ 23).
z23
= 0.933
20
NORMDIST(23, 20, 2, TRUE)P(X ≤ 23) =
175
Practice with ExcelQuestion: What do you do if the area you are interested in is not “all the way to the left”?
= 0.841 0.159 = 0.682Try 2 and 3 std. deviations from the mean and you will discover the empirical rule.
100
178
The Normal Distribution and the Empirical Rule
179
Interpreting the Std. Dev.When your data are bell-shaped (according to a histogram), you can interpret the pop. / sample standard deviation as follows: is a number so that 68% of your data are within one standard deviation of the mean .
= 72 2 Approx.
68%
Of the valuesare in 1
95% 2100% 3
The Empirical Rule
[ ][ ]68%95%
[ ]
100%
3/21/2018
12
180
A Historical NoteQuestion: Before Excel and NORMDIST, how did one find areas under the Normal density function?
Fact: Any probability question about a rv X~N(, ) can be stated as an equivalent question about a standard normal rv, as you will now see.
Answer: Using a table in which you could look up the area, but,…
Question: It is impossible to create a separate table for every combination of and , so what can you do?Answer: Create a single table for a rv Z~N(0, 1), which is called a standard normal rv.
181
The Standard Normal
Thus, P(X ≤ s) = P(Z ≤ t)
Example: If X~N(, ) then P(X ≤ s) = NORMDIST(s, , , TRUE).
But P(X ≤ s) = P(X ≤ s ) =
Z t
= NORMDIST(t, 0, 1, TRUE).
N(0, 1) ~
182
The Standard Normal Distribution
Standard Normalwith Mean 0
and SD 1
0 +1 +2-1-2
Normal with Mean
SD
183
The Standard Normal
z1.50
= 0.933
0
= NORMDIST(1.5, 0, 1, TRUE)
Example: If X ~ N(20, 2), find P(X ≤ 23).
Answer 1: P(X ≤ 23) = NORMDIST(23, 20, 2, TRUE)
= 0.933
Answer 2: P(X ≤ 23) = = P(Z ≤ 1.5)
3/21/2018
13
184
Excel Function NORMINV• For solving some problems, you know the
probability, p, and want to find the value of z so that the area to the left of z is p.
x
X ~ N(, )
Answer: z = NORMINV(p, , )
z = ?
p
P(X ≤ z) = p
185
Practice with NORMINV
k0.95 0
186
Example
50+j50 j 50
0.950.025
If X ~ N(50, 8), find j so that P(50 j X 50 + j) = 0.95
3/21/2018
1
187
Stat Camp for theMBA Program
Daniel Solow
Lecture 4The Normal Distribution and the
Central Limit Theorem
188
You wrote that a woman is pregnant for 266 days. Who said so? I carried my baby for ten months and five days, and there is no doubt about it because I know the exact date my baby was conceived. My husband is in the Navy and it couldn’t possibly have been any other time because I saw him only once for an hour, and I didn’t see him again until the day before the baby was born.
I don’t drink or run around, and there is no way this baby isn’t his, so please print a retraction about the 266-day carrying time because otherwise I am in a lot of trouble.
San Diego Reader
Example 1: Dear Abby
189
Dear AbbyStep 1: Identify an appropriate random variable.
Y = number of days of pregnancyWhat are the possible values for Y?What is the density function for Y?
About 230 – 290?
265 270260 275255… …Days
Prob. Density
Idea: Approximate the density of Y with a normal!
???
cont.
190
Dear Abby• Question: If you are going to use a normal approximation,
what information do you need? • Answer: The mean and standard deviation.• Fact: According to the data from generations of births,
pregnancies have a (sample) mean of 266 and (sample) standard deviation of 16 days, so Y ~ N ( = 266, = 16).
• Question: What are the possible values for Y?• Question: How can the number of days of pregnancy be
< 230?• Answer: Using the normal distribution, you have that
P(Y < 230) = NORMDIST(230, 266, 16, true) 0.01.• Thus, when using the normal approximation, there is only
about 1% chance that a pregnancy lasts less than 230 days.
– to
Models are NOT the real world but hopefully good approximations!
3/21/2018
2
191
Dear Abby•Step 2: State what you are looking for as a probability question in terms of the rv.
You want to find P(Y ≥ 10 mo. and 5 days) =P(Y ≥ 310).
= 1 – NORMDIST(310, 266, 16, TRUE)
•Step 3: Use the probability distribution of the rv to answer the probability question.
= 0.00298
P(Y ≥ 310) = 1 – P(Y < 310)
Was she telling the truth?Possibly, but highly unlikely.
192
Example 2: Problem of GoodTire
GoodTire has a new tire for which, in order to be competitive, they want to offer a warranty of 30,000 miles. Before doing so, the company wants to know what fraction of tires they can expect to be returned under the warranty.
193
The Problem of GoodTire
•For GoodTire, let X = number of miles such a tire will last.
X ~N( = 40000, = 10000)with possible values:
Step 1: Identify an appropriate random variable.
What are the possible values for X?What is the density function for X?
0 – 90000?
It is unknown, so estimate it using a model, as follows:
From statistical analysis of a random sample, GoodTirebelieves the mileage follows approximately a normal distribution with a mean of 40,000 miles and a standard deviation of 10,000 miles, so assume that
(cont.)
– to 194
The Problem of GoodTire
Step 2: State what you are looking for in terms of a probability question pertaining to the random variable.
•GoodTire wants to know the
P{X 30000} = ?Likelihood a tire fails =Fraction of tires returned =
3/21/2018
3
195
The Problem of GoodTireStep 3: Use the probability distribution of the random variable to answer the probability question.•For GoodTire, you have P{X 30000} = ?
40000
X N(40000, 10000)
30000
NORMDIST(30000, 40000, 10000, TRUE) = 0.1587
196
The Problem of GoodTireQuestion: The CEO finds that a 16% return rate is too high. What warranty mileage s should they offer to get a 5% return rate?Step 2: Probability Question: What should s be so that P{X s} = 0.05?
40000s = ?
0.05
Step 3: s = NORMINV(0.05, 40000, 10000) = 23551.47Fact: While you cannot control the value of a rv, you can control the likelihood of certain events occurring with that RV.
197
Example 3: Marketing Projections
• From historical data over a number of years, a firm knows that its annual sales average $25 million. For planning purposes, the CEO wants to know the likelihood that sales next year will:– Exceed $30 million.– Be within $1.5 million of the average.
The CEO is willing to issue bonuses if sales are “sufficiently” high. What level should be set so that bonuses are given at most 20% of the time?
198
Marketing Projections
•Let Y = next year’s sales in $ millions.
Y ~N( = 25, = 3)
Step 1: Identify an appropriate random variable.
What are the possible values for Y?What is the density function for Y?
0 – 50????
From statistical analysis over a number of years, they believe that annual sales follows approximately a normal distribution with a mean of $25 mil. and a standard deviation of $3 mil., so assume that
3/21/2018
4
199
Marketing ProjectionsStep 2: State what you are looking for in terms of a probability question pertaining to the random variable.•You want to know:
•P(sales exceeds $30 mil.) =
•P(sales is within $1.5 of $25 mil.) =
P(giving a bonus) = 0.20?P(Y ≥ s) = 0.20?
P(Y ≥ 30).
P(23.5 Y 26.5).•What should be the value of sales (s) so that
200
Marketing ProjectionsStep 3: Use the probability distribution of the random variable to answer the probability question.
Example 4: DUI Test• In many states, a driver is legally drunk if the blood
alcohol concentration, as determined by a breath analyzer, is 0.10% or higher.
• Suppose that a driver has a true blood alcohol concentration of 0.095%. With the breath analyzer test, what is the probability that the person will be (incorrectly) booked on a DUI charge?
Step 1: Identify an appropriate random variable.Let Y = the measurement of the analyzer as a %.Question: What are the possible values for Y? 0 – 0.3?
(cont.)
202
DUI TestStep 1 (continued).
Question: What is the density function for Y?
Answer: We do not know, but data indicate that Y follows approximately a normal distribution with mean equal to the person’s true alcohol level and standard deviation equal to 0.004%, so…
= the person’s true blood alcohol level (%)Y ~N(, = 0.004), where
3/21/2018
5
203
DUI TestStep 2: State what you are looking for in terms of a probability question pertaining to the random variable.•You want to know the probability that a person with = 0.095 will be (incorrectly) booked on a DUI charge:
P(Y ≥ 0.10) P(being booked on a DUI) =
204
DUI TestStep 3: Use the probability distribution of the random variable to answer the probability question.
•There is about an 11% chance that such a person will be incorrectly charged with a DUI.
205
An Insurance ProblemGoodHands is considering insuring employees of GoodTire. What annual premium should the company charge to be sure that there is a likelihood of no more than 1% of losing money on each customer?
This is an example of decision making under uncertainty: you have to make a decision today—how much should the annual premium be—
Question: Why is the future uncertain?facing an uncertain future.
206
Solving the Insurance ProblemStep 1: Identify an appropriate random variable.•Let X = the $ claimed by a customer in one year.
X ~N( = 2500, = 1000)
•What are the possible values for X? [0, 100000 (?)]•Is X continuous or discrete? discrete•What is the density function for X?It is unknown, so borrow one.
From statistical analysis of data, the annual claim for these people follows approximately a normal distribution with a mean of $2500 and a standard deviation of $1000, so:
•Note: It can be OK to approximate a discrete RV with a continuous distribution.
discrete or cont.?
3/21/2018
6
Probability Question: What should the premium s be so that the
207
An Insurance ProblemStep 2: State what you are looking for in terms of a probability question pertaining to the RV.
•For GoodHands, what should the premium s be so that the likelihood of losing money is no more than 1%.
2500
X N(2500, 1000)
X s
s
Question: When do you lose money on a customer?
P( ) = 0.01?
208
An Insurance ProblemStep 3: Use the probability distribution of the random variable to answer the probability question.
= NORMINV(0.99, 2500, 1000)
Fact: While you cannot control the value of a rv (such as the claim of a person), you can control the likelihood of certain events occurring with that RV (such as the likelihood of such a claim exceeding the premium).
2500
X N(2500, 1000)
s= $4826.3478solution to the model!= $4826.35solution for the real world!
209
The Insurance Problem (cont.)Question: GoodHands will insure all 100 employees of GoodTire. What premium should GoodHands charge per employee so that the likelihood of losing money on the average of all these claims is 1%?Step 1: Identify appropriate random variables.
X = the $ / annual claim of customer~N( = 2500, = 1000)
Prob. Question: What should be the premium, s, so that
(i = 1,…,100)•For this problem, you now have the following rvs:
i i
P(X > s) = 0.01? P( > s) = 0.01?
(a new random var.)
Fact: To answer this prob. quest. about you need to knowthe density function of . ???
Idea: When the rv you are interested in is the AVERAGE of other rvs, try…
(1) Independent
(2) Identically distributed
210
The Central Limit Theorem
(knowing the value of one rv tells younothing about the values of the other rvs).
(have the same densityfunction with mean and standard deviation ),
then, for “large” n,
The Central Limit Theorem provides an approximate density function when the r.v. you are interested in is the average of n other rvs, say, X1, X2, …, Xn, that are:
(approx.)
3/21/2018
7
211
The Insurance Problem (cont.)For the insurance problem, you have
Xi = annual $ claimed by person i (i = 1, …, 100)
(1) Are X1, X2, …, X100 independent random variables?
Yes, because the amount claimed by one person has no effect on the amount claimed by another person.
(2) Are X1, X2, …, X100 identically distributed? Yes, because
Therefore, by the CLT, is approximately Normal with…
212
An Insurance Problem
Step 2: State what you are looking for in terms of a probability question pertaining to the random variable.
•For GoodHands,What should the premium s be so that theprobability that the average of the 100 claimsexceeds s is 0.01?
Probability Question: What should s be so that
2500
N(2500, 100)
213
An Insurance Problem (cont.)Probability Question: What should the premium s be so that
Step 3: Use the probability distribution of the random variable to answer the probability question.
s = NORMINV(0.99, 2500, 100)
= $2732.64
s
Another Example of the CLT• In modeling the performance of a team with 5
people, consider the following five rvs:
Pi = performance contribution of person ifor (i = 1,…,5)
214
U[0,1]Possible values: [0, 1] (continuous)Density function:E[Pi] = = 0.5 STDEV[Pi] = =However, what is of interest is the team performance, so let…
3/21/2018
8
Another Example of the CLT
215
T = performance of the whole team
Possible values: [0, 1] (continuous)Density function: ???You cannot find the true density function, so borrow one.Because the rv T is the average of other RVs, think of using the Central Limit Theorem to approximate the density function of T.
0.29.
The Team Problem
216
For the team problem, you have
Pi = performance of person i (i = 1, 2, 3, 4, 5)
(1) Are P1, P2, P3, P4, P5 independent random variables?
Yes, assuming that the performance of a person says nothing about the performance of another person.
(2) Are P1, P2, P3, P4, P5 identically distributed?
0.5 and std. dev. =
Therefore, by the CLT, T is approximately Normal with…
~U[0, 1] with mean =
Yes, because
The Team Problem
217
Question: What is the probability that the team performance is at least 0.75?
0.5
T N(0.5, 0.13)
P(T ≥ 0.75) =
1 – NORMDIST(0.75, 0.5, 0.13, TRUE) =
0.027
P(T ≥ 0.75)
0.75218
Working with a PopulationConsider a population of N items in which item i has a number, Xi, associated with it and letX = the value of an item to be selected randomly from the population.
Possible values of X:
E[X] =
(discrete)X1, X2, …, XN
Density function: 1/N 1/N … 1/N
= the population average µ.
STDEV[X] = the population standard deviation .
Note: From here on, a sample of size n from a population should always be thought of as a random variable.
3/21/2018
9
Now suppose you are going to record the numbers X1, X2,…, Xntaken from a sample of size n from a population and then compute:
219
The Average of a Sample
All possible values:
G1Groups of size n:
A1
G2
A2
G3
A3
Discrete
for the group:
The (finite) list of averages of everygroup of size n in the population.
Density function: All equally likely.the population average µ.
the population standard deviation .
E[ ] =
STDEV[ ] =Fact: We cannot use the density function because we cannot list all of the possible values, so…
The Average of a Sample
220
(, +)Possible Values:
The rvs X1, X2,…, Xn are iid from the same population with
mean = and std. dev. =
Now you can use the Normal Distribution to answer your probability question about
Solution: Because is the average of rvs, think of the using the CLT which, if applicable, results in the following density function for
,
221
How Large is Large Enough?• For symmetric but outlier-prone data,
n = 15 samples should be enough to use the normal approximation.
• For mild skewness, n = 30 should generally be sufficient to make the normal approximation appropriate.
• For severe skewness, n should be at least 100 to use the normal approximation.
• Generally speaking, the larger n is, the better the normal approximation is.
222
A Final Example of the CLT• Historical data collected at a paper mill show that
40% of sheet breaks are due to water drops, resulting from the condensation of steam.
• Suppose that the causes of the next 100 sheet breaks are monitored and that the sheet breaks are independent of one another.
• Find the expected value and the standard deviation of the number of sheet breaks that will be caused by water drops.
• What is the probability that at least 35 of the breaks will be due to water drops?
3/21/2018
10
223
• Success = break due to water drops• P(success) = p =• X = number of breaks due to water drops• X is Binomial with n = 100 and p = 0.4• E(X) =
• From ExcelP(X 35) = 1 – P(X < 35) = 1 – P(X 34)
• = 1 – BINOMDIST(34, 100, 0.4, TRUE)• = 0.8617
Exact Answer
np = (100)(0.4) = 40= (100)(0.4)(0.6) = 24 = 4.9SD(X) = n p (1 p)
0.4
224
Normal Approx. to BinomialFor this problem, let p = P(success) = 0.4, and
In this problem, you are interested in the rvX = number of successes in 100 trials
= X1 + X2 + … +X100
To find P(X ≥ 35) = P(X / 100 ≥ 35 / 100) , you need to know the probability distribution of
which, by the CLT, is approximately normal, so…
225
Normal Approx. to BinomialEach Xi ~ Binomial(1, p = 0.4), so
E[Xi] = = p = 0.4
Assuming that•The Xi are pairwise independent and•n = 100 is large enough (np > 5 and n(1 – p) > 5),
then by the CLT, the random variable
226
Normal Approx. to BinomialThen, for X = X1 + …+ X100
= 1 NORMDIST(0.35, 0.4, 0.049, TRUE)
= 0.85.
P(X / 100 ≥ 35 / 100)
(The exact answer was 0.86.)
P(X ≥ 35) = 100 100
3/21/2018
11
227
A function y = f(x) describes a relationship between the two quantitative variables x and y.
Logarithms• The log base b of x [written logb(x)] is the
power to which you must raise b to get x.• Examples: log10(100) =• Logs are only defined for positive numbers.• If the base is omitted, the default is 10.• The base e = 2.718… is used in some financial
applications (such as continuous compounding), in which case, loge(x) is written as ln(x) (the “natural log” of x).
2, 5log2(32) =
240
Laws of Logarithms• Logs convert products to sums, that is,