Chapter 1 Describing Data: Graphical and Numerical PROBABILITY (6MTCOAE205)
Feb 15, 2016
Chapter 1
Describing Data: Graphical and Numerical
PROBABILITY (6MTCOAE205)
Dealing with Uncertainty
Everyday decisions are based on incomplete information
Consider: Will the job market be strong when I graduate? Will the price of Yahoo stock be higher in six months
than it is now? Will interest rates remain low for the rest of the year if
the federal budget deficit is as high as predicted?
Assist. Prof. Dr. İmran Göker Ch. 1-2
Dealing with Uncertainty
Numbers and data are used to assist decision making
Statistics is a tool to help process, summarize, analyze, and interpret data
Assist. Prof. Dr. İmran Göker Ch. 1-3
(continued)
Key Definitions
A population is the collection of all items of interest or under investigation
N represents the population size A sample is an observed subset of the population
n represents the sample size
A parameter is a specific characteristic of a population A statistic is a specific characteristic of a sample
Assist. Prof. Dr. İmran Göker Ch. 1-4
Population vs. Sample
Assist. Prof. Dr. İmran Göker Ch. 1-5
a b c d
ef gh i jk l m n
o p q rs t u v w
x y z
Population Sample
Values calculated using population data are called parameters
Values computed from sample data are called statistics
b c
g i n
o r u
y
Examples of Populations
Names of all registered voters in the Turkish Republic
Incomes of all families living in Ankara Osteoporosis incidence in Turkish women older
than 45 years old. Grade point averages of all the students in our
university
Assist. Prof. Dr. İmran Göker Ch. 1-6
Random Sampling
Simple random sampling is a procedure in which each member of the population is chosen strictly by
chance, each member of the population is equally likely to be
chosen, every possible sample of n objects is equally likely to
be chosen
The resulting sample is called a random sample
Assist. Prof. Dr. İmran Göker Ch. 1-7
Descriptive and Inferential Statistics
Two branches of statistics: Descriptive statistics
Graphical and numerical procedures to summarize and process data
Inferential statistics Using data to make predictions, forecasts, and
estimates to assist decision making
Assist. Prof. Dr. İmran Göker Ch. 1-8
Descriptive Statistics
Collect data e.g., Survey
Present data e.g., Tables and graphs
Summarize data e.g., Sample mean =
Assist. Prof. Dr. İmran Göker Ch. 1-9
iXn
Inferential Statistics
Assist. Prof. Dr. İmran Göker Ch. 1-10
Estimation e.g., Estimate the population
mean weight using the sample mean weight
Hypothesis testing e.g., Test the claim that the
population mean weight is 140 pounds
Inference is the process of drawing conclusions or making decisions about a population based on
sample results
Types of Data
Examples: Marital Status Are you registered to
vote? Eye Color (Defined categories or
groups)
Examples: Number of Children Defects per hour (Counted items)
Examples: Weight Voltage (Measured characteristics)
Assist. Prof. Dr. İmran Göker Ch. 1-11
Measurement Levels
Interval Data
Ordinal Data
Nominal Data
Quantitative Data
Qualitative Data
Categories (no ordering or direction)
Ordered Categories (rankings, order, or scaling)
Differences between measurements but no true zero
Ratio DataDifferences between measurements, true zero exists
Assist. Prof. Dr. İmran Göker Ch. 1-12
Graphical Presentation of Data
Data in raw form are usually not easy to use for decision making
Some type of organization is needed Table Graph
The type of graph to use depends on the variable being summarized
Assist. Prof. Dr. İmran Göker Ch. 1-13
Graphical Presentation of Data
Techniques reviewed in this chapter:
CategoricalVariables
NumericalVariables
• Frequency distribution • Bar chart• Pie chart• Pareto diagram
• Line chart• Frequency distribution• Histogram and ogive• Stem-and-leaf display• Scatter plot
(continued)
Assist. Prof. Dr. İmran Göker Ch. 1-14
Tables and Graphs for Categorical Variables
Categorical Data
Graphing Data
Pie Chart
Pareto Diagram
Bar Chart
Frequency Distribution
Table
Tabulating Data
Assist. Prof. Dr. İmran Göker Ch. 1-15
The Frequency Distribution Table
Example: Hospital Patients by Unit Hospital Unit Number of Patients
Cardiac Care 1,052 Emergency 2,245Intensive Care 340Maternity 552Surgery 4,630
(Variables are categorical)
Summarize data by category
Assist. Prof. Dr. İmran Göker Ch. 1-16
Bar and Pie Charts
Bar charts and Pie charts are often used for qualitative (category) data
Height of bar or size of pie slice shows the frequency or percentage for each category
Assist. Prof. Dr. İmran Göker Ch. 1-17
Bar Chart Example
Hospital Patients by Unit
0
1000
2000
3000
4000
5000
Car
diac
Car
e
Emer
genc
y
Inte
nsiv
eC
are
Mat
erni
ty
Surg
ery
Num
ber
of
patie
nts
per
year
Hospital Number Unit of Patients
Cardiac Care 1,052Emergency 2,245Intensive Care 340Maternity 552Surgery 4,630
Assist. Prof. Dr. İmran Göker Ch. 1-18
Hospital Patients by Unit
Emergency25%
Maternity6%
Surgery53%
Cardiac Care12%
Intensive Care4%
Pie Chart Example
(Percentages are rounded to the nearest percent)
Hospital Number % of Unit of Patients Total
Cardiac Care 1,052 11.93Emergency 2,245 25.46Intensive Care 340 3.86Maternity 552 6.26Surgery 4,630 52.50
Assist. Prof. Dr. İmran Göker Ch. 1-19
Pareto Diagram
Used to portray categorical data A bar chart, where categories are shown in
descending order of frequency A cumulative polygon is often shown in the
same graph Used to separate the “vital few” from the “trivial
many”
Assist. Prof. Dr. İmran Göker Ch. 1-20
Pareto Diagram Example
Example: 400 defective items are examined for cause of defect:
Source of Manufacturing Error Number of defects
Bad Weld 34Poor Alignment 223
Missing Part 25Paint Flaw 78
Electrical Short 19Cracked case 21
Total 400
Assist. Prof. Dr. İmran Göker Ch. 1-21
Pareto Diagram Example
Step 1: Sort by defect cause, in descending orderStep 2: Determine % in each category
Source of Manufacturing Error Number of defects % of Total Defects
Poor Alignment 223 55.75Paint Flaw 78 19.50Bad Weld 34 8.50
Missing Part 25 6.25Cracked case 21 5.25
Electrical Short 19 4.75Total 400 100%
(continued)
Assist. Prof. Dr. İmran Göker Ch. 1-22
Pareto Diagram Examplecum
ulative % (line graph)%
of d
efec
ts in
eac
h ca
tego
ry
(bar
gra
ph)
Pareto Diagram: Cause of Manufacturing Defect
0%
10%
20%
30%
40%
50%
60%
Poor Alignment Paint Flaw Bad Weld Missing Part Cracked case Electrical Short0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Step 3: Show results graphically(continued)
Assist. Prof. Dr. İmran Göker Ch. 1-23
Graphs for Time-Series Data
A line chart (time-series plot) is used to show the values of a variable over time
Time is measured on the horizontal axis
The variable of interest is measured on the vertical axis
Assist. Prof. Dr. İmran Göker Ch. 1-24
Line Chart Example
Magazine Subscriptions by Year
0
50
100
150
200
250
300
350
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006Th
ousa
nds
of s
ubsc
ribe
rs
Assist. Prof. Dr. İmran Göker Ch. 1-25
Numerical Data
Stem-and-LeafDisplay
Histogram Ogive
Frequency Distributions and
Cumulative Distributions
Graphs to Describe Numerical Variables
Assist. Prof. Dr. İmran Göker Ch. 1-26
Frequency Distributions
What is a Frequency Distribution? A frequency distribution is a list or a table … containing class groupings (categories or
ranges within which the data fall) ... and the corresponding frequencies with which
data fall within each class or category
Assist. Prof. Dr. İmran Göker Ch. 1-27
Why Use Frequency Distributions?
A frequency distribution is a way to summarize data
The distribution condenses the raw data into a more useful form...
and allows for a quick visual interpretation of the data
Assist. Prof. Dr. İmran Göker Ch. 1-28
Class Intervals and Class Boundaries
Each class grouping has the same width Determine the width of each interval by
Use at least 5 but no more than 15-20 intervals Intervals never overlap Round up the interval width to get desirable
interval endpoints
intervalsdesiredofnumbernumbersmallestnumberlargestwidthintervalw
Assist. Prof. Dr. İmran Göker Ch. 1-29
Frequency Distribution Example
Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
Assist. Prof. Dr. İmran Göker Ch. 1-30
Frequency Distribution Example
Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Find range: 58 - 12 = 46 Select number of classes: 5 (usually between 5 and 15) Compute interval width: 10 (46/5 then round up)
Determine interval boundaries: 10 but less than 20, 20 but less than 30, . . . , 60 but less than 70
Count observations & assign to classes
(continued)
Assist. Prof. Dr. İmran Göker Ch. 1-31
Frequency Distribution Example
Interval Frequency
10 but less than 20 3 .15 1520 but less than 30 6 .30 3030 but less than 40 5 .25 25 40 but less than 50 4 .20 2050 but less than 60 2 .10 10 Total 20 1.00 100
RelativeFrequency Percentage
Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
(continued)
Assist. Prof. Dr. İmran Göker Ch. 1-32
Histogram
A graph of the data in a frequency distribution is called a histogram
The interval endpoints are shown on the horizontal axis
the vertical axis is either frequency, relative frequency, or percentage
Bars of the appropriate heights are used to represent the number of observations within each class
Assist. Prof. Dr. İmran Göker Ch. 1-33
Histogram : Daily High Tem perature
0
3
65
4
2
00123
4567
0 10 20 30 40 50 60
Freq
uenc
y
Temperature in Degrees
Histogram Example
(No gaps between
bars)
Interval
10 but less than 20 320 but less than 30 630 but less than 40 540 but less than 50 450 but less than 60 2
Frequency
0 10 20 30 40 50 60 70
Assist. Prof. Dr. İmran Göker Ch. 1-34
Histograms in Excel
Select Data Tab
1
Assist. Prof. Dr. İmran Göker Ch. 1-35
Click on Data Analysis2
Choose Histogram
3
4
Input data range and bin range (bin range is a cell range containing the upper interval endpoints for each class grouping)
Select Chart Output and click “OK”
Histograms in Excel(continued)
(
Assist. Prof. Dr. İmran Göker Ch. 1-36
Questions for Grouping Data into Intervals
1. How wide should each interval be? (How many classes should be used?)
2. How should the endpoints of the intervals be determined?
Often answered by trial and error, subject to user judgment
The goal is to create a distribution that is neither too "jagged" nor too "blocky”
Goal is to appropriately show the pattern of variation in the data
Assist. Prof. Dr. İmran Göker Ch. 1-37
How Many Class Intervals?
Many (Narrow class intervals) may yield a very jagged distribution
with gaps from empty classes Can give a poor indication of how
frequency varies across classes
Few (Wide class intervals) may compress variation too much and
yield a blocky distribution can obscure important patterns of
variation. 0
2
4
6
8
10
12
0 30 60 More
TemperatureFr
eque
ncy
0
0.5
1
1.5
2
2.5
3
3.5
4 8
12 16 20 24 28 32 36 40 44 48 52 56 60
Mor
e
Temperature
Freq
uenc
y(X axis labels are upper class endpoints)
Assist. Prof. Dr. İmran Göker Ch. 1-38
The Cumulative Frequency Distribuiton
Class
10 but less than 20 3 15 3 1520 but less than 30 6 30 9 4530 but less than 40 5 25 14 7040 but less than 50 4 20 18 9050 but less than 60 2 10 20 100 Total 20 100
Percentage Cumulative Percentage
Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Frequency Cumulative Frequency
Assist. Prof. Dr. İmran Göker Ch. 1-39
The OgiveGraphing Cumulative Frequencies
Ogive: Daily High Temperature
0
20
40
60
80
100
10 20 30 40 50 60Cum
ulat
ive
Perc
enta
ge
Interval endpoints
Interval
Less than 10 10 010 but less than 20 20 1520 but less than 30 30 4530 but less than 40 40 7040 but less than 50 50 9050 but less than 60 60 100
Cumulative Percentage
Upper interval
endpoint
Assist. Prof. Dr. İmran Göker Ch. 1-40
Stem-and-Leaf Diagram
A simple way to see distribution details in a data set
METHOD: Separate the sorted data series into leading digits (the stem) and the trailing digits (the leaves)
Assist. Prof. Dr. İmran Göker Ch. 1-41
Example
Here, use the 10’s digit for the stem unit:
Data in ordered array:21, 24, 24, 26, 27, 27, 30, 32, 38, 41
21 is shown as 38 is shown as
Stem Leaf
2 1
3 8
Assist. Prof. Dr. İmran Göker Ch. 1-42
Example
Completed stem-and-leaf diagram:Stem Leaves
2 1 4 4 6 7 73 0 2 84 1
(continued)
Data in ordered array:21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Assist. Prof. Dr. İmran Göker Ch. 1-43
Using other stem units
Using the 100’s digit as the stem: Round off the 10’s digit to form the leaves
613 would become 6 1 776 would become 7 8 . . . 1224 becomes 12 2
Stem Leaf
Assist. Prof. Dr. İmran Göker Ch. 1-44
Using other stem units
Using the 100’s digit as the stem: The completed stem-and-leaf display:
Stem Leaves
(continued)
6 1 3 6 7 2 2 5 8 8 3 4 6 6 9 9 9 1 3 3 6 8 10 3 5 6 11 4 7 12 2
Data:
613, 632, 658, 717,722, 750, 776, 827,841, 859, 863, 891,894, 906, 928, 933,955, 982, 1034, 1047,1056, 1140, 1169, 1224
Assist. Prof. Dr. İmran Göker Ch. 1-45
Relationships Between Variables
Graphs illustrated so far have involved only a single variable
When two variables exist other techniques are used:
Categorical(Qualitative)
Variables
Numerical(Quantitative)
Variables
Cross tables Scatter plots
Assist. Prof. Dr. İmran Göker Ch. 1-46
Scatter Diagrams are used for paired observations taken from two numerical variables
The Scatter Diagram: one variable is measured on the vertical
axis and the other variable is measured on the horizontal axis
Scatter Diagrams
Assist. Prof. Dr. İmran Göker Ch. 1-47
Scatter Diagram Example
Cost per Day vs. Production Volume
0
50
100
150
200
250
0 10 20 30 40 50 60 70
Volume per Day
Cos
t per
Day
Volume per day
Cost per day
23 12526 14029 14633 16038 16742 17050 18855 19560 200
Assist. Prof. Dr. İmran Göker Ch. 1-48
Scatter Diagrams in Excel
Select the Insert tab12 Select Scatter type from
the Charts section
When prompted, enter the data range, desired legend, and desired destination to complete the scatter diagram
3
Assist. Prof. Dr. İmran Göker Ch. 1-49
Cross Tables
Cross Tables (or contingency tables) list the number of observations for every combination of values for two categorical or ordinal variables
If there are r categories for the first variable (rows) and c categories for the second variable (columns), the table is called an r x c cross table
Assist. Prof. Dr. İmran Göker Ch. 1-50
Cross Table Example
4 x 3 Cross Table for Investment Choices by Investor (values in $1000’s)
Investment Investor A Investor B Investor C Total Category
Stocks 46.5 55 27.5 129Bonds 32.0 44 19.0 95CD 15.5 20 13.5 49Savings 16.0 28 7.0 51Total 110.0 147 67.0 324
Assist. Prof. Dr. İmran Göker Ch. 1-51
Graphing Multivariate Categorical Data
Side by side bar charts
(continued)
Comparing Investors
0 10 20 30 40 50 60
S toc k s
B onds
CD
S avings
Inves tor A Inves tor B Inves tor C
Assist. Prof. Dr. İmran Göker Ch. 1-52
Side-by-Side Chart Example Sales by quarter for three sales territories:
0
10
20
30
40
50
60
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
EastWestNorth
1st Qtr 2nd Qtr 3rd Qtr 4th QtrEast 20.4 27.4 59 20.4West 30.6 38.6 34.6 31.6North 45.9 46.9 45 43.9
Assist. Prof. Dr. İmran Göker Ch. 1-53
Data Presentation Errors
Goals for effective data presentation:
Present data to display essential information
Communicate complex ideas clearly and
accurately
Avoid distortion that might convey the wrong
message
Assist. Prof. Dr. İmran Göker Ch. 1-54
Data Presentation Errors
Unequal histogram interval widths Compressing or distorting the
vertical axis Providing no zero point on the
vertical axis Failing to provide a relative basis
in comparing data between groups
(continued)
Assist. Prof. Dr. İmran Göker Ch. 1-55
Describing Data Numerically
Assist. Prof. Dr. İmran Göker
Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Coefficient of Variation
Range
Interquartile Range
Central Tendency Variation
Ch. 1-56
Measures of Central Tendency
Assist. Prof. Dr. İmran Göker
Central Tendency
Mean Median Mode
n
xx
n
1ii
Overview
Midpoint of ranked values
Most frequently observed value
Arithmetic average
Ch. 1-57
2.1
Arithmetic Mean The arithmetic mean (mean) is the most
common measure of central tendency For a population of N values:
For a sample of size n:
Assist. Prof. Dr. İmran GökerSample size
nxxx
n
xx n21
n
1ii
Observed
values
Nxxx
N
xμ N21
N
1ii
Population size
Population values
Ch. 1-58
Arithmetic Mean
The most common measure of central tendency Mean = sum of values divided by the number of values Affected by extreme values (outliers)
Assist. Prof. Dr. İmran Göker
(continued)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
35
155
54321
4520
5104321
Ch. 1-59
Median
In an ordered list, the median is the “middle” number (50% above, 50% below)
Not affected by extreme values
Assist. Prof. Dr. İmran Göker
0 1 2 3 4 5 6 7 8 9 10
Median = 3
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Ch. 1-60
Finding the Median
The location of the median:
If the number of values is odd, the median is the middle number If the number of values is even, the median is the average of
the two middle numbers
Note that is not the value of the median, only the
position of the median in the ranked data
Assist. Prof. Dr. İmran Göker
dataorderedtheinposition2
1npositionMedian
21n
Ch. 1-61
Mode A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical data There may may be no mode There may be several modes
Assist. Prof. Dr. İmran Göker
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No ModeCh. 1-62
Review Example
Assist. Prof. Dr. İmran Göker
Five houses on a hill by the beach$2,000 K
$500 K
$300 K
$100 K
$100 K
House Prices:
$2,000,000 500,000 300,000 100,000 100,000
Ch. 1-63
Review Example:Summary Statistics
Assist. Prof. Dr. İmran Göker
Mean: ($3,000,000/5) = $600,000
Median: middle value of ranked data = $300,000
Mode: most frequent value = $100,000
House Prices:
$2,000,000 500,000 300,000 100,000 100,000
Sum 3,000,000
Ch. 1-64
Which measure of location is the “best”?
Assist. Prof. Dr. İmran Göker
Mean is generally used, unless extreme values (outliers) exist . . .
Then median is often used, since the median is not sensitive to extreme values. Example: Median home prices may be reported for
a region – less sensitive to outliers
Ch. 1-65
Shape of a Distribution
Describes how data are distributed Measures of shape
Symmetric or skewed
Assist. Prof. Dr. İmran Göker
Mean = Median Mean < Median Median < Mean
Right-SkewedLeft-Skewed Symmetric
Ch. 1-66
Assist. Prof. Dr. İmran Göker
Geometric Mean
Geometric mean Used to measure the rate of change of a variable
over time
Geometric mean rate of return Measures the status of an investment over time
Where xi is the rate of return in time period i
1/nn21
nn21g )xx(x)xx(xx
1)x...x(xr 1/nn21g
Ch. 1-67
Assist. Prof. Dr. İmran Göker
Example
An investment of $100,000 rose to $150,000 at the end of year one and increased to $180,000 at end of year two:
$180,000X$150,000X$100,000X 321
50% increase 20% increase
What is the mean percentage return over time?
Ch. 1-68
Assist. Prof. Dr. İmran Göker
Example
Use the 1-year returns to compute the arithmetic mean and the geometric mean:
30.623%131.6231(1000)
1(20)][(50)
1)x(xr
1/2
1/2
1/n21g
35%2
(20%)(50%)X
Arithmetic mean rate of return:
Geometric mean rate of return:
Misleading result
More accurate result
(continued)
Ch. 1-69
Measures of Variability
Assist. Prof. Dr. İmran Göker
Same center, different variation
Variation
Variance Standard Deviation
Coefficient of Variation
Range Interquartile Range
Measures of variation give information on the spread or variability of the data values.
Ch. 1-70
Range
Simplest measure of variation Difference between the largest and the smallest
observations:
Assist. Prof. Dr. İmran Göker
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Example:
Ch. 1-71
Disadvantages of the Range Ignores the way in which data are distributed
Sensitive to outliers
Assist. Prof. Dr. İmran Göker
7 8 9 10 11 12Range = 12 - 7 = 5
7 8 9 10 11 12Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119
Ch. 1-72
Interquartile Range
Can eliminate some outlier problems by using the interquartile range
Eliminate high- and low-valued observations and calculate the range of the middle 50% of the data
Interquartile range = 3rd quartile – 1st quartile IQR = Q3 – Q1
Assist. Prof. Dr. İmran Göker Ch. 1-73
Interquartile Range
Assist. Prof. Dr. İmran Göker
Median(Q2)
XmaximumX
minimum Q1 Q3
Example:
25% 25% 25% 25%
12 30 45 57 70
Interquartile range = 57 – 30 = 27
Ch. 1-74
Box-and-Whisker Plots
Quartiles Quartiles split the ranked data into 4 segments with
an equal number of values per segment
Assist. Prof. Dr. İmran Göker
25% 25% 25% 25%
The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are larger)
Only 25% of the observations are greater than the third quartile
Q1 Q2 Q3
Ch. 1-75
Quartile Formulas
Assist. Prof. Dr. İmran Göker
Find a quartile by determining the value in the appropriate position in the ranked data, where
First quartile position: Q1 = 0.25(n+1)
Second quartile position: Q2 = 0.50(n+1) (the median position)
Third quartile position: Q3 = 0.75(n+1)
where n is the number of observed values
Ch. 1-76
Quartiles
Assist. Prof. Dr. İmran Göker
(n = 9)
Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so Q1 = 12.5
Sample Ranked Data: 11 12 13 16 16 17 18 21 22
Example: Find the first quartile
Ch. 1-77
Population Variance
Average of squared deviations of values from the mean
Population variance:
Assist. Prof. Dr. İmran Göker
N
μ)(xσ
N
1i
2i
2
Where = population mean
N = population size
xi = ith value of the variable x
μ
Ch. 1-78
Sample Variance
Average (approximately) of squared deviations of values from the mean
Sample variance:
Assist. Prof. Dr. İmran Göker
1-n
)x(xs
n
1i
2i
2
Where = arithmetic mean
n = sample size
Xi = ith value of the variable X
X
Ch. 1-79
Population Standard Deviation
Most commonly used measure of variation Shows variation about the mean Has the same units as the original data
Population standard deviation:
Assist. Prof. Dr. İmran Göker
N
μ)(xσ
N
1i
2i
Ch. 1-80
Sample Standard Deviation
Most commonly used measure of variation Shows variation about the mean Has the same units as the original data
Sample standard deviation:
Assist. Prof. Dr. İmran Göker
1-n
)x(xS
n
1i
2i
Ch. 1-81
Calculation Example:Sample Standard Deviation
Assist. Prof. Dr. İmran Göker
Sample Data (xi) : 10 12 14 15 17 18 18 24
n = 8 Mean = x = 16
4.24267
126
1816)(2416)(1416)(1216)(10
1n)x(24)x(14)x(12)X(10s
2222
2222
A measure of the “average” scatter around the mean
Ch. 1-82
Measuring variation
Assist. Prof. Dr. İmran Göker
Small standard deviation
Large standard deviation
Ch. 1-83
Comparing Standard Deviations
Assist. Prof. Dr. İmran Göker
Mean = 15.5 s = 3.338 11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5 s = 0.926
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5 s = 4.570
Data C
Ch. 1-84
Advantages of Variance and Standard Deviation
Each value in the data set is used in the calculation
Values far from the mean are given extra weight (because deviations from the mean are squared)
Assist. Prof. Dr. İmran Göker Ch. 1-85
Coefficient of Variation
Measures relative variation Always in percentage (%) Shows variation relative to mean Can be used to compare two or more sets of
data measured in different units
Assist. Prof. Dr. İmran Göker
100%xsCV
Ch. 1-86
Comparing Coefficient of Variation
Stock A: Average price last year = $50 Standard deviation = $5
Stock B: Average price last year = $100 Standard deviation = $5
Assist. Prof. Dr. İmran Göker
Both stocks have the same standard deviation, but stock B is less variable relative to its price
10%100%$50$5100%
xsCVA
5%100%$100
$5100%xsCVB
Ch. 1-87
Using Microsoft Excel
Descriptive Statistics can be obtained from Microsoft® Excel
Select:
data / data analysis / descriptive statistics
Enter details in dialog box
Assist. Prof. Dr. İmran Göker Ch. 1-88
Using Excel
Assist. Prof. Dr. İmran Göker
Select data / data analysis / descriptive statistics
Ch. 1-89
Using Excel
Enter input range details
Check box for summary statistics
Click OK
Assist. Prof. Dr. İmran Göker Ch. 1-90
Excel output
Assist. Prof. Dr. İmran Göker
Microsoft Excel descriptive statistics output, using the house price data:
House Prices:
$2,000,000 500,000 300,000 100,000 100,000
Ch. 1-91
For any population with mean μ and standard deviation σ , and k > 1 , the percentage of observations that fall within the interval
[μ + kσ] Is at least
Assist. Prof. Dr. İmran Göker
Chebychev’s Theorem
)]%(1/k100[1 2
Ch. 1-92
Regardless of how the data are distributed, at least (1 - 1/k2) of the values will fall within k standard deviations of the mean (for k > 1)
Examples:
(1 - 1/1.52) = 55.6% ……... k = 1.5 (μ ± 1.5σ) (1 - 1/22) = 75% …........... k = 2 (μ ± 2σ) (1 - 1/32) = 89% …….…... k = 3 (μ ± 3σ)
Assist. Prof. Dr. İmran Göker
Chebychev’s Theorem
withinAt least
(continued)
Ch. 1-93
If the data distribution is bell-shaped, then the interval:
contains about 68% of the values in the population or the sample
Assist. Prof. Dr. İmran Göker
The Empirical Rule
1σμ
μ
68%
1σμCh. 1-94
contains about 95% of the values in the population or the sample
contains almost all (about 99.7%) of the values in the population or
the sample
Assist. Prof. Dr. İmran Göker
The Empirical Rule
2σμ
3σμ
3σμ
99.7%95%
2σμ
Ch. 1-95
Weighted Mean
The weighted mean of a set of data is
Where wi is the weight of the ith observation
and
Use when data is already grouped into n classes, with wi values in the ith class
Assist. Prof. Dr. İmran Göker
nxwxwxw
n
xwx nn2211
n
1iii
Ch. 1-96
iwn
Approximations for Grouped DataSuppose data are grouped into K classes, with
frequencies f1, f2, . . . fK, and the midpoints of the classes are m1, m2, . . ., mK
For a sample of n observations, the mean is
Assist. Prof. Dr. İmran Göker
n
mfx
K
1iii
K
1iifnwhere
Ch. 1-97
Approximations for Grouped DataSuppose data are grouped into K classes, with
frequencies f1, f2, . . . fK, and the midpoints of the classes are m1, m2, . . ., mK
For a sample of n observations, the variance is
Assist. Prof. Dr. İmran Göker Ch. 1-98
1n
)x(mfs
K
1i
2ii
2
The Sample Covariance The covariance measures the strength of the linear relationship
between two variables
The population covariance:
The sample covariance:
Only concerned with the strength of the relationship No causal effect is implied
Assist. Prof. Dr. İmran Göker
N
))(y(xy),(xCov
N
1iyixi
xy
1n
)y)(yx(xsy),(xCov
n
1iii
xy
Ch. 1-99
Interpreting Covariance
Covariance between two variables:
Cov(x,y) > 0 x and y tend to move in the same direction
Cov(x,y) < 0 x and y tend to move in opposite directions
Cov(x,y) = 0 x and y are independent
Assist. Prof. Dr. İmran Göker Ch. 1-100
Coefficient of Correlation Measures the relative strength of the linear relationship
between two variables
Population correlation coefficient:
Sample correlation coefficient:
Assist. Prof. Dr. İmran Göker
YX ssy),(xCovr
YXσσy),(xCovρ
Ch. 1-101
Features of Correlation Coefficient, r
Unit free Ranges between –1 and 1 The closer to –1, the stronger the negative linear
relationship The closer to 1, the stronger the positive linear
relationship The closer to 0, the weaker any positive linear
relationship
Assist. Prof. Dr. İmran Göker Ch. 1-102
Scatter Plots of Data with Various Correlation Coefficients
Assist. Prof. Dr. İmran Göker
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3r = +1
Y
Xr = 0
Ch. 1-103
Using Excel to Find the Correlation Coefficient
Select Data / Data Analysis
Assist. Prof. Dr. İmran Göker Ch. 1-104
Choose Correlation from the selection menu Click OK . . .
Using Excel to Find the Correlation Coefficient
Input data range and select appropriate options
Click OK to get output
Assist. Prof. Dr. İmran Göker
(continued)
Ch. 1-105
Interpreting the Result
r = .733
There is a relatively strong positive linear relationship between test score #1 and test score #2
Students who scored high on the first test tended to score high on second test
Assist. Prof. Dr. İmran Göker
Scatter Plot of Test Scores
70
75
80
85
90
95
100
70 75 80 85 90 95 100
Test #1 ScoreTe
st #
2 S
core
Ch. 1-106