-
APPLIED STATISTICSEXAMPLES IN EXCEL AND SPSS
1
CONTENTS
I. Descriptive statistics
...................................................................................................4What
is Statistics?
..........................................................................................................4Scales
of
measurement...................................................................................................4Discrete
and continuous variables
.................................................................................5Data
collecting
...............................................................................................................5
Census
........................................................................................................................6Sampling
....................................................................................................................6
Types of sample
.............................................................................................................7Simple
random
sample...............................................................................................7Stratified
sample
........................................................................................................8Cluster
sampling
........................................................................................................8Quota
sampling
..........................................................................................................8Systematic
sampling
..................................................................................................9
Calculating a Sample Size
.............................................................................................9Frequency
distribution
...................................................................................................9
Class intervals
..........................................................................................................22Outliers.....................................................................................................................30
Data presentation: tables, diagrams and
graphs...........................................................30Descriptive
statistics
....................................................................................................42
Measures of central
tendency...................................................................................43Measures
of dispersion
............................................................................................43Shape
of distribution
................................................................................................45
Symmetry or skewness
........................................................................................45Kurtosis
................................................................................................................46Modality...............................................................................................................46
Measure of
concentration.........................................................................................47
II. Empirical versus appropriate theoretical distributions
(approximations withbinomial; Poisson, hypergeometric or normal
distribution) ........................................67BINOMIAL
DISTRIBUTION.....................................................................................68
Probability distribution of a binomial random variable
...........................................69Characteristics of the
Binomial distribution
............................................................70
POISSON
DISTRIBUTION........................................................................................80Probability
distribution of Poisson random variable
...............................................80Characteristics of
the Poisson
distribution...............................................................84
HYPERGEOMETRIC DISTRIBUTION
....................................................................93NORMAL
DISTRIBUTION
.......................................................................................95
Roles for standardized normal
distribution..............................................................97Characteristic
intervals for normal distribution
.......................................................98
STUDENT
t-DISTRIBUTION..................................................................................111CHI-SQUARE
2 DISTRIBUTION
.....................................................................113F
DISTRIBUTION
....................................................................................................115LOGNORMAL
DISTRIBUTION
.............................................................................116EXPONENTIAL
DISTRIBUTION...........................................................................119GAMA
DISTRIBUTION
..........................................................................................121
-
APPLIED STATISTICSEXAMPLES IN EXCEL AND SPSS
2
APROXIMATIONS FOR BINOMIAL, POISSON AND
HYPERGEOMETICDISTRIBUTION WITH NORMAL
DISTRIBUTION.............................................123
III. Inferential statistics: Estimation theory and hypothesis
testing...........................124INFERENCE..............................................................................................................124THE
DISTRIBUTION OF THE SAMPLE MEANS
................................................125CONFIDENCE
INTERVAL FOR THE POPULATION
MEAN.............................125
Standard deviation from population is known
.......................................................125Standard
deviation from population isnt
known...................................................126
CONFIDENCE INTERVAL FOR THE POPULATION PROPORTIONS
.............132CONFIDENCE INTERVAL FOR VARIANCE IN POPULATION
.......................134HOW TO DETERMINE SAMPLE SIZE ACCORDING TO
SAMPLE ERROR? .137
Determining sample size for estimating population
mean.....................................137Determining sample size
for estimating population proportion
............................138
HYPOTHESIS TESTING
.........................................................................................140Regions
of rejection and non-rejection
..................................................................141Risks
in decision making process
..........................................................................142Procedure
for hypothesis
testing............................................................................142Hypothesis
for the mean
........................................................................................142
known
............................................................................................................142
unknown, small sample
.................................................................................143
unknown, large
sample..................................................................................144
A two sample test for mean
...................................................................................150A
two sample test for variances
.............................................................................154Testing
differences between arithmetic means of more than two populations
on thebasis of their samples - analysis variance
ANOVA...............................................162Chi-square (
2 ) test
..............................................................................................167
Test for differences between proportion for
populations...................................176Test adequacy of
approximations (goodness of fit)
...........................................177
Kolmogorov-Smirnov test
.....................................................................................179
IV. REGRRESSION AND CORRELATION ANALISYS
......................................182Aim
............................................................................................................................182Basic
aspects
..............................................................................................................182Scatter
plot
...................................................................Error!
Bookmark not defined.Line of Best Fit (Regression
Line).............................................................................187The
Correlation Coefficient
.......................................................................................188The
Coefficient of
Determination..............................................................................190Interpretation
of the size of a correlation
...................................................................190The
standard error of estimate and the correlation coefficient
..................................192Calculating the Equation of
the Regression Line for two variables
..........................193Prediction or forecasting
............................................................................................197Spearmans
rank correlation coefficient
....................................................................198Statistical
testing (t test, ANOVA)
............................................................................201Overview
example for simple regression model with SPSS
.....................................202MULTIPLE REGRESSION
MODEL.......................................................................209
The general multiple regression
model..................................................................209Measures
for quality of multiple regression model
...................................................210Statistical
test (t test, ANOVA)
.................................................................................211Indicator
dummy variables
.....................................................................................215
-
APPLIED STATISTICSEXAMPLES IN EXCEL AND SPSS
3
Simple model with dummy variable
..................................................................216Example
indicator variables as the regression variables in the simple model
with a"dummy" variable
..................................................................................................217Example
of multiple regression models with indicator variables as a
explanatoryvariable and a continuous variable as another variable
explanatory......................217
CONDITIONS FOR ECONOMETRIC
MODELS...................................................222Assumptions
regression models through SPSS
.....................................................222
MULTICOLLINEARITY..................................................................................222OUTLIERS
........................................................................................................223NORMALITY....................................................................................................224AUTOCORRELATION
....................................................................................224HETEROSKEDASTICITY
...............................................................................224
ECONOMETRIC CONDITIONS FOR REGRESION MODELS WITH SPSSEXAMPLES
..........................................................................................................225
References..................................................................................................................282
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
4
I. Descriptive statisticsWhat is Statistics?Statistics, in
short, is the study of data. It includes:
Descriptive statistics (the study of methods and tools for
collecting data, andmathematical models to describe and interpret
data) and
Inferential statistics (the systems and techniques for making
probability-baseddecisions and accurate predictions based on
incomplete (sample) data).
Three main aspects in statistical dealing with data are:
1. The collection of qualitative or numerical data,2. The
presentation of qualitative or numerical data and3. The analysis of
numerical data with appropriate statistical methods and models.
Scales of measurementDifferent scales of measurement have
correspondence with appropriate data type.
1. Nominal scale
Nominal scale classifies data into various distinct categories
in which no ordering isimplied. Nominal variables might be used to
identify different attributes. For examplenominal scale is
appropriate for:
Gender Citizenship Internet provider that you prefer. The
license plate number of a car
The only comparisons that can be made between variable values
are equality andinequality. There are no "less than" or "greater
than" relations among them, noroperations such as addition or
subtraction.
2. Ordinal scale
Ordinal scale classifies data into various distinct categories
in which no ordering isimplied. Ordinal scale is in direct
connection with ranking. For example there isproduct satisfaction,
because you can be: very satisfied, satisfied, neutral,unsatisfied
or very unsatisfied.
Comparisons of better and worst can be made, in addition to
equality and inequality.However, operations such as conventional
addition and subtraction are still withoutmeaning. While the scale
can be ranked from high to low the difference betweenpoints cannot
be quantified. We cannot say that the person who thinks facilities
are
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
5
good regards the facilities as twice as good as the person who
thinks they are belowaverage.
3. Ratio scale
Ratio scale is an ordered scale in which the difference between
the measurementsinvolves a true zero point (height, consumption,
profit, etc.). All mathematicaloperations are possible with this
type of data and lead to meaningful results. There arenumerous
methods for analyzing this type of data.
4. Interval scale
The most important characteristic of interval scale is that the
measurement does notinvolve a true zero point. The numbers have all
the features of ordinal measurementand also are separated by the
same interval. Zero value is arbitrary, not real(temperature,
etc.)
In this case, differences between arbitrary pairs of numbers can
be meaningfullycompared. Operations such as addition and
subtraction are therefore meaningful.However, the zero point on the
scale is arbitrary, and ratios between numbers on thescale are not
meaningful, so operations such as multiplication and division
cannot becarried out. On the other hand, negative values on the
scale can be used.
Categorical variables (attributes) are connected with nominal or
ordinal scale, butnumerical variables are connected with ratio or
interval scale.
Discrete and continuous variablesNumerical variable can be
discrete or continuous: Discrete variables produce numerical
responses that arise from a counting
process. An example of a discrete numerical variable is the
number of magazinessubscribed to. Another example would be the
score given by a judge to agymnast in competition: the range is 0
to 10 and the score is always given to onedecimal (e.g., a score of
8.5). The response is one of a finite number of integers,so a
discrete variable can only take a finite number of real values.
Continuous variable produce numerical responses that arise from
a measuringprocess. The response takes on any value within a
continuum or interval,depending on the precision of the measuring
instrument. Examples of acontinuous variable are distance, age,
height, consumption, revenue, loan amount,export/import...
Data collectingDepending on the scope of research, data can be
collected from a whole population orfrom a part of population (a
sample).
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
6
Census
A survey of a whole population is called a census. A census
refers to data collectionabout every unit in a group or population.
If you collected data about the height ofeveryone in your class,
that would be regarded as a class census. A characteristic of
apopulation (such as the population mean) is referred to as a
parameter.
There are various reasons why a census may or may not be chosen
as the method ofdata collection:
Census dataAdvantages (+)
Sampling variance is zero: There is no sampling variability
attributed to the statisticbecause it is calculated using data from
the entire population.Detail: Detailed information about small
sub-groups of the population can be madeavailable.
Disadvantages ()Cost: In terms of money, conducting a census for
a large population can be veryexpensive.Time: A census generally
takes longer to conduct than a sample survey.Control: A census of a
large population is such a huge undertaking that it makes
itdifficult to keep every single operation under the same level of
scrutiny and control.
Sampling
Sampling frame is a complete or partial listing of items
comprising the population.The frame can be data sources as
population lists, directories or maps. Samples aredrawn from this
frame. If the frame is inadequate because certain groups if
individualsor items in the population were not properly included,
then the samples will beinaccurate and biased.
The sampling process comprises several stages:
Defining the population of concern, Specifying a sampling frame,
a set of items or events possible to measure, Specifying a sampling
method for selecting items or events from the frame, Determining
the sample size, Implementing the sampling plan, Sampling and data
collecting, Reviewing the sampling process.
Examples of sample surveys:
Phoning the fifth person on every page of the local phonebook
and asking themhow long they have lived in the area.
Selecting several cities in a country, several neighbourhoods in
those cities andseveral streets in those neighbourhoods to recruit
participants for a survey.
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
7
A characteristic of a sample (such as the sample standard
deviation) is referred to as astatistic.
Reasons one may or may not choose to use a sample survey
include:
Sample surveyAdvantages (+)
Cost: A sample survey costs less than a census because data are
collected from onlypart of a group.Time: Results are obtained far
more quickly for a sample survey, than for a census.Fewer units are
contacted and less data needs to be processed.Control: The smaller
scale of this operation allows for better monitoring and
qualitycontrol.
Disadvantages ()Sampling variance is non-zero: The data may not
be as precise because the datacame from a sample of a population,
instead of the total population.Detail: The sample may not be large
enough to produce information about smallpopulation sub-groups or
small geographical areas.
Types of sample
Simple random sample
A simple random sample is selected so that every possible sample
has an equal chanceof being selected from the population. Each
individual is chosen randomly andentirely by chance, such that each
individual has the same probability of being chosenat any stage
during the sampling process.
In small populations such sampling is typically done without
replacement. Thismeans that person or item once selected is not
returned to the frame and thereforecannot be selected again. An
unbiased random selection of individuals is important sothat in the
long run, the sample represents the population. However, this does
notguarantee that a particular sample is a perfect representation
of the population.
Although simple random sampling can be conducted with
replacement instead, this isless common and would normally be
described more fully as simple random samplingwith replacement.
This means that person or item once selected is returned to the
frame and therefore can be selected again with the same
probability 1N
.
Advantages are that a random sample is free of classification
error and it requiresminimum advance knowledge of the population.
Random sampling best suitssituations where not much information is
available about the population and datacollection can be
efficiently conducted on randomly distributed items.
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
8
Stratified sample
When sub-populations vary considerably, it is advantageous to
sample eachsubpopulation (stratum) independently. Stratification is
the process of groupingmembers of the population into relatively
homogeneous subgroups before sampling.
The strata should be mutually exclusive: every element in the
population must beassigned to only one stratum. The strata should
also be collectively exhaustive: nopopulation element can be
excluded. Then random or systematic sampling is appliedwithin each
stratum. This often improves the representativeness of the sample
byreducing sampling error.
In general, the size of the sample in each stratum is taken in
proportion to the size ofthe stratum. This is called proportionate
allocation. If the population consists of 60%in the male stratum
and 40% in the female stratum, then the relative size of the
twosamples (three males, two females) should reflect this
proportion.
Cluster sampling
The problem with random sampling methods when we have to sample
a populationthat is disbursed across a wide geographic region is
that you will have to cover a lot ofground geographically in order
to get to each of the units you sampled. It is forprecisely this
problem that cluster or area random sampling was invented.
In cluster sampling, we follow these steps: divide population
into clusters (usually along geographic boundaries) randomly sample
clusters measure all units within sampled clusters.
Cluster samples are generally used if: No list of the population
exists. Well-defined clusters, which will often be geographic
areas, exist.
Often the total sample size must be fairly large to enable
cluster sampling to be usedeffectively.
Quota sampling
Quota sampling is the non-probability equivalent of stratified
sampling. Likestratified sampling, the researcher first identifies
the stratums and their proportions asthey are represented in the
population. Then convenience or judgment sampling isused to select
the required number of subjects from each stratum. This differs
fromstratified sampling, where the stratums are filled by random
sampling.
There are two types of quota sampling: proportional and
non-proportional. Inproportional quota sampling you want to
represent the major characteristics of thepopulation by sampling a
proportional amount of each. For instance, if you know the
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
9
population has 40% women and 60% men, and that you want a total
sample size of100, you will continue sampling until you get those
percentages and then you willstop.
Non-proportional quota sampling is a bit less restrictive. In
this method, youspecify the minimum number of sampled units you
want in each category. Here,you're not concerned with having
numbers that match the proportions in thepopulation. Instead, you
simply want to have enough to assure that you will be able totalk
about even small groups in the population.
Systematic sampling
Systematic sampling is a statistical method involving the
selection of every kthelement from a sampling frame, where k, the
sampling interval, is calculated as:
k = population size (N) / sample size (n)
Using this procedure each element in the population has a known
and equalprobability of selection. This makes systematic sampling
functionally similar tosimple random sampling. It is however, much
more efficient and much less expensiveto carry out. The researcher
must ensure that the chosen sampling interval does nothide a
pattern. Any pattern would threaten randomness. A random starting
point mustalso be selected.
Systematic sampling is to be applied only if the given
population is logicallyhomogeneous, because systematic sample units
are uniformly distributed over thepopulation.
Calculating a Sample SizeThe three most important factors that
determine sample size are: How accurate you wish to be? How
confident you are in the results? What budget you have
available?
The temptation is to say all should be as high as possible. The
problem is that anincrease in either accuracy or confidence (or
both) will always require a larger sampleand higher budget.
Therefore, a compromise must be reached.
Frequency distributionFirst result that we get after research is
series with gross data. It is a database inwhich we entered data
for each item or object without any order (piled data). Inorder to
get an arranged statistical series (ordered array), we need to sort
data byorder of magnitude (from smallest observation to the largest
observation). The easiest
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
10
method of organizing data is a frequency distribution, which
converts raw data intoa meaningful pattern for statistical
analysis.
Well, the final form of data grouping is the statistical
distribution of frequencies, inwhich each variable modality or
interval (there is n of modalities or intervals)associate a
corresponding absolute frequency if (number of times each
value(modality or class) appears or number of occurrences of a
modality or class) ,i ix f or 1, 1, 1 ,i i iL L f .The number of
class groupings used depends on the number of observations in
thedata (N). In general, the frequency distribution should have at
least 5 class groupingsbut no more than 15.
When a variable can take continuous values instead of discrete
values or when thenumber of possible values is too large, the table
construction is cumbersome, if it isnot impossible. A slightly
different tabulation scheme based on the range of values(classes or
intervals) is used in such cases 1, 1, 1 ,i i iL L f .Frequency
distribution tables can be used for both categorical and numeric
variables.Continuous variables should only be used with class
intervals.
The relative frequency is proportion of units of a statistical
set with the samemodality or interval. This relative frequency of a
particular modality or class intervalis found by dividing the
absolute frequency by the number of observations:
1, 1
ni
i ii
fp pN
.The percentage frequency is found by multiplying each relative
frequency value by100. The percentage frequency is shown in
percentages, and it has the same meaninglike the relative
frequency:
1100 100, 100
ni
i i ii
fP p PN
Cumulative frequency (CF) is used to determine the number of
observations that lieabove (or below) a particular value in a data
set (how many data have the value that isequal to or lower than the
value of present modality). The cumulative frequency iscalculated
using a frequency distribution table. The cumulative frequency
iscalculated by adding each frequency from a frequency distribution
table to the sum ofits predecessors.
1
i
i jj
S f
The last value will always be equal to the total for all
observations, since allfrequencies will already have been added to
the previous total.
Cumulative percentage (CF%) is used to determine the percentage
or part ofobservations that lie above (or below) a particular value
in a data set (which part or %
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
11
data have the value that is equal to or lower than the value of
present modality). It iscalculated by adding each percentage
frequency from a frequency distribution table tothe sum of its
predecessors:
1
i
i jj
F P
.Excel solution for frequency distribution creating:1. For
qualitative data:
o Create column with modalities.o In next column for first cell
behind first modality choose Excel
function Statistical - Countif Range - row or column or array
with original data (fix that
range with $) Criteria description of modality ()
o For other modalities do this with Copy option. For numerical
data:
o Create new columns, one with lower and one with upper
endpoints ofclasses,
o Select free cells beside that column,o Choose Excel function
Statistical Frequency,
Data array row or column or array with original data, Bins array
new column with upper endpoint of classes, CTRL+SHIFT+ENTER,
o That will produce absolute frequencies for all classes.
Example 1.According to data base for HBS 2004 we have
information about several variables for7,413 households: Entity
Canton Gender Marital status Education level Employment statusWe
have qualitative variables with small number of modalities, so we
will use non-interval grouping, or we will find absolute frequency
for each modality.
First, we will in empty column of Excel sheet type modalities
for given variable. Wewill take variable marital status and
modalities are: unmarried, married, unformalmarriage, divorced and
widower/widow.
For construction of frequency distribution we will use Excel
function: COUNTIF:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
12
Now we will give elements to the chosen CONTIF function: Range
row or column with original data (we will fix that data range with
$:
$D$2:$D$7414) Criteria cell with given modality (H10)
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
13
With Copy-Paste option we will complete other cells for absolute
frequency:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
14
On the same way we can complete frequency distribution for other
variables.
Next step is to calculate relative and percentage frequencies
according to absolutefrequencies:
1. we will get relative frequency when we divide absolute
frequency with sample orpopulation size (N) like sum for absolute
frequencies (when we give sum wealways fix series with $):
Other relative frequencies we will get with Copy-Paste option
and sum of relativefrequencies has to be equal 1:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
15
2. Percentage frequency we will get when multiply relative
frequency with 100%,so we will transform part in percentage
form:
Other percentage frequencies we will get with Copy-Paste option
and sum ofpercentage frequencies has to be equal 100:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
16
Interpretation: Highest part (71.24%) households has head in
formal marriage, butlowest part (0.27%) households has head in
unformal marriage
When we have qualitative variable there is no any sense to
calculate cumulativefrequency, because there is no logical
explanation.
Example 2.We have data base about import and export in year
2007. for 181 countries (Doingbusiness 2007 trading across
boundaries). Variable number of documents forexport is example for
discrete quantitative variable. For construction of
frequencydistribution we will use option FREQUENCY.
First we will find minimal and maximal value of modality with
statistical functionMIN and MAX:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
17
Minimal value of modalities is 3 and maximal value is 14, so we
will according tothat take modalities from interval 3-14 in new
column (I8:I19) for frequencydistribution:
Then we will select all cells where we need absolute frequencies
(J8:J19) and wechoose in Functions: Statistical functions Frequency
and:
1. Data array are original data (B2:B182)2. Bins array are
modalities (I8:I19)
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
18
Than in the same time we press CTRL+SHIFT+ENTER and we will get
frequencydistribution:
According to sum of absolute frequencies (175) we can see that
for 6 countries dataabout this variable are missing.
Next step is to calculate relative, percentage and cumulative
frequencies according toabsolute frequencies:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
19
1. we will get relative frequency when we divide absolute
frequency with sample orpopulation size (N) like sum for absolute
frequencies (when we give sum wealways fix series with $):
Other relative frequencies we will get with Copy-Paste option
and sum of relativefrequencies has to be equal 1:
3. Percentage frequency we will get when multiply relative
frequency with 100%:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
20
Other percentage frequencies we will get with Copy-Paste option
and sum ofpercentage frequencies has to be equal 100:
Interpretation: Highest part of countries (19,43%) ask for 6
documents for exportrealization, but lowest part of countries
(1,14%) ask for 13 or 14 documents for exportrealization.
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
21
4. Increasing cumulative frequencyFirst increasing cumulative
frequency is always same as first absolute frequency andthen we on
current cumulant add next absolute frequency:
Other cumulative frequency we will get with option Copy-Paste
and last cumulativefrequency has to be equal N:
Interpretation: 149 countries ask 9 or less than 9 documents for
export realization.
5. Increasing cumulative percentage frequencyFirst increasing
cumulative percentage frequency is always same as first
percentagefrequency and then we on current cumulant add next
percentage frequency:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
22
Other cumulative percentage frequency we will get with option
Copy-Paste and lastcumulative percentage frequency has to be equal
100:
Interpretation: 61,14% countries ask 7 or less than 7 documents
for export realization.
Class intervals
Class interval width is the difference between the lower and
upper endpoint of aninterval ( 2, 1,i i il L L ).
In summary, follow these basic rules when constructing a
frequency distribution tablefor a data set that contains a large
number of observations: find the lowest and highest values of the
variables,
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
23
decide on the width of the class intervals and form class
intervals that are mutuallyexclusive,
include all possible values of the variable.
In an interval grouped series, in order to provide for
additional data calculation, weneed to approximate the intervals to
corresponding class middles (class mark,midpoint, centre of
interval):
1, 2,
2i i
i
L Lc
.
Example 3.We have data base about import and export in year
2007. for 181 countries (Doingbusiness 2007 trading across
boundaries). Variable cost to import is example forcontinuous
quantitative variable. For construction of frequency distribution
we willuse option FREQUENCY.
First we will find minimal and maximal value of modality with
statistical functionMIN and MAX:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
24
Minimal value is 367 and maximal value is 5.520, and according
to that we willdetermine interval for frequency distribution. We
will take intervals with width 500and in next cells we will type
boundaries for that intervals (truing to be visuallysymmetric):
When we set up boundaries for intervals then we can go to the
function Frequency.We will select all cells where we want to find
absolute frequencies (K8:K19),Kada smo odredili granice intervala
moemo pristupiti funkciji Frequency. We selectall cells where we
want to find absolute frequencies (K8:K19) and we choose
inFunctions: Statistical functions Frequency and:
1. Data array are original data (G2:G182)
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
25
2. Bins array are upper boundaries for intervals (that are
included in currentinterval) (J8:J19)
We pres at the same time CTRL+SHIFT+ENTER and we will get
frequencydistribution:
According to sum of absolute frequencies (175) we can see that
for 6 countries dataabout this variable are missing.
Frequency distribution looks like:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
26
Next step is to calculate relative, percentage and cumulative
frequencies according toabsolute frequencies:
1. we will get relative frequency when we divide absolute
frequency withsample or population size (N) like sum for absolute
frequencies (when wegive sum we always fix series with $):
Copy-Paste option is used to give other relative
frequencies:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
27
Sum of relative frequencies is 1.
2. Percentage frequency we will get when multiply relative
frequency with100%:
Other percentage frequencies we will get with Copy-Paste option
and sum ofpercentage frequencies has to be equal 100:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
28
Interpretation: Highest part (32,57%) countries have cost of
import per container ininterval 1000-1500 US$., but lowest part of
them (0,57%) have cost of import percontainer in intervals
3000-3500 or 5500-6000. Because of that we can conclude
thatinterval 5500-6000 or data from that interval is outlier.
3. Increasing cumulative frequency
First is equal to first absolute frequency and than we use
cumulation:
Then we use Copy-Paste option:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
29
For example, one of the conclusions can be that 170 countries
have cost for importlower than 4000 US$
4. Increasing cumulative percentage frequency
Procedure is same like in previous step but with percentage
frequencies:
Then we use Copy-Paste option:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
30
For example, one of the conclusions can be that 90,29% countries
have cost forimport lower than 2500 US$.
Outliers
An outlier is an extreme value of the data. It is an observation
value that issignificantly different from the rest of the data.
There may be more than one outlier ina set of data.
Sometimes, outliers are significant pieces of information and
should not be ignored.Other times, they occur because of an error
or misinformation and should be ignored.
Data presentation: tables, diagrams and graphsTwo most important
ways for presenting data are previously presented tables
withfrequency distributions and graphs.
Why use graphs when presenting data? Because graphs: are quick
and direct highlight the most important facts facilitate
understanding of the data can convince readers can be easily
remembered.
Knowing what type of graph to use with what type of information
is crucial.Depending on the nature of the data and variable type
some graphs might be moreappropriate than others. You too can
experiment with different types of graphs andselect the most
appropriate. There are several suggestions for appropriate
selectionaccording to effects that you want to get with graphs:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
31
pie chart (description of components) horizontal bar graph
(comparison of items and relationships, time series) vertical bar
graph (discrete variable, comparison of items and relationships,
time
series, frequency distribution) line graph (time series and
frequency distribution) scatter plot (analysis of relationships)
histogram (continuous variable).
In Excel in segment Tools Customize Insert- Chart we can find
function Chart andchoose different types of graphs:
Example 1.We will again work with variable marital status. What
types of graphs we can use?According to the variable type
qualitative variable, we can construct structural pieor vertical
bars.
For this example we will construct structural pie:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
32
We choose option Next:
We choose option Next:1. Titles - we give title to the graph
2. Legend We choose way to represent legend
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
33
3. Data labels we choose options to show on pie: variable name,
modality name,absolute frequency, %. We will take to show % because
we already have modalitynames in legend.
We choose option Next and determine place where graph will be
saved:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
34
We choose option Finish:
Example 2.Variable number of documents for export is discrete
variable. Because of that wewill choose structure pie, vertical
bars or frequency polygon to represent it.
We will construct graph for vertical bars:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
35
We choose option Next:
In Series option we will fix values for modalities (I8:I19):
We choose option Next:a) Titles we will determine title for
graph and axes
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
36
b) Axes we set up axesc) Gridlines we set up gridlinesd) Legend
we choose to include legend and how to do that. If we have only
one
variable than legend is not important. But if we have more
variables we willuse legend to classify variables.
e) Data labels we choose options to show on graph: variable
name, modalityname, absolute frequency, %. We will take to show
absolute frequencies:
f) Data table if we include this option we will get table below
graph, but this issame information like information on graph.
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
37
We choose option Next and we determine place where graph will be
saved:
We choose option Finish:
Example 3.This is continuous variable cost of import. Because of
that we prefer to usehistogram, frequency polygon or polygon of
cumulative frequency.
A. First we will construct histogram. Procedure is same like
with vertical bars. On theend when we get graph with vertical bar
we will on the graph make format for gapwithin bars, to be equal
0:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
38
We click on bars on Excel graph and we choose Format data
series, and then wechoose Options where we make that Gap width be
equal to 0:
Click OK and there is histogram (graph with continuous
bars):
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
39
B. Now we will construct polygon of absolute frequency. We need
centres of intervalsfor that. We need columns with lower and upper
boundaries for intervals. Centre ofinterval is sum of lower and
upper boundary divided by 2:
Others centre of intervals we will get with Copy-Paste
function:
Now we can to construct polygon of frequency:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
40
We choose Next and select in Data range cells with absolute
frequencies:
For Series we select centers of intervals like modalities:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
41
Again we use option Next:a) Axes we set up axesb) Gridlines we
set up gridlinesc) Legend we choose to include legend and how to do
that. If we have only one
variable than legend is not important. But if we have more
variables we will uselegend to classify variables.
d) Data labels we choose options to show on graph: variable
name, modality name,absolute frequency, %. We will take to show
absolute frequencies.
e) Data table if we include this option we will get table below
graph, but this issame information like information on graph:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
42
We choose option Next and we determine place where graph will be
saved:
On the same way we can create polygon of cumulative frequencies,
but in that case onthe beginning in Data range we would select
cells with cumulative frequencies.
poligon of cumulative procentual frequencies
3,4286
34,8571
67,428682,8571
99,428690,2857
95,428696,0000
97,142998,2857
99,4286100,0000
0,0000
20,0000
40,0000
60,0000
80,0000
100,0000
120,0000
1 2 3 4 5 6 7 8 9 10 11 12
centre of interval
CF%
Descriptive statisticsDescriptive statistics are used to
describe the basic features of the data in a study.Together with
simple graphics analysis, they form the basis of virtually
everyquantitative and qualitative analysis of data.
There may be several objectives for formulating a summary
statistic or parameter: To choose a statistic that shows how
different units seem similar. Statistical
textbooks call one solution to this objective, a measure of
central tendency. To choose another statistic that shows how they
differ. This kind of statistic is
often called a measure of statistical variability. To analyze
shape of frequency distribution.
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
43
Measures of central tendency
Measures of central tendency summarize a list of numbers by a
"typical" value calledmeasure of location. The three most common
measures of location are the mean, themedian, and the mode. The
mean (average) is the sum of the values, divided by the number of
values. It
has the smallest possible sum of squared differences from
members of the list.
1
N
ix
XN
The median is the middle value in the sorted list. It has the
smallest possible sumof absolute differences from members of the
list frequencies. The first modality or
interval in which it is2 MeN CF is the median or interval in
which the median
is contained. If it is an interval, then the median is
determined using the followingformula:
1
12
( )e
e e
e
M
e M MM
N CFM L l f R
The mode is the most frequent value in the list (or one of the
most frequent values,
if there are more than one). Mode is only calculated for the
statistical distribution(grouped series). It is graphically
determined in a histogram. For a non-intervalgrouped distribution,
on the basis of the highest frequency ( max Mof f ) the moddata is
read. For an interval grouped distribution, the frequency of the
readinterval opposed to the highest frequency is determined on the
basis of thefollowing formula:
11 1 1o oo o o o o oM M
o M MM M M M
f fM L l f f f f
Sometimes, we choose specific values from the cumulative
distribution functioncalled quartiles. Procedure is same like with
median:
25% of data has value less or equal to the first quartile and
75% of data hasvalue higher than the first quartile (theoretical
position
14 QN CF )
75% of data has value less or equal to the third quartile and
25% of data hasvalue higher than the third quartile (theoretical
position
3
34 QN CF ).
Measures of dispersion
Dispersion refers to the spread of the values around the central
tendency. There arethree common absolute measures of
dispersion:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
44
The rangeThe range is simply the highest value minus the lowest
value: max minRV x x .
The quartile rangeThe quartile range ( 3 1QI Q Q ) is the range
from the 25th to the 75th percentileof a distribution. It
represents the "Middle Half" of the data and is a marker
ofvariability or spread that is robust to outliers.
The standard deviationThe standard deviation is the square root
of the sum of the squared deviationsfrom the mean divided by the
number of scores (or the number of scores minusone, if we work with
sample).
For population: 22 21
1,
N
ii
x XN
For sample: 22 21
1,
1
N
ii
x XN
The standard deviation allows us to reach some conclusions about
specific scoresin our distribution. Assuming that the distribution
of scores is normal or bell-shaped (or close to it!), the following
conclusions can be reached (role six sigma):
approximately 68% of the scores in the sample fall within one
standarddeviation of the mean
approximately 95% of the scores in the sample fall within two
standarddeviations of the mean
approximately 99% of the scores in the sample fall within three
standarddeviations of the mean.
Problem with standard deviation, like absolute measure of
dispersion, is that wecan not use standard deviation for comparison
of series with different unit ofmeasure or with different
average.
Behind that we can define relative measures of dispersion like:
Coefficient of variation
The variance coefficient is a relative measure of variability
which can be used forcomparing series with different units of
measure, because it is an unnamednumber.
100 (%)VX
It can be used for comparing series with different arithmetic
means.
z valueZ values determine the relative position of variable
modality in the series:
, 1, 2,...,iix X
z i N
They are appropriate for comparing positions of data in
different series. Z valuesare specific because of fact that we can
calculate z value for each modality, notonly for series of
data.
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
45
The quartile deviation coefficientThe quartile deviation
coefficient is relative dispersion indicator and showsvariability
around median value:
3 11 3 1 3
100% 100%QQIQ QV Q Q Q Q
Higher value of the quartile deviation coefficient indicates
greater dispersion andvice versa. This is relative indicator of
data varying around the median.
Shape of distribution
Symmetry or skewness
A frequency distribution may be symmetrical or asymmetrical.
Imagine constructing ahistogram centred on a piece of paper and
folding the paper in half the long way. Ifthe distribution is
symmetrical, the part of the histogram on the left side of the
foldwould be the mirror image of the part on the right side of the
fold. If the distribution isasymmetrical, the two sides will not be
mirror images of each other. True symmetricdistributions include
what we will later call the normal distribution.
Asymmetricdistributions are more commonly found.
Measure of skewness33
3
331
1 Ni
ix X
N
03 symmetry3 0 positively skewed3 0 negatively skewed
X
f
symmetric
left asymmetricright asymmetric
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
46
If a distribution is asymmetric it is either positively skewed
or negatively skewed. Adistribution is said to be positively skewed
if the scores tend to cluster toward thelower end of the scale
(that is, the smaller numbers) with increasingly fewer scores atthe
upper end of the scale (that is, the larger numbers). A negatively
skeweddistribution is exactly the opposite. With a negatively
skewed distribution, most of thescores tend to occur toward the
upper end of the scale while increasingly fewer scoresoccur toward
the lower end.
Kurtosis
Another descriptive statistic that can be derived to describe a
distribution is calledkurtosis. It refers to the relative
concentration of data in the centre, the upper andlower ends
(tails), and the shoulders of a distribution. A distribution is
platykurtic ifit is flatter than the corresponding normal curve and
leptokurtic if it is more peakedthan the normal curve.
Modality
A distribution is called unimodal if there is only one major
"peak" in the distributionof scores when represented as a
histogram. A distribution is bimodal if there are twomajor peaks.
If there are more than two major peaks, we call the
distributionmultimodal.
Measure of kurtosis 44 4
441
1 Ni
ix X
N
4 3 normal4 3 leptocurtic4 3 platykurtic
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
47
Measure of concentration
The Lorenz curve is a graphical representation of the cumulative
distributionfunction of a probability distribution; it is a graph
showing the proportion of thedistribution assumed by the bottom y%
of the values. It is often used to representincome distribution,
where it shows for the bottom x% of households, whatpercentage y%
of the total income they have.
Every point on the Lorenz curve represents a statement like "the
bottom 20% of allhouseholds has 10% of the total income". A
perfectly equal income distribution wouldbe one in which every
person has the same income. In this case, the bottom N% ofsociety
would always have N% of the income. This can be depicted by the
straightline y = x; called the line of perfect equality.
By contrast, a perfectly unequal distribution would be one in
which one person has allthe income and everyone else has none. In
that case, the curve would be at y = 0 forall x < 100%, and y =
100% when x = 100%. This curve is called the line of
perfectinequality.
The Ginny coefficient is the area between the line of perfect
equality and theobserved Lorenz curve, as a percentage of the area
between the line of perfectequality and the line of perfect
inequality. This equals two times the area between theline of
perfect equality and the observed Lorenz curve.
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
48
1concentration area 2 concentration area 2
0,51
0 1j j j
G S
G p Q QG
The higher the Ginny coefficient, the more unequal the
distribution is.
Software Excel and SPSS do not offer the option to directly
calculate measures ofconcentration, and we therefore have based on
a formula in Excel, so we develop theprocedure.
Example 4.We have data base about variables that follow
procedure of paving taxes for 181countries (source:
http://www.doingbusiness.org/CustomQuery/, data for 2008.
year).Data are given in Excel sheet (A1-G363). Variables are:
Payments (number) (B2-B363) Time (hours) (C2-C363) Total tax rate
(% profit) (D2-D363).There are quantitative variables, so we can
apply methodology for descriptivestatistics for series of 181 data
per each variable to get several parameters which willdescribe
given series.
Most simple and fast way to get several parameters which will
describe given series(x min, x max, average, deviation, mod,
median, kurtosis and skewness) is to use Excelfunction: Tools Data
Analysis. If that option is not included we have to renew it:
1. Tools Add-ins:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
49
2. We have to renew or choose Analysis ToolPak and Analysis
ToolPak VBA:
3. Click OK and we will get in Tools:
Now we can use Data Analysis option:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
50
We will get list with analysis that we can make. Currently we
are interested for optionDescriptive statistics, so we choose it
and click OK. In Input range we can in the sametime to select all
columns with several variables and to give grouping according to
thecolumns ($B$1:$D$182). When we select data we include and first
cell with variablename and include option Labels in first row. Then
we set up empty cell or new sheetwhere we want to save result of
analyses and we select what we want to get ofparameters:
Summary statistics - x min, x max, average, deviation, mod,
median, kurtosis andskewness, range, count...
Confidence level for mean This is boundary for confidence
interval foraverage with given confidence level (for example
95%)
Kth largest i Kth smallest If we want to calculate quantiles we
will choosethis option , for example for first and third quartile
in both case we take 25, forfirs and ninth decile in both case we
take 10
Click OK and result is:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
51
On example on of this variables time (hours) we will give
interpretations for results: Average is 317.63 hours, for sample of
181 countries (count), so in average it
is needed 317.63 hours for paying taxes procedure. Standard
error of average estimation is given on base of sample size and
standard deviation in sample ( Xn
) is 23.61 hours. Median is 256, so for 50% of countries is
needed 256 hours or less for paying
taxes procedure until for 50% of countries is needed more than
256 hours forpaying taxes procedure.
Mod is 270, so we have most frequently appeared country with 270
hours forpaying taxes procedure.
Standard deviation like average linear deviation from average is
317.66 hours,so we can calculate coefficient of variation:
317.66100 100 100%317.63
VX
Relative variability around average is 100%. Only in comparison
with anotherseries this information has sense.
Variance like average square deviation from average is 100906.1,
but weinterpret this through standard deviation.
Kurtosis is (19.96+3)=22.96 what is more than 3 so we can
conclude that thisdistribution is significantly more peaked than
the normal curve.
Skewness is 3.77 what is more than 0 so we can conclude that
this distributionis significantly right asymmetric in comparison
with the normal curve
Range like difference between highest and lowest value is 2600
h. Minimal time for paying taxes procedure is 0 h. Maximal time for
paying taxes procedure is 2600 h. Sum of data in series is 57491,
but there is no logic interpretation for this
information.
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
52
Third quartile is 453, so for 75% of countries is needed 453
hours or less forpaying taxes procedure until for 25% of countries
is needed more than 453hours for paying taxes procedure.
Third quartile is 105, so for 25% of countries is needed 105
hours or less forpaying taxes procedure until for 75% of countries
is needed more than 105hours for paying taxes procedure.
Boundary for confidence interval for average with given
confidence level 95%is 46.59. Confidence interval for average with
95% confidence level is[317.6346,59]= [271.04-364.22]. So with
first type error 5% we canconclude that time for paying taxes
procedure for some country will be Iinterval [271.04-364.22]
hours.
To see these parameters visually we will construct histogram. We
have option in Dataanalysis:
Before we construct histogram we have to define intervals
according to minimal andmaximal value and to numbers of interval
that we want to create. Maximal value is2600 and minimal value is
0, so we will determine intervals with width 100: 0-100,100-200,
..., 400-500, 500-600, ..., 2500-2600. Upper limits for that
intervals that areincluded in intervals are: 99, 199, ..., 499,
599, ..., 2600. We will type this limits inone Excel column
(I22:I47).
For Input range we will select column with original data
(C2:C182) and for BinRange we will select cells where we type upper
limits for intervals (I22:I47). We willfind place to save result
and option Chart output:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
53
Graph that we are get is graph with vertical bars, but we will
click on graph and getChart options Options. There we will set up
that gap between bars be equal 0:
Finally histogram looks like:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
54
Histogram
0
10
20
30
40
50
60
99 299
499
699
899
1099
1299
1499
1699
1899
2099
2299
2499
More
Bin
Freq
uenc
y
Our interpretation of parameters for distribution shape is
completely proved. It is verypositive (right) asymmetric and peaked
distribution. This distribution is significantlydifferent in
comparison with normal curve.
Example 5.With aim to analyse concentration for consumption for
data base HBS 2008, we aretaken data about consumption per capita
for 23374 individuals from 7071 households:
There are original gross data, so we will firs construct
appropriate frequencydistribution. We need to find minimal and
maximal value for consumption level in oursample:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
55
According to that we make decision to set up intervals with
width 5000, so we haveupper limits that are included in intervals
(bins): 4999,99, 9999,99, 14999,99, ,54999,99. That limits we will
type in empty column in sheet where are original data:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
56
We select empty cells in column behind (E6:E16). In function
(fx) we chooseFrequency:
With CTRL+SHIFT+ENTER we will get frequency distribution:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
57
Now we can to start with construction of Lorenz curve and
calculation of Ginnycoefficient. We need centres of intervals and
relative frequencies, but before that wehave to form columns with
lower and upper limits for intervals:
First we will calculate centres of intervals:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
58
With Copy-Paste option we will get column with centres of
intervals:
Than we will calculate relative frequencies:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
59
With Copy-Paste option we will get column with relative
frequencies:
Than, we will calculate relative cumulative frequencies. First
is same like first relativefrequency and we follow cumulating:
With Copy-Paste option we will get column with relative
cumulative frequencies:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
60
Than we need cumulant for relative aggregate. First we will
calculate aggregate (cp)like product of centre of interval and
absolute frequency for given interval:
With Copy-Paste option we will get column for aggregate:
We will calculate relative aggregate like: i iii i
c pqc p :
With Copy-Paste option we will get column for relative
aggregate:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
61
On the end we will find relative cumulative aggregate (Q):
With Copy-Paste option we will get column for cumulant of
relative aggregate:
To graph Lorenz curve for x axes we will take relative
cumulative frequencies andlike y axes we will take cumulant of
relative aggregate. Before that we will insert onepoint with value
0 for both cumulant:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
62
Now we can graph Lorenz curve:
For line of perfect equality we will for both axes take same
data for relativecumulative frequencies.
For Lorenz curve we take:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
63
Now with Add we will insert new series for line with perfect
equality:
We choose Next and then we will get option to give titles:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
64
Finally graph looks like this:
White area is area of concentration.
We will calculate Ginny coefficient like quantification for
measure of concentrationaccording to relation: 11 j j jG p Q Q
:
With Copy-Paste option we will complete this column:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
65
When we calculate (1-this sum) we will get Ginny
coefficient:
And we will get result:
-
DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL
66
Ginny coefficient is 0.3378 so distribution of consumption is
not perfect equal butlevel of concentration is not very high.
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
67
II. Empirical versus appropriate theoretical
distributions(approximations with binomial; Poisson,hypergeometric
or normal distribution)
PROBABILITY DISTRIBUTIONS
Frequency distribution formed with groupation of population
units according to samecharacteristics is empirical distribution.
Distribution formed on the basis of theoreticalprepositions is
theoretical distribution. Main characteristics of theoretical
distributionsare:
We suppose them in some statistical model or we create them like
hypothesisthat we have to test.
Theoretical distributions are given like analytic model with
known parameters:expectation, mod, median, standard deviation,
skewness and kurtosis.
Theoretical distributions are given like probability
distributions.
Probability where we know number of possible outcomes of event
and we knownumber of success realization is a priory probability.
But in statistical research ismost frequently that we dont know
probability a priori so with experiment we try toget knowledge for
probability calculations like a posterior. Well a
posteriorprobability is empirical or statistical probability.
Empirical probability or a posterior is limited value for
relative frequency for numberof success of event A if we have great
number of trials: which tends to infinity:
( ) limn
mp An
; m- number of success realization, n- number of trials.
Cumulative function for discrete variable X (F(x)) is function
that x will take valueslower or equal to same real number ix or (
)
i
i i iX x
F x P X x p x
.Cumulative function for continuous variable X (F(x)) has
general formlike
a
dxxfaXF , and it is determined by parameters like expectation
andvariance..
If discrete variable X (F(x)) can take values kxxx ,...,, 21
withprobabilities kxpxpxp ,...,, 21 , where sum of probabilities
has to be 1, expectationfor X is :
iki
ikk xxpxxpxxpxxpXE 1
2211 ... .
For continuous variable expectation is:
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
68
dxxxfXE , - x .
Variance for discrete variable is:
XExpxX ki
ii
,1
222 odnosno
211
22
k
iii
k
iii xpxxpx .
Variance for continuous variable is:
dxxxfdxxfxXE
,2222 .
Well, theoretical probability distributions can be split into 2
groups:
discrete probability distributions deal with discrete eventso
binomial distributiono Poisson distributiono Hypergeometric
distribution.
continuous probability distributions deal with continuous
eventso normal distributiono Student (t) distributiono 2
(chi-square) distributiono F distribution.
The probability distribution of a random variable describes the
probability off allpossible outcomes. The sum (integral) of these
probabilities will equal 1.
BINOMIAL DISTRIBUTIONThe binomial distribution is used when
discrete random variable of interest is thenumber of successes
obtained in a sample of n observations. It is used to
modelsituations that have the following properties: The sample
consists of a fixed number of observations n. Each observation is
classified into one of two mutually exclusive categories,
usually called success and failure. The probability of an
observation being classified as success, noted as p, is
constant from observation to observation. Thus, the probability
of an observationbeing classified as failure, noted as (1-p)=q, is
constant over all observations.
The outcome (success or failure) of any observation is
independent of the outcomeof any other observation.
Well, binomial distribution has two parameters: n number of
observations, trials or experiment repetitions.
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
69
p the probability of success (occurrences of a given event) on a
singleobservation, trial or experiment.
Probability distribution of a binomial random variable
The probability distribution of a binomial random variable
is:
( ) 1 , 0,n xxnp x p p x nx
,where x is exact number of successes of interest and ( )p x is
probability that among ntrials will been realized exactly x
successes (given event will be realized exactly xtimes).
Binomial probability function 1
Example 1.An insurance broker believes that for particular
contact, the probability of making saleis 0.4. Suppose now that he
has five contacts. What is probability that he will realizethree
sales among these five contacts?
Solution:
Discrete random variable X is defined to take value 1 if sale is
made and 0 if sale isnot made so this is discrete variable that can
be treated with binomial distribution.Experiment of sale we will
repeat 5 times n=5.According to conclusion about dichotomous
variable we will apply approximationwith binomial distribution:
1 From Wikipedia, the free encyclopedia
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
70
(1) 0.4(0) 1 0.4 0.6
53
p pq pn
x
3 25( ) 1 (3) 0.4 0.6 0.233
n xxn
p x p p px
Probability that he will realize three sales among these five
contacts is 23%.
Characteristics of the Binomial distribution
ShapeBinomial distribution can be symmetrical (if p=0.5) or
skewed (if p 0.5) Mean
( )E X n p Variance
22 (1 )E X n p p We have 4 types for binomial distribution:
symmetric; if p=q=0.5 asymmetric; if p q a priori; if we know
probabilities p and q a posterior; if we have to find p and q by
empirical method
Conditions for approximation empirical distribution with
binomial distribution are:
0 1Xn
2 1 XXn
Error of approximation is measure for quality of approximation.
Error ofapproximation according to modalities is: bk k kd f f
where: kf is empiricalfrequency and bkf is theoretical frequency,
so overall error of approximation is:
2 211b k
dn
Example 2.Accounting office in one company has information that
40% customers don't realizeobligation on time because of inflation.
If we randomly select 6 customers, what isprobability:
1. that are all customers realized obligation on time2. that
more than 3/4 of customers realized obligation on time3. that 50%
or more of customers don't realize obligation on time.
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
71
Solution:
p=60%=0,6 (realize obligation on time)q=40%=0,4 (dont realize
obligation on time)n=6
( ) x n xnp x p qx
1. Probability that that
are all customers realized obligation on time according to the
table isp(6)=4.67%.
2. Probability that more than 3/4 of customers realized
obligation on time 3/4of 6 is 4,5 so we will take probability for
x=5 and x=6. According to the tablep(5)= 18.66% and p(6)= 4.67% ,
so final result according to (Additionaltheorem) is 23.33%.
3. Probability that 50% or more of customers don't realize
obligation on time 50% of 6 is 3, so we will take probability for
x=3, 4, 5, 6. According to thetable this is
(0.27648+0.311040+0.186624+0.046656)=0.8208 82.8%.
Example 3.For 1000 products we can find 28 with defect. If we
randomly select 14 products forsample, what is probability that:a)
in sample we have exactly 4 products with defect;b) in sample we
have maximum 2 products with defect;c) in sample we have minimum 4
products with defect.
Solution (by Excel):This is dichotomous variable, so in that
case we will apply Binomial distribution withmodalities - x:
0,1,2,3,4,...,14.
28 0.028 0.9721000
p q 14
, 0,14: 14( ) 0.028 0.972
k
b k kk k
x k kX
p P x kk
We will use Excel function:
x i p(x) F(x)0 0.004096 0.0040961 0.036864 0.0409602 0.138245
0.1792053 0.276480 0.4556854 0.311040 0.7667255 0.186624 0.9533496
0.046656 1.000000
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
72
a) in sample we have exactly 4 products with defectWe ask for
probability in point not for cumulative function, so for option
Cumulativewe will take False.
=BINOMDIST(4;14;0.028;FALSE)= 0.000463 0.0463%b) in sample we
have maximum 2 products with defect (so 0, 1 or 2 product
withdefect), this is cumulative distribution so for option
Cumulative we will take True.
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
73
=BINOMDIST(2;14;0.028;TRUE)= 0.993662 99.3662%c) in sample we
have minimum 4 products with defect 4, 5 or more products
withdefect, what is opposite event for cumulative frequency
(maximum 3 products withdefect or 1, 2 or 3 products with defect).
Event and opposite event for sum ofprobabilities have 1, so we can
use Excel to get probability for opposite event (1, 2 or3 products
with defect) and than use that characteristic:
1- =BINOMDIST(3;14;0.028;TRUE)=1- 0.999509=0.000491
0.491%Example 4.For monitoring of work for one automat machine,
inspector will take sample with 10products. On base of 50 samples
we get this information about number of productswith defect:
Number ofproducts withdefect
Number ofsamples
0 61 112 153 10
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
74
4 75 1
50We have to create appropriate theoretical approximation for
this empirical distribution.
Solution:
This is discrete random variable. We have two modalities in one
trial: product can becorrect or with defect. That shows us that
appropriate theoretical distribution isbinomial distribution.
According to empirical distribution of frequencies we willcalculate
average and standard deviation. We can con use Excel function
directly,because this is grouped distribution and we will set up
formulas for calculate averageand standard deviation:
10,50 nN
Result is:
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
75
Or we will create new column (xf) and sum for that column we
will divide with sumof absolute frequencies:
kx kf kk fx 0 6 01 11 112 15 303 10 304 7 285 1 5
50 104
104 2.0850
k kx fXN
Then we will calculate standard deviation:
Result is:
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
76
Or we will create new column kk fx 2 and calculate with general
formula2
22 k kx f XN
:kx kf kk fx kk fx 2
0 6 0 01 11 11 112 15 30 603 10 30 904 7 28 1125 1 10 25
50 109 298
222 2298 2.08 1.63 1.278
50k kx f X
N Now we will test that conditions for binomial approximations
are satisfied:
22.081 2.08 1 1.65 110
X XX Xn n
0 0.208 1X
n
Conditions are satisfied so we can apply approximation. Then is:
0.208Xpn
and0,792q .
1010 0.208 0.709 , 0,5b x x b bx x xp x f p Nx
In Excel we will create formula for probability calculations
1010 0.208 0.709 , 0,5b x xxp xx
and than according to these theoreticalprobabilities we can
compute theoretical frequencies b bx xf p N :
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
77
With Paste option we can complete other cells in column with
theoretical probabilities.Result is:
Now we will compute theoretical frequencies:
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
78
With Paste option we can complete other cells in column with
theoretical frequencies.Result is:
That was procedure for approximation with binomial distribution.
Now we haveschedule for this variable and we can make predictions.
Quality of approximation willbe measured by error of
approximation.
Error of approximation for modalities is: bk k kd f f
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
79
Because of different signs, we will square those errors:
We will sum square of errors:
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
80
2 21 9.589 0.8721 11b k
dn
Error of approximation is 0.872.
POISSON DISTRIBUTIONThe Poisson distribution is a useful
discrete probability distribution when you areinterested in the
number of times a certain event will occur in a given unit of area
ortime. This type of situation occurs frequently in a business. of
opportunity approacheszero as the area of opportunity becomes
smaller. The Poisson distribution has oneparameter 0 , which is
average or expected number of events per unit.
Probability distribution of Poisson random variable
The probability distribution of a Poisson random variable is: (
)!
xep xx
where is: x number of events per unit (number of successes per
unit)
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
81
( )p x is the probability of x successes given a knowledge of
average number of events per unit (average number of successes per
unit) e=2.71828 (constant)
Poisson probability function 2
The horizontal axis is the index k. The function is only defined
atinteger values of k (empty lozenges). The connecting lines are
onlyguides for the eye.
Example 5.If the probability that an individual be late on job
on Friday is 0.001, determine theprobability that out of 2000
individuals.a) exactly 3b) more than 2individuals will be late on
job on Friday.
Solution:
p=0.001 - probability that an individual be late on job on
Friday (rare event Poisson distribution)
2000 0.001 2N p 2 2( )
! !
x xe ep xx x
a)2 32(3) 0.183!
ep
There is 18% of chance that out of 2000 individuals exactly 3
will be late on job onFriday.
2 From Wikipedia, the free encyclopedia
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
82
b)
2 0 2 1 2 2
( 2) (3) (4) ... 1 (0) (1) (2)2 2 21 0.323
0! 1! 2!
p x p p p p p
e e e
There is 32.3% of chance that out of 2000 individuals more than
2 will be late on jobon Friday.
Example 6.Suppose that, on average, three customers arrive per
minute at the bank during thenoon to 1 p.m. hour. What is
probability that in a given minute exactly two customerswill
arrive?
Solution:
We are interested in the number of times a certain event will
occur in a given unit oftime Poisson distribution.
=3 3 3( )
! !
x xe ep xx x
3 23(2) 0.2242!
ep
There is 22.4% probability that at in a given minute exactly two
customers will arrive.
Example 7.If probability that randomly selected person will be
daltonist is 0.3% what isprobability that between 2800 persons we
will find:a) 4 daltonistsb) more than 3 daltonists.c) not more than
2 daltonists.
Solution (by Excel-a):0.003 0.3%p Rare event Poisson
distribution
2800 0.003 8.4n p 8,4 8.4( )
! !
x xe ep xx x
We will use Excel function:
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
83
a) exactly 4 daltonistsWe ask for probability in point not for
cumulative function, so for option Cumulativewe will take
False.
)4(XP =POISSON(4;8.4;FALSE)= 0.046648 4.6648%b) more than 3
daltonists, this is opposite to cumulative distribution so for
optionCumulative we will take True and on the end we will find
probability for oppositeevent:
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
84
1- )3(XP 1-=POISSON(3;8.4;TRUE)=1- 0.03226= 0.96774 96.774%c)
not more than 2 daltonists, this is cumulative distribution so for
option Cumulative
we will take True.
)2(XP =POISSON(2;8.4;TRUE)=0.0100471.0047 %
Characteristics of the Poisson distribution
ShapePoisson distribution is always positively (right) skewed.
Mean
( )E X Variance
22 E X
13 ,
134 .
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
85
The Poisson distribution can be derived as a limiting case to
the binomialdistribution as the number of trials goes to infinity
and the expected number ofsuccesses remains fixed. Therefore it can
be used as an approximation of thebinomial distribution if n is
sufficiently large and p is sufficiently small. There is arule of
thumb stating that the Poisson distribution is a good approximation
of thebinomial distribution if n is at least 20 and p is smaller
than or equal to 0.05.According to this rule the approximation is
excellent if n 100 and np 10.
Example 8.In one office there is copy machine. We want to
determine average number ofincorrect copies. We take samples with
1000 copies, number of trials was 250 andresults are:
number ofincorrect copies
Numberofsamples
0 101 202 403 554 505 406 157 108 59 310 2
250We have to create appropriate theoretical approximation for
this empirical distribution.
Solution:
This is discrete random variable. We have two modalities in one
trial: copy can becorrect or incorrect. That shows us that
appropriate theoretical distribution is binomialor Poisson
distribution. According to empirical distribution of frequencies we
willcalculate average and standard deviation. We can con use Excel
function directly,because this is grouped distribution and we will
set up formulas for calculate averageand standard deviation:
100,250 nN
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
86
Result for average is:
We will find variance:
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
87
Result for variance is:
There is 2X Poisson distribution, 3.65X 3.65 3.65!
xpxp e
x
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
88
In Excel we will create formula for probability calculations
3.65 3.65 , 0!
xpxp e x
x
and than according to these theoretical probabilities we can
compute theoreticalfrequencies b bx xf p N :
With Paste option we can complete other cells in column with
theoretical probabilities.Result is:
Now we will calculate theoretical frequencies:
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
89
With Paste option we can complete other cells in column with
theoretical frequencies.Result is:
That was procedure for approximation with Poisson distribution.
Now we haveschedule for this variable and we can make predictions.
Quality of approximation willbe measured by error of
approximation.
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
90
Error of approximation for modalities is: bk k kd f f
Because of different signs, we will square those errors:
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
91
We will sun that square errors:
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
92
2 21 1941.47 7.761 251b k
dn
Approximation error is 7.76.
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
93
HYPERGEOMETRIC DISTRIBUTION
Hipergeometric distribution H(N,n,p) is distribution for n
random Bernoullisdependent variables. There is sampling without
replications. Symbols are: N- number of elements in population M-
number of elements in population with characteristic A n- number of
elements in sample k - number of elements in sample with
characteristic A NMkNn ,
hkp is probability that in sample from that population be k
elements with
characteristic A: n
N
knN
kN
CCC
n
Nkn
NkN
kXp
2121
Expectations and variance are:
1
; 2121
NnN
NN
NN
nNN
nXE
This distribution has application in sampling procedure. When is
(n/N
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
94
probability that we select 0 incorrect products
==HYPGEOMDIST(0;4;9;30) = 0.21839121.84% probability that we
select 1 incorrect product
==HYPGEOMDIST(1;4;9;30) = 0.43678243.68% probability that we
select 2 incorrect products
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
95
==HYPGEOMDIST(2;4;9;30) = 0.27586227.59% Finally, probability
that we will have not more than 2 incorrect products is sum of
previous find probabilities (like or probability for mutually
excluded events) 0.931034 93.1%
NORMAL DISTRIBUTIONThe normal distribution, also called the
Gaussian distribution, is an important familyof continuous
probability distributions, applicable in many fields. Each member
of thefamily may be defined by two parameters, location and scale:
the mean ("average", )and variance (standard deviation squared, 2)
respectively.
The continuous probability density function of the normal
distribution is the Gaussianfunction:
2
121
, , ( )2
x Ei
i ix f x e
where > 0 is the standard deviation, the real parameter is
the expected value. Toindicate that a real-valued random variable X
is normally distributed with mean andvariance 0, we write
2( ; )X N
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
96
Normal probability density function 3
The red line is the standard normal distribution
The standard normal distribution is the normal distribution with
a mean of zero and avariance of one (the red curves in the plots to
the right). According to transformationformula that will be:
2
2
23 4
1, ( ) , (0,1),
2( ) 0, 1, 0, 3
iz
ii i
Z
x Ez z e Z N
E Z
The probability density function has notable properties
including: symmetry about its mean the mode and median both equal
the mean the inflection points of the curve occur one standard
deviation away from the
mean, i.e. at and + .
The cumulative distribution function of a probability
distribution, evaluated at anumber (lower-case) x, is the
probability of the event that a random variable X withthat
distribution is less than or equal to x. The cumulative
distribution function of thenormal distribution is expressed in
terms of the density function as follows:
2121( ) ( )
2
ix E
x
i ix p X x e dx
3 From Wikipedia, the free encyclopedia
x
~
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
97
The cumulative distribution function of a probability
distribution, evaluated at anumber (lower-case) z, is the
probability of the event that a random variable Z withthat
distribution is less than or equal to z. The cumulative
distribution function of thestandardized normal distribution (red
line) is expressed in terms of the densityfunction as follows:
2
21( ) ( )2
izz
i iF z p z z e dz
There are tables with values of cumulative distribution function
of the standardizednormal distribution.
Roles for standardized normal distribution
Roles for determination probability for different kinds of cases
with standardizednormal distribution are:1. ( ) 1 ( )i ip Z z F z
2. ( ) ( ) ( )i j j ii j p z Z z F z F z 5. ( ) 1 ( )i ip Z z F z
6. ( ) ( ) ( ) 2 ( ) 1i i i i ip z Z z F z F z F z
On next two graphs we can see illustration for determination
area under curve forstandardized normal distribution
(probability):1. ( 1.25) (1.25)p z F
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
98
2.( 1.25) ( 1.25) ( 1.25) ( 1.25) 1 ( 1.25)( 1.25) 1 (1.25)
p z F p z p z p zF F
Characteristic intervals for normal distribution
If 2~N( ; )X then we have characteristic intervals for distances
of one, two andthree standard deviations from the mean:
68.3%p X 2 2 95.4%p X 3 3 99.7%p X
Example 5.The tread life of a certain brand of tire has a normal
distribution with mean 35000miles and standard deviation 4000
miles. For randomly selected tire, what isprobability that its life
is:
a) less than 37200 milesb) more than 38000 milesc) between 30000
and 36000 milesd) less than 34000 milese) more than 33000
miles.
-
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES
IN EXCEL
99
Solution:
2(35000;4000 )X NFirst we have to standardize or to transform x
in z. We use