3/24/2014 1 Introduction to Statistics Why Statistics? To develop an appreciation for variability and how it effects products and processes. Study methods that can be used to help solve problems, build knowledge and continuously improve products and processes. Build an appreciation for the advantages and limitations of informed observation and experimentation. Why Statistics? Determine how to analyze data from designed experiments in order to build knowledge and continuously improve. Develop an understanding of some basic ideas of statistical reliability and the analysis data. Data and Statistics Data consists of information coming from observations, counts, measurements, or responses. Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions. A population is the collection of all outcomes, responses, measurement, or counts that are of interest. A sample is a subset of a population. Populations & Samples Example: In a recent survey, 250 university students at CBU were asked if they smoked cigarettes regularly. 35 of the students said yes. Identify the population and the sample. Responses of all students at CBU (population) Responses of students in survey (sample) Parameters & Statistics A parameter is a numerical description of a population characteristic. A statistic is a numerical description of a sample characteristic. Parameter Population Statistic Sample
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
3/24/2014
1
Introduction to Statistics
Why Statistics?
� To develop an appreciation for variability and how it effects products and processes.
� Study methods that can be used to help solve problems,
� build knowledge and continuously improve products and processes.
� Build an appreciation for the advantages and limitations of informed observation and experimentation.
Why Statistics?
� Determine how to analyze data from designed experiments in order to build knowledge and continuously improve.
� Develop an understanding of some basic ideas of statistical reliability and the analysis data.
Data and Statistics
Data consists of information coming from observations, counts, measurements, or responses.
Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions.
A population is the collection of all outcomes, responses, measurement, or counts that are of interest.
A sample is a subset of a population.
Populations & SamplesExample: In a recent survey, 250 university students at CBU were asked if they smoked cigarettes regularly. 35 of the students said yes. Identify the population and the sample.
Responses of all students at
CBU (population)
Responses of students
in survey (sample)
Parameters & Statistics
A parameter is a numerical description of a populationcharacteristic.
A statistic is a numerical description of a samplecharacteristic.
Parameter Population
Statistic Sample
3/24/2014
2
Parameters & StatisticsExample:Decide whether the numerical value describes a population parameter or a sample statistic.
a.) A recent survey of a sample of 450 university students reported that the average weekly income for students is K600,000.
Because the average of K600,000 is based on a sample, this is a sample statistic.
b.) The average weekly income for all students is K500, 000
Because the average of K500, 000 is based on a population, this is a population parameter.
Types of sampling techniques� Random Sampling
� Sampling in which the data is collected using chance methods or random numbers.
� Systematic Sampling
� Sampling in which data is obtained by selecting every kth object.
Types of sampling techniques
� Convenience Sampling
� Sampling in which data which is readily available is used.
� Stratified Sampling
� Sampling in which the population is divided into groups (called strata) according to some characteristic.
� Each of these strata is then sampled using one of the other sampling techniques.
Types of sampling techniques
� Cluster Sampling
� Sampling in which the population is divided into groups (usually geographically).
� Some of these groups are randomly selected, and then all of the elements in those groups are selected.
Branches of StatisticsThe study of statistics has two major branches: descriptive statistics and inferential statistics.
Statistics
Descriptive
statistics
Inferential
statistics
Involves the
organization,
summarization,
and display of data.
Involves using a
sample to draw
conclusions about a
population.
Descriptive and Inferential Statistics
Example:In a recent study, volunteers who had less than 6 hours of sleep were four times more likely to answer incorrectly on a science test than were participants who had at least 8 hours of sleep. Decide which part is the descriptive statistic and what conclusion might be drawn using inferential statistics.
The statement “four times more likely to answer incorrectly” is a descriptive statistic. An inference drawn from the sample is that all individuals sleeping less than 6 hours are more likely to answer science question incorrectly than individuals who sleep at least 8 hours.
3/24/2014
3
Descriptive and Inferential Statistics
Note: The development of Inferential Statistics has occurred only since the early 1900’s.
Examples:
1. The medical team that develops a new vaccine for a disease is interested in what would happen if the vaccine were administered to all people in the population.
2. The marketing expert may test a product in a few “representative” areas, from the resulting information, he/she will draw conclusion about what would happen if the product were made available to all potential customers.
The Essential Elements of a Statistical Problem
The objective of statistics is to make inferences (predictions, and/or decisions) about a population based upon the information contained in a sample. A statistical problem involves the following
1. A clear definition of the objectives of the experiment and the pertinent population. For example, clear specification of the questions to be answered.
2 The design of experiment or sampling procedure. This element is important because data cost money and time.
The Essential Elements of a Statistical Problem
3. The collection and analysis of data.
4. The procedure for making inferences about the population based upon the sample information.
5. The provision of a measure of goodness (reliability) of the inference. The most important step, because without the reliability the inference has no meaning and is useless.
Note, above steps to solve any statistical problem are sequential.
Data Classification
Types of Data
Data sets can consist of two types of data: qualitative data (Attribute) and quantitative (Numerical) data.
Data
Qualitative
Data
Quantitative
Data
Consists of
attributes, labels,
or nonnumerical
entries.
Consists of
numerical
measurements or
counts.
Types of Data
DATA: Consist of information coming from observations, counts, measurements or responses.
1. Attribute (Qualitative): Consists of qualities such as religion, sex, color, etc. No way to rank this type of data.
2. Numerical Data (Quantitative): Consists of numbers representing counts or measurements. Can be ranked. There are two types of numerical data.
3/24/2014
4
Types of Data
a. Discrete Data: Can take on a finite number of values or a countable infinity (as many values as there are whole numbers such as 0, 1, 2..). Examples:
1. Number of kids in the family.
2. Number of students in the class.
3. Number of calls received by the switch board each day
4. Number of flaws in a yard of material.
Types of Data
b. Continuous Data: Can assume all possible values within a range of values without gaps, interruptions, or jumps. Examples: all kind of measurements such as, time, weight, distance, etc.
1. Yard of material.
2. Height and weight of students in a class.
3. Duration of a call to a switch board.
4. Body temperature.
Qualitative and Quantitative Data
Example:The grade point averages of five students are listed in the table. Which data are qualitative data and which are quantitative data?
Student GPA
Sally 3.22
Bob 3.98
Cindy 2.75
Mark 2.24
Kathy 3.84
Quantitative dataQualitative data
Levels of Measurement
The level of measurement determines which statistical calculations are meaningful. The four levels of measurement are: nominal, ordinal, interval, and ratio.
Levels
of
Measurement
Nominal
Ordinal
Interval
Ratio
Lowest
to
highest
Nominal Level of Measurement
Data at the nominal level of measurement are qualitative only.
Levels
of
Measurement
Nominal
Calculated using names, labels,
or qualities. No mathematical
computations can be made at
this level.
Colors in
the US
flag
Names of
students in your
class
Textbooks you
are using this
semester
Nominal Scale
� Numbers are used simply to label groups or classes.
� For example, gender
� 1 = male, 2 = female.
� Color of eyes of a person
� 1 = blue, 2 = green, 3 = brown
3/24/2014
5
Ordinal Level of Measurement
Data at the ordinal level of measurement are qualitative or quantitative.
Levels
of
Measurement Arranged in order, but
differences between data
entries are not meaningful.
Class standings:
freshman,
sophomore,
junior, senior
Numbers on the
back of each
player’s shirt
Ordinal
Top 50 songs
played on the
radio
Ordinal Scale
� In addition to classification
� members can be ordered according to relative size or quality.
� For example, products ranked by consumers 1 = best, 2 = second best etc.
Interval Level of Measurement
Data at the interval level of measurement are quantitative. A zero entry simply represents a position on a scale; the entry is not an inherent zero.
Levels
of
MeasurementArranged in order, the differences
between data entries can be calculated.
Temperatures Years on a
timeline
Interval
Atlanta Braves
World Series
victories
Ratio Level of MeasurementData at the ratio level of measurement are similar to the interval level, but a zero entry is meaningful.
Levels
of
Measurement
A ratio of two data values can be
formed so one data value can be
expressed as a ratio.
Ages Grade point
averages
Ratio
Weights
Summary of Levels of Measurement
NoNoNoYesNominal
NoNoYesYesOrdinal
NoYesYesYesInterval
YesYesYesYesRatio
Determine if
one data value
is a multiple of
another
Subtract
data values
Arrange
data in
order
Put data
in
categories
Level of
measurement Displaying data
3/24/2014
6
FREQUENCY DISTRIBUTIONS
�Frequency (f) is used to describe the number of times a value or a range of values occurs in a data set.
�Cumulative frequencies are used to describe the number of observations less than, or greater than a specific value
Frequency Measures
� Absolute frequency is the number of times a value or range of values occurs in a data set.
� The relative frequency is found by dividing the absolute frequency by the total number of observations (n).
� The cumulative frequency is the successive sums of absolute frequencies.
Frequency distribution � The cumulative relative frequency is the successive
sum of cumulative frequencies divided by the total number of observations.
� Frequency distributions are portrayed as Frequency tables, Histograms or Polygons.
Frequency distribution table� The first step in drawing a frequency distribution is to
construct a frequency table.
� A frequency table is a way of organizing the data
� by listing every possible score as a column of numbers and
� the frequency of occurrence of each score as another.
Frequency distribution table� Computing the frequency of a score is simply a matter
of counting the number of times that score appears in the set of data.
� It is necessary to include scores with zero frequency in order to draw the frequency polygons correctly.
� For example, consider the following set of 15 scores which were obtained by asking a class of students their shoe size, shoe width, and sex (male or female).
Shoe Size Shoe Width Gender
10.5 B M
6.0 B F
9.5 D M
8.5 A F
7.0 B F
10.5 C M
7.0 C F
8.5 D M
6.5 B F
9.5 C M
7.0 B F
7.5 B F
9.0 D M
6.5 A F
7.5 B F
3/24/2014
7
Frequency distribution table� To construct a frequency table,
� start with the smallest shoe size and list all shoe sizes as a column of numbers.
� The frequency of occurrence of that shoe size is written to the right.
Shoe SizeAbsolute
Frequency
Cumulative
Frequency
Relative
Freq
6.0 1 1 0.07
6.5 2 3 0.13
7.0 3 6 0.2
7.5 2 8 0.13
8.0 0 8 0
8.5 2 10 0.13
9.0 1 11 0.07
9.5 2 13 0.13
10.0 0 13 0
10.5 2 15 0.13
Total 15
Frequency distribution table� Note that the sum of the column of frequencies is
equal to the number of scores or size of the sample (N = 15).
� This is a necessary, but not sufficient, property in order to insure that the frequency table has been correctly calculated.
� It is not sufficient because two errors could have been made, canceling each other out.
Grouped Frequency Distributions� These distributions used for data sets that contain a
large number of observations.
� The data is grouped into a number of classes.
Grouped Frequency Distributions � Guidelines for classes
� There should be between 5 and 20 classes. � The class width should be an odd number.
� This will guarantee that the class midpoints are integers instead of decimals.
Grouped Frequency Distributions� The classes must be mutually exclusive.
� This means that no data value can fall into two different classes
� The classes must be all inclusive or exhaustive.
� This means that all data values must be included.
3/24/2014
8
Grouped Frequency Distributions� The classes must be continuous.
� There are no gaps in a frequency distribution.
� Classes that have no values in them must be included (unless it's the first or last class which are dropped).
Grouped Frequency Distributions� The classes must be equal in width.
� The exception here is the first or last class.
� It is possible to have an "below ..." or "... and above" class.
� This is often used with ages.
Creating a Grouped Frequency Distribution
� Find the largest and smallest values
� Compute the Range = Maximum - Minimum
� Select the number of classes desired.
� This is usually between 5 and 20.
Creating a Grouped Frequency Distribution
� Find the class width by dividing the range by the number of classes and rounding up.
� There are two things to be careful of here. You must round up, not off.
� If the range divided by the number of classes gives an integer value (no remainder), then you can either add one to the number of classes or add one to the class width.
Creating a Grouped Frequency Distribution
� Pick a suitable starting point less than or equal to the minimum value.
� Your starting point is the lower limit of the first class.
� Continue to add the class width to this lower limit to get the rest of the lower limits.
Creating a Grouped Frequency Distribution
� To find the upper limit of the first class, subtract one from the lower limit of the second class.
� Then continue to add the class width to this upper limit to find the rest of the upper limits.
� Find the boundaries by subtracting 0.5 units from the lower limits and adding 0.5 units from the upper limits.
� The boundaries are also half-way between the upper limit of one class and the lower limit of the next class.
3/24/2014
9
Creating a Grouped Frequency Distribution
� Tally the data.
� Find the frequencies.
� Find the cumulative frequencies.
� If necessary, find the relative frequencies and/or relative cumulative frequencies.
Example� Thirty AA batteries were tested to determine how long
they would last. The results to the nearest minute were recorded as follows:
�� The three most commonly used graphs in The three most commonly used graphs in research are:research are:
� The histogram.
� The frequency polygon.
� The cumulative frequency graph, or ogive (pronounced o-jive).
22--3030
� The histogramhistogram is a graph that displays the data by using vertical bars of various heights to represent
the frequencies.
histogramhistogram
3/24/2014
10
histogramhistogram� A histogram is drawn by plotting the scores
(midpoints) on the X-axis and the frequencies on the Y-axis.
� A bar is drawn for each score value, the width of the bar corresponding to the real limits of the interval and the height corresponding to the frequency of the occurrence of the score value.
� An example histogram is presented below
22--3131 Example of a HistogramExample of a Histogram
2017141185
6
5
4
3
2
1
0
Number of Cigarettes Smoked per Day
Fre
quency
22--3232
� A frequency polygonfrequency polygon is a graph that displays the data by using lines that connect points
plotted for frequencies at the midpoint of classes.
� The frequencies represent the heights of the midpoints.
frequency polygonfrequency polygon 22--3333 Example of a Frequency PolygonExample of a Frequency Polygon
Frequency Polygon
262320171411852
6
5
4
3
2
1
0
Number of Cigarettes Smoked per Day
Fre
quency
22--3434
� A cumulative frequency graphcumulative frequency graph or ogiveogive is a graph that represents the cumulative
frequencies for the classes in a frequency distribution.
cumulative frequency graphcumulative frequency graph 22--3535 Example of an OgiveExample of an Ogive
262320171411852
20
10
0
Number of Cigarettes Smoked per Day
Cum
ula
tive
Fre
quency
Ogive
3/24/2014
11
Other Types of GraphsOther Types of Graphs� A bar chart or bar graph is a chart with rectangular
bars with lengths proportional to the values that they represent.
� The bars can be plotted vertically or horizontally.
� Bar graph use frequency distributions of discrete variables, often nominal or ordinal data.
� Bars represent separate groups, so they should be separated
Bar graphs
0
2
4
6
8
10
12
14
16
POL BAD PSY SOC
� Number of students in statistics class from each of four majors, fall, 2005
22--4141Other Types of GraphsOther Types of Graphs
�� Pie graph Pie graph -- A pie graph is a circle that is divided into sections or wedges according to
the percentage of frequencies in each category of the distribution.
A pie chart�A pie chart (or a circle graph) is a circular
chart divided into sectors, illustrating proportion.
� In a pie chart, the arc length of each sector (and consequently its central angle and area), is proportional to the quantity it represents.
A pie chart� When angles are measured with 1 turn as unit then
a number of percent is identified with the same number of centiturns.
� Together, the sectors create a full disk.
� It is named for its resemblance to a pie which has been sliced.
22--4242
Other Types of Graphs Other Types of Graphs --
Pie Pie GraphGraph
Robbery (29, 12.1%)
Rape (34, 14.2%)
Assaults
(164, 68.3%)
Homicide
(13, 5.4%)
Pie Chart of the
Number of Crimes
Investigated by
Law Enforcement
Officers In U.S.
National Parks
During 1995
3/24/2014
12
Measures of Central Tendency
�There are three main measures of central tendency: the mean, median, and mode.
�The purpose of measures of central tendency is to identify the location of the center of various distributions.
�For example, let’s consider the data below.
�This data represents the number of miles per gallon that 30 selected four-wheel drive sports utility vehicles obtained in city driving.
� In its current form it is difficult to determine where the center for the above data set lies.
� Thus, one way to help us get a better idea as to where the center of a distribution is located is to graph the data.
� Because the data is numerical, the most appropriate method of graphing the data would be to create a histogram.
� The histogram for the gas mileage data is given below
� If we rely on sight alone, it seems that the middle of the distributions lies at around 15 to 16 miles per gallon;
� however, because our senses can sometimes deceive us,
� we want to be a little more scientific in our methodology.
3/24/2014
13
Mode� The mode is the observation that occurs most
frequently.
� Thus, to find the mode for the above data set we simply locate the observation that occurs most frequently.
� In this case, the number 16 occurs 9 times, which is more than any other observation.
� Therefore, the mode of the data is 16.
The Median� The median is the middle observation in the data.
� This means that 50% of the data is below the median and 50% of the data is above the median.
� To find the median, we must first organize the data in order from the smallest to the largest observation.
� For example, the above gas mileage data would take on the following form: