Exploring Data 1.1 Displaying Distributions with Graphs YMS3e
Jan 28, 2016
Exploring Data
1.1 Displaying Distributions with Graphs
YMS3e
1.1 Objectives Describe what is meant by exploratory data analysis. Explain what is meant by distribution of a variable. Differentiate between categorical variables and
quantitative variables. Construct bar graphs and pie charts for a set of
categorical data. Construct stemplot for a set of quantitative data. Construct back-to-back stemplot to compare two
related distributions. Construct a stemplot using split stems. Construct a histogram for a set of quantitative data, and
discuss how changing the class width can change the impression of the data given by the histogram.
1.1 Objectives
Describe the overall pattern of a distribution by its shape, center and spread.
Explain what is meant by the mode of a distribution.
Recognize and identify symmetric and skewed distributions.
Explain what is meant by outlier in a stemplot or histogram.
Construct and interpret an ogive (relative cumulative frequency graph) from a relative frequency table.
Construct a time plot for a set of data collected over time.
Case Study
Neilsen Ratings Read the study on page 37.
What do you observe? Does one network appear to “win” the ratings race?
How can we get a better sense of which network has the best ratings?
How can Statistics help us understand this data?
Exploratory Data Analysis
Exploratory Data Analysis: Statistical practice of analyzing distributions of
data through graphical displays and numerical summaries.
Distribution: Description of the values a variable takes on and
how often the variable takes on those values. An EDA allows us to identify patterns and
departures from patterns in distributions.
EDA
EDA is the part of statistical practice concerned with reviewing, communicating, and using data where there is a low level of knowledge about its cause system.
EDA Objectives Suggest hypotheses about the causes of observed
phenomena. Assess assumptions on which statistical inference will be
based. Support the selection of appropriate statistical tools and
techniques. Provide a basis for further data collection through surveys
or experiments.
Categorical Data
Categorical Variable: Values are labels or
categories. Distributions list the
categories and either the count or percent of individuals in each.
Displays: BarGraphs and PieCharts
SOCS
When describing a distribution remember your SOCS! Shape Outliers Center Spread
Look Carefully
Look carefully at data, searching for patterns and for situations that seem to differ from the population. Clusters Outliers Gaps
Quantitative Data
Quantitative Variable: Values are numeric - arithmetic computation makes sense
(average, etc.) Distributions list the values and number of times the
variable takes on that value.
Displays: Dotplots Stemplots Histograms Boxplots
Only organized Data canIlluminate!
Your goal is to make neat,organized, labeled graphs that
display the distribution ofdata effectively and providean insight into patterns anddepartures from patterns.
DotPlots
Small datasets with a small range (max-min) can be easily displayed using a dotplot. Draw and label a number line from min to max. Place one dot per observation above its value. Stack multiple observations evenly.
Stemplots
A stemplot gives a quick picture of the shape of a distribution while including the numerical values. Separate each observation into a stem and a
leaf. eg. 14g -> 1|4 256 -> 25|6 32.9oz -> 32|9 Write stems in a vertical column and draw a
vertical line to the right of the column. Write each leaf to the right of its stem.
Stemplots Example1.4,
pages 42-43 Literacy Rates in
Islamic Nations
Stemplots
Note: Stemplots do not work well for large data sets
Back-to-Back Stemplots: Compare datasets
Splitting Stems: Double the number of stems, writing 0-4 after
the first and 5-9 after second. Split them into five (0-1, 2-3, 4-5, 6-7, 8-9)
Stemplots
Example1.5, pages 42-43
Virginia College Tuition
Example
Page 47 # 1.3: Cheese and Chemistry As cheddar cheese matures, a variety of
chemical processes take place. The taste of mature cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the Latrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition.
The final concentration of lactic acid in the 30 samples, as a multiple of their initial concentrations are given in the table.
Example Continued
A dotplot and a stemplot from the Minitab statistical software package.
Example Continued
Which plot does a better job of summarizing the data? Explain why.
What do the numbers in the left column in the stemplot tell us? How does Minitab identify the row that contains the center of the distribution?
The final concentration of lactic acid in one of the samples stayed the same (as its initial concentration). Identify the sample in both plots.
Histograms
Histograms break the range of data values into classes and displays the count/% of observations that fall into that class. Divide the range of data into equal-width
classes. Count the observations in each class -
“frequency” Draw bars to represent classes - height =
frequency Bars should touch (unlike bar graphs).
Histograms
Example1.6, page 49 IQ Scores for 5th Graders
Describe the SOCSWhat do these datasuggest?
Example
Page 57 #1.11: Presidential ages at inauguration The table gives the ages of all U.S. presidents
when they took office. Make a histogram of the ages of the presidents at
inauguration. Use class intervals of 40 to 44, 45-49, and so on. Each interval should contain the left hand endpoint but not the right hand endpoint.
Describe the shape, center and spread of the distribution.
Who was the youngest president? Who was the oldest?
Was Bill Clinton, at age 46, unusually young?
Example Continued
AP Tip
Be sure to label carefully any required graphs. This means your axes should be labeled
and your scales should made clear.
“Describe” means to discuss shape, center and spread!
EDA Summary
The purpose of an Exploratory Data Analysis is to organize data and identify patterns/departures.
PLOT YOUR DATA - Choose an appropriate graph
Look for overall pattern and departures from pattern Shape {mound, bimodal, skewed, uniform} Outliers {points clearly away from body of data} Center {What number “typifies” the data?} Spread {How “variable” are the data values?}
Outliers
Outliers need to be looked at carefully. Is it “bad data” that can be thrown out? Is there a reason for that particular value
to occur?
Shape
Modes Peaks in the graph. A distribution can be unimodal (1 peak), bimodal
(2 peaks), etc… Symmetric
The values above and below the midpoint are mirror images of each other.
Skewed Skewed right means the tail is pulled to the right,
skewed left means the tail is pulled to the left.
Frequency
Relative frequency refers to the proportion of values that fall into a certain class.
Cumulative frequency refers to the number of values that fall are contained in a class and in all classes below it.
Relative Cumulative Frequency refers to the proportion of values that fall into a class and into all classes before it. These are graphed with an Ogive.
Page 60-61 Example 1.9
Presidents
Ogives
Example
Page 64 1.14: Glucose Levels People with diabetes must monitor and control their blood
glucose level. The goal is to maintain “fasting plasma glucose” between about 90 and 130 milligrams per deciliter (mg/dl) of blood.
Here are the fasting plasma glucose levels for 18 diabetics enrolled in a diabetes control class, five months after the end of the class.
Example continued
Make a stemplot of these data and describe the main features of the distribution. (You will want to round and split stems.) Are there outliers? How well does this group do as a whole achieving the goal for controlling glucose levels?
Construct a relative cumulative frequency graph (ogive) for these data sets.
Use your graph to answer the following questions. What percent of blood glucose levels were between 90
and 130? What is the center of the distribution? What relative cumulative frequency is associated with a
blood glucose level of 130?
Timeplots
A timeplot of a variable plots each observation against the time at which it was measured. Time is on the horizontal
scale The variable you are
measuring goes in the vertical scale.
Connecting the points emphasizes change over time.
Exploring Data
1.2 Describing Distributions with Numbers
YMS3e
1.2 Objectives
Given a data set, compute the mean and median as measures of the center.
Explain what is meant by resistant measure. Identify situations in which the mean is the most appropriate
measure of center and situations in which the median is the most appropriate measure.
Given a data set, find the quartiles. Given a data set, find the five-number summary. Use the five-number summary of a data set to construct a
boxplot for the data. Compute the interquartile range (IQR) of a data set. Given a data set, use the 1.5xIQR rule to identify outliers. Given a data set, compute the standard deviation and
variance as measures of spread.
1.2 Objectives
Give two reasons why we use squared deviations rather than just average deviations from the mean.
Explain what is meant by degrees of freedom. Identify situations in which the standard deviation is
the most appropriate measure of spread and situations in which the interquartile range is the most appropriate measure.
Explain the effect of a linear transformation of a data set on the mean, median and standard deviation of the set.
Use numerical and graphical techniques to compare two or more data sets.
Sample DataConsider the following test scores for a small class:
75 76 82 93 45 68 74 82 91 98
Plot the data and describe the SOCS:
What number best describes the “center”?What number best describes the “spread’?
scores40 50 60 70 80 90 100
Collection 1 Dot Plot
scores40 50 60 70 80 90 100
Collection 1 Dot Plot Shape?Outliers?Center?Spread?
Measures of CenterNumerical descriptions of distributions begin with a measure of its “center”.
If you could summarize the data with one number, what would it be?
x x1 x2 ... xn
n
x xi
n
x Mean: The “average” value of a dataset.
Median: Q2 or M The “middle” value of a dataset.
Arrange observations in order min to max
Locate the middle observation, average if needed.
Mean vs. Median
The mean and the median are the most common measures of center.
If a distribution is perfectly symmetric, the mean and the median are the same.
The mean is not resistant to outliers.
You must decide which number is the most appropriate description of the center...
MeanMedian Applet
Measures of Spread
Variability is the key to Statistics. Without variability, there would be no need for the subject.
When describing data, never rely on center alone.
Measures of Spread:Range - {rarely used...why?}
Quartiles - InterQuartile Range {IQR=Q3-Q1}
Variance and Standard Deviation {var and sx}
Like Measures of Center, you must choose the most appropriate measure of spread.
QuartilesQuartiles Q1 and Q3 represent the 25th and 75th percentiles.
To find them, order data from min to max.Determine the median - average if necessary.The first quartile is the middle of the ‘bottom half’.The third quartile is the middle of the ‘top half’.
19 22 23 23 23 26 26 27 28 29 30 31 32
45 68 74 75 76 82 82 91 93 98
med Q3=29.5Q1=2
3
med=79Q1 Q3
5-Number Summary, Boxplots
The 5 Number Summary provides a reasonably complete description of the center and spread of distribution
We can visualize the 5 Number Summary with a boxplot.
MIN Q1 MED Q3 MAX
min=45 Q1=74 med=79 Q3=91 max=98
45 50 55 60 65 70 75 80 85 90 95 100
Quiz ScoresOutlier?Outlier?
Determining Outliers
InterQuartile Range “IQR”: Distance between Q1 and Q3. Resistant measure of spread...only measures middle 50% of data.
IQR = Q3 - Q1 {width of the “box” in a boxplot}
1.5 IQR Rule: If an observation falls more than 1.5 IQRs above Q3 or below Q1, it is an outlier.
“1.5 • IQR Rule”“1.5 • IQR Rule”
Why 1.5? According to John Tukey, 1 IQR seemed Why 1.5? According to John Tukey, 1 IQR seemed like too little and 2 IQRs seemed like too much...like too little and 2 IQRs seemed like too much...
1.5 • IQR Rule
To determine outliers:Find 5 Number Summary
Determine IQR
Multiply 1.5xIQR
Set up “fences” Q1-(1.5IQR) and Q3+(1.5IQR)
Observations “outside” the fences are outliers.
Outlier Example
0 10 20 30 40 50 60 70 80 90 100Spending ($)
IQR=45.72-19.06IQR=26.66IQR=45.72-19.06IQR=26.66 1.5IQR=1.5(26.66)
1.5IQR=39.991.5IQR=1.5(26.66)1.5IQR=39.99
All data on p. 48.
outliers}
fence: 45.72+39.99= 85.71
fence: 19.06-39.99= -20.93{
Standard DeviationAnother common measure of spread is the Standard Deviation: a measure of the “average” deviation of all observations from the mean.To calculate Standard Deviation:
Calculate the mean.Determine each observation’s deviation (x - xbar).“Average” the squared-deviations by dividing the total squared deviation by (n-1).This quantity is the Variance.Square root the result to determine the Standard Deviation.
Standard DeviationVariance:
Standard Deviation:
Example 1.16 (p.85): Metabolic Rates
var (x1 x )2 (x2 x )2 ... (xn x )2
n 1
sx (xi x )2n 1
1792 1666 1362 1614 1460 1867 1439
Standard Deviation
1792 1666 1362 1614 1460 1867 1439
x (x - x) (x - x)2
1792 192 368641666 66 43561362 -238 566441614 14 1961460 -140 196001867 267 712891439 -161 25921
Totals: 0 214870
Metabolic Rates: mean=1600
Total Squared Deviation
214870
Variancevar=214870/6var=35811.66
Standard Deviation
s=√35811.66s=189.24 cal
What does this value, s, mean?
Linear TransformationsVariables can be measured in different units (feet vs meters, pounds vs kilograms, etc)
When converting units, the measures of center and spread will change.
Linear Transformations (xnew=a+bx) do not change the shape of a distribution.
Multiplying each observation by b multiplies both the measure of center and spread by b.
Adding a to each observation adds a to the measure of center, but does not affect spread.
Data Analysis ToolboxTo answer a statistical question of interest:Data: Organize and Examine
Who are the individuals described? What are the variables? Why were the data gathered? When,Where,How,By Whom were data gathered?
Graph: Construct an appropriate graphical displayDescribe SOCS
Numerical Summary: Calculate appropriate center and spread (mean and s or 5 number summary)
Interpretation: Answer question in context!
Chapter 1 Summary
Data Analysis is the art of describing data in context using graphs and numerical summaries. The purpose is to describe the most important features of a dataset.