7/31/2019 Overheads Visualizing Data 2012
1/52
Statistics for Engineering
Section 1: Visualizing data
Kevin Dunn
Copyright, and all rights reserved, Kevin Dunn, 2012
http://stats4eng.connectmv.com
2012
1
7/31/2019 Overheads Visualizing Data 2012
2/52
Plot your data
2
7/31/2019 Overheads Visualizing Data 2012
3/52
Usage examples
Co-worker: Here are the yields from a batch system for thelast 3 years (1256 data points), can you help me: understand more about the time-trends in the past 3 year? efficiently summarize the yield from all batches run in 2010?
3
7/31/2019 Overheads Visualizing Data 2012
4/52
Usage examples
Co-worker: Here are the yields from a batch system for thelast 3 years (1256 data points), can you help me: understand more about the time-trends in the past 3 year? efficiently summarize the yield from all batches run in 2010?
Manager: effectively summarize the (a) number and (b) typesof defects on 17 aluminum grades for the past 12 months
4
7/31/2019 Overheads Visualizing Data 2012
5/52
Usage examples
Co-worker: Here are the yields from a batch system for thelast 3 years (1256 data points), can you help me: understand more about the time-trends in the past 3 year? efficiently summarize the yield from all batches run in 2010?
Manager: effectively summarize the (a) number and (b) typesof defects on 17 aluminum grades for the past 12 months
Yourself: 24 different measurements vs time (5 readings perminute, over 300 minutes) for each batch we produce; how
can we visualize these 36,000 data points?
5
7/31/2019 Overheads Visualizing Data 2012
6/52
References
1. Edward Tufte, Envisioning Information, Graphics Press, 1990.
(10th printing in 2005)
2. Edward Tufte, The Visual Display of Quantitative Information,Graphics Press, 2001.
3. Edward Tufte, Visual Explanations: Images and Quantities,
Evidence and Narrative, 2nd edition, Graphics Press, 1997.4. William Cleveland, Visualizing Data, and The Elements of
Graphing Data, Hobart Press; 2nd edition, 1994.
5. Stephen Few, Show Me the Numbers, and Now You See It,Analytics Press.
6. Su, Its easy to produce chartjunk using Microsoft Excel 2007but hard to make good graphs, Computational Statistics andData Analysis, 52 (10), 4594-4601, 2008,http://dx.doi.org/10.1016/j.csda.2008.03.007
6
7/31/2019 Overheads Visualizing Data 2012
7/52
Background
This class might seem too easy, too obvious. It is!
The human eye and brain are excellent at pattern recognition,sorting through signal and noise.
7
7/31/2019 Overheads Visualizing Data 2012
8/52
Background
This class might seem too easy, too obvious. It is!
The human eye and brain are excellent at pattern recognition,sorting through signal and noise.
We can easily cope with bad plots; but good plots save timeand show a clearer, more honest picture.
Cliches: Let the data speak for themselves, Plot the data
We will look at: how
8
7/31/2019 Overheads Visualizing Data 2012
9/52
Time-series plots
It is a 2-dimensional plot: (usually) horizontal x-axis: time or sequence order other axis: the data values
Univariate plot Our eyes can deal with high data density:
sinusoids spikes outliers separate noise from signal
9
7/31/2019 Overheads Visualizing Data 2012
10/52
Time-series plots
Good, automated labelling is important.Heres an example of bad labelling
(and bad axis scaling and colour choices)
10
7/31/2019 Overheads Visualizing Data 2012
11/52
Time-series plots
Multiple lines (trajectories): should not cross and jumble
Colours and markers help only slightly
11
7/31/2019 Overheads Visualizing Data 2012
12/52
Time-series plots
Use separate, parallel axes rather; and minimal ink
These non-default settings can take a long time to set (10 minutes
for this example)
12
7/31/2019 Overheads Visualizing Data 2012
13/52
Time-series plots
Sparklines
Read the website link (in the notes)
Used for financial trends (example)
Built into Excel 2010
Good for iPods, cell phones, tablet computers: high density, small size.
13
7/31/2019 Overheads Visualizing Data 2012
14/52
Time-series plots
Example of sparklines in everyday use:
Figure from Wikipedia
14
7/31/2019 Overheads Visualizing Data 2012
15/52
Time-series plots
Further tips
Keep the x-axis spacing constant: helps interpretation dont reposition the time-axis labels
dont use magnifying glass concept.
Adjust for inflation when plotting money values against time sales of polymer to DuPont over the past 10 years example of car sales:
http://www.duke.edu/ rnau/411infla.htm
15
7/31/2019 Overheads Visualizing Data 2012
16/52
Time-series plots
Show reasonable amount of data for context
16
7/31/2019 Overheads Visualizing Data 2012
17/52
Bar plots
A univariate plot on a two dimensional axis.
Has a category axis and value axis
Use a bar plot when:
many categories
interpretation does not change if category axis is reordered
17
7/31/2019 Overheads Visualizing Data 2012
18/52
Bar plots
Rather use a time-series plot if the data have a sequence:
You can see the trends more clearly.
18
7/31/2019 Overheads Visualizing Data 2012
19/52
Bar plotsBar plots can be wasteful as each data point is repeated severaltimes:
1. left edge (line) ofeach bar
2. right edge (line) ofeach bar
3. the height of thecolour in the bar
4. the numbersposition (up anddown along the
y-axis)
5. the top edge ofeach bar, just belowthe number
6. the number itself 19
7/31/2019 Overheads Visualizing Data 2012
20/52
Bar plots
Maximize data ink ratio within reason
Maximize data ink ratio =total ink for data
total ink for graphics= 1 proportion of ink that can be erased
without loss of data information
Rather use a table for a handful of data points:
20
7/31/2019 Overheads Visualizing Data 2012
21/52
Bar plots
Dont use cross-hatching, textures, or unusual shading in the
plots: it creates visual vibrations
21
7/31/2019 Overheads Visualizing Data 2012
22/52
Bar plots
Use horizontal bars if: there is a some ordering to the categories the labels do not fit side-by-side
22
7/31/2019 Overheads Visualizing Data 2012
23/52
Bar plots
Use horizontal bars if: there is a some ordering to the categories the labels do not fit side-by-side
You can place the labels inside the bars
23
7/31/2019 Overheads Visualizing Data 2012
24/52
Bar plots
Use horizontal bars if: there is a some ordering to the categories the labels do not fit side-by-side
You can place the labels inside the bars
You should usually start the non-category axis at zero
24
7/31/2019 Overheads Visualizing Data 2012
25/52
Box plots
A graphical display of the 5-number summary for 1 variable
minimum sample value
25th percentile (1st quartile)
50th percentile (median)
75th percentile (3rd quartile)
maximum sample value
Notes:
1. 25th percentile is the value below which 25 percent of theobservations in the sample are found
2. distance from 3rd to 1st quartile = interquartile range (IQR)
Box plots are effective for comparing similar variables (same unitsof measurement)
25
7/31/2019 Overheads Visualizing Data 2012
26/52
Box plots
P os 1 Po s2 Pos 3 P os4 Po s5 Pos 6
1 1761 1 739 1 758 1 677 1 684 1 6922 1801 1 688 1 753 1 741 1 692 1 6753 1697 1 682 1 663 1 671 1 685 1 6514 1679 1 712 1 672 1 703 1 683 1 6745 1699 1 688 1 699 1 678 1 688 1 705
. . . .
96 1717 1708 1645 1690 1568 168897 1661 1660 1668 1691 1678 1692
98 1706 1665 1696 1671 1631 164099 1689 1678 1677 1788 1720 1735100 1751 1736 1752 1692 1670 1671
Video of data source
26
7/31/2019 Overheads Visualizing Data 2012
27/52
Box plots
> summary ( boards [ 1 : 1 0 0 , 2 : 7 ] )Pos1 Pos2 Pos3 Pos4 Pos5 Pos6
Min . : 1524 1603 1594 1452 1568 1503
1 st Qu . : 1671 1657 1654 1667 1662 1652Median : 1680 1674 1672 1678 1673 1671Mean : 1687 1677 1677 1679 1674 16723 rd Qu . : 1705 1688 1696 1693 1685 1695Max . : 1822 1762 1763 1788 1741 1765
27
7/31/2019 Overheads Visualizing Data 2012
28/52
Box plots
28
7/31/2019 Overheads Visualizing Data 2012
29/52
Box plots
Some variations:
use the mean instead of the median
outliers shown as dots, where an outlier is most commonly
defined as any point 1.5 IQR distance units above and belowthe median.
use the 2nd percentile (instead of median 1.5IQR)
use the 98th percentile (instead of median + 1.5IQR)
add the density histogram onto the box plot: violin plot
29
7/31/2019 Overheads Visualizing Data 2012
30/52
Box plot variation: violin plot
30
7/31/2019 Overheads Visualizing Data 2012
31/52
Scatter plots
Used to help understand the relationship between twovariables: a bivariate plot
Collection of points in the 2 axes
Each point is the intersection of the values on each axis
Intention of a scatter plot
Asks the viewer to draw a causal relationship between the twovariables
31
7/31/2019 Overheads Visualizing Data 2012
32/52
Scatter plots
32
7/31/2019 Overheads Visualizing Data 2012
33/52
Scatter plots
However, not all scatter plots show causal phenomenon.
33
7/31/2019 Overheads Visualizing Data 2012
34/52
Scatter plots
Strive for graphical excellence by:
making each axis as tight as possible
avoid heavy grid lines use the least amount of ink
do not distort the axes
34
7/31/2019 Overheads Visualizing Data 2012
35/52
Scatter plots
There is an unfounded fear that others wont understand your 2Dscatter plot.
Tufte study (VDQI): no scatter plots in a sample (1974 to1980) of Western dailies
12 year olds can interpret such plots.
Japanese newspapers frequently use scatterplots
Plant control room: seldom see scatter plots.
Key point
The producers of charts must assume their audience is capable ofinterpreting them. Rather, assume that if you can understand theplot, so will your audience.
35
7/31/2019 Overheads Visualizing Data 2012
36/52
Scatter plots
Add box plots or histograms to aide interpretation:
36
7/31/2019 Overheads Visualizing Data 2012
37/52
Scatter plots
Add a 3rd variable: different marker sizes
Add a 4th variable: use colour or grayscale shading
The GapMinder website allows you to play the graph overtime (the 5th variable)
37
7/31/2019 Overheads Visualizing Data 2012
38/52
Scatter plots
Web-based demo from http://gapminder.org
Demo by Hans Rosling (requires internet access)
38
7/31/2019 Overheads Visualizing Data 2012
39/52
Tables
Tables are for comparative data analysis on categorical objects.
Note the rows are in default alphabetical order.
We can make the table tell a story if we reorder the rows by
some other variable. e.g. monthly insurance payment
39
7/31/2019 Overheads Visualizing Data 2012
40/52
Tables
Compare defect types (number of defects) for differentproduct grades (categories):
Which defects cost us the most money?
40
7/31/2019 Overheads Visualizing Data 2012
41/52
Tables
Defect frequency If 1850 lots of grade A4636 (first row): defect A rate = 1/50 If 250 lots of grade A2610 (last row): defect A rate = 1/50 Redraw table on production rate basis
If comparing defects over different grades: go down the table(show fraction within the column)
If comparing defects within grade: go across table (showfraction with the row) Could weight each column by cost of defect
41
7/31/2019 Overheads Visualizing Data 2012
42/52
Tables
Three common pitfalls:1. using pie charts when tables will do
42
7/31/2019 Overheads Visualizing Data 2012
43/52
Tables
2. arbitrarily ordering of the rows
43
7/31/2019 Overheads Visualizing Data 2012
44/52
Tables
3. using excessive grid lines
44
7/31/2019 Overheads Visualizing Data 2012
45/52
Tables
Interesting example: comparing two treatments
45
7/31/2019 Overheads Visualizing Data 2012
46/52
Tables
46
7/31/2019 Overheads Visualizing Data 2012
47/52
Data frames
Frames are the basic containers that surround the data and givecontext to our numbers. Here are some tips:
1. Use round numbers2. Tighten the axes as much as possible, except ...
3. when showing comparison plots: all axes must have the sameminima and maxima
47
7/31/2019 Overheads Visualizing Data 2012
48/52
Aesthetics and style
I highly recommend reading Tuftes 4 books: contain remarkable
examples of how to bring data to life.
48
7/31/2019 Overheads Visualizing Data 2012
49/52
Colour
Colour is effective, but: readers could be colour-blind, document read from a gray-scale print out
There is no standard colour progression (blues, greens,yellows, orange, red).
Safest colour progression is gray-scale axis: from black towhite satisfies colour-blind readers looks good in printed form
49
7/31/2019 Overheads Visualizing Data 2012
50/52
General summary
No general advice that applies in every instance. Useful tipsnevertheless:
To understand causality, you must show causality: usebivariate scatter plots (sometimes line plots also work well)
Plots and text go together: a plot = paragraph of text add labels to plots for outliers and interesting points add equations add small summary tables
Avoid codes: A = grade TK133, B = grade RT231
50
7/31/2019 Overheads Visualizing Data 2012
51/52
General summary
Avoid unnecessary extras to enliven the plot
If the statistics are boring, then youve got the wrongnumbers.
51
7/31/2019 Overheads Visualizing Data 2012
52/52
General summary
Adjust for inflation if plot involves money and time
Maximize the data-ink ratio = (ink for data) / (total ink forgraphics).
1. eliminate non-data ink2. erase redundant data-ink.
Maximize data density: 250 data points per linear inch, and625 data points per square inch.
52