Top Banner
Exploratory Data Analysis 1D
60

Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

Aug 21, 2018

Download

Documents

vucong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

Exploratory Data Analysis!1D!

!

Page 2: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

In your mat219_class project1.  CopycodefromD2LtodownloadtheDa7ngProfilesDataset&the

TopMoviesDataset,andrunbothintheConsole.2.  CreateanewRscriptorRnotebookcallededa_1d3.  Includethiscodeinyourscriptornotebook:

library(tidyverse)

diamonds <- ggplot2::diamondsdating_profiles <- read_csv(”dating_profiles.csv”)top_movies <- read_csv(”top_movies.csv")

Page 3: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

philosophical view

Page 4: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

What is EDA?• ParaphrasingJohnTukey:Itis•  amindset(awillingnesstolookforwhatcanbeseen,whetheror

notitisan7cipated),

•  aflexibility(letthedataspeakforthemselves,explorelotsofavenues)

•  Awaytomakepictures(thepicture-examiningeyeisthebestfinderwehaveofthewhollyunan7cipated)

Page 5: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

more concrete view

Page 6: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

What is EDA?

•  Verifyexpectedrela7onshipsactuallyexistinthedata

•  Findunexpectedstructureinthedatathatmustbeaccountedfor

•  Ensuretherightques7onsarebeingasked

•  Generateaddi7onalques7onstobeconsidered

•  Provideabasisforfurtherdatacollec7on

Page 7: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies
Page 8: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

There is a large emphasis on graphical exploration because it is the best way to reveal unanticipated structure. We want to•  plot the raw data•  plot simple statistics•  position objects and plots to maximize

pattern recognition

Page 9: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

one categorical variable

Page 10: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

We examine a categorical variable by looking at counts

diamonds

Page 11: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

diamonds%>%count(cut)

table(diamonds$cut)

table() is a vector function

count() is a data manipulation verb

Page 12: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

da7ng_profiles%>%count(educa7on,sort=TRUE)%>%mutate(pct=100*n/sum(n))

Page 13: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

da7ng_profiles%>%count(educa7on,sort=TRUE)%>%mutate(pct=100*n/sum(n))

Thingstoconsider:1.  Groupsizes2.  Vitalfewvs.trivialmany3.  Missingdata4.  Isthereanaturalordering?5.  Recoding6.  Collapsing7.  Datatype:chrvs.fctvs.ord

Page 14: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

da7ng_profiles<-da7ng_profiles%>%mutate(educa7on=fct_explicit_na(educa7on))levels(fct_infreq(da7ng_profiles$educa7on))

Thingstoconsider:1.  Groupsizes2.  Vitalfewvs.trivialmany3.   Missingdata4.  Isthereanaturalordering?5.  Recoding6.  Collapsing7.  Datatype:chrvs.fctvs.ord

Page 15: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

da7ng_profiles%>%ggplot(aes(fct_rev(fct_infreq(educa7on))))+geom_bar()+coord_flip()

Useabarplottoexaminethedistribu7onofacategoricalvariable

Page 16: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

da7ng_profiles%>%ggplot(aes(fct_rev(fct_infreq(educa7on))))+geom_bar()+coord_flip()

Useabarplottoexaminethedistribu7onofacategoricalvariable

bydefaultgeom_bar()placesthemissing-datagrouplast

Page 17: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

Your Turn 1Consider the pets variable in the dating_profiles dataset. Make a summary of the group counts and make a bar plot. Then, go through the checklist below and think about what you would consider doing.1.  Dogroupsizesvaryalot?2.  Arethereavitalfewand/ortriviallymany?3.  Anymissingdata?4.  Isthereanaturalorderingtothegroups?5.  Shouldweconsiderrecoding/collapsing?

Page 18: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

da7ng_profiles%>%mutate(pets=fct_rev(fct_infreq(fct_explicit_na(pets))))%>%ggplot(aes(pets))+geom_bar()+coord_flip()

da7ng_profiles%>%count(pets,sort=TRUE)%>%mutate(pct=100*n/sum(n))

Page 19: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies
Page 20: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

what about pie charts?

Page 21: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

Pie charts

Try placing the wedges in order from largest to smallest

Page 22: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

Pie charts

Page 23: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

Pie charts

What is revealed about the data from this pie chart?

Page 24: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

Pie charts

Page 25: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

Pie charts

Pie charts are awful because they encode information in angles and areas that are very difficult for humans to judge

Graphics that make comparisons via position on a common scale are best

Page 26: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

one numerical variable

Page 27: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

We examine a numerical variable by looking at statistical summaries and binned counts (histograms)

diamonds

Page 28: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

diamonds%>%summarise(min=min(price),q1=quan7le(price,0.25),median=median(price),mean=mean(price),q3=quan7le(price,0.75),max=max(price))

summary(diamonds$price)

summary() is a generic function

summarise() is a data manipulation verb

Page 29: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

diamonds%>%summarise(min=min(price),q1=quan7le(price,0.25),median=median(price),mean=mean(price),q3=quan7le(price,0.75),max=max(price))

summary(diamonds$price)

Thingstoconsider:1.  Whatvaluesaremostcommon?2.  Whichvaluesarerare/extreme?3.  Canyouseeunusualpaherns?4.  Measuresofcenter(mean,

median,etc.)5.  Measuresofspread(std.dev.,

IQR,etc.)6.  Skewness7.  Missingdata

Page 30: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

diamonds%>%ggplot(aes(price))+geom_histogram(binwidth=100)

Useahistogramtoexaminethedistribu7onofanumericalvariable

Page 31: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

diamonds%>%ggplot(aes(price))+geom_histogram(binwidth=100)+geom_vline(xintercept=mean(diamonds$price),color="red")+geom_vline(xintercept=median(diamonds$price),color="blue")

sta7s7csandothervaluescanbeoverlaid

Page 32: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

histogram examples

Page 33: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

diamonds%>%filter(carat<3)%>%ggplot(aes(carat))+geom_histogram(binwidth=0.01)

Page 34: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

faithful

Page 35: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

summary(faithful$erup7ons)

Page 36: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

faithful%>%ggplot(aes(erup7ons))+geom_histogram(binwidth=0.25)

Page 37: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

histograms on the density scale

Page 38: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

diamonds%>%ggplot(aes(fct_rev(cut),y=..prop..,group=1))+geom_bar()

barplotscanbeonapropor7onscale

theheightofeachbaristhepropor7onofvaluesinthatbar

Page 39: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

diamonds%>%ggplot(aes(price,y=..density..))+geom_histogram(binwidth=3000,color="white")

histogramscanbeonadensityscale

Page 40: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

diamonds%>%ggplot(aes(price,y=..density..))+geom_histogram(binwidth=3000,color="white")

histogramscanbeonadensityscale

theareaofeachbinisthepropor7onofvaluesinthatbin

Page 41: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

top_movies%>%ggplot(aes(gross_adj,y=..density..))+geom_histogram(binwidth=100,boundary=300,color="white")

Page 42: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

top_movies%>%ggplot(aes(gross_adj,y=..density..))+geom_histogram(breaks=c(300,400,600,1500),color="white")

Page 43: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

top_movies%>%ggplot(aes(gross_adj,y=..density..))+geom_histogram(breaks=c(300,400,1500),color="white")

no7cethatthedensityscalereasonablydisplaysthedistribu7on

Page 44: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

top_movies%>%ggplot(aes(gross_adj))+geom_histogram(binwidth=100,boundary=300,color="white")

Page 45: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

top_movies%>%ggplot(aes(gross_adj))+geom_histogram(breaks=c(300,400,600,1500),color="white")

Page 46: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

top_movies%>%ggplot(aes(gross_adj))+geom_histogram(breaks=c(300,400,1500),color="white")

acount(orpropor7on)scalelosestheshapeofthedistribu7on

Page 47: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

top_movies%>%ggplot(aes(gross_adj))+geom_histogram(breaks=c(300,400,1500),color="white")

acount(orpropor7on)scalelosestheshapeofthedistribu7on

thisisnotahistogram

Page 48: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

top_movies%>%ggplot(aes(gross_adj,y=..density..))+geom_histogram(breaks=c(300,400,600,1500),color="white")

Onthedensityscale,theheightofabinisthedensity:thepropor7onperunitonthehorizontalaxis

Page 49: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

top_movies%>%ggplot(aes(gross_adj,y=..density..))+geom_histogram(breaks=c(seq(300,400,by=10),600,1500),color="white")

Onthedensityscale,theheightofabinisthedensity:thepropor7onperunitonthehorizontalaxis

Page 50: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

top_movies%>%ggplot(aes(gross_adj,y=..density..))+geom_histogram(breaks=c(300,350,400,450,1500),color="white")

Recap:thereare25moviesrepresentedinthe[400,450)bin,and92inthe[450,1500)bin.Butthelastbinismuchwider,soitislesscrowded(lessdense)

Page 51: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

unusual values

Page 52: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

ggplot(diamonds)+geom_histogram(aes(y),binwidth=0.5)

Page 53: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

ggplot(diamonds)+geom_histogram(mapping=aes(x=y),binwidth=0.5)+coord_cartesian(ylim=c(0,50))

Page 54: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

diamonds%>%filter(y<3|y>25)%>%select(price,x,y,z)%>%arrange(y)

Page 55: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

diamonds%>%filter(y<3|y>20)%>%select(price,x,y,z)%>%arrange(y)

diamonds2<-diamonds%>%mutate(y=ifelse(y<3|y>20,NA,y))

theunusualdatacanbereplacedwithNA

Page 56: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

Your Turn 2Consider the height variable in the dating_profiles dataset. Compute some summary statistics and make a histogram. Do you see any unusual values? Which ones, if any, would you replace with NA

Page 57: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

da7ng_profiles%>%ggplot(aes(height))+geom_histogram(binwidth=1)+coord_cartesian(ylim=c(0,20))

Page 58: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

missing values

Page 59: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

summary(flights$dep_7me)

Firststep:lookforpahernsinthemissingdata.Whatvaluesintheothervariablestendtooccurwiththemissingvalues?Aretherepossibleexplana7onsforthemissingdata?

Page 60: Exploratory Data Analysis 1D - webspace.ship.eduwebspace.ship.edu/lebryant/mat219/pdf/slides-12-1D-EDA.pdf · Copy code from D2L to download the Dang Profiles Dataset & the Top Movies

summary(flights$dep_7me)

Firststep:lookforpahernsinthemissingdata.Whatvaluesintheothervariablestendtooccurwiththemissingvalues?Aretherepossibleexplana7onsforthemissingdata?

Approachestomissingdata:1.  complete-caseanalysis2.  available-caseanalysis3.  imputewiththemean

value4.  imputeusingother

variables5.  converttoacategorical

variableandkeepthemissingdataasanexplicitgroup