Top Banner
Summarizing Data
54

Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Jan 04, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Summarizing Data

Page 2: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Statistics

Page 3: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

statistics

probability

probability vs. statistics

sampling

inference

Page 4: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Distribution ?

Page 5: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Distribution :

A mathematical way to represent the diversity

of characteristics of a group.

Group may be a sample and a population.

• population distribution• distribution of a sample

Page 6: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

dist’n of a sample pop’n dist’n

realistic imaginary

data Theory (model)

statistics

Page 7: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.
Page 8: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Statistics starts from data.

Page 9: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Data are clues to truth, and say about truth.

Data are not just sets of numbers.

Page 10: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

The 1st principle of statistics :

The sample is not the same with the population,

but the population is represented by the sample

sufficiently well.

Page 11: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Page 12: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Datawork

Page 13: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

• From real world

• Data collecting

• Exploring data

• Reducing data

• Modeling

• Evaluating

• From forest

• Making timber

• Inspecting wood grain

• Cutting

• Structuring

• Finishing

Woodwork & Datawork

Page 14: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Craft & Endeavor

Page 15: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Tools & Skills

Page 16: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

• Paper, pencil & calculator

• Spreadsheet SW (Excel)

• Minitab, SPSS, SAS, R

• DBMS ( Access, Oracle, …)

• C/C++, Java, Python, …

Statistical tools

You need skill to use these.

Page 17: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Also, you need craft & experiences.

However, the more important point in

datawork is trying to get perspectives

of the data on your hand.

Page 18: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

No typical ways for good datawork.

Think, think and think !

That’s the only way.

Page 19: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.
Page 20: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Datawork is not a miagic. It's a hard job.

살라카둘라 메치카불라 비비디 바비디 부 --

Page 22: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Grain of data ?

Page 23: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Seeing the grain of data

Exploratory Data Analysis≈

Page 24: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

The step to check the basic properties of

data, by using the basic statistical

methods.

From EDA, we aim to develop insight on

data, as a first step for more specific

analysis.

Exploratory Data Analysis (EDA)

Page 25: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Qualitative

variable• frequency table

• crosstabulation (contingency table)

• bar chart, pie chart, ….

Basic Statistical Methods

Page 26: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

• (cumulative) frequency distribution

• histogram

• dot-plot

• stem & leaf diagram

• scatter plot

• box plot, ….

Quantitative scale

Basic Statistical Methods

Page 27: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

• 12 var’s & 100 obs’s

• Many types of ‘offer’ to cardholders

• To find the type of ‘offer’ that increases cardholder’s usage maximally.

Credit_Card_Bank: p22 of SVV

Example Data

Page 28: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

[1] "Offer.Status" (Categorical) [2] "Charges.Aug.2008" (Quantitative) [3] "Charges.Sept.2008" (Quantitative) [4] "Charges.Oct.2008“ (Quantitative) [5] "Marketing.Segment" (Categorical) [6] "Industry.Segment" (Categorical) [7] "Spendlift.After.Promotion“ (Quantitative) [8] "Pre.Promotion.Avg.Spend" (Quantitative) [9] "Post.Promotion.Avg.Spend" (Quantitative) [10] "Retail.Customer" (Yes, No) [11] "Enrolled.in.Program" (Yes, No) [12] "Spendlift.Positive" (Yes, No)

oct0

8msegiseg

loct08 =

log(oct08)

data.svv<-dir("c:/temp/text")dfile.svv<-paste("c:/temp/text/",data.svv,sep="")dsv<- read.table(dfile.svv[1],head=TRUE, sep="\t")names(dsv)oct08<-dsv[,4]; loct08<-log(oct08); xoct08<-loct08[oct08>0]mseg<-dsv[,5]; iseg<-dsv[,6]

Page 29: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

[1] -Inf 6.21 3.96 3.84 6.96 6.95 7.89 7.35 3.97 5.97

[11] 5.50 8.00 6.30 3.13 4.58 8.89 3.81 7.00 8.37 7.85

[21] 7.42 6.86 8.45 6.12 5.62 8.21 6.91 6.87 7.15 5.46

[31] 6.71 6.12 -Inf 7.68 9.08 5.91 3.42 6.12 8.05 7.03

[41] 6.02 2.51 7.20 3.29 7.44 5.88 6.33 6.24 4.33 5.93

[51] 5.25 7.85 8.76 7.15 7.95 7.13 -Inf 7.13 8.11 8.05

[61] 9.11 5.56 8.24 -Inf 7.47 6.70 7.52 6.53 8.33 4.63

[71] 6.80 5.72 7.54 3.48 7.57 8.42 8.16 4.67 7.16 5.61

[81] -Inf 10.42 8.73 4.85 -Inf 6.63 5.48 4.89 8.35 4.65

[91] 5.56 7.39 3.11 3.90 5.72 7.10 -Inf 7.58 8.15 6.30

log(oct08):

log(0) = - InfRounded up to 2nd decimal round(loct08,2)

Page 30: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96

[11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46

[21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91

[31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30

[41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95

[51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20

[61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68

[71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16

[81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89

[91] 9.08 9.11 10.42

Sorted values of log(oct08):

after deleting 7 cases of –

Inf.round(sort(xoct08,2)

Page 31: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

[1] B B A T A T B A A T B T T B B B B B T B R A T A A [26] R B B R T T T A A B T B A R B B A T B B R T T A A [51] B A B B T A A T B A B A B R B A A R A T B T T B R [76] T A T A A B B B T R T T R T B A A A A A A B T A T

Levels: A B R T

iseg

Meaning of the levels are not known.

Page 32: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

[1] M L L M B A L A M H M L A M M B L B H L

[21] H B L H H M A B H L A H A B L H L B A A [41] A H A L L H L A B A A B A B B A M A B L [61] L B B H B A B A B L B A H L M L L M A B [81] A L L M H A H H L A H L B A H A L L L H Levels: L < B < M < A < H

mseg

L: low, B: below medium, M: medium, A: above medium,

H: high

levels(mseg)<-c("M","H","L","A","B")mseg<-factor(mseg, levels=c("L","B","M","A","H"))mseg

Page 33: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Histogram of loct08

loct08

Freq

uenc

y

2 4 6 8 10

05

1015

20

hist(xoct08,col="grey")

Page 34: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Stem and leaf display:

leaf unit = 0.1

2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4

a stem

a leaf

2.5

stem(xoct08)

Page 35: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

leaf unit = 1

2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4

25

stem(10*xoct08)

Page 36: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Min. Q1 Median Q3 Max. 2.509 5.563 6.864 7.682 10.420

5 number summary of log(oct08):

IQR = 2.119

summary(xoct08)

Page 37: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Quartiles : Q1, Q2 , Q3

Q1 : values ranked at 25% from lowest

Q2 : values ranked at 50% from lowest

Q3 : values ranked at 75% from lowest

IQR (Inter-Quartile Range) = Q3 –

Q1

Median = Q2

Page 38: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

How to take : Q1, Q2, Q3

If c is an integer, then c-th ranked

value x[c]

If c is not an integer, then (x[c-]+ x[c+])/2

Q1 : c = 0.25*(n+1)

Q2 : c= 0.5*(n+1)

Q3 : c= 0.75*(n+1)

c- : the largest lower integer than c

c+ : the smallest upper integer than c

Page 39: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96

[11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46

[21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91

[31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30

[41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95

[51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20

[61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68

[71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16

[81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89

[91] 9.08 9.11 10.42

Sorted values of log(oct08):

after deleting 7 cases of –

Inf.n= 93 , 0.25*94=23.5, 0.5*94=47,

0.75*94=70.5

Page 40: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

2 4 6 8 10 12

loct08

Dot plot

Page 41: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

050

0010

000

1500

020

000

2500

030

000

Box plot oct08

46

810

Box plot of log(oct08)

boxplot(xoct08)boxplot(oct08)

Page 42: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

IQR

Q1 Q3Q2**

mild-outlier extreme-outlier

min(non-outlier) min(non-outlier)

1.5 IQR

Page 43: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

freq %freq cum. freq %cum. freq

Low Spender 26 0.26 26 0.26Med Low Spender 20 0.20 46 0.46Average Spender 11 0.11 57 0.57Med High Spender 25 0.25 82 0.82High Spender 18 0.18 100 1.00------------------------------------------------------------Total 100 1.00

Frequency table

table(mseg)table(mseg)/length(mseg)cumsum(table(mseg))cumsum(table(mseg))/length(mseg)

Page 44: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Bar chart of log(oct08)

(2,3] (3,4] (4,5] (5,6] (6,7] (7,8] (8,9] (9,10] (10,11]

05

1015

20

Page 45: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Histogram & Bar chart

Histogram : for quantitative variables

connected bar’s

Bar chart : for categorical variables

disconnected bar’s

Page 46: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

A B R T Total L 5 13 0 8 26 B 11 8 0 1 20 M 2 4 2 3 11 A 8 7 2 8 25 H 5 0 6 7 18

Total 31 32 10 27 100

Contingency table of mseg and

iseg

mse

g

iseg

table(mseg,iseg)apply(table(mseg,iseg),1,sum)apply(table(mseg,iseg),2,sum)

Page 47: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

A

B

RT

Pie chart of iseg

31

32

10 27

pie(table(iseg),col=c("red","light green","green","blue"))

Page 48: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

A B R T

05

1015

2025

30

Segmented bar chart of (mseg, iseg) -

serial

barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"))

Page 49: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

A B R T

02

46

810

12

Segmented bar chart of (mseg, iseg) -

parallel

barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"),beside=TRUE)

Page 50: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Mosaic Plot

iseg

mse

gA B R T

L B

M

AH

mosaicplot(~iseg+mseg,col=rainbow(5))

Page 51: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

L B M A H

46

810

Box plot of log(oct08) by mseg

boxplot(loct08[oct08>0]~mseg[oct08>0])

Page 52: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

A B C D E F

10 11 0 3 3 11

7 17 1 5 5 9

20 21 7 12 3 15

14 11 2 6 5 22

14 16 3 4 3 15

12 14 1 3 6 16

10 17 2 5 1 13

23 17 1 5 1 10

17 19 3 5 3 26

20 21 0 5 2 26

14 7 1 2 6 24

13 13 4 4 4 13

Page 53: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

A B C D E F

05

1015

2025

InsectSprays data

Type of spray

Inse

ct c

ount

Page 54: Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Thank you !!