Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Post on 04-Jan-2016

266 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

Transcript

Summarizing Data

Statistics

statistics

probability

probability vs. statistics

sampling

inference

Distribution ?

Distribution :

A mathematical way to represent the diversity

of characteristics of a group.

Group may be a sample and a population.

• population distribution• distribution of a sample

dist’n of a sample pop’n dist’n

realistic imaginary

data Theory (model)

statistics

Statistics starts from data.

Data are clues to truth, and say about truth.

Data are not just sets of numbers.

The 1st principle of statistics :

The sample is not the same with the population,

but the population is represented by the sample

sufficiently well.

Datawork

• From real world

• Data collecting

• Exploring data

• Reducing data

• Modeling

• Evaluating

• From forest

• Making timber

• Inspecting wood grain

• Cutting

• Structuring

• Finishing

Woodwork & Datawork

Craft & Endeavor

Tools & Skills

• Paper, pencil & calculator

• Spreadsheet SW (Excel)

• Minitab, SPSS, SAS, R

• DBMS ( Access, Oracle, …)

• C/C++, Java, Python, …

Statistical tools

You need skill to use these.

Also, you need craft & experiences.

However, the more important point in

datawork is trying to get perspectives

of the data on your hand.

No typical ways for good datawork.

Think, think and think !

That’s the only way.

Datawork is not a miagic. It's a hard job.

살라카둘라 메치카불라 비비디 바비디 부 --

Grain of data ?

Seeing the grain of data

Exploratory Data Analysis≈

The step to check the basic properties of

data, by using the basic statistical

methods.

From EDA, we aim to develop insight on

data, as a first step for more specific

analysis.

Exploratory Data Analysis (EDA)

Qualitative

variable• frequency table

• crosstabulation (contingency table)

• bar chart, pie chart, ….

Basic Statistical Methods

• (cumulative) frequency distribution

• histogram

• dot-plot

• stem & leaf diagram

• scatter plot

• box plot, ….

Quantitative scale

Basic Statistical Methods

• 12 var’s & 100 obs’s

• Many types of ‘offer’ to cardholders

• To find the type of ‘offer’ that increases cardholder’s usage maximally.

Credit_Card_Bank: p22 of SVV

Example Data

[1] "Offer.Status" (Categorical) [2] "Charges.Aug.2008" (Quantitative) [3] "Charges.Sept.2008" (Quantitative) [4] "Charges.Oct.2008“ (Quantitative) [5] "Marketing.Segment" (Categorical) [6] "Industry.Segment" (Categorical) [7] "Spendlift.After.Promotion“ (Quantitative) [8] "Pre.Promotion.Avg.Spend" (Quantitative) [9] "Post.Promotion.Avg.Spend" (Quantitative) [10] "Retail.Customer" (Yes, No) [11] "Enrolled.in.Program" (Yes, No) [12] "Spendlift.Positive" (Yes, No)

oct0

8msegiseg

loct08 =

log(oct08)

data.svv<-dir("c:/temp/text")dfile.svv<-paste("c:/temp/text/",data.svv,sep="")dsv<- read.table(dfile.svv[1],head=TRUE, sep="\t")names(dsv)oct08<-dsv[,4]; loct08<-log(oct08); xoct08<-loct08[oct08>0]mseg<-dsv[,5]; iseg<-dsv[,6]

[1] -Inf 6.21 3.96 3.84 6.96 6.95 7.89 7.35 3.97 5.97

[11] 5.50 8.00 6.30 3.13 4.58 8.89 3.81 7.00 8.37 7.85

[21] 7.42 6.86 8.45 6.12 5.62 8.21 6.91 6.87 7.15 5.46

[31] 6.71 6.12 -Inf 7.68 9.08 5.91 3.42 6.12 8.05 7.03

[41] 6.02 2.51 7.20 3.29 7.44 5.88 6.33 6.24 4.33 5.93

[51] 5.25 7.85 8.76 7.15 7.95 7.13 -Inf 7.13 8.11 8.05

[61] 9.11 5.56 8.24 -Inf 7.47 6.70 7.52 6.53 8.33 4.63

[71] 6.80 5.72 7.54 3.48 7.57 8.42 8.16 4.67 7.16 5.61

[81] -Inf 10.42 8.73 4.85 -Inf 6.63 5.48 4.89 8.35 4.65

[91] 5.56 7.39 3.11 3.90 5.72 7.10 -Inf 7.58 8.15 6.30

log(oct08):

log(0) = - InfRounded up to 2nd decimal round(loct08,2)

[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96

[11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46

[21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91

[31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30

[41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95

[51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20

[61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68

[71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16

[81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89

[91] 9.08 9.11 10.42

Sorted values of log(oct08):

after deleting 7 cases of –

Inf.round(sort(xoct08,2)

[1] B B A T A T B A A T B T T B B B B B T B R A T A A [26] R B B R T T T A A B T B A R B B A T B B R T T A A [51] B A B B T A A T B A B A B R B A A R A T B T T B R [76] T A T A A B B B T R T T R T B A A A A A A B T A T

Levels: A B R T

iseg

Meaning of the levels are not known.

[1] M L L M B A L A M H M L A M M B L B H L

[21] H B L H H M A B H L A H A B L H L B A A [41] A H A L L H L A B A A B A B B A M A B L [61] L B B H B A B A B L B A H L M L L M A B [81] A L L M H A H H L A H L B A H A L L L H Levels: L < B < M < A < H

mseg

L: low, B: below medium, M: medium, A: above medium,

H: high

levels(mseg)<-c("M","H","L","A","B")mseg<-factor(mseg, levels=c("L","B","M","A","H"))mseg

Histogram of loct08

loct08

Freq

uenc

y

2 4 6 8 10

05

1015

20

hist(xoct08,col="grey")

Stem and leaf display:

leaf unit = 0.1

2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4

a stem

a leaf

2.5

stem(xoct08)

leaf unit = 1

2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4

25

stem(10*xoct08)

Min. Q1 Median Q3 Max. 2.509 5.563 6.864 7.682 10.420

5 number summary of log(oct08):

IQR = 2.119

summary(xoct08)

Quartiles : Q1, Q2 , Q3

Q1 : values ranked at 25% from lowest

Q2 : values ranked at 50% from lowest

Q3 : values ranked at 75% from lowest

IQR (Inter-Quartile Range) = Q3 –

Q1

Median = Q2

How to take : Q1, Q2, Q3

If c is an integer, then c-th ranked

value x[c]

If c is not an integer, then (x[c-]+ x[c+])/2

Q1 : c = 0.25*(n+1)

Q2 : c= 0.5*(n+1)

Q3 : c= 0.75*(n+1)

c- : the largest lower integer than c

c+ : the smallest upper integer than c

[1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96

[11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46

[21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91

[31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30

[41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95

[51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20

[61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68

[71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16

[81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89

[91] 9.08 9.11 10.42

Sorted values of log(oct08):

after deleting 7 cases of –

Inf.n= 93 , 0.25*94=23.5, 0.5*94=47,

0.75*94=70.5

2 4 6 8 10 12

loct08

Dot plot

050

0010

000

1500

020

000

2500

030

000

Box plot oct08

46

810

Box plot of log(oct08)

boxplot(xoct08)boxplot(oct08)

IQR

Q1 Q3Q2**

mild-outlier extreme-outlier

min(non-outlier) min(non-outlier)

1.5 IQR

freq %freq cum. freq %cum. freq

Low Spender 26 0.26 26 0.26Med Low Spender 20 0.20 46 0.46Average Spender 11 0.11 57 0.57Med High Spender 25 0.25 82 0.82High Spender 18 0.18 100 1.00------------------------------------------------------------Total 100 1.00

Frequency table

table(mseg)table(mseg)/length(mseg)cumsum(table(mseg))cumsum(table(mseg))/length(mseg)

Bar chart of log(oct08)

(2,3] (3,4] (4,5] (5,6] (6,7] (7,8] (8,9] (9,10] (10,11]

05

1015

20

Histogram & Bar chart

Histogram : for quantitative variables

connected bar’s

Bar chart : for categorical variables

disconnected bar’s

A B R T Total L 5 13 0 8 26 B 11 8 0 1 20 M 2 4 2 3 11 A 8 7 2 8 25 H 5 0 6 7 18

Total 31 32 10 27 100

Contingency table of mseg and

iseg

mse

g

iseg

table(mseg,iseg)apply(table(mseg,iseg),1,sum)apply(table(mseg,iseg),2,sum)

A

B

RT

Pie chart of iseg

31

32

10 27

pie(table(iseg),col=c("red","light green","green","blue"))

A B R T

05

1015

2025

30

Segmented bar chart of (mseg, iseg) -

serial

barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"))

A B R T

02

46

810

12

Segmented bar chart of (mseg, iseg) -

parallel

barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"),beside=TRUE)

Mosaic Plot

iseg

mse

gA B R T

L B

M

AH

mosaicplot(~iseg+mseg,col=rainbow(5))

L B M A H

46

810

Box plot of log(oct08) by mseg

boxplot(loct08[oct08>0]~mseg[oct08>0])

A B C D E F

10 11 0 3 3 11

7 17 1 5 5 9

20 21 7 12 3 15

14 11 2 6 5 22

14 16 3 4 3 15

12 14 1 3 6 16

10 17 2 5 1 13

23 17 1 5 1 10

17 19 3 5 3 26

20 21 0 5 2 26

14 7 1 2 6 24

13 13 4 4 4 13

A B C D E F

05

1015

2025

InsectSprays data

Type of spray

Inse

ct c

ount

Thank you !!

top related