Chapter 4. Elements of Statistics # brief introduction to some concepts of statistics # descriptive statistics inductive statistics(statistical inference)

Post on 26-Mar-2015

225 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

Chapter 4. Elements of Statistics

# brief introduction to some concepts of statistics

# descriptive statistics inductive statistics(statistical inference)

# Classification of the field of statisticsi) Sampling theoryii) Estimation theoryiii) Hypothesis testingiv) Curve fitting or Regressionv) Analysis of variance

4.2 Sampling Theory–the Sample MeanHow many samples are required

for a given degree of confidence in the result?

# Terminology

- population

N(size of population) very large or ∞

- (random) sample

n(size of sample)

# one of the most important quantities is the sample mean

How close the sample mean might be

to the average value of the population?

Let the sample have the numerical value of x1, x2, … xn

Then, the sample mean is given by

Note that we are interested in the statistical properties of

arbitrary random samples rather than any particular sample.

That is, the sample mean becomes a random variable.

Therefore, it is appropriate to denote the sample mean as

n

i

xin

x1

1

n

i

Xin

x1

1

We want the mean value of the sample mean

close to the true mean value of the population

the mean value of the sample mean

= the true mean value of the population

The sample mean is a unbiased estimate of the true mean.

But, this is not sufficient to indicate whether the sample mean is a good estimator of the true population mean.

n

i

n

iiXEn

Xin

EXE1 1

][1

]1

[]ˆ[

XXnn

1

X

The variance of the sample mean 은 ?

N n ≫ 이라 가정 (population 의 특성이 sampling 중에 변하지 않는다 .)

Var mean

square of - square of the mean

n

i

n

jX

nXiXjEX

1 1

2

2 ]1

[)ˆ(

가정 : statisticallyindep.

따라서 Var

(!)

n

i

n

jX

nXiXjE

1 1

2

2 ][1

XjXi& ji XXiXjE

2][ ji

X 2 ji

nn

nnX

XX

XXnXn

222

2222

2 ])([1ˆ

Where is the true variance of the population As n => ∞, Variance => 0,

Which means that large sample sizes lead to a better estimate

* 참고 : 1)N 이 크지 않을 때 N 이 클 때와 같은 효과를 얻을 수 있는 방법 “sampling with replacement”

2

2)N 이 작고 replace 할 수 없을 때는Var

N->∞ 앞식으로 수렴N = n 일때는 0 ( 당연 !)

`Two examples : 교재 pp163 ~165 참조

)1

(ˆ2

N

nN

nX

4.3 Sampling Theory – The sample Variance

The population variance is needed for determiningthe sample size required to achieve a desired varianceof the sample mean (see eq. 4-4)

Definition(Sample Variance):

The expected value of the sample variance

can be derived easily using

not the true variance , that is, a biased estimate rather than an unbiased one

n

iXXS in 1

22 ˆ1

22 1][

n

nE S

n

j

Xjn

X1

2

2

Now, we redefine the sample variance for having an unbia

sed estimate of the population variance :

Note that these hold for very large N, that is, N=∞.

How about when the population size is not large?

n

iXX

SS

in

n

n

1

2

22

ˆ

~

1

1

1

# When N is not large, the expected value of S2 is given by

For obtaining an unbiased estimate, we redefine

# The variance of the estimates of the variance :

the variance of S2 :

the variance of :

where is the 4th central moment of the population

22 1

1][

n

n

N

NE S

SS n

n

N

N 22

1

1~

1 2)4( 42~

n

nVar S

n

Var S 4

42

S~2

][4

4 XXE

4.4 Sampling Distributions & Confidence Intervalswhat is the probability that the estimates are within specified bounds?

p,d,f 를 알아야 함2 가지 종류 , 그리고 sample mean 에 대해서만 !

normalized sample mean Xi 가 Gaussian and independent 일때

=> Gaussian (0,1)

n

XXZ

ˆ

Xi 가 not Gaussian 이더라도 n=>∞ 이면Z 는 asymptotically Gaussian by the

central limit theorem(n 은 보통 n≥30 은 되어야 함 ; A rule of

thumb)

H.W) Solve the problems in chap.4;4-2.1, 4-2.5, 4-3.1, 4-4.1, 4-5.1, 4-6.1

를 모를 때 대신에 로 대치그러나

No longer Gaussian =>”Student’s t distribution” with n-1 d.of f.

그림 p170 그림 4-2 참조

S~

1

ˆ~ˆ

nS

XX

nS

XXT

`pdf of student’s t distribution

Where the gamma heavier tails (n ≥30) n 의 유사 any

= ! integer

1n

2

1)1(

)2

(1

)2

1(

)(2

tf

Tt

T

(.);)1(

)()1( kkk kk k

( 당연히 )confidence interval 이란 ?

interval estimate ( 어떤 확률을 가지고 구간 내에 존재하는 가를 따짐 )q- percent confidence interval (q/100 의 확률을 갖고 ) 신뢰도

)2

1(,1)2()1( p

n

kXX

n

kX

ˆ

• 여 기 서 k 는 q 와 의 pdf 에 의존하는 상수임 .

• k 의 구체적인 값은 p.172 표 .4-1 참조 .

• (q 가 클수록 k 가 커짐 )

kx

kx xdxxfq )(100 ˆ

• 예 ) q=95% -> • 가 이 구간에 놓일 확률은 0.95 이다 .• 구간이 작을수록 확률이 적어짐• (q=99% 인 경우는 가 동일 구간이 넓어지나 추정에 필요한 정보 효용성은 떨어짐 !)

196.10ˆ804.9 x

• 참고 : q from PDF

• 여기서 F 는 Prob. Distribution for Student’s + function

• (See Appendix F or Table 4-2 page 172 for v = 8 )

)()(100 ˆˆ kxFkxFqxx

4.5 Hypothesis Testing

• The question arises; How does one decide to accept or reject a given hypothesis when the sample size and the confidence level are specified?

• Two steps; i) to make some hypothesis about the population

• ii) to determine if the observed sample confirms or rejects this hypothesis.

• Two tests; one-sided or two-sided.

The average life time of the light bulb >= 1000 hours

100ohms resisters too high or too low

One-sided test 경우예 ) A capacitor manufacturer claims

that a mean value of breakdown voltage >= 300 V

• a sample of 100 capacitors– >

• 99% confidence level is used• 문 ) Is the manufacturer’s claim valid?• 답 ) We would reject the hypothesis!

)40,400()~,ˆ( 22 VVsx

Normalized r, v, Z

그런데 99% 의 신뢰수준은

5.2100/40

300290

/

n

Xxz

cz cZZ zdzzfzF 99.0)(1)()(

5.233.2 cz

Vx 300Vs 40~

- 2.5 - 2.33

• 만약 99.5% 신뢰수준이라면– accept the hypothesis

• 신뢰수준이 낮을수록 구간이 좁아지고 가설을 받아들이기에 less likely

• 즉 more severe requirement 제시• 이것은 의미상 모순적으로 느껴짐

5.2575.2 cz

• 이제 유의 수준 (level of significance)으로 재정의하자

• 즉 (100% - 신뢰수준 )• 유의수준이 클수록 more severe!

• 예 ) 계속 sample size=9, • no longer Gaussian -> Student’s + distributi

on

• v=n-1=8 dof• 신뢰수준 99%,

– accept the hypothesis

)40,290( 2

75.0/~

ns

Xxt

75.0896.2 ct

• a small sample size 는 t 를 증가시키고

• heavier tail 을 가지고 있는 t distribution 을 를 감소

more likely to exceed the critical valuesmall size less reliable(less severe) than

large size tests

Two-sided test 경우• 예 ) A manufacture of Zener diodes clai

ms that the true mean breakdown voltage = 10V

• 문 ) hypothesis : the true accepts or rejects?

• 100 samples ->• 95% 신뢰수준

)2.1,3.10( 2VV

• 답 ) Rejected!

• z is outside the interval,

5.2100/2.1

103.10

/

n

Xxz

96.196.1 z

• 문 ) 계속 9 samples

t is inside the interval,

• accepted!– Less severe than a large sample test

75.010/2.1

103.10

/~

ns

Xxt

306.2306.2 t

)2.1,3.10( 2VV

2.5% 2.5%

95%tc=2.306

4.6 Curve Fitting and Linear Regression

• 변수들간의 ( 독립변수와 종속변수 ) 간의 함 수 관 계 를 자 료 를 매 개 체 로 하 여 통계적으로 찾아보는 분석방법 즉 , x 와 y의 관련성을 적절한 회귀방정식을 찾아 알아 보려함 .

• 대개 1 차식 (linear) or 2 차식• 반면 다음 절의 상관분석 (correlation analys

is) 는 x 와 y 의 관련성을 상관계수를 구하여 알아 보려함 .

• 용어– Scatter diagram ( 산점도 ) data 도시

- n samples

nn yyyxxx ,,,,,, 2121

- Curve fitting to find a mathematical relationship regression curve (equation) ; resulting curve

- What is the “best” fit? In a least squares sense

– Let be the errors between the regression curve and the scatter diagram

– 이것을 minimum 으로 하는 미지계수를 정하는 문제임 .

– 먼저 the type of equation to be fitted to the data 를 정하고 미지계수 수가 n 보다 훨씬 작게하면 smoothing 효과 얻음

222

21 n

i

2cxbxay

• Linear regression

• 이 최소가

되도록하는 a, b 는 ?

bxay

n

iii bxayJ

1

2)(

• 해 )

• 연립방정식을 풀면

n

i

n

iii xbany

a

J

1 10

n

i

n

ii

n

iiii xbxayx

b

J

1 1

2

10

2

11

2

111

n

ii

n

ii

n

ii

n

ii

n

iii

xxn

yxyxnb

n

xbya

n

ii

n

ii

11

MATLAB in function, p = polyfit(y, x, n)

• A second-order regression ( 교 재 p.180, 표 4-3, 그림 4-6)

0500.4266540.00334.0 2 TTvB

4.7 Correlation between Two Sets of Data

• Two data sets correlated or not?

nxxx ,,, 21

n

iixn

x1

1

nyyy ,,, 21

n

iiyn

y1

1

• Linear correlation coefficient“ Pearson’s r ”

Usage ; useful in determining the sources of errors예 ) a point-to-point digital communication link

BER(Bit Error Rate) 로 이 link 의 quality 판단BER may fluctuate randomly due to wind

문 ) error source 는 wind 인가 ?wind 속도 20 개 측정치와 resulting BER 과의 correlation test → r=0.891 충분히 크므로 yes!

1r

Gaussianelyapproximat500)( large;randomalso

)()(

))((

1

2

1

2

1

rnr

yyxx

yyxxr

n

ii

n

ii

n

iii

top related