Chapter 5: MULTIPLE DISCRETE RANDOM VARIABLES

Notes and figures are based on or taken from materials in the course textbook: Charles Boncelet,

Probability, Statistics, and Random Signals, Oxford University Press, February 2016.

B.J. Bazuin, Spring 2022 1 of 34 ECE 3800

Charles Boncelet, “Probability, Statistics, and Random Signals," Oxford University Press, 2016. ISBN: 978-0-19-020051-0

Chapter 5: MULTIPLE DISCRETE RANDOM VARIABLES

Sections 5.1 Multiple Random Variables and PMFs 5.2 Independence 5.3 Moments and Expected Values

5.3.1 Expected Values for Two Random Variables 5.3.2 Moments for Two Random Variables

5.4 Example: Two Discrete Random Variables 5.4.1 Marginal PMFs and Expected Values 5.4.2 Independence 5.4.3 Joint CDF 5.4.4 Transformations With One Output 5.4.5 Transformations With Several Outputs 5.4.6 Discussion

5.5 Sums of Independent Random Variables 5.6 Sample Probabilities, Mean, and Variance 5.7 Histograms 5.8 Entropy and Data Compression

5.8.1 Entropy and Information Theory 5.8.2 Variable Length Coding 5.8.3 Encoding Binary Sequences 5.8.4 Maximum Entropy

Summary Problems




Multiple Discrete Random Variables

The joint probability mass function

𝑝 𝑘, 𝑙 𝑃𝑟 𝑋 𝑥 ∩ 𝑌 𝑦

Properties of the joint pmf

1. 𝑝 𝑘, 𝑙 𝑃𝑟 𝑋 𝑥 ∩ 𝑌 𝑦 0 (all probabilities are positive)

2. The summation of the pmf for all k,l is equal to 1.

𝑝 𝑘, 𝑙 1.0

The marginal pmf of the individual random variables can be computed

𝑝 𝑘, 𝑙 𝑝 𝑙

𝑝 𝑘, 𝑙 𝑝 𝑘

Example 5.1 Let 𝑝 𝑘, 𝑙 𝑃𝑟 𝑋 𝑥 ∩ 𝑌 𝑦 be

y=1 0.1 0.0 0.1 0.1

y=0 0.0 0.1 0.4 0.2

x=0 x=1 x=2 x=3

The table explicitly shows each of the (2D) values


𝑝 𝑙 𝑝 𝑘, 𝑙

𝑝 0 𝑝 𝑘, 0 𝑝 0,0 𝑝 1,0 𝑝 2,0 𝑝 3,0

𝑝 1 𝑝 𝑘, 1 𝑝 0,1 𝑝 1,1 𝑝 2,1 𝑝 3,1




𝑝 0 0.0 0.1 0.4 0.2 0.7

𝑝 1 0.1 0.0 0.1 0.1 0.3

𝑝 0 𝑝 0, 𝑙 𝑝 0,0 𝑝 0,1

𝑝 1 𝑝 1, 𝑙 𝑝 1,0 𝑝 1,1

𝑝 2 𝑝 2, 𝑙 𝑝 2,0 𝑝 2,1

𝑝 3 𝑝 3, 𝑙 𝑝 3,0 𝑝 3,1

𝑝 0 0.1 0.0 0.1

𝑝 1 0.0 0.1 0.1

𝑝 2 0.1 0.4 0.5

𝑝 3 0.1 0.2 0.3

Notice that the sum in X or Y is 1.0!

𝑝 0 𝑝 1 1.0

𝑝 0 𝑝 1 𝑝 2 𝑝 3 1.0




The joint Cumulative Distribution Function exists

𝐹 𝑢, 𝑣 𝑃𝑟 𝑋 𝑢 ∩ 𝑌 𝑣

Properties

0 𝐹 𝑢, 𝑣 1,𝑓𝑜𝑟 ∞ 𝑢 ∞ 𝑎𝑛𝑑 ∞ 𝑣 ∞

𝐹 𝑢 𝑃𝑟 𝑋 𝑢 𝑃𝑟 𝑋 𝑢 ∩ 𝑌 ∞ 𝐹 𝑢,∞

𝐹 𝑢𝑣 𝑃𝑟 𝑌 𝑣 𝑃𝑟 𝑋 ∞ ∩ 𝑌 𝑣 𝐹 ∞, 𝑣

Calculating the probability of an “area”

𝑃𝑟 𝑎 𝑋 𝑏 ∩ 𝑐 𝑌 𝑑 𝐹 𝑏,𝑑 𝐹 𝑏, 𝑐 𝐹 𝑎,𝑑 𝐹 𝑎, 𝑐

A 2D diagram may help




Conditional Probability

Let 𝐴 𝑋 𝑥 and 𝐵 𝑌 𝑦

𝑃𝑟 𝑋 𝑥 |𝑌 𝑦 𝑃𝑟 𝐴|𝐵𝑃𝑟 𝐴,𝐵𝑃𝑟 𝐵

𝑃𝑟 𝑋 𝑥 |𝑌 𝑦 𝑃𝑟 𝐴|𝐵𝑃𝑟 𝑋 𝑥 ∩ 𝑌 𝑦

𝑃𝑟 𝑌 𝑦

From the previous example x=0 x=1 x=2 x=3

y=1 0.1 0.0 0.1 0.1 y=0 0.0 0.1 0.4 0.2

𝑃𝑟 𝑌 0|𝑋 0𝑃𝑟 𝑋 0 ∩ 𝑌 0

𝑃𝑟 𝑋 00.00.1

𝑃𝑟 𝑌 1|𝑋 0𝑃𝑟 𝑋 0 ∩ 𝑌 1

𝑃𝑟 𝑋 00.10.1

1.0

𝑃𝑟 𝑌 1|𝑋 2𝑃𝑟 𝑋 2 ∩ 𝑌 1

𝑃𝑟 𝑋 20.10.5

0.2

𝑃𝑟 𝑋 2|𝑌 1𝑃𝑟 𝑋 2 ∩ 𝑌 1

𝑃𝑟 𝑌 10.10.3

0.333⋯

If we consider three random variables: 𝐴 𝑋 𝑥 , 𝐵 𝑌 𝑦 , and 𝐶 𝑍 𝑧

We can consider the following (functions of 3 random variables)

𝑃𝑟 𝑋 𝑥 |𝑌 𝑦 ∩ 𝑍 𝑧 𝑃𝑟 𝐴|𝐵𝐶𝑃𝑟 𝐴𝐵𝐶𝑃𝑟 𝐵𝐶

𝑝 𝑘, 𝑙,𝑚𝑝 𝑙,𝑚

or

𝑃𝑟 𝑋 𝑥 ∩ 𝑌 𝑦 |𝑍 𝑧 𝑃𝑟 𝐴𝐵|𝐶𝑃𝑟 𝐴𝐵𝐶𝑃𝑟 𝐶

𝑝 𝑘, 𝑙,𝑚𝑝 𝑚




Independence

If X and Y are independent

𝑝 𝑘, 𝑙 𝑃𝑟 𝑋 𝑥 ∩ 𝑌 𝑦 𝑃𝑟 𝑋 𝑥 ∙ 𝑃𝑟 𝑌 𝑦 𝑝 𝑘 ∙ 𝑝 𝑙

and

𝐹 𝑢, 𝑣 𝑃𝑟 𝑋 𝑢 ∩ 𝑌 𝑣 𝑃𝑟 𝑋 𝑢 ∙ 𝑃𝑟 𝑌 𝑣 𝐹 𝑢 ∙ 𝐹 𝑣

For three or more random variables,

1) Each pair is independent

2) The joint pmf of all three factors for all outcomes is independent

Useful terminology and concept: Independent and Identically Distributed (IID)

For this case, the R.V. are independent and all have the same pmf!




Moments and Expected Values

𝐸 𝑔 𝑋,𝑌 𝑔 𝑥 , 𝑦 ∙ 𝑝 𝑘, 𝑙

Property Additive

𝐸 𝑔 𝑋,𝑌 𝑔 𝑋,𝑌 𝑔 𝑥 ,𝑦 𝑔 𝑥 ,𝑦 ∙ 𝑝 𝑘, 𝑙

𝐸 𝑔 𝑋,𝑌 𝑔 𝑋,𝑌 𝑔 𝑥 ,𝑦 ∙ 𝑝 𝑘, 𝑙 𝑔 𝑥 , 𝑦 ∙ 𝑝 𝑘, 𝑙

𝐸 𝑔 𝑋,𝑌 𝑔 𝑋,𝑌 𝐸 𝑔 𝑋,𝑌 𝐸 𝑔 𝑋,𝑌

Correlated Random Variables

𝑟 𝐸 𝑋 ∙ 𝑌

Multiplicative if and only if X and Y independent

𝐸 𝑋 ∙ 𝑌 𝑋 ∙ 𝑌 ∙ 𝑝 𝑘, 𝑙 𝑋 ∙ 𝑌 ∙ 𝑝 𝑘 ∙ 𝑝 𝑙

𝐸 𝑋 ∙ 𝑌 𝑌 ∙ 𝑝 𝑙 ∙ 𝑋 ∙ 𝑝 𝑘 𝑌 ∙ 𝑝 𝑙 ∙ 𝐸 𝑋

𝐸 𝑋 ∙ 𝑌 𝐸 𝑋 ∙ 𝑌 ∙ 𝑝 𝑙 𝐸 𝑋 ∙ 𝐸 𝑌

And in general for independent R.V.

𝐸 𝑔 𝑋 ∙ 𝑔 𝑌 𝐸 𝑔 𝑋 ∙ 𝐸 𝑔 𝑌

Note that if X and Y are not independent, they may be said to be correlated and correlation coefficients will be defined later.

𝑟 𝐸 𝑋 ∙ 𝑌 𝐸 𝑋 ∙ 𝐸 𝑌




Covariance and Correlation Coefficients

This gives rise to another term Covariance

𝜎 𝐶𝑜𝑣 𝑋,𝑌 𝐸 𝑋 𝜇 ∙ 𝑌 𝜇

also

𝐶𝑜𝑣 𝑋,𝑌 𝐸 𝑋 ∙ 𝑌 𝜇 ∙ 𝑌 𝑋 ∙ 𝜇 𝜇 ∙ 𝜇

𝐶𝑜𝑣 𝑋,𝑌 𝐸 𝑋 ∙ 𝑌 𝐸 𝜇 ∙ 𝑌 𝐸 𝑋 ∙ 𝜇 𝐸 𝜇 ∙ 𝜇

𝐶𝑜𝑣 𝑋,𝑌 𝐸 𝑋 ∙ 𝑌 𝜇 ∙ 𝜇 𝜇 ∙ 𝜇 𝜇 ∙ 𝜇

𝐶𝑜𝑣 𝑋,𝑌 𝐸 𝑋 ∙ 𝑌 𝜇 ∙ 𝜇

or

𝜎 𝐶𝑜𝑣 𝑋 ∙ 𝑌 𝑟 𝜇 ∙ 𝜇

A general derivation true for all combinations of two random variables.

Note that if X and Y are independent

𝑟 𝜇 ∙ 𝜇

and

𝜎 𝐶𝑜𝑣 𝑋 ∙ 𝑌 0

As more of Dr. Bazuin’s notation that you may see …

𝑟 𝐸 𝑋

𝐶𝑜𝑣 𝑋 ∙ 𝑋 𝐾 𝜎




TheCorrelationCoefficient

Letting 𝑍 𝑋 𝑌

The mean value

𝜇 𝐸 𝑍 𝐸 𝑋 𝑌 𝐸 𝑋 𝐸 𝑌 𝜇 𝜇

The variance

𝜎 𝐸 𝑍 𝜇 𝐸 𝑋 𝜇 𝑌 𝜇

𝜎 𝐸 𝑋 𝜇 2 ∙ 𝐸 𝑋 𝜇 ∙ 𝑌 𝜇 𝐸 𝑌 𝜇

𝜎 𝜎 2 ∙ 𝐶𝑜𝑣 𝑋,𝑌 𝜎

𝜎 𝜎 2 ∙ 𝜎 𝜎

Thinking in terms of a product function, we define

𝜎 𝜎 2 ∙ 𝜌 ∙ 𝜎 ∙ 𝜎 𝜎

With the correlation coefficient defined as

𝜌𝜎

𝜎 ∙ 𝜎

or

𝜌𝐸 𝑋 ∙ 𝑌 𝜇 ∙ 𝜇

𝜎 ∙ 𝜎

As a result of the “normalized scaling” we expect

1 𝜌𝜎

𝜎 ∙ 𝜎1




Aspecialnote

Independent random variables are uncorrelated.

𝐸 𝑋 ∙ 𝑌 𝑟 𝐸 𝑋 ∙ 𝐸 𝑌 𝜇 ∙ 𝜇

𝜌𝐸 𝑋 ∙ 𝑌 𝜇 ∙ 𝜇

𝜎 ∙ 𝜎

𝜌𝜇 ∙ 𝜇 𝜇 ∙ 𝜇

𝜎 ∙ 𝜎0

However, uncorrelated random variables are not necessarily independent!




Example 4.3-5 from Stark and Woods: Given

YXP , 11 x 02 x 13 x

01 y 0 31 0

12 y 31 0 3

1

2

1, ,

jjiYXiX yxPxP

3

1,

2

1,

jjiYXiX yxPxP

3

1321 xPxPxP XXX

3

2,

3

1, 10

3

1,

yPyPyxPyP YYj

ijYXiY

Note, not independent iYjXijYX yPxPyxP ,,

03

11

3

10

3

11

3

1

i

iXi xPxXE

3

2

3

21

3

10

2

1

i

iYi yPyYE

03

111010

3

111001

3

100001

,2

1

3

1,

i j

ijYXij yxPyxYXE

Therefore 0, YXCOV

The covariance and correlation coefficient are zero, but the R.V. are not independent!




Example 5.4

Two-dimensional probability example.

0 1 2 3 4

1

2

0

12 equally likely points in X and Y


Determining marginal probabilities

𝑝 𝑙 𝑝 𝑘, 𝑙

𝑝 𝑙5

12,

412

,3

12

𝐸 𝑌 𝑌 ∙ 𝑝 𝑙

𝐸 𝑌 0 ∙5

121 ∙

412

2 ∙3

124 6

121012

𝐸 𝑌 𝑌 ∙ 𝑝 𝑙

𝐸 𝑌 0 ∙5

121 ∙

412

2 ∙3

124 12

121612

𝜎 𝐸 𝑌 𝐸 𝑌

𝜎1612

1012

1612

100144

192 100144

92144

2336




𝑝 𝑘 𝑝 𝑘, 𝑙

𝑝 𝑘3

12,

312

,3

12,

212

,1

12

𝐸 𝑋 𝑋 ∙ 𝑝 𝑘

𝐸 𝑋 0 ∙3

121 ∙

312

2 ∙3

123 ∙

212

4 ∙1

120 3 6 6 4

121912

𝐸 𝑋 𝑋 ∙ 𝑝 𝑘

𝐸 𝑋 0 ∙3

121 ∙

312

2 ∙3

123 ∙

212

4 ∙1

120 3 12 18 16

124912

𝜎 𝐸 𝑋 𝐸 𝑋

𝜎4912

1912

4912

361144

588 361144

227144

Computing a CDF

Determine the bounds of interest

𝐶𝐷𝐹 2.5,1.3 𝑝 𝑘, 𝑙

..6

12




HW Problem 5.5 Continue the example in Section 5.4 and consider the joint transformation. Consider the joint transformation Two dimensional probability example in (X,Y)

0 1 2 3 4

1

2

0

12 equally likely points in X and Y

Letting 𝑈 𝑚𝑖𝑛 𝑋,𝑌 and 𝑊 𝑚𝑎𝑥 𝑋,𝑌

a.) What are the level curves (draw picture)?

b.) What are the individual PMFs of U and W?

c.) What is the joint PMF of U and W?




5.5 Sums of Independent Random Variables


The mean value

𝜇 𝐸 𝑍 𝐸 𝑋 𝑌 𝐸 𝑋 𝐸 𝑌 𝜇 𝜇

The variance

𝜎 𝐸 𝑍 𝜇 𝐸 𝑋 𝜇 𝑌 𝜇

𝜎 𝐸 𝑋 𝜇 2 ∙ 𝐸 𝑋 𝜇 ∙ 𝑌 𝜇 𝐸 𝑌 𝜇

𝜎 𝜎 2 ∙ 𝐶𝑜𝑣 𝑋,𝑌 𝜎

𝜎 𝜎 2 ∙ 0 𝜎

𝜎 𝜎 𝜎

Letting 𝑆 𝑋 𝑋 𝑋

The mean value is the sum of the means for all R.V.!

𝜇 𝐸 𝑆 𝐸 𝑋 𝑋 𝑋 𝐸 𝑋 𝐸 𝑋 𝐸 𝑋 𝜇 𝜇 𝜇

The variance is the sum of the variances for independent R.V.!

Letting 𝑆 𝑋 𝑋 𝑋 𝑍 𝑋

𝜎 𝜎 𝜎

𝜎 𝜎 𝜎 𝜎

Note that in this case it is easier to form the variance than the 2nd moment!




Joint pmf of Independent Random Variables – Convolution!


The pmf of Z can be defined as

𝑝 𝑛 𝑃𝑟 𝑋 𝑌 𝑛

from total probability

𝑝 𝑛 𝑃𝑟 𝑋 𝑌 𝑛|𝑌 𝑙 ∙ 𝑃𝑟 𝑌 𝑙

But this is equivalent to

𝑝 𝑛 𝑃𝑟 𝑋 𝑛 𝑙|𝑌 𝑙 ∙ 𝑃𝑟 𝑌 𝑙

With intendance, a joint probability is the product of probabilities; therefore,

𝑝 𝑛 𝑃𝑟 𝑋 𝑛 𝑙 ∙ 𝑃𝑟 𝑌 𝑙

Resulting in

𝑝 𝑛 𝑝 𝑛 𝑙 ∙ 𝑝 𝑙

This is the discrete convolution of the two independent R.V. pmf functions!

𝑝 𝑝 ∗ 𝑝

https://en.wikipedia.org/wiki/Convolution – Math that simplifies our computations!

For multiple independent R.V. sums, keep convolving!

𝑝 𝑝 ∗ 𝑝 ∗ 𝑝

Revisit flipping coins … as a discrete pmf convolution.

Revisit two fair die … as a discrete pmf convolution.




Discrete Convolutions

The sum of two fair die

66

15

6

14

6

13

6

12

6

11

6

1 kkkkkkkpmf

YXZ

A discrete convolution is in the form.

𝑝 𝑧 𝑝 𝑧 𝑘 ∙ 𝑝 𝑘 , 𝑓𝑜𝑟 2 𝑧 12, 1 𝑘 𝑎𝑛𝑑 𝑧 𝑘 6

Matlabconvolution.pmf1=[1 1 1 1 1 1] pmf1 = 1 1 1 1 1 1 >> conv(pmf1,pmf1) ans = 1 2 3 4 5 6 5 4 3 2 1 or pmf1=[1 1 1 1 1 1]/6 pmf1 = 0.1667 0.1667 0.1667 0.1667 0.1667 0.1667 >> conv(pmf1,pmf1) ans = Columns 1 through 11 0.0278 0.0556 0.0833 0.1111 0.1389 0.1667 0.1389 0.1111 0.0833 0.0556 0.0278

Matlab can “multiply” polynomials when correctly constructed using the conv function!




Textbook Example: Comment 5.5

The manual means of convolution shown (p. 115) is very handy for “long-hand” computations.

Matlab

>> X=[1 2 3 -4];

>> Y = [3 4 5];

>> conv(X,Y)

ans = 3 10 22 10 -1 -20

Binomial expansion, repeated convolution by [1 1] …

b2= [1 1];

>> b3 = conv(b2,b2)

b3 = 1 2 1

>> b4 = conv(b3,b2)

b4 = 1 3 3 1

>> b5 = conv(b4,b2)

b5 = 1 4 6 4 1




Textbook Moment Generating Function of two ind. R.V.

From Laplace … convolution in the time domain is multiplication in the Laplace domain.

𝑍 𝑋 𝑌

𝑝 𝑧 𝑝 𝑧 𝑘 ∙ 𝑝 𝑘

𝑀 𝑠 𝐸 𝑒𝑥𝑝 𝑠 ∙ 𝑍

𝑀 𝑠 𝐸 𝑒𝑥𝑝 𝑠 ∙ 𝑋 𝑌

𝑀 𝑠 𝐸 𝑒𝑥𝑝 𝑠 ∙ 𝑋 ∙ 𝑒𝑥𝑝 𝑠 ∙ 𝑌

𝑀 𝑠 𝐸 𝑒𝑥𝑝 𝑠 ∙ 𝑋 ∙ 𝐸 𝑒𝑥𝑝 𝑠 ∙ 𝑌

𝑀 𝑠 𝑀 𝑠 ∙ 𝑀 𝑠

For the sum of independent R.V., the MGF is the product of the MGF!

General comment on Laplace Transforms

a convolution in one domain is multiplication in the other Convolve in time/sample – multiply in Laplace Multiply in time/sample – convolve in Laplace




5.6 Sample Probabilities, Mean, and Variance (The beginning of the relationship between statistics and probability!)

Statistics Definition: The science of assembling, classifying, tabulating, and analyzing data or facts:

Descriptive statistics – the collecting, grouping and presenting data in a way that can be easily understood or assimilated.

Inductive statistics or statistical inference – use data to draw conclusions about or estimate parameters of the environment from which the data came from.

Theoretical Areas:

Sampling Theory – selecting samples from a collection of data that is too large to be examined completely.

Estimation Theory – concerned with making estimates or predictions based on the data that are available.

Hypothesis Testing – attempts to decide which of two or more hypotheses about the data are true.

Curve fitting and regression – attempt to find mathematical expressions that best represent the data. (Shown in Chap. 4)

Analysis of Variance – attempt to assess the significance of variations in the data and the relation of these variances to the physical situations from which the data arose. (Modern term ANOVA)

We will focus on parameter estimation of the mean and variance to begin!




Sampling Theory – The Sample Mean

How many samples are required to find a representative sample set that provides confidence in the results?

Defect testing, opinion polls, infection rates, etc.

Definitions

Population: the collection of data being studied N is the size of the population

Sample: a random sample is the part of the population selected all members of the population must be equally likely to be selected! n is the size of the sample

Sample Mean: the average of the numerical values that make of the sample

Population: N

Sample set: 𝑆 ∈ 𝑥 , 𝑥 , 𝑥 ,⋯ , 𝑥

Sample Mean �̅� ∙ ∑ 𝑥

To generalize, describe the statistical properties of arbitrary random samples rather than those of any particular sample.

Sample Mean 𝑋 ∙ ∑ 𝑋 ,

where iX are random variables with a pdf.

Notice that for a pdf, the true mean, X , can be compute while for a sample data set the

above sample mean, is computed. X̂




As may be noted, the sample mean is a combination of random variables and, therefore, can also be considered a random variable. As a result, the hoped for result can be derived as:

𝐸 𝑋 𝜇 𝐸1𝑁∙ 𝑋

1𝑁∙ 𝐸 𝑋

1𝑁∙ 𝑋

𝑁𝑁∙ 𝑋 𝑋 𝜇

If and when this is true, the estimate is said to be an unbiased estimate.

Though the sample mean may be unbiased, the sample mean may still not provide a good estimate.

What is the “variance” of the computation of the sample mean?

Varianceofthesamplemean–(themeanitself,notthevalueofX)

You would expect the sample mean to have some variance about the “probabilistic” or actual mean; therefore, it is also desirable to know something about the fluctuations around the mean. As a result, computation of the variance of the sample mean is desired.

For N>>n or N infinity (or even a known pdf), using the collected samples … based on the prior definition of variance, a statistical estimate of the 2nd moment and the square of the mean.

22

1

ˆ1ˆ XEXn

EXVarn

ii

211

2

1ˆ XXXn

EXVarn

jj

n

ii

21 1

2

1ˆ XXXn

EXVarn

i

n

jji

21 1

2

1ˆ XXXEn

XVarn

i

n

j

ji

For iX independent (measurements should be independent of each other)

jiforXXEXEXE

jiforXXEXXE

ji

ii

ji

,ˆ

,

22

22




As a result we can define two summation where i=j and i<>j,

21 ,1

2

1ˆ XXXEXXEn

XVarn

i

n

ijj

jiii

2222

2

1ˆ XXEnnXEnn

XVar ii

22

2

221ˆ XX

n

nnX

nXVar

nn

XXX

n

nX

nXVar

2222

221ˆ

where 2 is the true variance (probabilistic) of the random variable, X.

Therefore, as n approaches infinity, this variance in the sample mean estimate goes to zero!

It is referred to as a “consistent” estimate. Thus a larger sample size leads to a better estimate of the population mean.

Note: this variance is developed based on “sampling with replacement”.




Example: How many samples of an infinitely long time waveform would be required to insure the mean is within 1% of the true (probabilistic) mean value? For this relationship, we would require that

𝑉𝑎𝑟 𝑋 0.01 ⋅ 𝜇 0.01 ⋅ 𝜇

Infinite set, therefore assume that you use the “with replacement equation”:

n

XVar2

ˆ

Assume that the true means is 10 and that the true variance is 9 so that the mean +/- a standard deviation would be 310 . Then,

21001.09ˆ n

XVar

01.01.09 2 n

900n

A very large sample set size to “estimate” the mean within the 1% desired bound!




Sampling Theory – The Sample Variance

When dealing with probability, both the mean and variance provide valuable information about the “DC” and “AC” operating conditions (about what value is expected) and the variance (in terms of power or squared value) about the operating point.

Therefore, we are also interested in the sample variance as compared to the true data variance.

The sample variance of the population (stdevp) is defined as:

n

i

i XXn

S

1

22 ˆ1

and continuing until (shown in the coming pages)

22 1

n

nSE

where is the true probabilistic variance of the random variable.

Note: the sample variance is not equal to the true variance; it is a biased estimate!

To create an unbiased estimator, scale by the biasing factor to compute (stdev):

n

ii

n

iix XX

nXX

nn

nSE

n

nSE

1

2

1

2222 ˆ

1

1ˆ1

11

~

This is equation 5.12 in the textbook!




Additional notes: MATLAB and MS Excel

Simulation and statistical software packages allow for either biased or unbiased computations.

In MS Excel there are two distinct functions stdev and stdevp.

stdev uses (n-1) - http://office.microsoft.com/en-us/excel-help/stdev-function-HP010335660.aspx

stdevp uses (n) - https://support.office.com/en-US/article/STDEVP-function-1F7C1C88-1BEC-4422-8242-E9F7DC8BB195

In MATLAB, there is an additional flag associate with the std function.

n

jjx

nXXstd

1

2

1

1var , flag implied as 0

n

jjx

nXXstd

1

211,var1, , flag specified as 1

>> help std std Standard deviation. For vectors, Y = std(X) returns the standard deviation. For matrices, Y is a row vector containing the standard deviation of each column. For N-D arrays, std operates along the first non-singleton dimension of X. std normalizes Y by (N-1), where N is the sample size. This is the sqrt of an unbiased estimator of the variance of the population from which X is drawn, as long as X consists of independent, identically distributed samples. Y = std(X,1) normalizes by N and produces the square root of the second moment of the sample about its mean. std(X,0) is the same as std(X).

The tools you use compute the unbiased variance and standard deviation! Did you know this before?!




Sampling Theory – The Sample Variance - Proof

The sample variance of the population is defined as

n

i

i XXn

S

1

22 ˆ1

n

i

n

j

ji Xn

Xn

S

1

2

1

2 11

Determining the expected value

n

i

n

jji X

nX

nESE

1

2

1

2 11

n

i

n

jj

n

jjii X

nXX

nX

nESE

1

2

11

22 121

n

i

n

kk

n

jj

n

jjii XX

nEXXE

nXE

nSE

1 112

1

22 121

n

i

n

kk

n

jj

n

i

n

jji

n

ii XX

nE

nXXE

nXE

nSE

1 112

1 12

1

22 1121

n

i

n

j

n

kkj

n

i

XXEnn

XEnXEn

XEnn

SE1 1 1

21

222

22 111

21

n

i

n

j

n

j

n

jkkkjj XXEXE

nnXEnnXEn

nXESE

1 1 1 ,1

2

2

222

22 111

2

n

i

XEnnXEnn

XEn

nXE

nXESE

1

2223

2222 1122

22223

2222 1122XEnnnXEn

nXE

n

nXE

nXESE




n

n

n

nXE

nnXESE

112121 222

n

nXE

n

nXESE

11 222

2222 11

n

nXEXE

n

nSE

Therefore,

22 1

n

nSE

To create an unbiased estimator, scale by an (un-) biasing factor to compute:

222

1

~

SEn

nSE




Statistical Mean and Variance Summary

For taking samples and estimating the mean and variance …

The Estimate Variance of Estimate

Mean

n

iiX X

nX

1

1ˆ̂

An unbiased estimate

XEXE ˆ

XX ˆ

n

XVar X2

ˆ

Variance (biased)

n

i

i XXn

S

1

22 ˆ1

A biased estimate

22 1Xn

nSE

2

442

1

~

n

nSVar X

44 XXE

Variance (unbiased) 222

1

~XSE

n

nSE

An unbiased estimate

22~XXESE

222 ˆ~

XXSE

n

SVar X4

42

44 XXE




5.7 Histograms

Histogramming can be used to determine the values of a pmf! However, a significant number of trials may have to be run before the correct pmf can be observed.

Remember the MATLAB simulation of the marble selection in homework #1?!

Sec1_Marble1.m

Sec1_Marble2.m

Sec1_Marble3.m

See Uniform_hist.m

See Binomial_hist.m

Concepts to validate probability … ground truth, traffic studies, trend analysis




5.8 Entropy and Data Compression

See https://en.wikipedia.org/wiki/Information_theory

https://en.wikipedia.org/wiki/Entropy_(information_theory)

The basis for information theory and of particular benefit data compression involves the concept of entropy.

When evaluating information, a measure of the information content randomness involves the probability of the occurrence of various “letters” in the alphabet and the number of bits actually needed to represent the alphabet.

For the English alphabet, there are m=26 letters. For normal language, each letters has a probability of occurrence.

The measure of the entropy of each potential symbol is

𝐻 𝑋 𝐸 𝑙𝑜𝑔 𝑝 𝑥 𝑝 𝑘 ∙ 𝑙𝑜𝑔 𝑝 𝑘

Typically the log base 2 is used and the entropy can be measured in bits.

If we assume 26 equally likely letters in an alphabet …

𝐻 𝑋 𝐸 𝑙𝑜𝑔 𝑝 𝑥1

26∙ 𝑙𝑜𝑔

126

𝐻 𝑋2626

∙ 𝑙𝑜𝑔1

264.7004

But in reality, the letters have nowhere near equal probability! On-line it was observed that the average English word has an entropy of 2.62

https://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf

Overall, this is a specific application and discussion related to encoding that is quite involved and very important …. but somewhat unique to an area of interest. Therefore, read it at your leisure ….

Shannon’s Papers on “A mathematical theory of communication”

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6773024&isnumber=6773023

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6773067&isnumber=6773065




Homework Problem 5.5:

Continue the example in Section 5.4 and consider the joint transformation, U = min(X ,Y) (e.g., min(3,2) = 2), and W = max(X ,Y ). For each transformation,

a) What are the level curves (draw pictures)?

b) What are the individual PMF’s of U and W?

c) What is the joint PMF of U and W?

Below are the level curves and PMFs for W = max(X ,Y ) and U = min(X ,Y ):




Homework Problem 5.30:

Prove the Cauchy-Schwarz inequality:

𝑥 ∙ 𝑦 𝑥 ∙ 𝑦

where the x’s and y’s are arbitrary numbers.

Hint: Start with the following inequality (why is this true?):

0 𝑥 𝑎 ∙ 𝑦

Find the value of a that minimizes the right hand side above, substitute that value into the same inequality, and rearrange the terms into the Cauchy-Schwarz inequality at the top.




𝑥 ∙ 𝑦 𝑥 ∙ 𝑦

or

0 𝑥 ∙ 𝑦 𝑥 ∙ 𝑦

You may have heard the phrase,” The square of the sum of the product is less than or equal to the product of the sums of the squares!

Chapter 5: MULTIPLE DISCRETE RANDOM VARIABLES

Documents