Essentials for Bioinformatics - Cornell University · 2013. 3. 4. · deﬁnition of probability operate like our intuitive probability, we need these axioms. The ﬁrst two are pretty

Essentials for Bioinformatics

Lecture: From zero to (at least) p-values

Jason [email protected]

Feb. 21, 2013 (Th) 2-5PM

mailto:[email protected]

mailto:[email protected]

Goals for today• One lecture is not even close to enough time to introduce

the basics of statistics (even if you have had previous exposure)

• Given this fact, we will go over some basics, where the goal will be to get you to at least one major (and difficult!) concept: p-values

• This is a tool that you will undoubtably use in some form but my observation is very few people who report p-values really understand what they are reporting

• We will also touch on additional foundational and advanced concepts...

• Assume that you want to know if an underlying probably model is not a correct description of your system (a hypothesis! that we will call H0)

• Say you measure a value “x” generated by your system - we can assess our hypothesis H0 by considering the probability of observing “x” conditional on H0 bring correct (=true) - note that this distribution need not be normal!!

Intuition (!!): p-values

Pr(x

| H

0)

• p-value - the probability of obtaining a value of a statistic T(x), or more extreme, conditional on H0 being true

• In our case, our statistic is “x” and if we assume a “one-tailed test” (we will get to this in a moment) our p-value could be:

• To really understand this, we need probability and statistics...

Pr(x

| H

0)Intuition (!!): p-values

Summary of today’s lecture

• A broad introduction to probability and statistics as a modeling framework and decision making framework

• Definition of sample spaces, probability functions (models), samples, statistics, parameterized probability models, hypothesis testing (including p-values)

• Advanced topics: multiple testing, perhaps regression? perhaps others?

Definitions: Probability / Statistics

• Probability (non-technical def) - a mathematical framework for modeling under uncertainty

• Statistics (non-technical def) - a system for interpreting data for the purposes of prediction and decision making given uncertainty

These frameworks are particularly appropriate for modeling biological systems, since we are missing information concerning the complete set of components and relationships among components that determine relationships in our systems

Starting point: a system

• System - a process, an object, etc. which we would like to know something about

• Example: Genetic contribution to height

Genome Height

SNP A

T

Taller (on average)

Shorter (on average)?

Starting point: a system

• System - a process, an object, etc. which we would like to know something about

• Examples: (1) coin, (2) heights in a population

Coin - same amount of metal on both sides?

Heights - what is the average height in the US?

Experiments (general)

• To learn about a system, we generally pose a specific question that suggests an experiment, where we can extrapolate a property of the system from the results of the experiment

• Examples of “ideal” experiments (System / Experiment):

• SNP contribution to height / directly manipulate A -> T keeping all other genetic, environmental, etc. components the same and observe result on height

• Coin / cut coin in half, melt and measure the volume of each half

• Height / measure the height of every person in the US

Experiments (general)

• To learn about a system, we generally pose a specific question that suggests an experiment, where we can extrapolate a property of the system from the results of the experiment

• Examples of “non-ideal” experiments (System / Experiment):

• SNP contribution to height / measure heights of individuals that have an A and individuals that have a T

• Coin / flip the coin and observe “Heads” and “Tails”

• Height / measure some people in the US

Experiments and samples

• Experiment - a manipulation or measurement of a system that produces an outcome we can observe

• Experimental trial - one instance of an experiment

• Sample outcome - a possible outcome of the experiment

• Sample - the results of one or more experimental trials

• Example (Experiment / Sample outcomes):

• Coin flip / “Heads” or “Tails”

• Two coin flips / HH, HT, TH, TT

• Measure heights in this class / 5’, 5’3’’,5’3.5, ...

Sets / Sample Spaces / Events

• Set - any collection, group, or conglomerate

• Element - a member of a set

• Sample Space (S) - a set comprising all possible outcomes associated with an experiment

• Examples (Experiment / Sample Space):

• “Single coin flip” / S = H, T

• “Two coin flips” / S = HH, HT, TH, TT

• “Measure Heights” / S = 5’, 5’3’’, 5’3.5’’, ...

• Events - a subset of the sample space

Functions (and numbers)

• Now that we have formalized the concept of a sample space, we need to define what “probability”means

• To do this, we need the concept of a mathematical function

• Function (formally) - a binary relation between every element of a domain set to exactly one element of the range set

• Function (informally) - ?

Functions (and numbers)

X

Y

Function: Y = X2

Probability functions

• Probability Function - maps a sample space to the reals

• Not all such functions are probability functions, only those that satisfy the following Axioms of Probability (where an axiom is a property assumed to be true):

This concept is often introduced to us as Y = f(X) where f() is the function that mapsthe values taken by X to Y . For example, we can have the function Y = X

2 (see figurefrom class).

We are going to define a probability function which map sample spaces to the real line(to numbers):

Pr(S) : S → R (3)

where Pr(S) is a function, which we could have written f(S).

To be useful, we need some rules for how probability functions are defined (that is, not allfunctions on sample spaces are probability functions). These rules are are called the axiomsof probability (note that an axiom is a rule that we assume). There is some variation inhow these are presented, but we will present them as three axioms:

Axioms of Probability

1. For A ⊂ S, Pr(A) 0.

2. Pr(S) = 1.

3. For A1,A2, ... ∈ S, if Ai∩Aj = ∅ (disjoint) for each i = j: Pr(∞

i Ai) =∞

i Pr(A).

These axioms are necessary for many of the logically consistent results built upon proba-bility. Intuitively, we can think of these axioms as matching how we tend to think aboutprobability. That is, each of us uses probability in an abstract way all the time (e.g. Ithink it will rain today based on the following observations...) and to make the formaldefinition of probability operate like our intuitive probability, we need these axioms. Thefirst two are pretty intuitive. We in general would like to scale our probability between‘zero’ where there is no chance of occurrence to ‘one’, where we are absolutely sure anevent will happen. We therefore need all events to be greater than or equal to zero (i.e.we cannot have ‘negative’ probabilities, which we cannot interpret) and we need the entireset of possible outcomes to be at most one.

The third axiom is a little less intuitive at first glance. To see how this is also neces-sary to match our intuition about how we think about probability, let’s consider a counterexample. Let’s define a function where the first two axioms hold, but the third does not.Let’s use our abstract sample space of a ‘pair of coin flips’ which has four outcomes S =HH,HT, TH, TT where all are disjoint (i.e. ifHH happens at a particular instance, thenthe other pairs do not happen) and define: P (HH) = P (HT ) = P (TH) = P (TT ) = 0.25.TO PURPOSELY VIOLATE THE THIRD AXIOM, let’s also define P (HH ∪ TT ) = 0.8,and P (HT ∪ TH) = 0.2 (where we are using P instead of Pr to make it clear that thisis not a probability function). Note that with this definition, we do not violate the first

6

This concept is often introduced to us as Y = f(X) where f() is the function that mapsthe values taken by X to Y . For example, we can have the function Y = X

2 (see figurefrom class).

We are going to define a probability function which map sample spaces to the real line(to numbers):

Pr(S) : S → R (3)

where Pr(S) is a function, which we could have written f(S).

To be useful, we need some rules for how probability functions are defined (that is, not allfunctions on sample spaces are probability functions). These rules are are called the axiomsof probability (note that an axiom is a rule that we assume). There is some variation inhow these are presented, but we will present them as three axioms:

Axioms of Probability

1. For A ⊂ S, Pr(A) 0.

2. Pr(S) = 1.

3. For A1,A2, ... ∈ S, if Ai∩Aj = ∅ (disjoint) for each i = j: Pr(∞

i Ai) =∞

i Pr(A).

These axioms are necessary for many of the logically consistent results built upon proba-bility. Intuitively, we can think of these axioms as matching how we tend to think aboutprobability. That is, each of us uses probability in an abstract way all the time (e.g. Ithink it will rain today based on the following observations...) and to make the formaldefinition of probability operate like our intuitive probability, we need these axioms. Thefirst two are pretty intuitive. We in general would like to scale our probability between‘zero’ where there is no chance of occurrence to ‘one’, where we are absolutely sure anevent will happen. We therefore need all events to be greater than or equal to zero (i.e.we cannot have ‘negative’ probabilities, which we cannot interpret) and we need the entireset of possible outcomes to be at most one.

The third axiom is a little less intuitive at first glance. To see how this is also neces-sary to match our intuition about how we think about probability, let’s consider a counterexample. Let’s define a function where the first two axioms hold, but the third does not.Let’s use our abstract sample space of a ‘pair of coin flips’ which has four outcomes S =HH,HT, TH, TT where all are disjoint (i.e. ifHH happens at a particular instance, thenthe other pairs do not happen) and define: P (HH) = P (HT ) = P (TH) = P (TT ) = 0.25.TO PURPOSELY VIOLATE THE THIRD AXIOM, let’s also define P (HH ∪ TT ) = 0.8,and P (HT ∪ TH) = 0.2 (where we are using P instead of Pr to make it clear that thisis not a probability function). Note that with this definition, we do not violate the first

6

Probability functions on the reals• In any realistic case, our true sample outcomes will be discrete

• However we often approximate cases using the real numbers as a sample space, which is continuous, to take advantage of mathematical tools available in continuous spaces

• Example: we often use the real numbers as the sample space for human heights

• What approximations are we making? Why are these approximations reasonable?

• Note that to use the reals as a sample space, we cannot use the entire real line, but rather the subsets:

An important question to consider is what approximations we are making when approxi-mating a discrete sample space such as human height? We actually, make two. The first isthat the sample space does indeed contain a continuous set of values between the heightswe could observe (say 4’ to 7’). The second is that we assume heights could actually takeany continuous value between −∞ and ∞. At first glance, this latter assumption mayseem to be a poor one. However, the way we make this approximation work is by defininga probability function (model) that places a very small probability on heights outside ofthe range we can observe.

Now, before we take advantage of the mathematical tricks at our disposal when we usea continuous sample space (approximation), we have to deal with some additional issues.It turns out that segments of the real line are ‘compact’ and this introduces a number ofproblems for defining probability functions (you can get an intuitive idea of what compactmeans by considering that there are infinite number of points between any two points youcould define on the real line). How we deal with these issues are actually the provence ofthe field of real analysis or, more specifically, measure theory. We are therefore not going toconsider them in detail in this course and, beyond the discussion here, we will not discussmeasure theory again and it will not impact the concepts that we discuss.

The ‘problem’ with the real line for our purpose is that there are ‘too many’ subsets.This can lead to strange intuitive (but not mathematical) contradictions, e.g. a small con-tinuous three dimensional space R3 can be taken apart and put back together again intoa much larger space (without any spaces in between the pieces). Interestingly, this meansthat the set of all subsets of the real line is therefore not the best approximation of howwe think about how we model reality with probability. To produce a set of R subsets thatwe can use to approximate how we think about real systems, we will define a set called asigma field. For our case, the sigma field contains the following subsets of the real line:

[a, b], (a, b], [a, b), (a, b) (4)

where a and b are any two constants. Note that that a square bracket means the inter-val contains the value and a curved bracket means that the interval does not contain thevalue but rather values that are infinitely close to the bracketed value. Now, it is hardimagine that this subset of the real line would not include all subsets, but it does not.Imagining what these un-included subsets ‘look like’ is however not intuitive and we ingeneral do not specifically describe them but prove that they exist. With a sigma field inhand, we can define a ‘measure space’ which includes a sample space S (also represented asΩ), a sigma field F , and a probability measure Pr, which satisfies the axioms of probability.

One last historical side note. The rigorous conceptualization of probability is not actu-ally that old, arguably beginning with Kolmogorov in the 1930’s (who defined the axioms

8

of probability). This means that some of the architects of probability theory are still alive,

and one of them is here at Cornell: Eugene Dynkin (who is in his 90’s). Dynkin (among

other accomplishments) proved a number of theorems and developed a number of impor-

tant methods (e.g. π-λ-systems) which are used to prove a number of important results

in basic probability. He is a great teacher and if you ever get the chance to take a course

from him, it’s worth it (and you get a living connection to the beginning of probability as

we know it!).

S = (−∞,∞) (5)

7 Conditional Probability

A critical concept in probability is the concept of conditional probability. Intuitively, we

can define the conditional probability as ‘the probability of an event, given that another

event has taken place’. That is, this concept makes formal the case where an event that

has taken place provides us information that changes the probability of a future or focal

event. The formal definition of the conditional probability of Ai given Aj is:

Pr(Ai|Aj) =Pr(Ai

Aj)

Pr(Aj)(6)

At first glance, this relationship does not seem very intuitive. Let’s consider a quick

example that will make it clear why we define conditional probability this way. Let’s use

our ‘paired coin flip’ where PrHH = PrHT = PrTH = PrTT = 0.25. In this

case, we have the following:

H2nd T2nd

H1st HH HT

T1st TH TT

where we have the following probabilities:

H2nd T2nd

H1st Pr(H1st ∩H2nd) Pr(H1st ∩ T2nd) Pr(H1st)

T1st Pr(T1st ∩H2nd) Pr(T1st ∩ T2nd) Pr(T1st)

Pr(H2nd) Pr(T2nd)

where each entry of the last column reflects a sum of the rows and each entry of the bottom

rows are the sums or each column. Note that we also have the following relationships

Pr(H1st) = Pr(HH ∪HT ), Pr(H2nd) = Pr(HH ∪ TH), Pr(T1st) = Pr(TH ∪ TT ), and

Pr(T2nd) = Pr(HT ∪ TT ) (work this out for yourself!). Let’s now define the following

probability model:

9

Random variables I

• We have defined a probability function (also called a probability measure) to be a function on a sample space to the reals that satisfies the axioms of probability:

• We are often in situations where it is easier to work with and/or we are interested in outcomes that are a function of the original sample space

• For example, “Heads” and “Tails” accurately represent the outcomes of a coin flip example but they are not numbers (and therefore have no intrinsic ordering, etc.)

• As another example, we might be interested in “the number of “Tails” resulting from a coin flip

• To handle these cases, we will define a random variable

BTRY 4830/6830: Quantitative Genomics and GeneticsSpring 2010

Lecture 3: Random Variables, Random Vectors, and Probability Distribution Functions

Lecture: January 31; Version 1 Posted: February 2

1 Introduction

Last lecture, we introduced critical concepts for probabilistic modeling: sample spacesand probability functions. Today, we will introduce additional critical concepts: randomvariables and their associated probability distribution functions. We will do this for randomvariables that are discrete or continuous. We will then generalize the concept of a randomvariable to a random vector.

2 Random variables

As we discussed last lecture, a probability function or measure is a function that takes asample space to the reals:

Pr(S) : S → R (1)

and that abides by certain rules (the axioms of probability). For example, we can define aprobability function on the sample space for ‘a pair of coin flips’ as S = HH,HT, TH, TTusing Pr(HH) = Pr(HT ) = Pr(TH) = Pr(TT ) = 0.25, i.e. a fair coin example. As wemake use of sample spaces and the probability functions that we define, we are often in aposition where we want to quantify specific types of outcomes, e.g. the number of ‘Tails’in our two flips. To do this, we define a random variable, which is a real valued functionon the sample space f(S), where we generally substitute X for f :

X(S) : S → R (2)

A random variable differs from a probability function in that it is not constrained to followthe axioms of probability (although it must adhere to rules such that it is still considered amathematical function!). For example, it is not constrained to be greater than zero, if neednot take the entire probability space to 1, and it need enforce additivity on disjoint sets (thethird axiom of probability). While these functions are unconstrained, we in general definethem in such a way such that they capture useful information about sample outcomes.

1

Random variables II

• Random variable - a real valued function on the sample space:

• Note that these functions are not constrained by the axioms of probability (e.g. not constrained to be between zero or one, additivity of the function over disjoint events is not required)

• We generally define them in a manner that captures information that is of interest

• As an example, let’s define a random variable for the sample space of the “two coin flip” experiment that maps each sample outcome to the “number of Tails” of the outcome:

3 Discrete random variables

To make the concept of a random variable more clear, let’s begin by considering discreterandom variables, where just as with discrete sample spaces, we assume that we can enu-

merate the values that the random variable can take, i.e. they take specific values we

can count such as 0, 1, 2, etc. and cannot take any value within an interval (although

note they can potentially take an infinite number of discrete states!). For example, for our

sample space of two coin flips S = HH,HT, TH, TT, we can define a random variable

X representing ‘number of Tails’:

X(HH) = 0, X(HT ) = 1, X(TH) = 1, X(TT ) = 2 (3)

This is something useful we might want to know about our sample outcomes and now we

can work with numbers as opposed to concepts like ‘HT’.

Since we have defined a probability function and a random variable on the same sam-

ple space S, we can think of the probability function as inducing a probability distributionon the random variable. We will often represent probability distributions using PX(x) or

Pr(X = x), where the lower case ‘x’ indicates the specific value taken by the random

variable X. For example, if we define a ‘fair coin’ probability model for our two flip sample

space:

Pr(HH) = Pr(HT ) = Pr(TH) = Pr(TT ) = 0.25 (4)

given this probability model and the random variable defined in equation (3), we now have

the following probability distribution for X:

PX(x) = Pr(X = x) =

Pr(X = 0) = 0.25

Pr(X = 1) = 0.5

Pr(X = 2) = 0.25

(5)

where, again, we use lower case x to indicate a specific realization of the random variable

X. Note that it is implicit that a probability of zero is assigned to every other value of

X. Here, we have also introduced the notation PX(x) to indicate that this probability

distribution is a probability mass function or ‘pmf’, i.e. a probability distribution for a dis-

crete random variable. This is to distinguish it from a probability distribution defined on a

continuous random variable, which we will see is slightly different conceptually. Intuitively,the ‘mass’ part of this description can be seen when plotting this probability distribution

with the value taken by X on the X-axis and the probability on the Y-axis (see plot from

class). In this case the ‘mass’ of the probability is located at three points: 0, 1, and 2.

Now that we have introduced a pmf, let’s consider a related concept: the cumulativemass function or ‘cmf’. When first introduced, it is not clear why we need to define cmf’s.

However, it turns out the cmf’s play an important role in probability theory and statistics,

2


Lecture 3: Random Variables, Random Vectors, and Probability Distribution Functions

Lecture: January 31; Version 1 Posted: February 2

1 Introduction

Last lecture, we introduced critical concepts for probabilistic modeling: sample spacesand probability functions. Today, we will introduce additional critical concepts: randomvariables and their associated probability distribution functions. We will do this for randomvariables that are discrete or continuous. We will then generalize the concept of a randomvariable to a random vector.

2 Random variables

As we discussed last lecture, a probability function or measure is a function that takes asample space to the reals:

Pr(S) : S → R (1)

and that abides by certain rules (the axioms of probability). For example, we can define aprobability function on the sample space for ‘a pair of coin flips’ as S = HH,HT, TH, TTusing Pr(HH) = Pr(HT ) = Pr(TH) = Pr(TT ) = 0.25, i.e. a fair coin example. As wemake use of sample spaces and the probability functions that we define, we are often in aposition where we want to quantify specific types of outcomes, e.g. the number of ‘Tails’in our two flips. To do this, we define a random variable, which is a real valued functionon the sample space f(S), where we generally substitute X for f :

X(S) : S → R (2)

A random variable differs from a probability function in that it is not constrained to followthe axioms of probability (although it must adhere to rules such that it is still considered amathematical function!). For example, it is not constrained to be greater than zero, if neednot take the entire probability space to 1, and it need enforce additivity on disjoint sets (thethird axiom of probability). While these functions are unconstrained, we in general definethem in such a way such that they capture useful information about sample outcomes.

1

Random variables III

• A critical point to note: because we have defined a probability function on the sample space S, this induces a probability function on the random variable X:

• Notation: we often use an “upper” case letter to represent the random variable (which we will abbreviate r.v.) and a “lower” case letter to represent a specific value that the r.v. takes in an instance

• We will divide our discussion of random variables and the induced probability distributions into cases that are discrete (taking individual point values) or continuous (taking on values within an interval of the reals), since these have slightly different properties (but the same foundation is used to define both!!)








X(HH) = 0, X(HT ) = 1, X(TH) = 1, X(TT ) = 2 (3)







space:




PX(x) = Pr(X = x) =

Pr(X = 0) = 0.25

Pr(X = 1) = 0.5

Pr(X = 2) = 0.25

(5)











2

Discrete random variables / probability mass functions (pmf)

• If we define a random variable on a discrete sample space, we produce a discrete random variable. For example, our two coin flip / number of Tails example:

• The probability function in this case will induce a probability distribution that we call a probability mass function which we will abbreviate as pmf

• For our example, if we consider a fair coin probability model (assumption!) for our two coin flip experiment and define a “number of Tails” r.v., we induce the following pmf:








X(HH) = 0, X(HT ) = 1, X(TH) = 1, X(TT ) = 2 (3)







space:




PX(x) = Pr(X = x) =

Pr(X = 0) = 0.25

Pr(X = 1) = 0.5

Pr(X = 2) = 0.25

(5)











2








X(HH) = 0, X(HT ) = 1, X(TH) = 1, X(TT ) = 2 (3)







space:




PX(x) = Pr(X = x) =

Pr(X = 0) = 0.25

Pr(X = 1) = 0.5

Pr(X = 2) = 0.25

(5)











2








X(HH) = 0, X(HT ) = 1, X(TH) = 1, X(TT ) = 2 (3)







space:




PX(x) = Pr(X = x) =

Pr(X = 0) = 0.25

Pr(X = 1) = 0.5

Pr(X = 2) = 0.25

(5)











2

Discrete random variables / cumulative mass functions (cmf)

• An alternative (and important - stay tuned!) representation of a discrete probability model is a cumulative mass function which we will abbreviate (cmf):

• This definition is not particularly intuitive, so it is often helpful to consider a graph illustration. For example, for our two coin flip / fair coin / number of Tails example:

since they provide an alternative representation of a probability model (versus a pmf) that

has better properties in some cases (we will see this below when discussing the uniqueness

of the analogous concept for continuous distributions) and they have strong connections to

critical concepts in statistics, e.g. such as a p-value. For the moment, you should take my

word for it that cumulative functions are worth knowing about.

We define a cmf as follows:

FX(x) = Pr(X x) (6)

where we define this function for X from −∞ to +∞. Equation (6) is actually enough to

define the cmf completely. However, it is often more intuitive to see how this is calculated

using the following formalism:

FX(x) =x

i

Pr(X = i) (7)

where the sum is over a discrete set of values over the real line that we wish to consider

(again, note that only values defined in our probability model are assigned non-zero prob-

ability). For example, for the probability model in equation (5) we we can use equation

(7) to calculate the value of the cmf at particular values:

FX(−1) = 0, FX(0) = 0.25, FX(0.5) = 0.25, FX(1) = 0.75FX(1.2) = 0.75, FX(1) = 1.0, FX(12) = 1.0

(8)

When graphing a cmf from −∞ to ∞ with X on the X-axis and FX(x) on the Y-axis, this

produces a ‘step function’. For example, from (−∞, 0) (the interval that gets infinitely

close to zero but does not include zero) the function takes the value zero. It then makes a

‘step’ or ‘jump’ up to 0.25 for the interval [0, 1), etc. (see graph from class).

4 Continuous random variables

We define random variables that can take any value on the real line or an interval of the

real line R to be continuous random variables. It turns out that considering intervals of

(or the entire) real line adds considerable complexity for defining the analogous concepts

we have considered with discrete random variables (although not if we define a discrete

random variable on a continuous probability space - see your first Homework!). To mo-

tivate the reason for using continuous random variables, let’s consider our example of a

sample space of ‘human heights’. As we have discussed last lecture, human heights cannot

take any possible value on the real line, but we assume heights could actually take any

continuous value between −∞ and ∞ for mathematical convenience (and because we can

define probability functions in such a way that this assumption provides a reasonable ap-

proximation of reality).

3

Continuous random variables / probability density functions (pdf)

• For a continuous sample space, we can define a discrete random variable or a continuous random variable (or a mixture!)

• For continuous random variables, we will define analogous “probability” and “cumulative” functions, although these will have different properties

• For this class, we are considering only one continuous sample space: the reals

• Recall that we will use the reals as a convenient approximation to the true sample space

• For the reals, we define a probability density function (pdf):

• A pdf defines the probability of an interval of a random variable, i.e. the probability that the random variable will take a value in that interval

For our continuous probability space, defining a probability function and random vari-able results in a probability density function (pdf) fX(x) which we can use to define theprobability of an interval of the random variable:

Pr(a X b) =

b

afX(x)dx (9)

where the integral of fX(x) from −∞ to ∞ equals 1 (second axiom of probability). Wecan also define a cumulative density function (cdf):

FX(x) =

x

−∞fX(x)dx (10)

where intuitively, the cdf evaluated at a value x is the area under the curve of the pdf,starting from −∞ to x, e.g. for a symmetric distribution, the value of x right under the‘peak’ of the pdf produces FX(x) = 0.5 (and note this relationship holds for all continuousdistributions if we consider x=median(X) where we will define median in our next lecture).

As an example, assuming our height case, where you will recall from last lecture we definethe sample space of heights to be all open, closed, and combination open/closed intervalson the real line, we will define a random variable X, which takes each of these intervalsas an input and returns same interval as an output (note that as we will discuss in ourlecture on samples, while we generally will only consider specific point outcomes of therandom variable in our sample, e.g. X = 5 for an individual person’s height, we will usethe random variable and its associated probability distribution to consider the probabilitythat a specific sample outcome occurs in an interval - see below). In this particular heightexample, our random variable X is the identity function, the function which takes an inputand returns the same value as an output, i.e. the function has the general form f(x) = x.

Since we are allowing heights to (in theory) take any value on the real line, we definea probability function that induces the normal distribution on X, a reasonable model forheights. The pdf of a normal distribution has the following form:

fX(x) =1√2πσ2

e−(x−µ)2

2σ2 (11)

where we often use the following shorthand to represent this pdf fX(x) ∼ N(µ,σ2) andwhere the µ and σ2 are constants the we call parameters (see your notes from class fora picture of this pdf). The cdf of the normal FX(x) = Φ(x) is easy to draw (see yourclass notes for a picture) and while it cannot be written in a ‘closed form’, the functioncan be calculated to very high precision (we define an equation that has a closed form ex-pression as one that we can write as a single expression that includes only simple functions).

4


Pr(a X b) =

b

afX(x)dx (9)


FX(x) =

x

−∞fX(x)dx (10)




fX(x) =1√2πσ2

e−(x−µ)2

2σ2 (11)


4

Probability density functions (pdf): normal example

• To illustrate the concept of a pdf, let’s consider the reals as the (approximate!) sample space of human heights, the normal (also called Gaussian) probability function as a probability model for human heights, and the random variable X that takes the value “height” (what kind of function is this!?)

• In this case, the pdf of X has the following form:


Pr(a X b) =

b

afX(x)dx (9)


FX(x) =

x

−∞fX(x)dx (10)




fX(x) =1√2πσ2

e−(x−µ)2

2σ2 (11)


4

Continuous random variables / cumulative density functions (cdf)

• For continuous random variables, we also have an analog to the cmf, which is the cumulative density function abbreviated as cdf:

• Again, a graph illustration is instructive

• Note the cdf runs from zero to one (why is this?)


Pr(a X b) =

b

afX(x)dx (9)


FX(x) =

x

−∞fX(x)dx (10)




fX(x) =1√2πσ2

e−(x−µ)2

2σ2 (11)


4

Random vectors• We are often in situations where we are interested in defining more than

one r.v. on the same sample space

• When we do this, we define a random vector

• Note that a vector, in its simplest form, may be considered a set of numbers (e.g. [1.2, 2.0, 3.3] is a vector with three elements)

• Also note that vectors (when a vector space is defined) have similar properties to numbers in the sense that we can define operations for them (e.g. addition, multiplication), which we will use later in this course

• Beyond keeping track of multiple r.v.’s, a random vector works just like a r.v., i.e. a probability function induces a probability function on the random vector and we may consider discrete or continuous (or mixed!) random vectors

• Finally, note that while we can define several r.v.’s on the same sample space, we can only define one probability function (why!?)

Example of a discrete random vector

• Consider the two coin flip experiment and assume a probability function for a fair coin:

• Let’s define two random variables: “number of Tails” and “first flip is Heads”

• The probability function induces the following pmf for the random vector X=[X1, X2], where we use bold X do indicate a vector (or matrix):

This occurs as a consequence of ‘unmeasurable’ sets, which we attempt to deal with

by defining a sigma field.

6. A notation inconsistency is as follows: we abbreviate probability mass function (pmf),

cumulative mass function (cmf), probability density function (pdf), and cumulative

density function (cdf). However, we also use probability density function, which

we abbreviate pdf, to refer to either a pmf or a pdf, and a cumulative probability

distribution, which we abbreviate cdf, to refer to either a cmf or a cdf.

5 Random vectors

We are often in situations where we define more than a single random variable for a sample

space S. For example, in our ‘two coin flip’ sample space where we define our ‘fair coin’

probability function Pr(S), such that Pr(HH) = Pr(HT ) = Pr(TH) = Pr(TT ) = 0.25,

we could define two random variables, where the first is ‘number of tails’:

X1(S) =

X1(HH) = 0

X1(HT ) = X1(TH) = 1

X1(TT ) = 2

(13)

and the second is an indicator function that the ‘first flip is a head’:

X2(S) =

X2(TH) = X2(TT ) = 0

X2(HH) = X2(HT ) = 1(14)

In this case, we have defined a random vector for this sample space, i.e. a vector that has

two elements: X = [X1, X2]. Note that if we define a vector space, we can start treating

vector much as we do numbers and start defining operations such as vector addition or

vector multiplication. We will do this in our next computer lab and our following lectures.

For the moment, we will simply consider vectors as a notation system to keep track of

multiple random variables.

Just as a probability function Pr(S) induced a pdf (pmf) on a single random variable

X, in our example of two random variables, X = [X1, X2], the probability function now

induces a joint probability function (a joint pdf), which we symbolize as follows:

Pr(X) = Pr(X1 = x1, X2 = x2) = PX(x) = PX1,X2(x1, x2) (15)

For the specific probability function and random variables we have defined, this produces

the following PX1,X2(x1, x2):

Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 1) = 0.25, P r(X1 = 1, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 2) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(16)

6








5 Random vectors





X1(S) =

X1(HH) = 0

X1(HT ) = X1(TH) = 1

X1(TT ) = 2

(13)


X2(S) =

X2(TH) = X2(TT ) = 0

X2(HH) = X2(HT ) = 1(14)













Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 1) = 0.25, P r(X1 = 1, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 2) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(16)

6








5 Random vectors





X1(S) =

X1(HH) = 0

X1(HT ) = X1(TH) = 1

X1(TT ) = 2

(13)


X2(S) =

X2(TH) = X2(TT ) = 0

X2(HH) = X2(HT ) = 1(14)













Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 1) = 0.25, P r(X1 = 1, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 2) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(16)

6








5 Random vectors





X1(S) =

X1(HH) = 0

X1(HT ) = X1(TH) = 1

X1(TT ) = 2

(13)


X2(S) =

X2(TH) = X2(TT ) = 0

X2(HH) = X2(HT ) = 1(14)













Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 1) = 0.25, P r(X1 = 1, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 2) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(16)

6

For the specific probability function and random variables we have defined, this producesthe following PX1,X2(x1, x2):

Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25Pr(X1 = 1, X2 = 0) = 0.25, P r(X1 = 1, X2 = 1) = 0.25Pr(X1 = 2, X2 = 0) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(18)

where Pr(X1 = x1, X2 = x2) = Pr(X1 ∩X2), etc. We can also write this using our tablenotation:

X2 = 0 X2 = 1X1 = 0 0.0 0.25 0.25X1 = 1 0.25 0.25 0.5X1 = 2 0.25 0.0 0.25

0.5 0.5

Note that with this table we have also written out the marginal pdf’s of X1 and X2,which are just the pdf’s of X1 and X2: PX1(x1) = Pr(X1 = 0) = 0.25, P r(X1 = 1) =0.5, P r(X1 = 2) = 0.25 and PX2(x2) = Pr(X2 = 0) = 0.5, P r(X2 = 1) = 0.5.

Just as we defined conditional probabilities for subsets of a sample space S for whichwe have defined a probability function Pr(S), we can similarly define the conditional prob-abilities of random variables:

Pr(X1|X2) =Pr(X1 ∩X2)

Pr(X2)(19)

such that we have for example:

Pr(X1 = 0|X2 = 1) =Pr(X1 = 0 ∩X2 = 1)

Pr(X2 = 1)=

0.25

0.5= 0.5 (20)

Note that we can in fact use random variables as a means to define sample space subsets,so the concept of conditional probability defined for sample spaces and for joint randomvariables are interchangeable.

We can similarly define an (interchangeable) concept of independent random variables.Note that our current X1 and X2 are not independent, since:

Pr(X1 = 0 ∩X2 = 1) = 0.25 = Pr(X1 = 0)Pr(X2 = 1) = 0.25 ∗ 0.5 = 0.125 (21)

and for random variables to be independent, all possible combinations of outcomes mustadhere to the definition of independence. To provide an example of random variables that

7

Example of a continuous random vector

• Consider an experiment where we define a two-dimensional Reals sample space for “height” and “IQ” for every individual in the US (as a reasonable approximation)

• Let’s define a bivariate normal probability function for this sample space and random variables X1 and X2 that are identity functions for each of the two dimensions

• In this case, the pdf of X=[X1, X2] is a bivariate normal (we will not write out the formula for this distribution - yet):








5 Random vectors





X1(S) =

X1(HH) = 0

X1(HT ) = X1(TH) = 1

X1(TT ) = 2

(13)


X2(S) =

X2(TH) = X2(TT ) = 0

X2(HH) = X2(HT ) = 1(14)











Pr(X) = Pr(X1 = x1, X2 = x2) = fX(x) = fX1,X2(x1, x2) (16)



Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 1) = 0.25, P r(X1 = 1, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 2) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(17)

6

Again, note that we cannot use this probability function to define the probabilities of points (or lines!) but we can

use it to define the probabilities that values of the random vector fall within (square) intervals of the two

random variables (!) [a,b], [c,d]FX(x) = FX1,X1(x1, x2) =

i

−∞

j

−∞fX1,X2(i, j) (26)

Pr(a X1 b, c X1 d) =

b

a

d

cfX1,X2(x1, x2)dx1, dx2 (27)

for discrete and continuous random vectors respectively, and similarly for vectorswith more than two elements.

Before we leave the concept of joint pdf’s, let’s consider a conceptual extension. Note thatit is possible to define more than one sample space on the same experimental outcome.For example, if we were interested in the genetics of human height, our experiment mightbe to measure both the genotype (e.g. ‘A’ or ‘T’ at a particular SNP) and the phenotype(‘height’) of each individual. This produces ‘two’ sample spaces S1 and S2, which havea relationship, i.e. values of genotype occur with values of phenotype. We can thereforealso define a single sample space S = S1 ∩ S2, which indicates this relationship. For bothof these sample spaces S1 and S2, we could define distinct probability functions, althoughagain, these probability functions would be related and define a single pdf for S = S1 ∩S2.We could also define random vectors for each sample space S1 and S2 but only whereeach element of these random vectors is associated with only one probability function, i.e.while a single probability function can be associated with multiple random variables, asingle random variable cannot be associated with more than one probability function. Ifwe consider the sample space S = S1 ∩S2, we would now have a single random vector thatcombined the random vectors defined on S1 and S2, e.g. we could have random vectorsthat include both discrete and continuous random variables.

Finally, let’s introduce one last more formal concept. A vector-valued function Y = f(X)is a function which takes an input X and returns a vector Y = [Y1, Y2, ..., Yn] (note thatthe input X could also be a vector). A random vector is therefor a vector-valued functionon a sample space.

9

Random vectors• We are often in situations where we are interested in defining more than

one r.v. on the same sample space

• When we do this, we define a random vector

• Note that a vector, in its simplest form, may be considered a set of numbers (e.g. [1.2, 2.0, 3.3] is a vector with three elements)

• Also note that vectors (when a vector space is defined) have similar properties to numbers in the sense that we can define operations for them (e.g. addition, multiplication), which we will use later in this course

• Beyond keeping track of multiple r.v.’s, a random vector works just like a r.v., i.e. a probability function induces a probability function on the random vector and we may consider discrete or continuous (or mixed!) random vectors

• Finally, note that while we can define several r.v.’s on the same sample space, we can only define one probability function (why!?)

Example of a discrete random vector

• Consider the two coin flip experiment and assume a probability function for a fair coin:

• Let’s define two random variables: “number of Tails” and “first flip is Heads”

• The probability function induces the following pmf for the random vector X=[X1, X2], where we use bold X do indicate a vector (or matrix):








5 Random vectors





X1(S) =

X1(HH) = 0

X1(HT ) = X1(TH) = 1

X1(TT ) = 2

(13)


X2(S) =

X2(TH) = X2(TT ) = 0

X2(HH) = X2(HT ) = 1(14)













Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 1) = 0.25, P r(X1 = 1, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 2) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(16)

6








5 Random vectors





X1(S) =

X1(HH) = 0

X1(HT ) = X1(TH) = 1

X1(TT ) = 2

(13)


X2(S) =

X2(TH) = X2(TT ) = 0

X2(HH) = X2(HT ) = 1(14)













Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 1) = 0.25, P r(X1 = 1, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 2) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(16)

6








5 Random vectors





X1(S) =

X1(HH) = 0

X1(HT ) = X1(TH) = 1

X1(TT ) = 2

(13)


X2(S) =

X2(TH) = X2(TT ) = 0

X2(HH) = X2(HT ) = 1(14)













Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 1) = 0.25, P r(X1 = 1, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 2) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(16)

6








5 Random vectors





X1(S) =

X1(HH) = 0

X1(HT ) = X1(TH) = 1

X1(TT ) = 2

(13)


X2(S) =

X2(TH) = X2(TT ) = 0

X2(HH) = X2(HT ) = 1(14)













Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 1) = 0.25, P r(X1 = 1, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 2) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(16)

6

For the specific probability function and random variables we have defined, this producesthe following PX1,X2(x1, x2):

Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25Pr(X1 = 1, X2 = 0) = 0.25, P r(X1 = 1, X2 = 1) = 0.25Pr(X1 = 2, X2 = 0) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(18)

where Pr(X1 = x1, X2 = x2) = Pr(X1 ∩X2), etc. We can also write this using our tablenotation:

X2 = 0 X2 = 1X1 = 0 0.0 0.25 0.25X1 = 1 0.25 0.25 0.5X1 = 2 0.25 0.0 0.25

0.5 0.5

Note that with this table we have also written out the marginal pdf’s of X1 and X2,which are just the pdf’s of X1 and X2: PX1(x1) = Pr(X1 = 0) = 0.25, P r(X1 = 1) =0.5, P r(X1 = 2) = 0.25 and PX2(x2) = Pr(X2 = 0) = 0.5, P r(X2 = 1) = 0.5.

Just as we defined conditional probabilities for subsets of a sample space S for whichwe have defined a probability function Pr(S), we can similarly define the conditional prob-abilities of random variables:

Pr(X1|X2) =Pr(X1 ∩X2)

Pr(X2)(19)

such that we have for example:

Pr(X1 = 0|X2 = 1) =Pr(X1 = 0 ∩X2 = 1)

Pr(X2 = 1)=

0.25

0.5= 0.5 (20)

Note that we can in fact use random variables as a means to define sample space subsets,so the concept of conditional probability defined for sample spaces and for joint randomvariables are interchangeable.

We can similarly define an (interchangeable) concept of independent random variables.Note that our current X1 and X2 are not independent, since:

Pr(X1 = 0 ∩X2 = 1) = 0.25 = Pr(X1 = 0)Pr(X2 = 1) = 0.25 ∗ 0.5 = 0.125 (21)

and for random variables to be independent, all possible combinations of outcomes mustadhere to the definition of independence. To provide an example of random variables that

7

Example of a continuous random vector

• Consider an experiment where we define a two-dimensional Reals sample space for “height” and “IQ” for every individual in the US (as a reasonable approximation)

• Let’s define a bivariate normal probability function for this sample space and random variables X1 and X2 that are identity functions for each of the two dimensions

• In this case, the pdf of X=[X1, X2] is a bivariate normal (we will not write out the formula for this distribution - yet):








5 Random vectors





X1(S) =

X1(HH) = 0

X1(HT ) = X1(TH) = 1

X1(TT ) = 2

(13)


X2(S) =

X2(TH) = X2(TT ) = 0

X2(HH) = X2(HT ) = 1(14)











Pr(X) = Pr(X1 = x1, X2 = x2) = fX(x) = fX1,X2(x1, x2) (16)



Pr(X1 = 0, X2 = 0) = 0.0, P r(X1 = 0, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 1) = 0.25, P r(X1 = 1, X2 = 1) = 0.25

Pr(X1 = 1, X2 = 2) = 0.25, P r(X1 = 2, X2 = 1) = 0.0

(17)

6

Again, note that we cannot use this probability function to define the probabilities of points (or lines!) but we can

use it to define the probabilities that values of the random vector fall within (square) intervals of the two

random variables (!) [a,b], [c,d]FX(x) = FX1,X1(x1, x2) =

i

−∞

j

−∞fX1,X2(i, j) (26)

Pr(a X1 b, c X1 d) =

b

a

d

cfX1,X2(x1, x2)dx1, dx2 (27)

for discrete and continuous random vectors respectively, and similarly for vectorswith more than two elements.

Before we leave the concept of joint pdf’s, let’s consider a conceptual extension. Note thatit is possible to define more than one sample space on the same experimental outcome.For example, if we were interested in the genetics of human height, our experiment mightbe to measure both the genotype (e.g. ‘A’ or ‘T’ at a particular SNP) and the phenotype(‘height’) of each individual. This produces ‘two’ sample spaces S1 and S2, which havea relationship, i.e. values of genotype occur with values of phenotype. We can thereforealso define a single sample space S = S1 ∩ S2, which indicates this relationship. For bothof these sample spaces S1 and S2, we could define distinct probability functions, althoughagain, these probability functions would be related and define a single pdf for S = S1 ∩S2.We could also define random vectors for each sample space S1 and S2 but only whereeach element of these random vectors is associated with only one probability function, i.e.while a single probability function can be associated with multiple random variables, asingle random variable cannot be associated with more than one probability function. Ifwe consider the sample space S = S1 ∩S2, we would now have a single random vector thatcombined the random vectors defined on S1 and S2, e.g. we could have random vectorsthat include both discrete and continuous random variables.

Finally, let’s introduce one last more formal concept. A vector-valued function Y = f(X)is a function which takes an input X and returns a vector Y = [Y1, Y2, ..., Yn] (note thatthe input X could also be a vector). A random vector is therefor a vector-valued functionon a sample space.

9

Probability models I

• We have defined Pr(X), a probability model on a random variable (which technically we produce by defining Pr(S) and X(S)...)

• So far, we have considered such probability models without defining them explicitly (except for a illustrative few examples)

• Remember that there is a “true” probability model, that is a consequence of the experiment that produces sample outcomes, which we do not know (!!)

• In general, the starting point of a statistical investigation is to make assumptions about the form of this probability model

• A convenient assumption is to assume our true probability model is specific model in a family of distributions that can be described with a compact equation

• This is often done by defining equations indexed by parameters

Probability models II

• Parameter - a constant(s) which indexes a probability model belonging to a family of models such that

• Each value of the parameter (or combination of values if there is more than on parameter) defines a different probability model:Pr(X)

• We assume one such parameter value(s) is the true model

• The advantage of this approach is this has reduced the problem of using the sample to answer a broad question to the problem of using a sample to make an educated guess at the value of the parameter(s)

• Remember that the foundation of such an approach is still an assumption about the properties of the sample outcomes, the experiment, and the system of interest (!!!)


Lecture 5: Probability Models, Inference, Samples, Statistics, and Estimators

Lecture: February 14; Version 1: February 20; Version 2, March 15

1 Introduction

Last lecture, we discussed some fundamental functions of random variables/vectors and

their probability distributions, the interpretation of which does not depend on the specific

probability model under consideration: expectations (means), variances, covariances (cor-

relations). Today we will discuss some specific probability models that will be particularly

useful to us in our study of quantitative genomics. After introducing these models, we will

introduce inference and discuss the first critical concepts for making inferences: samples,statistics, and their sampling distributions. We will then begin our discussion of estimation(a particular class of inference), where we will make use of samples to determine the value

of the parameter of the underlying probability model that is responsible for our sample,

which we will use to (indirectly) make statements about the system we are studying.

2 Probability models

We have now discussed that by defining a probability function Pr(S) and a random variable

X(S) on a sample space S, we define a probability distribution for the random variable

Pr(X), and we can use expectations, variances, and covariance, to characterize aspects

of the probability distribution regardless of the specific form of the distribution. While

choosing a specific probability model (a specific probability distribution) is in theory only

restricted according to the axioms of probability, we in general make use of probability mod-

els that are both intuitive and allow for mathematical conveniences. One such convenience

is the ability to (mathematically) simply define a large number of possible probability mod-

els with a compact equation. For the models we will consider, the way this is done is by

making our probability distributions functions of parameters:

Parameter ≡ a constant which indexes a probability model belonging to a family of

models Θ such that θ ∈ Θ.

1




1 Introduction






















1




1 Introduction






















1

Discrete parameterized examples

• Consider the probability model for the one coin flip experiment / number of tails.

• This is the Bernoulli distribution with parameter = p (what does p represent!?) where

• We can write this X ~ Bern(p) and this family of probability models has the following form:

• For the experiment of n coin flips / number of tails, we can assume the Binomial distribution X ~ Bin(n, p):

• There are many other discrete examples: hypergeometric, Poisson, etc.

The differences among different models in a particular family therefore simply dependson the specific values of the parameters.

To make this concept more concrete, let first consider the probability model for a dis-crete random variable that can take only one of two values 0 or 1 (which could represent‘Heads’ or ‘Tails’ for a coin sample space of ‘one flip’). In this case, our specific probabilitymodel is the Bernoulli distribution, which is a function of a single parameter p:

Pr(X = x|p) = PX(x|p) = px(1− p)1−x (1)

Note that we use a conditional notation, since the specific probability model depends onthe value of the contant, e.g. a ‘fair coin’ probability model is a case where p = 0.5. Theparameter p can take values from [0, 1], so in our parameter notation, we have θ = p andΘ = [0, 1]. We will often use the following shorthand X ∼ Bern(p) to indicate a randomvariable that has a Bernoulli distribution.

Let’s now introduce a second probability model that we could use to model our ran-dom variable describing the ‘number of Tails’ for our sample space of ‘two coin flips’S = HH,HT, TH, TT. Recall that this random variable had the following structure:X(HH) = 0, X(HT ) = 1, X(TH) = 1, X(TT ) = 2. We can simply represent this randomvariable as a function of two random variables X1 ∼ Bern(p) and X2 ∼ Bern(p) if we setX = X1 +X2. More generally, we could do this for a sample space for n flips of a coin ifwe set X =

ni=1Xi. In this case, the probability model for X is a binomial distribution:

Pr(X = x|n, p) = PX(x|n, p) =n

x

px(1− p)n−x (2)

which technically has two parameters (n, p) but we often consider sets of probability modelsindexed by p for a specific n, i.e. we only consider the parameter p. For example, in our twoflip case, we have n = 2 and for these two flips, we can define a number of models includingthe ‘fair coin’ model p = 0.5. Note that if you are unfamiliar with ‘choose’ notation, it isdefined as follows:

n

x

=

n!

x!(n− x)!(3)

n! = n ∗ (n− 1) ∗ (n− 2) ∗ ... ∗ 1 (4)

which intuitively accounts for the different orderings that lead to the same number of‘Tails’, e.g. in the two flip case, the ordering HT and TH produce the same number ofTails. We use the following shorthand for the Binomial distribution: X ∼ Bin(n, p).

Other important discrete distributions include the Hypergeometric, Geometric, and Pois-son. We will discuss the former when we consider Fisher’s exact test. While we will not

2



Pr(X = x|p) = PX(x|p) = px(1− p)1−x (1)





x

px(1− p)n−x (2)


n

x

=

n!

x!(n− x)!(3)

n! = n ∗ (n− 1) ∗ (n− 2) ∗ ... ∗ 1 (4)



2



Pr(X = x|p) = PX(x|p) = px(1− p)1−x (1)





x

px(1− p)n−x (2)


n

x

=

n!

x!(n− x)!(3)

n! = n ∗ (n− 1) ∗ (n− 2) ∗ ... ∗ 1 (4)



2




1 Introduction






















1



Pr(X = x|p) = PX(x|p) = px(1− p)1−x (1)





x

px(1− p)n−x (2)


n

x

=

n!

x!(n− x)!(3)

n! = n ∗ (n− 1) ∗ (n− 2) ∗ ... ∗ 1 (4)



2

Continuous parameterized examples

• Consider the measure heights experiment (reals as approximation to the sample space) / identity random variable

• For this example we can use the family of normal distributions that are parameterized by (what do these parameters represent!?) with the following possible values: ,

• We often write this as and the equation has the following form:

• There are many other continuous examples: uniform, exponential, etc.

consider the latter two extensively in this course, they are critical to the foundation of

‘population genetics’, the subject that considers the statistical and probabilistic modeling

of how genes evolve in populations. Population genetics is a very relevant course for quan-

titative genomics (and other genomic disciplines), so I encourage you to take a theoretical

course on the subject.

Let’s now consider some probability models for continuous random variables. The model

we will make the most direct use of in this course is one that we have introduced previously,

the normal distribution (also called the Gaussian):

Pr(X = x|µ,σ2) = fX(x|µ,σ2

) =1√2πσ2

e−(x−µ)2

2σ2 (5)

This model therefore has two parameters (µ,σ2) such that θ is actually a parameter vec-

tor θ =µ,σ2

. The parameter µ intuitively sits in the ‘middle’ or at the ‘center of

gravity’ of this distribution (see class notes for a picture) and has the following possible

values: Θ = (−∞,∞). The σ2 parameter intuitively captures the ‘spread’ of the distri-

bution, i.e. the larger the value the greater the spread, and takes the following possible

values Θ = [0,∞). As we have seen previously, our shorthand for a normal distribution is

X ∼ N(µ,σ2).

Other continuous distributions that we will run into during this course are the Uniform,

chi squared, t-type, F-type, Gamma, and Beta. The former we will discuss in the context

of the distribution of p-values, the middle three will come up in our discussion of sampling

distributions of statistics, and we will discuss the last two during our lectures on Bayesian

statistics.

One final point to note. While we have considered parameterized statistical models for

individual ‘univariate’ random variables, there are analogous forms of all of these distribu-

tions for random vectors with multiple elements, which are ‘multivariate’ random variables

(although the multivariate forms have additional parameters). We will consider some mul-

tivariate forms of these distributions in this class, e.g. the multivariate Normal distribution.

3 Introduction to inference

A major goal of the field of statistics is inference:

Inference ≡ the process of reaching conclusions concerning an assumed probability dis-

tribution (specifically the parameter(s) θ) on the basis of a sample.

There are two major ‘types’ of inference: estimation and hypothesis testing. Both are

3










) =1√2πσ2

e−(x−µ)2

2σ2 (5)

fX(x|µ1, µ2,σ21,σ

22, ρ) =

1

2πσ1σ2√1− ρ

exp

− 1

2(1− ρ2)

(x1 − µ1)

2

2σ21

− 2ρ(x1 − µ1)(x2 − µ2)

σ1σ2+

(x2 − µ1)2

2σ22

(6)

This model therefore has two parameters (µ,σ2) such that θ is actually a parameter vector

θ = θµ,σ2 =µ,σ2



values: Θµ = (−∞,∞). The σ2 parameter intuitively captures the ‘spread’ of the distri-


values Θσ2 = [0,∞). As we have seen previously, our shorthand for a normal distribution

is X ∼ N(µ,σ2).





statistics.










3










) =1√2πσ2

e−(x−µ)2

2σ2 (5)


22, ρ) =

1

2πσ1σ2√1− ρ

exp

− 1

2(1− ρ2)

(x1 − µ1)

2

2σ21

− 2ρ(x1 − µ1)(x2 − µ2)

σ1σ2+

(x2 − µ1)2

2σ22

(6)








is X ∼ N(µ,σ2).





statistics.










3










) =1√2πσ2

e−(x−µ)2

2σ2 (5)


22, ρ) =

1

2πσ1σ2√1− ρ

exp

− 1

2(1− ρ2)

(x1 − µ1)

2

2σ21

− 2ρ(x1 − µ1)(x2 − µ2)

σ1σ2+

(x2 − µ1)2

2σ22

(6)








is X ∼ N(µ,σ2).





statistics.










3










) =1√2πσ2

e−(x−µ)2

2σ2 (5)


22, ρ) =

1

2πσ1σ2√1− ρ

exp

− 1

2(1− ρ2)

(x1 − µ1)

2

2σ21

− 2ρ(x1 − µ1)(x2 − µ2)

σ1σ2+

(x2 − µ1)2

2σ22

(6)








is X ∼ N(µ,σ2).





statistics.










3

Example for random vectors

• Since random vectors are the generalization of r.v.’s, we similarly can define parameterized probability models for random vectors

• As an example, if we consider an experiment where we measure “height” and “weight” and we take the 2-D reals as the approximate sample space (vector identity function), we could assume the bivariate normal family of probability models:










) =1√2πσ2

e−(x−µ)2

2σ2 (5)


22, ρ) =

1

2πσ1σ2√1− ρ

exp

− 1

2(1− ρ2)

(x1 − µ1)

2

2σ21

− 2ρ(x1 − µ1)(x2 − µ2)

σ1σ2+

(x2 − µ1)2

2σ22

(6)








is X ∼ N(µ,σ2).





statistics.










3

Introduction to inference I

• Recall that our eventual goal is to use a sample (generated by an experiment) to provide an answer to a question (about a system)

• So far, we have set up the mathematical foundation that we need to accomplish this goal in a probability / statistics setting (although note we have not yet defined a sample!!)

• Specifically, we have defined formal components of our framework and made assumptions that have reduced the scope of the problem

• With these components and assumptions in place, we are almost ready to perform inference, which will accomplish our goal

Introduction to inference II

• Inference - the process of reaching a conclusion about the true probability distribution (from an assumed family probability distributions, indexed by the value of parameter(s) ) on the basis of a sample

• There are two major types of inference we will consider in this course: estimation and hypothesis testing

• Before we get to these specific forms of inference, we need to formally define: samples, sample probability distributions (or sampling distributions), statistics, statistic probability distributions (or statistic sampling distributions)

Samples I

• Recall that we have defined experiments (= experimental trials) in a probability / statistics setting where these involve observing individuals from a population or the results of a manipulation

• We have defined the possible outcome of an experimental trial, i.e. the sample space S

• We have also defined a random variable X(S), where the random variable maps sample outcomes to numbers, the quantities in which we are ultimately interested

• Since we have also defined a probability model Pr(X), we have shifted our focus from the sample space to the random variable

Samples II

• Sample - repeated observations of a random variable X, generated by experimental trials

• We will consider samples that result from n experimental trials (what would be the ideal n = ideal experiment!?)

• We already have the formalism to represent a sample of size n, specifically this is a random vector:

• As an example, for our two coin flip experiment / number of tails r.v., we could perform n=3 experimental trials, which would produce a sample = random vector with three elements


essential in quantitative genomics (the latter will often be our goal but the former is re-

quired for the latter). We will discuss these in general terms in the next two lectures and in

specific terms throughout the semester. Also, note that one of the nice aspects of assuming

that the probability model of our random variable is from a family indexed in a parameter

set Θ, the problem of inference is reduced to the problem of learning something about the

specific parameter value of our model θ. However, before we get to concepts of inference

concerning θ, we need to define several fundamental concepts: samples, statistics, and their

sampling distributions.

4 Samples and i.i.d.

Recall that the starting point of our discussion is a system we want to know something

about, and an experiment that produces a sample space S. We then define a probabil-

ity function and a random variable on S, which define a specific probability distribution

Pr(X = x), where by definition, we have defined a specific probability model (by making

assumptions) indexed by θ. In general, we would like to know something about the pa-

rameter of our probability model θ, which is defined by the system and experiment (and

by extrapolation from our many assumptions, can be used to learn about the system),

but is unknown to us. Inference is the process of determining something about the true

parameter value, and for this we need a sample.

Sample ≡ repeated observations of a random variable X, generated by experiments.

The ideal set of experiments would have an infinite number of observations, but since

such cases are not possible, we will consider a sample of size n. Now, we have already seen

how to represent a sample, this is simply a random vector:

[X = x] = [X1 = x1, ..., Xn = xn] (7)

where unlike the random vectors we have considered before, each of the n random variables

have the same structure, they are simply indicate different observations of the random

variable in our sample, e.g. for n = 2 in our coin flip example(s), we do not define X1=‘#

of Tails’ and X2=‘# of Heads’ but rather X1=‘# of Tails’ of the first flip (or pair of flips)

in an experiment and X2=‘# of Tails’ in the second flip (or pair of flips) in the same

experiment. Now, as we have discussed, defining a probability function on the sample

space Pr(S) induces a probability distribution of a random variable defined on the same

sample space Pr(X) and since our random vector is considering multiple realizations of

this random variable, the Pr(X) induces a probability distribution on our sample vector,

4

Samples III• Note that since we have defined (or more accurately induced!) a probability

distribution Pr(X) on our random variable, this means we have induced a probability distribution on the sample:

• This is the sample probability distribution or sampling distribution (often called the joint sampling distribution)

• While samples could take a variety of forms, we generally assume that each sample has the same form, such that they are identically distributed:

• We also generally assume that each sample is independent of all other samples:

• If both of these assumptions hold, than the sample is independent and identically distributed, which we abbreviate as i.i.d.

i.e. a sample random vector X has a (joint) probability distribution:

Pr(X = x) = PX(x) or fX(x) = Pr(X1 = x1, X2 = x2, ..., Xn = xn) (8)

where each of the Xi have the same distribution as we have defined for X. Since we knowthey all have the same distribution, we know that:

Pr(X1 = x1) = Pr(X2 = x2) = ... = Pr(Xn = xn) (9)

and we therefore say that the sample is identically distributed. Ideally, it is also the casethat each of these Xi are independent of the rest. When this is the case, this makes muchof the mathematical framework we use to do inference easier, so we often try to constructexperiments, which produce such independence. When this is the case, we have:

Pr(X = x) = Pr(X1 = x1)Pr(X2 = x2)...P r(Xn = xn) (10)

which follows from the definition of independence. Ideally therefore, our sample is inde-pendent and identically distributed, which we abbreviate as i.i.d. (or iid). We will largelyconsider iid samples for this entire course.

Again, note that just as a probability function Pr(S) induces a probability distributionon a random variable X, this same probability distribution will induce a joint probabilitydistribution on the random vector Pr(X = x). This is effectively the probability distribu-tion describing all possible sample outcomes that could occur for a sample of size n, i.e. arandom vector where the marginal probability distributions have the same distribution asX and there is no covariance among the Xi (note that by assuming iid, we are providingadditional limits on the possible probability distributions that could describe our possiblesamples).

To perform inference in the real world, we generally only have a single set of experimentand therefore a single sample (or at least a limited number of samples). We are thereforegoing to consider inference for a specific realization of a sample of size n. For example, fora set of n = 10 Bernoulli samples this could be something like:

x = [1, 1, 0, 1, 0, 0, 0, 1, 1, 0] (11)

and for a normally distributed random variable this could be:

x = [−2.3, 0.5, 3.7, 1.2,−2.1, 1.5,−0.2,−0.8,−1.3,−0.1] (12)

where for the latter, keep in mind the values are constrained by our precision of mea-surement and we will approximate them by a continuous random variable and associatedsample that we assume are normally distributed, which defines the probability that ob-servations of this random variable fall in a particular interval (see lecture 3). To actually

5




Pr(X1 = x1) = Pr(X2 = x2) = ... = Pr(Xn = xn) (9)






x = [1, 1, 0, 1, 0, 0, 0, 1, 1, 0] (11)


x = [−2.3, 0.5, 3.7, 1.2,−2.1, 1.5,−0.2,−0.8,−1.3,−0.1] (12)


5




Pr(X1 = x1) = Pr(X2 = x2) = ... = Pr(Xn = xn) (9)






x = [1, 1, 0, 1, 0, 0, 0, 1, 1, 0] (11)


x = [−2.3, 0.5, 3.7, 1.2,−2.1, 1.5,−0.2,−0.8,−1.3,−0.1] (12)


5

Samples IV

• It is important to keep in mind, that while we have made assumptions such that we can define the joint probability distribution of (all) possible samples that could be generated from n experimental trials, in practice we only observe one set of trials, i.e. one sample

• For example, for our one coin flip experiment / number of tails r.v., we could produce a sample of n = 10 experimental trials, which might look like:

• As another example, for our measure heights / identity r.v., we could produce a sample of n=10 experimental trails, which might look like:

• In each of these cases, we would like to use these samples to perform inference (i.e. say something about our parameter of the assumed probability model)

• Using the entire sample is unwieldy, so we do this by defining a statistic




Pr(X1 = x1) = Pr(X2 = x2) = ... = Pr(Xn = xn) (9)






x = [1, 1, 0, 1, 0, 0, 0, 1, 1, 0] (11)


x = [−2.3, 0.5, 3.7, 1.2,−2.1, 1.5,−0.2,−0.8,−1.3,−0.1] (12)


5




Pr(X1 = x1) = Pr(X2 = x2) = ... = Pr(Xn = xn) (9)






x = [1, 1, 0, 1, 0, 0, 0, 1, 1, 0] (11)


x = [−2.3, 0.5, 3.7, 1.2,−2.1, 1.5,−0.2,−0.8,−1.3,−0.1] (12)


5

Statistics• Statistic - a function on a sample

• Note that a statistic T is a function that takes a vector (a sample) as an input and returns a value (or vector):

• For example, one possible statistic is the mean of a sample:

• It is critical to realize that, just as a probability model on X induces a probability distribution on a sample, since a statistic is a function on the sample, this induces a probability model on the statistic (the statistic probability distribution or the sampling distribution of the statistic):

perform inference, it is not particularly easy to use the entire sample as is, i.e. in the form

of a vector. We therefore usually define a statistic:

Statistic ≡ a function on a sample.

If we define this statistic as T , it has the following structure:

T (x) = T (x1, x2, ..., xn) = t (13)

where t can be a single number or a vector. For example, let’s define a statistic which takes

a sample and returns the mean of the sample:

T (x) =1

n

n

i=1

xi (14)

So for the sample in equation (9) this statistic would be T (x) = 0.5 and for equation (10),

it would be T (x) = 0.01 A statistic on a specific realization of a sample is what we use for

inference, as we will see with the next two lectures.

Let’s consider one last important concept. It is also critical to realize that, just as the

probability function on the sample space Pr(S) induces a probability distribution on the

random variable defined on the sample space Pr(X), which in turn induces a probability

distribution of i.i.d sample vector Pr(X = x), since a statistic is a function on the sample,

the probability distribution of the sample induces a probability distribution on the possible

values the statistic could take Pr(T (X)), i.e. the probability distribution of the statistic

when considering all possible samples. We call this a sampling distribution of the statistic

and as we will see, this also plays an important role in inference.

5 Estimators

Recall that we are interested in knowing about a system and to do this, we conduct an

experiment, which we use to define sample space. We define a probability function and a

random variable X on this sample space, where we assume a specific form for the proba-

bility function, which defines a probability distribution on our random variable. We write

this Pr(X) or Pr(X = x), where the large ‘X’ indicates a random variable that can take

different values, and the little ‘x’ represents a specific value that the random vector takes

(which at the moment we have not assigned). We assume that the probability distribution

of the random variable X has a specific form and is in a ‘family’ of probability distribu-

tions that are indexed by parameter(s) θ, e.g. X ∼ N(µ,σ2), which we write Pr(X|θ) orPr(X = x|θ). While we have assumed the specific form of the distribution (e.g. a ‘normal’)

we do not know the specific values of the parameters. Our goal is to perform inference to

6





T (x) = T (x1, x2, ..., xn) = t (13)



T (x) =1

n

n

i=1

xi (14)












5 Estimators











6





T (x) = T (x1, x2, ..., xn) = t (13)



T (x) =1

n

n

i=1

xi (14)












5 Estimators











6





T (x) = T (x1, x2, ..., xn) = t (13)



T (x) =1

n

n

i=1

xi (14)












5 Estimators











6





T (x) = T (x1, x2, ..., xn) = t (13)



T (x) =1

n

n

i=1

xi (14)












5 Estimators











6





T (x) = T (x1, x2, ..., xn) = t (13)



T (x) =1

n

n

i=1

xi (14)












5 Estimators











6

Statistics II

• As an example, consider our height experiment (reals as approximate sample space) / normal probability model (with true but unknown parameters / identity random variable

• If we calculate the following statistic:

what is ?

• Are the distributions of Xi = xi and always the same?










) =1√

2πσ2e−

(x−µ)2

2σ2 (5)

This model therefore has two parameters (µ,σ2) such that θ is actually a parameter vec-

tor θ =µ,σ2






X ∼ N(µ,σ2).





statistics.











3





T (x) = T (x1, x2, ..., xn) = t (13)



T (x) =1

n

n

i=1

xi (14)












5 Estimators











6





T (x) = T (x1, x2, ..., xn) = t (13)



T (x) =1

n

n

i=1

xi (14)












5 Estimators











6





T (x) = T (x1, x2, ..., xn) = t (13)



T (x) =1

n

n

i=1

xi (14)












5 Estimators











6

Statistics and estimators• Recall for the purposes of inference, we would like to use a sample to say

something about the specific parameter value (of the assumed) family or probability models that could describe out sample space

• Said another way, we are interested in using the sample to determine the “true” parameter value that describes the outcomes of our experiment

• An approach for accomplishing this goal is to define our statistic in a way that it will allow us to say something about the true parameter value

• In such a case, our statistic is an estimator of the parameter:

• There are many ways to define estimators

• Each estimator has different properties and there is no perfect estimator

• Note that without an infinite sample, we will never know if we are correct with certainty (!!)

which we could also write:

[X = x] = [X1 = 1, X2 = 1, X3 = 0, X4 = 1, X5 = 0, X6 = 0, X7 = 0, X8 = 1, X9 = 1, X10 = 0]

(16)


x = [−2.3, 0.5, 3.7, 1.2,−2.1, 1.5,−0.2,−0.8,−1.3,−0.1] (17)

In either of these examples, our statistic takes a specific value ‘t’, which is our actual esti-

mate of the parameter value, which we can write T (x) = θ.

Before we get to specific examples of estimators, a few comments:

1. Our parameter may be a single value or a vector of values θ = [θ1, θ2, ...], e.g. θ =µ,σ2

and we can define a estimator that is a vector valued function on our sample,

which estimates these multiple parameters T (X = x) = θ =

θ1, θ2, ...

.

2. We cannot define a statistic that always takes the true value of θ for every possible

sample (hence estimate), i.e. there is no perfect estimator.

3. There are different ways to define ‘good’ estimators, each of which may have differentproperties. We will consider some of these below.

4. It is easy to define ‘bad’ estimators. For example, an estimator that takes every

sample to the same value. In this case, it is a good estimator if the true parameter

value happens to be this value, otherwise, it is a bad estimator.

6 Method of moments estimator

To make the concept of estimators clear, let’s consider a specific example of an estimator.

Let’s first assume that we have coin system, where experiments are coin flips, and our

random variable X has a Bernoulli distribution Pr(X = x|p), such that our goal is to esti-

mate the parameter p, where for this example, let’s say p = 0.5. Our random variable can

therefore take values 0 or 1 (with equal probability), such that we could obtain a sample

of the type in equation (10). In this case, a perfectly reasonable estimator would be the

mean (also called the expectation) of the sample:

T (X = x) = E(X = x) = θ = p =1

n

n

i=1

xi (18)

As we mentioned above, this statistic has a sampling distribution that describes the possi-

ble values of this statistic. In this particular case, it happens to be a binomial distribution

with parameters n and p, although since we ‘re-scale’ the ‘number of Tails’ to be between

8

Statistics and estimators II

• Since our underlying probability model induces a probability distribution on a statistic, and an estimator is just a statistic, there is an underlying probability distribution on an estimator:

• Our estimator takes in a vector as input (the sample) and may be defined to output a single value or a vector of estimates:

• We cannot define a statistic that always outputs the true value of the parameter for every possible sample (hence no perfect estimator!)

• There are different ways to define “good” estimators and lots of ways to define “bad” estimators (examples?)

variable X. Since these are repeated observations, our sample is actually a vector, whichwe write X = [X1, ...,Xn] or [X = x] = [X1 = x1, ...,Xn = xn] to indicate all the possiblevalues our sample (the random vector) could take. Since we have defined a probability dis-tribution on our random variable X, this also induces a (joint) probability distribution overall the possible samples that we could produce, which we write as Pr(X) = Pr(X1, ...,Xn)or Pr(X = x) = Pr(X1 = x1, ...,Xn = xn). We will generally assume that our samplecontains independent, repeated (identical) observations of our random variable, such thatour sample is i.i.d. In such a case, each of the individual observations in our sample has aprobability distribution that is the same as our random variable Pr(Xi = xi|θ).

Let’s assume that we’d like to perform a particular ‘type’ of inference, specifically thatwe would like to infer the actual, unknown value of our parameters θ. This type of infer-ence is called estimation. The process of performing inferences requires that we define astatistic, which is a function on our sample T (X) or T (X = x). Intuitively, the reason fordoing this is each of the observations in our sample of size n contains information aboutthe true parameter value, but each individual observation can take many possible values.By combining these observations in a reasonable way, we can get more information aboutwhat the true parameter value is and make a better ‘guess’ or estimate concerning the trueparameter value. This is the goal of defining a statistic. Note that unless we have a infinitesample, we cannot know the true value of the parameter with certainty (hence estimation).

Our goal therefore is to define our statistic such that it is an estimate of the parame-ter θ. We write an parameter estimate as θ, and since our statistic T is an estimator, wewrite T (X) = θ or T (X = x) = θ. Note that since our sample has a probability distribu-tion (a sampling distribution), which reflects the possible values our sample could take, ourstatistic and hence our estimator, has a probability distribution Pr(T (X = x)) = Pr(θ),which need not be the same probability distribution of our original random variable X(because it is a function of multiple observations of our random variable). Thus, our es-timator also has a sampling distribution of possible values. However, our goal is to makethis probability distribution such that we have a reasonable probability of getting the rightparameter value or ‘close to’ the right parameter value for most samples, a concept we willmake more rigorous below.

In practice, we do not see all the possible value our sample, and therefore our estima-tor can take. We only have a single sample, which we represent as lower case x. Forexample, for a set of n = 10 Bernoulli samples this could be something like:

x = [1, 1, 0, 1, 0, 0, 0, 1, 1, 0] (14)

which we could also write:

[X = x] = [X1 = 1, X2 = 1, X3 = 0, X4 = 1, X5 = 0, X6 = 0, X7 = 0, X8 = 1, X9 = 1, X10 = 0](15)

7


x = [−2.3, 0.5, 3.7, 1.2,−2.1, 1.5,−0.2,−0.8,−1.3,−0.1] (16)

In either of these examples, our statistic takes a specific value ‘t’, which is our actual esti-mate of the parameter value, which we can write T (x) = θ.




which estimates these multiple parameters T (X = x) = θ =θ1, θ2, ...

.

2. We cannot define a statistic that always takes the true value of θ for every possiblesample (hence estimate), i.e. there is no perfect estimator.


4. It is easy to define ‘bad’ estimators. For example, an estimator that takes everysample to the same value. In this case, it is a good estimator if the true parametervalue happens to be this value, otherwise, it is a bad estimator.


To make the concept of estimators clear, let’s consider a specific example of an estimator.Let’s first assume that we have coin system, where experiments are coin flips, and ourrandom variable X has a Bernoulli distribution Pr(X = x|p), such that our goal is to esti-mate the parameter p, where for this example, let’s say p = 0.5. Our random variable cantherefore take values 0 or 1 (with equal probability), such that we could obtain a sampleof the type in equation (10). In this case, a perfectly reasonable estimator would be themean (also called the expectation) of the sample:

T (X = x) = E(X = x) = θ = p =1n

n

i=1

xi (17)

As we mentioned above, this statistic has a sampling distribution that describes the possi-ble values of this statistic. In this particular case, it happens to be a binomial distributionwith parameters n and p, although since we ‘re-scale’ the ‘number of Tails’ to be betweenzero and one by dividing by n, the sampling distribution of this statistic T (X = x) = Xwith X = Z/n where Z ∼ (Bin(n, p)). However, in a ‘realistic’ sample, we do not havemultiple realizations of this statistic but rather a single value corresponding to a single

8

Method of moments estimator• As an example of how to construct estimators, let’s construct a method of

moments estimator

• Consider the single coin flip experiment / number of tails random variable / Bernoulli probability model family (parameter p) / fair coin model (assumed and unknown to us!!!) / sample of size n=10

• What is the sampling distribution (of the sample) in this case?

• We want to estimate p, where a perfectly reasonable estimator is:

• What is the probability distribution of this statistic in this case?

• e.g. this statistic (=mean of the sample) would equal 0.5 for the following particular sample (will it always?)


x = [−2.3, 0.5, 3.7, 1.2,−2.1, 1.5,−0.2,−0.8,−1.3,−0.1] (16)

In either of these examples, our statistic takes a specific value ‘t’, which is our actual esti-mate of the parameter value, which we can write T (x) = θ.




which estimates these multiple parameters T (X = x) = θ =θ1, θ2, ...

.

2. We cannot define a statistic that always takes the true value of θ for every possiblesample (hence estimate), i.e. there is no perfect estimator.


4. It is easy to define ‘bad’ estimators. For example, an estimator that takes everysample to the same value. In this case, it is a good estimator if the true parametervalue happens to be this value, otherwise, it is a bad estimator.


To make the concept of estimators clear, let’s consider a specific example of an estimator.Let’s first assume that we have coin system, where experiments are coin flips, and ourrandom variable X has a Bernoulli distribution Pr(X = x|p), such that our goal is to esti-mate the parameter p, where for this example, let’s say p = 0.5. Our random variable cantherefore take values 0 or 1 (with equal probability), such that we could obtain a sampleof the type in equation (10). In this case, a perfectly reasonable estimator would be themean (also called the expectation) of the sample:

T (X = x) = E(X = x) = θ = p =1n

n

i=1

xi (17)

As we mentioned above, this statistic has a sampling distribution that describes the possi-ble values of this statistic. In this particular case, it happens to be a binomial distributionwith parameters n and p, although since we ‘re-scale’ the ‘number of Tails’ to be betweenzero and one by dividing by n, the sampling distribution of this statistic T (X = x) = Xwith X = Z/n where Z ∼ (Bin(n, p)). However, in a ‘realistic’ sample, we do not havemultiple realizations of this statistic but rather a single value corresponding to a single

8




Pr(X1 = x1) = Pr(X2 = x2) = ... = Pr(Xn = xn) (9)






x = [1, 1, 0, 1, 0, 0, 0, 1, 1, 0] (11)


x = [−2.3, 0.5, 3.7, 1.2,−2.1, 1.5,−0.2,−0.8,−1.3,−0.1] (12)


5

Introduction to MLE’s

• A maximum likelihood estimator (MLE) has the following definition:

• Recall that this statistic still takes in a sample and outputs a value that is our estimator (!!) Note that likelihoods are NOT probability functions, i.e. they need not conform to the axioms of probability (!!)

• Sometimes these estimators have nice forms (equations) that we can write out

• For example the maximum likelihood estimator of our single coin example:

• And for our heights example:

is easy to check and I’ll leave it as an exercise). Note that in this case, there was a closedform equation for the MLE that does not involve the parameter we are trying to estimate,but this is not always the case, particularly when we consider more complicated likelihoodfunctions (where we will need an algorithm). It is also interesting to note that, in this case,the MLE(µ) is the same as the method of moments estimator (again, this is not alwaysthe case). This is similarly the case for the MLE of σ2:

MLE(σ2) =1

n

n

i

(xi − x)2 (13)

which can be derived the using the same approach.

As another quick example, let’s derive the MLE(p) for X ∼ Bin(n, p) for a sample ofsize n. In this case the likelihood is:

L(p|X = x) =

x

n

px(1− p)n−x (14)

and the log-likelihood is:

l(p|X = x) = ln

x

n

+ xln(p) + n− xln(1− p) (15)

such that the first derivative is:

∂l(p|X = x)

∂p=

x

p− n− x

1− p(16)

and by setting this equal to zero and solving for p we obtain:

MLE(p) =x

n(17)

which we can check by considering the second derivative:

∂2l(p|X = x)

∂p2= − x

p2+

x− n

(1− p)2(18)

which is always negative. Note that the MLE and the method of moments estimator arealso the same in this case.

More generally, if we are interested in deriving the MLE(θ) for a vector of parametersθ = [θ1, θ2, ...] we can take the derivative the log-likelihood function with respect to allparameters:

dl(θ|X = x)

dθ=

∂l(θ|X=x)∂θ1

∂l(θ|X=x)∂θ2...

6

vector X = x of n elements, where each Xi = xi is normally distributed. In this case, thelikelihood equation is:

L(µ,σ2|X = x) =n

i=1

1√2πσ2

e−(xi−µ)2

2σ2 (8)

wheren

i=1 x1 ∗ x2 ∗ ... ∗ xn. If we now consider the log likelihood, from equation (15) andthe following properties of natural log (ln) and exponential (e) functions:

1. ln 1a = −a

2. ln(a2) = 2a

3. ln(ab) = a+ b

4. ln(ea) = a

5. eaeb = ea+b

we have the following:

l(µ,σ2|X = x)) = −nln(σ)− n

2ln(2π)− 1

2σ2

n

i

(xi − µ)2 (9)

To find the maximum of this function, we take a derivative with respect to µ and set thisequal to zero:

∂l(θ|X = x)

∂µ=

1

σ2

n

i

(xi − µ) = 0 (10)

where we use ∂ to indicate cases where we are taking the derivative of a function of severalvariables with respect to one (or a subset) of the variables (i.e. a partial derivative) andwe use d to take the derivative with respect to all variables at the same time. To find theMLE, we now solve equation (18) with respect to µ:

µ =1

n

n

i

xi (11)

Note that this is the mean of the sample so we have:

MLE(µ) =1

n

n

i

xi = x) = x (12)

Now, to be assured that this is actually the MLE, we need to check that the secondderivative of log-likelihood function is negative at this point, which it is in this case (this

5

is easy to check and I’ll leave it as an exercise). Note that in this case, there was a closedform equation for the MLE that does not involve the parameter we are trying to estimate,but this is not always the case, particularly when we consider more complicated likelihoodfunctions (where we will need an algorithm). It is also interesting to note that, in this case,the MLE(µ) is the same as the method of moments estimator (again, this is not alwaysthe case). This is similarly the case for the MLE of σ2:

MLE(σ2) =1

n

n

i

(xi − x)2 (13)

which can be derived the using the same approach.

As another quick example, let’s derive the MLE(p) for X ∼ Bin(n, p) for a sample ofsize n. In this case the likelihood is:

L(p|X = x) =

x

n

px(1− p)n−x (14)

and the log-likelihood is:

l(p|X = x) = ln

x

n

+ xln(p) + n− xln(1− p) (15)

such that the first derivative is:

∂l(p|X = x)

∂p=

x

p− n− x

1− p(16)

and by setting this equal to zero and solving for p we obtain:

MLE(p) =x

n(17)

which we can check by considering the second derivative:

∂2l(p|X = x)

∂p2= − x

p2+

x− n

(1− p)2(18)

which is always negative. Note that the MLE and the method of moments estimator arealso the same in this case.

More generally, if we are interested in deriving the MLE(θ) for a vector of parametersθ = [θ1, θ2, ...] we can take the derivative the log-likelihood function with respect to allparameters:

dl(θ|X = x)

dθ=

∂l(θ|X=x)∂θ1

∂l(θ|X=x)∂θ2...

6

5. Likelihoods have an appealing property described by the ‘Likelihood principle’, whichbasically states that any evidence present in a sample about θ depends only on thelikelihood function. This is a deeper theoretical concept than sufficiency (althoughthey are related).

Now that we have defined a likelihood, we are ready to define a Maximum LikelihoodEstimator:

MLE(θ) = θ = argmaxθ∈ΘL(θ|x) (4)

where ‘argmax’ simply means the argument or value of θ within the set Θ that maximizesthe function. That is, the actual value that we get as an estimate, after plugging in thesample x into this equation, is the value of θ where this function has a maximum. Wecan illustrate this concept visually by plotting this function with possible parameter val-ues on the X-axis and the Likelihood function on the Y-axis (see class notes for a diagram).

To determine the MLE means finding the maximum of a function. There are broadlytwo ways to do this: a. derive a specific (useful) formula for the MLE, b. in more complexcases, use an algorithm to determine the MLE. While the former is a reasonable strategy insome cases (as we will discuss today), as we will see later in the class, sometimes the latterstrategy is the only way to determine the MLE. To derive a specific formula for an MLE, asyou’ll recall from calculus, a way to solve the problem of finding a maximum of a functionis to find where the first derivative of the function takes a value of zero, and then check tosee if the second derivative at this point is negative, to determine whether this point is amaximum, i.e. instead of a minimum (or saddle point). When using this approach to findthe maximum, it is often easier to deal with the natural log of the likelihood:

l(θ|x) = ln [L(θ|x)] (5)

Since logarithms are ‘monotonic’ they change the shape of the likelihood function but donot change the location of the maximum, i.e. maximizing the log-likelihood produces thesame result as maximizing the likelihood. Part of the reason log-likelihoods are easier todeal with is they take advantage of the property ln(ab) = ln(a) + ln(b), such that thelikelihood of an i.i.d. sample:

L(θ|x1, x2, ..., xn) = L(θ|x1)L(θ|x2)...L(θ|xn) (6)

when expressed as a log-likelihood is:

l(θ|x1, x2, ..., xn) = l(θ|x1) + l(θ|x2) + ...+ l(θ|xn) (7)

As an example, let’s derive the MLE of the the parameter µ of a normally distributedrandom variable X ∼ N(µ,σ2) for an i.i.d sample of size n, i.e. our sample is a random

4

Review of estimation and hypothesis testing

• Recall that inference is the process of reaching a conclusion about a parameter using a sample

• Thus far we have been considering a “type” of inference, estimation, where we are interested in determining the actual value of a parameter

• We could ask another question, and consider whether the parameter is NOT a particular value

• This is another “type” of inference called hypothesis testing

Hypothesis testing I

• To build this framework, we need to start with a definition of hypothesis

• Hypothesis - an assumption about a parameter

• More specifically, we are going to start our discussion with a null hypothesis, which states that a parameter takes a specific value, i.e. a constant

• For example, for our height experiment / identity random variable, we have and we could consider the following null hypothesis:

our statistic such that it is an estimate of the parameter θ. We write a parameter estimate

as θ, and since our statistic T is an estimator, we write T (x) = θ or T (X = x) = θ. Note

that since our sample has a probability distribution (a sampling distribution), our statistic

= estimator has a probability distribution Pr(T (X = x)) = Pr(θ). Our goal when defining

our estimator is to make this probability distribution such that estimate has a reasonable

probability of getting the right parameter value or ‘close to’ the right parameter value for

most samples.

Today, we are going to consider situations where, instead of wanting to know the ac-

tual value of a parameter, we want to be able to answer a ‘yes’ or ‘no’ question about the

parameter. For example, we may be interested in whether a drug administered to a child

has an effect on adult height. In such a case, we are less interested on the exact effect ofthe drug (which we might summarize with the parameter µ) but rather whether we can

say with confidence that the hypothesis that the drug has no effect on height is wrong. We

could use the answer to the question (is there no effect of the drug?) to make decisions

about how the drug will be administered or regulated. This is what we want to accomplish

in the other major ‘type’ of inference, which is hypothesis testing. Note that hypothesis

testing is a fair bit more complicated (and arguably less intuitive) than estimation. Even

if you have been exposed to the hypothesis testing framework before, it often takes several

exposures to develop a deep understanding.

We will begin our discussion of hypothesis testing by defining some of the critical con-

cepts and providing some simple examples that should help with intuition. As per the

name, we first need to define a hypothesis:

Hypothesis ≡ an assumption about a parameter.

More specifically, we will assume that we have defined Pr(X|θ) for our system and we

will now define a null hypothesis, which states that are parameter θ takes a specific value

(a constant) or is an interval of values (for the moment, we will consider θ to take a single

value). We use the following formalism to write our null hypothesis (H0):

H0 : θ = c (1)

where c is a constant. For example, lets assume Pr(X|θ) ∼ N(µ,σ2), where in this case we

happen to know σ2 = 1. We can define the null hypothesis H0 : µ = 0 in this case and we

are interested in whether we can say with confidence that our null hypothesis is ‘incorrect’

or ‘false’.

Just as with estimation, we will assess this null hypothesis using a sample Pr(X = x) =

Pr(X1 = x1, ..., Xn = xn) , which we will assume is i.i.d. Again, just as with estimation,

we assume that multiple observations of the sample have more information about µ than

2







most samples.




















H0 : θ = c (1)




or ‘false’.




2







most samples.




















H0 : θ = c (1)




or ‘false’.




2

Hypothesis testing II

• Our goal in hypothesis testing is to use a sample to reach a conclusion about the null hypothesis

• To do this, just as in estimation, we will make use of a statistic (a function on the sample), where recall we know the sampling distribution (the probability distribution) of this statistic

• More specifically, we will consider the probability distribution of this statistic, assuming that the null hypothesis is true:

• Note that this means we have a probability distribution of the statistic given the null hypothesis!!

• We will use this distribution to construct a p-value

Pr(X1 = x1, ..., Xn = xn) , which we will assume is i.i.d. Again, just as with estimation,we assume that multiple observations of the sample have more information about µ thaneach individually, and to use this information effectively we define a statistic T (X = x).Now, since we have defined (assumed) the family of probability distributions that are ran-dom variable follows, we know the sampling distribution of our statistic assuming our nullhypothesis is correct Pr(T (X = x|θ = c)). We are going to use this information to as-sess the results that we get for an actual value of our statistic (from an actual sample)T (x) = T (x1, x2, ..., xn) to determine whether we think H0 is wrong.

Note that just as we choose statistics (functions on our sample) that will have good prop-erties for estimation, we also choose statistics which have good properties for hypothesistesting. A reasonable statistic that we could use in this case is the mean of the sampleT (x) =

ni=1 xi. To introduce the major concepts of hypothesis testing, let’s consider an

example that we would generally never deal with in a real statistical application: a casewhere our sample size is n = 1. In this case, our sample is X1 = x1, and our statisticis T (x) =

ni xi = x1 (i.e. the value of our one sample), and the sampling distribution

is x1 ∼ N(µ, 1) (i.e. the same probability distribution as our random variable - see classfor a diagram). If our H0 is correct, there would be a greater probability of our singlesample observation being in an interval around zero. What if our sample is quite far fromzero, say x1 = 2.5? We could take this as evidence that H0 is incorrect. Note that wecan never be sure that H0 is incorrect, no matter how far from zero our observation is,because there is always the possibility that such an outcome could have occurred by chance.

To make the concept of ‘evidence against H0’ more rigorous, we will need the conceptof a p-value:

p-value ≡ the probability of obtaining a value of T (x), or more extreme, conditionalon H0 being true.

The ‘more extreme’ part of this definition is a bit confusing at first glance, so let’sconsider our example to make this more clear. For our example, let’s assume that weare interested in whether the value of T (x1) are more extreme in the positive direc-tion (see class for a diagram). In this case, our p-value has the following definition:pval = Pr(T (X1) x1|H0 : µ = 0, true), where x1 reflects the various values our samplecould take (i.e. −∞ < x1 < ∞). Note that for our example, fX(x) ∼ N(0, 1) where forthis particular case:

pval(T (x)) =

∞

x1

fX(x)dx (3)

pval(T (x)) : T (x) → [0, 1] (4)

3

• As example, consider our height experiment (reals as sample space) / identity random variable X / normal probability model / sample n=1 (of one height measurement) / identity statistic T(X) = X (takes the height measured height)

• Let’s assume that and say we are interested in testing the following null hypothesis such that we have the following probability distribution of the statistic under the null hypothesis:










) =1√

2πσ2e−

(x−µ)2

2σ2 (5)

This model therefore has two parameters (µ, σ2) such that θ is actually a parameter vec-

tor θ =µ,σ2






X ∼ N(µ,σ2).





statistics.











3

where FX(x) is the cumulative distribution function of X (in general, cdf’s of statistics

have a close relationship with p-values).

Shifted paragraph down.

So, how do we make use of a p-value? Let’s go back to our example case where we

assume X ∼ N(µ,σ2), where we assume that we know σ2

= 1 and where we are go-

ing to do a one-sided test of H0 : µ = 0 using a sample of size n = 1 and statis-

tic T (x1) = x1. As an example, say our sample was x1 = 0. In this case, our p-

value would be pval =∞0 fX(x)dx = 0.5. Similarly we have pval(x1 = 1) = 0.159,

pval(x1 = 1.65) = 0.05, pval(x1 = 2.5) = 0.0062. So, for our case where x1 = 2.5, the

probability of getting this value or more extreme, conditional on H0 : µ = 0 being true, is

quite small. Can we interpret this as evidence against H0? Yes we can, and intuitively, this

is how we assess our null hypothesis. However, this is still does not provide us a guideline

for saying ‘yes’ or ’no’ when considering the question: is H0 false? To make this decision,

we generally decide on some probability α where if pval α we reject H0, i.e. we decide

that H0 is not correct. Where we set α is quite arbitrary (and as we shall see, depends on

what trade-offs we want to make in our hypothesis testing framework) but it is often the

case that we set this value reasonably low to values such as α = 0.05 or α = 0.01. Note

that in our example, a given value of α corresponds to a specific value of X, which we will

designate cα, the critical value:

α =

∞

cα

fX(x)dx (5)

where for α = 0.05, we have cα = 1.65 in our example (see class for a diagram). To use α(and cα), we pre-define this value (i.e. α = 0.05) and we reject H0 if our p-value is below

this value (i.e. or equivalently x1 cα in this case) and we cannot reject H0 if our p-value

is not below this value. Note that we can interpret α = 0.05 as a probability of 0.05 that

we would have obtained the value of our statistic or more extreme by chance, so even if

our p-value is less than α, we cannot reject H0 with absolute certainty.

Now, in our example, we have considered a case where we reject the null hypothesis if

our statistic is (equal to) or greater than a particular value. This is an example of a one-sided test. We might want to construct a one-sided test if we think from our previous

experience that, if H0 : µ = 0 is wrong, and the true value of µ is going to be positive (or

we only case about cases where µ is positive). In general, for one-sided tests if we have

a continuous statistic T (X = x) that could take values from (+∞,−∞) we can define a

p-value as follows:

pval(T (x)) =

∞

T (x)Pr(T (x)|µ = 0)dT (x) (6)

4







most samples.




















H0 : θ = c (1)

H0 : µ = 5.5 (2)




or ‘false’.


2

Hypothesis testing III

Pr(T

(x)

| H0)

p-value 1

• We quantify our intuition as to whether we would have observed the value of our statistics given the null is true with a p-value

• p-value - the probability of obtaining a value of a statistic T(x), or more extreme, conditional on H0 being true

• Formally, we can express this as follows:

• Note that a p-value is a function on a statistic (!!) that takes the value of a statistic as input and produces a p-value as output in the range [0, 1]:

each individually, and to use this information effectively we define a statistic T (X = x).Now, since we have defined (assumed) the family of probability distributions that are ran-dom variable follows, we know the sampling distribution of our statistic assuming our nullhypothesis is correct Pr(T (X = x|θ = c)). We are going to use this information to as-sess the results that we get for an actual value of our statistic (from an actual sample)T (x) = T (x1, x2, ..., xn) to determine whether we think H0 is wrong.









pval(T (x)) =

∞

x1

fX(x)dx (2)

pval(T (x)) : T (x) → [0, 1] (3)

(see diagram on board for an example). Also, note in this particular case:

pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (4)

3

Just as with estimation, we will assess this null hypothesis using a sample Pr(X = x) =Pr(X1 = x1, ..., Xn = xn) , which we will assume is i.i.d. Again, just as with estimation,we assume that multiple observations of the sample have more information about µ thaneach individually, and to use this information effectively we define a statistic T (X = x).Now, since we have defined (assumed) the family of probability distributions that are ran-dom variable follows, we know the sampling distribution of our statistic assuming our nullhypothesis is correct Pr(T (X = x|θ = c)). We are going to use this information to as-sess the results that we get for an actual value of our statistic (from an actual sample)T (x) = T (x1, x2, ..., xn) to determine whether we think H0 is wrong.








The ‘more extreme’ part of this definition is a bit confusing at first glance, so let’sconsider our example to make this more clear. For our example, let’s assume that weare interested in whether the value of T (x1) are more extreme in the positive direc-tion (see class for a diagram). In this case, our p-value has the following definition:pval = Pr(T (X1) x1|H0 : µ = 0, true), pval = Pr(|T (x)| t|H0 : θ = c) where x1

reflects the various values our sample could take (i.e. −∞ < x1 < ∞). Note that for ourexample, fX(x) ∼ N(0, 1) where for this particular case:

pval(T (x)) =

∞

x1

fX(x)dx (3)

pval(T (x)) : T (x) → [0, 1] (4)

3

Review of definitions• 1. null hypothesis H0 - an assumption about a parameter (value) that we will

assess:

• 2. sampling distribution of the test statistic:

• 3. p-value - the probability of obtaining a value of a statistic T(x), or more extreme, conditional on H0 being true:

• 4. one- / two- sided test - an assumption (that we make!!) that determines how we calculate the p-value (greater than OR less than OR both)

• 5. type I error - a value such that if a p-value is below this value, we reject (!) the null hypothesis and we do not reject if the p-value is above this value

• 6. critical value - value of the test statistic T(x) corresponding to a type I error

• 7. reject / do not reject - the two possible outcomes of our hypothesis test (which reflects if we think the null hypothesis is incorrect / false or we cannot say / reject)







most samples.




















H0 : θ = c (1)




or ‘false’.




2

Pr(X1 = x1, ..., Xn = xn) , which we will assume is i.i.d. Again, just as with estimation,we assume that multiple observations of the sample have more information about µ thaneach individually, and to use this information effectively we define a statistic T (X = x).Now, since we have defined (assumed) the family of probability distributions that are ran-dom variable follows, we know the sampling distribution of our statistic assuming our nullhypothesis is correct Pr(T (X = x|θ = c)). We are going to use this information to as-sess the results that we get for an actual value of our statistic (from an actual sample)T (x) = T (x1, x2, ..., xn) to determine whether we think H0 is wrong.









pval(T (x)) =

∞

x1

fX(x)dx (3)

pval(T (x)) : T (x) → [0, 1] (4)

3


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4

Just as with estimation, we will assess this null hypothesis using a sample Pr(X = x) =Pr(X1 = x1, ..., Xn = xn) , which we will assume is i.i.d. Again, just as with estimation,we assume that multiple observations of the sample have more information about µ thaneach individually, and to use this information effectively we define a statistic T (X = x).Now, since we have defined (assumed) the family of probability distributions that are ran-dom variable follows, we know the sampling distribution of our statistic assuming our nullhypothesis is correct Pr(T (X = x|θ = c)). We are going to use this information to as-sess the results that we get for an actual value of our statistic (from an actual sample)T (x) = T (x1, x2, ..., xn) to determine whether we think H0 is wrong.








The ‘more extreme’ part of this definition is a bit confusing at first glance, so let’sconsider our example to make this more clear. For our example, let’s assume that weare interested in whether the value of T (x1) are more extreme in the positive direc-tion (see class for a diagram). In this case, our p-value has the following definition:pval = Pr(T (X1) x1|H0 : µ = 0, true), pval = Pr(|T (x)| t|H0 : θ = c) where x1

reflects the various values our sample could take (i.e. −∞ < x1 < ∞). Note that for ourexample, fX(x) ∼ N(0, 1) where for this particular case:

pval(T (x)) =

∞

x1

fX(x)dx (3)

pval(T (x)) : T (x) → [0, 1] (4)

3

Assume H0 is correct (!):

Pr(T

(x)

| H0)







most samples.




















H0 : θ = c (1)




or ‘false’.




2

T(x)


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4

Sample 1:

Sample 1I:


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4

=0.05

=1.64


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4

=0.05


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4

=1.96-

p = 0.005

p = 0.45

p = 0.0025

p = 0.77

T(x)= -0.755

T(x)= 2.8

one-

sided

test tw

o-sided test

• There are only two possible decisions we can make as a result of our hypothesis test: reject or cannot reject

Results of hypothesis decisions I: when H0 is correct (!!)


Lecture 8: Hypothesis Testing II


1 Introduction

Note that two sections 2-3 (power, alternative hypotheses) were added from the previouslecture notes.

Last lecture, we began our discussion of hypothesis tests. Today, we are going to com-plete our general discussion with the introduction of likelihood ratio tests. We will endtoday’s lecture with a brief discussion of confidence intervals. This lecture will completeour general review of probability and statistics. Next lecture, we will begin our discussionof the application of probability and statistics in quantitative genomics.

2 Factors that affect power

As a review, recall that we have a system, we conduct an experiment, which defines asample space S. We define a probability function Pr and a random variable X on S insuch a way that Pr(X = x) is in a ‘family’ of probability distributions that are indexedby parameter(s) θ, where we do not know the specific values of the parameters. Weare interested in testing the null hypothesis H0, using a statistic T (X = x) on an i.i.dobservations of our random variable, e.g. for X ∼ N(µ,σ2). To test this hypothesis, wedefine an H0, which we use to define a p-value, which is a function of our statistic. Ifthe p-value for the actual value of our statistic (for our specific sample, e.g. T (x) = t)is below some pre-defined value α (which determines the critical value cα), we reject H0.If the p-value is above this value, we do not reject H0. The various critical concepts inhypothesis testing can be organized as follows:

H0 is true H0 is falsecannot reject H0 1-α, (correct) β, type II error

reject H0 α, type I error 1− β, power (correct)

1

Pr(T

(x)

| H0)

T(x)






1 Introduction







1

Pr(T

(x)

| H0)

T(x)






1 Introduction







1

Pr(T

(x)

| H0)

T(x)

Assume H0 is wrong (!):

Pr(T

(x)

| H0)







most samples.




















H0 : θ = c (1)




or ‘false’.




2

T(x)

Sample 1:

Sample 1I:


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4

=0.05

=1.64


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4

=0.05


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4

=1.96-

p = 0.005

p = 0.45

p = 0.0025

p = 0.77

T(x)= -0.755

T(x)= 2.8

one-

sided

test tw

o-sided test


Lecture 7: Hypothesis Testing I

Lecture: February 21; Version 1: February 19; Version 2: March 15

1 Introduction

Last lecture, we discussed estimation (also called ‘point’ estimation) where the goal was to

make a reasonable guess (=estimate) concerning the true (unknown) value of a parameter

from a sample. Today, we are going to begin discussion of the other major ‘type’ of

inference which is hypothesis testing. Our goal here is not to say what the actual value of

the parameter is, but rather, to say with some confidence what this parameter value is not.

As we will see, hypothesis testing has a natural fit with the goals of quantitative genomics.

µ = 3

2 Hypothesis Testing

As a review, recall our broader set-up, where we are interested in knowing about a system.

To do this, we conduct an experiment, which produces a sample, where we define a samplespace S the elements of which include all possible sample outcomes. We assume a specific

probability model, by defining a probability function Pr(S), and a random variable X(S)on this sample space, where defining the probability function Pr(S) induces a probabil-

ity distribution on our random variable Pr(X) or Pr(X = x). We assume that our true

probability distribution is in a ‘family’ of probability distributions that are indexed by

parameter(s) θ, e.g. X ∼ N(µ,σ2), which we write Pr(X|θ) or Pr(X = x|θ), where we do

not know the specific values of the parameters. Previously, our goal was to estimate the

value of this unknown parameter value using a sample, which are i.i.d observations of our

random variable X written X = [X1, ..., Xn] or (X = x) = [X1 = x1, ..., Xn = xn]. Our

assumed probability distribution on our random variable X, induces a (joint) probability

distribution over all the possible samples that we could produce: Pr(X) = Pr(X1, ..., Xn)

or Pr(X = x) = Pr(X1 = x1, ..., Xn = xn) and when our sample is i.i.d, each of the

individual observations in our sample has a probability distribution that is the same as

our random variable Pr(Xi = xi|θ). The process of estimation requires that we define a

1


Results of hypothesis decisions II: when H0 is wrong (!!)




1 Introduction







1

Pr(T

(x)

| H0)




1 Introduction







1

Pr(T

(x)

| H0)






1 Introduction







1

Pr(T

(x)

| H0)



• REMEMBER (!!): there are two possible outcomes of a hypothesis test: we reject or we cannot reject

• We never know for sure whether we are right (!!)

• If we cannot reject, this does not mean H0 is true (why?)

• Note that we can control the level of type I error because we decide on the value of

Important concepts I

same formal way of assessing the results of the test. A p-value allow us to do this, e.g.

rejecting H0 when pval < α is the same regardless of the specific test we perform. We will

use the fact that p-values have a uniform distribution later in the course when we discuss

solutions to the multiple testing issue.

A few additional important concepts:

1. There are two possible outcomes of a hypothesis test: we reject H0 or we cannot

reject H0.

2. If we cannot reject H0, this does not mean that H0 is true. This is because we could

have obtained our low p-value by chance, even when H0 is true (even if unlikely).

While people often use ‘accept’ H0 for the case where we cannot reject H0, we will

not use this phrase in this class because of the confusion this can cause, i.e. ‘accept’

seems to imply that H0 is true.

3. α is called the type I error, which is the probability of incorrectly rejecting H0 by

chance when H0 is true.

4. 1− α is the probability of making the correct decision not to reject H0.

5. Note that we can control the level of α, and hence the type I error, by setting our

critical value to a particular value. This is because we know what the sampling

distribution of our statistic will be, when assuming a specific value of our parameter.

So far, we have considered the case where H0 is true. How about the case where the true

value is different than our H0? To make the consequences of this clear, let’s consider our

example above of a normally distributed random variable, with σ2= 1, a single observation

n = 1, and a one-sided hypothesis test: H0 : µ = 0. However, in this case, let’s say that

(unknown to us), the true value of µ = 1. In this case the probability of getting an

observation such as x1 = 2.5, where we reject H0 is not all that unlikely. In fact, if we

consider α = 0.05 (which means cα = 1.65) we can calculate the probability 1 − β of

rejecting H0:

1− β =

∞

cα

fX(x|µ = 1,σ2= 1)dx (10)

(see class for a diagram). We can also calculate the probability β that we will incorrectly,

not reject H0:

β =

cα

−∞fX(x|µ = 1,σ2

= 1)dx (11)

We can similarly construct these for a two-tailed test for a case where we knew the true

value of µ (which we will never know in practice). We call 1− β is the power of the test,

i.e. the probability of making the correct decision given that H0 is false. In general, for a

6

• Unlike type I error , which we can set, we cannot control power directly (since it depends on the actual parameter value)

• However, since power depends on how far the true value of parameter is from the H0, we can make decisions to increase power depending on how we set up our experiment and test:

• Greater sample size = greater power

• Greater the value of that we set = greater power (trade-off!)

• One-sided or two-sided test (which is more powerful?)

• How we define our statistic (a more technical concept...)


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4


not reject H0:

β =

cα

−∞fX(x|µ = 1,σ2

= 1)dx (13)




test statistic with a continuous distribution, a one sided test, and (unbeknownst to us) the

true parameter has value θ = w, the power is:

1− α =

cα

−∞Pr(T (x)|θ = c)dT (x) (14)

α =

∞

cα

Pr(T (x)|θ = c)dT (x) (15)

β =

cα

−∞Pr(T (x)|θ)dT (x) (16)

1− β =

∞

cα

Pr(T (x)|θ)dT (x) (17)

and for a two sided test:

1− β =

−cα

−∞Pr(T (x)|θ = w)dT (x) +

∞

cα

Pr(T (x)|θ = w)dT (x) (18)

A few comments:

1. β is the type II error of the test, i.e. the probability of making the incorrect decision

do not reject H0, given that H0 is false.

2. Unlike the case of α, the type I error (and 1-α), which we know exactly (and set),

we will never know the true value of 1− β, the power (or β, the type II error), since

these depend on the true value of the parameters, which are unknown to us.

3. However, we can use strategies to set up our hypothesis tests in ways where we can

control power and type II error compared to other alternative ways of setting up

hypothesis tests (as we will see below and discuss next lecture).

With these concepts in hand, we can write out the following cases, which depend on the

two outcomes of a hypothesis test (we reject H0 or do not reject H0) and that depend on

the two possible cases: H0 is true or H0 is false:

H0 is true H0 is false

cannot reject H0 1-α, (correct) β, type II error


7


not reject H0:

β =

cα

−∞fX(x|µ = 1,σ2

= 1)dx (13)






1− α =

cα

−∞Pr(T (x)|θ = c)dT (x) (14)

α =

∞

cα

Pr(T (x)|θ = c)dT (x) (15)

β =

cα

−∞Pr(T (x)|θ)dT (x) (16)

1− β =

∞

cα

Pr(T (x)|θ)dT (x) (17)


1− β =

−cα

−∞Pr(T (x)|θ = w)dT (x) +

∞

cα

Pr(T (x)|θ = w)dT (x) (18)

A few comments:















7


pval(T (x)) =

∞

x1

fX(x)dx = 1− FX(x) = 1− x1

−∞fX(x)dx (5)





















α =

∞

cα

fX(x)dx (6)











4


not reject H0:

β =

cα

−∞fX(x|µ = 1,σ2

= 1)dx (13)






1− α =

cα

−∞Pr(T (x)|θ = c)dT (x) (14)

α =

∞

cα

Pr(T (x)|θ = c)dT (x) (15)

β =

cα

−∞Pr(T (x)|θ)dT (x) (16)

1− β =

∞

cα

Pr(T (x)|θ)dT (x) (17)


1− β =

−cα

−∞Pr(T (x)|θ = w)dT (x) +

∞

cα

Pr(T (x)|θ = w)dT (x) (18)

A few comments:















7

Important concepts II

• We need one more concept to complete our formal introduction to hypothesis testing: the alternative hypothesis (HA)

• This defines the set (interval) of values that we are concerned with, i.e. where we suspect our true parameter value will fall if our H0 is incorrect, i.e. for our example above:

• A complete hypothesis testing setup includes both H0 and HA

• HA makes the concept of one- and two-tailed explicit

• REMINDER (!!): If you reject H0 you cannot say HA is true (!!)

Final general concept 1

Now, power is something that we in general discuss quite a bit because we often set up H0

as a ‘straw man’ and we are interested in the probability that we will get the right answer

if H0 is indeed false. A few comments on the dependencies of power:

1. Power depends on the true value of the parameter, i.e. how far it is from our H0 (see

class for diagram).

2. We can control power by setting α to different values, e.g. the larger we set α, thegreater the power of the test. However, the trade-off with setting α higher is that we

are giving ourselves a greater chance of making a type I error.

3. Power depends on whether we use a one-sided or two-sided test. For example, a

one-sided test may be more powerful than a two-sided test if we are sure that, if

H0 : µ = 0 is wrong, the true value of µ is (much) greater than zero. However, a

two-sided test may be more powerful if it is unclear what the true value of µ will be

if H0 : µ = 0 is false.

4. When we set up hypothesis tests optimally, power increases with increasing sample

size n.

As an example of this last point, let’s contrast our example where n = 1 with a case

where n is much larger. Let’s take the case where H0 : µ = 0, HA : µ > 0 and let’s

say that the true value of µ = 1. When n = 1, for our statistic T (x) =n

i xi we have

Pr(T (x)) ∼ N(µ,σ2= 1). More generally for this statistic, (for n unrestricted) the

sampling distribution is Pr(T (x)) ∼ N(µ,σ2

n =1n). Thus, the variance of the sampling

distribution of our statistic gets smaller as our sample size increases. In the case where the

true value of µ = 1, this means that more of the true sampling distribution of our statistic

will be closer to one. This in turn means more of this distribution will be to the right of our

critical value cα as the sample size gets larger and, as a consequence, the power increases

as the sample size increases (see class for diagram).

3 Alternative Hypotheses

We now just need one more concept to complete our formal introduction to hypothesis

testing. The alternative hypothesis HA defines the set (interval) of values that we are

concerned with, which we suspect our parameter will fall in if H0 is not correct (=false).

For example, for our example case where we assume X ∼ N(µ,σ2), and where we are going

to do a one-sided test of H0 : µ = 0, we set up HA as follows:

HA : µ > 0 (1)

and we can similarly define HA for a two-sided test:

HA : µ = 0 (2)

2

Now, power is something that we in general discuss quite a bit because we often set up H0

as a ‘straw man’ and we are interested in the probability that we will get the right answer

if H0 is indeed false. A few comments on the dependencies of power:

1. Power depends on the true value of the parameter, i.e. how far it is from our H0 (see

class for diagram).

2. We can control power by setting α to different values, e.g. the larger we set α, thegreater the power of the test. However, the trade-off with setting α higher is that we

are giving ourselves a greater chance of making a type I error.

3. Power depends on whether we use a one-sided or two-sided test. For example, a

one-sided test may be more powerful than a two-sided test if we are sure that, if

H0 : µ = 0 is wrong, the true value of µ is (much) greater than zero. However, a

two-sided test may be more powerful if it is unclear what the true value of µ will be

if H0 : µ = 0 is false.

4. When we set up hypothesis tests optimally, power increases with increasing sample

size n.

As an example of this last point, let’s contrast our example where n = 1 with a case

where n is much larger. Let’s take the case where H0 : µ = 0, HA : µ > 0 and let’s

say that the true value of µ = 1. When n = 1, for our statistic T (x) =n

i xi we have

Pr(T (x)) ∼ N(µ,σ2= 1). More generally for this statistic, (for n unrestricted) the

sampling distribution is Pr(T (x)) ∼ N(µ,σ2

n =1n). Thus, the variance of the sampling

distribution of our statistic gets smaller as our sample size increases. In the case where the

true value of µ = 1, this means that more of the true sampling distribution of our statistic

will be closer to one. This in turn means more of this distribution will be to the right of our

critical value cα as the sample size gets larger and, as a consequence, the power increases

as the sample size increases (see class for diagram).

3 Alternative Hypotheses

We now just need one more concept to complete our formal introduction to hypothesis

testing. The alternative hypothesis HA defines the set (interval) of values that we are

concerned with, which we suspect our parameter will fall in if H0 is not correct (=false).

For example, for our example case where we assume X ∼ N(µ,σ2), and where we are going

to do a one-sided test of H0 : µ = 0, we set up HA as follows:

HA : µ > 0 (1)

and we can similarly define HA for a two-sided test:

HA : µ = 0 (2)

2

• Note that since we have induced a probability model on our r.v. -> sample -> statistic, and a p-value is a function on a statistic, we also have a probability distribution on our p-values

• This is the possible p-values we could obtain over an infinite number of different samples (sets of experimental trials)!

• This distribution is always (!!) the uniform distribution on [0,1] (regardless of the statistic or hypothesis test):

be specific to a given example with associated assumptions, the probability of a p-value isalways the same. Specifically, a p-value has a uniform distribution over the interval [0, 1,

which we may write Pr(pval) ∼ U [0, 1]. While we haven’t seen the uniform distribution

yet, this particular distribution is pretty intuitive, i.e. each interval of the same size over

zero to one has the same probability. Why would be define p-values in such a way? This

actually makes sense, since intuitively, regardless of the test we perform, we would like the

same formal way of assessing the results of the test. A p-value allow us to do this, e.g.

rejecting H0 when pval < α is the same regardless of the specific test we perform. We will

use the fact that p-values have a uniform distribution later in the course when we discuss

solutions to the multiple testing issue.

A few additional important concepts:

1. There are two possible outcomes of a hypothesis test: we reject H0 or we cannot

reject H0.

2. If we cannot reject H0, this does not mean that H0 is true. This is because we could

have obtained our low p-value by chance, even when H0 is true (even if unlikely).

While people often use ‘accept’ H0 for the case where we cannot reject H0, we will

not use this phrase in this class because of the confusion this can cause, i.e. ‘accept’

seems to imply that H0 is true.

3. α is called the type I error, which is the probability of incorrectly rejecting H0 by

chance when H0 is true.

4. 1− α is the probability of making the correct decision not to reject H0.

5. Note that we can control the level of α, and hence the type I error, by setting our

critical value to a particular value. This is because we know what the sampling

distribution of our statistic will be, when assuming a specific value of our parameter.

So far, we have considered the case where H0 is true. How about the case where the true

value is different than our H0? To make the consequences of this clear, let’s consider our

example above of a normally distributed random variable, with σ2= 1, a single observation

n = 1, and a one-sided hypothesis test: H0 : µ = 0. However, in this case, let’s say that

(unknown to us), the true value of µ = 1. In this case the probability of getting an

observation such as x1 = 2.5, where we reject H0 is not all that unlikely. In fact, if we

consider α = 0.05 (which means cα = 1.65) we can calculate the probability 1 − β of

rejecting H0:

1− β =

∞

cα

fX(x|µ = 1,σ2= 1)dx (12)

6

Final general concept 1I

Statistical Issues 1: Type 1 error

• Recall that Type 1 error is the probability of incorrectly rejecting the null hypothesis when it is correct

• We can control Type 1 error by setting it to a specified level but recall there is a trade-off: if we set it to low, we will not make a Type 1 error but we will also never reject the null hypothesis, even when it is wrong

• In general we like to set a conservative Type 1 error for a case where we perform many tests (why is this!?)

• To do this, we have to deal with the multiple testing problem

Statistical Issues I1: Multiple Testing

• Say that we perform N hypothesis tests

• Recall that if we set a Type 1 error to a level (say 0.05) this is the probability of incorrectly rejecting the null hypothesis

• If we performed N tests that were independent, we would therefore expect to incorrectly reject the null N*0.05 and if N is large, we would therefore make LOTS of errors (!!)

• This is the multiple testing problem = the more tests we perform the greater the probability of making a Type 1 error

Correcting for multiple tests I

• Since we can control the Type I error, we can correct for the probability of making a Type 1 error due to multiple tests

• There are two general approaches for doing this: those that involve a Bonferroni correction and those that involve a correction based on the estimate the False Discovery Rate (FDR)

• Both are different techniques for controlling Type 1 error but in practice, both set the Type I error to a specified level (!!)

Correcting for multiple tests II• A Bonferroni correction sets the Type I error for the entire

set of N tests using the following approach: for a desired type 1 error set the Bonferroni Type 1 error to the following:

• We therefore use the Bonferroni Type I error to assess EACH of our N tests

• For example, if we have N=100 and we want an overall Type 1 error of 0.05, we require a test to have a p-value less than 0.0005 to be considered significant

marker when there is no causal polymorphism in LD, i.e. we do not consider a case where

we reject the null for a marker in LD with a causal polymorphism a type I error. Next

sentences moved to section above.

A potential source of Type I error in GWAS arises from the multiple testing problem. Recall

that the sampling distribution of p-values, when the null hypothesis is true, is uniform, i.e.

Pr(pval) ∼ unif [0, 1]. This means that if we were to repeat a large number of independent

hypothesis tests, we would expect the p-values to follow a uniform distribution, such that

some of the p-values would be very low just by chance. More precisely, if we set a Type I

error at say α = 0.05 we would incorrectly reject the null hypothesis in approximately 5%

of the cases. If we did a large number of tests, say a million, this would produce a very

large number of false positives. This is effectively the case we have in a GWAS (with the

one difference that many of the tests of markers are correlated, an issue we will discuss

next lecture). If we have N markers, we will perform N hypothesis tests, which means if

our α is set relatively high, we will expect to get a large number of false positives by chance.

Now, the nice property of Type I error is that we can control this error rate by setting αlower. However, there is a trade-off: the lower we set the Type I error, the lower the powerof our hypothesis tests (see lecture 7). We therefore would like to adjust the Type I error

but how should this be done? It turns out, there is no perfect way to set the Type I error

and all proposed methods have drawbacks. Here we will consider three common approaches

for controlling Type I error in GWAS: 1. Bonferroni correct, 2. Benjamini-Hochberg cor-

rection (which is one way of implementing a False Discovery Rate (FDR) correction), and

3. a permutation approach.

A Bonferroni correction is applied as follows. Say we are interested in controlling the

probability of making one Type I error at α among N tests. A strategy for doing this is

to set a Bonferroni adjusted Type I error αB, which uses the formula:

αB =α

N(13)

Note that a Bonferroni correction controls the Type I error of the entire experiment (i.e.

the probability of making one or more Type I errors) to 0.05 or, more precisely, to a level

that is less that 0.05 (i.e. a standard Bonferroni correction bounds the Type I error at less

than 0.05 because of how this correction is derived). For example, if we were interested in

controlling the probability of a single Type I error among N = 1, 000, 000 tests to α = 0.05,we would set the Type I error to αB = 0.05/1, 000, 000. It turns out that a Bonferroni

correction makes some assumptions (which we will not describe here), which in fact sets

the probability of making a single Type I error to a level even lower than α. A drawback

of these assumptions is a Bonferroni correction is very ‘conservative’, i.e. there is a low

probability of making a Type I error by also low power. This is particularly problematic

10
























αB =α

N(13)










10
























αB =α

N(13)










10

Correcting for multiple tests III

• A False Discovery Rated (FDR) based approach (there are many variants!) uses the expected number of false positives to set (=control) the type 1 error

• For N tests and a specified Type 1 error, the FDR is defined in terms or the number of cases where the null hypothesis is rejected R:

• Intuitively, the FDR is the proportion of cases where we reject the null hypothesis that are false positives

• We can estimate the FDR, e.g. say for N=100,000 tests and a Type I error of 0.05, we reject the null hypothesis 10,000 times, the FDR = 0.5

• FDR methods for controlling for multiple tests (e.g. Benjamini-Hochberg) set the Type 1 error to control the FDR to a specific level, say FDR=0.01 (what is the intuition at this FDR level?)

hypothesis for a marker when there is no causal polymorphism in LD, i.e. we do not con-

sider a case where we reject the null for a marker in LD with a causal polymorphism a

type I error. Next sentences moved to section above.

Last lecture, we defined a Bonferroni adjusted Type I error αB strategy, which uses the

formula:

αB =α

N(1)

Recall that a drawback of a Bonferroni correction is the approach is very ‘conservative’, i.e.

there is a low probability of making a Type I error but also low power. This is particularly

problematic in cases where the number of N markers (tests) and the sample size n is not

particularly large. Larger sample sizes are required to produce lower p-values. This means

that a Bonferroni correction using a large N can make the power of GWAS tests so low that

true positives do not produce significant results, i.e. the power is extremely low. However,

a Bonferroni correction is a good strategy for keeping Type I error extremely low and it

is often applied in GWAS studies, where significant tests at a Bonferroni correct Type I

error are considered to have a very low probability of being false positives.

A class of less conservative approaches for correcting Type I error makes use of the concept

of a False Discovery Rate (FDR). These approaches include many variants. To get some

intuition concerning an FDR, consider the following example. If we have N = 1, 000, 000

independent tests, we would expect 50,000 of them to incorrectly reject the null hypothesis

by chance with a Type I error rate set to α = 0.05. What if we were to perform the GWAS

study and we reject the null in 100,000 cases? This is a far greater number than we would

expect to reject by chance if all of the tests were false positives. This may therefore indicate

that some of these tests that we rejected were actually true positives. A way of calculating

the proportion of these tests that are false positives is to calculate a False Discovery Rate

(FDR):

FDR =N ∗ αR

(2)

where R is the number of tests where H0 was rejected. For our simplistic example, the FDR

would be 0.5, indicating that half or the cases where we rejected H0 reflect true positives.

Now unfortunately, we have now way of knowing which of the tests where we rejected

the null are the false positives, only the proportion, i.e. we can’t say for any one test

whether it is a false positive or not. However, a way to deal with this problem is to set

the FDR rate to some specified low level, e.g. FDR = 0.1. In such cases, the FDR is low

enough that the probability of a case where we rejected the null hypothesis of being a false

positive is so low, we can be relatively confident that it represents a true positive. We can

identify such a level by setting FDR in equation (12) to a desired level (say FDR = 0.1)

and considering α in equation (12) to be the p-value for which we want to consider tests

2

Wrap-Up Conceptual Overview

System

Question

Experiment

Sample

ProbabilityModel

Sample and Statistic Distributions

Estimatorp-values

Inference

What next (time permitting)?

• Many topics we can cover - let’s come up with some of interest

Essentials for Bioinformatics - Cornell University · 2013. 3. 4. · deﬁnition of probability operate like our intuitive probability, we need these axioms. The ﬁrst two are pretty

Documents