Spanos lecture 7: An Introduction to Bayesian Inference

PHIL 6334 - Probability/Statistics Lecture Notes 7:

An Introduction to Bayesian Inference

Aris Spanos [Spring 2014]

1 Introduction to Bayesian Inference

The lectures notes provide an elementary introduction toBayesianinference focusing almost exclusively on two simple models;the simple Bernoulli and Normal models to keep the technicaldetails to a minimum.Textbooks often motivate the various techniques associ-

ated with Bayesian statistics by contrasting them to thoseof fequentist (classical) inference. Indeed, adherents to theBayesian approach often begin with a litany of charges lev-eled against the frequentist approach which are usually dueto inadequate understanding of the latter; see section 5. Thefollowing quotation from O’Hagan (1994), p. 16, is typical ofsuch charges:“Broadly speaking, some of the arguments in favour of the

Bayesian approach are that it is fundamentally sound, very flexi-ble, produces clear and direct inferences and makes use of all theavailable information. In contrast, the classical approach suffersfrom some philosophical flaws, has restrictive range of inferenceswith rather indirect meaning and ignores prior information.”The address the above claims by O’Hagan one needs to

compare the two approaches in depth, focusing more on theunderlying reasoning and their primary objectives.Section 2: Probability and its various interpretationsSection 3: Frequentist InferenceSection 4: Bayesian InferenceSection 5: Charges against the frequentist approach to

inference.1

2 Probability and its various interpretations

For mathematical purposes Kolmogorov’s axiomatization ofprobability is perfectly adequate, but when it comes to statis-tical inference the interpretation is absolutely necessary. Whatdoesmathematical probability correspond to in the real world?This correspondence will determine the kind of inductive pro-cedures one should follow in‘learning from data’. That is, theinterpretation of probability will determine the nature ofthe inductive (statistical) inference called for.From the viewpoint of statistical inference let us con-

sider the following interpretations of probability:

(i) the classical (equally-likely) - ()(ii) the degrees of belief - ()(iii) the relative frequency - Pr().

For a better understanding of the various interpretations ofprobability one should consider them in the context in whichthey were first developed. As early as the 18th century allthree basic interpretations of probability were being used indifferent contexts without much thought of choosing one in-terpretation for all purposes.The classical interpretation was used in the context of

games of chance and was viewed as stemming from equal prob-ability outcomes based on some sort of physical symmetry.The degrees of belief interpretation originated from at-

tempts to quantify the relationship between the evidence pre-sented in courts and the degree of conviction in the mind ofthe judge.The relative frequency interpretation originated from

mortality and socio-economic data gathered over long periodsof time from the 16th century onwards.

2

The question to be considered is whether any of these in-terpretations can provide a pertinent link between Kol-mogorov’s mathematical theory of probability and empiricalmodeling and inference.

2.1 The classical interpretation of probability

It is generally accepted that, historically, the theory of proba-bility was developed in the context of games of chance such ascasting dice or tossing coins. It was only natural then that thefirst interpretation of probability was inextricably bound upwith the chance mechanism of such games. The first explicitdefinition of the classical definition of probability is given byLaplace at the beginning of the 19th century.The classical definition. Consider the random exper-

iment E which has equally likely outcomes and event occurs when of them occur, then according to the classi-cal definition of probability:

() =¡

¢

Example. Let be the random variable denoting thenumber of dots on the sides of die. When a die is symmetricand homogeneous:

Prob. distribution of

= 1 2 3 4 5 6

() 1

6

1

6

1

6

1

6

1

6

1

6

The first important feature of this definition is its relianceon the nature of an explicit chance mechanism such as castingdice or tossing coins. Its second crucial feature is that it uti-lizes the apparent physical symmetry of the device underlyingthe chance mechanism to define probability by evaluating itas “the ratio of the number of outcomes favorable to the event

3

to the total number of possible outcomes, each assumed to beequally likely” (see Laplace, 1814).For the purposes of providing the missing link between the

mathematical concept of a statistical model and the notion ofchance regularity, this definition of probability is inadequatefor a number of reasons including:(i) it is based on an explicit chance mechanism,

(ii) the chance mechanism has a build-in physicalsymmetry that leads to equally likely outcomes,

(iii) it assumes that one can partition the set of outcomesinto a finite number of equally likely events.

This definition has been severely criticized in the literaturebut the critics tend to concentrate their arrows on the equallylikely clause.¥What do we mean by equally likely and how do we recog-

nize equally likely outcomes?Laplace proposed a principle for justifying equally likely

cases, known as:

the principle of insufficient reason,

or the principle of indifference. This is based on the idea thatif one has no reason to favor one outcome over the anotherthey are considered equally likely. This principle has givenrise to several paradoxes and has been called into questionrepeatedly (Hacking, 1975).In addition to the objection to the equally likely clause,

there is one crucial objection to the classical definition: itassumes that one can partition the set of outcomes into afinite number of equally likely events.¥ What happens when the random experiment does not

enjoy this symmetry, such as the case of a biased coin?¥ What about axiom A1 of the mathematical definition?

4

In the case of an infinite sample space the denominator of theabove definition will get us into trouble.Empirical modeling. The classical interpretation of prob-

ability is too restrictive for empirical modeling purposes.

2.2 The ‘degrees of belief’ interpretation of probability

Our interest in the degree of belief interpretation of prob-ability stems from the fact that it leads to an approach tostatistical inference known as the Bayesian approach.The degree of belief interpretation of probability comes in

two flavors: the subjective and rational.

A. Degrees of subjective belief

The subjective interpretation considers the probability ofan event as based on the personal judgment of whoever isassigning the probability; the personal judgement being basedon the individual’s experience and background. In this sensethe probability of event is based on the person’s beliefs andinformation relating to the experiment giving rise to event .Example. In the case of tossing a fair coin a person is

likely to assign the subjective probability ()=1

2because

a symmetric coin would render and a priori equallylikely. In the case where the person in question has additionalinformation relating to the mechanism, such as the coin isbent, the subjective probability is likely to change.De Finetti introduced a convenient way to think of sub-

jective probabilities is in terms of betting odds. Let usconsider the case of betting on the occurrence of an event and somebody offers odds 2 to 1 or in a ratio form 1

2 If the

person whose degrees of subjective belief we are trying to as-sess thinks that these are fair odds, then we can proceed to

5

evaluate her subjective probability via:12

1+12

= 1

3 i.e. () = 1

3

More generally, if the subjective probability for the occur-rence of the event is (i.e. () = ) then the odds ratio and the corresponding subjective probability take the form:

=

(1−) ⇒ =

1+

As we can see, the subjective dimension of this probabilityarises from the fact that it is the decision of a particular in-dividual whether the odds are fair or not. Another individualmight consider as fair the odds ratio 0 which implies thather subjective probability is 0= 0

1+0 6= This is not surprisingbecause the personal experiences which influence judgementare often different between individuals.The questionwhich naturally arises at this stage is towhether

such personal subjective probabilities will behave in accor-dance with the mathematical definition of probability. Theanswer is yes, under certain restrictions, as demonstrated byRamsey (1926) , de Finetti (1937) and Savage (1954).

B. Degrees of rational belief

Another question with regard to the degree of belief inter-pretation of probability is whether one could find some wayto establish that a particular odds ratio will be considered fairby a rational person; assuming a formal definition of ra-tionality. The idea being that in such a case the subjectivedimension will become less personalistic.Keynes (1921) was the first to propose such an interpreta-

tion, often called logical probability. Carnap (1950) general-ized and extended it, but the logical interpretation has beenseverely criticized on several grounds.

6

Empirical modeling. How appropriate is the degrees ofbelief interpretation of probability for empirical modeling?If data x0:=(1 ) contain systematic information in

the form of chance regularities, ‘stochasticity’ is a feature ofreal-world phenomena and exists independently of one’s be-liefs. Hence, degree of belief interpretation has limited role inselecting an appropriate statistical model.

2.3 The frequency interpretation of probability

The frequency interpretation of probability can be traced backto the statistical regularities established during the 18th and19th centuries, based on demographic, anthropomorphic, eco-nomic and social (crimes, violent deaths, etc.) data. Theanalysis of these data led to an amazing conclusion:“despite the unpredictability at the individual level (people,

firms etc.) there was a remarkable stability of the relativefrequencies at the aggregate level (groups) over long periodsof time.”In the context of the frequency interpretation, the proba-

bility of an event is viewed as an empirical regularity asso-ciated with this event.The frequentist interpretation. Consider the case

where one is able to repeat an experiment under identicalconditions, and denote the relative frequency of the event after trials by

¡

¢ then the frequency interpretation views

the probability of event as the limit of the relative frequency¡

¢as the number of repetitions goes to infinity:

Pr() = lim→∞¡

¢ (1)

Can one prove the above mathematical claim?NO, and thus the von Mises attempt to provide a frequen-

tist interpretation of probability using (1) is a dead end for

7

purely mathematical reasons! Instead, (1) is a mathemati-cally false statement that reflects the intuition underlying thefrequentist interpretation of probability. It becomes a mathe-matically true statement when the limit is interpreted in prob-abilistic terms; see Spanos (2013).A formal justification for the frequentist interpretation as

the limit of relative frequencies is grounded on the StrongLaw of Large Numbers (SLLN) that gives precise mean-ing to the claim ‘the sequence of relative frequencies {

}∞=1

converges to as → ∞’. To see that let =(=1) and=(=0) and

= 1

P

=1 :=Borel (1909). For an IID Bernoulli process { ∈N}

defining the simple Bernoulli model :

M(x): v BerIID( (1−)) ∈N, (2)

P( lim→∞

( 1

P

=1) = ) = 1 (3)

That is, as → ∞ the stochastic sequence {}∞=1 where=

1

P

=1 converges to a constant with probabilityone.

1 0 0 09 0 08 0 07 0 06 0 05 0 04 0 03 0 02 0 01 0 01

0 . 7 0

0 . 6 5

0 . 6 0

0 . 5 5

0 . 5 0

0 . 4 5

0 . 4 0

In d e x

Average

Fig. 1: The Strong Law of Large Numbers in action

The long-run metaphor. It is also important to note

8

that the long-run metaphor associated with the frequentist in-terpretation envisions repeating the mechanism representedby an IID Bernoulli process and observing the relative fre-quency process {}∞=1 (almost surely)-approximate as →∞I It is crucial to emphasize that, when viewed in the con-

text of a statistical modelM(x), the key dimension of thelong-runmetaphor is not the temporal aspect. Keynes’s tongue-in-cheek comment that "in the long-run we will all be dead"is totally misplaced in this context.I The crucial dimension of the long-run is repeatability (in

principle) of the data-generating process. To generate the re-alization shown below takes a fraction of a nanosecond! More-over, the long-run can be operationalized on a computer forany statistical model using the statistical GM.Example. In the case of the simple Normal model, one

can generate as many sample realizations as wanted usingpseudo-random numbers:

= + v N(0 1) = 1 2 Circularity? The issue often raised, when invoking the

SLLN as a justification for the frequency definition of proba-bility, is that the argument suffers from circularity:

one uses probability to define probability!

This claim is based on confusion. The SLLN states that,under certain restrictions on the probabilistic structure of{ ∈N}, one can deduce:

P( lim→∞

( 1

P

=1) = )=1 (4)

This is a measure-theoretic result which asserts that the con-vergence holds everywhere in a domain ⊂ except on asubset 0 the latter being a set of measure zero (P (0) =0)-

9

the impossible event. This mathematical theorem links theprobability () to the relative frequencies (1

P

=1 ) via theinterpretive provisions:

[i] data x0:=(1 2 ) is viewed as a ‘truly typical’realization of the process { ∈N} specified byM(x),[ii] the ‘typicality’ of x0 (e.g. IID) can be assessed usingtrenchant Mis-Specification (M-S) testing.

These provisions provide apertinent link between themath-ematical framework and the data-generating mechanism thatgave rise to x0.

3 Frequentist Inference

Statistical modelMθ(x)={(x;θ) θ∈Θ} x∈R

=⇒ Distribution of the sample

(x;θ) x∈R

↑Data: x0:=(1 2 ) −→

⇓Likelihood function(θ;x0) θ∈Θ

Frequentist approach to statistical inference

For the frequentist approach:[a] The interpretation of probability is the frequency inter-

pretation.[b] The systematic statistical information in data x0 in

the form of chance regularities, constitutes the only relevantstatistical information for selecting the statistical model. Sub-stantive information comes in the form of restrictions on sta-

10

tistical parameters, but should not be imposed at the outset;its validity needs to be established before imposed.[c] The primary aim of the frequentist approach is to learn

fromdatax0 about the ‘true’ underlying data-generatingmech-anismM∗(x)={(x;θ∗)} x∈R

; θ∗ denotes the true value

of θ In general, the expression ‘θ∗denotes the true value ofθ’ is a shorthand for saying that ‘data x0 constitute a real-ization of the sample X with distribution (x;θ∗)’ This isachieved by employing reliable and effective inference proce-dures that revolve around the unknown parameter(s) θ basedon ascertainable error probabilities.I The underlying inductive reasoning comes in two forms:(i) factual: the true state of nature (estimation andpre-

diction), whatever that happens to be, and(ii) hypothetical: various hypothetical scenarios are com-

pared to what actually happened (hypothesis testing).[d] Frequentist inference is based exclusively on the distri-

bution of the sample (x; ), x∈R. This is because all fre-

quentist procedures, estimators, test statistics and predictors,are based on statistics of the form:

=(1 2 )

whose sampling distribution is determined by (x; ) via:

()=P( ≤ )=

Z Z· · ·Z

| {z }{(12)≤}

(x; )12 · · · (5)

Its importance stems from the fact that all relevant error prob-abilities, coverage, type I and II, power, p-value, associatedwith frequentist inference are based on such a sampling dis-tribution. In the case where the density function () is con-

tinuous: (; )= ()

.

11

3.0.1 The simple Bernoulli model

The discussion that follows focuses on the simple Bernoullimodel, specified by:

M(x): v BerIID( (1−)) ∈N:=(1 2 )where ‘BerIID’ stands for ‘Bernoulli, Independence and Iden-tically Distributed’.Using the IID assumptions, one can derive the distribu-

tion of the sample:

(1 2 ; )IID=

Q=1

(; )Ber=

Q=1

(1− )1−=

=P

=1 (1− )P

=1(1−)=(1− )(1−)(6)

where =1

P

=1 is the sample mean, and = denotesthe number of ‘successes’ ( = 1) in trials, i.e.

(x; )=(1− )(1−) for all x∈R={0 1} (7)

viewed as a function of = is Binomially distributed. Thelikelihood function is defined by: to the distribution of thesample:

(;x0) ∝ (x0; ) for all ∈[0 1] (8)

Note that the proportionality (∝) is important because(;x0)is viewed as a function of ∈Θ but (x; ) is a function ofx∈R

and they usually have very different dimensions. In-deed, in the simple Bernoulli case (x; ) is discrete but(;x0)is continuous.Example (a). For ==4, =20 the likelihood function

takes the form:

(;x0) ∝ 4(1− )16 ∈[0 1]

12

121086420

0.25

0.20

0.15

0.10

0.05

0.00

y

Prob

abilit

y

Binomial, n=20, p=0.2

Fig. 1: (x; ), x∈R

1.00.80.60.40.20.0

5

4

3

2

1

0

theta

Like

lihho

od

Fig. 2: (;x0), ∈[0 1]The sampling distribution of is shown in fig. 1 and the

normalized form of the likelihood function is given in fig. 2.

3.0.2 The simple Normal model

The discussion that follows focuses on the simple Normalmodel, specified by:

M(x): v NIID( 2) ∈N:=(1 2 )The distribution of the sample takes the form:

(x;θ) =Y

=1

1

√2exp(− 1

22(−)2)=

=( 1

√2) exp{- 1

22

P

=1(−)2}This means that the likelihood function is:

( 2;x0) ∝ ( 1

√2) exp{- 1

22

P

=1(−)2} −∞∞ 20

In the case where 2 is known this simplifies to:

(;x0) ∝ exp{− 1

22

P

=1(−)2} −∞∞

13

3.1 Frequentist Estimation (point)

In general, the Maximum Likelihood (ML) estimator of isdefined by: b(X)=max

∈Θ(ln(;x))

It is important to distinguish between three different concepts:

-unknown constant, b(X)-estimator of b(x0)-estimate of In the case of the simple Bernoulli model, the first order

condition for maximizing ln(;x) is:

ln(;x)

=

− (1−)

1− =0

when solved for gives rise to the MLE estimator:b(X)==1

P

=1 (9)

The sampling distribution of this estimator is:b(X) v Bin³ (1−);´ (10)

where ‘v Bin ( ;)’ reads ‘distributedBinomiallywithmean, variance and sample size What is a good (optimal) estimator in frequentist

inference?It’s a statistic (X) that pin-points ∗ the true value of

most effectively. Its capacity (effectiveness) to pin-points ∗ isdefined in terms of several optimal properties, such as (i) Un-

biaseness: (b(X))=∗ (ii) Full Efficiency: (b(X))=(∗)(iii) Strong Consistency: P(lim→∞b(X)=∗)=1.Example. The estimator b(X)=1

P

=1 whose sam-pling distribution is given by (10) is optimal because it isunbiased, fully efficient and strongly consistent!

14

3.2 Confidence Intervals (CIs)

The most straightforward procedure to derive a ConfidenceInterval (CI) is to find a pivotal quantity whose distributionunder =∗ is known.A. For the simple Bernoulli model such a pivotal quan-

tify can be specified using theMLestimatorb(X)=1

P

=1

and its sampling distribution in (10):

(X; )=√(b(X)−)√

(1−)=∗v Bin (0 1;) (11)

In general, an optimal CI begins with an optimal estimator.Since its sampling distribution is known, one can specify atwo-sided CI of the form:

P¡−

2≤ (X; )

2; =∗

¢=1−

and then ‘solve’ for to derive the optimal (1−) ConfidenceInterval (CI):

Pµ[b −

2

qb(1−b)

] ≤ [b + 2

qb(1−b)

]

¶=1−

(12)where b is a shorthand for b(X) In practice, finding

2can

be cumbersome and in many cases (12) can be approximated(for a large enough sample size, say ≥ 20) by the N(0 1)distribution.Example (a). In the case where ==4, =2 =20

the approximate 95 observed CI for :∙(2− 196√

20

p2(8))=025 ≤ 0375=(2 + 196

q2(8)

20)

¸

Example (b). In the case where ==12, =6 =20the approximate 95 observed CI for :∙(6− 196√

20

p6(4))=385 ≤ 0815=(6 + 196

q6(4)

20)

¸

15

B. For the simple Normal model (with 2 known) thepivotal quantify can be specified using the ML estimator=

1

P

=1 whose sampling distribution is:

(X;)=√(−)

=∗v N (0 1) (13)

Using (13) one can specify a two-sided CI of the form:

P¡−

2≤ (X;)

2; =∗

¢=1−

and then ‘solve’ for to derive the optimal (1−) ConfidenceInterval (CI):

P³[ −

2

√] ≤ [ +

2

√

´=1− (14)

3.3 Frequentist Testing

A. In the context of the simple Bernoulli model, consider thehypotheses:

0 : =0 vs. 0 : 0 (15)As in the case of interval estimation, a good test procedure be-gins with a good estimator, which gives rise to a test statistic.In this case the test statistic looks similar to (11):

(X)=√(b(X)−0)√

0(1−0)=0v Bin (0 1;)

where b(X):= but in fact, it has two crucial differences:(i) the test statistic (X) involves no unknown parameters

as (X; ) and(ii) the evaluation of its sampling distribution is under the

null (=0), which is hypothetical, and not under =∗ which

is factual.The optimal −significance level test takes the form:

(X)=√(−0)√0(1−0)

1():={x : (x) } (16)

16

Example. In the case where =75, =20 =05 (2=196)

0=5 test (16) yields:

(x0)=√20(75−5)√5(1−5)

=2236

leading to rejecting 0 The p-value associated with this testis:

P((X) 2236;0) = 013 (17)

Example. For =6, =20 =05 (2=196) 0=5 test

(16) yields:(x0)=

√20(6−5)√5(1−5)

=894

leading to accepting0 The p-value associated with this testis:

P((X) 894;0) = 186 (18)

B. In the context of the simple Normal model (with 2

known) consider the hypotheses:

0 : =0 vs. 0 : 0

In this case the test statistic takes the form:

(X)=√(−0)

=0v N (0 1)

and gives rise to the optimal −significance level test takesthe form:

(X)=√(−0)

1():={x : (x) }

Example. For0=10 =2 =100 =05 (2=196) =116

this yields:(x0)=

√100(106−10)

2=30

leading to rejecting 0 The p-value associated with this testis:

P((X) 30;0) = 001 (19)

17

4 Bayesian Inference

4.1 The Bayesian framework

Bayesian inference begins with a statistical model :

Mθ(x)={(x;θ) θ∈Θ} x∈R for θ∈Θ⊂R

(20)where (x;θ) is the distribution of the sampleX:=(1 )R

is the sample space and Θ the parameter space. Bayesianinference modifies the frequentist inferential set up, based ex-clusively on (20), in two crucial respects:(A) It views the unknown parameter(s) θ as random vari-

ables with their own distribution, known as the prior dis-tribution:

(): Θ→ [0 1]

which represents one’s a priori assessment of how likely thevarious values of θ inΘ are, which amounts to ranking the dif-ferent modelsMθ(x) for all θ∈Θ. In frequentist θ is viewedas a set of unknown constants indexing (x;θ) x∈R

(B) It re-interprets the distribution of the sample as condi-

tional on the unknown parameters θ and denoted by (x|θ)Taken together thesemodifications imply that for Bayesians

the joint distribution of the sample is re-defined as:

(xθ) = (x|θ)·(θ) ∀θ∈Θ ∀x∈R (21)

where ∀ denotes ‘for all’. Distinguishing criteria:[a] The Bayesian approach interprets probability as the de-

grees of belief [subjective, logical or rational].[b] In the context of Bayesian inference, relevant infor-

mation includes:

(i) the data x0:=(1 2 ) and(ii) the prior distribution (θ) θ∈Θ

18

[c] The primary aim of the Bayesian approach is to revisethe initial ranking (θ) ∀θ∈Θ in light of the data x0 asprécised by (θ|x0) to update the ranking in terms of theposterior distribution derived using Bayes’ rule:

(θ|x0) = (x0|θ)·(θ)R(x0|θ)·(θ)θ ∝ (θ|x0)·(θ) ∀θ∈Θ (22)

where (θ|x0) ∝ (x0|θ) θ∈Θ denotes the likelihood func-tion, as re-interpreted by Bayesians.Remark: in relation to (22) it is important to note that

contrary to claims by Bayesians (e.g. Gosh et al, 2006, p. 31):“ (x0|θ) is the density of X interpreted as the conditional

density of X given θ. The numerator is the joint density of θand X and the denominator is the marginal density of X”(i) (x0|θ) is not “the conditional density of X given θ”,

since the latter is defined by (x|θ) ∀x∈R,

(ii) the numerator is not “the joint density of θ and X”,since the latter necessitates the double quantifier as in (21),(iii) the denominator is not “the marginal density of X”,

since the latter is defined by (x)=Rθ(x|θ)·(θ)θ ∀x∈R

;see Spanos (2014) for the details.

Prior probabilities(θ) θ∈Θ⇓

Statistical modelMθ(x)={(x;θ) θ∈Θ} x∈R

⇑Data: x0:=(1 )

⎫⎪⎪⎪⎪⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎪⎪⎪⎪⎭

Bayes’rule=⇒ Posterior Distribution

(θ|x0) ∝ (θ)·(θ|x0)

The Bayesian approach to statistical inference

19

A famous Bayesian, Savage (1954) summarized Bayesianinference succinctly by:‘Inference means for us the change of opinion induced by evi-

dence on the application of Bayes’ theorem.” (p. 178)O’Hagan (1994) is more specific:“Having obtained the posterior density (θ|x0), the final step

of the Bayesian method is to derive from it suitable inferencestatements. The most usual inference question is this: Afterseeing the data x0, what do we now know about the parameter θThe only answer to this question is to present the entire posteriordistribution." (p. 6)In this sense, learning from data in the context of the

Bayesian perspective pertains to how the original beliefs (θ)are revised in light of data x0, the revision coming in theform of the posterior: (θ|x0) ∀θ∈Θ

[d] For inference purposes, the only relevant point in thesample spaceR

is just the datax0 as précised by(θ|x0) θ∈Θ.This feature of Bayesian inference is formalized by the Likeli-hood Principle.Likelihood Principle. For inference purposes the only

relevant sample information pertaining to θ is contained inthe likelihood function (x0|θ) ∀θ∈Θ Moreover, if x0 andy0 are two sample realizations contain the same informationabout θ if their likelihoods are proportional to one another,i.e. (x0|θ)=(y0|θ) (Berger and Wolpert, 1988, p. 19).Indeed, Bayesians take delight in poking fun at frequentist

testing by quoting Jeffreys’s (1939) remark about the ‘absur-dity’ of invoking realizations ‘x∈R

’ other than x0:“What the use of P [p-value] implies, therefore, is that a hy-

pothesis that may be true may be rejected because it has notpredicted observable results that have not occurred. This seems

20

a remarkable procedure.” (p. 385) [ha, ha, ha!!!!!]Frequentist inference procedures, such as estimation (point

and interval), hypothesis testing and prediction DO invokerealizations x∈R

other than x0 contravening the LP.

4.2 The posterior distribution and its role in inference

Bayesian inference begins with a statistical model M(x),like the frequentist approach, but it adds a prior distribu-tion () ∈Θ for the unknown parameter viewed as arandom variable with its own distribution.A. In the case of the simple Bernoulli model, the prior

of choice for the Bayesians is the Beta distribution defined interms of two parameters and . For the discussion thatfollows let us assume that the prior () is Beta( ) dis-tributed with a density function:

()= 1

B()(−1)(1− )−1 0 0 01 (23)

The Beta distribution is very flexible in the sense that fordifferent values of the parameters ( ) the density functioncan take many different shapes; see fig. 4. Such a prior will beproper because it covers the whole of the parameter space

and it integrates to one, i.e.R 10() = 1

Bayesian inference are based on the posterior distribu-tion of given data x0 derived via Bayes’ formula:

(|x0)= ()(x0|)R∈[01] ()(x0|)

=()(x0|)

(x0), ∀∈[0 1]. (24)

The denominator (x0)=R 10()(x0|) is a normalizing

constant, which is important to ensure that (|x0) is aproperdensity function, i.e.

R∈[01] (|x0) = 1.

21

1.00.80.60.40.20.0

4

3

2

1

0

Den

sity

1 11 21 42 12 22 44 14 24 4

a b

Beta (a ,b) dens ities fo r diffe rent (a ,b)

Fig. 4: Beta( ) for different values of ( )

Example. In the case where (x0|) is given by (6) and() is given by (23), the posterior distribution is:

(|x0) ∝³

1

()(−1)(1− )−1

´ £(1− )(1−)

¤=

= 1

()

h+(−1)(1− )(1−)+−1

i

(25)In view of the formula in (23), this is recognized as an ‘un-normalized’ density of a Beta(∗ ∗) where:

∗=+ ∗=(1− ) + (26)Since the prior distribution is Beta( ) and the posterioris also Beta(∗ ∗) indicates that the Beta distribution isconjugate to the Bernoulli likelihood (Binomial), i.e., theBernoulli and Beta constitute a conjugate pair because theprior and posterior belong to the same family of distributions.

1 .00 .80 .60 .40 .20 .0

7

6

5

4

3

2

1

0

Dens

ity

Je ffre ys ' in v ar ian t p r io r fo r th e B in o mialB eta: a= 0.5 , b= 0 .5

Fig. 5: Jeffreys ()= 1

(55)−5(1−)−5

22

Example. For Jeffreys’ prior: v Beta(5 5) (see fig. 5)and =4 =20:

(;x0) ∝ 4(1− )16 ∈[0 1](|x0)= 1

( )

h+(−1)(1− )(1−)+−1

i (27)

i.e. (|x0) is Beta(∗ ∗) with ∗=+=45 ∗=(1-)+=165

1.00.80.60.40.20.0

5

4

3

2

1

0

theta

Like

lihoo

d (s

cale

d)

Fig. 6: The likelihood (;=4)

1.00.80.60.40.20.0

7

6

5

4

3

2

1

0

theta

Dens

ity

BetaPrior vs. Posterior

Fig. 7: () vs. (|x0)The (scaled) likelihood (;x0) is given in fig. 6 and the pos-terior (|x0) is compared to the prior () in fig. 7. Infig. 8, the likelihood is compared to the posterior. Notethat all three are continuous, but the sampling distribution

of vBin( (1−)) is discrete; all relevant frequentist error

probabilities come from this distribution.

1.00.80.60.40.20.0

5

4

3

2

1

0

theta

Dens

ity

4.5 16.55 17

F irst Second

BetaLikelihood vs. Posterior

Fig. 8: (|x0) vs. (;=4)121086420

0.25

0.20

0.15

0.10

0.05

0.00

X

Prob

abilit

y

Fig. 9: vBin (=2;=20)23

B. In the case of the simple Normal model, (with 2

known) the prior often selected is Normal:

()= 1

0√2exp(− 1

220(−0)2) for ∈R

and denoted by:

v N(0, 20), where (0,

20) are known values.

Hence, the posterior is defined by:

(|x0) ∝()(x0;))==[ 1

0√2exp(− 1

220(−0)2)]

h( 1

√2) exp{- 1

22

P

=1(−)2}i

∝ exp{−12

h(−0

0)2) +

P

=1(−)2i=

=exp{-122³120+

2

´+

³020+

2

´=exp{− 1

221(− 1)

2}(28)

where the parameters (1 21) of the posterior (after scaling):

(|x0) v N(1, 21).

take the particular form:

1=21(

020+

2) 21=(

1

20+

2)−1= 20

2

(20+2)

After some manipulation 1 can be expressed in the form:

1=+(1−)0 where = 20

20+2

= 21(2

)

which indicates that it is a convex combination of and 0with the variance of the prior, 20, and the variance of the

MLE ()=2

, providing the weights. Moreover, as

→ ∞ → 1 and 1 → In words, asymptotically (as→∞) the mean of the posterior 1 converges to the samplemean .

24

Example. (a) Consider the case where v N(8, 1),=2 =20 =116 The posterior is:

(|x0) v N(11022, 167),

21=(1

1+ 20

4)−1=167 1=167(

8

1+

20(116)

4)=11022

12111098765

2.5

2.0

1.5

1.0

0.5

0.0

theta

Dens

ity

NormalPrior vs. Posterior

Fig. 10: () vs. (|x0)12.2512.0011.7511.5011.2511.0010.7510.50

2.5

2.0

1.5

1.0

0.5

0.0

thetaDe

nsity

NormalPosterior vs. L ikelihood

Fig. 11: (|x0) vs. (|x0)(b) Consider the case where the prior is more precise, i.e.

20=5 v N(8, 5), =2 =20 =116 The posterior hasa smaller variance:

(|x0) v N(10582, 143),

21=(15+ 20

4)−1=143 1=143(

85+

20(116)

4)=10582

11109876

3.0

2.5

2.0

1.5

1.0

0.5

0.0

theta

Dens

ity

NormalPrior vs. Posterior

Fig. 12: () vs. (|x0)12.512.011.511.010.510.0

3.0

2.5

2.0

1.5

1.0

0.5

0.0

theta

Dens

ity

NormalPosterior vs. L ikelihood

Fig. 13: (|x0) vs. (|x0)

25

Example. Consider the case of the simple Normal model(with 2 known), where the prior is Uniform:

()=1 for ∈R:=(−∞∞)This prior is said to be improper because it does not integrateto one since:Z ∞

−∞()=∞ for any ()=∈(0 1]

In this case the posterior is proportional to the likelihood:

(|x0) ∝(;x0)=( 1

√2) exp{− 1

22

P

=1(−)2} −∞∞

=( 1

√2) exp{-

22

£1

P

=1(−)2 + ( − )2¤}

=( 1

√2) exp{-2

22} exp{-

22(−)2}

(29)

Hence, ignoring the constant term exp{-222}where 2=1

P

=1(−)2the posterior is Normally distributed:

(|x0) v N(,2

), −∞∞ (30)

Note that in (30) the random variable is which varies withdifferent values of the unknown parameter (∈(−∞∞)), and is a constant. In contrast, the sampling distribution of :

v N(, 2

), x∈R (31)

varies with different values of the sample realization (x∈R).Hence, any talk about results with identical tail areas has tobe objurgated with what the two tail areas really represent.I The intuitive difference between the two distributions is

that one can draw (30) but not (31) since is unknown!

26

4.3 Bayesian Point Estimation

According to O’Hagan (1994):“Classical inference theory is very concerned with constructing

good inference rules. The primary concern of Bayesian inference,..., is entirely different. The objective [of Bayesian inference] is toextract information concerning θ from the posterior distribution,and to present it helpfully via effective summaries. There aretwo criteria in this process. The first is to identify interestingfeatures of the posterior distribution. ... The second criterionis good communication. Summaries should be chosen to conveyclearly and succinctly all the features of interest.” (p. 14)In this sense, the frequentist inference procedures of estima-

tion, testing and prediction are viewed by Bayesians as styl-ized inferences which from the Bayesian perspective representnothing more than different ways to summarize relevant infor-mation in the posterior distribution (|x0). Hence, the rele-vant criteria for ‘goodness’ (optimality) pertain to how wellthey identify and communicate relevant/interestingfeatures of the posterior.In the case of stylized inference of a point estimate for

one can think of several features of the posterior (|x0) suchas a measure of location, that could potentially play such arole by indicating the most ‘representative’ value of in [0 1].

[1] An obvious choice is to used themode e of the pos-terior distribution (|x0) in (27):

(e|x0)= sup∈Θ (|x0)which is the value of that is ranked highest by (|x0)We know that in the case of v Beta( ) the mode of

() is = −1+−2 Hence, an example of a Bayesian estimate

27

of is: e= ∗−1∗+∗−2=

(+−1)(++−2) (32)

If we compare this with theML estimateb(x0)==1

P

=1 ,

the two will coincide algebraically, i.e. e=, only when==1 : v Beta(1 1)=(0 1) ()=1 for ∈[0 1]

1 .00 .80.60.40.20.0

1 .0

0 .8

0 .6

0 .4

0 .2

0 .0

Dens

ityU n i fo r m p r io rB e t a : a = 1 , b = 1

Fig. 10: Uniform prior

[2] Another "natural" choice for a most ‘representative’value of is itsmean of the posterior.We know that for vBeta( ) ()=

+ and thus:b= ∗

∗+∗=(+)

(++) (33)

Example. Let () vBeta(5 5).(a) =4, =20 ∗= + =45 ∗=(1-)+=165e= 35

21−2=184b= 45

45+165=214

(b) =12, =20 ∗=+=125 ∗=(1−)+=85e=115

19=605 b=125

21=595

As we can see from the above numerical examples, the twoBayesian estimates can give rise to different numerical values,depending on how asymmetric the posterior distribution is;the more asymmetric the greater the difference.How does a Bayesian decide which one of the

above estimators is better?

28

O’Hagan answers that question by contrasting frequentist(classical) inferences with Bayesian inferences:“In Bayesian terms, therefore, a good inference is one which

contributes effectively to appropriating the information about θwhich is conveyed by the posterior distribution.” (p. 14)Indeed, O’Hagan (1994), p. 14, proceeds to argue that

criteria for ‘good’ frequentist inference procedures are onlyparasitical on Bayes’ theorem and enter the picture via thedecision theoretic perspective:“... a study of decision theory has two potential benefits. First,

it provides a link to classical inference. It thereby shows to whatextent classical estimators, confidence intervals and hypothesestests can be given a Bayesian interpretation or motivation. Sec-ond, it helps identify suitable summaries to give Bayesian answersto stylized inference questions which classical theory addresses.”Invoking information other than the data and the

prior. In the case of point estimation the question of selectingan optimal Bayesian estimate of one needs to invoke addi-tional information from decision theory like a loss (or utility)

function (b ). Using different loss functions gives rise todifferent choices for Bayes’ estimate:

(i) The Bayes estimate that minimizes (∀∈Θ) the ex-pected zero-one loss function 0−1( )=

½0 if =

1 if 6= is

the mode of (|x0).(ii) The Bayes estimate e that minimizes (∀∈Θ) the ex-

pected absolute loss function 1(e; )=|e− | is the medianof (|x0)(iii) The Bayes estimate b that minimizes (∀∈Θ) the the

expected square loss function 2(b; )=(b− )2 is the mean

29

of (|x0)However, as o’Hagan (1994). p. 15, explains that the notion

of optimality is not primary:“The posterior mean can therefore be seen as an estimate of

which is best in the sense of minimizing expected square error.This is distinct from, but clearly related to, its more natural roleas a useful summary of location of the posterior distribution.”What is a good (optimal) estimate in Bayesian

inference? The one that minimizes a particular expectedloss function. Hence, in contrast to frequentist estimationwhere the optimality is assessed by the estimator’s capacity topin-point ∗ the true value of an optimal Bayesian estimatehas nothing to do with that; it just minimizes a particular lossfunction for all possible values of .The most widely used loss function is the square:

(b)=(b(X)−)2 ∀∈Θwhose expected value (the risk function) is known as theMean Square Error (MSE):

Bayesian: MSE(b(X); )=(b(X)−)2 ∀∈Θ (34)

In relation to the expected loss function, a key property for

an estimator b(X) is considered to be admissibility.Admissibility. An estimator e(X) is inadmissible with

respect to a risk function (b) if there exists another esti-mator b(X) such that:

(b) ≤ (e) ∀∈Θ (35)

and the strict inequality () holds for at least one value of

Otherwise, e(X) is said to be admissible with respect to theloss function (b)

30

Warning: the Bayesian definition of the MSE in (34),based on the universal quantifier ‘for all ∈Θ’, is at odds withthe frequentist definition of the MSE, which is defined at apoint =∗ [the true value of ]:

Frequentist: MSE(b(X); ∗)=(b(X)− ∗)2 (36)

The factual nature of frequentist reasoning in estimationalso brings out the impertinence of the notion of admissibilitystemming from its reliance on the quantifier ‘for all ∈Θ’. Tosee that more clearly let us consider the following example.Example: is no better than a crystallball esti-

mator? For the simple Normal model:

v NIID( 2) =1 2 (37)

consider a MSE comparison between two estimators of :

(i) the Maximum Likelihood Estimator (MLE): =1

P

=1

(ii) the ‘crystalball’ estimator: (x)=7405926 for all x∈R

It turns out that both estimators are admissible and thusequally acceptable on admissibility grounds. This surprisingresult stems primarily from the quantifier ‘for all ∈Θ’. In-deed, for certain values of close to , say ∈(± √

) for

01 is ‘better’ than since:

(; )=1

(; ) ≤ 2

for ∈(± √

)

Common sense suggests that if a certain criterion of opti-mality cannot distinguish between [a strongly consistent,unbiased, fully efficient and sufficient estimator] and an ar-bitrarily chosen real number that ignores the data altogether,it is practically useless for distinguishing between ‘good’ and‘bad’ estimators in frequentist statistics or being viewed as a

31

minimal property of estimators. Moreover, it is obvious thatthe source of the problem is the quantifier ∀∈Θ. In contrastto admissibility, the property of consistency instantly elimi-nates the crystal ball estimator .

4.4 Bayesian Credible Intervals (BCIs)

A Bayesian (1−) credible interval for is constructed byensuring that the area between and is equal to (1−):

( ≤ )=R

(|x0)=1−

In practice one can define an infinity of (1−) credible inter-vals using the same posterior (|x0) To avoid this indeter-minacy one needs to impose additional restrictions like theinterval with the shortest length or one with equal tails:R 1(|x0)=(1−

2)R 1(|x0)=

2; see Robert (2007).

Example. For the simple Bernoulli model, the end pointsof an equal-tail credible interval can be evaluated using the Ftables and the fact that:

v Beta(∗ ∗)⇒ ∗∗(1−) v F(2∗ 2

∗)

Denoting the

2and (1−

2) percentiles of the F(2∗ 2∗) distri-

bution, by f(2) and f(1−

2) respectively, the Bayesian (1−)

credible interval for is:h(1 + ∗

∗f(1−2))−1 ≤ ≤ (1 + ∗

∗f(2))−1i

For () vBeta(5 5). (a) =2, =20 =05∗=+=25 ∗=(1− )+=185

(1−

2)=163 (

2)=293³

1+ 185

25(163)

´−1≤ ≤

³1+ 185

25(293)

´−1=(0216 ≤ ≤ 284) (38)

32

(b) =18, =20 =05

∗=+ =185 ∗=(1-) + =25b=185

21=881 (1-

2)=341 (

2)=6188³

1+ 25

185(341)

´−1≤ ≤

³1+ 25

185(6188)

´−1= (716 ≤ ≤ 979)

How does one interpret a (1− ) BCI such as (38)?I It provides a summary of the posterior distribution (|x0)

by demarcating and that define the interval of length −with the highest probability, i.e. it includes (1−)% of thehighest ranked values of ∈[0 1].4.4.1 Confidence vs. Credible Intervals

Example. For the simple (one parameter - 2 is known)Normal model, the sampling distribution of =

1

P

=1

and the posterior distribution of derived on the basis of animproper uniform prior [()=1 ∀∈R] are:

=∗v N(∗ 2

) x∈R (|x0) v N( 2 ) ∈R (39)

The two distributions can be used, respectively, to construct(1−) Confidence and Credible Intervals:

P³−

2( √

) ≤ ≤ +

2( √

);=∗

´=1− (40)

³−

2( √

) ≤ ≤ +

2( √

)|x0

´=1− (41)

The two intervals might appear the same, but they are dras-tically different.First, in (40) the r.v. is and its sampling distribution

(;) is defined over x∈R but in (41) the r.v. is andits posterior (|x0) is defined over ∈RSecond, the reasoning underlying (40) is factual, but that

of (41) is all-inclusive (∀∈[0 1]).33

Hence, the (1−) Confidence Interval (40) provides theshortest random upper (X)=+

2( √

) and lower

(X)=−2( √

) bounds that cover ∗ the true value of

with probability (1−). In contrast, the (1−) CredibleInterval (41) provides the shortest interval of length 2

2( √

)

such that (1−)% of the highest ranked (by (|x0)) valuesof lie within it.The above comparison suggests that a Bayesian (1−) Cred-

ible Interval has nothing to do with the "true" !

4.5 Bayesian Testing

Bayesian testing of hypotheses is not as easy to handleusing the posterior distribution of the technical difficulty inattaching probabilities to particular values of when the pa-rameter space Θ is uncountable since:

()=0 (| x0)=0 ∀∈ΘIn their attempt to deflect attention away from this technicaldifficulty, Bayesians often criticize the use of a point hypoth-esis =0 in frequentist testing as nonsensical because it cannever be exactly true! This is a misplaced argument becausethe notion of exactly true, has no place in statistics.

4.5.1 Point null and alternative hypotheses

There have been several attempts to address the difficultywith point hypotheses, but no agreement seems to have emerged;see Roberts (2007). Let us consider one such attempt for test-ing of the hypotheses:

0 : θ=θ0 vs. 1 : θ=θ1

Like all Bayesian inferences, the basis is the posterior distribu-tion. Hence, an obvious way to assess their respective degrees

34

of belief is the posterior odds:(θ0|x0)(θ1|x0) =

(θ0|x0)·(θ0)(θ1|x0)·(θ1)=

³(θ0)

(θ1)

´³(θ0|x0)(θ1|x0)

´ (42)

where the factor(θ0)

(θ1)represents the prior odds, and

(θ0|x0)(θ1|x0)

the likelihood ratio. In light of the fact that technical prob-lem stems from the prior (θ) assigning probabilities to par-ticular values of θ an obvious way to sidestep the problem isto cancel the prior odds factor, by using the ratio of the pos-terior to the prior odds to define the Bayes Factor (BF):

(θ0θ1|x0)=³(θ0|x0)(θ1|x0)

´³(θ0)

(θ1)

´=

(θ0|x0)(θ1|x0) (43)

This addresses the technical problem because the likelihoodfunction is definable for particular values of For this reason Bayesian testing is often based on the BF

combinedwith certain rules of thumb, concerning the strengthof the degree of belief against 0 as it relates to themagnitudeof (x0; 0) (Robert, 2007):I 0 ≤ (x0; 0) ≤ 32 the degree of belief against 0 ispoor,I 32 (x0; 0) ≤ 10 the degree of belief against 0 issubstantial,I 10 (x0; 0) ≤ 100 the degree of belief against 0 isstrong, andI (x0; 0) 100 the degree of belief against 0 is deci-sive.

These rules of thumb, going from the BF to evidence for oragainst the null, have been questioned as largely ad hoc; seeKass and Raftery (1995).

The Likelihoodist approach. It is important to notethat the Law of Likelihood defining the likelihood ratio:

(0 1|x0)=(0|x0)(1|x0)

35

provides the basis of the Likelihoodist approach to testing,but applies only to tests of point vs. point hypotheses.

4.5.2 Composite hypotheses

A. In the context of the simple Bernoulli model consider thehypotheses:

0: ≤ 0 vs. 1: 0 0=5

assuming a Jeffreys invariant prior vBeta(5 5) and data=12, =20An obvious way to evaluate the posterior odds for these two

interval hypotheses is as follows:

( ≤ 0|x0)= Γ(21)

Γ(125)Γ(85)

R 5

0

¡115(1-)75

¢=186

( 0|x0)=1-( ≤ 0|x0)=814One can then employ the posterior odds criterion:

(≤0|x0)(0|x0)=

186

814=229

which indicates that the degree of belief against 0 is poor.B. (a) In the context of the simple Normal model (2

known) consider the hypotheses:

0: ≤ 10 vs. 1: 10

assuming a Normal prior vN(8 1), =2 =20 =116As shown in section 4.2, the posterior is:

(|x0) v N³21(

020+

2),

202

(20+2)

´,

yielding the result: (|x0) v N(11022, 167),21=(

1

1+ 20

4)−1=167 1=167(

8

1+

20(116)

4)=11022

The posterior odds criterion yields:

(≤10|x0)(10|x0)=

R 10−∞

1√2(167)

exp(− 12(167)

(−11022)2)R∞10

1√2(167)

exp(− 12(167)

(−11022)2) =0062

994=0062

36

which indicates that the degree of belief against 0 is verypoor.This result should be contrasted with that of a N-P test

yielding:

(x0)=√20(116−10)

2=3578 with (x0)=0002

that rejects 0 at any conventional significance level.What is the intuitive explanation behind these contradic-

tory results between a Bayesian and a frequentist test?

This frequentist test will reject 0 if (X)=√(−0)

or 0+√

The posterior odds will "reject" 0 if 0+2(0−8)

20.

This indicates that in the case where the null value 0is chosen equal to the mean (0=8) of the prior distribu-tion, i.e. vN(0 20), the posterior odds will "reject" 0

if 0 which ignores the sample size beyond its influ-ence on On the other hand, when 2=20 the rejection

threhold 0+(0−8)

decreases with in contrast to

√ for the

frequentist test.(b) Let us change the prior to be an improper prior:

()=1 ∈(−∞∞)As shown in section 4.2, in this case the posterior is:

(|x0) v N(,2

), −∞∞ (44)

and thus the tail areas of this posterior would seem to matchperfectly with that of the sampling distribution of :

v N(, 2

), x∈R (45)

but as argued in that section, this is an illusion; one is com-paring apples and eggs!

37

4.5.3 Point null but composite alternative hypothesis

Pretending that point hypotheses are small inter-vals. A ‘pragmatic’ way to handle point hypotheses inBayesianinference is to sidestep the technical difficulty in handling hy-potheses of the form:

0: =0 vs. 1: 6=0 (46)

by pretending that 0 is actually: 0: ∈Θ0:=(0− 0+)and attaching a spiked prior of the form:

(=0)=0 1=R 10( 6=0)=1−0 (47)

i.e. attach a prior of 0 to =0, and then distribute the rest1−0 to all the other values of ; see Berger (1985).Example- large problem. In the context of the simple

Bernoulli model, consider the case where 0=2

=106298

527135=020165233 =527135

[i] Let us apply the Bayes factor procedure to the hypothe-ses (46) using a spiked prior (47) with 0=5 Since the ratio(θ0)

( 6=0) cancels out, the posterior odds reduces to the Bayesfactor:

(x0; 0)=(0;x0)R 10(;x0)

=(527135106298)(2)106298(1−2)527135−106298R 10 ((

527135106298)

106298(1−)527135−106298)=

= 000015394

000001897=8115

(48)I The result (x0; 0) 8115 indicates that data x0 fa-

vor the null ( 0=2) against all other values of substantially.It turns out, however, that this result is highly vulnerable tothe fallacy of acceptance. For certain values ‡ in a subset ofΘ1=[0 1]−2 in particular:

‡∈(2 20331] (49)

38

the values ‡ are favored by (x0; ‡) more strongly than

0=2; contradicting the original result. Indeed, as pointedout by Mayo (1996), p. 200, there is always the maximally

likely alternative, ¨=b(x0)=20165233, for which the

Bayes factors favors ¨ more than any other value of . Inparticular, the Bayes factors favors ¨ 89 times stronger than0=2![ii] Applying an ‘optimal’ N-P test with =003 (

2=2968)

to the above data yields:

(x0)=√527135(106298

527135−2)√

2(1−2)=2999 (50)

which rejects 0. The p-value (x0)=0027 suggesting thatdata x0 indicate ‘some’ discrepancy between 0 and the ‘true’ (that gave rise to x0), but provides no information aboutits magnitude. The post-data severity evaluation based on:

(; 1) = P(x: (X) ≤ (x0); 1 is false)(51)

indicates that, for a severity threshold of say 9, the claim forwhich data x0 provide evidence for is:

20095⇒ ∗ ≤ 00095

In this sense, the frequentist approach can address the fallacyof acceptance by evaluating the warranted discrepancy fromthe null for the particular data and test.

39

4.6 Where do prior distributions come from?

4.6.1 Conjugate prior and posterior distributions

This is the case where the prior () and the posterior:

(|x0) ∝ () · (;x0) ∀∈Θbelong to the same family of distributions, i.e. (;x0) isfamily preserving.Example. For the simple Bernoulli model:

()vBeta( )(;x0)∝(1−)(1−)−→ (|x0)vBeta(∗ ∗)

Table 2 presents some examples of conjugate pairs of priorand posterior distributions, as they combine with differentlikelihood forms.Conjugate pairs makemathematical sense, but does it make

‘modeling’ sense? The various justifications in the Bayesianliterature vary from, ‘these help the objectivity of inference’ to‘they enhance the allure of the Bayesian approach as a blackbox’ and these claims are often contradictory!

Table 1 - Conjugate pairs (() (|x0))Likelihood ()

Binomial (Bernoulli) Beta( )Negative Binomial Beta( )Poisson Gamma( )Exponential Gamma( )Gamma Gamma( )Uniform Pareto( )Normal for = N( 2) ∈R 20Normal for = 2 Inverse Gamma( )

40

4.6.2 Jeffreys’s prior

Fisher (1921) criticized the notion of prior ignorance usinga uniform prior:

v U(0 1) ∀∈Θto quantify a state of ignorance about the unknown parame-ter Fisher’s criticism was that such a prior is non-invariantto reparameterizations: one is ignorant about but very in-formed about =()

1.00.80.60.40.20.0

1.0

0.8

0.6

0.4

0.2

0.0

Dens

ity

Uniform priorBeta: a=1, b=1

Fig. 10: Uniform prior

7.55.02.50.0-2.5-5.0

0.25

0.20

0.15

0.10

0.05

0.00

Dens

ity

Logistic distributionLogistic: Loc=0, Scale=1

Fig. 11: The Logistic prior

Example. If is uniformly distributed, then the logittransformation:

= ln¡

1−¢

gives rise to a very informative prior for :()=

(1+)2 −∞ ∞

This distribution attaches much higher probability to the val-ues of around zero and very low probability to the valuesin the tails.In respond to Fisher’s second criticism, Jeffreys (1939) pro-

posed a new class of priors which are invariant to reparameter-izations. This family of invariant priors was based on Fisher’saverage information:

(;x)=x

µ1

h ln(;x)

i2¶=R ··· Rx∈R

1

( ln(;x)

)2x (52)

41

Note that the above derivation involves some hand-waving inthe sense that if the likelihood function (;x0) is viewed,like the Bayesians do, as only a function of the data x0, thentaking expectations outside the brackets makes no sense; theexpectation is with respect to the distribution of the sample(x;) for all possible values of x∈R

. As we can see, thederivation of (;x) runs afoul to the likelihood principlesince all possible values of the sampleX, not just the observeddata x0, are taken into account. Note that in the case of arandom (IID) sample, the Fisher information (;x) for thesample X:=(1 2 ) is related to the above averageinformation via: (;x) = (;x)In the case of a single parameter, Jeffreys invariant

prior takes the form:

() ∝p(;x) (53)

i.e. the likelihood function determines the prior distribution.The simple Bernoulli model. In view of the fact that

the log-likelihood takes the form:

ln(;x)= ln () + (1− ) ln(1−) ln(;x)

=

−(1−)

1− 2 ln(;x)

2= −(

2)−(1−)

(1−)2

From the second derivative, it follows that:

µ1

h ln(;x)

i2¶=

³−1

2 ln(;x)

2

´= 1

(1−) (54)

This follows directly from ()= since:

³−1

2 ln(;x)

2

´=

2+

(1−)(1−)2=

1

+ 1

1−=1

(1−) (55)

From the definition of Jeffreys invariant prior we can de-duce that for :

()∝p(;x)=

q1

(1−)=−12(1−)−12 0 1 (56)

42

which is an ‘unnormalized’ Beta(12 12) distribution; it needs the

scaling 1

() Note that Jeffreys prior (56) is also the reference

prior for a one parameter statistical model; see Bernardo andSmith (1994).

5 Bayesian charges against frequentist inference

[1] Bayesian inference is fundamentally sound because it canbe given an axiomatic foundation based on coherent (rational)decision making, but frequentist inference suffers from severalphilosophical flaws.Remark: what does an axiomatic foundation have to do

with inductive inference?[2] Frequentist inference is not very flexible and has a restric-tive range of applicability.According to Koop, Poirier and Tobias (2007):"Non-Bayesians, who we hereafter refer to as frequentists,

argue that situations not admitting repetition under essentiallyidentical conditions are not within the realm of statistical enquiry,and hence ’probability’ should not be used in such situations. Fre-quentists define the probability of an event as its long-run relativefrequency. ... that definition is nonoperational since only a finitenumber of trials can ever be conducted.’ (p. 2)Remark: where have these guys been for the last 80 years?

Inference with time series data is beyond the intended scopeof frequentist statistics?[3] Bayesian inference produces clear and direct inferences, incontrast to frequentist inference producing unclear and indi-rect inferences, e.g. credible intervals vs. confidence intervals.“... the applied researcher would really like to be able to place

a degree of belief on the hypothesis.” (Press, 2003, p. 220)

43

Remark: if the primary aim of statistical inference is tolearn fromdata x0 about the ‘true’ underlying data-generatingmechanismM∗(x)={(x; ∗)} x∈R

, what does a probabil-ity, say (=0|x0) =.7, suggest about ∗?[4] Bayesian inference makes use of all the available a prioriinformation, but frequentist inference does not.Remark: there is a crucial difference between prior sub-

stantive matter information and information in the form ofa prior distribution. Frequentist inference is tailor-made toaccommodate prior information in the form of restrictions onthe statistical parameters suggested by substantive theories,say G(θϕ)=0, where θ and ϕ denote the statistical andsubstantive parameters of interest. Indeed, substantive priorinformation in most scientific fields does not come in the formof a prior distribution (θ) ∀θ∈Θ[5] A number of counter-examples, introduced by Bayesians,show that frequentist inference is fundamentally flawed.Remark: the problemdoes not lie with frequentist testing,

but with the statistical models introduced. The underlyingstatistical models are shown to be rigged; see Spanos (2010;2011; 2012; 2013a-d).[6] The subjectivity charge against Bayesians is misplacedbecause:“All statistical methods that use probability are subjective in

the sense of relying on mathematical idealizations of the world.Bayesian methods are sometimes said to be especially subjectivebecause of their reliance on a prior distribution, but in mostproblems, scientific judgement is necessary to specify both the’likelihood’ and the prior’ parts of the model.” (Gelman, et al.(2004), p. 14)“... likelihoods are just as subjective as priors.” (Kadane, 2011,

44

p. 445)Remark: I know how to test the adequacy of the proba-

bilistic assumptions defining the likelihood (they are themodelassumptions) vis-a-vis data x0 but howdo I test the adequacyof the assumptions defining a prior () ∀∈Θ?[7] For inference purposes, the only relevant point in the sam-ple space R

is just the data x0 as summarized by the likeli-hood function (θ|x0) θ∈Θ.Remark: as mentioned in section 4.1, the interpretation

of the posterior distribution as proportional to the conditionaldistribution of X give θ times the prior (θ) is at oddswith the Likelihood Principle; see Spanos (2014). Also, Mayo(2013) has shown that Birnbaum’s (1962) ‘proof’ of the LP iserroneous.[8]An effective way to generate frequentist optimal sta-tistical procedures is to find the Bayes solution using areasonable prior and then examine its frequentist propertiesto see whether it is satisfactory from the latter viewpoint; seeRubin (1984), Gelman et al (2004).Remark: this is based on assuming that admissibility (as

defined by Bayesians) is a desirable minimal property; it isnot! Indeed, expected losses do not constitute legitimate fre-quentist error probabilities. The minimal property for fre-quentist inference is consistency, not expected loss relativeefficiency.

The above [1]-[8] comments, criticisms and charges leveledagainst frequentist inference are largely misplaced and stemmostly from insufficient understanding or just plain ignoranceon behalf of the critics.Further arguments on how the above criticisms can be coun-

tered will be part of the class discussion.

45

6 Appendix A: the N-P Lemma and its extensions

The cornerstone of the Neyman-Pearson (N-P) approach isthe Neyman-Pearson lemma. Contemplate the simplegeneric statistical model:

Mθ(x)={(x; )} ∈Θ:={0 1}} x∈R (57)

and consider the problem of testing the simple hypotheses:

0: =0 vs. 1: =1 (58)

¥The fact that the assumed parameter space isΘ:={0 1}and (58) constitute a partition, is often left out frommost sta-tistics textbook discussions of this famous lemma!Existence. There is exists an-significance levelUniformly

Most Powerful (UMP) [-UMP] test based on:

(X)=((x;1)

(x;0)) 1()={x: (x) } (59)

where () is a monotone function.Sufficiency. If an -level test of the form (59) exists, then itis UMP for testing (58).Necessity. If {(X) 1()} is -UMP test, then it will begiven by (59).At first sight the N-P lemma seems rather contrived because

it is an existence result for a simple statistical modelMθ(x)whose parameter space is artificial Θ:={0 1}, but fits per-fectly into the archetypal formulation. To operationalize theexistence result one would need to do two things:

(1) Find transformation () that when applied to (x;1)

(x;0)yields

a meaningful test statistic (X)(2) Derive the distribution of (X) under both 0 and 1.

46

Warning: this lemma is often misconstrued as suggestingthat for an -UMP test to exist one needs to confine testing tosimple-vs-simple cases even when Θ is uncountable; nonsense!¥ The construction of an -UMP test in realistic cases has

nothing to do with simple-vs-simple hypotheses. Instead,(i) it should be based on the archetypal N-P testing formu-

lation based on partitioning Θ, and(ii) rely on monotone likelihood ratios and other features

of the prespecified statistical modelMθ(x).Example. To illustrate these issues consider the simple-

vs-simple hypotheses:

(i) 0: =0 vs. 1: =1 (60)

in the context of a simple Normal (one parameter) model :

v NIID( 2) =1 2 (61)

In this case, the N-P lemma does not apply because the twovalues (0 1) do not constitute a partition of the parameterspace Θ=R.Applying the N-P lemma requires setting up the ratio:

(x;1)

(x;0)=exp

©

2(1 − 0) −

22(21 − 20)

ª (62)

which is clearly not a test statistic, as it stands. However,there exists a monotone function () which transforms (62)into a familiar test statistic (Spanos, 1999, pp. 708-9):

(X)=((x;1)

(x;0))=h( 11) ln(

(x;1)

(x;0))+1

2

i=√(−0)

A UMP test can be derived when (X)=√(−0)

is com-

bined with information relating to the framing of the hypothe-ses.

47

Conditions that give rise to UMP tests

[1] Point null vs. one-sided alternative. In the caseof the simple Normal model, each pair of hypotheses:

(i) 0: ≤0 vs. 1: 0 0: =0 vs. 1: 0

(ii) 0: ≥0 vs. 1: 0 0: =0 vs. 1: 0

give rise to the same UMP tests.The existence of these -UMP tests extends the N-P lemma

to more realistic cases by invoking two regularity conditions:[2] Monotone likelihood ratio. The ratio (62) is a

monotone function of the statistic in the sense that

for any two values 10(x;1)

(x;0)changes monotonically with

This implies that(x;1)

(x;0) if and only if 0

This regularity condition is valid for most statistical mod-els of interest in practice, including the one parameter Expo-nential family of distributions [Normal, Student’s t, Pareto,Gamma, Beta, Binomial, Negative Binomial, Poisson, etc.].[3] Convex alternative. The parameter space under

1 say Θ1 is convex [contiguous], i.e. for any two values(1 2)∈Θ1 their convex combinations 1+(1−)2∈Θ1 forany 0 ≤ ≤ 1When convexity does not hold, like the 2-sided alternative:

(vi) (2-s): 0: = 0 vs. 1: 6= 0

the test :={(X) 1()} 1()={x: |(x)| 2}

is -UMPU (Unbiased); the -level and p-value are:

=P(|(X)| 2; =0) q(x0)=P(|(X)| |(x0)|;=0)

48

7 Appendix B: Examples based on Jeffreys prior

For the simple Bernoulli model, consider selecting Jeffreys in-variant prior:

()= 1(55)

−5(1− )−5 ∈[0 1]This gives rise to a posterior distribution of the form:

(|x0) v Beta( + 5 (1−) + 5) ∈[0 1]¥ (a) For =2, =20 the likelihood function is:(;x0) ∝ 2(1−)18 ∈[0 1]and the posterior density is: (|x0) vBeta(25 185) ∈[0 1]The Bayesian point estimates are:e=15

19=0789 b=25

21=119

A 95 credible interval for is: (0214 ≤ 3803)=95

1

B(25185)

R 1=0214

15(1−)175=975 1

B(25185)

R 1=3803

15(1−)175=025¥ (b) For =18, =20 the likelihood function is:(;x0) ∝ 18(1− )2 ∈[0 1]and the posterior density is: (|x0) vBeta(185 25) ∈[0 1]The Bayesian point estimates are:e=175

19=921 b=185

21=881


1

B(18525)

R 1=716

175(1−)15=0975 1

B(18525)

R 1=979

175(1−)15=0025¥ (c) For =72, =80 the likelihood function is:(;x0) ∝ 72(1− )8 ∈[0 1]

49

and the posterior density is: (|x0) v Beta(725 85)∈[0 1]The Bayesian point estimates are:e=715

79=905 b=725

81=895


1

B(72585)

R 1=82

715(1−)75=0975 1

B(72585)

R 1=9515

715(1−)75=0025¥ (d) for =40, =80 the likelihood function is:(;x0) ∝ 40(1− )40 ∈[0 1]and the posterior density is: (|x0) v Beta(405 405)

∈[0 1]The Bayesian point estimates are:e=395

79=5 b=405

81=5


1

B(405405)

R 1=392

395(1−)395=975 1

B(405405)

R 1=6525

395(1−)395=025In view of the symmetry of the posterior distribution, even

the asymptotic Normal credible interval (??) should give a

good approximation. Given that b= (+)

(++)=05 the approx-

imate credible interval is:

µ[5−196

√5(1−5)√80

]=390 ≤ 610=[5+196

√5(1−5)√80

]

¶=1−

which provides a reasonably good approximation to the exactone.

50

Spanos lecture 7: An Introduction to Bayesian Inference

Education

simple bernoulli

simple normal

explicit chance

bayesian point

frequentist

expected loss

true underlying

optimal bayesian