Testing Statistical Hypotheses In statistical hypothesis testing, the basic problem is to decide whether or not to reject a statement about the distribution of a random variable. The statement must be expressible in terms of membership in a well-defined class. The hypothesis can therefore be expressed by the statement that the distribution of the random variable X is in the class P H = {P θ : θ ∈ Θ H }. An hypothesis of this form is called a statistical hypothesis. Testing is a statistical decision problem.
47
Embed
Testing Statistical Hypotheses - George Mason Universitymason.gmu.edu/~jgentle/csi9723/11s/l04b_11s.pdf · Testing Statistical Hypotheses ... • asymptotic properties of the likelihood
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Testing Statistical Hypotheses
In statistical hypothesis testing, the basic problem is to decide
whether or not to reject a statement about the distribution of a
random variable.
The statement must be expressible in terms of membership in a
well-defined class.
The hypothesis can therefore be expressed by the statement that
the distribution of the random variable X is in the class
PH = Pθ : θ ∈ ΘH.
An hypothesis of this form is called a statistical hypothesis.
Testing is a statistical decision problem.
Issues
• optimality of tests: most powerful
• Neyman-Pearson Fundamental Lemma: the optimal procedure
for testing one simple hypothesis versus another simple hypoth-
esis
• uniformly optimal
∗ impose restrictions, such as unbiasedness or invariance
· find optimal tests under those restrictions
∗ define uniformity in terms of a global averaging
Issues
• general methods for constructing tests
• asymptotic properties of the likelihood ratio tests
• nonparametric tests
• sequential tests
• multiple tests
Statistical Hypotheses
We are given (or assume) broad family of distributions,
P = Pθ : θ ∈ Θ.
As in other problems in statistical inference, the objective is to
decide whether the given observations arose from some subset
of distributions
PH ⊂ P.
The statistical hypothesis is a statement of the form
“the family of distributions is PH”,
where PH ⊂ P,
or perhaps
“θ ∈ ΘH”,
where ΘH ⊂ Θ.
Statistical Hypotheses
The full statement consists of two pieces, one part an assump-
tion, “assume the distribution of X is in the class ”, and the
other part the hypothesis, “θ ∈ ΘH, where ΘH ⊂ Θ.”
Given the assumptions, and the definition of ΘH, we often denote
the hypothesis as H, and write it as
H : θ ∈ ΘH.
Two Hypotheses
While, in general, to reject the hypothesis H would mean to
decide that θ /∈ ΘH, it is generally more convenient to formulate
the testing problem as one of deciding between two statements:
H0 : θ ∈ Θ0
and
H1 : θ ∈ Θ1,
where Θ0 ∩ Θ1 = ∅.
We do not treat H0 and H1 symmetrically; H0 is the hypothesis
(or “null hypothesis”) to be tested and H1 is the alternative.
This distinction is important in developing a methodology of
testing.
Tests of Hypotheses
To test the hypotheses means to choose one hypothesis or theother; that is, to make a decision, d.
We have a sample X from the relevant family of distributionsand a statistic T (X).
A nonrandomized test procedure is a rule δ(X) that assigns twodecisions to two disjoint subsets, C0 and C1, of the range ofT (X).
We equate those two decisions with the real numbers 0 and 1,so δ(X) is a real-valued function,
δ(x) =
0 for T (x) ∈ C01 for T (x) ∈ C1.
Note for i = 0,1,
Pr(δ(X) = i) = Pr(X ∈ Ci).
We call C1 the critical region, and generally denote it by just C.
If δ(X) takes the value 0, the decision is not to reject; if δ(X)
takes the value 1, the decision is to reject.
If the range of δ(X) is 0,1, the test is a nonrandomized test.
Sometimes it is useful to choose the range of δ(X) as some other
set of real numbers, such as d0, d1 or even a set with cardinality
greater than 2.
If the range is taken to be the closed interval [0,1], we can inter-
pret a value of δ(X) as the probability that the null hypothesis
is rejected.
If it is not the case that δ(X) equals 0 or 1 a.s., we call the test
a randomized test.
Errors in Decisions Made in Testing
There are four possibilities in a test of an hypothesis:
the hypothesis may be true, and the test may or may not reject
it,
or the hypothesis may be false, and the test may or may not
reject it.
The result of a statistical hypothesis test can be incorrect in two
distinct ways: it can reject a true hypothesis or it can fail to
reject a false hypothesis.
We call rejecting a true hypothesis a “type I error”, and failing
to reject a false hypothesis a “type II error”.
Errors
Our standard approach in hypothesis testing is to control thelevel of the probability of a type I error under the assumptions,and to try to find a test subject to that level that has a smallprobability of a type II error.
We call the maximum allowable probability of a type I error the“significance level”, and usually denote it by α.
We call the probability of rejecting the null hypothesis the powerof the test, and will denote it by β.
If the alternate hypothesis is the true state of nature, the poweris one minus the probability of a type II error.
It is clear that we can easily decrease the probability of one typeof error (if its probability is positive) at the cost of increasingthe probability of the other.
Errors
In a common approach to hypothesis testing under the given
assumptions on X, we choose α ∈]0,1[ and require that δ(X) be
such that
Pr(δ(X) = 1 | θ ∈ Θ0) ≤ α.
and, subject to this, find δ(X) so as to minimize
Pr(δ(X) = 0 | θ ∈ Θ1).
Optimality of a test T is defined in terms of this constrained
optimization problem.
Notice that the restriction on the type I error applies ∀θ ∈ Θ0.
We call
supθ∈Θ0
Pr(δ(X) = 1 | θ)
the size of the test.
Errors
In common applications, Θ0∪Θ1 form a connected region in IRk,
and Θ0 contains the set of common closure points of Θ0 and
Θ1 and Pr(δ(X) = 1 | θ) is a continuous function of θ; hence the
sup is generally a max.
If the size is less than the level of significance, the test is said
to be conservative, and in that case, we often refer to α as the
“nominal size”.
Example 1 Testing in an exponential distribution
Suppose we have observations X1, . . . , Xn i.i.d. as exponential(θ).
The Lebesgue PDF is
pθ(x) = θ−1e−x/θI]0,∞[(x),
with θ ∈]0,∞[.
Suppose now we wish to test
H0 : θ ≤ θ0 versus H1 : θ > θ0.
We know that X is sufficient for θ.
A reasonable test may be to reject H0 if T (X) = X > c, where c
is some fixed positive constant; that is,
δ(X) = I]c,∞[(T (X)).
Knowing the distribution of X to be exponential(θ/n), we can
now work out
Pr(δ(X) = 1 | θ) = Pr(T (X) > c | θ),
which, for θ < θ0 is the probability of a Type I error.
For θ ≥ θ0
1 − Pr(δ(X) = 1 | θ)
is the probability of a Type I error. These probabilities, as a
function of θ are shown.
Performance of Test
0 θ0
]
(
H0 H1
type I errortype II errorcorrect rejection
Now, for a given significance level α, we choose c so that
Pr(T (X) > c | θ ≤ θ0) ≤ α.
This is satisfied for c such that for a random variable Y with an
exponential(θ0/n), Pr(Y > c) = α.
p-Values
Note that there is a difference in choosing the test procedure,and in using the test.
The question of the choice of α comes back.
Does it make sense to choose α first, and then proceed to applythe test just to end up with a decision d0 or d1?
It is not likely that this rigid approach would be very useful formost objectives.
In statistical data analysis our objectives are usually broader thanjust deciding which of two hypotheses appears to be true.
On the other hand, if we have a well-developed procedure fortesting the two hypotheses, the decision rule in this procedurecould be very useful in data analysis.
p-Values
One common approach is to use the functional form of the rule,
but not to pre-define the critical region.
Then, given the same setup of null hypothesis and alternative,
to collect data X = x, and to determine the smallest value α(x)
at which the null hypothesis would be rejected.
The value α(x) is called the p-value of x associated with the
hypotheses.
The p-value indicates the strength of the evidence of the data
against the null hypothesis.
Example 2 Testing in an exponential distribution; p-value
Consider again the problem where we had observations X1, . . . , Xn
i.i.d. as exponential(θ), and wished to test
H0 : θ ≤ θ0 versus H1 : θ > θ0.
Our test was based on T (X) = X > c, where c was some fixed
positive constant chosen so that Pr(Y > c) = α, where Y is a
random variable distributed as exponential(θ0/n).
Suppose instead of choosing c, we merely compute Pr(Y > x),
where x is the mean of the set of observations.
This is the p-value for the null hypothesis and the given data.
If the p-value is less than a prechosen significance level α, then
the null hypothesis is rejected.
Example 3 Sampling in a Bernoulli distribution; p-values
and the likelihood principle revisited
We have considered the family of Bernoulli distributions that isformed from the class of the probability measures Pπ(1) = π
and Pπ(0) = 1 − π on the measurable space (Ω = 0,1,F =2Ω). Suppose now we wish to test
H0 : π ≥ 0.5 versus H1 : π < 0.5.
As we indicated before there are two ways we could set up anexperiment to make inferences on π.
One approach is to take a random sample of size n, X1, . . . , Xn
from the Bernoulli(π), and then use some function of that sampleas an estimator.
An obvious statistic to use is the number of 1’s in the sample,that is, T =
∑Xi.
To assess the performance of an estimator using T , we wouldfirst determine its distribution and then use the properties of thatdistribution to decide what would be a good estimator based onT .
A very different approach is to take a sequential sample, X1, X2, . . .,until a fixed number t of 1’s have occurred.
This yields N , the number of trials until t 1’s have occurred.
The distribution of T is binomial with parameters n and π; itsPDF is
pT (t ; n, π) =(n
t
)πt(1 − π)n−t, t = 0,1, . . . , n.
The distribution of N is the negative binomial with parameterst and π, and its PDF is
pN(n ; t, π) =(n − 1
t − 1
)πt(1 − π)n−t, n = t, t + 1, . . . .
Suppose we do this both ways.
We choose n = 12 for the first method and t = 3 for the second
method.
Now, suppose that for the first method, we observe T = 3 and
for the second method, we observe N = 12.
The ratio of the likelihoods s does not involve π, so by the
likelihood principle, we should make the same conclusions about
π.
Let us now compute the respective p-values.
For the binomial setup we get p = 0.073 (using the R function
pbinom(3,12,0.5), but for the negative binomial setup we get
p = 0.033 (using the R function 1-pnbinom(8,3,0.5) in which the
first argument is the number of “failures” before the number of
“successes” specified in the second argument).
The p-values are different, and in fact, if we had decided to
perform the test at the α = 0.05 significance level, in one case
we would reject the null hypothesis and in the other case we
would not.
This illustrates a problem in the likelihood principle; it ignores
the manner in which information is collected.
The problem often arises in this kind of situation, in which we
have either an experiment that gathers information completely
independent of that information or an experiment whose conduct
depends on what is observed.
The latter type of experiment is often a type of Markov process
in which there is a stopping time.
Power of a Statistical Test
We call the probability of rejecting H0 the power of the test, and
denote it by β, or for the particular test δ(X), βT .
The power in the case that H1 is true is 1 minus the probability
of a type II error.
The probability of a type II error is generally a function of the
true distribution of the sample Pθ, and hence so is the power,
which we may emphasize by the notation βδ(Pθ) or βδ(θ).
We now can focus on the test under either hypothesis (that is,
under either subset of the family of distributions) in a unified
fashion.
Power of a Statistical Test
We define the power function of the test, for any given P ∈ P as
βδ(P) = EP (δ(X)).
Thus, to minimize the error is equivalent to maximizing the
power within Θ1.
Because the power is generally a function of θ, what does maxi-
mizing the power mean?
Power of a Statistical Test
That is, maximize it for what values of θ?
Ideally, we would like a procedure that yields the maximum for
all values of θ; that is, one that is most powerful for all values
of θ.
We call such a procedure a uniformly most powerful or UMP
test.
For a given problem, finding such procedures, or establishing that
they do not exist, will be one of our primary objectives.
Decision Theoretic Approach
In the decision-theoretic formulation of a statistical procedure,
the decision space is 0,1, corresponding respectively to not
rejecting and rejecting the hypothesis.
As in the decision-theoretic setup, we seek to minimize the risk:
R(P, δ) = E(L(P, δ(X))).
In the case of the 0-1 loss function and the four possibilities, the
risk is just the probability of either type of error.
We want a test procedure that minimizes the risk.
The issue above of a uniformly most powerful test is equivalent
to the issue of a uniformly minimum risk test.
Randomized Tests
Just as in the theory of point estimation, we found randomized
procedures useful for establishing properties of estimators or as
counterexamples to some statement about a given estimator,
we can use randomized test procedures to establish properties of
tests.
While randomized estimators rarely have application in practice,
randomized test procedures can actually be used to increase the
power of a conservative test.
Use of a randomized test in this way would not make much sense
in real-world data analysis, but if there are regulatory conditions
to satisfy, it might be useful.
Randomized Tests
We define a function δR that maps X into the decision space,and we define a random experiment R that has two outcomesassociated with not rejecting the hypothesis or with rejecting thehypothesis,such that
Pr(R = d0) = 1 − δR(x)
and so
Pr(R = d1) = δR(x).
A randomized test can be constructed using a test δ(x) whoserange is d0, d1 ∪ DR, with the rule that if δ(x) ∈ DR, then theexperiment R is performed with δR(x) chosen so that the overallprobability of a type I error is the desired level.
After δR(x) is chosen, the experiment R is independent of therandom variable about whose distribution the hypothesis appliesto.
Optimal Tests
Optimal tests are those that minimize the risk.
The risk considers the total expected loss.
In the testing problem, we generally prefer to restrict the prob-
ability of a type I error and then, subject to that, minimize the
probability of a type II error, which is equivalent to maximizing
the power under the alternative hypothesis.
An Optimal Test in a Simple Situation
First, consider the problem of picking the optimal critical region
C in a problem of testing the hypothesis that a discrete ran-
dom variable has the probability mass function p0(x) versus the
alternative that it has the probability mass function p1(x).
We will develop an optimal test for any given significance level
based on one observation.
For x 3 p0(x) > 0, let
r(x) =p1(x)
p0(x),
and label the values of x for which r is defined so that
r(xr1) ≥ r(xr2) ≥ · · · .
Let N be the set of x for which p0(x) = 0 and p1(x) > 0.
Assume that there exists a j such that
j∑
i=1
p0(xri) = α.
If S is the set of x for which we reject the test, we see that the
significance level is∑
x∈S
p0(x).
and the power over the region of the alternative hypothesis is∑
x∈S
p1(x).
Then it is clear that if C = xr1, . . . , xrj ∪ N , then∑
x∈S p1(x) is
maximized over all sets C subject to the restriction on the size
of the test.
If there does not exist a j such that∑j
i=1 p0(xri) = α, the rule is
to put xr1, . . . , xrj in C so long as
j∑
i=1
p0(xri) = α∗ < α.
We then define a randomized auxiliary test R
Pr(R = d1) = δR(xrj+1)
= (α − α∗)/p0(xrj+1)
It is clear in this way that∑
x∈S p1(x) is maximized subject to the
restriction on the size of the test.
Example 4 testing between two discrete distributions
Consider two distributions with support on a subset of 0,1,2,3,4,5.
Let p0(x) and p1(x) be the probability mass functions.
Based on one observation, we want to test H0 : p0(x) is the
mass function versus H1 : p1(x) is the mass function.
Suppose the distributions are as shown in the table, where we
also show the values of r and the labels on x determined by r.