Problems

MATH11400 Statistics 1 2013-14Homepage http://www.stats.bris.ac.uk/∼maotj/teaching.htmlProblem Sheet 1: Basics of data analysisRemember: when online, you can access the Statistics 1 data sets from an R console by typingload(url("http://www.maths.bris.ac.uk/∼maotj/teach/stats1.RData"))

Ensure that you can run R . You may wish to investigate how to download R (for free) onto yourown computer, or to try it in the undergraduate Computer Lab. Running R is a central part of thiscourse, and you do need to be able to do it.

*1. In an experiment to investigate the heat of sublimation of iridium, the following 27 mea-surements were made, listed across the rows in the order they were taken. The data iscontained in the Statistics 1 data set iridium.

136.6 145.2 151.5 162.7 159.1 159.8 160.8 173.9 160.1160.4 161.1 160.6 160.2 159.5 160.3 159.2 159.3 159.6160.0 160.2 160.1 160.0 159.7 159.5 159.5 159.6 159.5

(a) Use the R commands stem, hist, boxplot, and plot to make a stem-and-leafplot, a histogram, a boxplot and a plot of the observations in the order they weretaken. Print your plots and comment on the overall pattern and any striking features.

(b) Use the R commands median and mean to find the median and the mean. Use?mean in R to see how to compute a trimmed mean in R . Compute the 10% and20% trimmed means for the iridium data set.Compare how well the the mean and median and trimmed means represent the centreof this data set.

(c) Use the R commands var and sd to find the sample variance and standard devi-ation, use the R commands fivenum and summary to find the hinges and thesample quartiles, and use the R command IQR to find the interquartile range (butsee comments on ‘Hinges and Quartiles’ overleaf). Again, compare how these valuesrepresent the spread of the data. [Care! In an ‘ideal’ data set, would the variance,standard deviation and inter-quartile range be the same?]

(d) What conclusions do you draw from your plots and numerical summaries? Whateffect do the outliers have on the numerical and graphical summaries? What wouldthe corresponding results look like if the outliers were removed?

*2. Make a new version of the iridium data set, excluding the apparent outliers, by typingir2<-iridium[-c(1,2,3,4,8)]. Create a histogram and stem and leaf plot ofthis new data set. Now make similar plots for an artifical sample made by generating thesame number (22) of observations from a normal distribution (e.g. data<-rnorm(22)).Visually compare the plots for the real data and the artificial data.

Repeat a similar exercise with the storm.claims data set, comparing it with an artificialsample of 19 observations drawn from an exponential distribution, created usingdata<-rexp(19).

3. Construct an R function for calculating an empirical distribution function by typing in thefollowing instructions (note that the R prompt will switch from > to + while it is waitingfor the command to be completed):

plot.edf <- function(x){n <- length(x)plot(sort(x),(1:n)/(n+1),type=’s’,xlab=’data’,ylab=’empirical cdf’)}

Having loaded the Statistics 1 data sets, produce an empirical distribution function (edf)plot of the iridium data by typing the command plot.edf(iridium) and comment onhow the shape of the edf relates to the data.

4. Having loaded the Statistics 1 data set into R , use stem(us.temp,scale=4) to pro-duce a stem-and-leaf plot of the dataset us.temp. The data gives the mean Januarytemperatures for 60 U.S. metropolitan areas. Comment on any unusual pattern in the dataand try to find a plausible explanation.

5. (a) (From a recent Guardian puzzle) A lazy flea is wandering along a ruler. He knowsthat at a certain time, he will receive an instruction to move to the 1 inch mark on theruler, the 2 inch mark or the 11 inch mark. Which of these it will be is uncertain, andhe can assume there is 1/3 probability of each of these possibilities. Where should heposition himself to minimise the distance he has to move when instructed?

(b) How does your answer change if instead he want to minimise the mean squareddistance he has to move?

(c) What if he wants to minimise the maximum distance he might have to move?

(d) What does this question have to do with the issue of giving a numerical summary ofthe centre of a sample of data x1, x2, . . . , xn?

6.* Let {x1, . . . , xn} be a data set of real numbers and let yi = axi + b, for i = 1, . . . , n, forsome a 6= 0.

(a) Let x = 1n

∑ni=1 xi and s2

x = 1n−1

∑ni=1(xi−x)2. Show that y = ax+b and s2

y = a2s2x.

(b) Find expressions for the median, interquartile range and trimmed mean of {yi} interms of those of {xi}. (Note: why do you need to consider the cases a > 0 anda < 0 separately)

(c) Let x denote temperature in degrees centigrade and let y denote temperature in de-grees Fahrenheit, so y = 1.8x+ 32. Assume the {xi} data set has mean 68.1, median68.9, variance 3.2 and IQR 7.7. Calculate the corresponding quantities for the {yi}data.

7. Boxplots are most useful for comparing more than one sample. The built-in data setInsectSprays in R gives the number of insects found on plants subjected to 6 differenttreatments labelled A-F. Type the following in R :

data(InsectSprays)help(InsectSprays)InsectSpraysboxplot(count ˜ spray, data = InsectSprays)

The help command gives some background information about the data, and the commandInsectSprays on its own prints out the data. For this data set, the boxplot commandproduces a separate boxplot (on common axes) for each of the treatments. Use this plot tocompare the different treatments. Calculate the mean and variance for each of the treatmenttypes and see if you come to the same conclusions. (It is good practice working out howto do this in R ).

Hinges and QuartilesThe lower hinge is ‘the median of the set of values ≤ the sample median’ and the upper hinge is‘the median of the set of values ≥ the sample median’. Hinges were introduced by Tukey as asimple alternative to quartiles, since sources disagreed on how quartiles should be defined.

Loosely speaking, ‘quartiles’ are values that divide a dataset into four equal parts – a quarter ofthe data values are greater than the upper quartile, a quarter are between the upper quartile andthe median, a quarter are between the median and the lower quartile, and a quarter are less thanthe lower quartile.

Given a dataset with n data values x1, x2, . . . , xn−1, xn, denote the ordered values (the orderstatistics) by x(1) ≤ x(2) ≤ · · · ≤ x(n−1) ≤ x(n). This suggests Q1 should be roughly then/4th observation (ordered in increasing size). If n/4 is not an integer, then Q1 should liebetween x(i) and x(i+1) where i = [n/4] (the integer part of n/4). The methods actually used tocompute, say, Q1 are more complicated than this, but most have the following common basis:Set r = (n+1−2a)/4+a, set i = [r] (the integer part of r) and set γ = r− i. Then the requiredvalue is Q1 = (1− γ)x(i) + γx(i+1), i.e. the value that lies γ of the way between x(i) and x(i+1).Similarly for Q3, set s = 3(n+ 1− 2a)/4 + a, set j = [s] and set γ = s− j. Then the requiredvalue is Q3 = (1− γ)x(j) + γx(j+1).

Where methods differ is in the value of a – some have used a = 0 (Minitab), some a = 1 (Excel),some a = 1/2 (S-plus). Rice suggests using a = 0 or a = 1/2. R uses a = 1/2 in some placesand a = 3/8 in other places. The differences between using different values of a, or indeed thedifferences between the hinges and the quartiles, have no real practical importance in terms ofinterpretation, and they are negligible numerically in larger data sets.

MATH11400 Statistics 1 2013-14Homepage http://www.stats.bris.ac.uk/∼maotj/teaching.htmlProblem Sheet 2: Parametric families and method of momentsRemember: when online, you can access the Statistics 1 data sets from an R console by typingload(url("http://www.maths.bris.ac.uk/∼maotj/teach/stats1.RData"))

*1. Let X1, . . . , Xn be a random sample from a Geom (θ) distribution, where θ is a singleunknown parameter. State the value of E(X; θ) and hence find the method of momentsestimator of the unknown parameter θ.

2. Let X1, . . . , Xn be a random sample from a N(0, θ2) distribution, where θ > 0 is a singleunknown parameter. Find the method of moments estimator of the unknown parameter θ.

3. The data {2.08, 2.81, 0.04, 1.54, 1.27, 0.74} are thought to come from a Uniform(0,3) dis-tribution. Calculate the corresponding expected quantiles of the Uniform(0,3) distribution.Use R to plot the sample quantiles against these expected quantiles, and comment on thefit of the distribution to the data.

*4. The data in the disasters data set relates to all British coal mining disasters betweenMarch 1851 and March 1962 in which 10 or more men were killed. We will study the 120gaps in days between successive disasters from the start of the series up to January 1889,which we can extract as follows

source("http://www.stats.bris.ac.uk/∼maotj/teach/disasters.R")(beware: the ∼ symbol may not cut and paste from pdf into R )

gaps<-disasters$gap[2:121]

(if you are not going to be online when using R , you have first to copy the file from thelink on the website and save it on your computer, then use the Source R code item onthe File menu in R to navigate to the saved file and load it in).

(a) Use R to plot a histogram of the gaps data. Does the histogram indicate that anexponential distribution would not be an appropriate model for this set of data? Notethat if the occurrence of disasters was completely at random, the gaps would have the‘lack of memory’ property, with an exponential distribution.

(b) Assuming an Exponential distribution with parameter θ is an appropriate model, showthat method of moments estimate of θ for this data set is θ = 0.008681.

(c) You are given that the distribution function FX(x; θ) has inverse

F−1X (y; θ) = − log(1− y)/θ

Use R to produce a quantile plot of the sample quantiles (the ordered observations)against the corresponding approximate expected quantiles for the fitted Exponentialdistribution and comment on how well the estimated Exponential distribution fits thedata.

*5. In an experiment to investigate the effect of seeding clouds, the rainfall measurementsbelow were recorded for 25 seeded clouds. The data are contained in the Statistics 1 dataset seeded.rain.

4.1 7.7 17.5 31.4 32.7 40.6 92.4 115.3 118.3119.0 129.6 198.6 200.7 242.5 255.0 274.7 302.8 334.1430.0 489.1 703.4 978.0 1656.0 1697.8 2745.6

It is thought that a distribution in the family Gamma(α, λ) may provide a good model forthe data. Write down the two equations which determine the method of moments estimatesof the two unknown parameters α and λ. Solve these equations to find explicit expressionsfor α and λ, each in terms of the sample moments m1 and m2.

6. The following data come from a series of experiments by Henry Cavendish in 1798, de-signed to measure the density of the earth, as a multiple of the density of water. The dataare contained in the Statistics 1 data set cavendish.

5.50 5.61 4.88 5.07 5.26 5.55 5.36 5.29 5.58 5.655.57 5.53 5.62 5.29 5.44 5.34 5.79 5.10 5.27 5.395.42 5.47 5.63 5.34 5.46 5.30 5.75 5.68 5.85

(a) Use exploratory methods (histogram, boxplot etc.) to see if there is any immediatereason to believe a N(µ, σ2) distribution would not provide an appropriate model forthese data.

(b) Calculate the method of moments estimates of µ and σ2. Use the estimates to producea plot in R of the sample quantiles against the corresponding approximate expectedquantiles for the fitted Normal distribution. Comment on how well the estimatedNormal distribution fits the data.

(c) You can produce an automatic plot of the quantiles of a data set against the corre-sponding quantiles of a standard N(0, 1) distribution, and, if desired, a fitted linethrough the first and third quartiles, with the commands> qqnorm(dataset)> qqline(dataset)where dataset is the name of the data file. Comment on the similarities and dif-ferences between this plot and the plot in part (b) above.

MATH11400 Statistics 1 2013-14Homepage http://www.stats.bris.ac.uk/∼maotj/teaching.htmlProblem Sheet 3: Likelihood and Maximum Likelihood EstimationRemember: when online, you can access the Statistics 1 data sets from an R console by typingload(url("http://www.maths.bris.ac.uk/∼maotj/teach/stats1.RData"))

1. Find an expression for the maximum likelihood estimate of θ in terms of the observedvalues x1, . . . , xn of a random sample of size n from a Poisson(θ) distribution where θ > 0is an unknown parameter.

*2. Find an expression for the maximum likelihood estimate of θ in terms of the observed val-ues x1, . . . , xn of a random sample of size n from a continuous distribution with probabilitydensity function

fX(x; θ) =

{θxθ−1 0 ≤ x ≤ 1

0 otherwise

where θ > 0 is an unknown parameter.

In a random sample of size n = 5 from this distribution, the observed values of x1, x2, . . . , x5

were 0.07, 0.29, 0.45, 0.61 and 0.30 respectively. Compute the value of the maximum like-lihood estimate of θ for this set of observations.

3. (a) Find an expression for the maximum likelihood estimate of θ in terms of the observedvalues x1, . . . , xn of a random sample of size n from a Binomial(K, θ) distributionwhere K is known and where θ is an unknown parameter such that 0 < θ < 1.

(b) Passing a course is based on an exam with 10 multiple choice questions, and a passmark of 9. In seven mock exams a student has obtained scores of 9, 8, 10, 5, 8, 10and 7. Assuming that she has independent probability θ of correctly answering eachquestion in the exam, and assuming these seven scores are a random sample of sizeseven from a Binomial(10, θ) distribution with the same value of θ, find the maximumlikelihood estimate of the probability that the student will pass the examination.

*4. Consider the following data, recording the failure time, in hours, for a batch of 25 lamps.The data are contained in the Statistics 1 data set lamp.

5.5 3.8 8.0 7.8 9.3 4.7 4.0 0.3 4.6 0.6 7.9 1.8 4.00.7 4.0 1.6 2.6 0.7 0.2 3.1 1.0 3.4 3.7 10.8 1.2

Assuming an Exponential distribution with parameter θ is an appropriate model, find θmle

based on the above data. Hence find the maximum likelihood estimates of:

(a) the median of the distribution of the lifetimes of lamps in the population;

(b) the probability of a randomly chosen lamp surviving beyond 10 hours.

Compare these to appropriate simple estimates calculated directly from the data, withoutassuming an Exponential distribution.

*5. Let X1, . . . , Xn be a random sample of size n from a N(µX , σ2) distribution and let

Y1, . . . , Ym be a random sample of size m from a N(µY , σ2) distribution, and assume

the samples are independent of each other. Note that the means of the two distributions areassumed to be possibly different, but the variances are assumed to be the same.

(a) Since all n+m random variables are independent, show that the loglikelihood func-tion for µX , µY and σ2 based on all n+m observations is given by∑n

i=1 log fX(xi;µX , σ2) +

∑mj=1 log fY (yj;µY , σ

2).

(b) Hence, explain why the likehihood equations for this problem are

∂

∂µXl(µX , µY , σ

2 =

∑ni=1 xi − nµX

σ2= 0

∂

∂µYl(µX , µY , σ

2) =

∑mj=1 yj −mµY

σ2= 0

∂

∂σl(µX , µY , σ

2) = −nσ

+

∑ni=1(xi − µX)2

σ3− m

σ+

∑mj=1(yj − µY )2

σ3= 0

(c) Hence find expressions in terms of x1, . . . , xn and y1, . . . , ym for the joint maximumlikelihood estimates of µX , µY and σ2 from the combined sample.

6. Assume n subjects are each given an envelope. Half the envelopes contains the instructions“Tick box 1 if you have ever cheated on your tax return and tick box 0 otherwise”; theothers contain the instructions “Toss a coin and tick box 1 if it falls heads and box 0 ifit falls tails”. The two sets of instructions are allocated to the envelopes at random andonly the subject knows which set of instructions applied to him (or her). Assume that allsubjects follow the instructions in their envelope honestly and correctly.

(a) Assume the probability that any given subject cheated on their tax return is τ andlet θ denote the probability that a randomly chosen subject, following the procedureabove, will tick box 1. Express τ in terms of θ. What are the possible values for θ?

(b) Assume 8 out of a sample of 20 subjects ticked box 1; find the maximum likelihoodestimate of θ based on these data. Hence estimate τ .

MATH11400 Statistics 1 2013-14Homepage http://www.stats.bris.ac.uk/∼maotj/teaching.htmlProblem Sheet 4: Linear regressionRemember: when online, you can access the Statistics 1 data sets from an R console by typingload(url("http://www.maths.bris.ac.uk/∼maotj/teach/stats1.RData"))

*1. As part of a study of the relationship between ‘stress’ and ‘skill’, the stress levels for eightsecond year student volunteers were assessed and compared with their subject skill levels,as measured by each student’s average mark at the end of their first year.

Summary statistics for the data set are:n = 8

∑xi = 492

∑yi = 379

∑x2i = 32, 894

∑y2i = 20, 115

∑yixi = 21, 087

where xi is the subject skill level for the ith student and yi is their assessed stress level.

Calculate by hand the least squares estimates of α and β and the equation for the fittedregression line, under the simple linear regression model E(Y |x) = α + βx. What broadconclusion can you draw immediately from the fitted model? What stress level would youpredict for a student with skill level x = 60?

2. The table below shows a data set with five pairs of values (xi, yi), i = 1, . . . , 5. It is thoughtthat the data satisfy the simple linear regression model E(Y |x) = α+βx, Var (Y |x) = σ2.

x 1 3 4 6 7y 0 1 2 5 4

(a) Calculate by hand the least squares estimates of α and β.

(b) For i = 1, . . . , 5, calculate by hand the fitted values yi = α + βxi and the residualvalues ei = yi− yi. Hence calculate by hand an estimate of the common variance σ2.Also, find the sum of the residuals.

3. The table below shows the average weight (in kg) of piglets in a litter, for seven littersof varying size. The data are contained in the Statistics 1 data frame pig, variableslittersize and wt respectively.

Litter size (x) 1 3 5 8 8 9 10Average weight (y) 1.6 1.5 1.5 1.3 1.4 1.2 1.1

Perform a simple linear regression of average weight on litter size and output the results tothe R object piglets, with the commands:> attach(pig); piglets <- lm(wt ˜ littersize)

Calculate by hand the least squares estimates α and β for the simple linear regressionmodel E(Y |x) = α+βx and confirm your answers with the R command: > coef(piglets)

Draw a scatter plot of the data and add in the fitted regression line, using the commands:> plot(littersize,wt); abline(coef(piglets))Use your fitted regression line to predict the average weight of a piglet in a litter of size 6.

Let xi denote the litter size for the ith litter and let yi denote the corresponding averageweight for the piglets in that litter. Inspect the fitted values yi = α + βxi and the residualvalues ei = yi − yi with the commands:> fitted(piglets); residuals(piglets)

Plot the residuals against the litter sizes using the command:> plot(littersize, residuals(piglets))and comment on the fit of the model.

*4. The table below shows the rainfall (in inches) for the spring (April/May) and the followingautumn (September/October) for each of ten consecutive years. The data are contained inthe Statistics 1 data frame rain, in variable spring and autumn respectively. You canextract columns using rain$spring etc.

Spring rainfall (x) 1.6 5.3 2.8 9.6 6.7 1.5 5.4 8.5 4.1 3.9Autumn rainfall (y) 4.6 6.0 2.9 11.1 8.2 1.3 9.1 10.2 5.2 8.3

(a) Let xi denote the spring rainfall in the ith year and let yi denote the correspondingautumn rainfall. Use R to calculate the least squares estimates α and β for the simplelinear regression model E(Y |x) = α + βx. Use R to produce a scatter plot of thedata and use R to add in the fitted regression line.

(b) For i = 1, . . . , 10, use R to calculate the fitted values yi = α + βxi and the residualvalues ei = yi − yi. Use R to plot the residuals against the spring rainfalls, andcomment of the fit of the model.

*5. Consider a regression problem where the data values y1, . . . , yn are observed values ofresponse variables Y1, . . . , Yn.

In the notes we assume that, for given values x1, . . . , xn of the predictor variable, the Yisatisfy the simple linear regression model Yi = α + βxi + ei, where the ei are i.i.d. ∼N(0, σ2). The least squares estimates of the regression parameter(s) are defined to be thevalues which minimise the sum of squares of the differences between the observed yi andthe fitted values. For this model, E(Yi |xi) = α + βxi, so the least squares estimates arethe values minimising

∑ni=1(yi − (α + βxi))

2.

Now consider an alternative model which takes the form Yi = γxi + ei, i = 1, . . . , n,where the ei satisfy the same assumptions as before but where there is now a single un-known regression parameter γ. This model is sometimes used when it is clear from theproblem description that the value of E(Y ) must be zero if the corresponding x value iszero.

Derive an expression, in terms of the xi and yi values, for the least squares estimate for γfor this new model and suggest, with reasons, an appropriate estimate for σ2.

MATH11400 Statistics 1 2013-14Homepage http://www.stats.bris.ac.uk/∼maotj/teaching.htmlProblem Sheet 5: Assessing the performance of estimatorsRemember: when online, you can access the Statistics 1 data sets from an R console by typingload(url("http://www.maths.bris.ac.uk/∼maotj/teach/stats1.RData"))

*1. Let X1, . . . , Xn be a random sample from the Uniform(0, θ) distribution, for which thepopulation median is τ = θ/2.

(a) The method of moments estimate of τmom = X . Find E(X) and Var (X), and henceshow that τmom is unbiased and has mean square error θ2/12n.

(b) The maximum likelihood estimate of θ is Y = max{X1, . . . , Xn}. You are giventhat Y has probability density function fY (y; θ) = nyn−1/θn for 0 < y < θ (andfY (y; θ) = 0 otherwise). Show that E(Y ) = nθ/(n+ 1).The maximum likelihood estimator of the population median τ = θ/2 isτmle = max{X1, . . . , Xn}/2. Use the results for Y to show that τmle has bias−θ/(2(n+ 1)).

*2. The methods and R commands required for this question are similar to those in the notesfor the simulation from the Uniform(0,1) distribution, but with the obvious adjustments tothe names and to the formulae for the various estimators.

(a) Use the R commands below to construct a matrix xsamples with 1000 rows and10 columns. The 10 data values in each row can be thought of as a random sample ofsize 10 from an Exp(θ) distribution with rate θ = 1 and the 1000 rows can be thoughtof as B = 1000 independent repeated samples.> xvalues <- rexp(10000,rate=1)> xsamples <- matrix(xvalues,nrow=1000)

(b) Calculate the maximum likelihood estimate θ = 1/x for each sample, and plot ahistogram of the relative frequencies of the 1000 estimates of θ obtained.

(c) You may assume that the Exp(θ) distribution has median log(2)/θ. For each of theB = 1000 samples in part (a) above, calculate both the sample median and the maxi-mum likelihood estimate of the population median τ(θ) = log(2)/θ.Produce a single plot containing a boxplot of the 1000 values of the sample medianand a boxplot of the 1000 values of the mle of the population median.

(d) Since the observations above were drawn from a population distribution with θ = 1,add a horizontal line at height log(2) to your plot showing the true value of the medianfor this population and use it to compare the sample median and the mle as estimatorsof the population median.

(e) Calculate the mean and the variance of the 1000 values of the sample median and the1000 values of the maximum likelihood estimate of the population median and usethese numerical values to compare the bias, variance and mean square error of thetwo estimators.

3. Let T be the total number of heads obtained when a fair coin is tossed 10 times. LetA = {T ≤ 1} be the event that at most one head is obtained and let B = {T ≥ 6} be theevent that at least 6 heads are obtained.

(a) Calculate the exact values of P(A) and P(B).

(b) Calculate the approximate values of P(A) and P(B) given by applying the centrallimit theorem without and with a continuity correction.

4. An architect is designing the car park for a new apartment block, which has 200 apartments.She believes that the residents in 20% of the apartments will require 2 parking places, that70% will require 1 place, and that the remaining 10% will not have a car.

(a) LetX be the number of parking places required by the residents of a randomly chosenapartment. Find the mean and variance of X .

(b) If the architect provides 230 car parking places for residents, what is the probabilitythat this will not be enough? How many places would she need to provide for there tobe a 99% chance that there will be enough places to satisfy all the residents’ demands?

*5. Opinion polls indicate that support for the government has been about 37%, but it isthought that this may have changed in the light of recent events. Assume a random sampleof n electors is interviewed. Let Xi = 1 if the ith elector sampled supports the governmentand letXi = 0 otherwise, so that Tn = X1 + · · ·+Xn is the total number in the sample thatsay they support the government. Assume throughout that X1, X2, X3, . . . are independentrandom variables and that P(Xi = 1) = 0.37 = 1− P(Xi = 0).

Assume the pollsters take a sample of size n = 1500. Use the central limit theorem tofind the probability that the proportion in the sample supporting the government will differfrom 0.37 by no more than 0.02, i.e. find P(|Tn/n−0.37| ≤ 0.02). Perform this calculationwith and without a continuity correction, to see if it makes a significant difference in thiscase.

MATH11400 Statistics 1 2013-14Homepage http://www.stats.bris.ac.uk/∼maotj/teaching.htmlProblem Sheet 6: Sampling distributions related to the Normal distributionRemember: when online, you can access the Statistics 1 data sets from an R console by typingload(url("http://www.maths.bris.ac.uk/∼maotj/teach/stats1.RData"))

*1. Let X1, . . . , Xn be a random sample of size n from a general distribution with populationmean denoted by µ = E(X) and population variance denoted by σ2 = Var (X). (Note:we are not assuming here that the population has a Normal distribution).

(a) Show that the sample mean X = (X1 + · · ·+Xn)/n has expected value µ (and so Xis always an unbiased estimator of the population mean).

(b) Show also that X has variance σ2/n, (and so X always has mean square error equalto σ2/n as an estimator of µ, where µ denotes the population mean and σ2 denotesthe population variance).

(c) Assume the fact (see the notes) that∑n

j=1X2j =

∑nj=1(Xj−X)2 +nX

2. Show that,

whatever the distribution of X , the sample variance S2 =∑n

j=1(Xj −X)2/(n − 1)

has expected value σ2 (and so S2 is always an unbiased estimator of the populationvariance σ2).

2. Let Y have a Gamma(α, λ) distribution. Show that E(Y ) = α/λ, and show that, for α > 1,E(1/Y ) = λ/(α− 1). [Hint: Recall from your Probability 1 notes that

∫∞0xa−1e−bxdx =

Γ(a)/ba for all a > 0 and b > 0.]

*3. Let X1, . . . , Xn be a random sample of size n from the Exponential(θ) distribution. Wefound earlier that the maximum likelihood estimator of θ was θmle = n/

∑ni=1Xi.

(a) Find the moment generating function of∑n

i=1Xi. Hence show∑n

i=1Xi has theGamma(n, θ) distribution and state its mean.

(b) The population mean of the Exponential(θ) distribution is τ(θ) = 1/θ and the max-imum likelihood estimator of this population mean is τmle = 1/θmle. Show that themaximum likelihood estimator τmle has expected value 1/θ (and so it is unbiased asas estimator of the population mean).

(c) Using the fact (see previous question) that for Y with a Gamma(α, λ) distribution,E(1/Y ) = λ/(α− 1), show that E(θmle) = θn/(n− 1). Hence find the average error(i.e. the bias) of θmle as an estimator of θ and show it is not an unbiased estimator ofθ.

4. Use the R commands help(TDist) and help(Chisquare) to find out how to com-pute the probability density function, distribution function and the inverse distributionfunction for the t and the χ2 families of distributions.

(a) Plot the probability density function for the tr distribution over the interval (−4, 4) forr = 1, 5, 10 and 15 degrees of freedom, and compare it with the probability densityfunction for the N(0, 1) distribution.

(b) Plot the probability density function for the χ2r distribution over the interval (0, 20)

for r = 5, 10 and 15 degrees of freedom.

*5. Let X1, . . . , Xn be a random sample of size n from the N(µ, σ2) distribution. Denote thesample variance by S2 =

∑ni=1(Xi − X)2/(n − 1) and denote the maximum likelihood

estimator of σ2 by σ2mle =

∑ni=1(Xi −X)2/n.

(a) State the distribution of∑n

i=1(Xi−X)2/σ2 and its mean and variance, and thus findthe mean and variance of both S2 and σ2

mle.

(b) Let σ2 denote an estimator for σ2. The bias of σ2 is defined as E(σ2 − σ2), while themean square error is defined as E[(σ2−σ2)2]. Note that it can be easier to calcuate themean square error from its representation as E[(σ2− σ2)2] = Var (σ2) + [bias(σ2)]2.

Use the results of part (a) to compare the performance of S2 and σ2mle as estimators

of σ2 in terms of their bias (i.e. their average error) and their mean square error (i.e.their average squared error).

*6. Suppose that X1, X2, . . . , Xn and Y1, Y2, . . . , Yn are all i.i.d. Normal(0, θ2) where θ isunknown.

(a) Write down the distribution of the random variable Ti = X2i + Y 2

i for each i. Hencefind the maximum likelihood estimate of θ based on observations t1, t2, . . . , tn ofT1, T2, . . . , Tn.

(b) Find also the maximum likelihood estimate of θ based on observations x1, x2, . . . , xnand y1, y2, . . . , yn of X1, X2, . . . , Xn and Y1, Y2, . . . , Yn, respectively.

MATH11400 Statistics 1 2013-14Homepage http://www.stats.bris.ac.uk/∼maotj/teaching.htmlProblem Sheet 7: Confidence IntervalsRemember: when online, you can access the Statistics 1 data sets from an R console by typingload(url("http://www.maths.bris.ac.uk/∼maotj/teach/stats1.RData"))

To complete this sheet, you will need values of distributions such as t and χ2. This requiresaccess to R (or the use of statistical tables). Use of tables is not taught in the course, and in theexam, tables are not needed and will not be provided.All of the standard distribution probability and quantile values needed can be found in the Routput below.

> z[1] 0.84162 0.95000 0.99000 1.28155 1.64485 1.95996 2.57583 2.84400> pnorm(z)[1] 0.80000 0.82894 0.83891 0.90000 0.95000 0.97500 0.99500 0.99777> qt( c(.9,.95,.975), 8)[1] 1.396815 1.859548 2.306004> qt( c(.9,.95,.975), 32)[1] 1.308573 1.693889 2.036933> qt( c(.9,.95,.975), 33)[1] 1.307737 1.692360 2.034515> qchisq(c(.025,.05,.1,.9,.95,.975),7)[1] 1.6899 2.1673 2.8331 12.0170 14.0671 16.0128> qchisq(c(.025,.05,.1,.9,.95,.975),8)[1] 2.1797 2.7326 3.4895 13.3616 15.5073 17.5345> qchisq(c(.025,.05,.1,.9,.95,.975),9)[1] 2.7004 3.3251 4.1682 14.6837 16.9190 19.0228> qchisq(c(.025,.05,.1,.9,.95,.975),32)[1] 18.291 20.072 22.271 42.585 46.194 49.480> qchisq(c(.025,.05,.1,.9,.95,.975),33)[1] 19.047 20.867 23.110 43.745 47.400 50.725> qchisq(c(.025,.05,.1,.9,.95,.975),34)[1] 19.806 21.664 23.952 44.903 48.602 51.966> qchisq(c(.025,.05,.1,.9,.95,.975),50)[1] 32.357 34.764 37.689 63.167 67.505 71.420> qchisq(c(.025,.05,.1,.9,.95,.975),51)[1] 33.162 35.600 38.560 64.295 68.669 72.616

Similar information may be provided in the exam, and it is important to know how to interpret itand extract the required information.

*1. In an experiment to determine the fuel consumption of a new model of car, a driver wasemployed to drive nine new cars, each for 100km. The fuel consumption in litres for eachof the nine 100km drives is displayed in the table below. The data is contained in theStatistics 1 data set fuel.

12.09 11.18 9.97 10.50 9.92 9.97 11.84 10.93 10.70.

(a) State clearly any assumptions you make. Explore the data (e.g. with a stem and leafplot) to confirm that your assumptions are appropriate.

(b) find a 90% confidence interval for the mean fuel consumption per 100km for cars ofthis type.

(c) Find a 90% confidence interval for the variance of fuel consumption per 100km forthe population of cars of this type.

2. Neurobiological arguments suggest that learning to play an instrument may improve thespatial-temporal reasoning of pre-school children. A study measured the spatial-temporalreasoning of 34 preschool children before and after six months of piano lessons. Thechanges in the reasoning scores of the children are displayed in the table below. You mayassume the data represents the observed values of a simple random sample of size n = 34from a population with unknown mean µ and unknown variance σ2. The data is containedin the Statistics 1 data set piano.

2 5 7 −2 2 7 4 1 0 7 3 4 3 4 9 4 52 9 6 0 3 6 −1 3 4 6 7 −1 7 −3 3 4 4

Construct a 95% confidence interval for the population mean improvement in reasoningscores, stating clearly any assumptions you make. You should summarize and display thedistribution of the data, say in a histogram, and hence check that your assumptions areappropriate.

Under the same assumptions, construct a 95% confidence interval for the variance of theimprovement in reasoning scores in the population.

Note: Given a single data set xdata, containing n values with sample mean x and sample vari-ance s2, the R command t.test(xdata,conf.level=0.95) will produce outputthat includes a 95% confidence interval based on the end points

cL = x− tn−1;α/2 s/√n and cU = x+ tn−1;α/2 s/

√n.

with α = 0.05. The confidence level can, of course, be changed as desired. You shouldanswer the questions above by going through the relevant working yourself, but you maywish to use this command to check your answers in cases where it is appropriate.

*3. Consider again the following failure-time data for the batch of 25 lamps, which you mayassume is a simple random sample from an Exponential distribution with unknown param-eter θ. The data is contained in the Statistics 1 data set lamp.5.5 3.8 8.0 7.8 9.3 4.7 4.0 0.3 4.6 0.6 7.9 1.8 4.00.7 4.0 1.6 2.6 0.7 0.2 3.1 1.0 3.4 3.7 10.8 1.2

(a) Find an equal-tailed 95% confidence interval for the unknown parameter θ based onthis set of 25 observations.

(b) Let X1, . . . , Xn be a simple random sample of size n from the Exp(θ) distribution.You may assume that E(1/

∑ni=1Xi) = θ/(n− 1). Use this result to find the average

length of a 95% confidence interval for θ based on a random sample of size n = 25,expressing your answer as a multiple of the unknown parameter θ.

4. Assume that a random sample of 1000 electors are interviewed and that 370 of those in-terviewed say that they support the govenment. Find a 99% confidence interval for theproportion of electors that support the govenment.

*5. Assume the 25 observations below are a random sample from the Unif(0, θ) distribution.

1.41 0.11 0.61 4.06 2.81 4.23 2.68 4.43 2.98 4.15 0.10 4.04 5.572.04 4.44 5.48 1.53 0.10 4.82 5.99 2.35 0.07 3.24 5.83 1.57

For the Unif(0, θ) distribution we saw earlier that the method of moments estimate θmom

and the maximum likelihood estimate θmle were given by θmom = 2X , where X is thesample mean, and θmle = X(n), where X(n) = max(X1, . . . , Xn) is the sample maximum.

(a) Use the fact that for a random sample of size n from the Unif(0, θ) distribution,P (X(n)/θ ≤ v) = vn for 0 < v < 1, to find values v1 and v2 such that P (X(25)/θ <v1) = 0.025 and P (X(25)/θ > v2) = 0.025. Hence, following the general ideaseen in construction of other confidence intervals, but with different details, find anequal-tailed 95% confidence intervals for θ based on θmle.

(b) Find an equal-tailed 95% confidence intervals for θ based on θmom. [Hint: Use theNormal approximation to the distribution of X based on the Central Limit Theorem.]Comment on whether the interval you get is compatible with the data.

MATH11400 Statistics 1 2013-14Homepage http://www.stats.bris.ac.uk/∼maotj/teaching.htmlProblem Sheet 8: Hypothesis testsRemember: when online, you can access the Statistics 1 data sets from an R console by typingload(url("http://www.maths.bris.ac.uk/∼maotj/teach/stats1.RData"))

*1. A tyre company claims its tyres have a mean useful lifetime of 42,000 miles. A consumerassociation bought one of the company’s tyres from each of 10 randomly chosen outletsand tested them on a test rig that simulated normal road conditions. The observed lifetimes(in thousands of miles) were

42 36 46 43 41 35 43 45 40 39.Thinking carefully about the context, what is an appropriate alternative hypothesis H1 touse in testing the manufacturer’s claim? Use the data to test whether or not there is suffi-cient evidence to reject the manufacturer’s claim, using a test procedure with significancelevel α = 0.05. The data are contained in the Statistics 1 data set tyre.lifetimes.

[Your answer should include a statement of any model assumptions, a brief descriptionof your working at each stage of the test procedure including the p-value and the criticalregion for the test, and a summary of your conclusions. You may find it helpful to checkyour numerical calculations using the t.test() function in R . If you do this questionby hand calculation, it may help to know that pt(0.8808,9) gives 0.7993325 andqt(0.95,9) gives 1.833113.]

*2. A certain manufacturer produces packets of biscuits with a nominal weight of 200g. Youmay assume that it is known from experience that the standard deviation of the weight ofthe packets is 4g. To carry out a control check on the actual weight of the packets produced,an employee weighs 25 packets selected at random from a day’s production and finds thatthe average weight of the sample is x = 202.275g.

Let µ denote actual the mean weight of 200g packets produced by the manufacturer. Testthe null hypothesis H0 : µ = 200 against the alternative H1 : µ 6= 200, using a testprocedure with significance level α = 0.01. For what range of significance levels wouldyou reject H0 in favour of H1?

[Your answer should include a statement of any model assumptions, a brief descriptionof your working at each stage of the test procedure including the p-value and the criticalregion for the test, and a summary of your conclusions.]

*3. A random variable X is known to have a Normal distribution with mean µ and variance25. To test the hypotheses H0 : µ = 100 versus H1 : µ > 100 a test procedure is proposedwhich would take a simple random sample of size n from the population distribution of Xand reject H0 in favour of H1 if the sample mean x > 102, and otherwise accept H0.

Find an expression in terms of the sample size n for the significance level α of this testprocedure. Hence find the smallest sample size for which the significance level would beless than 0.05.

MATH11400 Statistics 1 2013-14Homepage http://www.stats.bris.ac.uk/∼maotj/teaching.htmlProblem Sheet 9: Comparison of population meansRemember: when online, you can access the Statistics 1 data sets from an R console by typingload(url("http://www.maths.bris.ac.uk/∼maotj/teach/stats1.RData"))Note: for each of the following questions your answer should include a statement of any modelassumptions, a brief description of your working at each stage of the test procedure including theapproximate p-value and the critical region for the test, and a summary of your conclusions. Youmay find it helpful to check your numerical calculations using the t.test() function in R .

*1. You have learned about several different hypothesis tests that are relevant to different situa-tions. It is vitally important that you apply the correct test in each situation. The followingare brief descriptions of experiments that are designed to test a hypothesis. For each de-scription write down (i) the null and alternative hypothesis, and (ii) the type of hypothesistest that should be used. There is no need to write down the test statistic or perform thetest.

(a) A car manufacturer claims that the level of nitrous oxide emissions from its newengine is lower than from its old engine. A researcher evaluates eight engines of theold type and nine of the new type. As a preliminary step the researcher establishesthat there is no difference between the variances of nitrous oxide emissions in bothengine types.

(b) A researcher measures the systolic blood pressure (SBP) of 20 men and 20 womenin a clinic. She wants to know whether there is a difference between SBP in men andwomen.

(c) Students willingly volunteer for a test of the effects of alcohol on reaction times. Arandom sample of 24 students have their reaction times measured and are then givenan alcoholic drink. Their reaction times are measured again, half an hour later.

(d) A manufacturer makes chocolate bars that are advertised as having a weight of 500g.To carry out a production control check, employees select 30 bars at random froma day’s production. They want to make sure that the company is not manufacturingunderweight packets of biscuits.

(e) An opinion polling company wants to know whether the majority of people in Bristolwould support a congestion charge. They telephone a random sample of 1000 peoplein Bristol and 466 say they are in favour of a congestion charge.

2. In a study of blood glucose level, measurements were made on a sample of 52 pregnantwomen in their third trimester of pregnancy. Their values (in milligrams/100 millilitres ofblood) were found to have a sample mean of 70.12. Healthy women who are not preg-nant are known to have a mean value of 80 (mg/100ml, with a standard deviation of 10(mg/100ml), and we will assume that the standard deviation for pregnant women is also 10(mg/100ml).

(a) Use the data to test the hypothesis that pregnant women have a lower glucose levelthan women who are not pregnant. [Your answer should include a statement of anymodel assumptions, a brief description of your working at each stage of the test pro-cedure including the p-value, and a summary of your conclusions. You will probablyneed to use R to calculate the actual p-value.]

(b) Now assume the test was carried out using a procedure with significance level α =0.01. Calculate the probability of type II error under the alternative hypothesis thatthe true mean value of glucose level in pregnant women is 79 (mg/100 ml). Hencecalculate the power of the test to discriminate between the null hypothesis and thisalternative.

3. (This question reminds you that sometimes the null hypothesis is true but due to chance thetest statistic is such that the null hypothesis is rejected. In particular, under a test procedurewith significance level α, about 100α% of samples will result in the null hypothesis beingrejected even when H0 is true. ).

In this question we investigate the one-sided one sample t-test of the null hypothesis H0 :µ = µ0 against the alternative H1 : µ > µ0, using a test procedure with significance levelα = 0.05.

(a) Choose valid numerical values of µ0 and σ2 for normally distributed data. Use the Rcommands below to construct a matrix xsamples with 1000 rows and 10 columns.The 10 data values in each row can be thought of as a random sample of size 10 froma N(µ0, σ

2) distribution. You will need to substitute in your own chosen values of µ0

and σ, or store them as separate objects in R .

> xvalues <- rnorm(10000, mean=mu0, sd=sigma)> xsamples <- matrix(xvalues,nrow=1000)

Now calculate the sample mean and sample standard deviation for each sample, andstore them as sample.mean and sample.sd respectively.

(b) Calculate the observed test statistic for each sample, using the command:

> sample.tobs <- sqrt(10) * (sample.mean - mu0) / sample.sd

again substituting in your chosen value of µ0. Plot a histogram of the relative fre-quencies of the 1000 observed test statistics that you obtain. What sort of distributiondoes the histogram portray? What distribution should we expect to see?

(c) For a one-sided test the critical region will be of the form C = {T > c∗}. Calculatethe exact value of c∗ for n = 10 and mark this point with a vertical line on yourhistogram. What do you notice?

(d) On average, how many of the 1000 sample test statistics should be inside the criticalregion? How many of the tests in your sample were significant?

*4. Eight athletes ran a 400 metre race at sea level and at a later meeting ran a 400 metrerace at high altitude. Their times in seconds are shown in the table below. Test at the nullhypothesis that race times are unaffected by altitude, against the alternative that race timesare greater at high altitude using a test procedure with significance level α = 0.05. Thedata are contained in the Statistics 1 data set runner.

Runner 1 2 3 4 5 6 7 8Sea Level 48.3 47.6 49.2 50.3 48.8 51.1 49.0 48.1High Altitude 50.4 47.3 50.8 52.3 47.7 54.5 48.9 49.9

[These values may be useful: pt(2.1958,7) is 0.9679363 and qt(0.95,7) is1.894579.]

*5. To investigate the relative size of secretarial starting salaries in the public and private sec-tor, 9 private sector posts and 10 public sector posts were chosen at random from jobsadvertised on the web. The table below shows the advertised stating salaries (in £1,000;this data set is quite old!). You may assume that the population variances are the same inthe private and public sectors. Use the data to test at the 0.05-level the hypothesis that start-ing salaries are the same in the two sectors against the alternative that private sector startingsalaries are higher. The data are contained in the Statistics 1 data set secretaries .

Private sector 12.1 13.4 11.3 10.6 9.7 12.5 9.6 13.6 11.2Public sector 9.3 8.5 8.2 13.1 8.8 11.9 10.1 9.8 12.2 10.4

6. A study was conducted of 10 households to see if alerting them to high usage rates of elec-tricity reduced their actual consumption. A small monitor was installed in each household,which activated a red flashing light whenever the current rate of usage exceeded a pre-set threshold. The monthly usage (in kilowatt-hours) before and after installation of themonitor is given below. Test at the 0.05-level whether the monitor is effective at reducingelectrical consumption. The data are contained in the Statistics 1 data set kwh.

Household 1 2 3 4 5 6 7 8 9 10Before 940 1370 1030 2030 1540 2300 1800 910 640 1200After 900 1230 1060 2100 1250 2200 1820 900 630 1110

7. In a study to examine whether increasing the amount of calcium in the diet reduced bloodpressure, a group of 10 men were given a calcium supplement in their diet for 12 weeks,and a control group of 11 men received a placebo (a pill that appeared identical, but con-tained no active substance). The table below shows the relative change (in mm of mercury)in blood pressure over the 12 week period (before - after) for each subject. What do youconclude from the results shown? The data are contained in the Statistics 1 data set bp.

Calcium group −7 4 −18 −17 3 5 −1 −10 −11 2Placebo group 1 −12 1 3 −3 5 −5 −2 11 1 3

MATH11400 Statistics 1 2013-14Homepage http://www.stats.bris.ac.uk/∼maotj/teaching.htmlProblem Sheet 10: Linear Regression: Confidence Intervals & HypothesisTestsRemember: when online, you can access the Statistics 1 data sets from an R console by typingload(url("http://www.maths.bris.ac.uk/∼maotj/teach/stats1.RData"))

1. The table below shows a data set with five pairs of values, (xi, yi) i = 1, . . . , 5. Assumethat the data satisfy the simple Normal linear regression model Yi = α + βxi + ei, wherethe ei are i.i.d. N(0, σ2).

x 1 3 4 6 7y 0 1 2 5 4

(a) Test the null hypothesis H0: α = 0 against the alternative H1: α 6= 0 using a testprocedure with significance level 0.05.

(b) Test the null hypothesis H0: β = 0 against the alternative H1: β 6= 0 using a testprocedure with significance level 0.05.

In each case calculate the p-value of the observed test statistic.

*2. A study was conducted to examine the dependence of metabolic rate on body mass for 7dogs, yielding data given in the table below.

Body mass (kg) 31.20 24.00 19.80 18.20 9.61 6.50 3.19Metabolic rate (kcal/day) 1113.2 981.8 908.2 840.8 626.2 429.5 280.9

It was decided to analyse these data on a log scale; defining x as log body mass and y aslog metabolic rate, we fit a Normal linear regression model Yi = α + βxi + ei. Summarystatistics calculated from the data are∑

xi = 17.800∑yi = 45.590

∑x2i = 49.239

∑y2i = 298.433

∑yixi = 118.365.

Calculate the end points of 99% confidence intervals for α and β.

[In R , I found qt(c(0.95,0.975,0.99,0.995),5) gives[1] 2.015048 2.570582 3.364930 4.032143; some of these values may beuseful.]

3. The table below shows the rainfall (in ins) for the spring and the following autumn foreach of ten consecutive years. Let xi denotes the observed spring rainfall in the ith yearand yi denotes the corresponding observed autumn rainfall. Assume that the data satisfythe simple Normal linear regression model Yi = α + βxi + ei, where the ei are i.i.d.N(0, σ2). The data are contained in the Statistics 1 data set rain.

Spring rainfall (x) 1.6 5.3 2.8 9.6 6.7 1.5 5.4 8.5 4.1 3.9Autumn rainfall (y) 4.6 6.0 2.9 11.1 8.2 1.3 9.1 10.2 5.2 8.3

(a) Find a 90% confidence interval for β.

(b) Test the null hypothesis H0: β = 0 against the alternative H1: β 6= 0, using a testprocedure with significance level equal to 0.10.

(c) Show that in general a test of the null hypothesis H0: β = 0 against the alternativeH1: β 6= 0 with significance level (say) γ will accept the null hypothesis if and onlyif the corresponding 100(1− γ)% confidence interval for β contains the value β = 0.

*4. The table below shows the average weight (in kg) of piglets in a litter, for seven litters ofvarying size. The data are contained in the Statistics 1 data set pig.

Litter size (x) 1 3 5 8 8 9 10Average weight (y) 1.6 1.5 1.5 1.3 1.4 1.2 1.1

Summary statistics are:∑xi = 44,

∑yi = 9.6,

∑x2i = 344,

∑y2i = 13.36,

∑yixi = 57.

When a simple linear regression of average weight on litter size is performed in R usingthe command piglets <- lm(wt ˜ littersize,data=pig), the output ofthe command summary(piglets) includes the following lines:

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.683051 0.064520 26.086 1.55e-06 ***littersize -0.049576 0.009204 -5.387 0.00297 **---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.07558 on 5 degrees of freedom

(a) Show how each of the numerical values on the three lines beginning (Intercept),littersize and Residual standard error were calculated, and explainthe interpretation of the asterisks *** and ** at the end on the lines.

(b) What conclusion would you reach from the output for testing the null hypothesis H0:β = 0 against the alternative H1: β 6= 0?

Problems

Documents