This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Lecture 24 – revision!1. Random quantities.2. Model fitting.3. Signal detection.4. Catalogs of sources.5. Radio astronomy.6. Polarized signals.7. Fourier matters.8. Interferometry.9. X-ray astronomy.10.Satellite observatories.11.Astronomical data.
1. Random quantities• Important properties of a random value Y:
– The probability density p(y).
– The mean or average
– The variance
• It is important to distinguish between the ideal values of these and estimates of them which one can calculate from a sample [y1,y2,...yN] of Y.
• Although the ideal values are formally unattainable, in practice good estimates of them may be available. Eg:– There may be formulae which predict p, μ and (often most
importantly) σ2 (see eg radio astronomy).– Long-term calibration measurements can provide good
1. Random quantities• Estimating the three properties from N
samples of Y:– A frequency histogram serves as an estimate
of p(y).– Estimate of the mean:
– Estimate of the variance*:
• The result of every (non-trivial) transformation of a set of random numbers is itself a random number. (For example, the estimators for the mean and variance.)
N
iiyN 1
1
N
iiyN 1
22 ˆ1
1ˆ
*This formula was incorrect on slide 5 of lecture 3.Note: the ‘hats’ here mean ‘estimate’.
• Measurements of some physical quantity:– We have a set of N measurements yi which are
usually made at different values of some other essentially non-random quantities ri (eg at different times, positions, frequencies, etc etc)
• It is often convenient to treat each measurement yi as a sum of signal si, background bi and noise ni.
– The division between signal and background is made largely on grounds of convenience, or interest – it is a bit like the distinction between ‘useful’ plants and weeds.
2. Model fitting• Just as we do not have access to μ or σ2, neither
do we have access to the ‘ideal’ quantities s(r) or b(r).– Model fitting is the process of estimating these
quantities.• The usual practice is to propose a model
which depends on a small number M of parameters Θ=[θ1,θ2,.. θM].
• The background can usually be much better estimated than the signal, because background tends to occur at similar levels in all data points, whereas signal is localized.– Hence it is often assumed that b^=b.
1. Decide on a model.2. Choose a fitting statistic U: ie some formula
which has the following properties:1. It returns a single number.
2. It is a function of the data values yi and the model m(x).
3. U should be a smooth function of the parameters Θ, with a quadratic (=analytic) minimum.
4. The parameter values at that minimum should be, in some reasonable sense of the words, the best fit parameters, ie provide the ‘best’ estimate of the ideal, unattainable s.
3. Calculate uncertainties, and the model plausibility.
2. Model fitting• Types of U: first I’ll look at the method of least
squares.• This prescribes a formula for U which is
essentially a ratio between variances. The estimate σ^2 we make from our data goes on the numerator, whereas the denominator is the predicted σ2.
– Before writing this down, let us reconsider the formula for σ^2. Adapting the formula on slide 4 gives:
– However, in general, this gives too small a value. And the larger the number M of parameters, the greater this error. This is because, as M is increased, the model can better and better follow the ups and downs of the noisy part of y.
2. Model fitting– If, for every data point yi, we replace <y> in
these expressions by our model m(Θ), then multiply all the probabilities together for i in [1,N], we have an expression for the likelihood L of a given set of parameters Θ. The best fit parameter values are taken to be those which maximize L.
– Two further modifications are usually made:1. Since the numbers are often very small, it is
more convenient to work with log(L);
2. A minimization algorithm will need to work with the negative of the likelihood, ie the expression to minimize is –log(L).
• This is like model fitting, but we start with a special class of model: one which contains only background.– Often we don’t need to do the fitting part,
because we have obtained a good estimate of the background from other sets of data.
– All we have to do is test the model.
• Remember that a model is our hypothesis about what lies behind the data.– This signal-less model is called the null
hypothesis (‘null’ is from the Latin for ‘nothing’).
3. Signal detection• This is ok, but it may not be the most
sensitive way to detect signals. This is because signal and background usually have different spatial scales:– Background usually extends over many data
points;– Signal usually extends only over a few data
points.
• Thus, if one uses the whole data set to calculate U, a small signal may be swamped among the background and noise.– Some selection and filtering are usually done
3. Signal detection• A likelihood ratio between the likelihood Lbest
calculated for best fit of a normal model m=b+s(Θ), and the likelihood Lnull for m=b alone, can also be used to test the null hypothesis.– W Cash has shown that for
where M is the number of fitted parameters.
• Don’t get confused though – this is a test of the null hypothesis, not of the best fit - a low value of this P means the null hypothesis is probably wrong and therefore there is some signal there.
– If you only have 1 random value y, and you want to decide whether it contains ‘signal’ or just ‘background’.
• Eg the assignment question in which you had some values of the Stokes parameter V, plus its uncertainty. In this case the ‘background’ was zero. The question was, is there some ‘signal’ (ie a non-zero value of V) present.
– The way to do this is to compare y-b with the uncertainty σy (this σ is the ‘expected’ standard deviation, calculated from sources other than the single data point).
– The ratio (y-b)/σy is called the significance.• If the ratio is about 1, then you might say “y is consistent with the
background value.” Ie you can’t rule out the null hypothesis.• Sometimes for a ratio X, you will hear people call this a “X-sigma
detection.” X=5 is a commonly used yardstick for a detection which is judged to be significant.
• Of interest is the frequency distribution of source amplitudes, n(A).– Often one talks of the source flux or flux density
rather than amplitude:• Examples of units are erg cm-2 s-1 for x-ray and
janskys for radio;• The symbol used is often S.
– Hence n(S) is more common notation.
• If the sources are distributed uniformly in space (known as a Euclidean distribution), n(S) will be a power law, with index=-2.5; in other words, .5.2 SSn
5. Radio astronomy.• Radio dishes are reflectors just the same as mirrors
of optical telescopes. Just the terminology is sometimes a bit different.– Instead of PSF one refers to the beam.
• They are sensitive to radiation from a small area in the sky, of angle ~ λ/D.– The beam nearly always has sidelobes though.
• Unlike optical detectors, radio detectors are polarized – the output voltage varies between a maximum and zero, depending on the polarization state of the incoming radiation.– Sometimes the detector is most sensitive to linear
polarized radiation at a given angle, sometimes to left- or right-circularly polarized radiation.
– Often these days, two detectors of opposite polarization are placed at the focus.
5. Radio astronomy.• Radio signals are nearly always noise-like. As
such they can be mimicked by:– placing a resistor at a certain temperature across the
input terminals to the detection electronics.
or– pointing the antenna directly at a surface with the
same temperature as the above resistor!
• Because the noise power spectral density from such a resistor is equal to kT watts Hz-1 (kT is Boltzmann’s constant times the temperature in kelvin), radio engineers are in the practice of expressing all noise powers in terms of temperatures.
5. Radio astronomy.• Since powers are additive, so are the associated
temperatures.• Thus the total output noise temperature can be
expressed as a sum of several terms:
– Such temperatures are most of the time not ‘real’ in the sense that they are numbers you could read off a thermometer somewhere. They are just a handy way of expressing power spectral density.
• For example, the ‘real’ temperature of the ground is about
300K, but Tbackground in the above sum won’t be 300K unless the antenna is pointed right at the ground. Normally you will only get a little contribution from this hot surface from reflections and from far sidelobes.
This last is roughly equal to the limiting sensitivity of the telescope – ie the minimum flux point source which can be detected. (Technically you’d want to set the detection threshhold at 5σ or so.)
kTw
tA
kT
e
totaltotal
e
source
A
kTS
In W m-2 Hz-1. You have to multiply
by 1026 to convert to janskys.
In W m-2 Hz-1. You have to multiplyby 1026 to convert to janskys.
6. Polarized signals.• Depolarization: basically due to mixing of many
slightly different, uncorrelated source polarizations within the width of the beam.
• Faraday rotation of the polarization angle:
where D is the distance from the source, Ne is the average number density of electrons along that path, and B is the average magnetic field.– Enhances depolarization because uneven DNeB
within the beam amplifies differences between polarization angles.
• What is the difference between– A Fourier series (FS);– A Fourier transform (FT);– A discrete* Fourier transform (DFT);– The Fast Fourier Transform (FFT)?
• A Fourier series starts with a function f(t) defined on an interval [0,T]. Its FS to order N-1 is
where
1
0
2expN
jj T
tjiAtf
tfdtT
AT
tjitfdt
TA
TT
j
000
1except ,2exp
2 *‘Discrete’ and ‘discreet’ mean different things – consult a dictionary!
7. Fourier matters.• The FT of f(t) (sometimes indicated F{f}) is:
FTs are known for some functions, but not for others. They can’t in general be calculated exactly.
• The DFT is the nearest one can get to a FT which is calculable. If f(t) is sampled at N evenly-spaced points within the interval [0,T], then the DFT is defined as
• From that it is pretty clear that an autocorrelation (correlation of a function against itself) must be real-valued.– The FT of an autocorrelation of f is called the
power spectrum of f.
• The normalized, zero-lag correlation is just the average of the product of f and g:
– Longish time accumulating photons;• Too bright a source more than 1 photon per pixel per
frame, called pileup.
– Shortish time reading the data out.• Serial readout – slow – OOTEs – ‘dead time’.
• Hardness ratios are a crude measure of an x-ray spectrum.
• With a ‘proper’ spectrum, if it is a power law, you have to be careful whether you are talking about number of photons per unit photon energy, or total energy per unit photon energy. The respective spectral indices differ by 1.
10. Satellite observatories.– If we want to calculate where that direction is in the
spacecraft basis, all we need to do is multiply by the attitude matrix A:
– To understand why this is, consider that the 1st component of us/c must be the cross product of the s/c X axis vector ax (which is the 1st row of A) with usky; and so for the other two components of us/c.
– We can go further, and calculate the components of u in the instrument frame by multiplying by the boresight matrix in similar fashion:
just remember that matrix multiplication does not commute! AB is NOT the same as BA.
• Time variation:– An important fact is that the shortest time-
scale Δt on which the flux varies carries information about the size D of the object.
• D cannot be larger than cΔt because time variation implies that some property of the object is changing in concert. But ‘news’ of such a change cannot travel from one side of the object to another faster than light.
• Binned data:– Dithering or randomization of position within
the bin is useful to prevent Moiré effects on rebinning.
11. Astronomical data.– Eg suppose you have a small histogram with
four bins which have initial values [3,15,21,16]. Their uncertainties are [1.7,3.9,4.6,4.0]. If you divide all values by the total 55, you should also divide the uncertainties by 55. (This is simple propagation of uncertainties.)
• Cumulative histograms:– Here again one can calculate uncertainties
based on the square roots.• Have a care though because cumulative values are
no longer uncorrelated – the value in one bin depends on others above or below it.
• Because of this factor, is easy to over-estimate the degree to which a cumulative histogram deviates from what is expected.
11. Astronomical data.• Data presentation and graphing:
– ALWAYS show uncertainties. Most data is useless without them.
– If you have some data values yi which you expect to be related to other values xi by some simple rule, of course you can graph y against x... however it is often preferable to transform the x and/or y values such that the transformed coordinates x’ and y’ have a linear relationship, then plot y’ against x’.
– Why? Because much of the time you are interested to see whether the supposed relationship is true or not.
1. It is much easier to tell by eye if points lie on a straight line, rather than trying to judge between various curves;
2. A straight line is also more straightforward to fit.
1. Faraday rotation. If you have a set of xs which are wavelengths and ys which are the associated polarization angles, then you expect these to be related by y=A+Bx2, for some constants A and B. Better to transform y’=y, x’=x2, which obey the linear relation y’=A+Bx’.
2. A power law. If x is wavelength once again and y is flux, a power law relation between them is y=Axα. Transforming y’=log(y) and x’=log(x) gives a linear relation y’=log(A)+αx’.
– If you transform y, you should of course also transform the uncertainties σy (using the propagation of error formula).