AdvancedMatlab.pdf

7/25/2019 AdvancedMatlab.pdf

1/84

Advanced Matlab:

Exploratory Data Analysis and Computational Statistics

Mark Steyvers

January 14, 2015


2/84

Contents

I Exploratory Data Analysis 4

1 Basic Data Analysis 51.1 Organizing and Summarizing Data . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Visualizing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Dimensionality Reduction 112.1 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Applications of ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.2 How does ICA work? . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.1.3 Matlab examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

II Probabilistic Modeling 22

3 Sampling from Random Variables 233.1 Standard distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Sampling from non-standard distributions . . . . . . . . . . . . . . . . . . . 26

3.2.1 Inverse transform sampling with discrete variables . . . . . . . . . . . 273.2.2 Inverse transform sampling with continuous variables . . . . . . . . . 303.2.3 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Markov Chain Monte Carlo 344.1 Monte Carlo integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Putting it together: Markov chain Monte Carlo . . . . . . . . . . . . . . . . 37

4.4 Metropolis Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.5 Metropolis-Hastings Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 414.6 Metropolis-Hastings for Multivariate Distributions . . . . . . . . . . . . . . . 45

4.6.1 Blockwise updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.6.2 Componentwise updating . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.7 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

1


3/84

CONTENTS 2

5 Basic concepts in Bayesian Data Analysis 585.1 Parameter Estimation Approaches . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1.2 Maximum a posteriori . . . . . . . . . . . . . . . . . . . . . . . . . . 595.1.3 Posterior Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2 Example: Estimating a Weibull distribution . . . . . . . . . . . . . . . . . . 61

6 Directed Graphical Models 646.1 A Short Review of Probability Theory . . . . . . . . . . . . . . . . . . . . . 646.2 The Burglar Alarm Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.2.1 Conditional probability tables . . . . . . . . . . . . . . . . . . . . . . 676.2.2 Explaining away . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2.3 Joint distributions and independence relationships . . . . . . . . . . . 70

6.3 Graphical Model Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.3.1 Example: Consensus Modeling with Gaussian variables . . . . . . . . 72

7 Sequential Monte Carlo 757.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.1.1 Example HMM with discrete outcomes and states . . . . . . . . . . . 777.1.2 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.2 Bayesian Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797.3 Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.3.1 Sampling Importance Resampling (SIR) . . . . . . . . . . . . . . . . 807.3.2 Direct Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82


4/84

Note to Students

Exercises

This course book contains a number of exercises in which you are asked to simulate Matlabcode, produce new code, as well as produce graphical illustrations and answers to questions.

The exercises marked with ** are optional exercises that can be skipped when time is limited.

Matlab documentation

It will probably happen many times that you will need to find the name of a Matlab functionor a description of the input and output variables for a given Matlab function. It is stronglyrecommended to have the Matlab documentation running in a separate window for quickconsultation. You can access the Matlab documentation by typing doc in the commandwindow. For specific help on a given matlab function, such as the function fprintf,you can type doc fprintf to get a help screen in the matlab documentation window or

help fprintfto get a description in the matlab command window.

Organizing answers to exercises

It is helpful to maintain a document that organizes all the material related to the exercises.Matlab can facilitate part of this organization using the publish option. For example, ifyou have a Matlab script that produces a figure, you can publish the code as well as thefigure produced by the code to a single external document such as pdf. You can find thepublishing option in the Matlab editor under the publish menu. You can change the publishconfiguration (look under the file menu of the editor window) to produce pdfs by changing

the output file format under the edit configurations window.

3


5/84

Part I

Exploratory Data Analysis

4


6/84

Chapter 1

Basic Data Analysis

1.1 Organizing and Summarizing DataWhen analyzing any kind of data, it is important to make the data available in an intu-itive representation that is easy to change. It is also useful when sharing data with otherresearchers to package all experimental or modeling data into a single container that can beeasily shared and documented. Matlab has two ways to package data into a single variablethat can contain many different types of variables. One method is to use structures. Anothermethod that was recently introduced in Matlab is based on tables.

In tables, standard indexing can be used to access individual elements or sets of elements.Suppose that t is a table. The element in the fourth row, second column can be accessedbyt(4,2). The result of this could be a scalar value, a string, cell array or whatever type

of variable is stored in that element of the table. To access the second row of the table, youcan use t(2,:). Similarly, to access the second column of the table, you can use t(:,2).Note that the last two operations return another table as a result.

Another way to access tables is by using the names of the columns. Suppose the nameof the second column is Gender. To access all the value from this column, you can uset.Gender(:). Below is some Matlab code to create identical data representations usingeither a structure or a table. The gray text shows the resulting output produced from thecommand window. From the example code, it might not be apparent what the relativeadvantages and disadvantages are of structures and tables. The following exercises hopefullymakes it clearer why using the new table format might be advantageous. It will be useful toread the matlab documentation on structures and tables under /LanguageFundamentals/DataTypes/Structures and /LanguageFundamentals/DataTypes/Tables

% Create a structure "d" with some example data

d.Age = [ 32 24 50 60 ];

d.Gender = { Male , Female , Female , Male };

d.ID = [ 215 433 772 445 ];

5


7/84

CHAPTER 1. BASIC DATA ANALYSIS 6

% Show d

disp( d )

% Create a Table "t" with the same informationt = table( [ 32 24 50 60 ] , ...

{ Male , Female , Female , Male } , ...

[ 215 433 772 445 ] , ...

VariableNames,{Age,Gender,ID} );

% Show this table

disp( t )

% Copy the Age values to a variable X

x = t.Age

% Extract the second row of the table

r o w = t ( 2 , : )

Age: [4x1 double]

Gender: {4x1 cell}

ID: [4x1 double]

Age Gender ID

___ ________ ___

32 Male 215

24 Female 433

50 Female 772

60 Male 445

x =

32

2450

60

row =

Age Gender ID


8/84


___ ________ ___

24 Female 433

Exercises

1. In this exercise, we will load some sample data into Matlab and represent the datainternally in a structure. In Matlab, execute the following command to create thestructure d which contains some sample data about patients:d = table2struct( readtable('patients.dat'), 'ToScalar',true);

Show in a single Matlab script how to a) calculate the mean age of the males, b) deletethe data entries that correspond to smokers, and c) sort the entries according to age.

2. Lets repeat the last exercise but now represent the data internally with a table. InMatlab, execute the following command to create the table t which contains the samedata about patients: t = readtable('patients.dat');Show in a single Mat-lab script how to a) extract the first row of the table, b) how to extract the numericvalues of Age from the Age column, c) calculate the mean age of the males, d) deletethe data entries that correspond to smokers, and e) sort the entries according to age.What is the advantage of using the table representation?

3. With the table representation of the patient data, use the tabulate function tocalculate frequency distribution of locations. What percentage of patients are located

at the VA hospital?

4. Use thecrosstabsfunction to calculate the contingency table of Gender by Smoker.How many female smokers are there in the sample?

5. Use the prctile function to calculate the 25% and 75% percentile of weights in thesample.

1.2 Visualizing Data

Exercises

For these exercises, we will use data from the Human Connectome Project (HCP). Thisdata is accessible in Excel format from http://psiexp.ss.uci.edu/research/programs_data/hcpdata1.xlsx. In the subset of the HCP data set that we will look at, there are 500subjects for which the gender, age, height and weight for each individual subject is recorded.Save the Excel file to a local directory that you can access with Matlab. You can load thedata into Matlab using t = readtable( 'hcpdata1'); For these exercises, it will behelpful to read the documentation of the histogram, scatter, and normpdf functions.


9/84


Figure 1.1: Example visualization of an empirical distribution of two different samples

Height (inches)

40 50 60 70 80 90 100

Density

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14Distribution of Heights

HCP Sample

Population

Figure 1.2: Example visualization of an empirical distribution and a theoretical populationdistribution


10/84


GenderF M

Height

58

60

62

64

66

68

70

72

74

76

Figure 1.3: Example boxplot visualization

1. Recreate Figure 1.1 as best as possible. This figure shows the histogram of weightsfor males and females. Note that the vertical axis shows probabilities, not counts. Thewidth of the bins is 5 lbs.

2. Recreate Figure 1.2 as best as possible. This figure shows the histogram of heights(regardless of gender). Note that the vertical axis shows density. The figure also has an

overlay of the population distribution of adult height. For this population distribution,you can use a Normal distribution with mean and standard deviation of 66.6 and 4.7inches respectively.

3. Recreate Figure 1.3 as best as possible using the boxplotfunction.

4. Recreate Figure 1.4, upper panel, as best as possible. This figure shows a scatter plotof the heights and weights for each gender.

5. Find some data that you find interesting and visualize it with a custom Matlab figurewhere you change some of the default parameters.

** 6 Recreate Figure 1.4, bottom panel, as best as possible. Note that this figure is a bettervisualization of the relationship between two variables in case there are several (x,y)pairs that are identical or visually very similar. In this plot, also known as a bubbleplot, the bubble size indicates the frequency of encountering a (x,y) pair in a particularregion of the space. One way to approach this problem is to use the scatter functionand scale the sizes of the markers with the observation frequency. In this particularvisualization, the markers were made transparent to help visualize the bubbles formultiple groups.


11/84


Height

55 60 65 70 75 80

Weight

50

100

150

200

250

300

350Heights and Weights of HCP Subjects

Female

Male

Height

55 60 65 70 75 80

Weight

50

100

150

200

250

300

350Heights and Weights of HCP Subjects

Female

Male

Figure 1.4: Example visualizations with scatter plots using the standard scatter function(top) and a custom-made bubble plot function with transparant patch objects (bottom)


12/84

Chapter 2

Dimensionality Reduction

An important part of exploratory data analysis is to get an understanding of the structure ofthe data, especially when a large number of variables or measurements are involved. Moderndata sets are often high-dimensonal. For example, in neuroimaging studies involving EEG,brain signals are measured at many (often 100+) electrodes on the scalp. In fMRI studies, theBOLD signal is typically measured for over 100K voxels. When analyzing text documents,the raw data might consist of counts of words in different documents which can lead toextremely large matrices (e.g., how many times is the word X mentioned in document Y).

With these many measurements, it is challenging to visualize and understand the rawdata. In the absence of any specific theory to analyze the data, it can be very useful to applydimensionality-reduction techniques. Specifically, the goal might be to find a low-dimensional(simpler) description of the original high dimensional data.

There are a number of dimensionality reduction techniques. We will discuss two stan-dard approaches: independent component analysis (ICA) and principal component analysis(PCA).

2.1 Independent Component Analysis

Note: the material in this section is based a tutorial on ICA by Hyvarinen and Oja (2000)which can be found at www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdfand mate-rial from the Computational Statistics Handbook with Matlab by Martinez and Martinez.

The easiest way to understand ICA is to think about the blind-source separation problem.

In this context, a number of signals (i.e., measurements or variables) are observed andeach signal is believed to be a linear combination of some unobserved source signals. Thegoal in blind-source separation is to infer the unobserved source signals without the aid ofinformation about the nature of the signals

Lets give a concrete example through the cocktail-party problem (see Figure 2.1). Sup-pose you are attending a party with several simultaneous conversations. Several microphones,located at different places in the room, are simultaneously recording the conversations. Eachmicrophone recording can be considered as a linear mixture of each independent conver-

11


13/84

CHAPTER 2. DIMENSIONALITY REDUCTION 12

Figure 2.1: Illustration of the blind-source separation problem

sation. How can we infer what the original conversations were at each location from theobserved mixtures?

2.1.1 Applications of ICA

There are a number of applications of ICA that are related to blind-source separation. First,ICA has been applied to natural images in order to understand the statistical properties of

real-world natural scenes. The independent components extracted from natural scenes turnout to be qualitatively similar to the visual receptive fields found in early visual cortex (seeFigure 2.2), suggesting that the visual cortex might be operating in a manner similar to ICA to find the independent structure in visual scenes.

Another application of ICA is in the analysis of resting state data in fMRI analysis.For example, when subjects are not performing any particular task (they are resting)while in the scanner, the correlations in the hemodynamic response pattern across brainregions suggests the presence of functional networks that might support a variety of cognitivefunctions. ICA has been applied to the BOLD activation to find the independent componentsof brain activation (see Figure 2.3 where each independent component is hypothesized torepresent a functional network.

Finally, another application of ICA is to remove artifacts in neuroimaging data. Forexample, when recording EEG data, eye blinks, muscle, and heart noise can significantlycontaminate the EEG signal. Figure 2.4 shows a 3-second portion of EEG data across 20locations. The bottom two time series in the left panel shows the electrooculographic (EOG)signal that measures eye blinks. Note that an eye-blink occurs around 1.8 seconds. Someof the derived ICA components show a sensitivity to the occurrence of these eye-blinks(for example, IC1 and IC2). The EEG signals can be corrected by removing the signals


14/84


Figure 2.2: Independent components extracted from natural images

Figure 2.3: Independent components extracted from resting state fMRI. (from Storti et al.(2013). Frontiers in Neuroscience)


15/84


Figure 2.4: Recorded EEG time series and its ICA component activations, the scalp to-pographies of four selected components, and the artifact-corrected EEG signals obtained byremoving four selected EOG and muscle noise components from the data. From Jung et al.

(2000). Psychophysiology, 37, 173-178.

corresponding to the bad independent components. Therefore, ICA can serve as a tool topreprocessing that data in a way to remove any undesirable artifacts.

2.1.2 How does ICA work?

In this section, we will illustrate the ideas behind ICA without any in-depth mathematicaltreatment. To go back to the problem of blind-source separation, suppose there are twosource signals, denoted by s1(t) and s2(t) and two microphones that record the mixturesignals x

1(t) and x

2(t). Lets assume that each of the recorded signals is a linear weighted

combination of the source signals. We can then express the relationship betweenx and s asa linear equation:

x1(t) =a11s1(t) +a12s2(t)

x2(t) =a21s1(t) +a22s2(t) (2.1)

Note that a11, a12, a21, and a22 are weighting parameters that determine how the origi-nal signals are mixed. The problem now is to estimate the original signalss from just the


16/84


observed mixturesx without knowing the mixing parameters a and having very little knowl-edge about the nature of the original signals. Independent component analysis (ICA) is onetechnique to approach this problem.

Note that ICA is formulated as a linear model the original source signals are combinedin a linear fashion to obtain the observed signals. We can rewrite the model such that theobserved signals are the inputs and we use weights w11, w12, w21, and w22 to create newoutputsy1(t) andy2(t):

y1(t) =w11x1(t) +w12x2(t)

y2(t) =w21x1(t) +w22x2(t) (2.2)

We can also write the previous two equations in matrix form:

Y = W X

X = A S (2.3)

Through linear algebra1, it can be shown that if we set the weight matrix W to the inverseof the original weight matrix A, then Y=S. This means that there exists a set of weights Wthat will transform the observed mixtures X to the source signals S. The question now ishow to find the appropriate set of weights W. In the examples below, we will first show somedemonstrations of ICA with a function that will automatically produce the independentcomponents. We will then give some example code that will illustrate the method behindICA.

2.1.3 Matlab examplesThere are many different toolboxes for ICA. We will use the fastica toolbox because itis simple and fast. It is available at http://research.ics.aalto.fi/ica/fastica/code/dlcode.shtml. Note that when you need to apply ICA to specialized tasks such as EEGartifact removal or resting-state analysis, there are more suitable Matlab packages for thesetasks.

To get started, lets give a few demonstrations of ICA in the context of blind sourceseparation. Listing 2.1 shows Matlab code that produces the visual output show in Figure2.5. In this example, there are two input signals corresponding to sine waves of a differentwave lengths. These two signals are weighted to produce two mixturesx1 andx2. These are

provided as inputs to the function fasticawhich produces the independent componentsy1andy2. Note how the reconstructed signals are not quite the same as the origal signals. Thefirst independent component (y1) is similar to the second source signal (s2) and the secondindependent component (y2) is similar to the first source signal (s1), but in both cases, theamplitudes are different. This highlights two properties of ICA: the original ordering andvariances of the source signals cannot be recovered.

1Note that ifW = A1, we can write A1X= A1AS= S


17/84


Listing 2.1: Demonstration of ICA to mixed sine wave signals.

1 %% ICA Demo 1: Mix two sine waves and unmix them with fastica2 % (file = icademo1.m)

3

4 % Create two source signals (sinewaves with different frequencies)5 s1 = sin(linspace(0,10, 1000));

6 s2 = sin(linspace(0,17, 1000)+5);

7

8 rng( 1 ); % set the random seed for replicability

9

10 % plot the source signals

11 figure(1); clf; % and

12 subplot(2,3,1); plot(s1,'r'); ylabel( ' s 1 ' );

13 title( 'Original Signals' );

14 subplot(2,3,4); plot(s2,'r'); ylabel( ' s 2 ' );15

16 % Mix these sources to create two observed signals

17 x1 = 1.00*s1 2.00*s2; % mixed signal 118 x2 = 1.73*s1 + 3.41*s2; % mixed signal 2

19 subplot(2,3,2); plot(x1); ylabel( ' x 1 ' ); % plot observed mixed signal 1

20 title( 'Mixed Signals' );

21 subplot(2,3,5); plot(x2); ylabel( ' x 2 ' ); % plot observed mixed signal 2

22

23 % Apply ICA using the fastica function

24 y = fastica([x1;x2]);

25

26 % plot the unmixed (reconstructed) signals

27 subplot(2,3,3); plot(y(1,:),'g'); ylabel( ' y 1 ' )

28 title( 'Reconstructed Signals' );

29 subplot(2,3,6); plot(y(2,:),'g'); ylabel( ' y 2 ' )

0 500 1000

s1

-1

0

1Original Signals

0 500 1000

s2

-1

0

1

0 500 1000

x1

-5

0

5Mixed Signals

0 500 1000

x2

-10

-5

0

5

0 500 1000

y1

-2

0

2Reconstructed Signals

0 500 1000

y2

-2

0

2

Figure 2.5: Output of matlab code of ICA applied to sine-wave signals


18/84


To motivate the method behind ICA, look at Matlab code 2.2 that produces the visualoutput show in Figure 2.6 and Figure 2.7. In this case, we take uniformly distributed randomnumber sequences as independent source signals s1 ands2 and we linearly combine them to

produce the mixtures x1 and x2. Figure 2.6 shows the marginal and joint distributionof the source signals. This is exactly as you would expect the marginal distribution isapproximately uniform (because we used uniform distributions) and the joint distributionshows no dependence between s1 ands2 because we independently generated the two sourcesignals. The marginal and joint distribution of the mixtures looks very different, as shown inFigure 2.7. In this case, the joint distribution reveals a dependence between the two x-values learning about one x value gives us information about possible values of the other x value.The key observation here is about the marginal distribution ofx1 and x2. Note that thesedistributions are more Gaussian shaped than the original source distribution. There aregood theoretical reasons to expect this result the Central Limit Theorem, a classical result

in probability theory, tells that the distribution of a sum of independent random variablestends toward a Gaussian distribution (under certain conditions).That brings us to the key idea behind ICA Nongaussian is independent. If we

assume that the input signals are non-Gaussian (a reasonable assumption for many naturalsignals), and mixtures of non-Gaussians will look more Gaussian, then we can try to findlinear combinations of the mixture signals that will make the resulting outputs (Y) look morenon-Gaussian. In other words, by finding weights W that will make Y look less Gaussian,we can find the original source signals S.

How do we achieve this? There are a number of ways to assess how non-Gaussian asignal is. One is by measuring the kurtosis of a signal. The kurtosis of a random variable isa measure of the peakedness or skewness of the probability distribution. Gaussian random

variables are symmetric and have zero kurtosis. Non-gaussian random variables (generally)have non-zero kurtosis. The methods behind ICA find the weights W that maximize thekurtosis in order to find one of the independent components. This component is then sub-tracted from the remaining mixture and the process of optimizing the weights is repeated tofind the remaining independent components.

Exercises

For some of the exercises, you will need the fastica toolbox which is available fromhttp://research.ics.aalto.fi/ica/fastica/code/dlcode.shtml . Unzip the files intosome folder accessible to you and add the folder to the matlab path.

1. Create some example code to demonstrate that sums of non-gaussian random variablesbecome more gaussian. For example, create some random numbers from an exponen-tial distribution (or some other non-gaussian distribution). Plot a histogram of theserandom numbers. Now create new random numbers where each random number is themeanof (say) 5 random numbers. Plot a histogram of these new random numbers.Do they look more gaussian? Check that the kurtosis of the new random numbers islower.


19/84


Listing 2.2: Matlab code to visualize joint and marginal distribution of source signals andmixed signals.

1 %% ICA Demo

2 % (file = icademo3.m)

3 % Plot the marginal distributions of the original and the mixture4

5 % Create two source signals (uniform distributions)

6 s1 = unifrnd( 0, 1, 1,1000 );

7 s2 = unifrnd( 0, 1, 1,1000 );

8

9 % make the signals of equal length

10 minsize = min( [ length( s1 ) length( s2 ) ] );

11 s1 = s1( 1:minsize ); s2 = s2( 1:minsize );

12

13 % normalize the variance

14 s1 = s1 / std( s1 ); s2 = s2 / std( s2 );

15

16 % Mix these sources to create two observed signals

17 x1 = 1.00*s1 2.00*s2; % mixed signal 118 x2 = 1.73*s1 + 3.41*s2; % mixed signal 2

19

20 figure(3); clf; % and plot the source signals

21 scatterhist( s1 , s2 , 'Marker', '.' );

22 title( 'Joint and marginal distribution of s1 and s2' );

23

24 figure(4); clf; % and plot the source signals

25 scatterhist( x1 , x2 , 'Marker', '.' );

26 title( 'Joint and marginal distribution of x1 and x2' );


20/84


s1

0 2 4 6 8 10 12

s2

0

2

4

6

8

10

12

Joint and marginal distribution of s1 and s2

Figure 2.6: Joint and marginal distribution of two uniformly distributed source signals


21/84


x1

-20 -10 0 10

x2

0

10

20

30

40

50

60

Joint and marginal distribution of x1 and x2

Figure 2.7: Joint and marginal distribution of linear combinations of two uniformly dis-tributed source signals


22/84


2. Adapt the example code in 2.1 to work with sound files. Matlab has some example au-dio files available (although you should feel free to use your own). For example, the codes1 = load( 'chirp'); s1=s1.y'; s2 = load( 'handel'); s2=s2.y';

will load a chirp signal and part of the Hallelujah chorus from Handel. These signalsare not of equal length yet, so youll need to truncate the longer signal to make them ofequal length. Check that the reconstructed audio signals sound like the original ones.You can use the soundscfunction to play vectors as sounds.

3. Adapt the example code in 2.2 to work with sound files. Check that the marginaldistributions of the mixtures are more Gaussian than the original source signals bycomputing the kurtosis.

4. Is the ICA procedure sensitive to the ordering of the measurements? For example,suppose we randomly scramble the temporal ordering of the sound files in the last

exercise but we apply the samescramble to each sound signal (e.g., the amplitude attime 1 might become the amplitude at time 65). Obviously, the derived independentcomponents will be different but suppose we unscramble the independent components(e.g., the amplitude at time 65 now becomes the amplitude at time 1), are the resultsthe same? What is the implication of this result?

5. Find some neuroimaging data such as EEG data from multiple electrodes or fMRI dataand apply ICA to find the independent components. Alternatively, find a set of (gray-scale) natural images and apply ICA to the set of images (you will have to convert thetwo-dimensional image values to a one-dimensional vector of gray-scale values).


23/84

Part II

Probabilistic Modeling

22


24/84

Chapter 3

Sampling from Random Variables

Probabilistic models proposed by researchers are often too complicated for analytic ap-proaches. Increasingly, researchers rely on computational, numerical-based methods whendealing with complex probabilistic models. By using a computational approach, the re-searcher is freed from making unrealistic assumptions required for some analytic techniques(e.g. such as normality and independence).

The key to most approximation approaches is the ability to sample from distributions.Sampling is needed to predict how a particular model will behave under some set of cir-cumstances, and to find appropriate values for the latent variables (parameters) whenapplying models to experimental data. Most computational sampling approaches turn theproblem of sampling from complex distributions into subproblems involving simpler samplingdistributions. In this chapter, we will illustrate two sampling approaches: the inverse trans-

formation method and rejection sampling. These approaches are appropriate mostly for theunivariate case where we are dealing with single-valued outcomes. In the next chapter, wediscuss Markov chain Monte Carlo approaches that can operate efficiently with multivariatedistributions.

3.1 Standard distributions

Some distributions are used so often, that they become part of a standard set of distributionssupported by Matlab. The Matlab Statistics Toolbox supports a large number of probabil-ity distributions. Using Matlab, it becomes quite easy to calculate the probability density,

cumulative density of these distributions, and to sample random values from these distribu-tions. Table 3.1 lists some of the standard distributions supported by Matlab. The Matlabdocumentation lists many more distributions that can be simulated with Matlab. Usingonline resources, it is often easy to find support for a number of other common distributions.

To illustrate how we can use some of these functions, Listing 3.1 shows Matlab code thatvisualizes the Normal(, ) distribution where= 100 and = 15. To make things concrete,imagine that this distribution represents the observed variability of IQ coefficients in somepopulation. The code shows how to display the probability density and the cumulative

23


25/84

CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 24

Table 3.1: Examples of Matlab functions for evaluating probability density, cumulative den-sity and drawing random numbers

Distribution PDF CDF Random Number GenerationNormal normpdf normcdf normUniform (continuous) unifpdf unifcdf unifrndBeta betapdf betacdf betarndExponential exppdf expcdf exprndUniform (discrete) unidpdf unidcdf unidrndBinomial binopdf binocdf binorndMultinomial mnpdf mnrndPoisson poisspdf poisscdf poissrnd

0 80 100 120 140

.

.

.

.

.

.

x

Probability Density Function

60 80 100 120 1400

0.2

0.4

0.6

0.8

1

x

cdf

Cumulative Density Function

0 50 100 1500

200

400

600

800

1000

1200

1400

1600

1800

x

frequency

Histogram of random values

Figure 3.1: Illustration of the Normal(, ) distribution where = 100 and = 15.

density. It also shows how to draw random values from this distribution and how to visualizethe distribution of these random samples using the hist function. The code produces theoutput as shown in Figure 3.1. Similarly, Figure 3.2 visualizes the discrete distributionBinomial(N, ) distribution where N = 10 and = 0.7. The binomial arises in situationswhere a researcher counts the number of successes out of a given number of trials. Forexample, the Binomial(10, 0.7) distribution represents a situation where we have 10 totaltrials and the probability of success at each trial, , equals 0.7.

Exercises

1. Adapt the Matlab program in Listing 3.1 to illustrate the Beta(, ) distribution where= 2 and = 3. Similarly, show the Exponential() distribution where = 2.

2. Adapt the matlab program above to illustrate the Binomial(N, ) distribution whereN= 10 and = 0.7. Produce an illustration that looks similar to Figure 3.2.

3. Write a demonstration program to sample 10 values from a Bernoulli() distributionwith = 0.3. Note that the Bernoulli distribution is one of the simplest discrete distri-


26/84


Listing 3.1: Matlab code to visualize Normal distribution.

1 %% Explore the Normal distribution N( mu , sigma )2 mu = 100; % the mean

3 sigma = 15; % the standard deviation

4 xmin = 70; % minimum x value for pdf and cdf plot

5 xmax = 130; % maximum x value for pdf and cdf plot

6 n = 100; % number of points on pdf and cdf plot

7 k = 10000; % number of random draws for histogram

8

9 % create a set of values ranging from xmin to xmax

10 x = linspace( xmin , xmax , n );

11 p = normpdf( x , mu , sigma ); % calculate the pdf

12 c = normcdf( x , mu , sigma ); % calculate the cdf

13

14 figure( 1 ); clf; % create a new figure and clear the contents15

16 subplot( 1,3,1 );

17 p l o t ( x , p , 'k' );18 xlabel( 'x' ); ylabel( 'pdf' );

19 title( 'Probability Density Function' );

20

21 subplot( 1,3,2 );

22 p l o t ( x , c , 'k' );23 xlabel( 'x' ); ylabel( 'cdf' );

24 title( 'Cumulative Density Function' );

25

26 % draw k random numbers from a N( mu , sigma ) distribution

27 y = normrnd( mu , sigma , k , 1 );28

29 subplot( 1,3,3 );

30 hist( y , 20 );

31 xlabel( 'x' ); ylabel( 'frequency' );

32 title( 'Histogram of random values' );

0 1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

x

Probab

ility

Probability Distribution

0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

x

CumulativeP

robability

Cumulative Probability Distribution

0 1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30

x

Freque

ncy

Histogram

Figure 3.2: Illustration of the Binomial(N, ) distribution where N= 10 and = 0.7.


27/84


butions to simulate. There are only two possible outcomes, 0 and 1. With probability, the outcome is 1, and with probability 1 , the outcome is 0. In other words,

p(X = 1) = , and p(X = 0) = 1

. This distribution can be used to simulate

outcomes in a number of situations, such as head or tail outcomes from a weightedcoin, correct/incorrect outcomes from true/false questions, etc. In Matlab, you cansimulate the Bernoulli distribution using the binomial distribution withN= 1. How-ever, for the purpose of this exercise, please write the code needed to sample Bernoullidistributed values that does not make use of the built-in binomial distribution.

4. It is often useful in simulations to ensure that each replication of the simulation givesthe exact same result. In Matlab, when drawing random values from distributions,the values are different every time you restart the code. There is a simple way toseed the random number generators to insure that they produce the same sequence.Write a Matlab script that samples two sets of 10 random values drawn from a uniform

distribution between [0,1]. Use the seeding function between the two sampling stepsto demonstrate that the two sets of random values are identical. Your Matlab codecould use the following line:

seed=1; rng( seed );

5. Suppose we know from previous research that in a given population, IQ coefficients areNormally distributed with a mean of 100 and a standard deviation of 15. Calculatethe probability that a randomly drawn person from this population has an IQ greaterthan 110 but smaller than 130. You can achieve this using one line of matlab code.

What does this look like?

** 6 The Dirichlet distribution is currently not supported by Matlab. Can you find amatlab function, using online resources, that implements the sampling from a Dirichletdistribution?

3.2 Sampling from non-standard distributions

Suppose we wish to sample from a distribution that is not one of the standard distributionsthat is supported by Matlab. In modeling situations, this situation frequently arises, be-cause a researcher can propose new noise processes or combinations of existing distributions.Computational methods for solving complex sampling problems often rely on sampling dis-tributions that we do know how to sample from efficiently. The random values from thesesimple distributions can then be transformed or compared to the target distribution. In fact,some of the techniques discussed in this section are used by Matlab internally to sample fromdistributions such as the Normal and Exponential distributions.


28/84


3.2.1 Inverse transform sampling with discrete variables

Inverse transform sampling (also known as the inverse transform method) is a method for

generating random numbers from any probability distribution given the inverse of its cumu-lative distribution function. The idea is to sample uniformly distributed random numbers(between 0 and 1) and then transform these values using the inverse cumulative distributionfunction. The simplicity of this procedure lies in the fact that the underlying sampling is

just based on transformed uniform deviates. This procedure can be used to sample manydifferent kinds of distributions. In fact, this is how Matlab implements many of its randomnumber generators.

It is easiest to illustrate this approach on a discrete distribution where we know theprobability of each individual outcome. In this case, the inverse transform method justrequires a simple table lookup.

To give an example of a non-standard discrete distribution, we use some data from ex-

periments that have looked at how well humans can produce uniform random numbers (e.g.Treisman and Faulkner, 1987). In these experiments, subjects produce a large number of ran-dom digits (0,..,9) and investigators tabulate the relative frequencies of each random digitproduced. As you might suspect, subjects do not always produce uniform distributions.Table 3.2.1 shows some typical data. Some of the low and the high numbers are underrepre-sented while some specific digits (e.g. 4) are overrepresented. For some reason, the digits 0and 9 were never generated by the subject (perhaps because the subject misinterpreted theinstructions). In any case, this data is fairly typical and demonstrates that humans are notvery good are producing uniformly distributed random numbers.

Table 3.2: Probability of digits observed in human random digit generation experiment. Thegenerated digit is represented byX;p(X) andF(X) are the probability mass and cumulativeprobabilities respectively. The data was estimated from subject 6, session 1, in experimentby Treisman and Faulkner (1987).

X p(X) F(X)0 0.000 0.0001 0.100 0.1002 0.090 0.1903 0.095 0.2854 0.200 0.485

5 0.175 0.6606 0.190 0.8507 0.050 0.9008 0.100 1.0009 0.000 1.000

Suppose we now want to mimic this process and write an algorithm that samples digits


29/84


according to the probabilities shown in Table 3.2.1. Therefore, the program should producea 4 with probability .2, a 5 with probability .175, etc. For example, the code in Listing 3.2implements this process using the built-in matlab function randsample. The code produces

the illustration shown in Figure 3.2.1.Instead of using the built-in functions such as randsample or mnrnd, it is helpful to

consider how to implement the underlying sampling algorithm using the inverse transformmethod. We first need to calculate the cumulative probability distribution. In other words,we need to know the probability that we observe an outcome equal to or smaller than someparticular value. IfF(X) represents the cumulative function, we need to calculate F(X =x) = p(X


30/84


24 % create a new figure

25 figure( 1 ); clf;

26

27 % Show the histogram of the simulated draws28 counts = hist( Y , digitset );

29 bar( digitset , counts , 'k' );

30 xlim( [0.5 9.5 ] );31 xlabel( 'Digit' );

32 ylabel( 'Frequency' );

33 title( 'Distribution of simulated draws of human digit generator' );

0 1 2 3 4 5 6 7 8 90

500

1000

1500

2000

2500

Digit

Frequency

Distribution of simulated draws of human digit generator

Figure 3.3: Illustration of the BINOMIAL(N, ) distribution where N= 10 and = 0.7.

Exercises

1. Create the Matlab program that implements the inverse tranform method for discretevariables. Use it to sample random digits with probabilities as shown in Table 3.2.1. Inorder to show that the algorithm is working, sample a large number of random digitsand create a histogram. Your program should never sample digits 0 and 9 as they aregiven zero probability in the table.

** 2 One solution to the previous exercise that does not require any loops is by using themultinomial random number generator mnrnd. Show how to use this function tosample digits according to the probabilities shown in Table 3.2.1.


31/84


0 1 2 3 4 5 6 7 8 90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

F(X)

X

Figure 3.4: Illustration of the inverse transform procedure for generating discrete randomvariables. Note that we plot the cumulative probabilities for each outcome. If we sample auniform random number ofU= 0.8, then this yields a random value ofX= 6

** 3 Explain why the algorithm as described above might be inefficient when dealing withskewed probability distributions. [hint: imagine a situation where the first N-1 out-comes have zero probability and the last outcome has probability one]. Can you thinkof a simple change to the algorithm to improve its efficiency?

3.2.2 Inverse transform sampling with continuous variables

The inverse transform sampling approach can also be applied to continuous distributions.Generally, the idea is to draw uniform random deviates and to apply the inverse functionof the cumulative distribution applied to the random deviate. In the following, let F(X) bethe cumulative density function (CDF) of our target variable X and F1(X) the inverse ofthis function, assuming that we can actually calculate this inverse. We wish to draw randomvalues for X. This can be done with the following procedure:

1. Draw U Uniform(0, 1)

2. Set X =F

1

(U)3. Repeat

Lets illustrate this approach with a simple example. Suppose we want to sample randomnumbers from the exponential distribution. When > 0, the cumulative density functionis F(x|) = 1 exp(x/). Using some simple algebra, one can find the inverse of thisfunction, which is F1(u|) = log(1u). This leads to the following sampling procedureto sample random numbers from a Exponental() distribution:


32/84


1. Draw U Uniform(0, 1)2. Set X =log(1 U)3. Repeat

Exercises

1. Implement the inverse transform sampling method for the exponential distribution.Sample a large number of values from this distribution, and show the distribution ofthese values. Compare the distribution you obtain against the exact distribution asobtained by the PDF of the exponential distribution (use the command exppdf).

** 2 Matlab implements some of its own functions using Matlab code. For example, when

you call the exponential random number generator exprnd, Matlab executes a func-tion that is stored in its own internal directories. Please locate the Matlab functionexprndand inspect its contents. How does Matlab implement the sampling from theexponential distribution? Does it use the inverse transform method? Note that thepath to this Matlab function will depend on your particular Matlab installation, butit probably looks something like

C:\Program Files\MATLAB\R2009B\toolbox\stats\exprnd.m

3.2.3 Rejection sampling

In many cases, it is not possible to apply the inverse transform sampling method because itis difficult to compute the cumulative distribution or its inverse. In this case, there are otheroptions available, such as rejection sampling, and methods using Markov chain Monte Carloapproaches that we will discuss in the next chapter. The main advantage of the rejectionsampling method is that it does not require any burn-in period. Instead, all samplesobtained during sampling can immediately be used as samples from the target distribution.

One way to illustrate the general idea of rejection sampling (also commonly called theaccept-reject algorithm) is with Figure 3.5. Suppose we wish to draw points uniformlywithin a circle centered at (0, 0) and with radius 1. At first, it seems quite complicated todirectly sample points within this circle in uniform fashion. However, we can apply rejectionsampling by first drawing (x, y) values uniformly from within the square surrounding thecircle, and rejecting any samples for whichx2 + y2 >1. Note that in this procedure, we useda very simple proposal distribution, such as the uniform distribution, as a basis for samplingfrom a much more complicated distribution.

Rejection sampling allows us to generate observations from a distribution that is difficultto sample from but where wecanevaluate the probability of any particular sample. In otherwords, suppose we have a distribution p(), and it is difficult to sample from this distributiondirectly, but we can evaluate the probability density or mass p() for a particular value of.


33/84


1 0.5 0 0.5 11

0.5

0

0.5

1

x

y

1 0.5 0 0.5 11

0.5

0

0.5

1

x

y

(A) (B)

Figure 3.5: Sampling points uniformly from unit circle using rejection sampling

The first choice that the researcher needs to make is the proposal distribution. The proposaldistribution is a simple distribution q(), that we can directly sample from. The idea is toevaluate the probability of the proposed samples under both the proposal distribution andthe target distribution and reject samples that are unlikely under the target distributionrelative to the proposal distribution.

Figure 3.6 illustrates the procedure. We first need to find a constant c such that cq() p() for all possible samples . The proposal function q() multiplied by the constant c isknown as the comparison distribution and will always lie on top of our target distribution.Finding the constant c might be non-trivial, but lets assume for now that we can do thisusing some calculus. We now draw a numberufrom a uniform distribution between [0, cq()].

In other words, this is some point on the line segment between 0 and the height of thecomparison distribution evaluated at our proposal . We will reject the proposal ifu > p()and accept it otherwise. If we accept the proposal, the sampled value is a draw from thetarget distributionp(). Here is a summary of the computational procedure:

1. Choose a density q() that is easy to sample from

2. Find a constant c such that cq() p() for all 3. Sample a proposal from proposal distribution q()

4. Sample a uniform deviate ufrom the interval [0, cq()]

5. Reject the proposal ifu > p(), accept otherwise

6. Repeat steps 3, 4, and 5 until desired number of samples is reached; each acceptedsample is a draw from p()

The key to an efficient operation of this algorithm is to have as many samples acceptedas possible. This depends crucially on the choice of the proposal distribution. A proposal


34/84


( )p

( )cq

u

Figure 3.6: Illustration of rejection sampling. The particular sample shown in the figure willbe rejected

distribution that is dissimilar to the target distribution will lead to many rejected samples,slowing the procedure down.

Exercises

1. Suppose we want to sample from a Beta(, ) distribution where= 2 and= 1. Thisgives the probability densityp(x) = 2xfor 0< x 1. Then, for each pair (x, y) we evaluate the quanti-

tiesz1 =x

2ln(x2+y2)x2+y2

1/2and z2 =y

2ln(x2+y2)

x2+y2

1/2. The valuesz1 and z2 are each

Gaussian distributed with zero mean and unit variance. Write a Matlab program thatimplements this Box-Muller method and verify that the sampled values are Gaussiandistributed.


35/84

Chapter 4

Markov Chain Monte Carlo

The application of probabilistic models to data often leads to inference problems that re-quire the integration of complex, high dimensional distributions. Markov chain Monte Carlo(MCMC), is a general computational approach that replaces analytic integration by summa-tion over samples generated from iterative algorithms. Problems that are intractable usinganalytic approaches often become possible to solve using some form of MCMC, even withhigh-dimensional problems. The development of MCMC is arguably the biggest advance inthe computational approach to statistics. While MCMC is very much an active researcharea, there are now some standardized techniques that are widely used. In this chapter, wewill discuss two forms of MCMC: Metropolis-Hastings and Gibbs sampling. Before we gointo these techniques though, we first need to understand the two main ideas underlyingMCMC: Monte Carlo integration, and Markov chains.

4.1 Monte Carlo integration

Many problems in probabilistic inference require the calculation of complex integrals orsummations over very large outcome spaces. For example, a frequent problem is to calculatethe expectation of a functiong(x) for the random variable x (for simplicity, we assume x isa univariate random variable). Ifx is continuous, the expectation is defined as:

E[g(x)] =

g(x)p(x)dx (4.1)

In the case of discrete variables, the integral is replaced by summation:

E[g(x)] =

g(x)p(x)dx (4.2)

These expectations arise in many situations where we want to calculate some statistic of adistribution, such as the mean or variance. For example, withg(x) =x, we are calculatingthe mean of a distribution. Integration or summation using analytic techniques can becomequite challenging for certain distributions. For example, the density p(x) might have a

34


36/84

CHAPTER 4. MARKOV CHAIN MONTE CARLO 35

functional form that does not lend itself to analytic integration. For discrete distributions,the outcome space might become so large to make the explicit summation over all possibleoutcomes impractical.

The general idea of Monte Carlo integration is to use samples to approximate the expec-tation of a complex distribution. Specifically, we obtain a set of samples x(t), t= 1, . . . , N ,drawn independently from distribution p(x). In this case, we can approximate the expecta-tions in 4.1 and 4.2 by a finite sum:

E[g(x)] = 1

n

nt=1

g(x(t)) (4.3)

In this procedure, we have now replaced analytic integration with summation over asuitably large set of samples. Generally, the accuracy of the approximation can be made asaccurate as needed by increasing n. Crucially, the precision of the approximation dependson the independence of the samples. When the samples are correlated, the effective samplesize decreases. This is not an issue with the rejection sampler discussed in the last chapter,but a potential problem with MCMC approaches.

Exercises

1. Develop Matlab code to approximate the mean of a Beta(, ) distribution with = 3and= 4 using Monte Carlo integration. You can use the Matlab function betarndto draw samples from a Beta distribution. You can compare your answer with theanalytic solution: /(+). [Note: this can be done with one line of Matlab code].

2. Similarly, approximate the variance of a Gamma(a, b) distribution with a = 1.5 andb= 4 by Monte Carlo integration. The Matlab command gamrnd allows you to samplefrom this distribution. Your approximation should get close to the theoretically derivedanswer ofab2.

4.2 Markov Chains

A Markov chain is a stochastic process where we transition from one state to another stateusing a simple sequential procedure. We start a Markov chain at some statex(1), and use

a transition function p(x

(t)

|x(t1)

), to determine the next state, x

(2)

conditional on the laststate. We then keep iterating to create a sequence of states:

x(1) x(2) . . . x(t) . . . (4.4)Each such a sequence of states is called a Markov chainor simplychain. The procedure

for generating a sequence ofTstates from a Markov chain is the following:

1. Set t = 1


37/84


2. Generate a initial value u, and set x(t) =u

3. Repeat

t= t + 1

Sample a new value u from the transition function p(x(t)|x(t1))Setx(t) =u

4. Until t = T

Importantly, in this iterative procedure, the next state of the chain at t + 1 is based onlyon the previous state att. Therefore, each Markov chain wanders around the state space andthe transition to a new state is only dependent on the last state. It is this local dependencywhat makes this procedure Markov or memoryless. As we will see, this is an important

property when using Markov chains for MCMC.When initializing each Markov chain, the chain will wander in state space around the

starting state. Therefore, if we start a number of chains, each with different initial conditions,the chains will initially be in a state close to the starting state. This period is called theburnin. An important property of Markov chains is that the starting state of the chain nolonger affects the state of the chain after a sufficiently long sequence of transitions (assumingthat certain conditions about the Markov chain are met). At this point, the chain is saidto reach itssteady stateand the states reflect samples from its stationary distribution. Thisproperty that Markov chains converge to a stationary distribution regardless of where westarted (if certain regularity conditions of the transition function are met), is quite important.When applied to MCMC, it allow us to draw samples from a distribution using a sequentialprocedure but where the starting state of the sequence does not affect the estimation process.

Example. Figure 4.1 show an example of a Markov chain involving a (single) continuousvariable x. For the transition function, samples were taken from a Beta(200(0.9x(t1) +0.05), 200(1 0.9x(t1) 0.05)) distribution. This function and its constants are chosensomewhat arbitrarily, but help to illustrate some basic aspects of Markov chains. The processwas started with four different initial values, and each chain was for continued for T= 1000iterations. The two panels of the Figure show the sequence of states at two different timescales. The line colors represent the four different chains. Note that the first 10 iterationsor so show a dependence of the sequence on the initial state. This is the burnin period.This is followed by the steady state for the remainder of the sequence (the chain would

continue in the steady state if we didnt stop it). How do we know exactly when the steadystate has been reached and the chain converges? This is often not easy to tell, especially inhigh-dimensional state spaces. We will differ the discussion of convergence until later.

Exercises

1. Develop Matlab code to implement the Markov chain as described in the example.Create an illustration similar to one of the panels in Figure 4.1. Start the Markov


38/84


30 40 50 60 70 80 90 100

t

0 100 200 300 400 500 600 70.2

0.4

0.6

0.8

1

t

x

Figure 4.1: Illustration of a Markov chain starting with four different initial conditions. Theright and left panes show the sequence of states at different temporal scales.

chain with four different initial values uniformly drawn from [0,1]. [tip: ifX is a T xKmatrix in Matlab such that X(t, k) stores the state of the k th Markov chain at thetth iteration, the command plot(X) will simultaneously display the K sequences indifferent colors].

4.3 Putting it together: Markov chain Monte Carlo

The two previous sections discussed the main two ideas underlying MCMC, Monte Carlosampling and Markov chains. Monte Carlo sampling allows one to estimate various char-acteristics of a distribution such as the mean, variance, kurtosis, or any other statistic ofinterest to a researcher. Markov chains involve a stochastic sequential process where we cansample states from some stationary distribution.

The goal of MCMC is to design a Markov chain such that the stationary distribution ofthe chain is exactly the distribution that we are interesting in sampling from. This is calledthetarget distribution1. In other words, we would like the states sampled from some Markovchain to also be samples drawn from the target distribution. The idea is to use some clevermethods for setting up the transition function such that no matter how we initialize eachchain, we will convergence to the target distribution. There are a number of methods thatachieve this goal using relatively simple procedures. We will discuss Metropolis, Metropolis-Hastings, and Gibbs sampling.

4.4 Metropolis Sampling

We will start by illustrating the simplest of all MCMC methods: the Metropolis sampler.This is a special case of the Metropolis-Hastings sampler discussed in the next section.

1The target distribution could be the posterior distribution over the parameters in the model or theposterior predictive distribution of a model (i.e., what future data could the model predict?), or any otherdistribution of interest to the researcher. Later chapters will hopefully make clearer that this abstractsounding concept of thetarget distributioncan be something very important for a model.


39/84


Suppose our goal is to sample from the target density p(), with < , we reject the proposal, and the next state is set equal to theold state: (t) = (t1). We continue generating new proposals conditional on the currentstate of the sampler, and either accept or reject the proposals. This procedure continuesuntil the sampler reaches convergence. At this point, the samples(t) reflect samples fromthe target distribution p(). Here is a summary of the steps of the Metropolis sampler:

1. Set t = 1

2. Generate a initial value u, and set (t) =u

3. Repeat

t= t + 1

Generate a proposal fromq(|(t1))Evaluate the acceptance probability = min

1, p(

)

p((t1))

Generate a u from a Uniform(0,1) distribution

Ifu , accept the proposal and set (t) =, else set (t) =(t1).4. Until t = T

Figure 4.2 illustrates the procedure for a sequence of two states. To intuitively understandwhy the process leads to samples from the target distribution, note that 4.6 will always accepta new proposal if the the new proposal is more likely under the target distribution than the

2The proposal distribution is a distribution that is chosen by the researcher and good choices for thedistribution depend on the problem. One important constraint for the proposal distribution is that it shouldcoverthe state space such that each potential outcome in state space has some non-zero probability underthe proposal distribution


40/84


old state. Therefore, the sampler will move towards the regions of the state space where thetarget function has high density. However, note that if the new proposal is less likely thanthan the current state, it is still possible to accept this worse proposal and move toward

it. This process of always accepting a good proposal, and occasionally accepting a badproposal insures that the sampler explores the whole state space, and samples from all partsof a distribution (including the tails).

A key requirement for the Metropolis sampler is that the proposal distribution is symmet-ric, such that q( = (t)|(t1)) = q( = (t1)|(t)). Therefore, the probability of proposingsome new state given the old state, is the same as proposing to go from the new state back tothe old state. This symmetry holds with proposal distributions such as the Normal, Cauchy,Student-t, as well as uniform distributions. If this symmetry does not hold, you should usethe Metropolis-Hastings sampler discussed in the next section.

A major advantage of the Metropolis sampler is that Equation 4.6 involves only a ratio of

densities. Therefore, any terms independent of in the functional form ofp() will drop out.Therefore, we do not need to know the normalizing constant of the density or probability massfunction. The fact that this procedure allows us to sample fromunnormalized distributions isone of its major attractions. Sampling from unnormalized distributions frequently happensin Bayesian models, where calculating the normalization constant is difficult or impractical.

Example 1. Suppose we wish to generate random samples from the Cauchy distribution(note that here are better ways to sample from the Cauchy that do not rely on MCMC, butwe just use as an illustration of the technique). The probability density of the Cauchy isgiven by:

f() = 1

(1 +2) (4.7)

Because we do not need any normalizing constants in the Metropolis sampler, we can rewritethis to:

f() 1(1 +2)

(4.8)

Therefore, the Metropolis acceptance probability becomes

= min

1,

1 + [(t)]2

1 + []2)

(4.9)

We will use the Normal distribution as the proposal distribution. Our proposals are generatedfrom a Normal((t), ) distribution. Therefore, the mean of the distribution is centered on

the current state and the parameter , which needs to be set by the modeler, controls thevariability of the proposed steps. This is an important parameter that we will investigatein the the Exercises. Listing 4.1 show the Matlab function that returns the unnormalizeddensity of the Cauchy distribution. Listing 4.2 shows Matlab code that implements theMetropolis sampler. Figure 4.3 shows the simulation results for a single chain run for 500iterations. The upper panel shows the theoretical density in the dashed red line and thehistogram shows the distribution of all 500 samples. The lower panel shows the sequence ofsamples of one chain.


41/84


!

!

Figure 4.2: Illustration of the Metropolis sampler to sample from target densityp(). (A)the current state of the chain is (t). (B) a proposal distribution around the current state is

used to generate a proposal. (C) the proposal was accepted and the new state is set equalto the proposal, and the proposal distribution now centers on the new state.


42/84


Exercises

1. Currently, the program in Listing 4.2 takes all states from the chain as samples to

approximate the target distribution. Therefore, it also includes samples while thechain is still burning in. Why is this not a good idea? Can you modify the codesuch that the effect of burnin is removed?

2. Explore the effect of different starting conditions. For example, what happens whenwe start the chain with = 30?

3. Calculate the proportion of samples that is accepted on average. Explore the effect ofparameter on the average acceptance rate. Can you explain what is happening withthe accuracy of the reconstructed distribution when is varied?

4. As a followup to the previous question, what is (roughly) the value of that leads

to a 50% acceptance rate? It turns out that this is the acceptance rate for which theMetropolis sampler, in the case of Gaussian distributions, converges most quickly tothe target distribution.

5. Suppose we apply the Metropolis sampler to a Normal(, ) density as the targetdistribution, with = 0 and = 1. Write down the equation for the acceptanceprobability, and remove any proportionality constants from the density ratio. [note:the Matlab documentation for normpdf shows the functional form for the Normaldensity].

** 6 Modify the code such that the sequences of multiple chains (each initialized differently)

are visualized simultaneously.

Listing 4.1: Matlab function to evaluate the unnormalized Cauchy.

1 function y = cauchy( theta )

2 %% Returns the unnormalized density of the Cauchy distribution

3 y = 1 ./ (1 + theta.2);

4.5 Metropolis-Hastings Sampling

The Metropolis-Hasting (MH) sampler is a generalized version of the Metropolis samplerin which we can apply symmetric as well as asymmetric proposal distributions. The MHsampler operates in exactly the same fashion as the Metropolis sampler, but uses the followingacceptance probability:

= min

1,

p()

p((t1))

q((t1)|)q(|(t1))

(4.10)


43/84


Listing 4.2: Matlab code to implement Metropolis sampler for Example 1

1 %% Chapter 2. Use Metropolis procedure to sample from Cauchy density2

3 %% Initialize the Metropolis sampler

4 T = 500; % Set the maximum number of iterations

5 sigma = 1; % Set standard deviation of normal proposal density

6 thetamin =30; thetamax = 30; % define a range for starting values7 theta = zeros( 1 , T ); % Init storage space for our samples

8 seed=1; rand( 'state' , seed ); randn('state',seed ); % set the random seed

9 theta(1) = unifrnd( thetamin , thetamax ); % Generate start value

10

11 %% Start sampling

12 t = 1 ;

13 while t < T % Iterate until we have T samples

14 t = t + 1;15 % Propose a new value for theta using a normal proposal density

16 theta star = normrnd( theta(t1) , sigma );17 % Calculate the acceptance ratio

18 alpha = min( [ 1 cauchy( theta star ) / cauchy( theta(t1 ) ) ] ) ;19 % Draw a uniform deviate from [ 0 1 ]

20 u = rand;

21 % Do we accept this proposal?

22 if u < alpha

23 theta(t) = theta star; % If so, proposal becomes new state

24 else

25 theta(t) = theta(t1); % If not, copy old state26 end

27 end28

29 %% Display histogram of our samples

30 figure( 1 ); clf;

31 subplot( 3,1,1 );

32 nbins = 200;

33 thetabins = linspace( thetamin , thetamax , nbins );

34 counts = hist( theta , thetabins );

35 bar( thetabins , counts/sum(counts) , 'k' );

36 xlim( [ thetamin thetamax ] );

37 xlabel( '\theta' ); ylabel( 'p(\theta)' );38

39 %% Overlay the theoretical density

40 y = cauchy( thetabins );

41 hold on;

42 plot( thetabins , y/sum(y) , 'r' , 'LineWidth' , 3 ) ;43 set( gca , 'YTick' , [ ] ) ;

44

45 %% Display history of our samples

46 subplot( 3,1,2:3 );

47 stairs( theta , 1:T , 'k' );48 ylabel( 't' ); xlabel( '\theta' );49 set( gca , 'YDir' , 'reverse' );

50 xlim( [ thetamin thetamax ] );


44/84


30 20 10 0 10 20 3

p()

30 20 10 0 10 20 3

0

50

100

150

200

250

300

350

400

450

500

Figure 4.3: Simulation results where 500 samples were drawn from the Cauchy distributionusing the Metropolis sampler. The upper panel shows the theoretical density in the dashedred line and the histogram shows the distribution of the samples. The lower panel shows thesequence of samples of one chain


45/84


The MH sampler has the additional ratio of q((t1)|)

q(|(t1)) in 4.10. This corrects for any asymme-

tries in the proposal distribution. For example, suppose we have a proposal distribution witha mean centered on the current state, but that is skewed in one direction. If the proposaldistribution prefers to move say left over right, the proposal density ratio will correct forthis asymmetry.

Here is a summary of the steps of the MH sampler:

1. Set t = 1

2. Generate an initial value u, and set (t) =u

3. Repeat

t= t + 1

Generate a proposal fromq(|(t1))

Evaluate the acceptance probability = min

1, p()

p((t1))

q((t1)|)

q(|(t1))



The fact that asymmetric proposal distributions can be used allows the Metropolis-Hastings procedure to sample from target distributions that are defined on a limited range(other than the uniform for which Metropolis sampler can be used). With bounded vari-

ables, care should be taken in constructing a suitable proposal distribution. Generally, agood rule is to use a proposal distribution has positive density on the same support as thetarget distribution. For example, if the target distribution has support over 0 < , theproposal distribution should have the same support.

Exercise

1. Suppose a researcher investigates response times in an experiment and finds that theWeibull(a, b) distribution with a = 2, and b = 1.9 captures the observed variabilityin response times. Write a Matlab program that implements the Metropolis-Hastingssampler in order to sample response times from this distribution. The pdf for the

Weibull is given by the Matlab command wblpdf. Create a figure that is analogousto Figure 4.3. You could use a number of different proposal distributions in this case.For this exercise, use samples from a Gamma((t), 1/) distribution. This proposaldensity has a mean equal to(t) so it is centered on the current state. The parametercontrols the acceptance rate of the sampler it is a precision parameter such thathigher values are associated with less variability in the proposal distribution. Can youfind a value forto get (roughly) an acceptance rate of 50%? Calculate the variance ofthis distribution using the Monte Carlo approach with the samples obtained from the


46/84


Metropolis-Hastings sampler. If you would like to know how close your approximationis, use online resources to find the analytically derived answer.

4.6 Metropolis-Hastings for Multivariate Distributions

Up to this point, all of the examples we discussed involved univariate distributions. It is fairlystraightforward though to generalize the MH sampler to multivariate distributions. There aretwo different ways to extend the procedure to sample random variables in multidimensionalspaces.

4.6.1 Blockwise updating

In the first approach, called blockwise updating, we use a proposal distribution that has

the same dimensionality as the target distribution. For example, if we want to samplefrom a probability distribution involving Nvariables, we design a N-dimensional proposaldistribution, and we either accept or reject the proposal (involving values for all N variables)as ablock. In the following, we will use the vector notation = (1, 2, . . . , N) to representa random variable involvingNcomponents, and (t) represents thetth state in our sampler.This leads to a generalization of the MH sampler where the scalar variables are now replacedby vectors :

1. Set t = 1

2. Generate an initial value u= (u1, u2, . . . , uN), and set (t) =u

3. Repeat

t= t + 1

Generate a proposal fromq(|(t1))Evaluate the acceptance probability = min

1, p(

)

p((t1))

q((t1)|)

q(|(t1))



Example 1 (adopted from Gill, 2008). Suppose we want to sample from the bivariateexponential distribution

p(1, 2) = exp ((1+)1 (2+)2 max(1, 2)) (4.11)

For our example, we will restrict the range of1and2to [0,8] and the set the constants to thefollowing: 1 = 0.5, 2 = 0.1, + 0.01, max(1, 2) = 8. This bivariate density is visualized


47/84


Listing 4.3: Matlab code to implement bivariate density for Example 1

1 function y = bivexp( theta1 , theta2 )2 %% Returns the density of a bivariate exponential function

3 lambda1 = 0.5; % Set up some constants

4 lambda2 = 0.1;

5 lambda = 0.01;

6 maxval = 8;

7 y = exp((lambda1+lambda)*theta1(lambda2+lambda)*theta2lambda*maxval );

in Figure 4.4, right panel. The Matlab function that implements this density function isshown in Listing 4.3. To illustrate the blockwise MH sampler, we use a uniform proposaldistribution, where proposals for 1 and

2 are sampled from a Uniform(0,8) distribution. In

other words, we sample proposals for uniformly from within a box. Note that with thisparticular proposal distribution, we are not conditioning our proposals on the previous stateof the sampler. This is known as an independence sampler. This is actually a very poor

proposal distribution but leads to a simple implementation because the ratio q((t1)|)

q(|(t1))= 1

and therefore disappears from the acceptance ratio. The Matlab code that implements thesampler is shown in Listing 4.4. Figure 4.4, left panel shows the approximated distributionusing 5000 samples.

Figure 4.4: Results of a Metropolis sampler for the bivariate exponential (right) approxi-mated with 5000 samples (left)

Listing 4.4: Blockwise Metropolis-Hastings sampler for bivariate exponential distribution

1 %% Chapter 2. Metropolis procedure to sample from Bivariate Exponential


48/84


2 % Blockwise updating. Use a uniform proposal distribution

3


5 T = 5000; % Set the maximum number of iterations6 thetamin = [ 0 0 ]; % define minimum for theta1 and theta2

7 thetamax = [ 8 8 ]; % define maximum for theta1 and theta2


9 theta = zeros( 2 , T ); % Init storage space for our samples

10 theta(1,1) = unifrnd( thetamin(1) , thetamax(1) ); % Start value for theta1

11 theta(2,1) = unifrnd( thetamin(2) , thetamax(2) ); % Start value for theta2

12


14 t = 1 ;


16 t = t + 1;

17 % Propose a new value for theta

18 theta star = unifrnd( thetamin , thetamax );19 pratio = bivexp( theta star(1) , theta star(2) ) / ...

20 bivexp( theta(1,t1) , theta(2,t1) );21 alpha = min( [ 1 pratio ] ); % Calculate the acceptance ratio

22 u = rand; % Draw a uniform deviate from [ 0 1 ]

23 if u < alpha % Do we accept this proposal?

24 theta(:,t) = theta star; % proposal becomes new value for theta

25 else

26 theta(:,t) = theta(:,t1); % copy old value of theta27 end

28 end

29

30 %% Display histogram of our samples

31 figure( 1 ); clf;32 subplot( 1,2,1 );

33 nbins = 10;

34 thetabins1 = linspace( thetamin(1) , thetamax(1) , nbins );


36 hist3( theta' , 'Edges' , {thetabins1 thetabins2} );37 xlabel( '\theta 1' ); ylabel('\theta 2' ); zlabel( 'counts' );38 az = 61; el = 30;

39 view(az, el);

40

41 %% Plot the theoretical density

42 subplot(1,2,2);

43 nbins = 20;



46 [ theta1grid , theta2grid ] = meshgrid( thetabins1 , thetabins2 );

47 ygrid = bivexp( theta1grid , theta2grid );

48 mesh( theta1grid , theta2grid , ygrid );

49 xlabel( '\theta 1' ); ylabel('\theta 2' );50 zlabel( 'f(\theta 1,\theta 2)' );51 view(az, el);


49/84


Example 2. Many researchers have proposed probabilistic models for order information.Order information can relate to preference rankings over political candidates, car brands andicecream flavors, but can also relate to knowledge about the relative order of items along

some temporal of physical dimension. For example, suppose we ask individuals to rememberthe chronological order of US presidents. Steyvers, Lee, Miller, and Hemmer (2009) foundthat individuals make a number of mistakes in the ordering of presidents that can be capturedby simple probabilistic models, such as Mallows model. To explain Mallows model, lets givea simple example. Suppose that we are looking at the first five presidents: Washington,Adams, Jefferson, Madison, and Monroe. We will represent this true ordering by a vector = (1, 2, 3, 4, 5) = (W ashington, Adams, Jef f erson, Madison, Monroe). Mallows modelnow proposes that the remembered orderings tend to be similar to the true ordering, withvery similar orderings being more likely than dissimilar orderings. Specifically, according toMallows model, the probability that an individual remembers an ordering is proportional

to: p(|, ) exp(d(,)) (4.12)In this equation, d(,) is the Kendall tau distance between two orderings. This distancemeasures the number of adjacent pairwise swaps that are needed to bring the two orderingsinto alignment. For example, if = (Adams, W ashington, Jef f erson, Madison, Monroe),thend(,) = 1 because one swap is needed to make the two orderings identical. Similarly,if = (Adams, Jef f erson, W ashington, Madison, Monroe), thend(,) = 2 because twoswaps are needed. Note that in Kendall tau distance, only adjacent items can be swapped.The scaling parameter controls how sharply peaked the distribution of remembered order-ings is around the true ordering. Therefore, by increasing, the model makes it more likelythat the correct ordering (or something similar) will be produced.

The problem is now to generate orderings according to Mallows model, given the trueordering and scaling parameter . This can be achieved in very simple ways using aMetropolis sampler. We start the sampler with (1) corresponding to a random permutationof items. At each iteration, we then make proposals that slightly modify the current state.This can be done in a number of ways. The idea here is to use a proposal distribution wherethe current ordering is permuted by transposing any randomly chosen pair of items (and not

just adjacent items). Formally, we draw proposals from the proposal distribution

q( = |(t1)) =

1/

N2

ifS(, (t1)) = 1

0 otherwise (4.13)

whereS(, (t1)) is the Cayley distance. This distance counts the number of transpositionsof any pair of items needed to bring two orderings into alignment (therefore, the differencewith the Kendall tau distance is that any pairwise swap counts as one, even nonadjacentswaps). This is just a complicated way to describe a very simple proposal distribution:

just swap two randomly chosen items from the last ordering, and make that the proposedordering.

Because the proposal distribution is symmetric, we can use the Metropolis sampler. Theacceptance probability is


50/84


= min

1,

p(|, )p((t1)

|, )

= min

1,

exp(d(,))exp(

d((t1),))

. (4.14)

The Matlab implementation of the Kendall tau distance function is shown in Listing 4.5.The Matlab code for the Metropolis sampler is shown in Listing 4.6. Currently, the codedoes not do all that much. It simply shows what the state is every 10 iterations for a totalof 500 iterations. Here is some sample output from the program:

Listing 4.5: Function to evaluate Kendall tau distance

1 function tau = kendalltau( order1 , order2 )

2 %% Returns the Kendall tau distance between two orderings

3 % Note: this is not the most efficient implementation

4 [ dummy , ranking1 ] = sort( order1(:)' , 2 , 'ascend' );

5 [ dummy , ranking2 ] = sort( order2(:)' , 2 , 'ascend' );6 N = length( ranking1 );

7 [ ii , jj ] = meshgrid( 1:N , 1:N );

8 ok = find( jj(:) > ii(:) );

9 ii = ii( ok );

10 jj = jj( ok );

11 nok = length( ok );

12 sign1 = ranking1( jj ) > ranking1( ii );

13 sign2 = ranking2( jj ) > ranking2( ii );

14 tau = sum( sign1 = sign2 );

Listing 4.6: Implementation of Metropolis-Hastings sampler for Mallows model1 %% Chapter 2. Metropolis sampler for Mallows model

2 % samples orderings from a distribution over orderings

3

4 %% Initialize model parameters

5 lambda = 0.1; % scaling parameter

6 labels = { 'Washington' , 'Adams' , 'Jefferson' , 'Madison' , 'Monroe' };7 omega = [ 1 2 3 4 5 ]; % correct ordering

8 L = length( omega ); % number of items in ordering

9


11 T = 500; % Set the maximum number of iterations


13 theta = zeros( L , T ); % Init storage space for our samples

14 theta(:,1) = randperm( L ); % Random ordering to start with

15


17 t = 1 ;


19 t = t + 1;

20

21 % Our proposal is the last ordering but with two items switched


51/84


22 lasttheta = theta(:,t1); % Get the last theta23 % Propose two items to switch

24 whswap = randperm( L ); whswap = whswap(1:2);

25 theta star = lasttheta;26 theta star( whswap(1)) = lasttheta( whswap(2));

27 theta star( whswap(2)) = lasttheta( whswap(1));

28

29 % calculate Kendall tau distances

30 dist1 = kendalltau( theta star , omega );

31 dist2 = kendalltau( lasttheta , omega );

32

33 % Calculate the acceptance ratio

34 pratio = exp(dist1*lambda ) / exp(dist2*lambda );35 alpha = min( [ 1 pratio ] );

36 u = rand; % Draw a uniform deviate from [ 0 1 ]

37 if u < alpha % Do we accept this proposal

AdvancedMatlab.pdf

Documents