PS 271B: Quantitative Methods II Lecture Notes

PS 271B: Quantitative Methods II

Lecture Notes

Langche Zeng

[email protected]

2

The Empirical Research Process; FundamentalMethodological Issues

• Theory; Data; Models/model selection; Estimation; Inference.

(Order?)

Examples: Presidential approval; International conflict/Civil war.

• Identification: Can quantities of interest be determined from the

model/data, assuming sufficient sample size? (asymptotic concept)

Parameters in structural equation models, for example, are often of

theoretical interests or directly code causal assumptions. Can they

be uniquely determined with available measured variables?

3

Endogenous vs. ex exogenous variables; exclusion restrictions (certain

causal links are ruled out); order condition (necessary condition for

identification. number of excluded exogenous vars at least equal the

number of included endogenous vars.)

A single equation models can be considered part of a SEM (with some

of the right-hand side variables potentially endogenous.) Standard

models (parametric or non-parametric matching) typically assume a

set of “control” variables are measured that makes identification of

the causal parameter possible.

What variables should be in the model? Is the same model good for

both prediction and causal inference?

4

– Standard practice: use the same (parametric) model for predic-

tion and causal inference, often for studying causal effects of each

independent variable in the model in turn. e.g.:

Pr(Voting) = f(education, income, party ID, race, gender, etc.)

– But: different objectives may require very different x’s to enter

the model.

Prediction: all direct causes of y;

Causal inference on xi: all xj’s that confound the relationship

between xi and y.

5

y

x1

x2

x3

In this hypothetical causal structure:

∗ prediction of y: all x’s;

∗ causal effect of x1 on y: x1 and x2;

∗ causal effect of x2 on y: x2 (controlling for x1, its consequence,

leads to bias on the total effects).

∗ causal effect of x3 on y: x3

6

– Finding the “right” set of control variables is hard

∗ In practice, decision is often made “informally, on a case-by-

case basis, resting on folklore and intuition rather than on hard

mathematics.” (Pearl 2009)

∗ Different studies of the same causal relationship often use dif-

ferent sets of control variables, guided by even slightly different

substantive theories.

∗ Lead to not only changes in magnitude but even reversal of signs

in estimated effects. “Simpson’s Paradox”.

– Pearl (2009) and related work (being introduced to political sci-

ence); Causal graph theory

7

∗ the possibility of causal inference from observational data

∗ the “discovery” of underlying causal graphs from data;

∗ graphical tools for control variable selection based on the causal

graph.

8

• Data source/Measurement:

– Experimental data

If done right, the gold standard. Random assignment makes treat-

ment exogenous and treatment and control group comparable (for

sufficient N)

Can be expensive/infeasible (regime type change?)

Issues like noncompliance, external validity, Hawthorne effect (ef-

fect of observation)

– Observational data, such as from surveys

Issues of sampling design. e.g, stratification with different sam-

9

pling rates (weighting necesary). Clustering (correlations within

clusters).

Selection bias. Response-based sampling (e.g., rare events data)

missing data; sensitive questions

cross sectional, panel (small T), tscs

– Measurement: e.g, Party identification? Economic wellbeing?

Ideal point? Power? Structural characteristics of the international

system?

Some easier, some harder. E.g. Party ID can be obtained directly

from survey data; others require more sophisticated methods, as

in recovering ideal points from roll call data (e.g. Item response

10

model)

Social network analysis useful for measuring structural character-

istics (such as polarization, globalization)

11

•Modeling:

– Abstraction: no model is ever perfect (if it is, then not a “model”).

Reality itself is infinitely rich and complex

Seek to capture the essential features of the data generating pro-

cess; A collection of assumptions about the process.

– Systematic and stochastic components:

e.g. Linear regression:

Y = Xβ + ε (1)

(Why ε: Never could measure all relevant variables; plus the uni-

verse is inherently probabilistic, according to quantum physics.)

12

Y : N × 1; X : N × k; β: k × 1;

ε ∼ N(0, σ2I)

Equivalently,

Y ∼ N(Xβ, σ2I)

For each individual i, i = 1, 2, . . . , N :

Yi ∼ N(Xiβ, σ2)

13

Also equivalent:

Yi ∼ fN(yi|µi, σ2), µi = xiβ

where yi is an observed value of the random variable Yi. Read:

The density of Yi at a particular location yi is given by the normal

distribution density with mean µi = xiβ and variance σ2.

→ We’ll be looking at a variety of forms of systematic and

stochastic components (distribution functions) suitable for differ-

ent types of data Y (binary, multinomial, ordinal, counted, cen-

sored/truncated, duration, etc.)

14

– Parametric, semi-parametric, non-parametric

∗We’ve just seen an example of a parametric model.

The data generating process is known up to a set of unknown

parameters (in the regression model, {β, σ})

Estimation of these parameters (more below): OLS, Least ab-

solute deviation, MLE, Bayesian..

∗ Semi-parametric models combine a parametric component with

a non-parametric component—more flexible/robust than fully

parametric models (but less efficient, if parametric forms can be

correctly specified). This can be in terms of partially specified

functional form for the systematic part (such as in neural net-

15

work model; Cox proportional hazard model), or in the form of

avoiding distributional assumptions for the stochastic term.

Method of Moment (and GMM, generalized MM) estimations

are semi-parametric, more robust to distributional assumptions

on the stochastic part.

Moments: mean, variance, etc.

n’th moment:

Mn =

∫xnf (x)dx

Basic idea: making use of the fact that sample moments ap-

proximates population moments, regardless of the distribution.

find a set of equations known to hold in the population given

16

a model. The equations involve population moments which are

functions of the unknown parameters. Obtain estimates by sub-

stituting sample moments for the population moments.

e.g. the OLS estimator is also a method of moment estimator.

One of the key assumptions of the classical linear model is

E[εixi] = E[(yi − xiβ)xi] = 0

(for simplicity, assuming xi scalar)

Sample version:

1N

∑i

(yi − xiβ)xi = 0

This is the same as the OLS normal equation: (first order deriva-

tive=0)

17

min∑

i ε2i = min

∑i(yi − xiβ)2

→ 2∑

i(yi − xiβ)xi = 0

→ 1N

∑i

(yi − xiβ)xi = 0

∗ Non-parametric models avoid such functional form assumptions

as well as distributional assumptions. The less assumed, the

more robust. But the less efficient (in case parametric assump-

tions are correct)

e.g.1. Kernel smoothing.

m̂h(x) =

∑ni=1Kh(x− xi)yi∑ni=1Kh(x− xi)

(K: some kernel function; h; bandwidth)

18

“Local” methods.

e.g.2. non-parametric matching. propensity score approach.

program evaluation. (will discuss in detail later)

∗ The vast majority of standard models used in political science are

parametric (logit/probit/ordered logit/Tobit/Heckit/Poisson re-

gression, etc.)

Pros: if assumptions are (approximately) right, more efficient

inference. Can do a lot of things with the precise functional

relations after estimation, such as marginal effects, prediction.

Cons: assumptions can be wrong.

19

∗ Examples of functional forms for the systematic part:

20

– Functional complexity in social science data. Neural networks as

universal learning machines.

y

z1 z2

x1 x2 x3

Output Layer

Hidden Layer

Input Layer

β Weights

γ Weightsγ1 γ2

β11 β32β31β12

β21 β22

Figure 1: A one hidden layer feed forward neural network

– Model selection:

21

Fitting vs. Out of sample performance.

Bayesian model averaging: in the Bayesian framework, no single

model is “true”. Each is valid with certain probability. Average

the ones with relatively high probability to be “true”.

• Estimation: (focusing on parametric models)

How to learn about the unknown parameters (i.e., the unknown part

of the model) from data

– Estimation criteria/principles

How to fit a line/curve to the scatter plot data?

∗ visual

22

y

x

Model 1

Model 2

∗ Least Square: minimize sum of squared errors. (have seen)

∗ Least absolute deviation (more robust w.r.t. outliers). Mathe-

matically more difficult to handle than OLS

∗Maximum likelihood: parameter values that maximize the prob-

ability of observed data given the model are most plausible.

These are point estimates. Confidence intervals can be con-

23

structed based on the sampling distribution of the estimators.

∗ the Bayesian approach: start with a prior belief about the un-

known. Update our knowledge according to the Bayes rule. As

the “posterior” density is proportional to likelihood times prior,

the data influence inference only through the likelihood func-

tion. When data dominate prior, the likelihood resembles the

posterior.

From the posterior distribution one can obtain point estimate

(e.g., the posterior mean or the most probable value) and inter-

val estimate (probability intervals based on the posterior distri-

bution).

24

P (θ|y) =P (θ, y)

P (y)

=P (y|θ)P (θ)

P (y)

=P (y|θ)P (θ)∫P (y|θ)P (θ)dθ

Computationally, the main distinction is optimization of a function

vs. sampling from a distribution.

Maximum likelihood estimation is obtained through optimization:

find values of parameters that maximizes the likelihood function.

But one can explore the likelihood function by sampling from

25

the entire distribution (e.g., Gill & King paper on Hessian not

invertable–mode doesn’t work, explore the mean instead.)

MCMC uses computational algorithms for obtaining samples

from a distribution. Heavily used in Bayesian inference. e.g.,

Gibbs Sampler (alternating conditional sampling). Convergence is

proved. Software such as BUGS (Bayesian inference Using Gibbs

Sampler; WinBugs–Window version), JAGS (Just Another Gibs

Sampler). Several R packages interface these with R or imple-

ment various specific models (e.g. MCMCPack).

Note that MCMC 6= Bayesian inference. Where posterior distribu-

tion is known or approximated through analytical methods, MCMC

26

is unnecessary. When the posterior/likelihood are “well behaved”

(such as being globally concave), optimization is more efficient

and more reliable. For complex function/distributions, MCMC re-

turns some results when optimization is difficult to do. Of course,

where optimization may fail, the quality of posterior approximation

through sampling could be low too. there is no magic.

– how special data features require special sampling and/or esti-

mation strategies, e.g. rare events (logit estimates biased); en-

dogenous dependence structure (independence assumption doesn’t

hold).

27

• Inference

– Quantities of interest can be computed based on the model and

the parameter estimates. e.g. marginal effect of an x. Except in

linear models with no higher order terms, this is generally not the

coefficient of x. But they are usually functions of the parameters.

– Uncertainty measures should be reported, based on uncertainty

measures for the parameters. (for quantities pertaining to indi-

vidual observations, also the fundamental uncertainty in the error

term. e.g. E(Yi|Xi) vs. Yi|Xi

– Model dependence: to what extent inference depends on the as-

28

sumption that the model is true.

Data quality: What kind of questions can be reliably answered

from available data? Or, when can history be our guide?

PS 271B: Quantitative Methods II Lecture Notes

Documents