2
The Empirical Research Process; FundamentalMethodological Issues
• Theory; Data; Models/model selection; Estimation; Inference.
(Order?)
Examples: Presidential approval; International conflict/Civil war.
• Identification: Can quantities of interest be determined from the
model/data, assuming sufficient sample size? (asymptotic concept)
Parameters in structural equation models, for example, are often of
theoretical interests or directly code causal assumptions. Can they
be uniquely determined with available measured variables?
3
Endogenous vs. ex exogenous variables; exclusion restrictions (certain
causal links are ruled out); order condition (necessary condition for
identification. number of excluded exogenous vars at least equal the
number of included endogenous vars.)
A single equation models can be considered part of a SEM (with some
of the right-hand side variables potentially endogenous.) Standard
models (parametric or non-parametric matching) typically assume a
set of “control” variables are measured that makes identification of
the causal parameter possible.
What variables should be in the model? Is the same model good for
both prediction and causal inference?
4
– Standard practice: use the same (parametric) model for predic-
tion and causal inference, often for studying causal effects of each
independent variable in the model in turn. e.g.:
Pr(Voting) = f(education, income, party ID, race, gender, etc.)
– But: different objectives may require very different x’s to enter
the model.
Prediction: all direct causes of y;
Causal inference on xi: all xj’s that confound the relationship
between xi and y.
5
y
x1
x2
x3
In this hypothetical causal structure:
∗ prediction of y: all x’s;
∗ causal effect of x1 on y: x1 and x2;
∗ causal effect of x2 on y: x2 (controlling for x1, its consequence,
leads to bias on the total effects).
∗ causal effect of x3 on y: x3
6
– Finding the “right” set of control variables is hard
∗ In practice, decision is often made “informally, on a case-by-
case basis, resting on folklore and intuition rather than on hard
mathematics.” (Pearl 2009)
∗ Different studies of the same causal relationship often use dif-
ferent sets of control variables, guided by even slightly different
substantive theories.
∗ Lead to not only changes in magnitude but even reversal of signs
in estimated effects. “Simpson’s Paradox”.
– Pearl (2009) and related work (being introduced to political sci-
ence); Causal graph theory
7
∗ the possibility of causal inference from observational data
∗ the “discovery” of underlying causal graphs from data;
∗ graphical tools for control variable selection based on the causal
graph.
8
• Data source/Measurement:
– Experimental data
If done right, the gold standard. Random assignment makes treat-
ment exogenous and treatment and control group comparable (for
sufficient N)
Can be expensive/infeasible (regime type change?)
Issues like noncompliance, external validity, Hawthorne effect (ef-
fect of observation)
– Observational data, such as from surveys
Issues of sampling design. e.g, stratification with different sam-
9
pling rates (weighting necesary). Clustering (correlations within
clusters).
Selection bias. Response-based sampling (e.g., rare events data)
missing data; sensitive questions
cross sectional, panel (small T), tscs
– Measurement: e.g, Party identification? Economic wellbeing?
Ideal point? Power? Structural characteristics of the international
system?
Some easier, some harder. E.g. Party ID can be obtained directly
from survey data; others require more sophisticated methods, as
in recovering ideal points from roll call data (e.g. Item response
10
model)
Social network analysis useful for measuring structural character-
istics (such as polarization, globalization)
11
•Modeling:
– Abstraction: no model is ever perfect (if it is, then not a “model”).
Reality itself is infinitely rich and complex
Seek to capture the essential features of the data generating pro-
cess; A collection of assumptions about the process.
– Systematic and stochastic components:
e.g. Linear regression:
Y = Xβ + ε (1)
(Why ε: Never could measure all relevant variables; plus the uni-
verse is inherently probabilistic, according to quantum physics.)
12
Y : N × 1; X : N × k; β: k × 1;
ε ∼ N(0, σ2I)
Equivalently,
Y ∼ N(Xβ, σ2I)
For each individual i, i = 1, 2, . . . , N :
Yi ∼ N(Xiβ, σ2)
13
Also equivalent:
Yi ∼ fN(yi|µi, σ2), µi = xiβ
where yi is an observed value of the random variable Yi. Read:
The density of Yi at a particular location yi is given by the normal
distribution density with mean µi = xiβ and variance σ2.
→ We’ll be looking at a variety of forms of systematic and
stochastic components (distribution functions) suitable for differ-
ent types of data Y (binary, multinomial, ordinal, counted, cen-
sored/truncated, duration, etc.)
14
– Parametric, semi-parametric, non-parametric
∗We’ve just seen an example of a parametric model.
The data generating process is known up to a set of unknown
parameters (in the regression model, {β, σ})
Estimation of these parameters (more below): OLS, Least ab-
solute deviation, MLE, Bayesian..
∗ Semi-parametric models combine a parametric component with
a non-parametric component—more flexible/robust than fully
parametric models (but less efficient, if parametric forms can be
correctly specified). This can be in terms of partially specified
functional form for the systematic part (such as in neural net-
15
work model; Cox proportional hazard model), or in the form of
avoiding distributional assumptions for the stochastic term.
Method of Moment (and GMM, generalized MM) estimations
are semi-parametric, more robust to distributional assumptions
on the stochastic part.
Moments: mean, variance, etc.
n’th moment:
Mn =
∫xnf (x)dx
Basic idea: making use of the fact that sample moments ap-
proximates population moments, regardless of the distribution.
find a set of equations known to hold in the population given
16
a model. The equations involve population moments which are
functions of the unknown parameters. Obtain estimates by sub-
stituting sample moments for the population moments.
e.g. the OLS estimator is also a method of moment estimator.
One of the key assumptions of the classical linear model is
E[εixi] = E[(yi − xiβ)xi] = 0
(for simplicity, assuming xi scalar)
Sample version:
1N
∑i
(yi − xiβ)xi = 0
This is the same as the OLS normal equation: (first order deriva-
tive=0)
17
min∑
i ε2i = min
∑i(yi − xiβ)2
→ 2∑
i(yi − xiβ)xi = 0
→ 1N
∑i
(yi − xiβ)xi = 0
∗ Non-parametric models avoid such functional form assumptions
as well as distributional assumptions. The less assumed, the
more robust. But the less efficient (in case parametric assump-
tions are correct)
e.g.1. Kernel smoothing.
m̂h(x) =
∑ni=1Kh(x− xi)yi∑ni=1Kh(x− xi)
(K: some kernel function; h; bandwidth)
18
“Local” methods.
e.g.2. non-parametric matching. propensity score approach.
program evaluation. (will discuss in detail later)
∗ The vast majority of standard models used in political science are
parametric (logit/probit/ordered logit/Tobit/Heckit/Poisson re-
gression, etc.)
Pros: if assumptions are (approximately) right, more efficient
inference. Can do a lot of things with the precise functional
relations after estimation, such as marginal effects, prediction.
Cons: assumptions can be wrong.
19
∗ Examples of functional forms for the systematic part:
20
– Functional complexity in social science data. Neural networks as
universal learning machines.
y
z1 z2
x1 x2 x3
Output Layer
Hidden Layer
Input Layer
β Weights
γ Weightsγ1 γ2
β11 β32β31β12
β21 β22
Figure 1: A one hidden layer feed forward neural network
– Model selection:
21
Fitting vs. Out of sample performance.
Bayesian model averaging: in the Bayesian framework, no single
model is “true”. Each is valid with certain probability. Average
the ones with relatively high probability to be “true”.
• Estimation: (focusing on parametric models)
How to learn about the unknown parameters (i.e., the unknown part
of the model) from data
– Estimation criteria/principles
How to fit a line/curve to the scatter plot data?
∗ visual
22
y
x
Model 1
Model 2
∗ Least Square: minimize sum of squared errors. (have seen)
∗ Least absolute deviation (more robust w.r.t. outliers). Mathe-
matically more difficult to handle than OLS
∗Maximum likelihood: parameter values that maximize the prob-
ability of observed data given the model are most plausible.
These are point estimates. Confidence intervals can be con-
23
structed based on the sampling distribution of the estimators.
∗ the Bayesian approach: start with a prior belief about the un-
known. Update our knowledge according to the Bayes rule. As
the “posterior” density is proportional to likelihood times prior,
the data influence inference only through the likelihood func-
tion. When data dominate prior, the likelihood resembles the
posterior.
From the posterior distribution one can obtain point estimate
(e.g., the posterior mean or the most probable value) and inter-
val estimate (probability intervals based on the posterior distri-
bution).
24
P (θ|y) =P (θ, y)
P (y)
=P (y|θ)P (θ)
P (y)
=P (y|θ)P (θ)∫P (y|θ)P (θ)dθ
Computationally, the main distinction is optimization of a function
vs. sampling from a distribution.
Maximum likelihood estimation is obtained through optimization:
find values of parameters that maximizes the likelihood function.
But one can explore the likelihood function by sampling from
25
the entire distribution (e.g., Gill & King paper on Hessian not
invertable–mode doesn’t work, explore the mean instead.)
MCMC uses computational algorithms for obtaining samples
from a distribution. Heavily used in Bayesian inference. e.g.,
Gibbs Sampler (alternating conditional sampling). Convergence is
proved. Software such as BUGS (Bayesian inference Using Gibbs
Sampler; WinBugs–Window version), JAGS (Just Another Gibs
Sampler). Several R packages interface these with R or imple-
ment various specific models (e.g. MCMCPack).
Note that MCMC 6= Bayesian inference. Where posterior distribu-
tion is known or approximated through analytical methods, MCMC
26
is unnecessary. When the posterior/likelihood are “well behaved”
(such as being globally concave), optimization is more efficient
and more reliable. For complex function/distributions, MCMC re-
turns some results when optimization is difficult to do. Of course,
where optimization may fail, the quality of posterior approximation
through sampling could be low too. there is no magic.
– how special data features require special sampling and/or esti-
mation strategies, e.g. rare events (logit estimates biased); en-
dogenous dependence structure (independence assumption doesn’t
hold).
27
• Inference
– Quantities of interest can be computed based on the model and
the parameter estimates. e.g. marginal effect of an x. Except in
linear models with no higher order terms, this is generally not the
coefficient of x. But they are usually functions of the parameters.
– Uncertainty measures should be reported, based on uncertainty
measures for the parameters. (for quantities pertaining to indi-
vidual observations, also the fundamental uncertainty in the error
term. e.g. E(Yi|Xi) vs. Yi|Xi
– Model dependence: to what extent inference depends on the as-
28
sumption that the model is true.
Data quality: What kind of questions can be reliably answered
from available data? Or, when can history be our guide?