8/9/2019 Stata Tutorial 13 v2 0
1/45
- 1 -
STATA 13 Tutorial
by Manfred W. Keil
to Accompany
Introduction to Econometrics
by James H. Stock and Mark W. Watson
------------------------------------------------------------------------------------------------------------------
1. STATA: INTRODUCTION 2
2. CROSS-SECTIONAL DATA
Interactive Use: Data Input and Simple Data Analysis 4
a) The Easy and Tedious Way: Manual Data Entry 5b) Summary Statistics 10c) Graphical Presentations 11d) Simple Regression 15e) Entering Data from a Spreadsheet 17f) Importing Data Files directly into STATA 18g) Multiple Regression Model 21h) Data Transformations 22
Batch (Do-Files) 24
3. SUMMARY OF FREQUENTLY USED STATA COMMANDS 38
4. FINAL NOTE 44
-----------------------------------------------------------------------------------------------------------------
8/9/2019 Stata Tutorial 13 v2 0
2/45
- 2 -
1. STATA: INTRODUCTION
This tutorial will introduce you to a statistical and econometric software package calledSTATA. The tutorial is an introduction to some of the most commonly used features inSTATA. These features were used by the authors of your textbook to generate the statistical
analysis reported in Chapters 3-9 (Stock and Watson, 2015). The tutorial provides thenecessary background to reproduce the results of Chapters 3-9 and to carry out relatedexercises. It does not cover panel data (Chapter 10), binary dependent variables (Chapter 11),instrumental variable analysis (Chapter 12), or time-series analysis (Chapters 14-16).
The most current professional version is STATA 13. Both STATA 12 and STATA 13 aresufficiently similar so that those who only have access to STATA 12 can also use this tutorial.As with many statistical packages, newer versions of a program allow you to use moreadvanced and recently developed techniques that you, as a first time user, most likely will notencounter in a first course of statistics or econometrics. There are several versions of STATA12, such as STATA/IC, STATA/SE, and STATA/MP. The difference is basically in terms of
the number of variables STATA can handle and the speed at which information is processed.Most users will probably work with the Intercooled (IC) version.
STATA runs on the Windows (2000, 2003, XP, Vista, Server 2008, or Windows 7), Mac, andUnix computers platform. It is produced by StataCorp in College Station, TX. You can readabout various product information at the firms Web site, www.stata.com . There are 20manuals that can be purchased with STATA 13, although subsets can be bought separately.Perhaps the most useful of these are the Users Guideand theBase Reference Manual, whichcan simply be downloaded. You can order STATA by calling (800) 782-8272 or by filling outa form at www.stata.com/order/quote-request/student/. In addition, if you purchase the StudentVersion, you can acquire STATA at a steep discount. Prices vary, but you could get a
perpetual license for STATA/IC for $189, or a six-month license for as low as $69 (abusiness/single user pays $1,695 to purchase STATA). There is even a 30-days freeevaluation copy for STATA.
Econometrics deals with three types of data: cross-sectional data, time series data, and panel(longitudinal) data (see Chapter 1 of the Stock and Watson (2015)). In a cross-section youanalyze data from multiple entities at a single point in time. In a time seriesyou observe thebehavior of a single entity over multiple time periods. This can range from high frequency datasuch as financial data (hours, days); to data observed at somewhat lower (monthly)frequencies, such as industrial production, inflation, and unemployment rates; to quarterly data(GDP) or annual (historical) data. One big difference between cross-sectional and time seriesanalysis is that the order of the observation numbers does not matter in cross-sections. Withtime series, you would lose some of the most interesting features of the data if you shuffled theobservations. Finally, panel data can be viewed as a combination of cross-sectional and timeseries data, since multiple entities are observed at multiple time periods. STATA allows you towork with all three types of data.
STATA is most commonly used for cross-sectional and panel data in academics, business, and
8/9/2019 Stata Tutorial 13 v2 0
3/45
- 3 -
government, but you can work with it relatively easily when you analyze time-series data.STATA allows you to store results within a program and to retrieve these results for furthercalculations later. Remember how you calculated confidence intervals in statistics say for apopulation mean? Basically you needed the sample mean, the standard error, and some valuefrom a statistical table. In STATA, you can calculate the mean and standard deviation of a
sample and then temporarily store these. You then work with these numbers in a standardformula for confidence intervals. In addition, STATA provides the required numbers from the
relevant distribution (normal, 2 , F, etc.).
While STATA is truly interactive, you will run a program sooner rather than later in abatch mode.
Interactive use: you type a STATA command in the STATA Command Window(seebelow) and hit the Return/Enter key on your keyboard. STATA executes the commandand the results are displayed in the STATA Results Window. Then you enter the nextcommand, STATA executes it, and so forth, until the analysis is complete. Even the
simplest statistical analysis typically will involve several STATA commands. Batch mode: all of the commands for the analysis are listed in a file, and STATA is told
to read the file and execute all of the commands. These files are calledDo-Filesand aresaved using a .dosuffix.
In the good old days the equivalent of writing aDo-File was to submit a batch of cards, eachcard containing a single command (now line), to a technician, who would use a card reader toenter these into the computer. The computer would then execute the sequence of statements.(You stored this batch of cards typically in a filing cabinet, and the deck was referred to as afile and stored them in a filing cabinet typically with a rubber band around each file ordeck of cards.) While you will work at first in interactive mode by clicking on buttons or
writing single line commands, you will very soon discover the advantage of running yourregressions in batch mode. This method allows you to see the history of commands, and youcan also analyze where exactly things went wrong if there are problems (errors) with any ofyour commands. This tutorial will initially explain the interactive use of STATA since it ismore intuitive. However, we will switch as soon as it makes sense into the batch mode and youshould seriously try to do your research/class work using this mode (Do-Files).
STATA produces highly professionally looking graphs and charts. However, it requires somepractice to generate these. A separate manual (Graphics) is devoted to the topic only. SinceSTATA works in a Windows format, it allows you to cut and paste the data into otherWindows-based program, such as Word or WordPerfect.
Finally, there is a warning about the limitations of this tutorial. The purpose is to help you gainan initial understanding of how to work with STATA. I hope that the tutorial looks lessdaunting than the manuals. However, it cannot replace the accompanying manuals, which youwill have to consult for more detailed questions (alternatively use Help within the program).Feel free to provide me with feedback of how the tutorial can be improved for future
8/9/2019 Stata Tutorial 13 v2 0
4/45
generadecide
institut
for thoworkin
in implines bsimply
is ther
them ipractic
2. CR
Intera
Lets gyour S
several
and be
ions of st to set up
ion. We ha
se who follg with stati
ovement. Iut will forfollow the
fore a good
you thinke the comm
SS-SECT
tive Use:
et started.ART wind
smaller wi
in the stati
dents (mk a Wiki
e found th
w. This is,tical softwa
you set itet the impnstructions
idea to kee
you will unds on you
ONAL DA
ata Input a
lick on theow. Once y
dows. At t
tical analys
il@claremorun by stu
t the wisd
f course, jure as learni
aside for tortant detailand when y
a separate
se them latown.
A
d Simple D
STATA icou have star
is point yo
s.
- 4 -
ntmckenna.dents but
m of crow
st a suggestg a new la
long, yous. Anotherou are done
sheet and t
r. I will gi
ta Analysi
n to beginted STATA
can load a
edu). Colleupervised
s often p
ion. Finallyguage: pra
will only rdanger of t, you do not
write dow
ve you sho
our sessio, you will s
data set or
agues ofby faculty
oduces val
you may wticing it ro
member thtorials likeremember
commands
rt exercises
, or choosee a large w
enter data (
ine and Iat my aca
able infor
nt to thinktinely will
e most imp this is thahe comma
and examp
so that yo
STATA 13ndow conta
described b
haveemic
ation
boutesult
rtantyouds. It
es of
can
fromining
low)
8/9/2019 Stata Tutorial 13 v2 0
5/45
The re
the bo
active
STATclickin
Windo
In this
ScoreChapte
a)The
In Chasection
and 19
leavesinputti
(somet
spread
Enteri
undersbecom
observ
To staComm
ults of you
tom left, th
in the data
command on com
.
tutorial, we
Data Set urs 3 and 8)
Easy andT
pters 4 toal data. The
99. You wil
room forg data. H
ing that ec
heet (Excel
g data ma
anding ofaware of e
tions from
t, click onnd Windo
r various op
ere is a Va
ile. Above
s. In interaand button
will work
ed in chapts an exerci
dious Way:
you willre are 420 o
l not want t
uman errorwever, the
nomists ar
) and then t
ually is us
ow to workntering, and
he Californ
the Data E. This will
erations wi
iables Win
t is the Re
tive use, Ss or by ty
ith two da
ers 4-9; ane.
Manual Da
ork with tbservations
o enter a la
. As a rese are occ
doing mor
cut and pa
d here for
with data iediting, da
ia Test Scor
itor buttopen the fol
- 5 -
l be displa
ow, which
iew Windo
TATA alloing the e
ta applicati
the Curre
ta Entry
e Californifrom K-6 a
ge amount
lt, it is gesions whe
and more).
te the data
pedagogica
STATA. Ia in the pro
e Data Set.
on the tooowing scre
ed in the s
shows the
, which le
s you touivalent c
ns: two cr
nt Populati
Test Scord K-8 scho
of data ma
nerally notyou have
The alterna
see below).
l purposes
n other woram. Here
lbar, or typn:
-called Res
ames of v
ts you view
xecute command int
ss-sectional
n Survey
e Data Set.ol districts f
ually, since
a recommecollected
tive is to en
since it giv
ds, it will bI will use a
e the comm
ults Windo
riables cur
previously
mands eitho the Com
(California
ata Set us
These areor the years
it is tediou
nded methata by yo
er the data
es you an i
e useful thasub-sample
and edit int
. On
ently
used
er byand
Test
ed in
ross-1998
s and
d ofrself
nto a
nitial
t youof 10
o the
8/9/2019 Stata Tutorial 13 v2 0
6/45
To en
subseq
teachenumbe
Makeshould
typed i
er data m
ently). He
ratio (str)s for all thr
ure not to tnumbers tu
n.
nually, sta
e I have c
from the dee).
ype the varin from blac
t typing i
osen 10 o
ta set you
able namesk to red, th
testscr
606.8
631.1
631.4
631.8
631.9
632
632
638.5
638.7
639.3
- 6 -
the obse
servations
will use in
in the threen it means
str
19.5
20.1
21.5
20.1
20.4
22.4
22.9
19.1
20.2
19.7
vations (y
of test scor
Chapter 4
columns, ohat STAT
chool
1
2
3
4
5
6
7
8
9
10
u will na
es (testscr)
of the textb
nly enter thcannot ide
e the vari
and the stu
ook (type i
e numbers.tify the dat
ables
dent-
n the
Also,you
8/9/2019 Stata Tutorial 13 v2 0
7/45
After e
followi
In theLabelcreated
sugges
Do a sienter f
Finally
After c
ntering the
ng box to a
amebox,
ox, you maoriginally
you enter
milar operar the third
, call the thi
ompleting t
data, doubl
pear at the
eplace var1
y want to eor as infor
ere
Avg
ion for theariable str
Stu
rd column s
is task, the
-click the g
ight botto
with the n
ter informaation for o
test score (
econd colu
dent teache
chool.
Data Edito
- 7 -
rey box at t
of your scr
me of the
tion that thhers who
=(read_scr
n, that is r
ratio (teac
screen sho
he top of th
een:
irst column
t helps youay subsequ
math_scr)
name var2
ers/enrl_to
ld look as
e first. This
variable, h
rememberently work
2)
as str. Simil
t)
ollows:
will result i
re testscr. I
ow the datwith your d
arly you co
n the
n the
wasata. I
ld
8/9/2019 Stata Tutorial 13 v2 0
8/45
Next c
your cshown
Enteriwill se
most c
ose the box
mmand toin the varia
g data in the below ho
mmon for
. Note that
dit is listedle list on th
is way is veto enter d
s of data y
our comma
in the Com
e upper righ
ry tedious,ata directly
u will recei
- 8 -
ds to edit tand Box, a
t-hand side:
nd you wilfrom a spr
e in the fut
e data now
d your ne
l make dataadsheet or
ure.
appear in t
ly created v
input errorn ASCII fi
eResults B
ariables are
frequently.le, which a
x,
. Youe the
8/9/2019 Stata Tutorial 13 v2 0
9/45
In gen
where
This cthe dat
work
imagin
observperhap
proble
You c
pentag
deman
You s
ral, you ca
varnameire
mmand wila set. (Mis
ith large d
e how long
tion by obs generated
s such as s
n always st
n with a w
in STAT
ould see the
look at vari
ers to a var
l list, one scing values
ata set, and
this may ta
servation, oby others d
ummarizing
op the listi
ite x in t
.
following:
ables that a
list varn
able that ex
li
reen at a tire denoted
you will p
ke with 5,0
f course, taring data e
the data.
g by hitting
e middle).
- 9 -
ready exist
me1, varna
ists in your
t testscr str
e, the databy a perio
robably not
0 observati
kes away ttry. Howe
the break
This button
by typing i
e2,
workfile. Tr
on the varia or . in
want to s
ons or mor
e ability ter, there ar
utton on t
can be use
the comma
y it here by
bles for eveSTATA.) L
e all obser
e. Failing t
spot errore other met
e toolbar (i
to stop the
nd
typing
ry observatiter on, yo
ations. Yo
look at th
in the datods to spot
looks like
execution o
on inwill
can
data
set,such
a red
f any
8/9/2019 Stata Tutorial 13 v2 0
10/45
b) Sum
For thcomm
sumst
statistipercen
statisti
The sudefine
If your
edit th
observ
After e
mary Statis
moment, lnd
nds for su
s for eachiles of the f
s for a subs
mmary statiin equatio
summary s
e data usin
tion and ch
ntering the
ics
ts just see
mmarize a
of the varirequency di
et of your d
stics are ex(2.15) on p
atistics diff
the Data
ange it. Aft
data, there
if we are w
sum t
d the optio
bles you htribution.
ta by addin
lained inage 25 in St
r, then che
Editor. Onc
r correcting
re various
- 10 -
rking with
stscr str, d
n detailgiv
ve enteredou will lear
g an ifor in
hapter 2 ofock and Wa
k the data
e you have
the proble
hings you
the same d
tail
es you a m
. These incn later that
command f
your textbtson (2015)
gain. To re
located th
, press the
an do with
ta set. Type
re extensiv
ude the mou can als
ollowing th
ok (for exa.
turn to the
data probl
reserve but
it. You ma
in the foll
list of su
dian and cobtain su
variable na
ple, Kurto
ata observa
em, click o
ton again.
y want to k
wing
mary
rtainmary
me.
sisis
ions,
n the
eep a
8/9/2019 Stata Tutorial 13 v2 0
11/45
- 11 -
hard copy of what you just entered. If so, click on the Printbutton. This will print the entireoutput of what you have produced so far.
In general, it is a good idea to save the data and your work frequently in some form. Many ofus have learned through multiple painful experiences how easy it is to lose hours of work by
not backing up data/results in some fashion. To save the data set you created, either press theSavebutton or click on Fileand then Save As. Follow the usual Windows format for savingfiles (drives, directories, file type, etc.). If you save datasets in STATA readable format, thenyou should use the extension .dta. Once you have saved your work, you can call it up thenext time you intend to use it by clicking on File and then Open. Try these operations bysaving the current workfile under the name SW13smpl.dta.
c) Graphical Presentations
Most often it is a good idea to generate graphs (pictures) to get some feel for the data. Youwill be able to detect outliers which may be the result of data entry errors or you will be able tosee if the data makes sense. Although STATA offers many graphing options, we will only gothrough a few commonly used ones here.1
There are three graphs that you will use most often:
histograms; line graphs, where one or more variables are plotted across entities (these will become
more important in time series analysis when you are plotting variables over time); scatterplots (crossplots), where one variable is graphed against another.
The purpose of histograms is to display absolute or relative frequencies for a single variable. Ingeneral, the command is
histogram varname, percent title( )
The percent option produces relative frequencies, and the title option adds whatever nameyou place between ( ) to the top of the graph.
You can either save the graph you have generated, or copy and paste it into another Windowsbased document, such as Word((replacing percent with frequency would have resulted in
absolute, rather than relative, frequencies to be plotted; there are other options for you toexplore, such as the number of classes (bins) to choose, etc.).
1I found the following STATA site particularly useful for graphs:http://www.stata.com/support/faqs/graphics/gph/statagraphs.html
8/9/2019 Stata Tutorial 13 v2 0
12/45
- 12 -
Try
histogram testscr, percent title(Testscores)
To create a line graph in a cross section, you can add a third variable in your data set whichtakes on the number of the observation (here: 1, 2, 3, , 10), in this case, the variable schoolthat we created.
Lets plot the student-teacher ratio for the first 10 observations using the scattercommand. Thecommand is followed by the two variables you would like to see plotted, where the first oneappears on the Yaxis and the second on theXaxis.
scatter varname1 varname2
plots variable 1 against variable 2. Try this with the student-teacher ratio and the variableschool.
The resulting graph just gives you the data points here. There are two ways to make this moreinformative, one is to connect the points by using the line command followed by the twovariable names. Alternatively you can use the twoway connectedcommand to have both thepoints and the lines displayed.
Try both here:
line str school
twoway connected str school
0
20
40
60
80
100
Percent
600 610 620 630 640Avg test score (=(read_scr+math_scr)/2)
Testscores
8/9/2019 Stata Tutorial 13 v2 0
13/45
- 13 -
After the graph appears, you can edit it using the Graph Editor(either use Fileand then StartGraph Editoror push the Graph Editorbutton). Alter the graph until it looks like the onebelow. Some of the alternations can be made in the resulting dialog boxes.
Frequently you will be interested either in causal relationships between variables or in theability of one variable to forecast another. As a result, it is a good idea to plot two variables inthe same graph.
The first way to look for a relationship is to plot the observations of both variables. This can bedone by generalizing the command twoway connectedto include more than two variable names(one for the Yaxis and one for theXaxis). Try this here with
twoway connected str testscr school
The resulting graph is pretty uninformative, since test scores and student-teacher ratios are on adifferent scale. You can allow for two (or more) scales by entering the following command:
twoway (scatter str school, c(1) yaxis(1)) (scatter testscr school, c(1) yaxis(2))
This command instructs STATA to use two Yaxis, one for the student-teacher ratio on the leftside of the graph, and the other for test scores on the right side of the graph. You may want tobeautify the resulting graph by using the graph editor. See if you can produce something like
the graph below:
18
19
20
21
22
23
24
Student-TeacherRatio
1 2 3 4 5 6 7 8 9 10School District
Student-Teacher Ratio Across 10 School DistrictsGraph 1
8/9/2019 Stata Tutorial 13 v2 0
14/45
- 14 -
To get an even better idea about the relationship, you can display a two-dimensionalrelationship in a scatterplot (see page 92 of your Stock and Watson (2015) textbook). Givenour discussion above, you could simply use the command scatter testscr str. However, youmay want to see what a fitted line through that scatter plot would look like, in which case youhave to modify the command slightly:
scatter testscr str || lfit testscr str
where || is the key | typed twice.
This will result in the following graph (after beautification):
600
610
620
630
640
Avgtestsco
re
18
19
20
21
22
23
24
Studentteacherratio
1 2 3 4 5 6 7 8 9 10School District
Student-Teacher Ratio Avg Test Score
Test Scores and Student-Teacher Ratio Across 10 School Districts
Grahph 2
600
610
620
630
640
TestScores
19 20 21 22 23Student-Teacher Ratio
Fitted values
Scatterplot of Test Scores vs Student-Teacher Ratio
Graph 3
8/9/2019 Stata Tutorial 13 v2 0
15/45
- 15 -
(Not to worry about the positive slope here. Remember, this is a sample, and a very small oneat that. After all, you may get 10 heads in 10 flips of a coin.)
d) Simple Regression
There is a commonly held belief among many parents that lower student-teacher ratios willresult in better student performance. Consequently, in California, for example, all K-3 classeswere reduced to a maximum student-teacher ratio of 20 (Class Size Reduction Act CSR) inthe late 90s. This comes at a cost, of course. Initially, it was $1.8 billion a year. With dollarfigures as big as these (ask yourself, if you laid down a dollar bill every second, how manyyears would it take to reach 1 billion?), the natural question arises whether or not it is worth it.That is why you are analyzing the effect of reducing student-teacher ratios in Chapters 4-9 ofthe Stock and Watson textbook.
For the 10 school districts in our sample, we seem to have found a positive relationshipbetween larger classes and student performance. Not to worry we will soon work with all 420observations from the California School Data Set, and we will then find the negativerelationship you have seen in the textbook for now, we are more concerned about learningtechniques in STATA.
In the previous section, we included a regression line in the scatterplot, something that youshould have encountered towards the end of your statistics course. However, the graph of theregression line does not allow you to make quantitative statements about the relationship; youwant to know the exact values of the slope and the intercept. For example, in generalapplications, you may want to predict the effect of an increase by one in the explanatory
variable (here the student-teacher ratio) on the dependent variable (here the test scores).
To answer the questions relating to the more precise nature of the relationship between classsize and student performance, you need to estimate the regression intercept and slope. Aregression line is little else than fitting a line through the observations in the scatterplotaccording to some principle. You could, for example, draw a line from the test score for thelowest student-teacher ratio to the test score for the highest student-teacher ratio, ignoring allthe observations in between. Or you could sort the data by student-teacher ratio and split thesample in half so that the observations with the lowest ten student-teacher ratios are in one set,and the observations with the highest ten student-teacher ratios are in the other set. For each ofthe two sets you could calculate the average student-teacher ratio and the correspondingaverage test score, and then connect the two resulting points. Or you could just eyeball therelationship. Some of these principles have better properties than others to infer the trueunderlying (population) relationship from the given sample. The principle of Ordinary LeastSquares (OLS), for example, will give you desirable properties under certain restrictiveassumptions that are discussed in Chapter 4 of the Stock/Watson textbook.
8/9/2019 Stata Tutorial 13 v2 0
16/45
Back tvariabl
with equati
coeffic
Often
the int
have szero, a
profes
here b
of the t
ThereYon a
where
wherestanda
automlikely
The ou
o computineXin a line
u represen, then the
ients, then
regression
rcept 0 o
en in the scd it is ther
or most lik
cause with
eacher in th
re various
constant (in
reg stand
he r follod errors (ev
tically do sever use it)
tput appear
. If the dear fashion o
ting the er task is to
1 describes
line is a lin
ly has a use
atterplot abfore better
ly will give
o students
t case?)
ays to esti
ercept) and
for least sq
wing the coen though y
. There is a.
as follows:
endent varithe type
or, or ranfind a val
he effect of
ar approxi
ful meaning
ve, there arnot to inter
you a seri
resent, the
ate the reg
another var
ares regres
re
ma indicau have not
option for
- 16 -
able, Y, is
0 1i
om disturbe for 0
a unit incre
ation to an
if observati
e no observret the nu
us penalty i
e is no scor
ression line
ableX is:
reg Y X
sion. For th
testscr str,
es that yourequested a
you to supp
nly determ
i iu
ance, notand 1 . I
ase inXon
underlying
ons around
tions arouerical valu
n the exam
e to record.
. The com
current ap
r
are using heintercept t
ress the inte
ned by a si
ccounted ff you had
Y.
complicate
X=0 occur i
d the studeof the inte
for interpre
(What woul
and for reg
lication, ty
teroskedastibe include
rcept, but y
ngle expla
i=1,2,
or by thevalues for
relationshi
n the data.
t-teacher racept at all.
ting the int
d be the fu
essing a va
e
city-robustd, STATA
u will mos
atory
...,N
inearthese
and
s we
tio ofYour
rcept
ction
iable
ill
8/9/2019 Stata Tutorial 13 v2 0
17/45
Accor
an dec
textbo
Note t
420 scregress
that th
e) Ente
So farextern
itself.
progra
StockLocate
found
STATcopy
STAT
familiabefore
Data E
This is
ing to these
ease of 0.6
k, you shou
at the resul
hool districion R
2 is q
above slop
ring Data f
you enterelto the SThis makes
, such as a
nd Watsonthe corresp
this tutorial
and openand past
, choosing
r with thispasting. N
ditor.
what you s
results, low
points, on a
ld display t
TestScore
t for the 10
s. Howeveite low. As
e coefficien
om a Sprea
data manTAprogra
sense as d
spreadsheet
present theonding Exc
) and open
he Data Ed comman
the option
rocedure.te that ST
ould see in
ering the st
erage, in t
e results as
= 618.9 + 0 (51.1) (2.
chosen sch
, as pointea matter o
is not stati
sheet
ally. Mostm, i.e., theta sets eit
.
Californiael file casc
it. Next, f
tor. Returns commo
Treat Fir
Make sureTA has co
STATA:
- 17 -
dent-teach
e district w
follows:
.61 STR,R.33)
ol districts
out beforfact, in Ch
tically signi
often youwill not b
er become
Test Scoreool.xlsx on
llowing th
to the Exce to Windo
t Row as
o select theveniently i
r ratio by o
de test scor
2
= 0.007, S
is quite dif
, this is aapter 5 of
ficant.
ill workincluded i
very large
Data Set ithe accom
procedure
file and ms progra
Variable N
grey box tcluded the
e student p
e. Using the
ER= 9.8
erent from
rather smallour textboo
ith larger, or be par
or are gene
Chapter 4anying we
s discussed
rk F1:R42s, move t
mes. You
o the immename of th
er class res
notation o
the sample
sample ank, you will
ata sets thof, the prorated by a
of the textsite (wher
previously,
. Next, usie data blo
are presu
iate right ovariables i
lts in
your
of all
d thelearn
t aregramother
ook.you
start
g thek to
ably
f 1n the
8/9/2019 Stata Tutorial 13 v2 0
18/45
- 18 -
When you are done, you are ready to save the file. Name it caschool.dta.
You can now reproduce Equation (4.7) from the textbook. Use the regression command youpreviously learned to generate the following output.
(You can find the standard errors and the distribution of the estimators on p. 131 of the Stock
and Watson (2015) textbook. The regression
2
R , sum of squared residuals (SSR), and standarderror of the regression (SER) are presented in Key Concept 4.3.)
f) Importing Data Files directly into STATA
Excel (Spreadsheet) Files
Even though the cut and pastemethod seemed straightforward enough, there is a second, more
direct, way to import data into STATA from Excel, which does not involve copying andpasting data points.
In general, make sure your data is organized with the variable names in Row 1 of yourspreadsheet with each column representing a different variable, and the observations in therows beneath the variable names. Then, save your data set in Excel (or an alternativespreadsheet program) as a .csv file (this stands for comma separated values).
Start again with a new STATA file. Next, type the following command into the commandwindow in STATA:
insheet using (filename)
where (filename)is the directory location of your file. (To find this, locate the file and right-click, selecting the Propertiesbutton. This should contain the location of the file to which youmust add the filename; here is an example C:\Econometrics\StockWatson\caschool.csv.) Ifyour filename has any spaces or any symbol that appears on the number keys of the keyboard,then you should put quotation marks around your filename. STATA reads spaces as denoting
_cons 698.933 10.36436 67.44 0.000 678.5602 719.3057 str 2.279808 .5194892 4.39 0.000 3.300945 1.258671
testscr Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust
Root MSE = 18.581 R-squared = 0.0512 Prob > F = 0.0000 F( 1, 418) = 19.26Linear regression Number of obs = 420
. reg testscr str, r
8/9/2019 Stata Tutorial 13 v2 0
19/45
- 19 -
separations between words, and therefore will only read the filename up until the first space orsymbol, and then considers the rest to be a separate command.
Note: In order to insheetdata, there must be no data already stored in memory. To get rid ofany data that is already stored, type the command
clear
before insheeting.
Once you have insheetedyour data, you should see this reflected in your Results boxand yourvariables should appear in your Variables List box. You can type editto see your data in thedata editor.
To save your data as a STATA file, click on Fileon the upper toolbar, then select Save As.When you save your file, make sure it is saved as a .dta file. This type of file can only be
opened in STATA. Alternatively, you can type the command
save (filename)
where (filename)is the directory location and name of your file. If you have a previous versionof this saved already, to overwrite the old version add replaceafter the save command. Forexample:
save C:\My Documents\test.dta, replace
If you wish to save a file that has been previously saved in the same directory location as the
previous version, you may use the commandsave, replace
.
Note: When you save a STATA dataset, you are really only saving the dataset as it exists at thetime you chose to save. You are not retaining any of the analysis you may have conducted,such as running regressions or testing for the statistical significance of coefficients. However,if you have changed the data since opening the file, such as edited observations, these changeswill be reflected.
As an exercise, copy the caschool.xls or caschool.xlsx data file from the Stock and Watsonwebsite and save the Excel file in some subdirectory on your computer as a .csv file. Thenimport the data set using the insheetcommand. Finally run the simple regression of testscronstr and check that your output contains 420 observations and corresponds to the STATAregression output in the previous section.
8/9/2019 Stata Tutorial 13 v2 0
20/45
- 20 -
ASCII data
You can also import data from an ASCII file (text file). This assumes that you either saved data
from a different source as an ASCII file or that you received data in ASCII file format. The filemust be organized with one observation in each row, and the variables in the data set must bein separate columns.
Using the infilecommand, type the name of the variable that represents each column, followedby the file name.
For example, consider an ASCII dataset that looks as follows:
ahe educ exper union married
10.75 12 6 1 0
16.50 16 3 0 0
..
12.10 12 8 1 1
and which you want to import into STATA.
Each row corresponds to observations on an entity (here an individual). The first columnsabove is the hourly wage, the second is years of education, the third is potential experience, the
fourth is a binary variable which equals one if the individual belongs to a union and is zerootherwise, and the last column is another binary variable which takes on the value of one if theindividual is married and is zero otherwise.
To import the data, you type the following command:
infile ahe educ exper union married using (filename)
STATA dataset
Data files that have been saved in STATA format, carry the extension .dta
To open a dataset that is already saved as a .dtafile, you can either go to File and then Opentoselect your dataset, or you can type the command
use (filename)
8/9/2019 Stata Tutorial 13 v2 0
21/45
- 21 -
This will open your dataset into STATA, as long as you have changed your working directoryto the location on your computer where the data file is stored. The command to change theworking directory is
CD: C:\(location)
Here are two tricks that will be of help down the road.
(i) If you are not sure how to type in the location of your data file, just right-click onyour Start button and select Explore. Then find your data set. Next right click onthe data set and chose Properties. A new window opens up. Copy the Location.Return to the Command Window in STATA and type use and then past thelocation. Add \ and the name of the file, including the extension. Then finish thecommand with a , clear.
Here is an example from my computer:
use C:\ClaremontLectures\ECON125\STATA\baseb.dta, clear
(ii) The clear command is very important. It erases previous data, if there was any,from memory. I, and others, have wasted time trying to find errors in programmingsimply by not clearing memory. Even if you dont understand the reason, the adviceis always to include the clear command when you read in a new data set.
You can try doing this with the caschool.dta data set from the Stock and Watson website.Simply save that data set on your computer, then double click on it. This will open STATAwith the data loaded already. Obviously this is the easiest method to import data into STATA.
Regardless of which method you use to import data, it is always a good idea to inspect thedata to check if there are some abnormalities. To do this, click on the Data Editor(Browse) button below the drop down menus.
g) Multiple Regression Model
Economic theory most often suggests that the behavior of a certain variable is influenced notonly by a single variable, but by a multitude of factors. The demand for a product, e.g. LALaker tickets, depends not only on the price of the product but also on the price of other goods,income, taste, etc. Similarly, the Phillips curve suggests that inflation depends not only on theunemployment rate, but also on inflationary expectation and possibly supply shocks, etc.
An extension of the simple regression model is the multiple regression model, whichincorporates more than one regressor (see Equation (6.7) in the textbook on page 192).
8/9/2019 Stata Tutorial 13 v2 0
22/45
- 22 -
0 1 1 2 2 ...i i i k ki iY X X X u , i= 1,,n.
To estimate the coefficients of the multiple regression model, you proceed in a similar way asin the simple regression model. The difference is that you now need to list the additionalexplanatory variables. In general, the command is:
reg Y X1 X2 Xk, (options)
where (options) can be omitted (this is the default and gives you homoskedasticity-onlystandard errors) or can be replaced by various possible entries ( e.g. r for heteroskedasticityrobust standard errors).
See if you can reproduce the following regression output, which corresponds to Column 5 inTable 7.1 of the Stock and Watson (2015) textbook (page 241). The option used below is (r)toproduce heteroskedasticity-robust standard error (STATA refers to these as Robust StandardErrors).
The interpretation of the coefficients is equivalent to that of a controlled science experiment: itindicates the effect of a unit change in the relevant variable on the dependent variable, holdingall other factors constant(ceteris paribus).
Section 7.2 of the Stock and Watson (2015) textbook discusses the F-statistic for testingrestrictions involving multiple coefficients, the so called Waldtest. To test whether all of theabove coefficients are zero with the exception of the intercept, you can use the testcommandfollowed by each restriction that you want to test in parenthesis (STATA uses the name of thevariable associated with the coefficient in combination with the restriction).
Type
test (str=0) (el_pct=0) (meal_pct=0) (calw_pct=0)
STATA will generate the following output:
_cons 700.3918 5.537418 126.48 0.000 689.507 711.2767 calw_pct -.0478537 .0586541 -0.82 0.415 -.1631498 .0674424 meal_pct -.5286191 .0381167 -13.87 0.000 -.6035449 -.4536932 el_pct -.1298219 .0362579 -3.58 0.000 -.201094 -.0585498 str -1.014353 .2688613 -3.77 0.000 -1.542853 -.4858534
testscr Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust
Root MSE = 9.0843 R-squared = 0.7749 Prob > F = 0.0000 F( 4, 415) = 361.68Linear regression Number of obs = 420
. reg testscr str el_pct meal_pct calw_pct, r
8/9/2019 Stata Tutorial 13 v2 0
23/45
- 23 -
Note that the F-statistic is identical to the same statistic listed in the regression output.
See if you can generate the F-statistic of 5.43 following Equation (7.6) in the Stock andWatson (2015) text and listed at the bottom of page 223 (restrict the coefficients of STRandExpnto be zero).
h) Data Transformations
So far, we have only used data in regressions that already existed in some file that we eithercreated or used. Almost always, you will be required to transform some of the raw data thatyou received before you run a regression. In STATA you transform variables by using thegen (as in generate) command. For example, Chapter 8 of the Stock/Watson textbookintroduces the polynomial regression model, logarithms, and interactions between variables.Lets reproduce Equations (8.2), (8.11), (8.18), and (8.37) here. The following commandsgenerate the necessary variables2:
gen avginc2=avginc^2
gen avginc3=avginc^3
gen lavginc=log(avginc)
gen ltestscr=log(testscr)
gen strpctel=str*el_pct
Note how the commands and generated variables are displayed in STATA, including those inred when you make a mistake in the command (e.g. genr instead of gen).
2For example, I have generated a variable called avginc2, and assigned it to be the square of the previously
defined variable avginc. Note that I am generating variable names that are self-explanatory. They could havebeen called variable1, variable2, variable3, etc. but it is a good idea to create variable names that you canremember.
Prob > F = 0.0000 F( 4, 415) = 361.68
( 4) calw_pct = 0( 3) meal_pct = 0( 2) el_pct = 0( 1)
str = 0
. test (str=0) (el_pct=0) (meal_pct=0) (calw_pct=0)
8/9/2019 Stata Tutorial 13 v2 0
24/45
Next r
Finally
Exerci
One oinstruc
proble
regress
Lets s
(http://Compa
Empiri
3 Note
tells yo(usually
which i
data set,gigabyte
n the four
save your
e
the probleions witho
s but then
ions, for ex
ee how mu
ww.pearsnion Web S
cal Results:
for STATA 1
that insufficiset at 1 MB b
creases the m
but small eno).
regressions
orkfile aga
ms with tht internaliz
little is re
mple, woul
h you unde
nhighered.cite, and doCPS Data
2 users: if yo
ent memorydefault). You
mory to 10 m
gh for your c
using the
n and exit
type of ting them.
ained. If I
d you be abl
rstood. Go
om/stock_
nload the Csed in Cha
just double
as allocated.can do this b
egabytes. In g
omputer to ha
- 24 -
ame techni
TATA.
torial youtypical s
asked you
e to do that
to the Stoc
atson). E
PS data setter 8). Ne
lick on the c
Before you otyping in the
et mem 10m
neral, make s
dle the progr
que as for
are workinudent will
to retrieve
Or would
and Wats
ter the
for Chaptert open it in
s_ch8.dta fil
en the cps_ccommand
ure to set the
m (use kfor
ultiple re
on is thatfinish the t
data set a
ou say ho
n website f
tudent Re
8 (Data Se
STATA3
, an error me
h8.dta file, in
emory large
ilobyte, mfor
ression ana
you just futorial wit
nd to run
do I do th
or the 3rd
e
ources ins for Repli
sage will occ
rease your m
nough to han
megabyte, an
lysis.
llowfew
few
s?
ition
theating
r that
emory
le thegfor
8/9/2019 Stata Tutorial 13 v2 0
25/45
Then r
(2015)
Why
to restrfind a
sampledefine
Batch
So far,
execut
a percreated
comm
createdview t
includ
Batch
plicate the
textbook.
o you thin
ict your saay to rest
to those iotential ex
Files
you have e
ble stateme
anent recor, etc.? In th
nds similar
such a proe output a
loops and
iles in STA
results for c
your result
ple to onlyict your sa
dividuals ierienceas
ther clicked
nts (comma
d of all theat case, you
to those tha
ram, whicterwards (i
conditional
A are calle
olumns (1)
s differ fro
include indiple, look f
that age ghe Mincer
on buttons
nds one by
transformawould nee
t you used i
is a textthe progra
branching
dDo-Files.
- 25 -
rom Table
those liste
viduals whoor Helpan
oup, replicxperience v
in STATA
ne, or line
tions youto create a
n the Com
or Asciidid not
(if you do
.1 on page
d in the tab
are at leastthe ifcom
ate columnariable (age
or used the
by line). Bu
ade, regreprogram
mand Wind
file, you caontain any
t know w
288 of the
le? What if
30 but notmand. Then
(1) to (3). Years of
Command
t what if yo
sions you tthat consis
w previou
then execerrors). Bat
at these ar
tock and W
you found
lder than 6, restricting
For columducation
Window t
u wanted to
ried, graphs of a list o
sly. After h
te (run) ich files ca
, not to w
atson
way
? Toyour
(4),6 ).
type
keep
youf line
ving
t andalso
rry).
8/9/2019 Stata Tutorial 13 v2 0
26/45
- 26 -
Using STATA in batch mode has two important advantages over using STATA interactively:
theDo-Fileprovides an audit trail for your work. The file provides an exact record ofeach STATA command;
even the best computer programmers will make typing or other errors when usingSTATA. When a command contains an error, it wont be executed by STATA, orworse, it will be executed but produce the wrong result. Following an error, it is oftennecessary to start the analysis from the beginning. If you are using STATAinteractively, you must retype all of the commands. If you are using aDo-File, then youonly need to correct the command containing the error and rerun the file.
Lets create such a program. Click on New Do-File Editorbutton. This opens the STATA Do-File Editor box.
Type in, the following commands exactly as they appear.
log using \statafiles\stata1.log, replaceuse \statafiles\caschool.dtadescribegenerate income = avginc*1000summarize incomelog closeexit
Here is the meaning of the seven lines of this program:
Line 1: This is an administrative command that tells STATA where to display the results of
your analysis. STATA output files are called logfiles. The current line tells STATA toopen a log file called stata1.log (you could have used any name, such aslove_metrix.log, meaning, the word stata1 is not required here). If there is already afile with the same name in the folder, STATA is instructed to replace it. Before yousave the Do-File, replace the path in this line with the relevant path on thecomputer you are using.
Line 2: This line concerns the data set. As you learned earlier in the tutorial, datasets inSTATA are called dtafiles. The dataset which you will use here is caschool.dta, whichyou downloaded earlier. The current line tells STATA the location and name of thedataset to be used for the analysis. Before you save theDo-File, replace the path in
this line with the relevant path of the location where you savedcaschool.dtato.
Line 3: This line also concerns the data set. It tells STATA to describe the dataset (a shorterversion of the command is des instead of describe). This command produces a listof the variable names and any variable descriptions stored in the data set.
8/9/2019 Stata Tutorial 13 v2 0
27/45
- 27 -
Line4: This line tells STATA to create a new variable called income(a shorter version of thecommand is gen instead of generate). The new variable is constructed bymultiplying the variable avgincby 1000. The variable avgincis contained in the datasetand is the average household income in a school district expressed in thousands ofdollars. The new variable incomewill be the average household income expressed in
dollars instead of thousands of dollars.
Line 5: This line tells STATA to compute some summary statistics (a shorter version of thecommand is sum instead of summarize). STATA will produce the mean, standarddeviation, etc.
Line 6: This line closes the file stata1.logwhich contains the output.
Line 7: This line tells STATA that the program has ended.
As long as you have replaced the path in line 1 and line 2 with the relevant paths from the
computer you are working on, and if you downloaded/saved the California Test Score DataSet, then we are good to go. Save theDo-File, using the .dosuffix. Next execute this Do-Fileby first opening STATA on your computer. Next, click on the Filemenu, then Do, and thenselect the stata1.dofile you just saved. This will run or execute the program.
(Alternatively, you can run the program, or even just part of the program, by hitting theExecute (do) button in the Do-file Editor.)
You will be able to see the program being executed in theResults Window. Since the executionwill not fit into one screen, you can scroll up and see everything that happened during therun. Sometimes (although not here) you may see that the program execution pauses, and that
--more--
is displayed at the bottom of the Results Window. If this happens, push any key on thekeyboard and execution will continue.
To exit STATA, click on the usual exit button at the top right of STATA (alternatively click onFileand then Exit.) STATA will ask you if you really want to exit, and you will respond Yes.
Your output has been saved in stata1.logand you can look at it by opening the file with anytext editor (Notepad, for example) or in Word/WordPerfect. Here is what you should see:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -name:
l og: yourpathherel og t ype: t ext
opened on: yourdateandtimehere
. use C: \ yourpathhere
8/9/2019 Stata Tutorial 13 v2 0
28/45
- 28 -
. descri be
Cont ai ns dat a f r omC: \ yourpathhere\ caschool . dt aobs: 420
var s: 13 yourdateheresi ze: 20, 160
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
st orage di spl ay val uevar i abl e name t ype f ormat l abel var i abl e l abel- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - -enr l _t ot i nt %8. 0gt eacher s f l oat %8. 0gcal w_pct f l oat %8. 0gmeal _pct f l oat %8. 0gcomput er i nt %8. 0gt est scr f l oat %8. 0gcomp_st u f l oat %8. 0gexpn_st u f l oat %8. 0gst r f l oat %8. 0gavgi nc f l oat %8. 0gel _pct f l oat %8. 0gr ead_scr f l oat %8. 0gmath_scr f l oat %8. 0g
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Sor t ed by:
. gener ate i ncome = avgi nc*1000
. summar i ze i ncome
Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
i ncome | 420 15316. 59 7225. 89 5335 55328
. l og cl osename:
l og: C: \ yourpathhere\ s tata1. l ogl og t ype: t ext
cl osed on: yourdateandtimehere
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
You now have an initial idea of how to work withDo-Filesin STATA. The rest of this part ofthe tutorial will guide you through further commands and make the initial Do-File morecomplex.
I suggest that you continue to work with the batch file you just created and then for you to addnew lines to this program (if you use the .pdf version of this tutorial or have printed the tutorialusing a color printer, then the new commands will appear in red).
8/9/2019 Stata Tutorial 13 v2 0
29/45
- 29 -
#del i mi t ;******************************************;* Admi ni st r at i ve Commands;******************************************;set more of f ;c l ear ;l og us i ng C: \ stataf i l es\ stata1. l og, repl ace;******************************************;
* Read i n the data set;*******************************************;use C: \ stataf i l es\ caschool . dta; des; *******************************************;* Tr ansf orm Data and Cr eat e New Vari abl es;*******************************************;* Const r uct Aver age Di st r i ct I ncome i n $s;*******************************************;gen i ncome = avgi nc*1000; *******************************************;* carr y out st at i st i cal anal ys i s ;*******************************************;* summary st ati st i cs f or I ncome;*******************************************;sum
i ncome;
*******************************************;* end of progr am;*******************************************;l og cl ose; exi t ;
The new version of theDo-Filecarries out exactly the same calculations as before. However ituses four features of STATA for more complicated analysis. The first new command is
# delimit ;
This command tells STATA that each STATA command ends with a semicolon. If STATA
does not see a semicolon at the end of the line, then it assumes that the command carries overto the following line. This is useful because complicated commands in STATA are often toolong to fit on a single line. (Make sure to place a ; at the end of the seven old commands.)The aboveDo-Filecontains an example of a STATA command written on two lines: near thebottom of the file you see the command sum income written on two lines. STATA combinesthese two lines into one command because the first line does not end with a semicolon. Whiletwo lines are not necessary for this command, some STATA commands can get quite long, soit is good to get used to employing this feature.
A word of warning: if you use the # delimit ;command, it is critical that you end eachcommand with a semicolon. Forgetting the semicolon on even a single line means that theDo-
Filewill not run properly (again, dont forget to add the seven ; in the first version of theprogram).
The second new feature of the aboveDo-Fileis that many of the lines begin with an asterisk.STATA ignores the text that comes after *, so that these lines can be used for comments orto describe what the commands that follow are doing. Note that each of these lines ends with a
8/9/2019 Stata Tutorial 13 v2 0
30/45
- 30 -
semicolon. Without the semicolon, STATA would include the next line as part of the textdescription.
A final new feature in the program is the command
set more off
This command eliminates the need to hit a key on your keyboard in the case when STATA fillstheResults Windowand stops displaying further results (the -- more -- would appear).
Run the program and have a look at the new log file.
Next, change the previous version of theDo-Fileby adding commands until the new versionlooks as follows (again, new commands can be seen in redif your tutorial displays colors):
#del i mi t ;*********************************************************;
*Admi ni st r at i ve Commands;*********************************************************;set more of f ;c l ear ;l og usi ng C: \ STATA\ st at a1. l og, r epl ace;*********************************************************;*Read i n the Dataset ;*********************************************************;use C: \ STATA\ caschool . dta;des;*********************************************************;*Tr ansf orm Data and Cr eat e New Vari abl es;*********************************************************;**** * Const r uct Aver age Di st r i ct I ncome i n $s;gen i ncome = avgi nc*1000;***** Def i ne var i abl es f or subset of data;gen testscr_ l o = testscr i f (st r=20) ;*********************************************************;*Carr y Out Stati sti cal Anal ysi s;*********************************************************;**** * Summary Stat i st i cs f or I ncome;sum
i ncome;sum testscr;t t est t estscr =0;t t est t estscr _l o=0;t t est t estscr _hi =0;t t est t est scr_l o=t est scr_hi , unequal unpai r ed;*********************************************************;*Repeat t he Anal ysi s usi ng STR = 19;*********************************************************;r epl ace t estscr _l o=t estscr i f ( str =19) ;
t t est t est scr_l o=t est scr_hi , unequal unpai r ed;*********************************************************;*End of Progr am;*********************************************************;l og cl ose;exi t ;
8/9/2019 Stata Tutorial 13 v2 0
31/45
- 31 -
There are three new features in this new version.
1) New variables are created using only a portion of the dataset. Two of the variables in thedataset are testscr (the average test score in a school district) and str (the districtsaverage class size or student teacher ratio). The STATA command
gen testscr_lo = testscr if (str |t|) = 0.0000 Pr(T > t) = 0.0000
Ha: mean < 0 Ha: mean = 0 Ha: mean > 0
Ho: mean = 0 degrees of freedom = 419
mean = mean(testscr) t = 703.6149
testscr 420 654.1565 .9297082 19.05335 652.3291 655.984
Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
One-sample t test
. ttest testscr=0
8/9/2019 Stata Tutorial 13 v2 0
32/45
- 32 -
confidence interval for the population mean. (In this example, the t-test that thepopulation mean of test scores is equal to zero is not really of interest, but theconfidence interval for the mean is what we are looking for in this example.) The samecommand is then used for testscr_loand testscr_hi(see section 3.2 and 3.3 in Stock andWatson (2015)).
The second form of the command is
ttest testscr_lo=testscr_hi, unequal unpaired
Executing this statement will test the hypothesis that testscr_loand testscr_hicome frompopulations with the same mean. That is, the command computes the t-statistic for thenull hypothesis that the (population) mean of test scores for districts with class sizes lessthan 20 students is the same as the mean of test scores for districts with class sizes
greater than 20 students. The command uses two options that are listed after thecomma in the command. These options are unequaland unpaired. The option unequaltells STATA that the variances in the two populations may not be the same. The optionunpairedtells STATA that the observations are for different districts, that is, these arenot panel data representing the same entity at two different time periods (see section 3.4in Stock and Watson (2015)).
3) A third new feature in the Do-File is the command replace. This appears near thebottom of the file. Here, the analysis is to be carried out again, but using 19 as the cutofffor small classes. Since the variables testscr_loand testscr_hialready exist (they were
define by the gencommand earlier in the program), STATA cannot generate variableswith the same name. Instead, the command replaceis used to replace the existing serieswith new series. In essence, the command instructs the program to overwrite thepreviously stored data.
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 0.0000
Ha: diff < 0 Ha: diff = 0 Ha: diff > 0
Ho: diff = 0 Satterthwaite's degrees of freedom = 403.607
diff = mean(testscr_lo) - mean(testscr_hi) t = 4.0426
diff 7.37241 1.823689 3.787296 10.95752
combined 420 654.1565 .9297082 19.05335 652.3291 655.984
testsc~i 182 649.9788 1.323379 17.85336 647.3676 652.5901
testsc~o 238 657.3513 1.254794 19.35801 654.8793 659.8232
Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
Two-sample t test with unequal variances
. ttest testscr_lo=testscr_hi, unequal unpaired
8/9/2019 Stata Tutorial 13 v2 0
33/45
- 33 -
You are now ready to execute (run) the program as done before.
As before, change the previous version of the Do-File by adding commands until the newversion looks as follows (again, new commands can be seen in red if your tutorial displayscolors):
#delimit ;*********************************************************;*Administrative Command*********************************************************;set more off;clear;log using \statafiles\stata1.log, replace;*********************************************************;*Read in the Dataset;*********************************************************;use \statafiles\caschool.dta;des;*********************************************************;*Transform Data and Create New Variables;*********************************************************;***** Construct Average District Income in $s;gen income = avginc*1000;***** Define variables for subset of data;gen testscr_lo = testscr if (str=20);*********************************************************;*Carry Out Statistical Analysis;*********************************************************;***** Summary Statistics for Income;sum
income;
*********************************************************;***** Table 4.1 *****;*********************************************************;sum str testscr, detail;*********************************************************;***** Figure 4.2 *****;*********************************************************;twoway scatter testscr str || lfit testscr str;*********************************************************;***** Correlation *****;*********************************************************;cor str testscr;*********************************************************;
***** Equation 4.11 and 5.8 *****;*********************************************************;reg testscr str, robust;*********************************************************;***** Equation 5.18 *****;gen d = (str
8/9/2019 Stata Tutorial 13 v2 0
34/45
- 34 -
sum testscr;ttest testscr=0;ttest testscr_lo=0;ttest testscr_hi=0;ttest testscr_lo=testscr_hi, unequal unpaired;*********************************************************;*Repeat the Analysis using STR = 19;*********************************************************;replace testscr_lo=testscr if (str=19);ttest testscr_lo=testscr_hi, unequal unpaired;*********************************************************;*End of Program;*********************************************************;log close;exit;
The new commands reproduce some of the empirical results shown in Chapters 4 and 5 of
Stock and Watson (2015). There are several features of STATA included in the new commandswhich have not been used in the previous examples:
1) The summarize command (sum) is now includes the option detail, which providesmore detailed summary statistics. The command is written as
sum str testscr, detail
This command tells STATA to compute summary statistics for the two variables strandtestscr. The option detailproduces detailed summary statistics that include, forexample, the percentiles that are reported in Table 4.1 on p. 113 of Stock and Watson
(2015).
2) The command
twoway scatter testscr str || lfit testscr str
constructs a scatterplot of testscr versus strand includes the estimated regression line forthe simple regression of the California Test Score Data Set, shown on p. 116 of Stockand Watson (2015).
3) The command
cor str testscr
tells STATA to compute the correlation between the student teacher ratio and testscores.
4) Next you will reproduce equations (4.11) and (5.8) in Stock and Watson (2011) by using
8/9/2019 Stata Tutorial 13 v2 0
35/45
- 35 -
the regress(or short reg) command:
reg testscr str, r
instructs STATA to run an OLS regression with testscras the dependent variable and str
as the regressor. The robust(short r) option tells STATA to calculate heteroskedasticity-robust formulas for the standard errors of the regression coefficient estimators. Omittingthis option results in the display of homoskedasticity-only standard errors.
5) The final innovation over the previous version of the Do-File is contained in the twocommands following the line Equation 5.18. First a binary (sometimes referred to asdummy or indicator) variable d is created suing the STATA command
gen d = (str
8/9/2019 Stata Tutorial 13 v2 0
36/45
- 36 -
gen testscr_lo = testscr if (str=20);*********************************************************;*Carry Out Statistical Analysis;*********************************************************;***** Summary Statistics for Income;sum
income;*********************************************************;***** Table 4.1 *****;*********************************************************;sum str testscr, detail;*********************************************************;***** Figure 4.2 *****;*********************************************************;twoway scatter testscr str || lfit testscr str;*********************************************************;***** Correlation *****;*********************************************************;cor str testscr;
*********************************************************;***** Equation 4.11 and 5.8 *****;*********************************************************;reg testscr str, r;*********************************************************;***** Equation 5.18 *****;gen d = (str
8/9/2019 Stata Tutorial 13 v2 0
37/45
- 37 -
ttest ts_lostr=ts_histr if elq4==1, unp une;*********************************************************;* Equation 7.5 ;*********************************************************;reg testscr str el_pct, r;*********************************************************;* Equation 7.6 ;*********************************************************;replace expn_stu = expn_stu/2000;reg testscr str expn_stu el_pct, r;*********************************************************;* Display Variance-Covariance Matrix;*********************************************************;vce;*********************************************************;* F-test reported in text;*********************************************************;test str expn_stu;*********************************************************;* Correlations reported in text;
*********************************************************;cor testscr str expn_stu el_pct meal_pct calw_pct;*********************************************************;*Table 7.1, Column(1);*********************************************************;reg testscr str, r;display "adjusted Rsquared = " e(r2_a);* Column (2);reg testscr str el_pct, r;display "adjusted Rsquared = " e(r2_a);* Column (3);reg testscr str el_pct meal_pct, r;display "adjusted Rsquared = " e(r2_a);
* Column (4);reg testscr str el_pct calw_pct, r;display "adjusted Rsquared = " e(r2_a);* Column (5);reg testscr str el_pct meal_pct calw_pct, r;display "adjusted Rsquared = " e(r2_a);*********************************************************;* Appendix rule of thumb F-Statistic;*********************************************************;reg testscr str expn_stu el_pct;test str expn_stu;reg testscr el_pct;*********************************************************;
*End of Program;*********************************************************;log close;exit;
The file produces several of the empirical results from Chapter 7 of Stock and Watson (2015).As before, some commands have been abbreviated when there is no possibility of confusion.
8/9/2019 Stata Tutorial 13 v2 0
38/45
- 38 -
The file uses abbreviations for STATA commands throughout (generatebecomes gen, regressturns into reg, etc.).
In essence there are two new commands:
1)
The first new command involves the testing of restrictions in equation 7.6 (page 221 ofStock and Watson (2015)). The command
reg testscr str expn_stu el_pct, r
instructs STATA to compute the regression. The command vceasks STATA to print outthe estimated variances and covariances of the estimated regression coefficients. Thecommand
test str expn_stu
gets STATA to carry out the joint test that the coefficients on strand expn_stuare bothequal to zero.
2) The second new command is in the analysis of Table 7.1 on page 241 of Stock andWatson (2015). When STATA computes an OLS regression, it computes the adjusted
R-squared (2
R ) as described in Section 6.4, page 197 of Stock and Watson (2015).However, STATA does not display all of the results it computes, including the adjustedR-squared (when the r option is invoked). The command
display Adjusted Rsquared = e(r2_a)
instructs STATA to print out (display) the adjust R-squared. Whatever appearsbetween the two quotation marks ( ) will be displayed in your output (you did nothave to display the wordsAdjusted Rsquaredbut could have chosen anything else, suchas My Measure of Fit). However e(r2_a) tells STATA where to retrieve the storedresult from and cannot be changed. The adjustedR-squared is not the only statistic thatSTATA stores and does not display. You can use the Help function or look in theReferencevolume under Saved Resultsfor the regcommand to find other statistics.
Other Examples ofDo-Files
You will find other examples ofDo-Fileson the accompanying Web site for the Stock andWatson (2015) econometrics textbook. You can download STATADo-Filesfro there toreproduce all of the analysis in Chapters 3-13. You will also find a STATADo-Filefor thetime series chapters 14-16 there. STATA programming for time series is somewhat morecomplicated than for cross-sectional or panel data. EViews and RATS are econometric
8/9/2019 Stata Tutorial 13 v2 0
39/45
- 39 -
programs specifically designed for time series data, and the web site contains EViews andRATS programs for Chapters 14-16, as well as a tutorial for EViews.
3. A SUMMARY OF SELECTED STATA COMMANDS
This section lists several of the most useful STATA commands. Many of these commands haveoptions. For example, the command summaryhas the option detailand the command regresshas the option robust. In the descriptions below, options are shown in square brackets [ ]. Manyof these commands have several options and can be used in many different ways. Thedescriptions below show how these commands are commonly used. Other uses and options canbe found in STATAs Help menu and in the other sources listed at the beginning of thistutorial.
The list of commands provided here is a small fraction of the commands in STATA, but these
are the important commands that you will need to get started for your econometrics course.You should extend the list or create your own in addition to what is listed here.
Administrative Commands
# delimit
sets the character that marks the end of a command. For example, the command #delimit ;tells STATA that all commands will end with a semicolon. This command isused inDo-Files.
clear
deletes/erases all variables from the current STATA session.
exit
in aDo-File, the command tells STATA that the program has ended. If you type exitinthe STATA Command Window, then STATA will close.
log
controls STATA log files, which is where STATA writes output. There are twocommon uses of this command:
log using filename [,append replace]. This opens the file given byfilename as alog
file for STATA output. The optionsappend
andreplace
are used when thereis already a file with the same name. With append, STATA will append theoutput to the bollom of the existing file. With replace, STATA will replace theexisting file with the new output file.
log close. This closes the current logfile.
set mat #
8/9/2019 Stata Tutorial 13 v2 0
40/45
- 40 -
sets the maximum number of variables that can be used in a regression. The defaultmaximum is 40. If you have a huge number of observations and want to run aregression with 45 variables, then you will need to use the command, where # is anumber greater than 45.
set memory #mis used in Windows and Unix versions of STATA to set the amount of memory used bythe program. For details, see the discussion within the tutorial.
set more off
tells STATA not to pause and display the more- message in theResults Window.
Data Management
describe
describes the contents of data in memory or on disk. A related command is describeusing filename, which describes the dataset infilename
drop list of variables
this deletes/erases the variables in list of variables from the current STATA session.For example, drop str testscrwill delete the two variables strand testscr
keep list of variablesdeletes/erases all of the variables from the current STATA session except those in list ofvariables. Alternatively, it keeps the variables in the list and drops everything else. Forexample, keep str testscrwill keep the two variables strand testscrand deletes all of
the other variables in the current STATA session.
list list of variables
tells STATA to print all of the observations for the variables listed in list of variables.
save filename [, replace]tells STATA to save the dataset that is currently in memory as a file with namefilename. The option replace tells STATA that it may replace any other file with thenamefilename.
use filename
tells STATA to load a dataset from the filefilename.
Transforming and Creating New Variables
New variables are created using the command generate, and existing variables are modifiedusing the command replace.
8/9/2019 Stata Tutorial 13 v2 0
41/45
- 41 -
Examples:
generate newts = testscr/100
creates a new variable called newtsthat is constructed as the variable testscrdivided by 100.
replace testscr = testscr/100
changes the variable testscrso that all observations are divided by 100.
You can use the standard arithmetic operations of addition (+), subtraction (-), multiplication(*), division (/), and exponentiation (^) in generate/replace commands. For example,
generate ts_squared = testscr*testscr
creates a new variable ts_squared as the square of testscr. (This could also have beenaccomplished by using the command gen ts_squared = testscr^2.)
You can also use relational operators to construct binary variables. For example, in the forthbatch file, the following command was included
gen d = (str
8/9/2019 Stata Tutorial 13 v2 0
42/45
- 42 -
Statistical Operations
cor list of variables
tells STATA to compute the correlation between each of the variables in list ofvariables
twoway scatter var1 var2 || lfit var1 var2produces a scatter plot of var1on the Y-axis and var2on theX-axis. If the || lfitpart isincluded then the fitted OLS line is also displayed
predict newvarname [, residuals]
when this command follows the regress command, the OLS predicted values orresiduals are calculated and stored under the name newvarname. When the optionresiduals is used, the residuals are computed; otherwise the predicted values are
computed and placed into newvarname.
Example:
reg testscr str expn_stu el_pct, r
predict tshat
predict uhat, residuals
Here, testscris regressed on str, expn_stu, el_pct(first command); the fitted values are savedand stored under the name tshat (second command), and the residuals are saved under thename uhat(third command).
regress depvar list of variables [if expression] [,robust noconstant]
carries out an OLS regression of the variable depvar on list of variables. When ifexpression is used, then the regression is estimated using observations for whichexpressionis true. The option robusttells STATA to use the heteroskedasticity-robustformula for the standard errors of the the coefficient estimators. The option noconstanttells STATA not to include a constant (intercept) in the regression.
Examples:
reg testscr str, r
reg testscr str expn_stu el_pct, r
summarize [list of variables] [, details]computes summary statistics. If the command is used without a list of variables, thensummary statistics are computed for all of the variables in the dataset. If the command
8/9/2019 Stata Tutorial 13 v2 0
43/45
- 43 -
is used with a list of variables, then summary statistics are computed for all variables inthe list. If the option details is used, more detailed summary statistics (includingpercentiles) are computed.
Examples:
sum testscr str
computes summary statistics for the variables testscrand str.
sum testscr str, detail
computes detaild summary statistics for the variables testscrand str.
testthis command is used to test hypothese about regression coefficients. It can be used to
test many types of hypotheses. The most common use of this command is to carry out ajoint test that several coefficients are equal to zero. Used this way, the form of thecommand is test list of variableswhere the list is to be carried out on the coefficientscorresponding to the variables given in list of variables.
Example:
reg testscr str expn_stu el_pct, r
test str expn_stu
Here testscris regressed on str, expn_stu, and el_pct(first command), and a joint test of
the hypothesis that the coefficient onstr
andexpn_stu
are jointly equal to zero is carriedout (second command).
ttest
this command is used to thest a hypothesis about the mean or the difference betweentwo means. The command has several forms. Here are a few:
ttest varname = # [if expression]}[,level(#)]
Here you test the null hypothesis that the population mena of the series varname isequal to #. When if expressionis used, then the test is computed using observations forwhich expression is true. The option level(#) is the desired level of the confidenceinterval. If this option is not used, then a confidence level of 95% is used.
Examples:
ttest testscr = 0;
8/9/2019 Stata Tutorial 13 v2 0
44/45
- 44 -
tests the null hypothesis that the population mean of testscris equal to 0 and computes a 95%confidence interval.
ttest testscr = 0, level(90);
tests the null hypothesis that the population mean of testscris equal to 0 and computed a 90%confidence interval.
ttest testscr = 0 if (str
8/9/2019 Stata Tutorial 13 v2 0
45/45
tutorial is not intended to replace theReferenceor Users Guide. The best way to learn how touse the program is to spend some time exploring and working with it. For a nice visualintroduction to the manuals, go to www.youtube.com/embed/xWJTFtWhQc4.
STATA replication batch files for all the results in the Stock/Watson textbook are available
from the Web site. You are invited to download these and study them.
There are many other tutorials on STATA available to you on the internet. If you prefer avisual one, then perhaps going to www.ats.ucla.edu/stat/stata/notes/default.htm might be agood one to look at. STATA has its own YouTube series and you can find it atwww.stata.com/links/video-tutorials/.