Stata Tutorial 13 v2 0

8/9/2019 Stata Tutorial 13 v2 0

1/45

- 1 -

STATA 13 Tutorial

by Manfred W. Keil

to Accompany

Introduction to Econometrics

by James H. Stock and Mark W. Watson

------------------------------------------------------------------------------------------------------------------

1. STATA: INTRODUCTION 2

2. CROSS-SECTIONAL DATA

Interactive Use: Data Input and Simple Data Analysis 4

a) The Easy and Tedious Way: Manual Data Entry 5b) Summary Statistics 10c) Graphical Presentations 11d) Simple Regression 15e) Entering Data from a Spreadsheet 17f) Importing Data Files directly into STATA 18g) Multiple Regression Model 21h) Data Transformations 22

Batch (Do-Files) 24

3. SUMMARY OF FREQUENTLY USED STATA COMMANDS 38

4. FINAL NOTE 44

-----------------------------------------------------------------------------------------------------------------


2/45

- 2 -

1. STATA: INTRODUCTION

This tutorial will introduce you to a statistical and econometric software package calledSTATA. The tutorial is an introduction to some of the most commonly used features inSTATA. These features were used by the authors of your textbook to generate the statistical

analysis reported in Chapters 3-9 (Stock and Watson, 2015). The tutorial provides thenecessary background to reproduce the results of Chapters 3-9 and to carry out relatedexercises. It does not cover panel data (Chapter 10), binary dependent variables (Chapter 11),instrumental variable analysis (Chapter 12), or time-series analysis (Chapters 14-16).

The most current professional version is STATA 13. Both STATA 12 and STATA 13 aresufficiently similar so that those who only have access to STATA 12 can also use this tutorial.As with many statistical packages, newer versions of a program allow you to use moreadvanced and recently developed techniques that you, as a first time user, most likely will notencounter in a first course of statistics or econometrics. There are several versions of STATA12, such as STATA/IC, STATA/SE, and STATA/MP. The difference is basically in terms of

the number of variables STATA can handle and the speed at which information is processed.Most users will probably work with the Intercooled (IC) version.

STATA runs on the Windows (2000, 2003, XP, Vista, Server 2008, or Windows 7), Mac, andUnix computers platform. It is produced by StataCorp in College Station, TX. You can readabout various product information at the firms Web site, www.stata.com . There are 20manuals that can be purchased with STATA 13, although subsets can be bought separately.Perhaps the most useful of these are the Users Guideand theBase Reference Manual, whichcan simply be downloaded. You can order STATA by calling (800) 782-8272 or by filling outa form at www.stata.com/order/quote-request/student/. In addition, if you purchase the StudentVersion, you can acquire STATA at a steep discount. Prices vary, but you could get a

perpetual license for STATA/IC for $189, or a six-month license for as low as $69 (abusiness/single user pays $1,695 to purchase STATA). There is even a 30-days freeevaluation copy for STATA.

Econometrics deals with three types of data: cross-sectional data, time series data, and panel(longitudinal) data (see Chapter 1 of the Stock and Watson (2015)). In a cross-section youanalyze data from multiple entities at a single point in time. In a time seriesyou observe thebehavior of a single entity over multiple time periods. This can range from high frequency datasuch as financial data (hours, days); to data observed at somewhat lower (monthly)frequencies, such as industrial production, inflation, and unemployment rates; to quarterly data(GDP) or annual (historical) data. One big difference between cross-sectional and time seriesanalysis is that the order of the observation numbers does not matter in cross-sections. Withtime series, you would lose some of the most interesting features of the data if you shuffled theobservations. Finally, panel data can be viewed as a combination of cross-sectional and timeseries data, since multiple entities are observed at multiple time periods. STATA allows you towork with all three types of data.

STATA is most commonly used for cross-sectional and panel data in academics, business, and


3/45

- 3 -

government, but you can work with it relatively easily when you analyze time-series data.STATA allows you to store results within a program and to retrieve these results for furthercalculations later. Remember how you calculated confidence intervals in statistics say for apopulation mean? Basically you needed the sample mean, the standard error, and some valuefrom a statistical table. In STATA, you can calculate the mean and standard deviation of a

sample and then temporarily store these. You then work with these numbers in a standardformula for confidence intervals. In addition, STATA provides the required numbers from the

relevant distribution (normal, 2 , F, etc.).

While STATA is truly interactive, you will run a program sooner rather than later in abatch mode.

Interactive use: you type a STATA command in the STATA Command Window(seebelow) and hit the Return/Enter key on your keyboard. STATA executes the commandand the results are displayed in the STATA Results Window. Then you enter the nextcommand, STATA executes it, and so forth, until the analysis is complete. Even the

simplest statistical analysis typically will involve several STATA commands. Batch mode: all of the commands for the analysis are listed in a file, and STATA is told

to read the file and execute all of the commands. These files are calledDo-Filesand aresaved using a .dosuffix.

In the good old days the equivalent of writing aDo-File was to submit a batch of cards, eachcard containing a single command (now line), to a technician, who would use a card reader toenter these into the computer. The computer would then execute the sequence of statements.(You stored this batch of cards typically in a filing cabinet, and the deck was referred to as afile and stored them in a filing cabinet typically with a rubber band around each file ordeck of cards.) While you will work at first in interactive mode by clicking on buttons or

writing single line commands, you will very soon discover the advantage of running yourregressions in batch mode. This method allows you to see the history of commands, and youcan also analyze where exactly things went wrong if there are problems (errors) with any ofyour commands. This tutorial will initially explain the interactive use of STATA since it ismore intuitive. However, we will switch as soon as it makes sense into the batch mode and youshould seriously try to do your research/class work using this mode (Do-Files).

STATA produces highly professionally looking graphs and charts. However, it requires somepractice to generate these. A separate manual (Graphics) is devoted to the topic only. SinceSTATA works in a Windows format, it allows you to cut and paste the data into otherWindows-based program, such as Word or WordPerfect.

Finally, there is a warning about the limitations of this tutorial. The purpose is to help you gainan initial understanding of how to work with STATA. I hope that the tutorial looks lessdaunting than the manuals. However, it cannot replace the accompanying manuals, which youwill have to consult for more detailed questions (alternatively use Help within the program).Feel free to provide me with feedback of how the tutorial can be improved for future


4/45

generadecide

institut

for thoworkin

in implines bsimply

is ther

them ipractic

2. CR

Intera

Lets gyour S

several

and be

ions of st to set up

ion. We ha

se who follg with stati

ovement. Iut will forfollow the

fore a good

you thinke the comm

SS-SECT

tive Use:

et started.ART wind

smaller wi

in the stati

dents (mk a Wiki

e found th

w. This is,tical softwa

you set itet the impnstructions

idea to kee

you will unds on you

ONAL DA

ata Input a

lick on theow. Once y

dows. At t

tical analys

il@claremorun by stu

t the wisd

f course, jure as learni

aside for tortant detailand when y

a separate

se them latown.

A

d Simple D

STATA icou have star

is point yo

s.

- 4 -

ntmckenna.dents but

m of crow

st a suggestg a new la

long, yous. Anotherou are done

sheet and t

r. I will gi

ta Analysi

n to beginted STATA

can load a

edu). Colleupervised

s often p

ion. Finallyguage: pra

will only rdanger of t, you do not

write dow

ve you sho

our sessio, you will s

data set or

agues ofby faculty

oduces val

you may wticing it ro

member thtorials likeremember

commands

rt exercises

, or choosee a large w

enter data (

ine and Iat my aca

able infor

nt to thinktinely will

e most imp this is thahe comma

and examp

so that yo

STATA 13ndow conta

described b

haveemic

ation

boutesult

rtantyouds. It

es of

can

fromining

low)


5/45

The re

the bo

active

STATclickin

Windo

In this

ScoreChapte

a)The

In Chasection

and 19

leavesinputti

(somet

spread

Enteri

undersbecom

observ

To staComm

ults of you

tom left, th

in the data

command on com

.

tutorial, we

Data Set urs 3 and 8)

Easy andT

pters 4 toal data. The

99. You wil

room forg data. H

ing that ec

heet (Excel

g data ma

anding ofaware of e

tions from

t, click onnd Windo

r various op

ere is a Va

ile. Above

s. In interaand button

will work

ed in chapts an exerci

dious Way:

you willre are 420 o

l not want t

uman errorwever, the

nomists ar

) and then t

ually is us

ow to workntering, and

he Californ

the Data E. This will

erations wi

iables Win

t is the Re

tive use, Ss or by ty

ith two da

ers 4-9; ane.

Manual Da

ork with tbservations

o enter a la

. As a rese are occ

doing mor

cut and pa

d here for

with data iediting, da

ia Test Scor

itor buttopen the fol

- 5 -

l be displa

ow, which

iew Windo

TATA alloing the e

ta applicati

the Curre

ta Entry

e Californifrom K-6 a

ge amount

lt, it is gesions whe

and more).

te the data

pedagogica

STATA. Ia in the pro

e Data Set.

on the tooowing scre

ed in the s

shows the

, which le

s you touivalent c

ns: two cr

nt Populati

Test Scord K-8 scho

of data ma

nerally notyou have

The alterna

see below).

l purposes

n other woram. Here

lbar, or typn:

-called Res

ames of v

ts you view

xecute command int

ss-sectional

n Survey

e Data Set.ol districts f

ually, since

a recommecollected

tive is to en

since it giv

ds, it will bI will use a

e the comm

ults Windo

riables cur

previously

mands eitho the Com

(California

ata Set us

These areor the years

it is tediou

nded methata by yo

er the data

es you an i

e useful thasub-sample

and edit int

. On

ently

used

er byand

Test

ed in

ross-1998

s and

d ofrself

nto a

nitial

t youof 10

o the


6/45

To en

subseq

teachenumbe

Makeshould

typed i

er data m

ently). He

ratio (str)s for all thr

ure not to tnumbers tu

n.

nually, sta

e I have c

from the dee).

ype the varin from blac

t typing i

osen 10 o

ta set you

able namesk to red, th

testscr

606.8

631.1

631.4

631.8

631.9

632

632

638.5

638.7

639.3

- 6 -

the obse

servations

will use in

in the threen it means

str

19.5

20.1

21.5

20.1

20.4

22.4

22.9

19.1

20.2

19.7

vations (y

of test scor

Chapter 4

columns, ohat STAT

chool

1

2

3

4

5

6

7

8

9

10

u will na

es (testscr)

of the textb

nly enter thcannot ide

e the vari

and the stu

ook (type i

e numbers.tify the dat

ables

dent-

n the

Also,you


7/45

After e

followi

In theLabelcreated

sugges

Do a sienter f

Finally

After c

ntering the

ng box to a

amebox,

ox, you maoriginally

you enter

milar operar the third

, call the thi

ompleting t

data, doubl

pear at the

eplace var1

y want to eor as infor

ere

Avg

ion for theariable str

Stu

rd column s

is task, the

-click the g

ight botto

with the n

ter informaation for o

test score (

econd colu

dent teache

chool.

Data Edito

- 7 -

rey box at t

of your scr

me of the

tion that thhers who

=(read_scr

n, that is r

ratio (teac

screen sho

he top of th

een:

irst column

t helps youay subsequ

math_scr)

name var2

ers/enrl_to

ld look as

e first. This

variable, h

rememberently work

2)

as str. Simil

t)

ollows:

will result i

re testscr. I

ow the datwith your d

arly you co

n the

n the

wasata. I

ld


8/45

Next c

your cshown

Enteriwill se

most c

ose the box

mmand toin the varia

g data in the below ho

mmon for

. Note that

dit is listedle list on th

is way is veto enter d

s of data y

our comma

in the Com

e upper righ

ry tedious,ata directly

u will recei

- 8 -

ds to edit tand Box, a

t-hand side:

nd you wilfrom a spr

e in the fut

e data now

d your ne

l make dataadsheet or

ure.

appear in t

ly created v

input errorn ASCII fi

eResults B

ariables are

frequently.le, which a

x,

. Youe the


9/45

In gen

where

This cthe dat

work

imagin

observperhap

proble

You c

pentag

deman

You s

ral, you ca

varnameire

mmand wila set. (Mis

ith large d

e how long

tion by obs generated

s such as s

n always st

n with a w

in STAT

ould see the

look at vari

ers to a var

l list, one scing values

ata set, and

this may ta

servation, oby others d

ummarizing

op the listi

ite x in t

.

following:

ables that a

list varn

able that ex

li

reen at a tire denoted

you will p

ke with 5,0

f course, taring data e

the data.

g by hitting

e middle).

- 9 -

ready exist

me1, varna

ists in your

t testscr str

e, the databy a perio

robably not

0 observati

kes away ttry. Howe

the break

This button

by typing i

e2,

workfile. Tr

on the varia or . in

want to s

ons or mor

e ability ter, there ar

utton on t

can be use

the comma

y it here by

bles for eveSTATA.) L

e all obser

e. Failing t

spot errore other met

e toolbar (i

to stop the

nd

typing

ry observatiter on, yo

ations. Yo

look at th

in the datods to spot

looks like

execution o

on inwill

can

data

set,such

a red

f any


10/45

b) Sum

For thcomm

sumst

statistipercen

statisti

The sudefine

If your

edit th

observ

After e

mary Statis

moment, lnd

nds for su

s for eachiles of the f

s for a subs

mmary statiin equatio

summary s

e data usin

tion and ch

ntering the

ics

ts just see

mmarize a

of the varirequency di

et of your d

stics are ex(2.15) on p

atistics diff

the Data

ange it. Aft

data, there

if we are w

sum t

d the optio

bles you htribution.

ta by addin

lained inage 25 in St

r, then che

Editor. Onc

r correcting

re various

- 10 -

rking with

stscr str, d

n detailgiv

ve enteredou will lear

g an ifor in

hapter 2 ofock and Wa

k the data

e you have

the proble

hings you

the same d

tail

es you a m

. These incn later that

command f

your textbtson (2015)

gain. To re

located th

, press the

an do with

ta set. Type

re extensiv

ude the mou can als

ollowing th

ok (for exa.

turn to the

data probl

reserve but

it. You ma

in the foll

list of su

dian and cobtain su

variable na

ple, Kurto

ata observa

em, click o

ton again.

y want to k

wing

mary

rtainmary

me.

sisis

ions,

n the

eep a


11/45

- 11 -

hard copy of what you just entered. If so, click on the Printbutton. This will print the entireoutput of what you have produced so far.

In general, it is a good idea to save the data and your work frequently in some form. Many ofus have learned through multiple painful experiences how easy it is to lose hours of work by

not backing up data/results in some fashion. To save the data set you created, either press theSavebutton or click on Fileand then Save As. Follow the usual Windows format for savingfiles (drives, directories, file type, etc.). If you save datasets in STATA readable format, thenyou should use the extension .dta. Once you have saved your work, you can call it up thenext time you intend to use it by clicking on File and then Open. Try these operations bysaving the current workfile under the name SW13smpl.dta.

c) Graphical Presentations

Most often it is a good idea to generate graphs (pictures) to get some feel for the data. Youwill be able to detect outliers which may be the result of data entry errors or you will be able tosee if the data makes sense. Although STATA offers many graphing options, we will only gothrough a few commonly used ones here.1

There are three graphs that you will use most often:

histograms; line graphs, where one or more variables are plotted across entities (these will become

more important in time series analysis when you are plotting variables over time); scatterplots (crossplots), where one variable is graphed against another.

The purpose of histograms is to display absolute or relative frequencies for a single variable. Ingeneral, the command is

histogram varname, percent title( )

The percent option produces relative frequencies, and the title option adds whatever nameyou place between ( ) to the top of the graph.

You can either save the graph you have generated, or copy and paste it into another Windowsbased document, such as Word((replacing percent with frequency would have resulted in

absolute, rather than relative, frequencies to be plotted; there are other options for you toexplore, such as the number of classes (bins) to choose, etc.).

1I found the following STATA site particularly useful for graphs:http://www.stata.com/support/faqs/graphics/gph/statagraphs.html


12/45

- 12 -

Try

histogram testscr, percent title(Testscores)

To create a line graph in a cross section, you can add a third variable in your data set whichtakes on the number of the observation (here: 1, 2, 3, , 10), in this case, the variable schoolthat we created.

Lets plot the student-teacher ratio for the first 10 observations using the scattercommand. Thecommand is followed by the two variables you would like to see plotted, where the first oneappears on the Yaxis and the second on theXaxis.

scatter varname1 varname2

plots variable 1 against variable 2. Try this with the student-teacher ratio and the variableschool.

The resulting graph just gives you the data points here. There are two ways to make this moreinformative, one is to connect the points by using the line command followed by the twovariable names. Alternatively you can use the twoway connectedcommand to have both thepoints and the lines displayed.

Try both here:

line str school

twoway connected str school

0

20

40

60

80

100

Percent

600 610 620 630 640Avg test score (=(read_scr+math_scr)/2)

Testscores


13/45

- 13 -

After the graph appears, you can edit it using the Graph Editor(either use Fileand then StartGraph Editoror push the Graph Editorbutton). Alter the graph until it looks like the onebelow. Some of the alternations can be made in the resulting dialog boxes.

Frequently you will be interested either in causal relationships between variables or in theability of one variable to forecast another. As a result, it is a good idea to plot two variables inthe same graph.

The first way to look for a relationship is to plot the observations of both variables. This can bedone by generalizing the command twoway connectedto include more than two variable names(one for the Yaxis and one for theXaxis). Try this here with

twoway connected str testscr school

The resulting graph is pretty uninformative, since test scores and student-teacher ratios are on adifferent scale. You can allow for two (or more) scales by entering the following command:

twoway (scatter str school, c(1) yaxis(1)) (scatter testscr school, c(1) yaxis(2))

This command instructs STATA to use two Yaxis, one for the student-teacher ratio on the leftside of the graph, and the other for test scores on the right side of the graph. You may want tobeautify the resulting graph by using the graph editor. See if you can produce something like

the graph below:

18

19

20

21

22

23

24

Student-TeacherRatio

1 2 3 4 5 6 7 8 9 10School District

Student-Teacher Ratio Across 10 School DistrictsGraph 1


14/45

- 14 -

To get an even better idea about the relationship, you can display a two-dimensionalrelationship in a scatterplot (see page 92 of your Stock and Watson (2015) textbook). Givenour discussion above, you could simply use the command scatter testscr str. However, youmay want to see what a fitted line through that scatter plot would look like, in which case youhave to modify the command slightly:

scatter testscr str || lfit testscr str

where || is the key | typed twice.

This will result in the following graph (after beautification):

600

610

620

630

640

Avgtestsco

re

18

19

20

21

22

23

24

Studentteacherratio

1 2 3 4 5 6 7 8 9 10School District

Student-Teacher Ratio Avg Test Score

Test Scores and Student-Teacher Ratio Across 10 School Districts

Grahph 2

600

610

620

630

640

TestScores

19 20 21 22 23Student-Teacher Ratio

Fitted values

Scatterplot of Test Scores vs Student-Teacher Ratio

Graph 3


15/45

- 15 -

(Not to worry about the positive slope here. Remember, this is a sample, and a very small oneat that. After all, you may get 10 heads in 10 flips of a coin.)

d) Simple Regression

There is a commonly held belief among many parents that lower student-teacher ratios willresult in better student performance. Consequently, in California, for example, all K-3 classeswere reduced to a maximum student-teacher ratio of 20 (Class Size Reduction Act CSR) inthe late 90s. This comes at a cost, of course. Initially, it was $1.8 billion a year. With dollarfigures as big as these (ask yourself, if you laid down a dollar bill every second, how manyyears would it take to reach 1 billion?), the natural question arises whether or not it is worth it.That is why you are analyzing the effect of reducing student-teacher ratios in Chapters 4-9 ofthe Stock and Watson textbook.

For the 10 school districts in our sample, we seem to have found a positive relationshipbetween larger classes and student performance. Not to worry we will soon work with all 420observations from the California School Data Set, and we will then find the negativerelationship you have seen in the textbook for now, we are more concerned about learningtechniques in STATA.

In the previous section, we included a regression line in the scatterplot, something that youshould have encountered towards the end of your statistics course. However, the graph of theregression line does not allow you to make quantitative statements about the relationship; youwant to know the exact values of the slope and the intercept. For example, in generalapplications, you may want to predict the effect of an increase by one in the explanatory

variable (here the student-teacher ratio) on the dependent variable (here the test scores).

To answer the questions relating to the more precise nature of the relationship between classsize and student performance, you need to estimate the regression intercept and slope. Aregression line is little else than fitting a line through the observations in the scatterplotaccording to some principle. You could, for example, draw a line from the test score for thelowest student-teacher ratio to the test score for the highest student-teacher ratio, ignoring allthe observations in between. Or you could sort the data by student-teacher ratio and split thesample in half so that the observations with the lowest ten student-teacher ratios are in one set,and the observations with the highest ten student-teacher ratios are in the other set. For each ofthe two sets you could calculate the average student-teacher ratio and the correspondingaverage test score, and then connect the two resulting points. Or you could just eyeball therelationship. Some of these principles have better properties than others to infer the trueunderlying (population) relationship from the given sample. The principle of Ordinary LeastSquares (OLS), for example, will give you desirable properties under certain restrictiveassumptions that are discussed in Chapter 4 of the Stock/Watson textbook.


16/45

Back tvariabl

with equati

coeffic

Often

the int

have szero, a

profes

here b

of the t

ThereYon a

where

wherestanda

automlikely

The ou

o computineXin a line

u represen, then the

ients, then

regression

rcept 0 o

en in the scd it is ther

or most lik

cause with

eacher in th

re various

constant (in

reg stand

he r follod errors (ev

tically do sever use it)

tput appear

. If the dear fashion o

ting the er task is to

1 describes

line is a lin

ly has a use

atterplot abfore better

ly will give

o students

t case?)

ays to esti

ercept) and

for least sq

wing the coen though y

. There is a.

as follows:

endent varithe type

or, or ranfind a val

he effect of

ar approxi

ful meaning

ve, there arnot to inter

you a seri

resent, the

ate the reg

another var

ares regres

re

ma indicau have not

option for

- 16 -

able, Y, is

0 1i

om disturbe for 0

a unit incre

ation to an

if observati

e no observret the nu

us penalty i

e is no scor

ression line

ableX is:

reg Y X

sion. For th

testscr str,

es that yourequested a

you to supp

nly determ

i iu

ance, notand 1 . I

ase inXon

underlying

ons around

tions arouerical valu

n the exam

e to record.

. The com

current ap

r

are using heintercept t

ress the inte

ned by a si

ccounted ff you had

Y.

complicate

X=0 occur i

d the studeof the inte

for interpre

(What woul

and for reg

lication, ty

teroskedastibe include

rcept, but y

ngle expla

i=1,2,

or by thevalues for

relationshi

n the data.

t-teacher racept at all.

ting the int

d be the fu

essing a va

e

city-robustd, STATA

u will mos

atory

...,N

inearthese

and

s we

tio ofYour

rcept

ction

iable

ill


17/45

Accor

an dec

textbo

Note t

420 scregress

that th

e) Ente

So farextern

itself.

progra

StockLocate

found

STATcopy

STAT

familiabefore

Data E

This is

ing to these

ease of 0.6

k, you shou

at the resul

hool districion R

2 is q

above slop

ring Data f

you enterelto the SThis makes

, such as a

nd Watsonthe corresp

this tutorial

and openand past

, choosing

r with thispasting. N

ditor.

what you s

results, low

points, on a

ld display t

TestScore

t for the 10

s. Howeveite low. As

e coefficien

om a Sprea

data manTAprogra

sense as d

spreadsheet

present theonding Exc

) and open

he Data Ed comman

the option

rocedure.te that ST

ould see in

ering the st

erage, in t

e results as

= 618.9 + 0 (51.1) (2.

chosen sch

, as pointea matter o

is not stati

sheet

ally. Mostm, i.e., theta sets eit

.

Californiael file casc

it. Next, f

tor. Returns commo

Treat Fir

Make sureTA has co

STATA:

- 17 -

dent-teach

e district w

follows:

.61 STR,R.33)

ol districts

out beforfact, in Ch

tically signi

often youwill not b

er become

Test Scoreool.xlsx on

llowing th

to the Exce to Windo

t Row as

o select theveniently i

r ratio by o

de test scor

2

= 0.007, S

is quite dif

, this is aapter 5 of

ficant.

ill workincluded i

very large

Data Set ithe accom

procedure

file and ms progra

Variable N

grey box tcluded the

e student p

e. Using the

ER= 9.8

erent from

rather smallour textboo

ith larger, or be par

or are gene

Chapter 4anying we

s discussed

rk F1:R42s, move t

mes. You

o the immename of th

er class res

notation o

the sample

sample ank, you will

ata sets thof, the prorated by a

of the textsite (wher

previously,

. Next, usie data blo

are presu

iate right ovariables i

lts in

your

of all

d thelearn

t aregramother

ook.you

start

g thek to

ably

f 1n the


18/45

- 18 -

When you are done, you are ready to save the file. Name it caschool.dta.

You can now reproduce Equation (4.7) from the textbook. Use the regression command youpreviously learned to generate the following output.

(You can find the standard errors and the distribution of the estimators on p. 131 of the Stock

and Watson (2015) textbook. The regression

2

R , sum of squared residuals (SSR), and standarderror of the regression (SER) are presented in Key Concept 4.3.)

f) Importing Data Files directly into STATA

Excel (Spreadsheet) Files

Even though the cut and pastemethod seemed straightforward enough, there is a second, more

direct, way to import data into STATA from Excel, which does not involve copying andpasting data points.

In general, make sure your data is organized with the variable names in Row 1 of yourspreadsheet with each column representing a different variable, and the observations in therows beneath the variable names. Then, save your data set in Excel (or an alternativespreadsheet program) as a .csv file (this stands for comma separated values).

Start again with a new STATA file. Next, type the following command into the commandwindow in STATA:

insheet using (filename)

where (filename)is the directory location of your file. (To find this, locate the file and right-click, selecting the Propertiesbutton. This should contain the location of the file to which youmust add the filename; here is an example C:\Econometrics\StockWatson\caschool.csv.) Ifyour filename has any spaces or any symbol that appears on the number keys of the keyboard,then you should put quotation marks around your filename. STATA reads spaces as denoting

_cons 698.933 10.36436 67.44 0.000 678.5602 719.3057 str 2.279808 .5194892 4.39 0.000 3.300945 1.258671

testscr Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust

Root MSE = 18.581 R-squared = 0.0512 Prob > F = 0.0000 F( 1, 418) = 19.26Linear regression Number of obs = 420

. reg testscr str, r


19/45

- 19 -

separations between words, and therefore will only read the filename up until the first space orsymbol, and then considers the rest to be a separate command.

Note: In order to insheetdata, there must be no data already stored in memory. To get rid ofany data that is already stored, type the command

clear

before insheeting.

Once you have insheetedyour data, you should see this reflected in your Results boxand yourvariables should appear in your Variables List box. You can type editto see your data in thedata editor.

To save your data as a STATA file, click on Fileon the upper toolbar, then select Save As.When you save your file, make sure it is saved as a .dta file. This type of file can only be

opened in STATA. Alternatively, you can type the command

save (filename)

where (filename)is the directory location and name of your file. If you have a previous versionof this saved already, to overwrite the old version add replaceafter the save command. Forexample:

save C:\My Documents\test.dta, replace

If you wish to save a file that has been previously saved in the same directory location as the

previous version, you may use the commandsave, replace

.

Note: When you save a STATA dataset, you are really only saving the dataset as it exists at thetime you chose to save. You are not retaining any of the analysis you may have conducted,such as running regressions or testing for the statistical significance of coefficients. However,if you have changed the data since opening the file, such as edited observations, these changeswill be reflected.

As an exercise, copy the caschool.xls or caschool.xlsx data file from the Stock and Watsonwebsite and save the Excel file in some subdirectory on your computer as a .csv file. Thenimport the data set using the insheetcommand. Finally run the simple regression of testscronstr and check that your output contains 420 observations and corresponds to the STATAregression output in the previous section.


20/45

- 20 -

ASCII data

You can also import data from an ASCII file (text file). This assumes that you either saved data

from a different source as an ASCII file or that you received data in ASCII file format. The filemust be organized with one observation in each row, and the variables in the data set must bein separate columns.

Using the infilecommand, type the name of the variable that represents each column, followedby the file name.

For example, consider an ASCII dataset that looks as follows:

ahe educ exper union married

10.75 12 6 1 0

16.50 16 3 0 0

..

12.10 12 8 1 1

and which you want to import into STATA.

Each row corresponds to observations on an entity (here an individual). The first columnsabove is the hourly wage, the second is years of education, the third is potential experience, the

fourth is a binary variable which equals one if the individual belongs to a union and is zerootherwise, and the last column is another binary variable which takes on the value of one if theindividual is married and is zero otherwise.

To import the data, you type the following command:

infile ahe educ exper union married using (filename)

STATA dataset

Data files that have been saved in STATA format, carry the extension .dta

To open a dataset that is already saved as a .dtafile, you can either go to File and then Opentoselect your dataset, or you can type the command

use (filename)


21/45

- 21 -

This will open your dataset into STATA, as long as you have changed your working directoryto the location on your computer where the data file is stored. The command to change theworking directory is

CD: C:\(location)

Here are two tricks that will be of help down the road.

(i) If you are not sure how to type in the location of your data file, just right-click onyour Start button and select Explore. Then find your data set. Next right click onthe data set and chose Properties. A new window opens up. Copy the Location.Return to the Command Window in STATA and type use and then past thelocation. Add \ and the name of the file, including the extension. Then finish thecommand with a , clear.

Here is an example from my computer:

use C:\ClaremontLectures\ECON125\STATA\baseb.dta, clear

(ii) The clear command is very important. It erases previous data, if there was any,from memory. I, and others, have wasted time trying to find errors in programmingsimply by not clearing memory. Even if you dont understand the reason, the adviceis always to include the clear command when you read in a new data set.

You can try doing this with the caschool.dta data set from the Stock and Watson website.Simply save that data set on your computer, then double click on it. This will open STATAwith the data loaded already. Obviously this is the easiest method to import data into STATA.

Regardless of which method you use to import data, it is always a good idea to inspect thedata to check if there are some abnormalities. To do this, click on the Data Editor(Browse) button below the drop down menus.

g) Multiple Regression Model

Economic theory most often suggests that the behavior of a certain variable is influenced notonly by a single variable, but by a multitude of factors. The demand for a product, e.g. LALaker tickets, depends not only on the price of the product but also on the price of other goods,income, taste, etc. Similarly, the Phillips curve suggests that inflation depends not only on theunemployment rate, but also on inflationary expectation and possibly supply shocks, etc.

An extension of the simple regression model is the multiple regression model, whichincorporates more than one regressor (see Equation (6.7) in the textbook on page 192).


22/45

- 22 -

0 1 1 2 2 ...i i i k ki iY X X X u , i= 1,,n.

To estimate the coefficients of the multiple regression model, you proceed in a similar way asin the simple regression model. The difference is that you now need to list the additionalexplanatory variables. In general, the command is:

reg Y X1 X2 Xk, (options)

where (options) can be omitted (this is the default and gives you homoskedasticity-onlystandard errors) or can be replaced by various possible entries ( e.g. r for heteroskedasticityrobust standard errors).

See if you can reproduce the following regression output, which corresponds to Column 5 inTable 7.1 of the Stock and Watson (2015) textbook (page 241). The option used below is (r)toproduce heteroskedasticity-robust standard error (STATA refers to these as Robust StandardErrors).

The interpretation of the coefficients is equivalent to that of a controlled science experiment: itindicates the effect of a unit change in the relevant variable on the dependent variable, holdingall other factors constant(ceteris paribus).

Section 7.2 of the Stock and Watson (2015) textbook discusses the F-statistic for testingrestrictions involving multiple coefficients, the so called Waldtest. To test whether all of theabove coefficients are zero with the exception of the intercept, you can use the testcommandfollowed by each restriction that you want to test in parenthesis (STATA uses the name of thevariable associated with the coefficient in combination with the restriction).

Type

test (str=0) (el_pct=0) (meal_pct=0) (calw_pct=0)

STATA will generate the following output:

_cons 700.3918 5.537418 126.48 0.000 689.507 711.2767 calw_pct -.0478537 .0586541 -0.82 0.415 -.1631498 .0674424 meal_pct -.5286191 .0381167 -13.87 0.000 -.6035449 -.4536932 el_pct -.1298219 .0362579 -3.58 0.000 -.201094 -.0585498 str -1.014353 .2688613 -3.77 0.000 -1.542853 -.4858534

testscr Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust

Root MSE = 9.0843 R-squared = 0.7749 Prob > F = 0.0000 F( 4, 415) = 361.68Linear regression Number of obs = 420

. reg testscr str el_pct meal_pct calw_pct, r


23/45

- 23 -

Note that the F-statistic is identical to the same statistic listed in the regression output.

See if you can generate the F-statistic of 5.43 following Equation (7.6) in the Stock andWatson (2015) text and listed at the bottom of page 223 (restrict the coefficients of STRandExpnto be zero).

h) Data Transformations

So far, we have only used data in regressions that already existed in some file that we eithercreated or used. Almost always, you will be required to transform some of the raw data thatyou received before you run a regression. In STATA you transform variables by using thegen (as in generate) command. For example, Chapter 8 of the Stock/Watson textbookintroduces the polynomial regression model, logarithms, and interactions between variables.Lets reproduce Equations (8.2), (8.11), (8.18), and (8.37) here. The following commandsgenerate the necessary variables2:

gen avginc2=avginc^2

gen avginc3=avginc^3

gen lavginc=log(avginc)

gen ltestscr=log(testscr)

gen strpctel=str*el_pct

Note how the commands and generated variables are displayed in STATA, including those inred when you make a mistake in the command (e.g. genr instead of gen).

2For example, I have generated a variable called avginc2, and assigned it to be the square of the previously

defined variable avginc. Note that I am generating variable names that are self-explanatory. They could havebeen called variable1, variable2, variable3, etc. but it is a good idea to create variable names that you canremember.

Prob > F = 0.0000 F( 4, 415) = 361.68

( 4) calw_pct = 0( 3) meal_pct = 0( 2) el_pct = 0( 1)

str = 0

. test (str=0) (el_pct=0) (meal_pct=0) (calw_pct=0)


24/45

Next r

Finally

Exerci

One oinstruc

proble

regress

Lets s

(http://Compa

Empiri

3 Note

tells yo(usually

which i

data set,gigabyte

n the four

save your

e

the probleions witho

s but then

ions, for ex

ee how mu

ww.pearsnion Web S

cal Results:

for STATA 1

that insufficiset at 1 MB b

creases the m

but small eno).

regressions

orkfile aga

ms with tht internaliz

little is re

mple, woul

h you unde

nhighered.cite, and doCPS Data

2 users: if yo

ent memorydefault). You

mory to 10 m

gh for your c

using the

n and exit

type of ting them.

ained. If I

d you be abl

rstood. Go

om/stock_

nload the Csed in Cha

just double

as allocated.can do this b

egabytes. In g

omputer to ha

- 24 -

ame techni

TATA.

torial youtypical s

asked you

e to do that

to the Stoc

atson). E

PS data setter 8). Ne

lick on the c

Before you otyping in the

et mem 10m

neral, make s

dle the progr

que as for

are workinudent will

to retrieve

Or would

and Wats

ter the

for Chaptert open it in

s_ch8.dta fil

en the cps_ccommand

ure to set the

m (use kfor

ultiple re

on is thatfinish the t

data set a

ou say ho

n website f

tudent Re

8 (Data Se

STATA3

, an error me

h8.dta file, in

emory large

ilobyte, mfor

ression ana

you just futorial wit

nd to run

do I do th

or the 3rd

e

ources ins for Repli

sage will occ

rease your m

nough to han

megabyte, an

lysis.

llowfew

few

s?

ition

theating

r that

emory

le thegfor


25/45

Then r

(2015)

Why

to restrfind a

sampledefine

Batch

So far,

execut

a percreated

comm

createdview t

includ

Batch

plicate the

textbook.

o you thin

ict your saay to rest

to those iotential ex

Files

you have e

ble stateme

anent recor, etc.? In th

nds similar

such a proe output a

loops and

iles in STA

results for c

your result

ple to onlyict your sa

dividuals ierienceas

ther clicked

nts (comma

d of all theat case, you

to those tha

ram, whicterwards (i

conditional

A are calle

olumns (1)

s differ fro

include indiple, look f

that age ghe Mincer

on buttons

nds one by

transformawould nee

t you used i

is a textthe progra

branching

dDo-Files.

- 25 -

rom Table

those liste

viduals whoor Helpan

oup, replicxperience v

in STATA

ne, or line

tions youto create a

n the Com

or Asciidid not

(if you do

.1 on page

d in the tab

are at leastthe ifcom

ate columnariable (age

or used the

by line). Bu

ade, regreprogram

mand Wind

file, you caontain any

t know w

288 of the

le? What if

30 but notmand. Then

(1) to (3). Years of

Command

t what if yo

sions you tthat consis

w previou

then execerrors). Bat

at these ar

tock and W

you found

lder than 6, restricting

For columducation

Window t

u wanted to

ried, graphs of a list o

sly. After h

te (run) ich files ca

, not to w

atson

way

? Toyour

(4),6 ).

type

keep

youf line

ving

t andalso

rry).


26/45

- 26 -

Using STATA in batch mode has two important advantages over using STATA interactively:

theDo-Fileprovides an audit trail for your work. The file provides an exact record ofeach STATA command;

even the best computer programmers will make typing or other errors when usingSTATA. When a command contains an error, it wont be executed by STATA, orworse, it will be executed but produce the wrong result. Following an error, it is oftennecessary to start the analysis from the beginning. If you are using STATAinteractively, you must retype all of the commands. If you are using aDo-File, then youonly need to correct the command containing the error and rerun the file.

Lets create such a program. Click on New Do-File Editorbutton. This opens the STATA Do-File Editor box.

Type in, the following commands exactly as they appear.

log using \statafiles\stata1.log, replaceuse \statafiles\caschool.dtadescribegenerate income = avginc*1000summarize incomelog closeexit

Here is the meaning of the seven lines of this program:

Line 1: This is an administrative command that tells STATA where to display the results of

your analysis. STATA output files are called logfiles. The current line tells STATA toopen a log file called stata1.log (you could have used any name, such aslove_metrix.log, meaning, the word stata1 is not required here). If there is already afile with the same name in the folder, STATA is instructed to replace it. Before yousave the Do-File, replace the path in this line with the relevant path on thecomputer you are using.

Line 2: This line concerns the data set. As you learned earlier in the tutorial, datasets inSTATA are called dtafiles. The dataset which you will use here is caschool.dta, whichyou downloaded earlier. The current line tells STATA the location and name of thedataset to be used for the analysis. Before you save theDo-File, replace the path in

this line with the relevant path of the location where you savedcaschool.dtato.

Line 3: This line also concerns the data set. It tells STATA to describe the dataset (a shorterversion of the command is des instead of describe). This command produces a listof the variable names and any variable descriptions stored in the data set.


27/45

- 27 -

Line4: This line tells STATA to create a new variable called income(a shorter version of thecommand is gen instead of generate). The new variable is constructed bymultiplying the variable avgincby 1000. The variable avgincis contained in the datasetand is the average household income in a school district expressed in thousands ofdollars. The new variable incomewill be the average household income expressed in

dollars instead of thousands of dollars.

Line 5: This line tells STATA to compute some summary statistics (a shorter version of thecommand is sum instead of summarize). STATA will produce the mean, standarddeviation, etc.

Line 6: This line closes the file stata1.logwhich contains the output.

Line 7: This line tells STATA that the program has ended.

As long as you have replaced the path in line 1 and line 2 with the relevant paths from the

computer you are working on, and if you downloaded/saved the California Test Score DataSet, then we are good to go. Save theDo-File, using the .dosuffix. Next execute this Do-Fileby first opening STATA on your computer. Next, click on the Filemenu, then Do, and thenselect the stata1.dofile you just saved. This will run or execute the program.

(Alternatively, you can run the program, or even just part of the program, by hitting theExecute (do) button in the Do-file Editor.)

You will be able to see the program being executed in theResults Window. Since the executionwill not fit into one screen, you can scroll up and see everything that happened during therun. Sometimes (although not here) you may see that the program execution pauses, and that

--more--

is displayed at the bottom of the Results Window. If this happens, push any key on thekeyboard and execution will continue.

To exit STATA, click on the usual exit button at the top right of STATA (alternatively click onFileand then Exit.) STATA will ask you if you really want to exit, and you will respond Yes.

Your output has been saved in stata1.logand you can look at it by opening the file with anytext editor (Notepad, for example) or in Word/WordPerfect. Here is what you should see:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -name:

l og: yourpathherel og t ype: t ext

opened on: yourdateandtimehere

. use C: \ yourpathhere


28/45

- 28 -

. descri be

Cont ai ns dat a f r omC: \ yourpathhere\ caschool . dt aobs: 420

var s: 13 yourdateheresi ze: 20, 160

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

st orage di spl ay val uevar i abl e name t ype f ormat l abel var i abl e l abel- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - -enr l _t ot i nt %8. 0gt eacher s f l oat %8. 0gcal w_pct f l oat %8. 0gmeal _pct f l oat %8. 0gcomput er i nt %8. 0gt est scr f l oat %8. 0gcomp_st u f l oat %8. 0gexpn_st u f l oat %8. 0gst r f l oat %8. 0gavgi nc f l oat %8. 0gel _pct f l oat %8. 0gr ead_scr f l oat %8. 0gmath_scr f l oat %8. 0g

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Sor t ed by:

. gener ate i ncome = avgi nc*1000

. summar i ze i ncome

Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

i ncome | 420 15316. 59 7225. 89 5335 55328

. l og cl osename:

l og: C: \ yourpathhere\ s tata1. l ogl og t ype: t ext

cl osed on: yourdateandtimehere

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

You now have an initial idea of how to work withDo-Filesin STATA. The rest of this part ofthe tutorial will guide you through further commands and make the initial Do-File morecomplex.

I suggest that you continue to work with the batch file you just created and then for you to addnew lines to this program (if you use the .pdf version of this tutorial or have printed the tutorialusing a color printer, then the new commands will appear in red).


29/45

- 29 -

#del i mi t ;******************************************;* Admi ni st r at i ve Commands;******************************************;set more of f ;c l ear ;l og us i ng C: \ stataf i l es\ stata1. l og, repl ace;******************************************;

* Read i n the data set;*******************************************;use C: \ stataf i l es\ caschool . dta; des; *******************************************;* Tr ansf orm Data and Cr eat e New Vari abl es;*******************************************;* Const r uct Aver age Di st r i ct I ncome i n $s;*******************************************;gen i ncome = avgi nc*1000; *******************************************;* carr y out st at i st i cal anal ys i s ;*******************************************;* summary st ati st i cs f or I ncome;*******************************************;sum

i ncome;

*******************************************;* end of progr am;*******************************************;l og cl ose; exi t ;

The new version of theDo-Filecarries out exactly the same calculations as before. However ituses four features of STATA for more complicated analysis. The first new command is

# delimit ;

This command tells STATA that each STATA command ends with a semicolon. If STATA

does not see a semicolon at the end of the line, then it assumes that the command carries overto the following line. This is useful because complicated commands in STATA are often toolong to fit on a single line. (Make sure to place a ; at the end of the seven old commands.)The aboveDo-Filecontains an example of a STATA command written on two lines: near thebottom of the file you see the command sum income written on two lines. STATA combinesthese two lines into one command because the first line does not end with a semicolon. Whiletwo lines are not necessary for this command, some STATA commands can get quite long, soit is good to get used to employing this feature.

A word of warning: if you use the # delimit ;command, it is critical that you end eachcommand with a semicolon. Forgetting the semicolon on even a single line means that theDo-

Filewill not run properly (again, dont forget to add the seven ; in the first version of theprogram).

The second new feature of the aboveDo-Fileis that many of the lines begin with an asterisk.STATA ignores the text that comes after *, so that these lines can be used for comments orto describe what the commands that follow are doing. Note that each of these lines ends with a


30/45

- 30 -

semicolon. Without the semicolon, STATA would include the next line as part of the textdescription.

A final new feature in the program is the command

set more off

This command eliminates the need to hit a key on your keyboard in the case when STATA fillstheResults Windowand stops displaying further results (the -- more -- would appear).

Run the program and have a look at the new log file.

Next, change the previous version of theDo-Fileby adding commands until the new versionlooks as follows (again, new commands can be seen in redif your tutorial displays colors):

#del i mi t ;*********************************************************;

*Admi ni st r at i ve Commands;*********************************************************;set more of f ;c l ear ;l og usi ng C: \ STATA\ st at a1. l og, r epl ace;*********************************************************;*Read i n the Dataset ;*********************************************************;use C: \ STATA\ caschool . dta;des;*********************************************************;*Tr ansf orm Data and Cr eat e New Vari abl es;*********************************************************;**** * Const r uct Aver age Di st r i ct I ncome i n $s;gen i ncome = avgi nc*1000;***** Def i ne var i abl es f or subset of data;gen testscr_ l o = testscr i f (st r=20) ;*********************************************************;*Carr y Out Stati sti cal Anal ysi s;*********************************************************;**** * Summary Stat i st i cs f or I ncome;sum

i ncome;sum testscr;t t est t estscr =0;t t est t estscr _l o=0;t t est t estscr _hi =0;t t est t est scr_l o=t est scr_hi , unequal unpai r ed;*********************************************************;*Repeat t he Anal ysi s usi ng STR = 19;*********************************************************;r epl ace t estscr _l o=t estscr i f ( str =19) ;

t t est t est scr_l o=t est scr_hi , unequal unpai r ed;*********************************************************;*End of Progr am;*********************************************************;l og cl ose;exi t ;


31/45

- 31 -

There are three new features in this new version.

1) New variables are created using only a portion of the dataset. Two of the variables in thedataset are testscr (the average test score in a school district) and str (the districtsaverage class size or student teacher ratio). The STATA command

gen testscr_lo = testscr if (str |t|) = 0.0000 Pr(T > t) = 0.0000

Ha: mean < 0 Ha: mean = 0 Ha: mean > 0

Ho: mean = 0 degrees of freedom = 419

mean = mean(testscr) t = 703.6149

testscr 420 654.1565 .9297082 19.05335 652.3291 655.984

Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

One-sample t test

. ttest testscr=0


32/45

- 32 -

confidence interval for the population mean. (In this example, the t-test that thepopulation mean of test scores is equal to zero is not really of interest, but theconfidence interval for the mean is what we are looking for in this example.) The samecommand is then used for testscr_loand testscr_hi(see section 3.2 and 3.3 in Stock andWatson (2015)).

The second form of the command is

ttest testscr_lo=testscr_hi, unequal unpaired

Executing this statement will test the hypothesis that testscr_loand testscr_hicome frompopulations with the same mean. That is, the command computes the t-statistic for thenull hypothesis that the (population) mean of test scores for districts with class sizes lessthan 20 students is the same as the mean of test scores for districts with class sizes

greater than 20 students. The command uses two options that are listed after thecomma in the command. These options are unequaland unpaired. The option unequaltells STATA that the variances in the two populations may not be the same. The optionunpairedtells STATA that the observations are for different districts, that is, these arenot panel data representing the same entity at two different time periods (see section 3.4in Stock and Watson (2015)).

3) A third new feature in the Do-File is the command replace. This appears near thebottom of the file. Here, the analysis is to be carried out again, but using 19 as the cutofffor small classes. Since the variables testscr_loand testscr_hialready exist (they were

define by the gencommand earlier in the program), STATA cannot generate variableswith the same name. Instead, the command replaceis used to replace the existing serieswith new series. In essence, the command instructs the program to overwrite thepreviously stored data.

Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 0.0000

Ha: diff < 0 Ha: diff = 0 Ha: diff > 0

Ho: diff = 0 Satterthwaite's degrees of freedom = 403.607

diff = mean(testscr_lo) - mean(testscr_hi) t = 4.0426

diff 7.37241 1.823689 3.787296 10.95752

combined 420 654.1565 .9297082 19.05335 652.3291 655.984

testsc~i 182 649.9788 1.323379 17.85336 647.3676 652.5901

testsc~o 238 657.3513 1.254794 19.35801 654.8793 659.8232

Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

Two-sample t test with unequal variances

. ttest testscr_lo=testscr_hi, unequal unpaired


33/45

- 33 -

You are now ready to execute (run) the program as done before.

As before, change the previous version of the Do-File by adding commands until the newversion looks as follows (again, new commands can be seen in red if your tutorial displayscolors):

#delimit ;*********************************************************;*Administrative Command*********************************************************;set more off;clear;log using \statafiles\stata1.log, replace;*********************************************************;*Read in the Dataset;*********************************************************;use \statafiles\caschool.dta;des;*********************************************************;*Transform Data and Create New Variables;*********************************************************;***** Construct Average District Income in $s;gen income = avginc*1000;***** Define variables for subset of data;gen testscr_lo = testscr if (str=20);*********************************************************;*Carry Out Statistical Analysis;*********************************************************;***** Summary Statistics for Income;sum

income;

*********************************************************;***** Table 4.1 *****;*********************************************************;sum str testscr, detail;*********************************************************;***** Figure 4.2 *****;*********************************************************;twoway scatter testscr str || lfit testscr str;*********************************************************;***** Correlation *****;*********************************************************;cor str testscr;*********************************************************;

***** Equation 4.11 and 5.8 *****;*********************************************************;reg testscr str, robust;*********************************************************;***** Equation 5.18 *****;gen d = (str


34/45

- 34 -

sum testscr;ttest testscr=0;ttest testscr_lo=0;ttest testscr_hi=0;ttest testscr_lo=testscr_hi, unequal unpaired;*********************************************************;*Repeat the Analysis using STR = 19;*********************************************************;replace testscr_lo=testscr if (str=19);ttest testscr_lo=testscr_hi, unequal unpaired;*********************************************************;*End of Program;*********************************************************;log close;exit;

The new commands reproduce some of the empirical results shown in Chapters 4 and 5 of

Stock and Watson (2015). There are several features of STATA included in the new commandswhich have not been used in the previous examples:

1) The summarize command (sum) is now includes the option detail, which providesmore detailed summary statistics. The command is written as

sum str testscr, detail

This command tells STATA to compute summary statistics for the two variables strandtestscr. The option detailproduces detailed summary statistics that include, forexample, the percentiles that are reported in Table 4.1 on p. 113 of Stock and Watson

(2015).

2) The command

twoway scatter testscr str || lfit testscr str

constructs a scatterplot of testscr versus strand includes the estimated regression line forthe simple regression of the California Test Score Data Set, shown on p. 116 of Stockand Watson (2015).

3) The command

cor str testscr

tells STATA to compute the correlation between the student teacher ratio and testscores.

4) Next you will reproduce equations (4.11) and (5.8) in Stock and Watson (2011) by using


35/45

- 35 -

the regress(or short reg) command:

reg testscr str, r

instructs STATA to run an OLS regression with testscras the dependent variable and str

as the regressor. The robust(short r) option tells STATA to calculate heteroskedasticity-robust formulas for the standard errors of the regression coefficient estimators. Omittingthis option results in the display of homoskedasticity-only standard errors.

5) The final innovation over the previous version of the Do-File is contained in the twocommands following the line Equation 5.18. First a binary (sometimes referred to asdummy or indicator) variable d is created suing the STATA command

gen d = (str


36/45

- 36 -

gen testscr_lo = testscr if (str=20);*********************************************************;*Carry Out Statistical Analysis;*********************************************************;***** Summary Statistics for Income;sum

income;*********************************************************;***** Table 4.1 *****;*********************************************************;sum str testscr, detail;*********************************************************;***** Figure 4.2 *****;*********************************************************;twoway scatter testscr str || lfit testscr str;*********************************************************;***** Correlation *****;*********************************************************;cor str testscr;

*********************************************************;***** Equation 4.11 and 5.8 *****;*********************************************************;reg testscr str, r;*********************************************************;***** Equation 5.18 *****;gen d = (str


37/45

- 37 -

ttest ts_lostr=ts_histr if elq4==1, unp une;*********************************************************;* Equation 7.5 ;*********************************************************;reg testscr str el_pct, r;*********************************************************;* Equation 7.6 ;*********************************************************;replace expn_stu = expn_stu/2000;reg testscr str expn_stu el_pct, r;*********************************************************;* Display Variance-Covariance Matrix;*********************************************************;vce;*********************************************************;* F-test reported in text;*********************************************************;test str expn_stu;*********************************************************;* Correlations reported in text;

*********************************************************;cor testscr str expn_stu el_pct meal_pct calw_pct;*********************************************************;*Table 7.1, Column(1);*********************************************************;reg testscr str, r;display "adjusted Rsquared = " e(r2_a);* Column (2);reg testscr str el_pct, r;display "adjusted Rsquared = " e(r2_a);* Column (3);reg testscr str el_pct meal_pct, r;display "adjusted Rsquared = " e(r2_a);

* Column (4);reg testscr str el_pct calw_pct, r;display "adjusted Rsquared = " e(r2_a);* Column (5);reg testscr str el_pct meal_pct calw_pct, r;display "adjusted Rsquared = " e(r2_a);*********************************************************;* Appendix rule of thumb F-Statistic;*********************************************************;reg testscr str expn_stu el_pct;test str expn_stu;reg testscr el_pct;*********************************************************;

*End of Program;*********************************************************;log close;exit;

The file produces several of the empirical results from Chapter 7 of Stock and Watson (2015).As before, some commands have been abbreviated when there is no possibility of confusion.


38/45

- 38 -

The file uses abbreviations for STATA commands throughout (generatebecomes gen, regressturns into reg, etc.).

In essence there are two new commands:

1)

The first new command involves the testing of restrictions in equation 7.6 (page 221 ofStock and Watson (2015)). The command

reg testscr str expn_stu el_pct, r

instructs STATA to compute the regression. The command vceasks STATA to print outthe estimated variances and covariances of the estimated regression coefficients. Thecommand

test str expn_stu

gets STATA to carry out the joint test that the coefficients on strand expn_stuare bothequal to zero.

2) The second new command is in the analysis of Table 7.1 on page 241 of Stock andWatson (2015). When STATA computes an OLS regression, it computes the adjusted

R-squared (2

R ) as described in Section 6.4, page 197 of Stock and Watson (2015).However, STATA does not display all of the results it computes, including the adjustedR-squared (when the r option is invoked). The command

display Adjusted Rsquared = e(r2_a)

instructs STATA to print out (display) the adjust R-squared. Whatever appearsbetween the two quotation marks ( ) will be displayed in your output (you did nothave to display the wordsAdjusted Rsquaredbut could have chosen anything else, suchas My Measure of Fit). However e(r2_a) tells STATA where to retrieve the storedresult from and cannot be changed. The adjustedR-squared is not the only statistic thatSTATA stores and does not display. You can use the Help function or look in theReferencevolume under Saved Resultsfor the regcommand to find other statistics.

Other Examples ofDo-Files

You will find other examples ofDo-Fileson the accompanying Web site for the Stock andWatson (2015) econometrics textbook. You can download STATADo-Filesfro there toreproduce all of the analysis in Chapters 3-13. You will also find a STATADo-Filefor thetime series chapters 14-16 there. STATA programming for time series is somewhat morecomplicated than for cross-sectional or panel data. EViews and RATS are econometric


39/45

- 39 -

programs specifically designed for time series data, and the web site contains EViews andRATS programs for Chapters 14-16, as well as a tutorial for EViews.

3. A SUMMARY OF SELECTED STATA COMMANDS

This section lists several of the most useful STATA commands. Many of these commands haveoptions. For example, the command summaryhas the option detailand the command regresshas the option robust. In the descriptions below, options are shown in square brackets [ ]. Manyof these commands have several options and can be used in many different ways. Thedescriptions below show how these commands are commonly used. Other uses and options canbe found in STATAs Help menu and in the other sources listed at the beginning of thistutorial.

The list of commands provided here is a small fraction of the commands in STATA, but these

are the important commands that you will need to get started for your econometrics course.You should extend the list or create your own in addition to what is listed here.

Administrative Commands

# delimit

sets the character that marks the end of a command. For example, the command #delimit ;tells STATA that all commands will end with a semicolon. This command isused inDo-Files.

clear

deletes/erases all variables from the current STATA session.

exit

in aDo-File, the command tells STATA that the program has ended. If you type exitinthe STATA Command Window, then STATA will close.

log

controls STATA log files, which is where STATA writes output. There are twocommon uses of this command:

log using filename [,append replace]. This opens the file given byfilename as alog

file for STATA output. The optionsappend

andreplace

are used when thereis already a file with the same name. With append, STATA will append theoutput to the bollom of the existing file. With replace, STATA will replace theexisting file with the new output file.

log close. This closes the current logfile.

set mat #


40/45

- 40 -

sets the maximum number of variables that can be used in a regression. The defaultmaximum is 40. If you have a huge number of observations and want to run aregression with 45 variables, then you will need to use the command, where # is anumber greater than 45.

set memory #mis used in Windows and Unix versions of STATA to set the amount of memory used bythe program. For details, see the discussion within the tutorial.

set more off

tells STATA not to pause and display the more- message in theResults Window.

Data Management

describe

describes the contents of data in memory or on disk. A related command is describeusing filename, which describes the dataset infilename

drop list of variables

this deletes/erases the variables in list of variables from the current STATA session.For example, drop str testscrwill delete the two variables strand testscr

keep list of variablesdeletes/erases all of the variables from the current STATA session except those in list ofvariables. Alternatively, it keeps the variables in the list and drops everything else. Forexample, keep str testscrwill keep the two variables strand testscrand deletes all of

the other variables in the current STATA session.

list list of variables

tells STATA to print all of the observations for the variables listed in list of variables.

save filename [, replace]tells STATA to save the dataset that is currently in memory as a file with namefilename. The option replace tells STATA that it may replace any other file with thenamefilename.

use filename

tells STATA to load a dataset from the filefilename.

Transforming and Creating New Variables

New variables are created using the command generate, and existing variables are modifiedusing the command replace.


41/45

- 41 -

Examples:

generate newts = testscr/100

creates a new variable called newtsthat is constructed as the variable testscrdivided by 100.

replace testscr = testscr/100

changes the variable testscrso that all observations are divided by 100.

You can use the standard arithmetic operations of addition (+), subtraction (-), multiplication(*), division (/), and exponentiation (^) in generate/replace commands. For example,

generate ts_squared = testscr*testscr

creates a new variable ts_squared as the square of testscr. (This could also have beenaccomplished by using the command gen ts_squared = testscr^2.)

You can also use relational operators to construct binary variables. For example, in the forthbatch file, the following command was included

gen d = (str


42/45

- 42 -

Statistical Operations

cor list of variables

tells STATA to compute the correlation between each of the variables in list ofvariables

twoway scatter var1 var2 || lfit var1 var2produces a scatter plot of var1on the Y-axis and var2on theX-axis. If the || lfitpart isincluded then the fitted OLS line is also displayed

predict newvarname [, residuals]

when this command follows the regress command, the OLS predicted values orresiduals are calculated and stored under the name newvarname. When the optionresiduals is used, the residuals are computed; otherwise the predicted values are

computed and placed into newvarname.

Example:


predict tshat

predict uhat, residuals

Here, testscris regressed on str, expn_stu, el_pct(first command); the fitted values are savedand stored under the name tshat (second command), and the residuals are saved under thename uhat(third command).

regress depvar list of variables [if expression] [,robust noconstant]

carries out an OLS regression of the variable depvar on list of variables. When ifexpression is used, then the regression is estimated using observations for whichexpressionis true. The option robusttells STATA to use the heteroskedasticity-robustformula for the standard errors of the the coefficient estimators. The option noconstanttells STATA not to include a constant (intercept) in the regression.

Examples:

reg testscr str, r


summarize [list of variables] [, details]computes summary statistics. If the command is used without a list of variables, thensummary statistics are computed for all of the variables in the dataset. If the command


43/45

- 43 -

is used with a list of variables, then summary statistics are computed for all variables inthe list. If the option details is used, more detailed summary statistics (includingpercentiles) are computed.

Examples:

sum testscr str

computes summary statistics for the variables testscrand str.

sum testscr str, detail

computes detaild summary statistics for the variables testscrand str.

testthis command is used to test hypothese about regression coefficients. It can be used to

test many types of hypotheses. The most common use of this command is to carry out ajoint test that several coefficients are equal to zero. Used this way, the form of thecommand is test list of variableswhere the list is to be carried out on the coefficientscorresponding to the variables given in list of variables.

Example:


test str expn_stu

Here testscris regressed on str, expn_stu, and el_pct(first command), and a joint test of

the hypothesis that the coefficient onstr

andexpn_stu

are jointly equal to zero is carriedout (second command).

ttest

this command is used to thest a hypothesis about the mean or the difference betweentwo means. The command has several forms. Here are a few:

ttest varname = # [if expression]}[,level(#)]

Here you test the null hypothesis that the population mena of the series varname isequal to #. When if expressionis used, then the test is computed using observations forwhich expression is true. The option level(#) is the desired level of the confidenceinterval. If this option is not used, then a confidence level of 95% is used.

Examples:

ttest testscr = 0;


44/45

- 44 -

tests the null hypothesis that the population mean of testscris equal to 0 and computes a 95%confidence interval.

ttest testscr = 0, level(90);

tests the null hypothesis that the population mean of testscris equal to 0 and computed a 90%confidence interval.

ttest testscr = 0 if (str


45/45

tutorial is not intended to replace theReferenceor Users Guide. The best way to learn how touse the program is to spend some time exploring and working with it. For a nice visualintroduction to the manuals, go to www.youtube.com/embed/xWJTFtWhQc4.

STATA replication batch files for all the results in the Stock/Watson textbook are available

from the Web site. You are invited to download these and study them.

There are many other tutorials on STATA available to you on the internet. If you prefer avisual one, then perhaps going to www.ats.ucla.edu/stat/stata/notes/default.htm might be agood one to look at. STATA has its own YouTube series and you can find it atwww.stata.com/links/video-tutorials/.