Top Banner

of 45

Stata Tutorial 13 v2 0

Jun 01, 2018

Download

Documents

mixerupper
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/9/2019 Stata Tutorial 13 v2 0

    1/45

    - 1 -

    STATA 13 Tutorial

    by Manfred W. Keil

    to Accompany

    Introduction to Econometrics

    by James H. Stock and Mark W. Watson

    ------------------------------------------------------------------------------------------------------------------

    1. STATA: INTRODUCTION 2

    2. CROSS-SECTIONAL DATA

    Interactive Use: Data Input and Simple Data Analysis 4

    a) The Easy and Tedious Way: Manual Data Entry 5b) Summary Statistics 10c) Graphical Presentations 11d) Simple Regression 15e) Entering Data from a Spreadsheet 17f) Importing Data Files directly into STATA 18g) Multiple Regression Model 21h) Data Transformations 22

    Batch (Do-Files) 24

    3. SUMMARY OF FREQUENTLY USED STATA COMMANDS 38

    4. FINAL NOTE 44

    -----------------------------------------------------------------------------------------------------------------

  • 8/9/2019 Stata Tutorial 13 v2 0

    2/45

    - 2 -

    1. STATA: INTRODUCTION

    This tutorial will introduce you to a statistical and econometric software package calledSTATA. The tutorial is an introduction to some of the most commonly used features inSTATA. These features were used by the authors of your textbook to generate the statistical

    analysis reported in Chapters 3-9 (Stock and Watson, 2015). The tutorial provides thenecessary background to reproduce the results of Chapters 3-9 and to carry out relatedexercises. It does not cover panel data (Chapter 10), binary dependent variables (Chapter 11),instrumental variable analysis (Chapter 12), or time-series analysis (Chapters 14-16).

    The most current professional version is STATA 13. Both STATA 12 and STATA 13 aresufficiently similar so that those who only have access to STATA 12 can also use this tutorial.As with many statistical packages, newer versions of a program allow you to use moreadvanced and recently developed techniques that you, as a first time user, most likely will notencounter in a first course of statistics or econometrics. There are several versions of STATA12, such as STATA/IC, STATA/SE, and STATA/MP. The difference is basically in terms of

    the number of variables STATA can handle and the speed at which information is processed.Most users will probably work with the Intercooled (IC) version.

    STATA runs on the Windows (2000, 2003, XP, Vista, Server 2008, or Windows 7), Mac, andUnix computers platform. It is produced by StataCorp in College Station, TX. You can readabout various product information at the firms Web site, www.stata.com . There are 20manuals that can be purchased with STATA 13, although subsets can be bought separately.Perhaps the most useful of these are the Users Guideand theBase Reference Manual, whichcan simply be downloaded. You can order STATA by calling (800) 782-8272 or by filling outa form at www.stata.com/order/quote-request/student/. In addition, if you purchase the StudentVersion, you can acquire STATA at a steep discount. Prices vary, but you could get a

    perpetual license for STATA/IC for $189, or a six-month license for as low as $69 (abusiness/single user pays $1,695 to purchase STATA). There is even a 30-days freeevaluation copy for STATA.

    Econometrics deals with three types of data: cross-sectional data, time series data, and panel(longitudinal) data (see Chapter 1 of the Stock and Watson (2015)). In a cross-section youanalyze data from multiple entities at a single point in time. In a time seriesyou observe thebehavior of a single entity over multiple time periods. This can range from high frequency datasuch as financial data (hours, days); to data observed at somewhat lower (monthly)frequencies, such as industrial production, inflation, and unemployment rates; to quarterly data(GDP) or annual (historical) data. One big difference between cross-sectional and time seriesanalysis is that the order of the observation numbers does not matter in cross-sections. Withtime series, you would lose some of the most interesting features of the data if you shuffled theobservations. Finally, panel data can be viewed as a combination of cross-sectional and timeseries data, since multiple entities are observed at multiple time periods. STATA allows you towork with all three types of data.

    STATA is most commonly used for cross-sectional and panel data in academics, business, and

  • 8/9/2019 Stata Tutorial 13 v2 0

    3/45

    - 3 -

    government, but you can work with it relatively easily when you analyze time-series data.STATA allows you to store results within a program and to retrieve these results for furthercalculations later. Remember how you calculated confidence intervals in statistics say for apopulation mean? Basically you needed the sample mean, the standard error, and some valuefrom a statistical table. In STATA, you can calculate the mean and standard deviation of a

    sample and then temporarily store these. You then work with these numbers in a standardformula for confidence intervals. In addition, STATA provides the required numbers from the

    relevant distribution (normal, 2 , F, etc.).

    While STATA is truly interactive, you will run a program sooner rather than later in abatch mode.

    Interactive use: you type a STATA command in the STATA Command Window(seebelow) and hit the Return/Enter key on your keyboard. STATA executes the commandand the results are displayed in the STATA Results Window. Then you enter the nextcommand, STATA executes it, and so forth, until the analysis is complete. Even the

    simplest statistical analysis typically will involve several STATA commands. Batch mode: all of the commands for the analysis are listed in a file, and STATA is told

    to read the file and execute all of the commands. These files are calledDo-Filesand aresaved using a .dosuffix.

    In the good old days the equivalent of writing aDo-File was to submit a batch of cards, eachcard containing a single command (now line), to a technician, who would use a card reader toenter these into the computer. The computer would then execute the sequence of statements.(You stored this batch of cards typically in a filing cabinet, and the deck was referred to as afile and stored them in a filing cabinet typically with a rubber band around each file ordeck of cards.) While you will work at first in interactive mode by clicking on buttons or

    writing single line commands, you will very soon discover the advantage of running yourregressions in batch mode. This method allows you to see the history of commands, and youcan also analyze where exactly things went wrong if there are problems (errors) with any ofyour commands. This tutorial will initially explain the interactive use of STATA since it ismore intuitive. However, we will switch as soon as it makes sense into the batch mode and youshould seriously try to do your research/class work using this mode (Do-Files).

    STATA produces highly professionally looking graphs and charts. However, it requires somepractice to generate these. A separate manual (Graphics) is devoted to the topic only. SinceSTATA works in a Windows format, it allows you to cut and paste the data into otherWindows-based program, such as Word or WordPerfect.

    Finally, there is a warning about the limitations of this tutorial. The purpose is to help you gainan initial understanding of how to work with STATA. I hope that the tutorial looks lessdaunting than the manuals. However, it cannot replace the accompanying manuals, which youwill have to consult for more detailed questions (alternatively use Help within the program).Feel free to provide me with feedback of how the tutorial can be improved for future

  • 8/9/2019 Stata Tutorial 13 v2 0

    4/45

    generadecide

    institut

    for thoworkin

    in implines bsimply

    is ther

    them ipractic

    2. CR

    Intera

    Lets gyour S

    several

    and be

    ions of st to set up

    ion. We ha

    se who follg with stati

    ovement. Iut will forfollow the

    fore a good

    you thinke the comm

    SS-SECT

    tive Use:

    et started.ART wind

    smaller wi

    in the stati

    dents (mk a Wiki

    e found th

    w. This is,tical softwa

    you set itet the impnstructions

    idea to kee

    you will unds on you

    ONAL DA

    ata Input a

    lick on theow. Once y

    dows. At t

    tical analys

    il@claremorun by stu

    t the wisd

    f course, jure as learni

    aside for tortant detailand when y

    a separate

    se them latown.

    A

    d Simple D

    STATA icou have star

    is point yo

    s.

    - 4 -

    ntmckenna.dents but

    m of crow

    st a suggestg a new la

    long, yous. Anotherou are done

    sheet and t

    r. I will gi

    ta Analysi

    n to beginted STATA

    can load a

    edu). Colleupervised

    s often p

    ion. Finallyguage: pra

    will only rdanger of t, you do not

    write dow

    ve you sho

    our sessio, you will s

    data set or

    agues ofby faculty

    oduces val

    you may wticing it ro

    member thtorials likeremember

    commands

    rt exercises

    , or choosee a large w

    enter data (

    ine and Iat my aca

    able infor

    nt to thinktinely will

    e most imp this is thahe comma

    and examp

    so that yo

    STATA 13ndow conta

    described b

    haveemic

    ation

    boutesult

    rtantyouds. It

    es of

    can

    fromining

    low)

  • 8/9/2019 Stata Tutorial 13 v2 0

    5/45

    The re

    the bo

    active

    STATclickin

    Windo

    In this

    ScoreChapte

    a)The

    In Chasection

    and 19

    leavesinputti

    (somet

    spread

    Enteri

    undersbecom

    observ

    To staComm

    ults of you

    tom left, th

    in the data

    command on com

    .

    tutorial, we

    Data Set urs 3 and 8)

    Easy andT

    pters 4 toal data. The

    99. You wil

    room forg data. H

    ing that ec

    heet (Excel

    g data ma

    anding ofaware of e

    tions from

    t, click onnd Windo

    r various op

    ere is a Va

    ile. Above

    s. In interaand button

    will work

    ed in chapts an exerci

    dious Way:

    you willre are 420 o

    l not want t

    uman errorwever, the

    nomists ar

    ) and then t

    ually is us

    ow to workntering, and

    he Californ

    the Data E. This will

    erations wi

    iables Win

    t is the Re

    tive use, Ss or by ty

    ith two da

    ers 4-9; ane.

    Manual Da

    ork with tbservations

    o enter a la

    . As a rese are occ

    doing mor

    cut and pa

    d here for

    with data iediting, da

    ia Test Scor

    itor buttopen the fol

    - 5 -

    l be displa

    ow, which

    iew Windo

    TATA alloing the e

    ta applicati

    the Curre

    ta Entry

    e Californifrom K-6 a

    ge amount

    lt, it is gesions whe

    and more).

    te the data

    pedagogica

    STATA. Ia in the pro

    e Data Set.

    on the tooowing scre

    ed in the s

    shows the

    , which le

    s you touivalent c

    ns: two cr

    nt Populati

    Test Scord K-8 scho

    of data ma

    nerally notyou have

    The alterna

    see below).

    l purposes

    n other woram. Here

    lbar, or typn:

    -called Res

    ames of v

    ts you view

    xecute command int

    ss-sectional

    n Survey

    e Data Set.ol districts f

    ually, since

    a recommecollected

    tive is to en

    since it giv

    ds, it will bI will use a

    e the comm

    ults Windo

    riables cur

    previously

    mands eitho the Com

    (California

    ata Set us

    These areor the years

    it is tediou

    nded methata by yo

    er the data

    es you an i

    e useful thasub-sample

    and edit int

    . On

    ently

    used

    er byand

    Test

    ed in

    ross-1998

    s and

    d ofrself

    nto a

    nitial

    t youof 10

    o the

  • 8/9/2019 Stata Tutorial 13 v2 0

    6/45

    To en

    subseq

    teachenumbe

    Makeshould

    typed i

    er data m

    ently). He

    ratio (str)s for all thr

    ure not to tnumbers tu

    n.

    nually, sta

    e I have c

    from the dee).

    ype the varin from blac

    t typing i

    osen 10 o

    ta set you

    able namesk to red, th

    testscr

    606.8

    631.1

    631.4

    631.8

    631.9

    632

    632

    638.5

    638.7

    639.3

    - 6 -

    the obse

    servations

    will use in

    in the threen it means

    str

    19.5

    20.1

    21.5

    20.1

    20.4

    22.4

    22.9

    19.1

    20.2

    19.7

    vations (y

    of test scor

    Chapter 4

    columns, ohat STAT

    chool

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    u will na

    es (testscr)

    of the textb

    nly enter thcannot ide

    e the vari

    and the stu

    ook (type i

    e numbers.tify the dat

    ables

    dent-

    n the

    Also,you

  • 8/9/2019 Stata Tutorial 13 v2 0

    7/45

    After e

    followi

    In theLabelcreated

    sugges

    Do a sienter f

    Finally

    After c

    ntering the

    ng box to a

    amebox,

    ox, you maoriginally

    you enter

    milar operar the third

    , call the thi

    ompleting t

    data, doubl

    pear at the

    eplace var1

    y want to eor as infor

    ere

    Avg

    ion for theariable str

    Stu

    rd column s

    is task, the

    -click the g

    ight botto

    with the n

    ter informaation for o

    test score (

    econd colu

    dent teache

    chool.

    Data Edito

    - 7 -

    rey box at t

    of your scr

    me of the

    tion that thhers who

    =(read_scr

    n, that is r

    ratio (teac

    screen sho

    he top of th

    een:

    irst column

    t helps youay subsequ

    math_scr)

    name var2

    ers/enrl_to

    ld look as

    e first. This

    variable, h

    rememberently work

    2)

    as str. Simil

    t)

    ollows:

    will result i

    re testscr. I

    ow the datwith your d

    arly you co

    n the

    n the

    wasata. I

    ld

  • 8/9/2019 Stata Tutorial 13 v2 0

    8/45

    Next c

    your cshown

    Enteriwill se

    most c

    ose the box

    mmand toin the varia

    g data in the below ho

    mmon for

    . Note that

    dit is listedle list on th

    is way is veto enter d

    s of data y

    our comma

    in the Com

    e upper righ

    ry tedious,ata directly

    u will recei

    - 8 -

    ds to edit tand Box, a

    t-hand side:

    nd you wilfrom a spr

    e in the fut

    e data now

    d your ne

    l make dataadsheet or

    ure.

    appear in t

    ly created v

    input errorn ASCII fi

    eResults B

    ariables are

    frequently.le, which a

    x,

    . Youe the

  • 8/9/2019 Stata Tutorial 13 v2 0

    9/45

    In gen

    where

    This cthe dat

    work

    imagin

    observperhap

    proble

    You c

    pentag

    deman

    You s

    ral, you ca

    varnameire

    mmand wila set. (Mis

    ith large d

    e how long

    tion by obs generated

    s such as s

    n always st

    n with a w

    in STAT

    ould see the

    look at vari

    ers to a var

    l list, one scing values

    ata set, and

    this may ta

    servation, oby others d

    ummarizing

    op the listi

    ite x in t

    .

    following:

    ables that a

    list varn

    able that ex

    li

    reen at a tire denoted

    you will p

    ke with 5,0

    f course, taring data e

    the data.

    g by hitting

    e middle).

    - 9 -

    ready exist

    me1, varna

    ists in your

    t testscr str

    e, the databy a perio

    robably not

    0 observati

    kes away ttry. Howe

    the break

    This button

    by typing i

    e2,

    workfile. Tr

    on the varia or . in

    want to s

    ons or mor

    e ability ter, there ar

    utton on t

    can be use

    the comma

    y it here by

    bles for eveSTATA.) L

    e all obser

    e. Failing t

    spot errore other met

    e toolbar (i

    to stop the

    nd

    typing

    ry observatiter on, yo

    ations. Yo

    look at th

    in the datods to spot

    looks like

    execution o

    on inwill

    can

    data

    set,such

    a red

    f any

  • 8/9/2019 Stata Tutorial 13 v2 0

    10/45

    b) Sum

    For thcomm

    sumst

    statistipercen

    statisti

    The sudefine

    If your

    edit th

    observ

    After e

    mary Statis

    moment, lnd

    nds for su

    s for eachiles of the f

    s for a subs

    mmary statiin equatio

    summary s

    e data usin

    tion and ch

    ntering the

    ics

    ts just see

    mmarize a

    of the varirequency di

    et of your d

    stics are ex(2.15) on p

    atistics diff

    the Data

    ange it. Aft

    data, there

    if we are w

    sum t

    d the optio

    bles you htribution.

    ta by addin

    lained inage 25 in St

    r, then che

    Editor. Onc

    r correcting

    re various

    - 10 -

    rking with

    stscr str, d

    n detailgiv

    ve enteredou will lear

    g an ifor in

    hapter 2 ofock and Wa

    k the data

    e you have

    the proble

    hings you

    the same d

    tail

    es you a m

    . These incn later that

    command f

    your textbtson (2015)

    gain. To re

    located th

    , press the

    an do with

    ta set. Type

    re extensiv

    ude the mou can als

    ollowing th

    ok (for exa.

    turn to the

    data probl

    reserve but

    it. You ma

    in the foll

    list of su

    dian and cobtain su

    variable na

    ple, Kurto

    ata observa

    em, click o

    ton again.

    y want to k

    wing

    mary

    rtainmary

    me.

    sisis

    ions,

    n the

    eep a

  • 8/9/2019 Stata Tutorial 13 v2 0

    11/45

    - 11 -

    hard copy of what you just entered. If so, click on the Printbutton. This will print the entireoutput of what you have produced so far.

    In general, it is a good idea to save the data and your work frequently in some form. Many ofus have learned through multiple painful experiences how easy it is to lose hours of work by

    not backing up data/results in some fashion. To save the data set you created, either press theSavebutton or click on Fileand then Save As. Follow the usual Windows format for savingfiles (drives, directories, file type, etc.). If you save datasets in STATA readable format, thenyou should use the extension .dta. Once you have saved your work, you can call it up thenext time you intend to use it by clicking on File and then Open. Try these operations bysaving the current workfile under the name SW13smpl.dta.

    c) Graphical Presentations

    Most often it is a good idea to generate graphs (pictures) to get some feel for the data. Youwill be able to detect outliers which may be the result of data entry errors or you will be able tosee if the data makes sense. Although STATA offers many graphing options, we will only gothrough a few commonly used ones here.1

    There are three graphs that you will use most often:

    histograms; line graphs, where one or more variables are plotted across entities (these will become

    more important in time series analysis when you are plotting variables over time); scatterplots (crossplots), where one variable is graphed against another.

    The purpose of histograms is to display absolute or relative frequencies for a single variable. Ingeneral, the command is

    histogram varname, percent title( )

    The percent option produces relative frequencies, and the title option adds whatever nameyou place between ( ) to the top of the graph.

    You can either save the graph you have generated, or copy and paste it into another Windowsbased document, such as Word((replacing percent with frequency would have resulted in

    absolute, rather than relative, frequencies to be plotted; there are other options for you toexplore, such as the number of classes (bins) to choose, etc.).

    1I found the following STATA site particularly useful for graphs:http://www.stata.com/support/faqs/graphics/gph/statagraphs.html

  • 8/9/2019 Stata Tutorial 13 v2 0

    12/45

    - 12 -

    Try

    histogram testscr, percent title(Testscores)

    To create a line graph in a cross section, you can add a third variable in your data set whichtakes on the number of the observation (here: 1, 2, 3, , 10), in this case, the variable schoolthat we created.

    Lets plot the student-teacher ratio for the first 10 observations using the scattercommand. Thecommand is followed by the two variables you would like to see plotted, where the first oneappears on the Yaxis and the second on theXaxis.

    scatter varname1 varname2

    plots variable 1 against variable 2. Try this with the student-teacher ratio and the variableschool.

    The resulting graph just gives you the data points here. There are two ways to make this moreinformative, one is to connect the points by using the line command followed by the twovariable names. Alternatively you can use the twoway connectedcommand to have both thepoints and the lines displayed.

    Try both here:

    line str school

    twoway connected str school

    0

    20

    40

    60

    80

    100

    Percent

    600 610 620 630 640Avg test score (=(read_scr+math_scr)/2)

    Testscores

  • 8/9/2019 Stata Tutorial 13 v2 0

    13/45

    - 13 -

    After the graph appears, you can edit it using the Graph Editor(either use Fileand then StartGraph Editoror push the Graph Editorbutton). Alter the graph until it looks like the onebelow. Some of the alternations can be made in the resulting dialog boxes.

    Frequently you will be interested either in causal relationships between variables or in theability of one variable to forecast another. As a result, it is a good idea to plot two variables inthe same graph.

    The first way to look for a relationship is to plot the observations of both variables. This can bedone by generalizing the command twoway connectedto include more than two variable names(one for the Yaxis and one for theXaxis). Try this here with

    twoway connected str testscr school

    The resulting graph is pretty uninformative, since test scores and student-teacher ratios are on adifferent scale. You can allow for two (or more) scales by entering the following command:

    twoway (scatter str school, c(1) yaxis(1)) (scatter testscr school, c(1) yaxis(2))

    This command instructs STATA to use two Yaxis, one for the student-teacher ratio on the leftside of the graph, and the other for test scores on the right side of the graph. You may want tobeautify the resulting graph by using the graph editor. See if you can produce something like

    the graph below:

    18

    19

    20

    21

    22

    23

    24

    Student-TeacherRatio

    1 2 3 4 5 6 7 8 9 10School District

    Student-Teacher Ratio Across 10 School DistrictsGraph 1

  • 8/9/2019 Stata Tutorial 13 v2 0

    14/45

    - 14 -

    To get an even better idea about the relationship, you can display a two-dimensionalrelationship in a scatterplot (see page 92 of your Stock and Watson (2015) textbook). Givenour discussion above, you could simply use the command scatter testscr str. However, youmay want to see what a fitted line through that scatter plot would look like, in which case youhave to modify the command slightly:

    scatter testscr str || lfit testscr str

    where || is the key | typed twice.

    This will result in the following graph (after beautification):

    600

    610

    620

    630

    640

    Avgtestsco

    re

    18

    19

    20

    21

    22

    23

    24

    Studentteacherratio

    1 2 3 4 5 6 7 8 9 10School District

    Student-Teacher Ratio Avg Test Score

    Test Scores and Student-Teacher Ratio Across 10 School Districts

    Grahph 2

    600

    610

    620

    630

    640

    TestScores

    19 20 21 22 23Student-Teacher Ratio

    Fitted values

    Scatterplot of Test Scores vs Student-Teacher Ratio

    Graph 3

  • 8/9/2019 Stata Tutorial 13 v2 0

    15/45

    - 15 -

    (Not to worry about the positive slope here. Remember, this is a sample, and a very small oneat that. After all, you may get 10 heads in 10 flips of a coin.)

    d) Simple Regression

    There is a commonly held belief among many parents that lower student-teacher ratios willresult in better student performance. Consequently, in California, for example, all K-3 classeswere reduced to a maximum student-teacher ratio of 20 (Class Size Reduction Act CSR) inthe late 90s. This comes at a cost, of course. Initially, it was $1.8 billion a year. With dollarfigures as big as these (ask yourself, if you laid down a dollar bill every second, how manyyears would it take to reach 1 billion?), the natural question arises whether or not it is worth it.That is why you are analyzing the effect of reducing student-teacher ratios in Chapters 4-9 ofthe Stock and Watson textbook.

    For the 10 school districts in our sample, we seem to have found a positive relationshipbetween larger classes and student performance. Not to worry we will soon work with all 420observations from the California School Data Set, and we will then find the negativerelationship you have seen in the textbook for now, we are more concerned about learningtechniques in STATA.

    In the previous section, we included a regression line in the scatterplot, something that youshould have encountered towards the end of your statistics course. However, the graph of theregression line does not allow you to make quantitative statements about the relationship; youwant to know the exact values of the slope and the intercept. For example, in generalapplications, you may want to predict the effect of an increase by one in the explanatory

    variable (here the student-teacher ratio) on the dependent variable (here the test scores).

    To answer the questions relating to the more precise nature of the relationship between classsize and student performance, you need to estimate the regression intercept and slope. Aregression line is little else than fitting a line through the observations in the scatterplotaccording to some principle. You could, for example, draw a line from the test score for thelowest student-teacher ratio to the test score for the highest student-teacher ratio, ignoring allthe observations in between. Or you could sort the data by student-teacher ratio and split thesample in half so that the observations with the lowest ten student-teacher ratios are in one set,and the observations with the highest ten student-teacher ratios are in the other set. For each ofthe two sets you could calculate the average student-teacher ratio and the correspondingaverage test score, and then connect the two resulting points. Or you could just eyeball therelationship. Some of these principles have better properties than others to infer the trueunderlying (population) relationship from the given sample. The principle of Ordinary LeastSquares (OLS), for example, will give you desirable properties under certain restrictiveassumptions that are discussed in Chapter 4 of the Stock/Watson textbook.

  • 8/9/2019 Stata Tutorial 13 v2 0

    16/45

    Back tvariabl

    with equati

    coeffic

    Often

    the int

    have szero, a

    profes

    here b

    of the t

    ThereYon a

    where

    wherestanda

    automlikely

    The ou

    o computineXin a line

    u represen, then the

    ients, then

    regression

    rcept 0 o

    en in the scd it is ther

    or most lik

    cause with

    eacher in th

    re various

    constant (in

    reg stand

    he r follod errors (ev

    tically do sever use it)

    tput appear

    . If the dear fashion o

    ting the er task is to

    1 describes

    line is a lin

    ly has a use

    atterplot abfore better

    ly will give

    o students

    t case?)

    ays to esti

    ercept) and

    for least sq

    wing the coen though y

    . There is a.

    as follows:

    endent varithe type

    or, or ranfind a val

    he effect of

    ar approxi

    ful meaning

    ve, there arnot to inter

    you a seri

    resent, the

    ate the reg

    another var

    ares regres

    re

    ma indicau have not

    option for

    - 16 -

    able, Y, is

    0 1i

    om disturbe for 0

    a unit incre

    ation to an

    if observati

    e no observret the nu

    us penalty i

    e is no scor

    ression line

    ableX is:

    reg Y X

    sion. For th

    testscr str,

    es that yourequested a

    you to supp

    nly determ

    i iu

    ance, notand 1 . I

    ase inXon

    underlying

    ons around

    tions arouerical valu

    n the exam

    e to record.

    . The com

    current ap

    r

    are using heintercept t

    ress the inte

    ned by a si

    ccounted ff you had

    Y.

    complicate

    X=0 occur i

    d the studeof the inte

    for interpre

    (What woul

    and for reg

    lication, ty

    teroskedastibe include

    rcept, but y

    ngle expla

    i=1,2,

    or by thevalues for

    relationshi

    n the data.

    t-teacher racept at all.

    ting the int

    d be the fu

    essing a va

    e

    city-robustd, STATA

    u will mos

    atory

    ...,N

    inearthese

    and

    s we

    tio ofYour

    rcept

    ction

    iable

    ill

  • 8/9/2019 Stata Tutorial 13 v2 0

    17/45

    Accor

    an dec

    textbo

    Note t

    420 scregress

    that th

    e) Ente

    So farextern

    itself.

    progra

    StockLocate

    found

    STATcopy

    STAT

    familiabefore

    Data E

    This is

    ing to these

    ease of 0.6

    k, you shou

    at the resul

    hool districion R

    2 is q

    above slop

    ring Data f

    you enterelto the SThis makes

    , such as a

    nd Watsonthe corresp

    this tutorial

    and openand past

    , choosing

    r with thispasting. N

    ditor.

    what you s

    results, low

    points, on a

    ld display t

    TestScore

    t for the 10

    s. Howeveite low. As

    e coefficien

    om a Sprea

    data manTAprogra

    sense as d

    spreadsheet

    present theonding Exc

    ) and open

    he Data Ed comman

    the option

    rocedure.te that ST

    ould see in

    ering the st

    erage, in t

    e results as

    = 618.9 + 0 (51.1) (2.

    chosen sch

    , as pointea matter o

    is not stati

    sheet

    ally. Mostm, i.e., theta sets eit

    .

    Californiael file casc

    it. Next, f

    tor. Returns commo

    Treat Fir

    Make sureTA has co

    STATA:

    - 17 -

    dent-teach

    e district w

    follows:

    .61 STR,R.33)

    ol districts

    out beforfact, in Ch

    tically signi

    often youwill not b

    er become

    Test Scoreool.xlsx on

    llowing th

    to the Exce to Windo

    t Row as

    o select theveniently i

    r ratio by o

    de test scor

    2

    = 0.007, S

    is quite dif

    , this is aapter 5 of

    ficant.

    ill workincluded i

    very large

    Data Set ithe accom

    procedure

    file and ms progra

    Variable N

    grey box tcluded the

    e student p

    e. Using the

    ER= 9.8

    erent from

    rather smallour textboo

    ith larger, or be par

    or are gene

    Chapter 4anying we

    s discussed

    rk F1:R42s, move t

    mes. You

    o the immename of th

    er class res

    notation o

    the sample

    sample ank, you will

    ata sets thof, the prorated by a

    of the textsite (wher

    previously,

    . Next, usie data blo

    are presu

    iate right ovariables i

    lts in

    your

    of all

    d thelearn

    t aregramother

    ook.you

    start

    g thek to

    ably

    f 1n the

  • 8/9/2019 Stata Tutorial 13 v2 0

    18/45

    - 18 -

    When you are done, you are ready to save the file. Name it caschool.dta.

    You can now reproduce Equation (4.7) from the textbook. Use the regression command youpreviously learned to generate the following output.

    (You can find the standard errors and the distribution of the estimators on p. 131 of the Stock

    and Watson (2015) textbook. The regression

    2

    R , sum of squared residuals (SSR), and standarderror of the regression (SER) are presented in Key Concept 4.3.)

    f) Importing Data Files directly into STATA

    Excel (Spreadsheet) Files

    Even though the cut and pastemethod seemed straightforward enough, there is a second, more

    direct, way to import data into STATA from Excel, which does not involve copying andpasting data points.

    In general, make sure your data is organized with the variable names in Row 1 of yourspreadsheet with each column representing a different variable, and the observations in therows beneath the variable names. Then, save your data set in Excel (or an alternativespreadsheet program) as a .csv file (this stands for comma separated values).

    Start again with a new STATA file. Next, type the following command into the commandwindow in STATA:

    insheet using (filename)

    where (filename)is the directory location of your file. (To find this, locate the file and right-click, selecting the Propertiesbutton. This should contain the location of the file to which youmust add the filename; here is an example C:\Econometrics\StockWatson\caschool.csv.) Ifyour filename has any spaces or any symbol that appears on the number keys of the keyboard,then you should put quotation marks around your filename. STATA reads spaces as denoting

    _cons 698.933 10.36436 67.44 0.000 678.5602 719.3057 str 2.279808 .5194892 4.39 0.000 3.300945 1.258671

    testscr Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust

    Root MSE = 18.581 R-squared = 0.0512 Prob > F = 0.0000 F( 1, 418) = 19.26Linear regression Number of obs = 420

    . reg testscr str, r

  • 8/9/2019 Stata Tutorial 13 v2 0

    19/45

    - 19 -

    separations between words, and therefore will only read the filename up until the first space orsymbol, and then considers the rest to be a separate command.

    Note: In order to insheetdata, there must be no data already stored in memory. To get rid ofany data that is already stored, type the command

    clear

    before insheeting.

    Once you have insheetedyour data, you should see this reflected in your Results boxand yourvariables should appear in your Variables List box. You can type editto see your data in thedata editor.

    To save your data as a STATA file, click on Fileon the upper toolbar, then select Save As.When you save your file, make sure it is saved as a .dta file. This type of file can only be

    opened in STATA. Alternatively, you can type the command

    save (filename)

    where (filename)is the directory location and name of your file. If you have a previous versionof this saved already, to overwrite the old version add replaceafter the save command. Forexample:

    save C:\My Documents\test.dta, replace

    If you wish to save a file that has been previously saved in the same directory location as the

    previous version, you may use the commandsave, replace

    .

    Note: When you save a STATA dataset, you are really only saving the dataset as it exists at thetime you chose to save. You are not retaining any of the analysis you may have conducted,such as running regressions or testing for the statistical significance of coefficients. However,if you have changed the data since opening the file, such as edited observations, these changeswill be reflected.

    As an exercise, copy the caschool.xls or caschool.xlsx data file from the Stock and Watsonwebsite and save the Excel file in some subdirectory on your computer as a .csv file. Thenimport the data set using the insheetcommand. Finally run the simple regression of testscronstr and check that your output contains 420 observations and corresponds to the STATAregression output in the previous section.

  • 8/9/2019 Stata Tutorial 13 v2 0

    20/45

    - 20 -

    ASCII data

    You can also import data from an ASCII file (text file). This assumes that you either saved data

    from a different source as an ASCII file or that you received data in ASCII file format. The filemust be organized with one observation in each row, and the variables in the data set must bein separate columns.

    Using the infilecommand, type the name of the variable that represents each column, followedby the file name.

    For example, consider an ASCII dataset that looks as follows:

    ahe educ exper union married

    10.75 12 6 1 0

    16.50 16 3 0 0

    ..

    12.10 12 8 1 1

    and which you want to import into STATA.

    Each row corresponds to observations on an entity (here an individual). The first columnsabove is the hourly wage, the second is years of education, the third is potential experience, the

    fourth is a binary variable which equals one if the individual belongs to a union and is zerootherwise, and the last column is another binary variable which takes on the value of one if theindividual is married and is zero otherwise.

    To import the data, you type the following command:

    infile ahe educ exper union married using (filename)

    STATA dataset

    Data files that have been saved in STATA format, carry the extension .dta

    To open a dataset that is already saved as a .dtafile, you can either go to File and then Opentoselect your dataset, or you can type the command

    use (filename)

  • 8/9/2019 Stata Tutorial 13 v2 0

    21/45

    - 21 -

    This will open your dataset into STATA, as long as you have changed your working directoryto the location on your computer where the data file is stored. The command to change theworking directory is

    CD: C:\(location)

    Here are two tricks that will be of help down the road.

    (i) If you are not sure how to type in the location of your data file, just right-click onyour Start button and select Explore. Then find your data set. Next right click onthe data set and chose Properties. A new window opens up. Copy the Location.Return to the Command Window in STATA and type use and then past thelocation. Add \ and the name of the file, including the extension. Then finish thecommand with a , clear.

    Here is an example from my computer:

    use C:\ClaremontLectures\ECON125\STATA\baseb.dta, clear

    (ii) The clear command is very important. It erases previous data, if there was any,from memory. I, and others, have wasted time trying to find errors in programmingsimply by not clearing memory. Even if you dont understand the reason, the adviceis always to include the clear command when you read in a new data set.

    You can try doing this with the caschool.dta data set from the Stock and Watson website.Simply save that data set on your computer, then double click on it. This will open STATAwith the data loaded already. Obviously this is the easiest method to import data into STATA.

    Regardless of which method you use to import data, it is always a good idea to inspect thedata to check if there are some abnormalities. To do this, click on the Data Editor(Browse) button below the drop down menus.

    g) Multiple Regression Model

    Economic theory most often suggests that the behavior of a certain variable is influenced notonly by a single variable, but by a multitude of factors. The demand for a product, e.g. LALaker tickets, depends not only on the price of the product but also on the price of other goods,income, taste, etc. Similarly, the Phillips curve suggests that inflation depends not only on theunemployment rate, but also on inflationary expectation and possibly supply shocks, etc.

    An extension of the simple regression model is the multiple regression model, whichincorporates more than one regressor (see Equation (6.7) in the textbook on page 192).

  • 8/9/2019 Stata Tutorial 13 v2 0

    22/45

    - 22 -

    0 1 1 2 2 ...i i i k ki iY X X X u , i= 1,,n.

    To estimate the coefficients of the multiple regression model, you proceed in a similar way asin the simple regression model. The difference is that you now need to list the additionalexplanatory variables. In general, the command is:

    reg Y X1 X2 Xk, (options)

    where (options) can be omitted (this is the default and gives you homoskedasticity-onlystandard errors) or can be replaced by various possible entries ( e.g. r for heteroskedasticityrobust standard errors).

    See if you can reproduce the following regression output, which corresponds to Column 5 inTable 7.1 of the Stock and Watson (2015) textbook (page 241). The option used below is (r)toproduce heteroskedasticity-robust standard error (STATA refers to these as Robust StandardErrors).

    The interpretation of the coefficients is equivalent to that of a controlled science experiment: itindicates the effect of a unit change in the relevant variable on the dependent variable, holdingall other factors constant(ceteris paribus).

    Section 7.2 of the Stock and Watson (2015) textbook discusses the F-statistic for testingrestrictions involving multiple coefficients, the so called Waldtest. To test whether all of theabove coefficients are zero with the exception of the intercept, you can use the testcommandfollowed by each restriction that you want to test in parenthesis (STATA uses the name of thevariable associated with the coefficient in combination with the restriction).

    Type

    test (str=0) (el_pct=0) (meal_pct=0) (calw_pct=0)

    STATA will generate the following output:

    _cons 700.3918 5.537418 126.48 0.000 689.507 711.2767 calw_pct -.0478537 .0586541 -0.82 0.415 -.1631498 .0674424 meal_pct -.5286191 .0381167 -13.87 0.000 -.6035449 -.4536932 el_pct -.1298219 .0362579 -3.58 0.000 -.201094 -.0585498 str -1.014353 .2688613 -3.77 0.000 -1.542853 -.4858534

    testscr Coef. Std. Err. t P>|t| [95% Conf. Interval] Robust

    Root MSE = 9.0843 R-squared = 0.7749 Prob > F = 0.0000 F( 4, 415) = 361.68Linear regression Number of obs = 420

    . reg testscr str el_pct meal_pct calw_pct, r

  • 8/9/2019 Stata Tutorial 13 v2 0

    23/45

    - 23 -

    Note that the F-statistic is identical to the same statistic listed in the regression output.

    See if you can generate the F-statistic of 5.43 following Equation (7.6) in the Stock andWatson (2015) text and listed at the bottom of page 223 (restrict the coefficients of STRandExpnto be zero).

    h) Data Transformations

    So far, we have only used data in regressions that already existed in some file that we eithercreated or used. Almost always, you will be required to transform some of the raw data thatyou received before you run a regression. In STATA you transform variables by using thegen (as in generate) command. For example, Chapter 8 of the Stock/Watson textbookintroduces the polynomial regression model, logarithms, and interactions between variables.Lets reproduce Equations (8.2), (8.11), (8.18), and (8.37) here. The following commandsgenerate the necessary variables2:

    gen avginc2=avginc^2

    gen avginc3=avginc^3

    gen lavginc=log(avginc)

    gen ltestscr=log(testscr)

    gen strpctel=str*el_pct

    Note how the commands and generated variables are displayed in STATA, including those inred when you make a mistake in the command (e.g. genr instead of gen).

    2For example, I have generated a variable called avginc2, and assigned it to be the square of the previously

    defined variable avginc. Note that I am generating variable names that are self-explanatory. They could havebeen called variable1, variable2, variable3, etc. but it is a good idea to create variable names that you canremember.

    Prob > F = 0.0000 F( 4, 415) = 361.68

    ( 4) calw_pct = 0( 3) meal_pct = 0( 2) el_pct = 0( 1)

    str = 0

    . test (str=0) (el_pct=0) (meal_pct=0) (calw_pct=0)

  • 8/9/2019 Stata Tutorial 13 v2 0

    24/45

    Next r

    Finally

    Exerci

    One oinstruc

    proble

    regress

    Lets s

    (http://Compa

    Empiri

    3 Note

    tells yo(usually

    which i

    data set,gigabyte

    n the four

    save your

    e

    the probleions witho

    s but then

    ions, for ex

    ee how mu

    ww.pearsnion Web S

    cal Results:

    for STATA 1

    that insufficiset at 1 MB b

    creases the m

    but small eno).

    regressions

    orkfile aga

    ms with tht internaliz

    little is re

    mple, woul

    h you unde

    nhighered.cite, and doCPS Data

    2 users: if yo

    ent memorydefault). You

    mory to 10 m

    gh for your c

    using the

    n and exit

    type of ting them.

    ained. If I

    d you be abl

    rstood. Go

    om/stock_

    nload the Csed in Cha

    just double

    as allocated.can do this b

    egabytes. In g

    omputer to ha

    - 24 -

    ame techni

    TATA.

    torial youtypical s

    asked you

    e to do that

    to the Stoc

    atson). E

    PS data setter 8). Ne

    lick on the c

    Before you otyping in the

    et mem 10m

    neral, make s

    dle the progr

    que as for

    are workinudent will

    to retrieve

    Or would

    and Wats

    ter the

    for Chaptert open it in

    s_ch8.dta fil

    en the cps_ccommand

    ure to set the

    m (use kfor

    ultiple re

    on is thatfinish the t

    data set a

    ou say ho

    n website f

    tudent Re

    8 (Data Se

    STATA3

    , an error me

    h8.dta file, in

    emory large

    ilobyte, mfor

    ression ana

    you just futorial wit

    nd to run

    do I do th

    or the 3rd

    e

    ources ins for Repli

    sage will occ

    rease your m

    nough to han

    megabyte, an

    lysis.

    llowfew

    few

    s?

    ition

    theating

    r that

    emory

    le thegfor

  • 8/9/2019 Stata Tutorial 13 v2 0

    25/45

    Then r

    (2015)

    Why

    to restrfind a

    sampledefine

    Batch

    So far,

    execut

    a percreated

    comm

    createdview t

    includ

    Batch

    plicate the

    textbook.

    o you thin

    ict your saay to rest

    to those iotential ex

    Files

    you have e

    ble stateme

    anent recor, etc.? In th

    nds similar

    such a proe output a

    loops and

    iles in STA

    results for c

    your result

    ple to onlyict your sa

    dividuals ierienceas

    ther clicked

    nts (comma

    d of all theat case, you

    to those tha

    ram, whicterwards (i

    conditional

    A are calle

    olumns (1)

    s differ fro

    include indiple, look f

    that age ghe Mincer

    on buttons

    nds one by

    transformawould nee

    t you used i

    is a textthe progra

    branching

    dDo-Files.

    - 25 -

    rom Table

    those liste

    viduals whoor Helpan

    oup, replicxperience v

    in STATA

    ne, or line

    tions youto create a

    n the Com

    or Asciidid not

    (if you do

    .1 on page

    d in the tab

    are at leastthe ifcom

    ate columnariable (age

    or used the

    by line). Bu

    ade, regreprogram

    mand Wind

    file, you caontain any

    t know w

    288 of the

    le? What if

    30 but notmand. Then

    (1) to (3). Years of

    Command

    t what if yo

    sions you tthat consis

    w previou

    then execerrors). Bat

    at these ar

    tock and W

    you found

    lder than 6, restricting

    For columducation

    Window t

    u wanted to

    ried, graphs of a list o

    sly. After h

    te (run) ich files ca

    , not to w

    atson

    way

    ? Toyour

    (4),6 ).

    type

    keep

    youf line

    ving

    t andalso

    rry).

  • 8/9/2019 Stata Tutorial 13 v2 0

    26/45

    - 26 -

    Using STATA in batch mode has two important advantages over using STATA interactively:

    theDo-Fileprovides an audit trail for your work. The file provides an exact record ofeach STATA command;

    even the best computer programmers will make typing or other errors when usingSTATA. When a command contains an error, it wont be executed by STATA, orworse, it will be executed but produce the wrong result. Following an error, it is oftennecessary to start the analysis from the beginning. If you are using STATAinteractively, you must retype all of the commands. If you are using aDo-File, then youonly need to correct the command containing the error and rerun the file.

    Lets create such a program. Click on New Do-File Editorbutton. This opens the STATA Do-File Editor box.

    Type in, the following commands exactly as they appear.

    log using \statafiles\stata1.log, replaceuse \statafiles\caschool.dtadescribegenerate income = avginc*1000summarize incomelog closeexit

    Here is the meaning of the seven lines of this program:

    Line 1: This is an administrative command that tells STATA where to display the results of

    your analysis. STATA output files are called logfiles. The current line tells STATA toopen a log file called stata1.log (you could have used any name, such aslove_metrix.log, meaning, the word stata1 is not required here). If there is already afile with the same name in the folder, STATA is instructed to replace it. Before yousave the Do-File, replace the path in this line with the relevant path on thecomputer you are using.

    Line 2: This line concerns the data set. As you learned earlier in the tutorial, datasets inSTATA are called dtafiles. The dataset which you will use here is caschool.dta, whichyou downloaded earlier. The current line tells STATA the location and name of thedataset to be used for the analysis. Before you save theDo-File, replace the path in

    this line with the relevant path of the location where you savedcaschool.dtato.

    Line 3: This line also concerns the data set. It tells STATA to describe the dataset (a shorterversion of the command is des instead of describe). This command produces a listof the variable names and any variable descriptions stored in the data set.

  • 8/9/2019 Stata Tutorial 13 v2 0

    27/45

    - 27 -

    Line4: This line tells STATA to create a new variable called income(a shorter version of thecommand is gen instead of generate). The new variable is constructed bymultiplying the variable avgincby 1000. The variable avgincis contained in the datasetand is the average household income in a school district expressed in thousands ofdollars. The new variable incomewill be the average household income expressed in

    dollars instead of thousands of dollars.

    Line 5: This line tells STATA to compute some summary statistics (a shorter version of thecommand is sum instead of summarize). STATA will produce the mean, standarddeviation, etc.

    Line 6: This line closes the file stata1.logwhich contains the output.

    Line 7: This line tells STATA that the program has ended.

    As long as you have replaced the path in line 1 and line 2 with the relevant paths from the

    computer you are working on, and if you downloaded/saved the California Test Score DataSet, then we are good to go. Save theDo-File, using the .dosuffix. Next execute this Do-Fileby first opening STATA on your computer. Next, click on the Filemenu, then Do, and thenselect the stata1.dofile you just saved. This will run or execute the program.

    (Alternatively, you can run the program, or even just part of the program, by hitting theExecute (do) button in the Do-file Editor.)

    You will be able to see the program being executed in theResults Window. Since the executionwill not fit into one screen, you can scroll up and see everything that happened during therun. Sometimes (although not here) you may see that the program execution pauses, and that

    --more--

    is displayed at the bottom of the Results Window. If this happens, push any key on thekeyboard and execution will continue.

    To exit STATA, click on the usual exit button at the top right of STATA (alternatively click onFileand then Exit.) STATA will ask you if you really want to exit, and you will respond Yes.

    Your output has been saved in stata1.logand you can look at it by opening the file with anytext editor (Notepad, for example) or in Word/WordPerfect. Here is what you should see:

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -name:

    l og: yourpathherel og t ype: t ext

    opened on: yourdateandtimehere

    . use C: \ yourpathhere

  • 8/9/2019 Stata Tutorial 13 v2 0

    28/45

    - 28 -

    . descri be

    Cont ai ns dat a f r omC: \ yourpathhere\ caschool . dt aobs: 420

    var s: 13 yourdateheresi ze: 20, 160

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    st orage di spl ay val uevar i abl e name t ype f ormat l abel var i abl e l abel- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - -enr l _t ot i nt %8. 0gt eacher s f l oat %8. 0gcal w_pct f l oat %8. 0gmeal _pct f l oat %8. 0gcomput er i nt %8. 0gt est scr f l oat %8. 0gcomp_st u f l oat %8. 0gexpn_st u f l oat %8. 0gst r f l oat %8. 0gavgi nc f l oat %8. 0gel _pct f l oat %8. 0gr ead_scr f l oat %8. 0gmath_scr f l oat %8. 0g

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Sor t ed by:

    . gener ate i ncome = avgi nc*1000

    . summar i ze i ncome

    Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    i ncome | 420 15316. 59 7225. 89 5335 55328

    . l og cl osename:

    l og: C: \ yourpathhere\ s tata1. l ogl og t ype: t ext

    cl osed on: yourdateandtimehere

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    You now have an initial idea of how to work withDo-Filesin STATA. The rest of this part ofthe tutorial will guide you through further commands and make the initial Do-File morecomplex.

    I suggest that you continue to work with the batch file you just created and then for you to addnew lines to this program (if you use the .pdf version of this tutorial or have printed the tutorialusing a color printer, then the new commands will appear in red).

  • 8/9/2019 Stata Tutorial 13 v2 0

    29/45

    - 29 -

    #del i mi t ;******************************************;* Admi ni st r at i ve Commands;******************************************;set more of f ;c l ear ;l og us i ng C: \ stataf i l es\ stata1. l og, repl ace;******************************************;

    * Read i n the data set;*******************************************;use C: \ stataf i l es\ caschool . dta; des; *******************************************;* Tr ansf orm Data and Cr eat e New Vari abl es;*******************************************;* Const r uct Aver age Di st r i ct I ncome i n $s;*******************************************;gen i ncome = avgi nc*1000; *******************************************;* carr y out st at i st i cal anal ys i s ;*******************************************;* summary st ati st i cs f or I ncome;*******************************************;sum

    i ncome;

    *******************************************;* end of progr am;*******************************************;l og cl ose; exi t ;

    The new version of theDo-Filecarries out exactly the same calculations as before. However ituses four features of STATA for more complicated analysis. The first new command is

    # delimit ;

    This command tells STATA that each STATA command ends with a semicolon. If STATA

    does not see a semicolon at the end of the line, then it assumes that the command carries overto the following line. This is useful because complicated commands in STATA are often toolong to fit on a single line. (Make sure to place a ; at the end of the seven old commands.)The aboveDo-Filecontains an example of a STATA command written on two lines: near thebottom of the file you see the command sum income written on two lines. STATA combinesthese two lines into one command because the first line does not end with a semicolon. Whiletwo lines are not necessary for this command, some STATA commands can get quite long, soit is good to get used to employing this feature.

    A word of warning: if you use the # delimit ;command, it is critical that you end eachcommand with a semicolon. Forgetting the semicolon on even a single line means that theDo-

    Filewill not run properly (again, dont forget to add the seven ; in the first version of theprogram).

    The second new feature of the aboveDo-Fileis that many of the lines begin with an asterisk.STATA ignores the text that comes after *, so that these lines can be used for comments orto describe what the commands that follow are doing. Note that each of these lines ends with a

  • 8/9/2019 Stata Tutorial 13 v2 0

    30/45

    - 30 -

    semicolon. Without the semicolon, STATA would include the next line as part of the textdescription.

    A final new feature in the program is the command

    set more off

    This command eliminates the need to hit a key on your keyboard in the case when STATA fillstheResults Windowand stops displaying further results (the -- more -- would appear).

    Run the program and have a look at the new log file.

    Next, change the previous version of theDo-Fileby adding commands until the new versionlooks as follows (again, new commands can be seen in redif your tutorial displays colors):

    #del i mi t ;*********************************************************;

    *Admi ni st r at i ve Commands;*********************************************************;set more of f ;c l ear ;l og usi ng C: \ STATA\ st at a1. l og, r epl ace;*********************************************************;*Read i n the Dataset ;*********************************************************;use C: \ STATA\ caschool . dta;des;*********************************************************;*Tr ansf orm Data and Cr eat e New Vari abl es;*********************************************************;**** * Const r uct Aver age Di st r i ct I ncome i n $s;gen i ncome = avgi nc*1000;***** Def i ne var i abl es f or subset of data;gen testscr_ l o = testscr i f (st r=20) ;*********************************************************;*Carr y Out Stati sti cal Anal ysi s;*********************************************************;**** * Summary Stat i st i cs f or I ncome;sum

    i ncome;sum testscr;t t est t estscr =0;t t est t estscr _l o=0;t t est t estscr _hi =0;t t est t est scr_l o=t est scr_hi , unequal unpai r ed;*********************************************************;*Repeat t he Anal ysi s usi ng STR = 19;*********************************************************;r epl ace t estscr _l o=t estscr i f ( str =19) ;

    t t est t est scr_l o=t est scr_hi , unequal unpai r ed;*********************************************************;*End of Progr am;*********************************************************;l og cl ose;exi t ;

  • 8/9/2019 Stata Tutorial 13 v2 0

    31/45

    - 31 -

    There are three new features in this new version.

    1) New variables are created using only a portion of the dataset. Two of the variables in thedataset are testscr (the average test score in a school district) and str (the districtsaverage class size or student teacher ratio). The STATA command

    gen testscr_lo = testscr if (str |t|) = 0.0000 Pr(T > t) = 0.0000

    Ha: mean < 0 Ha: mean = 0 Ha: mean > 0

    Ho: mean = 0 degrees of freedom = 419

    mean = mean(testscr) t = 703.6149

    testscr 420 654.1565 .9297082 19.05335 652.3291 655.984

    Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

    One-sample t test

    . ttest testscr=0

  • 8/9/2019 Stata Tutorial 13 v2 0

    32/45

    - 32 -

    confidence interval for the population mean. (In this example, the t-test that thepopulation mean of test scores is equal to zero is not really of interest, but theconfidence interval for the mean is what we are looking for in this example.) The samecommand is then used for testscr_loand testscr_hi(see section 3.2 and 3.3 in Stock andWatson (2015)).

    The second form of the command is

    ttest testscr_lo=testscr_hi, unequal unpaired

    Executing this statement will test the hypothesis that testscr_loand testscr_hicome frompopulations with the same mean. That is, the command computes the t-statistic for thenull hypothesis that the (population) mean of test scores for districts with class sizes lessthan 20 students is the same as the mean of test scores for districts with class sizes

    greater than 20 students. The command uses two options that are listed after thecomma in the command. These options are unequaland unpaired. The option unequaltells STATA that the variances in the two populations may not be the same. The optionunpairedtells STATA that the observations are for different districts, that is, these arenot panel data representing the same entity at two different time periods (see section 3.4in Stock and Watson (2015)).

    3) A third new feature in the Do-File is the command replace. This appears near thebottom of the file. Here, the analysis is to be carried out again, but using 19 as the cutofffor small classes. Since the variables testscr_loand testscr_hialready exist (they were

    define by the gencommand earlier in the program), STATA cannot generate variableswith the same name. Instead, the command replaceis used to replace the existing serieswith new series. In essence, the command instructs the program to overwrite thepreviously stored data.

    Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 0.0000

    Ha: diff < 0 Ha: diff = 0 Ha: diff > 0

    Ho: diff = 0 Satterthwaite's degrees of freedom = 403.607

    diff = mean(testscr_lo) - mean(testscr_hi) t = 4.0426

    diff 7.37241 1.823689 3.787296 10.95752

    combined 420 654.1565 .9297082 19.05335 652.3291 655.984

    testsc~i 182 649.9788 1.323379 17.85336 647.3676 652.5901

    testsc~o 238 657.3513 1.254794 19.35801 654.8793 659.8232

    Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

    Two-sample t test with unequal variances

    . ttest testscr_lo=testscr_hi, unequal unpaired

  • 8/9/2019 Stata Tutorial 13 v2 0

    33/45

    - 33 -

    You are now ready to execute (run) the program as done before.

    As before, change the previous version of the Do-File by adding commands until the newversion looks as follows (again, new commands can be seen in red if your tutorial displayscolors):

    #delimit ;*********************************************************;*Administrative Command*********************************************************;set more off;clear;log using \statafiles\stata1.log, replace;*********************************************************;*Read in the Dataset;*********************************************************;use \statafiles\caschool.dta;des;*********************************************************;*Transform Data and Create New Variables;*********************************************************;***** Construct Average District Income in $s;gen income = avginc*1000;***** Define variables for subset of data;gen testscr_lo = testscr if (str=20);*********************************************************;*Carry Out Statistical Analysis;*********************************************************;***** Summary Statistics for Income;sum

    income;

    *********************************************************;***** Table 4.1 *****;*********************************************************;sum str testscr, detail;*********************************************************;***** Figure 4.2 *****;*********************************************************;twoway scatter testscr str || lfit testscr str;*********************************************************;***** Correlation *****;*********************************************************;cor str testscr;*********************************************************;

    ***** Equation 4.11 and 5.8 *****;*********************************************************;reg testscr str, robust;*********************************************************;***** Equation 5.18 *****;gen d = (str

  • 8/9/2019 Stata Tutorial 13 v2 0

    34/45

    - 34 -

    sum testscr;ttest testscr=0;ttest testscr_lo=0;ttest testscr_hi=0;ttest testscr_lo=testscr_hi, unequal unpaired;*********************************************************;*Repeat the Analysis using STR = 19;*********************************************************;replace testscr_lo=testscr if (str=19);ttest testscr_lo=testscr_hi, unequal unpaired;*********************************************************;*End of Program;*********************************************************;log close;exit;

    The new commands reproduce some of the empirical results shown in Chapters 4 and 5 of

    Stock and Watson (2015). There are several features of STATA included in the new commandswhich have not been used in the previous examples:

    1) The summarize command (sum) is now includes the option detail, which providesmore detailed summary statistics. The command is written as

    sum str testscr, detail

    This command tells STATA to compute summary statistics for the two variables strandtestscr. The option detailproduces detailed summary statistics that include, forexample, the percentiles that are reported in Table 4.1 on p. 113 of Stock and Watson

    (2015).

    2) The command

    twoway scatter testscr str || lfit testscr str

    constructs a scatterplot of testscr versus strand includes the estimated regression line forthe simple regression of the California Test Score Data Set, shown on p. 116 of Stockand Watson (2015).

    3) The command

    cor str testscr

    tells STATA to compute the correlation between the student teacher ratio and testscores.

    4) Next you will reproduce equations (4.11) and (5.8) in Stock and Watson (2011) by using

  • 8/9/2019 Stata Tutorial 13 v2 0

    35/45

    - 35 -

    the regress(or short reg) command:

    reg testscr str, r

    instructs STATA to run an OLS regression with testscras the dependent variable and str

    as the regressor. The robust(short r) option tells STATA to calculate heteroskedasticity-robust formulas for the standard errors of the regression coefficient estimators. Omittingthis option results in the display of homoskedasticity-only standard errors.

    5) The final innovation over the previous version of the Do-File is contained in the twocommands following the line Equation 5.18. First a binary (sometimes referred to asdummy or indicator) variable d is created suing the STATA command

    gen d = (str

  • 8/9/2019 Stata Tutorial 13 v2 0

    36/45

    - 36 -

    gen testscr_lo = testscr if (str=20);*********************************************************;*Carry Out Statistical Analysis;*********************************************************;***** Summary Statistics for Income;sum

    income;*********************************************************;***** Table 4.1 *****;*********************************************************;sum str testscr, detail;*********************************************************;***** Figure 4.2 *****;*********************************************************;twoway scatter testscr str || lfit testscr str;*********************************************************;***** Correlation *****;*********************************************************;cor str testscr;

    *********************************************************;***** Equation 4.11 and 5.8 *****;*********************************************************;reg testscr str, r;*********************************************************;***** Equation 5.18 *****;gen d = (str

  • 8/9/2019 Stata Tutorial 13 v2 0

    37/45

    - 37 -

    ttest ts_lostr=ts_histr if elq4==1, unp une;*********************************************************;* Equation 7.5 ;*********************************************************;reg testscr str el_pct, r;*********************************************************;* Equation 7.6 ;*********************************************************;replace expn_stu = expn_stu/2000;reg testscr str expn_stu el_pct, r;*********************************************************;* Display Variance-Covariance Matrix;*********************************************************;vce;*********************************************************;* F-test reported in text;*********************************************************;test str expn_stu;*********************************************************;* Correlations reported in text;

    *********************************************************;cor testscr str expn_stu el_pct meal_pct calw_pct;*********************************************************;*Table 7.1, Column(1);*********************************************************;reg testscr str, r;display "adjusted Rsquared = " e(r2_a);* Column (2);reg testscr str el_pct, r;display "adjusted Rsquared = " e(r2_a);* Column (3);reg testscr str el_pct meal_pct, r;display "adjusted Rsquared = " e(r2_a);

    * Column (4);reg testscr str el_pct calw_pct, r;display "adjusted Rsquared = " e(r2_a);* Column (5);reg testscr str el_pct meal_pct calw_pct, r;display "adjusted Rsquared = " e(r2_a);*********************************************************;* Appendix rule of thumb F-Statistic;*********************************************************;reg testscr str expn_stu el_pct;test str expn_stu;reg testscr el_pct;*********************************************************;

    *End of Program;*********************************************************;log close;exit;

    The file produces several of the empirical results from Chapter 7 of Stock and Watson (2015).As before, some commands have been abbreviated when there is no possibility of confusion.

  • 8/9/2019 Stata Tutorial 13 v2 0

    38/45

    - 38 -

    The file uses abbreviations for STATA commands throughout (generatebecomes gen, regressturns into reg, etc.).

    In essence there are two new commands:

    1)

    The first new command involves the testing of restrictions in equation 7.6 (page 221 ofStock and Watson (2015)). The command

    reg testscr str expn_stu el_pct, r

    instructs STATA to compute the regression. The command vceasks STATA to print outthe estimated variances and covariances of the estimated regression coefficients. Thecommand

    test str expn_stu

    gets STATA to carry out the joint test that the coefficients on strand expn_stuare bothequal to zero.

    2) The second new command is in the analysis of Table 7.1 on page 241 of Stock andWatson (2015). When STATA computes an OLS regression, it computes the adjusted

    R-squared (2

    R ) as described in Section 6.4, page 197 of Stock and Watson (2015).However, STATA does not display all of the results it computes, including the adjustedR-squared (when the r option is invoked). The command

    display Adjusted Rsquared = e(r2_a)

    instructs STATA to print out (display) the adjust R-squared. Whatever appearsbetween the two quotation marks ( ) will be displayed in your output (you did nothave to display the wordsAdjusted Rsquaredbut could have chosen anything else, suchas My Measure of Fit). However e(r2_a) tells STATA where to retrieve the storedresult from and cannot be changed. The adjustedR-squared is not the only statistic thatSTATA stores and does not display. You can use the Help function or look in theReferencevolume under Saved Resultsfor the regcommand to find other statistics.

    Other Examples ofDo-Files

    You will find other examples ofDo-Fileson the accompanying Web site for the Stock andWatson (2015) econometrics textbook. You can download STATADo-Filesfro there toreproduce all of the analysis in Chapters 3-13. You will also find a STATADo-Filefor thetime series chapters 14-16 there. STATA programming for time series is somewhat morecomplicated than for cross-sectional or panel data. EViews and RATS are econometric

  • 8/9/2019 Stata Tutorial 13 v2 0

    39/45

    - 39 -

    programs specifically designed for time series data, and the web site contains EViews andRATS programs for Chapters 14-16, as well as a tutorial for EViews.

    3. A SUMMARY OF SELECTED STATA COMMANDS

    This section lists several of the most useful STATA commands. Many of these commands haveoptions. For example, the command summaryhas the option detailand the command regresshas the option robust. In the descriptions below, options are shown in square brackets [ ]. Manyof these commands have several options and can be used in many different ways. Thedescriptions below show how these commands are commonly used. Other uses and options canbe found in STATAs Help menu and in the other sources listed at the beginning of thistutorial.

    The list of commands provided here is a small fraction of the commands in STATA, but these

    are the important commands that you will need to get started for your econometrics course.You should extend the list or create your own in addition to what is listed here.

    Administrative Commands

    # delimit

    sets the character that marks the end of a command. For example, the command #delimit ;tells STATA that all commands will end with a semicolon. This command isused inDo-Files.

    clear

    deletes/erases all variables from the current STATA session.

    exit

    in aDo-File, the command tells STATA that the program has ended. If you type exitinthe STATA Command Window, then STATA will close.

    log

    controls STATA log files, which is where STATA writes output. There are twocommon uses of this command:

    log using filename [,append replace]. This opens the file given byfilename as alog

    file for STATA output. The optionsappend

    andreplace

    are used when thereis already a file with the same name. With append, STATA will append theoutput to the bollom of the existing file. With replace, STATA will replace theexisting file with the new output file.

    log close. This closes the current logfile.

    set mat #

  • 8/9/2019 Stata Tutorial 13 v2 0

    40/45

    - 40 -

    sets the maximum number of variables that can be used in a regression. The defaultmaximum is 40. If you have a huge number of observations and want to run aregression with 45 variables, then you will need to use the command, where # is anumber greater than 45.

    set memory #mis used in Windows and Unix versions of STATA to set the amount of memory used bythe program. For details, see the discussion within the tutorial.

    set more off

    tells STATA not to pause and display the more- message in theResults Window.

    Data Management

    describe

    describes the contents of data in memory or on disk. A related command is describeusing filename, which describes the dataset infilename

    drop list of variables

    this deletes/erases the variables in list of variables from the current STATA session.For example, drop str testscrwill delete the two variables strand testscr

    keep list of variablesdeletes/erases all of the variables from the current STATA session except those in list ofvariables. Alternatively, it keeps the variables in the list and drops everything else. Forexample, keep str testscrwill keep the two variables strand testscrand deletes all of

    the other variables in the current STATA session.

    list list of variables

    tells STATA to print all of the observations for the variables listed in list of variables.

    save filename [, replace]tells STATA to save the dataset that is currently in memory as a file with namefilename. The option replace tells STATA that it may replace any other file with thenamefilename.

    use filename

    tells STATA to load a dataset from the filefilename.

    Transforming and Creating New Variables

    New variables are created using the command generate, and existing variables are modifiedusing the command replace.

  • 8/9/2019 Stata Tutorial 13 v2 0

    41/45

    - 41 -

    Examples:

    generate newts = testscr/100

    creates a new variable called newtsthat is constructed as the variable testscrdivided by 100.

    replace testscr = testscr/100

    changes the variable testscrso that all observations are divided by 100.

    You can use the standard arithmetic operations of addition (+), subtraction (-), multiplication(*), division (/), and exponentiation (^) in generate/replace commands. For example,

    generate ts_squared = testscr*testscr

    creates a new variable ts_squared as the square of testscr. (This could also have beenaccomplished by using the command gen ts_squared = testscr^2.)

    You can also use relational operators to construct binary variables. For example, in the forthbatch file, the following command was included

    gen d = (str

  • 8/9/2019 Stata Tutorial 13 v2 0

    42/45

    - 42 -

    Statistical Operations

    cor list of variables

    tells STATA to compute the correlation between each of the variables in list ofvariables

    twoway scatter var1 var2 || lfit var1 var2produces a scatter plot of var1on the Y-axis and var2on theX-axis. If the || lfitpart isincluded then the fitted OLS line is also displayed

    predict newvarname [, residuals]

    when this command follows the regress command, the OLS predicted values orresiduals are calculated and stored under the name newvarname. When the optionresiduals is used, the residuals are computed; otherwise the predicted values are

    computed and placed into newvarname.

    Example:

    reg testscr str expn_stu el_pct, r

    predict tshat

    predict uhat, residuals

    Here, testscris regressed on str, expn_stu, el_pct(first command); the fitted values are savedand stored under the name tshat (second command), and the residuals are saved under thename uhat(third command).

    regress depvar list of variables [if expression] [,robust noconstant]

    carries out an OLS regression of the variable depvar on list of variables. When ifexpression is used, then the regression is estimated using observations for whichexpressionis true. The option robusttells STATA to use the heteroskedasticity-robustformula for the standard errors of the the coefficient estimators. The option noconstanttells STATA not to include a constant (intercept) in the regression.

    Examples:

    reg testscr str, r

    reg testscr str expn_stu el_pct, r

    summarize [list of variables] [, details]computes summary statistics. If the command is used without a list of variables, thensummary statistics are computed for all of the variables in the dataset. If the command

  • 8/9/2019 Stata Tutorial 13 v2 0

    43/45

    - 43 -

    is used with a list of variables, then summary statistics are computed for all variables inthe list. If the option details is used, more detailed summary statistics (includingpercentiles) are computed.

    Examples:

    sum testscr str

    computes summary statistics for the variables testscrand str.

    sum testscr str, detail

    computes detaild summary statistics for the variables testscrand str.

    testthis command is used to test hypothese about regression coefficients. It can be used to

    test many types of hypotheses. The most common use of this command is to carry out ajoint test that several coefficients are equal to zero. Used this way, the form of thecommand is test list of variableswhere the list is to be carried out on the coefficientscorresponding to the variables given in list of variables.

    Example:

    reg testscr str expn_stu el_pct, r

    test str expn_stu

    Here testscris regressed on str, expn_stu, and el_pct(first command), and a joint test of

    the hypothesis that the coefficient onstr

    andexpn_stu

    are jointly equal to zero is carriedout (second command).

    ttest

    this command is used to thest a hypothesis about the mean or the difference betweentwo means. The command has several forms. Here are a few:

    ttest varname = # [if expression]}[,level(#)]

    Here you test the null hypothesis that the population mena of the series varname isequal to #. When if expressionis used, then the test is computed using observations forwhich expression is true. The option level(#) is the desired level of the confidenceinterval. If this option is not used, then a confidence level of 95% is used.

    Examples:

    ttest testscr = 0;

  • 8/9/2019 Stata Tutorial 13 v2 0

    44/45

    - 44 -

    tests the null hypothesis that the population mean of testscris equal to 0 and computes a 95%confidence interval.

    ttest testscr = 0, level(90);

    tests the null hypothesis that the population mean of testscris equal to 0 and computed a 90%confidence interval.

    ttest testscr = 0 if (str

  • 8/9/2019 Stata Tutorial 13 v2 0

    45/45

    tutorial is not intended to replace theReferenceor Users Guide. The best way to learn how touse the program is to spend some time exploring and working with it. For a nice visualintroduction to the manuals, go to www.youtube.com/embed/xWJTFtWhQc4.

    STATA replication batch files for all the results in the Stock/Watson textbook are available

    from the Web site. You are invited to download these and study them.

    There are many other tutorials on STATA available to you on the internet. If you prefer avisual one, then perhaps going to www.ats.ucla.edu/stat/stata/notes/default.htm might be agood one to look at. STATA has its own YouTube series and you can find it atwww.stata.com/links/video-tutorials/.