Stata Tutorial 1

An Introduction to Stata

ii

An Introduction to Stata

F. PERACCHIFaculty of Economics, Tor Vergata University, Rome, Italy

iv

Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 STARTING AND STOPPING STATA . . . . . . . . . . . . . . . . . . 1

1.1.1 THE STATA WINDOWS . . . . . . . . . . . . . . . . . . . . . 11.1.2 THE STATA TOOLBAR . . . . . . . . . . . . . . . . . . . . . 21.1.3 ALLOCATING MEMORY TO STATA . . . . . . . . . . . . . 2

1.2 STATA DOCUMENTATION AND UPDATES . . . . . . . . . . . . . 31.2.1 THE HELP SYSTEM . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 THE REFERENCE MANUAL . . . . . . . . . . . . . . . . . . 31.2.3 THE STATA TECHNICAL BULLETIN . . . . . . . . . . . . . 31.2.4 TUTORIALS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.5 STATA UPDATES . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 VARIABLES AND OBSERVATIONS . . . . . . . . . . . . . . . . . . 41.3.1 VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.2 OBSERVATIONS . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 INPUTTING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4.1 DIRECT TYPING . . . . . . . . . . . . . . . . . . . . . . . . . 61.4.2 THE DATA EDITOR . . . . . . . . . . . . . . . . . . . . . . . 61.4.3 LOADING AN ASCII (TEXT) DATA FILE . . . . . . . . . . 61.4.4 LOADING A STATA DATA FILE . . . . . . . . . . . . . . . . 7

1.5 BASIC DATA MANIPULATION . . . . . . . . . . . . . . . . . . . . . 81.5.1 DISPLAYING DATA . . . . . . . . . . . . . . . . . . . . . . . 81.5.2 LABELING DATA . . . . . . . . . . . . . . . . . . . . . . . . . 81.5.3 SUMMARIZING DATA . . . . . . . . . . . . . . . . . . . . . . 91.5.4 CREATING NEW VARIABLES . . . . . . . . . . . . . . . . . 91.5.5 CHANGING AND RENAMING VARIABLES . . . . . . . . . 111.5.6 ELIMINATING VARIABLES OR OBSERVATIONS . . . . . . 121.5.7 INCREASING THE NUMBER OF OBSERVATIONS IN A

DATASET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.6 OUTPUTTING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.7 LOG FILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Stata Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1 GENERAL SYNTAX . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

vi CONTENTS

2.1.1 BY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.2 WEIGHTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.3 IF AND IN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1.4 QUIETLY AND NOISILY . . . . . . . . . . . . . . . . . . . . . 16

2.2 BASIC DATA COMMANDS . . . . . . . . . . . . . . . . . . . . . . . 172.2.1 DESCRIBE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.2 LIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.3 DROP AND KEEP . . . . . . . . . . . . . . . . . . . . . . . . 172.2.4 GENERATE AND EGEN . . . . . . . . . . . . . . . . . . . . . 182.2.5 REPLACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.6 SORT AND GSORT . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 COMBINING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.1 APPEND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.2 MERGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 RESHAPING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.1 COLLAPSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.2 CONTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.3 EXPAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.4 FILLIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.5 RESHAPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 BASIC SAMPLE STATISTICS . . . . . . . . . . . . . . . . . . . . . . 232.5.1 COUNT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5.2 SUMMARIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5.3 MEANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5.4 CENTILE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5.5 CUMUL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5.6 CORRELATE . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5.7 REGRESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.6 TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.6.1 TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.6.2 TABULATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.6.3 TABSUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1 BASIC SYNTAX AND GRAPHIC STYLES . . . . . . . . . . . . . . 293.2 COMMON GRAPH OPTIONS . . . . . . . . . . . . . . . . . . . . . . 303.3 HISTOGRAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4 TWO-WAY SCATTERPLOTS . . . . . . . . . . . . . . . . . . . . . . 323.5 TWO-WAY SCATTERPLOT MATRICES . . . . . . . . . . . . . . . 333.6 BOX PLOTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Programming and Matrix Commands . . . . . . . . . . . . . . . . . . 354.1 PROGRAMMING STATA . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 MACROS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.1.2 SYSTEM MACROS . . . . . . . . . . . . . . . . . . . . . . . . 364.1.3 LOOPING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.1.4 BRANCHING . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

CONTENTS vii

4.1.5 PROGRAM ARGUMENTS . . . . . . . . . . . . . . . . . . . . 374.1.6 TEMPORARY OBJECTS . . . . . . . . . . . . . . . . . . . . 394.1.7 EXCHANGING RESULTS BETWEEN PROGRAMS . . . . . 39

4.2 DO FILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 ADO FILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.4 MATRIX COMMANDS . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4.1 ROW AND COLUMN NAMES . . . . . . . . . . . . . . . . . . 424.4.2 SUBSCRIPTING AND SUBMATRICES . . . . . . . . . . . . 424.4.3 MATRIX OPERATORS AND FUNCTIONS . . . . . . . . . . 434.4.4 CROSS-PRODUCT MATRICES . . . . . . . . . . . . . . . . . 454.4.5 DATA TO MATRIX CONVERSION . . . . . . . . . . . . . . . 454.4.6 GETTING SYSTEM MATRICES . . . . . . . . . . . . . . . . 464.4.7 MATRIX DECOMPOSITION . . . . . . . . . . . . . . . . . . 46

5 Statistical Inference Using Stata . . . . . . . . . . . . . . . . . . . . . 475.1 ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1.1 GENERAL SYNTAX OF ESTIMATION COMMANDS . . . . 475.1.2 WEIGHTED ESTIMATION . . . . . . . . . . . . . . . . . . . 485.1.3 CONSTRAINED ESTIMATION . . . . . . . . . . . . . . . . . 485.1.4 ROBUST VARIANCE ESTIMATES . . . . . . . . . . . . . . . 48

5.2 POST-ESTIMATION COMMANDS . . . . . . . . . . . . . . . . . . . 495.2.1 ACCESSING COEFFICIENTS AND STANDARD ERRORS . 495.2.2 DISPLAYING THE VARIANCE ESTIMATES . . . . . . . . . 495.2.3 PREDICTIONS AND RESIDUALS . . . . . . . . . . . . . . . 495.2.4 HYPOTHESIS TESTING . . . . . . . . . . . . . . . . . . . . . 51

5.3 BOOTSTRAPPING AND MONTE CARLO SIMULATIONS . . . . . 515.3.1 BOOTSTRAP . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.3.2 MONTE CARLO SIMULATION . . . . . . . . . . . . . . . . . 52

6 Statistical Models in Stata . . . . . . . . . . . . . . . . . . . . . . . . . 536.1 LINEAR MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1.1 ORDINARY LEAST SQUARES . . . . . . . . . . . . . . . . . 536.1.2 CONSTRAINED LINEAR REGRESSION . . . . . . . . . . . 546.1.3 LINEAR INSTRUMENTAL VARIABLES . . . . . . . . . . . . 54

6.2 GENERALIZED LINEAR MODELS . . . . . . . . . . . . . . . . . . . 556.2.1 GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2.2 LOGIT AND PROBIT . . . . . . . . . . . . . . . . . . . . . . 576.2.3 POISSON AND NBREG . . . . . . . . . . . . . . . . . . . . . 57

6.3 OTHER LIMITED DEPENDENT VARIABLES MODELS . . . . . . 586.3.1 GROUPED BINARY RESPONSES . . . . . . . . . . . . . . . 586.3.2 ORDERED CATEGORICAL RESPONSES . . . . . . . . . . . 586.3.3 NESTED LOGIT . . . . . . . . . . . . . . . . . . . . . . . . . . 596.3.4 MULTINOMIAL LOGIT . . . . . . . . . . . . . . . . . . . . . 596.3.5 BIPROBIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.3.6 CENSORED AND TRUNCATED REGRESSION . . . . . . . 59

6.4 DURATION DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.4.1 PARAMETRIC DURATION MODELS . . . . . . . . . . . . . 59

viii CONTENTS

6.4.2 COX PROPORTIONAL HAZARD MODEL . . . . . . . . . . 606.5 TIME SERIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.5.1 LINEAR MODELS WITH AUTOCORRELATED ERRORS . 606.5.2 ARIMA MODELS . . . . . . . . . . . . . . . . . . . . . . . . . 606.5.3 ARCH-TYPE MODELS . . . . . . . . . . . . . . . . . . . . . . 60

6.6 PANEL DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.6.1 LINEAR PANEL DATA MODELS . . . . . . . . . . . . . . . . 616.6.2 DYNAMIC PANEL DATA MODELS . . . . . . . . . . . . . . 626.6.3 SEEMINGLY UNRELATED REGRESSION EQUATIONS . . 626.6.4 GEE FOR PANEL DATA . . . . . . . . . . . . . . . . . . . . . 626.6.5 LOGIT AND PROBIT FOR PANEL DATA . . . . . . . . . . 626.6.6 POISSON AND NEGATIVE BINOMIAL MODELS . . . . . . 63

6.7 NONPARAMETRIC ESTIMATION . . . . . . . . . . . . . . . . . . . 636.7.1 DENSITY ESTIMATION . . . . . . . . . . . . . . . . . . . . . 636.7.2 REGRESSION SMOOTHERS . . . . . . . . . . . . . . . . . . 63

6.8 ROBUST AND QUANTILE REGRESSION . . . . . . . . . . . . . . . 646.8.1 ROBUST REGRESSION . . . . . . . . . . . . . . . . . . . . . 646.8.2 QUANTILE REGRESSION . . . . . . . . . . . . . . . . . . . . 64

6.9 GENERAL NONLINEAR METHODS . . . . . . . . . . . . . . . . . . 64

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Introduction

Why use Stata? In my view, it has three main advantages over other statisticalpackages.

The first is portability: Stata runs on several platforms (Macintosh, Unix, Windows),and Stata programs written for one of them run with (almost) no change on any otherone. The latest release is Stata 7.0. In what follows I focus on Stata 7.0 for Windows98/95/NT.

The second is speed: Stata is fast because all data manipulations are carried out inthe RAM. The only limit is the amount of RAM available. With 100MB of RAM, onecan work with a dataset containing 5 million observations on 4 real-valued variablesor, equivalently, with one million observations on 20 real-valued variables.

The third advantage is that Stata contains “state-of-the-art” statistical procedures, isprogrammable, and is fully integrated with a matrix language.

This introduction to Stata is organized as follows. Chapter 1 describes the mainfeatures of the program. Chapter 2 introduces the syntax of a Stata commandand presents some of the most used commands. Chapter 3 describes Stata graphiccapabilities. Chapter 4 introduces the elements of Stata programming and the Statamatrix language. Chapter 5 shows how to carry out statistical inference (estimation,prediction, hypothesis testing) using Stata. Finally, Chapter 6 reviews the main classesof statistical models implemented in Stata.

1

Getting Started

This chapter introduces the main aspects of Stata, namely how to start and stop theprogram (Section 1.1), where to look for documentation and updates (Section 1.2), thedefinition of variables and observations (Section 1.3), how to input data (Section 1.4),basic data manipulation (Section 1.5), how to output data (Section 1.6), and how toopen and close log files (Section 1.7).

I adopt the following typographic conventions: the typewriter-style typeface is usedfor Stata commands or options that have to be typed in (e.g. describe or generate),italics is used for things that must be substituted for by some other word (e.g. varnameor varlist), small caps is used for keyboard keys (e.g. Enter or Ctrl+Break) andboldface is used for Windows commands or switches (e.g. Exit or Help).

1.1 STARTING AND STOPPING STATA

To start Stata, click on the Stata icon. To exit Stata, type exit or choose Exit fromthe File menu. To exit Stata when there are data in memory which have not beensaved, type exit, clear.

To test the installation of Stata, type verinst. To test that the supplied ado files (seeSection 4.3) are correctly installed, type crc.

To make Stata stop what is doing and return to the Stata prompt, click the Breakbutton or press Ctrl+Break.

1.1.1 THE STATA WINDOWS

The Stata windows consists of:

• the Stata Command window (where commands are typed in and then issuedby pressing Enter),

• the Stata Results window (where results are displayed),• the Review window (it shows the past commands),• the Variables window (it shows the list of variables).All these windows may be resized and rearranged. The windowing preferences may besaved by choosing Prefs from the main menu bar.

2

The Stata Command window follows standard Window editing style. The keys forediting in the Command window are Delete, Backspace, Esc, Home, End, PageUp and Page Down. One can copy one line at a time from the Results window intothe clipboard and paste into the Command window. Clicking once a command in theReview window copies the command on the Command window where it can be editedbefore being entered.

1.1.2 THE STATA TOOLBAR

Going from left to right, the Stata toolbar contains the following buttons (holding themouse pointer over each button, a box with a brief description will appear):

1. Open (opens a Stata dataset),

2. Save (saves to disk the Stata dataset currently in memory),

3. Print (prints a graph or log),

4. Begin Log (starts a new log, appends to an existing log, and stops or suspendsthe current log),

5. Start Viewer (opens the Stata viewer for help on Stata),

6. Bring Dialog Window to Front (brings the Dialog window to the front of theother Stata windows),

7. Bring Results Window to Front (brings the Result window to the front ofthe other Stata windows),

8. Bring Graph Window to Front (brings the Graph window to the front of theother Stata windows),

9. Do-file Editor (opens the Do-file editor or brings the Do-file Editor windowto the front of the other Stata windows),

10. Data Editor (opens the data editor or brings the Data Editor window to thefront of the other Stata windows),

11. Data Browser ((opens the data browser or brings the Data Browser windowto the front of the other Stata windows),

12. Clear — more — Condition (tells Stata to continue when it has puased in themiddle of a long output),

13. Break (stops the current task in Stata).

1.1.3 ALLOCATING MEMORY TO STATA

Initially, Stata allocates 1MB of memory to each session. Under Windows, topermanently change the amount of memory used every time Stata is invoked, click on

GETTING STARTED 3

the Stata icon, pull down File and choose Properties, click on the Shortcut taband put k# (in kilobytes) or m# (in megabytes) after the call to wstata.exe in theTarget line.

The Start in line specifies the initial working directory. This may be changedby editing the line or by using the cd drive:/directory_name command from theCommand line.

Memory allocation may also be changed within a given Stata session (although notpermanently) by using the set memory #k command (in kilobytes) or the set memory#m command (in megabytes). This command requires that no data be present inmemory.

If more memory is used than physically available on the computer, Stata slows down.In this case, it is recommended to set virtual memory on by typing

. set virtual on

1.2 STATA DOCUMENTATION AND UPDATES

The main documentation comes from the help system and the Stata reference manual.Additional documentation on Stata developments and updates is available through theStata Technical Bulletin and the Stata Web site http://www.stata.com.

1.2.1 THE HELP SYSTEM

On-line help can be accessed by opening the Stata viewer from the toolbar or bychoosing Help from the main menu bar. In either case, selecting Contents opens thetable of contents for on-line help. Selecting Search . . . and then entering keywordsearches for keyword in the list of help entries.

On-line help can also be accessed from the Command line by typing help keyword orlookup keyword. Try help, or help contents, or help list.

1.2.2 THE REFERENCE MANUAL

It consists of the introductory booklet Getting Started with Stata, plus seven volumes:the User’s Guide, the Graphics Manual, the Programming Manual and the ReferenceManual in four volumes.

1.2.3 THE STATA TECHNICAL BULLETIN

The Stata Technical Bulletin (STB) is a printed and electronic journal withcorresponding software. It contains articles written by Stata Corp., Stata users,and others. Articles have included enhancements to Stata (ado-files), tutorials onprogramming strategies, illustrations of data analysis techniques, discussions onteaching statistics, debates on appropriate statistical techniques, reports on otherprograms, along with interesting datasets, questions, and suggestions.

The STB is published every two months (in January, March, May, July, Septemberand November). Every year, the 6 issues are bound into a volume.

4

1.2.4 TUTORIALS

Stata provides tutorials on a variety of aspects: introduction to Stata (intro.tut),data input, graphics, tables, and procedures for statistical modeling.

To run a tutorial, type tutorial tutname, where tutname is any of the following:

• contents (lists the available official Stata tutorials),• intro (introductory tutorial),• yourdata (how to input data),• graphics (how to make graphs),• tables (how to make tables),• regress (estimating regression models, including 2SLS),• anova (estimating one-, two- and N-way ANOVA and ANCOVA models),• factor (estimating factor and principal component models),• logit (estimating maximum-likelihood logit and probit models),• survival (estimating maximum-likelihood survival models),• ourdata (description of the data provided by Stata).

1.2.5 STATA UPDATES

The Web site http://www.stata.com contains, among other things, answers tofrequently asked questions (FAQs), free additions to Stata (“Cool ado-files”) and thelatest official updates to Stata, which are released fairly frequently (every 3—4 weeks).The latter can be downloaded directly using the update command.

Another useful command is net, which fetches and installs additions to Stata obtainedfrom the Internet or from media. The additions can be ado-files (new commands), helpfiles, or even datasets. Collections of files are bound together into packages. The netsearch keywords command searches the Internet for user-written additions to Statathat contain the specified keywords.

1.3 VARIABLES AND OBSERVATIONS

In Stata, variables are associated with the columns of a data matrix, observations withits rows.

1.3.1 VARIABLES

Variables come in two types: alphabetic (strings) or numeric (real or integer valued).

Variables are called by their name, which must be 1 to 8 characters long. The first

GETTING STARTED 5

character must be a letter or an underscore, the other characters can be letters, digitsor underscores (spaces or other characters are not allowed). It is better to avoid usingthe name e for variables and beginning variable names with an underscore (all Statabuild-in variables begin with an underscore). Stata is case sensitive (xx, xX, Xx and XXare all different names). There are a few reserved names that cannot be used: e.g. if,in, int, with.

Variables can be renamed using the rename command. For example: rename x yrenames the variable x as y.

Associated with each type of variable is a storage type. The available storage typesare:

• String variables: str#, where # is an integer between 1 and 80 specifyingthe number of characters in the string. The maximum length of a string is 80characters.

• Real valued variables: double (double precision or about 16 digits of accuracy)and float (single precision or about 7 digits of accuracy). Notice that Statauses ‘.’ (a period) to denote both the decimal symbol and missing numericalvalues.

• Integer valued variables: long (integers between -2,147,483,648 and2,147,483,646), int (integers between -32,768 and 32,766) and byte (integersbetween -127 and 126).

A double occupies twice as much space as a float, a long occupies the same spaceas a float, an int occupies half the space of a float, and a byte occupies half thespace of an int. Thus, a byte occupies 1/4 of the space of a float and 1/8 of thespace of a double, thereby allowing to store categorical and indicator variables veryefficiently.

The default for storing numeric variables is float, but Stata performs all internalcalculations in double.

The compress command may be used to automatically optimize the storage type ofthe data in memory.

1.3.2 OBSERVATIONS

Observations correspond to the row of a data matrix. Stata automatically creates andupdates the build-in system variable _n, which is a counter containing the number ofthe current observation, and the system macro _N, which contains the total numberof observation in the dataset.

Notice that when the sorting of the data changes, so does the counter _n.

1.4 INPUTTING DATA

Data can be inputted into Stata by direct typing, with the data editor or from a file(an ASCII file or a Stata data file).

6

1.4.1 DIRECT TYPING

The input command allows typing data directly into the dataset in memory.

Example:

. input x1 x2 x3

. 4 3.5 "J. Neyman"

. 7 .01 "R.A. Fisher"

. 2 11.5 "J. Tukey"

. 3 21.1 "D.R. Cox"

1.4.2 THE DATA EDITOR

The data editor corresponds to the edit command. It may be accessed by clickingthe Data Editor button on the Stata toolbar. The data editor is like a standardspreadsheet with colums corresponding to variables and rows to observations. Datamay be entered or modified by choosing the cell, typing the value and then pressingEnter or Tab.

With the data editor, quotes around strings are unnecessary. Missing numeric valuesare recorded as ‘.’ (a period), missing string values are just empty strings.

The data editor initially names variables var1, var2, . . . . Variables may be renamedby doubly-clicking anywhere in the variable’s column, thus bringing up the Variableinformation dialog.

The data editor allows copying and pasting data created by other spreadsheet ordatabase programs.

It is important to always check the numeric format of a spreadsheet before copyingdata to Stata. Unless otherwise instructed, Stata will interpret the number 1,314 as astring.

1.4.3 LOADING AN ASCII (TEXT) DATA FILE

Stata offers three basic commands for loading an ASCII (text) data file.

The first command, infile, is a very flexible way of reading an ASCII (text) datafilefrom disk into memory. The data can be in either free- or fixed-format, and a singleobservation may span any number of input lines.

The basic syntax for data in free-format (data may be separated by spaces, tabs orcommas) is:

infile varlist using filename [, clear]

where varlist is a list of variable names with blanks in between (that is, varname1varname2 . . . ), filename is the name of the disk datafile (including the path, ifnecessary) and clear is an option that clears data loaded in memory without savingthem (I follow the convention of denoting items that are optional by enclosing them insquare brackets). If the file name is specified without an extension, .raw is assumed.

GETTING STARTED 7

If the data are in fixed-format, a dictionary (.dct) file is necessary. A dictionary is anASCII (text) file which describes the contents of a datafile. The data may be in thesame file as the dictionary or in another file. The basic syntax is:

infile filename [, using(filename2) clear]

where filename is the name of the dictionary file and filename2 is the name of thefile containing the data. If the using() option is not specified, the data is assumedto follow the dictionary in filename or, if the dictionary specifies the name of someother file, that file is assumed to contain the data. Notice that if using(filename2)is specified, filename2 is used to obtain the data even if the dictionary itself saysotherwise.

The basic syntax of a dictionary file is the following:

[infile] dictionary [using filename] {* comments may be included freely*

[type] varname}(data might appear here)

The second is the infix command, which reads ASCII files in fixed-column format.Again, a single observation may span any number of input lines. It is somewhat easierbut less flexible than infile. Its basic syntax is:

infix using filename [, using(filename2) clear]

where filename is the name of a dictionary file and filename2 is the name of the filecontaining the data.

The third is the insheet command, which reads ASCII files created by a spreadsheetor database program. Regardless of the creator, this command reads ASCII files wherethere is one observation per line and the values are separated by tabs or commas. Thefirst line of the file may contain the variable names. The basic syntax is:

insheet [varlist] using filename [, {comma|tab|delimiter("char")} clear]

The {comma|tab|delimiter("char")} option tells Stata how values are separated inthe file (I follow the convention of denoting the available alternatives by enclosingthem in curley brackets, with a vertical bar as a separator). Specifying tab or commais not necessary because insheet can determine the separation character for itselfwhen the character is a tab or comma.

The insheet command can also determine for itself whether the file includes variablenames.

1.4.4 LOADING A STATA DATA FILE

Stata data files have the default extension dta.

Stata data files on disk may be loaded using the use command. The basic syntax is:

use filename [, clear]

where filename contains the full path to the data.

8

1.5 BASIC DATA MANIPULATION

I now introduce some basic facilities for manipulating data.

1.5.1 DISPLAYING DATA

Strings and values of scalar expressions may be displayed using the display command.This command may also be used interactively as a substitute for a hand calculator.

Examples:

. display "this is a string"

. display 5+exp(ln(10))

. display "the value of f(x) is" 5+exp(ln(10))

The content of the dataset in memory may be displayed by using the describecommand. The content of a Stata data file on disk may be described without actuallyloading it by using the describe using filename command.

The complementary ds command lists variable names in a compact format, whereaslookfor string helps in finding variables by searching for string among all variablenames and labels.

The list [varlist] command displays the values of variables. If no varlist is specified,the values of all the variables are displayed.

The display format of a variable may be specified using the command

format varlist %fmt

where %fmt is the chosen format for varlist. For example, format x %9.0g displaysthe variable x in g (generic numeric) format, whereas format x %8.3f displays x inf (fixed numeric) format with three decimals.

The set dp comma command may be used to display numerical values using commaas the decimal character. To switch the decimal character back to period type set dpperiod.

Changing the display format does not affect the internal precision with which variablesare stored and manipulated.

1.5.2 LABELING DATA

Stata contains a number of commands for manipulating labels.

Variables are labeled by using the command

label variable varname "string"

where string (typed in quotes) is up to 80 character long. If no label is specified, anyexisting variable label is removed.

The values of a variable are labeled by using the command

label value varname [lblname]

GETTING STARTED 9

where lblname is the name of a value label defined through the command

label define lblname # "string" [# "string" ...]

Example:

label define sexlbl 0 "male" 1 "female"label value sex sexlbl

The label dir command lists the names of value labels stored in memory, labeldrop lblnames eliminates the value labels lblnames, label drop _all eliminates allvalue labels, whereas label list lists the names and contents of value labels storedin memory.

Data files are labeled by using the command

label data string

where string is up to 80 characters long. Data labels are displayed when the data areused or described. If no label is specified, any existing label is removed.

1.5.3 SUMMARIZING DATA

The command

summarize [varlist]

calculates and displays a variety of univariate summary statistics (number ofnonmissing observations, mean, standard deviation, minimum and maximum value).If no varlist is specified, summary statistics are calculated for all the variables in thedata.

The command

summarize [varlist], detail

produces additional statistics including skewness, kurtosis, the four smallest and fourlargest values, along with various percentiles.

1.5.4 CREATING NEW VARIABLES

New variables are created using the generate command. The basic syntax of thiscommand is:

generate newvar = exp [options]

where exp is an expression and options are optional instructions that may restrict theapplication of exp.

For example, to generate the new variable y using positive values of an existing variablex, type:

. generate y = x*x + log(x) if x>0

where *, + and > are examples of Stata operators, log(x) is an example of a Statafunction, and if x>0 is a qualifier that restricts the scope of the command to theobservations for which x > 0.

10

To generate the new variable y containg lagged values of x, type:

. generate y = x[_n-1]

Typing generate y = x[1] sets every observation of y equal to the first observationin x, whereas typing generate y = x[_n] and y = x are equivalent.

The Stata operators are:

• arithmetic operators: + (addition), - (subtraction), * (multiplication), /(division), ˆ (power);

• string operators: + (string concatenation), for example the expression "abc"+ "def" produces the string "abcdef";

• relational operators: < (less than), > (greater than), <= (less or equal), >=(greater or equal), == (equal), ˜= (not equal);

• logical operators: & (and), | (or), ˜ (not).The order of evaluation follows the standard rules. Parentheses may be used to forcea different order of evaluation.

Functions are used in expressions. The argument(s) of a function may be anyexpression, including other functions. The arguments of a function are enclosed inparentheses. If there are multiple arguments, they are separated by commas. Functionsreturn missing when the value of the function is undefined.

Stata has built in a number of functions:

• Mathematical functions, for example exp(x), log(x) or ln(x), sqrt(x),abs(x) and the main trigonometric functions.

• Statistical functions: density, distribution functions and quantile functionsof various probability distributions, both discrete and continuous. If Xdenotes the name of a continuous distribution, then Stata usually providesX() (cumulative distribution function), Xtail() (upper tail cumulativedistribution function), invX() (quantile function) and invXtail() (upperquantile function). The currently available distributions include chi2(df,x)(chi-square distribution with df degrees of freedom), F(df1,df2,x) (Fdistribution with df1 and df2 degrees of freedom), norm(x) (standardGaussian), nchi2(df,L,x) (noncentral chi-square distribution with dfdegrees of freedom and noncentrality parameter L) and t(df,x) (tdistribution with df degrees of freedom).

Examples:

. display Binomial(5,x,.25) (cumulative binomial with parametersn = 5 and π = .25). display normden(x) (standard normal density). display norm(x) (cumulative standard normal). display invnorm(p) (standard normal quantile function). display binorm(x,y,.5) (cumulative bivariate Gaussian with zeromeans, unit variances and correlation ρ = .5)

GETTING STARTED 11

• Pseudo-random number generator: uniform(), which generates uniformlydistributed pseudo-random numbers on the interval [0,1). It takes noarguments, and is George Marsaglia’s KISS (Keep It Simple Stupid). Pseudo-random numbers according to any other continuous distribution may begenerated through the inverse probability integral transform.

For example: pseudo-random numbers according to the standard normaldistribution may be generated with the invnorm(uniform()) command,where the function invnorm evaluates the quantile function of the standardnormal.

• String functions (which apply to string variables), for example lower(s)(returns the lowercased variant of s), real(s) (converts s into a numericvalue), string(n) (converts n into a string), substr(s, n1, n2) (returns thesubstring of s starting at n1 for a length of n2; if n2 = ., the remaining portionof the string is returned), and upper(s) (returns the uppercased variant ofs).

• Special functions, for example float(x) (returns the value of x roundedto float storage type), int(x) (returns the integer part of x),max(x1, x2, . . . , xn) and min(x1, x2, . . . , xn) (return respectively the max-imum and the minimum of the arguments, ignoring missing values),round(x, y) (returns x rounded into units of y), sign(x) (returns -1 if x < 0,0 if x = 0, 1 if x > 0, and . if x = .), and sum(x) (returns the running sum ofx, treating missing values as zero).

A variety of date and time-series functions are also available, as well as matrix functionsreturning scalars (see Section 4.4.3).

1.5.5 CHANGING AND RENAMING VARIABLES

The content of an existing variable may be changed by using the replace command,whereas the name of an existing variable may be changed (its contents remainunchanged) by using the rename command.

The recode varname command changes the values of varname according to the rulesspecified. For example,

. recode x 1=2 3=4

changes 1 in x to 2 and 3 to 4, whereas

. recode x 1 3/5 = 6

changes 1, 3, 4 and 5 to 6.

Given a string variable named varname, the command

encode varname, generate(newvar)

generates a new numeric variable named newvar based on varname, creating at thesame time (or just using as necessary) the value label newvar. Do not use encode if

12

varname contains numbers that merely happen to be stored as strings (e.g. the number‘1,314’). In this case use instead

generate newvar = real(varname)

The decode command creates a new string variable named newvar based on the“encoded” numeric variable varname and its value label.

1.5.6 ELIMINATING VARIABLES OR OBSERVATIONS

The drop command eliminates variables or observations from the data in memory. Toeliminate variables use

drop varlist

To eliminate observations use

drop in range [if exp]

The drop _all command eliminates all variables and observations in memory.

The keep command works the same as drop except that we specify the variables orobservations to be kept rather than those to be deleted.

The clear command essentially resets Stata and is equivalent to the set of commands:

. version 7.0

. drop _all

. label drop _all (drop all labels in memory)

. scalar drop _all (drop all scalar variables in memory)

. matrix drop _all (drop all matrices in memory)

. eq drop _all (drop all equations in memory)

. constraint drop _all (drop all constraints in memory)

. discard (drop all programs in memory)

1.5.7 INCREASING THE NUMBER OF OBSERVATIONS IN A DATASET

The set obs # command changes the number of observations in the current datasetto #, where # is an integer at least as large as the current number _N of observations.If there are variables in memory, the values of all new observations are set to missing.

For example,

. drop _all

. set obs 100

. gen x = _n

clears memory, makes 100 observations and assigns the variable x the values from 1to 100.

1.6 OUTPUTTING DATA

Stata offers three basic commands for outputting data, corresponding to the use,infile and insheet commands discussed in Section 1.4.

GETTING STARTED 13

The first command

save [filename] [, options]

stores the dataset currently in memory on disk in Stata format under the namefilename. If filename is not specified, the name under which the data was last knownto Stata is used. If filename is specified without an extension, .dta is assumed.

The available options are nolabel old replace all. The old option enables adataset to be readable by someone with Stata 6.0, the option replace permits saveto overwrite an existing dataset.

The second command

outfile [varlist] using filename [, options]

writes data to a disk file in ASCII (text) format. The data saved by outfile can beread back by infile. If filename is specified without an extension, .raw is assumedunless the dictionary option is specified, in which case .dct is assumed.

The third command

outsheet [varlist] using filename [, options]

writes data in tab- or comma-separated ASCII format into a file. This is the formatthat most spreadsheet programs prefer. If filename is specified without an extension,.out is assumed.

1.7 LOG FILES

The log command echos a copy of a Stata session to a file or a device. More precisely:

log using filename [, options]

opens the file filename and echos a copy of the Stata session to the file. If filename isspecified without an extension, .smcl is assumed (SMCL is Stata’s output language).The available options are noproc append replace.

The log close command stops logging the session and closes the file, log offtemporarily stops logging the session leaving the file open, while log on resumeslogging to the file.

The set log command controls the dimensions of output sent to the log. Its formatis:

set {display|log} {linesize|pagesize} #

where # is the line or page length, for example set linesize 120 or set pagesize40.

2

Stata Commands

In this chapter I describe the syntax of some frequently used Stata commands. Myselection is of course subjective.

2.1 GENERAL SYNTAX

The general syntax of a Stata command is:

command [varlist] [ = exp] [weight] [if exp] [in range] [, options]

If no varlist appears, the command assumes a varlist of _all, that is, the commandis applied to all the variables in the data.

The option = exp specifies the value to be assigned to a variable. It is most often usedwith generate and replace. For example:

. replace newvar = oldvar+2

Many commands take command-specific options. A single comma separates acommand’s options from the rest of the command.

Most commands can be abbreviated. For example, one may type gen or simply ginstead of generate, summ or simply su instead of summarize, des or simply d insteadof describe, l instead of list, etc. See the on-line help or the Reference Manual forthe shortest allowable abbreviation of a command.

The F -keys may be used to create shortcuts to some command. For example, theF3-key comes defined as describe Enter.

2.1.1 BY

Most Stata commands allow the by varlist: prefix. This causes command to berepeated for each subset of the data for which the values of the variables in varlist areequal. The use of by requires the data to be preliminarily sorted by varlist.

Example:

. sort x

. by x: summarize y

Not all commands allow the by varlist: prefix. Some replace it with by(groupvar) inthe options. For example, the syntax of the ttest command is:

16

ttest varname [if exp] [in range], by(groupvar) [unequalwelch level(#)]

2.1.2 WEIGHTS

The option weight indicates the weight to be attached to each observation. The syntaxof weight is [weightword = exp], where weighword is one either weight (the defaulttreatment of weights) or one of fweight, pweight, aweight and iweight, whichcorrespond to the four kind of weights that Stata understands (although not everycommand supports all four of them):

1. frequency weights (fweight) are integer-valued and indicate multipleobservations,

2. probability or sampling weights (pweight) are inversely proportional to thesample inclusion probabilities,

3. analytic weights (aweights) are inversely proportional to the variance of anobservation,

4. importance weights (iweights) indicate the relative “importance” of anobservation.

The default treatment (weight) is each command’s idea of what the “natural” weightsare and is one of the above weight types.

2.1.3 IF AND IN

The if exp qualifier restricts the scope of the command to those observations forwhich the value of the expression is true. For example:

. replace y = x+2 if x>0

The in range qualifier restricts the scope of the command to a specific observationrange, where range is any of #, #/#, #/l or f/#.

For example, to summarize the values of x and y for the first 10 observations:

. summarize x y in 10

. summarize x y in 1/10

. summarize x y in f/10

2.1.4 QUIETLY AND NOISILY

Typing quietly command suppresses all terminal output for the duration ofcommand, noisily command turns back on terminal output, if appropriate, for theduration of command.

Example:

. quietly by x: generate y = sum(z)

STATA COMMANDS 17

2.2 BASIC DATA COMMANDS

I discuss nine basic commands: describe, list, drop and keep, generate and itsextension egen, replace, sort and gsort.

2.2.1 DESCRIBE

This command displays a summary of the contents of either the data in memory orthe data stored in a Stata-format dataset. Its syntax is

describe [varlist] [, short detail fullnames numbers]

in the first case, and

describe using filename [, short detail]

in the second case, where short suppresses the specific information about eachvariable, detail includes more detailed information (the width of a single observation,the maximum number of observations holding the number of variables constant, themaximum number of variables holding the numbers of observations constant, themaximum width for an observation, and the maximum size of the dataset), fullnamesdisplays the full names of the variables (the default is to present an abbreviation whenthe variable name is longer than 15 characters), and numbers presents the variablenumber along with the variable name. The numbers and fullnames options may notboth be specified together.

2.2.2 LIST

This command displays the values of variables. Its syntax is:

list [varlist] [if exp] [in range] [, [no]display nolabel noobs]

where [no]display forces the format into display or tabular (nodisplay) format (if oneof these two options is not specified, then Stata chooses one based on its judgment),nolabel causes the numeric codes rather than label values to be displayed, and noobssuppresses printing of the observation numbers.

Examples:

. list in 1/10

. list x y

. list x y in 1/10

. list if x>20

. list x y if z>20

. list x y z if z>20 in 1/10

2.2.3 DROP AND KEEP

The drop command eliminates variables or observations from the data in memory. Itssyntax is:

drop varlist

18

drop if expdrop in range [if exp]

The keep command works exactly the same as drop except that one specifies thevariables or observations to be kept.

Examples:

. drop in 1/33

. keep in 34/l (drop first 33 observations)

. drop in -10/l (drop last 10 observations)

. drop if x<21

. keep if x>=21

. sort y

. by y: keep if _n==_N

2.2.4 GENERATE AND EGEN

The generate command creates a new variable. Its syntax is:

generate [type] newvar[:lblname] = exp [if exp] [in range]

If type is not specified, float is the default (the default type may be changed using theset type command). If missing values are generated, the number of missing values innewvar is always reported.

To prevent Stata from returning an error when string variables are generated, typemust be set to str#.

Examples:

. generate x2 = x*x

. generate bigz = z>100000 & z˜=.

. gen double w = x/y

. gen xlag = x[_n-1]

. gen u = uniform() (U(0, 1) pseudo-random numbers)

. gen z = invnorm(uniform()) (N (0, 1) pseudo-random numbers)

The egen command provides an extension to generate. Its syntax is:

egen [type] newvar = fcn(stuff) [if exp] [in range] [, options]

egen creates newvar equal to fcn(stuff). Depending on fcn(), stuff refers to anexpression, a list of variables, or a list of numbers. The options are similarly functiondependent. Note that egen may change the sort order of the data.

Important examples of egen functions include:

• count(exp) [, by(varlist)] creates a constant (within varlist) containingthe number of nonmissing observations of exp.

• diff(varlist) creates an indicator variable equal to 1 where the variables invarlist are not equal and 0 otherwise. It may not be combined with by.

STATA COMMANDS 19

• group(varlist) [, missing label truncate(num)] creates a single vari-able taking on values 1,2,. . . for the groups formed by varlist. It may notby combined with by. The label option returns integers from 1 up accordingto the distinct groups of varlist in sorted order. The integers are labeled withthe values of varlist, or the value labels if they exist. The truncate() optiontruncates the values contributed to the label from each variable in varlist tothe length specified by the integer argument num.

• iqr(exp) [, by(varlist)] creates a constant (within varlist) containing theinterquartile range of exp. The same syntax holds for a number of otherfunctions with argument exp, such as kurt (coefficient of kurtosis), mad(median absolute deviation from the median), max) (maximum value), mean(mean), median (median), medv (mean absolute deviation from the mean), min(minimum value), sd (standard deviation), skew (coefficient of skewness), sum(sum).

• ma(exp) [, t(#) nomiss] creates a #-period moving average of exp. If t()is not specified, t(3) is assumed. Notice that # must be odd and exp mustnot produce missing values.

• pctile(exp) [, p(#) by(varlist)] creates a constant (within varlist)containing the #-th percentile of exp. If p() is not specified, 50 is assumed,meaning medians.

• rmax(varlist) gives the maximum value in varlist for each observation (row).It may not be combined with by. The same syntax holds for a number ofother functions with argument varlist, such as rmean (row mean), rmin (rowminimum), rmiss (row number of missing values).

• std(exp) [, mean(#) std(#)] creates the standardized value of exp usingthe specified mean and standard deviation. The default is mean() and std()producing a variable with zero mean and unit variance.

Examples:

. egen avgx = mean(x)

. gen dev = x-avgx

. egen x = median(x2-x1) (expression, - means subtraction)

. egen y = rmean(x1 x2 x3)

. egen y = rmean(x1-x3) (varlist, - means through)

. egen sdx = sd(x)

. egen stdx = std(x), mean(100) std(10)

. egen sumx = sum(x), by(y)

. egen xy = group(x y)

2.2.5 REPLACE

This command changes the contents of an existing variable. Its syntax is:

20

replace oldvar = exp [if exp] [in range] [, nopromote]

where nopromote prevents replace from promoting the variable type to accommodatethe change.

Examples:

. replace z=. if z<=0

. replace y = 25 in 1007

. sort z

. by z: gen avgx = sum(x)/sum(x˜=.)

. by z: replace avgx = avgx[_N]

2.2.6 SORT AND GSORT

The sort commmad arranges the observations of the current data in ascending orderof the values of the variables in varlist. Its syntax is

sort varlist [in range]

There is no limit to the number of variables in varlist and each variable can be numericor string. Missing values are interpreted as being larger than any other number andare thus placed last (there is an exception: When sorting on a string variable, nullstrings are placed first).

The dataset is marked as being sorted by varlist unless in range is specified.

Examples:

. sort personid

. sort lstname frstname midinitl

Unlike sort, that can produce only ascending-order arrangements, gsort may arrangethe observations in either ascending or descending order. Its syntax is

gsort [+|-]varname [[+|-]varname [...]] [, generate(newvar) mfirst]

The observations are placed in ascending order of varname if + or nothing is typed infront of the name and in descending order if - is typed.

The generate(newvar) option creates newvar containing 1,2,3,. . . , for each of thegroups denoted by the ordered varnames. This is useful when one wishes to use theordering with a subsequent by. The mfirst option specifies that missing values are tobe placed first in descending orderings rather than last.

Examples:

. gsort x (same as sort x)

. gsort +x (same as gsort x)

. gsort -x (reverse sort)

. gsort -name (reverse alphabetical)

. gsort x y (ascending x, ascending y)

. gsort x -y (ascending x, descending y)

STATA COMMANDS 21

. gsort -x, gen(revx)

. quietly by revx: gen rcum = _N if _n==1

. replace rcum = sum(rcum)

. replace rcum = rcum/rcum[_N]

2.3 COMBINING DATA

I discuss two commands: append and merge.

2.3.1 APPEND

This command appends a Stata-format dataset stored on disk to the end of the datasetin memory. Its syntax is:

append using filename [, nolabel]

where nolabel prevents copying the value label definitions from the disk dataset. Evenif this option is not specified, label definitions from the disk dataset never replacedefinitions already in memory.

If filename is specified without an extension, .dta is assumed.

2.3.2 MERGE

This command joins corresponding observations from the dataset currently in memory(called the master dataset) with those from the Stata-format dataset stored as filename(called the using dataset) into single observations (if filename is specified without anextension, .dta is assumed). It can perform both one-to-one and match merges. Itssyntax is:

merge [varlist] using filename [, nolabel update replacenokeep _merge(varname)]

where nokeep causes merge to ignore observations in the using data that have nocorresponding observation in the master (the default is to add these observations tothe merged result and mark them with _merge==2) and _merge(varname) specifiesthe name of the variable that will mark the source of the resulting observation. Thedefault is _merge(_merge), which adds a new variable _merge to the data whosevalues are:

_merge==1 (obs. from master data)_merge==2 (obs. from using data)_merge==3 (obs. from both master and using data)

Examples:

. use data1 (one-to-one merge)

. merge using data2

. tab _merge

. use data2 (match merge)

. sort x

22

. save data2, replace

. use data1

. sort x

. merge x using data2

. tab _merge

2.4 RESHAPING DATA

I discuss five commands: collapse, contract, expand, fillin and reshape.

2.4.1 COLLAPSE

This command replaces the data in memory with a new dataset consisting of themeans, medians, etc. of the specified variables. Its syntax is:

collapse clist [weight] [if exp] [in range] [, by(varlist) cw fast]

where clist is either

[(stat)] varlist [[(stat)] ...]

[(stat)] target_var=varname [target_var=varname ...] [[(stat) ...]

or any combination of the varlist or target_var forms, and stat is one of the following:mean (means), sd (standard deviations), sum (sums), rawsum (sums ignoring optionallyspecified weights), count (number of nonmissing observations), max (maxima), min(minima), median (medians), p# (#th percentile), iqr (interquartile range). If stat isnot specified, mean is assumed.

The by(varlist) option specifies the groups over which the means, etc., are to becalculated, cw specifies casewise deletion (if not specified, all observations possible areused for each calculated statistic) and fast specifies that collapse not go to extra workso that it can restore the original data should the user press Break.

2.4.2 CONTRACT

This command makes datasets of frequencies. It replaces the data in memory with anew dataset consisting of all combinations of varlist that exist in the data togetherwith a new variable that contains the frequency of each combination. Its syntax is:

contract varlist [weight] [if exp] [in range] [, freq(varname)zero nomiss]

where freq(varname) specifies a name for the frequency variable (if not specified,_freq is used, the name must be new), zero specifies that combinations with frequencyzero are wanted, and nomiss specifies that observations with missing values on any ofthe variables in varlist will be dropped (if not specified, all observations possible areused).

STATA COMMANDS 23

2.4.3 EXPAND

This command replaces each observation in the current dataset with n copies of theobservation, where n is equal to the integer part of the required expression (if theexpression is less than one or equal to missing, then it is interpreted as if it were one,and the observation is retained but not duplicated). Its syntax is:

expand [=]exp [if exp] [in range]

Example:

. expand 2

2.4.4 FILLIN

This command rectangularizes a dataset by adding observations with missing dataso that all interactions of the variables in varlist exist. It also adds the variable_fillin to the data (with value 1 for created observations and 0 for previously existingobservations). Its syntax is:

fillin varlist

2.4.5 RESHAPE

This command converts data from wide to long form and vice versa. Its basic syntaxis:

reshape wide varnames, i(varlist) [j(varname) string]

reshape long varnames, i(varlist) [j(varname) string]

where i(varlist) specifies the variable(s) whose unique values denote a logicalobservation, j(varname) specifies the variable whose unique values denote asubobservation, and string specifies that the j() may contain string values.

Examples:

. reshape long x1 x2, i(y) j(z) (converts from wide to long)

. reshape wide (converts back to wide)

. reshape ..., i(z) (single i() variable)

. reshape ..., i(z1 z2) (two i() variables)

. reshape long x, i(y) j(z 1-3 5) (specifying j() values)

. reshape long x, i(y) j(z) string (allow string variables in j())

2.5 BASIC SAMPLE STATISTICS

I discuss seven commands: count, summarize, means, centile, cumul, correlate andregress.

2.5.1 COUNT

This command counts observations satisfying the specified conditions. Its syntax is:

24

count [if exp] [in range]

If no condition is specified, count displays the number of observations in the dataset.

Examples:

. count if y<0

. by x: count if y<0

2.5.2 SUMMARIZE

This command reports a variety of univariate summary statistics. Its syntax is:

summarize [varlist] [weight] [if exp] [in range] [,{detail|meanonly} format]

where detail produces additional statistics (including skewness, kurtosis, the foursmallest and four largest values, along with various percentiles), meanonly suppressesdisplay of the results and calculation of the variance (it is allowed only when detailis not specified) and format requests that the summary statistics be displayed usingthe display format associated with the variables rather than the default g format.

2.5.3 MEANS

This command reports the arithmetic, geometric, and harmonic means, along withtheir respective confidence intervals, for the specified variables. Its syntax is:

means [varlist] [if exp] [in range] [, add(#) only level(#)]

where add(#) adds the value# to each variable in varlist before computing the meansand confidence intervals (this may be useful when analyzing variables with nonpositivevalues), only modifies the action of the add() option (if specified, the add() optiononly adds # to variables with at least one nonpositive value) and level(#) specifiesthe percentage confidence level for confidence intervals.

The ci command may be used if one simply wants arithmetic means and correspondingconfidence intervals.

2.5.4 CENTILE

This command reports the (per)centiles of the specified variables and their confidenceintervals. By default, confidence intervals are obtained using a binomial method thatmakes no assumptions as to the underlying distribution of the variable. The syntax is:

centile [varlist] [if exp] [in range] [, centile(numlist) ccinormal meansd level(#)]

where centile(numlist) specifies the centiles to be reported, for example centile(2550 75) (if not specified, medians are reported), cci (conservative confidence interval)prevents centile from interpolating when calculating the distribution-free (binomial-based) confidence limits, normal specifies that confidence intervals are to be obtainedassuming that both the data and the centiles are normally distributed, meansd

STATA COMMANDS 25

calculates confidence intervals assuming that the estimated centiles themselves arenormally distributed.

The related command

pctile newvar = exp

creates a new variable containing the percentiles of exp, where exp is typically justanother variable.

2.5.5 CUMUL

This command creates a new variable containing the empirical distribution function(edf) of a variable. Its syntax is:

cumul varname [weight] [if exp] [in range] , gen(newvar) [freqby(varlist)]

where gen(newvar) specifies the name of the new variable to be created (it is notoptional), freq requests the edf to be in frequency units; otherwise it is normalized sothat newvar is 1 for the largest value of varname, and by(varlist) specifies that edf’sbe generated separately for each by-group.

2.5.6 CORRELATE

This command reports the covariance or correlation matrix of the specified variables.Observations are excluded from the calculation due to missing values on a casewisebasis. The syntax is:

correlate [varlist] [weight] [if exp] [in range] [, meansnoformat covariance wrap]

where means causes summary statistics (means, standard deviations, minima andmaxima) to be displayed along with the matrix, noformat displays the summarystatistic requested by the means option in g format regardless of the display formatsassociated with the variables, covariance displays the covariances rather than thecorrelation coefficients, and wrap requests that no action be taken on wide matricesto make them readable.

2.5.7 REGRESS

This command estimates linear regression models with a single response or dependentvariable. Estimation is carried out by least squares (either ordinary least squares orweighted least squares). Its basic syntax is:

regress yvar [xvars] [weight] [if exp] [in range] [, level(#)noconstant regress_options]

where level(#) specifies the confidence level (in percent) for the regressionparameters (the default is 95%), noconstant suppresses the constant term (intercept)in the regression, and the additional regress_options are described in more detail inSection 6.1.1.

26

2.6 TABLES

Stata offers three basic commands for producing tables: table, tabulate andtabulate, summarize.

2.6.1 TABLE

This command provides tables of summary statistics. Its syntax is a little involved:

table rowvar [colvar [supercolvar]] [weight] [if exp] [in range][, contents(clist) by(superrow_varlist) cw row col scolformat(%fmt) center left concise missing replace name(string)cellwidth(#) csepwidth(#) scsepwidth(#) stubwidth(#)]

where contents(clist) specifies the content of the table’s cells (up to 5 statistics maybe specified, if contents() is not specified it is assumed to be contents(freq)), clistis as in collapse, row specifies a row is to be added to the table reflecting the totalacross rows, col specifies a column is to be added to the table reflecting the totalacross columns, format(%fmt) specifies the display format for presenting numbers inthe table’s cells, center specifies results are to be centered in the table’s cells (thedefault is to right align), left specifies that column labels are to be left aligned (thedefault is to right align), missing specifies that missing statistics are to be shown inthe table as periods (the default is to leave them blank). See the on-line help or theReference Manual for a description of the other options.

2.6.2 TABULATE

This command provides one- and two-way tables of frequency counts along withvarious measures of association, including the common Pearson chi-squared, thelikelihood ratio chi-squared, Cramer’s V, Fisher’s exact test, Goodman and Kruskal’sgamma, and Kendall’s tau-b.

The syntax for one-way tables is:

tabulate varname [weight] [if exp] [in range] [, generate(varname)matcell(matname) matrow(matname) missing nofreq nolabelplot subpop(varname)]

The syntax for two-way tables is:

tabulate varname1 varname2 [weight] [if exp] [in range] [,all cell chi2 column exact gamma lrchi2 matcell(matname)matcol(matname) matrow(matname) missing nofreq nolabelrow taub V wrap]

where all is equivalent to specifying chi2 lrchi2 V gamma taub, cell displaysthe relative frequency of each cell in a two-way table, chi2 calculates and displaysPearson’s chi-squared for the hypothesis that the rows and columns in a two-waytable are independent, column displays in each cell of a two-way table the relativefrequency of that cell within its column, exact displays the significance calculatedby Fisher’s exact test, gamma displays Goodman and Kruskal’s gamma along with

STATA COMMANDS 27

its asymptotic standard error, generate(varname) creates a set of indicator variablesreflecting the observed values of the tabulated variable, lrchi2 displays the likelihood-ratio chi-squared statistic (the request is ignored if any cell of the table containsno observations), matcell(matname) saves the reported frequencies in the matrixmatname, matcol(matname) saves the numeric values of the column stub in thevector matname, matrow(matname) saves the numeric values of the row stub inthe vector matname, missing requests that missing values be treated like othervalues in calculations of counts, percentages, and other statistics, nofreq suppressesprinting the frequencies, nolabel causes the numeric codes to be displayed ratherthan the value labels, plot produces a bar chart of the relative frequencies in a one-way table, replace indicates that the immediate data specified as arguments to thecommand are to be left as the current data in place of whatever data was there, rowdisplays in each cell of a two-way table the relative frequency of that cell within itsrow, subpop(varname) excludes observations for which varname = 0 in tabulatingfrequencies, taub displays Kendall’s tau-b along with its asymptotic standard error,and V (note capitalization) displays Cramer’s V.

2.6.3 TABSUM

The tabulate, summarize() command produces one- and two-way tables ofsummary statistics. Although table is better, tabulate, summarize() is faster. Itssyntax is:

tabulate varname1 [varname2] [weight] [if exp] [in range] ,summarize(varname3) [[no]means [no]standard [no]freq [no]obswrap nolabel missing]

where summarize(varname3) identifies the name of the variable for which summarystatistics are to be reported (if this option is not specified, then a table of frequenciesis produced), [no]means includes only or suppresses only the means from the table(the summarize() table normally includes the mean, standard deviation, frequency,and, if the data is weighted, the number of observations), [no]standard includes onlyor suppresses only the standard deviations from the table, [no]freq includes only orsuppresses only the frequencies from the table, [no]obs includes only or suppressesonly the reported number of observations from the table, and missing requests thatmissing values of varname1 and varname2 be treated as categories rather than asobservations to be omitted from analysis.

Examples:

. tabulate y, summarize(z) (one-way table)

. tabulate y1 y2, summarize(z) (two-way table)

. sort x (n-way table)

. by x: tabulate y1 y2, summarize(z) means nofreq

3

Graphics

Stata graphics are not very fancy, but allow considerable flexibility and are relativelysimple to use.

3.1 BASIC SYNTAX AND GRAPHIC STYLES

The basic syntax of the graph command is:

graph [varlist] [weight] [if exp] [in range] [, options]

Typed without arguments, graph recalculates and redisplays the last graph. Thesyntax to review a saved Stata graph is:

graph using filename [filename] [, options]

An existing Stata graph can be translated to another format (e.g. PostScript) usingthe translate command and printed using the print command.

Examples:

. translate mygraph.gph mygraph.eps (converts to Encapsulated Post-Script). translate mygraph.gph mygraph.wmf (converts to Windows metafile). translate mygraph.gph mygraph.prn (converts to printer format). print mygraph.gph (print mygraph). print @Graph (print the graph in the Graph window)

Notice that graphs may also be produced by other Stata commands such as kdensity(nonparametric density estimation), ksm (regression smoothers) and logistic (logisticregression diagnostic plot).

Stata offers eight basic graph styles:

1. histogram,

2. twoway (two-way scatterplots),

3. matrix (two-way scatterplot matrices),

4. box (box plots),

5. oneway (one-way scatterplots),

30

6. star (star charts),

7. bar (bar charts),

8. pie (pie charts).

After discussing some options of the graph command that are common across all styles(Section 3.2), I shall focus on the first four styles.

3.2 COMMON GRAPH OPTIONS

In this section I briefly discuss some of the general options of the graph command.I will henceforth refer to these options as common_options.

• Saving a graph to disk: saving(filename [, replace]). If an extension isnot specified, .gph is assumed.

• Printing a graph: after the graph command, use the Print button in the Statatoolbar.

• Multiple-imaging options: by(varname) is allowed for all styles except matrixand star. It requests that graphs be drawn separately for the groups definedby varname and be combined into a single image.

• Specifying titles: graph allows up to two titles on every side of the graph (top,bottom, left and right), denoted by the options t1, t2, b1 (same as title orti), b2, l1, l2, r1, and r2. The first title (e.g. t1) is always the farther fromthe figure, the second (e.g. t2) is the closest. The argument of each optionis some text enclosed in quotes. Quotes can be omitted if text contains nospecial character.

Example:

. graph y x, l1(y) b2(x) title("Figure 1: x-y scatterplot")

• Setting the gap: gap(#) sets the amount of space between the left title andthe values along the y-axis. The default is gap(8).

• Labeling axes: by default, graph labels just the minimum and maximumof each variable. More aesthetically pleasing results may be obtainedwith the options {x|y|r|t}label[(#,...,#)]. Typed without arguments,{x|y|r|t}label chooses “round” values to be labelled.

• Adding ticks: graph automatically places tick marks on axes anywherethey are labelled. Additional ticking may be obtained with the options{x|y|r|t}tick[(#,...,#)].

• Adding lines: lines across the graph may be drawn with the options{x|y|r|t}line[(#,...,#)]. The yline and rline options draw horizontallines, xline and tline draw vertical lines.

GRAPHICS 31

Example:

. graph y x, gap(4) xlabel ylabel(0,1) ytick(.5) yline(0,.5,1)

• Setting the scale: by default, graph scales each axis according to the minimumand maximum of all things that go on the axis (data, labeling or ticking). Theoptions {x|y|r}xscale(#,#) may be used to widen (but never to narrow)the scale used for drawing a graph on any style that has an axis.

• Setting the axes rendition: by default, graph draws an axis on any style thathas an axis. The border option replaces axes with borders. The noaxis optionsuppresses both axes and borders.

• Creating log scales: the log option is used with the histogram style, the{x|y|r}log options with the twoway style.

• Plotting symbols: graph uses the following plotting symbols to specify thelocation of a point on a scatterplot: O (large circle, default for twoway), S(large square), T (large triangle), o (small circle, default for twoway with by, ormatrix), d (small diamond), p (small plus), . (dot), i (invisible), [varname](variable to be used as text), _n (observation number).

The sequence of plotting symbols for the variables in varlist is specified withthe option symbol(s . . . s), where s is any of the above symbols. By default,twoway chooses the symbols O, T, S, and the remainder . if symbol is notspecified. Combined with by(), it chooses instead o, p, d, and the remainder. by default.

• Connecting points: graph offers the following alternatives to connect points ona scatterplot: . (do not connect, default), l (straight lines between points), L(straight line between ascending x-points), m (connect median bands usingstraight lines), s (connect median bands using cubic splines), J (connectrectilinearly making steps), || (connect two variables vertically (high-low)),II (same as || but cap bottom and top of line).

How the variables in varlist are connected is specified by the option connect(s. . . s), where s is any of the above alternatuves. If connect is not specified,point are not connected.

The connect option connects points in the order of the data, not the orderof the x-axis. graph includes a sort option that automatically sorts the dataaccording to the x-axis before graphing.

• Line patterns: For each line type, one can specify the pattern of the line byadding a [pattern] after the line type, where pattern is any combination ofthe following: l (a solid line, the default), _ (a long dash), - (a medium dash),. (a short dash, almost a dot) and # (a space).

Example:

. graph y1 y2 y3 x, symbol(...) connect(ll[_]l[-])

32

The set textsize # command controls the size of the text used in a graph. This isnot an option but a separate command that must be issued before graph.

I refer to the Graphics Manual for other common options and to the remainder of thischapter for options specific to the various graph styles.

3.3 HISTOGRAMS

This is the default for graph when only one variable is specified. The basic syntax is:

graph [variable] [weight] [if exp] [in range], histogram[common_options bin(#) {freq|percent} normal[(#,#)]density(#)]

where bin(#) specifies the number of (equally spaced) bins to use for constructingthe histogram (the default is bin(5)), freq and percent affect how the verticalaxis is labeled (respectively, in frequency units and in percent), normal[(#,#)]overlays a normal density with specified mean and standard deviation (normal byitself uses the observed mean and standard deviation), and density(#) (only usedwith normal) specifies the number of points along the density to be calculated (thedefault is density(100)).

Examples:

. graph x (draws a histogram of x)

. graph x, bin(15) (uses 11 bins for histogram)

. graph x, normal(10,3) (overdraws a normal density with mean 10 andstd. dev. 3)

3.4 TWO-WAY SCATTERPLOTS

This is the default for graph when more than one variable is specified. twoway may becombined with oneway or box, but in that case, one must specify twoway explicitly.

The basic syntax is:

graph [yvars xvar] [weight] [if exp] [in range], twoway[common_options rescale rbox {y|x|r}reverse]

where rescale scales each y-variable independently (if there are two y-variables, thescale of the first is presented on the left axis and the scale for the second on the rightaxis; if there are more than two y-variables, no vertical scale is labeled), rbox places arangefinder box plot on the graph, and {y|x|r}reverse reverses the indicated scaleto run from high-to-low.

Examples:

. graph y x (graph of y against x)

. graph z y x, rescale (graph of z and y against x)

GRAPHICS 33

3.5 TWO-WAY SCATTERPLOT MATRICES

A two-way scatterplot matrix is a set of two-way scatterplots arranged in a matrix.The basic syntax is:

graph [varlist] [weight] [if exp] [in range], matrix[common_options half]

where half draws only the lower half of the matrix.

Example:

. graph x y z if z>0, matrix

3.6 BOX PLOTS

A box plot is a graphical procedure with the following features: (i) it combines ameasure of location (the median) and a measure of spread (the interquartile range),(ii) it shows the presence of possible outliers, and (iii) it provides some indication aboutthe shape of the distribution of the data in terms of their symmetry or skewness. Thebasic syntax is:

graph [varlist] [weight] [if exp] [in range], box[common_options [no]alt vwidth root]

where [no]alt forces the labeling of the groups to be on single line (noalt) or multiplelines, vwidth makes the width of the box proportional to the number of observations,and root (only used with vwidth) makes the width of the box proportional to thesquare root of the number of observations.

Examples:

. graph y x, box (graphs box-and-whiskers for y and x)

. graph y, box by(z) (graphs box-and-whiskers for y by z groups)

. graph y x, box by(z) (graphs y against x by z)

4

Programming and MatrixCommands

In this note I discuss the elements of Stata programmming and Stata matrix language.

4.1 PROGRAMMING STATA

The capabilities of Stata may be extended considerably by using programs. A Stataprogram is just a sequence of Stata commands enclosed between the commandsprogram define progname and end. Thus, the general structure of a Stata programis

program define progname

Stata commands

end

Programs must be defined (loaded in memory) before they can be used. The simplestway to do so is to type directly the commands from the keyboard. This is notrecommended, however, unless the program is short. Alternative ways of definingprograms are described in Sections 4.2 and 4.3.

Programs are executed by typing progname. Displaying of the underlying commandsis suppressed.

Programs may call other programs. Stata allows programs to be nested 32 deep.

4.1.1 MACROS

A macro is a user-defined string of characters, called the macro name, that stands foranother string of characters, called the macro content.

Stata has two types of macros, local and global. Local macros are private, that is,specific to the program where they are defined, global macro are public. Their contentis set respectively by the local and global commands. Their general syntax is

{local|global} mname [[`]"[string]"[´]|= exp|: extended_fcn]

where the macro name mname can be up to 7 character long for local macros andup to 8 characters for global macros, and exp may be either a numeric or a stringexpression. For the use of a extended macro function see the Stata manual.

36

To copy string to mname (the maximum length of string is 18,623 characters) use:

{local|global} mname "string"

To evaluate exp and store the result in mname (the maximum length of exp is 80characters) use:

{local|global} mname = exp

Macros can be used everywhere in programs, do-files and ado-files (see below). Thecontent of a local macro is accessed by enclosing the macro in `´, that of a globalmacro by prefixing it with a $. This simply replaces the name of the macro with itscontent.

Examples:

local options "gap(4) sy(i) xlab(10) ylin"graph x y, òptions´

sort zglobal options "by(z) gap(4) sy(.) c(l) xlab ylab"graph x y, $options

If a macro contains double quotes, compound double quotes `""´ may be used todefine a macro.

Typing macro drop mname eliminates the global macro mname. Typing macro drop_all eliminates all global macros.

4.1.2 SYSTEM MACROS

In addition to user-defined macros, Stata has number of built-in global system macrosthat begin with the characters S_.

Examples:

$S_DATE : contains the current date in the format dd mon yyyy$S_TIME : contains the current time in the format hh:mm:ss$S_FN : contains the filename last specified with use or save

User-written programs may examine and change the content of system macros. Typingmacro drop _all does not eliminate system macros and the content of system macrossuch as S_DATE and S_TIME cannot be changed.

4.1.3 LOOPING

Stata provides two commands for looping, while and forvalues. The syntax of whileis simpler:

while exp {

Stata commands

}

PROGRAMMING AND MATRIX COMMANDS 37

This command evaluates exp and, if it is true (nonzero), executes the commandsenclosed in the braces. It then repeats the process until exp evaluates to false (zero).whiles may be nested within whiles. If exp refers to any variables, their values inthe first observation are used unless explicit subscripts are specified.

Example: The following code fragment may be used to iterate Stata commands 10times

local i = 1local I = 10while ì´<=Ì´ {

Stata commands

local i = ì´+1}

The while command may also be used interactively.

4.1.4 BRANCHING

The syntax of this programming command is:

if exp {

Stata commands

}else { other Stata commands }

This command evaluates exp. If the result is true (nonzero), the commands inside thebraces are executed. If the result is false (zero), those statements are ignored and thestatements following the else, if specified, are executed.

Do not confuse this command with the if qualifier at the end of a command.

Example:

if x>0 {replace y = log(x)

}else if x<0 {

replace y = log(-x)}else {

replace y = .}

4.1.5 PROGRAM ARGUMENTS

Programs may take arguments, just like functions. Unlike functions, however, thearguments of a program are not enclosed in parentheses but simply follow the programname.

38

For example, if prog1 is a program and we type

prog1 x y

then x and y are the program’s arguments, respectively the first and the secondargument.

Arguments are passed to programs via local macros: `0´, `1´, `2´, . . . , where the localmacro `0´ is exactly what the user typed, `1´ is the first argument of the program,`2´ the second argument, etc.

The positional macros `1´, `2´, etc., may be renamed to facilitate reading andunderstanding of a program. Thus, for example, the following two programs bothproduce a sequence of n pseudo-random numbers according to the U(a, b) distribution:program define prog1

drop _allset obs `1´generate x = `2´+(_n-1)/(_N-1)*(`3´-`2´)

end

program define prog2args n a bdrop _allset obs `n´generate x = à´+(_n-1)/(_N-1)*(`b´-à´)

end

Sometimes programs involves a variable number of arguments, with the same thingdone to each argument. An example is the summarize command, which may be appliedto one, two or more variables.

Programs with this feature may be coded by shifting through its arguments

program define myprogwhile "`1´" ˜= "" {

Stata commands in terms of `1´

macro shift}

end

where macro shift shifts `1´, `2´, `3´, . . . , one to the left: what was `1´ disapears,what was `2´ becomes `1´, what was `3´ becomes `2´, etc. The outer while loopcontinues the process until macro `1´ is empty.

An alternative is the following:

program define myproglocal i = 1while "``1´´" ˜= "" {

Stata commands in terms of `1´


local i = ì´+1}

end

4.1.6 TEMPORARY OBJECTS

Programs often require objects (variables, scalars, matrices, data, etc.) that aretemporary, that is, can be discarded once the program completes.

Stata provides three commands to deal with this: tempvar creates names for temporaryvariables, tempname creates names for temporary scalars and matrices, and tempfilecreates names for temporary files. They all have the same syntax:

{tempvar|tempname|tempfile} mname [mname . . . ]

The command creates local macros containing names one may use.

Example:

...tempvar x ygen `x´ = expgen `y´ = exp...

The drop `x´ `y´ command is not necessary when the program completes, becauseStata automatically drops any variables with names assigned by tempvar.

4.1.7 EXCHANGING RESULTS BETWEEN PROGRAMS

Stata commands that report results save them in places where they can besubsequently used by other commands or programs. Most commonly, commands saveresults in one of two places:

1. r-class commands (such as summarize) save their results in r(),

2. e-class (estimation) commands (such as regress) save their results in e().

Typing return list after an r-class command or estimates list after an e-class(estimation class) command summarizes what the command saved.

Results saved in r() and e() come in three flavors: scalars, macros and matrices. Forexample, the number of observations used by a command are saved in the scalars r(N)or e(N). After an e-class command, the command name and the name of the response(dependent) variable are saved in the macros e(cmd) and e(depvar), whereas theestimated coefficients and their variance matrix are saved in the matrices e(b) ande(V).

Regardless of their flavor, one may refer to saved results in two ways. One is just bysimply typing r(name) or e(name). The other is to use macro substitution charactersto produce `r(name)´ or è(name)´.

40

Example: After regress

. display "You can refer to " e(cmd) " or to è(cmd)´"You can refer to regress or to regress

Notice that after running an r-class command, running another one would change thecontent of r() but not the content of e(). On the other hand, running a new e-classcommand may change the content of both e() and r(). Thus, if one wants to accessthe results produced by a command, it is important to do so immediately.

User-defined programs may save their results if their class is specified on the programdefine line through the option rclass or eclass, depending on whether the programis intended to be r- or e-class. The code to save results in r() is

return scalar name = expreturn local name ...return local name matname

while the code to save results in e() is

estimates scalar name = expestimates local name ...estimates local name matname

4.2 DO FILES

A do-file is a standard ASCII (text) file containing a sequence of Stata commands, aseparate command on each line. The sequence of commands is executed using the door run commands, whose syntax is

{do|run} filename [arguments] [, nostop]

where nostop allows the do-file to continue executing even if an error occurs. Iffilename is specified without an extension, .do is assumed.

The difference between do and run is that do echos the commands and their output,while run is silent.

A do-file completes the execution when: (i) the end of the file is reached, (ii) an exit isexecuted, or (iii) an error (nonzero return code) occurs (pressing Break while executinga do-file causes a nonzero return code and therefore stops the do-file).

A do-file may be used to define one or more programs or may call programs alreadydefined. Do-files may also call other do-files. As for programs, Stata allows do-files tobe nested 32 deep.

Here are some rules and recommendations for constructing a do-file.

• Start a do-file by typing version #, where# is the Stata release under whichthe file was written. This allows the do-file to run under later releases.

• Blank lines and comments may be included freely. Their proper use mayconsiderably enhance understanding of a program.


• Comments may be included either by beginning a line with a ‘*’ (a star), orby placing the comment in /* */ delimiters. The /* */ delimiters can be putanywhere, at the end of a line or even in the middle.

Example:

version 7.0 /* do-file written under Stata 7.0 */* read in the datause mydata, clear* summary statisticssummarize x y z, detail

• To avoid lines wider than the screen, the end-of-line delimiter may be changedfrom carriage return to, say, ‘;’ by including the #delimit ; command. Thedelimited may later be changed back to carriage return by including the#delimit cr command.

• The output of a do-file may be sent to a log file by including the commandlog using filename [, replace]. Logging stops and the log file is closedwhen log close is encountered. Output to the log file is suppressed if run isused to execute a do-file.

• To prevent Stata from pausing when the screen is full, include the set moreoff command.

Do-files accept arguments, just like programs. Arguments are stored in local macros`1´, `2´, and so on. For example, to repeat the same set of instruction for differentvariables one could write the do-file try.do

use mydata, cleardrop if `1´==.summarize `2´, detail

and then execute it by typing

. do try x y

The second command (drop if `1´==.) would be interpreted as drop if x==.because x is the first argument typed after do try, the third command (summarize`2´) would be interpreted as summarize y because y is the second argument typedafter do try.

4.3 ADO FILES

An ado-file defines a Stata command, although many commands (e.g. summarize orregress) are not defined by ado-files but are build directly into Stata.

An ado-file is an ASCII (text) file that contains a Stata program which defines(implements) a command. For example, the ci command produces confidence intervalsand is implemented as an ado-file. This means that a file called ci.ado is stored onsome directory that Stata can access.

42

Ado-files typically come with an associated help-file. Typing help ci (or pulling downHelp and searching for ci), prompts Stata to look for the file ci.help, just as it doesfor the file ci.ado after the command ci is typed.

Stata looks for ado-files (and the associated help-files) in several places: the official ado-directories (the base directory and the updates directory), the personal ado-directories,the current directory.

4.4 MATRIX COMMANDS

A Stata matrix is a rectangular array of double-precision numbers, none of which canbe missing, and which is bordered by a row and a column of names. A vector is aspecial case of a matrix. Although Stata has scalars, they could also be handled asspecial cases of a matrix. Matrices can be used interactively or in programs, do-filesand ado-files.

By default, the maximum matrix size is 40 × 40. The maximum matrix size can beincreased to 800×800 by issuing the command set matsize 800. Thus, Stata matricesare unsuited for holding large amounts of data.

4.4.1 ROW AND COLUMN NAMES

Stata matrices always have row and column names. These names are used to produce“pretty” output. The matrix list command displays a matrix with its row andcolumn names.

Row and column names have three parts: equation_name:ts_operator,subname. Thefirst two parts may be blank.

Row and column names may be reset using the matrix rownames and matrixcolnames commands.

4.4.2 SUBSCRIPTING AND SUBMATRICES

The basic syntax for subscripting is

matrix A = ...B[r,c] ...

where r and c are numeric or string scalar expressions.

Examples:

. matrix A = A/A[1,1]

. matrix B = A["weight","displ"]

. matrix D = G[1,"eq1:l1.gnp"]

The basic syntax for extracting submatrices is

matrix A = ...B[r0..r1, c0..c1] ...

where r0, r1, c0, and c1 are numeric or string scalar expressions.

Examples:


. matrix A = B[2..4, 3..6]

. matrix A = B[2..., 2...]

. matrix A = B[1, "price".."mpg"]

. matrix A = B["eq1:", "eq1:"]

The basic syntax for substituting submatrices is

matrix A[r,c] = ...

where r and c are numeric scalar expressions.

If the matrix expression to the right of the equal sign evaluates to a scalar or a 1× 1,the indicated element of A is replaced. If the matrix expression evaluates to a matrix,the resulting matrix is placed in A with its upper left corner at (r, c).

Examples:

. matrix A[2,2] = B

. matrix A[rownumb("price"), colnumb("mpg")] = sqrt(2)

4.4.3 MATRIX OPERATORS AND FUNCTIONS

The matrix operators are:

• -B (negation),

• B’ (transposition),

• B \ C (adds the rows of C below the rows of B),

• B , C (adds the columns of C to the right of the columns of B),

• B + C (addition),

• B - C (subtraction),

• B * C (multiplication, including multiplication by a scalar),

• B / z (division by a scalar z),

• B # C (Kronecker product).

Parentheses may be used to enforce a particular order of evaluation.

Examples

. matrix C = (B + B’)/2

The matrix functions returning scalar are:

• mreldif(B,C) (relative difference),• trace(B) (trace of a square matrix),• det(B) (determinant of a square matrix),

44

• diag0cnt(B) (number of zeros on diagonal),• rowsof(B) (number of rows),• colsof(B) (number of columns),• rownumb(A,s) (first row number named s, where s is a string or stringexpression),

• colnumb(A,s) (the first column number named s, where s is a string or stringexpression),

• el(A,i,j) (the i, j element of A), this function is the same as A[i,j]).

Matrix functions returning scalar may be used in any expression context, not justmatrix expression contexts.

The matrix functions returning matrix are:

• I(n) (n× n identity matrix),

• J(n,m,z) (n×m matrix containing the constant element z),

• get(mname) (returns the system matrix mname),

• syminv(B) (inverse of a symmetric matrix, if B is not positive definite,returns a generalized inverse),

• inv(B) (inverse of a square matrix),• sweep(B,j) (sweep of a square matrix; returns B with jth row/columnswept),

• cholesky(B) (Cholesky decomposition of a symmetric matrix),• corr(B) (correlation transform),• diag(V) (V is a row or column n-vector, it returns a diagonal n× n matrixwith diagonal elements equal to those of V ),

• vecdiag(B) (returns a row vector containing the diagonal of a square matrix).

Examples

. matrix beta = syminv(X’*X)*X’*y

. display trace(X)

. matrix L = cholesky(0.1*I(rowsof(X)) + 0.9*X)

There are matrix utilities to list the currently defined matrices (matrix dir), displaythe contents of a matrix (matrix list), rename a matrix (matrix rename) and dropa matrix (matrix drop). The matrix drop _all command drops all matrices.


4.4.4 CROSS-PRODUCT MATRICES

Statistical computations often involve matrix operations such as X>X or X>WX. Inthese cases, X usually has a large number of rows and a small to moderate numberof columns, whereas W takes on a restricted form (diagonal, block diagonal, or isknown in some functional form and need not be stored). Computing X>X or X>WXby storing the matrices and then directly performing the matrix multiplications isinefficient and wasteful. Stata ha a number of commands to compute these resultsefficiently.

The matrix accum command accumulates cross-product matrices from the data toform A = X>X.

The matrix glsaccum command accumulates cross-product matrices from the datausing a specified inner weight matrix to form A = X>BX, where B is a block diagonalmatrix.

The matrix vecaccum command accumulates the first variable against the remainingvariables to form the row vector a = X>

1 X, where X = (X2,X3, . . .).

4.4.5 DATA TO MATRIX CONVERSION

Variables can be converted into matrices and matrices into variables through the mkmatand svmat commands. The mkmat command stores the k variables listed in varlist ink column vectors of the same name. Optionally, if matrix() is specified, they canbe stored as a single matrix. The svmat command is the reverse of mkmat: it takes amatrix and stores its columns as new variables.

Their syntax is

mkmat varlist [if exp] [in range] [, matrix(matname)]

svmat [type] A [, names(col|eqcol|matcol|string)]

where type is a storage type for new variables, A is the name of an existing matrix,and the names(col|eqcol|matcol|string) option specifies how the new variables areto be named.

The related command

matname A namelist [, rows(range) columns(range) explicit]

renames the rows and columns of a matrix.

Examples:

. mkmat mpg

. mkmat foreign weight displ, matrix(X)

. matrix b = syminv(X’*X) * X’*mpg

. mkmat bvector1 if bvector1=̃.

. matrix list bvector1

. matrix d = bvector1’

. matname d wei gr for _cons, c(.)

. matrix list d

46

4.4.6 GETTING SYSTEM MATRICES

The usual way to obtain matrices after a command that produces matrices is to referto the returned matrix in the standard way. For example, all estimation commands(see Section 5.1) return the coefficent vector e(b) and the variance-covariance matrixv(b) of the estimates. These matrices can be referenced directly.

Examples:

. matrix list e(b)

. matrix S = vecdiag(e(V))

Other matrices are returned by various commands. They are obtained in the sameway. Alternatively, the matrix get command also obtains matrices after certaincommands.

4.4.7 MATRIX DECOMPOSITION

The matrix symeigen command returns the eigenvectors in the columns of the n×nmatrix X and the corresponding eigenvalues in the n-vector V . The eigenvalues aresorted from largest to smallest: V[1,1] contains the largest eigenvalue and X[1...,1]its corresponding eigenvector; V[1,2] contains the second largest eigenvalue andX[1...,2] its corresponding eigenvector, and so on.

The singular value decomposition of a symmetric nonnegative definite matrix A iscarried out through the matrix svd command. This command returns an m × nmatrix U , a row n-vector W and an n × n matrix V such that A = U diag(W )V >.In addition, the columns of U are orthogonal, the elements of W are positive or zero,and V is orthonornmal.

5

Statistical Inference Using Stata

5.1 ESTIMATION

5.1.1 GENERAL SYNTAX OF ESTIMATION COMMANDS

The general syntax of an estimation command is:

command varlist [weight] [if exp] [in range] [, options]

The first variable in varlist is the response or outcome variable, denoted by yvar, theother variables are the covariates or predictors, denoted by xvars.

The general syntax for multiple-equation commands, namely commands that estimatesystems of equations, is similar:

command (varlist) (varlist) ... (varlist) [weight] [if exp][in range] [, options]

All estimation commands share the following common features:

• To review the last estimates, just type the estimation command withoutarguments.

• In addition to the estimated parameters and their standard errors, confidenceintervals for the coefficients are displayed. The confidence level may be setusing the level(#) option, where # is the desired percentage level. Thedefault is level(95).

• The estimated variance matrix of the estimators is computed under theassumption that the statistical model is correctly specified, but somecommands allow for certain forms of model misspecification with the robustoption.

• After estimation, one may obtain the estimated variance matrix of theestimators using the vce command (Section 5.2.2).

• After estimation, one may obtain prediction, residuals and influence statisticsusing the predict command (Section 5.2.3).

• After estimation, one can perform tests of hypotheses about the modelparameters (Section 5.2.4).

48

• The command lincom computes point estimates, standard errors, t statistics,p-values and confidence intervals for a linear combination of coefficients afterany estimation command except anova.

5.1.2 WEIGHTED ESTIMATION

Specifying weights allows weighted estimation. For example

. regress y x1 x2 [pweight=w]

gives a weighted least squares regression of y on x1 and x2 using the probability weightscontained in the variable W .

5.1.3 CONSTRAINED ESTIMATION

Several commands (e.g. cnsreg) allow estimation subject to linear constraints on themodel parameters through the constraint(clist) option, where clist is of the form#[-#][, #[-#] ...], with 1 ≤ # ≤ 999.The constraint command defines, lists and drops linear constraints. Its syntax is:

constraint define # [exp=exp|coefficientlist]constraint dir [clist|_all]constraint drop {clist|_all}constraint list [clist|_all]

where coefficientlist lists the variables whose coefficients are set equal to zero.

The following example estimates the linear model E(Y ) = α + β1X1 + · · · + β6X6

subject to the constraints that β1 = β2 = β3 = β6 and β4 = −β5 = α/10:

. constraint define 1 x1=x6



. constraint define 4 x4=-x5

. constraint define 5 x4=_cons/10

. cnsreg y x1-x6, constraint(1-3,4-5)

5.1.4 ROBUST VARIANCE ESTIMATES

Some commands (e.g. regress or glm) offer the option of estimating the variancematrix of the parameter estimators by relaxing the assumption that the statisticalmodel is correctly specified and allowing for certain forms of model misspecification.

The robust options relaxes the assumption that the observations are identicallydistributed, thus allowing for heteroskedasticity of unknown form.

The following example estimates the linear model E(Y ) = α+β1X1+β2X2 respectivelywithout and with heteroskedasticity-robust variance estimates:

. regress y x1 x2

. regress y x1 x2, robust

STATISTICAL INFERENCE USING STATA 49

The robust cluster(varname) option only requires observations to be independentacross clusters specified by the variable varname, thus relaxing the assumption ofindependence.

5.2 POST-ESTIMATION COMMANDS

5.2.1 ACCESSING COEFFICIENTS AND STANDARD ERRORS

After a (single-equation) estimation conmmand, _b[varname] (or _coef[varname])contain the coefficient on varname and its standard error, both recorded to machineprecision. In case of multiple-equation estimation command, use [eqno]_b[varname](or simply [eqno][varname] or [eqno]varname) and [eqno]_se[varname], whereeqno is the equation number.

The command mfx produces tables displaying the marginal effects or the elasticities(and their standard errors) instead of the estimated coefficients.

5.2.2 DISPLAYING THE VARIANCE ESTIMATES

After model estimation, the estimated variance or correlation matrix of the estimatorsis displayed using the command

vce [, corr rho]

where corr and rho are synomis and either displays the correlation matrix instead ofthe variance matrix.

To obtain a copy of the estimated variance matrix for manipulations type

matrix matname = e(V)

5.2.3 PREDICTIONS AND RESIDUALS

The predict command calculates predictions, residuals and influence statistics afterestimation. What predict can do depends to some extent on the previous estimationcommand. The general features of predict are:

1. The predict newvarname command creates newvarname containing the“predicted values” of the response. For example, after linear regression,predict newvarname creates the fitted values β̂>Xi. After probit, it createsthe estimated probit probabilities Φ(β̂>Xi).

2. The predict newvarname, xb command creates newvarname containingthe linear prediction β̂>Xi. For linear models, this command produces thesame result as predict newvarname.

3. The predict newvarname, stdp command creates newvarname containingthe standard error of the linear prediction.

4. Adding the nooffset option to any of the above makes the calculationignoring any offset or exposure variable specified in the estimation command.

50

5. predict can be used to make in-sample or out-of-sample predictions.

6. In general, predict calculates the requested statistic for all observationspossible, whether they were used in estimating the model or not.

7. One can restrict the prediction to the estimation sample by typing

. predict newvarname if e(sample), ...

8. Some statistics make sense only with respect to the estimation sample. In suchcases, the calculation is automatically restricted to the estimation sample.

9. Out-of-sample predictions may be obtained by applying predict to otherdatasets.

Example:

. use data1

model estimation commands

. use data2 /* another dataset */

. predict hat, ... /* fill in the predictions */

The options of predict are:

• xb calculates the linear prediction from the estimated model.

• stdp calculates the standard error of the linear prediction.• stddp is allowed only after multiple-equation estimation commands. Itcomputes the standard error of the difference in linear predictions betweentwo equations.

• equation(eqno[,eqno]) is only relevant after multiple-equation estimationcommands. It specifies to which equation one is referring.

equation(#1) means that calculations are to be made for the firstequation, equation(#2) that they are to be made for the second, and soon. Alternatively, one could refer to the equations by their names. Forexample, equation(income) would refer to the equation named income,equation(hours) to the one named hours.

equation(#1) is the default when equation() is not specified.

Other statistics (for example stddp) refer to between-equation concepts. Inthose cases, one may use equation(#1,#2) or equation(income,hours).When two equations must be specified, equation() is not optional.

• The nooffset option may be combined with most statistics and specifiesthat the calculation should be made ignoring any offset or exposure variablespecified when the model was estimated. This option is available even if notdocumented for predict after a specific command. If neither the offset()

STATISTICAL INFERENCE USING STATA 51

nor the exposure() option was specified at the model estimatio stage,specifying nooffset does nothing.

• other_options refers to command-specific options that are documented witheach command.

5.2.4 HYPOTHESIS TESTING

The test command performs Wald-type tests of linear hypothees, the testnlcommand performs Wald-type tests of nonlinear (or linear) hypothees, and the lrtestcommand performs likelihood-ratio tests after ML estimation.

5.3 BOOTSTRAPPING AND MONTE CARLO SIMULATIONS

Bootstrap and Monte Carlo simulations rely on Stata’s uniform() random numbergenerator. Reproducibility of the results requires setting the random-number seed bytyping set seed #.

5.3.1 BOOTSTRAP

The command

bstrap progname [, reps(#) size(#) dots args(...) level(#)cluster(varnames) idcluster(newvarname) saving(filename)double every(#) replace noisily]

runs the user-defined program progname reps(#) times on bootstrap samples of sizesize(#).

The command

bs "command" "exp_list" [, bstrap_options nowarn noesample]

runs the user-specified command bootstrapping the statistics specified in exp_list.The expressions in exp_list must be separated by spaces and there must be no spaceswithin each expression. Note that command and exp_list must both be enclosed indouble quotes. This command takes the same options as bstrap except for args().

The command

bstat varlist [, stat(#) level(#) title(text)]

displays bootstrap estimates of standard error and bias, and calculates confidenceintervals using three different methods: normal approximation, percentile, and biascorrected. The bstrap and bs commands automatically run bstat after completingall the bootstrap replications. If the user specifies the saving(filename) option withbstrap or bs, then bstat can be run on the data in filename to view the bootstrapestimates again.

Finally, the command

bsample [exp] [, cluster(varnames) idcluster(newvarname)]

52

is a low-level utility for those who prefer not to use bstrap or bs. It draws a samplewith replacement from the existing data; the sample replaces the data in memory;exp specifies the size of the sample and must be less than or equal to _N. If exp isnot specified, a sample size of _N is drawn (or size n_c when the cluster() option isspecified where n_c is the number of clusters).

5.3.2 MONTE CARLO SIMULATION

The simul command is aimed at easing the programming task of performing MonteCarlo simulations. Its syntax is:

simul progname, reps(#) [args(whatever) dots doublesaving(filename) every(#) replace noisily]

where progname is the name of a program that performs a single simulation, reps(#)(not optional) specifies the number of replications to be performed, and args(slwhatever) specifies any arguments to be passed to progname.

Typing "simul progname, reps(#)" iterates progname # replications and collectsthe results.

6

Statistical Models in Stata

A broad range of statistical models may be estimated directly using the availableStata commands. In this note, I focus on estimation of linear models (Section 6.1),generalized linear models (Section 6.2) and parametric models for duration data(Section 6.4.1), and only provide a brief description for a number of other models.

6.1 LINEAR MODELS

Stata offers several commands for estimating linear models.

6.1.1 ORDINARY LEAST SQUARES

The regress command estimates a linear model by least squares (ordinar least squaresor weighted least squares). Its syntax is:

regress yvar [xvars] [weight] [if exp] [in range] [, level(#)beta robust cluster(varname) hc2 hc3 hascons noconstanttsscons noheader eform(string) depname(varname) mse1 plus]

where beta requests that the normalized regression coefficients be reported insteadof confidence intervals, robust and cluster(varname) have been discussed inSection 5.1.4, hc2 and hc3 specify alternative bias corrections for robust (they maynot be specified with cluster()), hascons indicates that a user-defined constant or itsequivalent is specified among the independent variables (some caution is recommendedwhen using this option as resulting estimates may not be as accurate as they otherwisewould be), noconstant suppresses the constant term (intercept) in the regression,tsscons forces the total sum of squares to be computed as though the model has aconstant (i.e., as deviations from the mean of the dependent variable), and noheader,eform(), depname(), mse1 and plus are for ado-file writers.

The syntax of predict following regress is:

predict [type] newvarname [if exp] [in range] [, statistic]

where, in addition to xb (the default) and stdp, statistic may be: pr(a,b) (Pr(Y | a <Y < b}), e(a,b) (E(Y | a < Y < b)), ystar(a,b) (Emax(a, min(Y, b)), cooksd(Cook’s distance), leverage|hat (diagonal elements of hat matrix), residuals(residuals), rstandard (standardized residuals), rstudent (Studentized or jackknifedresiduals), stdf (standard error of the forecast), stdr (standard error of the residual),

54

covratio (COVRATIO), dfbeta(varname) (DFBETA for varname), dfits (DFITS),or welsch (Welsch distance).

In addition to predict, the following commands can be used after regress fordiagnosing sensitivity to individual observations:

• avplot (graphs an added-variable or leverage plot),• cprplot (graphs a partial residual plot),• lvr2plot (graphs a leverage vs. squared residual or L-R plot),• rvfplot (graphs a residual-versus-fitted plot),• rvpplot (graphs a residual-versus-predictor plot),• ovtest (performs Ramsey’s RESET test for omitted variable),• hettest (performs the Cook-Weisberg test for heteroskedasticity),• dwstat (computes the Durbin-Watson test statistic),• dfbeta (calculates the DFBETAs),• vif (calculate the variance inflation factors).

6.1.2 CONSTRAINED LINEAR REGRESSION

The cnsreg command estimates constrained linear regression models. Its syntax is:

cnsreg yvar xvars [weight] [if exp] [in range],constraints(numlist) [level(#)]

where constraint(numlist) (not optional) specifies the constraint numbers of theconstraints to be applied (see Section 5.1.3). Constraints are defined using theconstraint command.

6.1.3 LINEAR INSTRUMENTAL VARIABLES

The ivreg command estimates a linear regression model using instrumental variables(or two-stage least squares) of yvar on xvars1 and xvars2 using ivars (along withxvars1) as instruments for xvars2. The variables in xvars1 and ivars are the exogenousvariables, those in xvars2 are the endogenous variables.

The syntax of this command is:

ivreg yvar [xvars1] (xvars2=ivars) [weight] [if exp] [in range][, level(#) beta hascons noconstant robust cluster(varname)first noheader eform(string) depname(varname) mse1]

STATISTICAL MODELS IN STATA 55

6.2 GENERALIZED LINEAR MODELS

Stata offers a single and very flexible command (glm) to estimate generalized linearmodels (McCullagh & Nelder 1989). It also offers, for selected models in this class,special commands (e.g. logit, probit, poisson) with a broader and more specific setof options, especially diagnostics and other post-estimation output.

6.2.1 GLM

The glm command fits generalized linear models. Estimation is carried out either byiteratively reweighted least squares (IRLS) or by using the Newton-Raphson (NR)method, which is the default.

The basic syntax is:

glm yvar [xvars] [weight] [if exp] [in range] [, max_optionsvar_options output_options spec_options]

The max_options are:

iterate(#) ltolerance(#) mu(varname) nolog search fisher(#) irls

where iterate(#) specifies the maximum number of iterations allowed in estimatingthe model (iterate(50) is the default), ltolerance(#) specifies the convergencecriterion for the change in deviance between iterations (ltolerance(1e-6) is thedefault), mu(varname) specifies varname as the initial estimate for the mean of yvar,nolog suppresses the iteration log, search specifies that the command should searchfor good starting values, fisher(#) specifies the number of NR steps that shoulduse the Fisher scoring Hessian or expected information matrix before switching to theobserved information matrix (both search and fisher() are only useful with NRoptimization, not with IRLS), and irls requests IRLS minimization of the devianceinstead of NR maximization of the log-likelihood.

The var_options are:

oim opg vfactor(#) robust cluster(varname) unbiasednwest(wtname [#]) jknife jknife1 bstrap brep(#)scale(x2|dev|#) disp(#) score(newvar) t(varname)

where oim specifies that the variance matrix should be calculated using the observedinformation matrix rather than the usual expected information matrix (option ignoredif irls is not specified), opg specifies that the variance matrix be calculated usingthe Berndt, Hall, Hall, and Hausman (1976) variance estimator (this option is notallowed when cluster() is specified), vfactor(#) specifies a scalar by which tomultiply the resulting variance matrix, robust and cluster(varname) have alreadybeen defined, unbiased specifies that the unbiased sandwich estimate of variancesbe used (robust is implied when unbiased is used), nwest(wtname [#]) specifiesthat a heteroskedasticity and autocorrelation consistent variance estimate be used,jknife and jknife1 specify that jackknife estimates of variance be used, bstrapspecifies that the bootstrap estimate of variance be used, brep(#) specifies thenumber of bootstrap samples to consider in forming the bootstrap estimate (thedefault is brep(199)), scale(x2|dev|#) overrides the default scale parameter (by

56

default, scale(1) is assumed for discrete distributions and scale(x2) for continuousdistributions), scale(x2) specifies the scale parameter be set to the Pearson chi-squared (or generalized chi-squared) statistic divided by the residual degrees offreedom, scale(dev) sets the scale parameter to the deviance divided by theresidual degrees of freedom (this provides an alternative to scale(x2) for continuousdistributions and over- or under-dispersed discrete distributions) scale(#) sets thescale parameter to #, disp(#) multiplies the variance of yvar by # and dividesthe deviance by #, score(newvar) creates the new variable newvar containing eachobservation’s contribution to the score, and t(varname) specifies the variable namecorresponding to the time index (this option is required if nwest() is specified).

The output_options are:

eform level(#) trace noheader nodisplay nodots

where eform displays the exponentiated coefficients and corresponding standard errorsand confidence intervals (for binomial models with the logit link, exponentiation resultsin odds ratios; for Poisson models with the log link, exponentiated coefficients arerate ratios), trace requests that the estimated coefficient vector be printed at eachiteration, noheader suppresses the header information from the output (the coefficienttable is still printed), nodisplay suppresses the output (the iteration log is stilldisplayed), and nodots specifies that a dot should not be printed for each fitted modelwhen calculating jackknife or bootstrap estimates (by default, a single dot characteris printed for each estimation that is performed).

The spec_options are:

family(familyname) link(linkname) noconstant [ln]offset(varname)

where family(familyname) specifies the parametric family, link(linkname) specifiesthe link function, noconstant specifies that the linear predictor has no intercept term,and [ln]offset(varname) specifies an offset to be added to the linear predictor.

familyname is either a user-written program or one of: binomial (Bernoulli/binomial),gamma, gaussian, igaussian (inverse Gaussian), nbinomial (negative binomial),poisson.

linkname is either a user-written program or one of: cloglog (complementary log-log),identity, log, logit, logc (log-complement), loglog (log-log), nbinomial (negativebinomial), opower # (odds power), power #, probit.

If family() is specified but not link(), then the canonical link for the family isobtained, namely:

• link(identity) for family(gaussian) (same as regress),• link(power -2) for family(igaussian),• link(logit) for family(binomial) (same as logit),• link(log) for family(poisson) (same as tt poisson),• link(log) for family(nbinomial) (same as nbreg),


• link(power -1) for family(gamma).

The syntax of predict after glm is:

predict [type] newvarname [if exp] [in range] [, statistic nooffsetstandardized studentized modified adjusted]

where, in addition to xb and stdp, statistic may be: mu (predicted mean of the response,the default), eta (same as the xb option), cooksd (Cook’s distance), deviance(deviance residual), hat (diagonal of the hat matrix), likelihood (likelihood residual),pearson (Pearson residual), response (response residual), score (score residual), orworking (working residual).

6.2.2 LOGIT AND PROBIT

The logit command estimates logit models by maximum-likelihood (ML). Its syntaxis:

logit yvar [xvars] [weight] [if exp] [in range] [, level(#)nocoef noconstant or robust cluster(varname) score(newvar)offset(varname) asis max_options]

where yvar==0 indicates a negative outcome, yvar˜=0 & yvar˜=. (typically yvar==1)indicates a positive outcome.

The logistic command is just the same as logit, except that it reports odds ratiosrather than coefficients by default.

The following commands can be used after both logit or logistic to explore thenature of the fit: lfit (performs goodness-of-fit tests), lstat (reports summarystatistics including classification table), lroc (graphs the ROC curve), and lsens(graphs sensitivity and specificity versus probability cutoff).

The syntax of predict following logit or logistic is:

predict [type] newvarname [if exp] [in range] [, statistic rulesasif nooffset]

where, in addition to xb and stdp, statistic may be: p (predicted probability ofa positive outcome, the default), dbeta (Delta-Beta influence statistic, Pregibon1981), deviance (deviance residual), dx2 (Delta chi-squared infl. stat., Hosmer &Lemeshow 1989), ddeviance (Delta-D influence statistic, Hosmer & Lemeshow 1989),hat (leverage, Pregibon 1981), number (sequential number of the covariate pattern),or residuals (Pearson residual).

The probit commands is completely analogous and estimates probit models by ML.

6.2.3 POISSON AND NBREG

The command poisson produces ML estimates of the Poisson regression model. Itssyntax is:

poisson yvar [xvars] [weight] [if exp] [in range] [, irr level(#)

58

exposure(varname) offset(varname) robust cluster(varname)score(newvarname) noconstant constraints(numlist) nologmax_options]

where depvar is a nonnegative count variable and irr reports estimated coefficientstransformed to incidence rate ratios.

The syntax of predict after poisson is:

predict [type] newvar [if exp] [in range] [, statistic nooffset]

where, in addition to xb and stdp, statistic may be n (predicted number of events,the default), or ir (incidence rate, is equivalent to predict ..., n nooffset).

The command nbreg produces ML estimates of the negative binomial regression model(Poisson regression with overdispersion).

nbreg yvar [xvars] [weight] [if exp] [in range] [,dispersion(mean|constant) level(#) irr exposure(varname)offset(varname) robust cluster(varname) score(newvarnames)noconstant constraints(numlist) nolrtest nolog max_options]

where depvar is a nonnegative count variable.

Two different parameterizations of the negative binomial model may be estimated.The default, also given by the option dispersion(mean), has dispersion for the ithobservation equal to 1 + α exp(β>Xi + offset), that is, the dispersion is a functionof the expected mean of the counts for the ith observation: exp(β>Xi + offset).The alternative parameterization, given by the option dispersion(constant), hasconstant dispersion for all observations equal to 1 + δ.

For the default model, α = 0 (or ln(α) = −∞) corresponds to unit dispersion, and,thus, it is simply a Poisson model. For the alternative parameterization, delta = 0 (orln(δ) = −∞) corresponds to unit dispersion, and is simply a Poisson model.The syntax of predict after nbreg is the same as after poisson.

6.3 OTHER LIMITED DEPENDENT VARIABLES MODELS

6.3.1 GROUPED BINARY RESPONSES

The blogit and bprobit commands produce ML estimates of the logit and probitmodels for grouped data. The glogit and gprobit commands produce weighted least-squares (mimimum chi-square) estimates.

6.3.2 ORDERED CATEGORICAL RESPONSES

The ologit and oprobit commands estimate ordered logit and probit models ofordinal variable depvar on the covariates. The actual values taken on by the responsevariable are irrelevant except that larger values are assumed to correspond to “higher”outcomes. Up to 50 outcomes are allowed.


6.3.3 NESTED LOGIT

The command nlogit estimates a nested logit model by ML. The model may containone or more levels. For a single-level model, nlogit estimates the same model asclogit.

6.3.4 MULTINOMIAL LOGIT

The mlogit command estimates multinomial logit models by ML. Constraints maybe defined to perform constrained estimation.

6.3.5 BIPROBIT

The biprobit command produces ML estimates of two-equation probit models, eithera bivariate probit or a system of two seemingly unrelated probit equations.

6.3.6 CENSORED AND TRUNCATED REGRESSION

The tobit command produces ML estimates of censored Gaussian regression modelswith a fixed censoring point. The cnreg estimates the same class of models but allowsthe censoring points to vary across observations.

The intreg command estimates a bivariate censored Gaussian model where theresponse variables (Y1, Y2) can be point data, interval data (e.g. a ≤ Yi ≤ b), left-censored (e.g., a ≥ Y1), or right-censored (e.g. Y1 ≤ b).

The heckman command estimates Gaussian linear models with sample selectionusing either Heckman’s two-step estimator (Heckman 1976) or full ML. The relatedcommand heckprob estimates probit models with sample selection by ML.

The truncreg command produces ML estimates of a truncated Gaussian regressionmodel.

6.4 DURATION DATA

6.4.1 PARAMETRIC DURATION MODELS

The ereg and weibull commands produce ML estimates respectively of theexponential and Weibull (survival time) models, with and without gamma-distributedor inverse Gaussian unobserved heterogeneity (frailty). Their syntax is:

{ereg[het]|weibull[het]} yvar [xvars] [weight] [if exp] [in range][, hazard hr tr dead(varname) t0(varname)frailty(gamma|invgaussian) ancillary(varlist) strata(varname)robust cluster(varname) score(newvars) constraints(numlist)level(#) nocoef noheader nolog maximize_options]

The syntax of predict following ereg and weibull is:

predict [type] newvarname [if exp] [in range] [, statistic]

where, in addition to xb and stdp, statistic may be: median time, (predicted median

60

survival time, the default), median lntime, (predicted median log survival time),mean time (predicted mean survival time), mean lntime (predicted mean log survivaltime), hazard (predicted hazard), hr (predicted hazard ratio), surv (predicted survivalprobability), csnell (partial Cox-Snell residuals), or mgale (partial martingale-likeresiduals).

6.4.2 COX PROPORTIONAL HAZARD MODEL

The cox command estimates proportional hazards models by ML. The covariates maybe either fixed or time-varying (fixed within intervals). The procedure allows for lefttruncation (delayed entry), as well as gaps and right censoring. The failure event maybe unique or recurring.

A simplified version of cox is stcox.

6.5 TIME SERIES

The tsset timevar command declares the data to be a time series and designatesthat variable timevar (which must take on integer values) represents time. The tssetcommand must be used before time-series operators may be used. After tsset, thedata will be sorted on timevar.

6.5.1 LINEAR MODELS WITH AUTOCORRELATED ERRORS

The prais command estimates a linear model with first-order autoregressive errorsusing the Prais—Winsten transformed regression estimator, the Cochrane—Orcutttransformed regression estimator, or a version of the Hildreth—Lu search method.

The newey command produces estimated standard errors for the OLS coefficients oflinear regression models with heteroskedastic and possibly autocorrelated errors (seeNewey & West 1987).

6.5.2 ARIMA MODELS

The arima command estimates a linear model with autoregressive moving-average(ARMA) errors. The response variable and the covariates may be differenced orseasonally differenced to any degree. When no covariate is specified, the conmmandestimates autoregressive integrated moving-average (ARIMA) models for the responsevariable. Missing data are allowed and are handled using the Kalman filter.

6.5.3 ARCH-TYPE MODELS

The arch command estimates models with autoregressive conditional heteroskedastic-ity (ARCH) using conditional ML. In addition to ARCH terms, models may includemultiplicative heteroskedasticity. Concerning the regression equation itself, modelsmay also contain ARCH-in-mean and/or ARMA terms.

The following options may be used to estimate a number of models in the ARCHfamily:


• arch() (ARCH),• arch() garch() (GARCH),• archm arch() [garch()] (ARCH-in-mean),• arch() garch() ar() ma() (GARCH with ARMA terms),• earch() egarch() (EGARCH),• abarch() atarch() sdgarch() (TARCH, threshold ARCH),• arch() tarch() [garch()] (GJR, form of threshold ARCH),

• arch() saarch() [garch()] (SAARCH), simple asymmetric ARCH),• parch() [pgarch()] (PARCH, power ARCH),• narch() [garch()] (NARCH, nonlinear ARCH),• narchk() [garch()] (NARCHK, NARCH with a single shift),• aparch() [pgarch()] (A-PARCH, asymmetric power ARCH),• nparch() [pgarch()] (NPARCH, nonlinear power ARCH).

6.6 PANEL DATA

The xt series of commands provides tools for analyzing longitudinal (panel) data.Each Each observation in a longitudinal dataset is indexed by a unit-specific index iand a time-specific index t. The iis command or the i() option set the name of thevariable corresponding to index i, while the tis command or the t() option set thename of the variable corresponding to index t.

Some of the xt commands use time-series operators in their internal calculations andthus require the data to be tsset.

6.6.1 LINEAR PANEL DATA MODELS

The xtreg command estimates linear panel data model. This command can estimatefixed-effects (within-group), between-group, and random-effects models as well aspopulation-averaged models. Which estimator is used is determined by the followingoptions:

• be (between-group estimator),• fe, (fixed-effects estimator),• re (GLS random-effects estimator),• pa (GEE population-averaged estimator),

62

• mle (Gaussian ML random-effects estimator).B If no option is specified, re is assumed.

The xttest0 and xthaus commands after xtreg, re perform, respectively, theBreusch and Pagan (1979) Lagrange multiplier test for random effects and theHausman (1978) specification test.

6.6.2 DYNAMIC PANEL DATA MODELS

The xtabond command estimates dynamic panel data models using Arellano and Bond(1989) one-step, one-step robust or two-step estimators. The command can be usedwith exogenously unbalanced panels and handles embedded gaps in the time series aswell as opening and closing gaps.

6.6.3 SEEMINGLY UNRELATED REGRESSION EQUATIONS

The sureg command estimates a system of seemingly unrelated linear regressionequations by feasible generalized least-squares.

This command is not part of the xt series.

6.6.4 GEE FOR PANEL DATA

The xtgee command generalizes the glm command to panel data. It is very flexible andallows estimation of generalized linear models for panel data (see Liang & Zeger 1986)with different choices of parametric family, link function and within-group correlationstructures.

The allowed distribution families are the same as for glm, namely Bernoulli/binomial,gamma, Gaussian (normal), inverse Gaussian, negative binomial, Poisson and user-supplied.

The allowed link functions are also the same, namely complementary log-log, identity,log, log-complement, log-log, logit, negative binomial, odds power, power, probit anduser-supplied.

The allowed within-group correlation structures include independence, equicorrelation,kth order autocorrelation, kth order moving average, unstructured (arbitrary non-stationary) and user-supplied.

6.6.5 LOGIT AND PROBIT FOR PANEL DATA

The xtlogit command estimates a fixed-effects (fe), a random-effects (re), or apopulation-averaged (pa) logit model for panel data.

The xtprobit command estimates random-effects (re) and population-averaged (pa)probit models for panel data. The integrals in the individual terms of the log-likelihoodof the random-effects model are computed using Gauss-Hermite quadrature. Afterestimating the model, the quality of the Gauss-Hermite quadrature approximationmay be checked using the quadchk command.


Notice that the xtlogit, fe command is equivalent to the clogit command. Alsonotice that xtlogit, pa corresponds to

xtgee, family(binomial) link(logit) corr(exchangeable)

whereas xtprobit, pa corresponds to

xtgee, family(binomial) link(probit) corr(exchangeable)

6.6.6 POISSON AND NEGATIVE BINOMIAL MODELS

The xtpois command estimates fixed-effects (fe), random-effects (re), or population-averaged (pa) Poisson models for panel data. For the re option, either a gamma (thedefault) or a normal (Gaussian) distributed random-effe cts model is estimated.

As for xtprob, re, the integrals in the individual terms of the log-likelihood of theGaussian random-effects model are computed using Gauss-Hermite quadrature. Afterestimating the model, the quality of the Gauss-Hermite quadrature approximationmay be checked using the quadchk command.

The xtnbreg command estimates fixed-effects (fe), random-effects (re), orpopulation-averaged (pa) negative binomial models for panel data.

For both xtpois and xtnbreg, the population-averaged model assumes equicorrelationas default (that is, corr(exchangeable)).

6.7 NONPARAMETRIC ESTIMATION

6.7.1 DENSITY ESTIMATION

The kdensity command produces kernel density estimates and graphs the result.

The available options for the kernel function are: biweight, cosine, epan(Epanechnikov, the default), gauss (Gaussian), parzen, rectangle (uniform),triangle.

The width(#) option specifies the halfwidth of the kernel. If width() is not specified,then Stata uses the asymptoticaly optimal width for Gaussian data and a Gaussiankernel.

6.7.2 REGRESSION SMOOTHERS

The ksm command carries out unweighted and locally weighted smoothing of aresponse variable yvar on a single covariate xvar, displays the graph, and optionallysaves the smoothed variable. Among the command’s capabilities are lowess (robustlocally weighted regression, Cleveland 1979).

The smooth command applies resistant, nonlinear smoothers (running medians) toa single variable varname and stores the new series in newvar. Missing values at thebeginning or end of the range of varname are ignored, but missing values in the middleof the series are not allowed.

64

6.8 ROBUST AND QUANTILE REGRESSION

6.8.1 ROBUST REGRESSION

The rreg command estimates a linear model by iteratively reweighted least squaresusing a particular set of robust weights.

6.8.2 QUANTILE REGRESSION

The qreg command estimates quantile (including median) regression models.

The iqreg command estimates interquantile regressions (with a limit of 336covariates). The estimated variance matrix of the estimators is obtained by bootstrap.

The sqreg command estimates simultaneous-quantile regression and produces thesame coefficients as qreg for each quantile. The estimated variance matrix of theestimators is obtained by bootstrap and includes between-quantiles blocks . Thus, onecan test and construct confidence intervals comparing coefficients describing differentquantiles. This command has a limit of 336/q covariates, where q is the number ofquantiles specified.

The bsqreg command is the same as sqreg. Although not as fast, it is not limited to336 coefficients.

6.9 GENERAL NONLINEAR METHODS

The nl command fits an arbitrary nonlinear function to a response variable yvar byleast squares. The function must be provided in a separate program.

The ml series of commands allows estimation of an arbitrary model by ML. See Gouldand Scribney (1999) for details.

References

Berndt E., Hall B., Hall R., and Hausman J.A. (1974) Estimation and Inference in NonlinearStructural Models. Annals of Economic and Social Measurement, 3/4: 653—665.

Breusch T.V. and Pagan A.R. (1980) The Lagrange Multiplier Test and Its Applications toModel Specification in Econometrics. Review of Economic Studies, 47: 239—254.

Cleveland W.S. (1979) Robust Locally Weighted Regression and Smoothing Scatterplots.Journal of the American Statistical Association, 74: 829—836.

Gould W. and Scribney W. (1999) Maximum Likelihood Estimation with Stata, StataCorporation, College Station, TX.

Hausman J.A. (1978) Specification Tests in Econometrics. Econometrica, 46: 1251—1272.Heckman J.J. (1976) The Common Structure of Statistical Models of Truncation, SampleSelection and Limited Dependent Variables and a Simple Estimator for Such Models.Annals of Economic and Social Measurement, 5: 475—492.

Hosmer D.W. and Lemeshow S. (1989) Applied Logistic Regression, Wiley, New York.Liang, K.Y. and Zeger S.L. (1986) Longitudinal Data Analysis Using Generalized LinearModels. Biometrika, 73: 13—22.

McCullagh P. and Nelder J.A. (1989) Generalized Linear Models (2nd ed.), Chapman andHall, London.

Newey W.K. and West K. (1987) A Simple, Positive Semi-Definite, Heteroskedasticity andAutocorrelation Consistent Covariance Matrix. Econometrica, 55: 703—708.

Peracchi F. (2001) Econometrics, Wiley, Chichester, UK.Pregibon D. (1981) Logistic Regression Diagnostics. Annals of Statistics, 9: 705—724.

Stata Tutorial 1

Documents