Top Banner

Click here to load reader

Introduction to STATA About STATA About STATA Basic Operations Basic Operations Regression Analysis Regression Analysis Panel Data Analysis Panel Data

Mar 31, 2015



  • Slide 1

Introduction to STATA About STATA About STATA Basic Operations Basic Operations Regression Analysis Regression Analysis Panel Data Analysis Panel Data Analysis Slide 2 About STATA is modern and general command driven package for statistical analyses, data management and graphics. STATA provides commands to analyze panel data (cross-sectional time-series, longitudinal, repeated-measures, and correlated data), cross-sectional data, time-series data, survival-time data, cohort study, STATA is user friendly. STATA has an extraordinary set of reference books. STATA has internet capabilities (installing new features, updating) Slide 3 Getting ready Download from Econ 511 website Unzip file to U:\stata Slide 4 Basic Operations Entering Data Entering Data Exploring Data Exploring Data Modifying Data Modifying Data Managing Data Managing Data Analyzing Data Analyzing Data Slide 5 Entering Data Insheet: Read ASCII (text) data created by a spreadsheet (.csv files only) Infile: Read unformatted ASCII (text) data (space delimited files) Input: Enter data from keyboard Describe: Describe contents of data in memory or on disk Compress: Compress data in memory Save: Store the dataset currently in memory on disk in Stata data format Count: Show the number of observations List: List values of variables Clear: Clear the entire dataset and everything else Memory: Display a report on memory usage Set memory: Set the size of memory Slide 6 Example cd u:\stata dir insheet using hs0.csv (If file has variable name on the first line) Save hs insheet gender id race ses schtyp prgtype read write math science socst using hs0_noname.csv, clear(If file doesnt have variable name on the first line) Count Describe Compress Clear use hs, clear (only for files in Stata files, can be use over internet) Memory set memory 5m (maximum: 256MB) Slide 7 Exploring data Describe: Describe a dataset List List the contents of a dataset Codebook: Detailed contents of a dataset Log: Create a log file Summarize: Descriptive statistics Tabstat: Table of descriptive statistics Table: Create a table of statistics Stem: Stem-and-leaf plot Graph: High resolution graphs Kdensity: Kernal density plot Sort: Sort observations in a dataset Histogram: Histogram for continuous and categorical variables Tabulate: One- and two-way frequency tables Correlate: Correlations Pwcorr: Pairwise correlations Type: Display an ASCII file Slide 8 Example use hs0, clear Describe List list gender-read Codebook log using unit1, text replace (open a existing log file called unit1 which will save all of the commands and the output in a text file and delete the contents and places the current log into the file summarize summarize read math science write display 9.48^2 (note: variance is the sd (9.48) squared) summarize write detail sum write if read>=60 sum write if prgtype=="academic sum write in 1/40 tabulate prgtype, summarize(read) stem write graph box write log close (close the log file) type unit1.log (see what is in the log file) Slide 9 Modifying Data label data: Apply a label to a data set Order: Order the variables in a data set label variable: Apply a label to a variable label define: Define a set of a labels for the levels of a categorical variable label values: Apply value labels to a variable List: Lists the observations Rename: Rename a variable Recode: Recode the values of a variable Notes: Apply notes to the data file Generate: Creates a new variable Replace: Replaces one value with another value Egen: Extended generate - has special functions that can be used when creating a new variable Slide 10 Example Use hs0 Order id gender label variable schtyp "The type of school the student attended." label define scl 1 public 2 private label values schtyp scl codebook schtyp list schtyp in 1/10 list schtyp in 1/10, nolabel encode prgtype, gen(prog) (create a new numeric version of the string variable prgtype) label variable prog "The type of program in which the student was enrolled." codebook prog list prog in 1/10 list prog in 1/10, nolabel Slide 11 Example (cont) rename gender female (easier to work with since we dont have to deal with 0s and 1s) label variable female "The gender of the student." label define fm 1 female 0 male label values female fm codebook female list female in 1/10, nolabel Gen total = read +write + math replace total = read + write + socst label variable total "The total of the read, write and socst." list race if race == 5 recode race 5 =. list race if race ==. generate total = read + write + math sum total Codebook total notes race: values of race coded as 5 were recoded to be missing egen zread = std(read) (using special function std(.)) save hs1 Slide 12 Managing Data Pwd: Show current directory (pwd=print working directory) dir or ls: Show files in current directory cd Change directory keep if: Keep observations if condition is met Keep: Keep variables (dropping others) Drop: Drop variables (keeping others) append using: Append a data file to current file Merge: Merge a data file with current file Slide 13 Example We take the hs1 data file and make a separate folder called honors and store a copy of our data which just has the students with reading scores of 60 or higher use hs1, clear Pwd Dir Ls cd honors keep if read >= 60 Describe summarize read save hsgoodread, replace use hsgoodread, clear drop ses save hsdropped, replace describe list in 1/20 Slide 14 Analyzing Data Ttest: t-test Regress: Regression Predict: Predicts after model estimation Kdensity: Kernel density estimates and graphs Pnorm: Graphs a standardized normal plot Qnorm: Graphs a quantile plot Rvfplot: Graphs a residual versus fitted plot Rvpplot: Graphs a residual versus individual predictor plot Xi: Creates dummy variables during model estimation Test: Test linear hypotheses after model estimation Oneway: One-way analysis of variance Anova: Analysis of variance Logistic: Logistic regression Logit: Logistic regression Slide 15 Example use hs1, clear ttest write = 50 (This is the one-sample t-test, testing whether the sample of writing scores was drawn from a population with a mean of 50 ) ttest write = read (This is the paired t-test, testing whether or not the mean of write equals the mean of read) ttest write, by(female) (This is the two-sample independent t-test with pooled (equal) variances) ttest write, by(female) unequal (This is the two-sample independent t-test with separate (unequal) variances) oneway write prog anova write prog (Both of these commands perform a one-way analysis of variance (ANOVA) anova write prog female prog*female (the anova command is used to perform a two- way analysis of variance (ANOVA).) anova write prog female prog*female read, cont(read) (the anova command performs an analysis of covariance (ANCOVA)) Slide 16 Example (cont) regress write read female (Plain vanilla OLS regression) regress write read female, robust (we run the regression with robust standard errors. This is very useful when there is heterogeneity of variance. This option does not affect the estimates of the regression coefficients.) predict p (The predict command calculates predictions, residuals, influence statistics, and the like after an estimation command. The default shown here is to calculate the predicted scores) predict r, resid (When using the resid option the predict command calculates the residual) pnorm r ( produces a normal probability plot and it is another method of testing whether the residuals from the regression are normally distributed) Rvfplot (generates a plot of the residual versus the fitted values; it is used after regress or anova) rvpplot read (produces a plot of the residual versus a specified predictor and it is also used after regress or anova. Slide 17 Example (cont) xi: regress write read i.prog (The xi prefix is used to dummy code categorical variables such as prog. The predictor prog has three levels and requires two dummy-coded variables) test _Iprog_2 _Iprog_3 (The test command is used to test the collective effect of the two dummy-coded variables; in other words, it tests the main effect of prog) xi: regress write i.prog*read (create dummy variables for prog and for the interaction of prog and read) test _IproXread_2 _IproXread_3 (tests the overall interaction) test _Iprog_2 _Iprog_3 (tests the main effect of prog) gen honcomp = write >= 60 (create a dichotomous variable called honcomp (honors composition) to use as our dependent variable) tab honcomp The logistic command defaults to producing the output in odds ratios but can display the coefficients if the coef option is used. The exact same results can be obtained by using the logit command, which produces coefficients as the default but will display the odds ratio if the or option is used: logit honcomp read female logit honcomp read female, or Slide 18 Logistic Regression Classical Regression vs Logistic Regression All of the previous regression examples have used continuous dependent variables. Logistic regression is used when the dependent variable is binary or dichotomous. Different Assumptions The population means of the dependent variables at each level of the independent variable are not on a straight line, i.e., no linearity. The variance of the errors are not constant, i.e., no homogeneity of variance. The errors are not normally distributed, i.e., no normaility. Logistic Regression Assumptions: The model is correctly specified, i.e., 1. the true conditional probabilities are a logistic function of the indpendent variables, 2. no important variables are omitted, 3. no extraneous variables are included, and 4. the independent variables are measured without error. The cases are independent. The independent variables are not linear combinations of each other. Perfect multicolinearity makes estimation impossible, while strong multicolinearity makes estimates imprecise. Slide 19 Logistic Regression - 2 Log