Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3
Generating new variables and manipulating data with STATA
Biostatistics 212
Lecture 3
Housekeeping
• Lab 1 cleanup
• Computer and software issues
• Change final session from 11/29 12/1– (Thursday instead of Tuesday)
• Change schedule – Excel NEXT session
Today...
• What we did last week, and why it was unrealistic• What does “data cleaning” mean?• How to generate a variable• How to manipulate the data in your new variable• How to label variables and otherwise document
your work• Examples
Last time…
• What was unrealistic?
Last time…
• What was unrealistic?– The dataset came as a Stata .dta file
Last time…
• What was unrealistic?– The dataset came as a Stata .dta file– The variables were ready to analyze
Last time…
• What was unrealistic?– The dataset came as a Stata .dta file– The variables were ready to analyze– Most variables were labeled
Last time…
• I.e. – The data was “clean”
How your data will arrive
• On paper forms
• In a text file (comma or tab delimited)
• In Excel
• In Access
• In another data format (SAS, etc)
Importing into Stata
• Options:– Cut and Paste– insheet, infile, fdause, other flexible
Stata commands– A convenience program like “Stat/Transfer”
Importing into Stata
• Make sure it worked– Look at the data
Importing into Stata
• Example – neonatal opiate withdrawal data
Exploring your data
• Figure out what all those variables mean
• Options– Browse, describe, summarize, list in STATA– Refer to a data dictionary– Refer to a data collection form– Guess, or ask the person who gave it to you
Exploring your data
• Example: Neonatal opiate withdrawal data
Exploring your data
• Example: Neonatal opiate withdrawal data
• Problems arise…– Sex is m/f, not 1/0
– Gestational age has nonsense values (0, 60)
– Breastfeeding has a bunch of weird text values
– Drug variables coded y or blank
– Many variable names are obscure
Cleaning your data
• You must “clean” your data so it is ready to analyze.
Cleaning your data
• Cleaning tasks– Check for consistency and clean up non-sense data and
outliers– Deal with missing values– Code all dichotomous variables 1/0– Categorize variables meaningfully (for Table 1, etc) – Derive new variables– Rename variables
• With common sense, or with a consistent scheme
– Label variables– Label the VALUES of coded variables
Cleaning your data
• The importance of documentation– Retracing your steps
• Document every step using a “do” file
Data cleaningBasic skill 1 – make a new variable
• Creating new variables
generate newvar = expression
• An expression can be:– A number (constant) - generate allzeros = 0– A variable - generate ageclone = age– A function - generate agesqrt = sqrt(age)
Data cleaningBasic skill 1 – make a new variable
• Getting rid of a variable
drop var
• Getting rid of observations
drop if boolean exp
Data cleaningBasic skill 2 – manipulating the values
• Changing the values of a variable
replace var = exp [if boolean exp]
• A boolean expression evaluates to true or false for each observation
Data cleaningBasic skill 2 – manipulating the values
• Examples
generate male = 0replace male = 1 if sex==“male”
generate ageover50 = 0replace ageover 50 = 1 if age>50
generate complexvar = agereplace complexvar = (ln(age)*3)
if (age>30 | male==1) & (othervar1>=othervar2)
Data cleaningBasic skill 2 – manipulating the values
• Logical operators for boolean expressions:
English StataEqual to ==Not equal to !=, ~=Greater than >Greater than/equal to>=Less than <Less than/equal to <=And &Or |
Data cleaningBasic skill 2 – manipulating the values
• Mathematical operators:
English StataAdd +Subtract -Multiply *Divide /To the power of ^Natural log of ln(expression)Base 10 log of log10(expression)Etcetera…
Data cleaningBasic skill 2 – manipulating the values
• Another way to manipulate data
Recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean expression]
• More complicated, but more flexible command than replace
Data cleaningBasic skill 2 – manipulating the values
• Examples
Generate male = 0Recode male 0=1 if sex==“male”
Generate raceethnic = raceRecode raceethnic 1=6 if ethnic==“hispanic”(Replace raceethnic = 6 if ethnic==“hispanic” & race==1)
Generate tertilescac = cacRecode min/54=1 55/82=2 83/max=3
Data cleaningBasic skill 3 – labeling variables
• You can label:– A dataset label data “label”
– A variable label var varname “label”
– Values of a variable (2-step process)label define labelname value1 “label1” [value2 “value2”…]
Label values varname labelname
Cleaning your data
• Cleaning tasks– Check for consistency and clean up non-sense data– Deal with missing values– Code all dichotomous variables 1/0– Categorize variables meaningfully (for Table 1, etc) – Derive new variables– Rename variables
• With common sense, or with a consistent scheme
– Label variables– Label the VALUES of coded variables
Data cleaning
• Example: Neonatal opiate withdrawal data
Data cleaning
• At the end of the day you have:– 1 raw data file, original format– 1 raw data file, Stata format– 1 do file that cleans it up– 1 log file that documents the cleaning– 1 clean data file, Stata format
Summary
• Data cleaning– ALWAYS necessary to some extent– ALWAYS use a do file, don’t overwrite
original data– Check your work– Watch out for missing values– Label as much as you can
Lab this week
• It’s long• It’s important• It’s hard
• But this year, we have 2 sessions for it!
• Email lab to [email protected]• Due 10/11 at Midnight
Preview of next week…
• Using Excel– What is it good for?– Formulas– Designing a good spreadsheet– Formatting