Top Banner
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3
33

Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Generating new variables and manipulating data with STATA

Biostatistics 212

Lecture 3

Page 2: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Housekeeping

• Lab 1 cleanup

• Computer and software issues

• Change final session from 11/29 12/1– (Thursday instead of Tuesday)

• Change schedule – Excel NEXT session

Page 3: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Today...

• What we did last week, and why it was unrealistic• What does “data cleaning” mean?• How to generate a variable• How to manipulate the data in your new variable• How to label variables and otherwise document

your work• Examples

Page 4: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Last time…

• What was unrealistic?

Page 5: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Last time…

• What was unrealistic?– The dataset came as a Stata .dta file

Page 6: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Last time…

• What was unrealistic?– The dataset came as a Stata .dta file– The variables were ready to analyze

Page 7: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Last time…

• What was unrealistic?– The dataset came as a Stata .dta file– The variables were ready to analyze– Most variables were labeled

Page 8: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Last time…

• I.e. – The data was “clean”

Page 9: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

How your data will arrive

• On paper forms

• In a text file (comma or tab delimited)

• In Excel

• In Access

• In another data format (SAS, etc)

Page 10: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Importing into Stata

• Options:– Cut and Paste– insheet, infile, fdause, other flexible

Stata commands– A convenience program like “Stat/Transfer”

Page 11: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Importing into Stata

• Make sure it worked– Look at the data

Page 12: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Importing into Stata

• Example – neonatal opiate withdrawal data

Page 13: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Exploring your data

• Figure out what all those variables mean

• Options– Browse, describe, summarize, list in STATA– Refer to a data dictionary– Refer to a data collection form– Guess, or ask the person who gave it to you

Page 14: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Exploring your data

• Example: Neonatal opiate withdrawal data

Page 15: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Exploring your data

• Example: Neonatal opiate withdrawal data

• Problems arise…– Sex is m/f, not 1/0

– Gestational age has nonsense values (0, 60)

– Breastfeeding has a bunch of weird text values

– Drug variables coded y or blank

– Many variable names are obscure

Page 16: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Cleaning your data

• You must “clean” your data so it is ready to analyze.

Page 17: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Cleaning your data

• Cleaning tasks– Check for consistency and clean up non-sense data and

outliers– Deal with missing values– Code all dichotomous variables 1/0– Categorize variables meaningfully (for Table 1, etc) – Derive new variables– Rename variables

• With common sense, or with a consistent scheme

– Label variables– Label the VALUES of coded variables

Page 18: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Cleaning your data

• The importance of documentation– Retracing your steps

• Document every step using a “do” file

Page 19: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Data cleaningBasic skill 1 – make a new variable

• Creating new variables

generate newvar = expression

• An expression can be:– A number (constant) - generate allzeros = 0– A variable - generate ageclone = age– A function - generate agesqrt = sqrt(age)

Page 20: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Data cleaningBasic skill 1 – make a new variable

• Getting rid of a variable

drop var

• Getting rid of observations

drop if boolean exp

Page 21: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Data cleaningBasic skill 2 – manipulating the values

• Changing the values of a variable

replace var = exp [if boolean exp]

• A boolean expression evaluates to true or false for each observation

Page 22: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Data cleaningBasic skill 2 – manipulating the values

• Examples

generate male = 0replace male = 1 if sex==“male”

generate ageover50 = 0replace ageover 50 = 1 if age>50

generate complexvar = agereplace complexvar = (ln(age)*3)

if (age>30 | male==1) & (othervar1>=othervar2)

Page 23: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Data cleaningBasic skill 2 – manipulating the values

• Logical operators for boolean expressions:

English StataEqual to ==Not equal to !=, ~=Greater than >Greater than/equal to>=Less than <Less than/equal to <=And &Or |

Page 24: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Data cleaningBasic skill 2 – manipulating the values

• Mathematical operators:

English StataAdd +Subtract -Multiply *Divide /To the power of ^Natural log of ln(expression)Base 10 log of log10(expression)Etcetera…

Page 25: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Data cleaningBasic skill 2 – manipulating the values

• Another way to manipulate data

Recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean expression]

• More complicated, but more flexible command than replace

Page 26: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Data cleaningBasic skill 2 – manipulating the values

• Examples

Generate male = 0Recode male 0=1 if sex==“male”

Generate raceethnic = raceRecode raceethnic 1=6 if ethnic==“hispanic”(Replace raceethnic = 6 if ethnic==“hispanic” & race==1)

Generate tertilescac = cacRecode min/54=1 55/82=2 83/max=3

Page 27: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Data cleaningBasic skill 3 – labeling variables

• You can label:– A dataset label data “label”

– A variable label var varname “label”

– Values of a variable (2-step process)label define labelname value1 “label1” [value2 “value2”…]

Label values varname labelname

Page 28: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Cleaning your data

• Cleaning tasks– Check for consistency and clean up non-sense data– Deal with missing values– Code all dichotomous variables 1/0– Categorize variables meaningfully (for Table 1, etc) – Derive new variables– Rename variables

• With common sense, or with a consistent scheme

– Label variables– Label the VALUES of coded variables

Page 29: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Data cleaning

• Example: Neonatal opiate withdrawal data

Page 30: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Data cleaning

• At the end of the day you have:– 1 raw data file, original format– 1 raw data file, Stata format– 1 do file that cleans it up– 1 log file that documents the cleaning– 1 clean data file, Stata format

Page 31: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Summary

• Data cleaning– ALWAYS necessary to some extent– ALWAYS use a do file, don’t overwrite

original data– Check your work– Watch out for missing values– Label as much as you can

Page 32: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Lab this week

• It’s long• It’s important• It’s hard

• But this year, we have 2 sessions for it!

• Email lab to [email protected]• Due 10/11 at Midnight

Page 33: Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Preview of next week…

• Using Excel– What is it good for?– Formulas– Designing a good spreadsheet– Formatting