Top Banner
Data File Data File Structure and Structure and Content Content Joe Larson Joe Larson 5 / 6 / 09 5 / 6 / 09
48

Data File Structure and Content Joe Larson 5 / 6 / 09.

Jan 04, 2016

Download

Documents

Jennifer Obrien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data File Structure and Content Joe Larson 5 / 6 / 09.

Data File Data File Structure and Structure and

ContentContentJoe LarsonJoe Larson

5 / 6 / 095 / 6 / 09

Page 2: Data File Structure and Content Joe Larson 5 / 6 / 09.

OutlineOutline

What’s in a Data Set?What’s in a Data Set?

- File Setup- File Setup

- Key Variables- Key Variables Data ConventionsData Conventions Fun With DemographicsFun With Demographics

Page 3: Data File Structure and Content Joe Larson 5 / 6 / 09.

What’s in a Data Set?What’s in a Data Set?

Page 4: Data File Structure and Content Joe Larson 5 / 6 / 09.

File SetupFile Setup

Data on the web is broken up into Data on the web is broken up into the forms it was collected on.the forms it was collected on.

Different forms can have different Different forms can have different collection time(s) and different collection time(s) and different participant subgroupsparticipant subgroups

Page 5: Data File Structure and Content Joe Larson 5 / 6 / 09.

Available Data is Broken up Available Data is Broken up by Formby Form

All data on the web is arranged by formAll data on the web is arranged by form

Exceptions:Exceptions:

- Outcomes file- Outcomes file

- Demographics file- Demographics file Variables within a data set are in the Variables within a data set are in the

order of the questionnaire, with any order of the questionnaire, with any computed variables at the end of the filecomputed variables at the end of the file

Page 6: Data File Structure and Content Joe Larson 5 / 6 / 09.

Available Data is Broken up Available Data is Broken up by Formby Form

Page 7: Data File Structure and Content Joe Larson 5 / 6 / 09.

Different Forms…Different Different Forms…Different Participants…Different Participants…Different

TimesTimes Forms collected only once result in a file Forms collected only once result in a file

with one record per personwith one record per person Forms collected numerous times Forms collected numerous times

throughout follow-up result in a file with throughout follow-up result in a file with multiple records per personmultiple records per person

Some data is only available for specific Some data is only available for specific groups of participants (i.e. DM Only, groups of participants (i.e. DM Only, blood subsample, etc.)blood subsample, etc.)

Specifics for an individual file can be Specifics for an individual file can be found in its corresponding data dictionaryfound in its corresponding data dictionary

Page 8: Data File Structure and Content Joe Larson 5 / 6 / 09.

Example from Form 80Example from Form 80

Page 9: Data File Structure and Content Joe Larson 5 / 6 / 09.

Key VariablesKey Variables

Some variables are found in every file Some variables are found in every file (with the exceptions of the (with the exceptions of the demographics and outcomes files)demographics and outcomes files)

- ID- ID

- Days since - Days since randomization/enrollmentrandomization/enrollment

- Visit type / Visit number- Visit type / Visit number

- Form closest to visit- Form closest to visit

- Expected for visit- Expected for visit

Page 10: Data File Structure and Content Joe Larson 5 / 6 / 09.

Key VariablesKey Variables Let’s take a look at actual Form 80 FileLet’s take a look at actual Form 80 File

Page 11: Data File Structure and Content Joe Larson 5 / 6 / 09.

WHI Participant ID (ID)WHI Participant ID (ID)

Page 12: Data File Structure and Content Joe Larson 5 / 6 / 09.

Participant ID (ID)Participant ID (ID)

The ID variable is common to all of The ID variable is common to all of the web files.the web files.

Completely independent of the Completely independent of the member ID that is used at the member ID that is used at the individual clinics.individual clinics.

Also independent of the Public and Also independent of the Public and blood draw IDs.blood draw IDs.

Page 13: Data File Structure and Content Joe Larson 5 / 6 / 09.

Days Since Randomization / Days Since Randomization / Enrollment (F80DAYS)Enrollment (F80DAYS)

Page 14: Data File Structure and Content Joe Larson 5 / 6 / 09.

Days Since Randomization / Days Since Randomization / Enrollment (F80DAYS)Enrollment (F80DAYS)

We do not give out actual dates for We do not give out actual dates for forms or events.forms or events.

Time is calculated between Time is calculated between randomization (CT) or enrollment randomization (CT) or enrollment (OS) and the form date.(OS) and the form date.

Page 15: Data File Structure and Content Joe Larson 5 / 6 / 09.

Visit Type (F80VTYP) & Visit Type (F80VTYP) & Visit Number (F80VNUM)Visit Number (F80VNUM)

Page 16: Data File Structure and Content Joe Larson 5 / 6 / 09.

Visit Type (F80VTYP) & Visit Type (F80VTYP) & Visit Number (F80VNUM)Visit Number (F80VNUM)

These variables combine to let you These variables combine to let you know when data was collected.know when data was collected.

For example, in the second line of For example, in the second line of the data on the previous slide we the data on the previous slide we can see that the record is for can see that the record is for “Annual Visit 3”. This matches up “Annual Visit 3”. This matches up well with the 1189 days since well with the 1189 days since randomizationrandomization

Page 17: Data File Structure and Content Joe Larson 5 / 6 / 09.

Closest to Visit Within Visit Closest to Visit Within Visit Type and Number Type and Number

(F80VCLO)(F80VCLO)

Page 18: Data File Structure and Content Joe Larson 5 / 6 / 09.

Closest to Visit Within Visit Closest to Visit Within Visit Type and Number Type and Number

(F80VCLO)(F80VCLO) On rare occasions multiple forms were On rare occasions multiple forms were

filled out or entered for the same filled out or entered for the same participant at the same follow-up visitparticipant at the same follow-up visit

This variable identifies the visit This variable identifies the visit closest to the actual date. For closest to the actual date. For example, a year 1 annual visit with a example, a year 1 annual visit with a value of “Yes” for VCLO will be the value of “Yes” for VCLO will be the year 1 visit that is closest to 365 days year 1 visit that is closest to 365 days from randomization/enrollmentfrom randomization/enrollment

Page 19: Data File Structure and Content Joe Larson 5 / 6 / 09.

Expected for Visit Expected for Visit (F80EXPC)(F80EXPC)

Page 20: Data File Structure and Content Joe Larson 5 / 6 / 09.

Expected for Visit Expected for Visit (F80EXPC)(F80EXPC)

Sometimes forms are filled out by Sometimes forms are filled out by participants who should not be participants who should not be filling them outfilling them out

The expected for visit flag identifies The expected for visit flag identifies data that were expected by protocoldata that were expected by protocol

Page 21: Data File Structure and Content Joe Larson 5 / 6 / 09.

File Setup / Key File Setup / Key VariablesVariables

Files are arranged by form on the Files are arranged by form on the web at web at www.whiops.org

File structure and participant group File structure and participant group varies by form and is in the data varies by form and is in the data dictionarydictionary

ID, Visit Type, and other important ID, Visit Type, and other important variables can be found at the start of variables can be found at the start of each fileeach file

Page 22: Data File Structure and Content Joe Larson 5 / 6 / 09.

Any Questions?Any Questions?

Page 23: Data File Structure and Content Joe Larson 5 / 6 / 09.

Data ConventionsData Conventions

Skip patternsSkip patterns Mark all that applyMark all that apply Version differencesVersion differences

Page 24: Data File Structure and Content Joe Larson 5 / 6 / 09.

Skip PatternsSkip Patterns

• Questions within a form are often set up with a hierarchical structure with parent questions and subquestions

• In most cases, the sub-questions are set to missing if the parent value indicates the sub-questions should not be answered. This is the application of a skip pattern

• In a few cases where the error percentage is high, the skip pattern is not applied

Page 25: Data File Structure and Content Joe Larson 5 / 6 / 09.

Example: Skip Pattern Example: Skip Pattern AppliedApplied

PePett

DoDogg

CaCatt

BirBirdd

FisFishh

OtheOtherr

11 00 11 11 00 11

00

00 00 11 00 00 00

11 00 00 00 00

11

PePett

DoDogg

CaCatt

BirBirdd

FisFishh

OtheOtherr

11 00 11 11 00 11

00

00

11

Skip pattern QA applied

Sub-questions

Error Percentage < 1%

Page 26: Data File Structure and Content Joe Larson 5 / 6 / 09.

Example: Skip Pattern Not Example: Skip Pattern Not AppliedApplied

Error Percentage ~ 6-12%

Page 27: Data File Structure and Content Joe Larson 5 / 6 / 09.

If the Skip Pattern is not If the Skip Pattern is not AppliedApplied

It will be in the data dictionaryIt will be in the data dictionary

Page 28: Data File Structure and Content Joe Larson 5 / 6 / 09.

Mark All That ApplyMark All That Apply

1 2 3 4 5

0 1 1 0 1

What kind of pet do you have? (mark all that apply)

Dog(s) Cat(s) Bird(s) Fish Other

• One question with multiple choices is

converted to separate indicator variables

of 0’s and 1’s

Page 29: Data File Structure and Content Joe Larson 5 / 6 / 09.

OrdeOrderr

QuestionQuestion QuestiQuestion on NumbeNumberr

ValueValue Value Value DescriptionDescription

1717 Do you have a Do you have a petpet

1111 11 YesYes

1818 DogDog 11.111.1

1919 CatCat 11.111.1 22 MarkedMarked

2020 BirdBird 11.111.1 33 MarkedMarked

2121 FishFish 11.111.1

2222 OtherOther 11.111.1 55 MarkedMarkedO1O177

O18O18 O19O19 O2O200

O2O211

O22O22

11 00 11 11 00 11

Mark all conversion

Page 30: Data File Structure and Content Joe Larson 5 / 6 / 09.

Version IssuesVersion Issues

Sometimes questions are not asked Sometimes questions are not asked on all versions of a form, leading to on all versions of a form, leading to higher percentages of missing datahigher percentages of missing data

The Data Dictionary will have thisThe Data Dictionary will have this

Page 31: Data File Structure and Content Joe Larson 5 / 6 / 09.

Data ConventionsData Conventions

Some cleaning was done to the data Some cleaning was done to the data before it reached the webbefore it reached the web

Skip patterns and mark-all-that-Skip patterns and mark-all-that-apply conversions were usually doneapply conversions were usually done

Sometimes questions were not Sometimes questions were not collected on all versions of a formcollected on all versions of a form

In all cases, any issues are In all cases, any issues are documented in the data dictionarydocumented in the data dictionary

Page 32: Data File Structure and Content Joe Larson 5 / 6 / 09.

Any Questions?Any Questions?

Page 33: Data File Structure and Content Joe Larson 5 / 6 / 09.

Fun With DemographicsFun With Demographics

Page 34: Data File Structure and Content Joe Larson 5 / 6 / 09.

The Demographics FileThe Demographics File

The demographics file is the glue The demographics file is the glue that pulls most analyses togetherthat pulls most analyses together

It contains important variables that It contains important variables that are used in just about every analysisare used in just about every analysis

The file has one record per personThe file has one record per person

Page 35: Data File Structure and Content Joe Larson 5 / 6 / 09.

Trial Participation FlagsTrial Participation Flags

Page 36: Data File Structure and Content Joe Larson 5 / 6 / 09.

Trial Participation FlagsTrial Participation Flags

Trial Flags distinguish what part of Trial Flags distinguish what part of the WHI a participant is inthe WHI a participant is in

In addition to CT and OS indicators, In addition to CT and OS indicators, there are indicator variables for there are indicator variables for each clinical trial componenteach clinical trial component

Page 37: Data File Structure and Content Joe Larson 5 / 6 / 09.

Basic Demographic DataBasic Demographic Data

Page 38: Data File Structure and Content Joe Larson 5 / 6 / 09.

Basic Demographic DataBasic Demographic Data Including age, ethnicity, education, Including age, ethnicity, education,

and income can be found hereand income can be found here Because clinical center data has not Because clinical center data has not

been released, the “U.S. Region” been released, the “U.S. Region” variable is the best variable to use variable is the best variable to use for geographic locationfor geographic location

Page 39: Data File Structure and Content Joe Larson 5 / 6 / 09.

Trial ArmsTrial Arms

Page 40: Data File Structure and Content Joe Larson 5 / 6 / 09.

Trial ArmsTrial Arms

These are the key variables for any These are the key variables for any analysis on the clinical trial.analysis on the clinical trial.

The hormone arm variable can also The hormone arm variable can also be used to separate out participants be used to separate out participants in the two hormone trialsin the two hormone trials

Page 41: Data File Structure and Content Joe Larson 5 / 6 / 09.

Days from CT to CaD Days from CT to CaD RandomizationRandomization

Page 42: Data File Structure and Content Joe Larson 5 / 6 / 09.

Days from CT to CaD Days from CT to CaD RandomizationRandomization

Key variable used to determine how Key variable used to determine how far a follow-up visit is from CaD far a follow-up visit is from CaD randomizationrandomization

To determine days from CaD To determine days from CaD randomizationrandomization

- Start with the days from CT - Start with the days from CT randomization randomization

- Subtract the days from CT to - Subtract the days from CT to CaD CaD randomization randomization

Page 43: Data File Structure and Content Joe Larson 5 / 6 / 09.

BMD Subsample BMD Subsample IndicatorIndicator

Page 44: Data File Structure and Content Joe Larson 5 / 6 / 09.

BMD Subsample BMD Subsample IndicatorIndicator

A ‘yes’ response indicates that the A ‘yes’ response indicates that the participant was at one of the three participant was at one of the three BMD clinicsBMD clinics

Page 45: Data File Structure and Content Joe Larson 5 / 6 / 09.

Fun With DemographicsFun With Demographics

The demographics file is a key file The demographics file is a key file used in most analysesused in most analyses

It includes trial participation and It includes trial participation and treatment status variables, as well treatment status variables, as well as basic demographic dataas basic demographic data

Page 46: Data File Structure and Content Joe Larson 5 / 6 / 09.

Questions?Questions?

Page 47: Data File Structure and Content Joe Larson 5 / 6 / 09.

Stay TunedStay Tuned

Later I’ll be doing a beginning to end Later I’ll be doing a beginning to end example:example:- Going to the web- Going to the web- Hunting down variables- Hunting down variables- Downloading the data- Downloading the data- Loading it into SAS- Loading it into SAS- Merging files together- Merging files together- Running some basic frequencies- Running some basic frequencies

And taking questions while I do it!And taking questions while I do it!

Page 48: Data File Structure and Content Joe Larson 5 / 6 / 09.

Thanks and Good NightThanks and Good Night