1
Getting to SDTM
Jeff Millstein, Ph.D.Principal ConsultantLincoln Safety GroupPhase Forward, Inc.
The presentation was made at the Washington, DC CDISC User’s Group, held at MedImmune, Gaithersburg, MD, on October 15, 2009.
Please contact the author ([email protected]) before using any of this material.
2
Qualifications
Product Guidance This presentation contains directional statements related to the features and
availability of one or more unreleased products. These statements represent current intentions and goals which are subject to change or withdrawal at any time without further notice.
Safe Harbor This presentation contains forward looking statements within the meaning of the
Private Securities Litigation Reform Act of 1995, including, without limitation, statements related to the features and availability of one or more unreleased products. These statements are subject to a variety of risks and uncertainties, including without limitation, technical difficulties encountered in the development of the planned product or changes in product plans which might result in the failure of Phase Forward to release the product as scheduled, as described, or at all, as well as the risks set forth in Phase Forward's public filings with the Securities and Exchange Commission. We do not assume any obligation to update the forward-looking statements contained in this presentation.
This presentation contains forward-looking statements related to the features and availability of one or more unreleased products. These statements are subject to a variety of risks and uncertainties including those listed on the screen and set forth in our public filings.
3
Introduction
Phase Forward’s headquarters, in Waltham, MA, where I work.
4
History of the SDTM
368
CDISCSDS Meets
1999 2009
SDS3.0
2004
SDS1.0SDTM1.0
SDTMIG3.1SDTM1.1
SDTMIG3.1.1
PostedSDTM1.2
SDTMIG3.1.2
Final draftSDTM1.2
SDTMIG3.1.2
SDS2.0
2002
ReleasedSDTM1.2
SDTMIG3.1.2
2006
FDA recognizes SDTM asApproved method for submitting
Data tabulation component of CRT
FDA releases Critical PathOpportunities List, Item44=SDTM
FDA releases Notice ofProposed RulemakingRegarding the SDTM
225 239
It has been about ten years since the CDISC Submission Data Standards or SDS Committee met, and about five years since we have had SDTM 3.1, the first real stable and usable version of the data model.
Over the past five years we at Phase Forward, and in particular, the consulting arm within the Lincoln Safety Group, have seen a steady increase in requests for conversion of InForm trials to SDTM.
This year SDTM 3.1.2 was released. Although the standard has improved by fixing issues and adding flexibility, this version currently sits in a “no mans land”. This is because the 3.1.2 standard is not yet adopted by the FDA but some sponsors, believing that adoption is near, are converting to this standard now, while others use only what is currently accepted by the FDA.
5
Planning an SDTM Conversion
Albert Bierstadt, “Storm in the Mountains" c. 1870, oil on canvas (collection of MFA, Boston, MA)
6
How much effort is this conversion going to be?
Victor Dubreuil, Barrels of Money," c. 1890s, oil on canvas (collection of Fed Reserve)
Invariably, the first question our consulting group is asked is something like “how long will this take?”, which can be converted into units of effort (e.g., staffing, meeting time) and/or money.
Conversions of any database are never trivial, and conversions of clinical trial data can be complex, time consuming, tedious, and often expensive. In addition, there may be considerable time pressures on job completion,. Therefore, it is important to have an organized methodology and perform the conversion carefully, step-by-step.
I have included in my procedure several stop points which are points in the process that I have found important to stop until the step is complete and approved.
7
How much effort is this conversion going to be?
Conversions are not often • Completed ahead of time
• Completed within budget
• Simpler than you imagined
• Easily tested
• Done and forgotten
Data conversions are always a significant and non-trivial effort. Because of many factors, conversion project budgeting (time and money) often slips.
More often than not, conversions become more complicated as you understand the nuances and details of the CRF case book. For example, details of the exposure or treatment schedule, derivation of baselines, unit conversions, and coding issues are not always completely understood at project start.
Clinical trials can have a long lifespan, and a conversion once completed may re-appear months or years later for further work. Sometimes, old conversions are revived for comparison with current trials. Therefore, archive your work carefully.
8
How much effort is this conversion going to be?
Need to assign staff and estimate effort for • Project management
• Project documents
• Issue management
• CRF Annotation
• Data review
• Client meetings
• Trial Design development
• Programming
• Mapping specification
• Technical assistance
• Testing
• Converted data review
• Change management
• Preparation of deliverables
The colors in this slide indicate the kind of expertise one should have to best perform the task, and stay within budget:
• Blue managers
• Green technical, with programming skills
• Purple QA, testing experience
9
Planning: Why does the client want SDTM?
Submission-ready
SDTM to load a CDW
SDTM for safety review
Sharing and comparing
Ye Olde Data – Legacy conversions
Business planning
The reason for conducting an SDTM conversion is no longer always a submission. It may include standardizing for clinical data warehouse loading, safety review, sharing study results with partners, converting old trials or trials acquired in a merger. Sometimes conversions are performed to help measure effort to plan for future trial conversions.
10
Planning: What does the client want?
IssuesLog
ExecutableCode
Tests &Results
MappingSpec
Define.xml
AnnotatedCRF
SDTM Datasets
Project DocsSOWs
Contracts
ConversionOutputs
There’s more that a client wants than just SDTM data sets. Knowing the complete list of deliverables helps you determine project work sequence, task staffing, and budgeting.
11
Planning: What does the client have?
OtherRequirementsConstraints
Staff Available To Review
Protocol
AdditionalSource Data
CompanyMetadata
CRFs[pdf or scans]
SourceData
Time Frame
Budget Constraints
SOWsContracts
ConversionInputs
Clients usually have contracts, budgets, time constraints, source data and CRFs, but they often have additional information which can be extraordinarily useful if you request it at project start. For example, trials design domains, visit naming, and treatment schedules are usually explained in the study protocol document. Most pharma companies have company codelists that are useful in standardizing data.
Note that there are cases where a client has all except data because they are just starting the trial but want SDTM data sets periodically, as data is collected. In this case you might ask about test or UAT data used in trial development.
12
SDTM Submissions Typical submission includes SDTM datasets + define.xml
eSub
Sponsor Data
Repository
Sponsor Data
Repository
FDAElectronicDocument
RoomServers
FDAElectronicDocument
RoomServers
Review Tools:i-Review, SAS/JMP,
WebSDM/CTSD
Review Tools:i-Review, SAS/JMP,
WebSDM/CTSD
•SDTM XPT Data•Define.XML
WebSDMData
Load &Validation
WebSDMData
Load &Validation
Janus SDTM Materialized
Views
Janus SDTM Materialized
Views
FDA/NCI JanusData
Warehouse
FDA/NCI JanusData
Warehouse
WebSDM DBWebSDM DB
GatewayGateway RepositoryRepository Review Environment
Review Environment
SponsorSponsor
StagingArea
StagingArea
This diagram illustrates how WebSDM™ (a Phase Forward commercial product, see www.phaseforward.com) is currently used at the FDA.
WebSDM contains a load and check feature which can act as a roadblock for submission acceptance if your SDTM submission is incorrectly formatted. WebSDM’s rule set is also part of the Janus Data Warehouse load checking functionality.
13
Planning: Submission-Ready Data
• Data converted to SDTM• Extra collected data in SUPPQUAL• External data (labs, ECGs) integrated• Clinical comments in CO, integrated
Derivations• Data recoded to standards (e.g., MedDRA)• Data mapped to internal test codes• Data standardized (e.g., labs to SI units)• Data derivations
– Baselines, Timepoints– Categories, Groups
Metadata• Trial design domains• Related observations in RELREC• Define.xml
QA Documents• Mapping specification• Tests and test results
Creating submission-ready SDTM conversions means more than SDTM-formatted data sets. A submission-ready package includes:
• SDTM-formatted data
• external data integrated into domains
• derivations and clinical comments (linked to observations)
•Trial design domains
•Define.xml data dictionary document
•SDTM-annotated CRF
•Project documents including mapping specifications, test and test results, etc.
14
BudgetingEstimatingGuesstimating
SDTM conversion job estimation is often tricky, particularly if the request is from a new client for a new study.
I start by examining the CRF, taking a guess as to the SDTM domain target, and estimating my programming effort. Then I add time for other factors including define.xml creation, mapping spec development, annotation of the CRF, and testing.
This spreadsheet is then provided to my manager who then meets with the sales staff. They negotiate a cost between company billing needs and the need to complete the sale.
15
Performing SDTM Conversions
Jean-François Millet, “The Gleaners” c. 1857, oil on canvas (collection of Musée d'Orsay, Paris)
16
General Conversion Process (steps 1 and 2 of 12)
1. Annotate CRF ↔ Review ↔ Update ↔ Review→ Acceptance Often takes several cycles
Most knowledge-intensive part of the SDTM conversion process
May require sponsor domain experts
Advisable to host a meeting to walk through annotation
Stop here until accepted by all parties
My overall conversion process has 12 steps.
Note: This is my current process, developed over a five-year period in which I was involved with way opver 100 SDTM conversion projects. Your process may differ.
Step 1 is annotation of the CRF (case report form). I view this as a critical step as the SDTM-annotated CRF becomes the master specification for all the downstream programming and testing. Further, virtually everyone involved in the conversion process understands the structure and function of the CRF, so this is a good place to start.
I usually halt the conversion process until the CRF is completely annotated and approved. Continue the process without an approved CRF always creates problems and wastes time and money downstream.
CDISC provides some guidance as to CRF annotation style (search www.cdisc.org).
17
Annotate the CRF Center of the conversion process Knowledge-intensive component Everyone knows about / understands the CRF or knows who knows Considerable effort to review
Never underestimate the amount of work that is required to annotate a CRF. While not technically complex (use Adobe® Acrobat®), it is without a doubt the most knowledge-intensive aspect of an SDTM conversion. The director/manager of the annotation process must understand both the trial structure and the SDTM target model. The most experienced SDTM developer available should direct or participate in this process along with someone who is familiar with the trial.
18
Annotate the CRF
Domain 1
Domain 2
More often than not, data from one form maps to multiple domains.
This increases work everywhere:
• More training
• More review
• More programming effort
• More QA
19
Annotate the CRF Requires careful review and thought
Biostatisticians should be part of the review
Domain 4Domain 2
Domain 1 Domain 3
SDTM annotations require thoughtful and careful review of every page, as some domains will map just one question from some pages.
The bottom image shows a client-created mapping of 4 domains from a nine question section. Probably could use more thought as this is quite messy.
20
Annotate the CRF
Domain 4Domain 2
Domain 1 Domain 3
Indices that measure disease severity, for example, such as the the APACHE II Index shown here (APACHE II= "Acute Physiology and Chronic Health Evaluation II") , collect data across many dimensions and often map to multiple domains. Here’s a single form that maps to four findings domains. Probably also needs a RELREC also
21
General Conversion Process (steps 2 of 12)
1. Install and review data Need to check that All subjects can be identified by treatment / ARM
Data set keys can be identified and data sets can be joined together
Topic variables exist and –TESTCDs can be derived
Visit numbering is understandable
Disposition events can be identified
Many other things including SDTM keys codes for items with controlled terminology Data for timing variables Potential RELREC relationships can be established Comments can be properly handles External data (labs, ECGs) is formatted in an understandable way
Useful to do this as soon as data’s available while sponsor staff is engaged
Stop here until confident about source data
Its prudent to spend a couple of days thoroughly reviewing source data. I would not proceed until you are confident that you have what you need to complete the task.
22
General Conversion Process (step 3 of 12)
1. Annotate CRF2. Install and review data
3. Develop trial design domains as a spreadsheet Requires protocol
May require review cycles
Important to have TA, TE, and TV settled at the start
Useful to do this as soon as data’s available while sponsor staff is engaged, particularly for legacy conversions
Very hesitant to continue conversions for submissions without TD(especially TA and TV)
Quite often, trial design domain creation is left undone until the end of the project. This is often because trial design data is not collected, but is recorded in the study protocol document and must be extracted. In my opinion, this is a big mistake, and I urge SDTM developers to develop the trial design domains, in Excel spreadsheet form, first, and have the client review the form before proceeding.
My reasons for this are that many domains, such as DM, EX, DS, IE, SV and SE all have some dependence on trial design data.
23
Develop Trial Design Spreadsheet
Example of a trial design spreadsheet showing TA for three studies.
Develop these from the protocol document then have client review and approve before proceeding.
24
Develop Trial Design Spreadsheet
Example of a trial design spreadsheet showing TE and TV.
Develop these from the protocol document then have client review and approve before proceeding.
25
General Conversion Process (steps 4 and 5 of 12)
1. Annotate CRF2. Install and review data3. Develop trial design domains as a spreadsheet
4. Program “central 4” domains & start mapping spec document
5. Send “central 4” domains to client with mapping specification For review of a small set of data Gives you an idea of sponsor responsiveness, quality
For comment on the structure of the mapping spec Mapping specs are very time consuming and risky to modify
Now that you have a set of approved SDTM-annotated CRFs, a set of reviewed source data, and a trial design spreadsheet, you are ready to proceed.
If possible, I will next map four domains that I consider central to the SDTM and the conversion process. Then I will provide these to the client for review.
26
My Central Four Domains
Trial Arms (TA) • Planned structure of trial• Shows all treatment regimens • Links study Protocol to demographic information
Demography (DM) • Who’s enrolled in the trial• Who’s assigned to study ARMs
Exposure (EX) • What treatment did subjects get?• Check on DM.USUBJID x DM.ARM
Disposition (DS) • What was each subjects actual path through the trial?• Final resolution of each subject
TA DM EX DS
The domains TA, DM, EX, and DS cover the trial from planning (TA), to enrollment (DM), to treatment (EX), to trial end (DS).
Creating these four domains allows one to ascertain with complete certainty the set of subjects in the trials, the set of subjects treated, and the fate of these subjects (these are not always so clear in some legacy studies). Additionally, everyone wants to know this information, it is required, and many other domains depend on getting these domains correct, so do it first. Besides, it can be a lot harder than you think.
27
• Needs to clearly record the relationship between
1. CRF questions2. source data items3. SDTM domains and variables
Mapping SpecificationDocument
In most cases clients want to have documentation of exactly what was programmed in a conversion. To do this, create a “mapping specification” document. A mapping spec shows the name (e.g., table.column or file.column) of source data, and the target (SDTM domain.variable) and a description of what was done to the data, or an algorithm of data processing.
This document is time-expensive to make, and difficult to maintain.
Creation of this document depends on the annotated CRF and the structure of the source data. If these change, then the mapping spec must be updated.
28
Mapping Specification Spreadsheet
Some (many?) developers create the mapping specification first, seek approval, and then program the instructions it contains.
Although that is a reasonable strategy, I have followed this process over several years and dozens of trials and have found that it just doesn’t work well. This is because it is very difficult (at least for me) to mentally translate a complex CRF into spreadsheet form before you are “into” and familiar with the data.
I tell clients that the annotated CRF is the specification, and then I develop the mapping specification as I codethe conversion; the mapping spec it reflects what I’ve programmed.
29
General Conversion Process (step 6 of 12)
1. Annotate CRF2. Install and review data3. Develop trial design domains as a spreadsheet4. Program “central 4” domains & start mapping specification 5. Send “central 4” domains & map spec out for review, comment
6. Development continues• Program safety domains next
• Programming other domains after safety domains
• Update mapping specification as development proceeds
• Unit test coding during development
• WebSDM load & check as domains completed
Continue on.
Generally I develop the safety domains next (e.g., AE, CM, LB, VS, PE, EG, maybe one or two others) because this is often what clients want to see first.
I program and update the mapping spec and test SDTM model compliance with WebSDM continuously as the conversion develops.
30
Program Test RepeatWebSDM
Program domain by domainUpdate map spec in parallel
Generate SDTM datasetsperiodically
Load & check in WebSDM
Further refinement
Annotations / Conversion plans
Formal testing
Review your work early and often. We use WebSDM SDTM load checking for this purpose because it’s fast and the issues are well-defined and traceable. This helps us detect within and across domain inconsistencies early, and it helps us differentiate errors due to logic from errors due to programming. It also highlights issues with source data (e.g., missing data, inconsistent units, etc.)
31
General Conversion Process (step 7 of 12)
1. Annotate CRF2. Install and review data3. Develop trial design domains as a spreadsheet4. Program “central 4” domains & start mapping specification 5. Send “central 4” domains & map spec out for review, comment6. Development continues
• Programming other domains on annotated CRF
• Update mapping specification as development proceeds
• Unit test during development
• WebSDM load & check as domains completed
7. Begin development of formal tests
Time to test.
Testing might be informal or ad hoc, or more formal including the development of written test scripts, and the execution of test scripts by experienced QA staff.
32
Testing SDTM Conversions
Subject and row counts• Source data vs. SDTM data• Non-trivial
Data value subset review & topic variable comparisons Categorical aggregate review Continuous aggregate review Manual review
• By individual subject• By proportion of subjects
Load, check, review with WebSDM or Empirica Study Sponsor review Data validation
• Usually a separate project
Combination or custom testing
There are many dimensions to testing.
Row counting between source data sets and SDTM data sets is often harder to complete than you’d think. This is because the SDTM allows you to grow rows (e.g., source has multiple AEs that are to be split, transposing to tall-narrow tables in findings domains) or reduce rows (e.g., suppress rows where AETERM is null) or createrows (e.g., unique disposition milestone created when subjects enter a specific phase or set of circumstances).
33
General Conversion Process (steps 8-11 of 12)
1. Annotate CRF2. Install and review data3. Develop trial design domains as a spreadsheet4. Program “central 4” domains & start mapping specification 5. Send “central 4” domains & map spec out for review, comment6. Development continues
• Programming other domains on annotated CRF• Update mapping specification as development proceeds• Unit test during development• WebSDM load & check as domains completed
7. Begin development of formal tests
8. Intermediate dataset delivery for review9. Testing commences10. Issues Management as a result of client review11. Define.xml
Development continues.
It’s generally a good idea to periodically supply your client with SDTM datasets that you’ve preliminarily completed for review; the more client review (even if it’s an internal client) the better, as they should understand their study best. Plus, without a doubt, most clients make changes of some sort when they actually see trial data reformatted to SDTM. Biostats, for example, may want data organized or grouped a specific way to prepare for analysis dataset creation.
It’s often best to create the define.xml data dictionary document absolutely last, to avoid rework.
34
Representing SDTM Metadata in define.xml
Define.xml: Description of datasets and their contents Components
• Table of Contents: lists all domains• Data Definition Tables: provide 7 kinds of information about variables• Controlled Terminology: Codelists referenced• Value Level Metadata: Map test codes to test name values• Comments
Created programmatically by reading SDTM datasets• Create from WebSDM, SDTM Mapper Tool or proprietary code
Generally needs editing to add:• Link to blank annotated CRF and other documents• Origins of each SDTM item (e.g., CRF pages, Derived, eDT, Protocol)• Computational methods or derivations used• Dictionary names and versions (e.g., MedDRA)
• Once we have a set of SDTM-formatted data files, a submission needs to create a dictionary-like document to describe the contents of these files to their recipients.
• We do this by creating a document called a “define”. We create this document using XML so that it displays as a web page.
• The Define.xml document consists of a description of datasets and their contents1. Table of Contents: lists all domains2. Data Definition Tables: provide 7 kinds of information about variables3. Controlled Terminology: Codelists referenced4. Value Level Metadata: Map test codes to test name values5. Comments
The define is created programmatically by reading SDTM datasets (we create from WebSDM v3 and then edit or post-process that document as needed).
Any define document generally needs editing to add:• Link to blank annotated CRF• Origins of each SDTM item (e.g., CRF pages, Derived, eDT, Protocol)• Computational methods or derivations used• Dictionary names and versions (e.g., MedDRA)
35
Define.xml: TOC
Hyperlink to dataset description
Hyperlink to dataset
Natural keys
The top of a displayed define is a table of contents with a domain order specified by CDISC.
Hyperlinks can move you deeper into the document or invoke an external file.
The appearance of the define, its fonts, background colors, and so forth is controlled by another document called an XML style sheet.
Shown here is the standard CDISC style sheet, but you are free to create your own.
36
Define.xml: Data Definition Table
Hyperlink to Value Level Metadata
Hyperlink to Codelist
Hyperlink to description of Computational Method
Hyperlink to pages in CRF
Hyperlink to dataset
Within the define each domain and its contents display. If you excluded empty permissible variables then they do not display here.
Within the domain description section hyperlinks take you to:
1. Value-level metadata, or lists of code-like values in your data with their descriptions, such as lab test codes and lab test names
2. Lists of controlled terminology
3. The origin or source for each variable
4. A comment which may also link to a description of a derivation.
37
General Conversion Process (step 12)
1. Annotate CRF2. Install and review data3. Develop trial design domains as a spreadsheet4. Program “central 4” domains & start mapping specification 5. Send “central 4” domains & map spec out for review, comment6. Development continues
• Programming other domains on annotated CRF• Update mapping specification as development proceeds• Unit test during development• WebSDM load & check as domains completed
7. Begin development of formal tests8. Intermediate dataset delivery for review9. Testing commences10. Issues Management as a result of client review11. Define.xml12. Delivery
Finally, you get to deliver the SDTM-formatted data sets and other documents.
38
Final Delivery
Labeled SAS transport files SQL scripts to be run on demand SQL*Loader files Other
• Oracle export .dmp
• SAS data sets
• ASCII files
• Excel workbooks
Although SDTM datasets are often SAS V5 transport files created with SAS PROC COPY with the XPORT option (*.xpt), SDTM delivery might include other types of content such as executable programs (e.g., SQL scripts, SAS, Java, etc.) to be run by the client on demand, or other types of files.
Delivery should always include a manifest document that lists time stamped files.