SDTM - Specifications and Programming

Introduction

Instat has developed a standard process for SDTM programming. At a high level, the process is to

  • capture the SDTM specifications for the domains (datasets) to be generated in a standard spreadsheet
    • provide programming details including mapping of raw data variables to SDTM variables and computation algorithms to the spreadsheet
    • add details related to the content needed for the define.xml to the spreadsheet
  • write external SAS programs as needed to prepare the raw data for direct mapping to SDTM variables
  • execute standard SAS macros to generate the SDTM creation programs from the spreadsheet specifications
  • execute the generated creation programs to create the SDTM datasets

This process allows for flexibility in its application with respect to SAS code generation for the SDTM creation programs. The programmer can choose to use the spreadsheet to define as much of the mapping and derivation algorithms as possible or choose to limit the use of the spreadsheet to SDTM specifications alone and perform all data mapping and derivations entirely in SAS programs. The full use of the spreadsheet can provide benefits but in some cases adds unnecessary complexity.

This document will not cover CDISC SDTM standards or how to correctly map data to SDTM domains, but focuses on the process and tools used to write specifications and create programs for SDTM datasets.

General Process Overview

SDTM specifications

Who is responsible: the biostatistician or senior programmer experienced with SDTM standards

Setting up a new SDTM Mapping spreadsheet

Check the GoogleDocs templates for the latest SDTM Mapping template spreadsheet, and copy this to a new spreadsheet. Give it a name that includes the study short name and includes “SDTM Mapping Specifications” in the document name.

Do the same for the trial level domains data template. Trial level domains are populated with data from this spreadsheet and not from data mapped in from raw datasets.

The layout of the SDTM Mapping specifications spreadsheet

Within the spreadsheet document, there are a few sheets that list out the domains and their status. This helps to plan and track what is needed for your project. Also, a sheet exists for each of the common SDTM domains. One sheet per domain.

The domain specifications sheets have many columns of information. The left portion are the dataset specifications, such as variable name, type, length, label. These are followed by CDISC attributes for each variable, such as required, role and origin. In the right-hand portion of the sheet, are details related to the mappings and derivation algorithms, with columns for source dataset, source variables, pre-code and algorithm.

domain_sample

Additional domains can be added by adding a new sheet. Within domains, new variables can be added per the SDTM guidelines. The pre-filled standard domains and variables should not be removed, you never know when you might need to add them in to your dataset and it’s helpful to have those details readily available. Instead, you can suppress domains or variables to not be included in your results.

Writing specifications

The template is pre-filled with the variable names and details that are complaint with SDTM standards. These should not be modified so as to remain complaint with the standards. Also, if you don’t need a domain or a variable, you should not delete them but simply suppress them. Later you’ll see how.

Dataset specifications then mostly becomes an exercise of deciding

  1. which of the pre-filled standard domains and variables apply to your project and should be used, and
  2. what additional domains or variables should be added.

Don’t worry about define.xml related information or mapping details at this point. Focus on defining the domains and variables that should exist. Consult the SDTM standards and implementation guide to complete the specifications correctly.

Having the raw annotated CRF, a raw PROC CONTENTS report and the spreadsheet open side-by-side on a large monitor can help keep a view on what source/raw data is available and needs too be mapped so nothing is overlooked. You may also want to print the aCRF unique pages and keep notes on how things will map to SDTM as you go.

Programming details

As mentioned earlier, there is some flexibility in how you can get the programming done to create your SDTM datasets. The SAS macros developed to support this process have the ability to use information entered in the spreadsheet to generate SAS code. The biggest advantage by using the spreadsheet properly is that your generated dataset will have variable attributes as defined in the spreadsheet. But beyond that, you can also specify mapping details and even derivation algorithms that can also be auto-generated into SAS code.

If you choose to not use the spreadsheet for programming details, either mappings or derivations, you still need to link your own program into the spreadsheet process so at a minimum, you still get the variable attributes (name, type, length, label and which are kept in the end) driven by the spreadsheet. If you do not, your dataset variables will be defined in your own program and may be out of sync with the specifications.

When using the spreadsheet for mapping data

Use the columns in the spreadsheet to define

  • source dataset
  • source variable(s)
  • derivation algorithm

for straightforward mappings, such as a one to one mapping of a single instance of a raw dataset variable to an SDTM variable. When there is no need to restructure your data (transpose, collapse, etc) then this is quick and easy way to define these mapping right in the spreadsheet.

After entering the source dataset and source variable, a derivation algorithm is required to complete the mapping.

For example, let’s say we are defining the DM domain variable SEX and our raw incoming data has a gender variable in a dataset called enroll. Our column input will be:

  • source dataset = enroll
  • source variable = gender
  • derivation algorithm = #VAR# = gender

We always use #VAR# in place of the actual SDTM variable name, just in case we need to copy this algorithm to another variable. The macro will replace #VAR# with the variable name.

What if your source data has values that are not what CDISC SEX variable should use as its controlled terms? Any mapping of values can be done within the derivation algorithm, utilizing any SAS techniques you want. After all it’s just SAS code that will get dropped into a data step by the macro. For example:

if gender='M' then #VAR# = 'MALE';
else if gender='F' then #VAR# = 'FEMALE';

or if you are a format defined to map your values, you can use

#VAR# = put(gender, $sexmap.) 

as your derivation algorithm.

How the macro processes your mapping information

There’s no limit to your algorithm code, but you are always working in a one-to-one (observation) mapping situation when using the derivation algorithm and mapping details. Basically, the macro will simply read in the source dataset, keep the ID variables it needs to merge and the source variable(s) you specified, sort it by the ID variables and merge it on to the final data step where all derivation algorithms get added into that data step in the order they appear in the spreadsheet.

If you have a non-simple merge situation, or the need to restructure your data before the final data step merge, then you need to use the Pre-merge Code column of the spreadsheet.

Using the pre-merge code column

This column allows you to define a SAS program that will be executed prior to the final data step merge and applying the derivation algorithm. This allows you to handle most scenarios of mapping to SDTM variables. For example, if you need the first treatment date then your pre-merge code column might be something like this:

proc sql;
  create table trtsaart as
  select usubjid, min(dosedate) as trtstart 
  from rawds group by usubjid;
quit;      

The key here is that the last dataset to get created by your pre-merge code will be what is merged on to the last data step that creates the domain dataset and applies all the derivation algorithms. So this must be a single-observation per subject dataset, or the merge and resulting SDTM dataset will be incorrect.

In this example, the derivation algorithm would look like this:

#VAR# = trtstart

which is a variable available in the last data step, since we created it in the pre-merge code. The macro will merge on to the last data step any datasets created in your pre-merge codes, so those variables are available for derivations (mappings) to SDTM variables.

When using external SAS programs for mapping data

You may choose to not use the spreadsheet at all for your mapping details. In which case, you still need to link your programs with the SDTM domains, to get the variables created per the attributes specified in the spreadsheet.

To do this, enter a single pre-merge code in one of the top rows of your domain sheet, and have the pre-merge code simply be a %include of your program. Nothing else is required to be filled in to the mapping columns, not even a value on any variables for derivation algorithms.

Since your external program is linked, your final dataset you create in your program will still be added (merged) to the macro’s final data step where it applies the variable attributes and keep statement.

Important:

  • Your program should only create a SASWORK dataset and not the final permanent dataset for the domain.
  • Also, don’t try to set final attributes on your SDTM variables in your external program, these may conflict and will be overridden by the macro that uses the spreadsheet variable specifications.

Combination of spreadsheet and external SAS programs for mapping

When pre-merge code is involved or lengthy, or just for your preference, you can use external SAS programs to do any of your tasks needing restructuring and preparing of your source data before they can be mapped into the SDTM variables but still use the spreadsheet mapping columns for simple mappings. When to use which is up to you.

SDTM process macros

As you can imagine, the spreadsheet and any external programs you created now create an integrated and dependent process. The SDTM macros will pull it all together and create the final SDTM datasets. To do this, it will generate a SAS program for each domain. This let’s you see the actual code that generates each dataset and helps in QC and troubleshooting, by easily tracing back to the spreadsheet the parts that have errors or issues.

Macro Purpose Output
%sdtm_create Reads the specifications for a domain and generates the complete SAS program to create the SAS dataset (domain) cr_[domain name].sas (For example, cr_dm.sas)
%sdtm_create_all Reads your full list of domains from the spreadsheet and runs %sdtm_create for each. Optionally, it can also produce a specifications report cr_[domain name].sas for all domains

%sdtm_create and %sdtm_create_all Macros

The macro to read the specifications and generate the complete SAS creation program for a single SDTM domain is %sdtm_create and has the following parameters:

Parameter Description
speclib Libname for the specifications datasets, imported from the mapping spreadsheet
specds Name of the specification dataset for this domain
source_id_vars List of ID variables in the source datasets to use during the merge step
outlib Libname for the output dataset to be created by the auto-generated SAS code (default = WORK)
studyid Study ID to use as the value for the CDISC variable
pgmpath Path to the programs (default = &proot\derived)

A companion macro is %sdtm_create_all that can do all domains specified in the spreadsheet at once, with options to also execute the create programs or generate a specifications report.

Parameter Description
speclib Libname for the specifications datasets, imported from the mapping spreadsheet
source_id_vars List of ID variables in the source datasets to use during the merge step
outlib Libname for the output dataset to be created by the auto-generated SAS code (default = WORK)
source Set =1 if you want SAS option SOURCE2 during execution (default is 1)
notes Set =1 for SAS NOTES to be turned on (default is 1)
generate Set =1 to generate the cr_[domain] SAS programs (default is 1)
execute Set =1 to execute the cr_[domain] SAS programs which will create the SDTM domain datasets (default is 1)
specrpt Set =1 to generate a report of SDTM specifications (default is 0)
studyid Study ID to use as the value for the CDISC variable

Generating SDTM datasets

The following sections will cover how to use the macros to go from specifications / mapping spreadsheet to generated datasets.

Generating complete SDTM creation programs

To create the auto-generated SAS programs driven from the specifications, mapping and derivation information entered in the spreadsheet, we execute the following that uses the macro.

For one domain: DM

%setenv(sponsor=pharmaco, study=funproj, dbver=locked_20160701);
libname specs “&droot\&dbpath\specs”;
libname sdtm “&droot\&dbpath\sdtm”;

%imp_gdoc();

%sdtm_create(speclib=specs, specds=DM, outlib=sdtm, 
  studyid=FUNPROJ, pgmpath=&proot\sdtm, 
  source_id_vars=usubjid)

Executing the SDTM creation programs

The generated creation programs, cr_[domain].sas, are standalone programs - meaning they do not need to be executed in a special way within a _runall or executed by a macro. If you have your SAS session initialized to the study with an appropriate %setenv then you can execute an SDTM creation program individually. This is helpful to debug or QC the resulting dataset creation.

To assist in running all domains, you can also use the %sdtm_create_all macro with the execute=1 option.

In some cases, you may want to create your own custom _runall program that executes all the SDTM creation programs with the usual %include statement for each program.