www.sciencetranslationalmedicine.org/cgi/content/full/3/114/114ra127/DC1 Supplementary Materials for Predicting Adverse Drug Events Using Pharmacological Network Models Aurel Cami,* Alana Arnold, Shannon Manzi, Ben Reis *To whom correspondence should be addressed. E-mail: [email protected]Published 21 December 2011, Sci. Transl. Med. 3, 114ra127 (2011) DOI: 10.1126/scitranslmed.3002774 This PDF file includes: Methods Table S1. Definition of covariates. Table S2. List of drugs and their ATC codes. Table S3. Number of missing observations for PubChem properties extracted for this study. Table S4. Number of missing observations for DrugBank properties extracted for this study. Table S5. Intercorrelation analysis of covariates. Table S6. Prediction cases studies. Table S7. List of supplementary source code files. Fig. S1. Newly associated ADEs per drug in each ATC top-level group. Fig. S2. Newly associated drugs per ADE in each MedDRA top-level group. Fig. S3. Comparative histograms of scores for the observed edges and non-edges by the three model types. Fig. S4. Three-way Venn diagrams for the sets of true and false positives generated by models NET, TAX, and INT. Fig. S5. Comparative histograms of selected network covariates for the predicted edges and non-edges. Fig. S6. Comparative histograms of selected taxonomic covariates for the predicted edges and non-edges. Fig. S7. Comparative histograms of the intrinsic covariates for the predicted edges and non-edges. Fig. S8. Drug-specific AUROCs. Fig. S9. ADE-specific AUROCs. Other Supplementary Material for this manuscript includes the following: (available at www.sciencetranslationalmedicine.org/cgi/content/full/3/114/114ra127/DC1)
76
Embed
Supplementary Materials for · 12/19/2011 · Supplementary Materials for Predicting Adverse Drug Events Using Pharmacological Network Models Aurel Cami,* Alana Arnold, Shannon Manzi,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Predicting Adverse Drug Events Using Pharmacological Network Models
Aurel Cami,* Alana Arnold, Shannon Manzi, Ben Reis
*To whom correspondence should be addressed. E-mail: [email protected]
Published 21 December 2011, Sci. Transl. Med. 3, 114ra127 (2011)
DOI: 10.1126/scitranslmed.3002774
This PDF file includes:
Methods Table S1. Definition of covariates. Table S2. List of drugs and their ATC codes. Table S3. Number of missing observations for PubChem properties extracted for this study. Table S4. Number of missing observations for DrugBank properties extracted for this study. Table S5. Intercorrelation analysis of covariates. Table S6. Prediction cases studies. Table S7. List of supplementary source code files. Fig. S1. Newly associated ADEs per drug in each ATC top-level group. Fig. S2. Newly associated drugs per ADE in each MedDRA top-level group. Fig. S3. Comparative histograms of scores for the observed edges and non-edges by the three model types. Fig. S4. Three-way Venn diagrams for the sets of true and false positives generated by models NET, TAX, and INT. Fig. S5. Comparative histograms of selected network covariates for the predicted edges and non-edges. Fig. S6. Comparative histograms of selected taxonomic covariates for the predicted edges and non-edges. Fig. S7. Comparative histograms of the intrinsic covariates for the predicted edges and non-edges. Fig. S8. Drug-specific AUROCs. Fig. S9. ADE-specific AUROCs.
Other Supplementary Material for this manuscript includes the following: (available at www.sciencetranslationalmedicine.org/cgi/content/full/3/114/114ra127/DC1)
File “meddra_mapping_code.sas” (SAS code to perform MedDRA mapping). File “NET_INT_covariates.R” (R code to compute network and intrinsic covariates). File “TAX_covariates.sas”(SAS code to compute taxonomic covariates). File “Fig2-highres.tif” (high-resolution version of Fig. 2).
1
Supplementary Methods
Mapping ADE names to MedDRA taxonomy We employed the following approach to map ADE names to the Medical Dictionary for
Regulatory Activities (MedDRA) terminology. First, we performed exact matching of each ADE
name against the lowest level terms (LLTs) of MedDRA. This step led to approximately 40% of
the unique ADE names being matched to LLTs. Next, for each non‐matched ADE name we
identified the two closest LLTs in terms of the string generalized edit distance (computed using
function COMPGED in the Statistical Analysis System (SAS) v9.2). Computer code to perform
exact matching and to identify the two closest LLTs of an ADE name is provided below and as
supplementary online files (table S7): meddra_mapping_code.sas, NET_INT_covariates.R, and
TAX_covariates.sas. Then, we performed a manual scan of the list of ADE names and their two
closest LLTs and were able to determine a match between an ADE name and one of its two
closest LLTs for approximately half of the list. We coded the final 30% of ADE names that were
still left unmatched at the end of the preceding step by performing term‐based searches using a
MedDRA browser. After the mapping of all ADE names to MedDRA LLT level was completed, we
identified the unique PT corresponding to each LLT. Finally, we identified the list of HLTs that
corresponded to each PT generated by the preceding step. In this study, all adverse events
were represented by their MedDRA HLT codes.
Source code Any reuse of all or part of these codes must reference this publication. The corresponding SAS
and R files are provided as supplementary online material.
MedDRA mapping SAS code /***************************************************************** Macro to exact-match ADE names to MedDRA LLT names. "in_ds" should be a SAS library containing two kinds of input files: First, it should contain a list of unique ADE names occurring in the drug-ADE database. This list is assumed to have been stored in a SAS data set named "<YEAR>_aes", where YEAR is 2005 or 2010. This data set contains one column named "ae_name". Second, the library should contain a list of unique LLT names occurring in MedDRA. This list is assumed to be stored in a SAS data set named "<YEAR>_unique_llts_meddra". This data set should contain one column named
2
"llt_name" as well as other columns including the MedDRA information pertaining to an LLT, such as the pt_code, pt_name, and so on. ******************************************************************/ %macro exact_match(year= ); proc sort data=in_ds._&year._aes; by ae_name; run; quit; proc sort data=in_ds._&year._unique_llts_meddra; by llt_name; run; quit; data out_ds._&year._aes_meddra; merge in_ds._&year._aes(in=a) in_ds._&year._unique_llts_meddra(in=b rename=(llt_name=ae_name)); by ae_name; if(a); run; quit; %mend exact_match; /****************************************************************** Macro to compute the smallest GED distance between ADE names and MedDRA LLT names. "in_ds" should be a SAS library containing two input SAS data sets: First, this library is supposed to contain the list of ADE names that were not exact-matched after running the previous macro. This list is assumed to be stored in the SAS data set "<YEAR>_unique_aes_llt_nomatch" where YEAR is either 2005 or 2010. This data set contains one column named "ae_name". Second, this library is supposed to contain the list of all unique LLT names in MedDRA. This is list is supposed to be stored in a SAS data set named "llt_lltname_only". This file has only one one column named "llt_name". out_ds: is a library that will contain output file(s) produced by the macro. To limit the running time to a few hours, this macro should be run in a cluster, with each computing node processing a portion of the ADE names contained in the input file "<YEAR>_unique_aes_llt_nomatch". This portion is defined by the macro variables: "jobnum": taking values 1,... "rows_per_job": number of ADE names in the job "total_rows": total number of ADE names to be processed One output file per job will be produced. These partial output files should in the end merged together. Each output file produced by a job contains the following fields: AE_name min_llt1: the closest LLT name in terms of GED min_score1: the GED between AE_name and min_llt1 min_llt2: the second closest LLT name min_score2: the GED between AE_name and min_llt2
3
Notes on running time: The GED computation for all the ADE names occurring in the drug-ADE database in our study took a few hours using the Orchestra cluster (http://ritg.med.harvard.edu/cluster.html)and a few hundred jobs. ******************************************************************/ /* The macro variables below should be re-defined as needed before submitting each job (e.g. using an external script). The values below are given for illustration purposes only */ %let jobnum=1; %let rows_per_job=20; %let total_rows=1000; %macro find_min_ged_llt(year= ,num_llts= ); /* create a local macro variable per LLT name */ data _null_; set in_ds.llt_lltname_only; call symputx(cats("llt_name_", _N_), llt_name, "L"); run; quit; /* find two closest LLTs to each ADE name */ data out_ds._&year._unique_aes_geds_&jobnum; length min_llt1 min_llt2 $ 255; length min_score1 min_score2 8; set in_ds._&year._unique_aes_llt_nomatch; start_ob = (&jobnum - 1)* &rows_per_job + 1; end_ob = &jobnum * &rows_per_job; if (end_ob > &total_rows) then end_ob = &total_rows; if (_N_ < start_ob OR _N_ > end_ob) then delete; else do; *put "_N_ = " _N_; i = 1; min_score1 = 100000; min_score2 = 100000; /* very large values */ min_llt1 = ""; min_llt2 = ""; do while (i <= &num_llts); llt_name_i = SYMGET(cats("llt_name_", i)); score = COMPGED(ae_name, llt_name_i); if (score < min_score1) then do; min_score2 = min_score1; min_score1 = score; min_llt2 = min_llt1; min_llt1 = llt_name_i; end; else if (score < min_score2) then do; min_score2 = score; min_llt2 = llt_name_i;
NET, INT covariates R code ####################################################################### # The following functions require R libraries "network" and "sna". # One way to acquire these libraries is to install the "statnet" suite # of packages (www.statnetproject.org). # # Note on running time: The functions listed below were executed for # 689,268 drug-ADE pairs in the network discussed in the paper # # This task was carried out using the Orchestra cluster # (http://ritg.med.harvard.edu/cluster.html), with a few hundred jobs in parallel and took about one day to complete. ###################################################################### ####################################################################### # Function to compute the covariate "euclid-min" discussed in the # paper. # # G is a network object # node1 denodes the PubChem_Compound_ID of a drug # node2 denotes the HLT code (MedDRA) of an ADE # # All properties of drugs and ADEs should have been stored as vertex attributes of the network object G # # This function returns a vector of length two. The first element of this vector denotes the value of covariate "euclid-min" ###################################################################### compute.value.euclid.dist.features = function(G, node1, node2) { all.attr.names = list.vertex.attributes(G) quant.attr.names = setdiff(all.attr.names, c("node_id","drug_name","DrugCard_ID","na", "PubChem_Compound_ID","stitch_compound_name1","vertex.names"))
5
n.drugs = network.size(G) n.attributes = length(quant.attr.names) attr.mat = matrix(-999, nrow=n.drugs, ncol=n.attributes) for (i in 1:n.attributes) { attr.vec = get.vertex.attribute(G, quant.attr.names[i]) attr.mat[,i] = attr.vec } attrs.node1 = attr.mat[node1, ] N2 = get.neighborhood(G, node2) N2 = setdiff(N2, node1) result.value.attr = numeric(2) min.feature.name = "min_euclid_dist" mean.feature.name = "mean_euclid_dist" names(result.value.attr) = c(min.feature.name, mean.feature.name) if (length(N2) == 0) { result.value.attr[min.feature.name] = 0 result.value.attr[mean.feature.name] = 0 } else { attr.mat.node2 = attr.mat[N2, ] merged.attr.mat = rbind(attrs.node1, attr.mat.node2) dist.mat = as.matrix(dist(merged.attr.mat)) dist.vec = dist.mat[1,2:nrow(merged.attr.mat)] result.value.attr[min.feature.name] = min(dist.vec) result.value.attr[mean.feature.name] = mean(dist.vec) } return(result.value.attr) } ####################################################################### # Function to compute the distribution of Euclidean distances # in the neighborhood of a drug-ADE pair. This distribution is used # in the computation of covariate "euclid-KL" discussed in the # paper. # # G is a network object # node1 denodes the PubChem_Compound_ID of a drug # node2 denotes the HLT code (MedDRA) of an ADE # # All properties of drugs and ADEs should have been stored as vertex attributes of the network object G # # This function returns a discretized version of the distribution # of Euclidean distances in the neighborhood of pair (node1, node2) ###################################################################### compute.value.euclid.dist.features.full = function(G, node1, node2) { all.attr.names = list.vertex.attributes(G) quant.attr.names = setdiff(all.attr.names, c("node_id","drug_name","DrugCard_ID","na", "PubChem_Compound_ID","stitch_compound_name1","vertex.names"))
6
n.drugs = network.size(G) n.attributes = length(quant.attr.names) attr.mat = matrix(-999, nrow=n.drugs, ncol=n.attributes) for (i in 1:n.attributes) { attr.vec = get.vertex.attribute(G, quant.attr.names[i]) attr.mat[,i] = attr.vec } attrs.node1 = attr.mat[node1, ] N2 = get.neighborhood(G, node2) N2 = setdiff(N2, node1) nbins = 20 result.value.attr = numeric(nbins) bin.names = character(nbins) for (i in 1:nbins) { bin.names[i] = paste("euclid_bin",i,sep="") } names(result.value.attr) = bin.names breaks.vec = c(seq(from=0, by=40, length.out=20), 10^10) if (length(N2) == 0) { result.value.attr=rep(0,20) } else { attr.mat.node2 = attr.mat[N2, ] merged.attr.mat = rbind(attrs.node1, attr.mat.node2) dist.mat = as.matrix(dist(merged.attr.mat)) dist.vec = dist.mat[1,2:nrow(merged.attr.mat)] histogram.obj = hist(dist.vec, breaks=breaks.vec, plot=FALSE) result.value.attr = histogram.obj$density } return(result.value.attr) } ####################################################################### # Function to compute the covariates "degree-prod" and "degree-absdiff" discussed in the paper. # # G is a network object # node1 denodes the PubChem_Compound_ID of a drug # node2 denotes the HLT code (MedDRA) of an ADE # # All properties of drugs and ADEs should have been stored as vertex attributes of the network object G # # This function returns a vector of length four. The third and second # elements of this vector denote "degree-prod" and "degree-absdiff", # respectively. ###################################################################### compute.degree.features = function(G, node1, node2, D) {
7
#D = degree(G, gmode="graph") D.node1 = D[node1] D.node2 = D[node2] result.degree = numeric(4) result.degree[1:4] <- NA names(result.degree) = c('degree_sum', 'degree_absdiff', 'degree_prod', 'degree_ratio') result.degree['degree_sum'] = D.node1 + D.node2 result.degree['degree_absdiff'] = abs(D.node1 - D.node2) result.degree['degree_prod'] = D.node1 * D.node2 result.degree['degree_ratio'] = D.node1/D.node2 return(result.degree) } ####################################################################### # Function to compute the covariate "jackard-drug-max" # discussed in the paper. # # G is a network object # node1 denodes the PubChem_Compound_ID of a drug # node2 denotes the HLT code (MedDRA) of an ADE # # All properties of drugs and ADEs should have been stored as vertex attributes of the network object G # # This function returns a vector of length three. The second # element of this vector denotes covariate "jackard-drug-max" ###################################################################### compute.jackard.drug.features = function(G, node1, node2) { N1 = get.neighborhood(G, node1) N2 = get.neighborhood(G, node2) N2 = setdiff(N2, node1) n.neighbors = length(N2) result.jackard.drug = numeric(3) result.jackard.drug[1:3] <- NA names(result.jackard.drug) = c('jackard_drug_min','jackard_drug_max','jackard_drug_mean') if (n.neighbors == 0) { result.jackard.drug['jackard_drug_min'] = 0 result.jackard.drug['jackard_drug_max'] = 0 result.jackard.drug['jackard_drug_mean'] = 0 } else { jackard.vector = numeric(n.neighbors) for (i in 1:n.neighbors) { neighbor.i = N2[i] N1.i = get.neighborhood(G, neighbor.i) intersection.i = intersect(N1, N1.i) union.i = union(N1, N1.i) jackard.vector[i] = length(intersection.i)/length(union.i) }
8
result.jackard.drug['jackard_drug_min'] = min(jackard.vector) result.jackard.drug['jackard_drug_max'] = max(jackard.vector) result.jackard.drug['jackard_drug_mean'] = mean(jackard.vector) } return(result.jackard.drug) } ####################################################################### # Function to compute the distribution of Jackard coefficients in the neighborhood of a drug-ADE pair. This distribution is used in the computation of covariate "jackard-drug-KL" discussed in the paper. # # G is a network object # node1 denodes the PubChem_Compound_ID of a drug # node2 denotes the HLT code (MedDRA) of an ADE # # All properties of drugs and ADEs should have been stored as vertex attributes of the network object G # # This function returns a discretized version of the distribution # of Jackard coefficients in the neighborhood of pair (node1, node2) ####################################################################### compute.jackard.drug.features.full = function(G, node1, node2) { N1 = get.neighborhood(G, node1) N2 = get.neighborhood(G, node2) N2 = setdiff(N2, node1) n.neighbors = length(N2) nbins = 20 result.jackard.drug = numeric(nbins) bin.names = character(nbins) for (i in 1:nbins) { bin.names[i] = paste("jackard_drugs_bin",i,sep="") } breaks.vec = seq(from=0, by=0.05, length.out=21) if (length(N2) == 0) { result.jackard.drug=rep(0,20) } else { jackard.vector = numeric(n.neighbors) for (i in 1:n.neighbors) { neighbor.i = N2[i] N1.i = get.neighborhood(G, neighbor.i) intersection.i = intersect(N1, N1.i) union.i = union(N1, N1.i) jackard.vector[i] = length(intersection.i)/length(union.i) } histogram.obj = hist(jackard.vector, breaks=breaks.vec, plot=FALSE) result.jackard.drug = histogram.obj$density #cat("sum ", sum(result.jackard.drug), "\n"); flush.console() }
9
names(result.jackard.drug) = bin.names return(result.jackard.drug) } ####################################################################### # Function to compute the covariate "jackard-ADE-max" discussed in the paper. # # G is a network object # node1 denodes the PubChem_Compound_ID of a drug # node2 denotes the HLT code (MedDRA) of an ADE # # All properties of drugs and ADEs should have been stored as vertex attributes of the network object G # # This function returns a vector of length three. The second element of this vector denotes covariate "jackard-ADE-max" ####################################################################### compute.jackard.ae.features = function(G, node1, node2) { N1 = get.neighborhood(G, node1) N2 = get.neighborhood(G, node2) N1 = setdiff(N1, node2) n.neighbors = length(N1) result.jackard.ae = numeric(3) result.jackard.ae[1:3] <- NA names(result.jackard.ae) = c('jackard_ae_min','jackard_ae_max','jackard_ae_mean') if (n.neighbors == 0) { result.jackard.ae['jackard_ae_min'] = 0 result.jackard.ae['jackard_ae_max'] = 0 result.jackard.ae['jackard_ae_mean'] = 0 } else { jackard.vector = numeric(n.neighbors) for (i in 1:n.neighbors) { neighbor.i = N1[i] N2.i = get.neighborhood(G, neighbor.i) intersection.i = intersect(N2, N2.i) union.i = union(N2, N2.i) jackard.vector[i] = length(intersection.i)/length(union.i) } result.jackard.ae['jackard_ae_min'] = min(jackard.vector) result.jackard.ae['jackard_ae_max'] = max(jackard.vector) result.jackard.ae['jackard_ae_mean'] = mean(jackard.vector) } return(result.jackard.ae) } ####################################################################### # Function to compute the distribution of Jackard coefficients # in the neighborhood of a drug-ADE pair. This distribution is used # in the computation of covariate "jackard-ADE-KL" discussed in the # paper. #
10
# G is a network object # node1 denodes the PubChem_Compound_ID of a drug # node2 denotes the HLT code (MedDRA) of an ADE # # All properties of drugs and ADEs should have been stored as vertex attributes of the network object G # # This function returns a discretized version of the distribution # of Jackard coefficients in the neighborhood of pair (node1, node2) ####################################################################### compute.jackard.ae.features.full = function(G, node1, node2) { N1 = get.neighborhood(G, node1) N2 = get.neighborhood(G, node2) N1 = setdiff(N1, node2) n.neighbors = length(N1) nbins = 20 result.jackard.ae = numeric(nbins) bin.names = character(nbins) for (i in 1:nbins) { bin.names[i] = paste("jackard_aes_bin",i,sep="") } breaks.vec = seq(from=0, by=0.05, length.out=21) if (n.neighbors == 0) { result.jackard.ae=rep(0,20) } else { jackard.vector = numeric(n.neighbors) for (i in 1:n.neighbors) { neighbor.i = N1[i] N2.i = get.neighborhood(G, neighbor.i) intersection.i = intersect(N2, N2.i) union.i = union(N2, N2.i) jackard.vector[i] = length(intersection.i)/length(union.i) } histogram.obj = hist(jackard.vector, breaks=breaks.vec, plot=FALSE) result.jackard.ae = histogram.obj$density } names(result.jackard.ae) = bin.names return(result.jackard.ae) } ####################################################################### # Function to compute the covariate "edge-density" discussed in the paper. # # G is a network object # node1 denodes the PubChem_Compound_ID of a drug # node2 denotes the HLT code (MedDRA) of an ADE # # All properties of drugs and ADEs should have been stored as vertex attributes of the network object G
11
# # This function returns a vector of length one denoting the covariate # "edge-density" ####################################################################### compute.edge.density.features = function(G, node1, node2) { N1 = get.neighborhood(G, node1) N2 = get.neighborhood(G, node2) numer = sum(G[N2,N1]) denom = length(N1)*length(N2) result.edge.dens = numeric(1) result.edge.dens[1] <- NA names(result.edge.dens) = c('edge_dens') result.edge.dens['edge_dens'] = numer/denom return(result.edge.dens) }
TAX covariates SAS code /* * Function to compute the covariate "atc-min" discussed in the * paper. * * all_pairs_ds is a list of all possible drug-ADE pairs * node_id_1: drug id (PubChem_ID) * node_id_2: AE id (HLT code) * * returns a new data set named <all_pairs_ds>_atc which also contains * the value of "atc-min" covariate for each drug-ADE pair */ %macro add_ATC_codes_min(all_pairs_ds= ); %local macro_i; /* create hash tables */ data in_ds.&all_pairs_ds._2005edge; set in_ds.&all_pairs_ds; where (is_old_edge EQ 1); keep node_id_1 node_id_2; run; quit; proc sort data=in_ds.&all_pairs_ds._2005edge; by node_id_2; run; quit; proc transpose data=in_ds.&all_pairs_ds._2005edge out=in_ds.&all_pairs_ds._2005edge_t prefix=node_id_1_; by node_id_2; var node_id_1; run; quit;
12
data in_ds.&all_pairs_ds._2005edge_t; set in_ds.&all_pairs_ds._2005edge_t; drop node_id_1 _NAME_ _LABEL_; run; quit; /* perform ATC distance computations */ data in_ds.&all_pairs_ds._atc; length node_id_1 8; length node_id_2 8; length atc_min_val 8; length atc_max_val 8; length atc_mean_val 8; %let macro_i=1; %do %while (¯o_i < 657); /* (1 + max AE degree) in the 2005 network */ length node_id_1_¯o_i 8; %let macro_i = %eval(¯o_i + 1); %end; length PubChem_Compound_ID 8; length atc_code_min_dist 8; length atc_code_min_tmp 8; length atc_code1-atc_code11 $ 7; length L1 L3 L4 L1_prime L3_prime L4_prime $ 1; length L2 L2_prime $ 2; set in_ds.&all_pairs_ds; array atc_codes_d1{11} $ 8 atc_code_d1_1-atc_code_d1_11; array atc_codes_d2{11} $ 8 atc_code_d2_1-atc_code_d2_11; array atc_min_distances{657} 8; if (_N_ = 1) then do; /* for each drug there were 1-11 ATC codes */ declare hash atcHash(dataset: 'in_ds._2005_drugs_atc_data'); rc = atcHash.definekey('PubChem_Compound_ID'); rc = atcHash.definedata('atc_code1', 'atc_code2', 'atc_code3', 'atc_code4', 'atc_code5', 'atc_code6', 'atc_code7', 'atc_code8', 'atc_code9', 'atc_code10', 'atc_code11'); atcHash.definedone(); /* for each HLT, the neighbors in 2005 network */ declare hash neighborHash(dataset: "in_ds.&all_pairs_ds._2005edge_t"); rc = neighborHash.definekey('node_id_2'); rc = neighborHash.definedata(ALL: 'YES'); neighborHash.definedone(); end; atc_min_val = .; atc_max_val = .; atc_mean_val = .;
13
PubChem_Compound_ID = node_id_1; rc = atcHash.find(); if (rc NE 0) then put "Could not find" PubChem_Compound_ID; else do; atc_code_d1_1 = atc_code1; atc_code_d1_2 = atc_code2; atc_code_d1_3 = atc_code3; atc_code_d1_4 = atc_code4; atc_code_d1_5 = atc_code5; atc_code_d1_6 = atc_code6; atc_code_d1_7 = atc_code7; atc_code_d1_8 = atc_code8; atc_code_d1_9 = atc_code9; atc_code_d1_10 = atc_code10; atc_code_d1_11 = atc_code11; rc = neighborHash.find(); if (rc NE 0) then put "Could not find" node_id_2; else do; %let macro_i = 1; %do %while (¯o_i < 657); atc_min_distances[¯o_i] = .; %let macro_i = %eval(¯o_i + 1); %end; %let macro_i = 1; %do %while (¯o_i < 657); PubChem_Compound_ID = node_id_1_¯o_i; if (PubChem_Compound_ID NE . AND PubChem_Compound_ID NE node_id_1) then do; rc = atcHash.find(); if (rc NE 0) then put "Could not find " PubChem_Compound_ID; else do; atc_code_d2_1 = atc_code1; atc_code_d2_2 = atc_code2; atc_code_d2_3 = atc_code3; atc_code_d2_4 = atc_code4; atc_code_d2_5 = atc_code5; atc_code_d2_6 = atc_code6; atc_code_d2_7 = atc_code7; atc_code_d2_8 = atc_code8; atc_code_d2_9 = atc_code9; atc_code_d2_10 = atc_code10; atc_code_d2_11 = atc_code11; atc_code_min_dist = 99; /* very large value */ i=1;
14
do while (i <= 11); /* max number of unique ATC codes */ if (atc_codes_d1[i] EQ "") then do; leave; end; L1 = substrn(atc_codes_d1[i],1,1); L2 = substrn(atc_codes_d1[i],2,2); L3 = substrn(atc_codes_d1[i],4,1); L4 = substrn(atc_codes_d1[i],5,1); j=1; do while (j <= 11); if (atc_codes_d2[j] EQ "") then do; leave; end; L1_prime = substrn(atc_codes_d2[j],1,1); L2_prime = substrn(atc_codes_d2[j],2,2); L3_prime = substrn(atc_codes_d2[j],4,1); L4_prime = substrn(atc_codes_d2[j],5,1); if (L1 EQ L1_prime AND L2 EQ L2_prime AND L3 EQ L3_prime AND L4 EQ L4_prime) then atc_code_min_tmp = 2; else if (L1 EQ L1_prime AND L2 EQ L2_prime AND L3 EQ L3_prime) then atc_code_min_tmp = 4; else if (L1 EQ L1_prime AND L2 EQ L2_prime) then atc_code_min_tmp = 6; else if (L1 EQ L1_prime) then
15
atc_code_min_tmp = 8; else atc_code_min_tmp = 10; if (atc_code_min_tmp < atc_code_min_dist) then atc_code_min_dist = atc_code_min_tmp; j = j+1; end; /* while j */ i = i+1; end; /* while i */ atc_min_distances[¯o_i] = atc_code_min_dist; end; /*else*/ end; /* if Pubchem_compound_ID */ %let macro_i = %eval(¯o_i + 1); %end; /*while macro i*/ atc_min_val = min(of atc_min_distances{*}); atc_max_val = max(of atc_min_distances{*}); atc_mean_val = mean(of atc_min_distances{*}); end; end; keep node_id_1 node_id_2 atc_min_val atc_max_val atc_mean_val is_old_edge is_old_edge_class is_new_edge is_new_edge_class is_test_pair is_test_pair_class; run; quit; %mend add_ATC_codes_min; /* * Function to compute the distribution of ATC distances * in the neighborhood of a drug-ADE pair. This distribution is used * in the computation of covariate "atc-KL" discussed in the * paper. * * all_pairs_ds is a list of all possible drug-ADE pairs * node_id_1: drug id (PubChem_ID) * node_id_2: AE id (HLT code) * * returns a new data set named <all_pairs_ds>_atcb which also contains * the distribution ATC distances in the neighborhood of each pair
16
*/ %macro add_ATC_codes_bins(all_pairs_ds= ); %local macro_i; /* create hash tables */ data in_ds.&all_pairs_ds._05e; set in_ds.&all_pairs_ds; where (is_old_edge EQ 1); keep node_id_1 node_id_2; run; quit; proc sort data=in_ds.&all_pairs_ds._05e; by node_id_2; run; quit; proc transpose data=in_ds.&all_pairs_ds._05e out=in_ds.&all_pairs_ds._05et prefix=node_id_1_; by node_id_2; var node_id_1; run; quit; data in_ds.&all_pairs_ds._05et; set in_ds.&all_pairs_ds._05et; drop node_id_1 _NAME_ _LABEL_; run; quit; /* compute full distribution of distances */ data in_ds.&all_pairs_ds._atcb; length node_id_1 8; length node_id_2 8; length atc_min_val 8; length atc_max_val 8; length atc_mean_val 8; length atc_bin1-atc_bin5 8; %let macro_i=1; %do %while (¯o_i < 657); length node_id_1_¯o_i 8; %let macro_i = %eval(¯o_i + 1); %end; length PubChem_Compound_ID 8; length atc_code_min_dist 8; length atc_code_min_tmp 8; length atc_code1-atc_code11 $ 7; length L1 L3 L4 L1_prime L3_prime L4_prime $ 1; length L2 L2_prime $ 2; set in_ds.&all_pairs_ds; array atc_codes_d1{11} $ 8 atc_code_d1_1-atc_code_d1_11; array atc_codes_d2{11} $ 8 atc_code_d2_1-atc_code_d2_11; array atc_min_distances{657} 8;
PubChem_Compound_ID = node_id_1_¯o_i; if (PubChem_Compound_ID NE . AND PubChem_Compound_ID NE node_id_1) then do; rc = atcHash.find(); if (rc NE 0) then put "Could not find " PubChem_Compound_ID; else do; atc_code_d2_1 = atc_code1; atc_code_d2_2 = atc_code2; atc_code_d2_3 = atc_code3; atc_code_d2_4 = atc_code4; atc_code_d2_5 = atc_code5; atc_code_d2_6 = atc_code6; atc_code_d2_7 = atc_code7; atc_code_d2_8 = atc_code8; atc_code_d2_9 = atc_code9; atc_code_d2_10 = atc_code10; atc_code_d2_11 = atc_code11; atc_code_min_dist = 99; i=1; do while (i <= 11); if (atc_codes_d1[i] EQ "") then do; leave; end; L1 = substrn(atc_codes_d1[i],1,1); L2 = substrn(atc_codes_d1[i],2,2); L3 = substrn(atc_codes_d1[i],4,1); L4 = substrn(atc_codes_d1[i],5,1); j=1; do while (j <= 11); if (atc_codes_d2[j] EQ "") then do; leave; end; L1_prime = substrn(atc_codes_d2[j],1,1); L2_prime = substrn(atc_codes_d2[j],2,2); L3_prime = substrn(atc_codes_d2[j],4,1); L4_prime = substrn(atc_codes_d2[j],5,1); if (L1 EQ L1_prime AND
19
L2 EQ L2_prime AND L3 EQ L3_prime AND L4 EQ L4_prime) then atc_code_min_tmp = 2; else if (L1 EQ L1_prime AND L2 EQ L2_prime AND L3 EQ L3_prime) then atc_code_min_tmp = 4; else if (L1 EQ L1_prime AND L2 EQ L2_prime) then atc_code_min_tmp = 6; else if (L1 EQ L1_prime) then atc_code_min_tmp = 8; else atc_code_min_tmp = 10; if (atc_code_min_tmp < atc_code_min_dist) then atc_code_min_dist = atc_code_min_tmp; j = j+1; end; /* while j */ i = i+1; end; /* while i */ atc_min_distances[¯o_i] = atc_code_min_dist; end; /*else*/ end; /* if Pubchem_compound_ID */ %let macro_i = %eval(¯o_i + 1); %end; /*while macro i*/ atc_min_val = min(of atc_min_distances{*}); atc_max_val = max(of atc_min_distances{*}); atc_mean_val = mean(of atc_min_distances{*}); atc_min_nonmiss = N(of atc_min_distances{*}); k = 1; do while (k <= dim(atc_min_distances));
20
if (atc_min_distances[k] NE .) then do; if (atc_min_distances[k] EQ 2) then atc_bin1 = atc_bin1 + 1; if (atc_min_distances[k] EQ 4) then atc_bin2 = atc_bin2 + 1; if (atc_min_distances[k] EQ 6) then atc_bin3 = atc_bin3 + 1; if (atc_min_distances[k] EQ 8) then atc_bin4 = atc_bin4 + 1; if (atc_min_distances[k] EQ 10) then atc_bin5 = atc_bin5 + 1; end; k = k+1; end; atc_bin1 = atc_bin1/atc_min_nonmiss; atc_bin2 = atc_bin2/atc_min_nonmiss; atc_bin3 = atc_bin3/atc_min_nonmiss; atc_bin4 = atc_bin4/atc_min_nonmiss; atc_bin5 = atc_bin5/atc_min_nonmiss; end; end; keep node_id_1 node_id_2 is_old_edge atc_bin1 atc_bin2 atc_bin3 atc_bin4 atc_bin5 ; run; quit; %mend add_ATC_codes_bins; /* Function to compute the Kullback-Leibler (KL) distance between a distribution and a desired reference distribution. This function if used to compute all KL-based covariates discussed in the paper. dist_type: what type of distribution--to distinguish between NET, TAX and INT covariates. bin_ds: is a data set containing the (discrete) distribution associated with each drug-ADE pair nbins: is the number of bins in that discrete distribution */ %macro compute_kldist(dist_type= ,bin_ds= ,nbins= ); proc means data=in_ds.&bin_ds mean noprint; var &dist_type._bin1-&dist_type._bin&nbins; where (is_old_edge = 1); output out=in_ds.&bin_ds.M; run; quit; data in_ds.&bin_ds.M;
21
set in_ds.&bin_ds.M; where (_STAT_ EQ "MEAN"); keep &dist_type._bin1-&dist_type._bin&nbins; run; quit; proc means data=in_ds.&bin_ds mean noprint; var &dist_type._bin1-&dist_type._bin&nbins; where (is_old_edge = 0); output out=in_ds.&bin_ds.N; run; quit; data in_ds.&bin_ds.N; set in_ds.&bin_ds.N; where (_STAT_ EQ "MEAN"); keep &dist_type._bin1-&dist_type._bin&nbins; run; quit; proc means data=in_ds.&bin_ds mean noprint; var &dist_type._bin1-&dist_type._bin&nbins; output out=in_ds.&bin_ds.Q; run; quit; data in_ds.&bin_ds.Q; set in_ds.&bin_ds.Q; where (_STAT_ EQ "MEAN"); keep &dist_type._bin1-&dist_type._bin&nbins; run; quit; %local macro_i ; data _null_; set in_ds.&bin_ds.M; if (_N_ = 1) then do; %let macro_i = 1; %do %while (¯o_i <= &nbins); corrected_bin = &dist_type._bin¯o_i + 0.000001; call symput("refBin1_¯o_i", corrected_bin); %let macro_i = %eval(¯o_i + 1); %end; end; run; quit; data _null_; set in_ds.&bin_ds.N; if (_N_ = 1) then do; %let macro_i = 1; %do %while (¯o_i <= &nbins); corrected_bin = &dist_type._bin¯o_i + 0.000001; call symput("refBin0_¯o_i", corrected_bin); %let macro_i = %eval(¯o_i + 1); %end; end; run; quit; data _null_; set in_ds.&bin_ds.Q; if (_N_ = 1) then do;
22
%let macro_i = 1; %do %while (¯o_i <= &nbins); corrected_bin = &dist_type._bin¯o_i + 0.000001; call symput("refBin01_¯o_i", corrected_bin); %let macro_i = %eval(¯o_i + 1); %end; end; run; quit; data in_ds.&bin_ds.K; length kl0_&dist_type kl1_&dist_type kl01_&dist_type 8; set in_ds.&bin_ds; kl0_&dist_type = 0; kl1_&dist_type = 0; kl01_&dist_type = 0; %let macro_i = 1; %do %while (¯o_i <= &nbins); if (&dist_type._bin¯o_i > 0) then do; kl0_&dist_type = kl0_&dist_type + &dist_type._bin¯o_i*(log2(&dist_type._bin¯o_i)-log2(&&refBin0_¯o_i)); kl1_&dist_type = kl1_&dist_type + &dist_type._bin¯o_i*(log2(&dist_type._bin¯o_i)-log2(&&refBin1_¯o_i)); kl01_&dist_type = kl01_&dist_type + &dist_type._bin¯o_i*(log2(&dist_type._bin¯o_i)-log2(&&refBin01_¯o_i)); end; %let macro_i = %eval(¯o_i + 1); %end; drop is_old_edge; run; quit; %mend compute_kldist; /* * Function to compute the covariate "meddra-min" discussed in the * paper. * * all_pairs_ds is a list of all possible drug-ADE pairs * node_id_1: drug id (PubChem_ID) * node_id_2: AE id (HLT code) * * returns a new data set named <all_pairs_ds>_meddra_h which also contains * the value of "meddra-min" covariate for each drug-ADE pair */ %macro add_meddra_min_dist_hlt(all_pairs_ds= ); /* create hash tables */ data in_ds.&all_pairs_ds._2005edge; set in_ds.&all_pairs_ds; where (is_old_edge EQ 1); keep node_id_1 node_id_2;
23
run; quit; proc sort data=in_ds.&all_pairs_ds._2005edge; by node_id_1; run; quit; proc transpose data=in_ds.&all_pairs_ds._2005edge out=in_ds.&all_pairs_ds._2005edge_t prefix=node_id_2_; by node_id_1; var node_id_2; run; quit; data in_ds.&all_pairs_ds._2005edge_t; set in_ds.&all_pairs_ds._2005edge_t; drop node_id_2 _NAME_ _LABEL_; run; quit; data meddra.mdhier_hlt_hlgt; set meddra.mdhier; keep hlt_code hlgt_code; run; quit; proc sort data=meddra.mdhier_hlt_hlgt noduprecs; by hlt_code hlgt_code; run; quit; proc transpose data=meddra.mdhier_hlt_hlgt out=meddra.mdhier_hlt_hlgt_t prefix=hlgt_code; by hlt_code; var hlgt_code; run; quit; /* HLT to HLGT mapping */ data meddra.mdhier_hlt_hlgt_t; set meddra.mdhier_hlt_hlgt_t; drop _NAME_; run; quit; data meddra.mdhier_hlgt_soc; set meddra.mdhier; keep hlgt_code soc_code; run; quit; proc sort data=meddra.mdhier_hlgt_soc noduprecs; by hlgt_code soc_code; run; quit; proc transpose data=meddra.mdhier_hlgt_soc out=meddra.mdhier_hlgt_soc_t prefix=soc_code; by hlgt_code; var soc_code; run; quit;
24
/* HLGT to SOC mapping */ data meddra.mdhier_hlgt_soc_t; set meddra.mdhier_hlgt_soc_t; drop _NAME_; run; quit; /* perform meddra distance computations */ data in_ds.&all_pairs_ds._meddra_h; length meddra_h_min_val 8; length meddra_h_max_val 8; length meddra_h_mean_val 8; %let macro_i=1; %do %while (¯o_i < 213); /* (1 + max drug degree) in the 2005 network */ length node_id_2_¯o_i 8; %let macro_i = %eval(¯o_i + 1); %end; length hlt_code hlgt_code 8; length hlgt_code1 hlgt_code2 8; length hlgt_code11 hlgt_code12 hlgt_code21 hlgt_code22 8; length soc_code1 soc_code2 8; length soc_code11 soc_code12 soc_code21 soc_code22 8; set in_ds.&all_pairs_ds; array meddra_h_min_distances{212} 8; if (_N_ = 1) then do; declare hash neighborHash(dataset: "in_ds.&all_pairs_ds._2005edge_t"); rc = neighborHash.definekey('node_id_1'); rc = neighborHash.definedata(ALL: 'YES'); neighborHash.definedone(); declare hash hltHash(dataset: "meddra.mdhier_hlt_hlgt_t"); rc = hltHash.definekey('hlt_code'); rc = hltHash.definedata('hlgt_code1','hlgt_code2'); hltHash.definedone(); declare hash hlgtHash(dataset: "meddra.mdhier_hlgt_soc_t"); rc = hlgtHash.definekey('hlgt_code'); rc = hlgtHash.definedata('soc_code1','soc_code2'); hlgtHash.definedone(); end; meddra_h_min_val = .; meddra_h_max_val = .; meddra_h_mean_val = .; rc = neighborHash.find(); if (rc NE 0) then do; put "Could not find PubChem_Compound_ID " node_id_1; end; else do;
25
%let macro_i = 1; %do %while (¯o_i < 213); meddra_h_min_distances[¯o_i] = .; %let macro_i = %eval(¯o_i + 1); %end; hlt_code = node_id_2; rc1 = hltHash.find(); if (rc1 NE 0) then do; put "Could not find hlt1 in hltHash " hlt_code; end; else do; hlgt_code11 = hlgt_code1; hlgt_code12 = hlgt_code2; end; %let macro_i = 1; %do %while (¯o_i < 213); if (node_id_2_¯o_i NE . AND node_id_2_¯o_i NE node_id_2) then do; /* second HLT */ hlt_code = node_id_2_¯o_i; rc1 = hltHash.find(); if (rc1 NE 0) then do; put "Could not find hlt2 in hltHash " hlt_code; end; else do; hlgt_code21 = hlgt_code1; hlgt_code22 = hlgt_code2; end; if ((hlgt_code11 EQ hlgt_code21) OR (hlgt_code22 NE . AND hlgt_code11 EQ hlgt_code22) OR (hlgt_code12 NE . AND hlgt_code12 EQ hlgt_code21) OR (hlgt_code12 NE . AND hlgt_code22 NE . AND hlgt_code12 EQ hlgt_code22)) then do; meddra_h_min_distances[¯o_i] = 2; end; else do; /* hlgt_code11 */ hlgt_code = hlgt_code11; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find hlgt_code11 in hlgtHash " hlgt_code; end; soc_code11 = soc_code1; soc_code12 = soc_code2;
26
/* hlgt_code21 */ hlgt_code = hlgt_code21; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find hlgt_code21 in hlgtHash " hlgt_code; end; soc_code21 = soc_code1; soc_code22 = soc_code2; if ((soc_code11 EQ soc_code21) OR (soc_code22 NE . AND soc_code11 EQ soc_code22) OR (soc_code12 NE . AND soc_code12 EQ soc_code21) OR (soc_code12 NE . AND soc_code22 NE . AND soc_code12 EQ soc_code22)) then do; meddra_h_min_distances[¯o_i] = 4; end; /* hlgt_code22 */ if (hlgt_code22 NE .) then do; hlgt_code = hlgt_code22; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find hlgt_code22 in hlgtHash " hlgt_code; end; soc_code21 = soc_code1; soc_code22 = soc_code2; if ((soc_code11 EQ soc_code21) OR (soc_code22 NE . AND soc_code11 EQ soc_code22) OR (soc_code12 NE . AND soc_code12 EQ soc_code21) OR (soc_code12 NE . AND soc_code22 NE . AND soc_code12 EQ soc_code22)) then do; meddra_h_min_distances[¯o_i] = 4; end; end; /* hlgt_code12 */ if (hlgt_code12 NE .) then do; hlgt_code = hlgt_code12; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find hlgt_code12 in hlgtHash " hlgt_code; end;
27
soc_code11 = soc_code1; soc_code12 = soc_code2; /* hlgt_code21 */ hlgt_code = hlgt_code21; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find hlgt_code 21 in hlgtHash " hlgt_code; end; soc_code21 = soc_code1; soc_code22 = soc_code2; if ((soc_code11 EQ soc_code21) OR (soc_code22 NE . AND soc_code11 EQ soc_code22) OR (soc_code12 NE . AND soc_code12 EQ soc_code21) OR (soc_code12 NE . AND soc_code22 NE . AND soc_code12 EQ soc_code22)) then do; meddra_h_min_distances[¯o_i] = 4; end; /* hlgt_code22 */ if (hlgt_code22 NE .) then do; hlgt_code = hlgt_code22; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find in hlgtHash " hlgt_code; end; soc_code21 = soc_code1; soc_code22 = soc_code2; if ((soc_code11 EQ soc_code21) OR (soc_code22 NE . AND soc_code11 EQ soc_code22) OR (soc_code12 NE . AND soc_code12 EQ soc_code21) OR (soc_code12 NE . AND soc_code22 NE . AND soc_code12 EQ soc_code22)) then do; meddra_h_min_distances[¯o_i] = 4; end; end; end; /* if hlgt_code12 NE . */ end; /* else do */ if (meddra_h_min_distances[¯o_i] EQ .) then meddra_h_min_distances[¯o_i] = 6; end; /* if node_id_2_¯o_i NE node_id_2 */
28
%let macro_i = %eval(¯o_i + 1); %end; meddra_h_min_val = min(of meddra_h_min_distances{*}); meddra_h_max_val = max(of meddra_h_min_distances{*}); meddra_h_mean_val = mean(of meddra_h_min_distances{*}); meddra_h_min_nonmiss = N(of meddra_h_min_distances{*}); end; /* else do */ keep node_id_1 node_id_2 meddra_h_min_val meddra_h_max_val meddra_h_mean_val is_old_edge is_old_edge_class is_new_edge is_new_edge_class is_test_pair is_test_pair_class; run; quit; %mend add_meddra_min_dist_hlt; /* * Function to compute the distribution of Meddra distances * in the neighborhood of a drug-ADE pair. This distribution is used * in the computation of covariate "meddra-KL" discussed in the * paper. * * all_pairs_ds is a list of all possible drug-ADE pairs * node_id_1: drug id (PubChem_ID) * node_id_2: AE id (HLT code) * * returns a new data set named <all_pairs_ds>_atcb which also contains * the distribution Meddra distances in the neighborhood of each pair */ %macro add_meddra_min_dist_bins(all_pairs_ds= ); data in_ds.&all_pairs_ds._05e; set in_ds.&all_pairs_ds; where (is_old_edge EQ 1); keep node_id_1 node_id_2; run; quit; proc sort data=in_ds.&all_pairs_ds._05e; by node_id_1; run; quit; proc transpose data=in_ds.&all_pairs_ds._05e out=in_ds.&all_pairs_ds._05e_t prefix=node_id_2_; by node_id_1; var node_id_2; run; quit;
29
data in_ds.&all_pairs_ds._05e_t; set in_ds.&all_pairs_ds._05e_t; drop node_id_2 _NAME_ _LABEL_; run; quit; data meddra.mdhier_hlt_hlgt; set meddra.mdhier; keep hlt_code hlgt_code; run; quit; proc sort data=meddra.mdhier_hlt_hlgt noduprecs; by hlt_code hlgt_code; run; quit; proc transpose data=meddra.mdhier_hlt_hlgt out=meddra.mdhier_hlt_hlgt_t prefix=hlgt_code; by hlt_code; var hlgt_code; run; quit; data meddra.mdhier_hlt_hlgt_t; set meddra.mdhier_hlt_hlgt_t; drop _NAME_; run; quit; data meddra.mdhier_hlgt_soc; set meddra.mdhier; keep hlgt_code soc_code; run; quit; proc sort data=meddra.mdhier_hlgt_soc noduprecs; by hlgt_code soc_code; run; quit; proc transpose data=meddra.mdhier_hlgt_soc out=meddra.mdhier_hlgt_soc_t prefix=soc_code; by hlgt_code; var soc_code; run; quit; data meddra.mdhier_hlgt_soc_t; set meddra.mdhier_hlgt_soc_t; drop _NAME_; run; quit; data in_ds.&all_pairs_ds._medb; length meddra_h_min_val 8; length meddra_h_max_val 8; length meddra_h_mean_val 8; length med_bin1-med_bin3 8;
end; else do; hlgt_code11 = hlgt_code1; hlgt_code12 = hlgt_code2; end; %let macro_i = 1; %do %while (¯o_i < 213); if (node_id_2_¯o_i NE . AND node_id_2_¯o_i NE node_id_2) then do; /* second HLT */ hlt_code = node_id_2_¯o_i; rc1 = hltHash.find(); if (rc1 NE 0) then do; put "Could not find hlt2 in hltHash " hlt_code; end; else do; hlgt_code21 = hlgt_code1; hlgt_code22 = hlgt_code2; end; if ((hlgt_code11 EQ hlgt_code21) OR (hlgt_code22 NE . AND hlgt_code11 EQ hlgt_code22) OR (hlgt_code12 NE . AND hlgt_code12 EQ hlgt_code21) OR (hlgt_code12 NE . AND hlgt_code22 NE . AND hlgt_code12 EQ hlgt_code22)) then do; meddra_h_min_distances[¯o_i] = 2; end; else do; /* hlgt_code11 */ hlgt_code = hlgt_code11; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find hlgt_code11 in hlgtHash " hlgt_code; end; soc_code11 = soc_code1; soc_code12 = soc_code2; /* hlgt_code21 */ hlgt_code = hlgt_code21; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find hlgt_code21 in hlgtHash " hlgt_code; end; soc_code21 = soc_code1; soc_code22 = soc_code2; if ((soc_code11 EQ soc_code21) OR
32
(soc_code22 NE . AND soc_code11 EQ soc_code22) OR (soc_code12 NE . AND soc_code12 EQ soc_code21) OR (soc_code12 NE . AND soc_code22 NE . AND soc_code12 EQ soc_code22)) then do; meddra_h_min_distances[¯o_i] = 4; end; /* hlgt_code22 */ if (hlgt_code22 NE .) then do; hlgt_code = hlgt_code22; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find hlgt_code22 in hlgtHash " hlgt_code; end; soc_code21 = soc_code1; soc_code22 = soc_code2; if ((soc_code11 EQ soc_code21) OR (soc_code22 NE . AND soc_code11 EQ soc_code22) OR (soc_code12 NE . AND soc_code12 EQ soc_code21) OR (soc_code12 NE . AND soc_code22 NE . AND soc_code12 EQ soc_code22)) then do; meddra_h_min_distances[¯o_i] = 4; end; end; /* hlgt_code12 */ if (hlgt_code12 NE .) then do; hlgt_code = hlgt_code12; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find hlgt_code12 in hlgtHash " hlgt_code; end; soc_code11 = soc_code1; soc_code12 = soc_code2; /* hlgt_code21 */ hlgt_code = hlgt_code21; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find hlgt_code 21 in hlgtHash " hlgt_code; end; soc_code21 = soc_code1; soc_code22 = soc_code2;
33
if ((soc_code11 EQ soc_code21) OR (soc_code22 NE . AND soc_code11 EQ soc_code22) OR (soc_code12 NE . AND soc_code12 EQ soc_code21) OR (soc_code12 NE . AND soc_code22 NE . AND soc_code12 EQ soc_code22)) then do; meddra_h_min_distances[¯o_i] = 4; end; /* hlgt_code22 */ if (hlgt_code22 NE .) then do; hlgt_code = hlgt_code22; rc1 = hlgtHash.find(); if (rc1 NE 0) then do; put "Could not find in hlgtHash " hlgt_code; end; soc_code21 = soc_code1; soc_code22 = soc_code2; if ((soc_code11 EQ soc_code21) OR (soc_code22 NE . AND soc_code11 EQ soc_code22) OR (soc_code12 NE . AND soc_code12 EQ soc_code21) OR (soc_code12 NE . AND soc_code22 NE . AND soc_code12 EQ soc_code22)) then do; meddra_h_min_distances[¯o_i] = 4; end; end; end; /* if hlgt_code12 NE . */ end; /* else do */ if (meddra_h_min_distances[¯o_i] EQ .) then meddra_h_min_distances[¯o_i] = 6; end; /* if node_id_2_¯o_i NE node_id_2 */ %let macro_i = %eval(¯o_i + 1); %end; meddra_h_min_val = min(of meddra_h_min_distances{*}); meddra_h_max_val = max(of meddra_h_min_distances{*}); meddra_h_mean_val = mean(of meddra_h_min_distances{*}); meddra_h_min_nonmiss = N(of meddra_h_min_distances{*}); k = 1; do while (k <= dim(meddra_h_min_distances)); if (meddra_h_min_distances[k] NE .) then do;
34
if (meddra_h_min_distances[k] EQ 2) then med_bin1 = med_bin1 + 1; if (meddra_h_min_distances[k] EQ 4) then med_bin2 = med_bin2 + 1; if (meddra_h_min_distances[k] EQ 6) then med_bin3 = med_bin3 + 1; end; k = k+1; end; med_bin1 = med_bin1/meddra_h_min_nonmiss; med_bin2 = med_bin2/meddra_h_min_nonmiss; med_bin3 = med_bin3/meddra_h_min_nonmiss; end; /* else do */ keep node_id_1 node_id_2 is_old_edge med_bin1-med_bin3 ; run; quit; %mend add_meddra_min_dist_bins;
35
SUPPLEMENTARY TABLES Table S1. Definition of covariates. Variable i denotes a drug, variable j denotes an ADR, and ( )N i denotes
the set of neighbors of node i. Definition of taxonomic covariates relies on the pre‐computed ATC‐ and
MedDRA‐based distances ATCd , MedDRAd discussed in the paper. The definition of intrinsic covariates
relies on the pre‐computed Euclidean distances INTd discussed in the paper.
Covariate name Covariate definition Additional information
Network covariate
degree‐prod 1 ( , ) ( ) ( )X i j i j= ´degree degree
degree‐sum 2 ( , ) ( ) ( )X i j i j= +degree degree
degree‐ratio 3 ( , ) ( ) / ( )X i j i j=degree degree
degree‐absdiff 4 ( , ) ( ) ( )X i j i j= -degree degree
jackard‐ADE‐max 5
( ) { }( , ) max { ( , )}
k N i jX i j J j k
Î -= ( , ) ( ) ( ) ( ) ( )J j k N j N k N j N k= Ç È denotes the
Jackard coefficient between the sets ( )N j
and ( )N k
jackard‐ADE‐KL 6 ( , )X i j : Kullback‐Leibler (KL) distance between the
distribution ( , )aeD i j of the variable
( , ), ( ) { }J i k k N j iÎ - and a reference distribution
The reference distribution aeD was
computed as the mean of distributions
( , )aeD i j over the training edges ( , )i j
jackard‐drug‐max 7
( ) { }( , ) max { ( , )}
k N j iX i j J i k
Î -=
jackard‐drug‐KL
8 ( , )X i j : KL distance between the distribution
( , )drugD i j of the variable ( , ), ( ) { }J j k k N i jÎ -
and a
reference distribution
The reference distribution drugD was
computed as the mean of distributions
( , )drugD i j over the training edges ( , )i j
edge‐density 9 ( , )X i j : The edge density in the subgraph induced
by ( ) ( ) { , }N i N j i jÈ - .
Taxonomic covariate
atc‐min 10
( ) { }( , ) min { ( , )}ATC
k N j iX i j d i k
Î -=
atc‐KL 11 ( , )X i j : KL distance between the distribution
( , )ATCD i j of the variable ( , ), ( ) { }ATCd i k k N j iÎ - and
a reference distribution
The reference distribution ATCD was
computed as the mean of distributions
( , )ATCD i j over the training edges ( , )i j
meddra‐min 12
( ) { }( , ) min { ( , )}MedDRA
k N i jX i j d j k
Î -=
meddra‐KL 13 ( , )X i j : KL distance between the distribution
( , )MedDRAD i j of the variable
( , ), ( ) { }MedDRAd j k k N i jÎ - and a reference
distribution
The reference distribution MedDRAD was
computed as the mean of distributions
( , )MedDRAD i j over all training edges ( , )i j
Intrinsic covariate
euclid‐min 14
( ) { }( , ) min { ( , )}INT
k N j iX i j d i k
Î -=
euclid‐KL 15 ( , )X i j : KL distance between the distribution
( , )INTD i j of the variable ( , ), ( ) { }INTd i k k N j iÎ - and
a reference distribution
The reference distribution INTD was
computed as the mean of distributions
( , )INTD i j over the training edges ( , )i j
36
Table S2. List of drugs and their ATC codes.
Drug name atc1 atc2 atc3 atc4 atc5 atc6 atc7 atc8 atc9 atc10 atc11
Note: * indicates a placeholder code that we created in the case of four drugs for which we were unable to determine an ATC code.
Table S3. Number of missing observations for PubChem properties extracted for this study. The total number of drugs in the study was 809. *XLogP3 and Tautomer Count were excluded from the study due to the missing values.
Property name Number of missing observations
Molecular Weight 0 XLogP3* 41 H Bond Donor 0 H Bond Acceptor 0 Rotatable Bond Count 0 Tautomer Count* 336 Topol Polar Surface Area 0 Heavy Atom Count 0 Formal Charge 0 Complexity 0 Isotope Atom Count 0 Defined Atom StereoCenter (SC) Count 0 Undefined Atom SC Count 0 Defined Bond SC Count 0 Undefined Bond SC Count 0 Covalently Bonded (CB) Unit Count 0
63
Table S4. Number of missing observations for DrugBank properties extracted for this study. The total number of drugs in the study was 809. Melting Point and Half Life were excluded from the study owing to the missing values. The remaining two properties (Exp LogP Hydrophobicity and Protein Binding) were initially included in the study through data imputation. The effect of the imputed data on the predictive performance was assessed by excluding these two properties from the model as well.
Property name Number of missing observations
Exp LogP Hydrophobicity 91 Protein Binding 218 Melting Point 261 Half Life 450
64
Table S5. Intercorrelation analysis of all covariates. The highest positive and negative Pearson correlations are bolded.
Table S6. Prediction cases studies. The selected drug‐ADE pairs represent some prominent drug‐ADE associations newly discovered during the period of 2006 to 2010.
Table S7. List of supplementary source code files.
File name Comments
meddra_mapping_code.sas SAS code to perform MedDRA mapping
NET_INT_covariates.r R code to compute network and intrinsic covariates
TAX_covariates.sas SAS code to compute taxonomic covariates
66
SUPPLEMENTARY FIGURES
A B C D G H J L M N P R S V0
20
40
60
ATC top-level group
Mean n
um
ber of AEs
Fig. S1. Newly associated ADEs per drug in each ATC top‐level group. ATC top‐level groups: A, alimentary tract and metabolism; B, blood and blood forming organs; C, cardiovascular system; D, dermatologicals; G, genito‐urinary system and sex hormones; H, systemic hormonal preparations; J, anti‐infectives for systemic use; L, antineoplastic and immunomodulating agents; M, musculo‐skeletal system; N, nervous system; P, antiparasitic products, insecticides and repellents; R, respiratory system; S, sensory organs; V, various. Data are means and error bars represent 95% CIs.
67
blo
car
con
ear
end
eye
gas
gen
hep
imm inf
inj
inv
met
mus
neo
ner
pre
psy
ren
rep
res
ski
sur
vas
0
20
40
60
MedDRA top-level group
Mean n
um
ber of dru
gs
Fig. S2. Newly associated drugs per ADE in each MedDRA top‐level group. MedDRA top‐level groups: blo, blood and lymphatic system disorders; car, bardiac disorders; con, congenital, familial and genetic disorders; ear, ear and labyrinth disorders; end, endocrine disorders; eye, eye disorders; gas, gastrointestinal disorders; gen, general disorders and administration site conditions; hep, hepatobiliary disorders; imm, immune system disorders; inf, infections and infestations; inj, injury, poisoning and procedural complications; inv, investigations; met, metabolism and nutrition disorders; mus, musculoskeletal and connective tissue disorders; neo, neoplasms benign, malignant and unspecified; ner, nervous system disorders; pre, pregnancy, puerperium and perinatal conditions; psy, psychiatric disorders; ren, renal and urinary disorders; rep, reproductive system and breast disorders; res, respiratory, thoracic and mediastinal disorders; ski, skin and subcutaneous tissue disorders; sur, surgical and medical procedures; vas, vascular disorders. Data are means and error bars represent 95% CIs.
68
A B
0 0.080.16 0.280.36 0.480.56 0.680.76 0.880.96
score
0
20
40
60
80
pe
rce
nt
Mean = 0.03Std = 0.09
0 0.080.16 0.28 0.4 0.48 0.60.680.76 0.880.96
score
0
4
8
12
16
20
pe
rce
nt
Mean = 0.46Std = 0.36
C D
0 0.04 0.10.14 0.20.24 0.30.34 0.40.44 0.50.54
score
0
20
40
60
80
pe
rce
nt Mean = 0.04
Std = 0.09
0 0.04 0.10.14 0.20.24 0.30.34 0.40.44 0.50.54
score
0
4
8
12
16
pe
rce
ntMean = 0.25Std = 0.17
E F
0 0.020.040.060.08 0.1 0.120.140.160.18 0.2 0.22
score
0
10
20
30
40
50
pe
rce
nt Mean = 0.05
Std = 0.06
0 0.020.040.060.08 0.1 0.120.140.160.18 0.2 0.22
score
0
4
8
12
16
pe
rce
nt Mean = 0.14
Std = 0.07
Fig. S3. Comparative histograms of scores the observed edges and non‐edges by the three model types. (A and B) NET model, non‐edges (A) and edges (B). (C and D) TAX model, non‐edges (C) and edges (D). (E and F) INT model, non‐edges (E) and edges (F).
for
69
True positives False positives
Fig. S4. Three‐way Venn diagrams for the sets of true positives and false positives generated by models NET, TAX, and INT. Specificity was fixed at 0.95.
70
Pairs predicted as non‐edges Pairs predicted as edges
0 4000 10000 16000 22000 28000 34000 40000
degree-prod
0
20
40
60
80
pe
rce
nt
0 4000 10000 16000 22000 28000 34000 40000
degree-prod
0
4
8
12
16
pe
rce
nt
0 50100 200 300 400 500 600 700
degree-absdiff
0
20
40
60
80
pe
rce
nt
0 50100 200 300 400 500 600 700
degree-absdiff
0
5
10
15
20
pe
rce
nt
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
jackard-ae-max
0
10
20
30
40
50
pe
rce
nt
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
jackard-ae-max
0
4
8
12
16
pe
rce
nt
00.4 1.2 22.4 3.2 44.4 5.2 66.4 7.2 8
jackard-ae-KL
0
10
20
30
40
50
pe
rce
nt
00.4 1.2 22.4 3.2 44.4 5.2 66.4 7.2 8
jackard-ae-KL
0
10
20
30
40
pe
rce
nt
Fig. S5. Comparative histograms of selected network covariates or the predicted edges and non‐edges. The predictions were generated by fixing the specificity of model NET at 0.95.
f
71
Pairs predicted as non‐edges Pairs predicted as edges
2 4 6 8 10 12
atc-min
0
10
20
30
40
50
pe
rce
nt
2 4 6 8 102 4 6 8 10 12
atc-min
0
20
40
60
80
pe
rce
nt
2 4 6 8 10
00.3 0.9 1.5 2.12.55 33.3 3.9 4.5 5.15.55 6
atc-KL
0
10
20
30
40
50
pe
rce
nt
00.3 0.9 1.5 2.12.55 33.3 3.9 4.5 5.15.55 6
atc-KL
0
20
40
60
pe
rce
nt
Fig. S6. Comparative histograms of selected taxonomic covariates or the predicted edges and non‐edges. The predictions were generated by fixing the specificity of model TAX at 0.95.
f
72
Pairs predicted as non‐edges Pairs predicted as edges
010 30 50 70 90 110 130 150 170 190 210
euclid-min
0
2
4
6
8
10
12
pe
rce
nt
010 30 50 70 90 110 130 150 170 190 210
euclid-min
0
5
10
15
20
25
30
pe
rce
nt
00.30.81.31.82.32.83.33.84.34.85.35.86.36.8
euclid-KL
0
2
4
6
8
10
pe
rce
nt
00.30.81.31.82.32.83.33.84.34.85.35.86.36.8
euclid-KL
0
20
40
60
80
pe
rce
nt
Fig. S7. Comparative histograms of the intrinsic covariates or the predicted edges and non‐edges. The predictions were generated by fixing the specificity of model INT at 0.95.
f
73
A
A B C D G H J L M N P R S V0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ATC top-level category
AU
RO
C
B
0 20 40 60 80 100 120 140 160 180 2000.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
AUROC
Regression line
Number of newly associated ADEs
AU
RO
C
Fig. S8. Drug‐specific AUROCs. (A) AUROCs were grouped by ATC top‐level category. Group means and 95% CIs (error bars) are shown in red. (B) AUROC was plotted against the number of newly associated ADEs. A regression line with slope of 0.00003 and P = 0.86 (F test) is shown.
74
A
blo
car
con
ear
end
eye
gas
gen
hep
imm inf
inj
inv
met
mus
neo
ner
pre
psy
ren
rep
res
ski
sur
vas
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
MedDRA top-level category
AU
RO
C
B
0 20 40 60 80 100 120 1400.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
AUROC
Regression line
Number of newly associated drugs
AU
RO
C
Fig. S9. ADE‐specific AUROCs. (A) AUROCs were grouped by MedDRA top‐level category. Group means and 95% CIs (error bars) are shown in red. (B) AUROC was plotted against the number of newly associated drugs. A regression line with slope of ‐0.0015 and P < 0.0001 (F test) is shown.