Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report

Survival Analysis Dimension Reduction TechniquesA Comparison of Select Methods

Claressa L. Ullmayer and Iván Rodríguez

Abstract

Although formal studies across many fields may yield copious data, it canoften be collinear (redundant) in terms of explaining particular outcomes.Thus, dataset dimensionality reduction becomes imperative for facilitatingthe explanation of phenomena given abundant covariates (independent vari-ables). Principal Component Analysis (PCA) and Partial Least Squares(PLS) are established methods used to obtain components—eigenvalues ofthe given data’s variance-covariance matrix—such that the covariance andcorrelation is maximized between linear combinations of predictor and re-sponse variables. PCA employs orthogonal transformations on covariatesto reduce dataset dimensionality by producing new uncorrelated variables.PLS, rather, projects both predictor and response variables into a new spaceto model their covariance structure. In addition to these standard procedures,three variants of Johnson-Lindenstrauss low-distortion Euclidean-space em-beddings (random matrices, RM) were also investigated. Each technique’sperformance was explored by simulating 5,000 datasets using R statisticalsoftware. The semi-parametric Accelerated Failure Time (AFT) model wasutilized to obtain predicted survivor curves. Then, total bias error (BE) andmean-squared error (MSE) between true and estimated survivor curves wasdetermined to find the error distributions of all methods. The results hereinindicate that PCA outperforms PLS, the RMs are comparable, and the RMsoutdo both PCA and PLS.

Keywords: survival analysis; dimension reduction; big data; principal com-ponent analysis (PCA); partial least squares (PLS); Johnson-Lindenstrauss(JL); random matrices; accelerated failure time (AFT); bias; mean-squarederror.

Ullmayer and Rodríguez Survival Analysis Dimension Reduction Techniques

Contents1 Introduction 1

2 Survival Analysis 2

3 Methods 53.1 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . 53.1.2 Principle Component Analysis . . . . . . . . . . . . . . . 63.1.3 Partial Least Squares . . . . . . . . . . . . . . . . . . . . 63.1.4 Random Matrices . . . . . . . . . . . . . . . . . . . . . . 7

3.2 The Accelerated Failure Time Model . . . . . . . . . . . . . . . . 9

4 Method Assessments 94.1 Simulated Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Results 115.1 Principle Component Analysis versus Partial Least Squares . . . . 125.2 Random Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 135.3 All Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6 Discussion 15

7 Conclusion 15

8 Acknowledgments 16

9 References 16

10 Appendix 1810.1 Error Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1810.2 Johnson-Lindenstrauss Testing . . . . . . . . . . . . . . . . . . . 3310.3 Survival Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

i


1 IntroductionThroughout various studies, researchers are able to associate covariates to a set ofobservations. From here, analysts would naturally seek to explain the relationshipbetween the two with regard to a given set of phenomena. Methods such as theCox Proportional Hazards (CPH) and the Accelerated Failure Time (AFT) modelshave been proposed with this intent in mind (Cox, 1972). However, to successfullyutilize both approaches, it is necessary to have more observations than covariates.Depending on the context, this property may not initially be satisfied, thus ren-dering both methods inept. One example of this complication arises in common-place microarray gene expression data. In this situation, there can often be lessobservations—patients—than covariates attributed to them—genes. As a result, itbecomes imperative to reduce the dimensionality of the dataset and then apply asuitable regression technique thereafter to understand the underlying relationshipsbetween the predictor and response variables. As a natural consequence, reduc-ing the original dataset’s dimensionality insinuates a loss of information; thus, afavorable dimension reduction technique will minimize loss of relevant informa-tion.

Given this to consider, dimension-reduction techniques have abounded to meetthis end. In this investigation, the methods of Principal Component Analysis(PCA), Partial Least Squares (PLS), and three variants of Johnson-Lindenstraussinspired Random Matrices (RM) will be compared (Johnson, Lindenstrauss, 1984).The first approach, PCA, originated and was described by Pearson (1901). PLSwas first rigorously introduced and explained by Wold (1966). Then, the threevariants of RMs were constructed according to specifications of Achlioptas (2002)and Dasgupta-Gupta (2002). This research was motivated in part by the resultsattributed to Nguyen and Rocke (2004) and Nguyen (2005) regarding the perfor-mance of PCA vis-à-vis PLS. Furthermore, the works of Nguyen and Rojo (2009)with respect to the performance of PLS variants and Nguyen and Rojo (2009) inregard to a multitude of reduction and regression approaches were utilized in thisinquiry.

Typically, the Cox PH model has been the standard model in this applica-tion. In this paper, however, the AFT model was employed. Random datasetswere first generated using the statistical software suite R. For a given amount ofthese datasets, there was a constant and true survivor function attributed to them.From here, the three dimension reduction techniques were employed on the sim-ulated datasets. Then, the AFT model was used primarily to generate a predictedsurvivor function. Bias and mean-squared error between the real and estimatedcurves were then calculated for a partition of fixed time values.

1


2 Survival AnalysisBefore any serious discussion of the current work can begin, a familiarity with thearea known as survival analysis must first be cultivated. In a sentence, survivalanalysis employs various methods to analyze data where the response variable isa time until an unambiguous event of interest occurs (Despa). This event must berigorously defined—some examples include birth, death, marriage, divorce, jobtermination, promotion, arrests, revolutions, heart attack, stroke, metastasis, andwinning the lottery, to name a few (Ross).

Depending on the research domain, this wide field has many monikers. It isreferred to as failure time analysis, hazard analysis, transition analysis, durationanalysis, reliability theory/analysis in engineering, duration analysis/modeling ineconomics, and event history analysis in sociology (Allison). At the time of thisinvestigation, ‘survival analysis’ serves as the umbrella term for all the aforemen-tioned epithets.

Survival analysis is borne out of the desire to overcome some limitations pre-sented in standard linear regression approaches (Despa). One of the two imme-diate complications that survival analysis can successfully address is data whereresponses are all positive values—exemplı gratia, survival times that range fromt ∈ (0,∞) (Despa). Secondly, survival analysis can grapple with censored data.

After the event of interest within a particular investigation has been rigorouslydeclared, an observation is branded as ‘censored’ if the special event was not ob-served. This can occur due to a plethora of reasons. A common one involves apatient in a clinical trial dropping out of the study. In this case, it is unknownhow much longer it may have taken for that individual to experience the partic-ular event of interest. Another example of censoring in the real world involvesobservations that do not experience the special event upon the end of a formalinvestigation. That is, an individual managed to not express the event of interestfor the whole duration of a study, so they are necessarily labeled as censored.

With this ubiquitous term broadly explained, it is also necessary to understandthat many forms of censoring exist. Typically, most data are ‘right-censored’. Thisterm signifies observations that have the potential to experience the declared eventof interest after—or to the right in a time-line—of the time they became censored.For instance, take an individual with a stage of cancer and declare the event ofinterest to be death. Then, if this person becomes censored, the event of interest isnaturally bound to occur after the time they became censored. In a similar manner,‘left-censored’ data occurs when the event of interest occurred before the specifictime a formal investigation began (Lunn). Understandably, this phenomenon isless commonplace in reality. An example of left-censored data involves providinga questionnaire to mothers inquiring whether or not they are actively breastfeed-ing (Vermeylen). Left-censoring would occur if a mother entered the study and

2


had hitherto stopped breastfeeding. Finally, a third type is known as ‘interval cen-soring’. This might be observed in a case where clinical follow-ups are necessary.For a datum to be interval-censored, the event of interest would have to be ob-served within an interval between two successive follow-ups (Sun).

Survival analysis is a prominent regression approach because it can success-fully incorporate both censored and uncensored data when modeling the relation-ship between predictors and responses (Despa). Typically, the response variableswill have at least both a survival time and censoring status associated with them.From here, methods exist to estimate both survival and hazard functions that fa-cilitate the interpretation of the distribution of survival times (Despa).

Survivor curves determine the probability that the event of interest is not ex-perienced after a particular time. Rigorously,

S(t) = P (T > t) =

∫ ∞t

f(τ) dτ = 1− F (t),

where S(t) denotes the survivor function, t is a fixed time, T is a random variable,f(τ) is the probability density function of T , and F (t) is the cumulative distribu-tion function of T .

The hazard, on the other hand, is defined as a rate in which events happen(Duerden). Thus, one can calculate the probability of an event happening within asmall time interval as this hazard rate multiplied by the length of time (Duerden).Additionally, the hazard function describes the probability that an observation ex-periences the event of interest at a particular time (Duerden). This implies thatthe observation has already survived—that is, has not experienced the event ofinterest—at the specified time (Duerden). In precise terms, the hazard function isdefined as

h(t) =f(t)

S(t),

where f(t) denotes the probability distribution function and S(t) represents thesurvival function given a random variable T . From this expression, it is imme-diately possible to understand the intricate relationship between distribution, sur-vival, and hazard functions. As a result, many other expressions exist aside fromthis rather simplistic form.

A natural thought that may arise within survival analysis is whether resultsinvolving survivor curves or hazard functions are desired. In many contexts, stan-dard researchers prefer survivor curves in order to interpret results of their gath-ered data. Arguably, since these curves output a probability in response to an inputof time, it becomes easier to comprehend trends and relationships than by doingso via hazard. Furthermore, hazard functions and hazard rates are based on ratiosof probability distribution functions and survival curves; this makes hazard results

3


more difficult to digest and understand.Aside from these considerations, there is also another factor involved in sur-

vival analysis to cognize: the selection of methods that can be utilized to relatepredictor variables and the resulting survival times. The three main forms toachieve this end include parametric, semiparametric, and nonparametric models(Despa). These differ in the assumptions being made on the given data.

Parametric approaches make the prime assumption that the distribution of thesurvival times follows a known probability distribution (Despa). For example,these can include the exponential and compound exponential, Weibull, Gompertz-Makeham, Rayleigh, gamma and generalized gamma, log-normal, log-logistic,generalized F, and the Coale-McNeil models (Rodriguez, 2010). For these andother applicable methods, model parameters are estimated according to an alter-ation to their maximum likelihood (Despa). In parametric techniques, relation-ships are forced between f(t), F (t), S(t), and h(t) (Cook).

In contrast, a nonparametric model does not assert as many relatively boldassumptions. For instance, linearity and a smooth regression function is not nec-essary in a nonparametric context (Fox). Although this provides a researcher withmuch more flexibility, interpretation can oftentimes become more difficult.

A semiparametric model posits that the error attributed to a nonlinear regres-sion model follows a well-defined probability distribution, but the error is uncor-related and identically distributed. In addition, a model of this form does notpresume that the baseline hazard function has a particular ‘shape’ attributed toit. Additionally, when a combination of both parametric and nonparametric as-sumptions are available, the regression model is appropriately described as beingsemiparametric in nature.

These three types of regression models are rigorously represented below. Letn denote the number of observations, Y represent the response variable, X sig-nify the matrix of predictors, and let β be regression coefficients with errors ε.Additionally, let m(·) = E(yi | xi) such that i = 1, . . . , n

A parametric model can be expressed as

yi = xiTβ + εi, i = 1, . . . , n.

In this case, the resulting curve is smooth and known. Furthermore, it is describedby a finite set of parameters which will need to be estimated. Ultimately, interpre-tation is simple through this approach.

Then, for a nonparametric method,

yi = m(xi) + εi, i = 1, . . . , n.

Here, function m(·) is also smooth and flexible, yet it is now unknown. Further-more, the interpretation of such a curve becomes ambiguous.

4


Lastly, in the case where a model is classified as semiparametric, we observethat

yi = xiTβ +mz(zi) + εi, i = 1, . . . , n.

As previously mentioned, some parameters are necessarily estimated while somewill be determined through the given data.

3 MethodsThe main methods employed in this investigation were centered on different waysof performing dimension reduction. These methods were: Principle ComponentAnalysis (PCA), Partial Least Squares (PLS), and a set of three distinct RandomMatrices (RM). For each method, the AFT model was employed primarily to gen-erate survivor curve estimates. These methods will be discussed in greater detailhere.

3.1 Dimension ReductionThe central goal of the three aforementioned dimension reduction techniques is toreduce a dataset with n observations and p covariates to a new dataset of dimen-sions n× k such that k p. Additionally, a competent method will achieve thisend while retaining an acceptable amount of relevant data and omitting relativelycollinear variables.

Both PCA and PLS reduce dimensionality through orthogonal transformationsof covariates; then, a subset of these is retained such that these new covariates pre-dict the response with a satisfactory caliber of precision. Meanwhile, RM differsfrom these two procedures by generating a matrix with certain qualities that alsoreduces dimensionality.

To facilitate the explanation of these reduction techniques, pertinent notationwill first be introduced.

3.1.1 Notation

Let X be the n× p column-centered matrix such that n and p denote given obser-vations and covariates, respectively. Also, let n p. Furthermore, let Y be then× q matrix of observed covariates.

In the microarray gene dataset example, n would represent the number of pa-tients while p would denote the amount of observed genes attributed to them.Thus, X would be a matrix that contains particular patients on the rows and theirrespective genes on the columns. Additionally, Y would serve as an n× 1 vectorof survival times.

5


3.1.2 Principle Component Analysis

PCA reduces dataset dimensionality through orthogonal components obtained bymaximizing the variance between linear combinations of the original predictorscontained in X. More precisely, k weight vectors or ‘loadings’ w are constructedsuch that rows of X map to principal component scores t. For n observations,

tn = xnwk.

Ultimately, X can be completely decomposed into its components as follows:

T = XW.

Here, X has original dimensions n×p, W has dimensions p×p, and T, therefore,has dimensions n × p as expected. Additionally, the columns of W contain theeigenvectors of XTX.

From here, a desired amount of the resulting orthogonal components is cho-sen. These are then referred to as ‘principal components’ since they are chosen inorder to maximize the variability along each direction of the new and reduced setof axes. What this transformation accomplishes, in other words, is that it projectsthe original data cloud into a new coordinate system via rotations of the initialcoordinate system such that variability of the initial data is maximized along eachdirection. Additionally, PCs are ranked according to how much variance theyaccount for in their respective directions. That is, the PCs with the largest eigen-values are ranked the highest and represent a sizable portion of the data sincevariability is greatest along its eigenvector’s direction.

It is imperative to note that the chosen PCs obtained from PCA rely on op-erations performed on X, the given dataset matrix. Thus, the response variableY is not taken into account during this particular dimension reduction algorithm.Consequently, these PCs may not be laudable predictors of the response variablein a given context. Due to this property of PCA, it is often referred to as an ‘un-supervised’ technique.

3.1.3 Partial Least Squares

Whereas PCA reduces dimensionality through X, the method of PLS does sothrough a consideration of both independent and dependent variables X and Y.Thus, this approach is often referred to as being ‘supervised’.

This regression model is especially useful when there is either high collinear-ity among predictors or when the number of predictor variables is much greaterthan the amount of observations. In these situations, ordinary least-squares re-gression would either perform poorly or fail entirely; it would also fail if Y wasnot one-dimensional—id est, if there were more than one observed response.

6


PLS extracts factors from both X and Y so that the covariance between thesefactors is maximized. In particular, PLS is largely based on the singular value de-composition of XTY. Recall that PLS does not require Y to be one-dimensional;an advantage of the PLS procedure is that Y can contain as many observed re-sponses as are deemed necessary and practical by researchers.

The method of PLS decomposes both X and Y so that

X = TPT + E and Y = UQT + F.

Here, T is a matrix of ‘X-scores’, P is a matrix of ‘X-loadings’, and E is a matrixof error for X. Similarly, U, Q, and F represent ‘Y-scores’, ‘Y-loadings’, and Yerror, respectively. Both X- and Y-scores are defined as being linear combinationsof the predictor and response variables, respectively. Then, X- and Y-loadings arelinear coefficients that form a bridge from X to T and from Y to U. A commonassumption about E and F is that they are random variables with independentand identical distributions. This decomposition of X and Y is done in hopes ofmaximizing the covariance between T and U.

The PLS algorithm is an iterative procedure. First, two sets of weights mustbe constructed as linear combinations of the columns of both X and Y. Thesewill be denoted by w and c, respectively. The goal here is to have their covariancebe maximal. Recall that matrices T and U denote, accordingly, X- and Y-scores.Then, the next step in the PLS approach is to obtain a first pair of vectors t = Xwand u = Yc such that wTw = 1, tTt = 1, and tTu be maximized. After thesefirst so-called ‘latent vectors’ have been obtained, they are subtracted from bothX and Y. This procedure is then repeated, thereby eventually reducing X to azero matrix.

3.1.4 Random Matrices

Whereas the previously discussed methods of PCA and PLS reduce dimension-ality through a careful analysis of X and Y, the third technique of constructingrandom matrices, as the name implies, is considerably cavalier by comparison. Inessence, a random matrix with a particular set of qualities is fabricated. Then,this matrix is multiplied to a given dataset—matrix X in this particular investi-gation. According to the lemma attributed to Johnson and Lindenstrauss, if twoobservations in X are considered as multidimensional points and have an initialdistance-squared between them, then once these particular random matrices aremultiplied to X, their intial distance is not distorted by too much. Similar to theapproaches utilized in PCA and PLS, random matrices can reduce dimensionalitywithout losing much information in the process. First, the Johnson-Lindenstrauss(JL) Lemma will be presented as well as a description of the three particular ran-

7


dom matrices that were constructed in this research. The constraint on k wasutilized according to Dasgupta-Gupta.

The Johnson-Lindenstrauss Lemma. For any ε ∈ (0, 1) and any n ∈ Z, letk ∈ Z be positive and let

k ≥ 4 ln(n)

ε2/2− ε3/3.

Then, for any set S of n points in Rd, there exists a mapping f : Rd → Rk suchthat, for all points u, v ∈ S,

(1− ε) ‖u− v‖2 ≤ ‖f(u)− f(v)‖2 ≤ (1 + ε) ‖u− v‖2 .

In terms of this investigation, n also represents the number of observationswhile ε denotes the error tolerance. Finally, k can be thought of as the resultingdimension in this given context after applying a random matrix to the dataset ma-trix X.

An immediate complication of these so-called ‘JL-embeddings’ is that we maysometimes observe that k ≥ d as a result of strictly following the hypotheses ofthe lemma. Id est, by employing the results of this theorem, a researcher wouldbe taking data from a smaller dimension and transforming it so that the data existsin a higher dimension. Ultimately, the JL Lemma may not reduce dimensionalityat all, thus rendering it impractical for the desired purposes of this text. Thus, itbecame imperative in this research to observe the effects of ignoring the restraintson k of the JL Lemma and deducing whether or not desirable results are obtainednonetheless. Having understood the motivation behind random matrices and theseprecise limitations, now an explanation of the three random matrices themselvesis in order.

The first two random matrices were fabricated according to the previous re-sults of Achlioptas while the third was constructed by following the specificationsof Dasgupta-Gupta. Let Γ1, Γ2, and Γ3 accordingly denote these random ma-trices. To keep consistent with the previous notation, recall that X is an n × ppredictor matrix of observations on the rows and covariates on the columns. Itfollows that Γ1, Γ2, and Γ3 are p× k matrices. Once multiplied to X, the result-ing matrix Ω will have dimensions n× k, where the goal is to have n > k.

Entries of Γ1 were produced from the following distribution:

1√k×−1 with probability 1/2+1 with probabilty 1/2

For Γ2, its entries were obtained from√3

k×

−1 with probability 1/60 with probability 4/6+1 with probabilty 1/6

8


Finally, Γ3 is a Gaussian random matrix generated from N(0, 1). The resultingrows of Γ3 are then normalized.

3.2 The Accelerated Failure Time ModelThe previously described techniques were sourced in order to reduce dimension-ality. After successfully achieving this consequence, it was necessary to generatea survival curve based on the modified data and compare it with the true survivalcurve. In this investigation, the AFT model was the vehicle to generate estimatesof the survivor curves.

The AFT model is seldom utilized compared to the celebrated Cox Propor-tional Hazards (PH) model for various reasons. One reason to adopt the AFTapproach in this investigation is due to the simplified interpretation it provides re-searchers of the data. This approach presents an interpretation of the relationshipbetween observation covariates and given responses in terms of survivor curves.The Cox PH model, on the other hand, does so through hazard functions and haz-ard ratios that, while equally profound, are not as visually simple to comprehendas the AFT model’s survivorship presentation. In simple terms, the hazard is theinstantaneous event probability within a range of a particular time. It is arguablymore straightforward to understand results in terms of the probability that an in-dividual ‘survives’ or does not experience an event of interest after a particulartime. Thus, this first reason to employ the AFT model in this text is a matterof user preference and ease of interpretation of results. Another technical reasonto employ the AFT model is due to the fact that it directly models given survivaltimes. This is one luxury that the Cox PH model cannot allow a fervent researcher.

In this investigation, AFT was implemented according to the following under-lying model:

ln(Ti) = µ+ z′iβ + ei.

Here, i represents a particular observation from a set of n observations. Further-more, Ti denotes the survival time for the ith observation. Meanwhile, µ desig-nates the given theoretical mean, zi is the vector of covariates for the ith obser-vation, and β is the vector of covariate/regression coefficients. Finally, ei is thegiven error for the ith observation.

4 Method AssessmentsThis research utilized a programming environment to simulate datasets that wouldundergo reduction procedure from PCA, PLS, and the variants of RMs. Addition-ally, ‘feeding’ this data into the AFT model to obtain and compare the pairs of

9


survival curves was likewise accomplished through statistical software. This sec-tion will address specifically how the research was performed.

4.1 Simulated DatasetsIn order to compare dimension reduction techniques, R statistical software is im-plemented to simulate data. β regression coefficients, observations, covariates,and survival times are simulated using the previously discussed AFT formula,where the theoretical mean, µ, is set to 0 for simplicity. The dimensionality of thedata matrix,X , is 100 observations by 1000 covariates. A vector of 1000 β regres-sion coefficients relating to the 1000 covariates is obtained by generating randomvalues from U(−1× 10−7, 1× 10−7). A vector, µj , of random values is generatedfrom a N(0, 1) distribution for j = 1, . . . , p, where p represents the number ofcovariates. β and µ remain fixed for all simulations. Next, the matrix, X100×1000,of the 1000 covariates and 100 observations is generated where xij = ezij wherezij ∼ N(µj, 1) for j = 1, . . . , p and i = 1, . . . , n where n is the number of obser-vations, therefore the data is log-normally distributed. The survival times, Ti, areconstructed from an exponential distribution with λi = e−x

′iβ for i = 1, · · · , n.

Now that all the data is generated, zn×p is converted to z∗n×p by centering eachcolumn about its mean. PCA is applied to z∗n×p using the function PCA from thepackage FactoMineR (Husson et al., 2015) to obtain 99 principle components.After this procedure is completed, the principle components are narrowed downto 37, which represents 50% of total variance of the model. PCA outputs a weightmatrix of dimension 1000 × 37, which represents the weights given to each co-variate by the 37 principle components. The data matrix, X , is multiplied by thisweight matrix to obtain a reduced dimension matrix of 100 × 37. A surv objectis created, which inputs survivals times, censoring type, and an indicator vectordenoting 1 if the observation is censored or 0 if it is not, and outputs a responsematrix. The Ti vector and the 37 principle components are fed into the AFT modelin R using the package aftgee (Chiou et al., 2015) to obtain the estimated 1000 βcoefficients. The weight matrix was multiplied by these β estimates to obtain the37 β estimates for the 37 principle components.

In order to acquire estimated lambda values for the estimated survival func-tion, the mean of the exponentiated product of the centered data matrix and theβ estimates is taken. The estimated survival function is now found by S0 = e−

ˆλt

where ˆλ is the estimated mean lambda value. This procedure was repeated forPLS using the same number of principle components as PCA except using thefunction plsreg1 from the package plsdepot (Sanchez, 2015) instead.

The matrices Γ1,Γ2, and Γ3 from Achlioptas and Dasgupta-Gupta are gen-erated containing random entries that satisfy each author’s probability specifica-

10


tions. An algorithm in R is created to validate the Johnson-Lindenstrauss Lemmadimension reduction ability for Γ1,Γ2, and Γ3. The algorithm takes two randomlypicked vectors u, v from X and maps f : Rp → Rk where k is the new reduceddimension. The Johnson-Lindenstrauss Lemma is then tested using varying val-ues of ε and k for multiple simulations. It is shown that as long as k and ε followthe constraints given by Dasgupta-Gupta(CITA), then the Johnson-LindenstraussLemma is satisfied 100% of the time. The value of ε is varied until the desireddimension of 1000 × 37 projection matrix is obtained satisfying the Johnson-Lindenstrauss Lemma. Unfortunately, a fairly high ε value of approximately 0.65is required to satisfy the lemma. Therefore, either a high ε value is used or thelemma is not followed.

In order to compare random matrices to PCA and PLS, X is multiplied byΓ1,Γ2, and Γ3 with dimensions 1000 × 37 to obtain a resulting k dimensionalmatrix of 100×37. Then, the reduced matrices are fed into the AFT model and allthe same steps as PCA and PLS are performed. Therefore, five different estimatedsurvival curves are produced, one each for PCA and PLS and three for the threerandom matrices.

The true survival curve is S0 = e−λt where λ is the mean of the λi values,which are created by exponentiating the product of the centered data matrix andthe true β coefficients. The y-axis of the survival curve is partitioned into 20equally spaced sections from 0.025, . . . , 0.975 and then the corresponding ti val-ues are found along the x-axis. The bias and mean squared error (MSE) are cal-culated at each of these ti values to obtain the error distribution for all methods.The bias is found by calculating the pointwise difference between the real andestimated survival curves and the MSE is calculated by finding the squared differ-ence. The bias and MSE at each ti are summed for 5000 simulations and the errordistributions are compared for all methods.

5 ResultsIn the following sections, the error distribution plots for the dimension reductiontechniques are compared after 5000 simulations. PLS and PCA are compared toeach other, the random matrices are compared, and then all dimension reductiontechniques are compared. The goal is to minimize Bias and MSE, therefore, thedimension reduction technique closest to zero is the more efficient method. Inthe Bias plots, zero is at the top of the plots and for MSE, the black horizontalline at the bottom denotes zero. Notice the plots differ least at the extremes ofthe survival curves domain while the most variability is observed in middle of theinterval.

11


5.1 Principle Component Analysis versus Partial Least Squares

From the plots above, it is shown that PCA outperforms PLS by a maximummagnitude of approximately 0.07 for the bias and 0.03 for MSE.

12


5.2 Random Matrices

In the plots above, RM1 denotes Γ1, RM2 denotes Γ2, and RM3 denotes Γ3.The results show that there is no significant difference in performance betweenthe three random matrices in terms of Bias and MSE.

13


5.3 All Methods

From both the Bias and MSE plots, it is evident that all three random matricesoutperform both PCA and PLS. All matrices outperform PCA by a magnitude ofapproximately 0.03 and PLS by 0.10 for bias and 0.015 and 0.045 for MSE.

14


6 DiscussionWe originally wanted to generate our β coefficients from a U(−0.2, 0.2), but whenwe multiplied x′iβ to get our λi values, we obtained very large values. Recall,our formula λi = e−x

′iβ . When x′iβ is very large, then the λi values become

very small and the precision of R estimates the survival function as 1, creatinga horizontal survival function. Therefore, we had to reduce the β coefficients toU(−1× 10−7, 1× 10−7) to obtain survival curves with realistic properties.

Before conducting our research, we investigated previous work that has beendone in the field, such as the research in the papers of Nguyen and Rojo (2009)and Nguyen and Rojo (2009). According to their findings, PLS outperformedPCA, which is the results we also expected to receive, but instead we observedthat PCA greatly outperformed PLS. We are not exactly positive why our resultsdiffer from these works, but we suspect that it is due to not incorporating censoreddata. In both papers of Nguyen and Rojo, they compared methods using censoreddata, which we did not have time to incorporate into our research. Therefore, wesuspect that PLS might outperform PCA when censored data is used, but PCAoutperforms PLS with uncensored data.

Obviously in real life studies, censored data can be a serious problem thatneeds to be taken into account. We wanted to incorporate censored data in our in-vestigation, but were unable to due to time constraints. This is something that wewould like to add in future investigations. We also wanted to apply our findingsto real microarray gene data sets where there are a few number of patients, with aspecific type of cancer, and a large dimension of genes. We wanted to work withthese data sets and apply our dimension reduction techniques to obtain estimatedsurvival curves where the event of interest was death and the survival curve mod-eled each patient’s probability of surviving after a given time Ti. Unfortunately,we were not able to work with these real data sets, which is also something wewould like to investigate at a future time.

7 ConclusionThe results of performing PLS, PCA, and the three Johnson-Lindenstrauss in-spired matrices from Achlioptas and Dasgupta-Gupta on log-normally distributed,uncensored data for estimating the survival curve under the AFT model show thatPCA outperforms PLS in terms of both bias and MSE. The three random matricesdo not show a significant difference between each other in terms of either bias orMSE. Overall, the random matrices outperform both PCA and PLS for both biasand MSE.

15


8 AcknowledgmentsThis research was supported by the National Security Agency through REU GrantH98230 15-1-0048 to The University of Nevada at Reno, Javier Rojo PI. Wewould like to greatly thank and acknowledge our advisor Dr. Javier Rojo, NathanWiseman, and Kyle Bradford from the University of Nevada Reno for their sup-port and generous contributions to our research.

9 ReferencesCox, DR. Regression Models and life tables (with discussion). Journal of RoyalStatistical Society Series B34: 187-220, 1972.

Johnson, W.B. and J. Lindenstrauss. Extensions of Lipschitz maps into a Hilbertspace. Contemp Math 26: 189-206, 1984.

Pearson, K. On lines and planes of closest fit to systems of points in space. Philo-sophical Magazine 2: 559-572, 1901.

Wold, H. Estimation of principal components and related models by iterative leastsquares. P.R. Krishnaiaah: 391-420, 1966.

Achlioptas, D. Database-friendly random projections: Johnson-Lindenstrauss withbinary coins. Journal of Computer and System Sciences 66(4): 671-687, 2003.

Dasgupta, S. and A. Gupta. An elementary proof of a theorem of Johnson andLindenstrauss. Random Structures and Algorithms 22(1): 60-65, 2003.

Nguyen, D.V. Partial least squares dimension reduction for microarray gene ex-pression data with a censored response. Math Biosci 193: 119-137, 2005.

Nguyen, D.V., and D.M. Rocke. On partial least squares dimension reduction formicroarraybased classification: A simulation study. Comput Stat Data Analysis46: 407-425, 2004.

Despa, Simona. What is Survival Analysis? StatNews 78: 1-2.

Ross, Eric. "Survival Analysis." 2012. PDF

16


Allison, Paul D. "Survival Analysis." 2013. PDF

Lunn, Mary. "Definitions and Censoring." 2012. PDF.

Vermeylen, Francoise. Censored Data. StatNews 67: 1, 2005.

Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of Microarray Gene Ex-pression Data: The Accelerated Failure Time Model. Journal of Bioinformaticsand Computational Biology 7(6): 939-954, 2009.

Nguyen, Tuan S. and Javier Rojo. Dimension Reduction of Microarray Data inthe Presence of a Censored Survival Response: A Simulation Study. StatisticalApplications in Genetics and Molecular Biology 8(1): 2009.

Sun, Jianguo. "Interval Censoring." 2011. PDF.

Duerden, Martin. "What Are Hazard Ratios?" 2012. PDF.

Rodriguez, German. "Parametric Survival Models. Princeton." 2010. PDF.

Cook, Alex. "Survival and hazard functions." 2008. PDF.

Fox, John. "Introduction to Nonparametric Methods." 2005. PDF.

Husson et al. "Package ‘FactoMineR’." 2015. PDF.

Sanchez, Gaston. "Package ‘plsdepot’." 2015. PDF.

Chiou et al. "Package ‘aftgee’." 2015. PDF.

Thernou et al. "Package ‘survival’." 2015. PDF.

17


10 AppendixHerein, the R code utilized in this investigation is presented. Packages survival(Thernou et al., 2015), FactoMineR (Husson et al., 2015), plsdepot (Gaston, 2015),and aftgee (Chiou et al., 2015) will need to be installed and loaded into R softwareto successfully run the provided code.

10.1 Error PlotsBelow is the code used to produce the six error plots for the five various reductionsmethods.

library(survival)# We created a Surv object using function ’Surv’ from this# package.

library(FactoMineR)# We used the function ’PCA’ from this package.

library(plsdepot)# We used ’plsreg1’ from this package.

library(aftgee)# With this package, we were able to apply the AFT model to our# simulated data using the function ’aftgee’.

sim <- function(s) # This function will produce ’s’# simulations and output error plots.

t1 <- Sys.time() # Initial time.

num <- 1 # Initial counter.

sum_PCA_BE_t <- matrix(0, 1, 20)sum_PCA_MSE_t <- matrix(0, 1, 20)sum_PLS_BE_t <- matrix(0, 1, 20)sum_PLS_MSE_t <- matrix(0, 1, 20)sum_RM1_BE_t <- matrix(0, 1, 20)sum_RM1_MSE_t <- matrix(0, 1, 20)

18


sum_RM2_BE_t <- matrix(0, 1, 20)sum_RM2_MSE_t <- matrix(0, 1, 20)sum_RM3_BE_t <- matrix(0, 1, 20)sum_RM3_MSE_t <- matrix(0, 1, 20)# These will store the calculated bias and mean-squared# error across 20 selected points geeer we have run ’s’# simulations.

beta <- c(runif(1000, min = -0.0000001, max = 0.0000001))# Fixed coefficients.

mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values.

X <- matrix(0, 100, 1000)# A location for the dataset information.

while(num <= s)# Running the entire code for a ’s’ iterations.

# No problems at the start of this iteration.

for(i in 1:100)for(j in 1:1000)X[i, j] <- rnorm(1, mean = mu[j], sd = 1)# A matrix of random data containing observations# on the rows and covariates on the columns.

z <- exp(X) # All entries of matrix ’X’ have been# exponentiated and stored in ’z’, which has dimensions# 100 by 1,000.

lambda <- matrix(0, 100, 1) # Rate values.

for(i in 1:100) # Generating lambda values.lambda[i] <- exp(t(-z[i,]) %*% as.matrix(beta))

19


T <- matrix(0, nrow = 100, ncol = 1)# Location for survival times.

for(i in 1:100) # Surivival times being generated.T[i] <- rexp(1, rate=lambda[i])

RM1 <- matrix(0, 1000, 37)# Random matrix one with ’-1’s and ’+1’s.

for (m in 1:1000)for (n in 1:37)RM1[m, n] <- sample(c(-1, 1), 1, replace = TRUE,

prob = c(1/2, 1/2))

RM1 <- RM1 / sqrt(37)

RM2 <- matrix(0, 1000, 37)# Random matrix two with# ’-sqrt(3)’s, ’0’s, and ’+sqrt(3)’s.

for (m in 1:1000)for (n in 1:37)RM2[m,n] <- sample(c(-sqrt(3), 0, sqrt(3)),

1, replace = TRUE,prob = c(1/6, 4/6, 1/6))

RM2 <- RM2 / sqrt(37)

RM3 <- matrix(0, 1000, 37)# Random matrix three generated under a Gaussian

20


# distribution.

for (m in 1:1000)for (n in 1:37)RM3[m,n] <- rnorm(1, 0, 1)

RM3_norm <- matrix(0, 1000, 1)

for (p in 1:1000)RM3_norm[p, ] <- sqrt(sum(RM3[p, ] ^ 2))

for (m in 1:1000)for(n in 1:37)RM3[m,n] <- RM3[m,n] / RM3_norm[m, ]

z_star <- scale(z, center = TRUE, scale = FALSE)# Column-centered ’z’ matrix for PCA.

z_star_PCA <- PCA(z_star, graph = FALSE, ncp = 37)z_star_PLS <- plsreg1(scale(z, center = TRUE,

scale = TRUE), T, comps = 37,crosval = FALSE)

# Applying PCA and PLS to the data.

z_double_star_PCA <- z_star %*% z_star_PCA$var$coordz_double_star_PLS <- z_star %*% z_star_PLS$x.loadsz_double_star_RM1 <- z %*% RM1z_double_star_RM2 <- z %*% RM2z_double_star_RM3 <- z %*% RM3# Reducing dimensionality.

21


delta <- matrix(0, nrow = 100, ncol = 1)# An indicator matrix. Here, delta is a 100 by 1 matrix# of zeros. The zeros are interpreted as meaning that the# event of interest has definitively occured. In other# words, there is currently no censoring with ’delta’# set up in this manner.

data_Surv <- Surv(time = T, event = delta,type = c("right"))

# A Surv object that takes the survival times from ’T’,# censoring information from ’delta’, and is specified# as being right-censored.

data_AFT_fit_PCA <- aftgee(data_Surv ~ -1 +z_double_star_PCA,

corstr = "independence", B = 0)

data_AFT_fit_PLS <- aftgee(data_Surv ~ -1 +z_double_star_PLS,


data_AFT_fit_RM1 <- aftgee(data_Surv ~ -1 +z_double_star_RM1,






beta_hat_star_PCA <- data_AFT_fit_PCA$coefficientsbeta_hat_star_PLS <- data_AFT_fit_PLS$coefficientsbeta_hat_star_RM1 <- data_AFT_fit_RM1$coefficientsbeta_hat_star_RM2 <- data_AFT_fit_RM2$coefficientsbeta_hat_star_RM3 <- data_AFT_fit_RM3$coefficients# The full beta/regression coefficients.

z_bar_star <- matrix(0, 1, 1000)

22


# Averaged columns of ’z’ will go here.

for (i in 1:1000) # Averaging ’z’s columns.z_bar_star[1, i] <- mean(z[, i])

beta_hat_z_PCA <- z_star_PCA$var$coord %*%beta_hat_star_PCA

beta_hat_z_PLS <- z_star_PLS$x.loads %*%beta_hat_star_PLS

beta_hat_z_RM1 <- RM1 %*%beta_hat_star_RM1



# The final beta estimates for each technique.

lambda_hat_PCA <- mean(exp(-z %*% beta_hat_z_PCA))lambda_hat_PLS <- mean(exp(-z %*% beta_hat_z_PLS))lambda_hat_RM1 <- mean(exp(-z %*% beta_hat_z_RM1))lambda_hat_RM2 <- mean(exp(-z %*% beta_hat_z_RM2))lambda_hat_RM3 <- mean(exp(-z %*% beta_hat_z_RM3))# Generating the lambda constant from each technique# employed.

lambda_bar = mean(lambda) # Taking the average of all# ’lambda’ values and storing it in ’lambda_bar’.

S <- function(t) # The true survivor function.exp(-t * lambda_bar)

S_hat_naught_PCA <- function(t)# The predicted survivor function through PCA.

23


exp(-t * lambda_hat_PCA)

S_hat_naught_PLS <- function(t)# The predicted survivor function through PLS.exp(-t * lambda_hat_PLS)

S_hat_naught_RM1 <- function(t)# The predicted survivor function through RM1.exp(-t * lambda_hat_RM1)



u <- c(seq(0.025, 0.975, 0.05))# Desired outputs ’u’ that range from 0.025 to 0.975# and are spaced out by 0.05, resulting in 20 points.

t <- (-1/lambda_bar) * log(u) # Input times ’t’ from the# respective ’u’s. There are 20 generated times ’t’ in# this vector.

for (i in 1:20)# Storing bias across the 20 point pairs in PCA.sum_PCA_BE_t[i] <- sum_PCA_BE_t[i] +(S_hat_naught_PCA(t[i]) - S(t[i]))

24


for (i in 1:20)# Storing mean-squared error across the 20 point pairs# in PCA.sum_PCA_MSE_t[i] <- sum_PCA_MSE_t[i] +(S_hat_naught_PCA(t[i]) - S(t[i])) ^ 2

for (i in 1:20)# Storing bias across the 20 point pairs in PLS.sum_PLS_BE_t[i] <- sum_PLS_BE_t[i] +(S_hat_naught_PLS(t[i]) - S(t[i]))

for (i in 1:20)# Storing mean-squared error across the 20 point pairs# in PLS.sum_PLS_MSE_t[i] <- sum_PLS_MSE_t[i] +(S_hat_naught_PLS(t[i]) - S(t[i])) ^ 2

for (i in 1:20)# Storing bias across the 20 point pairs in RM1.sum_RM1_BE_t[i] <- sum_RM1_BE_t[i] +(S_hat_naught_RM1(t[i]) - S(t[i]))

for (i in 1:20)# Storing mean-squared error across the 20 point pairs# in RM1.sum_RM1_MSE_t[i] <- sum_RM1_MSE_t[i] +(S_hat_naught_RM1(t[i]) - S(t[i])) ^ 2

for (i in 1:20)

25


# Storing bias across the 20 point pairs in RM2sum_RM2_BE_t[i] <- sum_RM2_BE_t[i] +(S_hat_naught_RM2(t[i]) - S(t[i]))


for (i in 1:20)# Storing bias across the 20 point pairs in RM3.sum_RM3_BE_t[i] <- sum_RM3_BE_t[i] +(S_hat_naught_RM3(t[i]) - S(t[i]))


print(paste("Simulation", num, "Complete."))

num <- num + 1

ymin_PCA_BE <- min(sum_PCA_BE_t)ymin_PLS_BE <- min(sum_PLS_BE_t)ymin_RM1_BE <- min(sum_RM1_BE_t)ymin_RM2_BE <- min(sum_RM2_BE_t)ymin_RM3_BE <- min(sum_RM3_BE_t)# Finding the minimum bias per each technique after

26


# ’s’ simulations.

ymax_PCA_BE <- max(sum_PCA_BE_t)ymax_PLS_BE <- max(sum_PLS_BE_t)ymax_RM1_BE <- max(sum_RM1_BE_t)ymax_RM2_BE <- max(sum_RM2_BE_t)ymax_RM3_BE <- max(sum_RM3_BE_t)# Finding the maximum bias per each technique after# ’s’ simulations.

ymin_BE <- min(ymin_PCA_BE, ymin_PLS_BE, ymin_RM1_BE,ymin_RM2_BE, ymin_RM3_BE) / s

ymax_BE <- max(ymax_PCA_BE, ymax_PLS_BE, ymax_RM1_BE,ymax_RM2_BE, ymax_RM3_BE) / s

# Finding the minimum and maximum bias across all five# techniques after ’s’ simulations. These will serve as# the lower and upper range of the y-axis in the final plot.

ymin_PCA_PLS_BE <- min(ymin_PCA_BE, ymin_PLS_BE) / symax_PCA_PLS_BE <- max(ymax_PCA_BE, ymax_PLS_BE) / s# Calculating the averaged minimum and maximum bias for PCA# and PLS after ’s’ simulations for plotting purposes.

ymin_RM_BE <-min(ymin_RM1_BE, ymin_RM2_BE, ymin_RM3_BE) / s

ymax_RM_BE <-max(ymax_RM1_BE, ymax_RM2_BE, ymax_RM3_BE) / s

# Calculating the averaged minimum and maximum bias for the# three RMs after ’s’ simulations for plotting purposes.

ymin_PCA_MSE <- min(sum_PCA_MSE_t)ymin_PLS_MSE <- min(sum_PLS_MSE_t)ymin_RM1_MSE <- min(sum_RM1_MSE_t)ymin_RM2_MSE <- min(sum_RM2_MSE_t)ymin_RM3_MSE <- min(sum_RM3_MSE_t)# Finding the minimum mean-squared error per each technique# after ’s’ simulations.

ymax_PCA_MSE <- max(sum_PCA_MSE_t)ymax_PLS_MSE <- max(sum_PLS_MSE_t)ymax_RM1_MSE <- max(sum_RM1_MSE_t)

27


ymax_RM2_MSE <- max(sum_RM2_MSE_t)ymax_RM3_MSE <- max(sum_RM3_MSE_t)# Finding the maximum mean-squared error per each technique# after ’s’ simulations.

ymin_MSE <- min(ymin_PCA_MSE, ymin_PLS_MSE, ymin_RM1_MSE,ymin_RM2_MSE, ymin_RM3_MSE) / s

ymax_MSE <- max(ymax_PCA_MSE, ymax_PLS_MSE, ymax_RM1_MSE,ymax_RM2_MSE, ymax_RM3_MSE) / s

# Finding the minimum and maximum mean-squared error across# all techniques. These will serve as the lower and upper# range of the y-axis in the final plot.

ymin_PCA_PLS_MSE <- min(ymin_PCA_MSE, ymin_PLS_MSE) / symax_PCA_PLS_MSE <- max(ymax_PCA_MSE, ymax_PLS_MSE) / s# Calculating the averaged minimum and maximum MSE for PCA# and PLS after ’s’ simulations for plotting purposes.

ymin_RM_MSE <-min(ymin_RM1_MSE, ymin_RM2_MSE, ymin_RM3_MSE) / s

ymax_RM_MSE <-max(ymax_RM1_MSE, ymax_RM2_MSE, ymax_RM3_MSE) / s

# Calculating the averaged minimum and maximum MSE for the# three RMs after ’s’ simulations for plotting purposes.

# Start of bias plot for PCA and PLS.plot(t, (sum_PCA_BE_t) / s, pch = 15,

main = paste("Bias: PCA and PLS \n", s,"Total Simulations"),

xlab = "Time",ylab = "Average Bias", ylim = c(ymin_PCA_PLS_BE,

ymax_PCA_PLS_BE),xlim = c(0, max(t)),col = "black")

points(t, (sum_PLS_BE_t) / s, pch = 15, col = "grey")

par(new = TRUE)

abline(0, 0, h = 0)

28


par(new = TRUE)

legend("topright", c("PCA", "PLS"), pch = c(15, 15),col = c("black", "grey"))

# End of bias plot for PCA and PLS.

# Start of the mean-squared error plot for PCA and PLS.plot(t, (sum_PCA_MSE_t) / s, pch = 15,

main = paste("Mean-Squared Error: PCA and PLS \n",s, "Total Simulations"),

xlab = "Time",ylab = "Average MSE", ylim = c(ymin_PCA_PLS_MSE,

ymax_PCA_PLS_MSE),xlim = c(0, max(t)),col = "black")

points(t, (sum_PLS_MSE_t) / s, pch = 15, col = "grey")

par(new = TRUE)

abline(0, 0, h = 0)

par(new = TRUE)

legend("topright", c("PCA", "PLS"), pch = c(15, 15),col = c("black", "grey"))

# End of mean-squared error plot for PCA and PLS.

# Start of the bias plot for the random matrices.plot(t, (sum_RM1_BE_t) / s, pch = 15,

main = paste("Bias: Random Matrices \n", s,"Total Simulations"),

xlab = "Time",ylab = "Average Bias", ylim = c(ymin_RM_BE,

ymax_RM_BE),xlim = c(0, max(t)),col = "darkblue")

points(t, (sum_RM2_BE_t) / s, pch = 15, col = "red")

points(t, (sum_RM3_BE_t) / s, pch = 15, col = "gold")

29


par(new = TRUE)

abline(0, 0, h = 0)

par(new = TRUE)

legend("topright", c("RM1", "RM2", "RM3"),pch = c(15, 15, 15),col = c("darkblue", "red", "gold"))

# End of bias plot for the random matrices.

# Start of the mean-squared error plot for the# random matrices.plot(t, (sum_RM1_MSE_t) / s, pch = 15,

main = paste("Mean-Squared Error: Random Matrices \n",s, "Total Simulations"),

xlab = "Time",ylab = "Average MSE", ylim = c(ymin_RM_MSE,

ymax_RM_MSE),xlim = c(0, max(t)),col = "darkblue")

points(t, (sum_RM2_MSE_t) / s, pch = 15, col = "red")

points(t, (sum_RM3_MSE_t) / s, pch = 15, col = "gold")

par(new = TRUE)

abline(0, 0, h = 0)

par(new = TRUE)

legend("topright", c("RM1", "RM2", "RM3"),pch = c(15, 15, 15),col = c("darkblue", "red", "gold"))

# End of mean-squared error plot for the random matrices.

# Start of bias plot for all methods.plot(t, (sum_PCA_BE_t) / s, pch = 15,

main = paste("Bias: All Techniques \n",

30


s, "Total Simulations"), xlab = "Time",ylab = "Average Bias", ylim = c(ymin_BE, ymax_BE),xlim = c(0, max(t)),col = "black")

points(t, (sum_PLS_BE_t) / s, pch = 15, col = "gray")

points(t, (sum_RM1_BE_t) / s, pch = 15, col = "darkblue")

points(t, (sum_RM2_BE_t) / s, pch = 15, col = "red")

points(t, (sum_RM3_BE_t) / s, pch = 15, col = "gold")

par(new = TRUE)

abline(0, 0, h = 0)

par(new = TRUE)

legend("topright", c("PCA", "PLS", "RM1", "RM2", "RM3"),pch = c(15, 15, 15, 15, 15),col = c("black", "gray", "darkblue", "red", "gold"))

# End of bias plot for all methods.

# Start of mean-squared error plot for all methods.plot(t, (sum_PCA_MSE_t) / s, pch = 15,

main = paste("Mean-Squared Error: All Techniques \n",s, "Total Simulations"), xlab = "Time",

ylab = "Average MSE", ylim = c(ymin_MSE, ymax_MSE),xlim = c(0, max(t)), col = "black")

points(t, (sum_PLS_MSE_t) / s, pch = 15, col = "gray")

points(t, (sum_RM1_MSE_t) / s, pch = 15, col = "darkblue")

points(t, (sum_RM2_MSE_t) / s, pch = 15, col = "red")

points(t, (sum_RM3_MSE_t) / s, pch = 15, col = "gold")

par(new = TRUE)

31


abline(0, 0, h = 0)

par(new = TRUE)

legend("topright", c("PCA", "PLS", "RM1", "RM2", "RM3"),pch = c(15, 15, 15, 15, 15),col = c("black", "gray", "darkblue", "red", "gold"))

# End of mean-squared error plot for all methods.

t2 <- Sys.time() # End time.

total_time <- t2 - t1 # Difference between start and end# times.

print(total_time) # Printing total time to run simulations# and obtain the plots.

32


10.2 Johnson-Lindenstrauss TestingBelow is the code used for testing the Johnson-Lindenstrauss Lemma by varyingk and ε.

good_points_RM1 <- 0good_points_RM2 <- 0good_points_RM3 <- 0# Good points counter for each random matrix.# Points are considered ’good’ if they satisfy# the Johnson-Lindenstrauss Lemma.

sim <- function(s, k, epsilon)# This function takes in ’s’ simulations and a desired# ’epsilon’. It returns the number of times the# Johnson-Lindenstrauss Lemma was satisfied based on the# three random matrices.

t1 <- Sys.time() # Initial time.

num <- 1 # Initial counter.

mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values.

X <- matrix(0, 100, 1000)# A location for the dataset information.

while(num <= s)# Running the entire code for a ’s’ iterations.

problem <- FALSE # No problems at the start of this# iteration.

for(i in 1:100)

for(j in 1:1000)X[i, j] <- rnorm(1, mean = mu[j], sd = 1)# A matrix of random data containing observations on# the rows and covariates on the columns.

33



u_v_rows <- sample(1:100, 2, replace = FALSE)obs_u_old <- z[u_v_rows[1], ]obs_v_old <- z[u_v_rows[2], ]# We’ve selected two different rows from the dataset# matrix ’z’ and stored them as new variables. Here,# observations ’u’ and ’v’ can be thought of as# 1,000-dimensional points.

dist_old <- sum((obs_u_old - obs_v_old) ^ 2)# Here, the distance has been calculated between# observations ’u’ and ’v’.

RM1 <- matrix(0, 1000, k)# Random matrix one with ’-1’s and ’+1’s.

for (m in 1:1000)for (n in 1:k)RM1[m, n] <- sample(c(-1, 1), 1, replace = TRUE,

prob = c(1/2, 1/2))

RM1 <- RM1 / sqrt(k)

RM2 <- matrix(0, 1000, k)# Random matrix two with ’-sqrt(3)’s, ’0’s, and# ’+sqrt(3)’s.

for (m in 1:1000)for (n in 1:k)

34


RM2[m,n] <- sample(c(-sqrt(3), 0, sqrt(3)), 1,replace = TRUE,prob = c(1/6, 4/6, 1/6))

RM2 <- RM2 / sqrt(k)

RM3 <- matrix(0, 1000, k)# Random matrix three generated under a Gaussian# distribution.

for (m in 1:1000)for (n in 1:k)RM3[m,n] <- rnorm(1, mean = 0, sd = 1)

RM3_norm <- matrix(0, 1000, 1)

for (p in 1:1000)RM3_norm[p, ] <- sqrt(sum(RM3[p, ] ^ 2))

for (m in 1:1000)for(n in 1:k)RM3[m,n] <- RM3[m,n] / RM3_norm[m, ]

z_star <- scale(z, center=TRUE, scale=FALSE)# Column-centered ’z’ matrix.

z_double_star_RM1 <- z %*% RM1z_double_star_RM2 <- z %*% RM2z_double_star_RM3 <- z %*% RM3

35


# Reducing dimensionality.

obs_u_new_RM1 <- z_double_star_RM1[u_v_rows[1], ]obs_v_new_RM1 <- z_double_star_RM1[u_v_rows[2], ]obs_u_new_RM2 <- z_double_star_RM2[u_v_rows[1], ]obs_v_new_RM2 <- z_double_star_RM2[u_v_rows[2], ]obs_u_new_RM3 <- z_double_star_RM3[u_v_rows[1], ]obs_v_new_RM3 <- z_double_star_RM3[u_v_rows[2], ]# After reducing dimensions, points ’u’ and ’v’ now have# new coordinates. Since there were three random# matrices, there are three new ’u’ and ’v’ points.

dist_new_RM1 <- sum((obs_u_new_RM1 - obs_v_new_RM1) ^ 2)dist_new_RM2 <- sum((obs_u_new_RM2 - obs_v_new_RM2) ^ 2)dist_new_RM3 <- sum((obs_u_new_RM3 - obs_v_new_RM3) ^ 2)# Calculating the new distance between the transformed# points ’u’ and ’v’ for each generated random matrix.

if((1 - epsilon) * (dist_old) <= dist_new_RM1&& dist_new_RM1 <= (1 + epsilon) * (dist_old))

good_points_RM1 <- good_points_RM1 + 1





# The preceding three ’if’ statements check to see if# the Johnson-Lindenstrauss Lemma was satisfied in this# iteration for each different random matrix.

print(paste("Simulation", num, "Complete."))

36


num <- num + 1

print(paste("For an epsilon of", epsilon, ", k is", k,"."))

print(paste("Number of times JL was satisfied, RM1:",good_points_RM1, "out of", s, "simulations."))



t2 <- Sys.time() # End time.

total_time <- t2 - t1# Difference between start and end times.

print(total_time) # Printing total time to run simulations# and obtain the plots.

37


10.3 Survival CurvesBelow is the code used for generating the real survival curve and the estimatedsurvival curve under PCA.

library(survival)library(FactoMineR)

sim <- function(s) # Making a function that takes in a# simulation count ’s’.

options(digits = 22) # Preserving more digits in hopes of# less algorithm failure.

results <- matrix(0, s, 2) # A matrix with BE on column 1# and MSE on column 2.

BE_T <- 0 # Initial total BE count.MSE_T <- 0 # Initial total MSE count.sum_BE_t <- matrix(0, 1, 20) # Matrix of BE at time ’t’.sum_MSE_t <- matrix(0, 1, 20) # Matrix of MSE at time ’t’.num <- 1 # Iteration counter.sum_BE_t1 <- 0 # Bias error at time ’t1’.sum_MSE_t1 <- 0 # Mean-squared error at time ’t1’.beta <- c(runif(1000, min = -0.0000001, max = 0.0000001))# Fixed coefficients.mu <- c(rnorm(1000, mean = 0, sd = 1)) # Mean values.X <- matrix(0, 100, 1000) # A location for the dataset# information.

while(num <= s) # Running the entire code for a specified# amount of iterations.

problem <- FALSE # No problems at the start of this# iteration.

for(i in 1:100)for(j in 1:1000)

38


X[i, j] <- rnorm(1, mean = mu[j], sd = 1) # A matrix# of random data containing observations on the rows# and covariates on the columns.


lambda <- matrix(0, 100, 1) # Rate values.

for(i in 1:100)lambda[i] <- exp(t(-z[i,]) %*% as.matrix(beta))# Generating lambda values.

T <- matrix(0, nrow = 100, ncol = 1)# Location for survival times.

for(i in 1:100)T[i] <- rexp(1,rate=lambda[i])

z_star <- scale(z, center=TRUE,scale=FALSE)

z_star_PCA <- PCA(z_star, graph=FALSE, ncp=37)

z_double_star <- z_star %*% z_star_PCA$var$coord

delta <- matrix(0, nrow = 100, ncol = 1) # An indicator# matrix. Here, delta is a 100 by 1 matrix of zeros.# The zeros are interpreted as meaning that the event of# interest has definitively occured. In other words,# there is currently no censoring with ’delta’ set up in# this manner.

data_Surv <- Surv(time = T, event = delta,

39


type = c("right"))# A Surv object that takes the survival times from ’T’,# censoring information from ’delta’, and is specified as# being right-censored.

data_AFT_fit <- NULL

data_AFT_fit <- tryCatch(survreg(data_Surv ~ -1 +z_double_star,

dist = "lognormal",survreg.control(maxiter=100000000)),warning=function(c) problem<<-TRUE)

if(!problem) # If there’s no problem, then our previous# code will run.

beta_hat_star <- as.matrix(data_AFT_fit$coeff)# These are beta estimates.

z_bar_star <- matrix(0, 1, 1000)# Averaged columns of ’z’ go here.

for (i in 1:1000)z_bar_star[1, i] <- mean(z[, i])# Taking the average of each column of ’z’.

beta_hat_z <- matrix(0, 1, 1000)# A location for our beta estimates.

beta_hat_z <- z_star_PCA$var$coord %*% beta_hat_star# Beta estimates.

lambda_hat <- exp(-z_bar_star %*% beta_hat_z)# Survival function constant.

lambda_bar = mean(lambda)# Taking the average of all ’lambda’ values and storing# it in ’lambda_bar’.

40


S_hat_naught <- function(t)# The predicted survivor function.

exp(-t * lambda_hat)

S <- function(t)# The true survivor function.

exp(-t * lambda_bar)

data_AFT_pred <- predict(data_AFT_fit, type = "terms",se.fit = TRUE)

# Here, we get the predicted values from the ’survreg’# object ’data_AFT_fit’. To wit, we get here the beta# values and the standard errors in a ’list’ format.

surv_curv <- curve(S_hat_naught, from = 0, to = 7,n = 1000, type="l",xlab = "", ylab = "", xaxt = ’n’,yaxt = ’n’, col = "99")

# Plotting the predicted survivor function.

par(new = TRUE)

curve(S, from = 0, to = 7, n = 1000, type = "l",main = paste("Survivor Curves \n Simulation", num),

xlab = expression(italic(t)),ylab = expression(S(italic(t))), col = "black")

u <- c(seq(0.025,0.975,0.05))# Outputs ’u’ that range from 0.025 to 0.975# spaced out by 0.05, resulting in 20 points.

t <- (-1/lambda_bar) * log(u)# Input times ’t’, generated from ’u’. There# are 20 generated times ’t’ in this vector.

print(paste("Simulation ", num, sep = ""))

41


num <- num + 1

else

42

Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report

Documents