Project Scope Given a data set of profiles of mRNA expression that contain distinct adenocarcinoma subclasses classify human lung carcinomas. Data Set This data set contains 56 variables measured on 12,625 genes using Affymetrix GeneChip 95av2 (dedicated to acquiring, analyzing and managing complex genetic information). Of the 56 variables measured 20 lung carcinoid (Carcinoid), 13 are related to the metastasis of colon cancer (colon) 17 normal lung function (Normal) and 6 to lung small cell carcinoma (SmallCell). STEP 1: Data Manipulation Missing data & Format The first step is assuring that the data set is suitable for the analysis. We are checking whether the format of the data is suitable for our analysis and whether there are missing values. The format has been changed to numeric in order for the analyze to be taken care of . No missing data was in the data set. Extreme values We are using PROC UNIVARIATE to search for extreme values among out 56 variables. We see that our data set has indeed extreme values. We expect somehow this behavior as we are
34
Embed
Classification project- application using SAS base programing
Project classifing human carcinoid cells. The application uses SAS Base Programing. PROC ACCECLUS, PROC VARCLUS PROC FASTCLUS are being . Given a data set of profiles of mRNA expression that contain distinct adenocarcinoma subclasses classify human lung carcinomas.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Project Scope
Given a data set of profiles of mRNA expression that contain distinct adenocarcinoma subclasses classify human lung carcinomas.
Data Set
This data set contains 56 variables measured on 12,625 genes using Affymetrix GeneChip 95av2 (dedicated to acquiring, analyzing and managing complex genetic information). Of the 56 variables measured 20 lung carcinoid (Carcinoid), 13 are related to the metastasis of colon cancer (colon) 17 normal lung function (Normal) and 6 to lung small cell carcinoma (SmallCell).
STEP 1: Data Manipulation
Missing data & Format
The first step is assuring that the data set is suitable for the analysis. We are checking whether the format of the data is suitable for our analysis and whether there are missing values. The format has been changed to numeric in order for the analyze to be taken care of . No missing data was in the data set.
Extreme values
We are using PROC UNIVARIATE to search for extreme values among out 56 variables. We see that our data set has indeed extreme values. We expect somehow this behavior as we are analyzing cancer cells which vary in size. As a consequence we will eliminate the outliers which have high values for the normal cells as this cells should have values that are not further from 3 standard deviation from the mean and smallcells. Data was standardized, although as the data set consists of variables measured in the same unit of measurement it would not affect the calculation of distances between clusters. We will keep in consequence the outliers for the colon cells and carcinoid. From our initial data base of 12,625 observations we will end up with a 11279 observation data base.
Getting an intuition over the data
We are interested to know whether or not the variables are linked between. As the results from PROC CORR indicate, the variables are strongly correlated. Our intuition that each type of variable Normal, Colon, Carcinoid and SmallCell are more correlated between
their own type then let’s say Colon and Normal cell is confirmed by the results given by the division into clusters using principal components as a criteria. We will specify the number of clusters to be 4. We will obtain a cluster (CLUSTER 1) with Normal Cells and half of the Small Cells, a cluster(CLUSTER 3) with colon cells and 2 clusters with Carcinoids, one of them having the rest of the small cells (CLUSTER 2) and the other one variable concerning Colon variables. The full table is to be found in the ANNEX.
STEP 2: Choosing the right number of clusters
In finding the right clusters, three approaches in manipulating our initial variables have been used:
1. Computing 56 canonical variables – PROC ACECLUS2. Reducing the number of variables to 10 – PROC VARCLUS3. Standardizing our initial 56 variables – PROC STANDARD However when comparing the Cubic Clustering Criteria, only the latter was found appropriate to continue our analysis.
As we do not know prior the number of clusters, we will apply automatic clustering methods to figure out the exact number. The data set is to vast to apply directly the SAS procedure CLUSTER, so we will first apply FASTCLUS to find a set of initial cluters which will be used as input for PROC CLUSTER.
We choose the number of maximum clusters for the FASTCLUST procedure 53 as the square root of our total number of observations (11279) devided by 2. As there were clusters with few observations, those with fewer than 9 were deleted and the rest became seeds for a second FASTCLUST procedure. We are using the output containing clusters as an input for the CLUSTER procedure.
The criteria used in our clusterisation is the Ward distance. This meant that a loss of inertia resulted in the fusion of two classes, as a consequence it seeks to have a low interclass inertia. It is calculates using the square of the distances of two barycenteres divided by 1/the number of individuals in the corresponding class.
In order to compute canonical variables for subsequent cluster analysis we obtain approximate estimates of the pooled within-cluster covariance matrix by using PROC ACECLUS. As our database contained a big number of data we choose 0.1 to be our within cluster covariance coefficient. Data with poorly separated or elongated patterns need to be transformed. Also, variables with different units of measurement or with different size variances will need to be transformed as well. In our case only the former is true as all the variables use the same unit of measure. For clusterization to be done it is advisable to have spherical clusters rather than elongates elliptical clusters.
We can apply this technique directly on the data without prior clusterization as there is no need for prior knowledge on cluster membership or number of clusters.
However the Clustering Criteria below fail to validate the data as appropriate for our analysis. The negative value of the CCC indicates a strong presence of outliers in the data set, which makes it difficult to find an appropriate number of clusters.
In consequence we will try to reduce the number of variables that are being used.
FIG1. Clustering Criteria for the clusters obtained on canonical variables
2. Reducing the number of variables to 10 – PROC VARCLUS
The correlation between our variables is high as the result from the PROC CORR show .
FIG2- PROC CORR-best 8- correlation between variables.
In consequence we can reduce our number of variables. In order to reduce our 56 variables to a smaller number we will use PROC VARCLUS. This procedure will output our variables into a number of clusters from which we will select a few variables that are most representative for the subsequent cluster and use it in our analysis. This procedure is closely related to the principal component procedure, finding the groups of variables that are as correlated as possible among themselves and as uncorrelated as possible with variables in other clusters.
For our analysis we choose rather than the number of clusters the threshold for identifying additional dimensions within equal to 0.8. From each cluster we choose one variable which had the lowest 1-R**2 value as it contributed the most to the subsequent cluster .
We choosed as a consequence : Normal7 Carcinoid4 Carcinoid19 Carcinoid18 Carcinoid6 Colon5 Colon12 Colon3 SmallCell3 SmallCell4 Colon10. Table 1 Classification of variables in 10 distinct clusters
When we executed our clustering procedure we obtained an improvement in the criteria , but still not good enough for a further analysis. The CCC criteria indicates a lower presence of a outliers and thus a better chance for obtaining a satisfying clusterisation . However the pseudo t square indicates in the are where CCC value allows for a clusterisation, a good number of clusters to be 15, as it is the number which indicates a surge followed by a drop. This number is rather big for our data of 10 variables and difficult to interpret in a proper manner as a consequence. We will continue our analysis on all the variables on which standardization has been performed.
FIG4. Clustering Criteria for the clusters obtained on 10 variables
3. Standardized variables – PROC STANDARD
Our third attempt consists in running PROC CLUSTER on standardized data with no outliers for the normal and smallcell variables.
We are looking for a Cubic Clustering Criteria (CCC) which is greater than 0 as well as local maximum and local maximum for the Pseudo F and Pseudo t square Criteria. As we do not observe a local spike in the Pseudo F statistic plot, we will use the pseudo t square as a criteria. We see that there are several local spikes, but we will take into consideration only those grater or equal to 11, ass for the others the CCC is negative indicating the presence of outliers. We will choose 12 clusters which is equal to K+1, K being the number of clusters where pseudo T square was a local maximum.
FIG5. Clustering Criteria for the clusters standardize variables
The resulted classification is comprised in the table below. The results are robust as there is no class with few observations.FIG 6 – Final clusters
FIG 7 Dendogram obtained from the cluster procedure
We want to study the characteristics of each cluster. In order to do that we will look at the classification obtained by the VARCLUS procedure and we will create 4 variables which we will use to highlight the difference between the clusters we obtained.
FIG 8 Characteristics of clusters found
Interpretation of cluster values
The clusters which include most of our observations are cluster number 3, 4 and 11. Clusters 3 and 11 are distinguishable as they contain values closer to 0. We can interpret this observations as being less prone to having a medical problem. Cluster 1 contains the fewest number of observations, however all the variables displayed high values, indicating a set of individuals which have a medical condition that is worst then the rest of the observations. Individuals from the 5th Cluster also exhibit a salient pattern as the value for the Colon cell is greater than the rest of the Colon values. The Carcinoid1 and Carcinoid2 have also striking low negative values. The values from the Carcinoid1 and Carcinoid2 display values that are somehow similar for each cluster.
REFERENCES
Variable Reduction for Modeling using PROC VARCLUS, Bryan D. Nelsonhttp://www2.sas.com/proceedings/sugi26/p261-26.pdf
A Methodological approach to performing cluster analysis with SAS®, William F. McCarthyhttp://analytics.ncsu.edu/sesug/2007/DM05.pdf
SAS Institute Inc.SAS/STAT ® User’s Guide, Version 8, Cary, NC: SAS Institute Inc., 1999https://ciser.cornell.edu/sasdoc/saspdf/stat/chap16.pdf
Data Mining et Statistique Decisionelle, Stéphane Tufféryhttp://data.mining.free.fr/cours/Descriptives.pdf