Application of a Selective Gaussian Naïve Bayes Model for Diffuse-Large B-Cell Lymphoma Classification

1

Application of a Selective Gaussian

Naïve Bayes Model for Diffuse-Large

B-Cell Lymphoma Classification

A. Cano, J. García, A. Masegosa, S. Moral

Dpt. Computer Science and Artificial Intelligence

University of Granada. Spain. E-mail: {acu,fjgc,andrew,smc}@decsai.ugr.es

2

Diffuse Large B-Cell Lymphoma

Diffuse Large B-Cell Lymphoma (DLBCL), the most common subtype of non-hodkin’s lymphoma, has long been enigmatic in that 40 per cent of patients can be cured by combination of chemotherapy whereas the remainder succumb to this disease.

Alizadeh et al (2000) discovered, using gene expression profiling, that DLBCL comprises actually two different diseases that are indistinguishable by current diagnostic methods. One subtype of DLBCL, termed germinal center B-like DLBCL (GCB), has a high survival index, whereas the other DLBCL, termed activated B-cell like (ABC), has a low survival index .

The gene expression profiling was carried out by a specialized cDNA microarray, 'the lymphochip', that allows to quantify the expression of thousands of genes in parallel that are preferentially expressed in lymphoid cells.

3

cDNA Microarray

This is a tipical image of hybridization of fluorescent in cDNA mirocrarray.

Each row represents a separate cDNA clone on the microarray and each column a separate mRNA sample.

In this case, there is 96 samples and 4096 cDNA clones.

These ratios are a measure of relative gene expression in each experimental sample.

As indicated, the scale extends from fuorescence ratios of 0.25 to 4 (-2 to +2 in log base 2 units). Grey indicates missing or excluded data

In this data sets, there is many missing or excluded datas.

4

Diffuse Large B-Cell Lymphoma

Classification

After Alizadeth et al (2000), there are three important approaches for the classification of Diffuse Large B-Cell Lymphoma.

– Rosenwald et al (2002): Get a new data base with more cases. They get 274 cases. Alizadeth et al had only 42 cases of Diffuse Large B-Cell Lymphoma

– Wright et al (2003): Find 27 genes to classify GCB versus ABC.

– Lossos et al (2004): Find only 6 genes to estimate the survival index of a patient.

5

Rosenwald et al (2002)

Andreas Rosenwald et al. 2002. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. New England Journal of Medicine, 346:1937–1947, June.

Biopsy samples of diffuse large-B-cell lymphoma from 240 patients were examined for gene expression with the aid of a DNA microarrays.

It is found a new subclass of DLBCL, called Type III, with an intermediate probability of survival using hierarchical clustering,

They construct a predictor of overall survival after chemotherapy based a linear combination of four signatures. A signature is a biological group of genes.

6

Wright et al (2003)

Wright et al. 2003. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large b cell lymphoma. Proceedings of National Academy of Sciences of the United States of America, 100:9991–9996, August.

It is proposed a Bayesian predictor that estimates the probability of membership to one of the two cancer subgroups (GCB or ABC), with the data set of Rosenwald et al (2002).

Gene Expression Data: http://llmpp.nih.gov/DLBCLpredictor

– 8503 genes.

– 134 cases of GCB, 83 cases of ABC and 57 cases of Type III.

http://llmpp.nih.gov/DLBCLpredictor

7

Wright et al. (2003)

DLBC subgroup predictor:

– Linear Predictor Score:

LPS(X)= X = (X1, X2, ...., Xn)

– Only the k genes with the most significant t statistic were used to form the LPS, the optimal k was determined by a leave one out method. A model including 27 genes had the lowest average error rate.

where N(x, , ) represents a Normal density function with mean and deviation .

Training set: 67 GCB + 42 ABC. Validation set: 67 GCB + 41 ABC + 57 Type III.

jjj Xa

2211

111

,,,,

,,

XLPSNXLPSN

XLPSNGXP

8

Wright et al (2003)

This predictor choses a cutoff of 90% certainty. The samples for which there was <90% probability of being in either subgroup are termed ‘unclassified’.

Results:

- It has a 9.3% of unclassified.

- If the predictor classifies a sample, it is correct 97.0 % of the times.

9

Lossos et al (2004)

In this paper, the authors studied 36 genes that had been reported by experts to predict survival in DLBCL.

In a univariate analysis, genes were ranked on the basis of their ability to predict survival. With this ranking, they developed a multivariate model that was only based on the expression of six genes.

Finally, they proved that this model was sufficient to predict the survival of a patient with DLBCL.

Mortality-predictor score =

-0.0273xLMO2 – 0.2103xBCL6 – 0.1878xFN1

+ 0.0346xCCND2 + 0.1888xSCYA3 + 0.5527xBCL2

10

Our Approach:

Selective Gaussian Naïve Bayes

It is a modified wrapper method to construct an optimal Naïve Bayes classifier with a minimum number of predictive genes.

The main steps of the algorithm are: – Step 1: Anova Phase. Filter method based on one way analysis of

variance (ANOVA).Selection of the most significant genes non correlated between them.

– Step 2: Wrapper Phase. Application of a wrapper search method to reduce the genes subset selected by Anova phase. It uses a Gaussian Naïve Bayes classifier. In addition, a wrapper algorithm is applied M times, so the output of the algorithm is a group of M features subsets.

– Step 3: Abduction Phase. A unique genes subset is selected using the group of the candidates of the Wrapper Phase.

11

We propose a filter method to reduce significantly the high number of genes, 8503, to apply a hard process as a wrapper search.

Firstly, the genes are ranked by its F-statistic. A gene with high F-statistic has differents expression values for each subtype of DLBCL, that is to say, is a good gene to discriminate each subtype of DLBCL.

In addition, the correlation between the genes is considered too. So, the correlation is calculated to each pair of genes in the two subclasses, so when two genes are correlated in any subclass, the one with the lowest F-satistic is eliminated.

Anova Phase

12

Anova Phase: ‘The Algorithm’ (1)

1. The genes are ranked by its F-statistic: {G1,

..., Gn}.

2. The gene G1 is selected:

Space of the genes

Selected gene

13


3. The genes correlated in the actual subclass with G1 are calculated and removed. G1 is a final selected gene.

Space of the genes

Selected gene

Cluster of genes

Final Selected gene

14


4. The next not removed gene in the ranking Gj

is selected:

5. The genes correlated with Gj are calculated

and removed too.

Space of the genes

Selected gene

Final Selected gene

Cluster of genes

15


6. The steps 4 and 5 are repeated until all genes are removed or

included in the final subset of selected genes.

Space of the genes

Final Selected gene

This is the reduced final subset of genes selected by the Anova

Phase. The initial data set projected over this subset of genes

will be the working data subset for the Wrapper Phase.

16

Anova Phase

There are several important properties for this filter method: – The more informative genes non positively correlated among

them are selected.

– The genes negatively correlated among them are not eliminated with the aim of introducing redundance in the final data set.

– A gene with a low score is not eliminated if it is not positively correlated with the other genes. This case do not occur in traditional filter methods based on genes ranking, where the least informative genes are directly eliminated.

17

Wrapper Phase

In this phase a wrapper algorithm is applied to find a reduced genes subset from the one obtained in the the Anova Phase.

The underlying classification model is a Naïve Bayes classifier with continuous variables. We assume these variables follow a Gaussian distribution given the class.

This phase uses a KFC methodology. It is used not only to estimate the classifier accuracy, but rather it is used to get a group of candidates features subsets for the final classification subset.

18

As it is indicated in KFC methodology, the data set D is randomly partitioned in K disjoint subsets, {D1, ..., Dk}. So, the training algorithm is applied K times to the subset Tk=D-Dk.

So, if a wrapper algorithm is applied to each Tk subset, an optimal Gk feature subset will be gotten for each case. So at the end of the procedure, K features subsets are gotten by the Wrapper Phase.

In this way, if this process is repeated m times (the randomly partition of data set D in K subsets ... ), we obtain M=K x m feature subsets.

Wrapper Phase

19

A wrapper algorithm is applied again in the new trainining data set.

And an optimal gen subset is obtained too.

Wrapper Phase

The data set is randomly partitioned in 5 subsets

Training Data Set

Testing Data Set

A wrapper algorithm is applied in the trainining data set.

And an optimal gen subset is obtained.

Data

Set

G1 G4

G5 G7

G10

G1 G4

G5 G6

G9

A new training data set is defined with the 5-fold-cross methodology

5 fold-cross Methodology in Wrapper Phase

20

Wrapper Phase

At the end 5 gene subset are obtained in this Wrapper Phase:

G1 G4

G5 G7

G10

G1 G4

G5 G6

G9

G2 G4

G5 G7

G11

G1 G4

G2 G8

G9

G1 G4

G2 G6

G9

If this process is repeated again (a new randomly partitioned of the data set in 5 subsets), a new 5 gene subset will be obtained again. Therefore, by this way, if the process is repeated m times then we obtained M= 5 x m distinct gene subsets.

21

Gaussian Naïve Bayes

Classifier

It is a simple classifier introduced by Langley et al (1992)

This classifier model assumes the following hypothesis:

– Each gene is independent of the other genes known the class.

– Each gene is considered as a random continuous variable that

follows a Normal distribution given the class.

An important advantage of this classifier is that it can work with missing

values in the data set.

C

G1 G2 G3 G4

Naïve Bayes Classifier

22

The Wrapper Algorithm

A wrapper algorithm is a stage divided algorithm. In each stage of the algorithm, an unique feature subset is obtained.

A simple wrapper search algorithm is implemented with the following parameters: – Search Strategy: Sequential Selection

– Initial Subset: Empty Set.

– Evaluation Function: Accuracy of the Gaussian Naïve Bayes classifier composed by the selected feature subset of this stage. This accuracy is calculated using again the K-fold cross validation.

The wrapper algorithm begins with an empty set of features and sequentially add the feature that maximizes the evaluation function at each stage.

When there are several features that maximizes the evaluation function, the one with the highest F-statistic is selected.

23

Wrapper Algorithm:

Stop Condition

Due to the low number of cases in the training data set (100 cases in our case), it is more probable that two similar subsets of features have the same accuracy (>1% in our case) than in the traditional data sets with a large number of cases. So it is reasonable to think that the evolution of the error rate will be very discontinuous and with a slow rate.

The typical error rate evolution is described in the following graphics:

0

0,1

0,2

0,3

0,4

0,5

0,6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

24

Wrapper Algorithm:

Stop Condition Generals stop conditions:

– Stop if ∆r ≥ 0 (Stop if there is not an improvement): • Early stopping. It is hard get an improvement in each stage. (In the preceding case, this condition stops in stage

number 4).

– Stop if ∆r > 0 (Stop if there is a deterioration) • There is an overffiting problem, because many features are introduced without being necessary. (In the preceding

case, this condition stops in stage number 12).

The parameters are: #(P), stage of the algorithm; rp, actual error rate; ∆r = rp - rp-1, increment error rate).

The heuristic stop condition implemented :

0

0,1

0,2

0,3

0,4

0,5

0,6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

r

False

True -0,2

-0,15

-0,1

-0,05

0

0,05

0,1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15∆r

False

True

Avoid overfitting OR Avoid early stopping

25

Abduction Phase

In the previous Wrapper Phase, M feature subsets were gotten as candidates for the final classification subset.

The procedure to select a unique feature subset is based on an abductive inference over a Bayesian network, BN, encoding the joint distribution of that a feature subset was selected by the wrapper algorithm.

Firstly, a BN is learned from the M feature selected subsets in the Wrapper Phase {F1, ..., FM} following this procedure: – Let Phi={G1, ..., Gp} the set of all features in the subsets F1, ..., FM. For

each Gi, we define a new discrete variable Yi with two states {absent, present}.

– Learn a BN by K2-Learnin algorithm.

26

Abduction Phase: An Expample

Let

– F1={G1,G3,G7,G8} F2={G2,G5,G7,G8}

– F3={G1,G3,G2,G5} F4={G1,G9,G10} F5={G1,G3,G7,G8,G10}

So Phi={G1,G2,G3,G5,G6,G7,G8,G9,G10} so we create this subset of

variables {Y1,Y2,Y3,Y5,Y6,Y7,Y8,Y9,Y10}.

So the cases are built as:

So, with this data set, a BN is learnt by the K2-Learning algorithm. A secondly,

it is applied an abduction algortihm over BN to get its two most probable

configurations.

27

Abduction Phase

Second, an abduction algortihm is applied to get its T most probable configurations: {y*1, ..., y*T}.

For each configuration y*i = (y1,....,yp), we get a new feature subset Hi with the variables at ‘present’ state.

At the end, we have the T most probable subsets of being selected in the Wrapper Phase, {H1, ..., HT}.

The final classification subset is the one that minimezes the average of the –logarithm of the likelihood of the class in the complete training data set using a GNB model and leave-one-out methodology.

28

Abduction Phase: An Expample

So, for example, we get this two configurations as the most ones

probable:

Form this two configurations, we get the two features subset most probable to be selected by a Wrapper algorithm. In this case are:

- H1= {G1,G3,G7,G8}

- H2= {G1,G2,G3,G5,G7,G8}

29

DLBCL Classification

The data set used in this paper is the one in Rosenwald et al (2002). This data set considers 8503 genes in the following cases: – 134 cases for GCB.

– 83 cases for ABC.

– 57 cases for Type III.

The data set was randomly divided in a training and testing group in the following way: – Training Data Set: 67 cases for GCB + 42 cases for ACB.

– Testing Data Set: 67 GCB + 42 ACB + 57 cases for Type III.

The classifier accuracy estimation is obtained trough the classifier evaluation in 10 different divisions of this data.

30

DLBCL Classification

The parameters for the implementation of the three phases are:

– A gene X is considered correlated with a gen Y if the lower limit of the confidence interval for the correlation coefficient is greater than 0.15.

– The procedures of KFC Validation were carried out with a K=10.

– M = 3 x 10 = 30 candidates feature subsets were gotten in the Wrapper Phase.

– The 20 most probable explanations were evaluated in the Abduction Phase.

– The samples that has a probability of being in either subgroup lower than 80% were termed ‘Unclassified’.

31

Results I

Phase Anova: (Confidence Intervals at 95%)

– Size (gene number): [74.3, 83.1]

– Train accuracy rate (%): [96.8, 98.6]

– Test accuracy rate (%): [92.8, 95.4]

– Test -log likelihood: [0.38, 0.68]

– TypeIII Test -log likelihood: ‘Infinity’

Model Prediction

DLBCL

subgroup

Training set Validation set

32

Results I: Conclusions

Anova Phase only select the 1 % of the all genes, 78

genes of 8503 genes, and the Gaussian Naïve Bayes

classifier has a 94.1% of accuracy with these genes.

In this case, there are several Type III cases for wich

the classifier assigns them extreme probability.

The classifier only has a 2.5 % of unclassified cases. If

the predictor classifies a sample, it is correct 95.4% of

the cases.

33

Results II

Phase Anova + Phase Search: (Confidence Intervals at

95%)

– Size (gene number): [6.17, 7.82]

– Train accuracy rate (%): [95.2, 98.0]

– Test accuracy rate (%): [88.83, 91.9]

– Test -log likelihood: [0.25, 0.37]

– TypeIII Test -log likelihood: [4.12, 5.08]

Model Prediction

DLBCL

subgroup

Training set Validation set

34

Results II: Conclusions

The number of genes is reduced to a 10 %, from 78 to 7 genes.

The average of –log likelihood is reduced at the half of the one obtained in the Anova Phase. Therefore, this is a better classifier.

For Type III class, the predictor does not assign a full probability of belonging to a class in none of its samples.

The classifier has a 9.1 % of unclassified cases. If the predictor classifier a sample, it is correct in the 93.2% of the cases.

In Wright et al., there is a 9.2% of unclassified cases and if the predcitor classifies a sample, it is correct in the 96.9% of the cases. But the evaluation is done in a unique partition of the data set, so this percent is not very reliable. In addition, we had some evaluations of our classifier with better accuraccy.

In the other hand, we reduced substancially the number of genes in all the cases. We get similars results with 7 genes, respect to the 27 ones of the Wright et al.

35

Conclusions

We obtain a simple classifcation method that provides good results.

We have developed a new method for feature subset selection very robust for data sets with a many more features than cases.

The use of an Abduction process can provide us several candidates that resume the distinct applications of the wrapper algorithm.

There are three genes (LMO2, BCL6 and CCND2) that are selected several times by our classification model and that coinciden with the 6 selected ones by Lossos et al. using medical information to predict survival in DLBCL.

36

Future works

Develop more sophisticated models: – Include replacement variables to manage missing

data.

– Consider Multidimensionals Gaussian distributions.

– Improve the MTE Gaussian Naïve Bayes model.

Apply this model to other data sets as breast cancer, colon cancer ...

Compare with other models with discrete variables.