MSc Thesis - CML Survival Prediction-Final Correction

CHAPTER ONE

INTRODUCTION

1.1 Background to the Study

According to American Cancer Society (2015) Chronic Myeloid Leukaemia

(CML) is a type of cancer that affects the blood cells of living organisms. The body is

made up of trillions of living cells. Normal body cells grow, divide to make new

cells, and die in an orderly way. During the early years of a person‘s life, normal cells

divide only to replace worn-out, damaged, or drying cells. Cancer begins when cells

in a part of a body start to grow out of control (Cortes et al., 2011). There are many

kinds of cancer, but they all start because of this out-of-control growth of abnormal

cells. Cancer growth is different from normal cell growth. Instead of dying, cancer

cells keep on growing and form new cancer cells. These cancer cells can grow into

(invade) other tissues, something that normal cells cannot do (Cortes et al., 2012).

Being able to grow out of control and invade other tissues is what makes a cell a

cancer cell. In most cases, the cancer cells form a tumor. But some cancers, like

Leukaemia, rarely form tumors. Instead, these cancer cells are in the blood and bone

marrow. When cancer cells get into the bloodstream or lymph vessels, they can travel

to other parts of the body (Kantarjian and Cortes, 2008). There they begin to grow

and form new tumors that replace normal tissues. This process is called metastasis.

Leukaemia is a type of cancer that starts in cells that form new blood cells.

These cells are found in the soft, inner part of the bones called bone marrow. Chronic

myeloid Leukaemia (CML), also known as chronic myelogenous Leukaemia, is a

fairly slow growing cancer that starts in the bone marrow. It is a type of cancer that

affects the myeloid cells – cells that form blood cells, such as red blood cells,

2

platelets, and many type of white blood cells. In CML, Leukaemia cells tend to build

up in the body over time. In many cases, people do not have any symptoms for at

least a few years. CML can also change into a fast growing, acute Leukaemia that

invades almost any organ in the body. Most cases of CML occur in adults, but it is

also very rarely found in children. As a rule, their treatment is the same as that for

adults.

According to Durosinmi et al. (2008), chronic myeloid Leukaemia (CML) has

an annual worldwide incidence of 1/100,000 with a male - female ratio of 1.5:1. The

median age of the disease incidence is about 60 years (Deninger and Druker, 2003).

In Nigeria and other African countries with similar demographic pattern, the median

age of the occurrence of CML is 38 years (Boma et al., 2006; Okanny et al., 1989).

In the United States of America (USA) however, the incidence of CML in the age

group under 70 years is higher among the African-Americans than among any other

racial/ethnic groups (Groves et al., 1995). It is probable that a combination of

environment and as yet unknown biological factors may account for the differential

age incidence pattern of CML between the Blacks and other races in the USA.

According to Oyekunle et al. (2012a), pediatric CML is rare, accounting for less than

10% of all cases of CML and less than 3% of all pediatric Leukaemia. Incidence

increases with age being exceptionally rare in infancy, it is about 0.7 per million/year

at ages 1 -14 years and rising to 1.2 per million/year in adolescents worldwide (Lee

and Chung, 2011).

To date, only allogenic stem cell transplantation (SCT) remains curative for

chronic myeloid leukaemia (Robin et al., 2005), though its role has waned

significantly in recent times due to the effectiveness of the tyrosine kinase inhibitors

(TKIs) (Oyekunle et al. 2011; Oyekunle et al., 2012b). Although potentially curative,

3

SCT is associated with significant morbidity and mortality (Gratwohl et al., 1998).

Alpha interferon-based regimens adequately control the chronic phase of the disease,

but result in few long term survivors (Bonifazi et al., 2001). Advances in targeted

therapy resulted in the discovery of Imatinib mesylate, a selective competitive

inhibitor of BCR – ABL protein tyrosine kinase, which has demonstrated to induce

both hematologic and cytogenetic remission in a significant proportion of CML

patients (Kantarjian et al., 2002). A number of prognostic scoring systems have been

developed for patients with CML, of which Sokal and Hasford (or Euro) scores are

most popular (Gratwohl et al., 2006). The Sokal score was generated using chronic

phase CML patients treated with busulphan or hydroxyurea (Sokal et al., 1984), while

the Hasford score was derived and validated, using patients treated with Interferon-

alpha (Hasford et al., 1998).

Survival Analysis deals with the application of methods to estimate the

likelihood of an event (death, survival, decay, child-birth etc.) occurring over a

variable time period (Dimitologlou et al., 2012); in short, it is concerned with

studying the time between entry to a study and a subsequent event (such as death).

The traditional statistical methods applied in the area of survival analysis include the

Kaplan-Meier (KM) estimator curve (Kaplan and Maier, 1958) and the Cox-

proportional hazard (PH) models (Cox, 1972). These methods apply parametric

methods in estimating survival parameters for a group of individuals. Other methods

applied in traditional statistical methods also include the use of non-parametric

models. The Kaplan-Meier method allows for an estimation of the proportion of the

population of people who survive a given length of time under some circumstances.

The cox model is a statistical technique for exploring the relationship between the

survival of a patient and several explanatory variables. Before the advent of Imatinib

4

as a treatment option for Chronic Myeloid Leukaemia (CML), the median survival

time for CML has been 3 – 5 years from the time of diagnosis of the disease (Hosseini

and Ahmadi, 2013). According to Gambacorti-Passerimi et al. (2011), a follow-up of

832 patients using Imatinib showed an overall survival rate of 95.2% after 8 years. A

10-year follow-up of 527 patients in Nigeria undergoing Imatinib treatment showed

an overall survival rate of 92% and 78% after 2 and 5 years respectively (Oyekunle et

al., 2013).

Machine learning (ML) is a branch of artificial intelligence that allows

computers to learn from past examples of data records (Quinlan, 1986; Cruz and

Wishart, 2006). Unlike traditional explanatory statistical modeling techniques,

machine learning does not rely on prior hypothesis (Waijee et al., 2013a). Machine

learning has found great importance in the area of predictive modeling in medical

research especially in the area of risk assessment, risk survival and risk recurrence.

Machine learning techniques can be broadly classified into: supervised and

unsupervised techniques; the earlier involves matching a set of input records to one

out of two or more target classes while the latter is used to create clusters or attribute

relationships from raw, unlabeled or unclassified datasets (Mitchell, 1997).

Supervised machine learning algorithms can be used in the development of

classification or regression models. Classification model is a supervised approach

aimed at allocating a set of input records to a discrete target class unlike regression

which allocates a set of records to a real value. This research is focused at using

classification models to classify patients‘ survival as either of survived or not

survived.

Feature selection methods are unsupervised machine learning techniques used

to identify relevant attributes in a dataset. It is important in identifying irrelevant and

5

redundant attributes that exist within a dataset which may increase computational

complexity and time (Yildirim, 2015; Hall, 1999). Feature selection methods are

broadly classified as filter-based, wrapper-based and embedded methods while filter-

based methods are chosen for this study due to the ability to identify relevant

attributes with respect to the target class – CML patient survival unlike wrapper-based

methods which use the performance of the machine learning algorithms. Filter-based

feature selection methods was used to identify the most relevant variables that are

predictive for CML patients survival from the variables monitored during the follow-

up of Imatinib treatment administered to Nigerian CML patients. The relevant

features proposed using feature selection were used to formulate the predictive model

for CML patients‘ survival classification using supervised machine learning

techniques.

1.2 Statement of Research Problem

Chronic Myeloid Leukaemia (CML) is a very serious disease affecting

Nigerians with just one referral government hospital in Nigeria which administers

Imatinib treatment but with limited number of experts compared to the number of

cases attended to. In Nigeria, hematologists rely on scoring models proposed using

datasets belonging to Caucasian (white race) and/or non-African CML patients

undergoing treatment before the Imatinib era (e.g. Sokal used busulphan or

hydroxyurea, Hasford used Interferon Alfa and European Treatment and Outcome

Study). These models have been deemed ineffective on Nigerian CML patients who

are undergoing Imatinib treatment and as such there is presently no existing predictive

model in Nigeria specifically for the survival of CML patients undergoing Imatinib

treatment. There is a need for a predictive model which will aid clinical decisions

6

concerning continual treatment or alternative action affecting the survival of CML

patients receiving Imatinib treatment, hence this study.

1.3 Aim and Objectives of the Study

The aim of this research is to develop a predictive model which identifies the relevant

attributes required for classifying the survival of Chronic Myeloid Leukemia patients

receiving Imatinib treatment in Nigeria using machine learning techniques. The

specific research objectives are to:

i. elicit knowledge on the variables monitored during the follow-up of

Imatinib treatment;

ii. propose the variables predictive for CML survival from (i) and use them to

formulate the predictive model;

iii. simulate the predictive model formulated in (ii); and

iv. validate the model in (iii) using historical data.

1.4 Research Methodology

In order to achieve the above listed objectives, the methodological approach for this

study was performed using the following methods.

Formal interview was conducted with two (2) Hematologists to identify

parameters used to monitor survival and anonymous and validated information about

patients will be collected.

Filter-based feature selection methods were used to identify the most relevant

variables (prognostic factors) predictive for survival from the variables identified for

which the predictive model was formulated using supervised machine learning

algorithms.

7

The formulated model was simulated using the explorer interface of the

Waikato Environment for Knowledge Analysis (WEKA) software – a light-weight

java based suite of machine learning tools using the preprocess, classify and select

attribute packages.

The collected historical data was used to validate the performance of the

model by determining the confusion matrix, recall, precision, accuracy and the area

under the Receiver Operating Characteristics (ROC) curve.

1.5 Research Justification

The Nigerian Health sector has set ambitious targets for providing essential

health services to all citizens; improving the quality of decisions affecting treatment

options is very essential to reducing disease mortality rates in Nigeria. Predictive

models for Chronic Myeloid Leukaemia (CML) survival classification can help

identify the most relevant variables for patient survival and thus allow physicians

concentrate on a smaller number but important variables during clinical observations.

1.6 Scope and Limitations of the Study

This study is limited to the classification of 2-year and 5-year survival of

Nigerian Chronic Myeloid Leukaemia (CML) receiving follow-up for Imatinib

treatment at Obafemi Awolowo University Teaching Hospital Complex (OAUTHC),

Ile-Ife, Osun State. Also, the datasets used for this study was based on information

collected from a single centre and relatively limited number of CML patients.

1.7 Organization of Thesis

The first chapter of this thesis has been presented and the organization of the

remaining chapters is discussed in the following paragraphs.

8

Chapter two contains the Literature Review which consists of an introduction

to chronic myeloid Leukaemia (CML), its etiology, treatment and distribution around

the World, Africa and Nigeria; survival analysis and the existing stochastic methods

(Kaplan-Meier and the Cox proportional hazard models); Machine learning –

supervised, unsupervised and application of machine learning in healthcare; Feature

selection methods; Existing survival models and related works.

Chapter three contains the Research Methodology which consists of the

research framework, data collection methods, data identification and variable

description, feature selection results, model formulation methods – supervised

machine learning algorithms proposed, model simulation and the performance

evaluation metrics to be used.

Chapter four contains the Results and discussions which consists of the

descriptive statistics of the data collected from the referral hospital, feature selection

results and discussions, simulation results and discussions and the performance

evaluation of the machine learning algorithms used.

Chapter five contains the summary, conclusion, recommendations and the

possible future works of the study.

9

CHAPTER TWO

LITERATURE REVIEW

2.1 Chronic Myeloid Leukaemia (CML)

According to DeAngelo and Ritz (2004), chronic myeloid leukaemia (CML) is

a clonal hematopoietic disorder characterized by the reciprocal translocation

involving chromosomes 9 and 22. As a result of this translocation, a novel fusion

gene, BCR-ABL is created and constitutive activity of this tyrosine kinase plays a

central role in the pathogenesis of the disease process. Cancer begins when cells in a

part of the body start to grow out of control. There are many kinds of cancer, but they

all start because of this out-of-control growth of abnormal cells (NCCN, 2014).

Cancer cell growth is different from normal cells growth. Instead of dying, cancer

cells keep on growing and form new ones. These cancer cells can grow into (invade)

other tissues, something that normal cells cannot do (National Cancer Institute NCI,

2011). Being able to grow out of control and invade other tissues is what makes a cell

a cancer cell. In most cases the cancer cells form a tumor. But some cancers, like

Leukaemia, rarely form tumors. Instead, these cancer cells are in the blood and bone

marrow (see Figure 2.1 for family tree of blood cells).

Leukaemia is a group of cancers that usually begins in the bone marrow and

results in high number of abnormal white blood cells (NCI, 2011). These white blood

cells are not fully developed and are called blasts or Leukaemia cells. Symptoms may

include bleeding and bruising problems, feeling very tired, fever and an increased risk

of infections.

10

Figure 2.1: Family Tree of Blood Cells

(Source: NCCN, 2014)

11

These symptoms occur due to a lack of normal blood cells. Diagnosis is typically by

blood tests or bone marrow biopsy. Clinically and pathologically, Leukaemia is

subdivided into a variety of large groups.

The first division is between its acute and chronic forms:

Acute Leukaemia is characterized by a rapid increase in the number of

immature blood cells (Locatelli and Niemeyer, 2015). Crowding due to such cells

makes the bone marrow unable to produce healthy blood cells. Immediate treatment is

required in acute Leukaemia due to the rapid progression and accumulation of the

malignant cells, which then spill over into the bloodstream and spread to other organs

of the body (Dohner et al., 2015). Acute forms of Leukaemia are the most common

forms of Leukaemia in children.

Chronic Leukaemia is characterized by the excessive buildup of relatively

mature, but still abnormal, blood cells. Typically taking months or years to progress,

the cells are produced at a much higher rate than normal, resulting in many abnormal

white blood cells (Shen et al., 2007). Whereas acute Leukaemia must be treated

immediately, chronic forms are sometimes monitored for some time before treatment

to ensure maximum effectiveness of therapy (Provan and Gribben, 2010). Chronic

Leukaemia mostly occurs in older people, but can theoretically occur in any age

group.

Additionally, the diseases are subdivided according to which kind of blood

cell is affected (Hira et al., 2014). This split divides Leukaemias into lymphoblastic or

lymphocytic Leukaemias and myeloid or myelogenous Leukaemias (Table 2.1):

12

Table 2.1: The Four Major Kinds of Leukaemia

Cell Type Acute Chronic

Lymphocytic Leukaemia

(or lymphoblastic)

Acute lymphoblastic

Leukaemia (ALL)

Chronic lymphocytic

Leukaemia (CLL)

Myelogenous Leukaemia

(myeloid or

nonlymphocytic)

Acute myelogenous

Leukaemia (AML or

myeloblastic)

Chronic myelogenous

Leukaemia (CML)

(Source: Hira et al., 2014)

13

In lymphoblastic or lymphocytic Leukaemias, the cancerous change takes

place in a type of marrow cell that normally goes on to form lymphocytes, which are

infection-fighting immune system cells. Most lymphocytic Leukaemias involve a

specific subtype of lymphocyte, the B cell. In myeloid or myelogenous Leukaemias,

the cancerous change takes place in a type of marrow cell that normally goes on to

form red blood cells, some other types of white cells, and platelets.

Chronic myelogenous (or myeloid or myelocytic) Leukaemia (CML), also

known as chronic granulocytic Leukaemia (CGL), is a cancer of the white blood cells.

It is a form of Leukaemia characterized by increased and unregulated growth of

predominantly myeloid cells in the bone marrow and the accumulation of these cells

in the blood. CML is a clonal bone marrow stem cell disorder in which a proliferation

of mature granulocytes (neutrophils, eosinophils and basophils) and their precursors is

found. It is a type of myeloproliferative disease associated with a characteristic

chromosomal translocation called the Philadelphia chromosome (Figure 2.2).

Chronic myeloid Leukaemia (CML) is defined by the presence of the Philadelphia

chromosome (Ph) which arises from the reciprocal translocation of the ABL1 and

BCR genes on chromosome 9 and 22 respectively (Oyekunle et al., 2012c).

CML is characterized by the proliferation of a malignant clone containing the

BCR-ABL1 mutant fusion gene resulting in myeloid hyperplasia and peripheral blood

leucocytosis and thrombocytosis. It is believed that pediatric CML is rare, accounting

for less than 10% of all cases of CML and less than 3% of all pediatric Leukaemias

(Lee and Chung, 2011). Incidence increases with age being exceptionally rare in

infancy, it is about 0.7 per million/year at ages 1 – 14 years and rising to 1.2 per

million/year in adolescents (National Cancer Institute (NCI), 2011).

14

Figure 2.2: Philadelphia Chromosome and BCR-ABL gene

15

Generally, children are diagnosed at a median age of 11 – 12 years (range, 1 –

18 years) with approximately 10% presenting in advanced phases (Suttorp and Millot,

2010).

2.1.1 CML diagnosis

According to the National Comprehensive Cancer Network (NCCN)

Guideline for Patients on CML (2014), in order to diagnose Chronic Myeloid

Leukaemia (CML), doctors use a variety of tests to analyze the blood and marrow

cells. This is because there are no special tests used in diagnosing CML. The best

form of diagnosis is the early report of symptoms. The following are a number of

tests useful in diagnosing CML in patients.

a. Complete Blood Count (CBC)

This test is used to measure the number and types of cells in the blood.

According to Tefferi et al. (2005), people with CML often have: decreased

hemoglobin concentration, increased white blood cell count, often to very high levels

and possible increase or decrease in the number of platelets depending on the severity

of the person‘s CML. Blood cells are stained (dyed) and examined with a light

microscope. These samples show a: specific pattern of white blood cells, small

proportion of immature cells (leukemic blast cells and promyelocytes) and larger

proportion of maturing and fully matured white blood cells (myelocytes and

neutrophils). These blast cells, promyelocytes and myelocytes are normally not

present in the blood of healthy individuals.

b. Bone Marrow Aspiration and Biopsy

These tests are used to examine marrow cells to find abnormalities and are

generally done at the same time (Raanani et al., 2005). The sample is usually taken

from the patient‘s hip bone after medicine has been given to numb the skin (Figure

16

2.3). For a bone marrow aspiration, a special needle is inserted through the hip bone

and into the marrow to remove a liquid sample of cells. For a bone marrow biopsy, a

special needle is used to remove a core sample of bone that contains marrow. Both

samples are examined under a microscope to look for chromosomal and other cell

changes (Vardiman et al., 2001).

c. Cytogenetic Analysis

This test measures the number and structure of the chromosomes. Samples

from the bone marrow are examined to confirm the blood test findings and to see if

there are chromosomal changes or abnormalities, such as the Philadelphia (Ph)

chromosome (Cortes et al., 1995). The presence of the Ph chromosome (the shortened

chromosome 22) in the marrow cells, along with a high white blood cell count and

other characteristic blood and marrow test findings, confirms the diagnosis of CML.

The bone marrow cells of some people with CML have a Ph chromosome detectable

by cytogenetic analysis (Aurich et al., 1998). A small percentage of people with

clinical signs of CML do not have cytogenetically detectable Ph chromosome, but

they almost always test positive for the BCR-ABL fusion gene on chromosome 22

with other types of tests.

d. FISH (Fluorescence In Situ Hybridization)

FISH is a more sensitive method for detecting CML than the standard

cytogenetic tests that identify the Ph chromosome (Mark et al., 2006; Tkachuk et al.,

1990). FISH is a quantitative test that can identify the presence of the BCRABL gene

(Figure 2.4). Genes are made up of DNA segments. FISH uses color probes that bind

to DNA to locate the BCR and ABL genes in chromosomes. Both BCR and ABL

genes are labeled with chemicals each of which releases a different color of light

(Landstrom and Tefferi, 2006).

17

Figure 2.3: Bone marrow biopsy

18

Figure 2.4: Identifying the BCR-ABL Gene Using FISH

19

The color shows up on the chromosome that contains the gene— normally

chromosome 9 for ABL and chromosome 22 for BCR—so FISH can detect the piece

of chromosome 9 that has moved to chromosome 22. The BCR-ABL fusion gene is

shown by the overlapping colors of the two probes. Since this test can detect BCR-

ABL in cells found in the blood, it can be used to determine if there is a significant

decrease in the number of circulating CML cells as a result of treatment.

e. Polymerase Chain Reaction (PCR)

The BCR-ABL gene is also detectable by molecular analysis. A quantitative

PCR test is the most sensitive molecular testing method available. This test can be

performed with either blood or bone marrow cells (Branford et al., 2008). The PCR

test essentially increases or ―amplifies‖ small amounts of specific pieces of either

RNA or DNA to make them easier to detect and measure. So, the BCR-ABL gene

abnormality can be detected by PCR even when present in a very low number of cells

(Hughes et al., 2003). About one abnormal cell in one million cells can be detected by

PCR testing. Quantitative PCR is used to determine the relative number of cells with

the abnormal BCR-ABL gene in the blood (Hughes et al., 2003). This has become

the most used and relevant type of PCR test because it can measure small amounts of

disease, and the test is performed on blood samples, so there is no need for a bone

marrow biopsy procedure. Blood cell counts, bone marrow examinations, FISH and

PCR may also be used to track a person‘s response to therapy once treatment has

begun (Ohm, 2013). Throughout treatment, the number of red blood cells, white blood

cells, platelets and CML cells is also measured on a regular basis.

2.1.2 Phases of chronic myeloid Leukaemia

Staging is the process of finding out how far a cancer has spread. Most types

of cancer are staged based on the size of the tumor and how far it has spread from

20

where it started. This system does not work for Leukaemias because they do not often

form a solid mass or tumor (Vardiman et al., 2001; Millot et al., 2005). Also,

Leukaemia starts in the bone marrow and, in many people; it has already spread to

other organs when it is found. For someone with chronic myeloid Leukaemia (CML),

the outlook depends on other factors such as features of the cells shown in lab tests,

and the results of imaging studies (Raanani et al., 2005). These information help

guide treatment decisions.

In Chronic myeloid Leukaemia, there are three phases. As the amount of blast

cells increases in the blood and bone marrow, there is less room for healthy white

blood cells, red blood cells and platelets. This may result in infections, anemia and

easy bleeding as well as bone pain and pain or a feeling of fullness below the ribs on

the left side. The number of blasts cells in the blood and bone marrow and the

severity of signs and symptoms determine the phase of the disease (NCI, 2011). The

three (3) phases of CML are:

a. Chronic Phase;

b. Accelerated Phase; and

c. Blast Crisis Phase.

A number of patients progress from chronic phase, which can usually be well-

managed, to accelerated phase or blast crisis phase. This is because there are

additional genetic changes in the leukemic stem cells. Some of these additional

chromosome abnormalities are identifiable by cytogenetic analysis (Cortes et al.,

1995a; Aurich et al., 1998). However, they appear to be other genetic changes (low

levels of drug-resistant mutations that may be present at diagnosis) in the CML stem

cells that cannot be identified by the laboratory tests that are currently available.

21

a. Chronic Phase

The chronic phase is the first phase of CML, the number of white blood cells

is increased and immature white blood cells (blasts) make up less than 10% of cells in

the peripheral blood and/or bone marrow. This means that less than 10 out of every

100 cells are blasts. CML in the chronic phase may cause mild symptoms, but most

often it does not cause any symptom. Possible symptoms include feeling infections

since the changes in the blood cells are not severe. In this phase, the cancer

progresses very slowly. Thus, CML in this phase may progress over several months

or years. In general, people with CML in the chronic phase respond better to

treatment.

b. Accelerated Phase

The accelerated phase is the second phase of CML. In this phase, the number

of blast cells in the peripheral blood and/or bone marrow is usually higher than

normal. Other aspects of accelerated phase can include increased basophils, very low

platelets, or new chromosome changes. The number of white blood cells is also high.

In this phase, the Leukaemia cells grow more quickly and may cause symptoms such

as anemia and an enlarged spleen. A few different criteria groups can be used to

define the accelerated phase. However, the two most commonly used are the World

Health Organization Criteria and the criteria from MD Anderson Cancer Centre

(Table 2.2)

c. Blast Crisis Phase

The blast phase is the final phase of CML progression. Also referred to as

‗Blast Crisis‖, CML in this phase can be life-threatening (NCCN, 2014). There are

two criteria groups that may be used to define the blast phase (Table 2.3).

22

Table 2.2: Criteria for Accelerated Phase

MD Anderson World Health Organization

15% blasts in peripheral blood

10% to 19% blast in peripheral blood

and/or bone marrow

30% blasts and promyelocytes in

peripheral blood

20% basophils in peripheral blood

20% basophils in peripheral blood

Very high or very low platelet count that is

unrelated to treatment

Very low platelet count that is

unrelated to treatment

Increasing spleen size and white blood

cells count despite treatment

New chromosome changes (mutations) New chromosome changes (mutations)

(Source: Faderi et al., 1999; Swerdlow et al., 2008)

23

Table 2.3: Criteria for Blast Phase

World Health Organization

International Bone Marrow

Transplant Registry

20% blasts in peripheral blood or bone

marrow

30% blasts in peripheral blood or bone

marrow

Blasts found outside of blood or bone

marrow

Blasts found outside of blood or bone

marrow

Large groups of blasts found in bone

marrow

(Source: Swerdlow et al., 2008; Druker, 2007)

24

In this phase, the number of blast cells in the peripheral blood and/or bone

marrow is very high. Another defining feature of blast phase is that the blast cells

have spread outside the blood and/or bone marrow into other tissues (Swerdlow et al.,

2008). In the blast phase, the Leukaemia cells may be more similar to Acute myeloid

Leukaemia (AML) or Acute lymphoblastic Leukaemia (ALL). AML causes too many

mature white blood cells called myeloblasts to be made. ALL results in too many

immature white blood cells called lymphoblast.

2.1.3 Chronic myeloid Leukaemia (CML) Treatment

There is more than one treatment for chronic myeloid Leukaemia; the type of

treatment being dependent on factors such as: age, general health, and the phase of

cancer. Some people with CML may have more than one treatment (Talpaz et al.,

2006; Kantarjian et al., 2006; Cortes et al., 2007). Primary treatment is the main

treatment used to rid the body of cancer. Tyrosine Kinase Inhibitors (TKIs) are often

used as primary treatment for CML. First-line treatment is the first set of treatments

given to CML patients and if the treatment fails, second-line treatment is the next

treatment or set of treatments given (Lichtman et al., 2006). This is also referred to as

follow-up treatment since it is given after follow-up tests show that the previous

treatment failed or stopped working.

a. Tyrosine kinase inhibitor (TKI) therapy

TKI (tyrosine kinase inhibitor) therapy is a type of targeted therapy used to

treat CML. Targeted therapy is treatment with drugs that target a specific or unique

feature of cancer cells not generally present in normal cells; as a result of targeting

cancer cells, they may be less likely to harm normal cells throughout the body

(Jabbour et al., 2012). TKI target the abnormal BCR-ABL protein that causes the

overgrowth of abnormal white blood cells (CML cells). The BCR-ABL protein, made

25

by the BCR-ABL gene, is a type of protein called a tyrosine kinase. Tyrosine kinases

are proteins located on or near the surface of cells and they tell cells when to grow

and divide to make new cells (Jabbour et al., 2007). TKIs block (inhibit) the BCR-

ABL protein from sending the signals that cause too many abnormal white blood cells

to form. However, each TKI works in a different way.

The FDA (Food and Drug Administration) approved the first TKI for the

treatment of CML in 2001. Since then, several TKIs have been developed to treat

CML. The newer drugs are referred to as Second-generation TKIs. The TKIs used to

treat CML are listed in Table 2.4; the drugs made in the form of pills are swallowed

by a patient. The dose of the drug is measured in mg (milligrams).

Imatinib was the first TKI to be approved to treat CML. Thus, it is called a first-

generation TKI. Imatinib works by binding to the active site on the BCR-ABL

protein to block it from sending signals to make new abnormal white blood cells

(CML cells). Figure 2.5 shows how Imatinib treatment works.

Dasatinib is a second-generation TKI that was approved for the treatment of CML in

2006. Dasatinib is more potent than Imatinib and can bind to the active and inactive

sites on the BCR-ABL protein to block growth signals.

Nilotinib was approved to treat CML in 2007. It is a second-generation TKI that

works in almost the same way as Imatinib. However, Nilotinib is more potent than

Imatinib and it more selectively targets the BCR-ABL protein. Nilotinib also targets

other protein apart from BCR-ABL protein.

Bosutinib was approved to treat CML in 2012. However, this second-generation TKI

is only approved to treat patients who experienced intolerance or resistance to prior

TKI therapy. It also targets other protein the same way as Nilotinib.

26

Table 2.4: Tyrosine Kinase Inhibitor (TKI) drugs used to treat CML

Generic name Brand name

(sold as)

Approved for

Imatinib Gleevec®

First-line treatment for:

1. Newly diagnosed adults and children

in chronic phase

2. Adults un chronic, accelerated or

blast phase after failure of

interferon-alfa therapy

Dasatinib Sprycel®


1. Newly diagnosed adults in chronic

phase

2. Adults resistant or intolerant to prior

therapy in chronic, accelerated or

blast phase

Nilotinib Tasigna®


1. Newly diagnosed adults in the

chronic phase

2. Adults resistant or intolerant to prior

therapy in chronic or accelerated

phase

Bosutinib Bosulif®

Second-line treatment for:

1. Adults with chronic, accelerated or

blast phase with resistance or

intolerance to prior therapy.

27

Figure 2.5: How Imatinib works

28

Side effects are new or worse unplanned physical or emotional conditions caused by

treatment. Each TKI for CML can cause side effects which depend on: the drug, the

amount taken, the length of treatment, and the person; most side effects can be

managed or even prevented. Supportive care is the treatment of symptoms caused by

CML or side effects caused by CML treatment.

b. Immunotherapy

The immune system is the body‘s natural defense against infection and

disease. Immunotherapy is treatment with drugs that boost the immune system

response against cancer cells (Sharma et al., 2011). Interferon is a substance naturally

made by the immune system. Interferon can also be made in a laboratory to be used

as immunotherapy for CML. PEG (pegylated) interferon is a long-acting form of the

drug. Interferon is not recommended as a first-line treatment option for patients with

newly diagnosed CML. But, it may be considered for patients unable to tolerate TKIs

(NCCN, 2014). Interferon is often given as a liquid that is injected under the skin or

in a muscle with a needle.

c. Chemotherapy

Chemotherapy is a type of drug commonly used to treat cancer. Many people

refer to this treatment as ―chemo‖. Chemotherapy drugs kill cells that grow rapidly,

including cancer cells and normal cells. Different types of chemotherapy drugs attack

cancer cells in different ways. Therefore, more than one drug is often used (Bluhm,

2011).

Omacetaxine is one of the chemotherapy drug used for CML treatment and

approved in 2012 by the FDA for patients in resistance and/or intolerant to two or

more TKIs. Resistance is when a CML patient does not respond to treatment;

intolerance is when treatment with a drug must be stopped due to severe side effects.

29

Omacetaxine works in part by blocking cells from making some of the proteins, such

as the BCR-ABL protein, needed for cell growth and division. This may slow or even

stop the growth of new CML cells.

Omacetaxine is administered as a liquid that is injected under the skin with a

needle. Other chemotherapy drugs may be given as a pill that is swallowed (NCCN,

2014). Chemotherapy is given in cycles of treatment days followed by days of rest.

The number of treatment days per cycle and the total number of cycles varies

depending on the chemotherapy drug given.

d. Stem cell transplant and donor lymphocyte infusion

An HSCT (hematopoietic stem cell transplant) is a medical procedure that

kills damaged or diseased blood stem cells in the body and replaces them with healthy

stem cells. HSCT is currently the only treatment for CML that may cure rather than

control the cancer. However, the excellent results with TKIs have challenged the role

of HSCT as first-line of treatment – the first set of treatments given to treat a disease.

For the treatment of CML, healthy blood stem cells are collected from another

person, called a donor. This is called an allogenic HSCT. An allogenic HSCT creates

a new immune system for the body. The immune system is the body‘s natural defense

against infection and disease. For this type of transplant, Human Leukocyte Antigen

(HLA) testing is needed to check if the patient and donor is a good match.

A Donor Lymphocyte Infusion (DLI) is a procedure in which the patient

receives lymphocytes from the same person who donated blood stem cells for the

HSCT. A lymphocyte is a type of white blood cells that helps the body to fight

infections. The purpose of the DLI is to stimulate an immune response called the

Graft-versus-tumor (GVT) effect of Graft-versus-Leukaemia (GVL) effect. The GVT

30

effect is when the transplanted cells (the graft) see the cancer cells (tumor/Leukaemia)

as foreign and attack them. This treatment may be used after HSCT for CML that

didn‘t respond to the transplant or that came back after an initial response.

2.1.4 Measuring CML treatment response

Measuring the response to treatment with blood and bone marrow testing is a

very important part of treatment for people with CML. In general terms, the greater

the response to drug therapy, the longer the disease will be controlled. Other factors

that affect a person‘s response to treatment include: the stage of the disease and the

features of the individual‘s CML at the time of diagnosis

Nearly all people with chronic phase CML have a ―complete hematologic

response‖ with Gleevec, Sprycel or Tasigna therapy; most of these people will

eventually achieve a ―complete cytogenetic response.‖ Patients who have a complete

cytogenetic response often continue to have a deeper response and achieve a ―major

molecular response.‖ Additionally, a growing number of patients achieve a ―complete

molecular response‖; table 2.5 shows the explanation of each term.

2.1.5 Imatinib treatment for Nigerian CML patients

According to Oyekunle et al. (2012b), Nigerian CML patients are presently

treated using Imatinib as the first line of treatment. Chromosome analysis is done

using cultured bone marrow aspirate samples; Philadelphia chromosomes are

estimated from the metaphase and the proportion of Ph+ cells are noted. Patients in

the chronic phase receive oral Imatinib: 400mg daily while those in the accelerated or

blastic phase receive 600mg daily. Imatinib is continued for as long as there is

evidence of continued benefit from therapy.

31

Table 2.5: Chronic Myeloid Leukaemia (CML) Treatment Responses

Type of Response Features Test Used to Measure Response

Hematologic Complete Hematologic

Response (CHR)

Blood counts completely return to normal

No blasts in peripheral blood

No signs/symptoms of disease – spleen

returns to normal size

Complete Blood Count (CBC)

with differential

Cytogenetic Complete Cytogenetic Response

(CCyR)

No Philadelphia (Ph) chromosomes

detected

Bone marrow cytogenetics

Partial Cytogenetic Response

(PCyR)

1% - 35% of cells have Ph chromosome

Major Cytogenetic Response 0% - 35% of cells have Ph chromosome

Minor Cytogenetic Response More than 35% of cells have Ph

chromosome

Molecular Complete Molecular Response

(CMR)

No BCR-ABL gene detectable Quantitative PCR (QPCR) using

International Scale (IS)

Major Molecular Response

(MMR)

At least a 3-log reduction1 in BCR-ABL

levels or BCR-ABL 0.1%

1 A 3-log reduction is a 1/1,000 or 1/1,000-fold reduction of the level at the start of treatment

31

32

Allopurinol (300mg daily) is given until leucocyte count fall below 20 x

109/L. Patients with hyperleucocytosis (leucocyte count > 100 x 10

9/L), and on

hydroxyurea continue on the latter for another 1 – 3 weeks, with monitoring of the full

blood count before final withdrawal of the drug, when the white cell count fall to less

than 100 x 109/L.

In individuals with severe Imatinib-induced myelosuppression, the drug is

withheld until the neutrophils rise to 1.5 x 109/L and the platelets count to at least 75 x

109/L. Patients with recurrent, therapy-induced myelosuppression can have the

Imatinib dose reduced to 300mg daily until blood count normalizes (minimum dose

for therapeutic blood levels in adults). However, if the myelosuppression is related to

blastic transformation, Imatinib is discontinued with appropriate supportive therapy

being given. Women of child-bearing age are advised to use barrier contraception.

Imatinib treatment is withdrawn in patients who develop neutropenia (<1000/mm3) or

thrombocytopenia (<75,000/mm3) while on therapy, until the cytopenias are

corrected, and are re-commenced at lower doses.

2.2 Predictive Modeling

Predictive research aims at predicting future events or an outcome based on

patterns within a set of variables and has become increasingly popular in medical

research (Agbelusi, 2014; Idowu et al., 2015). Accurate predictive models can inform

patients and physicians about the future course of an illness or the risk of developing

illness and thereby help guide decisions on screening and/or treatment (Waijee et al,

2013a). There are several important differences between traditional explanatory

research and predictive research. Explanatory research typically applies statistical

methods to test causal hypothesis using prior theoretical constructs. In contrast,

predictive research applies statistical methods and/or machine learning techniques,

33

without preconceived theoretical constructs, to predict future outcomes (e.g.

predicting the risk of hospital readmission) (Breiman, 1984).

Although, predictive models may be used to provide insight into casualty of

pathophysiology of the outcome, casualty is neither a primary aim nor a requirement

for variable inclusion (Moons et al., 2009). Non-causal predictive factors may be

surrogates for other drivers of disease, with tumor markers as predictors of cancer

progression or recurrence being the most common example. Unfortunately, a poor

understanding of the differences in methodology between explanatory and predictive

research has led to a wide variation in the methodological quality of prediction

research (Hemingway et al., 2009).

2.2.1 Types of predictive models

Machine learning has been previously used to predict behavior outcomes in

business, such as identifying consumer preferences for product based on prior

purchasing history. A number of different techniques to develop predictive

algorithms exist, using a variety of prediction analytic tools/software and have been

described in detail in literature (Waijee et al., 2010; Siegel et al., 2011). Some

examples include neural networks, support vector machines, decision trees, naïve

Bayes etc. Decision trees, for example, use techniques such as classification and

regression trees, boosting and random trees to predict various outcomes.

Machine learning algorithms, such as random-forest approaches have several

advantages over traditional explanatory statistical modeling, such as lack of a

predefined hypothesis, making it less likely to overlook unexpected hypothesis (Liaw

and Weiner, 2012). Approaching a predictive problem without a specific causal

hypothesis can be quite effective when many potential predictors are available and

34

when there are interactions between predictors, which are common in engineering,

biological and social causative processes. Predictive models using machine learning

algorithms may therefore facilitate the recognition of important variables that may

otherwise not be initially identified (Waijee et al., 2010). In fact, many examples of

discovery of unexpected predictor variables exist in the machine learning literature

(Singal et al., 2013).

2.2.2 Developing a predictive model

The first step in developing a predictive model, when using traditional

regression analysis, is selecting relevant candidate predictor variables for possible

inclusion in the model; however, there is no consensus for the best strategy to do so

(Royston et al., 2009). A backward-elimination approach starts with all candidate

variables, hypothesis tests are sequentially applied to determine which variables

should be removed from the final model, whereas a full-model approach includes all

candidate variables to avoid potential over-fitting and selection bias. Previously

reported significant predictor variables should typically be included in the final model

regardless of their statistical significance but the number of variables included is

usually limited by the sample size of the dataset (Greenland, 1989).

Inappropriate selection of variables is an important and common cause of poor

model performance in this situation. Selection of variables is less of an issue using

machine learning techniques given that they are often not solely based on predefined

hypothesis (Ibrahim et al., 2012). There are several other important issues relating to

data management when developing a predictive model, such as dealing with missing

data and variable transformation (Kaambwa et al., 2012; Waijee et al., 2013b).

35

2.2.3 Validating a predictive model

For a prediction model to be valuable, it must not only have predictive ability

in the derivation cohort but must also perform well in a validation cohort

(Hemingway et al., 2009). A model‘s performance may differ substantially between

derivation and validation cohorts for several reasons including over-fitting of the

model, missing important predictor variables, and inter-observer variability of

predictors leading to measurement errors (Altman et al., 2009). Therefore model

performance in the derivation dataset may be overly optimistic and is not a guarantee

that the model will perform equally well in a new dataset. A number of published

prediction research focuses solely on model derivation and validation studies are very

scarce (Waijee et al., 2013b).

Validation can be performed using internal and external validation. A

common approach to internal validation is to split the data into two portions – a

training set and validation set. If splitting the dataset is not possible given the limited

available data, measures such as cross validation or bootstrapping can be used for

internal validation (Steyerberg et al., 2010). However, internal validation nearly

always yields optimistic results given that the derivation and validation dataset are

very similar (as they are from the same dataset). Although external validation is more

difficult as it requires data collected from similar sources in a different setting or a

different location, it is usually preferred to internal validation (Steyerberg et al.,

2001). When a validation study shows disappointing results, researchers are often

tempted to reject the initial model and develop a new predictive model using the

validation dataset.

36

2.2.4 Assessing the performance of predictive model

When assessing model performance, it is important to remember that

explanatory models are judged based on the strength of associations, whereas

predictive models are judged solely on their ability to make accurate predictions. The

performance of a predictive model is assessed using several complementary tests,

which assess overall performance; calibration, discrimination, and reclassification

(Steyerberg et al., 2010) (Table 2.6). Performance characteristics should be

determined and reported for both the derivation and validation datasets. The overall

model performance can be measured using R2, which characterizes the degree of

variation in risk explained by the model (Gerds et al., 2008). The adjusted R2 has

been proposed as a better measure, as it accounts for the number of predictors and

helps preventing over-fitting. Brier scores are similar measures of performance,

which are used when the outcome of interest is categorical instead of continuous

(Czado et al., 2009).

Calibration is the difference between observed and predicted event rates for

groups of dataset and is assessed using the Hosmer-Lemeshow test (Hosmer et al.,

1997). Discrimination is the ability of a model to distinguish between records which

do and do not experience an outcome of interest, and it is commonly assessed using

Receiver Operating Characteristics (ROC) curves (Hagerty et al., 2005). However,

ROC analysis alone is relatively insensitive for assessing differences between good

predictive models (Cook, 2007); therefore several relatively novel performance

measures have been proposed. The net reclassification improvement and integrated

discrimination improvement are measures used to assess changes in predicted

outcome classification between two models (Pencina et al., 2012).

37

Table 2.6: Performance characteristics for a predictive model (measures of predictive error)

Aspect Measure Outcome Measure Description

Overall Performance R2

Adjusted R2

Brier Score

Continuous

Continuous

Categorical

Average squared difference between predicted and

observed outcome

Same as R2, but penalizes for the number of predictors

Average square distances from the predicted and the

observed outcomes

Discrimination ROC curve (c statistic)

C-index

Continuous or categorical

Cox-model

Overall measure of how effectively the model

differentiates between events and non-events

Calibration Hosmer-Lemeshow test Categorical Agreement between predicted and observed risks

Reclassification Reclassification table

NRI

IDI

Categoricala Number of records that move from one category to

another by improving the prediction model.

A quantitative assessment of the improvement in

classification by improving the prediction model.

Similar to NRI but using all possible cutoffs to

categorize events and non-events.

IDI, Integrated discrimination index; NRI, net classification index a Can be performed for continuous data as well if a risk cutoff is assigned.

(Source: Waijee et al., 2013b)

38

2.3 Machine Learning

Machine learning (ML) is a branch of artificial intelligence that allows

computers to learn from past examples using statistical and optimization techniques

(Quinlan, 1986; Cruz and Wishart, 2006). There are several applications for machine

learning, the most significant of which is predictive modeling (Dimitologlou et al.,

2012). Every instance (records/set of fields or attributes) in any dataset used by

machine learning algorithms is represented using the same set of features

(attributes/independent variables). The features may be continuous, categorical or

binary. If the instances are given with known labels (the corresponding target

outputs), then the learning is called supervised, in contrast to unsupervised learning,

where instances are unlabeled (Ashraf et al., 2013).

Supervised classification is one of the tasks most frequently carried out by

Social Intelligent Systems. Thus, a large number of techniques have been developed

based on Artificial Intelligence (Logic-based techniques, perceptron-based

techniques) and Statistics (Bayesian networks, Instance-based networks). The goal of

supervised learning is to build a concise model of the distribution of class labels in

terms of predictor features. The resulting classifier is then used to assign class labels

to the testing instances where the values of the predictor features are known, but the

value of the class label is known (Gauda and Chahar, 2013). There are two variations

of supervise classifications:

a. Regression (or Prediction/Forecasting) – the class label is represented

by a continuous variable (e.g. real number; and

b. Classification – the class label is represented by discrete values (e.g.

categorical or nominal)

39

Unsupervised machine learning algorithms perform learning tasks used for

inferring a function to describe hidden structure from unlabeled data – data without a

target class (Sebastiani, 2002). The goal of unsupervised machine learning is to

identify different examples that belong to the same group/clusters based underlying

characteristics that is common among attributes of members of the same cluster or

groups (Zamir et al., 1997; Jain et al., 1999; Zhao and Karypis, 2002). The only

things that unsupervised learning methods have to work with are the observed input

patterns , which are often assumed to be independent samples from an underlying

unknown probability distribution , and some explicit or implicit a priori

information as to what is important. Examples of unsupervised machine learning

algorithms include:

a. Clustering;

b. Maximum likelihood estimate;

c. Feature selection;

d. Association rule learning etc. (Becker and Plumbley, 1996)

2.3.1 Supervised machine learning algorithms

Supervised learning entails a mapping between a set of input variables

(features/attributes) labeled and an output variable (where j is the number of

records/CML cases) and applying this mapping to predict the outputs for unseen data

(data containing values for but no . Supervised machine learning is the most

commonly used machine learning technique in engineering and medicine.

In supervised machine learning paradigm, the goal is to infer a function, f:

40

This function, f is the model inferred by the supervised ML algorithm from a

sample data or training set composed of pairs of (inputs ( ) and output( )) such

that and :

Typically, for regression problems, (where d is the dimension (or

number of features) of the vector, ) and ; for classification problems

are discrete while for binary classification .

In the statistical learning framework, the first fundamental hypothesis is that

the training data are independently and identically generated from an unknown but

fixed joint probability distribution function . The goal of the learning

algorithm is to find a function, f attempting to model the dependency encoded in

between the input, X and the output, Y. will denote the set of functions

where the solution, f is sought such that where is the set of all possible

functions, f.

The second fundamental concept is the notion of error or loss to measure the

agreement between the prediction f(X) and the desired output Y. A loss (or cost)

function, L is introduced to evaluate this error (see equation 2.3):

The choice of the loss function L(f(X), Y) depends on the learning problem

being solved. Loss functions are classified according to their regularity or singularity

properties and according to their ability to produce convex or non-convex criteria for

optimization.

In the case of pattern recognition, where , a common choice for

L is the misclassification error which is measured as follows:

41

| |

This cost is singular and symmetric. Practical algorithmic considerations may

bias the choice of L. For instance, singular functions may be selected for their ability

to provide sparse solutions.

For unsupervised learning the problem may be expressed in a similar way

using the loss function defined in equations (2.5) and (2.6):

( ) ( )

The loss function L leads to the definition of the risk for a function f. also

called the generalization error:

∫

In classification, the objective could be to find the function f in that

minimizes R(f). Unfortunately, it is not possible because the joint probability P(x, y)

is unknown. From a probabilistic point of view, using the input and output random

variable notations X and Y, the risk can be expressed in equation (2.8) which can be

rewritten in two expectations:

( )

|

The expression in equation (2.9) offers the opportunity to separately minimize

[ | ] with respect to the scalar value of f (x). The resulting

function is the Bayes estimator associated with the risk R.

The learning problem is expressed as a minimization of R for any classifier f.

As the joint probability is unknown, the solution is inferred from the available training

42

set ( ). There are two ways to address the problem.

The first approach, called generative based, tries to approximate the joint probability

P(X, Y), or P(Y|X)P(X), and then compute the Bayes estimator with the obtained

probability. The second approach, called discriminative-based, attacks the estimation

of the risk R (f) head on (Liaw and Weiner, 2012). Following is a description of some

of the most popular and effective supervised machine learning algorithm.

a. Decision Trees (DT)

Decision trees learning uses a decision tree as a predictive model which maps

observations about the relevant indicators for CML survival classification Xij in order

to conclude about the target value - the patient‘s survival class (as survived and not

survived). Decision trees can either be classification or regression trees but for this

study, classification trees were adopted which can be used as input for decision

making using the data description using a top-down tree (Quinlan, 1986; Breiman et

al., 1984). Each interior node (starting from the root/parent node) of the tree

represents the attributes (features relevant for CML survival) with edges that

correspond to the values/labels of each attributes leading to a child node at the

bottom; the process continues for each subsequent values until the leaf is reached

which is the terminal node also representing the target class (class of CML survival) –

alongside the probability distribution over the class (Friedman, 1999).

Such decision trees algorithm include: ID3 (Iterative Dichotomiser 3), C4.5

(an extension of the ID3), CART (Classification and Regression Trees), CHAID (Chi-

squared Automatic Interaction Detector), MARS etc. In this study, the C4.5 decision

trees algorithm was considered.

43

The tree is learned by splitting the training dataset into subsets based on an

attribute value test for each input variables; the process is repeated on each derived

subset in a recursive manner called recursive partitioning. The recursion is completed

when the subset at a node has all the same value of the target class, or when splitting

no longer adds value to the predictions. This is also called the Top-down induction of

trees (Rokach and Maimon, 2008), an example of a greedy algorithm (divide-and-

conquer). When constructing the tree different decision trees algorithm uses different

metrics for measuring best attributes for splitting the tree which generally measure the

homogeneity of the target class (survival of CML) within the subsets of attributes

(relevant indicators for CML survival) selected (Rokach and Maimon, 2005). Some

of such metrics include: Gini impurity, Information gain, Variance reduction etc.; the

C4.5 decision trees algorithm uses the information gain evaluation metrics. The

information gain criterion is defined by equation (2.10)

If S is the training dataset containing the set of attributes, ai predictive for

CML survival in patients, j defined as with values needed for

partitioning S. Then:

∑| |

| | ( )

Where:

∑| |

| |

| |

| |

b. Support Vector Machines (SVM)

Support vector machines (SVMs) also called support vector networks are

supervised learning models with associated learning algorithms that analyze data and

recognize patterns (Cortes and Vapnik, 1995). Consider a training dataset consisting

44

of CML survival indicators representing the input vector and the CML survival class

for each patient in the training dataset representing the target vector representing one

of two target categories; SVM attempts to build a model that assigns new examples

into one category (or the other) making it a non-probabilistic binary linear classifier.

An SVM model is a representation of the examples as points in space, mapped so that

the examples of the separate categories are divided by a clear gap that is as wide as

possible. New examples are then mapped into that same space and predicted to

belong to a category based on which side of the gap they fall on.

In formal terms, SVM constructs a hyper-plane or set of hyper-planes in a

high-dimensional space, which can be applied for classification, regression or any

other task. A good separation is achieved by the hyperplane that has the largest

distance to the nearest training data point of any class – the support vectors since in

general the larger the margin the lower the generalization error of the classifier.

SVMs can be used to perform both supervised and unsupervised machine learning

needed for developing classification and regression models. In order to calculate the

margin between data belonging to the two different classes, two parallel hyper-planes

(blue lines) are constructed, one on either side of the separating hyper-plane (solid

black line), which are pushed up against the two datasets (the corresponding survived

and not survived datasets). A good separation is achieved by the hyperplane that has

the largest distance to the neighbouring data points of both classes, since in general

the larger the margin the lower the generalization error of the SVM classifier.

The parameters of the maximum-margin hyperplane are derived by solving

large quadratic programming (QP) optimization problems. There exist several

specialized algorithms for quickly solving these problems that arise from SVMs,

mostly on heuristics for breaking the problem down into smaller chunks. This study

45

implements the John Platt‘s (Platt, 1998) sequential minimal optimization (SMO)

algorithm for training the support vector classifier. SMO works by breaking the large

QP problem into a series of smaller 2-dimensional sub-problems. This study

implements the SMO using the algorithm available in the Weka public domain

software. This implementation globally replaces all missing values and transforms

nominal attributes into binary values, and by default normalizes all data.

Considering the use of a linear support vector as shown in figure 2.6, it is

assumed that both classes are linearly separable. The feature subset representing the

training data containing the information about each CML patient using the relevant

features (risk indicators for CML survival) is expressed as: while

the target class is represented by . The hyperplane can be

defined by where and . Since the classes are linearly

separable, then the following function can be determined:

The decision function may be expressed as with:

The SVM classification method aims at finding the optimal hyper-plane based

on the maximization of the margin between the training data for both classes. The

distance between a point x and the hyperplane is

|| ||, it is easy to show the

optimization problem as the following:

|| ||

46

Figure 2.6: Description of the linear SVM classifier

47

c. Artificial Neural Network (ANN) - Multi-layer Perceptron (MLP)

An artificial neural network (ANN) is an interconnected group of nodes, akin

to the vast network of neurons in a human brain. In machine learning and cognitive

science, ANNs are a family of statistical learning models inspired by biological neural

networks and are used to estimate or approximate functions that depend on a large

number of inputs and are generally unknown (McCulloch and Walter, 1943). ANNs

are generally presented as systems of interconnected neurons which send messages to

each other such that each connection have numeric weights that can be tuned based on

experience, making neural nets adaptive to inputs and capable of learning.

The word network refers to the inter-connections between the neurons in the

different layers of each system. The first layer has input neurons which send data via

synapses to the middle layer of neurons, and then via more synapses to the third layer

of output neurons. The synapses store parameters called weights that manipulate the

data stored in the calculations. An ANN is typically defined by three (3) types of

parameters, namely:

i. Interconnection pattern between the different layers of neurons;

ii. Learning process for updating the weights of the interconnections; and

iii. Activation function that converts a neuron‘s weighted input to its output

activation.

The simplest kind of neural network is a single-layer perceptron network,

which consists of a single layer of output nodes; the inputs are fed directly to the

outputs via a series of weights. In this way it can be considered the simplest kind of

feed-forward network. The sum of the products of the weights and the inputs is

calculated in each node, and if the value is above some threshold (typically 0) the

48

neuron fires and takes the activated value (typically 1); otherwise it takes the

deactivated value (typically -1).

A perceptron can be created using any values for the activated and deactivated

states as long as the threshold value lies between the two. Perceptrons can be trained

by a simple learning algorithm that is usually called the delta rule. It calculates the

errors between calculated output and sample output data, and uses this to create an

adjustment to the weights, thus implementing a form of gradient descent.

Multi-layer networks use a variety of learning techniques, the most popular

being back-propagation. Here, the output values are compared with the correct answer

to compute the value of some predefined error-function. By various techniques, the

error is then fed back through the network. Using this information, the algorithm

adjusts the weights of each connection in order to reduce the value of the error

function by some small amount. After repeating this process for a sufficiently large

number of training cycles, the network will usually converge to some state where the

error of the calculations is small. In this case, one would say that the network has

learned a certain target function.

To adjust weights properly, one applies a general method for non-linear

optimization that is called gradient descent. For this, the derivative of the error

function with respect to the network weights is calculated, and the weights are then

changed such that the error decreases (thus going downhill on the surface of the error

function). For this reason, back-propagation can only be applied on networks with

differentiable activation functions.

Back-propagation, an abbreviation for backward propagation of errors, is a

common method of training artificial neural networks used in conjunction with an

optimization method such as gradient descent. The method calculates the gradient of a

49

loss function with respect to all the weights in the network. The gradient is fed to the

optimization method which in turn uses it to update the weights, in an attempt to

minimize the loss function. It is a generalization of the delta rule to multi-layered

feed-forward networks, made possible by using the chain rule to iteratively compute

gradients for each layer. Back-propagation requires that the activation function used

by the artificial neurons be differentiable.

The back-propagation learning algorithm can be divided into two phases:

propagation and weight update.

a. Phase 1 – Propagation: each propagation involves the following steps:

i. Forward propagation of training pattern‘s input through the neural

network in order to generate the propagation‘s output activations; and

ii. Backward propagation of the propagation‘s output activations through

the neural network using the training pattern target in order to generate

deltas of all output and hidden neurons.

b. Phase 2 – Weight update: for each weight-synapse, hence the following:

i. Multiply its output delta and input activation to get the gradient of the

weight; and

ii. Subtract a ratio (percentage) of the gradient from the weight.

Assuming the input neurons are represented by variables determined by Xi =

{X1, X2, X3 ….Xi} where i is the number of variables (input neurons). The effect of

the synaptic weights, Wi on each input neuron at layer j is represented by the

expression:

Equation (3.16) is sent to the activation function (sigmoid/logistic function) is applied

in order to limit the output to a threshold [-1, +1], thus:

50

The measure of discrepancy between the expected output (p) and the actual output (y)

is made using the squared error measure (E):

Recall however, that the output (p) of a neuron depends on the weighted sum

of all its inputs as indicated in equation (2.14); implying that the error (E) also

depends on the incoming weights of the neuron which needs to be changed in the

network to enable learning. The back-propagation algorithm aims to find the set of

weights that minimizes the error. In this study, the gradient descent algorithm is

applied in order to minimize the error and hence find the optimal weights that satisfy

the problem. Since back-propagation uses the gradient descent method, there is a

need to calculate the derivative of the squared error function with respect to the

weights of the network.

Hence, the squared error function is now redefined as (the ½ is required to

cancel the exponent of 2 when differentiating):

For each neuron, j its output Oj is defined as:

( ) (∑

)

The input to a neuron is the weighted sum of outputs of the previous

neurons. The number of input neurons is n and the variable denotes the weight

between neurons I and j. The activation function is in general non-linear and

differentiable, thus, the derivative of the equation (2.15) is:

51

The partial derivative of the error (E) with respect to a weight is done using

the chain rule twice as follows:

The last term on the left hand side can be calculated from equation (2.20), thus:

(∑

)

The derivative of the output of neuron j with respect to its input is the partial

derivative of the activation function (logistic function) shown in equation (2.21):

( ) ( ) ( ( ))

The first term is evaluated by differentiating the error function in equation

(3.19) with respect to y, so if y is in the outer layer such that y = , then:

However, if j is in an arbitrary inner layer of the network, finding the

derivative E with respect to is less obvious. Considering E as a function of the

inputs of all neurons, l receiving input from neuron j and taking the total derivative

with respect to , a recursive expression for the derivative is obtained:

∑(

)

∑(

)

Thus, the derivative with respect to can be calculated if all the derivatives

with respect to the outputs of the next layer – the one closer to the output neuron –

are known. Putting them all together:

52

With:

{

( ) ( ) ( ( ))

∑ ( ) ( ( ))

Therefore, in order to update the weight using gradient descent, one must

choose a learning rate, . The change in weight, which is added to the old weight, is

equal to the product of the learning rate and the gradient, multiplied by -1:

Equation (3.28) is used by the back-propagation algorithm to adjust the value

of the synaptic weights attached to the inputs at each neuron in equation (2.14) with

respect to the inner layer of the multi-layer perceptron classifier.

2.3.2 General issues of supervised machine learning algorithms

The first step is collecting the dataset required for developing the predictive

model. If a requisite expert is available, he/she suggests which field

(attributes/features) are the most informative. If not, then the simplest method is that

of brute-force, which measures everything available in the hope that the right

(informative or relevant but not redundant) features can be isolated. However, a

dataset collected by the brute-force method is not directly suitable for induction. It

contains in most cases noise and missing feature values, and therefore requires

significant pre-processing (Zhang et al., 2002). For this reason, methods suitable for

removing noise and missing values are important before deciding on the use of the

identified variables needed for developing predictive models using supervised

machine learning algorithms.

53

a. Data preparation and data pre-processing

There is a hierarchy of problems that are often encountered in data preparation

and preprocessing which includes:

i. Impossible input values;

ii. Unlikely input values;

iii. Missing input values; and

iv. Presence of irrelevant input features in the data.

Impossible values should be detected by the data handling software, ideally at

the point of input so that they can be re-entered. These errors are generally

straightforward, such as coming across negative values when positive values are

expected. If correct values cannot be entered, the problem is converted into missing

value category, by removing the data. Variable-to-variable data cleansing is a filter

approach for unlikely values (those values that are suspicious due to their relationship

to a specific probability distribution with a mean of 5, a standard deviation of 3, but a

suspicious value of 10). Table 2.7 shows examples of how such metadata can help in

detecting a number of possible data quality problems.

The process of selecting the instances makes it possible to cope with the

infeasibility of learning from very large dataset. Selection of instances from the

original dataset is an optimization problem that maintains the mining quality while

minimizing the sample size (Liu and Metoda, 2001). It reduces data and enables a

machine learning algorithm to function and work effectively with very large datasets.

54

Table 2.7: Examples for the use of variable-by-variable data cleansing

Problems Metadata Examples/Heuristics

Illegal values Cardinality

Max, Min

Variance,

Deviation

e.g., cardinality (gender) > 2 indicated

problem.

Max, min should not be outside of

permissible range.

Variance, deviation of statistical values

should not be higher than threshold.

Misspellings Feature values Sorting on values often brings misspelled

values next to correct values.

(Source: Kotsiantis et al., 2006)

55

There are a variety of procedures for sampling instances from large dataset.

The most well-known are:

i. Random sampling, which selects a subset of instances randomly.

ii. Stratified sampling, which is applicable when the class values are not

uniformly distributed in the training sets. Instances of the minority class(es)

are selected with greater frequency in order to even out the distribution.

Incomplete data is an unavoidable problem in dealing with most real world

data sources. Generally, there are some important factors to be taken into account

when processing unknown feature values. One of the most important ones is the

source of unknown-ness:

i. A value is missing because it was forgotten or lost:

ii. A certain feature is not applicable for a given instance (e.g., it does not exist

for a given instance).

iii. For a given observation, the designer of a training set does not care about the

value of a certain feature (so-called don’t care values).

Depending on the circumstances, there are a number of methods to choose

from to handle missing data (Batista and Monard, 2003):

i. Method of ignoring instances with unknown feature values: This method is

simplest; it involves ignoring any instances (records) which have at least one

unknown feature value.

ii. Most common feature value: The value of the feature that occurs most often is

selected to be the value for all the unknown values of the feature.

56

iii. Most common feature value in class: In this case, the value of the feature

which occurs most commonly within the same class is selected to be the value

for all the unknown values of the feature.

iv. Mean substitution: The mean value (computed from available cases) is used to

fill in missing data values on the remaining cases. A more sophisticated

solution than using the general feature mean is to use the feature mean for all

samples belonging to the same class to fill in the missing value.

v. Regression or classification methods: a regression or classification model

based on the complete case data for a given feature is developed. This model

treats the feature as the outcome and uses the other features as predictors.

vi. Hot deck inputting: The most similar case to the case with a missing value is

identified, and then a similar case‘s target value for the missing case‘s target

value is substituted.

vii. Method of treating missing feature values as special values: unknown itself is

treated as a new value for the feature that contains missing values.

b. Feature selection

This is the process of identifying and removing as many irrelevant and

redundant features as possible (Yu and Liu, 2004). This reduces the dimensionality of

the data and enables data mining algorithms to operate faster and more effectively.

Generally, features are characterized as:

i. Relevant: are features that have an influence on the target class (output). Their

role cannot be assumed by the rest.

ii. Irrelevant: are features that do not have any influence on the target class.

Their values could be generated at random and not influence the output.

57

iii. Redundant: are features that can take the role of another (perhaps the simplest

way to incur model redundancy).

Feature selection algorithms in general have two (2) components:

i. A selection algorithm that generates proposed subsets of features and attempts

to find an optimal subset and

ii. An evaluation algorithm that determines how good a proposed feature subset

is.

However, without a suitable stopping criterion, the feature selection process may run

repeatedly through the space of subsets, taking up valuable computational time. The

stopping criteria might be whether:

i. addition (or deletion) of any feature does not produce a better subset; and

ii. an optimal subset according to some evaluation function is obtained.

The fact that many features depend on one another often unduly influence the

accuracy of supervised ML classification models. This problem can be addressed by

constructing new features from the basic feature set (Markovitch and Rosenstein,

2002). This technique is called feature construction/transformation. These newly

generated features may lead to the creation of more concise and accurate classifiers.

In addition, the discovery of meaningful features contributes to better

comprehensibility of the produced classifier, and a better understanding of the learned

concept.

c. Algorithm selection

The choice of which specific learning algorithm should be used is a critical

step. The classifier‘s evaluation is most often based on prediction accuracy (the

percentage of correct prediction divided by the total number of predictions). There

58

are at least three techniques which are used to calculate a classifier‘s accuracy

(Waijee et al., 2013b):

i. One technique is to split the training set by using two-thirds (about 67% of

total cases) for training and the one-third for estimating performance (testing).

ii. In another technique, known as cross-validation, the training set is divided

into mutually exclusive and equal-sized subsets and for each subset the

classifier is trained on the union of all other subsets. The average of the error

of each subset is therefore an estimate of the error rate of the classifier.

iii. Leave-one-out validation is a special case of cross validation. All test subsets

consists of a single instance. This type of validation is, of course, more

expensive computationally, but useful when the most accurate estimate of a

classifier‘s error is required.

If the error rate is unsatisfactory, a variety of factors must be examined:

i. Perhaps relevant features of the problem are not being used;

ii. A larger training set is needed;

iii. The dimensionality of the problem is too high; and/or

iv. The selected algorithm is inappropriate or parameter tuning is needed.

A common method for computing supervised ML algorithms is to perform

statistical comparisons of the accuracies of trained classifiers on specific dataset.

Several heuristic versions of the t-test have been developed to handle this issue

(Dietterich, 1998; Nadeau and Bengio, 2003).

2.3.3 Machine learning for cancer prediction and prognosis

According to a literature survey on the application of machine learning in

healthcare data by Cruz and Wishart (2006), machine learning is not new to cancer

59

research. In fact, artificial neural networks (ANNs) and decision trees (DTs) have

been used in cancer detection and diagnosis for nearly 30 years (Circchetti, 1992;

Simes, 1985; Machin et al., 1991), from the detection and classification of tumors via

X-rays and CRT images (Pertricoin, 2004; Bocchi et al., 2004) malignancies from

proteomic and genomic to the classification of malignancies from proteomic and

genomic (microarray) assays (Zhon et al., 2004; Wang et al., 2005).

The fundamental goal of cancer prediction and prognosis are distinct from the

goals of cancer detection and diagnosis. In cancer prediction/prognosis one is

concerned with three predictive foci, namely:

a. The prediction of cancer susceptibility (i.e. risk assessment): involves an

attempt to predict the likelihood of developing a type of cancer prior to the

occurrence of the disease;

b. The prediction of cancer recurrence: involves the prediction of the likelihood

of redeveloping cancer after the apparent resolution of the disease; and

c. The prediction of cancer survivability: involves the prediction of the outcome

(life expectancy, survivability, progression, tumor-drug sensitivity) after the

diagnosis of the disease.

In the latter two situations, the success of the prognostic prediction is

obviously dependent, in part, on the success or quality of the diagnosis performed.

However a disease prognosis can only come after a medical diagnosis and a

prognostic prediction must take into account more than just a simple diagnosis

(Hagerty et al., 2005).

Indeed, a cancer prognosis typically involves multiple physicians from

different specialties using different subsets of biomarkers and multiple clinical

60

factors, including the age and general health of the patient, the location and type of

cancer, as well as the grade and size of the tumor (Fielding et al., 1992; Cochran et

al., 1997; Burke et al., 2005). Histological (cell-based), clinical (patient-based) and

demographic (population-based) information must be carefully integrated by the

attending physician to come up with a reasonable prognosis. Even for the most

skilled clinician, it is not an easy job to do since similar challenges exist for both

physicians and patients alike when it comes to the issues of cancer prevention and

cancer susceptibility prediction. Family history, age, diet, Body Mass Index (BMI),

high risk habits (like smoking and drinking) and exposure to environmental

carcinogens (UV radiation, radon and asbestos) all play an important role in

predicting an individual‘s risk for developing cancer (Bach et al., 2003; Gascon et al.,

2004; Domchek et al., 2003).

In the past, the dependency of clinicians and physicians alike on macro-scale

information (tumor, patient, population and environmental data) generally kept the

number of variables small enough so that standard statistical methods or even the

physician‘s own intuition could be used to predict cancer risks and outcomes.

However, with today‘s high-throughput diagnostic and imaging technologies,

physicians are now faced with dozens or even hundreds of molecular, cellular and

clinical parameters. In these situations, human intuition and standard statistics do not

generally work efficiently; rather there is a reliance on non-traditional and intensively

computational approaches such as machine learning (ML). The use of computers

(and machine learning) in disease prediction and prognosis is part of a growing trend

towards personalized, predictive medicine (Weston and Hood, 2004).

Machine learning, like statistics is used to analyze and interpret data. Unlike

statistics, though, machine learning methods can employ Boolean logic (AND, OR,

61

NOT), absolute conditionality (IF, THEN, ELSE), conditional probabilities (the

probability of X given Y) and unconventional optimization strategies to model data or

classify patterns. These latter methods actually resemble the approaches humans

typically use to learn and classify. Although, machine learning draws heavily from

statistics and probability, it is still fundamentally more powerful because it allows

inferences or decisions to be made that could not otherwise be made using

conventional statistical methodologies (Mitchell, 1997; Duda et al., 2001).

Many statistical methods employ multivariate regression or correlation

analysis and these approaches assume that the variables are independent and that data

can be modeled using linear combinations of these variables. When the relationship

are non-linear and the variables are interdependent (or conditionally dependent)

conventional statistics usually flounders. It is in these situations that machine

learning tends to shine. Many biological systems are fundamentally non-linear and

their parameters conditionally dependent. Many simple physical systems are linear

and their parameters are essentially independent.

Knowing which machine learning method is best for a given problem is not

inherently obvious. This is why it is critically important to try more than one machine

learning method at any given training set. Another common misunderstanding about

ML is that patterns a ML tool finds or the trends it detects are non-obvious or not

intrinsically detectable. On the contrary, many patterns or trends could be detected by

a human expert – if they looked hard enough at the dataset. Machine learning

basically saves the time and effort needed to discover the pattern or to develop the

classification scheme required.

62

2.4 Feature Selection for the Identification of Relevant Attributes

Feature Selection (FS) is important in machine learning tasks because it can

significantly improve the performance by eliminating redundant and irrelevant

features at the same time speeding up the learning task (Yildirim, 2015). Given N

features, the FS problem is to find the optimal subset among 2N possible choices.

This problem usually becomes intractable as N increases. Feature subset selection is

the process of identifying and removing as much irrelevant and redundant information

as possible (Ashraf, 2013). This reduces the dimensionality of the data and may allow

learning algorithms to operate faster and more effectively (Novakovic, 2011).

In some cases, accuracy on future classification can be improved; in others,

the result is a more compact, easily interpreted representation of the target concept.

Therefore, the correct use of feature selection algorithms for selecting features

improves inductive learning, either in terms of generalization capacity, learning

speed, or inducing the complexity of the induced model (Kumar and Minz, 2014).

There are two major approaches to FS. The first is Individual Evaluation, and the

second is Subset Evaluation. Ranking of the features uses a weight to measure the

degree of relevance of a feature in the former method while candidate subset of

features are constructed using a search strategy in the latter method.

A feature selection algorithm (FSA) is a computational solution that is

motivated by a certain definition of relevance. However, the relevance of a feature

(or a subset of features) – as seen from inductive learning perspectives – may have

several definitions depending on the objective that is sought by the FS technique.

63

2.4.1 The relevance of a feature

The purpose of a FSA is to identify relevant features according to a definition

of relevance. However, the notion of relevance in ML has not yet been rigorously

defined on a common agreement (Bell and Wang, 2000). Let Ei with , be

domains of features X = {x1, x2, x3… xn} and an instance space defined as

, where an instance is a point in this space. Consider p a probability

distribution on E and T a space of target labels. The motive is to model or identify an

objective function according to its relevant features. A dataset S composed

by |S| instances can be seen as the result of sampling the attributes, E under the

distribution, p a total of |S| times and labeling its element suing the objective function,

c.

The notion of relevance according to a number of researchers is defined as a

relative relationship between the attributes and the objective function, the probability

distribution, sample, entropy or incremental usefulness (Novakovic et al, 2011;

Novakovic, 2009). Following are a number of definitions for the relevance of a

feature set of attributes.

a. Definition I (relevance with respect to an objective function, c): A feature

is relevant to an objective c if there exist two examples A, B in the

instance space E such that A and B differ only in their assignment to xi

and .

In other words, if there exist two instances that can only be classified by xi.

The definition has the inconvenience that the learning algorithm can not necessarily

determine if a feature xi is relevant or not, using only a Sample S or E (Wang et

al.1998).

64

b. Definition II (strong relevance with respect to the sample, S): A feature

is strongly relevant to the sample S if there exist two examples A, B

that only differ in their assignment to xi and .

The definition is the same as I, but now, A, B and the definition is with

respect to S (Blum and Langley, 1997).

c. Definition III (strong relevance with respect to the distribution, p): A

feature is weakly relevant to an objective c in the distribution p if there

exist two examples A, B with p that only differ in

their assignment to xi and .

This definition is the natural extension of II and, contrary to it, the distribution

p is assumed to be known.

d. Definition IV (weak relevance with respect to the sample, S): A feature

is weakly relevant to the sample S if there exist at least a proper

where xi is strongly relevant with respect to S.

A weakly relevant feature can appear when a subset containing at least one

strongly relevant feature is removed.

e. Definition V (weak relevance with respect to a distribution, p): A feature

is weakly relevant to the objective c in the distribution p if there exist at

least proper where xi is strongly relevant with respect to p.

Instead of focusing on which features are relevant, it is possible to use

relevance as a complexity measure with respect to the objective c. In this case, it will

depend on the type of inducer used.

f. Definition VI (relevance as a complexity measure) (Blum and Langley,

1997): Given a data sample S and an objective c, define r(S, c) as the smallest

65

number of relevant features to c using I only in S, and such that the error in S

is the least possible for the inducer.

It refers to the smallest number of features by a specific inducer to each

optimum performance in the task of modeling c using S.

g. Definition VII (relevance as an incremental usefulness) (Caruana and

Freitag, 1994): Given a data sample S, a learning algorithm L, and a subset of

features X’, the feature xi is incrementally useful to L with respect to X’ if the

accuracy of the hypothesis that L produces using the group of features

is better than the accuracy reached using only the subset of features

X’.

This definition is especially natural in feature selection algorithms (FSAs) that

search in the feature space in an incremental way, adding or removing features to a

current solution. It is also related to a traditional understanding of relevance in the

philosophy literature.

h. Definition VIII (relevance as an entropic measure) (Wang et al., 1998):

Denoting the (Shannon) entropy by H(x) and the mutual information by I(x; y)

= H(x) – H(x|y) (the difference of entropy in x generated by the knowledge of

y), the entropic relevance of x to y is defined as r(x; y)= I(x; y)/H(y). let X be

the original set of features and let C be the objective seen as a feature set

is sufficient if I(X’; C) = I(X. C) (i.e. if it preserves the learning

information). For a sufficient set X’, it turns out that r(X’; C) = r(X, C). the

most favourable set is that sufficient set for which H(X’) is smaller.

This implies that r(C; X) is greater. In short, the aim is to have r(C; X’) and

r(X’; C) jointly maximized.

66

2.4.2 Characteristics of feature selection algorithms

Feature selection algorithms (with a few notable exceptions) perform a search

through the space of feature subsets, and, as a consequence, must address four (4)

basic issues affecting the nature of the search (Langley and Sage, 1994; Patil and

Sane, 2014):

a. Starting point

Selecting a point in the feature subset from which to begin the search can

affect the direction of the search. One option is to begin with no features and

successively add attributes. In this case, the search is said to proceed forward through

the search space. Conversely, the search can begin with all features and successively

remove them. In this case, the search proceeds backwards through the search space.

Another alternative is to begin somewhere in between (in the middle) and move

outwards from this point.

b. Search organization

An exhaustive search of the feature subspace is prohibitive for all but a small

initial number of features. With N initial features, there exists 2N possible subset of

features. Heuristic search strategies are more feasible than exhaustive search methods

and can also give good results, although they do not guarantee finding the optimal

subset (Hall et al., 2009). A number of search methods are highlighted as follows:

BestFirst: It searches for the attribute subset by greedy hill climbing method

in combination with backtracking. The backtracking is based on the concept

that if some number of consecutive nodes is found such that they do not

improve the performance then backtracking is done. It may apply forward

approach where it starts from empty set of attributes and goes on adding the

next. It may also go for backward approach where it starts from a set of all

67

attributes and removes one by one. It may also adopt a midway between both

approaches where search is done in both directions (by considering all

possible single attribute additions and deletions at a given point) which is also

called as hybrid approach (Maji and Garai, 2013).

GreedyStepwise: Performs a greedy forward or backward search through the

space of attribute subsets. May start with no/all attributes or from an arbitrary

point in the space. Stops when the addition/deletion of any remaining

attributes results in a decrease in evaluation. Can also produce a ranked list of

attributes by traversing the space from one side to the other and recording the

order that attributes are selected.

Ranker: Individual evaluations of the attributes are done and they are ranked

accordingly (Hua-Liang and Billings, 2007). It is normally used in conjunction

with attribute evaluators (Relief, GainRatio, Entropy etc.).

Genetic Search: Genetic Algorithms (GAs) (Goldberg, 1989) are

optimization techniques that use a population of candidate solutions. They

explore the search space by evolving the population through four steps: parent

selection, crossover, mutation, and replacement. GAs have been seen as search

procedures that can locate high performance regions of vast and complex

search spaces, but they are not well suited for fine-tuning solutions (Holland,

1992). However, the components of the GAs may be specifically designed and

their parameters tuned, in order to provide effective local search behaviour.

c. Evaluation strategy

How feature subsets are evaluated is the single biggest differentiating factor

among most feature selection algorithms for machine learning. One paradigm,

dubbed Filter (distance, information, consistency and dependency metrics etc.)

68

(Kohavi, 1995; Kohavi and John, 1996) operates independently of any machine

learning algorithm – desirable features are filtered out of the data before learning

begins. These algorithms use heuristics based on general characteristics of the data to

evaluate merit of feature subsets. Another school of thought argues that the bias of a

particular induction algorithm should be taken into account when selecting features.

This method, called the wrapper (using predictive accuracy or cluster goodness) uses

an induction algorithm along with a statistical re-sampling technique such as cross-

validation to estimate the final accuracy of feature subsets.

d. Stopping criterion

A feature selector must decide when to stop searching through the space of

feature subsets. Depending on the evaluation strategy, a feature selector might stop

adding or removing features when none of the alternatives improves upon the merit of

a current feature subset. Alternatively, the algorithm might continue to revise the

feature subset as long as the merit does not degrade.

2.4.3 Filter-based feature selection methods

Among the evaluation n strategy used by feature selection methods, filter-

based feature selection (FS) methods were considered in this study to determine the

relevant features among the features present in the data collected from the study

location (Maji and Garai, 2013). This is because filter-based FS algorithms define

relevance by identifying the attributes that are more correlated with the target class

and filter-based FS algorithms are less computationally expensive compared to

wrapper-based FS algorithms which require an improvement of the supervised

machine learning algorithm.

Three (3) classes of filter-based feature selection methods considered are as

follows:

69

Consistency-based

Consistency measures the attempt to find a minimum number of features that

distinguish between the classes as consistently as the full set of features. An

inconsistency arises when multiple training samples have the same feature values, but

different class labels. Dash and Liu (1997) presented an inconsistency – based FS

technique called Set Cover. An inconsistency count is calculated by the difference

between the number of all matching patterns (except the class label) and the largest

number of patterns of different class labels of a chosen subset. If there are n matching

patterns in the training sample space and there are c1 patterns belonging to class 1 and

c2 patterns belonging to class 2, and if the largest number is c2, the inconsistency

count will be n – c2. Hence, given a training sample S the inconsistency count of an

instance is defined as (Liu and Motoda, 1998):

Where is the number of instances in S equal to A using only the features in

and is the number of instances in S of class k equal to A using only the

features in .

By summing all the inconsistency counts and averaging over the size of the

training sample size, a measure called the inconsistency rate for a given subset is

defined. The inconsistency rate of a feature subset in a sample S is then:

∑

| |

Correlation-based (CFS)

Correlation is also called similarity measures or dependency measures.

Gennari et al. (1989) stated that Features are relevant if their values vary

systematically with category membership thus, a feature is useful if it is correlated

70

with or predictive of the class; otherwise it is irrelevant. Thus, a feature Vi (variable

monitored for CML survival) is said to be relevant (predictive of CML survival) if

and only if there exist some vi (value of variable – nominal or numeric) and c (target

class – survived or not survived) for which p(Vi = vi) > 0 such that (Kohavi and John,

1996):

|

The implication of this is that the feature subset (set of variables that are

predictive of CML survival) is one that contains highly correlated with (predictive of)

the class, yet uncorrelated with (not predictive of) each other. It is but important to

state that a group of components (CML survival indicators) that are highly correlated

with the target variable will at the same time bear low correlations with each other

(Hall, 1999). Equation (2.30) is used as the heuristic measure for the merit of feature

subsets in supervised classification:

√

From equation (3.4) the merits is the heuristic merit of a feature subset S

containing k features, is the average feature–class correlation and is the

average feature-feature inter-correlation. The equation forms the core of CFS and

imposes a ranking on feature in the search space of all possible features subsets.

Correlation criteria are often used for microarray data analysis.

Information-based

A probabilistic model of a nominal valued feature Y (target class, C) can be

formed by estimating the individual probabilities of the values from the training

data containing the records of variables. If this model is used to estimate the value of

71

target class for a sample drawn from the training data, then the entropy of the model

is the number of bits it would take, on average, to correct the output of the model.

Entropy is a measure of the uncertainty or unpredictability in a system. The entropy

of the target class Y is given by equation (2.31):

∑ ( )

If the observed values of target class in the training data are partitioned

according to the values of input features X, and the entropy of Y with respect to the

partitions induced by X is less than the entropy of the target class prior to partitioning,

then there is a relationship between the target class Y and the indicator variables X.

Equation (3.6) gives the entropy of Y after observing X.

| ∑ ∑ | ( | )

The amount by which the entropy of Y decreases reflects additional

information about Y provided by X and is called the Information gain (Quinlan, 1986),

or alternatively Mutual information. Thus, information is given by:

|

|

|

2.5 Existing Models for Risk Assessment of CML Survival

Factors that can be used to predict the likely outcome (prognosis) of CML

treatment are called prognostic factors. Prognostic scoring systems use these factors

to determine a patient‘s risk score. Based on the risk score, patients are classified into

risk groups – low, intermediate and high risk. People in the same risk group are

similar in certain ways and will likely respond to certain treatments in the same way.

Therefore, doctors often use risk scores to help guide treatment decisions. In general,

72

a person classified as low-risk is more likely to have better response to treatment. The

two most popular prognostic scoring systems are Sokal score (Sokal et al., 1984) and

the Hasford score (Hasford et al., 1998).

2.5.1 Sokal score

The Sokal prognostic scoring system was developed from the examination of

813 patients with Philadelphia chromosome-positive, non-blastic chronic granulocytic

Leukaemia (CGL) collected from Six European and American series – Rosewell Park

Memorial Institute, University of Bologna, Italian Cooperative CML Study Group,

Memorial Sloan-Kettering Cancer Centre, University of Barcelona and Duke

University. The survival pattern of the population was typical of ―good-risk‖ patients,

and median survival time was 47 months. The prognostic factors identified were:

Age, Spleen size, Platelets count and Blasts, equation 2.34 shows the hazard function

for the Sokal scoring system developed using the Cox regression. The hazard ratio

for each patient was calculated from the expression which ranged from 0.41 to 5.68

for 677 patients.

*(

)

+

The model was used to identify a lower risk group of patients with a 2-year

survival of 90%, subsequent risk averaging somewhat less than 20%/year and median

survival of 5 years, an intermediate group and a high-risk group with a 2-year survival

of 65%, followed by a death rate of about 35%/year and median survival of 2.5 years.

A later study that involved a study of CGL patients between 5 and 45 years of age

was used to propose a Sokal scoring system for younger patients with CGL compared

73

to that proposed for people up to 84 years of age (Sokal et al., 1985). The scoring

system identified five (5) prognostic factors compared to the earlier scoring model,

the prognostic factors consisted of platelets count, spleen size, hematocrit, percentage

of blasts and the sex; equation 2.35 shows an expression of the scoring model

proposed by Sokal for younger patients. The median survival of the patients under

study exceeds four (4) years while the hazard (death) rate averages 22.5% compared

to 25% of the earlier study.

*(

)

+

(Platelets: 109/L; Sex: Male=1.0, female=2.0)

The Relative Risk (RR) of CML patients‘ survival using the Sokal Score is as

follows:

{

The Sokal score used scoring models developed from statistical regression

analysis for both older and younger CML patients undergoing treatment for CML. It

used four variables and five variables in the model for the older and younger CML

patients respectively. The scoring model developed measured survival as a

probabilistic value and as a function of the overall survival of the study cohort and

thus classifies the survival groups into three unique groups based on the interval of the

scores. Some variables collected in the study were removed because of their

inconsistency among data collected from the different hospitals used in the study.

74

Unlike the Sokal score, this study considers all the variables monitored during

Imatinib treatment of Nigerian CML patients from which relevant variables were

identified and the prediction model for CML survival classification developed using

machine learning algorithms.

2.5.2 Hasford score

The Hasford prognostic scoring system was developed from the examination

of 1303 patients with Chronic Myeloid Leukaemia (CML) who were treated in

prospective studies, including major randomized trials, separated into learning and

validation samples using the Cox regression analysis and the minimum P-value

approach used to identify the prognostic factors for CML patient survival undergoing

Interferon- According to Hasford et al. (1998), the survival model was developed

owing to the limitation in its ability to discriminate survival risk groups of people

undergoing Interferon- (Ohnishi et al., 1995; Hasford et al., 1996) which on the

other hand was developed for patients undergoing busulphan or hydroxyurea

treatment. As result of this, the Collaborative CML Prognostic Factors Project was

started with the goal of extracting and validating a new prognostic scoring system for

patients with CML treated with IFN- .

The median survival time of the data collected for the study was 69 months

(within a range of 1 to 117 months); 908 patients were used as the learning sample

from which three distinct groups were identified. The low-risk group had a median

survival time of 98 months (n = 369, 40.6%), the intermediate-risk group had a

median survival time of 65 months (n = 406, 44.7%) and the high-risk group with a

median survival time of 42 months (n = 133, 14.6%). The prognostic scoring model

was validated using 285 patients‘ data. The dataset used was collected from 1573

individual patients from 14 studies by searching the MADLINE® biomedical literature

75

database (National Library of Medicine, MD) from abstracts and conference reports,

and y contracting pharmaceutical companies. Participants included were from Austria

(Thaler et al., 1993; Thaler and Hilbe, 1996), Belgium, The Netherlands and

Luxembourg, France (Guilhot, 1993; Guilhot et al., 1996), Germany (Hehlmann et

al., 1994; Hehlmann et al., 1996; Kloke et al., 1993), the United Kingdom (Allan et

al., 1995), Italy (Alimena et al., 1988), Japan (Ohnishi et al., 1995), Span (Fernandez-

Ranada et al., 1993) and the United States. The prognostic factors identified were:

Age, Spleen size, Platelets count, Basophils and eosinophils, equation 2.36 shows the

hazard function for the Hasford scoring system developed using the Cox regression.

The prognostic score for each patient was calculated from the expression which

revealed that the score for low-risk group is 780, the intermediate-risk group had

score values 780 and 1480, and the high-risk group had score values 1480.

(

)

Where:

Age = 0 when < 50 years, 1 otherwise; Spleen size is cm below coastal margin;

Basophils = 0 when < 3%, 1 otherwise and Platelet count = 0 when < 1500, 0

otherwise.

The Relative Risk (RR) of CML patients‘ survival using the Hasford (Euro)

Score is as follows:

{

According to Oyekunle et al. (2012b) the Sokal score was not predictive of

differences in the Overall Survival (OS) of Nigerian patients in the study but it

76

sufficiently differentiates the Progression Free Survival (PFS) of patients in the risk

groups identified. The Hasford scoring system performed better as the positive

predictive value was more statistically significant for differences in PFS; but it also

fails to predict differences in Overall Survival of CML patients treated with Imatinib.

The results of Oyekunle et al. (2012) revealed that the diagnostic parameters used in

the Sokal and Hasford scores may need to be reviewed, if they are to retain their

prognostic relevance in the Imatinib era, especially regarding predicting OS and

Complete Cytogenetic Responses (CCR).

The Hasford score uses a scoring model developed using statistical regression

analysis for European CML patients receiving Interferon-alfa treatment. It uses six

variables in the model for CML survival. Like the Sokal score, the scoring model

developed measures survival as a probabilistic value and as a function of the overall

survival of the study cohort and thus classifies the survival groups into three unique

groups based on the interval of the scores. Unlike the Sokal and Hasford scores, this

study considers all the variables monitored during Imatinib treatment of Nigerian

CML patients from which relevant variables were identified and the prediction model

for CML survival classification developed using machine learning algorithms.

2.5.3 European treatment and outcome study (EUTOS) Score

Following the failure of the Sokal and Hasford Scoring Models at predicting

the overall survival (OS) of Chronic Myeloid Leukaemia in the Imatinib era, there

was the need for the development of a newer scoring model based on CML patients

undergoing Imatinib era. According to Jabbour et al. (2012) the European

LeukaemiaNet developed a new prognostic scoring system (European Treatment and

Outcome Study [EUTOS] score) using data from 2060 patients with newly diagnosed

77

Chronic Myeloid Leukaemia-Chronic Phase treated with Imatinib-based regiments

(Hasford et al., 2011).

The EUTOS classifies patients into two (2) risk groups – low and high risk

groups with significant correlations with significant correlations with the achievement

of an 18-month complete Cytogenetic response (CCR) and progression free survival

(PFS). However, studies showed that the adoption of the EUTOS score in predicting

the OS and PFS of CML patients undergoing Imatinib treatment still requires

validation (Jabbour et al., 2012; Marin et al., 2011). Equation (2.37) shows the

expression for the prognostic score proposed by the European Leukaemia-Net; the

European Treatment and Outcome Study (EUTOS) score:

The Relative Risk (RR) of CML patients‘ survival using the Hasford (Euro)

Score is as follows:

{

The EUTOS score used a scoring model developed from statistical regression

analysis for European CML patients receiving Imatinib treatment. It used two

variables in the model for CML survival. Like the earlier scoring models, EUTOS

measures survival as a probabilistic value and as a function of the overall survival of

the study cohort and thus classifies the survival groups into two unique groups based

on the interval of the scores. The EUTOS score has more relevance to CML patients

receiving Imatinib treatment but restricted to European CML patients.

78

The proposed study considers all the variables monitored during Imatinib

treatment of Nigerian CML patients from which relevant variables were identified and

the prediction model for CML survival classification developed using machine

learning algorithms.

A Summary of the variables identified by each scoring models proposed is

shown in Table 2.8. All variables were identified using statistical p-value (= 0.05)

test to determine the most relevant out of the identified variables collected during

follow-up of the respective treatment used in proposing each scoring model.

2.6 Related Works

There are a number of works published in the area of the application of

machine learning algorithms to cancer disease risk assessment, risk survivability and

risk recurrence. Most of the works published are limited to other types of cancer but

none in the area of the application of machine learning algorithms to CML survival.

A number of such works have however stressed the effect of machine learning in

developing effective and efficient prediction models.

Idowu et al. (2015) developed a predictive model for the survival of pediatric

sickle cell disease (SCD) using clinical variables. The predictive model was

developed with fuzzy logic using three (3) clinical variables while the rules for the

inference engine were elicited from expert pediatrician. The fuzzy logic-based model

was not validated using live clinical datasets. Moreover, relevant variables for SCD

survival could have been easily identified using feature selection methods from a

larger collection of variables monitored for pediatric SCD survival.

79

Table 2.8: Variables Identified by existing risk scoring model for CML

S/N Sokal Sokal (Younger) Hasford EUTOS

1 Age Sex Age Basophilis (%)

2 Spleen (cm) Spleen (cm) Spleen (cm) Spleen (cm)

3 Platelet count Platelet count Platelet count

4 Blasts (%) Blast (%) Blasts (%)

5 PCV Eosinophils (%)

6 Basophilis (%)

80

Agbelusi (2014) performed a comparative analysis of three supervised

machine learning algorithm to the prediction of the survival of pediatric HIV/AIDS

patients. The machine learning algorithms used were naïve bayes‘, decision trees and

multi-layer perceptron without the application of feature selection algorithms to

identify relevant features. Rather than base features used in the study to predict

HIV/AIDS survival, a larger number of features monitored in HIV/AIDS patients

could have been identified then feature selection methods used in identifying the

relevant features for HIV/AIDs survival.

Ahmad et al. (2013) performed a comparative analysis of three machine

learning algorithms for the prediction of breast cancer recurrence. Support vector

machines (SVM), multi-layer perceptron and decision trees algorithm were used to

formulate the model. Before the process of model formulation, feature selection was

not used for the identification of relevant variables for cancer recurrence. The

identification of relevant variables could have improved the performance of the model

developed using the machine learning algorithms.

Thongkam and Sukmak (2012) performed a comparative analysis of the

combination of bagging with several weak learners to build five (5) breast cancer

survivability prediction models. 10-fold cross validation was used to train the model

using decision tree learner algorithms, namely: J48 (also called C4.5), REPTree and

decision stump and in combination with the bagging algorithm using 14 and 10

attributes from the breast cancer dataset. The results showed that there was

significant improvement in the performance of the prediction model developed using

10 attributes compared to using the 14 attributes. Other, supervised machine learning

algorithms apart from decision trees algorithm could have been used to justify the

81

performance of decision trees algorithm in developing predictive models for cancer

survival.

Ganda and Chahar (2013) performed a comparative analysis of the predictive

models developed for predicting the survival of heart failure using unsupervised

machine learning algorithms. K-means clustering algorithm was used to classify the

survival of heart failure patients into two (2) groups following the application of

correlation-based feature selection algorithm to select relevant variables for heart

failure survival. The performance of k-means clustering algorithm was improved

following the use of relevant variables identified compared to using all the variables

identified. The predictive performance of other supervised machine learning

algorithms should have been compared to that of the k-means algorithm in order to

justify its performance.

Yussuff et al. (2012) applied statistical methods to the development of a

predictive model for the risk of breast cancer using features identified from breast

mammogram images. Logistic regression was used in the development of the

prediction model whereas supervised machine learning algorithm would have

produced a more effective and efficient model owing to the identification of relevant

variables using feature selection methods.

Vanneschi et al. (2011) performed a comparative analysis of four supervised

machine learning algorithms for the prediction of breast cancer survival. The

supervised machine learning algorithms used included genetic algorithm (GA),

support vector machines (SVM), artificial neural networks (ANN) and random forest

decision trees using 70 gene signatures as attribute variables. Genetic algorithm

outperformed the other methods due to its ability to identify relevant variables for

82

predicting breast cancer survival using its principle of natural selection. Feature

selection methods could have been used to identify the most relevant variables before

applying the machine learning algorithms thereby improving performance.

Luo and Chang (2010) applied machine learning algorithms to the

classification of breast masses on digital mammograms. The C4.5 decision trees and

the support vector machines algorithms were used following the application of

forward and backward feature selection algorithms for the identification of relevant

features. It was discovered that there was no significant difference in the performance

of the algorithm following feature selection methods. The limitation of the study was

in the inability of the feature selection methods used to adequately identify the

relevant variables predictive for breast cancer leading to the inability of improving the

model‘s performance. Other filter-based feature selection methods should have been

used in order to identify which will identify the most relevant features leading to an

improved performance of the model.

83

CHAPTER THREE

RESEARCH METHODOLOGY

3.1 Introduction

In this chapter, the methodology applied in this study is clearly defined. The

chapter starts with a description of the framework for the research methodology,

which explains the series of steps required: starting from data identification and

collection, model formulation and performance evaluation of the developed predictive

models. Before the formulation of the model using machine learning algorithms,

filter-based feature selection methods were used in identifying the relevant features

for predicting the survival of Chronic Myeloid Leukaemia (CML) in Nigerian patients

for the identified study location. In addition, the selected machine learning algorithms

adopted for the formulation of the predictive model were presented alongside a

description of their respective loss/cost functions used in the model formulation

process. Finally, the metrics of performance evaluation were presented alongside the

simulation environment chosen for the study.

3.2 Methodology Framework of the Study

This study involves the use of supervised machine learning algorithms in the

development of predictive model for CML survival using data collected from a study

location in South-western Nigeria. Figure 3.1 shows the methodology framework

which was applied in the development of the predictive model for CML survival in

Nigerian patients. The study began with the identification of the variables monitored

during the follow-up of Imatinib treatment administered to CML patients by the

physicians at the study location and the collection of the dataset containing the

identified variables for patients in the study location.

84

Figure 3.1: Methodology framework for predictive modeling of CML survival

85

The dataset collected from the hospital formed the basis of the historical

dataset which contains various records of predictive parameters (survival indicators)

and the survival time (output variable). Filter-based Feature selection methods were

used to identify the most relevant and important features among the features collected

based on the distribution of the dataset collected from the study location. The reduced

feature set was identified to be predictive for CML patients‘ survival. Following this,

the historical dataset containing the reduced feature set was divided into training and

testing dataset and fed to each supervised machine learning algorithms proposed for

this study using the n-fold cross-validation evaluation method. The result of the

performance of the combination of each filter-based feature selection method and the

supervised machine learning algorithm was used to identify the most effective and

efficient predictive model for CML survival.

3.3 Data Identification and Collection

This section highlights the process involved in identifying the data containing

the variables monitored during Imatinib treatment of Nigerian CML patients. Each

variable name was identified and properly defined with its respective units defined.

The method of data collection was also clearly stated showing from whom the data

was collected and the instruments of data collection from the data source alongside

the identification of the different survival classes in the dataset.

3.3.1 Identification of variables monitored during Imatinib treatment

Following the review of literature in the body of knowledge of chronic

myeloid Leukaemia survival, a number of features were identified to be monitored

during the follow-up of CML patients receiving Imatinib treatment. The variables

monitored (which were identified in related literature) were compared to the variables

monitored by physicians of the Obafemi Awolowo University teaching Hospital

86

Complex (OAUTHC), Ile-Ife - the only referral hospital for the treatment of Imatinib

in Nigeria.

Table 3.1 gives a description of the variables that were identified to be

monitored in CML patients receiving Imatinib treatment in Nigeria; this information

comprises of socio-demographic, clinical and CML survival-related data. The

variables identified include: the age of the patient (in months), time of start of

treatment (Imatinib) from date of diagnosis (in months), gender (male and female),

spleen and liver size, packed cell volume (PCV), white blood cell (WBC) count,

platelet count (measured as a %), basophils (measured as a %), eosinophils, disease

phase at diagnosis - Chronic (CP), Acute (AP) and Blast crisis Phase (BP), vital status

(alive or dead) and the survival time (measured in days).

A description of each identified variable (attribute) is made as follows:

a. Time of Imatinib treatment from date of diagnosis: is the time from the

date of the diagnosis of CML disease by the physician; it is a numeric value

which is recorded as the number of days or months;

b. Age at diagnosis: is defined as the present age of the patient; it is a numeric

value which is expressed in the number of days, months or years;

c. Gender: is defined as the sex of the CML patient; it is a nominal value which

is recorded as either male or female;

d. Spleen size: is the distance between the spleen below the coastal margin; it is

a numeric value which is expressed in centimeters (cm);

e. Liver size: is defined as the increase in the size of the liver; it is a numeric

value which is expressed in centimeters (cm);

87

Table 3.1: Variables monitored during the follow-up of Imatinib treatment

Name Unit of Measure Labels

Time of Imatinib treatment from

date of diagnosis

Months Numeric

Age at diagnosis Months Numeric

Gender Nil Male, Female

Spleen size Cm Numeric

Liver size Cm Numeric

Packed Cell Volume (PCV) % Numeric

White Blood Cell (WBC) / Numeric

Platelet Count / Numeric

Basophil % Numeric

Eosinophil % Numeric

Phase at Diagnosis Nil CP, AP, BP

Survival Time (ST) Days Numeric

Vital Status (VS) Nil Alive, Dead

88

f. Packed Cell Volume (PCV) count: also called the level of hematocrit is the

volume percentage of red blood cells in the blood; it is normally about 45%

for men and 40% for women; it is a numeric value expressed as a percentage

(%) value;

g. White Blood Cell (WBC) count: is an indication of the presence of system

disorder or a bone marrow disease, it is the number of white blood cells found

in one (1)litre of blood. It is a numeric value expressed as / ;

h. Basophil: is a type of white blood cell that circulates the human blood; it is a

numeric value which is expressed as a percentage (%) found in the blood;

i. Platelet count: is a measure of how many platelets there are in the blood and

it helps in allowing the clotting of blood during injuries or cuts. It is a

numeric value which is expressed as / ;

j. Eosinophils: is a type of blood test that measures the quality of eosinophil (a

type of white blood cell). It is a numeric value which is expressed as /

;

k. Disease phase at diagnosis: is a measure of the phase of the disease at the

time of diagnosis; it is a nominal value expressed as CP for chronic phase, AP

for acute phase and BP for blast crisis phase;

l. Vital Status: is the identification of the status of the patient at the time when

the information was collected; it is a nominal value expressed as either dead or

alive; and

m. Survival time: is defined as the period during which the patient is being

monitored; it is a numeric value which is expressed as number of days (or

months).

89

3.3.2 Data collection of variables monitored

Following ethical approval by the Health Records Ethical Committee (HREC)

approval board of the Obafemi Awolowo University Teaching Hospital Complex

(OAUTHC); the data required for the development of the predictive model for the

survival of CML patients receiving Imatinib treatment were collected. There was no

need for consent forms since the patients were not required to partake in the study

rather; electronic data containing information about each CML patient excluding their

personal information (e.g. names, address, hospital ID, contact number etc.) were

collected from the OAUTHC health records.

The data collected was stored in spreadsheet format and collected using a flash

drive following the identification of the variables monitored during follow-up of

Imatinib treatment. For the purpose of handling the problem as a classification

problem, the target class (output variable) was determined using three labels, namely:

survived, not survived and censored.

Survived: refers to the CML patients that lived up to or more than the

estimated survival time (2 and 5 years) and are either dead or alive (vital

status);

Not Survived: refers to the CML patients that did not live up to the estimated

survival time and are dead; and

Censored: refers to the CML patients that were lost during follow-up due to

one reason or the other – the patients‘ survival time is less than the estimated

survival time and they are still alive.

The pseudo-code in the following paragraph was used in assigning a target

class (Survived, Not Survived and Censored) to each patient‘s records using the

90

values of the vital status and the survival time of each patient (in days). Assumptions

made in the study for the number of days in a year was based on the fact that there are

52 weeks in a year and 7 days in a week hence, 364 days in a year.

If (Survival time >= n)

Then Survival class = “Survived”

Else if ((Survival time < n) AND (Vital Status = “Alive”))

Then Survival class = “Censored”

Else

Survival Class = “Not Survived”.

End if.

where n is the time in days (728 days and 1820 days for 2 and 5 years respectively)

Following the identification of the target survival class using the pseudo-code

above for each patient, the records with the target class identified as censored were all

removed from the original dataset. This was due to the fact that the study is only

concerned with the patients who have been followed up for treatment and have lived

up to the estimated time except they died during the course of receiving treatment at

the hospital. The variables monitored are assumed to contain variables relevant to

predicting the survival of CML disease.

3.4 Identification of Relevant Features Predictive for CML Survival

Following the process of the identification and collection of the data needed

for developing the predictive model, it was necessary to determine which set of

variables are deemed more predictive for CML survival classification. According to

literature, identifying the variables relevant to CML survival prediction is likely to

improve the performance of the supervised machine learning algorithm‘s performance

and also reduce the complexity of the model.

The basic algorithm for implementing the filter-based feature selection

algorithms used in this study is shown in the following paragraph. For the training

91

data collected which contained a set of patients‘ CML records, X and feature set F of

attributes (risk factors). The algorithm may start with either an empty set or a subset

of X using a search category to select the initial feature, . The independent measure

Im evaluates each generated subset Xg and compares it to the previous optimal subset

evaluation starting from the initial feature subset, . The search iterates until the

stopping criteria is met. Finally, the algorithm outputs the current optimal feature

subset Xopt.

The algorithm is presented as follows (Kumar and Minz, 2014):

INPUT:

D = {X, F} //A training data set with n CML patients’ records

// – Monitored variables from CML

//Patients and F labels – Survival Class of patient, n

X’ //Predefined initial feature subset ( )

//Stopping criterion

OUTPUT: //An optimal subset – relevant indicators of CML

//survival

Begin:

Initialize:

; //apply a search algorithm of choice

; //evaluate X’ by using an independent measure Im

do begin

//select next subset of CML indicators for evaluation

( ) //evaluate current subset, using Im

If //if new evaluation is better than previous

; //replace old evaluation with new

//replace initial subset of features with new one

repeat (until is not reached);

end

return ; //identified optimal set of relevant features of CML

survival

end;

92

Following is a description of the feature selection methods used for the

evaluation of relevant features in the CML survival training dataset alongside their

respective independent evaluation measure, Im.

3.4.1 Consistency-based

The consistency-based FS measured the minimum number of CML survival

features that distinguish between the target classes (Survived and Not survived) as

consistently as the full set of 14 identified features. Inconsistency arises when

multiple CML patient records have the same feature values, but different class labels.

An inconsistency count was calculated by the difference between the number of all

matching patterns (except the class label) and the largest number of patterns of

different class labels of a chosen subset. Hence, for every n matching patterns in the

CML patients training sample space; if there are c1 patterns belonging to class 1 (e.g.

Survived CML) and c2 patterns belonging to class 2 (e.g. Not Survived CML) and the

largest number is c2 then, the inconsistency count was calculated as n – c2. Hence,

given the training sample S the inconsistency count of an instance, :

where is the number of instances in S equal to A using only the features in

and is the number of instances in S of class k equal to A using only the

features in .

The sum of all the inconsistency counts averaging over the size of the training

sample size was used to measure the inconsistency rate for a given subset of

attributes. The inconsistency rate of a feature subset in a sample S is then:

∑

| |

93

3.4.2 Correlation-based (CFS)

The correlation-based FS also called similarity measures uses dependency

measures existing between features and the target class to determine relevant features.

The implication of this is that the feature subset (set of variables that are predictive of

CML survival) is one that contains highly correlated with (predictive of) the class, yet

uncorrelated with (not predictive of) each other. Also, the group of components

(CML survival indicators) that are highly correlated with the target variable will at the

same time bear low correlations with each other. Equation (3.3) shows the evaluation

measure used by the correlation-based FS in identifying relevant variables:

√

Using equation (3.3) the was used to identify the subset of risk factors

containing k number of features, was used to estimate the average feature–class

correlation and was used to estimate the average feature-feature inter-correlation.

The equation forms the core of CFS and imposes a ranking of feature subsets (set of

all predictive indicators of CML survival) in the search space of all possible features

subsets.

3.4.3 Information-based

Information-based FS is a probabilistic model for a nominal valued feature

which was formed by estimating the individual probabilities of the values

of the target class from the training data containing the records of variables monitored

during follow-up of Imatinib treatment for CML. This model was used to estimate

the value of entropy of a target class followed by that of the attribute feature. The

entropy is the number of bits it would take, on average, to correct the output of the

94

model. Entropy is a measure of the uncertainty or unpredictability in a system. The

entropy of the target class Y (CML survival – Survived or Not Survived) was

estimated from the training dataset using equation (3.4) while equation (3.5) was used

in estimating the entropy of the target class, Y after observing the attribute, X.

∑ ( )

| ∑ ∑ | ( | )

If the observed values of target class in the training data were partitioned

according to the values of input features X, and the entropy of Y with respect to the

partitions induced by X is less than the entropy of the target class prior to partitioning,

then there is a relationship between the target class Y and the indicator variables X.

The amount by which the entropy of Y decreases reflects additional information about

Y provided by A and is called the Information gain (Quinlan, 1986), or alternatively

Mutual information. Thus, information is given by:

|

|

|

3.5 Formulation of Predictive Model for CML Survival

Following the identification of the most relevant and predictive variables

(prognostic factors) for CML prediction, the next phase is the formulation of the

predictive model for CML survival using the identified variables. Mathematical

expressions called mapping functions were used to express the process of model

development (and loss function) following the description of the selected supervised

machine learning (SML) algorithms adopted for the purpose of this study.

95

The training dataset S consists of the original features identified at the point of

data identification and collection is represented by , where i is the number of

features existing in the original dataset of patients whose records were collected

(number of CML survival cases), and consists of the features relevant for

predicting the survival of CML where . The process of feature selection is

represented by the mapping:

where are the original set of attributes collected and are the relevant features

selected by the FS method.

Following the process of feature selection, the new CML patient‘s records,

where k is the number of CML patients‘ record and j is the number of relevant

features selected from the original I features. If k datasets were selected for training

the supervised machine learning (SML) algorithms adopted for the model using the

relevant variables, then the model can be represented by the mapping:

( )

where is the set of relevant attributes, j for patient, k and is the survival class of

patient, k given the values of .

The mapping function which describes the predictive model formulated for

CML survival using the identified risk factors/variables (relevant features) is:

( ) {

where is as described in equation (3.8)

96

Supervised machine learning (SML) algorithms are generally black-boxed

models; which implies that there is no general equation that can be used to describe

the predictive model using a mathematical representation. Although, all SML

algorithms have a metric that is used to evaluate how well the model is doing during

the training and testing process of model development. Following are the SML

algorithms that were used in this study for the development of the predictive model

for CML survival classification.

3.5.1 Decision trees algorithm (DT)

There are different types of implementation of the decision trees algorithm but

for this study, the C4.5 decision trees algorithm was proposed. The tree was built

from the training dataset, S by making it the root node. For every iteration, the value

of the entropy and information gain was estimated for each attribute, X (risk factor for

CML survival) in the training dataset, S. The algorithm selects the attribute with the

highest information gain and the set S was split by the attribute‘s label (e.g. phase of

disease is chronic, acute, blast) to produce subset of data.

The algorithm was continued on each subset of attributes never used before in

order to construct a tree with each non-terminal node representing the selected

attribute (CML survival risk factor) on which the data was split and the terminal nods

representing the class labels (CML survival class) of the final branch.

Assuming that X (e.g. sex of patient) is an attribute and represents the set

of labels assigned to X (e.g. labels assigned to sex of patients are male and female).

The Information Gain (IG) and Split Criteria used for the tree construction by the

C4.5 DT algorithm is shown in equations (3.10) and (3.11) respectively.

∑| |

| |

97

Where:

∑| |

| |

| |

| |

∑| |

| |

| |

| |

3.5.2 Support vector machines (SVM)

The support vector machines (SVM) is a supervised machine learning

algorithm used for both classification and regression problems. In this study, the

sequential minimal optimization (SMO) algorithm proposed by John Platt was used to

formulate the predictive model for CML survival classification using SVM.

Consider, the training dataset containing the relevant risk factors for CML survival,

consisting of j features (risk factors) for each CML patient, j alongside target class

for each patient represented as then the dataset can be represented as a set

consisting of ( ) ( ) .

The problem was to produce a soft margin (called the hyperplane) that is

enough to separate the members of each class; this was handled using the SMO

algorithm to solve the problem as a quadratic programming problem such that in dual

form we have:

∑

∑∑

( )

subject to the constraints:

∑

98

Figure 3.2: SVM linear hyperplane separation for CML survival classification

99

where C is an SVM hyper-parameter and ( ) is a kernel function and are

Lagrange multipliers.

Assuming a kernel function was used for the binary classification of CML

survival which was observed to be linear separable then, figure 3.2 shows a

description of the separation of each member of the two classes (Survived and Not

Survived cases) by the hyperplane created using the linear kernel function. Although,

for the purpose of this study a polynomial kernel function was used.

3.5.3 Multi-layer perceptron (MLP)

Multi-layer perceptron is an artificial neural network architecture with

multiple hidden layers which makes use of the feed-forward and back-propagation

algorithms for the development of the predictive model for CML survival

classification from a given training dataset of CML patients‘ records. The relevant

attributes selected (risk factors), for CML survival classification were applied as

input neurons to the input layer of the MLP with initial weights, attached to each

respective input i. The initial value of the weights attached to each input were

randomly assigned using a random number generator; a bias value, b was also

attached to the sum of products of the inputs (risk factors) and weights. Equation

(3.13) shows a mathematical representation of what happened at the input layer of the

MLP.

If i relevant risk factors for CML survival classification selected using the FS

methods which were used as input neurons are represented by the set,

then, the attachment of the synaptic weights and the bias will

be (see Figure 3.3):

100

Figure 3.3: Artificial Neural Network Structure for CML survival classification

101

Following the attachment of the weights to their respective input neurons the

forward propagation algorithm sends the results of equation (3.13) to the activation

functions in k hidden layers. The input layer of the MLP is represented by k = 1while

respective hidden layers are represented as k = 2 to k for k hidden layers. The

activation function used in this study was the sigmoid function, defined in equation

(3.14):

The results of equation (3.14) were propagated through the hidden layers of

the MLP until it reached the output layer where the value of the prediction made by

the model is presented as p. Following this, the second phase of the algorithm was

performed where the difference between the predicted values, p and the actual value y

is used to estimate the error, E. Recall that the value of the output, p depends on the

weighted sum of all the inputs as indicated in equation (3.13); implying that the error

depends on the incoming weights of the neuron which needs to be changed.

The gradient descent algorithm was used to minimize the error and find the

optimal weights that satisfy the problem. The error function is defined by equation

(3.15) while the partial derivative of the error values with respect to each weight

assigned to the neurons is defined by equation (3.16):

Where:

{

( )

∑ ( )

102

Thus, the derivative with respect to Ok was calculated for all estimated

derivatives with respect to the outputs Ok of the next layer (or the one closer to the

output neuron). Therefore, in order to update the weight using the gradient

descent, one must choose a learning rate . The change is weight was added to the

old weight in equation (3.13) and was determined by equation (3.17):

Figure 3.4 shows a conceptual view of the comparative analysis approach used

in proposing the most effective combination of feature selection strategy and

supervised machine learning algorithm needed to develop the predictive model for the

classification of CML patient‘s survival. As shown in the diagram, three (3) feature

selection methods (alongside a search strategy) was used to propose the variables that

is most predictive for the 2 and 5-year survival following which three (3) supervised

machine learning algorithms was used to formulate the predictive model using the

variables selected by each feature selection process. In all there were nine (9)

predictive models developed for both the 2 and 5-year survival data. Each model‘s

performance was evaluated using a validation procedure and a set of performance

metrics following which the most effective and efficient predictive model for CML

survival was selected and proposed.

3.6 Simulation of Predictive Model for CML Survival

Following the identification of the supervised machine learning algorithms

that was needed to formulate the predictive model for CML survival, the simulation of

the predictive model was performed using the data collected about variables

monitored from CML patients receiving Imatinib treatment at the referral hospital in

Nigeria.

103

Figure 3.4: Conceptual view of the comparative analysis

104

The Waikato Environment for Knowledge Analysis (WEKA) software – a

suite of machine learning algorithms was used as the simulation environment for the

implementation of the predictive model.

3.6.1 Model training and validation process

For the purpose of developing the predictive model for the classification of

CML patient‘s survival, the collected data containing information about the values of

the identified indicators for CML survival were used to formulate the model using the

three proposed supervised machine learning algorithms.

The dataset collected was divided into two parts: training and testing data –

the training data was used to formulate the model while the test data was used to

validate the model. The process of training and testing predictive model according to

literature is a very difficult experience especially with the various available validation

procedures.

For classification problems, it is natural to measure a classifier‘s performance

in terms of the error rate. The classifier predicts the class of each instance – the CML

patient‘s record containing values for each survival indicator: if it is correct, that is

counted as a success; if not, it is an error. The error rate being the proportion of errors

made over a whole set of instances, and it measures the overall performance of the

classifier. The error rate on the training data set is not likely to be a good indicator of

future performance; this is because classifier has been learned from the very same

training data.

In order to predict the performance of a classifier on new data, there is the

need to assess its error rate on a dataset that played no part in the formation of the

classifier. This independent dataset is called the test dataset – which is a

105

representative sample of the underlying problem as is the training data. It is important

that the test dataset is not used in any way to create the classifier since the machine

learning classifiers involve two stages: one to come up with a basic structure of the

predictive model and the second to optimize parameters involved in that structure.

3.6.2 Cross-validation

The process of leaving a part of a whole dataset as testing data while the rest is

used for training the model is called the holdout method. The challenge here is the

need to be able to find a good classifier by using as much of the whole historical data

as possible for training; to obtain a good error estimate and use as much as possible

for model testing. It is a common trend to holdout one-third of the whole historical

dataset for testing and the remaining two-thirds for training.

It is important to ensure that the random sampling of dataset records is done in

a way that guarantees that each class is properly represented in both training and

testing datasets; this procedure is referred to as stratification thus called stratified

holdout in the case of this study. Although, stratification may provide only a

primitive safeguard against uneven representation in training and testing datasets, a

more general way to mitigate bias caused by the sample chosen is to repeat the whole

process, training and testing, several times with different random samples. For each

iteration, a certain proportion (two-thirds) is randomly selected for training and the

rest for testing.

For this study the cross-validation procedure was employed, which involves

dividing the whole datasets into a number of folds (or partitions) of the data. Each

partition was selected for testing with the remaining k – 1 partitions used for training;

the next partition was used for testing with the remaining k – 1 partitions (including

106

the first partition used for testing) used for training until all k partitions had been

selected for testing. The error rate recorded from each process was added up with the

mean error rate recorded. The process used in this study was the stratified 10-fold

cross validation method which involves splitting the whole dataset into ten partitions.

Also, a single 10-fold cross validation was not enough to get a reliable error

estimate requiring different 10-fold cross validation experiments to be performed

because stratification reduces the variation but not entirely. Thus, the process of 10-

fold cross validation was repeated ten times that is, ten times 10-fold cross validation

giving rise to 100 cross-validation experiments which is a reliable process of

generating good measure of performance which is a computation-intensive

undertaking.

3.6.3 Simulation environment

Weka is open source software under the GNU General Public License. The

system was developed at the University of Waikato in New Zealand. Weka stands for

the Waikato Environment for Knowledge Analysis. The software is freely available

at http://www.cs.waikato.ac.nz/ml/weka. The system was written using object-

oriented language, Java. There are several different levels at which Weka can be

used. Weka provides implantations of state-of-the-art data mining and machine

learning algorithms. Weka contains modules for data preprocessing, classification,

clustering and association rule extraction for market basket analysis.

The main features of Weka include:

a. 49 data preprocessing tools;

b. 76 classification/regression algorithms;

c. 8 clustering algorithms;

http://www.cs.waikato.ac.nz/ml/weka

107

d. 15 attribute/subset evaluators + 10 search algorithms for feature selection;

e. 3 algorithms for finding association rules; and

f. 3 graphical user interfaces, namely:

i. The Explorer for exploratory data analysis;

ii. The Experimenter for experimental environment; and

iii. The Knowledge Flow, a new process model inspired interface.

For the purpose of this study, the explorer was used for performing the process

of feature selection using 3 different feature selection methods each with its own

unique search strategy. Following the identification of relevant indicators - input

variables, the training dataset containing those instances were tested using the

experimenter interface of the Weka environment. Thus, the datasets were subjected to

10 runs of 10-fold cross validation using the three selected supervised machine

learning algorithms, namely: decision trees using the C4.5 decision trees algorithm,

support vector machines using the sequential minimum optimization (SMO)

algorithm and the artificial neural network using the multi-layer perceptron algorithm.

Before subjecting the historical datasets containing the values of the variables

monitored during the follow-up of CML patients receiving Imatinib treatment

alongside their survival class; there was the need to store the dataset according to the

default format for data representation needed for data mining tasks on the Weka

environment. The default file type is called the attribute relation file format (.arff).

the arff file type stores three categories of data: the first defining the title of the

relation, the second defining the relations‘ attributes alongside their respective labels

and the third defining the relations data followed for the values of the attributes for

each record. Also, data can be read from comma separated values (.csv) format and

from databases using Open Database Connectivity (ODBC).

108

3.7 Performance Evaluation Metrics

During the course of evaluating the predictive model, a number of metrics

were used to quantify the model‘s performance. In order to determine these metrics,

four parameters must be identified from the results of predictions made by the

classifier during model testing. These are: true positive (TP), true negative (TN),

false positive (FP) and false negative (FP). In this study which involves a binary

classification, either of survived and not survived can be considered as positive while

the other is negative.

True positives are the correct prediction of positive cases, true negatives are

the correct prediction of negative cases, false positives are the incorrect positive cases

(negative predicted as positives) and false negatives are the incorrect prediction of

negative cases. The performance metrics are thus defined as follows:

a. Sensitivity/True positive rate/Recall: is the proportion of actual positive

cases that were correctly predicted positive by the model.

b. Specificity/True negative rate: is the proportion of actual negative cases that

were correctly predicted as negatives by the model.

c. False Positive rate/False alarm: is the proportion of actual negative cases

that were predicted as positive by the model.

d. Precision: is the proportion of the predicted positive/negative cases that were

actually positive or negative. Equations (3.32) and (3.33) show the precision

for positive and negative cases.

109

110

CHAPTER FOUR

RESULTS AND DISCUSSIONS

4.1 Introduction

In this section of the study, the results of the methodological approach

described earlier are discussed. A thorough investigation into the analysis of the

description of the dataset collected was initially performed in order to understand the

distribution of the values of each attributes monitored in CML patients during the

follow-up of Imatinib treatment using the minimum and maximum values, and the

mean and standard deviation of the data distribution. Following this, a description of

the number of missing values in the dataset for each attributes monitored was

discussed which were all handled during model training by ignoring the missing

values when computing relevant measures of model characteristics and performance

evaluation.

The mean value of all numeric attribute data was used as a threshold to

convert all numeric values into their respective binary nominal values – for instance,

if the mean value of all values for an attribute is k then the binary nominal values for

that variable will be less than k ( ) and greater and equal to k ( ). This was used

to convert all numeric attributes into nominal attributes thus making it easy to

manipulate compared to their numeric counterparts. Following this, three feature

selection methods alongside their respective search methods were used to identify the

most relevant features among the ones monitored during Imatinib follow-up for the

survival of CML patients. All features selected by each feature selection methods

were subjected to three supervised machine learning algorithms for the purpose of

formulating the predictive model needed for classifying the survival of CML patients.

111

The performance of the predictive models for CML survival developed using

each of the supervised machine learning algorithms (each formulated using the

variables proposed by all feature selection methods) were evaluated in order to know

the combination of feature selection and supervised machine learning algorithms

needed for developing an effective and efficient predictive model for CML patient‘s

survival. Thus, the variables identified by the feature selection method were proposed

as the most important and relevant indicator for the survival of CML patients.

4.2 Results and Discussion of Data Summarization of Historical Dataset

For this study, a total of 272 CML patients‘ records were collected from the

study location - Obafemi Awolowo University Teaching Hospital Complex

(OAUTHC), Ile-Ife. Using the value of the survival time for each CML patient, the

value of the survival class for each patient was determined and classified as: survived,

not survived and censored using the threshold of 728 days for 2 year survival and

1820 days for 5 year survival (assuming there are 52 weeks in a year – each week

consisting of 7 days).

Table 4.1 gives a description of the number of each survival class found in the

2-year and 5-year survival datasets following the process of classification. The tables

shows that in the 2-year survival data, 115 (42.3%) patients survived, 31 (11.4%) did

not survive and 126 (45.82%) were censored while in the 5-year survival data, 25

(9.19%) patients survived, 49 (18.01%) did not survive and 198 (72.79%) were

censored. Table 4.2 shows the number of survival class in the final dataset used for

this study after removing the censored patients‘ records from the original dataset

which contained 146 for the 2-year survival and 74 for the 5-year survival. Figure 4.1

shows a graphical plot of the survival classes in both survival data while Figure 4.2

gives a description of the stored file format (arff) for the data collected.

112

Table 4.1: Number of survival class in the original dataset

Survival Class 2-years 5-years

Survived 115 (42.28%) 25 (9.19%)

Not Survived 31 (11.40%) 49 (18.01%)

Censored 126 (45.82%) 198 (72.79%)

Total 272 272

113

Table 4.2: Number of survival class in the final dataset

Survival Class 2-years 5-years

Survived 115 (78.77%) 25 (33.78%)

Not Survived 31 (21.23%) 49 (66.22%)

Total 146 74

114

Figure 4.1: Graphical plot of the distribution of classes among 2 and 5-year survival data

11

4

115

Figure 4.2: Screenshot of the data collected and stored in arff file format

116

Following the process of data identification and data collection of the

historical data containing the values of the attributes monitored during the follow-up

of Imatinib treatment administered to CML, the missing data values that exist in the

data collected were initially identified and coded in a manner that could be easily

manipulated by the simulation environment, Weka. According to the Weka

documentation, all missing values were to be replaced with the question mark symbol

(?) and the system was adjusted to ignore all missing values found in each attribute

during the process of data analysis. Table 4.3 shows a description of the missing data

values found in the cells of each attributes identified for CML survival.

After identifying and replacing missing values of the attributes monitored

during the follow-up of Imatinib treatment administered to CML patients in the

Nigerian referral hospital; descriptive statistics methods were applied to analyzing the

information stored in the dataset. Table 4.4 shows the summarization of the numeric

data types found among the variables collected from the CML patients using the

metrics: minimum, maximum, mean and standard deviation for the dataset collected.

Table 4.5 gives a description of the analysis of the nominal values of the

attributes in the final dataset following the conversion of all numeric data types into

their respective nominal values following the use of the threshold identified by the

mean value identified in the descriptive analysis of the numeric data types. The

number of the nominal values for each attributes was presented alongside the number

of missing values present with the percentage of occurrence of each value for each

attributes selected for this study. The dataset‘s representation of all variables as

nominal values alongside the target variable was used in the process of feature

selection and predictive model development process for this study.

117

Table 4.3: Missing data values in each identified attributes for CML survival

Attributes Missing Data

Sex 0

Vital Signs 0

Time to start of Imatinib 0

Age 0

Packed Cell Volume (PCV) 0

Platelets count 6

Percentage Blest 0

Spleen size 0

Liver size 4

Eosinophils 9

Basophil 13

White Blood Cell (WBC) 3

Disease Phase at Diagnosis 4

118

Table 4.4: Data summarization of the numeric attributes in final dataset

Attributes Minimum Maximum Mean Standard Deviation

Time to Imatinib Treatment (days) 2.00 2308.00 188.39 331.23

Age (years) 20.00 75.00 40.20 12.98

Platelet Count ( / ) 10.00 1173.00 306.60 219.16

Packed Cell Volume (PCV, %) 13.00 49.00 31.58 7.14

White Blood Cell (WBC / ) 2.10 710.00 123.05 120.27

Basophil (%) 1.00 35.00 1.73 3.90

Eosinophil (%) 0.00 21.00 2.43 3.389

Percentage Blast (%) 1.00 20.00 2.50 3.81

Spleen Size (cm) 0.00 38.00 12.18 8.90

Liver Size (cm) 0.00 22.00 3.06 4.37

Survival Time (days) 34.00 2548.00 790.526 602.76

11

8

119

Table 4.5: Descriptive statistics of data collected after data processing

Attributes Labels

Sex

Female = 54 (36.99%)

Male = 92 (63.01%)

Status

Alive = 96 (65.75%)

Dead = 50 (34.25%)

Time to Imatinib Treatment (days)

Below 188 = 48 (32.88%)

Above 188 = 98 (67.12%)

Age (years)

Below 40 = 81 (55.48%)

Above 40 = 65 (44.52%)

Platelet Count ( / )

Above 306.6 = 46 (31.51%)

Below 306.6 = 94 (64.38%)

Missing = 6 (4.11%)

Packed Cell Volume (PCV, %)

Below 31 = 79 (54.11%)

Above 31 = 67 (45.89%)

White Blood Cell (WBC / ) Below 123 = 86 (60.96%)

Above 123 = 57 (39.04%)

Basophil (%)

Above 1.7 = 29 (19.86%)

Below 1.7 = 104 (71.24%)

Missing = 13 (8.9%)

Eosinophil (%)

Below 2.4 = 89 (60.96%)

Above 2.4 = 48 (32.88%)

Missing = 9 (6.16%)

Percentage Blast (%)

Above 2.5 = 54 (36.99%)

Below 2.5 = 92 (63.01%)

Spleen Size (cm)

Above 12.2 = 84 (57.53%)

Below 12.2 = 62 (42.47%)

Liver Size (cm)

Below 3.1 = 94 (64.38%)

Above 3.1 = 48 (32.88%)

Missing = 4 (2.74%)

120

4.3 Results and Discussion of Feature Selection Process

Following the process of data description and transformation into the accepted

file format (arff), the next important step was the identification of the most relevant

variables among the identified factors that will improve the prediction of CML

survival the most. As stated earlier, filter-based feature selection methods were used

as supported in the simulation environment, Weka. For each feature selection

methods chosen, a particular search strategy was chosen for selecting attributes

among the general set of attributes collected. Table 4.6 shows a description of the

feature selection methods used and their respective search algorithm employed in

each case with the relevant attributes selected for CML prediction for the 2-year and

5-year dataset.

The feature selection methods are implemented as the following algorithms in

Weka, namely:

a. Correlation-based feature selection algorithm – implemented using the class

weka.attributeSelection.CfsSubsetEval which selected the subset of features

highly correlated with target class but low correlation among themselves and

the search algorithm used was the genetic search algorithm;

b. Information-based feature selection algorithm – implemented using the class

weka.attributeSelection.InfoGainAttributeEval which selects features by

evaluating the worth of an attribute by measuring the information gain with

respect to the class (information gain of an attribute with respect to the

survival class) and the search algorithm used was the ranker algorithm which

ranked the attributes according to their individual evaluations; and

121

Table 4.6: Relevant attributes identified using three (3) feature selection methods

Feature

Selection

Method

Consistency-Based Information-Based Correlation-Based

Search Method Greedy Step-wise Ranker Search Genetic Search

Variables

Selected

2-years 5-years 2-years 5-years 2-years 5-years

Basophilis

PCV

Time to Start

Age

Percentage Blast

Disease Phase

Platelet Count

Eosinophil

Sex

Liver Size

Basophilis

WBC

Percentage Blast

Eosinophils

PCV

Spleen Size

Age

Sex

Basophilis

PCV

Percentage Blast

Disease Phase

Liver Size

Basophilis

PCV

Disease Phase

Liver Size

Spleen Size

Sex

Basophilis

PCV

Percentage Blast

Platelets Count

Basophilis

PCV

Disease Phase

Liver Size

Spleen Size

Sex

1

21

122

c. Consistency-based feature selection algorithm, - implemented using the class

weka.attributeSelection.ConsistencySubsetEval which selects features by

evaluating the worth of a subset of attributes by the level of consistency in the

class values when the training instances are projected onto the subset of

attributes and the search algorithm used was the greedy step-wise which

performed a greedy forward and backward search through the space of

attribute subsets.

Out of the total thirteen (13) attributes selected from the factors monitored

during the follow-up of Imatinib treatment a fewer number of relevant attributes were

selected by each filter-based feature selection algorithm for the 2-year and 5-year

dataset collected for this study. For the two (2) year survival dataset, 10 attributes

were selected by the consistency-based FS, 5 attributes selected by the information-

based FS and 4 attributes selected by the correlation-based FS while for the five (5)

year survival dataset, 8 attributes were selected by the consistency-based FS, 6

attributes selected by the information-based FS and correlation-based FS methods.

It was discovered that the information-based FS methods was able to deduce

the exact same type and number of attributes from the original dataset just as was

done by the correlation-based FS methods. Of all the FS methods considered in the

study, only the consistency-based FS method was able to identify age as a relevant

feature while sex was found as one of the relevant feature in the 2-year survival

dataset using the consistency-based FS method but sex was found relevant by all FS

methods applied on the 5-year survival. PCV was selected among other relevant

variables by all FS methods for both 2-year and 5-year dataset. Spleen size was

identified as one of the relevant attributes for CML survival by all FS methods in the

5-year survival dataset but not in any of the 2-year survival dataset.

123

4.4 Results and Discussion of Model Formulation and Simulation Process

Following the process of feature selection used in identifying the most

relevant variables among the 13 identified variables monitored in CML patients

receiving Imatinib treatment, the next phase is model formulation using the

aforementioned supervised machine learning algorithms available in the Weka

software. The 10-fold cross validation technique was used in evaluating the

performance of the developed predictive model for CML survival using the test

samples randomly selected from the historical test used for training the model. For

each supervised machine learning algorithm used in formulating the predictive model

for CML survival classification, 3 predictive models were developed using the

variables identified by each feature selection methods applied on the original dataset.

This process was performed for both the 2-year and 5-year survival data required for

model development.

4.4.1 Results of model formulation and simulation using all 13 variables

CML Patients‘ records containing all 13 variables (attributes) identified to be

monitored during Imatinib treatment was used as the training data for developing the

first set of predictive models for CML survival classification using the three machine

learning algorithms. The dataset used consisted of the set of CML patients contained

in the 2-year and 5-year survival data; for each patient‘s record the dataset a target

class defining the classification of survival was also provided and labeled as survived

and not survived.

i. The 2-year survival dataset

From the 2-year survival data containing 146 CML patients‘ record with 115

survived and 31 not survived cases, the results of the predictive model developed

using the three supervised machine learning algorithms are as follows. For the C4.5

124

decision trees algorithm, 9 out of the 31 actual not survived cases and 110 out of the

115 survived cases were correctly classified giving a total of 119 correct

classifications out of 146 cases with an accuracy of 81.5% (Figure 4.4 – left). For the

SMO algorithm used in formulating the SVM classifier, none out of the 31 actual not

survived cases and 113 out of the 115 actual survived cases were correctly classified

giving a total of 113 correct classifications out of 146 cases with an accuracy of

77.4% (Figure 4.4 – centre). For the MLP classifier, 13 out of 31 actual not survived

cases and 94 out of the 151 actual survived cases were correctly classified giving a

total of 107 correct classifications out of 146 cases with an accuracy of 73.3% (Figure

4.4 – right). The C4.5 decision trees algorithm formulated the most accurate

predictive model for CML 2-year survival classification for CML patients using all 13

variables monitored during Imatinib treatment.

ii. The 5-year survival dataset




decision trees algorithm, 43 out of the 49 actual not survived cases and 6 out of the 25

survived cases were correctly classified giving a total of 49 correct classifications out

of 74 cases with an accuracy of 66.2% (Figure 4.5 – left). For the SMO algorithm

used in formulating the SVM classifier, 40 out of the 49 actual not survived cases and

3 out of the 25 actual survived cases were correctly classified giving a total of 43

correct classifications out of 74 cases with an accuracy of 58.1% (Figure 4.5 – centre).

For the MLP classifier, 31 out of 49 actual not survived cases and 5 out of the 25

actual survived cases were correctly classified giving a total of 36 correct

classifications out of 74 cases with an accuracy of 48.6% (Figure 4.5 – right).

125

Figure 4.4: Results of model formulation using all 13 variables in 2-year survival

126

Figure 4.5: Results of model formulation using all 13 variables in 5-year survival

127

The C4.5 decision trees algorithm formulated the most accurate predictive

model for CML 5-year survival classification for CML patients using all 13 variables

monitored during Imatinib treatment.

4.4.2 Results of model formulation and simulation using variables selected by

consistency criteria

The second set of predictive models was formulated using the CML patients‘

records that contained variables identified using the consistency-based FS method.

Using the consistency-based FS method, 10 relevant variables were identified in the

2-year survival dataset while 8 relevant variables were identified to be relevant in the

5-year survival dataset. Thus, each supervised machine learning algorithm was used

to formulate the predictive model for CML survival classification using the relevant

variables identified using the consistency criteria for FS.









survived cases and 114 out of the 115 actual survived cases were correctly classified





128

4.6 – right). The C4.5 decision trees algorithm formulated the most accurate

predictive model for CML 2-year survival classification for CML patients using the

10 relevant variables in the 2-year survival data for CML survival classification. It

was also observed that there was no significant difference in the predictive model

formulated by the C4.5 decision trees algorithm using the original dataset containing

13 variables and the 10 variables identified using consistency criteria for FS but

significant improvement was observed by the predictive models formulated using

both SVM and MLP with the consistency criteria for FS then using all 13 attributes.









none out of the 25 actual survived cases were correctly classified giving a total of 45




classifications out of 74 cases with an accuracy of 52.7% (Figure 4.7 – right). The

C4.5 decision trees algorithm formulated the most accurate predictive model for the

5-year survival classification of CML patients using 8 relevant variables identified

using the consistency criteria during Imatinib treatment.

129

Figure 4.6: Results of model formulation using 10 variables in 2-year

survival selected using consistency-based FS method

130


survival selected using consistency-based FS method

131

There was also a significant improvement in the predictive model formulated

using the 8 relevant features compared to using the whole 13 attributes monitored.


information criteria

The third set of predictive models was formulated using the CML patients‘

records that contained variables identified using the information-based FS method.

Using the information-based FS method, 5 relevant variables were identified in the 2-

year survival dataset while 6 relevant variables were identified to be relevant in the 5-

year survival dataset. Thus, each supervised machine learning algorithm was used to

formulate the predictive model for CML survival classification using the relevant

variables identified using the information criteria for FS.









survived cases and all the 115 actual survived cases were correctly classified giving a


4.8 – centre). For the MLP classifier, 18 out of 31 actual not survived cases and 105

out of the 151 actual survived cases were correctly classified giving a total of 123

correct classifications out of 146 cases with an accuracy of 84.3% (Figure 4.8 – right).

132


survival selected using information-based FS method

133

The MLP algorithm formulated the most accurate predictive model for CML

2-year survival classification for CML patients using the 5 relevant variables in the 2-

year survival data for CML survival classification. It was also observed that there

was no significant difference in the predictive model formulated by the C4.5 decision

trees algorithm using the original dataset containing 13 variables and the 5 variables

identified using information criteria for FS but significant improvement was observed

by the predictive models formulated using SVM with the information criteria for FS

then using all 13 attributes.













classifications out of 74 cases with an accuracy of 62.2% (Figure 4.9 – right). The

SVM algorithm formulated the most accurate predictive model for the 5-year survival

classification of CML patients using 6 relevant variables identified using the

information criteria during Imatinib treatment which is an improvement compared to

the variables used in the earlier formulated models using SVM.

134


survival elected using information-based FS method

135

There was also a significant improvement in the predictive model formulated

using the MLP classifier but a reduced performance using the C4.5 decision trees

classifier.


correlation criteria

The final set of predictive models was formulated using the CML patients‘

records that contained variables identified using the correlation-based FS method.

Using the correlation-based FS method, 4 relevant variables were identified in the 2-

year survival dataset while 6 relevant variables were identified to be relevant in the 5-

year survival dataset – same as those identified using the information-based criteria.

Thus, each supervised machine learning algorithm was used to formulate the

predictive model for CML survival classification using the relevant variables

identified using the correlation criteria for FS.







classifications out of 146 cases with an accuracy of 81.5% (Figure 4.10 – left). For

the SMO algorithm used in formulating the SVM classifier, none out of the 31 actual

not survived cases and all the 115 actual survived cases were correctly classified




136


4.10 – right). The MLP algorithm formulated the most accurate predictive model for

CML 2-year survival classification for CML patients using the 4 relevant variables in

the 2-year survival data for CML survival classification. It was also observed that

there was no significant difference in the predictive model formulated by the C4.5

decision trees algorithm and SVM classifiers using either of the variables selected

using FS.










correct classifications out of 74 cases with an accuracy of 64.9% (Figure 4.11 –

centre). For the MLP classifier, 34 out of 49 actual not survived cases and 12 out of

the 25 actual survived cases were correctly classified giving a total of 46 correct

classifications out of 74 cases with an accuracy of 62.2% (Figure 4.11 – right). There

was no difference in the result of the predictive model formulated using the three

supervised machine learning algorithms via the variables selected using either the

information-based and the correlation-based criteria for feature selection.

137


survival selected using correlation-based FS method

138


survival selected using correlation-based FS

139

The results of the study showed that following the use of three feature

selection methods and the application of three supervised machine learning algorithms

required for the formulation of the predictive model needed for the classification of

the 2-year and 5-year survival classification of CML patients receiving Imatinib

treatment. The use of the information-based feature selection method and the

formulation of the 2-year survival classification prediction model using the multi-

layer perceptron showed the best performance for the 2-year survival classification

model and the use of the consistency-based feature selection method and the

formulation of the 5-year survival classification prediction model using the C4.5

decision trees algorithm showed the best performance for the 5-year survival

classification model.

4.5 Discussion of Results of the Formulation and Simulation of Prediction

Model for CML Survival Classification

For each prediction model developed using the combination of feature

selection and supervised machine learning algorithms; the confusion matrices were

constructed from the value of correct (true positive and true negative values) and

incorrect classifications (false positive and false negative values) made by each

prediction model developed for CML survival. The positive class for each prediction

model was identified by the not survived cases while the negative class for each

prediction model was identified by the survived cases.

The true positive and true negative values were used to evaluate the accuracy

of each prediction model showing how much of the total number of cases that was

correctly classified by the classifiers – efficiency of the model. Additional metrics

were estimated including the true positive rate which measures the ability of the

model to correctly classify the not survived cases, true negative rate which measures

140

the ability of the model to correctly classify the survived cases, false positive rate

which measures the incorrectly classified negative cases and the area under the

characteristics curve (AUC) which measures the effectiveness of the model.

4.5.1 2-year survival classification model

Based on the results obtained for the TP, TN, FP and FN from the prediction

models developed from each combination of feature selection and feature selection

methods, the aforementioned metrics were estimated and shown in Table 4.7. From

the table, the respective performance of each combination of feature selection and

machine learning algorithms is shown.

The C4.5 decision trees algorithm showed a consistent performance which

showed no significant change irrespective of the type of feature selection method used

in extracting the relevant variables; in all cases 81.5% of the total datasets were

correctly classified with 29% of the not survived cases correctly classified and 95.7%

of the survived cases correctly classified owing for the TP and TN rate of 0.290 and

0.957 respectively. The C4.5 decision trees algorithm outperformed the SVM and

MLP algorithms while using all 13 variables and using 10 relevant variables selected

by the consistency-based feature selection algorithm. Figure 4.12 shows the decision

tree constructed for 2-year CML survival classification using all 13 variables (left)

and using the selected variables (right). The tree shows that the most relevant

attribute which influences the 2-year survival are the Basophilis and the packed Cell

Volume (PCV) while other relevant variables are the percentage blast and the increase

in the spleen size of the CML patient.

141

Table 4.7: Results of the evaluation for the predictive model for 2-year survival classification

Feature Selection

Method

Supervised Machine

Learning Algorithm

Correct

Classification

(Accuracy)

TP Rate

(Sensitivity/

recall)

TN Rate

(Specificity)

FP Rate

(False alarm)

Precision

Area under

Receiver

Operating

Characteristics

(ROC) curve

None

(All 14 attributes)

Decision Trees 119 (81.51%) 0.290 0.957 0.043 0.793 0.647

Support Vector

Machines 113 (77.40) 0.000 0.983 0.017 0.618 0.491

Multi-layer Perceptron 104 (71.23%) 0.419 0.817 0.183 0.683 0.576

Consistency-Based

(10 relevant

attributes)

Decision Trees 119 (81.51%) 0.290 0.957 0.043 0.793 0.647

Support Vector

Machines 114 (78.08%) 0.000 0.991 0.009 0.727 0.548


Information-Based

(5 relevant

attributes)

Decision Trees 119 (81.51%) 0.290 0.957 0.043 0.793 0.676

Support Vector

Machines 115 (78.77%) 0.000 1.000 0.000 0.620 0.500


Correlation-Based

(4 relevant

attributes)

Decision Trees 119 (81.51%) 0.290 0.957 0.043 0.793 0.679

Support Vector

Machines 115 (78.77%) 0.000 1.000 0.000 0.620 0.500


14

1

142

Figure 4.12: Decision trees for all attributes (left) and selected attributes (right) in 2-year survival

14

2

143

From the decision tree shown in Figure 4.12, the following five rules were

extracted; the first for the left and the second for the one on the right.

I. Without feature selection

a. If (Basophilis=below 1.7) Then (2-year survival=Survived);

b. If (Basophilis=above 1.7) and (PCV=below 31) Then (2-year

survival=Survived);

c. If (Basophilis=above 1.7) and (PCV=above 31) and (Percentage blast=above

2.5) Then (2-year survival=Not survived);

d. If (Basophilis=above 1.7) and (PCV=above 31) and (Percentage blast=below

2.5) and (Spleen=below 12.2) Then (2-year survival=Not survived); and

e. If (Basophilis=above 1.7) and (PCV=above 31) and (Percentage blast=below

2.5) and (Spleen=above 12.2) Then (2-year survival=Survived);

II. With feature selection

a. If (Basophilis=below 1.7) Then (2-year survival=Survived);

b. If (Basophilis=above 1.7) and (PCV=below 31) Then (2-year

survival=Survived); and

c. If (Basophilis=above 1.7) and (PCV=above 31) Then (2-year survival=Not

survived);

The SMO classifier used in implementing the SVM algorithm for developing

the predictive model or CML survival classification showed very poor results for all

the different sets of attributes used in the study for the 2-year survival classification

prediction model. The SVM was unable to correctly classify all the not survived

cases (positive cases) for all the attributes used but classified 98.3% of the survived

cases using all 13 attributes and values 99.1% and 100% using the attributes selected

for feature selection. Although, the percentage of correct classification (accuracy) is

144

within the interval 77.4% - 78.8%, majority of the correct classifications were from

the survived cases (negative classes). The SVM was unable to formulate an effective

predictive model using the attributes available in the selected dataset for the 2-year

survival dataset for CML survival.

The back-propagation algorithm used by the Multi-layer perceptron to develop

the predictive model for the 2-year CML survival classification model also showed

interesting results. For the dataset containing the 13 initial attributes, the MLP

showed an accuracy of 71.2% which increased to 72.6% using attributes selected by

consistency-based FS to 82.19% using attributes selected by the correlation-based FS

and to 84.3% using attributes selected by the information-based FS method. The

MLP was able to correctly predict about 60% of the not survived cases (positive class)

and about 92% of the survived cases (negative class) owing for a value of 0.581,

0.913 and 0.095 for the TP rate, FN rate and the FP rates. The best performance was

achieved for the predictive model for CML survival classification developed using the

5 attributes selected by the information-based FS method, namely:

a. Basophilis;

b. Packed cell volume (PCV);

c. Percentage blast

d. Disease phase at diagnosis;

e. Liver size.

4.5.2 5-year survival classification model

Based on the results obtained for the TP, TN, FP and FN from the prediction

models developed from each combination of feature selection and feature selection

methods, the aforementioned metrics were estimated and shown in Table 4.8. From

the table, the respective performance of each combination of feature selection and

145

machine learning algorithms is shown for the 5-year CML survival classification

model.

The C4.5 decision trees algorithm showed a consistent improvement in

performance in the 5-year CML survival dataset used in formulating the predictive

model. The C4.5 DT algorithm was able to correctly classify 66.2% of the total cases

using the 13 identified features, 70.3% using the 10 variables identified by

consistency criteria and 67.57% using the 6 variables identified by the information

and correlation criteria each. The C4.5 DT algorithm had the highest performance for

the dataset containing attributes selected by the consistency-based FS method from

which 4 relevant features were selected by the C4.5 DT algorithm to construct the

decision tree shown in Figure 4.13 consisting of the following attributes:

a. Packed Cell Volume (PCV);

b. White Blood Cell (WBC) count;

c. Spleen size; and

d. Basophilis

The most important attribute for 5-year survival identified was the PCV

followed by the WBC, Spleen and the Basophilis unlike the attributes identified to be

relevant for 2-year CML survival. From the decision tree shown in Figure 4.13, the

following five (5) rules were extracted, which includes:

a. If (PCV=above 31) Then (5-year survival=Not Survived);

b. If (PCV=below 31) and (WBC=above 123) Then (5-year survival=Survived);

c. If (PCV=below 31) and (WBC=below 123) and (Spleen=above 12.2) Then (5-

year CML survival=Not survived);

d. If (PCV=below 31) and (WBC=below 123) and (Spleen=below 12.2) and

(Basophilis=below 1.7) Then (5-year CML survival=Not survived); and

146

Table 4.8: Results of the evaluation for the predictive model for 5-year survival classification

Feature Selection

Method

Supervised Machine

Learning Algorithm

Correct

Classification

(Accuracy)

TP Rate

(Sensitivity/

recall)

TN Rate

(Specificity)

FP Rate

(False alarm)

Precision

Area under

Receiver

Operating

Characteristics

(ROC) curve

None

(All 14 attributes)

Decision Trees 49 (66.22%) 0.878 0.240 0.760 0.628 0.521

Support Vector

Machines 43 (58.11%) 0.633 0.200 0.800 0.512 0.468


Consistency-Based

(8 relevant

attributes)

Decision Trees 52 (70.27%) 0.959 0.200 0.800 0.706 0.569

Support Vector

Machines 45 (60.81%) 0.673 0.240 0.760 0.426 0.459


Information-Based

(6 relevant

attributes)

Decision Trees 50 (67.57%) 0.837 0.120 0.880 0.664 0.641

Support Vector

Machines 48 (64.86%) 0.694 0.480 0.520 0.435 0.490


Correlation-Based

(6 relevant

attributes)

Decision Trees 50 (67.57%) 0.837 0.120 0.880 0.664 0.641

Support Vector

Machines 48 (64.86%) 0.694 0.480 0.520 0.435 0.490


14

6

147

Figure 4.13: Decision trees for selected attributes in 5-year survival

1

47

148

a. If (PCV=below 31) and (WBC=below 123) and (Spleen=below 12.2) and

(Basophilis=above 1.7) Then (5-year CML survival=Survived);

The SMO algorithm used in developing the SVM classifier for the CML

survival classification model was observed to correctly classify 58.1% of the total

cases using all 13 attributes but with values of 60.8% and 64.7% using attributes

selected by the consistency-based and information/correlation-based FS methods

respectively. The SVM classify showed the best performance using attributes

selected by the information and correlation-based FS methods which identified the

same set of attributes. The SVM classifier was able to predict correctly about 69.4%

of the not survived cases (positive class) and 48% of the survived cases (negative

class). Also, it was discovered that the performance of the SVM classifier was

improved by the use of the relevant attributes for CML survival compared to using all

13 attributes.

The MLP using the back-propagation algorithm used in developing the 5-year

CML survival model was observed to correctly classify 48.7% of the total cases using

the 13 attributes but with values of 52.7% and 62.16% using attributes selected by the

consistency-based and information/correlation-based FS methods respectively. The

MLP showed the highest performance using the variables selected by the

information/correlation-based FS methods with TP, TN and FP rates of 0.98, 0.00 and

1.00 respectively. It was also observed that the MLP was unable to effectively predict

the survived cases unlike the not survived cases for CML survival.

149

Table 4.9: A comparison of the variables selected for CML survival prediction

2-year CML Survival 5-year CML Survival Existing Models

C4.5 DT MLP C4.5 DT SOKAL HASFORD EUTOS

Basophils (with and without FS) Basophils Basophilis Platelets Basophils Basophils

Spleen (without FS) Liver size Spleen Spleen Spleen Spleen

Blast (without FS) Blast WBC Blast Blast

PCV (with and without FS) PCV PCV Age Age

Disease Phase Eosinophils

1

49

150

Out of the prediction models developed for CML 5-year survival, it was

discovered that the classification model developed by the C4.5 decision trees

algorithm using the consistency-based FS methods developed the most effective 5-

year CML survival classification model.

Table 4.9 shows a comparison of the variables identified by each model for

CML survival using supervised machine learning and the existing models using the

regression analysis. The table shows that all the survival models identified the

Basophilis as an important attribute except the Sokal score just as was identified by

the 2-year and 5-year survival classification models developed using the FS and ML

algorithms. All the survival models also identified Spleen as an important variable

except for the 2-year survival classification model using MLP and DT without feature

selection. All the survival models identified Blast as an important attribute except for

the EUTOS model and the 5-year survival classification model.

Other variables peculiar to Nigerian CML patients survival include the PCV

identified by the C4.5 decision trees algorithm and MLP used in developing the 2-

year survival classification model alongside the C4.5 DT algorithm for developing the

5-year survival classification model; the Liver and disease phase identified by the

MLP used in developing the 2-year survival classification model and finally the WBC

identified by the C4.5 DT algorithm used in developing the 2-year survival

classification model.

151

CHAPTER FIVE

SUMMARY, CONCLUSION AND RECOMMENDATIONS

5.1 Summary

This study focused on the development of an effective and efficient prediction

model using clinical information native to Nigerian CML patients in order to classify

the survival CML patients receiving Imatinib treatment in Nigeria. Existing survival

models were developed using clinical information of non-Nigerians and statistical

regression modeling techniques but have been very ineffective on estimating the

survival of Nigerian CML patients receiving Imatinib treatment.

Feature selection algorithms were used to identify the variables which have a

strong correlation/relevance to CML survival from the dataset containing all possible

variables monitored from CML patients receiving Imatinib treatment at a referral

hospital. The variables identified using three feature selection methods were used to

formulate the prediction models for 2-year and 5-year CML survival using three

supervised machine learning algorithms

The results of the study revealed the variables that are relevant to both 2-year

and 5-year survival of CML patients alongside the development of the prediction

model for CML 2-year and 5-year survival classification using the variables

identified.

5.2 Conclusion

Following the use of feature selection methods in identifying the variables

relevant to CML survival; basophilis and spleen were identified as the most relevant

for the 2-year survival using decision trees and 5-year survival using decision trees as

was proposed by the EUTOS and Hasford Score and the percentage of blast was also

152

relevant to the 2-year survival which was also identified by the Sokal and Hasford

models. Other variables identified were PCV in the 2-year and 5-year survival, liver

size and disease phase in the 2-year survival and WBC in the 5-year survival. Unlike

the other existing models proposed, the variables age, eosinophil and platelets counts

were insignificant to CML survival in Nigerian CML patients.

The prediction model developed using the dataset showed good results

although there was more likelihood to classify one class than the other due to the

inequality in proportion of the survived and not survived cases in the original dataset.

Thus, a dataset with a lesser number of censored values will create more reliable

predictive models for CML survival classification.

The variables identified by the prediction model and the tree constructed from

the decision tree using the variables can help provide insight into the relationship that

exist between the variables with respect to the classification of the 2-year and 5-year

survival of CML patients.

5.3 Recommendations

Following the development of the prediction model for CML survival

classification, a better understanding of the relationship between the attributes

relevant to CML survival was proposed. The model can also be integrated into

existing Health Information System (HIS) which captures and manages clinical

information which can be fed to the CML survival classification prediction model

thus improving the clinical decisions affecting CML survival and the real-time

assessment of clinical information affecting CML survival from remote locations.

It is advised that a continual assessment of variables monitored during CML

survival be made in order to increase the number of information relevant to creating

153

an improved prediction model for CML classification using the proposed methods of

feature selection and machine learning algorithms.

154

REFERENCES

Agbelusi, O. (2014). Development of a predictive model for survival of HIV/AIDS

patients in South-western Nigeria, Unpublished MPhil Thesis, Obafemi

Awolowo University, Ile-Ife, Nigeria.

Ahmad, L.G., Eshlaghy, A.T., Poorebrahimi, A., Ebrahimi, M. and Razavi, A.R.

(2013). J. Health Med Infomr 4(2): 1 -3.

Alimena, G., Morra, E., Lazzarino, M., Liberati, A.M., Montefusco, E. and Inveradi,

D. (1988). Interferon alpha-2b as therapy for Ph‘-positive chronic

myelogenous Leukaemia: A study of 82 patients treated with intermittent or

daily administration. Blood 72: 642 – 647.

Allan, N.C., Richards, S.M., Shepherd, P.C. (1995). UK Medical Research Council

randomised, multicentre trial of interferon-alpha n1 for chronic myeloid

leukaemia: improved survival irrespective of cytogenetic response. The UK

Medical Research Council‘s Working Parties for Therapeutic Trials in Adult

Leukaemia. Lancet 345:1392 – 1397.

Altman, D.G., Vergouwe, Y and Royston, P. (2009). Prognosis and Prognostic

research: validating a prognostic model. BMJ 338: 605.

American Cancer Society (2015). Cancer Facts & Figures 2015. Atlanta, Ga:

American Cancer Society.

Ashraf, M., Chetty, G. and Tran, D. (2013). Feature selection techniques on thyroid,

hepatitis and breast cancer datasets. International Journal on data mining and

intelligent information technology 3(1): 1 -8.

155

Aurich J., Duchayne E., Huguet-Rigal F., Bauduer F., Navarro M., Perel Y., Pris J.,

Caballin M. R., and Dastugue N. (1998). Clinical, morphological, cytogenetic

and molecular aspects of a series of Ph-negative chronic myeloid Leukaemias.

Hematol Cell Ther 40(4): 149 - 158.

Bach, P.B., Kattan, M.W. and Thornquist, M.D. (2003). Variations in lung cancer

risk among smokers. J Natl. Cancer Inst. 95: 470 -478.

Batista, G. and Monard, M.C. (2003). An analysis of four missing data treatment

methods for supervised learning. Appl Artif Intell 17: 519 – 533.

Becker, S and Plumbley, M. (1996) Unsupervised neural network learning procedures

for feature extraction and classification. International Journal of Applied

Intelligence 6: 185-203.

Bell, D. and Wang, H. (2010). A formalism for relevance and its application in

feature subset selection. Machine Learning 41(2): 175 – 195.

Bluhm, M.V. (2011). Factors Influencing Oncologist‘s Use of Chemotherapy in

Patients at the end of Life: a qualitative study. Published PhD thesis of the

University of Michigan, USA. Retrieved from

https://deepblue.lib.umich.edu/bitstream/handle/2027.42/84635/mbluhm_1.pdf

on 12 January, 2016.

Blum, A.L. and Langley, P. (1997). Selection of relevant features and examples in

machine learning. Artificial Intelligence on relevance 97: 245 – 271.

Bocchi, L., Coppini, G., Nori, J. and Valli, G. (2004). Detection of single and

clustered micro-calcifications in mammograms using fractals models and

neural networks. Med Eng Phys 26: 303 – 312.

https://deepblue.lib.umich.edu/bitstream/handle/2027.42/84635/mbluhm_1.pdf

156

Boma, P.O., Durosinmi, M.A., Adediran, I.A., Akinola, N.O. and Salawu, L. (2006).

Clinical and Prognostic features of Nigerians with Chronic Myeloid

Leukaemia. Niger Postgrad Med J. 6:47 – 52.

Bonifazi, F. de Vivo, A., Rosti, G., Guilot, J. and Trabacchi, E. (2001). Chronic

myeloid Leukaemia and interferon- : A study of complete cytogenetic

responders. Blood 98(10): 3074 – 3081.

Branford S, Fletcher L, Cross N.C., Muller M.C., Hochhaus A., Kim D.W., Radich,

J.P., Saglio, G., Pne, F., Kamel-Reid, S., Wang, Y.L., Press, R.D., Lynch, K.,

Rudzki, Z., Goldman, J.M. and Hughes, T. (2009) Desirable performance

characteristics for BCR-ABL measurement on an international reporting scale

to allow consistent interpretation of individual patient response and

comparison of response rates between clinical trials. Blood 112: 3330-3338.

Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C. J. (1984). Classification and

regression trees. Wadsworth & Brooks/Cole Advanced Books & Software,

Monterey, CA. ISBN 978-0-412-04841-8.

Burke, H.B., Bostwick, D.G. and Meiers, I. (2005). Prostate cancer outcome:

epidemiology and biostatistics. Anal Quant Cytol Histol 27: 211 – 217.

Caruana, R.A. and Freitag, D. (1994). Greedy Attribute selection. In Proceedings of

the 11th

International Conference on Machine Learning, new Brunswick, NJ,

Morgan Kaufmann Publishers: 28 – 36.

Circchetti, D.W. (1992). Neural networks and diagnosis in the clinical laboratory:

state of the art. Clin. Chem. 38: 9 – 10.

https://en.wikipedia.org/wiki/International_Standard_Book_Number

https://en.wikipedia.org/wiki/Special:BookSources/978-0-412-04841-8

157

Cochran, A.I. (1997). Prediction of outcome for patients with cutaneous melanoma.

Pigment Cell Res 10: 162 – 167.

Cook, N.R. (2007). Use and misuse of the receiver operating characteristics curve in

risk prediction. Circulation 115: 928 – 935.

Cortes J.E., Talpaz M., Beran M., O'Brien S.M., Rios M.B., Stass S. and Kantarjian

H. M. (1995a). Philadelphia chromosome-negative chronic myelogenous

Leukaemia with rearrangement of the breakpoint cluster region. Long-term

follow-up results. Cancer 75(2): 464 - 470.

Cortes, C. and Vapnik, V. (1995). Support Vector Networks. Machine Learning

20(3): 273 – 278.

Cortes, J., Bruemmendorf, T. and Kantarjian, H. (2007). Efficacy and safety of

bosutinib (SKI-606) among patients with chronic phase Ph+chronic

myelogenous Leukaemia (CML). Blood 110:733 – 741.

Cortes, J.E., Kantarjian, H. and Shah, N.P. (2012). Ponatinib in refractory

Philadelphia chromosome-positive Leukaemia. N Engl J Med. 367(22): 2075-

88.

Cortes, J.E., Kantarjian, H.M. and Brümmendorf, T.H. (2011). Safety and efficacy of

Bosutinib (SKI-606) in chronic phase Philadelphia chromosome-positive

chronic myeloid Leukaemia patients with resistance or intolerance to Imatinib.

Blood 118(17): 4567-76.

Cox, D.R. (1972). Regression models and Life Tables. Journal of Stat. Soc. Serv. 34:

187 - 220.

158

Cruz, J.A. and Wishart, D.S. (2006). Applications of Machine Learning in Cancer

Prediction and Prognosis. Cancer Informatics 2: 59 – 75.

Czado, C., Greiting, T., Hed, L. (2009). Predictive model assessment for cohort data.

Biometrics 65: 1254 – 1261.

Dash, M. and Liu, H.(1997). Feature selection for classification. Computational

Methods: 211 - 218.

DeAngelo, D.J. and Ritz, J. (2004). Imatinib Therapy for Patients with Chronic

Myelogenous Leukaemia: Are Patients Living Longer? Clinical Cancer

Research 10 (1-3): 1 – 3.

Deninger, M.W.N and Druker, B.J. (2003). Specific Targeted Therapy of Chronic

Myelogenous Leukaemia with Imatinib. Pharmacol Rev. 55: 401 – 423.

Dietterich, T.G. (1998). Approximate statistical tests for comparing supervised

classification learning algorithms. Neural Comput. 10(7): 1895 – 1924.

Dimitologlou, G., Adams, J.A. and Jim, C.M. (2012). Comparison of the C4.5 and a

naïve bayes classifier for the prediction of lung cancer survivability. Journal

of Computing 4(8): 1-9.

Dohner, H., Weisdorf, D.J. and Bloomfield, C.D. (2015) Acute Myeloid Leukemia.

The New England Journal of Medicine 373(12): 1136–52.

Domchek, S.M., Eisen, A. and Calzone, K. (2003). Application of breast cancer risk

prediction models in clinical practice. J Clin Oncol 21: 593 – 601.

Duda, R.O., Hart, P.E. and Stork, D.G. (2001). Pattern classification (2nd

edition):

Wiley, New York.

159

Durosinmi, M.A., Faluyi, J.O., Oyekunle, A.A., Salawu, L., Adediran, I.A. and

Akinola, N.O. (2008). The Use of Imatinib mesylate (Glivec) in Nigerian

patients with chronic myeloid Leukaemia. Cellular Therapy and

Transplantation 1(2):58 – 62.

Fernandez-Ranada, J.M., Lavilla, E., Odriozola, J., Garcia-Larana, J., Lozano, M. and

Parody, R. (1993). Interferon alfa 2A in the treatment of chronic myelogenous

Leukaemia in chronic phase. Results of the Spanish Group. Leuk Lymphoma

11(Supplementary 1):175 – 179.

Fielding, L.P., Fenoglio-Preiser, C.M. and Friedman, L.S. (1992). The future of

prognostic factors in outcome prediction for patients with cancer. Cancer 70:

2367 – 2377.

Friedman, J.H. (1999). Stochastic gradient boosting. Stanford University.

Gambacorti-Passerimi, C, Antohni, L., Mahon, F. and Guilhot, F. (2011). Multicentre

Independent Assessment of Outcomes in CML Patients treated with Imatinib.

Journal of National Cancer Institute 103: 553 – 561.

Gascon, F., Valle, M. and Martos, R. (2004). Childhood obesity and hormonal

abnormalities associated with cancer risk. Eur J. Cancer Previ. 13: 193 – 197.

Gauda, R. and Chahar, V. (2013). A comparative study on feature selection using

data mining tools. International Journal of advanced research in computer

science and software engineering 3(9): 26 – 33.

Gennari, J.H., Langley, P. and Fisher, D. (1989). Models of incremental concept

formation. Artificial Intelligence 40: 11 – 61.

160

Gerds, T.A., Cai, T. and Schumacher, M. (2008). The performance of risk models.

Biom J 50: 457 – 479.

Gratwohl, A., Brand, R., Apperley, J., Ruutu, T. and Corradini, P. (2006) Allogenic

hematopoietic stem cell transplantation for chronic myeloid Leukaemia in

Europe 2006: Transplant activity, long term data and current results. An

analysis by the Chronic Leukaemia Working Party of the Europe Group for

Blood and Marrow Transplantation (EBMT). Hematological. 91(4): 513 –

521.

Gratwohl, A., Hemans, J., Goldman, J.M., Arcese, W., Carrens, E. and Devergie, A.

(1998). Risk assessment for patients with chronic myeloid Leukaemia before

allogenic blood or marrow transplantation. Chronic Leukaemia Working

Party of the European Group for Blood and Marrow Transplantation. Lancet

352(9134): 1087 – 1092.

Greenland, S. (1989). Modeling and variable selection in epidemiologic analysis. Am

J Public Health 79: 340 – 349.

Groves, F.D., Linet, M.S. and Devesa, S.S. (1995). Patterns of occurrence of the

Leukaemias. Eur J Cancer 31A: 941 – 994.

Guilhot, F. (1993). Interferon alfa and low-dose cytosine-arabinoside for the

treatment of patients with chronic myelogenous Leukaemia in chronic phase.

French CML Study Group. Semin Hematol 30(Supplementary 3):24–5.

Guilhot, F., Guerci, A., Fiere, D., Harousseau, J.L., Maloisel, F., Bouabdallah, R. et

al. (1996). The treatment of chronic myelogenous Leukaemia by interferon

and cytosine-arabinoside: rationale and design of the French trials. French

CML Study Group. Bone Marrow Transplant 17: S29–S32.

161

Hagerty, R.G., Butow, P.N. and Ellis, P.M. (2005). Communicating prognosis in

cancer care: a systematic review of the literature. Ann Oncol 16: 1005 – 1053.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H.

(2009) The WEKA Data Mining Software: An Update. SIGKDD Explorations

11(1): 1 - 23.

Hall, M.A. (1999). Correlation-based Feature Selection for Machine learning. PhD

Thesis of the University of Waikato, Hamilton, New Zealand.

Hasford, J., Ansari, H., Pfirrmann, M. and Hehlmann, R. (1996). Analysis and

Validation of Prognostic factors for CML. German CML Study Group. Bone

Marrow Transplant 17(Supplementary 3): S49 – S54.

Hasford, J., Baccarani, M., and Hoffmann, V. (2011). Predicting Complete

Cytogenetic Response and subsequent Progression Free Survival in 2060

patients with CML on Imatinib treatment: the EUTOS Score. Blood

118(3):2177-2187.

Hasford, J., Pfirrmann, M., Hehlmann, R., Allan, N.C., Baccarani, M. and Kluin-

Nelemans, J.C. (1998). A new Prognostic score for survival of patients with

chronic myeloid Leukaemia treated with interferon alfa. Writing Committee

for the Collaborative CML Prognostic Factors Project Group. J Natl Cancer

Inst. 90(11): 850 – 858.

Hehlmann, R., Heimpel, H., Hasford, J., Kolb, H.J., Pralle, H. and Hossfeld, D.K.

Randomized comparison of interferon-alpha with busulfan and hydroxyurea in

chronic myelogenous Leukaemia. German CML Study Group. Blood

84:4064–77.

162

Hehlmann, R., Heimpel, H., Hossfeld, D.K., Hasford, J., Kolb, H.J. and Loffler, H.

(1996). Randomized study of the combination of hydroxyurea and interferon

alpha versus hydroxyurea mono-therapy during the chronic phase of chronic

myelogenous Leukaemia. (CML Study II) The German Study Group. Bone

Marrow Transplant 17 (Supplementary 3): S21–S24.

Hemingway, H., Riley, R.D. and Altman, D.G. (2009). Ten steps towards improving

prognostic research. BMJ 339: 4184 – 4193.

Hira, Z.M., Gillies, D.F. and Curry, E. (2014). Improving Classification accuracy of

response in Leukaemia treatment using feature selection over pathway

segmentation. Imperial College, London. Retrieved from

http://www.doc.ic.ac.uk/research/technicalreports/2014/DTR14-8.pdf on 12

January, 2016.

Holland, J.H. (1992) Adaptation in Natural and Artificial Systems. The MIT Press

Cambridge, MA, USA.

Hosmer, D.W., Hosmer, T. and Le Cessie, S. (1997). A comparison of goodness-of-

fit tests for the logistic regression model. Stat Med 16: 965 – 980.

Hosseini, N. and Ahmadi, R. (2014) Individual Characteristics of Patients with

Leukemia or Lymphoma in Hamedan – Northwest Iran. International Journal

of Advances in Clinical Engineering and Biological Sciences (IJACEBS) 1(1):

74 – 75.

Hua-Liang, W. and Billings, S.A. (2007) Feature Subset Selection and Ranking for

data dimensionality reduction. IEEE Transactions On Pattern Analysis And

Machine Intelligence 29(1): 1 – 12.

http://www.doc.ic.ac.uk/research/technicalreports/2014/DTR14-8.pdf%20%20%20%20on%20January%2012

163

Hughes, T.P., Kaeda, J., Branford, S., Rudzki, Z.,Hochhaus, A., Hensley, M.L.

(2003) Frequency of major molecular responses to Imatinib or interferon alfa

plus cytarabine in newly diagnosed chronic myeloid leukemia. N Engl J Med

349: 1423 – 1432.

Ibrahim, J.G., Chu, M. and Chen, M.H. (2012). Missing data in clinical studies:

issues and methods. J Clin. Oncol. 30: 3297 – 3303.

Idowu, P.A., Aladekomo, T.A., Williams, K.O. and Balogun, J.A. (2015). Predictive

model for likelihood of Sickle cell aneamia (SCA) among pediatric patients

using fuzzy logic. Transactions in networks and communications 31(1): 31 –

44.

Jabbour, E., Cortes, J., Nazha, A., O‘Brien, S., Quintas-Cardama, A., Pierce, S.,

Garcia-Manero, G. and Kantarjian, H. (2012). EUTOS score is not predictive

for survival and outcome of patients with early chronic myeloid Leukaemia

treated with tyrosine kinase inhibitors: a single institution experience. Blood

119(19):4524-4527.

Jabbour, E., Cortes, J.E., Giles, F.J., O‘Brien, S. and Kantarjian, H.M. (2007). Current

and emerging treatment options in chronic myeloid Leukaemia. Cancer

109(11):2171-2181

Jain, A.K., Marty, M.N. and Flynn, P. (1999). Data Clustering: a review. ACM

Comput. Surveys 31(3): 264 – 323.

Kaambwa, B., Bryan, S., Bilingham, L. (2012). Do the methods used to analyze

missing data really matter? An examination of data from an observational

study of intermediate care patients. BMC Res Notes 5: 330.

164

Kantarjian, H. and Cortes, J (2008). Chronic myeloid Leukaemia. In: Abeloff, M.D.,

Armitage, J.O., Lichter, A.S., Niederhuber, J.E., Kastan, M.B., McKenna,

W.G. Clinical Oncology. 4th ed. Philadelphia, Pa. Elsevier: 2279-2289.

Kantarjian, H., Giles, F. and Wunderle, L. (2006). Nilotinib in imatinib-resistant CML

and Philadelphia chromosome-positive ALL. N Engl J Med. 354(24):2542 -

2551.

Kantarjian, H., Sawyers, C., Hochhaus, A., Guilhot, F., Schiffer, C and Gambacorti-

Passerini, C. (2002). Hematologic and cytogenetic responses to Imatinib

mesylate in chronic myelogenous Leukaemia. N Engl J Med 346(9): 645 –

652.

Kaplan, E.L. and Meier, P. (1958). Non-Parametric estimation from incomplete

observation. Journal of American Statistical Association 53: 457 – 481.

Kloke, O., Niederle, N., Qiu, J.Y., Wandl, U., Moritz, T. and Nagel-Hiemke, M.

(1993) Impact of interferon alpha-induced cytogenetic improvement on

survival in chronic myelogenous leukaemia. Br J Haematol 83:399–403.

Kohavi, R. (1995). Wrappers for performance enhancement and oblivious decision

graphs. Unpublished PhD thesis, Stanford University.

Kohavi, R. and John, G.H. (1996). Wrappers for feature selection. Artificial

Intelligence Special review on relevance 97: 273 – 324.

Kumar, V. and Minz, S. (2014). Feature selection: a literature review. Smart

Computing Review 4(3): 211 – 229.

165

Landstrom A.P. and Tefferi A. (2006). Fluorescent in situ hybridization in the

diagnosis, prognosis, and treatment monitoring of chronic myeloid

Leukaemia. Leuk Lymphoma 47(3): 397 - 402.

Langley, P. and Sage, S. (1994). Oblivious decision trees and abstract cases. In

working notes of the AAA194 Workshop on Statistical Techniques in Pattern

recognition, Prague, Czech Republic: 91 – 96.

Lee, J.W. and Chung, N.G. (2011). The treatment of pediatric chronic myelogenous

Leukaemia in the Imatinib era. Korean J Pediatr. 54(3): 111 – 116.

Liaw, A. and Weiner, M. (2012). Classification and Regresion Trees by random

forest. R. News 2: 18 – 22.

Lichtman, M.A., Beutler, E., Kipps, T.J., (2006). Williams Hematology seventh

edition. New York, NY: McGraw-Hil: 1238 - 1245.

Liu, H. and Motoda, H. (1998). Feature selection for knowledge discovery and data

mining. Kluwer Academic Publishers, Boston.

Locatelli, F. and Niemeyer, C.M. (2015). How to treat Juvenile Myelomonocytic

Leukemia. Blood 125(7): 1083 – 1090.

Luo, S.T. and Cheng, B.W. (2010). Diagnosing breast masses in digital

mammography using feature selection and ensemble methods. J Med Syst: 1 –

9.

Machin, P.S., Dempsey, J. and Brooks, J. (1991). Using neural networks to diagnose

cancer. J MedSyst 15: 11 – 19.

166

Maji, P. and Garai,P. (2013). Fuzzy Rough Simultaneous Attribute Selection and

Feature Extraction Algorithm. IEEE Transactions on Cybernetics 43(4): 1 –

12.

Marin, D., Ibrahim, A.R. and Goldman, J.M. (2011). European Treatment and

Outcome Study (EUTOS) score for chronic myeloid Leukaemia still requires

more confirmation. Journal of Clinical Oncology 29(29):3944-3945.

Mark H. F., Sokolic R. A., and Mark Y. (2006). Conventional cytogenetics and FISH

in the detection of BCR/ABL fusion inchronic myeloid Leukaemia (CML).

Exp Mol Pathol 81(1): 1 - 7.

Markovitch, S. and Rosenstein, D. (2002). Feature generation using general

construction functions. Machine Learning 49: 59 – 98.

McCulloch, W. and Walter, P. (1943). A Logical Calculus of Ideas Immanent in

Nervous Activity. Bulletin of Mathematical Biophysics 5(4): 115 – 133.

Millot, F, Traore, P, Guilhot, J, Nelken, B, Leblanc, T, Leverger, G, et al. (2005)

Clinical and Biological Features at Diagnosis in 40 Children with Chronic

Myeloid Leukaemia. Pediatrics 116:140-143.

Mitchell, T. (1997). Machine Learning, McGraw Hill, New York.

Moons, K.G., Royston, P. and Vergouwe, Y. (2009). Prognosis and prognostic

research: what, why and how? BMJ 338: 375 – 381.

Nadeau, C. and Bengio, Y. (2003). Inference for the generalization error. Machine

Learning 52: 239 – 281.

167

National Cancer Institute (2011). SEER Cancer Statistics Review 1975 – 2008.

Available from http://seer.cancer.gov/csr/1975_2008/ and Accessed on 23

February, 2016.

National Comprehensive Cancer Network (NCCN) (2014) Chronic Myeloid

Leukaemia. NCCN Guidelines for Patients, version 1, 2014.

Novakovic, J (2009). Using information gain attribute evaluation to classify sonar

targets. 17th

Telecommunications forum (TELFOR 2009), Serbia, Belgrade,

November 24 – 29, 2009: 1351- 1354.

Novakovic, J., Strbac, P and Bulatovic, D. (2011). Towards optimal feature selection

using ranking methods and classification algorithms. Yugoslav Journal of

Operations Research 21(1): 119 – 135.

Ohm, L. (2013). Chronic Myeloid Leukaemia: Clinical Experimental and Health

economics study with special reference to Imatinib treatment. Published

Thesis of Karolinska University Hospital, Solna and Karolinska Institute,

Stockholm, Sweden. ISBN 978-91-7549-006-9.

Ohnishi, K., Ohno, R., Tomonaga, M, Kamada, N., Onozawa, K. and Kuramoto, A.

(1995). A randomized trial comparing interferon-alfa with busulphan for

newly diagnosed chronic myelogenous Leukaemia in chronic phase. Blood

86: 906 – 916.

Okanny, C.C. and Akinyanju, O.O. (1989). Chronic Leukaemia: an African

experience. Med Oncol Tumor Pharmacotherapy 6: 189 – 194.

http://seer.cancer.gov/csr/1975_2008/

168

Oyekunle, A. (2013). Survivorship in Nigerian patients with CML: A study of 527

patients over 10 years. Paper presentation at AORTIC Conference 2013,

Durban, South-Africa.

Oyekunle, A., Bolarinwa, R., Mamman, A.I. and Durosinmi, M. (2012c). The

Treatment of Childhood and adolescent chronic Myeloid Leukaemia in

Nigeria. Journal of pediatric sciences 4(4): 1 -5. Retrieved from Research

Gate at http://www,researchgate.net/publication23704415 on 28 June, 2015.

Oyekunle, A., Klyuchnikov, E., Ocheni, S., Kroger, N., Zander, A.R. and Baccarani,

M. (2011). Challenges for Allogenic Hematopoietic Stem Cell

Transplantation in Chronic Myeloid Leukaemia in the Era of Tyrosine Kinase

Inhibitors. Acta Haematologica 126(1): 30 – 39.

Oyekunle, A.A., Adelasoye, S.B., Bolarinwa, R.A., Ayansawo, T.A., Aladekomo,

T.A., Manam, A.I. and Durosinmi, M.A. (2012a). The treatment of childhood

and adolescent chronic myeloid Leukaemia in Nigeria. Journal of Pediatric

Sciences 4(4): pages 1 – 5.

Oyekunle, A.A., Osho, P.O., Aneke, J.C., Salawu, L. and Durosinmi, M.A. (2012b).

The Predictive Value of the Sokal and Hasford scoring systems in Chronic

Myeloid Leukaemia in the Imatinib Era. Journal of Hematological

Malignancies 2(2): pages 25 – 32.

Patil, M.D. and Sane, S. (2014) Effective Classification after Dimension Reduction: A

Comparative Study. International Journal of Scientific and Research

Publications 4(7): 1 – 4.

Pencina, M.J., D‘Agostino, R.B. and Demier, O.O. (2012). Novel metrics for

evaluating improvement in discrimination: net classification and integrated

http://www,researchgate.net/publication23704415

169

discrimination improvement for normal variables and nested models. Stat

Med 31: 101 – 113.

Pertricoin, E.F. (2004). SELDI-TOF-Based serum proteomic pattern diagnosis for

early detection of cancer. Curr Opin. Biotechnol. 15: 24 – 30.

Platt, J. (1998). Fast Training of Support Vector Machines using Sequential Minimal

Optimization. Advances in Kernel Methods – Support Vector Learning, 1998.

Provan,. D., Gribben, J.G. (2010) Chronic myelogenous leukemia. Molecular

Hematology (3rd ed.). Singapore: Wiley-Blackwell: 76.

Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning 1: 81-106.

Raanani, P, Trakhtenbrot, L, Rechavi, G, Rosenthal, E, Avigdor, A, Brok-Simoni, F,

et al. (2005). Philadelphia-Chromosome-Positive T-Lymphoblastic

Leukaemia: Acute Leukaemia or Chronic Myelogenous Leukaemia Blastic

Crisis. Acta Haematol.113:181 - 189.

Robin, M., Esperous, H., Peffault, R., Petropoulou, A.D., Xhaard, A. and Ribauad,

P.l. (2010) Splenectomy after allogeneic hematopoietic stem cell

transplantation in patients with primary myelofibrosis. Brit J Hematol. 150:

721–724.

Rokach, L. and Maimon, O. (2005). Top-down induction of decision trees classifiers-

a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part

C 35 (4): 476–487. doi: 10.1109/TSMCC.2004.843247.

Rokach, L. and Maimon, O. (2008). Data mining with decision trees: theory and

applications. World Scientific Pub Co Inc. ISBN 978-9812771711.

https://en.wikipedia.org/wiki/Digital_object_identifier

https://dx.doi.org/10.1109%2FTSMCC.2004.843247

https://en.wikipedia.org/wiki/International_Standard_Book_Number

https://en.wikipedia.org/wiki/Special:BookSources/978-9812771711

170

Royston, P., Moons, K.G. and Altman, D.G. (2009). Prognosis and prognostic

research: developing a prognostic model. BMJ 338: 604 – 610.

Sebastiani, F. (2002) Machine Learning in Automated Text Categorization. ACM

Computing Surveys 34(1): 1–47.

Sharma, P., Wagner, K., Wolchok, J.D. and Allison, J.P. (2011). Novel cancer

immunotherapy agents with survival benefit: recent successes and next

steps. Cancer 11(11): 805 - 812.

Shen, L., Au, W.-Y., Guo, T., Wong, M.L., Tsuchiyama, J., Yuen, P.-W., Kwong, Y.-

L., Liang, R.H. and Srivastava, G. (2007). Proteasome inhibitor bortezomib-

induced apoptosis in natural killer (nk)-cell Leukaemia and lymphoma: an in

vitro and in vivo preclinical evaluation, Blood 110(1): 469 – 470.

Siegel, C.A., Siegel, L.S. and Hyams, J.S. (2011). Real-time tool to display the

predicted disease course and treatment response for children with Crohn‘s

disease. Inflamm Bowel Dis 17: 30 – 38.

Simes, R.J. (1985). Treatment selection for cancer patients: application of statistical

decision theory to the treatment of advanced ovarian cancer. J Chronic Dis.

38: 171 – 186.

Singal, A.G., Mukherjee, A. and Higgins, P.D. (2013). Machine Learning Algorithms

outperform conventional regression models in identifying risk factors for

hepatocellular carcinoma in patients with cirrhosis. Am J. Gastroenterol 108:

1124 – 1130.

Sokal, J.E., Baccarani, M., Fiacchini, M., Carrates, F., Rozman, C., Gomez, G.A. and

Galton, A.G. (1985). Prognostic discrimination among younger patients with

171

chronic granulocytic Leukaemia: relevance to bone marrow transplantation.

Blood 66(6): 1352 – 1357.

Sokal, J.E., Cox, E.B., Baccarani, M., Tuna, S., Gomez, G.A. and Robertson, J.E.

(1984) Prognostic discrimination in ―good risk‖ chronic granulocytic

Leukaemia. Blood 63(4): 789 – 700.

Steyerberg, E.W., Harrell, F.E. and Borsbrom, G.J. (2001). Internal validation of

prediction models: efficiency of some procedures for logistic regression

analysis. J Clin Epidemiol. 54: 774 – 781.

Steyerberg, E.W., Vickers, A.J. and Cook, N.R. (2010). Assessing the performance

of prediction models: a framework for traditional and novel measures.

Epidemiology 21: 128 – 132.

Suttorp, M. and Millot, F. (2010). Treatment of Pediatric chronic myeloid Leukaemia

in the year 2010: Use of Tyrosine Kinase Inhibitors and Stem-Cell

Transplantation. Hematology: 368 – 376.

Swerdlow, S.H., Camp, E., Harris, N.C., Jaffe, E.S., Pilieri, S.A., Stein, H., Thiele, J.

and Varimani, J.N. (2008). WHO classification of turnovers and

hematopoietic and lymphoid tissues, 4th

edition, IARC Press: Lyon.

Talpaz, M., Shah, N.P. and Kantarjian, H. (2006). Dasatinib in imatinib-resistant

Philadelphia chromosome positive Leukaemias.N Engl J Med. 354(24): 2531 -

2541.

Tefferi, A., Hanson, C.A. and Inwards, D.J. (2005). How to interpret and pursue an

abnormal complete blood cell count in adults. Mayo Clin Proc. 80(7): 923 –

936.

172

Thaler, J. and Hilbe, W. (1996) Comparative analysis of two consecutive phase II

studies with IFN-alpha and IFN-alpha + ara-C in untreated chronic-phase

CML patients. Austrian CML Study Group. Bone Marrow Transplant

17(Supplementary 3):S25–S28.

Thaler, J., Gastl, G., Fluckinger, T., Niederweiser, D., Huber, H. and Seewan, H.

(1993). Treatment of chronic myelogenous Leukaemia with interferon alfa-

2c: response rate and toxicity in a phase II multicenter study. The Austrian

Biological Response Modifier (BRM) Study Group. Semin Hematol

30(Supplementary 3):17–19.

Thongkam, J. and Sukmak, V. (2013). Cervical cancer survivability prediction

models using machine leanring techniques. Journal of convergence

Information Technology (JCIT) 8(15): 13 -22.

Tkachuk D.C., Westbrook C.A., Andreeff M., Donlon T.A., Cleary M.L.,

Suryanarayan K., Homge M., Redner A., Gray J., and Pinkel D. (1990).

Detection of bcrabl fusion in chronic myelogeneous Leukaemia by in situ

hybridization. Science 250(4980): 559 - 562.

Vanneschi, L., Farinaccio, A., Mauri, G., Antoniotti, M., Provero, P. and Giacobini,

M. (2011). A comparison of machine learning techniques for survival

prediction in breast cancer. Bio Data Mining 4(12): 1 -13.

Vardiman, J, Pierre, R, Thiele, J, Imbert, M, Brunning, RD, Flandrin, G. (2001).

Chronic myelogenous leukaemia. In World Health Organization

Classification of Tumors: Pathology and Genetics of Tumors of Hematopoietic

and Lymphoid Tissues. Jaffe, E, Harris, NL, Stein, H, Vardiman, JW (eds.)

Lyon, France: IARC Press, 2001:20-26.

173

Waijee, A., Mukherjee, A. and Singal, A. (2013b). Comparison of modern imputation

methods for missing laboratory data in medicine. BMJ Open 3(8): 1 – 7.

Waijee, A.K., Higgings, P.D.R. and Singal, A.G. (2013a). A Primer on Predictive

Models. Clinical and Translational Gastroenterology 4(44): 1 – 4.

Waijee, A.K., Joyce, J.C. and Wang, S.J. (2010). Algorithms outperform metabolite

tests in predicting response of patients with inflammatory bone disease to

thiopurines. Clin Gastroenterol Hepatol 8: 143 – 150.

Wang, J.X., Zhang, B. and Yu, Y.K. (2005). Application of serum protein finger-

printing coupled with artificial neural network model in diagnosis of

hepatocellular carcinoma. Clin Med J (Engl) 118: 1278 – 1284.

Wang, K., Bell, D. and Murtagh, F. (1998). Relevance Approach to feature subset

selection: Kluwer Academic Publishers, Boston: 85 – 97.

Weston, A.D. and Hood, L. (2004). Systems biology proteomics and the future of

healthcare towards predictive and personalized medicine. J Proteomic Res. 3:

179 – 196.

Yildirim, P. (2015). Filter-Based Feature Selection Methods for Prediction of Risks

in Hepatitis Disease. International Journal of Machine Learning and

Computing 5(4): 258 – 263.

Yu, L. and Liu, H. (2004). Efficient feature selection via analysis of relevance and

redundancy. JMLR 5: 1205 – 1224.

Yussuff, H., Mohammad, N., Ngah, Y.K. and Yahaya, A.S. (2012). Breast cancer

analysis using logistic regression. IJRRAS 10(1): 14 -22.

174

Zamir, O., Etzioni, O., Madani, O. and Karp, R.M. (1997). Fast and Intuitive

Clustering of Web Documents. KDD’97: 287–290.

Zhang, S., Zhang, C. and Yang, Q. (2002). Data preparation for data mining. Appl

Artif. Intell.17: 375 – 381.

Zhao, Y. and Karypis, G. (2002) In Proceedings of CIKM. Evaluation of Hierarchical

Clustering Algorithms for Document Datasets: 1 – 7.

Zhon, X., Liu, K.T. and Wong, S.T. (2004). Cancer classification and prediction

using logistic regression with Bayesian gene selection. J. Biomed. Inform 37:

249 – 259.