Page 1
CHAPTER ONE
INTRODUCTION
1.1 Background to the Study
According to American Cancer Society (2015) Chronic Myeloid Leukaemia
(CML) is a type of cancer that affects the blood cells of living organisms. The body is
made up of trillions of living cells. Normal body cells grow, divide to make new
cells, and die in an orderly way. During the early years of a person‘s life, normal cells
divide only to replace worn-out, damaged, or drying cells. Cancer begins when cells
in a part of a body start to grow out of control (Cortes et al., 2011). There are many
kinds of cancer, but they all start because of this out-of-control growth of abnormal
cells. Cancer growth is different from normal cell growth. Instead of dying, cancer
cells keep on growing and form new cancer cells. These cancer cells can grow into
(invade) other tissues, something that normal cells cannot do (Cortes et al., 2012).
Being able to grow out of control and invade other tissues is what makes a cell a
cancer cell. In most cases, the cancer cells form a tumor. But some cancers, like
Leukaemia, rarely form tumors. Instead, these cancer cells are in the blood and bone
marrow. When cancer cells get into the bloodstream or lymph vessels, they can travel
to other parts of the body (Kantarjian and Cortes, 2008). There they begin to grow
and form new tumors that replace normal tissues. This process is called metastasis.
Leukaemia is a type of cancer that starts in cells that form new blood cells.
These cells are found in the soft, inner part of the bones called bone marrow. Chronic
myeloid Leukaemia (CML), also known as chronic myelogenous Leukaemia, is a
fairly slow growing cancer that starts in the bone marrow. It is a type of cancer that
affects the myeloid cells – cells that form blood cells, such as red blood cells,
Page 2
2
platelets, and many type of white blood cells. In CML, Leukaemia cells tend to build
up in the body over time. In many cases, people do not have any symptoms for at
least a few years. CML can also change into a fast growing, acute Leukaemia that
invades almost any organ in the body. Most cases of CML occur in adults, but it is
also very rarely found in children. As a rule, their treatment is the same as that for
adults.
According to Durosinmi et al. (2008), chronic myeloid Leukaemia (CML) has
an annual worldwide incidence of 1/100,000 with a male - female ratio of 1.5:1. The
median age of the disease incidence is about 60 years (Deninger and Druker, 2003).
In Nigeria and other African countries with similar demographic pattern, the median
age of the occurrence of CML is 38 years (Boma et al., 2006; Okanny et al., 1989).
In the United States of America (USA) however, the incidence of CML in the age
group under 70 years is higher among the African-Americans than among any other
racial/ethnic groups (Groves et al., 1995). It is probable that a combination of
environment and as yet unknown biological factors may account for the differential
age incidence pattern of CML between the Blacks and other races in the USA.
According to Oyekunle et al. (2012a), pediatric CML is rare, accounting for less than
10% of all cases of CML and less than 3% of all pediatric Leukaemia. Incidence
increases with age being exceptionally rare in infancy, it is about 0.7 per million/year
at ages 1 -14 years and rising to 1.2 per million/year in adolescents worldwide (Lee
and Chung, 2011).
To date, only allogenic stem cell transplantation (SCT) remains curative for
chronic myeloid leukaemia (Robin et al., 2005), though its role has waned
significantly in recent times due to the effectiveness of the tyrosine kinase inhibitors
(TKIs) (Oyekunle et al. 2011; Oyekunle et al., 2012b). Although potentially curative,
Page 3
3
SCT is associated with significant morbidity and mortality (Gratwohl et al., 1998).
Alpha interferon-based regimens adequately control the chronic phase of the disease,
but result in few long term survivors (Bonifazi et al., 2001). Advances in targeted
therapy resulted in the discovery of Imatinib mesylate, a selective competitive
inhibitor of BCR – ABL protein tyrosine kinase, which has demonstrated to induce
both hematologic and cytogenetic remission in a significant proportion of CML
patients (Kantarjian et al., 2002). A number of prognostic scoring systems have been
developed for patients with CML, of which Sokal and Hasford (or Euro) scores are
most popular (Gratwohl et al., 2006). The Sokal score was generated using chronic
phase CML patients treated with busulphan or hydroxyurea (Sokal et al., 1984), while
the Hasford score was derived and validated, using patients treated with Interferon-
alpha (Hasford et al., 1998).
Survival Analysis deals with the application of methods to estimate the
likelihood of an event (death, survival, decay, child-birth etc.) occurring over a
variable time period (Dimitologlou et al., 2012); in short, it is concerned with
studying the time between entry to a study and a subsequent event (such as death).
The traditional statistical methods applied in the area of survival analysis include the
Kaplan-Meier (KM) estimator curve (Kaplan and Maier, 1958) and the Cox-
proportional hazard (PH) models (Cox, 1972). These methods apply parametric
methods in estimating survival parameters for a group of individuals. Other methods
applied in traditional statistical methods also include the use of non-parametric
models. The Kaplan-Meier method allows for an estimation of the proportion of the
population of people who survive a given length of time under some circumstances.
The cox model is a statistical technique for exploring the relationship between the
survival of a patient and several explanatory variables. Before the advent of Imatinib
Page 4
4
as a treatment option for Chronic Myeloid Leukaemia (CML), the median survival
time for CML has been 3 – 5 years from the time of diagnosis of the disease (Hosseini
and Ahmadi, 2013). According to Gambacorti-Passerimi et al. (2011), a follow-up of
832 patients using Imatinib showed an overall survival rate of 95.2% after 8 years. A
10-year follow-up of 527 patients in Nigeria undergoing Imatinib treatment showed
an overall survival rate of 92% and 78% after 2 and 5 years respectively (Oyekunle et
al., 2013).
Machine learning (ML) is a branch of artificial intelligence that allows
computers to learn from past examples of data records (Quinlan, 1986; Cruz and
Wishart, 2006). Unlike traditional explanatory statistical modeling techniques,
machine learning does not rely on prior hypothesis (Waijee et al., 2013a). Machine
learning has found great importance in the area of predictive modeling in medical
research especially in the area of risk assessment, risk survival and risk recurrence.
Machine learning techniques can be broadly classified into: supervised and
unsupervised techniques; the earlier involves matching a set of input records to one
out of two or more target classes while the latter is used to create clusters or attribute
relationships from raw, unlabeled or unclassified datasets (Mitchell, 1997).
Supervised machine learning algorithms can be used in the development of
classification or regression models. Classification model is a supervised approach
aimed at allocating a set of input records to a discrete target class unlike regression
which allocates a set of records to a real value. This research is focused at using
classification models to classify patients‘ survival as either of survived or not
survived.
Feature selection methods are unsupervised machine learning techniques used
to identify relevant attributes in a dataset. It is important in identifying irrelevant and
Page 5
5
redundant attributes that exist within a dataset which may increase computational
complexity and time (Yildirim, 2015; Hall, 1999). Feature selection methods are
broadly classified as filter-based, wrapper-based and embedded methods while filter-
based methods are chosen for this study due to the ability to identify relevant
attributes with respect to the target class – CML patient survival unlike wrapper-based
methods which use the performance of the machine learning algorithms. Filter-based
feature selection methods was used to identify the most relevant variables that are
predictive for CML patients survival from the variables monitored during the follow-
up of Imatinib treatment administered to Nigerian CML patients. The relevant
features proposed using feature selection were used to formulate the predictive model
for CML patients‘ survival classification using supervised machine learning
techniques.
1.2 Statement of Research Problem
Chronic Myeloid Leukaemia (CML) is a very serious disease affecting
Nigerians with just one referral government hospital in Nigeria which administers
Imatinib treatment but with limited number of experts compared to the number of
cases attended to. In Nigeria, hematologists rely on scoring models proposed using
datasets belonging to Caucasian (white race) and/or non-African CML patients
undergoing treatment before the Imatinib era (e.g. Sokal used busulphan or
hydroxyurea, Hasford used Interferon Alfa and European Treatment and Outcome
Study). These models have been deemed ineffective on Nigerian CML patients who
are undergoing Imatinib treatment and as such there is presently no existing predictive
model in Nigeria specifically for the survival of CML patients undergoing Imatinib
treatment. There is a need for a predictive model which will aid clinical decisions
Page 6
6
concerning continual treatment or alternative action affecting the survival of CML
patients receiving Imatinib treatment, hence this study.
1.3 Aim and Objectives of the Study
The aim of this research is to develop a predictive model which identifies the relevant
attributes required for classifying the survival of Chronic Myeloid Leukemia patients
receiving Imatinib treatment in Nigeria using machine learning techniques. The
specific research objectives are to:
i. elicit knowledge on the variables monitored during the follow-up of
Imatinib treatment;
ii. propose the variables predictive for CML survival from (i) and use them to
formulate the predictive model;
iii. simulate the predictive model formulated in (ii); and
iv. validate the model in (iii) using historical data.
1.4 Research Methodology
In order to achieve the above listed objectives, the methodological approach for this
study was performed using the following methods.
Formal interview was conducted with two (2) Hematologists to identify
parameters used to monitor survival and anonymous and validated information about
patients will be collected.
Filter-based feature selection methods were used to identify the most relevant
variables (prognostic factors) predictive for survival from the variables identified for
which the predictive model was formulated using supervised machine learning
algorithms.
Page 7
7
The formulated model was simulated using the explorer interface of the
Waikato Environment for Knowledge Analysis (WEKA) software – a light-weight
java based suite of machine learning tools using the preprocess, classify and select
attribute packages.
The collected historical data was used to validate the performance of the
model by determining the confusion matrix, recall, precision, accuracy and the area
under the Receiver Operating Characteristics (ROC) curve.
1.5 Research Justification
The Nigerian Health sector has set ambitious targets for providing essential
health services to all citizens; improving the quality of decisions affecting treatment
options is very essential to reducing disease mortality rates in Nigeria. Predictive
models for Chronic Myeloid Leukaemia (CML) survival classification can help
identify the most relevant variables for patient survival and thus allow physicians
concentrate on a smaller number but important variables during clinical observations.
1.6 Scope and Limitations of the Study
This study is limited to the classification of 2-year and 5-year survival of
Nigerian Chronic Myeloid Leukaemia (CML) receiving follow-up for Imatinib
treatment at Obafemi Awolowo University Teaching Hospital Complex (OAUTHC),
Ile-Ife, Osun State. Also, the datasets used for this study was based on information
collected from a single centre and relatively limited number of CML patients.
1.7 Organization of Thesis
The first chapter of this thesis has been presented and the organization of the
remaining chapters is discussed in the following paragraphs.
Page 8
8
Chapter two contains the Literature Review which consists of an introduction
to chronic myeloid Leukaemia (CML), its etiology, treatment and distribution around
the World, Africa and Nigeria; survival analysis and the existing stochastic methods
(Kaplan-Meier and the Cox proportional hazard models); Machine learning –
supervised, unsupervised and application of machine learning in healthcare; Feature
selection methods; Existing survival models and related works.
Chapter three contains the Research Methodology which consists of the
research framework, data collection methods, data identification and variable
description, feature selection results, model formulation methods – supervised
machine learning algorithms proposed, model simulation and the performance
evaluation metrics to be used.
Chapter four contains the Results and discussions which consists of the
descriptive statistics of the data collected from the referral hospital, feature selection
results and discussions, simulation results and discussions and the performance
evaluation of the machine learning algorithms used.
Chapter five contains the summary, conclusion, recommendations and the
possible future works of the study.
Page 9
9
CHAPTER TWO
LITERATURE REVIEW
2.1 Chronic Myeloid Leukaemia (CML)
According to DeAngelo and Ritz (2004), chronic myeloid leukaemia (CML) is
a clonal hematopoietic disorder characterized by the reciprocal translocation
involving chromosomes 9 and 22. As a result of this translocation, a novel fusion
gene, BCR-ABL is created and constitutive activity of this tyrosine kinase plays a
central role in the pathogenesis of the disease process. Cancer begins when cells in a
part of the body start to grow out of control. There are many kinds of cancer, but they
all start because of this out-of-control growth of abnormal cells (NCCN, 2014).
Cancer cell growth is different from normal cells growth. Instead of dying, cancer
cells keep on growing and form new ones. These cancer cells can grow into (invade)
other tissues, something that normal cells cannot do (National Cancer Institute NCI,
2011). Being able to grow out of control and invade other tissues is what makes a cell
a cancer cell. In most cases the cancer cells form a tumor. But some cancers, like
Leukaemia, rarely form tumors. Instead, these cancer cells are in the blood and bone
marrow (see Figure 2.1 for family tree of blood cells).
Leukaemia is a group of cancers that usually begins in the bone marrow and
results in high number of abnormal white blood cells (NCI, 2011). These white blood
cells are not fully developed and are called blasts or Leukaemia cells. Symptoms may
include bleeding and bruising problems, feeling very tired, fever and an increased risk
of infections.
Page 10
10
Figure 2.1: Family Tree of Blood Cells
(Source: NCCN, 2014)
Page 11
11
These symptoms occur due to a lack of normal blood cells. Diagnosis is typically by
blood tests or bone marrow biopsy. Clinically and pathologically, Leukaemia is
subdivided into a variety of large groups.
The first division is between its acute and chronic forms:
Acute Leukaemia is characterized by a rapid increase in the number of
immature blood cells (Locatelli and Niemeyer, 2015). Crowding due to such cells
makes the bone marrow unable to produce healthy blood cells. Immediate treatment is
required in acute Leukaemia due to the rapid progression and accumulation of the
malignant cells, which then spill over into the bloodstream and spread to other organs
of the body (Dohner et al., 2015). Acute forms of Leukaemia are the most common
forms of Leukaemia in children.
Chronic Leukaemia is characterized by the excessive buildup of relatively
mature, but still abnormal, blood cells. Typically taking months or years to progress,
the cells are produced at a much higher rate than normal, resulting in many abnormal
white blood cells (Shen et al., 2007). Whereas acute Leukaemia must be treated
immediately, chronic forms are sometimes monitored for some time before treatment
to ensure maximum effectiveness of therapy (Provan and Gribben, 2010). Chronic
Leukaemia mostly occurs in older people, but can theoretically occur in any age
group.
Additionally, the diseases are subdivided according to which kind of blood
cell is affected (Hira et al., 2014). This split divides Leukaemias into lymphoblastic or
lymphocytic Leukaemias and myeloid or myelogenous Leukaemias (Table 2.1):
Page 12
12
Table 2.1: The Four Major Kinds of Leukaemia
Cell Type Acute Chronic
Lymphocytic Leukaemia
(or lymphoblastic)
Acute lymphoblastic
Leukaemia (ALL)
Chronic lymphocytic
Leukaemia (CLL)
Myelogenous Leukaemia
(myeloid or
nonlymphocytic)
Acute myelogenous
Leukaemia (AML or
myeloblastic)
Chronic myelogenous
Leukaemia (CML)
(Source: Hira et al., 2014)
Page 13
13
In lymphoblastic or lymphocytic Leukaemias, the cancerous change takes
place in a type of marrow cell that normally goes on to form lymphocytes, which are
infection-fighting immune system cells. Most lymphocytic Leukaemias involve a
specific subtype of lymphocyte, the B cell. In myeloid or myelogenous Leukaemias,
the cancerous change takes place in a type of marrow cell that normally goes on to
form red blood cells, some other types of white cells, and platelets.
Chronic myelogenous (or myeloid or myelocytic) Leukaemia (CML), also
known as chronic granulocytic Leukaemia (CGL), is a cancer of the white blood cells.
It is a form of Leukaemia characterized by increased and unregulated growth of
predominantly myeloid cells in the bone marrow and the accumulation of these cells
in the blood. CML is a clonal bone marrow stem cell disorder in which a proliferation
of mature granulocytes (neutrophils, eosinophils and basophils) and their precursors is
found. It is a type of myeloproliferative disease associated with a characteristic
chromosomal translocation called the Philadelphia chromosome (Figure 2.2).
Chronic myeloid Leukaemia (CML) is defined by the presence of the Philadelphia
chromosome (Ph) which arises from the reciprocal translocation of the ABL1 and
BCR genes on chromosome 9 and 22 respectively (Oyekunle et al., 2012c).
CML is characterized by the proliferation of a malignant clone containing the
BCR-ABL1 mutant fusion gene resulting in myeloid hyperplasia and peripheral blood
leucocytosis and thrombocytosis. It is believed that pediatric CML is rare, accounting
for less than 10% of all cases of CML and less than 3% of all pediatric Leukaemias
(Lee and Chung, 2011). Incidence increases with age being exceptionally rare in
infancy, it is about 0.7 per million/year at ages 1 – 14 years and rising to 1.2 per
million/year in adolescents (National Cancer Institute (NCI), 2011).
Page 14
14
Figure 2.2: Philadelphia Chromosome and BCR-ABL gene
Page 15
15
Generally, children are diagnosed at a median age of 11 – 12 years (range, 1 –
18 years) with approximately 10% presenting in advanced phases (Suttorp and Millot,
2010).
2.1.1 CML diagnosis
According to the National Comprehensive Cancer Network (NCCN)
Guideline for Patients on CML (2014), in order to diagnose Chronic Myeloid
Leukaemia (CML), doctors use a variety of tests to analyze the blood and marrow
cells. This is because there are no special tests used in diagnosing CML. The best
form of diagnosis is the early report of symptoms. The following are a number of
tests useful in diagnosing CML in patients.
a. Complete Blood Count (CBC)
This test is used to measure the number and types of cells in the blood.
According to Tefferi et al. (2005), people with CML often have: decreased
hemoglobin concentration, increased white blood cell count, often to very high levels
and possible increase or decrease in the number of platelets depending on the severity
of the person‘s CML. Blood cells are stained (dyed) and examined with a light
microscope. These samples show a: specific pattern of white blood cells, small
proportion of immature cells (leukemic blast cells and promyelocytes) and larger
proportion of maturing and fully matured white blood cells (myelocytes and
neutrophils). These blast cells, promyelocytes and myelocytes are normally not
present in the blood of healthy individuals.
b. Bone Marrow Aspiration and Biopsy
These tests are used to examine marrow cells to find abnormalities and are
generally done at the same time (Raanani et al., 2005). The sample is usually taken
from the patient‘s hip bone after medicine has been given to numb the skin (Figure
Page 16
16
2.3). For a bone marrow aspiration, a special needle is inserted through the hip bone
and into the marrow to remove a liquid sample of cells. For a bone marrow biopsy, a
special needle is used to remove a core sample of bone that contains marrow. Both
samples are examined under a microscope to look for chromosomal and other cell
changes (Vardiman et al., 2001).
c. Cytogenetic Analysis
This test measures the number and structure of the chromosomes. Samples
from the bone marrow are examined to confirm the blood test findings and to see if
there are chromosomal changes or abnormalities, such as the Philadelphia (Ph)
chromosome (Cortes et al., 1995). The presence of the Ph chromosome (the shortened
chromosome 22) in the marrow cells, along with a high white blood cell count and
other characteristic blood and marrow test findings, confirms the diagnosis of CML.
The bone marrow cells of some people with CML have a Ph chromosome detectable
by cytogenetic analysis (Aurich et al., 1998). A small percentage of people with
clinical signs of CML do not have cytogenetically detectable Ph chromosome, but
they almost always test positive for the BCR-ABL fusion gene on chromosome 22
with other types of tests.
d. FISH (Fluorescence In Situ Hybridization)
FISH is a more sensitive method for detecting CML than the standard
cytogenetic tests that identify the Ph chromosome (Mark et al., 2006; Tkachuk et al.,
1990). FISH is a quantitative test that can identify the presence of the BCRABL gene
(Figure 2.4). Genes are made up of DNA segments. FISH uses color probes that bind
to DNA to locate the BCR and ABL genes in chromosomes. Both BCR and ABL
genes are labeled with chemicals each of which releases a different color of light
(Landstrom and Tefferi, 2006).
Page 17
17
Figure 2.3: Bone marrow biopsy
Page 18
18
Figure 2.4: Identifying the BCR-ABL Gene Using FISH
Page 19
19
The color shows up on the chromosome that contains the gene— normally
chromosome 9 for ABL and chromosome 22 for BCR—so FISH can detect the piece
of chromosome 9 that has moved to chromosome 22. The BCR-ABL fusion gene is
shown by the overlapping colors of the two probes. Since this test can detect BCR-
ABL in cells found in the blood, it can be used to determine if there is a significant
decrease in the number of circulating CML cells as a result of treatment.
e. Polymerase Chain Reaction (PCR)
The BCR-ABL gene is also detectable by molecular analysis. A quantitative
PCR test is the most sensitive molecular testing method available. This test can be
performed with either blood or bone marrow cells (Branford et al., 2008). The PCR
test essentially increases or ―amplifies‖ small amounts of specific pieces of either
RNA or DNA to make them easier to detect and measure. So, the BCR-ABL gene
abnormality can be detected by PCR even when present in a very low number of cells
(Hughes et al., 2003). About one abnormal cell in one million cells can be detected by
PCR testing. Quantitative PCR is used to determine the relative number of cells with
the abnormal BCR-ABL gene in the blood (Hughes et al., 2003). This has become
the most used and relevant type of PCR test because it can measure small amounts of
disease, and the test is performed on blood samples, so there is no need for a bone
marrow biopsy procedure. Blood cell counts, bone marrow examinations, FISH and
PCR may also be used to track a person‘s response to therapy once treatment has
begun (Ohm, 2013). Throughout treatment, the number of red blood cells, white blood
cells, platelets and CML cells is also measured on a regular basis.
2.1.2 Phases of chronic myeloid Leukaemia
Staging is the process of finding out how far a cancer has spread. Most types
of cancer are staged based on the size of the tumor and how far it has spread from
Page 20
20
where it started. This system does not work for Leukaemias because they do not often
form a solid mass or tumor (Vardiman et al., 2001; Millot et al., 2005). Also,
Leukaemia starts in the bone marrow and, in many people; it has already spread to
other organs when it is found. For someone with chronic myeloid Leukaemia (CML),
the outlook depends on other factors such as features of the cells shown in lab tests,
and the results of imaging studies (Raanani et al., 2005). These information help
guide treatment decisions.
In Chronic myeloid Leukaemia, there are three phases. As the amount of blast
cells increases in the blood and bone marrow, there is less room for healthy white
blood cells, red blood cells and platelets. This may result in infections, anemia and
easy bleeding as well as bone pain and pain or a feeling of fullness below the ribs on
the left side. The number of blasts cells in the blood and bone marrow and the
severity of signs and symptoms determine the phase of the disease (NCI, 2011). The
three (3) phases of CML are:
a. Chronic Phase;
b. Accelerated Phase; and
c. Blast Crisis Phase.
A number of patients progress from chronic phase, which can usually be well-
managed, to accelerated phase or blast crisis phase. This is because there are
additional genetic changes in the leukemic stem cells. Some of these additional
chromosome abnormalities are identifiable by cytogenetic analysis (Cortes et al.,
1995a; Aurich et al., 1998). However, they appear to be other genetic changes (low
levels of drug-resistant mutations that may be present at diagnosis) in the CML stem
cells that cannot be identified by the laboratory tests that are currently available.
Page 21
21
a. Chronic Phase
The chronic phase is the first phase of CML, the number of white blood cells
is increased and immature white blood cells (blasts) make up less than 10% of cells in
the peripheral blood and/or bone marrow. This means that less than 10 out of every
100 cells are blasts. CML in the chronic phase may cause mild symptoms, but most
often it does not cause any symptom. Possible symptoms include feeling infections
since the changes in the blood cells are not severe. In this phase, the cancer
progresses very slowly. Thus, CML in this phase may progress over several months
or years. In general, people with CML in the chronic phase respond better to
treatment.
b. Accelerated Phase
The accelerated phase is the second phase of CML. In this phase, the number
of blast cells in the peripheral blood and/or bone marrow is usually higher than
normal. Other aspects of accelerated phase can include increased basophils, very low
platelets, or new chromosome changes. The number of white blood cells is also high.
In this phase, the Leukaemia cells grow more quickly and may cause symptoms such
as anemia and an enlarged spleen. A few different criteria groups can be used to
define the accelerated phase. However, the two most commonly used are the World
Health Organization Criteria and the criteria from MD Anderson Cancer Centre
(Table 2.2)
c. Blast Crisis Phase
The blast phase is the final phase of CML progression. Also referred to as
‗Blast Crisis‖, CML in this phase can be life-threatening (NCCN, 2014). There are
two criteria groups that may be used to define the blast phase (Table 2.3).
Page 22
22
Table 2.2: Criteria for Accelerated Phase
MD Anderson World Health Organization
15% blasts in peripheral blood
10% to 19% blast in peripheral blood
and/or bone marrow
30% blasts and promyelocytes in
peripheral blood
20% basophils in peripheral blood
20% basophils in peripheral blood
Very high or very low platelet count that is
unrelated to treatment
Very low platelet count that is
unrelated to treatment
Increasing spleen size and white blood
cells count despite treatment
New chromosome changes (mutations) New chromosome changes (mutations)
(Source: Faderi et al., 1999; Swerdlow et al., 2008)
Page 23
23
Table 2.3: Criteria for Blast Phase
World Health Organization
International Bone Marrow
Transplant Registry
20% blasts in peripheral blood or bone
marrow
30% blasts in peripheral blood or bone
marrow
Blasts found outside of blood or bone
marrow
Blasts found outside of blood or bone
marrow
Large groups of blasts found in bone
marrow
(Source: Swerdlow et al., 2008; Druker, 2007)
Page 24
24
In this phase, the number of blast cells in the peripheral blood and/or bone
marrow is very high. Another defining feature of blast phase is that the blast cells
have spread outside the blood and/or bone marrow into other tissues (Swerdlow et al.,
2008). In the blast phase, the Leukaemia cells may be more similar to Acute myeloid
Leukaemia (AML) or Acute lymphoblastic Leukaemia (ALL). AML causes too many
mature white blood cells called myeloblasts to be made. ALL results in too many
immature white blood cells called lymphoblast.
2.1.3 Chronic myeloid Leukaemia (CML) Treatment
There is more than one treatment for chronic myeloid Leukaemia; the type of
treatment being dependent on factors such as: age, general health, and the phase of
cancer. Some people with CML may have more than one treatment (Talpaz et al.,
2006; Kantarjian et al., 2006; Cortes et al., 2007). Primary treatment is the main
treatment used to rid the body of cancer. Tyrosine Kinase Inhibitors (TKIs) are often
used as primary treatment for CML. First-line treatment is the first set of treatments
given to CML patients and if the treatment fails, second-line treatment is the next
treatment or set of treatments given (Lichtman et al., 2006). This is also referred to as
follow-up treatment since it is given after follow-up tests show that the previous
treatment failed or stopped working.
a. Tyrosine kinase inhibitor (TKI) therapy
TKI (tyrosine kinase inhibitor) therapy is a type of targeted therapy used to
treat CML. Targeted therapy is treatment with drugs that target a specific or unique
feature of cancer cells not generally present in normal cells; as a result of targeting
cancer cells, they may be less likely to harm normal cells throughout the body
(Jabbour et al., 2012). TKI target the abnormal BCR-ABL protein that causes the
overgrowth of abnormal white blood cells (CML cells). The BCR-ABL protein, made
Page 25
25
by the BCR-ABL gene, is a type of protein called a tyrosine kinase. Tyrosine kinases
are proteins located on or near the surface of cells and they tell cells when to grow
and divide to make new cells (Jabbour et al., 2007). TKIs block (inhibit) the BCR-
ABL protein from sending the signals that cause too many abnormal white blood cells
to form. However, each TKI works in a different way.
The FDA (Food and Drug Administration) approved the first TKI for the
treatment of CML in 2001. Since then, several TKIs have been developed to treat
CML. The newer drugs are referred to as Second-generation TKIs. The TKIs used to
treat CML are listed in Table 2.4; the drugs made in the form of pills are swallowed
by a patient. The dose of the drug is measured in mg (milligrams).
Imatinib was the first TKI to be approved to treat CML. Thus, it is called a first-
generation TKI. Imatinib works by binding to the active site on the BCR-ABL
protein to block it from sending signals to make new abnormal white blood cells
(CML cells). Figure 2.5 shows how Imatinib treatment works.
Dasatinib is a second-generation TKI that was approved for the treatment of CML in
2006. Dasatinib is more potent than Imatinib and can bind to the active and inactive
sites on the BCR-ABL protein to block growth signals.
Nilotinib was approved to treat CML in 2007. It is a second-generation TKI that
works in almost the same way as Imatinib. However, Nilotinib is more potent than
Imatinib and it more selectively targets the BCR-ABL protein. Nilotinib also targets
other protein apart from BCR-ABL protein.
Bosutinib was approved to treat CML in 2012. However, this second-generation TKI
is only approved to treat patients who experienced intolerance or resistance to prior
TKI therapy. It also targets other protein the same way as Nilotinib.
Page 26
26
Table 2.4: Tyrosine Kinase Inhibitor (TKI) drugs used to treat CML
Generic name Brand name
(sold as)
Approved for
Imatinib Gleevec®
First-line treatment for:
1. Newly diagnosed adults and children
in chronic phase
2. Adults un chronic, accelerated or
blast phase after failure of
interferon-alfa therapy
Dasatinib Sprycel®
First-line treatment for:
1. Newly diagnosed adults in chronic
phase
2. Adults resistant or intolerant to prior
therapy in chronic, accelerated or
blast phase
Nilotinib Tasigna®
First-line treatment for:
1. Newly diagnosed adults in the
chronic phase
2. Adults resistant or intolerant to prior
therapy in chronic or accelerated
phase
Bosutinib Bosulif®
Second-line treatment for:
1. Adults with chronic, accelerated or
blast phase with resistance or
intolerance to prior therapy.
Page 27
27
Figure 2.5: How Imatinib works
Page 28
28
Side effects are new or worse unplanned physical or emotional conditions caused by
treatment. Each TKI for CML can cause side effects which depend on: the drug, the
amount taken, the length of treatment, and the person; most side effects can be
managed or even prevented. Supportive care is the treatment of symptoms caused by
CML or side effects caused by CML treatment.
b. Immunotherapy
The immune system is the body‘s natural defense against infection and
disease. Immunotherapy is treatment with drugs that boost the immune system
response against cancer cells (Sharma et al., 2011). Interferon is a substance naturally
made by the immune system. Interferon can also be made in a laboratory to be used
as immunotherapy for CML. PEG (pegylated) interferon is a long-acting form of the
drug. Interferon is not recommended as a first-line treatment option for patients with
newly diagnosed CML. But, it may be considered for patients unable to tolerate TKIs
(NCCN, 2014). Interferon is often given as a liquid that is injected under the skin or
in a muscle with a needle.
c. Chemotherapy
Chemotherapy is a type of drug commonly used to treat cancer. Many people
refer to this treatment as ―chemo‖. Chemotherapy drugs kill cells that grow rapidly,
including cancer cells and normal cells. Different types of chemotherapy drugs attack
cancer cells in different ways. Therefore, more than one drug is often used (Bluhm,
2011).
Omacetaxine is one of the chemotherapy drug used for CML treatment and
approved in 2012 by the FDA for patients in resistance and/or intolerant to two or
more TKIs. Resistance is when a CML patient does not respond to treatment;
intolerance is when treatment with a drug must be stopped due to severe side effects.
Page 29
29
Omacetaxine works in part by blocking cells from making some of the proteins, such
as the BCR-ABL protein, needed for cell growth and division. This may slow or even
stop the growth of new CML cells.
Omacetaxine is administered as a liquid that is injected under the skin with a
needle. Other chemotherapy drugs may be given as a pill that is swallowed (NCCN,
2014). Chemotherapy is given in cycles of treatment days followed by days of rest.
The number of treatment days per cycle and the total number of cycles varies
depending on the chemotherapy drug given.
d. Stem cell transplant and donor lymphocyte infusion
An HSCT (hematopoietic stem cell transplant) is a medical procedure that
kills damaged or diseased blood stem cells in the body and replaces them with healthy
stem cells. HSCT is currently the only treatment for CML that may cure rather than
control the cancer. However, the excellent results with TKIs have challenged the role
of HSCT as first-line of treatment – the first set of treatments given to treat a disease.
For the treatment of CML, healthy blood stem cells are collected from another
person, called a donor. This is called an allogenic HSCT. An allogenic HSCT creates
a new immune system for the body. The immune system is the body‘s natural defense
against infection and disease. For this type of transplant, Human Leukocyte Antigen
(HLA) testing is needed to check if the patient and donor is a good match.
A Donor Lymphocyte Infusion (DLI) is a procedure in which the patient
receives lymphocytes from the same person who donated blood stem cells for the
HSCT. A lymphocyte is a type of white blood cells that helps the body to fight
infections. The purpose of the DLI is to stimulate an immune response called the
Graft-versus-tumor (GVT) effect of Graft-versus-Leukaemia (GVL) effect. The GVT
Page 30
30
effect is when the transplanted cells (the graft) see the cancer cells (tumor/Leukaemia)
as foreign and attack them. This treatment may be used after HSCT for CML that
didn‘t respond to the transplant or that came back after an initial response.
2.1.4 Measuring CML treatment response
Measuring the response to treatment with blood and bone marrow testing is a
very important part of treatment for people with CML. In general terms, the greater
the response to drug therapy, the longer the disease will be controlled. Other factors
that affect a person‘s response to treatment include: the stage of the disease and the
features of the individual‘s CML at the time of diagnosis
Nearly all people with chronic phase CML have a ―complete hematologic
response‖ with Gleevec, Sprycel or Tasigna therapy; most of these people will
eventually achieve a ―complete cytogenetic response.‖ Patients who have a complete
cytogenetic response often continue to have a deeper response and achieve a ―major
molecular response.‖ Additionally, a growing number of patients achieve a ―complete
molecular response‖; table 2.5 shows the explanation of each term.
2.1.5 Imatinib treatment for Nigerian CML patients
According to Oyekunle et al. (2012b), Nigerian CML patients are presently
treated using Imatinib as the first line of treatment. Chromosome analysis is done
using cultured bone marrow aspirate samples; Philadelphia chromosomes are
estimated from the metaphase and the proportion of Ph+ cells are noted. Patients in
the chronic phase receive oral Imatinib: 400mg daily while those in the accelerated or
blastic phase receive 600mg daily. Imatinib is continued for as long as there is
evidence of continued benefit from therapy.
Page 31
31
Table 2.5: Chronic Myeloid Leukaemia (CML) Treatment Responses
Type of Response Features Test Used to Measure Response
Hematologic Complete Hematologic
Response (CHR)
Blood counts completely return to normal
No blasts in peripheral blood
No signs/symptoms of disease – spleen
returns to normal size
Complete Blood Count (CBC)
with differential
Cytogenetic Complete Cytogenetic Response
(CCyR)
No Philadelphia (Ph) chromosomes
detected
Bone marrow cytogenetics
Partial Cytogenetic Response
(PCyR)
1% - 35% of cells have Ph chromosome
Major Cytogenetic Response 0% - 35% of cells have Ph chromosome
Minor Cytogenetic Response More than 35% of cells have Ph
chromosome
Molecular Complete Molecular Response
(CMR)
No BCR-ABL gene detectable Quantitative PCR (QPCR) using
International Scale (IS)
Major Molecular Response
(MMR)
At least a 3-log reduction1 in BCR-ABL
levels or BCR-ABL 0.1%
1 A 3-log reduction is a 1/1,000 or 1/1,000-fold reduction of the level at the start of treatment
31
Page 32
32
Allopurinol (300mg daily) is given until leucocyte count fall below 20 x
109/L. Patients with hyperleucocytosis (leucocyte count > 100 x 10
9/L), and on
hydroxyurea continue on the latter for another 1 – 3 weeks, with monitoring of the full
blood count before final withdrawal of the drug, when the white cell count fall to less
than 100 x 109/L.
In individuals with severe Imatinib-induced myelosuppression, the drug is
withheld until the neutrophils rise to 1.5 x 109/L and the platelets count to at least 75 x
109/L. Patients with recurrent, therapy-induced myelosuppression can have the
Imatinib dose reduced to 300mg daily until blood count normalizes (minimum dose
for therapeutic blood levels in adults). However, if the myelosuppression is related to
blastic transformation, Imatinib is discontinued with appropriate supportive therapy
being given. Women of child-bearing age are advised to use barrier contraception.
Imatinib treatment is withdrawn in patients who develop neutropenia (<1000/mm3) or
thrombocytopenia (<75,000/mm3) while on therapy, until the cytopenias are
corrected, and are re-commenced at lower doses.
2.2 Predictive Modeling
Predictive research aims at predicting future events or an outcome based on
patterns within a set of variables and has become increasingly popular in medical
research (Agbelusi, 2014; Idowu et al., 2015). Accurate predictive models can inform
patients and physicians about the future course of an illness or the risk of developing
illness and thereby help guide decisions on screening and/or treatment (Waijee et al,
2013a). There are several important differences between traditional explanatory
research and predictive research. Explanatory research typically applies statistical
methods to test causal hypothesis using prior theoretical constructs. In contrast,
predictive research applies statistical methods and/or machine learning techniques,
Page 33
33
without preconceived theoretical constructs, to predict future outcomes (e.g.
predicting the risk of hospital readmission) (Breiman, 1984).
Although, predictive models may be used to provide insight into casualty of
pathophysiology of the outcome, casualty is neither a primary aim nor a requirement
for variable inclusion (Moons et al., 2009). Non-causal predictive factors may be
surrogates for other drivers of disease, with tumor markers as predictors of cancer
progression or recurrence being the most common example. Unfortunately, a poor
understanding of the differences in methodology between explanatory and predictive
research has led to a wide variation in the methodological quality of prediction
research (Hemingway et al., 2009).
2.2.1 Types of predictive models
Machine learning has been previously used to predict behavior outcomes in
business, such as identifying consumer preferences for product based on prior
purchasing history. A number of different techniques to develop predictive
algorithms exist, using a variety of prediction analytic tools/software and have been
described in detail in literature (Waijee et al., 2010; Siegel et al., 2011). Some
examples include neural networks, support vector machines, decision trees, naïve
Bayes etc. Decision trees, for example, use techniques such as classification and
regression trees, boosting and random trees to predict various outcomes.
Machine learning algorithms, such as random-forest approaches have several
advantages over traditional explanatory statistical modeling, such as lack of a
predefined hypothesis, making it less likely to overlook unexpected hypothesis (Liaw
and Weiner, 2012). Approaching a predictive problem without a specific causal
hypothesis can be quite effective when many potential predictors are available and
Page 34
34
when there are interactions between predictors, which are common in engineering,
biological and social causative processes. Predictive models using machine learning
algorithms may therefore facilitate the recognition of important variables that may
otherwise not be initially identified (Waijee et al., 2010). In fact, many examples of
discovery of unexpected predictor variables exist in the machine learning literature
(Singal et al., 2013).
2.2.2 Developing a predictive model
The first step in developing a predictive model, when using traditional
regression analysis, is selecting relevant candidate predictor variables for possible
inclusion in the model; however, there is no consensus for the best strategy to do so
(Royston et al., 2009). A backward-elimination approach starts with all candidate
variables, hypothesis tests are sequentially applied to determine which variables
should be removed from the final model, whereas a full-model approach includes all
candidate variables to avoid potential over-fitting and selection bias. Previously
reported significant predictor variables should typically be included in the final model
regardless of their statistical significance but the number of variables included is
usually limited by the sample size of the dataset (Greenland, 1989).
Inappropriate selection of variables is an important and common cause of poor
model performance in this situation. Selection of variables is less of an issue using
machine learning techniques given that they are often not solely based on predefined
hypothesis (Ibrahim et al., 2012). There are several other important issues relating to
data management when developing a predictive model, such as dealing with missing
data and variable transformation (Kaambwa et al., 2012; Waijee et al., 2013b).
Page 35
35
2.2.3 Validating a predictive model
For a prediction model to be valuable, it must not only have predictive ability
in the derivation cohort but must also perform well in a validation cohort
(Hemingway et al., 2009). A model‘s performance may differ substantially between
derivation and validation cohorts for several reasons including over-fitting of the
model, missing important predictor variables, and inter-observer variability of
predictors leading to measurement errors (Altman et al., 2009). Therefore model
performance in the derivation dataset may be overly optimistic and is not a guarantee
that the model will perform equally well in a new dataset. A number of published
prediction research focuses solely on model derivation and validation studies are very
scarce (Waijee et al., 2013b).
Validation can be performed using internal and external validation. A
common approach to internal validation is to split the data into two portions – a
training set and validation set. If splitting the dataset is not possible given the limited
available data, measures such as cross validation or bootstrapping can be used for
internal validation (Steyerberg et al., 2010). However, internal validation nearly
always yields optimistic results given that the derivation and validation dataset are
very similar (as they are from the same dataset). Although external validation is more
difficult as it requires data collected from similar sources in a different setting or a
different location, it is usually preferred to internal validation (Steyerberg et al.,
2001). When a validation study shows disappointing results, researchers are often
tempted to reject the initial model and develop a new predictive model using the
validation dataset.
Page 36
36
2.2.4 Assessing the performance of predictive model
When assessing model performance, it is important to remember that
explanatory models are judged based on the strength of associations, whereas
predictive models are judged solely on their ability to make accurate predictions. The
performance of a predictive model is assessed using several complementary tests,
which assess overall performance; calibration, discrimination, and reclassification
(Steyerberg et al., 2010) (Table 2.6). Performance characteristics should be
determined and reported for both the derivation and validation datasets. The overall
model performance can be measured using R2, which characterizes the degree of
variation in risk explained by the model (Gerds et al., 2008). The adjusted R2 has
been proposed as a better measure, as it accounts for the number of predictors and
helps preventing over-fitting. Brier scores are similar measures of performance,
which are used when the outcome of interest is categorical instead of continuous
(Czado et al., 2009).
Calibration is the difference between observed and predicted event rates for
groups of dataset and is assessed using the Hosmer-Lemeshow test (Hosmer et al.,
1997). Discrimination is the ability of a model to distinguish between records which
do and do not experience an outcome of interest, and it is commonly assessed using
Receiver Operating Characteristics (ROC) curves (Hagerty et al., 2005). However,
ROC analysis alone is relatively insensitive for assessing differences between good
predictive models (Cook, 2007); therefore several relatively novel performance
measures have been proposed. The net reclassification improvement and integrated
discrimination improvement are measures used to assess changes in predicted
outcome classification between two models (Pencina et al., 2012).
Page 37
37
Table 2.6: Performance characteristics for a predictive model (measures of predictive error)
Aspect Measure Outcome Measure Description
Overall Performance R2
Adjusted R2
Brier Score
Continuous
Continuous
Categorical
Average squared difference between predicted and
observed outcome
Same as R2, but penalizes for the number of predictors
Average square distances from the predicted and the
observed outcomes
Discrimination ROC curve (c statistic)
C-index
Continuous or categorical
Cox-model
Overall measure of how effectively the model
differentiates between events and non-events
Calibration Hosmer-Lemeshow test Categorical Agreement between predicted and observed risks
Reclassification Reclassification table
NRI
IDI
Categoricala Number of records that move from one category to
another by improving the prediction model.
A quantitative assessment of the improvement in
classification by improving the prediction model.
Similar to NRI but using all possible cutoffs to
categorize events and non-events.
IDI, Integrated discrimination index; NRI, net classification index a Can be performed for continuous data as well if a risk cutoff is assigned.
(Source: Waijee et al., 2013b)
Page 38
38
2.3 Machine Learning
Machine learning (ML) is a branch of artificial intelligence that allows
computers to learn from past examples using statistical and optimization techniques
(Quinlan, 1986; Cruz and Wishart, 2006). There are several applications for machine
learning, the most significant of which is predictive modeling (Dimitologlou et al.,
2012). Every instance (records/set of fields or attributes) in any dataset used by
machine learning algorithms is represented using the same set of features
(attributes/independent variables). The features may be continuous, categorical or
binary. If the instances are given with known labels (the corresponding target
outputs), then the learning is called supervised, in contrast to unsupervised learning,
where instances are unlabeled (Ashraf et al., 2013).
Supervised classification is one of the tasks most frequently carried out by
Social Intelligent Systems. Thus, a large number of techniques have been developed
based on Artificial Intelligence (Logic-based techniques, perceptron-based
techniques) and Statistics (Bayesian networks, Instance-based networks). The goal of
supervised learning is to build a concise model of the distribution of class labels in
terms of predictor features. The resulting classifier is then used to assign class labels
to the testing instances where the values of the predictor features are known, but the
value of the class label is known (Gauda and Chahar, 2013). There are two variations
of supervise classifications:
a. Regression (or Prediction/Forecasting) – the class label is represented
by a continuous variable (e.g. real number; and
b. Classification – the class label is represented by discrete values (e.g.
categorical or nominal)
Page 39
39
Unsupervised machine learning algorithms perform learning tasks used for
inferring a function to describe hidden structure from unlabeled data – data without a
target class (Sebastiani, 2002). The goal of unsupervised machine learning is to
identify different examples that belong to the same group/clusters based underlying
characteristics that is common among attributes of members of the same cluster or
groups (Zamir et al., 1997; Jain et al., 1999; Zhao and Karypis, 2002). The only
things that unsupervised learning methods have to work with are the observed input
patterns , which are often assumed to be independent samples from an underlying
unknown probability distribution , and some explicit or implicit a priori
information as to what is important. Examples of unsupervised machine learning
algorithms include:
a. Clustering;
b. Maximum likelihood estimate;
c. Feature selection;
d. Association rule learning etc. (Becker and Plumbley, 1996)
2.3.1 Supervised machine learning algorithms
Supervised learning entails a mapping between a set of input variables
(features/attributes) labeled and an output variable (where j is the number of
records/CML cases) and applying this mapping to predict the outputs for unseen data
(data containing values for but no . Supervised machine learning is the most
commonly used machine learning technique in engineering and medicine.
In supervised machine learning paradigm, the goal is to infer a function, f:
Page 40
40
This function, f is the model inferred by the supervised ML algorithm from a
sample data or training set composed of pairs of (inputs ( ) and output( )) such
that and :
Typically, for regression problems, (where d is the dimension (or
number of features) of the vector, ) and ; for classification problems
are discrete while for binary classification .
In the statistical learning framework, the first fundamental hypothesis is that
the training data are independently and identically generated from an unknown but
fixed joint probability distribution function . The goal of the learning
algorithm is to find a function, f attempting to model the dependency encoded in
between the input, X and the output, Y. will denote the set of functions
where the solution, f is sought such that where is the set of all possible
functions, f.
The second fundamental concept is the notion of error or loss to measure the
agreement between the prediction f(X) and the desired output Y. A loss (or cost)
function, L is introduced to evaluate this error (see equation 2.3):
The choice of the loss function L(f(X), Y) depends on the learning problem
being solved. Loss functions are classified according to their regularity or singularity
properties and according to their ability to produce convex or non-convex criteria for
optimization.
In the case of pattern recognition, where , a common choice for
L is the misclassification error which is measured as follows:
Page 41
41
| |
This cost is singular and symmetric. Practical algorithmic considerations may
bias the choice of L. For instance, singular functions may be selected for their ability
to provide sparse solutions.
For unsupervised learning the problem may be expressed in a similar way
using the loss function defined in equations (2.5) and (2.6):
( ) ( )
The loss function L leads to the definition of the risk for a function f. also
called the generalization error:
∫
In classification, the objective could be to find the function f in that
minimizes R(f). Unfortunately, it is not possible because the joint probability P(x, y)
is unknown. From a probabilistic point of view, using the input and output random
variable notations X and Y, the risk can be expressed in equation (2.8) which can be
rewritten in two expectations:
( )
|
The expression in equation (2.9) offers the opportunity to separately minimize
[ | ] with respect to the scalar value of f (x). The resulting
function is the Bayes estimator associated with the risk R.
The learning problem is expressed as a minimization of R for any classifier f.
As the joint probability is unknown, the solution is inferred from the available training
Page 42
42
set ( ). There are two ways to address the problem.
The first approach, called generative based, tries to approximate the joint probability
P(X, Y), or P(Y|X)P(X), and then compute the Bayes estimator with the obtained
probability. The second approach, called discriminative-based, attacks the estimation
of the risk R (f) head on (Liaw and Weiner, 2012). Following is a description of some
of the most popular and effective supervised machine learning algorithm.
a. Decision Trees (DT)
Decision trees learning uses a decision tree as a predictive model which maps
observations about the relevant indicators for CML survival classification Xij in order
to conclude about the target value - the patient‘s survival class (as survived and not
survived). Decision trees can either be classification or regression trees but for this
study, classification trees were adopted which can be used as input for decision
making using the data description using a top-down tree (Quinlan, 1986; Breiman et
al., 1984). Each interior node (starting from the root/parent node) of the tree
represents the attributes (features relevant for CML survival) with edges that
correspond to the values/labels of each attributes leading to a child node at the
bottom; the process continues for each subsequent values until the leaf is reached
which is the terminal node also representing the target class (class of CML survival) –
alongside the probability distribution over the class (Friedman, 1999).
Such decision trees algorithm include: ID3 (Iterative Dichotomiser 3), C4.5
(an extension of the ID3), CART (Classification and Regression Trees), CHAID (Chi-
squared Automatic Interaction Detector), MARS etc. In this study, the C4.5 decision
trees algorithm was considered.
Page 43
43
The tree is learned by splitting the training dataset into subsets based on an
attribute value test for each input variables; the process is repeated on each derived
subset in a recursive manner called recursive partitioning. The recursion is completed
when the subset at a node has all the same value of the target class, or when splitting
no longer adds value to the predictions. This is also called the Top-down induction of
trees (Rokach and Maimon, 2008), an example of a greedy algorithm (divide-and-
conquer). When constructing the tree different decision trees algorithm uses different
metrics for measuring best attributes for splitting the tree which generally measure the
homogeneity of the target class (survival of CML) within the subsets of attributes
(relevant indicators for CML survival) selected (Rokach and Maimon, 2005). Some
of such metrics include: Gini impurity, Information gain, Variance reduction etc.; the
C4.5 decision trees algorithm uses the information gain evaluation metrics. The
information gain criterion is defined by equation (2.10)
If S is the training dataset containing the set of attributes, ai predictive for
CML survival in patients, j defined as with values needed for
partitioning S. Then:
∑| |
| | ( )
Where:
∑| |
| |
| |
| |
b. Support Vector Machines (SVM)
Support vector machines (SVMs) also called support vector networks are
supervised learning models with associated learning algorithms that analyze data and
recognize patterns (Cortes and Vapnik, 1995). Consider a training dataset consisting
Page 44
44
of CML survival indicators representing the input vector and the CML survival class
for each patient in the training dataset representing the target vector representing one
of two target categories; SVM attempts to build a model that assigns new examples
into one category (or the other) making it a non-probabilistic binary linear classifier.
An SVM model is a representation of the examples as points in space, mapped so that
the examples of the separate categories are divided by a clear gap that is as wide as
possible. New examples are then mapped into that same space and predicted to
belong to a category based on which side of the gap they fall on.
In formal terms, SVM constructs a hyper-plane or set of hyper-planes in a
high-dimensional space, which can be applied for classification, regression or any
other task. A good separation is achieved by the hyperplane that has the largest
distance to the nearest training data point of any class – the support vectors since in
general the larger the margin the lower the generalization error of the classifier.
SVMs can be used to perform both supervised and unsupervised machine learning
needed for developing classification and regression models. In order to calculate the
margin between data belonging to the two different classes, two parallel hyper-planes
(blue lines) are constructed, one on either side of the separating hyper-plane (solid
black line), which are pushed up against the two datasets (the corresponding survived
and not survived datasets). A good separation is achieved by the hyperplane that has
the largest distance to the neighbouring data points of both classes, since in general
the larger the margin the lower the generalization error of the SVM classifier.
The parameters of the maximum-margin hyperplane are derived by solving
large quadratic programming (QP) optimization problems. There exist several
specialized algorithms for quickly solving these problems that arise from SVMs,
mostly on heuristics for breaking the problem down into smaller chunks. This study
Page 45
45
implements the John Platt‘s (Platt, 1998) sequential minimal optimization (SMO)
algorithm for training the support vector classifier. SMO works by breaking the large
QP problem into a series of smaller 2-dimensional sub-problems. This study
implements the SMO using the algorithm available in the Weka public domain
software. This implementation globally replaces all missing values and transforms
nominal attributes into binary values, and by default normalizes all data.
Considering the use of a linear support vector as shown in figure 2.6, it is
assumed that both classes are linearly separable. The feature subset representing the
training data containing the information about each CML patient using the relevant
features (risk indicators for CML survival) is expressed as: while
the target class is represented by . The hyperplane can be
defined by where and . Since the classes are linearly
separable, then the following function can be determined:
The decision function may be expressed as with:
The SVM classification method aims at finding the optimal hyper-plane based
on the maximization of the margin between the training data for both classes. The
distance between a point x and the hyperplane is
|| ||, it is easy to show the
optimization problem as the following:
|| ||
Page 46
46
Figure 2.6: Description of the linear SVM classifier
Page 47
47
c. Artificial Neural Network (ANN) - Multi-layer Perceptron (MLP)
An artificial neural network (ANN) is an interconnected group of nodes, akin
to the vast network of neurons in a human brain. In machine learning and cognitive
science, ANNs are a family of statistical learning models inspired by biological neural
networks and are used to estimate or approximate functions that depend on a large
number of inputs and are generally unknown (McCulloch and Walter, 1943). ANNs
are generally presented as systems of interconnected neurons which send messages to
each other such that each connection have numeric weights that can be tuned based on
experience, making neural nets adaptive to inputs and capable of learning.
The word network refers to the inter-connections between the neurons in the
different layers of each system. The first layer has input neurons which send data via
synapses to the middle layer of neurons, and then via more synapses to the third layer
of output neurons. The synapses store parameters called weights that manipulate the
data stored in the calculations. An ANN is typically defined by three (3) types of
parameters, namely:
i. Interconnection pattern between the different layers of neurons;
ii. Learning process for updating the weights of the interconnections; and
iii. Activation function that converts a neuron‘s weighted input to its output
activation.
The simplest kind of neural network is a single-layer perceptron network,
which consists of a single layer of output nodes; the inputs are fed directly to the
outputs via a series of weights. In this way it can be considered the simplest kind of
feed-forward network. The sum of the products of the weights and the inputs is
calculated in each node, and if the value is above some threshold (typically 0) the
Page 48
48
neuron fires and takes the activated value (typically 1); otherwise it takes the
deactivated value (typically -1).
A perceptron can be created using any values for the activated and deactivated
states as long as the threshold value lies between the two. Perceptrons can be trained
by a simple learning algorithm that is usually called the delta rule. It calculates the
errors between calculated output and sample output data, and uses this to create an
adjustment to the weights, thus implementing a form of gradient descent.
Multi-layer networks use a variety of learning techniques, the most popular
being back-propagation. Here, the output values are compared with the correct answer
to compute the value of some predefined error-function. By various techniques, the
error is then fed back through the network. Using this information, the algorithm
adjusts the weights of each connection in order to reduce the value of the error
function by some small amount. After repeating this process for a sufficiently large
number of training cycles, the network will usually converge to some state where the
error of the calculations is small. In this case, one would say that the network has
learned a certain target function.
To adjust weights properly, one applies a general method for non-linear
optimization that is called gradient descent. For this, the derivative of the error
function with respect to the network weights is calculated, and the weights are then
changed such that the error decreases (thus going downhill on the surface of the error
function). For this reason, back-propagation can only be applied on networks with
differentiable activation functions.
Back-propagation, an abbreviation for backward propagation of errors, is a
common method of training artificial neural networks used in conjunction with an
optimization method such as gradient descent. The method calculates the gradient of a
Page 49
49
loss function with respect to all the weights in the network. The gradient is fed to the
optimization method which in turn uses it to update the weights, in an attempt to
minimize the loss function. It is a generalization of the delta rule to multi-layered
feed-forward networks, made possible by using the chain rule to iteratively compute
gradients for each layer. Back-propagation requires that the activation function used
by the artificial neurons be differentiable.
The back-propagation learning algorithm can be divided into two phases:
propagation and weight update.
a. Phase 1 – Propagation: each propagation involves the following steps:
i. Forward propagation of training pattern‘s input through the neural
network in order to generate the propagation‘s output activations; and
ii. Backward propagation of the propagation‘s output activations through
the neural network using the training pattern target in order to generate
deltas of all output and hidden neurons.
b. Phase 2 – Weight update: for each weight-synapse, hence the following:
i. Multiply its output delta and input activation to get the gradient of the
weight; and
ii. Subtract a ratio (percentage) of the gradient from the weight.
Assuming the input neurons are represented by variables determined by Xi =
{X1, X2, X3 ….Xi} where i is the number of variables (input neurons). The effect of
the synaptic weights, Wi on each input neuron at layer j is represented by the
expression:
Equation (3.16) is sent to the activation function (sigmoid/logistic function) is applied
in order to limit the output to a threshold [-1, +1], thus:
Page 50
50
The measure of discrepancy between the expected output (p) and the actual output (y)
is made using the squared error measure (E):
Recall however, that the output (p) of a neuron depends on the weighted sum
of all its inputs as indicated in equation (2.14); implying that the error (E) also
depends on the incoming weights of the neuron which needs to be changed in the
network to enable learning. The back-propagation algorithm aims to find the set of
weights that minimizes the error. In this study, the gradient descent algorithm is
applied in order to minimize the error and hence find the optimal weights that satisfy
the problem. Since back-propagation uses the gradient descent method, there is a
need to calculate the derivative of the squared error function with respect to the
weights of the network.
Hence, the squared error function is now redefined as (the ½ is required to
cancel the exponent of 2 when differentiating):
For each neuron, j its output Oj is defined as:
( ) (∑
)
The input to a neuron is the weighted sum of outputs of the previous
neurons. The number of input neurons is n and the variable denotes the weight
between neurons I and j. The activation function is in general non-linear and
differentiable, thus, the derivative of the equation (2.15) is:
Page 51
51
The partial derivative of the error (E) with respect to a weight is done using
the chain rule twice as follows:
The last term on the left hand side can be calculated from equation (2.20), thus:
(∑
)
The derivative of the output of neuron j with respect to its input is the partial
derivative of the activation function (logistic function) shown in equation (2.21):
( ) ( ) ( ( ))
The first term is evaluated by differentiating the error function in equation
(3.19) with respect to y, so if y is in the outer layer such that y = , then:
However, if j is in an arbitrary inner layer of the network, finding the
derivative E with respect to is less obvious. Considering E as a function of the
inputs of all neurons, l receiving input from neuron j and taking the total derivative
with respect to , a recursive expression for the derivative is obtained:
∑(
)
∑(
)
Thus, the derivative with respect to can be calculated if all the derivatives
with respect to the outputs of the next layer – the one closer to the output neuron –
are known. Putting them all together:
Page 52
52
With:
{
( ) ( ) ( ( ))
∑ ( ) ( ( ))
Therefore, in order to update the weight using gradient descent, one must
choose a learning rate, . The change in weight, which is added to the old weight, is
equal to the product of the learning rate and the gradient, multiplied by -1:
Equation (3.28) is used by the back-propagation algorithm to adjust the value
of the synaptic weights attached to the inputs at each neuron in equation (2.14) with
respect to the inner layer of the multi-layer perceptron classifier.
2.3.2 General issues of supervised machine learning algorithms
The first step is collecting the dataset required for developing the predictive
model. If a requisite expert is available, he/she suggests which field
(attributes/features) are the most informative. If not, then the simplest method is that
of brute-force, which measures everything available in the hope that the right
(informative or relevant but not redundant) features can be isolated. However, a
dataset collected by the brute-force method is not directly suitable for induction. It
contains in most cases noise and missing feature values, and therefore requires
significant pre-processing (Zhang et al., 2002). For this reason, methods suitable for
removing noise and missing values are important before deciding on the use of the
identified variables needed for developing predictive models using supervised
machine learning algorithms.
Page 53
53
a. Data preparation and data pre-processing
There is a hierarchy of problems that are often encountered in data preparation
and preprocessing which includes:
i. Impossible input values;
ii. Unlikely input values;
iii. Missing input values; and
iv. Presence of irrelevant input features in the data.
Impossible values should be detected by the data handling software, ideally at
the point of input so that they can be re-entered. These errors are generally
straightforward, such as coming across negative values when positive values are
expected. If correct values cannot be entered, the problem is converted into missing
value category, by removing the data. Variable-to-variable data cleansing is a filter
approach for unlikely values (those values that are suspicious due to their relationship
to a specific probability distribution with a mean of 5, a standard deviation of 3, but a
suspicious value of 10). Table 2.7 shows examples of how such metadata can help in
detecting a number of possible data quality problems.
The process of selecting the instances makes it possible to cope with the
infeasibility of learning from very large dataset. Selection of instances from the
original dataset is an optimization problem that maintains the mining quality while
minimizing the sample size (Liu and Metoda, 2001). It reduces data and enables a
machine learning algorithm to function and work effectively with very large datasets.
Page 54
54
Table 2.7: Examples for the use of variable-by-variable data cleansing
Problems Metadata Examples/Heuristics
Illegal values Cardinality
Max, Min
Variance,
Deviation
e.g., cardinality (gender) > 2 indicated
problem.
Max, min should not be outside of
permissible range.
Variance, deviation of statistical values
should not be higher than threshold.
Misspellings Feature values Sorting on values often brings misspelled
values next to correct values.
(Source: Kotsiantis et al., 2006)
Page 55
55
There are a variety of procedures for sampling instances from large dataset.
The most well-known are:
i. Random sampling, which selects a subset of instances randomly.
ii. Stratified sampling, which is applicable when the class values are not
uniformly distributed in the training sets. Instances of the minority class(es)
are selected with greater frequency in order to even out the distribution.
Incomplete data is an unavoidable problem in dealing with most real world
data sources. Generally, there are some important factors to be taken into account
when processing unknown feature values. One of the most important ones is the
source of unknown-ness:
i. A value is missing because it was forgotten or lost:
ii. A certain feature is not applicable for a given instance (e.g., it does not exist
for a given instance).
iii. For a given observation, the designer of a training set does not care about the
value of a certain feature (so-called don’t care values).
Depending on the circumstances, there are a number of methods to choose
from to handle missing data (Batista and Monard, 2003):
i. Method of ignoring instances with unknown feature values: This method is
simplest; it involves ignoring any instances (records) which have at least one
unknown feature value.
ii. Most common feature value: The value of the feature that occurs most often is
selected to be the value for all the unknown values of the feature.
Page 56
56
iii. Most common feature value in class: In this case, the value of the feature
which occurs most commonly within the same class is selected to be the value
for all the unknown values of the feature.
iv. Mean substitution: The mean value (computed from available cases) is used to
fill in missing data values on the remaining cases. A more sophisticated
solution than using the general feature mean is to use the feature mean for all
samples belonging to the same class to fill in the missing value.
v. Regression or classification methods: a regression or classification model
based on the complete case data for a given feature is developed. This model
treats the feature as the outcome and uses the other features as predictors.
vi. Hot deck inputting: The most similar case to the case with a missing value is
identified, and then a similar case‘s target value for the missing case‘s target
value is substituted.
vii. Method of treating missing feature values as special values: unknown itself is
treated as a new value for the feature that contains missing values.
b. Feature selection
This is the process of identifying and removing as many irrelevant and
redundant features as possible (Yu and Liu, 2004). This reduces the dimensionality of
the data and enables data mining algorithms to operate faster and more effectively.
Generally, features are characterized as:
i. Relevant: are features that have an influence on the target class (output). Their
role cannot be assumed by the rest.
ii. Irrelevant: are features that do not have any influence on the target class.
Their values could be generated at random and not influence the output.
Page 57
57
iii. Redundant: are features that can take the role of another (perhaps the simplest
way to incur model redundancy).
Feature selection algorithms in general have two (2) components:
i. A selection algorithm that generates proposed subsets of features and attempts
to find an optimal subset and
ii. An evaluation algorithm that determines how good a proposed feature subset
is.
However, without a suitable stopping criterion, the feature selection process may run
repeatedly through the space of subsets, taking up valuable computational time. The
stopping criteria might be whether:
i. addition (or deletion) of any feature does not produce a better subset; and
ii. an optimal subset according to some evaluation function is obtained.
The fact that many features depend on one another often unduly influence the
accuracy of supervised ML classification models. This problem can be addressed by
constructing new features from the basic feature set (Markovitch and Rosenstein,
2002). This technique is called feature construction/transformation. These newly
generated features may lead to the creation of more concise and accurate classifiers.
In addition, the discovery of meaningful features contributes to better
comprehensibility of the produced classifier, and a better understanding of the learned
concept.
c. Algorithm selection
The choice of which specific learning algorithm should be used is a critical
step. The classifier‘s evaluation is most often based on prediction accuracy (the
percentage of correct prediction divided by the total number of predictions). There
Page 58
58
are at least three techniques which are used to calculate a classifier‘s accuracy
(Waijee et al., 2013b):
i. One technique is to split the training set by using two-thirds (about 67% of
total cases) for training and the one-third for estimating performance (testing).
ii. In another technique, known as cross-validation, the training set is divided
into mutually exclusive and equal-sized subsets and for each subset the
classifier is trained on the union of all other subsets. The average of the error
of each subset is therefore an estimate of the error rate of the classifier.
iii. Leave-one-out validation is a special case of cross validation. All test subsets
consists of a single instance. This type of validation is, of course, more
expensive computationally, but useful when the most accurate estimate of a
classifier‘s error is required.
If the error rate is unsatisfactory, a variety of factors must be examined:
i. Perhaps relevant features of the problem are not being used;
ii. A larger training set is needed;
iii. The dimensionality of the problem is too high; and/or
iv. The selected algorithm is inappropriate or parameter tuning is needed.
A common method for computing supervised ML algorithms is to perform
statistical comparisons of the accuracies of trained classifiers on specific dataset.
Several heuristic versions of the t-test have been developed to handle this issue
(Dietterich, 1998; Nadeau and Bengio, 2003).
2.3.3 Machine learning for cancer prediction and prognosis
According to a literature survey on the application of machine learning in
healthcare data by Cruz and Wishart (2006), machine learning is not new to cancer
Page 59
59
research. In fact, artificial neural networks (ANNs) and decision trees (DTs) have
been used in cancer detection and diagnosis for nearly 30 years (Circchetti, 1992;
Simes, 1985; Machin et al., 1991), from the detection and classification of tumors via
X-rays and CRT images (Pertricoin, 2004; Bocchi et al., 2004) malignancies from
proteomic and genomic to the classification of malignancies from proteomic and
genomic (microarray) assays (Zhon et al., 2004; Wang et al., 2005).
The fundamental goal of cancer prediction and prognosis are distinct from the
goals of cancer detection and diagnosis. In cancer prediction/prognosis one is
concerned with three predictive foci, namely:
a. The prediction of cancer susceptibility (i.e. risk assessment): involves an
attempt to predict the likelihood of developing a type of cancer prior to the
occurrence of the disease;
b. The prediction of cancer recurrence: involves the prediction of the likelihood
of redeveloping cancer after the apparent resolution of the disease; and
c. The prediction of cancer survivability: involves the prediction of the outcome
(life expectancy, survivability, progression, tumor-drug sensitivity) after the
diagnosis of the disease.
In the latter two situations, the success of the prognostic prediction is
obviously dependent, in part, on the success or quality of the diagnosis performed.
However a disease prognosis can only come after a medical diagnosis and a
prognostic prediction must take into account more than just a simple diagnosis
(Hagerty et al., 2005).
Indeed, a cancer prognosis typically involves multiple physicians from
different specialties using different subsets of biomarkers and multiple clinical
Page 60
60
factors, including the age and general health of the patient, the location and type of
cancer, as well as the grade and size of the tumor (Fielding et al., 1992; Cochran et
al., 1997; Burke et al., 2005). Histological (cell-based), clinical (patient-based) and
demographic (population-based) information must be carefully integrated by the
attending physician to come up with a reasonable prognosis. Even for the most
skilled clinician, it is not an easy job to do since similar challenges exist for both
physicians and patients alike when it comes to the issues of cancer prevention and
cancer susceptibility prediction. Family history, age, diet, Body Mass Index (BMI),
high risk habits (like smoking and drinking) and exposure to environmental
carcinogens (UV radiation, radon and asbestos) all play an important role in
predicting an individual‘s risk for developing cancer (Bach et al., 2003; Gascon et al.,
2004; Domchek et al., 2003).
In the past, the dependency of clinicians and physicians alike on macro-scale
information (tumor, patient, population and environmental data) generally kept the
number of variables small enough so that standard statistical methods or even the
physician‘s own intuition could be used to predict cancer risks and outcomes.
However, with today‘s high-throughput diagnostic and imaging technologies,
physicians are now faced with dozens or even hundreds of molecular, cellular and
clinical parameters. In these situations, human intuition and standard statistics do not
generally work efficiently; rather there is a reliance on non-traditional and intensively
computational approaches such as machine learning (ML). The use of computers
(and machine learning) in disease prediction and prognosis is part of a growing trend
towards personalized, predictive medicine (Weston and Hood, 2004).
Machine learning, like statistics is used to analyze and interpret data. Unlike
statistics, though, machine learning methods can employ Boolean logic (AND, OR,
Page 61
61
NOT), absolute conditionality (IF, THEN, ELSE), conditional probabilities (the
probability of X given Y) and unconventional optimization strategies to model data or
classify patterns. These latter methods actually resemble the approaches humans
typically use to learn and classify. Although, machine learning draws heavily from
statistics and probability, it is still fundamentally more powerful because it allows
inferences or decisions to be made that could not otherwise be made using
conventional statistical methodologies (Mitchell, 1997; Duda et al., 2001).
Many statistical methods employ multivariate regression or correlation
analysis and these approaches assume that the variables are independent and that data
can be modeled using linear combinations of these variables. When the relationship
are non-linear and the variables are interdependent (or conditionally dependent)
conventional statistics usually flounders. It is in these situations that machine
learning tends to shine. Many biological systems are fundamentally non-linear and
their parameters conditionally dependent. Many simple physical systems are linear
and their parameters are essentially independent.
Knowing which machine learning method is best for a given problem is not
inherently obvious. This is why it is critically important to try more than one machine
learning method at any given training set. Another common misunderstanding about
ML is that patterns a ML tool finds or the trends it detects are non-obvious or not
intrinsically detectable. On the contrary, many patterns or trends could be detected by
a human expert – if they looked hard enough at the dataset. Machine learning
basically saves the time and effort needed to discover the pattern or to develop the
classification scheme required.
Page 62
62
2.4 Feature Selection for the Identification of Relevant Attributes
Feature Selection (FS) is important in machine learning tasks because it can
significantly improve the performance by eliminating redundant and irrelevant
features at the same time speeding up the learning task (Yildirim, 2015). Given N
features, the FS problem is to find the optimal subset among 2N possible choices.
This problem usually becomes intractable as N increases. Feature subset selection is
the process of identifying and removing as much irrelevant and redundant information
as possible (Ashraf, 2013). This reduces the dimensionality of the data and may allow
learning algorithms to operate faster and more effectively (Novakovic, 2011).
In some cases, accuracy on future classification can be improved; in others,
the result is a more compact, easily interpreted representation of the target concept.
Therefore, the correct use of feature selection algorithms for selecting features
improves inductive learning, either in terms of generalization capacity, learning
speed, or inducing the complexity of the induced model (Kumar and Minz, 2014).
There are two major approaches to FS. The first is Individual Evaluation, and the
second is Subset Evaluation. Ranking of the features uses a weight to measure the
degree of relevance of a feature in the former method while candidate subset of
features are constructed using a search strategy in the latter method.
A feature selection algorithm (FSA) is a computational solution that is
motivated by a certain definition of relevance. However, the relevance of a feature
(or a subset of features) – as seen from inductive learning perspectives – may have
several definitions depending on the objective that is sought by the FS technique.
Page 63
63
2.4.1 The relevance of a feature
The purpose of a FSA is to identify relevant features according to a definition
of relevance. However, the notion of relevance in ML has not yet been rigorously
defined on a common agreement (Bell and Wang, 2000). Let Ei with , be
domains of features X = {x1, x2, x3… xn} and an instance space defined as
, where an instance is a point in this space. Consider p a probability
distribution on E and T a space of target labels. The motive is to model or identify an
objective function according to its relevant features. A dataset S composed
by |S| instances can be seen as the result of sampling the attributes, E under the
distribution, p a total of |S| times and labeling its element suing the objective function,
c.
The notion of relevance according to a number of researchers is defined as a
relative relationship between the attributes and the objective function, the probability
distribution, sample, entropy or incremental usefulness (Novakovic et al, 2011;
Novakovic, 2009). Following are a number of definitions for the relevance of a
feature set of attributes.
a. Definition I (relevance with respect to an objective function, c): A feature
is relevant to an objective c if there exist two examples A, B in the
instance space E such that A and B differ only in their assignment to xi
and .
In other words, if there exist two instances that can only be classified by xi.
The definition has the inconvenience that the learning algorithm can not necessarily
determine if a feature xi is relevant or not, using only a Sample S or E (Wang et
al.1998).
Page 64
64
b. Definition II (strong relevance with respect to the sample, S): A feature
is strongly relevant to the sample S if there exist two examples A, B
that only differ in their assignment to xi and .
The definition is the same as I, but now, A, B and the definition is with
respect to S (Blum and Langley, 1997).
c. Definition III (strong relevance with respect to the distribution, p): A
feature is weakly relevant to an objective c in the distribution p if there
exist two examples A, B with p that only differ in
their assignment to xi and .
This definition is the natural extension of II and, contrary to it, the distribution
p is assumed to be known.
d. Definition IV (weak relevance with respect to the sample, S): A feature
is weakly relevant to the sample S if there exist at least a proper
where xi is strongly relevant with respect to S.
A weakly relevant feature can appear when a subset containing at least one
strongly relevant feature is removed.
e. Definition V (weak relevance with respect to a distribution, p): A feature
is weakly relevant to the objective c in the distribution p if there exist at
least proper where xi is strongly relevant with respect to p.
Instead of focusing on which features are relevant, it is possible to use
relevance as a complexity measure with respect to the objective c. In this case, it will
depend on the type of inducer used.
f. Definition VI (relevance as a complexity measure) (Blum and Langley,
1997): Given a data sample S and an objective c, define r(S, c) as the smallest
Page 65
65
number of relevant features to c using I only in S, and such that the error in S
is the least possible for the inducer.
It refers to the smallest number of features by a specific inducer to each
optimum performance in the task of modeling c using S.
g. Definition VII (relevance as an incremental usefulness) (Caruana and
Freitag, 1994): Given a data sample S, a learning algorithm L, and a subset of
features X’, the feature xi is incrementally useful to L with respect to X’ if the
accuracy of the hypothesis that L produces using the group of features
is better than the accuracy reached using only the subset of features
X’.
This definition is especially natural in feature selection algorithms (FSAs) that
search in the feature space in an incremental way, adding or removing features to a
current solution. It is also related to a traditional understanding of relevance in the
philosophy literature.
h. Definition VIII (relevance as an entropic measure) (Wang et al., 1998):
Denoting the (Shannon) entropy by H(x) and the mutual information by I(x; y)
= H(x) – H(x|y) (the difference of entropy in x generated by the knowledge of
y), the entropic relevance of x to y is defined as r(x; y)= I(x; y)/H(y). let X be
the original set of features and let C be the objective seen as a feature set
is sufficient if I(X’; C) = I(X. C) (i.e. if it preserves the learning
information). For a sufficient set X’, it turns out that r(X’; C) = r(X, C). the
most favourable set is that sufficient set for which H(X’) is smaller.
This implies that r(C; X) is greater. In short, the aim is to have r(C; X’) and
r(X’; C) jointly maximized.
Page 66
66
2.4.2 Characteristics of feature selection algorithms
Feature selection algorithms (with a few notable exceptions) perform a search
through the space of feature subsets, and, as a consequence, must address four (4)
basic issues affecting the nature of the search (Langley and Sage, 1994; Patil and
Sane, 2014):
a. Starting point
Selecting a point in the feature subset from which to begin the search can
affect the direction of the search. One option is to begin with no features and
successively add attributes. In this case, the search is said to proceed forward through
the search space. Conversely, the search can begin with all features and successively
remove them. In this case, the search proceeds backwards through the search space.
Another alternative is to begin somewhere in between (in the middle) and move
outwards from this point.
b. Search organization
An exhaustive search of the feature subspace is prohibitive for all but a small
initial number of features. With N initial features, there exists 2N possible subset of
features. Heuristic search strategies are more feasible than exhaustive search methods
and can also give good results, although they do not guarantee finding the optimal
subset (Hall et al., 2009). A number of search methods are highlighted as follows:
BestFirst: It searches for the attribute subset by greedy hill climbing method
in combination with backtracking. The backtracking is based on the concept
that if some number of consecutive nodes is found such that they do not
improve the performance then backtracking is done. It may apply forward
approach where it starts from empty set of attributes and goes on adding the
next. It may also go for backward approach where it starts from a set of all
Page 67
67
attributes and removes one by one. It may also adopt a midway between both
approaches where search is done in both directions (by considering all
possible single attribute additions and deletions at a given point) which is also
called as hybrid approach (Maji and Garai, 2013).
GreedyStepwise: Performs a greedy forward or backward search through the
space of attribute subsets. May start with no/all attributes or from an arbitrary
point in the space. Stops when the addition/deletion of any remaining
attributes results in a decrease in evaluation. Can also produce a ranked list of
attributes by traversing the space from one side to the other and recording the
order that attributes are selected.
Ranker: Individual evaluations of the attributes are done and they are ranked
accordingly (Hua-Liang and Billings, 2007). It is normally used in conjunction
with attribute evaluators (Relief, GainRatio, Entropy etc.).
Genetic Search: Genetic Algorithms (GAs) (Goldberg, 1989) are
optimization techniques that use a population of candidate solutions. They
explore the search space by evolving the population through four steps: parent
selection, crossover, mutation, and replacement. GAs have been seen as search
procedures that can locate high performance regions of vast and complex
search spaces, but they are not well suited for fine-tuning solutions (Holland,
1992). However, the components of the GAs may be specifically designed and
their parameters tuned, in order to provide effective local search behaviour.
c. Evaluation strategy
How feature subsets are evaluated is the single biggest differentiating factor
among most feature selection algorithms for machine learning. One paradigm,
dubbed Filter (distance, information, consistency and dependency metrics etc.)
Page 68
68
(Kohavi, 1995; Kohavi and John, 1996) operates independently of any machine
learning algorithm – desirable features are filtered out of the data before learning
begins. These algorithms use heuristics based on general characteristics of the data to
evaluate merit of feature subsets. Another school of thought argues that the bias of a
particular induction algorithm should be taken into account when selecting features.
This method, called the wrapper (using predictive accuracy or cluster goodness) uses
an induction algorithm along with a statistical re-sampling technique such as cross-
validation to estimate the final accuracy of feature subsets.
d. Stopping criterion
A feature selector must decide when to stop searching through the space of
feature subsets. Depending on the evaluation strategy, a feature selector might stop
adding or removing features when none of the alternatives improves upon the merit of
a current feature subset. Alternatively, the algorithm might continue to revise the
feature subset as long as the merit does not degrade.
2.4.3 Filter-based feature selection methods
Among the evaluation n strategy used by feature selection methods, filter-
based feature selection (FS) methods were considered in this study to determine the
relevant features among the features present in the data collected from the study
location (Maji and Garai, 2013). This is because filter-based FS algorithms define
relevance by identifying the attributes that are more correlated with the target class
and filter-based FS algorithms are less computationally expensive compared to
wrapper-based FS algorithms which require an improvement of the supervised
machine learning algorithm.
Three (3) classes of filter-based feature selection methods considered are as
follows:
Page 69
69
Consistency-based
Consistency measures the attempt to find a minimum number of features that
distinguish between the classes as consistently as the full set of features. An
inconsistency arises when multiple training samples have the same feature values, but
different class labels. Dash and Liu (1997) presented an inconsistency – based FS
technique called Set Cover. An inconsistency count is calculated by the difference
between the number of all matching patterns (except the class label) and the largest
number of patterns of different class labels of a chosen subset. If there are n matching
patterns in the training sample space and there are c1 patterns belonging to class 1 and
c2 patterns belonging to class 2, and if the largest number is c2, the inconsistency
count will be n – c2. Hence, given a training sample S the inconsistency count of an
instance is defined as (Liu and Motoda, 1998):
Where is the number of instances in S equal to A using only the features in
and is the number of instances in S of class k equal to A using only the
features in .
By summing all the inconsistency counts and averaging over the size of the
training sample size, a measure called the inconsistency rate for a given subset is
defined. The inconsistency rate of a feature subset in a sample S is then:
∑
| |
Correlation-based (CFS)
Correlation is also called similarity measures or dependency measures.
Gennari et al. (1989) stated that Features are relevant if their values vary
systematically with category membership thus, a feature is useful if it is correlated
Page 70
70
with or predictive of the class; otherwise it is irrelevant. Thus, a feature Vi (variable
monitored for CML survival) is said to be relevant (predictive of CML survival) if
and only if there exist some vi (value of variable – nominal or numeric) and c (target
class – survived or not survived) for which p(Vi = vi) > 0 such that (Kohavi and John,
1996):
|
The implication of this is that the feature subset (set of variables that are
predictive of CML survival) is one that contains highly correlated with (predictive of)
the class, yet uncorrelated with (not predictive of) each other. It is but important to
state that a group of components (CML survival indicators) that are highly correlated
with the target variable will at the same time bear low correlations with each other
(Hall, 1999). Equation (2.30) is used as the heuristic measure for the merit of feature
subsets in supervised classification:
√
From equation (3.4) the merits is the heuristic merit of a feature subset S
containing k features, is the average feature–class correlation and is the
average feature-feature inter-correlation. The equation forms the core of CFS and
imposes a ranking on feature in the search space of all possible features subsets.
Correlation criteria are often used for microarray data analysis.
Information-based
A probabilistic model of a nominal valued feature Y (target class, C) can be
formed by estimating the individual probabilities of the values from the training
data containing the records of variables. If this model is used to estimate the value of
Page 71
71
target class for a sample drawn from the training data, then the entropy of the model
is the number of bits it would take, on average, to correct the output of the model.
Entropy is a measure of the uncertainty or unpredictability in a system. The entropy
of the target class Y is given by equation (2.31):
∑ ( )
If the observed values of target class in the training data are partitioned
according to the values of input features X, and the entropy of Y with respect to the
partitions induced by X is less than the entropy of the target class prior to partitioning,
then there is a relationship between the target class Y and the indicator variables X.
Equation (3.6) gives the entropy of Y after observing X.
| ∑ ∑ | ( | )
The amount by which the entropy of Y decreases reflects additional
information about Y provided by X and is called the Information gain (Quinlan, 1986),
or alternatively Mutual information. Thus, information is given by:
|
|
|
2.5 Existing Models for Risk Assessment of CML Survival
Factors that can be used to predict the likely outcome (prognosis) of CML
treatment are called prognostic factors. Prognostic scoring systems use these factors
to determine a patient‘s risk score. Based on the risk score, patients are classified into
risk groups – low, intermediate and high risk. People in the same risk group are
similar in certain ways and will likely respond to certain treatments in the same way.
Therefore, doctors often use risk scores to help guide treatment decisions. In general,
Page 72
72
a person classified as low-risk is more likely to have better response to treatment. The
two most popular prognostic scoring systems are Sokal score (Sokal et al., 1984) and
the Hasford score (Hasford et al., 1998).
2.5.1 Sokal score
The Sokal prognostic scoring system was developed from the examination of
813 patients with Philadelphia chromosome-positive, non-blastic chronic granulocytic
Leukaemia (CGL) collected from Six European and American series – Rosewell Park
Memorial Institute, University of Bologna, Italian Cooperative CML Study Group,
Memorial Sloan-Kettering Cancer Centre, University of Barcelona and Duke
University. The survival pattern of the population was typical of ―good-risk‖ patients,
and median survival time was 47 months. The prognostic factors identified were:
Age, Spleen size, Platelets count and Blasts, equation 2.34 shows the hazard function
for the Sokal scoring system developed using the Cox regression. The hazard ratio
for each patient was calculated from the expression which ranged from 0.41 to 5.68
for 677 patients.
*(
)
+
The model was used to identify a lower risk group of patients with a 2-year
survival of 90%, subsequent risk averaging somewhat less than 20%/year and median
survival of 5 years, an intermediate group and a high-risk group with a 2-year survival
of 65%, followed by a death rate of about 35%/year and median survival of 2.5 years.
A later study that involved a study of CGL patients between 5 and 45 years of age
was used to propose a Sokal scoring system for younger patients with CGL compared
Page 73
73
to that proposed for people up to 84 years of age (Sokal et al., 1985). The scoring
system identified five (5) prognostic factors compared to the earlier scoring model,
the prognostic factors consisted of platelets count, spleen size, hematocrit, percentage
of blasts and the sex; equation 2.35 shows an expression of the scoring model
proposed by Sokal for younger patients. The median survival of the patients under
study exceeds four (4) years while the hazard (death) rate averages 22.5% compared
to 25% of the earlier study.
*(
)
+
(Platelets: 109/L; Sex: Male=1.0, female=2.0)
The Relative Risk (RR) of CML patients‘ survival using the Sokal Score is as
follows:
{
The Sokal score used scoring models developed from statistical regression
analysis for both older and younger CML patients undergoing treatment for CML. It
used four variables and five variables in the model for the older and younger CML
patients respectively. The scoring model developed measured survival as a
probabilistic value and as a function of the overall survival of the study cohort and
thus classifies the survival groups into three unique groups based on the interval of the
scores. Some variables collected in the study were removed because of their
inconsistency among data collected from the different hospitals used in the study.
Page 74
74
Unlike the Sokal score, this study considers all the variables monitored during
Imatinib treatment of Nigerian CML patients from which relevant variables were
identified and the prediction model for CML survival classification developed using
machine learning algorithms.
2.5.2 Hasford score
The Hasford prognostic scoring system was developed from the examination
of 1303 patients with Chronic Myeloid Leukaemia (CML) who were treated in
prospective studies, including major randomized trials, separated into learning and
validation samples using the Cox regression analysis and the minimum P-value
approach used to identify the prognostic factors for CML patient survival undergoing
Interferon- According to Hasford et al. (1998), the survival model was developed
owing to the limitation in its ability to discriminate survival risk groups of people
undergoing Interferon- (Ohnishi et al., 1995; Hasford et al., 1996) which on the
other hand was developed for patients undergoing busulphan or hydroxyurea
treatment. As result of this, the Collaborative CML Prognostic Factors Project was
started with the goal of extracting and validating a new prognostic scoring system for
patients with CML treated with IFN- .
The median survival time of the data collected for the study was 69 months
(within a range of 1 to 117 months); 908 patients were used as the learning sample
from which three distinct groups were identified. The low-risk group had a median
survival time of 98 months (n = 369, 40.6%), the intermediate-risk group had a
median survival time of 65 months (n = 406, 44.7%) and the high-risk group with a
median survival time of 42 months (n = 133, 14.6%). The prognostic scoring model
was validated using 285 patients‘ data. The dataset used was collected from 1573
individual patients from 14 studies by searching the MADLINE® biomedical literature
Page 75
75
database (National Library of Medicine, MD) from abstracts and conference reports,
and y contracting pharmaceutical companies. Participants included were from Austria
(Thaler et al., 1993; Thaler and Hilbe, 1996), Belgium, The Netherlands and
Luxembourg, France (Guilhot, 1993; Guilhot et al., 1996), Germany (Hehlmann et
al., 1994; Hehlmann et al., 1996; Kloke et al., 1993), the United Kingdom (Allan et
al., 1995), Italy (Alimena et al., 1988), Japan (Ohnishi et al., 1995), Span (Fernandez-
Ranada et al., 1993) and the United States. The prognostic factors identified were:
Age, Spleen size, Platelets count, Basophils and eosinophils, equation 2.36 shows the
hazard function for the Hasford scoring system developed using the Cox regression.
The prognostic score for each patient was calculated from the expression which
revealed that the score for low-risk group is 780, the intermediate-risk group had
score values 780 and 1480, and the high-risk group had score values 1480.
(
)
Where:
Age = 0 when < 50 years, 1 otherwise; Spleen size is cm below coastal margin;
Basophils = 0 when < 3%, 1 otherwise and Platelet count = 0 when < 1500, 0
otherwise.
The Relative Risk (RR) of CML patients‘ survival using the Hasford (Euro)
Score is as follows:
{
According to Oyekunle et al. (2012b) the Sokal score was not predictive of
differences in the Overall Survival (OS) of Nigerian patients in the study but it
Page 76
76
sufficiently differentiates the Progression Free Survival (PFS) of patients in the risk
groups identified. The Hasford scoring system performed better as the positive
predictive value was more statistically significant for differences in PFS; but it also
fails to predict differences in Overall Survival of CML patients treated with Imatinib.
The results of Oyekunle et al. (2012) revealed that the diagnostic parameters used in
the Sokal and Hasford scores may need to be reviewed, if they are to retain their
prognostic relevance in the Imatinib era, especially regarding predicting OS and
Complete Cytogenetic Responses (CCR).
The Hasford score uses a scoring model developed using statistical regression
analysis for European CML patients receiving Interferon-alfa treatment. It uses six
variables in the model for CML survival. Like the Sokal score, the scoring model
developed measures survival as a probabilistic value and as a function of the overall
survival of the study cohort and thus classifies the survival groups into three unique
groups based on the interval of the scores. Unlike the Sokal and Hasford scores, this
study considers all the variables monitored during Imatinib treatment of Nigerian
CML patients from which relevant variables were identified and the prediction model
for CML survival classification developed using machine learning algorithms.
2.5.3 European treatment and outcome study (EUTOS) Score
Following the failure of the Sokal and Hasford Scoring Models at predicting
the overall survival (OS) of Chronic Myeloid Leukaemia in the Imatinib era, there
was the need for the development of a newer scoring model based on CML patients
undergoing Imatinib era. According to Jabbour et al. (2012) the European
LeukaemiaNet developed a new prognostic scoring system (European Treatment and
Outcome Study [EUTOS] score) using data from 2060 patients with newly diagnosed
Page 77
77
Chronic Myeloid Leukaemia-Chronic Phase treated with Imatinib-based regiments
(Hasford et al., 2011).
The EUTOS classifies patients into two (2) risk groups – low and high risk
groups with significant correlations with significant correlations with the achievement
of an 18-month complete Cytogenetic response (CCR) and progression free survival
(PFS). However, studies showed that the adoption of the EUTOS score in predicting
the OS and PFS of CML patients undergoing Imatinib treatment still requires
validation (Jabbour et al., 2012; Marin et al., 2011). Equation (2.37) shows the
expression for the prognostic score proposed by the European Leukaemia-Net; the
European Treatment and Outcome Study (EUTOS) score:
The Relative Risk (RR) of CML patients‘ survival using the Hasford (Euro)
Score is as follows:
{
The EUTOS score used a scoring model developed from statistical regression
analysis for European CML patients receiving Imatinib treatment. It used two
variables in the model for CML survival. Like the earlier scoring models, EUTOS
measures survival as a probabilistic value and as a function of the overall survival of
the study cohort and thus classifies the survival groups into two unique groups based
on the interval of the scores. The EUTOS score has more relevance to CML patients
receiving Imatinib treatment but restricted to European CML patients.
Page 78
78
The proposed study considers all the variables monitored during Imatinib
treatment of Nigerian CML patients from which relevant variables were identified and
the prediction model for CML survival classification developed using machine
learning algorithms.
A Summary of the variables identified by each scoring models proposed is
shown in Table 2.8. All variables were identified using statistical p-value (= 0.05)
test to determine the most relevant out of the identified variables collected during
follow-up of the respective treatment used in proposing each scoring model.
2.6 Related Works
There are a number of works published in the area of the application of
machine learning algorithms to cancer disease risk assessment, risk survivability and
risk recurrence. Most of the works published are limited to other types of cancer but
none in the area of the application of machine learning algorithms to CML survival.
A number of such works have however stressed the effect of machine learning in
developing effective and efficient prediction models.
Idowu et al. (2015) developed a predictive model for the survival of pediatric
sickle cell disease (SCD) using clinical variables. The predictive model was
developed with fuzzy logic using three (3) clinical variables while the rules for the
inference engine were elicited from expert pediatrician. The fuzzy logic-based model
was not validated using live clinical datasets. Moreover, relevant variables for SCD
survival could have been easily identified using feature selection methods from a
larger collection of variables monitored for pediatric SCD survival.
Page 79
79
Table 2.8: Variables Identified by existing risk scoring model for CML
S/N Sokal Sokal (Younger) Hasford EUTOS
1 Age Sex Age Basophilis (%)
2 Spleen (cm) Spleen (cm) Spleen (cm) Spleen (cm)
3 Platelet count Platelet count Platelet count
4 Blasts (%) Blast (%) Blasts (%)
5 PCV Eosinophils (%)
6 Basophilis (%)
Page 80
80
Agbelusi (2014) performed a comparative analysis of three supervised
machine learning algorithm to the prediction of the survival of pediatric HIV/AIDS
patients. The machine learning algorithms used were naïve bayes‘, decision trees and
multi-layer perceptron without the application of feature selection algorithms to
identify relevant features. Rather than base features used in the study to predict
HIV/AIDS survival, a larger number of features monitored in HIV/AIDS patients
could have been identified then feature selection methods used in identifying the
relevant features for HIV/AIDs survival.
Ahmad et al. (2013) performed a comparative analysis of three machine
learning algorithms for the prediction of breast cancer recurrence. Support vector
machines (SVM), multi-layer perceptron and decision trees algorithm were used to
formulate the model. Before the process of model formulation, feature selection was
not used for the identification of relevant variables for cancer recurrence. The
identification of relevant variables could have improved the performance of the model
developed using the machine learning algorithms.
Thongkam and Sukmak (2012) performed a comparative analysis of the
combination of bagging with several weak learners to build five (5) breast cancer
survivability prediction models. 10-fold cross validation was used to train the model
using decision tree learner algorithms, namely: J48 (also called C4.5), REPTree and
decision stump and in combination with the bagging algorithm using 14 and 10
attributes from the breast cancer dataset. The results showed that there was
significant improvement in the performance of the prediction model developed using
10 attributes compared to using the 14 attributes. Other, supervised machine learning
algorithms apart from decision trees algorithm could have been used to justify the
Page 81
81
performance of decision trees algorithm in developing predictive models for cancer
survival.
Ganda and Chahar (2013) performed a comparative analysis of the predictive
models developed for predicting the survival of heart failure using unsupervised
machine learning algorithms. K-means clustering algorithm was used to classify the
survival of heart failure patients into two (2) groups following the application of
correlation-based feature selection algorithm to select relevant variables for heart
failure survival. The performance of k-means clustering algorithm was improved
following the use of relevant variables identified compared to using all the variables
identified. The predictive performance of other supervised machine learning
algorithms should have been compared to that of the k-means algorithm in order to
justify its performance.
Yussuff et al. (2012) applied statistical methods to the development of a
predictive model for the risk of breast cancer using features identified from breast
mammogram images. Logistic regression was used in the development of the
prediction model whereas supervised machine learning algorithm would have
produced a more effective and efficient model owing to the identification of relevant
variables using feature selection methods.
Vanneschi et al. (2011) performed a comparative analysis of four supervised
machine learning algorithms for the prediction of breast cancer survival. The
supervised machine learning algorithms used included genetic algorithm (GA),
support vector machines (SVM), artificial neural networks (ANN) and random forest
decision trees using 70 gene signatures as attribute variables. Genetic algorithm
outperformed the other methods due to its ability to identify relevant variables for
Page 82
82
predicting breast cancer survival using its principle of natural selection. Feature
selection methods could have been used to identify the most relevant variables before
applying the machine learning algorithms thereby improving performance.
Luo and Chang (2010) applied machine learning algorithms to the
classification of breast masses on digital mammograms. The C4.5 decision trees and
the support vector machines algorithms were used following the application of
forward and backward feature selection algorithms for the identification of relevant
features. It was discovered that there was no significant difference in the performance
of the algorithm following feature selection methods. The limitation of the study was
in the inability of the feature selection methods used to adequately identify the
relevant variables predictive for breast cancer leading to the inability of improving the
model‘s performance. Other filter-based feature selection methods should have been
used in order to identify which will identify the most relevant features leading to an
improved performance of the model.
Page 83
83
CHAPTER THREE
RESEARCH METHODOLOGY
3.1 Introduction
In this chapter, the methodology applied in this study is clearly defined. The
chapter starts with a description of the framework for the research methodology,
which explains the series of steps required: starting from data identification and
collection, model formulation and performance evaluation of the developed predictive
models. Before the formulation of the model using machine learning algorithms,
filter-based feature selection methods were used in identifying the relevant features
for predicting the survival of Chronic Myeloid Leukaemia (CML) in Nigerian patients
for the identified study location. In addition, the selected machine learning algorithms
adopted for the formulation of the predictive model were presented alongside a
description of their respective loss/cost functions used in the model formulation
process. Finally, the metrics of performance evaluation were presented alongside the
simulation environment chosen for the study.
3.2 Methodology Framework of the Study
This study involves the use of supervised machine learning algorithms in the
development of predictive model for CML survival using data collected from a study
location in South-western Nigeria. Figure 3.1 shows the methodology framework
which was applied in the development of the predictive model for CML survival in
Nigerian patients. The study began with the identification of the variables monitored
during the follow-up of Imatinib treatment administered to CML patients by the
physicians at the study location and the collection of the dataset containing the
identified variables for patients in the study location.
Page 84
84
Figure 3.1: Methodology framework for predictive modeling of CML survival
Page 85
85
The dataset collected from the hospital formed the basis of the historical
dataset which contains various records of predictive parameters (survival indicators)
and the survival time (output variable). Filter-based Feature selection methods were
used to identify the most relevant and important features among the features collected
based on the distribution of the dataset collected from the study location. The reduced
feature set was identified to be predictive for CML patients‘ survival. Following this,
the historical dataset containing the reduced feature set was divided into training and
testing dataset and fed to each supervised machine learning algorithms proposed for
this study using the n-fold cross-validation evaluation method. The result of the
performance of the combination of each filter-based feature selection method and the
supervised machine learning algorithm was used to identify the most effective and
efficient predictive model for CML survival.
3.3 Data Identification and Collection
This section highlights the process involved in identifying the data containing
the variables monitored during Imatinib treatment of Nigerian CML patients. Each
variable name was identified and properly defined with its respective units defined.
The method of data collection was also clearly stated showing from whom the data
was collected and the instruments of data collection from the data source alongside
the identification of the different survival classes in the dataset.
3.3.1 Identification of variables monitored during Imatinib treatment
Following the review of literature in the body of knowledge of chronic
myeloid Leukaemia survival, a number of features were identified to be monitored
during the follow-up of CML patients receiving Imatinib treatment. The variables
monitored (which were identified in related literature) were compared to the variables
monitored by physicians of the Obafemi Awolowo University teaching Hospital
Page 86
86
Complex (OAUTHC), Ile-Ife - the only referral hospital for the treatment of Imatinib
in Nigeria.
Table 3.1 gives a description of the variables that were identified to be
monitored in CML patients receiving Imatinib treatment in Nigeria; this information
comprises of socio-demographic, clinical and CML survival-related data. The
variables identified include: the age of the patient (in months), time of start of
treatment (Imatinib) from date of diagnosis (in months), gender (male and female),
spleen and liver size, packed cell volume (PCV), white blood cell (WBC) count,
platelet count (measured as a %), basophils (measured as a %), eosinophils, disease
phase at diagnosis - Chronic (CP), Acute (AP) and Blast crisis Phase (BP), vital status
(alive or dead) and the survival time (measured in days).
A description of each identified variable (attribute) is made as follows:
a. Time of Imatinib treatment from date of diagnosis: is the time from the
date of the diagnosis of CML disease by the physician; it is a numeric value
which is recorded as the number of days or months;
b. Age at diagnosis: is defined as the present age of the patient; it is a numeric
value which is expressed in the number of days, months or years;
c. Gender: is defined as the sex of the CML patient; it is a nominal value which
is recorded as either male or female;
d. Spleen size: is the distance between the spleen below the coastal margin; it is
a numeric value which is expressed in centimeters (cm);
e. Liver size: is defined as the increase in the size of the liver; it is a numeric
value which is expressed in centimeters (cm);
Page 87
87
Table 3.1: Variables monitored during the follow-up of Imatinib treatment
Name Unit of Measure Labels
Time of Imatinib treatment from
date of diagnosis
Months Numeric
Age at diagnosis Months Numeric
Gender Nil Male, Female
Spleen size Cm Numeric
Liver size Cm Numeric
Packed Cell Volume (PCV) % Numeric
White Blood Cell (WBC) / Numeric
Platelet Count / Numeric
Basophil % Numeric
Eosinophil % Numeric
Phase at Diagnosis Nil CP, AP, BP
Survival Time (ST) Days Numeric
Vital Status (VS) Nil Alive, Dead
Page 88
88
f. Packed Cell Volume (PCV) count: also called the level of hematocrit is the
volume percentage of red blood cells in the blood; it is normally about 45%
for men and 40% for women; it is a numeric value expressed as a percentage
(%) value;
g. White Blood Cell (WBC) count: is an indication of the presence of system
disorder or a bone marrow disease, it is the number of white blood cells found
in one (1)litre of blood. It is a numeric value expressed as / ;
h. Basophil: is a type of white blood cell that circulates the human blood; it is a
numeric value which is expressed as a percentage (%) found in the blood;
i. Platelet count: is a measure of how many platelets there are in the blood and
it helps in allowing the clotting of blood during injuries or cuts. It is a
numeric value which is expressed as / ;
j. Eosinophils: is a type of blood test that measures the quality of eosinophil (a
type of white blood cell). It is a numeric value which is expressed as /
;
k. Disease phase at diagnosis: is a measure of the phase of the disease at the
time of diagnosis; it is a nominal value expressed as CP for chronic phase, AP
for acute phase and BP for blast crisis phase;
l. Vital Status: is the identification of the status of the patient at the time when
the information was collected; it is a nominal value expressed as either dead or
alive; and
m. Survival time: is defined as the period during which the patient is being
monitored; it is a numeric value which is expressed as number of days (or
months).
Page 89
89
3.3.2 Data collection of variables monitored
Following ethical approval by the Health Records Ethical Committee (HREC)
approval board of the Obafemi Awolowo University Teaching Hospital Complex
(OAUTHC); the data required for the development of the predictive model for the
survival of CML patients receiving Imatinib treatment were collected. There was no
need for consent forms since the patients were not required to partake in the study
rather; electronic data containing information about each CML patient excluding their
personal information (e.g. names, address, hospital ID, contact number etc.) were
collected from the OAUTHC health records.
The data collected was stored in spreadsheet format and collected using a flash
drive following the identification of the variables monitored during follow-up of
Imatinib treatment. For the purpose of handling the problem as a classification
problem, the target class (output variable) was determined using three labels, namely:
survived, not survived and censored.
Survived: refers to the CML patients that lived up to or more than the
estimated survival time (2 and 5 years) and are either dead or alive (vital
status);
Not Survived: refers to the CML patients that did not live up to the estimated
survival time and are dead; and
Censored: refers to the CML patients that were lost during follow-up due to
one reason or the other – the patients‘ survival time is less than the estimated
survival time and they are still alive.
The pseudo-code in the following paragraph was used in assigning a target
class (Survived, Not Survived and Censored) to each patient‘s records using the
Page 90
90
values of the vital status and the survival time of each patient (in days). Assumptions
made in the study for the number of days in a year was based on the fact that there are
52 weeks in a year and 7 days in a week hence, 364 days in a year.
If (Survival time >= n)
Then Survival class = “Survived”
Else if ((Survival time < n) AND (Vital Status = “Alive”))
Then Survival class = “Censored”
Else
Survival Class = “Not Survived”.
End if.
where n is the time in days (728 days and 1820 days for 2 and 5 years respectively)
Following the identification of the target survival class using the pseudo-code
above for each patient, the records with the target class identified as censored were all
removed from the original dataset. This was due to the fact that the study is only
concerned with the patients who have been followed up for treatment and have lived
up to the estimated time except they died during the course of receiving treatment at
the hospital. The variables monitored are assumed to contain variables relevant to
predicting the survival of CML disease.
3.4 Identification of Relevant Features Predictive for CML Survival
Following the process of the identification and collection of the data needed
for developing the predictive model, it was necessary to determine which set of
variables are deemed more predictive for CML survival classification. According to
literature, identifying the variables relevant to CML survival prediction is likely to
improve the performance of the supervised machine learning algorithm‘s performance
and also reduce the complexity of the model.
The basic algorithm for implementing the filter-based feature selection
algorithms used in this study is shown in the following paragraph. For the training
Page 91
91
data collected which contained a set of patients‘ CML records, X and feature set F of
attributes (risk factors). The algorithm may start with either an empty set or a subset
of X using a search category to select the initial feature, . The independent measure
Im evaluates each generated subset Xg and compares it to the previous optimal subset
evaluation starting from the initial feature subset, . The search iterates until the
stopping criteria is met. Finally, the algorithm outputs the current optimal feature
subset Xopt.
The algorithm is presented as follows (Kumar and Minz, 2014):
INPUT:
D = {X, F} //A training data set with n CML patients’ records
// – Monitored variables from CML
//Patients and F labels – Survival Class of patient, n
X’ //Predefined initial feature subset ( )
//Stopping criterion
OUTPUT: //An optimal subset – relevant indicators of CML
//survival
Begin:
Initialize:
; //apply a search algorithm of choice
; //evaluate X’ by using an independent measure Im
do begin
//select next subset of CML indicators for evaluation
( ) //evaluate current subset, using Im
If //if new evaluation is better than previous
; //replace old evaluation with new
//replace initial subset of features with new one
repeat (until is not reached);
end
return ; //identified optimal set of relevant features of CML
survival
end;
Page 92
92
Following is a description of the feature selection methods used for the
evaluation of relevant features in the CML survival training dataset alongside their
respective independent evaluation measure, Im.
3.4.1 Consistency-based
The consistency-based FS measured the minimum number of CML survival
features that distinguish between the target classes (Survived and Not survived) as
consistently as the full set of 14 identified features. Inconsistency arises when
multiple CML patient records have the same feature values, but different class labels.
An inconsistency count was calculated by the difference between the number of all
matching patterns (except the class label) and the largest number of patterns of
different class labels of a chosen subset. Hence, for every n matching patterns in the
CML patients training sample space; if there are c1 patterns belonging to class 1 (e.g.
Survived CML) and c2 patterns belonging to class 2 (e.g. Not Survived CML) and the
largest number is c2 then, the inconsistency count was calculated as n – c2. Hence,
given the training sample S the inconsistency count of an instance, :
where is the number of instances in S equal to A using only the features in
and is the number of instances in S of class k equal to A using only the
features in .
The sum of all the inconsistency counts averaging over the size of the training
sample size was used to measure the inconsistency rate for a given subset of
attributes. The inconsistency rate of a feature subset in a sample S is then:
∑
| |
Page 93
93
3.4.2 Correlation-based (CFS)
The correlation-based FS also called similarity measures uses dependency
measures existing between features and the target class to determine relevant features.
The implication of this is that the feature subset (set of variables that are predictive of
CML survival) is one that contains highly correlated with (predictive of) the class, yet
uncorrelated with (not predictive of) each other. Also, the group of components
(CML survival indicators) that are highly correlated with the target variable will at the
same time bear low correlations with each other. Equation (3.3) shows the evaluation
measure used by the correlation-based FS in identifying relevant variables:
√
Using equation (3.3) the was used to identify the subset of risk factors
containing k number of features, was used to estimate the average feature–class
correlation and was used to estimate the average feature-feature inter-correlation.
The equation forms the core of CFS and imposes a ranking of feature subsets (set of
all predictive indicators of CML survival) in the search space of all possible features
subsets.
3.4.3 Information-based
Information-based FS is a probabilistic model for a nominal valued feature
which was formed by estimating the individual probabilities of the values
of the target class from the training data containing the records of variables monitored
during follow-up of Imatinib treatment for CML. This model was used to estimate
the value of entropy of a target class followed by that of the attribute feature. The
entropy is the number of bits it would take, on average, to correct the output of the
Page 94
94
model. Entropy is a measure of the uncertainty or unpredictability in a system. The
entropy of the target class Y (CML survival – Survived or Not Survived) was
estimated from the training dataset using equation (3.4) while equation (3.5) was used
in estimating the entropy of the target class, Y after observing the attribute, X.
∑ ( )
| ∑ ∑ | ( | )
If the observed values of target class in the training data were partitioned
according to the values of input features X, and the entropy of Y with respect to the
partitions induced by X is less than the entropy of the target class prior to partitioning,
then there is a relationship between the target class Y and the indicator variables X.
The amount by which the entropy of Y decreases reflects additional information about
Y provided by A and is called the Information gain (Quinlan, 1986), or alternatively
Mutual information. Thus, information is given by:
|
|
|
3.5 Formulation of Predictive Model for CML Survival
Following the identification of the most relevant and predictive variables
(prognostic factors) for CML prediction, the next phase is the formulation of the
predictive model for CML survival using the identified variables. Mathematical
expressions called mapping functions were used to express the process of model
development (and loss function) following the description of the selected supervised
machine learning (SML) algorithms adopted for the purpose of this study.
Page 95
95
The training dataset S consists of the original features identified at the point of
data identification and collection is represented by , where i is the number of
features existing in the original dataset of patients whose records were collected
(number of CML survival cases), and consists of the features relevant for
predicting the survival of CML where . The process of feature selection is
represented by the mapping:
where are the original set of attributes collected and are the relevant features
selected by the FS method.
Following the process of feature selection, the new CML patient‘s records,
where k is the number of CML patients‘ record and j is the number of relevant
features selected from the original I features. If k datasets were selected for training
the supervised machine learning (SML) algorithms adopted for the model using the
relevant variables, then the model can be represented by the mapping:
( )
where is the set of relevant attributes, j for patient, k and is the survival class of
patient, k given the values of .
The mapping function which describes the predictive model formulated for
CML survival using the identified risk factors/variables (relevant features) is:
( ) {
where is as described in equation (3.8)
Page 96
96
Supervised machine learning (SML) algorithms are generally black-boxed
models; which implies that there is no general equation that can be used to describe
the predictive model using a mathematical representation. Although, all SML
algorithms have a metric that is used to evaluate how well the model is doing during
the training and testing process of model development. Following are the SML
algorithms that were used in this study for the development of the predictive model
for CML survival classification.
3.5.1 Decision trees algorithm (DT)
There are different types of implementation of the decision trees algorithm but
for this study, the C4.5 decision trees algorithm was proposed. The tree was built
from the training dataset, S by making it the root node. For every iteration, the value
of the entropy and information gain was estimated for each attribute, X (risk factor for
CML survival) in the training dataset, S. The algorithm selects the attribute with the
highest information gain and the set S was split by the attribute‘s label (e.g. phase of
disease is chronic, acute, blast) to produce subset of data.
The algorithm was continued on each subset of attributes never used before in
order to construct a tree with each non-terminal node representing the selected
attribute (CML survival risk factor) on which the data was split and the terminal nods
representing the class labels (CML survival class) of the final branch.
Assuming that X (e.g. sex of patient) is an attribute and represents the set
of labels assigned to X (e.g. labels assigned to sex of patients are male and female).
The Information Gain (IG) and Split Criteria used for the tree construction by the
C4.5 DT algorithm is shown in equations (3.10) and (3.11) respectively.
∑| |
| |
Page 97
97
Where:
∑| |
| |
| |
| |
∑| |
| |
| |
| |
3.5.2 Support vector machines (SVM)
The support vector machines (SVM) is a supervised machine learning
algorithm used for both classification and regression problems. In this study, the
sequential minimal optimization (SMO) algorithm proposed by John Platt was used to
formulate the predictive model for CML survival classification using SVM.
Consider, the training dataset containing the relevant risk factors for CML survival,
consisting of j features (risk factors) for each CML patient, j alongside target class
for each patient represented as then the dataset can be represented as a set
consisting of ( ) ( ) .
The problem was to produce a soft margin (called the hyperplane) that is
enough to separate the members of each class; this was handled using the SMO
algorithm to solve the problem as a quadratic programming problem such that in dual
form we have:
∑
∑∑
( )
subject to the constraints:
∑
Page 98
98
Figure 3.2: SVM linear hyperplane separation for CML survival classification
Page 99
99
where C is an SVM hyper-parameter and ( ) is a kernel function and are
Lagrange multipliers.
Assuming a kernel function was used for the binary classification of CML
survival which was observed to be linear separable then, figure 3.2 shows a
description of the separation of each member of the two classes (Survived and Not
Survived cases) by the hyperplane created using the linear kernel function. Although,
for the purpose of this study a polynomial kernel function was used.
3.5.3 Multi-layer perceptron (MLP)
Multi-layer perceptron is an artificial neural network architecture with
multiple hidden layers which makes use of the feed-forward and back-propagation
algorithms for the development of the predictive model for CML survival
classification from a given training dataset of CML patients‘ records. The relevant
attributes selected (risk factors), for CML survival classification were applied as
input neurons to the input layer of the MLP with initial weights, attached to each
respective input i. The initial value of the weights attached to each input were
randomly assigned using a random number generator; a bias value, b was also
attached to the sum of products of the inputs (risk factors) and weights. Equation
(3.13) shows a mathematical representation of what happened at the input layer of the
MLP.
If i relevant risk factors for CML survival classification selected using the FS
methods which were used as input neurons are represented by the set,
then, the attachment of the synaptic weights and the bias will
be (see Figure 3.3):
Page 100
100
Figure 3.3: Artificial Neural Network Structure for CML survival classification
Page 101
101
Following the attachment of the weights to their respective input neurons the
forward propagation algorithm sends the results of equation (3.13) to the activation
functions in k hidden layers. The input layer of the MLP is represented by k = 1while
respective hidden layers are represented as k = 2 to k for k hidden layers. The
activation function used in this study was the sigmoid function, defined in equation
(3.14):
The results of equation (3.14) were propagated through the hidden layers of
the MLP until it reached the output layer where the value of the prediction made by
the model is presented as p. Following this, the second phase of the algorithm was
performed where the difference between the predicted values, p and the actual value y
is used to estimate the error, E. Recall that the value of the output, p depends on the
weighted sum of all the inputs as indicated in equation (3.13); implying that the error
depends on the incoming weights of the neuron which needs to be changed.
The gradient descent algorithm was used to minimize the error and find the
optimal weights that satisfy the problem. The error function is defined by equation
(3.15) while the partial derivative of the error values with respect to each weight
assigned to the neurons is defined by equation (3.16):
Where:
{
( )
∑ ( )
Page 102
102
Thus, the derivative with respect to Ok was calculated for all estimated
derivatives with respect to the outputs Ok of the next layer (or the one closer to the
output neuron). Therefore, in order to update the weight using the gradient
descent, one must choose a learning rate . The change is weight was added to the
old weight in equation (3.13) and was determined by equation (3.17):
Figure 3.4 shows a conceptual view of the comparative analysis approach used
in proposing the most effective combination of feature selection strategy and
supervised machine learning algorithm needed to develop the predictive model for the
classification of CML patient‘s survival. As shown in the diagram, three (3) feature
selection methods (alongside a search strategy) was used to propose the variables that
is most predictive for the 2 and 5-year survival following which three (3) supervised
machine learning algorithms was used to formulate the predictive model using the
variables selected by each feature selection process. In all there were nine (9)
predictive models developed for both the 2 and 5-year survival data. Each model‘s
performance was evaluated using a validation procedure and a set of performance
metrics following which the most effective and efficient predictive model for CML
survival was selected and proposed.
3.6 Simulation of Predictive Model for CML Survival
Following the identification of the supervised machine learning algorithms
that was needed to formulate the predictive model for CML survival, the simulation of
the predictive model was performed using the data collected about variables
monitored from CML patients receiving Imatinib treatment at the referral hospital in
Nigeria.
Page 103
103
Figure 3.4: Conceptual view of the comparative analysis
Page 104
104
The Waikato Environment for Knowledge Analysis (WEKA) software – a
suite of machine learning algorithms was used as the simulation environment for the
implementation of the predictive model.
3.6.1 Model training and validation process
For the purpose of developing the predictive model for the classification of
CML patient‘s survival, the collected data containing information about the values of
the identified indicators for CML survival were used to formulate the model using the
three proposed supervised machine learning algorithms.
The dataset collected was divided into two parts: training and testing data –
the training data was used to formulate the model while the test data was used to
validate the model. The process of training and testing predictive model according to
literature is a very difficult experience especially with the various available validation
procedures.
For classification problems, it is natural to measure a classifier‘s performance
in terms of the error rate. The classifier predicts the class of each instance – the CML
patient‘s record containing values for each survival indicator: if it is correct, that is
counted as a success; if not, it is an error. The error rate being the proportion of errors
made over a whole set of instances, and it measures the overall performance of the
classifier. The error rate on the training data set is not likely to be a good indicator of
future performance; this is because classifier has been learned from the very same
training data.
In order to predict the performance of a classifier on new data, there is the
need to assess its error rate on a dataset that played no part in the formation of the
classifier. This independent dataset is called the test dataset – which is a
Page 105
105
representative sample of the underlying problem as is the training data. It is important
that the test dataset is not used in any way to create the classifier since the machine
learning classifiers involve two stages: one to come up with a basic structure of the
predictive model and the second to optimize parameters involved in that structure.
3.6.2 Cross-validation
The process of leaving a part of a whole dataset as testing data while the rest is
used for training the model is called the holdout method. The challenge here is the
need to be able to find a good classifier by using as much of the whole historical data
as possible for training; to obtain a good error estimate and use as much as possible
for model testing. It is a common trend to holdout one-third of the whole historical
dataset for testing and the remaining two-thirds for training.
It is important to ensure that the random sampling of dataset records is done in
a way that guarantees that each class is properly represented in both training and
testing datasets; this procedure is referred to as stratification thus called stratified
holdout in the case of this study. Although, stratification may provide only a
primitive safeguard against uneven representation in training and testing datasets, a
more general way to mitigate bias caused by the sample chosen is to repeat the whole
process, training and testing, several times with different random samples. For each
iteration, a certain proportion (two-thirds) is randomly selected for training and the
rest for testing.
For this study the cross-validation procedure was employed, which involves
dividing the whole datasets into a number of folds (or partitions) of the data. Each
partition was selected for testing with the remaining k – 1 partitions used for training;
the next partition was used for testing with the remaining k – 1 partitions (including
Page 106
106
the first partition used for testing) used for training until all k partitions had been
selected for testing. The error rate recorded from each process was added up with the
mean error rate recorded. The process used in this study was the stratified 10-fold
cross validation method which involves splitting the whole dataset into ten partitions.
Also, a single 10-fold cross validation was not enough to get a reliable error
estimate requiring different 10-fold cross validation experiments to be performed
because stratification reduces the variation but not entirely. Thus, the process of 10-
fold cross validation was repeated ten times that is, ten times 10-fold cross validation
giving rise to 100 cross-validation experiments which is a reliable process of
generating good measure of performance which is a computation-intensive
undertaking.
3.6.3 Simulation environment
Weka is open source software under the GNU General Public License. The
system was developed at the University of Waikato in New Zealand. Weka stands for
the Waikato Environment for Knowledge Analysis. The software is freely available
at http://www.cs.waikato.ac.nz/ml/weka. The system was written using object-
oriented language, Java. There are several different levels at which Weka can be
used. Weka provides implantations of state-of-the-art data mining and machine
learning algorithms. Weka contains modules for data preprocessing, classification,
clustering and association rule extraction for market basket analysis.
The main features of Weka include:
a. 49 data preprocessing tools;
b. 76 classification/regression algorithms;
c. 8 clustering algorithms;
Page 107
107
d. 15 attribute/subset evaluators + 10 search algorithms for feature selection;
e. 3 algorithms for finding association rules; and
f. 3 graphical user interfaces, namely:
i. The Explorer for exploratory data analysis;
ii. The Experimenter for experimental environment; and
iii. The Knowledge Flow, a new process model inspired interface.
For the purpose of this study, the explorer was used for performing the process
of feature selection using 3 different feature selection methods each with its own
unique search strategy. Following the identification of relevant indicators - input
variables, the training dataset containing those instances were tested using the
experimenter interface of the Weka environment. Thus, the datasets were subjected to
10 runs of 10-fold cross validation using the three selected supervised machine
learning algorithms, namely: decision trees using the C4.5 decision trees algorithm,
support vector machines using the sequential minimum optimization (SMO)
algorithm and the artificial neural network using the multi-layer perceptron algorithm.
Before subjecting the historical datasets containing the values of the variables
monitored during the follow-up of CML patients receiving Imatinib treatment
alongside their survival class; there was the need to store the dataset according to the
default format for data representation needed for data mining tasks on the Weka
environment. The default file type is called the attribute relation file format (.arff).
the arff file type stores three categories of data: the first defining the title of the
relation, the second defining the relations‘ attributes alongside their respective labels
and the third defining the relations data followed for the values of the attributes for
each record. Also, data can be read from comma separated values (.csv) format and
from databases using Open Database Connectivity (ODBC).
Page 108
108
3.7 Performance Evaluation Metrics
During the course of evaluating the predictive model, a number of metrics
were used to quantify the model‘s performance. In order to determine these metrics,
four parameters must be identified from the results of predictions made by the
classifier during model testing. These are: true positive (TP), true negative (TN),
false positive (FP) and false negative (FP). In this study which involves a binary
classification, either of survived and not survived can be considered as positive while
the other is negative.
True positives are the correct prediction of positive cases, true negatives are
the correct prediction of negative cases, false positives are the incorrect positive cases
(negative predicted as positives) and false negatives are the incorrect prediction of
negative cases. The performance metrics are thus defined as follows:
a. Sensitivity/True positive rate/Recall: is the proportion of actual positive
cases that were correctly predicted positive by the model.
b. Specificity/True negative rate: is the proportion of actual negative cases that
were correctly predicted as negatives by the model.
c. False Positive rate/False alarm: is the proportion of actual negative cases
that were predicted as positive by the model.
d. Precision: is the proportion of the predicted positive/negative cases that were
actually positive or negative. Equations (3.32) and (3.33) show the precision
for positive and negative cases.
Page 110
110
CHAPTER FOUR
RESULTS AND DISCUSSIONS
4.1 Introduction
In this section of the study, the results of the methodological approach
described earlier are discussed. A thorough investigation into the analysis of the
description of the dataset collected was initially performed in order to understand the
distribution of the values of each attributes monitored in CML patients during the
follow-up of Imatinib treatment using the minimum and maximum values, and the
mean and standard deviation of the data distribution. Following this, a description of
the number of missing values in the dataset for each attributes monitored was
discussed which were all handled during model training by ignoring the missing
values when computing relevant measures of model characteristics and performance
evaluation.
The mean value of all numeric attribute data was used as a threshold to
convert all numeric values into their respective binary nominal values – for instance,
if the mean value of all values for an attribute is k then the binary nominal values for
that variable will be less than k ( ) and greater and equal to k ( ). This was used
to convert all numeric attributes into nominal attributes thus making it easy to
manipulate compared to their numeric counterparts. Following this, three feature
selection methods alongside their respective search methods were used to identify the
most relevant features among the ones monitored during Imatinib follow-up for the
survival of CML patients. All features selected by each feature selection methods
were subjected to three supervised machine learning algorithms for the purpose of
formulating the predictive model needed for classifying the survival of CML patients.
Page 111
111
The performance of the predictive models for CML survival developed using
each of the supervised machine learning algorithms (each formulated using the
variables proposed by all feature selection methods) were evaluated in order to know
the combination of feature selection and supervised machine learning algorithms
needed for developing an effective and efficient predictive model for CML patient‘s
survival. Thus, the variables identified by the feature selection method were proposed
as the most important and relevant indicator for the survival of CML patients.
4.2 Results and Discussion of Data Summarization of Historical Dataset
For this study, a total of 272 CML patients‘ records were collected from the
study location - Obafemi Awolowo University Teaching Hospital Complex
(OAUTHC), Ile-Ife. Using the value of the survival time for each CML patient, the
value of the survival class for each patient was determined and classified as: survived,
not survived and censored using the threshold of 728 days for 2 year survival and
1820 days for 5 year survival (assuming there are 52 weeks in a year – each week
consisting of 7 days).
Table 4.1 gives a description of the number of each survival class found in the
2-year and 5-year survival datasets following the process of classification. The tables
shows that in the 2-year survival data, 115 (42.3%) patients survived, 31 (11.4%) did
not survive and 126 (45.82%) were censored while in the 5-year survival data, 25
(9.19%) patients survived, 49 (18.01%) did not survive and 198 (72.79%) were
censored. Table 4.2 shows the number of survival class in the final dataset used for
this study after removing the censored patients‘ records from the original dataset
which contained 146 for the 2-year survival and 74 for the 5-year survival. Figure 4.1
shows a graphical plot of the survival classes in both survival data while Figure 4.2
gives a description of the stored file format (arff) for the data collected.
Page 112
112
Table 4.1: Number of survival class in the original dataset
Survival Class 2-years 5-years
Survived 115 (42.28%) 25 (9.19%)
Not Survived 31 (11.40%) 49 (18.01%)
Censored 126 (45.82%) 198 (72.79%)
Total 272 272
Page 113
113
Table 4.2: Number of survival class in the final dataset
Survival Class 2-years 5-years
Survived 115 (78.77%) 25 (33.78%)
Not Survived 31 (21.23%) 49 (66.22%)
Total 146 74
Page 114
114
Figure 4.1: Graphical plot of the distribution of classes among 2 and 5-year survival data
11
4
Page 115
115
Figure 4.2: Screenshot of the data collected and stored in arff file format
Page 116
116
Following the process of data identification and data collection of the
historical data containing the values of the attributes monitored during the follow-up
of Imatinib treatment administered to CML, the missing data values that exist in the
data collected were initially identified and coded in a manner that could be easily
manipulated by the simulation environment, Weka. According to the Weka
documentation, all missing values were to be replaced with the question mark symbol
(?) and the system was adjusted to ignore all missing values found in each attribute
during the process of data analysis. Table 4.3 shows a description of the missing data
values found in the cells of each attributes identified for CML survival.
After identifying and replacing missing values of the attributes monitored
during the follow-up of Imatinib treatment administered to CML patients in the
Nigerian referral hospital; descriptive statistics methods were applied to analyzing the
information stored in the dataset. Table 4.4 shows the summarization of the numeric
data types found among the variables collected from the CML patients using the
metrics: minimum, maximum, mean and standard deviation for the dataset collected.
Table 4.5 gives a description of the analysis of the nominal values of the
attributes in the final dataset following the conversion of all numeric data types into
their respective nominal values following the use of the threshold identified by the
mean value identified in the descriptive analysis of the numeric data types. The
number of the nominal values for each attributes was presented alongside the number
of missing values present with the percentage of occurrence of each value for each
attributes selected for this study. The dataset‘s representation of all variables as
nominal values alongside the target variable was used in the process of feature
selection and predictive model development process for this study.
Page 117
117
Table 4.3: Missing data values in each identified attributes for CML survival
Attributes Missing Data
Sex 0
Vital Signs 0
Time to start of Imatinib 0
Age 0
Packed Cell Volume (PCV) 0
Platelets count 6
Percentage Blest 0
Spleen size 0
Liver size 4
Eosinophils 9
Basophil 13
White Blood Cell (WBC) 3
Disease Phase at Diagnosis 4
Page 118
118
Table 4.4: Data summarization of the numeric attributes in final dataset
Attributes Minimum Maximum Mean Standard Deviation
Time to Imatinib Treatment (days) 2.00 2308.00 188.39 331.23
Age (years) 20.00 75.00 40.20 12.98
Platelet Count ( / ) 10.00 1173.00 306.60 219.16
Packed Cell Volume (PCV, %) 13.00 49.00 31.58 7.14
White Blood Cell (WBC / ) 2.10 710.00 123.05 120.27
Basophil (%) 1.00 35.00 1.73 3.90
Eosinophil (%) 0.00 21.00 2.43 3.389
Percentage Blast (%) 1.00 20.00 2.50 3.81
Spleen Size (cm) 0.00 38.00 12.18 8.90
Liver Size (cm) 0.00 22.00 3.06 4.37
Survival Time (days) 34.00 2548.00 790.526 602.76
11
8
Page 119
119
Table 4.5: Descriptive statistics of data collected after data processing
Attributes Labels
Sex
Female = 54 (36.99%)
Male = 92 (63.01%)
Status
Alive = 96 (65.75%)
Dead = 50 (34.25%)
Time to Imatinib Treatment (days)
Below 188 = 48 (32.88%)
Above 188 = 98 (67.12%)
Age (years)
Below 40 = 81 (55.48%)
Above 40 = 65 (44.52%)
Platelet Count ( / )
Above 306.6 = 46 (31.51%)
Below 306.6 = 94 (64.38%)
Missing = 6 (4.11%)
Packed Cell Volume (PCV, %)
Below 31 = 79 (54.11%)
Above 31 = 67 (45.89%)
White Blood Cell (WBC / ) Below 123 = 86 (60.96%)
Above 123 = 57 (39.04%)
Basophil (%)
Above 1.7 = 29 (19.86%)
Below 1.7 = 104 (71.24%)
Missing = 13 (8.9%)
Eosinophil (%)
Below 2.4 = 89 (60.96%)
Above 2.4 = 48 (32.88%)
Missing = 9 (6.16%)
Percentage Blast (%)
Above 2.5 = 54 (36.99%)
Below 2.5 = 92 (63.01%)
Spleen Size (cm)
Above 12.2 = 84 (57.53%)
Below 12.2 = 62 (42.47%)
Liver Size (cm)
Below 3.1 = 94 (64.38%)
Above 3.1 = 48 (32.88%)
Missing = 4 (2.74%)
Page 120
120
4.3 Results and Discussion of Feature Selection Process
Following the process of data description and transformation into the accepted
file format (arff), the next important step was the identification of the most relevant
variables among the identified factors that will improve the prediction of CML
survival the most. As stated earlier, filter-based feature selection methods were used
as supported in the simulation environment, Weka. For each feature selection
methods chosen, a particular search strategy was chosen for selecting attributes
among the general set of attributes collected. Table 4.6 shows a description of the
feature selection methods used and their respective search algorithm employed in
each case with the relevant attributes selected for CML prediction for the 2-year and
5-year dataset.
The feature selection methods are implemented as the following algorithms in
Weka, namely:
a. Correlation-based feature selection algorithm – implemented using the class
weka.attributeSelection.CfsSubsetEval which selected the subset of features
highly correlated with target class but low correlation among themselves and
the search algorithm used was the genetic search algorithm;
b. Information-based feature selection algorithm – implemented using the class
weka.attributeSelection.InfoGainAttributeEval which selects features by
evaluating the worth of an attribute by measuring the information gain with
respect to the class (information gain of an attribute with respect to the
survival class) and the search algorithm used was the ranker algorithm which
ranked the attributes according to their individual evaluations; and
Page 121
121
Table 4.6: Relevant attributes identified using three (3) feature selection methods
Feature
Selection
Method
Consistency-Based Information-Based Correlation-Based
Search Method Greedy Step-wise Ranker Search Genetic Search
Variables
Selected
2-years 5-years 2-years 5-years 2-years 5-years
Basophilis
PCV
Time to Start
Age
Percentage Blast
Disease Phase
Platelet Count
Eosinophil
Sex
Liver Size
Basophilis
WBC
Percentage Blast
Eosinophils
PCV
Spleen Size
Age
Sex
Basophilis
PCV
Percentage Blast
Disease Phase
Liver Size
Basophilis
PCV
Disease Phase
Liver Size
Spleen Size
Sex
Basophilis
PCV
Percentage Blast
Platelets Count
Basophilis
PCV
Disease Phase
Liver Size
Spleen Size
Sex
1
21
Page 122
122
c. Consistency-based feature selection algorithm, - implemented using the class
weka.attributeSelection.ConsistencySubsetEval which selects features by
evaluating the worth of a subset of attributes by the level of consistency in the
class values when the training instances are projected onto the subset of
attributes and the search algorithm used was the greedy step-wise which
performed a greedy forward and backward search through the space of
attribute subsets.
Out of the total thirteen (13) attributes selected from the factors monitored
during the follow-up of Imatinib treatment a fewer number of relevant attributes were
selected by each filter-based feature selection algorithm for the 2-year and 5-year
dataset collected for this study. For the two (2) year survival dataset, 10 attributes
were selected by the consistency-based FS, 5 attributes selected by the information-
based FS and 4 attributes selected by the correlation-based FS while for the five (5)
year survival dataset, 8 attributes were selected by the consistency-based FS, 6
attributes selected by the information-based FS and correlation-based FS methods.
It was discovered that the information-based FS methods was able to deduce
the exact same type and number of attributes from the original dataset just as was
done by the correlation-based FS methods. Of all the FS methods considered in the
study, only the consistency-based FS method was able to identify age as a relevant
feature while sex was found as one of the relevant feature in the 2-year survival
dataset using the consistency-based FS method but sex was found relevant by all FS
methods applied on the 5-year survival. PCV was selected among other relevant
variables by all FS methods for both 2-year and 5-year dataset. Spleen size was
identified as one of the relevant attributes for CML survival by all FS methods in the
5-year survival dataset but not in any of the 2-year survival dataset.
Page 123
123
4.4 Results and Discussion of Model Formulation and Simulation Process
Following the process of feature selection used in identifying the most
relevant variables among the 13 identified variables monitored in CML patients
receiving Imatinib treatment, the next phase is model formulation using the
aforementioned supervised machine learning algorithms available in the Weka
software. The 10-fold cross validation technique was used in evaluating the
performance of the developed predictive model for CML survival using the test
samples randomly selected from the historical test used for training the model. For
each supervised machine learning algorithm used in formulating the predictive model
for CML survival classification, 3 predictive models were developed using the
variables identified by each feature selection methods applied on the original dataset.
This process was performed for both the 2-year and 5-year survival data required for
model development.
4.4.1 Results of model formulation and simulation using all 13 variables
CML Patients‘ records containing all 13 variables (attributes) identified to be
monitored during Imatinib treatment was used as the training data for developing the
first set of predictive models for CML survival classification using the three machine
learning algorithms. The dataset used consisted of the set of CML patients contained
in the 2-year and 5-year survival data; for each patient‘s record the dataset a target
class defining the classification of survival was also provided and labeled as survived
and not survived.
i. The 2-year survival dataset
From the 2-year survival data containing 146 CML patients‘ record with 115
survived and 31 not survived cases, the results of the predictive model developed
using the three supervised machine learning algorithms are as follows. For the C4.5
Page 124
124
decision trees algorithm, 9 out of the 31 actual not survived cases and 110 out of the
115 survived cases were correctly classified giving a total of 119 correct
classifications out of 146 cases with an accuracy of 81.5% (Figure 4.4 – left). For the
SMO algorithm used in formulating the SVM classifier, none out of the 31 actual not
survived cases and 113 out of the 115 actual survived cases were correctly classified
giving a total of 113 correct classifications out of 146 cases with an accuracy of
77.4% (Figure 4.4 – centre). For the MLP classifier, 13 out of 31 actual not survived
cases and 94 out of the 151 actual survived cases were correctly classified giving a
total of 107 correct classifications out of 146 cases with an accuracy of 73.3% (Figure
4.4 – right). The C4.5 decision trees algorithm formulated the most accurate
predictive model for CML 2-year survival classification for CML patients using all 13
variables monitored during Imatinib treatment.
ii. The 5-year survival dataset
From the 5-year survival data containing 74 CML patients‘ record with 25
survived and 49 not survived cases, the results of the predictive model developed
using the three supervised machine learning algorithms are as follows. For the C4.5
decision trees algorithm, 43 out of the 49 actual not survived cases and 6 out of the 25
survived cases were correctly classified giving a total of 49 correct classifications out
of 74 cases with an accuracy of 66.2% (Figure 4.5 – left). For the SMO algorithm
used in formulating the SVM classifier, 40 out of the 49 actual not survived cases and
3 out of the 25 actual survived cases were correctly classified giving a total of 43
correct classifications out of 74 cases with an accuracy of 58.1% (Figure 4.5 – centre).
For the MLP classifier, 31 out of 49 actual not survived cases and 5 out of the 25
actual survived cases were correctly classified giving a total of 36 correct
classifications out of 74 cases with an accuracy of 48.6% (Figure 4.5 – right).
Page 125
125
Figure 4.4: Results of model formulation using all 13 variables in 2-year survival
Page 126
126
Figure 4.5: Results of model formulation using all 13 variables in 5-year survival
Page 127
127
The C4.5 decision trees algorithm formulated the most accurate predictive
model for CML 5-year survival classification for CML patients using all 13 variables
monitored during Imatinib treatment.
4.4.2 Results of model formulation and simulation using variables selected by
consistency criteria
The second set of predictive models was formulated using the CML patients‘
records that contained variables identified using the consistency-based FS method.
Using the consistency-based FS method, 10 relevant variables were identified in the
2-year survival dataset while 8 relevant variables were identified to be relevant in the
5-year survival dataset. Thus, each supervised machine learning algorithm was used
to formulate the predictive model for CML survival classification using the relevant
variables identified using the consistency criteria for FS.
i. The 2-year survival dataset
From the 2-year survival data containing 146 CML patients‘ record with 115
survived and 31 not survived cases, the results of the predictive model developed
using the three supervised machine learning algorithms are as follows. For the C4.5
decision trees algorithm, 9 out of the 31 actual not survived cases and 110 out of the
115 survived cases were correctly classified giving a total of 119 correct
classifications out of 146 cases with an accuracy of 81.5% (Figure 4.6 – left). For the
SMO algorithm used in formulating the SVM classifier, none out of the 31 actual not
survived cases and 114 out of the 115 actual survived cases were correctly classified
giving a total of 114 correct classifications out of 146 cases with an accuracy of
78.1% (Figure 4.6 – centre). For the MLP classifier, 12 out of 31 actual not survived
cases and 97 out of the 151 actual survived cases were correctly classified giving a
total of 109 correct classifications out of 146 cases with an accuracy of 74.7% (Figure
Page 128
128
4.6 – right). The C4.5 decision trees algorithm formulated the most accurate
predictive model for CML 2-year survival classification for CML patients using the
10 relevant variables in the 2-year survival data for CML survival classification. It
was also observed that there was no significant difference in the predictive model
formulated by the C4.5 decision trees algorithm using the original dataset containing
13 variables and the 10 variables identified using consistency criteria for FS but
significant improvement was observed by the predictive models formulated using
both SVM and MLP with the consistency criteria for FS then using all 13 attributes.
ii. The 5-year survival dataset
From the 5-year survival data containing 74 CML patients‘ record with 25
survived and 49 not survived cases, the results of the predictive model developed
using the three supervised machine learning algorithms are as follows. For the C4.5
decision trees algorithm, 47 out of the 49 actual not survived cases and 5 out of the 25
survived cases were correctly classified giving a total of 52 correct classifications out
of 74 cases with an accuracy of 70.3% (Figure 4.7 – left). For the SMO algorithm
used in formulating the SVM classifier, 45 out of the 49 actual not survived cases and
none out of the 25 actual survived cases were correctly classified giving a total of 45
correct classifications out of 74 cases with an accuracy of 60.8% (Figure 4.7 – centre).
For the MLP classifier, 33 out of 49 actual not survived cases and 6 out of the 25
actual survived cases were correctly classified giving a total of 39 correct
classifications out of 74 cases with an accuracy of 52.7% (Figure 4.7 – right). The
C4.5 decision trees algorithm formulated the most accurate predictive model for the
5-year survival classification of CML patients using 8 relevant variables identified
using the consistency criteria during Imatinib treatment.
Page 129
129
Figure 4.6: Results of model formulation using 10 variables in 2-year
survival selected using consistency-based FS method
Page 130
130
Figure 4.7: Results of model formulation using 8 variables in 5-year
survival selected using consistency-based FS method
Page 131
131
There was also a significant improvement in the predictive model formulated
using the 8 relevant features compared to using the whole 13 attributes monitored.
4.4.3 Results of model formulation and simulation using variables selected by
information criteria
The third set of predictive models was formulated using the CML patients‘
records that contained variables identified using the information-based FS method.
Using the information-based FS method, 5 relevant variables were identified in the 2-
year survival dataset while 6 relevant variables were identified to be relevant in the 5-
year survival dataset. Thus, each supervised machine learning algorithm was used to
formulate the predictive model for CML survival classification using the relevant
variables identified using the information criteria for FS.
i. The 2-year survival dataset
From the 2-year survival data containing 146 CML patients‘ record with 115
survived and 31 not survived cases, the results of the predictive model developed
using the three supervised machine learning algorithms are as follows. For the C4.5
decision trees algorithm, 9 out of the 31 actual not survived cases and 110 out of the
115 survived cases were correctly classified giving a total of 119 correct
classifications out of 146 cases with an accuracy of 81.5% (Figure 4.8 – left). For the
SMO algorithm used in formulating the SVM classifier, none out of the 31 actual not
survived cases and all the 115 actual survived cases were correctly classified giving a
total of 115 correct classifications out of 146 cases with an accuracy of 78.8% (Figure
4.8 – centre). For the MLP classifier, 18 out of 31 actual not survived cases and 105
out of the 151 actual survived cases were correctly classified giving a total of 123
correct classifications out of 146 cases with an accuracy of 84.3% (Figure 4.8 – right).
Page 132
132
Figure 4.8: Results of model formulation using 5 variables in 2-year
survival selected using information-based FS method
Page 133
133
The MLP algorithm formulated the most accurate predictive model for CML
2-year survival classification for CML patients using the 5 relevant variables in the 2-
year survival data for CML survival classification. It was also observed that there
was no significant difference in the predictive model formulated by the C4.5 decision
trees algorithm using the original dataset containing 13 variables and the 5 variables
identified using information criteria for FS but significant improvement was observed
by the predictive models formulated using SVM with the information criteria for FS
then using all 13 attributes.
ii. The 5-year survival dataset
From the 5-year survival data containing 74 CML patients‘ record with 25
survived and 49 not survived cases, the results of the predictive model developed
using the three supervised machine learning algorithms are as follows. For the C4.5
decision trees algorithm, 41 out of the 49 actual not survived cases and 3 out of the 25
survived cases were correctly classified giving a total of 44 correct classifications out
of 74 cases with an accuracy of 59.5% (Figure 4.9 – left). For the SMO algorithm
used in formulating the SVM classifier, 48 out of the 49 actual not survived cases and
none out of the 25 actual survived cases were correctly classified giving a total of 48
correct classifications out of 74 cases with an accuracy of 64.9% (Figure 4.9 – centre).
For the MLP classifier, 34 out of 49 actual not survived cases and 12 out of the 25
actual survived cases were correctly classified giving a total of 46 correct
classifications out of 74 cases with an accuracy of 62.2% (Figure 4.9 – right). The
SVM algorithm formulated the most accurate predictive model for the 5-year survival
classification of CML patients using 6 relevant variables identified using the
information criteria during Imatinib treatment which is an improvement compared to
the variables used in the earlier formulated models using SVM.
Page 134
134
Figure 4.9: Results of model formulation using 6 variables in 5-year
survival elected using information-based FS method
Page 135
135
There was also a significant improvement in the predictive model formulated
using the MLP classifier but a reduced performance using the C4.5 decision trees
classifier.
4.4.4 Results of model formulation and simulation using variables selected by
correlation criteria
The final set of predictive models was formulated using the CML patients‘
records that contained variables identified using the correlation-based FS method.
Using the correlation-based FS method, 4 relevant variables were identified in the 2-
year survival dataset while 6 relevant variables were identified to be relevant in the 5-
year survival dataset – same as those identified using the information-based criteria.
Thus, each supervised machine learning algorithm was used to formulate the
predictive model for CML survival classification using the relevant variables
identified using the correlation criteria for FS.
i. The 2-year survival dataset
From the 2-year survival data containing 146 CML patients‘ record with 115
survived and 31 not survived cases, the results of the predictive model developed
using the three supervised machine learning algorithms are as follows. For the C4.5
decision trees algorithm, 9 out of the 31 actual not survived cases and 110 out of the
115 survived cases were correctly classified giving a total of 119 correct
classifications out of 146 cases with an accuracy of 81.5% (Figure 4.10 – left). For
the SMO algorithm used in formulating the SVM classifier, none out of the 31 actual
not survived cases and all the 115 actual survived cases were correctly classified
giving a total of 115 correct classifications out of 146 cases with an accuracy of
78.8% (Figure 4.10 – centre). For the MLP classifier, 13 out of 31 actual not survived
cases and 107 out of the 151 actual survived cases were correctly classified giving a
Page 136
136
total of 120 correct classifications out of 146 cases with an accuracy of 82.2% (Figure
4.10 – right). The MLP algorithm formulated the most accurate predictive model for
CML 2-year survival classification for CML patients using the 4 relevant variables in
the 2-year survival data for CML survival classification. It was also observed that
there was no significant difference in the predictive model formulated by the C4.5
decision trees algorithm and SVM classifiers using either of the variables selected
using FS.
ii. The 5-year survival dataset
From the 5-year survival data containing 74 CML patients‘ record with 25
survived and 49 not survived cases, the results of the predictive model developed
using the three supervised machine learning algorithms are as follows. For the C4.5
decision trees algorithm, 41 out of the 49 actual not survived cases and 3 out of the 25
survived cases were correctly classified giving a total of 44 correct classifications out
of 74 cases with an accuracy of 59.5% (Figure 4.11 – left). For the SMO algorithm
used in formulating the SVM classifier, 48 out of the 49 actual not survived cases and
none out of the 25 actual survived cases were correctly classified giving a total of 48
correct classifications out of 74 cases with an accuracy of 64.9% (Figure 4.11 –
centre). For the MLP classifier, 34 out of 49 actual not survived cases and 12 out of
the 25 actual survived cases were correctly classified giving a total of 46 correct
classifications out of 74 cases with an accuracy of 62.2% (Figure 4.11 – right). There
was no difference in the result of the predictive model formulated using the three
supervised machine learning algorithms via the variables selected using either the
information-based and the correlation-based criteria for feature selection.
Page 137
137
Figure 4.10: Results of model formulation using 5 variables in 2-year
survival selected using correlation-based FS method
Page 138
138
Figure 4.11: Results of model formulation using 6 variables in 5-year
survival selected using correlation-based FS
Page 139
139
The results of the study showed that following the use of three feature
selection methods and the application of three supervised machine learning algorithms
required for the formulation of the predictive model needed for the classification of
the 2-year and 5-year survival classification of CML patients receiving Imatinib
treatment. The use of the information-based feature selection method and the
formulation of the 2-year survival classification prediction model using the multi-
layer perceptron showed the best performance for the 2-year survival classification
model and the use of the consistency-based feature selection method and the
formulation of the 5-year survival classification prediction model using the C4.5
decision trees algorithm showed the best performance for the 5-year survival
classification model.
4.5 Discussion of Results of the Formulation and Simulation of Prediction
Model for CML Survival Classification
For each prediction model developed using the combination of feature
selection and supervised machine learning algorithms; the confusion matrices were
constructed from the value of correct (true positive and true negative values) and
incorrect classifications (false positive and false negative values) made by each
prediction model developed for CML survival. The positive class for each prediction
model was identified by the not survived cases while the negative class for each
prediction model was identified by the survived cases.
The true positive and true negative values were used to evaluate the accuracy
of each prediction model showing how much of the total number of cases that was
correctly classified by the classifiers – efficiency of the model. Additional metrics
were estimated including the true positive rate which measures the ability of the
model to correctly classify the not survived cases, true negative rate which measures
Page 140
140
the ability of the model to correctly classify the survived cases, false positive rate
which measures the incorrectly classified negative cases and the area under the
characteristics curve (AUC) which measures the effectiveness of the model.
4.5.1 2-year survival classification model
Based on the results obtained for the TP, TN, FP and FN from the prediction
models developed from each combination of feature selection and feature selection
methods, the aforementioned metrics were estimated and shown in Table 4.7. From
the table, the respective performance of each combination of feature selection and
machine learning algorithms is shown.
The C4.5 decision trees algorithm showed a consistent performance which
showed no significant change irrespective of the type of feature selection method used
in extracting the relevant variables; in all cases 81.5% of the total datasets were
correctly classified with 29% of the not survived cases correctly classified and 95.7%
of the survived cases correctly classified owing for the TP and TN rate of 0.290 and
0.957 respectively. The C4.5 decision trees algorithm outperformed the SVM and
MLP algorithms while using all 13 variables and using 10 relevant variables selected
by the consistency-based feature selection algorithm. Figure 4.12 shows the decision
tree constructed for 2-year CML survival classification using all 13 variables (left)
and using the selected variables (right). The tree shows that the most relevant
attribute which influences the 2-year survival are the Basophilis and the packed Cell
Volume (PCV) while other relevant variables are the percentage blast and the increase
in the spleen size of the CML patient.
Page 141
141
Table 4.7: Results of the evaluation for the predictive model for 2-year survival classification
Feature Selection
Method
Supervised Machine
Learning Algorithm
Correct
Classification
(Accuracy)
TP Rate
(Sensitivity/
recall)
TN Rate
(Specificity)
FP Rate
(False alarm)
Precision
Area under
Receiver
Operating
Characteristics
(ROC) curve
None
(All 14 attributes)
Decision Trees 119 (81.51%) 0.290 0.957 0.043 0.793 0.647
Support Vector
Machines 113 (77.40) 0.000 0.983 0.017 0.618 0.491
Multi-layer Perceptron 104 (71.23%) 0.419 0.817 0.183 0.683 0.576
Consistency-Based
(10 relevant
attributes)
Decision Trees 119 (81.51%) 0.290 0.957 0.043 0.793 0.647
Support Vector
Machines 114 (78.08%) 0.000 0.991 0.009 0.727 0.548
Multi-layer Perceptron 106 (72.60%) 0.387 0.843 0.157 0.732 0.538
Information-Based
(5 relevant
attributes)
Decision Trees 119 (81.51%) 0.290 0.957 0.043 0.793 0.676
Support Vector
Machines 115 (78.77%) 0.000 1.000 0.000 0.620 0.500
Multi-layer Perceptron 123 (84.25%) 0.581 0.913 0.095 0.831 0.734
Correlation-Based
(4 relevant
attributes)
Decision Trees 119 (81.51%) 0.290 0.957 0.043 0.793 0.679
Support Vector
Machines 115 (78.77%) 0.000 1.000 0.000 0.620 0.500
Multi-layer Perceptron 120 (82.19%) 0.419 0.930 0.075 0.808 0.694
14
1
Page 142
142
Figure 4.12: Decision trees for all attributes (left) and selected attributes (right) in 2-year survival
14
2
Page 143
143
From the decision tree shown in Figure 4.12, the following five rules were
extracted; the first for the left and the second for the one on the right.
I. Without feature selection
a. If (Basophilis=below 1.7) Then (2-year survival=Survived);
b. If (Basophilis=above 1.7) and (PCV=below 31) Then (2-year
survival=Survived);
c. If (Basophilis=above 1.7) and (PCV=above 31) and (Percentage blast=above
2.5) Then (2-year survival=Not survived);
d. If (Basophilis=above 1.7) and (PCV=above 31) and (Percentage blast=below
2.5) and (Spleen=below 12.2) Then (2-year survival=Not survived); and
e. If (Basophilis=above 1.7) and (PCV=above 31) and (Percentage blast=below
2.5) and (Spleen=above 12.2) Then (2-year survival=Survived);
II. With feature selection
a. If (Basophilis=below 1.7) Then (2-year survival=Survived);
b. If (Basophilis=above 1.7) and (PCV=below 31) Then (2-year
survival=Survived); and
c. If (Basophilis=above 1.7) and (PCV=above 31) Then (2-year survival=Not
survived);
The SMO classifier used in implementing the SVM algorithm for developing
the predictive model or CML survival classification showed very poor results for all
the different sets of attributes used in the study for the 2-year survival classification
prediction model. The SVM was unable to correctly classify all the not survived
cases (positive cases) for all the attributes used but classified 98.3% of the survived
cases using all 13 attributes and values 99.1% and 100% using the attributes selected
for feature selection. Although, the percentage of correct classification (accuracy) is
Page 144
144
within the interval 77.4% - 78.8%, majority of the correct classifications were from
the survived cases (negative classes). The SVM was unable to formulate an effective
predictive model using the attributes available in the selected dataset for the 2-year
survival dataset for CML survival.
The back-propagation algorithm used by the Multi-layer perceptron to develop
the predictive model for the 2-year CML survival classification model also showed
interesting results. For the dataset containing the 13 initial attributes, the MLP
showed an accuracy of 71.2% which increased to 72.6% using attributes selected by
consistency-based FS to 82.19% using attributes selected by the correlation-based FS
and to 84.3% using attributes selected by the information-based FS method. The
MLP was able to correctly predict about 60% of the not survived cases (positive class)
and about 92% of the survived cases (negative class) owing for a value of 0.581,
0.913 and 0.095 for the TP rate, FN rate and the FP rates. The best performance was
achieved for the predictive model for CML survival classification developed using the
5 attributes selected by the information-based FS method, namely:
a. Basophilis;
b. Packed cell volume (PCV);
c. Percentage blast
d. Disease phase at diagnosis;
e. Liver size.
4.5.2 5-year survival classification model
Based on the results obtained for the TP, TN, FP and FN from the prediction
models developed from each combination of feature selection and feature selection
methods, the aforementioned metrics were estimated and shown in Table 4.8. From
the table, the respective performance of each combination of feature selection and
Page 145
145
machine learning algorithms is shown for the 5-year CML survival classification
model.
The C4.5 decision trees algorithm showed a consistent improvement in
performance in the 5-year CML survival dataset used in formulating the predictive
model. The C4.5 DT algorithm was able to correctly classify 66.2% of the total cases
using the 13 identified features, 70.3% using the 10 variables identified by
consistency criteria and 67.57% using the 6 variables identified by the information
and correlation criteria each. The C4.5 DT algorithm had the highest performance for
the dataset containing attributes selected by the consistency-based FS method from
which 4 relevant features were selected by the C4.5 DT algorithm to construct the
decision tree shown in Figure 4.13 consisting of the following attributes:
a. Packed Cell Volume (PCV);
b. White Blood Cell (WBC) count;
c. Spleen size; and
d. Basophilis
The most important attribute for 5-year survival identified was the PCV
followed by the WBC, Spleen and the Basophilis unlike the attributes identified to be
relevant for 2-year CML survival. From the decision tree shown in Figure 4.13, the
following five (5) rules were extracted, which includes:
a. If (PCV=above 31) Then (5-year survival=Not Survived);
b. If (PCV=below 31) and (WBC=above 123) Then (5-year survival=Survived);
c. If (PCV=below 31) and (WBC=below 123) and (Spleen=above 12.2) Then (5-
year CML survival=Not survived);
d. If (PCV=below 31) and (WBC=below 123) and (Spleen=below 12.2) and
(Basophilis=below 1.7) Then (5-year CML survival=Not survived); and
Page 146
146
Table 4.8: Results of the evaluation for the predictive model for 5-year survival classification
Feature Selection
Method
Supervised Machine
Learning Algorithm
Correct
Classification
(Accuracy)
TP Rate
(Sensitivity/
recall)
TN Rate
(Specificity)
FP Rate
(False alarm)
Precision
Area under
Receiver
Operating
Characteristics
(ROC) curve
None
(All 14 attributes)
Decision Trees 49 (66.22%) 0.878 0.240 0.760 0.628 0.521
Support Vector
Machines 43 (58.11%) 0.633 0.200 0.800 0.512 0.468
Multi-layer Perceptron 36 (48.65%) 0.816 0.136 0.864 0.476 0.473
Consistency-Based
(8 relevant
attributes)
Decision Trees 52 (70.27%) 0.959 0.200 0.800 0.706 0.569
Support Vector
Machines 45 (60.81%) 0.673 0.240 0.760 0.426 0.459
Multi-layer Perceptron 39 (52.70%) 0.918 0.000 1.000 0.512 0.458
Information-Based
(6 relevant
attributes)
Decision Trees 50 (67.57%) 0.837 0.120 0.880 0.664 0.641
Support Vector
Machines 48 (64.86%) 0.694 0.480 0.520 0.435 0.490
Multi-layer Perceptron 46 (62.16%) 0.980 0.000 1.000 0.629 0.633
Correlation-Based
(6 relevant
attributes)
Decision Trees 50 (67.57%) 0.837 0.120 0.880 0.664 0.641
Support Vector
Machines 48 (64.86%) 0.694 0.480 0.520 0.435 0.490
Multi-layer Perceptron 46 (62.16%) 0.980 0.000 1.000 0.629 0.633
14
6
Page 147
147
Figure 4.13: Decision trees for selected attributes in 5-year survival
1
47
Page 148
148
a. If (PCV=below 31) and (WBC=below 123) and (Spleen=below 12.2) and
(Basophilis=above 1.7) Then (5-year CML survival=Survived);
The SMO algorithm used in developing the SVM classifier for the CML
survival classification model was observed to correctly classify 58.1% of the total
cases using all 13 attributes but with values of 60.8% and 64.7% using attributes
selected by the consistency-based and information/correlation-based FS methods
respectively. The SVM classify showed the best performance using attributes
selected by the information and correlation-based FS methods which identified the
same set of attributes. The SVM classifier was able to predict correctly about 69.4%
of the not survived cases (positive class) and 48% of the survived cases (negative
class). Also, it was discovered that the performance of the SVM classifier was
improved by the use of the relevant attributes for CML survival compared to using all
13 attributes.
The MLP using the back-propagation algorithm used in developing the 5-year
CML survival model was observed to correctly classify 48.7% of the total cases using
the 13 attributes but with values of 52.7% and 62.16% using attributes selected by the
consistency-based and information/correlation-based FS methods respectively. The
MLP showed the highest performance using the variables selected by the
information/correlation-based FS methods with TP, TN and FP rates of 0.98, 0.00 and
1.00 respectively. It was also observed that the MLP was unable to effectively predict
the survived cases unlike the not survived cases for CML survival.
Page 149
149
Table 4.9: A comparison of the variables selected for CML survival prediction
2-year CML Survival 5-year CML Survival Existing Models
C4.5 DT MLP C4.5 DT SOKAL HASFORD EUTOS
Basophils (with and without FS) Basophils Basophilis Platelets Basophils Basophils
Spleen (without FS) Liver size Spleen Spleen Spleen Spleen
Blast (without FS) Blast WBC Blast Blast
PCV (with and without FS) PCV PCV Age Age
Disease Phase Eosinophils
1
49
Page 150
150
Out of the prediction models developed for CML 5-year survival, it was
discovered that the classification model developed by the C4.5 decision trees
algorithm using the consistency-based FS methods developed the most effective 5-
year CML survival classification model.
Table 4.9 shows a comparison of the variables identified by each model for
CML survival using supervised machine learning and the existing models using the
regression analysis. The table shows that all the survival models identified the
Basophilis as an important attribute except the Sokal score just as was identified by
the 2-year and 5-year survival classification models developed using the FS and ML
algorithms. All the survival models also identified Spleen as an important variable
except for the 2-year survival classification model using MLP and DT without feature
selection. All the survival models identified Blast as an important attribute except for
the EUTOS model and the 5-year survival classification model.
Other variables peculiar to Nigerian CML patients survival include the PCV
identified by the C4.5 decision trees algorithm and MLP used in developing the 2-
year survival classification model alongside the C4.5 DT algorithm for developing the
5-year survival classification model; the Liver and disease phase identified by the
MLP used in developing the 2-year survival classification model and finally the WBC
identified by the C4.5 DT algorithm used in developing the 2-year survival
classification model.
Page 151
151
CHAPTER FIVE
SUMMARY, CONCLUSION AND RECOMMENDATIONS
5.1 Summary
This study focused on the development of an effective and efficient prediction
model using clinical information native to Nigerian CML patients in order to classify
the survival CML patients receiving Imatinib treatment in Nigeria. Existing survival
models were developed using clinical information of non-Nigerians and statistical
regression modeling techniques but have been very ineffective on estimating the
survival of Nigerian CML patients receiving Imatinib treatment.
Feature selection algorithms were used to identify the variables which have a
strong correlation/relevance to CML survival from the dataset containing all possible
variables monitored from CML patients receiving Imatinib treatment at a referral
hospital. The variables identified using three feature selection methods were used to
formulate the prediction models for 2-year and 5-year CML survival using three
supervised machine learning algorithms
The results of the study revealed the variables that are relevant to both 2-year
and 5-year survival of CML patients alongside the development of the prediction
model for CML 2-year and 5-year survival classification using the variables
identified.
5.2 Conclusion
Following the use of feature selection methods in identifying the variables
relevant to CML survival; basophilis and spleen were identified as the most relevant
for the 2-year survival using decision trees and 5-year survival using decision trees as
was proposed by the EUTOS and Hasford Score and the percentage of blast was also
Page 152
152
relevant to the 2-year survival which was also identified by the Sokal and Hasford
models. Other variables identified were PCV in the 2-year and 5-year survival, liver
size and disease phase in the 2-year survival and WBC in the 5-year survival. Unlike
the other existing models proposed, the variables age, eosinophil and platelets counts
were insignificant to CML survival in Nigerian CML patients.
The prediction model developed using the dataset showed good results
although there was more likelihood to classify one class than the other due to the
inequality in proportion of the survived and not survived cases in the original dataset.
Thus, a dataset with a lesser number of censored values will create more reliable
predictive models for CML survival classification.
The variables identified by the prediction model and the tree constructed from
the decision tree using the variables can help provide insight into the relationship that
exist between the variables with respect to the classification of the 2-year and 5-year
survival of CML patients.
5.3 Recommendations
Following the development of the prediction model for CML survival
classification, a better understanding of the relationship between the attributes
relevant to CML survival was proposed. The model can also be integrated into
existing Health Information System (HIS) which captures and manages clinical
information which can be fed to the CML survival classification prediction model
thus improving the clinical decisions affecting CML survival and the real-time
assessment of clinical information affecting CML survival from remote locations.
It is advised that a continual assessment of variables monitored during CML
survival be made in order to increase the number of information relevant to creating
Page 153
153
an improved prediction model for CML classification using the proposed methods of
feature selection and machine learning algorithms.
Page 154
154
REFERENCES
Agbelusi, O. (2014). Development of a predictive model for survival of HIV/AIDS
patients in South-western Nigeria, Unpublished MPhil Thesis, Obafemi
Awolowo University, Ile-Ife, Nigeria.
Ahmad, L.G., Eshlaghy, A.T., Poorebrahimi, A., Ebrahimi, M. and Razavi, A.R.
(2013). J. Health Med Infomr 4(2): 1 -3.
Alimena, G., Morra, E., Lazzarino, M., Liberati, A.M., Montefusco, E. and Inveradi,
D. (1988). Interferon alpha-2b as therapy for Ph‘-positive chronic
myelogenous Leukaemia: A study of 82 patients treated with intermittent or
daily administration. Blood 72: 642 – 647.
Allan, N.C., Richards, S.M., Shepherd, P.C. (1995). UK Medical Research Council
randomised, multicentre trial of interferon-alpha n1 for chronic myeloid
leukaemia: improved survival irrespective of cytogenetic response. The UK
Medical Research Council‘s Working Parties for Therapeutic Trials in Adult
Leukaemia. Lancet 345:1392 – 1397.
Altman, D.G., Vergouwe, Y and Royston, P. (2009). Prognosis and Prognostic
research: validating a prognostic model. BMJ 338: 605.
American Cancer Society (2015). Cancer Facts & Figures 2015. Atlanta, Ga:
American Cancer Society.
Ashraf, M., Chetty, G. and Tran, D. (2013). Feature selection techniques on thyroid,
hepatitis and breast cancer datasets. International Journal on data mining and
intelligent information technology 3(1): 1 -8.
Page 155
155
Aurich J., Duchayne E., Huguet-Rigal F., Bauduer F., Navarro M., Perel Y., Pris J.,
Caballin M. R., and Dastugue N. (1998). Clinical, morphological, cytogenetic
and molecular aspects of a series of Ph-negative chronic myeloid Leukaemias.
Hematol Cell Ther 40(4): 149 - 158.
Bach, P.B., Kattan, M.W. and Thornquist, M.D. (2003). Variations in lung cancer
risk among smokers. J Natl. Cancer Inst. 95: 470 -478.
Batista, G. and Monard, M.C. (2003). An analysis of four missing data treatment
methods for supervised learning. Appl Artif Intell 17: 519 – 533.
Becker, S and Plumbley, M. (1996) Unsupervised neural network learning procedures
for feature extraction and classification. International Journal of Applied
Intelligence 6: 185-203.
Bell, D. and Wang, H. (2010). A formalism for relevance and its application in
feature subset selection. Machine Learning 41(2): 175 – 195.
Bluhm, M.V. (2011). Factors Influencing Oncologist‘s Use of Chemotherapy in
Patients at the end of Life: a qualitative study. Published PhD thesis of the
University of Michigan, USA. Retrieved from
https://deepblue.lib.umich.edu/bitstream/handle/2027.42/84635/mbluhm_1.pdf
on 12 January, 2016.
Blum, A.L. and Langley, P. (1997). Selection of relevant features and examples in
machine learning. Artificial Intelligence on relevance 97: 245 – 271.
Bocchi, L., Coppini, G., Nori, J. and Valli, G. (2004). Detection of single and
clustered micro-calcifications in mammograms using fractals models and
neural networks. Med Eng Phys 26: 303 – 312.
Page 156
156
Boma, P.O., Durosinmi, M.A., Adediran, I.A., Akinola, N.O. and Salawu, L. (2006).
Clinical and Prognostic features of Nigerians with Chronic Myeloid
Leukaemia. Niger Postgrad Med J. 6:47 – 52.
Bonifazi, F. de Vivo, A., Rosti, G., Guilot, J. and Trabacchi, E. (2001). Chronic
myeloid Leukaemia and interferon- : A study of complete cytogenetic
responders. Blood 98(10): 3074 – 3081.
Branford S, Fletcher L, Cross N.C., Muller M.C., Hochhaus A., Kim D.W., Radich,
J.P., Saglio, G., Pne, F., Kamel-Reid, S., Wang, Y.L., Press, R.D., Lynch, K.,
Rudzki, Z., Goldman, J.M. and Hughes, T. (2009) Desirable performance
characteristics for BCR-ABL measurement on an international reporting scale
to allow consistent interpretation of individual patient response and
comparison of response rates between clinical trials. Blood 112: 3330-3338.
Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C. J. (1984). Classification and
regression trees. Wadsworth & Brooks/Cole Advanced Books & Software,
Monterey, CA. ISBN 978-0-412-04841-8.
Burke, H.B., Bostwick, D.G. and Meiers, I. (2005). Prostate cancer outcome:
epidemiology and biostatistics. Anal Quant Cytol Histol 27: 211 – 217.
Caruana, R.A. and Freitag, D. (1994). Greedy Attribute selection. In Proceedings of
the 11th
International Conference on Machine Learning, new Brunswick, NJ,
Morgan Kaufmann Publishers: 28 – 36.
Circchetti, D.W. (1992). Neural networks and diagnosis in the clinical laboratory:
state of the art. Clin. Chem. 38: 9 – 10.
Page 157
157
Cochran, A.I. (1997). Prediction of outcome for patients with cutaneous melanoma.
Pigment Cell Res 10: 162 – 167.
Cook, N.R. (2007). Use and misuse of the receiver operating characteristics curve in
risk prediction. Circulation 115: 928 – 935.
Cortes J.E., Talpaz M., Beran M., O'Brien S.M., Rios M.B., Stass S. and Kantarjian
H. M. (1995a). Philadelphia chromosome-negative chronic myelogenous
Leukaemia with rearrangement of the breakpoint cluster region. Long-term
follow-up results. Cancer 75(2): 464 - 470.
Cortes, C. and Vapnik, V. (1995). Support Vector Networks. Machine Learning
20(3): 273 – 278.
Cortes, J., Bruemmendorf, T. and Kantarjian, H. (2007). Efficacy and safety of
bosutinib (SKI-606) among patients with chronic phase Ph+chronic
myelogenous Leukaemia (CML). Blood 110:733 – 741.
Cortes, J.E., Kantarjian, H. and Shah, N.P. (2012). Ponatinib in refractory
Philadelphia chromosome-positive Leukaemia. N Engl J Med. 367(22): 2075-
88.
Cortes, J.E., Kantarjian, H.M. and Brümmendorf, T.H. (2011). Safety and efficacy of
Bosutinib (SKI-606) in chronic phase Philadelphia chromosome-positive
chronic myeloid Leukaemia patients with resistance or intolerance to Imatinib.
Blood 118(17): 4567-76.
Cox, D.R. (1972). Regression models and Life Tables. Journal of Stat. Soc. Serv. 34:
187 - 220.
Page 158
158
Cruz, J.A. and Wishart, D.S. (2006). Applications of Machine Learning in Cancer
Prediction and Prognosis. Cancer Informatics 2: 59 – 75.
Czado, C., Greiting, T., Hed, L. (2009). Predictive model assessment for cohort data.
Biometrics 65: 1254 – 1261.
Dash, M. and Liu, H.(1997). Feature selection for classification. Computational
Methods: 211 - 218.
DeAngelo, D.J. and Ritz, J. (2004). Imatinib Therapy for Patients with Chronic
Myelogenous Leukaemia: Are Patients Living Longer? Clinical Cancer
Research 10 (1-3): 1 – 3.
Deninger, M.W.N and Druker, B.J. (2003). Specific Targeted Therapy of Chronic
Myelogenous Leukaemia with Imatinib. Pharmacol Rev. 55: 401 – 423.
Dietterich, T.G. (1998). Approximate statistical tests for comparing supervised
classification learning algorithms. Neural Comput. 10(7): 1895 – 1924.
Dimitologlou, G., Adams, J.A. and Jim, C.M. (2012). Comparison of the C4.5 and a
naïve bayes classifier for the prediction of lung cancer survivability. Journal
of Computing 4(8): 1-9.
Dohner, H., Weisdorf, D.J. and Bloomfield, C.D. (2015) Acute Myeloid Leukemia.
The New England Journal of Medicine 373(12): 1136–52.
Domchek, S.M., Eisen, A. and Calzone, K. (2003). Application of breast cancer risk
prediction models in clinical practice. J Clin Oncol 21: 593 – 601.
Duda, R.O., Hart, P.E. and Stork, D.G. (2001). Pattern classification (2nd
edition):
Wiley, New York.
Page 159
159
Durosinmi, M.A., Faluyi, J.O., Oyekunle, A.A., Salawu, L., Adediran, I.A. and
Akinola, N.O. (2008). The Use of Imatinib mesylate (Glivec) in Nigerian
patients with chronic myeloid Leukaemia. Cellular Therapy and
Transplantation 1(2):58 – 62.
Fernandez-Ranada, J.M., Lavilla, E., Odriozola, J., Garcia-Larana, J., Lozano, M. and
Parody, R. (1993). Interferon alfa 2A in the treatment of chronic myelogenous
Leukaemia in chronic phase. Results of the Spanish Group. Leuk Lymphoma
11(Supplementary 1):175 – 179.
Fielding, L.P., Fenoglio-Preiser, C.M. and Friedman, L.S. (1992). The future of
prognostic factors in outcome prediction for patients with cancer. Cancer 70:
2367 – 2377.
Friedman, J.H. (1999). Stochastic gradient boosting. Stanford University.
Gambacorti-Passerimi, C, Antohni, L., Mahon, F. and Guilhot, F. (2011). Multicentre
Independent Assessment of Outcomes in CML Patients treated with Imatinib.
Journal of National Cancer Institute 103: 553 – 561.
Gascon, F., Valle, M. and Martos, R. (2004). Childhood obesity and hormonal
abnormalities associated with cancer risk. Eur J. Cancer Previ. 13: 193 – 197.
Gauda, R. and Chahar, V. (2013). A comparative study on feature selection using
data mining tools. International Journal of advanced research in computer
science and software engineering 3(9): 26 – 33.
Gennari, J.H., Langley, P. and Fisher, D. (1989). Models of incremental concept
formation. Artificial Intelligence 40: 11 – 61.
Page 160
160
Gerds, T.A., Cai, T. and Schumacher, M. (2008). The performance of risk models.
Biom J 50: 457 – 479.
Gratwohl, A., Brand, R., Apperley, J., Ruutu, T. and Corradini, P. (2006) Allogenic
hematopoietic stem cell transplantation for chronic myeloid Leukaemia in
Europe 2006: Transplant activity, long term data and current results. An
analysis by the Chronic Leukaemia Working Party of the Europe Group for
Blood and Marrow Transplantation (EBMT). Hematological. 91(4): 513 –
521.
Gratwohl, A., Hemans, J., Goldman, J.M., Arcese, W., Carrens, E. and Devergie, A.
(1998). Risk assessment for patients with chronic myeloid Leukaemia before
allogenic blood or marrow transplantation. Chronic Leukaemia Working
Party of the European Group for Blood and Marrow Transplantation. Lancet
352(9134): 1087 – 1092.
Greenland, S. (1989). Modeling and variable selection in epidemiologic analysis. Am
J Public Health 79: 340 – 349.
Groves, F.D., Linet, M.S. and Devesa, S.S. (1995). Patterns of occurrence of the
Leukaemias. Eur J Cancer 31A: 941 – 994.
Guilhot, F. (1993). Interferon alfa and low-dose cytosine-arabinoside for the
treatment of patients with chronic myelogenous Leukaemia in chronic phase.
French CML Study Group. Semin Hematol 30(Supplementary 3):24–5.
Guilhot, F., Guerci, A., Fiere, D., Harousseau, J.L., Maloisel, F., Bouabdallah, R. et
al. (1996). The treatment of chronic myelogenous Leukaemia by interferon
and cytosine-arabinoside: rationale and design of the French trials. French
CML Study Group. Bone Marrow Transplant 17: S29–S32.
Page 161
161
Hagerty, R.G., Butow, P.N. and Ellis, P.M. (2005). Communicating prognosis in
cancer care: a systematic review of the literature. Ann Oncol 16: 1005 – 1053.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H.
(2009) The WEKA Data Mining Software: An Update. SIGKDD Explorations
11(1): 1 - 23.
Hall, M.A. (1999). Correlation-based Feature Selection for Machine learning. PhD
Thesis of the University of Waikato, Hamilton, New Zealand.
Hasford, J., Ansari, H., Pfirrmann, M. and Hehlmann, R. (1996). Analysis and
Validation of Prognostic factors for CML. German CML Study Group. Bone
Marrow Transplant 17(Supplementary 3): S49 – S54.
Hasford, J., Baccarani, M., and Hoffmann, V. (2011). Predicting Complete
Cytogenetic Response and subsequent Progression Free Survival in 2060
patients with CML on Imatinib treatment: the EUTOS Score. Blood
118(3):2177-2187.
Hasford, J., Pfirrmann, M., Hehlmann, R., Allan, N.C., Baccarani, M. and Kluin-
Nelemans, J.C. (1998). A new Prognostic score for survival of patients with
chronic myeloid Leukaemia treated with interferon alfa. Writing Committee
for the Collaborative CML Prognostic Factors Project Group. J Natl Cancer
Inst. 90(11): 850 – 858.
Hehlmann, R., Heimpel, H., Hasford, J., Kolb, H.J., Pralle, H. and Hossfeld, D.K.
Randomized comparison of interferon-alpha with busulfan and hydroxyurea in
chronic myelogenous Leukaemia. German CML Study Group. Blood
84:4064–77.
Page 162
162
Hehlmann, R., Heimpel, H., Hossfeld, D.K., Hasford, J., Kolb, H.J. and Loffler, H.
(1996). Randomized study of the combination of hydroxyurea and interferon
alpha versus hydroxyurea mono-therapy during the chronic phase of chronic
myelogenous Leukaemia. (CML Study II) The German Study Group. Bone
Marrow Transplant 17 (Supplementary 3): S21–S24.
Hemingway, H., Riley, R.D. and Altman, D.G. (2009). Ten steps towards improving
prognostic research. BMJ 339: 4184 – 4193.
Hira, Z.M., Gillies, D.F. and Curry, E. (2014). Improving Classification accuracy of
response in Leukaemia treatment using feature selection over pathway
segmentation. Imperial College, London. Retrieved from
http://www.doc.ic.ac.uk/research/technicalreports/2014/DTR14-8.pdf on 12
January, 2016.
Holland, J.H. (1992) Adaptation in Natural and Artificial Systems. The MIT Press
Cambridge, MA, USA.
Hosmer, D.W., Hosmer, T. and Le Cessie, S. (1997). A comparison of goodness-of-
fit tests for the logistic regression model. Stat Med 16: 965 – 980.
Hosseini, N. and Ahmadi, R. (2014) Individual Characteristics of Patients with
Leukemia or Lymphoma in Hamedan – Northwest Iran. International Journal
of Advances in Clinical Engineering and Biological Sciences (IJACEBS) 1(1):
74 – 75.
Hua-Liang, W. and Billings, S.A. (2007) Feature Subset Selection and Ranking for
data dimensionality reduction. IEEE Transactions On Pattern Analysis And
Machine Intelligence 29(1): 1 – 12.
Page 163
163
Hughes, T.P., Kaeda, J., Branford, S., Rudzki, Z.,Hochhaus, A., Hensley, M.L.
(2003) Frequency of major molecular responses to Imatinib or interferon alfa
plus cytarabine in newly diagnosed chronic myeloid leukemia. N Engl J Med
349: 1423 – 1432.
Ibrahim, J.G., Chu, M. and Chen, M.H. (2012). Missing data in clinical studies:
issues and methods. J Clin. Oncol. 30: 3297 – 3303.
Idowu, P.A., Aladekomo, T.A., Williams, K.O. and Balogun, J.A. (2015). Predictive
model for likelihood of Sickle cell aneamia (SCA) among pediatric patients
using fuzzy logic. Transactions in networks and communications 31(1): 31 –
44.
Jabbour, E., Cortes, J., Nazha, A., O‘Brien, S., Quintas-Cardama, A., Pierce, S.,
Garcia-Manero, G. and Kantarjian, H. (2012). EUTOS score is not predictive
for survival and outcome of patients with early chronic myeloid Leukaemia
treated with tyrosine kinase inhibitors: a single institution experience. Blood
119(19):4524-4527.
Jabbour, E., Cortes, J.E., Giles, F.J., O‘Brien, S. and Kantarjian, H.M. (2007). Current
and emerging treatment options in chronic myeloid Leukaemia. Cancer
109(11):2171-2181
Jain, A.K., Marty, M.N. and Flynn, P. (1999). Data Clustering: a review. ACM
Comput. Surveys 31(3): 264 – 323.
Kaambwa, B., Bryan, S., Bilingham, L. (2012). Do the methods used to analyze
missing data really matter? An examination of data from an observational
study of intermediate care patients. BMC Res Notes 5: 330.
Page 164
164
Kantarjian, H. and Cortes, J (2008). Chronic myeloid Leukaemia. In: Abeloff, M.D.,
Armitage, J.O., Lichter, A.S., Niederhuber, J.E., Kastan, M.B., McKenna,
W.G. Clinical Oncology. 4th ed. Philadelphia, Pa. Elsevier: 2279-2289.
Kantarjian, H., Giles, F. and Wunderle, L. (2006). Nilotinib in imatinib-resistant CML
and Philadelphia chromosome-positive ALL. N Engl J Med. 354(24):2542 -
2551.
Kantarjian, H., Sawyers, C., Hochhaus, A., Guilhot, F., Schiffer, C and Gambacorti-
Passerini, C. (2002). Hematologic and cytogenetic responses to Imatinib
mesylate in chronic myelogenous Leukaemia. N Engl J Med 346(9): 645 –
652.
Kaplan, E.L. and Meier, P. (1958). Non-Parametric estimation from incomplete
observation. Journal of American Statistical Association 53: 457 – 481.
Kloke, O., Niederle, N., Qiu, J.Y., Wandl, U., Moritz, T. and Nagel-Hiemke, M.
(1993) Impact of interferon alpha-induced cytogenetic improvement on
survival in chronic myelogenous leukaemia. Br J Haematol 83:399–403.
Kohavi, R. (1995). Wrappers for performance enhancement and oblivious decision
graphs. Unpublished PhD thesis, Stanford University.
Kohavi, R. and John, G.H. (1996). Wrappers for feature selection. Artificial
Intelligence Special review on relevance 97: 273 – 324.
Kumar, V. and Minz, S. (2014). Feature selection: a literature review. Smart
Computing Review 4(3): 211 – 229.
Page 165
165
Landstrom A.P. and Tefferi A. (2006). Fluorescent in situ hybridization in the
diagnosis, prognosis, and treatment monitoring of chronic myeloid
Leukaemia. Leuk Lymphoma 47(3): 397 - 402.
Langley, P. and Sage, S. (1994). Oblivious decision trees and abstract cases. In
working notes of the AAA194 Workshop on Statistical Techniques in Pattern
recognition, Prague, Czech Republic: 91 – 96.
Lee, J.W. and Chung, N.G. (2011). The treatment of pediatric chronic myelogenous
Leukaemia in the Imatinib era. Korean J Pediatr. 54(3): 111 – 116.
Liaw, A. and Weiner, M. (2012). Classification and Regresion Trees by random
forest. R. News 2: 18 – 22.
Lichtman, M.A., Beutler, E., Kipps, T.J., (2006). Williams Hematology seventh
edition. New York, NY: McGraw-Hil: 1238 - 1245.
Liu, H. and Motoda, H. (1998). Feature selection for knowledge discovery and data
mining. Kluwer Academic Publishers, Boston.
Locatelli, F. and Niemeyer, C.M. (2015). How to treat Juvenile Myelomonocytic
Leukemia. Blood 125(7): 1083 – 1090.
Luo, S.T. and Cheng, B.W. (2010). Diagnosing breast masses in digital
mammography using feature selection and ensemble methods. J Med Syst: 1 –
9.
Machin, P.S., Dempsey, J. and Brooks, J. (1991). Using neural networks to diagnose
cancer. J MedSyst 15: 11 – 19.
Page 166
166
Maji, P. and Garai,P. (2013). Fuzzy Rough Simultaneous Attribute Selection and
Feature Extraction Algorithm. IEEE Transactions on Cybernetics 43(4): 1 –
12.
Marin, D., Ibrahim, A.R. and Goldman, J.M. (2011). European Treatment and
Outcome Study (EUTOS) score for chronic myeloid Leukaemia still requires
more confirmation. Journal of Clinical Oncology 29(29):3944-3945.
Mark H. F., Sokolic R. A., and Mark Y. (2006). Conventional cytogenetics and FISH
in the detection of BCR/ABL fusion inchronic myeloid Leukaemia (CML).
Exp Mol Pathol 81(1): 1 - 7.
Markovitch, S. and Rosenstein, D. (2002). Feature generation using general
construction functions. Machine Learning 49: 59 – 98.
McCulloch, W. and Walter, P. (1943). A Logical Calculus of Ideas Immanent in
Nervous Activity. Bulletin of Mathematical Biophysics 5(4): 115 – 133.
Millot, F, Traore, P, Guilhot, J, Nelken, B, Leblanc, T, Leverger, G, et al. (2005)
Clinical and Biological Features at Diagnosis in 40 Children with Chronic
Myeloid Leukaemia. Pediatrics 116:140-143.
Mitchell, T. (1997). Machine Learning, McGraw Hill, New York.
Moons, K.G., Royston, P. and Vergouwe, Y. (2009). Prognosis and prognostic
research: what, why and how? BMJ 338: 375 – 381.
Nadeau, C. and Bengio, Y. (2003). Inference for the generalization error. Machine
Learning 52: 239 – 281.
Page 167
167
National Cancer Institute (2011). SEER Cancer Statistics Review 1975 – 2008.
Available from http://seer.cancer.gov/csr/1975_2008/ and Accessed on 23
February, 2016.
National Comprehensive Cancer Network (NCCN) (2014) Chronic Myeloid
Leukaemia. NCCN Guidelines for Patients, version 1, 2014.
Novakovic, J (2009). Using information gain attribute evaluation to classify sonar
targets. 17th
Telecommunications forum (TELFOR 2009), Serbia, Belgrade,
November 24 – 29, 2009: 1351- 1354.
Novakovic, J., Strbac, P and Bulatovic, D. (2011). Towards optimal feature selection
using ranking methods and classification algorithms. Yugoslav Journal of
Operations Research 21(1): 119 – 135.
Ohm, L. (2013). Chronic Myeloid Leukaemia: Clinical Experimental and Health
economics study with special reference to Imatinib treatment. Published
Thesis of Karolinska University Hospital, Solna and Karolinska Institute,
Stockholm, Sweden. ISBN 978-91-7549-006-9.
Ohnishi, K., Ohno, R., Tomonaga, M, Kamada, N., Onozawa, K. and Kuramoto, A.
(1995). A randomized trial comparing interferon-alfa with busulphan for
newly diagnosed chronic myelogenous Leukaemia in chronic phase. Blood
86: 906 – 916.
Okanny, C.C. and Akinyanju, O.O. (1989). Chronic Leukaemia: an African
experience. Med Oncol Tumor Pharmacotherapy 6: 189 – 194.
Page 168
168
Oyekunle, A. (2013). Survivorship in Nigerian patients with CML: A study of 527
patients over 10 years. Paper presentation at AORTIC Conference 2013,
Durban, South-Africa.
Oyekunle, A., Bolarinwa, R., Mamman, A.I. and Durosinmi, M. (2012c). The
Treatment of Childhood and adolescent chronic Myeloid Leukaemia in
Nigeria. Journal of pediatric sciences 4(4): 1 -5. Retrieved from Research
Gate at http://www,researchgate.net/publication23704415 on 28 June, 2015.
Oyekunle, A., Klyuchnikov, E., Ocheni, S., Kroger, N., Zander, A.R. and Baccarani,
M. (2011). Challenges for Allogenic Hematopoietic Stem Cell
Transplantation in Chronic Myeloid Leukaemia in the Era of Tyrosine Kinase
Inhibitors. Acta Haematologica 126(1): 30 – 39.
Oyekunle, A.A., Adelasoye, S.B., Bolarinwa, R.A., Ayansawo, T.A., Aladekomo,
T.A., Manam, A.I. and Durosinmi, M.A. (2012a). The treatment of childhood
and adolescent chronic myeloid Leukaemia in Nigeria. Journal of Pediatric
Sciences 4(4): pages 1 – 5.
Oyekunle, A.A., Osho, P.O., Aneke, J.C., Salawu, L. and Durosinmi, M.A. (2012b).
The Predictive Value of the Sokal and Hasford scoring systems in Chronic
Myeloid Leukaemia in the Imatinib Era. Journal of Hematological
Malignancies 2(2): pages 25 – 32.
Patil, M.D. and Sane, S. (2014) Effective Classification after Dimension Reduction: A
Comparative Study. International Journal of Scientific and Research
Publications 4(7): 1 – 4.
Pencina, M.J., D‘Agostino, R.B. and Demier, O.O. (2012). Novel metrics for
evaluating improvement in discrimination: net classification and integrated
Page 169
169
discrimination improvement for normal variables and nested models. Stat
Med 31: 101 – 113.
Pertricoin, E.F. (2004). SELDI-TOF-Based serum proteomic pattern diagnosis for
early detection of cancer. Curr Opin. Biotechnol. 15: 24 – 30.
Platt, J. (1998). Fast Training of Support Vector Machines using Sequential Minimal
Optimization. Advances in Kernel Methods – Support Vector Learning, 1998.
Provan,. D., Gribben, J.G. (2010) Chronic myelogenous leukemia. Molecular
Hematology (3rd ed.). Singapore: Wiley-Blackwell: 76.
Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning 1: 81-106.
Raanani, P, Trakhtenbrot, L, Rechavi, G, Rosenthal, E, Avigdor, A, Brok-Simoni, F,
et al. (2005). Philadelphia-Chromosome-Positive T-Lymphoblastic
Leukaemia: Acute Leukaemia or Chronic Myelogenous Leukaemia Blastic
Crisis. Acta Haematol.113:181 - 189.
Robin, M., Esperous, H., Peffault, R., Petropoulou, A.D., Xhaard, A. and Ribauad,
P.l. (2010) Splenectomy after allogeneic hematopoietic stem cell
transplantation in patients with primary myelofibrosis. Brit J Hematol. 150:
721–724.
Rokach, L. and Maimon, O. (2005). Top-down induction of decision trees classifiers-
a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part
C 35 (4): 476–487. doi: 10.1109/TSMCC.2004.843247.
Rokach, L. and Maimon, O. (2008). Data mining with decision trees: theory and
applications. World Scientific Pub Co Inc. ISBN 978-9812771711.
Page 170
170
Royston, P., Moons, K.G. and Altman, D.G. (2009). Prognosis and prognostic
research: developing a prognostic model. BMJ 338: 604 – 610.
Sebastiani, F. (2002) Machine Learning in Automated Text Categorization. ACM
Computing Surveys 34(1): 1–47.
Sharma, P., Wagner, K., Wolchok, J.D. and Allison, J.P. (2011). Novel cancer
immunotherapy agents with survival benefit: recent successes and next
steps. Cancer 11(11): 805 - 812.
Shen, L., Au, W.-Y., Guo, T., Wong, M.L., Tsuchiyama, J., Yuen, P.-W., Kwong, Y.-
L., Liang, R.H. and Srivastava, G. (2007). Proteasome inhibitor bortezomib-
induced apoptosis in natural killer (nk)-cell Leukaemia and lymphoma: an in
vitro and in vivo preclinical evaluation, Blood 110(1): 469 – 470.
Siegel, C.A., Siegel, L.S. and Hyams, J.S. (2011). Real-time tool to display the
predicted disease course and treatment response for children with Crohn‘s
disease. Inflamm Bowel Dis 17: 30 – 38.
Simes, R.J. (1985). Treatment selection for cancer patients: application of statistical
decision theory to the treatment of advanced ovarian cancer. J Chronic Dis.
38: 171 – 186.
Singal, A.G., Mukherjee, A. and Higgins, P.D. (2013). Machine Learning Algorithms
outperform conventional regression models in identifying risk factors for
hepatocellular carcinoma in patients with cirrhosis. Am J. Gastroenterol 108:
1124 – 1130.
Sokal, J.E., Baccarani, M., Fiacchini, M., Carrates, F., Rozman, C., Gomez, G.A. and
Galton, A.G. (1985). Prognostic discrimination among younger patients with
Page 171
171
chronic granulocytic Leukaemia: relevance to bone marrow transplantation.
Blood 66(6): 1352 – 1357.
Sokal, J.E., Cox, E.B., Baccarani, M., Tuna, S., Gomez, G.A. and Robertson, J.E.
(1984) Prognostic discrimination in ―good risk‖ chronic granulocytic
Leukaemia. Blood 63(4): 789 – 700.
Steyerberg, E.W., Harrell, F.E. and Borsbrom, G.J. (2001). Internal validation of
prediction models: efficiency of some procedures for logistic regression
analysis. J Clin Epidemiol. 54: 774 – 781.
Steyerberg, E.W., Vickers, A.J. and Cook, N.R. (2010). Assessing the performance
of prediction models: a framework for traditional and novel measures.
Epidemiology 21: 128 – 132.
Suttorp, M. and Millot, F. (2010). Treatment of Pediatric chronic myeloid Leukaemia
in the year 2010: Use of Tyrosine Kinase Inhibitors and Stem-Cell
Transplantation. Hematology: 368 – 376.
Swerdlow, S.H., Camp, E., Harris, N.C., Jaffe, E.S., Pilieri, S.A., Stein, H., Thiele, J.
and Varimani, J.N. (2008). WHO classification of turnovers and
hematopoietic and lymphoid tissues, 4th
edition, IARC Press: Lyon.
Talpaz, M., Shah, N.P. and Kantarjian, H. (2006). Dasatinib in imatinib-resistant
Philadelphia chromosome positive Leukaemias.N Engl J Med. 354(24): 2531 -
2541.
Tefferi, A., Hanson, C.A. and Inwards, D.J. (2005). How to interpret and pursue an
abnormal complete blood cell count in adults. Mayo Clin Proc. 80(7): 923 –
936.
Page 172
172
Thaler, J. and Hilbe, W. (1996) Comparative analysis of two consecutive phase II
studies with IFN-alpha and IFN-alpha + ara-C in untreated chronic-phase
CML patients. Austrian CML Study Group. Bone Marrow Transplant
17(Supplementary 3):S25–S28.
Thaler, J., Gastl, G., Fluckinger, T., Niederweiser, D., Huber, H. and Seewan, H.
(1993). Treatment of chronic myelogenous Leukaemia with interferon alfa-
2c: response rate and toxicity in a phase II multicenter study. The Austrian
Biological Response Modifier (BRM) Study Group. Semin Hematol
30(Supplementary 3):17–19.
Thongkam, J. and Sukmak, V. (2013). Cervical cancer survivability prediction
models using machine leanring techniques. Journal of convergence
Information Technology (JCIT) 8(15): 13 -22.
Tkachuk D.C., Westbrook C.A., Andreeff M., Donlon T.A., Cleary M.L.,
Suryanarayan K., Homge M., Redner A., Gray J., and Pinkel D. (1990).
Detection of bcrabl fusion in chronic myelogeneous Leukaemia by in situ
hybridization. Science 250(4980): 559 - 562.
Vanneschi, L., Farinaccio, A., Mauri, G., Antoniotti, M., Provero, P. and Giacobini,
M. (2011). A comparison of machine learning techniques for survival
prediction in breast cancer. Bio Data Mining 4(12): 1 -13.
Vardiman, J, Pierre, R, Thiele, J, Imbert, M, Brunning, RD, Flandrin, G. (2001).
Chronic myelogenous leukaemia. In World Health Organization
Classification of Tumors: Pathology and Genetics of Tumors of Hematopoietic
and Lymphoid Tissues. Jaffe, E, Harris, NL, Stein, H, Vardiman, JW (eds.)
Lyon, France: IARC Press, 2001:20-26.
Page 173
173
Waijee, A., Mukherjee, A. and Singal, A. (2013b). Comparison of modern imputation
methods for missing laboratory data in medicine. BMJ Open 3(8): 1 – 7.
Waijee, A.K., Higgings, P.D.R. and Singal, A.G. (2013a). A Primer on Predictive
Models. Clinical and Translational Gastroenterology 4(44): 1 – 4.
Waijee, A.K., Joyce, J.C. and Wang, S.J. (2010). Algorithms outperform metabolite
tests in predicting response of patients with inflammatory bone disease to
thiopurines. Clin Gastroenterol Hepatol 8: 143 – 150.
Wang, J.X., Zhang, B. and Yu, Y.K. (2005). Application of serum protein finger-
printing coupled with artificial neural network model in diagnosis of
hepatocellular carcinoma. Clin Med J (Engl) 118: 1278 – 1284.
Wang, K., Bell, D. and Murtagh, F. (1998). Relevance Approach to feature subset
selection: Kluwer Academic Publishers, Boston: 85 – 97.
Weston, A.D. and Hood, L. (2004). Systems biology proteomics and the future of
healthcare towards predictive and personalized medicine. J Proteomic Res. 3:
179 – 196.
Yildirim, P. (2015). Filter-Based Feature Selection Methods for Prediction of Risks
in Hepatitis Disease. International Journal of Machine Learning and
Computing 5(4): 258 – 263.
Yu, L. and Liu, H. (2004). Efficient feature selection via analysis of relevance and
redundancy. JMLR 5: 1205 – 1224.
Yussuff, H., Mohammad, N., Ngah, Y.K. and Yahaya, A.S. (2012). Breast cancer
analysis using logistic regression. IJRRAS 10(1): 14 -22.
Page 174
174
Zamir, O., Etzioni, O., Madani, O. and Karp, R.M. (1997). Fast and Intuitive
Clustering of Web Documents. KDD’97: 287–290.
Zhang, S., Zhang, C. and Yang, Q. (2002). Data preparation for data mining. Appl
Artif. Intell.17: 375 – 381.
Zhao, Y. and Karypis, G. (2002) In Proceedings of CIKM. Evaluation of Hierarchical
Clustering Algorithms for Document Datasets: 1 – 7.
Zhon, X., Liu, K.T. and Wong, S.T. (2004). Cancer classification and prediction
using logistic regression with Bayesian gene selection. J. Biomed. Inform 37:
249 – 259.