Breast Cancer Diagnostics with Bayesian Networks

Breast Cancer Diagnostics with Bayesian Networks

Interpreting the Wisconsin Breast Cancer Database with BayesiaLab

Stefan Conrady, [email protected]

Dr. Lionel Jouffe, [email protected]

May 20, 2013

mailto:[email protected]




Table of Contents

Case Study & Tutorial

Introduction 4

Background 6

Wisconsin Breast Cancer Database 6

Notation 7

Model Development 8

Data Import 8

Unsupervised Learning 13

Model 1: Markov Blanket 16

Model 1: Performance 21

K-Folds Cross-Validation 23

Model 2: Augmented Markov Blanket 25

Model 2a: Performance 28

Structural Coef!cient 32

Model 2b: Augmented Markov Blanket (SC=0.3) 38

Model 2b: Performance 39

Conclusion 40

Model Inference 41

Interactive Inference 42

Adaptive Questionnaire 43

Target Interpretation Tree 46

Summary 52

Appendix

Framework: The Bayesian Network Paradigm 53

Acyclic Graphs & Bayes’s Rule 53

Compact Representation of the Joint Probability Distribution 54

References 55

Contact Information 56

Bayesia USA 56


ii www.bayesia.us | www.bayesia.sg | www.bayesia.com

http://www.bayesia.us


http://www.bayesia.sg


http://www.bayesia.com


Bayesia Singapore Pte. Ltd. 56

Bayesia S.A.S. 56

Copyright 56


www.bayesia.us | www.bayesia.sg | www.bayesia.com iii







Case Study & Tutorial

Introduction

Data classi!cation is one of the most common tasks in the !eld of statistical analysis and countless methods have been developed for this purpose over time. A common approach is to develop a model based on known historical data, i.e. where the class membership of a record is known, and to use this generalization to predict the class membership for a new set of observations.

Applications of data classi!cations permeate virtually all !elds of study, including social sciences, engineer-ing, biology, etc. In the medical !eld, classi!cation problems often appear in the context of disease identi!-cation, i.e. making a diagnosis about a patient’s condition. The medical sciences have a long history of de-veloping large body of knowledge, which links observable symptoms with known types of illnesses. It is the physician’s task to use the available medical knowledge to make inference based on the patient’s symptoms, i.e. to classify the medical condition in order to enable appropriate treatment.

Over the last two decades, so-called medical expert systems have emerged, which are meant to support phy-sicians in their diagnostic work. Given the sheer amount of medical knowledge in existence today, it should not be surprising that signi!cant bene!ts are expected from such machine-based support in terms of medical reasoning and inference.

In this context, several papers by Wolberg, Street, Heisey and Managasarian became much-cited examples. They proposed an automated method for the classi!cation of Fine Needle Aspirates1 through imaging proc-essing and machine learning with the objective of achieving a greater accuracy in distinguishing between malignant and benign cells for the diagnosis of breast cancer. At the time of their study, the practice of vis-ual inspection of FNA yielded inconsistent diagnostic accuracy. The proposed new approach would increase this accuracy reliably to over 95%. This research was quickly translated into clinical practice and has since been applied with continued success.

As part of their studies in the late 1980s and 1990s, the research team generated what became known as the Wisconsin Breast Cancer Database, which contains measurements of hundreds of FNA samples and the as-sociated diagnoses. This database has been extensively studied, even outside the medical !eld. Statisticians and computer scientists have proposed a wide range of techniques for this classi!cation problem and have continuously raised the benchmark for predictive performance.

Our objective with this paper is to present Bayesian networks as a highly practical framework for working with this kind of classi!cation problem. We intend to demonstrate how the BayesiaLab software can ex-


4 www.bayesia.us | www.bayesia.sg | www.bayesia.com

1 Fine needle aspiration (FNA) is a percutaneous (“through the skin”) procedure that uses a !ne gauge needle (22 or 25

gauge) and a syringe to sample "uid from a breast cyst or remove clusters of cells from a solid mass. With FNA, the cellular material taken from the breast is usually sent to the pathology laboratory for analysis.







tremely quickly, and relatively simply, create Bayesian network models that achieve the performance of the best custom-developed models, while only requiring a fraction of the development time.

Furthermore, we wish to illustrate how Bayesian networks can help researchers and practitioners generate a deeper understanding of the underlying problem domain. Beyond merely producing predictions, we can use Bayesian networks to precisely quantify the importance of individual variables and employ BayesiaLab to help identify the most ef!cient path towards a diagnosis.

BayesiaLab’s speed of model building, its excellent classi!cation performance, plus the ease of interpretation provide researchers with a powerful new tool. Bayesian networks and BayesiaLab have thus become a driver in accelerating research.


www.bayesia.us | www.bayesia.sg | www.bayesia.com 5







Background

To provide context for this study, we quote Mangasarian, Street and Wolberg (1994), who conducted the original research related breast cancer diagnosis with digital image processing and machine learning:

Most breast cancers are detected by the patient as a lump in the breast. The majority of breast lumps are benign, so it is the physician’s responsibility to diagnose breast cancer, that is, to distin-guish benign lumps from malignant ones. There are three available methods for diagnosing breast cancer: mammography, FNA with visual interpretation and surgical biopsy. The reported sensitiv-ity, i.e. ability to correctly diagnose cancer when the disease is present of mammography varies from 68% to 79%, of FNA with visual interpretation from 65% to 98%, and of surgical biopsy close to 100%.

Therefore mammography lacks sensitivity, FNA sensitivity varies widely, and surgical biopsy, al-though accurate, is invasive, time consuming and costly. The goal of the diagnostic aspect of our research is to develop a relatively objective system that diagnoses FNAs with an accuracy that ap-proaches the best achieved visually.

Wisconsin Breast Cancer Database

This breast cancer database was created through the clinical work of Dr. William H. Wolberg at the Univer-sity of Wisconsin Hospitals in Madison. As of 1992, Dr. Wolberg had collected 699 instances of patient diagnoses in this database, consisting of two classes: 458 benign cases (65.5%) and 241 malignant cases (34.5%).

The following eleven attributes2 are included in the database:

1. Sample code number

2. Clump Thickness (1 - 10)

3. Uniformity of Cell Size (1 - 10)

4. Uniformity of Cell Shape (1 - 10)

5. Marginal Adhesion (1 - 10)

6. Single Epithelial Cell Size (1 - 10)

7. Bare Nuclei (1 - 10)

8. Bland Chromatin (1 - 10)

9. Normal Nucleoli (1 - 10)

10. Mitoses (1 - 10)

11. Class (benign/malignant)



2 “Attribute” and “variable” are used interchangeably throughout the paper.







Attributes #2 through #10 were computed from digital images of !ne needle aspirates (FNA) of breast masses. These features describe the characteristics of the cell nuclei in the image. The attribute #11, Class, was established via subsequent biopsies or via long-term monitoring of the tumor.

We will not go into detail here regarding the de!nition of the attributes and their measurement. Rather, we refer the reader to papers referenced in the bibliography.

The Wisconsin Breast Cancer Database is available to any interested researcher from the UC Irvine Machine Learning Repository.3 We use this database in its original format without any further transformation, so our results can be directly compared to dozens of methods that have been developed since the original study.

Notation

To clearly distinguish between natural language, software-speci!c functions and study-speci!c variable names, the following notation is used:

• BayesiaLab-speci!c functions, keywords, commands, etc., are capitalized and printed in bold type. You can look up such terms in the BayesiaLab Library (library.bayesia.com) for more details.

• The names of variables, attributes, nodes, and node states are capitalized and italicized.



3 UC Irvine Machine Learning Repository website: http://archive.ics.uci.edu/ml/

http://library.bayesia.com

http://library.bayesia.com







http://archive.ics.uci.edu/ml/

http://archive.ics.uci.edu/ml/

Model Development

Data Import

Our modeling process begins with importing the database,4 which is formatted as a text !le with comma-separated values. Therefore, we start with Data | Open Data Source | Text File.

The Data Import Wizard then guides us through the required steps. In the !rst dialogue box of the Data Import Wizard, we click on De!ne Typing and specify that we wish to set aside a Test Set from the data-base.



4 If we exclude the variable Sample code number, this database can also be used with the publicly-available evaluation

version of BayesiaLab, which is limited to a maximum of ten nodes. Deleting this variable does not affect the work"ow or the results of the analysis.







Following common practice, we will randomly select 20% of the 699 records as Test Set, and, conse-quently, the remaining 80% will serve as our Learning Set set.5

In the next step, the Data Import Wizard will suggest the data format for each variable. Attributes 2 through 10 are identi!ed as continuous variables and Class is read as a discrete variable. Only for the !rst variable, Sample code number, we have to specify Row Identi!er, so it is not mistaken for a continuous pre-dictor variable.

In the next step, the Information Panel reports that we have a total of 16 missing values in the entire data-set. We can also see that the column Bare Nuclei is labeled with a small question mark, indicating the pres-ence of missing values in this particular column.



5 “Learning/Test Set” and “Learning/Test Sample” are used interchangeably in this paper.







We now need to specify the type of Missing Values Imputation. Given the small size of the dataset, and the small number of missing values, we will choose the Structural EM method.6

A critical element of the data import process is the discretization of all continuous variables. On the next screen we click Select All Continuous to apply the same discretization algorithm across all continuous vari-ables. Alternatively, we could choose the type of discretization individually by variable. However, we will not discuss this option any further in this paper.

As the objective of this exercise is classi!cation, we choose the Decision Tree algorithm from the drop-down menu in the Multiple Discretization panel. This discretizes each variable for a maximum information gain with respect to the Target Class.



6 For more details on missing values imputation with Bayesian network, see Conrady and Jouffe (2012).







Bayesian networks are entirely non-parametric, probabilistic models, and for their estimation they require a certain minimum number of observations. To help us with the selection of the number of discretization lev-els (or Intervals), we use the heuristic of !ve observations per parameter and probability cell. Given that we have a relatively small database with only 560 observations,7 three discretization intervals for each variable appear to be an appropriate choice. If we used a higher number of Intervals, we would need more observa-tions for a reliable estimation of the parameters.

Upon clicking Finish, we will immediately see a representation of the newly imported database in the form of a fully unconnected Bayesian network in the Graph Panel. Each variable is now represented as a blue node in the graph panel of BayesiaLab.



7 560 cases are in the training set (80%) and 139 are in the test set (20%).







The question mark symbol, which is associated with the Bare Nuclei node, indicates that there are missing values for this variable. Hovering over the question mark with the mouse pointer while pressing the “i” key will show the number of missing values.

Optionally, BayesiaLab can display an import report summarizing the obtained discretizations for all vari-ables.









Unsupervised Learning

When exploring a new domain, we generally recommended performing Unsupervised Learning on the newly imported database. This is also the case here, even though our principal objective is predictive modeling, for which Supervised Learning will later be the main tool.

Learning | Unsupervised Structural Learning | EQ initiates the EQ Algorithm, which is suitable for the initial review of the database. For larger databases with signi!cantly more variables, the Maximum Weight Span-ning Tree is a very fast algorithm and can be used instead.

Upon learning, the initial Bayesian network looks like this:

In its “raw” form, the crossing arcs make this network somewhat tricky to read. BayesiaLab has a number of layout algorithms that can quickly “disentangle” such a network and produce a much more user-friendly format.









We can select View | Automatic Layout or alternative use the shortcut “P”.

Now we can visually review the learned network structure and compare it to our own domain knowledge. This allows for a “sanity check” of the database and the variables, and it may highlight any inconsistencies.

Beyond visually inspecting the network structure, BayesiaLab allows us to visualize the quantitative part of this network. To do this, we !rst need to switch into the Validation Mode by clicking on the highlighted button in the lower-lefthand corner of the Graph Panel, or by alternatively using the “F5” key as a shortcut.

We can now display the Pearson Correlation between the nodes that are directly linked in the graph by se-lecting Analysis | Visual | Pearson’s Correlation from the menu.









Each arc’s thickness is now proportional to the Pearson Correlation between the connected nodes. Also, the blue and red colors indicate positive and negative correlations respectively. Any unexpected sign of correla-tions would thus become apparent very quickly. In our example, we only have positive correlations and thus all arcs are blue.

Additionally, callouts indicate that further information can be displayed. We can opt to display this numerical information via View | Display Arc Comments.









This function is also available via a button in the menu:

Model 1: Markov Blanket

Now that we have performed an initial review of the dataset with the Unsupervised Learning step, we can return to the Modeling Mode by clicking on the corresponding button in the lower lefthand corner of the









screen or using the shortcut “F4”.8

This allows us to proceed to the modeling stage. Given our objective of predicting the state of the variable Class, i.e. benign versus malignant, we will de!ne Class as the Target Variable by right-clicking on the node and selecting Set as Target Variable from the contextual menu. Alternatively, we can double-click on Class while holding the shortcut “T” pressed. We need to specify this explicitly, so the subsequent Supervised Learning algorithm can use Class as the dependent variable.

This setting is con!rmed by the “bullseye”appearance of the new Target Node.



8 We will mostly omit further references to switching between Modeling Mode (F4) and Validation Mode (F5). The

required modes can generally be inferred from the context.







Upon this selection, all Supervised Learning algorithms become available under Learning | Supervised Learn-ing.

In many cases, the Markov Blanket algorithm is a good starting point for a predictive model. This algorithm is extremely fast and can even be applied to databases with thousands of variables and millions of records, even though database size is not a concern in this particular study.

Upon learning the Markov Blanket for Class, and once again applying the Automatic Layout, the resulting Bayesian network looks as follows:

Markov Blanket De!nition

The Markov Blanket for a node A is the set of nodes composed of A’s parents, its children, and its children’s other parents (=spouses).

The Markov Blanket of the node A contains all the variables, which, if we know their states, will shield the node A from the rest of the network. This means that the Markov Blanket of a node is the only knowledge needed to predict the behavior of that node A. Learning a Markov Blanket selects relevant predictor variables, which is particu-larly helpful when there is a large number of variables in the database. In fact, this can also serve as a highly-ef!cient variable selection method in preparation for other types of modeling, e.g. neural net-works.









This network suggests that Class has a direct probabilistic relationship with all variables except Marginal Adhesion and Single Epithelial Cell Size, which are both disconnected. The lack of their connection with the Target indicates that these nodes are independent given the nodes in the Markov Blanket.

Beyond distinguishing between predictors (connected nodes) and non-predictors (disconnected nodes), we can further examine the relationship versus the Target Node Class by highlighting the Mutual Information of the arcs connecting the nodes. This function is accessible within the Validation Mode via Analysis | Vis-ual | Arcs’ Mutual Information.

Note

We can see on the graph learned earlier with the EQ algorithm that Uniformity of Cell Shape is the node that makes these two nodes conditionally independent of Class.









We will also go ahead and immediately select View | Display Arc Comments.

The thickness of the arcs is now proportional to the Mu-tual Information, i.e. the strength of the relationship be-tween the nodes. Intuitively, Mutual Information measures the information that X and Y share: it measures how much knowing one of these variables reduces our uncertainty about the other. For example, if X and Y are independent, then knowing X does not provide any information about Y and vice versa, so their Mutual Information is zero. At the other extreme, if X and Y are identical then all information conveyed by X is shared with Y: knowing X determines the value of Y and vice versa.

Formal De!nition of Mutual Information

I(X;Y ) = p(x, y)log p(x, y)p(x)p(y)

⎛⎝⎜

⎞⎠⎟x∈X

∑y∈Y∑









In the top part of the comment box attached to each arc, the Mutual Information of the arc is shown. Expressed as a percentage and highlighted in blue, we see the relative Mutual Informa-tion in the direction of the arc (parent node ➔ child node). And, at the bottom, we have the

relative Mutual Information in the opposite direction of the arc (child node ➔ parent node).

Model 1: Performance

As we are not equipped with speci!c domain knowledge about the variables, we will not further interpret these relationships but rather run an initial test regarding the Network Performance. We want to know how well this Markov Blanket model can predict the states of the Class variable, i.e. Benign versus Malignant. This test is available via Analysis | Network Performance | Target.

Using our previously de!ned Test Set for validating our model, we obtain the following, rather encouraging results:









Of the 88 Benign cases of the test set, 3 were incorrectly identi!ed, which corresponds to a false positive rate of 3.41%. More importantly though, of the 51 Malignant cases, all were identi!ed correctly (true posi-tives) with no false negatives. The overall performance can be expressed as the Total Precision, which is computed as total number of correct predictions (true positives + true negatives) divided by the total num-ber of cases in the Test Set , i.e. (85 +51) ÷ 139 = 97.84%.

As the selection of the Learning Set and the Test Set during the data import process is random, BayesiaLab may learn slightly different networks based on different Learning Sets after each data import. Hence, your own network performance evaluation could deviate from what is shown above, unless you chose the same Fixed Seed for the random number generator when you de!ned Data Typing during the data import proc-ess.









K-Folds Cross-Validation

To mitigate the sampling artifacts that may occur in a one-off test, we can systematically learn networks on a sequence of different subsets and then aggregate the test results. Analogous to the original papers on this topic, we will perform K-Folds Cross Validation, which will iteratively select K different Learning Sets and Test Sets and then, based on those, learn the networks and test their performance.

The Cross Validation can then be started via Tools | Cross Validation | Targeted Evaluation | K-Folds.

We use the same learning algorithm as before, i.e. the Markov Blanket, and we choose 10 as the number of sub-samples to be analyzed. Of the total dataset of 699 cases, each of the ten iterations will create a Test Set of 69 randomly drawn samples, and use the remaining 630 as the Learning Set. This means that BayesiaLab learns one network per Learning Set and then tests the performance on the respective Test Set.









The summary, including the synthesized results, is shown below.

These results con!rm the good performance of this model. The Total Precision is 97%, with a false negative rate of 2%. This means 2% of the cases were predicted as Benign, while the were actually Malignant.









Clicking Comprehensive Report produces a summary, which can also be saved in HTML format. This is convenient for subsequent editing, as the generated HTML !le can be opened and edited as a spreadsheet.

Value Benign MalignantGini Index 33.95% 64.59%Relative Gini Index 98.50% 98.55%Mean Lift 1.42 2.04Relative Lift Index 99.74% 99%

Value Benign (458)

Malignant (241)

Benign (446) 441 5Malignant (253) 17 236

Value Benign (458)

Malignant (241)

Benign (446) 98.88% 1.12%Malignant (253) 6.72% 93.28%

Value Benign (458)

Malignant (241)


R: 0.93817485358R2: 0.88017205588Occurrences

Reliability

Precision

Sampling Method: K-FoldsLearning Algorithm: Markov BlanketTarget: Class

Relative Gini Index Mean: 98.53%Relative Lift Index Mean: 99.37%Total Precision: 96.85%

As our Markov Blanket modeling is already performing at a level comparable to the models that have been published in the literature, we might be tempted to conclude our analysis at this point. However, we will attempt to see whether further performance improvements are possible.

Model 2: Augmented Markov Blanket

BayesiaLab offers an extension to the Markov Blanket algorithm, namely the Augmented Markov Blanket, which performs an Unsupervised Learning Algorithm on the nodes in the Markov Blanket. This allows identifying in"uence paths between the predictor variables and can potentially help improve the prediction performance.









This algorithm can be started via Learning | Supervised Learning | Augmented Markov Blanket.

As expected, the resulting network is somewhat more complex than the standard Markov Blanket.

If we save the original Markov Blanket and the new Augmented Markov Blanket under different !le names, we can use Tools | Compare | Structure to highlight the differences between both. Given that the addition of three arcs is immediately visible, this function may appear as overkill for our particular example. However,









in more complex situation, Structure Comparison can be rather helpful, and so we will spell out the details.

We choose the original network and the newly learned network as the Reference Network and the Com-parison Network respectively.

Upon selection, a table provides a list of common arcs and those arcs that have been added in the Compari-son Network, which was learned with the Augmented Markov Blanket algorithm:









Clicking Charts provides a visual representation of these differences. The additional arcs, compared to the original Markov Blanket network, are now highlighted in blue. Conversely, had any arcs been deleted, those would be shown in red.

Model 2a: Performance

We now proceed to performance evaluation with this new Augmented Markov Blanket network, analogous to the Markov Blanket model: Analysis | Network Performance | Target

Given that we had originally split the dataset into a Learning Set and a Test Set, the Network Performance evaluation is once again carried out separately on both subsets.









Interestingly, the performance on the Test Set is better than on the Learning Set. This indicates that over!t-ting is not a problem here.









A summary for either subset can be saved by clicking Comprehensive Report. The out-of-sample Test Set report is generally the more important one. It is shown below.

Value Benign MalignantGini Index 36.52% 63.01%Relative Gini Index 99.53% 99.53%Mean Lift 1.45 1.99Relative Lift Index 99.92% 99.79%

Value Benign (88)

Malignant (51)


Value Benign (88)

Malignant (51)

Benign (86) 100% 0%Malignant (53) 3.77% 96.23%

Value Benign (88)

Malignant (51)

Benign (86) 97.73% 0%Malignant (53) 2.27% 100%

Occurrences

Reliability

Precision

Target: Class

Relative Gini Index Mean: 99.53%Relative Lift Index Mean: 99.85%Total Precision: 98.56%R: 0.97499525394R2: 0.95061574521

As with the earlier model, we repeat K-Folds Cross Validation for the Augmented Markov Blanket. The results are shown below, !rst as a screenshot and then as a spreadsheet generated via Comprehensive Re-port.










Value Benign (458)

Malignant (241)


Value Benign (458)

Malignant (241)


Value Benign (458)

Malignant (241)


R: 0.93877413371R2: 0.88129687412Occurrences

Reliability

Precision

Sampling Method: K-FoldsLearning Algorithm: Augmented Markov BlanketTarget: Class


Despite the greater complexity of this new network, we do not see an improvement in any of the perform-ance measures.









Structural Coef!cient

Up to this point, the difference in network complexity was a only function of the choice of learning algo-rithm. We will now address the Structural Coef!cient (SC), which is the only parameter adjustable across all the learning algorithms in BayesiaLab. In essence, this parameter determines a kind of signi!cance thresh-old, and thus it in"uences the degree of complexity of the induced networks.

By default, this Structural Coef!cient is set to 1, which reliably prevents the learning algorithms from over-!tting the model to the data. In studies with relatively few observations, the analyst’s judgment is needed for determining a potential downward adjustment of this parameter. On the other hand, when data sets are very large, increasing the parameter to values higher than 1 will help manage the network complexity.

Given the fairly simple network structure of the Markov Blanket model, complexity was of no concern. Augmented Markov Blanket is more complex, but still very manageable. The question is, could a more complex network provide greater precision without over!tting? To answer this question, we will perform a Structural Coef!cient Analysis, which generates several metrics that help in making the trade-off between complexity and precision: Tools | Cross Validation | Structural Coef!cient Analysis

BayesiaLab prompts us to specify the range of the Structural Coef!cient to be examined and the number of iterations to be performed. It is worth noting that the Minimum Structural Coef!cient should not be set to 0, or even close to 0. A value of 0 would imply a fully connected network, which can take a very long time to learn depending on the number of variables, or even exceed the memory capacity of the computer run-ning BayesiaLab.

Number of Iterations determines the interval steps to be taken within the speci!ed range of the Structural Coef!cient. Given the relatively light computational load, we choose 25 iterations. With more complex models, we might be more conservative, as each iteration re-learns and re-evaluates the network. Further-more, we select to compute all metrics.









The resulting report shows how the network changes as a function of the Structural Coef!cient. This can be interpreted as the degree of con!dence the analyst should have in any particular arc in the structure.









Clicking Graphs, will show a synthesized network, consisting of all structures generated during the iterative learning process.

The reference structure is represented by black arcs, which show the original network learned immediately prior to the start of the Structural Coef!cient Analysis. The blue-colored arcs are not contained in the refer-ence structure, but they appear in networks that have been learned as a function of the different Structural Coef!cients (SC). The thickness of the arcs is proportional to the frequency of individual arcs existing in the learned networks.

More importantly for us, however, is determining the correct level of network complexity for a reliable and accurate prediction performance while avoiding over!tting the data. We can plot several different metrics in this context by clicking Curve.









Typically, the “elbow” of the L-shaped curve above identi!es a suitable value for the Structural Coef!cient (SC). More formally, we would look for the point on the curve where the second derivative is maximized. With a visual inspection, an SC value of around 0.3 appears to be a good candidate for that point. The por-tion of the curve, where SC values approach 0, shows the characteristic pattern of over!tting, which is to be avoided.

We will also plot the Target’s Precision alone as a function of the SC. On the surface, the curve for the Learning Set resembles an L-shape too, but the curve moves only within roughly 2 percentage points, i.e. between 97% and 99%. For practical purposes, this means that the curve is virtually "at.

















As a result, the Structure/Target’s Precision Ratio i.e. Structure

Target's Precision

⎛⎝⎜

⎞⎠⎟ is primarily a function of the numera-

tor, i.e. the Structure, as the denominator, Target’s Precision, is nearly constant across a wide range of SC values, as per the graph above.

If both Learning and Test Sets are available, a Validation Measure ɣ can be computed to help choose the

most appropriate Structural Coef!cient.

This measure is based on the Test Set’s mean negative log-likelihood (returned by the network learned from the Learning Set) and on the variances of the negative log-likelihood of the Test Set and Learning Set (re-turned by the network learned from Learning Set).

γ = µLL,Test ×max(1,σ LL,Test2

σ LL,Learning2 )

The range between roughly 0.3 and 0.6, i.e. the section around the minimum of the curve, suggests suitable values for the Structural Coef!cient.









Model 2b: Augmented Markov Blanket (SC=0.3)

Given the results from the Structural Coef!cient Analysis, we now wish to relearn the network with an SC value of 0.3. The SC value can be set by right-clicking on the background of the Graph Panel and then se-lecting Edit Structural Coef!cient from the Contextual Menu, or alternatively via the menu, i.e. Edit | Edit Structural Coef!cient.

Once we relearn the network, using the same Augmented Markov Blanket algorithm as before, we obtain a more complex network. The key question is, will this increase in complexity improve the performance or perhaps be counterproductive?









Model 2b: Performance

We repeat the Network Performance Analysis and generate the Comprehensive Report for the Test Set.


Value Benign (88)

Malignant (51)


Value Benign (88)

Malignant (51)

Benign (86) 100% 0%Malignant (53) 3.77% 96.23%

Value Benign (88)

Malignant (51)

Benign (86) 97.73% 0%Malignant (53) 2.27% 100%

Occurrences

Reliability

Precision

Target: Class

Relative Gini Index Mean: 99.75%Relative Lift Index Mean: 99.93%Total Precision: 98.56%R: 0.97908818201R2: 0.95861366815









Secondly, we perform K-Folds Cross Validation:


Value Benign (458)

Malignant (241)


Value Benign (458)

Malignant (241)


Value Benign (458)

Malignant (241)


R: 0.94052337963R2: 0.88458422762Occurrences

Reliability

Precision

Sampling Method: K-FoldsLearning Algorithm: Augmented Markov BlanketTarget: Class


Conclusion

All models reviewed, Model 1 (Markov Blanket), Model 2a (Augmented Markov Blanket, SC=1), Model 2b (Augmented Markov Blanket, SC=0.3), have performed at very similar levels in terms of classi!cation per-formance. Total Precision and false positives/negatives are shown as the key metrics in the summary table below.

Total&Precision

False&Positives

False&Negatives

Total&Precision

False&Positives

False&Negatives

Markov&Blanket&(SC=1) 97.84% 3 0 96.85% 17 5Augmented&Markov&Blanket&(SC=1) 98.56% 2 0 96.85% 16 6Augmented&Markov&Blanket&(SC=0.3) 98.56% 2 0 96.71% 17 6

Test&Set&(n=139) 10JFold&CrossJValidation&(n=699)Summary

Reestimating these models with more observations could potentially change the results and might more clearly differentiate the classi!cation performance. For now, we select the Augment Markov Blanket (SC=1), and it will serve as the basis for the next section of this paper, Model Inference.









Model Inference

Without further discussion of the merits of each model speci!cation, we will now show how the learned Augment Markov Blanket model can be applied in practice and used for inference. First, we need to go to Validation Mode (F5). We can now bring up all the Monitors in the Monitor Panel by selecting all the nodes (Ctrl+A) and double-clicking on any one of them. More conveniently, the Monitors can be displayed by right-clicking inside the Monitor Panel and selecting Sort | Target Correlation from the Contextual Menu.

Alternatively, we can do the same via Monitor | Sort | Target Correlation.

Monitors are then automatically created for all the nodes correlated with the Target Node. The Monitor of Target Node is placed !rst in the Monitor Panel, followed by the other Monitors in order of their correla-tion with the Target Node, from highest to lowest.









Interactive Inference

For instance, we can use now BayesiaLab to review the individual predictions made based on the model. This feature is called Interactive Inference, which can be accessed from the menu via Inference | Interactive Inference.

Also, we have a choice of using either the Learning Set or the Test Set for inference. For our purposes, we choose the Test Set.

The Navigation Bar allows scrolling through each record of the test set. Record #0 can be seen below with all the associated observations highlighted in green. Given the observations shown, the model predicts a









99.97% probability that Class is Benign (the Monitor of the Target Node is highlighted in red).

Most cases are rather clear-cut, as above, with probabilities for either diagnosis around 99% or higher. However, there are a number of exceptions, such as case #11. Here, the probability of malignancy is ap-proximately 75%.

Adaptive Questionnaire

In situations, when only individual cases are under review, rather than a batch of cases from a database, BayesiaLab can provide case-by-case diagnosis support with the Adaptive Questionnaire.

For a a Target Node with more than two states, the Adaptive Questionnaire requires that we de!ne a Tar-get State. Setting the Target State allows BayesiaLab to compute Binary Mutual Information and then focus









on the de!ned Target State. Technically, setting the Target State is not necessary in our particular example as the Target Node is binary.

The Adaptive Questionnaire can be started from the menu via Inference | Adaptive Questionnaire.

We can set Based on a Target State to Malignant, as we want to highlight this particular state.

Furthermore, we can set the cost of collecting observations via the Cost Editor, which can be started via the Edit Costs button. This is helpful when certain observations are more costly to obtain than others.9

Unfortunately, our example is not ideally suited to illustrate this feature, as the FNA attributes are all col-lected at the same time, rather than consecutively. However, one can imagine that in other contexts a physi-cian will start the diagnosis process by collecting easy-to-obtain data, such as blood pressure, before pro-ceeding to more elaborate (and more expensive) diagnostic techniques, such as performing an angiogram.



9 Beyond monetary measures, “cost” could re"ect, for instance, the degree of pain associated with a surgical procedure.







Once the Adaptive Questionnaire is started, BayesiaLab presents the Monitor of the Target Node (red) and its marginal probability, with the Target State highlighted. Again, as shown below, the Monitors are auto-matically ordered in the sequence of their importance, from high to low, with regard to diagnosing the Tar-get State of the Target Node.

This means that the ideal !rst piece of evidence is Uniformity of Cell Size. Let us suppose this metric is equal to 3 (<=4.5) for the case under investigation. Upon setting this !rst observation, BayesiaLab will compute the new probability distribution of the Target Node, given the evidence. We see that the probability of Class=Malignant has increased to 58.53%. Given the evidence, BayesiaLab also recomputes the ideal new order of questions and now presents Bare Nuclei as the next most relevant question.

Let us now assume that Bare Nuclei is not available for observation. We instead set the node Clump Thick-ness to Clump Thickness<=4.5.









Given this latest piece of evidence, the probability distribution of Class is once again updated, as is the array of questions. The small gray arrows inside the Monitors indicate how the probabilities have changed com-pared to the prior iteration.

It is important to point out that not only the Target Node is updated as we set evidence. Rather, all nodes are being updated upon setting evidence, re"ecting the omnidirectional nature of inference within a Bayesian network.

We can continue this process of updating until we have exhausted all available evidence, or until we have reached an acceptable level of certainty regarding the diagnosis.

Target Interpretation Tree

Although its tree structure is not displayed, the Adaptive Questionnaire is a dynamic tree for seeking evi-dence. More speci!cally, it is a tree that applies to one speci!c case given its observed evidence. The Target Interpretation Tree is a static tree that is induced from all cases. As such it provides a more general ap-proach in terms of searching for the optimum sequence of gathering evidence.









The Target Interpretation Tree can be started from the menu via Analysis | Target Interpretation Tree.

Upon starting this function, we need to set several options. We de!ne the Search Stop Criteria, and set the Maximum Size of Evidence to 3 and the Minimum Joint Probability to 1 (percent). Furthermore, we check the Center on State box and select Malignant from the drop-down menu. This way, Malignant will be high-lighted in each node of the to-be-generated tree.

By default, the tree is presented in a top-down format.

Often, it may be more convenient to change the layout to a left-to-right format via the Switch Position but-ton in the upper lefthand corner of the window that contains the tree.









The following tree is presented in the left-to-right layout.

This tree prescribes in which sequence evidence should be sought for gaining the maximum amount of in-formation towards a diagnosis. Going from left to right, we see how the probability distribution for Class changes given the evidence set thus far.

The leftmost node in the tree, without any evidence set, shows the marginal probability distribution of Class. The bottom panel of this node shows Uniformity of Cells Size as the most important evidence to seek.









The three branches that emerge from the node represent the possible states of Uniformity of Cells Size, i.e. the hard evidence we can observe. If we set evidence analogously to what we did in the Adaptive Question-naire, we will choose the middle branch with the value Uniformity of Cell Size<=4.5 (2/3).

This evidence updates the probabilities of the Target State, now predicting a 58.53% probability of Class= Malignant. At the same time we can see what is the next best piece of evidence to seek. Here, it is Bare Nu-clei, which will provide the greatest information gain towards the diagnosis of Class. The information gain is quanti!ed with the Score displayed at the bottom of the node.

The Score is the Conditional Mutual Information of the node Bare Nuclei with regard to the Target Node, divided by the cost of observing the evidence if the option Utilize Evidence Cost was checked. In our case, as we did not check this option, the Score is equal to the Conditional Mutual Information.

We can quickly verify the Score of 7.1% by running the Mapping function. First, we set the evidence on Uniformity of Cell Size (<=4.5) and then run Analysis | Visual | Mapping.









The Mapping window features drop-down menus for Node Analysis and Arc Analysis. However, we are only interested in Node Analysis, and we select Mutual Information with the Target Node as the metric to be displayed.

The size of the nodes, beyond a !xed minimum size,10 is now proportional to the Mutual Information with the Target Node. To see the speci!c values, we right-click on the background of the window and select Dis-play Scores on Nodes from the Contextual Menu.



10 The minimum and maximum sizes can be changed via Edit Sizes from the Contextual Menu in the Mapping Window.







This shows us, given Uniformity of Cell Size<=4.5, the Mutual Information of Bare Nuclei with the Target Node is 0.0711, or 7.1%. Note that the node on which evidence has already been set, i.e. Uniformity of Cell Size, shows a Conditional Mutual Information of 0.

So, learning Bare Nuclei will bring the highest information gain among the remaining variables. For in-stance, if we now observed Bare Nuclei>5.5 (3/3), the probability of Class=Malignant would reach 98.33%.









Finally, BayesiaLab also reports the joint probability of each tree node, i.e. the probability that all pieces of evidence in a branch, up to and including that tree node, would occur.

This says that the joint probability of Uniformity of Cell Size<=4.5 and Bare Nuclei>5.5 is 5.32%.

As opposed to this somewhat arti!cial illustration of a Target Interpretation Tree in the context of FNA-based diagnosis, Target Interpretation Trees are often prepared for emergency situations, such as triage classi!cation, in which rapid diagnosis with constrained resources is essential. We believe that our example still conveys the idea of “optimum escalation” in obtaining evidence towards a diagnosis.

Summary

By using Bayesian networks as the framework and BayesiaLab as the tool, we have shown a practical new modeling and analysis approach based on the widely studied Wisconsin Breast Cancer Database.

BayesiaLab can rapidly machine-learn reliable models, even without prior domain knowledge and without hypothesis. The classi!cation performance of the BayesiaLab-generated Bayesian network models is on par with all studies on this topic that are published to date. Beyond the predictive performance, BayesiaLab en-ables a range of analysis and interpretation functions, which can help the researcher gain deeper domain knowledge and perform inference more ef!ciently.









Appendix

Framework: The Bayesian Network Paradigm11

Acyclic Graphs & Bayes’s Rule

Probabilistic models based on directed acyclic graphs have a long and rich tradition, beginning with the work of geneticist Sewall Wright in the 1920s. Variants have appeared in many !elds. Within statistics, such models are known as directed graphical models; within cognitive science and arti!cial intelligence, such models are known as Bayesian networks. The name honors the Rev. Thomas Bayes (1702-1761), whose rule for updating probabilities in the light of new evidence is the foundation of the approach.

Rev. Bayes addressed both the case of discrete probability distributions of data and the more complicated case of continuous probability distributions. In the discrete case, Bayes’ theorem relates the conditional and marginal probabilities of events A and B, provided that the probability of B does not equal zero:

P(A∣B) = P(B∣A)P(A)

P(B)

In Bayes’ theorem, each probability has a conventional name:

P(A) is the prior probability (or “unconditional” or “marginal” probability) of A. It is “prior” in the sense that it does not take into account any information about B; however, the event B need not occur after event A. In the nineteenth century, the unconditional probability P(A) in Bayes’s rule was called the “ante-cedent” probability; in deductive logic, the antecedent set of propositions and the inference rule imply con-sequences. The unconditional probability P(A) was called “a priori” by Ronald A. Fisher.

P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is de-rived from or depends upon the speci!ed value of B.

P(B|A) is the conditional probability of B given A. It is also called the likelihood.

P(B) is the prior or marginal probability of B, and acts as a normalizing constant.

Bayes theorem in this form gives a mathematical representation of how the conditional probability of event A given B is related to the converse conditional probability of B given A.

The initial development of Bayesian networks in the late 1970s was motivated by the need to model the top-down (semantic) and bottom-up (perceptual) combination of evidence in reading. The capability for bidirec-tional inferences, combined with a rigorous probabilistic foundation, led to the rapid emergence of Bayesian networks as the method of choice for uncertain reasoning in AI and expert systems replacing earlier, ad hoc rule-based schemes.



11 Adapted from Pearl (2000), used with permission.







The nodes in a Bayesian network represent variables of interest (e.g. the temperature of a device, the gen-der of a patient, a feature of an object, the occur-rence of an event) and the links represent statistical (informational) or causal dependencies among the variables. The dependencies are quanti!ed by condi-tional probabilities for each node given its parents in the network. The network supports the computation of the posterior probabilities of any subset of vari-ables given evidence about any other subset.

Compact Representation of the Joint Probability Distribution

“The central paradigm of probabilistic reasoning is to identify all relevant variables x1, . . . , xN in the environment [i.e. the domain under study], and make a probabilistic model p(x1, . . . , xN) of their interaction [i.e. represent the variables’ joint probability distribution].”

Bayesian networks are very attractive for this purpose as they can, by means of factorization, compactly represent the joint probability distribution of all variables.

“Reasoning (inference) is then performed by introducing evidence that sets variables in known states, and subsequently computing probabilities of interest, conditioned on this evidence. The rules of probability, combined with Bayes’ rule make for a complete reasoning system, one which includes traditional deductive logic as a special case.” (Barber, 2012)









References

Abdrabou, E. A.M.L, and A. E.B.M Salem. “A Breast Cancer Classi!er Based on a Combination of Case-Based Reasoning and Ontology Approach” (n.d.).

Conrady, Stefan, and Lionel Jouffe. “Missing Values Imputation - A New Approach to Missing Values Processing with Bayesian Networks,” January 4, 2012. http://bayesia.us/index.php/missingvalues.

El-Sebakhy, E. A, K. A Faisal, T. Helmy, F. Azzedin, and A. Al-Suhaim. “Evaluation of Breast Cancer Tu-mor Classi!cation with Unconstrained Functional Networks Classi!er.” In The 4th ACS/IEEE Interna-tional Conf. on Computer Systems and Applications, 281–287, 2006.

Hung, M. S, M. Shanker, and M. Y Hu. “Estimating Breast Cancer Risks Using Neural Networks.” Journal of the Operational Research Society 53, no. 2 (2002): 222–231.

Karabatak, M., and M. C Ince. “An Expert System for Detection of Breast Cancer Based on Association Rules and Neural Network.” Expert Systems with Applications 36, no. 2 (2009): 3465–3469.

Mangasarian, Olvi L, W. Nick Street, and William H Wolberg. “Breast Cancer Diagnosis and Prognosis via Linear Programming.” OPERATIONS RESEARCH 43 (1995): 570–577.

Mu, T., and A. K Nandi. “BREAST CANCER DIAGNOSIS FROM FINE-NEEDLE ASPIRATION USING SUPERVISED COMPACT HYPERSPHERES AND ESTABLISHMENT OF CONFIDENCE OF MA-LIGNANCY” (n.d.).

Pearl, Judea. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press, 2009.

Pearl, Judea, and Stuart Russell. Bayesian Networks. UCLA Congnitive Systems Laboratory, November 2000. http://bayes.cs.ucla.edu/csl_papers.html.

Wolberg, W. H, W. N Street, D. M Heisey, and O. L Mangasarian. “Computer-derived Nuclear Features Distinguish Malignant from Benign Breast Cytology* 1.” Human Pathology 26, no. 7 (1995): 792–796.

Wolberg, William H, W. Nick Street, and O. L Mangasarian. “MACHINE LEARNING TECHNIQUES TO DIAGNOSE BREAST CANCER FROM IMAGE-PROCESSED NUCLEAR FEATURES OF FINE NEEDLE ASPIRATES” (n.d.). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.2109.

Wolberg, William H, W. Nick Street, and Olvi L Mangasarian. “Breast Cytology Diagnosis Via Digital Im-age Analysis” (1993). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.9894.

———. “Breast Cytology Diagnosis Via Digital Image Analysis” (1993). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.9894.



http://bayesia.us/index.php/missingvalues

http://bayesia.us/index.php/missingvalues

http://bayes.cs.ucla.edu/csl_papers.html

http://bayes.cs.ucla.edu/csl_papers.html

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.2109












Contact Information

Bayesia USA

312 Hamlet’s End WayFranklin, TN 37067USAPhone: +1 888-386-8383 [email protected]

Bayesia Singapore Pte. Ltd.

20 Cecil Street#14-01, Equity PlazaSingapore 049705Phone: +65 3158 [email protected]

Bayesia S.A.S.

6, rue Léonard de VinciBP 11953001 Laval CedexFrancePhone: +33(0)2 43 49 75 [email protected]

Copyright

© 2013 Bayesia S.A.S., Bayesia USA and Bayesia Singapore. All rights reserved.





















Breast Cancer Diagnostics with Bayesian Networks

Documents

model inference

diagnosis of breast

model development

exbreast cancer diagnostics

breast cyst

bayesian networks4

bayesian networksinterpreting

bayesian networksii