Optimized Scoring Systems: Towards Trust in Machine Learning for Healthcare and Criminal Justice Cynthia Rudin and Berk Ustun Duke University and Harvard University Questions of trust in machine learning models are becoming increasingly important, as these tools are starting to be used widely for high-stakes decisions in medicine and criminal justice. Transparency of models is a key aspect affecting trust. This paper reveals that there is new technology to build transparent machine learning models that are often as accurate as black box machine learning models. These methods have had impact already in medicine and criminal justice. This work calls into question the overall need for black box models in these applications. There has been an increasing trend in healthcare and criminal justice to leverage machine learning for high-stakes prediction problems such as detecting heart attacks (Weng et al. 2017), diagnosing Alzheimer’s disease (Pekkala et al. 2017), and assessing recidivism risk (Berk & Bleich 2013, Tollenaar & van der Heijden 2013). In many of these problems, practitioners are deploying black box machine learning models that do not explain their predictions in a way that humans can understand. In some cases, model development is outsourced to private companies, who build and sell proprietary predictive models using confidential datasets, without regulatory oversight. The lack of transparency and accountability of a predictive model can have severe conse- quences when it is used to make decisions that significantly affect human lives. In criminal justice, proprietary predictive models can lead to questions about due process, or may dis- criminate based on race or poverty status (Wexler 2017b). In 2015, for instance, Billy Ray Johnson was imprisoned based on evidence from software developed by a private company,
42
Embed
Optimized Scoring Systems: Towards Trust in Machine ...cynthia/docs/WagnerPrizeJournal.pdf · predictive model in medicine. Scoring systems are sparse linear models with small integer
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Optimized Scoring Systems:Towards Trust in Machine Learning for Healthcare
and Criminal Justice
Cynthia Rudin and Berk UstunDuke University and Harvard University
Questions of trust in machine learning models are becoming increasingly important, as these tools are
starting to be used widely for high-stakes decisions in medicine and criminal justice. Transparency of models
is a key aspect affecting trust. This paper reveals that there is new technology to build transparent machine
learning models that are often as accurate as black box machine learning models. These methods have had
impact already in medicine and criminal justice. This work calls into question the overall need for black box
models in these applications.
There has been an increasing trend in healthcare and criminal justice to leverage machine
learning for high-stakes prediction problems such as detecting heart attacks (Weng et al.
2017), diagnosing Alzheimer’s disease (Pekkala et al. 2017), and assessing recidivism
risk (Berk & Bleich 2013, Tollenaar & van der Heijden 2013). In many of these problems,
practitioners are deploying black box machine learning models that do not explain their
predictions in a way that humans can understand. In some cases, model development is
outsourced to private companies, who build and sell proprietary predictive models using
confidential datasets, without regulatory oversight.
The lack of transparency and accountability of a predictive model can have severe conse-
quences when it is used to make decisions that significantly affect human lives. In criminal
justice, proprietary predictive models can lead to questions about due process, or may dis-
criminate based on race or poverty status (Wexler 2017b). In 2015, for instance, Billy Ray
Johnson was imprisoned based on evidence from software developed by a private company,
Rudin and Ustun: Optimized Scoring Systems2 Article submitted to Interfaces; manuscript no. (Please, provide the mansucript number!)
TrueAllele, which refused to reveal how the software worked. This led to a landmark case
(People v. Chubbs) where the California Appeals Court ruled that such companies were
not required to reveal how their software worked. As a different example, consider the
controversy surrounding the COMPAS recidivism prediction model (Northpointe 2015),
which is used for several applications in the U.S. criminal justice system, but does not
provide clear reasons for its predictions. COMPAS has been accused of discriminating on
the basis of race (Angwin et al. 2016, Citron 2016), and possibly uses socioecononomic
information such as how often the individual is not paid above minimum wage.
A key problem with proprietary models is that they are prone to data-entry errors. There
have been cases such as that of Glenn Rodrıguez, a prisoner with a nearly perfect record,
who was denied parole as a result of an incorrectly calculated COMPAS score (Wexler
2017b,a), with little recourse to argue, or even to determine how his score was computed.
There have been cases where criminological risk scores (even simple ones) were miscalcu-
lated, allowing dangerous criminals to be released, who subsequently commit murders (Ho
2017) or other crimes. Issues like those discussed above have led to new regulations such as
the European Union’s “right to explanation” (Goodman & Flaxman 2016), which requires
explanations from any algorithmic decision-making tool that significantly affects humans.
Because mistakes in healthcare and criminal justice can be serious, or even deadly, it
can be beneficial for companies not to disclose their models. If the model is allowed to
be hidden, the company never needs to fully justify why any particular prediction was
made, and could avoid liability when the model makes mistakes. This leads to misaligned
incentives, where the users of the tools would strongly benefit from transparent predictive
models, but this would equally undermine profits for selling predictive models. Since these
industries have a strong disincentive from building transparent models, there has been
little work done on determining the answers to the following questions:
Rudin and Ustun: Optimized Scoring SystemsArticle submitted to Interfaces; manuscript no. (Please, provide the mansucript number!) 3
1. Are there interpretable predictive models that are as accurate as black box models? When
we trust companies to build black box models, we are implicitly assuming that their
models are more accurate than transparent models. Is it possible that for many given
black box models, an alternative model exists that is just as accurate, but that is so sim-
ple that it can fit on an index card? We claim the answer is yes. A compelling argument
of Breiman (2001), called the Rashomon effect, indicates that for many applications,
there may exist a large class of models that predict almost equally well. Among this
large class of models are those from the various black box machine learning methods
(e.g., support vector machines, random forests, boosted decision trees, neural networks).
There is no inherent reason that this class would exclude interpretable models. This
observation also helps to explain the 40 years of literature on the surprising performance
of simple linear models (Dawes 1979, Holte 1993).
2. What are the desired characteristics of an interpretable model, if one exists? The answer
to this question changes for each audience and application (Kodratoff 1994, Pazzani
2000, Freitas 2014). We might desire accuracy in predictions, risks that are calibrated,
and we might want the model to be calculated by a judge or a doctor without a calcu-
lator, which makes it easier to explain to a defendant or medical patient. Predictions
from simpler models are much easier to verify, leading to fewer calculation errors and
more robust decisions. A model with all of the characteristics listed above may not exist
for any given problem, but if it does, it would be better to use than a black box.
3. If an interpretable model does exist, is it possible to find it? Interpretability, trans-
parency, usability, and other desirable characteristics in predictive models lead to com-
putationally hard optimization problems, such as mixed-integer non-linear programs. It
is much easier to find an accurate unintelligible model than an interpretable one.
Rudin and Ustun: Optimized Scoring Systems4 Article submitted to Interfaces; manuscript no. (Please, provide the mansucript number!)
The renaissance from proprietary predictive models back to interpretable predictive mod-
els can only be partially determined by regulations such as “right to explanation.” Instead,
the restoration to interpretable models should fundamentally be driven by technology. It
must be demonstrated that interpretable models can achieve performance comparable with
black box models. That is what this work focuses on.
We will present two machine learning algorithms, called Supersparse Linear Integer Mod-
els (SLIM) and Risk-Calibrated Supersparse Linear Integer Models (RiskSLIM), which
solve mixed-integer linear and nonlinear programs. They produce sparse linear models
directly from data that are faithful to the century-old scoring-system model form, similar
to the predictive models that have been used over the last century. SLIM produces scoring
systems optimized for desired true positive / false positive tradeoffs, whereas RiskSLIM
produces risk scores. Both methods leverage modern optimization tools and avoid well-
known pitfalls of rounding methods. The models come with optimality guarantees, meaning
that they allow one to test for the existence of interpretable models that are as accurate
as black box models. RiskSLIM’s models are risk-calibrated across the spectrum of true
positives and false positives (or sensitivity and specificity), and both methods honor con-
straints imposed by the domain. Software for both methods is public, and could be used
to challenge the use of black box models for high-stakes decisions.
SLIM and RiskSLIM are already challenging decision-making processes for applications
in medicine and criminal justice. We will focus on three of them in this work. (i) Sleep
Apnea Screening : In joint work with Massachusetts General Hospital (Ustun et al. 2016),
we determined that a scoring system built using a patient’s medical history can be as
accurate as one that relies on reported symptoms. This yields savings in the efficiency and
effectiveness of medical care for sleep apnea patients. (ii) ICU Seizure Prediction: In joint
Rudin and Ustun: Optimized Scoring SystemsArticle submitted to Interfaces; manuscript no. (Please, provide the mansucript number!) 5
work with Massachusetts General Hospital (Struck et al. 2017), we created the first scoring
system that uses continuous EEG measurements to predict seizures, called 2HELPS2B.
The model provides concise reasons why a patient may be at risk. (iii) Recidivism Pre-
diction: The recent public debate regarding recidivism prediction, and whether COMPAS’
proprietary predictions are racially biased (Angwin et al. 2016) leads to the question of
whether interpretable models exist for recidivism prediction. In our studies of recidivism
(Zeng et al. 2017, Ustun & Rudin 2016a, 2017), we used the largest publicly available
dataset on recidivism, and showed that SLIM and RiskSLIM could produce small scoring
systems that are as accurate as state-of-the-art machine learning models. This calls into
question the necessity of tools like COMPAS, and the reasons for government expenditures
for predictions from proprietary models.
Scoring Systems: Applications and Prior Art
The use of predictive models is not new to society, only the use of black box models is
relatively new. Scoring systems, which are a widely used form of interpretable predictive
model, have dated back at least to work on parole violation by Burgess (1928). An example
of a scoring system is the CHADS2 score (Gage et al. 2001), shown in Figure 1, which
predicts stroke in patients with atrial fibrillation, and is arguably the most widely used
predictive model in medicine. Scoring systems are sparse linear models with small integer
coefficients. The coefficients are the “point scores:” for CHADS2, the coefficients are 1, 1,
1, 1 and 2.
The vast majority of predictive models in the healthcare system and justice system are
scoring systems. Other examples from healthcare include: SAPS I, II and III (Le Gall et al.
1993, Moreno et al. 2005); APACHE I, II and III to assess ICU mortality risk (Knaus
et al. 1981, 1985, 1991); TIMI to assess the risk of death and ischemic events (Antman
Rudin and Ustun: Optimized Scoring Systems6 Article submitted to Interfaces; manuscript no. (Please, provide the mansucript number!)
1. Congestive Heart Failure 1 point · · ·2. Hypertension 1 point + · · ·3. Age ≥ 75 1 point + · · ·4. Diabetes Mellitus 1 point + · · ·5. Prior Stroke or Transient Ischemic Attack 2 points + · · ·
ADD POINTS FROM ROWS 1–5 SCORE = · · ·
SCORE 0 1 2 3 4 5 6
STROKE RISK 1.9% 2.8% 4.0% 5.9% 8.5% 12.5% 18.2%
Figure 1 CHADS2 score to assess stroke risk (Gage et al. 2001). For each patient, the score is computed as the
sum of the patients’ points. The score is translated into the 1-year stroke risk using the lower table.
et al. 2000), HEART (Six et al. 2008) and EDACS (Than et al. 2014) for cardiac events;
PCL to screen for PTSD (Weathers et al. 2013), and SIRS to detect system inflammatory
response syndrome (Bone et al. 1992). Examples from criminal justice include the Ohio
Risk Assessment System (Latessa et al. 2009), the Kentucky Pretrial Risk Assessment
Instrument (Austin et al. 2010), the Salient Factor Score (Hoffman & Adelberg 1980,
Hoffman 1994), and the Criminal History Category (CHC) (U.S. Sentencing Commission
1987).
None of the scoring systems listed in the previous paragraphs were optimized for pre-
dictive performance on data. Each scoring system was created using a different method.
Some of them were built using domain expertise alone (no data), and some were created
using logistic regression followed by rounding of coefficients to obtain integer-valued point
scores.
Serious problems with rounding heuristics are well documented in the optimization liter-
ature. When we solve a relaxed problem and round values to integers afterward, we know
that (unless the problem has specific properties) either the solutions become infeasible or
suboptimal. It is easy to find problems in discrete optimization textbooks where rounding
leads to flawed solutions. In the case of linear regression or linear classification models,
coefficients that are small are all rounded to zero, and thus an important part of the signal
Rudin and Ustun: Optimized Scoring SystemsArticle submitted to Interfaces; manuscript no. (Please, provide the mansucript number!) 7
can easily be lost. We should not be using rounding heuristics if we want a reliable high
quality solution, despite the government’s recommendation (Gottfredson & Snyder 2005)
to round logistic regression coefficients.
An additional set of challenges arises when models need to satisfy operational constraints,
which are user-defined requirements for the model (e.g., false positive rate below 20%).
It is extremely difficult to design rounding heuristics that produce accurate models that
also obey operational constraints. Heuristics for model design lead to suboptimal models,
which in turn could lead to poor decision-making for high-stakes applications.
Since its inception, the field of discrete optimization has been advancing, while all of
the scoring systems have been built without using discrete optimization technology. Let us
describe the optimization problems that we actually desire to solve when building scoring
systems.
Optimization Problems and Methods
We will discuss two kinds of scoring systems:
1. Decision rules, which are scoring systems for decision-making, produced by SLIM. Here,
predictions are based on whether the total score exceeds a threshold value (i.e., predict
“yes” if total score > 1). The choice of variables and points in the score function is
optimized for accuracy at a specific decision point (a specific true positive rate or false
positive rate). The desired choice of true positive rate (TPR) or false positive rate
(FPR) depends on the application. For medical screening, one might desire a larger
false positive rate so that the test is more likely to falsely identify someone as positive
for a disease than to dismiss someone who has the disease by giving them a negative
test result. The user could specify the maximum false positive rate they are willing to
tolerate, and SLIM will optimize the true positive rate subject to that constraint.
Rudin and Ustun: Optimized Scoring Systems8 Article submitted to Interfaces; manuscript no. (Please, provide the mansucript number!)
2. Risk scores, which are scoring systems for risk assessment, produced by RiskSLIM.
These models use the score to output a risk estimate. The choice of variables and points
in the score function is optimized for risk calibration. A scoring system is risk calibrated
when the predicted risk of the outcome (from the model) matches the risk of outcome
in the data. These models do not optimize a specific TPR/FPR tradeoff, rather they
aim to achieve the highest true positive rate for each false positive rate.
We illustrate the difference between these two types of scoring systems in Figure 2, where
we show SLIM and RiskSLIM models for predicting whether a prisoner will be arrested
within three years of being released from prison. Both models were built using the largest
publicly available dataset on recidivism and perform similarly to state-of-the-art machine
learning models (as discussed in the applications section). The SLIM scoring system out-
puts a decision rule (predict “yes” if the total score exceed a threshold score), whereas the
RiskSLIM scoring system outputs a table of risk estimates for each distinct score. In both
cases, the choice of variables and the number of points are chosen to optimize the relevant
performance metric by solving a discrete optimization problem.
SLIM solves one constrained optimization problem to produce decision rules, and
RiskSLIM solves a different problem to produce risk scores. Solving these optimization
problems directly is principled, obviates the need for rounding and other manipulation,
and directly encodes what we desire in a scoring system. The optimization problems are
described mathematically in the appendix. In particular:
• In both optimization problems (the decision rule optimization and risk score model
optimization), hard constraints are used to force the coefficients to integer values.
• In both optimization problems, the objective we minimize includes a term that encour-
ages the number of questions asked in the scoring system to be small (model sparsity).
Rudin and Ustun: Optimized Scoring SystemsArticle submitted to Interfaces; manuscript no. (Please, provide the mansucript number!) 9
SLIM scoring system
Recidivism
1. Age at Release between 18 to 24 2 points · · ·2. Prior Arrests � 5 2 points + · · ·3. Prior Arrest for Misdemeanor 1 point + · · ·4. No Prior Arrests -1 point + · · ·5. Age at Release � 40 -1 point + · · ·
SCORE = · · ·
PREDICT ARREST FOR ANY OFFENSE IF SCORE > 1
1. Prior Arrests � 2 1 point · · ·2. Prior Arrests � 5 1 point + · · ·3. Prior Arrests for Local Ordinance 1 point + · · ·4. Age at Release between 18 to 24 1 point + · · ·5. Age at Release � 40 -1 points + · · ·
SCORE = · · ·
SCORE -1 0 1 2 3 4
RISK 11.9% 26.9% 50.0% 73.1% 88.1% 95.3%
3
RiskSLIM risk score
Recidivism
1. Age at Release between 18 to 24 2 points · · ·2. Prior Arrests � 5 2 points + · · ·3. Prior Arrest for Misdemeanor 1 point + · · ·4. No Prior Arrests -1 point + · · ·5. Age at Release � 40 -1 point + · · ·
SCORE = · · ·
PREDICT ARREST FOR ANY OFFENSE IF SCORE > 1
1. Prior Arrests � 2 1 point · · ·2. Prior Arrests � 5 1 point + · · ·3. Prior Arrests for Local Ordinance 1 point + · · ·4. Age at Release between 18 to 24 1 point + · · ·5. Age at Release � 40 -1 points + · · ·
SCORE = · · ·
SCORE -1 0 1 2 3 4
RISK 11.9% 26.9% 50.0% 73.1% 88.1% 95.3%
3
Figure 2 Optimized scoring systems for recidivism prediction built using SLIM (top) and RiskSLIM (bottom).
The outcome variable for both models is whether a prisoner is arrested within 3 years of release
from prison. The SLIM scoring system outputs a predicted outcome. It has a test TPR/FPR of
76.6%/44.5%, and a mean 5-fold cross validation TPR/FPR of 78.3%/46.5%. The RiskSLIM scoring
system outputs a risk estimate. It has a 5-fold cross validation mean test CAL/AUC of 1.7%/0.697
and training CAL/AUC of 2.6%/0.701. We provide a definition of these performance metrics in the
Evaluation section. See Zeng et al. (2017), Ustun & Rudin (2016a) for more details.
• In the objective for SLIM, there is a term that encourages the point values to be small
(e.g., it prefers value ‘1 point’ rather than value ‘7 points’). This also encourages the
point values to be co-prime, meaning they share no common prime factors. Thus, this
formulation would never choose point scores ‘10, 10, 20, 10, 40’, rather it would choose
‘1, 1, 2, 1, 4’ to solve the same problem.
• In the formulation for RiskSLIM, the objective includes a term used in logistic regression
(the logistic loss) that encourages the scores to be small and risk calibrated. As we
define later, a model is risk calibrated when its predicted risks agree with risks calculated
directly from the data.
Rudin and Ustun: Optimized Scoring Systems10 Article submitted to Interfaces; manuscript no. (Please, provide the mansucript number!)
Both optimization problems can accommodate constraints on the solution that are spe-
cific to the domain (operational constraints). Some types of constraints are in Table 1.
Constraint Type Example
Feature Selection Choose up to 10 features
Group Sparsity Include either Male or Female, not both
Optimal Thresholding Use at most 3 thresholds for Age, e.g., (Age≤30, Age≤50, Age≤75).Logical Structure If Male is in model, then also include Hypertension
Probability Predict Pr (y=+1|x)≥ 0.90 when Male = TRUE
Fairness Ensure that the predicted outcome y is +1 an equal number of times for Male and Female
Table 1 Examples of operational constraints that can be addressed. Both SLIM and RiskSLIM can handle
constraints on model form. SLIM handles constraints related to error metrics (e.g., fairness constraints).
RiskSLIM handles constraints on risk estimates (e.g., probability constraints, as in the second last row).
Both optimization problems are computationally hard, but theoretical results allow prac-
tical improvements in speed. As a result, both the decision rule optimization problem and
the risk score optimization problem can be solved for reasonably large datasets in minutes.
The risk score problem is a mixed-integer non-linear program, because the logistic loss is
nonlinear. However, since the logistic loss is convex, cutting planes would be a natural type
of technique for this problem. Cutting plane techniques produce piecewise linear approxi-
mations to the objective (cuts), which produce a surrogate lower bound, labeled “cutting
plane approximation” in the illustration in Figure 3. However, traditional cutting plane
methods fail badly for the risk score problem. Since the feasible region is the integer lattice,
a traditional cutting plane method would need to solve a mixed-integer program (MIP)
to optimality to develop each new cut. If this surrogate MIP is not solved to optimality,
we have no way of knowing when we have reached the solution to the risk score problem.
After several iterations, enough cuts would accumulate that the mixed integer program
Rudin and Ustun: Optimized Scoring SystemsArticle submitted to Interfaces; manuscript no. (Please, provide the mansucript number!) 11
Figure 3 A convex loss function (smooth curve) and its surrogate lower bound (lines).
could not be solved to optimality in a reasonable amount of time, and the program would
stall and fail to provide optimal scoring systems. This necessitates a new approach.
We developed a new branch-and-bound cutting plane method used in RiskSLIM for
solving the risk score problem. This method does not stall, involves solving linear pro-
grams rather than mixed-integer programs, and can be implemented using standard call-
back functions in CPLEX (ILOG 2007). The method gracefully handles arbitrarily large
datasets (even millions of observations), since computation scales linearly with the number
of observations. The RiskSLIM model in Figure 2 was fit on a dataset with N = 22,530
observations in 20 minutes.
SLIM’s decision rule problem (unlike the risk-score problem we just described for
RiskSLIM) is a mixed-integer linear program. It can be solved with optimization software
like CPLEX, but the solver is made more efficient with a specialized bound that we con-
structed, which reduces the amount of data we use without changing the solution to the
optimization problem (discussed in Ustun & Rudin 2016b).
In the appendix, we discuss the optimization problems solved by SLIM and RiskSLIM.
Before we discuss applications, let us discuss means of evaluation.
Rudin and Ustun: Optimized Scoring Systems12 Article submitted to Interfaces; manuscript no. (Please, provide the mansucript number!)
Evaluation Methodology for Machine Learning Models
The fields of machine learning and data mining use rigorous empirical evaluation tech-
niques. Cross validation is commonly used to provide a measure of uncertainty of prediction
quality. To perform 5 fold cross-validation, the data are divided into five equal size folds.
Four of the folds are used to train the algorithm, and predictions are made out-of-sample
on the fifth “test” fold. The test fold rotates, and we report a mean and standard deviation
(or range) across folds.
In this work, we are interested in the following evaluation measures for classification
problems: The true positive rate (TPR) is the fraction of positive test observations pre-
dicted to be positive. Sensitivity is also the true positive rate. Specificity is the true negative
rate, the fraction of negative test observations predicted to be negative. The false positive
rate (FPR) is the fraction of negative test observations predicted to be positive, and FPR
is equal to one minus the specificity. The Receiver Operator Characteristic (ROC) curve
is a plot of true positive rate for each possible value of the false positive rate. The area
under the ROC curve (AUC) is important, since if the true positive rate is high for each
value of the false positive rate, the algorithm has a high AUC and is performing well. An
AUC value of .5 would be obtained for random guessing, an AUC of 1 is perfect, and for
most of the problems we consider here, an AUC value of .8 would be considered excellent.
AUC is a useful evaluation measure particularly when the positive and negative classes are
imbalanced, meaning that only a small fraction of the data are positive (or negative). For
instance, for the seizure prediction problem we discuss below, only 13.5% of observations
in the seizure prediction data correspond to true seizures, while the rest were non-seizures.
For risk score prediction, we are also interested in calibration (CAL), which is a measure
of how closely the predicted positive rate from the model matches the empirical positive
rate in the data. We will discuss CAL later.
Rudin and Ustun: Optimized Scoring SystemsArticle submitted to Interfaces; manuscript no. (Please, provide the mansucript number!) 13
In general we find that when the form or size of the model is not constrained, then for
the majority of applications, AUC values for all machine learning algorithms tend to be
similar. AUC’s start to differ when operational constraints are imposed. We will see this
in more depth for the sleep apnea and seizure examples below.
Applications and Insights
Both SLIM and RiskSLIM have had an impact on several applications in healthcare and
criminal justice. In what follows, we discuss three applications, and provide insight gained
by producing interpretable models.
Sleep Apnea Screening
Obstructive Sleep Apnea (OSA) is a serious medical condition that can lead to morbidity
and mortality, and can severely affect quality of life. A major goal of every sleep clinic is
to screen patients for this disease correctly. Testing for OSA is problematic. Preliminary
screening is mainly based on patient-reported symptoms and scoring systems. However,
surprisingly, patient-reported symptoms are not particularly reliably reported, nor are they
very useful for determining whether a patient has OSA. In particular, doctors often use
the Epworth Sleepiness scale (Johns et al. 1991) or other scoring systems to screen for
OSA, which are based on typical reported OSA symptoms like snoring, nocturnal gasp-
ing, witnessed apneas, sleepiness and other daytime complaints. Each of these predictive
factors alone is weak; the comorbidities provided in medical records are much stronger.
Hypertension, for instance, is a good predictor of OSA. Thus, it is reasonable that the staff
of the Massachusetts General Hospital hypothesized that an accurate scoring system could
be created that uses information from only routinely available medical records – without
reported symptoms – that could be just as accurate as the widely used scoring systems.
Rudin and Ustun: Optimized Scoring Systems14 Article submitted to Interfaces; manuscript no. (Please, provide the mansucript number!)
The data provided for this study were records from all patients at the Massachusetts
General Hospital Sleep Lab above 18 years old that underwent an definitive test for OSA
called polysomnography (1,922 patients) between 2009 and 2013. Polysomnography is an
expensive test for obstructive sleep apnea in which patients stay at the hospital overnight in
order to collect information about brain activity, blood oxygen levels, heart rate, breathing
patterns, eye movements and leg movements. Our goal was to predict OSA using only
information that was available before the polysomnography. Such information included
standard medical information (e.g. gender, age, BMI, past heart problems, hypertension,
diabetes, smoking), as well as self-reported information on sleep patterns (e.g. caffeine
consumption, insomnia, snoring, gasping, dry mouth in morning, leg jerks, falls back to
sleep slowly). A full list of the features is provided in Table 1 of Ustun et al. (2016).
The domain experts also required several operational constraints on the form of the
model, such as constraints on the size of the model, and the signs of the coefficients. The
domain experts considered these constraints vital to their trust in the model.
If a scoring system could be developed that accurately screens patients for sleep apnea,
using only the patient’s medical records, without using the patient-reported symptoms,
it would create an actionable tool that could allow automatic screening (as opposed to
manual screening where a doctor would be involved). This type of automated scoring would
allow wise usage of limited resources available for direct patient encounters.
To summarize, our domain experts (Brandon Westover and Matt Bianchi at Mas-
sachusetts General Hospital) had two important goals: (i) create an accurate transparent
model for obstructive sleep apnea that obeyed operational constraints; (ii) determine the
value of the patient-reported symptoms (e.g. gasping, insomnia, caffeine consumption) as
compared with information that is already within the patient’s medical record.
Rudin and Ustun: Optimized Scoring SystemsArticle submitted to Interfaces; manuscript no. (Please, provide the mansucript number!) 15
Prior to our work, the best previous scoring system for sleep apnea screening was
arguably the STOP-BANG score (Chung et al. 2008). STOP-BANG relies on 8 features
including self-reported snoring, tiredness, and breathing problems in addition to medical
record information. Its sensitivity is 83.6% and specificity is 56.4%, which precludes it
from being used as a screening tool. The specificity is the percentage of negatives identified
correctly, meaning that the false positive rate is 100%− 56.4% = 43.6%, much higher than
the goal on FPR that our domain experts were looking for, which was 20%.
SLIM Model for Sleep Apnea Screening
One of the models that our collaboration produced has sensitivity 61.4% and specificity
79.1%, so that the FPR was 20.9%. The scoring system was produced by SLIM, and is in