Predicting Post-SafeTrack Metro Reliability
Georgetown SCS Data Science
Team Members: Patrick McGrady, Micah Melling, and Drew Wheatley
Problem Statement:
With an average weekday ridership exceeding 800,000 passengers, the Washington DC Metro is the second busiest rapid transit system in the United States. Through an ever expanding
hub-and-spoke system of 91 stations, Metro provides service to two states and the District of
Columbia. Many riders choose Metro as an alternative to what is arguably the worse street traffic
congestion in the country. Given the heavy reliance of Washington-area population on the rail
system, delays in train service can lead to serious issues in productivity
In May 2016, following a series of high-profile delays, a deadly smoke crisis affecting the
yellow line, and a blistering report from the National Transportation Safety Board, Metro
officials announced the SafeTrack project. SafeTrack is a comprehensive track work
maintenance effort designed to improve safety and reliability. Track work was previously
constricted to the 33 hours a week train service was shut down, but SafeTrack calls for
maintenance work that cuts into Metro’s operating schedule. This, in turn, leads to station shut-
downs, widespread single-tracking, and reduced service hours. WMATA officials say the
project will take 12 months and has an estimated price tag of $60 million.
The three members of our team, each a Metro commuter, were curious about the potential effect
SafeTrack would have on our daily schedules and on the region as a whole. We set out to create a
Metrorail simulation model to inform riders about the potential impact of SafeTrack on their
commutes. At the conclusion of our project, we wanted to gauge the effectiveness of the
maintenance project and answer the question on every Metro commuters’ mind: Will it be worth
it?
Methodology:
Our project adhered to the Data Science Pipeline outlined by Tony Ojeda and Ben Bengfort,
which identifies five stages of data research. We will show below how each step led us to our
final product.
Data Ingestion and Wrangling:
To develop a simulation model for DC’s Metrorail system, we needed two main data inputs:
1) the theoretical runtime of each line and 2) in which ways the theoretical runtime is interrupted.
At a high level, we needed data that would allow us to, as accurately as possible, portray how the
Metro system is disrupted from reaching its theoretical operating condition.
Data on theoretical perfect runtimes were obtained from timetables on WMATA’s website. To
obtain data on system interruptions, we attempted to pull delay data from WMATA’s API.
However, we discovered that WMATA no longer made this data available, necessitating a
change of course.
Therefore, our team used a list of disruption reports found on the website Open Data DC. The
dataset was made available by a WMATA employee who was formerly tasked with compiling
this information and adding it to wmata.com. The dataset includes 23,630 instances of delays on
Metro trains. Each instance includes date, time, the line on which the incident occurred, the
direction the train was heading, a brief description of the incident, a cause, and the length of delay
in number of minutes. The data was in a downloadable CSV format. There were 349 different
types of disruption “causes”, which we were reclassified into one of three categories:
Technical: technology/mechanical failure (3rd rail power fail, radio malfunction, signal
problem, switch issue, unscheduled maintenance)
Operational: act of nature/scheduling issues (fires, medical emergency, police
investigation, track obstruction, schedule adherence, train spacing)
Unclassified
Although useful, the data from Open Data DC is imperfect, causing us to pivot direction. First, the
WMATA employee stated the data may not be 100% complete. For example, few small delays of
one, two, or three minutes appear in the dataset, indicating not all delays were recorded.
Additionally, after inspecting the dataset, it did not appear to include the impact of compounding
delays, meaning we would need to build those effects into the model. Lastly, the data did not
include delays specific to each station. Without station-specific data, we could not create a true
discrete-event simulation model.
We recognized this data would give us the opportunity to analyze runtime by day rather than by
trip, as we could summarize delay data by day. Using the timetable data from WMATA’s
website, we then determined how many minutes a day that trains are running on each line. As a
simple example, if it takes 30 minutes to get from one end of a line to another, and two trains
make this trip per day, there would be 60 minutes that trains are running on this line.
Although not the case for most data science projects, our work did not use a database. In
comparison to other types of data science work, our simulation model necessitated much less
data. In most cases, each simulation required under a dozen unique data points! We believed that
querying a database would have added unnecessary complexity to the work; therefore, data was
saved in CSV files. In retrospect, however, we could have used a database to save our simulation
outputs.
Computation and Modeling:
Due to the data limitations described above, we came to a crossroads in the computation and
modeling step. A simulation model using the “real” data we could find would be imperfect,
though given a short timeline to complete this project, would provide the most feasible path
forward to modeling the potential impact of SafeTrack. However, we still saw the
value in developing software that could take more accurate data, if we can find it in the future.
Therefore, modeling was broken into two branches, meaning our project has two final outputs: 1)
development of a discrete-event simulation model based on dummy data to provide a foundation
for future work, and 2) creation of a less robust simulation model to provide an approximate
estimate on the potential impact of SafeTrack. This approach delivers on our original promise yet
also provides an avenue for us to conduct advanced work in the future. Both branches are
described below.
Discrete-event simulation model. Since we could not find station-by-station data, developing a
true discrete-event simulation model on real data was not possible. However, we wanted to
develop software for future use, knowing that once we can obtain “real” station-by-station
data, this program could be quite powerful. (See the future research section for details on
populating this script with “real” data).
We used the Python library Simpy to develop a simulation model on dummy data. For software
development, we opted to simulate a portion of the silver line (East Falls Church to Farragut
West, which is the commute for one of the team members). Although the current script only
focuses on part of one line, it can easily be adapted to other lines in totality.
At a high level, the simulation software works as follows:
1. A generator creates an initial train and adds trains into the simulation at specified
intervals.
2. Trains move throughout the system, with each station-to-station trip having its own
specified time.
3. Station-to-station trips are occasionally interrupted by delays, which are thrown into the
system based on a probability.
4. Only one train can access a station at a time; therefore, a delayed train will prevent a
subsequent train from accessing a station, causing compounding delays throughout the
system.
5. Each train’s “actions” are recorded, such as how long it is delayed and when it reaches a
new destination.
Below is a visualization of a simulated run of the system, using dummy data, from East Falls
Church to Farragut West. The simulation time was set to 100 minutes, with new trains being
injected into the system approximately every four minutes. (In the graph below, the blue line
represents the actual trip time, while the gray line shows the forecasted time based on timetable
data). We can see the system runs smoothly until train 6, which experiences delays. These delays
reverberate throughout the system, with all subsequent trains running well off their forecast.
Simulation modeling on real-world delay data. To deliver on our promise to demonstrate the
potential impact of SafeTrack, we also developed a simulation model on days of trips for each
line. Although limited due to the lack of robust data, this model can provide insight into the
possible impact of SafeTrack. Below is a description of how this model works.
1. The optimal runtime was set to be the expected number of minutes per day the line has
trains operating, based on timetable data from WMATA’s website.
2. The optimal runtime was interrupted by five tiers of delays. On each line, there is a certain
probability for different severities of delay. For example, on the blue line, there is a 20%
probability of a four-minute delay and a 50% probability of a 17-minute delay. After
inspecting density plots, histograms, and quartile breakdowns, we deemed five delay tier
to be optimal. Our goal was to minimize the standard deviation in each tier, providing the
most stability in simulation design.
3. Since the source data did not appear to include compounding delays, we built those into
the delay data, based on how often trains are scheduled to run. For example, using a
weighted average, blue line trains are scheduled to run every thirteen minutes throughout
the day, which means that only tier 5 delays will spur compounding effects throughout the
system.
4. The model was set to simulate approximately 300 days of trips for each line.
5. To simulate the potential impact of improvements after SafeTrack, we adjusted the
probability and severity of the delay tiers, demonstrating what lines might looks like under
different conditions. (The results of different scenarios are presented in the next section).
6. Knowing the Metrorail system is stochastic, we also placed random noise into each line.
Random noise was based on a probability distribution set to inject noise that hovered
around 5% of a day’s expected runtime.
Though it provides an insightful basis, we acknowledge this model has limitations. First, delay
tiers, runtimes, and train “spacing” is based on simple or weighted averages. The system is more
nuanced, and these figures change over the course of a day. Second, knowing how much random
noise to inject into the system is tricky and based on the team’s gut instinct, not an optimal route.
Throwing in too much or too little noise will certainly impact results. Third, the current model
does a poor job of accounting for highly severe delays, such as those greater than 30 minutes. Per
our source data, these delays are quite rare (less than 0.5% of all delays); still, they should be
included as a possibility. As we will discuss later, most of our key conclusions from this section
of the project stem from process rather than model results.
As we will discuss in the future research section, the optimal path forward will be to obtain real-
world data that can be used in our discrete-event simulation script (the first branch described
above). However, the simulation model developed on the “real” data we could find can still
provide intuition of SafeTrack’s impact.
Visualization and Presentation of Results:
Results for our simulation models using the “real” data are shown below. (All outcomes are
measured in minutes of trips per day). After running two improved scenarios (scenario 1 and
scenario 2), we realized that more drastic improvements were needed and, in some cases, that we
likely need to change the amount of random noise in the system. Therefore, we ran scenarios 3-5
and compared against the current scenario. Still, some scenarios did not experience marked
improvements in mean performances, though standard deviations tended to fluctuate from model to
model. To note, we experimented with several different changes to the system and describe those
below. We were unable to find a forecasted improvement rate tied to SafeTrack, which
necessitated experimentation. Future modeling should take a more stringent approach from the
beginning and include more hypotheses to test.
Orange Line
Current:
Mean: 8121.386
Standard Deviation: 95.95478
Scenario 1: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by one minute
Mean: 8117.496
Standard Deviation: 97.34168
Scenario 2: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by two minutes
Mean: 8115.761
Standard Deviation: 99.95325
Scenario 3: Cut tier 5 delays in half
Mean: 8114.653
Standard Deviation: 104.40773
Scenario 4: For all tiers, cut probabilities by 30% and reduced each delay by three
minutes
Mean: 8104.472
Standard Deviation: 99.70228
Scenario 5: Eliminated tier 1 and cut other tiers in half in terms of probability and
severity
Mean: 8093.360
Standard Deviation: 98.42959
Silver Line
Current:
Mean: 9861.402
Standard Deviation: 102.51522
Scenario 1: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by one minute
Mean: 9868.713
Standard Deviation: 97.10936
Scenario 2: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by two minutes
Mean: 9854.400
Standard Deviation: 108.57028
Scenario 3: Cut tier 5 delays in half
Mean: 9852.256
Standard Deviation: 102.1384
Scenario 4: Eliminated tiers 2 and 4; reduced probability of tiers 1 and 3 by 20% and
reduced severity by 2 minutes; cut tier 5 severity by 3 minutes
Mean: 9850.429
Standard Deviation: 101.7149
Scenario 5: Eliminated tiers 2 and 4; reduced probability of tiers 1 and 3 by 30% and
reduced severity by 3 minutes; cut tier 5 severity by 50%
Mean: 9848.057
Standard Deviation: 104.1241
Blue Line
Current:
Mean: 6811.311
Standard Deviation: 108.8495
Scenario 1: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by one minute
Mean: 6815.635
Standard Deviation: 100.7707
Scenario 2: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by two minutes
Mean: 6816.787
Standard Deviation: 105.2023
Scenario 3: Cut tier 5 delays in half
Mean: 6809.531
Standard Deviation: 108.57136
Scenario 4: Reduced probabilities in each tier by 20% and cut severity by two minutes
Mean: 6810.966
Standard Deviation: 97.19704
Scenario 5: Reduced probabilities in each tier by 30% and cut severity by three minutes.
Reduced tier 5 by 50% in severity and 30% in probability.
Mean: 6809.322
Standard Deviation: 98.01096
In retrospect, it appears we might have too much random noise in the blue line.
Red Line
Current:
Mean: 11149.50
Standard Deviation: 97.38860
Scenario 1: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by one minute
Mean: 11159.63
Standard Deviation: 98.45129
Scenario 2: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by two minutes
Mean: 11146.33
Standard Deviation: 99.59112
Scenario 3: Cut tier 5 delays in half
Mean: 11138.77
Standard Deviation: 112.83935
Scenario 4: For each tier, reduced probability by 30% and severity by three minutes
Mean: 11132.07
Standard Deviation: 97.06135
Scenario 5: Eliminated tier 1 and cut other tiers in half in terms of both probability and
severity
Mean: 11123.84
Standard Deviation: 101.22620
Green Line
Current (this version is only comparable to scenarios 1 and 2):
Mean: 6762.053
Standard Deviation: 97.83910
Current (Adjusted Noise; this version is only comparable to scenarios 3-5):
Mean: 6562.225
Standard Deviation: 52.97352
Scenario 1: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by one minute
Mean: 6765.060
Standard Deviation: 96.90318
Scenario 2: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by two minutes
Mean: 6759.091
Standard Deviation: 103.26655
Scenario 3: Cut tier 5 delays in half
Mean: 6557.466
Standard Deviation: 55.09180
Scenario 4: For each tier, reduced probability by 20% and severity by 3 minutes
Mean: 6552.855
Standard Deviation: 53.31641
Scenario 5: For tiers 1-4, reduced probability by 20% and severity by 4 minutes. Cut
severity and probability of Tier 5 by 50%
Mean: 6540.793
Standard Deviation: 48.94717
Yellow Line
Current (this version is only comparable to scenarios 1 and 2):
Mean: 5280.572
Standard Deviation: 100.55666
Current (Adjusted Noise; this version is only comparable to scenarios 3-5):
Mean: 5025.293
Standard Deviation: 41.42158
Scenario 1: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by one minute
Mean: 5261.651
Standard Deviation: 92.57486
Scenario 2: Reduced the probability of each delay tier by 10% and reduced the severity of
each delay by two minutes
Mean: 5262.043
Standard Deviation: 114.09398
Scenario 3: Cut tier 5 delays in half
Mean: 5020.811
Standard Deviation: 41.13303
Scenario 4: For each tier, reduced probabilities by 20% and severity by 2 minutes
Mean: 5014.868
Standard Deviation: 41.25119
Scenario 5: For tiers 1-4, reduced probabilities by 20% and severity by 3 minutes. Cut
tier 5 probability by 20% and severity in half.
Mean: 5013.925
Standard Deviation: 40.98048
Conclusions:
The development of software for a discrete-event simulation model on the Metrorail system
provides a solid framework for future research and modeling. It has the potential for Metro to either
1) insert hypothetical situations into the model and see what might happen to the system or
2) provide real-world, station-by-station data to more accurately model the current system and
adjust those parameters to understand the possible impact of track improvements.
The major conclusions from the results of the simulation model based on “real” data stem from
process and theory rather than the actual results. Again, the numeric outcomes should be viewed
tentatively, due to likely issues with the source data and impreciseness of inducing random noise
into the system. Below are our most important lessons learned:
1. Several improved scenarios of the lines showed little impact on system performance.
Essentially, it appears that modest improvements were sometimes not enough to
compensate for the stochastic nature of Metrorail system. One item with which we
struggled was the amount of random noise to inject into the system. However, this struggle
provided us with a key learning: In a stochastic system, improvements must be drastic
enough so they are not canceled out by the random variation in the system. SafeTrack’s
improvements may not be noticed if they do not overcome the system’s random noise.
2. Recognizing bias and stating assumptions is key to data science. Throughout our project,
we recognized issues with our source data and that we needed to make certain
assumptions to run a model. We have attempted to be open and honest about such biases
and assumptions.
3. Likewise, the importance of accurate data cannot be overstated, and dealing with what
appears to be flawed data can be a tricky process.
4. The use of additional data sources could be used to create a more powerful model.
Future Research:
As hinted at multiple times above, a firm foundation for future work has been established.
Broadly, future work would involve finding accurate data to place into our station-to-station,
discrete-event simulation software program. This data likely could be supplied by two sources:
from WMATA’s API or from WMATA authorities. If pulling trip data from the API, we would
be tasked with separating delays from random noise when inspecting train arrival times. With
the right data, we believe this software could be powerful in helping WMATA understand the
impact of different system issues as well as to project what a certain improvement could mean.
Additionally, to improve the software, we would focus on adding a user interface that would
visualize results and print descriptive statistics for different scenarios. Lastly, it would likely be
beneficial to develop separate models for different times of the day and perhaps even for
different days of the week or months of the year.
This process has been incredibly rewarding and enlightening. We look forward to continuing to
build our data science skills to make a positive impact on our communities!