Predicting Post-SafeTrack Metro Reliability

Predicting Post-SafeTrack Metro Reliability

Georgetown SCS Data Science

Team Members: Patrick McGrady, Micah Melling, and Drew Wheatley

Problem Statement:

With an average weekday ridership exceeding 800,000 passengers, the Washington DC Metro is the second busiest rapid transit system in the United States. Through an ever expanding

hub-and-spoke system of 91 stations, Metro provides service to two states and the District of

Columbia. Many riders choose Metro as an alternative to what is arguably the worse street traffic

congestion in the country. Given the heavy reliance of Washington-area population on the rail

system, delays in train service can lead to serious issues in productivity

In May 2016, following a series of high-profile delays, a deadly smoke crisis affecting the

yellow line, and a blistering report from the National Transportation Safety Board, Metro

officials announced the SafeTrack project. SafeTrack is a comprehensive track work

maintenance effort designed to improve safety and reliability. Track work was previously

constricted to the 33 hours a week train service was shut down, but SafeTrack calls for

maintenance work that cuts into Metro’s operating schedule. This, in turn, leads to station shut-

downs, widespread single-tracking, and reduced service hours. WMATA officials say the

project will take 12 months and has an estimated price tag of $60 million.

The three members of our team, each a Metro commuter, were curious about the potential effect

SafeTrack would have on our daily schedules and on the region as a whole. We set out to create a

Metrorail simulation model to inform riders about the potential impact of SafeTrack on their

commutes. At the conclusion of our project, we wanted to gauge the effectiveness of the

maintenance project and answer the question on every Metro commuters’ mind: Will it be worth

it?

Methodology:

Our project adhered to the Data Science Pipeline outlined by Tony Ojeda and Ben Bengfort,

which identifies five stages of data research. We will show below how each step led us to our

final product.

Data Ingestion and Wrangling:

To develop a simulation model for DC’s Metrorail system, we needed two main data inputs:

1) the theoretical runtime of each line and 2) in which ways the theoretical runtime is interrupted.

At a high level, we needed data that would allow us to, as accurately as possible, portray how the

Metro system is disrupted from reaching its theoretical operating condition.

Data on theoretical perfect runtimes were obtained from timetables on WMATA’s website. To

obtain data on system interruptions, we attempted to pull delay data from WMATA’s API.

However, we discovered that WMATA no longer made this data available, necessitating a

change of course.

Therefore, our team used a list of disruption reports found on the website Open Data DC. The

dataset was made available by a WMATA employee who was formerly tasked with compiling

this information and adding it to wmata.com. The dataset includes 23,630 instances of delays on

Metro trains. Each instance includes date, time, the line on which the incident occurred, the

direction the train was heading, a brief description of the incident, a cause, and the length of delay

in number of minutes. The data was in a downloadable CSV format. There were 349 different

types of disruption “causes”, which we were reclassified into one of three categories:

Technical: technology/mechanical failure (3rd rail power fail, radio malfunction, signal

problem, switch issue, unscheduled maintenance)

Operational: act of nature/scheduling issues (fires, medical emergency, police

investigation, track obstruction, schedule adherence, train spacing)

Unclassified

Although useful, the data from Open Data DC is imperfect, causing us to pivot direction. First, the

WMATA employee stated the data may not be 100% complete. For example, few small delays of

one, two, or three minutes appear in the dataset, indicating not all delays were recorded.

Additionally, after inspecting the dataset, it did not appear to include the impact of compounding

delays, meaning we would need to build those effects into the model. Lastly, the data did not

include delays specific to each station. Without station-specific data, we could not create a true

discrete-event simulation model.

We recognized this data would give us the opportunity to analyze runtime by day rather than by

trip, as we could summarize delay data by day. Using the timetable data from WMATA’s

website, we then determined how many minutes a day that trains are running on each line. As a

simple example, if it takes 30 minutes to get from one end of a line to another, and two trains

make this trip per day, there would be 60 minutes that trains are running on this line.

Although not the case for most data science projects, our work did not use a database. In

comparison to other types of data science work, our simulation model necessitated much less

data. In most cases, each simulation required under a dozen unique data points! We believed that

querying a database would have added unnecessary complexity to the work; therefore, data was

saved in CSV files. In retrospect, however, we could have used a database to save our simulation

outputs.

Computation and Modeling:

Due to the data limitations described above, we came to a crossroads in the computation and

modeling step. A simulation model using the “real” data we could find would be imperfect,

though given a short timeline to complete this project, would provide the most feasible path

forward to modeling the potential impact of SafeTrack. However, we still saw the

http://opendatadc.org/dataset/wmata-disruption-reports/resource/2fa53a5a-2382-49a7-b8cc-%2015fc72f98e08

value in developing software that could take more accurate data, if we can find it in the future.

Therefore, modeling was broken into two branches, meaning our project has two final outputs: 1)

development of a discrete-event simulation model based on dummy data to provide a foundation

for future work, and 2) creation of a less robust simulation model to provide an approximate

estimate on the potential impact of SafeTrack. This approach delivers on our original promise yet

also provides an avenue for us to conduct advanced work in the future. Both branches are

described below.

Discrete-event simulation model. Since we could not find station-by-station data, developing a

true discrete-event simulation model on real data was not possible. However, we wanted to

develop software for future use, knowing that once we can obtain “real” station-by-station

data, this program could be quite powerful. (See the future research section for details on

populating this script with “real” data).

We used the Python library Simpy to develop a simulation model on dummy data. For software

development, we opted to simulate a portion of the silver line (East Falls Church to Farragut

West, which is the commute for one of the team members). Although the current script only

focuses on part of one line, it can easily be adapted to other lines in totality.

At a high level, the simulation software works as follows:

1. A generator creates an initial train and adds trains into the simulation at specified

intervals.

2. Trains move throughout the system, with each station-to-station trip having its own

specified time.

3. Station-to-station trips are occasionally interrupted by delays, which are thrown into the

system based on a probability.

4. Only one train can access a station at a time; therefore, a delayed train will prevent a

subsequent train from accessing a station, causing compounding delays throughout the

system.

5. Each train’s “actions” are recorded, such as how long it is delayed and when it reaches a

new destination.

Below is a visualization of a simulated run of the system, using dummy data, from East Falls

Church to Farragut West. The simulation time was set to 100 minutes, with new trains being

injected into the system approximately every four minutes. (In the graph below, the blue line

represents the actual trip time, while the gray line shows the forecasted time based on timetable

data). We can see the system runs smoothly until train 6, which experiences delays. These delays

reverberate throughout the system, with all subsequent trains running well off their forecast.

Simulation modeling on real-world delay data. To deliver on our promise to demonstrate the

potential impact of SafeTrack, we also developed a simulation model on days of trips for each

line. Although limited due to the lack of robust data, this model can provide insight into the

possible impact of SafeTrack. Below is a description of how this model works.

1. The optimal runtime was set to be the expected number of minutes per day the line has

trains operating, based on timetable data from WMATA’s website.

2. The optimal runtime was interrupted by five tiers of delays. On each line, there is a certain

probability for different severities of delay. For example, on the blue line, there is a 20%

probability of a four-minute delay and a 50% probability of a 17-minute delay. After

inspecting density plots, histograms, and quartile breakdowns, we deemed five delay tier

to be optimal. Our goal was to minimize the standard deviation in each tier, providing the

most stability in simulation design.

3. Since the source data did not appear to include compounding delays, we built those into

the delay data, based on how often trains are scheduled to run. For example, using a

weighted average, blue line trains are scheduled to run every thirteen minutes throughout

the day, which means that only tier 5 delays will spur compounding effects throughout the

system.

4. The model was set to simulate approximately 300 days of trips for each line.

5. To simulate the potential impact of improvements after SafeTrack, we adjusted the

probability and severity of the delay tiers, demonstrating what lines might looks like under

different conditions. (The results of different scenarios are presented in the next section).

6. Knowing the Metrorail system is stochastic, we also placed random noise into each line.

Random noise was based on a probability distribution set to inject noise that hovered

around 5% of a day’s expected runtime.

Though it provides an insightful basis, we acknowledge this model has limitations. First, delay

tiers, runtimes, and train “spacing” is based on simple or weighted averages. The system is more

nuanced, and these figures change over the course of a day. Second, knowing how much random

noise to inject into the system is tricky and based on the team’s gut instinct, not an optimal route.

Throwing in too much or too little noise will certainly impact results. Third, the current model

does a poor job of accounting for highly severe delays, such as those greater than 30 minutes. Per

our source data, these delays are quite rare (less than 0.5% of all delays); still, they should be

included as a possibility. As we will discuss later, most of our key conclusions from this section

of the project stem from process rather than model results.

As we will discuss in the future research section, the optimal path forward will be to obtain real-

world data that can be used in our discrete-event simulation script (the first branch described

above). However, the simulation model developed on the “real” data we could find can still

provide intuition of SafeTrack’s impact.

Visualization and Presentation of Results:

Results for our simulation models using the “real” data are shown below. (All outcomes are

measured in minutes of trips per day). After running two improved scenarios (scenario 1 and

scenario 2), we realized that more drastic improvements were needed and, in some cases, that we

likely need to change the amount of random noise in the system. Therefore, we ran scenarios 3-5

and compared against the current scenario. Still, some scenarios did not experience marked

improvements in mean performances, though standard deviations tended to fluctuate from model to

model. To note, we experimented with several different changes to the system and describe those

below. We were unable to find a forecasted improvement rate tied to SafeTrack, which

necessitated experimentation. Future modeling should take a more stringent approach from the

beginning and include more hypotheses to test.

Orange Line

Current:

Mean: 8121.386

Standard Deviation: 95.95478

Scenario 1: Reduced the probability of each delay tier by 10% and reduced the severity of

each delay by one minute

Mean: 8117.496



each delay by two minutes

Mean: 8115.761


Scenario 3: Cut tier 5 delays in half

Mean: 8114.653


Scenario 4: For all tiers, cut probabilities by 30% and reduced each delay by three

minutes

Mean: 8104.472


Scenario 5: Eliminated tier 1 and cut other tiers in half in terms of probability and

severity

Mean: 8093.360


Silver Line

Current:

Mean: 9861.402




Mean: 9868.713




Mean: 9854.400



Mean: 9852.256


Scenario 4: Eliminated tiers 2 and 4; reduced probability of tiers 1 and 3 by 20% and

reduced severity by 2 minutes; cut tier 5 severity by 3 minutes

Mean: 9850.429


Scenario 5: Eliminated tiers 2 and 4; reduced probability of tiers 1 and 3 by 30% and

reduced severity by 3 minutes; cut tier 5 severity by 50%

Mean: 9848.057


Blue Line

Current:

Mean: 6811.311




Mean: 6815.635




Mean: 6816.787



Mean: 6809.531


Scenario 4: Reduced probabilities in each tier by 20% and cut severity by two minutes

Mean: 6810.966


Scenario 5: Reduced probabilities in each tier by 30% and cut severity by three minutes.

Reduced tier 5 by 50% in severity and 30% in probability.

Mean: 6809.322


In retrospect, it appears we might have too much random noise in the blue line.

Red Line

Current:

Mean: 11149.50




Mean: 11159.63




Mean: 11146.33



Mean: 11138.77


Scenario 4: For each tier, reduced probability by 30% and severity by three minutes

Mean: 11132.07


Scenario 5: Eliminated tier 1 and cut other tiers in half in terms of both probability and

severity

Mean: 11123.84


Green Line

Current (this version is only comparable to scenarios 1 and 2):

Mean: 6762.053


Current (Adjusted Noise; this version is only comparable to scenarios 3-5):

Mean: 6562.225




Mean: 6765.060




Mean: 6759.091



Mean: 6557.466


Scenario 4: For each tier, reduced probability by 20% and severity by 3 minutes

Mean: 6552.855


Scenario 5: For tiers 1-4, reduced probability by 20% and severity by 4 minutes. Cut

severity and probability of Tier 5 by 50%

Mean: 6540.793


Yellow Line

Current (this version is only comparable to scenarios 1 and 2):

Mean: 5280.572


Current (Adjusted Noise; this version is only comparable to scenarios 3-5):

Mean: 5025.293




Mean: 5261.651




Mean: 5262.043



Mean: 5020.811


Scenario 4: For each tier, reduced probabilities by 20% and severity by 2 minutes

Mean: 5014.868


Scenario 5: For tiers 1-4, reduced probabilities by 20% and severity by 3 minutes. Cut

tier 5 probability by 20% and severity in half.

Mean: 5013.925


Conclusions:

The development of software for a discrete-event simulation model on the Metrorail system

provides a solid framework for future research and modeling. It has the potential for Metro to either

1) insert hypothetical situations into the model and see what might happen to the system or

2) provide real-world, station-by-station data to more accurately model the current system and

adjust those parameters to understand the possible impact of track improvements.

The major conclusions from the results of the simulation model based on “real” data stem from

process and theory rather than the actual results. Again, the numeric outcomes should be viewed

tentatively, due to likely issues with the source data and impreciseness of inducing random noise

into the system. Below are our most important lessons learned:

1. Several improved scenarios of the lines showed little impact on system performance.

Essentially, it appears that modest improvements were sometimes not enough to

compensate for the stochastic nature of Metrorail system. One item with which we

struggled was the amount of random noise to inject into the system. However, this struggle

provided us with a key learning: In a stochastic system, improvements must be drastic

enough so they are not canceled out by the random variation in the system. SafeTrack’s

improvements may not be noticed if they do not overcome the system’s random noise.

2. Recognizing bias and stating assumptions is key to data science. Throughout our project,

we recognized issues with our source data and that we needed to make certain

assumptions to run a model. We have attempted to be open and honest about such biases

and assumptions.

3. Likewise, the importance of accurate data cannot be overstated, and dealing with what

appears to be flawed data can be a tricky process.

4. The use of additional data sources could be used to create a more powerful model.

Future Research:

As hinted at multiple times above, a firm foundation for future work has been established.

Broadly, future work would involve finding accurate data to place into our station-to-station,

discrete-event simulation software program. This data likely could be supplied by two sources:

from WMATA’s API or from WMATA authorities. If pulling trip data from the API, we would

be tasked with separating delays from random noise when inspecting train arrival times. With

the right data, we believe this software could be powerful in helping WMATA understand the

impact of different system issues as well as to project what a certain improvement could mean.

Additionally, to improve the software, we would focus on adding a user interface that would

visualize results and print descriptive statistics for different scenarios. Lastly, it would likely be

beneficial to develop separate models for different times of the day and perhaps even for

different days of the week or months of the year.

This process has been incredibly rewarding and enlightening. We look forward to continuing to

build our data science skills to make a positive impact on our communities!

Predicting Post-SafeTrack Metro Reliability

Documents

Predicting Post-SafeTrack Metro Reliability