-
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
VOL. X, NO. X, AUGUST 2019 1
An Open-Source Framework for Adaptive TrafficSignal Control
Wade Genders and Saiedeh Razavi
Abstract—Sub-optimal control policies in transportation sys-tems
negatively impact mobility, the environment and humanhealth.
Developing optimal transportation control systems at theappropriate
scale can be difficult as cities’ transportation systemscan be
large, complex and stochastic. Intersection traffic
signalcontrollers are an important element of modern
transportationinfrastructure where sub-optimal control policies can
incur highcosts to many users. Many adaptive traffic signal
controllershave been proposed by the community but research is
lackingregarding their relative performance difference - which
adaptivetraffic signal controller is best remains an open question.
Thisresearch contributes a framework for developing and
evaluatingdifferent adaptive traffic signal controller models in
simula-tion - both learning and non-learning - and demonstrates
itscapabilities. The framework is used to first, investigate
theperformance variance of the modelled adaptive traffic
signalcontrollers with respect to their hyperparameters and
second,analyze the performance differences between controllers
withoptimal hyperparameters. The proposed framework
containsimplementations of some of the most popular adaptive
trafficsignal controllers from the literature; Webster’s,
Max-pressureand Self-Organizing Traffic Lights, along with deep
Q-networkand deep deterministic policy gradient reinforcement
learningcontrollers. This framework will aid researchers by
acceleratingtheir work from a common starting point, allowing them
togenerate results faster with less effort. All framework source
codeis available at https://github.com/docwza/sumolights.
Index Terms—traffic signal control, adaptive traffic signal
con-trol, intelligent transportation systems, reinforcement
learning,neural networks.
I. INTRODUCTION
C ITIES rely on road infrastructure for transporting
indi-viduals, goods and services. Sub-optimal control poli-cies
incur environmental, human mobility and health costs.Studies
observe vehicles consume a significant amount of fuelaccelerating,
decelerating or idling at intersections [1]. Landtransportation
emissions are estimated to be responsible forone third of all
mortality from fine particulate matter pollutionin North America
[2]. Globally, over three million deaths areattributed to air
pollution per year [3]. In 2017, residents ofthree of the United
States’ biggest cities, Los Angeles, New
W. Genders was a Ph.D. student with the Department of Civil
En-gineering, McMaster University, Hamilton, Ontario, Canada
e-mail: [email protected]
S. Razavi is an Associate Professor, Chair in Heavy Construction
and Direc-tor of McMaster Institute for Transportation &
Logistics at the Department ofCivil Engineering, McMaster
University, Hamilton, Ontario, Canada e-mail:[email protected]
20XX IEEE. Personal use of this material is permitted.
Permission fromIEEE must be obtained for all other uses, in any
current or future media,including reprinting/republishing this
material for advertising or promotionalpurposes, creating new
collective works, for resale or redistribution to serversor lists,
or reuse of any copyrighted component of this work in other
works.
Manuscript received August X, 2019; revised August X, 2019.
York and San Francisco, spent between three and four dayson
average delayed in congestion over the year, respectivelycosting
19, 33 and 10 billion USD from fuel and individualtime waste [4].
It is paramount to ensure transportation systemsare optimal to
minimize these costs.
Automated control systems are used in many aspects
oftransportation systems. Intelligent transportation systems seekto
develop optimal solutions in transportation using intelli-gence.
Intersection traffic signal controllers are an importantelement of
many cities’ transportation infrastructure wheresub-optimal
solutions can contribute high costs. Traditionally,traffic signal
controllers have functioned using primitive logicwhich can be
improved. Adaptive traffic signal controllerscan improve upon
traditional traffic signal controllers byconditioning their control
on current traffic conditions.
Traffic microsimulators such as SUMO [5], Paramics, VIS-SUM and
AIMSUM have become popular tools for developingand testing adaptive
traffic signal controllers before field de-ployment. However,
researchers interested in studying adaptivetraffic signal
controllers are often burdened with developingtheir own adaptive
traffic signal control implementations denovo. This research
contributes an adaptive traffic signalcontrol framework, including
Webster’s, Max-pressure, Self-organizing traffic lights (SOTL),
deep Q-network (DQN) anddeep deterministic policy gradient (DDPG)
implementationsfor the freely available SUMO traffic microsimulator
to aidresearchers in their work. The framework’s capabilities
aredemonstrated by studying the effect of optimizing traffic
signalcontrollers hyperparameters and comparing optimized
adaptivetraffic signal controllers relative performance.
II. BACKGROUNDA. Traffic Signal Control
An intersection is composed of traffic movements or waysthat a
vehicle can traverse the intersection beginning from anincoming
lane to an outgoing lane. Traffic signal controllersuse phases,
combinations of coloured lights that indicate whenspecific
movements are allowed, to control vehicles at theintersection.
Fundamentally, a traffic signal control policy can be decou-pled
into two sequential decisions at any given time; whatshould the
next phase be and for how long in duration?A variety of models have
been proposed as policies. Thesimplest and most popular traffic
signal controller determinesthe next phase by displaying the phases
in an ordered sequenceknown as a cycle, where each phase in the
cycle has a fixed,potentially unique, duration - this is known as a
fixed-time,cycle-based traffic signal controller. Although simple,
fixed-time, cycle-based traffic signal controllers are ubiquitous
in
arX
iv:1
909.
0039
5v1
[ee
ss.S
Y]
1 S
ep 2
019
https://github.com/docwza/sumolights
-
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
VOL. X, NO. X, AUGUST 2019 2
transportation networks because they are predictable, stableand
effective, as traffic demands exhibit reliable patterns overregular
periods (i.e., times of the day, days of the week). How-ever, as
ubiquitous as the fixed-time controller is, researchershave long
sought to develop improved traffic signal controllerswhich can
adapt to changing traffic conditions.
Actuated traffic signal controllers use sensors and booleanlogic
to create dynamic phase durations. Adaptive traffic
signalcontrollers are capable of acyclic phase sequences and
dy-namic phase durations to adapt to changing intersection
trafficconditions. Adaptive controllers attempt to achieve
higherperformance at the expense of complexity, cost and
reliability.Various techniques have been proposed as the foundation
foradaptive traffic signal controllers, from analytic
mathematicalsolutions, heuristics and machine learning.
B. Literature Review
Developing an adaptive traffic signal control ultimatelyrequires
some type of optimization technique. For decadesresearchers have
proposed adaptive traffic signal controllersbased on a variety of
techniques such as evolutionary algo-rithms [6], [7], [8], [9],
[10], [11] and heuristics such as pres-sure [12], [13], [14],
immunity [15], [16] and self-organization[17], [18], [19].
Additionally, many comprehensive adaptivetraffic signal control
systems have been proposed such asOPAC [20], SCATS [21], RHODES
[22] and ACS-Lite [23].
Reinforcement learning has been demonstrated to be aneffective
method for developing adaptive traffic signal con-trollers in
simulation [6], [24], [25], [26], [27], [28]. Recently,deep
reinforcement learning has been used for adaptive trafficsignal
control with varying degrees of success [29], [30], [31],[32],
[33], [34], [35], [36], [37], [38], [39]. A comprehensivereview of
reinforcement learning adaptive traffic signal con-trollers is
presented in Table I.
Readers interested in additional adaptive traffic signal
con-trol research can consult extensive review articles [40],
[41],[42], [43], [44].
Although ample research exists proposing novel adaptivetraffic
signal controllers, it can be arduous to compare be-tween
previously proposed ideas. Developing adaptive trafficsignal
controllers can be challenging as many of them requiredefining many
hyperparameters. The authors seek to addressthis problem by
contributing an adaptive traffic signal controlframework to address
these problems and aid in their research.
C. Contribution
The authors’ work contributes in the following areas:
• Diverse Adaptive Traffic Signal Controller Implemen-tations:
The proposed framework contributes adaptivetraffic signal
controllers based on a variety of paradigms,the broadest being
non-learning (e.g., Webster’s, SOTL,Max-pressure) and learning
(e.g., DQN and DDPG). Thediversity of adaptive traffic signal
controllers allows re-searchers to experiment at their leisure
without investingtime developing their own implementations.
• Scalable, Optimized: The proposed framework is opti-mized for
use with parallel computation techniques lever-aging modern
multicore computer architecture. This fea-ture significantly
reduces the compute time of learning-based adaptive traffic signal
controllers and the generationof results for all controllers. By
making the frameworkcomputationally efficient, the search for
optimal hyperpa-rameters is tractable with modest hardware (e.g., 8
coreCPU). The framework was designed to scale to developadaptive
controllers for any SUMO network.
All source code used in this manuscript can be retrievedfrom
https://github.com/docwza/sumolights.
III. TRAFFIC SIGNAL CONTROLLERS
Before describing each traffic signal controller in
detail,elements common to all are detailed. All of the included
trafficsignal controllers share the following; a set of
intersectionlanes L, decomposed into incoming Linc and outgoing
lanesLout and a set of green phases P . The set of incoming
laneswith green movements in phase p ∈ P is denoted as Lp,incand
their outgoing lanes as Lp,out.
A. Non-Learning Traffic Signal Controllers
1) Uniform: A simple cycle-based, uniform phase durationtraffic
signal controller is included for use as a base-linecomparison to
the other controllers. The uniform controller’sonly hyperparameter
is the green duration u, which definesthe same duration for all
green phases; the next phase isdetermined by a cycle.
2) Websters: Webster’s method develops a cycle-based,fixed phase
length traffic signal controller using phase flowdata [53]. The
authors propose an adaptive Webster’s trafficsignal controller by
collecting data for a time interval W induration and then using
Webster’s method to calculate thecycle and green phase durations
for the next W time interval.This adaptive Webster’s essentially
uses the most recent Winterval to collect data and assumes the
traffic demand willbe approximately the same during the next W
interval. Theselection of W is important and exhibits various
trade-offs,smaller values allow for more frequent adaptations to
changingtraffic demands at the risk of instability while larger
valuesadapt less frequently but allow for increased stability.
Pseudo-code for the Webster’s traffic signal controller is
presented inAlgorithm 1.
In Algorithm 1, F represents the set of phase flows
collectedover the most recent W interval and R represents the
totalcycle lost time. In addition to the time interval
hyperparameterW , the adaptive Webster’s algorithm also has
hyperparametersdefining a minimum cycle duration cmin, maximum
cycleduration cmax and lane saturation flow rate s.
3) Max-pressure: The Max-pressure algorithm developsan acyclic,
dynamic phase length traffic signal controller.The Max-pressure
algorithm models vehicles in lanes as asubstance in a pipe and
enacts control in a manner whichattempts to maximize the relief of
pressure between incoming
https://github.com/docwza/sumolights
-
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
VOL. X, NO. X, AUGUST 2019 3
TABLE IADAPTIVE TRAFFIC SIGNAL CONTROL RELATED WORK.
Research Network Intersections Multi-agent RL Function
Approximation[45] Grid 15 Max-plus Model-based N/A[27] Grid,
Corridor cmax then9: C = cmax
10: end if11: G = C −R12: #allocate green time proportional to
flow13: return C, { G y∑Y for y in Y }14: end procedure
and outgoing lanes [13]. For a given green phase p, thepressure
is defined in (1).
Pressure(p) =∑
l∈Lp,inc
|Vl| −∑
l∈Lp,out
|Vl| (1)
Where Lp,inc represents the set of incoming lanes withgreen
movements in phase p and Lp,out represents the set ofoutgoing lanes
from all incoming lanes in Lp,inc.
Pseudo-code for the Max-pressure traffic signal controlleris
presented in Algorithm 2.
Algorithm 2 Max-pressure Algorithm1: procedure MAXPRESSURE(gmin,
tp, P )2: if tp < gmin then3: tp = tp + 14: else5: tp = 06:
#next phase has largest pressure7: return argmax({ Pressure(p) for
p in P })8: end if9: end procedure
In Algorithm 2, tp represents the time spent in the
currentphase. The Max-pressure algorithm requires a minimum
greentime hyperparameter gmin which ensures a newly enactedphase
has a minimum duration.
4) Self Organizing Traffic Lights: Self-organizing trafficlights
(SOTL) [17], [18], [19] develop a cycle-based, dynamicphase length
traffic signal controller based on self-organizingprinciples, where
a “...self-organizing system would be onein which elements are
designed to dynamically and au-tonomously solve a problem or
perform a function at thesystem level.” [18, p. 2].
Pseudo-code for the SOTL traffic signal controller is pre-sented
in Algorithm 3.
Algorithm 3 SOTL Algorithm1: procedure SOTL(tp, gmin, θ, ω, µ)2:
#accumulate red phase vehicles time integral3: κ = κ+
∑l∈Linc−Lp,inc |Vl|
4: if tp > gmin then5: #vehicles approaching in current green
phase6: #< ω distance of stop line7: n =
∑l∈Lp,inc |Vl|
8: #only consider phase change if no platoon9: #or too large n
> µ
10: if n > µ or n == 0 then11: if κ > θ then12: κ = 013:
#next phase in cycle14: i = i+ 115: return Pimod|P |16: end if17:
end if18: end if19: end procedure
The SOTL algorithm functions by changing lights accord-ing to a
vehicle-time integral threshold θ constrained by aminimum green
phase duration gmin. Additionally, small (i.e.n < µ) vehicle
platoons are kept together by preventing aphase change if
sufficiently close (i.e., at a distance < ω) tothe stop
line.
B. Learning Traffic Signal Controllers
Reinforcement learning uses the framework of MarkovDecision
Processes to solve goal-oriented, sequential decision-making
problems by repeatedly acting in an environment.
-
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
VOL. X, NO. X, AUGUST 2019 4
At discrete points in time t, a reinforcement learning
agentobserves the environment state st and then uses a policy π
todetermine an action at. After implementing its selected
action,the agent receives feedback from the environment in the
formof a reward rt and observes a new environment state st+1.The
reward quantifies how ‘well’ the agent is achieving itsgoal (e.g.,
score in a game, completed tasks). This processis repeated until a
terminal state sterminal is reached, andthen begins anew. The
return Gt =
∑k=Tk=0 γ
krt+k is theaccumulation of rewards by the agent over some time
horizonT , discounted by γ ∈ [0, 1). The agent seeks to maximize
theexpected return E[Gt] from each state st. The agent developsan
optimal policy π∗ to maximize the return.
There are many techniques for an agent to learn the
optimalpolicy, however, most of them rely on estimating value
func-tions. Value functions are useful to estimate future
rewards.State value functions V π(s) = E[Gt|st = s] represent
theexpected return starting from state s and following policy
π.Action value functions Qπ(s, a) = E[Gt|st = s, at = a]represent
the expected return starting from state s, taking ac-tion a and
following policy π. In practice, value functions areunknown and
must be estimated using sampling and functionapproximation
techniques. Parametric function approximation,such as neural
networks, use a set of parameters θ to estimatean unknown function
f(x|θ) ≈ f(x). To develop accurateapproximations, the function
parameters must be developedwith some optimization technique.
Experiences are tuples et = (st, at, rt, st+1) that representan
interaction between the agent and the environment at time t.A
reinforcement learning agent interacts with its environmentin
trajectories or sequences of experiences et, et+1, et+2,
....Trajectories begin in an initial state sinit and end in a
terminalstate sterminal. To accurately estimate value functions,
expe-riences are used to optimize the parameters. If neural
networkfunction approximation is used, the parameters are
optimizedusing experiences to perform gradient-based techniques
andbackpropagation [54], [55]. Additional technical details
re-garding the proposed reinforcement learning adaptive
trafficsignal controllers can be found in the Appendix.
To train reinforcement learning controllers for all
intersec-tions, a distributed acting, centralized learning
architecture isdeveloped [56], [57], [58]. Using parallel
computing, multipleactors and learners are created, illustrated in
Figure 1. Actorshave their own instance of the traffic simulation
and neuralnetworks for all intersections. Learners are assigned a
subsetof all intersections, for each they have a neural network
andan experience replay buffer D. Actors generate experienceset for
all intersections and send them to the appropriatelearner. Learners
only receive experiences for their assignedsubset of intersections.
The learner stores the experiencesin an experience replay buffer,
which is uniformly sampledfor batches to optimize the neural
network parameters. Aftercomputing parameter updates, learners send
new parametersto all actors.
There are many benefits to this architecture, foremost beingthat
it makes the problem feasible; because there are hundredsof agents,
distributing computation across many actors andlearners is
necessary to decrease training time. Another benefit
is experience diversity, granted by multiple environments
andvaried exploration rates.
C. DQN
The proposed DQN traffic signal controller enacts control
bychoosing the next green phase without utilizing a phase
cycle.This acyclic architecture is motivated by the observation
thatenacting phases in a repeating sequence may be contributingto
sub-optimal control policy. After the DQN has selected thenext
phase, it is enacted for a fixed duration known as anaction repeat
arepeat. After the phase has been enacted for theaction repeat
duration, a new phase is selected acyclically.
1) State: The proposed state observation for the DQN is
acombination of the most recent green phase and the densityand
queue of incoming lanes at the intersection at time t.Assume each
intersection has a set L of incoming lanes anda set P of green
phases. The state space is then defined asS ∈ (R2|L| × B|P |+1).
The density and queue of each laneare normalized to the range [0,
1] by dividing by the lane’sjam density kj . The most recent phase
is encoded as a one-hot vector B|P |+1, where the plus one encodes
the all-redclearance phase.
2) Action: The proposed action space for the DQN trafficsignal
controller is the next green phase. The DQN selectsone action from
a discrete set, in this model one of the manypossible green phases
at ∈ P . After a green phase has beenselected, it is enacted for a
duration equal to the action repeatarepeat.
3) Reward: The reward used to train the DQN traffic
signalcontroller is a function of vehicle delay. Delay d is
thedifference in time between a vehicle’s free-flow travel timeand
actual travel time. Specifically, the reward is the negativesum of
all vehicles’ delay at the intersection, defined in (2):
rt = −∑v∈V
dvt (2)
Where V is the set of all vehicles on incoming lanes atthe
intersection, and dtv is the delay of vehicle v at time t.Defined
in this way, the reward is a punishment, with theagent’s goal to
minimize the amount of punishment it receives.Each intersection
saves the reward with the largest magnitudeexperienced to perform
minimum reward normalization rt|rmin|to scale the reward to the
range [−1, 0] for stability.
4) Agent Architecture: The agent approximates the action-value Q
function with a deep artificial neural network. Theaction-value
function Q is two hidden layers of 3(|st|) fullyconnected neurons
with exponential linear unit (ELU) acti-vation functions and the
output layer is |P | neurons withlinear activation functions. The Q
function’s input is the localintersection state st. A visualization
of the DQN is presentedin Fig. 1.
D. DDPG Traffic Signal Controller
The proposed DDPG traffic signal controller implementsa cycle
with dynamic phase durations. This architecture ismotivated by the
observation that cycle-based policies can
-
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
VOL. X, NO. X, AUGUST 2019 5
Fig. 1. Adaptive traffic signal control DDPG and DQN neural
network agents (left) and distributed acting, centralized learning
architecture (right) composedof actors and learners. Each actor has
one SUMO network as an environment and neural networks for all
intersections. Each learner is assigned a subset ofintersections at
the beginning of training and is only responsible for computing
parameter updates for their assigned intersections, effectively
distributing thecomputation load for learning. However, learners
distribute parameter updates to all actors.
maintain fairness and ensure a minimum quality of servicebetween
all intersection users. Once the next green phase hasbeen
determined using the cycle, the policy π is used to selectits
duration. Explicitly, the reinforcement learning agent islearning
how long in duration to make the next green phase inthe cycle to
maximize its return. Additionally, the cycle skipsphases when no
vehicles are present on incoming lanes.
1) Actor State: The proposed state observation for the actoris a
combination of the current phase and the density and queueof
incoming lanes at the intersection at time t. The state spaceis
then defined as S ∈ (R2|L|×B|P |+1). The density and queueof each
lane are normalized to the range [0, 1] by dividing bythe lane’s
jam density kj . The current phase is encoded as aone-hot vector
B|P |+1, where the plus one encodes the all-redclearance phase.
2) Critic State: The proposed state observation for the
criticcombines the state st and the actor’s action at, depicted
inFigure 1.
3) Action: The proposed action space for the adaptivetraffic
signal controller is the duration of the next greenphase in
seconds. The action controls the duration of thenext phase; there
is no agency over what the next phaseis, only on how long it will
last. The DDPG algorithmproduces a continuous output, a real number
over some rangeat ∈ R. Since the DDPG algorithm outputs a real
numberand the phase duration is defined in intervals of seconds,
theoutput is rounded to the nearest integer. In practice,
phasedurations are bounded by minimum time gmin and a maximumtime
gmax hyperparameters to ensure a minimum quality ofservice for all
users. Therefore the agent selects an action{at ∈ Z|gmin ≤ at ≤
gmax} as the next phase duration.
4) Reward: The reward used to train the DDPG trafficsignal
controller is the same delay reward used by the DQNtraffic signal
controller defined in (2).
5) Agent Architecture: The agent approximates the policyπ and
action-value Q function with deep artificial neuralnetworks. The
policy function is two hidden layers of 3|st|fully connected
neurons, each with batch normalization andELU activation functions,
and the output layer is one neuronwith a hyperbolic tangent
activation function. The action-valuefunction Q is two hidden
layers of 3(|st|+|at|) fully connectedneurons with batch
normalization and ELU activation func-tions and the output layer is
one neuron with a linear activationfunction. The policy’s input is
the intersection’s local trafficstate st and the action-value
function’s input is the local stateconcatenated with the local
action st + at. The action-valueQ function also uses a L2 weight
regularization of λ = 0.01.
By deep reinforcement learning standards the networks usedare
not that deep, however, their architecture is selected
forsimplicity and they can easily be modified within the
frame-work. Simple deep neural networks were also implemented
toallow for future scalability, as the proposed framework can
bedeployed to any SUMO network - to reduce the computationalload
the default networks are simple.
IV. EXPERIMENTS
A. Hyperparameter Optimization
To demonstrate the capabilities of the proposed
framework,experiments are conducted on optimizing adaptive traffic
sig-nal control hyperparameters. The framework is for use with
theSUMO traffic microsimulator [5], which was used to evaluatethe
developed adaptive traffic signal controllers. Understandinghow
sensitive any specific adaptive traffic signal
controller’sperformance is to changes in hyperparameters is
importantto instill confidence that the solution is robust.
Determiningoptimal hyperparameters is necessary to ensure a
balancedcomparison between adaptive traffic signal control
methods.
-
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
VOL. X, NO. X, AUGUST 2019 6
gneJ0 gneJ6
Fig. 2. Two intersection SUMO network used for hyperparameter
ex-periments. In addition to this two intersection network, a
single, isolatedintersection is also included with the
framework.
Using the hyperparameter optimization script included inthe
framework, a grid search is performed with the imple-mented
controllers’ hyperparameters on a two intersectionnetwork, shown in
Fig. 2 under a simulated three hour dynamictraffic demand scenario.
The results for each traffic signalcontroller are displayed in Fig.
3 and collectively in Fig. 4.
As can be observed in Fig. 3 and Fig. 4 the choice
ofhyperparameter significantly impacts the performance of thegiven
traffic signal controller. As a general trend observedin Fig. 3,
methods with larger numbers of hyperparameters(e.g., SOTL, DDPG,
DQN) exhibit greater performance vari-ance than methods with fewer
hyperparameters (e.g., Max-pressure). Directly comparing methods in
Fig. 4 demonstratesnon-learning adaptive traffic signal control
methods (e.g.,Max-pressure, Webster’s) robustness to hyperparameter
valuesand high performance (i.e., lowest travel time).
Learning-basedmethods exhibit higher variance with changes in
hyperparam-eters, DQN more so than DDPG. In the following section,
thebest hyperparameters for each adaptive traffic signal
controllerwill be used to further investigate and compare
performance.
B. Optimized Adaptive Traffic Signal Controllers
Using the optimized hyperparameters all traffic signal
con-trollers are subjected to an additional 32 simulations
withrandom seeds to estimate their performance, quantified
usingnetwork travel time, individual intersection queue and
delaymeasures of effectiveness (MoE). Results are presented in
Fig.5 and Fig. 6.
Observing the travel time boxplots in Fig. 5, the SOTLcontroller
produces the worst results, exhibiting a mean traveltime almost
twice the next closest method and with manysignificant outliers.
The Max-pressure algorithm achieves thebest performance, with the
lowest mean and median alongwith the lowest standard deviation. The
DQN, DDPG, Uniformand Webster’s controllers achieve approximately
equal perfor-mance, however, the DQN controller has significant
outliers,indicating some vehicles experience much longer travel
timesthan most.
Each intersection’s queue and delay MoE with respect toeach
adaptive traffic signal controller is presented in Fig. 6.The
results are consistent with previous observations from
the hyperparameter search and travel time data, however,
thereader’s attention is directed to comparing the performanceof
DQN and DDPG in Fig. 6. The DQN controller performspoorly (i.e.,
high queues and delay) at the beginning and endof the simulation
when traffic demand is low. However, at thedemand peak, the DQN
controller performs just as well, ifnot a little better, than every
method except the Max-pressurecontroller. Simultaneously
considering the DDPG controller,the performance is opposite the DQN
controller. The DDPGcontroller achieves relatively low queues and
delay at thebeginning and end of the simulation and then is bested
bythe DQN controller in the middle of the simulation when thedemand
peaks. This performance difference can potentiallybe understood by
considering the difference between theDQN and DDPG controllers. The
DQN’s ability to selectthe next phase acyclically under high
traffic demand mayallow it to reduce queues and delay more than the
cycleconstrained DDPG controller. However, it is curious thatunder
low demands the DQN controller performance sufferswhen it should be
relatively simple to develop the optimalpolicy. The DQN controller
may be overfitting to the periodsin the environment when the
magnitude of the rewards arelarge (i.e., in the middle of the
simulation when the demandpeaks) and converging to a policy that
doesn’t generalizewell to the environment when the traffic demand
is low.The author’s present these findings to reader’s and
suggestfuture research investigate this and other issues to
understandthe performance difference between reinforcement
learningtraffic signal controllers. Understanding the advantages
anddisadvantages of a variety of controllers can provide
insightinto developing future improvements.
V. CONCLUSION & FUTURE WORK
Learning and non-learning adaptive traffic signal
controllershave been developed within an optimized framework for
thetraffic microsimulator SUMO for use by the research com-munity.
The proposed framework’s capabilities were demon-strated by
studying adaptive traffic signal control algorithm’ssensitivity to
hyperparameters, which was found to be sensi-tive with
hyperparameter rich controllers (i.e., learning) andrelatively
insensitive with hyperparameter sparse controllers(i.e.,
heuristics). Poor hyperparameters can drastically alterthe
performance of an adaptive traffic signal controller, lead-ing
researchers to erroneous conclusions about an adaptivetraffic
signal controller’s performance. This research providesevidence
that dozens or hundreds of hyperparameter config-urations may have
to be tested before selecting the optimalone.
Using the optimized hyperparamters, each adaptive con-troller’s
performance was estimated and the Max-pressurecontroller was found
to achieve the best performance, yieldingthe lowest travel times,
queues and delay. This manuscript’sresearch provides evidence that
heuristics can offer powerfulsolutions even compared to complex
deep-learning methods.This is not to suggest that this is
definitively the case in all en-vironments and circumstances. The
authors’ hypothesize thatlearning-based controllers can be further
developed to offer
-
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
VOL. X, NO. X, AUGUST 2019 7
0 50 100 1500
20
40
60
80
100
120
140
StandardDeviation
Travel Time (s)
Uniform
0 250 500 750 10000
200
400
600
800
1000
1200
1400SOTL
0 20 40 60 800
20
40
60
80
DDPG
0 20 40 60Mean
Travel Time (s)
0
5
10
15
20
25
30
35
StandardDeviation
Travel Time (s)
Max-pressure
0 200 400 600Mean
Travel Time (s)
0
200
400
600
800
1000
DQN
0 20 40 60 80Mean
Travel Time (s)
0
10
20
30
40
Webster's
Best
WorstHyperparameter Performance
Fig. 3. Individual hyperparameter results for each traffic
signal controller. Travel time is used as a measure of
effectiveness and is estimated for eachhyperparameter from eight
simulations with random seeds in units of seconds (s). The coloured
dots gradient from green (best) to red (worst) orders
thehyperparameters by the sum of the travel time mean and standard
deviation. Note differing scales between graph axes making direct
visual comparison biased.
0 25 50 75 100 125 150 175 200Mean
Travel Time (s)
0
25
50
75
100
125
150
175
200
StandardDeviation
Travel Time (s)
Traffic Signal ControlHyperparameter Comparison
DDPGDQNMax-pressureSOTLUniformWebster's
Fig. 4. Comparison of all traffic signal controller
hyperparameter travel time performance. Note both vertical and
horizontal axis limits have been clipped at200 to improve
readability.
improved performance that may yet best the
non-learning,heuristic-based methods detailed in this research.
Promisingextensions that have improved reinforcement learning in
otherapplications and may do the same for adaptive traffic
signalcontrol include richer function approximators [59], [60],
[61]and reinforcement learning algorithms [62], [63], [64].
The authors intend for the framework to grow, with the addi-tion
of more adaptive traffic signal controllers and features. In
its current state, the framework can already aid adaptive
trafficsignal control researchers rapidly experiment on a
SUMOnetwork of their choice. Acknowledging the importance
ofoptimizing our transportation systems, the authors hope
thisresearch helps others solve practical problems.
-
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
VOL. X, NO. X, AUGUST 2019 8
Traffic Signal Controller
0
250
500
750
1000
1250
1500
1750
Travel Time (s)( , , median)
(72, 34, 65) (78, 46, 66) (59, 21, 54) (158, 169, 85) (79, 37,
74) (71, 30, 66)
Traffic Signal ControllerTravel Time)
DDPGDQNMax-pressureSOTLUniformWebster's
Fig. 5. Boxplots depicting the distribution of travel times for
each traffic signal controller. The solid white line represents the
median, solid coloured box theinterquartile range (IQR) (i.e., from
first (Q1) to third quartile (Q3)), solid coloured lines the
Q1-1.5IQR and Q3+1.5IQR and coloured crosses the outliers.
0 25 50 75 100 125 150 1750
2000
4000
6000
8000
Queue (veh)
gneJ0DDPGDQNMax-pressureUniformWebster's
0 25 50 75 100 125 150 1750
2000
4000
6000
8000
gneJ6DDPGDQNMax-pressureUniformWebster's
0 25 50 75 100 125 150 175Time (min)
0
20000
40000
60000
80000
Delay (s)
DDPGDQNMax-pressureUniformWebster's
0 25 50 75 100 125 150 175Time (min)
0
10000
20000
30000
40000
50000
60000
70000DDPGDQNMax-pressureUniformWebster's
Intersection Measures of Effectiveness
Fig. 6. Comparison of traffic signal controller individual
intersection queue and delay measure of effectiveness in units of
vehicles (veh) and seconds (s).Solid coloured lines represent the
mean and shaded areas represent the 95% confidence interval. SOTL
has been omitted to improve readability since itsQueue and Delay
values are exclusively outside the graph range.
APPENDIX ADQN
Deep Q-Networks [65] combine Q-learning and deep neuralnetworks
to produce autonomous agents capable of solvingcomplex tasks in
high-dimensional environments. Q-learning[66] is a model-free,
off-policy, value-based temporal dif-ference [67] reinforcement
learning algorithm which can beused to develop an optimal discrete
action space policy for
a given problem. Like other temporal difference
algorithms,Q-learning uses bootstrapping [68] (i.e., using an
estimate toimprove future estimates) to develop an action-value
functionQ(s, a) which can estimate the expected return of taking
actiona in state s and acting optimally thereafter. If the Q
functioncan be estimated accurately it can be used to derive the
optimalpolicy π∗ = argmaxaQ(s, a). In DQN the Q function
isapproximated with a deep neural network. The DQN algorithm
-
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
VOL. X, NO. X, AUGUST 2019 9
utilizes two techniques to ensure stable development of the
Qfunction with DNN function approximation - a target networkand
experience replay. Two parameter sets are used whentraining a DQN,
online θ and target θ′. The target parametersθ′ are used to
stabilize the return estimates when performingupdates to the neural
network and are periodically changed tothe online parameters θ′ = θ
at a fixed interval. The experiencereplay is a buffer which stores
the most recent D experiencetuples to create a slowly changing
dataset. The experiencereplay is uniformly sampled from for
experience batches toupdate the Q function online parameters θ.
Training a deep neural network requires a loss function,which is
used to determine how to change the parameters toachieve better
approximations of the training data. Reinforce-ment learning
develops value functions (e.g., neural networks)using experiences
from the environment.
The DQN’s loss function is the gradient of the mean squarederror
of the return Gt target yt and the prediction, defined in(3).
yt = rt + γQ(st+1, argmaxaQ(st+1, a|θ′)|θ′)LDQN(θ) = (yt −Q(st,
at|θ))2
(3)
APPENDIX BDDPG
Deep deterministic policy gradients [69] are an extensionof DQN
to continuous action spaces. Similiar to Q-learning,DDPG is a
model-free, off-policy reinforcement learning al-gorithm. The DDPG
algorithm is an example of actor-criticlearning, as it develops a
policy function π(s|φ) (i.e., actor)using an action-value function
Q(s, a|θ) (i.e., critic). The actorinteracts with the environment
and modifies its behaviourbased on feedback from the critic.
The DDPG critic’s loss function is the gradient of the
meansquared error of the return Gt target yt and the
prediction,defined in (4).
yt = rt + γQ(st+1, π(st+1|φ′)|θ′)LCritic(θ) = (yt −Q(st,
at|θ))2
(4)
The DDPG actor’s loss function is the sampled policygradient,
defined in (5).
LActor(θ) = ∇θQ(st, π(st|φ)|θ) (5)
Like DQN, DDPG uses two sets of parameters, online θand target
θ′, and experience replay [70] to reduce instabilityduring
training. DDPG performs updates on the parametersfor both the actor
and critic by uniformly sampling batches ofexperiences from the
replay. The target parameters are slowlyupdated towards the online
parameters according to θ′ = (1−τ)θ′ + (τ)θ after every batch
update.
APPENDIX CTECHNICAL
Software used include SUMO 1.2.0 [5], Tensorflow 1.13[71], SciPy
[72] and public code [73]. The neural networkparameters were
initialized with He [74] and optimized usingAdam [75].
To ensure intersection safety, two second yellow change andthree
second all-red clearance phases were inserted between allgreen
phase transitions. For the DQN and DDPG traffic signalcontrollers,
if no vehicles are present at the intersection, thephase defaults
to all-red, which is considered a terminal statesterminal. Each
intersection’s state observation is bounded by150 m (i.e., the
queue and density are calculated from vehiclesup to a maximum of
150 m from the intersection stop line).
ACKNOWLEDGMENTS
This research was enabled in part by support in the formof
computing resources provided by SHARCNET (www.sharcnet.ca), their
McMaster University staff and ComputeCanada
(www.computecanada.ca).
REFERENCES[1] L. Wu, Y. Ci, J. Chu, and H. Zhang, “The influence
of intersections on
fuel consumption in urban arterial road traffic: a single
vehicle test inharbin, china,” PloS one, vol. 10, no. 9, 2015.
[2] R. A. Silva, Z. Adelman, M. M. Fry, and J. J. West, “The
impactof individual anthropogenic emissions sectors on the global
burden ofhuman mortality due to ambient air pollution,”
Environmental healthperspectives, vol. 124, no. 11, p. 1776,
2016.
[3] World Health Organization et al., “Ambient air pollution: A
globalassessment of exposure and burden of disease,” 2016.
[4] G. Cookson, “INRIX global traffic scorecard,” INRIX, Tech.
Rep., 2018.[5] D. Krajzewicz, J. Erdmann, M. Behrisch, and L.
Bieker, “Recent devel-
opment and applications of SUMO - Simulation of Urban
MObility,”International Journal On Advances in Systems and
Measurements,vol. 5, no. 3&4, pp. 128–138, December 2012.
[6] S. Mikami and Y. Kakazu, “Genetic reinforcement learning for
cooper-ative traffic signal control,” in Evolutionary Computation,
1994. IEEEWorld Congress on Computational Intelligence.,
Proceedings of the FirstIEEE Conference on. IEEE, 1994, pp.
223–228.
[7] J. Lee, B. Abdulhai, A. Shalaby, and E.-H. Chung, “Real-time
optimiza-tion for adaptive traffic signal control using genetic
algorithms,” Journalof Intelligent Transportation Systems, vol. 9,
no. 3, pp. 111–122, 2005.
[8] H. Prothmann, F. Rochner, S. Tomforde, J. Branke, C.
Müller-Schloer,and H. Schmeck, “Organic control of traffic
lights,” in InternationalConference on Autonomic and Trusted
Computing. Springer, 2008, pp.219–233.
[9] L. Singh, S. Tripathi, and H. Arora, “Time optimization for
traffic signalcontrol using genetic algorithm,” International
Journal of Recent Trendsin Engineering, vol. 2, no. 2, p. 4,
2009.
[10] E. Ricalde and W. Banzhaf, “Evolving adaptive traffic
signal controllersfor a real scenario using genetic programming
with an epigeneticmechanism,” in 2017 16th IEEE International
Conference on MachineLearning and Applications (ICMLA). IEEE, 2017,
pp. 897–902.
[11] X. Li and J.-Q. Sun, “Signal multiobjective optimization
for urban trafficnetwork,” IEEE Transactions on Intelligent
Transportation Systems,vol. 19, no. 11, pp. 3529–3537, 2018.
[12] T. Wongpiromsarn, T. Uthaicharoenpong, Y. Wang, E.
Frazzoli, andD. Wang, “Distributed traffic signal control for
maximum networkthroughput,” in Intelligent Transportation Systems
(ITSC), 2012 15thInternational IEEE Conference on. IEEE, 2012, pp.
588–595.
[13] P. Varaiya, “The max-pressure controller for arbitrary
networks ofsignalized intersections,” in Advances in Dynamic
Network Modelingin Complex Transportation Systems. Springer, 2013,
pp. 27–66.
[14] J. Gregoire, X. Qian, E. Frazzoli, A. De La Fortelle, and
T. Wong-piromsarn, “Capacity-aware backpressure traffic signal
control,” IEEETransactions on Control of Network Systems, vol. 2,
no. 2, pp. 164–173, 2015.
[15] S. Darmoul, S. Elkosantini, A. Louati, and L. B. Said,
“Multi-agentimmune networks to control interrupted flow at
signalized intersections,”Transportation Research Part C: Emerging
Technologies, vol. 82, pp.290–313, 2017.
[16] A. Louati, S. Darmoul, S. Elkosantini, and L. ben Said, “An
artificialimmune network to control interrupted flow at a
signalized intersection,”Information Sciences, vol. 433, pp. 70–95,
2018.
[17] C. Gershenson, “Self-organizing traffic lights,” arXiv
preprintnlin/0411066, 2004.
www.sharcnet.cawww.sharcnet.cawww.computecanada.ca
-
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
VOL. X, NO. X, AUGUST 2019 10
[18] S.-B. Cools, C. Gershenson, and B. DHooghe,
“Self-organizing trafficlights: A realistic simulation,” in
Advances in applied self-organizingsystems. Springer, 2013, pp.
45–55.
[19] S. Goel, S. F. Bush, and C. Gershenson, “Self-organization
in trafficlights: Evolution of signal control with advances in
sensors and com-munications,” arXiv preprint arXiv:1708.07188,
2017.
[20] N. Gartner, “A demand-responsive strategy for traffic
signal control,”Transportation Research Record, vol. 906, pp.
75–81, 1983.
[21] P. Lowrie, “Scats, sydney co-ordinated adaptive traffic
system: A trafficresponsive method of controlling urban traffic,”
1990.
[22] P. Mirchandani and L. Head, “A real-time traffic signal
control system:architecture, algorithms, and analysis,”
Transportation Research Part C:Emerging Technologies, vol. 9, no.
6, pp. 415–432, 2001.
[23] F. Luyanda, D. Gettman, L. Head, S. Shelby, D. Bullock, and
P. Mirchan-dani, “Acs-lite algorithmic architecture: applying
adaptive control systemtechnology to closed-loop traffic signal
control systems,” TransportationResearch Record: Journal of the
Transportation Research Board, no.1856, pp. 175–184, 2003.
[24] T. L. Thorpe and C. W. Anderson, “Traffic light control
using sarsa withthree state representations,” Citeseer, Tech. Rep.,
1996.
[25] E. Bingham, “Reinforcement learning in neurofuzzy traffic
signal con-trol,” European Journal of Operational Research, vol.
131, no. 2, pp.232–241, 2001.
[26] B. Abdulhai, R. Pringle, and G. J. Karakoulas,
“Reinforcement learningfor true adaptive traffic signal control,”
Journal of TransportationEngineering, vol. 129, no. 3, pp. 278–285,
2003.
[27] L. Prashanth and S. Bhatnagar, “Reinforcement learning with
functionapproximation for traffic signal control,” IEEE
Transactions on Intelli-gent Transportation Systems, vol. 12, no.
2, pp. 412–421, 2011.
[28] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent
rein-forcement learning for integrated network of adaptive traffic
signalcontrollers (marlin-atsc): methodology and large-scale
application ondowntown toronto,” IEEE Transactions on Intelligent
TransportationSystems, vol. 14, no. 3, pp. 1140–1150, 2013.
[29] T. Rijken, “Deeplight: Deep reinforcement learning for
signalised trafficcontrol,” Ph.D. dissertation, Masters Thesis.
University College London,2015.
[30] E. van der Pol, “Deep reinforcement learning for
coordination intraffic light control,” Ph.D. dissertation, Masters
Thesis. University ofAmsterdam, 2016.
[31] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via
deep reinforce-ment learning,” IEEE/CAA Journal of Automatica
Sinica, vol. 3, no. 3,pp. 247–254, 2016.
[32] W. Genders and S. Razavi, “Using a deep reinforcement
learning agentfor traffic signal control,” arXiv preprint
arXiv:1611.01142, 2016, https://arxiv.org/abs/1611.01142.
[33] M. Aslani, M. S. Mesgari, and M. Wiering, “Adaptive traffic
signalcontrol with actor-critic methods in a real-world traffic
network withdifferent traffic disruption events,” Transportation
Research Part C:Emerging Technologies, vol. 85, pp. 732–752,
2017.
[34] S. S. Mousavi, M. Schukat, P. Corcoran, and E. Howley,
“Traffic lightcontrol using deep policy-gradient and value-function
based reinforce-ment learning,” arXiv preprint arXiv:1704.08883,
2017.
[35] X. Liang, X. Du, G. Wang, and Z. Han, “Deep reinforcement
learn-ing for traffic light control in vehicular networks,” arXiv
preprintarXiv:1803.11115, 2018.
[36] ——, “A deep reinforcement learning network for traffic
light cyclecontrol,” IEEE Transactions on Vehicular Technology,
vol. 68, no. 2,pp. 1243–1253, 2019.
[37] S. Wang, X. Xie, K. Huang, J. Zeng, and Z. Cai, “Deep
reinforcementlearning-based traffic signal control using
high-resolution event-baseddata,” Entropy, vol. 21, no. 8, p. 744,
2019.
[38] W. Genders and S. Razavi, “Asynchronous n-step q-learning
adaptivetraffic signal control,” Journal of Intelligent
Transportation Systems,vol. 23, no. 4, pp. 319–331, 2019.
[39] T. Chu, J. Wang, L. Codecà, and Z. Li, “Multi-agent deep
reinforcementlearning for large-scale traffic signal control,” IEEE
Transactions onIntelligent Transportation Systems, 2019.
[40] A. Stevanovic, Adaptive traffic control systems: domestic
and foreignstate of practice, 2010, no. Project 20-5 (Topic
40-03).
[41] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Design of
reinforce-ment learning parameters for seamless application of
adaptive trafficsignal control,” Journal of Intelligent
Transportation Systems, vol. 18,no. 3, pp. 227–245, 2014.
[42] S. Araghi, A. Khosravi, and D. Creighton, “A review on
computationalintelligence methods for controlling traffic signal
timing,” Expert systemswith applications, vol. 42, no. 3, pp.
1538–1550, 2015.
[43] P. Mannion, J. Duggan, and E. Howley, “An experimental
review ofreinforcement learning algorithms for adaptive traffic
signal control,”in Autonomic Road Transport Support Systems.
Springer, 2016, pp.47–66.
[44] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and P.
Komisarczuk,“A survey on reinforcement learning models and
algorithms for trafficsignal control,” ACM Computing Surveys
(CSUR), vol. 50, no. 3, p. 34,2017.
[45] L. Kuyer, S. Whiteson, B. Bakker, and N. Vlassis,
“Multiagent rein-forcement learning for urban traffic control using
coordination graphs,”in Joint European Conference on Machine
Learning and KnowledgeDiscovery in Databases. Springer, 2008, pp.
656–671.
[46] J. C. Medina and R. F. Benekohal, “Traffic signal control
using reinforce-ment learning and the max-plus algorithm as a
coordinating strategy,”in Intelligent Transportation Systems
(ITSC), 2012 15th InternationalIEEE Conference on. IEEE, 2012, pp.
596–601.
[47] M. Abdoos, N. Mozayani, and A. L. Bazzan, “Holonic
multi-agentsystem for traffic signals control,” Engineering
Applications of ArtificialIntelligence, vol. 26, no. 5, pp.
1575–1587, 2013.
[48] M. A. Khamis and W. Gomaa, “Adaptive multi-objective
reinforcementlearning with hybrid exploration for traffic signal
control based on co-operative multi-agent framework,” Engineering
Applications of ArtificialIntelligence, vol. 29, pp. 134–151,
2014.
[49] T. Chu, S. Qu, and J. Wang, “Large-scale traffic grid
signal controlwith regional reinforcement learning,” in American
Control Conference(ACC), 2016. IEEE, 2016, pp. 815–820.
[50] N. Casas, “Deep deterministic policy gradient for urban
traffic lightcontrol,” arXiv preprint arXiv:1703.09035, 2017.
[51] W. Liu, G. Qin, Y. He, and F. Jiang, “Distributed
cooperative rein-forcement learning-based traffic signal control
that integrates v2x net-works dynamic clustering,” IEEE
Transactions on Vehicular Technology,vol. 66, no. 10, pp.
8667–8681, 2017.
[52] W. Genders, “Deep reinforcement learning adaptive traffic
signal con-trol,” Ph.D. dissertation, McMaster University,
2018.
[53] F. Webster, “Traffic signal settings, road research
technical paper no.39,” Road Research Laboratory, 1958.
[54] S. Linnainmaa, “Taylor expansion of the accumulated
rounding error,”BIT Numerical Mathematics, vol. 16, no. 2, pp.
146–160, 1976.
[55] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,
“Learning represen-tations by back-propagating errors,” nature,
vol. 323, no. 6088, p. 533,1986.
[56] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T.
Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for
deep rein-forcement learning,” in International Conference on
Machine Learning,2016, pp. 1928–1937.
[57] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel,H.
Van Hasselt, and D. Silver, “Distributed prioritized
experiencereplay,” arXiv preprint arXiv:1803.00933, 2018.
[58] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T.
Ward,Y. Doron, V. Firoiu, T. Harley, I. Dunning et al., “Impala:
Scalable dis-tributed deep-rl with importance weighted
actor-learner architectures,”arXiv preprint arXiv:1802.01561,
2018.
[59] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional
perspec-tive on reinforcement learning,” in Proceedings of the 34th
InternationalConference on Machine Learning-Volume 70. JMLR. org,
2017, pp.449–458.
[60] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos,
“Distribu-tional reinforcement learning with quantile regression,”
in Thirty-SecondAAAI Conference on Artificial Intelligence,
2018.
[61] W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit
quan-tile networks for distributional reinforcement learning,”
arXiv preprintarXiv:1806.06923, 2018.
[62] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and
semi-mdps: Aframework for temporal abstraction in reinforcement
learning,” Artificialintelligence, vol. 112, no. 1-2, pp. 181–211,
1999.
[63] P.-L. Bacon, J. Harb, and D. Precup, “The option-critic
architecture,” inThirty-First AAAI Conference on Artificial
Intelligence, 2017.
[64] S. Sharma, A. S. Lakshminarayanan, and B. Ravindran,
“Learning torepeat: Fine grained action repetition for deep
reinforcement learning,”arXiv preprint arXiv:1702.06054, 2017.
[65] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G.
Ostrovskiet al., “Human-level control through deep reinforcement
learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[66] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning,
vol. 8, no.3-4, pp. 279–292, 1992.
https://arxiv.org/abs/1611.01142https://arxiv.org/abs/1611.01142
-
JOURNAL OF TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS,
VOL. X, NO. X, AUGUST 2019 11
[67] R. S. Sutton, “Learning to predict by the methods of
temporal differ-ences,” Machine learning, vol. 3, no. 1, pp. 9–44,
1988.
[68] R. S. Sutton and A. G. Barto, Reinforcement learning: An
introduction.MIT press Cambridge, 1998, vol. 1, no. 1.
[69] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez,
Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep
reinforcementlearning,” arXiv preprint arXiv:1509.02971, 2015.
[70] L.-J. Lin, “Self-improving reactive agents based on
reinforcement learn-ing, planning and teaching,” Machine learning,
vol. 8, no. 3-4, pp. 293–321, 1992.
[71] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C.
Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I.
Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L.
Kaiser,M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D.
Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,
K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O.
Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X.
Zheng,“TensorFlow: Large-scale machine learning on heterogeneous
systems,”2015, software available from tensorflow.org. [Online].
Available:https://www.tensorflow.org/
[72] E. Jones, T. Oliphant, P. Peterson et al., “SciPy: Open
source scientifictools for Python,” 2001,
”http://www.scipy.org/”.
[73] P. Tabor, 2019. [Online]. Available:
https://github.com/philtabor/Youtube-Code-Repository/blob/master/ReinforcementLearning/PolicyGradient/DDPG/pendulum/tensorflow/ddpg
orig tf.py
[74] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into
rectifiers:Surpassing human-level performance on imagenet
classification,” inProceedings of the IEEE international conference
on computer vision,2015, pp. 1026–1034.
[75] D. Kingma and J. Ba, “Adam: A method for stochastic
optimization,”arXiv preprint arXiv:1412.6980, 2014.
Wade Genders earned a Software B.Eng. & Societyin 2013,
Civil M.A.Sc. in 2014 and Civil Ph.Din 2018 from McMaster
University. His researchinterests include traffic signal control,
intelligenttransportation systems, machine learning and arti-ficial
intelligence.
Saiedeh Razavi Saiedeh Razavi is the inauguralChair in Heavy
Construction, Director of the Mc-Master Institute for
Transportation and Logisticsand Associate Professor at the
Department of CivilEngineering at McMaster University. Dr.
Razavihas a multidisciplinary background and considerableexperience
in collaborating and leading national andinternational
multidisciplinary team-based projectsin sensing and data
acquisition, sensor technologies,data analytics, data fusion and
their applications insafety, productivity, and mobility of
transportation,
construction, and other systems. She combines several years of
industrial expe-rience with academic teaching and research. Her
formal education includes de-grees in Computer Engineering (B.Sc),
Artificial Intelligence (M.Sc) and CivilEngineering (Ph.D.). Her
research, funded by Canadian council (NSERC), aswell as the
ministry of Transportation of Ontario, focuses on connected
andautomated vehicles, on smart and connected work zones and on
computationalmodels for improving safety and productivity of
highway construction. Dr.Razavi brings together the private and
public sectors with academia for thedevelopment of high quality
research in smarter mobility, construction andlogistics. She has
received several awards including McMasters Student UnionMerit
Award for Teaching, the Faculty of Engineering Team Excellent
Award,and the Construction Industry Institute best poster
award.
https://www.tensorflow.org/"http://www.scipy.org/"https://github.com/philtabor/Youtube-Code-Repository/blob/master/ReinforcementLearning/PolicyGradient/DDPG/pendulum/tensorflow/ddpg_orig_tf.pyhttps://github.com/philtabor/Youtube-Code-Repository/blob/master/ReinforcementLearning/PolicyGradient/DDPG/pendulum/tensorflow/ddpg_orig_tf.pyhttps://github.com/philtabor/Youtube-Code-Repository/blob/master/ReinforcementLearning/PolicyGradient/DDPG/pendulum/tensorflow/ddpg_orig_tf.py
I IntroductionII BackgroundII-A Traffic Signal ControlII-B
Literature ReviewII-C Contribution
III Traffic Signal ControllersIII-A Non-Learning Traffic Signal
ControllersIII-A1 UniformIII-A2 WebstersIII-A3 Max-pressureIII-A4
Self Organizing Traffic Lights
III-B Learning Traffic Signal ControllersIII-C DQNIII-C1
StateIII-C2 ActionIII-C3 RewardIII-C4 Agent Architecture
III-D DDPG Traffic Signal ControllerIII-D1 Actor StateIII-D2
Critic StateIII-D3 ActionIII-D4 RewardIII-D5 Agent Architecture
IV ExperimentsIV-A Hyperparameter OptimizationIV-B Optimized
Adaptive Traffic Signal Controllers
V Conclusion & Future WorkAppendix A: DQNAppendix B:
DDPGAppendix C: TechnicalReferencesBiographiesWade GendersSaiedeh
Razavi