HAL Id: tel-01491026 https://tel.archives-ouvertes.fr/tel-01491026 Submitted on 16 Mar 2017 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Cooperative Adaptive Cruise Control Performance Analysis Qi Sun To cite this version: Qi Sun. Cooperative Adaptive Cruise Control Performance Analysis. Automatic. Ecole Centrale de Lille, 2016. English. NNT : 2016ECLI0020. tel-01491026
185
Embed
Cooperative Adaptive Cruise Control Performance Analysis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
HAL Id: tel-01491026https://tel.archives-ouvertes.fr/tel-01491026
Submitted on 16 Mar 2017
HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.
Cooperative Adaptive Cruise Control PerformanceAnalysis
Qi Sun
To cite this version:Qi Sun. Cooperative Adaptive Cruise Control Performance Analysis. Automatic. Ecole Centrale deLille, 2016. English. �NNT : 2016ECLI0020�. �tel-01491026�
The global vehicle production rises significantly thanks to the development of au-
tomobile industry during past years. [44] reported that there were 41 million cars
being produced around the world only in the year 2000. Then, in 2005, 47 million
cars were produced worldwide. Specially in 2015, almost 70 million passenger cars
were produced, as seen in Fig. 1.1. Except in 2008 and 2009, car sales dried up
on account of the economic crisis. Due to the increased demand, the volume of
automobiles sold is back to pre-crisis levels today, especially from Asian markets.
The passenger car sales are expected to continuous increase to about 100 million
units in 2017 worldwide. China is ranked as the largest passenger car manufacturer
in the world, having produced more than 18 million cars in 2013, and making up
for more than 22 percent of the world’s passenger vehicle production. Transport
infrastructure investment is projected to grow at an average annual rate of about
5% worldwide over the period of 2014 to 2025. Roads will likely remain the biggest
area of investment, especially for growth markets. This is partly due to the rise in
prosperity and, hence, car ownership in developing countries 1.2.
Figure 1.1 – Worldwide automobile production from 2000 to 2015 (in million vehicles)
Along within this augmentation, on one hand, we benefit the vehicles in differ-
ent aspects. Like Europe, road transport is the largest share of intra-EU transport.
The share of EU-281 inland freight that was transported by road (74.9%) was more
1EU-28: The European Union (EU) was established on 1 November 1993 with 12 Member States.Their number has grown to the present 28 on 1 July 2013, through a series of enlargements.
1.1. General traffic situation 9
Figure 1.2 – Cumulative transport infrastructure investment (in trillion dollars)
than four times as high as the share transported by rail (18.2%), while the remain-
der (6.9%) of the freight transported in the EU-28 in 2013 was carried along inland
waterways. The total inland freight transport in the EU-28 was over 2,200 billion
tonne-kilometers in 2013[35]. Passenger cars accounted for 83.2% of inland passen-
ger transport in the EU-28 in 2013, with motor coaches, buses and trolley buses
(9.2%) and trains (7.6%) both accounting for less than a tenth of all traffic [36].
On the other hand, we have to face the spreading traffic problems:
• Accidents and safety. Ascending traffic have produced growing number of
accidents and fatalities. Nearly 1.3 million people die in road crashes each
year, on average 3,287 deaths a day, and 20-50 million are injured or disabled.
A large proportion of accidents are caused by incorrect driving behaviors,
such as violate regulations, speeding, fatigue driving and drunken driving.
• Congestion. Traffic jam is a very common transport problem in urban agglom-
erations. It is usually due to the lag between infrastructure construction and
the increasing vehicle ownership. There are another reasons can be referred to
improper traffic light signal, inappropriate road construction and accidents.
• Environment impacts. Noise pollution and air pollution are the by-products
10 Chapter 1. Introduction to ITS
of road transportation systems, especially in metropolis where vehicles are
considerably gathered. Smog brought by vehicles, industries and heating
facilities is hurting people’s health. The exhaust from incomplete combustion
when the vehicle is in congestion is even more pollutant.
• Loss of public space. In order to deal with congestion and parking difficulties
due to the increasing amount of vehicles, streets are widen and parking areas
are built, which seizes the space for public activities like markets, parades
and community interactions.
We can see from the White paper of 2004, the European Commission has set
the ambitious aim of decreasing the number of road traffic fatalities by 2014. Much
progress has been achieved. The total number of fatalities in road traffic accidents
decreased by 45% between 2004 and 2014 (Figure 1.3) at the level of the EU-28.
Road mobility comes at a high price in terms of lives lost: in 2014, slightly over
25 thousand persons lost their lives in road accidents within the EU-28. A general
trend towards fewer road traffic fatalities has long been observed in all countries
in Europe. However, at the level of the EU, this downward trend has come to a
standstill as the total number of fatalities registered in 2014 remained at the same
level as in 2013 [37].
Figure 1.3 – Total number of fatalities in road traffic accidents, EU-28
A solution to the traffic problems is to build adequate highways and streets.
1.2. Intelligent Transportation Systems 11
However, the fact that it is becoming increasingly difficult to build additional high-
way, for both financial and environmental reasons. Data shows that the traffic vol-
ume capacity added every year by construction lags the annual increase in traffic
volume demanded, thus making traffic congestion increasingly worse. Therefore,
the solution to the problem must lie in other approaches, one of which is to opti-
mize the use of highway and fuel resources, provide safe and comfortable trans-
portation, while have minimal impact on the environment. It is a great challenge to
develop vehicles that can satisfy these diverse and often conflicting requirements.
To meet this challenge, the new approach of “Intelligent Transportation System”
(ITS) has shown its potential of increasing the safety, reducing the congestion, and
improving the driving conditions. Early studies show that it is possible to cut acci-
dents by 18%, gas emissions by 15%, and fuel consumption by 12% by employing
ITS approach [161].
1.2. Intelligent Transportation Systems
1.2.1. Definition of ITS
A concept transportation system named "Futurama" was exhibited at the World’s
Fair 1940 in New York. At the same time, the origin of Intelligent Transportation
System (ITS) appeared. After a long story via many researches and projects be-
tween 1980 to 1990 in Europe, North America and Japan, today’s mainstream of
ITS was formed. ITS is a transport system which is comprised of an advanced
information and telecommunications network for users, roads and vehicles. By
sharing vital information, ITS allows people to get more from transport networks,
in greater safety, efficiency, and with less impact to the environment. The Concep-
tual principle of ITS is illustrated in Figure. 1.4.
For example, [64] designed an architecture of road ITS for commercial vehicles.
This system is used to reduce fuel consumption through fuel-saving advice, main-
tain driver and vehicle safety with remote vehicle diagnostics and enable drivers to
access information more conveniently. Generally speaking, there are three layers in
ITS system, see Figure. 1.5:
12 Chapter 1. Introduction to ITS
Figure 1.4 – Conceptual principal of ITS
• Information collection: This layer employs a vehicle terminal which is equipped
with roadside surveillance including vehicle sensors, CCTV and camera, in-
telligent vehicle identification, etc. Meanwhile, it enables the information
exchange with other units and infrastructures, such as parking information
system, dynamic bus information center, police radio station traffic division
dispatch center and center of freeway bureau.
• Communication: This layer ensures real-time, secure and reliable transmission
between each layer via different networks, such as 3G/4G, Wi-Fi, Bluetooth,
wired networks and optical fiber.
• Information processing: In this layer, diverse applications using various tech-
nologies are implemented, such as cloud computing, data analytics, informa-
tion processing and artificial intelligence. Vehicle services are supported by
a cloud-based, back-end platform that has a network connection to vehicles
and runs advanced data analytic applications. Different categories of services
can be supplied, including collision notification, roadside rescue, remote di-
agnostic, positioning monitoring.
• Information publishing and strategy execution: In this layer, each individual ve-
1.2. Intelligent Transportation Systems 13
hicle transfers information of their state and control strategy to the different
centers. Therefore, these centers are able to publish traffic condition, manage
all connected vehicles and execute complete strategy based on collected infor-
mation in different situations, e.g. lane change, traffic light and intersection,
freeway, etc.
Figure 1.5 – Instance for road ITS system layout
1.2.2. ITS applications
Although ITS may refer to all types of transport, EU Directive 2010/40/EU (7 July
2010) defines ITS as systems in which information and communication technolo-
gies are applied in the field of road transport, including infrastructure, vehicles
and users, and in traffic management and mobility management, as well as for
interfaces with other modes of transport, see Figure. 1.6.
ITS is actually a big system which concerns a broad range of technologies and
diverse activities.
14 Chapter 1. Introduction to ITS
Figure 1.6 – ITS applications
• Adaptive Cruise Control (ACC): ACC systems perform longitudinal control by
controlling the throttle and brakes so as to maintain a desired spacing from
the preceding vehicle. A significant benefit of using ACC is to avoid rear-end
collisions. The SeiSS study reported that it could save up to 4 000 accidents
in Europe in 2010 if only 3% of the vehicles were equipped [3].
• Lane Change Assistant (LCA) system. The LCA will check for obstacles in a
vehicle’s course when the driver intends to change lanes. The same study
estimated that 1 500 accidents could be avoided in 2010 given a penetration
rate of only 0.6%, while a penetration rate of 7% in 2020 would lead to 14 000
fewer accidents.
• Collision Avoidance (CA): CA system operates like a cruise control system to
maintain a constant desired speed in the absence of preceding vehicles. If a
preceding vehicle appears, the CA system will judge the operation speed is
1.2. Intelligent Transportation Systems 15
safe of not, if not, the CA will reduce the throttle and/or apply brake so as to
slow the vehicle down, at the same time a warning is provided to the driver.
• Drive-by-wire: This technology replaces the traditional mechanical and hy-
draulic control systems with electronic control systems using electromechan-
ical actuators and human-machine interfaces such as pedal and steering feel
emulators. The benefits of applying electronic technology are improved per-
formance, safety and reliability with reduced manufacturing and operating
costs. Some sub-systems using "by-wire" technology have already appeared
in the new car models.
• Vehicle navigation system: It typically uses a GPS navigation device to acquire
position data to acquire position data to locate the user on a road in the unit’s
map database. Using the road database, the unit can give directions to other
locations along roads also in its database.
• Emergency vehicle notification systems: The in-vehicle eCall is generated ei-
ther manually by the vehicle occupants or automatically via activation of
in-vehicle sensors after an accident. When activated, the in-vehicle eCall de-
vice will establish an emergency call carrying both voice and data directly to
the nearest emergency point. The voice call enables the vehicle occupant to
communicate with the trained eCall operator. At the same time, data about
the incident will be sent to the eCall operator receiving the voice call, includ-
ing time, precise location, the direction the vehicle was traveling, and vehicle
identification.
• Automatic road enforcement: A traffic enforcement camera system, consisting
of a camera and a vehicle-monitoring device, is used to detect and identify
vehicles disobeying a speed limit or some other road legal requirement and
automatically ticket offenders based on the license plate number. Traffic tick-
ets are sent by mail.
• Variable speed limits: Recently some jurisdictions have begun experimenting
with variable speed limits that change with road congestion and other factors.
16 Chapter 1. Introduction to ITS
Typically such speed limits only change to decline during poor conditions,
rather than being improved in good ones. Initial results indicated savings in
journey times, smoother-flowing traffic, and a fall in the number of accidents,
Figure 2.6 – The convergence of both the value function and the policy to their optimals
Algorithm 1: Policy Iteration [151]Require: An MDP model 〈S ,A, T , r, γ〉;/* Initialization */t = 0, k = 0;∀s ∈ S : Initialize πt(s) with an arbitrary action;∀s ∈ S : Initialize Vk(s) with an arbitrary value;repeat/* Policy evaluation */repeat∀s ∈ S : Vk+1(s) = r(s, πt(s)) + γ ∑s′∈S T (s, πt(s), s′)Vk(s′);k← k + 1;
until ∀s ∈ S : |Vk(s)−Vk−1(s)| < ε;/* Policy improvement */∀s ∈ S : πt+1(s) = arg maxa∈A [r(s, a) + γ ∑s′∈S T (s, a, s′)Vk(s′)];t← t + 1;
until πt = πt−1;π∗ = πt;return An optimal policy π∗.
Policy evaluation (Steps 5 to 8) consists in calculating the action value of policy
πt+1 by solving the solving the equations (2.19) for all the states s ∈ S . An efficient
iterative way to solve this equation is to initialize the value function of πt+1 with
the value function Vk of the previous policy, and then repeat the operation:
Figure 3.4 – String stability comparison of ACC and two CACC functionality with differenttransmission delays: ACC (dashed black), Conventional CACC (black) and TVACACC in which
the second vehicle (black) and the rest vehicles (colored)
the normal situation. Instead, the string stability is maintained in the TVACACC
case. The TVACACC model uses not only the input from vehicle i− 1 but also i− 2,
so that the communication degradation from vehicle i− 1 will have less influence,
which leads to a better string stability.
3.4.2. Comparison of ACC, CACC AND TVACACC
The string stability of conventional CACC system is the same as the second vehicle
in TVACACC system as mentioned above. Therefore, the transfer function is the
same as equation 3.26.
||ΓCACC(jω)||L2 =||D(s) + G(s)K(s)||L2
||H(s)(1 + G(s)K(s)||L2
(3.28)
Moreover, ACC system is easily obtained by choosing the transmission delay
block D(s) = 0, because there is no transmission between the host and its front
vehicle. The transfer function of ACC is then derived as
||ΓACC(jω)||L2 =||G(s)K(s)||L2
||H(s)(1 + G(s)K(s)||L2
(3.29)
In the case of transmission delay θ = 0.2s, the frequency domain response of
3.5. Simulation tests 75
ACC and conventional CACC systems are represented by dashed black and solid
black line in Figure. 3.4a respectively. It is clearly shown that for an ACC system,
the disturbance amplifies in the platoon upstream, resulting in worse influences on
the rest vehicles in platoon. On the contrary, thanks to the V2V technology, the
conventional CACC system as well as the proposed TVACACC system guarantee
the string stability.
If the transmission delay degrades to θ = 1s, the string stability is illustrated in
Figure 3.4b. However in this situation, the same platoon using conventional CACC
system is no longer string stable.We can see that the amplification of disturbance
is almost the same compared to ACC system. ACC system is not changed as no
V2V transmission is applied. Therefore, if transmission degradation occurs, the
CACC functionality degrades and if the transmission delay continues to increases,
the performance may be worse than ACC system.
Therefore, in the case of TVACACC system, an increased traffic flux and a
decreased disturbance are obtained compared to the conventional CACC system.
Besides, it performs better facing an increasing transmission delay than the existent
CACC system.
3.5. Simulation tests
To validate the theoretical results and demonstrate its feasibility of the conventional
and proposed CACC functionality, a series of simulations is carried out within a
platoon of V2V communication equipped vehicles. It is shown whether the distur-
bance of the leading vehicle is attenuated upstream through the platoon, which is
defined as string stability in chapter 2. Therefore, the vehicle’s velocity and accel-
eration are selected as string stability performance measures. The results in both
normal and degraded situations will be shown.
For validation of the theories of the proposed model in the previous sections,
a stop-and-go scenario is chosen because it is the most dangerous situation of all
possible situations in longitudinal control. The platoon is composed of six CACC
equipped vehicles and they are assumed to share identical characters. The platoon
starts in steady state with speed of 30m/s (108km/h). At t = 10s, the leading vehicle
76 Chapter 3. CACC system design
of the platoon performs a brake with deceleration of −5m/s2, and reaccelerates
until regaining the initial velocity 30m/s with acceleration of 2m/s2 at t = 30s.
3.5.1. Comparison of ACC CACC and TVACACC
The Conventional CACC and ACC system are introduced in Figure. 3.5(a) and (c),
to make a clear comparison to the TVACACC system. Each vehicle is following
its front vehicle by respecting a safe distance with a headway time of 0.5s. The
transmission delay of the input ui is set to be 0.2s for CACC system while there is
no V2V communication in ACC system. It can be clearly seen that the simulation
results correspond to the theoretical analysis shown in Figure. 3.4a. Under the de-
signed condition, the platoon equipped with conventional CACC system is string
stable. The influence of acceleration disturbance decreases in the upstream direc-
tion. However, the ACC system is not string stable under the same condition. The
further the following vehicle is to the leading vehicle, the greater are the accelera-
tion and deceleration responses. The string stability is a crucial criterion for CACC
systems. It ensures that the following vehicles’ safety and low fuel cost. On the
contrary, the string instability results in larger acceleration and deceleration facing
the stop-and-go scenario, which is the case of ACC system in this case. If there
are more vehicles in the platoon, the last vehicle will suffer from a hard brake and
acceleration to catch up the platoon, even beyond its physical limit which is not
only harmful for the entire traffic flow, safety and comfort, but also might result
in rear-end collision. That is the reason why conventional ACC system requires
greater headway time to guarantee the string stability, which means lower traffic
flux.
The simulation of the proposed TVA-CACC system in the same scenario is
shown in Figure. 3.5(b) with the same parameters. It is obvious that the string
stability is obtained, the same as conventional CACC system, i.e. the acceleration
and deceleration disturbance decrease in the upstream direction. Moreover, the
acceleration response is smaller which means better string stability. The result
corresponds to the theoretical analysis in Figure. 3.4a. Therefore, with the proposed
3.5. Simulation tests 77
system, a better traffic flux, a safer and more comfortable driving experience is
obtained, compared to the conventional one-vehicle-ahead CACC system.
Figure 3.5 – Acceleration response of a platoon in Stop-and-Go scenario using conventional CACCsystem (a), TVA-CACC system (b) and ACC system (c) with a communication delay of 0.2s
3.5.2. Increased transmission delay
In this subsection, it is assumed that the CACC systems are suffering from trans-
mission delay. Instead of a normal delay of 0.2s, the lagged transmission delay is
1s. In Figure. 3.6, it is clearly seen that the conventional CACC system is badly
degraded, compared to the normal situation shown in Figure. 3.5, due to increased
transmission delay. The acceleration response is overshoot and increases in the
upstream direction which means the system is string unstable. The experimental
78 Chapter 3. CACC system design
results correspond to the theoretical analysis of string stability. One solution is re-
gain the string stability is to increase the headway time, which however, decrease
the traffic flow. On the contrary, in the case of TVA-CACC system, the acceler-
ation disturbance still attenuates in upstream direction, i.e., the string stability is
maintained in the degraded situation. However, the acceleration response of the
same vehicle slightly increases which means the string is less stable than it is in
the normal transmission situation. And if is transmission is even more delayed,
the proposed CACC system cannot guarantee its string stability. The threshold
according to simulation is about 2.5s.
Figure 3.6 – Acceleration response of a platoon in Stop-and-Go scenario using conventional CACCsystem (a) and TVACACC system (b) with a communication delay of 1s
3.6. Conclusion
In this chapter, we concentrated on the vehicle longitudinal control system design.
The spacing policy and its associated control law were designed with the con-
strains of string stability. The CTH spacing policy is adopted to determine the
desired spacing from the preceding vehicle. It was shown that the proposed TVA-
3.6. Conclusion 79
CACC system could ensure both the string stability. In addition, through the com-
parisons between the TVACACC and the conventional CACC and ACC systems,
we could find the obvious advantages of the SSP system in improving traffic capac-
ity especially in the high-density traffic conditions.
The above proposed longitudinal control system was validated to be effective
It is shown in [146] that in order to satisfy p(a), the covariance Cu(τ) of the
white noise input u in 4.5 is
Cu(τ) = 2ασ2a δ(τ) (4.8)
where δ is the unit impulse function. As a result, the random variable a in equa-
tion 4.5, satisfying a probability density function p(a) with variance σ2a is described,
with with a white noise input u(t) satisfying equation 4.8.
Using the acceleration model 4.5, the corresponding equation of motion can be
described in the state space as
x(t) = Ax(t) + Bu(t) (4.9)
y(t) = Cx(t) (4.10)
where xT = [s v a] in which s, v, a represent the host vehicle’s position, velocity
and acceleration respectively. The vector yT = [s v] is the output of the model,
which is in practical measured by vehicle onboard sensor. The matrix A, B, and C
are defined as
A =
0 1 0
0 0 1
0 0 −α
, B =
0
0
1
, C =
1 0 0
0 1 0
(4.11)
Note that the state equation 4.9 closely resembles the vehicle dynamics model
in equation 3.7 when replacing α by 1/τ.
4.3. Degradation of CACC 89
The model 4.9 is used as a basis for the estimation of the object vehicle acceler-
ation by means of a Kalman filter. To design this observer, the state-space model
4.9 is extended so as to include a process noise term w(t), representing model
uncertainty, and a measurement noise term v(t), yielding
x(t) = Ax(t) + w(t) (4.12)
y(t) = Cx(t) + v(t) (4.13)
The input u(t) in equation 4.9, which was assumed to be white noise, is in-
cluded in 4.12 by choosing w(t) = Bu(t). v(t) is a white noise signal with co-
variance matrix R = E[v(t)vT(t)], as determined by the noise parameters of the on-
board sensor used in the implementation of the observer, Furthermore, using equa-
tion 4.8, the continuous-time process noise covariance matrix Q = E[w(t)wT(t)] is
equal to
Q = BE[w(t)wT(t)]BTa =
0 0 0
0 0 0
0 0 2ασ2a
(4.14)
With the given Q and R matrix, the following continuous-time observer is ob-
tained:
˙x(t) = Ax + K(y− Cx) (4.15)
where x is the estimate of the object vehicle state xT = [s v a], K is the continuous-
time Kalman filter gain matrix, and y is the measurement vector, consisting of
position s and velocity v of the object vehicle. This observer provides a basis for
the design of the fallback control strategy, as explained in the following subsection.
4.3.2. DTVACACC
The fallback CACC strategy, which is hereafter referred to as Degraded Two-
Vehicle-Ahead CACC (DTVACACC), aims to use the observer 4.15 to estimate the
acceleration ai−1 of the preceding vehicle, when the communication between the
host and its nearest front vehicle is degraded. However, the measurement y in
90 Chapter 4. Degraded CACC system design
equation 4.15, containing the absolute vehicle position and velocity, is not avail-
able. Instead, the onboard sensor of the host vehicle provides inter-vehicle distance
and relative velocity. Consequently, the estimation algorithm needs to be adapted,
as described below.
When the transmission of ai−1 is lost or badly degraded, the observer 4.15 is
described in the Laplace domain by a transfer function T(s), which takes the actual
position si−1 and velocity vi−1 of the preceding vehicle, contained in the measure-
ment vector y, as input. The output of T(s) is the estimate ai−1 of the preceding
vehicle’s acceleration, being the third element of the estimated state. This yields
the estimator
ai−1 = T(s)
si−1
vi−1
(4.16)
where ai−1(s) denotes the Laplace transform of ai−1(t), and si−1(s) and vi−1(s)
are the Laplace transforms of si−1(t) and vi−1(t) respectively. Moreover, the esti-
mator transfer function T(s) is is derived from equation 4.15:
T(s) = C(sI − A− KC)−1K (4.17)
where C = [0 0 1].
The second step involves a transformation to relative coordinates,using the re-
lation that
si−1(s) = di(s) + si(s) (4.18)
vi−1(s) = ∆vi(s) + vi(s) (4.19)
where ∆vi(s) denotes the Laplace transform of the relative velocity ∆vi(t) =
di(t). Substituting 4.18 and 4.19 into 4.16, we obtain
ai−1(s) = T(s)
di(s)
∆vi(s)
+ T(s)
si(s)
vi(s)
(4.20)
As a result, the acceleration estimator is, in fact, split into a relative coordinate
estimator ∆ai(s) and an absolute coordinate estimator ai(s), i.e., ai−1(s) = ∆ai(s) +
ai(s).
4.3. Degradation of CACC 91
∆ai(s) := T(s)
di(s)
∆vi(s)
(4.21)
ai(s) := T(s)
si(s)
vi(s)
(4.22)
where ∆ai(s) is the Laplace transform of the estimated relative acceleration
∆ai(t) and ai(s) is the Laplace transform of the estimated local acceleration.
Finally, ai(s) in 4.22 can be easily computed with
ai(s) = T(s)
si(s)
vi(s)
=(
Tas(s) Tav(s))si(s)
vi(s)
= (
Tas(s)s2 +
Tav(s)s
)ai(s) := Taa(s)ai(s)
(4.23)
Using the fact that the local position si(t) and velocity vi(t) are the result of
integration of the locally measured acceleration ai(t), thereby avoiding the use of
a potentially inaccurate absolute position measurement by means of a global po-
sitioning system. The transfer function Taa(s) acts as a filter for the measured
acceleration ai, yielding the "estimated" acceleration ai. In other words, the local
vehicle acceleration measurement ai is synchronized with the estimated relative
acceleration ∆ai by taking the observer phase lag of the latter into account.
The control law of the fallback DTVACACC system is now obtained by replac-
ing the preceding vehicle’s input ui−1 in equation 3.6 by the estimated acceleration
ai−1. As a result, the control law is formulated in the Laplace domain as
ui(s) = H−1(s)(K(s)ei(s) + T(s)
di(s)
∆vi(s)
+ Taa(s)ai(s)) (4.24)
which can be implemented using the radar measurement of the distance di
and the relative velocity ∆vi, and the locally measured acceleration ai and velocity
vi, the latter being required to calculate the distance error ei. The corresponding
block diagram of the closed-loop DTVACACC system as a result of this approach
92 Chapter 4. Degraded CACC system design
is shown in Figure. 4.2, which can be compared with Figure. 3.3, showing the
TVACACC scheme.
Figure 4.2 – Block diagram of the DTVACACC system
4.3.3. String stability analysis
To analyze the DTVACACC string stability properties, the output of interest is
chosen to be the acceleration. Recall that parameters are chosen as the same as we
defined in previous chapter. τ = 0.1, kp = 0.2, kd = 0.7, kdd = 0 to avoid feedback
of the and h = 0.5s, transmission delay θ=0.2s. Besides, the novel parameters for
DTVACACC is defined as amax = 3m/s2, amin = −5m/s2, Pmax = Pmin = 0.01,
P0 = 0.1, Pr = 0.11, α = 1.25, σ2d = 0.029 and σ2
∆v = 0.029. As a result, with the
closed-loop configuration given in Figure. 4.2, the transfer function is obtained:
||ΓDTVACACC(jω)||L2 =||ai(s)||L2
||ai−1(s)||L2
=||G(s)K(s) + 0.5s2TaaG(s) + 0.5D(s)/Γ2(jω)||L2
||H(s)(1 + G(s)K(s)||L2
(4.25)
where Γ2 is the transfer function of second vehicle in the platoon which receives
only one input from the leading vehicle. Therefore, it uses the conventional CACC
system. The transfer function is the same as equation 3.26
4.3. Degradation of CACC 93
||Γ2(jω)||L2 =||a2(s)||L2
||a1(s)||L2
=||D(s) + G(s)K(s)||L2
||H(s)(1 + G(s)K(s)||L2
(4.26)
The platoon of vehicles is string stable if the infinite norm of the transfer func-
tion is less than 1, i.e., ||ΓDTVACACC(jω)||L∞ ≤ 1. Furthermore, if the system is
string unstable, ||ΓDTVACACC(jω)||L∞ will exceed 1; still, in that case, we would aim
at making this norm as low as possible to minimize disturbance amplification. The
L2 norm is here used to make a comparison between different CACC systems. The
frequency response magnitudes ||ΓDTVACACC(jω)||L∞ from 4.25, ||ΓTVACACC(jω)||L∞
from 3.27, ||ΓACC(jω)||L∞ from 3.29 as a function of the frequency ω, are shown in
Figure. 4.3a and 4.3b for different headway time h = 0.5s and h = 2s, respectively.
(a) headway time h = 0.5s (b) headway time h = 2s
Figure 4.3 – Frequency response magnitude with different headway time, in case of (blue)TVACACC, (green) DTVACACC, and (red) ACC
Recall the string stability criterion defined in equation 2.4, ||Γi(jω)||L∞ =
supω ||Γi(jω)|| ≤ 1. From the frequency response magnitudes, it follows that
for h = 0.5s, only TVACACC system results in string-stable behavior that
||ΓTVACACC(jω)||L∞ = 1; whereas both DTVACACC and ACC system is not string
stable, ||ΓDTVACACC(jω)||L∞ = 1.0192 and ||ΓACC(jω)||L∞ = 1.2782. But even if
the system is unstable, we try to find the lowest response to keep the disturbance
amplification as small as possible. Therefore it is clear that the DTVACACC sys-
tem helps to improve the performance compared to ACC system, in case of no
communication from i− 1th vehicle.
94 Chapter 4. Degraded CACC system design
As for h = 1.3s, both TVACACC and DTVACACC yield string stability.
Clearly, ACC is still not string stable in either case. Here, ||ΓTVACACC(jω)||L∞ =
||ΓDTVACACC(jω)||L∞ = 1, ||ΓACC(jω)||L∞ = 1.0859. This is logical because increas-
ing headway time helps to improve the string stability, which however results in
large inter-vehicle distance and low traffic flow capicity.
4.3.4. Model switch strategy
Until now, either full wireless communication under nominal conditions or a per-
sistent loss of communication has been considered. However, in practice, the loss
of the wireless link is often preceded by increasing communication latency, rep-
resented by the time delay θ. Intuitively, it can be expected that above a certain
maximum allowable latency, wireless communication is no longer effective, upon
which switching from TVACACC to DTVACACC is beneficial in view of string
stability. This section proves this intuition to be true and also calculates the ex-
act switching value for the latency, thereby providing a criterion for activation of
DTVACACC.
Figure 4.4 – Minimum headway time (blue) hmin,TVACACC and (red) hmin,DTVACACC versuswireless communication delay θ
From analysis of string stability of DTVACACC system in equation 4.25, it is
shown that the magnitude of the transfer function changes its string stability when
different headway time is chosen. This infinite norm value ||ΓTVACACC(jω)||L∞ is
4.4. Simulation 95
reduced by increasing headway time h, of which the effect is increasing the H(s)
in denominator. Consequently, for TVACACC, a minimum string-stable headway
time hmin,TVACACC must exist, which depends on the delay θ. Along the same line
of thought, it can be shown that for DTVACACC, a minimum string-stable head-
way time also exists, which is obviously independent of the communication delay.
Figure. 4.4 shows hmin,TVACACC and hmin,DTVACACC as a function of θ. Here, the min-
imum headway time is obtained by searching for the smallest h for each θ, such
that ||Γi(jω)||L∞ = 1 for each system. This figure clearly shows a breakeven point
θb of the delay θ, i.e., hmin,DTVACACC = hmin,TVACACC(θb), which is equal to θb = 1.53s
for the current controller and acceleration observer. The figure also indicates that
for θ ≤ θb, it is beneficial to use TVACACC in view of string stability, since this
allows for smaller time gaps, whereas for θ ≥ θb, DTVCACC is preferred. This is
an important result, since it provides a criterion for switching from TVACACC to
DTVCACC and vice versa in the event that there is not (yet) a total loss of com-
munication, although it would require monitoring the communication time delay
when CACC is operational. As a final remark on this matter, it should be noted
that the above analysis only holds for a communication delay that slowly varies,
compared with the system dynamics. Moreover, it does not cover the situation in
which data samples (packets) are intermittently lost, rather than delayed.
4.4. Simulation
To test the performance of the proposed model, a stop-and-go scenario is chosen
because it is the most dangerous situation of all possible situations in longitudinal
control. The platoon consists of several CACC equipped vehicles and they are
assumed to be identical. The platoon starts at a constant speed of 30m/s. At
t=50s, the leading vehicle of the platoon brakes with a deceleration of −5m/s2,
and reaccelerates until regaining the initial velocity 30m/s with an acceleration of
2m/s2 at t=70s. The results in different headway time will be shown. The numerous
parameters are described in the table below.
The conventional ACC and TVACACC systems are introduced to make a clear
comparison with the DTVACACC system. The first vehicle’s acceleration is repre-
96 Chapter 4. Degraded CACC system design
Figure 4.5 – Acceleration response of the third vehicle in Stop-and-Go scenario using conventionalACC system (red), TVACACC system (gray) and DTVACACC system (blue) with a
communication delay of 1s and headway 0.5s
Figure 4.6 – Velocity response of the third vehicle in Stop-and-Go scenario using conventionalACC system (red), TVACACC system (gray) and DTVACACC system (blue) with a
communication delay of 1s and headway 0.5s
sented in black line. And the third vehicle is chosen to investigate the difference
between the conventional ACC system (red), TVACACC system (gray) and DTVA-
CACC system (blue), shown in Figure. 4.5. We can see that each vehicle is follow-
ing its preceding vehicle by respecting a safe distance with a headway time of 0.5s.
However, the string stability criterion, is obviously not satisfied in existent ACC
system as the absolute values of deceleration and acceleration are much greater
than the first vehicle.The DTVACACC system is not string stable either. However
we can see that the response is less overshoot. The vehicle is keeping a lower ac-
4.4. Simulation 97
celeration and deceleration for the following objective. It is reasonable to conclude
that the proposed acceleration estimate approach by Kalman filter helps to improve
the string stability in case of V2V communication degradation. Similar results are
obtained for velocity responses. The ACC system always responses greater than
the leading vehicle. If a platoon consists of a large number of vehicles, the velocity,
acceleration and spacing error will become extremely great in the upstream direc-
tion under the determined condition by using the existent ACC system, which is
uncomfortable and dangerous. It is obvious that the proposed DTVACACC system
outperforms ACC again, but still worse than the TVACACC system. Because the
transmission of the front vehicle i− 1th is degraded or lost.
Figure 4.7 – Velocity response of the third vehicle in Stop-and-Go scenario using conventionalACC system (red), TVACACC system (gray) and DTVACACC system (blue) with a
communication delay of 1s and headway 1.5s
As we have discussed above, increasing the headway time can improve the
string stability. Therefore, different headway time of 1.5s and 3S are chosen to
determine the improvement of the performance. If h = 1.5 shown in Figure. 4.7,
the DTVACACC system is now string stable while ACC is still not. Then if we
continue to increase headway time h = 3s shown in Figure. 4.8, all three systems
obtain the string stability. ACC system needs the largest headway time to keep the
platoon string stable, then DTVACACC and finally TVACACC, which is the same
as our theoretical analysis. The string instability is not only wasting energy, but
also making the situation dangerous. Imagine a platoon of twenty vehicles, the last
98 Chapter 4. Degraded CACC system design
Figure 4.8 – Velocity response of the third vehicle in Stop-and-Go scenario using conventionalACC system (red), TVACACC system (gray) and DTVACACC system (blue) with a
communication delay of 1s and headway 3s
vehicle will suffer from a hard brake and acceleration, even beyond its physical
limit which may result in rear-end collision.
4.5. Conclusion
In this chapter, we concentrated on the degradation of CACC system.
To accelerate practical implementation of CACC in everyday traffic, wireless
communication faults must be taken into account. To this end, a graceful degrada-
tion technique for CACC was presented, serving as an alternative fallback scenario
to ACC. The idea behind the proposed approach is to obtain the minimum loss of
functionality of CACC when the wireless link fails or when the preceding vehicle is
not equipped with wireless communication means. The proposed strategy, which
is referred to as DTVACACC, uses an estimation of the preceding vehicle’s current
acceleration as a replacement to the desired acceleration, which would normally be
communicated over a wireless link for this type of CACC. In addition, a criterion
for switching from TVACACC to DTVACACC was presented, in the case that wire-
less communication is not (yet) lost, but shows increased latency. It was shown that
the performance, in terms of string stability of DTVACACC, can be maintained at
a much higher level compared with an ACC fallback scenario. Both theoretical as
well as experimental results showed that the DTVACACC system outperforms the
4.5. Conclusion 99
ACC fallback scenario with respect to string stability characteristics by reducing
the minimum string-stable time gap to less than half the required value in case of
102 Chapter 5. Reinforcement Learning approach for CACC
5.1. Introduction
Endowing vehicles with human-like abilities to perform specific skills in a smooth
and natural way is one of the important goals of ITS. Reinforcement learning (RL)
is the key tool that helps us to create vehicles that can learn new skills by them-
selves, just similarly to our human beings. Reinforcement learning is realized by
interacting with an environment. In RL, the learner is a decision-making agent that
takes actions in an environment and receives an reinforcement signal for its actions
in trying to accomplish a task. The signal, well known as reward (or penalty),
evaluates an action’s outcome, and the agent seeks to learn to select a sequence of
actions, i.e. a policy, that maximize the total accumulated reward over time. Rein-
forcement learning can be formulated as a Markov Decision Process. Model-based
RL algorithms can be used if we know the state transition function T(s, a, s′).
The whole learning scenario is a process of trial-and-error runs. We apply
a Boltzmann probability distribution to tackle the problem of the exploration-
exploitation trade-off, that is, the dilemma between should we exploit the past
experiences and select the actions that as far as we know are beneficial, or should
we explore some new and potentially more rewarding states. Under the circum-
stances, the policies are stochastic.
Analytic methods to ACC and CACC control problems are often different be-
cause of nonlinear dynamics and high-dimensional state spaces. Generally speak-
ing, linearization is not sufficient to help solving this problem, thus it would be
preferred to investigate new approaches, particularly RL, in which the knowledge
of the Markov decision process (MDP) that sustains it is not necessary. In this chap-
ter, a new RL approach to the CACC system that uses policy search is designed,
i.e., directly modifying the parameters of a control policy based on obtained re-
wards. The policy-gradient method is adopted, because unlike other RL methods,
it converges very well to high-dimensional systems. The advantages of the policy-
gradient methods are obvious [114]. Among all the most important methods, it is
indispensable that the policy representation can be chosen such that it is useful for
the task, i.e., the domain knowledge can easily be incorporated. The proposed ap-
5.2. Related Work 103
proach, in general, leads to fewer parameters in the learning process compared to
other RL methods. Besides, there are already many different algorithms for policy-
gradient estimation in the literature, most of which are based on strong theoretical
foundations. Finally, we use policy-gradient methods for model free problem and
it can therefore be applied to problems without analytically knowing task and re-
ward models as well. Consequently, in this chapter, we propose a policy-gradient
algorithm for CACC, where the algorithm repeatedly estimates the gradient of the
value with respect to the parameters, based on the information observed during
policy trials, and then updates the parameters in the upstream direction.
5.2. Related Work
Most of the researches on CACC systems have concerned about the classic con-
trol theory to develop autonomous controllers. However, recent projects based on
machine-learning approach has been launched promising theoretical and practical
results for the resolution of control problems in uncertain and partially observable
environments, and it would be desirable to apply it on CACC. One of the first
project efforts to use machine learning for autonomous vehicle control was Pomer-
leau’s autonomous land vehicle in a neural network (ALVINN) [121], in which it
consisted of a computer vision system, based on a neural network, that learns to
correlate observations of the road to the correct action to take. The results are
tested on autonomous controller which drove a real vehicle by itself for more than
30 miles. In [175, 179, 177], a RL based self-learning algorithm is designed in the
cases where former experiences are not available to learning agents in advance
and they are obliged to find a robust policy via interacting with the environment.
Experiments are realized on the autonomous navigation tasks for mobile robots.
To our knowledge, Yu [185] was the first researcher that gave the idea to use RL
for steering control. According to it, using RL approach allows control designers
to remove the requirement for extra supervision and also to provide continuous
learning abilities. RL is one of the machine-learning approach which has shown
as the adaptive optimal control of a process P , where the controller (called agent)
interacts with P and learns to control it. To this end, the agent learns behavior
104 Chapter 5. Reinforcement Learning approach for CACC
through trial-and-error interactions with P . The agent then perceives the state of
P , and it select an action which maximizes the cumulative return that is based on
a real-valued reward signal, which comes from P after each action. Thus, RL relies
on modifying the control policy, which associates an action a to a state s, based on
the state of the environment. Vehicle following has also been investigated in [110],
using RL and vision sensor. Through RL, the control system indirectly learns the
vehicle-road interaction dynamics, the knowledge of which is essential to stay on
the road in high-speed road tracking.
In [43], the author has been directed toward obtaining a vehicle controller using
instance-based RL. To this objective, the stored instances of past observations are
used as values estimates for controlling autonomous vehicles that were extended
to automobile control tasks. Simulations in an extensible environment and a hier-
archical control architecture for autonomous vehicles have been realized. Particu-
larly, the controllers proposed from this architecture were evaluated and improved
in the simulator until difficult traffic scenarios are taken into account in a variety
of (simulated) highway networks. However, this approach is, limited to the mem-
ory storage, which can be very rapidly developed when it deals with a realistic
application.
More recently, in [107], an adaptive control system using gain scheduling
learned by RL is proposed. In this research, they somehow kept the nonlinear
nature of vehicle dynamics. This proposed controller performs better than a sim-
ple linearization of the longitudinal model, which is not be suitable for the entire
operating range of the vehicle. The performance of the proposed approach at spe-
cific operating points shows accurate tracking ability of both velocity and positions
in most cases. However in the case of adaptive controller deployed in a convoy
or a platoon, the tracking performance is less desirable. In particular, the second
car attempts to track the leader, resulting in slight oscillations. This oscillation is
passed onto the following vehicles, but in the upstream direction of the platoon,
the oscillations decrease, implying string stability. Thus, this approach is more con-
venient for platooning control than the CACC, because sometimes in this later case,
it engenders slight oscillations.
5.3. Neural Network Model 105
Thus, although some researches are dedicated to the longitudinal control using
RL, no researcher has particularly used RL for controlling CACC. In this chapter,
we will try to fix this problem.
5.3. Neural Network Model
An artificial neural network (ANN) [136, 105] is organized in layers and each layer
is composed of a bunch of "neuron" nodes. A neuron is a computational unit that
can reads inputs, processes them and generates an output, see Figure 5.1 as an
example.
+1
x1
x2
x3
+1
a1(2)
a2(2)
a3(2)
a0(2)
Layer L1 Layer L2 Layer L3
hw,b(x)
Figure 5.1 – A neural network example
The whole network is constructed by interconnecting many neurons. In this
figure, one circle represents one neuron. The leftmost layer of the network is called
the input layer, and the rightmost layer the output layer. The middle layer of nodes
is called the hidden layer, since its values are not observed in the training set. The
input layer and output layer serve respectively as the inputs and outputs of the
neural network. The neurons labeled "+1" are called bias units. A bias unit has
no input and always outputs +1. Hence, this neural network has 3 input units
(excluding the bias unit), 3 hidden units (excluding the bias unit), and 1 output
unit.
We use nl to denote the number of layers and label each layer l as Ll . In
Figure 5.1, nl = 3, layer L1 is the input layer, and layer Lnl the output layer.
106 Chapter 5. Reinforcement Learning approach for CACC
The links connecting two neurons are named weights, representing the connec-
tion strength between the neurons. The parameters inside the neural network are
(W, b) = (W(1), b(1), W(2), b(2)), where we write W(l)ij to denote the weight associ-
ated with the connection between unit j in layer l, and unit i in layer l + 1. Also,
b(l)i is the bias associated with unit i in layer l + 1. Thus, we have W(1) ∈ R3×3 and
W(2) ∈ R1×3.1
Each neuron in the network contains an activation function in order to control its
output. We denote the activation of unit i in layer l by a(l)i . For the input layer L1,
a(1)i = xi, the i-th input of the whole network. For the other layers, a(l)i = f (z(l)i ).
Here, z(l)i denote the total weighted sum of inputs to unit i in layer l, including the
bias term (e.g., z(2)i = ∑nj=1 W(1)
ij xj + b(1)i ), so that a(l)i = f (z(l)i ).
Given a fixed setting of the parameters (W, b), the neural network outputs a
real number that is defined as the hypothesis hW,b(x). Specifically, the computation
that this neural network represents is given by:
a(2)1 = f (W(1)11 x1 + W(1)
12 x2 + W(1)13 x3 + b(1)1 ),
a(2)2 = f (W(1)21 x1 + W(1)
22 x2 + W(1)23 x3 + b(1)2 ),
a(2)3 = f (W(1)31 x1 + W(1)
32 x2 + W(1)33 x3 + b(1)3 ),
hW,b(x) = a(3)1 = f (W(2)11 a(2)1 + W(2)
12 a(2)2 + W(2)13 a(2)3 + b(2)1 ).
(5.1)
For a more compact expression, we can extend the activation function f (·) to
apply to vectors in an element-wise fashion, i.e., f ([z1, z2, z3]) = [ f (z1), f (z2), f (z3)],
then we can write the equations above as:
1b(l)i can also be interpreted as the connecting weight between the bias unit in layer l who always
outputs +1 and the neuron unit i in layer l + 1. Thus, b(l)i may be replaced by W(l)i0 . In this way,
W(1) ∈ R3×4 and W(2) ∈ R1×4.
5.3. Neural Network Model 107
a(1) = x,
z(2) = W(1)a(1) + b(1),
a(2) = f (z(2)),
z(3) = W(2)a(2) + b(2),
hW,b(x) = a(3) = f (z(3)).
(5.2)
x = [x1, x2, x3]> is a vector of values from the input layer. This computational
process, from inputs to outputs, is called forward propagation. More generally, given
any layer l’s activation a(l), we can compute the activation a(l+1) of the next layer
l + 1 as:
z(l+1) = W(l)a(l) + b(l),
a(l+1) = f (z(l+1)).(5.3)
In this dissertation, we will choose f (·) to be the sigmoid function f : R 7→
]− 1,+1[ :
f (z) =1
1 + exp(−z). (5.4)
Its derivative is given by
f ′(z) = f (z)(1− f (z)). (5.5)
The advantage of putting all variables and parameters into matrices is that we
can greatly speed up the calculation speed by using matrix-vector operations.
Neural networks can also have multiple hidden layers or multiple output units.
Taking Figure 5.2 as an example, this network has two hidden layers L2 and L3 and
two output units in layer L4.
The forward propagation applies to all architectures of feedforward neural net-
works, i.e., to compute the output of the network, we can start with the input layer
108 Chapter 5. Reinforcement Learning approach for CACC
+1
x1
x2
x3
+1
Layer L1 Layer L2 Layer L3
hw,b(x)
+1
Layer L4
Figure 5.2 – A neural network example with two hidden layers
L1, and successively compute all the activations in layer L2, then layer L3, and so
on, up to the output layer Lnl .
5.3.1. Backpropagation Algorithm
Suppose we have a fixed training set {(x(1), y(1)), . . . , (x(m), y(m))} of m training
examples. We can train our neural network using batch gradient descent. In detail,
for a single training example (x, y), we define the cost function with respect to that
single example to be:
J(W, b; x, y) =12‖hW,b(x)− y‖2. (5.6)
This is a squared-error cost function. Given a training set of m examples, we
then define the overall cost function J(W, b) to be:
J(W, b) =
[1m
m
∑i=1
J(W, b; x(i), y(i))
]+
λ
2
nl−1
∑l=1
sl
∑i=1
sl+1
∑j=1
(W(l)
ji
)2
=
[1m
m
∑i=1
(12‖hW,b(x(i))− y(i)‖2
)]+
λ
2
nl−1
∑l=1
sl
∑i=1
sl+1
∑j=1
(W(l)
ji
)2. (5.7)
sl denotes the number of nodes in layer l (not counting the bias unit). The
first term in the definition of J(W, b) is an average sum-of-squares error term. The
second term is a regularization term that tends to decrease the magnitude of the
weights, and helps prevent overfitting. Regularization is applied only to W but not
5.3. Neural Network Model 109
to b. λ is the regularization parameter which controls the relative importance of the
two terms. Note that J(W, b; x, y) is the squared error cost with respect to a single
example; while J(W, b) is the overall cost function that includes the regularization
term.
The goal of the backpropagation is to minimize J(W, b) as a function of W and
b. To train the neural network, we first initialize each parameter W(l)ij and each b(l)i
to a small random value near zero, and then apply an optimization algorithm such
as batch gradient descent. It is important to initialize the parameters randomly,
rather than to all 0’s. If all the parameters start off at identical values, then all
the hidden layer units will end up learning the same function of the input. More
formally, W(1)ij will be the same for all values of i, so that a(2)1 = a(2)2 = a(2)3 = . . . for
any input x. The random initialization serves the purpose of symmetry breaking.
One iteration of gradient descent updates the parameters W, b as follows:
W(l)ij = W(l)
ij − α∂
∂W(l)ij
J(W, b),
b(l)i = b(l)i − α∂
∂b(l)i
J(W, b).(5.8)
The parameter α is the learning rate. It determines how fast W and b move
towards their optimal values. If α is very large, they may miss the optimal and
diverge. If α is tuned too small, the convergence may need a long time.
The key step in Equation (5.8) is computing the partial derivatives terms of the
overall cost function J(W, b). Derived from Equation (5.7), we can easily obtain:
∂
∂W(l)ij
J(W, b) =
1m
m
∑i=1
∂
∂W(l)ij
J(W, b; x(i), y(i))
+ λW(l)ij ,
∂
∂b(l)i
J(W, b) =1m
m
∑i=1
∂
∂b(l)i
J(W, b; x(i), y(i)).
(5.9)
One of the main tasks of the backpropagation algorithm is to compute the par-
tial derivatives terms ∂
∂W(l)ij
J(W, b; x(i), y(i)) and ∂
∂b(l)i
J(W, b; x(i), y(i)) in Equation (5.9).
110 Chapter 5. Reinforcement Learning approach for CACC
The backpropagation algorithm for one training example is shown as follows:
1. Perform a forward propagation, computing the activations for layers L2, L3,
and so on up to the output layer Lnl .
2. For each output unit i in the output layer nl , set
δ(nl)i =
∂
∂z(nl)i
(12‖y− hW,b(x)‖2
)= −(yi − a(nl)
i ) · f ′(z(nl)i ). (5.10)
3. For l = nl − 1, nl − 2, nl − 3, . . . , 2:
for each node i in layer l, set
δ(l)i =
(sl+1
∑j=1
W(l)ji δ
(l+1)j
)f ′(z(l)i ). (5.11)
4. Compute the desired partial derivatives, which are given as:
∂
∂W(l)ij
J(W, b; x, y) = a(l)j δ(l+1)i ,
∂
∂b(l)i
J(W, b; x, y) = δ(l+1)i .
(5.12)
Given a training example (x, y), we first run a forward propagation to compute
all the activations throughout the network, including the output value of the hy-
pothesis hW,b(x). Then, for each node i in layer l, we compute an error term δ(l)i that
measures how much that node was “responsible” for any errors in our output. For
an output node, we can directly measure the difference δ(nl)i between the network’s
activation and the true target value, and for hidden units, we compute δ(l)i based
on a weighted average of the error terms of the nodes that uses a(l)i as an input.
In practice, we use matrix-vectorial operations to reduce the computational cost.
We use “◦” to denote the element-wise product operator 2. By definition, if C =
A ◦ B, then
(C)ij = (A ◦ B)ij = (A)ij · (B)ij.
2Also called the Hadamard product.
5.3. Neural Network Model 111
The algorithm for one can then be written:
1. Perform a forward propagation, computing the activations for layers L2, L3,
up to the output layer Lnl , using the equations defining the forward propaga-
tion steps.
2. For the output layer nl , set
δ(nl) = −(y− a(nl)) ◦ f ′(z(nl)). (5.13)
3. For l = nl − 1, nl − 2, nl − 3, . . . , 2, set
δ(l) =((W(l))>δ(l+1)
)◦ f ′(z(l)). (5.14)
4. Compute the desired partial derivatives:
∇W(l) J(W, b; x, y) = δ(l+1)(
a(l))>
,
∇b(l) J(W, b; x, y) = δ(l+1).(5.15)
In steps 2 and 3 above, we need to compute f ′(z(l)i ) for each value of i. As-
suming f (z) is the sigmoid activation function, we would already have a(l)i stored
away from the forward propagation throughout the whole network. Thus, using
the Equation (5.5) for f ′(z), we can compute this as f ′(z(l)i ) = a(l)i (1− a(l)i ).
After getting all the partial derivatives that we desire, we can finally implement
the gradient descent algorithm. One iteration of batch gradient descent is processed
as follows:
1. Set ∆W(l) := 0, ∆b(l) := 0 (matrix/vector of zeros) for all l.
2. For i = 1 to m,
(a) Use backpropagation to compute ∇W(l) J(W, b; x, y) and ∇b(l) J(W, b; x, y).
(b) Set ∆W(l) := ∆W(l) +∇W(l) J(W, b; x, y).
(c) Set ∆b(l) := ∆b(l) +∇b(l) J(W, b; x, y).
112 Chapter 5. Reinforcement Learning approach for CACC
3. Update the parameters:
W(l) = W(l) − α
[(1m
∆W(l))+ λW(l)
],
b(l) = b(l) − α
[(1m
∆b(l))]
.(5.16)
∆W(l) is a matrix of the same dimension as W(l), and ∆b(l) is a vector of the
same dimension as b(l).
To train the neural network, we can repeatedly take steps of gradient descent to
reduce our cost function J(W, b).
5.4. Model-Free Reinforcement Learning Method
In our work, We study reinforcement learning approach for longitudinal control
problems of intelligent vehicles. The leading vehicle is taking random decisions
and sequentially the following vehicles choose actions over a sequence of time
steps, in order to maximize a cumulative reward. We model the problem as a
Markov Decision Process: a state space S , an action space A, a transition dynamics
distribution P(st+1 | st, at) satisfying the Markov property P(st+1 | s1, a1, ..., st, at) =
P(st+1 | st, at), for any trajectory s1, a1, s2, a2, ..., sT, aT in state-action space, and a
reward function r : S × A −→ R. A stochastic policy π(st, at) = P(at | st) is
used to select actions and produce a trajectory of states, actions and rewards
s1, a1, r1, s2, a2, r2, ..., sT, aT, rT over S ×A×R.
An on-policy method learns the value of the policy that is used to make deci-
sions. The value functions are updated using results from executing actions deter-
mined by some policy. An off-policy methods can learn the value of the optimal
policy independently of the agent’s actions. It updates the estimated value func-
tions using hypothetical actions, those which have not actually been tried.
We focus on model-free RL methods that the vehicle drives an optimal policy
without explicitly learning the model of the environment. Q-learning [172] algo-
rithm is one of the major model-free reinforcement learning algorithms.
Q-Learning algorithm is an important off-policy model-free reinforcement
learning algorithm for temporal difference learning. It can be proven that given
5.5. CACC based on Q-Learning 113
sufficient training under any ε-soft policy, the algorithm converges with probabil-
ity 1 to a close approximation of the action-value function for an arbitrary target
policy. Q-Learning learns the optimal policy even when actions are selected ac-
cording to a more exploratory or even random policy.
The update of state-action values in Q-learning is defined by
Q(st, at) := Q(st, at) + α[rt+1 + γ max
aQ(st+1, a)−Q(st, at)
]. (5.17)
The parameters used in the Q-value update process are:
α - the learning rate, set between 0 and 1. Setting it to 0 means that the Q-
values are never updated, hence nothing is learned. Setting a high value such
as 0.9 means that learning can occur quickly.
γ - discount factor, also set between 0 and 1. This models the fact that future
rewards are worth less than immediate rewards. Mathematically, the discount
factor needs to be set less than 0 for the algorithm to converge.
In this case, the learned action-value function, Q, directly approximates Q∗,
the optimal action-value function, independent of the policy being followed. This
dramatically simplifies the analysis of the algorithm and enabled early convergence
proofs. The policy still has an effect in that it determines which state-action pairs
are visited and updated. However, all that is required for correct convergence is
that all pairs continue to be updated. Under this assumption and a variant of the
usual stochastic approximation conditions on the sequence of step-size parameters,
Qt has been shown to converge with probability 1 to Q∗. The Q-learning algorithm
is shown below.
5.5. CACC based on Q-Learning
One of the strengths of Q-learning is that it is able to compare the expected utility
of the available actions without requiring a model of the environment. Q-learning
can handle problems with stochastic transitions and rewards.
114 Chapter 5. Reinforcement Learning approach for CACC
Algorithm 3: One-step Q-learning algorithm [172]1: Initialize Q(s,a) arbitrarily;2: repeat3: Initialize s;4: repeat5: Choose a from s using policy derived from Q;6: Take action a, observe r, s′;7: Q(s, a)← Q(s, a) + α [r + γ maxa′ Q(s′, a′)−Q(s, a)];8: s← s′;9: until s is terminal
10: until all episodes end.
This section explains the design of an autonomous CACC system that inte-
grates both sensors and inter-vehicle communication in its control loop to keep a
secure longitudinal vehicle-following behavior. To this end, we will use the policy-
gradient method that we described in the previous section to learn a vehicle control
by direct interaction with a complex simulated driving environment. In this sec-
tion, we will present the driving scenario simulated, show the learning simulations
in detail, and evaluates the performance of the resulting policies.
The learning task concerned in this chapter is the same as previous chapters,
corresponding to a Stop-and-Go scenario. This type of scenario is the most in-
teresting, because it usually occurs on urban roads. It has been used by many
researchers for the development of autonomous controllers and the evaluation of
their efficiency and effects on the traffic flow. In this case, the learning vehicle’s
objective is to learn to follow the leading vehicle while keeping a specific defined
range of 2 s.
5.5.1. State and Action Spaces
Since reinforcement learning algorithms can be modeled as an MDP, we need first
to define the state space S and action space A.
For the definition of the states, the following three state variables are consid-
ered:
• headway time Hω: Headway time (also called the "range") is defined as the
5.5. CACC based on Q-Learning 115
distance in time from the front vehicle and is calculated as follows:
Hω =SLeader − SFollower
VFollower(5.18)
where SLeader and SFollower are the position of leading vehicle and following
vehicle respectively, VFollower is the velocity of the following vehicle. This mea-
surement is widely adopted for inter-vehicle spacing that has the advantage
of being dependent on the current velocity of the following vehcile. This state
representation is also interesting, because it is independent of the velocity of
its front vehicle which is good for a heterogeneous platoon. Thus, a behav-
ior learned using these states will generalize to all the possible front vehicle
velocities.
• headway time derivative ∆Hω: Headway time derivative (also called the "range
rate") contains valuable information about the relative velocity between the
two vehicles and is expressed by
∆Hω = Hωt − Hωt−1 (5.19)
It shows whether the following vehicle is moving closer to or farther from the
front vehicle since the previous update of the value. Both the headway and
the headway can be derived by using a simulated laser sensor. Although con-
tinuous values are considered, we limit the range of the state space by bound-
ing the value of these variables to specific intervals that is valuable experience
to learn vehicle following behavior. Thus, the possible values of headway is
bounded from 0 to 10s, whereas the headway derivative is bounded from
−0.1s to 0.1s.
• Front-vehicle’s acceleration ai−1: The acceleration of the front vehicle, which
can be obtained through wireless V2V communication, is another important
state variable of our system. The same as two previous state variables, the
acceleration values are bounded to a particular interval, ranging from−3m/s2
to 5m/s2.
116 Chapter 5. Reinforcement Learning approach for CACC
Finally, the action space is composed of the following three actions: 1) a braking
action (B); 2) a gas action (G); and 3) a non-operation action (NO−OP). The state
and action space of our framework can formally be described as follows:
S = {Hω, ∆Hω, ai−1} (5.20)
A = {B, G, NO−OP} (5.21)
5.5.2. Reward Function
The progress of the learning phase depends on the reward function used by the
agent, because this function is mostly used by the learning algorithm to direct the
agent in areas of the state space where it will gather the maximum expected reward.
It is used to evaluate how good or how bad the selected action is. Obviously, the
reward function must be designed to be positive reward values to actions that get
the agent toward the safe inter-vehicle distance to the preceding vehicle (see Figure
5.3).
Figure 5.3 – Reward of CACC system in RL approach
As the secure inter-vehicle distance should be around the pre-defined value of
5.5. CACC based on Q-Learning 117
2 s (a common value in industrialized countries’ legislation), we choose a large
positive reward given when the vehicle enters the zone that extends at ±0.1s from
the headway goal of 2 s. Moreover, we also define a even smaller zone at ±0.05s
from the safe distance, where the agent receives the most important reward. The
desired effect of such a reward function is to advise the agent to stay as close as
possible to the safe distance. On the contrary, we give negative rewards to the
vehicle when it is located very far from the safe distance or when it is too close
to the preceding vehicle. To reduce learning times, we also use a technique called
reward shaping, which directs the exploration of the agent by giving small positive
rewards to actions that make the agent progress along a desired trajectory through
the state space (i.e., by giving positive rewards when the vehicle is very far but gets
closer to its front vehicle).
5.5.3. The Stochastic Control Policy
A reinforcement learning agent learns from the consequences of its state-action
pairs rather than from being explicitly taught, and it selects its actions on basis of its
past experiences and also by new choices. If we may visit each state-action (s, a) a
sufficient large number of times, we could obtain the state values via, for example,
Monte Carlo methods. However, it is not realistic, and even worse, many state-
action pairs would not be visited once. It is important to deal with the exploration-
exploitation trade-off.
In our work, we transplant a Boltzmann distribution to express a stochastic
control policy. The learning agent tries out actions probabilistically based on their
Q-values. Given a state s, the stochastic policy outputs an action a with probability:
π(s, a) = P(a | s) = eQ(s,a)
T
∑b∈A eQ(s,b)
T
. (5.22)
where T is the temperature that controls the stochasticity of action selection. If
T is high, all the action Q-values tend to be equal, and the agent choose a random
action. If T is low, the action Q-values differ and the action with the highest Q-value
is preferred to be picked. Thus, P(a|s) ∝ eQ(s,a)
T > 0 and ∑a P(a|s) = 1.
118 Chapter 5. Reinforcement Learning approach for CACC
We do not fix the temperature to a constant, since random exploration through-
out the whole self-learning process takes too long to focus on the best actions.
At the beginning, all Q(s, a) are generated inaccurately, so a high T is set to
guarantee the exploration that all actions have a roughly equal chance of being
selected. As time goes on, a large amount of random exploration have been done,
and the agent could gradually exploit its accumulating knowledge. Thus, the agent
decreases T, and the actions with the higher Q-values become more and more likely
to be picked. Finally, as we assume Q is converging to Q∗, T approaches zero (pure
exploitation) and we tend to only pick the action with the highest Q-value:
P(a|s) =
1, if Q(s, a) = maxb∈A Q(s, b)
0, otherwise(5.23)
In sum, the agent starts with high exploration and converts to exploitation as
time goes on, so that after a while we are only exploring (s, a)’s that have worked
out at least moderately well before.
5.5.4. State-Action Value Iteration
The Q-value function expresses the mapping policy from the perceived state of
environment to the executing action. One Q-value Q(st, at) corresponds with one
specific state and one action in this state. Like many RL researches, they have a
large-scale state and action spaces. Traditionally, all the state or action values are
store in a Q-table. However, this is not practical and computationally expensive for
large-scale problems. In our method, We propose to predict all state Q-values by
using a three-layer neural network, as shown in Figure 5.4.
The inputs are the state features that the robot perceives in the surrounding
environment, and the outputs correspond to all the action Q-values. Therefore,
according to Equation (5.20) and (5.21), the network has 3 neurons in the input
layer, and 3 in the output layer. Moreover, 8 neurons are designed in the hidden
layer.
The bias units are set to 1. The weight W(1) ∈ R8×4 is used to connect the
input layer and the hidden layer, and similarly, the weight W(2) ∈ R3×9 links the
5.5. CACC based on Q-Learning 119
s1
sk
s3
s2
q1
q3
q2
qn
Q (st ,a1)
Q (st ,a2)
Q (st ,a3)
Q (st ,an)
Perception
Features
Action
Values
+1 +1W (1) W (2)
state st
Figure 5.4 – A three-layer neural network architecture
hidden layer and the output layer. The sigmoid function is used for calculating the
activation in the hidden and output layers.
We denote Q(st) a vector of all action-values in the state st, and use Q(st, at) to
specify the Q-value of taking at in st. Thus,
Q(st) =
Q(st, a1)
Q(st, a2)
Q(st, a3)
.
The action value iteration is realized by updating the neural network by the
means of its weights. In the previous chapter, the neural network was applied for
supervised learning where the label for each training state-action pair was explicitly
provided. Differently, the neural network in the reinforcement learning does not
has label outputs. Q-learning is a process of value iteration and the optimal value
after each iteration serves as the target value for neural network training. The
update rule is
Qk+1(st, at) = Qk(st, at) + α
[rt + γ max
a∈AQk(st+1, a)−Qk(st, at)
]. (5.24)
where the initial action values Q0 of al the state-action pairs are generated ran-
120 Chapter 5. Reinforcement Learning approach for CACC
domly between 0 and 1. Qk+1(st, at) is treated as the target value of the true value
Qk(st, at) in the (k + 1)th iteration.
In the vector Qk(st), only Qk(st, at) is updated to Qk+1(st, at), and the rest el-
ements stay unchanged. Sometimes, Qk+1(st, at) may exceed the range [0, 1], then
we need to rescale Qk+1(st) to make sure all its components are in [0, 1]. We de-
note Qk+1(st) the rescaled action value. To make it clear, the update of Q-value is
realized along the road Qk → Qk+1 → Qk+1.
The network error is a vector of form:
δk+1 = Qk+1(st)−Qk(st). (5.25)
We employ the stochastic gradient descent (SGD) to train the neural network
online. The goal is to minimize the cross-entropy cost function J defined as:
J = −[
NA
∑i=1
(Qk+1)i · log(Qk)i + (1− (Qk+1)i)(1− log(Qk)i)
]. (5.26)
where NA is the number of actions used for training. In our navigation tasks,
NA = 5.
The action Q-values are nonlinear functions of weights of the network. SGD
optimizes J and updates weights by using one or a few training examples according
to:
W(i) ←W(i) − α∂J
∂W(i). (5.27)
Each iteration outputs new weights W(i) and a new cost J′ is calculated. This
update repeats until it arrives at a maximum times of iteration or |J′ − J| < ε.
5.5.5. Algorithm
A longitudinal control problem via NNQL can be divided into two processes. The
first one is the training process to endow the vehicle with the self-learning ability,
and the second one is the tracking process to use the trained policy to execute an
independent tracking task.
5.5. CACC based on Q-Learning 121
5.5.5.1. Training Process of NNQL
Training the vehicle is done by exposing it to a bunch of learning episodes and each
episode has a different environment. The variety helps the vehicle to encounter as
many situations as possible, which could accelerate the learning speed.
The key of training efficiency is greatly related to how to make use of the ac-
cumulated sequence of state-action pairs and their Q-values. A bunch of previous
work [124, 69, 179] used one-step Q-learning to update one Q-value at a time. When
the vehicle is at a new state, only the new Q-value will be updated and the previous
action values will be discarded. Others used batch learning [134] that updates all
the Q-values once they are all collected. This also poses some advantages. First,
without online update, we cannot guarantee that the collected Q-values have their
optimal target values. Moreover, waiting all the values being obtained is always
time-wasting. We propose to update online not only the current Q-value but also
gather the previous values to train together.
The learning algorithm is given in Algorithm 4.
Algorithm 4: Training algorithm of NNQL
1: Initialize the NN weights W(1) and W(2) randomly;2: for all episodes do3: Initialize the leading vehicle state;4: Read the sensor inputs;5: Observe current state s1;6: t← 1;7: for all moving steps do8: Compute all action-values {Q(st, ai)}i in state st via NN;9: Select one action at according to the stochastic policy π(s, a) in (5.22), and
then execute;10: Observe new state st+1 and state property pt+1;11: Obtain the immediate reward rt;12: Update the Q-value function from Q(st, at) to Q(st, at) via (5.24);13: Apply feature scaling for Q to the range [0, 1];14: Apply SGD to train (input, target) and to update the weights W(1) and
W(2);15: t← t + 1;16: end for17: end for
122 Chapter 5. Reinforcement Learning approach for CACC
5.5.5.2. Tracking Problem Using NNQL
After training the vehicle, the resulting policy is still stochastic but closed to deter-
ministic that used by the vehicle for future tracking problems in various environ-
ments.
The tracking problem algorithm is shown in Algorithm 5.
Algorithm 5: Tracking problem using NNQL
1: Load the trained NN weights W(1) and W(2);2: Initialize the leading vehicle state randomly;3: Load the vehicle initial state;4: t← 1;5: for all moving steps do6: Observe current state st and state property pt;7: Compute all action Q-values {Q(st, ai)}i via neural network;8: Pick the moving action at according to greedy policy, and then move;9: end for
5.6. Experimental Results
Due to the stochastic property of the policy gradient algorithms, a hundred learn-
ing simulations that result in a hundred different control policies have been exe-
cuted. After the learning phase, the policy that obtained the highest reward sum is
chosen and is tested in the Stop-and-Go scenario that was used for learning. The
results are presented in the Figures. In the figures it is shown respectively, their
accelerations, the velocities of both vehicles, the headway time and the inter-vehicle
distance in the simulation.
The headway response of the follow vehicle, as shown in Figure. 5.6b, indicates
that, when the front vehicle is braking, the follower is able to keep a safe distance by
using the learned policy. During this period, the headway of the follower oscillates
close to the desired value of 2s (approximately from 1.95s to 2.05s). Note that
this oscillatory problem is due to the small number of discrete time steps that we
defined in this simulation. From time steps 200 to 400, however, we can see that
CACC operates the vehicle away from the desired headway that it gets closer to its
front vehicle. This behavior can be resulted due to the fact that, at this time step,
5.6. Experimental Results 123
(a) Acceleration response
(b) Velocity response
Figure 5.5 – Acceleration and velocity response of tracking problem using RL
124 Chapter 5. Reinforcement Learning approach for CACC
(a) Inter-vehicle distance
(b) Headway time
Figure 5.6 – Inter-vehicle distance and headway time of tracking problem using RL
5.7. Conclusion 125
the front vehicle has stopped accelerating. Thus, to select actions of the following
vehicle, its controller observes a constant velocity (acceleration of 0) of the front
vehicle and accordingly selects actions. In reality, at this time, the following vehicle
is still rolling a faster than the front vehicle (as shown in the velocity profile in Fig.
5.5b). As consequence, the following vehicle has a tendency to get closer to the
front vehicle, because it uses "no-op" actions, although it should still be braking for
a small amount of time. The RL approach for CACC obtained is also interesting
when looking at the acceleration response of the following vehicle. Obviously, in
Fig. 5.5a it is shown that CACC does not need to use as much braking as the leader
(around -4 m/s2), i.e. string stability is obtained. This is because of the defined
actions, where only a deceleration of −5m/s2 is considered.
Macroscopically, the performance is desirable, because it has shown that there
is no amplification of velocity, which would result, within a vehicle platoon, in a
complete halt of the traffic flow further down the stream. Thus, the string stability
is kept and the presence of the acceleration signal of the leader enables the learning
of a better control policy.
5.7. Conclusion
In this chapiter, we have proposed a novel design approach to obtain an au-
tonomous longitudinal vehicle controller. To achieve this objective, a vehicle ar-
chitecture with its CACC subsystem has been designed. With this architecture,
we have also described the specific definitions for an efficient autonomous vehicle
control policy through RL and the simulator in which the learning engine is em-
bedded. The policy-gradient algorithm estimation is used to optimizer the policy
and has used a back propagation neural network for achieving the longitudinal
control. Then, experimental results, through Stop-and-Go scenario, have shown
that this proposed RL approach results in efficient behavior for CACC.
Conclusions and Perspectives
Conclusions
In this thesis, we addressed the issue of CACC performance.
In chapter 1 a generally introduction to intelligent road transportation systems
was presented. Firstly, the current traffic problems and situation were introduced.
Then several historical researches worldwide were presented. In order to reduce
the accidents caused by human errors, autonomous vehicles are being developed
by research organizations and companies all over the world. Researches in au-
tonomous vehicle development was introduced in this chapter as well. Secondly,
ITS, AHS and intelligent vehicle were introduced, which are considered as the most
promising solutions to the traffic problems. Thirdly, CACC as an extension of ACC
systems by enabling the communication among the vehicles in a platoon, was then
presented. CACC systems prevent the driver from repetitive jobs like adjusting
speed and distance to the preceding vehicle. Fourthly, V2X communication, an
important technology in developing ITS, was introduced. The VANETs are formed
enabling communications among these agents, so that autonomous vehicles can be
upgraded into cooperative systems, in which a vehicle’s range of awareness can be
extended. Finally, the technology of machine learning was introduced, which can
be applied on intelligent vehicles.
Chapter 2 has presented the most important criterion to evaluate the perfor-
mance of intelligent vehicle platoon, the string stability. Then the Markov decision
processes were described in detail, which are the underlying structure of reinforce-
ment learning. Several classical algorithms for solving MDPs were also briefly
introduced. The fundamental concepts of the reinforcement learning was then
brought.
Chapter3 concentrated on the vehicle longitudinal control system design. The
spacing policy and its associated control law were designed with the constrains of
string stability. The CTH spacing policy was adopted to determine the desired spac-
ing from the preceding vehicle. It was shown that the proposed TVACACC system
127
128 Conclusions and Perspectives
could ensure both the stability of individual vehicle and the string stability. In addi-
tion, through the comparisons between the TVACACC and the conventional CACC
and ACC systems, we could find the obvious advantages of the proposed system
in improving traffic capacity especially in the high-density traffic conditions. The
above proposed longitudinal control system was validated to be effective through
a series of simulations in stop-and-go scenario.
In chapter4, a degradation approach for TVACACC was presented, used as an
alternative fallback strategy to ACC. The concept of the proposed approach is to
remain the minimum loss of functionality of TVACACC when the wireless com-
munication is failed or when the preceding vehicle is not intelligent, which is not
equipped with wireless communication units. The proposed degraded system,
which is referred to as DTVACACC, uses the Kalman Filter to estimate the preced-
ing vehicle’s current acceleration to replace the desired acceleration, which is nor-
mally be communicated over a wireless V2V communication for the conventional
CACC system. What’s more, a switch criterion from TVACACC to DTVACACC
was presented, in the case that wireless communication is not (yet) lost completely,
but is suffering from increased transmission delay. Theoretical results have shown
that the performance, in terms of string stability of DTVACACC, can be kept at
a much higher level compared with an ACC fallback strategy. Both theoretical as
well as experimental results have shown that the DTVACACC system outperforms
the ACC fallback scenario by reducing the minimum string-stable time gap to less
than half the required value in case of ACC.
Finally in chapter 5, we have proposed a novel approach to obtain an au-
tonomous longitudinal vehicle cACC controller. To achieve this ovjective, a vehicle
architecture with its CACC subsystem has been presented. Using this architecture,
the specific requirements for an efficient autonomous vehicle control policy through
RL and the simulator in which the learning engine is embedded are described. The
policy-gradient algorithm estimation has been applied and we have used a back
propagation neural network for achieving the longitudinal control. Then, through
experimental results, through Stop-and-Go Scenario simulation, it is shown that
this design approach can result in efficient behavior for CACC.
Conclusions and Perspectives 129
Future work
Much work can still be achieved to improve the performance of vehicle longitudinal
controller proposed in this thesis.
• Further experimental validation of the proposed framework, TVACACC on
real platoon is part of future research. Moreover, a various headway time
and communication delay is required due to different factors, such as road
condition and weather.
• The approach to estimate the front vehicle’s acceleration in case of losing the
V2V communication can be improved. In this thesis, we used typical filter
Kalman for estimation based on the inter-vehicle distance and relative speed.
Other technology of estimation can be applied to improve the performance of
CACC systems.
• The state and action of vehicle in RL is not precisely defined. More factors
of vehicle state and action should be taken into account. Issues of the oscilla-
tory behavior of our vehicle control policy can be solved by using continuous
actions. This approach would require further study to efficiently realize this
method, because it causes additional complexity to the learning process.
• Some elements to our simulation of RL approach can also be improved, with
the ultimate goal of having an even more realistic environment through which
we can make our learning experiments. In fact, an important aspect to con-
cern, as we did in chapter 3, would be to simulate a more accurate simulator
for sensory and communication systems, which means sensor and commu-
nication delay, data loss and noise. These factors would make the learning
process more complex, but the results would be much closer to real-life envi-
ronments.
• Our controller can also be completed by extending an autonomous lateral
control system. Again, this issue can be tackled using RL, and a potential
solution is to use a reward function in the form of a potential function over
the width of a lane, which is similar to the current force feedback given by
130 Conclusions and Perspectives
the existing lane-keeping assistance system. This reward function will surely
direct the driving agent toward learning an adequate lane-change policy.
Résumé Étendu en Français
Introduction
Cette thèse est consacrée à la recherche de l’application de la théorie du contrôle in-
telligent dans les futurs systèmes de transport routier. A cause du développement
de la société humaine, la demande de transport est beaucoup plus élevé que toute
autre période de l’istoire. Plus flexibles et plus confortables, les voitures privées
sont préférées par beaucoup de gens. En outre, le développement de l’industrie
automobile réduit le coût de posséder une voiture, ainsi le nombre de voitures a
augmenté rapidement dans le monde entier, surtout dans les métropoles. Toutefois,
l’augmentation du nombre de voitures rend notre société à souffrir de la conges-
tion du trafic, pollution des gaz et accidents. Ces effets négatifs nous exigent de
trouver des solutions. Dans ce contexte, la notion de Systèmes de Transport In-
telligents (ITS) est proposée. Les scientifiques et les ingénieurs travaillent depuis
des décennies pour appliquer des technologies multidisciplinaires aux transports,
afin d’avoir des systèmes plus stables, plus efficaces, plus d’économie d’effort, et
environnemental amicale.
Une pensée est le système (semi-)autonome. L’idée principale est d’utiliser des
applications pour aider ou remplacer l’opération humaine et la décision. Les sys-
tèmes d’Assistance Avancés au Conducteur (ADAS) sont conçus pour aider les
conducteurs en les alertant lorsque le danger s’est produit (changement de la voie,
avertissement de collision directe), fournissant de plus d’informations pour la prise
de décision (plan d’itinéraire, évitement de la congestion) et libérant des manœu-
vres répétitives (régulateur de vitesse adaptatif, parking). Dans les systèmes semi-
automatiques, le processus de conduite nécessite le conducteur humain: le conduc-
teur doit définir certains paramètres dans le système, et il peut décider de suivre
l’assistance consultative ou pas. Récemment, avec l’amélioration des technologies
de détection et d’intelligence artificielle, les entreprises et les instituts se sont en-
gagés dans la recherche et le développement de la conduite autonome. Dans cer-
taines scénarios, par exemple des autoroutes et des routes principales, à l’aide de
131
132 Résumé Étendu en Français
capteurs et la carte très précis, les mains-off et pieds-off expériences de conduite
seraient réalisées. L’élimination de l’erreur humaine rendra le transport routier
beaucoup plus sécurisé et l’optimisation de l’espace entre véhicules améliorera
l’utilisation de la capacité routière. Toutefois, les voitures ont encore besoin de
l’anticipation du conducteur dans certains scénarios avec une situation de trafic
compliquée ou des informations limitées. La structure intérieure des véhicules au-
tonomes ne serait pas différente que celle des voitures actuelles, parce que le volant
et les pédales sont toujours nécessaires. L’étape suivante de la conduite autonome
est la conduite sans conducteur, c’est-à-dire la voiture est totalement conduit par
lui-même. Le siège dédié au conducteur disparaîtrait et les gens à bord se concen-
treraient sur leur propre personnel. L’économie de l’auto-partage des voitures sans
conducteur seraient énormes: à l’avenir, les gens préféreraient une voiture sans
conducteur lorsqu’ils ont besoin d’une voiture privée. Ainsi, les congestions et les
pollutions pourraient être soulagées.
Une autre penseé est le système coopératif. De toute évidence, pour le transport
routier actuel les notifications sont conçu pour les conducteurs humains, tels que
les feux de circulation et les panneaux latéraux. Les véhicules autonomes actuels
sont équipés avec des caméras dédiées à la détection de ces signes. Toutefois,
les notifications humaines n’est pas assez efficace pour les véhicules autonomes,
car l’utilisation de la caméra est limitée par la portée et la visibilité, et des algo-
rithmes doivent être conçus pour reconnaître ces signes. Si l’interaction entre les
véhicules et l’environnement est activée, les notifications peuvent être transmises
via les communications Vehicule-to-X (V2X). Ainsi les véhicules peuvent être re-
marqués dans la plus grande distance même au-delà de la vue, et les informations
transmises sont plus précises que celles détectées par les capteurs. Quand le taux
de communication des voitures sans conducteur est assez élevé, il ne serait plus
nécessaire d’avoir des feux de circulation physiques et des panneaux. Le panneau
de trafic personnel virtuel peut être communiquées aux véhicules individuels par
le gestionnaire du trafic. Dans les systèmes coopératifs, un individu n’a pas besoin
d’acquérir l’nformation tout par ses propres capteurs, mais avec l’aide des autres
Résumé Étendu en Français 133
par la communication. Par conséquent, l’intelligence autonome peut être étendue
à l’intelligence coopérative.
La recherche présentée dans cette thèse concentre sur le développement
d’applications pour améliorer la sécurité et l’efficacité des systèmes de transport
intelligents dans le contexte des véhicules autonomes et des communications V2X.
Ainsi, cette recherche cible des systèmes coopératifs. Stratégies de contrôle sont
conçues pour définir la méthode dont les véhicules interagissent les uns avec les
autres.
Contributions Principales
Un nouveau système décentralisé de Régulateur de Vitesse Coopératif Adaptif à
deux véhicules (TVACACC) est proposé dans ce document thèse. Il est montré que
le contrôleur proposé avec deux entrées d’accéleration souaitée permet de réduire
la distance entre véhicules, en utilisant une politique d’espacement dépendante de
la vitesse. De plus, une approche de la stabilité dan le domaine fréquenciel est
théoriquement analysée. En utilisant la communication multiple sans fil entre les
véhicules, comparée au système conventionnel, une meilleure stabilité de chaîne
est démontrée, qui entraîne une perturbation plus faible. La caravane des véhicules
dans le scénario Stop-and-Go est simulé avec la communication de V2V dégradée.
Il est montré que le système proposé donne un comportement stable de chaîne.
Une technique de dégradation gracieuse est proposé pour CACC, qui constitue
un scénario alternatif de ACC. L’idée de l’approche proposée est d’obtenir la perte
minimale de fonctionnalité de CACC lorsque la communication sans fil échoue
ou le véhicule précédent n’est pas équipé de module de communication sans fil.
La stratégie proposée, appelée TVACACC Dégradée (DTVACACC), utilise une
estimation de l’accélération actuelle du véhicule précédent en remplacement de
l’accélération souhaitée, qui est normalement communiquée par la communication
sans fil.
Une nouvelle approche de conception pour obtenir un contrôleur de véhicule
longitudinal autonome est proposé. Pour atteindre cet objectif, une architecture
de véhicule CACC a été présenté. Avec cette architecture, nous avons décrit
134 Résumé Étendu en Français
les exigences spécifiques pour un contrôle autonome efficace des véhicules par
l’Apprentissage de Renforcement (RL) et le simulateur dans lequel le moteur
d’apprentissage est intégré. Une estimation d’algorithme de gradient de politique
a été introduit et a utilisé un réseau neuronal de rétro-propagation pour le contrôle
longitudinal.
Conlusions et Perspectives
Dans cette thèse, nous avons abordé le recherche de la performance du CACC.
Au chapitre 1, une introduction aux systèmes intelligents de transport routier a
été présenté. Tout d’abord, les problèmes de circulation et la situation actuelle
ont été introduits. Ensuite, plusieurs recherches historiques ont été présentées
dans le monde entier. Pour but de réduire les accidents causés par les erreurs
humaines, les véhicules autonomes sont en cours de développement par des or-
ganismes de recherche et des entreprises partout dans le monde. Le développe-
ment des véhicules a également été introduit dans ce chapitre. Deuxiémement,
ITS, AHS et le véhicule intelligent ont été introduits, qui sont considérés comme
des solutions prometteuses aux problèmes de trafic. Troisièmement, le CACC en
tant que prolongement du ACC systèmes en permettant la communication entre
les véhicules d’une caravane, était alors présenté. Les systèmes CACC empêchent
le conducteur de faire des tâches répétitives, en maintenant la vitesse et la distance
inter-véhicules plus optimisées par rapport au ACC et CC systèmes. Quatrième-
ment, la communication V2X, une technologie importante dans le développement
des ITS, a été introduite. Les VANET sont formés permettant la communication
entre les agents, de sorte que les véhicules autonomes mise au point en systèmes
coopératifs, dans lesquels la gamme de sensibilisation d’un véhicule est prolongée.
Enfin, la technologie de l’apprentissage a été introduite, qui peut être appliqué sur
les véhicules intelligents.
Le chapitre 2 a présenté le critère le plus important pour évaluer la perfor-
mance d’une caravane de véhicules intelligents, la stabilité de chaîne. Puis la Dé-
cision du Markov Processus (MDP) a été décrite en détail, qui est la structure de
l’Apprentissage de Renforcement (RI). Plusieurs algorithmes classiques pour ré-
Résumé Étendu en Français 135
soudre les MDP ont également été brièvement Introduits. Les concepts fondamen-
taux du RI ont été apportés.
Le chapitre 3 se concentre sur la conception du système de contrôle longitu-
dinale du véhicule. La politique d’espacement et sa loi de contrôle associée ont
été conçues avec les contraintes de stabilité de chaîne. La politique d’espacement
CTH a été adoptée pour déterminer l’espacement souhaité du véhicule précédent.
Il a été démontré que le système proposé TVACACC pourrait assurer à la fois la
stabilité du véhicule individuelle et la stabilité de chaîne. En outre, à travers les
comparaisons entre TVACACC, CACC conventionnel et ACC, nous avons prouvé
les avantages évidents du système proposé dans l’amélioration de la capacité de
trafic, en particulier dans les conditions de trafic à forte densité. Le système de
contrôle longitudinal proposé a été validé par une série de simulations dans le
scénario stop-and-go.
Au chapitre 4, une technique gracieuse de dégradation du CACC a été présen-
tée, comme un scénario alternatif de rechange à ACC. L’idée de l’approche pro-
posée est d’obtenir la perte minimale de fonctionnalité de CACC lorsque la liaison
sans fil échoue ou lorsque le véhicule précédent n’est pas équipé d’une communi-
cation sans fil. La stratégie proposée, appelée DTVACACC, utilise le filtre Kalman
pour estimer l’accélération actuelle du véhicule précédent en remplacement de
l’accélération souaitée, qui est normalement communiquée par un lien sans fil pour
ce type de CACC. En outre, un critère pour passer de TVACACC à DTVACACC a
été présentée, dans le cas où la communication sans fil n’est pas (encore) perdue,
mais montre un délai accru. Il a été démontré que la performance, en termes de la
stabilité de chaîne de DTVACACC, peut être maintenu à un niveau beaucoup plus
élevé qu’un système ACC. Les résultats théoriques et expérimentaux ont montré
que le système DTVACACC surpasse ACC avec des caractéristiques de stabilité de
chaîne en réduisant l’intervalle de temps minimum une moitié de la valeur requise
dans le cas de ACC.
Enfin, dans le chapitre 5, nous avons proposé une nouvelle approche
d’pprentissage pour obtenir un régulateur longitudinal de vitesse de véhicule.
Pour parvenir à cette condition, une architecture de véhicule dans CACC a été
136 Résumé Étendu en Français
présentée. Avec cette architecture, nous avons également décrit les exigences spé-
cifiques d’un véhicule autonom, la politique de contrôle par RL et le simulateur
dans lequel le moteur d’apprentissage est intégré. Une méthode d’estimation
d’algorithme, le gradient de politique, a été introduite et utilisé dans un réseau
neuronal de rétro-propagation pour réaliser le contrôle longitudinal. Alors, les
résultats expérimentaux, grâce à la simulation, ont montré que cette approche de
conception peut entraîner un comportement efficace pour les CCAC.
Beaucoup de travail peut encore être fait pour améliorer le contrôleur de
véhicule proposé dans cette thèse.
Validation expérimentale supplémentaire du cadre proposé, TVACACC sur une
caravane de véhicules réels fait partie de la recherche future. En outre, une inter-
valle de temps et le retard de communication variés peut être prises en compte en
raison de différents facteurs, par exemple la condition routière météorologique.
L’approche pour estimer l’accélération du véhicule précédent en cas de perte
de la communication V2V peut être améliorée. Dans cette thèse, nous avons utilisé
un filtre Kalman typique pour l’estimation basée sur la distance inter-véhicule et
la vitesse relative. D’autres techniques d’estimation peuvent être appliquées pour
améliorer le système CACC dégradé.
L’état et l’action du véhicule dans RL n’est pas précisément défini. Plus de
facteurs de l’état du véhicule et de l’action doit être prise en compte. Problèmes
relatives au comportement oscillatoire de notre politique de contrôle des véhicules
peut être améliorés par des actions continues. Ce cas nécessiterait une étude plus
approfondie pour cette approche, car elle apporte une complexité supplémentaire
à l’apprentissage processus.
Certains éléments de notre simulation de l’approche RL peuvent également être
améliorés, avec l’objectif ultime d’un environnement encore plus réaliste. En fait,
un aspect important à considérer, comme nous l’avons fait au chapitre 3, serait
d’intégrer un simulateur plus précis pour les systèmes sensoriels et de communi-
cation, ce qui signifie capteur et communication en retard, avec perte de données
et bruit. Cette condition rendrait le processus de l’apprentissage plus complexe,
Résumé Étendu en Français 137
mais l’environnement qui en résulterait resemblerait beaucoup plus aux conditions
réelles.
Notre contrôleur peut également être complété par un système de contrôle
latéral autonome. Encore une fois, cette approche peut être faite en utilisant RL.
Une solution possible est d’utiliser une fonction de récompense sous la forme d’une
fonction potentielle sur la voie, semblable à la rétroaction de la force actuelle don-
née par la voie existante de système d’assistance. Cette fonction de récompense
dirigera sûrement l’agent de conduite vers une politique de changement de voie
adéquate.
Bibliography
[1] Pieter Abbeel, Adam Coates, and Andrew Y Ng. Autonomous helicopter aer-
obatics through apprenticeship learning. The International Journal of Robotics
Research, 2010. (Cited page 33.)
[2] Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An ap-
plication of reinforcement learning to aerobatic helicopter flight. Advances in
Neural Information Processing Systems, 19:1, 2007. (Cited page 33.)
[3] J. Abele, C. Kerlen, S. Krueger, H Baum, and et al. Exploratory study on
the potential socio-economic impact of the introduction of intelligent safety
systems in road vehicles. Final report, SEISS, Teltow, Germany, January 2005.
(Cited page 14.)
[4] M. Alit, Z. Hou, and M. Noori. Stability and performance of feedbackcon-
trol sysytems with time delays. Computer & Structures, 66(2-3):241–248, 1998.
(Cited page 82.)
[5] Card Andrew H. Hearing before the subcommitteee on investigations and
oversight of the committee on science, space and technology. US. House of
Representatives, 103 congress, First Session, PP. 108-109, US. Printing Office,
November 1993. (Cited page 19.)
[6] Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A
survey of robot learning from demonstration. Robotics and Autonomous Sys-
tems, 57(5):469–483, 2009. (Cited page 33.)
[7] Bassam Bamieh, Fernando Paganini, and Munther A Dahleh. Distributed
139
140 Bibliography
control of spatially invariant systems. IEEE Transactions on Automatic Control,
47(7):1091–1107, 2002. (Cited page 39.)
[8] E Barbieri. Stability analysis of a class of interconnected systems. Journal
of Dynamic Systems, Measurement, and Control, 115(3):546–551, 1993. (Cited
page 39.)
[9] Lakshmi Dhevi Baskar, Bart De Schutter, J Hellendoorn, and Zoltan Papp.
Traffic control and intelligent vehicle highway systems: a survey. Intelligent
Transport Systems, IET, 5(1):38–52, 2011. (Cited page 20.)
[10] Dimitri P Bertsekas. Dynamic Programming and Optimal Control, volume 1.
Analyse de Performance de Réegulateur de Vitesse Adaptatif Coopératif
Résumé: Cette thèse est consacrée à l’analyse de performance du Régulateur de Vitesse Adap-tatif Coopératif (CACC) pour un train de véhicules intelligents pour les objectifs principaux de laréduction de congestion du trafic et l’amélioration de la sécurité routière. Ensuite, une approche dedomaine fréquenciel de la stabilité de chaîne est présentée, qui est généralement définie comme laperturbation du premier véhicule n’amplifie pas pour les véhicules suivants.
Premièrement, la politique d’espacement , Intervalle Constante de Temps (CTH) pour un trainde véhicule est introduite. Basé sur cette politique d’espacement, un nouveau système décentraliséde Deux-Véhicules-Devant CACC (TVACACC) est proposé, dans lequel l’accélération souhaitée dedeux véhicules précédents est prise en compte. Ensuite, la stabilité de chaîne du système proposé estthéoriquement analysé. Il est démontré que grâce à l’aide de la communication multiple sans fil parmiles véhicules, une meilleure stabilité la chaîne est obtenue par rapport au système conventionnel. Untrain de véhicules dans Stop-and-Go scénario est simulé avec la communication normale et dégradée,y compris le délai de transmission élevé et la perte de données. Le système proposé donne uncomportement stable de chaîne, correspondant à l’analyse théorique.
Deuxièmement, une technique de dégradation gracieuse pour CACC a été présenté, commeune stratégie alternative lorsque la communication sans fil est perdu ou mal dégradé. La stratégieproposée, qui est appelée DTVACACC, utilise le filtre de Kalman pour estimer l’accélération actuelledu véhicule précédent remplaçant l’accélération souhaitée. Il est démontré que la performance, entermes de stabilité de chaîne de DTVACACC, peut être maintenue à un niveau beaucoup plus élevé.
Enfin, une approche d’Apprentissage par Renforcement (RL) pour système CACC est proposé.L’algorithme politique-gradient est introduit pour réaliser le contrôle longitudinal. Ensuite, la simu-lation a montré que cette nouvelle approche de RL est efficace pour CACC.
Mots-clés: Systèmes de Transport Intelligents, Véhicules Autonomes, Régulateur de VitesseAdaptatif Coopératif, Analyse de Performance, Contrôle Longitudinal, Degradation de Transmission,Apprentissage par Renforcement.
Cooperative Adaptive Cruise Control Performance Analysis
Abstract: This PhD thesis is dedicated to the performance analysis of Cooperative AdaptiveCruise Control (CACC) system for intelligent vehicle platoon with the main aims of alleviating traf-fic congestion and improving traffic safety. Then a frequency-domain approach of string stability ispresented, which is generally defined as the disturbance of leading vehicle not amplifying throughupstream of the platoon. At first, the Constant Time Headway (CTH) spacing policy for vehicle pla-toon is introduced. Based on this spacing policy, a novel decentralized Two-Vehicle-Ahead CACC(TVACACC) system is proposed, in which the desired acceleration of two front vehicles is taken intoaccount. Then the string stability of the proposed system is theoretically analyzed. It is shown that byusing the multiple wireless communication among vehicles, a better string stability is obtained com-pared to the conventional system. Vehicle platoon in Stop-and-Go scenario is simulated with bothnormal and degraded communication, including high transmission delay and data loss. The pro-posed system yields a string stable behavior, in accordance with the theoretical analysis. Secondly,a graceful degradation technique for CACC was presented, as an alternative fallback strategy whenwireless communication is lost or badly degraded. The proposed strategy, which is referred to DTVA-CACC, uses Kalman filter to estimate the preceding vehicle’s current acceleration as a replacement ofthe desired acceleration. It is shown that the performance, in terms of string stability of DTVACACC,can be maintained at a much higher level. Finally, a Reinforcement Learning (RL) approach of CACCsystem is proposed. The policy-gradient algorithm is introduced to achieve the longitudinal control.Then simulation has shown that this new RL approach results in efficient performance for CACC.