ArticlePDF Available

A new reinforcement learning-based variable speed limit control approach to improve traffic efficiency against freeway jam waves

Authors:

Abstract and Figures

Conventional reinforcement learning (RL) models of variable speed limit (VSL) control systems (and traffic control systems in general) cannot be trained in real traffic process because new control actions are usually explored randomly, which may result in high costs (delays) due to exploration and learning. For this reason, existing RL-based VSL control approaches need a traffic simulator for training. However, the performance of those approaches are dependent on the accuracy of the simulators. This paper proposes a new RL-based VSL control approach to overcome the aforementioned problems. The proposed VSL control approach is designed to improve traffic efficiency by using VSLs against freeway jam waves. It applies an iterative training framework, where the optimal control policy is updated by exploring new control actions both online and offline in each iteration. The explored control actions are evaluated in real traffic process, thus it avoids that the RL model learns only from a traffic simulator. The proposed VSL control approach is tested using a macroscopic traffic simulation model to represent real world traffic flow dynamics. By comparing with existing VSL control approaches, the proposed approach is demonstrated to have advantages in the following two aspects: (i) it alleviates the impact of model mismatch, which occurs in both model-based VSL control approaches and existing RL-based VSL control approaches, via replacing knowledge from the models by knowledge from the real process, and (ii) it significantly reduces the exploration and learning costs compared to existing RL-based VSL control approaches.
Content may be subject to copyright.
A new reinforcement learning-based variable speed limit control approach to
improve trac eciency against freeway jam waves
Yu Han*a, Andreas Hegyib, Le Zhangc, Zhengbing Hed, Edward Chunge, Pan Liu*a
aSchool of Transportation, Southeast University, Nanjing, China.
bDepartment of Transport and Planning, Delft University of Technology, the Netherlands.
cSchool of Economics and Management, Nanjing University of Science and Technology, Nanjing, China
dBeijing Key Laboratory of Trac Engineering, Beijing University of Technology, China.
eDepartment of Electrical Engineering, Hong Kong Polytechnic University, Hong Kong, China.
Abstract
Conventional reinforcement learning (RL) models of variable speed limit (VSL) control systems (and trac con-
trol systems in general) cannot be trained in real trac process because new control actions are usually explored
randomly, which may result in high costs (delays) due to exploration and learning. For this reason, existing RL-based
VSL control approaches need a trac simulator for training. However, the performance of those approaches are
dependent on the accuracy of the simulators. This paper proposes a new RL-based VSL control approach to over-
come the aforementioned problems. The proposed VSL control approach is designed to improve trac eciency
by using VSLs against freeway jam waves. It applies an iterative training framework, where the optimal control
policy is updated by exploring new control actions both online and oine in each iteration. The explored control
actions are evaluated in real trac process, thus it avoids that the RL model learns only from a trac simulator. The
proposed VSL control approach is tested using a macroscopic trac simulation model to represent real world traf-
fic flow dynamics. By comparing with existing VSL control approaches, the proposed approach is demonstrated to
have advantages in the following two aspects: (i) it alleviates the impact of model mismatch, which occurs in both
model-based VSL control approaches and existing RL-based VSL control approaches, via replacing knowledge from
the models by knowledge from the real process, and (ii) it significantly reduces the exploration and learning costs
compared to existing RL-based VSL control approaches.
Keywords: Variable speed limits, Freeway trac control, Reinforcement learning, Data-driven approach
1. Introduction
A jam wave, also known as wide moving jam or shock wave in some studies, e.g., (Kerner and Rehborn, 1996;
Hegyi et al., 2005b), is a common type of trac jams on freeways. A jam wave usually originates from a trac
breakdown that occurs due to high trac demand, and its head and tail are both propagating upstream. From various
empirical studies, some common features of jam waves are distilled. For example, the propagation speed of jam waves
is roughly a constant, typically between 15-20 km/h (Kerner, 2002). It can propagate for a long time and distance,
and resolves only when the trac demand decreases (Kerner and Rehborn, 1996). The queue discharge rate from a
jam wave is typically around 30 percent lower than the free-flow capacity (Sch¨
onhof and Helbing, 2007). Jam waves
create many problems, including capacity reduction, travel delays, and safety risks. Therefore, eliminating jam waves
can greatly improve freeway trac operation eciency.
One way to alleviate jam waves is to avoid the activation of infrastructural bottlenecks, e.g., on-ramp bottlenecks,
by applying trac control measures such as ramp metering and variable speed limits (Hadiuzzaman et al., 2013; Lu
et al., 2015). As jam waves usually originate from standing queues that form at infrastructural bottlenecks, removing
those bottlenecks may significantly reduce the occurrences of jam waves. However, due to the limited storage space
at on-ramps, on-ramp bottlenecks cannot be fully avoided by ramp metering. On the other hand, variable speed limits
(VSLs) can reduce the mainstream flow upstream of a bottleneck, so as to avoid the activation of the bottleneck.
Carlson et al. (2011) proposed a feedback-based variable speed limit control method for local bottlenecks. Chen
Preprint submitted to Elsevier October 13, 2022
et al. (2014); Chen and Ahn (2015) developed analytical approaches of VSLs based on the kinematic wave theory for
recurrent and non-recurrent infrastructural bottlenecks. Studies of Hegyi et al. (2005a); Lu et al. (2010); Zhang and
Ioannou (2016); Carlson et al. (2014) combined VSLs with other control measures, such as ramp metering and lane-
changing control, to improve trac operation eciency at infrastructural bottlenecks. Carlson et al. (2010b); Wang
et al. (2020) proposed optimal control methods of VSLs for large scale freeway networks. The aforementioned VSL
control approaches may create a high-density region in or upstream of the VSL-controlled area, which may trigger
new jam waves. Hence, trac control measures aiming for eliminating stationary bottlenecks may not be able to fully
avoid the formation of jam waves.
Another way to alleviate jam waves is to suppress them after their formation using VSLs. There are dierent the-
ories and algorithms to determine the parameter values of the VSLs. The SPECIALIST algorithm proposed by Hegyi
et al. (2008) is an analytical approach for determining VSL parameters using the shockwave theory (Lighthill and
Whitham, 1955; Richards, 1956). It was successfully implemented and tested in practice (Hegyi and Hoogendoorn,
2010). However, since the SPECIALIST algorithm has a feed-forward structure, disturbances that occur after the acti-
vation of a VSL scheme cannot be handled. Hegyi et al. (2005b) presented a model predictive control (MPC) approach
of VSLs, where the design was based on a macroscopic second-order trac flow model, METANET (Messmer and
Papageorgiou, 1990; Kotsialos et al., 2002a). The nonlinear and non-convex formulation of METANET-based MPC
approaches might result in high computation load, especially if the optimization is solved by the standard SQP algo-
rithm (Hegyi et al., 2005a). Moreover, globally optimal VSL control is often unattainable for that type of approaches
(Frejo and Camacho, 2012; Frejo et al., 2014). Studies of Muralidharan and Horowitz (2015); Roncoli et al. (2015);
Hadiuzzaman and Qiu (2013); Han et al. (2017b); Zhang and Ioannou (2018) developed simpler MPC approaches
that have less computational complexity based on the cell transmission model and its variants. However, those models
cannot accurately reproduce the propagation of jam waves (Han et al., 2016). Han et al. (2017b, 2021) proposed
MPC approaches of VSLs based on discrete first-order trac flow models formulated in Eulerian and Lagrangian
coordinates. Due to the linear formulations of the optimal controllers, those approaches significantly improved the
computational eciency. Despite the successful demonstration of the above MPC approaches via simulation, in
general MPC for trac systems are dicult to be implemented in practice, partially because MPC approaches are
sensitive to the accuracy of the prediction models.
In recent years, data-driven approaches, such as reinforcement learning (RL), have attracted greater attentions in
the realm of road trac control as more trac data become available. RL applications to road trac control were
initially investigated in urban trac networks for trac signal optimization problems (Arel et al., 2010; Prashanth
and Bhatnagar, 2010; El-Tantawy et al., 2013; Li et al., 2016; Ozan et al., 2015). Regarding freeway trac control,
most of the RL applications focused on improving trac operation eciency at local bottlenecks. Davarynejad et al.
(2011) addressed a local ramp metering problem considering the storage capacity of on-ramps using a Q-learning
algorithm. Li et al. (2017) presented a Q-learning-based VSL control approach for recurrent freeway bottlenecks.
Schmidt-Dumont and van Vuuren (2015) proposed a decentralized RL approach that integrated ramp metering and
VSLs. Belletti et al. (2017) presented a deep RL-based ramp metering strategy. In a simulation test, the strategy
achieved a control performance comparable to the classical feedback ramp metering method, ALINEA (Papageorgiou
et al., 1991). Wu et al. (2020) proposed a deep actor-critic algorithm of lane-based VSLs to eliminate recurrent
freeway bottlenecks. Han et al. (2022) proposed a physics-informed reinforcement learning approach for local and
coordinated ramp metering.
Most of existing RL-based trac control approaches train their RL models using trac simulators. Therefore,
similar as the aforementioned MPC approaches, which are sensitive to the accuracy of the prediction models, the con-
trol performances of those RL-based approaches are also dependent on the accuracy of the simulators. Nevertheless,
the training processes of those RL models cannot be performed in real world. The reason is twofold. First, control
actions are usually explored randomly in those approaches. Such way of action exploration can only be performed in
a simulation environment, as a real trac control system cannot accept randomly generated control actions that may
lead to very poor trac performance. Secondly, the training process with random exploration may require a large
amount of training data, which may not be feasible to collect because the speed of data collection in real world is
restricted by physical time and the ”slowness” of the trac process. Furthermore, training those RL models using
historical field data is also infeasible. The reason is that eective training data collected from the field are lacking,
as trac flows in real world are regulated by a limited number of pre-defined control strategies. In addition, many
practical trac control systems are not used for eliminating trac jams or improving trac eciency. For example,
2
many trac signal control systems and speed control systems in reality only implement fixed signal timing plans and
fixed speed limit values. The field data collected from those control systems cannot be used for training a RL model.
Therefore, it is still a challenge to develop RL-based trac control strategies for real world implementation.
In this paper, we propose a new RL-based VSL control approach that trains the RL model based on both oine
synthetic data and data collected from the real system, where the real data gradually replace the synthetic data. The
proposed VSL control approach consists of an oine training process and an online control process, which interact
iteratively. In the online control process data are collected of the states, control actions, and the related performances
as they occur in the real trac process. In the oine training process the data collected online are fed into a learning
algorithm to update the control policy. To explore new control actions that may lead to a better trac performance,
synthetic data generated from a macroscopic trac flow model are also added to the training data set in the oine
process. In the online control process, the VSL control policy obtained from the oine training process is applied to
regulate trac flow, and at the same time a new batch of data is collected. During the course of the iterations, the
control performance is expected to improve as over time more real data are utilized by the RL.
The proposed approach is tested using the METANET model, which simulates real-world trac flow dynamics.
Therefore, in this paper the data generated from the METANET model are referred to as real data. To reproduce
the dierence between the trac prediction model and the real trac process, we use another trac flow model, the
extended cell transmission model (CTM), as the oine data generation model. Data generated from the extended
CTM are referred to as synthetic data. To demonstrate the performance of the proposed approach against the model
mismatch, it is compared with an MPC approach that uses the same extended CTM for prediction and an existing
RL-based VSL control approach which also uses the same extended CTM for training. The proposed approach is also
compared with an existing RL-based VSL control approach with random exploration to demonstrate the performance
of reducing the exploration and learning costs.
The rest of this paper is organized as follows. Section 2 describes the VSL control problem. Section 3 presents the
RL-based VSL control approach including the oine training and online control processes. Section 4 describes the
simulation design for testing the proposed approach, and Section 5 discusses the simulation results. The conclusion
and the topics for future research are discussed in Section 6.
2. The RL-based VSL control problem
This section presents the RL-based VSL control problem addressed in this paper. Section 2.1 describes the VSL
control mechanism in resolving freeway jam waves. Section 2.2 defines the RL-based VSL control problem. A
solution algorithm to that problem is presented in Section 2.3.
2.1. VSL control mechanism
As has been presented in (Hegyi et al., 2008), two types of trac jams are usually identified on freeways. Trac
jams with the head fixed at the bottleneck are known as standing queues, and jams that have an upstream moving head
and tail are known as jam waves (also known as wide moving jams in some studies, e.g., Kerner and Rehborn (1996)).
Fig. 1 shows a jam wave and a standing queue observed in real data. Both types of trac jams can be eliminated by
VSLs, based on two dierent mechanisms explained as follows.
3
Figure 1: Examples of a jam wave and a standing queue observed in real data. Data were collected from Dutch freeway A20 on January 23, 2006.
Standing queues form at infrastructural bottlenecks, e.g., an on-ramp bottleneck or a lane-drop bottleneck. The
VSL control strategies against infrastructural bottlenecks are developed based on the assumption that VSLs below the
critical speed lead to a fundamental diagram that has lower capacity than under normal conditions. The application
of VSLs upstream of a bottleneck permanently reduces the mainstream arriving flow, so as to avoid the bottleneck
activation and the related throughput reduction as a result of the capacity drop. Then capacity flow can be established
at the downstream bottleneck and the mainstream throughput is maximized, leading to a decrease of the total time
spent. Fig. 2 shows the mechanism schematically. This mechanism forms the basis of the VSL control strategies in
many studies such as Carlson et al. (2010a,b); Hadiuzzaman et al. (2013); Li et al. (2017); Wang et al. (2020).
Traffic flow
On-ramp
VSL application area
Acceleration
area
Capacity flow
qout~qVSL
c
Figure 2: The mechanism of VSLs against infrastructural bottlenecks. The on-ramp is a potential bottleneck. qout is the outflow of the VSL-
controlled area, and qc
VSL is the VSL-induced capacity.
The mechanism of VSLs in eliminating jam waves is dierent from that against standing queues. SPECIALIST
is one of the earliest theories that systematically explained the mechanism of VSLs against jam waves (Hegyi et al.,
2008). In Fig. 3, the time-space graph (left) shows the trac states on a road stretch and their propagation over time.
The density-flow diagram (right) shows the corresponding density and flow values for these states. According to
kinematic wave theory, the boundary (front) between two states in the left figure has the same slope as the slope of
the line that connects the two states in the right figure. Area 2 represents a jam wave that propagates upstream and
which is surrounded by trac in free-flow (areas 1 and 6). As soon as the jam wave is detected, VSLs are applied to
the direct upstream of the jam wave, where the trac state changes from state 6 to state 3. Subsequently, the size of
the jam wave (area 2) is reduced because the inflow to the jam is lower than the outflow. The required length of the
speed-limited stretch to resolve the jam depends on the density and flow associated with state 2 and the physical length
4
of the detected jam. When the jam wave is resolved, there remains an area with the speed limits active (state 4) with
a moderate density (higher than in free-flow, lower than in the jam wave). It was assumed that the trac from area
4 can flow out more eciently than a queue discharging from full congestion as in the shock wave (flow of state 2).
This assumption was demonstrated in a later research by analyzing the data from SPECIALIST field test experiment
(Hegyi and Hoogendoorn, 2010).
The similarity between these two VSL control mechanisms is that they both assume the trac jams are associated
with a capacity drop, and the major benefit of VSLs is to reduce travel delays by eliminating the capacity drop.
The dierence is that these two mechanisms eliminate capacity drop in dierent ways, which may lead to dierent
consequences. The mechanism of VSLs against jam waves takes advantage of the transition flow created by VSLs,
which only lasts for a relatively short period of time, just enough to resolve the jam. As it aims to keep the VSL-
induced density at a moderate value, (e.g., area 4 in SPECIALIST), these VSLs can keep the trac stable under the
speed limits. It is assumed that the demand is always lower than the free-flow capacity so that the jam wave can
be resolved without creating a new congestion. On the other hand, VSLs against standing queues do not need that
assumption because even though the demand of the bottleneck exceeds the capacity, the trac system still gets benefit
from eliminating the capacity drop and maximizing the throughput. However, new congestion may be created when
VSLs are applied to eliminate standing queues. For example, Papageorgiou et al. (2008) found that the VSL-induced
capacity may be lower than the free-flow capacity. However, at a dierent site, Soriguera et al. (2017) could not
identify any permanent flow reduction that could be attributed to VSLs, even when the speed limit value is as low as
40 km/h. Therefore, to create a suciently low flow to eliminate the standing queue under this circumstance, speed
limits lower than 40 km/h will be needed. This will create new congestion at the upstream of the VSL-controlled area.
Most of the experiments (both simulations and field test) on VSLs against jam waves were performed in a ho-
mogeneous freeway stretch (Hegyi et al., 2008; Han et al., 2017b, 2021). For VSLs against standing queues, some
strategies have been tested in larger sizes of freeway networks which include multiple on- and o-ramps via macro-
scopic simulations (Carlson et al., 2010b; Wang et al., 2020).
0 50 100 150 200
0
500
1000
1500
2000
2500
3000
3500
4000
1,6
2
3
4
5
density (veh/km)
flow (veh/h)
0 0.2 0.4 0.6
0
2
4
6
8
10
12
2
6
1
3
4
5
6
time (h)
location (km)
Figure 3: Illustration of trac evolution under the SPECIALIST (Hegyi et al., 2008). The left figure is the time-space graph and the right figure is
the fundamental diagram. Areas 3 and 4 are the VSL-controlled areas.
In this paper, we focus only on jam waves, and the mechanism of the VSLs follows the theory of SPECIALIST.
From SPECIALIST field test experiment, it was summarized that some jam waves were not successfully resolved
because the VSL-induced flows were not suciently low (Han et al., 2017a). In other failed cases, it was found that
new jam waves were triggered at the upstream of the VSL-controlled area because the densities of this area were
too high (Hegyi and Hoogendoorn, 2010). Therefore, an eective VSL control scheme to improve trac eciency
against freeway jam waves should be able to (i) create suciently low flow to resolve the jam, and (ii) maintain the
density of the VSL-controlled area at a moderate value so as to avoid triggering a new jam wave. In the next section
we will formulate an RL controller, that is capable of both by using stretches of VSLs that are directly upstream of
the jam and that can vary in length.
5
2.2. RL-based VSL control system
Reinforcement learning concerns the problem of a learning agent that interacts with its environment to achieve a
goal (Sutton and Barto, 2018). The agent and the environment are generally interacting in discrete time steps. At each
time step k, the agent takes an action a(k) based on the state s(k) received from the environment. The environment
responds to the action by assigning a reward r(k) to the agent and presenting a new state, s(k+1). The agent’s objective
at time step kis maximizing the accumulative reward-to-go over a given time horizon,
G(k)=
KT
X
τ=k
γτkr(τ),(1)
where KTdenotes the time index when the state of the environment reaches the terminal state; r(τ) the reward received
at time τ; and γτkthe discount factor (0 γ1) that defines the relative importance of the reward at time τ.
In this paper, we consider an RL-based VSL control system, where the trac dynamics on the freeway is the
environment, and the VSL controller is the agent. More specifically, we consider a long homogeneous freeway stretch,
which is suitable for applying VSLs to resolve jam waves. The agent decides about the speed limit values displayed to
the drivers at dierent positions of the freeway. It is assumed that the freeway is equipped with fixed-location sensors,
e.g., loop detectors, which divide the freeway into cells. Variable message signs (VMSs) which display the speed limit
values are placed above the freeway.
The state, action, and reward of the RL system are defined considering the mechanism of VSLs as presented in the
previous section. The freeway stretch is divided into four areas, which are indexed as I, II, III, and IV from upstream
to downstream, as shown in Fig. 4. They represent the area upstream of VSL control (I), the VSL-controlled area
(II), the jam area (III), and the area downstream of the jam (IV), respectively. Each area consists of a number of
consecutive cells, so the area boundaries coincide with the cell boundaries. Area I has a length of vf·Tk, where vf
denotes the free-flow speed and Tkis the unit time step duration. Area II, the VSL-controlled area, denotes the freeway
section that controlled by a number of consecutive VMSs, which display the speed limits. It is assumed that this area
resides immediately upstream of the congestion area. Area III, the congestion area, consists of all the cells that are in
congestion. Cell iis defined to be in congestion if vivjmax and qiqjmax are both satisfied, where vjmax and qjmax are
predefined speed and flow thresholds, respectively. Area IV covers the part where the discharging trac recover to
the free-flow speed. The length should be long enough for the acceleration, e.g., longer than 1 km. These four areas
move along with the jam wave and their trac states are updated in each control cycle accordingly. Note that the
VSL controller is switched on when there is only one jam area on the freeway. When there are multiple, disconnected
congestion areas, e.g., multiple jam waves, the VSL controller will not be activated.
Area I
Driving direction
VSL VSL
Area II,
the VSL control area Area III,
the congestion area Area IV
vf·Tk
Figure 4: Dividing the freeway stretch into four areas.
As presented in the previous section, it is summarized that an eective VSL control scheme against freeway jam
waves should be able to (i) create suciently low flow to resolve the jam, and (ii) maintain the density of the VSL-
6
controlled area at a moderate value. Therefore, the state and action variables of the RL model should be able to capture
the trac dynamics of the jam and the VSL-controlled area. According to the conservation law, the dynamic evolution
of a trac jam is related to the size of the jam at the current time step, i.e., how many vehicles are in the jam, and
the inflow and outflow of the jam. Likewise, the density variation of the VSL-controlled area is related to its original
density, and the inflow and outflow of this area. Therefore, to resolve the jam wave and also maintain the density of
the VSL-controlled area at a moderate value, the state, action, and reward functions of the RL system should take all
those variables into account.
To define the state, action, and reward, the VSL control system is discretized in time. The state of discrete time
step k,s(k), and the action, a(k), are defined as:
s(k)=[¯qI(k),¯ρV(k),ljam(k),¯vjam (k),Pjam(k)] (2)
a(k)=[V(k),PV(k)],(3)
where V(k) denotes the speed limit value, and PV(k) denotes the index of the most upstream cell of the VSL-controlled
area. It is assumed that VSLs are applied directly upstream of Pjam, the most upstream cell of the jam area. Therefore,
the variables, V(k), PV(k), and Pjam(k) can determine the speed limit value of every VMS. For other state variables,
¯qIdenotes the average flow of area I, which is considered as the arriving flow to the VSL-controlled area in one time
step. ¯ρVrepresents the average density of the VSL-controlled area. ljam and ¯vjam are the length and average speed
of the congestion area, respectively. These two variables represent the size of the jam wave. All the state and action
variables are summarized in Table 1. Those state and action variables can eectively capture the trac dynamics of
the jam and the VSL-controlled area.
1. The state variable, ¯qI(k), determines the inflow of the VSL-controlled area. The state variable, ¯ρV(k), and the
action variable, V(k), approximate the outflow of the VSL-controlled area. The state variable, ¯ρV(k), represents
the density of the VSL-controlled area at the current time step. Therefore, the RL system captures the density
variation of the VSL-controlled area based on those variables.
2. The state variables, ljam (k) and ¯vjam (k), represent the size of the jam at the current time step. The inflow to area
III is equal to the outflow of area II. Thus, the state variable, ¯ρV(k), and the action variable, V(k), approximate
the inflow to the jam. According to the empirical study of Yuan et al. (2015), the outflow of a jam wave is
dependent to the speed in the jam. Therefore, the state variable, ¯vjam , can capture the outflow of the jam. The
RL system captures the dynamic evolution of the jam based on those variables.
Table 1: The state and action variables of the proposed RL system.
State variables ¯qI(k) [veh/h] The inflow to the VSL-controlled area, which is calculated as the
arithmetic mean of the measured flow of all cells in area I.
¯ρV(k) [veh/km/lane] The average density of the VSL-controlled area, which is calcu-
lated as arithmetic mean of the density of all cells in area II.
ljam(k) [km] The length of the congestion area, i.e., area III.
¯vjam(k) [km/h] The average speed of the jam area, which is calculated as the
arithmetic mean of the measured speed of all cells in area III.
Pjam(k) The index of the most upstream cell of area III, the jam area.
Action variables V(k) [km/h] The speed limit value
PV(k) The index of the most upstream cell of the VSL-controlled area
For the presented VSL control system, the VSL controller is activated when a jam wave is detected and deactivated
when it is resolved or considered as unresolvable. The jam wave is considered as being resolved if for every cell i,
vi>vjmax and qi>qjmax are both satisfied. The jam wave is considered as unresolvable if the congestion has reached
to the upstream boundary of the freeway stretch or multiple jam waves have been observed during the VSL control
process.
It is assumed that for a single jam wave only one speed choice is applied to the entire VSL-controlled area over the
control horizon. In other words, on detection of a jam wave the speed limit value is decided and it is kept unchanged
7
until the speed limits are deactivated again. This setting is consistent with SPECIALIST, which has been implemented
in practice. It requires less attention for drivers because they only need to decelerate once and accelerate after the jam
wave being resolved. Therefore, this setting is more acceptable to the drivers and may also avoid possibly new
breakdowns induced by a frequent acceleration and deceleration. For a dierent jam wave, however, the speed choice
can be chosen from all available speed choices based on the learning result. Besides, when VSL control is activated,
to avoid a sharp deceleration, the speed limit values are gradually reduced from the default speed limit to the target
value.
The reward should reflect the improvement of trac performance caused by VSLs. Intuitively, the reward should
be a function of the freeway throughput, since the foremost improvement resulting from resolving jam waves is the
elimination of capacity drop. Unfortunately, the throughput increment produced by the VSL control can hardly be
observed until the jam wave is resolved. Thus, for a faster learning we define a reward function based on an artificial
variable, J(k), that represents congestion severity. The J(k) is defined as follows:
J(k)=ljam(k)
vjam(k).(4)
The congestion severity decreases as the average speed in the congestion area increases and the length of the conges-
tion area diminishes. The change of J(k) may take place soon after the VSL is implemented, and before the jam wave
is resolved. The reward, r(k), is defined as the reduction in congestion severity,
r(k)=J(k)J(k+1).(5)
2.3. The solution algorithm
The RL problem presented in the previous section can be solved by a number of methods (Sutton and Barto, 2018).
In this section, we briefly introduce a model free Q-learning method, which has been extensively used in RL-based
trac control systems (Watkins and Dayan, 1992; Davarynejad et al., 2011; Li et al., 2017). To apply the Q-learning
method, the variables in the state and reward functions, i.e., equations (2) and (5) need to be discretized. The domain
of each variable is divided into discrete intervals, and the value of each interval is represented by its midpoint.
The Q-learning method estimates the optimal value function Qusing temporal-dierence learning. The Q-value,
Q(s,a), stores the value of a state-action pair, and it is updated according to:
Q(s,a)Q(s,a)+κ(s,a)[r+γmax
a0Q(s0,a0)Q(s,a)] (6)
where ris the observed reward of the transition from the current state sto the new state s0under action a;a0denotes
the action chosen at state s0;κ(s,a)is the learning rate which controls how fast the Q-values are altered. Typically, the
learning rate decreases over time to ensure convergence. Some studies, e.g., Li et al. (2017), defined the learning rate
of a state-action pair as a function of the number of visits to that pair. In this paper we adopt the same method and
define κ(s,a) as:
κ(s,a)=h1
1+C(s,a)(1 γ)i0.7(7)
where C(s,a)is the number of visits to the state-action pair (s,a).
For Q-learning, the selection rule for the action taken at a given state should consider the trade-obetween ex-
ploitation and exploration. Even though using pure exploitation may greatly save the learning time, it may prohibit the
discovery of new, potentially better actions. On the contrary, although pure exploration outperforms pure exploitation
in the capability of discovering better actions, the former is quite time-consuming as it selects actions without making
use of the learning results. In this paper, the method for the RL agent’s exploration is referred to Li et al. (2017), in
which the probability of selecting action afrom state sis represented as:
ps(a)=eQ(s,a)/T
Pa0AseQ(s,a0)/T(8)
8
where Asis the set of available actions at state s; and Tis the so-called temperature parameter. When Tis large, each
action would have approximately the same probability of being selected (more exploration). When Tis small, actions
would be selected in proportional to their estimated value (more exploitation).
The pseudocode of Q-learning is shown in Algorithm 1. Tdenotes the set of training data, where each training
data slice is represented by a state transition tuple, [s,a,s0,r]. Sand Arepresent the set of states and the set of actions
in the training data, respectively. denotes a very small positive value. The terminal state is defined as the state at the
time the speed limit control is deactivated, i.e., when the jam wave is resolved or considered as unresolvable. Please
be noted that any RL algorithm that can learn directly from data can be used in the proposed VSL control approach. In
this paper, a simple Q-learning algorithm is used because we consider the amount of training data is relatively small.
Algorithm 1 The pseudocode of Q-learning.
Input:T,S,A
1: Initialize Q(s,a)=0, C(s,a)=0, κ(s,a)=1, sS,aA;
2: repeat
3: Initialize s;
4: repeat
5: choose afrom sbased on equation (8);
6: C(s,a)+= 1;
7: update κ(s,a)based on equation (7);
8: update Q(s,a)based on equation (4-6);
9: ss0;
10: until sis a terminal state
11: until convergence: qPsPa(Q(s,a)Q(s0,a0))2
Output:Q(s,a),sS,aA
3. An iterative RL approach of VSLs
As explained earlier in this paper, the conventional training method for RL-based trac control strategies, which
solely relies on trac simulators to generate the training data, is flawed because accurate trac simulators are dicult
to obtain (Papageorgiou, 1998). In fact, the modeling errors of some well-known simulators were shown to be between
10% - 20% (Spiliopoulou et al., 2014; Han et al., 2017a). Moreover, how the error of a trac simulator aects the
performance of the corresponding RL controller is still unclear. On the other hand, training the RL controller with
field data is also infeasible because of random explorations. The control actions that are randomly explored may
lead to very poor trac performance. Furthermore, the training process with random exploration may require a large
amount of training data, which may not be feasible to collect because the speed of data collection in real world is
restricted by physical time.
In light of the above, our RL approach combines the two ways of training by using both oine simulation data
and real data, where real data gradually replace simulation data. The proposed approach applies an iterative training
framework, where the optimal control policy is updated by exploring new control actions from both online and oine
in each training iteration. Section 3.1 presents the general framework of the proposed approach. Sections 3.2 and 3.3
explain the oine training and online control processes, respectively.
3.1. Framework of the iterative RL
The proposed iterative RL approach of VSLs consists of an oine training process and an online control process,
which interact through the iterations. In each iteration, the interaction between those two processes is shown in Fig. 5.
For the oine training process, the input includes historical data and a new batch of data collected from last iteration.
Historical data are sliced in the form of state transition tuples, and added to the training dataset. To explore new
control actions that may lead to better trac performance, new synthetic data generated from a macroscopic trac
flow model based on the new batch of real data, are also added to the training dataset. The process of the oine
9
synthetic data generation is presented in Section 3.2. The output of the oine training process is the Q-table that
contains the Q-values, Q(s,a), of all available state-action pairs in the data.
After training, the optimal control policy is fed into the online control process. In each iteration, the VSL control
policy associated with a fixed state-value table is implemented in the online process for a period of time. This duration
is determined considering the trade-obetween the control performance and the learning rate. If the time duration
of each stage is too long, it would take more time for the VSL agent to improve the trac performance. If the time
duration is too short, the data gathered from the online process may be too limited for the RL agent to improve. A
new control action is explored online only if the RL state is not in the Q-table. The online exploration that follows a
certain of rules is presented in Section 3.3.
Exploration in RL is always the price to pay to improve the system performance. However, in real trac system,
there are many restrictions that limit the exploration of new control actions. For example, a poor exploration method,
e.g., random exploration as in many existing RL systems, may lead to very poor trac performance or even unsafe
trac situations. Furthermore, an inecient exploration method may not lead to any improvement in real-world
simply because of limited physical time. The presented oine/online exploration method prevents poor control actions
being explored in real trac process to some extent, so as to reduce the exploration and learning costs. With the
interaction between the oine training and the online control, the optimal policy is updated iteratively. During the
course of iterations, the trac performance is expected to be improved with the updating of the optimal policy because
the model mismatch is alleviated via replacing knowledge from the models by knowledge from the real process.
Historical traffic data
Generating synthetic data
via a traffic flow model
Optimal policy
Offline training process Online control process
Traffic dynamics
Traffic state measurement
Mixed training data
Learning algorithm
State in historical
data
Choose the optimal
control action
yes
explore a new
control action
no
New optimal
policy,
A new batch
of data,
Training data slices
Jam wave being
detected
yes
no
󰇛󰇜
Figure 5: The oine training process and online control process of the proposed approach in an iteration.
3.2. Oine training
The oine training process in an iteration is shown in the left block of Fig. 5. In the oine training process of
iteration x, the set of training data slices, Tx, which include both real data and synthetic data, are gathered and fed
into Algorithm 1 to obtain the Q-table. A training data slice is represented as state transition tuples in the form of
[s,a,s0,r]. In the training data, the real data is represented as Treal
x, and,
Treal
x=Tstart
x
x1
[
n=1
Treal
n(9)
where Tstart
xis the initial training data, e.g., the training data collected from the previous implementations of other
VSL control strategies in the target site. If no VSL control strategy was implemented before, Tstart
xis empty.
10
Table 2: The notation of variables in Algorithm 2.
TA training data slice
e
TA synthetic training data slice
TThe set of training data slices
Treal The set of real training data
Tstart The initial training data
xIndex of the iterations
mIndex of trac data slice
ZThe set of trac data slices
zmTrac state data slice m,z=[ ˆρ, ˆq,ˆv]
ˆρ, ˆq,ˆv: Vectors of density, flow, and speed of all the cells in the freeway network
smThe RL state of data slice m
˜
AmThe set of feasible synthetic control actions for data slice m
˜aA synthetic control action
˜znext
mThe synthetic future trac state predicted on state zm
snext
mThe synthetic future RL state predicted on state zm
FThe operator of predicting trac state using a trac flow model
HThe operator of transforming a trac state to a RL state
The synthetic training data of iteration x, is generated based on the set of trac data slices, Zx1, collected from
the online control process in iteration x1. Each trac data slice is represented by the vectors of density, flow, speed
of all the cells in the freeway network. For data slice zmZx1, we use a trac flow model to predict its future trac
state for one step ahead under all feasible new VSL control actions, represented by the set ˜
Am. We define Fas the
operator of predicting future trac state using the trac flow model, and,
znext
m=F(zm,˜a) (10)
where ˜a˜
Am.znext
mrepresents the predicted trac state. The predicted reward, ˜r, can be obtained based on the
predicted trac state, according to (4-5). znext
mis transformed to the corresponding RL state, ˜snext
m, according to the
definition of the variables in (2). We define Hto represent the operator of transforming a trac state to a RL state,
and,
˜snext
m=H(znext
m).(11)
We define Sxas the set of all the RL state observed in real data, Treal
x. If the predicted RL state is in the set of RL
state, i.e., ˜snext
mSx, then the synthetic data slice, [sm,˜a,˜r,˜snext
m] is added to the training data set, Tx.smis the RL state
corresponding to trac state zm.
Therefore, in the oine process, new actions are generated by which the process can go from state sto state s0,
where both state sand state s0have been observed in real data but the transition has not yet been observed yet. We use
this data generation method for two reasons. Firstly, the reliability of the explored control actions are dependent to the
accuracy of the trac prediction. Since the proposed method predicts trac state transitions for only one step ahead,
the prediction accuracy should be better than the prediction for multiple steps, in which the prediction error will be
accumulated. Secondly, this method restricts the ratio of synthetic data in the training dataset. If the oine model also
produces new states, the fraction of synthetic data may remain large and may remain dominant in the training data.
Consequently, the model mismatch would not be alleviated.
By adding the information of this possible transition to the training data, new actions can be explored in the oine
training process. In addition, the new action leading to s0will only be chosen in the online control process if the
associated Q-value is high enough (based on earlier experiences), which will prevent choosing actions that lead to
very poorly performing states. The pseudocode of the oine training process is shown in Algorithm 2, where the
notation of variables can be found in Tab. 2.
In the training dataset, the ratio of real data increases with the number of iterations because only real data are
accumulated. Similar to many heuristic exploration methods such as softmax, the proposed method also has a higher
11
probability of exploration at the beginning of the training than at the end when the policy is close to the greedy policy.
At the early stage of the iterations, the training data set contains a high proportion of synthetic data, enabling the RL
agent to explore more actions. With an increasing number of iterations, the real data become dominant in the training
data set, and the RL agent explores fewer control actions, to guarantee the improvement of trac performance.
Algorithm 2 The pseudocode of the oine training process in iteration x.
Input:Treal
x,Sx,Ax,Zx1={z1,z2, ..., zM};
1: Tx=Treal
x
2: for m=1, 2,...,M do
3: sm=H(zm)
4: for ˜a˜
Amdo
5: znext
m,˜r=F(zm,˜a)
6: ˜snext
m=H(znext
m)
7: if ˜snext
mSxthen
8: e
T=[sm,˜a,˜r,˜snext
m]
9: TxTxS{e
T }
10: end if
11: end for
12: end for
13: function Algorithm 1(Tx,Sx,Ax)
14: return Q(s,a)
15: end function
16: Qx
(s,a)=Q(s,a)
Output:Qx
(s,a), , sSx,aAx
3.3. Online VSL control
The online control process is shown in the right block of Fig. 5. First, the control system detects jam waves based
on trac flow measurements, as per the criteria presented in Section 2.2. The VSL controller is then activated once a
jam wave is detected. For the first control step, k, if the RL state is in the RL state data set, i.e., s(k)Sx, then the
control action is decided by:
a(k)=arg max
aQs(k),a,if s(k)Sx,(12)
where a(k)=[V(k),PV(k)]. If the RL state is not in the RL state data set, then the control action of the first step will
be determined by an existing VSL control strategy, e.g., SPECIALIST. For the subsequent control steps, the speed
limit value V=V(k) will be kept unchanged and only the boundaries of the VSL-controlled area are allowed to
change. For step k, if the RL state, s(k), is in the state data set, the controller exploits the optimal policy to give the
optimal control action. Among all the state-action pairs associated with that state, the action that produces the largest
Q-value is chosen and implemented to the trac process:
a(k)=arg max
aQs(k),a|a=[V,PV],V=V(k),if s(k)Sx(13)
If the RL state is not in the state value table, a new control action will be explored and implemented in the trac
process. For the new control action, it is assumed that the speed limit value is the same as it was in the previous
control step, i.e., V(k+1) =V(k), and the index of the most upstream cell of the VSL control area changes no more
than 1, i.e., |PV(k)PV(k+1)| 1. Note that this constraint not only prevents frequent acceleration and deceleration of
drivers caused by VSLs, but also reduces the exploration space in the RL. If the exploration space is too large, finding
the actions that improve the system performance may take unrealistically long time. For these states, that do not exist
in the state value table, we apply a simple method to determine PV(k). The method intends to keep the VSL-controlled
area at a moderate value. Specifically, we define a tuning parameter ρcr
V, which represents the critical density of the
12
VSL-controlled area, and use ρup
Vto represent the density of the most upstream cell of the VSL-controlled area. For
ρup
Vin two consecutive steps, ρup
V(k1) and ρup
V(k), there are four possible situations:
1
The density is lower than the critical value and it is decreasing: ρup
V(k)ρcr
Vand ρup
V(k)ρup
V(k1);
2
The density is lower than the critical value and it is increasing: ρup
V(k)ρcr
Vand ρup
V(k)> ρup
V(k1);
3
The density is higher than the critical value and it is decreasing: ρup
V(k)> ρcr
Vand ρup
V(k)ρup
V(k1);
4
The density is higher than the critical value and it is increasing: ρup
V(k)> ρcr
Vand ρup
V(k)> ρup
V(k1).
For situation 1
, the upstream boundary of the VSL-controlled area moves one cell downstream. For situations
2
and 3
, that upstream boundary remains at the same position. For situation 4
, that upstream boundary moves
upstream by one cell. Therefore, PV(k) is determined as:
PV(k)=
PV(k)1,if 1
,s(k)<S
PV(k),if 2
or 3
,s(k)<S
PV(k)+1,if 4
,s(k)<S.
(14)
4. Simulation experiment design
This section presents the simulation experiments for testing the proposed VSL control approach. The purpose of
the simulations is to show that the proposed approach (i) can eectively eliminate jam waves and reduce travel delays,
(ii) performs better than those approaches aected by the model mismatch, and (iii) has less exploration and learning
costs compared to a RL method with random explorations. The following experiment scenarios are designed.
1. Testing the proposed iterative RL approach of VSLs using macroscopic trac simulation. The purpose of this
scenario is to investigate the performance of the proposed approach in reducing travel delays during the iterative
training process. As it is impossible to directly test the proposed approach in the field, the METANET model
is used as the process model to represent the real-world trac flow dynamics. For this scenario, the overall
framework is presented in Section 4.1. The process model and simulation settings are presented in Section 4.2.
The parameter settings of the RL controller are presented in Section 4.3.
2. Compare the proposed approach to SPECIALIST. In SPECIALIST, the trac state transitions under VSLs are
predicted based on kinematic wave theory. The accuracy of the prediction is influenced by the tuning parameters
and some external disturbances such as demand fluctuations. Therefore, the mismatch between the prediction
results and real process may aect the control performance. The purpose of this scenario is to demonstrate the
proposed approach can outperform SPECIALIST in terms of reducing travel delays by eliminating the model
mismatch. Parameter settings of SPECIALIST are presented in Section 4.4.
3. Compare the proposed approach to an existing MPC approach against freeway jam waves (Han et al., 2017b).
The MPC approach was developed based on the extended CTM (Han et al., 2016). As the prediction model
of the MPC is dierent from the process model (METANET), the performance will be aected by the model
mismatch. This scenario intends to demonstrate that the proposed approach can outperform the MPC approach
in terms of reducing travel delays by alleviating the model mismatch,
4. Compare the proposed approach to an existing RL-based VSL control approach with random online exploration.
In this scenario, the RL model is directly trained in the real trac process using the DDQN algorithm. As the
DDQN explores control actions randomly, the exploration and learning costs during the training may be very
high. For example, a randomly explored control action may lead to very poor trac performance and even
increase the travel delay. The purpose of this scenario is to demonstrate that the proposed approach has much
less exploration and learning costs than the random exploration method.
5. Compare the proposed approach to an existing RL-based VSL control approach with zero-shot policy transfer.
In this scenario, an existing deep reinforcement learning algorithm, namely Double DQN (DDQN, Van Hasselt
et al. (2016)), is used as the training algorithm. The same extended CTM model is used as the training envi-
ronment. After training, the optimal policy is directly transferred to the real trac process. As the training
environment is dierent from the real trac process, this RL-based approach is also aected by the model mis-
match. The purpose of this scenario is to demonstrate that the proposed approach can outperform this RL-based
approach by alleviating the model mismatch.
13
6. Compare the proposed approach to an existing RL-based VSL control approach with continually online learn-
ing. In this scenario, the DDQN-based VSL control strategy in scenario 5 is assumed to continually learn from
the online environment after the oine optimal policy was transferred. This scenarios intends to investigate
if the DDQN can continually improve trac performance in the online environment, and also to quantify the
online learning cost of the DDQN.
The simulation results of those five experiment scenarios are presented in Sections 5.1-5.5, respectively.
4.1. Overall framework of experiment scenario 1
The simulation experiment for testing the proposed approach includes the following steps.
1. Implementing the starting VSL control approach. We assume SPECIALIST as the starting VSL approach,
which was applied before implementing the proposed approach. Therefore, in (9), Tstart
xis the set of training data
collected from SPECIALIST implementation. The time period of implementing SPECIALIST is represented
by 100 online simulations where in each simulation one jam wave is artificially created.
2. The oine-online interaction process. The iterations start from the oine training process. In the oine, the
synthetic data are generated from the extended CTM, which is briefly presented in Section 4.5. In each iteration,
the optimal VSL control policy associated with a fixed state-value table is implemented in the online process
for a period of time, represented by 100 online simulations. Other parameters of the RL controller are specified
in Section 4.4.
3. Stop criterion. In the online control process, if the RL state is in the state-value table, i.e., the state has appeared
in historical trac data, the action that produces the largest Q-value is implemented by the process. If the RL
state is not in the state-value table, a new control action will be implemented. We define the actions selected
from the state-value table as RL control actions. In each stage, the total number of RL control actions (NRL
x)
and the total number of all the control actions (Nx) are recorded. The ratio between NRL
xand Nx, denoted as ηx,
represents the percentage of the states that has appeared in historical data. In general, ηxshould be higher with
the increment of the number of iterations and the expansion of training data. The experiment ends if ηxis larger
than 0.8, when a large percentage of the states has appeared in historical data.
Note that two dierent macroscopic trac flow models are used in the simulation experiments. The METANET
model is used as the process model which represents the real-world trac flow dynamics. The extended CTM is used
as the oine data generation model. Therefore, the simulations using METANET are referred to as online simulations
and the simulations using the extended CTM are referred to as oine simulations.
The stochasticity of trac flow is considered in the experiment, by incorporating noises to the process model for
dierent jam waves. Detailed settings about the process model is presented in Section 4.2. The simulation experiment
is repeated for 20 times to avoid getting unreliable results due to the stochasticity of the simulation environment.
4.2. The METANET model and simulation settings
The second-order macroscopic trac flow model METANET, proposed by (Messmer and Papageorgiou, 1990;
Kotsialos et al., 2002b), has been extensively used for freeway trac simulation. The METANET model predicts
the dynamic evolution of trac speeds based on a steady speed-density relation and some heuristic terms that express
driver behavior. Hegyi et al. (2005a) has extended the METANET model to account for the eect of VSLs. The model
with VSLs extension has been validated using field data (Han et al., 2017b; Frejo et al., 2019). In this simulation test,
the model presented in Hegyi et al. (2005a) is utilized as the process model. The simulation test uses the METANET
model to represent real-world trac dynamics. The reason we choose METANET as the process model is that it has
been validated to reproduce the propagation of jam waves with a reasonable accuracy (Han et al., 2017a; Frejo et al.,
2019). Furthermore, it runs much faster than microscopic simulations.
In the METANET model, the freeway is divided into cells which have a uniform geometric structure. For cell i,
the desired speed at time tis calculated as:
V(ρi(t)) =min VC,i(t),vf,i·exp 1
am ρi(t)
ρcr,i!am!!,(15)
14
where the first term, VC,i, is the speed limit of cell i. We assume that the drivers fully comply with the speed limit
control. The second term describes the steady speed-density relation of the model, which is characterized by three
parameters, namely am,vf,iand ρcr,i. In the fundamental diagram, vf,iand ρcr,irepresent the free-flow speed and
the critical density, respectively. For the sake of compactness, the equations that describe the trac dynamics of
METANET are shown in Appendix A.
Most of the experiments on VSLs against jam waves (both simulations and field test) are performed on a homo-
geneous freeway stretch. In the experiments, a three-lane synthetic freeway stretch is used as the test bed for the
proposed VSL control approach. The homogeneous freeway stretch is 7.5km in length, and it is divided into 25 cells.
A graphical representation of the synthetic freeway is shown in Fig. 6. The parameter values of the process model are
taken from Kotsialos et al. (1999); Hegyi et al. (2005a); Han et al. (2017b). Specifically, ρcr=27.6 veh/km/lane, am=
2.5 for every cell, and vf=108 km/h.
cell 1,
the origin cell 2 cell 25
Traffic flow
...
Figure 6: A graphical representation of the synthetic freeway stretch.
In practice, trac flow conditions (e.g., trac demand and capacity) may vary from day to day. To reproduce
the stochastic feature of trac flow, we assume that parameters vf,am, and ρcr, which influence the shape of the
fundamental diagram, are stochastic. Each of the three parameters is assumed to follow a Gaussian distribution,
where the mean is equal to the referred value and the standard deviation is 2% of the mean. Therefore, in each online
simulation run, a sample of these parameters is taken. This gives us (slightly) dierent sizes of fundamental diagrams
for dierent simulation runs. Fig. 7 (a) shows the free-flow capacities obtained from 100 random online simulation
runs. In general, the free-flow capacity in most of the online simulation runs ranges from 1900 veh/h/lane to 2100
veh/h/lane. Furthermore, to reproduce trac demand fluctuations in reality, the demands in the online simulation runs
are assumed to follow Gaussian distribution. Specifically, each online simulation run lasts for 2 hours, including one
hour of peak time and one hour of o-peak time. The mean of peak hour demands and mean o-peak hour demands
are set to 90% of the capacity (which varies in dierent simulation runs) and 4000 veh/h respectively. The standard
deviations are set to 5% of the mean.
Figure 7: Results of 100 online simulation runs: (a) The road capacity of each online simulation run. (b) The density-flow plot of all cells.
15
Jam waves in reality usually form at a relatively fixed location of a site (Hegyi and Hoogendoorn, 2010). In the
simulations, jam waves are artificially triggered at the downstream boundary of the freeway stretch. The densities
downstream of the freeway stretch are set to 100 veh/km/lane at min. 32-34. To give an impression of the resulting
stochasticity, we run the simulation for 100 times applying the presented demand and parameter settings. The density-
flow plot, taken from the data of every cell in every minute, is shown in Fig. 7 (b). The length of congestion area in
those created jam waves varies from 0.9 km to 2.4 km, which is consistent with empirical observation. Fig. 8 shows
an example of the simulated jam waves.
Figure 8: (a) Speed and (b) Flow contour plots of an example of the simulated jam waves.
4.3. Settings of the RL controller
In the oine training process of the proposed VSL control approach, the state and reward variables need to be
discretized. The domain of each variable is divided into discrete intervals, and the value of each interval is represented
by the midpoint. The discrete interval sizes of qI,ρV,ljam,vjam, and Pjam are set to 100 veh/h/lane, 2 veh/km/lane,
0.3 km, 5 km/h, and 1 cell respectively, which considers the trade-obetween data resolution and variable space.
The discrete intervals and the upper and lower bounds of the state and action variables are summarized in Tab. 3. A
penalty of -200 min is added to the terminal state, if the jam wave is not successfully resolved. In the Q-learning, the
convergence threshold is set to 0.01 min.
Table 3: The discrete intervals and the upper and lower bounds of the state and action variables.
Variables Discrete intervals Upper bounds Lower bounds
¯qI[veh/h] 100 2000 1000
¯ρV[veh/km/lane] 2 100 10
ljam [km] 0.3 3 0.3
¯vjam [km/h] 5 50 5
Pjam 1 25 1
PV1 24 1
V10 60 50
In the proposed control system, we assume that two values of speed limit are used: 50 km/h and 60 km/h. Those
two values are chosen based on both empirical evidence and trial-and-error tuning. From extensive simulation tests it
is found that (i) a speed limit lower than 50 km/h would result in a higher density in the VSL-controlled area, which
increases the risk of inducing new trac breakdowns, and (ii) a speed limit value higher than 60 km/h may not be
able to trigger a suciently low flow that can resolve the jam waves. For reader’s reference, the displayed speed limit
16
value in SPECIALIST system is 60 km/h (Hegyi and Hoogendoorn, 2010). For some trac situations, the trac
performance may be further improved if more speed limit values can be displayed. However, the solution space of
the RL increases exponentially with the size of action space, which may require an impractical amount of time to
gather sucient training data for the RL agent to improve the trac performance. Therefore, the number of speed
limit values is determined by considering the trade-obetween potential trac improvement and the time required to
achieve the improvement.
In the online control process, the duration of a control time step, Tk, is set to 30s. When VSL control is activated,
to avoid a sharp reduction of speed limit, i.e., from the free-flow speed to 50 km/h or 60 km/h, 100 and 80 km/h are
used for the lead-in. The same approach was used in Hegyi and Hoogendoorn (2010). The critical speed of the VSL
control region, ρcr
V, is set to 30 veh/h/lane. The VSL control is deactivated when the jam wave is resolved.
4.4. The starting VSL control approach
We assume the SPECIALIST algorithm as the original VSL control approach before the RL-based VSL control
approach is implemented. SPECIALIST has multiple tuning parameters, which have clear physical interpretations.
These parameters can be tuned based on heuristic tuning rules using oine trac data. In this simulation test, we
mimic the implementation of SPECIALIST in the METANET simulation. A brief introduction of SPECIALIST and
the tuning rules of parameters are presented in Appendix B. Fig. 9 shows an example in the simulation in which the
jam wave is successfully resolved by SPECIALIST.
Figure 9: (a) Speed and (b) Flow contour plots of an example in the simulation in which the jam wave is successfully resolved by SPECIALIST. In
(a), the VSL-controlled area is enclosed by black lines.
4.5. Model mismatch
In scenario 1 of the simulation experiments, we use the extended CTM model, proposed by Han et al. (2016), as the
oine data generation model. The model extends the original CTM to reproduce capacity drop and the propagation
of jam waves. Since there is always mismatch between real trac process and a trac simulation model, we choose
the extended CTM model, which has a dierent mechanism as the METANET model, as the oine synthetic data
generation model to reproduce such mismatch.
Although the process model (the METANET model) and the oine data generation model (the extended CTM)
have some similarities, e.g., both of them assume a fundamental diagram for homogeneous trac state, their mecha-
nisms are still quite dierent. For example, the METANET model considers driver behavior in trac speed dynamics
such as anticipation to spatially increasing or decreasing densities, while the extended CTM does not. In the simula-
tion experiment, the extended CTM model is calibrated with the simulation data from the METANET model. For a
detailed presentation about the extended CTM, readers are referred to Han et al. (2016, 2017b).
Furthermore, in scenario 3 of the simulation experiments, the extended CTM is used as the prediction model of
an MPC controller of VSLs for comparison. In scenario 4, the training environment of an existing RL-based VSL
17
control strategy, which is used for comparison, is also developed based on the extended CTM. In those two scenarios,
the model mismatch can be reproduced as a result of the dierence between METANET and the extended CTM.
5. Simulation results and analysis
This section presents the results of the simulation experiments, and each sub-section corresponds to one of the
experiment scenarios described in Section 4.
5.1. Performance of the proposed approach
This section presents the results of the simulation experiment in testing the proposed approach, described as
scenario 1 in Section 4. In the simulation experiment, 100 online simulations are performed in each iteration. The
trac performance at each iteration is evaluated using the average total travel delay as the performance indicator,
which is calculated as the dierence between the total time spent by all vehicles in the freeway stretch and the sum of
all the vehicles’ free-flow travel time. The simulation experiment is repeated for 20 times to avoid getting unreliable
results due to the stochasticity of the simulation environment. A whisker plot that depicts the trac performance of
the proposed VSL control approach is shown in Fig. 10 (a). The average total travel delay saving of the proposed
approach is 31.3% during the entire training process.
Fig. 10 (b) shows the total travel delay improvement of the presented VSL control approach with dierent values
of η. In the figure, each box represents a 10 percent interval of that ratio. The dashed blue line in each box indicates
the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. The whiskers
extend to the most extreme data points. The red line represents the average total travel delay reduction for dierent
intervals of η. In general, the average total travel delay saving increases with η, except for the interval [20%, 30%],
where only three data points are observed. Moreover, it can be observed that the lower bound of the total travel
delay reduction also increases when ηis higher than 50%, which indicates that the presented VSL control approach
becomes more robust as ηgrows. These results are as expected, because with the increment of η, more control actions
are explored and more data are utilized by the RL. Hence, the actions selected by the RL controller becomes more
reliable, because the RL controller takes the stochasticity of trac environment into account.
Fig. 10 (b) shows the change of ηwith the increment of the number of iterations. The average number of iterations
in the simulation experiment is 15.9. The oine training time of the RL agent varies from less than one minute to
5 minutes. During earlier stages when the amount of training data is less, it takes less time for the Q-learning to
converge.
18
Figure 10: (a): the whisker plot of total travel delay reduction (compared to those without VSL control) for dierent iterations; (b) total travel delay
reduction for dierent ratios of RL actions; (c): the share of the ratio of RL control actions for dierent iterations.
5.2. Comparison with SPECIALIST
This section presents the proposed approach and SPECIALIST, described as scenario 2 in Section 4. SPECIALIST
is utilized as the starting VSL control strategy in scenario 1 of the simulation experiments. The average total travel
delay reduction of SPECIALIST is 15.5%. In all the simulation runs, about 70% of the jam waves are classified as re-
solvable and the VSL schemes generated from SPECIALIST are implemented in those cases. Among the cases where
VSLs are implemented, over 60% of the jam waves are successfully resolved. Some of the failures are attributed to the
mismatch between the predicted trac dynamics and real trac process, for example, a sudden demand increment.
In contrast, the proposed VSL control approach reduces the average total travel delay by 35.1% when ηis larger
than 0.8. About 80% of the jam waves are resolved by VSLs, which is much higher than the SPECIALIST algorithm.
Its better performance is attributed to two main reasons. First, the RL controller has a feedback structure. It determines
the VSL control actions based on the online measured trac states. It is thus able to handle disturbances such as
demand increases. Second, the RL controller does not rely on online trac prediction, because optimal control
actions are obtained mainly from real trac data.
Fig. 11 shows an example of comparison between the SPECIALIST and the proposed VSL control approach. In
this example, both VSL control approaches are tested using the same demand profile and the same parameter values
for the process model. Under SPECIALIST, the VSL-controlled area is too short to generate a transition flow that
lasts long enough to resolve the jam wave. The reason is that the outflow of the jam (flow of area 1 in Fig. 3) is
overestimated. Moreover, as SPECIALIST has a feed-forward control structure, it is very sensitive to the errors of
trac flow prediction. By contrast, the RL-based controller successfully resolves the jam wave.
It is worthy to be noted that we have tried a dierent set of SPECIALIST parameters. Although the performance of
SPECIALIST with the new parameters is inferior, the performance of the proposed approach using the inferior tuning
19
of SPECIALIST, is not aected. It still reduces the total travel delay by 35% at the end of the training. The reason is
that the proposed approach explores new control actions and evaluates them in every stage. The actions that lead to
a good trac performance are kept, and the actions that lead to a worse trac performance are discarded by the RL
model. When the amount of training data becomes suciently rich, SPECIALIST data are only a small proportion of
the training data and overruled by the real data. Therefore, the performance of the proposed approach is not sensitive
to the tuning parameters of SPECIALIST.
Figure 11: Comparison between SPECIALIST and the proposed VSL control approach in an example. In this example, (a) and (d) are the simulated
speed (km/h) and flow (veh/h) contour plots without VSL control; (b) and (e) are the simulation results under SPECIALIST; (c) and (f) are the
simulation results under the proposed VSL control approach. In (b) and (c), the VSL-controlled areas are enclosed by black lines.
5.3. Comparison with a MPC approach
This section presents the results of the comparison between the proposed VSL control approach and the MPC
approach, described as scenario 3 in Section 4. The same extended CTM is used for trac prediction in the MPC.
The MPC has a feedback control structure, and the optimal VSL control scheme is calculated in every control step
based on trac state feedback. The prediction horizon is set to 20 minutes. The duration of a control step is set to 30
seconds. Model parameters are calibrated with the online simulation data. The minimum VSL value is set to 50 km/h
in the optimization of the MPC. At each optimization iteration, the trac demand in the prediction is set to a constant
value for the entire prediction horizon, and the value is predicted as the measured average demand of last 15 minutes.
For a full presentation of the MPC, readers are referred to Han et al. (2017b).
The MPC controller is run with the online process model for 100 simulation runs. It reduces the average total travel
delay by 25.9%, which is higher than SPECIALIST but lower than the proposed VSL controller. The performance of
the MPC controller depends on the accuracy of trac prediction. It may generate ineective control schemes if the
predicted trac dynamics are not consistent with the simulated trac process. Fig. 12 shows an example, in which the
MPC controller fails to resolve the jam wave because of inaccurate trac prediction. In this example, the capacity of
the process model is set to 1950 veh/h/lane, which is slightly lower than that of the prediction model, 2000 veh/h/lane.
At minute 35, when the MPC controller is activated, the predicted trac demand is 4900 veh/h but the actual trac
demand is about 5400 veh/h. Therefore, the congestion severity of the jam wave is underestimated by the MPC. As a
result, the MPC only narrows the jam wave but it is unable to completely resolve it. Using the same demand profile
and parameter values, the proposed VSL control approach can successfully resolve the jam wave, as showing by the
speed and flow contour plots in Fig. 12 (c) and (f).
20
In this simulation experiment, the prediction model of the MPC controller is the same as the data generation model
in the proposed RL-based VSL control approach. However, the performance of the MPC controller is restricted by
the accuracy of the prediction model, as evidenced by the above example. On the other hand, the performance of the
proposed VSL control approach is not restricted by the accuracy of that model, because the explored actions produced
from the data generation model are evaluated in the online process (i.e., the reality), and the actions that lead to worse
trac performances are discarded.
Figure 12: Comparison between the MPC approach and the proposed VSL control approach in an example. In this example, (a) and (d) are the
simulated speed (km/h) and flow (veh/h) contour plots without VSL control; (b) and (e) are the simulation results under the MPC control approach;
(c) and (f) are the simulation results under the proposed VSL control approach. In (b) and (c), the VSL-controlled areas are enclosed by black lines.
5.4. The exploration and learning costs
This section compares the proposed approach with an existing RL-based VSL control approach using random
exploration in terms of exploration and learning costs. The RL model with random exploration is directly trained in
the online simulation environment using the DDQN algorithm. The performance curves of the random exploration
approach are shown in Fig. 13. We use data from the first 10000 simulation runs to evaluate the exploration and
learning costs, as the performance of the random exploration approach stabilizes after 10000 simulation runs. For the
proposed approach, data from all the online simulations are used for evaluation.
The exploration cost is represented by the performance in terms of travel delay. During the first 10000 simulation
runs, the average total travel delay for the random exploration approach is 168.5 h. In 32.6% of the online simulation
runs, VSLs lead to worse trac performance, i.e., increase the total travel delay. For the proposed approach, the
average travel delay during the training phase is 142.2 h. Only in 17.9% of the simulation runs, VSLs lead to worse
trac performance. Furthermore, for the random exploration approach, the average total travel delay saving after
10000 simulation runs is 28.1%. For the proposed approach, it achieves a comparable performance only using less
than 200 simulation runs, as shown in Fig. 10. Therefore, the exploration cost of the proposed approach is much less
than that of the random exploration approach.
21
Figure 13: Performance curves of the random exploration approach in the online training.
5.5. Comparison with an existing RL approach
In this section, we compare the proposed approach with an existing RL-based VSL control approach with zero-
shot transfer, described as scenario 5 in Section 4. Specifically, the same extended CTM is used as the training
environment. An existing deep reinforcement learning algorithm, namely Double DQN (DDQN, Van Hasselt et al.
(2016)), is applied as the training algorithm. DDQN has been successfully applied to RL-based trac signal control
systems in multiple studies, such as Zeng et al. (2018); Liang et al. (2019). During the training process, the RL agent
receive states and rewards from the environment while the environment implements actions taken by the agents. After
training, the optimal policy is directly transferred to the online simulations.
In the training environment, the settings of trac demand and model parameters are the same as those in Section
4.2. The state, action, and reward are the same as those defined in Sections 2.1 and 4.5. This RL model is trained
using data from 20000 oine simulation runs. Fig. 14 shows the performance curves of the training. The control
policy at the end of the training is implemented to the online simulations for 100 runs. It reduces the average total
travel delay by 22.4%, which is not as good as the proposed control approach.
Figure 14: Performance curves of the training with DDQN in the extended CTM environment.
Fig. 15 shows an example that highlights the comparison between the proposed approach and the DDQN-based
approach. In this example, the proposed approach successfully resolves the jam wave, but the DDQN-based VSL
control approach fails. The DDQN-based approach chooses speed limit value 60 km/h at the beginning of VSLs
22
activation, as shown in Fig. 15 (b) and (j). As time advances, although the upstream of the VSL-controlled area nearly
reaches to the upstream boundary of the freeway stretch, the VSL control still cannot create a transition flow that is
suciently low to fully resolve the jam, as shown in Fig. 15 (e). As a comparison, the proposed approach chooses
speed limit value 50 km/h at the beginning of VSLs activation, so the created transition flow is suciently low to
resolve the jam wave, as shown in Fig. 15 (a), (d), and (i).
The performance of the DDQN-based VSL control approach is also tested in the training environment, i.e., the
extended CTM, using the same trac demand of the aforementioned example. In the training environment, the
DDQN-based VSL control approach successfully resolves the jam wave and achieves a higher downstream through-
put, as shown in Fig. 15 (c) and (f). The dierent performances in the oine training environment and the online
simulation indicate that the DDQN-based approach is aected by the model mismatch, i.e., the dierence between the
training environment and the online simulation. Even though a well-trained RL strategy performs well in the training
environment, it is not guaranteed that the strategy will perform equally well in real trac process, where there is
always a mismatch.
Figure 15: Comparison between the proposed VSL control approach and the DDQN-based approach in an example. In this example, (a) and (d)
are the simulated speed (km/h) and flow (veh/h) contour plots under the proposed approach; (b) and (e) are the simulation results under the DDQN-
based approach; (c) and (f) are the simulation results under the DDQN-based approach in the training environment. (i-k) are the corresponding
VSLs profiles. In (i), the speed limit is chosen as 50 km/h while in (j), the speed limit is chosen as 60 km/h.
5.6. DDQN with continual online learning
This section presents the results of scenario 6, the DDQN with continual online learning. Two sub-scenarios are
tested in this section. In sub-scenario 1, it is assumed that there is no online exploration after the oine optimal policy
23
being transferred to the online environment. Therefore, the DDQN adopts the greedy policy to update the parameters.
In sub-scenario 2, it is assumed that there are still online exploration after the oine optimal policy being transferred.
The DDQN adopts the -greedy policy to update the parameters. The performance curves of both scenarios are shown
in Fig. 16.
Figure 16: Performance curves of the DDQN with continual learning.
In those two sub-scenarios, the DDQN with -greedy policy reduces the total travel delay substantially more than
the DDQN with greedy policy. To quantify the learning cost, we use the average delay of the proposed method during
the entire training period as a comparison. For the DDQN with -greedy policy, the average travel delay during the
first 2000 simulation runs is 155.9 h, which is 9.6% higher than the average of the proposed method, shown as the
red lines in Fig. 16. While the DDQN eventually achieves a similar performance as the proposed method, but at a
significantly higher learning cost during the online process.
6. Discussion and conclusions
Reinforcement learning has attracted extensive attentions in trac control areas. Most of existing RL-based trac
control approaches explore control actions randomly, which may induce high exploration and learning costs. For
those approaches, the RL learning cannot be purely based on real-world explorations. Furthermore, the training
process with random exploration may require a large amount of training data, which may not be feasible to collect
because the speed of data collection in real world is restricted by the ”slowness” of the trac process. Therefore,
to date most of existing RL-based trac control approaches train their RL models solely using trac simulators.
However, The mismatch between the training simulators and the real trac process aects the performance of those
approaches.
In this paper we have proposed a new reinforcement learning-based VSL control approach to resolve freeway jam
waves. The proposed VSL control approach applies an iterative training framework, where the optimal control policy
is updated by exploring new control actions both online and oine in each iteration. The oine/online exploration
method often prevents poor control actions being explored in real trac process so as to reduce the exploration and
learning costs. The explored control actions are evaluated in the real trac process. Thus the proposed approach
avoids letting the RL model learning only from a trac simulator, and alleviates the impact of the model mismatch
by replacing knowledge from the model by knowledge from the real process.
The proposed VSL control approach has been tested using a macroscopic trac simulation model, namely METANET,
which represents real world trac flow dynamics. The simulation results have shown that the RL controller decreases
the total travel delay as more control actions are explored and more training data are fed into the RL. The proposed
approach has also been compared with several existing VSL control approaches to demonstrate its advantages. Due
to the alleviation of model mismatch errors, the proposed approach performed better in reducing travel delays, than
24
SPECIALIST, the MPC-based approach, and the approach based on an existing RL method. The advantage in reduc-
ing the exploration and learning costs has been demonstrated by the comparison with an existing RL-based approach
with random exploration.
Although the proposed approach has been demonstrated to alleviate the impact of the model mismatch, it is not
guaranteed that it will lead to a system optimal performance. In the proposed method, actions are mainly explored in
a smaller space created from the oine model rather than in the entire action space. Therefore, the policy of the RL
can be suboptimal if the optimal control actions are out of the exploration space. In future research, we will further
investigate if there are better training methods which can incorporate random online explorations and lead to a system
optimal performance.
The proposed VSL control approach is designed to resolve freeway jam waves based on the VSL control mech-
anism against jam waves. In future research, we will extend the proposed approach to eliminate infrastructural bot-
tlenecks such as on-ramp bottleneck and lane-drop bottleneck. The test bed will also be extended to larger sizes of
freeway networks. Other methods that can more eciently deal with the scarcity of real data in RL-based trac
control problems will also be investigated.
Acknowledgement
This research is jointly supported by the National Natural Science Foundation of China (No.52002065, No.52131203),
and the Natural Science Foundation of Jiangsu (No.BK20200378).
Appendix A. METANET model
In the METANET model, the following equations describe the evolution of freeway trac dynamics over time.
The outflow of each cell is equal to the density times the mean speed and the number of lanes of that cell (represented
by λi):
qi(t)=ρi(t)vi(t)λi,(A.1)
The density of a cell follows the vehicle conservation law, which is represented as:
ρi(t+1) =ρi(t)+Ts
liλi
(qi1(t)qi(t)),(A.2)
where liis the length of cell i. The mean speed of segment iat time step t+1, vi(t+1), depends on the mean speed at
time step t, the speed of the inflow of vehicles, and the density downstream. Specifically,
vi(t+1) =vi(t)+Ts
τM
(V(ρi(t)) vi(t)) +Ts
li
vi(t)(vi1(t)vi(t)) ϑTs
τMli
ρi(t+1) ρi(t)
ρi(t)+κ,(A.3)
where τM,ϑ,κare model parameters. In the experiment, τMis set to 18 s, κis set to 40 veh/km/lane, and ϑis set to
30 km2/h.
Appendix B. SPECIALIST
There are multiple tuning parameters for the SPECIALIST algorithm, which correspond to the trac states in
Fig. 3. The control scheme can be constructed given the measured and calculated trac states 1-6. The densities,
speeds, and flows for the six states are denoted as ρ[j],v[j],q[j],j1, ..., 6. In the experiments, these parameters
are determined using the same method as in (Hegyi and Hoogendoorn, 2010). One of the most important tuning
parameters is the density associated with state 4. The speed of state 4 is determined by the speed limits, however the
choice of the density is a design variable that influences the shape of the control scheme. Based on trial-and-error
tuning, ρ[4] is set to 30 veh/km/lane, and ρ[5] and q[5] are set to 27 veh/km/lane and 2000 veh/h/lane, respectively.
After the construction of the control scheme, the resolvability is assessed. If the constructed control scheme
satisfies certain conditions, the jam wave is considered to be resolvable and the control scheme is applied. These
25
conditions include: (i) the heads and tails of areas 2 and 4 should converge; (ii) the speed of area 6 should be higher
than the speed limits; and (iii) the necessary length of the speed-limited stretch is smaller than the available upstream
free-flow area. In the experiment, it is assumed that the SPECIALIST can choose one speed limit value from 50 km/h
and 60 km/h. If both values satisfy the conditions of resolvability, the higher value 60 km/h will be chosen. The VSL
control is activated at minute 35, when the jam wave has already formed.
26
References
Arel, I., Liu, C., Urbanik, T., and Kohls, A. Reinforcement learning-based multi-agent system for network trac signal control. IET Intelligent
Transport Systems, 4(2):128–135, 2010.
Belletti, F., Haziza, D., Gomes, G., and Bayen, A. M. Expert level control of ramp metering based on multi-task deep reinforcement learning. IEEE
Transactions on Intelligent Transportation Systems, 19(4):1198–1207, 2017.
Carlson, R. C., Papamichail, I., Papageorgiou, M., and Messmer, A. Optimal motorway trac flow control involving variable speed limits and
ramp metering. Transportation Science, 44(2):238–253, 2010a.
Carlson, R. C., Papamichail, I., Papageorgiou, M., and Messmer, A. Optimal mainstream trac flow control of large-scale motorway networks.
Transportation Research Part C: Emerging Technologies, 18(2):193–212, 2010b.
Carlson, R. C., Papamichail, I., and Papageorgiou, M. Local feedback-based mainstream trac flow control on motorways using variable speed
limits. IEEE Transactions on Intelligent Transportation Systems, 12(4):1261–1276, 2011.
Carlson, R. C., Papamichail, I., and Papageorgiou, M. Integrated feedback ramp metering and mainstream trac flow control on motorways using
variable speed limits. Transportation Research Part C: Emerging Technologies, 46:209–221, 2014.
Chen, D. and Ahn, S. Variable speed limit control for severe non-recurrent freeway bottlenecks. Transportation Research Part C: Emerging
Technologies, 51:210–230, 2015.
Chen, D., Ahn, S., and Hegyi, A. Variable speed limit control for steady and oscillatory queues at fixed freeway bottlenecks. Transportation
Research Part B: Methodological, 70:340–358, 2014.
Davarynejad, M., Hegyi, A., Vrancken, J., and van den Berg, J. Motorway ramp-metering control with queuing consideration using Q-learning. In
2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), pages 1652–1658. IEEE, 2011.
El-Tantawy, S., Abdulhai, B., and Abdelgawad, H. Multiagent reinforcement learning for integrated network of adaptive trac signal controllers
(MARLIN-ATSC): methodology and large-scale application on downtown toronto. IEEE Transactions on Intelligent Transportation Systems,
14(3):1140–1150, 2013.
Frejo, J. R. D., N ´
u˜
nez, A., De Schutter, B., and Camacho, E. F. Hybrid model predictive control for freeway trac using discrete speed limit
signals. Transportation Research Part C: Emerging Technologies, 46:309–325, 2014.
Frejo, J. R. D., Papamichail, I., Papageorgiou, M., and De Schutter, B. Macroscopic modeling of variable speed limits on freeways. Transportation
Research Part C: Emerging Technologies, 100:15–33, 2019.
Frejo, J. R. D. and Camacho, E. F. Global versus local mpc algorithms in freeway trac control with ramp metering and variable speed limits.
IEEE Transactions on Intelligent Transportation Systems, 13(4):1556–1565, 2012.
Hadiuzzaman, M. and Qiu, T. Z. Cell transmission model based variable speed limit control for freeways. Canadian Journal of Civil Engineering,
40(1):46–56, 2013.
Hadiuzzaman, M., Qiu, T. Z., and Lu, X.-Y. Variable speed limit control design for relieving congestion caused by active bottlenecks. Journal of
Transportation Engineering, 139(4):358–370, 2013.
Han, Y., Yuan, Y., Hegyi, A., and Hoogendoorn, S. P. New extended discrete first-order model to reproduce propagation of jam waves. Transporta-
tion Research Record: Journal of the Transportation Research Board, (2560):108–118, 2016.
Han, Y., Hegyi, A., Yuan, Y., and Hoogendoorn, S. Validation of an extended discrete first-order model with variable speed limits. Transportation
Research Part C: Emerging Technologies, 83:1–17, 2017a.
Han, Y., Hegyi, A., Yuan, Y., Hoogendoorn, S., Papageorgiou, M., and Roncoli, C. Resolving freeway jam waves by discrete first-order model-based
predictive control of variable speed limits. Transportation Research Part C: Emerging Technologies, 77:405–420, 2017b.
Han, Y., Wang, M., He, Z., Li, Z., Wang, H., and Liu, P. A linear lagrangian model predictive controller of macro-and micro-variable speed limits
to eliminate freeway jam waves. Transportation Research Part C: Emerging Technologies, 128:103–121, 2021.
Han, Y., Wang, M., Li, L., Roncoli, C., Gao, J., and Liu, P. A physics-informed reinforcement learning-based strategy for local and coordinated
ramp metering. Transportation Research Part C: Emerging Technologies, 137:103584, 2022.
Hegyi, A. and Hoogendoorn, S. Dynamic speed limit control to resolve shock waves on freeways-field test results of the specialist algorithm. In
2010 International IEEE Conference on Intelligent Transportation Systems, pages 519–524. IEEE, 2010.
Hegyi, A., Hoogendoorn, S., Schreuder, M., Stoelhorst, H., and Viti, F. Specialist: A dynamic speed limit control algorithm based on shock wave
theory. In 2008 International IEEE Conference on Intelligent Transportation Systems, pages 827–832. IEEE, 2008.
Hegyi, A., De Schutter, B., and Hellendoorn, H. Model predictive control for optimal coordination of ramp metering and variable speed limits.
Transportation Research Part C: Emerging Technologies, 13(3):185–209, 2005a.
Hegyi, A., De Schutter, B., and Hellendoorn, J. Optimal coordination of variable speed limits to suppress shock waves. IEEE Transactions on
Intelligent Transportation Systems, 6(1):102–112, 2005b.
Kerner, B. S. Empirical macroscopic features of spatial-temporal trac patterns at highway bottlenecks. Physical Review E, 65(4):046138, 2002.
Kerner, B. S. and Rehborn, H. Experimental features and characteristics of trac jams. Physical Review E, 53(2):R1297, 1996.
Kotsialos, A., Papageorgiou, M., and Messmer, A. Optimal coordinated and integrated motorway network trac control. In 14th International
Symposium on Transportation and Trac Theory, 1999.
Kotsialos, A., Papageorgiou, M., Diakaki, C., Pavlis, Y., and Middelham, F. Trac flow modeling of large-scale motorway networks using the
macroscopic modeling tool metanet. IEEE Transactions on Intelligent Transportation Systems, 3(4):282–292, 2002a.
Kotsialos, A., Papageorgiou, M., Mangeas, M., and Haj-Salem, H. Coordinated and integrated control of motorway networks via non-linear optimal
control. Transportation Research Part C: Emerging Technologies, 10(1):65–84, 2002b.
Li, L., Lv, Y., and Wang, F.-Y. Trac signal timing via deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 3(3):247–254, 2016.
Li, Z., Liu, P., Xu, C., Duan, H., and Wang, W. Reinforcement learning-based variable speed limit control strategy to reduce trac congestion at
freeway recurrent bottlenecks. IEEE Transactions on Intelligent Transportation Systems, 18(11):3204–3217, 2017.
Liang, X., Du, X., Wang, G., and Han, Z. A deep q learning network for trac lights’ cycle control in vehicular networks. IEEE Transactions on
Vehicular Technology, 68(2):1243–1253, 2019.
27
Lighthill, M. J. and Whitham, G. B. On kinematic waves. II. a theory of trac flow on long crowded roads. In Proceedings of the Royal Society of
London A: Mathematical, Physical and Engineering Sciences, volume 229, pages 317–345. The Royal Society, 1955.
Lu, X.-Y., Qiu, T. Z., Varaiya, P., Horowitz, R., and Shladover, S. E. Combining variable speed limits with ramp metering for freeway trac
control. In Proceedings of the 2010 american control conference, pages 2266–2271. IEEE, 2010.
Lu, X.-Y., Shladover, S. E., Jawad, I., Jagannathan, R., and Phillips, T. Novel algorithm for variable speed limits and advisories for a freeway
corridor with multiple bottlenecks. Transportation Research Record, 2489(1):86–96, 2015.
Messmer, A. and Papageorgiou, M. METANET: A macroscopic simulation program for motorway networks. Trac engineering &control, 31(9),
1990.
Muralidharan, A. and Horowitz, R. Computationally ecient model predictive control of freeway networks. Transportation Research Part C:
Emerging Technologies, 2015.
Ozan, C., Baskan, O., Haldenbilen, S., and Ceylan, H. A modified reinforcement learning algorithm for solving coordinated signalized networks.
Transportation Research Part C: Emerging Technologies, 54:40–55, 2015.
Papageorgiou, M. Some remarks on macroscopic trac flow modelling. Transportation Research Part A: Policy and Practice, 32(5):323–329, sep
1998.
Papageorgiou, M., Hadj-Salem, H., and Blosseville, J.-M. ALINEA: A local feedback control law for on-ramp metering. Transportation Research
Record, 1320(1):58–67, 1991.
Papageorgiou, M., Kosmatopoulos, E., and Papamichail, I. Eects of variable speed limits on motorway trac flow. Transportation Research
Record: Journal of the Transportation Research Board, (2047):37–48, 2008.
Prashanth, L. and Bhatnagar, S. Reinforcement learning with function approximation for trac signal control. IEEE Transactions on Intelligent
Transportation Systems, 12(2):412–421, 2010.
Richards, P. I. Shock waves on the highway. Operations Research, 4(1):42–51, 1956.
Roncoli, C., Papageorgiou, M., and Papamichail, I. Trac flow optimisation in presence of vehicle automation and communication systems–part
II: Optimal control for multi-lane motorways. Transportation Research Part C: Emerging Technologies, 57:260–275, 2015.
Schmidt-Dumont, T. and van Vuuren, J. H. Decentralised reinforcement learning for ramp metering and variable speed limits on highways. IEEE
Transactions on Intelligent Transportation Systems, 14(8):1, 2015.
Sch¨
onhof, M. and Helbing, D. Empirical features of congested trac states and their implications for trac modeling. Transportation Science, 41
(2):135–166, 2007.
Soriguera, F., Mart´
ınez, I., Sala, M., and Men´
endez, M. Eects of low speed limits on freeway trac flow. Transportation Research Part C:
Emerging Technologies, 77:257–274, 2017.
Spiliopoulou, A., Kontorinaki, M., Papageorgiou, M., and Kopelias, P. Macroscopic trac flow model validation at congested freeway o-ramp
areas. Transportation Research Part C: Emerging Technologies, 41:18–29, 2014.
Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial
intelligence, volume 30, 2016.
Wang, Y., Yu, X., Zhang, S., Zheng, P., Guo, J., Zhang, L., Hu, S., Cheng, S., and Wei, H. Freeway trac control in presence of capacity drop.
IEEE Transactions on Intelligent Transportation Systems, 22(3):1497–1516, 2020.
Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279–292, 1992.
Wu, Y., Tan, H., Qin, L., and Ran, B. Dierential variable speed limits control for freeway recurrent bottlenecks via deep actor-critic algorithm.
Transportation Research Part C: Emerging Technologies, 117:102649, 2020.
Yuan, K., Knoop, V. L., and Hoogendoorn, S. P. Capacity drop: Relationship between speed in congestion and the queue discharge rate. Trans-
portation Research Record, 2491(1):72–80, 2015.
Zeng, J., Hu, J., and Zhang, Y. Adaptive trac signal control with deep recurrent q-learning. In 2018 IEEE Intelligent Vehicles Symposium (IV),
pages 1215–1220. IEEE, 2018.
Zhang, Y. and Ioannou, P. A. Combined variable speed limit and lane change control for highway trac. IEEE Transactions on Intelligent
Transportation Systems, 18(7):1812–1823, 2016.
Zhang, Y. and Ioannou, P. A. Stability analysis and variable speed limit control of a trac flow model. Transportation Research Part B: Method-
ological, 118:31–65, 2018.
28
... Many researchers proposed VSLs combined with other ITS techniques; for instance, Frejo et al. designed a VSL method based on a second-order discrete macroscopic traffic model, META-NET, analyzing VSLs' impact on freeway traffic flow by accounting for segment capacities, critical densities, and driver compliance [41], and Han et al. proposed VSL control with ITS control strategies using reinforcement learning methods [42]. Similarly, Fares et al. used reinforcement learning to coordinate VSL and RM controls, alleviating freeway network congestion [43]. ...
Article
Full-text available
Global urbanization and increasing traffic volume have intensified traffic congestion throughout transportation infrastructure, particularly at toll plazas, highlighting the critical need to implement proactive transportation infrastructure solutions. Traditional toll plaza management approaches, often relying on manual interventions, suffer from inefficiencies that fail to adapt to dynamic traffic flow and are unable to produce preemptive control strategies, resulting in prolonged queues, extended travel times, and adverse environmental effects. This study proposes a proactive traffic control strategy using advanced technologies to combat toll plaza congestion and optimize traffic management. The approach involves deep learning convolutional neural network models (YOLOv7–Deep SORT) for vehicle counting and an extended short-term memory model for short-term arrival rate prediction. When projected arrival rates exceed a threshold, the strategy proactively activates variable speed limits (VSLs) and ramp metering (RM) strategies during peak hours. The novelty of this study lies in its predictive and adaptive capabilities, ensuring efficient traffic flow management. Validated through a case study at Ravi Toll Plaza Lahore using PTV VISSIMv7, the proposed method reduces queue length by 57% and vehicle delays by 47% while cutting fuel consumption and pollutant emissions by 28.4% and 34%, respectively. Additionally, by identifying the limitations of conventional approaches, this study presents a novel framework alongside the proposed strategy to bridge the gap between theory and practice, making it easier for toll plaza operators and transportation authorities to adopt and benefit from advanced traffic management techniques. Ultimately, this study underscores the importance of integrated and proactive traffic control strategies in enhancing traffic management, minimizing congestion, and fostering a more sustainable transportation system.
... In value-based methods, the Q-learning (QL) algorithm is used in early research [23], [24], [25], [26]. The common problem is Q table used to record experiences is limits space. ...
Preprint
Variable speed limit (VSL) control is an established yet challenging problem to improve freeway traffic mobility and alleviate bottlenecks by customizing speed limits at proper locations based on traffic conditions. Recent advances in deep reinforcement learning (DRL) have shown promising results in solving VSL control problems by interacting with sophisticated environments. However, the modeling of these methods ignores the inherent graph structure of the traffic state which can be a key factor for more efficient VSL control. Graph structure can not only capture the static spatial feature but also the dynamic temporal features of traffic. Therefore, we propose the DVS-RG: DRL-based differential variable speed limit controller with graph state representation. DVS-RG provides distinct speed limits per lane in different locations dynamically. The road network topology and traffic information(e.g., occupancy, speed) are integrated as the state space of DVS-RG so that the spatial features can be learned. The normalization reward which combines efficiency and safety is used to train the VSL controller to avoid excessive inefficiencies or low safety. The results obtained from the simulation study on SUMO show that DRL-RG achieves higher traffic efficiency (the average waiting time reduced to 68.44\%) and improves the safety measures (the number of potential collision reduced by 15.93\% ) compared to state-of-the-art DRL methods.
... Recently, reinforcement learning (RL)-based approaches have gained attention in the field of road traffic control due to their powerful ability to forecast system evolution and achieve proactive control schemes. Q-learning (QL) method, a popular model-free solution for the traffic control problem, maintains a lookup table to store values for all state-action pairs [18][19][20][21][22][23][24][25]. For instance, Fares and Gomaa presented a QL-based RM control approach for freeway networks and evaluated the proposed strategy under both dense and light demand conditions [21]. ...
Article
Full-text available
Most current RM approaches are developed for fixed bottlenecks. However, the number and locations of bottlenecks are usually uncertain and even time‐varying due to some unexpected phenomena, such as severe accidents and temporal lane closures. Thus, the RM approach should be able to enhance traffic flow stability by effectively handling the time‐delay effect and fluctuations in traffic flow rate caused by uncertain bottlenecks. This study proposed a novel approach called deep reinforcement learning with curriculum learning (DRLCL) to improve ramp metering efficacy under uncertain bottleneck conditions. The curriculum learning method transfers an optimal control policy from a simple on‐ramp bottleneck case to more challenging bottleneck tasks, while DRLCL agents explore and learn from the tasks gradually. Four RM control tasks were developed in the modified cell transmission model, including typical on‐ramp bottleneck, fixed downstream bottleneck, random‐location bottleneck, and multiple bottlenecks. With curriculum learning, the entire training process was reduced by 45.1% to 64.5%, while maintaining a similar maximum reward level compared to DRL‐based RM control with full learning from scratch. Specifically, the results also demonstrated that the proposed DRLCL‐based RM outperformed the feedback‐based RM due to its stronger predictive ability, faster response, and higher action precision.
Article
The traffic fundamental diagram (FD) describes the relationships among fundamental traffic variables of flow, density, and speed. FD represents fundamental properties of traffic streams, giving insights into traffic performance. This paper presents a theoretical investigation of dynamic FD properties, derived directly from vehicle car-following (control) models to model traffic hysteresis. Analytical derivation of dynamic FD is enabled by (i) frequency-domain representation of vehicle kinematics (acceleration, speed, and position) to derive vehicle trajectories based on transfer function and (ii) continuum approximation of density and flow, measured along the derived trajectories using Edie's generalized definitions. The formulation is generic: the derivation of dynamic FD is possible with any analytical car-following (control) laws for human-driven vehicles or automated vehicles (AVs). Numerical experiments shed light on the effects of the density-flow measurement region and car-following parameters on the dynamic FD properties for an AV platoon.
Article
Full-text available
Traffic exhaust pollution, especially in congested areas of freeways, is one of the main causes of air pollution. With the increase in the number of vehicles, traffic and environmental issues have become more prominent. In addition, traffic congestion leads to frequent starting and stopping of vehicles, further exacerbating environmental pollution. This article focuses on the problem of frequent starting and stopping of vehicles, using variable speed limit control to smooth traffic flow, reduce vehicle speed, and alleviate exhaust emissions caused by traffic congestion. At the same time, considering the traffic and environmental benefits of bottleneck areas on freeways, the VT-Micro model is used to calculate exhaust emissions, and a coordinated control method for the mainline and ramp of freeways is proposed. The simulation experiment results show that the total driving time of the mainline and ramp collaborative control method considering environmental benefits has been reduced by 24.69%, CO emissions have been reduced by 4.79%, HC emissions have been reduced by 7.65%, NOx emissions have been reduced by 2.48%, and fuel consumption has been reduced by 4.98%.
Article
Full-text available
This paper proposes a physics-informed reinforcement learning(RL)-based ramp metering strategy, which trains the RL model using a combination of historic data and synthetic data generated from a traffic flow model. The optimal policy of the RL model is updated through an iterative training process, where in each iteration a new batch of historic data is collected and fed into the training data set. Such iterative training process can evaluate the control policy from reality rather than from a simulator, thus avoiding the RL model being trapped in an inaccurate training environment. The proposed strategy is applied to both local and coordinated ramp metering. Results from extensive microscopic simulation experiments demonstrate that the proposed strategy (i) significantly improves the traffic performance in terms of total time spent savings; (ii) outperforms classical feedback-based ramp metering strategies; and (iii) achieves higher improvements than an existing RL-based ramp metering strategy, which trains the RL model merely by a simulator. We also test the performance of two different learning algorithms in the simulation experiment, namely a conventional tabular approach and a batch-constrained deep RL approach. It is found that the deep RL approach is not as effective as the conventional tabular approach in the proposed strategy due to the limited amount of training data.
Article
Full-text available
The cell transmission traffic flow model (CTM) has attracted considerable interest in the field of transportation due to its simplicity as well as the ability to capture most of the macroscopic traffic flow characteristics. The stability properties of the CTM under different demand and capacity constraints are not always obvious. In addition, the impact of microscopic phenomena such as forced lane changes at bottlenecks leading to capacity drop is not captured by the CTM. In this paper, we start with a single section and modify the CTM to account for capacity drop. We analyze the stability properties of the CTM under all possible demand and capacity constraints as well as all possible initial density conditions. The analysis is used to motivate the design of variable speed limit (VSL) control to overcome capacity drop and achieve the maximum possible flow under all feasible traffic situations. The results are extended to multiple sections, where the stability properties of the open-loop system are analyzed and a VSL control scheme is designed and shown to achieve the objective of maximizing the traffic flow under different demand and capacity constraints. Unlike the open loop system where an infinite number of equilibrium points exist under certain demand levels, the proposed nonlinear VSL scheme guarantees exponential convergence to a unique equilibrium point that corresponds to maximum possible flow and speed under all possible demand levels and capacity constraints.
Article
Variable speed limits (VSLs) are a common traffic control measure to resolve freeway jam waves. State-of-the-art model predictive control (MPC) approaches of VSLs are developed based on Eulerian Lighthill-Whitham and Richards (LWR) models, where the decision variables are flows between road segments. It is difficult to implement constraints on speeds that are necessary in typical real-world speed limit systems, because converting flow to speed results in nonlinear and non-convex optimization formulations. In this paper, we develop a new MPC of VSLs based on a discrete Lagrangian LWR model, in which the decision variables are average speeds of vehicle groups. This allows formulating speed constraints as control constraints rather than state constraints in the MPC problem. The optimization of vehicle groups speeds is formulated as a linear programming problem which can be solved efficiently. We further integrate the presented MPC to a hierarchical VSL control framework leveraging connected vehicles. The presented MPC decides the optimal target speed of each vehicle group led by a connected automated vehicle (CAV) at the upper macroscopic level with a prediction horizon of 20 min. At the lower microscopic level, CAVs randomly distributed in mixed traffic are regarded as actuators of the upper layer. Microscopic CAV accelerations are optimized in a short horizon of the order 5-10 s so that the human-driven vehicles following them reach the target speed from the upper layer in an efficient and smooth manner. The presented MPC and the hierarchical control approach are tested in microscopic simulation environments. Simulation results show that (i) the presented MPC resolves freeway jam waves efficiently with reasonable safety constraints implemented, and (ii) the presented hierarchical control approach can effectively resolve jam waves in a single-lane freeway, even though the penetration rate of CAVs is as low as 5%.
Article
Variable speed limit (VSL) control is a flexible way to improve traffic conditions, increase safety, and reduce emissions. There is an emerging trend of using reinforcement learning methods for VSL control. Currently, deep learning is enabling reinforcement learning to develop autonomous control agents for problems that were previously intractable. In this paper, a more effective deep reinforcement learning (DRL) model is developed for differential variable speed limit (DVSL) control, in which dynamic and distinct speed limits among lanes can be imposed. The proposed DRL model uses a novel actor-critic architecture to learn a large number of discrete speed limits in a continuous action space. Different reward signals, such as total travel time, bottleneck speed, emergency braking, and vehicular emissions are used to train the DVSL controller, and a comparison between these reward signals is conducted. The proposed DRL-based DVSL controllers are tested on a freeway with a simulated recurrent bottleneck. The simulation results show that the DRL based DVSL control strategy is able to improve the safety, efficiency and environment-friendliness of the freeway. In order to verify whether the controller generalizes to real world implementation, we also evaluate the generalization of the controllers on environments with different driving behavior attributes. and the robustness of the DRL agent is observed from the results.
Article
Capacity drop at congested freeway bottlenecks is well known with a lot of field observations. This paper studies coordinated ramp metering (RM) and mainstream traffic flow control (MTFC) as well as their integration (RM+MTFC) for freeway traffic, with particular attention to effects of capacity drop on the performance of traffic control measures. Via mathematical analysis and comprehensive simulation studies under an optimal control framework, the work has revealed a capacity-drop-related mechanism that MTFC and ramp metering are based on to take effects, and obtained a number of important conclusions: (1) applications of any control measure (RM, MTFC, or RM+MTFC) in freeways are justified by the existence of capacity drop in field; (2) an appropriate usage of the control measures can effectively prevent the activation of potential bottlenecks on freeways and hence avoid capacity drop; (3) any control measure is beneficial for a large majority of the driver population in a freeway network if it can manage to increase the accumulated total network exit flow; (4) it is a common misconception that ramp metering would simply transfer traffic loads from the freeway mainstream to on-ramps. The work has also highlighted the strengths, weaknesses, and applicability of ramp metering and MTFC.
Article
Existing inefficient traffic light control causes numerous problems, such as long delay and waste of energy. To improve efficiency, taking real-time traffic information as an input and dynamically adjusting the traffic light duration accordingly is a must. In terms of how to dynamically adjust traffic signals' duration, existing works either split the traffic signal into equal duration or extract limited traffic information from the real data. In this paper, we study how to decide the traffic signals' duration based on the collected data from different sensors and vehicular networks. We propose a deep reinforcement learning model to control the traffic light. In the model, we quantify the complex traffic scenario as states by collecting data and dividing the whole intersection into small grids. The timing changes of a traffic light are the actions, which are modeled as a high-dimension Markov decision process. The reward is the cumulative waiting time difference between two cycles. To solve the model, a convolutional neural network is employed to map the states to rewards. The proposed model is composed of several components to improve the performance, such as dueling network, target network, double Q-learning network, and prioritized experience replay. We evaluate our model via simulation in the Simulation of Urban MObility (SUMO) in a vehicular network, and the simulation results show the efficiency of our model in controlling traffic lights.
Article
This paper validates the prediction model embedded in a model predictive controller (MPC) of variable speed limits (VSLs). The MPC controller was designed based on an extended discrete first-order model with a triangular fundamental diagram. In our previous work, the extended discrete first-order model was designed to reproduce the capacity drop and the propagation of jam waves, and it was validated with reasonable accuracy without the presence of VSLs. As VSLs influence traffic dynamics, the dynamics including VSLs needs to be validated, before it can be applied as a prediction model in MPC. For conceptual illustrations, we use two synthetic examples to show how the model reproduces the key mechanisms of VSLs that are applied by existing VSL control approaches. Furthermore, the model is calibrated by use of real traffic data from Dutch freeway A12, where the field test of a speed limit control algorithm (SPECIALIST) was conducted. In the calibration, the original model is extended by using a quadrangular fundamental diagram which keeps the linear feature of the model and represents traffic states at the under-critical branch more accurately. The resulting model is validated using various traffic data sets. The accuracy of the model is compared with a second-order traffic flow model. The performance of two models is comparable: both models reproduce accurate results matching with real data. Flow errors of the calibration and validation are around 10%. The extended discrete first-order model-based MPC controller has been demonstrated to resolve freeway jam waves efficiently by synthetic cases. It has a higher computation speed comparing to the second-order model-based MPC.
Article
The primary objective of this paper was to incorporate the reinforcement learning technique in variable speed limit (VSL) control strategies to reduce system travel time at freeway bottlenecks. A Q-learning (QL)-based VSL control strategy was proposed. The controller included two components: a QL-based offline agent and an online VSL controller. The VSL controller was trained to learn the optimal speed limits for various traffic states to achieve a long-term goal of system optimization. The control effects of the VSL were evaluated using a modified cell transmission model for a freeway recurrent bottleneck. A new parameter was introduced in the cell transmission model to account for the overspeed of drivers in unsaturated traffic conditions. Two scenarios that considered both stable and fluctuating traffic demands were evaluated. The effects of the proposed strategy were compared with those of the feedback-based VSL strategy. The results showed that the proposed QL-based VSL strategy outperformed the feedback-based VSL strategy. More specifically, the proposed VSL control strategy reduced the system travel time by 49.34% in the stable demand scenario and 21.84% in the fluctuating demand scenario.