ArticlePDF Available

A new reinforcement learning-based variable speed limit control approach to improve traffic efficiency against freeway jam waves

November 2022
Transportation Research Part C Emerging Technologies 144(2):103900

November 2022
144(2):103900

DOI:10.1016/j.trc.2022.103900

Authors:

Yu Han

Southeast University (China)

Andreas Hegyi

Delft University of Technology

Zhengbing He

Massachusetts Institute of Technology

Show all 6 authorsHide

Conventional reinforcement learning (RL) models of variable speed limit (VSL) control systems (and traffic control systems in general) cannot be trained in real traffic process because new control actions are usually explored randomly, which may result in high costs (delays) due to exploration and learning. For this reason, existing RL-based VSL control approaches need a traffic simulator for training. However, the performance of those approaches are dependent on the accuracy of the simulators. This paper proposes a new RL-based VSL control approach to overcome the aforementioned problems. The proposed VSL control approach is designed to improve traffic efficiency by using VSLs against freeway jam waves. It applies an iterative training framework, where the optimal control policy is updated by exploring new control actions both online and offline in each iteration. The explored control actions are evaluated in real traffic process, thus it avoids that the RL model learns only from a traffic simulator. The proposed VSL control approach is tested using a macroscopic traffic simulation model to represent real world traffic flow dynamics. By comparing with existing VSL control approaches, the proposed approach is demonstrated to have advantages in the following two aspects: (i) it alleviates the impact of model mismatch, which occurs in both model-based VSL control approaches and existing RL-based VSL control approaches, via replacing knowledge from the models by knowledge from the real process, and (ii) it significantly reduces the exploration and learning costs compared to existing RL-based VSL control approaches.

Examples of a jam wave and a standing queue observed in real data. Data were collected from Dutch freeway A20 on January 23, 2006.

…

The mechanism of VSLs against infrastructural bottlenecks. The on-ramp is a potential bottleneck. q out is the outflow of the VSLcontrolled area, and q c VSL is the VSL-induced capacity.

…

Illustration of traffic evolution under the SPECIALIST (Hegyi et al., 2008). The left figure is the time-space graph and the right figure is the fundamental diagram. Areas 3 and 4 are the VSL-controlled areas.

…

Dividing the freeway stretch into four areas.

…

The offline training process and online control process of the proposed approach in an iteration.

…

Figures - uploaded by Yu Han

Content may be subject to copyright.

Content uploaded by Yu Han

Content may be subject to copyright.

A new reinforcement learning-based variable speed limit control approach to

improve traﬃc eﬃciency against freeway jam waves

Yu Han*a, Andreas Hegyib, Le Zhangc, Zhengbing Hed, Edward Chunge, Pan Liu*a

aSchool of Transportation, Southeast University, Nanjing, China.

bDepartment of Transport and Planning, Delft University of Technology, the Netherlands.

cSchool of Economics and Management, Nanjing University of Science and Technology, Nanjing, China

dBeijing Key Laboratory of Traﬃc Engineering, Beijing University of Technology, China.

eDepartment of Electrical Engineering, Hong Kong Polytechnic University, Hong Kong, China.

Abstract

Conventional reinforcement learning (RL) models of variable speed limit (VSL) control systems (and traﬃc con-

trol systems in general) cannot be trained in real traﬃc process because new control actions are usually explored

randomly, which may result in high costs (delays) due to exploration and learning. For this reason, existing RL-based

VSL control approaches need a traﬃc simulator for training. However, the performance of those approaches are

dependent on the accuracy of the simulators. This paper proposes a new RL-based VSL control approach to over-

come the aforementioned problems. The proposed VSL control approach is designed to improve traﬃc eﬃciency

by using VSLs against freeway jam waves. It applies an iterative training framework, where the optimal control

policy is updated by exploring new control actions both online and oﬄine in each iteration. The explored control

actions are evaluated in real traﬃc process, thus it avoids that the RL model learns only from a traﬃc simulator. The

proposed VSL control approach is tested using a macroscopic traﬃc simulation model to represent real world traf-

ﬁc ﬂow dynamics. By comparing with existing VSL control approaches, the proposed approach is demonstrated to

have advantages in the following two aspects: (i) it alleviates the impact of model mismatch, which occurs in both

model-based VSL control approaches and existing RL-based VSL control approaches, via replacing knowledge from

the models by knowledge from the real process, and (ii) it signiﬁcantly reduces the exploration and learning costs

compared to existing RL-based VSL control approaches.

Keywords: Variable speed limits, Freeway traﬃc control, Reinforcement learning, Data-driven approach

1. Introduction

A jam wave, also known as wide moving jam or shock wave in some studies, e.g., (Kerner and Rehborn, 1996;

Hegyi et al., 2005b), is a common type of traﬃc jams on freeways. A jam wave usually originates from a traﬃc

breakdown that occurs due to high traﬃc demand, and its head and tail are both propagating upstream. From various

empirical studies, some common features of jam waves are distilled. For example, the propagation speed of jam waves

is roughly a constant, typically between 15-20 km/h (Kerner, 2002). It can propagate for a long time and distance,

and resolves only when the traﬃc demand decreases (Kerner and Rehborn, 1996). The queue discharge rate from a

jam wave is typically around 30 percent lower than the free-ﬂow capacity (Sch¨

onhof and Helbing, 2007). Jam waves

create many problems, including capacity reduction, travel delays, and safety risks. Therefore, eliminating jam waves

can greatly improve freeway traﬃc operation eﬃciency.

One way to alleviate jam waves is to avoid the activation of infrastructural bottlenecks, e.g., on-ramp bottlenecks,

by applying traﬃc control measures such as ramp metering and variable speed limits (Hadiuzzaman et al., 2013; Lu

et al., 2015). As jam waves usually originate from standing queues that form at infrastructural bottlenecks, removing

those bottlenecks may signiﬁcantly reduce the occurrences of jam waves. However, due to the limited storage space

at on-ramps, on-ramp bottlenecks cannot be fully avoided by ramp metering. On the other hand, variable speed limits

(VSLs) can reduce the mainstream ﬂow upstream of a bottleneck, so as to avoid the activation of the bottleneck.

Carlson et al. (2011) proposed a feedback-based variable speed limit control method for local bottlenecks. Chen

Preprint submitted to Elsevier October 13, 2022

et al. (2014); Chen and Ahn (2015) developed analytical approaches of VSLs based on the kinematic wave theory for

recurrent and non-recurrent infrastructural bottlenecks. Studies of Hegyi et al. (2005a); Lu et al. (2010); Zhang and

Ioannou (2016); Carlson et al. (2014) combined VSLs with other control measures, such as ramp metering and lane-

changing control, to improve traﬃc operation eﬃciency at infrastructural bottlenecks. Carlson et al. (2010b); Wang

et al. (2020) proposed optimal control methods of VSLs for large scale freeway networks. The aforementioned VSL

control approaches may create a high-density region in or upstream of the VSL-controlled area, which may trigger

new jam waves. Hence, traﬃc control measures aiming for eliminating stationary bottlenecks may not be able to fully

avoid the formation of jam waves.

Another way to alleviate jam waves is to suppress them after their formation using VSLs. There are diﬀerent the-

ories and algorithms to determine the parameter values of the VSLs. The SPECIALIST algorithm proposed by Hegyi

et al. (2008) is an analytical approach for determining VSL parameters using the shockwave theory (Lighthill and

Whitham, 1955; Richards, 1956). It was successfully implemented and tested in practice (Hegyi and Hoogendoorn,

2010). However, since the SPECIALIST algorithm has a feed-forward structure, disturbances that occur after the acti-

vation of a VSL scheme cannot be handled. Hegyi et al. (2005b) presented a model predictive control (MPC) approach

of VSLs, where the design was based on a macroscopic second-order traﬃc ﬂow model, METANET (Messmer and

Papageorgiou, 1990; Kotsialos et al., 2002a). The nonlinear and non-convex formulation of METANET-based MPC

approaches might result in high computation load, especially if the optimization is solved by the standard SQP algo-

rithm (Hegyi et al., 2005a). Moreover, globally optimal VSL control is often unattainable for that type of approaches

(Frejo and Camacho, 2012; Frejo et al., 2014). Studies of Muralidharan and Horowitz (2015); Roncoli et al. (2015);

Hadiuzzaman and Qiu (2013); Han et al. (2017b); Zhang and Ioannou (2018) developed simpler MPC approaches

that have less computational complexity based on the cell transmission model and its variants. However, those models

cannot accurately reproduce the propagation of jam waves (Han et al., 2016). Han et al. (2017b, 2021) proposed

MPC approaches of VSLs based on discrete ﬁrst-order traﬃc ﬂow models formulated in Eulerian and Lagrangian

coordinates. Due to the linear formulations of the optimal controllers, those approaches signiﬁcantly improved the

computational eﬃciency. Despite the successful demonstration of the above MPC approaches via simulation, in

general MPC for traﬃc systems are diﬃcult to be implemented in practice, partially because MPC approaches are

sensitive to the accuracy of the prediction models.

In recent years, data-driven approaches, such as reinforcement learning (RL), have attracted greater attentions in

the realm of road traﬃc control as more traﬃc data become available. RL applications to road traﬃc control were

initially investigated in urban traﬃc networks for traﬃc signal optimization problems (Arel et al., 2010; Prashanth

and Bhatnagar, 2010; El-Tantawy et al., 2013; Li et al., 2016; Ozan et al., 2015). Regarding freeway traﬃc control,

most of the RL applications focused on improving traﬃc operation eﬃciency at local bottlenecks. Davarynejad et al.

(2011) addressed a local ramp metering problem considering the storage capacity of on-ramps using a Q-learning

algorithm. Li et al. (2017) presented a Q-learning-based VSL control approach for recurrent freeway bottlenecks.

Schmidt-Dumont and van Vuuren (2015) proposed a decentralized RL approach that integrated ramp metering and

VSLs. Belletti et al. (2017) presented a deep RL-based ramp metering strategy. In a simulation test, the strategy

achieved a control performance comparable to the classical feedback ramp metering method, ALINEA (Papageorgiou

et al., 1991). Wu et al. (2020) proposed a deep actor-critic algorithm of lane-based VSLs to eliminate recurrent

freeway bottlenecks. Han et al. (2022) proposed a physics-informed reinforcement learning approach for local and

coordinated ramp metering.

Most of existing RL-based traﬃc control approaches train their RL models using traﬃc simulators. Therefore,

similar as the aforementioned MPC approaches, which are sensitive to the accuracy of the prediction models, the con-

trol performances of those RL-based approaches are also dependent on the accuracy of the simulators. Nevertheless,

the training processes of those RL models cannot be performed in real world. The reason is twofold. First, control

actions are usually explored randomly in those approaches. Such way of action exploration can only be performed in

a simulation environment, as a real traﬃc control system cannot accept randomly generated control actions that may

lead to very poor traﬃc performance. Secondly, the training process with random exploration may require a large

amount of training data, which may not be feasible to collect because the speed of data collection in real world is

restricted by physical time and the ”slowness” of the traﬃc process. Furthermore, training those RL models using

historical ﬁeld data is also infeasible. The reason is that eﬀective training data collected from the ﬁeld are lacking,

as traﬃc ﬂows in real world are regulated by a limited number of pre-deﬁned control strategies. In addition, many

practical traﬃc control systems are not used for eliminating traﬃc jams or improving traﬃc eﬃciency. For example,

many traﬃc signal control systems and speed control systems in reality only implement ﬁxed signal timing plans and

ﬁxed speed limit values. The ﬁeld data collected from those control systems cannot be used for training a RL model.

Therefore, it is still a challenge to develop RL-based traﬃc control strategies for real world implementation.

In this paper, we propose a new RL-based VSL control approach that trains the RL model based on both oﬄine

synthetic data and data collected from the real system, where the real data gradually replace the synthetic data. The

proposed VSL control approach consists of an oﬄine training process and an online control process, which interact

iteratively. In the online control process data are collected of the states, control actions, and the related performances

as they occur in the real traﬃc process. In the oﬄine training process the data collected online are fed into a learning

algorithm to update the control policy. To explore new control actions that may lead to a better traﬃc performance,

synthetic data generated from a macroscopic traﬃc ﬂow model are also added to the training data set in the oﬄine

process. In the online control process, the VSL control policy obtained from the oﬄine training process is applied to

regulate traﬃc ﬂow, and at the same time a new batch of data is collected. During the course of the iterations, the

control performance is expected to improve as over time more real data are utilized by the RL.

The proposed approach is tested using the METANET model, which simulates real-world traﬃc ﬂow dynamics.

Therefore, in this paper the data generated from the METANET model are referred to as real data. To reproduce

the diﬀerence between the traﬃc prediction model and the real traﬃc process, we use another traﬃc ﬂow model, the

extended cell transmission model (CTM), as the oﬄine data generation model. Data generated from the extended

CTM are referred to as synthetic data. To demonstrate the performance of the proposed approach against the model

mismatch, it is compared with an MPC approach that uses the same extended CTM for prediction and an existing

RL-based VSL control approach which also uses the same extended CTM for training. The proposed approach is also

compared with an existing RL-based VSL control approach with random exploration to demonstrate the performance

of reducing the exploration and learning costs.

The rest of this paper is organized as follows. Section 2 describes the VSL control problem. Section 3 presents the

RL-based VSL control approach including the oﬄine training and online control processes. Section 4 describes the

simulation design for testing the proposed approach, and Section 5 discusses the simulation results. The conclusion

and the topics for future research are discussed in Section 6.

2. The RL-based VSL control problem

This section presents the RL-based VSL control problem addressed in this paper. Section 2.1 describes the VSL

control mechanism in resolving freeway jam waves. Section 2.2 deﬁnes the RL-based VSL control problem. A

solution algorithm to that problem is presented in Section 2.3.

2.1. VSL control mechanism

As has been presented in (Hegyi et al., 2008), two types of traﬃc jams are usually identiﬁed on freeways. Traﬃc

jams with the head ﬁxed at the bottleneck are known as standing queues, and jams that have an upstream moving head

and tail are known as jam waves (also known as wide moving jams in some studies, e.g., Kerner and Rehborn (1996)).

Fig. 1 shows a jam wave and a standing queue observed in real data. Both types of traﬃc jams can be eliminated by

VSLs, based on two diﬀerent mechanisms explained as follows.

Figure 1: Examples of a jam wave and a standing queue observed in real data. Data were collected from Dutch freeway A20 on January 23, 2006.

Standing queues form at infrastructural bottlenecks, e.g., an on-ramp bottleneck or a lane-drop bottleneck. The

VSL control strategies against infrastructural bottlenecks are developed based on the assumption that VSLs below the

critical speed lead to a fundamental diagram that has lower capacity than under normal conditions. The application

of VSLs upstream of a bottleneck permanently reduces the mainstream arriving ﬂow, so as to avoid the bottleneck

activation and the related throughput reduction as a result of the capacity drop. Then capacity ﬂow can be established

at the downstream bottleneck and the mainstream throughput is maximized, leading to a decrease of the total time

spent. Fig. 2 shows the mechanism schematically. This mechanism forms the basis of the VSL control strategies in

many studies such as Carlson et al. (2010a,b); Hadiuzzaman et al. (2013); Li et al. (2017); Wang et al. (2020).

Traffic flow

On-ramp

VSL application area

Acceleration

area

Capacity flow

qout~qVSL

Figure 2: The mechanism of VSLs against infrastructural bottlenecks. The on-ramp is a potential bottleneck. qout is the outﬂow of the VSL-

controlled area, and qc

VSL is the VSL-induced capacity.

The mechanism of VSLs in eliminating jam waves is diﬀerent from that against standing queues. SPECIALIST

is one of the earliest theories that systematically explained the mechanism of VSLs against jam waves (Hegyi et al.,

2008). In Fig. 3, the time-space graph (left) shows the traﬃc states on a road stretch and their propagation over time.

The density-ﬂow diagram (right) shows the corresponding density and ﬂow values for these states. According to

kinematic wave theory, the boundary (front) between two states in the left ﬁgure has the same slope as the slope of

the line that connects the two states in the right ﬁgure. Area 2 represents a jam wave that propagates upstream and

which is surrounded by traﬃc in free-ﬂow (areas 1 and 6). As soon as the jam wave is detected, VSLs are applied to

the direct upstream of the jam wave, where the traﬃc state changes from state 6 to state 3. Subsequently, the size of

the jam wave (area 2) is reduced because the inﬂow to the jam is lower than the outﬂow. The required length of the

speed-limited stretch to resolve the jam depends on the density and ﬂow associated with state 2 and the physical length

of the detected jam. When the jam wave is resolved, there remains an area with the speed limits active (state 4) with

a moderate density (higher than in free-ﬂow, lower than in the jam wave). It was assumed that the traﬃc from area

4 can ﬂow out more eﬃciently than a queue discharging from full congestion as in the shock wave (ﬂow of state 2).

This assumption was demonstrated in a later research by analyzing the data from SPECIALIST ﬁeld test experiment

(Hegyi and Hoogendoorn, 2010).

The similarity between these two VSL control mechanisms is that they both assume the traﬃc jams are associated

with a capacity drop, and the major beneﬁt of VSLs is to reduce travel delays by eliminating the capacity drop.

The diﬀerence is that these two mechanisms eliminate capacity drop in diﬀerent ways, which may lead to diﬀerent

consequences. The mechanism of VSLs against jam waves takes advantage of the transition ﬂow created by VSLs,

which only lasts for a relatively short period of time, just enough to resolve the jam. As it aims to keep the VSL-

induced density at a moderate value, (e.g., area 4 in SPECIALIST), these VSLs can keep the traﬃc stable under the

speed limits. It is assumed that the demand is always lower than the free-ﬂow capacity so that the jam wave can

be resolved without creating a new congestion. On the other hand, VSLs against standing queues do not need that

assumption because even though the demand of the bottleneck exceeds the capacity, the traﬃc system still gets beneﬁt

from eliminating the capacity drop and maximizing the throughput. However, new congestion may be created when

VSLs are applied to eliminate standing queues. For example, Papageorgiou et al. (2008) found that the VSL-induced

capacity may be lower than the free-ﬂow capacity. However, at a diﬀerent site, Soriguera et al. (2017) could not

identify any permanent ﬂow reduction that could be attributed to VSLs, even when the speed limit value is as low as

40 km/h. Therefore, to create a suﬃciently low ﬂow to eliminate the standing queue under this circumstance, speed

limits lower than 40 km/h will be needed. This will create new congestion at the upstream of the VSL-controlled area.

Most of the experiments (both simulations and ﬁeld test) on VSLs against jam waves were performed in a ho-

mogeneous freeway stretch (Hegyi et al., 2008; Han et al., 2017b, 2021). For VSLs against standing queues, some

strategies have been tested in larger sizes of freeway networks which include multiple on- and oﬀ-ramps via macro-

scopic simulations (Carlson et al., 2010b; Wang et al., 2020).

0 50 100 150 200

500

1000

1500

2000

2500

3000

3500

4000

1,6

density (veh/km)

flow (veh/h)

0 0.2 0.4 0.6

time (h)

location (km)

Figure 3: Illustration of traﬃc evolution under the SPECIALIST (Hegyi et al., 2008). The left ﬁgure is the time-space graph and the right ﬁgure is

the fundamental diagram. Areas 3 and 4 are the VSL-controlled areas.

In this paper, we focus only on jam waves, and the mechanism of the VSLs follows the theory of SPECIALIST.

From SPECIALIST ﬁeld test experiment, it was summarized that some jam waves were not successfully resolved

because the VSL-induced ﬂows were not suﬃciently low (Han et al., 2017a). In other failed cases, it was found that

new jam waves were triggered at the upstream of the VSL-controlled area because the densities of this area were

too high (Hegyi and Hoogendoorn, 2010). Therefore, an eﬀective VSL control scheme to improve traﬃc eﬃciency

against freeway jam waves should be able to (i) create suﬃciently low ﬂow to resolve the jam, and (ii) maintain the

density of the VSL-controlled area at a moderate value so as to avoid triggering a new jam wave. In the next section

we will formulate an RL controller, that is capable of both by using stretches of VSLs that are directly upstream of

the jam and that can vary in length.

2.2. RL-based VSL control system

Reinforcement learning concerns the problem of a learning agent that interacts with its environment to achieve a

goal (Sutton and Barto, 2018). The agent and the environment are generally interacting in discrete time steps. At each

time step k, the agent takes an action a(k) based on the state s(k) received from the environment. The environment

responds to the action by assigning a reward r(k) to the agent and presenting a new state, s(k+1). The agent’s objective

at time step kis maximizing the accumulative reward-to-go over a given time horizon,

G(k)=

τ=k

γτ−kr(τ),(1)

where KTdenotes the time index when the state of the environment reaches the terminal state; r(τ) the reward received

at time τ; and γτ−kthe discount factor (0 ≤γ≤1) that deﬁnes the relative importance of the reward at time τ.

In this paper, we consider an RL-based VSL control system, where the traﬃc dynamics on the freeway is the

environment, and the VSL controller is the agent. More speciﬁcally, we consider a long homogeneous freeway stretch,

which is suitable for applying VSLs to resolve jam waves. The agent decides about the speed limit values displayed to

the drivers at diﬀerent positions of the freeway. It is assumed that the freeway is equipped with ﬁxed-location sensors,

e.g., loop detectors, which divide the freeway into cells. Variable message signs (VMSs) which display the speed limit

values are placed above the freeway.

The state, action, and reward of the RL system are deﬁned considering the mechanism of VSLs as presented in the

previous section. The freeway stretch is divided into four areas, which are indexed as I, II, III, and IV from upstream

to downstream, as shown in Fig. 4. They represent the area upstream of VSL control (I), the VSL-controlled area

(II), the jam area (III), and the area downstream of the jam (IV), respectively. Each area consists of a number of

consecutive cells, so the area boundaries coincide with the cell boundaries. Area I has a length of vf·Tk, where vf

denotes the free-ﬂow speed and Tkis the unit time step duration. Area II, the VSL-controlled area, denotes the freeway

section that controlled by a number of consecutive VMSs, which display the speed limits. It is assumed that this area

resides immediately upstream of the congestion area. Area III, the congestion area, consists of all the cells that are in

congestion. Cell iis deﬁned to be in congestion if vi≤vjmax and qi≤qjmax are both satisﬁed, where vjmax and qjmax are

predeﬁned speed and ﬂow thresholds, respectively. Area IV covers the part where the discharging traﬃc recover to

the free-ﬂow speed. The length should be long enough for the acceleration, e.g., longer than 1 km. These four areas

move along with the jam wave and their traﬃc states are updated in each control cycle accordingly. Note that the

VSL controller is switched on when there is only one jam area on the freeway. When there are multiple, disconnected

congestion areas, e.g., multiple jam waves, the VSL controller will not be activated.

Area I

Driving direction

VSL VSL

Area II,

the VSL control area Area III,

the congestion area Area IV

vf·Tk

Figure 4: Dividing the freeway stretch into four areas.

As presented in the previous section, it is summarized that an eﬀective VSL control scheme against freeway jam

waves should be able to (i) create suﬃciently low ﬂow to resolve the jam, and (ii) maintain the density of the VSL-

controlled area at a moderate value. Therefore, the state and action variables of the RL model should be able to capture

the traﬃc dynamics of the jam and the VSL-controlled area. According to the conservation law, the dynamic evolution

of a traﬃc jam is related to the size of the jam at the current time step, i.e., how many vehicles are in the jam, and

the inﬂow and outﬂow of the jam. Likewise, the density variation of the VSL-controlled area is related to its original

density, and the inﬂow and outﬂow of this area. Therefore, to resolve the jam wave and also maintain the density of

the VSL-controlled area at a moderate value, the state, action, and reward functions of the RL system should take all

those variables into account.

To deﬁne the state, action, and reward, the VSL control system is discretized in time. The state of discrete time

step k,s(k), and the action, a(k), are deﬁned as:

s(k)=[¯qI(k),¯ρV(k),ljam(k),¯vjam (k),Pjam(k)] (2)

a(k)=[V(k),PV(k)],(3)

where V(k) denotes the speed limit value, and PV(k) denotes the index of the most upstream cell of the VSL-controlled

area. It is assumed that VSLs are applied directly upstream of Pjam, the most upstream cell of the jam area. Therefore,

the variables, V(k), PV(k), and Pjam(k) can determine the speed limit value of every VMS. For other state variables,

¯qIdenotes the average ﬂow of area I, which is considered as the arriving ﬂow to the VSL-controlled area in one time

step. ¯ρVrepresents the average density of the VSL-controlled area. ljam and ¯vjam are the length and average speed

of the congestion area, respectively. These two variables represent the size of the jam wave. All the state and action

variables are summarized in Table 1. Those state and action variables can eﬀectively capture the traﬃc dynamics of

the jam and the VSL-controlled area.

1. The state variable, ¯qI(k), determines the inﬂow of the VSL-controlled area. The state variable, ¯ρV(k), and the

action variable, V(k), approximate the outﬂow of the VSL-controlled area. The state variable, ¯ρV(k), represents

the density of the VSL-controlled area at the current time step. Therefore, the RL system captures the density

variation of the VSL-controlled area based on those variables.

2. The state variables, ljam (k) and ¯vjam (k), represent the size of the jam at the current time step. The inﬂow to area

III is equal to the outﬂow of area II. Thus, the state variable, ¯ρV(k), and the action variable, V(k), approximate

the inﬂow to the jam. According to the empirical study of Yuan et al. (2015), the outﬂow of a jam wave is

dependent to the speed in the jam. Therefore, the state variable, ¯vjam , can capture the outﬂow of the jam. The

RL system captures the dynamic evolution of the jam based on those variables.

Table 1: The state and action variables of the proposed RL system.

State variables ¯qI(k) [veh/h] The inﬂow to the VSL-controlled area, which is calculated as the

arithmetic mean of the measured ﬂow of all cells in area I.

¯ρV(k) [veh/km/lane] The average density of the VSL-controlled area, which is calcu-

lated as arithmetic mean of the density of all cells in area II.

ljam(k) [km] The length of the congestion area, i.e., area III.

¯vjam(k) [km/h] The average speed of the jam area, which is calculated as the

arithmetic mean of the measured speed of all cells in area III.

Pjam(k) The index of the most upstream cell of area III, the jam area.

Action variables V(k) [km/h] The speed limit value

PV(k) The index of the most upstream cell of the VSL-controlled area

For the presented VSL control system, the VSL controller is activated when a jam wave is detected and deactivated

when it is resolved or considered as unresolvable. The jam wave is considered as being resolved if for every cell i,

vi>vjmax and qi>qjmax are both satisﬁed. The jam wave is considered as unresolvable if the congestion has reached

to the upstream boundary of the freeway stretch or multiple jam waves have been observed during the VSL control

process.

It is assumed that for a single jam wave only one speed choice is applied to the entire VSL-controlled area over the

control horizon. In other words, on detection of a jam wave the speed limit value is decided and it is kept unchanged

until the speed limits are deactivated again. This setting is consistent with SPECIALIST, which has been implemented

in practice. It requires less attention for drivers because they only need to decelerate once and accelerate after the jam

wave being resolved. Therefore, this setting is more acceptable to the drivers and may also avoid possibly new

breakdowns induced by a frequent acceleration and deceleration. For a diﬀerent jam wave, however, the speed choice

can be chosen from all available speed choices based on the learning result. Besides, when VSL control is activated,

to avoid a sharp deceleration, the speed limit values are gradually reduced from the default speed limit to the target

value.

The reward should reﬂect the improvement of traﬃc performance caused by VSLs. Intuitively, the reward should

be a function of the freeway throughput, since the foremost improvement resulting from resolving jam waves is the

elimination of capacity drop. Unfortunately, the throughput increment produced by the VSL control can hardly be

observed until the jam wave is resolved. Thus, for a faster learning we deﬁne a reward function based on an artiﬁcial

variable, J(k), that represents congestion severity. The J(k) is deﬁned as follows:

J(k)=ljam(k)

vjam(k).(4)

The congestion severity decreases as the average speed in the congestion area increases and the length of the conges-

tion area diminishes. The change of J(k) may take place soon after the VSL is implemented, and before the jam wave

is resolved. The reward, r(k), is deﬁned as the reduction in congestion severity,

r(k)=J(k)−J(k+1).(5)

2.3. The solution algorithm

The RL problem presented in the previous section can be solved by a number of methods (Sutton and Barto, 2018).

In this section, we brieﬂy introduce a model free Q-learning method, which has been extensively used in RL-based

traﬃc control systems (Watkins and Dayan, 1992; Davarynejad et al., 2011; Li et al., 2017). To apply the Q-learning

method, the variables in the state and reward functions, i.e., equations (2) and (5) need to be discretized. The domain

of each variable is divided into discrete intervals, and the value of each interval is represented by its midpoint.

The Q-learning method estimates the optimal value function Q∗using temporal-diﬀerence learning. The Q-value,

Q(s,a), stores the value of a state-action pair, and it is updated according to:

Q(s,a)←Q(s,a)+κ(s,a)[r+γmax

a0Q(s0,a0)−Q(s,a)] (6)

where ris the observed reward of the transition from the current state sto the new state s0under action a;a0denotes

the action chosen at state s0;κ(s,a)is the learning rate which controls how fast the Q-values are altered. Typically, the

learning rate decreases over time to ensure convergence. Some studies, e.g., Li et al. (2017), deﬁned the learning rate

of a state-action pair as a function of the number of visits to that pair. In this paper we adopt the same method and

deﬁne κ(s,a) as:

κ(s,a)=h1

1+C(s,a)(1 −γ)i0.7(7)

where C(s,a)is the number of visits to the state-action pair (s,a).

For Q-learning, the selection rule for the action taken at a given state should consider the trade-oﬀbetween ex-

ploitation and exploration. Even though using pure exploitation may greatly save the learning time, it may prohibit the

discovery of new, potentially better actions. On the contrary, although pure exploration outperforms pure exploitation

in the capability of discovering better actions, the former is quite time-consuming as it selects actions without making

use of the learning results. In this paper, the method for the RL agent’s exploration is referred to Li et al. (2017), in

which the probability of selecting action afrom state sis represented as:

ps(a)=eQ(s,a)/T

Pa0∈AseQ(s,a0)/T(8)

where Asis the set of available actions at state s; and Tis the so-called temperature parameter. When Tis large, each

action would have approximately the same probability of being selected (more exploration). When Tis small, actions

would be selected in proportional to their estimated value (more exploitation).

The pseudocode of Q-learning is shown in Algorithm 1. Tdenotes the set of training data, where each training

data slice is represented by a state transition tuple, [s,a,s0,r]. Sand Arepresent the set of states and the set of actions

in the training data, respectively. denotes a very small positive value. The terminal state is deﬁned as the state at the

time the speed limit control is deactivated, i.e., when the jam wave is resolved or considered as unresolvable. Please

be noted that any RL algorithm that can learn directly from data can be used in the proposed VSL control approach. In

this paper, a simple Q-learning algorithm is used because we consider the amount of training data is relatively small.

Algorithm 1 The pseudocode of Q-learning.

Input:T,S,A

1: Initialize Q(s,a)=0, C(s,a)=0, κ(s,a)=1, ∀s∈S,∀a∈A;

2: repeat

3: Initialize s;

4: repeat

5: choose afrom sbased on equation (8);

6: C(s,a)+= 1;

7: update κ(s,a)based on equation (7);

8: update Q(s,a)based on equation (4-6);

9: s←s0;

10: until sis a terminal state

11: until convergence: qPsPa(Q(s,a)−Q(s0,a0))2≤

Output:Q(s,a),∀s∈S,∀a∈A

3. An iterative RL approach of VSLs

As explained earlier in this paper, the conventional training method for RL-based traﬃc control strategies, which

solely relies on traﬃc simulators to generate the training data, is ﬂawed because accurate traﬃc simulators are diﬃcult

to obtain (Papageorgiou, 1998). In fact, the modeling errors of some well-known simulators were shown to be between

10% - 20% (Spiliopoulou et al., 2014; Han et al., 2017a). Moreover, how the error of a traﬃc simulator aﬀects the

performance of the corresponding RL controller is still unclear. On the other hand, training the RL controller with

ﬁeld data is also infeasible because of random explorations. The control actions that are randomly explored may

lead to very poor traﬃc performance. Furthermore, the training process with random exploration may require a large

amount of training data, which may not be feasible to collect because the speed of data collection in real world is

restricted by physical time.

In light of the above, our RL approach combines the two ways of training by using both oﬄine simulation data

and real data, where real data gradually replace simulation data. The proposed approach applies an iterative training

framework, where the optimal control policy is updated by exploring new control actions from both online and oﬄine

in each training iteration. Section 3.1 presents the general framework of the proposed approach. Sections 3.2 and 3.3

explain the oﬄine training and online control processes, respectively.

3.1. Framework of the iterative RL

The proposed iterative RL approach of VSLs consists of an oﬄine training process and an online control process,

which interact through the iterations. In each iteration, the interaction between those two processes is shown in Fig. 5.

For the oﬄine training process, the input includes historical data and a new batch of data collected from last iteration.

Historical data are sliced in the form of state transition tuples, and added to the training dataset. To explore new

control actions that may lead to better traﬃc performance, new synthetic data generated from a macroscopic traﬃc

ﬂow model based on the new batch of real data, are also added to the training dataset. The process of the oﬄine

synthetic data generation is presented in Section 3.2. The output of the oﬄine training process is the Q-table that

contains the Q-values, Q(s,a), of all available state-action pairs in the data.

After training, the optimal control policy is fed into the online control process. In each iteration, the VSL control

policy associated with a ﬁxed state-value table is implemented in the online process for a period of time. This duration

is determined considering the trade-oﬀbetween the control performance and the learning rate. If the time duration

of each stage is too long, it would take more time for the VSL agent to improve the traﬃc performance. If the time

duration is too short, the data gathered from the online process may be too limited for the RL agent to improve. A

new control action is explored online only if the RL state is not in the Q-table. The online exploration that follows a

certain of rules is presented in Section 3.3.

Exploration in RL is always the price to pay to improve the system performance. However, in real traﬃc system,

there are many restrictions that limit the exploration of new control actions. For example, a poor exploration method,

e.g., random exploration as in many existing RL systems, may lead to very poor traﬃc performance or even unsafe

traﬃc situations. Furthermore, an ineﬃcient exploration method may not lead to any improvement in real-world

simply because of limited physical time. The presented oﬄine/online exploration method prevents poor control actions

being explored in real traﬃc process to some extent, so as to reduce the exploration and learning costs. With the

interaction between the oﬄine training and the online control, the optimal policy is updated iteratively. During the

course of iterations, the traﬃc performance is expected to be improved with the updating of the optimal policy because

the model mismatch is alleviated via replacing knowledge from the models by knowledge from the real process.

Historical traffic data

Generating synthetic data

via a traffic flow model

Optimal policy

Offline training process Online control process

Traffic dynamics

Traffic state measurement

Mixed training data

Learning algorithm

State in historical

data

Choose the optimal

control action

yes

explore a new

control action

New optimal

policy,

A new batch

of data,

Training data slices

Jam wave being

detected

yes



󰇛󰇜



Figure 5: The oﬄine training process and online control process of the proposed approach in an iteration.

3.2. Oﬄine training

The oﬄine training process in an iteration is shown in the left block of Fig. 5. In the oﬄine training process of

iteration x, the set of training data slices, Tx, which include both real data and synthetic data, are gathered and fed

into Algorithm 1 to obtain the Q-table. A training data slice is represented as state transition tuples in the form of

[s,a,s0,r]. In the training data, the real data is represented as Treal

x, and,

Treal

x=Tstart

x−1

[

n=1

Treal

n(9)

where Tstart

xis the initial training data, e.g., the training data collected from the previous implementations of other

VSL control strategies in the target site. If no VSL control strategy was implemented before, Tstart

xis empty.

Table 2: The notation of variables in Algorithm 2.

TA training data slice

TA synthetic training data slice

TThe set of training data slices

Treal The set of real training data

Tstart The initial training data

xIndex of the iterations

mIndex of traﬃc data slice

ZThe set of traﬃc data slices

zmTraﬃc state data slice m,z=[ ˆρ, ˆq,ˆv]

ˆρ, ˆq,ˆv: Vectors of density, ﬂow, and speed of all the cells in the freeway network

smThe RL state of data slice m

AmThe set of feasible synthetic control actions for data slice m

˜aA synthetic control action

˜znext

mThe synthetic future traﬃc state predicted on state zm

snext

mThe synthetic future RL state predicted on state zm

FThe operator of predicting traﬃc state using a traﬃc ﬂow model

HThe operator of transforming a traﬃc state to a RL state

The synthetic training data of iteration x, is generated based on the set of traﬃc data slices, Zx−1, collected from

the online control process in iteration x−1. Each traﬃc data slice is represented by the vectors of density, ﬂow, speed

of all the cells in the freeway network. For data slice zm∈Zx−1, we use a traﬃc ﬂow model to predict its future traﬃc

state for one step ahead under all feasible new VSL control actions, represented by the set ˜

Am. We deﬁne Fas the

operator of predicting future traﬃc state using the traﬃc ﬂow model, and,

znext

m=F(zm,˜a) (10)

where ˜a∈˜

Am.znext

mrepresents the predicted traﬃc state. The predicted reward, ˜r, can be obtained based on the

predicted traﬃc state, according to (4-5). znext

mis transformed to the corresponding RL state, ˜snext

m, according to the

deﬁnition of the variables in (2). We deﬁne Hto represent the operator of transforming a traﬃc state to a RL state,

and,

˜snext

m=H(znext

m).(11)

We deﬁne Sxas the set of all the RL state observed in real data, Treal

x. If the predicted RL state is in the set of RL

state, i.e., ˜snext

m∈Sx, then the synthetic data slice, [sm,˜a,˜r,˜snext

m] is added to the training data set, Tx.smis the RL state

corresponding to traﬃc state zm.

Therefore, in the oﬄine process, new actions are generated by which the process can go from state sto state s0,

where both state sand state s0have been observed in real data but the transition has not yet been observed yet. We use

this data generation method for two reasons. Firstly, the reliability of the explored control actions are dependent to the

accuracy of the traﬃc prediction. Since the proposed method predicts traﬃc state transitions for only one step ahead,

the prediction accuracy should be better than the prediction for multiple steps, in which the prediction error will be

accumulated. Secondly, this method restricts the ratio of synthetic data in the training dataset. If the oﬄine model also

produces new states, the fraction of synthetic data may remain large and may remain dominant in the training data.

Consequently, the model mismatch would not be alleviated.

By adding the information of this possible transition to the training data, new actions can be explored in the oﬄine

training process. In addition, the new action leading to s0will only be chosen in the online control process if the

associated Q-value is high enough (based on earlier experiences), which will prevent choosing actions that lead to

very poorly performing states. The pseudocode of the oﬄine training process is shown in Algorithm 2, where the

notation of variables can be found in Tab. 2.

In the training dataset, the ratio of real data increases with the number of iterations because only real data are

accumulated. Similar to many heuristic exploration methods such as softmax, the proposed method also has a higher

probability of exploration at the beginning of the training than at the end when the policy is close to the greedy policy.

At the early stage of the iterations, the training data set contains a high proportion of synthetic data, enabling the RL

agent to explore more actions. With an increasing number of iterations, the real data become dominant in the training

data set, and the RL agent explores fewer control actions, to guarantee the improvement of traﬃc performance.

Algorithm 2 The pseudocode of the oﬄine training process in iteration x.

Input:Treal

x,Sx,Ax,Zx−1={z1,z2, ..., zM};

1: Tx=Treal

2: for m=1, 2,...,M do

3: sm=H(zm)

4: for ˜a∈˜

Amdo

5: znext

m,˜r=F(zm,˜a)

6: ˜snext

m=H(znext

7: if ˜snext

m∈Sxthen

8: e

T=[sm,˜a,˜r,˜snext

9: Tx←TxS{e

T }

10: end if

11: end for

12: end for

13: function Algorithm 1(Tx,Sx,Ax)

14: return Q(s,a)

15: end function

16: Qx

(s,a)=Q(s,a)

Output:Qx

(s,a), , ∀s∈Sx,∀a∈Ax

3.3. Online VSL control

The online control process is shown in the right block of Fig. 5. First, the control system detects jam waves based

on traﬃc ﬂow measurements, as per the criteria presented in Section 2.2. The VSL controller is then activated once a

jam wave is detected. For the ﬁrst control step, k∗, if the RL state is in the RL state data set, i.e., s(k∗)∈Sx, then the

control action is decided by:

a(k∗)=arg max

aQs(k∗),a,if s(k∗)∈Sx,(12)

where a(k∗)=[V(k∗),PV(k∗)]. If the RL state is not in the RL state data set, then the control action of the ﬁrst step will

be determined by an existing VSL control strategy, e.g., SPECIALIST. For the subsequent control steps, the speed

limit value V=V(k∗) will be kept unchanged and only the boundaries of the VSL-controlled area are allowed to

change. For step k, if the RL state, s(k), is in the state data set, the controller exploits the optimal policy to give the

optimal control action. Among all the state-action pairs associated with that state, the action that produces the largest

Q-value is chosen and implemented to the traﬃc process:

a(k)=arg max

aQs(k),a|a=[V,PV],V=V(k∗),if s(k)∈Sx(13)

If the RL state is not in the state value table, a new control action will be explored and implemented in the traﬃc

process. For the new control action, it is assumed that the speed limit value is the same as it was in the previous

control step, i.e., V(k+1) =V(k), and the index of the most upstream cell of the VSL control area changes no more

than 1, i.e., |PV(k)−PV(k+1)| ≤ 1. Note that this constraint not only prevents frequent acceleration and deceleration of

drivers caused by VSLs, but also reduces the exploration space in the RL. If the exploration space is too large, ﬁnding

the actions that improve the system performance may take unrealistically long time. For these states, that do not exist

in the state value table, we apply a simple method to determine PV(k). The method intends to keep the VSL-controlled

area at a moderate value. Speciﬁcally, we deﬁne a tuning parameter ρcr

V, which represents the critical density of the

VSL-controlled area, and use ρup

Vto represent the density of the most upstream cell of the VSL-controlled area. For

ρup

Vin two consecutive steps, ρup

V(k−1) and ρup

V(k), there are four possible situations:

The density is lower than the critical value and it is decreasing: ρup

V(k)≤ρcr

Vand ρup

V(k)≤ρup

V(k−1);

The density is lower than the critical value and it is increasing: ρup

V(k)≤ρcr

Vand ρup

V(k)> ρup

V(k−1);

The density is higher than the critical value and it is decreasing: ρup

V(k)> ρcr

Vand ρup

V(k)≤ρup

V(k−1);

The density is higher than the critical value and it is increasing: ρup

V(k)> ρcr

Vand ρup

V(k)> ρup

V(k−1).

For situation 1

, the upstream boundary of the VSL-controlled area moves one cell downstream. For situations

and 3

, that upstream boundary remains at the same position. For situation 4

, that upstream boundary moves

upstream by one cell. Therefore, PV(k) is determined as:

PV(k)=









PV(k)−1,if 1

,s(k)<S

PV(k),if 2

or 3

,s(k)<S

PV(k)+1,if 4

,s(k)<S.

(14)

4. Simulation experiment design

This section presents the simulation experiments for testing the proposed VSL control approach. The purpose of

the simulations is to show that the proposed approach (i) can eﬀectively eliminate jam waves and reduce travel delays,

(ii) performs better than those approaches aﬀected by the model mismatch, and (iii) has less exploration and learning

costs compared to a RL method with random explorations. The following experiment scenarios are designed.

1. Testing the proposed iterative RL approach of VSLs using macroscopic traﬃc simulation. The purpose of this

scenario is to investigate the performance of the proposed approach in reducing travel delays during the iterative

training process. As it is impossible to directly test the proposed approach in the ﬁeld, the METANET model

is used as the process model to represent the real-world traﬃc ﬂow dynamics. For this scenario, the overall

framework is presented in Section 4.1. The process model and simulation settings are presented in Section 4.2.

The parameter settings of the RL controller are presented in Section 4.3.

2. Compare the proposed approach to SPECIALIST. In SPECIALIST, the traﬃc state transitions under VSLs are

predicted based on kinematic wave theory. The accuracy of the prediction is inﬂuenced by the tuning parameters

and some external disturbances such as demand ﬂuctuations. Therefore, the mismatch between the prediction

results and real process may aﬀect the control performance. The purpose of this scenario is to demonstrate the

proposed approach can outperform SPECIALIST in terms of reducing travel delays by eliminating the model

mismatch. Parameter settings of SPECIALIST are presented in Section 4.4.

3. Compare the proposed approach to an existing MPC approach against freeway jam waves (Han et al., 2017b).

The MPC approach was developed based on the extended CTM (Han et al., 2016). As the prediction model

of the MPC is diﬀerent from the process model (METANET), the performance will be aﬀected by the model

mismatch. This scenario intends to demonstrate that the proposed approach can outperform the MPC approach

in terms of reducing travel delays by alleviating the model mismatch,

4. Compare the proposed approach to an existing RL-based VSL control approach with random online exploration.

In this scenario, the RL model is directly trained in the real traﬃc process using the DDQN algorithm. As the

DDQN explores control actions randomly, the exploration and learning costs during the training may be very

high. For example, a randomly explored control action may lead to very poor traﬃc performance and even

increase the travel delay. The purpose of this scenario is to demonstrate that the proposed approach has much

less exploration and learning costs than the random exploration method.

5. Compare the proposed approach to an existing RL-based VSL control approach with zero-shot policy transfer.

In this scenario, an existing deep reinforcement learning algorithm, namely Double DQN (DDQN, Van Hasselt

et al. (2016)), is used as the training algorithm. The same extended CTM model is used as the training envi-

ronment. After training, the optimal policy is directly transferred to the real traﬃc process. As the training

environment is diﬀerent from the real traﬃc process, this RL-based approach is also aﬀected by the model mis-

match. The purpose of this scenario is to demonstrate that the proposed approach can outperform this RL-based

approach by alleviating the model mismatch.

6. Compare the proposed approach to an existing RL-based VSL control approach with continually online learn-

ing. In this scenario, the DDQN-based VSL control strategy in scenario 5 is assumed to continually learn from

the online environment after the oﬄine optimal policy was transferred. This scenarios intends to investigate

if the DDQN can continually improve traﬃc performance in the online environment, and also to quantify the

online learning cost of the DDQN.

The simulation results of those ﬁve experiment scenarios are presented in Sections 5.1-5.5, respectively.

4.1. Overall framework of experiment scenario 1

The simulation experiment for testing the proposed approach includes the following steps.

1. Implementing the starting VSL control approach. We assume SPECIALIST as the starting VSL approach,

which was applied before implementing the proposed approach. Therefore, in (9), Tstart

xis the set of training data

collected from SPECIALIST implementation. The time period of implementing SPECIALIST is represented

by 100 online simulations where in each simulation one jam wave is artiﬁcially created.

2. The oﬄine-online interaction process. The iterations start from the oﬄine training process. In the oﬄine, the

synthetic data are generated from the extended CTM, which is brieﬂy presented in Section 4.5. In each iteration,

the optimal VSL control policy associated with a ﬁxed state-value table is implemented in the online process

for a period of time, represented by 100 online simulations. Other parameters of the RL controller are speciﬁed

in Section 4.4.

3. Stop criterion. In the online control process, if the RL state is in the state-value table, i.e., the state has appeared

in historical traﬃc data, the action that produces the largest Q-value is implemented by the process. If the RL

state is not in the state-value table, a new control action will be implemented. We deﬁne the actions selected

from the state-value table as RL control actions. In each stage, the total number of RL control actions (NRL

and the total number of all the control actions (Nx) are recorded. The ratio between NRL

xand Nx, denoted as ηx,

represents the percentage of the states that has appeared in historical data. In general, ηxshould be higher with

the increment of the number of iterations and the expansion of training data. The experiment ends if ηxis larger

than 0.8, when a large percentage of the states has appeared in historical data.

Note that two diﬀerent macroscopic traﬃc ﬂow models are used in the simulation experiments. The METANET

model is used as the process model which represents the real-world traﬃc ﬂow dynamics. The extended CTM is used

as the oﬄine data generation model. Therefore, the simulations using METANET are referred to as online simulations

and the simulations using the extended CTM are referred to as oﬄine simulations.

The stochasticity of traﬃc ﬂow is considered in the experiment, by incorporating noises to the process model for

diﬀerent jam waves. Detailed settings about the process model is presented in Section 4.2. The simulation experiment

is repeated for 20 times to avoid getting unreliable results due to the stochasticity of the simulation environment.

4.2. The METANET model and simulation settings

The second-order macroscopic traﬃc ﬂow model METANET, proposed by (Messmer and Papageorgiou, 1990;

Kotsialos et al., 2002b), has been extensively used for freeway traﬃc simulation. The METANET model predicts

the dynamic evolution of traﬃc speeds based on a steady speed-density relation and some heuristic terms that express

driver behavior. Hegyi et al. (2005a) has extended the METANET model to account for the eﬀect of VSLs. The model

with VSLs extension has been validated using ﬁeld data (Han et al., 2017b; Frejo et al., 2019). In this simulation test,

the model presented in Hegyi et al. (2005a) is utilized as the process model. The simulation test uses the METANET

model to represent real-world traﬃc dynamics. The reason we choose METANET as the process model is that it has

been validated to reproduce the propagation of jam waves with a reasonable accuracy (Han et al., 2017a; Frejo et al.,

2019). Furthermore, it runs much faster than microscopic simulations.

In the METANET model, the freeway is divided into cells which have a uniform geometric structure. For cell i,

the desired speed at time tis calculated as:

V(ρi(t)) =min VC,i(t),vf,i·exp −1

am ρi(t)

ρcr,i!am!!,(15)

where the ﬁrst term, VC,i, is the speed limit of cell i. We assume that the drivers fully comply with the speed limit

control. The second term describes the steady speed-density relation of the model, which is characterized by three

parameters, namely am,vf,iand ρcr,i. In the fundamental diagram, vf,iand ρcr,irepresent the free-ﬂow speed and

the critical density, respectively. For the sake of compactness, the equations that describe the traﬃc dynamics of

METANET are shown in Appendix A.

Most of the experiments on VSLs against jam waves (both simulations and ﬁeld test) are performed on a homo-

geneous freeway stretch. In the experiments, a three-lane synthetic freeway stretch is used as the test bed for the

proposed VSL control approach. The homogeneous freeway stretch is 7.5km in length, and it is divided into 25 cells.

A graphical representation of the synthetic freeway is shown in Fig. 6. The parameter values of the process model are

taken from Kotsialos et al. (1999); Hegyi et al. (2005a); Han et al. (2017b). Speciﬁcally, ρcr=27.6 veh/km/lane, am=

2.5 for every cell, and vf=108 km/h.

cell 1,

the origin cell 2 cell 25

Traffic flow

...

Figure 6: A graphical representation of the synthetic freeway stretch.

In practice, traﬃc ﬂow conditions (e.g., traﬃc demand and capacity) may vary from day to day. To reproduce

the stochastic feature of traﬃc ﬂow, we assume that parameters vf,am, and ρcr, which inﬂuence the shape of the

fundamental diagram, are stochastic. Each of the three parameters is assumed to follow a Gaussian distribution,

where the mean is equal to the referred value and the standard deviation is 2% of the mean. Therefore, in each online

simulation run, a sample of these parameters is taken. This gives us (slightly) diﬀerent sizes of fundamental diagrams

for diﬀerent simulation runs. Fig. 7 (a) shows the free-ﬂow capacities obtained from 100 random online simulation

runs. In general, the free-ﬂow capacity in most of the online simulation runs ranges from 1900 veh/h/lane to 2100

veh/h/lane. Furthermore, to reproduce traﬃc demand ﬂuctuations in reality, the demands in the online simulation runs

are assumed to follow Gaussian distribution. Speciﬁcally, each online simulation run lasts for 2 hours, including one

hour of peak time and one hour of oﬀ-peak time. The mean of peak hour demands and mean oﬀ-peak hour demands

are set to 90% of the capacity (which varies in diﬀerent simulation runs) and 4000 veh/h respectively. The standard

deviations are set to 5% of the mean.

Figure 7: Results of 100 online simulation runs: (a) The road capacity of each online simulation run. (b) The density-ﬂow plot of all cells.

Jam waves in reality usually form at a relatively ﬁxed location of a site (Hegyi and Hoogendoorn, 2010). In the

simulations, jam waves are artiﬁcially triggered at the downstream boundary of the freeway stretch. The densities

downstream of the freeway stretch are set to 100 veh/km/lane at min. 32-34. To give an impression of the resulting

stochasticity, we run the simulation for 100 times applying the presented demand and parameter settings. The density-

ﬂow plot, taken from the data of every cell in every minute, is shown in Fig. 7 (b). The length of congestion area in

those created jam waves varies from 0.9 km to 2.4 km, which is consistent with empirical observation. Fig. 8 shows

an example of the simulated jam waves.

Figure 8: (a) Speed and (b) Flow contour plots of an example of the simulated jam waves.

4.3. Settings of the RL controller

In the oﬄine training process of the proposed VSL control approach, the state and reward variables need to be

discretized. The domain of each variable is divided into discrete intervals, and the value of each interval is represented

by the midpoint. The discrete interval sizes of qI,ρV,ljam,vjam, and Pjam are set to 100 veh/h/lane, 2 veh/km/lane,

0.3 km, 5 km/h, and 1 cell respectively, which considers the trade-oﬀbetween data resolution and variable space.

The discrete intervals and the upper and lower bounds of the state and action variables are summarized in Tab. 3. A

penalty of -200 min is added to the terminal state, if the jam wave is not successfully resolved. In the Q-learning, the

convergence threshold is set to 0.01 min.

Table 3: The discrete intervals and the upper and lower bounds of the state and action variables.

Variables Discrete intervals Upper bounds Lower bounds

¯qI[veh/h] 100 2000 1000

¯ρV[veh/km/lane] 2 100 10

ljam [km] 0.3 3 0.3

¯vjam [km/h] 5 50 5

Pjam 1 25 1

PV1 24 1

V10 60 50

In the proposed control system, we assume that two values of speed limit are used: 50 km/h and 60 km/h. Those

two values are chosen based on both empirical evidence and trial-and-error tuning. From extensive simulation tests it

is found that (i) a speed limit lower than 50 km/h would result in a higher density in the VSL-controlled area, which

increases the risk of inducing new traﬃc breakdowns, and (ii) a speed limit value higher than 60 km/h may not be

able to trigger a suﬃciently low ﬂow that can resolve the jam waves. For reader’s reference, the displayed speed limit

value in SPECIALIST system is 60 km/h (Hegyi and Hoogendoorn, 2010). For some traﬃc situations, the traﬃc

performance may be further improved if more speed limit values can be displayed. However, the solution space of

the RL increases exponentially with the size of action space, which may require an impractical amount of time to

gather suﬃcient training data for the RL agent to improve the traﬃc performance. Therefore, the number of speed

limit values is determined by considering the trade-oﬀbetween potential traﬃc improvement and the time required to

achieve the improvement.

In the online control process, the duration of a control time step, Tk, is set to 30s. When VSL control is activated,

to avoid a sharp reduction of speed limit, i.e., from the free-ﬂow speed to 50 km/h or 60 km/h, 100 and 80 km/h are

used for the lead-in. The same approach was used in Hegyi and Hoogendoorn (2010). The critical speed of the VSL

control region, ρcr

V, is set to 30 veh/h/lane. The VSL control is deactivated when the jam wave is resolved.

4.4. The starting VSL control approach

We assume the SPECIALIST algorithm as the original VSL control approach before the RL-based VSL control

approach is implemented. SPECIALIST has multiple tuning parameters, which have clear physical interpretations.

These parameters can be tuned based on heuristic tuning rules using oﬄine traﬃc data. In this simulation test, we

mimic the implementation of SPECIALIST in the METANET simulation. A brief introduction of SPECIALIST and

the tuning rules of parameters are presented in Appendix B. Fig. 9 shows an example in the simulation in which the

jam wave is successfully resolved by SPECIALIST.

Figure 9: (a) Speed and (b) Flow contour plots of an example in the simulation in which the jam wave is successfully resolved by SPECIALIST. In

(a), the VSL-controlled area is enclosed by black lines.

4.5. Model mismatch

In scenario 1 of the simulation experiments, we use the extended CTM model, proposed by Han et al. (2016), as the

oﬄine data generation model. The model extends the original CTM to reproduce capacity drop and the propagation

of jam waves. Since there is always mismatch between real traﬃc process and a traﬃc simulation model, we choose

the extended CTM model, which has a diﬀerent mechanism as the METANET model, as the oﬄine synthetic data

generation model to reproduce such mismatch.

Although the process model (the METANET model) and the oﬄine data generation model (the extended CTM)

have some similarities, e.g., both of them assume a fundamental diagram for homogeneous traﬃc state, their mecha-

nisms are still quite diﬀerent. For example, the METANET model considers driver behavior in traﬃc speed dynamics

such as anticipation to spatially increasing or decreasing densities, while the extended CTM does not. In the simula-

tion experiment, the extended CTM model is calibrated with the simulation data from the METANET model. For a

detailed presentation about the extended CTM, readers are referred to Han et al. (2016, 2017b).

Furthermore, in scenario 3 of the simulation experiments, the extended CTM is used as the prediction model of

an MPC controller of VSLs for comparison. In scenario 4, the training environment of an existing RL-based VSL

control strategy, which is used for comparison, is also developed based on the extended CTM. In those two scenarios,

the model mismatch can be reproduced as a result of the diﬀerence between METANET and the extended CTM.

5. Simulation results and analysis

This section presents the results of the simulation experiments, and each sub-section corresponds to one of the

experiment scenarios described in Section 4.

5.1. Performance of the proposed approach

This section presents the results of the simulation experiment in testing the proposed approach, described as

scenario 1 in Section 4. In the simulation experiment, 100 online simulations are performed in each iteration. The

traﬃc performance at each iteration is evaluated using the average total travel delay as the performance indicator,

which is calculated as the diﬀerence between the total time spent by all vehicles in the freeway stretch and the sum of

all the vehicles’ free-ﬂow travel time. The simulation experiment is repeated for 20 times to avoid getting unreliable

results due to the stochasticity of the simulation environment. A whisker plot that depicts the traﬃc performance of

the proposed VSL control approach is shown in Fig. 10 (a). The average total travel delay saving of the proposed

approach is 31.3% during the entire training process.

Fig. 10 (b) shows the total travel delay improvement of the presented VSL control approach with diﬀerent values

of η. In the ﬁgure, each box represents a 10 percent interval of that ratio. The dashed blue line in each box indicates

the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. The whiskers

extend to the most extreme data points. The red line represents the average total travel delay reduction for diﬀerent

intervals of η. In general, the average total travel delay saving increases with η, except for the interval [20%, 30%],

where only three data points are observed. Moreover, it can be observed that the lower bound of the total travel

delay reduction also increases when ηis higher than 50%, which indicates that the presented VSL control approach

becomes more robust as ηgrows. These results are as expected, because with the increment of η, more control actions

are explored and more data are utilized by the RL. Hence, the actions selected by the RL controller becomes more

reliable, because the RL controller takes the stochasticity of traﬃc environment into account.

Fig. 10 (b) shows the change of ηwith the increment of the number of iterations. The average number of iterations

in the simulation experiment is 15.9. The oﬄine training time of the RL agent varies from less than one minute to

5 minutes. During earlier stages when the amount of training data is less, it takes less time for the Q-learning to

converge.

Figure 10: (a): the whisker plot of total travel delay reduction (compared to those without VSL control) for diﬀerent iterations; (b) total travel delay

reduction for diﬀerent ratios of RL actions; (c): the share of the ratio of RL control actions for diﬀerent iterations.

5.2. Comparison with SPECIALIST

This section presents the proposed approach and SPECIALIST, described as scenario 2 in Section 4. SPECIALIST

is utilized as the starting VSL control strategy in scenario 1 of the simulation experiments. The average total travel

delay reduction of SPECIALIST is 15.5%. In all the simulation runs, about 70% of the jam waves are classiﬁed as re-

solvable and the VSL schemes generated from SPECIALIST are implemented in those cases. Among the cases where

VSLs are implemented, over 60% of the jam waves are successfully resolved. Some of the failures are attributed to the

mismatch between the predicted traﬃc dynamics and real traﬃc process, for example, a sudden demand increment.

In contrast, the proposed VSL control approach reduces the average total travel delay by 35.1% when ηis larger

than 0.8. About 80% of the jam waves are resolved by VSLs, which is much higher than the SPECIALIST algorithm.

Its better performance is attributed to two main reasons. First, the RL controller has a feedback structure. It determines

the VSL control actions based on the online measured traﬃc states. It is thus able to handle disturbances such as

demand increases. Second, the RL controller does not rely on online traﬃc prediction, because optimal control

actions are obtained mainly from real traﬃc data.

Fig. 11 shows an example of comparison between the SPECIALIST and the proposed VSL control approach. In

this example, both VSL control approaches are tested using the same demand proﬁle and the same parameter values

for the process model. Under SPECIALIST, the VSL-controlled area is too short to generate a transition ﬂow that

lasts long enough to resolve the jam wave. The reason is that the outﬂow of the jam (ﬂow of area 1 in Fig. 3) is

overestimated. Moreover, as SPECIALIST has a feed-forward control structure, it is very sensitive to the errors of

traﬃc ﬂow prediction. By contrast, the RL-based controller successfully resolves the jam wave.

It is worthy to be noted that we have tried a diﬀerent set of SPECIALIST parameters. Although the performance of

SPECIALIST with the new parameters is inferior, the performance of the proposed approach using the inferior tuning

of SPECIALIST, is not aﬀected. It still reduces the total travel delay by 35% at the end of the training. The reason is

that the proposed approach explores new control actions and evaluates them in every stage. The actions that lead to

a good traﬃc performance are kept, and the actions that lead to a worse traﬃc performance are discarded by the RL

model. When the amount of training data becomes suﬃciently rich, SPECIALIST data are only a small proportion of

the training data and overruled by the real data. Therefore, the performance of the proposed approach is not sensitive

to the tuning parameters of SPECIALIST.

Figure 11: Comparison between SPECIALIST and the proposed VSL control approach in an example. In this example, (a) and (d) are the simulated

speed (km/h) and ﬂow (veh/h) contour plots without VSL control; (b) and (e) are the simulation results under SPECIALIST; (c) and (f) are the

simulation results under the proposed VSL control approach. In (b) and (c), the VSL-controlled areas are enclosed by black lines.

5.3. Comparison with a MPC approach

This section presents the results of the comparison between the proposed VSL control approach and the MPC

approach, described as scenario 3 in Section 4. The same extended CTM is used for traﬃc prediction in the MPC.

The MPC has a feedback control structure, and the optimal VSL control scheme is calculated in every control step

based on traﬃc state feedback. The prediction horizon is set to 20 minutes. The duration of a control step is set to 30

seconds. Model parameters are calibrated with the online simulation data. The minimum VSL value is set to 50 km/h

in the optimization of the MPC. At each optimization iteration, the traﬃc demand in the prediction is set to a constant

value for the entire prediction horizon, and the value is predicted as the measured average demand of last 15 minutes.

For a full presentation of the MPC, readers are referred to Han et al. (2017b).

The MPC controller is run with the online process model for 100 simulation runs. It reduces the average total travel

delay by 25.9%, which is higher than SPECIALIST but lower than the proposed VSL controller. The performance of

the MPC controller depends on the accuracy of traﬃc prediction. It may generate ineﬀective control schemes if the

predicted traﬃc dynamics are not consistent with the simulated traﬃc process. Fig. 12 shows an example, in which the

MPC controller fails to resolve the jam wave because of inaccurate traﬃc prediction. In this example, the capacity of

the process model is set to 1950 veh/h/lane, which is slightly lower than that of the prediction model, 2000 veh/h/lane.

At minute 35, when the MPC controller is activated, the predicted traﬃc demand is 4900 veh/h but the actual traﬃc

demand is about 5400 veh/h. Therefore, the congestion severity of the jam wave is underestimated by the MPC. As a

result, the MPC only narrows the jam wave but it is unable to completely resolve it. Using the same demand proﬁle

and parameter values, the proposed VSL control approach can successfully resolve the jam wave, as showing by the

speed and ﬂow contour plots in Fig. 12 (c) and (f).

In this simulation experiment, the prediction model of the MPC controller is the same as the data generation model

in the proposed RL-based VSL control approach. However, the performance of the MPC controller is restricted by

the accuracy of the prediction model, as evidenced by the above example. On the other hand, the performance of the

proposed VSL control approach is not restricted by the accuracy of that model, because the explored actions produced

from the data generation model are evaluated in the online process (i.e., the reality), and the actions that lead to worse

traﬃc performances are discarded.

Figure 12: Comparison between the MPC approach and the proposed VSL control approach in an example. In this example, (a) and (d) are the

simulated speed (km/h) and ﬂow (veh/h) contour plots without VSL control; (b) and (e) are the simulation results under the MPC control approach;

(c) and (f) are the simulation results under the proposed VSL control approach. In (b) and (c), the VSL-controlled areas are enclosed by black lines.

5.4. The exploration and learning costs

This section compares the proposed approach with an existing RL-based VSL control approach using random

exploration in terms of exploration and learning costs. The RL model with random exploration is directly trained in

the online simulation environment using the DDQN algorithm. The performance curves of the random exploration

approach are shown in Fig. 13. We use data from the ﬁrst 10000 simulation runs to evaluate the exploration and

learning costs, as the performance of the random exploration approach stabilizes after 10000 simulation runs. For the

proposed approach, data from all the online simulations are used for evaluation.

The exploration cost is represented by the performance in terms of travel delay. During the ﬁrst 10000 simulation

runs, the average total travel delay for the random exploration approach is 168.5 h. In 32.6% of the online simulation

runs, VSLs lead to worse traﬃc performance, i.e., increase the total travel delay. For the proposed approach, the

average travel delay during the training phase is 142.2 h. Only in 17.9% of the simulation runs, VSLs lead to worse

traﬃc performance. Furthermore, for the random exploration approach, the average total travel delay saving after

10000 simulation runs is 28.1%. For the proposed approach, it achieves a comparable performance only using less

than 200 simulation runs, as shown in Fig. 10. Therefore, the exploration cost of the proposed approach is much less

than that of the random exploration approach.

Figure 13: Performance curves of the random exploration approach in the online training.

5.5. Comparison with an existing RL approach

In this section, we compare the proposed approach with an existing RL-based VSL control approach with zero-

shot transfer, described as scenario 5 in Section 4. Speciﬁcally, the same extended CTM is used as the training

environment. An existing deep reinforcement learning algorithm, namely Double DQN (DDQN, Van Hasselt et al.

(2016)), is applied as the training algorithm. DDQN has been successfully applied to RL-based traﬃc signal control

systems in multiple studies, such as Zeng et al. (2018); Liang et al. (2019). During the training process, the RL agent

receive states and rewards from the environment while the environment implements actions taken by the agents. After

training, the optimal policy is directly transferred to the online simulations.

In the training environment, the settings of traﬃc demand and model parameters are the same as those in Section

4.2. The state, action, and reward are the same as those deﬁned in Sections 2.1 and 4.5. This RL model is trained

using data from 20000 oﬄine simulation runs. Fig. 14 shows the performance curves of the training. The control

policy at the end of the training is implemented to the online simulations for 100 runs. It reduces the average total

travel delay by 22.4%, which is not as good as the proposed control approach.

Figure 14: Performance curves of the training with DDQN in the extended CTM environment.

Fig. 15 shows an example that highlights the comparison between the proposed approach and the DDQN-based

approach. In this example, the proposed approach successfully resolves the jam wave, but the DDQN-based VSL

control approach fails. The DDQN-based approach chooses speed limit value 60 km/h at the beginning of VSLs

activation, as shown in Fig. 15 (b) and (j). As time advances, although the upstream of the VSL-controlled area nearly

reaches to the upstream boundary of the freeway stretch, the VSL control still cannot create a transition ﬂow that is

suﬃciently low to fully resolve the jam, as shown in Fig. 15 (e). As a comparison, the proposed approach chooses

speed limit value 50 km/h at the beginning of VSLs activation, so the created transition ﬂow is suﬃciently low to

resolve the jam wave, as shown in Fig. 15 (a), (d), and (i).

The performance of the DDQN-based VSL control approach is also tested in the training environment, i.e., the

extended CTM, using the same traﬃc demand of the aforementioned example. In the training environment, the

DDQN-based VSL control approach successfully resolves the jam wave and achieves a higher downstream through-

put, as shown in Fig. 15 (c) and (f). The diﬀerent performances in the oﬄine training environment and the online

simulation indicate that the DDQN-based approach is aﬀected by the model mismatch, i.e., the diﬀerence between the

training environment and the online simulation. Even though a well-trained RL strategy performs well in the training

environment, it is not guaranteed that the strategy will perform equally well in real traﬃc process, where there is

always a mismatch.

Figure 15: Comparison between the proposed VSL control approach and the DDQN-based approach in an example. In this example, (a) and (d)

are the simulated speed (km/h) and ﬂow (veh/h) contour plots under the proposed approach; (b) and (e) are the simulation results under the DDQN-

based approach; (c) and (f) are the simulation results under the DDQN-based approach in the training environment. (i-k) are the corresponding

VSLs proﬁles. In (i), the speed limit is chosen as 50 km/h while in (j), the speed limit is chosen as 60 km/h.

5.6. DDQN with continual online learning

This section presents the results of scenario 6, the DDQN with continual online learning. Two sub-scenarios are

tested in this section. In sub-scenario 1, it is assumed that there is no online exploration after the oﬄine optimal policy

being transferred to the online environment. Therefore, the DDQN adopts the greedy policy to update the parameters.

In sub-scenario 2, it is assumed that there are still online exploration after the oﬄine optimal policy being transferred.

The DDQN adopts the -greedy policy to update the parameters. The performance curves of both scenarios are shown

in Fig. 16.

Figure 16: Performance curves of the DDQN with continual learning.

In those two sub-scenarios, the DDQN with -greedy policy reduces the total travel delay substantially more than

the DDQN with greedy policy. To quantify the learning cost, we use the average delay of the proposed method during

the entire training period as a comparison. For the DDQN with -greedy policy, the average travel delay during the

ﬁrst 2000 simulation runs is 155.9 h, which is 9.6% higher than the average of the proposed method, shown as the

red lines in Fig. 16. While the DDQN eventually achieves a similar performance as the proposed method, but at a

signiﬁcantly higher learning cost during the online process.

6. Discussion and conclusions

Reinforcement learning has attracted extensive attentions in traﬃc control areas. Most of existing RL-based traﬃc

control approaches explore control actions randomly, which may induce high exploration and learning costs. For

those approaches, the RL learning cannot be purely based on real-world explorations. Furthermore, the training

process with random exploration may require a large amount of training data, which may not be feasible to collect

because the speed of data collection in real world is restricted by the ”slowness” of the traﬃc process. Therefore,

to date most of existing RL-based traﬃc control approaches train their RL models solely using traﬃc simulators.

However, The mismatch between the training simulators and the real traﬃc process aﬀects the performance of those

approaches.

In this paper we have proposed a new reinforcement learning-based VSL control approach to resolve freeway jam

waves. The proposed VSL control approach applies an iterative training framework, where the optimal control policy

is updated by exploring new control actions both online and oﬄine in each iteration. The oﬄine/online exploration

method often prevents poor control actions being explored in real traﬃc process so as to reduce the exploration and

learning costs. The explored control actions are evaluated in the real traﬃc process. Thus the proposed approach

avoids letting the RL model learning only from a traﬃc simulator, and alleviates the impact of the model mismatch

by replacing knowledge from the model by knowledge from the real process.

The proposed VSL control approach has been tested using a macroscopic traﬃc simulation model, namely METANET,

which represents real world traﬃc ﬂow dynamics. The simulation results have shown that the RL controller decreases

the total travel delay as more control actions are explored and more training data are fed into the RL. The proposed

approach has also been compared with several existing VSL control approaches to demonstrate its advantages. Due

to the alleviation of model mismatch errors, the proposed approach performed better in reducing travel delays, than

SPECIALIST, the MPC-based approach, and the approach based on an existing RL method. The advantage in reduc-

ing the exploration and learning costs has been demonstrated by the comparison with an existing RL-based approach

with random exploration.

Although the proposed approach has been demonstrated to alleviate the impact of the model mismatch, it is not

guaranteed that it will lead to a system optimal performance. In the proposed method, actions are mainly explored in

a smaller space created from the oﬄine model rather than in the entire action space. Therefore, the policy of the RL

can be suboptimal if the optimal control actions are out of the exploration space. In future research, we will further

investigate if there are better training methods which can incorporate random online explorations and lead to a system

optimal performance.

The proposed VSL control approach is designed to resolve freeway jam waves based on the VSL control mech-

anism against jam waves. In future research, we will extend the proposed approach to eliminate infrastructural bot-

tlenecks such as on-ramp bottleneck and lane-drop bottleneck. The test bed will also be extended to larger sizes of

freeway networks. Other methods that can more eﬃciently deal with the scarcity of real data in RL-based traﬃc

control problems will also be investigated.

Acknowledgement

This research is jointly supported by the National Natural Science Foundation of China (No.52002065, No.52131203),

and the Natural Science Foundation of Jiangsu (No.BK20200378).

Appendix A. METANET model

In the METANET model, the following equations describe the evolution of freeway traﬃc dynamics over time.

The outﬂow of each cell is equal to the density times the mean speed and the number of lanes of that cell (represented

by λi):

qi(t)=ρi(t)vi(t)λi,(A.1)

The density of a cell follows the vehicle conservation law, which is represented as:

ρi(t+1) =ρi(t)+Ts

liλi

(qi−1(t)−qi(t)),(A.2)

where liis the length of cell i. The mean speed of segment iat time step t+1, vi(t+1), depends on the mean speed at

time step t, the speed of the inﬂow of vehicles, and the density downstream. Speciﬁcally,

vi(t+1) =vi(t)+Ts

τM

(V(ρi(t)) −vi(t)) +Ts

vi(t)(vi−1(t)−vi(t)) −ϑTs

τMli

ρi(t+1) −ρi(t)

ρi(t)+κ,(A.3)

where τM,ϑ,κare model parameters. In the experiment, τMis set to 18 s, κis set to 40 veh/km/lane, and ϑis set to

30 km2/h.

Appendix B. SPECIALIST

There are multiple tuning parameters for the SPECIALIST algorithm, which correspond to the traﬃc states in

Fig. 3. The control scheme can be constructed given the measured and calculated traﬃc states 1-6. The densities,

speeds, and ﬂows for the six states are denoted as ρ[j],v[j],q[j],j∈1, ..., 6. In the experiments, these parameters

are determined using the same method as in (Hegyi and Hoogendoorn, 2010). One of the most important tuning

parameters is the density associated with state 4. The speed of state 4 is determined by the speed limits, however the

choice of the density is a design variable that inﬂuences the shape of the control scheme. Based on trial-and-error

tuning, ρ[4] is set to 30 veh/km/lane, and ρ[5] and q[5] are set to 27 veh/km/lane and 2000 veh/h/lane, respectively.

After the construction of the control scheme, the resolvability is assessed. If the constructed control scheme

satisﬁes certain conditions, the jam wave is considered to be resolvable and the control scheme is applied. These

conditions include: (i) the heads and tails of areas 2 and 4 should converge; (ii) the speed of area 6 should be higher

than the speed limits; and (iii) the necessary length of the speed-limited stretch is smaller than the available upstream

free-ﬂow area. In the experiment, it is assumed that the SPECIALIST can choose one speed limit value from 50 km/h

and 60 km/h. If both values satisfy the conditions of resolvability, the higher value 60 km/h will be chosen. The VSL

control is activated at minute 35, when the jam wave has already formed.

References

Arel, I., Liu, C., Urbanik, T., and Kohls, A. Reinforcement learning-based multi-agent system for network traﬃc signal control. IET Intelligent

Transport Systems, 4(2):128–135, 2010.

Belletti, F., Haziza, D., Gomes, G., and Bayen, A. M. Expert level control of ramp metering based on multi-task deep reinforcement learning. IEEE

Transactions on Intelligent Transportation Systems, 19(4):1198–1207, 2017.

Carlson, R. C., Papamichail, I., Papageorgiou, M., and Messmer, A. Optimal motorway traﬃc ﬂow control involving variable speed limits and

ramp metering. Transportation Science, 44(2):238–253, 2010a.

Carlson, R. C., Papamichail, I., Papageorgiou, M., and Messmer, A. Optimal mainstream traﬃc ﬂow control of large-scale motorway networks.

Transportation Research Part C: Emerging Technologies, 18(2):193–212, 2010b.

Carlson, R. C., Papamichail, I., and Papageorgiou, M. Local feedback-based mainstream traﬃc ﬂow control on motorways using variable speed

limits. IEEE Transactions on Intelligent Transportation Systems, 12(4):1261–1276, 2011.

Carlson, R. C., Papamichail, I., and Papageorgiou, M. Integrated feedback ramp metering and mainstream traﬃc ﬂow control on motorways using

variable speed limits. Transportation Research Part C: Emerging Technologies, 46:209–221, 2014.

Chen, D. and Ahn, S. Variable speed limit control for severe non-recurrent freeway bottlenecks. Transportation Research Part C: Emerging

Technologies, 51:210–230, 2015.

Chen, D., Ahn, S., and Hegyi, A. Variable speed limit control for steady and oscillatory queues at ﬁxed freeway bottlenecks. Transportation

Research Part B: Methodological, 70:340–358, 2014.

Davarynejad, M., Hegyi, A., Vrancken, J., and van den Berg, J. Motorway ramp-metering control with queuing consideration using Q-learning. In

2011 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), pages 1652–1658. IEEE, 2011.

El-Tantawy, S., Abdulhai, B., and Abdelgawad, H. Multiagent reinforcement learning for integrated network of adaptive traﬃc signal controllers

(MARLIN-ATSC): methodology and large-scale application on downtown toronto. IEEE Transactions on Intelligent Transportation Systems,

14(3):1140–1150, 2013.

Frejo, J. R. D., N ´

u˜

nez, A., De Schutter, B., and Camacho, E. F. Hybrid model predictive control for freeway traﬃc using discrete speed limit

signals. Transportation Research Part C: Emerging Technologies, 46:309–325, 2014.

Frejo, J. R. D., Papamichail, I., Papageorgiou, M., and De Schutter, B. Macroscopic modeling of variable speed limits on freeways. Transportation

Research Part C: Emerging Technologies, 100:15–33, 2019.

Frejo, J. R. D. and Camacho, E. F. Global versus local mpc algorithms in freeway traﬃc control with ramp metering and variable speed limits.

IEEE Transactions on Intelligent Transportation Systems, 13(4):1556–1565, 2012.

Hadiuzzaman, M. and Qiu, T. Z. Cell transmission model based variable speed limit control for freeways. Canadian Journal of Civil Engineering,

40(1):46–56, 2013.

Hadiuzzaman, M., Qiu, T. Z., and Lu, X.-Y. Variable speed limit control design for relieving congestion caused by active bottlenecks. Journal of

Transportation Engineering, 139(4):358–370, 2013.

Han, Y., Yuan, Y., Hegyi, A., and Hoogendoorn, S. P. New extended discrete ﬁrst-order model to reproduce propagation of jam waves. Transporta-

tion Research Record: Journal of the Transportation Research Board, (2560):108–118, 2016.

Han, Y., Hegyi, A., Yuan, Y., and Hoogendoorn, S. Validation of an extended discrete ﬁrst-order model with variable speed limits. Transportation

Research Part C: Emerging Technologies, 83:1–17, 2017a.

Han, Y., Hegyi, A., Yuan, Y., Hoogendoorn, S., Papageorgiou, M., and Roncoli, C. Resolving freeway jam waves by discrete ﬁrst-order model-based

predictive control of variable speed limits. Transportation Research Part C: Emerging Technologies, 77:405–420, 2017b.

Han, Y., Wang, M., He, Z., Li, Z., Wang, H., and Liu, P. A linear lagrangian model predictive controller of macro-and micro-variable speed limits

to eliminate freeway jam waves. Transportation Research Part C: Emerging Technologies, 128:103–121, 2021.

Han, Y., Wang, M., Li, L., Roncoli, C., Gao, J., and Liu, P. A physics-informed reinforcement learning-based strategy for local and coordinated

ramp metering. Transportation Research Part C: Emerging Technologies, 137:103584, 2022.

Hegyi, A. and Hoogendoorn, S. Dynamic speed limit control to resolve shock waves on freeways-ﬁeld test results of the specialist algorithm. In

2010 International IEEE Conference on Intelligent Transportation Systems, pages 519–524. IEEE, 2010.

Hegyi, A., Hoogendoorn, S., Schreuder, M., Stoelhorst, H., and Viti, F. Specialist: A dynamic speed limit control algorithm based on shock wave

theory. In 2008 International IEEE Conference on Intelligent Transportation Systems, pages 827–832. IEEE, 2008.

Hegyi, A., De Schutter, B., and Hellendoorn, H. Model predictive control for optimal coordination of ramp metering and variable speed limits.

Transportation Research Part C: Emerging Technologies, 13(3):185–209, 2005a.

Hegyi, A., De Schutter, B., and Hellendoorn, J. Optimal coordination of variable speed limits to suppress shock waves. IEEE Transactions on

Intelligent Transportation Systems, 6(1):102–112, 2005b.

Kerner, B. S. Empirical macroscopic features of spatial-temporal traﬃc patterns at highway bottlenecks. Physical Review E, 65(4):046138, 2002.

Kerner, B. S. and Rehborn, H. Experimental features and characteristics of traﬃc jams. Physical Review E, 53(2):R1297, 1996.

Kotsialos, A., Papageorgiou, M., and Messmer, A. Optimal coordinated and integrated motorway network traﬃc control. In 14th International

Symposium on Transportation and Traﬃc Theory, 1999.

Kotsialos, A., Papageorgiou, M., Diakaki, C., Pavlis, Y., and Middelham, F. Traﬃc ﬂow modeling of large-scale motorway networks using the

macroscopic modeling tool metanet. IEEE Transactions on Intelligent Transportation Systems, 3(4):282–292, 2002a.

Kotsialos, A., Papageorgiou, M., Mangeas, M., and Haj-Salem, H. Coordinated and integrated control of motorway networks via non-linear optimal

control. Transportation Research Part C: Emerging Technologies, 10(1):65–84, 2002b.

Li, L., Lv, Y., and Wang, F.-Y. Traﬃc signal timing via deep reinforcement learning. IEEE/CAA Journal of Automatica Sinica, 3(3):247–254, 2016.

Li, Z., Liu, P., Xu, C., Duan, H., and Wang, W. Reinforcement learning-based variable speed limit control strategy to reduce traﬃc congestion at

freeway recurrent bottlenecks. IEEE Transactions on Intelligent Transportation Systems, 18(11):3204–3217, 2017.

Liang, X., Du, X., Wang, G., and Han, Z. A deep q learning network for traﬃc lights’ cycle control in vehicular networks. IEEE Transactions on

Vehicular Technology, 68(2):1243–1253, 2019.

Lighthill, M. J. and Whitham, G. B. On kinematic waves. II. a theory of traﬃc ﬂow on long crowded roads. In Proceedings of the Royal Society of

London A: Mathematical, Physical and Engineering Sciences, volume 229, pages 317–345. The Royal Society, 1955.

Lu, X.-Y., Qiu, T. Z., Varaiya, P., Horowitz, R., and Shladover, S. E. Combining variable speed limits with ramp metering for freeway traﬃc

control. In Proceedings of the 2010 american control conference, pages 2266–2271. IEEE, 2010.

Lu, X.-Y., Shladover, S. E., Jawad, I., Jagannathan, R., and Phillips, T. Novel algorithm for variable speed limits and advisories for a freeway

corridor with multiple bottlenecks. Transportation Research Record, 2489(1):86–96, 2015.

Messmer, A. and Papageorgiou, M. METANET: A macroscopic simulation program for motorway networks. Traﬃc engineering &control, 31(9),

1990.

Muralidharan, A. and Horowitz, R. Computationally eﬃcient model predictive control of freeway networks. Transportation Research Part C:

Emerging Technologies, 2015.

Ozan, C., Baskan, O., Haldenbilen, S., and Ceylan, H. A modiﬁed reinforcement learning algorithm for solving coordinated signalized networks.

Transportation Research Part C: Emerging Technologies, 54:40–55, 2015.

Papageorgiou, M. Some remarks on macroscopic traﬃc ﬂow modelling. Transportation Research Part A: Policy and Practice, 32(5):323–329, sep

1998.

Papageorgiou, M., Hadj-Salem, H., and Blosseville, J.-M. ALINEA: A local feedback control law for on-ramp metering. Transportation Research

Record, 1320(1):58–67, 1991.

Papageorgiou, M., Kosmatopoulos, E., and Papamichail, I. Eﬀects of variable speed limits on motorway traﬃc ﬂow. Transportation Research

Record: Journal of the Transportation Research Board, (2047):37–48, 2008.

Prashanth, L. and Bhatnagar, S. Reinforcement learning with function approximation for traﬃc signal control. IEEE Transactions on Intelligent

Transportation Systems, 12(2):412–421, 2010.

Richards, P. I. Shock waves on the highway. Operations Research, 4(1):42–51, 1956.

Roncoli, C., Papageorgiou, M., and Papamichail, I. Traﬃc ﬂow optimisation in presence of vehicle automation and communication systems–part

II: Optimal control for multi-lane motorways. Transportation Research Part C: Emerging Technologies, 57:260–275, 2015.

Schmidt-Dumont, T. and van Vuuren, J. H. Decentralised reinforcement learning for ramp metering and variable speed limits on highways. IEEE

Transactions on Intelligent Transportation Systems, 14(8):1, 2015.

Sch¨

onhof, M. and Helbing, D. Empirical features of congested traﬃc states and their implications for traﬃc modeling. Transportation Science, 41

(2):135–166, 2007.

Soriguera, F., Mart´

ınez, I., Sala, M., and Men´

endez, M. Eﬀects of low speed limits on freeway traﬃc ﬂow. Transportation Research Part C:

Emerging Technologies, 77:257–274, 2017.

Spiliopoulou, A., Kontorinaki, M., Papageorgiou, M., and Kopelias, P. Macroscopic traﬃc ﬂow model validation at congested freeway oﬀ-ramp

areas. Transportation Research Part C: Emerging Technologies, 41:18–29, 2014.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artiﬁcial

intelligence, volume 30, 2016.

Wang, Y., Yu, X., Zhang, S., Zheng, P., Guo, J., Zhang, L., Hu, S., Cheng, S., and Wei, H. Freeway traﬃc control in presence of capacity drop.

IEEE Transactions on Intelligent Transportation Systems, 22(3):1497–1516, 2020.

Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279–292, 1992.

Wu, Y., Tan, H., Qin, L., and Ran, B. Diﬀerential variable speed limits control for freeway recurrent bottlenecks via deep actor-critic algorithm.

Transportation Research Part C: Emerging Technologies, 117:102649, 2020.

Yuan, K., Knoop, V. L., and Hoogendoorn, S. P. Capacity drop: Relationship between speed in congestion and the queue discharge rate. Trans-

portation Research Record, 2491(1):72–80, 2015.

Zeng, J., Hu, J., and Zhang, Y. Adaptive traﬃc signal control with deep recurrent q-learning. In 2018 IEEE Intelligent Vehicles Symposium (IV),

pages 1215–1220. IEEE, 2018.

Zhang, Y. and Ioannou, P. A. Combined variable speed limit and lane change control for highway traﬃc. IEEE Transactions on Intelligent

Transportation Systems, 18(7):1812–1823, 2016.

Zhang, Y. and Ioannou, P. A. Stability analysis and variable speed limit control of a traﬃc ﬂow model. Transportation Research Part B: Method-

ological, 118:31–65, 2018.

Traffic Flow Optimization at Toll Plaza Using Proactive Deep Learning Strategies

Article

Full-text available

May 2024

Global urbanization and increasing traffic volume have intensified traffic congestion throughout transportation infrastructure, particularly at toll plazas, highlighting the critical need to implement proactive transportation infrastructure solutions. Traditional toll plaza management approaches, often relying on manual interventions, suffer from inefficiencies that fail to adapt to dynamic traffic flow and are unable to produce preemptive control strategies, resulting in prolonged queues, extended travel times, and adverse environmental effects. This study proposes a proactive traffic control strategy using advanced technologies to combat toll plaza congestion and optimize traffic management. The approach involves deep learning convolutional neural network models (YOLOv7–Deep SORT) for vehicle counting and an extended short-term memory model for short-term arrival rate prediction. When projected arrival rates exceed a threshold, the strategy proactively activates variable speed limits (VSLs) and ramp metering (RM) strategies during peak hours. The novelty of this study lies in its predictive and adaptive capabilities, ensuring efficient traffic flow management. Validated through a case study at Ravi Toll Plaza Lahore using PTV VISSIMv7, the proposed method reduces queue length by 57% and vehicle delays by 47% while cutting fuel consumption and pollutant emissions by 28.4% and 34%, respectively. Additionally, by identifying the limitations of conventional approaches, this study presents a novel framework alongside the proposed strategy to bridge the gap between theory and practice, making it easier for toll plaza operators and transportation authorities to adopt and benefit from advanced traffic management techniques. Ultimately, this study underscores the importance of integrated and proactive traffic control strategies in enhancing traffic management, minimizing congestion, and fostering a more sustainable transportation system.

DVS-RG: Differential Variable Speed Limits Control using Deep Reinforcement Learning with Graph State Representation

Preprint

May 2024

Variable speed limit (VSL) control is an established yet challenging problem to improve freeway traffic mobility and alleviate bottlenecks by customizing speed limits at proper locations based on traffic conditions. Recent advances in deep reinforcement learning (DRL) have shown promising results in solving VSL control problems by interacting with sophisticated environments. However, the modeling of these methods ignores the inherent graph structure of the traffic state which can be a key factor for more efficient VSL control. Graph structure can not only capture the static spatial feature but also the dynamic temporal features of traffic. Therefore, we propose the DVS-RG: DRL-based differential variable speed limit controller with graph state representation. DVS-RG provides distinct speed limits per lane in different locations dynamically. The road network topology and traffic information(e.g., occupancy, speed) are integrated as the state space of DVS-RG so that the spatial features can be learned. The normalization reward which combines efficiency and safety is used to train the VSL controller to avoid excessive inefficiencies or low safety. The results obtained from the simulation study on SUMO show that DRL-RG achieves higher traffic efficiency (the average waiting time reduced to 68.44\%) and improves the safety measures (the number of potential collision reduced by 15.93\% ) compared to state-of-the-art DRL methods.

Enhancing reinforcement learning‐based ramp metering performance at freeway uncertain bottlenecks using curriculum learning

Article

Full-text available

Feb 2024
IET INTELL TRANSP SY

Most current RM approaches are developed for fixed bottlenecks. However, the number and locations of bottlenecks are usually uncertain and even time‐varying due to some unexpected phenomena, such as severe accidents and temporal lane closures. Thus, the RM approach should be able to enhance traffic flow stability by effectively handling the time‐delay effect and fluctuations in traffic flow rate caused by uncertain bottlenecks. This study proposed a novel approach called deep reinforcement learning with curriculum learning (DRLCL) to improve ramp metering efficacy under uncertain bottleneck conditions. The curriculum learning method transfers an optimal control policy from a simple on‐ramp bottleneck case to more challenging bottleneck tasks, while DRLCL agents explore and learn from the tasks gradually. Four RM control tasks were developed in the modified cell transmission model, including typical on‐ramp bottleneck, fixed downstream bottleneck, random‐location bottleneck, and multiple bottlenecks. With curriculum learning, the entire training process was reduced by 45.1% to 64.5%, while maintaining a similar maximum reward level compared to DRL‐based RM control with full learning from scratch. Specifically, the results also demonstrated that the proposed DRLCL‐based RM outperformed the feedback‐based RM due to its stronger predictive ability, faster response, and higher action precision.

On dynamic fundamental diagrams: Implications for automated vehicles

Article

Jun 2024
TRANSPORT RES B-METH

The traffic fundamental diagram (FD) describes the relationships among fundamental traffic variables of flow, density, and speed. FD represents fundamental properties of traffic streams, giving insights into traffic performance. This paper presents a theoretical investigation of dynamic FD properties, derived directly from vehicle car-following (control) models to model traffic hysteresis. Analytical derivation of dynamic FD is enabled by (i) frequency-domain representation of vehicle kinematics (acceleration, speed, and position) to derive vehicle trajectories based on transfer function and (ii) continuum approximation of density and flow, measured along the derived trajectories using Edie's generalized definitions. The formulation is generic: the derivation of dynamic FD is possible with any analytical car-following (control) laws for human-driven vehicles or automated vehicles (AVs). Numerical experiments shed light on the effects of the density-flow measurement region and car-following parameters on the dynamic FD properties for an AV platoon.

A Lagrangian approach for variable speed limit implementation in C-ITS framework

Article

May 2024

Ecologically Oriented Freeway Control Methods Integrated Speed Limits and Ramp Toll Booths Layout

Article

Full-text available

May 2024

Traffic exhaust pollution, especially in congested areas of freeways, is one of the main causes of air pollution. With the increase in the number of vehicles, traffic and environmental issues have become more prominent. In addition, traffic congestion leads to frequent starting and stopping of vehicles, further exacerbating environmental pollution. This article focuses on the problem of frequent starting and stopping of vehicles, using variable speed limit control to smooth traffic flow, reduce vehicle speed, and alleviate exhaust emissions caused by traffic congestion. At the same time, considering the traffic and environmental benefits of bottleneck areas on freeways, the VT-Micro model is used to calculate exhaust emissions, and a coordinated control method for the mainline and ramp of freeways is proposed. The simulation experiment results show that the total driving time of the mainline and ramp collaborative control method considering environmental benefits has been reduced by 24.69%, CO emissions have been reduced by 4.79%, HC emissions have been reduced by 7.65%, NOx emissions have been reduced by 2.48%, and fuel consumption has been reduced by 4.98%.

Urban network geofencing with dynamic speed limit policy via deep reinforcement learning

Article

May 2024

A variable speed limit control approach for freeway tunnels based on the model-based reinforcement learning framework with safety perception

Article

Apr 2024

Towards Integrated Traffic Control with Operating Decentralized Autonomous Organization

Conference Paper

Sep 2023

SWDPM: A Social Welfare-Optimized Data Pricing Mechanism

Conference Paper

Oct 2023

A physics-informed reinforcement learning-based strategy for local and coordinated ramp metering ✩

Article

Full-text available

Apr 2022
TRANSPORT RES C-EMER

This paper proposes a physics-informed reinforcement learning(RL)-based ramp metering strategy, which trains the RL model using a combination of historic data and synthetic data generated from a traffic flow model. The optimal policy of the RL model is updated through an iterative training process, where in each iteration a new batch of historic data is collected and fed into the training data set. Such iterative training process can evaluate the control policy from reality rather than from a simulator, thus avoiding the RL model being trapped in an inaccurate training environment. The proposed strategy is applied to both local and coordinated ramp metering. Results from extensive microscopic simulation experiments demonstrate that the proposed strategy (i) significantly improves the traffic performance in terms of total time spent savings; (ii) outperforms classical feedback-based ramp metering strategies; and (iii) achieves higher improvements than an existing RL-based ramp metering strategy, which trains the RL model merely by a simulator. We also test the performance of two different learning algorithms in the simulation experiment, namely a conventional tabular approach and a batch-constrained deep RL approach. It is found that the deep RL approach is not as effective as the conventional tabular approach in the proposed strategy due to the limited amount of training data.

Stability Analysis and Variable Speed Limit Control of a Traffic Flow Model

Article

Full-text available

Oct 2018

The cell transmission traffic flow model (CTM) has attracted considerable interest in the field of transportation due to its simplicity as well as the ability to capture most of the macroscopic traffic flow characteristics. The stability properties of the CTM under different demand and capacity constraints are not always obvious. In addition, the impact of microscopic phenomena such as forced lane changes at bottlenecks leading to capacity drop is not captured by the CTM. In this paper, we start with a single section and modify the CTM to account for capacity drop. We analyze the stability properties of the CTM under all possible demand and capacity constraints as well as all possible initial density conditions. The analysis is used to motivate the design of variable speed limit (VSL) control to overcome capacity drop and achieve the maximum possible flow under all feasible traffic situations. The results are extended to multiple sections, where the stability properties of the open-loop system are analyzed and a VSL control scheme is designed and shown to achieve the objective of maximizing the traffic flow under different demand and capacity constraints. Unlike the open loop system where an infinite number of equilibrium points exist under certain demand levels, the proposed nonlinear VSL scheme guarantees exponential convergence to a unique equilibrium point that corresponds to maximum possible flow and speed under all possible demand levels and capacity constraints.

A linear Lagrangian model predictive controller of macro-and micro-variable speed limits to eliminate freeway jam waves ☆

Article

May 2021
TRANSPORT RES C-EMER

Variable speed limits (VSLs) are a common traffic control measure to resolve freeway jam waves. State-of-the-art model predictive control (MPC) approaches of VSLs are developed based on Eulerian Lighthill-Whitham and Richards (LWR) models, where the decision variables are flows between road segments. It is difficult to implement constraints on speeds that are necessary in typical real-world speed limit systems, because converting flow to speed results in nonlinear and non-convex optimization formulations. In this paper, we develop a new MPC of VSLs based on a discrete Lagrangian LWR model, in which the decision variables are average speeds of vehicle groups. This allows formulating speed constraints as control constraints rather than state constraints in the MPC problem. The optimization of vehicle groups speeds is formulated as a linear programming problem which can be solved efficiently. We further integrate the presented MPC to a hierarchical VSL control framework leveraging connected vehicles. The presented MPC decides the optimal target speed of each vehicle group led by a connected automated vehicle (CAV) at the upper macroscopic level with a prediction horizon of 20 min. At the lower microscopic level, CAVs randomly distributed in mixed traffic are regarded as actuators of the upper layer. Microscopic CAV accelerations are optimized in a short horizon of the order 5-10 s so that the human-driven vehicles following them reach the target speed from the upper layer in an efficient and smooth manner. The presented MPC and the hierarchical control approach are tested in microscopic simulation environments. Simulation results show that (i) the presented MPC resolves freeway jam waves efficiently with reasonable safety constraints implemented, and (ii) the presented hierarchical control approach can effectively resolve jam waves in a single-lane freeway, even though the penetration rate of CAVs is as low as 5%.

Differential variable speed limits control for freeway recurrent bottlenecks via deep actor-critic algorithm

Article

Aug 2020
TRANSPORT RES C-EMER

Variable speed limit (VSL) control is a flexible way to improve traffic conditions, increase safety, and reduce emissions. There is an emerging trend of using reinforcement learning methods for VSL control. Currently, deep learning is enabling reinforcement learning to develop autonomous control agents for problems that were previously intractable. In this paper, a more effective deep reinforcement learning (DRL) model is developed for differential variable speed limit (DVSL) control, in which dynamic and distinct speed limits among lanes can be imposed. The proposed DRL model uses a novel actor-critic architecture to learn a large number of discrete speed limits in a continuous action space. Different reward signals, such as total travel time, bottleneck speed, emergency braking, and vehicular emissions are used to train the DVSL controller, and a comparison between these reward signals is conducted. The proposed DRL-based DVSL controllers are tested on a freeway with a simulated recurrent bottleneck. The simulation results show that the DRL based DVSL control strategy is able to improve the safety, efficiency and environment-friendliness of the freeway. In order to verify whether the controller generalizes to real world implementation, we also evaluate the generalization of the controllers on environments with different driving behavior attributes. and the robustness of the DRL agent is observed from the results.

Freeway Traffic Control in Presence of Capacity Drop

Article

Mar 2020

Capacity drop at congested freeway bottlenecks is well known with a lot of field observations. This paper studies coordinated ramp metering (RM) and mainstream traffic flow control (MTFC) as well as their integration (RM+MTFC) for freeway traffic, with particular attention to effects of capacity drop on the performance of traffic control measures. Via mathematical analysis and comprehensive simulation studies under an optimal control framework, the work has revealed a capacity-drop-related mechanism that MTFC and ramp metering are based on to take effects, and obtained a number of important conclusions: (1) applications of any control measure (RM, MTFC, or RM+MTFC) in freeways are justified by the existence of capacity drop in field; (2) an appropriate usage of the control measures can effectively prevent the activation of potential bottlenecks on freeways and hence avoid capacity drop; (3) any control measure is beneficial for a large majority of the driver population in a freeway network if it can manage to increase the accumulated total network exit flow; (4) it is a common misconception that ramp metering would simply transfer traffic loads from the freeway mainstream to on-ramps. The work has also highlighted the strengths, weaknesses, and applicability of ramp metering and MTFC.

Macroscopic modeling of variable speed limits on freeways

Article

Mar 2019
TRANSPORT RES C-EMER

A Deep Q Learning Network for Traffic Lights' Cycle Control in Vehicular Networks

Article

Jan 2019

Existing inefficient traffic light control causes numerous problems, such as long delay and waste of energy. To improve efficiency, taking real-time traffic information as an input and dynamically adjusting the traffic light duration accordingly is a must. In terms of how to dynamically adjust traffic signals' duration, existing works either split the traffic signal into equal duration or extract limited traffic information from the real data. In this paper, we study how to decide the traffic signals' duration based on the collected data from different sensors and vehicular networks. We propose a deep reinforcement learning model to control the traffic light. In the model, we quantify the complex traffic scenario as states by collecting data and dividing the whole intersection into small grids. The timing changes of a traffic light are the actions, which are modeled as a high-dimension Markov decision process. The reward is the cumulative waiting time difference between two cycles. To solve the model, a convolutional neural network is employed to map the states to rewards. The proposed model is composed of several components to improve the performance, such as dueling network, target network, double Q-learning network, and prioritized experience replay. We evaluate our model via simulation in the Simulation of Urban MObility (SUMO) in a vehicular network, and the simulation results show the efficiency of our model in controlling traffic lights.

Adaptive Traffic Signal Control with Deep Recurrent Q-learning

Conference Paper

Jun 2018

Validation of an extended discrete first-order model with variable speed limits

Article

Oct 2017
TRANSPORT RES C-EMER

This paper validates the prediction model embedded in a model predictive controller (MPC) of variable speed limits (VSLs). The MPC controller was designed based on an extended discrete first-order model with a triangular fundamental diagram. In our previous work, the extended discrete first-order model was designed to reproduce the capacity drop and the propagation of jam waves, and it was validated with reasonable accuracy without the presence of VSLs. As VSLs influence traffic dynamics, the dynamics including VSLs needs to be validated, before it can be applied as a prediction model in MPC. For conceptual illustrations, we use two synthetic examples to show how the model reproduces the key mechanisms of VSLs that are applied by existing VSL control approaches. Furthermore, the model is calibrated by use of real traffic data from Dutch freeway A12, where the field test of a speed limit control algorithm (SPECIALIST) was conducted. In the calibration, the original model is extended by using a quadrangular fundamental diagram which keeps the linear feature of the model and represents traffic states at the under-critical branch more accurately. The resulting model is validated using various traffic data sets. The accuracy of the model is compared with a second-order traffic flow model. The performance of two models is comparable: both models reproduce accurate results matching with real data. Flow errors of the calibration and validation are around 10%. The extended discrete first-order model-based MPC controller has been demonstrated to resolve freeway jam waves efficiently by synthetic cases. It has a higher computation speed comparing to the second-order model-based MPC.

Reinforcement Learning-Based Variable Speed Limit Control Strategy to Reduce Traffic Congestion at Freeway Recurrent Bottlenecks

Article

Jun 2017

The primary objective of this paper was to incorporate the reinforcement learning technique in variable speed limit (VSL) control strategies to reduce system travel time at freeway bottlenecks. A Q-learning (QL)-based VSL control strategy was proposed. The controller included two components: a QL-based offline agent and an online VSL controller. The VSL controller was trained to learn the optimal speed limits for various traffic states to achieve a long-term goal of system optimization. The control effects of the VSL were evaluated using a modified cell transmission model for a freeway recurrent bottleneck. A new parameter was introduced in the cell transmission model to account for the overspeed of drivers in unsaturated traffic conditions. Two scenarios that considered both stable and fluctuating traffic demands were evaluated. The effects of the proposed strategy were compared with those of the feedback-based VSL strategy. The results showed that the proposed QL-based VSL strategy outperformed the feedback-based VSL strategy. More specifically, the proposed VSL control strategy reduced the system travel time by 49.34% in the stable demand scenario and 21.84% in the fluctuating demand scenario.

A new reinforcement learning-based variable speed limit control approach to improve traffic efficiency against freeway jam waves

Abstract and Figures

Recommended publications

Evaluation Approach for Safety Impact of Adaptive Cruise Control System by Using Micro Traffic Simul...

A physics-informed reinforcement learning-based strategy for local and coordinated ramp metering ✩

The Use of Microscopic Traffic Simulation Model for Traffic Control Systems

ES-band: a novel approach to coordinate green wave system with adaptation evolutionary strategies