Content uploaded by Mofeng Yang
Author content
All content in this area was uploaded by Mofeng Yang on Feb 17, 2021
Content may be subject to copyright.
Content uploaded by Mofeng Yang
Author content
All content in this area was uploaded by Mofeng Yang on Jun 18, 2020
Content may be subject to copyright.
A DEEP REINFORCEMENT LEARNING-BASED RAMP METERING CONTROL
FRAMEWORK FOR IMPROVING TRAFFIC OPERATION AT FREEWAY WEAVING
SECTIONS
Mofeng Yang
Jiangsu Key Laboratory of Urban ITS, Southeast University
Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies
Dong Nan Da Xue Rd. #2, Nanjing, China, 211189
Email: yangmofeng@seu.edu.cn
Zhibin Li, Ph.D., Corresponding Author
Jiangsu Key Laboratory of Urban ITS, Southeast University
Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies
Dong Nan Da Xue Rd. #2, Nanjing, China, 211189
Email: lizhibin@seu.edu.cn
Zemian Ke
Jiangsu Key Laboratory of Urban ITS, Southeast University
Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies
Dong Nan Da Xue Rd. #2, Nanjing, China, 211189
Email: kezemian@seu.edu.cn
Meng Li
Jiangsu Key Laboratory of Urban ITS, Southeast University
Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies
Dong Nan Da Xue Rd. #2, Nanjing, China, 211189
Email: seulimeng@163.com
s
Word count: 6,581 words text + 3 tables × 250 words (each) = 7,331 words
Submission Date: August 1, 2018
Yang, Li, Ke and Li 2
ABSTRACT
Ramp metering (RM) dynamically adjusts ramp flow merging into freeway mainline according to
real-time traffic conditions to improve traffic operation. The effectiveness of RM is mainly
determined by its control strategies, which decide how to calculate flow for various traffic states.
The traditional RM control strategies are limited by the responding speed to traffic changes and
the online computation workloads. They also require large and sufficient human knowledges about
the traffic flow problems in the study segments. In this study, we aim at proposing a control
framework that is deep reinforcement learning-based RM, named DQN-based RM. This new
control framework incorporates the deep Q network (DQN) algorithm and the RM in order to
reduce total travel time on freeways. A typical freeway weaving bottleneck section was simulated
based on the Simulation of Urban Mobility (SUMO) platform. The results show that the proposed
DQN-based RM strategy is able to response proactively to different traffic states and take
immediate and correct actions to prevent traffic breakdown, without full prior knowledge of traffic
flow theories. The DQN-based RM could reach the optimal control target within a short training
time, and the total travel time was reduced by 51.48% and 50.58% with 15 s and 30 s as the control
cycle. We also compare the performances of various RM strategies. The results show that the
DQN-based RM outperforms the traditional fixed-time and the feedback-based RM control
strategies in mitigating congestions and reducing travel time on freeways.
Keywords: Ramp metering, Deep reinforcement learning, Congestion, Freeway weaving sections,
Bottleneck
Yang, Li, Ke and Li 3
INTRODUCTION
The continuous increase of travel demand and flow on freeways cause frequent and severe
congestion issues, which greatly increase total travel time, delays, crash risks and fuel
consumptions (1-4). Freeway weaving sections are typical bottlenecks which are usually forms
with a merge area closely followed by a diverge area (5, 6). Within the weaving sections, vehicles
that enter and exit the freeway perform intense lane changes to access the target lanes, which are
the main cause of traffic congestion and traffic breakdown (7). The traffic operation of freeway
weaving bottleneck should be improved by properly controlling the traffic flow (8).
Ramp metering is the most commonly used active traffic management (ATM) strategy that
controls the number of ramp vehicles merging into the mainline by installing traffic lights and a
set of loop detectors on mainlines and ramps (9-11). The main objective is to set an appropriate
ramp flow leasing rate in order to minimize the negative impact of ramp flow disturbance on
mainline traffic to prevent the occurrences of traffic breakdown and capacity drop. The general
framework in previous studies is shown in Figure 1 (a). The study site is selected which contains
recurrent bottlenecks; secondly, the traffic data at loop detector stations is collected. Then, traffic
flow features as well as key congestion attributes (i.e. occupancy thresholds for prediction of
capacity drop) are analyzed based on traffic flow theories. Finally, the key parameters in the RM
control strategies are decided.
Early RM studies and practice tend to use the fixed-time control strategies, which decide
the metering rate by setting the control cycle, green light phase and red light phase (10). Such
strategies are easy to implement, but the major drawback is that the fixed parameters fail in
responding to the changing traffic states, which greatly reduce the strategy performances. Later,
the feedback-based RM algorithms were proposed, which were traffic responsive in nature. The
most famous feedback strategy, named the Asservissement Linéaire d’Entrée Autoroutière
(ALINEA) (12), and its variants (13-16), aim at adjusting the bottleneck occupancy at the expected
value to maintain the maximum flow and prevent traffic breakdown. Such approaches have already
been implemented in many practical uses (17, 18). However, the feedback-based RM adjusts the
metering rate passively according to the traffic conditions. The performance is greatly limited by
the feedback nature especially in the fast-changing traffic environments. In addition, the set of
control parameters in the feedback controller rely on human prior knowledge about the capacity
drop and the breakdown probabilities.
Some research used the online optimization methods, such as the model predictive control,
to calculate the metering rate by solving on-line optimal control problems (19-22). Though such
approaches can theoretically obtain the mathematical optimal solutions of the RM control problem,
they require accurate models to predict the traffic dynamics and contain large online computing
workload which is considered not capable for large-scale applications. Some researchers proposed
that the traffic flow theory-based RM control strategy improved traffic operation by forming a
free-flow pocket in bottleneck traffic flow (1, 23-25). The strategy was applied and tested on the
I-805 and I-5 freeway section in California and it was found to outperform the feedback-based
strategy in their case study (1). However, such approaches require deep understanding about the
traffic flow characteristics and congestion mechanisms at the target bottlenecks. The obtained
strategies may be site-specific and may not work well when applied on other freeway bottlenecks.
Yang, Li, Ke and Li 4
(a)
(b)
FIGURE 1 (a) General framework for RM studies; (b) DQN-based RM control framework.
In this study, a deep reinforcement learning (RL)-based RM control framework was
proposed for reducing total travel time at freeway bottleneck areas, as shown in Figure 1 (b). The
framework directly bridges up the correlation between the traffic flow data and the RM control
strategies. The RL enables the agent to obtain optimal strategy in an unfamiliar environment
without prior knowledge on traffic flow. In other words, the RL-based RM control strategy does
not require complex analyses based on the traffic flow theories and thus reduces the need for
human knowledges. Considering the complexity and mechanism of some traffic issues, such as
the capacity drop and traffic breakdown, are still not fully explored and understood by researchers,
previous RM strategies have clear cognitive limitations and our framework has a potential of
leading to the improved performances.
Previous researchers have incorporated the Q-learning (QL) (26), which is a basic RL
algorithm (27-29), into the RM control tasks (8, 30-35). Though the training process is extensive,
the QL-based RM can obtain the best metering actions for various traffic states. The effects of the
QL-based RM control strategies are widely reported in the previous studies. However, the QL uses
discretized traffic state sets with large intervals which may not accurately reflect the traffic
conditions. In addition, the QL’s performance is greatly limited by the storage ability of the Q-
table which is used to determine the optimal control actions in the iteration process.
Deep Q learning (DQN) is a combined use of the QL and the deep neural network (DNN)
(36, 37). The DeepMind team has put forward a number of successful applications with the DQN,
such as the use of Atari, AlphaGO, AlphaZero, etc in go games. (36-39). Compared to the
traditional QL, the DQN can deal with more complex tasks with large and continuous state space
by adding the DNN in the process of the agent’s learning structure. Some recent studies have
applied the DQN to solve intersection signal control problems (40-43). They reported that the
DQN finds appropriate and stable signal timing polices and result in a reduction in total travel time
and collision risks.
A most recent study proposed a multi-task deep RL-based RM and could achieve a control
performance on par with ALINEA (44). However, their study focused on large-scale freeway
sections with multiple on ramps and did not pay particular attentions to the specific freeway
bottlenecks where the occurrence of capacity drop was the main reason for the low efficiency of
traffic flow.
General framework for ramp metering studies
Select a
study site Data collection Traffic flow
analysis
Congestion
mechanism analysis
Ramp metering
strategies
Human knowledge
Yang, Li, Ke and Li 5
The literature review shows that none of previous studies have incorporated the DQN
algorithm with the RM control to reduce total travel time and to improve traffic operation at
freeway bottleneck areas. This study aims to fill the gap. A typical bottleneck is simulated in the
Simulation of Urban Mobility (SUMO). A training procedure for the DQN agent is proposed to
obtain the optimal actions. The effects of the DQN-based RM strategy are evaluated and compared
with some traditional RM control strategies.
METHODOLOGY
The RL approach is initially inspired by behaviorist psychology, which considers how an agent
ought to take actions in an environment to maximize the cumulative reward (27-29). An RL agent
interacts with its environment in discrete time steps, which is typically formulated as a Markov
decision process (MDP). The RL requires the determination of a metering flow control action for
the current traffic state at each decision interval. After the agent has taken a control action, the
current traffic state changes into a new state. The state transition can be evaluated by the reward
function. Therefore, the RM control problem can be formulated as a typical MDP problem and can
be resolved by the RL technique.
Deep Q Network
In a QL, the Q-value is assigned to each state-action pair to evaluate the quality of the action. The
set of Q-values can be represented as:
Q: S
×
A
→
R
(1)
where S is the set of possible states, A is the set of possible actions, and R is the set of
rewards. In an infinite horizon discounted reward problem, the agent’s goal is to maximize
(2)
where Rt is the reward at time step t, and γt is the discount factor that defines the relative
importance of the current rewards and those earned earlier (0≤γ≤1). For a non-deterministic
environment, the Q-value is updated with every new training sample according to
(3)
where Qt+1(st, at) is the Q-value for the state-action pair (st, at) at time step t+1, Rt+1 is the
reward received after performing action at at state st and then moves to the new state st+1, and κ(s,
a) is the learning rate which controls how fast the Q-values are altered.
The Q-table is updated in the training process. The Q-values will converge if each state-
action pair is executed for several times, and the optimal action for each state is determined by the
action with the largest Q-value. Then, the QL agent can be used for the optimal control according
to the knowledge it obtained in the training process.
The Q-table based approach is effective and efficient if the state-action set is not large.
However, in practical applications, the state-action space is usually very large, resulting in the size
of the lookup table growing exponentially. The algorithm is hard to converge, because all state-
0
t
t
t
R
g
¥
=
å
( ) ( )
( )
( ) ( )
1
111
,
,,
max , ,
tt
tt tt
tt
ttttt
sa
Qsa Qsa
RQsaQsa
kg
+
+++
=+
éù
+× -
ëû
Yang, Li, Ke and Li 6
action pairs need to be visited multiple times according to the inherent logic of QL. In addition,
limited by the size of the lookup table, the number of variables for traffic state representation is
restricted, which makes the QL incapable of handling more complicated traffic control tasks. The
drawbacks associated with the QL greatly limit its performance and applications.
The DQN incorporates a DNN in the QL and relaxes the limitation of the state size (36,
37). In a DQN, the Q-values are mapped from states which are used as the input of the DNN,
allowing for a larger or continuous space of states. There are two common ways to stabilize the
DQN, the target network freezing and the prioritized experience replay (45, 46). The target
network freezing splits the Q-value estimation into two different networks, i.e., a value network to
estimate the Q-value of the current state and a target network to compute the targets. By selecting
an appropriate freezing interval, the targets can be partially stabilized. Experience replay is a way
to sample a batch of the experience during the training process. In the DQN with prioritized
experience replay (46), the Temporal-Difference (TD)-error is used to help sample experiences
with larger reward.
The key idea is that the transition with higher expected learning progress measured by
absolute TD-error is more likely to be replayed. The priority of transition t is calculated as
(4)
where
𝛿!
is the TD-error, and
𝜋
is a positive constant that prevents some transitions not
being sampled again once their TD-error is zero. TD error is calculated as
(5)
Then the probability of sampling transition t is determined as
(6)
where
𝑝!> 0
, the exponent
𝛼
indicates the level of prioritization, and
𝛼 = 0
means the
uniform sampling.
Importance-Sampling (IS) weights are used to correct bias introduced by prioritized replay
(47). Hence, the parameter of the value network can be updated as
(7)
where
𝑘
is the mini-batch size when sampling, and
𝜔!
is the IS weight of sampled
transition t,
𝜔!
is calculated as
(8)
where
𝑁
is the memory size, and
𝛽 = 1
indicates that the non-uniform sampling
probabilities are fully compensated.
In our study, the DQN with the prioritized experience replay and target network freezing
was used to support the proposed RM control strategy.
tt
p
dp
=+
( )
11
max ( , ) ,
tt t tt
a
RQsasQa
dg
--
+-=
() t
k
k
p
Pt p
a
a
=å
1
1(, )
k
tt t
t
tt t
Qs a
q
qqhwd
=
+××Ñ=+
å
11
()
()
tNPt
b
w
=×
Yang, Li, Ke and Li 7
DQN-Based RM Control strategy
A DQN-based RM control strategy was proposed in this section and the flowchart is shown in
Figure 2. The DQN-based RM consists of two parts, the DQN Agent and the simulation network.
These two parts could exchange traffic state, reward and action. The DQN agent could iterate to
the optimal control strategy by continuously interacting with the simulation network. Details will
be discussed in next section.
The DQN-based RM first perceives a particular traffic state and selects a traffic light color
at each decision interval. The traffic light leads the state transition. Then the feedback reward and
the transition of this state-action pair is store in the replay memory of the DQN agent’s mind. The
DQN agent will evaluate the reward, learn and update the policy with a batch of sample memories
with a given learning value and learning rate. The crucial elements in the DQN-based RM are
designed as follows:
(1) State. The selected states should be able to help the agent perceive traffic situations in
the freeway segment. In previous QL-based RM research, states were discretized into several
intervals. However, discrete states are not able to accurately describe the dynamics of freeway
traffic. In our study, three traffic flow variables and one traffic light status were used to represent
the traffic state at a freeway weaving section. They were: the density at the immediate upstream
of the weaving area, the density at the weaving area, the density on the on-ramp, and the color of
the traffic light. The traffic flow variable values are kept with one decimal and the light color is
represented by 0 (green) and 1 (red).
(2) Action. In this study, the action set is set to be the traffic light color which includes two
choices, i.e. turning green or turning red. The control period should be set to ensure that the effect
(or reward) after the action taken can be perceived by the agent after the interval. In our study, we
tested two control periods for the DQN-based RM which were 15 s and 30 s. In other words, the
DQN-based RM was able to turn the traffic light to green or red after 15-s or 30-s of the previous
action.
(3) Reward. The objective of the DQN-based RM was to reduce the system total travel
time (TTT). The TTT over a time horizon K can be calculated by:
(9)
where N(k) is the total number of vehicles in the network at time k, and η is the time interval.
In the simulation, the number of vehicles entering the freeway system at each time step
from both upstream mainline and on-ramp was recorded. The number of vehicles leaving the
freeway system at downstream mainline and off-ramp was also recorded. Any control measures
that managed to increase the early exit flows of the freeway section would lead to a decrease in
the total travel time (1-3, 10, 24, 48). Thus, the reward function could be determined by the
discharge flow at the bottleneck, which is defined as:
(10)
where R(s) was the reward for state s, and q(t) was the bottleneck discharge flow within
interval t, which can be obtained by simply counting the number of vehicles that passed through
the bottleneck.
1
()
K
k
TTT N k
h
=
=å
() ()Rs qt=
Yang, Li, Ke and Li 8
(4) Learning Parameters. The DQN has many learning parameters and the selection of
their values greatly affects the performance. The learning rate decides the learning speed of the
agent at each time step. If the learning rate is too large, the gradient descent might be too fast and
the DQN may not find the optimal policy; if the learning rate is too small, the computation cost
and training time were too long and the algorithm may be hard to converge. In previous studies,
the learning rate was typically set to be a small constant value to ensure that the optimal policy can
be found.
FIGURE 2 Flowchart of the DQN-based RM control strategy.
Another important consideration is to make a balance between exploitation and exploration
when selecting actions. The DQN should fully learn the information that has been presented in the
Q-values. Using pure exploitation may greatly save the learning time, but may also prohibit the
discovery of better actions and lead to local optimization. On the other hand, pure exploration
enhances the capability of discovering new and better actions. However, it may result in a random
action selection without making use of the existing learning results and, accordingly, is quite time
consuming. In our study, the exploitation and exploration were balanced by gradually decreasing
the exploration rate from 0.9 to 0.1 during the learning process. The DQN starts with a large
exploration rate. With the knowledge of the agent getting mature, the exploration rate decreases at
the same time.
Traditional RM Control Strategies
(1) Fixed-time Control Strategies. Traffic light changes its color according to the pre-designed
cycle length and green/red phases. For the fixed-time RM control strategy, the ramp flow can be
calculated by:
(11)
where λr is the number of lanes on ramp, G is the green time in the signal cycle, and c is
the signal cycle. This strategy is able to maintain a stable metering rate for simple scenarios.
(2) Feedback-based Control Strategies. The most representative feedback-based RM strategy,
i.e. ALINEA, is considered in this study for comparison.
DQN-based ramp metering control strategy
Simulation start DQN agent Policy update Optimal policy
State
Action
Reward
Traffic light
Detectors
Update
Input
OutputImplement
Traffic state
Discharge
flow
Simulation network DQN agentInteraction
Kweave
Kup
Kramp
1800 /
r
rGc
l
=× ×
Yang, Li, Ke and Li 9
(12)
where r(k) is the metering rate at decision interval k, r(k-1) is the metering rate at decision
interval k-1, KR>0 is a regulator parameter, ô is the desired occupancy at the bottleneck which is
usually set as the critical occupancy, oout(k) is the real time measurement of the occupancy at
decision interval k. The ramp metering rate is set to be proportional to the difference between the
expected and measured downstream occupancy.
DEVELOPMENT OF SIMULATION PLATFORM
SUMO Simulation Platform
An open-source traffic simulation software, the Simulation of Urban MObility (SUMO), is used
as the simulation platform. Compared to macroscopic simulations such as Cell Transmission
Model, SUMO can capture fine vehicle movements and generate traffic flow data with higher
granularity. Compared to other microscopic simulations, SUMO provides numerous car-following
and lane-changing models that can meet distinct needs for the different road segments, study
purposes. More importantly, SUMO gives users higher privileges to do further and deeper
developments. It also enables communications and interactions with other software, such as
importing python packages.
SUMO includes two main modules: Netedit module and Sumoconfig module. The Netedit
module enables users to define the network information, the traffic dynamic information, and flow
demand. Sumoconfig module enables users to define simulation information, such as simulation
period, waiting time, etc. SUMO also provides an Application Programing Interface (API), called
TraCI. Users are allowed to retrieve the real-time detector data, or change the state of the network
elements (detectors, traffic control etc.) by calling the TraCI.
In our study, an interactive simulation-control system was established, as shown in Figure
3. The network information was edited and stored within the Netedit module and was incorporated
into the Sumoconfig module. The DQN agent was defined in the Python Intergrated Development
Environment (IDE). At the beginning of the learning process, Python IDE calls TraCI API to start
the simulation. The real-time simulation data can be obtained at each decision interval. Data was
then sent back to Python IDE to feed the DQN agent to generate real-time RM control strategy.
The state of the control changes according to the given RM strategy which is also delivered by
Python IDE. The above steps formulate a loop. By following this loop, the freeway network can
be continuously simulated to help the DQN agent iterate to optimal control strategy.
() ( 1) ()
R out
rk rk K o o k
Ù
éù
=-+-
êú
ëû
Yang, Li, Ke and Li 10
FIGURE 3 SUMO simulation platform structure and interaction.
Experiment Design
A freeway weaving section is developed in the simulation model (see Figure 2). The upstream and
downstream sections of the mainline compose of 3 lanes, and the weaving area is 250 meter long
containing 4 lanes. The speed limit of all lanes is 33m/s (approximately 75 mph). To simulate real-
world traffic flow features near the bottleneck area, the traffic demand and the parameters in the
SUMO were carefully calibrated based on the empirical loop detector data obtained from the
Freeway Performance Measurement System (PeMS) (50).
The duration of simulation was set to be 4 hours with a 30-min warm-up period. Traffic
demand on the mainline and on-ramp started with 2200 veh/h and 400 veh/h in the warm-up hour.
The peak hour lasted for two hours with 3000 veh/h on the mainline and 1000 veh/h on the on-
ramp. Then the demand dropped backed to 2200 veh/h and 400 veh/h and lasted for another hour.
The capacity of the freeway mainline before capacity drop was found to be 3300 veh/h. The
magnitude of capacity drop was found to be 15.2%.
TRAINING OF DQN-BASED RM CONTROL STRATEGY
Experience replay and freeze interval are the most distinct features between the DQN and
the traditional QL algorithms. According to previous studies (46, 48), the replay memory size
should correspond to the specific scenarios of experiments and an appropriate freeze interval can
mitigate the correlation between the experiences and stabilize the learning process. In our study,
the parameters are carefully determined according to preliminary tests and suggestions from
previous studies (36-37, 40-43, 49), as shown in Table 1.
The simulation was activated by calling the TraCI API. The maximum epoch for the
training process is pre-defined. The DQN-based RM starts to perceive the traffic state at the
beginning of every decision interval, chooses an action (green or red light) and generates a
corresponding RM control strategy. The transition of this state action pair is store in the replay
memory of the DQN agent’s mind. At the end of the decision interval, the DQN agent would
Demand
Network
Traffic Control
Car Following
Detectors
Netedit
Sumoconfig
Raw elements Raw data
Time
Output
Flow
Occupancy
Speed
Python IDE
Data aggregation
Data collection Traffic analysis
Data storation
Simulation
On ramp controller
Fixed-time Feedback DQN agent
Bottleneck
Capacity drop
TraCI
Yang, Li, Ke and Li 11
perceive new state and a corresponding reward to the previous action. The DQN agent learns and
updates the policy with a batch of sample memories with the given learning value and learning
rate. During the simulation, the total travel time was computed at each step and the bottleneck
discharge flow was computed every 5 min.
TABLE 1 Learning Parameters for DQN-based RM Control Strategy
Parameter
Value
Optimizer
RMSProp
Replay memory size
10,000
Experience sampling
0.6
Learning rate
0.00025
Batch size
32
Exploration rate
From 0.9 to 0.1
Discount factor
0.99
Freeze interval
2,000
State matrix size
4
State matrix frame
1
Action size
2
SIMULATION RESULTS
Results of DQN-based RM Control Strategy
For the DQN-based RM with the 15 s control circle, the simulation with the smallest total travel
time was reached at the 160th training epoch; for DQN-based RM with 30 s control circle, the
simulation with the smallest total travel time was reached at the 204th training epoch. The time for
the agent to reach the optimal epoch varies, since the decision interval affects the learning time in
each epoch. The DQN-based RM was capable to quickly find an optimal solution when facing the
control scenarios. We also considered the situation in which none of the control strategies was
used.
The speed, occupancy, total travel time and bottleneck discharge flow were collected and
calculated. There was no difference between various strategies in the initial 30 min non-peak hour
period after the warm-up period. Thus, Fig. 4 (a) and (b) illustrated the speed and occupancy
profiles respectively starting from the peak hour at the bottleneck under different control strategies.
In the no control scenario, the congestion occurs and lasts until the end of the simulation. With the
DQN-based RM strategies, the speed and occupancy profiles presented similar trends, such as
maintaining larger speed and lower occupancy during peak hours. Table 2 listed the mean for the
speed and occupancy. Compared to the no control scenario, both DQN-based RM strategies
obviously increased the mean speed from 5.87 m/s to 12.95 and 13.82 m/s, and reduced the mean
occupancy from 35.60 to 20.12 and 18.51, respectively. The total travel time was reduced from
3306.26 veh·h to 1604.09 veh·h and 1633.94 veh·h with the DQN-based RM strategies, as
compared to the no control case, indicating a 51.48% and 50.58% reduction respectively. The
bottleneck discharge flow increase from 3185 veh/h to 3483 veh/h and 3462 veh/h with the two
DQN-based RM strategies, indicating a 9.39% and 8.69% increment in the bottleneck discharge
flow. The results suggest that the proposed DQN-based RM control strategy reduced significantly
the total travel time and increased bottleneck discharge flow at freeway weaving bottlenecks.
Yang, Li, Ke and Li 12
(a)
(b)
FIGURE 4 (a) Occupancy profiles (%); (b) speed profiles (m/s)
TABLE 2 Effects of DQN-based RM Control Strategies
Control strategy
Mean speed (m/s)
Mean occupancy (%)
No control
5.87
35.60
DQN-based RM (cycle=15 s)
12.95
20.12
DQN-based RM (cycle=30 s)
13.82
18.51
The different between the two DQN-based RM strategies could be attributed to the
difference in control cycles which affects the responding speeds. After each action was
implemented, the ramp vehicles need a period to merge into the mainline and then cause impacts
the traffic flow. Compared to the DQN-based RM with 30 s control cycle, DQN-based RM with
15 s control cycle can make a change of control action every 15 s which ensures a faster responding
speed to the change of traffic state. As shown in Table 2, the mean occupancy was maintained
around 20% which is close to the occupancy that triggers the capacity drop (which is
approximately 23% in our simulation). Note that if the control cycle is too short, for example 5 s,
it can mislead the agent’s learning ability and control performance because the impact of the
control on traffic flow may not have appeared completely.
Comparison between Different Control Strategies
The performances of different RM control strategies were compared. The fixed-time RM control
strategies used 5 s (with 2 s green phase and 3 s red phase) and 7 s (with 2 s green phase and 5 s
red phase) as the control cycle. Two ALINEA strategies were used with the parameters defined
according to the traffic flow features in the bottleneck section (18): desired occupancy = 22% or
23%, traffic control circle = 80 s and KR = 70 veh/h. The QL was also trained for the RM control
based on the same state and action sets as the DQN. The experiments show that the QL was not
able to convergent, or in other words, to find the optimal control strategy in the training process.
The main reason for the failure is that QL has limited ability of dealing with large space sets and
complex control tasks.
Yang, Li, Ke and Li 13
The total travel time and the bottleneck discharge flow in different scenarios were
compared in Table 3 and Figure 5. The results show that the two fixed-time RM control strategies
performs the worst, only reducing 23.33% and 32.75% in total travel time and increased 0.29%
and 2.48% bottleneck discharge flow respectively; two ALINEA strategies enhance the freeway
system moderately, resulting in 41.77% and 38.45% reduction in total travel time and 7.79% and
7.21% increment in bottleneck discharge flow. All RM control strategies could improve the
freeway efficiency to some extent, but two proposed DQN-based RM strategies outperformed the
others.
The causes for the differences between RM control strategies are discussed. The fixed-time
RM control strategy cannot adjust to different traffic conditions. Though the effects can be slightly
improved by setting proper control cycle parameters, such strategies performed the worst among
all strategies. The control parameters in the ALINEA strategies highly rely on human knowledge
about the capacity drop and breakdown probabilities at bottlenecks, as different expected
occupancies lead to different control effects. In addition, the feedback nature decides the strategies
are slow in responding to traffic changes and make proper actions, which is another major cause
for the reduced effects as compared to the DQN-based RM control strategies.
TABLE 3 Comparison Result for Different Control Strategies
Control strategy
Total Travel
time (veh∙h)
Improvement
(%)
Discharge flow
(veh/h)
Improvement
(%)
No control
3306.26
/
3185
/
Fixed-time 5-s
2534.96
23.33
3194
0.29
Fixed-time 7-s
2223.41
32.75
3264
2.48
ALINEA (occ=22)
1925.37
41.77
3433
7.79
ALINEA (occ=23)
2034.90
38.45
3415
7.21
DQN-based RM (cycle=15 s)
1604.09
51.48
3483
9.39
DQN-based RM (cycle=30 s)
1633.94
50.58
3462
8.69
(a) (b)
Yang, Li, Ke and Li 14
(c) (d)
FIGURE 5 (a),(c) Comparison of bottleneck discharge flow with different RM control
strategies; (b),(d) Comparison of total travel time with different RM control strategies. A:
No control; B: Pre-timed 3-s; C: Pre-timed 5-s; D: ALINEA (occ=22); E: ALINEA (occ=23);
F: DQN-based RM (cycle=15 s); G: DQN-based RM (cycle=30 s)
CONCLUSIONS AND DISCUSSION
This study proposed a DQN-based RM control strategy that aimed at reducing total travel time at
freeway weaving bottlenecks. A novel framework which incorporated an open-source microscopic
simulation software, SUMO, and the deep reinforcement learning agent was proposed to
automatically obtain the optimal control action. A training procedure was proposed for the DQN-
based RM in an off-line scheme which could obtain the optimal control strategy within a short
training period. Then the trained agent can be applied for real-time RL control tasks. The results
showed that, the well-trained DQN-based RM has the capability of predicting traffic state
transitions and acts in a proactive control scheme. In our simulation tests, improved performances
in reducing total travel time and increasing bottleneck discharge flow were observed as compared
to traditional RM control strategies. In addition, compared to the QL-based RM control, the DQN-
based RM used continuous and larger states as inputs which can perceive traffic state and take
control action more precisely.
The major contribution of our proposed framework is that the DQN-based RM directly
bridged up the correlation between traffic parameters at the freeway bottlenecks and the RM
actions, releasing human prior knowledge of traffic flow theories (such as oscillation pattern,
congestion incentive, and capacity drop mechanism). Only the state variables, actions, reward
functions and DNN related parameters should be given. No other traffic-related parameters need
to be calibrated in the proposed strategy. The proposed framework can be further extended and
transferred to other control scenarios such as coordinated ramp metering, variable speed limits,
and signal timing.
Though the training process is considered reasonably fast, the DQN-based RM still requires
a number of state variables and sufficient training time to obtain the optimal control strategy before
it can be applied. For large freeway networks or complex scenarios, the DQN agent may be
difficult to converge, indicating the control performance is not optimized. A balance should be
carefully decided between the computing efficiency and learning effectiveness. In addition, the
DQN agent with DNN performs like a black box which might be hard to explain the inherent
mechanism. This fact may reduce and the easiness and practicability of transferring to other case
studies. A continuous learning procedure in the reinforcement learning algorithm is recommended
Discharge Flow (veh/h)
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
A
B
C
D
E
F
G
System Travel Time (veh ∙ h)
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
A
B
C
D
E
F
G
Yang, Li, Ke and Li 15
to enhance its robustness and transferability of control performances when applied to new
environment (51).
In the present study, a typical freeway weaving section was simulated with the proposed
DQN-based RM control strategy. Other types of bottlenecks caused by lane reduction, traffic
incident or work zone also need to be investigated. Besides, this study used bottleneck discharge
flow as the reward function in the DQN method, and successful reduction of travel time and
increment in bottleneck discharge flow were observed. Future study could consider other reward
functions, such as social equality for ramp and mainline vehicles, for system optimization.
Furthermore, for a complex freeway segment with two or more on ramp, the coordination of
multiple reinforcement learning agents with RM control strategies and variable speed limit control
strategies can be evaluated to improve the overall performance in reducing freeway traffic
congestions.
ACKNOWLEDGEMENTS
This research was jointly sponsored by the National Natural Science Foundation of China
(51508094), the National Key Research and Development Program of China: Key Projects of
International Scientific and Technological Innovation Cooperation Between Governments
(2016YFE0108000), and the Fundamental Research Funds for the Central Universities
(2242017K40130).
AUTHOR CONTRIBUTION STATEMENT
The authors confirm contribution to the paper as follows: study conception and design: Mofeng
Yang, Zhibin Li, Zemian Ke; simulation: Mofeng Yang, Meng Li; analysis and interpretation of
results: Mofeng Yang, Zhibin Li; draft manuscript preparation: Mofeng Yang, Zhibin Li. All
author reviewed the results and approved the final version of the manuscript.
REFERENCES
1. Cassidy, M. J., & Rudjanakanoknad, J.. Increasing the Capacity of an Isolated Merge by
Metering its On-ramp. Transportation Research Part B: Methodological, 2005. 39(10):
896-913.
2. Chung, K., Rudjanakanoknad, J., and Cassidy, M. J. 2007. Relation between Traffic
Density and Capacity Drop at Three Freeway Bottlenecks. Transportation Research Part
B: Methodological, 2007. 41(1): 82-95.
3. Zhang, L., and Levinson, D.. Ramp Metering and Freeway Bottleneck Capacity.
Transportation Research Part A: Policy and Practice, 2011. 44(4): 218-235.
4. Oh, S., and Yeo, H.. Microscopic Analysis on the Causal Factors of Capacity Drop in
Highway Merging Sections. Presented at 91rd Annual Meeting of the Transportation
Research Board, Washington, D.C., 2012.
5. Transportation Research Board. Highway Capacity Manual. Technical report, Washington,
D.C., 2000.
6. Transportation Research Board. Highway Capacity Manual. Technical report, Washington,
D.C., 2010.
Yang, Li, Ke and Li 16
7. Paul Ryus, Mark Vandehey, Lily Elefteriadou, Richard G Dowling, and Barbara K Ostrom.
New TRB Publication: Highway Capacity Manual 2010. TR News, 2011. (273).
8. Yang, H., and Rakha, H.. Reinforcement Learning Ramp Metering Control for Weaving
Sections in a Connected Vehicle Environment. Presented at 96th Annual Meeting of the
Transportation Research Board, Washington, D.C., 2017.
9. Shaaban, K., Khan, M. A., and Hamila, R.. Literature Review of Advancements in
Adaptive Ramp Metering. Procedia Computer Science, 2016. 83: 203-211.
10. Papageorgiou, M., and Kotsialos, A.. Freeway ramp metering: An overview. IEEE
Transactions on Intelligent Transportation Systems, 2002. 3(4): 271-281.
11. Papageorgiou, M., and Papamichail, I.. Overview of Traffic Signal Operation Policies for
Ramp Metering. Transportation Research Record: Journal of the Transportation Research
Board, 2008. 2047: 28-36.
12. Papageorgiou, M., Hadj-Salem, H., and Blosseville, J. M.. ALINEA: A Local Feedback
Control Law for On-ramp Metering. Transportation Research Record: Journal of the
Transportation Research Board, 1991. 1320: 58-67.
13. Zhang, M., Kim, T., Nie, X., Jin, W., Chu, L., and Recker, W.. California Partners for
Advanced Transit and Highways (PATH), 2001.
14. Smaragdis, E., Papageorgiou, M., and Kosmatopoulos, E.. A Flow-maximizing Adaptive
Local Ramp Metering Strategy. Transportation Research Part B: Methodological, 2004.
38(3): 251-270.
15. Kotsialos, A., Papageorgiou, M., Hayden, J., Higginson, R., McCabe, K., and Rayman, N..
Discrete Release Rate Impact on Ramp Metering Performance. IEE Proceedings-
Intelligent Transport Systems, 2006. 153(1): 85-96.
16. Smaragdis, E., & Papageorgiou, M.. Series of New Local Ramp Metering Strategies:
Emmanouil smaragdis and markos papageorgiou. Transportation Research Record:
Journal of the Transportation Research Board, 2003. 1856: 74-86.
17. Demiral, C., & Celikoglu, H. B.. Application of ALINEA Ramp Control Algorithm to
Freeway Traffic Flow on Approaches to Bosphorus Strait Crossing Bridges. Procedia-
Social and Behavioral Sciences, 2011. 20: 364-371
18. Papageorgiou, M., Hadj-Salem, H., Middelham, F.. “ALINEA Local Ramp Metering:
Summary of Field Results”, Transportation Research Record: Journal of the
Transportation Research Board, 1997. 1603: 90-98
19. Bellemans, T., B.D. Schutter, and B.D. Moor.. Model Predictive Control for Ramp
Metering of Motorway Traffic: A Case Study. Control Engineering Practice, 2006. 4(7):
757–767.
20. Hegyi, A., De Schutter, B., and Hellendoorn, H.. Model Predictive Control for Optimal
Coordination of Ramp Metering and Variable Speed Limits. Transportation Research Part
C: Emerging Technologies, 2005. 13(3): 185-209.
21. Papamichail, I., Kotsialos, A., Margonis, I., and Papageorgiou, M. Coordinated Ramp
Metering for Freeway Networks–A Model-predictive Hierarchical Control Approach.
Transportation Research Part C: Emerging Technologies, 2010. 18(3): 311-331.
22. Zegeye, S. K., De Schutter, B., Hellendoorn, J., Breunesse, E. A., and Hegyi, A.. Integrated
Macroscopic Traffic Flow, Emission, and Fuel Consumption Model for Control Purposes.
Transportation Research Part C: Emerging Technologies, 2013. 31: 158-171.
23. Cassidy, M. J., & Windover, J. R.. Methodology for Assessing Dynamics of Freeway
Traffic Flow. Transportation Research Record: Journal of the Transportation Research
Board, 1995. 1484: 73-79.
Yang, Li, Ke and Li 17
24. Cassidy, M. Freeway On-ramp Metering, Delay Savings, and Diverge Tottleneck.
Transportation Research Record: Journal of the Transportation Research Board, 2003.
1856: 1-5.
25. Kim, K., and Cassidy, M. J.. A Capacity-increasing Mechanism in Freeway Traffic.
Transportation Research Part B: Methodological, 2012. 46(9): 1260-1272.
26. Watkins, C., and Dayan, P.. Q-Learning. Machine Learning, 1992. 8(3): 279-292.
27. Sutton, R.S., and Barto, A. G.. Reinforcement Learning: An Introduction. Cambridge, MA:
MIT Press, 1998.
28. Barto, A. G., and Mahadevan, S.. Recent Advances in Hierarchical Reinforcement
Learning. Discrete Event Dynamic Systems, 2003. 13(1-2): 41-77.
29. Mahadevan, S.. Average Reward Reinforcement Learning: Foundations, Algorithms, and
Empirical Results. Machine Learning, 1996. 22(1-3): 159-195.
30. Wang, X., Liu, B., Niu, X., & Miyagi, T. 2009. Reinforcement Learning Control for On-
ramp Metering Based on Traffic Simulation. ICCTP 2009: Critical Issues In
Transportation Systems Planning, Development, and Management, 2009. 1-7.
31. Veljanovska, K., M Bombol, K., & Maher, T. 2010. Reinforcement Learning Technique in
Multiple Motorway Access Control Strategy Design. PROMET-Traffic&Transportation,
2010. 22(2): 117-123.
32. Davarynejad, M., Hegyi, A., Vrancken, J., and van den Berg, J. Motorway Ramp-metering
Control with Queuing Consideration Using Q-learning. Intelligent Transportation Systems
(ITSC), 2011 14th International IEEE Conference on. IEEE, 2012: 1652-1658.
33. Wang, X. J., Xi, X. M., and Gao, G. F.. Reinforcement Learning Ramp Metering without
Complete Information. Journal of Control Science and Engineering, 2012. 2.
34. Rezaee, K., Abdulhai, B., and Abdelgawad, H. 2012. Application of Reinforcement
Learning with Continuous State Space to Ramp Metering in Real-world Conditions.
Intelligent Transportation Systems (ITSC), 2012 15th International IEEE Conference on.
IEEE, 2012. 1590-1595
35. Rezaee, K. 2014. Decentralized Coordinated Optimal Ramp Metering Using Multi-agent
Reinforcement Learning. Doctoral dissertation, University of Toronto (Canada), 2014.
36. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and
Riedmiller, M.. Playing Atari with Deep Reinforcement Learning. arXiv preprint, 2013.
arXiv:1312.5602.
37. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... and
Petersen, S.. Human-level Control through Deep Reinforcement Learning. Nature, 2015.
518(7540): 529.
38. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... &
Dieleman, S. Mastering the Game of Go with Deep Neural Networks and Tree Search.
Nature, 2016. 529(7587): 484-489.
39. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., ... and
Chen, Y. Mastering the Game of Go without Human Knowledge. Nature, 2017. 550(7676):
354.
40. Li, L., Lv, Y., and Wang, F. Y.. Traffic Signal Timing via Deep Reinforcement Learning.
IEEE/CAA Journal of Automatica Sinica, 2016. 3(3): 247-254.
41. Van der Pol, E., and Oliehoek, F. A.. Coordinated Deep Reinforcement Learners for Traffic
Light Control. Proceedings of Learning, Inference and Control of Multi-Agent Systems at
NIPS, 2016.
42. Van der Pol, E.. Deep Reinforcement Learning for Coordination in Traffic Light Control.
Master’s thesis, University of Amsterdam, 2016.
Yang, Li, Ke and Li 18
43. Mousavi, S. S., Schukat, M., and Howley, E.. Traffic Light Control Using Deep Policy-
Gradient and Value-function-based Reinforcement Learning. IET Intelligent Transport
Systems, 2017. 11(7): 417-423.
44. Belletti, F., Haziza, D., Gomes, G., & Bayen, A. M.. Expert level control of ramp metering
based on multi-task deep reinforcement learning. IEEE Transactions on Intelligent
Transportation Systems, 2017.
45. Van Hasselt, H., Guez, A., & Silver, D.. Deep Reinforcement Learning with Double Q-
Learning. AAAI, 2016. 16: 2094-2100.
46. Schaul, T., Quan, J., Antonoglou, I., and Silver, D.. Prioritized Experience Replay. arXiv
preprint, 2015. arXiv:1511.05952.
47. Mahmood, A Rupam, van Hasselt, Hado P, and Sutton, Richard S. 2014. Weighted
Importance Sampling for Off-policy Learning with Linear Function Approximation.
Advances in Neural Information Processing Systems, 2014. 3014–3022.
48. Daganzo, C. F.. Fundamentals of transportation and traffic operations. Pergamon Press,
Oxford, 1997.
49. Lu, C., Huang, J., and Gong, J.. Reinforcement Learning for Ramp Control: An Analysis
of Learning Parameters. PROMET-Traffic&Transportation, 2016. 28(4): 371-381.
50. Chen, C., Petty, K., Skabardonis, A., Varaiya, P., and Jia, Z.. Freeway Performance
Measurement System: Mining Loop Detector Data. Transportation Research Record:
Journal of the Transportation Research Board, 2001. 1748: 96-102.
51. Li, Z., Liu, P., Xu, C., Duan, H., & Wang, W.. Reinforcement Learning-based Variable
Speed Limit Control Strategy to Reduce Traffic Congestion at Freeway Recurrent
Bottlenecks. IEEE Transactions on Intelligent Transportation Systems, 2017. 18(11):
3204-3217.