PresentationPDF Available

A Deep Reinforcement Learning-based Ramp Metering Control Framework for Improving Traffic Operation at Freeway Weaving Sections

Authors:

Abstract and Figures

Ramp metering (RM) dynamically adjusts ramp flow merging into freeway mainline according to real-time traffic conditions to improve traffic operation. The effectiveness of RM is mainly determined by its control strategies, which decide how to calculate flow for various traffic states. The traditional RM control strategies are limited by the responding speed to traffic changes and the online computation workloads. They also require large and sufficient human knowledges about the traffic flow problems in the study segments. In this study, we aim at proposing a control framework that is deep reinforcement learning-based RM, named DQN-based RM. This new control framework incorporates the deep Q network (DQN) algorithm and the RM in order to reduce total travel time on freeways. A typical freeway weaving bottleneck section was simulated based on the Simulation of Urban Mobility (SUMO) platform. The results show that the proposed DQN-based RM strategy is able to response proactively to different traffic states and take immediate and correct actions to prevent traffic breakdown, without full prior knowledge of traffic flow theories. The DQN-based RM could reach the optimal control target within a short training time, and the total travel time was reduced by 51.48% and 50.58% with 15 s and 30 s as the control cycle. We also compare the performances of various RM strategies. The results show that the DQN-based RM outperforms the traditional fixed-time and the feedback-based RM control strategies in mitigating congestions and reducing travel time on freeways.
Content may be subject to copyright.
A DEEP REINFORCEMENT LEARNING-BASED RAMP METERING CONTROL
FRAMEWORK FOR IMPROVING TRAFFIC OPERATION AT FREEWAY WEAVING
SECTIONS
Mofeng Yang
Jiangsu Key Laboratory of Urban ITS, Southeast University
Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies
Dong Nan Da Xue Rd. #2, Nanjing, China, 211189
Email: yangmofeng@seu.edu.cn
Zhibin Li, Ph.D., Corresponding Author
Jiangsu Key Laboratory of Urban ITS, Southeast University
Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies
Dong Nan Da Xue Rd. #2, Nanjing, China, 211189
Email: lizhibin@seu.edu.cn
Zemian Ke
Jiangsu Key Laboratory of Urban ITS, Southeast University
Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies
Dong Nan Da Xue Rd. #2, Nanjing, China, 211189
Email: kezemian@seu.edu.cn
Meng Li
Jiangsu Key Laboratory of Urban ITS, Southeast University
Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies
Dong Nan Da Xue Rd. #2, Nanjing, China, 211189
Email: seulimeng@163.com
s
Word count: 6,581 words text + 3 tables × 250 words (each) = 7,331 words
Submission Date: August 1, 2018
Yang, Li, Ke and Li 2
ABSTRACT
Ramp metering (RM) dynamically adjusts ramp flow merging into freeway mainline according to
real-time traffic conditions to improve traffic operation. The effectiveness of RM is mainly
determined by its control strategies, which decide how to calculate flow for various traffic states.
The traditional RM control strategies are limited by the responding speed to traffic changes and
the online computation workloads. They also require large and sufficient human knowledges about
the traffic flow problems in the study segments. In this study, we aim at proposing a control
framework that is deep reinforcement learning-based RM, named DQN-based RM. This new
control framework incorporates the deep Q network (DQN) algorithm and the RM in order to
reduce total travel time on freeways. A typical freeway weaving bottleneck section was simulated
based on the Simulation of Urban Mobility (SUMO) platform. The results show that the proposed
DQN-based RM strategy is able to response proactively to different traffic states and take
immediate and correct actions to prevent traffic breakdown, without full prior knowledge of traffic
flow theories. The DQN-based RM could reach the optimal control target within a short training
time, and the total travel time was reduced by 51.48% and 50.58% with 15 s and 30 s as the control
cycle. We also compare the performances of various RM strategies. The results show that the
DQN-based RM outperforms the traditional fixed-time and the feedback-based RM control
strategies in mitigating congestions and reducing travel time on freeways.
Keywords: Ramp metering, Deep reinforcement learning, Congestion, Freeway weaving sections,
Bottleneck
Yang, Li, Ke and Li 3
INTRODUCTION
The continuous increase of travel demand and flow on freeways cause frequent and severe
congestion issues, which greatly increase total travel time, delays, crash risks and fuel
consumptions (1-4). Freeway weaving sections are typical bottlenecks which are usually forms
with a merge area closely followed by a diverge area (5, 6). Within the weaving sections, vehicles
that enter and exit the freeway perform intense lane changes to access the target lanes, which are
the main cause of traffic congestion and traffic breakdown (7). The traffic operation of freeway
weaving bottleneck should be improved by properly controlling the traffic flow (8).
Ramp metering is the most commonly used active traffic management (ATM) strategy that
controls the number of ramp vehicles merging into the mainline by installing traffic lights and a
set of loop detectors on mainlines and ramps (9-11). The main objective is to set an appropriate
ramp flow leasing rate in order to minimize the negative impact of ramp flow disturbance on
mainline traffic to prevent the occurrences of traffic breakdown and capacity drop. The general
framework in previous studies is shown in Figure 1 (a). The study site is selected which contains
recurrent bottlenecks; secondly, the traffic data at loop detector stations is collected. Then, traffic
flow features as well as key congestion attributes (i.e. occupancy thresholds for prediction of
capacity drop) are analyzed based on traffic flow theories. Finally, the key parameters in the RM
control strategies are decided.
Early RM studies and practice tend to use the fixed-time control strategies, which decide
the metering rate by setting the control cycle, green light phase and red light phase (10). Such
strategies are easy to implement, but the major drawback is that the fixed parameters fail in
responding to the changing traffic states, which greatly reduce the strategy performances. Later,
the feedback-based RM algorithms were proposed, which were traffic responsive in nature. The
most famous feedback strategy, named the Asservissement Linéaire d’Entrée Autoroutière
(ALINEA) (12), and its variants (13-16), aim at adjusting the bottleneck occupancy at the expected
value to maintain the maximum flow and prevent traffic breakdown. Such approaches have already
been implemented in many practical uses (17, 18). However, the feedback-based RM adjusts the
metering rate passively according to the traffic conditions. The performance is greatly limited by
the feedback nature especially in the fast-changing traffic environments. In addition, the set of
control parameters in the feedback controller rely on human prior knowledge about the capacity
drop and the breakdown probabilities.
Some research used the online optimization methods, such as the model predictive control,
to calculate the metering rate by solving on-line optimal control problems (19-22). Though such
approaches can theoretically obtain the mathematical optimal solutions of the RM control problem,
they require accurate models to predict the traffic dynamics and contain large online computing
workload which is considered not capable for large-scale applications. Some researchers proposed
that the traffic flow theory-based RM control strategy improved traffic operation by forming a
free-flow pocket in bottleneck traffic flow (1, 23-25). The strategy was applied and tested on the
I-805 and I-5 freeway section in California and it was found to outperform the feedback-based
strategy in their case study (1). However, such approaches require deep understanding about the
traffic flow characteristics and congestion mechanisms at the target bottlenecks. The obtained
strategies may be site-specific and may not work well when applied on other freeway bottlenecks.
Yang, Li, Ke and Li 4
(a)
(b)
FIGURE 1 (a) General framework for RM studies; (b) DQN-based RM control framework.
In this study, a deep reinforcement learning (RL)-based RM control framework was
proposed for reducing total travel time at freeway bottleneck areas, as shown in Figure 1 (b). The
framework directly bridges up the correlation between the traffic flow data and the RM control
strategies. The RL enables the agent to obtain optimal strategy in an unfamiliar environment
without prior knowledge on traffic flow. In other words, the RL-based RM control strategy does
not require complex analyses based on the traffic flow theories and thus reduces the need for
human knowledges. Considering the complexity and mechanism of some traffic issues, such as
the capacity drop and traffic breakdown, are still not fully explored and understood by researchers,
previous RM strategies have clear cognitive limitations and our framework has a potential of
leading to the improved performances.
Previous researchers have incorporated the Q-learning (QL) (26), which is a basic RL
algorithm (27-29), into the RM control tasks (8, 30-35). Though the training process is extensive,
the QL-based RM can obtain the best metering actions for various traffic states. The effects of the
QL-based RM control strategies are widely reported in the previous studies. However, the QL uses
discretized traffic state sets with large intervals which may not accurately reflect the traffic
conditions. In addition, the QL’s performance is greatly limited by the storage ability of the Q-
table which is used to determine the optimal control actions in the iteration process.
Deep Q learning (DQN) is a combined use of the QL and the deep neural network (DNN)
(36, 37). The DeepMind team has put forward a number of successful applications with the DQN,
such as the use of Atari, AlphaGO, AlphaZero, etc in go games. (36-39). Compared to the
traditional QL, the DQN can deal with more complex tasks with large and continuous state space
by adding the DNN in the process of the agent’s learning structure. Some recent studies have
applied the DQN to solve intersection signal control problems (40-43). They reported that the
DQN finds appropriate and stable signal timing polices and result in a reduction in total travel time
and collision risks.
A most recent study proposed a multi-task deep RL-based RM and could achieve a control
performance on par with ALINEA (44). However, their study focused on large-scale freeway
sections with multiple on ramps and did not pay particular attentions to the specific freeway
bottlenecks where the occurrence of capacity drop was the main reason for the low efficiency of
traffic flow.
General framework for ramp metering studies
Select a
study site Data collection Traffic flow
analysis
Congestion
mechanism analysis
Ramp metering
strategies
Human knowledge
Yang, Li, Ke and Li 5
The literature review shows that none of previous studies have incorporated the DQN
algorithm with the RM control to reduce total travel time and to improve traffic operation at
freeway bottleneck areas. This study aims to fill the gap. A typical bottleneck is simulated in the
Simulation of Urban Mobility (SUMO). A training procedure for the DQN agent is proposed to
obtain the optimal actions. The effects of the DQN-based RM strategy are evaluated and compared
with some traditional RM control strategies.
METHODOLOGY
The RL approach is initially inspired by behaviorist psychology, which considers how an agent
ought to take actions in an environment to maximize the cumulative reward (27-29). An RL agent
interacts with its environment in discrete time steps, which is typically formulated as a Markov
decision process (MDP). The RL requires the determination of a metering flow control action for
the current traffic state at each decision interval. After the agent has taken a control action, the
current traffic state changes into a new state. The state transition can be evaluated by the reward
function. Therefore, the RM control problem can be formulated as a typical MDP problem and can
be resolved by the RL technique.
Deep Q Network
In a QL, the Q-value is assigned to each state-action pair to evaluate the quality of the action. The
set of Q-values can be represented as:
Q: S
×
A
R
(1)
where S is the set of possible states, A is the set of possible actions, and R is the set of
rewards. In an infinite horizon discounted reward problem, the agent’s goal is to maximize
(2)
where Rt is the reward at time step t, and γt is the discount factor that defines the relative
importance of the current rewards and those earned earlier (0γ1). For a non-deterministic
environment, the Q-value is updated with every new training sample according to
(3)
where Qt+1(st, at) is the Q-value for the state-action pair (st, at) at time step t+1, Rt+1 is the
reward received after performing action at at state st and then moves to the new state st+1, and κ(s,
a) is the learning rate which controls how fast the Q-values are altered.
The Q-table is updated in the training process. The Q-values will converge if each state-
action pair is executed for several times, and the optimal action for each state is determined by the
action with the largest Q-value. Then, the QL agent can be used for the optimal control according
to the knowledge it obtained in the training process.
The Q-table based approach is effective and efficient if the state-action set is not large.
However, in practical applications, the state-action space is usually very large, resulting in the size
of the lookup table growing exponentially. The algorithm is hard to converge, because all state-
0
t
t
t
R
g
¥
=
å
( ) ( )
( )
( ) ( )
1
111
,
,,
max , ,
tt
tt tt
tt
ttttt
sa
Qsa Qsa
RQsaQsa
kg
+
+++
=+
éù
+× -
ëû
Yang, Li, Ke and Li 6
action pairs need to be visited multiple times according to the inherent logic of QL. In addition,
limited by the size of the lookup table, the number of variables for traffic state representation is
restricted, which makes the QL incapable of handling more complicated traffic control tasks. The
drawbacks associated with the QL greatly limit its performance and applications.
The DQN incorporates a DNN in the QL and relaxes the limitation of the state size (36,
37). In a DQN, the Q-values are mapped from states which are used as the input of the DNN,
allowing for a larger or continuous space of states. There are two common ways to stabilize the
DQN, the target network freezing and the prioritized experience replay (45, 46). The target
network freezing splits the Q-value estimation into two different networks, i.e., a value network to
estimate the Q-value of the current state and a target network to compute the targets. By selecting
an appropriate freezing interval, the targets can be partially stabilized. Experience replay is a way
to sample a batch of the experience during the training process. In the DQN with prioritized
experience replay (46), the Temporal-Difference (TD)-error is used to help sample experiences
with larger reward.
The key idea is that the transition with higher expected learning progress measured by
absolute TD-error is more likely to be replayed. The priority of transition t is calculated as
(4)
where
𝛿!
is the TD-error, and
𝜋
is a positive constant that prevents some transitions not
being sampled again once their TD-error is zero. TD error is calculated as
(5)
Then the probability of sampling transition t is determined as
(6)
where
𝑝!> 0
, the exponent
𝛼
indicates the level of prioritization, and
𝛼 = 0
means the
uniform sampling.
Importance-Sampling (IS) weights are used to correct bias introduced by prioritized replay
(47). Hence, the parameter of the value network can be updated as
(7)
where
𝑘
is the mini-batch size when sampling, and
𝜔!
is the IS weight of sampled
transition t,
𝜔!
is calculated as
(8)
where
𝑁
is the memory size, and
𝛽 = 1
indicates that the non-uniform sampling
probabilities are fully compensated.
In our study, the DQN with the prioritized experience replay and target network freezing
was used to support the proposed RM control strategy.
tt
p
dp
=+
( )
11
max ( , ) ,
tt t tt
a
RQsasQa
dg
--
+-=
1
1(, )
k
tt t
t
tt t
Qs a
q
qqhwd
=
+××Ñ=+
å
11
()
()
tNPt
b
w
=×
Yang, Li, Ke and Li 7
DQN-Based RM Control strategy
A DQN-based RM control strategy was proposed in this section and the flowchart is shown in
Figure 2. The DQN-based RM consists of two parts, the DQN Agent and the simulation network.
These two parts could exchange traffic state, reward and action. The DQN agent could iterate to
the optimal control strategy by continuously interacting with the simulation network. Details will
be discussed in next section.
The DQN-based RM first perceives a particular traffic state and selects a traffic light color
at each decision interval. The traffic light leads the state transition. Then the feedback reward and
the transition of this state-action pair is store in the replay memory of the DQN agent’s mind. The
DQN agent will evaluate the reward, learn and update the policy with a batch of sample memories
with a given learning value and learning rate. The crucial elements in the DQN-based RM are
designed as follows:
(1) State. The selected states should be able to help the agent perceive traffic situations in
the freeway segment. In previous QL-based RM research, states were discretized into several
intervals. However, discrete states are not able to accurately describe the dynamics of freeway
traffic. In our study, three traffic flow variables and one traffic light status were used to represent
the traffic state at a freeway weaving section. They were: the density at the immediate upstream
of the weaving area, the density at the weaving area, the density on the on-ramp, and the color of
the traffic light. The traffic flow variable values are kept with one decimal and the light color is
represented by 0 (green) and 1 (red).
(2) Action. In this study, the action set is set to be the traffic light color which includes two
choices, i.e. turning green or turning red. The control period should be set to ensure that the effect
(or reward) after the action taken can be perceived by the agent after the interval. In our study, we
tested two control periods for the DQN-based RM which were 15 s and 30 s. In other words, the
DQN-based RM was able to turn the traffic light to green or red after 15-s or 30-s of the previous
action.
(3) Reward. The objective of the DQN-based RM was to reduce the system total travel
time (TTT). The TTT over a time horizon K can be calculated by:
(9)
where N(k) is the total number of vehicles in the network at time k, and η is the time interval.
In the simulation, the number of vehicles entering the freeway system at each time step
from both upstream mainline and on-ramp was recorded. The number of vehicles leaving the
freeway system at downstream mainline and off-ramp was also recorded. Any control measures
that managed to increase the early exit flows of the freeway section would lead to a decrease in
the total travel time (1-3, 10, 24, 48). Thus, the reward function could be determined by the
discharge flow at the bottleneck, which is defined as:
(10)
where R(s) was the reward for state s, and q(t) was the bottleneck discharge flow within
interval t, which can be obtained by simply counting the number of vehicles that passed through
the bottleneck.
1
()
K
k
TTT N k
h
=
=å
() ()Rs qt=
Yang, Li, Ke and Li 8
(4) Learning Parameters. The DQN has many learning parameters and the selection of
their values greatly affects the performance. The learning rate decides the learning speed of the
agent at each time step. If the learning rate is too large, the gradient descent might be too fast and
the DQN may not find the optimal policy; if the learning rate is too small, the computation cost
and training time were too long and the algorithm may be hard to converge. In previous studies,
the learning rate was typically set to be a small constant value to ensure that the optimal policy can
be found.
FIGURE 2 Flowchart of the DQN-based RM control strategy.
Another important consideration is to make a balance between exploitation and exploration
when selecting actions. The DQN should fully learn the information that has been presented in the
Q-values. Using pure exploitation may greatly save the learning time, but may also prohibit the
discovery of better actions and lead to local optimization. On the other hand, pure exploration
enhances the capability of discovering new and better actions. However, it may result in a random
action selection without making use of the existing learning results and, accordingly, is quite time
consuming. In our study, the exploitation and exploration were balanced by gradually decreasing
the exploration rate from 0.9 to 0.1 during the learning process. The DQN starts with a large
exploration rate. With the knowledge of the agent getting mature, the exploration rate decreases at
the same time.
Traditional RM Control Strategies
(1) Fixed-time Control Strategies. Traffic light changes its color according to the pre-designed
cycle length and green/red phases. For the fixed-time RM control strategy, the ramp flow can be
calculated by:
(11)
where λr is the number of lanes on ramp, G is the green time in the signal cycle, and c is
the signal cycle. This strategy is able to maintain a stable metering rate for simple scenarios.
(2) Feedback-based Control Strategies. The most representative feedback-based RM strategy,
i.e. ALINEA, is considered in this study for comparison.
DQN-based ramp metering control strategy
Simulation start DQN agent Policy update Optimal policy
State
Action
Reward
Traffic light
Detectors
Update
Input
OutputImplement
Traffic state
Discharge
flow
Simulation network DQN agentInteraction
Kweave
Kup
Kramp
1800 /
r
rGc
l
=× ×
Yang, Li, Ke and Li 9
(12)
where r(k) is the metering rate at decision interval k, r(k-1) is the metering rate at decision
interval k-1, KR>0 is a regulator parameter, ô is the desired occupancy at the bottleneck which is
usually set as the critical occupancy, oout(k) is the real time measurement of the occupancy at
decision interval k. The ramp metering rate is set to be proportional to the difference between the
expected and measured downstream occupancy.
DEVELOPMENT OF SIMULATION PLATFORM
SUMO Simulation Platform
An open-source traffic simulation software, the Simulation of Urban MObility (SUMO), is used
as the simulation platform. Compared to macroscopic simulations such as Cell Transmission
Model, SUMO can capture fine vehicle movements and generate traffic flow data with higher
granularity. Compared to other microscopic simulations, SUMO provides numerous car-following
and lane-changing models that can meet distinct needs for the different road segments, study
purposes. More importantly, SUMO gives users higher privileges to do further and deeper
developments. It also enables communications and interactions with other software, such as
importing python packages.
SUMO includes two main modules: Netedit module and Sumoconfig module. The Netedit
module enables users to define the network information, the traffic dynamic information, and flow
demand. Sumoconfig module enables users to define simulation information, such as simulation
period, waiting time, etc. SUMO also provides an Application Programing Interface (API), called
TraCI. Users are allowed to retrieve the real-time detector data, or change the state of the network
elements (detectors, traffic control etc.) by calling the TraCI.
In our study, an interactive simulation-control system was established, as shown in Figure
3. The network information was edited and stored within the Netedit module and was incorporated
into the Sumoconfig module. The DQN agent was defined in the Python Intergrated Development
Environment (IDE). At the beginning of the learning process, Python IDE calls TraCI API to start
the simulation. The real-time simulation data can be obtained at each decision interval. Data was
then sent back to Python IDE to feed the DQN agent to generate real-time RM control strategy.
The state of the control changes according to the given RM strategy which is also delivered by
Python IDE. The above steps formulate a loop. By following this loop, the freeway network can
be continuously simulated to help the DQN agent iterate to optimal control strategy.
() ( 1) ()
R out
rk rk K o o k
Ù
éù
=-+-
êú
ëû
Yang, Li, Ke and Li 10
FIGURE 3 SUMO simulation platform structure and interaction.
Experiment Design
A freeway weaving section is developed in the simulation model (see Figure 2). The upstream and
downstream sections of the mainline compose of 3 lanes, and the weaving area is 250 meter long
containing 4 lanes. The speed limit of all lanes is 33m/s (approximately 75 mph). To simulate real-
world traffic flow features near the bottleneck area, the traffic demand and the parameters in the
SUMO were carefully calibrated based on the empirical loop detector data obtained from the
Freeway Performance Measurement System (PeMS) (50).
The duration of simulation was set to be 4 hours with a 30-min warm-up period. Traffic
demand on the mainline and on-ramp started with 2200 veh/h and 400 veh/h in the warm-up hour.
The peak hour lasted for two hours with 3000 veh/h on the mainline and 1000 veh/h on the on-
ramp. Then the demand dropped backed to 2200 veh/h and 400 veh/h and lasted for another hour.
The capacity of the freeway mainline before capacity drop was found to be 3300 veh/h. The
magnitude of capacity drop was found to be 15.2%.
TRAINING OF DQN-BASED RM CONTROL STRATEGY
Experience replay and freeze interval are the most distinct features between the DQN and
the traditional QL algorithms. According to previous studies (46, 48), the replay memory size
should correspond to the specific scenarios of experiments and an appropriate freeze interval can
mitigate the correlation between the experiences and stabilize the learning process. In our study,
the parameters are carefully determined according to preliminary tests and suggestions from
previous studies (36-37, 40-43, 49), as shown in Table 1.
The simulation was activated by calling the TraCI API. The maximum epoch for the
training process is pre-defined. The DQN-based RM starts to perceive the traffic state at the
beginning of every decision interval, chooses an action (green or red light) and generates a
corresponding RM control strategy. The transition of this state action pair is store in the replay
memory of the DQN agent’s mind. At the end of the decision interval, the DQN agent would
Demand
Network
Traffic Control
Car Following
Detectors
Netedit
Sumoconfig
Raw elements Raw data
Time
Output
Flow
Occupancy
Speed
Python IDE
Data aggregation
Data collection Traffic analysis
Data storation
Simulation
On ramp controller
Fixed-time Feedback DQN agent
Bottleneck
Capacity drop
TraCI
Yang, Li, Ke and Li 11
perceive new state and a corresponding reward to the previous action. The DQN agent learns and
updates the policy with a batch of sample memories with the given learning value and learning
rate. During the simulation, the total travel time was computed at each step and the bottleneck
discharge flow was computed every 5 min.
TABLE 1 Learning Parameters for DQN-based RM Control Strategy
Parameter
Value
Optimizer
RMSProp
Replay memory size
10,000
Experience sampling
0.6
Learning rate
0.00025
Batch size
32
Exploration rate
From 0.9 to 0.1
Discount factor
0.99
Freeze interval
2,000
State matrix size
4
State matrix frame
1
Action size
2
SIMULATION RESULTS
Results of DQN-based RM Control Strategy
For the DQN-based RM with the 15 s control circle, the simulation with the smallest total travel
time was reached at the 160th training epoch; for DQN-based RM with 30 s control circle, the
simulation with the smallest total travel time was reached at the 204th training epoch. The time for
the agent to reach the optimal epoch varies, since the decision interval affects the learning time in
each epoch. The DQN-based RM was capable to quickly find an optimal solution when facing the
control scenarios. We also considered the situation in which none of the control strategies was
used.
The speed, occupancy, total travel time and bottleneck discharge flow were collected and
calculated. There was no difference between various strategies in the initial 30 min non-peak hour
period after the warm-up period. Thus, Fig. 4 (a) and (b) illustrated the speed and occupancy
profiles respectively starting from the peak hour at the bottleneck under different control strategies.
In the no control scenario, the congestion occurs and lasts until the end of the simulation. With the
DQN-based RM strategies, the speed and occupancy profiles presented similar trends, such as
maintaining larger speed and lower occupancy during peak hours. Table 2 listed the mean for the
speed and occupancy. Compared to the no control scenario, both DQN-based RM strategies
obviously increased the mean speed from 5.87 m/s to 12.95 and 13.82 m/s, and reduced the mean
occupancy from 35.60 to 20.12 and 18.51, respectively. The total travel time was reduced from
3306.26 veh·h to 1604.09 veh·h and 1633.94 veh·h with the DQN-based RM strategies, as
compared to the no control case, indicating a 51.48% and 50.58% reduction respectively. The
bottleneck discharge flow increase from 3185 veh/h to 3483 veh/h and 3462 veh/h with the two
DQN-based RM strategies, indicating a 9.39% and 8.69% increment in the bottleneck discharge
flow. The results suggest that the proposed DQN-based RM control strategy reduced significantly
the total travel time and increased bottleneck discharge flow at freeway weaving bottlenecks.
Yang, Li, Ke and Li 12
(a)
(b)
FIGURE 4 (a) Occupancy profiles (%); (b) speed profiles (m/s)
TABLE 2 Effects of DQN-based RM Control Strategies
Control strategy
Mean speed (m/s)
Mean occupancy (%)
No control
5.87
35.60
DQN-based RM (cycle=15 s)
12.95
20.12
DQN-based RM (cycle=30 s)
13.82
18.51
The different between the two DQN-based RM strategies could be attributed to the
difference in control cycles which affects the responding speeds. After each action was
implemented, the ramp vehicles need a period to merge into the mainline and then cause impacts
the traffic flow. Compared to the DQN-based RM with 30 s control cycle, DQN-based RM with
15 s control cycle can make a change of control action every 15 s which ensures a faster responding
speed to the change of traffic state. As shown in Table 2, the mean occupancy was maintained
around 20% which is close to the occupancy that triggers the capacity drop (which is
approximately 23% in our simulation). Note that if the control cycle is too short, for example 5 s,
it can mislead the agent’s learning ability and control performance because the impact of the
control on traffic flow may not have appeared completely.
Comparison between Different Control Strategies
The performances of different RM control strategies were compared. The fixed-time RM control
strategies used 5 s (with 2 s green phase and 3 s red phase) and 7 s (with 2 s green phase and 5 s
red phase) as the control cycle. Two ALINEA strategies were used with the parameters defined
according to the traffic flow features in the bottleneck section (18): desired occupancy = 22% or
23%, traffic control circle = 80 s and KR = 70 veh/h. The QL was also trained for the RM control
based on the same state and action sets as the DQN. The experiments show that the QL was not
able to convergent, or in other words, to find the optimal control strategy in the training process.
The main reason for the failure is that QL has limited ability of dealing with large space sets and
complex control tasks.
Yang, Li, Ke and Li 13
The total travel time and the bottleneck discharge flow in different scenarios were
compared in Table 3 and Figure 5. The results show that the two fixed-time RM control strategies
performs the worst, only reducing 23.33% and 32.75% in total travel time and increased 0.29%
and 2.48% bottleneck discharge flow respectively; two ALINEA strategies enhance the freeway
system moderately, resulting in 41.77% and 38.45% reduction in total travel time and 7.79% and
7.21% increment in bottleneck discharge flow. All RM control strategies could improve the
freeway efficiency to some extent, but two proposed DQN-based RM strategies outperformed the
others.
The causes for the differences between RM control strategies are discussed. The fixed-time
RM control strategy cannot adjust to different traffic conditions. Though the effects can be slightly
improved by setting proper control cycle parameters, such strategies performed the worst among
all strategies. The control parameters in the ALINEA strategies highly rely on human knowledge
about the capacity drop and breakdown probabilities at bottlenecks, as different expected
occupancies lead to different control effects. In addition, the feedback nature decides the strategies
are slow in responding to traffic changes and make proper actions, which is another major cause
for the reduced effects as compared to the DQN-based RM control strategies.
TABLE 3 Comparison Result for Different Control Strategies
Control strategy
Total Travel
time (veh∙h)
Improvement
(%)
Discharge flow
(veh/h)
Improvement
(%)
No control
3306.26
/
3185
/
Fixed-time 5-s
2534.96
23.33
3194
0.29
Fixed-time 7-s
2223.41
32.75
3264
2.48
ALINEA (occ=22)
1925.37
41.77
3433
7.79
ALINEA (occ=23)
2034.90
38.45
3415
7.21
DQN-based RM (cycle=15 s)
1604.09
51.48
3483
9.39
DQN-based RM (cycle=30 s)
1633.94
50.58
3462
8.69
(a) (b)
Yang, Li, Ke and Li 14
(c) (d)
FIGURE 5 (a),(c) Comparison of bottleneck discharge flow with different RM control
strategies; (b),(d) Comparison of total travel time with different RM control strategies. A:
No control; B: Pre-timed 3-s; C: Pre-timed 5-s; D: ALINEA (occ=22); E: ALINEA (occ=23);
F: DQN-based RM (cycle=15 s); G: DQN-based RM (cycle=30 s)
CONCLUSIONS AND DISCUSSION
This study proposed a DQN-based RM control strategy that aimed at reducing total travel time at
freeway weaving bottlenecks. A novel framework which incorporated an open-source microscopic
simulation software, SUMO, and the deep reinforcement learning agent was proposed to
automatically obtain the optimal control action. A training procedure was proposed for the DQN-
based RM in an off-line scheme which could obtain the optimal control strategy within a short
training period. Then the trained agent can be applied for real-time RL control tasks. The results
showed that, the well-trained DQN-based RM has the capability of predicting traffic state
transitions and acts in a proactive control scheme. In our simulation tests, improved performances
in reducing total travel time and increasing bottleneck discharge flow were observed as compared
to traditional RM control strategies. In addition, compared to the QL-based RM control, the DQN-
based RM used continuous and larger states as inputs which can perceive traffic state and take
control action more precisely.
The major contribution of our proposed framework is that the DQN-based RM directly
bridged up the correlation between traffic parameters at the freeway bottlenecks and the RM
actions, releasing human prior knowledge of traffic flow theories (such as oscillation pattern,
congestion incentive, and capacity drop mechanism). Only the state variables, actions, reward
functions and DNN related parameters should be given. No other traffic-related parameters need
to be calibrated in the proposed strategy. The proposed framework can be further extended and
transferred to other control scenarios such as coordinated ramp metering, variable speed limits,
and signal timing.
Though the training process is considered reasonably fast, the DQN-based RM still requires
a number of state variables and sufficient training time to obtain the optimal control strategy before
it can be applied. For large freeway networks or complex scenarios, the DQN agent may be
difficult to converge, indicating the control performance is not optimized. A balance should be
carefully decided between the computing efficiency and learning effectiveness. In addition, the
DQN agent with DNN performs like a black box which might be hard to explain the inherent
mechanism. This fact may reduce and the easiness and practicability of transferring to other case
studies. A continuous learning procedure in the reinforcement learning algorithm is recommended
Discharge Flow (veh/h)
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
A
B
C
D
E
F
G
System Travel Time (veh ∙ h)
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
/通用格式
A
B
C
D
E
F
G
Yang, Li, Ke and Li 15
to enhance its robustness and transferability of control performances when applied to new
environment (51).
In the present study, a typical freeway weaving section was simulated with the proposed
DQN-based RM control strategy. Other types of bottlenecks caused by lane reduction, traffic
incident or work zone also need to be investigated. Besides, this study used bottleneck discharge
flow as the reward function in the DQN method, and successful reduction of travel time and
increment in bottleneck discharge flow were observed. Future study could consider other reward
functions, such as social equality for ramp and mainline vehicles, for system optimization.
Furthermore, for a complex freeway segment with two or more on ramp, the coordination of
multiple reinforcement learning agents with RM control strategies and variable speed limit control
strategies can be evaluated to improve the overall performance in reducing freeway traffic
congestions.
ACKNOWLEDGEMENTS
This research was jointly sponsored by the National Natural Science Foundation of China
(51508094), the National Key Research and Development Program of China: Key Projects of
International Scientific and Technological Innovation Cooperation Between Governments
(2016YFE0108000), and the Fundamental Research Funds for the Central Universities
(2242017K40130).
AUTHOR CONTRIBUTION STATEMENT
The authors confirm contribution to the paper as follows: study conception and design: Mofeng
Yang, Zhibin Li, Zemian Ke; simulation: Mofeng Yang, Meng Li; analysis and interpretation of
results: Mofeng Yang, Zhibin Li; draft manuscript preparation: Mofeng Yang, Zhibin Li. All
author reviewed the results and approved the final version of the manuscript.
REFERENCES
1. Cassidy, M. J., & Rudjanakanoknad, J.. Increasing the Capacity of an Isolated Merge by
Metering its On-ramp. Transportation Research Part B: Methodological, 2005. 39(10):
896-913.
2. Chung, K., Rudjanakanoknad, J., and Cassidy, M. J. 2007. Relation between Traffic
Density and Capacity Drop at Three Freeway Bottlenecks. Transportation Research Part
B: Methodological, 2007. 41(1): 82-95.
3. Zhang, L., and Levinson, D.. Ramp Metering and Freeway Bottleneck Capacity.
Transportation Research Part A: Policy and Practice, 2011. 44(4): 218-235.
4. Oh, S., and Yeo, H.. Microscopic Analysis on the Causal Factors of Capacity Drop in
Highway Merging Sections. Presented at 91rd Annual Meeting of the Transportation
Research Board, Washington, D.C., 2012.
5. Transportation Research Board. Highway Capacity Manual. Technical report, Washington,
D.C., 2000.
6. Transportation Research Board. Highway Capacity Manual. Technical report, Washington,
D.C., 2010.
Yang, Li, Ke and Li 16
7. Paul Ryus, Mark Vandehey, Lily Elefteriadou, Richard G Dowling, and Barbara K Ostrom.
New TRB Publication: Highway Capacity Manual 2010. TR News, 2011. (273).
8. Yang, H., and Rakha, H.. Reinforcement Learning Ramp Metering Control for Weaving
Sections in a Connected Vehicle Environment. Presented at 96th Annual Meeting of the
Transportation Research Board, Washington, D.C., 2017.
9. Shaaban, K., Khan, M. A., and Hamila, R.. Literature Review of Advancements in
Adaptive Ramp Metering. Procedia Computer Science, 2016. 83: 203-211.
10. Papageorgiou, M., and Kotsialos, A.. Freeway ramp metering: An overview. IEEE
Transactions on Intelligent Transportation Systems, 2002. 3(4): 271-281.
11. Papageorgiou, M., and Papamichail, I.. Overview of Traffic Signal Operation Policies for
Ramp Metering. Transportation Research Record: Journal of the Transportation Research
Board, 2008. 2047: 28-36.
12. Papageorgiou, M., Hadj-Salem, H., and Blosseville, J. M.. ALINEA: A Local Feedback
Control Law for On-ramp Metering. Transportation Research Record: Journal of the
Transportation Research Board, 1991. 1320: 58-67.
13. Zhang, M., Kim, T., Nie, X., Jin, W., Chu, L., and Recker, W.. California Partners for
Advanced Transit and Highways (PATH), 2001.
14. Smaragdis, E., Papageorgiou, M., and Kosmatopoulos, E.. A Flow-maximizing Adaptive
Local Ramp Metering Strategy. Transportation Research Part B: Methodological, 2004.
38(3): 251-270.
15. Kotsialos, A., Papageorgiou, M., Hayden, J., Higginson, R., McCabe, K., and Rayman, N..
Discrete Release Rate Impact on Ramp Metering Performance. IEE Proceedings-
Intelligent Transport Systems, 2006. 153(1): 85-96.
16. Smaragdis, E., & Papageorgiou, M.. Series of New Local Ramp Metering Strategies:
Emmanouil smaragdis and markos papageorgiou. Transportation Research Record:
Journal of the Transportation Research Board, 2003. 1856: 74-86.
17. Demiral, C., & Celikoglu, H. B.. Application of ALINEA Ramp Control Algorithm to
Freeway Traffic Flow on Approaches to Bosphorus Strait Crossing Bridges. Procedia-
Social and Behavioral Sciences, 2011. 20: 364-371
18. Papageorgiou, M., Hadj-Salem, H., Middelham, F.. “ALINEA Local Ramp Metering:
Summary of Field Results”, Transportation Research Record: Journal of the
Transportation Research Board, 1997. 1603: 90-98
19. Bellemans, T., B.D. Schutter, and B.D. Moor.. Model Predictive Control for Ramp
Metering of Motorway Traffic: A Case Study. Control Engineering Practice, 2006. 4(7):
757–767.
20. Hegyi, A., De Schutter, B., and Hellendoorn, H.. Model Predictive Control for Optimal
Coordination of Ramp Metering and Variable Speed Limits. Transportation Research Part
C: Emerging Technologies, 2005. 13(3): 185-209.
21. Papamichail, I., Kotsialos, A., Margonis, I., and Papageorgiou, M. Coordinated Ramp
Metering for Freeway Networks–A Model-predictive Hierarchical Control Approach.
Transportation Research Part C: Emerging Technologies, 2010. 18(3): 311-331.
22. Zegeye, S. K., De Schutter, B., Hellendoorn, J., Breunesse, E. A., and Hegyi, A.. Integrated
Macroscopic Traffic Flow, Emission, and Fuel Consumption Model for Control Purposes.
Transportation Research Part C: Emerging Technologies, 2013. 31: 158-171.
23. Cassidy, M. J., & Windover, J. R.. Methodology for Assessing Dynamics of Freeway
Traffic Flow. Transportation Research Record: Journal of the Transportation Research
Board, 1995. 1484: 73-79.
Yang, Li, Ke and Li 17
24. Cassidy, M. Freeway On-ramp Metering, Delay Savings, and Diverge Tottleneck.
Transportation Research Record: Journal of the Transportation Research Board, 2003.
1856: 1-5.
25. Kim, K., and Cassidy, M. J.. A Capacity-increasing Mechanism in Freeway Traffic.
Transportation Research Part B: Methodological, 2012. 46(9): 1260-1272.
26. Watkins, C., and Dayan, P.. Q-Learning. Machine Learning, 1992. 8(3): 279-292.
27. Sutton, R.S., and Barto, A. G.. Reinforcement Learning: An Introduction. Cambridge, MA:
MIT Press, 1998.
28. Barto, A. G., and Mahadevan, S.. Recent Advances in Hierarchical Reinforcement
Learning. Discrete Event Dynamic Systems, 2003. 13(1-2): 41-77.
29. Mahadevan, S.. Average Reward Reinforcement Learning: Foundations, Algorithms, and
Empirical Results. Machine Learning, 1996. 22(1-3): 159-195.
30. Wang, X., Liu, B., Niu, X., & Miyagi, T. 2009. Reinforcement Learning Control for On-
ramp Metering Based on Traffic Simulation. ICCTP 2009: Critical Issues In
Transportation Systems Planning, Development, and Management, 2009. 1-7.
31. Veljanovska, K., M Bombol, K., & Maher, T. 2010. Reinforcement Learning Technique in
Multiple Motorway Access Control Strategy Design. PROMET-Traffic&Transportation,
2010. 22(2): 117-123.
32. Davarynejad, M., Hegyi, A., Vrancken, J., and van den Berg, J. Motorway Ramp-metering
Control with Queuing Consideration Using Q-learning. Intelligent Transportation Systems
(ITSC), 2011 14th International IEEE Conference on. IEEE, 2012: 1652-1658.
33. Wang, X. J., Xi, X. M., and Gao, G. F.. Reinforcement Learning Ramp Metering without
Complete Information. Journal of Control Science and Engineering, 2012. 2.
34. Rezaee, K., Abdulhai, B., and Abdelgawad, H. 2012. Application of Reinforcement
Learning with Continuous State Space to Ramp Metering in Real-world Conditions.
Intelligent Transportation Systems (ITSC), 2012 15th International IEEE Conference on.
IEEE, 2012. 1590-1595
35. Rezaee, K. 2014. Decentralized Coordinated Optimal Ramp Metering Using Multi-agent
Reinforcement Learning. Doctoral dissertation, University of Toronto (Canada), 2014.
36. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and
Riedmiller, M.. Playing Atari with Deep Reinforcement Learning. arXiv preprint, 2013.
arXiv:1312.5602.
37. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... and
Petersen, S.. Human-level Control through Deep Reinforcement Learning. Nature, 2015.
518(7540): 529.
38. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... &
Dieleman, S. Mastering the Game of Go with Deep Neural Networks and Tree Search.
Nature, 2016. 529(7587): 484-489.
39. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., ... and
Chen, Y. Mastering the Game of Go without Human Knowledge. Nature, 2017. 550(7676):
354.
40. Li, L., Lv, Y., and Wang, F. Y.. Traffic Signal Timing via Deep Reinforcement Learning.
IEEE/CAA Journal of Automatica Sinica, 2016. 3(3): 247-254.
41. Van der Pol, E., and Oliehoek, F. A.. Coordinated Deep Reinforcement Learners for Traffic
Light Control. Proceedings of Learning, Inference and Control of Multi-Agent Systems at
NIPS, 2016.
42. Van der Pol, E.. Deep Reinforcement Learning for Coordination in Traffic Light Control.
Master’s thesis, University of Amsterdam, 2016.
Yang, Li, Ke and Li 18
43. Mousavi, S. S., Schukat, M., and Howley, E.. Traffic Light Control Using Deep Policy-
Gradient and Value-function-based Reinforcement Learning. IET Intelligent Transport
Systems, 2017. 11(7): 417-423.
44. Belletti, F., Haziza, D., Gomes, G., & Bayen, A. M.. Expert level control of ramp metering
based on multi-task deep reinforcement learning. IEEE Transactions on Intelligent
Transportation Systems, 2017.
45. Van Hasselt, H., Guez, A., & Silver, D.. Deep Reinforcement Learning with Double Q-
Learning. AAAI, 2016. 16: 2094-2100.
46. Schaul, T., Quan, J., Antonoglou, I., and Silver, D.. Prioritized Experience Replay. arXiv
preprint, 2015. arXiv:1511.05952.
47. Mahmood, A Rupam, van Hasselt, Hado P, and Sutton, Richard S. 2014. Weighted
Importance Sampling for Off-policy Learning with Linear Function Approximation.
Advances in Neural Information Processing Systems, 2014. 3014–3022.
48. Daganzo, C. F.. Fundamentals of transportation and traffic operations. Pergamon Press,
Oxford, 1997.
49. Lu, C., Huang, J., and Gong, J.. Reinforcement Learning for Ramp Control: An Analysis
of Learning Parameters. PROMET-Traffic&Transportation, 2016. 28(4): 371-381.
50. Chen, C., Petty, K., Skabardonis, A., Varaiya, P., and Jia, Z.. Freeway Performance
Measurement System: Mining Loop Detector Data. Transportation Research Record:
Journal of the Transportation Research Board, 2001. 1748: 96-102.
51. Li, Z., Liu, P., Xu, C., Duan, H., & Wang, W.. Reinforcement Learning-based Variable
Speed Limit Control Strategy to Reduce Traffic Congestion at Freeway Recurrent
Bottlenecks. IEEE Transactions on Intelligent Transportation Systems, 2017. 18(11):
3204-3217.
... On-ramp traffic has been shown to have an impact on mainline traffic flow, especially when the inflow rate of ramps is high. Additionally, merging traffic tends to increase the probability of traffic breakdowns [15,21]. Consequently, dynamic ramp meters generally use the traffic density of the mainline as a trigger threshold to ensure that the density of the mainline remains within an acceptable range [22]. ...
... The stationary sensors installed on the highway collect traffic information from the corresponding road segments. Previous research has demonstrated the temporal and spatial correlation of traffic conditions in adjacent roadway segments [11,21]. Therefore, traffic data from upstream, downstream, and ramps, from current and previous time intervals, have been collected for each segment of the roadway. ...
Article
Full-text available
Traffic breakdown is the transition of traffic flow from an uncongested state to a congested state. During peak hours, when a large number of on-ramp vehicles merge with mainline traffic, it can cause a significant drop in speed and subsequently lead to traffic breakdown. Therefore, ramp meters have been used to regulate the traffic flow from the ramps to maintain stable traffic flow on the mainline. However, existing traffic breakdown prediction models do not consider on-ramp traffic flow. In this paper, an algorithm based on artificial neural networks (ANN) is developed to predict the probability of a traffic breakdown occurrence on freeway segments with merging traffic, considering temporal and spatial correlations of the traffic conditions from the location of interest, the ramp, and the upstream and downstream segments. The feature selection analysis reveals that the traffic condition of the ramps has a significant impact on the occurrence of traffic breakdown on the mainline. Therefore, the traffic flow characteristics of the on-ramp, along with other significant features, are used to build the ANN model. The proposed ANN algorithm can predict the occurrence of traffic breakdowns on freeway segments with merging traffic with an accuracy of 96%. Furthermore, the model has been deployed at a different location, which yields a predictive accuracy of 97%. In traffic operations, the high probability of the occurrence of a traffic breakdown can be used as a trigger for the ramp meters.
... For instance, during a drone-assisted bridge inspection, the drone can initially learn from a human demonstration how to examine the bridge, identify defects, and navigate the environment securely. Subsequently, reinforcement learning can be applied to further optimize the drone's path planning and inspection strategy based on feedback from the inspection process (Ke et al. 2021;Yang et al. 2019). This approach can lead to a more efficient and secure inspection process, requiring less human intervention while delivering more effective performance in detecting and classifying defects. ...
Conference Paper
Bridge inspections are characterized by their labor-intensive nature and inherent risks, relying predominantly on engineers' visual analysis. Although the integration of drones has alleviated the safety concerns associated with human labor, the accurate identification of defects in vital elements continues to necessitate inspectors' specialized knowledge. Aggregating multi-inspector experiences can improve the localization of critical defects. The challenge lies in capturing and explaining drone trajectories into reusable and explainable strategies. This paper presents a framework to capture inspectors' strategies by analyzing drone control in bridge inspection simulations. It gathers and scrutinizes inspectors' drone control histories to understand their intentions. Due to the vast search space of inspection strategies in dynamic, uncertain contexts, imitation and reinforcement learning are utilized to learn reusability and explainability. Experiments demonstrate that drone trajectories aligned with bridge elements can explain inspection knowledge. Inspectors with explainable patterns, such as the human attention between the different spans inside the span, achieve better defect detection performance (correlation coefficient of 0.5). This framework promotes inspector-drone collaboration that adaptively supports human inspectors, resulting in more reliable inspections.
... Te parameters were carefully determined in our research based on the preliminary testing and recommendations from previous studies [21,29,[44][45][46][47]. Te parameters are shown in Table 1. ...
Article
Full-text available
Most of the current variable speed limit (VSL) strategies are designed to alleviate congestion in relatively short freeway segments with a single bottleneck. However, in reality, consecutive bottlenecks can occur simultaneously due to the merging flow from multiple ramps. In such situations, the existing strategies use multiple VSL controllers that operate independently, without considering the traffic flow interactions and speed limit differences. In this research, we introduced a multiagent reinforcement learning-based VSL (MARL-VSL) approach to enhance collaboration among VSL controllers. The MARL-VSL approach employed a centralized training with decentralized execution structure to achieve a joint optimal solution for a series of VSL controllers. The consecutive bottleneck scenarios were simulated in the modified cell transmission model to validate the effectiveness of the proposed strategy. An independent single-agent reinforcement learning-based VSL (ISARL-VSL) and a feedback-based VSL (feedback-VSL) were also applied for comparison. Time-varying heterogeneous traffic flow stemming from the mainline and ramps was loaded into the freeway network. The results demonstrated that the proposed MARL-VSL achieved superior performance compared to the baseline methods. The proposed approach reduced the total time spent by the vehicles by 18.01% and 17.07% in static and dynamic traffic scenarios, respectively. The control actions of the MARL-VSL were more appropriate in maintaining a smooth freeway traffic flow due to its superior collaboration performance. More specifically, the MARL-VSL significantly improved the average driving speed and speed homogeneity across the entire freeway.
... When using the reinforcement learning algorithm, the steps of setup, tuning, calculation, and measurement are more complex and require greater computing power support, but better optimization results are often achieved in ramp metering. For example, compared with ALINEA, the deep reinforcement learningbased ramp metering method can respond proactively to diferent trafc states and take more correct actions for trafc breakdown prevention [11]. Using an online reinforcement learning method to model ramp metering could avoid the establishment of an accurate trafc model and the reliance on prior knowledge, and this method could increase the average vehicle speed by 6.80% and decrease the total travel time by 5.22% when compared to the ALINEA [12]. ...
Article
Full-text available
The traffic congestion problem on urban expressways, especially in the weaving areas, has become severe. Some cooperative methods have been proven to be more effective than a separate approach in optimizing the traffic state in weaving areas on urban expressways. However, a cooperative method that combines channelization with ramp metering has not been presented and its effectiveness has not been examined yet. Thus, to fill this research gap, this study proposes a reinforcement learning-based cooperative method of channelization and ramp metering to achieve automated traffic state optimization in the weaving area. This study uses an unmanned aerial vehicle to collect the real traffic flow data, and four control strategies (i.e., two kinds of channelization methods, a ramp metering method, and a cooperative method of channelization and ramp metering) and a baseline (without controls) are designed in the simulation platform (Simulation of Urban Mobility). The speed distributions of different control strategies on each lane were obtained and analyzed in this study. The results show that the cooperative method of channelization and ramp metering is superior to other methods, with significantly higher increases in vehicle speeds. This cooperative method can increase the average vehicle speeds in lane-1, lane-2, and lane-3 by 14.51%, 14.81%, and 37.03%, respectively. Findings in this study can contribute to the improvement of traffic efficiency and safety in the weaving area of urban expressways.
... However, both questionnaire surveys and driving simulators are biased towards respondents' subjective factors, which might lead to the overestimation of VSL systems' safety benefits. Traffic simulation methods have been widely used to examine both the operation and safety benefits of different ATM control strategies [6][7][8][9]. To better reproduce the real-world effect of VSL control, different traffic simulation methods have also been applied to evaluated VSL benefits by simulating different driver groups' behavioral responses to the VSL [7][8][9]. ...
Article
Full-text available
Variable speed limit (VSL) control dynamically adjusts the displayed speed limit to harmonize traffic speed, prevent congestions, and reduce crash risks based on prevailing traffic stream and weather conditions. Previous research studies examine the impacts of VSL control on reducing corridor-level crash risks and improving bottleneck throughput. However, less attention focuses on utilizing real-world data to see how compliant the drivers are under different VSL values and how the aggregated driving behavior changes. This study aims to fill the gap. With the high-resolution lane-by-lane traffic big data collected from a European motorway, this study performs statistical analysis to measure the difference in driving behavior under different VSL values and analyze the safety impacts of VSL controls on aggregate driving behaviors (mean speed, average speed difference, and the percentage of small space headway). The data analytics show that VSL control can effectively decrease the mean speed, the speed difference, and the percentage of small space headways. The safety impacts of VSL control on aggregated driving behavior are also discussed. The aggregated driving behavior variables follow a trend of first decreasing and then increasing with the continuous decrease in VSL values, indicating that potential traffic safety benefits can be achieved by adopting suitable VSL values that match with prevailing traffic conditions.
Conference Paper
Full-text available
The infrequent emergence of traffic congestion on freeways can result in the decline of the transportation system over time. Without the implementation of appropriate countermeasures, congestion can escalate, leading to unfavorable impacts on other aspects of the traffic network. As a result, there is a greater need for reliable and optimal traffic control. The goal of this research is to manage the number of vehicles entering the main freeway from the ramp merging area, in order to balance the demand and capacity to satisfy the maximum utilization of the freeway capacity. Despite extensive research into different ramp metering techniques, this study aims to utilize the fuzzy cognitive map as a macroscopic traffic flow model in conjunction with the Q-learning algorithm. This combination prevents freeway congestion and maintains optimal performance by keeping freeway density below a key threshold. The inherent uncertainty of traffic conditions is addressed through the application of reinforcement learning, which is constructed on the principles of the Markov decision process. This approach represents an exploration-exploitation trade-off, as implemented through the Q-learning algorithm. The proposed technique was evaluated for its efficacy in the regulation of freeway ramp metering in both controlled and uncontrolled simulations. The findings demonstrate a significant improvement in the control of the mainstream traffic flow.
Article
Shared autonomous vehicles (SAVs) are a fleet of autonomous taxis that provide point-to-point transportation services for travellers, and have the potential to reshape the nature of the transportation market in terms of operational costs, environmental outcomes, increased tolling efficiency, etc. However, the number of waiting passengers could become arbitrarily large when the fleet size is too small for travel demand, which could cause an unstable network. An unstable network will make passengers impatient and some people will choose some other alternative travel modes, such as metro or bus. To achieve stable and reliable SAV services, this study designs a dynamic queueing model for waiting passengers and provides a fast maximum stability dispatch policy for SAVs when the average number of waiting for passengers is bounded in expectation, which is analytically proven by the Lyapunov drift techniques. After that, we expand the stability proof to a more realistic scenario accounting for the existence of exiting passengers. Unlike previous work, this study considers exiting passengers in stability analyses for the first time. Moreover, the maximum stability of the network doesn't require a planning horizon based on the proposed dispatch policy. The simulation results show that the proposed dispatch policy can ensure the waiting queues and the number of exiting passengers remain bound in several experimental settings.
Article
Ramp metering has been considered as one of the most effective approaches of dealing with the traffic congestion on the freeways. The modelling of the freeway traffic flow dynamics is challenging because of its non-linearity and uncertainty. Recently, Koopman operator, which transfers a non-linear system to a linear system in an infinite-dimensional space, has been studied for modelling complex dynamics. In this paper, we propose a data-driven modelling approach based on neural networks, denoted by deep Koopman model, to learn a finite-dimensional approximation of the Koopman operator. To consider the sequential relations of the ramps and main roads on the freeway, a long short-term memory network is applied. Furthermore, a model predictive controller with the trained deep Koopman model is proposed for the real-time control of the ramp metering on the freeway. To validate the performance of the proposed approach, experiments based on the simulation in the traffic simulation software Simulation of Urban MObility (SUMO) environment are conducted. The results demonstrate the effectiveness of the proposed approach on both the dynamics prediction and the real-time control of the ramp metering.
Article
Active traffic management (ATM) strategies are useful methods to reduce crash risk and improve safety on expressways. Although there are some studies on ATM strategies, few studies take the moving vehicle group as the object of analysis. Based on the crash risk prediction of moving vehicle groups in a connected vehicle (CV) environment, this study developed various ATM safety strategies, that is, variable speed limits (VSLs), ramp metering (RM), and coordinated VSL and RM (VSL-RM) strategies. VSLs were updated to minimize the crash risk of multiple moving vehicle groups in the next time interval, which is 1 min, and the updated speed limits were sent directly to the CVs in the moving vehicle group. The metering rate and RM opening time were determined using mainline occupancy, the crash risk of upcoming moving vehicle groups, and the predicted time at which moving vehicle groups arrived at the on-ramp. The VSL-RM strategy was used to simultaneously control and coordinate traffic flow on the mainline and ramps. These strategies were tested in a well-calibrated and validated micro-simulation network. The crash risk index and conflict count were utilized to evaluate the safety effects of these strategies. The results indicate that the ATM strategies improved the expressway safety benefits by 2.84–15.92%. The increase in CV penetration rate would promote the safety benefits of VSL and VSL-RM. Moreover, VSL-RM was superior to VSL and RM in reducing crash risk and conflict count.
Article
Full-text available
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo's own move selections and also the winner of AlphaGo's games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
Article
Full-text available
Recent advances in combining deep neural network architectures with reinforcement learning techniques have shown promising potential results in solving complex control problems with high dimensional state and action spaces. Inspired by these successes, in this paper, we build two kinds of reinforcement learning algorithms: deep policy-gradient and value-function based agents which can predict the best possible traffic signal for a traffic intersection. At each time step, these adaptive traffic light control agents receive a snapshot of the current state of a graphical traffic simulator and produce control signals. The policy-gradient based agent maps its observation directly to the control signal, however the value-function based agent first estimates values for all legal control signals. The agent then selects the optimal control action with the highest value. Our methods show promising results in a traffic network simulated in the SUMO traffic simulator, without suffering from instability issues during the training process.
Thesis
Full-text available
The cost of traffic congestion in the EU is large, estimated to be 1% of the EU's GDP, and good solutions for traffic light control may reduce traffic congestion, saving time and money and reducing environmental pollution. To find optimal traffic light control policies, reinforcement learning uses reward signals from the environment to learn how to make optimal decisions. This approach can be deployed in traffic light control to learn optimal traffic light policies to reduce traffic congestion. However, earlier reinforcement learning approaches to traffic light control relied on simplifying assumptions over the state and manual feature extraction, so that potentially vital information about the state is lost. Techniques from the field of deep learning can be used in deep reinforcement learning to enable the use of more information over the state and to potentially find better traffic light policies. This thesis builds upon the Deep Q-learning algorithm and applies it to the problem of traffic light control. The contribution of this thesis is twofold: first, it extends earlier research on applying Deep Q-learning to the problem of controlling traffic lights on intersections with the goal of achieving optimal traffic throughput, and shows that, although Deep Q-learning can find very good policies for the traffic control problem without manual feature extraction, stability is not a guarantee. Second, it combines the Deep Q-learning algorithm with an existing multi-agent coordination algorithm to achieve cooperation between traffic lights and improves upon earlier work related to coordination for traffic light control. This thesis is the first work to combine transfer planning and deep reinforcement learning, an approach that is empirically shown to be promising.
Article
Full-text available
Reinforcement Learning (RL) has been proposed to deal with ramp control problems under dynamic traffic conditions; however, there is a lack of sufficient research on the behaviour and impacts of different learning parameters. This paper describes a ramp control agent based on the RL mechanism and thoroughly analyzed the influence of three learning parameters; namely, learning rate, discount rate and action selection parameter on the algorithm performance. Two indices for the learning speed and convergence stability were used to measure the algorithm performance, based on which a series of simulation-based experiments were designed and conducted by using a macroscopic traffic flow model. Simulation results showed that, compared with the discount rate, the learning rate and action selection parameter made more remarkable impacts on the algorithm performance. Based on the analysis, some suggestions about how to select suitable parameter values that can achieve a superior performance were provided.
Conference Paper
Experience replay lets online reinforcement learning agents remember and reuse experiences from the past. In prior work, experience transitions were uniformly sampled from a replay memory. However, this approach simply replays transitions at the same frequency that they were originally experienced, regardless of their significance. In this paper we develop a framework for prioritizing experience, so as to replay important transitions more frequently, and therefore learn more efficiently. We use prioritized experience replay in Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved human-level performance across many Atari games. DQN with prioritized experience replay achieves a new state-of-the-art, outperforming DQN with uniform replay on 41 out of 49 games.
Article
The primary objective of this paper was to incorporate the reinforcement learning technique in variable speed limit (VSL) control strategies to reduce system travel time at freeway bottlenecks. A Q-learning (QL)-based VSL control strategy was proposed. The controller included two components: a QL-based offline agent and an online VSL controller. The VSL controller was trained to learn the optimal speed limits for various traffic states to achieve a long-term goal of system optimization. The control effects of the VSL were evaluated using a modified cell transmission model for a freeway recurrent bottleneck. A new parameter was introduced in the cell transmission model to account for the overspeed of drivers in unsaturated traffic conditions. Two scenarios that considered both stable and fluctuating traffic demands were evaluated. The effects of the proposed strategy were compared with those of the feedback-based VSL strategy. The results showed that the proposed QL-based VSL strategy outperformed the feedback-based VSL strategy. More specifically, the proposed VSL control strategy reduced the system travel time by 49.34% in the stable demand scenario and 21.84% in the fluctuating demand scenario.
Article
This article shows how the recent breakthroughs in Reinforcement Learning (RL) that have enabled robots to learn to play arcade video games, walk or assemble colored bricks, can be used to perform other tasks that are currently at the core of engineering cyberphysical systems. We present the first use of RL for the control of systems modeled by discretized non-linear Partial Differential Equations (PDEs) and devise a novel algorithm to use non-parametric control techniques for large multi-agent systems. We show how neural network based RL enables the control of discretized PDEs whose parameters are unknown, random, and time-varying. We introduce an algorithm of Mutual Weight Regularization (MWR) which alleviates the curse of dimensionality of multi-agent control schemes by sharing experience between agents while giving each agent the opportunity to specialize its action policy so as to tailor it to the local parameters of the part of the system it is located in.
Article
In this paper, we propose a set of algorithms to design signal timing plans via deep reinforcement learning. The core idea of this approach is to set up a deep neural network (DNN) to learn the Q-function of reinforcement learning from the sampled traffic state/control inputs and the corresponding traffic system performance output. Based on the obtained DNN, we can find the appropriate signal timing policies by implicitly modeling the control actions and the change of system states. We explain the possible benefits and implementation tricks of this new approach. The relationships between this new approach and some existing approaches are also carefully discussed.
Chapter
This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal policies that maximize average reward, none of them can reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.