Conference PaperPDF Available

Smart Edge-Enabled Traffic Light Control: Improving Reward-Communication Trade-offs with Federated Reinforcement Learning

Authors:
Smart Edge-Enabled Traffic Light Control:
Improving Reward-Communication Trade-offs with
Federated Reinforcement Learning
Nathaniel Hudson, Pratham Oza, Hana Khamfroush, and Thidapat Chantem
Department of Computer Science, University of Kentucky
Department of Electrical & Computer Engineering, Virginia Tech
Abstract—Traffic congestion is a costly phenomenon of every-
day life. Reinforcement Learning (RL) is a promising solution due
to its applicability to solving complex decision-making problems
in highly dynamic environments. To train smart traffic lights
using RL, large amounts of data is required. Recent RL-based
approaches consider training to occur on some nearby server or
a remote cloud server. However, this requires that traffic lights
all communicate their raw data to some central location. For
large road systems, communication cost can be impractical, par-
ticularly if traffic lights collect heavy data (e.g., video, LIDAR).
As such, this work pushes training to the traffic lights directly
to reduce communication cost. However, completely independent
learning can reduce the performance of trained models. As such,
this work considers the recent advent of Federated Reinforcement
Learning (FedRL) for edge-enabled traffic lights so they can
learn from each other’s experience by periodically aggregating
locally-learned policy network parameters rather than share
raw data, hence keeping communication costs low. To do this,
we propose the SEAL framework which uses an intersection-
agnostic representation to support FedRL across traffic lights
controlling heterogeneous intersection types. We then evaluate
our FedRL approach against Centralized and Decentralized RL
strategies. We compare the reward-communication trade-offs of
these strategies. Our results show that FedRL is able to reduce
the communication costs associated with Centralized training
by 36.24%; while only seeing a 2.11% decrease in average
reward (i.e., decreased traffic congestion).
Index Terms—Smart Traffic, Traffic Light Control, Reinforce-
ment Learning, Edge Computing, Federated Learning
I. INTRODUCTION
According to recent transportation analytics data by INRIX,
traffic congestion cost the United States economy $88 billion
in 2019 alone [1]. Traffic congestion poses a constant threat to
the economy and safety within an urban environment, which
can be alleviated by using the compute and communication
resources available in smart cities. Urban traffic networks
exemplify a typical CPS where data, communication, and
connected infrastructure can now jointly optimize traffic oper-
ations within a road network. Communication capabilities of
the vehicles, traffic lights, and other road-side units (RSUs)
powered by vehicle-to-everything (V2X) and vehicular ad-hoc
networks (VANETs) provide opportunities for novel strategies
to mitigate traffic congestion over large and complex urban
road networks [2], [3]. Such strategies may require reliable
computing resources for the strict needs of urban traffic
networks. The recent advent of Edge Computing (EC) [4]
pushes compute resources to the network edge via compute
node servers, known as “edge servers”, that are close to the
smart city infrastructure. EC can be used to support more
compute-intensive tasks for vehicular networks.
Many recent works trying to support smart decision making
for traffic lights (commonly referred to as adaptive traffic
signal control) consider Reinforcement Learning (RL)-based
approaches [5], [6], [7], [8], [9], [10], [11]. RL is a popular
technique for training sequential decision-making policies for
problems that are highly dynamic and complex. Smart traffic
light strategies that incorporate RL typically employ either a
centralized [12], [13], [14] or decentralized [15], [10], [16]
technique for training policies. In the centralized case, a
policy is trained (typically on a roadside server) from the
observations collected by detectors and other infrastructural
components throughout the system. This central, roadside
server then communicates actions to each of the traffic lights.
Because the policy is learning over observations throughout
the road network, these approaches perform well in terms of
maximizing total reward. However, in practice, the amount
of communication needed to send all observational data to
the server can be costly. Decentralized approaches push the
policy training to the traffic lights based on observations local
to that traffic light, meaning less communication is needed
since training is local to the traffic light itself. However, in
decentralized approaches, the performance of the trained poli-
cies can be compromised because policies are learning in an
isolated and independent manner. Therein lies a natural trade-
off between policy performance w.r.t. maximizing reward and
the communication cost associated with training. To the best
of our knowledge, this trade-off has not been formally studied
for smart traffic light control with RL.
To this end, we study the reward-communication trade-off
for training smart traffic light control policies in an edge-
enabled traffic system. We do this by proposing a Federated
Reinforcement Learning (FedRL) technique inspired by the
recent Federated Learning (FL) paradigm [17], [18]. Under
our FedRL technique, we train traffic lights in a decentralized
manner to reduce overall communication costs. Periodically,
traffic lights will communicate their current policy network to
a roadside edge server (hereafter referred to as “edge-RSU”)
Edge RSU
Traffic light (b)
controls 2 lanes.
Traffic light (a)
controls 4 lanes.
Edge-Enabled Traffic light
Smart traffic AI model
Communication channel
Fig. 1. Example of our traffic system where traffic lights communicate with
an edge-enabled roadside unit (Edge-RSU).
which will then aggregate the policy network parameters
using a weighted averaging method based on total reward.
This newly-averaged policy network is then distributed to
traffic lights for further training until the next aggregation
phase. This aggregation will allow traffic lights to learn from
each other without sharing raw observational data. For our
FedRL to work, representation of current traffic conditions
must be consistent across the road network, even in the
face of heterogeneous intersection types. In this way, the
representation needs to be transferable across road networks
and intersections. For this, we design a novel, intersection-
agnostic Markov Decision Process (MDP) [19] which we refer
to as Smart Edge-enabled trAffic Lights (SEAL). The central
contributions of this work can be summarized as follows:
Design a novel, intersection-agnostic MDP for represent-
ing traffic conditions at traffic lights which we call SEAL.
SEAL is designed to have a general representation of
traffic conditions at intersections.
Proposal of a Federated Reinforcement Learning (FedRL)
approach for training RL decision-making policies for
smart traffic light control.
Improve reward-communication cost trade-off associated
with solving SEAL using our proposed FedRL approach
by reducing communication costs up to 36.24% on av-
erage while losing 2.11% on average when compared to
Centralized training.
II. SY ST EM DESCRIPTION
We now describe the system requirements for traffic infras-
tructure, data, and communication capabilities for our model.
Fig. 1 shows a typical traffic environment where our model
could be deployed. Our system considers a road network with
one or more intersections (depending on the road topology),
each equipped with a traffic light k K where Kdenotes the
set of traffic lights in the entire system. Each traffic light k K
controls the traffic flow entering the intersection through an
incoming lane. A set of such controlled lanes is denoted
by Lk. A traffic controller, either located at each intersection
or at a server calculates a “phase state” φt
kfor the traffic
lights at a given traffic light kat time-step t. The assigned
phase state is such that a traffic light will be assigned a
green, yellow or red “signal state”, represented by G,y,r,
respectively. Therefore, a phase state is a string representing
the signal states of the traffic lights at all controlled lanes at an
intersection. For example, the phase state for an intersection
with eight controlled lanes would be GyrrGyrr. For a visual
example of phase states, refer to Fig. 2. Note that the phase
states are assigned such that the vehicles with conflicting
traffic flows are not allowed to access the intersection at once.
The length of the phase state is based on the number of
incoming controlled lanes at a given intersection. Our model
also expects that the vehicles obey the traffic regulations and
do not violate the assigned phase permissions indicated by
the traffic lights. Finally, in our system, we assume each
traffic light k K cannot change phase states until 4seconds
have elapsed since ks last phase state change. Further, we
assume each traffic light k K must change after 120 seconds
have elapsed since ks last phase state change. These timings
are calculated in accordance with the U.S. federal highway
administration (FHWA) guidelines based on average traffic
behavior [20] and can be changed as per traffic regulatory
requirements. This is enforced for all training and evaluation.
Traffic infrastructure is equipped either with road-side sen-
sors installed within every controlled lane to measure traffic
parameters such as lane occupancy, average traffic flow speed,
etc. (detailed in §III) or have connected vehicles to report
such data to the traffic lights by utilizing the connected
infrastructure. Traffic lights are equipped with edge compute
resources to process the data and perform local learning. The
edge resources also enable the connectivity among all traffic
lights within the traffic network as well as the centralized cloud
server to enable global optimization of the learning models.
For simplicity, we assume the presence of a single deployed
edge-RSU server in the region that maintains communication
channels to all the traffic lights in a given region to support ad-
ditional processes. Additionally, traffic lights are also equipped
with compute resources as well as the edge-RSU server. As
a simplifying assumption, we assume compute resources at
both the traffic lights and the edge-RSU server are sufficient to
train policies for smart traffic decisions. Succinctly, this work
aims to improve the implicit reward-communication trade-off
associated with distributed learning solutions to support smart
traffic systems using FedRL.
III. PROP OS ED SEAL MO DE L DEFI NI TI ON
Here, we define the Smart Edge-enabled trAffic
Lights (SEAL) system. SEAL is modeled as a Markov
Decision Process (MDP) [19] with the goal to minimize
traffic congestion in road networks. SEALs novelty is in
defining a general state space representation that can describe
current traffic conditions at a traffic light in an intersection-
agnostic way. This is necessary to support policy aggregation
in our FedRL approach (discussed later in §IV-C).
The work most similar to ours is that of Zhou et al.’s DRLE
framework in [15]. This work is able to consider a distributed
multi-agent RL approach to smart traffic light control with
convergence guarantees. However, this does not consider the
possibility of traffic lights themselves training their own policy
GGrGGr
012345
0 1
2
4 3
5
0 1
2
4 3
5
rryrry
012345
0 1
2
4 3
5
yyryyr
012345
0 1
2
4 3
5
rrGrrG
012345
Fig. 2. Example traffic light action transition graph. Consider the given traffic
light k’s current phase state is GGrGGR. If the action at
k= 1 at time-step t,
then the phase state for kwill transition to yyryyr if sufficient time has
elapsed since its last transition. Otherwise, its phase state remains the same,
unless too much time has elapsed since its last change.
networks. Instead, the DRLE framework sets traffic lights
to communicate their local state observations to a roadside
server to perform state aggregation to form a “global” state.
This global statefulness allows for convergence guarantees, but
may not be attractive for future solutions where traffic lights
may collect large volumes of data (e.g., hyper-spectral images,
videos, LIDAR imaging, etc.) to make decisions. Having large
amounts of traffic lights stream these data in real-time to
make timely decisions may not scale well. Thus, we consider
SEAL. Future works investigate possible convergence bounds
of SEAL is of interest but is beyond the scope of this work.
A. Action Space
In prior works investigating the use of RL for traffic light
control, various kinds of actions have been considered. These
include phase switch [15], [16], phase duration [9], and the
phase state itself [7]. The phase state considers a discrete
space of size nwhere nis the number of possible states for
a traffic light. Since phase state depends on the number of
controlled lanes and hence the traffic lights at an intersection,
it is infeasible to aggregate knowledge among the intersections
with varying topologies. For this work, we consider a simpler
phase switch approach in which we consider each traffic
light k K in time-step tto take an action at
k {0,1}where
at
k= 1 signifies that traffic light kwill attempt to change to
the next phase state. Otherwise, at
k= 0 signifies no phase
state change will be attempted by traffic light kat time-step t.
Note, if a traffic light kattempts to change in some time-
step t(i.e., at
k= 1), a change can only occur if enough time
has elapsed since its last change; further, a traffic light kwill
be forced to change its phase state regardless of its action if
too much time has elapsed since its last change. This is due
to the phase state timer (discussed in §II) to ensure policies
mean mandatory regulations related to road safety [20]. Refer
to Fig. 2 for an illustrated example of phase state logic and
transitions made when at
k= 1.
B. State Space
State space features consist of the following for a traffic
light kin time-step t:lane occupancy (ot
k), halted lane occu-
pancy (ht
k), average speed (ψt
k), and phase state ratios (φt
k(·))
for all possible phase states (e.g., green, yellow, red).
1) Lane Occupancy: The average ratio of occupancy across
all lanes controlled by a traffic light kin time-step t. Each
traffic light kcontrols some set of lanes. Thus, we consider
the occupancy of a lane lto be how much of a lane’s length
(in meters) is occupied by vehicles (as a ratio). However, we
average this across all lanes controlled by traffic light k. The
formal definition for lane occupancy is provided below:
ot
kPl∈LkPv∈Vt
l
len(v)
Pl∈Lklen(l)(1)
where Lkis the set of lanes controlled by traffic light k,Vt
lis
the set of vehicles occupying lane lin time-step t, and len(·)
is the length of the vehicle or lane (in meters).
2) Halted Lane Occupancy: SEAL’s goal is to minimize
congestion in road systems. Thus, we consider how much of
a lane is occupied with halted vehicles. As such, we consider
ht
kto be the halted lane occupancy of traffic light kin time-
step twhen we consider a vehicle to be halted if its current
speed is 0.1meters/second. Thus, we define ht
kbelow:
ht
kPl∈LkPv∈Ht
l
len(v)
Pl∈Lklen(l)(2)
where Ht
lis the set of halted vehicles occupying lane lin
time-step t.
3) Average Speed: We also consider the average speed (ψt
k)
among vehicles occupying lanes controlled by a traffic light k
at time-step tas a feature. Similar to the other features, this
one is also normalized as a ratio in the range [0,1]. The formal
definition is below:
ψt
k
Pl∈LkPv∈Vt
l
min spdt
v,spdmax
l
Pl∈LkPv∈Vt
l
spdmax
lSl∈LkVt
l1
1.0otherwise
(3)
where spdt
vis the moving speed of vehicle vin time-step t
and spdmax
lis the speed limit (or maximum speed allowed)
on lane l. The second case in Eq. (3) is for cases when there
are no vehicles occupying lanes controlled by traffic light k.
4) Phase State Ratio: The current phase state of a traffic
light has been used as feature in prior works (namely, [15]).
This is possible because simple road networks are considered
with homogeneous intersections where traffic lights have the
same sets of possible phase states. To handle heterogeneous
phase state sets across different intersection types, we instead
represent the ratio of how each possible traffic light signal
(e.g., green, yellow, red) makes up the entire phase state. Thus,
we denote the ratio of a traffic light signal for a traffic light kin
time-step tby φk
k(·)[0,1]. For instance, given a phase state
at traffic light kin time-step tGGrGGr, we denote how much
of the phase state are red lights, r, by φt
k(r) = 2/6(similarly
for prioritized green lights, G,φt
k(G) = 4/6). Because we
represent the ratio rather than assign an arbitrary discrete value
to represent the entire phase state, the representation is general
and can be used across different road networks with various
intersections. It should be noted that Pp∈Pkφt
k(p) = 1 (k, t)
where Pkis the set of phase states for traffic light k.
C. Reward Function
The goal of SEAL is to reduce congestion in a given road
network. With that in mind, we let reward rt
kfor a traffic
light kat time-step tbe a function of both lane occupancy (ot
k)
and halted lane occupancy (ht
k). We define it below:
rt
k(ot
k+ht
k)2.(4)
These state space features are summed to penalize traffic lights
with more congestion. We let halted vehicles to incur more
penalty since they contribute to both lane occupancy and halted
lane occupancy. From there, we define the total reward,rt,
over the road whole network at time-step tas
rtX
k∈K
rt
k.(5)
D. Communication Model
As discussed in §II, we require robust communication capa-
bilities between vehicles, traffic lights and edge-enabled RSUs
to support smart traffic control. Depending on the training
approach (detailed in §IV), a traffic control system must ac-
count for different communication channel utilization and their
incurred costs. We therefore consider the following 6different
types of possible communications that can take place under
the SEAL system: (i) policy network parameters from edge-
RSU to traffic light, (ii) policy network parameters from traffic
light to edge-RSU, (iii) action from edge-RSU to traffic light,
(iv) observations from traffic light to edge-RSU, (v) vehicle-
to-infrastructure (V2I) communication from vehicle to traffic
light, and (vi) congestion ranks from edge-RSU to traffic
light. We will evaluate the associated communication costs
while training of our proposed model in §VI. To reiterate,
we assume that edge-enabled traffic lights and the edge-RSU
have sufficient compute capacity to performing policy training.
Thus, we do not consider compute constraints and focus on
communication cost instead.
IV. TRAINING ALGORITHMS
The goal of SEAL is to learn optimal traffic light control
policies to minimize congestion for a given road network. To
solve the SEAL model, we adopt model-free reinforcement
learning techniques. More specifically, we will incorporate the
recent Proximal Policy Optimization (PPO) [21] algorithm.
Solutions to SEAL will aim to find a smart traffic light control
policy, π, such that
Qπ(s,a) = (1 γ)·E"
X
t=1
(γ)t1·rt|s1=s,a1=a#(6)
is maximized where the policy is a decision-making function
π:S7→ Aand γis the discount factor. Eq. (6) is known
Decentralized TrainingCentralized Training Federated Training
Fig. 3. Training approaches considered for solving SEAL.
as the Q-function. The optimal policy that maximizes the Q-
function is defined as π= arg maxπQπ(s, π(s)) s. For the
sake of convenience, we denote Q(s,a) = Qπ(s,a),(s,a)
where sand aare a state and action, respectively.
RL algorithms can be implemented in real-world systems
in various ways. As such, we consider 3different approaches
for facilitating the PPO algorithm to solve SEAL: (i) central-
ized training,(ii) decentralized training, and (iii) federated
training. A visual example of how these approaches compare
can be found in Fig. 3. For a comprehensive overview on the
theory of RL, please refer to [22].
A. Centralized Training
Under centralized training, there is a single policy network
that is hosted on the nearby edge-RSU. At each time-step t,
each traffic light k K submits their current state st
kto
the edge-RSU which then returns an action at
kto traffic
light k. Since a single policy network is learning across all
observations in the system, it is expected to learn the optimal
policy faster than other approaches. However, this is at the
expense of incurring a large amount of overhead in terms of
communication cost because of the traffic light having nonstop
communication with the edge-RSU to take an action. For this
work, we view this approach as an upper bound in terms of
most quickly learning the optimal policy, π.
1) Centralized Training Communication Costs: Decision-
making in a centralized manner requires traffic lights to always
communicate to the edge-RSU leading to higher communica-
tions. Under Centralized training, the following communica-
tions take place at each time-step: actions from the edge-RSU
to traffic lights, observations from traffic lights to edge-RSUs,
and V2I communications from vehicles to traffic lights.
B. Decentralized Training
Unlike centralized training, decentralized training equips
each traffic light k K with a policy network that aims to
independently learn an optimal local policy for traffic light k,
π
k, for optimizing reward using only observations local to
that traffic light. In essence, if all traffic lights in the system
are able to learn an optimal policy, then that can benefit
the entire road network. Zhou et al. in [15] proved that a
decentralized training approach using per-traffic light policies
for smart traffic light control, can converge to a centralized
approach if given infinite time. In general, this approach can
attain good performance if given enough time. While the
decentralized approach is bested by the centralized approach
in finding an optimal policy, since the latter is learning from
global observations, the former approach is of interest as it
requires less communication.
1) Decentralized Training Communication Costs: In the
decentralized case, since the traffic lights never communicate
to the edge-RSU for making decisions, little communication
occurs. The only communication that takes place is V21
communication from vehicles to traffic lights.
C. Federated Training
With the expectation that decentralized training will not
perform as well as centralized training due to policies learning
over fewer observations, but will require less communication,
we wish to achieve the best of both worlds. A novel contribu-
tion of this work is that we leverage the findings of the recent
federated learning (FL) paradigm [17], [18] for distributed
systems. Here we apply it to decentralized training to allow
the traffic lights to learn from each other without needing
to communicate raw data. We refer to this notion aptly as
Federated Reinforcement Learning (FedRL) [23], [24]. FL has
shown to reduce communication cost in the literature [25]
while providing an immediate layer of privacy because no
raw data are communicated. These are crucial advantages for
smart traffic light control for future systems. For instance,
consider a system that considers live video feed as a feature in
the state space representation. Because identifying information
(e.g., license plate numbers and faces of pedestrians) may be
included, privacy is crucial. Additionally, such data may be
very large and incur hefty data transmission costs. As such,
we will focus on the benefit of federated training for smart
traffic light control w.r.t. the trade-off between communication
cost on the system and maximizing reward.
In FedRL, the traffic light agents training their own policy
networks will periodically communicate the learned policy
network parameters to the edge-RSU. The edge-RSU will
then aggregate them using an averaging function. The newly
aggregated policy network parameters are then communicated
back to the traffic lights for further learning. Aggregation will
occur after a number of time-steps occurs. We refer to this
time period as a frame and denote it by F. We denote the
policy network parameters learned by traffic light kat the end
of frame Fby ωF
k.
In [18], the federated averaging (FedAvg) technique was
proposed. This technique addresses the challenge of non-
independent and identically distributed (iid) data distributions
across different client devices. FedAvg uses a weighted aver-
age of the client’s locally-updated model parameters based on
the number of data items owned by that client. This weight
combats non-iid data distributions common in distributed
systems. For the sake of this work, we consider a simplifying
assumption that traffic lights have identical data sampling rates
resulting in the same amount of observations. Below is the
definition of the averaging we consider,
ωF+1 X
k∈K
1
|K| ωF+1
k(7)
1 lane running north/south
1 lane running east/west
2 lanes running east/west
1 lane running south
Fig. 4. Example Grid-3×3road network with heterogeneous intersection
types. Note that the number of lanes increase as roads are more central.
where the newly-aggregated, global parameters ωF+1 is the
average of the parameters collected from all the traffic lights.
These parameters are then sent back to the traffic lights at the
start of frame F+ 1 to resume training. Asynchronous ag-
gregation techniques to address heterogeneous data sampling
rates among traffic lights is beyond the scope of this work.
1) Federated Training Communication Costs: With fed-
erated training, communications that occur at each time-
step are mostly identical to that for decentralized training
(discussed in §IV-B1). The only difference is at the end of
each frame (which occur less frequently than each time-step),
2additional communications occur: policy network parameters
from edge-RSU to traffic lights and policy network parameters
from traffic lights to edge-RSU.
V. EX PE RI ME NT DESIGN
We implement the SEAL framework using the Python
programming language. Further, we implement the training
approaches described in §IV using the SUMO traffic simula-
tor [26] for the traffic simulation and Ray’s RLlib [27] toolbox
for the RL pipeline. Our software serves as the interface for
these tools to fit our work’s very specific needs. Thus, we
only train the policy networks using PPO using simulations
with these tools.
A. Considered Road Network Topologies
For training the policies using Ray’s RLlib [27] and per-
forming evaluation via simulation, we consider 3road network
topologies provided in Fig. 4: (a) Grid-3×3,(b) Grid-
5×5, and (c) Grid-7×7. Roads on the border of the
network have 1lane going each direction, with the number
of lanes going north/south and east/west increasing by 1
when approaching the center north/south and east/west roads.
This is to introduce heterogeneous road network topologies.
For an example, refer to Fig. 4. For simplicity, we do not
allow vehicles to make turns to prevent the vehicles from
getting stuck in the simulation. Note that this is a limitation
of SUMO and SEAL’s design is general enough to support
turning vehicles. Each training approach (discussed in §IV)
will learn policies over each road network topology. Vehicles
routes for training and evaluation are randomly generated
using the randomTrips.py module provided by SUMO
with 360 vehicles per lane per hour (VPLPH) generated.
0k 100k 200k
Time-Steps
(a)
10
8
6
Mean Episode Reward
Grid-3
×
3
0k 100k 200k
Time-Steps
(b)
18
16
14
Grid-5
×
5
0k 100k 200k
Time-Steps
(c)
35
30
Grid-7
×
7
Federated Centralized Decentralized
Fig. 5. Learning curves with each training approach on each road network.
B. Training Parameters
We use Proximal Policy Optimization (PPO) [21] to
train policies to solve SEAL. We use the following hyper-
parameters. The learning rate is 5×105. SGD minibatch size
is 128. PPO CLIP parameter is set to 0.3. Target value for KL
divergence is 0.3. Train batch size is 4000 time-steps. (Note
policy network parameter aggregation, described in §IV-C,
occurs every 4000 steps.) Roll-out fragment length (size of
batches collected from each worker) is 200. We use General-
ized Advantage Estimator (GAE) and the GAE parameter is
set to 1.0. The VF clip parameter is set to 10.
VI. RE SU LTS & DISCUSSION
A. Reward Evaluation During Training
First, we compare the different training strategies discussed
in §IV in terms of the reward achieved by the policy networks
during training. In Fig. 5, we can see the learning curves
of each training strategy when used on each of the 3road
network topologies described in §V-A. From these results,
we find that, in general, make the following observations:
(i) Centralized training generally achieves the greatest re-
ward, (ii) Decentralized training generally achieves the worst
reward and, (iii) Federated training achieves greater reward
than Decentralized training (and often nearly match that of
Centralized training). These observations are fairly intuitive.
Since Centralized training trains a single policy network over
all observations collected in the environment, it has more
to learn from. Conversely, with Decentralized training, each
traffic light learns independently using its own observations
meaning each traffic light’s policy learns over fewer ob-
servations. Since Federated training expands on Decentralized
training by allowing parameter aggregation among the policy
networks learned by the traffic lights, the traffic lights are
essentially able to learn from each other without explicitly
sharing observations and other raw data. More specifically, we
find that Decentralized training suffers from an 8.01% drop
in reward compared to the Centralized training. Meanwhile,
Federated training only suffers from an 2.11% drop in reward
compared to Centralized training.
B. Communication Cost Evaluation During Training
Given Federated training is able to more closely approxi-
mate the reward achieved by Centralized training when com-
pared to Decentralized training, we next compare the com-
munication costs associated with each training strategy. We
0k 100k 200k
Time-Steps
(a)
80k
100k
120k
# Bytes
Grid-3
×
3
0k 100k 200k
Time-Steps
(b)
100k
120k
140k
Grid-5
×
5
0k 100k 200k
Time-Steps
(c)
100k
120k
140k
160k
Grid-7
×
7
Federated Centralized Decentralized
Fig. 6. Communication cost (i.e., data size in bytes) transmitted during
training time under each training strategy for each road network.
do this by tracking the number of communication that occurs
(refer to §III-D) and the number of times each communication
type occurs by the amount of bytes needed to transmit the data
for that communication. In Fig. 6, we compare the size of the
data needed to be communicated through the system during
training using each of the training strategies under each of the
road network topologies. There is a glaring difference in terms
of communication efficiency between Centralized and De-
centralized/Federated. Because Centralized training requires
constant communication between the edge-RSU and the traffic
lights in order to transmit observations, actions, and other data,
it naturally incurs much greater communication cost. Mean-
while, Decentralized and Federated training greatly reduce this
cost due to them keeping communication mostly between the
vehicles and the traffic light. The only communication between
the Edge-RSU and the traffic lights under Federated training
is when policy network parameters are aggregated after each
frame concludes. It is interesting to note that Federated is able
to best Decentralized training in terms of communication cost
in these results. This is due to the Federated training strategy
producing better policy networks and removing vehicles from
the system more efficiently than the Decentralized model
resulting in less vehicle-to-infrastructure communication.
More numerically speaking, from our results Decentralized
and Federated training are able to achieve a communication
cost reduction of 34.65% and 36.24%, respectively, when
compared to Centralized training.
C. Trained Policy Network Performance
Here, we are interested in two questions: (1) Can RL-based
traffic lights trained with SEAL improve traffic conditions?
(2) Can policy networks trained with SEAL perform well when
used on road networks they were not trained on? To answer
the first question, we compare our trained policy networks
against a standard traffic light control baseline: a pre-timed
control [20] where traffic lights cycle through phase states at
fixed time intervals. We support this comparison using real-
world traffic metrics to evaluate the experience of drivers in the
system. Namely, we consider both “Travel Time” and “Waiting
Time”. The former is the total amount of (simulation) time
taken for vehicles to reach their destination; the latter is the
amount of (simulation) time vehicles are waiting to move at a
traffic light. The results of this evaluation are shown in Fig. 7.
We see that in nearly all cases, the RL-based training strategies
Grid-3x3
Grid-5x5
Grid-7x7
Pre-Timed
0
50
100
Travel Time
Tested On "Grid-3x3"
Grid-3x3
Grid-5x5
Grid-7x7
Pre-Timed
0
50
100
Tested On "Grid-5x5"
Grid-3x3
Grid-5x5
Grid-7x7
Pre-Timed
0
50
100
Tested On "Grid-7x7"
Grid-3x3
Grid-5x5
Grid-7x7
Pre-Timed
Trained On
0
20
40
60
Waiting Time
Tested On "Grid-3x3"
Grid-3x3
Grid-5x5
Grid-7x7
Pre-Timed
Trained On
0
20
40
Tested On "Grid-5x5"
Grid-3x3
Grid-5x5
Grid-7x7
Pre-Timed
Trained On
0
20
40
Tested On "Grid-7x7"
Federated Decentralized Centralized Pre-Timed
Fig. 7. Evaluation of trained policy networks on each road network using trip metrics, namely Travel Time (top row) and Waiting Time (bottom row). We
compare the results to a Pre-Timed phase transition model as a baseline. Results confirm the RL-based solutions generally outperform the baseline.
outperform that of the Timed-Phase baseline. The only outlier
is the Centralized trainer when learning in the Grid-3×3
road network. As for the second question regarding possible
transferability of the policy networks, we observe in Fig. 7 that
the policy networks are generally able to perform comparable
to one another (ignoring the Centralized trainer when trained
on Grid-3×3). This generally holds true for policy networks
being tested on the same road network they were trained on
when compared to policy networks trained on other networks.
These results serve to motivate the use of RL-based approaches
for future smart traffic applications. We find that (on average)
Centralized, Decentralized, and Federated reduce travel time
compared to Pre-Timed by 11.63%,18.16%, and 18.14%,
respectively. Also, we find that (on average) Centralized,
Decentralized, and Federated reduce waiting time compared
to Pre-Timed by 42.81%,58.92%, and 58.93%, respectively.
The underperformance of Centralized here, compared to De-
centralized and Federated, is likely due to the outlier scenarios
when its trains on Grid-3×3. We attribute these anamolies to
potential overfitting, though further experiments are needed.
VII. REL ATED WO RK S
Improving traffic light signal control in road networks has
been a widely studied subject. Much work is being done
to improve traffic conditions by developing adaptive traffic
signal control (ATSC) where traffic lights adapt intelligently
based on current traffic demands [28]. Many different tech-
niques have been considered for realizing ATSC. Early works
considered linear optimization frameworks [29].While linear
programming is straightforward, it is not an appropriate match
for ATSC because of the highly dynamic nature of real-
world traffic systems making accurate objective functions
and constraints difficult to define. Genetic (or evolutionary)
algorithms have also been considered in prior works [30]. In
the early 2000s, initial works focusing on the application of
Reinforcement Learning (RL) techniques for ATSC were pub-
lished [6], [12]. While seminal, these initial works considered
very simple road network scenarios. With advancements in
both vehicular communication [2], [3] and RL algorithms [22],
interest in RL for ATSC (or smart traffic) has been renewed.
However, recent RL algorithms use more complex policy
networks that require more compute resources to train.
Works considering RL for smart traffic light signal control
have greatly increased over the years [7], [9], [10]. Because of
the large number of entities in a traffic system (e.g., multiple
traffic lights, multiple vehicles), multi-agent RL techniques
have been applied to smart traffic light control [16], [11]. El-
Tantawy et al. in [16] propose a multi-agent RL framework
where agents can either be independent or collaborative in how
they make decisions with other traffic light agents. Chu et al.
in [8] propose a decentralized, multi-agent RL framework to
provide robust learning with using a scalable framework. Chen
et al. in [5] propose a decentralized actor-critic model and a
difference reward method to accelerate the convergence of the
trained policies for smart traffic light control. Mousavi et al.
in [13] study both, policy- and value-based deep reinforcement
learning approaches for smart traffic light control. However,
they only consider a single intersection, where the state
space is a screenshot of the intersection provided by a traffic
simulator. These works focus on improving training first and
foremost. For instance, the communication cost for training
these policies is neglected.
Edge Computing (EC) [4], [31] is a recent enabling technol-
ogy that pushes compute resources to the network edge. This
has become an increasingly popular context for deploying AI
(e.g., machine learning, deep learning, and RL) services to the
network edge to provide low-latency intelligence. A significant
recent work by Zhou et al. in [15] studied the applicability of
edge computing for decentralized RL for smart traffic lights. A
central contribution of this work is the theoretical guarantees
that show that their proposed decentralized framework can
provide a near-optimal guarantee on reduced traffic if given
enough time. Different from this work, we design a framework
that allows heterogeneous traffic lights to train policy networks
in a federated manner to reduce communication costs.
The central gap in the literature related to RL for smart
traffic light control is that the trade-off between reward and
communication cost has been neglected. Additionally, recent
advancements in the realm of Federated Learning (FL) or,
more specifically, Federated Reinforcement Learning (FedRL)
has yet to be applied to the smart traffic control problem.
VIII. CONCLUSIONS
In closing, this work to the best of our knowledge, is
the first to approach smart traffic light control using Feder-
ated Reinforcement Learning (FedRL) in an edge computing-
enabled system. We do this by proposing SEAL, which is
an intersection-agnostic Markov Decision Problem for smart
traffic light control to support aggregating learned policy
network parameters across heterogeneous intersection types.
This allows traffic lights to learn from each other’s experiences
without sharing raw experience data which reduces communi-
cation workloads (while providing some level of privacy). Our
experiments demonstrate that SEAL combined with FedRL
approach is able to closely match the rewards provided by a
Centralized training approach (only a 2.11% decrease) when
compared to the Decentralized approach that shows a 8.01%
drop in reward. Further, our FedRL approach reduces the
communication cost by 36.24% when compared to Central-
ized training. Hence, FedRL improves the implicit reward-
communication trade-off for distributedly training smart traffic
systems. In the future, we aim to extend our work to further
analyze the theoretical bounds of SEAL and to study its
effectiveness in small robotic testbed systems.
REFERENCES
[1] INRIX, “Congestion costs each american nearly 100 hours, $1,400
a year, Mar 2020. [Press release]. Retrieved from https://inrix.com/
press-releases/2019- traffic-scorecard-us/.
[2] M. S. Anwer and C. Guy, “A survey of VANET technologies,” Journal
of Emerging Trends in Computing and Information Sciences.
[3] S. K. Bhoi and P. M. Khilar, “Vehicular communication: a survey, IET
networks, vol. 3, no. 3, pp. 204–217, 2014.
[4] P. Mach and Z. Becvar, “Mobile edge computing: A survey on archi-
tecture and computation offloading, IEEE Communications Surveys &
Tutorials, vol. 19, no. 3, 2017.
[5] Y. Chen, C. Li, W. Yue, H. Zhang, and G. Mao, “Engineering a
large-scale traffic signal control: A multi-agent reinforcement learning
approach,” IEEE INFOCOM 2021 Workshops, pp. 1–6, 2021.
[6] M. A. Wiering, “Multi-agent reinforcement learning for traffic light con-
trol,” in Machine Learning: Proceedings of the Seventeenth International
Conference (ICML’2000), pp. 1151–1158, 2000.
[7] L. Prashanth and S. Bhatnagar, “Reinforcement learning with function
approximation for traffic signal control, IEEE Transactions on Intelli-
gent Transportation Systems, vol. 12, no. 2, pp. 412–421, 2010.
[8] T. Chu, J. Wang, L. Codec`
a, and Z. Li, “Multi-agent deep reinforcement
learning for large-scale traffic signal control, IEEE Transactions on
Intelligent Transportation Systems, vol. 21, pp. 1086–1095, 2020.
[9] M. Aslani, M. S. Mesgari, and M. Wiering, “Adaptive traffic signal
control with actor-critic methods in a real-world traffic network with
different traffic disruption events, Transportation Research Part C-
emerging Technologies, vol. 85, 2017.
[10] X. Wang, L. Ke, Z. Qiao, and X. Chai, “Large-scale traffic signal control
using a novel multiagent reinforcement learning, IEEE Transactions on
Cybernetics, vol. 51, pp. 174–187, 2021.
[11] T. Tan, T. Chu, and J. Wang, “Multi-agent bootstrapped deep q-network
for large-scale traffic signal control, 2020 IEEE CCTA, 2020.
[12] B. Abdulhai, R. Pringle, and G. J. Karakoulas, “Reinforcement learning
for true adaptive traffic signal control, Journal of Transportation
Engineering-asce, vol. 129, pp. 278–285, 2003.
[13] S. S. Mousavi, M. Schukat, and E. Howley, “Traffic light control using
deep policy-gradient and value-function-based reinforcement learning,
IET Intelligent Transport Systems, vol. 11, no. 7, pp. 417–423, 2017.
[14] P. G. Balaji, X. German, and D. Srinivasan, “Urban traffic signal control
using reinforcement learning agents,” Iet Intelligent Transport Systems,
vol. 4, pp. 177–188, 2010.
[15] P. Zhou, X. Chen, Z. Liu, T. Braud, P. Hui, and J. Kangasharju, “DRLE:
Decentralized reinforcement learning at the edge for traffic light control
in the iov, IEEE Transactions on Intelligent Transportation Systems,
vol. 22, no. 4, pp. 2262–2273, 2020.
[16] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent rein-
forcement learning for integrated network of adaptive traffic signal
controllers (marlin-atsc): Methodology and large-scale application on
downtown toronto, IEEE Transactions on Intelligent Transportation
Systems, vol. 14, pp. 1140–1150, 2013.
[17] J. Koneˇ
cn`
y, H. B. McMahan, F. X. Yu, P. Richt´
arik, A. T. Suresh, and
D. Bacon, “Federated learning: Strategies for improving communication
efficiency,” arXiv preprint arXiv:1610.05492, 2016.
[18] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
“Communication-efficient learning of deep networks from decentralized
data,” in Artificial intelligence and statistics, PMLR, 2017.
[19] R. Bellman, “A Markovian decision process, Journal of mathematics
and mechanics, vol. 6, no. 5, pp. 679–684, 1957.
[20] P. Koonce and L. Rodegerdts, “Traffic signal timing manual.,” tech. rep.,
United States. Federal Highway Administration, 2008.
[21] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” ArXiv, vol. abs/1707.06347,
2017.
[22] Y. Li, “Deep reinforcement learning: An overview,” arXiv preprint
arXiv:1701.07274, 2017.
[23] H. H. Zhuo, W. Feng, Q. Xu, Q. Yang, and Y. Lin, “Federated
reinforcement learning,” 2019.
[24] B. Liu, L. Wang, and M. Liu, “Lifelong federated reinforcement learn-
ing: a learning architecture for navigation in cloud robotic systems,”
IEEE Robotics and Automation Letters, pp. 4555–4562, 2019.
[25] N. Hudson, M. J. Hossain, M. Hosseinzadeh, H. Khamfroush,
M. Rahnamay-Naeini, and N. Ghani, “A framework for edge intelligent
smart distribution grids via federated learning,” in 2021 IEEE ICCCN,
pp. 1–9, 2021.
[26] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent
development and applications of SUMO-Simulation of Urban MObility,”
International journal on advances in systems and measurements, vol. 5,
no. 3&4, 2012.
[27] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gon-
zalez, M. Jordan, and I. Stoica, “RLlib: Abstractions for distributed rein-
forcement learning,” in International Conference on Machine Learning,
PMLR, 2018.
[28] Z. Liu, “A survey of intelligence methods in urban traffic signal con-
trol,” IJCSNS International Journal of Computer Science and Network
Security, vol. 7, no. 7, 2007.
[29] M. Dotoli, M. P. Fanti, and C. Meloni, A signal timing plan formulation
for urban traffic control, Control engineering practice, 2006.
[30] H. Ceylan and M. G. Bell, “Traffic signal timing optimisation based on
genetic algorithm approach, including drivers’ routing, Transportation
Research Part B: Methodological, vol. 38, no. 4, pp. 329–342, 2004.
[31] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “Mo-
bile edge computing: Survey and research outlook, arXiv preprint
arXiv:1701.01090, 2017.
... While these approaches allow for developing a global model, their applicability is restricted to specific types of intersections. Scholars have attempted a unified representation of agent states to address this limitation, specifically exploring intersection-agnostic state representations for heterogeneous intersections 18 . However, on the one hand, the accumulation of too many factors hinders a deep network's ability to comprehend the environmental state more effectively 7 . ...
... Centralized learning models can achieve optimal global results but require excessive communication and computational resources 40 . Distributed learning solutions allocate agents to learn local models, hindering generalization across different intersections 18 . This paper seeks to establish a robust model capable of aggregating knowledge from intelligent agents at various intersections with lower communication and computational costs. ...
Article
Full-text available
Intelligent Transportation has seen significant advancements with Deep Learning and the Internet of Things, making Traffic Signal Control (TSC) research crucial for reducing congestion, travel time, emissions, and energy consumption. Reinforcement Learning (RL) has emerged as the primary method for TSC, but centralized learning poses communication and computing challenges, while distributed learning struggles to adapt across intersections. This paper presents a novel approach using Federated Learning (FL)-based RL for TSC. FL integrates knowledge from local agents into a global model, overcoming intersection variations with a unified agent state structure. To endow the model with the capacity to globally represent the TSC task while preserving the distinctive feature information inherent to each intersection, a segment of the RL neural network is aggregated to the cloud, and the remaining layers undergo fine-tuning upon convergence of the model training process. Extensive experiments demonstrate reduced queuing and waiting times globally, and the successful scalability of the proposed model is validated on a real-world traffic network in Monaco, showing its potential for new intersections.
... Depending on the application, there may be benefit in aggregating across regions to improve the performance of the trained models. Relevant real-world use cases where hierarchical FL can be applied include (but are not limited to): (i) smart farming systems [14], (ii) smart traffic control systems [15], and (iii) smart energy grid systems [16]. ...
... In such a system, traffic lights can learn optimal traffic-light control policies based on current traffic conditions. Federated Reinforcement Learning has been shown to improve the implicit trade-off between model performance and communication costs [15]. Hierarchical FL can enable cities to collaboratively train such FL-enabled infrastructure, with partial models being developed in a regional/ geographical context, and then aggregated to support particular application requirements. ...
Preprint
Full-text available
Federated learning has shown enormous promise as a way of training ML models in distributed environments while reducing communication costs and protecting data privacy. However, the rise of complex cyber-physical systems, such as the Internet-of-Things, presents new challenges that are not met with traditional FL methods. Hierarchical Federated Learning extends the traditional FL process to enable more efficient model aggregation based on application needs or characteristics of the deployment environment (e.g., resource capabilities and/or network connectivity). It illustrates the benefits of balancing processing across the cloud-edge continuum. Hierarchical Federated Learning is likely to be a key enabler for a wide range of applications, such as smart farming and smart energy management, as it can improve performance and reduce costs, whilst also enabling FL workflows to be deployed in environments that are not well-suited to traditional FL. Model aggregation algorithms, software frameworks, and infrastructures will need to be designed and implemented to make such solutions accessible to researchers and engineers across a growing set of domains. H-FL also introduces a number of new challenges. For instance, there are implicit infrastructural challenges. There is also a trade-off between having generalised models and personalised models. If there exist geographical patterns for data (e.g., soil conditions in a smart farm likely are related to the geography of the region itself), then it is crucial that models used locally can consider their own locality in addition to a globally-learned model. H-FL will be crucial to future FL solutions as it can aggregate and distribute models at multiple levels to optimally serve the trade-off between locality dependence and global anomaly robustness.
... Hudson et al., [22] have been conducted research on Federated Reinforcement. Instead of exchanging data in its raw form, edge-enabled lights can combine locally learned policies and network settings to learn from one another's experiences, lowering the cost of communication. ...
Article
Full-text available
City traffic congestion can be reduced with the help of adaptable traffic signal control system. The technique improves the efficiency of traffic operations on urban road networks by quickly adjusting the timing of signal values to account for seasonal variations and brief turns in traffic demand. This study looks into how adaptive signal control systems have evolved over time, their technical features, the state of adaptive control research today, and Control solutions for diverse traffic flows composed of linked and autonomous vehicles. This paper finally came to the conclusion that the ability of smart cities to generate vast volumes of information, Artificial Intelligence (AI) approaches that have recently been developed are of interest because they have the power to transform unstructured data into meaningful information to support decision-making (For instance, using current traffic information to adjust traffic lights based on actual traffic circumstances). It will demand a lot of processing power and is not easy to construct these AI applications. Unique computer hardware/technologies are required since some smart city applications require quick responses. In order to achieve the greatest energy savings and QoS, it focuses on the deployment of virtual machines in software-defined data centers. Review of the accuracy vs. latency trade-off for deep learning-based service decisions regarding offloading while providing the best QoS at the edge using compression techniques. During the past, computationally demanding tasks have been handled by cloud computing infrastructures. A promising computer infrastructure is already available and thanks to the new edge computing advancement, which is capable of meeting the needs of tomorrow's smart cities.
... Centralized learning models can achieve optimal global results but require excessive communication and computational resources 29 . Distributed learning solutions allocate agents to learn local models, hindering generalization across different intersections 30 . This paper seeks to establish a robust model capable of aggregating knowledge from intelligent agents at various intersections with lower communication and computational costs. ...
Preprint
Full-text available
Intelligent Transportation has seen significant advancements with Deep Learning (DL) and Internet of Things, making Traffic Signal Control (TSC) research crucial for reducing congestion, travel time, emissions, and energy consumption. Reinforcement Learning (RL) has emerged as the primary method for TSC, but centralized learning poses communication and computing challenges, while distributed learning struggles to adapt across intersections. This paper presents a novel approach using Federated Learning (FL)-based RL for TSC. FL integrates knowledge from local agents into a global model, overcoming intersection variations with a unified agent state structure. Additionally, the output layer's parameters are not aggregated to handle different intersection settings; instead, fine-tuning is performed after model training for deployment. Extensive experiments demonstrate reduced queuing and waiting times globally, and successful scalability of the proposed model is validated on a real-world traffic network in Monaco, showing its potential for new intersections.
Article
Resource constraints on the computing continuum require that we make smart decisions for serving AI-based services at the network edge. AI-based services typically have multiple implementations (e.g., image classification implementations include SqueezeNet, DenseNet, and others) with varying trade-offs (e.g., latency and accuracy). The question then is how should AI-based services be placed across Function-as-a-Service (FaaS) based edge computing systems in order to maximize total Quality-of-Service (QoS). To address this question, we propose a problem that jointly aims to solve (i) edge AI service placement and (ii) request scheduling. These are done across two time-scales (one for placement and one for scheduling). We first cast the problem as an integer linear program. We then decompose the problem into separate placement and scheduling subproblems and prove that both are NP-hard. We then propose a novel placement algorithm that places services while considering device-to-device communication across edge clouds to offload requests to one another. Our results show that the proposed placement algorithm is able to outperform a state-of-the-art placement algorithm for AI-based services, and other baseline heuristics, with regard to maximizing total QoS. Additionally, we present a federated learning-based framework, FLIES, to predict the future incoming service requests and their QoS requirements. Our results also show that our FLIES algorithm is able to outperform a standard decentralized learning baseline for predicting incoming requests and show comparable predictive performance when compared to centralized training.
Chapter
In vehicular networks, one of the traffic light signal parameters is the current inefficient traffic light control and causes problems such as long delay and energy waste. To improve traffic efficiency, dynamically adjusting the traffic light duration, taking into account real-time traffic information, is a logical and reasonable method. In this study; a deep reinforcement learning model is proposed to control the traffic light. In order to reduce the waiting time of the intersection users, the model and timing of the changes were optimized using deep reinforcement learning for the signals. In addition to the existing studies, Shibuya Crossing is chosen as an exemplary intersection application, focusing on encrypted intersections as the application target of traffic control with deep reinforcement learning. A traffic simulation SUMO is used to create the perimeter of Shibuya Crossing. Traffic signals are optimized using DQN, A2C and PPO algorithms. As a result, by using reinforcement learning, the waiting time has been reduced by about four times compared to the signal patterns currently used. In the study, the behavior of the optimized signal is also analyzed, explaining how the accuracy of the learning process changes when the method or condition observation is changed.KeywordsVehicular networktraffic signal controldeep reinforcement learningSUMO
Article
As vehicles become increasingly automated, novel vehicular applications emerge to enhance the safety and security of the vehicles and improve user experience. This brings ever-increasing data and resource requirements for timely computation on the vehicle’s on-board computing systems. To alleviate these demands, prior work propose deploying vehicular edge computing (VEC) resources on the road-side units (RSUs) in the traffic infrastructure to which the vehicles can communicate and offload compute intensive tasks. Due to limited communication range of these RSUs, the communication link between the vehicles and the RSUs and therefore the response times of the offloaded applications are significantly impacted by vehicle’s mobility through road traffic. Existing task offloading strategies do not consider the influence of traffic lights on vehicular mobility while offloading workloads on the RSUs, and thereby cause deadline misses and quality-of-service (QoS) reduction for the offloaded tasks. In this paper, we present a novel task model that captures time and location-specific requirements for vehicular applications. We then present a deadline-based strategy that incorporates traffic light data to opportunistically offload tasks. Our approach allows up to \(33\% \) more tasks to be offloaded onto the RSUs, compared to existing work, without causing any deadline misses and thereby maximizing the resource utilization on the RSUs.
Preprint
Full-text available
The Internet of Vehicles (IoV) enables real-time data exchange among vehicles and roadside units and thus provides a promising solution to alleviate traffic jams in the urban area. Meanwhile, better traffic management via efficient traffic light control can benefit the IoV as well by enabling a better communication environment and decreasing the network load. As such, IoV and efficient traffic light control can formulate a virtuous cycle. Edge computing, an emerging technology to provide low-latency computation capabilities at the edge of the network, can further improve the performance of this cycle. However, while the collected information is valuable, an efficient solution for better utilization and faster feedback has yet to be developed for edge-empowered IoV. To this end, we propose a Decentralized Reinforcement Learning at the Edge for traffic light control in the IoV (DRLE). DRLE exploits the ubiquity of the IoV to accelerate the collection of traffic data and its interpretation towards alleviating congestion and providing better traffic light control. DRLE operates within the coverage of the edge servers and uses aggregated data from neighboring edge servers to provide city-scale traffic light control. DRLE decomposes the highly complex problem of large area control. into a decentralized multi-agent problem. We prove its global optima with concrete mathematical reasoning. The proposed decentralized reinforcement learning algorithm running at each edge node adapts the traffic lights in real time. We conduct extensive evaluations and demonstrate the superiority of this approach over several state-of-the-art algorithms.
Article
Full-text available
Reinforcement learning (RL) is a promising data-driven approach for adaptive traffic signal control (ATSC) in complex urban traffic networks, and deep neural networks further enhance its learning power. However, centralized RL is infeasible for large-scale ATSC due to the extremely high dimension of the joint action space. Multi-agent RL (MARL) overcomes the scalability issue by distributing the global control to each local RL agent, but it introduces new challenges: now the environment becomes partially observable from the viewpoint of each local agent due to limited communication among agents. Most existing studies in MARL focus on designing efficient communication and coordination among traditional Q-learning agents. This paper presents, for the first time, a fully scalable and decentralized MARL algorithm for the state-of-the-art deep RL agent: advantage actor critic (A2C), within the context of ATSC. In particular, two methods are proposed to stabilize the learning procedure, by improving the observability and reducing the learning difficulty of each local agent. The proposed multi-agent A2C is compared against independent A2C and independent Q-learning algorithms, in both a large synthetic traffic grid and a large real-world traffic network of Monaco city, under simulated peak-hour traffic dynamics. Results demonstrate its optimality, robustness, and sample efficiency over other state-of-the-art decentralized MARL algorithms.
Article
Full-text available
Recent advances in combining deep neural network architectures with reinforcement learning techniques have shown promising potential results in solving complex control problems with high dimensional state and action spaces. Inspired by these successes, in this paper, we build two kinds of reinforcement learning algorithms: deep policy-gradient and value-function based agents which can predict the best possible traffic signal for a traffic intersection. At each time step, these adaptive traffic light control agents receive a snapshot of the current state of a graphical traffic simulator and produce control signals. The policy-gradient based agent maps its observation directly to the control signal, however the value-function based agent first estimates values for all legal control signals. The agent then selects the optimal control action with the highest value. Our methods show promising results in a traffic network simulated in the SUMO traffic simulator, without suffering from instability issues during the training process.
Conference Paper
Recent advances in distributed data processing and machine learning provide new opportunities to enable critical, time-sensitive functionalities of smart distribution grids in a secure and reliable fashion. Combining the recent advents of edge computing (EC) and edge intelligence (EI) with existing advanced metering infrastructure (AMI) has the potential to reduce overall communication cost, preserve user privacy, and provide improved situational awareness. In this paper, we provide an overview for how EC and EI can supplement applications relevant to AMI systems. Additionally, using such systems in tandem can enable distributed deep learning frameworks (e.g., federated learning) to empower distributed data processing and intelligent decision making for AMI. Finally, to demonstrate the efficacy of this considered architecture, we approach the non-intrusive load monitoring (NILM) problem using federated learning to train a deep recurrent neural network architecture in a 2-tier and 3-tier manner. In this approach, smart homes locally train a neural network using their metering data and only share the learned model parameters with AMI components for aggregation. Our results show this can reduce communication cost associated with distributed learning, as well as provide an immediate layer of privacy, due to no raw data being communicated to AMI components. Further, we show that FL is able to closely match the model loss provided by standard centralized deep learning where raw data is communicated for centralized training.
Article
Finding the optimal signal timing strategy is a difficult task for the problem of large-scale traffic signal control (TSC). Multiagent reinforcement learning (MARL) is a promising method to solve this problem. However, there is still room for improvement in extending to large-scale problems and modeling the behaviors of other agents for each individual agent. In this article, a new MARL, called cooperative double Q-learning (Co-DQL), is proposed, which has several prominent features. It uses a highly scalable independent double Q-learning method based on double estimators and the upper confidence bound (UCB) policy, which can eliminate the over-estimation problem existing in traditional independent Q-learning while ensuring exploration. It uses mean-field approximation to model the interaction among agents, thereby making agents learn a better cooperative strategy. In order to improve the stability and robustness of the learning process, we introduce a new reward allocation mechanism and a local state sharing method. In addition, we analyze the convergence properties of the proposed algorithm. Co-DQL is applied to TSC and tested on various traffic flow scenarios of TSC simulators. The results show that Co-DQL outperforms the state-of-the-art decentralized MARL algorithms in terms of multiple traffic metrics.
Article
This paper was motivated by the problem of how to make robots fuse and transfer their experience so that they can effectively use prior knowledge and quickly adapt to new environments. To address the problem, we present a learning architecture for navigation in cloud robotic systems: Lifelong Federated Reinforcement Learning (LFRL). In the work, we propose a knowledge fusion algorithm for upgrading a shared model deployed on the cloud. Then, effective transfer learning methods in LFRL are introduced. LFRL is consistent with human cognitive science and fits well in cloud robotic systems. Experiments show that LFRL greatly improves the efficiency of reinforcement learning for robot navigation. The cloud robotic system deployment also shows that LFRL is capable of fusing prior knowledge. In addition, we release a cloud robotic navigation-learning website to provide the service based on LFRL: www.shared-robotics.com
Article
The transportation demand is rapidly growing in metropolises, resulting in chronic traffic congestions in dense downtown areas. Adaptive traffic signal control as the principle part of intelligent transportation systems has a primary role to effectively reduce traffic congestion by making a real-time adaptation in response to the changing traffic network dynamics. Reinforcement learning (RL) is an effective approach in machine learning that has been applied for designing adaptive traffic signal controllers. One of the most efficient and robust type of RL algorithms are continuous state actor-critic algorithms that have the advantage of fast learning and the ability to generalize to new and unseen traffic conditions. These algorithms are utilized in this paper to design adaptive traffic signal controllers called actor-critic adaptive traffic signal controllers (A-CATs controllers). The contribution of the present work rests on the integration of three threads: (a) showing performance comparisons of both discrete and continuous A-CATs controllers in a traffic network with recurring congestion (24-h traffic demand) in the upper downtown core of Tehran city, (b) analyzing the effects of different traffic disruptions including opportunistic pedestrians crossing, parking lane, non-recurring congestion, and different levels of sensor noise on the performance of A-CATS controllers, and (c) comparing the performance of different function approximators (tile coding and radial basis function) on the learning of A-CATs controllers. To this end, first an agent-based traffic simulation of the study area is carried out. Then six different scenarios are conducted to find the best A-CATs controller that is robust enough against different traffic disruptions. We observe that the A-CATs controller based on radial basis function networks (RBF (5)) outperforms others. This controller is benchmarked against controllers of discrete state Q-learning, Bayesian Q-learning, fixed time and actuated controllers; and the results reveal that it consistently outperforms them.
Article
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.