Conference PaperPDF Available

Smart Edge-Enabled Traffic Light Control: Improving Reward-Communication Trade-offs with Federated Reinforcement Learning

June 2022

June 2022

DOI:10.1109/SMARTCOMP55677.2022.00021

Conference: 2022 IEEE International Conference on Smart Computing (SMARTCOMP)

Authors:

Nathaniel Hudson

University of Chicago

Pratham Oza

Virginia Tech (Virginia Polytechnic Institute and State University)

Hana Khamfroush

Pennsylvania State University

Training approaches considered for solving SEAL.

…

Learning curves with each training approach on each road network.

…

Communication cost (i.e., data size in bytes) transmitted during training time under each training strategy for each road network.

…

Evaluation of trained policy networks on each road network using trip metrics, namely Travel Time (top row) and Waiting Time (bottom row). We compare the results to a Pre-Timed phase transition model as a baseline. Results confirm the RL-based solutions generally outperform the baseline.

…

Figures - uploaded by Nathaniel Hudson

Content may be subject to copyright.

Content uploaded by Nathaniel Hudson

Content may be subject to copyright.

Smart Edge-Enabled Trafﬁc Light Control:

Improving Reward-Communication Trade-offs with

Federated Reinforcement Learning

Nathaniel Hudson∗, Pratham Oza†, Hana Khamfroush∗, and Thidapat Chantem†

∗Department of Computer Science, University of Kentucky

†Department of Electrical & Computer Engineering, Virginia Tech

Abstract—Trafﬁc congestion is a costly phenomenon of every-

day life. Reinforcement Learning (RL) is a promising solution due

to its applicability to solving complex decision-making problems

in highly dynamic environments. To train smart trafﬁc lights

using RL, large amounts of data is required. Recent RL-based

approaches consider training to occur on some nearby server or

a remote cloud server. However, this requires that trafﬁc lights

all communicate their raw data to some central location. For

large road systems, communication cost can be impractical, par-

ticularly if trafﬁc lights collect heavy data (e.g., video, LIDAR).

As such, this work pushes training to the trafﬁc lights directly

to reduce communication cost. However, completely independent

learning can reduce the performance of trained models. As such,

this work considers the recent advent of Federated Reinforcement

Learning (FedRL) for edge-enabled trafﬁc lights so they can

learn from each other’s experience by periodically aggregating

locally-learned policy network parameters rather than share

raw data, hence keeping communication costs low. To do this,

we propose the SEAL framework which uses an intersection-

agnostic representation to support FedRL across trafﬁc lights

controlling heterogeneous intersection types. We then evaluate

our FedRL approach against Centralized and Decentralized RL

strategies. We compare the reward-communication trade-offs of

these strategies. Our results show that FedRL is able to reduce

the communication costs associated with Centralized training

by 36.24%; while only seeing a 2.11% decrease in average

reward (i.e., decreased trafﬁc congestion).

Index Terms—Smart Trafﬁc, Trafﬁc Light Control, Reinforce-

ment Learning, Edge Computing, Federated Learning

I. INTRODUCTION

According to recent transportation analytics data by INRIX,

trafﬁc congestion cost the United States economy $88 billion

in 2019 alone [1]. Trafﬁc congestion poses a constant threat to

the economy and safety within an urban environment, which

can be alleviated by using the compute and communication

resources available in smart cities. Urban trafﬁc networks

exemplify a typical CPS where data, communication, and

connected infrastructure can now jointly optimize trafﬁc oper-

ations within a road network. Communication capabilities of

the vehicles, trafﬁc lights, and other road-side units (RSUs)

networks (VANETs) provide opportunities for novel strategies

to mitigate trafﬁc congestion over large and complex urban

road networks [2], [3]. Such strategies may require reliable

computing resources for the strict needs of urban trafﬁc

networks. The recent advent of Edge Computing (EC) [4]

pushes compute resources to the network edge via compute

node servers, known as “edge servers”, that are close to the

smart city infrastructure. EC can be used to support more

compute-intensive tasks for vehicular networks.

Many recent works trying to support smart decision making

for trafﬁc lights (commonly referred to as adaptive trafﬁc

signal control) consider Reinforcement Learning (RL)-based

approaches [5], [6], [7], [8], [9], [10], [11]. RL is a popular

technique for training sequential decision-making policies for

problems that are highly dynamic and complex. Smart trafﬁc

light strategies that incorporate RL typically employ either a

centralized [12], [13], [14] or decentralized [15], [10], [16]

technique for training policies. In the centralized case, a

policy is trained (typically on a roadside server) from the

observations collected by detectors and other infrastructural

components throughout the system. This central, roadside

server then communicates actions to each of the trafﬁc lights.

Because the policy is learning over observations throughout

the road network, these approaches perform well in terms of

maximizing total reward. However, in practice, the amount

of communication needed to send all observational data to

the server can be costly. Decentralized approaches push the

policy training to the trafﬁc lights based on observations local

to that trafﬁc light, meaning less communication is needed

since training is local to the trafﬁc light itself. However, in

decentralized approaches, the performance of the trained poli-

cies can be compromised because policies are learning in an

isolated and independent manner. Therein lies a natural trade-

off between policy performance w.r.t. maximizing reward and

the communication cost associated with training. To the best

of our knowledge, this trade-off has not been formally studied

for smart trafﬁc light control with RL.

To this end, we study the reward-communication trade-off

for training smart trafﬁc light control policies in an edge-

enabled trafﬁc system. We do this by proposing a Federated

Reinforcement Learning (FedRL) technique inspired by the

recent Federated Learning (FL) paradigm [17], [18]. Under

our FedRL technique, we train trafﬁc lights in a decentralized

manner to reduce overall communication costs. Periodically,

trafﬁc lights will communicate their current policy network to

a roadside edge server (hereafter referred to as “edge-RSU”)

Edge RSU

Traffic light (b)

controls 2 lanes.

Traffic light (a)

controls 4 lanes.

Edge-Enabled Traffic light

Smart traffic AI model

Communication channel

Fig. 1. Example of our trafﬁc system where trafﬁc lights communicate with

an edge-enabled roadside unit (Edge-RSU).

which will then aggregate the policy network parameters

using a weighted averaging method based on total reward.

This newly-averaged policy network is then distributed to

trafﬁc lights for further training until the next aggregation

phase. This aggregation will allow trafﬁc lights to learn from

each other without sharing raw observational data. For our

FedRL to work, representation of current trafﬁc conditions

must be consistent across the road network, even in the

face of heterogeneous intersection types. In this way, the

representation needs to be transferable across road networks

and intersections. For this, we design a novel, intersection-

agnostic Markov Decision Process (MDP) [19] which we refer

to as Smart Edge-enabled trAfﬁc Lights (SEAL). The central

contributions of this work can be summarized as follows:

•Design a novel, intersection-agnostic MDP for represent-

ing trafﬁc conditions at trafﬁc lights which we call SEAL.

SEAL is designed to have a general representation of

trafﬁc conditions at intersections.

•Proposal of a Federated Reinforcement Learning (FedRL)

approach for training RL decision-making policies for

smart trafﬁc light control.

•Improve reward-communication cost trade-off associated

with solving SEAL using our proposed FedRL approach

by reducing communication costs up to 36.24% on av-

erage while losing 2.11% on average when compared to

Centralized training.

II. SY ST EM DESCRIPTION

We now describe the system requirements for trafﬁc infras-

tructure, data, and communication capabilities for our model.

Fig. 1 shows a typical trafﬁc environment where our model

could be deployed. Our system considers a road network with

one or more intersections (depending on the road topology),

each equipped with a trafﬁc light k∈ K where Kdenotes the

set of trafﬁc lights in the entire system. Each trafﬁc light k∈ K

controls the trafﬁc ﬂow entering the intersection through an

incoming lane. A set of such controlled lanes is denoted

by Lk. A trafﬁc controller, either located at each intersection

or at a server calculates a “phase state” φt

kfor the trafﬁc

lights at a given trafﬁc light kat time-step t. The assigned

phase state is such that a trafﬁc light will be assigned a

green, yellow or red “signal state”, represented by G,y,r,

respectively. Therefore, a phase state is a string representing

the signal states of the trafﬁc lights at all controlled lanes at an

intersection. For example, the phase state for an intersection

with eight controlled lanes would be GyrrGyrr. For a visual

example of phase states, refer to Fig. 2. Note that the phase

states are assigned such that the vehicles with conﬂicting

trafﬁc ﬂows are not allowed to access the intersection at once.

The length of the phase state is based on the number of

incoming controlled lanes at a given intersection. Our model

also expects that the vehicles obey the trafﬁc regulations and

do not violate the assigned phase permissions indicated by

the trafﬁc lights. Finally, in our system, we assume each

trafﬁc light k∈ K cannot change phase states until 4seconds

have elapsed since k’s last phase state change. Further, we

assume each trafﬁc light k∈ K must change after 120 seconds

have elapsed since k’s last phase state change. These timings

are calculated in accordance with the U.S. federal highway

administration (FHWA) guidelines based on average trafﬁc

behavior [20] and can be changed as per trafﬁc regulatory

requirements. This is enforced for all training and evaluation.

Trafﬁc infrastructure is equipped either with road-side sen-

sors installed within every controlled lane to measure trafﬁc

parameters such as lane occupancy, average trafﬁc ﬂow speed,

etc. (detailed in §III) or have connected vehicles to report

such data to the trafﬁc lights by utilizing the connected

infrastructure. Trafﬁc lights are equipped with edge compute

resources to process the data and perform local learning. The

edge resources also enable the connectivity among all trafﬁc

lights within the trafﬁc network as well as the centralized cloud

server to enable global optimization of the learning models.

For simplicity, we assume the presence of a single deployed

edge-RSU server in the region that maintains communication

channels to all the trafﬁc lights in a given region to support ad-

ditional processes. Additionally, trafﬁc lights are also equipped

with compute resources as well as the edge-RSU server. As

a simplifying assumption, we assume compute resources at

both the trafﬁc lights and the edge-RSU server are sufﬁcient to

train policies for smart trafﬁc decisions. Succinctly, this work

aims to improve the implicit reward-communication trade-off

associated with distributed learning solutions to support smart

trafﬁc systems using FedRL.

III. PROP OS ED SEAL MO DE L DEFI NI TI ON

Here, we deﬁne the Smart Edge-enabled trAfﬁc

Lights (SEAL) system. SEAL is modeled as a Markov

Decision Process (MDP) [19] with the goal to minimize

trafﬁc congestion in road networks. SEAL’s novelty is in

deﬁning a general state space representation that can describe

current trafﬁc conditions at a trafﬁc light in an intersection-

agnostic way. This is necessary to support policy aggregation

in our FedRL approach (discussed later in §IV-C).

The work most similar to ours is that of Zhou et al.’s DRLE

framework in [15]. This work is able to consider a distributed

multi-agent RL approach to smart trafﬁc light control with

convergence guarantees. However, this does not consider the

possibility of trafﬁc lights themselves training their own policy

GGrGGr

012345

0 1

4 3

0 1

4 3

rryrry

012345

0 1

4 3

yyryyr

012345

0 1

4 3

rrGrrG

012345

Fig. 2. Example trafﬁc light action transition graph. Consider the given trafﬁc

light k’s current phase state is GGrGGR. If the action at

k= 1 at time-step t,

then the phase state for kwill transition to yyryyr if sufﬁcient time has

elapsed since its last transition. Otherwise, its phase state remains the same,

unless too much time has elapsed since its last change.

networks. Instead, the DRLE framework sets trafﬁc lights

to communicate their local state observations to a roadside

server to perform state aggregation to form a “global” state.

This global statefulness allows for convergence guarantees, but

may not be attractive for future solutions where trafﬁc lights

may collect large volumes of data (e.g., hyper-spectral images,

videos, LIDAR imaging, etc.) to make decisions. Having large

amounts of trafﬁc lights stream these data in real-time to

make timely decisions may not scale well. Thus, we consider

SEAL. Future works investigate possible convergence bounds

of SEAL is of interest but is beyond the scope of this work.

A. Action Space

In prior works investigating the use of RL for trafﬁc light

control, various kinds of actions have been considered. These

include phase switch [15], [16], phase duration [9], and the

phase state itself [7]. The phase state considers a discrete

space of size nwhere nis the number of possible states for

a trafﬁc light. Since phase state depends on the number of

controlled lanes and hence the trafﬁc lights at an intersection,

it is infeasible to aggregate knowledge among the intersections

with varying topologies. For this work, we consider a simpler

phase switch approach in which we consider each trafﬁc

light k∈ K in time-step tto take an action at

k∈ {0,1}where

k= 1 signiﬁes that trafﬁc light kwill attempt to change to

the next phase state. Otherwise, at

k= 0 signiﬁes no phase

state change will be attempted by trafﬁc light kat time-step t.

Note, if a trafﬁc light kattempts to change in some time-

step t(i.e., at

k= 1), a change can only occur if enough time

has elapsed since its last change; further, a trafﬁc light kwill

be forced to change its phase state regardless of its action if

too much time has elapsed since its last change. This is due

to the phase state timer (discussed in §II) to ensure policies

mean mandatory regulations related to road safety [20]. Refer

to Fig. 2 for an illustrated example of phase state logic and

transitions made when at

k= 1.

B. State Space

State space features consist of the following for a trafﬁc

light kin time-step t:lane occupancy (ot

k), halted lane occu-

pancy (ht

k), average speed (ψt

k), and phase state ratios (φt

k(·))

for all possible phase states (e.g., green, yellow, red).

1) Lane Occupancy: The average ratio of occupancy across

all lanes controlled by a trafﬁc light kin time-step t. Each

trafﬁc light kcontrols some set of lanes. Thus, we consider

the occupancy of a lane lto be how much of a lane’s length

(in meters) is occupied by vehicles (as a ratio). However, we

average this across all lanes controlled by trafﬁc light k. The

formal deﬁnition for lane occupancy is provided below:

k≜Pl∈LkPv∈Vt

len(v)

Pl∈Lklen(l)(1)

where Lkis the set of lanes controlled by trafﬁc light k,Vt

lis

the set of vehicles occupying lane lin time-step t, and len(·)

is the length of the vehicle or lane (in meters).

2) Halted Lane Occupancy: SEAL’s goal is to minimize

congestion in road systems. Thus, we consider how much of

a lane is occupied with halted vehicles. As such, we consider

kto be the halted lane occupancy of trafﬁc light kin time-

step twhen we consider a vehicle to be halted if its current

speed is ≤0.1meters/second. Thus, we deﬁne ht

kbelow:

k≜Pl∈LkPv∈Ht

len(v)

Pl∈Lklen(l)(2)

where Ht

lis the set of halted vehicles occupying lane lin

time-step t.

3) Average Speed: We also consider the average speed (ψt

among vehicles occupying lanes controlled by a trafﬁc light k

at time-step tas a feature. Similar to the other features, this

one is also normalized as a ratio in the range [0,1]. The formal

deﬁnition is below:

ψt

k≜





Pl∈LkPv∈Vt

min spdt

v,spdmax

l

Pl∈LkPv∈Vt

spdmax

lSl∈LkVt

l≥1

1.0otherwise

(3)

where spdt

vis the moving speed of vehicle vin time-step t

and spdmax

lis the speed limit (or maximum speed allowed)

on lane l. The second case in Eq. (3) is for cases when there

are no vehicles occupying lanes controlled by trafﬁc light k.

4) Phase State Ratio: The current phase state of a trafﬁc

light has been used as feature in prior works (namely, [15]).

This is possible because simple road networks are considered

with homogeneous intersections where trafﬁc lights have the

same sets of possible phase states. To handle heterogeneous

phase state sets across different intersection types, we instead

represent the ratio of how each possible trafﬁc light signal

(e.g., green, yellow, red) makes up the entire phase state. Thus,

we denote the ratio of a trafﬁc light signal for a trafﬁc light kin

time-step tby φk

k(·)∈[0,1]. For instance, given a phase state

at trafﬁc light kin time-step tGGrGGr, we denote how much

of the phase state are red lights, r, by φt

k(r) = 2/6(similarly

for prioritized green lights, G,φt

k(G) = 4/6). Because we

represent the ratio rather than assign an arbitrary discrete value

to represent the entire phase state, the representation is general

and can be used across different road networks with various

intersections. It should be noted that Pp∈Pkφt

k(p) = 1 (∀k, t)

where Pkis the set of phase states for trafﬁc light k.

C. Reward Function

The goal of SEAL is to reduce congestion in a given road

network. With that in mind, we let reward rt

kfor a trafﬁc

light kat time-step tbe a function of both lane occupancy (ot

and halted lane occupancy (ht

k). We deﬁne it below:

k≜−(ot

k+ht

k)2.(4)

These state space features are summed to penalize trafﬁc lights

with more congestion. We let halted vehicles to incur more

penalty since they contribute to both lane occupancy and halted

lane occupancy. From there, we deﬁne the total reward,rt,

over the road whole network at time-step tas

rt≜X

k∈K

k.(5)

D. Communication Model

As discussed in §II, we require robust communication capa-

bilities between vehicles, trafﬁc lights and edge-enabled RSUs

to support smart trafﬁc control. Depending on the training

approach (detailed in §IV), a trafﬁc control system must ac-

count for different communication channel utilization and their

incurred costs. We therefore consider the following 6different

types of possible communications that can take place under

the SEAL system: (i) policy network parameters from edge-

RSU to trafﬁc light, (ii) policy network parameters from trafﬁc

light to edge-RSU, (iii) action from edge-RSU to trafﬁc light,

(iv) observations from trafﬁc light to edge-RSU, (v) vehicle-

to-infrastructure (V2I) communication from vehicle to trafﬁc

light, and (vi) congestion ranks from edge-RSU to trafﬁc

light. We will evaluate the associated communication costs

while training of our proposed model in §VI. To reiterate,

we assume that edge-enabled trafﬁc lights and the edge-RSU

have sufﬁcient compute capacity to performing policy training.

Thus, we do not consider compute constraints and focus on

communication cost instead.

IV. TRAINING ALGORITHMS

The goal of SEAL is to learn optimal trafﬁc light control

policies to minimize congestion for a given road network. To

solve the SEAL model, we adopt model-free reinforcement

learning techniques. More speciﬁcally, we will incorporate the

recent Proximal Policy Optimization (PPO) [21] algorithm.

Solutions to SEAL will aim to ﬁnd a smart trafﬁc light control

policy, π, such that

Qπ(s,a) = (1 −γ)·E"∞

t=1

(γ)t−1·rt|s1=s,a1=a#(6)

is maximized where the policy is a decision-making function

π:S7→ Aand γis the discount factor. Eq. (6) is known

Decentralized TrainingCentralized Training Federated Training

Fig. 3. Training approaches considered for solving SEAL.

as the Q-function. The optimal policy that maximizes the Q-

function is deﬁned as π∗= arg maxπQπ(s, π(s)) ∀s. For the

sake of convenience, we denote Q(s,a) = Qπ∗(s,a),∀(s,a)

where sand aare a state and action, respectively.

RL algorithms can be implemented in real-world systems

in various ways. As such, we consider 3different approaches

for facilitating the PPO algorithm to solve SEAL: (i) central-

ized training,(ii) decentralized training, and (iii) federated

training. A visual example of how these approaches compare

can be found in Fig. 3. For a comprehensive overview on the

theory of RL, please refer to [22].

A. Centralized Training

Under centralized training, there is a single policy network

that is hosted on the nearby edge-RSU. At each time-step t,

each trafﬁc light k∈ K submits their current state st

kto

the edge-RSU which then returns an action at

kto trafﬁc

light k. Since a single policy network is learning across all

observations in the system, it is expected to learn the optimal

policy faster than other approaches. However, this is at the

expense of incurring a large amount of overhead in terms of

communication cost because of the trafﬁc light having nonstop

communication with the edge-RSU to take an action. For this

work, we view this approach as an upper bound in terms of

most quickly learning the optimal policy, π∗.

1) Centralized Training Communication Costs: Decision-

making in a centralized manner requires trafﬁc lights to always

communicate to the edge-RSU leading to higher communica-

tions. Under Centralized training, the following communica-

tions take place at each time-step: actions from the edge-RSU

to trafﬁc lights, observations from trafﬁc lights to edge-RSUs,

and V2I communications from vehicles to trafﬁc lights.

B. Decentralized Training

Unlike centralized training, decentralized training equips

each trafﬁc light k∈ K with a policy network that aims to

independently learn an optimal local policy for trafﬁc light k,

π∗

k, for optimizing reward using only observations local to

that trafﬁc light. In essence, if all trafﬁc lights in the system

are able to learn an optimal policy, then that can beneﬁt

the entire road network. Zhou et al. in [15] proved that a

decentralized training approach using per-trafﬁc light policies

for smart trafﬁc light control, can converge to a centralized

approach if given inﬁnite time. In general, this approach can

attain good performance if given enough time. While the

decentralized approach is bested by the centralized approach

in ﬁnding an optimal policy, since the latter is learning from

global observations, the former approach is of interest as it

requires less communication.

1) Decentralized Training Communication Costs: In the

decentralized case, since the trafﬁc lights never communicate

to the edge-RSU for making decisions, little communication

occurs. The only communication that takes place is V21

communication from vehicles to trafﬁc lights.

C. Federated Training

With the expectation that decentralized training will not

perform as well as centralized training due to policies learning

over fewer observations, but will require less communication,

we wish to achieve the best of both worlds. A novel contribu-

tion of this work is that we leverage the ﬁndings of the recent

federated learning (FL) paradigm [17], [18] for distributed

systems. Here we apply it to decentralized training to allow

the trafﬁc lights to learn from each other without needing

to communicate raw data. We refer to this notion aptly as

Federated Reinforcement Learning (FedRL) [23], [24]. FL has

shown to reduce communication cost in the literature [25]

while providing an immediate layer of privacy because no

raw data are communicated. These are crucial advantages for

smart trafﬁc light control for future systems. For instance,

consider a system that considers live video feed as a feature in

the state space representation. Because identifying information

(e.g., license plate numbers and faces of pedestrians) may be

included, privacy is crucial. Additionally, such data may be

very large and incur hefty data transmission costs. As such,

we will focus on the beneﬁt of federated training for smart

trafﬁc light control w.r.t. the trade-off between communication

cost on the system and maximizing reward.

In FedRL, the trafﬁc light agents training their own policy

networks will periodically communicate the learned policy

network parameters to the edge-RSU. The edge-RSU will

then aggregate them using an averaging function. The newly

aggregated policy network parameters are then communicated

back to the trafﬁc lights for further learning. Aggregation will

occur after a number of time-steps occurs. We refer to this

time period as a frame and denote it by F. We denote the

policy network parameters learned by trafﬁc light kat the end

of frame Fby ωF

In [18], the federated averaging (FedAvg) technique was

proposed. This technique addresses the challenge of non-

independent and identically distributed (iid) data distributions

across different client devices. FedAvg uses a weighted aver-

age of the client’s locally-updated model parameters based on

the number of data items owned by that client. This weight

combats non-iid data distributions common in distributed

systems. For the sake of this work, we consider a simplifying

assumption that trafﬁc lights have identical data sampling rates

— resulting in the same amount of observations. Below is the

deﬁnition of the averaging we consider,

ωF+1 ≜X

k∈K

|K| ωF+1

k(7)

1 lane running north/south

1 lane running east/west

2 lanes running east/west

1 lane running south

Fig. 4. Example Grid-3×3road network with heterogeneous intersection

types. Note that the number of lanes increase as roads are more central.

where the newly-aggregated, global parameters ωF+1 is the

average of the parameters collected from all the trafﬁc lights.

These parameters are then sent back to the trafﬁc lights at the

start of frame F+ 1 to resume training. Asynchronous ag-

gregation techniques to address heterogeneous data sampling

rates among trafﬁc lights is beyond the scope of this work.

1) Federated Training Communication Costs: With fed-

erated training, communications that occur at each time-

step are mostly identical to that for decentralized training

(discussed in §IV-B1). The only difference is at the end of

each frame (which occur less frequently than each time-step),

2additional communications occur: policy network parameters

from edge-RSU to trafﬁc lights and policy network parameters

from trafﬁc lights to edge-RSU.

V. EX PE RI ME NT DESIGN

We implement the SEAL framework using the Python

programming language. Further, we implement the training

approaches described in §IV using the SUMO trafﬁc simula-

tor [26] for the trafﬁc simulation and Ray’s RLlib [27] toolbox

for the RL pipeline. Our software serves as the interface for

these tools to ﬁt our work’s very speciﬁc needs. Thus, we

only train the policy networks using PPO using simulations

with these tools.

A. Considered Road Network Topologies

For training the policies using Ray’s RLlib [27] and per-

forming evaluation via simulation, we consider 3road network

topologies provided in Fig. 4: (a) Grid-3×3,(b) Grid-

5×5, and (c) Grid-7×7. Roads on the border of the

network have 1lane going each direction, with the number

of lanes going north/south and east/west increasing by 1

when approaching the center north/south and east/west roads.

This is to introduce heterogeneous road network topologies.

For an example, refer to Fig. 4. For simplicity, we do not

allow vehicles to make turns to prevent the vehicles from

getting stuck in the simulation. Note that this is a limitation

of SUMO and SEAL’s design is general enough to support

turning vehicles. Each training approach (discussed in §IV)

will learn policies over each road network topology. Vehicles

routes for training and evaluation are randomly generated

using the randomTrips.py module provided by SUMO

with 360 vehicles per lane per hour (VPLPH) generated.

0k 100k 200k

Time-Steps

(a)

Mean Episode Reward

Grid-3

0k 100k 200k

Time-Steps

(b)

Grid-5

0k 100k 200k

Time-Steps

(c)

Grid-7

Federated Centralized Decentralized

Fig. 5. Learning curves with each training approach on each road network.

B. Training Parameters

We use Proximal Policy Optimization (PPO) [21] to

train policies to solve SEAL. We use the following hyper-

parameters. The learning rate is 5×10−5. SGD minibatch size

is 128. PPO CLIP parameter is set to 0.3. Target value for KL

divergence is 0.3. Train batch size is 4000 time-steps. (Note

policy network parameter aggregation, described in §IV-C,

occurs every 4000 steps.) Roll-out fragment length (size of

batches collected from each worker) is 200. We use General-

ized Advantage Estimator (GAE) and the GAE parameter is

set to 1.0. The VF clip parameter is set to 10.

VI. RE SU LTS & DISCUSSION

A. Reward Evaluation During Training

First, we compare the different training strategies discussed

in §IV in terms of the reward achieved by the policy networks

during training. In Fig. 5, we can see the learning curves

of each training strategy when used on each of the 3road

network topologies described in §V-A. From these results,

we ﬁnd that, in general, make the following observations:

(i) Centralized training generally achieves the greatest re-

ward, (ii) Decentralized training generally achieves the worst

reward and, (iii) Federated training achieves greater reward

than Decentralized training (and often nearly match that of

Centralized training). These observations are fairly intuitive.

Since Centralized training trains a single policy network over

all observations collected in the environment, it has more

to learn from. Conversely, with Decentralized training, each

trafﬁc light learns independently using its own observations

— meaning each trafﬁc light’s policy learns over fewer ob-

servations. Since Federated training expands on Decentralized

training by allowing parameter aggregation among the policy

networks learned by the trafﬁc lights, the trafﬁc lights are

essentially able to learn from each other without explicitly

sharing observations and other raw data. More speciﬁcally, we

ﬁnd that Decentralized training suffers from an 8.01% drop

in reward compared to the Centralized training. Meanwhile,

Federated training only suffers from an 2.11% drop in reward

compared to Centralized training.

B. Communication Cost Evaluation During Training

Given Federated training is able to more closely approxi-

mate the reward achieved by Centralized training when com-

pared to Decentralized training, we next compare the com-

munication costs associated with each training strategy. We

0k 100k 200k

Time-Steps

(a)

80k

100k

120k

# Bytes

Grid-3

0k 100k 200k

Time-Steps

(b)

100k

120k

140k

Grid-5

0k 100k 200k

Time-Steps

(c)

100k

120k

140k

160k

Grid-7

Federated Centralized Decentralized

Fig. 6. Communication cost (i.e., data size in bytes) transmitted during

training time under each training strategy for each road network.

do this by tracking the number of communication that occurs

(refer to §III-D) and the number of times each communication

type occurs by the amount of bytes needed to transmit the data

for that communication. In Fig. 6, we compare the size of the

data needed to be communicated through the system during

training using each of the training strategies under each of the

road network topologies. There is a glaring difference in terms

of communication efﬁciency between Centralized and De-

centralized/Federated. Because Centralized training requires

constant communication between the edge-RSU and the trafﬁc

lights in order to transmit observations, actions, and other data,

it naturally incurs much greater communication cost. Mean-

while, Decentralized and Federated training greatly reduce this

cost due to them keeping communication mostly between the

vehicles and the trafﬁc light. The only communication between

the Edge-RSU and the trafﬁc lights under Federated training

is when policy network parameters are aggregated after each

frame concludes. It is interesting to note that Federated is able

to best Decentralized training in terms of communication cost

in these results. This is due to the Federated training strategy

producing better policy networks and removing vehicles from

the system more efﬁciently than the Decentralized model

— resulting in less vehicle-to-infrastructure communication.

More numerically speaking, from our results Decentralized

and Federated training are able to achieve a communication

cost reduction of 34.65% and 36.24%, respectively, when

compared to Centralized training.

C. Trained Policy Network Performance

Here, we are interested in two questions: (1) Can RL-based

trafﬁc lights trained with SEAL improve trafﬁc conditions?

(2) Can policy networks trained with SEAL perform well when

used on road networks they were not trained on? To answer

the ﬁrst question, we compare our trained policy networks

against a standard trafﬁc light control baseline: a pre-timed

control [20] where trafﬁc lights cycle through phase states at

ﬁxed time intervals. We support this comparison using real-

world trafﬁc metrics to evaluate the experience of drivers in the

system. Namely, we consider both “Travel Time” and “Waiting

Time”. The former is the total amount of (simulation) time

taken for vehicles to reach their destination; the latter is the

amount of (simulation) time vehicles are waiting to move at a

trafﬁc light. The results of this evaluation are shown in Fig. 7.

We see that in nearly all cases, the RL-based training strategies

Grid-3x3

Grid-5x5

Grid-7x7

Pre-Timed

100

Travel Time

Tested On "Grid-3x3"

Grid-3x3

Grid-5x5

Grid-7x7

Pre-Timed

100

Tested On "Grid-5x5"

Grid-3x3

Grid-5x5

Grid-7x7

Pre-Timed

100

Tested On "Grid-7x7"

Grid-3x3

Grid-5x5

Grid-7x7

Pre-Timed

Trained On

Waiting Time

Tested On "Grid-3x3"

Grid-3x3

Grid-5x5

Grid-7x7

Pre-Timed

Trained On

Tested On "Grid-5x5"

Grid-3x3

Grid-5x5

Grid-7x7

Pre-Timed

Trained On

Tested On "Grid-7x7"

Federated Decentralized Centralized Pre-Timed

Fig. 7. Evaluation of trained policy networks on each road network using trip metrics, namely Travel Time (top row) and Waiting Time (bottom row). We

compare the results to a Pre-Timed phase transition model as a baseline. Results conﬁrm the RL-based solutions generally outperform the baseline.

outperform that of the Timed-Phase baseline. The only outlier

is the Centralized trainer when learning in the Grid-3×3

road network. As for the second question regarding possible

transferability of the policy networks, we observe in Fig. 7 that

the policy networks are generally able to perform comparable

to one another (ignoring the Centralized trainer when trained

on Grid-3×3). This generally holds true for policy networks

being tested on the same road network they were trained on

when compared to policy networks trained on other networks.

These results serve to motivate the use of RL-based approaches

for future smart trafﬁc applications. We ﬁnd that (on average)

Centralized, Decentralized, and Federated reduce travel time

compared to Pre-Timed by 11.63%,18.16%, and 18.14%,

respectively. Also, we ﬁnd that (on average) Centralized,

Decentralized, and Federated reduce waiting time compared

to Pre-Timed by 42.81%,58.92%, and 58.93%, respectively.

The underperformance of Centralized here, compared to De-

centralized and Federated, is likely due to the outlier scenarios

when its trains on Grid-3×3. We attribute these anamolies to

potential overﬁtting, though further experiments are needed.

VII. REL ATED WO RK S

Improving trafﬁc light signal control in road networks has

been a widely studied subject. Much work is being done

to improve trafﬁc conditions by developing adaptive trafﬁc

signal control (ATSC) where trafﬁc lights adapt intelligently

based on current trafﬁc demands [28]. Many different tech-

niques have been considered for realizing ATSC. Early works

considered linear optimization frameworks [29].While linear

programming is straightforward, it is not an appropriate match

for ATSC because of the highly dynamic nature of real-

world trafﬁc systems — making accurate objective functions

and constraints difﬁcult to deﬁne. Genetic (or evolutionary)

algorithms have also been considered in prior works [30]. In

the early 2000s, initial works focusing on the application of

Reinforcement Learning (RL) techniques for ATSC were pub-

lished [6], [12]. While seminal, these initial works considered

very simple road network scenarios. With advancements in

both vehicular communication [2], [3] and RL algorithms [22],

interest in RL for ATSC (or smart trafﬁc) has been renewed.

However, recent RL algorithms use more complex policy

networks that require more compute resources to train.

Works considering RL for smart trafﬁc light signal control

have greatly increased over the years [7], [9], [10]. Because of

the large number of entities in a trafﬁc system (e.g., multiple

trafﬁc lights, multiple vehicles), multi-agent RL techniques

have been applied to smart trafﬁc light control [16], [11]. El-

Tantawy et al. in [16] propose a multi-agent RL framework

where agents can either be independent or collaborative in how

they make decisions with other trafﬁc light agents. Chu et al.

in [8] propose a decentralized, multi-agent RL framework to

provide robust learning with using a scalable framework. Chen

et al. in [5] propose a decentralized actor-critic model and a

difference reward method to accelerate the convergence of the

trained policies for smart trafﬁc light control. Mousavi et al.

in [13] study both, policy- and value-based deep reinforcement

learning approaches for smart trafﬁc light control. However,

they only consider a single intersection, where the state

space is a screenshot of the intersection provided by a trafﬁc

simulator. These works focus on improving training ﬁrst and

foremost. For instance, the communication cost for training

these policies is neglected.

Edge Computing (EC) [4], [31] is a recent enabling technol-

ogy that pushes compute resources to the network edge. This

has become an increasingly popular context for deploying AI

(e.g., machine learning, deep learning, and RL) services to the

network edge to provide low-latency intelligence. A signiﬁcant

recent work by Zhou et al. in [15] studied the applicability of

edge computing for decentralized RL for smart trafﬁc lights. A

central contribution of this work is the theoretical guarantees

that show that their proposed decentralized framework can

provide a near-optimal guarantee on reduced trafﬁc if given

enough time. Different from this work, we design a framework

that allows heterogeneous trafﬁc lights to train policy networks

in a federated manner to reduce communication costs.

The central gap in the literature related to RL for smart

trafﬁc light control is that the trade-off between reward and

communication cost has been neglected. Additionally, recent

advancements in the realm of Federated Learning (FL) or,

more speciﬁcally, Federated Reinforcement Learning (FedRL)

has yet to be applied to the smart trafﬁc control problem.

VIII. CONCLUSIONS

In closing, this work to the best of our knowledge, is

the ﬁrst to approach smart trafﬁc light control using Feder-

ated Reinforcement Learning (FedRL) in an edge computing-

enabled system. We do this by proposing SEAL, which is

an intersection-agnostic Markov Decision Problem for smart

trafﬁc light control to support aggregating learned policy

network parameters across heterogeneous intersection types.

This allows trafﬁc lights to learn from each other’s experiences

without sharing raw experience data which reduces communi-

cation workloads (while providing some level of privacy). Our

experiments demonstrate that SEAL combined with FedRL

approach is able to closely match the rewards provided by a

Centralized training approach (only a 2.11% decrease) when

compared to the Decentralized approach that shows a 8.01%

drop in reward. Further, our FedRL approach reduces the

communication cost by 36.24% when compared to Central-

ized training. Hence, FedRL improves the implicit reward-

communication trade-off for distributedly training smart trafﬁc

systems. In the future, we aim to extend our work to further

analyze the theoretical bounds of SEAL and to study its

effectiveness in small robotic testbed systems.

REFERENCES

[1] INRIX, “Congestion costs each american nearly 100 hours, $1,400

a year,” Mar 2020. [Press release]. Retrieved from https://inrix.com/

press-releases/2019- trafﬁc-scorecard-us/.

[2] M. S. Anwer and C. Guy, “A survey of VANET technologies,” Journal

of Emerging Trends in Computing and Information Sciences.

[3] S. K. Bhoi and P. M. Khilar, “Vehicular communication: a survey,” IET

networks, vol. 3, no. 3, pp. 204–217, 2014.

[4] P. Mach and Z. Becvar, “Mobile edge computing: A survey on archi-

tecture and computation ofﬂoading,” IEEE Communications Surveys &

Tutorials, vol. 19, no. 3, 2017.

[5] Y. Chen, C. Li, W. Yue, H. Zhang, and G. Mao, “Engineering a

large-scale trafﬁc signal control: A multi-agent reinforcement learning

approach,” IEEE INFOCOM 2021 Workshops, pp. 1–6, 2021.

[6] M. A. Wiering, “Multi-agent reinforcement learning for trafﬁc light con-

trol,” in Machine Learning: Proceedings of the Seventeenth International

Conference (ICML’2000), pp. 1151–1158, 2000.

[7] L. Prashanth and S. Bhatnagar, “Reinforcement learning with function

approximation for trafﬁc signal control,” IEEE Transactions on Intelli-

gent Transportation Systems, vol. 12, no. 2, pp. 412–421, 2010.

[8] T. Chu, J. Wang, L. Codec`

a, and Z. Li, “Multi-agent deep reinforcement

learning for large-scale trafﬁc signal control,” IEEE Transactions on

Intelligent Transportation Systems, vol. 21, pp. 1086–1095, 2020.

[9] M. Aslani, M. S. Mesgari, and M. Wiering, “Adaptive trafﬁc signal

control with actor-critic methods in a real-world trafﬁc network with

different trafﬁc disruption events,” Transportation Research Part C-

emerging Technologies, vol. 85, 2017.

[10] X. Wang, L. Ke, Z. Qiao, and X. Chai, “Large-scale trafﬁc signal control

using a novel multiagent reinforcement learning,” IEEE Transactions on

Cybernetics, vol. 51, pp. 174–187, 2021.

[11] T. Tan, T. Chu, and J. Wang, “Multi-agent bootstrapped deep q-network

for large-scale trafﬁc signal control,” 2020 IEEE CCTA, 2020.

[12] B. Abdulhai, R. Pringle, and G. J. Karakoulas, “Reinforcement learning

for true adaptive trafﬁc signal control,” Journal of Transportation

Engineering-asce, vol. 129, pp. 278–285, 2003.

[13] S. S. Mousavi, M. Schukat, and E. Howley, “Trafﬁc light control using

deep policy-gradient and value-function-based reinforcement learning,”

IET Intelligent Transport Systems, vol. 11, no. 7, pp. 417–423, 2017.

[14] P. G. Balaji, X. German, and D. Srinivasan, “Urban trafﬁc signal control

using reinforcement learning agents,” Iet Intelligent Transport Systems,

vol. 4, pp. 177–188, 2010.

[15] P. Zhou, X. Chen, Z. Liu, T. Braud, P. Hui, and J. Kangasharju, “DRLE:

Decentralized reinforcement learning at the edge for trafﬁc light control

in the iov,” IEEE Transactions on Intelligent Transportation Systems,

vol. 22, no. 4, pp. 2262–2273, 2020.

[16] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent rein-

forcement learning for integrated network of adaptive trafﬁc signal

controllers (marlin-atsc): Methodology and large-scale application on

downtown toronto,” IEEE Transactions on Intelligent Transportation

Systems, vol. 14, pp. 1140–1150, 2013.

[17] J. Koneˇ

cn`

y, H. B. McMahan, F. X. Yu, P. Richt´

arik, A. T. Suresh, and

D. Bacon, “Federated learning: Strategies for improving communication

efﬁciency,” arXiv preprint arXiv:1610.05492, 2016.

[18] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,

“Communication-efﬁcient learning of deep networks from decentralized

data,” in Artiﬁcial intelligence and statistics, PMLR, 2017.

[19] R. Bellman, “A Markovian decision process,” Journal of mathematics

and mechanics, vol. 6, no. 5, pp. 679–684, 1957.

[20] P. Koonce and L. Rodegerdts, “Trafﬁc signal timing manual.,” tech. rep.,

United States. Federal Highway Administration, 2008.

[21] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,

“Proximal policy optimization algorithms,” ArXiv, vol. abs/1707.06347,

2017.

[22] Y. Li, “Deep reinforcement learning: An overview,” arXiv preprint

arXiv:1701.07274, 2017.

[23] H. H. Zhuo, W. Feng, Q. Xu, Q. Yang, and Y. Lin, “Federated

reinforcement learning,” 2019.

[24] B. Liu, L. Wang, and M. Liu, “Lifelong federated reinforcement learn-

ing: a learning architecture for navigation in cloud robotic systems,”

IEEE Robotics and Automation Letters, pp. 4555–4562, 2019.

[25] N. Hudson, M. J. Hossain, M. Hosseinzadeh, H. Khamfroush,

M. Rahnamay-Naeini, and N. Ghani, “A framework for edge intelligent

smart distribution grids via federated learning,” in 2021 IEEE ICCCN,

pp. 1–9, 2021.

[26] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent

development and applications of SUMO-Simulation of Urban MObility,”

International journal on advances in systems and measurements, vol. 5,

no. 3&4, 2012.

[27] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gon-

zalez, M. Jordan, and I. Stoica, “RLlib: Abstractions for distributed rein-

forcement learning,” in International Conference on Machine Learning,

PMLR, 2018.

[28] Z. Liu, “A survey of intelligence methods in urban trafﬁc signal con-

trol,” IJCSNS International Journal of Computer Science and Network

Security, vol. 7, no. 7, 2007.

[29] M. Dotoli, M. P. Fanti, and C. Meloni, “A signal timing plan formulation

for urban trafﬁc control,” Control engineering practice, 2006.

[30] H. Ceylan and M. G. Bell, “Trafﬁc signal timing optimisation based on

genetic algorithm approach, including drivers’ routing,” Transportation

Research Part B: Methodological, vol. 38, no. 4, pp. 329–342, 2004.

[31] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “Mo-

bile edge computing: Survey and research outlook,” arXiv preprint

arXiv:1701.01090, 2017.

A scalable approach to optimize traffic signal control with federated reinforcement learning

Article

Full-text available

Nov 2023

Intelligent Transportation has seen significant advancements with Deep Learning and the Internet of Things, making Traffic Signal Control (TSC) research crucial for reducing congestion, travel time, emissions, and energy consumption. Reinforcement Learning (RL) has emerged as the primary method for TSC, but centralized learning poses communication and computing challenges, while distributed learning struggles to adapt across intersections. This paper presents a novel approach using Federated Learning (FL)-based RL for TSC. FL integrates knowledge from local agents into a global model, overcoming intersection variations with a unified agent state structure. To endow the model with the capacity to globally represent the TSC task while preserving the distinctive feature information inherent to each intersection, a segment of the RL neural network is aggregated to the cloud, and the remaining layers undergo fine-tuning upon convergence of the model training process. Extensive experiments demonstrate reduced queuing and waiting times globally, and the successful scalability of the proposed model is validated on a real-world traffic network in Monaco, showing its potential for new intersections.

Hierarchical and Decentralised Federated Learning

Preprint

Full-text available

Apr 2023

Federated learning has shown enormous promise as a way of training ML models in distributed environments while reducing communication costs and protecting data privacy. However, the rise of complex cyber-physical systems, such as the Internet-of-Things, presents new challenges that are not met with traditional FL methods. Hierarchical Federated Learning extends the traditional FL process to enable more efficient model aggregation based on application needs or characteristics of the deployment environment (e.g., resource capabilities and/or network connectivity). It illustrates the benefits of balancing processing across the cloud-edge continuum. Hierarchical Federated Learning is likely to be a key enabler for a wide range of applications, such as smart farming and smart energy management, as it can improve performance and reduce costs, whilst also enabling FL workflows to be deployed in environments that are not well-suited to traditional FL. Model aggregation algorithms, software frameworks, and infrastructures will need to be designed and implemented to make such solutions accessible to researchers and engineers across a growing set of domains. H-FL also introduces a number of new challenges. For instance, there are implicit infrastructural challenges. There is also a trade-off between having generalised models and personalised models. If there exist geographical patterns for data (e.g., soil conditions in a smart farm likely are related to the geography of the region itself), then it is crucial that models used locally can consider their own locality in addition to a globally-learned model. H-FL will be crucial to future FL solutions as it can aggregate and distribute models at multiple levels to optimally serve the trade-off between locality dependence and global anomaly robustness.

An Architecture and Review of Intelligence Based Traffic Control System for Smart Cities

Article

Full-text available

Jan 2024

City traffic congestion can be reduced with the help of adaptable traffic signal control system. The technique improves the efficiency of traffic operations on urban road networks by quickly adjusting the timing of signal values to account for seasonal variations and brief turns in traffic demand. This study looks into how adaptive signal control systems have evolved over time, their technical features, the state of adaptive control research today, and Control solutions for diverse traffic flows composed of linked and autonomous vehicles. This paper finally came to the conclusion that the ability of smart cities to generate vast volumes of information, Artificial Intelligence (AI) approaches that have recently been developed are of interest because they have the power to transform unstructured data into meaningful information to support decision-making (For instance, using current traffic information to adjust traffic lights based on actual traffic circumstances). It will demand a lot of processing power and is not easy to construct these AI applications. Unique computer hardware/technologies are required since some smart city applications require quick responses. In order to achieve the greatest energy savings and QoS, it focuses on the deployment of virtual machines in software-defined data centers. Review of the accuracy vs. latency trade-off for deep learning-based service decisions regarding offloading while providing the best QoS at the edge using compression techniques. During the past, computationally demanding tasks have been handled by cloud computing infrastructures. A promising computer infrastructure is already available and thanks to the new edge computing advancement, which is capable of meeting the needs of tomorrow's smart cities.

Optimizing Traffic Signal Control: A Scalable Approach with Federated Reinforcement Learning

Preprint

Full-text available

Aug 2023

Intelligent Transportation has seen significant advancements with Deep Learning (DL) and Internet of Things, making Traffic Signal Control (TSC) research crucial for reducing congestion, travel time, emissions, and energy consumption. Reinforcement Learning (RL) has emerged as the primary method for TSC, but centralized learning poses communication and computing challenges, while distributed learning struggles to adapt across intersections. This paper presents a novel approach using Federated Learning (FL)-based RL for TSC. FL integrates knowledge from local agents into a global model, overcoming intersection variations with a unified agent state structure. Additionally, the output layer's parameters are not aggregated to handle different intersection settings; instead, fine-tuning is performed after model training for deployment. Extensive experiments demonstrate reduced queuing and waiting times globally, and successful scalability of the proposed model is validated on a real-world traffic network in Monaco, showing its potential for new intersections.

QoS-aware edge AI placement and scheduling with multiple implementations in FaaS-based edge computing

Article

Mar 2024
FUTURE GENER COMP SY

Resource constraints on the computing continuum require that we make smart decisions for serving AI-based services at the network edge. AI-based services typically have multiple implementations (e.g., image classification implementations include SqueezeNet, DenseNet, and others) with varying trade-offs (e.g., latency and accuracy). The question then is how should AI-based services be placed across Function-as-a-Service (FaaS) based edge computing systems in order to maximize total Quality-of-Service (QoS). To address this question, we propose a problem that jointly aims to solve (i) edge AI service placement and (ii) request scheduling. These are done across two time-scales (one for placement and one for scheduling). We first cast the problem as an integer linear program. We then decompose the problem into separate placement and scheduling subproblems and prove that both are NP-hard. We then propose a novel placement algorithm that places services while considering device-to-device communication across edge clouds to offload requests to one another. Our results show that the proposed placement algorithm is able to outperform a state-of-the-art placement algorithm for AI-based services, and other baseline heuristics, with regard to maximizing total QoS. Additionally, we present a federated learning-based framework, FLIES, to predict the future incoming service requests and their QoS requirements. Our results also show that our FLIES algorithm is able to outperform a standard decentralized learning baseline for predicting incoming requests and show comparable predictive performance when compared to centralized training.

Federated Reinforcement Learning for Adaptive Traffic Signal Control: A Case Study in New York City

Conference Paper

Sep 2023

Hierarchical and Decentralised Federated Learning

Conference Paper

Dec 2022

Deep Reinforcement Learning at Scramble Intersections for Traffic Signal Control: An Example of Shibuya Crossing

Chapter

Aug 2023

Serap Ergün

In vehicular networks, one of the traffic light signal parameters is the current inefficient traffic light control and causes problems such as long delay and energy waste. To improve traffic efficiency, dynamically adjusting the traffic light duration, taking into account real-time traffic information, is a logical and reasonable method. In this study; a deep reinforcement learning model is proposed to control the traffic light. In order to reduce the waiting time of the intersection users, the model and timing of the changes were optimized using deep reinforcement learning for the signals. In addition to the existing studies, Shibuya Crossing is chosen as an exemplary intersection application, focusing on encrypted intersections as the application target of traffic control with deep reinforcement learning. A traffic simulation SUMO is used to create the perimeter of Shibuya Crossing. Traffic signals are optimized using DQN, A2C and PPO algorithms. As a result, by using reinforcement learning, the waiting time has been reduced by about four times compared to the signal patterns currently used. In the study, the behavior of the optimized signal is also analyzed, explaining how the accuracy of the learning process changes when the method or condition observation is changed.KeywordsVehicular networktraffic signal controldeep reinforcement learningSUMO

Balancing Federated Learning Trade-Offs for Heterogeneous Environments

Conference Paper

Mar 2023

Deadline-Aware Task Offloading for Vehicular Edge Computing Networks Using Traffic Light Data

Article

Apr 2023

As vehicles become increasingly automated, novel vehicular applications emerge to enhance the safety and security of the vehicles and improve user experience. This brings ever-increasing data and resource requirements for timely computation on the vehicle’s on-board computing systems. To alleviate these demands, prior work propose deploying vehicular edge computing (VEC) resources on the road-side units (RSUs) in the traffic infrastructure to which the vehicles can communicate and offload compute intensive tasks. Due to limited communication range of these RSUs, the communication link between the vehicles and the RSUs and therefore the response times of the offloaded applications are significantly impacted by vehicle’s mobility through road traffic. Existing task offloading strategies do not consider the influence of traffic lights on vehicular mobility while offloading workloads on the RSUs, and thereby cause deadline misses and quality-of-service (QoS) reduction for the offloaded tasks. In this paper, we present a novel task model that captures time and location-specific requirements for vehicular applications. We then present a deadline-based strategy that incorporates traffic light data to opportunistically offload tasks. Our approach allows up to $33\% $ more tasks to be offloaded onto the RSUs, compared to existing work, without causing any deadline misses and thereby maximizing the resource utilization on the RSUs.

DRLE: Decentralized Reinforcement Learning at the Edge for Traffic Light Control

Preprint

Full-text available

Sep 2020

The Internet of Vehicles (IoV) enables real-time data exchange among vehicles and roadside units and thus provides a promising solution to alleviate traffic jams in the urban area. Meanwhile, better traffic management via efficient traffic light control can benefit the IoV as well by enabling a better communication environment and decreasing the network load. As such, IoV and efficient traffic light control can formulate a virtuous cycle. Edge computing, an emerging technology to provide low-latency computation capabilities at the edge of the network, can further improve the performance of this cycle. However, while the collected information is valuable, an efficient solution for better utilization and faster feedback has yet to be developed for edge-empowered IoV. To this end, we propose a Decentralized Reinforcement Learning at the Edge for traffic light control in the IoV (DRLE). DRLE exploits the ubiquity of the IoV to accelerate the collection of traffic data and its interpretation towards alleviating congestion and providing better traffic light control. DRLE operates within the coverage of the edge servers and uses aggregated data from neighboring edge servers to provide city-scale traffic light control. DRLE decomposes the highly complex problem of large area control. into a decentralized multi-agent problem. We prove its global optima with concrete mathematical reasoning. The proposed decentralized reinforcement learning algorithm running at each edge node adapts the traffic lights in real time. We conduct extensive evaluations and demonstrate the superiority of this approach over several state-of-the-art algorithms.

Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control

Article

Full-text available

Mar 2019

Reinforcement learning (RL) is a promising data-driven approach for adaptive traffic signal control (ATSC) in complex urban traffic networks, and deep neural networks further enhance its learning power. However, centralized RL is infeasible for large-scale ATSC due to the extremely high dimension of the joint action space. Multi-agent RL (MARL) overcomes the scalability issue by distributing the global control to each local RL agent, but it introduces new challenges: now the environment becomes partially observable from the viewpoint of each local agent due to limited communication among agents. Most existing studies in MARL focus on designing efficient communication and coordination among traditional Q-learning agents. This paper presents, for the first time, a fully scalable and decentralized MARL algorithm for the state-of-the-art deep RL agent: advantage actor critic (A2C), within the context of ATSC. In particular, two methods are proposed to stabilize the learning procedure, by improving the observability and reducing the learning difficulty of each local agent. The proposed multi-agent A2C is compared against independent A2C and independent Q-learning algorithms, in both a large synthetic traffic grid and a large real-world traffic network of Monaco city, under simulated peak-hour traffic dynamics. Results demonstrate its optimality, robustness, and sample efficiency over other state-of-the-art decentralized MARL algorithms.

Traffic Light Control Using Deep Policy-Gradient and Value-Function Based Reinforcement Learning

Article

Full-text available

Apr 2017

Recent advances in combining deep neural network architectures with reinforcement learning techniques have shown promising potential results in solving complex control problems with high dimensional state and action spaces. Inspired by these successes, in this paper, we build two kinds of reinforcement learning algorithms: deep policy-gradient and value-function based agents which can predict the best possible traffic signal for a traffic intersection. At each time step, these adaptive traffic light control agents receive a snapshot of the current state of a graphical traffic simulator and produce control signals. The policy-gradient based agent maps its observation directly to the control signal, however the value-function based agent first estimates values for all legal control signals. The agent then selects the optimal control action with the highest value. Our methods show promising results in a traffic network simulated in the SUMO traffic simulator, without suffering from instability issues during the training process.

A Framework for Edge Intelligent Smart Distribution Grids via Federated Learning

Conference Paper

Jul 2021

Recent advances in distributed data processing and machine learning provide new opportunities to enable critical, time-sensitive functionalities of smart distribution grids in a secure and reliable fashion. Combining the recent advents of edge computing (EC) and edge intelligence (EI) with existing advanced metering infrastructure (AMI) has the potential to reduce overall communication cost, preserve user privacy, and provide improved situational awareness. In this paper, we provide an overview for how EC and EI can supplement applications relevant to AMI systems. Additionally, using such systems in tandem can enable distributed deep learning frameworks (e.g., federated learning) to empower distributed data processing and intelligent decision making for AMI. Finally, to demonstrate the efficacy of this considered architecture, we approach the non-intrusive load monitoring (NILM) problem using federated learning to train a deep recurrent neural network architecture in a 2-tier and 3-tier manner. In this approach, smart homes locally train a neural network using their metering data and only share the learned model parameters with AMI components for aggregation. Our results show this can reduce communication cost associated with distributed learning, as well as provide an immediate layer of privacy, due to no raw data being communicated to AMI components. Further, we show that FL is able to closely match the model loss provided by standard centralized deep learning where raw data is communicated for centralized training.

Engineering A Large-Scale Traffic Signal Control: A Multi-Agent Reinforcement Learning Approach

Conference Paper

May 2021

Multi-Agent Bootstrapped Deep Q-Network for Large-Scale Traffic Signal Control

Conference Paper

Aug 2020

Large-Scale Traffic Signal Control Using a Novel Multiagent Reinforcement Learning

Article

Sep 2020

Finding the optimal signal timing strategy is a difficult task for the problem of large-scale traffic signal control (TSC). Multiagent reinforcement learning (MARL) is a promising method to solve this problem. However, there is still room for improvement in extending to large-scale problems and modeling the behaviors of other agents for each individual agent. In this article, a new MARL, called cooperative double Q-learning (Co-DQL), is proposed, which has several prominent features. It uses a highly scalable independent double Q-learning method based on double estimators and the upper confidence bound (UCB) policy, which can eliminate the over-estimation problem existing in traditional independent Q-learning while ensuring exploration. It uses mean-field approximation to model the interaction among agents, thereby making agents learn a better cooperative strategy. In order to improve the stability and robustness of the learning process, we introduce a new reward allocation mechanism and a local state sharing method. In addition, we analyze the convergence properties of the proposed algorithm. Co-DQL is applied to TSC and tested on various traffic flow scenarios of TSC simulators. The results show that Co-DQL outperforms the state-of-the-art decentralized MARL algorithms in terms of multiple traffic metrics.

Lifelong Federated Reinforcement Learning: A Learning Architecture for Navigation in Cloud Robotic Systems

Article

Jul 2019

This paper was motivated by the problem of how to make robots fuse and transfer their experience so that they can effectively use prior knowledge and quickly adapt to new environments. To address the problem, we present a learning architecture for navigation in cloud robotic systems: Lifelong Federated Reinforcement Learning (LFRL). In the work, we propose a knowledge fusion algorithm for upgrading a shared model deployed on the cloud. Then, effective transfer learning methods in LFRL are introduced. LFRL is consistent with human cognitive science and fits well in cloud robotic systems. Experiments show that LFRL greatly improves the efficiency of reinforcement learning for robot navigation. The cloud robotic system deployment also shows that LFRL is capable of fusing prior knowledge. In addition, we release a cloud robotic navigation-learning website to provide the service based on LFRL: www.shared-robotics.com

Adaptive traffic signal control with actor-critic methods in a real-world traffic network with different traffic disruption events

Article

Dec 2017
TRANSPORT RES C-EMER

The transportation demand is rapidly growing in metropolises, resulting in chronic traffic congestions in dense downtown areas. Adaptive traffic signal control as the principle part of intelligent transportation systems has a primary role to effectively reduce traffic congestion by making a real-time adaptation in response to the changing traffic network dynamics. Reinforcement learning (RL) is an effective approach in machine learning that has been applied for designing adaptive traffic signal controllers. One of the most efficient and robust type of RL algorithms are continuous state actor-critic algorithms that have the advantage of fast learning and the ability to generalize to new and unseen traffic conditions. These algorithms are utilized in this paper to design adaptive traffic signal controllers called actor-critic adaptive traffic signal controllers (A-CATs controllers). The contribution of the present work rests on the integration of three threads: (a) showing performance comparisons of both discrete and continuous A-CATs controllers in a traffic network with recurring congestion (24-h traffic demand) in the upper downtown core of Tehran city, (b) analyzing the effects of different traffic disruptions including opportunistic pedestrians crossing, parking lane, non-recurring congestion, and different levels of sensor noise on the performance of A-CATS controllers, and (c) comparing the performance of different function approximators (tile coding and radial basis function) on the learning of A-CATs controllers. To this end, first an agent-based traffic simulation of the study area is carried out. Then six different scenarios are conducted to find the best A-CATs controller that is robust enough against different traffic disruptions. We observe that the A-CATs controller based on radial basis function networks (RBF (5)) outperforms others. This controller is benchmarked against controllers of discrete state Q-learning, Bayesian Q-learning, fixed time and actuated controllers; and the results reveal that it consistently outperforms them.

Proximal Policy Optimization Algorithms

Article

Jul 2017

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

Smart Edge-Enabled Traffic Light Control: Improving Reward-Communication Trade-offs with Federated Reinforcement Learning

Figures

Recommended publications

Deep Reinforcement Learning for Autonomous Traffic Light Control

Deep Reinforcement Learning with Vehicle Heterogeneity Based Traffic Light Control for Intelligent T...

Engineering A Large-Scale Traffic Signal Control: A Multi-Agent Reinforcement Learning Approach

Cooperative Learning with Difference Reward in Large-Scale Traffic Signal Control

Dynamic Arterial Coordinated Control Based on Multi-agent Reinforcement Learning

IG-RL: Inductive Graph Reinforcement Learning for Massive-Scale Traffic Signal Control