PreprintPDF Available

Towards Multi-agent Reinforcement Learning for Wireless Network Protocol Synthesis

Authors:

Abstract and Figures

This paper proposes a multi-agent reinforcement learning based medium access framework for wireless networks. The access problem is formulated as a Markov Decision Process (MDP), and solved using reinforcement learning with every network node acting as a distributed learning agent. The solution components are developed step by step, starting from a single-node access scenario in which a node agent incrementally learns to control MAC layer packet loads for reining in self-collisions. The strategy is then scaled up for multi-node fully-connected scenarios by using more elaborate reward structures. It also demonstrates preliminary feasibility for more general partially connected topologies. It is shown that by learning to adjust MAC layer transmission probabilities, the protocol is not only able to attain theoretical maximum throughput at an optimal load, but unlike classical approaches, it can also retain that maximum throughput at higher loading conditions. Additionally, the mechanism is agnostic to heterogeneous loading while preserving that feature. It is also shown that access priorities of the protocol across nodes can be parametrically adjusted. Finally, it is also shown that the online learning feature of reinforcement learning is able to make the protocol adapt to time-varying loading conditions.
Content may be subject to copyright.
1
Abstract This paper proposes a multi-agent reinforcement
learning based medium access framework for wireless networks.
The access problem is formulated as a Markov Decision Process
(MDP), and solved using reinforcement learning with every
network node acting as a distributed learning agent. The solution
components are developed step by step, starting from a single-node
access scenario in which a node agent incrementally learns to
control MAC layer packet loads for reining in self-collisions. The
strategy is then scaled up for multi-node fully-connected scenarios
by using more elaborate reward structures. It also demonstrates
preliminary feasibility for more general partially connected
topologies. It is shown that by learning to adjust MAC layer
transmission probabilities, the protocol is not only able to attain
theoretical maximum throughput at an optimal load, but unlike
classical approaches, it can also retain that maximum throughput
at higher loading conditions. Additionally, the mechanism is
agnostic to heterogeneous loading while preserving that feature. It
is also shown that access priorities of the protocol across nodes can
be parametrically adjusted. Finally, it is also shown that the online
learning feature of reinforcement learning is able to make the
protocol adapt to time-varying loading conditions
Index Terms Multi-agent Reinforcement learning, Medium
Access Control (MAC), Multi-agent Markov Decision Process
(MAMDP), ALOHA, Heterogeneous network
I. INTRODUCTION
The objective of this paper is to explore an online learning
paradigm for medium access control (MAC) in wireless
networks with multiple levels of heterogeneity. Towards the
long-term goal of developing a general-purpose learning
framework, in this particular paper we start with a basic scenario
in which wireless nodes run a rudimentary MAC logic without
relying on carrier sensing and other complex features from the
underlying hardware. In other words, the developed
mechanisms are suitable for very simple transceivers that are
found in low-cost wireless sensors, Internet of Things (IoTs),
and other embedded devices.
The current best practice for programming MAC logic in an
embedded wireless node is to implement known protocols such
as ALOHA, Slotted-ALOHA, CSMA, and CSMA-CA (i.e.,
WiFi, BT, etc.) depending on the available lower layer hardware
support. The choice of such protocols is often driven by
heuristics and past experience of network designers. While such
methods provide a standard method for network and protocol
deployment, they do not necessarily maximize the MAC layer
performance in a fair manner, especially in the presence of
network and data heterogeneity. An example is when nodes
without carrier sensing abilities run AOLHA family of
protocols, their performance start degrading due to collisions
when the application traffic load in the network exceeds an
optimal level. Such problems are further compounded in the
presence of various forms of heterogeneity in terms of traffic
load, topology, and node-specific access priorities. The key
reason for such performance gap is that the nodes are statically
programmed with a protocol logic that is not cognizant of time-
varying load situations and such heterogeneities. The proposed
framework allows wireless nodes to learn how to detect such
conditions and change transmission strategies in real-time for
maximizing the network performance, even under node-specific
access prioritization.
The key concept in this work is to model the MAC layer
logic as a Markov Decision Process (MDP) [1], and solve it
dynamically using Reinforcement Learning (RL) as a temporal
difference solution [2] under varying traffic and network
conditions. An MDP solution is the correct set of transmission
actions taken by the network nodes, which act as the MDP
agents. RL provides opportunities for the nodes to learn on the
fly without the need of any prior training data. The proposed
framework allows provisions for heterogenous traffic load,
network topology, and node-specific priority while striking the
right balance between node level and network level
performance. Learning adjustments to temporal variations of
such heterogeneities and access priorities are also supported by
leveraging the inherent real-time adaptability of Reinforcement
Learning. It is shown that the nodes can learn to self-regulate
collisions in order to attain theoretically maximum MAC level
performance at optimal loading conditions. For higher load, the
nodes learn to adjust their transmit probabilities in order to
reduce the effective load, maintain low levels of collisions, thus
maintaining the maximum MAC level performance.
Specific contributions of this work are as follows. First, the
MAC layer logic without carrier sensing is modeled as a Markov
Decision Process (MDP), which is solved using Reinforcement
Learning (RL). Second, it is shown that such learning has the
abilities to make the agents/nodes self-regulate their individual
traffic loads in order to attain and maintain the theoretically
maximum throughput even when the application level traffic
load is increased beyond the point of optimality. Third, the
learning mechanism is enhanced to handle heterogeneous traffic
load. Fourth, node-level access priorities are incorporated in the
native learning process. Finally, it is preliminarily demonstrated
that the core concept of MDP formulation and RL solution can
be extended for non-fully connected network topologies.
II. NETWORK AND TRAFFIC MODEL
The first network model is a multi-point to point sparse
Towards Multi-agent Reinforcement Learning for
Wireless Network Protocol Synthesis
Hrishikesh Dutta and Subir Biswas
Electrical and Computer Engineering, Michigan State University, East Lansing, MI, USA
duttahr1@msu.edu, sbiswas@msu.edu
2
network. As shown in Fig. 1: a, N fully connected sensor nodes
send data to a wirelessly connected base station using fixed size
packets. Performance, which is node- and network-level
throughputs, is affected by collisions at the base station caused
by overlapping transmissions from multiple nodes. The primary
objective of a learning-based MAC protocol is to learn the
optimal transmission strategy that can minimize collisions at the
base station. It is assumed that nodes do not have carrier sensing
abilities, and each of them is able to receive transmissions from
all other nodes in the network.
Fig. 1: Network and data upload models for (a) Fully connected nodes
uploading data to a single Base Station (b) Partially connected nodes
uploading data to multiple Base Stations
The second network model, as shown in Fig. 1: b, is a non-
fully connected scenario in which the sensor nodes upload data
to different receiver base stations, while they form an arbitrary
mesh topology among themselves. In this case, collisions
happen at the receiver base stations and the learning goal is to
minimize such collisions in a distributed manner. Unlike in the
fully connected scenario, a node in this case can receive
transmissions only from its direct neighbors.
III. NETWORK PROTOCOL MODELING AS MARKOV DECISION
PROCESS (MDP)
A. Markov Decision Process
An MDP is a Markov process [3] in which a system
transitions stochastically within a state space
{ as a result of actions taken by an agent in
each state. When the agent is in state and takes an action ,
a set of transition probabilities determine the next system state
. A reward, indicating a physical benefit, is associated with
each such transition. Formally, an MDP can be represented by
the tuple (), where S is the state space, A is the set of all
possible actions, T is the state transition probabilities, and R is
the reward function. For any system, whose dynamic behavior
can be represented by an MDP, there exists an optimal set of
actions for each state such that the long-term expected reward
can be maximized [4]. Such optimal action sequence is referred
to as a solution to the MDP.
B. Reinforcement Learning (RL)
RL is a class of algorithms for solving an MDP [5] without
necessarily requiring an exact mathematical model for the
system, as needed by the classical dynamic programming
methods. As shown in Fig. 2, in RL, an agent interacts with its
environment by taking an action, which causes a state-change.
Each action results in a reward that the agent receives from the
environment. Q-Learning [6], a model-free and value-based
reinforcement learning technique, is used in this work. Using Q-
Learning, by taking each possible action in all the states
repeatedly, an agent learns to take the best set of actions that
represents the optimal MDP solution, which maximizes the
expected long-term reward. Each agent maintains a Q-table with
entries  which represents the Q-value for action when
taken in state . After an action, the Q-value is updated using
the Bellman’s equation given by Eq. (1), where is the reward
received, is a learning rate, is a discount factor, and is the
next state caused by action a.
       


(1)
Fig. 2: Reinforcement Learning (RL) for network protocol MDP
C. Modeling Network Protocols as MDP
We represent the MAC layer protocol logic using an MDP,
where the network nodes behave as the learning agents.
Environment is the network within which the nodes/agents
interact via their actions. The actions here are to transmit with
various probabilities and to wait, which the agents need to learn
in a network state-specific manner so that the expected reward,
which is throughput, can be maximized. A solution to this MDP
problem would represent a desirable MAC layer protocol.
State Space: The MDP state space for an agent/node is defined
as the network congestion level as perceived by the agent.
Congestion is coded with two elements, namely, inter-collision
probability and self-collision probability. An inter-collision
takes place when the collided packets at a receiver are from two
different nodes. A self-collision occurs at a transmitter when a
node’s application layer generates a packet in the middle of one
of its own ongoing transmissions. It will be shown later as to
how using RL, a node learns to deal with both self-collisions and
inter-collisions, thus maximizing the reward or throughput.
Inter-collision and self-collision probabilities are defined by
Eqns. (2) and (3) respectively. To keep the state space discrete,
these probabilities are discretized in multiple distinct ranges as
described in the experimental results sections.
   
   
(2)
    
   
(3)
Action Space: As for the agents’ actions, two different
formulations are explored, namely, fixed actions and
incremental actions. With the first one, the actions are defined
Environment
(Network)
Agent
(Network
Nodes)
Actions
(Tx, Wait)
Reward
(Throughput)
State
(Congestion
Status)
3
as the probabilities of packet transmissions by an agent/node.
For the latter, the actions are defined as the change of the current
transmission probability. Details about these two action space
formulations and their performance are presented in Section V.
Rewards: In a fully connected network, each RL agent/node
keeps track of its own throughput and those of all other agents.
The latter is computed based on the successfully received
packets from other nodes. Using such information, an agent-i
constructs a reward function as follows.
   
 ()}
(4)
S is the network-wide throughput (expressed in packets/packet
duration, or Erlang) computed by adding measured throughput
for all nodes including node-i itself; is the throughput of node-
 and   
 represents a fairness factor. A
larger indicates a fairer system from node-i’s perspective. The
coefficients and are learning hyper-parameters. The
coefficient provides weightage towards maximizing network-
wide throughput, and contributes towards making the
throughput distribution fairer. The node-specific parameter
determines a media access priority for node-i. In addition to the
baseline reward structure in Eqn. 4, a learning recovery
protection is provided by assigning a penalty of 0.8 to all agents
if the network-wide throughput S ever goes to zero.
For a non-fully connected topology, as shown in Fig. 1: b, a
modified reward function has been formulated since the
individual nodes here do not have the information about the
entire network. In this case, a node is able to monitor the number
of non-overlapping and overlapping transmissions with only its
one hop-neighbor nodes. Let be the number of non-
overlapping (with that of node-i) transmissions per packet
duration from node-j, as observed by node-i. The quantity
may not represent the actual throughput of node-j since some of
those packets may collide with node-j’s own neighbors.
Similarly, some of node-j’s transmissions that are overlapping
with node-i’s transmissions may actually succeed at one of
node-j’s neighbor receivers. In the absence of such 2-hop
information, we assume that represents a reasonable estimate
of node-j’s throughput from node i’s perspective. The reward
function for the non-fully connected topology is formulated as:
 

(5)
Fairness of bandwidth distribution across the nodes within
node-i’s immediate neighborhood is given by the term

 , where j represents all node-i’s one hop
neighbors. The coefficient accounts for maximizing the
estimated throughput of node-i and its one-hop neighbors,
whereas tries to distribute the throughput equally among
node-i’s neighbors. As in the fully connected scenario, a
recovery protection is added here too by assigning a fixed
penalty if the quantity goes to zero. More about the design of
hyper-parameters for both fully connected and non-fully
connected scenarios are presented in Section VI.
IV. DISTRIBUTED MULTI-AGENT REINFORCEMENT LEARNING
MAC (DRMRL-MAC)
Using the state and action spaces deigned in Section III, we
develop an MRL (Multiagent RL) system in which each
network node acts as an independent RL agent. Distributed
agent behavior give rise to a medium access control protocol
that is termed as DMRL-MAC. We used a special flavor of
multi-agent RL, known as Hysteretic Q-learning [7].
Hysteretic Q-learning (HQL): HQL is used in a cooperative and
decentralized multi-agent environment, where each agent is
treated as an independent learner. An agent, without the
knowledge of the actions taken by the other agents in the
environment, learns to achieve a coherent joint behavior. The
agents learn to converge to an optimal policy without the
requirement of explicit communications. The Q-table update
rule in Eqn. (1) is modified here as:
      
 
      
    
(6)
In a multi-agent environment, the rewards and penalties
received by an agent depend not only on its own action, but also
on the set of actions taken by all other agents. Even if the action
taken by an agent is optimal, still it may receive a penalty as a
result of the bad actions taken by the other agents. Therefore, in
hysteretic Q-learning, an agent gives less importance to a
penalty received for an action that was rewarded in the past. This
is taken into consideration by the use of two different learning
rates and . This can be seen in the Q-table update rule in
Eqn. 6, where and are the increase and decrease rates of the
Q values. The learning rate is selected based on the sign of the
parameter The parameter is positive if the actions taken
were beneficial for attaining the desired optimum of the system
and vice-versa. In order to assign less importance to the
penalties, is chosen such that it is always less than .
However, the decrease rate is set to non-zero in order to make
sure that the hysteretic agents are not totally blind to penalties.
In this way, the agents make sure to avoid converging to a sub-
optimal equilibrium.
In the proposed DMRL-MAC protocol, each node in the
network acts as a hysteretic agent, which is unaware of the
actions taken by the other nodes/agents in the network. The
actions are evaluated by the rewards assigned using Eqns. (4)
and (5). The fact that the Q-table size is independent of number
of nodes, makes the protocol scalable with the network size.
V. SINGLE AGENT REINFORCEMENT LEARNING
Before delving into multi-node scenarios, we experiment with
the key concepts of RL-based MAC logic with a single network
node executing MAC logic for sending data to a base station.
A. Analysis of Self-collision
A self-collision occurs when a node’s application layer
generates a packet in the middle of one of its own ongoing
transmissions. Consider a situation in which an application
4
generates fixed size packets of duration at the rate Erlang
(with Poisson distribution), and hands them to the MAC layer.
The MAC layer transmits the packet occupying the channel for
the duration   to   . In order for the packet to not trigger
any self-collision, no other packet should arrive from the
application layer for the duration [, the probability of
which is . The other scenario when self-collision will be
avoided is when one packet arrived and transmitted anytime
during the interval [, and no packet arrived between the
end of the transmission of that prior packet and time   . The
expected probability of this situation to occur is: 

. Combining those two scenarios:
 

This leads to the single node throughput:
     
Fig. 3: Single node throughput under self-collisions; throughput comparison
between ALOHA and RL-based MAC Logic
Using a C language simulator, we have implemented single-
node pure ALOHA simulation experiments in which the MAC
layer transmits a packet whenever it arrives from the application
layer. Fig. 3 shows single-node throughput computed
analytically as well for the simulation experiments. The
maximum throughput is around 68% at an application layer load
of 2.4. In the absence of inter-collisions with other nodes,
throughput loss is solely due to the self-collisions modeled
above. The throughput reduces with load asymptotically (not
shown) as increasing self-collisions eventually prevent any
packets to be successfully transmitted.
The analytical model can be further refined by considering the
packet arrival history further back compared to just two packet
durations. However, the close match between the results from
the model and from the experiments in Fig. 3 indicates that such
refinements are not necessary.
B. Handling Self-collisions with Reinforcement Learning
We have programmed a single node to use the classical Q-
Learning [2] for medium access. Using the reward formulation
in Eqn. 4, the agent learns an appropriate action from the state
and action spaces specified in Section III. The state space for
single node RL-based MAC is obtained by discretizing the self-
collision probability into 6 equal discrete levels. For the fixed
action strategy, the actions are defined by the probability of
transmission by the node. Five distinct values of transmit
probability {0,0.25,0.5,0.75,1} are used. The learning
hyperparameters and other relevant Q-Learning parameters are
summarized in Table I.
As shown in Fig. 3, the RL-based MAC is able to achieve the
maximum ALOHA throughput by learning to take the
appropriate transmission actions from its available action space.
However, unlike the regular ALOHA logic, the RL-based logic
can maintain the maximum throughout even after the load
exceeds 2.4, which is the optimal load for the ALOHA logic.
This is achieved by adjusting the transmit probability so that the
self-collisions are reduced. For ALOHA, if a node is in the
RL-based MAC
ALOHA Simulation
Analytical
Load from application layer
MAC throughput/ Effective load to MAC layer
TABLE I: BASELINE EXPERIMENTAL PARAMETERS
No. of packets per epoch
1000
0.1
0.95
 
1.0
0
0
Fig. 4: Convergence plots for RL-based MAC logic: effective load and throughput for three different initial loads using fixed action strategy
Load = 10.0
Avg. s = 0.65
Avg. g* = 2.5
Avg. s = 0.6
Avg. g* = 2.5
Avg. s = 0.66
Avg. g* = 2.5
Load (g)=10.0 Load (g)=2.5 Load (g)=1.5
Learning Epoch ID Learning Epoch ID Learning Epoch ID
0 2 4 6 8 10
0.1 0.3 0.5 0.7
Effective Load (g*)
Throughput(s)
0 0.5 1 1.5 2 2.5 3
Effective Load (g*)
0.1 0.3 0.5 0.7
Throughput(s)
0.1 0.3 0.5 0.7
Throughput(s)
0 0.5 1 1.5 2 2.5 3
Effective Load (g*)
5
middle of a transmission, it transmits the packets in its queue
irrespective of its current transmission status. But with RL, the
node can learn not to transmit when it is in the middle of an
ongoing transmission. As a result, the effective load g* to the
MAC layer is lowered compared to the original load g from the
application layer, thus maintaining the maximum throughout
even when the application layer load is increased. Such learning
provides a new direction for medium access, which will be
explored further for multi-node networks later in the paper.
As mentioned in Section III, the RL-based MAC is
implemented using two different action strategies by the RL
agent: fixed action strategy and incremental action strategy.
Figs. 4 and 5 show the convergence plots for these two cases
respectively. It can be observed that the convergence speed is
much higher for the fixed action strategy than that of the
incremental one. This is because the search space for the optimal
transmit probability is smaller when fixed actions are used as
compared to the incremental actions. This is achieved at the
expense of accuracy because the granularity of the transmit
probability for fixed action strategy is less than that of the
incremental action strategy. However, for an application which
requires the nodes to learn fast and when the accuracy is not
critical, the fixed action strategy proves to be useful.
To summarize, the results in this section for a single-node
scenario demonstrates the ability of a reinforcement learning
based MAC logic to attain the theoretically maximum
throughput and to maintain that for higher application layer
loads by controlling self-collisions.
Fig. 6: DMRL-MAC performance in a two-node network: homogeneous load
VI. MULTI-NODE FULLY CONNECTED NETWORKS
A. Performance in a Two-Node Network
Unlike for the single node implementation in Section V which
uses the classical RL logic, we implemented the Distributed
Multi-agent Reinforcement Learning MAC (DMRL-MAC) for
multi-node networks. This implementation is based on
Hysteretic Q-learning as described in Section IV.
Another key augmentation over the single-node case is that
the state space in multi-node scenario contains inter-collision
probabilities in addition to the probabilities for self-collisions.
In other words, DMRL-MAC uses a 2-dimensional discrete
state space with 6 and 4 equal discrete levels of self-collision
and inter-collision probabilities respectively.
Fig. 7: Performance of DMRL-MAC in a 2-node network with homogeneous
load and a total throughput maximization strategy
Homogeneous Loading: As shown in Fig. 6, for the two-node
homogeneous loading case, the pure ALOHA protocol can
achieve a maximum nodal throughput of   at the
optimal loading     . The figure also shows that
the DMRL-MAC logic is able to learn to attain that maximum
throughput, and then able to sustain it for larger application
layer loads (i.e., g). Like in the single-node case, such
sustenance is achieved via learning to adjust the effective MAC
layer load (i.e., g*) by prudently discarding packets from the
application layer. This keeps both self-collisions and inter-
collisions at bay for higher throughputs.
We ran DMRL-MAC in a 2-node network with the objective
of maximizing networkwide throughput, which is different from
maximizing individual throughput in Fig. 7. This was achieved
s: ALOHA
s: DMRL-MAC
Max( ), DMRL-MAC
Min ( ), DMRL-MAC
S: Pure ALOHA
S: DMRL-MAC
s: Pure ALOHA
Throughput (s)
Load from application layer, g(= )
Fig. 5: Convergence plots for RL-based MAC logic: effective load and throughput for three different initial loads using incremental action strategy
Learning Epoch ID Learning Epoch ID Learning Epoch ID
6
by setting     in Eqn. (4). As shown in Fig. 8, for traffic
  , the individual node throughputs with DMRL-
MAC mimic those of pure-ALOHA. With higher load, however,
with DMRL-MAC one of the node’s throughput goes to zero so
that the other node is able to utilize the entire available
throughput, which is the one-node throughput as shown in Fig.
7. This way, the network level throughput is maximized at the
expense of the throughput of one of the nodes which is chosen
randomly by the underlying distributed reinforcement learning.
Heterogeneous Loading: Results in this section correspond to
when the application layer data rates from different nodes are
different. Fig. 8 shows the performance for three scenarios,
namely,  ,  , and  , where is the effective
load from the application layer, for which the optimal
throughput is obtained for pure ALOHA. For a 2-node network,
the value of is found out to be 0.4 Erlangs. Node 2’s
application layer load is varied from 0 to 5 erlangs. The
behavior of the system can be categorized in in three broad
cases. Case-I: when   , DMRL-MAC mimics
the performance of regular ALOHA. Case-II: when  
  or    the node with the higher load
adjusts accordingly such that the optimal ALOHA throughput is
maintained. Case III: when   , wireless
bandwidth is fairly distributed, and both the nodes transmit such
that the effective load boils down to . Thus, unlike the regular
ALOHA protocol, the DMRL-MAC can learn to maintain the
optimal ALOHA throughput for higher traffic scenarios. It does
so via learning to reduce both self- and inter-collisions by
discarding packets from the application layer.
As can be seen from Fig. 9, an important feature of the RL-
based DMRL-MAC protocol is that it can adjust to dynamic
traffic environments with time varying loads. When the traffic
generated in the network changes, the protocol can adjust
transmit probability accordingly, so that the optimal throughput
is maintained. It can achieve and maintain the known optimal
throughputs and fairly distribute the available bandwidth under
heterogeneous loading conditions. This is useful in scenarios in
which application layer packet generation fluctuates over time.
Fig. 9: Convergence plot for DMRL-MAC for dynamic load
Fig. 10: Load vs throughput plots for different values of (priority
between the nodes) for a two-node network.
Prioritized Access: One notable feature of the proposed DMRL-
MAC is that node-specific access priorities can be achieved by
assigning specific values of the coefficients in Eqn. (4). In
Fig. 10, the load-throughput plots are shown for two different
values of     . For
 , DMRL-MAC mimics the performance of Pure
ALOHA for any values of . If   , the system
performs as ALOHA, that is, the individual throughput for each
node is equal to the ALOHA maximum. With increase in ,
Effective load (g*)
Throughput (S)
Learning Epoch ID
= 0.75, = 3.5 = 0.25, = 0.2 = 0.25, = 2.5
S
0 1 2 3 4
0 0.2 0.4 0.6
0 4000 8000 12000
Load change Load change
:
:
:
:
, :
Throughput (S)
Load from application layer, g(= )
Fig. 8: Performance of DMRL-MAC in a two-node network with heterogeneous load [ are individual throughputs of node 1 and node 2 respectively; S
is the networkwide throughput; are load (in Erlangs) from the application layers of node 1 and node 2 respectively]
g2= 0.4
Throughput (s)
Load from Node 2 application layer
S: DMRL-MAC
S: ALOHA : ALOHA
: DMRL-MAC
: DMRL-MAC
: ALOHA
S: DMRL-MAC
S: ALOHA
: ALOHA
: DMRL-MAC
: DMRL-MAC
: ALOHA
Load from Node 1 application layer
Load from Node 2 application layer (
Load from Node 1 application layer
g2= 0.4 g2
: DMRL-MAC
: ALOHA
: DMRL-MAC
S: DMRL-MAC
S: ALOHA
: ALOHA
Load from Node 2 application layer
Load from Node 1 application layer
7
node 1 gets the priority and the individual throughput for node
2 approaches towards zero. This kind of prioritized access is
useful when data from specific sensors are more critical
compared to others, especially when the available wireless
bandwidth is not sufficient.
B. Performance in Larger Networks
Performance of DMRL-MAC for 3-node network is shown in
Fig. 11. As shown for the simulated ALOHA performance, the
maximum network wide throughput for a homogeneous load
distribution occurs when      erlangs.
That throughput is    and that is with a fair distribution
among the nodes. It can be observed that like the 1-node and 2-
node scenarios, DMRC-MAC can learn the theoretically
feasible maximum throughput and maintain that at higher loads
by avoiding both self- and inter-collisions.
Fig. 11: Performance of DMRL-MAC in a three-node network with
homogeneous load
For large networks with 2 or higher node-count, the learning
hyperparameters in Eqn. (4) are made empirically dependent on
the network size N as follows. We set  
 because with
larger N, both the number of contributing terms in the
expression of in Eqn. (4) and the value of itself go up. This
effect is compensated by making (the coefficient of )
inversely proportional to the number of one-hop neighbors
(which is N-1 for a fully connected network). After setting the
value of , the parameter in Eqn. (4) is determined
empirically. It is observed that for a given  a range of values
of can be obtained for which the system converges. That range
decreases with larger N. The relationship was experimentally
found as: = 0.330.05×. Using this empirical relationship,
the reward expression from Eqn. 4 can be rewritten as:
     

  }
(5)
Fig. 13: Load versus throughput plots for different values of (priority
among the nodes) for a three-node network.
Performance under heterogeneous loads is analyzed and
reported in Fig. 14. Three different situations are studied,
namely,          ,
and     In all these three cases, throughput
variation is studied by varying It can be seen that for
   , DMRL-MAC mimics the performance
of regular ALOHA. When the load in any of these nodes goes
higher than the optimal value (), learning enables the node to
adjust transmit probability so that the optimal ALOHA
throughput is maintained by limiting both types of collisions.
Priority among the nodes can be assigned using the
coefficient in the reward function in Eqn. (4). As shown in
s: ALOHA
s: DMRL-MAC
Load from application layer, g(= )
Throughput (S)
:
:
:
:
:
:
Load from application layer,
Throughput (S)
Fig. 12: Performance of DMRL-MAC in a three-node network with heterogeneous load [ are individual throughputs of node 1, node 2 and node 3
respectively; S is the networkwide throughput; are load (in Erlangs) from the application layers of node 1, node 2 and node 3 respectively]
: ALOHA
: ALOHA
: ALOHA
: ALOHA
: ALOHA
: DMRL-MAC
: DMRL-MAC
: DMRL-MAC
: DMRL-MAC
: DMRL-MAC
s1 : ALOHA
: ALOHA
s2: ALOHA
: DMRL-MAC
: DMRL-MAC
: DMRL-MAC
Load from Node 2 application layer Load from Node 2 application layer Load from Node 2 application layer
Throughput (s)
Throughput (s)
Throughput (s)
8
Fig. 13., when   , DMRL-MAC mimics the
performance of Pure ALOHA for any values of . Also, if
  , the channel bandwidth is fairly distributed.
However, when the values of  are set to 0.5 and 0.005
respectively, with an increase in , the throughput for node 1
(i.e., ) increases. The node with the largest value gets the
highest portion of the available wireless bandwidth. These
results demonstrate how DMRL-MACs ability of prioritized
access can hold for a 3-node fully connected networks.
Fig. 14 depicts the performance of larger networks with 4, 5,
and 6 nodes. It shows that the desirable properties of DMRL-
MAC in attaining the theoretical maximum throughput and
maintaining it at higher loads are still valid for such larger
networks. However, the convergence becomes increasingly
slower as the networks grow in size. The convergence time
distributions in Fig. 15 in fact show that the learning almost
stops working for networks with 7 or more nodes.
Fig. 15: Convergence behavior for different number of nodes (pmf)
To investigate this more, we have plotted the self- and inter-
collisions probabilities in Fig. 16. The first notable observation
is that while the self-collisions increases with higher application
layer loads, it is almost insensitive to the network size. As
expected, the inter-collisions do increase with larger network
size. However, the rate of increase of the inter-collisions with
increasing application layer loads reduces in larger networks.
The implication of this observation is that as the network size
increases, the inter-collisions become less and less sensitive to
individual node’s transmission decisions. In other words, the
reinforcement learning agents’ actions become less influential
for changing the system’s states. Thus, the system becomes
stateless and moves out of the realm of reinforcement learning
based solution discussed here. A stateless Q-learning or Multi-
arm bandit-based MDP solution [2] is needed for this situation,
which will be explored as the next step beyond this paper, which
reports our initial approaches and preliminary results.
Fig. 16: Variation of inter-collision and self-collision probability with
number of nodes (N) and traffic load (g)
VII. PARTIALLY CONNECTED MULTI-NODE NETWORKS
Although the primary goal of this paper is to report the key
concepts of RL based wireless medium access and its
performance in fully connected networks, here we include a
preliminary set of results to demonstrate how the protocol
behaves in a partially connected scenario shown in Fig. 1(b). In
this network, since node-2 can listen to both node-1 and node-
3, the transmissions from node-2 experience more collisions
than those experienced by node-1 and node-3. Fig. 17(a) shows
the load-throughput plot for this network when the nodes run
regular ALOHA protocol. It is observed that the maximum
throughput for node-2 (
  ) is half the maximum
throughput for node-1 and node-3 (
  
  ). The
goal of the proposed DMRL-MAC is to adjust the transmit
probability of each node such that the channel bandwidth is
maximized, with the condition that it is fairly distributed among
the three nodes. Fig. 17 (b) shows the throughput variation with
the load   when the throughput is equal for all three
8-9 10-11 >12
Not converged Not converged
Epoch ID ( ) Epoch ID ( ) Epoch ID ( )
Frequency
0 0.1 0.2 0.3 0.4 0.5
9-10 11-12 13-14 >14
Not converged
>14
0 0.2 0.4 0.6 0.8
N=6 N=7
N=5
Frequency
N=2
N=6
N=7
g for load-varying node in the network
Collision probability
Inter-Collision Probability
Self-Collision Probability:
N=2,3…7
N=3
N=4N=5
Fig. 14: Performance of DMRL-MAC in a network with 4,5 and 6 nodes for homogeneous load
6 Nodes
4 Nodes
S: ALOHA
S: DMRL-MAC
s: ALOHA
s: DMRL-MAC
5 Nodes 6 Nodes
S: ALOHA
S: DMRL-MAC
s: ALOHA
s: DMRL-MAC
S: ALOHA
S: DMRL-MAC
s: ALOHA
s: DMRL-MAC
ggg
Throughput (s)
Throughput (s)
Throughput (s)
Load from application layer, Load from application layer, Load from application layer,
9
nodes. It can be seen from the plot that due to more collisions of
packets from node-2, the value of needs to be larger than
and for receiving equal bandwidth. This analysis provides the
benchmark for testing the DMRL-MAC protocol:
  
 

  and
 . This benchmark throughput is
lower than the maximum attainable throughput (  )
with homogeneous load.
Implementation of DMRL-MAC for a partially connected
network is different from the fully connected case in the
following ways. A node in this case have no throughput
information about its 2-hop nodes and beyond and have only
partial information about the throughput of 1-hop neighbors. A
node can monitor the number of transmissions from its 1-hop
neighbors that are overlapping and non-overlapping with its
own transmissions. In the absence of any network-wide
information, a node running DMRL-MAC treats: i) its
immediate neighborhood (i.e., 1-hop) as the complete network,
and ii) the estimated throughput of its 1-hop neighbors
computed from monitored non-overlapping transmissions as
their approximated actual throughputs.
Performance of DMRL-MAC executed with such partial
information for the partially connected linear network from Fig.
1(b) is shown in Fig. 17(c). It can be observed that this
distributed learning-based approach can achieve a near-optimal
throughput in this case. In this scenario, when <
optimal , the system achieves pure ALOHA performance, and
for > optimal , a near-optimal throughput (
 
) is maintained even at higher loads. Moreover, the
throughput is equally distributed among all the nodes even
though node-2 is in a more disadvantageous position than node-
1 and node-3 in terms of collisions. It should be noted that the
results in this section are a preliminary reporting for the partially
connected scenarios. More protocol refinements and
experiments will be needed to address learning in the absence
of complete throughput information in the neighborhood. These
results are just to indicate the promise of distributed
reinforcement learning in partially connected networks. More
results on this will be published elsewhere.
VIII. RELATED WORK
A number of Reinforcement Learning based network access
strategies were explored in the literature. The approaches in [8,
9] apply Q-learning in order to increase MAC layer throughput
by reducing collisions in a slotted access system. Unlike the
DMRL framework in this paper, the approach in [9] does not
solve the access solution in scenarios with load and network
topology heterogeneity. The ability to deal with load
heterogeneity is also missing in [10,11], which present Q-
learning based adaptive learning for underwater networks.
A number of papers explored RL as a means to control
network resource allocation. Throughput maximization for
secondary nodes in a cognitive network was considered in [12].
In a time-slotted system, under the assumption of a
homogeneous secondary network, Q-learning is used for
solving a Partially Observable Markov Process (POMDP) in
order to reduce the interference between the primary and
secondary users. In [13] and [14], Q-learning was explored to
find efficient and load-dependent radio transmission schedules.
It was shown that the approach delivers good throughput and
delay compared to the classical MAC protocols. However, the
protocols are designed considering a homogeneous and dense
network. Moreover, unlike the proposed DMRL in this paper,
the approach in [13] does not learn to do load modulation for
maintaining good throughput at higher loads. Two more Q-
learning-based resource allocation protocols were proposed in
[15] and [16]. These protocols rely on the ability of carrier
sensing to dynamically modulate an access contention window
[16] so that collisions are reduced. This work is not directly
comparable to the work presented here due to its ability of
carrier sensing. In [17], the authors develop a RL-based
mechanism for scheduling sleep and active transmission periods
in a wireless network. While this technique was shown to be
able to achieve throughputs that are higher than the traditional
MAC protocols, the throughput falls with the increase in the
network traffic. Such loss of performance at high application
layer load is specifically avoided in our proposed DMRL work.
IX. SUMMARY AND CONCLUSIONS
A multi-agent Reinforcement Learning based protocol for
MAC layer radio access has been proposed. The protocol allows
Fig. 17: Load vs throughput plot (a) for ALOHA when   , (b) for ALOHA when   , (c) for DMRL-MAC when   
(a) (b)
SS
g1=g2=g3
g1=g2=g3
Throughput (S)
Throughput (S)
Load from application layer, Load from application layer,
(c)
S
Benchmark S
g1=g2=g3
Throughput (S)
Load from application layer,
10
network nodes to individually adjust transmit probabilities so
that self-collisions and inter-collisions are reduced. As a result,
the nodes can attain the theoretical maximum throughput and
sustain it for higher loading conditions. An important feature of
the protocol is that it can deal with network heterogeneity in
terms load, topology, and QoS needs in the form of access
priorities. Moreover, learning allows the nodes to self-adjust in
a dynamic environment with a time-varying traffic load. The
proposed protocol has been tested for various size networks with
nodes without carrier sensing abilities. From a Markov Decision
Process (MDP) perspective, it is shown that the system becomes
increasingly stateless for larger networks. In such scenarios,
stateless Q-learning or Multi-arm bandit-based solutions can be
used, which will be explored in a separate paper. As next steps,
we will explore MDP solutions in networks with arbitrary
topology. We will also explore the framework for nodes with
channel sensing capabilities and compare them with existing
CSMA family of protocols. Future work will also consider other
access performance such as energy and delay.
X. REFERENCES
[1] Z. Lan, H. Jiang and X. Wu, "Decentralized cognitive MAC protocol
design based on POMDP and Q-Learning," 7th International Conference
on Communications and Networking in China, Kun Ming, 2012, pp. 548-
551, doi: 10.1109/ChinaCom.2012.6417543.
[2] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An
introduction. MIT press, 2018.
[3] Xu, Yi-Han, et al. "Reinforcement Learning (RL)-Based Energy Efficient
Resource Allocation for Energy Harvesting-Powered Wireless Body Area
Network." Sensors 20.1 (2020): 44.
[4] Leon-Garcia, Alberto. "Probability, statistics, and random processes for
electrical engineering." (2017).
[5] Van Otterlo, Martijn, and Marco Wiering. "Reinforcement learning and
markov decision processes." Reinforcement Learning. Springer, Berlin,
Heidelberg, 2012. 3-42.
[6] Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine
learning 8.3-4 (1992): 279-292.
[7] Matignon, Laëtitia, Guillaume J. Laurent, and Nadine Le Fort-Piat.
"Hysteretic q-learning: an algorithm for decentralized reinforcement
learning in cooperative multi-agent teams." 2007 IEEE/RSJ International
Conference on Intelligent Robots and Systems. IEEE, 2007.
[8] Yu, Yiding, Taotao Wang, and Soung Chang Liew. "Deep-reinforcement
learning multiple access for heterogeneous wireless networks." IEEE
Journal on Selected Areas in Communications 37.6 (2019): 1277-1290.
[9] Lee, Taegyeom, and Ohyun Jo Shin. "CoRL: Collaborative Reinforcement
Learning-Based MAC Protocol for IoT Networks." Electronics 9.1 (2020):
143.
[10] S. H. Park, P. D. Mitchell and D. Grace, "Reinforcement Learning Based
MAC Protocol (UW-ALOHA-Q) for Underwater Acoustic Sensor
Networks," in IEEE Access, vol. 7, pp. 165531-165542, 2019, doi:
10.1109/ACCESS.2019.2953801.
[11] S. H. Park, P. D. Mitchell and D. Grace, "Performance of the ALOHA-Q
MAC Protocol for Underwater Acoustic Networks," 2018 International
Conference on Computing, Electronics & Communications Engineering
(iCCECE), Southend, United Kingdom, 2018, pp. 189-194, doi:
10.1109/iCCECOME.2018.8658631.
[12] Z. Lan, H. Jiang and X. Wu, "Decentralized cognitive MAC protocol
design based on POMDP and Q-Learning," 7th International Conference
on Communications and Networking in China, Kun Ming, 2012, pp. 548-
551, doi: 10.1109/ChinaCom.2012.6417543.
[13] Galzarano S., Liotta A., Fortino G., “QL-MAC: A Q-Learning Based
MAC for Wireless Sensor Networks”, Algorithms and Architectures for
Parallel Processing. ICA3PP 2013, Lecture Notes in Computer Science,
vol 8286. Springer, Cham.
[14] Galzarano S., Fortino G., Liotta A., “A Learning-Based MAC for Energy
Efficient Wireless Sensor Networks”, Internet and Distributed Computing
Systems, IDCS 2014. Lecture Notes in Computer Science, vol 8729.
Springer, Cham.
[15] Bayat-Yeganeh, Hossein, Vahid Shah-Mansouri, and Hamed Kebriaei. "A
multi-state Q-learning based CSMA MAC protocol for wireless
networks." Wireless Networks 24.4 (2018): 1251-1264.
[16] Ali, Rashid, et al. "Deep reinforcement learning paradigm for performance
optimization of channel observationbased MAC protocols in dense
WLANs." IEEE Access 7 (2018): 3500-3511.
[17] Zhenzhen Liu and I. Elhanany, "RL-MAC: A QoS-Aware Reinforcement
Learning based MAC Protocol for Wireless Sensor Networks," 2006 IEEE
International Conference on Networking, Sensing and Control, Ft.
Lauderdale, FL, 2006, pp. 768-773, doi: 10.1109/ICNSC.2006.1673243.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Devices used in Internet of Things (IoT) networks continue to perform sensing, gathering, modifying, and forwarding data. Since IoT networks have a lot of participants, mitigating and reducing collisions among the participants becomes an essential requirement for the Medium Access Control (MAC) protocols to increase system performance. A collision occurs in wireless channel when two or more nodes try to access the channel at the same time. In this paper, a reinforcement learning-based MAC protocol was proposed to provide high throughput and alleviate the collision problem. A collaboratively predicted Q-value was proposed for nodes to update their value functions by using communications trial information of other nodes. Our proposed protocol was confirmed by intensive system level simulations that it can reduce convergence time in 34.1% compared to the conventional Q-learning-based MAC protocol.
Article
Full-text available
Wireless body area networks (WBANs) have attracted great attention from both industry and academia as a promising technology for continuous monitoring of physiological signals of the human body. As the sensors in WBANs are typically battery-driven and inconvenient to recharge, an energy efficient resource allocation scheme is essential to prolong the lifetime of the networks, while guaranteeing the rigid requirements of quality of service (QoS) of the WBANs in nature. As a possible alternative solution to address the energy efficiency problem, energy harvesting (EH) technology with the capability of harvesting energy from ambient sources can potentially reduce the dependence on the battery supply. Consequently, in this paper, we investigate the resource allocation problem for EH-powered WBANs (EH-WBANs). Our goal is to maximize the energy efficiency of the EH-WBANs with the joint consideration of transmission mode, relay selection, allocated time slot, transmission power, and the energy constraint of each sensor. In view of the characteristic of the EH-WBANs, we formulate the energy efficiency problem as a discrete-time and finite-state Markov decision process (DFMDP), in which allocation strategy decisions are made by a hub that does not have complete and global network information. Owing to the complexity of the problem, we propose a modified Q-learning (QL) algorithm to obtain the optimal allocation strategy. The numerical results validate the effectiveness of the proposed scheme as well as the low computation complexity of the proposed modified Q-learning (QL) algorithm.
Article
Full-text available
The demand for regular monitoring of the marine environment and ocean exploration is rapidly increasing, yet the limited bandwidth and slow propagation speed of acoustic signals leads to low data throughput for underwater networks used for such purposes. This study describes a novel approach to medium access control that engenders efficient use of an acoustic channel. ALOHA-Q is a medium access protocol designed for terrestrial radio sensor networks and reinforcement learning is incorporated into the protocol to provide efficient channel access. In principle, it potentially offers opportunities for underwater network design, due to its adaptive capability and its responsiveness to environmental changes. However, preliminary work has shown that the achievable channel utilisation is much lower in underwater environments compared with the terrestrial environment. Three improvements are proposed in this paper to address key limitations and establish a new protocol (UW-ALOHA-Q). The new protocol includes asynchronous operation to eliminate the challenges associated with time synchronisation under water, offer an increase in channel utilisation through a reduction in the number of slots per frame, and achieve collision free scheduling by incorporating a new random back-off scheme. Simulations demonstrate that UW-ALOHA-Q provides considerable benefits in terms of achievable channel utilisation, particularly when used in large scale distributed networks.
Article
Full-text available
The potential applications of deep learning (DL) to the media access control (MAC) layer of wireless local area networks (WLANs) have already been progressively acknowledged due to their novel features for future communications. Their new features challenge conventional communications theories with more sophisticated artificial intelligence (AI)-based theories. DL has been extensively proposed for the MAC layer of WLANs in various research areas, such as deployment of cognitive radio and communications networks. Deep reinforcement learning (DRL) is one DL technique that is motivated by the behaviorist sensibility and control philosophy, where a learner can achieve an objective by interacting with the environment. Next-generation dense WLANs like the IEEE 802.11ax high-efficiency WLAN (HEW) are expected to confront ultra-dense diverse user environments and radically new applications. To satisfy the diverse requirements of such dense WLANs, it is anticipated that prospective WLANs will freely access the best channel resources with the assistance of self-scrutinized wireless channel condition inference. Such intelligence is only possible with the introduction of DL techniques in future WLANs. Therefore, in this paper, we propose reinforcement learning (RL) as an intelligent paradigm for MAC layer resource allocation in dense WLANs. One of the RL models, Q-learning (QL), is used to optimize the performance of channel observation–based MAC protocols in dense wireless networks. An intelligent QL-based resource allocation (iQRA) mechanism is proposed for MAC layer channel access in dense WLANs. Simulation results indicate that the proposed intelligent paradigm learns diverse WLAN environments and optimizes performance, compared to conventional non-intelligent MAC protocols. The performance of the proposed iQRA mechanism is evaluated in diverse WLAN network environments with throughput, channel access delay, and fairness as performance metrics.
Article
Full-text available
Due to the shared nature of wireless channels, competition among the nodes to access media is inevitable. P-persistent carrier sense multiple access (CSMA) is a medium access control scheme widely used for resource allocation in wireless environments. The probability of transmission highly affects the throughput. We model the wireless nodes as agents in a network game. The strategy of an agent is defined as the probability of transmission. Agents don’t have a priori information about the network (e.g., number of nodes, other agents strategies) and learn their optimal strategy using the historical sensory information including the number of collisions or successful transmissions. In this paper, a multi-state reinforcement learning (RL) method is designed for this purpose. The main challenge in designing an RL agent is to define the states of the environment from agent’s perspective. For this purpose, in this paper, various state representations are proposed in a multi-state Q-learning model. This leads to different agents personalities ranging from cautious agents with risk aversion to aggressive risky agents. We also propose agents with combined personalities of cautiousness and aggressiveness. The performance of the proposed Q-learning agents with different state definitions in comparison with each other and also other benchmarking agents is examined via comprehensive simulation results.
Article
Full-text available
Situated in between supervised learning and unsupervised learning, the paradigm of reinforcement learning deals with learning in sequential decision making problems in which there is limited feedback. This text introduces the intuitions and concepts behind Markov decision processes and two classes of algorithms for computing optimal behaviors: reinforcement learning and dynamic programming. First the formal framework of Markov decision process is defined, accompanied by the definition of value functions and policies. The main part of this text deals with introducing foundational classes of algorithms for learning optimal behaviors, based on various definitions of optimality with respect to the goal of learning sequential decisions. Additionally, it surveys efficient extensions of the foundational algorithms, differing mainly in the way feedback given by the environment is used to speed up learning, and in the way they concentrate on relevant parts of the problem. For both model-based and model-free settings these efficient extensions have shown useful in scaling up to larger problems.
Article
This paper investigates a deep reinforcement learning (DRL) based MAC protocol for heterogeneous wireless networking, referred to as Deep-reinforcement Learning Multiple Access (DLMA). Specifically, we consider the scenario of a number of networks operating different MAC protocols trying to access the time slots of a common wireless medium. A key challenge in our problem formulation is that we assume our DLMA network does not know the operating principles of the MACs of the other networks—i.e., DLMA does not know how the other MACs make decisions on when to transmit and when not to. The goal of DLMA is to be able to learn an “optimal” channel access strategy to achieve a certain pre-specified global objective. Possible objectives include maximizing the sum throughput and maximizing α-fairness among all networks. The underpinning learning process of DLMA is based on DRL. With proper definitions of the state space, action space, and rewards in DRL, we show that DLMA can easily maximize the sum throughput by judiciously selecting certain time slots to transmit. Maximizing general α-fairness, however, is beyond the means of the conventional reinforcement learning (RL) framework. We put forth a new multi-dimensional RL framework that enables DLMA to maximize general α-fairness. Our extensive simulation results show that DLMA can maximize sum throughput or achieve proportional fairness (two special classes of α-fairness) when coexisting with TDMA and ALOHA MAC protocols without knowing they are TDMA or ALOHA. Importantly, we show the merit of incorporating the use of neural networks into the RL framework (i.e., why DRL and not just traditional RL): specifically, the use of DRL allows DLMA (i) to learn the optimal strategy with much faster speed, and (ii) to be more robust in that it can still learn a near-optimal strategy even when the parameters in the RL framework are not optimally set.
Conference Paper
This paper introduces RL-MAC, a novel adaptive media access control (MAC) protocol for wireless sensor networks (WSN) that employs a reinforcement learning framework. Existing schemes center around scheduling the nodes' sleep and active periods as means of minimizing the energy consumption. Recent protocols employ adaptive duty cycles as means of further optimizing the energy utilization (W. Ye et al., 2004)(T.V. Dam and K. Langendoen, 2003). However, in most cases each node determines the duty cycle as a function of its own traffic load. In this paper, nodes actively infer the state of other nodes, using a reinforcement learning based control mechanism, thereby achieving high throughput and low power consumption for a wide range of traffic conditions. Moreover, the computational complexity of the proposed scheme is moderate rendering it pragmatic for practical deployments. Quality of service can easily be implemented in the proposed framework as well