ArticlePDF Available

Task Partitioning and Scheduling Based on Stochastic Policy Gradient in Mobile Crowdsensing

Authors:

Abstract

Deep reinforcement learning (DRL) has become prevalent for decision-making task assignments in mobile crowd-sensing (MCS). However, when facing sensing scenarios with varying numbers of workers or task attributes, existing DRL-based task assignment schemes fail to generate matching policies continuously and are susceptible to environmental fluctuations. To overcome these issues, a twin-delayed deep stochastic policy gradient (TDDS) approach is presented for balanced and low-latency MCS task decomposition and parallel subtask allocation. A masked attention mechanism is incorporated into the policy network to enable TDDS to adapt to task-attribute and subtask variations. To enhance environmental adaptability, an off-policy DRL algorithm incorporating experience replay is developed to eliminate sample correlation during training. Gumbel-Softmax sampling is integrated into the twin-delayed deep deterministic policy gradient (TD3) to support discrete action space decisions and a customized reward strategy to reduce task completion delay and balance workloads. Extensive simulation results confirm that the proposed scheme outperforms mainstream DRL baselines in terms of environmental adaptability, task completion delay, and workload balancing.
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS 1
Task Partitioning and Scheduling Based on
Stochastic Policy Gradient in Mobile Crowdsensing
Tianjing Wang ,Member, IEEE, Yu Zhang ,HangShen ,Member, IEEE, and Guangwei Bai
Abstract—Deep reinforcement learning (DRL) has become
prevalent for decision-making task assignments in mobile crowd-
sensing (MCS). However, when facing sensing scenarios with
varying numbers of workers or task attributes, existing DRL-
based task assignment schemes fail to generate matching policies
continuously and are susceptible to environmental fluctuations.
To overcome these issues, a twin-delayed deep stochastic policy
gradient (TDDS) approach is presented for balanced and low-
latency MCS task decomposition and parallel subtask allocation.
A masked attention mechanism is incorporated into the policy
network to enable TDDS to adapt to task-attribute and subtask
variations. To enhance environmental adaptability, an off-policy
DRL algorithm incorporating experience replay is developed to
eliminate sample correlation during training. Gumbel-Softmax
sampling is integrated into the twin-delayed deep deterministic
policy gradient (TD3) to support discrete action space decisions
and a customized reward strategy to reduce task completion delay
and balance workloads. Extensive simulation results confirm that
the proposed scheme outperforms mainstream DRL baselines in
terms of environmental adaptability, task completion delay, and
workload balancing.
Index Terms—Attention mechanism, Gumbel-Softmax sam-
pling, mobile crowdsensing (MCS), parallel subtask allocation,
task partition.
I. INTRODUCTION
MOBILE crowdsensing (MCS) technology [1] has be-
come prevalent in recent years due to the rapid prolifer-
ation of intelligent mobile devices with computing, perception,
storage, and communication capabilities. Unlike traditional sen-
sor networks, MCS leverages intelligent devices carried by
mobile users as basic sensing units and forms working groups
to collaboratively complete large-scale data sensing. MCS has
been applied to various fields such as object tracking [2],envi-
ronmental monitoring [3], and smart cities [4].
An MCS system must recruit a large number of workers
to complete the sensing tasks continuously submitted by the
Manuscript received 30 January 2024; revised 28 April 2024; accepted
5 May 2024. This work was supported in part by the National Natural
Science Foundation of China under Grant 61502230 and Grant 61501224,
in part by the Natural Science Foundation of Jiangsu Province under Grant
BK20201357, and in part by the Six Talent Peaks Project in Jiangsu Province
under Grant RJFW-020. (Corresponding author: Hang Shen.)
The authors are with the College of Computer and Information Engineering
(College of Artificial Intelligence), Nanjing Tech University, Nanjing 211816,
China (e-mail: wangtianjing@njtech.edu.cn; 202161120001@njtech.edu.cn;
hshen@njtech.edu.cn; bai@njtech.edu.cn)
Digital Object Identifier 10.1109/TCSS.2024.3398430
platform [5],[6]. How task allocation is performed to optimize
the system’s sensing performance is crucial. Task allocation can
be offline or online [7]. The former follows a predetermined
plan for task allocation, which cannot adjust the allocation
strategy in real-time to adapt to environmental dynamics. The
latter can handle task allocation flexibly under dynamic envi-
ronments. Song et al. [8] established an online multiskill task
allocation model and integrated a greedy algorithm to match
dynamic tasks with specific skill workers. Schmitz and Lyk-
ourentzou [9] designed an online-optimized greedy algorithm
for reliable task allocation when the budget and quality cycle
change. However, these works focused on allocation for simple
tasks. The emergence of various MCS applications such as ride-
hailing (e.g., DiDi1), on-demand delivery (e.g., Ele.me2), and
live map (e.g., Waze3) has made it inevitable for MCS systems
to support continuous allocation of complex tasks.
Unlike simple tasks that a worker can complete indepen-
dently, a complex task requires division into multiple subtasks.
These subtasks must then be allocated to several workers for
collaborative completion. In this scenario, any delay in one
subtask can potentially affect the timely completion of the
entire task. Researchers have explored reinforcement learning
(RL) and deep RL (DRL) approaches to decide on multitask
allocation. Xu et al. [10] utilized RL to optimize the allocation
strategy of parallel subtasks. However, RL-based methods face
dimension explosion with the increase of parallel subtasks.
DRL-based schemes can handle high-dimensional action spaces
for parallel subtask allocation. Xu and Song [11] designed
a multiagent DRL algorithm to train a local model for each
worker and obtain parallel allocation actions through multia-
gent cooperation. Ding et al. [12] improved proximal policy
optimization (PPO), solving the unreasonable action-matching
problem caused by spatiotemporal complexity by dynamically
matching tasks, workers, and workplaces. However, most ex-
isting DRL-based task allocations directly or indirectly assume
that the number of tasks or workers is static within a period,
reducing the scheme’s usability.
A. Challenging Issues and Related Works
For DRL-based MCS task allocation in a time-varying envi-
ronment, many challenges remain.
1https://www.didiglobal.com/
2https://www.ele.me
3https://www.waze.com
2329-924X © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
2IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS
1) Dynamics of Parallel Subtasks and Workers: The num-
ber of parallel subtasks divided from a complex task is not
constant due to the tasks’ heterogeneity. Xie et al. [13] presented
a multistage complex task decomposition framework that dy-
namically divides tasks according to knowledge-intensive types
and assigns subtasks to suitable service providers. Liu and Zhao
[14] developed a multiattribute E-CARGO task assignment
model based on adaptive heterogeneous residual networks, con-
sidering the heterogeneous workers and tasks. However, these
solutions assume the number of workers is fixed to reduce
model complexity. Sun et al. [15] pointed out that online task
allocation in MCS is dynamic and uncertain. They designed a
spatial perception multiagent Q-learning algorithm for dynamic
spatial task allocation. Liu et al. [16] proposed a distributed
execution framework based on DRL to provide reliable and
accurate sensing services when the number of tasks and vehicles
changes. In [17], a DRL-based algorithm was designed for
scheduling workflow on small time scales for task offloading in
space-air-ground integrated vehicular networks. However, these
methods deal with simple tasks rather than parallel subtasks.
Designing a DRL model for dynamic task division and worker
selection is challenging.
2) Diversity of Task Allocation Environments: Differenti-
ated tasks require MCS systems to provide differentiated system
services. Several studies have proposed multitask allocation
based on DRL to address the challenges of multitask concur-
rency, task and worker heterogeneity, and participant preference
changes. Hang et al. [18] proposed a multiagent DRL-based
multitask allocation to provide differentiated sensing responses.
Considering the complicated and dynamic environment of ve-
hicular computing, Qi et al. presented a DRL-based parallel
task scheduling approach [19], where the output branches of
multitask learning are fine-matched to parallel scheduling. Zhao
et al. [20] designed a similarity function on the task transfer
graph to promote the allocation of personalized multitasks.
The advantage of these on-policy algorithms is that parallel
subpolicies can output personalized allocation decisions, but the
samples they master have a strong correlation, leading to a weak
model generalization and difficulty in service quality guaran-
teeing. Unlike on-policy algorithms that periodically abandon
samples interacting with the environment, off-policy algorithms
design an experience replay pool that stores diversified samples
to enhance the model’s adaptability to differentiated tasks. A
double deep Q-network with a priority experience replay pool
is studied in [21], planning a travel path that meets the require-
ments of each mobile user. Existing deep deterministic policy
gradient (DDPG) [22] and twin delayed deep deterministic
policy gradient (TD3) [23] can also solve task scheduling, but
they operate in continuous action spaces. This is because the
random sampling is not derivable in discrete action spaces, and
the model cannot be trained using backpropagation. Therefore,
exploring an off-policy algorithm in discrete action spaces is
necessary for environment diversity to ensure service stability.
3) Long-Term Balanced Scheduling in Continuous Task
Allocation: Load balancing is important to consider in achiev-
ing long-term optimized task scheduling. Several studies have
proposed DRL-based task allocation, considering the long-term
utility of workers and requesters. Zhao et al. [24] proposed
a discrete threshold task allocation algorithm based on policy
gradient considering long-term utility, significantly improving
the long-term continuous task allocation utility. In [25],amul-
tiagent DRL solution was proposed to generate a multitask
allocation strategy that considers the long-term interests of
workers and requesters. This scheme designs a reward function
that considers local and global returns to balance short-term
and long-term benefits and achieve a long-term equilibrium
task completion rate. Ma et al. [26] proposed a real-time task
dynamic scheduling model based on centralized learning, which
makes more accurate continuous task scheduling decisions by
analyzing the processor load of workers. The system’s load bal-
ance is better than random task allocation methods, improving
CPU utilization and service quality. However, these methods
do not incorporate balance indicators into the DRL reward
function, making it difficult for the model to learn to schedule
experience that satisfies long-term load balancing.
B. Contributions and Organization
In response to the above issues, we proposed a twin delayed
deep stochastic policy gradient (TDDS) approach for long-
term balanced and low-latency task allocation via dynamic
partitioning and scheduling. The main contributions include
the following.
1) We construct a scalable policy network consisting of two
shared linear layers to extract state features of subtasks
and workers, along with a masked attention mechanism to
match subtasks and workers. This network can indepen-
dently infer an optimal subpolicy for each subtask, with
enhanced robustness of task allocation.
2) An off-policy algorithm based on TD3 is designed, which
uses Gumbel-Softmax sampling to enable TD3 to output
allocation decisions of parallel subtasks in discrete action
spaces. The rich samples in the experience replay pool
can enhance model generalization to adapt to heteroge-
neous MCS environments.
3) We develop an appropriate reward function considering
completion delay and load balancing. This encourages the
model to learn from allocation experiences that optimize
both indicators simultaneously, ensuring the long-term
stability of task scheduling. Simulation results demon-
strated that the proposed approach’s task completion de-
lay and environmental adaptability are better than typical
DRL-based baselines.
The rest of this article is organized as follows. Section II
presents the system model for MCS task partition and paral-
lel subtask allocation. Section III proposes a parallel subtask
allocation scheme based on TDDS. Section IV analyzes the
evaluation results under simulation experiments. Finally, we
summarize the research work in Section V. The main notations
and variables are listed in Table I.
II. SYSTEM MODEL
This section begins with an overview of MCS task par-
titioning and continuous subtask assignment. Then, the task
partitioning and assignment are transformed into a long-term
optimization problem.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: TASK PARTITIONING AND SCHEDULING BASED ON STOCHASTIC POLICY 3
TAB LE I
MAIN NOTATIONS AND VARIABLES
Symbols Definition
at,i,m,n Allocation strategy for task iin time window t
ItSet of tasks in time window t
Li,n Set of incomplete subtasks for worker n
upon receiving Ui,n
MMaximum number of task subdivision
MiSet of subtasks in state i
Mt,i/Mt,i Set/Num. of subtasks for task iin time window t
NMaximum number of workers
NiSet of workers in state i
Nt/NtSet/Num. of workers in time window t
T/TSet/Num. of time windows
Ui,n/Ui,n Set/Num. of task i’subtasks allocated to worker n
ui,n[j]The jth subtask executed in Ui,n
wi,n[j]Delay from receiving Ui,n
to the start of transmission of ui,n[j]
zsen
i,n/ztra
i,n
Sensing/transmission time for worker n
from receiving Ui,n to completing Li,n
zsen
t,i,n/ztra
t,i,n Thevalueofzsen
i,n/ztra
i,n in time window t
Fig. 1. Consecutive MCS task allocation.
A. System Overview
Fig. 1illustrates an MCS system comprising a control plat-
form, task requesters, and workers. As a dispatch center, the
control platform connects the task requesters and workers via
base stations. Requesters create tasks that require environmental
sensing, and workers with different sensing and computing
abilities cooperate to complete them. It assigns a continuous
Fig. 2. Completion delay of two subtasks under FIFO.
stream of tasks to a group of workers following the first-in-first-
out (FIFO) rule. Each task is divided into parallel subtasks, and
the control platform employs DRL to select the most suitable
workers to complete these subtasks, taking into account the
resource competition among the subtasks.
B. Task Completion Latency Model
We now explain the allocation and execution of parallel sub-
tasks and model the completion latency. Let Ui,n denote the set
of subtasks for task iallocated to worker n, and Ui,n indicate the
total count of these subtasks. A worker movement minimization
method [27] is used to determine the execution order of subtasks
in Ui,n.Letui,n[j]represent the jth subtask executed in Ui,n,
and oi,n[j]represent the size of ui,n [j].li,n[j]refers to the
location of ui,n[j], which specifies the initial location of worker
nprior to commencing Ui,n, with li,n [0]=li1,n[Ui1,n].Itis
assumed that the movement, sensing, and transmission rates of
worker nare vmov
n,vsen
n, and vtra
n, respectively, and their latency
expressions are represented as dmov
i,n [j],dsen
i,n[j], and dtra
i,n[j]
dmov
i,n [j]=|li,n[j]li,n [j1]|/vmov
n
dsen
i,n[j]=oi,n [j]/vsen
n
dtra
i,n[j]=oi,n [j]/vtra
n.
(1)
Fig. 2explains the process of worker nexecuting the sub-
tasks in Ui,n when Ui,n contains two subtasks, and this can be
extended to the general case. Assume Li,n is a set of subtasks
that worker nhas not yet after Ui,n arrives, with the remaining
sensing and transmission delays being zsen
i,n and ztra
i,n, respec-
tively. After completing the sensing of Li,n,workernmoves
to li,n[1]to start executing subtasks in Ui,n in sequence. Note
that the delivery of a subtask must wait until the sensing of this
subtask is completed and the transmission of all other subtasks
before this subtask is finished before it can be executed. We
define wi,n[j]as the delay from receiving Ui,n to the start of
transmission of ui,n[j]. The recursive expression for wi,n[j]is
given by below equation (2), shown at the bottom of the page.
wi,n[j]=
max {zsen
i,n +dmov
i,n [1]+dsen
i,n[1],ztra
i,n},if j=1
max
zsen
i,n +
j
j=1
(dmov
i,n [j]+dsen
i,n[j]),w
i,n[j1]+dtra
i,n[j1]
,otherwise. (2)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
4IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS
Based on wi,n[j], the latency of completing all subtasks
in Ui,n is wi,n[Ui,n ]+dtra
i,n[Ui,n ]. Since parallel subtasks are
allocated to multiple workers for processing, the completion
latency of task iis
di=max
n{wi,n[Ui,n ]+dtra
i,n[Ui,n ]}.(3)
C. Problem Formulation
In time window t, the group of workers is denoted as Nt, with
Ntbeing its cardinality. A good allocation strategy should make
the sensing queue lengths of workers similar to the balanced
network load. We relabel zsen
i,n in the following equation as zsen
t,i,n,
and define the balance index of task allocation as
ϕt,i
1
Nt
n∈Ntzsen
t,i,n 1
Nt
n∈Nt
zsen
t,i,n2
.(4)
The smaller ϕt,i is, the higher the balance of task allocation.
To evaluate the long-term task allocation in a time-varying
environment, the set of time windows is defined as T, with
Tbeing its cardinality. The set of tasks in window t∈T is
denoted as It, with Itas its cardinality, and the set of subtasks
into which task iis divided is Mt,i.Letat,i,m,n =1 indicate
that subtask m∈M
t,i is assigned to worker n; otherwise,
at,i,m,n =0. Assuming that the completion delay of task iin
window tis dt,i, the task partitioning and continuous parallel
subtask assignment are transformed into the following long-
term optimization problem
P1:min lim
T→∞
1
T
t∈T
(ζE(dt,i)+(1ζ)E(ϕt,i))
s.t.
n∈Nt
at,i,m,n =1,t∈T,i∈I
t,m∈M
t,i,n∈N
t
(5a)
at,i,m,n ∈{0,1},t∈T,i∈I
t,m∈M
t,i,n∈N
t
(5b)
where ζis a weight parameter. Constraint (5a) states that one
subtask is assigned to only one worker, and (5b) consists of the
0-1 decision of subtask assignment.
III. PROPOSED SOLUTION
The DRL approach that supports continuous parallel subtask
allocation is used to solve problem P1. The MCS platform
is abstracted as an agent, interacting with the environment at
discrete time steps. During the ith interaction with the environ-
ment, the agent obtains an action aiaccording to the environ-
ment state siand policy πφ. Then, the environment transitions
to the next state si+1according to the action ai, and returns
arewardri. The state space, action space, and reward are
described as follows.
1) State Space. To unify input tensors’dimensions, the num-
ber of parallel subtasks is set not to exceed M. When the
number is less than M, the state of all missing subtasks
is filled with 0. Similarly, When the number of workers
is less than N, the state of all missing workers is also
filled with 0. The state of subtask mis represented by
ˆsi,m =(li,m,o
i,m), where li,m and oi,m represent the
location and size of subtask m, respectively. The state of
task iis represented by the combination of the states of
Msubtasks
stask
i=(ˆsi,1,...,ˆsi,M ).(6)
The state of worker nis represented by ˜si,n =(vmov
n,
vsen
n,v
tra
n,l
i,n[0],zsen
i,n,ztra
i,n). Similarly, the state of N
workers is represented by
sworker
i=(˜si,1,...,˜si,N ).(7)
Finally, the system state is composed by concatenating
the task state and the worker state
si=(stask
i,s
worker
i).(8)
2) Action Space. Let Miand Nidenote the sets of subtasks
and workers at state si, respectively. The action space
dimension of assigning Mito Niis at most NM.Ifthe
number of output layer neurons of the policy network of
DRL is set to NM, the high-dimensional action space will
make the learning difficult to converge. For this reason,
the allocation decision of Msubtasks is decomposed into
Msubdecisions, and the number of output layer neu-
rons of the policy network of DRL is reduced to M·N.
For each subdecision, the action space is {1,2,...,N},
where nmeans that a certain subtask is assigned to worker
n. The actions corresponding to the missing subtasks are
ignored if the number of subtasks is less than M. Thus,
the output action airepresents assigning the Msubtasks
of task ito the Nworkers.
3) Reward Function. During DRL training, we give an im-
mediate reward value ri=r(si,a
i)that evaluates the
merits of the selected action. The goal of task allocation
is to minimize the task completion delay and balance
variance, while the goal of DRL is to maximize the long-
term reward, so the reward function is defined as
ri=σ1(ζdi+(1ζ)ϕi)
σ2(9)
where ϕirepresents the balance index of task iallocation,
and the parameters σ1and σ2are used to control the range
of diand ϕifor the sake of DRL training.
The agent interacts with the environment and generates a
sampling trajectory ς={s1,a
1,r
1,...,s
i,a
i,r
i,...}by policy
πbased on (8) and (9). The optimal allocation policy can
be solved by maximizing the expected return of the sampling
trajectory, expressed as
π=max
πJ(π)=Eςπ
i0
γiri+1
(10)
where γis a discount factor.
Most studies use DRL of on-policy strategies such as asyn-
chronous advantage actor–critic (A3C) [28] and PPO in task
allocation [29]. Although on-policy strategies are suitable for
environments where data is continuously generated, they are
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: TASK PARTITIONING AND SCHEDULING BASED ON STOCHASTIC POLICY 5
Fig. 3. TDDS model structure.
Fig. 4. Policy network for multiple subtasks.
susceptible to noise and may quickly need to remember previ-
ously learned information. The off-policy strategy shows more
significant advantages in diverse environments by utilizing an
experience pool to store and reuse past data, as it can learn from
historical data with enhanced adaptability. TD3 is a scheduling
algorithm used in continuous control. It adopts an off-policy
strategy and can effectively alleviate the overestimation and
high variance of the expected long-term return of a state or state-
action pairs. This motivates us to apply it to parallel subtask al-
location. However, converting TD3 from continuous to discrete
control remains challenging.
To address this issue, we construct a twin delayed deep
stochastic policy gradient (TDDS) model based on TD3, which
uses two critic networks Qθ1and Qθ2and two target critic
networks Qθ
1and Qθ
2with multilayer perceptron (MLP) archi-
tectures, as shown in Fig. 3. Moreover, we created an experience
replay pool to enable TDDS to store samples collected by
interacting with the environment.
A. Policy Network Design
Considering the inconsistent action space dimensions due
to dynamic task divisions, we design the policy network πφ
and the target policy network πφin Fig. 4with a linear layer
and an attention aggregation layer. The policy network takes
the sampled state sias input and passes Msubtask states
si,1,...,ˆsi,M )through linear layer 1 to obtain Mqueries
{qi,m}of dimension D. Similarly, it passes Nworker states
si,1,...,˜si,N )through linear layer 2 to obtain Nkeys keyi,n
of dimension D. The attention score for qi,m and keyi,n is
calculated as
ωi,m,n =
qi,m ·keyi,n
D,if m∈M
i,n∈N
i
−∞,otherwise.
(11)
The attention weight of query mselecting key nis deter-
mined as
αi,m,n =exp (ωi,m,n)
n∈N exp (ωi,m,n).(12)
The larger the value of αi,m,n, the higher the matching degree
between subtask mand worker n. Denote the action distribution
of subpolicy mas πφ,m(·|si)=(αi,m,1,...,α
i,m,N ), then the
output of policy network πφis Msubpolicies paired with
Msubtasks {πφ,m(·|si)}. The attention aggregation layer can
perceive the resource competition among parallel subtasks and
learn how to map the state of associating one subtask with N
workers to the subpolicy. Benefiting from the masked attention
mechanism, the missing subtasks or workers used for padding
do not affect policy network update [30].
B. Gumbel-Softmax Sampling
TDDS adapts TD3, which originally operates in the con-
tinuous action space, to work in the discrete action space by
applying Gumbel-Softmax sampling. Suppose the action prob-
ability vector pi,m =(pi,m,1,...,p
i,m,N )output by the subpol-
icy network munder the state sisatisfies n∈N pi,m,n =1.
The common Gumbel-Max [31] is used to sample the discrete
probability distribution pi,m and one-hot vector encoding is
used to represent the sampled action as
F(pi,m)=one_hot(arg max
n
(gi,m,n +log pi,m,n)) (13)
where gi,m,n Gumbel(0,1). Because (13) is not differen-
tiable concerning pi,m, so backpropagation cannot be used to
update network parameters. The continuous Softmax function
ei,m,n =exp ((gi,m,n +log pi,m,n) )
n∈N exp ((gi,m,n +log pi,m,n) )(14)
is used to approximate (13), obtaining a differentiable Gumbel-
Softmax sampling
G(pi,m)=(ei,m,1,...,e
i,m,N ).(15)
In (14), τis the temperature coefficient used to control the
degree of approximation of G(pi,m)to F(pi,m ). To describe
the allocation of multiple subtasks, we use p=(p1,...,p
M)
to denote a multidimensional probability distribution, and use
G(p)=(G(p1),...,G(pM)) to denote the Gumbel-Softmax
sampling of p.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
6IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS
C. Model Update Strategy
The agent uses the policy network, πφ, to interact with the
environment and obtain samples (si,a
i,r
i,s
i+1)to fill the ex-
perience replay pool (see step
1in Fig. 3). Let s
krepresent the
next state of sk.LetBrepresent the set of random indices for B
samples in the experience replay pool. When there are enough
samples in the experience replay pool, the agent will randomly
sample Btuples k∈B (sk,a
k,r
k,s
k), and inputs s
kto target
policy network πφto obtain the action probabilities of Msub-
policies πφ(·|s
k)Δ
=(πφ,1(·|s
k),...,π
φ,M (·|s
k)). Each sub-
policy of πφ(·|s
k)is sampled to obtain a concatenated action
vector ˜ak
Δ
=(˜ak,1,...,˜ak,M ). Then ˜akand s
kare input into Qθ
1
and Qθ
2respectively to obtain two temporal-difference targets
ˆyk,1and ˆyk,2as
ˆyk,l =rk+γQθ
h(s
k,˜ak),h∈{1,2}.(16)
Let ˆyk=min yk,1,ˆyk,2)and it can be regarded as the target
value of the critic networks. Mean square error (MSE) was used
to establish the loss functions of critic networks Qθ1and Qθ2,
expressed as
lossh=1
B
k∈B
ykQθh(sk,a
k))2,h∈{1,2}.(17)
Then the Nadam optimizer updates Qθ1and Qθ2by using the
gradient of loss1and loss1with respect to θ1and θ2, respectively
(see step
2in Fig. 3). After the critic networks, Qθ1and Qθ2,
are updated ctimes, the agent updates policy network πφonce
to ensure model training stability (see step
3in Fig. 3).
According to the deterministic policy gradient (DPG) theo-
rem [32], the policy gradient of the expected return in (10)is
expressed as
φJ(φ)=Esρπφ,aπφ(·|s)[φQπφ(s, a)]
1
B
k∈B φQπφ(sk,a
k)
=1
B
k∈B akQπφ(sk,a
k)φG(πφ(·|sk))
=1
B
k∈B akQπφ(sk,a
k)pG(p)φπφ(·|sk)(18)
where ρπφdenotes the discounted state distribution [32],Qπφ
represents the state-action value function based on policy πφ.
In (18), G(πφ(·|sk)) approximates a one-hot vector, which con-
forms to the discrete action form. To solve φJ(φ), TD3 uses
Qθ1instead of Qπφto ensure the differentiability with respect
to ak. Value function Qπφcan be estimated by value network
Qθ1or Qθ2. Since the two networks are equivalent, the mean of
the two is used to approximate the policy gradient. Accordingly,
Qπφ(sk,a
k)is approximated as
Qπφ(sk,a
k)=1
2(Qθ1(sk,a
k)+Qθ2(sk,a
k)) (19)
and φJ(φ)is used by Nadam optimizer to update policy
network πφ. After updating πφ, the parameters of target critic
Algorithm 1: TDDS-Based Parallel Subtask Allocation
Input: Sampling batch B, soft update factor β,
discount factor γ, policy network update period
c, maximum training rounds max_epochs
Output: Policy network πφfor task allocation
1Initialize critic networks Qθ1,Q
θ2as θ1
2, initialize the
parameters of policy network πφas φ, assign target
network as θ
1θ1
2θ2
φ;
2for epoch = 1tomax_epochs do
3Get the initial system state s1;
4i1;
5while siis not the terminated state do
6Select action aiπφ(·|si)according to the
current policy πφ;
7Execute action ai, calculate reward ri,system
transitions to the next state si+1;
8Put (si,a
i,r
i,s
i+1)into experience replay pool;
9ii+1;
10 Sample Btuples from the experience replay pool;
11 Calculate the temporal-difference targets ˆyk,1and
ˆyk,2according to (16);
12 ˆykminyk,1,ˆyk,2);
13 Calculate loss1and loss2according to (17);
14 Update the critic networks;
15 if epoch mod c=0then
16 Calculate φJ(φ)according to (18);
17 Update the policy network;
18 Update the target networks according to (20);
19 return πφ
networks Qθ
2,Qθ
2and target policy network πφare updated
by (see step
4in Fig. 3)
θ
h(1β)θ
h+βθh,h∈{1,2}
φ(1β)φ+βφ (20)
where β1 is the soft update factor.
The execution of TDDS-based parallel subtask allocation is
summarized as Algorithm 1. Initially, the parameters of the
critic networks and policy network are randomly initialized and
assigned to the corresponding target network (line 1). The agent
then periodically interacts with the environment, collecting a
variety of samples to populate the experience replay pool (lines
3–9) and updates the critic networks depending on minimizing
the loss function at each time window (lines 1014). Whenever
the critic networks update ctimes, the policy network is up-
dated by using the policy gradient (lines 1517), and the target
networks are updated simultaneously (line 18). This process is
repeated until the TDDS training converges.
IV. PERFORMANCE EVALUATION
PyTorch was used as the deep learning framework to imple-
ment the proposed solution. The crowdsensing area was set to 2
×2 km. The inter-arrival time of tasks (in minutes) follows an
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: TASK PARTITIONING AND SCHEDULING BASED ON STOCHASTIC POLICY 7
TAB LE II
EXPERIMENTAL PARAMETERS
Parameter Value
Discount factor (γ)0.8
Soft update factor (β) 0.005
Temperature coefficient (τ)0.5
Objective function weight parameter (ζ)0.75
Reward function parameters (σ1,σ2) 56, 29
Policy network update cycle (c)8
exponential distribution with the rate parameter λ. The platform
divided a task into one to eight subtasks, each with a size rang-
ing from 0.5 to 1 GB. The number of workers varied between 8
and 16. The sensing rate (GB/min), transmission rate (GB/min),
and movement rate (km/min) varied in the ranges of [0.1,0.2],
[0.09,0.21], and [0.3,0.6], respectively. The experience replay
pool capacity was set to 100 000. The number of policy network
linear layer dimensions, D, was set to 1000. The critic networks
used a 264×1000×1 MLP with Prelu as the activation function.
Other simulation parameters are given in Table II.
For comprehensive comparison and verification, three base-
line approaches were selected and designed as follows.
1) Random assignment (RA) [33]: The agent randomly as-
signs one worker for each subtask.
2) PPO [34]: The agent uses PPO to perform task partition
and allocation, where the policy network depends on an
MLP to generate parallel subpolicies.
3) Independent deep Q-network (IDQN) [35]: Multiple
agents perform parallel subtask allocation, where each
agent uses one Q-network to allocate each subtask.
A. Convergence Analysis
The first experiments evaluated the convergence of TDDS
under different learning rates by calculating the average cumu-
lative reward of the policy network over multiple time windows.
Sampling larger batches of tuples from the experience replay
pool can ensure learning stability [36]. We set the sampling
batch B=4096 first.
The Nadam optimizers for critic and policy networks have
the same learning rate, denoted as η. As shown in Fig. 5(a),
when η=0.025, the convergence curve of TDDS showed large
oscillations, and the policy network converged to stability after
340 updates; when η=0.001, the average cumulative reward
converged slowly to 73. This showed that large or small η
affected the convergence of TDDS. When ηwas set to 0.01
and 0.005, respectively, TDDS could balance the convergence
speed and stability, and the average cumulative reward could
stabilize at around 77 after 300 updates. Thus, the subsequent
experiments all took η=0.01.
Fig. 5(b) tested the effect of Bon the convergence of TDDS
when η=0.01. In the case of B=512, small batch sampling
made the gradient estimation inaccurate and led to slow con-
vergence speed and the convergence value was only around 64.
Increasing Bto 1024 and 2048, respectively, TDDS accelerated
the convergence speed, but the convergence curve had slight
fluctuations after reaching stability. The convergence curve of
B=4096 was close to that of B=8192 after 170 updates. This
reflected that increasing the sampling batch to a large number
might not necessarily improve the convergence of TDDS and
might even increase model training cost. Hence, the subsequent
experiments all took B=4096.
To evaluate the convergence of TDDS in detail, we extracted
multiple variables in training. Because the loss of the two critic
networks was almost the same, only loss1is given. Fig. 6(a)
demonstrates that the loss1curve converges steadily, proving
that the critic networks can accurately predict the expected
return after training, providing a solid foundation for updating
the policy network. In Fig. 6(b), the expected return curve rose
steadily, indicating that the policy network was gradually opti-
mized. In Fig. 6(c), the policy entropy [37] gradually decreased
from 2.5, reflecting that the agent’s exploratory gradually de-
creased and gradually tended to be stable.
B. Adaptability Analysis
Multiple indicators are extracted to evaluate the allocation
driven by TDDS. We took T=30 000 for all subsequent eval-
uations to evaluate the applicability to different environments.
Fig. 7(a) shows that the expected task completion delay E(dt,i)
of TDDS was 44%, 65%, and 70% of RA, PPO, and IDQN,
respectively. Let ztra
i,n,t be the value of ztra
i,n in time window t.
Fig. 7(b) and 7(c) show the expected length of sensing and
transmission queue E(zsen
t,i,n),E(ztra
t,i,n)were both smaller than
the other three algorithms. This indicated that TDDS effectively
reduced the queuing cost in executing subtasks. At the same
time, in Fig. 7(d), the expected movement distance to complete
a task of TDDS was 0.16 km, while RA, PPO, and IDQN were
all larger than 0.37 km. This meant that TDDS could achieve
the optimal matching according to the spatial information of
subtasks and workers and had stronger environmental adapt-
ability. In Fig. 7(e), the expected balance, E(ϕt,i), of TDDS
was much smaller than RA and 72% and 81% of PPO and
IDQN respectively, so TDDS can provide a low-load scheduling
strategy. The above experimental results show that the five
indicators yielded consistent outcomes. Accordingly, we used
the objective value of P1 as a simple and effective criterion to
evaluate the following simulations.
C. Impact of Task Attributes
When evaluating the average impact of a certain quantity
on the objective value under different environments, we fix
this quantity in all test environments and keep the rest of the
variables taking values under the original distribution.
This group of experiments first considered the impact of task
reaching intensity in each time window on objective value.
The task inter-arrival time followed an exponential distribution
with λ, so the higher the λ, the higher the task arrival rate,
the more subtasks accumulated by workers, and the objective
value showed a rapid upward trend, as shown in Fig. 8(a). When
λ=1, IDQN, PPO, and RA found it difficult to cope with the
densely arriving tasks, and their objective values were 138, 127,
and 151, respectively, while TDDS was only 94. In addition,
when λ=(1/5), the task arrival time interval was longer, so
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
8IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS
(a) (b)
Fig. 5. Convergence curves of TDDS. (a) Convergence curves with varying learning rate. (b) Convergence curves with varying batch size.
(a) (b) (c)
Fig. 6. Variation of loss1,J(φ)and policy entropy. (a) Variation of loss1.(b) Variation of J(φ). (c) Variation of policy entropy.
(a) (b) (c) (d) (e)
Fig. 7. Comprehensive analysis of task allocation. (a) Comparison of E(dt,i). (b) Comparison of E(zsen
t,i,n). (c) Comparison of E(ztra
t,i,n). (d) Comparison
of moving. (e) Comparison of E(ϕt,i).
the workers had relatively sufficient time to complete each
subtask, and the objective values of the four algorithms were
all small.
The impact of the number of tasks in each time window It
variation on objective value was considered. When the number
of tasks in the time window increased, the cumulative effect
caused the unprocessed subtasks to accumulate continuously,
affecting the completion delay of the subsequent arrival tasks.
When λ=(2/7),Fig.8(b) shows the growth trend of the objec-
tive value of the four algorithms under varying task numbers,
where TDDS still had the optimal allocation, and its delay
growth rate was 36%. When It=500, the objective values
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: TASK PARTITIONING AND SCHEDULING BASED ON STOCHASTIC POLICY 9
(a) (b) (c) (d)
Fig. 8. Impact of task properties on objective value. (a) Impact of λ. (b) Impact of It. (c) Impact of Mt,i. (d) Impact of oi,m .
(a) (b) (c) (d)
Fig. 9. Impact of worker properties on objective value. (a) Impact of Nt. (b) Impact of vsen
n. (c) Impact of vmov
n. (d) Impact of vtra
n.
of RA, IDQN, and PPO were 132, 68, and 65, respectively,
while TDDS was 33. Facing the long-term high demand of task
requesters, TDDS provided the highest service quality.
Let Mt,i be the number of subtasks from task iin time
window t. As shown in Fig. 8(c), as the number of subtasks
Mt,i divided by each task gradually increased, the objective
values of all algorithms showed an upward trend, but TDDS
rose the slowest, and the maximum objective value was 104.
When Mt,i [1,2], the low-difficulty allocation work made the
task completion situation of each algorithm similar. However,
the objective value of TDDS increased by 86 when Mt,i [3,8],
while the other three algorithms were all over 155. Among
them, IDQN was most affected by subtask number variation,
and the objective value increased by 180. This was because, in
IDQN, each agent tended to assign subtasks to workers with
strong abilities, which might have caused most of the subtasks
to be assigned to the same worker, thus delaying the completion
of the entire task. Especially when Mt,i was large, the objective
value of IDQN was close to RA.
From Fig. 8(d), the objective value was positively correlated
with subtask size. Randomly assigning subtasks increased the
difficulty of low-ability workers in handling complex tasks, so
RAs objective value was much higher than PPO, IDQN, and
TDDS. PPO and IDQN output subtask allocation strategies that
could usually match subtasks and workers well, so the objective
value was significantly lower than RA. TDDS’s attention aggre-
gation layer further enhanced the matching degree of subtasks
and workers, and its objective value was 43%–50%, 67%–74%,
and 70%–77% of RA, PPO, and IDQN, respectively, when
oi,m [0.5,1].
D. Impact of Worker Attributes
As shown in Fig. 9(a), the gradually increasing workers could
share more subtasks, so the objective values of the four algo-
rithms all dropped rapidly. However, TDDS achieved the lowest
objective value by using a more optimal allocation strategy,
which was about 26-67 and 5-18 less than the three algorithms
when there were 8 and 16 workers, respectively. This showed
that TDDS had obvious advantages under different numbers
of workers.
The sensing rate vsen
nwas limited by the ability of sens-
ing devices carried by workers. For example, sensing devices
with high-definition cameras and GPU chips could sense high-
quality data faster. Fig. 9(b) shows that the faster sensing rate
promoted the task completion speed of the four algorithms. At
the same sensing rate, the objective value of TDDS was much
smaller than the other three algorithms. For example, when
vsen
n=0.1, the objective values of IDQN, PPO, and RA were
84, 82, and 110, respectively, while TDDS was only 58. On
the other hand, the speed of worker movement also affected
the allocation.
In Fig. 9(c), the faster-moving workers reach the subtask
location earlier, which helps reduce the objective value. TDDS’s
curve was relatively flat, unlike the other three algorithms’fluc-
tuating curves. When vmov
nincreased from 0.3 to 0.6, the change
in the objective value of TDDS was 6, while IDQN, PPO, and
RA were 20, 19, and 28, respectively. This is because TDDS’s
average movement distance of workers was short, which re-
duced the movement delay.
Workers could transmit data while sensing and moving, so
data transmission did not affect the sensing of the next subtask.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
10 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS
Fig. 10. Convergence curves of different models.
In Fig. 9(d), the objective values of the four algorithms did not
change much due to the increase of transmission rate when
vtra
n0.14, and TDDS still had the lowest objective value,
which was about 62%–69% of IDQN and PPO. Fig. 9showed
that TDDS can adapt to the changes in worker attributes and
continuously output better online allocation strategies than the
other three algorithms.
E. Ablation Experiments
To verify the effectiveness of TDDS, two task allocation
models were set up for ablation experiments.
1) TDDS-SF: The policy network uses the score function
estimator (SF) to calculate gradient instead of Gumbel-
Softmax sampling. In this case, the policy gradient
changes according to (21), shown at the bottom of
the page.
2) TDDS-MLP: The policy network does not use the at-
tention mechanism but relies on an MLP to output
eight subpolicies, with the structure as 136 ×500 ×
500 ×128.
In Fig. 10, the average cumulative rewards of TDDS-SF and
TDDS-MLP after convergence were 62 and 58, respectively,
lower than the 77 of TDDS. At the same time, the MLP of
TDDS-MLP did not easily capture the correlation between sub-
tasks and workers, which made the curve fluctuate greatly, so
the model stability needed to be improved.
From Fig. 11, the probability density curves of objective
value for the three models within 30 000-time windows indi-
cated that the objective values of TDDS-SF, TDDS-MLP, and
TDDS were around 34, 36, and 26, with TDDS having the
narrowest curve width. The mean and variance of each curve
demonstrated that TDDS performs an optimal allocation.
Fig. 11. Probability density of objective values.
To further verify the advantages of TDDS, we compared the
task allocation performance of the three models under different
environmental states. As shown in Fig. 12(a), the objective
value range of the three models was similar (between 11 and
24) when Mt,i [1,3]. However, when the number of subtasks
increased to 8, the objective value of TDDS-MLP rose to 155,
which was 37 and 62 higher than that of TDDS-SF and TDDS,
respectively. The effect of Nton the objective values of the
three models is illustrated in Fig. 12(b). As Ntincreased, the
objective values of the three models decreased rapidly. How-
ever, TDDS outperformed TDDS-SF and TDDS-MLP in all
cases. When Nt[8,11], TDDS-SF had a lower objective value
than TDDS-MLP, but still 10–19 higher than TDDS. When
Nt[12,16], the objective values of TDDS-MLP and TDDS-
SF were similar (about 26–35) but 5–9 higher than TDDS.
Fig. 12(c) shows the impact of Iton the objective values. With
the increase of It, the objective values of the three models
also increased rapidly. However, TDDS had a lower objective
value than TDDS-SF and TDDS-MLP in all scenarios. When
It=500, TDDS had a 36% and 48% lower objective value than
TDDS-SF and TDDS-MLP.
The results in Fig. 12 demonstrate that using the score func-
tion estimator SF instead of Gumbel-Softmax sampling leads to
inaccurate calculation of the policy network gradient, and using
MLP instead of the attention mechanism fails to capture the
correlation between subtasks and workers. As a result, TDDS-
SF and TDDS-MLP’s policy networks cannot optimally match
subtasks and workers, which results in significantly higher ob-
jective values than TDDS under different environmental states.
The proposed task allocation model used Gumbel-Softmax
sampling and attention mechanism to help the policy network
generate allocation strategies, with which the MCS system can
φJ(φ)1
B
k∈B Qπφ(sk,ak,1,...,˜ak,M ))φlog
M
m=1
πφ,m(·|sk).(21)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: TASK PARTITIONING AND SCHEDULING BASED ON STOCHASTIC POLICY 11
(a) (b) (c)
Fig. 12. Impact of worker and task’s properties on objective value. (a) Impact of Mt,i. (b) Impact of Nt. (c) Impact of It.
meet task needs in a more timely manner while balancing the
load for workers.
V. CONCLUSION
We have presented a TDDS-based approach for continuous
parallel subtask assignment in MCS. The policy network in
TDDS uses shared linear layers to reduce network parameters
and introduces a masked attention mechanism to match the
dynamically changing number of subtasks and workers. Con-
sidering that off-policy DRL has high sample utilization and
good generalization, we introduce Gumbel-Softmax sampling
so that the off-policy TD3 algorithm can be applied to dis-
crete action spaces, and the feasibility of the proposed algo-
rithm is proved through convergence analysis. Compared with
mainstream DRL baseline algorithms, TDDS shortens the task
completion delay by 30%–56% while balancing the load and
reducing workers’ movement distance. Regarding adapting to
the dynamics of tasks and workers, TDDS performs more stably
and is less affected by environmental fluctuations than other
baseline algorithms. Ablation studies verify the effectiveness
of masked attention and Gumbel-Softmax in TDDS.
When tasks arrive intensively, previous tasks’allocation sig-
nificantly impacts subsequent tasks’ allocation, and this ap-
proach may not achieve global optimality. If the offline method
is integrated into online assignments, we can assign tasks after
receiving multiple tasks and improve allocation efficiency by
controlling when to perform task assignments, which is our
follow-up research direction.
REFERENCES
[1] X. Cheng, B. He, G. Li, and B. Cheng, “A survey of crowdsensing and
privacy protection in digital city, IEEE Trans. Comput. Social Syst.,
vol. 10, no. 6, pp. 3471–3487, Dec. 2023.
[2] S. Du and S. Wang, “An overview of correlation-filter-based object
tracking,” IEEE Trans. Comput. Social Syst., vol. 9, no. 1, pp. 18–31,
Feb. 2022.
[3] I. Koukoutsidis, “Estimating spatial averages of environmental parame-
ters based on mobile crowdsensing, ACM Trans. Sensor Netw., vol. 14,
no. 1, pp. 1–26, 2017.
[4] Y. Gu, H. Shen, G. Bai, T. Wang, and X. Liu, “QOL-aware incentive for
multimedia crowdsensing enabled learning system,” Multimedia Syst.,
vol. 26, pp. 3–16, Feb. 2020.
[5] L. Zhang, Y. Ding, X. Wang, and L. Guo, “Conflict-aware participant
recruitment for mobile crowdsensing, IEEE Trans. Comput. Social Syst.,
vol. 7, no. 1, pp. 192–204, Feb. 2020.
[6] H. Shen, G. Bai, Y. Hu, and T. Wang, “P2TA: Privacy-preserving task
allocation for edge computing enhanced mobile crowdsensing, J. Syst.
Archit., vol. 97, pp. 130–141, 2019.
[7] Y. Tong, Z. Zhou, Y. Zeng, L. Chen, and C. Shahabi, “Spatial crowd-
sourcing: A survey,” VLDB J., vol. 29, pp. 217–250, Jan. 2020.
[8] T. Song, K. Xu, J. Li, Y. Li, and Y. Tong, “Multi-skill aware task
assignment in real-time spatial crowdsourcing, GeoInformatica, vol. 24,
pp. 153–173, Jan. 2020.
[9] H. Schmitz and I. Lykourentzou, “Online sequencing of non-
decomposable macrotasks in expert crowdsourcing, ACM Trans. Social
Comput., vol. 1, no. 1, pp. 1–33, 2018.
[10] Y. Xu, Y. Wang, J. Ma, and Q. Jin, “PSARE: A RL-based online par-
ticipant selection scheme incorporating area coverage ratio and degree
in mobile crowdsensing, IEEE Trans. Veh. Technol., vol. 71, no. 10,
pp. 10923–10933, Oct. 2022.
[11] C. Xu and W. Song, “Decentralized task assignment for mobile crowd-
sensing with multi-agent deep reinforcement learning,” IEEE Internet
Things J., vol. 10, no. 18, pp. 16564–16578, Sep. 2023.
[12] W. Ding, Z. Ming, G. Wang, and Y. Yan, “System-of-systems approach
to spatio-temporal crowdsourcing design using improved PPO algorithm
based on an invalid action masking, Knowl. Based Syst., vol. 285, 2024,
Art. no. 111381.
[13] S. Xie, X. Wang, B. Yang, M. Long, J. Zhang, and L. Wang, “A multi-
stage framework for complex task decomposition in knowledge-intensive
crowdsourcing, in Proc. IEEE Int. Conf. Ind. Eng. Eng. Manage.
(IEEM), 2021, pp. 1432–1436.
[14] Z. Liu and Z. Zhao, “Multiattribute E-CARGO task assignment model
based on adaptive heterogeneous residual networks, IEEE Trans. Com-
put. Social Syst., early access, doi: 10.1109/TCSS.2023.3344173.
[15] Y. Sun, J. Wang, and W. Tan, “Dynamic worker-and-task assignment
on uncertain spatial crowdsourcing, in Proc. IEEE Int. Conf. Comput.
Supported Cooperative Work Des. (CSCWD), 2018, pp. 755–760.
[16] C. H. Liu, Z. Dai, H. Yang, and J. Tang, “Multi-task-oriented vehicular
crowdsensing: A deep learning approach,” in Proc. IEEE Conf. Comput.
Commun. (INFOCOM), 2020, pp. 1123–1132.
[17] H. Shen, Y. Tian, T. Wang, and G. Bai, “Slicing-based task offloading
in space-air-ground integrated vehicular networks, IEEE Trans. Mobile
Comput., early access, doi: 10.1109/TMC.2023.3283852.
[18] J. Han, Z. Zhang, and X. Wu, “A real-world-oriented multi-task allo-
cation approach based on multi-agent reinforcement learning in mobile
crowd sensing, Information, vol. 11, no. 2, 2020, Art. no. 101.
[19] Q. Qi et al., “Scalable parallel task scheduling for autonomous driving
using multi-task deep reinforcement learning,” IEEE Trans. Veh. Tech-
nol., vol. 69, no. 11, pp. 13861–13874, Nov. 2020.
[20] B. Zhao, H. Dong, and D. Yang, “A spatio-temporal task allocation
model in mobile crowdsensing based on knowledge graph, Smart Cities,
vol. 6, no. 4, pp. 1937–1957, 2023.
[21] X. Tao and A. S. Hafid, “DeepSensing: A novel mobile crowdsensing
framework with double deep Q-network and prioritized experience
replay, IEEE Internet Things J., vol. 7, no. 12, pp. 11547–11558,
Dec. 2020.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
12 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS
[22] L. Li, H. Xu, J. Ma, A. Zhou, and J. Liu, “Joint EH time and transmit
power optimization based on DDPG for EH communications,” IEEE
Commun. Lett., vol. 24, no. 9, pp. 2043–2046, Sep. 2020.
[23] S. Fujimoto, H. Hoof, and D. Meger, Addressing function approxima-
tion error in actor-critic methods,” in Proc. Int. Conf. Mach. Learn.,
2018, pp. 1587–1596.
[24] B. Zhao, H. Dong, Y. Wang, and T. Pan, “PPO-TA: Adaptive task
allocation via proximal policy optimization for spatio-temporal crowd-
sourcing,” Knowl. Based Syst., vol. 264, 2023, Art. no. 110330.
[25] P. Zhao, X. Li, S. Gao, and X. Wei, “Cooperative task assignment
in spatial crowdsourcing via multi-agent deep reinforcement learning,”
J. Syst. Archit., vol. 128, 2022, Art. no. 102551.
[26] Y. Ma, Z. Bi, Z. Yin, and A. Chai, “Research and implementation of a
real-time task dynamic scheduling model based on reinforcement learn-
ing,” in Proc. Int. Conf. Intell. Comput. Technol. Automat. (ICICTA),
2020, pp. 717–722.
[27] A. Bjorklund, “Determinant sums for undirected Hamiltonicity,” SIAM
J. Comput., vol. 43, no. 1, pp. 280–299, 2014.
[28] M. Min et al., “Geo-perturbation for task allocation in 3-D mobile
crowdsourcing: An A3C-based approach, IEEE Internet Things J.,
vol. 11, no. 2, pp. 1854–1865, Jan. 2024.
[29] J. Jin and Y. Xu, “Optimal policy characterization enhanced proximal
policy optimization for multitask scheduling in cloud computing,” IEEE
Internet Things J., vol. 9, no. 9, pp. 6418–6433, May 2022.
[30] S. Huang and S. Ontañón, “A closer look at invalid action masking in
policy gradient algorithms,” in Proc. Int. FLAIRS Conf. Proc., vol. 35,
May 2022.
[31] C. J. Maddison, D. Tarlow, and T. Minka, “A* sampling, in Proc. Adv.
Neural Inf. Process. Syst., vol. 27, pp. 3086–3094, 2014.
[32] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in Proc. Int. Conf. Mach.
Learn., 2014, pp. 387–395.
[33] D. Li, J. Zhu, and Y. Cui, “Prediction-based task allocation in mobile
crowdsensing, in Proc. Int. Conf. Mobile Ad-Hoc Sensor Netw. (MSN),
2019, pp. 89–94.
[34] J. Jin and Y. Xu, “Optimal policy characterization enhanced proximal
policy optimization for multitask scheduling in cloud computing,” IEEE
Internet Things J., vol. 9, no. 9, pp. 6418–6433, May 2022.
[35] A. Tampuu et al., “Multiagent cooperation and competition with
deep reinforcement learning,” PLoS One, vol. 12, no. 4, 2017,
Art. no. e0172395.
[36] T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep
learning: An in-depth concurrency analysis,” ACM Comput. Surveys,
vol. 52, no. 4, pp. 1–43, 2019.
[37] C. E. Shannon, “A mathematical theory of communication,” Bell System
Tec h. J ., vol. 27, no. 3, pp. 379–423, 1948.
Tianjing Wang (Member, IEEE) received the
B.Sc. degree in mathematics from Nanjing Nor-
mal University, Nanjing, China, in 2000, the
M.Sc. degree in mathematics from Nanjing Uni-
versity, Nanjing, China, in 2002, and the Ph.D.
degree in signal and information system from
Nanjing University of Posts and Telecommunica-
tions (NUPT), Nanjing, China, in 2009.
From 2011 to 2013, she was a Full-Time Postdoc-
toral Fellow with the School of Electronic Science
and Engineering, NUPT. From 2013 to 2014, she
was a Visiting Scholar with the Department of Electrical and Computer
Engineering at the State University of New York, Stony Brook, NY, USA.
She is an Associate Professor with the Department of Communication
Engineering, Nanjing Tech University, Nanjing, China. Her research interests
include mobile crowdsensing, cellular V2X communication networks, and
distributed machine learning for multimedia networking. She has published
research papers in prestigious international journals and conferences, including
IEEE TRANSACTIONS ON MOBILE COMPUTING, IEEE TRANSACTIONS ON
BROADCASTING,Journal of Systems Architecture,Multimedia Systems,Peer-
to-Peer Networking and Applications,IEEE ICC,andIEEE ISCC.
Yu Zhang received the B.M. degree in engineer-
ing management from Beijing University of Civil
Engineering and Architecture, Beijing, China. He
is currently working toward the M.S. degree in
computer science with Nanjing Tech University,
Nanjing, China.
His research interests include combinatorial
optimization, deep reinforcement learning, and its
applications in crowdsensing.
Hang Shen (Member, IEEE) received the Ph.D.
degree (with honors) in computer science from
Nanjing University of Science and Technology, in
2015.
He worked as a Full-Time Postdoctoral Fel-
low with the Broadband Communications Research
(BBCR) Lab, ECE Department, University of Wa-
terloo, Waterloo, ON, Canada, from 2018 to 2019.
He is an Associate Professor with the Department
of Computer Science and Technology, Nanjing Tech
University, Nanjing, China. His research interests
involve mobile crowdsensing, vehicular networks, cybersecurity, and privacy
computing.
Dr. Shen serves as an Associate Editor for Journal of Information Pro-
cessing Systems andIEEEA
CCESS. He was a Guest Editor for the Peer-to-
Peer Networking and Applications and a TPC member of the 2021 Annual
International Conference on Privacy, Security and Trust (PST). He is a Senior
Member of CCF and an Executive Committee Member of the ACM Nanjing
Chapter.
Guangwei Bai received the B.Eng. and M.Eng.
degrees in computer engineering from Xi’an Jiao-
tong University, Xi’an, China, in 1983 and 1986,
respectively, and the Ph.D. degree in computer
science from the University of Hamburg, Hamburg,
Germany, in 1999.
From 1999 to 2001, he worked as a Research
Scientist with the German National Research Center
for Information Technology, Germany. In 2001,
he joined the University of Calgary, Calgary, AB,
Canada, as a Research Associate. Since 2005, he
has been working as a Professor in computer science with Nanjing Tech
University, Nanjing, China. From October to December 2010, he was a
Visiting Professor with the ECE Department at the University of Waterloo,
Waterloo, ON, Canada. His research interests include architecture and protocol
design for future networks, QoS provisioning, cybersecurity, and privacy
computing. He has authored and coauthored more than 70 peer review papers
in international journals and conferences including IEEE TRANSACTIONS
ON MOBILE COMPUTING,IEEETRANSACTIONS ON BROADCASTING,IEEE
TRANSACTIONS ON VEHICULAR TECHNOLOGY,Performance Evaluation,Ad
Hoc Networks,Journal of Systems Architecture,Multimedia Systems,Com-
puter Communications, IEEE ICC, and IEEE LCN.
Dr. Bai is an ACM member and a CCF Distinguished Member.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
With the increasing popularity of wireless networks and the development of smart cities, the Mobile Crowdsourcing System (MCS) has emerged as a framework for automatically assigning spatiotemporal tasks to workers. The study of mobile crowdsourcing makes a valuable research contribution to community service and urban route planning. However, previous algorithms have faced challenges in effectively addressing task allocation issues with massive spatial data. In this paper, we propose a novel solution to the spatiotemporal task allocation problem using a knowledge graph. Firstly, we construct a robust spatiotemporal knowledge graph (STKG) and employ a knowledge graph embedding algorithm to learn the representations of nodes and edges. Next, we utilize these representations to build a task transition graph, which is a weighted and learning-based graph that highlights important neighbors for each task. We then apply a simplified Graph Convolutional Network (GCN) and an RNN-based model to enhance task representations and capture sequential transition patterns on the task transition graph. Furthermore, we design a similarity function to facilitate personalized task allocation. Through experimental results, we demonstrate that our solution achieves higher accuracy compared to existing approaches when tested on three real datasets. These research findings are significant as they contribute to an 18.01% improvement in spatiotemporal task allocation accuracy.
Article
Full-text available
A slicing-based collaborative task offloading framework for space-air-ground integrated vehicular networks is proposed in this study, which can provide differentiated quality-of-service (QoS) guarantees for task offloading for high-speed vehicles while maximizing the number of completed tasks. A service-oriented radio access network (RAN) slicing framework is presented that supports slicing window adaptation, spectrum and computing resource orchestration, and collaboration among heterogeneous base stations. Based on the queuing model, the collaborative decision-making of RAN slicing and task offloading is modeled as a problem of maximizing the number of long-term task completions, which consists of three subproblems-slicing window division, resource slicing, and task scheduling-which are solved by a multi-access edge computing (MEC)-enabled controller, forming a closed loop with the slicing window as the period. When a new slicing window arrives, the controller determines its duration according to task traffic fluctuations and allocates resources to RAN slices through an optimization method. A double deep Q-learning network (DDQN)-based algorithm is developed for scheduling workflow on small time scales within a slicing window. Simulation results demonstrate that the proposed scheme performs better than existing approaches in terms of adaptability, task completion rate, and control overhead.
Article
Mobile crowd sensing (MCS) is an emerging approach to collect data using smart devices. In MCS, task assignment is described as assigning existing tasks to known workers outside the constraints of task demand attributes and worker attributes, and maximizing the profit of the platform. However, workers and tasks often exist in different environments and heterogeneous features such as workers with attributes are not considered, leading to nondeterministic polynomial (NP)-hard task assignment problems. To optimize such problems, this article proposes a multiattribute environments-classes, agents, roles, groups, and objects (E-CARGO) task assignment model based on adaptive heterogeneous residual networks (AHRNets). The AHRNet is integrated into deep reinforcement learning (DRL) to optimize the NP-hard problem, dynamically adjust task assignment decisions and learn the relationship between workers with different attributes and task requirements. Multiattribute E-CARGO uses group task assignment policy to obtain the ideal worker-task assignment relationship. Compared with traditional heuristic algorithms for solving NP-hard, this method has the flexibility and applicability of adaptive networks, enabling the solver to interact with and adapt to new environments and generalize its experience to different situations. Under various experimental conditions, a large number of numerical results show that this method can achieve better results than the reference scheme.
Article
Location privacy protection (LPP) has become a key concern during mobile crowdsourcing (MCS) task allocation. Existing LPP mechanisms for MCS applications mainly focus on two-dimensional (2D) plane scenarios or directly apply 2D techniques into three-dimensional (3D) space scenarios, leaving the height dimension of 3D geolocation vulnerable to privacy breaches. To facilitate the LPP in 3D MCS, we propose a learning-based geo-perturbation mechanism using 3D geo-indistinguishability (3D-GI). In this mechanism, we first define an optimization objective to balance location privacy and MCS server profit, making it adaptable to different types of MCS applications. Then, we adopt the Asynchronous Advantage Actor-Critic (A3C) algorithm to design a reinforcement learning (RL)-based approach without knowing the accurate system and attack models. This approach enables us to derive the optimal perturbation policy in continuous policy space and accelerates the learning speed using asynchronous multi-thread training. Simulation results demonstrate that the proposed mechanism can better balance location privacy and server profit in 3D MCS applications compared to existing benchmarks.
Article
Task assignment is a fundamental research problem in mobile crowdsensing (MCS) since it directly determines an MCS system’s practicality and economic value. Due to the complex dynamics of tasks and workers, task assignment problems are usually NP-hard, and approximation-based methods are preferred to impractical optimal methods. In the literature, a graph neural network-based deep reinforcement learning (GDRL) method is proposed in Xu and Song (2022) to solve routing problems in MCS and shows high performance and time efficiency. However, GDRL, as a centralized method, has to cope with the limitation in scalability and the challenge of privacy protection. In this paper, we propose a multi-agent deep reinforcement learning-based method named CQDRL to solve a task assignment problem in a decentralized fashion. The CQDRL method not only inherits the merits of GDRL over traditional heuristic and metaheuristic methods but also exploits computation potentials in mobile devices and protects workers’ privacy with a decentralized decision-making scheme. Our extensive experiments show that the CQDRL method can achieve significantly better performance than other traditional methods and performs fairly close to the centralized GDRL method.
Article
The key pillar of developing digital city is the ubiquitous sensing of people and the environment. Crowdsensing requires a large number of users to participate in the collection of sensing data, and these data may carry sensitive information, such as identity and location related to the users or sensing object. If this information is eavesdropped, intercepted, and leaked, this may seriously harm the interests of individuals, organizations, and even countries. Therefore, from a privacy perspective, users may be reluctant to open data. While relying on mobile devices used by a large number of ordinary users as the basic sensing unit, it is necessary to include a variety of communication methods to realize the distribution of sensing tasks and to collect the sensing data. Then, to complete the complex crowdsensing tasks, it is important to ensure privacy security in the context of crowdsensing because it is a key problem. In this article, we comb through the development status of crowdsensing in the digital city, emphatically analyze the privacy protection in crowdsensing under the background of digital city, and qualitatively evaluate the existing privacy protection technologies for crowdsensing. Finally, this article presents research challenges and future directions that should be addressed to improve the performance of privacy protection technologies for crowdsensing systems.
Article
Mobile crowdsensing (MCS) is a cost-effective paradigm for gathering real-time and location-related urban sensing data. To complete MCS tasks, MCS platform needs to exploit the trajectory of participants (vehicles or individuals, etc.) for effectively choosing participants. On one hand, the existing works usually assume that platform has possessed the abundant historical movement trajectory for participant selection, or can accurately predict the movement of participant before selection, but this assumption is impractical for many MCS applications, for some candidates have just arrived without sufficient mobility profiles, so-called trajectory from-scratch, or cold-trajectory issue. On the other hand, most of works only considers the coverage ratio of the sensing area, while some hotspots should be sensed frequently, so-called coverage degree of hotspots. To solve the issue, this paper proposes a reinforcement learning (RL) based, i.e., an improved Q-learning based online participant selection scheme to incorporate both coverage ratio and degree, PSARE. First, to solve the explosion of state-value table in traditional tabular Q-learning, an improved two-level Q-learning method is proposed to select participants in online way so as to achieve high long-term return. Specifically, in each selection round, PSARE dynamically compresses all the real participants (RPs) into several virtual participants (VPs) using the available historical trajectories of RPs, and the VP-based state-value table is constructed and constantly updated (i.e., the first level). Then, after selecting the VP through looking up the table, PARSE chooses the RP with the largest expected reward in this VP using epsilon-greedy way to balance the effect of exploration and exploitation (i.e., the second level). Moreover, the reward function is designed to measure the MCS coverage quality, including both coverage degree of hotspots and coverage ratio of target area. Thorough experiments on real-world mobility data set demonstrate that PSARE outperforms than other RL based online participant selection schemes (including deep Q-learning network) and traditional offline selection methods.
Article
With the rapid development of mobile Internet, spatial crowdsourcing (SC) has become an emerging paradigm with many applications. As a key challenge in SC, the problem of task assignment has attracted extensive research. However, most previous work focus on the single mode setting where cooperation among workers is either allowed or prohibited. Moreover, only short-term benefit of either workers or requesters is considered separately. To address these issues, we first propose a new spatial crowdsourcing scenario that permits cooperation with no mandate among workers and tasks. Furthermore, we propose a multi-agent deep reinforcement learning (MADRL) solution for SC. Specifically, we extend the Advantage Actor-Critic (A2C) algorithm to multi-agent settings, and design a reward function that considers both local and global return. Through the game between agents, we generate a task assignment scheme that considers both workers’ and requesters’ long-term benefit. In order to improve the performance of our model, we further introduce the attention mechanism to guide information sharing between agents. We use simulations to conduct experimental evaluation on both synthetic and real-world datasets. Experimental results show that our proposed method outperforms other state-of-the-art task assignment algorithms in terms of worker profitability rate and task completion rate.