ArticlePDF Available

Task Partitioning and Scheduling Based on Stochastic Policy Gradient in Mobile Crowdsensing

June 2024
IEEE Transactions on Computational Social Systems

June 2024

DOI:10.1109/TCSS.2024.3398430

Authors:

Hang Shen

Nanjing Tech University

Guangwei Bai

Nanjing University of Technology

Deep reinforcement learning (DRL) has become prevalent for decision-making task assignments in mobile crowd-sensing (MCS). However, when facing sensing scenarios with varying numbers of workers or task attributes, existing DRL-based task assignment schemes fail to generate matching policies continuously and are susceptible to environmental fluctuations. To overcome these issues, a twin-delayed deep stochastic policy gradient (TDDS) approach is presented for balanced and low-latency MCS task decomposition and parallel subtask allocation. A masked attention mechanism is incorporated into the policy network to enable TDDS to adapt to task-attribute and subtask variations. To enhance environmental adaptability, an off-policy DRL algorithm incorporating experience replay is developed to eliminate sample correlation during training. Gumbel-Softmax sampling is integrated into the twin-delayed deep deterministic policy gradient (TD3) to support discrete action space decisions and a customized reward strategy to reduce task completion delay and balance workloads. Extensive simulation results confirm that the proposed scheme outperforms mainstream DRL baselines in terms of environmental adaptability, task completion delay, and workload balancing.

Content uploaded by Hang Shen

Content may be subject to copyright.

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS 1

Task Partitioning and Scheduling Based on

Stochastic Policy Gradient in Mobile Crowdsensing

Tianjing Wang ,Member, IEEE, Yu Zhang ,HangShen ,Member, IEEE, and Guangwei Bai

Abstract—Deep reinforcement learning (DRL) has become

prevalent for decision-making task assignments in mobile crowd-

sensing (MCS). However, when facing sensing scenarios with

varying numbers of workers or task attributes, existing DRL-

based task assignment schemes fail to generate matching policies

continuously and are susceptible to environmental ﬂuctuations.

To overcome these issues, a twin-delayed deep stochastic policy

gradient (TDDS) approach is presented for balanced and low-

latency MCS task decomposition and parallel subtask allocation.

A masked attention mechanism is incorporated into the policy

network to enable TDDS to adapt to task-attribute and subtask

variations. To enhance environmental adaptability, an off-policy

DRL algorithm incorporating experience replay is developed to

eliminate sample correlation during training. Gumbel-Softmax

sampling is integrated into the twin-delayed deep deterministic

policy gradient (TD3) to support discrete action space decisions

and a customized reward strategy to reduce task completion delay

and balance workloads. Extensive simulation results conﬁrm that

the proposed scheme outperforms mainstream DRL baselines in

terms of environmental adaptability, task completion delay, and

workload balancing.

Index Terms—Attention mechanism, Gumbel-Softmax sam-

pling, mobile crowdsensing (MCS), parallel subtask allocation,

task partition.

I. INTRODUCTION

MOBILE crowdsensing (MCS) technology [1] has be-

come prevalent in recent years due to the rapid prolifer-

ation of intelligent mobile devices with computing, perception,

storage, and communication capabilities. Unlike traditional sen-

sor networks, MCS leverages intelligent devices carried by

mobile users as basic sensing units and forms working groups

to collaboratively complete large-scale data sensing. MCS has

been applied to various ﬁelds such as object tracking [2],envi-

ronmental monitoring [3], and smart cities [4].

An MCS system must recruit a large number of workers

to complete the sensing tasks continuously submitted by the

Manuscript received 30 January 2024; revised 28 April 2024; accepted

5 May 2024. This work was supported in part by the National Natural

Science Foundation of China under Grant 61502230 and Grant 61501224,

in part by the Natural Science Foundation of Jiangsu Province under Grant

BK20201357, and in part by the Six Talent Peaks Project in Jiangsu Province

under Grant RJFW-020. (Corresponding author: Hang Shen.)

The authors are with the College of Computer and Information Engineering

(College of Artiﬁcial Intelligence), Nanjing Tech University, Nanjing 211816,

China (e-mail: wangtianjing@njtech.edu.cn; 202161120001@njtech.edu.cn;

hshen@njtech.edu.cn; bai@njtech.edu.cn)

Digital Object Identiﬁer 10.1109/TCSS.2024.3398430

platform [5],[6]. How task allocation is performed to optimize

the system’s sensing performance is crucial. Task allocation can

be ofﬂine or online [7]. The former follows a predetermined

plan for task allocation, which cannot adjust the allocation

strategy in real-time to adapt to environmental dynamics. The

latter can handle task allocation ﬂexibly under dynamic envi-

ronments. Song et al. [8] established an online multiskill task

allocation model and integrated a greedy algorithm to match

dynamic tasks with speciﬁc skill workers. Schmitz and Lyk-

ourentzou [9] designed an online-optimized greedy algorithm

for reliable task allocation when the budget and quality cycle

change. However, these works focused on allocation for simple

tasks. The emergence of various MCS applications such as ride-

hailing (e.g., DiDi1), on-demand delivery (e.g., Ele.me2), and

live map (e.g., Waze3) has made it inevitable for MCS systems

to support continuous allocation of complex tasks.

Unlike simple tasks that a worker can complete indepen-

dently, a complex task requires division into multiple subtasks.

These subtasks must then be allocated to several workers for

collaborative completion. In this scenario, any delay in one

subtask can potentially affect the timely completion of the

entire task. Researchers have explored reinforcement learning

(RL) and deep RL (DRL) approaches to decide on multitask

allocation. Xu et al. [10] utilized RL to optimize the allocation

strategy of parallel subtasks. However, RL-based methods face

dimension explosion with the increase of parallel subtasks.

DRL-based schemes can handle high-dimensional action spaces

for parallel subtask allocation. Xu and Song [11] designed

a multiagent DRL algorithm to train a local model for each

worker and obtain parallel allocation actions through multia-

gent cooperation. Ding et al. [12] improved proximal policy

optimization (PPO), solving the unreasonable action-matching

problem caused by spatiotemporal complexity by dynamically

matching tasks, workers, and workplaces. However, most ex-

isting DRL-based task allocations directly or indirectly assume

that the number of tasks or workers is static within a period,

reducing the scheme’s usability.

A. Challenging Issues and Related Works

For DRL-based MCS task allocation in a time-varying envi-

ronment, many challenges remain.

1https://www.didiglobal.com/

2https://www.ele.me

3https://www.waze.com

See https://www.ieee.org/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

2IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

1) Dynamics of Parallel Subtasks and Workers: The num-

ber of parallel subtasks divided from a complex task is not

constant due to the tasks’ heterogeneity. Xie et al. [13] presented

a multistage complex task decomposition framework that dy-

namically divides tasks according to knowledge-intensive types

and assigns subtasks to suitable service providers. Liu and Zhao

[14] developed a multiattribute E-CARGO task assignment

model based on adaptive heterogeneous residual networks, con-

sidering the heterogeneous workers and tasks. However, these

solutions assume the number of workers is ﬁxed to reduce

model complexity. Sun et al. [15] pointed out that online task

allocation in MCS is dynamic and uncertain. They designed a

spatial perception multiagent Q-learning algorithm for dynamic

spatial task allocation. Liu et al. [16] proposed a distributed

execution framework based on DRL to provide reliable and

accurate sensing services when the number of tasks and vehicles

changes. In [17], a DRL-based algorithm was designed for

scheduling workﬂow on small time scales for task ofﬂoading in

space-air-ground integrated vehicular networks. However, these

methods deal with simple tasks rather than parallel subtasks.

Designing a DRL model for dynamic task division and worker

selection is challenging.

2) Diversity of Task Allocation Environments: Differenti-

ated tasks require MCS systems to provide differentiated system

services. Several studies have proposed multitask allocation

based on DRL to address the challenges of multitask concur-

rency, task and worker heterogeneity, and participant preference

changes. Hang et al. [18] proposed a multiagent DRL-based

multitask allocation to provide differentiated sensing responses.

Considering the complicated and dynamic environment of ve-

hicular computing, Qi et al. presented a DRL-based parallel

task scheduling approach [19], where the output branches of

multitask learning are ﬁne-matched to parallel scheduling. Zhao

et al. [20] designed a similarity function on the task transfer

graph to promote the allocation of personalized multitasks.

The advantage of these on-policy algorithms is that parallel

subpolicies can output personalized allocation decisions, but the

samples they master have a strong correlation, leading to a weak

model generalization and difﬁculty in service quality guaran-

teeing. Unlike on-policy algorithms that periodically abandon

samples interacting with the environment, off-policy algorithms

design an experience replay pool that stores diversiﬁed samples

to enhance the model’s adaptability to differentiated tasks. A

double deep Q-network with a priority experience replay pool

is studied in [21], planning a travel path that meets the require-

ments of each mobile user. Existing deep deterministic policy

gradient (DDPG) [22] and twin delayed deep deterministic

policy gradient (TD3) [23] can also solve task scheduling, but

they operate in continuous action spaces. This is because the

random sampling is not derivable in discrete action spaces, and

the model cannot be trained using backpropagation. Therefore,

exploring an off-policy algorithm in discrete action spaces is

necessary for environment diversity to ensure service stability.

3) Long-Term Balanced Scheduling in Continuous Task

Allocation: Load balancing is important to consider in achiev-

ing long-term optimized task scheduling. Several studies have

proposed DRL-based task allocation, considering the long-term

utility of workers and requesters. Zhao et al. [24] proposed

a discrete threshold task allocation algorithm based on policy

gradient considering long-term utility, signiﬁcantly improving

the long-term continuous task allocation utility. In [25],amul-

tiagent DRL solution was proposed to generate a multitask

allocation strategy that considers the long-term interests of

workers and requesters. This scheme designs a reward function

that considers local and global returns to balance short-term

and long-term beneﬁts and achieve a long-term equilibrium

task completion rate. Ma et al. [26] proposed a real-time task

dynamic scheduling model based on centralized learning, which

makes more accurate continuous task scheduling decisions by

analyzing the processor load of workers. The system’s load bal-

ance is better than random task allocation methods, improving

CPU utilization and service quality. However, these methods

do not incorporate balance indicators into the DRL reward

function, making it difﬁcult for the model to learn to schedule

experience that satisﬁes long-term load balancing.

B. Contributions and Organization

In response to the above issues, we proposed a twin delayed

deep stochastic policy gradient (TDDS) approach for long-

term balanced and low-latency task allocation via dynamic

partitioning and scheduling. The main contributions include

the following.

1) We construct a scalable policy network consisting of two

shared linear layers to extract state features of subtasks

and workers, along with a masked attention mechanism to

match subtasks and workers. This network can indepen-

dently infer an optimal subpolicy for each subtask, with

enhanced robustness of task allocation.

2) An off-policy algorithm based on TD3 is designed, which

uses Gumbel-Softmax sampling to enable TD3 to output

allocation decisions of parallel subtasks in discrete action

spaces. The rich samples in the experience replay pool

can enhance model generalization to adapt to heteroge-

neous MCS environments.

3) We develop an appropriate reward function considering

completion delay and load balancing. This encourages the

model to learn from allocation experiences that optimize

both indicators simultaneously, ensuring the long-term

stability of task scheduling. Simulation results demon-

strated that the proposed approach’s task completion de-

lay and environmental adaptability are better than typical

DRL-based baselines.

The rest of this article is organized as follows. Section II

presents the system model for MCS task partition and paral-

lel subtask allocation. Section III proposes a parallel subtask

allocation scheme based on TDDS. Section IV analyzes the

evaluation results under simulation experiments. Finally, we

summarize the research work in Section V. The main notations

and variables are listed in Table I.

II. SYSTEM MODEL

This section begins with an overview of MCS task par-

titioning and continuous subtask assignment. Then, the task

partitioning and assignment are transformed into a long-term

optimization problem.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

WANG et al.: TASK PARTITIONING AND SCHEDULING BASED ON STOCHASTIC POLICY 3

TAB LE I

MAIN NOTATIONS AND VARIABLES

Symbols Deﬁnition

at,i,m,n Allocation strategy for task iin time window t

ItSet of tasks in time window t

Li,n Set of incomplete subtasks for worker n

upon receiving Ui,n

MMaximum number of task subdivision

MiSet of subtasks in state i

Mt,i/Mt,i Set/Num. of subtasks for task iin time window t

NMaximum number of workers

NiSet of workers in state i

Nt/NtSet/Num. of workers in time window t

T/TSet/Num. of time windows

Ui,n/Ui,n Set/Num. of task i’subtasks allocated to worker n

ui,n[j]The jth subtask executed in Ui,n

wi,n[j]Delay from receiving Ui,n

to the start of transmission of ui,n[j]

zsen

i,n/ztra

i,n

Sensing/transmission time for worker n

from receiving Ui,n to completing Li,n

zsen

t,i,n/ztra

t,i,n Thevalueofzsen

i,n/ztra

i,n in time window t

Fig. 1. Consecutive MCS task allocation.

A. System Overview

Fig. 1illustrates an MCS system comprising a control plat-

form, task requesters, and workers. As a dispatch center, the

control platform connects the task requesters and workers via

base stations. Requesters create tasks that require environmental

sensing, and workers with different sensing and computing

abilities cooperate to complete them. It assigns a continuous

Fig. 2. Completion delay of two subtasks under FIFO.

stream of tasks to a group of workers following the ﬁrst-in-ﬁrst-

out (FIFO) rule. Each task is divided into parallel subtasks, and

the control platform employs DRL to select the most suitable

workers to complete these subtasks, taking into account the

resource competition among the subtasks.

B. Task Completion Latency Model

We now explain the allocation and execution of parallel sub-

tasks and model the completion latency. Let Ui,n denote the set

of subtasks for task iallocated to worker n, and Ui,n indicate the

total count of these subtasks. A worker movement minimization

method [27] is used to determine the execution order of subtasks

in Ui,n.Letui,n[j]represent the jth subtask executed in Ui,n,

and oi,n[j]represent the size of ui,n [j].li,n[j]refers to the

location of ui,n[j], which speciﬁes the initial location of worker

nprior to commencing Ui,n, with li,n [0]=li−1,n[Ui−1,n].Itis

assumed that the movement, sensing, and transmission rates of

worker nare vmov

n,vsen

n, and vtra

n, respectively, and their latency

expressions are represented as dmov

i,n [j],dsen

i,n[j], and dtra

i,n[j]

⎧

⎪

⎨

⎪

⎩

dmov

i,n [j]=|li,n[j]−li,n [j−1]|/vmov

dsen

i,n[j]=oi,n [j]/vsen

dtra

i,n[j]=oi,n [j]/vtra

(1)

Fig. 2explains the process of worker nexecuting the sub-

tasks in Ui,n when Ui,n contains two subtasks, and this can be

extended to the general case. Assume Li,n is a set of subtasks

that worker nhas not yet after Ui,n arrives, with the remaining

sensing and transmission delays being zsen

i,n and ztra

i,n, respec-

tively. After completing the sensing of Li,n,workernmoves

to li,n[1]to start executing subtasks in Ui,n in sequence. Note

that the delivery of a subtask must wait until the sensing of this

subtask is completed and the transmission of all other subtasks

before this subtask is ﬁnished before it can be executed. We

deﬁne wi,n[j]as the delay from receiving Ui,n to the start of

transmission of ui,n[j]. The recursive expression for wi,n[j]is

given by below equation (2), shown at the bottom of the page.

wi,n[j]=⎧

⎪

⎨

⎪

⎩

max {zsen

i,n +dmov

i,n [1]+dsen

i,n[1],ztra

i,n},if j=1

max ⎧

⎨

⎩zsen

i,n +



j=1

(dmov

i,n [j]+dsen

i,n[j]),w

i,n[j−1]+dtra

i,n[j−1]⎫

⎬

⎭,otherwise. (2)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

4IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

Based on wi,n[j], the latency of completing all subtasks

in Ui,n is wi,n[Ui,n ]+dtra

i,n[Ui,n ]. Since parallel subtasks are

allocated to multiple workers for processing, the completion

latency of task iis

di=max

n{wi,n[Ui,n ]+dtra

i,n[Ui,n ]}.(3)

C. Problem Formulation

In time window t, the group of workers is denoted as Nt, with

Ntbeing its cardinality. A good allocation strategy should make

the sensing queue lengths of workers similar to the balanced

network load. We relabel zsen

i,n in the following equation as zsen

t,i,n,

and deﬁne the balance index of task allocation as

ϕt,i 



1

Nt

n∈Ntzsen

t,i,n −1

Nt

n∈Nt

zsen

t,i,n2

.(4)

The smaller ϕt,i is, the higher the balance of task allocation.

To evaluate the long-term task allocation in a time-varying

environment, the set of time windows is deﬁned as T, with

Tbeing its cardinality. The set of tasks in window t∈T is

denoted as It, with Itas its cardinality, and the set of subtasks

into which task iis divided is Mt,i.Letat,i,m,n =1 indicate

that subtask m∈M

t,i is assigned to worker n; otherwise,

at,i,m,n =0. Assuming that the completion delay of task iin

window tis dt,i, the task partitioning and continuous parallel

subtask assignment are transformed into the following long-

term optimization problem

P1:min lim

T→∞

T

t∈T

(ζE(dt,i)+(1−ζ)E(ϕt,i))

s.t.⎧

⎪

⎨

⎪

⎩



n∈Nt

at,i,m,n =1,∀t∈T,i∈I

t,m∈M

t,i,n∈N

(5a)

at,i,m,n ∈{0,1},∀t∈T,i∈I

t,m∈M

t,i,n∈N

(5b)

where ζis a weight parameter. Constraint (5a) states that one

subtask is assigned to only one worker, and (5b) consists of the

0-1 decision of subtask assignment.

III. PROPOSED SOLUTION

The DRL approach that supports continuous parallel subtask

allocation is used to solve problem P1. The MCS platform

is abstracted as an agent, interacting with the environment at

discrete time steps. During the ith interaction with the environ-

ment, the agent obtains an action aiaccording to the environ-

ment state siand policy πφ. Then, the environment transitions

to the next state si+1according to the action ai, and returns

arewardri. The state space, action space, and reward are

described as follows.

1) State Space. To unify input tensors’dimensions, the num-

ber of parallel subtasks is set not to exceed M. When the

number is less than M, the state of all missing subtasks

is ﬁlled with 0. Similarly, When the number of workers

is less than N, the state of all missing workers is also

ﬁlled with 0. The state of subtask mis represented by

ˆsi,m =(li,m,o

i,m), where li,m and oi,m represent the

location and size of subtask m, respectively. The state of

task iis represented by the combination of the states of

Msubtasks

stask

i=(ˆsi,1,...,ˆsi,M ).(6)

The state of worker nis represented by ˜si,n =(vmov

vsen

n,v

tra

n,l

i,n[0],zsen

i,n,ztra

i,n). Similarly, the state of N

workers is represented by

sworker

i=(˜si,1,...,˜si,N ).(7)

Finally, the system state is composed by concatenating

the task state and the worker state

si=(stask

i,s

worker

i).(8)

2) Action Space. Let Miand Nidenote the sets of subtasks

and workers at state si, respectively. The action space

dimension of assigning Mito Niis at most NM.Ifthe

number of output layer neurons of the policy network of

DRL is set to NM, the high-dimensional action space will

make the learning difﬁcult to converge. For this reason,

the allocation decision of Msubtasks is decomposed into

Msubdecisions, and the number of output layer neu-

rons of the policy network of DRL is reduced to M·N.

For each subdecision, the action space is {1,2,...,N},

where nmeans that a certain subtask is assigned to worker

n. The actions corresponding to the missing subtasks are

ignored if the number of subtasks is less than M. Thus,

the output action airepresents assigning the Msubtasks

of task ito the Nworkers.

3) Reward Function. During DRL training, we give an im-

mediate reward value ri=r(si,a

i)that evaluates the

merits of the selected action. The goal of task allocation

is to minimize the task completion delay and balance

variance, while the goal of DRL is to maximize the long-

term reward, so the reward function is deﬁned as

ri=σ1−(ζdi+(1−ζ)ϕi)

σ2(9)

where ϕirepresents the balance index of task iallocation,

and the parameters σ1and σ2are used to control the range

of diand ϕifor the sake of DRL training.

The agent interacts with the environment and generates a

sampling trajectory ς={s1,a

1,r

1,...,s

i,a

i,r

i,...}by policy

πbased on (8) and (9). The optimal allocation policy can

be solved by maximizing the expected return of the sampling

trajectory, expressed as

π∗=max

πJ(π)=Eς∼π⎛

⎝

i≥0

γiri+1⎞

⎠(10)

where γis a discount factor.

Most studies use DRL of on-policy strategies such as asyn-

chronous advantage actor–critic (A3C) [28] and PPO in task

allocation [29]. Although on-policy strategies are suitable for

environments where data is continuously generated, they are

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

WANG et al.: TASK PARTITIONING AND SCHEDULING BASED ON STOCHASTIC POLICY 5

Fig. 3. TDDS model structure.

Fig. 4. Policy network for multiple subtasks.

susceptible to noise and may quickly need to remember previ-

ously learned information. The off-policy strategy shows more

signiﬁcant advantages in diverse environments by utilizing an

experience pool to store and reuse past data, as it can learn from

historical data with enhanced adaptability. TD3 is a scheduling

algorithm used in continuous control. It adopts an off-policy

strategy and can effectively alleviate the overestimation and

high variance of the expected long-term return of a state or state-

action pairs. This motivates us to apply it to parallel subtask al-

location. However, converting TD3 from continuous to discrete

control remains challenging.

To address this issue, we construct a twin delayed deep

stochastic policy gradient (TDDS) model based on TD3, which

uses two critic networks Qθ1and Qθ2and two target critic

networks Qθ

1and Qθ

2with multilayer perceptron (MLP) archi-

tectures, as shown in Fig. 3. Moreover, we created an experience

replay pool to enable TDDS to store samples collected by

interacting with the environment.

A. Policy Network Design

Considering the inconsistent action space dimensions due

to dynamic task divisions, we design the policy network πφ

and the target policy network πφin Fig. 4with a linear layer

and an attention aggregation layer. The policy network takes

the sampled state sias input and passes Msubtask states

(ˆsi,1,...,ˆsi,M )through linear layer 1 to obtain Mqueries

{qi,m}of dimension D. Similarly, it passes Nworker states

(˜si,1,...,˜si,N )through linear layer 2 to obtain Nkeys keyi,n

of dimension D. The attention score for qi,m and keyi,n is

calculated as

ωi,m,n =⎧

⎨

⎩

qi,m ·keyi,n

√D,if m∈M

i,n∈N

−∞,otherwise.

(11)

The attention weight of query mselecting key nis deter-

mined as

αi,m,n =exp (ωi,m,n)

n∈N exp (ωi,m,n).(12)

The larger the value of αi,m,n, the higher the matching degree

between subtask mand worker n. Denote the action distribution

of subpolicy mas πφ,m(·|si)=(αi,m,1,...,α

i,m,N ), then the

output of policy network πφis Msubpolicies paired with

Msubtasks {πφ,m(·|si)}. The attention aggregation layer can

perceive the resource competition among parallel subtasks and

learn how to map the state of associating one subtask with N

workers to the subpolicy. Beneﬁting from the masked attention

mechanism, the missing subtasks or workers used for padding

do not affect policy network update [30].

B. Gumbel-Softmax Sampling

TDDS adapts TD3, which originally operates in the con-

tinuous action space, to work in the discrete action space by

applying Gumbel-Softmax sampling. Suppose the action prob-

ability vector pi,m =(pi,m,1,...,p

i,m,N )output by the subpol-

icy network munder the state sisatisﬁes n∈N pi,m,n =1.

The common Gumbel-Max [31] is used to sample the discrete

probability distribution pi,m and one-hot vector encoding is

used to represent the sampled action as

F(pi,m)=one_hot(arg max

(gi,m,n +log pi,m,n)) (13)

where gi,m,n ∼Gumbel(0,1). Because (13) is not differen-

tiable concerning pi,m, so backpropagation cannot be used to

update network parameters. The continuous Softmax function

ei,m,n =exp ((gi,m,n +log pi,m,n)/τ )

n∈N exp ((gi,m,n +log pi,m,n)/τ )(14)

is used to approximate (13), obtaining a differentiable Gumbel-

Softmax sampling

G(pi,m)=(ei,m,1,...,e

i,m,N ).(15)

In (14), τis the temperature coefﬁcient used to control the

degree of approximation of G(pi,m)to F(pi,m ). To describe

the allocation of multiple subtasks, we use p=(p1,...,p

to denote a multidimensional probability distribution, and use

G(p)=(G(p1),...,G(pM)) to denote the Gumbel-Softmax

sampling of p.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

6IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

C. Model Update Strategy

The agent uses the policy network, πφ, to interact with the

environment and obtain samples (si,a

i,r

i,s

i+1)to ﬁll the ex-

perience replay pool (see step 

1in Fig. 3). Let s

krepresent the

next state of sk.LetBrepresent the set of random indices for B

samples in the experience replay pool. When there are enough

samples in the experience replay pool, the agent will randomly

sample Btuples k∈B (sk,a

k,r

k,s



k), and inputs s

kto target

policy network πφto obtain the action probabilities of Msub-

policies πφ(·|s

k)Δ

=(πφ,1(·|s

k),...,π

φ,M (·|s

k)). Each sub-

policy of πφ(·|s

k)is sampled to obtain a concatenated action

vector ˜ak

=(˜ak,1,...,˜ak,M ). Then ˜akand s

kare input into Qθ

and Qθ

2respectively to obtain two temporal-difference targets

ˆyk,1and ˆyk,2as

ˆyk,l =rk+γQθ

h(s

k,˜ak),h∈{1,2}.(16)

Let ˆyk=min (ˆyk,1,ˆyk,2)and it can be regarded as the target

value of the critic networks. Mean square error (MSE) was used

to establish the loss functions of critic networks Qθ1and Qθ2,

expressed as

lossh=1

B

k∈B

(ˆyk−Qθh(sk,a

k))2,h∈{1,2}.(17)

Then the Nadam optimizer updates Qθ1and Qθ2by using the

gradient of loss1and loss1with respect to θ1and θ2, respectively

(see step 

2in Fig. 3). After the critic networks, Qθ1and Qθ2,

are updated ctimes, the agent updates policy network πφonce

to ensure model training stability (see step 

3in Fig. 3).

According to the deterministic policy gradient (DPG) theo-

rem [32], the policy gradient of the expected return in (10)is

expressed as

∇φJ(φ)=Es∼ρπφ,a∼πφ(·|s)[∇φQπφ(s, a)]

≈1

B

k∈B ∇φQπφ(sk,a

B

k∈B ∇akQπφ(sk,a

k)∇φG(πφ(·|sk))

B

k∈B ∇akQπφ(sk,a

k)∇pG(p)∇φπφ(·|sk)(18)

where ρπφdenotes the discounted state distribution [32],Qπφ

represents the state-action value function based on policy πφ.

In (18), G(πφ(·|sk)) approximates a one-hot vector, which con-

forms to the discrete action form. To solve ∇φJ(φ), TD3 uses

Qθ1instead of Qπφto ensure the differentiability with respect

to ak. Value function Qπφcan be estimated by value network

Qθ1or Qθ2. Since the two networks are equivalent, the mean of

the two is used to approximate the policy gradient. Accordingly,

Qπφ(sk,a

k)is approximated as

Qπφ(sk,a

k)=1

2(Qθ1(sk,a

k)+Qθ2(sk,a

k)) (19)

and ∇φJ(φ)is used by Nadam optimizer to update policy

network πφ. After updating πφ, the parameters of target critic

Algorithm 1: TDDS-Based Parallel Subtask Allocation

Input: Sampling batch B, soft update factor β,

discount factor γ, policy network update period

c, maximum training rounds max_epochs

Output: Policy network πφfor task allocation

1Initialize critic networks Qθ1,Q

θ2as θ1,θ

2, initialize the

parameters of policy network πφas φ, assign target

network as θ

1←θ1,θ



2←θ2,φ

←φ;

2for epoch = 1tomax_epochs do

3Get the initial system state s1;

4i←1;

5while siis not the terminated state do

6Select action ai∼πφ(·|si)according to the

current policy πφ;

7Execute action ai, calculate reward ri,system

transitions to the next state si+1;

8Put (si,a

i,r

i,s

i+1)into experience replay pool;

9i←i+1;

10 Sample Btuples from the experience replay pool;

11 Calculate the temporal-difference targets ˆyk,1and

ˆyk,2according to (16);

12 ˆyk←min(ˆyk,1,ˆyk,2);

13 Calculate loss1and loss2according to (17);

14 Update the critic networks;

15 if epoch mod c=0then

16 Calculate ∇φJ(φ)according to (18);

17 Update the policy network;

18 Update the target networks according to (20);

19 return πφ

networks Qθ

2,Qθ

2and target policy network πφare updated

by (see step 

4in Fig. 3)

θ

h←(1−β)θ

h+βθh,h∈{1,2}

φ←(1−β)φ+βφ (20)

where β1 is the soft update factor.

The execution of TDDS-based parallel subtask allocation is

summarized as Algorithm 1. Initially, the parameters of the

critic networks and policy network are randomly initialized and

assigned to the corresponding target network (line 1). The agent

then periodically interacts with the environment, collecting a

variety of samples to populate the experience replay pool (lines

3–9) and updates the critic networks depending on minimizing

the loss function at each time window (lines 10–14). Whenever

the critic networks update ctimes, the policy network is up-

dated by using the policy gradient (lines 15–17), and the target

networks are updated simultaneously (line 18). This process is

repeated until the TDDS training converges.

IV. PERFORMANCE EVALUATION

PyTorch was used as the deep learning framework to imple-

ment the proposed solution. The crowdsensing area was set to 2

×2 km. The inter-arrival time of tasks (in minutes) follows an

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

WANG et al.: TASK PARTITIONING AND SCHEDULING BASED ON STOCHASTIC POLICY 7

TAB LE II

EXPERIMENTAL PARAMETERS

Parameter Value

Discount factor (γ)0.8

Soft update factor (β) 0.005

Temperature coefﬁcient (τ)0.5

Objective function weight parameter (ζ)0.75

Reward function parameters (σ1,σ2) 56, 29

Policy network update cycle (c)8

exponential distribution with the rate parameter λ. The platform

divided a task into one to eight subtasks, each with a size rang-

ing from 0.5 to 1 GB. The number of workers varied between 8

and 16. The sensing rate (GB/min), transmission rate (GB/min),

and movement rate (km/min) varied in the ranges of [0.1,0.2],

[0.09,0.21], and [0.3,0.6], respectively. The experience replay

pool capacity was set to 100 000. The number of policy network

linear layer dimensions, D, was set to 1000. The critic networks

used a 264×1000×1 MLP with Prelu as the activation function.

Other simulation parameters are given in Table II.

For comprehensive comparison and veriﬁcation, three base-

line approaches were selected and designed as follows.

1) Random assignment (RA) [33]: The agent randomly as-

signs one worker for each subtask.

2) PPO [34]: The agent uses PPO to perform task partition

and allocation, where the policy network depends on an

MLP to generate parallel subpolicies.

3) Independent deep Q-network (IDQN) [35]: Multiple

agents perform parallel subtask allocation, where each

agent uses one Q-network to allocate each subtask.

A. Convergence Analysis

The ﬁrst experiments evaluated the convergence of TDDS

under different learning rates by calculating the average cumu-

lative reward of the policy network over multiple time windows.

Sampling larger batches of tuples from the experience replay

pool can ensure learning stability [36]. We set the sampling

batch B=4096 ﬁrst.

The Nadam optimizers for critic and policy networks have

the same learning rate, denoted as η. As shown in Fig. 5(a),

when η=0.025, the convergence curve of TDDS showed large

oscillations, and the policy network converged to stability after

340 updates; when η=0.001, the average cumulative reward

converged slowly to 73. This showed that large or small η

affected the convergence of TDDS. When ηwas set to 0.01

and 0.005, respectively, TDDS could balance the convergence

speed and stability, and the average cumulative reward could

stabilize at around 77 after 300 updates. Thus, the subsequent

experiments all took η=0.01.

Fig. 5(b) tested the effect of Bon the convergence of TDDS

when η=0.01. In the case of B=512, small batch sampling

made the gradient estimation inaccurate and led to slow con-

vergence speed and the convergence value was only around 64.

Increasing Bto 1024 and 2048, respectively, TDDS accelerated

the convergence speed, but the convergence curve had slight

ﬂuctuations after reaching stability. The convergence curve of

B=4096 was close to that of B=8192 after 170 updates. This

reﬂected that increasing the sampling batch to a large number

might not necessarily improve the convergence of TDDS and

might even increase model training cost. Hence, the subsequent

experiments all took B=4096.

To evaluate the convergence of TDDS in detail, we extracted

multiple variables in training. Because the loss of the two critic

networks was almost the same, only loss1is given. Fig. 6(a)

demonstrates that the loss1curve converges steadily, proving

that the critic networks can accurately predict the expected

return after training, providing a solid foundation for updating

the policy network. In Fig. 6(b), the expected return curve rose

steadily, indicating that the policy network was gradually opti-

mized. In Fig. 6(c), the policy entropy [37] gradually decreased

from 2.5, reﬂecting that the agent’s exploratory gradually de-

creased and gradually tended to be stable.

B. Adaptability Analysis

Multiple indicators are extracted to evaluate the allocation

driven by TDDS. We took T=30 000 for all subsequent eval-

uations to evaluate the applicability to different environments.

Fig. 7(a) shows that the expected task completion delay E(dt,i)

of TDDS was 44%, 65%, and 70% of RA, PPO, and IDQN,

respectively. Let ztra

i,n,t be the value of ztra

i,n in time window t.

Fig. 7(b) and 7(c) show the expected length of sensing and

transmission queue E(zsen

t,i,n),E(ztra

t,i,n)were both smaller than

the other three algorithms. This indicated that TDDS effectively

reduced the queuing cost in executing subtasks. At the same

time, in Fig. 7(d), the expected movement distance to complete

a task of TDDS was 0.16 km, while RA, PPO, and IDQN were

all larger than 0.37 km. This meant that TDDS could achieve

the optimal matching according to the spatial information of

subtasks and workers and had stronger environmental adapt-

ability. In Fig. 7(e), the expected balance, E(ϕt,i), of TDDS

was much smaller than RA and 72% and 81% of PPO and

IDQN respectively, so TDDS can provide a low-load scheduling

strategy. The above experimental results show that the ﬁve

indicators yielded consistent outcomes. Accordingly, we used

the objective value of P1 as a simple and effective criterion to

evaluate the following simulations.

C. Impact of Task Attributes

When evaluating the average impact of a certain quantity

on the objective value under different environments, we ﬁx

this quantity in all test environments and keep the rest of the

variables taking values under the original distribution.

This group of experiments ﬁrst considered the impact of task

reaching intensity in each time window on objective value.

The task inter-arrival time followed an exponential distribution

with λ, so the higher the λ, the higher the task arrival rate,

the more subtasks accumulated by workers, and the objective

value showed a rapid upward trend, as shown in Fig. 8(a). When

λ=1, IDQN, PPO, and RA found it difﬁcult to cope with the

densely arriving tasks, and their objective values were 138, 127,

and 151, respectively, while TDDS was only 94. In addition,

when λ=(1/5), the task arrival time interval was longer, so

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

8IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

(a) (b)

Fig. 5. Convergence curves of TDDS. (a) Convergence curves with varying learning rate. (b) Convergence curves with varying batch size.

(a) (b) (c)

Fig. 6. Variation of loss1,J(φ)and policy entropy. (a) Variation of loss1.(b) Variation of J(φ). (c) Variation of policy entropy.

(a) (b) (c) (d) (e)

Fig. 7. Comprehensive analysis of task allocation. (a) Comparison of E(dt,i). (b) Comparison of E(zsen

t,i,n). (c) Comparison of E(ztra

t,i,n). (d) Comparison

of moving. (e) Comparison of E(ϕt,i).

the workers had relatively sufﬁcient time to complete each

subtask, and the objective values of the four algorithms were

all small.

The impact of the number of tasks in each time window It

variation on objective value was considered. When the number

of tasks in the time window increased, the cumulative effect

caused the unprocessed subtasks to accumulate continuously,

affecting the completion delay of the subsequent arrival tasks.

When λ=(2/7),Fig.8(b) shows the growth trend of the objec-

tive value of the four algorithms under varying task numbers,

where TDDS still had the optimal allocation, and its delay

growth rate was 36%. When It=500, the objective values

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

WANG et al.: TASK PARTITIONING AND SCHEDULING BASED ON STOCHASTIC POLICY 9

(a) (b) (c) (d)

Fig. 8. Impact of task properties on objective value. (a) Impact of λ. (b) Impact of It. (c) Impact of Mt,i. (d) Impact of oi,m .

(a) (b) (c) (d)

Fig. 9. Impact of worker properties on objective value. (a) Impact of Nt. (b) Impact of vsen

n. (c) Impact of vmov

n. (d) Impact of vtra

of RA, IDQN, and PPO were 132, 68, and 65, respectively,

while TDDS was 33. Facing the long-term high demand of task

requesters, TDDS provided the highest service quality.

Let Mt,i be the number of subtasks from task iin time

window t. As shown in Fig. 8(c), as the number of subtasks

Mt,i divided by each task gradually increased, the objective

values of all algorithms showed an upward trend, but TDDS

rose the slowest, and the maximum objective value was 104.

When Mt,i ∈[1,2], the low-difﬁculty allocation work made the

task completion situation of each algorithm similar. However,

the objective value of TDDS increased by 86 when Mt,i ∈[3,8],

while the other three algorithms were all over 155. Among

them, IDQN was most affected by subtask number variation,

and the objective value increased by 180. This was because, in

IDQN, each agent tended to assign subtasks to workers with

strong abilities, which might have caused most of the subtasks

to be assigned to the same worker, thus delaying the completion

of the entire task. Especially when Mt,i was large, the objective

value of IDQN was close to RA.

From Fig. 8(d), the objective value was positively correlated

with subtask size. Randomly assigning subtasks increased the

difﬁculty of low-ability workers in handling complex tasks, so

RA’s objective value was much higher than PPO, IDQN, and

TDDS. PPO and IDQN output subtask allocation strategies that

could usually match subtasks and workers well, so the objective

value was signiﬁcantly lower than RA. TDDS’s attention aggre-

gation layer further enhanced the matching degree of subtasks

and workers, and its objective value was 43%–50%, 67%–74%,

and 70%–77% of RA, PPO, and IDQN, respectively, when

oi,m ∈[0.5,1].

D. Impact of Worker Attributes

As shown in Fig. 9(a), the gradually increasing workers could

share more subtasks, so the objective values of the four algo-

rithms all dropped rapidly. However, TDDS achieved the lowest

objective value by using a more optimal allocation strategy,

which was about 26-67 and 5-18 less than the three algorithms

when there were 8 and 16 workers, respectively. This showed

that TDDS had obvious advantages under different numbers

of workers.

The sensing rate vsen

nwas limited by the ability of sens-

ing devices carried by workers. For example, sensing devices

with high-deﬁnition cameras and GPU chips could sense high-

quality data faster. Fig. 9(b) shows that the faster sensing rate

promoted the task completion speed of the four algorithms. At

the same sensing rate, the objective value of TDDS was much

smaller than the other three algorithms. For example, when

vsen

n=0.1, the objective values of IDQN, PPO, and RA were

84, 82, and 110, respectively, while TDDS was only 58. On

the other hand, the speed of worker movement also affected

the allocation.

In Fig. 9(c), the faster-moving workers reach the subtask

location earlier, which helps reduce the objective value. TDDS’s

curve was relatively ﬂat, unlike the other three algorithms’ﬂuc-

tuating curves. When vmov

nincreased from 0.3 to 0.6, the change

in the objective value of TDDS was 6, while IDQN, PPO, and

RA were 20, 19, and 28, respectively. This is because TDDS’s

average movement distance of workers was short, which re-

duced the movement delay.

Workers could transmit data while sensing and moving, so

data transmission did not affect the sensing of the next subtask.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

10 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

Fig. 10. Convergence curves of different models.

In Fig. 9(d), the objective values of the four algorithms did not

change much due to the increase of transmission rate when

vtra

n≥0.14, and TDDS still had the lowest objective value,

which was about 62%–69% of IDQN and PPO. Fig. 9showed

that TDDS can adapt to the changes in worker attributes and

continuously output better online allocation strategies than the

other three algorithms.

E. Ablation Experiments

To verify the effectiveness of TDDS, two task allocation

models were set up for ablation experiments.

1) TDDS-SF: The policy network uses the score function

estimator (SF) to calculate gradient instead of Gumbel-

Softmax sampling. In this case, the policy gradient

changes according to (21), shown at the bottom of

the page.

2) TDDS-MLP: The policy network does not use the at-

tention mechanism but relies on an MLP to output

eight subpolicies, with the structure as 136 ×500 ×

500 ×128.

In Fig. 10, the average cumulative rewards of TDDS-SF and

TDDS-MLP after convergence were 62 and 58, respectively,

lower than the 77 of TDDS. At the same time, the MLP of

TDDS-MLP did not easily capture the correlation between sub-

tasks and workers, which made the curve ﬂuctuate greatly, so

the model stability needed to be improved.

From Fig. 11, the probability density curves of objective

value for the three models within 30 000-time windows indi-

cated that the objective values of TDDS-SF, TDDS-MLP, and

TDDS were around 34, 36, and 26, with TDDS having the

narrowest curve width. The mean and variance of each curve

demonstrated that TDDS performs an optimal allocation.

Fig. 11. Probability density of objective values.

To further verify the advantages of TDDS, we compared the

task allocation performance of the three models under different

environmental states. As shown in Fig. 12(a), the objective

value range of the three models was similar (between 11 and

24) when Mt,i ∈[1,3]. However, when the number of subtasks

increased to 8, the objective value of TDDS-MLP rose to 155,

which was 37 and 62 higher than that of TDDS-SF and TDDS,

respectively. The effect of Nton the objective values of the

three models is illustrated in Fig. 12(b). As Ntincreased, the

objective values of the three models decreased rapidly. How-

ever, TDDS outperformed TDDS-SF and TDDS-MLP in all

cases. When Nt∈[8,11], TDDS-SF had a lower objective value

than TDDS-MLP, but still 10–19 higher than TDDS. When

Nt∈[12,16], the objective values of TDDS-MLP and TDDS-

SF were similar (about 26–35) but 5–9 higher than TDDS.

Fig. 12(c) shows the impact of Iton the objective values. With

the increase of It, the objective values of the three models

also increased rapidly. However, TDDS had a lower objective

value than TDDS-SF and TDDS-MLP in all scenarios. When

It=500, TDDS had a 36% and 48% lower objective value than

TDDS-SF and TDDS-MLP.

The results in Fig. 12 demonstrate that using the score func-

tion estimator SF instead of Gumbel-Softmax sampling leads to

inaccurate calculation of the policy network gradient, and using

MLP instead of the attention mechanism fails to capture the

correlation between subtasks and workers. As a result, TDDS-

SF and TDDS-MLP’s policy networks cannot optimally match

subtasks and workers, which results in signiﬁcantly higher ob-

jective values than TDDS under different environmental states.

The proposed task allocation model used Gumbel-Softmax

sampling and attention mechanism to help the policy network

generate allocation strategies, with which the MCS system can

∇φJ(φ)≈1

B

k∈B Qπφ(sk,(˜ak,1,...,˜ak,M ))∇φlog



m=1

πφ,m(·|sk).(21)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

WANG et al.: TASK PARTITIONING AND SCHEDULING BASED ON STOCHASTIC POLICY 11

(a) (b) (c)

Fig. 12. Impact of worker and task’s properties on objective value. (a) Impact of Mt,i. (b) Impact of Nt. (c) Impact of It.

meet task needs in a more timely manner while balancing the

load for workers.

V. CONCLUSION

We have presented a TDDS-based approach for continuous

parallel subtask assignment in MCS. The policy network in

TDDS uses shared linear layers to reduce network parameters

and introduces a masked attention mechanism to match the

dynamically changing number of subtasks and workers. Con-

sidering that off-policy DRL has high sample utilization and

good generalization, we introduce Gumbel-Softmax sampling

so that the off-policy TD3 algorithm can be applied to dis-

crete action spaces, and the feasibility of the proposed algo-

rithm is proved through convergence analysis. Compared with

mainstream DRL baseline algorithms, TDDS shortens the task

completion delay by 30%–56% while balancing the load and

reducing workers’ movement distance. Regarding adapting to

the dynamics of tasks and workers, TDDS performs more stably

and is less affected by environmental ﬂuctuations than other

baseline algorithms. Ablation studies verify the effectiveness

of masked attention and Gumbel-Softmax in TDDS.

When tasks arrive intensively, previous tasks’allocation sig-

niﬁcantly impacts subsequent tasks’ allocation, and this ap-

proach may not achieve global optimality. If the ofﬂine method

is integrated into online assignments, we can assign tasks after

receiving multiple tasks and improve allocation efﬁciency by

controlling when to perform task assignments, which is our

follow-up research direction.

REFERENCES

[1] X. Cheng, B. He, G. Li, and B. Cheng, “A survey of crowdsensing and

privacy protection in digital city,” IEEE Trans. Comput. Social Syst.,

vol. 10, no. 6, pp. 3471–3487, Dec. 2023.

[2] S. Du and S. Wang, “An overview of correlation-ﬁlter-based object

tracking,” IEEE Trans. Comput. Social Syst., vol. 9, no. 1, pp. 18–31,

Feb. 2022.

[3] I. Koukoutsidis, “Estimating spatial averages of environmental parame-

ters based on mobile crowdsensing,” ACM Trans. Sensor Netw., vol. 14,

no. 1, pp. 1–26, 2017.

[4] Y. Gu, H. Shen, G. Bai, T. Wang, and X. Liu, “QOL-aware incentive for

multimedia crowdsensing enabled learning system,” Multimedia Syst.,

vol. 26, pp. 3–16, Feb. 2020.

[5] L. Zhang, Y. Ding, X. Wang, and L. Guo, “Conﬂict-aware participant

recruitment for mobile crowdsensing,” IEEE Trans. Comput. Social Syst.,

vol. 7, no. 1, pp. 192–204, Feb. 2020.

[6] H. Shen, G. Bai, Y. Hu, and T. Wang, “P2TA: Privacy-preserving task

allocation for edge computing enhanced mobile crowdsensing,” J. Syst.

Archit., vol. 97, pp. 130–141, 2019.

[7] Y. Tong, Z. Zhou, Y. Zeng, L. Chen, and C. Shahabi, “Spatial crowd-

sourcing: A survey,” VLDB J., vol. 29, pp. 217–250, Jan. 2020.

[8] T. Song, K. Xu, J. Li, Y. Li, and Y. Tong, “Multi-skill aware task

assignment in real-time spatial crowdsourcing,” GeoInformatica, vol. 24,

pp. 153–173, Jan. 2020.

[9] H. Schmitz and I. Lykourentzou, “Online sequencing of non-

decomposable macrotasks in expert crowdsourcing,” ACM Trans. Social

Comput., vol. 1, no. 1, pp. 1–33, 2018.

[10] Y. Xu, Y. Wang, J. Ma, and Q. Jin, “PSARE: A RL-based online par-

ticipant selection scheme incorporating area coverage ratio and degree

in mobile crowdsensing,” IEEE Trans. Veh. Technol., vol. 71, no. 10,

pp. 10923–10933, Oct. 2022.

[11] C. Xu and W. Song, “Decentralized task assignment for mobile crowd-

sensing with multi-agent deep reinforcement learning,” IEEE Internet

Things J., vol. 10, no. 18, pp. 16564–16578, Sep. 2023.

[12] W. Ding, Z. Ming, G. Wang, and Y. Yan, “System-of-systems approach

to spatio-temporal crowdsourcing design using improved PPO algorithm

based on an invalid action masking,” Knowl. Based Syst., vol. 285, 2024,

Art. no. 111381.

[13] S. Xie, X. Wang, B. Yang, M. Long, J. Zhang, and L. Wang, “A multi-

stage framework for complex task decomposition in knowledge-intensive

crowdsourcing,” in Proc. IEEE Int. Conf. Ind. Eng. Eng. Manage.

(IEEM), 2021, pp. 1432–1436.

[14] Z. Liu and Z. Zhao, “Multiattribute E-CARGO task assignment model

based on adaptive heterogeneous residual networks,” IEEE Trans. Com-

put. Social Syst., early access, doi: 10.1109/TCSS.2023.3344173.

[15] Y. Sun, J. Wang, and W. Tan, “Dynamic worker-and-task assignment

on uncertain spatial crowdsourcing,” in Proc. IEEE Int. Conf. Comput.

Supported Cooperative Work Des. (CSCWD), 2018, pp. 755–760.

[16] C. H. Liu, Z. Dai, H. Yang, and J. Tang, “Multi-task-oriented vehicular

crowdsensing: A deep learning approach,” in Proc. IEEE Conf. Comput.

Commun. (INFOCOM), 2020, pp. 1123–1132.

[17] H. Shen, Y. Tian, T. Wang, and G. Bai, “Slicing-based task ofﬂoading

in space-air-ground integrated vehicular networks,” IEEE Trans. Mobile

Comput., early access, doi: 10.1109/TMC.2023.3283852.

[18] J. Han, Z. Zhang, and X. Wu, “A real-world-oriented multi-task allo-

cation approach based on multi-agent reinforcement learning in mobile

crowd sensing,” Information, vol. 11, no. 2, 2020, Art. no. 101.

[19] Q. Qi et al., “Scalable parallel task scheduling for autonomous driving

using multi-task deep reinforcement learning,” IEEE Trans. Veh. Tech-

nol., vol. 69, no. 11, pp. 13861–13874, Nov. 2020.

[20] B. Zhao, H. Dong, and D. Yang, “A spatio-temporal task allocation

model in mobile crowdsensing based on knowledge graph,” Smart Cities,

vol. 6, no. 4, pp. 1937–1957, 2023.

[21] X. Tao and A. S. Haﬁd, “DeepSensing: A novel mobile crowdsensing

framework with double deep Q-network and prioritized experience

replay,” IEEE Internet Things J., vol. 7, no. 12, pp. 11547–11558,

Dec. 2020.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

12 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

[22] L. Li, H. Xu, J. Ma, A. Zhou, and J. Liu, “Joint EH time and transmit

power optimization based on DDPG for EH communications,” IEEE

Commun. Lett., vol. 24, no. 9, pp. 2043–2046, Sep. 2020.

[23] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxima-

tion error in actor-critic methods,” in Proc. Int. Conf. Mach. Learn.,

2018, pp. 1587–1596.

[24] B. Zhao, H. Dong, Y. Wang, and T. Pan, “PPO-TA: Adaptive task

allocation via proximal policy optimization for spatio-temporal crowd-

sourcing,” Knowl. Based Syst., vol. 264, 2023, Art. no. 110330.

[25] P. Zhao, X. Li, S. Gao, and X. Wei, “Cooperative task assignment

in spatial crowdsourcing via multi-agent deep reinforcement learning,”

J. Syst. Archit., vol. 128, 2022, Art. no. 102551.

[26] Y. Ma, Z. Bi, Z. Yin, and A. Chai, “Research and implementation of a

real-time task dynamic scheduling model based on reinforcement learn-

ing,” in Proc. Int. Conf. Intell. Comput. Technol. Automat. (ICICTA),

2020, pp. 717–722.

[27] A. Bjorklund, “Determinant sums for undirected Hamiltonicity,” SIAM

J. Comput., vol. 43, no. 1, pp. 280–299, 2014.

[28] M. Min et al., “Geo-perturbation for task allocation in 3-D mobile

crowdsourcing: An A3C-based approach,” IEEE Internet Things J.,

vol. 11, no. 2, pp. 1854–1865, Jan. 2024.

[29] J. Jin and Y. Xu, “Optimal policy characterization enhanced proximal

policy optimization for multitask scheduling in cloud computing,” IEEE

Internet Things J., vol. 9, no. 9, pp. 6418–6433, May 2022.

[30] S. Huang and S. Ontañón, “A closer look at invalid action masking in

policy gradient algorithms,” in Proc. Int. FLAIRS Conf. Proc., vol. 35,

May 2022.

[31] C. J. Maddison, D. Tarlow, and T. Minka, “A* sampling,” in Proc. Adv.

Neural Inf. Process. Syst., vol. 27, pp. 3086–3094, 2014.

[32] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,

“Deterministic policy gradient algorithms,” in Proc. Int. Conf. Mach.

Learn., 2014, pp. 387–395.

[33] D. Li, J. Zhu, and Y. Cui, “Prediction-based task allocation in mobile

crowdsensing,” in Proc. Int. Conf. Mobile Ad-Hoc Sensor Netw. (MSN),

2019, pp. 89–94.

[34] J. Jin and Y. Xu, “Optimal policy characterization enhanced proximal

policy optimization for multitask scheduling in cloud computing,” IEEE

Internet Things J., vol. 9, no. 9, pp. 6418–6433, May 2022.

[35] A. Tampuu et al., “Multiagent cooperation and competition with

deep reinforcement learning,” PLoS One, vol. 12, no. 4, 2017,

Art. no. e0172395.

[36] T. Ben-Nun and T. Hoeﬂer, “Demystifying parallel and distributed deep

learning: An in-depth concurrency analysis,” ACM Comput. Surveys,

vol. 52, no. 4, pp. 1–43, 2019.

[37] C. E. Shannon, “A mathematical theory of communication,” Bell System

Tec h. J ., vol. 27, no. 3, pp. 379–423, 1948.

Tianjing Wang (Member, IEEE) received the

B.Sc. degree in mathematics from Nanjing Nor-

mal University, Nanjing, China, in 2000, the

M.Sc. degree in mathematics from Nanjing Uni-

versity, Nanjing, China, in 2002, and the Ph.D.

degree in signal and information system from

Nanjing University of Posts and Telecommunica-

tions (NUPT), Nanjing, China, in 2009.

From 2011 to 2013, she was a Full-Time Postdoc-

toral Fellow with the School of Electronic Science

and Engineering, NUPT. From 2013 to 2014, she

was a Visiting Scholar with the Department of Electrical and Computer

Engineering at the State University of New York, Stony Brook, NY, USA.

She is an Associate Professor with the Department of Communication

Engineering, Nanjing Tech University, Nanjing, China. Her research interests

include mobile crowdsensing, cellular V2X communication networks, and

distributed machine learning for multimedia networking. She has published

research papers in prestigious international journals and conferences, including

IEEE TRANSACTIONS ON MOBILE COMPUTING, IEEE TRANSACTIONS ON

BROADCASTING,Journal of Systems Architecture,Multimedia Systems,Peer-

to-Peer Networking and Applications,IEEE ICC,andIEEE ISCC.

Yu Zhang received the B.M. degree in engineer-

ing management from Beijing University of Civil

Engineering and Architecture, Beijing, China. He

is currently working toward the M.S. degree in

computer science with Nanjing Tech University,

Nanjing, China.

His research interests include combinatorial

optimization, deep reinforcement learning, and its

applications in crowdsensing.

Hang Shen (Member, IEEE) received the Ph.D.

degree (with honors) in computer science from

Nanjing University of Science and Technology, in

2015.

He worked as a Full-Time Postdoctoral Fel-

low with the Broadband Communications Research

(BBCR) Lab, ECE Department, University of Wa-

terloo, Waterloo, ON, Canada, from 2018 to 2019.

He is an Associate Professor with the Department

of Computer Science and Technology, Nanjing Tech

University, Nanjing, China. His research interests

involve mobile crowdsensing, vehicular networks, cybersecurity, and privacy

computing.

Dr. Shen serves as an Associate Editor for Journal of Information Pro-

cessing Systems andIEEEA

CCESS. He was a Guest Editor for the Peer-to-

Peer Networking and Applications and a TPC member of the 2021 Annual

International Conference on Privacy, Security and Trust (PST). He is a Senior

Member of CCF and an Executive Committee Member of the ACM Nanjing

Chapter.

Guangwei Bai received the B.Eng. and M.Eng.

degrees in computer engineering from Xi’an Jiao-

tong University, Xi’an, China, in 1983 and 1986,

respectively, and the Ph.D. degree in computer

science from the University of Hamburg, Hamburg,

Germany, in 1999.

From 1999 to 2001, he worked as a Research

Scientist with the German National Research Center

for Information Technology, Germany. In 2001,

he joined the University of Calgary, Calgary, AB,

Canada, as a Research Associate. Since 2005, he

has been working as a Professor in computer science with Nanjing Tech

University, Nanjing, China. From October to December 2010, he was a

Visiting Professor with the ECE Department at the University of Waterloo,

Waterloo, ON, Canada. His research interests include architecture and protocol

design for future networks, QoS provisioning, cybersecurity, and privacy

computing. He has authored and coauthored more than 70 peer review papers

in international journals and conferences including IEEE TRANSACTIONS

ON MOBILE COMPUTING,IEEETRANSACTIONS ON BROADCASTING,IEEE

TRANSACTIONS ON VEHICULAR TECHNOLOGY,Performance Evaluation,Ad

Hoc Networks,Journal of Systems Architecture,Multimedia Systems,Com-

puter Communications, IEEE ICC, and IEEE LCN.

Dr. Bai is an ACM member and a CCF Distinguished Member.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Authorized licensed use limited to: Nanjing Tech University. Downloaded on June 07,2024 at 12:38:45 UTC from IEEE Xplore. Restrictions apply.

ResearchGate has not been able to resolve any citations for this publication.

A Spatio-Temporal Task Allocation Model in Mobile Crowdsensing Based on Knowledge Graph

Article

Full-text available

Aug 2023

With the increasing popularity of wireless networks and the development of smart cities, the Mobile Crowdsourcing System (MCS) has emerged as a framework for automatically assigning spatiotemporal tasks to workers. The study of mobile crowdsourcing makes a valuable research contribution to community service and urban route planning. However, previous algorithms have faced challenges in effectively addressing task allocation issues with massive spatial data. In this paper, we propose a novel solution to the spatiotemporal task allocation problem using a knowledge graph. Firstly, we construct a robust spatiotemporal knowledge graph (STKG) and employ a knowledge graph embedding algorithm to learn the representations of nodes and edges. Next, we utilize these representations to build a task transition graph, which is a weighted and learning-based graph that highlights important neighbors for each task. We then apply a simplified Graph Convolutional Network (GCN) and an RNN-based model to enhance task representations and capture sequential transition patterns on the task transition graph. Furthermore, we design a similarity function to facilitate personalized task allocation. Through experimental results, we demonstrate that our solution achieves higher accuracy compared to existing approaches when tested on three real datasets. These research findings are significant as they contribute to an 18.01% improvement in spatiotemporal task allocation accuracy.

Slicing-Based Task Offloading in Space-Air-Ground Integrated Vehicular Networks

Article

Full-text available

May 2024

A slicing-based collaborative task offloading framework for space-air-ground integrated vehicular networks is proposed in this study, which can provide differentiated quality-of-service (QoS) guarantees for task offloading for high-speed vehicles while maximizing the number of completed tasks. A service-oriented radio access network (RAN) slicing framework is presented that supports slicing window adaptation, spectrum and computing resource orchestration, and collaboration among heterogeneous base stations. Based on the queuing model, the collaborative decision-making of RAN slicing and task offloading is modeled as a problem of maximizing the number of long-term task completions, which consists of three subproblems-slicing window division, resource slicing, and task scheduling-which are solved by a multi-access edge computing (MEC)-enabled controller, forming a closed loop with the slicing window as the period. When a new slicing window arrives, the controller determines its duration according to task traffic fluctuations and allocates resources to RAN slices through an optimization method. A double deep Q-learning network (DDQN)-based algorithm is developed for scheduling workflow on small time scales within a slicing window. Simulation results demonstrate that the proposed scheme performs better than existing approaches in terms of adaptability, task completion rate, and control overhead.

Multiattribute E-CARGO Task Assignment Model Based on Adaptive Heterogeneous Residual Networks

Article

Jun 2024

Mobile crowd sensing (MCS) is an emerging approach to collect data using smart devices. In MCS, task assignment is described as assigning existing tasks to known workers outside the constraints of task demand attributes and worker attributes, and maximizing the profit of the platform. However, workers and tasks often exist in different environments and heterogeneous features such as workers with attributes are not considered, leading to nondeterministic polynomial (NP)-hard task assignment problems. To optimize such problems, this article proposes a multiattribute environments-classes, agents, roles, groups, and objects (E-CARGO) task assignment model based on adaptive heterogeneous residual networks (AHRNets). The AHRNet is integrated into deep reinforcement learning (DRL) to optimize the NP-hard problem, dynamically adjust task assignment decisions and learn the relationship between workers with different attributes and task requirements. Multiattribute E-CARGO uses group task assignment policy to obtain the ideal worker-task assignment relationship. Compared with traditional heuristic algorithms for solving NP-hard, this method has the flexibility and applicability of adaptive networks, enabling the solver to interact with and adapt to new environments and generalize its experience to different situations. Under various experimental conditions, a large number of numerical results show that this method can achieve better results than the reference scheme.

System-of-Systems Approach to Spatio-temporal Crowdsourcing Design using Improved PPO Algorithm Based on An Invalid Action Masking

Article

Feb 2024
KNOWL-BASED SYST

Geo-Perturbation for Task Allocation in 3D Mobile Crowdsourcing: An A3C-Based Approach

Article

Jan 2023

Location privacy protection (LPP) has become a key concern during mobile crowdsourcing (MCS) task allocation. Existing LPP mechanisms for MCS applications mainly focus on two-dimensional (2D) plane scenarios or directly apply 2D techniques into three-dimensional (3D) space scenarios, leaving the height dimension of 3D geolocation vulnerable to privacy breaches. To facilitate the LPP in 3D MCS, we propose a learning-based geo-perturbation mechanism using 3D geo-indistinguishability (3D-GI). In this mechanism, we first define an optimization objective to balance location privacy and MCS server profit, making it adaptable to different types of MCS applications. Then, we adopt the Asynchronous Advantage Actor-Critic (A3C) algorithm to design a reinforcement learning (RL)-based approach without knowing the accurate system and attack models. This approach enables us to derive the optimal perturbation policy in continuous policy space and accelerates the learning speed using asynchronous multi-thread training. Simulation results demonstrate that the proposed mechanism can better balance location privacy and server profit in 3D MCS applications compared to existing benchmarks.

Decentralized Task Assignment for Mobile Crowdsensing With Multi-Agent Deep Reinforcement Learning

Article

Sep 2023

Task assignment is a fundamental research problem in mobile crowdsensing (MCS) since it directly determines an MCS system’s practicality and economic value. Due to the complex dynamics of tasks and workers, task assignment problems are usually NP-hard, and approximation-based methods are preferred to impractical optimal methods. In the literature, a graph neural network-based deep reinforcement learning (GDRL) method is proposed in Xu and Song (2022) to solve routing problems in MCS and shows high performance and time efficiency. However, GDRL, as a centralized method, has to cope with the limitation in scalability and the challenge of privacy protection. In this paper, we propose a multi-agent deep reinforcement learning-based method named CQDRL to solve a task assignment problem in a decentralized fashion. The CQDRL method not only inherits the merits of GDRL over traditional heuristic and metaheuristic methods but also exploits computation potentials in mobile devices and protects workers’ privacy with a decentralized decision-making scheme. Our extensive experiments show that the CQDRL method can achieve significantly better performance than other traditional methods and performs fairly close to the centralized GDRL method.

PPO-TA: Adaptive task allocation via Proximal Policy Optimization for spatio-temporal crowdsourcing

Article

Mar 2023
KNOWL-BASED SYST

A Survey of Crowdsensing and Privacy Protection in Digital City

Article

Jan 2022

The key pillar of developing digital city is the ubiquitous sensing of people and the environment. Crowdsensing requires a large number of users to participate in the collection of sensing data, and these data may carry sensitive information, such as identity and location related to the users or sensing object. If this information is eavesdropped, intercepted, and leaked, this may seriously harm the interests of individuals, organizations, and even countries. Therefore, from a privacy perspective, users may be reluctant to open data. While relying on mobile devices used by a large number of ordinary users as the basic sensing unit, it is necessary to include a variety of communication methods to realize the distribution of sensing tasks and to collect the sensing data. Then, to complete the complex crowdsensing tasks, it is important to ensure privacy security in the context of crowdsensing because it is a key problem. In this article, we comb through the development status of crowdsensing in the digital city, emphatically analyze the privacy protection in crowdsensing under the background of digital city, and qualitatively evaluate the existing privacy protection technologies for crowdsensing. Finally, this article presents research challenges and future directions that should be addressed to improve the performance of privacy protection technologies for crowdsensing systems.

PSARE: A RL-based Online Participant Selection Scheme Incorporating Area Coverage Ratio and Degree in Mobile Crowdsensing

Article

Oct 2022

Mobile crowdsensing (MCS) is a cost-effective paradigm for gathering real-time and location-related urban sensing data. To complete MCS tasks, MCS platform needs to exploit the trajectory of participants (vehicles or individuals, etc.) for effectively choosing participants. On one hand, the existing works usually assume that platform has possessed the abundant historical movement trajectory for participant selection, or can accurately predict the movement of participant before selection, but this assumption is impractical for many MCS applications, for some candidates have just arrived without sufficient mobility profiles, so-called trajectory from-scratch, or cold-trajectory issue. On the other hand, most of works only considers the coverage ratio of the sensing area, while some hotspots should be sensed frequently, so-called coverage degree of hotspots. To solve the issue, this paper proposes a reinforcement learning (RL) based, i.e., an improved Q-learning based online participant selection scheme to incorporate both coverage ratio and degree, PSARE. First, to solve the explosion of state-value table in traditional tabular Q-learning, an improved two-level Q-learning method is proposed to select participants in online way so as to achieve high long-term return. Specifically, in each selection round, PSARE dynamically compresses all the real participants (RPs) into several virtual participants (VPs) using the available historical trajectories of RPs, and the VP-based state-value table is constructed and constantly updated (i.e., the first level). Then, after selecting the VP through looking up the table, PARSE chooses the RP with the largest expected reward in this VP using epsilon-greedy way to balance the effect of exploration and exploitation (i.e., the second level). Moreover, the reward function is designed to measure the MCS coverage quality, including both coverage degree of hotspots and coverage ratio of target area. Thorough experiments on real-world mobility data set demonstrate that PSARE outperforms than other RL based online participant selection schemes (including deep Q-learning network) and traditional offline selection methods.

Cooperative task assignment in spatial crowdsourcing via multi-agent deep reinforcement learning

Article

May 2022
J SYST ARCHITECT

With the rapid development of mobile Internet, spatial crowdsourcing (SC) has become an emerging paradigm with many applications. As a key challenge in SC, the problem of task assignment has attracted extensive research. However, most previous work focus on the single mode setting where cooperation among workers is either allowed or prohibited. Moreover, only short-term benefit of either workers or requesters is considered separately. To address these issues, we first propose a new spatial crowdsourcing scenario that permits cooperation with no mandate among workers and tasks. Furthermore, we propose a multi-agent deep reinforcement learning (MADRL) solution for SC. Specifically, we extend the Advantage Actor-Critic (A2C) algorithm to multi-agent settings, and design a reward function that considers both local and global return. Through the game between agents, we generate a task assignment scheme that considers both workers’ and requesters’ long-term benefit. In order to improve the performance of our model, we further introduce the attention mechanism to guide information sharing between agents. We use simulations to conduct experimental evaluation on both synthetic and real-world datasets. Experimental results show that our proposed method outperforms other state-of-the-art task assignment algorithms in terms of worker profitability rate and task completion rate.

Task Partitioning and Scheduling Based on Stochastic Policy Gradient in Mobile Crowdsensing

Abstract

Recommended publications

Invisible man: blockchain-enabled peer-to-peer collaborative privacy games in LBSs

Vehicle counting in drone images: An adaptive method with spatial attention and multiscale receptive...

P2TA: Privacy-Preserving Task Allocation for Edge Computing Enhanced Mobile Crowdsensing

Privacy-Preserving Task Allocation for Edge Computing Enhanced Mobile Crowdsensing: 18th Internation...