Abstract

Task assignment is a key issue in mobile crowdsensing (MCS). Previous task assignment methods were mainly static offline assignment. However, the MCS platform needs to process dynamically changing workers and tasks online in the actual assignment process. Hence, a reliable dynamic assignment strategy is crucial to improving the platform’s efficiency. This paper proposes an MCS dynamic task assignment framework to solve the task maximization assignment problem with spatiotemporal properties. First, a single worker is modeled for the Markov decision process, and a deep reinforcement learning algorithm (DDQN) is used to perform offline learning on historical task data. Then, in the dynamic assignment process, we consider the impact of current decisions on future decisions. Use the maximum flow model to maximize the number of tasks completed in each period while maximizing the expected value of all workers to achieve the optimal global assignment. Experiments show that the strategy proposed in this paper has good performance compared with the baseline strategy under different conditions.

1. Introduction

With the development of technologies such as 5G and microsensors, MCS has been widely used in different fields, such as intelligent transportation systems [1], environmental monitoring [2], and public safety [3]. A typical MCS system usually consists of a mobile sensing platform, workers, and task issuers. The task issuer posts tasks on the platform, and the platform distributes the collected perception tasks to workers who complete the perception tasks and get paid. Task assignment is a key issue of MCS. The current research can be divided into two categories according to the level of worker participation: the opportunistic task assignment and the participatory task assignment. In opportunistic task assignments, workers do not need to change their original trajectories, and the platform selects workers offline based on predictions of worker mobility. For example, literature [47] proposes different strategies for selecting a predefined number of workers to maximize the perceived quality of the task. In participatory task assignments, workers need to generate their moving routes according to the tasks assigned by the platform, which requires candidates to report real-time locations continuously, and the system selects workers online. If a worker is selected and assigned to a specific task location, it will change the original route, move to the specified location, and receive the corresponding reward. Different kinds of literature [811] use different reward models for workers.

In the actual scene, the task assignment of the platform has the limitation of time and space, showing characteristics such as dynamic, strong randomness, and multistage. Most of the original MCS task assignment research has the following problems: (1)In MCS, workers need to collect task data at a precise location and time, so it is necessary to consider the impact of worker and task spatiotemporal information on platform task assignment(2)In the actual assignment process, tasks and workers change dynamically, and the platform only knows local spatiotemporal information. Therefore, the platform needs to mine the historical spatiotemporal information to optimize the task assignment strategy in decision-making(3)Due to the spatiotemporal continuity, the recent decision of the platform not only affects the current assignment result but also affects the following assignment result. Most of the original assignment strategies are limited to the optimal assignment in a certain stage, while the optimization goal of the platform is the optimal global assignment for the entire period

In order to solve the above problems, this paper designs a dynamic participatory task assignment framework. The framework considers the Markov decision process modeling of a single worker and introduces a deep reinforcement learning model to help the platform’s decision-making at each period. In building a deep reinforcement learning model, each step of the platform’s decision for a single worker is to estimate the state-action value function within the current worker’s perception range and guide the worker to a place where future tasks are more intensive in its perception range. We use the DDQN [12] model to train historical data effectively, and the generated Q network is used to assist the decision-making of the platform’s online assignment. At the same time, this paper designs a dynamic MCS task assignment framework, which fully considers the spatiotemporal and randomness of tasks and workers. There is a series of executable tasks within the perception range of workers in each period, and the platform maximizes the total number of tasks completed by the workers in all periods through a reasonable task assignment. Specifically, based on the maximum flow model [13] and combined with the auxiliary decision-making of the Q network, this paper realizes the dynamic online assignment of multiple periods and reaches the global maximum number of completed tasks on the platform. In summary, this paper’s dynamic task assignment framework has the following characteristics: (1)Before task assignment begins, a Markov decision process is modeled for a single worker. We first use the DDQN model for offline training of the historical task set in crowdsensing task assignment to generate a Q network for real-time prediction(2)When performing online dynamic task assignments, the predicted value of the Q network and the impact of current decisions on future decisions are considered to achieve the global optimal task assignment under the platform’s completion of the largest tasks in each period(3)According to the characteristics of MCS dynamic task assignment, this paper thoroughly mines the spatiotemporal information of historical data, establishes an MCS dynamic task assignment framework based on deep reinforcement learning, and verifies the good performance of this framework through experiments

The task assignment goal is to assign perceptual tasks to eligible users, and in much literature, a lot of research has been done on task assignment algorithms for traditional MCS [14, 15]. Liu et al. [16] considered the comprehensive sensing quality of MCS to optimize the utility of the whole system. Xiao et al. [17] proposed a task assignment scheme of independent perception and cooperative perception, and the optimization goal is to minimize the average completion time of tasks. Yang et al. [18] maximize the amount of budget information for task performers under budget constraints. Liu et al. [19] proposed two dual-objective multitask assignment models, namely, FPMT and MPFT, and presented corresponding solving algorithms. Zhang et al. [20] proposed a hybrid perceptual task model with the goal of maximizing task completion and perceptual coverage. Zhang et al. [21] used a greedy heuristic algorithm and a genetic algorithm to solve the problem based on two optimization objectives in the vehicle crowdsensing system. Li et al. [22] established an optimization model for crowdsourcing task assignment in heterogeneous spaces: maximizing task coverage and minimizing incentive cost and designed a greedy swarm intelligence optimization algorithm to solve two optimization objectives.

In studies where workers select a single MCS task, task assignment has specific goals and constraints. For example, the literature proposes different recruitment strategies to select a predefined number of workers to maximize the perceived quality of the task [46]. Zhang et al. [23] chooses a minimum number of workers to ensure a certain degree of perceived quality. Ji et al. [24] propose a mobile crowdsensing system with social awareness and design an improved MOEA/D algorithm to achieve the Pareto optimal solution set. As the number of MCS tasks increases and the tasks are interrelated, some studies consider the overall utility of multiple concurrent perception tasks. For example, Song et al. [25] and Wang et al. [26] both proposed multitask assignment algorithms that maximize the system’s overall utility when tasks share a limited incentive budget. The multitask assignment strategy proposed by Wang et al. [27] is aimed at optimizing the overall utility when multiple tasks share a constrained worker pool.

Some studies collect location-related data centrally without any time constraints. Shah-Mansouri and Wong [28] used the auction mechanism to maximize the profit of a single-task platform without a time limit while providing satisfactory rewards for workers. Zhou et al. [29] used a single-objective optimization method with budget constraints to maximize task quality efficiency under the premise of considering employee reputation and no time constraints. However, in practice, users need to complete these tasks before certain deadlines, so we need to consider the impact of time on task assignments. Some recent studies have focused on centralized task assignment schemes with time constraints. For example, Cheung et al. [30] considered collecting time-sensitive and location-dependent perception data by multiple users. They proposed a distributed algorithm to help users determine their task choices and mobility plans. Estrada et al. [31] studied the trade-off between quality, budget, and time constraints of perception tasks over a period of time. They provided a service computing framework for task assignments with time and location constraints.

In recent years, reinforcement learning has been used in the MCS to make a range of decisions in uncertain environments. Ji et al. [32] construct a dynamic task allocation model and proposes a Q-learning-based hyperheuristic evolutionary algorithm to maximize the average perceived quality of all tasks in each period. Akter et al. [33] proposed a deep Q-learning-based algorithm to determine the assignment of tasks and workers and iteratively used the asymmetric travelling salesman (ATSP) heuristic to find the task completion order of workers. Wang et al. [34] proposed a privacy-enhanced multiregional task assignment strategy (PMTA) for Healthcare 4.0 using deep differential privacy, deep Q network, federated learning, and blockchain to effectively protect the privacy of tasks and patients and obtain better system performance. Tao and Song [35] tried to use deep reinforcement learning methods to find a more efficient task assignment solution and used DDQN to solve the task assignment problem with time windows. Wang et al. [36] proposed a blockchain-based secure data aggregation strategy (BSDA) for edge computing-enhanced IoT. BSDA adopts three important mechanisms to prevent privacy leakage and develops a deep reinforcement learning method for energy-efficient data aggregation. Han et al. [37] proposed a real-world-oriented multitask assignment method based on multiagent reinforcement learning. This method fully considers worker and task heterogeneity. It is based on an improved soft Q-learning method that enables workers as agents to learn multiple solutions independently, which optimizes the perceptual quality of tasks. The above literature mainly applies reinforcement learning to task assignment optimization with time windows. The platform finds the optimal assignment strategy offline after knowing all the information about the workers and tasks. However, in actual task assignments, workers and tasks are dynamically changed. This paper first uses reinforcement learning in crowdsensing task assignments to train offline on historical data and provide decision-making for dynamic platform assignments.

3. Model Description and Problem Definition

Assuming that there is a batch of worker sets and task sets with spatiotemporal attributes in the MCS platform, the platform needs to assign appropriate tasks to each worker dynamically. In order to simplify the model, this paper divides the entire task assignment stage of the platform into equal small periods represented by . Workers in all periods are represented by the set , and each worker has the following attributes: . Among them, and represent the current latitude and longitude coordinates of the worker and represents the period when the worker joined the platform. Workers will continue to participate in task assignment after joining until reaching the deadline , so represents the period the worker is in, where . The task in all periods consists of the set , and the task has the following attributes: . Among them, and represent the latitude and longitude of the task and represents the period when the task is posted to the platform, where . In this article, the task is time-sensitive, it needs to be completed within the period , and represents that a task is done by at most different workers.

Different from traditional task assignments, tasks and workers are dynamic in the task assignment process of this paper, and the platform only has information on workers and tasks in the current period. In the period , the set of workers is composed of workers, using the set , where ; the task set consists of tasks, using the set , where . Each worker can only be assigned a maximum of one task in each period, if the worker is currently assigned a task, then ; otherwise, . Figure 1 shows the system model of dynamic task assignment in each period. Firstly, the platform obtains information about workers and tasks in the current period. Then, the platform will assign each worker a task within the maximum perception radius based on the current spatiotemporal information of workers and tasks. The assignment result will be returned to the worker. After receiving the task instruction, the worker moves to the task location to complete the task in this period and uploads the perception information to the platform. Finally, the platform will return the perception result to the task issuer, update the spatiotemporal information of the worker and the task, and enter the allocation of the next period. In particular, to simplify the model, this paper considers the ideal case where the tasks appearing in each period of the MCS are time-sensitive and the value of the tasks is much larger than the cost of moving the workers within their perceived range. Therefore, we do not consider worker consumption and assume that all tasks within the worker’s perceived range are equal, while the main goal of the platform is to guide the worker to complete more tasks. Suppose there is no task that can be assigned to the worker within the perception range of the current period. The worker will stay at the original location or move a short distance randomly in the perception range, waiting for the assignment of the next period.

Therefore, we define the cumulative number of tasks completed by the platform as the sum of the number of tasks completed in multiple periods throughout the assignment:

In Equation (1), represents the total number of workers in the worker set in the period , and represents the tasks completed by the workers in the period . In the actual assignment process, the platform needs to consider the positions of workers and tasks in real time. In order to ensure that workers can complete perception tasks within a period, this paper defines the maximum task perception radius of workers:

The distance between the worker and the task in Equation (2) is obtained from the latitude and longitude coordinates of the worker and the task according to the haversine formula: where is the radius of the earth and , , , and are the latitude and longitude coordinates of workers and tasks, respectively.

The goal of MCS is to maximize the total number of completed tasks, from which we obtain the optimization objective and spatiotemporal constraints for dynamic task assignment:

Subject to:

Equation (4) indicates that the optimization goal of the platform is to maximize the number of completed tasks in multiple periods. In Equation (5), represents the number of workers who complete task , which is less than of the task. Equation (6) is the time limit, where means that the worker needs to arrive at the task location within the period to complete the task . Equation (7) is the space limit, which means that the task perception radius of worker in each period is smaller than .

4. Problem Solving

The most important feature of dynamic task assignment is that the platform does not have complete spatiotemporal information about workers and tasks throughout the period so that the current decision-making will impact future decision-making. Therefore, we consider the dynamic task assignment problem as a multistage sequential decision-making problem. Firstly, the Markov decision process is modeled for a single worker and then used the improved DDQN algorithm to train historical data to generate a Q network. Finally, the overall framework of MCS dynamic task assignment is designed to achieve reasonable task matching in all periods.

4.1. Markov Decision Process Modeling from a Single Worker Perspective

A Markov decision process (MDP) is a discrete-time stochastic control process that provides a mathematical model for decision-making problems. Most of the literature conducts MDP modeling on the entire MCS platform [31, 33]; however, this will make the model complex, and it is difficult to fully learn the information in the environment. In this paper, MDP modeling of task assignment is carried out from the perspective of a single worker, and a single worker is regarded as an agent, which significantly simplifies the definitions of state transitions, actions, and rewards. The relevant definitions are as follows:

State is defined as the information of a single worker in the system, that is, , where and represent the worker’s latitude and longitude coordinates, and represents the current period of the worker. When , is a termination state. Specially, when training the Q network, in order to better explore the historical environment and reduce the influence of the randomness of the initial position, the agent is not limited by the perception range in the initial state. The initial state is represented by =, where is a virtual location label, and the agent can select all the tasks in the initial period , so that it can fully explore the historical task expected value.

Action is defined as the task information assigned to the worker by the platform, and the worker can only complete the task in the period, so we set the action as the position of the task, that is, , where and indicate the latitude and longitude coordinates of the task. The feasible action search of the agent in the state is shown in Figure 2: (a) In state , all tasks within the range of the agent perception are feasible actions. (b) If there is no executable task within the perception range of the agent, a virtual task (virtual action ) with return of 0 is constructed, and the worker considers staying in place or going to any point within the perception range, waiting for the allocation of the next period. (c) Specially, in the initial state of training the Q network, all tasks in the current period are the feasible actions of the agent.

The worker goes to the task location to perform the task and transitions to the next state , where and are the latitude and longitude of the task, . Reward represents the worker’s benefit completing action in state . In this paper, we assume that each task is equivalent and want workers to travel to task-dense regions to avoid no feasible tasks within the perception range of the following period. Therefore, the reward is set to a constant , whose magnitude is determined by the number of tasks within the worker’s perception range. Specially, the reward is 0 when the action is taken. In the initial state , since the agent is not limited by the range of perception, the rewards of all actions of the agent are equal.

The state-action function represents the expectation of the total benefit that the worker can obtain in the future when he takes action in state : where is the number of steps for state to reach the terminal state, and is the discount coefficient of future rewards.

Policy refers to a probability distribution based on a set of behaviours in a certain state. In this paper, model-independent reinforcement learning is used to learn the optimal policy through the interaction between the agent and the environment to maximize the expected cumulative return. The greedy policy with respect to a learned is given by

4.2. Offline Q Network Training Based on Improved DDQN

Based on the definition of MDP, this paper adopts the reinforcement learning algorithm to train the historical data. -learning [38] is a widely used reinforcement learning method that mainly focuses on estimating the value function for each state-action pair with a value table. For any state and action , Q-learning predicts the value of the state-action pair by iteratively updating where is the learning rate, is the discount factor, and is the reward for the state transition from to after taking action . is the largest value function among all possible actions in the new state .

Since the dynamic task assignment needs to use space-time-based continuous state space and action space , the dimension of the state space is large. This paper adopts a single worker Markov decision process solution based on DDQN [12] model and approximates the value function in Q-learning with a deep neural network to find the optimal policy. On the neural network architecture, traditional DQN assumes a small discrete action space, using states as inputs and multiple outputs corresponding to the action values of fixed actions, as shown in Figure 3(a). Due to the huge action space in the training historical data and the action space that will change continuously over time, cannot be enumerated. Therefore, we improved the network structure of DDQN in combination with literature [39], and Figure 3(b) shows the structure of a deep neural network. The input consists of state and action . State contains the latitude and longitude coordinates of the worker and the period. Action contains the latitude and longitude coordinates of the task. Combined with literature [40], the hidden layer uses a three-layer fully connected network, and the number of neurons is set to 64,128 and 16, respectively. The output is the state-action value .

This paper uses the interaction of a single agent with the historical data environment to explore the expected value of tasks in different locations. The agent starts from the virtual initial state and generates the sample sequence required for training the neural network by interacting with the environment of the historical task set and stores it in the replay memory. When the replay memory is full, randomly select a minibatch of data for training. In the Q network training framework based on the DDQN algorithm, the minibatch update via backpropagation is essentially a step in solving a regression problem with the following loss function: where is calculated from

Finally, after the training is complete, the information of the worker’s state-action pair is used as input, and the predicted value of the state action is output. Combined with the characteristics of MCS dynamic task assignment, we designed a Q network training algorithm based on the DDQN model. The specific steps are as follows:

Input: Historical data set H, replay memory D, maximum training episodes N, a constant Z, initialization evaluation network Q and target network
Output: Evaluate network Q
 1: For step from 1 to N do
 2: Initialized worker state =
 3: while is not the termination state do
 4:  if is the initial state then
 5:   Take the task with period equal to in the historical data set H as the
     action set of the current state s
 6:  else
 7:   Obtain the action set of the current state in the historical data set H
     according to the spatiotemporal constraints of Equation (6) and Equation (7)
 8:  end if
 9:  if the action set of is empty then
 10:   The worker executes the virtual task , the state transitions to , and
      the reward is 0
 11:   Store in the cache memory, where is the next state,
       is the reward, and is whether is the termination state
 12:  else
 13:   Take as input to get the value for each state-action
      pair
 14:   Use the method to select the corresponding action in the
      output of the current value
 15:   Get , , according to action . and store in the
      replay memory
 16:  end if
 17:  if replay memory D is full then
 18:   Cover a piece of data in D and extract a mini-batch to randomly sample
       for learning
 19:   Calculate the target value y according to Equation (12)
 20:   Gradient descent update of evaluation network Q parameters according
      to the loss function of Equation (11)
 21:   Update the target network parameters every Z step
 22:  end if
 23:   =
 24: end while
 25: end for

Algorithm 1 first selects the historical dataset as the historical environment of the task and has all the spatiotemporal information of the environment during training. The worker starts from the initial state, explores the historical environment of the task, generates a training sequence , and puts it into the replay memory . After the replay memory is full, the neural network trained and updated the replay memory. When the worker reaches the deadline, it becomes terminated and enters the next loop. Through continuous training, the algorithm converges and outputs the evaluation network Q.

4.3. A Dynamic Task Matching Strategy for Multiple Workers in Each Period

In the actual task assignment process, there may be multiple workers and tasks in each period. The optimization goal of the MCS platform is to maximize the number of completed tasks, which can be based on the maximum flow model [13] to maximize the number of completed tasks in each period. However, due to the continuity of time and space, the platform’s current task assignment decision will affect the following assignment result, and the traditional maximum flow algorithm may fall into the optimal local period. Therefore, we combine the prediction value of the Q network to construct the worker task matching of each period into a maximum Q value maximum flow model to achieve the optimal matching of the entire period. Figure 4 shows the model diagram of the network. The first layer and the tail layer are the source node and the tail node, respectively, and the middle layers contain the platform’s worker nodes and task nodes in the current period. Each edge has two attributes: capacity and cost . For the edge between the source node and the worker node, the capacity of each edge is 1, reflecting that each worker is assigned at most 1 task per period; the cost of the edge is 0, because the edge between the source node and the worker node is only used for inflow. For the edge between the worker node and the task node, the platform assigns the corresponding edge in the graph by searching for the task set within the maximum perception range under the current spatiotemporal information of each worker. The capacity of each edge is 1, and the cost of the edge is . For edges between task nodes and tail nodes, each edge has a capacity of , which reflects the maximum number of a task can be completed by different workers; the cost of the edge is 0, because edges between task and tail nodes are only used for outflow.

This paper proposes a maximum value maximum flow matching strategy (MaxflowQ) in each period, as shown in Algorithm 2. The platform maximizes the sum of values of workers while matching worker tasks in each period to the maximum flow, so as to achieve the overall optimization of dynamic assignment. Algorithm 2 consists of two parts: one is to construct the flow network of the model, and the other is to find the optimal solution in the flow network. The platform constructs a flow network based on the worker set in the current period, the task set within the perception range of each worker, and the set of workers , where is the capacity of each edge and is the cost of each edge, as shown in Figure 4. Subsequently, the optimal solution is found in the flow network. The procedure is as follows: firstly, initialize the flow graph and then greedily select the augmented path from node source to node tail in the residual network with maximum cost, along with the capacity of increases flow until there are no additional paths in the remaining network . Finally, the algorithm outputs the worker task matching set for the current period.

Input: The platform worker set in the current period, the task set within the perception range of each worker, the set of workers .
Output: Worker task matching set in the current period.
 1: According to , , constructs the flow graph
 2: Initialize flow f to 0
 3: while there exists an augmenting path in the residual network do
 4: Select an augmenting path with the largest Q value
 5: 
 6: Augment flow f along with
 7: Update residual network
 8: Save worker task matches
 9: Update matching set
 10: end while
4.4. MCS Dynamic Task Assignment Solution Framework

Through the above discussion, we construct the overall framework to solve the dynamic task assignment of the MCS platform, as shown in Figure 5.

The task assignment system of the MCS platform consists of two modules: the Q network module and the dynamic task assignment module. Before the task assignment starts, the MCS platform uses historical data to train offline according to Algorithm 1 to generate a Q network for decision-making assistance. During task assignment, the MCS platform uses the dynamic task assignment module to match workers and tasks in real time at each period, including the following three steps: (1) The platform first obtains the spatiotemporal information of the worker set and the task set in the current period and searches for the task set within the perception range of each worker based on the spatiotemporal information. (2) The platform generates the state-action pairs of the worker according to the task set of each worker . The Q network module is then called to generate values for all actions in the current state of worker , which is used to assist the platform’s task assignment decisions. (3) Finally, the platform will construct a flow network according to Algorithm 2 and generate the optimal worker task matching for the current period . After receiving the task information sent by the platform, the workers move to the corresponding location to complete the task and enter the allocation for the next period. When the deadline is reached, the platform completes the assignment and exits the dynamic task assignment framework. From this, we construct the dynamic task assignment of the MCS platform as Algorithm 3.

Input: Task set L, worker set W, evaluation network Q
Output: Complete task set WL
 1: for from 1 to h do
 2: Obtain the worker set and the task set in the current period
 3: for from 1 to do
 4:  Obtain the feasible task set of worker in according to the
     spatiotemporal constraints of Equation (6) and Equation (7)
 5:  if is empty then
 6:   Workers stay in place or move a short distance at random in perception
      range, waiting for the next period to be allocated
 7:   Continue
 8:  end if
 9:  if Worker is the initial state then
 10:   Set the value of the Q network input of the worker as the virtual
      location label
 11:  end if
 12:  Generate state-action pairs for workers
 13:  Input the state-action pair to the Q network generated by Algorithm 1, and
     obtain the set of workers
 14: end for
 15: Generate worker task matching set according to Algorithm 2
 16: Update worker and task information
 17: Proceed to the next period's assignment
 18: end for

5. Experimental and Analysis

5.1. Experimental Dataset Selection and Processing

In order to verify the effectiveness of the framework, two real datasets, the Gowalla dataset [41] and the Foursquare dataset [42], were selected for model training and testing. The Gowalla dataset contains check-in records which contain a large number of users, including user ID, check-in time, latitude, longitude, and location ID. The Foursquare dataset contains check-ins collected in New York City and Tokyo, and each check-in contains the following records: user ID, venue ID, venue category ID, venue category name, latitude, longitude, time zone offset in minutes, and UTC time.

In the experiment, the check-in data from 9 : 00 to 12 : 00 in New York City for two months in two datasets were selected as the training set, and the data of different days after the training set was used as the test set. The latitude and longitude of the check-in records in the dataset are regarded as the location of the task set in the MCS, and a randomly different maximum number of is generated for each task. At the same time, in order to simplify the model, the check-in data of the Gowalla dataset and the Foursquare dataset are divided into 24 periods and 40 periods with the same interval, and the check-in data of the same place in the same period is deleted. In addition, the initial check-in location and period of workers with different IDs are selected as the initial period and location of workers for task assignment testing. At the same time, in the two datasets, there are obviously hot and cold areas in the check-in locations in New York City, so a rectangular area of about m was intercepted, as shown in Figure 6.

5.2. Experimental Environment and Parameter Settings

The experiment is implemented based on python and uses PyTorch to build a deep neural network model. The choice of parameters when training a deep reinforcement learning network may affect the solution results. There are general principles to follow and related literature for reference [12, 39]. The randomly chosen action probability decreases from 0.2 to 0.1 during training. The discount factor measures the weight of the subsequent state-action value to the total return, so the value is generally close to 1, and in the experiment. The learning rate , the replay memory capacity , the sampling minibatch is 64, the network parameter adopts a random initialization strategy, the target network parameter delay update steps , the rule activation function is used, and the adaptive moment estimation algorithm is selected. At the same time, in order to eliminate the dimension, the data obtained from the input layer of the neural network is normalized. The specific parameter settings of Q network training and dynamic task assignment test are shown in Table 1.

5.3. Evaluation of Q Network
5.3.1. Parameter Selection of Training

In training on historical data, we hope to learn optimal policies through worker-environment interactions that maximize the expected cumulative return. Therefore, this paper uses the improved DDQN model to learn the optimal strategy. The DDQN model is sensitive to the selection of hyperparameter such as the learning rate. To explore the effect of different learning rates on the optimal learning of DDQN models, the experiment sets in the worker’s initial state to 1 in each iteration and compares the worker’s reward value in the same initial state. The learning rate is set 0.01, 0.001, and 0.0001, respectively. The maximum perception radius is set to 2 km, and the maximum episode is set to 30,000.

Figure 7 shows the evolution of the reward value during training the historical data in Gowalla and Foursquare, using a sliding window every 50 episodes to calculate the reward value curve. As can be seen from the figure, among the three learning rate settings, the highest reward value for training is achieved with a learning rate of 0.001. The learning rate of 0.0001 converges the slowest. Although the reward rises fastest at a learning rate of 0.01, it has worse values and fluctuations in convergence than the case of a learning rate of 0.001. The stability of the training process and the better strategy is more important than the training speed. Therefore, we set the learning rate to 0.001 in the Q network training.

5.3.2. Q Network Train and Analysis

The experiment uses Algorithm 1 to train the historical data. The converged Q network can be used to predict the expected values of dynamically changing worker task pairs in real environments. The initial state of the worker is set to . Since workers mostly appear at the initial time of the platform allocation distribution in practice, at each iteration, randomly selected integers from 1 to 3. The maximum perception radius for a single worker is set to 2 km, and the number of network training is set to 50,000 episodes.

Figure 8 shows the changes in the value of the function value during the training process in the Gowalla dataset and the Foursquare dataset, respectively. For the training results, a sliding window every 50 times was used to calculate the loss value curve. It can be seen from the figure that the function value has a good convergence effect. During the training process, the function value will fluctuate due to the agent’s exploration of the environment and acquiring new environmental information. With the continuous increase of training times, the fluctuation range of the function value will gradually become smaller, and there is a decreasing trend.

In this paper, we expect workers to travel to task-intensive areas based on the predicted values of the Q network to avoid no feasible tasks within the perception range of the next period. To verify the predictive effectiveness of the Q network, we compare the expected values of different locations in the same period. The state value function represents the expected cumulative reward that workers will be able to obtain by following policy from state until the end of the state. Assuming that a greedy strategy based on state-action value is always used in the assignment process, then the state value function is shown in

Figure 9 shows the distribution of the value (the maximum value of all possible action-state pairs in the current state) for different states in a certain period in the Gowalla and the Foursquare datasets, respectively. It can be seen that the values of different states in the same period are mainly affected by space. The part with a large red value in the figure corresponds to the area with denser tasks in the lower right corner of Figure 6, while the blank area and the blue value of the small part in the figure correspond to the blank area and the area with sparse data in Figure 6. It shows that the Q network has good predictability for the hot and cold distribution of tasks in different regions.

In summary, it shows the good performance of Q network in this paper. The Markov decision process is modeled for a single worker, and the Q network generated by training evaluates its own state to find a better strategy. However, when the platform performs dynamic allocation, multiple workers will affect each other. Then, the experiment will perform the joint dynamic assignment of different workers in the test task set to simulate the assignment in the actual scene and verify the effectiveness of the proposed framework.

5.4. Evaluation of Dynamic Task Assignment System Framework

In this section, we evaluate the performance of the overall dynamic task assignment system framework in different experimental environments. The baseline matching strategy in each period is as follows: (1)Random: the platform randomly selects tasks for workers to execute in each period(2)Ranking algorithm [43]: the ranking algorithm for online bipartite graph matching problem proposed by Karp et al. The workers are sorted before the start of each period, and the platform matches the workers according to the order until the platform reaches the deadline(3)Maximum flow [13]: worker task matching is performed from the perspective of maximizing the number of completed tasks in each period. We use the Ford-Fulkerson algorithm to calculate the maximum flow(4)Maximum value maximum flow (MaxflowQ): strategy of this paper. Based on the maximum flow strategy, considering the impact of current decisions on future decisions, the DDQN model [12] is introduced to train historical data to generate a Q network for real-time prediction and optimize the dynamic matching strategy. In each period, the action state value output by the Q network is used as the weight of the maximum flow matching, and the maximum matching number and the maximum Q value of the platform in the current period are obtained

5.4.1. The Effect of Different Numbers of Workers on Task Assignment

When the platform performs dynamic assignment, multiple workers will influence each other, resulting in changes in the number of tasks completed by workers. This section compares the number of tasks completed by the platform on the Gowalla dataset and the Foursquare dataset under different total worker counts. The maximum perception radius is set to 2 km, and the test set is the data of 10 days and 15 days after the training set of Gowalla and Foursquare, respectively. It can be seen from Figure 10(a) that our strategy is better than other comparison strategies in the Gowalla dataset, and as the number of workers increases, our strategy shows better performance and can complete a larger number of tasks than other strategies. Figure 10(b) shows that on the Foursquare dataset, when the number of workers is 30, the strategy of this paper is close to the comparison strategy. With the increase of the number of workers, the strategy of this paper is better than other comparison strategies. However, the maximum flow strategy is more likely to fall into a local period optimum due to the increase of the assignment period and the number of workers. Combining Figures 10(a) and 10(b), it can be seen that due to the addition of the Q network in the decision-making process, our strategy shows better performance when the number of workers increases. Furthermore, it is not easy to fall into the optimum of the local period, which reflects the effectiveness of introducing reinforcement learning to consider the impact of current decisions on future decisions.

5.4.2. The Impact of Different Task Test Sets on Task Assignment

In the dynamic assignment process, the environment is dynamic and uncertain. To explore the performance of the framework in different environments, the experiments in this section select test task sets of different days in the Gowalla dataset and the Foursquare dataset to compare the number of tasks completed by the platform. The maximum perception radius of workers is set to 2 km, and the number of test workers on the Gowalla and Foursquare test sets is 70 and 50, respectively. It can be seen from Figure 11(a) that the strategy in this paper is better than other comparison strategies under different test sets in the Gowalla dataset. Moreover, this strategy outperforms the other strategies when the test set is sparse with 7 and 9 days of data. From Figure 11(b), we can see that in the Foursquare dataset, due to the sparse tasks in the 10 days and 12 days test sets, the maximum flow matching strategy falls into a local period optimum, resulting in a low number of overall task assignments. In the 20-day task set, the number of tasks completed by the maximum flow matching strategy is close to the strategy in this paper due to the denser tasks. The combined Figures 11(a) and 11(b) show that the strategy of this paper outperforms other comparative algorithms under different test sets. And it is not easy to fall into the local period optimum, which indicates that the dynamic assignment framework is stable in the actual test environment.

5.4.3. The Effect of Different Perception Radius on Task Assignment

The worker’s perception range determines the task set that a worker can choose in each period, which will affect the platform’s matching result for the worker. The experiments in this section compare the number of tasks completed by the platform on the Gowalla dataset and the Foursquare dataset for workers with different maximum perception radius . The test set in the Gowalla dataset is 10 days of check-in data, and the number of test workers is 60. The test set in the Foursquare dataset is 15 days of check-in data, and the number of test workers is 50. At the same time, we trained Q networks with maximum perception radius of 1 km and 3 km, respectively, based on the Q network with maximum perception radius of 2 km. From Figures 12(a) and 12(b), it can be seen that the strategy in this paper can complete more tasks under different perception ranges. With the increase of the worker’s perception range, the worker can perceive more tasks at each assignment, and the platform can more easily find the optimal assignment for each worker, showing better performance.

6. Conclusion

This paper constructs a dynamic task assignment framework for MCS based on deep reinforcement learning. First, a single worker is modeled for the Markov decision process, and the DDQN model is used to train the historical task set to generate a Q network for real-time prediction. Then, in the process of dynamic task assignment, the maximum value maximum flow matching strategy is used to maximize the number of completed tasks in each period while enabling workers to travel to task-intensive areas in order to avoid no feasible tasks in the perceived range in the future. As a result, the global maximum task assignment of the platform is achieved. In the following work, we will focus on the heterogeneity of workers and tasks, explore the cooperation and competition of workers in large-scale task assignments, and try to introduce related theoretical knowledge, such as game theory and multiagent reinforcement learning to optimize the frame.

Data Availability

The experiment uses the datasets of the Gowalla dataset [41] and the Foursquare dataset [42]. All datasets can be accessed from the relevant references.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (62172182).