Abstract

Urban wireless sensor networks (UWSNs) are an important application scenario for the Internet of Things (IoT). With the emergence of many computationally intensive applications based on urban environments, sensors in wireless sensor networks are unable to meet demands such as latency due to their limited resources. Fog computing architectures have the potential to liberate data transmission from resource-constrained sensor nodes through data offloading. Therefore, data collection and scalable collaboration based on fog architectures are seen as a challenge. For the multinode data offloading problem, a multinode data offloading strategy based on stable matching and MAB (multi-armed bandit) model is proposed to maximize the offloading success rate while guaranteeing the latency requirements of the source task nodes. Firstly, the multinode data offloading problem is modelled. Secondly, the case of multinode selection conflict is considered, and selection conflict and information exchange are reduced by the MAB model and the fallback timer. Then, an adjustment strategy is proposed based on the idea of delayed reception in the stable matching of game theory. Finally, the multinode data offloading problem is solved by successive iterations. The proposed algorithm not only accomplishes coordinated cooperation between nodes to achieve high-quality data offloading and avoid collisions between nodes but also reduces the amount of information exchanged between nodes. The effectiveness of the algorithm is demonstrated by theoretical analysis and simulation experiments.

1. Introduction

Our lives are increasingly dependent on smart IoT devices that sample data and collect data from the environment and operate in the physical world [1]. Many IoT devices, such as sensor nodes, are deployed in cities and large metropolitan areas for applications such as security monitoring, traffic, pollution monitoring, infotainment, and energy management [2, 3], as well as in vehicle-based self-organizing networks to support data communication between nodes via wireless multihop transmission [4]. In the future, many new applications will emerge in everyday life, such as smart homes, smart cities, and smart grids. The proliferation of urban IoT applications has created an unprecedented amount of data, including physical quantities sampled from the environment [5]. Specifically, data are collected from the real world by IoT devices and then transmitted to processing centres for calculation and processing using the internet as a communication network [6]. In addition, the proliferation of a large number of heterogeneous IoT devices and the dynamic changes in the environment adds to the difficulty of real-time computation and decision-making for IoT devices.

To realize the above vision of the future, the devices in the network need to perform a lot of real-time calculations and need to make decisions constantly, which places high demands on the computing power and battery consumption of the devices. In order to reduce the cost of various devices, a viable option today is to use cloud computing and cloud infrastructure to increase sufficient computing and storage resources for the network. Cloud computing can offload data and computing tasks from the device to a remote cloud for supplementary computing [7]. Using this approach can reduce the burden and energy consumption of the device itself, save local storage resources, and reduce the cost of network equipment. For example, in [8], real-time workflow scheduling in a cloud-based environment effectively reduces network costs. However, with the rapid spread of emerging applications such as virtual reality (VR), face recognition, image recognition, and video processing, centralized cloud-based solutions have raised concerns about the latency issues arising from the long distances between end devices and the cloud, as well as link loading issues. As a result, the concept of fog computing was introduced, with fog computing networks distributing computing, storage, control, and communication services across a cloud-to-thing continuum, rather than offloading them to the cloud [9]. Some network edge devices that are closer to end devices such as small base stations, wireless access points, laptops, moving vehicles, and smartphones with some computing and storage capabilities are used as an intermediate layer between the cloud and end devices, i.e., the fog layer. Fog computing networks are considered to be a promising network architecture that uses more edge network resources to improve services such as offloading, triage, and caching for devices. Compared to cloud computing, fog computing is not only effective in reducing latency but also in reducing the number of communications between the base station and the core network, thereby reducing link load. In summary, fog computing can effectively support compute-intensive and latency-sensitive applications and can better meet the needs of the future Internet of Everything.

There are many application scenarios for IoT, such as industrial IoT [10] and IoT in urban environments. Urban wireless sensor networks are one of the key application scenarios for the IoT. Data from the surrounding environment are sampled by sensor nodes deployed in urban wireless sensor networks to support intelligent detection of urban applications such as temperature, humidity, water levels, vehicle activity on roads, alley crime monitoring, and other scenarios [11]. With the increase in urban population, the massive increase in data volume, and the proliferation of a large number of heterogeneous devices, urban habitats are more likely to interfere with wireless sensor network communications, resulting in degraded network performance. Therefore, research on urban wireless sensor networks based on fog computing is necessary. Sensor nodes can only effectively support urban applications if they first successfully offload sampled data to fog layer devices (fog nodes). Sensor nodes have limited resources, which prevent them from storing the sampled data indefinitely. It is therefore crucial that data are offloaded to resource-rich nodes before storage space is exhausted, a problem known as the “data loss problem.”

The sensor nodes in the network that need to offload data are referred to as task nodes, while resource-rich nodes, i.e., nodes that can help resource-limited nodes with data offloading, are referred to as helper nodes. Designing an offloading solution to prevent the “data loss problem” is very challenging: an effective strategy requires careful collaboration between nodes, but the limited resources of sensor nodes preclude a significant overhead of coordination mechanisms. In real urban wireless sensor networks, there are often multiple task nodes that need to be offloaded simultaneously. Therefore, it is a natural trend to consider multinode data offloading schemes. In the multinode data offloading problem, there are selection conflicts between nodes and fairness issues, i.e., multiple task nodes compete for a “best” helper node for data offloading. Therefore, in this paper, the problem of collision between nodes and pairing fairness is a pressing challenge based on a scenario where multiple nodes coexist.

Most existing solution strategies customize offloading or content caching as a constrained convex optimization problem and choose different metrics and constraints such as latency, network throughput, and energy efficiency. In [12], Lan et al. proposed two schemes, both of which aim to maximize the average utility of the system. The first scheme formulates a task caching optimization problem based on stochastic theory and designs task caching algorithms to solve it. The second scheme describes the task offloading and resource optimization problem as a mixed-integer nonlinear dynamic programming problem. Jiang [13] studied the edge cache optimization problem in foggy radio access networks. Intending to minimize delay, they considered the use of joint and parallel transmission strategies to transform the objective into a nondeterministic polynomial. In [14, 15], the authors of [14] wanted to maximize the weighted computation rate of all wireless devices in the network by jointly optimizing the selection of individual computation modes and the allocation of system transmission delays. They first assumed that the mode selection is given and considered decoupled optimization to solve the problem. However, this approach is highly complex in large networks. The authors of [15] investigated the problem of joint task offloading and resource allocation to maximize offloading gains for users, and the problem they considered is formulated as a mixed-integer nonlinear program involving joint optimal offloading decisions, the uplink transmission power for mobile users, and computational resource allocation for mobile edge cloud servers. The authors of [16] considered the total gain of the network by computing offloading decisions, resource allocation, and content caching policies as an optimization problem. The authors transformed the problem into a convex problem and proposed alternating directions based on a multiplier algorithm to solve the optimization problem. In addition, many researchers are currently using game theory or its variants to solve the above problems, such as a smart gateway based on game theory for offloading and migration functions for the Internet of Things in fog computing [17], where the authors proposed an approach based on noncooperative game theory to reduce the latency and energy consumption during application execution. There is also a multiuser partial computation offloading strategy based on a game theory proposed by the authors of [18], who modelled the computational overhead based on game theory with the objective of a multiuser partial offloading problem in a mobile edge computing environment under wireless channels. There are also literature studies [19, 20] that are based on game theory. Almost all of the approaches mentioned above are premised on the assumption that a complete model of the system and the state transfer probabilities of the individual states can be obtained. However, in actual dynamic reality scenarios, such assumptions are too idealistic.

In addition to formulating the problem as a convex optimization problem, many scholars have studied the problems regarding data collection and offloading in IoT scenarios based on fog computing through many different theories. The authors of [21] used cloud and fog computing paradigms to design a multilayer data offloading protocol for a variety of data-centric applications in urban scenarios. Specifically, to reduce the probability of data loss, the protocol exploits the heterogeneity in the network and the characteristics of fog computing by logically dividing sensor nodes into “in-need” and “helper” nodes based on Markov chains, enabling sensor nodes to collaboratively offload data to each other or the fog nodes. However, their solution assumes that the nodes have enough storage space to store all the data and that the mobility of the nodes is fixed and known in advance. The authors of [22] focused on security and privacy-preserving issues during data collection and offloading. To achieve efficient and secure large sensory data collection in fog computing-based IoT, firstly, the authors proposed a sampling perturbation encryption method without sacrificing data relevance. Secondly, the authors developed an optimization model of the measurement matrix to ensure the accuracy of data reconstruction. Finally, the authors developed an efficient offloading decision algorithm by determining the offloading ratio for the joint optimal allocation of resources with the minimum offloading time as the goal. This paper focuses on improving the success rate of data offloading as high as possible while satisfying the requirement of maximum tolerable time delay for data offloading, and therefore, data security and privacy issues during offloading are not the focus of this paper. In addition to this, there are still many scholars studying the problems regarding offloading decisions in fog computing-based IoT scenarios; for example, the authors of [23] focused on fog computing-based offloading solutions in industrial IoT scenarios. Their aim was that computational tasks can be completed within the desired energy consumption and latency and with minimal energy consumption. The authors not only found the optimal value by means of an acceleration gradient algorithm with joint optimization of the offloading rate and transmission time but also developed an alternating minimization algorithm taking into account dynamic electrocompression party techniques. Also, the authors of [24] developed latency-minimized offloading decisions and resource allocation for fog computing-based IoT scheme based on queueing theory to propose a joint optimization problem for offloading decisions, local resources, etc., for fog nodes. They took this mixed-integer nonlinear programming problem and dynamically decomposed it into two subproblems, which are in turn optimized by a simulated annealing algorithm.

To solve the problem of the too idealized hypothetical model in actual dynamic scenes, some scholars use the reinforcement learning (RL) based method. Guo et al. [25] believed that the existing cache strategy is “blind” to users; that is, users do not know whether the file to be requested is in the local cache of nodes around them. Therefore, the authors made local cache efficient by actively informing users what the base station has cached. Since the probability of request is unknown, the authors used Q-learning to perceive the probability of request and random arrival and departure of mobile users and then optimized the cache replacement strategy. In [26], the authors proposed a new joint offloading and resource allocation method in a WiFi-based multiuser mobile edge architecture. Their goal was to minimize the energy consumption of mobile terminals under application delay constraints, and the optimal offloading strategy was implemented jointly with radio resource allocation. Therefore, the authors presented this problem as a new online reinforcement learning problem and proposed a new strategy based on a Q-learning algorithm to solve this problem, considering both delay and device computing constraints. Q-learning is a model-free algorithm in RL, which can learn the best strategy by constantly interacting with the environment without knowing the complete system model and system state. However, as the number of individual states and actions increases, the computational complexity and storage cost based on Q-learning will increase exponentially, so it is not suitable for complex dynamic Internet of Things scenarios.

With the increasing application of reinforcement learning as the core technology, research and attention on reinforcement learning have been gradually increased in various fields. MAB model is a classic application model in reinforcement learning. In the randomized MAB problem, given a set of arms (actions), one arm is selected in each trial, and a reward is obtained from the reward distribution following that arm. Each arm has an unknown random reward, and by pulling an arm, the player can immediately receive a reward. The player decides which arm to pull in a series of experiments to maximize the rewards accumulated over time. Naturally, the multinode data offloading problem in urban wireless sensor networks based on fog computing can be modelled as the MAB problem. During data offloading, mission nodes can be modelled as players in the MAB scene, and help nodes can be modelled as arms in the MAB scene. Similarly, each help node has an unknown random reward for data offloading. Task nodes offload data through the decision helper node and obtain the feedback structure (reward) for offloading data after completion. The task node decides on the helper node for offloading in each round to maximize the long-term accumulated feedback results about offloading. Currently, MAB algorithms are widely used in various fields, such as website optimization [27], optimal control strategies for robots [28], distributed dynamic recommender systems [29], competitive cooperative learning [30], and learning in changing environments [31].

The MAB model has a classic EE (exploitation and exploration) problem, in which the player must update the algorithmic strategy through trial and error to achieve the optimal strategy with the highest expected reward. This learning feature leads to the decision maker’s need to constantly balance the relationship between exploration and utilization: the strategy to improve the probability of winning the reward by using the accumulated known data and, at the same time, to explore other unselected arms, to obtain more data and more accurately evaluate the strategy of the currently selected arm operation. This requires a lot of interactions with the environment, some of which are effective and some of which are meaningless, such as repeated exploration of an area where rewards are known and low.

By far, the most practical and effective exploration strategy in practice is the reward bonus, a mechanism that adds exploration rewards to the unexplored or less frequently decided arms, driving the decision maker to explore these arms based on a high exploration reward. The mechanism is mainly based on the unknown optimistic exploration method [32], the main idea of which is to use historical data to build an estimate of the maximum possible reward for an unknown environment, i.e., an upper bound on the confidence interval, and then to select arms based on this upper bound. A simple idea is to calculate the upper bound of the confidence interval for the dereward function and add it to the reward, which means that a larger value of the reward function represents a greater unknown for this arm, which would drive the decision maker to collect more data about this arm. For example, the authors of [33] proposed an online learning-based offloading strategy for dynamic fog computing networks using the UCB1 algorithm to optimize the average offloading delay and offloading success rate, which can make optimal offloading decisions in real time. However, in this paper, this author only considers the context when a single node performs offloading. In practical IoT scenarios, it is necessary and challenging to consider scenarios where multiple nodes perform data offloading with the massive increase of a large number of IoT heterogeneous devices.

In summary, this paper considers the multinode data offloading problem in an urban dynamic sensor network based on fog computing with the maximum tolerable delay of data offloading as a constraint, thus improving the success rate of data offloading. The UCB1 algorithm, based on the optimistic idea, models the multinode data offloading problem as a MAB model in reinforcement learning, allowing decision makers in the network to exploit and explore without prior knowledge of the complete system model and the transfer probabilities of individual states by continuously learning through interaction with the environment and using the upper bound of the rewarded confidence interval. In addition, considering the selection conflict and fairness issues in multinode data offloading scenarios, the MAB framework is improved using stable matching theory [34] and fallback timers to make the system robust and effective in terms of time delay and offloading success. With this one-to-one selection mechanism, multitask node selection conflicts are avoided. Moreover, the introduction of the fallback timer can reduce a large amount of information exchange. For maximizing the data offloading success rate to minimize the latency, the MTDOsa-MAB strategy is proposed based on the consideration of the selection conflict problem and the fairness problem. Our main contributions are summarized as follows:(i)A novel learning framework is proposed to learn the multinode data offloading problem as a MAB problem without requiring any prior knowledge about the characteristics of the helper nodes.(ii)To reduce the computational overhead and to resolve conflicts in a distributed manner, an offloading strategy has been proposed, namely, MTDOsa-MAB (a multinode data offloading strategy based on stable matching and MAB model). This strategy implements the selection of helper nodes by using the stable matching theory and a fallback timer. In this case, collisions are eliminated to ensure efficient data collection and to avoid extensive information exchange.(iii)An upper bound on the regret value of the MTDOsa-MAB strategy is proposed, and the effectiveness of the strategy is verified by simulation experiments.

2. Model and Problem Formulation

2.1. Network Model and Offload Model

Figure 1 illustrates the dynamic fog computing model for a certain time slot. With the support of helper nodes, task nodes offload data to fog nodes before resources are exhausted to support urban applications. The helper node acts as a forwarder of data and does not process the data, which is all processed by the fog node. In this paper, we consider a many-to-many (multiple task nodes, multiple helper nodes) data offloading scenario with indivisible data.

This paper focuses on optimizing data offloading for fog networks in the context of resource-limited sensor nodes. It uses the fog network paradigm to design a data offloading protocol, shown in Figure 2, for various data-centric applications in urban wireless sensor network scenarios. Specifically, it exploits the heterogeneity in fog computing networks, so sensor nodes can collaboratively offload data to each other or to mobile fog nodes. To reduce the coordination overhead between sensor nodes and fog nodes, sensor nodes are logically divided into task node and helper node categories based on their buffer availability and relative locations. Firstly, when a task node sends a request to offload data, the helper node updates its own status and feeds the status information back to the task node. Secondly, the historical offload reward information of a helper node and the request conditions of the task node to offload data are fed back to the fog node for decision analysis and to complete the data offload. Finally, after the data offloading is completed, the fog node feeds the offloading result under this decision to the task node. Table 1 shows the symbols used in this paper and their meanings.

At the beginning of each time slot, the task node makes a request to offload data , which is represented by the triplet . Time is divided into several time slots, where denotes the size of the offloaded data, indicates the number of CPU cycles required to complete the data processing, and is the maximum tolerable delay to complete the offload request . The system model is assumed to consist of task nodes, helper nodes, and fog nodes.

Minimizing the average data offloading delay and increasing the success rate of data offloading are considered as the objectives of this paper in designing the data offloading strategy. The delay in completing the offload request is accomplished by two components: transmission delay and calculation delay. In general, since the returned results are small (a few to a few tens of bits), the delay in returning the results is negligible [35].

There are two parts of transmission delay in the model, namely, the transmission delay from the task node offloading data to the helper node and the transmission delay from the helper node offloading data to the fog node . That is,where is the transmission power of data offloading to the helper node and is the transmission power of data offloading to the fog node . According to Shannon’s formula,where is the transmit power, is the channel gain, and is the noise power. The data are processed by the fog node. The computational capacity of the fog node is assumed to be , from which the computational latency of the offloaded data at the fog node can be obtained.

Ultimately, the delay in offloading data to the fog node for processing via helper node can be expressed as

2.2. Problem Formulation

The problem of data offloading can be naturally modelled as a MAB problem. The task nodes are considered as players, and the helper nodes are considered as arms. In each round, the player expects to choose the arm with the highest reward. Similarly, during the data offloading process in each time slot, the task node offloads data to the helper node and expects a high reward, e.g., a high level of data offloading success.

The objective function of this paper is to satisfy the task node data offload latency with the highest data offload success rate and maximize system performance. The DoS (degree of satisfaction) of a single task node is defined aswhere is the task node satisfaction, and is one if the data offload delay is less than or equal to the maximum tolerated delay; otherwise, it is zero.

At moment , the task node selects a helper node for offloading. Assuming that no other task node selects the same helper node at that moment and that the value is one, the task node receives an instantaneous reward message . Otherwise, if multiple task nodes select the same helper node, there is a selection conflict, and none of the conflicting task nodes receives a reward (i.e., the reward is zero). The instantaneous reward information is composed of random variables that are independent of each other and obey a uniform distribution. Without loss of generality, the normalized reward is . The random variable has a mean value of , and is the total number of time slots. The value is not known in advance by the task nodes, and different task nodes have different mean values. The set of mean information of all task nodes to all helper nodes is denoted as . In the data offloading process proposed in this paper, task nodes are constantly exploring and learning to estimate and predict the availability of helper nodes.

Considering the simultaneous existence of multiple task nodes, selection conflicts between multiple nodes are bound to exist, and therefore, an effective data offloading strategy needs to be developed to solve the problem. In the MAB framework, the regret value is used to measure the performance of a bandit algorithm [36]. Its definition can be described as the difference between the total system gain obtained by adopting the optimal strategy in the ideal case and the total gain obtained by adopting the learning strategy . The mathematical expression for the regret value can be expressed as follows:where is used to specify the pairing of task nodes and helper nodes that contain the maximum rewards. denotes the total reward of task nodes after time slots, as indicated in the following:

In the above equation, denotes the conflict coefficient between task nodes. At moment , when only one task node selects a certain helper node, it is considered that no conflict occurs, i.e., is one; otherwise, it is zero.

In summary, the objective function for maximizing the satisfaction of all task nodes is expressed as

3. MTDOsa-MAB

3.1. Design of Algorithms

In this paper, the bilateral matching method in game theory and the fallback timer are used to improve the MAB framework, and a data offloading strategy is proposed to solve the above problems. A bilateral match is a one-to-one, many-to-one, or many-to-many match of elements in one set with elements in another set. All task nodes are considered as one set, and all helper nodes are considered as another set. The data offloading process can be seen as a bilateral matching of one or more task nodes to one helper node. A stable matching is a state of bilateral matching in which Nash equilibrium is reached as well as game theory [37]. Several definitions of stable matching exist here.

3.1.1. Partial Order Relations

Let be a set and the preference relation on be a two-way relation, i.e.,(1)For each , there exists or , and the relation is complete(2) is not valid, and the relation is non-self-referential(3)If and , then is transitive

3.1.2. Matching Problem

A matching problem is described as the existence of two sets, a set of task nodes and a set of helper nodes. Each helper node has a preference relation on the set of task nodes. A task node considers a helper node to be superior to , and this relationship is expressed as .

3.1.3. Bilateral Matching

The matching in this paper is a bijection from the set of task nodes to the set of helper nodes for a pairing is included in a match, i.e., a request to offload data on behalf of the task node is matched to the helper node .

3.1.4. Matching Opposition

A helper node and a task node oppose a match if they believe that there exists an individual on the other side that is better than the individual they are matched to under the current match. A match is stable on the premise that there is no such opposition.

3.1.5. Stable Match

stands for a match. A match is stable if a helper node believes that another task node’s request is better than its partner under match , and this task node believes that the helper node it matches is better than the other helper nodes under match . A stable matching is always present.

The advantages of the stable matching theory in helper node allocation are as follows. (i) Because the stable matching theory always specifies a stable one-to-one match based on any preference function, it avoids the multinode competition under this interference model. (ii) When the values of the preference functions of the bilateral elements are different, the stable matching theory has only unique stable matching solutions. (iii) Stable matching theory allows each participant (i.e., task node and helper node) to define its respective utility based on its local information. Being that our proposed algorithm has no significant overhead, the computational complexity of the algorithm is greatly reduced.

In [32], the authors proposed the UCB1 strategy, which is a decision learning algorithm. In a UCB1 policy, the task node selects the policy to be executed at the next moment by analysing a series of historical policies and their resulting rewards. To provide an optimistic evaluation of the helper nodes, subject to the latency requirement, the UCB1 algorithm associates an index called the UCB index to each task node and helper node pair. The UCB index of each task node and helper node pair is calculated to estimate the reward expectation of the response, and the pair with the highest index is selected.

When performing bilateral matching, each element of the bilateral set will have a preference ranking of all elements of the other set, and this ranking rule needs to be developed first. The order of preference of task nodes to helper nodes is to first identify the helper nodes that satisfy the time delay requirement and then to rank them according to their UCB values [32], with higher UCB values having higher priority. Conversely, helper nodes are preference ordered based on the maximum tolerable offload latency of the data posted by the task node, with requests with low latency being given high priority.

At each time slot , when a task node and a helper node are paired, all values of ( is a set that contains information about the pairing of task and helper nodes) can be observed. is the average reward prediction generated by pairing a task node with a helper node up to the current time slot. is the number of times that helper node is selected by task node at the current time slot. and are updated in the following way:

Based on the above description, the pseudo-code of the algorithm is as follows. The MTDOsa-MAB algorithm is carried out iteratively.

//Initialization
 WHILE Broadcast data offload requests DO
  FOR DO
  FOR DO
   Update the set of candidate helper nodes.
   Calculate the predicted time delay for the nodes in the set.
   IF THEN
    Put the helper nodes that satisfy the time delay into the set .
   END IF
  END FOR
  FOR DO
   Randomly select the helper node in the set .
   ,
  END FOR
 END FOR
 //MAINLOOP
  WHILE 1 DO
  FOR DO
   Update the set of candidate helper nodes.
   Calculate the predicted time delay for the nodes in the set.
   IF THEN
    Put the helper nodes that satisfy the time delay into the set .
   END IF
   FOR DO
   Based on the UCB policy, the UCB index is calculated for each pair of task and helper node pairs.
   
    Run Algorithm 2 for matching task nodes and helper nodes.
   Update , accordingly.
   END FOR
  END FOR
END WHILE

Algorithm 1 is based on the idea of the UCB1 strategy and is divided into two phases: the initialization phase and the cyclic phase. To accumulate the information available to the helper nodes, lines 2 to 15 of algorithm 1 are the initialization phase. When a task node sends an offload data request , the algorithm in the initialization phase predicts the helper node that can satisfy the offload request with the maximum tolerable offload delay within the communication range, and then each task node randomly executes all feasible policies and records the reward corresponding to each policy. Immediately after each feasible policy has been executed once, the task node enters the main loop phase. Lines 16 to 30 in the pseudo-code are the main loop phase. The main loop phase is the decision-making of the helper nodes using the historical offload information and offloads feedback results accumulated during the initial phase. Decision-making in this phase is where the task node learns the historical strategy and corresponding reward information through the UCB1 equation. The main loop process entails deciding on a many-to-many pairing that is a strategy that maximizes the overall UCB index such that the best task node and helper node can be paired in the many-to-many set.

The fallback timer was introduced to be able to determine the timing of reselection after a selection conflict. The fallback timer allows the task node that has stopped sending data after a conflict, instead of waiting for the colliding helper node to be free and then sending data immediately, to postpone (called fallback) for a time. Otherwise, when the colliding helper node is free and each task node sends data at the same time, another conflict will arise. The fallback time is determined by the UCB index of the task node to the helper node. Assume that all task nodes choose the same fallback function, which is a monotonically decreasing function of the UCB index when task and helper nodes are paired. At the beginning of each time slot after the initialization phase, the task node calculates and sets a fallback timer based on its UCB index for a particular helper node. Each task node in the network calculates the UCB index and maps it to a fallback time based on a predetermined common decreasing function . Figure 3 shows an example of such a fallback function. The first task node that reselects to send data after a conflict is the one with the highest UCB index to the helper node.

 INPUT: , and preference list of bilateral elements according to Algorithm 1.
 Output: , the set contains pairs of paired task nodes and helper nodes.
 FOR DO
  Select the optimal helper node for each task node.
 END FOR
 FOR DO
  Rejection of redundant task nodes per helper node.
  IF determine whether is the smallest THEN
   Rejected task node continues to select the helper node.
  ELSE
   Rejected task node enters the rejected queue.
  END IF
 END FOR
 WHILE DO
  Take out from
  FOR DO
   Select the optimal helper node for each task node.
  END FOR
  FOR DO
   Rejection of redundant task nodes per helper node.
   IF determine whether is the smallest THEN
    Rejected task node continues to select the helper node.
   ELSE
    Rejected task node enters the rejected queue.
   END IF
  END FOR
 END WHILE

The main steps of algorithm 2 are as follows: (1) In lines 3 to 5 of algorithm 2, the task node selects the helper node with the highest priority in its own preferences. (2) In lines 6 to 7 of algorithm 2, each helper node selects a particular task node from the task nodes that selected it, in accordance with each of its own preferences for task node offload requests, accepts its offload data request, and then rejects the remaining task nodes’ offload data requests. (3) Lines 8 to 27 of the Algorithm 2, for each task node rejected by a helper node in the previous phase, calculate the UCB index of each task node with respect to that helper node, map the UCB index to the fallback timer, and obtain the fallback time . If the fallback time of this task node is the smallest among the conflicting task nodes, then the task node continues to wait for this helper node; otherwise, it enters the reject queue and continues to iterate according to its preference list. (4) When no task node is rejected, i.e., the length of the rejection queue is zero, the algorithm terminates the iteration. At the termination of the algorithm, all that is achieved is a stable match, i.e., the elements in the bilateral have no objections to this output match, and the maximum UCB index can be achieved with the task node tolerance delay requirement, the higher offload success rate and the maximum task node satisfaction are achieved.

3.2. Convergence Analysis of the MTDOsa-MAB

According to literature [32], the proposed data offloading strategy MTDOsa-MAB has the following conclusions regarding its regret value convergence analysis.

Theorem 1. The expected regret of MTDOsa-MAB is at most

Proof. Each time a task node and a helper node pair up to complete a data offload, the task node receives a reward value . Therefore, we definewhere is any maximal element in the set . is the number of choices for non-optimal task node and helper node pairings. That is, after the initialization process, there exists at least one task node and helper node pairing that does not belong to the optimal set at time slot when the non-optimal set is selected. is the set containing pairs of task and helper node pairs in round . is the optimal set containing pairs of task and helper node pairs with optimal reward information. Note that the regret after rounds can be written asSo, we can bound the regret by simply bounding each . Denote . means if and . Then, is one; otherwise, it is zero. Because there will always be uncertainty in the estimation of the reward information generated by the helper nodes of the decision, an exploration phase is necessary. In the exploration phase of the algorithm, the set of helper nodes whose prediction delay satisfies the maximum tolerable condition is filtered and the elements in the set of helper nodes are decided by a random strategy. Ideally, in the exploration phase, each helper node is decided at least once. In the exploitation phase, there is still a possibility for helper node to be decided, and the number of times that helper node is decided in the later rounds is represented by the indicator function . Then, the following equation holds:Equation (14) represents the number of times a decision has been made for a particular helper node in an ideal situation. In fact, the aim of the stochastic strategy is to collect information about the historical offloading experience of the helper nodes to a limited extent. However, the stochastic strategy does not guarantee that each helper node will not be decided repetitively. Suppose is a positive integer of arbitrary size. Therefore, the number of times task node decides on helper node for data offloading is greater than or equal to equation (14), so the following equation can be obtained:

In each round of decision-making, there exists a possibility that the predicted reward information of the helper node paired with the final decision is greater than or equal to the reward information of the optimal helper node paired with that task node, so that the suboptimal helper node will be decided by the task node after learning. Therefore, the above inequality is restructured according to the UCB equation, which gives

By exhaustive enumeration of and , the following inequalities are obtained:

In the first rounds, any helper node can be selected at most times; hence, . And, ; hence, . Expanding all possibilities yields the following inequality:

Using the converse method, it can be shown that for the situation described by the occurrence inequality to hold, at least one of the following three inequalities holds:

For the case of , since takes a range of values of and therefore , it follows that

Therefore, inequality (21) does not hold.

When , at least one of inequalities (19) and (20) holds if it holds for inequality . Applying the Chernoff–Hoeffding Boundary Theorem to inequalities (19) and (20), respectively, we get

The bilateral set of the multinode data offloading policy model designed in this paper is composed of task nodes and helper nodes. It is not difficult to obtain from the relation of inequality that at least one of the following equations holds for and (it has pairing possibilities):

Inequality implies that holds. Thus, the following inequality can be obtained:

Because of , it follows that

Thus, the expression for the total regret value under the MTDOsa-MAB strategy is

With equations (12) and (13), equation (27) can be converted into a representation of the number of decisions made through helper nodes. Thus, the following equation can be obtained:

Therefore, the proof of equation (11) is completed.

4. Evaluation and Simulation

4.1. Data and Experimental Environment

In this paper, the effectiveness of the MTDOsa-MAB offloading strategy was evaluated by computer simulation in Python and PyCharm. The specific experimental simulation parameters of this paper are shown in Table 2.

Task nodes explore the merits of helper nodes with an unknown desired reward pattern . is the desired reward received by the task node for choosing a helper node. To simplify the process, it is assumed that the instantaneous return value obeys a Gaussian distribution.

The metrics evaluated in this paper are task node satisfaction, average data offload latency, cumulative regret value, and data loss rate due to selection conflicts. Task node satisfaction is the ratio of the number of data offload requests that can be completed within a tolerable time delay to the total number of offload requests. Average data offload latency is the time taken by each policy to process various pieces of information to arrive at the offload decision and to calculate it. The cumulative regret value is the difference between the optimal offloading strategy and the actual offloading strategy regarding the reward as the number of experimental rounds increases. The loss rate due to the selection conflict is the percentage of data offload failures due to conflicts occurring against multiple nodes when different policies are used.

In order to verify the effectiveness of the MTDOsa-MAB policy, the MTDOsa-MAB policy was compared and validated with the following policies through simulation experiments.(1)EpsGreedy algorithm: the relationship between random numbers and a priori epsilon values is used to decide whether to select the helper node with the highest reward or to select the helper node at random.(2)The UCB1 algorithm considers the average reward and confidence interval of the arms. Each helper node corresponds to a confidence interval, and the helper node with the largest upper confidence interval is selected for data offloading.(3)The RR (round-robin) algorithm is based on round-robin scheduling; that is, helper nodes are selected in turn to offload data.(4)MTDOsa-MAB algorithm: learning and exploring through stable matching theory in game theory and backoff timer to improve the UCB1 strategy to select suitable helper nodes for data offloading.(5)Random algorithm: randomly select one of the nodes in the set of candidate helper nodes for data offloading.

4.2. Analysis of Experimental Results
4.2.1. Task Node Satisfaction

Task node satisfaction is one of the important evaluation metrics in this paper. It refers to the ratio of the number of data offload requests that can be completed within a tolerable time delay to the total number of offload requests. Figure 4 shows the impact of the number of offload requests on task node satisfaction under five different policies. From Figure 4, it can be seen that task node satisfaction under the random policy was maintained at around 31.8% when random helper nodes were selected for offloading when the main factor affecting task node satisfaction was selection conflict under the random policy, and conflicting task nodes were not rewarded. Compared to the random strategy, the RR strategy is based on time slice rotation to select helper nodes for offloading, and the probability of the selection conflict is reduced, so the satisfaction rate is around 49%. Task node satisfaction was higher for EpsGreedy, UCB1, and MTDOsa-MAB compared to both random and RR strategies. As the number of data offload requests increased, task node satisfaction tended to decrease under all three strategies. However, the rate of decline tended to be slower for MTDOsa-MAB, with task node satisfaction remaining above 90% when the number of tasks reached 85. Specifically, compared to EpsGreedy and UCB1, MTDOsa-MAB satisfaction increased by 32 and 22 percentage points when the number of data offload requests increased, respectively. This is due to the use of a stable matching one-to-one strategy, which effectively avoids conflict and thus rewards feedback for data offloading.

4.2.2. Average Offload Delay

Figure 5 shows the effect of the number of data offload requests on the average offload latency. As can be seen in Figure 5, the random offload policy and the RR offload policy have longer average offload latencies compared to the other three offload policies due to the nature of their policies. The EpsGreedy policy, the UCB1 policy, and the MTDOsa-MAB policy all take longer to execute as the number of tasks increases. Compared to EpsGreedy, MTDOsa-MAB saved approximately 5 ms in execution time when the number of requests was low and achieved a saving of approximately 32 ms as the number of tasks increased. This is because the EpsGreedy algorithm is based on the greedy idea of always using the current best helper node more than choosing to explore potential helper nodes that are better than the currently selected helper node; furthermore, the greed-based idea of always selecting a certain helper node frequently is prone to cause nodes due to insufficient energy consumption for wireless sensor networks “death” and therefore generates more offloading delay. Task nodes in MTDOsa-MAB always select the best choice under the current conditions based on a one-to-one matching mechanism and choose the next best choice after being rejected, with linearly correlated time complexity. In the UCB1 strategy, the confidence interval width of a helper node reflects its degree of uncertainty, i.e., the larger the interval width, the higher the uncertainty, and vice versa, the lower the uncertainty. As the number of trials increases, the confidence intervals of the helper nodes become narrower, and the mean value of the reward and the confidence interval of each helper node are reestimated before each selection, feeling the historical results of the already tried trials. Compared to the UCB1 strategy, MTDOsa-MAB can save an average of about 9.75 ms in execution time. MTDOsa-MAB has a significant advantage in execution time and therefore can be effective in reducing latency expenditure for latency-sensitive applications.

4.2.3. Cumulative Regret Values for Several Bandit Strategies

The EpsGreedy offloading strategy and the UCB1 offloading strategy are classical stochastic bandit algorithms in the MAB framework. The MTDOsa-MAB strategy is an offloading strategy that uses stable matching theory and a fallback timer to improve the MAB framework. In the regret value comparison experiments, only three offloading strategies were considered: EpsGreedy, UCB1, and MTDOsa-MAB. Figure 6 illustrates the changes in the cumulative regret values of the three data offloading strategies as the number of experimental rounds increases. It is clear that the cumulative regret values for the three bandit algorithms grow logarithmically and increase with the number of experiments. In each round of experiments, the pairing of multiple task nodes and multiple helper nodes, whether or not they were optimal pairings, resulted in a feedback result regarding the data offloading reward information after the data offloading was completed. For each round, the difference between this feedback result and the reward feedback result generated by the optimal pairing strategy is the regret value. As the number of experimental rounds increases, the EpsGreedy and UCB1 algorithms have difficulty converging in the case of multitask node data offloading, while the cumulative regret value of MTDOsa-MAB converges more easily to a stable value due to the stable matching-based strategy that effectively solves the collision problem between nodes. In targeting the problem of multitask node offloading, a single UCB1-based strategy does not solve the conflict problem well, so the MTDOsa-MAB strategy, which improves on the UCB1 strategy, is rewarded by greatly reducing node conflicts.

4.2.4. Data Loss Rate due to Conflicts between Nodes

Figure 7 shows the rate of data loss due to selection conflicts for EpsGreedy, the UCB1 policy, and the MTDOsa-MAB policy for different numbers of offloaded data requests. It can be seen that, as the number of data offload requests increases, the percentage of data loss due to selection conflicts increases for all three strategies. Apparently, the MTDOsa-MAB strategy based on stable matching theory avoids selection conflicts of task nodes under this one-to-one selection mechanism, so its data loss rate is always below 1.5%. The data loss rates of UCB1 and EpsGreedy strategies are as high as around 12.4% and 18%, respectively. In addition, the MTDOsa-MAB strategy shows an almost constant data loss rate as the number of requests increases, while the other two strategies show a gradual increase in data loss rate as the number of requests increases. This fluctuation is smoother in the case of the MTDOsa-MAB policy, which implies that the proposed MTDOsa-MAB policy has the potential to be extended to larger network deployments.

5. Conclusions

In this paper, for the problem of data loss in the UWSN environment, it is investigated how multiple task nodes can collaboratively offload data in a collegiate manner so that data loss due to selection conflicts can be avoided. By modelling this problem as a MAB problem, a multitask node data offloading strategy based on stable matching and fallback timers is proposed. Simulation experiments show that the MTDOsa-MAB algorithm is able to achieve higher task node satisfaction and a significant reduction in execution time and maintain a lower data loss rate compared to two classical bandit algorithms. Moreover, in the long run, its regret value converges to a smoother state. Subsequently, in addition to considering the selection conflict and competition issues, the energy demand of nodes in UWSNs is also an important factor affecting the overall network performance, which will be a major work worth investigating in the future.

Data Availability

The simulated evaluation data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Natural Science Foundation of China under Grant no. 62171413.