Abstract

With the technological advancements, practical challenges of establishing long-distance communication should be addressed using hop-oriented routing networks. However, long-distance data transmissions usually deteriorate the quality of service (QoS) especially in terms of considerable communication delay. Therefore, in the presented work, a reward-based routing mechanism is proposed that aims at minimizing the overall delay which is evaluated under various scenarios. The routing process involved a refined CH selection mechanism based on a mathematical model until a threshold simulation is not attained. The illustrations for the coverage calculations of CH in the route discovery are also provided for possible routes between the source and the destination to deliver quality service. Based on this information, the data gathered from the past simulations is passed to the learning mechanism using the Q-learning model. The work is evaluated in terms of throughput, PDR, and first dead node in order to achieve minimal transmission delay. Furthermore, area variation is also involved to investigate the effect of an increase in the deployment area and number of nodes on a Q-learning-based mechanism aimed to minimize the delay. The comparative analysis against four existing studies justifies the success of the proposed mechanism in terms of throughput, first dead node, and delay analysis.

1. Introduction

Wireless sensor networks (WSN) are made of tiny sensor nodes that are powered by small batteries that provide power to the sensor node. The sensor nodes are designed to transfer data from one end to another end, or in other words, the sensor nodes transfer the data from the source node to the terminal node. The transfer of the data takes place in collaboration with other nodes in the network and a multihop path is used as a route for the data transfer. Every data transfer follows a process of broadcasting the signals (R-REQ) in order to get the responses against the transferred signals and are termed as route replies (R-REP). This process is categorized under hierarchal routing [1]. When the sensor node is overoccupied with data to be transferred or if the sensor node is not able to transfer the data due to any physical or software-oriented barriers, a time-lapse is generated, and it is termed as delay [2]. One of the major reasons for the delay is node overhead, and in order to reduce the overhead of the nodes, the concept of clustering was introduced in the early stage of the development of the WSN [3]. When a sensor node is unable to transfer or receive the data, it is indicated that the node has run out of stored energy which is also called residual energy (RE). A node with 0 residual energy is termed as a dead node. The researchers have focused on maximizing the time interval when the first node is dead in order to increase the lifetime and reduce the overall delay of the network. More number of dead nodes in the network will produce packet overhead and delay. The CH selection method plays a crucial role in order to reduce the overall computation complexity and delay in the network. Since WSN is scalable, it can always make room for more nodes or gadgets. Since it is adaptable, physical divisions are possible. The centralized monitoring system provides access to all WSN nodes. It does not need cords or wires because it is wireless. WSNs may be used extensively and across a wide range of industries, including mining, healthcare, surveillance, and agriculture. It employs various security algorithms in accordance with the underlying wireless technology and thereby offers customers or users a dependable network.

The use of procedures or technologies on a network to manage traffic and assure the functioning of key applications that require limited network capacity is known as quality of service (QoS). It enables enterprises to prioritize certain high-performance apps while adjusting total network traffic. QoS is commonly used in networks that transport traffic for resource-intensive systems. It is commonly required for services such as Internet protocol television (IPTV), online gaming, streaming media, videoconferencing, video on demand (VOD), and voice over IP (VoIP). Organizations may use QoS in networking to improve the performance of various applications running on their network and obtain insight into its bit rate, latency, jitter, and packet rate. By doing so, companies may control network traffic and alter how packets are sent to the Internet or other networks to prevent transmission delays. In addition, this makes sure that the company provides apps with the desired service quality and user experiences. According to the definition of QoS, the main objective is to give networks and organizations the ability to prioritize traffic. This is accomplished by providing dedicated bandwidth, managed jitter, and decreased latency. The technologies that are employed to make this possible are crucial for improving the functionality of corporate applications and wide-area networks.

There are certain methods that may be applied to raise the service quality. Scheduling, traffic shaping, admission control, and resource reservations are the four most often used techniques. A switch or router receives packets from various flows for processing. A good scheduling method addresses the various flows equally and correctly. Several scheduling strategies are intended to raise the level of service. Here, we will talk about three of them: weighted fair queuing, priority queuing, and FIFO queuing. A method of regulating the volume and speed of traffic transmitted to the network is called traffic shaping. A token bucket and a leaky bucket are two methods that can influence traffic. Resources such as buffer, bandwidth, CPU time, and others are required for a data flow. Admission control is the technique through which a router or switch accepts or rejects a flow based on established parameters known as flow requirements. Before accepting a flow for processing, a router examines the flow requirements to see whether its capacity (in terms of bandwidth, buffer size, CPU speed, and so on) and past commitments to other flows can handle the incoming flow.

Reward-based routing encourages adopting the pathways that result in the least amount of interaction between intermediate nodes while simultaneously taking into account the fact that each freshly forwarded packet results in some more exposure. The reward (metric) connected with each routing choice is carried out through incentive-based packets, which enables routers to assess their rules in light of past usage. In order to avoid the most vulnerable locations, reward-based routing is intended as a technology-neutral routing technique that develops a link between route cost and population exposure. Reward-based routing uses virtual currency (reward) to forward packets; therefore, the more exposure a node causes, the more reward it needs to do so. Reward-based routing, which is a legitimate stand-in for the aforementioned EI, takes into consideration exposure in terms of the number of packets. There are two underlying ideas in reward-based routing. The first is to adjust the hop cost in a way that considers exposure, and the second is to create and keep the least-exposed routes in place. The reward-based routing behavior may have the fascinating side effect of reducing the frequency of retransmissions, which would assist in further limiting the exposure. Reward-based routing attempts to avoid routes with greater loads, which results in fewer crashes and retransmission occurrences. A longer (average) battery lifespan, which encourages energy-aware behaviors, is another benefit of load balancing promoted by the RBR strategy.

Wireless sensor networks are gaining popularity in contemporary IoT-enabled industrial and home applications that use either heterogeneous or homogeneous sensors to gather intent data. Because their application is geographically essential, WSNs are intended to function utilizing self-powered sensor nodes. Such nodes must enable energy efficiency to ensure network lifespan. Cluster head selection is an important element in a WSN design that focuses on minimizing network energy usage. It organizes sensor nodes in such a manner that a complex network cluster is produced, resulting in an increased lifetime and reduced power consumption. The stability of the cluster head (CH) has a significant impact on the network’s robustness and scalability. The overhead is decreased since the stable CH assures little intracluster and intercluster communication. This article proposes a novel CH selection method based on a reward mechanism generated by the application of the proposed Q-learning method to enhance the network performance and to reduce the overall delay. Routing decisions are made using the shortest path algorithms in typical routing algorithms. As a result, nodes typically selected along the shortest pathways deplete faster than other nodes, resulting in a shorter network lifespan. Numerous routing techniques have been developed to increase network longevity. The energy-efficient algorithm takes coverage into account. It can maintain k-coverage while achieving maximum coverage for a given region while using the least amount of energy. Low energy usage does not, however, guarantee a long network lifetime. The network lifespan is also impacted by the residual energy distribution. Q-learning follows a reward generation mechanism which is defined based on the agents carrying state variables against a specified set of actions defined in the methodology section of this paper. The proposed network model is scalable, and Q-learning algorithms have been found to be the most suitable for scalable problems [4]. The Q-learning algorithm produces a Q-value against each defined action. The proposed overall algorithm architecture is compared with other state-of-the-art techniques being discussed in the related work section. The predominant objective of this article is to reduce the overall delay of the network by opting for the most suitable CH based on the generated reward by Q-learning. The main contributions of this article are as follows.(1)The article considers a quality of service (QoS)-oriented network considering minimization of the delay as the primary objective. The defined network focuses to illustrate the delay minimization by considering packet delivery ratio and packet flow ratio alias throughput.(2)The article proposes a reward-based routing mechanism in order to reduce the overall delay in the network with all the encapsulated scenarios in the article. The routing mechanism is performed utilizing the formed CHs in the network. The CH formation is done by utilizing the mathematical model until a threshold simulation is not attained. The gathered information for the “t” number of threshold simulations is passed on to the learning mechanism of the proposed Q-learning model. Once a historical diagram of the participant CHs is created, a hybrid mechanism is applied which combines the current selection through the mathematical model, and the reward generated by the proposed Q-learning mechanism and the CH selection is performed on the base of a hybrid score of the node. Table 1 illustrates the notations used in the paper.

The rest of the article is organized in the following manner. Section 2 illustrates the related work which encapsulates route selection methods using machine learning and QoS. The third section presents the detailed work architecture of the proposed hybrid protocol route selection, path optimization, and delay minimization.

The passing of time has witnessed a revolution in the design of sensors used in wireless networks. The sensors that have been advanced are characterized by lightweight, miniature size, and low power feasting. However, the delay observed in the data transmission process remained a critical challenge constantly drawing high attention of the scientific community. The rising impact of WSN in real-world applications has added pace to the research pertaining to various strategies to guarantee network reliability and minimize energy consumption and end-to-end delay during data transfer. However, such issues still penetrate at some step during communication. Förster and Murphy conducted an investigational study of machine learning at all layers in the WSN network stack. A detailed outline of various machine learning algorithms including decision trees, neural networks, and reinforcement learning was provided. The study was mainly referred to for the discussion and implication of the RL algorithm. The existing work illustrates the efforts and applications at each level of the stack [5]. Predictive analytics, as we are all aware, uses methods such as predictive modeling and machine learning to examine historical data and forecast future patterns contrary to conventional forecasting techniques; neural networks are unique. In contrast to a neural network, the most popular model, linear regression, is actually a pretty straightforward approach to problem-solving. Because of their hidden layers, neural networks perform predictive analytics more effectively. Only input and output nodes are used in linear regression models to generate predictions. The hidden layer is also used by the neural network to improve prediction accuracy. That is because it “learns” similarly to how people do. They are prohibitively expensive due to their high computing power requirements. In addition, massive data sets are required to train neural networks, which a company might not have. But as IT technology becomes more affordable, the first obstacle could soon vanish. Soon, there will not be “unpleasant shocks” any more because of technologies such as ANNs. In this respect, Velmani and Kaarthick proposed a simple scheme to mitigate node mobility and delay issues. The proposed cluster-independent data collection tree scheme addressed the challenges of cluster formation and selection of CH based on CH location. The simulation analysis depicted that the scheme provided a better quality of service in terms of throughput, energy consumption, network lifetime, and end-to-end delay [6]. In the same year, Alsheikh et al. presented a review of machine learning algorithms in WSN in which they evaluated the advantages and disadvantages of existing machine learning algorithms. They also outlined the existing challenges while summarizing the adopted machine learning-based methods to address these challenges pertaining to distinct research areas [7]. Han et al. had evaluated CH selection strategies in different routing protocols. The nodes in the existing heterogeneous WSN network were analyzed to investigate when the first node died and its impact on the number of packets transferred from source to sink. The most important aspect of the research was the evaluation of the monitoring ability of each protocol understudy for the selection of the number of CHs in the defined network [8]. Levendovszky and Thai had developed a quality-aware routing algorithm that could perform routing in an energy-efficient manner. This energy efficiency was achieved by selecting some particular nodes in the multihop network for packet transfer. The algorithm achieved energy balancing under various reliability constraints and was also named as the high quality-of-service routing algorithm by the authors due to the selection of near-optimal paths within the network [9]. Chen et al. had addressed various constraints such as high energy consumption, more transmission time, and constraints of path movement that are important to be dealt with for network optimization. Furthermore, the work also led to the reduction in the energy consumption by the node during latency [10]. Wang et al. had adopted a cross-layered routing protocol that was capable of power control and could adapt and coordinate with network dynamics. In the process, researchers had implemented a multiagent Q-learning mechanism which was analyzed using the delay-based nonselfish cost function. The simulation analysis performed using reward functioning improved the throughput and end-to-end latency [11]. Fei et al. made development efforts to address various conflicting optimization criteria. The work was mainly referred to for its metrics curves and concept of multiobjective optimizations. This study also summarized the energy consumption and latency trade-offs along with the lifetime vs. performance trade-offs of existing approaches [12]. Kumar and Kumar used hierarchical clustering in which two swarm intelligence algorithms were integrated one after the other. The artificial bee colony (ABC) was implemented for the selection of CH, and ant colony optimization (ACO) was used for efficient data transmission. The CHs were selected, and subclusters were formed based on the threshold value resulting in an enhanced stability of the designed network architecture [13]. Sun and Park had provided route choice modeling using SVM and neural network architecture and was mainly referred to for distinguishing route prediction accuracy. The predicted root prediction accuracy using NN was 0.6833 and using SVM was 0.7086 with a computing time of 7640.20 s with NN and 602.27 s with SVM [14]. SVM is one of the most effective machine learning algorithms since it is a highly complex and mathematically sound method. It is a dynamic method that can resolve a variety of issues, including regression issues, binary, binomial, and multiclass classification issues, as well as linear and nonlinear issues. By utilizing the idea of the margins and attempting to maximize the distinction between two classes, SVM lowers the likelihood of model overfitting and increases the model's stability. Because kernels are readily available and because SVM is based on fundamental principles, it can operate with ease whenever the data are in a high-dimensional state and are precise in high-dimensional datasets to the point where it can compete with other algorithms such as Naive Bayes, which is adept at handling classification problems with extremely high dimensions. SVM is renowned for its memory management and processing speed. In particular, when compared to machine learning and deep learning algorithms, with which SVM frequently competes and occasionally even exceeds to this day, it requires less memory. Jin et al. found that in some circumstances, the replacement of batteries becomes too difficult. Therefore, in addition to low network latency and a high transmission rate, an extended network lifetime is a necessity. In this context, the authors had postulated a Q-learning-based delay-aware routing algorithm that could easily adapt to the changing environment in the vicinity of the sensor nodes. Furthermore, an action utility function was defined in which decisions were based on observed delay and residual energy to assure uniform routing. Simulation analysis showed that the mechanism achieved 20% to 25% minimization of end-to-end delay [15]. Jan et al. had addressed the energy utilization issues in WSN. However, balanced energy consumption was achieved at the cost of a slight increase in the end-to-end delay observed during transmission in WSN [16]. Li et al. adopted optimized CH selection to minimize network delay and reduce the energy consumed in the process of the CH selection process. A variable step size firefly algorithm was adopted by the researchers for the identification of the head node to assign CH within each cluster. The objective function was based on a number of parameters such as intracluster distance, probable cluster heads, and residual energy. The delay and the latency observed during data transmission were observed due to the controlled size of data packet transfer. It was observed that lower duty cycle operation followed by efficient cluster formation significantly reduced the network delay and latency of the sensor network [17]. Li et al. had proposed a routing algorithm that could minimize the interference delay during data transmission. Furthermore, it was also associated with efficient management of the energy resources resulting in quality routing [18]. Elappila et al. had presented survival path routing that was based on the survivability factor, the observed interference, and the noise within the traced path from one hop to another. The evaluation studies showed that the designed protocol improved the throughput and packet delivery at the destination with comparatively lesser end-to-end delay. However, the number of packets delivered decreased due to congestion observed during data transmission [19]. Ilamathi and Rangarajan had focused on determining the shortest distant path that was determined using optimization techniques and neural network architecture. The simulation analysis had shown that integration of harmony search as an optimization technique resulted in reduced time complexity and energy consumption [20]. Alghamdi had addressed the challenges of CH selection based on delay, energy, distance, and security. In WSN, optimal CH selection is the key element of data transmission which was addressed using hybrid optimization in which the firefly optimized and replaced the positions and updated them by using the dragonfly. Thus, more refined CH selection was performed resulting in the least computational time of 9102 ms which was consumed using the proposed CH selection work based on hybrid optimization and was analyzed using 1000 simulation rounds. It was observed that the risk probability, energy, and the number of live node analysis was performed for 2000 simulations, but delay analysis was restricted to only 1000 simulations. It was concluded that further analysis may affect the presented observations if performed for more simulation rounds [21]. Adil et al. adopted a hybrid routing scheme that combines three routing protocols. The CH formation was performed dynamically for a particular time duration using the dynamic cluster-based static routing protocol (DCBSRP) associated with the AODV routing protocol and low-energy adaptive clustering hierarchy (LEACH) protocol. It was observed that static routing limited the number of nodes selected for a fixed interval for the CH selection process. Once that particular interval was completed, the nodes associated with the CH were released, and reassignment of the CH node was performed. The simulation analysis showed that by using the proposed scheme, 95.9% of the total nodes participated in routing that significantly balanced the network load and minimized the transmission delay. The latency of the network using the designed scheme was found to be quite consistent due to the unicast communication of the clusters [22]. On large networks, dynamic routing is seen to be simple to set up. It is also believed to be more user-friendly than static routing in terms of choosing the optimum path, detecting route modifications, and discovering faraway networks. However, because routers constantly exchange updates, they always use more bandwidth than static routing does. The extra burdens caused by routing protocols are also felt by the router’s CPUs and RAM. Finally, it is believed that dynamic routing is less secure than static routing. There are several IGPs or inner gateway routing protocols. The protocols listed here are those that might be applied within the network. Every router and server operating system, including Windows 2003 Server and Linux, supports these protocols. So, it follows that the routing protocols are really a collection of languages that the router employs to share routing data with other routers. The capacity of the routing protocol to adapt to changing network topology is its key advantage.

Routes are only created when they are required in the reactive protocol known as ad-hoc on-demand distance vector (AODV). The mobile network is where AODV is most commonly used. One entry is used in the routing table for each destination. In order to ensure that routing information is current and to avoid routing loops, sequence numbers are employed. Reactive protocols, such as AODV, have a tendency to reduce the overhead of control traffic messages. Only the nodes affected by network topological changes receive updates from AODV, which responds to these changes reasonably quickly. The AODV routing protocol conserves energy and storage space. They only reply once to the initial request made by the destination node and disregard the subsequent ones. Each destination has a maximum of one entry in the routing table. AODV grows to a high number of mobile nodes and is loop-free. The processing of routing is handled by AODV without the need for any central management.

Leach is a clustering hierarchy protocol that uses less energy. The leach procedure has various benefits. First, the cluster heads integrate the whole data, which reduces network traffic overall. Second, since there is only one hop in routing from nodes to cluster heads, there is a significant reduction in energy usage. Thirdly, this protocol enhances and lengthens the network’s life. Fourthly, it is not required to know the node’s location to construct a cluster. Leach is entirely dispersed and self-organized since it does not require any control of the flow of information from the ground station. The first hierarchical routing protocol that enabled data fusion was the leach protocol. In the clustering routing protocol, this protocol is crucial. In addition to the discussed literature, numerous researchers had combined the multipath techniques with the machine learning approaches. A handful of the referred machine learning-inspired research publications are further summarized in Table 2.

It has been observed that the selection of CH in each cluster is one of the most challenging tasks to minimize transmission delays and energy consumption. The aforementioned literature shows that the efficacy of CH selection was addressed by numerous researchers to minimize the transmission delay, increase network lifetime, and reduce energy consumption.

3. Proposed Algorithm Architecture

The proposed algorithm architecture is a three-phase model. The first phase performs data aggregation based on the statistical approaches of routing. The second phase generates a reward mechanism that utilizes the data aggregated in the first phase using Q-learning architecture, and the third phase incorporates the reward generated in the second phase into the first phase of routing architecture. The overall system architecture can be explained using Figure 1. Simulated on the elastic idea of area and node variation is the suggested network. The energy associated at the beginning of the simulation, RE = 100 mJ, is distributed across the “n” number of nodes in the specified network N. Each of the four zones that make up the full simulated region is patrolled by a different drone. There might be several clusters in a zone. To save communication overhead, the drones only communicate with the cluster’s cluster head (CH). The sensor nodes’ deployment locations are chosen at random. In order to adjust for “n” in the network, a total number of CH must be determined before the simulation can begin. For the initial simulation, the maximum RE is used to determine each region’s CH. To make the most of the available resources, each node in the proposed simulation environment may receive and send data from other nodes. Source and terminal nodes are created that identify their CH and pass the data packet to the CH carrying packet characteristics in order to comprehend routing.

3.1. Phase 1: The Routing Architecture and the Data Aggregation with Action Label Generation

The proposed network is simulated under the elastic concept of area and node variation. The designed network N has “n” number of nodes with RE = 100 mJ, the amount of energy associated at the start of the simulation, and the Econ model for energy consumption which is defined in Table 1, and “Ar” is the area of simulation. 4 mobile sinks based on the geographic location of the sensor nodes are shown in Figure 2. The mobile sinks are drones and are there to assist the faulty wireless nodes in the network. CH formation is different from zone formation for the drone in the proposed work. The total number of CHs in the network is decided by the following ordinal measures.(a)The entire simulation area is divided into 4 zones, and each zone is covered by one drone(b)One zone may have more than one cluster(c)The drones communicate only with the cluster head (CH) of the cluster to avoid communication overhead(d)The deployment position of the sensor nodes is random

The existing literature referred for various parameters is listed in Table 3.

The simulation starts with the identification of the total number of CH required in order to compensate for “n” in the network. As the RE is similar for each node at the start of the simulation, distance to the base station becomes the only way to select the CH. Drones follow a random walk model to move from one location to another location [23]. Where Econ is the total energy consumption, delay is the propagated delay, r is the total number of simulation rounds, and “n” is the total number of nodes in the network. The objective function aims to minimize the delay produced in each simulation round for every participating node in the list.

The CH selection process utilizes the following equation in order to determine the total number of CHs required for the supplied number of nodes and area [24]:

The attraction index is calculated as follows:

For the first simulation, the CH of every region is selected based on the maximum available RE.

In the proposed simulation environment, each node can receive and transfer data from multiple nodes in order to utilize the resources as much as possible. In order to comprehend routing, source and terminal nodes are generated that identify their CH and forward the data packet to the CH containing packet attributes. This node vector used in the present work is inspired by Lee's work [31] and comprises 7 features as shown in Table 4. In further processing, each feature becomes a part of the learning architecture. The device type considers the type of device that needs to forward the data (sensor node or drone). RSSI is an abbreviation for the received signal strength indicator. It is an approximated measure of the power level received by an RF client device from an access point or router. At greater distances, the signal weakens, and wireless communication speeds decline, resulting in reduced total data throughput. The receive signal strength indicator (RSSI) measures signal strength and, in most circumstances, displays how effectively a specific radio can hear the remote-linked client radios. RSSI levels are often lower in wider channels. Smaller channel widths are thus advised in all but except for a few exceptional cases when installing EnGenius APs. Signal strength is passed on in the form of RSSI value computed in the algorithm. Hop number is given to count the hop number in that particular route. Cov list is the neighbouring node of the sender node. Neighbour node number gives the number of the next node to which data are being currently forwarded. RE left will measure the left residual energy of the sender node. Distance will depict the current distance between the sender and the current intermediate node to which the data are being forwarded.

The CH uses a broadcast mechanism and follows an ad-hoc on-demand vector routing (AODV) protocol for the dynamic response messages as route replies (R-REP). The process of broadcasting the route request (R-REQ) keeps on rolling in between the CHs until the CH which is associated with the terminal alias destination is not found.

3.1.1. The Routing Protocol

The AODV protocol was first developed without any thought to security. Therefore, no protective system was created to recognize the presence of a malicious assault. Due to the fast change in network structure, maintaining a new route to the target is one of AODV’s essential jobs. The route discovery procedure is used to carry it out. In AODV, the destination node number and the number of hops are crucial characteristics to identifying the route’s freshness, and these characteristics are simple for attackers to influence. Because of this, the security of ad hoc routing protocols such as the AODV is essential, and researchers in MANET from all over the world are always looking for ways to create a routing protocol that is both safe and effective for wireless ad hoc networks. A dynamic reactive routing protocol is what AODV is. A route will be constructed via a reactive routing protocol based on the requirement (upon request by the source node). Route request (RREQ) and route reply are two crucial control messages in AODV route finding (RREP). Both control messages contain a crucial property called the destination sequence number, which has the incremental value to assess the route’s freshness. The proposed routing protocol is inspired by the AODV routing protocol which uses the broadcast mechanism. Cluster-oriented routing has been incorporated to reduce the computation delay of data communication. In order to demonstrate the routing strategy, a working example of 80 nodes is illustrated as follows.

Figure 3 represents the deployment of a network having an area of 1000 m2 where network height and width are considered as 1000 m. Here, 80 nodes are deployed, and the total network is clustered into 8 segments based on its coverage capability. Afterwards, a cluster head (CH) is selected in each region that is represented by a red color in Figure 3, and their node numbers are 48, 60, 9, 8, 20, 62, 76, and 50. In this scenario, node 11 is considered the source, and node 28 is considered the destination node. Here, the source node belongs to CH5, and the destination node belongs to CH8, and a route formation mechanism is needed to transmit data packets from the source to the destination node. The most suitable and fast routing protocol is used to discover a route from the source to the destination node via CHs based on demand. The used “Request (RREQ)–Reply (RREP)” routing mechanism helps to minimize the transmission delay by avoiding the routing loops according to Table 5.

Here, the source node is the member node of the CH and transfers data packets to CH5. Afterwards, CH5 starts broadcasting a RREQ message packet to neighbour CHs based on the coverage limit. According to Figure 4, the calculated coverage of CHs is given in Table 6.

Here, CH5 starts broadcasting a RREQ message packet to neighbours CH4 and CH7. The structure of the broadcast contains the information about the source CH and the destination CH, and it could be illustrated as (source CH, hops, and destination CH). In the case of the current example, it would be (CH5, hops, and CH8) CH4 and CH7 that receives this packet and reverts back to CH5 with a RREP = [CH5, Hop-Count, CH4/CH7]. R-REQ indicates the “hello message” having information about source CH and destination CH with their hop count, and RREP is generated after the response of neighbour nodes that indicates the identification of nodes. This process continues until destination CH is not found, and the entire route discovery mechanism is shown in Figure 5.

The CH of Node 11 starts to send routing request packets RREQ to all neighboring CHS to find an optimal path through the calculation of total power consumption and transmission delay. After receiving the RREQ, the intermediate CHs reply to the sender CH by creating a reverse path to the CH of Node 11 with a RREP packet. This process continues and sends the RREQ message to neighbour CH until the message packet is not received by the CH of destination Node 28. At the same time, the destination CH of Node 28 generates a RREP message and transmits this message through the reverse path to the CH of source Node 11. After the CH of source Node 11 receives RREP message packets through various possible routes, a forward route is selected based on the minimum power consumption and transmission delay. There are a total of four possible routes from source to destination via CHs using the routing mechanism. Based on Figure 5, a constructed routing table is given in Table 7.

Based on the concept of minimum power consumption and transmission delay, route 2 is selected as an optimal route for data transmission from the source to the destination node. The algorithmic structure of the protocol is given as follows.

3.1.2. AODV Routing Protocol

Input: WN ⟶ Number of NodeWS ⟶ Act as a Source NodeWD ⟶ Act as a Destination NodeOutput: WSN-FR ⟶ from WS to WD developed a final route(1)Start(2)WN start broadcasting an RREQ (Route Request) message to neighbor nodes and CH, Where CH ⟶ Cluster Head(3)Create an RREQ message, RREQ = [WN, Hop Count, WD](4)Initially consider, Hop Count = 0 // because, at starting of route formation, there is no node that can be considered as an intermediate node.(5)Consider an array for routing, W-Route = [] // take an empty variable to store route information(6)W-Route (1st Node) = WS // because routing always starts from the source node(7)Set a Variable, WD-Found = 0 // initially, destination is not founded(8)While (W D -Found ≠ 1)(a)1st Node Broadcast RREQ to neighbors is own coverage and record Hop Count(b)Neighbor node receives and analyzes the requirements(c)If W-Route (Node [W N , Hop Count, W D ])==NN[WN, Hop Count, WD]//Where Neighbor Node ⟶ NN(i)W-Route + = Neighbor NN is a WD(ii)Each node sends RREP (Route Reply) to WS // Feedback sent by a node in coverage of NN(iii)Hop Count = 1(d)Else(i)W-Route = Neighbor WN(ii)Send RREP to WS(iii)Hop Count = +1(e)End-If(i)Update the Route array and repeat the above steps until WD not founded(f)If (WD=1)(i)WD-Found = 1(ii)Break(g)End-If(i)Possible Route, R = R1, R2, R3, …, RN(h)For r=1⟶R(i)Current W-Route, R = R (r)(ii)Calculate to distance (D) from WS to WD(iii)If D is minimum then(1)WSN-FR = R (r) // Current W-Route becomes the final route(iv)Else(1)Check the next route condition(v)End-If(i)End-For(9)End-While(10)Return: WSN–FR as a final route from WS to WD in WSN(11)End-Function

3.1.3. The Ground Truth

A formed route is categorized as a good and effective route if the evaluated QoS parameters fit perfectly as per the network demand. The proposed solution evaluates throughput, energy consumption, network delay, and packet deliver ratio (PDR) by usingwhere r is the total number of simulation rounds and Pn is the total number of participating nodes under one route, and timelapse ( ) computes the time interval between the expected arrival time “at” and the actual time of arrival “act.”

The aggregated parameters are passed to the k-means algorithm keeping the attribute list centric centroid in order to reduce the convergence issue of k-means. Two action labels “Stable” and “NonStable” are considered to be applied to the aggregated data. Suitable and nonsuitable action labelling has been performed by calculating the root mean squared value (R-MSE) between the route that is classified under its class and the centroid of the class. The less R-MSE value will indicate the record that belongs to the good class due to its corelation between the data elements.where t is the total number of attributes getting used for centroid calculation. C is the centroid for the respective cluster.

3.2. Phase 2: Proposed Learning Algorithm for Reward Generation for Optimal Route Selection

The proposed learning mechanism for reward generation follows the Q-learning mechanism, and it integrates the support vector machine (SVM) for the classification of the action labels. The action labels are defined as stable and nonstable routes based on the R-MSE calculated in phase 1. The description of naming is deliberated in algorithm description. The learning mechanism is illustrated as follows.

3.2.1. State

In the case of the proposed solution, the state variables are collected over “r” number of simulations in which “rt” is the number of routes that were formed. The proposed solution aggregated the data from the discovered routes defined in the multiroute description in the later section of this phase. In the case of the proposed solution, tr is equal to 10,000. Hence, in an interval of 1 − r, the state is defined as S (t) = . Two binary indicators alias the ground truth . When the R-MSE value is higher, the binary indicator is set to be 1 and remains 0 in the other case. The binary indicators are also referred to as action labels in this article.

3.2.2. Action

The action is the selection of the CH in the route with the maximum value of the objective function defined as where R is the estimation function that is defined as follows:

3.2.3. Reward

The reward function is defined as the delay efficacy by executing the action A (r) in the state S (t).

Delay efficacy is measured in terms of the difference between the minimum delay and the average delay of the discovered routes up to now. The reward is generated if the transition function does not cross the limit of delay efficacy (DE).

3.2.4. Transition Function

As a result of the data transfer in the network as a given simulation round r, a transition function T (A (r), S (r)} is evaluated under the architecture of the policies generated through the training mechanism of the support vector machine (SVM). Due to the absence of information on the strategies followed by other CHs to choose the next CH in the route list, each CH demonstrates a selfish pattern to select a strategy to attain maximum reward function. The selection procedure might distort the selection strategy which may increase the delay further. This leads us to the formation of a collaborative inference engine to minimize the overall delay. A drone-assisted strategy inference engine is  = (, where Y is a finite number.

The process results in knowledge inference of other CHs to select the next CH in the list for the anticipated discount return as follows:where CH count is the total number of CHs in the list. The transition function uses the training mechanism borrowed from the utilization architecture of SVM as follows: the kernel has been utilized to form the architecture of the transition function. SVM utilizes kernel function to attain hyperplane separation. The most common kernel is a polynomial kernel expressed in the equation as follows, but it is not recommended for complex and high-volume data.

Slope α, constant term classification, and the polynomial degree d are adjustable parameters. The proposed solution utilizes radial basis function (RBF) as the kernel for SVM.

The performance of the RBF kernel depends on the adjustable parameter sigma and should be carefully tuned according to the dataset. If overestimated, the exponential will behave almost linearly, and the higher-dimensional projection will start to lose its nonlinear power. On the other hand, if underestimated, the function will lack regularization, and the decision boundary will be highly sensitive to noise in training data. The learning of the engine is based on a learning rate, which indicates how often the learning agent changes its Q-value. denotes no learning, but indicates that new information has totally replaced the old one. The learning rate of the proposed model is stored in a structure containing all the information. As the Bellman equation states that the long-term benefit for a particular activity is equal to the sum of the rewards from the present action and the anticipated rewards from the subsequent acts, the Bellman equation may be used to determine if we have succeeded in maximizing the long-term reward, which is the purpose of reinforcement learning, as it is obvious. Bellman’s optimality equation can be written as maximizing

This equation states that, at time t, for any state-action pair (s, a), we can get the expected reward Rt+1 by choosing an action in state s, in addition to the maximum of “expected discounted return” that is achievable by any (s′, a′), where (s′, a′) is a potential next state-action pair, and will be equal to the expected return from starting state s, taking action a, and then with the optimal policy afterwards. The key point is that a reinforcement learning system can determine the action that maximizes q using this equation to get optimum q and optimal policy (s, a). This equation is crucial for this reason. The Bellman optimality equation and the optimal value function are connected iteratively. The Q-table is updated by using Bellman’s optimality equation for the reward generation and is defined as follows. The proposed algorithm uses the discount factor by evaluating the record with SVM and is illustrated as follows. The proposed algorithm views every CH as a node and every root as the link edge between the CHs. If there ever exists a route between the node and the CH, the Q-table represents it as 1, and if there does not exist a direct link between a node to another node, it would be represented as −1.

If there is no direct communication link between node A and node B but communication exists through any other node, then the connection value is 0. In order to illustrate the architecture, a 5-node sample architecture is presented in Figure 6.

As the Q-learning algorithm initially selects random action from the list of policies that are defined by the broadcast mechanism, the transition function provides a discount factor based on the semisupervised policy imposed by the RBF kernel. If the selected CH is classified as true against its own action label, it would get a discount and will be not penalized. If the selected CH is not classified as true against its own action label, the transition function computes the following equations:where rtd is the total number of discovered routes with all the nodes having a direct connection to each other as per the Q-table explained in Table 4 and rt is all the discovered routes. The proposed transition function aims to optimize delay as reward utilizing throughput, PDR, and energy consumption considering that raise in delay will necessarily provide a raise in the energy consumption. Looking at the other scenario, a sustainable PDR can be attained via a sustainable route itself, and hence, energy consumption is directly proportional to produced delay. The proposed transition function is to take maximum reward in order to reduce the chances of a delay getting reduced through the most sustainable route [18]. The generation of the reward mechanism is illustrated by the algorithm reward mechanism as follows (Algorithm1).

(1) // extract the states by collecting the route values as illustrated in the state definition earlier in the article//
(2) //create an empty Q-repository which will contain the Q-table and transition results to update the Q-table
(3) // set the value of k = 2 as there are two action labels namely optimal and nonoptimal
(4) //Apply k-means on the data gathered in itr time interval with k = 2, k-means will return route index as the action label is defined as 1 for optimal and 2 for nonoptimal//
(5) equation (7)
(6) /if the R-MSE is higher, the class is labelled as nonoptimal and vice-versa.//
(7) //initialize the Q-table as illustrated in Table 4.
(8) //SVM rating follows polynomial kernel for hyperplane separation//
(9) //the transition is said to be successful if the data gathered as state variables, maps the hyperplane while getting selected for the gradient satisfaction of the policies defined under . //
(10) where T’ .
(11)
(12) //represents that the route will get maximum reward for this action//
(13)
(14) //represents that the route will get discounted reward for this action.//
(15)Else
(16) //represents that the route will not get any reward for this action and might also be penalized depending upon the distance of the result to its ground truth.//
(17) //create a policy T-policy according to the transition actions and update the learning policy//
(18)//The proposed transition function introduces a semi-successful transition where if the state was selected during the plane policy of the transition function but was unsuccessful under the mapping policy, then the transition is called semi-successful and will get a partial reward as well.
(19) //where T is the transition function, is the selection strategy declared in kernel K and dignified under . //
(20) //calculate the defined parameters stated under equations (14) to (16).
(21) (20)//both neutralized throughput and pdr will surely be high if the delay is low so as the energy consumption.//
(22)
(23)culate //calculate the transition value of the route in the list by applying equation (20) to the state variable of the current route//
(24)>
(25) //dcb is the direct connection benefit in which if the transition function is satisfied completely, the value of dcb is .1//
(26) > ψ||  < ψ
(27)Reward = 
(28)Else
(29)Reward = 0
(30)End If
(31)Update Q-table
(32)End For
(33)
(34)Choose Route
(35)Add to List()
(36)Return List if transition completed.

With the updated set of Q-table, the proposed algorithm selects the route with maximum reward and transfers the data through the route.

4. Results and Analysis

The result section is illustrated in two subsequent segments. The first segment illustrates the working behavior analysis for Q-learning and its update in the Q-table, and the second segment illustrates the effects of the Q-learning mechanism on the quality-of-service (QoS) parameters. The parameters are evaluated on the base of throughput, PDR, and delay. The preliminary objective is to minimize the overall delay, and hence, segment 2 is more illustrated towards the delay architecture. Simulation parameters of the proposed work are summarized in Table 8.

4.1. Segment 1: The Q-Learning Behavior Results and Illustrations

The architecture system for Q-learning can be explained by utilizing the Q-table. In order to illustrate the same, a selected route architecture system with Q-learning is presented in Table 9. The selected routes have demonstrated the least delay in each selection of Q-learning through the Q-table. Each possible discovered route is treated as a candidate in the table. To show the separation of the route, we have underlined the end of that particular route having multiple paths, and the route with the minimum delay is selected as shown and is marked in red color.

Figure 7 represents the route formation and optimal route selection mechanism in the network. In the network, total 7 routes are possibly based on the QoS parameters that is shown in the first segment of Table 9, and possible routes are (32, 6, 4, 5, 19), (32, 6, 1, 5, 19), (32, 6, 2, 5, 19), (32, 6, 3, 5, 19), (32, 6, 5, 5, 19), (32, 6, 6, 5, 19), and (32, 6, 7, 5, 19). Here, nodes 32 and 19 are the source and destination, respectively. In the figure, a total of possible 7 routes are represented with different colors, but the selected route is marked with a bidirectional green color. Based on the reward mechanism using the Q-learning algorithm, an optimal route (that is represented by green color) is selected for data transmission within the network.

If the possible discoveries are passed to the k-means architecture system, as explained in pseudocode, it will be divided into two subsequent segments as shown in Figure 8.

49 routes were separated into two clusters by the k-means algorithm providing the index of each record as follows.

K-means divides the aggregated data into two segments. “X” marks the centroid of the clusters. The x-axis contains the PDR {0,1}, and the energy consumption is calculated in kW which is plotted on the y-axis. It is evident from the distribution that one segment contains the PDR values that are just below the 0.4 mark.

The divided index can be seen in Table 10, and the formed centroids are given in Table 11.

The separated class is treated as states for the routes, and the action is set to be {0,1} where R_A is the action taken against the state variable. To modify the current policy update rule that is applied using Bellman’s equation, the separated data are passed to SVM for binary distributed labeling of the routes.

The trained SVM provides the training architecture as illustrated in Table 12.

The update over the trained sample based on the selected policy of learning is defined under the Q-learning architecture, and the following results have been attained that are displayed in Tables 13 and 14.

When compared to the ground truth values of the data, it is evident that the learning rate gets stable in the 3rd iteration, and more than 90% of data are classified correctly. The scaling values of SVM also get updated using the modified learning system and are presented in Table 15.

In order to demonstrate the performed algorithm action, a sample of 49 discoveries has been presented. The proposed algorithm has utilized more than 71361 discovered routes. Amplified learning helps in route optimization as false classified routes are not pursued for long intervals, and hence, QoS does not get affected. This results in optimal QoS values as illustrated in the second segment of this section.

4.2. Segment 2: The Improvisation in QoS

The QoS parameters are evaluated in order to check the effects of the proposed algorithm over the routes. The evaluation contains two parameters, namely, the throughput and delay. As the network remains to sustain when the nodes are alive, and hence, first dead node and the last dead node for 5000 simulations have also been noted and are illustrated as a part of the overall result. The proposed work is compared to other state-of-the-art techniques that are illustrated in the literature as well. The evaluation has been conducted for 5000 test simulations apart from the data aggregation simulations. The comparison of throughput and PDR is demonstrated as an overall architecture of the increasing number of nodes. The delay is illustrated with the variation in size and variation in nodes as well to support the elasticity of the network.

The throughput is evaluated under a standard 1500  1500 network layout. As it is evident from Table 16 that with the increasing number of nodes, every algorithm demonstrates an increase in the throughput. It is obvious that if the number of nodes will increase, the number of carriers to data packets will also increase, and hence, the table demonstrates an uprising behavior. The proposed algorithm stands high in terms of throughput in comparison to other state-of-the-art techniques. The significant improvement in throughput is due to the proposed algorithm architecture which boosts the route selection policy by embedding the classification system to discoveries. The proposed routing policies enable the sensor nodes to serve more in the network, and as a result, the average throughput is most efficient for the proposed simulation setup. The most second efficient algorithm next to the proposed algorithm is observed to be the routing policy proposed by Adil et al. For 300 nodes, the proposed algorithm provided 14653 packets/second, whereas Adil et al. resulted in 12548 packets/second. In terms of improvement, the proposed algorithm is 14.77% more efficient in terms of throughput as compared to Adil et al.

The evaluation of the first dead node and the last dead node is evaluated for 5000 iterations, but the maximum sustainability was attained at 3500 iterations. Figure 9 represents the evaluation of the first dead node.

As displayed in Figure 9, as the number of nodes increases, the active number of participants increases, and hence, less energy is consumed per node. It further results in less amount of battery drainage and fewer dead penalties. In comparison to other state-of-the-art techniques, the proposed algorithm performs significantly better. A margin of 8-9% has been attained prior to 170 nodes in the list. After 170 nodes, it has been observed that the growth in attaining efficiency becomes static in terms of dead nodes. When compared to Adil et al., the percentage difference is quite low after 170 nodes and is illustrated in Table 17.

After 170 nodes, it has been observed that the growth in attaining efficiency becomes static in terms of dead nodes. When compared to Adil et al., the percentage difference is quite low after 170 nodes and is illustrated in Table 16 and is presented in Figure 10.

This research article aims to minimize the overall propagation delay, and hence, the delay is evaluated under different varying area sizes as well. The evaluation has been performed keeping 4 different area sizes i.e., . The purpose is to check the elasticity of the proposed algorithm against the other state-of-the-art algorithms. The tabular analysis is presented in Tables 1821.

The evaluated delay values are computed in ms. As the network deployment area increases, the propagation delay increases throughoutthe network. As with the increasing number of nodes and as the participants are more available to transfer the data, the delay decreases in the deployment area size, but it is also evident that increase in the area raises the propagation delay. As for example, the delay for the proposed algorithm in area size , with a full stack node of 170, is noted to be 26.334 ms where as for the same simulation setup with decreased area by 200 units viz. in the area size of the delay is noted to be 25.435 ms which is approximately 1 ms less than next area size This pattern could also be observed in other algorithmic architectures, but still the proposed algorithm outcasts the existing algorithms by more than 3 sec on an average in any overall scenario which marks an improvement of 7% from the existing algorithms.

5. Conclusion

In the present work, an improved route discovery mechanism is proposed based on the Q-learning model of reinforcement learning. This reward-based learning mechanism is used to significantly minimize the overall communication delay observed during communication performed over long distances in WSN. To evaluate the work, deployment areas are varied from 1000 m2 to 1800 m2 along with variation in the number of nodes deployed in the network. The number of nodes here varies from 50 to 300 that provided a comprehensive investigation of the designed mechanism. In addition to delay analysis, throughput, PDR, and first dead node analysis were also performed to justify the effectiveness of the enhanced routing mechanism to deliver a quality of service with minimal communication delay. The practical difficulties of creating long-distance communication may now be overcome by adopting hop-oriented routing networks thanks to technical improvements. However, the quality of service (QoS), particularly in terms of significant communication latency, is often deteriorated by long-distance data transmissions. As a result, in the work that is being given, a reward-based routing system that tries to minimize the total latency and is assessed under various circumstances provided. Up until a threshold simulation was not reached, the routing method entailed fine-tuning the CH selection mechanism based on a mathematical model. To provide a high-quality service, examples of CH coverage estimations are also given for potential paths between the source and the destination. Based on this knowledge, the Q-learning model's learning process receives the data obtained from previous simulations. In order to obtain the shortest possible transmission delay, the job is assessed in terms of throughput, PDR, and the first dead node. Area variation is also investigated in order to determine how an increase in the deployment area and node count affects the Q-learning-based approach designed to reduce latency. The success of the suggested method in terms of throughput, the first dead node, and delay analysis is justified by a comparison to four other research studies. Q-learning is a prominent technique right now since it lacks modeling. Deep learning may be used to help the Q-learning model as well. Numerous artificial neural networks used in deep learning choose the appropriate weights to get the best potential answer. Deep Q-learning is a type of Q-learning using neural networks. These methods help firms make significant progress in job completion and decision making.

Data Availability

The data shall be accessed on request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.