Abstract

Drone examination has been overall quickly embraced by NDMM (natural disaster mitigation and management) division to survey the state of impacted regions. Manual video analysis by human observers takes time and is subject to mistakes. The human identification examination of pictures caught by drones will give a practical method for saving lives who are being trapped under debris during quakes or in floods and so on. Drone investigation for research and security and search and rescue (SAR) should involve the drone to filter the impacted area using a camera and a model of unmanned area vehicles (UAVs) to identify specific locations where assistance is required. The existing methods (Balmukund et al. 2020) used were faster-region based convolutional neural networks (F-RCNNs), single shot detector (SSD), and region-based fully convolutional network (R-FCN) for the detection of human and recognition of action. Some of the existing methods used 700 images with six classes only, whereas the proposed model uses 1996 images with eight classes. The proposed model is used YOLOv3 (you only look once) algorithm for the detection and recognition of actions. In this study, we provide the fundamental ideas underlying an object detection model. To find the most effective model for human recognition and detection, we trained the YOLOv3 algorithm on the image dataset and evaluated its performance. We compared the outcomes with the existing algorithms like F-RCNN, SSD, and R-FCN. The accuracies of F-RCNN, SSD, R-FCN (existing algorithms), and YOLOv3 (proposed algorithm) are 53%, 73%, 93%, and 94.9%, respectively. Among these algorithms, the YOLOv3 algorithm gives the highest accuracy of 94.9%. The proposed work shows that existing models are inadequate for critical applications like search and rescue, which convinces us to propose a model raised by a pyramidal component extracting SSD in human localization and action recognition. The suggested model is 94.9% accurate when applied to the proposed dataset, which is an important contribution. Likewise, the suggested model succeeds in helping time for expectation in examination with the cutting-edge identification models with existing strategies. The average time taken by our proposed technique to distinguish a picture is 0.40 milisec which is a lot better than the existing method. The proposed model can likewise distinguish video and can be utilized for real-time recognition. The SSD model can likewise use to anticipate messages if present in the picture.

1. Introduction

Drone reconnaissance is becoming famous nowadays in a catastrophic event for help and salvage. Nearly every week another natural disaster somewhere in the world makes another headline. Discussing India, it is the most generally influencing our nation’s economy; the best dangers are the catastrophic events like floods, tropical storms, typhoons, and the wide range of various regular disasters. Debacles and normal catastrophes cause an immense far and wide of human, material, ecological, and financial misfortune.

A variety of drone types are available on the market. A few of the basic drones are listed as follows:(a)Multirotor(b)Fixed wing(c)Fixed wing hybrid vertical take-off and landing (VTOL)

1.1. Multirotor

It is the easiest and the cheapest option of all the drones. It allows the user to attach a small camera and “keep an eye in the sky” for a small period of time.

1.2. Fixed-Wing

Fixed-wing drones (not “rotary wing,” like helicopters) use a wing as a typical plane to give a lift in opposition to vertical lift rotors. In view of that, they just have to use the energy to move forward, rather than raise themselves into the air, so are considerably more proficient. They are able to be in the air for long periods of time and are very efficient.

1.3. Fixed Wing Hybrid VTOL

The combination of the advantages of fixed-wing UAVs with drift capability is another 50/50 classification that can also take off and land upwards. There are different kinds of work going on, some of them are primarily existing fixed-wing aircraft with vertical lift engines. Other planes are “tail sitter,” which appear to be a customary aircraft lying on the tail on the ground, pointing directly up for the start before a pitch to fly normally, or a type of “inclined rotor” where the rotors or even the whole wing with the connected propellers may rotate upwards for departure to point to an aeroplane in level flight forward. There are various types of natural disasters that occur every day in various parts of the world. Some are earthquakes, floods, and wildfires.

A wildfire, sometimes known as a forest fire, is an unplanned fire that happens in a forest, grassland, or other natural areas. Wildfires can occur anywhere at any moment. They frequently result from human activity or natural occurrences like lightning. Half the forest fires are not known, which have been registered begun.

According to the report in Times of India (TOI), India’s average annual economic losses due to disasters are estimated to be $9.8 billion. And, moreover, taking about recent disasters, we lost around 103 lives in the Assam floods (source-Hindustan times).

Drone surveillance involves using unmanned aerial vehicles (UAVs) to capture still pictures and video-based data to collect information on particular targets, which can be either individuals, groups, or environments. In this document [1], they examined the use of UAV-based human detection and technology in search and rescue operations during natural disasters. Drone surveillance is used to surreptitiously gather data about a target captured from a distance. It is shown in Figure 1.

2. Literature Survey

This section examines some recent and significant work on human detection and action recognition for search and rescue in disasters. More than 1000 images were used in these papers [210] and [11]. In these studies [1218] and [19], these articles used less than 1000 images. In these studies [16] and [18], the authors used live simulation data/real time data, United Nations World Tourism Organization [20], etc. YOLO algorithm was used in these papers [8, 9, 11, 21] and [22], and different types of CNN algorithms (R-CNN, Faster R-CNN, Fine-tuned CNN, and U-Net CNN) were used in these papers [2, 6, 8, 9, 16, 19] and [10] for the human detection and action recognition for SAR. The machine learning algorithms were used in these papers [5, 7, 1215, 2326] and [20]. The cloud computing technology was used in these articles [4, 27] and [22]. The robots/smart wearable devices/sensors/IoT were used in these papers [6, 18, 22, 28, 29] and [30]. OPNET was used to evaluate the network performance in these studies [29, 31]. These details are summarized in Table 1. The relevant works are grouped based on the following criteria.(a)Dataset size/live simulation data/real time data/other data(b)Deep learning algorithms(c)Machine learning algorithms(d)Cloud computing technology(e)Robots/smart wearable devices/IoT and(f)Network performance

2.1. Dataset Size/Live Simulation Data/Real Time Data/Other Data

More than 1000 images were used in these papers [210] and [11]. They were given as follows.

In this research paper [2], they suggested a cost-effective CNN fire detection architecture for keeping track of videos. The model was based upon the architecture of GoogLeNet, taking into consideration its reasonable calculation complexity and the relevance of the problem to other high-cost computing networks such as AlexNet. To balance efficiency with accuracy, the model had been refined considering the nature of the target and fire data. The experimental results of the fire benchmark data sets demonstrated the effectiveness of the proposed framework and validate its suitability for fire detection in CCTV monitoring systems compared to the latest methods.

A significant topic in emergency response management was considered: the allocation and scheduling of rescue units. MOMILP was used by the author to allocate and arrange rescue in the event of a natural disaster. The first goal was to reduce the sum of weighted relief operation completion times, and the second was to reduce the number of makes pan. This model used a single objective mixed integer programme with linear function utility to tackle the problem of prior models. Their experiment demonstrated the efficacy of the recommended strategy in achieving high-quality results. However, determining the processing time, travel time, and severity level still takes time [3].

In their previous work, through the use of VANET, cloud computing, and simulations, they created a system for handling disasters and developed evacuation strategies for the city. They developed on past research utilising deep learning to forecast the behaviour of urban traffic in this paper. They also used GPUs to address the computationally intensive nature of deep learning methods. They were the first to bring a deep learning method to disaster management. They utilized real-world open road traffic in a city available through the UK Department of Transport. Their findings demonstrated the effectiveness of deep learning in managing disasters and accurately predicting traffic behaviour in emergency situations [4].

In this research [5], they proposed a method to improve postnatural disaster management operations by selecting the appropriate disaster type and site. They initially retrieved the disaster-related tweets from the Twitter API using predefined keywords. The posts were cleaned and the noise level was reduced at the second stage. The third stage then moved on to the disaster type and geolocation. The Named Entity Recognizer library and Google Maps Geocoding API were also utilized to acquire the geolocation. They used the same three steps to retrieve news using the News API. To determine the trueness of each Twitter message, they contrasted the data from Twitter data from the news.

In this paper [6], they provided a framework for early fire detection for CCTV security cameras that used customised convolutional neural networks to detect fire in a variety of indoor and outdoor environments. To ensure an autonomous response, they recommended an adaptive priority technique for the surveillance system’s cameras. Finally, they offered a dynamic channel selection approach for cameras based on cognitive radio networks to guarantee precise data transfer. Experimental results demonstrated the higher accuracy of our fire detection methodology in comparison to state-of-the-art technologies and support the applicability of their framework for successful fire catastrophe management.

The three goals of this study [7] were: (1) creating instructional resources for teaching disaster mitigation; (2) understanding student learning results when utilising developed materials; and (3) understanding student reactions to developed materials.

They discussed real-time human identification on a fully autonomous rescue UAV in this article using the YOLOv3 algorithm. The embedded system that was created could find swimmers in open water by using deep learning algorithms. This improved the operating capability of first responders by enabling the UAV to give help precisely and entirely unattended. The unique aspect of this work was the integration of computer vision algorithms and global navigation satellite system (GNSS) for exact human recognition and release of rescue equipment. The hardware configuration in detail as well as the system’s performance assessment were covered in detail [8].

In order to identify aeroplanes from satellite photos, this study compared and evaluated the most recent CNN-based object detection models. The DOTA dataset was used to train the networks, and the DOTA dataset as well as independent Pleiades satellite pictures were used to assess their performance. According to COCO metrics and F1 scores, the faster R-CNN network produced the best results. With less processing time, the YOLO-v3 architecture also produced promising results, but SSD was unable to effectively converge the training data with few iterations. With more rounds and various parameter settings, all of the networks tended to learn more. When compared to other networks, YOLO-v3 can be said to have a faster convergence capability; however, optimization techniques also play a significant part in the process. SSD was superior in object localization while having the weakest detection performance. The disparity in object sizes and diversities also had an impact on the results. Imbalances should be avoided or the categories should be broken down into smaller grains, such as aeroplanes, gliders, small planes, jet planes, and warplanes, while training deep learning architectures [9].

In this document [10], they came up with a data set on drones for human action recognition. This data set can also serve to detect and other similar tasks in different surveillance applications. The dataset provided was diverse by colour, size, actor, and background. This variation allowed for the generalization of the proposed dataset for various applications. In addition, their primary purpose was to provide support to search and rescue operations through drone surveillance. An experimental comparison of the measurement model of deep learning applied to the proposed dataset was submitted with additional publicly available datasets. A new detection model for the recognition of actions was also proposed in this document obtained a higher mAP value by 7% compared to the advanced SSD when used on a publicly available Okutama dataset The suggested model also achieved 0.98 mAP when it was applied to their two class action detection data sets for the SAR, which was a good performance value for a live application.

In this paper [11], to establish a uniform PASCAL VOC format picture database, they gathered pill images and used LabelImg. The pill dataset was used to train RetinaNet, SSD, and YOLOv3, three of the most popular object detection techniques currently available. The YOLOv3 model’s loss function converges more quickly, which suggested that its training time was less than that of the other two models. As a result, it could better handle the effects of the model’s retraining owing to the frequent changes in pills in pharmacies. By using the MAP and FPS as the evaluation metrics, they compared three models.

In these articles [1218] and [19], the authors were used less than 1000 images. They were given as follows.

The conceptual model for deducting the emergency plan in the event of a natural disaster was the main purpose of this document. The multiagents system (MAS) based emergency response plan deduction architecture, both the internal structure and reporting mechanism of the agents, had been designed. At the JADE platform, the suggested deduction model had been tested. The findings indicated that the natural disaster contingency plan deduction conceptual model had been developed and was better able to meet the challenge of modelling in the complex system of inference of contingency plans and could achieve the desired results. Within NDREPMACS, there was a problem of coordination and collaboration among separate agents, that was referred to in the study. To address these issues, in-depth co-operative research in emergency management, artificial intelligence, the science of disasters, and other fields was needed [12].

The primary contribution of this article was to create a formal RSLD model for managing natural disasters that could be customised for viewing any real RSLD situation. Various stochastic aspects were considered during RSLD modelling, including the selection of trajectory, the destruction of the trajectory, the selection of vehicles, the destruction of vehicles, and the passage time of the trajectory. The findings of the analysis demonstrated the efficacy of the proposed framework, that could be extended to the official model and assess other aspects of disaster management like evacuation shelters and evacuation planning [13].

They recommended a hybrid heuristic approach [14] built on a bilevel optimization model and machine-learning framework. An optimization framework would be used to iteratively find a better scenario using a supervised learning (regression) model that was trained using data from several contraflow scenarios. Real datasets from Winnipeg and Sioux Falls were used as benchmarks to assess this approach. It was programmed as a single computer programme that worked in conjunction with the general algebraic modeling system (GAMS) for optimization, MATLAB for machine learning, Equilibre Multimodal, Multimodal Equilibrium 3 (EMME3) for traffic simulation, MS-Access for data storage, and MS-Excel for data analysis (as an interface). By changing the direction of some roads, the algorithm improved accessibility to Winnipeg’s centrally located, crowded, and congested districts while also producing optimal global solutions for the Sioux Falls example.

In this study [15], “deep learning” classification techniques were compared to human-coded photos released during Hurricane Harvey in 2017. The VGG-16 convolutional neural network/multilayer perceptron classifiers were used in their framework for feature extraction to classify the urgency and time period for a given image. They found that machine learning algorithms did not always capture the unique characteristics of disaster situations. Together, these techniques helped locate relevant material and requests by sorting through the volume of irrelevant social media posts.

In order to operationalize the science and technology roadmap for disaster risk reduction (DRR), this article evaluated the main knowledge gaps and potential for boosting transdisciplinary approach (TDA). To promote science-based policy implementation at global, regional, and local levels for risk-informed decision-making throughout the post-2015 Agenda, it was necessary to strengthen the links between science and policy. The crucial TDA elements for enhancing DRR, climate change, and sustainable development strategies must be included, and DRR stakeholders must work together to improve collaboration through an integrated strategy that uses science, technology, and these TDA components [16].

The objectives of this work [17] were to investigate cutting-edge disaster recovery (DR) solutions, thoroughly analyze recently published research, and identify various methodologies that were described in the literature. 49 research studies, spanning the years 2007–2017, were examined as part of a comprehensive mapping effort. Numerous DR techniques were being researched. The findings revealed a variety of pertinent concerns, such as justifications for adopting DR solutions, implementation strategies, analytical methods, and metrics taken into account when DR solutions were analysed.

In this paper [18], for studies on intelligent search, rescue, and disaster recovery missions, they provided as a useful foundation. They examined the fundamental machine-learning methods needed for object detection and path planning in clever rescue operations. They also demonstrated the viability of the suggested architecture using a proof-of-concept hardware-in-the-loop (HIL) simulator framework to support a specific rescue mission scenario. By building a proof-of-concept prototype for search, rescue, and disaster recovery operations, they illustrated the Internet of Things (IoT) architecture and put it to the test.

In this research [19], they reviewed the ways in which post-tsunami disaster response had benefited from the development of remote sensing techniques. The performance assessments of the remote sensing techniques were addressed in light of the requirements for responding to the tsunami disaster in the long term.

In some other papers, the authors use live simulation data/real time data [16] and [18], United Nations World Tourism Organization [20], etc. They are given as follows.

In this study [16], population estimation in the region-of-interest and their migration pattern were investigated using cutting-edge object detectors. It was shown that the majority of detectors exhibit numerous detections for a single object, therefore exaggerating the number of objects. In order to decrease numerous redundant detections and enhance the count and mean average precision of the discovered classes, a nonmaxima-suppression (NMS) technique was used after the detections.

A dataset on natural and man-made catastrophe occurrences was incorporated into a model of international tourism flows in order to evaluate the effects of various disaster types on foreign arrivals at the national level [20]. The findings demonstrated that various events can modify tourist flows to differing degrees. While a positive effect was considered in some situations, the results were frequently unfavorable and led to reduced tourist arrivals after an event. Destination managers who make critical decisions regarding recovery, reconstruction, and marketing had benefited from understanding the relationship between catastrophic events and tourism.

2.2. Deep Learning Algorithms

YOLO algorithm was used in these articles [8, 9, 11, 21] and [22], and different types of CNN algorithms like R-CNN, faster R-CNN, fine-tuned CNN, and U-Net CNN were used in these papers [2, 6, 8, 9, 16, 19] and [10] for the human detection and action recognition for SAR. They were given as follows.

In this paper [21], they looked at the usage of composite photos to hone an efficient victim detector. They were inspired because it was difficult to find real victim photographs for training and the state-of-the-art detectors trained on the COCO dataset could not reliably identify disaster victims. They suggested that in order to create composite victim photos, human body pieces should be copied, then pasted onto a background of garbage. Their approach takes into account the fact that the actual victims were frequently covered in the rubble and could only be seen in fragments. The body sections are thereby randomly pasted, as opposed to earlier approaches that copied and pasted an entire object instance. The experimental findings showed that fine-tuning the detectors using their composite images could significantly increase the average precision (AP). They had examined some cutting-edge detectors. Their unsupervised deep harmonisation network, which could create harmonic composite images for training and aided in improving the detectors even further, was also proven to be beneficial. YOLOv3 algorithm was used to detect the human.

In this paper [22], they intended to introduce the use of robots for the initial investigation of the disaster site. The robots toured the area and used the video stream (with audio) they had recorded to locate the human survivors. They proceeded to transmit the survivor’s discovered position to a main cloud server. In order to establish whether it is safe for rescue personnel to access the chosen area, it was also necessary to monitor the area’s associated air quality. They employed a human detection model for images with a mean average precision (mAP) of 70.2%. The F1 score of the suggested method’s speech detection technology was 0.9186, while the architecture’s overall accuracy was 95.83%. To increase the accuracy of the detection, they merged audio and image detection (YOLOv3 algorithm) approaches.

2.3. Machine Learning Algorithms

The machine learning algorithms were used in these papers [5, 7, 1215, 2326] and [20]. They were given as follows.

In this work [23], a critical review of current disaster management and warning systems was performed, finding that the SMS alert system was highly useful but had not yet been approved in Romania. There were various gadgets which helped us build the system, as well as software modules capable of managing devices all over the world. In addition, along with the SMS alert system, the real-time data collection system was a vital component. Some international researchers had made a few feeble attempts to explore the subject, but this technique had been noted as more accurate.

In this article [24], they outlined automated techniques of extracting information from microblog posts. In particular, they emphasised on extracting precious “nuggets of information,” short, and autonomous pieces of information relevant to disaster response. In order to categorise messages into a set of precise classes, their system made use of cutting-edge machine learning algorithms.

In this paper [25], they discussed a technology called artificial intelligence for disaster response, which classifies tweets in real time into a number of user-defined situational awareness categories. The platform used a combination of machine learning and human intelligence to assign tags to a subset of messages and create an algorithmic classification for the remaining messages. The platform used an active learning approach to select possible messages to tag and continuously learns to increase the accuracy of traditional training when new training examples become available.

The focus of this study was on a specific natural calamity, i.e., landslides. It was lacking in the phase of preparation. This research provided a review of the conceptual framework for managing landslides as a natural hazard. Background, goal, model, substance, legitimation, implementation, and contribution were the seven criteria used in the evaluation. The evaluation revealed the framework’s strengths and flaws, which could be used by the user to identify any gaps. The findings showed that the framework could be used as a guideline for landslide management during the preparedness phase [26]. The conceptual framework for managing landslides as a natural disaster was evaluated in this study.

2.4. Cloud Computing Technology

The cloud computing technology was used in these articles [4, 27] and [22]. They were given as follows.

In this article [27], they put forward a disaster recovery (DR) structure for the e-learning environment. In particular, they described assistance in using the framework provided, and they showed the significance of earthquakes and tsunami events response to the e-Learning environment. They constructed the model system according to the suggested framework, and they described the results of the experimental use and examination. Going forward, they implemented their disaster recovery framework on a cloud orchestration framework like OpenStack. They tried to validate that it is effective in a cross-organizational environment with multipoint organizations. Also, they believed it was necessary to reconfigure an OpenFlow control method in order to shorten live migration time.

2.5. Robots/Smart Wearable Devices/IoT

The robots/smart wearable devices/sensors/IoT/other devices were used in these papers [6, 18, 22, 2830]. They were given as follows.

In this article [28], they presented a short review of how UAV capabilities had been used in disaster management, and examples of existing use in disaster management with adoption considerations. Disaster domains included fire, tornadoes, flooding, building and dam collapses, crowd monitoring, search and rescue, and postdisaster critical infrastructure monitoring. This review might increase alertness and problems in the review of UAVs by those facing crisis and disaster management.

In this study [30], they highlighted unresolved research questions about unmanned aerial vehicles (UAVs) in disaster management (DMS) and indicated the key use of the UAV network for DM. Based on the reviewed study, UAV networks, along with the wireless sensor network (WSN) and Internet of Things (IoT), were promising future technology for DM applications. The combined role that WSN, IoT, and UAV systems could play in both natural and man-made DM was the main focus of this article. This paper’s initial important contribution was the classification of ongoing research projects that use various technologies combined with UAVs for DM. It also concentrated on various network architecture and technology utilized in the DMS. Following that, it covered further crucial facets of using UAVs to provide emergency communication during natural and man-made disasters.

2.6. Network Performance

OPNET was used to evaluate the network performance in these papers [29, 31]. They were given as follows.

In this document [31], they evaluated the performance of the UAV-assisted intelligent on-board computing network to speed up search and rescue (SAR) missions and capabilities because it could be rolled out in a short period of time and could assist most people in the event of a disaster. They examined network parameters such as delay, speed, and traffic sent and received, as well as path loss for the suggested network. It has also been established that with the suggested parameter optimization, network performance increases considerably, and ultimately resulting in much more efficient SAR missions in the event of a disaster and tough environments.

In this article [29], the authors concentrated on network performance for effective disaster management collaboration of drone edge intelligence and smart wearable devices. They concentrated mostly on network connectivity factors to enhance real-time data exchange between wearable smart devices and drone edge intelligence. In this study, pertinent characteristics such as throughput, delay, and the load from drone edge intelligence were taken into account. Furthermore, it was demonstrated that when the aforementioned parameters were properly optimised, network performance may greatly improve, which would have a positive impact on the effectiveness and efficiency of search and rescue (SAR) team guiding and coordination.

3. Dataset Description and Sample Data

The dataset for the proposed work has been taken from https://www.leadingindia.ai/data-set.

The number of action classes are eight. The camera motions are yes, slow, and steady. The total number of images used for the proposed model is 1,996. The resolution is 1920 × 1080. The annotation and its format are Bounding box,.tx (yolo) format. The information about the dataset used for the proposed model is shown in Table 2.

Table 3 shows various actions of people captured by drone from various angles and locations. The sample images with various actions of people captured by drone from various angles and locations from the dataset are shown in Table 3. The actions are human standing and waving, human running, human standing, human sitting walking, human standing and running, human waving, and human standing and walking.

4. Proposed Algorithm with Flowchart

The proposed algorithm has used SSD using text detection. The steps involved in the proposed algorithm are given as follows:(i)Step 1. Data set collection and action selection(ii)Step 2. Data cleaning and preprocessing(iii)Step 3. Training the model(iv)Step 4. Testing the model(v)Step 5. Performance evaluation

4.1. Step 1: Data Set Collection and Action Selection

The dataset has been collected from drone-surveillance video from a leading AI website. Other details about datasets are mentioned above.

4.2. Step 2: Data Cleaning and Pre-Processing

Images whose annotations are not given were removed. And, for data preprocessing, following steps were taken:(a)Frame selection: the extracted frames were repeated as action features when they were first extracted. We checked by eliminating 10, 15, 20, and 30 frames from the suggested dataset while preserving one frame.(b)Annotations: used Labelme (software) to manually annotate 1,996 images, as the given annotation was not suitable for our proposed algorithm.

4.3. Step 3: Training the Model

80% of dataset images were used in training the network. We train our model on the using darknet-53 convolutional neural network. On the human action detection dataset for the validation set, we train that network for about a week and get a single 94 percent accuracy. The darknet framework is used for all training and testing. The model was then adapted to carry out the detection. We add four convolutional layers and two fully interlinked layers with randomly initialized weights, according to their lead. As detection often requires fine-grain visual information, we have increased network input resolution from 224 × 224 to 448 × 448. Class probabilities and coordinates of the boundary box are predicted by our final layer. We want only one bounding box predictor to be responsible for each object during training. The task of predicting an object has a predictor assigned to it and is based on the prediction of who had the highest IoU current with the terrain truth. Therefore, limiting box predictors were becoming more specialized.

A network of convolutional neurons is generally composed of one input layer, one convolutional layer, one pooling layer, and one output layer. The system captures an image in two dimensions, and the convolution layer extracts and charts the important features of the image in the form of a slider; the pooling layer reduces the image of the input feature, reduce the intricacy of the calculations, and extract the main features from them. Image characteristic information is obtained by convolving. The flowchart of the YOLOv3 algorithm is shown in Figure 2.

The third iteration of the YOLO object detection algorithm is called YOLOv3. A YOLO network’s structure resembles that of a typical CNN. Before ending with the fully connected layers, it has multiple convolutional and max pooling layers. By employing techniques like multiscale prediction and bounding box prediction through the use of logistic regression, YOLOv3 greatly improved the design. A novel CNN architecture called Darknet-53 is used in YOLOv3. The ResNet architecture’s Darknet-53 variant was created primarily for jobs involving object detection. On a variety of object detection benchmarks, it achieves cutting-edge performance because to its 53 convolutional layers. Anchor boxes in the YOLOv3 version have various scales and aspect ratios. To better fit the size and shape of the objects being detected, the anchor boxes in YOLOv3 are scaled and their aspect ratios are altered.

Furthermore, “feature pyramid networks” (FPN) are introduced in YOLOv3. FPNs are a CNN architecture that can recognise objects at various scales. They build a pyramid of feature maps, and they use each level of the pyramid to find objects at various scales. Due to the model’s ability to view the objects at different scales, this helps to improve the detection performance for small objects The range of object sizes and aspect ratios that YOLOv3 can handle has increased. Figure 3 depicts the architecture diagram for the YOLOv3 algorithm, which has 106 layers of convolution.

One-step approach YOLOv3 features are given as follows.

This network was not looking at the complete picture. Rather, these are parts of the picture that may contain the object.(i)A single neural network predicts bounding boxes and class probabilities for these boxes(ii)The input image is divided into S × S grids, each with “m” bounding boxes(iii)For each bounding box, the network generates an offset value and a class probability(iv)The object in the image is located using bounding boxes that have been selected and have a class probability higher than a threshold value

It predicts the B boundary boxes for every grid cell, with a confidence score for each box; it detects only one item, regardless of how many boxes B there are; it forecasts class C conditional probabilities (one for each class for the probability of the object class).

As we can understand, the midpoint of the flow is the cutoff value, that requires the bounding box for an object in the image and is displayed in Figure 4.

The object detection method used by the SSD (single shot multibox detector) has two sections as follows:(i)Get feature maps(ii)Use convolution filters to find objects

SSD uses VGG16 to get feature maps. The next step is object detection using Conv4 layer 3. There is a border box for each prediction. The class with the greatest score is chosen as the class of the bounded object. There are 21 classes (plus one for no object). There are four predictions per cell in Conv4 3 and a total of 38 38 4 predictions, regardless of the depth of the feature maps.

Many predictions, as predicted, have no object. Class “0” is designated by SSD to indicate that it has no object. SSD has no network of delegated regional proposals. Rather, this is a very simple process, small convolution filters are used to calculate the location and class scores. After collecting feature maps, SSD employs three convolution filters for each cell to create predictions.

4.4. Step 4: Testing the Model

YOLO is exceptionally quick during testing compared to classifier-based techniques because it only requires one network evaluation. The grid layout guarantees the spatial diversity of the bounding box forecasts. Because it is frequently obvious to which grid cell an object belongs, the network anticipates only one box for each object. For each picture in our network, there are 98 boundary boxes with corresponding class probabilities. On the other hand, large objects or those that are near a lot of cells’ boundary can be accurately localised by a lot of cells. Nonmaximal suppression was used for IoU >0.5, and the outcomes were better.

4.5. Step 5: Performance Evaluation

IoU (Intersection over union) mAP (mean Average Precision), precision, and recall are used for performance evaluation. They are given as follows.

The overlap between two edges is measured through the IoU. We used this to see how close our estimated limit is to the truth of the ground (the boundary of the real object). In some data sets, we use an IoU cut-off (for example, 0.5) to determine whether a prediction is correct or wrong.

IoU is shown in the following equation:

IoU (Intersection over Union) represents the prediction of the bbox during testing versus the actual bbox. If the prediction is the same as the actual then the IoU will be 1. The value of IoU nearer to 1 represents that our model is working fine.

For object detection, the mAP (mean average precision) is the mean of the average precision attained across all classes. It is also worth noting that some studies use the terms average precision and mAP interchangeably. It is shown in the following equation:where  = number of classes in the dataset.

Precision (P) measures how accurately your predictions are made. In other words, the percentage of your positive predictions is right. It is shown in the following equation:

Recall (R) measures how you find them all positive. It is shown in the following equation:where TP = True Positive, TN = True Negative, FP = False Positive, and FN = False Negative.

5. Experiments Results

Natural disasters are unforeseen occurrences that can result in significant economic and environmental losses as well as the loss of lives. As an organisation, it is our duty to make sure that suitable recovery plans are in place in case these natural tragedies occur. A save our souls (SOSs) system can be put in place to deal with these kinds of situations.

The SOS ought to have the ability to retrieve live video feed messages from a variety of sources, including cameras, news stations, and podcasts. The most typical SOS architecture consists of three main parts, such as Alert, Assess and Act.

Natural disasters, the weather, terrorism, disease outbreaks, etc., will all be given to an SOS system, and an alarm can be generated from this data. International and local incidents are entered into this SOS system.

As part of the assessment step, the threat data are regularly analysed to weed out false positives and identify useful information. Filtering incident data according to our specific region of interest, the incident’s timestamp, its severity, etc., allows us to perform the assessment. The impact on people, property, and businesses is taken into account, as well as past occurrences, chance of occurrence, and other variables. Ad hoc search, range filtering, and geographical queries should be used to evaluate the data and turn it into useful information during this phase. By doing this, we can simply query the data from a single interface. Thus, it enables the analyst to drill down the data in accordance with its seriousness or can apply any specific filter that makes the data more understandable.

The incident data are shared with victims during the action phase. The decision is made on whether the occurrence is critical and whether it affects our assets or not after the act and assess phase. Emails, personal or business phone numbers, or other forms of contact can be used to communicate this to victims. The matter is being carefully monitored at this point by focusing on it and alerting upper management about the incident. It can be difficult to carry out the communication phase.

An SOS system can be used in natural disasters to locate and track the current impact zone and location of natural catastrophes such as earthquakes, floods, storms, and volcano explosions.

Effective reaction to security issues, improved operational performance, and protection of the business’s assets are some of the benefits.

Real-time location alertness, data sharing between businesses, legal issues, poor implementation, recipients’ lack of understanding, device failure, lack of a mobile network, difficulties communicating internationally, language barrier, absence of a good procedure manual, etc., are disadvantages of SOS.

SOSs (save our souls) image (help is required image) is given to SSD using text-based detection. The proposed algorithm detection accuracy is 73%. The output to show help is required with accuracy is shown in Figure 5.

A natural catastrophe is defined by the abnormal intensity of a natural agent (such as a flood, mudslide, earthquake, avalanche, and drought) when the typical precautions to be taken to mitigate the damage were either unable to avoid their emergence or were not able to be taken. No natural catastrophe or not help is detected in Figure 6. That is no help is required.

Help is not required image without text is given to SSD using text-based detection. The proposed algorithm detection accuracy is 73%. The output to show help is not required with accuracy is shown in Figure 5.

YOLOv3 algorithm is used for only the detection and localization of images. The detecting human action from various drone images like person standing, person running, person waving, and person walking of drone images using the YOLOv3 deep learning technique is shown in Figure 7.

The YOLOv3 deep learning technique detect the various actions like person standing, person waving, and person walking with confidence score is shown in Figure 8. Figure 8 shows images with detection along with localization and confidence scores of the same images with the same action classes.

The proposed model uses the YOLOv3 algorithm. YOLOv3 algorithm is compared with the existing algorithms like F-RCNN, SSD, and RFCN. The existing algorithms detected only images with six actions not as a text image and showed action detection without a confidence score, and it is shown in Figure 9. YOLOv3 detects images with eight actions, also detect text image and show action detection with a confidence score, and it is shown in Figures 57.

6. Results and Discussion

The proposed model detects images with eight actions using YOLOv3, also detect text image using SSD and show action detection with a confidence score. The accuracy comparison between all the existing algorithms like F-RCNN, SSD, and RFCN versus the proposed algorithm YOLOv3 is shown in Table 4 with the eight classes are human standing and waving, human running, human standing, human sitting and human walking, human standing and running, human waving, and human standing and walking.

Figure 10 shows the comparative analysis of the existing algorithm like F-RCNN, SSD, and RFCN with the proposed algorithm (YOLOv3) using the mean average precision values. Figure 11 illustrates the proposed algorithm (YOLOv3), which provides the best accuracy of 95%. The accuracy achieved for SSD with text detection with a graphics processing unit (GPU) is 73%.

Hence, this result shows that our proposed algorithm outstands in both, less and more number of classes. With 1996 images, while the existing ones have only trained on around 700 images and also only on 6 classes, whereas the proposed model has been trained on 8 classes for 1,996 images with more accuracy and faster results.

Figure 12 shows human action detection on videos and real-time detection by the proposed model.

7. Conclusion and Future Work

In the conclusion, the proposed algorithm (YOLOv3) achieved faster (gives results in milliseconds), more accurate, and have worked on more number of classes as compared to the existing detections with the proposed model, the proposed model also succeeded in real-time detections where other model fails to do so. As the main objective is to provide assistance for natural disaster management and mitigation team help in drone surveillance, an experimental assessment of the used deep-learning action detection model was reported in the suggested study.

The average loss was also less and the learning rate was high in the proposed method, which was one of the reasons for the higher accuracy of 95%. Fast YOLO is the fastest and most adaptable object detection system in the world, and it pushes limits on real-time object detection. YOLO is excellent for applications which need speed and reliably recognising objects because it can adapt to new domains.

In future work, more number of action classes (in datasets) can be included like help (waving both hands) and searching which are more specific towards the search and rescue area. We can try to increase the accuracy and decrease the speed to detect by using more efficient algorithms in the future like YOLOv4 which is right now in the testing phase. We can also try and use good quality drones for clearer pictures which can be then used to train the system. The proposed work can be further extended for other disasters also like flood, tornado, or tsunami.

Data Availability

The data used to support the study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.