Abstract

The advancements in digital video technology have empowered video surveillance to play a vital role in ensuring security and safety. Public and private enterprises use surveillance systems to monitor and analyze daily activities. Consequently, a massive volume of data is generated in videos that require further processing to achieve security protocol. Analyzing video content is tedious and a time-consuming task. Moreover, it also requires high-speed computing hardware. The video summarization concept has emerged to overcome these limitations. This paper presents a customized video summarization framework based on deep learning. The proposed framework enables a user to summarize the videos according to the Object of Interest (OoI), for example, person, airplane, mobile phone, bike, and car. Various experiments are conducted to evaluate the performance of the proposed framework on the video summarization (VSUMM) dataset, title-based video summarization (TVSum) dataset, and own dataset. The accuracy of VSUMM, TVSum, and own dataset is 99.6%, 99.9%, and 99.2%, respectively. A desktop application is also developed to help the user summarize the video based on the OoI.

1. Introduction

Security is the primary concern of the entire world. Besides taking a few other security measures, video surveillance cameras have been installed on private and public premises to cope with this challenge. Several kinds of security surveillance cameras (i.e., static and moveable) have been installed in public places, homes, shops, airports, banks, and so on. These cameras play a vital role in real-time monitoring and detecting suspicious behavior. They are also helpful in investigating events or crime scenes, for example, road accidents, robbery, murder, and terrorist activity [1].

Furthermore, the global estimated number of currently operational cameras is more than 770 million [2]. These cameras usually remain active round the clock and generate more than 2,500 petabytes of video data per day [3]. Figure 1 exhibits daily statistics of the real-world data produced by the video surveillance cameras.

Considerable progress has already been made in developing video analytic tools that automatically perform content-based video interpretation, including motion detection [4, 5], facial recognition [6, 7], people counting [810], and license plate recognition [1113]. However, the issue is that manual (security guards, police officers, etc.) interventions are still required to analyze the recorded videos. Visual analysis of video content to extract meaningful information is complex and time-consuming because visual analysis needs to concentrate and watch the whole video [14]. It may also result in false negatives, especially in the case of long videos. Therefore, there is an ultimate need for a solution that helps in reducing human efforts and time for manual analysis. Multiple efforts are being made towards video summarization to address this concern and generate a video summary that quickly provides the whole video’s gist [15]. The video summarization (VS) is creating a summary of extensive video content by detecting and presenting relevant material to the potential users that are most informative and contain up-to-date information. VS is being used in security surveillance systems to detect and analyze suspicious or anomalous activities. Personal VS is used to share occasional videos on social media, generate sports highlights, make trailers of movies and serials and video content indexing to facilitate fast browsing of huge video through a video search engine, and so on [1619].

The researchers have made several efforts to propose automatic VS. Most of the VS techniques generate a summary based on selecting keyframes representing the video through the skimming process [812, 1517, 2037]. Feature-based approaches for VS produce a generalized video summary rather than focusing on a specific object [2037]. The shot boundary detection approaches are also well known for video summarization [3845]. These approaches show limitations in detecting the object precisely, hence failing to fulfill the user’s requirements. Clustering [20, 24, 4649] and trajectory-based [17, 28] techniques summarize the video by focusing on similar activities, events, and objects. However, these approaches do not summarize any video containing information according to the user’s interest. Consequently, these techniques limit the use of retrieval tasks and do not help enhance the users’ observing experience.

This study presents an effective VS framework based on the OoI to cope with the issues of video summarization. The OoI refers to the objects such as person, car, mobile, and bike that a user selects to summarize the video by collecting all frames where the selected object appears. The proposed VS framework works in three steps: (i) the combination of the OoI selection phase, (ii) the object localization or detection phase, and (iii) the video summarization phase. Initially, the OoI selection is performed from the dictionary (a database of objects) to ignore the unnecessary noisy objects (other than the OoI) that are imperative for the segmentation of objects. After that, You Look Only Once (YOLOv3) detector is applied to localize the OoI. After localizing the OoI, the proposed VS algorithm summarizes the video based on the OoI.

Based on the above discussion, the contributions of the proposed work can be summarized as follows:(i)In the OoI selection step, the proposed algorithm selects the object from the dictionary and ignores all the unnecessary objects automatically. After that, the YOLOv3 is used to detect the desired object.(ii)The proposed VS framework can detect single and multiple objects present in the video.(iii)The proposed VS algorithm effectively summarizes the video and overwhelms all the challenges shown in the VSUMM [50], TVSum [51], and own dataset.(iv)The experimental finding highlights that the proposed VS framework performs tremendously as compared to state-of-the-art methods in the area of video summarization.

The rest of the paper is organized as follows: Section 2 describes the literature review for exiting techniques. Section 3 presents the proposed VS model that explains the video summarization method. Section 4 discusses the results and comparison with other techniques, and finally, the conclusion is discussed in Section 5.

2. Literature Review

Several techniques have been proposed. Uchihachi et al. [42] have presented the packing algorithm to define the excellent layout. This algorithm has packed the selected frame to produce the best sequence in the block, and the algorithm organizes several shots and produces videos concisely. However, this approach is useful only when the camera is moving. Chong-Wah Ngo et al. [34] presented a method that depends on the perceptual quality and redundancy reduction to maintain the content of the video summary. The video clusters are generated through temporal slices coherency to partition the video into shots and subshots. After that, the authors adopted a motion attention framework presented in [52] to analyze the clusters and quality of the shots. The temporal graph is also formed to prescribe the importance of clusters. The graph’s attention values are used to select the appropriate scenes to create a video summary. The summary generated using the proposed approach is only about 10–25% of the entire video.

Rav-Acha et al. [32] have also presented the video’s abstraction, where all events and activities of the video are consolidated to generate a video summary. The work has been performed by detecting the moving objects directly and then applying video optimization using detected objects. However, this approach cannot join the parts of different scenes because of discontinuity. Damnjanovic et al. [35] introduced an event-based video précising technique. Initially, the technique calculates each frame’s energy by summating the absolute difference between the current frame and the reference frame pixel values. In this way, all the existing events in frames are determined. Later on, the video summarization algorithm is applied to extract keyframes. The suggested approach is suitable in a static environment. However, in the case of a dynamic or changing background, the system’s performance is unacceptable. Almeida et al. [30] have presented a video summarization method that performs three steps to summarize the video content: extraction of visual features, summarizing the video, and its filtration. First, the color histogram’s extraction of visual features labels the visual contents by manipulating the color. Second, a speedy and straightforward algorithm is implemented for condensing the video. The purpose of the algorithm is to detect similar content and select the relevant frames. Finally, filtration is performed on selected frames to remove noise and redundant data from the video to generate the video summary. Furthermore, the proposed approach is highly hardware-dependent and requires high computational speed. Moreover, the system’s performance is unacceptable for objects summarization in lengthy videos. Wang and Ngo [20] have discussed a method in which the hierarchical hidden Markov model recognizes the motion features. It classifies low-level features to a high level using the semantic concept. After identifying the object’s features, the most representative clips are selected from the video to generate a video summary. The generated summary is 50 times faster than the original video. The inclusion of similar shots in the video is a major limitation of this approach that causes redundancy. Another limitation is the ignorance of the objects that are in a moving state.

Miniakhmetova and Zymbler [40] have described the method of personalized video summarization that works in two stages. The first stage is video structuring, where various scene detection techniques are performed and a video summary is generated. In the second stage, objects are detected from the subset of video scenes using the detection bank. The video summary is generated that consists of the most influencing scenes in which objects are detected, which in turn become a region of the user’s interest. The authors only proposed an idea to implement such a system, not any built prototype. Varghese and Nair [38] proposed a method in which video can be summarized by performing three main steps: shot boundaries detection, redundant frame elimination, and stroboscopic imaging. The shot boundary detection compares the current frame with the neighboring frame. Repetitive frames are removed using the structural similarity index (SSI). The stroboscopic is also used to understand the common background and show the existing activities in the video. Compared to the original video, the presented technique reduces the volume by 55% in the summarized video. Lai et al. [45] presented a frame recomposition-based approach using a clustering algorithm, optical flow, and background subtraction to detect foreground objects. The foreground object has been detected by fusing a group of pixels. After detecting the objects/activity, a sliding window has been used to combine the detected objects in consecutive frames to build a spatiotemporal trajectory. The video summary is generated by combining the entire spatiotemporal trajectory, and the algorithm has achieved an accuracy of 97%. However, this technique is suitable only when the camera position remains static. Srinivas et al. [21] discussed how video could be summarized by computing three factors. First, it assigns the score to each frame based on various features such as quality, color, hue, statically attention, temporal segment, demonstration, and uniformity. Second, it assigns weights to each score based on feature importance for getting keyframes. The standard deviation is used for the assignment of weightage. Finally, the redundancy is eliminated by eliminating repetitive frames, which are collected based on their ascending order score. The presented method showed slightly better performance than the improved frame-blocks features method (IFBFM). However, the comparison with state of the art is not presented.

Davila and Zanibbi [33] have discussed the frame selection in lecture videos based on segmentation by diminishing the conflicts between content regions, removing objects, and rebuilding each frame to generate a video summary. The compression rate of the approach is not specified while discussing the video summarization, and the approach is tested on lecture videos only. Ajmal et al. [29] have discussed a method in which human motion has been tracked with the Kalman filter’s help to find the trajectory. The color features are helpful for video, where the color histogram is used for shots-detection and generates a video summary. However, this approach is designed for surveillance videos only. Ma et al. [39] have presented a collaborative representation of the adjacent frame technique to detect an abnormal frame and remove noisy content from the video. Keyframes are selected using minimum sparse reconstruction to remove the noisy data and prevent the loss of important information. The frame having high collaborative representation error is considered a keyframe. A greedy iterative algorithm is utilized for model optimization that controls the count of keyframes with the help of the average percentage of reconstruction (APOR) and the sparse boundary. However, this approach is ineffective for videos with different frames. Sridevi and Kharde [53] performed video summarization by detecting highlights. In this method, two-stream architecture is used that consists of a deep convolutional neural network (DCNN). The two-dimensional convolutional neural network (2D-CNN) is used to exploit spatial information, and the three-dimensional convolutional neural network (3D-CNN) is used to exploit temporal information to score the video segment highlights. This method achieved a 43.9 precision rate.

Meyer et al. [54] presented a cloud-based system known as HOMER for the generation of video highlights. In this system, the video summary can be generated by detecting the user’s emotions. Two different datasets are used for experimental analysis: a dataset filmed through a dual-camera setup and a home video randomly selected from Microsoft’s video titles in the wild (VTW) dataset. Resultantly, HOMER achieved 38% improvement from baseline. Afzal and Tahir [55] described a video summarization by combining ResNet 152 and gated recurrent unit (GRU). In this method, ResNet 152 is used to extract deep features that existed in the video. Similarly, a gated recurrent unit (GRU) is used to improve the method’s robustness and performance. The experimental analysis was performed on the SumMe dataset, and F-measure was 43.7. Gunawardena et al. [56] performed OoI-based video summarization by generating features from the video according to different scenes with the help of VGG16-1. The technique generates features of OoI using VGG16-2 by taking the frame (containing objects) from the selected video. The accuracy of the proposed method is 88%. However, only three objects are used in the experiments and the computational time is high. Meng et al. [57] proposed a technique that summarizes a video into several key objects through representative object proposals generated from video frames. The proposed technique is tested only on a few objects such as a clock, microphone, and signs. The overall accuracy of the technique is not given. Fataliyev et al. [58] proposed a method to summarize a video with the help of object motion pattern analysis. The method is based on key positions extraction and index frame generation. The Gaussian mixtures are used for object extraction and adaptive background subtraction. For noise reduction, morphological opening and closing operations are adopted. The overall accuracy of the method is 82%. However, the method is tested only for a single object, like a person.

Table 1 summarizes some of the existing VS techniques discussed in the current section.

In public places, many surveillance cameras have been installed to monitor suspicious activities such as mobile snatching, terrorism, and robbery, where the information contained by every single frame is essential. Most of the existing techniques work on the principle of keyframe selection by eliminating the redundant frames that may result in the loss of important information related to a user’s interest. Due to the limitation (disappearance of object and event), these techniques cannot produce significant results. Though some techniques summarize the video based on OoI, the main limitation of these techniques is their high computational power requirements and low accuracy. So, there is a need for a framework that should provide robustness, high accuracy, support for multiple static and dynamic objects, and provision for investigating numerous scenarios. The framework should be able enough to accommodate a wide range of OoI.

3. Proposed Framework

The proposed VS framework takes video and OoI as input. After that, the frames having OoI are detected using an object detection module. Finally, only the detected frames are combined to produce a video summary as an output. The architecture of the proposed framework is shown in Figure 2. It comprises the following main modules:(i)Selection of input: it takes the video and OoI(ii)OoI detection module: it detects the OoI from the videos using a deep learning technique(iii)The video summarization module takes the frames that contain an OoI and generates the video summary as an output

The description of each module is given in the following subsections.

3.1. Selection of Inputs

In this work, a desktop application is developed using Python that provides an interactive user interface for selecting input video and OoI. The detailed working of the application is as follows.

3.1.1. Input Video Selection

The front-end of the application developed for selecting input video is shown in Figure 3. It contains the information related to the input and performs a video format validation check. The application only supports MP4 and AVI standard formats.

3.1.2. OoI Selection

After selecting the video, the next step is to choose the object type (i.e., OoI) to be detected from the input video. The user may select the OoI from the dropdown menu, as shown in Figure 4. A dictionary has been developed using the MS COCO dataset. The dataset contains 330 thousand images in which more than 200 thousand images are labeled. Moreover, it has 15 million object instances of 80 object categories of car, person, suitcase, and so on. The 11 supercategories of MS COCO datasets are person, animal, outdoor objects, indoor objects, vehicle, sports, kitchenware, food, appliance, furniture, and electronics [59]. The pistol dataset contains 2986 images with a single annotation class known as the pistol. The pistol dataset images contain cartoon and staged studio quality images of guns and pistols in hand [60]. A sample set of images from the MS COCO and pistol datasets is shown in Figures 5 and 6.

3.2. OoI Detection Module

In the proposed framework, YOLOv3 (You Look Only Once) [42] is used to detect the OoI. This module determines the scene, event, and frame where the desired object is located. YOLOv3 uses a variant of Darknet containing 53 layers that are trained on Imagenet. Furthermore, 53 more layers have been added for task detection that provides the fully convolutional underlying architecture for YOLOv3, consisting of 106 layers. There is no pooling layer in YOLOv3. A convolutional layer with stride 2 is used to downsample the feature maps to prevent the loss of low-level features [61]. It applies a single neural network to the full video, where the network divides the frames into regions, and it predicts probabilities and bounding boxes [62]. The architecture of the YOLOv3 is presented in Figure 7.

In YOLOv3, each class score is predicted with the help of logistic regression, and the prediction of multiple labels of the object can be performed using a threshold. However, classes with scores higher than a threshold are assigned to the box [59]. In the proposed framework, object detection is performed using a bounding box to demarcate the OoI. In case of multiple objects in a frame, this method helps to describe the spatial location of an OoI. The prediction of the bounding box is described in Figure 8.

In Figure 8, () are the x-y dimensions of the bounding box. However, for each bounding box, YOLO v3 predicts four coordinates (). If the cell is offset from the top left corner of the image by () and the bounding box prior has width and height , then the predictions are presented as follows:

In this work, YOLOv3 is used for OoI detection because it is much faster than its competitors [63]. Figure 9 shows the comparison of YOLO v3 with other object detection models in terms of speed. The processing speed of YOLO v3 is 45 fps that is quite impressive compared to Single Shot Detectors (SSD), Faster-RCNN, and R-FCN [63, 64]. However, the accuracy of YOLOv3 is less than , but the processing speed is much higher; that is, it processes 45 fps, while the Faster-RCNN family processes only 5 fps [63, 64]. Figure 10 shows the comparison of YOLOv3 with its competitors.

3.3. Video Summarization Module

VS module takes the frames that contain an OoI as input and generates the video summary as an output. The steps of the VS process are described as follows:(1)Read the current frame from the input video.(2)Perform OoI detection in the current frame using YOLOv3.(3)If OoI is found, save the current frame in the buffer. Otherwise, discard it.(4)If the current frame is the last, go to step 5. Otherwise, go to step 1 for the next frame.(5)Finalize the video summarization process by combining all the buffered frames having OoI.

The algorithmic flow of the proposed video summarization framework is also given in Algorithm 1.

Algorithm: Video summary generation
Algorithm 1: Process of VS
Inputs: Video X, OoI O
Output: Summarized video Y
Start process: VS(X, O)
N ← No. of frames (X)
for i = 0 to N − 1, do
Read the current frame F[i]
Status OoI detection (f[i], OoI)
if (status = = 1) then{
  Y[j] = F[i]
}
Else
  Discard the frame
end for x
Saved Y

4. Experimental Analysis

All the experiments were performed on a machine equipped with an Intel Core i5-6200U processor (running at 2.4 GHz) and 8 gigabytes (GB) of RAM, and Python was used as the programming language.

In this work, a subjective method is used to evaluate the performance of the proposed framework. For each test stream, summarized video is generated manually (with the help of a video editing tool, “Filmora”) and automatically through the proposed framework. The performance of the proposed framework is evaluated based on precision, recall, F1-score, and accuracy. The mathematical expressions for these evaluation parameters are given in the following equations:

Three different datasets are used, the VSUMM dataset, the TVSum dataset, and own dataset, to compare and validate the efficiency of the proposed framework with the manual method. The VSUMM dataset consists of 50 videos from the open video project (OVP). All VSUMM videos are in MPEG-1 form with 30 fps, pixels resolutions. However, videos contained by the VSUMM dataset belong to several categories (educational, documentary, historical, ephemeral, and lecture), ranging from 1 to 4 minutes. The TVSum contains 50 videos taken from different video websites that belong to several genres such as news, how-to, documentary, vlog, and egocentric. The own dataset contains the videos taken from multiple sources. The video is in AVI and MP4 format with different resolutions such as 320 × 240, 352 × 240, 640 × 360, 854 × 480, and 1280 × 720. Tables 24 list the sample test video sequences taken from VSUMM datasets, TVSum datasets, and own datasets, along with their specifications.

Extensive experiments are performed to evaluate the performance of the proposed framework on videos with different durations and resolutions. Some of the scenarios are discussed in the subsequent sections.

4.1. Evaluation of the VSUMM Dataset
4.1.1. Scenario 1

The video sequence consists of captured scenes from the lecture video in this scenario. The video has a duration of 58 seconds and 352 × 256 resolution. In this video, the person is considered an OoI. Hence, the video summarization is performed based on the object person. This video has been summarized by using both user-based and the proposed automated model.

Figure 11 shows frame-level (few frames) comparisons of the proposed method with the manual method. It reveals that the frames captured by both methods are the same in number, and there is no missing frame in this scenario.

The confusion matrix for the person as an OoI is shown in Table 5. It shows that all frames containing the person have been successfully detected by the proposed method. There was no error in the detection.

4.1.2. Scenario 2

In this scenario, the video sequence contains the captured scenes from the clinic. The video has a duration of 1.08 minutes, and its resolution is 352 × 256. It contains several objects such as a person, glasses, pen, and clock. In the video, the person is considered as OoI.

Figure 12 shows frame-level comparisons of the proposed method with the manual method. It shows that all the frames captured by both methods are the same in number, and there is no missing frame that cannot be detected by the proposed method.

The confusion matrix for the person taken as an OoI is shown in Table 6. It shows that, out of 420 frames containing the person, all frames have been successfully detected by the proposed method.

4.1.3. Scenario 3

In this scenario, the video sequence contains the captured scenes from the documentary on farmer living style. In the video, a person is considered an OoI. The video has a duration of 3.17 minutes, and its resolution is 352 × 256.

Figure 13 shows frame-level (few frames) comparisons of the proposed method with the manual method. It reflects that the first four frames are captured in both methods, while the fifth frame is wrongly predicted by the proposed method.

The confusion matrix for the person as an OoI is shown in Table 7. It shows that, out of 3780 frames containing the person, 3778 frames are detected by the proposed method, while 2 of the frames are wrongly predicted.

4.1.4. Scenario 4

In this scenario, the video sequence contains the captured scenes from the news. In the video, the person is considered an OoI. The video has a duration of 9 seconds with a resolution of 352 × 256.

Figure 14 shows frame-level comparisons of the proposed method with the manual method. It shows that all the frames captured by both methods are the same in number, and there is no missing frame that cannot be detected by the proposed method.

The confusion matrix for the person taken as an OoI is shown in Table 8. It shows that, out of 360 frames containing the person, the proposed method has successfully detected all of the frames.

4.1.5. Scenario 5

In this scenario, the video sequence contains the captured scenes from the documentary on GYM workouts. The video has a duration of 2.58 minutes, and its resolution is 352 × 256. It comprises several objects such as person and car. In the video, the car is considered an OoI. Thus, the VS is performed using the object “car.”

Figure 15 shows frame-level (few frames) comparisons of the proposed method with the manual method. It reflects that the first three frames are captured in both methods, while the fourth and fifth frames are missed by the proposed method. The missing frame is the size of persons in that frame, that is, too small, which can be visualized through naked eyes but not by the proposed method.

The confusion matrix for the car as an OoI is shown in Table 9. It indicates that, out of 300 total frames containing the car, 120 frames are detected by the proposed method. There are 180 frames in which a car exists, but the proposed framework did not detect them.

4.1.6. Results Summary

Table 10 presents the experimental results of the proposed framework. Several scenarios have been taken from different scenes or locations such as lectures, documentaries, news, and GYM containing various objects such as person and cars.

The object types and scenarios provide details of objects considered an OoI regarding the specific scenario. The duration of the summarized video is recorded that describes the duration detail before and after the processing. In best cases (lecture, documentary, and news), the recall and precision are 100%, showing that the proposed framework accurately identifies the object and generates a full video summary. Similarly, the recall is less in the worst cases (documentary 2 and GYM). The reason is that the size of objects present in the frame is too small, and the video quality is not good. Therefore, the proposed method could not detect the object. The overall accuracy of the proposed framework is 99.6%, and the total saved time is 82.84%.

4.2. Evaluation of the TVSum Dataset
4.2.1. Scenario 1

In this scenario, the video sequence consists of captured scenes from the documentary on the honey bee. The video has a duration of 1.38 minutes and 640 × 360 resolution. In this video, the person is considered an OoI.

Figure 16 shows frame-level (few frames) comparisons of the proposed method with the manual method. It reflects that the first three frames are captured in both methods, while the fourth frame is wrongly predicted by the proposed method.

The confusion matrix for the person as an OoI is shown in Table 11. It shows that all the frames captured by both methods are the same in number, while the proposed method wrongly predicted 21 frames.

4.2.2. Scenario 2

In this scenario, the video sequence consists of captured scenes from the news. The video has a duration of 2.18 minutes and 480 × 360 resolution. In this video, the person is considered an OoI.

Figure 17 shows frame-level (few frames) comparisons of the proposed method with the manual method. It reflects that the first four frames are captured in both methods, while the fourth frame is wrongly predicted by the proposed method.

The confusion matrix for the person as an OoI is shown in Table 12. It shows that all the frames captured by both methods are the same in number, while only one frame is wrongly predicted by the proposed method in this video.

4.2.3. Scenario 3

This video sequence consists of captured scenes from the truck accident video. In this video, the truck is considered an OoI. It comprises several objects such as a person, car, chair, and truck. It has a duration of 5.22 minutes and 640 × 360 resolution.

Figure 18 shows frame-level comparisons of the proposed method with the manual method. It shows that all the frames captured by both methods are the same in number, and there is no single frame that is missed by the proposed methods in the detection.

The confusion matrix for the truck as an OoI is shown in Table 13. It shows that all 2940 frames containing the truck have been successfully detected by the proposed method.

4.2.4. Scenario 4

In this scenario, the video sequence consists of captured scenes from the festival. In this video, the truck is considered an OoI. It comprises several objects such as a person, truck, and balloons. The video has a duration of 1.50 minutes and 640 × 360 resolution.

Figure 19 shows frame-level comparisons of the proposed method with the manual method. It shows that all the frames captured by both methods are the same in number, and there is no single frame that is missed by the proposed methods in the detection.

The confusion matrix for the truck as an OoI is shown in Table 14. It shows that the proposed method has successfully detected all the 60 frames containing the truck.

4.2.5. Scenario 5

In this scenario, the video sequence monitors pet dog behavior scenes. The video has a duration of 2.10 minutes and 640 × 360 resolution. It comprises several objects such as a person and a clock. In this video, the dog is considered an OoI.

Figure 20 shows frame-level comparisons of the proposed method with the manual method. It shows that all the frames captured by both methods are the same in number, and there is no single frame that is missed by the proposed methods in the detection.

The confusion matrix for the dog as an OoI is shown in Table 15. It shows that all 1740 frames containing the dog have been successfully detected by the proposed method.

4.2.6. Results Summary

Table 16 presents the experimental results of the proposed framework. In experimental analysis, several scenarios have been taken from different scenes or locations such as documentaries, festivals, and news containing the various type of objects such as a person, dog, and truck.

In best cases, such as truck accidents, festivals, and news, the recall and precision of the proposed methods is 100%, which shows that the proposed framework is capable of identifying the object precisely and generates a full video summary. Similarly, in worst cases, such as news and honey bee documentary, the precision is less. The overall accuracy of the proposed framework is 99.9%, and the total saved time is 78.82%.

4.3. Evaluation of Own Dataset
4.3.1. Scenario 1

In this video, the VS is performed based on the airplane as an object. It comprises several objects, for example, person, airplane, tree, and mountain. The video sequence consists of captured scenes from the airport environment in this scenario. This video has been summarized by using both user-based and the proposed automated model. The video has a duration of 24 seconds and 1280 × 738 resolution.

Figure 21 presents the frame-level comparisons of the proposed method with the manual method. It reveals that the frames captured by both methods are the same in number, and there is no missing frame in this scenario.

The confusion matrix for the airplane as an OoI is shown in Table 17. It shows that all 360 frames containing the airplane have been successfully detected by the proposed method. There was no error in the detection.

4.3.2. Scenario 2

The video sequence contains the captured scenes from the roadside/street environment in this scenario. It captures several objects such as a person, car, and bike. In the video, the car is considered an OoI. The video has a duration of 3.06 minutes with a resolution of 1280 × 738.

Figure 22 shows frame-level comparisons of the proposed method with the manual method. It reveals that all the frames captured by both methods are the same in number, and there is no missing frame that cannot be detected by the proposed method.

The confusion matrix for the car as an OoI is shown in Table 18. It shows that the proposed method has successfully detected all of the 120 frames containing the car.

4.3.3. Scenario 3

The video sequence contains the captured scenes from the parking environment in this scenario. It comprises several objects such as person, car, and tree. In this video, the person is taken as an OoI. The video has a duration of 5.16 minutes, and its resolution is 854 × 480.

Figure 23 shows the frame-level comparisons of the proposed method with the manual method. It reflects that the first four frames are captured in both methods, while the fifth frame is wrongly predicted by the proposed method.

The confusion matrix for the person as an OoI is shown in Table 19. It shows that, out of 779 frames containing the object person, 718 frames are detected by the proposed method. It shows 61 wrongly detected frames (frames without the person).

4.3.4. Scenario 4

In this video, mobile is taken as an OoI. The video sequence contains the precautions scenes related to mobile snatching in this scenario. The video has a duration of 2.49 minutes, and its resolution is 640 × 360. It comprises several objects such as person, mobile, and bike.

Figure 24 shows frame-level comparisons of the proposed method with the manual method. It reflects that the first six frames are captured in both methods, while the seventh and eighth frames mentioned in Figure 21(a) are missed by the proposed method. The reason is that the mobile size in that frame is tiny that can only be seen through the naked eyes (manual method). Similarly, the proposed method wrongly predicts the seventh frame in Figure 23(b).

The confusion matrix for mobile as an OoI is shown in Table 20. It shows that, out of 992 frames containing the mobile, 660 frames are detected by the proposed method. There are 262 wrongly detected frames, and 3 are falsely detected.

4.3.5. Scenario 5

In this video, the bike is considered as an OoI. It comprises several objects such as a person, bike, and trees. The video sequence contains bike snatching scenes from the roadside in this scenario. The video’s length is 30 seconds, and its resolution is 640 × 360.

Figure 25 presents a frame-level comparison of the proposed method with the manual method. It shows that all the frames captured by both methods are properly detected, so there is no incorrectly detected or missing frame.

Table 21 shows the confusion matrix for the bike as an OoI. It shows that, out of 300 frames containing the bike, all frames are detected by the proposed method. None of the frames is incorrectly detected or missed by the proposed method.

4.3.6. Scenario 6

In this scenario, the video sequence contains gun testing scenes. The video has a duration of 3.03 minutes, and its resolution is 640 × 360. It comprises several objects such as a person, umbrella, and pistol. In this video, the pistol is considered an OoI.

Figure 26 presents a frame-level comparison of the proposed method with the manual method. It shows that all the frames captured by both methods are properly detected; therefore, there is no incorrect detected or missing frame.

Table 22 shows the confusion matrix for pistol as an OoI. It shows that, out of 9000 frames containing the pistol, all of the frames are detected using the proposed method. None of the frames is incorrectly detected or missed by the proposed method.

4.3.7. Results Summary

Table 23 presents the experimental results of the proposed framework. In experimental analysis, several scenarios have been taken from different scenes or locations such as airport, parking, and street containing various objects such as a person, bike, and airplane.

In best cases such as street video, airport, and bike snatching, the recall and precision of the proposed method is 100%, which shows that the proposed framework identifies the object precisely and generates a summary of the full video. Similarly, the recall is less in the worst cases, such as mobile snatching. The reason is that the object’s size in the video frame is tiny, which can be seen only through the naked eyes. Consequently, the proposed method is unable to detect such objects. The overall accuracy of the proposed framework is 99.33%, and the total saved time is 87.86%.

4.4. Comparative Analysis

This section presents a comparative analysis of the proposed framework with the existing VS techniques. The comparative analysis is based on the following fundamental features:(i)F1: customized object type (OoI)(ii)F2: frame extraction based on object(iii)F3: object detection accuracy(iv)F4: summarization rate

Table 24 shows that most existing techniques generally perform object detection rather than focusing on the specific object (i.e., does not consider an object as an Object of Interest). Similarly, many techniques performed frame extraction by redundant frame elimination and scene elimination instead of focusing on the objects. The analysis shows that the proposed framework is unique and contains the most relevant features for VS. The uniqueness of our proposed framework is that it performs video summarizing based on objects of interest given to the system at the time of providing input. Added advantages of the proposed VS framework are simplicity and ease of understanding, with accuracy of 99.6, 99.9, and 99.3% and summarization rate of 82.8, 78.8, and 91.7% of three different datasets such as VSUMM, TVSum, and own dataset, which increases its efficiency as compared to other methods.

To further evaluate the performance of our proposed framework for VS, a comparative analysis between the proposed framework and other state-of-the-art VS techniques is performed as given in Table 25.

5. Conclusion

This paper presents an effective VS framework that summarizes the video based on the OoI. The proposed framework is very effective, optimal, and performed much faster than other state-of-the-art methods for summarizing the video. The OoI-based solution makes it more reliable and flexible to generate the relevant video summary. YOLOv3 empowers the proposed framework to detect various objects efficiently and precisely. For validation of the proposed framework, extensive experiments are performed on three different datasets: the VSUMM dataset, the TVSum dataset, and the own dataset. The proposed VS framework has achieved an accuracy of 99.6% with high processing speed and overall saved time of 82.8% if full video is played to detect the OoI on the VSUMM dataset. Similarly, the accuracy of 99.9% with a summarization rate of 78.8% on the TVSum dataset is achieved. The accuracy of the own dataset is 99.3%, and the overall saved time is 87.86%. A desktop application is also developed that provides ease of use and customized object selection. In future, this work can be extended by enriching the dictionary and training the model for more OoI. It can be deployed in real-time environment to record summarized video for multiple nature of crime scenes.

Data Availability

The data used to support the study’s findings are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.