An Effective Video Summarization Framework Based on the Object of Interest Using Deep Learning

Ul Haq, Hafiz Burhan; Asif, Muhammad; Ahmad, Maaz Bin; Ashraf, Rehan; Mahmood, Toqeer

doi:https://doi.org/10.1155/2022/7453744

Mathematical Problems in Engineering

On this page

Abstract Introduction Literature Review Conclusion Data Availability Conflicts of Interest References Copyright Related Articles

Special Issue

Pattern Recognition and Deep Learning Models for Limited Labelled Data

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 7453744 | https://doi.org/10.1155/2022/7453744

An Effective Video Summarization Framework Based on the Object of Interest Using Deep Learning

Hafiz Burhan Ul Haq,¹Muhammad Asif,¹Maaz Bin Ahmad,²Rehan Ashraf,³and Toqeer Mahmood³

Academic Editor: Nadeem Qazi

Received09 Jul 2021

Revised11 Dec 2021

Accepted25 Mar 2022

Published12 May 2022

Abstract

The advancements in digital video technology have empowered video surveillance to play a vital role in ensuring security and safety. Public and private enterprises use surveillance systems to monitor and analyze daily activities. Consequently, a massive volume of data is generated in videos that require further processing to achieve security protocol. Analyzing video content is tedious and a time-consuming task. Moreover, it also requires high-speed computing hardware. The video summarization concept has emerged to overcome these limitations. This paper presents a customized video summarization framework based on deep learning. The proposed framework enables a user to summarize the videos according to the Object of Interest (OoI), for example, person, airplane, mobile phone, bike, and car. Various experiments are conducted to evaluate the performance of the proposed framework on the video summarization (VSUMM) dataset, title-based video summarization (TVSum) dataset, and own dataset. The accuracy of VSUMM, TVSum, and own dataset is 99.6%, 99.9%, and 99.2%, respectively. A desktop application is also developed to help the user summarize the video based on the OoI.

1. Introduction

Security is the primary concern of the entire world. Besides taking a few other security measures, video surveillance cameras have been installed on private and public premises to cope with this challenge. Several kinds of security surveillance cameras (i.e., static and moveable) have been installed in public places, homes, shops, airports, banks, and so on. These cameras play a vital role in real-time monitoring and detecting suspicious behavior. They are also helpful in investigating events or crime scenes, for example, road accidents, robbery, murder, and terrorist activity [1].

Furthermore, the global estimated number of currently operational cameras is more than 770 million [2]. These cameras usually remain active round the clock and generate more than 2,500 petabytes of video data per day [3]. Figure 1 exhibits daily statistics of the real-world data produced by the video surveillance cameras.

Considerable progress has already been made in developing video analytic tools that automatically perform content-based video interpretation, including motion detection [4, 5], facial recognition [6, 7], people counting [8–10], and license plate recognition [11–13]. However, the issue is that manual (security guards, police officers, etc.) interventions are still required to analyze the recorded videos. Visual analysis of video content to extract meaningful information is complex and time-consuming because visual analysis needs to concentrate and watch the whole video [14]. It may also result in false negatives, especially in the case of long videos. Therefore, there is an ultimate need for a solution that helps in reducing human efforts and time for manual analysis. Multiple efforts are being made towards video summarization to address this concern and generate a video summary that quickly provides the whole video’s gist [15]. The video summarization (VS) is creating a summary of extensive video content by detecting and presenting relevant material to the potential users that are most informative and contain up-to-date information. VS is being used in security surveillance systems to detect and analyze suspicious or anomalous activities. Personal VS is used to share occasional videos on social media, generate sports highlights, make trailers of movies and serials and video content indexing to facilitate fast browsing of huge video through a video search engine, and so on [16–19].

The researchers have made several efforts to propose automatic VS. Most of the VS techniques generate a summary based on selecting keyframes representing the video through the skimming process [8–12, 15–17, 20–37]. Feature-based approaches for VS produce a generalized video summary rather than focusing on a specific object [20–37]. The shot boundary detection approaches are also well known for video summarization [38–45]. These approaches show limitations in detecting the object precisely, hence failing to fulfill the user’s requirements. Clustering [20, 24, 46–49] and trajectory-based [17, 28] techniques summarize the video by focusing on similar activities, events, and objects. However, these approaches do not summarize any video containing information according to the user’s interest. Consequently, these techniques limit the use of retrieval tasks and do not help enhance the users’ observing experience.

This study presents an effective VS framework based on the OoI to cope with the issues of video summarization. The OoI refers to the objects such as person, car, mobile, and bike that a user selects to summarize the video by collecting all frames where the selected object appears. The proposed VS framework works in three steps: (i) the combination of the OoI selection phase, (ii) the object localization or detection phase, and (iii) the video summarization phase. Initially, the OoI selection is performed from the dictionary (a database of objects) to ignore the unnecessary noisy objects (other than the OoI) that are imperative for the segmentation of objects. After that, You Look Only Once (YOLOv3) detector is applied to localize the OoI. After localizing the OoI, the proposed VS algorithm summarizes the video based on the OoI.

Based on the above discussion, the contributions of the proposed work can be summarized as follows:(i)In the OoI selection step, the proposed algorithm selects the object from the dictionary and ignores all the unnecessary objects automatically. After that, the YOLOv3 is used to detect the desired object.(ii)The proposed VS framework can detect single and multiple objects present in the video.(iii)The proposed VS algorithm effectively summarizes the video and overwhelms all the challenges shown in the VSUMM [50], TVSum [51], and own dataset.(iv)The experimental finding highlights that the proposed VS framework performs tremendously as compared to state-of-the-art methods in the area of video summarization.

The rest of the paper is organized as follows: Section 2 describes the literature review for exiting techniques. Section 3 presents the proposed VS model that explains the video summarization method. Section 4 discusses the results and comparison with other techniques, and finally, the conclusion is discussed in Section 5.

2. Literature Review

Several techniques have been proposed. Uchihachi et al. [42] have presented the packing algorithm to define the excellent layout. This algorithm has packed the selected frame to produce the best sequence in the block, and the algorithm organizes several shots and produces videos concisely. However, this approach is useful only when the camera is moving. Chong-Wah Ngo et al. [34] presented a method that depends on the perceptual quality and redundancy reduction to maintain the content of the video summary. The video clusters are generated through temporal slices coherency to partition the video into shots and subshots. After that, the authors adopted a motion attention framework presented in [52] to analyze the clusters and quality of the shots. The temporal graph is also formed to prescribe the importance of clusters. The graph’s attention values are used to select the appropriate scenes to create a video summary. The summary generated using the proposed approach is only about 10–25% of the entire video.

Rav-Acha et al. [32] have also presented the video’s abstraction, where all events and activities of the video are consolidated to generate a video summary. The work has been performed by detecting the moving objects directly and then applying video optimization using detected objects. However, this approach cannot join the parts of different scenes because of discontinuity. Damnjanovic et al. [35] introduced an event-based video précising technique. Initially, the technique calculates each frame’s energy by summating the absolute difference between the current frame and the reference frame pixel values. In this way, all the existing events in frames are determined. Later on, the video summarization algorithm is applied to extract keyframes. The suggested approach is suitable in a static environment. However, in the case of a dynamic or changing background, the system’s performance is unacceptable. Almeida et al. [30] have presented a video summarization method that performs three steps to summarize the video content: extraction of visual features, summarizing the video, and its filtration. First, the color histogram’s extraction of visual features labels the visual contents by manipulating the color. Second, a speedy and straightforward algorithm is implemented for condensing the video. The purpose of the algorithm is to detect similar content and select the relevant frames. Finally, filtration is performed on selected frames to remove noise and redundant data from the video to generate the video summary. Furthermore, the proposed approach is highly hardware-dependent and requires high computational speed. Moreover, the system’s performance is unacceptable for objects summarization in lengthy videos. Wang and Ngo [20] have discussed a method in which the hierarchical hidden Markov model recognizes the motion features. It classifies low-level features to a high level using the semantic concept. After identifying the object’s features, the most representative clips are selected from the video to generate a video summary. The generated summary is 50 times faster than the original video. The inclusion of similar shots in the video is a major limitation of this approach that causes redundancy. Another limitation is the ignorance of the objects that are in a moving state.

Miniakhmetova and Zymbler [40] have described the method of personalized video summarization that works in two stages. The first stage is video structuring, where various scene detection techniques are performed and a video summary is generated. In the second stage, objects are detected from the subset of video scenes using the detection bank. The video summary is generated that consists of the most influencing scenes in which objects are detected, which in turn become a region of the user’s interest. The authors only proposed an idea to implement such a system, not any built prototype. Varghese and Nair [38] proposed a method in which video can be summarized by performing three main steps: shot boundaries detection, redundant frame elimination, and stroboscopic imaging. The shot boundary detection compares the current frame with the neighboring frame. Repetitive frames are removed using the structural similarity index (SSI). The stroboscopic is also used to understand the common background and show the existing activities in the video. Compared to the original video, the presented technique reduces the volume by 55% in the summarized video. Lai et al. [45] presented a frame recomposition-based approach using a clustering algorithm, optical flow, and background subtraction to detect foreground objects. The foreground object has been detected by fusing a group of pixels. After detecting the objects/activity, a sliding window has been used to combine the detected objects in consecutive frames to build a spatiotemporal trajectory. The video summary is generated by combining the entire spatiotemporal trajectory, and the algorithm has achieved an accuracy of 97%. However, this technique is suitable only when the camera position remains static. Srinivas et al. [21] discussed how video could be summarized by computing three factors. First, it assigns the score to each frame based on various features such as quality, color, hue, statically attention, temporal segment, demonstration, and uniformity. Second, it assigns weights to each score based on feature importance for getting keyframes. The standard deviation is used for the assignment of weightage. Finally, the redundancy is eliminated by eliminating repetitive frames, which are collected based on their ascending order score. The presented method showed slightly better performance than the improved frame-blocks features method (IFBFM). However, the comparison with state of the art is not presented.

Davila and Zanibbi [33] have discussed the frame selection in lecture videos based on segmentation by diminishing the conflicts between content regions, removing objects, and rebuilding each frame to generate a video summary. The compression rate of the approach is not specified while discussing the video summarization, and the approach is tested on lecture videos only. Ajmal et al. [29] have discussed a method in which human motion has been tracked with the Kalman filter’s help to find the trajectory. The color features are helpful for video, where the color histogram is used for shots-detection and generates a video summary. However, this approach is designed for surveillance videos only. Ma et al. [39] have presented a collaborative representation of the adjacent frame technique to detect an abnormal frame and remove noisy content from the video. Keyframes are selected using minimum sparse reconstruction to remove the noisy data and prevent the loss of important information. The frame having high collaborative representation error is considered a keyframe. A greedy iterative algorithm is utilized for model optimization that controls the count of keyframes with the help of the average percentage of reconstruction (APOR) and the sparse boundary. However, this approach is ineffective for videos with different frames. Sridevi and Kharde [53] performed video summarization by detecting highlights. In this method, two-stream architecture is used that consists of a deep convolutional neural network (DCNN). The two-dimensional convolutional neural network (2D-CNN) is used to exploit spatial information, and the three-dimensional convolutional neural network (3D-CNN) is used to exploit temporal information to score the video segment highlights. This method achieved a 43.9 precision rate.

Meyer et al. [54] presented a cloud-based system known as HOMER for the generation of video highlights. In this system, the video summary can be generated by detecting the user’s emotions. Two different datasets are used for experimental analysis: a dataset filmed through a dual-camera setup and a home video randomly selected from Microsoft’s video titles in the wild (VTW) dataset. Resultantly, HOMER achieved 38% improvement from baseline. Afzal and Tahir [55] described a video summarization by combining ResNet 152 and gated recurrent unit (GRU). In this method, ResNet 152 is used to extract deep features that existed in the video. Similarly, a gated recurrent unit (GRU) is used to improve the method’s robustness and performance. The experimental analysis was performed on the SumMe dataset, and F-measure was 43.7. Gunawardena et al. [56] performed OoI-based video summarization by generating features from the video according to different scenes with the help of VGG16-1. The technique generates features of OoI using VGG16-2 by taking the frame (containing objects) from the selected video. The accuracy of the proposed method is 88%. However, only three objects are used in the experiments and the computational time is high. Meng et al. [57] proposed a technique that summarizes a video into several key objects through representative object proposals generated from video frames. The proposed technique is tested only on a few objects such as a clock, microphone, and signs. The overall accuracy of the technique is not given. Fataliyev et al. [58] proposed a method to summarize a video with the help of object motion pattern analysis. The method is based on key positions extraction and index frame generation. The Gaussian mixtures are used for object extraction and adaptive background subtraction. For noise reduction, morphological opening and closing operations are adopted. The overall accuracy of the method is 82%. However, the method is tested only for a single object, like a person.

Table 1 summarizes some of the existing VS techniques discussed in the current section.

In public places, many surveillance cameras have been installed to monitor suspicious activities such as mobile snatching, terrorism, and robbery, where the information contained by every single frame is essential. Most of the existing techniques work on the principle of keyframe selection by eliminating the redundant frames that may result in the loss of important information related to a user’s interest. Due to the limitation (disappearance of object and event), these techniques cannot produce significant results. Though some techniques summarize the video based on OoI, the main limitation of these techniques is their high computational power requirements and low accuracy. So, there is a need for a framework that should provide robustness, high accuracy, support for multiple static and dynamic objects, and provision for investigating numerous scenarios. The framework should be able enough to accommodate a wide range of OoI.

3. Proposed Framework

The proposed VS framework takes video and OoI as input. After that, the frames having OoI are detected using an object detection module. Finally, only the detected frames are combined to produce a video summary as an output. The architecture of the proposed framework is shown in Figure 2. It comprises the following main modules:(i)Selection of input: it takes the video and OoI(ii)OoI detection module: it detects the OoI from the videos using a deep learning technique(iii)The video summarization module takes the frames that contain an OoI and generates the video summary as an output

The description of each module is given in the following subsections.

3.1. Selection of Inputs

In this work, a desktop application is developed using Python that provides an interactive user interface for selecting input video and OoI. The detailed working of the application is as follows.

3.1.1. Input Video Selection

The front-end of the application developed for selecting input video is shown in Figure 3. It contains the information related to the input and performs a video format validation check. The application only supports MP4 and AVI standard formats.

3.1.2. OoI Selection

After selecting the video, the next step is to choose the object type (i.e., OoI) to be detected from the input video. The user may select the OoI from the dropdown menu, as shown in Figure 4. A dictionary has been developed using the MS COCO dataset. The dataset contains 330 thousand images in which more than 200 thousand images are labeled. Moreover, it has 15 million object instances of 80 object categories of car, person, suitcase, and so on. The 11 supercategories of MS COCO datasets are person, animal, outdoor objects, indoor objects, vehicle, sports, kitchenware, food, appliance, furniture, and electronics [59]. The pistol dataset contains 2986 images with a single annotation class known as the pistol. The pistol dataset images contain cartoon and staged studio quality images of guns and pistols in hand [60]. A sample set of images from the MS COCO and pistol datasets is shown in Figures 5 and 6.

3.2. OoI Detection Module

In the proposed framework, YOLOv3 (You Look Only Once) [42] is used to detect the OoI. This module determines the scene, event, and frame where the desired object is located. YOLOv3 uses a variant of Darknet containing 53 layers that are trained on Imagenet. Furthermore, 53 more layers have been added for task detection that provides the fully convolutional underlying architecture for YOLOv3, consisting of 106 layers. There is no pooling layer in YOLOv3. A convolutional layer with stride 2 is used to downsample the feature maps to prevent the loss of low-level features [61]. It applies a single neural network to the full video, where the network divides the frames into regions, and it predicts probabilities and bounding boxes [62]. The architecture of the YOLOv3 is presented in Figure 7.

In YOLOv3, each class score is predicted with the help of logistic regression, and the prediction of multiple labels of the object can be performed using a threshold. However, classes with scores higher than a threshold are assigned to the box [59]. In the proposed framework, object detection is performed using a bounding box to demarcate the OoI. In case of multiple objects in a frame, this method helps to describe the spatial location of an OoI. The prediction of the bounding box is described in Figure 8.

In Figure 8, () are the x-y dimensions of the bounding box. However, for each bounding box, YOLO v3 predicts four coordinates (). If the cell is offset from the top left corner of the image by () and the bounding box prior has width and height , then the predictions are presented as follows:

In this work, YOLOv3 is used for OoI detection because it is much faster than its competitors [63]. Figure 9 shows the comparison of YOLO v3 with other object detection models in terms of speed. The processing speed of YOLO v3 is 45 fps that is quite impressive compared to Single Shot Detectors (SSD), Faster-RCNN, and R-FCN [63, 64]. However, the accuracy of YOLOv3 is less than , but the processing speed is much higher; that is, it processes 45 fps, while the Faster-RCNN family processes only 5 fps [63, 64]. Figure 10 shows the comparison of YOLOv3 with its competitors.

3.3. Video Summarization Module

VS module takes the frames that contain an OoI as input and generates the video summary as an output. The steps of the VS process are described as follows:(1)Read the current frame from the input video.(2)Perform OoI detection in the current frame using YOLOv3.(3)If OoI is found, save the current frame in the buffer. Otherwise, discard it.(4)If the current frame is the last, go to step 5. Otherwise, go to step 1 for the next frame.(5)Finalize the video summarization process by combining all the buffered frames having OoI.

The algorithmic flow of the proposed video summarization framework is also given in Algorithm 1.

	Algorithm: Video summary generation
	Algorithm 1: Process of VS
	Inputs: Video X, OoI O
	Output: Summarized video Y
	Start process: VS(X, O)
	N ← No. of frames (X)
	for i = 0 to N − 1, do
	Read the current frame F[i]
	Status OoI detection (f[i], OoI)
	if (status = = 1) then{
	Y[j] = F[i]
	}
	Else
	Discard the frame
	end for x
	Saved Y

4. Experimental Analysis

All the experiments were performed on a machine equipped with an Intel Core i5-6200U processor (running at 2.4 GHz) and 8 gigabytes (GB) of RAM, and Python was used as the programming language.

In this work, a subjective method is used to evaluate the performance of the proposed framework. For each test stream, summarized video is generated manually (with the help of a video editing tool, “Filmora”) and automatically through the proposed framework. The performance of the proposed framework is evaluated based on precision, recall, F1-score, and accuracy. The mathematical expressions for these evaluation parameters are given in the following equations:

Three different datasets are used, the VSUMM dataset, the TVSum dataset, and own dataset, to compare and validate the efficiency of the proposed framework with the manual method. The VSUMM dataset consists of 50 videos from the open video project (OVP). All VSUMM videos are in MPEG-1 form with 30 fps, pixels resolutions. However, videos contained by the VSUMM dataset belong to several categories (educational, documentary, historical, ephemeral, and lecture), ranging from 1 to 4 minutes. The TVSum contains 50 videos taken from different video websites that belong to several genres such as news, how-to, documentary, vlog, and egocentric. The own dataset contains the videos taken from multiple sources. The video is in AVI and MP4 format with different resolutions such as 320 × 240, 352 × 240, 640 × 360, 854 × 480, and 1280 × 720. Tables 2–4 list the sample test video sequences taken from VSUMM datasets, TVSum datasets, and own datasets, along with their specifications.

Extensive experiments are performed to evaluate the performance of the proposed framework on videos with different durations and resolutions. Some of the scenarios are discussed in the subsequent sections.

The confusion matrix for the car as an OoI is shown in Table 9. It indicates that, out of 300 total frames containing the car, 120 frames are detected by the proposed method. There are 180 frames in which a car exists, but the proposed framework did not detect them.

4.1.6. Results Summary

Table 10 presents the experimental results of the proposed framework. Several scenarios have been taken from different scenes or locations such as lectures, documentaries, news, and GYM containing various objects such as person and cars.

The object types and scenarios provide details of objects considered an OoI regarding the specific scenario. The duration of the summarized video is recorded that describes the duration detail before and after the processing. In best cases (lecture, documentary, and news), the recall and precision are 100%, showing that the proposed framework accurately identifies the object and generates a full video summary. Similarly, the recall is less in the worst cases (documentary 2 and GYM). The reason is that the size of objects present in the frame is too small, and the video quality is not good. Therefore, the proposed method could not detect the object. The overall accuracy of the proposed framework is 99.6%, and the total saved time is 82.84%.

The confusion matrix for the dog as an OoI is shown in Table 15. It shows that all 1740 frames containing the dog have been successfully detected by the proposed method.

4.2.6. Results Summary

Table 16 presents the experimental results of the proposed framework. In experimental analysis, several scenarios have been taken from different scenes or locations such as documentaries, festivals, and news containing the various type of objects such as a person, dog, and truck.

In best cases, such as truck accidents, festivals, and news, the recall and precision of the proposed methods is 100%, which shows that the proposed framework is capable of identifying the object precisely and generates a full video summary. Similarly, in worst cases, such as news and honey bee documentary, the precision is less. The overall accuracy of the proposed framework is 99.9%, and the total saved time is 78.82%.

Table 22 shows the confusion matrix for pistol as an OoI. It shows that, out of 9000 frames containing the pistol, all of the frames are detected using the proposed method. None of the frames is incorrectly detected or missed by the proposed method.

4.3.7. Results Summary

Table 23 presents the experimental results of the proposed framework. In experimental analysis, several scenarios have been taken from different scenes or locations such as airport, parking, and street containing various objects such as a person, bike, and airplane.

In best cases such as street video, airport, and bike snatching, the recall and precision of the proposed method is 100%, which shows that the proposed framework identifies the object precisely and generates a summary of the full video. Similarly, the recall is less in the worst cases, such as mobile snatching. The reason is that the object’s size in the video frame is tiny, which can be seen only through the naked eyes. Consequently, the proposed method is unable to detect such objects. The overall accuracy of the proposed framework is 99.33%, and the total saved time is 87.86%.

4.4. Comparative Analysis

This section presents a comparative analysis of the proposed framework with the existing VS techniques. The comparative analysis is based on the following fundamental features:(i)F1: customized object type (OoI)(ii)F2: frame extraction based on object(iii)F3: object detection accuracy(iv)F4: summarization rate

Table 24 shows that most existing techniques generally perform object detection rather than focusing on the specific object (i.e., does not consider an object as an Object of Interest). Similarly, many techniques performed frame extraction by redundant frame elimination and scene elimination instead of focusing on the objects. The analysis shows that the proposed framework is unique and contains the most relevant features for VS. The uniqueness of our proposed framework is that it performs video summarizing based on objects of interest given to the system at the time of providing input. Added advantages of the proposed VS framework are simplicity and ease of understanding, with accuracy of 99.6, 99.9, and 99.3% and summarization rate of 82.8, 78.8, and 91.7% of three different datasets such as VSUMM, TVSum, and own dataset, which increases its efficiency as compared to other methods.

To further evaluate the performance of our proposed framework for VS, a comparative analysis between the proposed framework and other state-of-the-art VS techniques is performed as given in Table 25.

5. Conclusion

This paper presents an effective VS framework that summarizes the video based on the OoI. The proposed framework is very effective, optimal, and performed much faster than other state-of-the-art methods for summarizing the video. The OoI-based solution makes it more reliable and flexible to generate the relevant video summary. YOLOv3 empowers the proposed framework to detect various objects efficiently and precisely. For validation of the proposed framework, extensive experiments are performed on three different datasets: the VSUMM dataset, the TVSum dataset, and the own dataset. The proposed VS framework has achieved an accuracy of 99.6% with high processing speed and overall saved time of 82.8% if full video is played to detect the OoI on the VSUMM dataset. Similarly, the accuracy of 99.9% with a summarization rate of 78.8% on the TVSum dataset is achieved. The accuracy of the own dataset is 99.3%, and the overall saved time is 87.86%. A desktop application is also developed that provides ease of use and customized object selection. In future, this work can be extended by enriching the dictionary and training the model for more OoI. It can be deployed in real-time environment to record summarized video for multiple nature of crime scenes.

Data Availability

The data used to support the study’s findings are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

W. Ullah, A. Ullah, T. Hussain et al., “Artificial Intelligence of Things-assisted two-stream neural network for anomaly detection in surveillance Big Video Data,” Future Generation Computer Systems, vol. 129, pp. 286–297, 2022.
View at: Publisher Site | Google Scholar
E. Cosgrove, “One billion surveillance cameras will be watching around the world in 2021, a new study says,” 2019, https://www.cnbc.com/2019/12/06/one-billion-surveillance-cameras-will-be-watching-globally-in-2021.html.
View at: Google Scholar
SecurityInfoWatch, “Data generated by new surveillance cameras to increase exponentially in the coming years,” 2016, https://www.securityinfowatch.com/video-surveillance/news/12160483/data-generated-by-new-surveillance-cameras-to-increase-exponentially-in-the-coming-years.
View at: Google Scholar
X. Yan, S. Z. Gilani, M. Feng, L. Zhang, H. Qin, and A. Mian, “Self-supervised learning to detect key frames in videos,” Sensors, vol. 20, no. 23, p. 6941, 2020.
View at: Publisher Site | Google Scholar
B. Korbar, D. Tran, and L. Torresani, “Scsampler: sampling salient clips from video for efficient action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6232–6242, IEEE, Seoul, Korea (South), November 2020.
View at: Publisher Site | Google Scholar
J. Huo and T. L. van Zyl, “Unique faces recognition in videos,” in Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION), pp. 1–7, IEEE, Rustenburg, South Africa, July 2020.
View at: Publisher Site | Google Scholar
S. Manna, S. Ghildiyal, and K. Bhimani, “Face recognition from video using deep learning,” in Proceedings of the 2020 5th International Conference on Communication and Electronics Systems (ICCES), pp. 1101–1106, IEEE, Coimbatore, India, June 2020.
View at: Publisher Site | Google Scholar
J. Zhang, S. Chen, S. Tian, W. Gong, G. Cai, and Y. Wang, “A crowd counting framework combining with crowd location,” Journal of Advanced Transportation, vol. 2021, Article ID 6664281, 12 pages, 2021.
View at: Publisher Site | Google Scholar
S. A. Velastin, R. Fernández, J. E. Espinosa, and A. Bay, “Detecting, tracking and counting people getting on/off a metropolitan train using a standard video camera,” Sensors, vol. 20, no. 21, Article ID 6251, 2020.
View at: Publisher Site | Google Scholar
K. Khan, R. U. Khan, W. Albattah et al., “Crowd counting using end-to-end semantic image segmentation,” Electronics, vol. 10, no. 11, Article ID 1293, 2021.
View at: Publisher Site | Google Scholar
N. Mufti and S. A. A. Shah, “Automatic number plate Recognition: a detailed survey of relevant algorithms,” Sensors, vol. 21, Article ID 3028, 2021.
View at: Google Scholar
J. Shashirangana, H. Padmasiri, D. Meedeniya, and C. Perera, “Automated license plate recognition: a survey on methods and techniques,” IEEE Access, vol. 9, pp. 11203–11225, 2020.
View at: Google Scholar
M. Asif, M. Bin Ahmad, S. Mushtaq, K. Masood, T. Mahmood, and A. Ali Nagra, “Long multi-digit number recognition from images empowered by deep convolutional neural networks,” The Computer Journal, vol. 117, 2021.
View at: Publisher Site | Google Scholar
S.-H. Zhong, J. Lin, J. Lu, A. Fares, and T. Ren, “Deep semantic and attentive network for unsupervised video summarization,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 18, no. 2, pp. 1–21, 2022.
View at: Publisher Site | Google Scholar
M. Furini, F. Geraci, M. Montangero, and M. Pellegrini, “STIMO: STIll and MOving video storyboard for the web scenario,” Multimedia Tools and Applications, vol. 46, no. 1, pp. 47–69, 2010.
View at: Publisher Site | Google Scholar
Z. Elkhattabi, Y. Tabii, and A. Benkaddour, “Video summarization: techniques and applications,” International Journal of Computer and Information Engineering, vol. 9, pp. 928–933, 2015.
View at: Google Scholar
A. Bora and S. Sharma, “A review on video summarization approcahes: recent advances and directions,” in Proceedings of the 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), pp. 601–606, IEEE, Greater Noida, India, October 2018.
View at: Publisher Site | Google Scholar
M. Tahir, I. A. Taj, P. A. Assuncao, and M. Asif, “Low complexity high efficiency coding of light fields using ensemble classifiers,” Journal of Visual Communication and Image Representation, vol. 66, Article ID 102742, 2020.
View at: Publisher Site | Google Scholar
Q. G. K. Safi, T. Nawaz, S. M. A. Shah, and T. Mahmood, “Intelligent device independent ui adaption for heterogeneous ubiquitous environments,” IJCSNS, vol. 11, p. 75, 2011.
View at: Google Scholar
F. Wang and C.-W. Ngo, “Summarizing rushes videos by motion, object, and event understanding,” IEEE Transactions on Multimedia, vol. 14, pp. 76–87, 2011.
View at: Google Scholar
M. Srinivas, M. M. M. Pai, and R. M. Pai, “An improved algorithm for video summarization - a rank based approach,” Procedia Computer Science, vol. 89, pp. 812–819, 2016.
View at: Publisher Site | Google Scholar
K. Kumar, D. D. Shrimankar, and N. Singh, “Event bagging: a novel event summarization approach in multiview surveillance videos,” in Proceedings of the 2017 International Conference on Innovations in Electronics, Signal Processing and Communication (IESC), pp. 106–111, IEEE, Shillong, India, April 2017.
View at: Publisher Site | Google Scholar
S. S. Thomas, S. Gupta, and V. K. Subramanian, “Perceptual video summarization—a new framework for video summarization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, pp. 1790–1802, 2016.
View at: Google Scholar
F. Cricri, S. Mate, I. D. Curcio, and M. Gabbouj, “Salient event detection in basketball mobile videos,” in Proceedings of the 2014 IEEE International Symposium on Multimedia, pp. 63–70, IEEE, Taichung, Taiwan, December 2014.
View at: Publisher Site | Google Scholar
M. Cote, F. Jean, A. B. Albu, and D. Capson, “Video summarization for remote invigilation of online exams,” in Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9, IEEE, Lake Placid, NY, USA, March 2016.
View at: Publisher Site | Google Scholar
R. Agyeman, R. Muhammad, and G. S. Choi, “Soccer video summarization using deep learning,” in Proceedings of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 270–273, IEEE, San Jose, CA, USA, March 2019.
View at: Publisher Site | Google Scholar
E. Bulut and T. Capin, “Key frame extraction from motion capture data by curve saliency,” in Proceedings of the 20th International Conference on Computer Animation and Social Agents, pp. 1822–185, CGS, Hasselt, Belgium, 2007.
View at: Google Scholar
C. Li, Y.-T. Wu, S.-S. Yu, and T. Chen, “Motion-focusing key frame extraction and video summarization for lane surveillance system,” in Proceedings of the 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 4329–4332, IEEE, Cairo, February 2010.
View at: Publisher Site | Google Scholar
M. Ajmal, M. Naseer, F. Ahmad, and A. Saleem, “Human motion trajectory analysis based video summarization,” in Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 550–555, IEEE, Cancun, Mexico, December 2017.
View at: Publisher Site | Google Scholar
J. Almeida, R. d. S. Torres, and N. J. Leite, “Rapid video summarization on compressed video,” in Proceedings of the 2010 IEEE International Symposium on Multimedia, pp. 113–120, IEEE, Taichung, Taiwan, December 2010.
View at: Publisher Site | Google Scholar
Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important people and objects for egocentric video summarization,” in Proceedings of the 2012 IEEE conference on computer vision and pattern recognition, pp. 1346–1353, IEEE, Providence, RI, USA, June 2012.
View at: Publisher Site | Google Scholar
A. Rav-Acha, Y. Pritch, and S. Peleg, “Making a long video short: dynamic video synopsis,” in Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pp. 435–441, IEEE, New York, NY, USA, June 2006.
View at: Google Scholar
K. Davila and R. Zanibbi, “Whiteboard video summarization via spatio-temporal conflict minimization,” in Proceedings of the 2017 14th IAPR International conference on document analysis and recognition (ICDAR), pp. 355–362, IEEE, Kyoto, Japan, November 2017.
View at: Publisher Site | Google Scholar
C.-W. Chong-Wah Ngo, Y.-F. Yu-Fei Ma, and H.-J. Hong-Jiang Zhang, “Video summarization and scene detection by graph modeling,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 2, pp. 296–305, 2005.
View at: Publisher Site | Google Scholar
U. Damnjanovic, V. Fernandez, E. Izquierdo, and J. M. Martinez, “Event detection and clustering for surveillance video summarization,” in Proceedings of the 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services, pp. 63–66, IEEE, Klagenfurt, Austria, May 2008.
View at: Publisher Site | Google Scholar
S. Jai-Andaloussi, A. Mohamed, N. Madrane, and A. Sekkaki, “Soccer video summarization using video content analysis and social media streams,” in Proceedings of the 2014 IEEE/ACM International Symposium on Big Data Computing, pp. 1–7, IEEE, London, UK, December 2014.
View at: Publisher Site | Google Scholar
T. Hussain, K. Muhammad, W. Ding, J. Lloret, S. W. Baik, and V. H. C. de Albuquerque, “A comprehensive survey of multi-view video summarization,” Pattern Recognition, vol. 109, Article ID 107567, 2021.
View at: Publisher Site | Google Scholar
J. Varghese and K. R. Nair, “An algorithmic approach for general video summarization,” in Proceedings of the 015 Fifth International Conference on Advances in Computing and Communications (ICACC), pp. 7–11, IEEE, Kochi, India, September 2015.
View at: Publisher Site | Google Scholar
M. Ma, S. Mei, S. Wan, Z. Wang, and D. D. Feng, “Robust video summarization using collaborative representation of adjacent frames,” Multimedia Tools and Applications, vol. 78, no. 20, pp. 28985–29005, 2019.
View at: Publisher Site | Google Scholar
M. Miniakhmetova and M. Zymbler, “An approach to personalized video summarization based on user preferences analysis,” in Proceedings of the 2015 9th International Conference on Application of Information and Communication Technologies (AICT), pp. 153–155, IEEE, Rostov on Don, Russia, October 2015.
View at: Publisher Site | Google Scholar
H. B. U. Haq, M. Asif, and M. B. Ahmad, “Video summarization techniques: a review,” International Journal of Scientific & Technology Research, vol. 9, pp. 146–153, 2020.
View at: Google Scholar
S. Uchihachi, J. T. Foote, and L. Wilcox, Automatic Video Summarization Using a Measure of Shot Importance and a Frame-Packing Method, Google Patents, 2003.
Z. Lu and K. Grauman, “Story-driven summarization for egocentric video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2714–2721, IEEE, Portland, OR, USA, June 2013.
View at: Publisher Site | Google Scholar
Y. Jiang, K. Cui, B. Peng, and C. Xu, “Comprehensive Video Understanding: video summarization with content-based video recommender design,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, IEEE, Seoul, Korea (South), October 2019.
View at: Publisher Site | Google Scholar
P. K. Lai, M. Décombas, K. Moutet, and R. Laganiere, “Video summarization of surveillance cameras,” in Proceedings of the 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 286–294, IEEE, Colorado Springs, CO, USA, August 2016.
View at: Publisher Site | Google Scholar
K. Peker and F. Bashir, Content-based Video Summarization Using Spectral Clustering, ResearchGate, Berlin, Germany, 2007.
V. T. Chasanis, A. C. Likas, and N. P. Galatsanos, “Scene detection in videos using shot clustering and sequence alignment,” IEEE Transactions on Multimedia, vol. 11, pp. 89–100, 2008.
View at: Google Scholar
W. Sabbar, A. Chergui, and A. Bekkhoucha, “Video summarization using shot segmentation and local motion estimation,” in Proceedings of the Second International Conference on the Innovative Computing Technology (INTECH 2012), pp. 190–193, IEEE, Casablanca, Morocco, September 2012.
View at: Publisher Site | Google Scholar
G. Evangelopoulos, K. Rapantzikos, A. Potamianos, P. Maragos, A. Zlatintsi, and Y. Avrithis, “Movie summarization based on audio visual saliency detection,” in Proceedigs of the 2008 15th IEEE International Conference on Image Processing, pp. 2528–2531, IEEE, San Diego, CA, USA, December 2008.
View at: Publisher Site | Google Scholar
S. E. F. De Avila, A. P. B. Lopes, A. da Luz Jr, and A. de Albuquerque Araújo, “VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method,” Pattern Recognition Letters, vol. 32, no. 1, pp. 56–68, 2011.
View at: Publisher Site | Google Scholar
Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: summarizing web videos using titles,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5179–5187, IEEE, Boston, MA, USA, June 2015.
View at: Publisher Site | Google Scholar
Y.-F. Ma and H.-J. Zhang, “A model of motion attention for video skimming,” in Proceedings of the International Conference on Image Processing, IEEE, Rochester, NY, USA, September 2002.
View at: Google Scholar
M. Sridevi and M. Kharde, “Video summarization using highlight detection and pairwise deep ranking model,” Procedia Computer Science, vol. 167, pp. 1839–1848, 2020.
View at: Publisher Site | Google Scholar
H. Meyer, P. Wei, and X. Jiang, “Intelligent video highlights generation with front-camera emotion sensing,” Sensors, vol. 21, no. 4, Article ID 1035, 2021.
View at: Publisher Site | Google Scholar
M. S. Afzal and M. A. Tahir, “Reinforcement learning based video summarization with combination of ResNet and gated recurrent unit,” in Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021), pp. 261–268, SciTePress, Karachi, Pakistan, February 2021.
View at: Publisher Site | Google Scholar
P. Gunawardena, H. Sudarshana, O. Amila, R. Nawaratne, D. Alahakoon, and A. S. Perera, “Interest-oriented video summarization with keyframe extraction,” in Proceedings of the 2019 19th International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 1–8, IEEE, Colombo, Sri Lanka, September 2019.
View at: Publisher Site | Google Scholar
J. Meng, H. Wang, J. Yuan, and Y.-P. Tan, “From keyframes to key objects: video summarization by representative object proposal selection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1039–1048, IEEE, Las Vegas, NV, USA, December 2016.
View at: Publisher Site | Google Scholar
Z. Fataliyev, D. Han, Y. Imamverdiyev, and H. Ko, “Video summarization based on extracted key position of spotted objects,” in Proceedings of the 2015 IEEE International Conference on Consumer Electronics (ICCE), pp. 331-332, IEEE, Las Vegas, NV, USA, March 2015.
View at: Publisher Site | Google Scholar
T.-Y. Lin, M. Maire, S. Belongie et al., “Microsoft coco: common objects in context,” in Proceedings of the Computer Vision - ECCV 2014, pp. 740–755, Springer link, New Yark, NY, USA, April 2014.
View at: Publisher Site | Google Scholar
Roboflow, “Pistols dataset,” 2020, https://public.roboflow.com/object-detection/pistols.
View at: Google Scholar
A. Kathuria, “What’s new in YOLO v3? Towards Data Science,” 2018, https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b.
View at: Google Scholar
N. Chauhan, “Yolo object detection made easy,” 2020, https://medium.com/analytics-vidhya/yolo-object-detection-made-easy-7b17cc3e782f.
View at: Google Scholar
J. Hui, “Object detection: speed and accuracy comparison (Faster R-CNN, R-FCN, SSD, FPN, RetinaNet and YOLOv3),” 2018, https://jonathan-hui.medium.com/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo-5425656ae359.
View at: Google Scholar
J. Redmon and A. Farhadi, “Yolov3: an incremental improvement,” 2018, https://arxiv.org/abs/1804.02767.
View at: Google Scholar
W. Zhu, J. Lu, J. Li, and J. Zhou, “DSNet: a flexible detect-to-summarize network for video summarization,” IEEE Transactions on Image Processing, vol. 30, pp. 948–962, 2020.
View at: Google Scholar

Copyright

Copyright © 2022 Hafiz Burhan Ul Haq et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

2500

Downloads

1006

Citations