Along with the fast development of digital information technology and the application of Internet, video data begins to grow explosively. Some applications with high real-time requirements, such as object detection, require strong online video storage and analysis capabilities. Key-frame extraction is an important technique in video analysis, which provides an organizational framework for dealing with video content and reduces the amount of data required in video indexing. To address the problem, this study proposes a key-frame extraction method based on HSV (hue, saturation, value) histogram and adaptive clustering. The HSV histogram is used as color features for each frame, which reduces the amount of data. Furthermore, by using the transformed one-dimensional eigenvector, the fixed number of features can be extracted for images with different sizes. Then, a cluster validation technique, the silhouette coefficient, is employed to get the appropriate number of clusters without setting any clustering parameters. Finally, several algorithms are compared in the experiments. The density peak clustering algorithm (DPCA) model is shown to be more effective than the other four models in precision and F-measure.

1. Introduction

Advancements in digital storage, content distribution, and digital video recorders result in making the recording of the digital content procedure easy [1]. Handling such volume of content becomes a challenge for the implementation of the real-time applications, such as video surveillance, educational purposes, video lectures, and sports highlights [2]. The user might not have always adequate time to watch the entire video and the integral video content might not be the interest or important for the user [3]. Among all the media types (text, image, graphic, audio, and video), video is the most expressive one because it combines all the other media information together. However, video processing is a relatively time-consuming task due to the large and unstructured format of video data. Key frames provide a suitable abstraction and framework for video indexing, browsing, and retrieval. The use of key frames greatly reduces the quantity of data required in video browsing and provides an organizational framework for dealing with video content [4]. Key-frame extraction has been recognized as one of the important research issues in video information retrieval [5].

Clustering is a powerful technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, data compression, and computer graphics. Traditional clustering algorithms, such as K-means, require some prior knowledge to determine the initial parameters, most of them need to be specified manually, and it would be a tough job to define the optimum parameter. Numerous research efforts and progresses have been done to extract video key frames in recent years, but the existing approaches have high computational complexity, and do not capture the main visual content effectively.

In this study, the cluster validation technique, silhouette coefficient [6], is employed to obtain the optimal cluster number. Regardless of video coding, video key-frame is a relatively subjective concept, and there is no unified criterion for evaluating the quality of key frames so far. Extracting key frames with an unsupervised clustering algorithm can combine the characteristics of video content well. For video data, clustering algorithm can automatically classify video data according to their similarity, and three common clustering algorithms are used for experiments, respectively.

The main contributions of this paper lie in the following aspects: (1) The HSV histogram is used to transform high-dimensional abstract video image data into quantifiable low-dimensional data, which reduces the computational complexity while capturing image features. (2) The silhouette coefficient (SC) index (discussed in Section 2.2) is implemented to find the best prior k-value, which reduces time for computation. (3) Key-frame extraction is converted into clustering problem, and each cluster centroid or nearby centroid frame is declared as the key frame, and (4) the density peak clustering algorithm (DPCA) (discussed in Section 3.2.3) is proposed to extract key frames with better F-measure. Moreover, it only requires computing the distance between all the pairs of data points and does not need other additional prior parameters except the optimal cluster number k.

The rest of this paper is organized as follows. Section 2 provides a brief survey of the related work. In Section 3, the proposed key-frame extraction algorithms are described. The experimental results and analysis are given in Section 4. Section 5 concludes this paper.

Lu et al. [7] divide the existing multimedia content research into key-frame based and video-skim based approaches as video summaries. Video summarization can be of two categories either be a sequence of frozen images which are also called storyboard or moving images called skimming [8]. The video-storyboard is defined as a group of stationary key frames, which summarizes the important video content with minimal data [9]. This class of video summaries is well explored using numerous clustering algorithms where different clusters are formed on the basis of similarity between the frames [1013]. On the other hand, the video skimming retains the important information without losing the semantics of video sequences [14]. Generally speaking, the video-storyboard mainly analyzes the visual content rather than audio information. Its construction and expression are relatively simple, and it is often flexibly organized for browsing and indexing. Dynamic video summary makes a comprehensive consideration on multimedia information flow. It usually contains rich audio, action, and even text information, which can express the content of original video more clearly. Therefore, dynamic video summary is more entertaining and ornamental, but it is difficult to realize [15].

Data clustering is an unsupervised pattern classification method, which has been widely used in the field of video data analysis in recent years. According to the principle of minimizing the similarity between clusters and maximizing the similarity within clusters, this method clustered video data streams. Cluster centers are selected as class representations to eliminate redundancy. Jain et al. [16] classified clustering algorithms into two categories: partitioning and structuring. The former can divide data at one time to determine all classifications, while the latter need to recursively classify in a cohesive or split way. Amiri and Fathy [17] use an improved K-means algorithm to cluster shot-level key frames. Compared with the traditional K-means, the algorithm can obtain the number of clusters adaptively. Kumar et al. [8] propose a novel key-frame extraction technique to summarize the video lectures so that a reader can get the critical information in real time. Singh et al. [9] use the k-medoids algorithm to extract key frames and implement the Calinski–Harabasz- (CH-) based cluster validation technique to get the optimal cluster set. Similarly, Kumar et al. [18] employ Davies–Bouldin index to choose the desired number of key frames without incurring additional computational costs. Other common clustering methods include fuzzy clustering, spectral clustering, self-organizing map, and so on. [19]. However, there is no algorithm suitable for various data types and application backgrounds. Therefore, data clustering algorithms should be selected according to the characteristics of data in practical application.

2.1. Interframe Distance Metric

The distances between adjacent frames is defined as the difference of their visual content, where there can be a combination of color, texture, shape, or more [20]. Firstly, we extract total N frames with size of from a test video, where and represent the width and height of a frame, respectively. The content of adjacent frames does not change much, but the data of three RGB channels need to be calculated separately [14]. In order to reduce the computation burden, an improved HSV histogram [21] method is used to reduce the dimensionality of data.

All N color frames of a video are converted from RGB color space into HSV color space. Then, considering the human visual resolution ability, the hue H component is divided into 12 parts, and the saturation S and value V components are divided into 5 equal parts. The quantitative formulas are as follows:

Like RGB color space, any pixel is represented by three components of h, s, and , where , and , and these pixels are both quantified to color space using equation (1). The sensitivity of human eyes to the H component is greater than the S component, and the sensitivity to the S component is greater than the V component. Finally, these three color components are merged into one-dimensional feature vectors, as shown in the following equation:where and are the quantization parameters of S and V, respectively, and both of them are set to 3. Therefore, , and each frame is converted into a one-dimensional vector with 72 attributes, which are independent of the size of the video frame. The HSV histogram is depicted in Figure 1, in which the horizontal axis represents 72 attributes of the one-dimensional feature vector F, and the vertical axis represents the number of pixels appearing on the scale of F in an image.

The interframe difference approach [3] is used to estimate the difference/changes between two consecutive frames. represents the difference between any two frames fi and fj using the Euclidean distance:

Each frame fi is represented by a one-dimensional vector with n attribute values, where is the kth element of fi.

2.2. Optimal Cluster Number

Obtaining the optimal cluster number is a challenge. As we all know, a priori method is a more suitable technique than posteriori or determined technique to select the size of the key frame within the abstraction process [22]. To address this problem, the internal cluster evaluation method is used for the validation of clustering, namely, the silhouette coefficient (SC). The SC index value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation), which ranges from −1 to +1. A high value of the SC index indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters [23]. During the experiment, a series of possible k-values are used to calculate the silhouette coefficient to obtain the optimal one, and the optimal cluster number usually appears at the maximum of the SC index. The SC index of each sample point is defined in the following equation:where denotes the average distance between and all other data within the same cluster, and indicates the smallest average distance of to all sample points in any other cluster, of which is not a member. The average over all the points of the entire dataset is the SC index, which reflects how appropriately the data have been clustered.

3. Key-Frame Extraction Algorithm

3.1. Key-Frame Extraction Algorithm in Compressed Domain

From the perspective of video coding, I-frame [24] is a complete image and the first frame of each GOP (group of pictures, a video compression technology used by MPEG), which is moderately compressed as a reference point for random access. I-frames do not require other video frames to decode, and they are encoded without reference to any other frames except themselves [25]. Therefore, we apply the I-frame method to extract the key frame. The I-frame method directly utilizes some characteristics of compressed video data to analyze and process video. A simple and feasible method of key-frame extraction in compressed domain is to extract I-frame with FFmpeg [26], and then take the extracted I-frame as the key frame.

3.2. Improved Key-Frame Extraction Algorithm in Uncompressed Domain

The uncompressed domain key frame extraction algorithm requires the decompression of videos, which takes a certain amount of time. The key frames are mainly extracted from the views of video content, so the characteristics of the video itself can be fully utilized.

When the clustering algorithm is applied to extract key frames, some frames with high similarity are clustered into a class, and the cluster center is regarded as a key frame of video. The classic clustering algorithm mainly has the following 3 categories.

3.2.1. Partition-Based Clustering Algorithm

The partition-based method generally adopts mutually exclusive partition on the dataset, which means every single object belongs to only one cluster. Most of partition-based methods are based on distance. Among them, the most classic partition-based clustering algorithm is K-means [27], whose goal is to divide the given data points into k clusters by minimizing the absolute distance between the data points and the selected cluster centers. Data point is assigned to the cluster closest to it, and the clustering center is recalculated based on the existing data points in the cluster. The cluster center and the data points assigned to them represent a cluster. The process of the improved K-means clustering algorithm can be summarized as Algorithm 1.

(1)Initialization: convert each video frame into a one-dimensional vector by HSV histogram
(2)Optimal cluster number: calculate the maximum SC index for the value of k
(3)Perform K-means cluster algorithm for the given k-value
(4)Select k cluster centers for each cluster, and the centroid or nearby centroid frame is regarded as the key frame
3.2.2. Hierarchical Clustering Algorithm

Hierarchical clustering (also known as hierarchical clustering analysis or tree clustering) is a clustering analysis method, which seeks to establish the hierarchical structure of clustering. The algorithm does not require prespecified number of clusters. Strategies for hierarchical clustering are generally classified into two categories: AGNES (Agglomerative Nesting) and DIANA (Divisive Analysis). The agglomerative method (also called the bottom-up method) starts with each object as a separate cluster, and then merges the nearest objects until all samples are merged into the same cluster. The divisive method is also known as the top-down approach. Initially, all samples are in the same cluster, and the largest cluster is split until each object is separated [28]. Video data is a highly abstract and complex data, so the splitting of hierarchical clustering algorithms is not feasible. An improved agglomerative hierarchical clustering algorithm (AGNES) is considered, which is implemented in Algorithm 2.

(1)Initialization: each video frame is put into an initial cluster and converted into a one-dimensional vector
(2)Optimal cluster number: calculate the maximum SC index for the value of k
(4)Calculate the distance between any two adjacent clusters
(5)Merge two nearest clusters to generate a new cluster
(6)Until the expected k clusters are generated or other termination conditions are satisfied
3.2.3. Density-Based Clustering Algorithm

Different from the previous two clustering methods, the density-based clustering algorithm defines clusters as areas with higher density than the rest of the dataset. Clusters consist of all density-connected objects and all objects that are within these objects’ range. Objects in some sparse areas are used to separate clusters, usually considered as noise and boundary points. In density-based spatial clustering of applications with noise (DBSCAN), clusters with an arbitrary shape are easily detected. However, the DBSCAN algorithm has to set a density threshold, discards points in the region where the density is lower than the threshold as noise, and assigns points with higher density than the threshold to different clusters for the disconnected region. The threshold directly affects the results of the algorithm, and choosing an appropriate one is a difficult task.

The density peak clustering algorithm (DPCA) (clustering by fast search and finding of density peaks) [29] is a new density-based clustering algorithm, which can find clusters with different densities by the visualized method, quickly find the density peak points (i.e. cluster centers) of datasets, and efficiently allot sample points and eliminate outliers [30]. It requires that each cluster has a maximum density point as the cluster center, each cluster center attracts and connects the points with lower density around it, and different cluster centers are relatively far away [31]. That is, the density peak clustering algorithm is based on two assumptions: (1) the density of cluster center is greater than that of their neighbors within the same cluster, and (2) the distance between different cluster centers and the higher density point is relatively large. Therefore, there are two main quantities that need to be calculated: local density and distance from higher density points , which are defined as follows respectively:where denotes the local density of each datum xi, represents the interframe distance between the sample point xi and xj, and dc indicates the cutoff distance, which depends on the value range of the empirical parameter in the literature [29]. And the method is robust with respect to changes in the metric that do not significantly affect the distances dc.

is an indicator function, which is defined as follows:

The distance between any two points in the dataset U is calculated and sorted in ascending order. Then, the value of dc takes the numeric value at the position t in the incremental sequence. tells how many points are within the distance dc. Next, the distance from higher density points is defined as follows:where is the global maximum value and item is the maximum distance between any other point xj and xi. Otherwise, item is the minimum distance between any other sample xj and xi, where the local density of xj is greater than that of xi. Therefore, DPCA aims to find data objects with high local density and large relative distance to be the centroid of the cluster. Meanwhile, these cluster centers attract and connect the points with low density around them, and they are relatively far away from each other. According to the calculation results of and , the two-dimensional decision graph is generated to show the plot of as a function of for each point, where transverse axis represents and longitudinal axis represents . Some data points in the upper-right corner of the decision graph can represent different cluster centers because of their high local density and relatively high relative distance from other clusters. The process of improved density peak clustering algorithm (DPCA) is shown as Algorithm 3.

(1)Initialization: convert each video frame into a one-dimensional vector by HSV histogram
(2)Optimal cluster number: calculate the maximum SC index for the value of k
(3)Calculate the distance between any two points and sort them in ascending order
(4)Take the value at t = 2% of the incremental sequence as the cutoff distance dc
(5)Calculate the and for each frame and generate the decision graph
(6)Combining the decision graph to select the k cluster centers in descending order of

4. Experimental Result

In this section, two test datasets are utilized to verify the proposed algorithm. We use the optimal cluster number k to get two test datasets, each containing 50 videos from the Open Video Project (OVP) [32]. Each video lasts on an average of two minutes. In order to get the optimal cluster number k, the SC validation technique is applied to two test videos with a range from 3 to 15 and 3 to 20, respectively. The maximum value of SC index well describes the optimum k-value. According to the analysis of the experimental result, k = 6 and k = 17 in Figures 2 and 3 are the finest values of the SC index for two datasets, respectively. The 1st dataset contains 50 videos which the optimal cluster number k is 6, and the 2nd dataset contains 50 videos which the optimal cluster number k is 17.

4.1. Key-Frame Extraction Algorithm in Uncompressed Domain

Here, several experiments are being discussed to approve the efficiency and performance of the clustering-based key-frame extraction algorithm.

OVP official websites provides the storyboards for each video, and these storyboards are selected by experts and used as ground truth. Several key-frame extraction algorithms have been proposed in recent years, but there is no unified criterion to assess various models. A relatively mature approach is based on the following three metrics to assess the performance of each algorithm: precision, recall, and F-measure [33].

There are four definitions based on the ground truth and results of the algorithm:(i)True positive (TP), a frame that belongs to both the ground truth and the output of the algorithm(ii)False positive (FP), a frame that is selected by the algorithm but do not belong to the ground truth(iii)True negative (TN), a frame is neither selected by the algorithm nor the key frame from the ground truth(iv)False negative (FN), a frame that is not selected by the algorithm but belongs to the ground truth

The precision, recall, and F-measure are defined as follows:where Precision indicates the ability to remove useless frames, Recall represents the ability to keep import information, and F-measure is about the harmonic mean of precision and recall . In all, the higher value of F-measure represents more accurate algorithm.

4.2. Experiment Results of K-Means

The computation process of the K-means algorithm has been discussed in Section 3.2.1. The optimum k-values of two test videos are determined to be 6 and 17, respectively. The comparison between the key frames extracted by the K-means algorithm and the ground truth of the two test videos are shown in Figures 4 and 5.

4.3. Experiment Results of AGNES

The algorithm flow of AGNES has been discussed in Section 3.2.2. Unlike K-means, AGNES provides four methods to calculate the distance between two clusters, namely, the linkage criterion, which specifies the distance to be used between sets of observation: (1) ward-linkage minimizes the variance of the clusters being merged; (2) average-linkage uses the average of the distances of each observation of the two sets; (3) complete-linkage uses the maximum distances between all observations of the two sets; (4) single-linkage uses the minimum of the distances between all observations of the two sets. Different linkage criteria lead to different experimental results. Take the ward-linkage as an example, as shown in Figures 6 and 7. From Figure 7, the key frame extracted from test video 2 reaches 21 frames. Here, the number of key frames extracted is allowed to exceed the k-value of 17 because there are more than one discontinuous frame sequence in several clusters.

4.4. Experiment Results of DPCA

A higher t-value leads to a larger cutoff distance dc, which makes larger. In order to make large, t = 2% is adopted as the experimental scheme. As discussed, the only points with high and relatively high are likely to be the cluster centers. A hint for choosing cluster centers is provided by the plot of sorted in decreasing order, those data points with high values are most likely to become clustering centers. Then, the decision graph on the first test video is depicted in Figure 8, where points 35, 390, 450, 540, 943, and 1480 have the first six large , and they can be considered as cluster centers. In other words, f35, f390, f450, f540, f943, and f1480 are six key frames extracted by the DPCA model. The comparison between the key frames extracted by the DPCA model and the ground truth of the two test videos are shown in Figures 9 and 10.

4.5. Experiment Results of I-Frame Method

The comparison between the average number of key frames extracted by the I-frame method and true values are shown in Table 1. Obviously, using the I-frame method to extract key frames is not effective. The I-frame method does not consider the video content, only from the perspective of video coding extracting key frame, so that redundant key frames appear. In order to further confirm the above conclusion, we add comparative experiments. The test video is a white-screen video of 20 seconds. We extracted key frames with the I-frame method. The experimental results are shown in Table 2. A large number of redundant key frames are obtained. However, the white-screen video should have only 1 key frame in theory. The shortcoming is that the video content information features used in the process of key-frame extraction are less, which leads to poor key frame quality.

4.6. Experimental Comparison

The qualitative analysis of the above experimental results shows that key frames extracted by the DPCA model are almost the ground truth. Tables 3 and 4 show the quantitative evaluation of these proposed models on two test datasets, with the best results shown in bold. Based on the above results and experimental data, we can draw a conclusion that the two key-frame evaluation indexes, precision and F-measure of the DPCA model, are larger than the other four models in this paper. Therefore, the DPCA model achieves a relatively better performance as compared with the other models.

5. Conclusion

Cluster analysis has shown promising prospects due to its successful application in many fields. In addition to classical clustering algorithms, various clustering algorithms are being put forward continuously. In practical applications, various clustering algorithms have their own appropriate scenarios, and the performance of the same algorithm varies greatly on different datasets. The DPCA algorithm only requires computing the distance between all the pairs of data points and does not need other additional prior parameters except the optimal cluster number. Its two important quantities are of great practical significance, and those frames with higher means that the scenes of events they describe account for a larger proportion of the whole shot or video. Moreover, the higher means that the distance between different key frames is large enough to get rid of redundant key frames. The results of experiment show that the DPCA model has the best performance among several key frame extraction algorithms involved in this paper. Applied to environment perception module of automatic cars, the improved key-frame extraction algorithm can effectively capture the salient visual content of car-mounted cameras. Because the environment perception module can only depend on a few key frames and does not require every single frame to be used to detect its surroundings environment, it is very suitable for those applications that require high real-time content analysis, for example, object detection.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This research work was supported by the National Science Foundation of China under Grant nos. 51668043 and 61262016, the CERNET Innovation Project under Grant nos. NGII20160311 and NGII20160112, and the Gansu Science Foundation of China under Grant no. 18JR3RA156.