Abstract

Medical data security is an important guarantee for intelligent medical system. Medical video data can help doctors understand the patients’ condition. Medical video retargeting can greatly reduce the storage capacity of data on the premise of preserving the original content information as much as possible. The smaller volume of medical data can reduce the execution time of data encryption and threat detection algorithm and improve the performance of medical data security methods. The existing methods mainly focus on the temporal pixel relationship and foreground motion between adjacent frames, but these methods ignore the user’s attention to the video content and the impact of background movement on retargeting, resulting in serious deformation of important content and area. To solve the above problems, this paper proposes an innovative video retargeting method, which is based on visual attention and motion estimation. Firstly, the visual attention map is obtained from eye tracking data, by K-means clustering method and Euclidean distance factor equation. Secondly, the motion estimation map is generated from both the foreground and background displacements, which are calculated based on the feature points and salient object positions between adjacent frames. Then, the visual attention map, the motion estimation map, and gradient map are fused to the importance map. Finally, video retargeting is performed by mesh deformation based on the importance map. Experiment on open datasets shows that the proposed method can protect important area and has a better effect on salient object flutter suppression.

1. Introduction

With the rapid development of high-tech medical imaging [13], blockchain technology [4], artificial intelligence [5], Internet of Things (IoT) [6], and 5G network [7], intelligent medical system [8] and intelligent diagnosis [9] are becoming more and more popular. However, data security threats [10] make protecting the security of medical data an urgent problem. The volume of medical video data is greater than typical data, which makes the execution of medical data security methods, such as data encryption [11] and integrity detection [12], long. Video retargeting [13] can greatly reduce the storage capacity of video data on the premise of preserving the original content information as much as possible. Medical video retargeting can obtain smaller volume of medical data, then reduce the execution time of data encryption and threat detection algorithm, and improve the performance of medical data security methods.

Traditional image and video retargeting methods, mainly including uniform scaling and direct cropping, only consider the original size and the target size of images, without considering image content. Their effects are unsatisfactory. To improve image and video retargeting performance, researchers proposed content-aware retargeting techniques, which are mainly classified into three types: discrete retargeting [14, 15], continuous retargeting [1619], and multi-operator retargeting [2023].

Video retargeting has one more time dimension than image retargeting. It needs to take consideration of the correlation between the contents of adjacent frames. By regarding video as a three-dimensional pixel space-time matrix, Rubinstein et al. proposed FSC [15], looking for and deleting common pixel seams between adjacent frames to eliminate content jitter. NCV [17] proposed by Wolf et al. combines the gradient map, face detection, and foreground motion to produce importance map and then uses mesh deformation to realize video retargeting. Nam et al. [24] proposed a video retargeting method based on Kalman filter and saliency fusion to reduce video content jitter, so as to enhance the robustness of video retargeting. Wang et al. [25] proposed a multi-operator method based on improved seam carving to realize video retargeting. Cho and Kang [26] proposed an interpolation video retargeting method based on image deformation vector network, which uses the displacement vector generated by a convolutional neural network to perform interpolation. Kaur et al. [27] proposed a spatiotemporal seam carving video retargeting method based on Kalman filter.

The existing video retargeting methods mainly focus on the pixel relationship and foreground motion between adjacent frames. These methods aim to ensure the shape of important content in the process of retargeting. However, the above methods do not consider the attention of users to the video content, nor the impact of background movement on retargeting, resulting in serious deformation of the important content or poor quality of retargeting results. Furthermore, the human visual system can quickly find the required information from the visual scene and locate the visual attention to the focus in the scene [28]. Consequently, besides moving objects and important targets, the attention focus also includes the areas where change is about to happen next moment, such as the place where the sun will rise before sunrise, the place where actors will appear on the stage before the performance, and the direction where the ball is moving to.

This paper makes full use of the user’s eye tracking data and the motion information of both the background and foreground in the video and proposes a video retargeting method based on visual attention and motion estimation to reduce the deformation of the important area. Firstly, clustering is carried out according to the eye tracking data to generate the visual attention energy map. Then, the motion estimation map is obtained according to the corresponding feature points of the foreground and background between adjacent frames. Thirdly, importance map is generated by composing visual attention energy map, motion estimation energy map, and gradient map. Finally, video retargeting is performed by mesh deformation.

The proposed method utilizes the attention attribute of the human visual system and the movement factor of content in video, so the retargeting result is more in line with people’s visual requirements. The experimental results on public datasets show that the method in this paper is better than the compared method in protecting important area and reducing salient object jitter.

2. Proposed Method

As shown in Figure 1, the framework of the proposed VAMEVR (visual attention and motion estimation-based video retargeting) method mainly includes visual attention data clustering, salience detection, SIFT feature detection, motion estimation, mesh deformation, and so on.

2.1. Visual Attention

In a video, the areas concerned by the human visual system are usually regarded as important areas. These areas should be of increased energy to reduce deformation in the retargeting process. In this paper, the eye tracking data will be utilized as the basis of visual attention, and it will be abstracted as visual focus. Then, visual attention energy will be generated according to the visual focus.

2.1.1. Visual Attention Focus

This paper takes the eyeball tracking data of DAVSOD [29] dataset as demonstration. As shown in Figure 2, the eyeball tracking data exist in the form of discrete points. Through observation, it is found that most eyeball tracking data points are presented as two clusters.

In this paper, the K-means method [30] is utilized to cluster the eyeball tracking data points into 2 groups. The center of each group is just the visual focus. Firstly, we randomly select 2 data points as the initial cluster centroid. Secondly, we divide the data points into 2 mutually exclusive clusters according to the Euclidean distance from each point to the initial selected data points. Thirdly, the average positions of each cluster are obtained as the new cluster centroid. Then, repeat steps 2 and 3 until the centroid position does not vary.

The example of the focusing result is presented in Figure 3. Figure 3(a) shows the original frame. Figure 3(b) shows the eye tracking data and focusing result. The white point is regarded as the eye tracking data, and the two red points are the centers of two clusters. Figure 3(c) shows the visual attention energy map.

2.1.2. Visual Attention Energy

Visual attention energy indicates the attention of the human visual system to important position in the image. The greater the energy is, the higher the attention is, and vice versa.

Two cluster centroids described in Section 2.1.1 are denoted as and . The distances from each pixel of the frame to and are separately set as and . Then, visual attention energy of each pixel position in the frame is defined aswhere and H are separately the width and height of the video frame. The generated energy map is shown in Figure 3. Figure 3(c) shows the visual attention energy map, which is generated according to the cluster results of eye tracking data as shown in Figure 3(b).

2.2. Motion Estimation

In a video, the background and foreground are usually moving. In addition, the moving direction and speed of background are different from those of the foreground. The human visual system pays greater attention to the direction where the object is going. For example, in the tennis video, the direction where the players run to will attract more attention. In the racing video, area in front of the car is paid more attention.

Between adjacent video frames, the motion distance and direction of the background and foreground can be calculated to predict the motion trajectory of the salient object. Both current position and the upcoming position of the foreground object are taken as important areas at the same time, which can protect the visual attention areas to reduce the deformation of these important areas in the process of retargeting and improve the visual effect of retargeting results.

2.2.1. Feature Detection

In the background of a video frame, the mean values of displacement of the feature points are used as the base of moving speed. The same is for the foreground of a video. The position to be reached by the foreground significant object is estimated according to the moving speed. Then, both the current position of the foreground and the position to be reached after motion estimation are regarded as important areas.

SIFT (scale-invariant feature transform) [31] is a computer vision algorithm proposed by Lowe to detect regional features in images. The core idea of SIFT algorithm is to find extreme points in multiple spatial scales and calculate position, rotation, light, and scale invariants to describe the features in images. The SIFT algorithm has good robustness, recognition, expansibility, and efficiency.

In this paper, SIFT algorithm feature detection is used to detect the background and foreground motion information between adjacent frames. Also, 20 feature points with the highest reliability are selected as the basis for motion speed calculation. An example of feature points is shown in Figure 4.

2.2.2. Foreground Separation

In a video frame, salient object is generally the foreground area. By salience detection, the foreground area can be separated from the background. Compared with other algorithms, SSAV [29] can obtain clearer and more accurate result. SSAV [29] is mainly composed of a pyramid deconvolution module and salience transfer perception module. The former is used to robustly learn static salience features. The latter combines the traditional long-term memory convolution network with salience transfer perception attention mechanism. This paper uses the SSAV [29] method to separate the salient foreground object from video frames.

2.2.3. Motion Detection and Estimation

From SIFT feature points, we select point with high reliability as the basis for motion detection and estimation. Concretely, SIFT feature points contained in the background are recorded as , and the number of those points is . Similarly, SIFT feature points contained in the foreground are recorded as , and the number is . From frame to frame , the average moving speed of feature points in the background is recorded as .

Similarly, from frame to frame , the average value of the moving speed of the feature points in the foreground is denoted as .

For a video, the estimated actual motion speed of the foreground is defined as the difference between the motion speed of the foreground and background.

Bring equations (2) and (3) into equation (4).

As shown in Figure 5, after obtaining the salience map of the current frame, we calculate the edge of the salient region by the Canny [32] method. Then, the edge is overlaid with the actual motion speed as the predicted position of the salient object. The polygon surrounding method [33] is used to obtain the external polygon of both current and predicted object contour. Finally, the area surrounded by the polygon is just the important region after motion estimation. The motion estimation energy map is the binary map of important area after motion estimation, which is shown in Figure 5(d).

When the salient object is too small or the features are not obvious, the first n (n = 20) feature points detected by the SIFT algorithm are wholly in the background area. In this situation, the centroid displacement of the salience object detected by SSAV is directly used as the moving speed of the foreground object to predict the position where the foreground will go.

The points in salient object area are denoted as . The number of those points is . From frame to frame , the motion speed of the foreground’s centroid is denoted as , where

The actual motion speed of the foreground is the difference between the motion speed of the foreground and the motion speed of the background.

Bring (2) and (6) into (7).

2.3. Importance Map Fusion

The importance map is the direct basis for image retargeting. The visual attention energy map and motion estimation map obtained in the above steps need to be fused into the importance map.

We denote as the normalized visual attention energy map, as the normalized gradient energy map, as the normalized motion estimation energy map, and as the importance map. The coefficient is the weight of the visual attention energy map in the importance map, over the gradient energy map. Then, the importance map is defined as follows.

The parameter determines the visual effect of visual attention energy in importance map. The smaller is, the smaller the proportion of visual attention energy is. Thus, the impact of visual attention on the results in the retargeting process is smaller, and vice versa. The larger is, the greater the proportion of visual focus energy is. Therefore, the impact of visual attention on the results in the retargeting process is greater. When , the retargeting results only reflect the gradient information and motion estimation information, not the visual attention information. Also, when , the retargeting results only reflect the visual attention information and motion estimation information, not the visual attention information.

2.4. Mesh Deformation

This paper uses Wang’s method [18] for mesh deformation to realize video retargeting. The input frame is divided into quadrilateral mesh . , and F represent the set of vertex, edge, and quadrilateral separately. Each quadrilateral is with a scaling factor . The average importance energy of each quad is . The quad deformation energy is defined as .

The grid line bending energy is described as .

The total energy is the sum of and .

Wang’s method [18] uses iterative solver to solve for mesh deformation. In each iteration, the scaling factor of each grid is calculated by local optimization, and then the mesh vertexes are updated by global optimization under the constraint of target image boundary conditions. The iterator will be terminated when the energy is no longer increased or the displacement of mesh vertexes is less than 0.5. The smooth scaling factors are generated by minimizing the following energy.

2.5. The Algorithm of the Proposed Method

The implementation steps of the proposed methods are shown in Algorithm 1.

Input: original video , the number of frames , important map fusion coefficient parameter
Output: retargeting result video
For i=1 to K − 1 do
       Calculate the two cluster centers of eye tracking data of Framei by K-means method
       Use (1) to produce the visual attention energy map of Framei
       Significant object separation of Framei and Framei + 1 by SSAV [29] model
       Calculate the position of corresponding features of Framei and Framei + 1 by SIFT method
       Get the number of trusted feature points in foreground and denote it as
       If
         Calculate the background speed between Framei and Framei + 1 by (2)
         Calculate the foreground speed between Framei and Framei + 1 by (3)
         Calculate actual moving speed of the salient object by (4) and (5)
       Else
         Calculate the background speed between Framei and Framei + 1 by (2)
         Calculate the foreground speed between Framei and Framei + 1 by (6)
         Calculate actual moving speed of the salient object by (7) and (8)
       End If
       Estimate the position of foreground
       Calculate the circumscribed polygon of both the estimated position and current position of the foreground
       Generate the foreground motion estimation map according to the salient areas in polygon
       Compose importance map from visual attention energy map , foreground motion estimation map , and gradient map by (9)
       Use the mesh deformation method described in Section 2.4 to produce retargeting result of Framei
End for
Output result

3. Results and Analysis

3.1. Experimental Environment and Parameter Settings

To validate the performance of the proposed method, we conduct experiments on a computer with an Intel [email protected] GHz CPU and 16 GB RAM. The proposed method was implemented in MATLAB R2016a on Windows.

The number of visual attention data cluster k is set as 2. In the important map fusion process, the weight of visual attention is set as 0.1, 0.5, and 0.9 separately.

In order to illustrate the universality of proposed method, the public dataset DAVSOD [29] is selected as experimental input. DAVSOD is a large-scale video salient object dataset, which mainly serves the evaluation of video salient object detection and video retargeting. DAVSOD contains 226 video sequences and 24000 frames, covering a variety of scenes, object categories, and motion modes. It is marked strictly according to human eye tracking data.

For each dataset, 3 methods were applied for comparison experiments: forward seam carving (FSC) [15], SNS [18], and the proposed VAMEVR.

3.2. Experimental Result and Analysis

We randomly select “select_0115” and “select_0194” videos of DAVSOD as input data of experiment. The data of “select_0115” include a tennis video clip with 105 frames and pixels per frame. The data of “select_0194” include a motorcycle race video clip, with 133 frames in total, and the size of each frame is . In both of the above video data, the camera is moving during video shooting, that is, the background is moving.

The experimental results are shown in Figures 6 and 7.

From Figures 6 and 7, we can find that the deformation of the salient area is small, especially the area in the direction the object moves to, which is well protected. As shown in Figure 7(d) concretely, the region, where the tennis ball in “0145” video frame is moving to, is with smaller deformation, and so is the area in front of the motorcycle in “0818” video frame.

The main reason of above results is that the important area is of high energy by visual attention and motion estimation. In “0145” and “0150” of Figure 7(c), it can be seen that people paid more attention to the direction of the ball the player was going to move. Similarly, in “0815” and “0818” of Figure 7(c), people pay more attention to the forward direction of the motorcycle and less attention to the rear direction of the motorcycle.

Specifically, as shown in Figures 6(c), 7(a), and 7(c), the smaller is, the weaker the effect of the visual attention is. The larger is, the more obvious the effect of visual attention is.

3.3. Time Analysis

The size of video frames and average processing time of each frame in this paper are shown in Table 1.

It can be seen from Table 1 that the time of FSC is longest, with 6.03 s per frame. The average time per frame of VAMEVR in this paper is 0.53 s. It is 0.24 seconds longer than SNS. The increased time is mainly used to calculate visual attention energy and motion estimation detection.

3.4. Discussion

The human visual system is more sensitive to salient objects. The more consistent the displacement of salient objects in adjacent frames before and after retargeting, the lower the content jitter. In this paper, 30 frames of motorcycle racing videos are randomly selected for retargeting.

For the proposed VAMEVR, the centroid displacement of salient object in the retargeting result is basically the same as that of original video. When the weight coefficient of visual attention energy map is 0.1 and 0.9, the comparative analysis of horizontal and vertical displacement is shown in Figure 8.

The displacement correlation of the salient objects can indicate the visual consistency between the original video and the retargeting result. The displacement of the centroid of the significant object in input video and retargeting result is denoted as X and Y separately. The covariance is defined as cov (X, Y), and the standard deviation of X and Y is . The Pearson correlation coefficient is defined as follows.

As shown in Table 2, for VAMEVR, the displacements of the salient objects before and after retargeting are more positively correlated than SNS and FSC. The visual effects of our results are more consistent with the original video than SNS and FSC.

4. Conclusion

This paper proposes a visual attention and motion estimation-based video retargeting method for medical data security. Firstly, clustering is carried out according to the eye tracking data to generate the visual attention energy map. Then, the motion estimation map is obtained according to the corresponding feature points of the foreground and background between adjacent frames. Thirdly, importance map is generated by composing visual attention energy map, motion estimation map, and gradient map. Finally, video retargeting is performed by mesh deformation. Experiments show that the proposed method can protect important area concerned by the human visual system. The displacement of a salient object in retargeting results is more close to input video. Therefore, the visual effect is more in line with human visual need. Our future work is to study the multi-object separation method and then study the video retargeting method based on multi-object motion estimation for medical data security.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the Hubei Natural Science Foundation under grant no. 2021CFB156 and the Japan Society for the Promotion of Science (JSPS) Grants-in-Aid for Scientific Research (KAKENHI) under grant no. JP21K17737.