Abstract
With the rapid development of interactive multimedia services and camera sensor networks, the number of network videos is exploding, which has formed a natural carrier library for steganography. In this study, a coverless steganography scheme based on motion analysis of video is proposed. For every video in the database, the robust histograms of oriented optical flow (RHOOF) are obtained, and the index database is constructed. The hidden information bits are mapped to the hash sequences of RHOOF, and the corresponding indexes are sent by the sender. At the receiver, through calculating hash sequences of RHOOF from the cover video, the secret information can be extracted successfully. During the whole process, the cover video remains original without any modification and has a strong ability to resist steganalysis. The capacity is investigated and shows good improvement. The robustness performance is prominent against most attacks such as pepper and salt noise, speckle noise, MPEG4 compression, and motion JPEG 2000 compression. Compared with the existing coverless information hiding schemes based on images, the proposed method not only obtains a good tradeoff between hiding information capacity and robustness but also can achieve higher hiding success rate and lower transmission data load, which shows good practicability and feasibility.
1. Introduction
In recent years, the demand for information hiding continues to grow, especially for cloud computing environments. Traditional information hiding technologies usually embed secret information in the carrier [1–5] and lead to variable modification of carrier features. The steganography schemes that hide information by constructing the mapping relationship between cover features and secret information [6, 7] or using autogeneration technology [8–10] have aroused the interest of many researchers, which have a strong ability to resist steganalysis.
Most existing coverless steganography schemes are based on text and images. The textbased methods dug out the text features such as Chinese numeral expression [11], word rank map [12], or word frequency [13, 14] and quantified them. Then, the mapping relationship between text features and secret information was established, and the indexes were constructed. While in the imagebased methods, the key problem is how to extract the main features of an image efficiently, which have been extensively studied in previous research studies [15–19]. In the method proposed by Zhou et al. [6], the secret information was converted to bits and divided into several data segments. The image with the same hash sequence as the data segment was selected and transmitted to the receiver as the cover image, from which the receiver could extract the secret information. Zheng et al. [20] used the direction information of scaleinvariant feature transform (SIFT) points to design image hash and used the inverted index of quadtree structure to improve the capacity and retrieval efficiency. An algorithm based on histograms of oriented gradients (HOG) was proposed by Zhou et al. [21], which obtained the hash sequences from the nonoverlapping blocks of the image. After block discrete cosine transformation (DCT) [22] or discrete wavelet transformation (DWT) [23], the relationship between coefficients of adjacent blocks was used to generate robust feature sequences. It can improve the capacity of hiding information by partitioning the image, but the robustness will be reduced by a larger partition number. Zou et al. [24] and Cao et al. [25] used average pixel values of subimages, which achieve a high hiding success rate and capacity. In [26], LBP feature of the medical image was extracted and mapped to privacy information. Recently, Luo et al. [27] used recognized objects to hide secret information.
At the same time, live network platforms and video social applications are becoming more and more popular [28]. A large number of short videos have been generated and spread on the Internet, which provides sufficient carrier for information hiding. Compared with image, video not only has texture, shape, and color features but also has rich spatial and temporal features, from which some motion characteristics can be mined. Theoretically, motion characteristic is robust and cannot easily be tampered with, which is suitable for steganography. This motivates us to design a novel coverless steganography scheme based on short videos by constructing a mapping function between the motion characteristics and information bits.
Existing research results of coverless steganography based on videos are still rare. Some researchers have proposed some zerowatermarking technologies for copyright protection of video, which constructed watermark information by extracting video features. Li et al. [29] proposed a zerowatermarking algorithm based on logarithmic polar coordinate transformation. After 2DDWT and 3DDCT transformation of the original image, the zerowatermarking was realized by transforming the logarithmic polar coordinates. Liu et al. [30] proposed a zerowatermarking scheme for threedimensional video, which extracted the features of twodimensional video frames and depth maps to generate copyright information and used secret sharing schemes to achieve copyright protection. Compared with the zerowatermarking algorithms, coverless steganography based on videos has a higher requirement for capacity.
As shown in Figure 1, there is a baby walking in the video. The simplest mapping function is to connect the stepping of different feet with different information bits as the following equation:
Then, the continuous stepping of the baby can represent a sequence of information bits. However, this kind of mapping has some shortcomings: first, the stepping characteristic is semantic and can easily be understood and cracked. Second, the calculation complexity of stepping recognition is still relatively high, although the motion analysis and tracking of a video have achieved significant progress recently [31, 32]. Third, the capacity of information hiding is low since only one bit is hidden in every frame. Therefore, how to mine the nonsemantic motion characteristic and construct a correlated mapping function for information bits is the key issue of coverless steganography based on videos.
Video recognition has been studied in depth by a lot of researchers, and many algorithms have been proposed. Optical flow c is the most classical method for video analysis [33–37]. In this work, we mainly study the optical flow characteristics of video and map them with hidden information to realize coverless steganography. The main contributions of this work are as follows: first, we construct a novel coverless steganography scheme based on motion analysis of video. Second, the mapping algorithm between the robust directional characteristics of video optical flow and secret information is proposed and optimized. Finally, information hiding capacity, robustness performance, efficiency, hiding success rate, and transmission data load of the proposed scheme are analysed and compared with the existing coverless steganography schemes based on images.
The study is organized as follows: preliminaries are introduced in the second section, and the proposed method is described in the third section. Experimental results and comparisons are shown in the fourth section. Finally, we conclude this study in the last section.
2. Preliminaries
2.1. Optical Flow
Optical flow is the instantaneous velocity distribution of the brightness pattern, which is caused by the movements of objects [33, 34] and has been applied widely for motion analysis. In recent studies, the optical flow was used to estimate the traffic flow parameters of moving vehicles in different scenarios and shew effectiveness [35, 36]. Lv et al. realized subpixel image registration based on the optical flow model and featurepoint matching [37].
The basic idea of the optical flow method is to find the corresponding relationship between adjacent frames in the image sequence using the change of pixels in time domain. Then, the motion information of objects can be calculated.
Assuming there is a pixel in one frame, its light intensity is expressed as . It moves to in the next frame. According to the assumption of constant brightness, we can get the following equation:
The Taylor series approximation is applied to equation (2); then, we can getwhere is the secondorder infinitesimal term and can be neglected. Therefore, equation (3) can be transformed to
Assuming u and are the velocity vectors of optical flow along the X axis and Y axis, respectively, we have
Equation (4) can be transformed aswhere , , and . This is the basic constraint equation of optical flow.
In order to solve the above equations and achieve the value of the unknown u and , there are two classical methods. One is a global differential method, which assumes that the optical flow changes smoothly over the entire image. The other is a local differential method, which assumes that the motion vector remains constant over a small spatial domain. Therefore, it is suitable for small motion detection, but fails for large motion detection. In order to improve this defect, the pyramidal implementation was proposed by Bouguet [38].
The pyramid layering method was used to reduce the size of the image layer by layer, thereby reducing the motion displacement of the object between two frames. The process is shown in Figure 2, and the specific steps are as follows: Step1: A pyramid is created for every frame, and the resolution is sequentially lowered from the bottom to the top. Step2: Starting at the top level, the optical flow at every point in the toplevel image is obtained by minimizing the minimum matching error sum within the neighbourhood of each point. Assuming d is the optical flow, the residual function is defined as [38] where is the point of the original image I and is the point of the target image J. Supposing there are L layers of the pyramid, the first layer is the original image. If the total displacement is d, then the displacement for each layer is Step 3: The optical flow of the layer L is propagated to the layer L − 1 as follows:
For the layer l, the calculation of optical flow is based on the minimization of the sum of matching error for all points in the neighbourhood area, as the following equation:
It is propagated down the pyramid until it reaches the bottom layer. Then, the optical flow is calculated by
2.2. Robust Histogram of Oriented Optical Flow
Since the size of the moving target usually changes with time in a video, the dimension of the corresponding optical flow descriptor will also change. At the same time, the original optical flow is also sensitive to the background noise, scale change, and the direction of motion. For information hiding, the extracted features are expected to be more stable, which can gain better robustness. Therefore, it is necessary to find a method based on the optical flow that can not only characterize the temporal motion information but also be insensitive to scale. Histogram of oriented optical flow (HOOF) was proposed by Chaudhry et al. [39]. The scale invariance of HOOF feature was achieved by the normalized histogram. In order to further enhance the robustness, the robust histogram of oriented optical flow (RHOOF) is achieved by only counting the number of optical flows located in the directional bins, while the amplitude information is ignored. This means that RHOOF will not be affected by the amplitude variation of optical flow, which is different from the original HOOF.
For every two frames, the optical flow is calculated. And then, the directional angle of the optical flow vector can be achieved bywhere atan2(·) is a fourquadrant inverse tangent function, x is the horizontal component, and y is the vertical component of optical flow vector. The range of is . If the angle range is divided into several groups, then the histogram distribution is statistically obtained. As shown in Figure 3, the bin number of the histogram is 6 and the bin size is , which means that the angle range of optical flow is divided to 6 groups. The distribution of the angles is shown in the histogram on the right. The sum of the possibilities of all groups is 1.
3. Proposed Coverless Steganography Scheme
Our proposed coverless steganography scheme based on videos is shown in Figure 4. The framework mainly includes three parts: index construction, secret information hiding, and secret information extraction. First, the video database is composed of multiple videos with different topics. The video database is shared by both secret information sender and receiver, which can be stored on cloud platform to save the storage space of the end user. Second, calculate RHOOF for every video in the video database. After the hash sequences of RHOOF are calculated, the video index database is constructed. The construction of video database and index database is the basis of coverless information hiding.
During the information hiding process, the secret information needs to be preprocessed and divided to binary bit groups. Every bit group can be mapped to a hash sequence of RHOOF. After searching in the video index database, the appropriate one or several videos are selected as cover, and the corresponding mapping indexes are returned to the sender. The mapping indexes will be sent to the receiver. At the receiver, cover video can be found accurately and efficiently according to the received mapping indexes. Through calculating the hash sequences of RHOOF from the cover video, the secret information can be recovered successfully. During the whole process, the cover video remains original without any modification. Therefore, it can resist the detection of steganalysis.
3.1. Generation of Hash Sequence
As described previously, RHOOF can reflect the main movement characteristic of the video. We propose a hash sequence generation method based on RHOOF as shown in Figure 5. For the two adjacent frames (assuming as frame i and frame i + 1) of a video, they are transformed to gray scale first. Second, the two frames are median filtered in order to suppress the possible noise and protect the edge information. Third, the pixel changes of these two frames are calculated, and the oriented optical flows are achieved as described in the previous section. We can analyse the orientation values of optical flow. The histogram is calculated in several bins for every subblock. In Figure 5, we set the number of subblocks to 4 and the number of bins to 8 as an example.
Assuming the histogram is denoted aswhere N is the number of the bins. We set the threshold aswhere is a correction factor. Then, the hash sequence is achieved by comparing the histogram value with the threshold as the following equation:
3.2. Construction of Video Index Database
In order to find the cover video efficiently and accurately, the construction of video index database is necessary and important. Therefore, we construct an efficient index database with two levels as shown in Figure 6. The first level index is the hash sequence and the second level index contains the information items of cover video and cover frame.
The index items are sorted by the hash sequence. Here, the bin number is also set to 8 as an example. Therefore, the value of hash sequence varies from “00000000” to “11111111” in the index table. For every index item corresponding to the hash sequence, index ID, video ID, frame ID, and subblock ID are contained in the index database. Index ID is the serial number of an index, which is incremental. Video ID means the storage path and the name of the cover video. Frame ID means the corresponding frame, and subblock ID means the corresponding subblock. With such information, the cover video can be found accurately, and the hash sequence can be calculated efficiently and conveniently. For one hash sequence, it is possible to contain multiple index items, which means there are multiple cover subblocks with the same hash sequence in the video database. In this situation, any index item can be chosen in principle for subsequent secret information mapping.
3.3. Secret Information Hiding
During secret information hiding, how to map the secret information to the cover video efficiently is the most critical part. The whole process is summarized as follows: Step 1: construct a video database, which is shared by both the sender and the receiver Step 2: for each video in the library, the frame optical flows are obtained as described in the previous section Step 3: the directional angle of the optical flow vector is calculated by equation (12) Step 4: for every frame, the robust histogram distribution is statistically counted in subblocks based on the oriental information of optical flow. The hash sequences are obtained as described previously. Step 5: construct the video index database as described previously Step 6: the secret information needs to be preprocessed before sending. Assuming the length of the secret information is k bits, it will be divided into m segments as where N is the bin number of RHOOF statistics. For the last segment, “0” bits are padded to the tail, and the number of the padding bits is Step 7: for every segment, we search the corresponding index item in the index database, which has the hash sequence equal to the information bits. It is possible that there are multiple index items mapping to the same hash sequence. In order to increase the efficiency of information extraction, we should choose the index items with the same video file as much as possible. For the same video file, the mapping index item with smaller index ID will be chosen. Step 8: the information of mapping indexes corresponding to the secret information segments will be sent to the receiver. In order to enhance security, the index information can be encrypted before transmission.
The detailed algorithm of index database construction is described in Algorithm 1.

The information hiding algorithm at the sender is described in Algorithm 2.

3.4. Extraction of Secret Information
At the receiver, by calculating the hash sequence of RHOOF based on the cover video, the secret information can be extracted successfully. The process of secret information extraction is as follows: Step 1: after receiving the index information sent by the transmitter, the receiver will decrypt it first if necessary. Then, the index items will be analysed, and video ID, frame ID, and subblock ID can be obtained. Step 2: the corresponding frame can be found according to video ID and frame ID. The optical flow of the corresponding frame is calculated as described previously. Step 3: the directional angle of the optical flow vector is calculated by equation (12) Step 4: the robust histogram distribution is statistically counted in subblocks according to the oriented information of the optical flow. The hash sequence is obtained. Step 5: repeat steps 1–4 until all the hash sequences corresponding to the mapping index items have been extracted Step 6: after connecting the hash sequences and removing the padding bits from the tail, the bitstream of secret information is recovered successfully
The detailed algorithm of secret information extraction is described in Algorithm 3.

As shown in Figure 7, there is an example of transmission and extraction of the secret bitstream as “11111111000000.” The bin number N is 8. As mentioned before, the length of padded bitstream should be an integral multiple of N. Therefore, the bitstream is padded with one “0” bit at the tail first, which makes the length of bitstream as 16. Then, the padded bitstream is segmented to 2 groups of 8 bits. Next, the hash sequences equal to the segmented bitstream “11111111” and “00000010” are searched in the video index database. From Figure 6, it can be seen that there are multiple index items corresponding to the hash sequences “11111111” and “00000010.” The index items with index ID as 902 and 9 are selected since they have the same video ID, which are marked with green as shown in Figure 6. Therefore, the video “walking.avi” is our cover video. At the receiver, RHOOF are calculated based on frame 220 and 221 and frame 23 and 24 of “walking.avi.” Then, the hash sequences “11111111” and “00000010” are achieved from the subblock 2 and the subblock 4 separately. After removing the padding bit “0” from the tail, the secret bitstream “111111110000001” can be recovered successfully.
3.5. Algorithm Improvement
One prerequisite of the optical flow method is that the brightness should remain constant. In the case of noise or random interference, the optical flow value will be greatly affected, which will lead to wrong extraction of the hidden information. Therefore, the algorithm is further optimized with an averaging window applied. Before the optical flow is calculated, we update the data of every frame by averaging the pixel values of adjacent frames. Through this smoothing operation, the influence of noise and random interference can be reduced. The improved algorithm of index database construction is described in Algorithm 4.

The improved information extraction algorithm is described in Algorithm 5.

4. Experimental Results and Analysis
The experiments are conducted with the Intel(R) Core (TM) i76500X CPU @ 2.50 GHz and 16.00 GB RAM. Matlab 2018b is used for algorithm simulation, and MySQL workbench 6.3 is used for the index database construction.
A video database is used for the test, which is composed by videos with different movements and scenarios that are randomly chosen from UCF101 and HMDB51 datasets as shown in Figure 8. The file size of the videos is about 200∼800 KB, and the duration time is about 2∼10 seconds.
4.1. Capacity Analysis
The bit number of the generated hash sequence based on the cover video determines the capacity of information hiding. Assuming the frame number of the video is F, the number of optical flow images should be F − 1. Every optical flow image will be divided into S subblock, and the oriented histogram are statistically counted in N bins. Therefore, for every optical flow image, the number of mapped bits is:
Then for every video, the number of mapped bits is:
It can be seen that the capacity of information hiding in our scheme is related to the bin number N, subblock number S, and frame number F. For a specific video, the frame number is fixed, and then, the capacity is determined by the number of bins and subblocks. The larger the N and S are, the larger the capacity is. We compare the capacity of single optical flow image in our scheme with existing coverless information hiding schemes based on single image in Table 1. Here, in the proposed method, the subblock number is set as 4 and the bin number is set as 8.
For a secret message to be hidden, if the capacity of single image is larger, a smaller number of cover images will be needed. Assuming the length of the secret information is K bits, the capacity of single image is C bits; then, the number of required cover images is
With the same hidden information, the number of images needed for different methods is compared in Table 2. It can be seen that our proposed method (set S = 4, N = 8) has a larger capacity than other methods. With the increment of S and N, the capacity will be even enlarged more. However, if N is increased, the size of video database needs also to be enlarged in order to ensure the success rate of information mapping. And the robustness will also be affected by the variation of S and N, which will be further investigated in the next tests.
4.2. Robustness Analysis
We investigate the robustness against pepper and salt noise, Gauss noise, and speckle noise with different parameters for performance evaluation. For a video, the compression transformation is used commonly. Therefore, the effect of compressed MPEG4 transformation (.mp4 file) and compressed motion JPEG 2000 file transformation (.mj2 file) are also investigated. Assuming the original bitstream is and the extracted bit sequence is , the accuracy rate is calculated bywhere
The results of accuracy rate against different types of attacks with different bin number N are shown in Table 3, in which the subblock number S is fixed as 4. It can be seen that the robustness of the proposed scheme is good, especially against salt and pepper noise and MPEG4 compression. At the same time, the increment of bin number will lead to the decrease of accuracy. It is because the smaller bin number will cause bigger bin size, which is less sensitive to the variation of angle distribution. But according to equation (19), the capacity will be decreased with the lower bin number.
The accuracy results with different types of attacks and different subblock number S are also investigated as given in Table 4. Here, the bin number N is fixed as 8. It can be seen that the effect of subblock number S is relatively small and the variation trend is irregular. The reason is that the spatial distribution of the optical flow is different for various types of videos. Therefore, according to equation (19), we can increase the capacity of information hiding by increasing the subblock number if necessary.
4.3. Improvement Analysis of Frame Averaging
We investigate the performance improvement of introducing the frame averaging window before calculating optical flow. Here, the length of frame averaging window is set as 10, which means that the data of every frame are updated by averaging the pixel values of adjacent 10 frames. The comparison of accuracy rate with different types of noises and compression is shown in Figures 9–12, where the subblock number is set as 4 and the bin number is set as 4, 8, 12, and 16, respectively.
It can be seen that frame averaging operation can improve the robustness significantly. Figure 9 is the accuracy comparison with compression attacks, which shows that the improvement for mj2 compression is much larger than for mp4 compression. For mj2 compression, the accuracy rate is increased by more than 7% with any bin number, while there is only weak improvement for mp4 compression. Figures 10 and 11 show the accuracy comparison with Gaussian noise and speckle noise, respectively, in which both have significant improvement. The increment of accuracy rate is even bigger when the bin number is increased. However, for salt and pepper noise, the accuracy rate will be decreased with frame averaging operation, as shown in Figure 12. And the larger the bin number is, the more obvious the impact is. This is because that salt and pepper noise is approximately equal in the amplitude but randomly distributed in different locations. Therefore, frame averaging calculation may possibly cause some clean pixels to be contaminated conversely.
4.4. Robustness Comparison with Different Methods
We use the methods based on images for performance comparison after transferring the videos to frame images. The latest DWT method [22], DCT method [23], and Hash method [20] are considered and tested based on our video database. During tests, the subblock number of the DWT method and the DCT method is set as 8. The subblock number of our method is 4, and the bin number is also set as 4. The accuracy with Gaussian noise, salt and pepper noise, and video compression transformations in the different methods are shown in Figures 13–15. It can be seen that the proposed method has good performance with salt and pepper noise and video compression, while it is more sensitive to Gaussian noise compared with other three methods. Although the overall accuracy performance of the proposed method is slightly worse than the DWT method and the DCT method, the accuracy rate still can arrive at 0.97 for most cases. However, the hiding information capacity of our method is 16 bits for every frame, which is twice that of the DWT method and the DCT method during the tests. Therefore, our proposed method has obtained a good tradeoff between steganographic capacity and robustness.
4.5. Hiding Success Rate Analysis
For coverless information hiding based on feature mapping, the extracted feature sequences should not only ensure the robustness but also reflect the differences of features. Therefore, under the premise of a given database, the success rate of information hiding is also an important indicator to measure the feasibility and the practicability of the secret transmission scheme. Assuming that every hidden information segment contains n bits, the number of different mapping sequences that can be generated by the current video library is k, and the hiding success rate is
In the test, the videos are chosen randomly from UCF101 and HMDB51 datasets. The success rate of our scheme is compared with the DWT method and the DCT method. The subblock number of all the three algorithms is set as 8. The results are shown in Figure 16. It can be seen that the hiding success rate increases with the increment of the number of the videos. With the same number of videos, the hiding success rate of our method is much higher than other two methods. With only about 70 videos, our method can achieve a hiding success rate as more than 90%. The improvement of hiding success rate comes from the consideration of the optical flow features between adjacent frames in our method. However, the DWT method and the DCT method based on images only focus on every separate frame and the adjacent frames usually have similar texture features. Therefore, the generated bit sequences also have high similarity, which will reduce the hiding success rate.
4.6. Complexity and Efficiency Analysis
The complexity of the proposed method mainly lies in the construction of the video index database because the RHOOF of every video frame needs to be calculated. However, the video index database only needs to be constructed once in advance at the sender. During real secret transmission, we only need to consider the time cost of the specific secret information hiding and extraction, in which the main work load lies in the feature analysis of the cover frames. We investigate the efficiency of different methods based on the time cost of hiding information bits with the same length. In Table 5, the time cost of different methods is listed, where “s/B” means the number of seconds that are required for hiding one information byte. It can be seen that the time cost of our method is more than other methods due to the complexity of hierarchical optical flow calculation.
4.7. Transmission Data Load Analysis
In the proposed steganography scheme, the video database is shared by both the sender and the receiver. During the information hiding process, the secret information will be preprocessed and mapped to the hash sequences of RHOOF. After searching in the video index database, the corresponding mapping indexes will be sent to the receiver. Then, the receiver can find the cover video from the video database. Therefore, the data transmission load only includes the contents of the index item as video ID, frame ID, and subblock ID. Assuming the size of the secret information is n byte, the transmission loads of different methods are analysed in Table 6. Here, the subblock number of the DWT method and the DCT method is 8. It can be seen that the transmission load of our scheme is greatly reduced since the cover video need not to be transmitted. But the sender and the receiver are required to share and update the video database synchronously to ensure the successful information hiding and extraction.
5. Conclusion
In this study, a coverless steganography scheme based on motion analysis of videos is proposed. The capacity, robustness, hiding success rate, time cost, and transmission data load have been investigated and compared with the existing methods. It is shown that the proposed method not only obtains a good tradeoff between hiding information capacity and robustness but also achieves higher hiding success rate and lower transmission data load, which shows good practicability and feasibility. However, the time cost of our method is higher due to the complexity of hierarchical optical flow calculation. We will try to improve the efficiency in the future work.
Data Availability
The UCF101 data used to support the findings of this study are available at http://crcv.ucf.edu/data/UCF101/UCF101.rar and the HMDB51 data are available at http://serrelab.clps.brown.edu/resource/hmdbalargehumanmotiondatabase/#Downloads.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (61772561 and 62002392), the Key Research and Development Plan of Hunan Province (2019SK2022), the Science Research Projects of Hunan Provincial Education Department (18A174 and 19B584), the Degree and Postgraduate Education Reform Project of Hunan Province (2019JGYB154), the National Natural Science Foundation of Hunan (2020JJ4140 and 2020JJ4141), and the Postgraduate Excellent Teaching Team Project of Hunan Province ([2019] 370–133).