Mobile Information Systems

Volume 2017 (2017), Article ID 5418978, 18 pages

https://doi.org/10.1155/2017/5418978

## Detecting Steganography of Adaptive Multirate Speech with Unknown Embedding Rate

^{1}College of Computer Science and Technology, National Huaqiao University, Xiamen 361021, China^{2}Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

Correspondence should be addressed to Hui Tian and Tian Wang

Received 9 December 2016; Accepted 23 April 2017; Published 18 May 2017

Academic Editor: Elio Masciari

Copyright © 2017 Hui Tian et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Steganalysis of adaptive multirate (AMR) speech is a significant research topic for preventing cybercrimes based on steganography in mobile speech services. Differing from the state-of-the-art works, this paper focuses on steganalysis of AMR speech with unknown embedding rate, where we present three schemes based on support-vector-machine to address the concern. The first two schemes evolve from the existing image steganalysis schemes, which adopt different global classifiers. One is trained on a comprehensive speech sample set including original samples and steganographic samples with various embedding rates, while the other is trained on a particular speech sample set containing original samples and steganographic samples with uniform distributions of embedded information. Further, we present a hybrid steganalysis scheme, which employs Dempster–Shafer theory (DST) to fuse all the evidence from multiple specific classifiers and provide a synthesized detection result. All the steganalysis schemes are evaluated using the well-selected feature set based on statistical characteristics of pulse pairs and compared with the optimal steganalysis that adopts specialized classifiers for corresponding embedding rates. The experimental results demonstrate that all the three steganalysis schemes are feasible and effective for detecting the existing steganographic methods with unknown embedding rates in AMR speech streams, while the DST-based scheme outperforms the others overall.

#### 1. Introduction

Steganography is an ancient but effective technique for covert communications through hiding confidential messages into seemingly innocent carriers with imperceptible distortion. Although its history can date back to 440 BC [1], its candidate carriers have been ceaselessly evolving with the elapsing of years [2]. Over the last years, the steganographic carriers have developed from image [3, 4] to almost all media forms (e.g., video [5, 6], audio [7, 8], text [9, 10], network protocol [11, 12], and Voice over IP [13–16]). However, steganography is a double-edged sword. Illegal usage of this technique would facilitate cybercrime activities and thereby pose a great threat to information security. Thus, its countermeasure, steganalysis, has been also attracting considerable attention [17–25], whose purpose is to detect potential steganographic behaviors effectively.

In today’s mobile world, adaptive multirate (AMR) codec has become a well-known and important compression standard for speech coding and been widely employed in not only 3G and 4G speech services [26–28] but also various mobile instant messaging apps (such as WhatsApp, Snapchat, LINE, and WeChat). Moreover, it is also a popular file format for storing AMR-encoded spoken audio supported by almost all mobile communication devices. Due to its increasing popularity and broad influence in mobile communications, AMR speech is spontaneously considered as an ideal carrier by the steganographic research community, and some relevant studies have been successfully performed [29–33].

AMR is a typical codec based on an algebraic code-excited linear prediction algorithm, in which algebraic codebook indices (ACIs), also called fixed codebook indices (FCIs), occupy a large percentage of each speech frame [26–28]. Taking the AMR speech codec at 12.2 kbps mode [28], for example, 140 bits out of 244 frame bits is allocated to FCIs, suggesting that FCIs account for a large proportion (57.38%) of all frame bits [33]. Therefore, they are popularly regarded as nice candidates for steganographic carriers in the existing studies [29–33]. Geiser and Vary [29] first incorporated information hiding into speech coding of the AMR codec by modifying the fixed-codebook-search algorithm. Specifically, two secret bits can be hidden into a track pulse through limiting the searching range of the second FCI to two of eight candidate values. Their experimental results demonstrate that this method can offer a steganographic bandwidth of 2 kbit/s for the AMR speech codec at 12.2 kbps mode, while guaranteeing an imperceptible impact on speech quality and fairly small computational complexity. Moreover, following the similar idea, Miao et al. [30] proposed an adaptive suboptimal pulse combination constrained method for steganography in the AMR speech stream. Their main advantage over the previous method is enabling regulation of the steganographic capacity by introducing an embedding factor . For example, for the AMR speech codec at 12.2 kbps mode, can be typically set as 1, 2, or 4, so the steganographic bandwidths are correspondingly 1, 2, or 3 kbit/s [32, 33]. It has been demonstrated that, by choosing a befitting , this method can achieve a nice trade-off between the distortion of speech quality and the embedding capacity [30].

To prevent potential cybercrimes based on the above steganographic methods, some steganalysis studies have accordingly been conducted. Miao et al. [31] first presented two steganalysis methods for AMR speech. One is called Markov-based method that adopts Markov transition probabilities to evaluate the relationship between pulse positions in each track, while the other is Entropy-based method that employs the joint entropy and the conditional entropy to measure the uncertainty of pulse positions [31]. However, the above two kinds of statistical features are not accurate enough for characterizing AMR speech, because they ignore the fact that the pulse positions may often be interchanged in the AMR encoding process [33]. Moreover, Ren et al. [32] presented a steganalysis method called Fast-SPP, which employs probabilities of same pulse positions (SPP) as the features to detect the existing steganographic methods [29, 30]. However, the SPP features only reflect the distributions of two track-pulses being in the same position, which are not comprehensive enough to characterize AMR speech [33]. Particularly, if a steganographic method designedly abandons the track-pulses with the same positions and the ones that would be the same after the embedding operation, Fast-SPP could not detect any abnormalities [33]. Therefore, in our previous work [33], we presented more accurate and more complete features for steganalysis of AMR speech. To avoid the impact induced by possible interchange of pulse positions in each track, we employ the statistical features of pulse pairs to characterize AMR speech, including the probability distributions of pulse pairs reflecting the long-term distribution of speech signals, Markov transition probabilities of pulse pairs depicting the short-term invariant characteristic of speech signals, and joint probability matrices of pulse pairs characterizing the track-to-track correlation [33]. Moreover, to optimize the feature set as well as cut down the dimension, a feature selection mechanism using adaptive boosting (AdaBoost) [34–38] is designed. Employing the selected optimal feature set, a support-vector-machine (SVM) based steganalysis of AMR speech was presented. The experimental results show that the proposed method significantly outperforms the previous ones.

However, all the above steganalysis methods assume that the embedding rate (also called the usage rate of the cover, which is the ratio between the practical embedded bits and the total number of cover bits) of steganographic samples in a given test set is exactly known. In other words, they generally train specific classifiers for steganographic samples with predefined embedding rates, and each specialized classifier is expected to detect the steganographic samples with the corresponding embedding rate. Unfortunately, in practice, we usually cannot ascertain whether the steganographic operation has been performed on a given sample, let alone knowing the concrete embedding rate. Thus, it is necessary and significant to develop detection technique for steganography with unknown embedding rate [39–41]. To the best of our knowledge, this work in this paper is the first one dedicated to address the concern in the speech steganalysis field. In the image steganalysis field, however, some pioneer researchers have presented two useful schemes for detecting image steganography with unknown embedding rate. Both the two schemes adopt global classifiers based on a machine-learning algorithm (e.g., SVM) as the detectors, but the components of their training set are different. Specifically, the training set of the first scheme includes original (untouched) samples and steganographic samples with various embedding rates [40, 41], while that of the other one consists of original samples and steganographic samples with uniform distributions of embedded data [40]. In this work, we would like to attempt to first extend the two existing schemes to AMR speech steganalysis with unknown embedding rate employing the state-of-the-art steganalysis features presented in our recent work [33]. Besides, incorporating with Dempster–Shafer theory (DST) [42, 43], we further present a hybrid steganalysis scheme for AMR speech based steganography with unknown embedding rate. DST, also called evidence theory, is a well-established framework for uncertain reasoning, which can fuse available evidence from different sources and achieve a level of belief (confidence; trust) by considering all of them [42–46]. The main idea behind the presented steganalysis scheme is employing an algorithm based on DST to combine all the evidence from a set of classifiers intended for detecting steganographic approaches with specific embedding rates and accordingly providing a synthesized judgement for having or not having hidden information. All the three steganalysis schemes are evaluated with a great number of AMR-encoded speech samples and compared with the optimal steganalysis that uses every specialized classifier to detect the steganography with the corresponding embedding rate. The experimental results show that all these steganalysis schemes are feasible and efficient for detecting the state-of-the-art steganographic methods with unknown embedding rates in AMR speech streams, while the DST-based scheme can achieve better detection performance than the other ones.

The remaining of this paper is organized as follows. To make this paper self-contained, Section 2 first reviews the state-of-the-art steganalysis features based on statistical characteristics of pulse pairs. Section 3 presents the three steganalysis schemes for detecting AMR speech based steganography with unknown embedding rate. Section 4 evaluates the performance of the three steganalysis schemes by a set of comprehensive experiments, which is followed by concluding remarks given in Section 5.

#### 2. Steganalysis Features Based on Statistical Characteristics of Pulse Pairs

In this work, all the presented steganalysis schemes would adopt the state-of-the-art detection features based on statistical characteristics of pulse pairs for AMR speech, which consists of long- term features, short-term features, and track-to-track features [33].

The probability distributions of the pulse pairs are employed to depict the long-term features of AMR speech. Assume that the given AMR speech sample to be detected has subframes and each subframe contains tracks. For the th track in the th subframe, two pulse positions as a pulse pair can be extracted. For a pulse pair , its probability (denoted by ) appearing in all subframes can be determined as follows: where “&” is the binary AND operation, “” is the binary OR operation, and is a characteristic function defined as follows:

Let the number of candidate positions for every pulse in each track be ; the number of the possible pulse pairs (denoted by ) is

Therefore, there are pulse pairs in each subframe. That is to say, the dimension of the long-term feature set (LTFS) for pulse pairs is .

According to the short-term invariance of speech signals [47], the pulse pair of a track in the current subframe is bound to have a strong correlation with the one of the same track in the prior subframe [33]. In this sense, for the th pulse pairs (i.e., the pulse pairs of the th tracks) in all subframes, the sequence of pulse-position pairs can be considered as a Markov chain. Accordingly, the Markov transition matrix (MTM) can be employed to describe the transitive correlation of pulse-pair states in the given track. Moreover, as a first-order Markov chain, satisfiesIn the th tracks of all subframes, the probability that the pulse pair occurs after the pulse pair isFurther, the MTM for the th track (denoted by ) can be determined as follows:where is the number of all possible pulse-position pairs for the th track that can be determined as (3); is the th possible pulse-position pair for the th track, where and are the potential pulse positions for the th track. Moreover, assume that there are candidate positions for each pulse; , , and satisfy the following relation:

Since there are possible pulse-position pairs in each track, the size of each MTM is . Taking the MTMs of all tracks into account, the dimension of the feature set would be very large. However, the characteristics of all the MTMs are similar. Therefore, we often adopt the average Markov transition probabilities (MTPs) as the steganalysis features instead. Apparently, the average MTM (denoted by ) is determined as

Accordingly, the dimension of the short-term feature set (STFS) for pulse pairs is .

Furthermore, the joint probability matrices of the pulse pairs in different tracks are employed to characterize the track-to-track features. To be specific, for the pulse pair of the th track and the one of the th track , the joint probability matrix (JPM) iswhere is the number of all possible pulse-position pairs for the th track that can be determined by (3); is the th possible pulse-position pair for the th (th) track; and is the joint probability of and . Specifically, the joint probability of the pulse-position pair in the th track and the pulse-position pair in the th track can be determined as follows:where is the number of the subframes, () is the pulse pair in the th (th) track of the th subframe , is a characteristic function defined as (2), and “&” is the binary AND operation.

Like STFS above, we adopt the average JPM as the track-to-track feature set (TTFS) instead of all JPMs to reduce the computational complexity. Specifically, the average JPM (denoted by ) is

Apparently, the dimension of the TTFS is . Accordingly, the total dimension of all the three feature sets is . Taking the AMR speech codec at 12.2 kbps mode as an example, there are five tracks in each subframe (i.e., ), where two pulses share eight candidate positions, that is, . Thus, there are pulse pairs in each track, and the total dimension of all feature sets is 2772. These features are still too large to be directly adopted in the machine-learning based steganalysis scheme, since very-high-dimensional features would not only cause huge computational costs in the detection phase but also be more likely to induce overfitting in the training phase [33]. Thus, a feature selection mechanism based on AdaBoost [34–38] is employed to optimize the feature set as well as reduce the dimension. In the previous work [33], by this mechanism a reduced feature set with the 498 most effective features is obtained for the AMR speech codec at 12.2 kbps mode, of which the composition is shown in Table 1. Given that the excellent effectiveness of the selected feature set for steganalysis of AMR speech has been verified, we directly employ it in this paper.