Abstract

Steganography is a means of covert communication without revealing the occurrence and the real purpose of communication. The adaptive multirate wideband (AMR-WB) is a widely adapted format in mobile handsets and is also the recommended speech codec for VoLTE. In this paper, a novel AMR-WB speech steganography is proposed based on diameter-neighbor codebook partition algorithm. Different embedding capacity may be achieved by adjusting the iterative parameters during codebook division. The experimental results prove that the presented AMR-WB steganography may provide higher and flexible embedding capacity without inducing perceptible distortion compared with the state-of-the-art methods. With 48 iterations of cluster merging, twice the embedding capacity of complementary-neighbor-vertices-based embedding method may be obtained with a decrease of only around 2% in speech quality and much the same undetectability. Moreover, both the quality of stego speech and the security regarding statistical steganalysis are better than the recent speech steganography based on neighbor-index-division codebook partition.

1. Introduction

With the rapid development of the Internet and the growing popularity of instant messaging application, people are increasingly using audio-based communication. How to avoid interception and secure communication turns into one of the most important research problems. Encryption is a conventional method of protecting communication; however, the transmission of ciphered content may easily arouse attackers’ suspicion. In recent years, steganography has been presented as an effective means of covert communication. Audio steganography can transfer important messages secretly by embedding them into cover audio files with the use of information hiding techniques [1]. Data hiding in audio is especially challenging because the human auditory system operates over a wider dynamic range in comparison with human visual system.

Many works on audio steganography have been already reported. Gruhl et al. [2] proposed an audio steganographic method of echo hiding by the introduction of synthetic resonances in the form of closely spaced echoes. Gopalan [3] presented a method of embedding a covert audio message into a cover utterance by altering one bit in each of the cover utterance samples. Gopalan et al. [4] provided two methods of secret message embedding by modifying the phase or amplitude of perceptually masked or significant regions of a host. And a direct-sequence spread-spectrum watermarking method with strong robustness against common audio editing procedures was proposed in [5]. And many audio steganographic applications including Steghide and Hide4PGP can be freely downloaded from the Internet. But most of these methods are not resilient to AMR-WB speech.

Based on segmental SNR analysis of modification to the encoded bits in a frame, Liu et al. [6] selected the perceptually least important bits to embed secret message in G.729 speech. In [7], a simple and effective steganographic approach, which may be applied to 5.3 Kbps G.723.1 speech, was presented based on analyzing the redundancy of code parameters, and augmented identity matrix was utilized to lower the distortion of cover speech. Similarly, by calculating speech quality sensitivity on each encoded bit out of 244 bits using perceptual evaluation of speech quality (PESQ) criterion, a data hiding approach to embedding data in enhanced full rate (EFR) compressed speech bitstream is proposed in [8]. In addition, Nishimura [9] proposed three methods of hiding data in the pitch delay data of the AMR speech.

Based on complementary neighbor vertices codebook partition algorithm (CNV), Xiao et al. [10] presented an approach to information hiding in compressed speech with the use of quantization index modulation (QIM) [11]. Huang et al. [12] proposed a steganographic algorithm for embedding data in different speech encoding parameters of the inactive frames, the embedding capacity of which is bounded by the number of inactive frames in the cover speech. In [13], Huang et al. also presented a method for steganography in low bit-rate VoIP streams based on pitch period prediction. It can achieve high quality of stegospeech and prevent statistical steganalysis, but the embedding rate is still low (only about 133.3 bps). And an adaptive suboptimal pulse combination constrained (ASOPCC) method was presented in [14] to embed data into compressed speech signal of AMR-WB codec. However, most of the PESQ scores in different coding modes are not high. In [15], a key-based codebook partition strategy, which dynamically determines the adopted division scheme, was designed to improve the security of the QIM steganography in speech bitstream. Although the stegospeech quality is guaranteed to be good, the embedding capacity is very limited and not adjustable. Liu et al. [16] proposed a neighbor-index-division codebook division algorithm (NID) for G.723.1 speech. Differing from the existing CNV method, NID divides neighbor-indexed codewords into separated subcodebooks according to a suitable stegocoding strategy. The embedding capacity is improved by using multiple division and multi-ary coding strategy.

The adaptive multirate wideband (AMR-WB) is a widely adapted format in mobile handsets and is also the recommended speech codec for VoLTE. AMR-WB speech may be a good candidate for cover medium in audio steganography. Therefore, we will focus on AMR-WB speech steganography in this paper. Firstly, a new diameter-neighbor (DN) codebook partition algorithm toward AMR-WB speech is proposed. Based on DN codebook division, we develop a novel AMR-WB speech steganography capable of providing flexible embedding capacity with different iterative parameter . For example, when , twice the embedding capacity of CNV-based method may be obtained with a decrease of only about 2% in speech quality and much the same undetectability. Moreover, both the quality of stego speech and the security of defending against statistical steganalysis [17, 18] are better than the recent NID-based speech steganography.

The remainder of this paper is organized as follows. In Section 2, the related work is briefly introduced. In Section 3, the proposed DN codebook partition algorithm and the novel AMR-WB speech steganography are described in detail. The experimental results and analysis are provided in Section 4. Finally, conclusions are presented.

In this section, a technical overview of AMR-WB codec is firstly presented. Then two related codebook partition algorithms CNV [10] and NID [16] are also briefly reviewed.

2.1. AMR-WB Codec

The AMR-WB speech codec is standardized by 3GPP (3rd Generation Partnership Project) and adopted as the standard G.722.2 by ITU-T in 2002 [19]. It is a multirate wideband speech codec applied in modern mobile communication systems to remarkably improve the speech quality. The AMR-WB codec operates at a multitude of bit rates ranging from 6.6 kbit/s to 23.85 kbit/s.

The input audio signal is separated into 20 ms long frame using 16 kHz sampling rate. Every frame contains a linear prediction analysis (LPA) and the LP coefficients are converted to immittance spectrum pairs (ISP) coefficients. ISP coefficients are then converted to frequency domain (ISF) for quantization. Except for mode 0 (6.6 kbit/s), the ISF coefficients are quantized using two-stage vector quantization with split-by-2 in first stage and split-by-5 in the second stage. Both the second and the third codebooks in the second stage have 128 codewords, and the ISF indices of the codewords in these codebooks may be employed to embed secret message.

In the decoder, the transmitted indices are first parsed from the received bitstream and then decoded to obtain the code parameters for each transmitted frame, such as the ISP vector, the 4 fractional pitch lags, the 4 LTP filtering parameters, the 4 innovative code vectors, and the 4 sets of vector quantized pitch and innovative gains. For a more detailed description, one should refer to [19]. From the received ISF indices, which may have been modified because of secret message embedding, the receiver can recover the embedded secret message.

2.2. Complementary Neighbor Vertices

CNV is a new type of codebook partition algorithm proposed in [10], in which each codeword in a codebook is viewed as a vertex in the multidimensional space. The relationship between two codewords and is described as an edge connecting the two codewords’ vertices. And the weight of an edge is defined as the Euclidean distance between two codewords and . Small value of indicates that and bear a close resemblance to each other. The vertex nearest to is referred to as ’s neighbor vertex, which is denoted by . The vertex set together with the edge set form a graph in a multidimensional space.

The codebook partition is realized by the construction of the graph and vertex labelling. First, each vertex in is connected with its neighbor vertex using an edge. Thus, the graph would be divided into several isolated subgraphs, each of which may be proved to be acyclic and 2-colorable. Second, every vertex and its neighbor vertex in a subgraph are labelled oppositely using “0” or “1.” Third, all of the vertices with same label are collected into a subcodebook; hence, two subcodebooks will be obtained.

Based on the generated subgraphs and the label assigned to each codeword in them, CNV-based steganography applies QIM concept to embed secret message. More specifically, when the label of the codeword , which is associated with the cover quantization index , agrees with the secret message, remains unchanged, or else it should be replaced with the quantization index of the neighbor codeword which belongs to the opposite subcodebook.

The key characteristic of CNV-based steganography is that the distortion is bound even in the worst case. However, the embedding capacity is limited, which is analyzed experimentally in Section 4. Moreover, the number of possible combinations of flipping coefficients which determine whether the labels in a subgraph will be flipped is large. Extra information about the flipping process must be transmitted to the receiver and thus the effective embedding capacity may be decreased further.

2.3. Neighbor Index Division

NID assumes that the codewords of neighbor indices (i.e., neighbor positions) in a codebook would be close together. Hence the codewords in a codebook can be easily separated into subcodebooks according to their indices instead of the Euclidean distance. Specifically, select an appropriate integer according to the demand for embedding capacity and label the th codeword with digit , respectively. Then collect all the codewords with same label into a subcodebook and obtain different subcodebooks.

In order to take full use of the embedding capacity, the binary secret message should be transformed into -ary digits denoted by (). When the codeword related to the cover quantization index belongs to the subcodebook whose label differs from the -ary digit to be embedded, this index should be substituted with that of the closest codeword in the corresponding .

NID-based steganography is an information hiding method based on neighbor-index codebook partition, of which the embedding capacity may be controlled by the number of subcodebooks . However, as illustrated in [16], only about 34% of the pairs of neighbor-index codewords happened to be the pairs of neighbor-vertex codewords. And the mean distance between neighbor-index codewords is apparently larger than that of neighbor-vertex codewords. Therefore, the amount of distortion induced by NID-based steganography may be a little large, which is proved by the experimental results provided in Section 4.

3. Proposed Method

The diagram of the proposed method is shown in Figure 1. Based on DN codebook partition of the codebooks described in Section 2.1, secret message can be embedded into an AMR-WB speech file. After the stego AMR-WB speech file is received, the embedded secret message can be extracted without errors. At the same time, the decoded speech without perceptible distortion will also be obtained. In the following section, the diameter-neighbor codebook partition algorithm (DN) is first introduced. Then the embedding and extraction procedure of our proposed method are described.

3.1. Codebook Partition

A codebook may be viewed as a list of isolated code vectors (i.e., codewords) in the multidimensional space. The codebook partition algorithm used for audio steganography is to divide the codebook into several clusters, in each of which the codewords can be replaced with each other without causing perceptible distortion.

Let denote the original codebook with codewords, and denote a cluster with codewords , and the centroid of a cluster is defined as follows:where and are the th components of and , respectively.

The centroid (average code vector) is used to represent the corresponding cluster ; hence, the cluster may also be considered as a vector in the multidimensional codebook space. In order to describe the similarity between two clusters and , the Euclidean distance between them is defined as follows:where and are the corresponding geometric center points of the two clusters and . And is the dimension of a codeword; and are the th components of and , respectively.

Let denote a cluster set. The diameter of is defined as the maximal Euclidean distance of all cluster pairs in the cluster set , that is,where is the number of clusters within the cluster set . The cluster pair with maximal Euclidean distance , called diameter cluster pair, is denoted by . And the neighbor of a cluster in is represented by ; then we have

Figure 2 illustrates the diagram of the proposed DN codebook partition algorithm. And its detailed procedure is given in Algorithm 1. The original codebook will be divided into clusters by iteratively merging the diameter cluster pair with their respective neighbors. An iteration parameter is applied to obtain flexible embedding capacity through controlling the merging procedure. The relationship between and the embedding capacity will be discussed in Section 4.3.

Input: Codebook , iterative parameter
Output: Cluster set
is a helper cluster set
.clear();
.clear();
Each codeword is taken as a initial cluster
for  ;  ++  do
.push  ;
end
Iterative merging
while    do
if    is empty  then
;
.clear();
end
;
;
;
.push  ;
.push  ;
.remove  ;
.remove  ;
.remove  ;
.remove  ;
;
end
Put the remaining clusters in into
for  .begin(); .end(); ++
do
.push  ;
end
return ;

Figure 3 is provided as an example to illustrate the proposed codebook partition algorithm. The white circle “○” denotes a codeword. And the oval “” with shadow denotes a codeword and its neighbor in being processed, while the oval “” without shadow represents a cluster in that has been formed. The “0,” “1,” “00,” “01,” “10,” or “11” in a circle “○” is the label of a codeword in the cluster. The cross “” means the centroid of the cluster it belongs to, and a line “−” represents the diameter of a cluster set. The first to third merging iterations are shown in Figures 3(a)3(c), respectively. The fourth merging iteration is comprised of Figures 3(d) and 3(e), and Figure 3(f) demonstrates the labelling of the codewords.

3.2. Embedding Procedure

In our proposed method, the ISF indices corresponding to the codewords in the codebook are first obtained by parsing the host AMR-WB speech. Then the ISF indices are employed to embed secret message based on codebook partition. Generally, the codewords in the same cluster as the codeword referred by lies in are considered to be replaceable with each other. According to the secret message to be embedded, may be substituted by one of the other codewords’ indices within the same cluster. The number of secret message bits that can be embedded depends on the size of the specific cluster. The embedding procedures are given in the following.

Step 1. Search cluster set for the cluster which contains the codeword referred by the ISF index .

Step 2. If there are codewords in , the number of secret bits that can be embedded into is calculated as .

Step 3. Read not-yet-embedded bits, denoted by , from the secret message. is replaced with which indexes the codeword with the same label as .

Step 4. Repeat Steps 1–3 until all the secret bits are embedded.

Figure 4 is an example of embedding two secret bits into one cover ISF index. Let us assume the cluster set contains two clusters and the corresponding codeword indexed by is ; for example, indexes the codeword . Hence, the ISF index shown in Figure 4 will be replaced with which indexes the codeword with the same label as the secret bits “01.”

3.3. Extracting Procedure

When the stego AMR-WB speech is transferred to the intended receiver, the stego indices may be obtained by parsing AMR-WB speech stream and used to extract the embedded secret message. The message extraction procedures from the stegoindex are given below.

Step 1. Search cluster set , which is the same as that employed in the embedding procedure, for the cluster which contains the codeword referred by the ISF index .

Step 2. If there are totally codewords in , the number of secret bits carried by is computed by .

Step 3. Read the label of as the extracted bits, which are appended to the secret message bit sequence.

Step 4. Repeat Steps 1–3 until all the secret bits are recovered.

Figure 5 is the corresponding example of extracting two secret bits from the stegoindex generated by the previous embedding instance shown in Figure 4. It can be easily seen that the extracted secret bits are identical to the embedded secret bits.

4. Experimental Results and Analysis

In order to demonstrate the performance of the proposed method, the perceptual quality of the stego AMR-WB speech with secret message embedded using our method is computed and compared to that of the stego AMR-WB speech generated with CNV and NID steganography. Moreover, the flexibility of embedding capacity and the security regarding statistical detection are analyzed in detail.

4.1. Audio Database

TIMIT acoustic-phonetic continuous speech corpus (https://catalog.ldc.upenn.edu/ldc93s1) is an audio database which contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences, and all audio sentences are sampled at 16 kHz. In our experiments, 1000 audio sentences are randomly chosen from TIMIT database. The average, maximum, and minimum length of the chosen audio sentences are 3.47 s, 3.96 s, and 3.12 s. All audio files are converted into AMR-WB format using standard codec.

4.2. Speech Quality Evaluation

The perceptual evaluation of speech quality (PESQ) described in the ITU-T P.862 Recommendation [20] may be employed to evaluate speech quality. Moreover, according to ITU-T P.862.2 [21], the raw PESQ score can be converted to mean opinion score-listening quality objective (MOS-LQO), which is more suitable for evaluating wideband speech. Hence, MOS-LQO is applied in our experiments. The normal range of MOS-LQO score is 1.017 to 4.549. The higher the score, the better the quality.

Figure 6 shows the MOS-LQO scores of the 1000 cover AMR-WB speeches in 23.85 kbit/s mode and the corresponding stego AMR-WB speeches using three different codebook partition algorithms. Three progressive embedding rates, that is, 100 bps, 200 bps, and 300 bps, are employed in our experiments. The indices of speech samples are sorted according to the MOS-LQO scores of our proposed method. It can be seen from Figure 6 that the overall scores of the stego AMR-WB speeches generated with our method are higher than those of the NID-based stego AMR-WB speeches, especially when the embedding rates are 200 bps and 300 bps. And the MOS-LQO scores of the CNV-based stego AMR-WB speeches are slightly higher than ours when the embedding rate is 100 bps, which means there are no obvious discrepancies in speech quality between them. Besides, when the high embedding rate, that is, 200 bps or 300 bps, is used, the decrease in MOS-LQO scores of our stego AMR-WB speeches is significantly smaller than that of NID-based steganography.

Moreover, the average MOS-LQO scores of the cover AMR-WB speeches and the stego AMR-WB speeches with three different codebook partition algorithms, that is, CNV, NID, and DN, including four rate modes (12.65 kbit/s, 15.85 kbit/s, 19.85 kbit/s, and 23.85 kbit/s) together with three kinds of embedding rate (100 bps, 200 bps, and 300 bps), are given in Table 1. Only the MOS-LQO scores of NID-based and DN-based steganographic methods with embedding rates 200 bps and 300 bps are given in Table 1 because the embedding capacity of CNV-based steganography may not be larger than 100 bps.

When the embedding rate is 100 bps, which is almost the limit of CNV steganography, we can see from Table 1 that the mean MOS-LQO scores of our proposed method are only about 0.3% worse than CNV-based steganography. The slight decrease may be almost imperceptible by human auditory system (HAS). And there are significant increases of approximately 3.8% in the mean MOS-LQO scores when our presented method is compared to NID-based steganography. And it can be observed that when the embedding rates are 200 bps and 300 bps, the scores of our approach are improved by about 7% and 15% correspondingly in contrast to those of NID-based steganography.

Furthermore, we can also see that the experimental results of four rate modes are analogous. The decrease of speech quality caused by NID-based steganography is more than twice that caused by DN-based steganography. And the proposed method can obtain twice the embedding capacity of CNV-based steganography by sacrificing less than 2% speech quality in average. In addition, only a slight decline in speech quality is observed when 300 bps embedding rate is used in the proposed DN-based method while 200 bps is employed in NID-based method.

4.3. Flexible Embedding Capacity

Compared to CNV-based steganography, flexible embedding capacity may be obtained to satisfy different practical demand with our proposed method. The steganographic capacity can be adjusted by changing the iteration parameter . For different values of , for example, , the average embedding capacity and the MOS-LQO scores are given in Figure 7(a), and the corresponding results of NID-based steganography are provided in Figure 7(b) for comparison. Without loss of generality, only 23.85 kbit/s mode is used.

From Figure 7, we can observe that the embedding rate significantly increases with while the MOS-LQO score slightly goes down. However, as NID-based steganography is concerned, the MOS-LQO score rapidly declines with the increase of the embedding rate. Therefore, the proposed DN-based steganography can achieve higher embedding capacity with slight decrease in speech quality. For example, when , the size of each cluster in is equal to and we can embed bits per frame; that is, the embedding rate is 200 bps, but, at the same time, the CNV algorithm can embed at most bits per frame (100 bps).

4.4. Resistibility of Statistical Steganalysis

Speech steganography aims to hide secret message into cover speech without arousing suspicion. It is very important for a steganographic method to resist statistical steganalysis, which is the technique of detecting the presence of hidden message. Two state-of-the-art steganalytic methods [17, 18] are used to evaluate the performance of statistical undetectability of our proposed method. In [17], mel-cepstrum coefficients and Markov transition features from the second-order derivative of the audio signal are extracted to capture the statistical distortions caused by audio steganography, while, in [18], the correlation characteristics of split vector quantization codewords of linear predictive coding filter coefficients are utilized to steganalyze QIM-based steganography in low-bit-rate speech (such as G.723.1 and G.729). Both steganalytic methods use a support vector machine to predict the existence of hidden message in given audios.

In our experiments, the sentences chosen from “TIMIT” databases as stated in Section 4.1 are first encoded using the standard AMR-WB codec. These AMR-WB recordings constitute the cover speech set. Then secret message is embedded into each cover AMR-WB speech with different embedding rates, that is, 100 bps, 200 bps, and 300 bps, by CNV-based, NID-based, and DN-based steganography. Of course, 200 bps and 300 bps may be omitted for CNV-based steganography because of its limited embedding capacity. And seven stegospeech sets are generated, among which one set is related to CNV-based steganographic method, and each of three sets is associated with NID-based and DN-based steganography, respectively. Moreover, only 23.85 kbit/s mode is used without loss of generality.

In each experiment, a pair of cover and stego speech sets is randomly divided into training and testing sets according to three kinds of training rates, that is, 0.4, 0.5, and 0.6. For example, if the training rate is 0.4, the training set contains 40% speech samples randomly chosen from each of the cover and stegospeech sets, and the remaining 60% samples go into the testing set. As described in [17, 18], LIBSVM [22] is used as a classifier, and radial basis function (RBF) kernel and grid-search technique are employed to obtain better classification performance. For Li et al.’s steganalytic method, the principal component analysis (PCA) is first used, as suggested in [18], to reduce the dimension of feature vectors to 300. Let the samples in cover speech set denote negatives and those in stego speech set stand for positives. Hence, the accuracy may be defined as follows:where TP are true positives, TN are true negatives, FN are false negatives, and FP are false positives.

The steganalytic results are given in Table 2, It can be seen that when the embedding rate is 100 bps, the accuracy of detecting both CNV-based and DN-based methods is almost the same, say, 50% or so, while that of detecting NID-based steganography increases to 60% when MFCC-based steganalytic method is applied. Moreover, there is an apparent increase in the accuracy of detecting NID-based hiding method with the embedding rate increases to 200 bps or 300 bps when Liu et al.’s methods (i.e., Markov and MFCC-based steganalytic methods) are applied. But the accuracy of steganalyzing our proposed method, DN-based steganography, stays at the same level of 50%. Therefore, the proposed method may defend against Liu et al.’s statistical steganalysis [17] even with higher embedding rates.

According to the definition of the correlation index given in [18], the experimental results of the correlation indices of 1000 AMR-WB speeches, which are randomly selected from “TIMIT,” are shown in Figure 8. Based on these results, two strong quantization codeword correlation network (QCCN) models, say, SS-QCCN and RS-QCCN, can be constructed as illustrated in Figure 9. These two models are then used to steganalyze our proposed steganography. The steganalytic results are also presented in Table 2. It can be seen from Table 2 that the accuracy of both SS-QCCN and RS-QCCN is less than 50% for all of the AMR-WB stegospeeches. The possible reasons may be that only the second and third codebooks in the second stage are employed in the AMR-WB speech steganography, which means merely the vertices and in the th frame may be changed during steganography while none of them are utilized in Li et al.’s steganalytic method except for the edge “33” in RS-QCCN model. Besides, we also used an adapted QCCN model (i.e., utilize edges “22,” “33,” and “23′”) targeted at AMR-WB speech, but the accuracy is still less than 50%. It may be because the correlation of those edges is not strong enough for steganalysis according to Figure 8. Therefore, it is reasonable to conclude that the AMR-WB speech steganography can defend against the steganalytic method proposed in [18].

In order to visualize the detection performance, we give some receiver operating characteristic (ROC) curves of steganalyzing CNV-based steganography with 100 bps embedding rate and NID-based and DN-based steganography with 100 bps, 200 bps, and 300 bps embedding rates are provided in Figure 10 (ROC curves for SS-QCCN and RS-QCCN are omitted for these two methods fail to steganalyze AMR-WB steganography in spite of embedding capacity). It shows that all of the three steganographic methods can resist statistical steganalysis when the embedding rate is 100 bps. While the statistical steganalytic methods, especially MFCC-based steganalysis, may detect the existence of hidden message embedded with NID-based steganography when the embedding rate is above 100 bps, the proposed DN-based steganography may still have good security against both Markov-based and MFCC-based steganalysis.

5. Conclusion

The adaptive multirate wideband (AMR-WB) is a widely adapted format in mobile handsets and is also the recommended speech codec for VoLTE. AMR-WB speech may be a good candidate for cover medium in speech steganography. In this paper, a novel AMR-WB speech steganographic method is proposed. The experimental results demonstrated the effectiveness of our proposed method. The main contributions of this paper are as follows:(1)A novel AMR-WB speech steganography is proposed based on diameter-neighbor codebook partition algorithm. It can provide higher capacity without noticeable decrease in speech quality and better performance against statistical steganalysis than NID-based method.(2)Flexible embedding capacity may be easily achieved with different iterations of cluster merging. Twice the embedding capacity of CNV-based embedding method may be obtained with .

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China under Grant no. 61632013.