Table of Contents Author Guidelines Submit a Manuscript
Security and Communication Networks
Volume 2018, Article ID 1781384, 21 pages
Research Article

Blind Key Based Attack Resistant Audio Steganography Using Cocktail Party Effect

1St. Thomas’ College of Engineering & Technology, Kolkata 700023, India
2University of Calcutta, Kolkata 700098, India

Correspondence should be addressed to Barnali Gupta Banik; moc.liamg@ilanrab.atpug

Received 19 October 2017; Revised 13 January 2018; Accepted 5 February 2018; Published 16 April 2018

Academic Editor: Emanuele Maiorana

Copyright © 2018 Barnali Gupta Banik and Samir Kumar Bandyopadhyay. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Steganography is a popular technique of digital data security. Among all digital steganography methods, audio steganography is very delicate as human auditory system is highly sensitive to noise; hence small modification in audio can make significant audible impact. In this paper, a key based blind audio steganography method has been proposed which is built on discrete wavelet transform (DWT) as well as discrete cosine transform (DCT) and adheres to Kerckhoff’s principle. Here image has been used as secret message which is preprocessed using Arnold’s Transform. To make the system more robust and undetectable, a well-known problem of audio analysis has been explored here, known as Cocktail Party Problem, for wrapping stego audio. The robustness of the proposed method has been tested against Steganalysis attacks like noise addition, random cropping, resampling, requantization, pitch shifting, and mp3 compression. The quality of resultant stego audio and retrieved secret image has been measured by various metrics, namely, “peak signal-to-noise ratio”; “correlation coefficient”; “perceptual evaluation of audio quality”; “bit error rate”; and “structural similarity index.” The embedding capacity has also been evaluated and, as seen from the comparison result, the proposed method has outperformed other existing DCT-DWT based technique.

1. Introduction

In the present era, communicating through Internet has become vulnerable as there may be several intruders who can eavesdrop for secret messages to capture and disburse them for unlawful misconducts. Henceforth nowadays it is most necessary to camouflage secret message in such a way that stego cannot be identified as carrier of secret message. Camouflaging secret message through carrier objects introduces the age-old technique of steganography. However, with the current enormous use of Internet and elevation of various Steganalysis attacks, it is required to have an extra shield to protect steganography techniques. This is the reason cocktail party effect in audio steganography has been explored to ensure enhanced security during data transmission.

2. Related Work

2.1. Audio Steganography Techniques

In audio steganography, audio is used as cover media. In [1], authors have described different spatial and frequency domain techniques of audio steganography. The popular spatial domain techniques are as follows.

Least Significant Bit (LSB) Encoding. This is the simplest method of audio steganography where Least Significant Bit of each audio sample is modified with bits of secret message vector. With the extensive use of this method it becomes more prone to attack and its embedding capacity is poor compared to others. To cope up with the necessity of increasing capacity, authors of [2] have proposed an enhanced method of LSB technique where it has been proved that 2nd and 3rd LSB modification does not make audible difference in audio sample. In [3], authors have suggested another enhancement over LSB technique by shifting LSB modification from 3rd bit to 4th bit which incur more embedding capacity compared to previous methods of LSB encoding.

Parity Encoding. In this approach, audio signal is broken into number of samples [4]. Depending on sample’s parity bit, secret message is embedded in the LSB of the sample byte stream.

Echo Hiding. In this method, a short echo signal is introduced as part of cover audio where secret message is hidden [5]. Study shows that the echo signal is inaudible provided the delay between cover audio and echo signal is up to 1 ms.

The widespread frequency domain techniques are as follows.

Phase Coding. As human auditory system cannot percept phase component modulation, hence, in this technique, secret data is embedded by modification of selected phase component of cover audio signal. Using psychoacoustic model, a threshold is calculated which can be used as masking threshold [6]. In [7], authors have used difference between the phase values of the selected component frequencies and their adjacent frequencies of the cover signal as a medium to hide secret data bits. This method provides more robustness than the previous approaches.

Spread Spectrum. The basic principle of spread spectrum is to spread the secret message over the frequency spectrum of cover audio signal. In [8], Direct Sequence Spread Spectrum is used to hide text data in an audio. Here a key is used to embed message to the noise. In [9], authors have discovered that low spreading rate improves performance of spread spectrum audio steganography. Therefore, authors have proposed a technique which decreases correlation between original signal and spread data signal by having phase shift in each subband signal of original audio.

Discrete Wavelet Transforms (DWT). DWT decomposes a signal in four frequency components, popularly known as subbands. These sub bands are Low-Low (LL), Low-High (LH), High-Low (HL), and High-High (HH), as shown in Figure 1. The LL subband describes approximation details. The HL band demonstrates variation along the -axis or horizontal details and the LH band demonstrates the -axis variation or vertical details [10]. In other words, the low frequency subband is a low-pass approximation of the original signal and contains most energy of the signal. The other subbands include mainly detailed components which have low energy level. This is the reason LH subband is very popular for data hiding.

Figure 1: Block diagram of 1 level 2D DWT.

In [11], authors have proposed a method to create DWT of cover audio and select higher frequency to embed image data using low bit encoding technique. In [12], authors have decomposed the cover audio signal using Haar DWT and then choose coefficient to embed data. This is done using a precalculated threshold value to flip data. In [13], secret audio is embedded using synchronizing code in the low frequency part of DWT of cover audio.

Discrete Cosine Transforms (DCT). DCT is used to convert a signal from spatial domain into frequency domain. DCT decomposes a signal into a series of cosine functions. The two-dimensional DCT can be performed by executing one-dimensional DCT twice, initially in the direction, next by direction. The formulation of the 2D DCT for an input signal with rows and columns and the output signal has been given in where and and

Inverse 2D DCT is also available to transform a frequency domain coefficient to spatial domain signal, as specified in where and .

DCT can be performed in block-by-block basis like , , and blocks.

As shown in Figure 2(a), the top left coefficient is called DC coefficient holding the approximate value of the whole signal; normally it has coefficients with zero frequency and the remaining 15 coefficients are called AC coefficients holding most detailed parameters of the signal, having coefficients with nonzero frequency. There are some DCT coefficients which hold quite similar values. Human brains are less sensitive to detect changes where all the elements hold more or less the same value. Therefore, this region of similar values can be selected for data hiding purpose. This region is known as midband region, as shown in Figure 2(b).

Figure 2

In [14], authors have used speech signal as cover, where voiced and nonvoiced part of the speech are separated by zero crossing count and short time energy. The secret data is embedded by modifying DCT coefficient of nonvoiced part. In [15], authors have decomposed the cover audio in nonoverlapping block and secret data is hidden in the DC coefficient and 4th AC coefficient in line. In [16], authors have embedded secret data in the low frequency component of DCT quantization. In [17], authors have decomposed the cover audio into block and then each of those blocks was decomposed further into frames. Embedding of secret message depends on the difference between first or last two frames.

2.2. Correlation Coefficient (CC)

A correlation coefficient is a measure of linear relationship between two random variables. This term was first coined by Karl Pearson in 1896. The value of correlation coefficient can vary from −1 to 1. If the value is perfect −1 or 1 that indicates both variables are linearly related. If the value is 0 that indicates there is no relation between the said variables. Moreover, the sign indicates that the variables are positively related or negatively related [18]. There are three types of correlation coefficients: Pearson’s coefficient (), Spearman’s rho coefficient (), and Kendall’s tau coefficient (). Pearson’s coefficient, which is also known as product-moment correlation coefficient, is the most widely used popular correlation coefficient. It is given by paired measurements as mentioned in where and are the mean of and , respectively. Correlation coefficient can also be used as quality metrics to measure similarity between two signals.

2.3. Arnold Transform

Arnold’s Transform is a chaotic bidirectional map proposed by Vladimir Arnold in 1960. A chaotic map is an evaluation function which demonstrates some sort of chaotic nature, as seen in the following transformation function:

An image is collection of pixels in row and column arrangement, which can be organized in square or nonsquare shape. If Arnold transform is applied to an image, it scrambles the image by “” times iteration (e.g., iteration 1 will scramble less and iteration 10 will scramble more), which makes the image imperceptible. This undetectable image format can be used for data hiding securely as it is unable to reveal any existence of secret data. Hence scrambling an image can be a preprocessing step of data hiding technique.

Traditionally Arnold transform can be applied only for square matrices; however later it has been improvised to apply on any matrix, by where is the element of original matrix and () is the element of transformed matrix and is the order of the matrix; as shown in Figure 3, the point () is sheared through - and -axis to get ().

Figure 3: Representation of point () sheared to point ().

The function is important to regenerate the original image. The functions to shear in -axis, -axis, and modulo function is represented in

Arnold transformation is reversible [19]. To recover original image from scrambled image there are two ways, the traditional way is periodicity, and the better approach is to use inverse matrix, which is also known as Reverse Arnold Transformation [20] and expressed by

In [21], authors have used Arnold’s transformation to scramble the image before embedding into the DWT coefficient of cover audio. In [22], authors have embedded scrambled image in “Redundant Discrete Wavelet Transform” coefficient using Singular Value Decomposition (SVD) technique. In [23], authors have proposed data hiding in DWT and DCT domain using SVD where the secret image is scrambled before embedding.

2.4. Cocktail Party Problem

Cocktail Party Problem is a classic example of source separation which is very popular in digital signal processing. In this problem, several people are talking to each other in a banquet room and a listener is trying to recognize one specific speech from that crowd of partying guests. Human brain can distinguish one explicit signal component from a mixed signal combination in real time which is popularly known as “Auditory Scene Analysis.” However, in digital signal processing, it is difficult to extract only one speaker’s voice from the rest in cocktail party situation.

In [24], Colin Cherry first revealed the ability of human auditory system to separate a single speech or audio from a combination of voices, which may turn into noise through properties like pitch, gender, rate of speech, and/or direction of speech. This task of separating single source audio from a noise is known as dichotic listening task [25]. In [26], authors have reviewed the same techniques to train machine to segregate signals. In [27], Broadbent has concluded that simultaneous listening can be performed for small messages, not for long ones. Human ability to identify audio from a mixed signal can be improved by listening by two ears [28]. It has been seen that, in ideal circumstances, the signal detection threshold of binaural listening is 25 dB more than monaural listening. In [29], it has been stated that cocktail party effect can be explained by Binaural Masking Level Difference (BMLD). As per BMLD, for binaural listening the desired signal coming from one direction is ineffectively masked by the noise generated in different direction. In [30], Kassebaum et al. discussed two methods for signal separation—Back Propagation (BP) and Self-Organizing Neural Network (SONN). That experiment was carried out through 4 kHz channel using a modem data signal and a male speech signal. It has been concluded that BP requires more inputs and training time than SONN.

In [31] authors have discussed 3 types of approach to solve Cocktail Party Problem:(i) Temporal binding and oscillatory correlation(ii) Cortronic network(iii) Blind source separation.

In [32], von der Malsburg explained the temporal binding technique. He stated that neuron carries two distinct signals and the binding is accomplished by correlation. The synchronization allows neuron to create topological network. In [33], von der Malsburg and Schneider proposed a cocktail party processor enhancing this idea—the Oscillatory Correlation which is the basis of Computational Auditory Scene Analysis. In [34, 35], multistage neural model has been proposed to separate speech from interfering sounds using oscillatory correlation.

In [36], authors have proposed a biological approach to solve Cocktail Party Problem using artificial neural network named as cortronic network. A cortronic neural network describes connection among neurons in several regions which demonstrates the output links of each neuron and the strength of the connections.

The Blind Source Separation (BSS) is the technique of separating signal from a mixed source without having knowledge of source signals and the process of mixing. There are different methods of BSS among which Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Time and Frequency domain approaches are significant. PCA and ICA are both statistical approaches which are better than Time or Frequency domain approach, since Fourier components of data segments are fixed in frequency domain whereas in statistical domain the transformation depends on the data to be analyzed [37].

PCA is a mathematical technique of transforming large correlated dataset into a small number of major components known as principal components [38]. It is moderately related to mathematical theory of Singular Value Decomposition (SVD), which is used to implement PCA [39]. Independent Component Analysis can also be implemented with SVD, though there are subtle differences between PCA and ICA. The aim of PCA is to find decorrelated variables whereas the aim of ICA is to find independent variables. PCA and ICA both perform matrix factorization for linear transformation, though PCA perform low rank matrix factorization whereas ICA performs full-rank matrix factorization. The advantage of ICA over PCA is that PCA just removes correlations whereas ICA removes correlations and higher order dependencies [40]. ICA has extensive use in biomedical imaging and audio processing [41]. ICA can also be used for transformation to independent variable using multiplication of observed data and for demixing matrix [42]. It depends on the fact that there are as many sources as channels of data available, which are to be separated as independent sources—by utilizing this fact, ICA is used in Blind Source Separation. In [43], author described a fast method for ICA using fixed point iteration. This algorithm is popularly known as FastICA.

In Table 1, comparison of the existing techniques for solving Cocktail Party Problem has been discussed. It can be noted that each of these techniques has its own advantage and disadvantages. However, as blind steganographic approach is considered more robust and secure than the nonblind steganography techniques, hence, in this proposed method, “Blind Source Separation” approach has been chosen for solving cocktail party effect.

Table 1: Advantage and disadvantage of different approaches for solving Cocktail Party Problem.

3. Proposed Method

3.1. In a Nutshell

Steganography can be broadly grouped into two types: blind and nonblind techniques. The technique where cover object is not required to retrieve the secret is called blind steganography. The method where cover object is required to regain secret is called nonblind or cover escrow technique of steganography. To create a most robust method of steganography, here a blind steganography technique has been proposed.

In this proposed method, image has been used as secret message. This secret image is scrambled using Arnold transform. Then Haar filter is applied for two-dimensional DWT on the cover source audio. Since audio is one-dimensional signal, hence it must be reshaped into two-dimensional matrix to perform 2D DWT. Haar is simple, fast, and memory efficient compared to other available DWT filters like Daubechies and Coiflets. After DWT application, LH subband has been chosen for further decomposition into blocks where two-dimensional DCT has been applied. As shown in Figure 2(b), in Section 2.1, midband region of those blocks has been chosen and embedding has been performed by the following equation:where indicates midband frequency region; is the embedding factor; and PN is the pseudorandom number. Equation (9) has been further explained in Section 3.6; embedding factor () has been discussed in Section 3.4 and pseudorandom number (PN) has been discussed in Section 3.5.

After embedding, the resultant cover becomes stego audio. To increase security of the proposed method, this stego audio is blended with other audio signals to produce cocktail party effect—afterwards this has been securely transmitted through the web to reach the intended recipient. Even if any intruder is able to break the communication channel and get access to the transmitted media, neither he would decipher the cocktail party effect to identify stego audio nor he would able to decode the stego audio to recognize the secret message without knowing the key required for extraction, whereas the intended receiver knowing the key as well as the entire algorithms is able to easily extract the secret message implanted without any loss of data. The proposed method is also tested against well-known Steganalysis attacks and the outcomes are quite impressive (discussed in Section 4.3)—hence this technique provides complete security.

Once the intended recipient receives the cocktail effect, using the demixing algorithm (discussed in Section 3.8) s/he can separate the audios and can also apply the extraction procedure on them, as the recipient is aware of the key. The extraction algorithm performs correlation between the coefficients and extracts the secret bits, from which the scrambled secret image can be generated. Finally, by applying inverse Arnold transform, the secret image can be reconstructed. The flowcharts for embedding and extraction procedure have been shown in Figures 4 and 5, respectively.

Figure 4: Flowchart of embedding procedure.
Figure 5: Flowchart of extraction procedure.
3.2. Input Preparation

Cover Audio Source. Any speech or music can be used here as cover audio sources. For this demonstration, popular English songs have been chosen—as mentioned below. All the audio sources have been sampled at 44100 kHz in monochannel with 16-bit depth, cut to 26 seconds’ duration for optimizing embedding capacity calculation, and finally saved as  .wav file.

The following are the audio sources used for this research experiment:(1)“My Heart Will Go On” by Celine Dion from film “Titanic” saved as tt.wav(2)“Beat It” by Michael Jackson from album “Thriller” saved as mj.wav(3)“Like a Rolling Stone” by Bob Dylan from album “Highway 61 Revisited” saved as bob.wav(4)Title song from film “Mamma Mia!” by Meryl Streep saved as mm.wav(5)Title song from film “High School Musical” by chorus saved as hsm.wav.

Secret Image. Though any types of grayscale image (.jpg or  .bmp) can be used here as secret, however for this experiment binary images (.pbm) have been chosen for better quality extraction. For this proposed method, secret images need to transform to binary, which is lossy conversion; hence any true-color RGB images cannot be applied here as, after extraction, the retrieved image will only have two colors—black and white. Secret image size here is taken as , which can be further increased if the length of input cover audio source is more than 26 seconds. For this experiment, secret images have been either downloaded from Internet (these do not have any copyright restriction) or drawn by Microsoft Paint software.

3.3. Scrambling and Descrambling Algorithm for Secret Image

The “Arnold transform” algorithm randomizes the input image by number of iterations to create scrambled image.

Input: Any binary Image (), number of iteration ()
Output: Scrambled Image ()
Algorithm: written as function Arnold ()
 Step : Find out the size of and store in and
 Step :
  for to
   for to
    for to
    Find out ;
     , mod

Once applied to the scrambled image, the “Reverse Arnold Transform” algorithm returns the original secret image after specified iterations.

Input: Any scrambled binary Image (), number of
iteration ()
Output: Descrambled Image ()
Algorithm: written as function iArnold ()
 Step : Find out the size of and store in and
 Step :
  for to
   for to
    for to
     Find out ;
3.4. Embedding and Multiplicative Factors

As shown in (9) in Section 3.1, embedding factor () has been multiplied with PN to offset the increment of DCT coefficient value such that, after embedding, stego audio will not have any audible noise. Hence the value of α must be between 0 and 1. After repeated experiments, it has been observed that when value of embedding factor nears 1, then the extracted message is having very high PSNR and SSIM—which tends to high robustness—however simultaneously, in stego audio, there are audible artifacts identified, which is differentiating with the cover audio. This signifies value of α near to 1 compromise imperceptibility. On the other hand, if the value of approaches 0, the stego audio would be just like the original cover audio (the PSNR between these two audios reaches around 100 dB), whereas then the secret image extracted is completely corrupted. These test results indicate that, to get an optimum outcome, the tradeoff must be done between robustness and imperceptibility.

While experimenting with several cover audios along with various secret images, it has been also noticed that keeping a constant value of embedding factor () cannot ensure similar quality outcome, after extraction. Henceforth it is decided to set α depending on the cover to generate the optimal result. As the data hiding takes place in the LH subband of DWT, hence, to formularize , maximum coefficient value of the LH subband has been chosen as one of the aspects of the following formula:

Finally, for this proposed method, the value of Multiplicative Factor has been universally set as 0.2, based on the experimental outcome, as shown in Table 2.

Table 2: Experimental Results with different embedding and multiplicative factors.
3.5. Pseudorandom Number

For embedding secret into cover, in this proposed method “pseudorandom number” (PN) has been used; PN is generated using Linear Feedback Shift Register (LFSR), as shown in Figure 6. Here LFSR has been designed using only right shift operator and the operation of this shift register is completely deterministic. It must be initialized with a set of numbers and, at any given point, the value of LFSR can be determined by its present state.

Figure 6: Simplified block diagram of LFSR.

In this proposed method, two simple algorithms have been designed to generate two different sets of PN values for a given key with the same initial sequence of numbers. This initial sequence can be altered any time. Here, for easy illustration purpose, “” has been chosen as initial sequence.

Description: The below algorithm(s) generates endless
non-sequential lists of numbers in binary base
using Linear Feedback Shift Register.
Input: A number as Key
Output: Pseudo-random Numbers, PN1 and PN2
Algorithm 1: written as function SRPN1 (Key)
Step 1: set = Key;
Step 2: set initial state of shift register as
  state =
Step 3: set PN1 = ;
Step 4:
   for to
   PN1 = [PN1 state]
   if state == state
    then set temp = 0;
    else set temp = 1;
   set state = state;
   set state = state;
   set state = state;
   set state = state;
   set state = temp;
Algorithm 2: written as function SRPN2 (Key)
Step 1: set = Key;
Step 2: set initial state of shift register as
  state =
Step 3: set PN2 = ;
Step 4:
   for to
   PN2 = [PN2 state]
   if state == state
    then set temp 1 = 0;
    else set temp 1 = 1;
   if state == temp 1
    then set temp 2 = 0;
    else set temp 2 = 1;
   if state == temp 2
    then set temp 3 = 0;
    else set temp 3 = 1;
   set state = state;
   set state = state;
   set state = state;
   set state = state;
   set state = temp 3;
3.6. Embedding Algorithm

To ensure more security and imperceptibility, in this proposed method, the secret message is embedded in the transform domain using discrete wavelet transform (DWT) as well as by discrete cosine transform (DCT).  Description: algorithm for embedding secret data.Input: a Cover Audio (), Secret message as an image ()Output: a Stego Audio (Steg_Aud).Algorithm:Step 1: read cover audio ()Step 2: read secret message ()Step 3: set iteration as a number = Step 4: call function Arnold() which returns scrambled image ()Step 5: set Key as a number = Step 6: call function SRPN1() which returns PN1;Step 7: call function SRPN2() which returns PN2;Step 8: apply 2D DWT on to decompose in LL, LH, HL and HH;Step 9: find maxf = max(value of coefficients in LH);Step 10: set embedding factor () = Multiplicative Factor × maxfStep 11: apply 2D DCT over LH and get .Step 12: find mid-band coefficient region of and term it as mid();Step 13: if == 0then set mid() = mid() + α × PN1;else set mid() = mid() + α × PN2; end;Step 14: perform inverse DCT to get new(LH).Step 15: perform inverse DWT using LL, new(LH), HL, HH and get StegoStep 16: write Stego in Steg_Aud

3.7. Mixing Algorithm

This algorithm mixes two audio sources from two different channels to create cocktail effect of two audio signals.Input: two monochannel  .wav files ( and ) having same duration and sampling rate of 44100 HzOutput:  .wav files having cocktail sound effect ( and )Algorithm: written as function Mixing (, )Step 1: set Gain Factor () as decimal ()Step 2: read and in sig1 & sig2 while keeping their respective sampling frequencies stored in Fs1 and Fs2Step 3: set Mixed1 = sig1 + ( sig2) and Mixed2 = sig2 + ( sig1);Step 4: write Mixed1 in audio file with Fs1 and   write Mixed2 in audio file with Fs2

3.8. Demixing Algorithm

Here, for demixing, FastICA MATLAB package (ver. 2.5) has been used which estimates the independent components from given multidimensional signals using Blind Source Separation technique.Input: two  .wav files ( and ) containing mixed signals from different channelsOutput: two unmixed source.wav files ()Algorithm: written as function Demixing ()Step 1: read and in & while keeping their respective sampling frequencies stored in Fs1 and Fs2Step 2: find complex conjugate transpose of and , store them in and Step 3: create one matrix from and , store it in Step 4: set = FastICA();Step 5: extract two sources from as source1 and source2Step 6: write source1 in with Fs1 and source2 in with Fs2

3.9. Extraction Algorithm

Input: stego audio (Steg_Aud)Output: secret image ()Algorithm:Step 1: read Stego audio (Steg_Aud) in Step 2: set Key as a number = Step 3: call function SRPN1() which returns PN1;Step 4: call function SRPN2() which returns PN2;Step 5: apply 2D DWT on to decompose it in LL, LH, HL and HH;Step 6: apply 2D DCT over LH and get Step 7: find mid-band coefficient region of and term it as mid()Step 8: if Correlation(mid(), ) >= Correlation(mid(), )   then = 0 else = 1; end;Step 9: reshape the image bits stored in to get secret scrambled imageStep 10: set iteration as a number = Step 11: call function iArnold which returns secret image ()

4. Experimental Results and Analysis

This proposed method has been applied on several sets of cover audio and secret images, though, for efficient use of space, here only 2 sets of robustness test results have been presented for Steganalysis attacks.

4.1. Adherence to Kerckhoff’s Principle

In this research article, a key based steganography technique has been proposed. Hence it should follow Kerckhoff’s principle of cryptography [48], which says an exemplary method should be secure even if the public is aware of all the details of that method except the key. As mentioned in Section 3.5, here LFSR has been used both at sender’s end and at receiver’s end. It requires a unique key to generate the same set of pseudorandom numbers [49] which are used in embedding equation (9) and again in Step of the extraction algorithm for comparing correlation coefficients. If the exact same key is not used during embedding and extraction, then LFSR will generate different set of pseudorandom numbers using which secret image cannot be extracted from the stego audio. Henceforth it is proved that the proposed method complies with Kerckhoff’s principle.

4.2. Outcome of Quality Metrics

Embedding Capacity (EC). EC is measured by the ratio between size of hidden message (in bits) and size of cover (in bits), as shown in (11) below. In this research experiment, it has been observed that, to hide size of a secret image, it requires cover audio size of 1048576 bits—which implies embedding capacity value of 1.5625%. Similarly, to implant a secret image, 262144 bits of cover audio is needed—this again confirms the proportion of embedding capacity as 1.5625%.

Peak Signal-to-Noise Ratio (PSNR). PSNR represents the ratio between maximum power of test signal and the power of reference signal. The mathematical representation for PSNR is as follows:where is maximum signal value or maximum fluctuation in the input image data type (e.g., for 8-bit unsigned integer data type, is 255) and MSE is the Mean Squared Error, which is given by where represents original signal; represents degraded signal; and represent numbers of rows and columns of the signal matrix, respectively; represents index of row and represents index of column.

Structural Similarity Index (SSIM). SSIM is a measurement of similarity, calculated through luminance, contrast, and structural differences between two images as given below.where and are the mean of secret image S and extracted image E, respectively; and are the standard deviation of S and E; is correlation of S and E.

Bit Error Rate (BER). BER is defined by number of error bits divided by total number of transmitted bits, as shown in the following equation:

Here the BER is calculated between original secret image and extracted secret image.

Table 3 shows the quality outcome of the secret and extracted images with respect to PSNR, SSIM, BER, and correlation coefficient (CC, discussed in Section 2.2).

Table 3: Quality analysis of secret and extracted image.

Perceptual Evaluation of Audio Quality (PEAQ). PEAQ is a standardized metric to evaluate audio quality utilizing human perceptual properties, output of which is given in a scale of 1 to 5 (where 1 signifies poor and 5 implies excellent) depending on the Mean Opinion Score (MOS) of all listeners. The quality of output audio is measured by comparing with a reference audio.

Normalized Cross-Correlation (NCC). NCC quantifies degree of similarity between two signals. NCC computes normalized two-dimensional cross-correlation values between two image metrics. The values of correlation coefficients lie between −1 and 1, where 1 signifies identical images and −1 denotes totally different image. It is formulated aswhere is the extracted image and is the reference image. NCC is used to produce surface plot, which depicts functional relationship between two independent variables and map to a plane which is parallel to - plane. Here, in Figure 7, the surface plot of NCC between secret and extracted image has been shown.

Figure 7: Surface plot of NCC between secret and extracted image.

In Table 4, quality analysis of the cover and stego audio has been shown in PSNR, PEAQ, and CC.

Table 4: Quality analysis of cover and stego audio.
4.3. Robustness Tests by Steganalysis Attacks

By Random Cropping. On average, English music or a full song has duration of over 5 minutes, that is, more than 300 seconds. In this proposed method, only 25 seconds of audio is required to hide a secret image having size of . This secret can be kept anywhere within the stego, that is, at the start or at the end or after th seconds—in short, the secret can be moved throughout the cover and the exact place of hiding is not predetermined. That is why 9 out of 10 attempts of random cropping leave the secret image intact, as stego has been cropped elsewhere. For the remaining 1 out of 10 attempts, that is, when the stego audio has been cropped in such a place where secret image was embedded, Figures 8(a), 8(b), and 8(c) provide the results.

Figure 8

As shown in Figure 8(a), from a stego audio of 26 seconds’ duration, 8-second-long window (from 2nd to 10th second) has been chosen and the remaining audio signal has been replaced with zero. When the intended recipient applies the extraction mechanism on such modified stego audio, it generates only a portion of scrambled secret image as shown in Figure 8(b). However, when “Reverse Arnold Transform” has been applied on such partially scrambled secret image, it still recovers the extracted secret as shown in Figure 8(c). Quality analysis of the extracted secret image has revealed PSNR value of 55.7633 and SSIM value of 0.9867, when compared with the original secret image which was embedded.

By Adding White Gaussian Noise. In this type of attack, “Additive White Gaussian Noise” (AWGN) is added to the stego audio to distort the hidden message. AWGN can be added to any signal, and it has uniform power and is distributed with respect to time. As shown in Table 5, to test robustness of the proposed method, here 20, 30, and 40 dB of SNR (Signal-to-Noise Ratio) per sample is added to the stego audio signal, assuming the power of stego signal is 0 dBW (decibel-watt is a unit of power in decibel scale, relative to 1 watt).

Table 5: Experimental results of AWGN stego analysis attack.

By Resampling. While writing audio data into a file, sampling rate of the audio is generally mentioned as Fs. In the resampling attack, at first this sampling rate has been changed to a higher or lower frequency while saving the same audio in a new file. As resampling causes impact on audio file length, hence, to maintain the same length as of original cover, modified audio has been cut or filled with zeros. Once saved, resampling has been performed again on the modified audio to revert it back to the original sampling frequency—by this, audibly no differences will be noted; however it will distort the embedded secret message (if any). In Table 6, result of such resampling attack has been shown.

Table 6: Outcome of resampling attack.

By Requantization. The number of bits required to express each audio sample is known as bit depth. It is a measurement of sound accuracy: the higher the bit depth is, the more it would be precise. In the requantization attack, this bit depth of stego audio has been changed to pervert the embedded secret image. Table 7 illustrates the outcome of the extraction process after requantization attack.

Table 7: Results of requantization attack.

By Pitch Shifting. Pitch means tone of a signal; it describes the quality of a sound by the rate of vibrations. In pitch shifting attack, original pitch of an audio is lifted or dropped without modifying its length to destroy the hidden message embedded in a stego audio. Here pitch shifting has been done by utilizing time-scale modification algorithm called “Phase Vocoder” [50], the result of which is shown in Table 8.

Table 8: Experimental results of pitch shifting attack.

By MP3 Compression. In this Steganalysis attack, stego.wav file has been compressed to MP3 format to eliminate redundant data, by which embedded secret message would be completely removed. Here mp3write MATLAB function has been used to convert the stego.wav file into mp3 format and mp3read MATLAB function has been applied to read from the mp3 file during extraction process.

Table 9 reflects the extraction outcome from three different mp3 files of the same stego audio which has been encoded with bitrates 128 kbps, 192 kbps, and 320 kbps, respectively.

Table 9: Experimental results of MP3 compression attack.
4.4. Comparison with Existing Method

For comparison with the proposed method, research articles published in SCI indexed journal have been searched—where data hiding in audio has been performed by DWT along with DCT and extraction mechanism is blind. Authors of [51] have proposed DCT-DWT based data hiding technique using 16-bit Barker code as synchronizing code to accommodate binary image as secret message. From the comparison results presented in Table 10, this can be proved that the proposed method has outperformed the existing one in terms of quality and robustness test against Steganalysis attacks.

Table 10: Comparison results.

In Table 10, “✓” signifies “satisfactory result obtained”; “” signifies “unsatisfactory result or method does not comply”; and “-” implies “details not mentioned.”

5. Conclusion

Secret communication using age-old steganography techniques often increases chances of detectability through the perceivable noise. Hence, in this article, the cocktail party effect has been considered which has effectively reduced the probability of detectability. This has also been proved by the help of different Steganalysis techniques. Additionally, PSNR, CC, and PEAQ values are also analyzed to determine the perceptual noise recorded due to secret message embedding and extraction. Since all the above results verify the undetectability and robustness of the system, hence it can be concluded that this audio steganography technique is successful in secret communication with very high robustness.

In future, this proposed method can be further improvised by utilizing speaker diarization technique, which determines “who spoke when.” Application of speaker diarization along with speech recognition would identify a speaker’s voice and this concept will permit segregating secret audio stream into multiple speech segments, ensuring another novel approach of data hiding.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


  1. B. G. Banik and S. K. Bandyopadhyay, “Review on steganography in digital media,” International Journal of Science and Research, vol. 4, no. 2, pp. 265–274, 2015. View at Google Scholar
  2. M. Asad, J. Gilani, and A. Khalid, “An enhanced least significant bit modification technique for audio steganography,” in Proceedings of the 1st International Conference on Computer Networks and Information Technology (ICCNIT '11), pp. 143–147, IEEE, Pakistan, July 2011. View at Publisher · View at Google Scholar · View at Scopus
  3. N. Cvejic and T. Seppanen, “Increasing the capacity of LSB-based audio steganography,” in Proceedings of the 2002 5th IEEE Workshop on Multimedia Signal Processing (MMSP '02), pp. 336–338, IEEE, USA, December 2002. View at Publisher · View at Google Scholar · View at Scopus
  4. Jayaram, Ranganatha, and Anupama, “Information Hiding Using Audio Steganography - A Survey,” The International Journal of Multimedia & Its Applications, vol. 3, no. 3, pp. 86–96, 2011. View at Google Scholar
  5. D. Gruhl, A. Lu, and W. Bender, “Echo hiding,” in Information Hiding, vol. 1174 of Lecture Notes in Computer Science, pp. 295–315, Springer Berlin Heidelberg, Berlin, Heidelberg, 1996. View at Publisher · View at Google Scholar
  6. D. Xiaoxiao, M. F. Bocko, and Z. Ignjatovic, “Robustness analysis of a digital audio steganographic method based on phase manipulation,” in Proceedings of the 7th International Conference on Signal Processing (ICSP '04), vol. 3, pp. 2375–2378, IEEE, Beijing, China, 2004. View at Publisher · View at Google Scholar
  7. N. Parab, M. Nathan, and K. T. Talele, “Audio Steganography Using Differential Phase Encoding,” in Technology Systems and Management, vol. 145 of Communications in Computer and Information Science, pp. 146–151, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. View at Publisher · View at Google Scholar
  8. R. M. Nugraha, “Implementation of Direct Sequence Spread Spectrum steganography on audio data,” in Proceedings of the 2011 International Conference on Electrical Engineering and Informatics (ICEEI '11), pp. 1–6, IEEE, Indonesia, July 2011. View at Publisher · View at Google Scholar · View at Scopus
  9. H. Matsuoka, “Spread spectrum audio steganography using sub-band phase shifting,” in Proceedings of the 2006 International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP '06), pp. 3–6, IEEE, USA, December 2006. View at Publisher · View at Google Scholar · View at Scopus
  10. G. Prabakaran and R. Bhavani, “A modified secure digital image steganography based on discrete wavelet transform,” in Proceedings of the 2012 International Conference on Computing, Electronics and Electrical Technologies (ICCEET '12), pp. 1096–1100, IEEE, India, March 2012. View at Publisher · View at Google Scholar · View at Scopus
  11. N. Gupta and N. Sharma, “Dwt and LSB based Audio Steganography,” in Proceedings of the 2014 International Conference on Reliability, Optimization and Information Technology (ICROIT '14), pp. 428–431, IEEE, India, February 2014. View at Publisher · View at Google Scholar · View at Scopus
  12. S. S. Verma, R. Gupta, and G. Shrivastava, “A novel technique for data hiding in audio carrier by using sample comparison in DWT domain,” in Proceedings of the 2014 4th International Conference on Communication Systems and Network Technologies (CSNT '14), pp. 639–643, IEEE, India, April 2014. View at Publisher · View at Google Scholar · View at Scopus
  13. W. Junjie, M. Qian, M. Dongxia, and Y. Jun, “Research for synchronic audio information hiding approach based on DWT domain,” in Proceedings of the 2009 International Conference on E-Business and Information System Security (EBISS '09), pp. 1–5, IEEE, China, May 2009. View at Publisher · View at Google Scholar · View at Scopus
  14. A. Kanhe and G. Aghila, “DCT based audio steganography in voiced and un-voiced frames,” in Proceedings of the 1st International Conference on Informatics and Analytics (ICIA '16), pp. 1–4, ACM Press, India, August 2016. View at Publisher · View at Google Scholar · View at Scopus
  15. Z. Zhou and L. Zhou, “A novel algorithm for robust audio watermarking based on quantification DCT domain,” in Proceedings of the 3rd International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP '07), pp. 441–444, IEEE, Taiwan, November 2007. View at Publisher · View at Google Scholar · View at Scopus
  16. W. Yongqi and Y. Yang, “A synchronous audio watermarking algorithm based on chaotic encryption in DCT domain,” in Proceedings of the 2008 International Symposium on Information Science and Engineering (ISISE '08), pp. 371–374, IEEE, China, December 2008. View at Publisher · View at Google Scholar · View at Scopus
  17. S. Roy, N. Sarkar, A. K. Chowdhury, and S. M. A. Iqbal, “An efficient and blind audio watermarking technique in DCT domain,” in Proceedings of the 18th International Conference on Computer and Information Technology (ICCIT '15), pp. 362–367, IEEE, Bangladesh, December 2015. View at Publisher · View at Google Scholar · View at Scopus
  18. B. Ratner, “The correlation coefficient: Its values range between +1/−1, or do they?” Journal of Targeting, Measurement and Analysis for Marketing, vol. 17, no. 2, pp. 139–142, 2009. View at Publisher · View at Google Scholar · View at Scopus
  19. L. Min, L. Ting, and H. Yu-jie, “Arnold Transform Based Image Scrambling Method,” in Proceedings of the 3rd International Conference on Multimedia Technology (ICMT '13), pp. 1309–1316, Publisher Atlantis Press, Guangzhou, China, November 2013. View at Publisher · View at Google Scholar
  20. L. Wu, J. Zhang, W. Deng, and D. He, “Arnold transformation algorithm and anti-Arnold transformation algorithm,” in Proceedings of the 1st International Conference on Information Science and Engineering (ICISE '09), pp. 1164–1167, IEEE, China, December 2009. View at Publisher · View at Google Scholar · View at Scopus
  21. N. V. Lalitha, S. Rao, and P. V. JayaSree, “DWT - Arnold Transform based audio watermarking,” in Proceedings of the 2013 IEEE Postgraduate Research in Microelectronics and Electronics Asia (PrimeAsia), pp. 196–199, IEEE, Visakhapatnam, India, December 2013. View at Publisher · View at Google Scholar
  22. S. Gaur and V. K. Srivastava, “Robust embedding of improved arnold transformed watermark in digital images using RDWT-SVD,” in Proceedings of the 4th IEEE International Conference on Parallel, Distributed and Grid Computing (PDGC '16), pp. 563–568, IEEE, India, December 2016. View at Publisher · View at Google Scholar · View at Scopus
  23. Z. Zhang, C. Wang, and X. Zhou, “Image watermarking scheme based on Arnold transform and DWT-DCT-SVD,” in Proceedings of the 13th IEEE International Conference on Signal Processing (ICSP '16), pp. 805–810, IEEE, China, November 2016. View at Publisher · View at Google Scholar · View at Scopus
  24. E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,” The Journal of the Acoustical Society of America, vol. 25, no. 5, pp. 975–979, 1953. View at Publisher · View at Google Scholar
  25. R. Russell, Cognition: Theory and Practice, Worth Publishers, 2013.
  26. B. Arons, “A Review of The Cocktail Party Effect,” Journal of The American Voice I/O Society, vol. 12, pp. 35–50, 1992. View at Google Scholar
  27. D. E. Broadbent, “Selective listening to speech,” in Perception and Communication, pp. 11–35, Pergamon Press, 1958. View at Publisher · View at Google Scholar
  28. N. I. Durlach and H. S. Colburn, “Binaural Phenomena,” in Hearing, pp. 365–466, Elsevier, 1978. View at Google Scholar
  29. J. Blauert and R. A. Butler, “Spatial hearing: the psychophysics of human sound localization,” The Journal of the Acoustical Society of America, vol. 77, no. 1, pp. 334-335, 1985. View at Publisher · View at Google Scholar
  30. J. Kassebaum, M. F. Tenorio, and C. Schaefers, “The Cocktail Party Problem: Speech/Data Signal Separation Comparison between Backpropagation and SONN,” in Proceedings of the 2nd International Conference on Neural Information Processing Systems, pp. 542–549, MIT Press Cambridge, Cambridge, USA, 1990.
  31. S. Haykin and Z. Chen, “The cocktail party problem,” Neural Computation, vol. 17, no. 9, pp. 1875–1902, 2005. View at Publisher · View at Google Scholar · View at Scopus
  32. C. von der Malsburg, “The correlation theory of brain function,” in Models of Neural Networks, Temporal Aspects of Coding and Information Processing in Biological Systems, pp. 95–119, Springer New York, New York, NY, USA, 1994. View at Publisher · View at Google Scholar
  33. C. von der Malsburg and W. Schneider, “A neural cocktail-party processor,” Biological Cybernetics, vol. 54, no. 1, pp. 29–40, 1986. View at Publisher · View at Google Scholar · View at Scopus
  34. D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 10, no. 3, pp. 684–697, 1999. View at Publisher · View at Google Scholar · View at Scopus
  35. G. J. Brown and D. L. Wang, “An oscillatory correlation framework for computational auditory scene analysis,” in Advances in Neural Information Processing Systems 12, pp. 747–753, MIT Press, 2000. View at Google Scholar
  36. B. Sagi, S. C. Nemat-Nasser, R. Kerr, R. Hayek, C. Downing, and R. Hecht-Nielsen, “A biologically motivated solution to the cocktail party problem,” Neural Computation, vol. 13, no. 7, pp. 1575–1602, 2001. View at Publisher · View at Google Scholar · View at Scopus
  37. S. Ao, Z. Luo, N. Zhao, and R. Wang, “Blind source separation based on principal component analysis- independent component analysis for acoustic signal during laser welding process,” in Proceedings of the 2010 International Conference on Digital Manufacturing and Automation (ICDMA '10), pp. 336–339, IEEE, China, December 2010. View at Publisher · View at Google Scholar · View at Scopus
  38. I. T. Jolliffe, Principal Component Analysis, Springer Series in Statistics, Springer, New York, NY, USA, 2nd edition, 2002. View at MathSciNet
  39. M. E. Wall, A. Rechtsteiner, and L. M. Rocha, “Singular value decomposition and principal component analysis,” in A Practical Approach to Microarray Data Analysis, pp. 91–109, Kluwer Academic Publishers, 2003. View at Publisher · View at Google Scholar
  40. J. Wellhausen, “Audio signal separation using independent subspace analysis and improved subspace grouping,” in Proceedings of the 7th Nordic Signal Processing Symposium (NORSIG '06), pp. 310–313, IEEE, Iceland, June 2006. View at Publisher · View at Google Scholar · View at Scopus
  41. Q. Cai and X. Tang, “A digital audio watermarking algorithm based on independent component analysis,” in Proceedings of the 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics, CISP-BMEI 2016, pp. 1053–1058, IEEE, China, October 2016. View at Publisher · View at Google Scholar · View at Scopus
  42. A. Hyv\"arinen, “Independent component analysis: recent advances,” Philosophical Transactions of the Royal Society A: Mathematical, Physical & Engineering Sciences, vol. 371, no. 1984, 20110534 pages, 2012. View at Google Scholar · View at MathSciNet
  43. A. Hyvärinen, “Fast and robust fixed-point algorithms for independent component analysis,” IEEE Transactions on Neural Networks and Learning Systems, vol. 10, no. 3, pp. 626–634, 1999. View at Publisher · View at Google Scholar · View at Scopus
  44. H. Helfrich, Time and Mind II: Information Processing Perspectives, Hogrefe & Huber Publishers, Cambridge, MA, USA, 2004.
  45. M. F. Casanova and I. Opris, Eds., Recent Advances on the Modular Organization of the Cortex, Springer Netherlands, Dordrecht, 2015. View at Publisher · View at Google Scholar
  46. C. Ionescu and R. De Keyser, “Exploring the advantages of blind source separation in monitoring input respiratory impedance during apneic events,” Journal of Control Engineering and Applied Informatics, vol. 10, no. 3, pp. 53–59, 2008. View at Google Scholar
  47. Q. Su, Y. Shen, W. Jian, and P. Xu, “Blind source separation algorithm based on modified bacterial colony chemotaxis,” in Proceedings of the 5th International Conference on Intelligent Control and Information Processing (ICICIP '14), pp. 354–359, IEEE, Dalian, China, August 2014. View at Publisher · View at Google Scholar · View at Scopus
  48. F. A. P. Petitcolas, “Kerckhoffs’ Principle,” in Encyclopedia of Cryptography and Security, H. C. A. van Tilborg and S. Jajodia, Eds., 675 pages, Springer, Boston, MA, USA, 2011. View at Publisher · View at Google Scholar
  49. Paar, Christof, and J. Pelzl, “Stream Ciphers,” in In Understanding Cryptography, p. 35, Springer, Berlin, Heidelberg, Germany, 2010. View at Google Scholar
  50. J. Laroche and M. Dolson, “New phase-vocoder techniques for pitch-shifting, harmonizing and other exotic effects,” in Proceedings of the 1999 Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 91–94, IEEE, New Paltz, NY, USA. View at Publisher · View at Google Scholar
  51. X.-Y. Wang and H. Zhao, “A novel synchronization invariant audio watermarking scheme based on DWT and DCT,” IEEE Transactions on Signal Processing, vol. 54, no. 12, pp. 4835–4840, 2006. View at Publisher · View at Google Scholar · View at Scopus