Abstract
This paper presents a parallel implementation of a kind of discrete Fourier transform (DFT): the vectorvalued DFT. The vectorvalued DFT is a novel tool to analyze the spectra of vectorvalued discretetime signals. This parallel implementation is developed in terms of a mathematical framework with a set of block matrix operations. These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors. In this work, an implementation and experimental investigation of the mathematical framework are performed using MATLAB with the Parallel Computing Toolbox. We found that there is advantage to use multicore processors and a parallel computing environment to minimize the high execution time. Additionally, speedup increases when the number of logical processors and length of the signal increase.
1. Introduction
Let be the space of vectorvalued discretetime signals with samples, where each sample is a complex vector of length . The vectorvalued discretetime signals are used very often in several applications in signal processing and electrical engineer, for example, vector quantization of images [1], timefrequency localization with wavelets [2], image coding [3], vector filter bank theory [4], linear timedependent MISO [5], and analysis of MMSE estimation for compressive sensing of block sparse signals [6].
Now, to analyze the spectra of vectorvalued discretetime signals, a novel tool was developed, and it is called vectorvalued DFT [7, 8]. This transform has applications in vector analysis in complex, quaternion, biquaternion, and Clifford algebras [8]. Additionally, the vectorvalued DFT is used in digital signal processing, for example, the study of new complex valued constant amplitude zero autocorrelation (CAZAC) signals [9], which serve as coefficients for phase coded waveforms with prescribed vectorvalued ambiguity function behavior, which is relevant in light of timefrequency analysis, vector sensor, and MIMO technologies [7].
The following paper presents a parallel framework of the vectorvalued DFT. The major contributions of this paper are summarized as follows:(1)The construction of a new mathematical structure for the vectorvalued DFT using block matrix theory such that it allows a parallel implementation in multicore processors.(2)Reducing the elapsed time to compute the vectorvalued DFT of a vectorvalued discretetime signal using parallel computing through aforementioned new mathematical framework.
This new framework is developed with a set of block matrix operations, for example, Kronecker product, direct sum, stride permutation, vec operator, and vec inverse operator (see Section 2.1 for details). These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors [10–12]. This mathematical framework is inspired in the matrix representation of the CooleyTukey fast Fourier transform (FFT) algorithm for complex discretetime signals, corresponding to the decomposition of the transform size into the product of two factors and , which is developed in [10, 12, 13].
The present paper is organized as follows. Section 2 explains a mathematical background about block matrix operations and discrete Fourier transform. Section 3 defines the concept of vectorvalued DFT for vectorvalued discretetime signals. Section 4 develops a mathematical framework of vectorvalued DFT in terms of block matrix operations for vectorvalued discretetime signals with length . This mathematical framework contributes to implementation of parallel algorithms. Section 5 explains an implementation and experimental investigation of this mathematical framework using parallel computing in multicore processors with MATLAB. Finally, some conclusions are presented in Section 6.
Throughout the paper, the following notations are used. is the additive group of integers modulo , is the matrix space of rows and columns with complex numbers entries and . The rows and columns of are indexed by elements of and , respectively. , , , and represent entry , row , column , and transpose matrix of , respectively. is identity matrix.
2. Background
2.1. Block Matrix Operations
A block matrix with row partitions and column partitions and a block vector with row blocks are defined as respectively, where designates block and designates block. In this paper, the following block matrix operations are used: Kronecker product, direct sum, stride permutation, vec operator, and vec inverse operator.
The Kronecker product of two matrices and is defined as and it replaces every entry of by the matrix . In the special case , it is called parallel operation [12].
The direct sum of matrices constructs a block diagonal matrix from a set of matrices, that is, for , such that : where , , and .
Let . The stride permutation matrix is defined as such that it permutes the elements of the input signal as , , and [12, 14]. This matrix permutation governs the data flow required to parallelize a Kronecker product computation [12]. We clarify that the superscript is an index, not power.
The vec operator, , transforms a matrix into a vector by stacking all the columns of this matrix one underneath the other. On the other hand, the vec inverse operator, , transforms a vector of dimension into a matrix of size .
2.2. Discrete Fourier Transform
Let be the set of valued signals on ; that is, if and only if [9]. Additionally, for each , , where and . The discrete Fourier transform (DFT) of is represented as such that , where and .
As mentioned in [14], there are two different approaches of representing the DFT: as matrixvector products or using summations. Consequently, fast algorithms using parallel computing are represented with either a matrix formalism as in [10, 12–14] or summations as in most signal processing books. Below, the matrix formalism is introduced and used to express the CooleyTukey FFT algorithm, corresponding to the decomposition of the transform size into the product of two factors and ; that is, .
The matrix representation of DFT of is , where such that . If , then the matrix formalism can be used to express as factorizations of matrices using block matrices operations [10, 12, 13]:Here, is a diagonal matrix containing the twiddle factors. We clarify that the superscript is an index, not power. This factorization of is the matrix representation of the CooleyTukey FFT for . In addition, this representation of allows the implementation using parallel computing [14].
3. DFT for VectorValued Signals
Based on [2, 6–9, 15, 16], the space of vectorvalued discretetime signals with samples is defined as The space is the set of valued signals on ; that is, if and only if . Additionally, for each , , where and . Furthermore, if , then . Now, for , there is a kind of DFT for vectorvalued signals called vectorvalued DFT. This transform is defined as such thatwhere is the matrix kernel. Algorithm 1 shows the implementation of (5). This implementation is a sequential algorithm.

From the reviewed literature, there are two kinds of kernels for this transform: the first one is hypercomplex DFT kernel [8]: where such that , and the second one is DFT frame kernel [7]: where with for . It is called DFT frame kernel because , where is a DFT frame. In this paper, subsets are used, such that , although it does not represent a DFT frame.
Lemma 1. Let be a hypercomplex DFT kernel or DFT frame kernel. Then (1).(2).(3)If and , then .(4)If , then .
Proof. For hypercomplex DFT kernel, the proof of each case is similar to proof of th roots of unity. For DFT frame kernel, is a diagonal matrix, and then the proof of each case is straightforward.
4. A Parallel Framework for
In this section, the main results of this paper are presented. Firstly, a block matrix representation of the vectorvalued DFT is given. Secondly, a new mathematical framework from matrix representation of vectorvalued DFT is derived, using a block matrix formalism (i.e., Theorem 2). This new result is inspired in the matrix representation of the CooleyTukey FFT algorithm for complex discretetime signals, corresponding to the decomposition of the transform size into the product of two factors and , which is developed in [10, 12, 13]. The result obtained in Theorem 2 is transformed in a new block matrix representation such that it contributes to analysis, design, and implementation of parallel algorithms (i.e., Corollary 3). This new result is inspired in (3). Finally, a computational complexity analysis of new algorithm is developed.
Similar to the DFT matrix representation explained in Section 2.2, there are two different approaches of representing the vectorvalued DFT: as summations (see (5)) or using matrixvector products. Both approaches allow a parallel implementation. In fact, the proof of Theorem 2 is developed using summation notation.
The vectorvalued DFT can be presented as matrixvector products. The block matrix representation of vectorvalued DFT of is defined as , where such that , for . We clarify that the superscript is an index, not power. In this section, a block matrix factorization of is developed, and it is inspired in (3). First, a generalization of stride permutation is defined. Let . The block stride permutation matrix [14, 17] is defined as such that , and, for each with blocks , the operation permutes each block of the input block as , , and .
Theorem 2. Let and let be the block matrix of DFT for vectorvalued signals. Then where such that .
Proof. Let , let , and let . The block vector is defined. Then Now, let . From Lemma 1, ; then Let . Then But ; then Let , let , and let because , , and . Then
Now, if , , and , the following equality [17] is obtained:From Theorem 2 and (14), the following corollary presents a matrix factorization of such that it permits an implementation using parallel computing.
Corollary 3. Let and let be the block matrix of DFT for vectorvalued signals. Thenwhere was defined in Theorem 2.
Algorithm 2 shows a parallel implementation of (15).

independent processes in Steps (3)–(5), and independent processes in Steps (6)–(8) and (12)–(14) are observed, making this approach a parallel operation. A model of Algorithm 2 is shown in Figure 1.
4.1. Computational Complexity Analysis
In this section, the computational complexity analysis of (15) is developed. First, consider the matrix operation . The computational complexity (CC) of is [8] because it is the multiplication between a block matrix in and a block vector in . But the operation can be implemented with a CC (see, e.g., [12, 14]).
Let be the block matrix and vectorvalued signal , where . It is known that the CC of operation is . Now consider operation using (15). If we consider each matrixvector multiplication, we obtain the following:(1)The CC of is .(2)The CC of is , because it is a block diagonal matrix multiplication.(3)The CC of is , because is a diagonal matrix multiplication.(4)The CC of is .(5)The CC of is , because it is a block diagonal matrix multiplication.(6)The CC of is .Therefore, the CC of using (15) is
Thus, the CC of operation is and the CC of operation using (15) is . The above mentioned shows the efficiency of matrix formulation in (15).
5. Implementation and Experimental Investigation
5.1. General Information
The investigations have been carried out on a computer with multicore processor. The computer consists of 4 cores with Intel Core i73632QM CPU processor, system clock of 2.20 GHz, and 8 GB of RAM. The experiment develops the implementation and testing of Algorithms 1 and 2 with the hypercomplex DFT kernel and the DFT frame kernel is developed. Algorithm 1 does not use any parallel implementation, unlike Algorithm 2. A CAZAC signal in is used; it is generated using a Wiener CAZAC signal in [9] with and , where , , , , and .
The implementation of Algorithms 1 and 2 to compute the vectorvalued DFT is performed using MATLAB. Algorithm 2 is computed using Parallel Computing Toolbox. MATLAB uses builtin multithreading and parallelism using MATLAB workers. Parallelism using MATLAB workers is used. We can run multiple MATLAB workers (MATLAB computational engines) on a multicore computer to execute applications in parallel with the Parallel Computing Toolbox. This approach allows more control over the parallelism compared to builtin multithreading. With programming constructs, such as parallelforloops (parfor) and batch, we write the parallel MATLAB programs of the parallel framework for the vectorvalued DFT.
5.2. Results and Discussion
Let be the execution time of Algorithm 1 without any parallel implementation, and let be the execution time of Algorithm 2, where is the number of cores. The value of needs to be less than that of for two reasons: Algorithm 2 has a parallel implementation and the matrix multiplication size is different. Algorithm 2 is computed with matrices in and . Algorithm 1 is computed with matrices in , where .
The computational performance analysis of Algorithm 2 is evaluated using the metrics speedup (or acceleration) and efficiency. The speedup is the ratio between the execution times of parallel implementations with one core and parallel implementations with two or more cores [18]. The speedup is represented by the formula . The efficiency estimates how well utilized the processors are in solving the problem compared to how much effort is wasted in communication and synchronization [18]. The efficiency is determined by the ratio between the speedup and the number of processing elements, represented by the formula .
Table 1 shows the execution time, in seconds (s), of both algorithms. A significant reduction in the parallel execution time of the vectorvalued DFT is observed. Table 1 shows that Algorithm 1 with hypercomplex kernel for a Wiener CAZAC signal in produces a time of serial execution s. Using Algorithm 2, however, we obtain ( of ), s ( of ), s ( of ), and s ( of ). This result shows the advantage of using multicore processors and a parallel computing environment to minimize the high execution time in the vectorvalued DFT. This is because parallel computing is a form of computation in which many calculations are carried out simultaneously [19, 20], operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently, and minimize the execution time [20, 21]. The difference between and is because is computed with matrices in and . Algorithm 1 is computed with matrices in , where .
Table 2 represents the speedup of Algorithm 2. The acceleration of the vectorvalued DFT increases when increases regardless of the value of . The results show that, using the proposed parallel implementation with cores, where , the speedup to compute the vectorvalued DFT of a Wiener CAZAC signal is , , and , respectively. These results imply that, to get the highest speedup, one should prefer the approach with four cores.
Table 3 represents efficiency of Algorithm 2. The information in this table shows that a good efficiency (greater than 65%) is reached with . But the efficiency of the vectorvalued DFT decreases (until 36%) when increases regardless of the value of . It is attributed to a decrease in the share of simultaneous computation of the partial vectorvalued DFT in Algorithm 2 (steps ()–() and ()–()), which is responsible for the main effect. The results obtained in Table 3 imply that, to get a better efficiency, one should prefer the approach with two cores, because we obtain the highest efficiency.
6. Conclusion
This work presented a parallel framework of vectorvalued DFT for vectorvalued discretetime signals. This mathematical framework was inspired in the matrix representation of the CooleyTukey FFT algorithm for complex discretetime signals, corresponding to the decomposition of the transform size into the product of two factors and , which is developed in [10, 12]. It was expressed in (15) and Algorithm 2. This parallel framework was performed in terms of a matrix representation using a set of block matrix operations: Kronecker product, direct sum, stride permutation, vec operator, and vec inverse operator. These operations contributed to analysis, design, and implementation in parallel. Two kernels are used in the vectorvalued DFT: hypercomplex DFT kernel and DFT frame kernel.
The experimental investigation indicated there are profit using MATLAB with the Parallel Computing Toolbox in a computer with multicore processors. First, there was advantage to use multicore processors and a parallel computing environment to minimize the high execution time (with hypercomplex DFT kernel, we obtained s, , s, s, and s). Second, speedup increased when increased regardless of the value of , and a good efficiency too was obtained when (above 65%).
As future work, we would like to extend the proposed parallel framework to vectorvalued discretetime signals in , where , using the idea of Pease algorithm for complex discretetime signals [22]. Additionally, we would like to take advantage of more design tradeoffs of different approaches besides what have been shown in this paper, for example, the approach developed in [23].
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
This work was supported by Vicerrectoría de Investigación y Extensión of Instituto Tecnológico de Costa Rica.