A Parallel Framework with Block Matrices of a Discrete Fourier Transform for Vector-Valued Discrete-Time Signals
This paper presents a parallel implementation of a kind of discrete Fourier transform (DFT): the vector-valued DFT. The vector-valued DFT is a novel tool to analyze the spectra of vector-valued discrete-time signals. This parallel implementation is developed in terms of a mathematical framework with a set of block matrix operations. These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors. In this work, an implementation and experimental investigation of the mathematical framework are performed using MATLAB with the Parallel Computing Toolbox. We found that there is advantage to use multicore processors and a parallel computing environment to minimize the high execution time. Additionally, speedup increases when the number of logical processors and length of the signal increase.
Let be the space of vector-valued discrete-time signals with samples, where each sample is a complex vector of length . The vector-valued discrete-time signals are used very often in several applications in signal processing and electrical engineer, for example, vector quantization of images , time-frequency localization with wavelets , image coding , vector filter bank theory , linear time-dependent MISO , and analysis of MMSE estimation for compressive sensing of block sparse signals .
Now, to analyze the spectra of vector-valued discrete-time signals, a novel tool was developed, and it is called vector-valued DFT [7, 8]. This transform has applications in vector analysis in complex, quaternion, biquaternion, and Clifford algebras . Additionally, the vector-valued DFT is used in digital signal processing, for example, the study of new complex valued constant amplitude zero autocorrelation (CAZAC) signals , which serve as coefficients for phase coded waveforms with prescribed vector-valued ambiguity function behavior, which is relevant in light of time-frequency analysis, vector sensor, and MIMO technologies .
The following paper presents a parallel framework of the vector-valued DFT. The major contributions of this paper are summarized as follows:(1)The construction of a new mathematical structure for the vector-valued DFT using block matrix theory such that it allows a parallel implementation in multicore processors.(2)Reducing the elapsed time to compute the vector-valued DFT of a vector-valued discrete-time signal using parallel computing through aforementioned new mathematical framework.
This new framework is developed with a set of block matrix operations, for example, Kronecker product, direct sum, stride permutation, vec operator, and vec inverse operator (see Section 2.1 for details). These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors [10–12]. This mathematical framework is inspired in the matrix representation of the Cooley-Tukey fast Fourier transform (FFT) algorithm for complex discrete-time signals, corresponding to the decomposition of the transform size into the product of two factors and , which is developed in [10, 12, 13].
The present paper is organized as follows. Section 2 explains a mathematical background about block matrix operations and discrete Fourier transform. Section 3 defines the concept of vector-valued DFT for vector-valued discrete-time signals. Section 4 develops a mathematical framework of vector-valued DFT in terms of block matrix operations for vector-valued discrete-time signals with length . This mathematical framework contributes to implementation of parallel algorithms. Section 5 explains an implementation and experimental investigation of this mathematical framework using parallel computing in multicore processors with MATLAB. Finally, some conclusions are presented in Section 6.
Throughout the paper, the following notations are used. is the additive group of integers modulo , is the matrix space of rows and columns with complex numbers entries and . The rows and columns of are indexed by elements of and , respectively. , , , and represent entry , row , column , and transpose matrix of , respectively. is identity matrix.
2.1. Block Matrix Operations
A block matrix with row partitions and column partitions and a block vector with row blocks are defined as respectively, where designates block and designates block. In this paper, the following block matrix operations are used: Kronecker product, direct sum, stride permutation, vec operator, and vec inverse operator.
The Kronecker product of two matrices and is defined as and it replaces every entry of by the matrix . In the special case , it is called parallel operation .
The direct sum of matrices constructs a block diagonal matrix from a set of matrices, that is, for , such that : where , , and .
Let . The stride permutation matrix is defined as such that it permutes the elements of the input signal as , , and [12, 14]. This matrix permutation governs the data flow required to parallelize a Kronecker product computation . We clarify that the superscript is an index, not power.
The vec operator, , transforms a matrix into a vector by stacking all the columns of this matrix one underneath the other. On the other hand, the vec inverse operator, , transforms a vector of dimension into a matrix of size .
2.2. Discrete Fourier Transform
Let be the set of -valued signals on ; that is, if and only if . Additionally, for each , , where and . The discrete Fourier transform (DFT) of is represented as such that , where and .
As mentioned in , there are two different approaches of representing the DFT: as matrix-vector products or using summations. Consequently, fast algorithms using parallel computing are represented with either a matrix formalism as in [10, 12–14] or summations as in most signal processing books. Below, the matrix formalism is introduced and used to express the Cooley-Tukey FFT algorithm, corresponding to the decomposition of the transform size into the product of two factors and ; that is, .
The matrix representation of DFT of is , where such that . If , then the matrix formalism can be used to express as factorizations of matrices using block matrices operations [10, 12, 13]:Here, is a diagonal matrix containing the twiddle factors. We clarify that the superscript is an index, not power. This factorization of is the matrix representation of the Cooley-Tukey FFT for . In addition, this representation of allows the implementation using parallel computing .
3. DFT for Vector-Valued Signals
Based on [2, 6–9, 15, 16], the space of vector-valued discrete-time signals with samples is defined as The space is the set of -valued signals on ; that is, if and only if . Additionally, for each , , where and . Furthermore, if , then . Now, for , there is a kind of DFT for vector-valued signals called vector-valued DFT. This transform is defined as such thatwhere is the matrix kernel. Algorithm 1 shows the implementation of (5). This implementation is a sequential algorithm.
From the reviewed literature, there are two kinds of kernels for this transform: the first one is hypercomplex DFT kernel : where such that , and the second one is DFT frame kernel : where with for . It is called DFT frame kernel because , where is a DFT frame. In this paper, subsets are used, such that , although it does not represent a DFT frame.
Lemma 1. Let be a hypercomplex DFT kernel or DFT frame kernel. Then (1).(2).(3)If and , then .(4)If , then .
Proof. For hypercomplex DFT kernel, the proof of each case is similar to proof of th roots of unity. For DFT frame kernel, is a diagonal matrix, and then the proof of each case is straightforward.
4. A Parallel Framework for
In this section, the main results of this paper are presented. Firstly, a block matrix representation of the vector-valued DFT is given. Secondly, a new mathematical framework from matrix representation of vector-valued DFT is derived, using a block matrix formalism (i.e., Theorem 2). This new result is inspired in the matrix representation of the Cooley-Tukey FFT algorithm for complex discrete-time signals, corresponding to the decomposition of the transform size into the product of two factors and , which is developed in [10, 12, 13]. The result obtained in Theorem 2 is transformed in a new block matrix representation such that it contributes to analysis, design, and implementation of parallel algorithms (i.e., Corollary 3). This new result is inspired in (3). Finally, a computational complexity analysis of new algorithm is developed.
Similar to the DFT matrix representation explained in Section 2.2, there are two different approaches of representing the vector-valued DFT: as summations (see (5)) or using matrix-vector products. Both approaches allow a parallel implementation. In fact, the proof of Theorem 2 is developed using summation notation.
The vector-valued DFT can be presented as matrix-vector products. The block matrix representation of vector-valued DFT of is defined as , where such that , for . We clarify that the superscript is an index, not power. In this section, a block matrix factorization of is developed, and it is inspired in (3). First, a generalization of stride permutation is defined. Let . The block stride permutation matrix [14, 17] is defined as such that , and, for each with blocks , the operation permutes each block of the input block as , , and .
Theorem 2. Let and let be the block matrix of DFT for vector-valued signals. Then where such that .
Proof. Let , let , and let . The block vector is defined. Then Now, let . From Lemma 1, ; then Let . Then But ; then Let , let , and let because , , and . Then
Now, if , , and , the following equality  is obtained:From Theorem 2 and (14), the following corollary presents a matrix factorization of such that it permits an implementation using parallel computing.
Corollary 3. Let and let be the block matrix of DFT for vector-valued signals. Thenwhere was defined in Theorem 2.
4.1. Computational Complexity Analysis
In this section, the computational complexity analysis of (15) is developed. First, consider the matrix operation . The computational complexity (CC) of is  because it is the multiplication between a block matrix in and a block vector in . But the operation can be implemented with a CC (see, e.g., [12, 14]).
Let be the block matrix and vector-valued signal , where . It is known that the CC of operation is . Now consider operation using (15). If we consider each matrix-vector multiplication, we obtain the following:(1)The CC of is .(2)The CC of is , because it is a block diagonal matrix multiplication.(3)The CC of is , because is a diagonal matrix multiplication.(4)The CC of is .(5)The CC of is , because it is a block diagonal matrix multiplication.(6)The CC of is .Therefore, the CC of using (15) is
5. Implementation and Experimental Investigation
5.1. General Information
The investigations have been carried out on a computer with multicore processor. The computer consists of 4 cores with Intel Core i7-3632QM CPU processor, system clock of 2.20 GHz, and 8 GB of RAM. The experiment develops the implementation and testing of Algorithms 1 and 2 with the hypercomplex DFT kernel and the DFT frame kernel is developed. Algorithm 1 does not use any parallel implementation, unlike Algorithm 2. A CAZAC signal in is used; it is generated using a Wiener CAZAC signal in  with and , where , , , , and .
The implementation of Algorithms 1 and 2 to compute the vector-valued DFT is performed using MATLAB. Algorithm 2 is computed using Parallel Computing Toolbox. MATLAB uses built-in multithreading and parallelism using MATLAB workers. Parallelism using MATLAB workers is used. We can run multiple MATLAB workers (MATLAB computational engines) on a multicore computer to execute applications in parallel with the Parallel Computing Toolbox. This approach allows more control over the parallelism compared to built-in multithreading. With programming constructs, such as parallel-for-loops (parfor) and batch, we write the parallel MATLAB programs of the parallel framework for the vector-valued DFT.
5.2. Results and Discussion
Let be the execution time of Algorithm 1 without any parallel implementation, and let be the execution time of Algorithm 2, where is the number of cores. The value of needs to be less than that of for two reasons: Algorithm 2 has a parallel implementation and the matrix multiplication size is different. Algorithm 2 is computed with matrices in and . Algorithm 1 is computed with matrices in , where .
The computational performance analysis of Algorithm 2 is evaluated using the metrics speedup (or acceleration) and efficiency. The speedup is the ratio between the execution times of parallel implementations with one core and parallel implementations with two or more cores . The speedup is represented by the formula . The efficiency estimates how well utilized the processors are in solving the problem compared to how much effort is wasted in communication and synchronization . The efficiency is determined by the ratio between the speedup and the number of processing elements, represented by the formula .
Table 1 shows the execution time, in seconds (s), of both algorithms. A significant reduction in the parallel execution time of the vector-valued DFT is observed. Table 1 shows that Algorithm 1 with hypercomplex kernel for a Wiener CAZAC signal in produces a time of serial execution s. Using Algorithm 2, however, we obtain ( of ), s ( of ), s ( of ), and s ( of ). This result shows the advantage of using multicore processors and a parallel computing environment to minimize the high execution time in the vector-valued DFT. This is because parallel computing is a form of computation in which many calculations are carried out simultaneously [19, 20], operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently, and minimize the execution time [20, 21]. The difference between and is because is computed with matrices in and . Algorithm 1 is computed with matrices in , where .
Table 2 represents the speedup of Algorithm 2. The acceleration of the vector-valued DFT increases when increases regardless of the value of . The results show that, using the proposed parallel implementation with cores, where , the speedup to compute the vector-valued DFT of a Wiener CAZAC signal is , , and , respectively. These results imply that, to get the highest speedup, one should prefer the approach with four cores.
Table 3 represents efficiency of Algorithm 2. The information in this table shows that a good efficiency (greater than 65%) is reached with . But the efficiency of the vector-valued DFT decreases (until 36%) when increases regardless of the value of . It is attributed to a decrease in the share of simultaneous computation of the partial vector-valued DFT in Algorithm 2 (steps ()–() and ()–()), which is responsible for the main effect. The results obtained in Table 3 imply that, to get a better efficiency, one should prefer the approach with two cores, because we obtain the highest efficiency.
This work presented a parallel framework of vector-valued DFT for vector-valued discrete-time signals. This mathematical framework was inspired in the matrix representation of the Cooley-Tukey FFT algorithm for complex discrete-time signals, corresponding to the decomposition of the transform size into the product of two factors and , which is developed in [10, 12]. It was expressed in (15) and Algorithm 2. This parallel framework was performed in terms of a matrix representation using a set of block matrix operations: Kronecker product, direct sum, stride permutation, vec operator, and vec inverse operator. These operations contributed to analysis, design, and implementation in parallel. Two kernels are used in the vector-valued DFT: hypercomplex DFT kernel and DFT frame kernel.
The experimental investigation indicated there are profit using MATLAB with the Parallel Computing Toolbox in a computer with multicore processors. First, there was advantage to use multicore processors and a parallel computing environment to minimize the high execution time (with hypercomplex DFT kernel, we obtained s, , s, s, and s). Second, speedup increased when increased regardless of the value of , and a good efficiency too was obtained when (above 65%).
As future work, we would like to extend the proposed parallel framework to vector-valued discrete-time signals in , where , using the idea of Pease algorithm for complex discrete-time signals . Additionally, we would like to take advantage of more design tradeoffs of different approaches besides what have been shown in this paper, for example, the approach developed in .
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
This work was supported by Vicerrectoría de Investigación y Extensión of Instituto Tecnológico de Costa Rica.
J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri, “A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures,” Circuits, Systems, and Signal Processing, vol. 9, no. 4, pp. 449–500, 1990.View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
D. Rodriguez, J. Seguel, and E. Cruz, “Algebraic methods for the analysis and design of time-frequency signal processing algorithms,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '93), vol. 1, pp. 196–199, IEEE, Chicago, Ill, USA, May 1993.View at: Publisher Site | Google Scholar
R. Tolimieri, M. An, and C. Lu, Algorithms for Discrete Fourier Transform and Convolution, Signal Processing and Digital Filtering, Springer, Berlin, Germany, 1997.View at: Publisher Site
C. Van Loan, Computational Frameworks for the Fast Fourier Transform, Frontiers in Applied Mathematics, SIAM, 2012.
A. Saberi, A. Stoorvogel, and P. Sannuti, Internal and External Stabilization of Linear Systems with Constraints, Systems & Control: Foundations & Applications, Springer, Berlin, Germany, 2012.View at: Publisher Site
A. Shirazinia, S. Chatterjee, and M. Skoglund, “Performance bounds for vector quantized compressive sensing,” in Proceedings of the International Symposium on Information Theory and Its Applications (ISITA '12), pp. 289–293, October 2012.View at: Google Scholar
R. Tolimieri, M. An, C. Lü, and C. Burrus, Mathematics of Multidimensional Fourier Transform Algorithms, Signal Processing and Digital Filtering, Springer, Berlin, Germany, 1997.View at: Publisher Site
M. D. McCool, A. D. Robison, and J. Reinders, Structured Parallel Programming: Patterns for Efficient Computation, Morgan Kaufmann Publishers, Elsevier, 2012.
G. S. Almasi and A. Gottlieb, Highly Parallel Computing, Benjamin-Cummings Publishing Company, 1989.
R. Trobec, M. Vajteric, and P. Zinterhof, Parallel Computing: Numerics, Applications, and Trends, Springer, 2009.
M. O. Tokhi, M. A. Hossain, and M. H. Shaheed, Parallel Computing for Real-Time Signal Processing and Control, Springer, Berlin, Germany, 2003.