Abstract

The enumeration of two-dimensional Costas arrays is a problem with factorial time complexity and has been solved for sizes up to 29 using computer clusters. Costas arrays of higher dimensionality have recently been proposed and their properties are beginning to be understood. This paper presents, to the best of our knowledge, the first proposed implementations for enumerating these multidimensional arrays in GPUs and FPGAs, as well as the first discussion of techniques to prune the search space and reduce enumeration run time. Both GPU and FPGA implementations rely on Costas array symmetries to reduce the search space and perform concurrent explorations over the remaining candidate solutions. The fine grained parallelism utilized to evaluate and progress the exploration, coupled with the additional concurrency provided by the multiple instanced cores, allowed the FPGA (XC5VLX330-2) implementation to achieve speedups of up to 30× over the GPU (GeForce GTX 580).

1. Introduction

A two-dimensional Costas array (2DCA) of size is a permutation such that for every integer , , and , where and , , , implies that . Thus, informally, a size Costas array is defined as matrix containing exactly dots, where every row and column contain exactly one dot and the vectors joining each pair of dots are distinct. Figure 1 illustrates a Costas array of size , both as a matrix and a permutation. The figure also shows the array’s difference triangle, which organizes the differences between the various permutation digits in rows where each row corresponds to a fixed . By definition, each row in the difference triangle of a Costas array must consist of unique numbers.

Their definition implies that Costas arrays have perfect autocorrelation (autocorrelation = 1), which makes them useful in communications where signals must be recoverable even in the presence of considerable noise. Costas arrays are useful in many security and communication applications, such as remote object recognition and optical communications [1]. Furthermore, some special cases of Costas arrays can be used for digital watermarking [2].

Costas arrays with dimensions higher than two were introduced in 2008 by Drakakis [3]. These arrays maintain perfect autocorrelation, which broadens their applicability in optical communications, for example, 3D optical orthogonal codes [4]. Multidimensional periodic Costas arrays (MPCAs) over elementary Abelian groups, introduced by Moreno and Tirkel, add the property of being periodic over all dimensions. This extends their applicability to digital watermarking of video and combined video and audio, where higher-dimensionality codes are desired [2]. A formal definition and some of their properties are presented in Section 3. In this paper, we focus on the latter kind of multidimensional periodic Costas arrays (MPCAs) due to their richer application range.

The enumeration of 2D Costas arrays has been a topic of interest since their discovery by Costas in the 1960s [5]. With each new size enumerated, new properties and generation techniques may be discovered [6]. Ortiz-Ubarri et al. presented MPCA transformations and their first enumeration in [7]. Given their relatively new discovery, it is expected that the enumeration of MPCAs will, as with 2DCAs, help researchers improve their understanding.

Both 2DCAs and MPCAs can be generated using algebraic constructions based on finite fields like the Welch and Lempel-Golomb constructions [7]. Small sizes can be enumerated using hand computation, yet the only known method to guarantee complete enumeration is by exhaustive exploration. The search space for complete enumeration of 2DCAs grows factorially with , thus computer-based exploration is the only practical approach for medium and large sizes. The most common approach for the enumeration of 2DCAs is to use a backtracking algorithm that incorporates symmetries and other observations to further prune the search space. This paper presents the first discussion of search space pruning techniques for MPCAs. Both the FPGA and GPU implementations utilize these methods to significantly reduce the search space. Nevertheless, the worst case timecomplexity is still factorial, requiring tremendous run times even for small cases of and (the dimension).

This paper discusses our implementations for the enumeration of -dimensional Costas arrays in GPUs and FPGAs and constitutes the first description of such an enumeration. We present the techniques chosen to prune the search space as well as the organization of our designs. Our FPGA implementation achieved speedups of up to 30 times faster than the GPU. Furthermore, the modules that were created as part of the design process can easily be adapted to other constraint satisfaction problems that use backtracking.

The rest of this paper is organized as follows. Section 2 presents the relevant previous work, while Section 3 defines -dimensional Costas arrays and some of their symmetries. Section 4 describes the backtracking algorithm and its use for the enumerations. Section 5 introduces several techniques that allow us to prune the search space during the enumerations. Sections 6 and 7 discuss the GPU and FPGA designs, respectively. Section 8 reports and discusses the results and Section 9 provides our conclusions.

2. Previous Work

Most recent enumerations of 2DCAs have been completed using general purpose processors [8, 9]. The latest enumerations have been achieved by deploying many computer clusters (in all, thousands of cores) to concurrently explore disjoint parts of the search space. For and , the time per single CPU was determined to be 70 and 366.55 years, respectively.

To the best of our knowledge, the only reported FPGA-accelerated 2DCA enumerations have been [10, 11]. Devlin and Rickard implemented sizes through on a Xilinx Spartan-3 XC3S1000 running at 25 Mhz and extrapolated their results to sizes up to using the same device as well as the Virtex-5 XC5VLX110 [10]. The extrapolated execution times for and were and years, respectively. Arce-Nazario and Ortiz-Ubarri compared the execution of 2DCA enumeration in an FPGA (Virtex 5-XC5VLX330-2) and a GPU (GeForce GTX 580). The FPGA implementation achieved speedups of up to 40× over the GPU and 4.44× over the fastest reported software implementation [9].

The first generalizations of Costas arrays were given by Drakakis in [12]. He defined various classes of multi-dimensional arrays by modifying the Costas property constraints and presented examples. Moreno et al. defined the -dimensional periodic Costas arrays over elementary Abelian groups [13] and described algebraic Welch Costas constructions. More recently, Ortiz-Ubarri et al. presented symmetries over the MPCAs that allow the expansion of the families discovered by Moreno [7]. They reported the first enumeration for sizes , , and , yet no details are offered regarding enumeration or solution space pruning techniques.

3. -Dimensional Costas Arrays

We begin by providing the definitions of the -dimensional periodic Costas arrays over elementary Abelian groups.

Definition 1. A generic -dimensional periodic Costas array over the elementary Abelian group is a permutation function , where means -. This function has the distinct difference property: for any , , , implies , where the addition and subtraction operations are performed in the corresponding Abelian group.

Remark 2. Since, by definition, the -dimensional periodic Costas arrays over the elementary Abelian group are fully periodic; the periodic shifts of an MPCA on any of its dimensions result in a different -dimensional periodic Costas array over the elementary Abelian group.

Example 3. The following is a grid defined over :

As a shorthand method, we may also enumerate the elements in a Costas array using the index mapping

The distinct difference property can be verified, in a manner similar to 2DCAs, by using difference matrices. The differences for a array are organized into matrices. For instance, for , each difference matrix is and contains at each cell the result of . The cells for differences that involve position are represented using *.

Figure 2 shows a Costas array along with its corresponding difference matrices. For example, cell of the difference matrix contains the difference between 5. A MPCA satisfies the distinct difference property if each of its difference matrices contain each number in - exactly once.

3.1. Addition and Multiplication (Modulo ) Symmetries

Two algebraic symmetries introduced by Moreno et al. can be used to significantly reduce the search space for MPCA enumeration [13].

Theorem 4. Multiplication (modulo ) of a periodic Costas array by an integer less than and relatively prime to generates a new periodic Costas array.

Example 5. Multiplying , the array in Figure 2, by yields the following MPCA:

Theorem 6. Addition (modulo ) of any integer less than to a periodic Costas array generates a new periodic Costas array.

Example 7. Adding 3 to yields the MPCA:

4. Backtracking

Backtracking is a general algorithm for solving a computational problem by incrementally generating all possible solutions. The execution of a backtracking algorithm can be modelled as a search tree where every node is a partial solution. Moving forward corresponds to approaching a valid solution, and going backward corresponds to abandoning a partial candidate that cannot possibly generate a valid solution.

For the purpose of this discussion, we define a subpermutation , where . For MPCAs, . The next subpermutation of size of , where , expressed as is defined as the next subpermutation in lexicographical order of size that conserves the first elements.

Example 8. For , let , the next subpermutation of size 4, . The next subpermutation of size 5, . The next subpermutation of size 3, .

Example 9. For , let . , that is, there is no next subpermutation beginning with . . .

Algorithm 1 illustrates the backtracking algorithm used for enumerating all MPCAs in given a seed permutation . Figure 3 illustrates the steps in the computational tree of the backtracking approach, given the seed for .

Inputs: : MPCA dimensions
    : initial permutation
Output: Displays all MPCAs in
Perm = Perminit
= length of Perminit
while ( >= length of Perminit){
 if (IsCostas (Perm) {
  if ( == ) {
   display Perm
   do {
     −−;
   } until (Next (Perm, )!= empty)
  }
  else
    ++; //explore a deeper level
 }
 else {
  // backtrack
  do {
    −−;
  } until (Next (Perm, )!= empty)
 }
 Perm = Next (Perm, );
}

5. Techniques for Pruning the Search Space and Efficient Evaluation of Candidate Arrays

MDCA symmetries can be leveraged to reduce the search space in their enumeration. For instance, it can be deduced from Theorems 4 and 6 that backtracking exploration must proceed only through permutations lexicographically smaller than . (1)MPCAs with the in any position other than are generated by periodic shifts of the arrays with the in position . These include all the geometric symmetries (horizontal flip, vertical flip, and rotations) of the MPCAs.(2) Any permutation starting with , , and followed by a , can be obtained by multiplying a lexicographically smaller subpermutation by , that is, using Theorem 4. For example, for , any permutation including and higher than can be obtained by multiplying a smaller permutation by , for instance, by multiplying by .(3)Any permutation starting with , followed by a nonzero element can be obtained by adding to a permutation that starts , that is,using Theorem 6. For example, is obtained from the addition of to .

Thus, the required exploration is reduced from to approximately permutations in the worst case.

5.1. Evaluation of Candidate Arrays

Computationally, we determine if a permutation is an MPCA by using the distinct difference property. In the backtracking algorithm, every time the algorithm moves forward by adding a new element to the permutation, the new differences generated by subtracting and are added to the corresponding arrays. Thus, the difference arrays fill up as the permutation in the backtracking tree grows and deplete as the backtracking algorithm moves backward.

From the MPCA definition it can be deduced that only half of the difference matrices must be maintained. To understand this, let us define the negative of a distance vector.

Definition 10. Let be a distance vector of -dimensional periodic Costas array over the elementary Abelian group . The negative of the distance vector , expressed as , is defined as . In other words, is a vector in the direction then is a vector of same length in the opposite direction.

Example 11. The negative of the difference vector over the elementary Abelian group is .

The difference matrix for a distance vector accumulates the negatives of the differences collected by the matrix for the distance vector . This is illustrated in Figure 4. The current subpermutation being evaluated has produced the differences shown in the matrices. Notice how the differences in and are the negatives of each other, for example, −4 = 4, −1 = 7, −5 = 3, and −2 = 6. The same behavior is obtained for the rest of the and pairs, for example, and , and , and and . Therefore, we only need to keep track of either the or of each pair, that is, the other matrix contains redundant information.

Furthermore, using the index mapping , we can demonstrate that the matrices can be completed by computing all , where .

Theorem 12. The differences , , where complete all the differences matrices.

Proof. Without loss of generality, we consider , that is, . The matrix collects all differences . If then . Else, , that is, which implies one of two cases.(1). This implies that , in which case, for the negative, and . Thus, the negative of this case can be found using a difference covered by .(2) and . This implies that , in which case, for , and . Thus, the negative of this case can be found using a difference covered by .

Example 13. For the difference matrix the computation for , that is, can be obtained from , that is, .

In our implementations, the difference matrices are managed as follows.(i) A hash table of size is used for each of the matrices to keep track of its differences.(ii) Whenever the permutation length increases ( to ), the differences between the last added digit and the rest of the digits are computed and compared to the contents of the corresponding hash tables. A hit in any of the tables indicates a repeated difference and hence a non-Costas permutation. Otherwise, the differences are registered in the corresponding tables.(iii) Whenever the permutation length decreases ( to ), the differences between the dropped digit and the rest of the digits are deleted from the hash tables.

6. GPU Design

We perform our computations in a GeForce GTX 580 with 16 multiprocessors, each with 32 cores at 1.54 GHz clock rate, 1.5 GB of global memory, and 48 K of shared memory using the CUDA parallel computing architecture. Similar to many other computational problems where GPUs are used to accelerate algorithms in parallel, our implementation essentially decomposes the enumeration into many disjoint subspaces, which are deployed as threads to the GPU. Figure 5 illustrates the workflow of the GPGPU implementation. The Host (CPU) generates a set of subpermutations of size that comply with the Costas property. The set is then passed to the Device (GPU), where for each subpermutation a thread is generated to complete the exploration, that is, each thread determines all (if any) Costas arrays that begin with its given subpermutation. While the threads are executing, the Host is generating the next set of subpermutations. When all threads complete, the results are passed to the Host, and the next set of subpermutations is transferred to the GPU.

Two quite similar versions of Algorithm 1 run in the Host and each of the threads of the GPU. In the Host, Algorithm 1 is used to generate all subpermutations of length compliant with the Costas property. As each subpermutation is found it is added to an array of size of the data type shown in Algorithm 2. When the array is full it is copied to the GPU global memory and the CUDA threads are deployed to process the copied subpermutations.

typedef struct{
 char counter;
 char subcostas[ ];
 char costas[MAX_COSTAS_PT][ ];
} CostasData;

Each CUDA thread runs a version of Algorithm 1 that takes one of the subpermutations as and copies any found Costas arrays back to the GPU global memory. When all the GPU threads are done, the found Costas arrays of size are copied to the Host. The number of subpermutations () to be passed to the GPU is determined by the number of cores of the GPU. In our experiments we obtained the best performance with number  of  cores 128.

7. FPGA Design

Many FPGA implementations obtain their impressive performance by exploiting fine-grained parallelism through deep, custom pipelines. However, backtracking is in essence a serial process, that is, generates permutation, then evaluates, then accepts or backtracks. Given this scenario, we opted to implement a highly tuned, low-resource serial MPCA enumeration (MPCAEn) core that can be instantiated many times inside the FPGA. Hence, the acceleration provided by our design comes mainly from two factors: (a) the rapid generation of candidate permutations and their evaluation within each core and (b) the high number of cores working in parallel on subsets of the enumeration.

7.1. Backtracking Functionality and Array Evaluation

Figure 6 illustrates the basic blocks of our MPCAEn core. A shift register is used for constructing or reversing the candidate permutation. The candidate permutation is constructed by shifting left and concatenating a new digit in the right-most position. As a permutation is generated, its compliance with the Costas array definition is verified by the Costas evaluation block. If the candidate complies, the next permutation is generated by shifting left and concatenating the lowest available digit to the right-most position. If not, then one of two cases may occur. (1) There is an available digit that is higher than the rightmost digit. In this case, the rightmost digit is substituted by . For example, is evaluated and does not comply. Since the available digits are 3, 5, and 7, then 6 is substituted by 7 to compose the next candidate permutation .(2)There are no digits available higher than the right-most digit. In this case the shift register is shifted right and the digit that is shifted out is added to the available digits. This is repeated until there is an available digit that is higher than the current rightmost digit, at which case we perform the substitution described in the first case. For example, is determined to not comply with the MPCA difference property; the available digits are , and but none of them is higher than so the permutation must backtrack to . The available digits are now , and , thus is substituted by to form the next permutation .

Figure 7 illustrates the operation of the MPCA evaluation block. When a new digit is added to the current permutation, for example, in the figure, the differences between the new digit and the rest are computed. The negatives of the differences are also computed, since they might be used to update some of the difference matrices (as explained in Section 5.1). Depending on the index of the newly added digit, the encoder/mux block routes the results of the differences to the corresponding hash tables. For the example in the figure, the result from corresponds to , that is, ; thus it will update the hash table for the difference matrix . When is added to the permutation, corresponds to , that is, ; thus its negative will be used to update the hash table for the difference matrix .

For 2DCAs, the deeper rows of the difference triangle are responsible for the detection of only a small percentage of the rejected arrays, as confirmed in [10, 11]. Our enumeration core for MPCAs is parameterizable in the number of distance vectors that are used to evaluate whether an array is an MPCA. The designer may choose to implement less than distance vectors to save FPGA resources. If so, the few false positives that will come out from the FPGA enumeration can be eliminated in software once they are communicated to the general purpose processor (GPP).

7.2. Resource Utilization and Core Organization

Figure 8 shows the amount of FPGA LUTs required by the MPCAEn core for and , implementing various amounts of vector differences. LUTs are the most strained resource in the MPCAEn implementation (versus registers). The resource utilization results were obtained from the synthesis process using Xilinx ISE 11.5 targeting a Virtex5-XC5VLX330-2 FPGA. It was found through experimentation that keeping track of more than 7 differences for and 14 differences for did not significantly reduce enumeration times. The enumerators that were implemented for obtaining the results used those amounts of differences, for example, 7 for and 14 for .

Since resource utilization per enumerator is low, multiple MPCAEn modules were instantiated in the FPGA as illustrated in Figure 9. The transfer of subpermutations and collection of results is performed by the transfer/collector module, which is connected through low-width data lines to the enumerators in order to save connection resources. MPCAs are so scarcely found during the enumeration that, regardless of the bandwidth between enumerators and transfer/collector module, these connections never become a bottleneck.

8. Results and Discussion

To compare GPU and FPGA performance, we implemented sizes , , and of our MPCA enumeration designs in GPU (GeForce GTX 580) and FPGA (one Xilinx XC5VLX330-2 device of the four provided in a Convey HC-1 Server). Table 1 shows the number of found MPCAs starting with , the execution times for both designs as well as the growth rate as a function of , and the speedup of FPGA versus GPU. Results for and are wall clock times while is the worst case estimation of the run time based on sample runs.

We attribute the speedup mainly to the fact that the FPGA implementation was able to exploit the following two levels of parallelism, whereas the GPU could only make use of the highest level: (1)coarse-grained level parallelism, which is achieved by splitting the solution space into multiple disjoint sets;(2)fine-grained parallelism at the level of individual candidate evaluations, that is, the operations for the evaluation of each (sub)permutation are performed in parallel (as discussed in Section 7 and illustrated in Figure 7). An analogous, low-level technique could be used in general purpose processors by utilizing SIMD instructions. However, CUDA programs compile to the PTX instruction set, which does not contain SIMD instructions.

The GPU was greatly overshadowed by the data transfer times between FPGA/GPP; therefore, we only consider for fair comparisons the cases and . For these larger cases we observe FPGA versus GPU speedups similar to those reported for 2DCAs in [11]. The enumeration of MPCAs exhibits a slower growth rate (approximately 3) per additional permutation digit as compared to the reported for two-dimensional Costas arrays (approximately 5) [8, 11]. We conjecture that the reason for the slower growth rate is that the conditions imposed in the MPCA definition are more strained, thus eliminating more candidate subpermutations earlier than the case of 2D Costas arrays.

All enumerated MPCAs turned out to be either Welch Costas constructions as presented in [13] or their symmetries introduced in [7], that is, no spurious MPCAs were found similar to some sizes of 2DCAs. These results support the conjecture that MPCAs (of all sizes and dimensions) are fully characterized by multidimensional Welch Costas arrays along with their symmetries.

9. Conclusions

In this work, we presented designs for the enumeration of multidimensional periodic Costas arrays in GPUs and FPGAs. Also, we introduced several MPCA symmetries and showed how they are used in our designs to significantly prune the search space. Both GPU and FPGA implementations rely on the concurrent exploration of multiple disjoint areas of the search space. In the GPU implementation, hundreds of threads are deployed to complete the search using the many GPU cores. For the FPGA, a multidimensional periodic Costas arrays enumeration core was designed taking into consideration pruning techniques while maintaining a low use of logic resources. Multiple cores were instantiated in the FPGA to provide further acceleration. The fine grained parallelism utilized to evaluate and progress the exploration, coupled with the additional concurrency provided by the multiple cores allowed the FPGA implementation to achieve speedups of up to 30× over the GPU. The implementations completed the first reported enumeration for MPCAs. Furthermore, the MPCA properties and symmetries discovered throughout the process help improve our understanding of these new structures.

Acknowledgments

Dr. R. A. Arce-Nazario was partially supported by NSF Grant number CNS-0923152. Dr. J. R. Ortiz-Ubarri was partially supported by the UPR-RP FIPI funds. Thanks are due to Glen Edwards of Convey Computer for his engineering support in the HC1 implementations and Jonathan Vélez for his help in the GPU implementations.