Abstract
The NQueens problem plays an important role in academic research and practical application. Heuristic algorithm is often used to solve variant 2 of the NQueens problem. In the process of solving, evaluation of the candidate solution, namely, fitness function, often occupies the vast majority of running time and becomes the key to improve speed. In this paper, three parallel schemes based on CPU and four parallel schemes based on GPU are proposed, and a serial scheme is implemented at the baseline. The experimental results show that, for a largescale NQueens problem, the coarsegrained GPU scheme achieved a maximum 307fold speedup over a singlethreaded CPU counterpart in evaluating a candidate solution. When the coarsegrained GPU scheme is applied to simulated annealing in solving NQueens problem variant 2 with a problem size no more than 3000, the speedup is up to 9.3.
1. Introduction
The EightQueens problem was first proposed by Max Bezzel in a Berlin chess magazine in 1848 [1]. The original question was how to place eight queens on the chessboard and make them unable to attack each other. If the number of queens of the problem is expanded, it becomes the famous NQueens problem. The NQueens problem has many applications in realworld and theoretical research, such as artificial intelligence, graph theory, circuit design, air traffic control, data compression, and computer task scheduling [2]. The input of the NQueens problem is only the number of queens. According to different requirements, the output can be the number of solutions or the sequence of each solution. Since the problem has been proved to be NPhard, the only way to obtain the number of solutions is to find these solutions. Therefore, the amount of calculation required to obtain these two types of outputs is the same, and the only difference is whether each solution is saved in the calculation process. There are three variants of NQueens problem according to the different demands for the number of valid solutions: Variant 1: finding one valid solution for each problem size Variant 2: finding a set of different valid solutions for larger problem scale Variant 3: finding all valid solutions if the problem scale is small enough
Variant 1 has been solved in 1969. Hoffman [3] proposed a construction method by analyzing the inherent mathematical laws of the NQueens problem. This method can obtain a valid constructive solution within the time complexity of . However, his construction method can only construct one fixed valid solution for each N value that is not universal.
Variant 3 was mathematically proven to be NPhard, and there is no deterministic algorithm in polynomial time to solve all valid solutions. For variant 3 with small N value, brute force, backtracking, and treebased search algorithm can be used to get all valid solutions. Somes [4] solve the problem for by recursive backtracking algorithm; Kise et al. [5] solve the problem for using MPI (MessagePassing Interface) on generalpurpose processors; Caromel et al. [6] solve the problem for by using a heterogeneous grid of 260 computers; and Preußer et al. [7] solve a 26queens problem in 270 days using a cluster of 26 FPGAs (field programmable gate arrays); to the best of our knowledge, this is a current world record.
Based on the performance of current computers, it is not realistic to find and save all valid solutions of the largescale NQueens problem in terms of running time and storage space. Therefore, for NQueens problem variant 3, some researchers study the case of N greater than 26 by building devices with higher computing power, while others design new parallel algorithms to accelerate the case of .
Before the dynamic parallel technology appeared on GPU, the usual way to solve the combinatorial optimization problem with the backtracking algorithm implemented by GPU was to divide the problem into two steps: first, the precalculated subsearch trees are generated on the CPU, and then, these subtrees are assigned to the GPU thread to complete the further search [8–10]. Amrasinghe et al. [11] use NVIDIA Cg language to solve the NQueens problem on GeForce 6800 hardware, but the performance of their algorithm is proved to be lower than that of Pentium M CPU with 2.0 GHz frequency. Pamplona et al. [12] design an NQueens problem solver running on GeForce 9600 by using CUDA 1.0. The performance of their algorithm is also lower than that of the C++ implementation on an Intel quad core processor with 2.4 GHz frequency. Feinbube et al. [10] transplant Somers’ algorithm to GPU for parallelization and use four optimization methods such as using shared memory to improve the performance of his algorithm. Their parallel algorithm is used to speed up the solving process of an NQueens problem with a size range from 14 to 17 on GPU devices with a computing capability of 1.1 and 1.3 (GTX275, GTX295, NVS295, and GeForce8600M). Zhang et al. [13] optimize a GPUbased NQueens solver by increasing L1 cache configuration, reducing shared memory bank conflicts, balancing thread load, etc. and obtain more than 10 times speedup with the number of queens ranging from 15 to 19 on GTX480. Thouti et al. [14] use the OpenCL programming model to analyze and solve the issues of atomicity and synchronization and obtained speedup of 20X with the number of queens between 16 and 21 on the Quadro FX 3800. Plauth et al. [15] use CDP (CUDA dynamic parallel) technology to solve the NQueens problem. In his scheme, the kernel in each layer of the CDP recursive stack is responsible for one row or multiple rows of the chessboard. Plauth uses his scheme to solve the NQueens problem with the problem size between 8 and 16 on Tesla K20, and the experimental results show that the performance of his scheme is lower than that of Feinbube’s GPU nonCDPbased scheme and even lower than that of Somers’ serial scheme in some cases. Carneiro et al. [16, 17] use a CDPbased backtracking algorithm to solve NQueens problem variant 3 with a size ranging from 10 to 17. He concluded that CDP is less dependent on parameter tuning, but due to the high cost imposed by dynamically launched kernels, the performance of the CDPbased scheme is outperformed by nonCDP bitsetbased implementation with welltuned parameters and multicore counterparts.
The methods used to solve variant 3 can also be used to solve variant 2 to obtain a set of different solutions, but the efficiency is very low because these algorithms need to ensure that every position in the solution space is searched without omission. In addition, because these algorithms search all the solution spaces in a certain order, the solutions are likely to be located in the close position in the solution space, and the diversity of solutions is not strong enough. So, variant 2 is usually solved by heuristic algorithm and random algorithm.
The output of variant 2, a set of valid solutions for the largescale NQueens problem, can be used in scientific research and practice. For example, we want the neural network to have the ability to generate zero conflict or less conflict NQueens layout, and the output of variant 2 can provide a set of solutions to the neural network as a training sample set. In circuit design, for the reason of signal interference, or in arts and crafts, just for the sake of beauty, it may be required that devices and patterns cannot be placed in the same line, column, or diagonal. In this case, the output of variant 2 can complete this requirement.
Variant 2 can be regarded as a permutationbased combinatorial optimization problem or a constraint satisfaction problem, and researchers often get valid solutions by using random algorithm and heuristic algorithm. The process of solving is to generate a group of random solutions first, then guide these candidate solutions to evolve in a better direction through various heuristic information, and finally, get the optimal solution. Hu Xiaohui et al. [18] use the improved PSO (Particle Swarm Optimization) algorithm to obtain part of the valid solutions with ; Jafarzadeh et al. [19] use PSO and SA (Simulated Annealing) to obtain part of the valid solutions with ; Zhang et al. [20] use CRO (Chemical Reaction Optimization) algorithm to solve an eightqueens problem; Hu Nengfa et al. [21] use simplified GA (Genetic Algorithm) to calculate a valid solution for N = 500 in 45 seconds; Zhang Buzhong et al. [22] implement an operatororiented parallel genetic algorithm in the multicore platform for solving in 20655 seconds; Turky et al. [23] use the genetic algorithm to obtain a valid solution with in 9123 seconds; Wang et al. [24] use four core i5 processors to implement the parallel genetic algorithm and achiev a speedup of 2.8 compared with a serial counterpart when the problem scale reached 512; and Cao et al. [25] constructed a twolevel parallel genetic algorithm based on a GPU cluster, which expands the NQueens solution scale to 10000 in the acceptable time.
Those heuristic algorithms need to evaluate candidate solutions generated in the search process. The number of queens with conflicts in the candidate solutions is an appropriate evaluation criterion. The conflict number can be calculated with the following formulas:
This calculation process is usually encapsulated as an evaluation function. In other works, it is also called cost function, objective function, or fitness function. This function has the time complexity of and high parallelism. Because it needs to be executed repeatedly, this function takes up a lot of running time in the heuristic algorithm.
Simulated annealing [26] is a kind of heuristic algorithm which simulates the process that metals tend to be stable during heating and cooling in metallurgy. Simulated annealing algorithm has been proved to be asymptotically convergent and can converge to the global optimal solution with probability 1 under the condition that the initial temperature is high enough, the cooling speed is slow enough, and the termination temperature is low enough. Because the simulated annealing algorithm is simple and has few control parameters, we use this algorithm to demonstrate the acceleration effect of parallelization of evaluation function on the whole algorithm.
Because the time cost to ensure the algorithm converges to the valid solution with probability 1 is too high, we use the parameters in Table 1 to get the result with a probability higher than 0.5. With this set of parameters, the algorithm can get a valid solution in a few hours. The average running time of the algorithm is 9443 seconds with problem size 3000. We use to shuffle the sequence from 1 to N to get the initial random solution.
Figure 1 depicts the running time proportion of the evaluation function in the whole simulated annealing algorithm for different N values.
(a)
(b)
It can be seen from Figure 1 that the larger the N, the higher the proportion of the evaluation function. When N is greater than 700, the proportion of evaluation function exceeds 99%. Therefore, for heuristic algorithms based on search and evaluation, such as simulated annealing, genetic algorithm, and chemical reaction optimization, accelerate evaluation function is the key to improving the speed of the whole algorithm in solving a largescale NQueens problem.
Since evaluation function has high parallelism and simple operation, it is very suitable for GPU (graphical processing unit), which uses the SIMT(Single Instruction Multiple Threads) model, originally designed for graphics applications and optimized for high throughput by allocating more transistors to compute unit, instead cache, prediction unit, etc.
Our objective is to speed up the search and evaluationbased heuristic algorithm in solving variant 2 of the NQueens problem by accelerating the fitness function. The main acceleration method is the threadlevel parallel technology of CPU and GPU. In this paper, we propose four GPUbased parallel schemes by using CUDA 8.0 [27] parallel technology with different parallel granularities and three CPUbased parallel schemes by using C++ multithreading technology, Intel TBB library, and Java ForkJoin framework. Performances of these schemes are verified through experiments. The scheme with the highest speedup is applied to simulated annealing algorithm for accelerating NQueens problem variant 2.
The organization of this paper is as follows. In Section 1, we introduce three variants of the NQueens problem and related research. The significance of improving the performance of the evaluation function is also discussed here through an experiment. In Section 2, one CPUbased serial scheme, three CPUbased parallel schemes, and four GPUbased parallel schemes of realizing evaluation functions are proposed. In Section 3, we compare the performance of the first seven schemes, and the eighth scheme is also discussed separately here. In Section 4, the validity of the GPUbased coarsegrained scheme is verified by the simulated annealing algorithm. Section 5 discusses future work and concludes this paper.
2. Parallel Schemes of Fitness Function
An NQueens problem is a twodimensional optimization problem. In order to facilitate the mutation, crossover, synthesis, splitting, and other operations in the evolution process of the heuristic algorithm, the candidate solution is usually encoded by an integer and expressed as onedimensional arrays or a vector. The subscripts of the array or vector are used as abscissas, and the element values are used as ordinates. For example, we use array to represent a candidate solution and variable N to store array length.
The initial solution is obtained by shuffling the number sequence from 1 to N with function. After the heuristic algorithm improves these initial solutions according to the heuristic information, the candidate solutions are sent to the fitness function. The number of conflicts calculated by fitness function is returned to heuristic algorithm as evaluation result. This process is often repeated many times.
2.1. CPU SingleThreaded Scheme
With a singlethreaded processor, the scheme has to compare all pairs of queens sequentially. This scheme is described in Algorithm 1 and only used as a baseline to calculate the speedup of other parallel schemes.

2.2. CPUAdaptive Multithreaded Scheme
We use the class introduced in C++11 to implement a CPU multithreaded scheme, the task of thread function is described in Algorithm 2. In order to avoid the high cost of atomic operation, we design a counter array with the same length as the number of threads to store the number of conflicts calculated by the corresponding thread. After all threads are finished, STL function is used to get the total conflicts number of all threads. Algorithm 3 describes the process of accumulating all conflicts. Limited by the number of cores, the most appropriate number of CPU threads is often less than 100, and the summation of the array can be performed quickly.





In this scheme, the number of threads can be set manually. We observe the performance of this scheme in different problem sizes when the number of threads varies from 1 to 60 and find that the optimal number of threads is related to the size of the problem. Figure 2 shows the speedup of this scheme compared with the singlethreaded scheme with different thread numbers, and different colors in the legend indicate different problem sizes.
(a)
(b)
(c)
For the case of , the maximum speedup is less than 5, and the optimal number of threads is about 10. If the number of threads participating in the calculation exceeds the optimal number, the cost of redundant threads is greater than the benefit and the performance will decrease.
For the case of , the maximum speedup is between 10 and 15, and the optimal number of threads is about 20. For the cases of , the maximum speedup is between 10 and 20, the optimal number of threads is about 40, which is the number of logical cores of our test platform. If the number of threads exceeds the optimal value, the performance will be only slightly affected because at this time, all processors are fully utilized, and increasing the number of threads will not continue to improve the utilization of the processors.
The maximum speedup and the corresponding optimal number of threads for each different problem scale are plotted in Figure 3. We observed that as the size of the problem increased, the speedup and the corresponding number of optimal threads gradually approached the number of cores. We store the optimal number of threads corresponding to each problem size in data structure. In the later experiments, this map is used to select the optimal number of threads for different scale problems, so that the algorithm has a certain adaptive ability.
2.3. CPU Intel TBB Scheme
Intel Threading Building Blocks (TBB) is a library that takes full advantage of multicore performance. The key notion of TBB is to separate logical task patterns from physical threads and to delegate task scheduling to the system. Compared with using the raw thread library, such as POSIX threads, , or Boost threads, users can focus on the decomposition of tasks instead of allocating computation and data to threads manually and the synchronization issue among threads.
We use the function provided by TBB to decompose and summarize the calculation tasks of evaluation function. To make use of this function, we need to design a class and override function and in the NQClass which is defined in Algorithm 4. The function SubTask() is used to complete the decomposed subtask, that is, calculate the number of conflicts caused by the queen . This function is called infunction .
Block_range used in Algorithm 5 is a class defined by TBB in the file blocked_range.h to indicate the range of data to be processed. After the required class is defined, the use of function is very simple. Without specifying the number of threads manually, TBB can automatically decompose subtasks and complete them in parallel.
2.4. CPU ForkJoin Scheme
Fork/Join is a framework provided by JAVA7 for parallel task execution. By using job stealing technology to schedule tasks, the Fork/Join framework can achieve better load balancing among multiple cores. The key to implementing the evaluation function with this framework is to inherit class RecursiveTask and override function .
As shown in Algorithm 6, tasks larger than the threshold are divided into smaller tasks recursively and delivered to multiple cores for execution. We tried different thresholds and found that best performance can be achieved when the threshold is between 2 and 10. The experimental data used in the following section are obtained with , which means that each thread only calculates the number of conflicts caused by a single queen.
2.5. GPU FineGrained Scheme 1
Considering the SIMT structure of the GPU, one can run thousands of threads at the same time. To make full use of the number of threads in a finegrained scheme, we use one thread to compare a pair of queens to ascertain if there is a conflict between them. We use the CPU to calculate the subscript pairs of queens that need to be compared, as shown in Algorithm 7. There are a total of pairs of subscripts to be stored in array. The array is then transported along with the candidate solution to the GPU through the PCIE data bus. In the GPU kernel, each GPU thread reads a pair of subscripts and extracts the location of the corresponding queen according to the subscript. If there is a conflict between the two, the atomic operation is used to add one to the counter in the global memory of GPU.
Array and subscript array have been transferred from the host memory to the GPU memory by before kernel run. The task of each thread of GPU is described in Algorithm 8 which has time complexity. Variable conflicts is a global variable that can be accessed by all threads.

When problem size N increases to 50000, approximately 1.164G pairs of queens need to be compared. If the subscript is saved with the unsigned short type, more than 4 GB memory is needed. The huge amount of data transfer between CPU and GPU takes up most of the running time, which completely offsets the benefits of parallel computing. The performance of this scheme is 2 to 4 orders of magnitude lower than that of the coarsegrained GPU scheme. As the scale of the problem increases, the disadvantage will continue to be magnified. Even if GPU can reduce the cost of data transmission by multistream and overlap of computation and data transmission, this scheme has few performance advantages. So, we did not continue to test the performance of this scheme for .
2.6. GPU FineGrained Scheme 2
Considering finegrained scheme 1, the subscripts array is generated by the CPU and transferred to the GPU through the PCIE bus. With the increase in problem size N, the memory consumption of subscript array increases at the speed of . To avoid the communication overhead caused by a large amount of data transmission between the CPU and GPU, in this scheme, the subscripts of the two queens that each GPU thread needs to compare is calculated by GPU thread according to its own index, The process of using GPU to generate subscript is shown in Algorithm 9. This scheme can improve the parallelism and improve the utilization of GPU resources. The part of the algorithm for calculating subscripts has time complexity.
The disadvantage of this scheme is that the tasks of each thread are very few to give full play to make full use of GPU computing power, which makes the performance of this scheme lower than that of the coarsegrained scheme by about two orders of magnitude. Since this huge performance difference cannot be compensated by thread task merging, we gave up this scheme when the number of queens reached 50000.
2.7. GPU CoarseGrained Scheme
Each GPU thread corresponds to a queen’s location to calculate whether this location conflicts with the queen behind the location and to accumulate the number of conflicts into the counter in the global memory of GPU with atomic operations. Algorithm 10 describes the task of one thread in GPU; it has a time complexity of .


The GPU kernel is launched with the following parameters: . With Nvidia Tesla K80 [28], this scheme can theoretically be used to calculate the number of conflicts for queens at maximum. Since only the candidate solution array needs to be transferred to the GPU, the data transmission burden of this scheme is very small.
2.8. GPU CDPBased Scheme
The coarsegrained GPU scheme is relatively simple, but there are problems with unbalanced tasks between threads. For example, for queens number 2000, thread_0 has to check 1999 pairs of queen conflicts, while thread_1998 only checks whether 1 pair of queens conflict; therefore, the task amount of those two threads is 1999 times different. For larger N values, the issue is even more serious. To balance the amount of tasks between multiple threads, we set a threshold for using CDP (CUDA dynamic parallelism). CUDA dynamic parallelism technology appears for the first time in NVIDIA devices with a computing capability of 3.5 or higher. It empowers GPU kernels to launch nested subkernels by themselves, without the participation of the CPU, thereby avoiding the communication cost between the CPU and GPU. When a thread in the kernel is responsible for more comparisons between the two queens than the threshold, this thread uses dynamic parallel technology to launch subkernels, divides its task into more finegrained subtasks, and gives it to the subkernel to run. Algorithm 11 describes the process of deciding whether to call the subkernel according to the threshold, and Algorithm 12 describes the tasks of each subkernel. In theory, this scheme can alleviate the problem of unbalanced tasks between threads and also improve the parallelism. For the case of and , the task amount(pairs of queens to be checked) of thread_0 is 1999, which is greater than the threshold value of 32. This task is divided into 63 small subtasks with task amount not greater than the threshold. The number of subtasks can be determined by the following formula: Each subtask is completed by a thread in the subkernel, so the launch parameters of the subkernel are as follows: . The parent grid and the subkernel have their exclusive local memory and shared memory space, so the parent kernel should pass data to the child kernel by passing values or global memory space pointers instead of pointers of local memory or shared memory space.



3. Experiment
3.1. Test Platform
All trials were performed on an HP Proliant DL580 Gen9 server with a Tesla K80. The configuration of the experimental platform is shown in Table 2.
With 2 Xeon E74820 v4 CPU, our experimental platform has 20 physical cores, which can be virtualized into 40 logical cores through Intel HyperThreading Technology and run 40 threads at the same time. Nvidia Tesla K80 has up to 2.91 Teraflops of doubleprecision performance and 480 GB/sec memory bandwidth. Most of the experimental data used in charts below were averaged over 100 runs, and a few very timeconsuming experimental data use the average value of 10 runs. We use the highprecision library provided in C ++11 standard for timing and use microseconds as the timing unit for the smallscale problem and milliseconds for the largescale problem . CUDA function is used for synchronization between GPU and CPU.
Schemes based on CPU are implemented with C++ and Java, and schemes based on GPU are implemented with CUDAC. We set the block size of GPU kernels to 512 based on experience. For GK210, the maximum number of concurrent threads in each SM is 2048, and the block size we set can ensure that there are 4 blocks in each SM. Because the number of subkernel threads caused by CUDA dynamic parallelism is often in the order of tens and hundreds, we set a smaller block size (32) for the subkernel. For fairness and portability on different hardware, all codes are compiled with the default optimization option.
Java virtual machine supports hotspot detection technology, which can analyze hotspot code and optimize it automatically. We tried and compiler with the option in the compilation phase and forced JIT mode with the Xcomp option in the runtime phase. The results show that the best performance can be achieved by default compiling and running configuration .
(a)
(b)
(c)
(d)
3.2. Performance Comparison
As shown in Figure 4, the rank of performance of seven schemes is a coarsegrained GPU scheme, multithreaded CPU schemes (including an adaptive scheme and TBB scheme), ForkJoin scheme, singlethreaded CPU scheme, and finegrained GPU schemes 1 and 2. The dynamic parallelism scheme will be discussed in Section 3.3 separately.
Because of the cost of thread startup and management, the performance of the CPUadaptive multithreaded scheme and TBB scheme is lower than that of the singlethreaded scheme with small problem sizes. However, as the problem size increases to 300, these two schemes keep their advantages over other schemes until the problem size is more than 3000, which is replaced by the GPU coarsegrained scheme.
ForkJoin scheme has more extreme characteristics: when the problem size is less than 2000, its performance is even lower than that of the singlethreaded scheme; when the problem size reaches 50000, its performance exceeds that of the multithreaded scheme and TBB scheme. The highest speedup of multithreaded and TBB schemes is 22.04 and 20.71, while the ForkJoin scheme achieved a maximum speedup of 29.18. Considering that 40 logical cores of our experimental platform are virtualized on 20 physical cores by Intel HyperThreading Technology, this scheme has achieved a high CPU resource utilization.
The performance of finegrained scheme 1 is lower than that of the CPU scheme because of the large memory usage, high transmission cost, and the number of tasks per thread being too small to offset the overhead caused by thread management.
Finegrained scheme 2 gave the task of calculating the subscripts of the queens to be compared to the GPU to avoid a large amount of data transmission. However, the SIMT architecture of the GPU is suitable for executing a code with large amount of tasks and simple control structure. While calculating subscripts of the queens, we use loop and branch structure, which causes GPU thread divergence. In the most extreme cases, 32 threads in a warp will execute in sequence and seriously reduce the execution efficiency. The experimental results show that, in the heterogeneous architecture of CPU + GPU, the optimization scheme must comprehensively measure various factors, such as thread parallelism, data transmission throughput, SM (Streaming Multiprocessor) core utilization, and load balancing among multiple SMs. Only one factor of increasing parallelism does not necessarily lead to performance improvement.
When the problem size , the performance of the coarsegrained GPU scheme is lower than that of the CPU singlethreaded counterpart. The reason is that, in the case of problems with small sizes, only a small number of threads participate in the calculation, the utilization of GPU cores is low, and the overhead caused by the GPU thread startup and data transmission covers the gain brought by computational parallelism. When , the advantages brought by the massive parallelism of the GPUs make the speedup to the singlethreaded scheme continue to rise. When the size of the problem reaches 400,000, the speedup is stable at approximately 300, ten times more than that of the CPU multithreaded scheme. These phenomena can also be obtained by observing the speedup changes of these schemes under different problem sizes, which is described in a logarithmic form in Figure 5.
3.3. Performance of the GPU CDPBased Scheme
NVIDIA Tesla K80 has a computing capability of 3.7 and supports dynamic parallelism. Function cudaDeviceSetLimit cudaLimitDevRuntimeSyncDepth,MAX DEPTH is used to set the depth of the dynamic parallel stack, and the maximum depth is 24. If the program’s recursive call depth of dynamic parallel exceeds this limit, no error will be reported, but the result returned from GPU is wrong.
In our scheme, for threads whose task amount exceeds the threshold, the dynamic parallel is only triggered once, and the recursion depth of dynamic parallel is 1. The task is divided into several subtasks whose task amount is less than the threshold. Because the dynamic parallel scheme is very sensitive to the threshold parameters, we discuss the performance of this scheme separately in this section.
Tables 3 and 4 record the running time of the dynamic parallelism scheme when the threshold is set to 1 k, 2 k, 4 k, 8 k, 16 k, 32 k, 64 k, and 128 k. The first column is the number of queens. The second column is the running time of the coarsegrained GPU scheme, which is used here for comparison. The remaining columns are the running time of the dynamic parallel scheme with different thresholds. The time unit used here is millisecond.
The experimental results show that no matter how large the threshold is, the performance will drop sharply as long as the dynamic parallelism is triggered. This experiment shows that CDP is not suitable for the acceleration of the evaluate function of the NQueens problem. We believe that, for the larger NQueens problem, the coarsegrained GPU scheme has started enough threads and reached a high SM occupation. If one GPU thread uses dynamic parallelism to launch new subkernels, it will not reduce the total workload (queens comparison times), but instead increase the number of extra thread startup and the management workload, resulting in performance degradation.
CDP is very suitable for writing recursive patterns to implement divide and conquer or backtracking algorithm. The advantage of this technology is to deal with irregular tasks, such as searching in the unbalanced tree of NQueens problem variant 3. However, to evaluate the candidate solution of the NQueens problem with a fixed size is a regular workload and its calculation amount can be predicted in advance, and the overhead caused by dynamic subkernel launches outweighs the benefits of the improved load balance yielded by CDP.
3.4. Stability Analysis of the CoarseGrained Scheme
Statistics show that a random candidate solution contains a number of conflicts of approximately 2/3 of the length of the solution. In the evolutionary process of heuristic algorithms, candidate solution will continue to evolve in the optimal direction, and the number of conflicts included in the candidate solution will gradually decrease until a valid solution with zero conflict appears. This process leads to a reduction in the number of atomic operations in the evaluation process of the candidate solution, which theoretically shortens the running time of the coarsegrained GPU scheme.
To observe the effect of the reduction of the conflict number on the performance of the coarsegrained GPU scheme, we test the performance of the coarsegrained GPU scheme with a random solution set and valid solution set with length ranging from 100 to 1 million. We shuffle the sequence from 1 to N to construct the random candidate solution set and use the Hoffman construction method to construct the valid solution set.
As shown in Figure 6, as the length of the solution increases, the performance difference on the two datasets gradually decreases. The reason is that as the length of the solution increases, the number of threads and the amount of computation gradually increase, and the delay caused by the atomic operation in GPU global memory has more chances to be hidden by other threads/warp running. Compared with the valid solution set, the performance of the coarsegrained GPU scheme on the random candidate solution set is reduced by 0.95% on average. The fluctuation of about 1% indicates that the coarsegrained GPU scheme has strong stability in different datasets.
4. Application of the GPU CoarseGrained Scheme to Simulated Annealing
In order to verify the effectiveness of our scheme, we apply the coarsegrained GPU scheme to simulated annealing to solve NQueens problem variant 2. We keep the parameters and the experimental platform intact and only replace the evaluation function from the CPU singlethreaded scheme to the coarsegrained GPU scheme.
As can be seen from Figure 7, because the evaluation function takes a very high proportion of time in the whole simulated annealing algorithm, the performance gain of fitness function brought by GPU parallelism directly improves the performance of the SA algorithm. Taking into account the experimental errors and that simulated annealing is a probabilistic technique, we think that the acceleration of fitness function is directly reflected in the SA algorithm\enleadertwodots.
5. Conclusions
Variant 2 of the NQueens problem is a classical problem which has been proved to be NPhard, so heuristic algorithms are often used to obtain valid solutions. At present, the parallelization of these methods is often at the algorithm level, such as dividing the large population into several small populations for evolution in parallel or mutating individuals in parallel. In this paper, we focus on how to improve the speed of the heuristic algorithms for solving variant 2 by accelerating the evaluation function.
Besides a CPU singlethreaded serial scheme, three CPU multithreaded parallel schemes and four GPU parallel schemes are proposed to evaluate candidate solutions for the NQueens problem. The performances of all schemes are measured under uniform experimental. In solving NQueens problem variant 2 with the simulated annealing algorithm, the advantage of the coarsegrained GPU scheme has been proved. Usually, the evaluation function is the most timeconsuming part of a heuristic algorithm, and we believe that our schemes based on CPU and GPU can improve the performance of all algorithms based on search and evaluation in solving NQueens problem variant 2 without changing the algorithm process and parameters. These algorithms include simulated annealing, genetic algorithm, chemical reaction optimization, etc. Users can choose the appropriate scheme according to their hardware devices to speed up their computing process. Our scheme does not conflict with the parallel methods at the algorithm level; they can be used together. For example, in the case of GPU hardware, replacing the fitness function in the island model of parallel genetic algorithm with the GPU scheme proposed in this paper can further shorten the execution time.
The performance of the coarsegrained GPU scheme can be further improved by the following means, which is also our next work:(1)In the current GPU coarsegrained GPU scheme, device memory is allocated and freed for each evaluation function call. Performance can be improved theoretically if device memory is reused in multiple evaluation function calls. Data transmission from CPU to GPU and computation in kernel can be parallel by using CUDA multistream technology.(2)Some CUDA kernel configuration parameters can be further optimized. We plan to use NVidia NVVP [29] to read the hardware counters in the GPU to analyze the microperformance bottleneck of the program and to further improve the performance of GPU schemes by optimizing parameters, such as the block size and L1D/Share memory settings. For the dynamic parallel scheme, we plan to use bypass technology to cancel some subkernel launch random in order to reduce the thread management burden and improve performance.(3)This paper focuses on using threadlevel parallel technology to improve the efficiency of the evaluation algorithm. Instructionlevel parallelism is also an important optimization method. In the next step, we plan to combine these two methods to further improve the performance of the algorithm.
Data Availability
The experimental code, script, and result used to support the findings of this study are available at https://github.com/grasshoper97/NQueens.git.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this manuscript.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61672123).