Research Article  Open Access
Xiang Yu, Yu Qiao, Qingpeng Li, Gang Xu, Chuanxiong Kang, Claudio Estevez, Chengzhi Deng, Shengqian Wang, "Parallelizing Comprehensive Learning Particle Swarm Optimization by Open Computing Language on an Integrated Graphical Processing Unit", Complexity, vol. 2020, Article ID 6589658, 17 pages, 2020. https://doi.org/10.1155/2020/6589658
Parallelizing Comprehensive Learning Particle Swarm Optimization by Open Computing Language on an Integrated Graphical Processing Unit
Abstract
Comprehensive learning particle swarm optimization (CLPSO) is a powerful metaheuristic for global optimization. This paper studies parallelizing CLPSO by open computing language (OpenCL) on the integrated Intel HD Graphics 520 (IHDG520) graphical processing unit (GPU) with a low clock rate. We implement a coarsegrained allGPU model that maps each particle to a separate work item. Two enhancement strategies, namely, generating and transferring random numbers from the central processor to the GPU as well as reducing the number of instructions in the kernel, are proposed to shorten the model’s execution time. This paper further investigates parallelizing deterministic optimization for implicit stochastic optimization of China’s Xiaowan Reservoir. The deterministic optimization is performed on an ensemble of 62 years’ historical inflow records with monthly time steps, is solved by CLPSO, and is parallelized by a coarsegrained multipopulation model extended from the allGPU model. The multipopulation model involves a large number of work items. Because of the capacity limit for a buffer transferring data from the central processor to the GPU and the size of the global memory region, the random number generation strategy is modified by generating a small number of random numbers that can be flexibly exploited by the large number of work items. Experiments conducted on various benchmark functions and the case study demonstrate that our proposed allGPU and multipopulation parallelization models are appropriate; and the multipopulation model achieves the consumption of significantly less execution time than the corresponding sequential model.
1. Introduction
A graphical processing unit (GPU) is a processor specially designed to rapidly manipulate the creation of images in a frame buffer intended for output to a display device. By providing functionalities such as texture mapping, rendering, shading, antialiasing, color space, and video decoding, a GPU is an indispensable aid to a central processing unit (CPU) to manage and boost the performance of graphics. A CPU consists of only a few processing elements optimized for sequential processing, whereas a GPU consists of a large number of compute units, with each compute unit in turn containing many processing elements, thereby constituting a massively parallel architecture for handling multiple computing tasks simultaneously. People have recently studied leveraging the massively parallel architectures of GPUs for accelerating nongraphical generalpurpose computing in a wide range of areas [1–14]. GPUbased parallel computing is implemented by a host program and kernel(s). The host program runs on the CPU and can launch a kernel on a connected GPU. A kernel is a function executed in parallel on the processing elements within the GPU. The parallel threads of the kernel can be organized into groups, with each group of concurrent threads executed on the same compute unit. Data are transferred between the CPU and the GPU.
Metaheuristics are naturally suitable to be parallelized on GPUs. A metaheuristic is essentially a set of natureinspired intelligent search strategies and is promising for singleobjective global optimization [15]. A metaheuristic usually finds the optimum using a population of individuals, with each individual representing a candidate solution. All the individuals are initialized randomly and evolve iteratively and randomly in the search space to gradually locate the optimum. Each individual is thus associated with a fitness indicating the individual’s search performance. The final solution obtained is the individual that exhibits the best fitness among the population when the evolution ends. Compared with traditional optimization methods such as linear programming, nonlinear programming, dynamic programming, and optimal control, metaheuristics do not require the objective and constraints of the optimization problem to be continuous, differentiable, linear, or convex. In addition, metaheuristics can be directly linked with simulation models without requiring simplifying any assumption in the models. According to Tan and Ding [16], GPUbased parallelization implementations of metaheuristics are classified into four categories, namely, naive model, multiphase model, allGPU model, and multipopulation model. The naive model offloads fitness calculation onto the GPU side. The naive model can be coarsegrained or finegrained. For the coarsegrained naive model, the fitness of each individual is calculated by a separate thread. If the fitness is an aggregation of a number of partial calculations that can be run in parallel, then the finedgrained naive model can be used to further accelerate fitness calculation. Besides fitness calculation, the multiphase model also parallelizes other phases of a metaheuristic to leverage the GPU’s computing power. Multiple kernels are used to parallelize different phases, and different kernels can take different parallelization granularities to best fit the corresponding tasks. The multiphase model might incur the overhead of frequent kernel launching. The allGPU model combines multiple kernels into one kernel. Instead of relying on a single population, multiple populations are employed in the multipopulation model, with each population evolving separately with different thread(s).
Particle swarm optimization (PSO) is a class of metaheuristics simulating the foodsearching behavior of a bird flock [17, 18]. In PSO, population is termed as swarm, and an individual is termed as the particle. All the particles “fly” in the search space. Each particle is accordingly additionally associated with a position and velocity. Each particle updates the flight trajectory iteratively and randomly and tries to gradually move towards the optimum. Many PSO variants have been proposed in the literature since 1995 [17, 18], and some variants have been parallelized on GPUs. Open computing language (OpenCL) [19] and compute unified device architecture (CUDA) [20] are two commonly used industry standards that facilitate parallel computing on GPUs. Based on the C/C++ programming language, OpenCL and CUDA both provide a series of runtime application programming interfaces (APIs). OpenCL enables programming across heterogeneous types of processors including not only GPU but also digital signal processor, and field programming gate array, whereas CUDA only supports NVIDIA proprietary GPUs. Global PSO (GPSO) is the earliest PSO variant [17]. In GPSO, the flight of each particle is guided by the particle’s historical best search experience (i.e., personal best position) and also the historical best search experience out of the entire swarm (i.e., global best position). GPSO was parallelized on GPUs by CUDA following the multiphase model in [21], the allGPU model in [1–3, 22], and the multipopulation model in [4, 5]. GPSO is liable to get stuck in premature convergence, and local PSO (LPSO) [23] is an improved variant based on GPSO. A static topology is constructed in LPSO. For a particle, its neighborhood contains itself and the particles that it directly connects to in the topology. Instead of referring to the global best position, the historical best search experience of the neighborhood (i.e., local best position) is used for updating particle’s velocity. An allGPU parallelization implementation of LPSO by CUDA was proposed in [6]. In contrast with LPSO, standard PSO (SPSO) [24] maintains a dynamic topology. An allGPU parallelization model of SPSO implemented by CUDA and that by OpenCL were, respectively, introduced in [25, 26]. Comprehensive learning PSO (CLPSO) [27–29] differs from GPSO, LPSO, and SPSO in that CLPSO encourages each particle to learn from different exemplars on different dimensions during the flight trajectory update, whereas GPSO, LPSO, and PSO enforce the same pair of personal best position and global/local best position for updating particle’s velocity on all the dimensions. CLPSO is good at preserving the particles’ diversity and exhibits excellent optimization performance. Papadakis and Bakrtzis [7] investigated developing an allGPU model of CLPSO by OpenCL. Kilic et al. [8], Ouyang et al. [9], Souza et al. [10], Sharma et al. [11, 12], and Rabinovich [30] studied parallelizing other PSO variants on GPUs.
In this paper, we study parallelizing CLPSO on a platform with the Intel Core i76500U CPU, the thirdgeneration double data rate (DDR3) main memory, and the Intel HD Graphics 520 (IHDG520) GPU by OpenCL. IHDG520 is an integrated GPU; it can be found in various ultralowvoltage mobile CPUs and is suited for laptop (particularly, ultrabook) computers. The IHDG520 GPU lacks dedicated graphics memory and has to access the main memory. The sequentialization implementation of CLPSO was evaluated with a swarm of 40 particles in [27–29]. We propose a coarsegrained allGPU model with the kernel being concurrently executed by 40 threads. Two enhancement strategies, i.e., generating and transferring random numbers from the CPU to the IHDG520 GPU as well as reducing the number of instructions in the kernel, are introduced into the model for the purpose of shortening the execution time. We further investigate parallelizing deterministic optimization for implicit stochastic optimization (ISO) of China’s Xiaowan Reservoir. The deterministic optimization is performed on an ensemble of 62 years’ historical inflow records with monthly time steps and is solved by CLPSO. The deterministic optimization is parallelized by a coarsegrained multipopulation model extended from the allGPU model, using each swarm of particles to address the optimal operation in a separate year. The multipopulation model involves a large number of threads. Because of the limit on data transfer capacity between the CPU and the IHDG520 GPU, we modify the random number generation strategy by just generating a small number of random numbers that can be flexibly exploited by the large number of threads without hurting randomness.
A reservoir is a hydraulic structure that impounds water and uses water to serve various purposes such as hydropower generation, flood control, navigation, sediment control, and water provisioning for agricultural, domestic, and industrial demands. A reservoir system consists of one or more cascaded reservoirs constructed within the same river basin. The optimal operation of a reservoir system is to schedule outflows of the reservoir (s) over a series of consecutive time steps in order to optimize a specific objective, trying to fulfill the multipurpose development of the system. The optimal operation of a reservoir system is complex because the optimization problem has to take into account inflow imprecision and uncertainties, the dynamic multistage nature of decisionmaking, and different physical and operational constraints [31–36]. The optimal operation of a reservoir system is either deterministic or stochastic. Deterministic optimization assumes that inflows into the reservoirs (s) over all the time steps are available. However, in practice, only limited inflow forecasting information can be obtained. Alternatives to avoid perfect inflow knowledge during the entire planning horizon include ISO and explicit stochastic optimization (ESO) [35, 37]. ISO is also referred to as Monte Carlo optimization. ISO optimizes over a long continuous series of historically recorded or synthetically generated inflows, or many shorter equally likely sequences. Accordingly, stochasticity of the inflows is implicitly addressed, and deterministic optimization can be applied on the ensemble of inflows. Operation rules for the outflows of each reservoir over all the time steps conditioned on information, e.g., the reservoir’s present storage volume (or forebay elevation), previous inflows, and limited forecasted inflows, are then abstracted from the deterministic optimization results using a fitting method, e.g., rule curve [38–42], linear regression [43–48], artificial neural network [49–52], neurofuzzy inference system [53, 54], decision tree [50, 55], genetic programming [56], and support vector regression [57]. Compared with ISO, ESO directly operates on probabilistically described inflows [58–62]. The deterministic optimization is usually nonlinear, nonconvex, and nondifferentiable [63–66] and has been extensively addressed by metaheuristics recently [67].
Integrated GPUs are prevalent nowadays and can be found in both laptop (e.g., the IHDG520 GPU and the Intel HD Graphics 620 GPU) and desktop computers (e.g., the Intel HD Graphics 530 GPU). The clock rates of the Intel HD Graphics 520, 620, and 530 GPUs are all rather low, being 0.3 GHz, 0.3 GHz, and 0.35 GHz, respectively. The Intel HD Graphics 620 and 530 GPUs also lack dedicated graphics memory. The Intel integrated GPUs feature significantly different architectures from NVIDIA and AMD GPUs studied in the existing literature body, e.g., the NVIDIA GPUs studied in [1–4, 6–10, 12, 21, 22, 25, 26, 30] and the AMD GPUs studied in [13, 14, 68, 69] have much higher clock rates and have dedicated graphics memories. Many NVIDIA and AMD GPUs are quite larger in size and require higher power supply. The motherboard and the power supply of a brand desktop computer often differ considerably from those of a selfassembled desktop computer and may not support adding an NVIDIA or AMD GPU. As a result, this paper experimenting on the IHDG520 GPU is of critical practical meaning. In addition, to the best of our knowledge, this paper is the first pioneering work investigating parallelizing the deterministic optimization for the ISO of a reservoir system on a GPU.
The rest of this paper is organized as follows. The working procedure of CLPSO, the knowledge about OpenCL, and the characteristics of the IHDG520 GPU are detailed in Section 2. Section 3 introduces the case study of the Xiaowan Reservoir and formulates the deterministic optimization for the ISO of the Xiaowan Reservoir. Section 4 presents our proposed coarsegrained allGPU and multipopulation models of CLSPO implemented by OpenCL. The performance of the models is evaluated in Section 5. Section 6 concludes this paper.
2. Background
2.1. Comprehensive Learning Particle Swarm Optimization
Let the search space be Ddimensional, and there are N particles in the swarm. Each particle, denoted as i (i = 1, 2, …, N), is associated with velocity _{i} = _{i, 1}, _{i, 2}, …, _{i, D} and a position _{i} = _{i, 1}, _{i, 2}, …, _{i, D}. In each iteration (or generation), _{i} and _{i} are updated on each dimension d (d = 1, 2, …, D) as follows:where is the inertia weight; δ_{i,d} is a random number uniformly distributed in [0, 1]; and E_{i} = E_{i, 1}, E_{i, 2}, …, E_{i, D} is the exemplar that guides the update of i’s flight trajectory.
The dimensional velocity is clamped to a positive value , i.e.,
Let and , respectively, be the lower bound and the upper bound of the search space on each dimension d, . At the beginning of the run of CLPSO, is initialized as a random value uniformly distributed in , and the dimensional position is initialized as a random value uniformly distributed in .
The weight linearly decreases from 0.9 to 0.4 in each generation k according to the following equation:where K is the predefined number of generations.
CLPSO maintains a personal best position B_{i} = B_{i, 1}, B_{i, 2}, …, B_{i, D} for each particle i. Initially, B_{i} = . Let f be the fitness function. In each generation, if i’s fitness value f () is better than i’s personal best fitness value f (B_{i}), then B_{i} = ; otherwise, B_{i} does not change. The dimensional exemplar E_{i,d} can be B_{i, d} or B_{j, d} with j ≠ i, and the choice depends on i’s learning probability . To be specific, the value of is
On each dimension d, a random number uniformly distributed in [0, 1] is generated. If the generated number is no less than L_{i}, E_{i, d} = B_{i, d}; otherwise, E_{i, d} = B_{j, d}. To determine j, two different particles excluding i are randomly selected, and j is the particle with a better personal best fitness value. If E_{i} = B_{i} on all the dimensions, CLPSO randomly chooses one dimension d and one particle j with j ≠ i to force E_{i, d} = B_{j, d}. The exemplar E_{i} is not updated unless f (B_{i}) ceases improving for a refreshing gap of 7 generations. In each generation, f () is evaluated only if is feasible, i.e., is within on each dimension d. As is always feasible, infeasible will eventually be drawn back to the search space. After K generations, CLPSO determines the global best position G = G_{1}, G_{2}, …, G_{D} that exhibits the best fitness value among all the personal best positions. f (G) is the swarm’s global best fitness value. The stepbystep flowchart of CLPSO is depicted in Figure 1. If N is large enough, the chances of selecting two same particles and j = i at Step 3 are very small; hence, the procedure to determine j can be simplified and just slightly affect the performance of CLPSO by just randomly selecting two particles when the generated random number is greater than L_{i} and randomly selecting a particle when E_{i} = B_{i} on all the dimensions.
2.2. Open Computing Language
As can be seen from Figure 2, OpenCL views the hardware platform as a collection of heterogeneous compute devices attached to and managed by a CPU. A compute device can be a GPU or some other type of processor. OpenCL implements parallel computing with a host program and kernels. The host program runs on the CPU and can launch a kernel on a compute device. The parallel threads of a kernel are termed as work items. The work items can be organized into a number of independent socalled work groups, and only the work items executed on the same compute unit can be included in one work group. Each work item is identified by a unique global ID. The index space of the global IDs is one, two, or threedimensional, with each attribute starting at zero. Each work item can also be identified by the group ID of the work group and the local ID of the work item relative to the work group.
The host program creates a context. A context specifies kernel (s) to be executed on one or more compute devices. Besides kernel, a context also manages objects such as command queue, memory, and program. A command queue holds commands (or operations) that will be executed on a compute device. Commands placed into a command queue can be classified into three categories, i.e., kernel management, memory management, and synchronization. Values for the input parameters of a kernel are transferred between the CPU and a compute device. OpenCL represents generic data by a buffer and supports creating a buffer only for a onedimensional array. The memory space of a compute device is divided into four regions, i.e., global memory, constant memory, local memory, and private memory. The global memory permits read/write access to all the work items in all the work groups. Being writable by the CPU but not the compute device, the constant memory remains constant during the execution of a kernel. A local memory is just shared by all the work items in one specific work group. Each work item has a private memory, invisible by any other work item. Memory region qualifiers, “__global,” “__constant,” “__local,” and “__private,” can be applied on an input parameter of a kernel to, respectively, restrict that the parameter is to be stored in the global memory region, the constant memory region, the local memory region, or the private memory region. An input parameter with no memory region qualifier is stored in the private memory region by default. An input parameter with the data transferred by a buffer can only be stored in the global memory region. All the variables and constants additionally declared inside the kernel are stored in the private memory region. OpenCL is able to synchronize all the work items in the same work group, but cannot synchronize work items across different work groups. A program consists of one or more kernels.
2.3. Intel HD Graphics 520 Graphical Processing Unit
IHDG520 is an integrated GPU, i.e., it is embedded on the same die as the CPU. Integrated GPUs lead to less heat output and less power usage; thus, they have been widely taken in laptop (particularly, ultrabook) computers. The IHDG520 GPU has 24 compute units clocked at 0.3 GHz. Each compute unit is composed of 256 processing elements. The IHDG520 GPU has to share the main memory with the CPU. For the IHDG520 GPU, the size of the constant memory region and that of the local memory region are both zero; in other words, only the global memory region and the private memory region located in the main memory can be used. The size of the global memory region is 1.3 GB. The maximum size of a buffer created in the global memory region is 511 MB. The IHDG520 GPU uses onchip registers to store kernel instructions. The IHDG520 GPU supports singleprecision floating point calculation, but does not support doubleprecision floating point calculation.
3. Case Study and Deterministic Optimization Problems’ Formulation
3.1. Case Study
The Xiaowan Reservoir located on Lancang River is taken as the case for study. Lancang River is the upper stream of Mekong River in China. Mekong River is a crossborder river in Southeast Asia. Originating from the QinghaiTibet Plateau, Mekong River runs through 6 countries, i.e., China, Myanmar, Laos, Thailand, Cambodia, and Vietnam, sequentially. Mekong River is the world’s 12^{th} longest river, with a length of 4350 km. The length of Lancang River is 2139 km, draining an area of 0.16 million km^{2} over the provinces including Qinghai, Tibet, and Yunnan. The Xiaowan Reservoir is constructed at the west of Yunnan and on the middle reach of Lancang River, with a longitude of 100°05′28″ and a latitude of 24°42′18″. Figure 3 shows the Lancang River basin in Yunnan and the Xiaowan Reservoir. The Xiaowan Reservoir is mainly used for hydropower generation and also serves flood control, irrigation, sediment control, and navigation. For the Xiaowan Reservoir, the installed power generation capacity is 4200 MW, the normal forebay elevation is 1240 m, the dead forebay elevation is 1166 m, the flood control forebay elevation is 1236 m, and the total storage volume is 14,914 million m^{3}. The Xiaowan Reservoir is affected by a monsoon climate, and the inflows feature seasonal variations. The flood season is from June to September. The guaranteed hydropower generation per year is 190⋅10^{8} kWh. Historical inflow records for the Xiaowan Reservoir from the year 1953 to 2014 are available.
3.2. Deterministic Optimization Problems’ Formulation
The deterministic optimization problems for the ISO of the Xiaowan Reservoir are formulated with a yearly planning horizon of 12 monthly time steps and an ensemble of M years. Assuming that, for each year m (m = 1, 2, …, M), the inflow I_{m, t} into the reservoir in each month t (t = 1, 2, …, 12) of year m is already known, the deterministic optimization problem related to year m tries to determine the power discharge rate Q_{m, t} and the spillage rate S_{m, t} in each month t of year m for the objective of maximizing the total hydropower generation over the yearly planning horizon of year m; I_{m, t}, Q_{m, t}, and S_{m, t} are all measured by the unit of m^{3}/s, and the following equation gives the objective:where U_{m, t} is the power output in month t of year m and is measured by the unit of kW and X_{m, t} is the number of days in month t of year m.
U_{m, t} is calculated bywhere C_{m, t} is the water conversion rate in month t of year m and is measured by the unit of m^{3}/kWh. C_{m, t} is affected by the water head H_{m, t} in month t of year m. H_{m, t} is the difference of the forebay elevation Y_{m, t} and the tailrace elevation Z_{m, t} in month t of year m, i.e.,
Let A_{m, t} be the storage volume at the beginning of month t of year m. Y_{m, t} is a function of the average storage volume in month t of year m. is a function of the outflow rate in month t of year m, and = + .
cannot surpass the power discharge rate upper bound in month t of year m. As a result, takes a smaller value of and , as expressed in the following equation:
is a function of H_{m, t}.
Let be the initial/final storage volume bound. The initial storage volume A_{m,1} is known, and . The storage volume at the end of each month is calculated based on water balance, i.e.,
The problem is associated with the following constraints:where and are, respectively, the lower and upper bounds of the outflow rate in each month t and and are, respectively, the lower and upper bounds of the storage volume at the end of each month t.
The deterministic optimization needs to solve M problems, with one problem for each year m separately. Monthly operation rules can be abstracted from the deterministic optimization results of all the M problems.
4. Parallelizing Comprehensive Learning Particle Swarm Optimization
4.1. Basic CoarseGrained AllGPU Model
A basic parallelization scheme is presented here and works as the basis of our proposed enhancement strategies. The basic parallelization scheme follows the allGPU model and implements a single kernel. CLPSO needs to generate random numbers uniformly distributed in [0, 1] at Steps 1 and 3. An OpenCL program is composed of both the host part and the kernel part. OpenCL provides no builtin primitive for generating any kind of random number in the kernel part. We write an auxiliary inline function that the kernel function can invoke for generating a random unsigned integer number based on the multiplicative linear congruential (MLC) principle [70], i.e.,where ϕ is a random unsigned integer number and works as the seed for generating the next random unsigned integer number. A random float number δ uniformly distributed in [0, 1] is obtained by
When the inline function is called, the function code gets directly inserted at the point of each function call, thereby shortening the function call overhead.
The allGPU model is coarsegrained, with each particle mapped to a separate work item in a onedimensional index space. Each work item is identified by the global ID. N, D, , , E_{i, d}, , , , , B_{i, d}, L_{i}, f (), f (B_{i}), G_{d}, f (G), and ϕ_{i} are input parameters of the kernel function, while k and K are additionally declared as variables/constants inside the kernel function. ϕ_{i} is the seed for each particle/work item i to generate a random unsigned integer number. Buffers are created for and the memory region qualifier “__global” is applied on , , E_{i, d}, , , , , B_{i, d}, L_{i}, f (), f (B_{i}), G_{d}, f (G), and ϕ_{i}; hence, they are stored in the global memory region. N, D, k, and K are stored in the private memory region. The detailed working procedures of the host part and the kernel part are illustrated in Figure 4. The CPU first needs to initialize numerical values for some input parameters including N, D, , , , and ϕ_{i}. The numerical values are then transferred from the CPU to the IHDG520 GPU before the kernel function executes. The work items execute concurrently, and each work item is just responsible for performing the operations related to the corresponding particle at Steps 1, 3, and 4. Only one prespecified work item executes Steps 2 and 5. When the kernel function finishes execution, G_{d} and f (G) are transferred back to the CPU.
4.2. Enhancement Strategies
Two enhancement strategies, namely, generating and transferring random numbers from the CPU as well as reducing the number of instructions in the kernel, are employed to accommodate the characteristics of the IHDG520 GPU and the OpenCL APIs for the purpose of significantly shortening the execution time of the basic coarsegrained allGPU model.
OpenCL provides no builtin primitive for generating any type of random number. In the basic coarsegrained allGPU model, an auxiliary inline function is written to assist the kernel function generating random numbers based on the MLC principle. The MLC principle generates a random unsigned integer number based on an unsigned integer input, and a random float number uniformly distributed in [0, 1] can be obtained by dividing the random unsigned integer number timed with 1.0 over 2147483647.0. Most GPUs including the IHDG520 GPU are not good at dealing with the integer multiplication and modulation operations as well as the float division operation involved in the MLC random number generation process and need many clock cycles to execute such costly operations. In addition, the IHDG520 GPU is slow at execution because its clock rate is just 0.3 GHz. Step 1 of CLPSO randomly initializes each particle i’s dimensional velocity and position on each dimension d and requires 2ND random numbers. At Step 3, a random number is compared with each particle i’s learning probability L_{i} on each dimension d, two particles are randomly selected for determining the dimensional exemplar E_{i, d} if the number is greater than L_{i}, and a dimension is randomly selected to learn from a particle that is also randomly selected when E_{i} equals to the personal best position B_{i} on all the dimensions; thus, maximally, 3 ((K − 1)/7 + 1) ND + 2 ((K − 1)/7 + 1) N random numbers are needed. Step 4 updates each particle i’s dimensional velocity with a random coefficient on each dimension d, and uses KND random numbers. It would be very timeconsuming to generate all the 2ND + 3 ((K − 1)/7 + 1) ND + 2 ((K − 1)/7 + 1) N + KND random numbers on the IHDG520 GPU. The CPU is clocked at 2.5 GHz, and the host part of the parallelization program can invoke some highly efficient C/C++ library function to generate a random number uniformly distributed in [0, 1]. Therefore, we can generate all the random numbers on the CPU and transfer all the random numbers from the CPU to the IHDG520 GPU before the kernel function begins execution.
With respect to the basic coarsegrained allGPU model, , , E_{i, d}, B_{i, d}, L_{i}, f (), and f (B_{i}) are input parameters of the kernel function. OpenCL supports creating a buffer only for a onedimensional array. Buffers representing onedimensional arrays with ND elements are created for , , E_{i, d}, and B_{i, d}, and buffers representing onedimensional arrays with D elements are created for L_{i}, f (), and f (B_{i}). All the work items are indexed in a onedimensional space, and the global IDs of the work items range from 0 to N − 1, with each work item standing for a separate particle. The kernel function thus accesses , , E_{i, d}, and B_{i, d} by the index iN + d from the corresponding onedimensional array and accesses L_{i}, f (), and f () by the index i from the corresponding onedimensional array, with i also being the global ID of the work item. B_{i, d} and f (B_{i}) are shared among the particles for exemplar redetermination at Step 3 of CLPSO, while , , E_{i, d}, L_{i}, and f () are not shared by the particles. We can alternatively declare each particle’s dimensional velocities, positions, and exemplars as onedimensional arrays with D elements and also each particle’s learning probability and fitness value as variables inside the kernel function, thereby reducing the number of addressing instructions. There are a number of ifclause conditional statements in the kernel function of the basic coarsegrained allGPU model. Most GPUs including the IHDG520 GPU are also slow at performing conditional operations. Some ifclause conditional statements in the kernel function can actually be replaced by OpenCL builtin primitives. To be specific, clamping the dimensional velocity based on the maximum dimensional velocity can be done by the primitive clamp (, , ), and judging whether the dimensional position is feasible can be implemented as step (, ) + step (, ) through using the step primitive. The step primitive returns 0 if the right input parameter is less than the left input parameter, and 1, otherwise. The use of the OpenCL builtin primitives to replace some ifclause conditional statements therefore helps to shorten the time consumed on conditional operations.
4.3. CoarseGrained Multipopulation Model
Like in [66], the storage volume constraints as expressed in equations (12) and (13) are repaired as follows.
First, for each year m, reversely starting from t = 12 and ending at 1, sequentially calculate the new lower bound and upper bound for the storage volume at the end of month t according to equations (16) and (17), respectively:
Second, incrementally starting from t = 1 and ending at 12, sequentially calculate Δ_{m, t} which is the deviation of the storage volume at the end of month t according to equation (18), modify O_{m,t} according to equation (19), and update A_{m,t + 1} if Δ_{m,t} ≠ 0. Note that O_{m,t} is kept feasible in equation (19).
Let be the violation of the constraint for the storage volume at the end of month t. is calculated according to the following equation:
The original constrained problem is converted to an unconstrained problem by optimizing the following objective that incorporates the violations:where λ is the penalty factor and is a large positive number. The term is the benefit, and the term is the violation cost.
The unconstrained problem is solved by CLPSO. Each particle’s position is a 12dimensional vector representing the outflow rates over the yearly planning horizon. The power discharge and spillage rates in each month can be easily determined from the outflow rate in the corresponding month. The deterministic optimization for ISO on the ensemble of M years is parallelized by a coarsegrained multipopulation model extended from the allGPU model. The kernel part of the multipopulation model is illustrated in Figure 5. M work groups are used to concurrently tackle the M optimal operation problems, with each work group consisting of N work items and solving the optimal operation problem related to a different year following the allGPU model. Each work group determines the global best position, global best fitness, global best benefit, and global best violation cost for the corresponding optimal operation problem. The global best benefit is the benefit of the global best position, and the global best violation cost is the violation cost of the global best position. The global best position, global best fitness, global best benefit, and global best violation cost results of all the work groups are transferred back from the GPU to the CPU when the kernel function finishes execution. By summing the global best benefit results and global best violation cost results, respectively, the CPU is able to obtain the total best benefit and the total best violation cost for the deterministic optimization. The summation can only be done by the host part on the CPU because OpenCL does not support synchronizing different work groups.
A serious challenge arises and needs to be addressed for the multipopulation model. The maximum size of a buffer created in the global memory region of the IHDG520 GPU is limited to be 511 MB, and the size of the global memory region is 1.3 GB. The multipopulation model needs maximally M (2ND + 3 ((K − 1)/7 + 1) ND + 2 ((K − 1)/7 + 1) N + KND) random numbers uniformly distributed in [0, 1]. Suppose N = 40, D = 12, K = 10000, and M = 62, as a singleprecision float number occupies 4 bytes, storing about 432 million random numbers requires a memory space of around 1.6 GB, exceeding the 511 MB limit for a buffer and the 1.3 GB size of the global memory region. We propose to create a buffer representing a onedimensional array of 2ND + 3 ((K − 1)/7 + 1) ND+2 ((K − 1)/7 + 1) N + KND + M − 1 random numbers. Each work item is identified by the group ID m and the local ID i simultaneously, with m ranging from 0 to M − 1 and i ranging from 0 to N − 1 in a onedimensional index space. Each work item with the group ID m and the local ID i can access 2D + 3 ((K − 1)/7 + 1)D+2 ((K − 1)/7 + 1) + KD random numbers starting at the index (2D+3 ((K − 1)/7 + 1) D + 2 ((K − 1)/7 + 1) + KD) i + m from the onedimensional array stored in the buffer. This modified random number generation strategy aims to generate a small number of random numbers that can be flexibly used by the large number of work items without negatively impacting randomness.
5. Experimental Studies
5.1. Experimental Setup
In [27–29], the sequentialization implementation of CLPSO with a swarm of 40 particles was evaluated on various 30dimensional benchmark global optimization functions for 25 runs. Four experimental issues are investigated in this paper: (1) how is the parallelization performance of the basic coarsegrained allGPU model on all the benchmark functions? (2) how do the two enhancement strategies improve the parallelization performance of the basic allGPU model on all the benchmark functions? (3) how is the parallelization performance of the multipopulation model extended from the allGPU model on the case study of the deterministic optimization for the ISO of the Xiaowan Reservoir? and (4) would it be better to develop multiple kernels for parallelizing different phases of CLPSO?
For the 1^{st} and 2^{nd} issues, a sequentialization model of CLPSO and 3 coarsegrained allGPU models as listed in Table 1 are implemented and compared. The sequentialization model involves a swarm of 40 particles. In all the 3 allGPU models, 40 work items are used to concurrently execute the kernel on the IHDG520 GPU. Each particle/work item loops for 5000 generations. Eight 30dimensional benchmark functions which were also used in [27–29] are evaluated. Table 2 gives the name, description, global optimum , corresponding function value f (), and search space of each function. f_{1}, f_{2}, and f_{3} are unimodal, and all the other functions are multimodal. f_{7} and f_{8} are rotated.


Regarding the 3^{rd} issue, the multi population model is compared with the sequentialization model. The deterministic optimization for the ISO of the Xiaowan Reservoir is performed on the historical monthly inflow data recorded during the period of 62 years from 1953 to 2014. The optimal operation related to each year is solved by CLPSO with a swarm of 40 particles in the sequentialization model. The multipopulation model takes advantage of 62 work groups to concurrently tackle the 62 optimal operation problems. Each work group solves the optimal operation with respect to a separate year, is composed of 40 work items, and follows the final coarsegrained allGPU model. Each work item iterates for 10,000 generations. The multipopulation model employs the modified random generation strategy. The initial/final storage volume bound is 145.57 10^{8} m^{3}, corresponding to the normal forebay elevation as 1240 m. The penalty factor is 36810^{8}.
Concerning the 4^{th} issue, we need to understand the overhead of kernel launching. The execution time of a parallelization model is the time gap between the initialization of parameters and the release of OpenCL objects and is the addition of CPUside execution time and GPUside execution time. The GPUside execution time is the time spent on blocking until all the enqueued commands in the command queue are issued to the GPU and have completed. The CPUside execution time can be divided into 6 segments: segment 1 is the time initializing the numerical values for some input parameters; segment 2 is the time creating a context, a command queue, and buffers, as well as enqueueing commands to write some of the buffers; segment 3 is the time building an executable program; segment 4 is the time creating a kernel declared in the program, setting input parameters of the kernel and enqueueing a command to execute the kernel; segment 5 is the time reading the results from the GPU and releasing the kernel; and segment 6 is the time releasing the other objects.
5.2. Experimental Results and Discussion
All the sequentialization and parallelization models are executed for 25 runs on all the benchmark functions and the case study. The speedup of a parallelization model as compared with the sequentialization model is the value calculated as the mean execution time of the parallelization model divided by that of the sequentialization model. Table 3 gives the statistical (i.e., mean, standard deviation, maximum, and minimum) execution time and global best fitness value results of the sequentialization model and the intermediate, basic, and final allGPU models on all the functions. To determine whether the solutions obtained by the sequentialization model are statistically different from those obtained by the final allGPU model, twotailed ttests with the assumption of equal variances and the significance level 0.05 are carried out, and the ttest results are listed in Table 4. Table 5 lists the statistical execution time, total best benefit, and total best violation cost results of the sequentialization model and the multipopulation model on the case study. The twotailed ttest and speedup results from the comparison of the sequentialization model and the multipopulation model on the case study are listed in Table 6. For the Xiaowan Reservoir, year 1954 is a typical wet year with natural inflow 1528 m^{3}/s in average, 1976 is a typical normal year with natural inflow 1186 m^{3}/s in average, and 2011 is a typical dry year with natural inflow 974 m^{3}/s in average. The monthly natural inflow records as well as outflow rate, forebay elevation, and power output results determined from the median run by the multipopulation model for 1954, 1976, and 2011 are, respectively, shown in Figures 6–9. Table 7 gives the mean CPUside execution time results of the final allGPU model on some benchmark functions and the multipopulation model on the case study.





The original sequentialization implementation of CLPSO proposed in [27–29] also uses 40 particles that iterate for 5000 generations and was also evaluated on the 30dimensional Sphere, Schewefel’s P2.22, Noise, Rosenbrock’s, Rastrigin’s, Ackley’s, rotated Schwefel’s, and rotated Rastrigin’s functions for 25 runs. Our sequentialization model described in Table 3 differs from the original sequentialization implementation in that the inequality conditions enforced on determining the dimensional exemplar are relaxed to facilitate generating and transferring random numbers from the CPU. The statistical global best fitness results of our sequentialization model as given in Table 3 are similar with those of the original sequentialization implementation found in [27–29], verifying that the relaxation of the inequality conditions causes little negative impact on the final solution’s quality when the number of particles is large enough. The sequentialization model and the basic, intermediate, and final allGPU models all use singleprecision float numbers and directly report 0 for sufficiently small float numbers, e.g., the statistical global best fitness results of all the models are all 0 on functions f_{1} and f_{2} in Table 3.
As can be seen from Table 3, the statistical global best fitness results of the basic, intermediate, and final allGPU models are the same as those of the sequentialization model on f_{1} and f_{2} and similar with those of the sequentialization model on the other functions. The ttest results reported in Table 4 indicate that the global best fitness results of the final model are statistically indifferent on f_{3} to f_{8} as the ttest results are greater than the significance level 0.05. A ttest cannot proceed when the two pairs of data samples are all 0; hence, the ttest results on f_{1} and f_{2} are blank. Functions f_{1} to f_{3} are unimodal, and f_{4} to f_{8} are multimodal. The statistical global best fitness results given in Table 3 show that the sequentialization model and all the allGPU models of CLPSO can find the global optimum on f_{1} and f_{2} in all the runs, can find the global optimum on f_{5} in some runs and a nearoptimum in the other runs, can find a nearoptimum on f_{3} and f_{6} in all the runs, and can only find a local optimum on f_{5}, f_{7}, and f_{8} in all the runs.
For the sequentialization model, a significant part of the execution time is spent on fitness calculation. Functions f_{5} and f_{6} include cosine operations. Compared with f_{5}, f_{6} additionally needs to calculate exponential values. f_{7} and f_{8} are rotated functions and multiply the original decision vector by an orthogonal matrix. f_{8} is a rotated variant of f_{5}. There are sine operations in f_{7}. Therefore, the evaluation of f_{7} and f_{8} is most timeconsuming followed by f_{5} and f_{6}, and f_{1} to f_{4} are least timeconsuming. The analysis is clearly validated by the statistical execution time results of the sequentialization model given in Table 3.
In Table 3, the basic, intermediate, and final allGPU models exhibit similar statistical global best fitness results on all the functions. Owing to the ways adopting the enhancement strategies, the statistical execution time results of the 3 allGPU models are quite different, particularly the basic model and the intermediate model. The mean execution time results of the basic model are around 6000 ms on f_{1}, f_{2}, f_{3}, f_{4}, f_{5}, f_{6}, and f_{8} and around 3000 ms on f_{7}. The mean execution time results of the intermediate allGPU model are around 600 ms on f_{1} to f_{6}, around 700 ms on f_{7}, and around 900 ms on f_{8}, far less than those of the basic model. The dramatic execution time difference between the basic model and the intermediate model on the same function is attributed to random number generation. The basic model generates random numbers in the kernel function based on the MLC principle. Performing the integer multiplication and modulo as well as float division operations involved in the MLC random number generation process on the IHDG520 GPU is very timeconsuming. In contrast, the intermediate model generates random numbers on the high clock rate CPU efficiently and transfers the random numbers from the CPU to the GPU. The basic model takes less time on f_{7} than on the other functions because the landscape of f_{7} is highly mountainous; each particle is likely to fly to a position that leads to a better personal best fitness value during the trajectory update, and the model thus goes through much less times of exemplar redetermination and invokes much less times of random number generation. The mean execution time results of the final model are around 100 ms less than those of the intermediate model on f_{1} to f_{6}, around 30 ms less on f_{7}, and around 300 ms less on f_{8}, verifying that the strategy reducing the number of instructions in the kernel benefits shortening the execution time. With respect to f_{8}, multiplication of the dimensional position by the orthogonal matrix is actually a twolayer nested forloop; as a result, shortening of the execution time is much more on f_{8} than on the other functions. Particles are likely to be infeasible when the work items concurrently optimize the highly mountainous function f_{7}; hence, shortening of the execution time is not that noticeable on f_{7}. The mean execution time results of the final model however are still more than those of the sequentialization model on all the functions; this is because a considerable amount of time (in the scale of hundreds of ms) must be spent on creating/releasing OpenCL objects (e.g., context, command queue, buffer, program, and kernel) and transferring buffers between the CPU and the GPU, as verified by the mean CPUside execution time results given in Table 7. It can be seen from Table 3 that the standard deviation execution time results of the sequentialization model and all the allGPU models are relatively small as compared with the corresponding mean execution time results, meaning that the execution time results of each model do not vary much in each run.
The deterministic optimization for the ISO of the Xiaowan Reservoir is multimodal, as reflected from the statistical total best benefit results of the sequentialization model listed in Table 5. The standard deviation total best benefit of the sequentialization model is 0.42 10^{8} kWh. The statistical total best benefit results of the multipopulation model are similar with those of the sequentialization model. The ttest results given in Table 6 indicate that the total best benefit results of the two models are statistically indifferent. Accordingly, the modified random generation strategy does not hurt randomness. The mean total best benefit of the sequentialization model and that of the multipopulation model are around 14,550 10^{8} kWh; hence, in average, the optimized hydropower generation is about 235 10^{8} kWh per year, much more than the guaranteed hydropower generation 190 10^{8} kWh per year, validating the powerful global optimization capability of CLPSO. The solutions are feasible as the statistical total best violation cost results of the 2 models are all 0. The sequentialization model is very timeconsuming, and its mean execution time is 16,090.00 ms. As we can see from Tables 5 and 6, the mean execution time of the multipopulation model is 1165.60 ms which is significantly less than that of the sequentialization model, and the speedup is 13.80. The significant speedup is achieved by parallelizing the 62 optimal operation problems and parallelizing the 40 particles for each optimal operation problem. The 2 models are robust in terms of execution time results in all the runs as the standard deviation execution time results are small.
It can be observed from Figure 6 that the natural inflows are large in the flood season from June to September for all the 3 typical years. The natural inflows of year 1954 are considerably more than those of 1976 in July, August, September, and October, and those of 1976 are noticeably more than those of 2011 in June, July, September, and October. As Figures 7–9 show, the Xiaowan Reservoir needs to release much more outflows in April, May, June, July, September, October, and November of 1954 than in the corresponding months of 1976, leading to much lower forebay elevations in April, May, June, July and August, less power outputs in March, June, and July, and more power outputs in April, October, November, and December. The outflow rates and power outputs of 1976 are more than those of 2011 in April, May, June, September, October, November, and December.
As we can see from Table 7, the mean segment 1 time, segment 2 time, segment 4 time, segment 5 time, and segment 6 time results of the final allGPU model are similar on functions f_{1}, f_{3}, and f_{7}, with the mean segment 1 time results more than, the mean segment 2 time results less than, and the mean segment 4, 5, and 6 time results similar to the mean segment 1 time result, the mean segment 2 time result, and the mean segment 4, 5, and 6 time results of the multipopulation model on the case study, respectively. The segment 1 time is mainly the time generating the random numbers and is thus affected by the number of random numbers. The segment 2 time increases with the number of buffers created. The mean segment 3 time results of the final allGPU model on f_{1} and f_{3} are similar and are much less than those of the final allGPU model on f_{7} and the multipopulation model on the case study, indicating that the more difficult fitness evaluation of a particle, the more time building the program. Steps 1, 3, and 4 of CLPSO involve operations related to each work item, while Steps 2 and 5 are executed by just one prespecified work item. Steps 2, 3, and 4 constitute a forloop. When multiple kernels are used for parallelizing different phases of CLPSO, intermediate results must be transferred back from a kernel, and if the kernel is not the last kernel, then the intermediate results need to be transferred to the next kernel. Steps 2, 3, and 4 cannot be implemented as multiple kernels as the forloop causes the overhead of frequently enqueueing commands to write some buffers, enqueueing commands to execute the kernels, and reading results from the kernels. The overhead can be much large with respect to many generations. Steps 1 and 5 also do not benefit from being split as multiple kernels because all the work items are occupied at Step 1, and for the multipopulation model, each work group has one work item occupied at Step 5. An alternative is to implement 3 kernels, respectively, corresponding to Step 1, the forloop, and Step 5. The alternative incurs a small overhead of enqueueing commands to write some buffers, creating kernels, setting input parameters of the kernels, enqueueing commands to execute the kernels, reading results from the kernels, and releasing the kernels. Accordingly, our proposed final allGPU model and multipopulation model for parallelizing CLPSO are appropriate.
6. Conclusions
In this paper, we have studied parallelizing CLPSO by OpenCL on the integrated IHDG520 GPU. We have firstly proposed a basic coarsegrained allGPU model, with one kernel written and each work item representing a separate particle. As the IHDG520 GPU features a low clock rate and the CPU has a high clock rate, two strategies, i.e., generating and transferring random numbers from the CPU to the GPU as well as reducing the number of instructions in the kernel, have been adopted to shorten the basic model’s execution time. To facilitate parallelization implementation of CLPSO, the inequality conditions used when determining a dimensional exemplar are relaxed. We have also studied a realworld case parallelizing the deterministic optimization for the ISO of the Xiaowan Reservoir. The deterministic optimization has been solved by CLPSO on 62 years’ monthly natural inflow records and has been parallelized by a multipopulation model using a large number of work items extended from the allGPU model. Owing to the size limits for a buffer transferring data from the CPU to the GPU and for storing the data in the global memory region, the random number generation strategy has been further modified by generating a small number of random numbers that can be flexibly exploited by the large number of work items without harming randomness. Experiments have been conducted on various unimodal/multimodal 30dimensional benchmark global optimization functions and the case study. The experimental results demonstrate that (1) the relaxation of the inequality conditions causes little negative impact on the final solution’s quality; (2) the two enhancement strategies help improve the basic model’s efficiency; (3) the modified random number generation strategy is suitable for the case of a large number of work items; and (4) the multi population model is able to achieve the consumption of significantly less execution time than the corresponding sequentialization model. In the future, we will investigate adapting and applying the proposed models for parallelizing more advanced metaheuristics [29, 71–74] and solving more realworld largescale problems.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was financially supported by the National Natural Science Foundation of China Projects (61703199, 61866023, and 61865012), the Shaanxi Province Natural Science Foundation Basic Research Project (2020JM278), and the Central Universities Fundamental Research Foundation Project (GK202003006).
References
 V. Roberge and M. Tarbouchi, “Parallel particle swarm optimization on graphical processing unit for pose estimation,” WSEAS Transactions on Computers, vol. 11, no. 6, pp. 170–179, 2012. View at: Google Scholar
 V. Roberge and M. Tarbouchi, “Efficient parallel particle swarm optimizers on GPU for realtime harmonic minimization in multilevel inverters,” in Proceedings of the Annual Conference on IEEE Industrial Electronics Society,, IEEE, Montreal, Canada, October 2012. View at: Publisher Site  Google Scholar
 J. Platos, V. Snasel, T. Jezowicz et al., “A PSObased document classification algorithm accelerated by the CUDA platform,” in Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics,, IEEE, Seoul, South Korea, October 2012. View at: Publisher Site  Google Scholar
 J. Zhao, W. Wang, W. Pedrycz, and X. Tian, “Online parameter optimizationbased prediction for converter gas system by parallel strategies,” IEEE Transactions on Control Systems Technology, vol. 20, no. 3, pp. 835–845, 2012. View at: Publisher Site  Google Scholar
 M. S. Nobile, D. Besozzi, P. Cazzaniga, G. Mauri, and D. Pescini, “Estimating reaction constants in stochastic biological systems with a multiswarm PSO running on GPUs,” in Proceedings of the Annual Conference Companion on Genetic and Evolutionary Computation, ACM, Philadelphia, PA, USA, July 2012. View at: Publisher Site  Google Scholar
 J. RegueraSalgado and J. MartínHerrero, “High performance GCPbased particle swarm optimization of orthorectification of airborne pushbroom imagery,” in Proceedings of the IEEE International Geoscience and Remote Sensing Symposium,, IEEE, Munich, Germany, July 2012. View at: Publisher Site  Google Scholar
 S. E. Papadakis and A. G. Bakrtzis, “A GPU accelerated PSO with application to Economic Dispatch problem,” in Proceedings of theInternational Conference on Intelligent System Application to Power Systems,, IEEE, Hersonissos, Greece, September 2011. View at: Publisher Site  Google Scholar
 O. Kilic, E. ElAraby, Q. Nguyen, and V. Dang, “Bioinspired optimization for electromagnetic structure design using fullwave techniques on GPUs,” International Journal of Numerical Modelling: Electronic Networks, Devices and Fields, vol. 26, no. 6, pp. 649–669, 2013. View at: Publisher Site  Google Scholar
 A. Ouyang, Z. Tang, X. Zhou, Y. Xu, G. Pan, and K. Li, “Parallel hybrid PSO with CUDA for lD heat conduction equation,” Computers & Fluids, vol. 110, pp. 198–210, 2015. View at: Publisher Site  Google Scholar
 D. L. Souza, O. N. Teixeira, D. C. Monteiro et al., “A new cooperative evolutionary multiswarm optimizer algorithm based on CUDA architecture applied to engineering optimization,” in Combinations of Intelligent Methods and Applications, pp. 95–115, Springer, Berlin, Germany, 2013. View at: Publisher Site  Google Scholar
 B. Sharma, R. K. Thulasiram, and P. Thulasiraman, “Portfolio management using particle swarm optimization on GPU,” in Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications,, IEEE, Leganes, Spain, July 2012. View at: Publisher Site  Google Scholar
 B. Sharma, R. K. Thulasiram, and P. Thulasiraman, “Normalized particle swarm optimization for complex chooser option pricing on graphics processing unit,” The Journal of Supercomputing, vol. 66, no. 1, pp. 170–192, 2013. View at: Publisher Site  Google Scholar
 D. Tristram, D. Hughes, and K. Bradshaw, “Accelerating a hydrological uncertainty ensemble model using graphics processing units (GPUs),” Computers & Geosciences, vol. 62, pp. 178–186, 2014. View at: Publisher Site  Google Scholar
 M. Abdelaziz, “GPUopencl accelerated probabilistic power flow analysis using montecarlo simulation,” Electric Power Systems Research, vol. 147, pp. 70–72, 2017. View at: Publisher Site  Google Scholar
 I. Boussaïd, J. Lepagnot, and P. Siarry, “A survey on optimization metaheuristics,” Information Sciences, vol. 237, pp. 82–117, 2013. View at: Publisher Site  Google Scholar
 Y. Tan and K. Ding, “A survey on GPUbased implementation of swarm intelligence algorithms,” IEEE Transactions on Cybernetics, vol. 46, no. 9, pp. 2028–2041, 2016. View at: Publisher Site  Google Scholar
 J. Kennedy and R. C. Eberhart, “Particle swarm optimization,” in Proceedings of the IEEE International Conference on Neural Networks, pp. 1942–1948, IEEE, November 1995, Perth, Australia. View at: Google Scholar
 Y.D. Zhang, S.H. Wang, and G.L. Ji, “A comprehensive survey on particle swarm optimization algorithm and its applications,” Mathematical Problems in Engineering, vol. 2015, Article ID 931256, 38 pages, 2015. View at: Publisher Site  Google Scholar
 Khronos Group, “OpenCL overview,” 2019, https://www.khronos.org/opencl/. View at: Google Scholar
 Nvidia Corporation, “CUDA toolkit,” 2019, https://developer.nvidia.com/cudatoolkit. View at: Google Scholar
 Y. Hung and W. Wang, “Accelerating parallel particle swarm optimization via GPU,” Optimization Methods and Software, vol. 27, no. 1, pp. 33–51, 2012. View at: Publisher Site  Google Scholar
 M. P. Wachowiak and A. E. L. Foster, “GPUbased asynchronous global optimization with particle swarm,” Journal of Physics: Conference Series, vol. 385, 2012. View at: Publisher Site  Google Scholar
 J. Kennedy and R. Mendes, “Population structure and particle swarm performance,” in Proceedings of the IEEE Congress on Evolutionary Computation, IEEE, Brisbane, Australia, June 2002. View at: Google Scholar
 Particle Swarm Central, “Standard PSO 2006,” 2006, http://www.particleswarm.info/Standard_PSO_2006.c. View at: Google Scholar
 Y. Zhou and Y. Tan, “GPUbased parallel particle swarm optimization,” in Proceedings of the IEEE Congress on Evolutionary Computation, IEEE, Trondheim, Norway, May 2009. View at: Google Scholar
 S. Cagnoni, A. Bacchini, and L. Mussi, “OpenCL implementation of particle swarm optimization: a comparison between multicore CPU and GPU performances,” in Proceedings of the European Conference on the Applications of Evolutionary Computation, Springer, Málaga, Spain, April 2012. View at: Google Scholar
 J. J. Liang, A. K. Qin, P. N. Suganthan, and S. Baskar, “Comprehensive learning particle swarm optimizer for global optimization of multimodal functions,” IEEE Transactions on Evolutionary Computation, vol. 10, no. 3, pp. 281–295, 2006. View at: Publisher Site  Google Scholar
 Z.H. Zhan, J. Zhang, Y. Li, and Y.H. Shi, “Orthogonal learning particle swarm optimization,” IEEE Transactions on Evolutionary Computation, vol. 15, no. 6, pp. 832–847, 2011. View at: Publisher Site  Google Scholar
 X. Yu and X. Zhang, “Enhanced comprehensive learning particle swarm optimization,” Applied Mathematics and Computation, vol. 242, pp. 265–276, 2014. View at: Publisher Site  Google Scholar
 M. Rabinovich, “Particle swarm optimization on a GPU,” in Proceedings of the IEEE International Conference on Electro/Information Technology, IEEE, Brookings, SD, USA, May 2012. View at: Google Scholar
 S. Simonovic, “The implicit stochastic model for reservoir yield optimization,” Water Resources Research, vol. 23, no. 12, pp. 2159–2165, 1987. View at: Publisher Site  Google Scholar
 W. W.G. Yeh, “Reservoir management and operations models: a stateoftheart review,” Water Resources Research, vol. 21, no. 12, pp. 1797–1818, 1985. View at: Publisher Site  Google Scholar
 J. D. C. Little, “The use of storage water in a hydroelectric system,” Journal of the Operations Research Society of America, vol. 3, no. 2, pp. 187–197, 1955. View at: Publisher Site  Google Scholar
 R. A. Wurbs, “Reservoirsystem simulation and optimization models,” Journal of Water Resources Planning and Management, vol. 119, no. 4, pp. 455–472, 1993. View at: Publisher Site  Google Scholar
 J. W. Labadie, “Optimal operation of multireservoir systems: stateoftheart review,” Journal of Water Resources Planning and Management, vol. 130, no. 2, pp. 93–111, 2004. View at: Publisher Site  Google Scholar
 S.M. Choong and A. ElShafie, “Stateoftheart for modelling reservoir inflows and management optimization,” Water Resources Management, vol. 29, no. 4, pp. 1267–1282, 2015. View at: Publisher Site  Google Scholar
 A. B. Celeste and M. Billib, “Evaluation of stochastic reservoir operation optimization models,” Advances in Water Resources, vol. 32, no. 9, pp. 1429–1443, 2009. View at: Publisher Site  Google Scholar
 F.J. Chang, L. Chen, and L.C. Chang, “Optimizing the reservoir operating rule curves by genetic algorithms,” Hydrological Processes, vol. 19, no. 11, pp. 2277–2289, 2005. View at: Publisher Site  Google Scholar
 L. Chen, J. McPhee, and W. W.G. Yeh, “A diversified multiobjective GA for optimizing reservoir rule curves,” Advances in Water Resources, vol. 30, no. 5, pp. 1082–1093, 2007. View at: Publisher Site  Google Scholar
 L. Le Ngo, H. Madsen, and D. Rosbjerg, “Simulation and optimisation modelling approach for operation of the hoa binh reservoir, Vietnam,” Journal of Hydrology, vol. 336, no. 34, pp. 269–281, 2007. View at: Publisher Site  Google Scholar
 W. Suiadee and T. Tingsanchali, “A combined simulationgenetic algorithm optimization model for optimal rule curves of a reservoir: a case study of the nam oon irrigation project, Thailand,” Hydrological Processes, vol. 21, no. 23, pp. 3211–3225, 2007. View at: Publisher Site  Google Scholar
 J. R. Lund and I. Ferreira, “Operating rule optimization for missouri river reservoir system,” Journal of Water Resources Planning and Management, vol. 122, no. 4, pp. 287–295, 1996. View at: Publisher Site  Google Scholar
 G. K. Young, “Finding reservoir operating rules,” Journal of the Hydraulics Division, vol. 93, no. 6, pp. 297–322, 1967. View at: Google Scholar
 M. I. Hejazi, X. Cai, and B. L. Ruddell, “The role of hydrologic information in reservoir operation—learning from historical releases,” Advances in Water Resources, vol. 31, no. 12, pp. 1636–1650, 2008. View at: Publisher Site  Google Scholar
 Z. Yang, P. Liu, L. Cheng, H. Wang, B. Ming, and W. Gong, “Deriving operating rules for a largescale hydrophotovoltaic power system using implicit stochastic optimization,” Journal of Cleaner Production, vol. 195, pp. 562–572, 2018. View at: Publisher Site  Google Scholar
 M. Karamouz, M. H. Houck, and J. W. Delleur, “Optimization and simulation of multiple reservoir systems,” Journal of Water Resources Planning and Management, vol. 118, no. 1, pp. 71–81, 1992. View at: Publisher Site  Google Scholar
 I. Nalbantis and D. Koutsoyiannis, “A parametric rule for planning and management of multiplereservoir systems,” Water Resources Research, vol. 33, no. 9, pp. 2165–2177, 1997. View at: Publisher Site  Google Scholar
 P. Liu, S. Guo, X. Xu, and J. Chen, “Derivation of aggregationbased joint operating rule curves for cascade hydropower reservoirs,” Water Resources Management, vol. 25, no. 13, pp. 3177–3200, 2011. View at: Publisher Site  Google Scholar
 A. Cancelliere, G. Giuliano, A. Ancarani, and G. Rossi, “A neural networks approach for deriving irrigation reservoir operating rules,” Water Resources Management, vol. 16, no. 1, pp. 71–88, 2002. View at: Publisher Site  Google Scholar
 A. R. S. Kumar, M. K. Goyal, C. S. P. Ojha, R. D. Singh, P. K. Swamee, and R. K. Nema, “Application of ANN, fuzzy logic and decision tree algorithms for the development of reservoir operating rules,” Water Resources Management, vol. 27, no. 3, pp. 911–925, 2013. View at: Publisher Site  Google Scholar
 V. Chandramouli and H. Raman, “Multireservoir modeling with dynamic programming and neural networks,” Journal of Water Resources Planning and Management, vol. 127, no. 2, pp. 89–98, 2001. View at: Publisher Site  Google Scholar
 M. Sangiorgio and G. Guariso, “NNbased implicit stochastic optimization of multireservoir systems management,” Water, vol. 10, no. 3, p. 303, 2018. View at: Publisher Site  Google Scholar
 P. Chaves and T. Kojiri, “Deriving reservoir operational strategies considering water quantity and quality objectives by stochastic fuzzy neural networks,” Advances in Water Resources, vol. 30, no. 5, pp. 1329–1341, 2007. View at: Publisher Site  Google Scholar
 S. J. Mousavi, K. Ponnambalam, and F. Karray, “Inferring operating rules for reservoir operations using fuzzy regression and ANFIS,” Fuzzy Sets and Systems, vol. 158, no. 10, pp. 1064–1082, 2007. View at: Publisher Site  Google Scholar
 C.C. Wei and N.S. Hsu, “Derived operating rules for a reservoir operation system: comparison of decision trees, neural decision trees and fuzzy decision trees,” Water Resources Research, vol. 44, no. 2, Article ID W02428, 2008. View at: Publisher Site  Google Scholar
 L. Li, P. Liu, D. E. Rheinheimer, C. Deng, and Y. Zhou, “Identifying explicit formulation of operating rules for multireservoir systems using genetic programming,” Water Resources Management, vol. 28, no. 6, pp. 1545–1565, 2014. View at: Publisher Site  Google Scholar
 C.M. Ji, T. Zhou, and H.T. Huang, “Operating rules derivation of jinsha reservoirs system with parameter calibrated support vector regression,” Water Resources Management, vol. 28, no. 9, pp. 2435–2451, 2014. View at: Publisher Site  Google Scholar
 A. B. Alaya, “Optimization of nebhana reservoir water allocation by stochastic dynamic programming,” Water Resources Management, vol. 17, no. 4, pp. 259–272, 2003. View at: Google Scholar
 P. P. Mujumdar and B. Nirmala, “A bayesian stochastic optimization model for a multireservoir hydropower system,” Water Resources Management, vol. 21, no. 9, pp. 1465–1485, 2007. View at: Publisher Site  Google Scholar
 V. Jothiprakash and G. Shanthi, “Comparison of policies derived from stochastic dynamic programming and genetic algorithm models,” Water Resources Management, vol. 23, no. 8, pp. 1563–1580, 2009. View at: Publisher Site  Google Scholar
 R. Yun, V. P. Singh, and Z. Dong, “Longterm stochastic reservoir operation using a noisy genetic algorithm,” Water Resources Management, vol. 24, no. 12, pp. 3159–3172, 2010. View at: Publisher Site  Google Scholar
 D. Etkin, P. Kirshen, D. Watkins et al., “Stochastic programming for improved multiuse reservoir operation in Burkina Faso, west Africa,” Journal of Water Resources Planning and Management, vol. 141, no. 3, Article ID 04014056, 2015. View at: Publisher Site  Google Scholar
 M. E. ElHawary and G. S. Christensen, Optimal Economic Operation of Electric Power Systems, Academic Press, Cambridge, MA, USA, 1979.
 G. W. Tauxe, R. R. Inman, and D. M. Mades, “Multiple objectives in reservoir operation,” Journal of the Water Resources Planning and Management Division, vol. 106, no. 1, pp. 225–238, 1980. View at: Google Scholar
 C. Lyra and L. R. M. Ferreira, “A multiobjective approach to the shortterm scheduling of a hydroelectric power system,” IEEE Transactions on Power Systems, vol. 10, no. 4, pp. 1750–1755, 1995. View at: Publisher Site  Google Scholar
 X. Zhang, X. Yu, and H. Qin, “Optimal operation of multireservoir hydropower systems using enhanced comprehensive learning particle swarm optimization,” Journal of HydroEnvironment Research, vol. 10, pp. 50–63, 2016. View at: Publisher Site  Google Scholar
 M. S. Hossain and A. ElShafie, “Intelligent systems in optimizing reservoir operation policy: a review,” Water Resources Management, vol. 27, no. 9, pp. 3387–3407, 2013. View at: Publisher Site  Google Scholar
 D. A. Augusto and H. J. C. Barbosa, “Accelerated parallel genetic programming tree evaluation with opencl,” Journal of Parallel and Distributed Computing, vol. 73, no. 1, pp. 86–100, 2013. View at: Publisher Site  Google Scholar
 A.K. C. Ahamed and F. Magoules, “Conjugate gradient method with graphics processing unit acceleration: Cuda vs opencl,” Advances in Engineering Software, vol. 111, pp. 32–42, 2017. View at: Publisher Site  Google Scholar
 S. K. Park and K. W. Miller, “Random number generators: good ones are hard to find,” Communications of the ACM, vol. 31, no. 10, pp. 1192–1201, 1988. View at: Publisher Site  Google Scholar
 M. Taherkhani and R. Safabakhsh, “A novel stabilitybased adaptive inertia weight for particle swarm optimization,” Applied Soft Computing, vol. 38, pp. 281–295, 2016. View at: Publisher Site  Google Scholar
 X. Xia, J. Liu, and Z. Hu, “An improved particle swarm optimizer based on tabu detecting and local learning strategy in a shrunk search space,” Applied Soft Computing, vol. 23, pp. 76–90, 2014. View at: Publisher Site  Google Scholar
 H. Wang, W. Wang, X. Zhou et al., “Firefly algorithm with neighborhood attraction,” Information Sciences, vol. 382383, pp. 374–387, 2017. View at: Publisher Site  Google Scholar
 H. Wang, X. Zhou, H. Sun et al., “Firefly algorithm with adaptive control parameters,” Soft Computing, vol. 21, no. 17, pp. 5091–5102, 2017. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2020 Xiang Yu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.