Research Article  Open Access
Yong Chen, Hai Jin, Han Jiang, Dechao Xu, Ran Zheng, Haocheng Liu, "Implementation and Optimization of GPUBased Static State Security Analysis in Power Systems", Mobile Information Systems, vol. 2017, Article ID 1897476, 10 pages, 2017. https://doi.org/10.1155/2017/1897476
Implementation and Optimization of GPUBased Static State Security Analysis in Power Systems
Abstract
Static state security analysis (SSSA) is one of the most important computations to check whether a power system is in normal and secure operating state. It is a challenge to satisfy realtime requirements with CPUbased concurrent methods due to the intensive computations. A sensitivity analysisbased method with Graphics processing unit (GPU) is proposed for power systems, which can reduce calculation time by 40% compared to the execution on a 4core CPU. The proposed method involves load flow analysis and sensitivity analysis. In load flow analysis, a multifrontal method for sparse LU factorization is explored on GPU through dynamic frontal task scheduling between CPU and GPU. The varying matrix operations during sensitivity analysis on GPU are highly optimized in this study. The results of performance evaluations show that the proposed GPUbased SSSA with optimized matrix operations can achieve a significant reduction in computation time.
1. Introduction
An electric power system includes a network of connected electrical components which are used to supply, transfer, distribute, and use electric power. It is necessary to know the physical properties of the power system for safety. Different experiments can be conducted to study the properties of power system. Sometimes, however, it is impossible to perform such experiments on largescale, complex power systems because the experiments are too expensive or difficult to take measurements on those power systems. Therefore, power system analysis [1] becomes another popular choice and plays more and more important role in planning, design, and operation of electrical power system. Power system analysis aims at building an equivalent model for a given power system, running load flow calculation, evaluating the effects when subjected to disturbances, and generating the ways to improve the stability performance of the power system. Power system analysis can be classified into steady state analysis and transient stability analysis. The former is further categorized into Load flow analysis and Static state security analysis (SSSA). Load flow analysis involves determining voltages and power flow through a stable system, where any transient disturbances are assumed to have settled down. SSSA involves power flow calculation when a chosen device is connected or disconnected. Transient stability is analyzed to estimate the system’s stability under dynamic conditions during disturbances.
Load flow analysis is one of the most significant computations in power system planning and operations. In load flow analysis, we need to model the entire network with all generators, loads, transmission lines, transformers, and reactors. Following the modeling, since power system is a largescale and highly nonlinear dynamic system, power flow calculation involves solving higherdimensional sparse nonlinear algorithmic equations based on nodal admittance form. Load flow equations allow us to compute bus voltage magnitudes and phase angles as well as branch current magnitudes. Solving nonlinear equations is related to iterative computations, which are data and computationintensive. SSSA can be replaced by a series of load flow analyses. However, the computation would be more intensive if a rigorous load flow calculation is used. Some studies are focusing on SSSA to improve the performance [2, 3]. Many highly simplified algorithms are worked out to save iterative computations, and a few approaches with parallel load flow analysis are proposed. Graphics processing unit (GPU) is becoming an attractive accelerator for parallel computation [4] and can achieve high performance for generalpurpose computation. Load flow analysis is intended to solve highdimensional sparse nonlinear equations, and sensitivity analysis solves changes in network state based on the load flow results, which uses a number of sparse matrix operations, such as matrix multiplication, addition, and inversion. These features make GPU a suitable and viable solution for SSSA. In this study, while we have provided a GPUbased SSSA solution [5], in practice, a considerable amount of work is needed to continue improving its speed and performance.
Many operations in SSSA can be parallelized on GPU to improve the performance. However a hybrid approach is necessary to combine CPU and GPU computations in GPUbased SSSA. Branch prediction or speculative execution is not supported on GPU, which is not good at iterative operations and judgements of convergence in solving nonlinear equations. For a large sparse matrix, the nonzero storage and computation mechanism should be adopted. The proposed method can combine various small matrices into one matrix in multiplication to use the GPU threads better.
The reminder of this paper is organized as follows: the background and related work are described in Section 2. Section 3 describes the workflow of static state security analysis and its modules. Section 4 addresses the optimization of matrix operations. Section 5 evaluates the performance of the proposed system and addresses optimization issues. Section 6 contains the conclusions of this study as well as a discussion of future improvements in the proposed system.
2. Related Work
SSSA is an important computation tool in the design and operation of large, interconnected power systems. It determines whether a system is operating in a secure state at any given time with respect to unforeseen outages in buses, lines, transformers, or generators. branch outage simulation is a basic validation of a power grid safety [6]. SSSA can be implemented by running one load flow calculation for each outage case. If a large number of cases are provided, the time needed for calculation and analysis can be very long. Hence, other methods, such as DC (Direct Current) power method and sensitivity analysis method, have been proposed to speed up static state security analysis.
The DC power method [7] can solve nonlinear equations in power systems by transforming them into linear equations. This reduces computational complexity to make calculations more efficient but can lead to poor precision. The DC power method can only check for the overload in practical application. If the voltage is raised above the upper limit, it will not be checked. The DC power method is commonly used to design power systems, for which the active power flows are an important concern.
In sensitivity analysis, the offline of one transmission line is regarded as a perturbation under normal circumstances [8]. The method involves calculating a sensitivity matrix from the Taylor series expansion of load flow equations. The impact of line outage can be simulated by net injection and withdraw changes. Many analyses of outages on the same network conduct only one load flow calculation and perform their sensitivity analysis on the load flow results. This bypasses many intermediate, iterative steps of the calculation and significantly increases the efficiency of lineoff analysis. This method can not only improve performance and precision, but also obtain active power, reactive power, voltage magnitude, and angle at the system node. Therefore, it is commonly used in general practice.
Static state security analysis pays more attention to the calculation of the linear algebraic operation. Reference [9] presents a systematic approach for a large set of frequently encountered dense linear algebra operations, but it is shown to yield new highperformance algorithms. There are many researches based on GPU, in which cuBLAS [10] is the most commonly used. cuBLAS is a linear algebraic library on GPU, which is an implementation of BLAS (Basic Linear Algebra Subprograms) on NVIDIA’s CUDA (Compute Unified Device Architecture) runtime. When programming on cuBLAS, we use it like BLAS and do not care about the details of thread modeling or the storage model of CUDA programming.
Reference [11] proposed a hybrid CPUGPU implementation of sparse Cholesky factorization based on the multifrontal method. Reference [12] introduced an efficient GPUbased sparse solver for circuit problems and developed a hybrid parallel LU factorization approach combining tasklevel and datalevel parallelism on GPUs. Reference [13] proposed an implementation of the NewtonRaphson (NR) load flow algorithm, as it pertained to parallelizing and implementing in CUDA. However, there is no simple and efficient static state security analysis method based on GPU to conduct branch outage simulations.
3. Workflow and Modules of Static State Security Analysis
3.1. Overview of Static State Security Analysis
Sensitivity analysis is a method to be highly parallelized, which is convenient to be used with GPU to accelerate calculations in static state security analysis. The topology of the power grid can be simply modeled as a graph with nodes and edges, in which the nodes represent buses (power stations or transformers) and the edges represent transmission lines [14]. The graph network can be converted into a nodal admittance matrix or a nodal impedance matrix. The two matrices are highly sparse for a large power system. According to the conservation of complex power theorem, the load flow equations can be written as follows:where and are the injected active and the reactive powers, respectively, at node . are the elements of nodal admittance matrix. Load flow calculations can be roughly considered as the problem of solving node voltages , ) at each node when the injecting complex powers , ) are given. This nonlinear equation can be solved by the NewtonRaphson (NR) method, which is based on a shortened Taylor series. NR needs to iteratively solve (2) until it converges within a given tolerance of , ):
For a large system, the load flow calculation may require significant computational resources to calculate, store, and factorize the Jacobian matrix . Following load flow analysis, the normal state of the system is determined. If there is a randomly injected power disturbance or network change , the state vector is correspondingly changed by . The equation can be expressed as follows:in which is the active and reactive power of nodes in normal state, is the voltage vector of the nodes, and is the normal network parameter. When power disturbance is ignored, , and Taylor series expansion [15] is used for (3). Finally, we obtain the following equations: In the above, is the sensitivity matrix, which is equal to . If only one lineoff is considered in SSSA, the injected power change of nodes along the offline can be obtained bywhereIn the above equations, , , , and , the elements of the Jacobian matrix , are calculated as follows:and , , , and , the elements related to the offline node in the sensitivity matrix, are calculated as follows:We find that the items on the righthand side of (6) come from load flow calculation, and only two matrix operations occur in the equation.
The overall workflow of the static state security analysis system is shown in Figure 1. It consists of four main modules: I/O, preprocessing, power flow calculation, and static state security analysis.
When the original data is input, it is preprocessed first for power flow calculations. The system then conducts a full power flow calculation and determines the Jacobian matrix , which is prepared for static state security analysis. Finally, the changing network parameters are assumed by a lineoff and new system state variables identified. Once a set of critical elements are simulated to be tripped, we can determine whether there are equipmentlimit violations through SSSA. Hence, the state of power system (secure or unsecured) can be easily determined.
3.2. Preprocessing
Data preprocessing is a major and essential stage in power flow calculation, which can speed up static state security analysis. There are three processings: sparse matrix storage, node numbering optimization, and result storage optimization, which aim at reducing storage space for matrices and speeding up the calculation of nonzero elements.
Following raw data input, data in large sparse matrix is compressed in CSR (Compressed Row Storage) format [16]. CSR is very efficient due to an indirect addressing step for every single scalar operation in matrix vector. The subsequent nonzero elements in matrix rows are placed in contiguous locations in memory and traversed in a rowwise fashion.
By matrix sorting with the minimumdegree minimumlength algorithm [17] or the minimumdegree minimumnumber algorithm [18], the rows and columns of sparse matrix are moved to avoid creating new nonzero elements in the matrix during factorization, and the forward and backward substitutions can save many nonzero computations. Such node numbering optimizations are conducted with an elimination tree, which can determine the position of injection elements in factorization.
Static state security analysis is performed on the results of power flow calculations. Since three read operations are necessary when accessing CSRformatted data every time, it incurs a considerable costs for I/O operations. Hence, the matrix is decompressed prior to calculation. It may take up a lot of space to assemble several matrices, and the memory may not be large enough to store the matrices. Therefore, preprocessing involves the following three steps:(1)Analyze the result of power flow calculation , where sensitivity matrix is obtained by the inverse . Allocate memory to store matrices assembled from .(2)Generate submatrices and vectors from with and in (6) and (7), respectively.(3)Check matrix size. If it does not exceed the capacity, copy it directly into GPU device memory; otherwise, partition the matrix and copy submatrices into device.
3.3. Power Flow Calculation
Power flow calculation is executed with the NewtonRaphson (NR) method to solve large sparse nonlinear equations. The NR method repeatedly solves large, sparse, linear systems of equations. It is a significant challenge of computational and memory capacity to factorize the Jacobian matrix at each iteration; hence, it is crucial to design a fast and efficient solution for the factorization. The multifrontal algorithm is a particularly useful method for factoring large sparse linear systems and is adopted in the proposed solution. The workflow of the method is shown in Figure 2.
The multifrontal algorithm [19] is an effective method to solve large sparse matrices operation. A frontal matrix is a small, dense matrix. The term “multi” here refers to the fact that multiple frontal matrices are used during the factorization. The multifrontal method turns the factorization of sparse matrix into a sequence of factorizations of smaller, dense matrices organized as an elimination tree or an assembly tree. The method can get satisfactory data locality and great potential for parallelization. For example, [20] investigates an automatic tuning of SpMV (Sparse Matrix Vector) multiplication kernel in a partitioned global address space language, which supports a hybrid thread and processbased communication layer for multicore systems. The multifrontal method proposed here is associated with supernodal implementation, so that multiple columns with the same nonzero patterns can be grouped together as a dense kernel for concurrent factorization.
The method is formulated in terms of frontal matrices and update matrices. The processing order of frontal matrices is determined by the elimination tree from the preprocessing. The order is expressed by a frontal matrix chain, in which each frontal matrix is designated as a leaf and processed by a CPU thread. For node , subtree update matrix is the sum of all subtree update matrices; frontal matrix is formed by assembly over and can be partitioned into four submatrices , , , and with (10). Then the Cholesky factor vector and the update matrix of can be solved with following Cholesky factorization:
It will take a considerable amount of time for matrix operations in the multifrontal method. A threshold is defined to distinguish CPU from GPU tasks and is derived from matrix calculations. If the number of the computations is greater than the threshold, the main thread will allocate tasks to GPU. Otherwise, a smallscale matrix is computed on CPU. Nodes in frontal matrix chain are calculated one by one until all of them are computed. is then computed backward through substitute, decomposed matrices and to update . If the absolute value of is less than , which is regarded as convergence, the calculation is concluded; otherwise, the iterations will continue to get a convergent result.
3.4. Static State Security Analysis
In this module, the GPUbased sensitivity method is used for static state security analysis. The elements in the matrices of (6) are prepared from the previous load flow analysis. It consists of four steps.
Step 1. Calculate in (6) by matrix multiplication.
Step 2. Calculate in (6) by matrix addition.
Step 3. Calculate by matrix inversion from .
Step 4. Calculate (5) by using a matrix multiplying vector.
Two matrix multiplication and one matrix inversion operations are involved during the analysis. Following these steps, the changes in the node state variables can be obtained, so that the power flow on each branch can be acquired following line outage. Moreover, many cases of SSSA can be combined in parallel on a GPU to improve the efficiency.
4. Optimization of Matrix Operations
4.1. Small Matrix Multiplication
In matrix multiplication on GPU, 16 threads are executed for concurrent element multiplication in one block, shown in Figure 3. The 16 threads are completely independent and communicate with one another in the block. However, the same instruction is executed in a warp, which is allowed to launch 32 threads at once. Whether there are 32 or 16 threads in a task, they are launched by a warp with the same time cost. Hence if only one submatrix multiplication occurs in a block with only a warp, 16 threads are launched and the other 16 are idle.
It is a best practice to allocate more than one multiplication operation to a block. A block can contain a maximum of 1024 threads, which can store 64 submatrix multiplication operations. Host threads can combine their data into a large matrix and copy it to share the memory of the block. Since the instructions are the same in a block, it is necessary to access own data of the large matrix in each thread. The distribution of logical memory is shown in Figure 4. We assume that every 16 threads process a matrix multiplication operation and 64 matrix multiplications can be concurrently handled in a block.
4.2. Small Matrix Inversion
Between and , there is a matrix inversion operation. In general, there are two methods for matrix inversion. One is the adjoint matrix method, which executes the calculation of . The adjoint matrix of must be calculated first for this. It is too complicated to derive the expression of for matrix of order greater than 3. The other matrix inversion method is the elementary transformation method, which is simpler than the adjoint matrix method for highorder matrices, but more iterations are required and inefficient in terms of GPU resource consumption. In order to enhance efficiency, (11) is used to partition matrix inversion:
If a matrix is provided, it can be divided into four matrices. In this manner, the inversion task is divided and assigned to four threads, which can complete their tasks almost simultaneously. In (11), only the operations of matrix multiplication, addition, and matrix inversion are given. matrix inversion can be expressed as follows:
In this way, four threads can be used with (12) to calculate the inversion of , , , and in formula (11). The results are substituted into (11) to work out the inversion of the matrix.
4.3. Vector Multiplication by CrossCombining Storage
When solving (5), each change in power flow is obtained by multiplying row vector by column vector. A set of vectorvector multiplication can be combined into a matrix vectorvector multiplication as Figure 5.
In sensitivity analysis, ( is a positive integer) row vectors are merged into one long row vector placed in global memory. Global memory allows a warp of 32 threads to access it concurrently. As Figure 6 shows, there are 32 row vectors; one long row vector is created in global memory. The first, second, third, and fourth column elements of the 32 row vectors are set to , , , and , respectively. When row vector is multiplied by column vector , , , , and are multiplied by , , , and , respectively. Since vector is accessed many times, is stored in the shared memory.
A number of row vectors are stored in one row vector by crosscombination. It is necessary to execute a reduce operation after vector multiplication. The result of vector multiplication is stored back into . The four result elements (, ) are grouped, summed up, and stored at . Eventually, is the result of 32 multiplications of a vector by another vector.
4.4. RCMM Storage Method
In general practice, cublasSgemm function in cuBLAS is called for GPU matrix multiplication. The matrices in C/C++ are in rowmajor order, but cuBLAS assumes that the matrices are stored in columnmajor order in the devices. The order exchange is a timeconsuming operation. Hence the RCMM (Row ColumnMatrix Multiplication) method is adopted, and , , and are stored in columnmajor, rowmajor, and columnmajor order, respectively, in .
For the multiplication in the multifrontal method, one matrix is stored in rowmajor order and the another matrix in columnmajor order. The matrices of the same frontal chain share an array, because of which the entire row or column data is copied to global memory. Each computation only uses part of the data, which avoids having to move large amounts of data on demand.
5. Performance Evaluation
5.1. Dataset and Experimental Environment
The experiments are conducted on a server with an Intel i7 950 (3.07 GHz) CPU, 16 GB memory, and NVIDIA GeForce GTX460. The CentOS 5.9 Linux operation system with CUDA 4.0 is used. From MATPOWER [21], a professional software in power calculation derived from a real power grid and some IEEE standard testing datasets are chosen, shown in Table 1. CA300 is the power flow data for the IEEE 300 bus test case. CA3012wp and CA3120sp are the power flow data cases for the Polish system winter 200708 evening peak and summer 2008 morning peak, respectively. In order to evaluate the computation involved in a large power system, two identical Polish power systems are connected to form the large symmetric power system CA5472, CA5492wp, CA6024, and CA6240.

The speedup ratio is defined in (13), where and are the execution times on CPU and GPU, respectively. The program executed on CPU is a multithread program on multicore CPU.
5.2. Evaluation of Static State Security Analysis System
Executed on GPU and CPU platforms, the datasets in Table 1 are used for testing. The results are shown in Table 2.

In Table 2, there is a 4core processor in the CPU, with which the program can almost achieve best performance. But with increasing of the number of active processor cores in the compute node, the program on the CPU platform suffers serious performance degradation. In dataset CA3012wp, the calculation time on an 8core CPU increased by 13% compared with a 4core CPU.
The experimental results show that the execution time on GPU is shorter than on CPU, except on a scale of 300 nodes. It takes much time to transfer data between host and device in GPU. When the scale is 300 nodes, the transfer time cannot be ignored, and it is occupied by a significant proportion of the execution time on GPU. On the other hand, for only 300 submatrices, every two submatrices are processed in a warp on average. However for 336 cores on GTX 460, it means that over half the cores are idle throughout the processing, which is a serious waste of GPU resources to hinder overall performance.
Excluding the dataset of the 300 nodes, the accelerating effect of GPU computation is reflected, and significant speedup is revealed with increasing data scale. GPU computing can reduce calculation time by 40% compared with the execution on 4core CPU. The increasing trend of GPU speedup is shown in Figure 7.
5.3. Evaluation of Small Matrix Multiplication
Equation (6) is chosen to run the performance experiments. The formula consists of matrix multiplication and matrix addition. Two experiments are done. In Experiment (Exp), two combined submatrices are stored in one block after optimization. In Experiment (Exp), one submatrix is stored in one block. For different matrix storages, the experimental results for matrix multiplication and addition in (6) are shown in Figure 8.
The optimization method can achieve a speedup of approximately 1.7. A number of parallel threads are launched by a warp. If the matrices are not merged into a block, half the threads will be idle in a warp, which can launch 32 threads. Following matrix combination, the 32 threads in the warp can be kept busy, which indicates that the optimization method can exhibit good performance.
5.4. Evaluation of Small Matrix Inversion
Two experiments are conducted on matrix inversion in (5). cuBLAS has provided a large matrix inversion function with low efficiency for batch small matrices inversion. The small batch matrix inversion method is implemented on GPU kernel, in which one submatrix is processed by one thread. The small batch matrix inversion function is called in Exp once, and 100 inversion threads are scheduled on the GPU. The cuBLAS matrix inversion function is called 100 times in Exp. The results show that the small matrix inversion method yields better performance (Exp 925 ms versus Exp 1055 ms) in Figure 9.
6. Conclusion and Future Work
In this paper, GPUbased static state security analysis is proposed for power systems. The GPUbased multifrontal method is implemented to solve a large sparse matrix, and sensitivity analysis is chosen for static state security analysis on GPU. To make full use of GPU device, several optimization methods of matrix operations are presented, such as data combination in multiple smallscale matrix multiplication operations and the partition matrix method for matrix inversion.
Experimental results indicate that the proposed algorithm on GPU can significantly improve system performance. Our results show a speedup of 1.7–1.9 with power system simulation cases from a scale of 3,000 to 6,000.
In future work, it may be desirable to further improve performance that the system and methods could be ported to more scalable distributed memory environment, such as multiGPUs [22]. We can also use the compilers to speed up system migration and dynamic task scheduling in CPUGPU heterogeneous parallel systems. Our way of dealing with smallscale matrices can be used in scientific calculations in other fields, and the processing method of specialdimensional matrix can be extended to alldimensional matrices.
Competing Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (no. 61133008), the National 973 Key Basic Research Plan of China (no. 2013CB2282036), Major Subject of State Grid Corporation of China (no. SGCCMPLG001(001031)2012), the National 863 Research and Development Program of China (no. 2011AA05A118), and the National Science and Technology Pillar Program (no. 2012BAH14F02).
References
 J. D. Glover, M. S. Sarma, and T. Overbye, Power System Analysis and Design, Cengage Learning, 2016.
 P. Ding, Y. Li, D. Xu, F. Tian, J. Yan, and Z. Yu, “Improved algorithm of fast static state security analysis of power systems,” Proceedings of the Chinese Society of Electrical Engineering, vol. 30, no. 31, pp. 77–82, 2010. View at: Google Scholar
 G. Zhou, X. Zhang, Y. Lang et al., “A novel GPUaccelerated strategy for contingency screening of static security analysis,” International Journal of Electrical Power & Energy Systems, vol. 83, pp. 33–39, 2016. View at: Publisher Site  Google Scholar
 J. D. Owens, D. Luebke, N. Govindaraju et al., “A survey of generalpurpose computation on graphics hardware,” Computer Graphics Forum, vol. 26, no. 1, pp. 80–113, 2007. View at: Publisher Site  Google Scholar
 Y. Chen, H. Jin, H. Jiang, D. Xu, R. Zheng, and H. Liu, “GPUbased static state security analysis in power systems,” in Proceedings of the 9th AsiaPacific Services Computing Conference (APSCC '15), pp. 258–267, Bangkok, Thailand, December 2015. View at: Google Scholar
 X. Li and Z. Guo, “The transmission interface real power flow control based on N1 static safety restriction,” Electric Power, vol. 38, no. 3, pp. 26–28, 2005. View at: Google Scholar
 K. Purchala, L. Meeus, D. Van Dommelen, and R. Belmans, “Usefulness of DC power flow for active power flow analysis,” in Proceedings of the IEEE Power Engineering Society General Meeting, pp. 454–459, San Francisco, Calif, USA, June 2005. View at: Google Scholar
 X. Wang, W. Fang, and Z. Du, Modern Power System Analysis, Science Press, 2016.
 P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintanaorti, and R. A. van de Geijn, “The science of deriving dense linear algebra algorithms,” ACM Transactions on Mathematical Software, vol. 31, no. 1, pp. 1–26, 2005. View at: Publisher Site  Google Scholar  MathSciNet
 NVIDIA, “cuBLAS,” https://developer.nvidia.com/cublas. View at: Google Scholar
 R. Zheng, W. Wang, H. Jin, S. Wu, Y. Chen, and H. Jiang, “GPUbased multifrontal optimizing method in sparse Cholesky factorization,” in Proceedings of the 26th IEEE International Conference on ApplicationSpecific Systems, Architectures and Processors (ASAP '15), pp. 90–97, IEEE, Ontario, Canada, July 2015. View at: Publisher Site  Google Scholar
 X. Chen, L. Ren, Y. Wang, and H. Yang, “GPUaccelerated sparse LU factorization for circuit simulation with performance modeling,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 3, pp. 786–795, 2015. View at: Publisher Site  Google Scholar
 D. J. Sooknanan and A. Joshi, “GPU computing using CUDA in the deployment of smart grids,” in Proceedings of the SAI Computing Conference (SAI '16), pp. 1260–1266, London, United Kingdom, July 2016. View at: Publisher Site  Google Scholar
 L. Xuan, L. Tianqi, and L. Xingyuan, “A novel evolving model for power grids,” Science China Technological Sciences, vol. 53, no. 10, pp. 2862–2866, 2010. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 S. Wang, Z. Zheng, and C. Wang, “Power system transient stability simulation under uncertainty based on Taylor model arithmetic,” Frontiers of Electrical and Electronic Engineering in China, vol. 4, no. 2, pp. 220–226, 2009. View at: Publisher Site  Google Scholar
 J. L. Greathouse and M. Daga, “Efficient sparse matrixvector multiplication on GPUs Using the CSR storage format,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14), pp. 769–780, November 2014. View at: Publisher Site  Google Scholar
 A. Gómez and L. G. Franquelo, “An efficient ordering algorithm to improve sparse vector methods,” IEEE Transactions on Power Systems, vol. 3, no. 4, pp. 1538–1544, 1988. View at: Publisher Site  Google Scholar
 R. Betancourt, “An efficient heuristic ordering algorithm for partial matrix refactorization,” IEEE Transactions on Power Systems, vol. 3, no. 3, pp. 1181–1187, 1988. View at: Publisher Site  Google Scholar
 J. W. Liu, “The multifrontal method for sparse matrix solution: theory and practice,” SIAM Review, vol. 34, no. 1, pp. 82–109, 1992. View at: Publisher Site  Google Scholar  MathSciNet
 S. G. Li, C. J. Hu, J. C. Zhang, and Y. Q. Zhang, “Automatic tuning of sparse matrixvector multiplication on multicore clusters,” Science China Information Sciences, vol. 58, no. 9, pp. 1–14, 2015. View at: Publisher Site  Google Scholar
 R. D. Zimmerman, C. E. MurilloSánchez, and R. J. Thomas, “MATPOWER: steadystate operations, planning, and analysis tools for power systems research and education,” IEEE Transactions on Power Systems, vol. 26, no. 1, pp. 12–19, 2011. View at: Publisher Site  Google Scholar
 D. Chen, W. Chen, and W. Zheng, “CUDAZero: a framework for porting shared memory GPU applications to multiGPUs,” Science China Information Sciences, vol. 55, no. 3, pp. 663–676, 2012. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2017 Yong Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.