Scientific Programming

Volume 2018, Article ID 9340697, 11 pages

https://doi.org/10.1155/2018/9340697

## HPGraph: High-Performance Graph Analytics with Productivity on the GPU

^{1}Department of Computer, National University of Defense Technology, Changsha 410000, China^{2}National Key Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410000, China

Correspondence should be addressed to Haoduo Yang; moc.361@33622567981

Received 7 July 2018; Accepted 31 October 2018; Published 11 December 2018

Academic Editor: Can Özturan

Copyright © 2018 Haoduo Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The growing use of graph in many fields has sparked a broad interest in developing high-level graph analytics programs. Existing GPU implementations have limited performance with compromising on productivity. HPGraph, our high-performance bulk-synchronous graph analytics framework based on the GPU, provides an abstraction focused on mapping vertex programs to generalized sparse matrix operations on GPU as the backend. HPGraph strikes a balance between performance and productivity by coupling high-performance GPU computing primitives and optimization strategies with a high-level programming model for users to implement various graph algorithms with relatively little effort. We evaluate the performance of HPGraph for four graph primitives (BFS, SSSP, PageRank, and TC). Our experiments show that HPGraph matches or even exceeds the performance of high-performance GPU graph libraries such as MapGraph, nvGraph, and Gunrock. HPGraph also runs significantly faster than advanced CPU graph libraries.

#### 1. Introduction

Graph computing has become critical for analyzing data in many domains, such as in bioinformatics, social networking, web analysis, and traffic engineering. During the past decade, in terms of dealing with large-scale graphs, various parallel graph computing frameworks have been proposed for leveraging modern massively parallel processors, specifically graphics processing units (GPUs). GPUs which have excellent peak throughput and energy efficiency [1] have demonstrated very strong computational performance with appropriate optimization. However, the unpredictable control flows and memory divergence on GPUs caused by the irregularity of graph topologies need sophisticated strategies to ensure efficiency, making an efficient implementation on GPUs a challenge. With graphs getting larger and queries getting more complex, it is imperative for high-level graph analysis frameworks to help users extract the information they need with minimal programming effort.

In order to bridge the gap between high performance and productivity, we propose HPGraph, a high-level parallel graph analytics framework on GPU. The key abstraction of our framework is mapping vertex programs to generalized sparse matrix vector multiplication (SPMV) operations by CUDA kernels. Unlike other GPU graph computing models which focus on sequencing the steps of computation [2], we instead convert graph traversal to matrix operations so that we can focus on manipulating on data structures and provide high performance brought about by the optimized generalized SPMV. In addition, HPGraph encapsulates the complexity of programming and achieves the high productivity by hiding the underlying matrix primitives for users. Users with limited knowledge of low-level GPU architectures are therefore able to assemble complex graph primitives.

Our contributions to this field are as follows:(1)We propose an efficient graph analytics framework which maps vertex programs to generalized sparse matrix operation on the GPU. This abstraction, unlike the abstractions of pervious GPU graph procecssing libraries, is able to develop a wide range of graph primitives while simultaneously delivering high performance.(2)HPGraph is productive for users. We hide some of the more unsavory facts of how HPGraph is really getting the job done, and provides a set of flexible APIs to express several graph primitives at a high level of abstraction.(3)HPGraph integrates a variety of GPU-specific optimization strategies for data structures, graph traversal, and memory access into its core to further improve performance. In our experiments, our graph primitives significantly outperform other advanced CPU graph analytics frameworks and achieve comparable or even superior performance to other state-of-the-art GPU analytics libraries.(4)We provide a detailed experimental evaluation of our graph primitives including comparisons to the performance of several advanced CPU and GPU implementations.

The remainder of this paper is organized as follows: Section 2 presents the existing graph frameworks and the motivations for our work. Section 3 describes our implementation and optimizations in detail. Section 4 discusses the implementation details of the graph algorithms. Section 5 provides the results of measuring the performance of the frameworks. In Section 6, we conclude the paper and discuss potential areas for future research.

#### 2. Related Work and Motivation

This section will discuss the research landscape of large-scale graph analytics frameworks, which differ both in terms of programming models as well as the supported platforms.

Parallel graph analytics frameworks propose various high-level programmable models, such as vertex programming, matrix operations, tasks models, and declarative programming. Among them, vertex programming is quite widely applied and is generally productive for writing graph programs. However, since it focuses on sequencing the steps of computation and lacks a strong mathematical model, it is difficult to analyze due to its long runtimes and high memory consumption [3]. On the contrary, matrix models are built with a solid mathematical background, e.g., graph traversal computations are modeled as operations on a semi-ring in CombBLAS [4] and nvGRAPH [5]. This is beneficial in reasoning and performing optimizations, but it is considered difficult to program [6].

Single-node CPU-based systems are in common use for graph computation. Giraph [7] uses an iterative vertex-centric programming model similar to Google’s Pregel [8], and it is based on Apache Hadoop’s MapReduce. PowerGraph [9], designed for real-world graphs which have a skewed power-law degree distribution, uses the more flexible Gather-Apply-Scatter abstraction. These methods partition edges across nodes with a vertex-cut which exposes greater parallelism in natural graphs. Galois [10–12] is one of the highest performance graph systems for shared memory adopting a task-based abstraction. CombBLAS [4] and GraphMat [3] are two popular matrix programming models. CombBLAS is an extensible distributed-memory parallel graph library offering a small but powerful set of linear algebra primitives specifically targeting graph analytics. GraphMat, developed by Intel, is a single-node multicore graph framework mapping Pregel-like vertex programs to high-performance sparse matrix operations. Recent work [3] and [13] have compared different graph frameworks on CPUs. These papers show that GraphMat significantly outperforms many other frameworks in most cases. Moreover, the capability of mapping many diverse graph operations to a small set of matrix operations provides considerable convenience for the backend of GraphMat to maintain and extend itself, for example, to multiple nodes.

GPUs are power-efficient and able to carry out high-memory-bandwidth processing. They can exploit parallelism in computationally demanding applications. Most high-level GPU programming models for analytics today mirror CPU programming models. For instance, Zhong and He introduced Medusa [14], a high-level GPU-based system for parallel graph computing using Pregel’s message model [8]. VertexAPI2 [15], MapGraph [16], and CuSha [17, 18] adopt PowerGraph’s GAS programming model [9]. Gunrock [2] is a more recent library for developing graph algorithms on a single GPU. Rather than designing an abstraction around computation, Gunrock instead uses a GPU-specific data-centric model centered on operations on a vertex or edge frontier. Wang et al. [2] report that, compared to hardwired GPU implementations, Gunrock has comparable performance to the fastest GPU hardwired implementations and achieves better performance than any other GPU high-level graph library. nvGRAPH (nvGRAPH is available at https://developer.nvidia.com/nvgraph) is a high-performance GPU graph analytics library developed by NVIDIA. It harnesses the power of GPUs for linear algebra and matrix computations to handle the large-scale graph analytics [5]. The core functionality is using semi-ring SPMV operations to express graph computation [2]. It currently supports three algorithms: PageRank, SSSP, and Single-Source Widest Path.

Compared to CPU graph frameworks, existing high-level GPU graph frameworks usually gain improved performance due to their strengths in terms of hardware, the generalized load balance strategies, and optimized GPU primitives. Nevertheless, the unpredictable control flows and memory divergence on GPUs caused by irregular graph topologies need sophisticated strategies to ensure efficiency. This can result in relatively low productivity and high memory consumption.

Some matrix-based frameworks on CPUs, e.g., CombBLAS [4], GraphMat [3], and PEGASUS [19], have proven that a vertex-based programming model on CPUs can be established with a matrix backend for graph programming. Meanwhile, GPUs have the potential to accelerate sparse matrix algebra due to their memory-bound nature. A variety of optimizations have been performed to improve the performance of SPMV [20], one of the most important operations in high-performance computing (HPC), on GPUs. However, as far as we know, existing matrix-based graph analytics on GPUs achieve nowhere near the same performance as these optimized libraries [21, 22]. In this work, our goal is to achieve high performance (optimized sparse matrix backend) for graph analytics as well as the productivity of vertex programming (such as vertex programming for users) for GPUs.

#### 3. The HPGraph’s Abstraction and Optimizations

##### 3.1. HPGraph’s Abstraction

HPGraph is based on the idea that traversals from a vertex can be expressed as an operation which is similar to dot product, an element of SPMV routines on the graph adjacency matrix (or its transpose). Hence, HPGraph maps graph analytics using vertex programming to generalized SPMV on the GPU to deliver high performance. It targets graph operations which can be expressed as iterative convergent processes. By “iterative,” we refer to operations which may require running a series of steps repeatedly, and the results of one iteration are used as the starting point for the next iteration. By “convergent,” we mean that the correct answer can be obtained with sufficient accuracy by these iterations before terminating the process.

HPGraph uses a bulk-synchronous model (BSP) [23]. In BSP, parallel programs are executed in synchronous phases, known as supersteps. Such operations are sufficient for portability and efficiency on the GPUs. Each iteration is a superstep in HPGraph. Our abstraction differs from other frameworks, particularly other GPU-based frameworks. Rather than focusing on sequencing the steps of computation, we focus on mapping vertex programs to manipulate data structures. The graph primitives we describe in this paper mainly include three steps: *PREPROCESS*, *SPMV*, and *APPLY*.

*PREPROCESS*: in the *PREPROCESS* phase, the graph is converted to an adjacency matrix which is stored in the GPU memory. Furthermore, according to the specific requirements of graph algorithms, HPGraph generates different property data for vertices and configures the framework parameters. In terms of data structures, HPGraph represents all per-node data as structure-of-array (SOA) data structures, which allow for coalesced memory accesses with minimal memory divergence.

SPMV: similar to Giraph [7], HPGraph marks some vertices (or a single vertex) as having an “*active*” status. In each iteration, each vertex only visits and interacts with its “*active*” neighbors. Supposing that *G* is a *M*-by-*N* sparse matrix storing the graph, *x* is a vector storing user-defined node properties. In SPMV phase, graph traversal is completed by generalized SPMV: (or ). The vector *y* stores the promising new properties of each node, which will be used in APPLY phase. The corresponding operations on the sparse matrix are based on the idea that visiting the adjacent vertices can be performed through a dot product which is described as follows: If a vertex *r* visits one of its “*active*” neighbors, *l*, along out-edges (), a function named “*gather*” will be executed using , , and the properties of these two vertices. Conversely, visiting along in-edges requires us to perform a transposition firstly and obtain matrix . Therefore, and can be used directly. The “reduce” function will summarize a new property using the results from “gather” operations and store it in the resultant array *y*. The above process can be substituted by a dot product in generalized SPMV.

Figure 1 shows a simple example of calculating out-degrees. A native SPMV operation, which uses an adjacency matrix converted from the graph and a vector of all ones as an input, can produce the out-degrees of all vertices stored in a vector. Concretely, a vertex visits along the out-edges with multiplication (i.e., “gather” operation) and adds (i.e., “reduce” operation) together to obtain its out-degree.