Computational Intelligence and Neuroscience

Volume 2017 (2017), Article ID 8348671, 8 pages

https://doi.org/10.1155/2017/8348671

## High Performance Implementation of 3D Convolutional Neural Networks on a GPU

^{1}College of Computer, National University of Defense Technology, Changsha 410073, China^{2}National Key Laboratory of Parallel and Distributed Processing, Changsha 410073, China

Correspondence should be addressed to Qiang Lan; moc.361@tdun_gnaiqnal

Received 16 April 2017; Revised 19 July 2017; Accepted 6 August 2017; Published 8 November 2017

Academic Editor: Athanasios Voulodimos

Copyright © 2017 Qiang Lan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Convolutional neural networks have proven to be highly successful in applications such as image classification, object tracking, and many other tasks based on 2D inputs. Recently, researchers have started to apply convolutional neural networks to video classification, which constitutes a 3D input and requires far larger amounts of memory and much more computation. FFT based methods can reduce the amount of computation, but this generally comes at the cost of an increased memory requirement. On the other hand, the Winograd Minimal Filtering Algorithm (WMFA) can reduce the number of operations required and thus can speed up the computation, without increasing the required memory. This strategy was shown to be successful for 2D neural networks. We implement the algorithm for 3D convolutional neural networks and apply it to a popular 3D convolutional neural network which is used to classify videos and compare it to cuDNN. For our highly optimized implementation of the algorithm, we observe a twofold speedup for most of the 3D convolution layers of our test network compared to the cuDNN version.

#### 1. Introduction

Convolutional neural networks have proven advantages over traditional machine learning methods on applications such as image classification [1–4], tracking [5, 6], detection [7–11]. However, the primary downside of convolutional neural networks is the increased computational cost. This becomes especially challenging for 3D convolution where handling even the smallest instances requires substantial resources.

3D convolutional neural networks have recently come to the attention of the scientific community. In [12], a database for 3D object recognition named ObjectNet3D is presented. The database focuses on the problem of recognizing the 3D pose and the shape of objects from 2D images. Another repository of 3D CAD models of objects is ShapeNet [13]. In [14], the authors propose VoxNet, a 3D convolutional neural network, to solve the robust object recognition task with the help of 3D information, while the authors of [15] propose a 3D convolutional neural networks for human-action recognition.

In the light of these successful applications, it is worthwhile to explore new ways of speeding up the 3D convolution operation. In this paper we do so by deriving the 3D convolution forms of the minimal filtering algorithms invented by Toom and Cook [16] and generalized by Winograd [17]. Our experiments show this algorithm to be very efficient in accelerating 3D convolutional neural network in video classification applications.

#### 2. Related Work

Many approaches aim to directly reduce the computational cost within CNN. In [18], the authors analyse the algebraic properties of CNNs and propose an algorithmic improvement to reduce the computational workload. They achieve a 47% reduction in computation without affecting the accuracy. In [19], convolution operations are replaced with pointwise products in the Fourier domain, which can reduce the amount of computation significantly. Reference [20] evaluates two fast Fourier transform (FFT) convolution implementations, one based on Nvidia cuFFT [21] and the other based on Facebook’s FFT implementation. The FFT method can achieve an obvious speeding up of performance when the filter size is large, and the disadvantage of the FFT method is that it consumes much more memory than the standard method.

In [22], the authors use WMFA (Winograd Minimal Filter Algorithm) [17] to implement the convolution operation. In theory, fewer multiplications are needed in the WMFA, while not much extra memory is needed. WMFA is easy to parallelize; Lavin and Gray [22] implemented the algorithm on GPU, and they achieved better performance than the fastest cuDNN library. In [23], the authors show a novel architecture implemented in OpenCL on an FPGA platform; the algorithm they use to do the convolution is WMFA, which significantly boosts the performance of the FPGA. However, both works implemented 2D convolutional neural networks.

In this paper, we make four main contributions. Firstly, we derive the 3D forms of WMFA and design detailed algorithm to implement 3D convolution operation based on 3D WMFA. Secondly, we analyse the arithmetic complexity of 3D WMFA and prove 3D WMFA method can reduce computation in theory. Thirdly, we implement 3D WMFA for GPU platform and propose several optimization techniques to improve the performance of 3D WMFA. Finally, we evaluate the performance of 3D convolutional neural networks based on several implementations and prove the advantage of our proposed 3D WMFA method.

#### 3. Fast 3D Convolution Algorithm

##### 3.1. Preliminary: 3D Convolutional Neural Networks

For the 2D convolution, kernels have fixed width and height, and they are slid along the width and height of the input feature maps. For the 3D convolution, both feature maps and kernels have depth dimension, and the convolution also needs to slide along the depth direction. We can compute the output of a 3D convolutional layer using the following formula:where represents the result of a convolution operation at the th channel feature and is one of the input features, while is one of the filters. Equation (1) represents a direct convolution method, which requires intensive computability. The detailed arithmetic complexity of this method is shown in Section 3.3.

##### 3.2. 3D WMFA

We introduce a new, fast algorithm to compute a 3D convolutional layer. The algorithm is based on WMFA. In order to introduce the 3D WMFA, firstly, we will give a simple introduction to the 1D WMFA. WMFA computes output with a tile size of each time; we use to represent the output tile and is the filter size. According to the definition of convolution, multiplications are required to compute , but we can reduce the number of multiplications to do the convolution if we use the following WMFA:whereThe number of multiplications needed is ; however, four additions are needed to transform the input image, three additions to transform the filter, and four additions to transform the result of the dot product. We can use a matrix form to represent the computation:We call the , , and transform matrices, and the values of the transforming matrices areIn (4), and represent the input tile and filter tile, respectively. As described in [22], the format of the 2D WMFA is as follows:where is the filter with size and is the image with size . To compute , we need multiplications; however, multiplications are needed according to the convolution definition. Therefore, 2D WMFA can reduce the number of multiplications by a factor of at the cost of increasing 32 additions in the data transformation stage, 28 floating point instructions at the filter transformation stage, and 24 additions at the inverse transformation stage. For a convolutional layer, the number of input channels and number of output channels are large, which means the input channels need to convolve different filters, so the transformed input tile can be reused as many times as the number of output channels. Each filter needs to be slid in and direction of input channel during convolution, so each transformed filter is reused as many times as the number of subtiles of input channel. And since the output tile is reduced along the input channels, the inverse transformation is done after reduction; then the number of inverse transformation is determined by the number of output channels. Therefore, the cost of data transformation stage, filter transformation stage, and the inverse transformation stage keep low in real convolutional layer implementation.

We can also apply the 3D WMFA to 3D convolution. To compute , we apply the 3D Winograd transformation to the input tile and filter tile and apply 3D Winograd inverse transformation to the dot product of the transformed input image tile and the transformed filter tile. Algorithm 1 is a general form of the 3D Winograd transformation. In the algorithm, is the transformation matrix; the transformation matrix can be applied to transform the filter tile or applied to transform the input image tile. The dot product of the transformed input image tile and transformed filter tile will be accumulated along the channels, which can be converted to a matrix multiplication similar to the description in [22]Consider the sumThe previous equation can be divided into several submatrix multiplications; assume the output tile size is , using new coordinates to replace , yieldingThis equation represents the matrix multiplication, and it can be simplified as follows:Algorithm 2 gives the overview of the 3D WMFA. The algorithm mainly consists of four stages, which are Winograd transformation of the input feature tile; Winograd transformation of the filter tile; the matrix multiplication, which is converted from the dot product of the transformed input tile and the transformed filter tile; and the inverse Winograd transformation of the result of the matrix multiplication.