Scientific Programming

Volume 2019, Article ID 3185137, 11 pages

https://doi.org/10.1155/2019/3185137

## Low-Complexity Scalable Architectures for Parallel Computation of Similarity Measures

^{1}Department of Computer Engineering, Princess Sumaya University for Technology, Amman, Jordan^{2}Department of Electrical and Computer Engineering, University of Victoria, Victoria, BC, Canada^{3}Department of Computer Engineering, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia^{4}Department of Microelectronics, Electronics Research Institute, Cairo 11622, Egypt

Correspondence should be addressed to Awos Kanan; moc.oohay@nanak_swa

Received 1 March 2019; Accepted 11 April 2019; Published 26 May 2019

Guest Editor: Mohamed Zahran

Copyright © 2019 Awos Kanan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Processor array architectures have been employed, as an accelerator, to compute similarity distance found in a variety of data mining algorithms. However, most of the proposed architectures in the existing literature are designed in an ad hoc manner without taking into consideration the size and dimensionality of the datasets. Furthermore, data dependencies have not been analyzed, and often, only one design choice is considered for the scheduling and mapping of computational tasks. In this work, we present a systematic methodology to design scalable and area-efficient linear (1-D) processor arrays for the computation of similarity distance matrices. Six possible design options are obtained and analyzed in terms of area and time complexities. The obtained architectures provide us with the flexibility to choose the one that meets hardware constraints for a specific problem size. Comparisons with the previously reported architectures demonstrate that one of the proposed architectures achieves less area and area-delay product besides its scalability to high-dimensional data.

#### 1. Introduction

The computational complexity of machine learning and data mining algorithms, that are frequently used in today’s embedded applications such as handwritten analysis, fingerprint/iris/signature verification, and face recognition, makes the design of efficient hardware architectures for these algorithms a challenge. The computation of similarity distance matrices is one of the computational kernels that is required by several machine learning and data mining algorithms to measure the degree of similarity between data samples [1]. For several algorithms such as K-means [2], SVM [3], and KNN [4], distance calculation is a computationally intensive task that accounts for a significant portion of the overall processing time [5].

Given the complexity of today’s applications, machine learning and data mining algorithms are expected to handle big and high-dimensional data. In [6], an optimized FPGA implementation of the K-means clustering algorithm has been presented. The authors reported that the maximum number of features that could fit on Stratix V A7 FPGA is around 160. Even partitioning the computation and caching partial results in local memory to accommodate larger sizes was not efficient due to excessive global memory transactions. Most of the existing hardware architectures for similarity distance computation have not taken into consideration the size and/or dimensionality of the datasets. Theoretical time and area complexities for some architectures, including the ones presented in [7–9], have not been validated experimentally. Implementation results reported for other architectures, including [10, 11], are for low-dimensional datasets of dimensions 4 and 9, respectively. Despite the fact that these architectures have low theoretical time complexities, their poor scalability to high-dimensional data makes them not suitable for hardware implementation or implementable with poor performance, as discussed in [6].

In our recent work [12], we have systematically explored the design space of 2-D processor array architectures for similarity distance computation. Using the employed methodology, we were able to obtain the same architectures, as proposed in [7, 8] and also to identify an additional four architectures with improved area and time complexities. Furthermore, the obtained architectures have been classified into two groups based on the size and dimensionality of input datasets. 2-D processor arrays are generally faster than 1-D (linear) processor arrays as more processing elements (PEs) are used to perform the computation in parallel. On the contrary, linear arrays are more suitable for resource-constrained applications with limited area and I/O bandwidth, typically found in embedded applications. In this work, we present a systematic technique to explore the design space of linear processor arrays for the computation of similarity distance matrices in order to obtain additional design options for area and bandwidth efficiency optimization, which is desirable in the embedded system design.

In summary, the key contributions of this paper are as follows:(i)We present an algebraic technique to design scalable low-complexity linear processor arrays for the computation of similarity distance matrices based on an algebraic analysis of data dependencies. Compared to the classical approach of analyzing data dependencies that relies on studying how output variables depend on inputs, the employed technique relies on defining a computational domain using algorithm indices and studying how input and output variables depend on these indices.(ii)We propose six scheduling functions using computational geometry and matrix algebra. In addition to the minimum restrictions we used in [12] to obtain valid scheduling vectors, more time restrictions are introduced in this work to meet area and bandwidth constraints. Associated projection matrices for the obtained scheduling vectors are also introduced, to map points in the 3-D computation domain to PEs in the projected 1-D processor arrays.(iii)We perform full design space exploration using the proposed scheduling vectors and their associated projection matrices. Six design options are obtained, analyzed in terms of area, speed, and bandwidth efficiency, and compared analytically and experimentally with existing architectures in the literature.

The rest of this paper is organized as follows: related work is presented in the next section. The similarity distance computation problem is formulated in Section 3. In Section 4, the systematic technique used to parallelize distance computation is introduced. In Section 5, a systematic design space exploration is performed to obtain the proposed architectures. Design comparison and implementation results are presented in Section 6 and Section 7, respectively. Finally, Section 8 concludes the paper.

#### 2. Related Work

Several processor array architectures have been proposed for accelerating the computation of similarity distance. In [7], a distance calculation unit for a VLSI cluster analysis architecture has been proposed as a 2-D processor array to calculate similarity distances between *N* samples of an input dataset and *K* cluster centroids. For datasets with large number of samples *N*, the proposed architecture is not feasible for hardware implementation as it consists of a large number of processing elements (PEs) with numerous input features being fed simultaneously. The authors of [8] proposed a 2-D processor array for the calculation of similarity distances between samples of an *M*-dimensional dataset and *K* cluster centroids. For high-dimensional datasets with large number of features *M* per sample, the proposed architecture is not feasible for hardware implementation due to chip constraints in I/O bandwidth and number of pins.

Compared to 2-D processor arrays, linear arrays are generally more area-efficient with less bandwidth and energy demands. In [9], a linear processor array for the computation of similarity distance has been proposed. The proposed architecture is used to calculate similarity distances between data samples of an input dataset and clusters centroids in a VLSI clustering analyzer. Input data samples are fed in a feature-serial format. The proposed linear array has higher time complexity than 2-D processor arrays. However, both area complexity and number of I/O pins have been reduced. Another linear array for the computation of similarity measures has been proposed in [13]. The proposed architecture is used to calculate a special case of the similarity distance matrix that is required by some machine learning algorithms, in which pairwise distances among all samples of a dataset are calculated.

In [10], a distance calculation unit has been proposed to calculate similarity distances between data samples and cluster centroids in a hardware implementation of the K-means clustering algorithm. The proposed design calculates *K* distances between a data sample of *M* features and *K* cluster centroids concurrently using *K* adder trees of adders each. A similar architecture with pipelined adder trees has been presented in [11] to minimize the critical path delay and improve the throughput.

#### 3. Similarity Distance Computation

Given dataset of *N* samples and dataset of *K* samples with each sample in the two datasets having *M* features. A similarity measure such as Manhattan, Euclidean, or Cosine distance [1, 13] can be used to generate a distance matrix of elements. The distance between the sample of dataset and the sample of dataset is represented by the value of element of matrix . In this work, the calculation of the similarity distance matrix, using Manhattan distance between data samples of the two datasets and , is used to illustrate the introduced concepts and methodologies. Manhattan distance can be expressed as follows:where *N* and *K* are the number of samples of datasets and , respectively, and *M* is the dimensionality (number of features) of the two datasets. The emphasis of this paper is on the parallelization of similarity distance computation rather than the similarity measure used. Hence, the work presented in this paper can be generalized to other similarity measures.

Similarity distance computation in the K-means clustering algorithm [2], for instance, is performed in the same way as described in this section. Distances between *N* samples of dataset and the set of centroids of *K* clusters are calculated in order to identify the closest cluster for each data sample.

#### 4. Parallelizing the Computation of Similarity Distance

In our recent work [12], we have systematically explored the design space of 2-D processor array architectures for similarity distance computation using the methodology proposed by Gebali for designing digital filters systolic arrays [14]. In this work, we focus on extending the methodology to explore the design space of area-efficient linear processor arrays for the computation of similarity distance matrices. For more details on the employed methodology, refer [15].

##### 4.1. Computation Domain

As shown in Figure 1, the computation domain of Manhattan distance (1) is defined by the algorithm indices *k*, *m*, and *n*. Every point in the computation domain has three coordinates, represented as follows: