Abstract

Many highly parallel algorithms usually generate large volumes of data containing both valid and invalid elements, and high-performance solutions to the stream compaction problem reveal extremely important in such scenarios. Although parallel stream compaction has been extensively studied in GPU-based platforms, and more recently, in the Intel Xeon Phi platform, no study has considered yet its parallelization using a low-cost computing cluster, even when general-purpose single-board computing devices are gaining popularity among the scientific community due to their high performance per $ and watt. In this work, we consider the case of an extremely low-cost cluster composed by four Odroid C2 single-board computers (SDCs), showing that stream compaction can also benefit—important speedups can be obtained—from this kind of platforms. To do so, we derive two parallel implementations for the stream compaction problem using MPI. Then, we evaluate them considering varying number of processes and/or SDCs, as well as different input sizes. In general, we see that unless the number of elements in the stream is too small, the best results are obtained when eight MPI processes are distributed among the four SDCs that conform the cluster. To add value to the obtained results, we also consider the execution of the two parallel implementations for the stream compaction problem on a very high-performance but power-hungry 18-core Intel Xeon E5-2695 v4 multicore processor, obtaining that the Odroid C2 SDC cluster constitutes a much more efficient alternative when both resulting execution time and required energy are taken into account. Finally, we also implement and evaluate a parallel version of the stream split problem to store also the invalid elements after the valid ones. Our implementation shows good scalability on the Odroid C2 SDC cluster and more compensated computation/communication ratio when compared to the stream compaction problem.

1. Introduction

Continuous improvements in the technologies used to build computers have recently made possible the fabrication of extremely low-cost general-purpose single-board computing devices. Nowadays, one can buy one of these tiny computers for a few dollars and make it run Windows 10 or Ubuntu-Linux operating systems [1, 2]. Among the variety of vendors providing these single-board computers (SBCs), maybe the most renowned ones are Raspberry Pi and Odroid. Although the initial aim of these devices was to promote the teaching of basic computer science in schools [3, 4] and developing countries [57], recent appearance of single-board computers with multicore ARM CPU chips and several gigabytes of main memory also provides a desirable hardware platform for the project-based learning paradigm in computer science and engineering education [811] and have attracted interest of a multitude of projects trying to take advantage of their very low-cost performance ratio (i.e., for scientific computing [1214]) in contrast with other energy-efficient but which are alternatives of higher cost [15].

Whereas Raspberry Pi SBCs seem to have put the focus more on a “stand-alone” scenario, Odroid devices provide increased processor frequency, more main memory, and higher bandwidth Ethernet capabilities. Particularly, the Raspberry Pi 3 model B that was launched in February 2016 features a 1.2 GHz, 4-core ARM Cortex-A53 CPU chip, 1 GB main memory, and a 10/100 Ethernet port. Compared with its predecessor, the Raspberry Pi 2 model B released in February 2015 adds wireless connectivity (2.4 GHz Wi-Fi 802.11n and Bluetooth 4.1). On the contrary, the Odroid C2 sacrifices wireless connectivity in favor of higher clock frequencies (1.5 GHz, 4-core ARM Cortex-A53 CPU chip), larger main memory (2 GB), and Gigabit Ethernet connection. These characteristics make these particular devices more appropriate at building high-performance low-cost clusters able to meet the demands of some scientific applications.

On the other hand, a common characteristic found in many highly parallel algorithms is that they usually generate large volumes of data containing both valid and invalid elements. In these scenarios, high-performance solutions to the data reduction problem are extremely important. Stream compaction (also known as stream reduction) has been proposed to “compact” an input stream mixed with both valid and invalid elements to a subset with only the valid elements [16]. This way, stream compaction is found in many applications that go from data mining and machine learning (in order to prune invalid nodes after each parallel breadth-first tree traversal step [17]) to deferred shading (to obtain the subset of pixels whose rays intersect, which allows for better workload balancing among the participating threads [18, 19]) or more specifically to speedup dosimetric computations for radiotherapy, using Monte Carlo methods (they compacted computations on photons that worked longer than others [20]) and during voxelization of surfaces and solids [21].

Formally, given a list of elements belonging to the set I and a predicate function stream , stream compaction divides I into valid and invalid elements (ones that satisfy the predicate F and others that do not) and keeps the relative order for all the valid elements in the output (O) [18]. As shown in Algorithm 1, the serial stream compaction of I under the predicate function F is . Therefore, the output O simply contains all valid elements copied from the input I. An example of the execution of Algorithm 1 can be observed in Figure 1. The list of input elements is composed by numbers between 0 and 4. The serial stream compaction selects all elements that are not zero (assuming that zero represents the invalid value), based on the predicate function F, as shown in the low part of Figure 1. Although Algorithm 1 is simple, the parallelization is not trivial because the output position of each valid element cannot be obtained until all its preceding elements have been discovered [22].

Input: Vector I of length n
Input: Predicate function F
Output: Vector O of valid elements
Output: nvalid: the number of valid elements
(1)
(2)for to do
(3)if then
(4)  
(5)end if
(6)end for

Parallel stream compaction has been extensively studied in GPU-based platforms [16, 18, 2225], and more recently, parallel implementations for the Intel Xeon Phi processor have also been proposed [26]. In this work, we consider the case of an extremely low-cost cluster composed by four Odroid C2 single-board computers (SDCs), showing that stream compaction can also benefit—important speedups can be obtained—from this kind of platforms. To do so, we derive two parallel implementations for the stream compaction problem using MPI. Then, we evaluate them considering varying number of processes and/or SDCs, as well as different input sizes. In general, we see that unless the number of elements in the stream is too small, the best results are obtained when 8 MPI processes are distributed among the 4 SDCs that conform the cluster.

This manuscript extends a preliminary version of this work [27] by making the following two important additional contributions:(i)To highlight the importance of our study, we also consider the execution of the two parallel implementations for the stream compaction problem on a very high-performance but power-hungry 18-core Intel Xeon CPU E5-2695 v4. Overall, the obtained results show that the Odroid C2 SDC cluster constitutes an appealing alternative to a traditional high-end multicore processor in those contexts in which both low-cost and energy efficiency requirements are present.(ii)We derive a parallel version of the stream split problem to append the invalid elements to the output stream of the valid elements. We evaluate it on the Odroid C2 SDC cluster, observing good results in terms of scalability that lead to important speedups, and better balance between computation and communication requirements than in the stream compaction problem.

The rest of the paper is organized as follows. The parallelization strategies that we have implemented and evaluated for the stream compaction problem in this work are explained in Section 2. In Section 3, we give the details of the cluster of Odroid C2 SDCs used for the evaluation, and then, we present the experimental results. The parallelization of the stream split problem and results on the Odroid C2 SDCs are exposed in Section 4. Finally, Section 5 draws some important conclusions obtained from this work.

2. Parallel Stream Compaction on a Cluster of Odroid C2 SDCs

In this section, we present the two parallelization strategies that we have considered in this work. In both cases, we have implemented them using MPI [28].

2.1. Parallel Stream Compaction

We have based on the implementation proposed in the Thrust library [29] to develop the parallel stream compaction scheme shown in Algorithm 2. A vector of a particular length, the predicate function, the number of processes, and the pid of each process are the inputs. We have divided Algorithm 2 into four phases, namely: Validation phase (lines 4–8), Scan phase (lines 9–12), Communication phase (lines 13–21), and Scatter phase (lines 22–26). During the Validation phase, the input vector (I) is examined in parallel, and taking into consideration the predicate function, each process annotates the validity of each of its assigned elements in array (representing 1 a valid element and 0 an invalid one). The parallel Scan phase needs an additional array () to compute the so-called prefix-sum [3032], where each element is the addition of all its preceding elements excluding itself. So, each process obtains in parallel the number of valid elements () in its portion of the stream. Following this, in the Communication phase, each process, identified by a pid, sends the number of valid elements that it has found to all the processes with higher pids. All the processes, except the first one, receive the number of valid elements and compute the position () of the first of their valid elements. Finally, during the Scatter phase, based on the and arrays, all valid elements are transferred from the input array to the output one (I and O, resp.), preserving the order in which these elements appear in the input array.

Input: Vector I of length n
Input: Predicate function F
Input: Number of processes p
Input: pid of process
Output: Vector O of valid elements
Output: nvalid: the number of valid elements
Output: pos: position to write
(1)
(2)
(3)
(4)for to in parallel do
(5)if then
(6)  
(7)end if
(8)end for
(9)for to in parallel do
(10)
(11)end for
(12)
(13)for to in parallel do
(14) Send nvalid to process
(15)end for
(16)if then
(17)for to in parallel do
(18)  Receive
(19)  
(20)end for
(21)end if
(22)for to in parallel do
(23)if then
(24)  
(25)end if
(26)end for

Figure 2 shows an example of an execution with four MPI processes for a list of input elements composed by numbers ranging between 0 and 4. In this case, the predicate function F selects all elements that are not zero. Now, the input vector of length 16 positions is divided among the four MPI processes (P0, P1, P2, and P3). All the processes carry out the Validation and Scan phases in parallel. The position () computed by each process is shown below the vector . Finally, the output O is built taking into account the and vectors, as well as the , previously computed.

2.2. Parallel Work-Efficient Stream Compaction

In [26], it is presented a work-efficient stream compaction algorithm aimed at improving the computing complexity of the parallel stream compaction that was shown in Algorithm 2. Again, using MPI, we have developed the parallel version of this work-efficient stream compaction and we show it in Algorithm 3. Now, during the Validation phase (lines 5–10), each process saves the validity of each element on the array and stores the number of valid elements on the vector V. Therefore, the additional array of integers () needed in Algorithm 2 is no longer necessary. In the Communication phase (lines 11–26), all processes except the first one send the number of valid elements to the first process (that with pid 0), which executes the inclusive prefix-sum on vector V [31], where each element is the addition of all its preceding elements including itself. Then, each position of the array V is sent back to the corresponding process. Following this, each process executes the Scan phase (lines 27–30) on its own segment independently, based on the shifting value received previously. Finally, in the Scatter phase (lines 31–35), the validity of each element is rechecked by evaluating two consecutive positions of the array, obtaining the output array (O) with the valid elements from the input array (I).

Input: Vector I of length n
Input: Predicate function F
Input: Number of processes p
Input: pid of process
Output: Vector O of valid elements
Output: nvalid: the number of valid elements
(1)
(2)
(3)
(4)
(5)for to in parallel do
(6)if then
(7)  
(8)  
(9)end if
(10)end for
(11)if then
(12) Send to process pid 0
(13)end if
(14)if then
(15)for to do
(16)  
(17)  
(18)end for
(19)for to do
(20)  Send to process pid i
(21)end for
(22)
(23)end if
(24)if then
(25) Receive
(26)end if
(27)
(28)for to in parallel do
(29)
(30)end for
(31)for to in parallel do
(32)if then
(33)  
(34)end if
(35)end for

Figure 3 illustrates an example for a list of elements ranging between 0 and 4 and the predicate function F that selects all elements that are not zero for an execution of four MPI processes. As in the previous example, 4 input elements are assigned to each MPI process and the Validation phase is applied producing directly the validity of each element on vector together with the number of valid elements that each process finds out. The latter is stored on vector V. Then, the process executes the inclusive prefix-sum on vector V and sends back the output to the rest of the processes as is indicated by the arrows in Figure 3. Finally, each process enters the Scan and Scatter phases taking into account the corresponding shifting value calculated by .

3. Experiments

We have built a cluster which is composed by four Odroid C2 nodes. Each node contains a 1.5 GHz quad-core 64-bit ARM Cortex-A53 CPU and 2 GB of RAM memory. All the nodes are interconnected through a Gigabit Ethernet switch. The operating system installed on each node is Ubuntu 16.04.02 LTS. In this cluster, we have installed MPICH (v3.2) as the MPI library implementation.

We have executed and measured the two parallelization strategies for stream compaction presented in Section 2 on this cluster. The baseline for all the comparisons is the sequential version of Algorithm 2 without the Communication phase. Moreover, we have configured different parallel execution scenarios for the two parallel versions of the stream compaction problem explained before. We consider parallel executions with 2, 4, 8, and 16 MPI processes, running on the same Odroid C2 board or different boards (up to 4). We have chosen several input data sizes for our tests. In particular, we consider input arrays with 1M, 8M, 32M, and 64M integer elements ranging between 0 and 4. The predicate function in all cases determines as valid all numbers that are not zero. The 64M input set is the largest configuration that we could run taking into account the 2 GiB limit that the Odroid SDC imposes.

3.1. Execution Time Results

Figures 4(a), 4(b), 5(a), and 5(b) show the execution times (in milliseconds) that are observed for input data sizes of 1M, 8M, 32M, and 64M elements, respectively. For all these figures, from left to right, we first present the result obtained for the sequential version (Sequential), and then we show the results for the parallel stream compaction (Compaction) and parallel work-efficient stream compaction (Compaction-Shifted) parallelization strategies, respectively. For each one of them, we consider 2, 4, and 8 MPI processes running on one Odroid C2 board (2P-1C2, 4P-1C2, and 8P-1C2, resp.) and on 2 Odroid C2 boards, having 1, 2, and 4 MPI processes per board in each case (2P-2C2, 4P-2C2, and 8P-2C2, resp.), and finally 2, 4, 8, and 16 MPI processes running on 4 Odroid C2 boards, having 1, 2, or 4 processes per board as appropriate (2P-4C2, 4P-4C2, 8P-4C2, and 16P-4C2, resp.).

From Figure 4(a), we can see that the two proposed parallelization strategies for the stream compaction problem obtain noticeable speedups when they are executed on a single Odroid C2 board with 2 or 4 MPI processes with regard to the sequential version. However, the executions on different Odroid C2 boards show negative outcomes from the performance point of view when the size of the input array is excessively small (1M elements). What makes the differences is that in the first case, all communications take place on the same board and, therefore, can be performed with low latency. Contrarily, what happens when communications involve several Odroid C2 boards? In this case, the time required for communication does not compensate the small processing time that is needed to obtain the stream compaction for such a small number of elements (the communication time constitutes between 65% and 92% of the execution time). Moreover, the executions on a single Odroid C2 SDC with 8 MPI processes (2 MPI processes per core) also show negative speedups, revealing (as expected) that a configuration with more than one MPI process per core increments the communications, which represents up to 58% of the total time, and potentially slows computations.

Taking a closer look at the results for one Odroid C2 board and 1M input size, we see that the speedups of the parallel stream compaction strategy for 2 and 4 MPI processes are 1.68 and 1.64, respectively. Similarly, the work-efficient stream compaction parallelization strategy obtains speedups of 2.25 and 1.63 for 2 and 4 MPI processes, respectively. Therefore, in both cases, fewer MPI processes, and therefore, less amount of communications among several processes, bring the best results. These two parallel versions do not scale due to the small computation/communication ratio that they exhibit (approximately 4 and 2 for 2 and 4 processes for both proposals), which decreases as the number of processes grows.

In general, from Figures 4(b), 5(a), and 5(b), we can see that as the input data size increases, so it does the speedups obtained by the two parallelization strategies analyzed in this work when more cores are involved. The exception is the configuration with 16 processes running on 4 Odroid C2 boards (4 processes per board), which reaches lower speedups than that with 8 processes running on 4 Odroid C2 boards (2 processes per board).

More specifically, Figure 4(b) shows the results seen for 8M elements. In this case, the two proposed parallelization strategies obtain significant speedups when executed on a single Odroid C2 board with 2, 4, or 8 MPI processes with regard to the sequential version. Additionally, the scalability is good for 2 and 4 MPI processes obtaining 1.69 and 2.06 for the parallel stream compaction strategy and 2.14 and 2.74 for the parallel work-efficient stream compaction approach. Therefore, for medium input data sizes, the computation/communication ratio is appropriate (approximately 100 and 7 for 2 and 4 processes). Although the two parallelization strategies also achieve gains for the configuration (8P-1C2) with 2 processes per core on a single Odroid C2 (speedups of 1.44 and 1.31, resp.), these speedups are (as expected) lower than those of the (4P-1C2) case. It is clear that the fact that there are twice the number of MPI processes than the total number of cores available introduces extra scheduling overhead and causes worse use of cores’ resources (such as caches). On the other hand, the executions on different Odroid C2 SDCs (except for 8P-4C2 and 16P-4C2) present important speedups and good scalability for 2, 4, and 8 MPI processes for the two proposed parallelization strategies. Thus, the increment in the number of processes per Odroid C2 implies a suitable operation of the Odroid C2 cluster, where the communication latency among the different boards of the cluster does not ballast performance. In the 8P-4C2 case is where the performance differences between the two parallelization strategies start appearing. Whereas the most efficient strategy (namely, Compaction-Shifted) achieves the highest speedup for this configuration, the other one cannot improve over the results reached by 4P-4C2 demonstrating its more limited scalability for medium-sized workloads. Finally, the large number of processes involved in 16P-4C2 results in excessively small computation/communication ratios, which is the reason for the negative outcomes observed in both cases (the fraction of time due to communications reaches 87%).

As we can observe in Figures 5(a) and 5(b), having higher input data sizes for the two parallel stream compaction strategies results in significant gains in all the configurations. For both input data sizes, both Compaction and Compaction-Shifted obtain speedups that are close to that observed for the 8M element case when executed on a single Odroid C2 SDC with 2, 4, or 8 MPI processes. However, the resulting speedups become even more important as the number of involved cores grows. Moreover, they scale nicely for 2, 4, and 8 MPI processes, achieving their highest values for 8 MPI processes running on 4 Odroid C2 SDCs (5.10 and 5.06 for the parallel stream compaction and input data sizes of 32M and 64M, resp., and 6.96 and 7.04 for the parallel work-efficient stream compaction and input data sizes of 32M and 64M, resp.). It is also worth noting that even for these large input sizes, the results reached for the 16P-4C2 configuration are worse than those of the 8P-4C2 in both cases. Now the differences between them become narrower as input data sizes increase.

3.2. Energy Efficiency Results

To give readers a more complete view that can help put our results in context, we also consider the case of executing the parallel stream compaction and parallel work-efficient stream compaction approaches described in Sections 2.1 and 2.2, respectively, on a conventional, high-performance multicore processor. In particular, we have considered the case of a state-of-the-art Intel® Xeon® E5-2695 v4 multicore processor running at 2.10 GHz. Particularly, the Intel Xeon multicore processor has 18 cores, and its price is approximately 8× that of the complete cluster. We have a dual-socket configuration.

The comparison between the 4 Odroid C2 cluster and the Intel Xeon is done by taking into consideration the execution times of each version on every platform and the reported thermal design power (TDP) measures in each case (16 W for the Odroid C2 and 120 W for Intel Xeon processor). We have measured the energy consumption using RAPL [33] in the Intel Xeon processor. Although for the 1M input data size resulting watts are lesser than 120, this TDP is overcome in the rest of input data sizes. Therefore, we have used the TDP as an average measure of the energy consumption.

Figures 6 and 7 show total energy consumption (in joules) for parallel stream compaction and parallel work-efficient stream compaction, respectively. Again, results for input data sizes of 1M, 8M, 32M, and 64M elements are reported. In both figures, we show the results for 2, 4, 8, and 16 MPI processes running on 4 Odroid C2 boards (having 1, 2, or 4 processes per board as appropriate (2P-OC2, 4P-OC2, 8P-OC2, and 16P-OC2, resp.)) and running on the Intel Xeon using 2, 4, 8, and 16 cores (2P-Xeon, 4P-Xeon, 8P-Xeon, and 16P-Xeon, resp.).

In the figures, we can see that the trend observed for the low-cost SDC cluster does not keep in the case of the Intel Xeon, and the best results in this case are obtained for 16 cores. The fact that, in this case, all communications occur on the same chip significantly reduces the overhead of involving a larger number of cores in the computation. This is also evidenced by the fact that speedups are obtained even for the small problem sizes. However, even when the computational power of the Intel Xeon is much greater than the one of the Odroid C2 clusters, the very large TDP of the Intel Xeon ballasts its results when energy efficiency is also considered. Particularly, the best results for the Odroid C2 cluster (obtained when 2 processes run on 4 boards) clearly outperform those achieved when 16 processes are executed using 2 Intel Xeon chips, demonstrating that the Odroid C2 SDC cluster constitutes an appealing alternative to a traditional high-end multicore processor in those contexts in which both low-cost and energy efficiency requirements are found.

4. Extension to Stream Split

There are some applications, for example, a radix sort [34] or random forest-based data classifiers [35], in which it is needed to append the invalid elements to the end of the output stream with the valid elements. This is the so-called stream split problem. In this work, we have also developed a parallel solution to the stream split problem and we present it in Algorithm 4. The stream split algorithm is very much like the parallel work-efficient stream compaction version presented in Algorithm 3. The main differences are that now the first process must send the number of valid elements to all the processes with higher pids (line 23), the different processes (except the first one) receive the number of valid elements (line 27), and if an element is invalid, it would be stored after the valid elements plus an offset given by , as we can observe in lines 36–37.

Input: Vector I of length n
Input: Predicate function F
Input: Number of processes p
Input: pid of process
Output: Vector O of valid elements
Output: nvalid: the number of valid elements
(1)
(2)
(3)
(4)
(5)for to in parallel do
(6)if then
(7)  
(8)  
(9)end if
(10)end for
(11)if then
(12) Send to process pid 0
(13)end if
(14)if then
(15)for to do
(16)  
(17)  
(18)end for
(19)for to do
(20)  Send to process pid i
(21)end for
(22)
(23) Send to all processes
(24)end if
(25)if then
(26) Receive
(27) Receive
(28)end if
(29)
(30)for to in parallel do
(31)
(32)end for
(33)for to in parallel do
(34)if then
(35)  
(36)else
(37)  
(38)end if
(39)end for

Figures 8(a), 8(b), 9(a), and 9(b) show the execution times (in milliseconds) that are observed for input data sizes of 1M, 8M, 32M, and 64M elements, respectively. For all these figures, from left to right, we first present the results obtained for the sequential version (Sequential), and then we show the results for the parallel stream split (Split). For each one of them, we consider 2, 4, and 8 MPI processes running on one Odroid C2 board (2P-1C2, 4P-1C2, and 8P-1C2, resp.) and on 2 Odroid C2 boards, having 1, 2, and 4 MPI processes per board in each case (2P-2C2, 4P-2C2, and 8P-2C2, resp.), and finally 2, 4, 8, and 16 MPI processes running on 4 Odroid C2 boards, having 1, 2, or 4 processes per board as appropriate (2P-4C2, 4P-4C2, 8P-4C2, and 16P-4C2, resp.).

In general, from Figures 8(a), 8(b), 9(a), and 9(b), we can see that the trend observed for the different input sizes is very similar to that already explained for the Compaction and Compaction-Shifted proposals except for the configuration with 16 processes running on 4 Odroid C2 boards (4 processes per board) and input sizes from 8M to 64M elements, which reaches higher speedups than the rest of configurations and that those observed in the two previous approaches. Speedups of 4.74, 6.66, and 8.36 with regard to the sequential version are achieved for 8M elements, 32M elements, and 64M elements, respectively. Now, the increase in computation due to the storage of the invalid elements compensates the communication requirements and significant speedups are obtained.

More specifically, the scalability is good for 2, 4, and 8 MPI processes running on 2 Odroid C2 boards, obtaining 1.52, 1.75, and 2.08 for the parallel stream split approach for 8M elements. Therefore, for medium input data sizes, the computation/communication ratio is appropriate. On the same way, the scalability is suitable for 2, 4, and 8 MPI processes for higher input data sizes. For example, for 64M elements, the speedups achieved are 2.25, 3.63, and 4.52 for 2, 4, and 8 MPI processes executing on 2 Odroid C2 boards. Moreover, the scalability is good for 2, 4, 8, and 16 MPI processes running on 4 Odroid C2 boards, obtaining 1.47, 2.76, 3.47, and 4.74 for medium input sizes, whereas gains for big input sizes are very similar, and achieving, for instance, 2.25, 4.39, 7.02, and 8.36 for 64M elements.

5. Conclusions

In this work, we have studied the parallelization of the stream compaction problem on a low-cost cluster of single-board computers. Particularly, we have configured the low-cost cluster from 4 Odroid C2 SDCs which are interconnected using a typical Gigabit Ethernet switch. We have implemented two parallel versions for the stream compaction problem using MPI. Then, we evaluate them considering varying number of processes and/or SDCs, as well as different input sizes. In general, we see that when the number of elements in the stream is too small, the most important benefits are observed when all participating processes are in the same Odroid board. In this case, the low computation/communication ratio for small number of input elements cannot make up for the overhead entailed by the inter-SDC communications. As the number of elements in the input stream increases, so it does the number of processes that can participate in parallel executions, and important speedups are reached. Overall, the best results are reached when eight MPI processes are distributed among the four SDCs that conform the cluster. In this case, speedups of 5.10 and 7.04 are obtained for the Compaction and Compaction-Shifted strategies, respectively, for the larger problem size considered in this work (input data size of 64M). Moreover, to add value to the obtained results, we also consider the execution of the two parallel implementations for the stream compaction problem on a very high-performance but power-hungry 18-core Intel Xeon E5-2695 v4 multicore processor, obtaining that the Odroid C2 SDC cluster constitutes a much more efficient alternative when both resulting execution time and required energy are taken into account. Finally, the parallelization of the stream split problem is implemented and evaluated on the Odroid C2 SDC cluster. In this case, for input data sizes starting from 8M elements, important speedups are achieved and the computation/communication is more equilibrated due to the storage of the invalid elements. In summary, the best results are obtained for the configuration of 16 MPI processes running on 4 Odroid C2 boards. In this case, speedups of 6.66 and 8.36 are reached for input data sizes of 32 and 64M elements, respectively.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Spanish MINECO and by European Commission FEDER funds, under Grant TIN2015-66972-C5-3-R.