Abstract

On-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to as On-chip Large-scale Parallel Computing Architectures (OLPCs) in the paper. Homogenous OLPCs feature strong regularity and scalability due to its identical cores and routers. Data-parallel applications have their parallel data subsets that are handled individually by the same program running in different cores. Therefore, data-parallel applications are able to obtain good speedup in homogenous OLPCs. The paper addresses modeling the speedup performance of homogeneous OLPCs for data-parallel applications. When establishing the speedup performance model, the network communication latency and the ways of storing data of data-parallel applications are modeled and analyzed in detail. Two abstract concepts (equivalent serial packet and equivalent serial communication) are proposed to construct the network communication latency model. The uniform and hotspot traffic models are adopted to reflect the ways of storing data. Some useful suggestions are presented during the performance model’s analysis. Finally, three data-parallel applications are performed on our cycle-accurate homogenous OLPC experimental platform to validate the analytic results and demonstrate that our study provides a feasible way to estimate and evaluate the performance of data-parallel applications onto homogenous OLPCs.

1. Introduction and Motivation

As technology advances, on-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which feature integrating a number of computing cores that run in parallel and adopting an on-chip network that provides concurrent pipelined communication. The many-core network-based systems are referred to as On-chip Large-scale Parallel Computing Architectures (OLPCs) in the paper. OLPCs can be highly homogeneous or irregular and heterogeneous. Homogenous OLPC owns its characteristics of strong regularity and scalability, since processor cores and routers in it are the same. Each processor core has the same computation capability. As one way of parallel processing, data parallelism partitions data into several blocks that are mapped to different processors and processors work in SPMD (Single Program Multiple Data) mode, that is, they handle their own data blocks by running the same program. Data-parallel applications have the parallel data set that can be partitioned in parallel into data subsets and each data subset can be handled individually by the same program and has marginal synchronization overhead, so they are well scalable and can be used to exploit the potential of multiple computing cores. Therefore, homogenous OLPCs and data-parallel applications match each other well. Data-parallel applications are able to obtain good speedup on homogenous OLPCs. Therefore, the focus of the paper is to provide a workable way to estimate and evaluate the performance of homogenous OLPCs with data-parallel applications.

Scalability is one of the important features of homogenous OLPCs. In homogenous OLPCs, as the network size is scaled up, the network communication latency is increasing and becoming one of the most significant factors affecting the system performance. Therefore, we firstly propose two abstract concepts: equivalent serial packet and equivalent serial communication, and then we construct a detailed network communication latency model. Then, based on Amdahl’s Law, we propose a performance model including the detailed network communication latency. Two traffic models (uniform and hotspot) are used to reflect the two ways of storing data of data-parallel applications. The uniform traffic model matches the distributed way that data are equally distributed into all nodes, while the hotspot traffic model matches the centralized way that data are only maintained in the central node. Our models also analyze the performance impact of the noncommunication/communication ratio. Some useful suggestions are presented during the performance model’s analysis. Finally, we map three data-parallel applications (Wavefront Computation, Vector Norm, and Block Matching Algorithm in Motion Estimation) on our cycle-accurate homogenous OLPC experimental platform to validate and demonstrate our performance analysis.

The contributions of the paper are summarized as follows.(1)Since homogenous OLPCs match data-parallel applications well and vice versa, our study exhibits a workable way to formulate and evaluate the speedup performance of data-parallel applications onto homogenous OLPCs before application programming and hardware design.(2)Two abstract concepts, equivalent serial packet and equivalent serial communication, are proposed and then used to construct the detailed network communication latency model (see Section 4.3).(3)Based on Amdahl’s Law, we propose a performance model of homogeneous OLPCs for data-parallel applications (see Section 4.4). The proposed performance includes the proposed network communication latency model and adopts two traffic models (uniform and hotspot) so as to have two forms (see Sections 4.4.1 and 4.4.2). They, respectively, reflect the distributed way and the centralized way of storing data of data-parallel applications.(4)A cycle-accurate homogenous OLPC experimental platform is built up and three real data-parallel applications are mapped to validate the effectiveness of the proposed performance model.

The rest of the paper is organized as follows. Section 2 presents the background and related work. Section 3 discusses the characteristics of homogenous OLPCs and data-parallel applications and their relationship. Section 4 proposes the communication latency model and the performance model of homogenous OLPCs and details the analysis. Section 5 maps three data-parallel applications on our homogenous OLPC platform to validate the effectiveness of the performance model. Section 6 discusses the applicabilities and the limitations of our performance model. Finally, we conclude in Section 7.

The development of on-chip computation presents two trends. One is towards a growing number of processors integrated on a chip [1, 2]. it is moving away from a sequential to a parallel paradigm leading to tens, dozens, hundreds, and soon even thousands of computing cores on a single chip. A number of computing cores are potential to cooperate in parallel to obtain higher performance of parallel applications. The other trend is about the interconnection of on-chip resources. The communication infrastructure is developing into a similarly parallel structure, which is often called a Network-on-Chip (NoC) [35]. Shared, serial buses are replaced by pipelined communication networks that allow hundreds or thousands of communications going on concurrently at any time. Combining the two trends, on-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to as On-chip Large-scale Parallel Computing Architectures (OLPCs) in the paper. Understanding the speedup potential that OLPC computing platforms can offer is a fundamental question to continually pursuing higher performance.

With respect to performance analysis, Amdahl’s Law [6] provides a simple, yet very useful method to evaluate the performance of a parallel system. Its fundamental hypothesis is that the computation problem size does not change when running on enhanced parallel systems. Its main result shows that the percentage of the serial portion dominates the speedup limit. Amdahl’s Law is a pessimistic view that the speedup does not increase infinitely along with the increase of the number of paralleled processor cores. Based on Amdahl’s Law, many researchers discussed their variants for different purposes. In [7], Li and Malek discussed the effect of non-communication/communication ratio on the speedup based on Amdahl’s Law, but his communication delay model is simple without considering the detail of interconnects. In [8], Paul revisited Amdahl’s Law on the single chip heterogeneous multiprocessor. His focus was on the performance impact induced by different types of processor cores with different processing capability. In [9], Cho and Melhem presented the corollaries to Amdahl’s Law in order to study the interaction between parallelization and energy consumption. In [10], Hill and Marty offered a corollary of a simple model of multicore hardware resources based on Amdahl’s Law. He proved that an enhanced core is necessary for the high system performance but the parallelism supported by systems with such cores suffers. In [11], Loh extended Hill’s work to study the performance impact of uncore function units on the multicore system’s throughput. In both Hill’s and Loh’s discussions, the effect of network communication latency is omitted. In OLPCs, the enhancement of application performance may be restricted by the increasing network communication latency, even though the number of cores increases. We note that less work aforementioned discusses the effect of network communication latency on the performance of OLPCs. In the paper, we detail the network communication latency by proposing two abstract concepts, equivalent serial packet and equivalent serial communication, and establish the performance model of homogenous OLPCs. Our model, verified by real data-parallel applications, exhibits a workable way to estimate and evaluate the performance of homogenous OLPCs.

3. Homogenous OLPCs and Data-Parallel Applications

Homogenous OLPCs are a suitable architecture for data-parallel applications and vice versa. Regularity and scalability are the key features of homogenous OLPCs. Figure 1(a) shows an example of homogenous OLPCs. The communication infrastructure is a regular 2D-mesh NoC, which is the most popular NoC topology proposed today [12]. As we can see, the processor type and the local memory volume in each Processor-Memory (PM) node is the same so that each PM node has the same computation capability. All PM nodes are networked by routers. The network size is scalable. As one way of parallel processing, data parallelism partitions data into several blocks that are mapped to different processors. Processors handles their own data blocks by running the same program. Data parallelism is efficient for applications with high computation complexity (e.g., image processing, hydrodynamics computing). These data-parallel applications are well scalable and their data are regular. They are easily parallelized by partitioning their data. Figure 1(b) illustrates a data partitioning way of data-parallel applications. Assuming that there are 144 (12 × 12) data to be processed by a data-parallel application and the homogenous OLPC is with the network size of 36 (6 × 6). Since the computation ability of each PM node is the same, it is obvious to partition the 144 data into 36 equivalent parts. Each equivalent part contains 4 sets of data and is handled by a PM node. As the network size is scaled up and hence more PM nodes are included, we can repartition the data to suit the number of PM nodes in order to gain higher performance. However, the network communication limits the performance. We consider two traffic models which reflect two ways of storing data of data-parallel applications. The uniform traffic model matches the distributed way that data are distributed equally into all local memories of all nodes. The hotspot traffic model matches the centralized way that data are only maintained in the central node.

4. Models and Analysis

4.1. Problem Definition

The problem we consider is the performance in the context of homogeneous OLPCs for data parallel applications. We give detailed analysis on communication latency. The program running on OLPCs are divided into several subprograms running on different processor nodes. The subprogram can be abstracted as a set of subtasks and communications (see Figure 2(a)). The communication denotes the interaction between two communicating processor nodes. A communication contains one or more packets transmitted in the network. The subtask denotes the noncommunication processing (e.g., computation, memory access, etc.) between two successive communications. To facilitate constructing the models of communication latency and the performance, we make the following three assumptions.(1)The noncommunication time and communication time of the subprogram assigned to each node is equal to each other. That is, the subprogram in each node contains the same number of subtasks and communications.(2)The execution time of each subtask is also equal to each other.(3)The time of each communication is also equal to that of others.

Figure 2(b) is the reabstracted subprogram based on assumption (2) and (3). The sum of the subtasks and communications of Figure 2(b) is equal to that of Figure 2(a).

4.2. Notations

To facilitate the analysis, we first define a set of symbols in Notaions section.

4.3. Communication Latency Model

Communication latency contains two parts: minimal (noncontention) latency and contention latency.

The minimal latency is determined by the distance of the two communicating nodes. We use hop count to calculate the latency. Table 1 lists our calculated hop count following [13]. We consider two representative traffic models (Uniform and Hotspot) in 2D-mesh networks. For hotspot traffic, the central node is chosen as the hotspot node.

The contention latency mainly depends on the behavior of parallel applications running on OLPCs. In general, it is difficult to quantify the contention latency exactly. “When to communicate,” “which processor core starts a message passing” and “where the destination is” lead to different contention latency. If no contention occurs, transmitting a packet in one hop takes 1 cycle ( = 1) in our experimental platform shown in Figure 8. However, network contention makes uncertain. Hence, in order to facilitate constructing the performance models, we consider the contention latency from another angle. Since network contention occurs only when multiple communications issued by different processor nodes appear simultaneously in the on-chip network, we introduce an abstract concept: equivalent serial communication. The equivalent serial communications are sequential when the program is running so that network contention does not exist at all. To a certain extent, the number of equivalent serial communication () reflects the network contention. Equivalent serial communication is discussed in detail in Step 3 below.

In the next, we use three steps to establish the communication latency model.

Step 1 (calculating the time of transmitting a packet). With packet switching, the average time of transmitting a packet in the network iswhere reflects the distance and reflects the architectural latency without contention.

Step 2 (calculating the time of a communication). In general, a communication issued by a processor node contains one or more packets. These packets are launched by the same processor node. Packet transmissions may overlap. In the best case, a packet in a communication is launched one cycle after the preceding packet. A packet transmits in the on-chip network without need of waiting for the completion of its preceding packet transmission. The packet transmissions are overlapping. For the worst case, all packets are transmitted serially; that is, a packet will not be transmitted until the previous one is finished. The overlap among packet transmissions improves the performance by shortening the network communication latency.

To measure the time of a communication, we define an abstract concept: equivalent serial packet. The equivalent serial packet is considered to be transmitted sequentially. A communication is abstracted to consist of several equivalent serial packets. As shown in Figure 3(a), assuming that the communication contains four packets, the program behavior determines the concurrent degree of packets’ transmission. For example, Packet 1 and Packet 2 are almost fully overlapped, while small portion of Packet 3 and 4 are overlapped. For ease of measuring the communication time, the communication is abstracted to be composed of several equivalent serial packets. In Figure 3(b), the number of equivalent serial packets () is about 2.67, which is less than the packet number: 4. meets the inequation below:

describes the concurrent degree of packet transmission in a communication. The ideal best case is that all packets is transmitted concurrently. However, it cannot be reached, because there is only one physical channel from the node to the router. The best case is that packets in a communication are launched one cycle by one cycle, so is close to, but not equal to, 1. For the worst case that all packets transmit sequentially, . That means the number of equivalent serial packets is equal to the number of real packets () in a communication.

From (1) and (2), we can obtain the time of a communication:

Step 3 (calculating the communication overhead of a program). The program is parallelized on nodes, so the subprogram in each node contains communications. Communications issued by the same node are sequential, because the subprogram is sequentially executed in the processor node. Communications issued by different nodes may exist in the network at the same time. For the best case, the program is fully parallelized. The communication overhead of the entire program is equal to communication latency of the subprogram distributed in each node. For the worst case, communications from different nodes do not overlap one another. The communication overhead of the entire program is equal to the sum of communication latency of each node. In this case, there is no network contention. However, in general, communications are partially overlapped and network contention always exists. Moreover, the existence of multiple communications in the network leads to the occurrence of network contention. The behavior of parallel programs (e.g., “when a communication is generated” and “which node sends or receives packets in the communication”) determines the concurrent degree of communications and the network contention latency.
Therefore, in order to quantify the network contention and measure the communication overhead of the entire program, we define an abstract concept: equivalent serial communication. The equivalent serial communication is considered to be sequential so that there is no network contention. A program is abstracted to contain several equivalent serial communications. As shown in Figure 4(a), assuming that the program is mapped on two nodes: Node 1 and Node 2. There are four communications. Communication 1_1 and Communication 1_2 are generated by Node 1, while Communication 2_1 and Communication 2_2 are generated by Node 2. Communications generated by different nodes may be overlapped due to the program behavior. For example, Communication 1_2 is overlapped with Communication 2_2. There are network contention between Communication 1_2 and Communication 2_2. The number of equivalent serial communications () is about 3.33, which is less than the communication number: 4. meets the inequation below: describes the concurrent degree of communications as well as the network contention. The equivalent serial communications are sequential when the program is running so that no network contention occurs. Therefore, the contention latency is removed and fused into the when calculating the network communication latency. The network contention and the concurrent degree of communications together determine the value of .(i)If communications are concurrent and they all exist in the same local area resulting in a hotspot, the network contention is heavy. In this case, the total communication time of the program is longer and hence is larger, close to . For instance, as illustrated in Figure 5(a), Node (1,1), Node (2,1), Node (1,2), Node (2,2), and Node (3,3) communicate with Node (3,1) concurrently. A hotspot is formed near Node (3,1) and network contention is heavy there. Although the five communications are issued concurrently, the network contention serializes them.(ii)If communications are concurrent and they are uniformly distributed in the entire on-chip network, the network contention becomes light. In this case, the total communication time of the program is shorter and hence is smaller, close to . For instance, as shown in Figure 5(b), there are also five communications occurring concurrently in the network. However, they belong to different source nodes and destination nodes and their routing tracks do not overlap, so there is no network contention. Therefore, its is smaller than that in Figure 5(a).(iii)If communications are sequential, although the network contention is not heavy, the total communication time of the program is always long and hence is large and close to . For instance, as shown in Figure 5(c), Node (1,1) communicates with Node (3,1), Node (2,2) communicates with Node (1,2), and Node (3,3) communicates with Node (3,2). After that, Node (2,1) communicates with Node (2,2) and Node (1,2) communicates with Node (1,3) (see Figure 5(d)). Although there is no network contention, the five communications are not issued concurrently. Therefore, its is bigger than that in Figure 5(b).(iv)For the best case that all nodes are fully concurrent and there is no network contention, the number of equivalent serial communication is equal to the number of real communication in each node (). For the worst case that communications from all nodes occur sequentially, the number of equivalent serial communication is equal to the sum of the number of real communication in each node ().

From (1), (2), and (4), we can calculate the communication overhead of a program running on homogenous OLPCs:

From (5), we can observe that (i) when and , is the maximal communication overhead of the program for the worst case that all packets are transmitted in the network in a sequential way and (ii) when and , is the minimal communication overhead of the program for the ideal best case that all packets in a communication are transmitted concurrently, all communications from different nodes are concurrent and no network contention occurs and (iii) when the network size is scaled up, and increases due to the longer communication distance.

Network contention is hard to quantify exactly. The concrete behavior of parallel applications leads to different traffic patterns, packet generation rate, and other factors. These factors influence the network contention. In this section, by introducing two abstract concepts, equivalent serial packet and equivalent serial communication, we could quantify the network contention and formulate the network communication latency. The equivalent serial packets and equivalent serial communications are sequential so that the network contention does not exist. To a certain extent, the effect of network contention is fused into the number of equivalent serial packet () and the number of equivalent serial communication (). With the two extremes of traffic patterns (Uniform and Hotspot traffic models), we obtain the upper and lower bounds of and (see Formulas (2) and (4)). The bounds are determined by the number of packets in a communication (), parallel part of the program () and the total processor number (). reflects the packet generation rate. Our model offers a feasible way to evaluate the network communication latency of homogenous OLPCs, but here comes a question: how do we determine or estimate , , , , and ? The network size of OLPCs decides . Different applications have their own . Data-parallel applications are scalable and their data are regular. Their programs generally consist of a set of identical subtasks. Analyzing computation and communication behavior of the subtask, we could determine and estimate and . Section 5.3 exemplifies the way of estimating and . Based on the analysis in this subsection, we could have a piece of implication.

Implication 1. Network communication latency has significant influence on the system’s performance. The basic three threads to reduce the latency are (1) decreasing the number of communications in the program and the number of packets in a communication, (2) improving the concurrency of communications and packets, and (3) avoiding hotspot traffic. Architects or programmers can try their best to achieve these three threads by optimizing hardware design and application mapping, for instance by offering support for outstanding transactions or caching remote data in the local memory.

4.4. Performance Model

In this subsection, inspired by Amdahl’s Law, we establish the performance model for homogenous OLPCs, incorporating the network communication latency. We elaborate the performance model under both uniform and hotspot traffic patterns. Under the two traffic models, we discuss and analyze the performance’s trend, limitation, minimum, and maximum. The impact of network size (), the ratio of the serial part and the parallel part in a program (), the number of equivalent serial packets in a communication (), and the execution time of a subtask () on the performance are also discussed in detail. reflects the influence of network contention and congestion, while reflects the influence of noncommunication/communication ratio.

The same as with Amdahl’s Law, we assume that the total problem size is fixed as the number of computing nodes increases. The parallel part in the program is speeded up. The parallel part assigned to each processor node decreases with the increase of the system size. So we can get the performance model as the formula below shows:

By including (5), we can get

The last product item in the denominator describes the communication overhead. If this item is ignored, (6) can be simplified towhich is Amdahl’s Law [6].

The behavior of parallel programs determines the communication patterns, affecting the value of and . Uniform traffic model is a well-distributed traffic model, while hotspot traffic model is a centralized traffic model. They are two extremes, representing the upper bound and the lower bound of the communication patterns, respectively. Hence, we consider both uniform and hotspot traffic models below to analyze the speedup in detail. Although hotspot traffic has smaller average hop count and hence less minimal latency, hotspot traffic causes much heavier network contention than uniform traffic. For uniform traffic, it has lower network contention and is closer to . For hotspot traffic, it has higher network contention and is closer to , because of the serialization effect in the destination node. Therefore, to facilitate the formula transformation and analysis, we consider for uniform traffic, while for hotspot traffic. This assumption is thought to be reasonable without the loss of analyzing the performance trend.

4.4.1. Uniform Traffic Model

Assuming , we can refine (7) as

Since reflects the architectural latency without contention, it is a constant for a given homogenous OLPC architecture. Therefore, The speedup () is a quaternion function: . Its value is determined by , , , and . To obtain the variation trend of , we conduct two steps below.

Step 1 (calculating the speedup’s limitation). We have the limitation of as below:

Step 2 (calculating the value of related to the extreme minimal value of ). Let ; then, we can getLet ; Formula (11) is refined as

From formula (12), we can get

The extreme minimal value of exists; its related is defined as . Because is a positive integer, we have

The OLPC hosts at least one processor core, so . Combining the two steps, we can obtain that(1)when ,(i) monotonically increases with the increase of ; parallelization enables the performance improvement; however, is bounded by for ; the ratio of the serial part in a program limits the performance improvement;(2)when or ,(i)when , decreases with the increase of ; parallelization degrades the performance rather than improves it, because the negative effect of network communication latency on the performance surpasses the positive effect of cooperation of multiple processor cores on the performance;(ii)when , reaches its minimum ();(iii)when , increases when is increasing; the positive effect of parallelization surpasses the negative effect of network communication latency, thus improving the performance;

(3)the ratio between the serial part and the parallel part in a program determines the upper limit of . The limit is inversely proportional to . It indicates that reducing the serial part or enlarging the parallel part in a program is good for improving the performance limit.

As we can see, reaches its minimum when is very small. The OLPC hosts a number of processor cores. Therefore, for a lager range of , keeps going up when increases. To further discuss the effect of , , , and on , Figure 6 shows performance trends of under uniform traffic model. Without loss of trend analysis, we consider(a); the network size is scaled up from 1 to 256; the increase of the network size makes more processor cores involved;(b); with the increase of , the serial part takes more proportion in a program; the performance limit () becomes less;(c); the number of equivalent serial packets in a communication increases from 1, 16 to 256; more packets lead to larger network communication latency, causing negative effect on the performance;(d); the execution time of a subtask increases from 10, 100 to 1000; increasing noncommunication time can bring positive effect on the performance.

From aforementioned formula transformation and Figure 6, we can have results regarding the performance under uniform traffic model.(I)The increase of the network size () makes more processor cores, exploiting larger parallelism. As shown in Figure 6, as increases, the speedup () firstly decreases and soon reaches its minimum when (this situation is shown in Figure 6(g). It also exists in other subfigures, but it is not obvious, since is much larger); then, increases. However, increases more and more slowly; it is limited by finally.(II)Both the incremental ratio and the limit of are deeply influenced by . As shown in all subfigures, as increases, the incremental ratio of becomes very low and the limit of is very small. Even if the network size () is scaled up, the performance improvement is very little.(III)As increases, network communication hosts more packets, worsening network congestion or contention and thus generating larger network communication latency. Larger network communication latency brings negative effect on the performance. Frequent network communication and huge latency makes the performance very bad. For instance, for , , and (see Figures 6(a), 6(d), and 6(g)), (i) when , can reach its maximum (); (ii) when , the maximal speedup becomes small (); (iii) when , the heavy network communication makes the performance even not improved.(IV)The increase of can improve the performance, alleviating and making up the negative effect of network communication latency. For instance, for , , and (see Figures 6(d), 6(e), and 6(f)), (i) when , reaches its maximum (); (ii) when , the maximal speedup becomes large (); (iii) As rises up to 1000, the maximal speedup () is close to the ideal maximal value (256).

4.4.2. Hotspot Traffic Model

Assuming and is odd, we can refine (7) as

The same as with Section 4.4.1, the speedup () is also a quaternion function: . Its value is decided by , , , and . In (15), when becomes larger, decreases but increases, so may increase or decrease. To obtain the variation trend of , we also conduct two steps below.

Step 1 (calculating the speedup’s limitation). We have the limitation of as below:

Step 2 (calculating the value of related to the extreme maximal value of ). Let ; then, we can getThe extreme maximal value of exists; its related is obtained by the formula below:

With Formulas (15) and (17), we can have the extreme maximal value of :

Let ; Formula (19) is refined as

Because is a positive integer, combining the two steps, we can obtain that(1)when ,(i) monotonically decreases with the increase of ; parallelization degrades the performance rather than improves it, because the negative effect of network communication latency on the performance surpasses the positive effect of cooperation of multiple processor cores on the performance; tends to zero when ;(2)when ,(i)when , increases with the increase of ; within this condition, the network communication latency is not much and parallelization is able to improve the performance;(ii)when , reaches its maximum ();(iii)when , becomes decreasing when keeps going up; performance degrades because the network communication latency dominates;(3) and ; when and , , resulting in ; it indicates that increasing noncommunication time and improving packet concurrency can increase the extreme value of and the performance improves further covers a larger system size.

To further discuss the effect of , , , and on , Figure 7 shows performance trends of under uniform traffic model. We consider the values of , , , and as the same with Section 4.4.1. From aforementioned formula transformation and Figure 7, we can have results regarding the performance under hotspot traffic model.(I)Although the increase of the network size () could make more processor cores involved to cooperation together so as to seek higher parallel performance, it also induces network communication latency, limiting the performance improvement and even worsening the performance. As shown in Figure 7, as increases, in some cases (see Figures 7(a), 7(b), 7(c), 7(e), 7(f), and 7(i)), the speedup () firstly increases and then becomes decreasing after reaching its maximum; in other cases (see Figures 7(d), 7(g), and 7(h)), it monotonically decreases. For all cases, as increases, finally tends to zero.(II)Both the incremmental/decremental ratio and the maximal value of are influenced by . As shown in all subfigures, as increases, the incremental/decremental ratio becomes very low. With the increase of , the maximal value of may increase or decrease: (i) if in Formula (20), decreases (see Figures 7(a), 7(b), 7(c), 7(e), 7(f), and 7(i)); (ii) if in Formula (20), ; (iii) if in Formula (20), increases (see Figures 7(d), 7(g), and 7(h)). (III)As increases, network communication hosts more packets. Larger network communication latency makes the performance goes bad. The maximal value of reached by parallelism declines. For instance, for and (see Figures 7(c), 7(f), and 7(i)), (i) when , can reach its maximum ( with ); (ii) when , the maximal speedup becomes small ( with ); (iii) when , the heavy network communication makes the speedup soon reach its little maximum ( with ).(IV)The increase of can improve the performance, alleviating and making up the negative effect of network communication latency. For instance, for and (see Figures 7(a), 7(b), and 7(c)), (i) when , the maximal value of is very small ( with ); (ii) when , the maximal speedup becomes big ( with ); (iii) as rises up to 1000, the maximal speedup further becomes large ( with ).

In all, performance under hotspot traffic model is worse than that under uniform traffic model.

With the performance analysis in this subsection, we could have the following.

Implication 2. With the uniform traffic model, the communication overhead is modest, assuming that there is limited contention, so the performance can still keep improving. Under the uniform traffic model, the concurrent degree of communications are usually high. Architects or programmers need to pay more attention to improve the concurrent degree of packets in a communication. The performance improvement can benefit more from the improvement of packet concurrency.

Implication 3. With the hotspot traffic model, parallelization cannot always improve the system’s performance, when the network communication latency dominates. To alleviate the impact of network communication latency on the performance and hence keep the performance’s improvement continuous, designers need to address increasing the noncommunication time and improving packet concurrency.

Implication 4. Exploiting the parallelism of multiple processor cores well is potential to make up the negative effect of network communication latency and even obtains the continuous improvement of the performance. Following this view, architects or programmers need to pay more attention to exploit the parallelism of processor cores.

Implication 5. Besides, increasing the noncommunication time is a viable way to alleviate the negative effect induced by the network communication latency.

5. Experiments and Results

In this section, we apply three real data-parallel applications on our cycle-accurate homogenous OLPC experimental platform to validate and demonstrate the effectiveness of our performance analysis.

5.1. Experimental Platform

Figure 8 shows our homogenous OLPC experimental platform. The platform uses the LEON3 [14] as the processor in each PM node and uses the Nostrum NoC [15] as the on-chip network. Each Processor-Memory (PM) node has a LEON3 processor, an enhanced memory controller plus a local memory. The enhanced memory controller extends the function of LEON3’s own memory control module to support memory accesses from/to remote nodes via the network. The LEON3 processor core is a synthesizable VHDL model of a 32-bit processor compatible with the SPARC V8 architecture. The Nostrum NoC is a 2D-mesh packet-switched network with configurable size. Besides, moving one hop in the network takes one cycle.

5.2. Application Examples

We use Wavefront Computation, Vector Norm, and Block Matching Algorithm in Motion Estimation as application examples and perform experiments on various instances of the three applications. Wavefront Computation and Vector Norm are mostly used in wireless communication, computer vision, and image/video processing. And Block Matching Algorithm in Motion Estimation is one of the basic components in image/video processing.

5.2.1. Wavefront Computation

Wavefront Computations are common in scientific applications. Given a matrix (see Figure 9(a)), the left and top edges of which are all a constant, the computation of each remaining element depends on its neighbors to the left, above, and above-left. If the solution is computed in parallel, the computation at any instant forms a wavefront propagating toward in the solution space. Therefore, this form of computation gets its name as wavefront. We use the same method as [16] to parallelize the Wavefront Computation, the rows of the matrix are assigned to PM nodes in a round-robin fashion (see Figure 9(b)). With this static scheduling policy, to compute an element, only the availability of its above neighbor needs to be checked (synchronized). For instance, PM node 0 computes the elements in row 1. PM node 1 cannot compute the elements in row 2 until the corresponding elements in row 1 has been figured out by PM node 0. After finishing the computation in row 1, PM node 0 goes on to compute the elements in row 3 according to the round-robin scheduling policy. In our experiment, we conduct various instances of Wavefront Computation described below.(1)Two ways of data storing are realized to reflect the two traffic models. One is “Uniform” meaning that the matrix data are uniformly distributed over all nodes. The other is “Hotspot” meaning that the matrix data are only located in the central node.(2)Both integer matrix and floating point matrix are implemented to vary the noncommunication time: . For the same problem size and algorithm, floating point computation needs more time than integer computation and hence has bigger .(3)The Wavefront Computation conducts a matrix with the size of 256 × 256, on the homogenous OLPC with the network size varying from 1 × 1 , 1 × 2 , 2 × 2 , 2 × 4 , 4 × 4 , 4 × 8 (32), 8 × 8 (64), 8 × 16 (128), to 16 × 16 (256). The total problem size is fixed and the problem size assigned to each node varies from 256 rows, 128 rows, 64 rows, 32 rows, 16 rows, 8 rows, 4 rows, 2 rows to 1 row.

5.2.2. Vector Norm

Vector Norm is used to compute the magnitude (length) of the vector. Figure 10(a) shows the formula of Vector Norm. When , the Vector Norm is also called -Norm or Euclidean Norm, which is common in operations of 2D/3D computer graphics. In the paper, we choose to parallelize and compute -Norm. Figure 10(b) illustrates the parallelization of -Norm on our OLPC platform. Different from Matrix Multiplication and Wavefront Computation, Vector Norm only can be partially parallelized. Its computation contains two steps. Step 1 is parallel. In Step 1, PM nodes are responsible for computing the square () of . are assigned to PM nodes in a round-robin fashion. Step 2 is sequential. In Step 2, a central PM node takes charge of computing the square root of the sum of all . For instance, as shown in Figure 10(b), there are two PM nodes computing the -Norm of a vector with four elements. In Step 1, PM node 0 computes the square () of , while PM node 1 computes the square () of . After finishing the computation of , PM node 0 goes on to compute the square () of according to the round-robin scheduling policy. In Step 2, PM node 1 (the central PM node) computes . In our experiment, we apply various instances of -Norm described below.(1)Two ways of data storing are realized to reflect the two traffic models. One is “Uniform” meaning that the data in Step 1 are uniformly distributed over all nodes. The other is “Hotspot” meaning that all data in both Steps 1 and 2 are only located in the central node.(2)Both integer data type and floating point data type are implemented to vary the noncommunication time: . For the same problem size and algorithm, floating point computation needs more time than integer computation and hence has bigger .(3)The -Norm conducts a vector with 1024 elements, on the homogenous OLPC with the network size varying from 1 × 1 , 1 × 2 , 2 × 2 , 2 × 4 , 4 × 4 , 4 × 8 (32), 8 × 8 (64), 8 × 16 (128), to 16 × 16 (256). The total problem size is fixed and the problem size assigned to each node varies from 1024 elements, 512 elements, 256 elements, 128 elements, 64 elements, 32 elements, 16 elements, 8 elements, to 4 elements.

5.2.3. Block Matching Algorithm in Motion Estimation

Motion Estimation is one of important parts in H.264/AVC standard, which addresses obtaining high coding efficiency and good picture quality [17]. It is of importance to find the best Motion Vector in Motion Estimation. The Block Matching Algorithm in Motion Estimation aims at looking for the best matching block with the best Motion Vector in Reference Frame. Figure 11(a) illustrates the Block Matching Algorithm. As shown in the figure, there is a Current Block (C) in the Current Frame. For a Reference Frame, the Block Matching Algorithm first predicts a Search Center (SC) according to the position of the Current Block (C). Then, it exhaustively checks all search points (i.e., candidate Reference Blocks, e.g., R) in the Search Window (SW) of the Reference Frame to find the best matching block () with the best Motion Vector (MV). The position of the Search Window (SW) is decided by the Search Center (SC), while its size is decided by the Search Range (SR). It is obvious that larger Search Window (SW) leads to more accurate prediction of the best matching block with the best Motion Vector but consumes more amount of computation time. Figure 11(b) shows how the Block Matching Algorithm is parallelized on our OLPC platform. We uniformly assign candidate Reference Blocks () into each PM node so that each PM node handles the same number of candidate Reference Blocks. For instance, assume that there are n search points in the Search Window (SW) and two PM nodes take charge of obtaining the best matching block. The PM node 0 is responsible for comparing with the Current Block (C), while PM node 1 takes charge of comparing with the Current Block (C). In our experiment, we perform a various instances described below.(1)We also realize two ways of data storing to reflect the two traffic models. One is “Uniform” meaning that the candidate reference blocks are uniformly distributed over all nodes. The other is “Hotspot” meaning that all candidate reference blocks are located in the central node.(2)Only integer data type is considered, since the data in image processing are “integer.”(3)We conduct a Search Window with the size of 128 × 128 (i.e., 16384 candidate reference blocks), on the homogenous OLPC with the network size varying from 1 × 1 , 1 × 2 , 2 × 2 , 2 × 4 , 4 × 4 , 4 × 8 (32), 8 × 8 (64), 8 × 16 (128), to 16 × 16 (256). The total problem size is fixed and the problem size assigned to each node varies from 16384, 8192, 4096, 2048, 1024, 512, 256, 128, to 64 reference blocks.

5.3. Theoretical Speedup Estimation

To compare our theoretical analysis with the real simulation results, we first estimate the theoretical speedups of the three applications.

5.3.1. Wavefront Computation

(1)The program of Wavefront Computation can be fully parallelized, thus .(2)The subtask on each node is . Here, represents the current element in the matrix in Figure 9, while , and are ’s neighboring element to the above-left, left and above, respectively. The time of such subtask (including computation and local memory reference) is collected in our experiment: clock cycles for integer data type; clock cycles for floating point data type.(3)For “Uniform” data storing, the elements computed by a PM node are located on the local memory of that PM node. Hence, and are local, while and are remote. There are two packets () transmission in a communication and we assume that considering packet concurrency. For “Hotspot” data storing, all elements are located on the central node. Hence, there are four packet transmissions () in a communication. Considering packet transmissions are overlapped, we assume that .

5.3.2. Vector Norm

(1)The program of Vector Norm is partially parallelized. The serial part consumes much time.(2)Step 1 is parallel. In Step 1, the subtask on each node is . The time of such subtask (including computation and local memory reference) is collected in our experiment: clock cycles for integer data type; clock cycles for floating point data type. Because the vector contains 1024 elements, . Step 2 is sequential. in our experiment, the computation of takes 32700 cycles for integer data type and 337560 cycles for floating point data type. So for integer data type and for floating point data type.(3)For “Uniform” data storing, in Step 1 used by a PM node are located on the local memory of that PM node. is stored in the central PM node. Hence, there is one packet () transmission in a communication and we assume that . For “Hotspot” data storing, all data are located on the central node. Hence, there are two packet transmissions () in a communication. Considering packet transmissions are overlapped, we assume that .

5.3.3. Block Matching Algorithm in Motion Estimation

(1)The Reference Frame has been computed and stored in on-chip local memories in the last Motion Estimation. In current Motion Estimation, the “Block Matching” processing starts until the Current Block in the Current Frame is transferred from the off-chip DRAM into the on-chip memory. The elapsed time of transferring the Current Block from the off-chip DRAM memory into the on-chip memory is the serial part of the Block Matching Algorithm. In our OLPC platform, the central PM node features an External Memory Interface connecting with the off-chip DRAM. The External Memory Interface reads a datum from the DRAM in 20 cycles and the size of the Current Block is 16 × 16. Hence, for “Hotspot” data storing that all data are stored in the central PM node, the data transfer takes 5120 (=16 × 16 × 20) cycles. For “Uniform” data storing that data are uniformly stored in each PM node, the Current Block is transferred from the DRAM to the External Memory Interface and routed to all PM nodes in a broadcast way, so the time of the Current Block’s transfer is cycles (a packet from the central node to the corner one takes hops), approximately equal to 5120 cycles. The subtask on each node is the comparison of the Current Block and a candidate Reference Block, consuming 7680 cycles. Therefore, the problem size is 128 × 128, so the parallel part takes 125829120 (=7680 × 128 × 128) cycles. , and .(2)The subtask on each node is the comparison of the Current Block and a candidate Reference Block. The time of such subtask (including computation and local memory reference) is collected in our experiment: cycles.(3)For “Uniform” data storing, the Current Block and the candidate blocks are located in each PM node, so there is no network communication and . For “Hotspot” data storing, the Current Block and the candidate blocks are in the central PM node, Hence, there are 512 (=16 × 16 × 2) packet transmissions () in a communication. Considering that such many packets are routed to the central node, the network contention is extremely heavy and we assume that .

Then, using the Formula (9) and (15) estimates the theoretical speedups of the three applications.

5.4. Simulation Results

The real speedups of the three applications are calculated based on the simulation results on our homogenous OLPC experimental platform (because the sequential part in the program of Vector Norm dominates, the performance improvement is limited).

5.5. Analysis and Discussion
5.5.1. Effect of Network Size

The effect of network size on the performance reflects the scalability of homogenous OLPCs. Figures 12, 13, 14, 15, and 16 plot the real and theoretical speedups versus the size of the homogenous OLPC from 1 × 1 , 1 × 2 , 2 × 2 , 2 × 4 , 4 × 4 , 4 × 8 (32), 8 × 8 (64), 8 × 16 (128), to 16 × 16 (256). From the six figures, we can see that (i) the theoretical speedups has the same trend with the real speedups; (ii) for uniform traffic model, the speedup usually increases when the network size is scaled up; and (iii) for hotspot traffic model, the speedup reaches its maximum when the network size is scaled up to a certain size and becomes decreasing when the network size goes on increasing.

5.5.2. Effect of Traffic Models

Figure 12 shows the effect of traffic models on the real and theoretical speedups of both integer and floating-point Wavefront Computation, Figure 13 shows the effect of traffic models on the real and theoretical speedups of both integer and floating-point Vector Norm, and Figure 14 shows the effect of traffic models on the real and theoretical speedups of Block Matching Algorithm in Motion Estimation.(i)For uniform traffic model, consistent with the theoretical speedup performance model, the real speedup increases as the network size is scaled up, no matter the data type is integer or floating-point. This is because the contention latency induced by uniform traffic is not enough to kill the performance improvement introduced by the parallelization. However, it can slow down the performance improvement.(ii)Because a hotspot traffic model incurs heavy contention latency, the speedup increases when the network size is small but begins decreasing when the network size is scaled up to a certain finite value. Using (17), we can calculate the value of network size () for the maximal speedup. (i) For Wavefront Computation with integer data type, , so Figure 12(a) shows that both the theoretical and the real speedups go up from 1 1 to 4 8 (32), the speedups on 8 8 (64) are approximately equal to the speedups on 4 8 (32), and the speedups turn to fall as the network size goes on increasing to 16 16 (256). (ii) For Wavefront Computation with floating-point data type, , so Figure 12(b) shows both the theoretical and the real speedups ascend when the network size is from 1 1 to 16 16 (256). (iii) For Vector Norm with integer data type, , so Figure 13(a) shows that both the theoretical and the real speedups go up from 1 × 1 to 4 × 8 (32), the speedups on 8 × 8 (64) are approximately equal to the speedups on 4 × 8 (32), and the speedups fall when the network size goes on increasing to 16 × 16 (256). (iv) For Vector Norm with floating-point data type, , so Figure 13(b) shows both the theoretical and the real speedups ascend when the network size is from 1 × 1 to 16 × 16 (256). (v) For Block Matching Algorithm, , so Figure 14 shows that both the theoretical and the real speedups go up from 1 × 1 to 4 × 4 but fall when the network size is from 4 × 8 (32) to 16 × 16 (256). From Figure 14, we can see that the speedup under hotspot traffic model is very small, because the Block Matching Algorithm in Motion Estimation makes a large number of memory accesses flow towards the central PM node and hence results in huge network contention.(iii)Because hotspot traffic model consumes much more network contention latency than uniform traffic model, the speedup with hotspot traffic model is smaller than that with uniform traffic model for the same network size. The difference becomes larger when the network size is increasing.

5.5.3. Effect of Noncommunication/Communication Ratio

Figure 15 shows the effect of noncommunication/communication ratio on the real and theoretical speedups of Wavefront Computation under both uniform and hotspot traffic models, and Figure 16 shows the effect of noncommunication/communication ratio on the real and theoretical speedups of Vector Norm under both uniform and hotspot traffic models.(i)For the same network factors, the theoretical and real speedups for the floating point data type is higher than those for the integer data type. This is as expected because when the noncommunication time increases, the portion of communication latency becomes less significant, thus achieving higher performance.(ii)For hotspot traffic model, the increase of noncommunication/communication ratio shifts the optimal network size () to a larger value. For integer data type, for Wavefront Computation and for Vector Norm. For floating-point data type, for Wavefront Computation and for Vector Norm.

6. Applicability and Limitation

6.1. Applicability

The target architectures and applications of our study are homogenous OLPCs and data-parallel applications, respectively. Homogenous OLPCs are such an on-chip computing platform that have a number of computing cores that run in and an on-chip network that provides concurrent pipelined communication, and data-parallel applications represent a wide range of applications whose data sets can be partitioned in parallel into data subsets handled individually by the same program running in different processor cores. Scalability is the common characteristic of both. Considering that homogenous OLPCs and data-parallel applications match each other very well in nature and hence data-parallel applications can obtain good speedup in homogenous OLPCs, the performance model proposed by the paper is applicable to homogenous OLPCs for data-parallel applications.

The performance model is general for homogenous OLPCs and a variety of data-parallel applications. For any particular application, a customized many-core platform such as application-specific architecture and hardware accelerator will be superior, but a NoC-based homogenous OLPC would be better than such a custom-designed hardware architecture when a variety of data-parallel applications share the same OLPC. The custom-designed many-core platform is specific so as not to be in the range of the general homogenous OLPCs. GPU (Graphic Processing Unit) is such kind of hardware accelerator for graphic processing as the name suggests. Although GPGPU (General-Purpose GPU) exhibits generality to some extent by providing programmability in its GPUs, it is still specific because the programmable GPU adopts a special structure for accelerating the graphic processing applications and the interconnections in GPGPU is special for such as stream processing and data shuffling that are common in graphic processing. Therefore, GPGPU is not in the range of homogenous OLPCs. Besides data-parallel applications, there exist other applications that do not have the scalability feature, so the proposed model is not applicable to those applications’ performance analysis.

6.2. Limitation

The proposed performance model is not suitable for many-core platforms in specific application areas and the applications without the characteristic of scalability. The purpose of the model is to offer a general but workable way to estimate and evaluate the performance of homogenous OLPCs for data-parallel applications. Because the network communication latency and the ways of storing data of data-parallel applications are two of the most significant factors affecting the performance of homogenous OLPCs, when we establish the speedup performance model, the network communication latency and the data storing ways are stressed out and modeled in detail. Therefore, the processor behavior such as cache hierarchy and cache miss is not considered. We assume that all of the data are moved from the external memory to the appointed on-chip memory regions in different nodes before the system handles the data and the performance is measured from the time when the system begins handling the data, even if the data is continuously fed from the external memory, part of the latency can be hidden in the process of data handling, and we emphasize analyzing the effect of the data storing ways, so the model does not describe the situation that data are moved from the external memory.

7. Conclusion

Understanding the speedup potential that homogenous OLPC computing platforms can offer is a fundamental question to continually pursuing higher performance. This paper has focused on analyzing the performance of homogeneous OLPCs for data-parallel applications. Because the enhancement of application performance in OLPCs may be restricted by the increasing network communication latency even though the number of cores increases, one main issue for the analysis is to properly capture the network communication. We first detailed a network communication latency model by proposing two abstract concepts (equivalent serial packet and equivalent serial communication). Then, based on the network communication latency model, we have proposed the performance model. By considering the uniform and hotspot traffic models, the performance model has two detailed forms to reflect the distributed way and the centralized way of storing data of data-parallel applications. Essentially, the performance model revisits Amdahl’s Law under the context of homogenous OLPCs. Theoretic analysis and real application experiments demonstrate that our model provides a feasible way to estimate and evaluate the performance of data-parallel applications onto homogenous OLPCs.

In the future, we plan to extend the performance model by considering the cache hierarchy and cache miss and the external memory access. Another direction is to emphasize studying the effect of topologies and communication protocols on the performance models of homogenous OLPCs.

Notations

:Number of nodes in each dimension
:The number of processor nodes,
:The number of subtasks in the serial part of a program
:The number of subtasks or communications in the parallel part of a program
:The ratio between the serial part and the parallel part in a program,
:The time of a communication
:The execution time of a subtask, that is, noncommunication time
:Average hop count of transmitting a packet
:The time of transmitting a packet in one hop
:Average time of transmitting a packet in the network
:The number of packets in a communication
:The number of equivalent serial packets in a communication
:The number of equivalent serial communications in a program
:The communication overhead of a program on OLPCs
:Speedup
:Maximal speedup
:Minimal speedup.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The research is partially supported by the Hunan Natural Science Foundation of China (no. 2015JJ3017), the Doctoral Program of the Ministry of Education in China (no. 20134307120034), and the National Natural Science Foundation of China (no. 61402500).