Abstract

This paper proposes a new replica placement algorithm that expands the exhaustive search limit with reasonable calculation time. It combines a new type of parallel data-flow processor with an architecture tuned for fast calculation. The replica placement problem is to find a replica-server set satisfying service constraints in a content delivery network (CDN). It is derived from the set cover problem which is known to be NP-hard. It is impractical to use exhaustive search to obtain optimal replica placement in large-scale networks, because calculation time increases with the number of combinations. To reduce calculation time, heuristic algorithms have been proposed, but it is known that no heuristic algorithm is assured of finding the optimal solution. The proposed algorithm suits parallel processing and pipeline execution and is implemented on DAPDNA-2, a dynamically reconfigurable processor. Experiments show that the proposed algorithm expands the exhaustive search limit by the factor of 18.8 compared to the conventional algorithm search limit running on a Neumann-type processor.

1. Introduction

Content delivery networks (CDNs) [13] are being developed to improve the user’s experience when downloading voluminous files such as music and videos, a rapidly growing component of the traffic on the Internet. A CDN consists of two types of servers: origin server and replica server. The original data is stored in the origin server and then copied to the replica servers, which are geographically distributed. A user requesting content is connected to a replica server automatically selected by the network, which then sends the content to the user. Replica selection is based on the distance between the server and the user, and usually, the closest server is selected [4].

One important issue in CDN performance is replica placement [5]. The problem is in deciding which servers are to hold which replicas. Replica servers cache the original servers’ contents to prevent traffic congestion and to maintain user performance. They allow CDN providers to minimize the capital expenditures (CapEx) and operational expenditures (OpEx). Note that cache size is restricted; no replica server can hold all of the contents held by the origin server. For each content, we must pick those servers, not all, that will hold the replica. In addition, in order to achieve adequate user performance, each replica server must have a limited delivery area. The delivery area is expressed as the distance from the replica server. Each user must lie within the delivery area of at least one replica server. For maximizing resource utilization, the number of replicas holding a content should be minimized while satisfying the delivery area constraint.

Selecting a combination that satisfies the constraint conditions in all combinations for replica placement is called the set cover problem. The replica placement problem is an extension of the set cover problem [6, 7], which is known to be NP-hard [8]. Several greedy algorithms have been proposed with the aim of decreasing the calculation time, since it rapidly increases when the number of servers is large [5, 912]. Greedy algorithms are widely used because of their simplicity. However, it has been mathematically proved that no greedy algorithm can guarantee the optimal solution. The optimal solution is the minimum replica-server set that satisfies the constraints, for example, quality of services. Our simulations also show that over 90 percent of the solutions output by the greedy algorithm are not optimal; we must search the entire solution space to get the optimal solution. There is a report to try exhaustive search [13] with liner programming, but they define sever groups and fixes location of servers and reduce the number of combinations. This requires an impractically long time if the network is large, especially when the calculations are run on a Neumann-type sequential processor.

In this paper, we propose a fast calculation method that uses parallel processing for the replica placement problem. Users’ demands for guarantees of the quality of services trigger constraints such as the latency for content access and download time. The demands of providers, on the other hand, focus on maintaining the service level within minimum replica placement. Since contents and user requests are dynamically changing, it is necessary to rework replica placement dynamically. Shorter computation time contributes to satisfying the user and provider demands without delay.

The replica placement problem is to find the minimum replica-server set that satisfies all constraints such as quality. Our algorithm is not only for split Beeler’s algorithm into parallel tasks. Our proposal is to expand the exhaustive search limit within reasonable calculation time by extract the maximum hardware performance with our architecture. Our proposed algorithm divides all patterns of replica server selection into groups. Each group is run in parallel using the pipeline technique. We prove how many groups are optimal. This algorithm suits the architectures of current dynamically reconfigurable processors, most of which have many processor elements (PEs). While the time complexity of the conventional method is , that of the proposed algorithm is , where is the combination number, is the number of servers in the network, and is the number of selected servers in that search. In addition, we implement the proposed algorithm on DAPDNA-2, a commercial dynamically reconfigurable processor developed by IPFlex Inc. [14]. Experiments show that our proposed algorithm reduces the execution time by a factor of 18.8 compared to the conventional method on an Intel Pentium 4 2.8 GHz.

The rest of the paper is organized as follows. Section 2 describes the replica placement problem and related work. Section 3 explains our proposed algorithm for fast solution of the replica placement problem. Section 4 describes DAPDNA-2 implementation and performance evaluation results of our experiments. Finally, Section 5 concludes the paper.

2. Replica Placement Problem

Let be an undirected graph, where is the set of servers and is the set of links between any two servers. Link is associated with cost , which denotes the cost of the link between servers and . It is assumed that the graph is connected so that any server can connect to any other server via a path. The cost of a path is defined as the sum of the costs of the links along the path and the cost of nodes between a user and a replica server. is the cost of the shortest path between and .

The origin server, which at the start is the only server holding the data, is expressed as . A replica server is a server that is assigned a replica. Each client is connected to a local server and sends its request to this server. If the server has the data requested, the request is processed locally. Otherwise, the request is redirected to the closest server that has a replica of the data requested. The client then obtains the data from that replica server. Every server has a distance constraint, , to guarantee users’ performance such as response time and download speed. A request from a client associated with server must be serviced by a replica server within distance constraint . If the closest replica server lies outside , the request cannot be resolved. Therefore, each server must be connected to at least one replica server that can satisfy the constraint condition.

The replica placement problem is to decide which servers are to be selected as replica servers while satisfying the distance constraint, see (1), and minimizing the number of replica servers. A binary bit pattern of server selection is called a replication strategy and is expressed as a set of replica servers, . If a replication strategy meets the constraint requirement, it is called feasible

Figures 1 and 2 illustrate examples of replica placement. The numbers in the rings are the index number of the server; they lie between 0 an , where is the total number of servers. The number on a link is the sum of the cost of the link between nodes and both nodes. In both figures, the origin server is server 0, and the distance constraint is 8, so the replication strategies shown are feasible.

The replica placement problem is derived from the set cover problem, which is known to be NP-hard. The definition of the set cover problem is as follows.

Minimum Weight Set Cover Problem
Let be the universal set and the family of . The solution is subfamily such that the weight is minimized and is satisfied.

The replica placement problem is also NP-hard, since it has been proved that the minimum weight set cover problem is NP-hard. The computational complexity of replica placement is extremely high, especially when the number of servers in the network is large. Some greedy algorithms have been proposed [5, 912]. Johnson proposed a greedy algorithm for the minimum weight set cover problem [15]. The algorithm is a straightforward heuristic algorithm, requiring time . However, it has been mathematically proven that no greedy algorithm can guarantee the optimal solution. For example, Figure 1 shows replica placement with the greedy algorithm that selects the server that has the largest cover area. It is not the optimal solution, since the number of replica servers is 2, as shown in Figure 2. In addition, our simulation results show that about 90% of the replication strategies output by the greedy algorithm are not optimal as shown in Figure 3. In Figure 3, node degree represents the number of links connected to a neighbor node. Greedy algorithms are realistic, since suboptimal solutions are obtained within realistic time. However, many of the solutions deviate from the optimal solution by over 100%. If we can obtain the optimal solution within realistic time, we do not need to pay the cost penalty of the suboptimal solutions.

In order to obtain the optimal solution, a fast exhaustive search algorithm is required. This approach is unrealistic if implemented on a Neumann-type sequential processor, since the number of replication strategies is too large.

3. Proposed Method

3.1. Overview

In this paper, we propose a fast calculation method to obtain the optimal solution for replica placement. The proposed architecture is based on exhaustive search, because only exhaustive search can obtain the optimal solution. Figure 4 overviews the procedures of our proposal. It consists of two phases: generation phase and inspection phase. In the generation phase, all replication strategies are calculated. The inspection phase checks whether each replication strategy can cover all servers and satisfy the distance constraint. Finally, the replication strategy with the minimum number of replica servers is selected as the optimal solution.

In the generation phase, calculation of the optimal division number of combinations and the seed replication strategies are complicated and not repeatable, so software is suitable for implementation. Generation of replication strategy with binary bit patterns (replication strategies) and inspection of cover area are simple and repeatable, so hardware is suitable for implementation. Therefore, our proposed architecture needs both hardware for software execution and binary bit operation.

Candidate hardware for implementation is FPGA (field programmable gate array) and a reconfigurable processor. It is difficult to create software-execution function on FPGA, because we need private compiler. So, we selected a reconfigurable processor. Though there are many reconfigurable processors, we selected DAPDNA-2 because of that DAPDNA-2 combines high-performance RISC (reduced instruction set computer) processor and high-capacity process elements, and hardware configuration can be changed by one clock. We implemented the proposed algorithm on DAPDNA-2 closely cooperated with hardware DAPDNA-2 and software and drew out the maximum performance of it. DAPDNA-2 is commercially available dynamically reconfigurable processor developed by IPFlex Inc. [14, 16].

In order to employ parallel pipeline execution to speed up exhaustive search, the replication strategies are divided into several groups in the generation phase. The replication strategies of each group are simultaneously generated using Beeler’s algorithm [17, 18], which generates the next replication strategy from the previous replication strategy. Beeler’s algorithm is a sequential algorithm, so it is not suitable to obtain the th replication strategy directly. To divide all replication strategies into groups, we propose a new algorithm that calculates the th replication strategy directly. In addition, determining the division number that minimizes the calculation clock total is a problem to be solved. We derive the optimal number of divisions theoretically in Section 3.2.4.

3.2. Bit Pattern Generation Phase
3.2.1. Bit Pattern Notation

First, the replication strategy notation is introduced. A replication strategy can be expressed as an -bit bit pattern, where the th bit corresponds to the server indexed as and is the total number of servers in the network. 1 on the th bit means server is selected as a replica server. By using this notation, when (≤) servers are picked from servers, -bits of the -bit bit pattern are set to 1, where is the number of original and replica servers. For example, 010110 represents the replication strategy for 6 servers. These bit patterns allow us to sort replication strategies in ascending order. Figure 5 shows the bit pattern order when servers are selected from servers. It seems that these values increase irregularly, since the differences of neighboring values are not constant.

[Replication strategy covers all servers] [Replication strategy does not cover all servers].

3.2.2. Beeler’s Algorithm

Our proposed algorithm aims to minimize the hardware operation clocks to minimize calculation time. A simple counter that generates ascending values needs many clocks to find the combination binary patterns because the “1” bit-combination patterns in ascending values are random, and there is no regularity. On the other hand, Beeler’s algorithm ensures regularity and needs fewer clocks than a simple counter. Accordingly, we utilize Beeler’s algorithm.

Beeler proposed an algorithm that generates all combinations, where outcomes are picked from possibilities in ascending order [17, 18]. These combinations can be expressed as -digit binary sequences. For example, the binary bit pattern “010101” represents replica server numbers (0, 2, 4) when , as shown at seed binary bit pattern of group 2 in Figure 5. Combinations can be ordered in ascending order. Replica server numbers (0, 2, and 4) are smaller than replica server numbers (1, 3, and 4), because the binary bit pattern “010100” is smaller than the binary bit pattern “011010”. Beeler’s algorithm can generate all bit patterns from “000111” to “111000” in order. The algorithm is explained as follows.

There are five steps to generate the next binary bit pattern- from the original binary bit pattern-.

Beeler’s Algorithm
There are five steps to generate the next binary bit pattern- from the original binary bit pattern-.(1)Let be given, as all bits are unset except for the least significant bit 1 in binary pattern-.(2).(3)Let be given, as all bits are unset except for the least significant bit 1 in binary pattern-.(4) [≫: shift right -bit].(5), and is next to . When , , , for example, is calculated as follows:(1),(2), (3),(4),(5).

With Beeler’s algorithm, we can generate all bit patterns, representing replication strategies, in ascending order. For example, when , bits are selected from bits; we start from 000111 as the first bit pattern, and from this, we get the next bit pattern 001011. Consequently, by sequential calculation, 19 iterations of Beeler’s algorithm produces all bit patterns shown in Figure 5.

Parallel pipeline execution is a technique to speed up exhaustive search. However, Beeler’s algorithm in itself is not applicable to parallel pipeline execution, since it has data dependency; we need the previous bit pattern to calculate the next bit pattern. In parallel pipeline execution, we divide all replication strategies into groups, and the bit patterns of each group are simultaneously generated from the first value of each group. The first replication strategy of a group is called the group’s seed. Therefore, an algorithm that can calculate the bit pattern of a seed directly is required for parallel pipeline execution. Unfortunately, Beeler’s algorithm is not suitable for this purpose, and no known algorithm is useful either. We propose here an algorithm to calculate the bit patterns of the seeds directly.

3.2.3. Direct Calculation of a Seed Bit Pattern

Our proposed algorithm that generates any-order patterns in sequences that are sorted in ascending order. More generally, a nonordered pattern is generated by the following equation. We use the characteristic of the following equation to calculate the bit pattern representing the th replication strategy. Here, is the bit pattern number, the bit pattern representing the th replication strategy is called the th pattern To get the th pattern, find the smallest that satisfies means the bit patterns whose most significant bit with value “1” is the th bit; there are “1”s between the 1st bit and the th bit because there are “1”s in total. Hence, the th bit of the th pattern is “1”. The th pattern corresponds to -th in . Replace as follows: Next, find the smallest that satisfies means the patterns whose most significant bit with value “1” is the th bit; there are “1”s between 1st and th bit. Hence, the th bit of the pattern is “1”. can be obtained by repeating times in a similar way. Setting the corresponding bit to “1” yields the th pattern.

For example, the 6th pattern in can be obtained as follows: Apply (2) to , because includes the 6th pattern. Hence, , . Apply (2) to , because includes the 2nd pattern. Hence, , . The 1st pattern corresponds to . Hence, . Set the corresponding bit to 1 which yields the 6th pattern, 010101.

After dividing all bit patterns, pick -bits from -bits, into groups, and then calculate the bit patterns of the seeds, such as 1, th bit patterns using the above algorithm.

3.2.4. Optimal Number of Groups

Determining the number of groups is a problem that must be solved prior to parallel pipeline execution of exhaustive search. As the number of groups increases, the calculation clocks needed to generate the bit patterns of the seeds increases. The optimal number of groups depends on the number of all replication strategies and the calculation clocks of Beeler’s algorithm. In order to solve this problem, we introduce here the theoretical optimal number of groups.

Let be the number of clocks needed to calculate the bit pattern of a seed and the number of clocks needed to execute Beeler’s algorithm. clocks are required to generate all combinations created by picking outcomes from possibilities. When we divide all replication strategies into 2 groups, clocks are required. When we divide them into 3 groups, clocks are required. More generally, when we divide them into groups, clocks are required as follows: According to the relationship between arithmetic mean and geometric mean, The equality is satisfied if and only if . Hence, This is the optimal division number.

3.3. Cover Inspection Phase

In the inspection phase, we check whether each replication strategy is feasible or not. The cover area of each replica server represents the distance constraint, and it can be calculated using a shortest path algorithm such as Dijkstra’s algorithm [19]. In this paper, it is assumed that the data representing the cover areas for all servers are calculated with Dijkstra’s algorithm before the inspection phase. The data is expressed as a bit pattern as in the replication strategies. In a network with servers, we use an -bit bit pattern to represent the cover areas. The -bit of the bit pattern corresponds to the cover area of one server. The value of 1 on the th bit means that the server indexed as is covered. For example, 010110 indicates that servers are covered.

Figure 6 shows the procedure of the cover phase when the replication strategies are and in the network shown in Figure 2. The data of the cover area of the servers where the replicas are placed is looked up using the bit pattern of a replication strategy. Next, all looked-up data is subjected to Bit-wise OR operation. If all -bits of the result are 1, the replication strategy covers all servers. Otherwise, it does not cover all servers.

4. DAPDNA-2 Implementation and Evaluation

4.1. DAPDNA-2 Implementation

DAPDNA-2 is a heterogeneous dual core processor. It consists of two cores: DAP (digital application processor), and DNA (distributed network architecture). These processors have different architectures. DAP is a 32-bit RISC core for controlling DNA; DNA is a reconfigurable parallel data-flow machine. Typically, the main data processing of an algorithm is run on DNA. DNA consists of 376 processor elements (PEs), each comprising computation units, internal memory, synchronizers, and counters; the PEs are arranged in a grid pattern. We can design the connections between PEs when implementing an algorithm on DAPDNA-2 to yield a parallel data-flow machine. Each structure is called a configuration. DNA can keep three configurations in its own internal cache. These configurations can be switched within one clock. Thus, this chip combines the advantages of the high-speed processing of hardware and the flexibility of software. Details of its architecture are described in [16]. Let be the number of nodes and (≤) the number of original and replica servers. In our implementation, , since PE word size in DNA is 32-bits long.

Figure 7 shows a block diagram of the DAPDNA-2 implementation design, and Figure 8 shows a block diagram of the cover inspection unit as implemented on DNA; see Figure 7. We must generate and check huge number of combinations to solve the replica placement problem. So, we focus on reducing the calculation time for generation phase and inspection phase with parallel and pipeline computation and increasing the number of parallel computation. So, we considered four major issues in implementation.(1)We designed Beeler’s algorithm unit and inspection unit in Figure 7 to operate with minimum clocks and minimum PEs.(2)We set the depth of the pipeline operation for Beeler’s algorithm unit in Figure 7 to 18. This means that we can get the first output after 18 clocks and the next outputs arrive one by one at each clock until the pipeline is empty.(3)We implemented inspection unit in Figure 7 as four divided cover-area-data tables with 8-bit indexing to reduce the PE resources and operation clocks. Because direct 32-bit indexing needs bytes of memory and operation clock is 1, the bit-position-number search method, shown in Figure 6, needs many clocks to find bit-position-number and to index cover-area-data tables. Therefore, we produced the cover-area-data tables shown in Figure 8; they need only four memory tables with byte memories and operation clock is 4.(4)We could minimize the PE resources for Beeler’s algorithm unit and inspection unit and create four blocks for parallel calculation in DNA.

Calculation is executed in the following 8 steps.Step 1: DAP calculates the seed binary bits patterns (replication strategies) using the algorithm described in Section 3.2.Step 2: DAP writes seed replication strategies into external memory.Step 3: DNA reads seeds replication strategies in external memory.Step 4: Beeler’s algorithm unit in DNA executes Beeler’s algorithm by pipeline calculation. The depth of pipeline calculation in Beeler’s algorithm calculation unit is 18; we can get the first output after 18 clocks, and the next outputs arrive one by one at each clock until the pipeline is empty. So, we divide all replication strategies into 18 groups per calculation to minimize the calculation time.Step 5: Inspection unit in DNA checks the cover area, which is the output of Beeler’s algorithm unit. The principle of the inspection algorithm is described in Figure 8 shows the implementation of the inspection unit. The bit pattern representing a replication strategy, which is 32-bits long, was divided into 4 data blocks; each block is 8-bits long, and the cover area is looked up using the 8-bit data. The 4-bit patterns of the looked-up cover area are then subjected to the bitwise OR operation. Finally, the result of the OR operation is compared to 0xFFFFFFFF.Step 6: Inspection unit in DNA writes the inspection results to external memory.Step 7: DAP reads the inspection results in external memory.Step 8: DAP checks the inspection results and decides replica placement.

4.2. Evaluation

We conducted experiments to evaluate the performance of our proposal; values of were increased until a solution was found, where is the number of replica and original servers. Three types of network models were considered: grid network ( (=3, 4, 5, 6, 7, 8)), ring network ( (=3, 4, 5, 6, 7, 8)), and nation-wide network . Figure 9(a) shows the nation-wide network model. The distance constraint is expressed as the maximum number of hops . We compare the performances of our proposed approach and the conventional approach. The implementation mentioned above is used as the proposed approach, and the conventional approach, sequential Beeler’s algorithm, was executed on a Intel Pentium 4 3.6 GHz processor. In the proposed approach, 4 pairs of Beeler’s algorithm unit and the inspection unit were used as shown in Figure 7. The conventional approach was implemented as a C program and compiled with GNU C Compiler (GCC) with “-O3” optimization option.

Figure 10 shows the execution time of the proposed and the conventional algorithm approach for the ring, grid, and nation-wide network models, where the distance constraint . The grid and ring network model each have nodes, while the nation-wide network model has nodes. The result shows that the proposed algorithm has much shorter execution times than the conventional algorithm for all network models. It is over 16 times, 14 times, and 12 times faster for the ring, grid, and nation-wide network models, respectively. This is because the proposed algorithm makes good use of the parallel and pipeline calculation provided by DAPDNA-2. The conventional algorithm is a strictly sequential algorithm, and a lot of replication strategies must be generated and searched. Note that execution time is strongly dependent on topology. For example, the proposed approach takes 363 milliseconds for the ring model but only 9.5 milliseconds for the nation-wide network model. This is due to the difference in average node degrees of the topologies. The average node degree of ring network model is 2, grid network model is 3.25, and nation-wide network model is 4.1. So, the cover area of a replica server in the ring network model is smaller than that of in the grid network model, and that of grid network model is smaller than that of nation-wide network model. When the cover area is small, an increased number of binary combination patterns must be generated and inspected. Therefore, the execution time increases.

Next, we investigate the characteristics of our proposed algorithm. Figure 11 shows the execution time of the proposed algorithm versus the node cost constraint. We evaluated node cost, where link cost is constant and node cost is variable by hops, on the grid and ring networks, where is the number of hops; the number of nodes . As increases, the execution time decreases. This is because the cover area of a replica server is large when is large; in other words, the latency constraint is relaxed. Larger cover areas can reduce the execution time and reduce the number of replica servers.

Figure 12 shows the execution time of the proposed algorithm versus the number of nodes on the ring and grid networks. The execution time increases as the number of nodes and the number of replica servers increases. The execution time is proportional to the number of combinations created by the number of nodes and the number of replica servers, because the generated combination binary patterns and inspection patterns increase. The calculation load greatly increases with the number of nodes.

We can theoretically estimate the execution time when the number of nodes. The experiment showed that the execution time per bit pattern of DAPDNA-2 and Pentium 4 was 1.5 nanoseconds and 28 nanoseconds, respectively. If the execution time per combination binary patterns is expressed as , the total execution time is given by where is the minimum number of replica servers for which the replication strategy is feasible. To identify , we assume the grid network model as shown in Figure 9(b) with nodes. The ideal value of when is given by where is the average node degree of the network. In the grid network, the average degree From (12), (13), and (14), and the execution time per binary combination pattern of DAPDNA-2, and Pentium 4, we can estimate the execution times even if the number of nodes .

Figure 13 show the execution time of the proposed algorithm and of the conventional algorithm derived from the above equations, where , , and from 2 to 32. The execution time of the proposed algorithm is 18.8 times faster than that of the conventional algorithm for all cases examined.

5. Conclusion

This paper proposed a fast calculation method to solve the replica server placement problem; exhaustive search is implemented in parallel form to discover the optimum solution in practical time. To achieve fast exhaustive search, we employ parallel pipeline execution. Exhaustive search consists of two phases: the generation phase and inspection phase. In order to be able to use parallel pipeline execution, we have introduced the following two points:(1)calculation method that directly determines the th replication strategy expressed as a bit pattern,(2)theoretical proof of the optimal number of groups.

In addition, we implemented our proposed method on DAPDNA-2, a dynamically reconfigurable processor. The results of experiments on ring, grid, and nation-wide networks showed that our proposed method is 18.8 times faster than the conventional method on an Intel Pentium 4. This is because the proposed method can realize close cooperation between hardware (DAPDNA-2) and software, and thus fully utilize the performance of DAPDNA-2. Therefore, the proposed algorithm expands the exhaustive search limit by a factor of 18.8 compared to the conventional algorithm limit running on a Neumann-type processor.

Acknowledgments

This work is supported by the PREDICT program of the Ministry of Internal Affairs and Communications (MIC) of Japan. This work is also supported in part by a Grant-in-Aid for the Global Center of Excellence for high-Level Global Cooperation for Leading-Edge Platform on Access Spaces from the Ministry of Education, Culture, Sport, Science, and Technology in Japan.