#### Abstract

We consider a problem of minimum cost (energy) data aggregation in wireless sensor networks computing certain functions of sensed data. We use in-network aggregation such that data can be combined at the intermediate nodes en route to the sink. We consider two types of functions: firstly the summation-type which includes *sum*, *mean*, and *weighted sum*, and secondly the extreme-type which includes *max* and *min*. However for both types of functions the problem turns out to be NP-hard. We first show that, for *sum* and *mean*, there exist algorithms which can approximate the optimal cost by a factor logarithmic in the number of sources. For *weighted sum* we obtain a similar result for Gaussian sources. Next we reveal that the problem for extreme-type functions is intrinsically different from that for summation-type functions. We then propose a novel algorithm based on the crucial tradeoff in reducing costs between local aggregation of flows and finding a low cost path to the sink: the algorithm is shown to empirically find the best tradeoff point. We argue that the algorithm is applicable to many other similar types of problems. Simulation results show that significant cost savings can be achieved by the proposed algorithm.

#### 1. Introduction

*Motivation*. In this paper we consider the problem of minimum cost (energy) data aggregation in wireless sensor networks (WSN) where the aggregated data is to be reported to a single sink. A common objective of WSN is to retrieve certain* summary* of sensed data instead of the entire set of data. The relevant summary is defined as a certain function applied to a set of measured data [1]. Specifically we are given a function such that, for a set of measurement data , the goal of the sink is to retrieve . Examples of are mean, max, min, and so forth. When mean function is used, . For applications such as “alarm” systems, one can use max as , for example, where can be temperature values in forest-fire monitoring systems or the structural stress values measured in a building. We will refer to as a* summary function* throughout this paper. Certain types of allow us to combine data at the intermediate nodes en route to the sink. Such combining techniques are commonly referred to as* in-network aggregation* [2–4]. By using in-network aggregation one can potentially save communication costs by reducing the amount of traffic [5–7]. For instance, in the applications such as wireless multimedia sensor networks (WMSN) where the transmitted multimedia data has a far greater volume than that in typical WSNs, the in-network aggregation technique is crucial for the purpose of saving energy and extending network lifetime [8, 9]. While in-network aggregation offers many benefits, it poses significant challenge for network design, for example, designing routing algorithms so as to minimize costs such as energy expenditure and delay. In particular, we show that it is crucial to take into account how the summary function affects the statistical properties of sensed data.

*Objectives*. In this paper we study the minimum cost aggregation problem for several types of . The performance of in-network aggregation relies heavily on the properties of the function . To be specific let us briefly look at the problem formulation. Consider the single-sink aggregation problem where we define the cost function as follows. Let denote the set of links in the network. We would like to minimizewhere represents the weight associated with link and represents the average number of bits transmitted over . Note that the objective similar to (1) has been considered in [10–14] as well. The most relevant objective associated with (1) is the* energy consumption*. To see this, let us define weight where is the distance between nodes connected by Link , is the path loss exponent, and is the related channel parameter. Hence (1) is proportional to the total transmitted energy consumed throughout the data aggregation. Note in [13, 14], the authors consider the same energy cost function. We refer to as the* aggregation cost function* (we will use notation to denote the cost function in general, whereas is used to denote the cost function specifically on Link ). Note that depends on the source measurements aggregated on , and also on which is the summary function applied to the measurements. The work in [15] also studies an aggregation problem in sensor networks computing summary functions, assuming that all the packets generated in the network have the same size. However, the amount of information generated at intermediate nodes may vary, since a summary of data can be statistically different from the original data, which is our key observation.

Let us take an example. Consider the network in Figure 1 where Nodes 1 and 2 are the source nodes, and the node in shaded color represents the sink. The sink wants to receive a summary of information from Nodes 1 and 2. The sensor readings generated at Nodes 1 and 2 are represented by the random variables (RV) and , respectively. Since Node 1 is a “leaf” node, Node 1 will simply transmit the raw reading to Node 2. Node 2 will combine with its own data, , by computing the summary function which is then transmitted to the sink. We define the aggregation cost function as follows. Suppose the sensor information to be transmitted on Edge is random variable . The average number of bits to be transmitted on , or , is defined as (We temporarily ignore communication overheads incurred in addition to the sensor information, e.g., the packet header size. We will however take such overheads into account later when we formally define .)where denotes the entropy function. Note that the entropy function has been also adopted as cost function in [10, 12], and throughout this paper we will define in terms of . The average numbers of bits transmitted on Edges 1 and 2 are, respectively, given bySuppose is given by sum. Since , the costs incurred at Edges 1 and 2, that is, and , are different. If we had used other types of , such as max, we would have that which would incur different cost from the case where was sum. In many cases we will assume symmetric sources; that is, depends only on the number of sensor readings to which is applied. In those cases we will treat as a function ; that is, (we will also examine the cases of asymmetric sources as well). We will show that determines the properties of such as convexity and monotonicity, and the structure of the aggregation problem heavily depends on those properties. Hence the aggregation scheme must be designed to capture key aspects of aggregation cost functions under the given summary function. The abovementioned links among summary functions, cost functions, and optimal aggregation strategies have not been previously well studied, as we will see in Section 2 through reviewing related works.

*Contributions*. In this paper we investigate the minimum energy aggregation problem for several widely used summary functions. We consider two types of . The first type is called the* summation* type which involves sums of measurements: specifically sum, mean, and weighted sum. The second type is called the* extreme* type which is related to the extreme statistics of the data: specifically max and min. We will use the entropy function as the measure of information rate. We show that, when is sum or mean, and if the source data is i.i.d., is indeed concave and increasing,* irrespective of the distribution of the source data*. This implies that one can use well-known algorithms such as the Hierarchical Matching (HM) algorithm [16] in order to approximate the optimal cost. When is weighted sum however, it is unclear how we make association between the flow aggregation problem and the cost function. Nonetheless we prove that, if the source data is independent Gaussian random variables, there exists an efficient algorithm for the problem of aggregating weighted sum of data with arbitrary weights.

Next we consider extreme type summary functions such as max. We will show that for certain distributions of source data, can be* convex* and* decreasing* in the (nonzero) number of aggregated measurements. Note that the single-sink aggregation problems for concave/increasing [16–20] or convex/increasing cost functions [21, 22] have been widely studied, however convex and decreasing has not been well studied yet. We propose a novel algorithm which effectively captures such properties of . We begin by observing that there are two aspects in cost reduction as follows. Since is convex and decreasing, decreases faster when the number of aggregated data is smaller. The intuition is that it pays to locally aggregate data among nearby sources in the* early* stages of aggregation, that is, when the number of measurements aggregated at sensors is small. This leads us to find a low-cost* local* clustering of sources, which is a “microscopic” aspect of cost reduction. Meanwhile we need to simultaneously find a low-cost route to the sink, which must take the* global* structure of the network into account and thus is a “macroscopic” aspect of cost reduction. These are conflicting aspects and a good tradeoff point between them should be sought. To that end we propose Hierarchical Cover and Steiner Tree (HCST) algorithm. The algorithm consists of multiple stages and is designed to empirically find the best tradeoff point over the stages. We show that, by simulation, the algorithm can significantly reduce cost compared to baseline schemes such as a greedy heuristic using shortest path routing, or the HM algorithm.

Our results show that the summary function can significantly impact the design of aggregation schemes. However there are many choices for : suppose for example, we would like to compute norm of the vector of measurement data. Note sum and max functions which we study in this paper are in fact related such that, if the measurement data is always positive, then sum function is simply norm and max is norm of a data vector. One could ask: what are good aggregation strategies if we take as a different norm, say norm? We leave such questions as future work.

*Paper Organization*. We briefly review related work in Section 2. Section 3 introduces the model and problem formulation. Sections 4 and 5 discuss the optimal routing problem for summation and extreme type summary functions, respectively. Simulation results are presented in Section 6. Section 7 concludes the paper.

#### 2. Related Work

In general the single-sink aggregation problem to minimize (1) is NP-hard [23], and a substantial amount of research has been devoted to designing approximated algorithms depending on certain properties of . In our case it is important to note that such properties of are determined by the choice of . Let us briefly review the related work on the single sink aggregation problem for two types of . Most research on the single sink aggregation problem has focused on the case where is* concave* and* increasing*. Due to the concavity of , the link costs associated with the amount of aggregated data exhibit* economies of scale*, that is, the marginal cost of adding a flow at a link is cheaper when the current number of aggregated flows is greater at the link. Buy-at-bulk network design [23, 24] is based on such property of . A number of approximation algorithms have been proposed, for example, [17–19]. When is known in advance, a constant factor approximation to the optimal cost is possible [20, 25]. Even when is unknown but is concave and increasing, Goel and Estrin [16] have proposed a simple and randomized algorithm called* Hierarchical Matching* (HM). The algorithm computes minimum weight matchings of the source nodes hierarchically over stages, and outputs a tree for aggregation. HM algorithm can approximate the optimal cost by a factor logarithmic in the number of sources [16]. Nonuniform variants of this problem such that differs among the links are also studied [26, 27] in which a polylogarithmic approximation to the optimal cost is shown to be achievable.

The case where is* convex* and* increasing* in the number of aggregated measurements has been studied in [21, 22]. Here exhibits (dis)economies of scale, that is, the marginal cost of routing a flow at a link is more expensive when a greater number of flows are aggregated at the link. Such phenomenon can be observed from many applications, such as speed scaling of microprocessors modeled by where is the clock speed, and are constants, and is the energy consumption at the processor. Notably the authors show that the problem can intrinsically differ from that for a concave and increasing . For example the authors show that constant-factor approximation algorithms do not exist for certain convex and increasing [21]. They nevertheless proposed a constant factor approximation algorithm for the case . These results show that the single-sink aggregation problem crucially depends on certain properties of such as convexity. However, none of the above works deal with convex and decreasing which we will study in the sequel.

There have been many studies regarding the intermediate data combining in conjunction with routing in order for an efficient retrieval of the complete sensor readings. Scaling laws for achievable rates under joint source coding and routing are studied in [28]. The work [11] studies the problem of minimizing the flow costs under distributed source coding. They show that when is linear in , firstly applying Slepian-Wolf coding at the sources, and secondly routing coded information via shortest path tree from the sources to the sink is optimal. In [10] a single-input coding model was adopted in which the coding of information among the nodes can be done only in pairs, but joint coding of the source data from more than two nodes is not allowed. Assuming reduction in packet size is a linear function of correlation coefficient between each pair of nodes, they proposed a minimum-energy routing algorithm. The impact of spatial correlation on routing has been explored in [12]. They showed that, assuming the correlation decays over distance, it pays to form clusters of nearby nodes and aggregate data at the clusterheads. The aggregated information is then routed from clusterheads to the sink. The algorithm is shown to perform well for various correlation models. The tradeoff between integrity of aggregated information and energy consumption has been studied in [29]. Further works on in-network aggregation combined with routing include [30, 31] which propose efficient protocols for routing excessive values among sensed data. A scheme using spatially adaptive aggregation so as to mitigate traffic congestion was proposed in [32].

The above works aim at retrieving the* entire* set of data, instead of a summary, subject to certain degrees of data integrity. In our case, we design energy efficient aggregation schemes to compute the summary function of the sensor readings. Also in the above mentioned works, the in-network aggregation reduces cost mainly by removing correlation among the data set. In our work, by contrast, we will focus on losslessly retrieving a summary of* statistically independent* sensor readings. We assume the independence of sensor readings because we would like to* decouple* the cost savings from removing correlation, and the savings from applying the summary function in association with aggregation strategies; we focus on the latter. Moreover, the assumption of the independence among the readings represents the “worst” case in terms of cost savings, since one cannot reduce the energy cost by removing correlations in sensor readings. In fact, the independence assumption can be valid in certain cases. For example, consider a large sensor network assuming that the sensed data is spatially correlated and the correlation decays quickly over distance. If the source nodes are sparsely deployed and thus tend to be far apart from one another, the correlation among their data can be very weak. Obviously such sparse node placement is motivated by cost efficiency: sparse placement of nodes enables us to reap as much information given a fixed number of sensor devices, assuming that the network senses a homogeneous field and the measure of information is given by the joint entropy function.

#### 3. Model

##### 3.1. Preliminaries

We are given an undirected graph where and denote the set of vertices and edges, respectively. For , denotes the (undirected) edge connecting nodes and . For each edge in we associate a* weight* defined by . A weight captures the cost of transmitting unit amount of data between two nodes, for example, expenditure of transmission energy in order to compensate path loss. The set denotes the set of source nodes, that is, the nodes which generate measurement data to be reported to the sink. Also define where denotes the cardinality of a set. For a source node , its measured data is modeled by an RV denoted by . We assume that ’s are independent and identically distributed among the sources. The measured data is to be aggregated at the sink node denoted by . The nodes which are not source nodes act as relays in the aggregation process. For simplicity we will assume that any node in the network transmits data at most once during the aggregation process. Such an assumption has been made in other works such as [15]. Thus the routes for aggregation constitute a tree whose root is given by . We refer to such tree as an* aggregation tree*. The aggregation process is performed as follows. The sources initiate transmissions. An intermediate node waits for all the data from the sources which are descendants of the node to arrive. Next the node computes the summary function of the aggregated data which is then relayed to the next hop.

In this paper a summary function is defined to be a nonnegative function denoted by which is a* divisible* function. Divisible functions are a class of summary functions which can be computed in a divide-and-conquer manner [1]. Divisible functions are defined as follows: given data samples, consider a partition of the samples into sets of size and denoted by and , respectively. If is divisible, holds for any and . Examples of divisible functions are sum, max and min. Particularly when is divisible, the aggregation can be performed in a divide-and-conquer manner as follows. Suppose a set of data samples are aggregated at a node. If the node is a source, it applies to the collected samples and its own data. If the node is simply a relay, it applies to the aggregated data samples to obtain a summary of the samples where the summary of its aggregated data is transmitted to the next hop.

Abusing notation for the sake of simplicity, we let the function take a set, a vector, or their combination as its argument. For example if is sum, , and , and also . For some , we define as the set of RVs representing the measurements from the nodes in ; that is, . Thus is the aggregation function applied to the set , for example, if is sum then

##### 3.2. Problem Formulation

We will define the problem of minimizing communication costs as follows. There exists a sink to which the data is to be aggregated. Our goal is to find a minimum-cost aggregation tree rooted at the sink. We would like to solve the following aggregation problem:where represents the average number of bits communicated over Edge . Note that the objective of (5) has been considered in the works [10, 12] as well. We call as* aggregation cost function* which we define as follows.

We will use the entropy function as our measure of information rate similar to works [13, 14]. We assume that the average number of bits to represent random sensor measurement is given by . A precise definition of the entropy function depends on the nature of : if is a discrete RV, denotes the usual Shannon entropy. If is a continuous RV, is implicitly defined to be where is a discrete RV obtained by applying uniform scalar quantization to with some quantization step size, say for some integer . If the quantization precision is sufficiently high, it is known [33] that where denotes the differential entropy of continuous RVs. Note that a similar approximation has been made in defining the information rates for continuous RVs in [13, 14]. Hence in this paper, we will assume that continuous RV incurs the cost of bits where is a sufficiently large parameter, and we denote such costs by .

In addition, the measured data is transmitted as a packet in the network. Hence for each packet transmission, there is an overhead of metadata, for example, packet header. For any measurement , no matter how small , there is always an overhead of transmitting such metadata in practice. We will assume the header length is fixed to bits throughout this paper. Hence the average number of bits required to send measurement information per transmission over a link is given byFor a given aggregation tree , let denote the path from a source to the sink. For a given Edge , let denote the set of source nodes whose aggregated measurements are transmitted over , that is, . The information to be communicated over Edge is the function applied to the set of measurement value from , that is, . Hence we define the aggregation cost function as follows:We would like to solve using the definition of given by (7). In the following sections we investigate several widely used summary functions and the associated optimal aggregation problems.

#### 4. Aggregation Schemes for Summation-Type Summary Functions

We consider the summary functions of sum, mean, and weighted sum.

##### 4.1. sum and mean

We first discuss the case where is sum. We have thatClearly sum is a divisible function. Thus the aggregation process is as follows: a node simply applies sum function to the aggregated data, and relays the aggregated information to the next hop.

When the source data is i.i.d., we will show that there exists a randomized algorithm which finds an aggregation tree whose expected cost is within a factor of of the optimal cost to (5).

Proposition 1. *Suppose ’s are i.i.d. For any distribution of , there exists an algorithm yielding the mean cost within a factor of of the optimal cost of .**Goel and Estrin [16] studied a single-sink data aggregation problem as follows. A source generates a unit flow which needs to be routed to a sink where the flows are aggregated though a tree. Their objective is to minimize the following cost function:**where is the weight on Edge , is the number of flows on Edge and is a function that maps the total size of flow to its cost. They proposed an algorithm to minimize (9) when is a canonical aggregation function defined as follows.*

*Definition 2 (see [16]). *The function is called a canonical aggregation function (CAF) if it has the following property: (1).(2) is increasing.(3) is concave.Their algorithm, called Hierarchical Matching (HM) [16], guarantees the mean cost to be within the factor of of the optimal* irrespective of *, provided that is a CAF. As mentioned previously, since ’s are i.i.d., depends only on . Specifically we will define as follows:We will show that is a CAF by showing that satisfies the three properties of Definition 2. Note this implies that HM algorithm can be used to approximately solve , since (9) and the objective of are identical.

*Proof of Proposition 1. *For the first property, it trivially holds that . For the second property, for any two independent RVs and , it is known that implying that , that is, the sum of independent RVs always increases entropy [33], which implies that is increasing in . For the third property, consider the following. It is shown in [34] that the entropy of the sum of independent RVs is a submodular set function. That is, the following holds for independent RVs , and [34, Theorem I]:Now consider sensor measurements , and make substitutions , , and in (11). We have thatIf we apply the definition of given by (10) to (12), the following holds due to symmetry:Hence holds, or the slope is decreasing in , which implies that is concave on the domain of integers. Thus satisfies all the properties of Definition 2, and is a CAF. This implies that, by using HM algorithm, one can achieve the expected cost which is within the factor of of the optimal cost of .

Next we consider mean as the summary function. Note that mean, as well as weighted sum considered in the next section, are not divisible functions in general. We will nevertheless show that the problem for those summary functions can be reduced to sum problem as follows. Suppose every source node is aware of the total number of the sources, that is, . In our scheme every source simply scales its measurement by prior to transmission, that is, Source transmits , then such scaled measurements are aggregated in a similar way as the sum problem. The average number of bits transmitted over Edge can be written as . Since ’s are i.i.d., for the minimum cost aggregation problem for mean we can use the same algorithm as that used for sum, for example, HM algorithm.

##### 4.2. weighted sum

Next we consider the case where is weighted sum as follows. We assign arbitrary weights , , to the source nodes. The goal of the sink is to compute . Our method of aggregation is similar to that for the case of mean, that is, Source scales its measurement by , then transmits where the aggregation process is the same as that for sum. However the effective source data seen by the network is no longer i.i.d., unless ’s are identical for all . The aggregation cost function is given byThe difficulty lies in it is difficult to associate a “flow” with the source data due to asymmetry, that is, the problem is no longer a flow optimization. Moreover, it is easily seen that (14) is not a CAF in general. Thus we restrict our attention to a specific distribution of . We will show that, if are independent Gaussian RVs, the problem for weighted sum is indeed a single-sink aggregation problem with concave costs, and there exist algorithms similar to HM algorithm which have good approximation ratio. Specifically we prove that our problem is equivalent to the single-sink aggregation/flow optimization problem with* nonuniform* source demands.

Proposition 3. *Suppose , and are independent. Let be weighted sum with arbitrary weights . For sufficiently large , there exists an algorithm yielding the mean cost within a factor of of the optimal cost of .*

*Proof. *Consider the information communicated over Edge denoted by :Since ’s are independent Gaussian RVs, is also Gaussian with variance where . Thus the differential entropy of is given byWe observe that, from (16), we can treat as the “flow” generated by Source , and the sum of flows at Edge incurs the entropy cost as in (16). Specifically we will make the following definitions:Here represents the (unsplittable) flow demand generated by Source , and denotes the minimum demand. Hence under a flow routing scheme, the total amount of flow at Link is given by . Then from (16), the associated communication cost incurred at Link is given by bits, that is, represents the information rate of a flow aggregated at Link . Unlike the previously defined cost functions, is no longer a function of the number of sources on a link, but instead the function of the amount of flow on that link. Finally we define the aggregation cost function in terms of as in (20) in order to meet the concavity condition for as follows: is essentially identical to , and ifone can show that is concave and increasing for all . Hence under the condition (21), is an increasing concave function of the total flow aggregated on a link. In that case we can use the algorithm proposed by Meyerson et al. [19] which essentially extends the HM algorithm to the problems with nonuniform source flow demands, and can approximate the optimal cost by a factor of on average.

In summary, the key question was whether can be cast as a flow aggregation problem, if is weighted sum. In general, it is difficult to make such association due to asymmetry; however, we revealed that such formulation is possible for independent Gaussian sources.

##### 4.3. Discussions

Note that some properties regarding ’s such as the submodularity relation in (11), used to show that is a CAF rely heavily on the independence of ’s. When ’s are correlated, we can find examples of which are not CAF for the summary function of sum as follows. Let and be jointly Gaussian with the same marginal given by with . Then is distributed according to , thus we have that, if , thenThus the entropy function does not satisfy the second condition of Definition 2, that is, the increasing property, as a CAF. Hence for arbitrarily correlated sources, presumably few meaningful arguments can be made on optimal aggregation problems, even for simple summary functions such as sum.

The discussion so far enables us to deal with more general objective functions extended from . Consider a function which is concave and increasing. We now define communication overhead on an edge as the function of the average number of bits transmitted over the edge. Namely, we consider the following extension of :Consider for the summary function sum for i.i.d. sources and weighted sum for independent Gaussian sources. Note that the composition of two concave and increasing functions is also concave and increasing [35]. Thus is a concave and increasing function of the amount of flows at an edge, and thus is a CAF. Hence HM algorithm can be used to approximate .

#### 5. Aggregation Schemes for Extreme-Type Summary Functions

##### 5.1. Case Study

In this section we consider summary functions regarding the extreme statistics of measurements, that is, max or min. We will first investigate the entropy of the extreme statistics of a set of RVs. Consider measurements denoted by , . Since , we will focus only on max without loss of generality. It is easily seen that max function is divisible, thus the aggregation process is similar to that for sum: a node simply applies max function to the aggregated data. For example, suppose a node receives data given by . The node simply computes and forwards it to the next hop.

For extreme-type summary functions, we will show that is in general* not* a CAF. In particular we consider several cases of practical importance.

*Case 1 (Gaussian RVs). *We consider the problem of retrieving the maximum of i.i.d. Gaussian RVs. We assume that for where we again assume that for and some constant . We provide a numerical evaluation of on the left of Figure 2. We observe that is strictly* convex* and* decreasing* in for , thus is not a CAF.

**(a)**

**(b)**

*Case 2 (Extreme data retrieval problem). *We consider the problem of* extreme data retrieval* defined as follows. Assume that a source node measures some physical quantity which is distributed according to a continuous RV . We assume ’s are independent but not necessarily identically distributed. Suppose with some probability is equal to a large number, which indicates an “abnormal” event. An important application of sensor networks is to detect the maximum* abnormality* among the measurements. The abnormality is defined as how far a sensor’s measurement has deviated from its usual statistics as follows. Let us denote the cumulative distribution function (CDF) of by or , . Consider realizations of given by . We will quantify the abnormality at Source in terms of how* unlikely* the measurement is: specifically the goal of the sink is to retrieve , or alternatively,thus the abnormality of is defined by . Let . We will assume that the nodes transmit and aggregate instead of , and the goal of the sink is to retrieve . Note since is the RV evaluated at its distribution function, one can show that ’s are i.i.d. RVs uniformly distributed on . Thus the problem reduces to an optimal aggregation problem retrieving max of i.i.d. uniform RVs.

We will show that associated with the extreme data retrieval problem is* convex* and* decreasing* function when the number of aggregated measurements is greater than or equal to 2. Suppose the data aggregated at a node is given by and define . As previously we assume that the node requires on average bits to transmit .

Proposition 4. *Consider the extreme data retrieval problem. The aggregation cost function is convex and decreasing for .*

*Proof. *Since is the maximum of i.i.d. uniform RV’s, the CDF of denoted by is given byThus the probability density function (pdf) of denoted by is given by . If we compute ,ThusBy regarding as a continuous variable, we have that, for ,Clearly is decreasing for , and since its second order derivative is nonnegative for , is convex for .

On the right of Figure 2 the plot of is shown. Note is strictly convex for , but overall appears to be approximately convex. Note that is nonpositive, thus one could select a sufficiently large such that , so that for all .

In general, for a convex and decreasing , is clearly NP-hard since the problem contains the Steiner tree problem as a special case. In the following section we present a novel algorithm which captures key properties of convex and decreasing . Later we show by simulation the algorithm effectively achieves low cost.

##### 5.2. Algorithm for Convex and Decreasing Aggregation Cost Functions

###### 5.2.1. Motivation

Before we describe our algorithm we present the motivation behind the algorithm. An important observation for the data aggregation problems was made in [25] for* concave* and* increasing *. They proposed a “hub-and-spoke” model for so-called* facility location problem*. The idea is that when is concave and increasing, one should first aggregate flows to some “hubs,” then route the aggregated flow from the hubs to the sink at the minimum cost; this is done by building an approximately optimal Steiner tree where the hubs (facility locations) are the Steiner nodes. The rationale is that, once multiple flows are aggregated at hubs, the cost of routing them collectively to the sink is cheaper than routing the sources’ flows separately, due to the concavity of . We observe two aspects in such hub-and-spoke schemes. Firstly by local aggregation of flows at hubs we aim at greedily reducing costs based on local information, which we view as the* microscopic* approach to reduce cost. Secondly by building an approximately optimal Steiner tree with respect to the hubs and the sink, we take the* global* network structure into account, which can thus be seen as the* macroscopic* aspect for cost reduction. Hence there exists a tradeoff between microscopic and macroscopic aspects of the cost reduction. A similar observation on such tradeoff was made in [12]. However our key question is that, how do we achieve an optimal tradeoff between those aspects for a* convex* and* decreasing *?

Consider the three examples of aggregation cost functions denoted by , , and which are decreasing and convex for as shown in Figure 3. In case of , we see that is flat for , that is, the average number of bits communicated over a link is constant irrespective of the number of flows passed through it. Thus, the minimum cost routing problem reduces to a Steiner tree problem, in which case a completely “macroscopic” solution is optimal. In case of , we see that decreases slowly in . Thus, the more number of flows merges at a link, it takes the less number of bits to transmit the merged information. Suppose we use the hub-and-spoke scheme to aggregate flows in a local manner. The amount of aggregated flows at a hub is at least 2: note that however, is approximately “flat” for . This implies that, once more than two flows are aggregated, the benefits from further local flow aggregation will be negligible. Hence the optimal routing problem from the hubs to the sink approximately reduces to the Steiner tree problem! Thus one could expect that local aggregation (microscopic approach) followed by an optimal Steiner tree construction (macroscopic approach) would yield a good solution. Now let us consider . The overall rate of decrease of is higher than that of . It appears that when the number of aggregated flows is significantly high, for example, is greater than 6, becomes effectively “flat.” This suggests that, one should keep aggregating flows until sufficient amount of flows, say 6, is aggregated, that is, the microscopic cost reduction should be applied for* multiple times in a hierarchical manner*, then build an optimal Steiner tree with respect to the aggregated sources, that is, applying macroscopic reduction.

**(a)**

**(b)**

**(c)**

The example provides us with some insights. Since is convex decreasing, the marginal benefit of local aggregation is large for small but decreases with increasing . In other words, when is small, that is, in the early stages of the overall aggregation process, one should focus on low-cost local aggregation in order to benefit from high rate of decrease of for small . Meanwhile, once a large number of flows are aggregated, it pays to perform macroscopic cost reduction from there on by building the optimal Steiner trees since becomes more “flat” with increasing . This suggests that there exists a tradeoff point at which such microscopic and macroscopic reduction are optimally balanced. Unfortunately it is difficult to know such a tradeoff point in advance. The proposed algorithm not only exploits both the microscopic and macroscopic aspects of cost reduction for a convex and decreasing , but also empirically searches for the optimal tradeoff point. Details are presented in the following section.

###### 5.2.2. Outline

An outline of the proposed algorithm is presented as follows. The algorithm consists of multiple stages. A hub-and-spoke problem (or facility location problem) is approximately solved at each stage. The flows from source nodes are merged at the hubs. The hubs at the present stage become the source nodes in the next stage, that is, the flows are merged hierarchically. Instead of solving complex facility location problem, we find a* minimum weight edge cover* (MWEC) on the source nodes at each stage as a simple approximation. The rationale is that we would like to cluster sources for local aggregation at low costs, and by definition the MWEC incurs low cost in doing that. MWEC consists of multiple connected components, each of which is a tree. For each connected component we select a source as a hub and call it a* center node* (details on the selection of center nodes are provided later). The flows in that component is aggregated at the center node.

At each stage, once the center nodes are determined, we build an approximately optimal Steiner tree with respect to the center nodes and the sink. We use algorithm in [36] for the Steiner tree construction. Their algorithm provides the best known -approximation for Steiner tree problem where .

Each stage outputs an aggregation tree. The output tree at Stage is the* union* of the paths from all the hierarchical aggregations found up to Stage and the Steiner tree built at Stage . Namely, the output tree at Stage is a combination of * consecutive* hierarchical aggregations (microscopic cost reduction) and a Steiner tree with respect to the sink and Stage hubs (macroscopic cost reduction).

Hence, over the stages, the algorithm progressively changes the balance between microscopic and macroscopic aspects of cost reduction in the output trees. Roughly speaking, the output trees from later stages are more biased towards the microscopic aspect. After the stages are over, we pick the tree with the minimum cost among the output trees. As a result the algorithm empirically searches for the point of the “best” balance between the two aspects of cost reduction over the stages. Hence one could expect that our algorithm will work well for any convex and decreasing .

###### 5.2.3. Algorithm Description

We present a formal description of the proposed algorithm followed by an explanation of further details. For given aggregation tree , let denote the total energy cost associated with , as in the objective of .

*Hierarchical Cover and Steiner Tree (HCST) Algorithm*

Begin Algorithm(1)(Metric completion of ) If is not a complete graph, perform a metric completion of to yield a complete graph. Namely, if there exist any pair of vertices without an edge, create an edge between the pair and assign the edge a weight which is the distance between the pair. The distance is measured in terms of the sum of the weights on the shortest path between the pair.(2)(Initialization) , .(3)(Initialize flows at sources) , for all .(4)(Initial output is a Steiner tree) Jump to Step 7.(5)(Minimum weight edge cover) Let us denote the subgraph of induced by by . Find a minimum-weight edge cover in . Let be the subgraph of induced by the cover.(6)(Node selection) Suppose has connected components, and denote the th connected component of by for . For each , select a node with the maximum degree (ties are arbitrarily broken), say , which is called a* center node*. is a tree, and becomes the root of . All the flows in are aggregated at such that every node transmits data to its parent node after the data from its child nodes has been aggregated at the node. The total flow at is updated as follows: Remove all the noncenter nodes from , and let be the resulting set of source nodes.(7)(Steiner tree construction) Build -optimal Steiner tree with respect to the source nodes in and the sink, using the algorithm in [36].(8)(Merging trees) If , merge all the MWECs found up to the present stage and the Steiner tree found in Step 7; that is, let If , . We call the output tree of Stage .(9)(Loop) If , and go back to Step 5. If , continue to Step 10.(10)(Tree selection) The final output is the tree such that that is, the minimum cost tree among the output trees from all the stages.

End Algorithm

###### 5.2.4. Comments

We explain the details of several steps in the algorithm. In Step 3 the flow variables denoted by , , associated with the source nodes are initialized where we will track the amount of flows throughout the algorithm. In Step 6 it is natural to select a node with the maximum degree as the center node, since such node is literally a “hub.” When solving the hub-and-spoke problem at each stage, we choose to solve the MWEC problem whereas in [25] the load-balanced facility location problem is solved. An advantage of solving MWEC problem is that it is considerably simpler than load-balanced facility location problems since an MWEC problem can be reduced to a minimum weight perfect matching problem [37]. Note that the algorithm in [25] solves the hub-and-spoke problem only once, that is, its output is analogous to the output tree from Stage 1 of our algorithm. Meanwhile HM algorithm solves minimum weight perfect matching at each stage in order to locally aggregate flows with low costs. HM algorithm solves the matching problem hierarchically until all the flows are aggregated to a single source, and the final output is the union of those matchings. Thus its final output is analogous to that from the final stage of our algorithm. In other words, the outputs of the abovementioned algorithms correspond to those from intermediate stages in our algorithm. The HIERARCHY algorithm proposed in [20] hierarchically constructs Steiner trees and solves load-balanced facility location problems, however in a way which heavily relies on the concave and increasing property of . Thus the algorithm may not be suitable for convex and decreasing .

##### 5.3. Performance Analysis

In this section we analyze the performance of HCST algorithm. For set of weighted edges, let denote the sum of its edge weights, that is, . For given source set , let denote the edge set of the optimal Steiner tree associated with .

Proposition 5. *For given network graph , the cost achieved by HCST algorithm is higher than the optimal algorithm by a factor of at most defined as**where denotes the stage at which HCST algorithm terminates. denotes the approximation ratio for Steiner tree problem, and is the ratio of the sums of edge weights between MWEC at Stage of HCST algorithm and the Steiner tree associated with source set , that is,**Also is defined as**where , and denotes the th smallest value of the edge weights of . Note that the second summation term of (32) is defined to be 0 if .*

*Proof. *Denote the optimal cost by . We first find a lower bound for . Let denote the set of edges of the optimal aggregation tree. Let us sort the amount of edge flows of in increasing order, and denote them by , that is, where has edges. There are at least nonzero flows since there are sources, hence and hold. In addition is at most , since is a tree. Also it is clear that , for , and is at most for . This implies that, since is decreasing, , . Let us denote the weight of the edge that carries flow by . For real numbers and , let . We have thatwhere (36) is by Jensen’s inequality due to the convexity of , and (37) is from the definition of Steiner trees. Considering that is decreasing, we would like to make the argument of in (37) as large as possible in order to find a lower bound for . Hence we would like to maximize defined aswhere , are chosen from the edge weights of . For the purpose of maximizing (38), we will assume WLOG, because over all possible permutations of , is maximized when .

We first observe that is decreasing in , since if , we have thatHence can be maximized over by choosing smallest weights from the edge weights of , that is, by letting for .

Next we would like to derive an upper bound for as follows:For inequality (42), we used the fact that (41) is increasing in , hence we chose and the largest possible weights for , in order to maximize . From (42), we obtain . Hence from (37), we obtainNow let us consider the cost of output tree at Stage of HCST algorithm, or . Recall that in HCST algorithm, denotes the source set at Stage , and denotes the output tree at Stage . The cost of is divided into (i) the cost incurred by hierarchical MWECs , and (ii) the cost of -approximate Steiner tree associated with . Hencewhere denote the amount of flow at Edge under HCST algorithm. Note that, the amount of flow in the network at Stage is at least , since the flows are agglomerated through MWECs at every stage. Since is decreasing, the first summation of (44) is at mostNote that the first summation of (44) is 0 for Stage 0. As for the second summation of (44),Inequality (48) is due to ; specifically, the Steiner tree for is a tree that spans , hence by definition, the sum of edge weights of is no more than that of the Steiner tree associated with .

In conclusion, we have that, from (43), (45) and (48),Since the cost of HCST algorithm is , the proposition is proved.

An interpretation for ratio in (32) is as follows: the first term in the bracket of represents a bound on the macroscopic cost associated with the Steiner tree approximation. The second term in the bracket of is a bound on the cost associated with the hierarchical aggregation of flows, that is, the microscopic cost reduction. Clearly we have that , due to , thus , is a decreasing sequence where , for all . The progressive cost reduction due to hierarchical flow aggregation is reflected in . As in (32), is the minimum of numbers, each of which contains a weighted sum of in different combination of weights . Hence represents the empirical minimum of different degrees of tradeoff between microscopic and macroscopic cost reduction.

Next we discuss constant in (34). Firstly observe that ; the first summation of the numerator of (34) is at most , in which case the first term of (34) is at most . Note that a naive upper bound for is simply , yielding a lower bound ; however we observe that our bound (43) improves such a bound since .

can be numerically computed for a given graph, and in the next section we provide numerical examples of . We also apply HCST algorithm to a specific graph as an example.

##### 5.4. Illustrating Examples

In this section we consider a simple convex and decreasing . As previously the packet header length is bits, and we assume that the maximum packet size is 10 times the header length, that is, . We will accordingly consider which is convex and decreasing for of the following form:Clearly holds for .

Figures 4 and 5 show the numerical examples of the performance bound . is computed and averaged over randomly generated graphs of uniformly distributed nodes in a square area. In Figure 4, network size is fixed to 200, and is plotted against the number of source nodes . We consider two types of cost functions: the curve labelled “harmonic” represents the cost function (50) in which decreases as a harmonic sequence. The curve labelled “exp” corresponds to the case where the term in (50) is replaced by where the parameter controls the decay rates of the cost function. We set in this example. In addition, we compare with a simple analytical bound; suppose we build a -approximate Steiner tree based on . The cost under that tree is at most . By combining that cost with (43), we obtain a simple approximation ratio of for the approximately optimal Steiner tree. In Figure 4, the plots of such bounds based on -approximate Steiner tree are added for both harmonic and exponential cost functions, and are labelled as “Steiner(har)” and “Steiner(exp),” respectively. We observe that provides improved bounds as compared to those based on -approximate Steiner tree. In Figure 5, is plotted against varying under the aforementioned harmonic and exponential cost function where we fixed to 10. In Figures 4 and 5, we observe that eventually becomes nearly constant, or increases very slowly at most, even if the system size grows. Hence we conclude that provides an approximation ratio which remains effectively constant irrespective of the system size.

Next we present an example of the application of the HCST algorithm to a specific graph. An example of is given in Figure 6(a). consists of nodes where Node 1 is the sink, that is, . There are four source nodes: where the sources are depicted in a shaded color. Each source generates 1 unit of data. We will again consider convex and decreasing given by (50), and assume . Figure 6(b) shows the output of Stage 0 or which is an approximately optimal Steiner tree. Figure 7 shows the MWECs over the stages. Figure 7(a) shows the metric completion of the subgraph induced by . Figure 7(b) shows the MWEC at Stage 1. Node 4 and 5 became the center nodes as emphasized in the figure. Figure 7(c) shows the MWEC and the center node at Stage 2.

**(a)**

**(b)**

**(a)**

**(b)**

**(c)**

Figure 8(a) shows the full paths of the MWEC at Stage 1, that is, that in Figure 7(b), in . By building an approximately optimal Steiner tree associated with and taking the union of and as in Step 8, we get as in Figure 8(b). Similarly Figure 9 demonstrates Stage 2 of the algorithm. The full paths for the MWEC from Figure 7(c) in are shown in Figure 9(a). Note that Node 4 is selected as the center node, and the output from Stage 2 or is shown in Figure 9(b). Let us compare the energy costs from all the stages. For , a total of three flows pass through the link between Node 1 and 3, while the flow on the other links is simply 1. Thus, the cost of from Stage 0 is given bySimilarly, we have thatThus the final output of HCST is with the final cost of 275.5. Note that in this example, the Shortest Path Tree (SPT) heuristic incurs the energy cost of 374.

**(a)**

**(b)**

**(a)**

**(b)**

Next consider such thatAssume that the algorithm has yielded the same , and as the previous case. Since is constant for , the problem reduces to the Steiner tree problem, thus one would expect that would perform the best since is intended to be an approximately optimal Steiner tree. The energy costs are given bythus indeed the HCST algorithm will output as the best solution with cost 37, whereas the SPT heuristic will yield the energy cost of 41. This demonstrates that our algorithm can effectively deal with various types of convex and decreasing aggregation cost functions. In the following section we will evaluate the performance of the HCST algorithm by simulation.

#### 6. Simulation

In our simulation we randomly generate as follows. The node locations are generated independently and uniformly on a unit square. We define as the Delaunay graph induced by the node locations. An example of is depicted in Figure 10 for . As previously it is assumed that the average number of bits required to transmit the aggregated information is approximately where we set header length to 1 and the number of quantization bits to 3. The edge weights are randomly selected from which represents the energy consumption per transmitted bit. In our simulation two types of sources are considered. The first type, called uniform type, is associated with the extreme data retrieval problem, that is, are i.i.d. uniformly on . The second type, called Gaussian type, is associated with retrieving the maximum of Gaussian source data where . The summary function is given by max function.

We will compare the performance of the HCST algorithm with HM algorithm [16] and SPT heuristic. Figure 11 shows the average energy consumption of the algorithms when we fix the number of sources to 8 with varying . The energy cost shown on the left (resp. right) of Figure 11 is associated with the sources of uniform (resp. Gaussian) type. We observe that the HCST algorithm achieves lower energy costs than the SPT heuristic in both types of the sources. The gain in the energy savings by the HCST algorithm ranges 35–38% for uniform type sources and 24-25% for Gaussian type sources. Compared to HM algorithm, our algorithm reduces the energy consumption by 20-21% and 14-15% for uniform and Gaussian type sources, respectively. HM algorithm focuses on microscopic cost reduction, which may be effective for concave and increasing cost functions, however not for convex and decreasing cost functions. Comparing SPT heuristic and HCST algorithm, we observe that the difference in the mean energy consumption of the algorithms slightly increases with . This can be interpreted as follows: for larger networks, there is further room for improvement by HCST, for example, there are more choices for Steiner nodes and more ways to merge sources at low costs by MWEC. Thus the performance gain from the HCST algorithm relative to the SPT heuristic is expected to grow with as shown in the simulation.

Figure 12 shows the mean energy costs with varying where we scale the number of the sources proportional to . Specifically in the simulation we let , that is, one out of five nodes is a source node. In the figure we see that the HCST algorithm again outperforms the SPT heuristic. The relative savings in energy by HCST algorithm ranges 19–41% for uniform type sources and 14–27% for Gaussian type sources. Relative to HM algorithm, HCST algorithm saves energy costs by 20–23% and 14–17% for uniform and Gaussian type sources, respectively. The difference in the energy cost of the algorithms increases with similar to the case of fixed number of sources, however, such a rate of increase is higher in the case of varying number of sources. This can be explained as follows. When we increase the network size, the number of sources also increases proportionally. When the network size grows, from the previous argument such that there is further room for improvement by HCST, its relative gain will increase with the network size. In addition to that, since the number of sources grows, the total number of stages at the end of the HCST algorithm will also increase. Since HCST chooses the best tree from the intermediate output trees collected over stages, a large number of stages implies that we can choose the final output tree from a large pool of trees having various degrees of tradeoff between microscopic and macroscopic aspects of the cost reduction. Thus the abundance of source nodes enables us to choose an aggregation tree with a “refined” tradeoff, which is crucial for a convex and decreasing . This explains the enhanced performance of HCST with increasing number of sources. Hence we conclude from the simulation that the HCST algorithm can improve performance for various proportions of source nodes among the network.

#### 7. Conclusion

In this paper we have studied a single-sink aggregation problem for wireless sensor networks computing several widely used summary functions. It is observed that the problem is characterized by the aggregation cost function which maps the amount of aggregated measurements to transmission costs at a link. We show that the properties of depend heavily on the chosen summary function . When is given by sum or mean, we showed that is concave and increasing, implying that there exist algorithms such as the HM algorithm which can approximate the optimal algorithm by a factor logarithmic in the number of sources. A similar argument was made when is weighted sum for i.i.d. Gaussian sources. When is given by max, however, we have shown that is convex and decreasing for certain types of sources. For such we identify that there exists a tradeoff between the following two aspects of cost reduction: firstly local clustering of sources which is the microscopic aspect, and secondly a low-cost routing from the clustered sources to the sink which is the macroscopic aspect. We proposed the Hierarchical Cover and Steiner Tree algorithm which empirically finds the best tradeoff point between the aspects. Numerical examples and simulation results were presented to demonstrate that the HCST algorithm is versatile and improves performance for various types of convex and decreasing . A future direction would be investigating the optimal aggregation problems for a wider range of summary functions. In addition, the evaluation of the HCST algorithm in a real-world testbed environment is also part of our future work.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

This work was supported by Basic Science Research Program through The National Research Foundation of Korea (NRF) funded by The Ministry of Science, ICT & Future Planning (NRF-2013R1A1A1062500), and in part by the ICT R&D program of MSIP/IITP, (10-911-05-006, High Speed Virtual Router that Supports Dynamic Circuit Network).