Abstract

We generalize Ripley’s function to get a new function, , to characterize the spatial structure of a point pattern relatively to another one. We show that this new approach is pertinent in ecology when space is not homogenous and the size of objects matters. We present how to use the function and test the data against the null hypothesis of independence between points. In a tropical tree data set we detect intraspecific aggregation and interspecific competition.

1. Introduction

Investigating the spatial structure of point patterns has been a long-time challenge for ecologists. Pielou [1] claimed that the information an ecologist wants to have immediately when observing a point set representing a vegetal community is the density of each species and the existence of interactions between plants. Density can be estimated by various methods [2] but interactions have motivated a living literature for more than half a century. In ecology, Ripley’s function [3], or its square-root transformation [4], is the most used tool to characterize them [5], assuming the pattern is a realization of a homogenous point process; that is to say the probability to find a point is the same everywhere.

However, identifying interactions under the assumption of nonhomogeneity of space is still an open question. Twenty years ago, Cuzick and Edwards [6] followed by Diggle and Chetwynd [7] paved the way by introducing some specific tests. The function proposed by Diggle and Chetwynd is defined as the difference between the function for the studied points (called cases) and the function for others (called controls). It is not completely satisfactory yet because both functions are computed separately so all the data contained in the relative position of cases and controls is lost. A more recent important advance was proposed by Baddeley et al. [8] who generalized to inhomogeneous point processes. They developed a complete theoretical framework but practical applications are still difficult because assumptions are necessary about the relative scales of heterogeneity and interactions, leading to possibly opposite analyses of the results [9]. A recent review of these methods can be found in [5].

Other tools were also developed by economists [10] with a different approach, comparing the distribution of points of interest relatively to that of other points. Brülhart and Traeger [11] call them relative measures, as opposed to topographic measures such as which take space (measured by areas) as their benchmark.

In this study, we introduce a relative measure of spatial structure, namely, the function [12] to extend the ecologists’ toolbox and allow a more pertinent approach when the null hypothesis is that points of interest are distributed like others. It bypasses the issue of heterogeneity and allows weighting points. Moreover, its computation is easy.

The paper is organized as follows: first, we derive the function as a generalization of Ripley’s function; then, we apply it to theoretical examples and real data sets in a tropical forest and in epidemiology; we finally discuss the way it can be usefully applied after clarifying the assumptions it relies on.

2. Methods

2.1. Ripley’s Function
2.1.1. Definition

The theoretical framework is a point process whose realization is observed in a window of area . Nonparametric methods such as Ripley’s are used to reject the null hypothesis of independence of points. To use Ripley’s , we will assume that the point process is stationary (i.e., its intensity does not vary by translation). All point processes used in this paper are second-order stationary (i.e., interactions between points do not vary under translation) and isotropic (i.e., they do not depend on direction). The null hypothesis to reject is therefore complete spatial randomness (CSR); that is, the point pattern is a realization of a homogenous Poisson process. More details on can be found in [5] or [13]. We focus on its estimator here.

Points are denoted . We call a point ’s neighbors all the points less than apart from it (all points in a disk of radius centered on the point ).

Ripley’s function estimator was built by counting neighbors (indexed by ) around reference points (indexed by ), which can belong to the same type (univariate function) or not (intertype or bivariate function). points are found in the window, and we denote the indicator function equal to 1 if the distance between and is less than or equal to , 0 else.

An unbiased estimator of univariate [14] with no edge-effect correction is The bivariate version of (denoted ) is very similar. We denote the number of reference points and the number of neighbors. We have

2.1.2. Edge-Effect Correction

Points located close to the window borders are problematic because a part of the circle inside which points are supposed to be counted is outside the window. Various answers have been proposed to correct for this [15, 16]. We prefer Besag’s [4] correction. Let us denote the part of the area of the circle of radius r centered on the point located inside the window. We count the number of neighbors inside the circle, and we correct it by the ratio between the circle’s area and its inside part. We suppose that the outside part of the circle would have contained the same neighbor density than the inside part. Finally, an unbiased estimator of with edge-effect correction is

2.1.3. Normalization

Besag [4] proposed to normalize to obtain a benchmark of rather than . The well-known function is defined as . It can be interpreted as a distance [17]: means that as many neighbors are found around reference points up to distance as would be expected at distance under CSR. We believe that is a better normalization. Its reference value is 1, and it can be interpreted as the density of neighbors around reference points divided by the density of neighbors anywhere.

2.2. The Function
2.2.1. Definition of

Equation (3) can be rearranged: Around each point , is the number of neighbors divided by the area where it is counted. Its average value is compared to what it is expected to be all over the window, .

Topographic measures like use space as their benchmark; that is, the number of points is divided by an area. The benchmark may also be another point pattern; for example, the number of trees of a species under study may be divided by the total number of neighbor trees, defining relative measures.

We transpose into a relative framework. The ratio is now built comparing a number of neighbors of interest to the total number of neighbors. Weights can be associated to points without changing the construction of the measure. Reference points are indexed by ( is a reference point), neighbor points by ; all points whatever their type (i.e., the benchmark) by ; their numbers are , , and . is the weight of point , is the total weight of this type of points.

The average weighted ratio of neighbor points around reference points is

In the whole window, the same ratio is if neighbor points and reference points belong to the same type, else. If reference and neighbor points belong to the same type, and .

We define the univariate function asThe bivariate function isEquations (5) and (6) are simplified when points are not weighted:

2.2.2. Case-Control Design

A particular attention must be paid to case-control designs. In practical terms, all points of interest (called cases) are carefully referenced, and the benchmark point set (called controls) is just sampled. This approach has been widely used for spatial clustering of diseases [7, 9, 1820]: sick people are the cases and the rest of the population the controls. Case-control design is of course not limitative to geographical epidemiology and can easily be applied to ecology questions.

The function defined previously can be slightly modified to take into account this feature. Since the controls are chosen to be a representative sample of the population at every scale, neighbors of any kind are replaced by controls, indexed by . Reference and neighbor points are cases; their total weight is . After simplifications, can be written as follows:

2.2.3. Significance

The first-order property (intensity) of the process must be controlled to allow the detection of the second-order property (nonindependence of points, that is to say interactions between the objects they represent). Thus, a point distribution generated according to the null hypothesis must respect, on the one hand, the local values of the density of the process the point distribution is a realization of and, on the other hand, its points must be distributed independently from each other.

The practical difficulty comes from the lack of knowledge of the point process that gave the point distribution, which is its unique available realization. Its first-order property is consequently widely unknown. We can only assume that the actual set of point locations is a good approximation of it, following Duranton and Overman [10]. Consequently, we generate random data sets for the univariate and case-control functions by redistributing the actual point set (type and weight couples) on the actual location set (coordinates). The confidence interval of the null hypothesis is then computed by the Monte Carlo technique [21].

The intertype function must support two null hypotheses [22]. The random labeling hypothesis is simulated by permuting the point types, keeping point locations and weights unchanged. The population independence hypothesis is more complex to test. The reference points are kept unchanged, so that the spatial structure of the reference point type is maintained, and all other points are redistributed across the available locations. This allows testing the independence of populations considering the structure of the reference point type. Then, the reference and neighbor types are interchanged and the test is repeated. If both    and    functions leave their null hypothesis confidence envelope in a range of distances, then population independence is rejected. This test requires that some points do not belong to either the reference or the neighbor point type or there will be nothing to redistribute. More generally, testing the relative spatial structure only makes sense if the tested point types are a small part of the point pattern; see discussion Section 4.1.

The tests based on Monte Carlo simulations are actually not correct because they are repeated at each step of the function (see [23] for an extensive discussion). A global test, without the need of simulations, is available only for against CSR [14]. We can follow Loosmore and Ford’s goodness-of-fit (GoF) test to obtain a correct -value to reject the null hypothesis. We first need to compute the average value of on all simulations, more exactly [24]: where is the number of simulations and the value of in the th simulation. Then, the statistic is calculated for the th simulation by summing on all values of , where is the difference between the next value of and the present one: The same statistic for the actual data, denoted , is compared to the simulated values to get a -value:

If is greater than all simulated values, the -value to reject the null hypothesis erroneously is around 0. To avoid 0 or 1 -values, we can assume that another simulation would have given a value of higher or lower than and write or .

2.3. Examples

Three theoretical examples are given. Two of them illustrate very simple point patterns on a homogeneous space for a comparison of and functions (Sections 3.1.1 and 3.1.2). The third one computes an inhomogenous Poisson point process to show how the function controls for the first-order property of point processes (Section 3.1.3). No theoretical example is given with weighted points because they are not so easy to understand visually. Three real point patterns are considered then. They do not allow a classical analysis by the function because of heterogeneity. Cuzick and Edwards [6] introduced the first formal way to deal with nonhomogeneous point processes: they used a dataset (published with the paper) concerning the location of 62 cases of childhood leukemia between 1974 and 1986 in the North Humberside area, England. A control set of 141 children representing the whole concerned population was chosen from the birth register (all weights are 1). They could conclude that the cases were significantly clumped. We use this data set to go further: we are now able to corroborate their conclusion and also to precise the size of aggregates. The function is computed according to the case-control design, (8).

We overall want to provide evidence of the interest of relative spatial structure in ecology. Trees are considered in a 25 ha plot of tropical rainforest in Paracou field station in French Guiana [25]. We investigate the spatial structure of two species, Vouacapoua americana Aublet (Caesalpiniaceae) and Qualea rosea Aublet (Vochysiaceae) in a point set of 11,276 trees above 10 cm diameter at breast height (DBH), excluding flooded zones. All trees above 1 cm diameter have been measured and plotted for a few species, allowing us to study the spatial relations between saplings (up to 10 cm DBH) and possibly reproductive trees (30 cm or more) of V. americana. Points are weighted by the basal area of the tree they represent, the reference and neighbor points are mentioned in the results, and all trees of the maps, including references and neighbors, are used as the benchmark.

3. Results

3.1. Theoretical Examples

In what follows, we generate a point pattern (“black points,” represented by closed circles in the figures) to investigate its spatial structure with the and univariate functions. Other points (“grey points” represented by grey open circles) are used by only: grey and black points together constitute the benchmark. Black points may be considered as trees of a species of interest in a forest plot, while grey points are all other trees.

All confidence intervals are computed at 1% risk level generated from 10,000 simulations.

3.1.1. Aggregates

200 grey points are completely randomly distributed. Black points are generated by a Matérn process [26]: 5 aggregates (radius 0.5) of 5 points. All point weights equal 1. The map is in Figure 1; the curves are in Figure 2.

The curve shape is similar to ’s: significant, positive peaks denote concentration. The benchmark points of the function are distributed almost homogeneously so the number of neighbors around each point is proportional to the area: the relative and the topographic measures are nearly equivalent. Nevertheless, while peaks approximately correspond to the diameter of aggregates [27], peaks occur exactly at distances at which the local density is the greatest, that is, approximately the distance between points in the aggregates. The differences are due to the way is normalized: peaks occur at the same distance as those of (not shown on the figure).

3.1.2. Regularity

200 grey points are drawn from a homogenous Poisson process again. 100 black points have a regular distribution around a square, 1-by-1 grid, with a perturbation: each point is randomly moved horizontally and vertically within a 0.4 interval around the grid nodes (Figure 3). All point weights equal 1.

The first part of the univariate curve (Figure 4) is made of 0 values, showing the absence of neighbors at any distance up to 0.6. Note that the univariate curve shape is different since its original value is 0 and its minimum slope is −1 by construction.

Negative peaks of both the univariate and functions allow detecting the grid size (before , and 3). Maximum values correspond to the diagonal of the grid ( is the diagonal length, then and so on).

3.1.3. Inhomogeneous Point Set

We generated two completely random point sets in a 10-by-10 window. Then, we transformed the point coordinates: after having calculated the polar coordinates of each point from the center of the point set, we squared the distance to get . The result is a nonhomogeneous Poisson pattern, shown in Figure 5. Both point types have the same random distribution, but the center of the map shows a greater density.

It can be seen (Figure 6) that the function is not applicable: assuming homogeneity, it interprets the black point distribution as a single big aggregate. This issue is known as “virtual aggregation” [28, 29]. The function is able to control for density variations: since the pattern of the black points does not differ from that of all points, values are around 1.

3.2. Cuzick and Edwards Data Set

The case-control function is used (Figure 7). 0.7 km apart from cases, the average case density is about 70% higher than it would be if the cases followed the control pattern (at this distance, the peak of the function reaches 1.7).

In the discussion of [6], Diggle (page 101) suggested that the function, equal to , would be appropriate for this point pattern. The next year, Diggle and Chetwynd [7] published the function and applied it to the same dataset. We recomputed considering the rectangle window shown in the figure. This data set has been widely used and gave slightly different results according to the window definition in [7, page 1160] or [30, page 634]. It can be seen that the and functions give the same results if points are not weighted. Nevertheless, values cannot be interpreted easily and cannot be compared accross distances.

Both methods suffer here a severe lack of power due to the very little number of controls. The confidence envelopes are computed at 10% levels (from 1000 simulations). The GoF test applied to returns . Diggle and Chetwynd obtained a -value equal to 14% for . Increasing the number of controls would not have been a real problem if the experimental design had included a distance-based point pattern analysis.

3.3. V. americana and Q. rosea

The dataset (map in Figure 8) contains 156 V. americana, 388 Q. rosea, and 10,732 other trees in a 25 ha plot where flooded zones have been excluded, leaving a 20.06 ha, polygonal shape for the study.

Aggregation of both species is detected up from 4–6 meters (Figures 9(a) and 9(d)) by the univariate function. The species repulse each other (Figures 9(b) and 9(c)). in all cases, according to the bivariate function. All trees are used as the benchmark in all analyses.

These results suggest competition if our null hypothesis is correct: we suppose that both species could locate anywhere if the other did not impede it. Of course, it might be wrong so further work is necessary to test alternate hypotheses: the environment may be different and niche preferences may be the reason for segregation, or else the spatial distribution of populations may not be in equilibrium, and we may be observing the contact of two colonization fronts.

3.4. V. americana Regeneration

The density of V. americana is very variable (Figure 10(a)). The univariate function is applied to saplings (reference and neighbor points are saplings, the benchmark is all trees; weights are basal areas). Saplings are aggregated (Figure 10(b)) at all distances, . The intertype function shows that potentially reproducing trees repulse saplings (Figure 10(c)), with significant results up to 15 meters. Actually, (reference points are potentially reproducing trees, neighbors are saplings, and the benchmark is all trees; weights are basal areas).

Jansen et al. [31] already mentioned the absence of seedlings around adult V. americana at short distances (6 meters). Our results show that no sapling can be found less than 3 meters apart from adult trees, but also that the relative abundance of saplings among neighbors is low (these results are significant according to Monte-Carlo simulations), suggesting that Vouacapoua regeneration follows the Janzen-Connell hypothesis [32, 33].

4. Discussion

4.1. Using M

The theoretical examples illustrate that is equivalent to when applied to a homogenous, nonweighted point pattern. The univariate function is not affected by virtual aggregation when the point pattern is not homogenous, using all points as its benchmark. This means that the points of interest should be a small enough fraction of the whole point set to allow considering the latter as a valid control set. The case-control approach is meaningful when the points of interest must not be included in the control set. Then, a sufficient number of controls is required, or the test against the null hypothesis of independence of points will not be powerful.

Although the function requires several point types, it is completely different from a bivariate or function. This may be illustrated by the example of Section 3.1.1: the univariate function characterizes the spatial structure of black points, exactly like ; grey points are added to black points to obtain a benchmark for (the number of points whatever their type less than apart from each black point), while the disk area is the benchmark for . In Section 3.3, the bivariate function could be computed instead of the bivariate function: it would only consider the 156 V. americana and 388 Q. rosea trees (and suppose both are distributed homogenously), while also includes the 10,732 other trees to constitute its benchmark.

Figure 9(a) shows a good example of the behavior of the function. Confidence envelopes are around the expected value equal to 1. At very low distances, they are not defined if no pair of points less than apart exists, and the confidence interval is very wide then because of stochasticity, amplified by the little number of point pairs. At long distances, all values tend to 1. When point weights are not homogenously distributed, the envelope is not around 1 (Figure 10). Heuristically, measures the spatial structure of square centimeters of basal areas of trees: when points are redistributed independently from each other under the null hypothesis, square centimeters are still aggregated. More or less aggregation than under the null hypothesis is detected relatively to its envelope and by the GoF test. As shown by Loosmore and Ford [23], the classical, Monte-Carlo-generated confidence envelope may be too optimistic: while the curve is clearly out of the 1% envelope, the -value for V. americana regeneration (Figure 10) is only 6.7% (but no power study for the GoF has been conducted as far as we know).

4.2. Relative versus Topographic Measures

Distance-based measures of spatial concentration can be classified into two main categories [34] following Brülhart and Traeger [11]. Topographic ones compare a number of neighbors to a measure of space (a surface area) while relative ones compare it to another number of neighbors. All indices used in ecology are topographic, except for Diggle and Chetwynd [7] . On the opposite, economists, who are often interested in the spatial distribution of firms on a territory, mostly use relative measures: using the distribution of the whole industry as a benchmark to study the spatial structure of an economic sector is even one of the good properties a measure must respect according to Duranton and Overman [10]. We believe both frameworks can help addressing ecological questions, as they allow different null models to be tested.

The topographic toolbox is already well furnished, with Baddeley et al.’s [8] and Wiegand and Moloney’s [28, 35] O-ring allowing to deal with inhomogenous point patterns. Diggle et al. [9] separated control points to evaluate intensity and cases to evaluate dependence in ; that is to say they used as a relative measure. The function is designed for this purpose. It is similar to with a simple box kernel [13] with bandwidth parameter used to estimate density around each reference point, but it also allows weighting points. The relative structure of basal areas of trees is more meaningful than that of individuals in many applications (biomass spatial structure, competition for light, etc.): roughly speaking, a big tree is often more similar to many small trees than to a single one.

5. Conclusion

The function is defined as a generalization of Ripley’s function to allow its application to inhomogeneous point processes and to take into account point weights. From a more theoretical point of view, it is a weighted, relative measure of spatial structure. We believe that relative measures (comparing a point pattern to another) are powerful tools, even though the topographic approach is more used.

To allow the effective use of the function, we developed the necessary code for [36], available as a supplementary material available online at http://dx.doi.org/10.1155/2012/619281.

Acknowledgments

The authors thank the editor and an anonymous referee for useful suggestions. This work has benefited from an “Investissement d’Avenir” grant managed by Agence Nationale de la Recherche (CEBA, ref. ANR-10-LABX-0025). This paper partially incorporates earlier unpublished work written by E. Marcon and F. Puech (generalizing Ripley's function to inhomogeneous populations, halshs-00372631, version 1).

Supplementary Materials

R code used to compute the M function. The code is fully commented and an example is provided.

  1. Supplementary Material