Journal of Probability and Statistics

Volume 2016, Article ID 1285026, 9 pages

http://dx.doi.org/10.1155/2016/1285026

## Exploratory Methods for the Study of Incomplete and Intersecting Shape Boundaries from Landmark Data

^{1}University of Benghazi, Benghazi, Libya^{2}University of Leeds, Leeds, UK

Received 25 July 2016; Revised 10 October 2016; Accepted 12 October 2016

Academic Editor: Z. D. Bai

Copyright © 2016 Fathi M. O. Hamed and Robert G. Aykroyd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Structured spatial point patterns appear in many applications within the natural sciences. The points often record the location of key features, called landmarks, on continuous object boundaries, such as anatomical features on a human face. In other situations, the points may simply be arbitrarily spaced marks along a smooth curve, such as on handwritten numbers. This paper proposes novel exploratory methods for the identification of structure within point datasets. In particular, points are linked together to form curves which estimate the original shape from which the points are the only recorded information. Nonparametric regression methods are applied to polar coordinate variables obtained from the point locations and periodic modelling allows closed curves to be fitted even when data are available on only part of the boundary. Further, the model allows discontinuities to be identified to describe rapid changes in the curves. These generalizations are particularly important when the points represent shapes which are occluded or are intersecting. A range of real-data examples is used to motivate the modelling and to illustrate the flexibility of the approach. The method successfully identifies the underlying structure and its output could also be used as the basis for further analysis.

#### 1. Introduction

Many scientific investigations involve the recording of spatially located data. This data might summarize objects within an image as digitized versions of continuous curves. Once the data are collected often the original context is lost and the aim of the analysis is to identify which points are associated with each other and to link the points to reconstruct the original shape. These can then be seen as estimates of continuous curves and object outlines. If the original scene contains multiple structures, then the analysis must also divide the points into groups with separate curves used to describe the points in each group. It is important to note that this is likely to form only the first part of an analysis and hence can be seen as exploratory data analysis.

This paper looks at the use of smoothing splines to identify and describe geometric patterns in sets of points. It is assumed that the points lie on smooth curves but that a dataset may contain multiple intersecting curves. It is vital that this be done in a nonparametric way so that the widest possible range of patterns can be highlighted. In general, these are closed, or nearly closed, curves and so a transformation to polar coordinates is used to simplify the analysis. Intersecting curves are described by allowing discontinuities in the fitted curves. These procedures are illustrated using simulated data and varied real datasets describing human faces, gorilla skulls, handwritten number 3’s, and an archaeological site. These provide a wide variety of point patterns and reinforce the general usefulness of the proposed methods. For mathematical detailed description and applications of shape-based analysis of points, refer to, for example, Batschelet [1], Bookstein [2], Dryden and Mardia [3], and Lele and Richtsmeier [4].

To allow for this wide variety of possible curves a nonparametric fitting approach, such as splines, can be used (see, e.g., [5, 6]). The flexibility is helpful in the exploratory statistical analysis of a dataset, and the results can be used to suggest parametric equations for later analysis. Nonparametric regression is the general name for a range of curve fitting techniques which make few a priori assumptions about the true shape. In nonparametric regression, several different families of basis functions can be used to describe curves; one of the common kinds of basis for smooth curves is the spline. Splines are generally defined as piecewise polynomials in which curve, or line, segments are joined together to form a continuous function. The spline smoothing approach to nonparametric regression is discussed, for example, by Silverman [7] and extended to deal with branching curves by defining a roughness penalty by Silverman and Wood [8]. For an introduction to natural cubic spline see Green and Silverman [9]. For more review of spline methods in statistics see Wegman and Wright [10], Silverman [11], Silverman [7], Nychka [12], and Wahba [13].

It is important to note that there are many existing general frameworks for performing spline-based regression. For example, multivariate adaptive regression splines (MARS) [14] or its more robust generalizations, RMARS [15] and RCMARS [16], with a good overview and comparison in [17]. These follow the general approach of general additive modelling [18] and give a formal framework for fitting and model selection.

A brief introduction to splines, along with the extension to circular data, is given in Section 2. The main results of this paper are given in Section 3 by considering modelling for single curves with occlusions and multiple intersecting curves. Although simulated examples are used to illustrate, the main real-data examples are given in Section 4. General discussion appears in Section 5.

#### 2. Nonparametric Curve Estimation and Periodic Splines

A smoothing spline is a nonparametric curve estimator that is defined as the solution to a minimization problem. It provides a flexible smooth function for situations in which a simple polynomial or nonlinear regression model is not suitable. For a set of observations consider a regression problem where the observations are assumed to satisfywhere the errors are uncorrelated with zero mean and constant variance, . Then the spline smoothing method uses the data to construct a curve by minimizing the objective functionwhere represents the th derivative of , with being a positive integer, and is a smoothing parameter. For more details of smoothing splines see, for example, Eubank [19], Eubank [6], and Cantoni and Hastie [20]. An alternative definition of the level of smoothing is in terms of an* equivalent degrees of freedom*, Df, which describes the amount of information in the data needed to estimate the residuals. The function* smooth.spline* [21] allows or Df to be specified, but the degrees of freedom have been used in what follows as this gives a more intuitive interpretation.

The above objective function consists of two parts: the first measures the agreement of the function and the data and the second is a roughness penalty reflecting the total curvature—this can also be interpreted in a Bayesian setting as the likelihood and prior. Hence, for given Df, the estimate of is given byIf Df is large then the function is rough but closely fits the data, whereas when Df is small then the function is smooth but may not fit the data well. Here the choice of Df is made automatically using standard leave-one-out cross-validation [22]; that is,where is the fitted spline curve, for given parameter , and with the th data point, , being removed. Then is the fitted curve using the cross-validation estimate of the degrees of freedom.

Figure 1 shows fitted curves using splines with different degrees of freedom, Df. The true curve is a sine function with noise level which corresponds to a signal to noise ratio () of about . In (a) Df is about half the value found using cross-validation which is used in (b), with (c) using double the cross-validation degrees of freedom. The small degrees-of-freedom value gives a smoother fitted curve that ignores many of the points in the data whereas a large value produces a rougher fit which more closely follows the data. The automatic choice was which gives a very good fit to the data reproducing the sin curve well.