Mathematical Problems in Engineering

Volume 2018, Article ID 4763050, 18 pages

https://doi.org/10.1155/2018/4763050

## Shape Recognition Based on Projected Edges and Global Statistical Features

Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, Prater 50/A, Budapest 1083, Hungary

Correspondence should be addressed to Attila Stubendek; moc.liamg@alitta.kednebuts

Received 16 September 2017; Revised 12 January 2018; Accepted 13 February 2018; Published 19 April 2018

Academic Editor: Daniel Zaldivar

Copyright © 2018 Attila Stubendek and Kristóf Karacs. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A combined shape descriptor for object recognition is presented, along with an offline and online learning method. The descriptor is composed of a local edge-based part and global statistical features. We also propose a two-level, nearest neighborhood type multiclass classification method, in which classes are bounded, defining an inherent rejection region. In the first stage, global features are used to filter model instances, in contrast to the second stage, in which the projected edge-based features are compared. Our experimental results show that the combination of independent features leads to increased recognition robustness and speed. The core algorithms map easily to cellular architectures or dedicated VLSI hardware.

#### 1. Introduction

Recognizing shapes is an essential task in computer vision, especially in understanding digital images and image flows. A wide spectrum of application areas relies on shape recognition, including robotics, healthcare, security systems, assistance for the impaired.

The goal of computer vision is to generate answers to visual queries which are based on the input image. Depending on the query, several levels can be identified in a vision problem. A typical categorization distinguishes between detection, localization, and recognition. In the detection part, the presence of an object is examined; localization determines the position; in comparison, recognition identifies the detected objects, possibly considering their context in the visual scene. However, the definition of the object depends on the task [1, 2]. In typical computer vision systems, the result is computed from the image through its features, as a verified hypothesis [3, 4]. Similar to queries, features may incorporate local details as well as global image properties. If patches or complete contours are extracted from the image, the shape of the resulting region is one of the most important local features beside color, texture, and other details [5].

The key to efficient shape recognition is to use an appropriate representation that comprises all important characteristics of a shape in a compact descriptor. A shape description is considered to be efficient from a recognition point of view, if(i)the representation is compact,(ii)a metric for the comparison of the feature vectors can be efficiently computed,(iii)the representation is insensitive to minor changes and noise,(iv)the description is invariant to several distortions.

The most basic classification of shape descriptions distinguishes between contour-based and region-based techniques. Each method extracts specific features that encompass some meaningful aspects of the information in the shape. Using only one feature type thus limits the description power of the descriptor in terms of discriminative power and classification performance [6].

Contour-based shape features describe the shape based on its contour lines in various representations, such as contour moments [7, 8], centroid distances and shape signatures [9–11], scale space methods [12], spectral transforms [13, 14], and structural representation [15, 16]. Common drawbacks of contour methods are the complexity of feature matching, representation of holes and detached parts of the shape, and noise sensitivity [6].

Region-based techniques describe the shape based on every point of the shape and represent mainly global features of the shape. Moment invariants are derived as statistical features of the shape points [17]. Orthogonal moment descriptors such as Zernike and Legendre descriptors employ polynomials instead of the moment transform kernels [18–20]. Complex shape moments are robust and matching is straightforward; however, lower order of moments poorly represents the shape, but higher orders are more sensitive to noise and difficult to derive [21]. Generic Fourier descriptor represents the shape as the 2D Fourier transformation of the polar-transformed shape.

The requirement of compactness stands for the maximal level of independence of the feature data without sacrificing comparison and recognition performance. In other words, redundancy in the feature vector is accepted if it significantly simplifies the subsequent processing of the vector, thus accelerating the classification, and may increase the accuracy of the recognition. Combining different features allows catching different essences of the shape, and although it may introduce redundant data, at the same time it also increases robustness [22–24]. However, employing compound feature vectors requires a decision method that suits the different parts of the description. In machine learning, several ensemble classifiers are known, which handle compound features, like boosting, bagging, or stacking [25–27].

Representations of the same real-world object may differ due to several effects such as lighting conditions, camera settings, position, and noise. The major challenges of object detection are to ignore the differences in the representation resulting by sensing and preprocessing and to recognize if the difference is caused by different input objects. Several invariance requirements are often standard expectations to shape recognition methods, but the exact group of requirements has to be defined to each individual task, considering other parameters as well, such as hardware ones.

The principal motivation of our work was to create methods for portable vision application, where safety and reliability are the primal goals, such as aid for the visually impaired, and also other vision-based recognition systems. The requirements towards the application outline the specifications of the used algorithms. We aim to recognize mainly rigid, not flexible objects in video images, but due to various image acquisition conditions and poor image quality, significant amount of noise has to be handled and several invariance requirements have to be fulfilled. The application is valuable only if it is reliable and it is not critical to classify all frames but false answers can easily cause dangerous situations. Thus minimizing false-positive errors has priority over maximizing cover ratio. Finally, we preferred that kind of algorithms that are appropriate for dedicated VLSI architecture but provide real-time processing even on standard cell phone CPU and GPU.

Visual environments containing real-word objects normally encountered by humans contain a practically infinite number of object classes. Depending on the task, out of these classes, the number of relevant ones may be orders of magnitude smaller than the number of irrelevant classes; thus representing each irrelevant class with a representative instance is not efficient, if at all feasible. Hence, our primary goal is to develop a framework that can handle multiclass recognition problems, with only a few classes considered relevant, which requires performance evaluation metrics adapted to this flavor of multiclass classification.

The paper is organized as follows. In Section 2, we review the issue of invariance requirement of a recognition tool. In Section 3, we describe the performance evaluation methods used in the paper. In Section 4, we present our proposed compound description method, the Global Statistical and Projected Principal Edge Description. In Section 5, a gradual classification method is presented including a limited nearest neighborhood decision. The related online and offline learning method is presented in Sections 6 and 7. Finally, in Section 8, we show our results and, in Section 9, we conclude with future directions.

#### 2. The Role of Description and Classification

We investigate classic machine learning decomposition and the role of edges and their appropriate and efficient representation. The estimation of the ground truth is based on limited sensing, resulting in different representation of essentially same objects. The key point of the recognition is a model that draws boundaries of output classes. However, classes may differ based on various traits; thus the selection of discriminative features is also essential. From this point of view, we will divide recognition to feature extraction and classification.

In this paper, we investigate shape recognition that models the decision based on supervised learning, where the model is built up based on previously labeled inputs denoted as templates; the set of already known inputs is denoted as training set. Independently from the exact type and behavior of the classifier, the classification is a comparison of the input to labeled elements from the training set (or a model built up from the set), where the decision is a function of the representation. The difference between the representations of the same object is a result of various distortions that occur during the image acquisition and preprocessing. Note that distortions affect also the elements of the training set.

The input shape is a result of a transformation of the original shape , where denotes the parameter(s) of the transformation and is the set of all possible parameters of the transformation:The input shape is a result of a transformation of the original template shape :The output class of is a decision function , depending on one or more labeled shapes , comprising the representative set :The task of the recognition is not the reconstruction of the original shape by mathematical operations but to classify independently from transformations that distort the original and the template shapes and thus to estimate the ground truth . From this aspect, the transformation can be also considered as noise and noise is considered as a transformation.

In the next paragraphs, we give an overview of possible distortions of a shape in an object recognition problem and formalize deviations mathematically. Then we try to define the ability to represent similarity by formalizing tolerance and invariance in general and especially for the target shapes. Finally, we give an overview about possible solutions of ensuring invariance and tolerance in a description-based recognition system.

##### 2.1. Distortions in a Shape Description Problem

To find all the possible deviations of a shape, we go along the process where the binary shape is generated from a real-word object. However, shape generally can be defined as a multidimensional set of points; in this paper, we only focus on 2D shapes that are projections of 2D, flat objects in a 3D space and characteristic silhouettes of 3D images (2D representation of 3D objects from different viewpoints, where the object has to be modeled or multiple shapes are needed to reconstruct the object, is not the subject of this paper).

Applying the constraints above, during image acquisition by a camera, where the 3D-2D transformation and the sampling take place, the following geometric and pixel-level deviations may occur:(a)Rotation of the object on its plane compared to the camera axes(b)Position difference of the object relatively to the camera, which can be split to(ba)distance difference of the camera and the object(bb)position difference of the projected camera origin and the object(c)Angular deviation of the object plane normal-vector and the camera projection direction(d)Appearance of noise due to sensing limitations and sampling errors(e)Some part of the shape being missing or the shape being joint with another pattern

Note that, from practical considerations, geometric variances can be represented in other spaces too. If we consider the characteristic motives of the shape to be larger than the sampling rate, the deviance in (d) is limited only to the sensing noise. However, inappropriate focusing may also cause loss of details of the shape which in most of the cases exceeds the sampling error.

The shape is generated from the input image by various image processing algorithms, such as segmentation, patterns extraction, and morphological operations. Here, we will not investigate these preprocessing phases, but generally it can be stated that the shape generation is a binarization of some characteristic pattern of the image; thus the deviation (e) may befall due to the various lighting condition and unambiguous shape edges.

Summarizing the deviations, we can name those variations, of which shape recognition may be independent or the similarity index should be proportional to the deviation. From the aspect of the shape, the distance variation appears in different scales of the shape. Positioning variance results in a different location of the shape on the image canvas; rotation of the image in its plane also results in a rotated shape. Angular deviation of the image plane together with positioning difference results in perspective variance. Not only do binarization ambiguity and noise result in misplaced edge pixels on the desired shape but also both of them may lead to detached shape parts or to holes in the original shape.

##### 2.2. Decomposition Model for Shape Similarity

Variance in the appearance of an object can be modeled in a mathematical sense as noise. We call shapes to be similar if the difference is due to different observation properties and processing noise. If the shape is rigid, observation property is reduced only to geometrical transformations. To achieve classification consistency across various distortions, we identify two different aspects, invariance and tolerance, with respect to these distortions.

Invariance of a recognition engine with respect to a particular type of deviation is defined as the ability to return the same result for all inputs that only differ in the given deviation.We speak about tolerance to an effect if difference in the input causes no difference in the output to a certain limit :Note that norm for transformation parameter is substantially an abstract function, which cannot be measured directly but only can be estimated based on the transformed shape. Similarly, the limit also represents an abstract value. Both the norm and the limit are determined by the actual interpretation of the similarity.

Tolerance can be defined as a limited, local invariance, and, vice versa, invariance is a global tolerance. Due to this, invariance with respect to an effect implies tolerance to the whole domain, while invariance can be achieved by overlapping regions of tolerance.

The human similarity metric highly depends on the actual task; thus no general statement can be defined on which deviations should be eliminated and which should be tolerated during shape recognition. The environment in some cases does provide some references regarding the projection details. Some of the parameters described above might be fixed, previously adjusted (e.g., relative orientation or position of the camera and the object), or can be derived from the image metadata (e.g., distance of the focused subject of an image and angular difference from the horizontal plane). In these cases, deviations in the given parameters result in a different shape; thus invariance is needed only if the human notion of the shape is not dependent on the distortion, and only tolerance is required if the given parameters are not exact or the human perception does tolerate deviations with a certain limit.

The transformations above can be characterized by the possible outputs applying the transformations. The range of transformation is defined as the set of all possible results of transformation on a shape :In the case of reversible transformation , the inverse transformation is denoted here as . To represent noise as transformation, we chose the parameter as a shape, and the noise transformation , where stands for the logical X-OR operation. By using this formalization, the random property of the noise transformation is ensured in random selection of parameter . This annotation allows us to represent the noise as a reversible operation, where .

We denote shapes and to be separated by transformation if there are no parameters and of which transform and to the same shape:If the transformation is reversible, then and are separated by transformation :If we assume that output classes are separated by transformation and no reference system is given, the recognition should be invariant to transformation . If the classes are not separated, the recognition should only tolerate the difference caused by transformation .

Without any assumptions about the noise, adding sampling and preprocessing noise to a shape (noise transformation) may result in an arbitrary distortion; no shapes are separated by noise transformation, and thus the recognition should only be tolerant to the noise transformation. If the noise is bounded, the result space is limited.

Adding sampling and preprocessing noise (noise transformation) theoretically may result in arbitrary shape. The geometric transformations, except for the 90-degree perspective distortion, are closed transformations; thus invariance with respect to rotation, scale, and transition and tolerance to perspective distortion are standard requirements in case of shape recognition. However, distortions affecting a shape cannot be handled separately. Sampling noise when doing a low resolution scale or a flat perspective view can be significant. Hence, scale invariance and perspective tolerance are limited to scales where essential details of the shape are still present.

Invariance and tolerance regarding different distortions can be ensured in various ways. Feature extraction generalizes the shape from the specific aspect independently from those effects that are irrelevant for the classification, and classification performs a decision based on a complex distance. Hence, feature extraction is generally responsible for ensuring invariance and classification for tolerating difference to a specific limit. However, as we described in Section 2.2, invariance can be achieved by continuous tolerance and tolerance is a partial invariance; thus encoding similarities may occur in different parts of the recognition unit. In addition, there are many classifiers that also include generalization power (i.e., kernel functions).

#### 3. Performance Evaluation in Multiclass Classification

In open-world multiclass recognition problems, only a relatively small subset of the classes is considered relevant for the given task. This is similar to a binary classification scheme with only positive and negative labels, with the difference that inside the positive class we need to be able to differentiate between several “positive” labels, which are considered relevant by themselves, as opposed to the irrelevant ones, among which no differentiation is necessary. More precisely, the relevancy attribute partitions the set of classes into the relevant and the irrelevant subsets.

For an appropriate evaluation performance, metrics need to be adapted to this nature. Due to the prevalence of the positive-negative property for this multiclass case, it makes sense to rely on classic binary performance metrics, including recall and precision. To be able to use them, we need to extend the binary confusion matrix scheme of positive and negative decisions. Since we do not differentiate between irrelevant classes, all decisions from and into irrelevant classes are counted as true-negative (TN). True-positive (TP) counts all correct positive, that is, relevant, classifications; false-negative (FN) refers to the number of decisions where a relevant input was classified as irrelevant. False-positive decisions are split into two categories: indicates the number of false classifications between relevant classes, while counts decisions where an irrelevant input is classified as a relevant one.

Using this extended taxonomy, precision and recall can be defined as follows:For , being a weighted average of precision and recall, the definition does not need to be changed, with recall being more important for , and precision weighted more important for :As we primarily target real-time recognition tasks on video sequences, type II errors have a much lower cost than type I errors. Hence, we have used the value , which reflects this preference.

#### 4. The Global Statistical and Projected Principal Edge Description

Since shapes have different properties depending on several aspects and the distinctive characteristics may be encoded in different aspects, a descriptor compound of independent shape features may provide more accurate representation. As mentioned earlier, the most important aspects are scope (global-local) and basis (region and edge). We suggest a shape description denoted as Global Statistical and Projected Principal Edge Description (GSPPED) that combines these shape features in order to represent different aspects.

The descriptor consists of global statistical features and principal edge descriptors representing local characteristics. Structurally the descriptor is divided into three parts:(a)A highly expressive header including eccentricity and area fill ratio(b)A region-based feature set with histogram moments representing global shape properties(c)A contour-based edge description employing modified Projected Principal Edge Distribution description for shapes

##### 4.1. General Region-Based Global Features

Moments and general statistical features derived from moments are frequently used descriptors in shape and pattern recognition [6, 28]. A series of moments express the properties of a shape from basic features to details [17]; however, moments of higher orders are more vulnerable to noise and variances in shape. Thus, in vision applications, where patterns belonging to the same class may vary due to camera position or segmentation, using higher-order moments is less effective [21].

The header part of the proposed description aims to depict the shape in the most compressed and expressive way. We are searching for that kind of combinations which are perceptually linear but may be calculated by nonlinear operations from easily measurable operands. Eccentricity and area ratio describe the basic outline of the shape; however, they are only suitable to use as primary features in filtering obviously false matches [6]. Besides, they are simple scalars encompassing understandable and most characterizing information for a human. The smaller the eccentricity is, the closer the shape is to a circle, while shape with eccentricity value of one is a line. The area ratio is the ratio of the area occupied by the shape and the area of the minimal rectangle covering the shape.

The region-based feature set consists of the first four moments of horizontal and vertical histograms of the shape (Figure 1). Using more moments would enable us to describe the shape in more detail, but we would lose the general recognition ability. Thus, we used the first four moments: mean, variance, skewness, and kurtosis. For the sake of simplicity but not losing dimensional information, the moments are computed from the histograms of the shape. This solution reduces computational complexity compared to 2-dimensional moment calculation and provides advantages when the descriptor is computed on VLSI architecture. The distribution of the region-based features is shown in Figures 2 and 3.