Abstract

Since the Pre-Roman era, olive trees have a significant economic and cultural value. In 2019, the Al-Jouf region, in the north of the Kingdom of Saudi Arabia, gained a global presence by entering the Guinness World Records, with the largest number of olive trees in the world. Olive tree detecting and counting from a given satellite image are a significant and difficult computer vision problem. Because olive farms are spread out over a large area, manually counting the trees is impossible. Moreover, accurate automatic detection and counting of olive trees in satellite images have many challenges such as scale variations, weather changes, perspective distortions, and orientation changes. Another problem is the lack of a standard database of olive trees available for deep learning applications. To address these problems, we first build a large-scale olive dataset dedicated to deep learning research and applications. The dataset consists of 230 RGB images collected over the territory of Al-Jouf, KSA. We then propose an efficient deep learning model (SwinTUnet) for detecting and counting olive trees from satellite imagery. The proposed SwinTUnet is a Unet-like network which consists of an encoder, a decoder, and skip connections. Swin Transformer block is the fundamental unit of SwinTUnet to learn local and global semantic information. The results of an experimental study on the proposed dataset show that the SwinTUnet model outperforms the related studies in terms of overall detection with a 0.94% estimation error.

1. Introduction

Nowadays, the availability of satellite services is establishing a lot of promising research and applications, for example, detecting, locating, and counting objects (such as people, vehicles, buildings, and farms), monitoring, 3D reconstructions, and geographical analysis [1].

In the computer vision field, object counting is a well-known problem that aims to figure out how many objects are in a static image or video frame. Object counting is an active research field and has many use cases in diverse domains, such as ecologic studies, crowd counting, microcell counting, and vehicle counting [2, 3].

In traditional methods, handcrafted features (e.g., SIFT [4] and HOG [5]) are extracted to detect and count olive trees from a stationary image. Unfortunately, the performance of these traditional methods is affected by many factors such as scale variations, weather changes, perspective distortions, and orientation changes. Recently, deep learning detection models such as single shot multibox detector (SSD) [6] and region convolutional neural network (R-CNN) [7] achieve high performance and provide a promising solution for these challenges [3]. Despite the successes of deep learning methods, there is a lack of a standard database of olive trees available for deep learning applications.

Therefore, we first created a large-scale olive dataset for deep learning research and applications. The dataset consists of 230 RGB satellite imagery collected across Al-Jouf, Saudi Arabia. The images have been obtained from Satellites Pro [8], which provides satellite imagery and maps for most countries and cities around the world. To lighten the workload and expedite the annotation process, the olive trees are labeled with center points. The proposed dataset is useful for many olive tree deep learning applications such as detection, counting, and segmentation.

Then, inspired by the success of the Swin Transformer [9], we propose an efficient deep learning model (SwinTUnet) for detecting and counting olive trees. SwinTUnet is a Unet-like network that includes an encoder, a decoder, and skip connections. Instead of using the convolution operation, Swin Transformer is used to learn better local and global semantic information. The proposed SwinTUnet model outperforms the related studies in terms of overall detection, with a 0.94 percent estimation error, according to the results of an experimental study on the proposed dataset.

The structure of this paper is as follows: Section 2 presents a related work. The proposed SwinTUnet architecture is explained and discussed in Section 3. Section 4 discusses the experiments and their outcomes. Finally, in Section 5, the conclusion and future work are presented.

The use of remote sensing to automatically identify trees was first used for forestry applications [10]. Tree detection and enumeration in crop fields have always been a priority for the research community in recent years. There are numerous techniques available to effectively solve the problem of identifying and counting olive trees. Based on image processing methods, olive tree identifying and counting from satellite and UAS images can, generally, be classified into three groups.

2.1. Classification Methods

In these methods, image classification methods are used to classify and then count the olive trees. Bazi et al. [11] adapted the Gaussian process classifier (GPC) and the erosion morphological filter [12] to automatically counting olive trees in a selected zone of Saudi Arabia. The algorithm detected 1124 trees out of 1167 with a 96% overall detection rate. Land cover classes contain nonolive trees, buildings, and ground. Olive trees were counted using blob analysis and classified into land cover classes. Despite the high level of overall accuracy, the classifier was trained using a smaller number of training samples, so there was room for improvement.

Moreno-Garcia et al. [13] proposed an approach based on the fuzzy logic to classify and then count olive trees in very high resolution (VHR) images using the k-neighbor approach. This method was tested using RGB satellite imagery obtained from the SIGPAC viewer covering an area of Spain. With a 1-in-6 omission rate and zero omission rate, the results were promising, but the number of testing images and diversity were insufficient.

Peters et al. [14] applied an object-based feature extraction and classification for olive landscape mapping based on VHR from various sensors (optical and radar data). A four-step model was used to detect olive trees across the French countryside. This model consists of segmentation, extraction of features, classification, and, finally, mapping of results. Synergy methods were applied at each phase by merging features from numerous sensors. As a result, the overall accuracy was 84.3%.

2.2. Segmentation Methods

Image segmentation is the operation of partitioning the input image to extract a region of interest (ROI) that holds important information. Many segmentation methods (such as edge detection, thresholding, region growing, and clustering) were used to identify and count olive trees. These methods were individually proposed as well as a hybrid approach.

Moreno-Garcia et al. introduced a technique to identify and segment trees of olive using the K-means algorithm from satellite imagery [15]. This technique was quick with fewer clusters and could tell the difference between ground and satellite data. This methodology yielded a 0% omission error rate and a 1-in-6 commission error rate. Despite the marked results, the number and diversity of the images tested were insufficient. Waleed et al. used the improved K-means to count trees of olive across large areas [16]. Some of the steps in the technique include preprocessing, image segmentation, extraction of features, and classification. For the purpose of segmenting the region of interest (ROI), K-means segmentation was used. The development of various classifiers yielded promising results. With a training and testing accuracy of 97.5 percent, the random forest classification technique produces good results.

Another technique applied a thresholding strategy with different levels to segment and extract olive trees from the foreground [17]. They proposed a robust and an efficient model for accurately segmenting and detecting olive trees in a variety of environments. This technique yielded promising results, with a 96 percent overall accuracy. However, the existence of nonolive elements in the total count left some room for improvement.

In [2], the authors presented a color-based segmentation application for olive tree counting from images acquired from unmanned aerial systems (UASs) and utilized a cloud service. The application produced promising results, 330 of 332 trees were counted, but the latency and computational time were not overcome by a mixture of on-board processing and cloud-based services.

2.3. Detection Methods

Aerial views of trees reveal morphological features that resemble blobs. These blobs appear brighter at the tips when viewed from above, with shadows following them to their base. The Laplace operator, also known as the Laplacian, is a differential operator in the Euclidean space defined by the divergence of a function’s gradient. The Laplacian operator is primarily used for edge and blob detection [1].

For the detection of the olive, Karantzalos and Argialas used Laplacian spatial resolution with local maximum points [18]. Satellite greyscale images from the QuickBird and IKONOS satellites were used to test their algorithm. Blob detection is a popular technique because of its simplicity and reliability; however, it is vulnerable to missing olive data. The technique only used trees with circular morphology and treated each object with those characteristics as the olive tree.

Daliakopoulos et al. introduced a hybrid method between the Arbor crown enumerator (ACE) and the Laplacian of Gaussian (LOG) to detect olive trees as blobs for VHR satellite images [19]. The method used red band thresholding and blob detection based on the Normalized Difference in Vegetation Index (NDVI) [20]. The hybrid method overcomes the disadvantages of separately thresholding and blob detection. With an estimation error of 1.3%, accurate detection and count were observed. However, with high computational cost, the algorithm produced accurate results.

Waleed et al. [1] proposed a multistep technique to identify and count olive trees. This technique consists of multiple image optimization and edge detection steps. The information of the red band was extracted from RGB images acquired using the SIGPAC viewer. The single red band is sharpened, and edges are detected after it is extracted. Using morphological reconstruction, the closed edges formed by tree boundaries are transformed into white blobs. With an estimation error of 1.27%, results were generated over a variety of images capturing ground truth information.

The use of various techniques for the automatic detecting and counting of olive trees has been documented in the literature. Simple and effective techniques such as image segmentation, as well as training and testing samples, all outperformed complex classifiers. However, it was discovered that, as the image information was increased, the accuracy improved. Although the traditional method has achieved high accuracy, it is not stable and is affected by many satellite challenges, such as variation of the viewpoint, image scale, quality, and orientation. Furthermore, the datasets used lacked the necessary diversity in terms of number and ground classes.

Despite the success of the above handcrafted feature methods, many factors such as scale variations, weather changes, perspective distortions, and orientation changes affect the performance of these traditional methods. Deep learning models, such as [2, 6, 7], have recently achieved high performance and offer a promising solution to these problems. However, the lack of a standard dataset for olive farms is a major impediment to deep learning techniques being used in this field. Therefore, we begin by creating a large-scale olive dataset for deep learning research and applications. The dataset consists of 230 RGB images collected across Al-Jouf, Saudi Arabia. Then, we propose an efficient deep learning model (SwinTUnet) for detecting and counting olive trees.

3. The Proposed Model

Figure 1 depicts the proposed SwinTUnet architecture, which includes an encoder, a decoder, and skip connections. Swin Transformer block [9] is the fundamental unit of SwinTUnet. The encoder is used to make a series of embeddings out of the inputs. The olive satellite images are divided into 4 × 4 nonoverlapping patches. Based on this partitioning method, each patch now has a feature dimension of 4 × 4 × 3 = 48. The projected feature dimension is also converted into an arbitrary dimension using a linear embedding layer (represented as C).

The hierarchical feature representations are created by passing the tokens (transformed patches) across several blocks of Swin Transformer and layers of patch merging. Downsampling and increasing dimension are handled by the patch merging layer, while feature representation learning is handled by the Swin Transformer block. We create a symmetric transformer-based decoder, which is inspired by U-net [21]. The decoder is constructed from Swin Transformer blocks and the opposite patch expanding layers. The derived context features are merged with multiresolution features out from the encoder through the use of skip connections to cover the loss of the spatial features due to downsampling.

A patch expanding layer, unlike a patch merging layer, is specifically applied for upsampling the size of features. The layer of patch expanding resizes adjacent-dimension feature vectors into large feature vectors with upsampling the resolution by two. Finally, the final patch expanding layer is applied to the feature maps to perform four upsamplings of the resolution to the original resolution (W and H). Then, on top of these upsampled features, a linear projection layer is used to create the density map.

3.1. Swin Transformer Block

Unlike the traditional multihead self-attention (MSA) module, the Swin Transformer block [9] is based on the use of shifted windows. Figure 2 depicts each Swin Transformer block which includes a LayerNorm (LN) layer, a MSA module, a residual connection, and two MLP layers. In the two successive transformer blocks, the window-based MSA (W-MSA) and shifted window-based MSA (SW-MSA) modules are used. Using a window partitioning mechanism, sequential Swin Transformer blocks can be formulated as follows:where and are the output features of the block, (S)W-MSA and MLP modules, respectively. The MSA is calculated in the same way as in previous research [22, 23]:where represent the query, key, and value matrices. Respectively, and denote the query or key dimension and the patch number in a window. The bias values are extracted from the matrix of bias .

3.2. Encoder

During the encoder, two sequential Swin Transformer blocks are applied on the input tokens with a resolution of H/4 × W/4 and 48 dimensions to produce representation learning. The output resolution and feature dimension were left unchanged. As the network expands, the token number is reduced to produce a hierarchical representation by patch merging layers. The first patch merging layer merges the features of each group of 2 × 2 adjacent patches. After that, a linear layer is applied to the merge features in 4C dimensions. The output dimension is set to 2C, and the token number is reduced by 2 × 2 = 4. Following this, Swin Transformer blocks are utilized to transform the features, keeping the resolution at . Stage 2 refers to the first section of patch merging and feature transformation. The procedure is repeated twice more, with different output resolutions with the size of and , respectively, as “stage 3” and “stage 4.” The four stages are enough to learn the deep feature representation because the transformer is too deep to be converged [24].

3.3. Decoder

Swin Transformer block is the backbone of both encoder and decoder. Unlike to the encoder, the patch expanding layer is used in the decoder instead of the patch merging layer to upsample the constructed features. By reshaping neighboring dimension feature maps and reducing the dimension of the feature by 2 of its input dimensions, the patch expanding layer increases the resolution of the feature map. Consider the first patch expanding layer; before upsampling, a linear layer is utilized to double the dimension of the feature () to be (). Then, we use the rearrange operation to double the resolution and reduce the dimension of the feature to a quarter of its original size (.

The skip connections, like the U-net [21], are used to inject the upsampled features by the multiscale features from the encoder. To reduce the spatial feature loss caused by downsampling, we concatenate the shallow and deep features together. The concatenated features’ size is kept like the upsampled features’ size after a linear layer.

3.4. Implementation Details

In the implementation, we used a well-known PyTorch [25] library. Then, the proposed model was trained and tested on an NVIDIA GeForce RTX 2060 GPU. Before training, random data augmentations such as rotating, scaling, and flipping are used to raise a data variety. The input image was resized into 224 × 224 which leads to overcome the problem of GPU out of memory in the training. The model parameters are set using the weights pretrained on ImageNet-1K [26]. During the training period, our model is optimized for backpropagation using the well-known SGD optimizer [27] with momentum value 0.9 and weight decay equal to 1e−4.

4. Experiments and Discussion

This section discusses the proposed olive tree dataset used to evaluate our model. It also covers the metrics that were used to assess the proposed model’s performance.

4.1. Dataset

According to the statistics of the Ministry of Environment and Water Branch in Al-Jouf, 2019, the Al-Jouf region hosts 30 million trees, foremost among which are olive trees (18 million trees), which annually produce 10 thousand tons of oil [8, 9]. Therefore, our dataset consists of 230 images gathered from Al-Jouf, KSA territory, by using the Satellites Pro. The Satellites Pro provides satellite imagery and maps over most countries and cities of the world. The RGB images of size 512 × 512 were taken of the target area having 32 bit of information. Some image samples are presented in Figure 3.

The olive trees are marked with center points to reduce workload and accelerate annotation. In the first step, olive images are labeled by bounding boxes enclosing the olive trees. The four vertices of the bounding boxes are denoted as . In the second step, we calculate the centroid of each box as the central location by the following formula:

4.2. Evaluation Metrics

To evaluate different techniques on different datasets, the performance of our model was evaluated using a variety of performance metrics.(1)Overall accuracy (OA): it is the percentage of olive trees correctly identified out of the actual total olive tree number. It displays correctly identified number of trees in the ground truth data among the marked ones. Overall accuracy is calculated mathematically using the following equation:where NE represents the estimated olive tree number and NA is the actual olive tree number.(2)Omission error rate (OER): it is the percentage of positive test subjects who are misidentified as negative test subjects. In other words, OER is the percentage of times our proposed system fails to recognize olive trees as such. OER is calculated mathematically using the following equation:where Nm denotes the number of omitted olive trees.(3)Commission error rate (CER): it is defined as the presence of negative samples that have been mistakenly identified as positive. It happens when there are nonolive trees in the output. CER is calculated mathematically using the following equation:where Nf denotes the number of false trees detected.(4)Estimation error (EE): it refers to the difference between the identified number of objects and the number of objects to be identified. In our proposed model, it is the difference between an actual and estimated number of olive trees in the sample divided by the actual number of olive trees. EE is calculated mathematically using the following equation:

4.3. Overall Evaluation of the Proposed Model

Based on evaluating the proposed model, the overall estimation error for testing was 0.94%. As shown in Table 1, for a 100% distribution of olive and nonolive trees along with other objects, about 0.97% of nonolive data was miscalculated as olive, and 1.2% of olive data was miscalculated as nonolive. The results of testing on the proposed dataset showed an overall identification with a 0.94% estimation error.

Figure 3 depicts an olive image along with ground truth and corresponding detection results. The image depicts a mix of distributions of olive trees with notable distance between them and those that are closely planted. Our proposed model correctly identified almost all the olive trees, but it miscounted the young and closely planted trees.

4.4. Comparative Analysis with Related Work

The results of the proposed model were compared to those of existing olive detection and counting techniques. The dataset’s parameters, the processed number of images, the spectrum representing the size of processed data, and evaluated performance were all used in the comparison. Table 2 demonstrates the results of a comparison of the proposed model to existing methods.

As shown in Figure 4, our proposed model was evaluated on a large-scale dataset and yielded high accuracy, indicating that our model is accurate and robust. Our proposed model addressed the flaws in related techniques by accurately identifying and counting olive trees. This novel model of olive tree detecting and counting was validated over RGB images with an overall accuracy of 98.3%, which overcomes related work. Testing on the large-scale dataset which consists of olive trees and other ground objects, the proposed model had the lowest overall estimation error of 0.94% of the existing techniques. It is worth noting that our proposed dataset includes 230 images of both olive trees and other objects.

5. Conclusion

In conclusion, we have proposed an effective deep learning model (SwinTUnet) for detecting and counting olive trees from satellite imagery. SwinTUnet is a Unet-like network that includes an encoder, a decoder, and skip connections. Instead of using the convolution operation, the SwinTUnet adapted the Swin Transformer block to learn local and global semantic information. Moreover, we started by constructing a large-scale olive dataset for deep learning research and applications. The dataset consists of 230 RGB images collected across Al-Jouf, Saudi Arabia. According to the results of an experimental study on the proposed dataset, the SwinTUnet model outperforms the related studies in terms of overall detection, with a 0.94% estimation error.

However, there are some drawbacks, such as the difficulty in identifying olive trees that are close to other trees. Consequently, in the future, we plan to extend the proposed dataset by more images from various sources and enhance the proposed model.

Data Availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Acknowledgments

The authors extend their appreciation to the Deanship of Scientific Research at Jouf University for funding this work through research grant no. DSR-2021-02-0104.