Abstract

This article presents the comparative analysis of classification techniques to assign land use and land cover classes from different strategies (pixel-based, object-based, rule-based, distance-based, and neural-based) with a Sentinel-2A satellite image for 2016. The study area is the Sana’a city of Yemen which covers about 18,796.88 km2 land area. This research aims to present the fundamentals of supervised machine learning approaches, including their limitations and strengths and experimentation for twelve classifiers. The outcome of experimentation showed that the Random Forest could be a good choice as a classifier for object-based strategy. In contrast, DTC and SVM were efficient in rule-based and pixel-based strategies. Results also showed that the highest accuracy was with object-based strategy, followed by rule-based and then pixel-based and distance-based strategies.

1. Introduction

Artificial intelligence techniques play a significant role in LULC classification by spotting patterns in data. For many years, the popularity of AI has been growing, with a considerable percentage of necessary research employing AI in its operations; in the work of Alshari and Gawali (2021) [1], remote sensing aids in gathering data about the Earth through satellites.

Machine learning helps to solve a wide range of real-world computer problems. Two types of supervised methods are classification and regression (Singh and others, 2021) [2]. Classification is a technique for categorizing data into strategies (Paul and others, 2021) [3]. The primary purpose is to determine the type or class. Regression is a method for predicting a single output value using training data. When an algorithm divides data into two strategies, this is known as binary classification. Choosing between more than two classes is called multiclass classification (Alshari and Gawali, 2021) [4].

Disclosure of land change is a factor in conserving land and considering management and development (Khwarahm, 2021) [5]. LULC map is vital in arranging executives (Makwinja and others, 2021) [6] and monitoring local, territorial, and national programs (Nayak, 2021) [7]. Land use and land cover data are needed for strategy-making business (Sarif and Gupta, 2021) [8] and regulatory purposes. With their spatial subtleties, the information is, in like manner, vital for ecological security and spatial arranging (Xie and others, 2021) [9]. Land use arrangement is indispensable because it gives information that can be utilized to demonstrate (Sang and others, 2021) [10], particularly, the one managing climate; for example, models manage environmental change and strategies improvements (Bhattacharya and others, 2021) [11].

Sentinel-2A of multispectral satellites is used in this study. Sentinel-2A is of medium resolution and is the first optical Earth observation satellite from European Space Agency [12].

The software used in this study is SAGA GIS, which is free, open-source software used on Windows and Linux computers. “SAGA” is an acronym for System for Automated Geoscientific Analysis. According to the SAGA website, a GIS was created to make the application of spatial algorithms simple and effective [1315]. It includes an easy-to-use user interface with various visualization possibilities and a rich, increasing collection of geoscientific methodologies [16].

The supervisory machine learning classifiers’ principles were narrowed down and explored, and their strengths and limits were revealed in this study. The significance of this research and its limitations lies in applying strategies supervised machine learning, identifying constraints, stability, and weaknesses, as well as the opportunities and problems that each technology presents. The practice of this study was with twelve classifiers for five types of supervisory machine learning techniques to compare the differences in classification accuracy between the strategies for land-use changes described, which helps to search for flaws in supervisory learning to improve it. All these points were essential for future users and researchers. The results showed that the object-based (OB) method is better than other classification methods and superior to all approaches. The comparison results from this study showed that the RF classifier (object-based) was the first best result, having given overall accuracy of 99.92% with Sentinel-2A for map of 2016. Random Forest (RF) classifier gave the highest classification accuracy in the twelve classifiers. The results also indicated differences in performance between the twelve classifiers in the same year, same season, and same weather condition with the different satellites. The best four classifiers among the twelve are RF, KNN, DTC, and SVM. Results also showed that object-based strategy gave the highest accuracy, followed by rule-based and then pixel-based and distance-based strategies.

The critical contributions from this study are as follows: (i) implementation of the twelve classifiers related to the five strategies of machine learning using Sentinel-2A satellite and (ii) presenting the analytical comparison of classification techniques to assign land use and land cover classes from different strategies (pixel-based, object-based, rule-based, distance-based, and neural-based) with a Sentinel-2A satellite image for the year 2016. The structure of this article is as follows: the introduction of this study is given in Section 1; related work and comparison between supervised methods for various strategies of AI techniques and multiple classifiers are discussed in Section 2; methodology and materials related to research area in a case study and data collection, LULC preprocessing, and digital classification are in Section 3; accuracy estimation and kappa coefficient are in Section 4; the results and discussion are given in Section 5; and finally conclusion is given in Section 6.

Several studies on machine learning have lately been published, and each study has a specific purpose that was discussed [1747]. This study demonstrated that the purpose it addressed had never been addressed previously. It focused on limiting the fundamentals of supervised machine learning textbooks and studying and extracting their strengths and flaws. This research looked at a lot of materials about applying supervisory machine learning algorithms to classify land-use changes. This study analyzed several pieces of literature related to the classification of land-use changes using machine learning algorithms. Machine learning is significantly popular [17] owing to its widespread use, as evidenced by the previous literature, because of ease, flexibility, speed, and low cost compared to deep learning and all artificial intelligence techniques. Machine learning techniques’ prediction performance can be considerably improved through parameter modification [18]. Algorithms and tasks for machine learning can be complex [19], selecting the best learning algorithm for the application at hand [20, 21]. Choosing the incorrect learning algorithm will create unanticipated outcomes [22], resulting in a loss of effort and the model’s efficacy and accuracy [23, 24]. According to previous studies, despite drawing academic interest and their desire to learn more about current land changes, the field of LULCC development remains underutilized [25]. Investigations and studies are necessary for various ways to boost knowledge discovery utilizing artificial intelligence (AI) techniques [26], which provided a significant drive for our effort [25, 26]. Comparing various methods used in this study is very important, as described in Table 1 [38, 39].

Table 2 describes the comparison of the features of twelve classifiers implemented in this study which are Random Forest (RF), Decision Tree Classification (DTC), Maximum Likelihood Classifier (MLC), Spectral Angle Mapper (SAM), Support Vector Machine (SVM), K-Nearest Neighbour (KNN), Minimum Distance Classification (MDC), Artificial Neural Networks (ANN), Mahalanobis, Maximum Entropy, Parallelepiped, and Normal Bayes.

Computational Complexity. A modern approach is a concise, in-depth examination of the subject, including cryptography and quantum computation. It takes x minutes to train it on n points. What if you train it on kn issues instead? If the training time has now increased to kx, the training time is now linear. It can be more at times. The new training time may be k2x. The training time would be labelled quadratic in the number of points in this situation. Do not anticipate being able to execute this procedure on millions of points if you have a long training period for a few thousand points. Assessing the complexity of a machine learning algorithm is more complex than it appears. It may be implementation-dependent, data properties may lead to other methods, and training time typically depends on some parameters provided to the algorithm. Another point to consider is that the learning algorithms are complicated and reliant on other algorithms. The following approximations are obtained by multiplying n by the number of training samples, p by the number of features, and n trees by the number of trees (for approaches based on various trees), as described in Table 3 [29, 45, 46].

3. Methodology

The method for accomplishing this study’s aims is the following: Review theory about all strategies in machine learning is surveyed and features and characters of all varieties are determined. Twelve classifiers are implemented for 92 images of size database. It used multispectral Sentinel-2A 20 m resolution and comparison between the results evaluation and analysis to justify the results. The following is a quick rundown of the satellite data processing methodology: Data was collected from a downloaded image product. The processing procedure begins with identifying real-world data as follows: Extract images, identify the location of the study area, identify the composite band for satellite processing, identify construct layers form, and finally identify the classifiers. The statistical results and overall classification accuracy are calculated for each image for each satellite. The methodology steps of this study for the Sana’a region are presented in Figure 1.

3.1. Research Area

The city of Sana’a is one of the largest cities in Yemen which is located in the governorate of the same name, and this city is the case study for this article [28]. The city of Sana’a is located at 15°N 44°C or 15.369445 latitudes 44.1191006 with 15°22ʹ 10.0020′N and 44°11′ 27.6216″E in GPS coordinates [39]. The total area of the city of Sana’a is 18,796.88 km2 (49 sq mi) in this study. The population was 2,545,000 in 2017 [35]. It is surrounded by two mountains (Jabal Naqum from the east and Jabal Eiban from the west), and it is also surrounded by the province from all sides [45]. The city is around 2,200 meters above ocean level. In Figure 2, the study area of the case study is clear.

3.2. Data Collection

This study used images obtained from USGS of the Sana’a region. The survey images of the SOI Toposheet on the size of 1 : 50000 scales were used to prepare the base map [16]. In this study, the data was collected for Sentinel-2A satellite in 2016. The sensor is Sentinel-2A, allowing the calibration and comparison process in changing the land. The images generally consisted of maps of various types, dates, scales, and times. This study used pictures collected from Sentinel-2A (10 m) of multispectral resolution satellites. The image data were collected in December of 2016. Twelve images are contained for Sentinel-2A as dataset in this study as described in Table 4.

3.3. LULC Preprocessing

It is the primary stage and essential task in the LULC process, as well as the coordinate reference system for defining and cutting the map into specific areas. The preprocessing process includes studying the location of the case study exactly, as evident in this study (Figure 3).

Identifying the data after being downloaded from satellites under remote sensing technology is evident in Figure 4. The information subject to preprocessing is divided into the images shown in WGS84 or WGS84/UTM. It offers the preprocessing corrections for band 432. The preprocessing contains valid data with a geometric and radiometric correction, presented in this study [33] with QGIS and SAGA software. These operations improve satellite imagery for classification and rectify the degraded image to generate a more authentic portrayal of the actual scene [33].

3.4. Digital Classification

This section explains the approach used in the general level LULC planning action for Sana’a city and the specific outcomes obtained using multispectral medium goal satellite data. Our investigation reveals that the LULC in Sana’a saw considerable modifications in 2016. This data source can be used to containerize Sana’a’s city sizes and contribute to territorial and global environmental models in the long run. There are two groups for classification models. Every group contains six models in LULC 2016 of the database for the proposed model to train, validate, and test the methods. The band classification used in this study is RGB 432. There are six samples for six parameters for creating models classes: High Land, Mountains, Land Area, Built-up, Vegetation, and Bare Land. The vegetation has been merged with the area of the agricultural land. The samples are created depending on RGB color composites of Sentinel-2A images, for example, the class Vegetation (red pixels in color composite RGB = 432) detailed changes in the region. There are twelve models for twelve classifiers described as groups in Figures 5 and 6, and the description of the twelve classifiers is detailed in Figures 5(a)5(f) and Figures 6(a)6(f).

4. Accuracy Assessment

Accuracy, confusion matrix, log-loss, and AUC-ROC are the four metrics used to assess classifier performance. This article employed the confusion matrix and the A kappa coefficient for accuracy assessment (Figures 7 and 8); the confusion matrix results for all methods used in this study are shown. A confusion matrix (sometimes called an error matrix) (Table 5) shows how well a classification model or classifier performs on a set of test data for which the proper values are known. The confusion matrix is simple, but the related terminology might be confusing. A confusion matrix is a tool for comparing two raster datasets’ differences. An error matrix is the most frequent way of presenting the precision of the characterization result, the correctness of users and producers, and the insights gained from mistake lattices. The classes to which pixels in an array correspond for validation (ground truth) are used in the confusion matrix’s columns. The confusion matrix is calculated by the following steps: the first step is to validate the dataset using the projected outcome values, the second step is to predict all of the rows in the test dataset, and the third step is to determine the anticipated outcomes and forecasts.

In this study, the SAGA GIS software used for LULC classification automatically calculates a confusion matrix and kappa coefficient with Excel for calculating statistical values. After removing the extent of performance anticipated by change, the kappa measurement merges the off-slanting components of the mistake frameworks and addresses arrangement [42]. The transaction is the perfect agreement when the Kappa coefficient equals 1. When it is close to zero, the deal is not much better than what you would anticipate by chance [43]. The kappa coefficient ranges from 0 to 1, with values above 0.7 deemed satisfactory. Simultaneously, individuals with a value of 0.4 or less identify an external link between the described image and the ground truth [44]. Table 5 shows details of the kappa values, and Table 6 shows the overall accuracy and kappa coefficient with the Sentinel-2A satellite calculated in this investigation.

5. Results and Discussion

The process has done LULC classification and the comparison of overall accuracy for LULC type to twelve classifiers described in Table 7. The object-based strategy was the first-best result category of RF classifier with 99.92%. The rule-based strategy was the second-best result category of the DTC classifier with 91.49%. After that, the pixel-based strategy was the third-best result category of SVM with 84.56%. It mentions results of land changes classification of Sana’a City with RF classifier and Sentinel-2A satellite in 2016 in Tables 79; it offers statistical results ultimately. This section will discuss the analysis of the highest and the lowest accuracy recorded in the results of this study. When using multispectral satellite that has good accuracy like Sentinel-2A satellite (10 m) resolution, the objects on the ground are seen close and are easy to identify quickly by the object-oriented algorithm. Random Forests provide the highest accuracy for many reasons. RF offers a superior method for working with missing data. Among all the available classification methods, missing values are substituted by the variable appearing in the most particular node. The Random Forest technique can also handle big data with numerous variables running into thousands. It can automatically balance datasets when a class is more infrequent than other classes in the data. The method also handles variables fast, making it suitable for complicated tasks. In comparison, the ANN classifier performed poorly with the Sentinel-2A satellite, probably due to the ANN’s features and its classification procedure based solely on the training of the data. ANN has rarely employed the reason behind this. Artificial Neural Networks (ANN) attempt to identify land classes through training the data on strategies of land in most cases and most circumstances.

6. Conclusion

This study analyzed the nature and qualities of the data and the performance of the learning algorithms to determine the effectiveness and efficiency of a machine-learning-based solution. The performances of twelve supervised machine learning classifiers on various categorization methodologies were explored in this work. It is observed that object-based classification using the Random Forest strategy produced the best results. Obtaining exact LULC maps in a variety of circumstances is difficult in general. When comparing the classification results, it was discovered that when the proper parameters are combined with the auxiliary data, the object-based classification technique provides high-accuracy LULC. This study found that this is still an open topic with much room for research and thought to improve land categorization tools. The basics should be the focus of future study of this field to devise proper methods for overcoming the challenges that this field presents, such as intraclass variation, which is the first factor to consider. Finally, any classification method’s effectiveness is heavily dependent on a thorough understanding of the procedures and classifiers, the landscape’s features, and the user’s competence. This study provides a prediction model for future city planners for a better ecosystem. This study showcases the result of twelve machine learning classifiers which will be helpful for future researchers to select satellite images and the learner’s algorithms according to their application.

Data Availability

The data of the executables for 12 classifiers and statistical results in Excel used to support the findings of this study have been deposited in the Google Drive repository (https://drive.google.com/drive/u/0/my-drive). The figures and tables data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.