Abstract

Achievement of precision measurement is highly desired in a current industrial revolution where a significant increase in living standards increased municipal solid waste. The current industry 4.0 standards require accurate and efficient edge computing sensors towards solid waste classification. Thus, if waste is not managed properly, it would bring about an adverse impact on health, the economy, and the global environment. All stakeholders need to realize their roles and responsibilities for solid waste generation and recycling. To ensure recycling can be successful, the waste should be correctly and efficiently separated. The performance of edge computing devices is directly proportional to computational complexity in the context of nonorganic waste classification. Existing research on waste classification was done using CNN architecture, e.g., AlexNet, which contains about 62,378,344 parameters, and over 729 million floating operations (FLOPs) are required to classify a single image. As a result, it is too heavy and not suitable for computing applications that require inexpensive computational complexities. This research proposes an enhanced lightweight deep learning model for solid waste classification developed using MobileNetV2, efficient for lightweight applications including edge computing devices and other mobile applications. The proposed model outperforms the existing similar models achieving an accuracy of 82.48% and 83.46% with Softmax and support vector machine (SVM) classifiers, respectively. Although MobileNetV2 may provide a lower accuracy if compared to CNN architecture which is larger and heavier, the accuracy is still comparable, and it is more practical for edge computing devices and mobile applications.

1. Introduction

Industry 4.0 standards require cutting-edge solutions to prevent, reduce, and even eradicate solid waste to ensure a pollution-free and sustainable environment [1]. As defined by the Environmental Protection Agency (EPA), municipal solid waste (MSW) is a trash from sources that include residential, commercial, and institutional locations, such as businesses, schools, and hospitals [2]. However, the definition of MSW by EPA does not include industrial, hazardous, or construction and demolition (C&D) waste. MSW is locally handled by each municipality [3].

Municipal solid waste (MSW) can generally be classified into two categories: organic and inorganic categories [4]. To break it down further, organic waste consists of food waste and yard waste, while inorganic waste contains plastics, metal, cans, paper, glass, and others [5]. The waste composition differs from country to country as they are subjected to the influence of many factors, such as level of economic development, cultural norms, geographical location, energy sources, and climate. Inorganic waste such as plastics, paper, and aluminium increases while the relative organic waste decreases when a country urbanizes and populations become wealthier [6]. On the other hand, for low- and middle-income countries, organic waste would form the majority percentage in the urban waste stream, ranging from 40 to 85% of the total [7]. Paper, plastic, glass, and metal fractions increase in the waste stream of the middle- and high-income countries [8].

Significant advancement in the standard of living will inevitably increase MSW generation [9]. This is because when the population’s spending power increases, consumption of goods and services increases as well [10]. When consumption increases, the population will demand more resources like water, energy, minerals, and land which will give rise to the amount of waste generated [11]. Along with a higher standard of living, the increasing population from rural areas to urban areas had also contributed to a higher MSW generation rate [12]. Based on an estimation by World Bank, currently, 1.3 billion tons of waste is generated annually over the world, and this amount will increase to 2.2 billion tons annually by 2025 [13]. Thus, if waste is not managed properly, it would bring about an adverse impact on health, the economy, and the global environment. It was also reported that improper management of waste will result in a cost higher than what it would have cost if waste was managed appropriately in the first place. As such, like it or not, all parties will need to assume more responsibility for waste generation and disposal, specifically, product design and waste separation [14].

Considering that waste is mainly a by-product of consumer-based lifestyles that is the main driver of the world’s economies, in most cases, the fastest way to reduce waste generation is to reduce economic activity [15]. However, a reduction in economic activity would not be attractive [16]. With that, EPA has recognized MSW recycling as the second most environmentally sound strategy for dealing with urban waste [17].

The recent ICT-driven solutions have played a substantial role in MSW management. The state-of-the-art technologies, i.e., computer vision, edge computing, IoT, machine learning, and deep learning, have been proven to be an enormous support to achieve industry 4.0 standards [18]. The current industry employs the latest MSW management tools that are economically and computationally inexpensive and efficient [19].

The accurate classification of MSW employing machine learning models has considerably benefited the MSW stakeholders. [20] proposed machine learning as a useful tool for accurate prediction and management of solid waste in Canada based on socio-economic and demographic variables. The study applied decision trees and neural networks to build the models. Ultimately, the study developed an integrated framework for accurate classification and management of municipal solid waste of 220 municipalities of Ontario. Similarly, [21] developed algorithms to understand the patterns of solid waste generations using machine learning and small area estimation techniques. The study incorporated the prediction of MSW to quantify the future estimates of waste generation to ease the recycling of waste materials for reusability. With these advancements, the ML algorithms, i.e., regression classification, naïve Bayes classification, support vector machine classification, decision trees, KNN, random forests, and CNN, have bestowed computationally inexpensive and fruitful solutions for MSW classifications. As the main advantages of these solutions, efficient resource utilisation has given economical solutions to local and urban waste management [22]. Thus, significantly assisted in achieving a sustainable and pollution-free environment and enhanced the quality and standards of lives of common people [23].

The study observes many citations related to forecasting future solid waste generation and its quantification, e.g., [24]. Similarly, another good number of citations can be observed for MSW using sensors and IoT devices, e.g., [25, 26]. On contrary, the study could find a little citable work related to optimization or enhancement of solid waste classification methods developed for edge computing devices or other lightweight applications. Among the recent citations, [27] developed an automated recognition system employing a deep learning algorithm for the classification of solid waste objects as biodegradable and nonbiodegradable.

Moving on, [28] used two models, namely, support vector machine (SVM) and CNN for waste image classification. SVM was used based on the principle of processing features in space higher than the original features space. Using a higher feature space, the data was divided into different categories using a hyperplane. SVM computed the selection of planes to develop the largest separation of different categories among the data. For CNN, the study had implemented an architecture that was similar to AlexNet but smaller due to computational constraints. At that point, there were no publicly available datasets, and the study initially found Flickr Material Database and Google Images. However, the images could not accurately represent the state of recycled goods. Such issues resulted in the creation of a new dataset named TrashNet that contained around 2400 images with six different classes. Data augmentation techniques such as random rotation, random brightness control, random translation, random scaling, and random shearing were applied to increase the dataset size. Surprisingly, SVM performed better than CNN even though SVM is a much simpler model. The authors believed that the poor performance of CNN was due to suboptimal hyperparameters.

Another work done by [29] proposed an image classification model related to MSW management. The study employed a camera to picture the solid waste and categorize it according to five defined waste categories trained on a dataset of 400-500 images. In a similar context, [30] employed SVM for the supervised classification of solid waste. The study segregated the solid waste images into recyclable or nonrecyclable.

The study done by [31] leveraged the transfer learning using a pretrained network utilizing AlexNet, GoogLeNet, VGG-16, and ResNet to classify waste images with different categories with the TrashNet dataset. In the final output layer of the CNN, different classifiers were used, namely, Softmax and support vector machine (SVM). The study segregated training and testing data into equal halves without performing any further data augmentation.

[32] used VGG-19 for the waste image classification. The study used data augmentation such as shear, rotation, zooms, and shifts in addition to resizing images to dimensions before feeding the network. The classification layer of pretrained VGG-19 was removed and replaced by a fully connected layer of 256 neurons with ReLU activation function with a dropout of 0.6 and batch normalization. The model was trained using a batch size of 32, and the authors observed a learning rate between 0.001 and 0.0001 bestowed better results. The lower validation accuracy depicted the overfitting of the model even though [31] and [32] had managed to train the CNN model with high accuracy. The CNN architecture used, e.g., AlexNet, contained about 62,378,344 parameters, and over 729 million FLOPs required classifying a single image. A deep architecture as AlexNet or VGG increases the model complexity. As a result, the deployment of these huge CNN models is often unaffordable for common computers and mobile devices. Thus, this had given rise to the need for a lightweight CNN architecture that could still achieve good results when compared to huge CNN models such as AlexNet and VGG.

To create a lightweight deep neural network that could be used for embedded vision applications and mobile applications, [33] proposed a new architecture named MobileNet which is based on depthwise separable convolution. Depthwise separable convolution is a depthwise convolution followed by a pointwise convolution. Using a depthwise separable layer, the number of multiplication operations required is reduced significantly compared to normal convolution. Although being smaller in size, MobileNet has managed to achieve good results compared to other popular models such as VGG and GoogLeNet.

Following the success of MobileNet, [34] proposed an improved version of MobileNet named MobileNetV2. The study incorporated two changes to MobileNet. The first change is to incorporate an expansion layer before the depthwise convolution. This module takes a low-dimensional compressed representation which is first expanded to high dimensions and later on pass-through convolution to project it back to a low-dimensional representation. Hence, the expansion layer will always have more output channels than the input. The second change proposed by [34] is to include a residual block similar to ResNet. Each layer has batch normalization, and the activation function used is ReLU6. Only the output of the projection layer does not have an activation function. This is because the projection layer produces a low-dimensional output, and applying nonlinearity on it will destroy useful information.

The magnitude of expansion is defined by the expansion factor, which is known as the width multiplier, in MobileNet. For example, if there is an input with 24 channels and is 6, the expansion layer will create a new feature with channels. Subsequently, the depthwise convolution will be applied followed by the projection layer ( convolution) which will project the 144 channels back into a smaller number, e.g., 24. The expansion and projection module is named as bottleneck layer [35]. Ideally, the number of channels should be large so that more information can be extracted. Combining both the expansion layer and projecting layer enables the model to extract more information while keeping the feature dimension rather small.

3. Methodology

In this research, we intend to develop an enhanced image classification model based on the lightweight feature extraction model MobileNetV2 to classify waste images into different categories according to their material. The prevailing citations of referred work in waste image classification employed convolution neural network models that were found to be larger and computationally expensive.

Let us consider a collection contains images from image space. Here, an image is a vector such that with number of dimensions. Let us consider where each , , and represent the row vector, column vector, and color vector having specific indices , , defined by , , and .

Figure 1 depicts a general architecture of the proposed classification model. The input vector is passed through a series of preprocessing steps explained below. The feature extraction of output vector assists in accurate classification.

We breakdown the preprocessing steps of image vector as follows: (1)We apply the horizontal and vertical shearing to image with coordinates and as described in Equations (1) and (2) The horizontal and vertical shear displace the point of interest to adjust image for the desired shear.(2)We apply the zooming to image vector and achieve another image such that is () of vector . Here, is a particular point in image having coordinates and . We can assume that zooming will give us zoomed image with points () that specify zooming points with a displacement . It is important to note down that various zooming can be applied to predict the network better(3)The horizontal flip applied to image vector with coordinates and produces an image with coordinates and such that while . We can describe this phenomenon as follows

Loop (1): in range (until the width of vector )
Loop (2): in range (until the height of vector )
End Loop (2):
End Loop (1):

Let us define the component-based scenario of our convolution neural network as

Here, vector is an input image that is handled by the first layer of the network with weight (the vector that processes the input by applying weight , and the result of this computation is forwarded to the second layer). Next, the vector , which was the output of the first layer, now serves as an input for the second layer managed by weight . This process keeps on going for a defined number of layers until the outcome (where represents the number of output channels, i.e., prediction of an image in two or more classifications) is reached. It is important to describe that is a trade-off between the last predicted value of vector with the desired target (e.g., ) defined by Euclidean distance . The distance can also be measured using other measures, e.g., Manhattan distance.

In an ideal scenario, the number of channels should be larger so to extract more information. Integration of two layers, i.e., expansion and projection layer, can significantly enable the model for extracting additional information. This holds an added value of keeping the feature dimension even small in size. The activation function ReLU6 can also be significant in this context.

Figure 2 showed the summarized components based on various scenarios of the proposed classification model. The adjustment of weights is essentially important to reduce the error; a gradient descent can be adopted as described in Figure 3 and below algorithm.

Let us consider slope and intercept to be defined by m_curr and b_curr, respectively, and both initialized to 0. Let us assume the initial learning rate as 0.01. Let represent the length of vector defined by .

Loop var in range(10000):
End Loop

The range specifies the number of iterations; we can change the range to ensure a considerable reduction in the corresponding error value.

Image preprocessing, simulation of model, and classification are an end-to-end process described as follows: (1)Split the entire dataset into a training set and test set whereby the training set can be denoted as set of labelled examples: where or ~80% of the entire dataset, while the test set has 508 images(2)Perform data augmentation only on the training set to increase the number of training images. Types of augmentation performed: shearing, zooming, and horizontal flipping of images(3)Upon performing data augmentation, the number of the training set is doubled where (4)The training set is fed to a pretrained MobileNetV2 CNN, which acts as a feature extractor. Note that the final output layer of the pretrained MobileNetV2 is removed(5)The extracted features are used as input to two classifiers, i.e., Softmax and SVM, with 10-fold crossvalidation(6)Both models are evaluated using the test set

4. Results and Discussion

We have employed TrashNet [32] in the research. The dataset contains 2,527 images of six waste category, namely, cardboard, glass, metal, paper, plastic, and thrash. The images were resized to dimension as required by the MobileNetV2. Data Augmentation such as shear, rotation, zooms, and shifts were used. We split the dataset into training and test sets for comparative analysis.

We train the model using system hardware with 16 GB RAM equipped with NVIDIA RTX 2060 GPU (6GB). To utilize GPU via Python, NVIDIA CUDA, and cuDNN are installed on the system.

4.1. Performance of MobileNetV2 Using Softmax

The feature extracted by the pretrained MobileNetV2 obtained from Keras was flattened and input into the newly trained Softmax classifier. As there are six classes for the waste image dataset, the Softmax classifier layer has 6 nodes. This model will be used as the base model which will be compared with MobileNetV2 using an SVM classifier. Table 1 shows the accuracy, loss, precision, and recall throughout the 10-fold crossvalidation. The 10-fold crossvalidation is a good approximation method that gives unbiased and equal chance to all data points to be used fairly in the training of the model, and it ultimately enhances the accuracy, precision, and recall.

In terms of accuracy, the average accuracy for the 10-fold validation is 88.85%, with an average precision of 0.91 and an average recall of 0.91 as well. Generally, precision and recall close to 1 indicate that the model performs well in predicting all 6 different categories contained in our dataset. The accuracy of the model on the test set is 79.53%.

Figure 4 presents the confusion matrix of different objects and their classification. The blue cells depict the correct identification of objects related to other objects. Further, the precision and recall are shown in Table 2.

4.2. Performance of MobileNetV2 Using SVM Classifier

Similar to the base model presented above, the MobileNetV2 SVM model received the same features extracted by the pretrained MobileNetV2. Instead of using the Softmax function, SVM was used as the classifier. parameter in SVM dictates the amount of regularization performed. Small values led to the SVM classifier setting a larger margin when performing classification, while a larger value led to a smaller margin. A larger margin (smaller values) increased the misclassification rate on the training set but tended to perform better on the test set as it is more generalized. In this study, the default value of 1.0 is used.

Table 3 shows the accuracy, loss, precision, and recall throughout the 10-fold crossvalidation for SVM. The SVM model resulted in an average accuracy of 94.28%, average precision of 0.947, and recall of 0.931 which are higher than the 10-fold crossvalidation of the base model (Softmax classifier).

Figure 5 presents the confusion matrix of different objects and their classification. The blue cells depict the correct identification of objects related to other objects. Further, the precision and recall are shown in Table 4.

The MobileNetV2 using the SVM model is then evaluated using the test set, and it provided an accuracy of 83.26%, which is approximately 4% higher than the base model. This is aligned with the result obtained by [31], which had also stated that using the SVM classifier is more accurate when paired together with AlexNet, GoogLeNet, ResNet, and VGG-16 if compared to the Softmax classifier.

4.3. Performance of MobileNetV2 with Global Average Pooling Using Softmax Classifier

Based on the results above, it was shown that the SVM model is more accurate than the base model. However, the base model could be further enhanced to increase the accuracy.

A global average pooling layer is added to the model. To illustrate how the global average pooling layer works, given the feature extracted by the pretrained MobileNetV2 model has a dimension of , after applying global average pooling, the output dimension will be .

Table 5 depicts the training result of 10-fold crossvalidation using global average pooling with Softmax classifier with the global average-pooling layer, the model provided an average accuracy of 87.59%, average precision of 0.913, and an average recall of 0.819 from 10-fold crossvalidation. The training accuracy is slightly lower than the base model.

However, the accuracy on the test set is 81.10%, which is higher than the base model. This could be due to the global average-pooling layer which may enable the model to generalize better on the test set.

Figure 6 presents the confusion matrix of different objects and their classification. The blue cells depict the correct identification of objects related to other objects. Further, the precision and recall are shown in Table 6.

4.4. Performance of MobileNetV2 with Additional Fully Connected Layer

Using the model with a global average pooling layer, an additional fully connected layer of 64 neurons was added. The new fully connected layer was added so that the model could pick up more complex features, which may be helpful for the waste image dataset.

Table 7 presents the training result of 10-fold crossvalidation for MobileNetV2 with an additional fully connected layer. With 10-fold crossvalidation, MobileNetV2 with an additional fully connected layer provided an average accuracy of 90.93%, average precision of 0.934, and an average recall of 0.887.

Figure 7 presents the confusion matrix of different objects and their classification. The blue cells depict the correct identification of objects related to other objects.

Table 8 describes the precision and recall (test set) for MobileNetV2 using Softmax classifier with an additional fully connected layer. With the newly added fully connected layer, the model achieved an accuracy of 83.46% on the test set, which was higher than the model with a global average pooling layer (81.10%).

5. Conclusion

With the significant increase of human living standards, solid waste management, especially recycling, must be emphasised to prevent the adverse environmental problem before it is too late. As the recycling effort requires proper waste segregation by the public, data science, specifically deep learning paired with smart mobile phones, could be a tool to aid the effort in recycling. The development and advancement of CNN and the ability to perform transfer learning are crucial for image classification and model development purposes. Architecture such as VGG16 performs very accurately for image classification. However, the sheer size of a large CNN model, i.e., the number of parameters and operations required, is not feasible for mobile applications.

Although the accuracy of the MobileNetV2 with SVM classifier for waste image classification (83.46%) is lower than the accuracy of the VGG16 model with SVM classifier (97%), using MobileNetV2, which is a more lightweight CNN architecture, is perhaps more practical if the image classification model is to be implemented on mobile devices. Moreover, MobileNetV2 is indeed specifically designed for mobile applications.

In the current research, the feature extracted from MobileNetV2 is directly fed into the SVM classifier. To further improve the MobileNetV2 model SVM classifier, the hyperparameters could be tuned, and perhaps the feature extracted could be fed into multiple layers of the neural network before using SVM so that more complex features or relationships could be extracted. On top of that, the Softmax classifier could be further enhanced by adding more layers as well. The scope of this study is limited to solid waste classification only. We have used cardboard, glass, metal, paper, plastic, and trash but it can be scaled to a large variety of solid waste classification.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors have no conflict of interest in this research.

Acknowledgments

The authors are grateful to the Taif University Researchers Supporting Project number (TURSP-2020/215), Taif University, Taif, Saudi Arabia. This work is also supported by the Faculty of Computer Science and Information Technology, University of Malaya, under Postgraduate Research Grant (PG035-2016A).