Turtles are one of the ancient marine animals that live today. However, the population is threatened with extinction, so its existence needs to be protected and preserved because turtles often eat plastic waste in the ocean whose shape, texture, and color are similar to jellyfish. The technology in the computer vision area can be used to find the solution related to the case of reducing plastics and bottles trash in the ocean by implementing robotics. The region-based Convolutional Neural Network (CNN) is the latest image segmentation and has good detection accuracy based on the Faster R-CNN algorithm. In this study, the training image was built based on two different objects, namely plastic bottles and plastic bags. The target is that the two objects can be recognized even if there are other objects in the vicinity, or the image quality will be affected by the color of the seawater. The results obtained are that plastic objects and bottles can be recognized correctly in the picture. Of the five-color hues tested, the results show that the object detection process is valid on the average color hue, sepia, bandicoot, and grayscale. In contrast, the object detection process is invalid in black-and-white tones. The test results shown in the table explain that the object detection that gets the highest results is an image with normal coloring, while the lowest value is on bandicoot. The average accuracy of all types of images tested is 96.50. However, the accuracy value still needs to be improved to apply feasibility permanently to hardware such as diving robots.

1. Introduction

Turtles are one of the ancient marine animals that live to this day. However, the population is threatened with extinction, so it needs to be protected and preserved were. Of the seven sea turtles globally, six are found in Indonesia [1]. Several species of turtles have become endangered, which is why there is a need for search and rescue [2]. Efforts are needed to protect it through turtle conservation [3]; so that it is expected to prevent the extinction of turtle habitat. One of the turtle foods is jellyfish; however, turtles often eat plastic waste in the oceans because of their shape, texture, and color, similar to jellyfish, so that they will think of them as jellyfish. Over the last decade, breakthroughs in the domains of machine learning, statistics, and computer vision have piqued researchers’ interest in advanced deep learning techniques [4].

The Faster R-CNN methodology can be used to detect plastic waste objects at sea that are applied to diving tools or robots to reduce the distribution of plastic waste in the sea. In datasets with occlusion and overlap, the Faster R-CNN beats other masks in terms of computation precision and mean precision [5]. The dataset used in this study is made up of photos with.jpg and.png extensions that were gathered from Google Image and Shutterstock.com in the form of images of plastic bags and plastic bottles in the sea, with a total of 400 image datasets, 140 of which are jpg files and 260 of which are png files. Each step in this research is shown by a flowchart in Figure 1.

2. Background

2.1. Computer Vision

Computer vision is a branch of technology that identifies, tracks, and measures targets for further image processing using a camera and a computer as an image to the human eye [6]. Deep learning approaches have made significant contributions to computer vision applications such as picture classification, object detection, and image segmentation [7]. Computer vision and machine-learning algorithms have mainly been studied in a centralized setting, where all processing is done in one central location. Object detection, object classification, and extraction of useful information from photos, graphic documents, and videos are among the most recent machine-learning applications in computer vision [8].

The machine-learning paradigm for computer vision supports vector machines, neural networks, and probabilistic graphical models. Machine learning in computer vision plays an essential role in object recognition, and image classification uses a tensor-flow library that can improve accuracy when recognizing objects [9]. Figure 2 shows the object detection process in a machine learning and computer vision environment.

Based on the illustration in Figure 2. It is explained that after detecting objects in the image, the next feature will be extracted from the given image, where every single image is broken down into small pieces containing a collection of information. The extraction process is seen in Figure 3.

2.2. Region Convolutional Neural Network R-CNN

Region Convolutional Neural Network (R-CNN) is based on deep learning object detection, commonly used for object detection. R-CNN uses a selective search algorithm to propose the image, where the input image will be grouped into 2000 regions that are selected based on texture, intensity, and color. This is done to cover the weakness of CNN, which divides the image region with a large regional scale which makes the identification process slower as shown in Figure 4.

Mask R-CNN is a Region-based Convolutional Neural Network that is state-of-the-art in image segmentation and has a good Faster R-CNN method [12]. This Deep Neural Network variant detects objects in the image and generates a high-quality segmentation mask for each instance [13]. Image segmentation is becoming a significant task in computer vision and image processing with essential applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression [14]. Weaknesses in R-CNN include the relatively slow data training process because it uses 2000 proposal regions for each image. Besides that, it cannot be implemented for real-time classification because it takes about 47–50 seconds to process per image. The last one is that R-CNN can only do selective search algorithms in the introduction process and cannot use other algorithms for a selective search [11].

The R-CNN Mask is simple to implement and adds only a slight overhead to the Faster R-CNN, which runs at five frames per second. Furthermore, the R-CNN Mask is simple to apply to other tasks.

2.3. Faster Region Convolutional Neural Network (Faster R-CNN)

Faster R-CNN is a method based on deep learning object detection, which is commonly used for object detection developed from the R-CNN algorithm to cover the weaknesses that exist in R-CNN. The advantage of Faster R-CNN is that Faster R-CNN uses RPN, where RPN is a neural network that replaces the role of selective search to propose regions. The role of selective search is replaced because the process is slow in processing images, which is about 2 seconds per image [15]. RPN serves to generate several bounding boxes where each box has 2 probability scores whether there are objects at that location or not, with the RPN processing is not repeated as is done in R-CNN and makes the whole model one that can be trained by end to end. Figure 5 shows the General architecture Faster R-CNN. One disadvantage of Faster R-CNN is that, unlike RPN, all anchors in the minibatch are extracted from a single image. The Faster R-CNN algorithm is very effective for cult problems in detecting some small still has some limitations in detecting camouflaged objects. As a result, the test is performed on five types of object coloring, including Normal, Sepia, Bandicoot, Grayscale, and black-white, to determine which types of coloring do not support a good recognition process. All samples from one image may be correlated. Because the network may take a long time to reach convergence, Mask R-CNN can return a mask for each detected object [16, 17].

Faster R-CNN is easy to deploy and train due to the Faster R-CNN framework, which facilitates a variety of flexible architectural designs and consists of mask branches that add only a small computational overhead, thus enabling fast systems and short experiments [18]. We chose Faster R-CNN due to its extremely high precision, which outperforms other algorithms, and its ability to detect small objects. Because plastic is a transparent object in water, our primary goal is to detect plastic waste with the most excellent precision possible [19]. As a result, we compromise the FPS rate because we are satisfied with the faster R-CNN. Seven frames per second is a faster rate.

Table 1 describes the object detection performance, with the Faster R-CNN algorithm appearing to achieve the highest precision(mAP).

3. Literary Review

The research that forms part of this paper has been based on the work studied through several publications. Table 1. Summarizes all of the publications. (Table 2)

4. Methodology

A research approach is an action plan that provides direction for conducting research systematically and efficiently. The explanation of the method used. One disadvantage of Faster R-CNN is that for RPN, all anchors in the minibatch are extracted from one image because all samples from one image may be correlated. The network may take a long time to reach convergence; therefore, Mask R-CNN can return a mask for each detected object. A research approach is an action plan that provides direction for conducting research systematically and efficiently. The explanation of the method used. Faster R-CNN integrates candidate region extraction, deep feature extraction, classification, and bounding box regression into a deep neural network faster.

4.1. Library Tensor-Flow

Tensor Flow is an open-source machine-learning library for research and software development. Tensor Flow offers beginners and specialists APIs for desktop, mobile, web, and cloud computing-based application development. In implementing the object detection process using the Faster R-CNN, one of the backend engines, tensor-flow, is first installed. This paper focuses on the primary use of the tensor-flow library working on the backend. The Object Detection API in TensorFlow is a powerful tool that allows anyone to quickly design and deploy practical picture recognition applications [27]. Object detection entails classifying and recognizing items in a picture and localizing and attracting bounding boxes for those objects. Tensor flow is also cross-platform, which means it can run on any platform, including GPUs, CPUs, and even mobile platforms. It also has specialized hardware for tensor math, known as a tensor-processing unit (TPU) [28].

4.2. Data Item Collection

Process Dataset aims to train and test neural networks and develop algorithms in computer vision [29]. The formation of the dataset begins by placing the folder containing the images in the.zip archive, then uploading it to Google Colab. A collection of images contained in the dataset is shown in Table 3.

4.3. Label Images Annotation

Image annotation is a branch of image retrieval used to label or tag images with a set of keywords based on the content of the idea that produces labels that can be used for grouping images based on the content of the image for easy management [30]. Figure 6 shows the image annotations process.

Object detection is consolidated into instance segmentation, with the purpose of classifying and localizing each object using bounding boxes. The purpose is to assign each pixel to a certain object class [31]. The point coordinates will be stored in a JSON file for each image in the annotation process. Although sometimes a minor error occurs during the image annotation process, it will not affect the overall model evaluation [32, 33].

4.4. The Method Faster R-CNN Recognition and Results

In this research, the proposed recognition and result are shown in Figure 7.

The illustration is shown in Figure 7 and explained that the process is part of two processes, namely:

4.4.1. Training Session

The system is given input data in images from plastic bag waste and plastic bottles where the file resizing is carried out no more than 200 KB so that the image size is reduced horizontally or vertically. Through the convolutional layer, the image features will be extracted and studied, and the essential parts that can be a characteristic of an object through the feature map created by the convolutional layer, which contains information about the vector representation of the captured image. RPN (Region Proposal Network) is a module that functions through two convolution layers where one layer is responsible for detecting the location of objects and one layer functions to predict bounding boxes. The output of RPN is the proposal region of the image. ROI is the layer responsible for equalizing the size of the feature map and proposal region that has been processed by the RPN and sending feature map information and proposals to be classified at the classification layer. The function of the classification layer is to group objects that have been detected in the RPN and perform labeling and assigning a bounding box to an object. Finally, after the system runs the learned model process, a dataset containing dataset weight information.

4.4.2. Testing Session

At the initial testing stage, the system will be given input data in the form of images from plastic bags and plastic bottles, then run the load model process so that the system reloads the model stored during the training session. Frozen Graph is a process where input data received through the camera will be processed on a graph stored in a frozen model to identify and assign a bounding box based on the weights that have been stored in the model that has been trained. The output is the result of the identification carried out and bounding boxes and labels for objects that have been classified.

5. Discussion

In this study, the plastic image object detection process stages are the following: first, the object image is obtained with a self-made image acquisition device. Second, the objects will be processed, labeled, and inserted into the Faster R-CNN for training. Finally, the trained model is used to segment the picture of the training object in order to acquire an indicator of the item’s features.

5.1. Training Image Samples

Data were gathered from the Internet for this investigation, and the dataset was separated into training sets, validation sets, and test sets for the experiment. The training was carried out on plastic images and beverage bottles. Most methods for object instance segmentation require all training instances to be labeled with a segmentation mask [34]. Image training is frequently used to decide what heterogeneity should be included in a multipoint statistical reservoir model [35]. In this study, the training image is built based on two different objects, namely plastic bottles and plastic bags. The target is that both objects can be identified even though there are other objects around the object, or the image quality will be affected by the color of the seawater. The training results are shown in Figure 8.

Faster R-CNN parameters used at the training stage can be seen in Table 4.

In this study, to build a detection model using Faster R-CNN, a total of 92,998 images were used, consisting of 22,461 images of plastic bags and 69,996 images of plastic bottles. Before the training process is carried out, a text file contains the image name, bounding box size, and class (label) information. The training data is divided into training data with as many as 76,990 images and data validation as many as 15,467 images. The CNN architecture used is Resnet50, a model that has been trained using the ImageNet Dataset to produce good feature extraction. According to the default anchor Faster R-CNN, the number of anchors used is nine. Anchor is an important part that is used to determine an essential part of the image (proposal region) that will be included in the RPN. The optimizer used is a plastic bag with a learning rate of 0.00001. In addition, Stochastic Gradient Distance is used to optimize the convolution layer, RPN weight, and fully connected layer. The epoch length used is 50,000, with a total epoch of 25.

Table 5 shows the results obtained at the training stage. Based on the table, the highest accuracy value is obtained in the 25th epoch with a value of 96.66%, the error rate is getting lower, and the execution time is 7 hours 21 minutes 12 seconds.

In this research, the Faster R-CNN recognition method and the result are shown in Figure 7.

The RPN loss is the sum of the classification loss and the bounding box regression loss, where the classification loss penalizes incorrectly classified boxes using cross-entropy loss, and the regression loss penalizes incorrectly predicted regression coefficients using a function of the distance between the accurate regression coefficients and the regression coefficients predicted by the network. The neural network is trained by specifying a multitask loss function:where Ncls, Nreg, and λ balance the normalized weights of classification loss and regression loss, and I is the index of the x candidate frame in small-batch processing. .e probability is that the x candidate box is the target. If the x candidate box is a candidate target, then px = 1; otherwise, px = 0.

The classification and regression loss functions are defined as formulae in equations (2) and (3):where R is the smoothL1 function tx = { tz, ty, , th } is a vector prediction parameterized candidate frame coordinates and  = { , , , } is the coordinate vector of actual boundaries.

5.2. Testing

The Faster R-CNN approach recognizes objects using random images in the test. The results show that plastic objects and bottles in the image can be identified correctly. The results are seen in Table 6.

Based on the test results shown in Table 6, the image used in the testing process uses several types of color shades. This is done as an example of conditions in seawater, where the color of seawater can be affected by certain conditions, which can also affect the accuracy of the object detection process. The five-color hues tested show that the object detection process is valid in ordinary, sepia, bandicoot, and grayscale color tones. In contrast, the object detection process is invalid in black-and-white tones. The author assumes that the black-and-white process is the condition of the seawater at night or the seawater is polluted by waste oil, so it can be considered when applied to a diving machine or robot so that it does not work when the color of the seawater is black, or the level of clarity is very cloudy.

Various sorts of plastic bags and plastic bottles pictures will be used in the tests. The Confusion Matrix is used for testing, and the values of accuracy are used, where data was collected from 400 images of plastic bags and bottles. (Table 7).

The accuracy in the table is using the formula in (4):

The test results shown in the table explain that the object detection that gets the highest results is an image with normal coloring, while the lowest value is on bandicoot. The average accuracy of all types of images tested is 96.50.

6. Conclusion

This study concludes that turtle population extinction can be prevented by helping reduce marine pollution by plastic waste. When applied to robotic technology or diving equipment, the Faster R-CNN approach can assist in segmentation and target item detection. Object identification algorithms must be able to run in near real-time on robotic platforms in order to be beneficial for the purpose of eliminating those plastics and other waste. The work given here is an algorithm for the protection of turtle species, which can become endangered if such measures are not implemented. When the item is black and white, the Faster R-CNN approach has limitations and is therefore recommended for use in clear seawater conditions. In the future, we want to expand on this work by evaluating analogous algorithms on a dataset collected from our own observations of marine trash in real-world settings. We'd also like to consider other approaches to accomplishing this project.

Data Availability

The data are available on request.

Conflicts of Interest

The author(s) declare that there are no conflicts of interest regarding the publication of this paper.


The authors are thankful to the support by the STMIK Professional Makassar. The present research work is self-funded.