Content uploaded by Uswah Khairuddin
Author content
All content in this area was uploaded by Uswah Khairuddin on Mar 20, 2024
Content may be subject to copyright.
Signal, Image and Video Processing
https://doi.org/10.1007/s11760-023-02736-3
ORIGINAL PAPER
An automatic garbage detection using optimized YOLO model
Nur Athirah Zailan1
·Anis Salwa Mohd Khairuddin1
·Khairunnisa Hasikin1
·Mohamad Haniff Junos2
·
Uswah Khairuddin3
Received: 17 July 2023 / Revised: 4 August 2023 / Accepted: 8 August 2023
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2023
Abstract
Garbage pollution is an increasing global concern. Hence, the adoption of innovative solutions is important for controlling
garbage pollution. In order to develop an efficient cleaner robot, it is very crucial to obtain visual information of floating
garbage on the river. Deep learning has been actively applied over the past few years to tackle various problems. High-
level, semantic, and advanced features can be learnt by deep learning models based on visual information. This is extremely
important to detect and classify different types of floating garbage. This paper proposed an optimized You Only Look Once v4
Tiny model to detect floating garbage, mainly by improving the spatial pyramid pooling with average pooling, mish activation
function, concatenated densely connected neural network, and hyperparameters optimization. The proposed model shows
improved results of 74.89% mean average precision with a size of 16.4 MB, which can be concluded as the best trade-off
among other models. The proposed model has promising results in terms of model size, detection time and memory space,
which is feasible to be embedded in low-cost devices.
Keywords Computer vision ·Debris ·Deep learning ·Image processing ·Object detection
1 Introduction
Garbage pollution in river ecosystems has been a major
environmental issue across the globe for decades now. Sub-
merged debris not only can be a danger to marine life and
fishing vessels. Initiatives have been adopted to manage pol-
lution, for example, manual and machine-based cleaning,
which requires human supervision constantly. In addition,
the requirement of manual labour for cleaning waste can be
a threat to the person [1]. Hence, an autonomous cleaning
robot that can clean waste from the water contributes to a
significant impact on river pollution control. However, the
suitable design of the robot is a challenging task. The main
tasks to be performed by cleaner robots are garbage detection
BAnis Salwa Mohd Khairuddin
anissalwa@um.edu.my
1Department of Electrical Engineering, Faculty of
Engineering, Universiti Malaya, Kuala Lumpur, Malaysia
2School of Aerospace Engineering, Universiti Sains Malaysia,
Engineering Campus, 14300 Nibong Tebal, Penang, Malaysia
3School Malaysia Japan International Institute of Technology,
Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
and garbage collection. The detection task is particularly sig-
nificant since it provides precise object location information
for the cleaning robot.
Therefore, an efficient object detection method which
incorporates computer vision is highly demanded. Gener-
ally, computer vision is an extent of artificial intelligence (AI)
that lets computers and systems to interpret information from
various visual inputs. The rapid growth of machine learning
technology in machine vision applications has contributed
to deep learning methods obtaining the state-of-the-art out-
comes for the object detection system [2]. Not to mention,
the deep learning method also has the capability to auto-
matically isolate deep features from the feedback image by
self-learning. Faster R-CNN, Single Shot Detector (SSD),
and You Only Look Once (YOLO) are some examples of
object detection algorithms that could be used to serve the
purpose of obtaining visual inputs for the cleaning robots.
The YOLOv4 algorithm, which has been widely used, is
another integrated version from the features of YOLOv1,
YOLOv2, and YOLOv3. In a real complicated environment,
due to external hindrances such as obstruction and multi-
scale, there are still some deficiencies in the garbage detection
when directly using YOLOv4. Some of the concerns are the
123
Signal, Image and Video Processing
long training time, high computation cost and overfull param-
eters [3–7]. Besides, various weather conditions and lighting
would be a challenge in most works because these kinds of
datasets have not really been mentioned and focused on. The
proposed model is developed based on key objectives, which
are to improve the detection on various weather conditions,
achieving real-time prediction, and small-scale memory stor-
age. The two main improvements in the optimized model are
as follows:
•Firstly, mish activation function is fine-tuned to increase
regularization, expressivity, and gradient flow in obtaining
a more generalized model.
•Secondly, DenseNet with Spatial Pyramid Average Pool-
ing is implemented by adding more layers to concate-
nate valuable features in the same convolutional layer,
thus increasing the receptive field of the network which
improves detection accuracy.
2 Related works
The visual counting method is an effective method in man-
aging floating debris which requires labor work to count the
number of visible debris in the river. The risks include biased
judgements of observers, as well as geographical limitations.
Therefore, with the advancement of machine learning tech-
nologies, automatic riverine monitoring system can really
be implemented for the sake of a better livelihood. The
most important step to develop an efficient monitoring sys-
tem is by having a reliable visual detector to collect and
extract the debris in the river [8–11]. The mainstream object
detection algorithms are based on convolution neural net-
works (CNN), which are one-stage detection and two-stage
detection, by using different feature extraction methods.
Object detection algorithms that adopt a two-stage detection
method include R-CNN, Fast R-CNN, and Faster R-CNN,
which divide the detection task into region proposal and
classification. Meanwhile, the one-stage detection method
integrates region proposal and classification into one step,
which reduces the detection time. The mainstream methods
of one-stage detection are SSD and YOLO. SSD is more
recommended for the object detection application due to its
significant increase in accuracy and speed. On the other hand,
the idea of YOLO detector is to employ a unique neural net-
work to the entire image, where the network splits the image
into regions and concurrently predicts bounding boxes and
probabilities for each region [3–7].Theworkin[2] proposed
modified YOLOv3 model for garbage detection and achieved
mean average precision (mAP) of 91.431%. However, the
model only detects three class, which are bottle, bag and
Styrofoam. The work in [5] showed that YOLOv3 model per-
forms better compared to YOLOv3-tiny model in detecting
garbage. The works in [12–19] modified the deep learning
architecture to improve the detection accuracy for garbage
detection. However, previous works reported the shortcom-
ings of the garbage detection using computer vision, such
as long training time, high computation cost and overfull
parameters. Hence, this work aims to improve the original
network features. Besides, the implementation of embedded
device for real-life application requires the model to be small,
lightweight, and fast. YOLOv4 had shown to have a high-
precision and real-time one-stage object detection algorithm.
On the other hand, YOLOv4-tiny is basically the simpler ver-
sion of YOLOv4. YOLOv4-tiny has become very practical
in creating on mobile and embedded devices due to its faster
training time and detection speed [16,17]. Therefore, this
work focuses on optimizing the conventional YOLOv4-tiny
model in detecting the floating debris for river monitoring
system, which satisfies the requirements mentioned earlier
with accurate detection performance.
3 Proposed methodology
3.1 Dataset
This work utilizes garbage images from open access
databases [9–11]. To create an effective object detector, the
training images are augmented in terms of brightness and
positions to prevent overfitting. The scope of this project is
to focus on five common classes of debris, namely styrofoam,
plastic bag, plastic bottle, plastic container, and aluminium
can. The size of an input image is 416 ×416. The proposed
model has been trained on the dataset with 21,358 and 5,845
training and testing images, respectively.
3.2 The proposed optimized YOLO model
In this work, an optimized model based on YOLOv4-tiny
is proposed with the goals of improving overall accuracy,
detection, and model size, which is a very crucial point
to be implemented in embedded devices. There are four
key components in this model, which are the fined-tuned
Mish activation function which optimizes the usage of
Mish instead of rectified linear unit (ReLu), spatial pyra-
mid average pooling (SPAP) in the DenseNet architecture
with more concatenated layers, hyperparameters optimiza-
tion by manipulating them in several series of experiments
and customized anchor box mechanism which is generated
using K-means clustering algorithm.
123
Signal, Image and Video Processing
3.3 The fined-tuned Mish activation function
Mish activation function is an improved activation function
that is flowing and non-monotonic [3]. The expression can
be defined as:
(x)=x·tanh(ς(x)) (1)
Where, (x)=ln(1+ex)(2)
By providing the scalar input to the gate through self-
gating, Mish function has a similar property as the Swish
function, which is very useful to substitute existing activa-
tion functions, including ReLu. The implementation of the
Mish function is also straightforward in the deep learning
framework by just specifying a custom activation layer. How-
ever, for the Mish function, it is advisable to state a lower
learning rate compared to ReLu for better results. Mish acti-
vation function has a few features, such as being bounded
below, unbounded above, flowing, and non-monotonic. This
will result to an increased in expressivity and gradient flow.
Hence, this work is implementing the fine-tuned Mish acti-
vation function in its architecture (Fig. 1). The network is
modified by inserting two CBM blocks that consist of 1 ×
1 Conv-BatchNorm-Mish that initially processes the input.
The Conv-BatchNorm-ReLu is added to provide a clear and
precise transition of scalar magnitudes before performing 3
×3 convolution to enhance the feature extraction. The output
will be feedforwarded as the input for the next convolutions.
In one of the feedforward mechanisms, the output of 3 ×3
convolution will be divided into two parts to perform another
3×3 convolution before it is stacked with 1 ×1 convolu-
tion to further integrate the channel. Finally, the parts are
concatenated to obtain smoother loss functions in transition
results. This defines the good effects for generalization and
optimization of the model. Mish function is also integrated
into the DenseNet structure along with the ReLu function
in its dense layers as activation function. Both functions are
crucial in improving the cost efficiency and regularization
of the network structure due to their properties which allow
for different nonlinearities that typically works well for deci-
phering a specific function.
3.4 DenseNet with spatial pyramid average pooling
(SPAP)
Reduced gradient information is one of the concerns in
deep convolutional neural networks. This happens when fea-
ture information slowly degrades due to big information
being transferred from the input to the output layer. There-
fore, densely connected convolutional network (DenseNet) is
adopted in this work to guarantee a high and powerful gradi-
ent flow. Generally, DenseNet employs the usage of features
in order to ensure highly varied features and deeper patterns.
In this work, each layer in multiple convolution layers of a
Dense Block is called Hi.Hiconsists of batch normaliza-
tion, ReLu or Mish function, as mentioned in the previous
section, and lastly, convolution. All previous layers are taken
as output and the original as inputs by Hisuch as x0,x1,…,
and xi−1.
Hi=bix0,x1,...,xi−1(3)
where [x0,x1,…,xi−1] is defined as concatenated feature
maps in each layer [0, 1, 2, …, i−1]. On the other hand,
bi represents a function that processes information of linked
feature maps to produce nonlinear transformations. bi also
may generate ynumber of feature maps which can be referred
as follows:
yi=y0+y(4)
Feature maps are produced as outputs from preceding lay-
ers. Therefore, the growth rate of feature maps is increased
by the number of feature maps produced at each layer. The
composition of multiple Dense Blocks is done to create
a DenseNet [4–6]. Different special resolutions created at
the neck are needed for different scales of object detec-
tion. Therefore, the head probing feature maps produces a
hierarchy structure. The neck consists of feature maps that
will be added from bottom-up stream to top-down stream
to enhance the information that needs to be passed on to
the head. This addition is done with concatenation or ele-
mentwise by adding neighbouring feature maps. As a result,
spatial rich information will be obtained by the head’s input.
Furthermore, a transition block called Rn is in between layers
of Dense Blocks, which consists of pooling and convolution.
In this work, spatial pyramid average pooling layer (SPAP) is
implemented, which, as the name suggests, takes an average
pooling instead of max pooling, as shown in Fig. 2.
In this work, spatial pyramid average pooling layer (SPAP)
is implemented, which, as the name suggests, takes an aver-
age pooling instead of max pooling. In SPAP, the feature
maps from preceding layers are taken to provide multi-scale
local region feature maps of 1 ×256, 4 ×256 and 9 ×256,
which translate into an output feature vector of 6 ×1024.
The vector is expanded into 13 ×13 kernel size to be
passed to the convolution in the neck network. Images are
smoothed out without clear features by taking the average
pooling, which is useful due to the different lighting condi-
tions of the particular image datasets. This is because, the
SPAP layer takes the average values or average pixels in
passing the information instead of the brightest pixels in the
conventional SPP with max pooling. The output produced is
123
Signal, Image and Video Processing
Fig. 1 The fine-tuned module structure
Fig. 2 The spatial pyramid average pooling (SPAP)
the outcome of k function through the application of embed-
ding vectors, v that are passed to each layer. The function
can be referred to as:
ke1,e2,...,eW/S=1
W
S
v
ev(5)
where evrepresents the v-th embedding vector.
Average pooling plays an important role to convey deeper
semantic information through embedding vectors. SPAP is
used in the first and second transition layers of the DenseNet
which mainly acts to focus on overall features from the input
to be feedforwarded to the next layers. As a result, native
convolution structure could be obtained as feature maps are
clearly construed as categories confidence maps. Further-
more, spatial information is added up at this layer to prevent
overfitting due to no parameter to be optimized in SPAP,
which results to spatial translations of the input. The general
architecture of the proposed model is illustrated in Fig. 3.
4 Results and discussion
The experiments are carried out by using Windows 10 64-bit
operating system and ×64-based processor. It is equipped
with AMD Ryzen 7 3750H with Radeon Vega Mobile Gfx
2.30 GHz. It has an installed memory of 12.0 GB RAM and
NVIDIA GeForce GTX 1650 graphic card. GPU accelera-
tor used is Tesla K80, which is readily available on Google
Colab with Jupyter Notebooks compilers and Python 3 as the
scripting language. Evaluation metrics are computed for each
123
Signal, Image and Video Processing
Fig. 3 The overall architecture of the proposed model
of the object classes and the model’s performance is evalu-
ated in terms of accuracy, mean average precision (mAP),
and recall.
4.1 Experimental results
In this part, the optimized proposed model is compared with
several other models, including YOLOv3, YOLOv3-tiny,
YOLOv4 and YOLOv4-tiny. The performance of these mod-
els is evaluated based on the mean average precision, average
IoU, precision, recall, training time, model size, and compu-
tation time.
4.2 Detection performance
The models are also evaluated in terms of mean average preci-
sion (mAP) at different threshold values of 0.5, 0.75 and 0.95.
Table 1shows the proposed model outperforms lightweight
models of YOLOv3 tiny and YOLOv4 tiny. This proves the
efficiency of the proposed lightweight model in detecting
the floating debris by optimizing the usage of concatenated
layers of densely connected neural network in the backbone.
On the other hand, the proposed work shows a substantial
improvement in terms of average IoU, which is the highest
(67.67%) compared to the other models. This explains that
the customized anchor boxes are implemented successfully
to increase the overlapping area with the ground truth of the
image, which finally leads to increment in the IoU. Besides,
in terms of precision and recall, the improved YOLO model
attained great moderate values of 75% and 60%, respectively,
which proves that it is superior to those of the conventional
YOLOv4-tiny with just 73% and 58% precision and recall.
The excellent values of precision and recall contributes to the
highest F1-score by the proposed model (0.75), which is sim-
ilar to the YOLOv4 model. The model clearly has remarkable
and stabled values between the precision and recall which
are necessary to improve its overall detection performance.
Receiver operating characteristics (ROC) curve for all mod-
els are also being shown to compare their performance, as
can be seen in Fig. 4.
As mentioned in the previous section, test predictions only
have four probabilities of being True Positive (TP), False
Positive (FP), True Negative (TN) and False Negative (FN).
It can be seen that the YOLOv4 model has the best ROC
out of all models because it has the most similar shape and
curves to the perfect classifier that has a 100% true positive
rate and 0% false positive rate. In other words, the closer
the curve to the upper left corner of the graph, the better the
performance of the model in terms of ROC. Following close
123
Signal, Image and Video Processing
Table 1 Comparison of the
detection performance for
different models
Model mAP Average IoU
(%)
Precision
(%)
Recall
(%)
F1-score
0.50 0.75 0.95
YOLOv3-tiny 51.32 19.48 0.00 54.24 74 29 0.41
YOLOv4-tiny 70.14 28.97 0.00 54.68 73 58 0.70
YOLOv3 74.79 49.31 0.05 62.34 79 63 0.73
YOLOv4 81.83 56.26 0.15 64.46 81 47 0.75
The proposed
work
74.89 31.76 0.00 67.67 75 60 0.75
Fig. 4 The ROC curve comparison for all models
after YOLOv4 are the proposed model, YOLOv3, YOLOv4-
tiny, and finally, YOLOv3-tiny. YOLOv3-tiny has the curve
shape closest to the straight linear line, indicating no predic-
tive power or random guessing. One of the benefits of using
the ROC is that it helps to find the most suitable classification
threshold that matches a specific problem, in this case, for
our floating garbage classifier.
4.3 Computational performance
Based on Table 2, the computational performance of the
model is mapped out. It produces 7.247 billion FLOPS or
BFLOPs, which is 90.87% lower than the YOLOv4 model
with the highest BFLOPs. This indicates that it has a great
lightweight capability in the constraints of a real-life imple-
mentation. Compared to the conventional YOLOv4-tiny,
BLOPs are slightly increased by 6.68% for the proposed
work, which means BFLOPs are a bit enlarged due to vari-
ous number of layers in the network. Besides, the optimized
model produces a model size of 16.4 MB, which is also
the best among YOLOv4 (250 MB), YOLOv3 (238 MB),
YOLOv4-tiny (23 MB), and YOLOv3-tiny (35 MB). The
decrease of 1.4 times model size than YOLOv4-tiny proves
the effectiveness of implementing the densely connected neu-
ral network in the architecture of the model, which is caused
by the reduction of the network parameters. Besides, the
training time for the proposed model is slightly increased by
6.7% for the proposed model is slightly increased by 6.7%
than YOLOv4-tiny; however, it is not significant when com-
pared to other outcomes.
4.4 Detection on test images
In this section, the performance of the proposed optimized
model is evaluated with test images from all 5 classes. Some
challenging images that are blurry, noisy, darkened or bright-
ened can still be detected because of wide variations of
images in the datasets, as can be seen in Fig. 5.Thevari-
ation of datasets done through image augmentation ensures
that we can mimic the actual environment in the best way
possible. This proves that the detector is reliable to be used
in various weather conditions in real life such as during rainy
or sunny days.
IoU threshold values simply limit the model’s confidence
to detect the object. Hence, the lower the threshold value
being set, the more the number of objects detected, which
contributes to the improvement in the overall performance
of the model. Precision and recall are evaluated based on the
threshold values shown in the comparative graphs in Figs. 6
and 7.
Based on Fig. 5, the plastic container class outperforms
other object classes with the highest overall precision val-
ues for all different threshold values. At the threshold of
0.3, the second-best result is achieved by aluminium can,
followed by plastic bottle, plastic bag, and Styrofoam. The
plastic container has the highest true positive (TP) and the
least false positive (FP) detections. At a threshold of 0.9, the
precision for most classes drops significantly except for the
plastic container. The lowest precision with the most FP is
obtained by plastic bag class (13%) which means the model’
confidence to detect the object is high, unfortunately for the
wrong classes.
123
Signal, Image and Video Processing
Table 2 Comparison of the
computational performance of
the models
Model BFLOPs Detection time
(s)
Average
training time
(h)
Model size
(MB)
Frames per
second (FPS)
YOLOv3-tiny 5.454 40.87 4.2 35.0 66.2
YOLOv4-tiny 6.793 39.35 7.5 23.0 66.3
YOLOv3 65.333 418.21 13 238.0 33.1
YOLOv4 79.339 456.38 15.5 250.0 34.8
The proposed
work
7.247 38.15 8 16.4 66.4
Fig. 5 Example of some test
images
Fig. 6 Precision of each object class
Furthermore, in terms of recall values in Fig. 6, plastic
container also obtained the highest overall results with sig-
nificant differences compared to other classes. However, at
threshold of 0.7, aluminium class shows the highest recall
value of 66%, which is about 8% higher than plastic con-
tainer. The percentage number of false negative (FN) results
for plastic container overpowers the total number of FN in
aluminium class due to failure of the model to detect objects
when they are present.
Generally, looking at the overall performance in terms
of precision and recall results, plastic container has the best
results, followed by aluminium can, plastic bottle, styforoam,
and plastic bag class. The performance for each object class
is affected mostly by the number of datasets available, as well
as the common features in terms of the shapes and colours of
the objects. Plastic bag has the lowest detection results due to
Fig. 7 Recall of each object class
their indistinct shapes and colours, compared to other objects
with easier features and variables to learn by the model. In
short, precision, and recall, values decrease as the thresh-
old value decreases. Plastic bag has the lowest detection
results due to their indistinct shapes and colours, compared
to other objects with easier features and variables to learn
by the model. In short, precision, and recall, values decrease
as the threshold value decreases. In addition, Table 3bench-
marks the proposed detection model with previous works
on similar applications. The performance of the proposed
improved model based on YOLOv4-tiny is evaluated on 5845
test images and has produced the mAP value of 74.89% with
a smaller model size of 16.4 MB compared to YOLOv3-tiny
in [12] with 35 MB model size. Despite achieving the highest
mAP value of 91.40% and 84.58%, respectively, the work in
[2,12] only focus on detecting fewer number of test images
123
Signal, Image and Video Processing
Table 3 Benchmark of the proposed work with previous works
References Applications Data Accuracy
(%)
Sherwood
et al. [5]
YOLOv3 9 Classes 48.35
YOLOv3-tiny 39.92
Pedersen et al.
[12]
YOLOv3-tiny 5 Classes 84.58
Li et al. [2] YOLO-v3 3 Classes 91.40
Alejandro
et al. [13]
DNN Random
debris
70.00
Zhang et al.
[14]
YOLOv3 Random
debris
78.60
Faster R-CNN 81.20
Deng et al.
[15]
Mask R-CNN 22 Classes 56.70
Improved Mask
R-CNN
65.00
Ye et al. [16] YOLO with VAE 3 Classes 69.7
Wu et al. [17] GC-YOLOv5 5 Classes 99
Arulmozhi
et al. [18]
FRCNN 1 Class 80–90
Zailan et al.
[19]
Modified
YOLOv4-model
5 Classes 89
The proposed
work
Improved
YOLOv4-tiny
5 Classes 74.89
which are only 60 and 301 images. Meanwhile, the proposed
work focuses on detecting five classes of debris (styrofoam,
plastic bags, plastic bottles, plastic containers, and aluminum
cans), with more training and validation images, as well as
having up to 5845 test images. On the other hand, the work
in [5] using YOLOv3 has the biggest model size (238 MB)
with only 48.35% accuracy. Furthermore, Zhang et al. [14]
also demonstrates quite promising results using YOLOv3
(78.60%) and Faster R-CNN (81.20%). However, the mod-
els are evaluated on random datasets of floating debris with
no particular class detection. The work in [15] using Mask R-
CNN (56.70%) and improved Mask R-CNN (65%) also could
not beat the proposed model in terms of detection accuracy.
On the other hand, the work in [17] reported accuracy of 99%
by using limited private image database with 642 images for
training and 40 images for validation. The work in [18]pro-
posed plastic detection system using FRCNN method with
accuracy range between 80 to 90%. Meanwhile, the work
in [19] achieved accuracy of 89% for detecting 5 classes of
garbage by improving YOLOv4 model. The work in [19]
applied 9554 training images and 2481 test images which is
considered limited and less robust compared this work that
applied 21,358 training images and 5845 test images. The
framework in [19] focuses on improving the conventional
YOLOv4 model which include modification of CSPDark-
Net53 into the backbone to overcome limitations due to
training time, and improved PANet in the Neck module to aid
the feature extraction process. In contrast, this work focuses
on improving the lightweight version, YOLOV4-tiny model
to support application in low-cost embedded devices. Hence,
it can be concluded that the proposed detection model is
considered feasible due to its ability to detect more types of
debris accurately with the smallest model size compared to
previous works. It offers great trade-offs among other mod-
els in terms of accuracy and size, which is a huge advantage
when it comes to real-life applications on low-cost embedded
devices.
5 Conclusion
In conclusion, an optimized model for garbage detection has
been proposed based on a modified YOLOv4-tiny model. It
achieves a mean average precision of 74.89% and 16.4 MB
model size. The expected outcome from the proposed model
includes detecting images under several conditions such as
blurry, noisy, dark, and bright images as well as objects
from different perspectives or angles. In other words, the
proposed model is feasible under different environment con-
ditions. The proposed different environment conditions. The
proposed model consists of three stages which are backbone
feature extraction network, neck network, and object model
consists of three stages which are backbone feature extrac-
tion network, neck network, and object detection stage. As
presented in Table 3, the proposed model shows better per-
formance compared to other state-of-the-art models. This is
achieved by increasing the number of concatenated layers of
the convolutional neural network using DenseNet for better
feature extractions and customized anchor box mechanism
which is generated using K-means clustering algorithm to
better suit this work’s dataset. Furthermore, the proposed
model also adopts the Mish activation function and opti-
mized hyperparameters, which prove to create a good balance
between the overall accuracy and the model size of the object
detection system for real-time detection.
Acknowledgements The research funding is provided by Universiti
Malaya with project number IMG001-2022.
Author contributions NAZ, MHJ and ASMK performed analysis,
investigation, validation, and draft manuscript. KH and UK prepared
conceptualization, methodology and figures. All authors reviewed the
manuscript.
Data availability The dataset analyzed in this study is available upon
reasonable request.
Declarations
Conflict of interest The authors declare that they have no known com-
peting financial interests or personal relationships that could have
appeared to influence the work reported in this paper. All the authors
listed have approved the manuscript that is enclosed.
123
Signal, Image and Video Processing
Ethical approval Ethical and informed consent for data used. No ethical
data in this paper.
References
1. Chen, Y.C.: Effects of urbanization on municipal solid waste com-
position. Waste Manag. 79, 823–836 (2018). https://doi.org/10.
1016/j.wasman.2018.04.017
2. Li, X., Tian, M., Kong, S., Wu, L., Yu, J.: A modified YOLOv3
detection method for vision-based water surface garbage cap-
ture robot. Int. J. Adv. Rob. Syst. (2020). https://doi.org/10.1109/
ICCEA50009.2020.00176
3. Junos, M., Mohd Khairuddin, A., Thannirmalai, S., Dahari, M.:
Automatic detection of oil palm fruits from UAV images using an
improved YOLO model. Vis. Comput. (2021). https://doi.org/10.
1007/s00371-021-02116-3
4. Junos, M., Mohd Khairuddin, A., Dahari, M.: Automated object
detection on aerial images for limited capacity embedded device
using a lightweight CNN model. Alex. Eng. J. (2022). https://doi.
org/10.1016/j.aej.2021.11.027
5. Sherwood, L., Tian, M., Kong, S., Wu, L., Yu, J.: Applying object
detection to monitoring marine debris. In: Tropical Conservation
Biology and Environmental Science TCBES Theses, vol 14, No. 8
(2020). http://hdl.handle.net/10790/5298
6. Junos, M.H., Mohd Khairuddin, A.S., Thannirmalai, S., Dahari,
M.: An optimized YOLO-based object detection model for crop
harvesting system. IET Image Process. 15(9), 2112–2125 (2021).
https://doi.org/10.1049/ipr2.12181
7. Momin, M.A., Junos, M.H., Mohd Khairuddin, A.S., et al.:
Lightweight CNN model: automated vehicle detection in aerial
images. SIViP 17, 1209–1217 (2022). https://doi.org/10.1007/
s11760-022-02328-7
8. Kaggle: Datasets. https://www.kaggle.com/datasets. Accessed 5
Feb 2021
9. OR&R’s Marine Debris Program: Marine Debris Monitor-
ing and Assessment Project. https://marinedebris.noaa.gov/
research/marine-debrismonitoring-and-assessment-project
(2020). Accessed 12 Sept 2020
10. Litwinow, N.: Contaminants in water in the marine environ-
ment. Kaggle. https://doi.org/10.34740/KAGGLE/DS/2088659.
Accessed 21 Feb 2022
11. Panwar, H.: Aquatrash. Kaggle. https://doi.org/10.34740/
KAGGLE/DSV/4237900. Accessed 15 Mar 2022
12. Pedersen, M., Haurum, J.B., Moeslund, T.: Detection of marine
animals in a new underwater dataset with varying visibility.
In: Environmental Science, Computer Science, CVPR Work-
shops. https://openaccess.thecvf.com/content_CVPRW_2019/
papers/AAMVEM/Pedersen_Detection_of_Marine_Animals_
in_a_New_Underwater_Dataset_with_CVPRW_2019_paper.pdf
(2019)
13. Alejandro, M., Toro, V.: Deep neural networks for marine debris
detection in sonar images. Dissertation submitted to Heriot-Watt
University, Edinburgh. arXiv:1905.0524 (2019)
14. Zhang, L., Zhang, Y., Zhang, Z., Shen, J., Wang, H.: Real-time
water surface object detection based on improved faster R-CNN.
Sensors (2019). https://doi.org/10.3390/s19163523
15. Deng, H., Ergu, D., Liu, F., Ma, B., Chai, Y.: An embeddable algo-
rithm for automatic garbage detection based on complex marine
environment. Sensors (2021). https://doi.org/10.3390/s21196391
16. Ye, A., Pang, B., Jin, Y., Cui, J.: A YOLO-based neural network
with VAE for intelligent garbage detection and classification. In:
Proceedings of the 2020 3rd International Conference on Algo-
rithms, Computing and Artificial Intelligence, pp. 1–7 (2020)
17. Wu, Z., Zhang, D., Shao, Y., Zhang, X., Zhang, X., Feng, Y.,
Cui, P.: Using YOLOv5 for garbage classification. In: 2021 4th
International Conference on Pattern Recognition and Artificial
Intelligence (PRAI), pp. 35–38. IEEE (2021).
18. Arulmozhi, M., Iyer, N.G., Jeny Sophia, S., Sivakumar, P., Amutha,
C., Sivamani, D.: Comparison of YOLO and Faster R-CNN on
Garbage Detection. In: Optimization Techniques in Engineering:
Advances and Applications, pp. 37–49 (2023).
19. Zailan, N.A., Azizan, M.M., Hasikin, K., Mohd Khairuddin, A.S.,
Khairuddin, U.: An automated solid waste detection using the opti-
mized YOLO model for riverine management. Front. Public Health
10, 907280 (2022). https://doi.org/10.3389/fpubh.2022.907280
20. Cchangcs: Garbage classification. Kaggle. https://doi.org/10.
34740/KAGGLE/DS/81794 (2018). Accessed 14 Mar 2022
Publisher’s Note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds
exclusive rights to this article under a publishing agreement with the
author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such
publishing agreement and applicable law.
123
A preview of this full-text is provided by Springer Nature.
Content available from Signal Image and Video Processing
This content is subject to copyright. Terms and conditions apply.