ArticlePDF Available

An automatic garbage detection using optimized YOLO model

Authors:

Abstract and Figures

Garbage pollution is an increasing global concern. Hence, the adoption of innovative solutions is important for controlling garbage pollution. In order to develop an efficient cleaner robot, it is very crucial to obtain visual information of floating garbage on the river. Deep learning has been actively applied over the past few years to tackle various problems. High-level, semantic, and advanced features can be learnt by deep learning models based on visual information. This is extremely important to detect and classify different types of floating garbage. This paper proposed an optimized You Only Look Once v4 Tiny model to detect floating garbage, mainly by improving the spatial pyramid pooling with average pooling, mish activation function, concatenated densely connected neural network, and hyperparameters optimization. The proposed model shows improved results of 74.89% mean average precision with a size of 16.4 MB, which can be concluded as the best trade-off among other models. The proposed model has promising results in terms of model size, detection time and memory space, which is feasible to be embedded in low-cost devices.
Content may be subject to copyright.
Signal, Image and Video Processing
https://doi.org/10.1007/s11760-023-02736-3
ORIGINAL PAPER
An automatic garbage detection using optimized YOLO model
Nur Athirah Zailan1
·Anis Salwa Mohd Khairuddin1
·Khairunnisa Hasikin1
·Mohamad Haniff Junos2
·
Uswah Khairuddin3
Received: 17 July 2023 / Revised: 4 August 2023 / Accepted: 8 August 2023
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2023
Abstract
Garbage pollution is an increasing global concern. Hence, the adoption of innovative solutions is important for controlling
garbage pollution. In order to develop an efficient cleaner robot, it is very crucial to obtain visual information of floating
garbage on the river. Deep learning has been actively applied over the past few years to tackle various problems. High-
level, semantic, and advanced features can be learnt by deep learning models based on visual information. This is extremely
important to detect and classify different types of floating garbage. This paper proposed an optimized You Only Look Once v4
Tiny model to detect floating garbage, mainly by improving the spatial pyramid pooling with average pooling, mish activation
function, concatenated densely connected neural network, and hyperparameters optimization. The proposed model shows
improved results of 74.89% mean average precision with a size of 16.4 MB, which can be concluded as the best trade-off
among other models. The proposed model has promising results in terms of model size, detection time and memory space,
which is feasible to be embedded in low-cost devices.
Keywords Computer vision ·Debris ·Deep learning ·Image processing ·Object detection
1 Introduction
Garbage pollution in river ecosystems has been a major
environmental issue across the globe for decades now. Sub-
merged debris not only can be a danger to marine life and
fishing vessels. Initiatives have been adopted to manage pol-
lution, for example, manual and machine-based cleaning,
which requires human supervision constantly. In addition,
the requirement of manual labour for cleaning waste can be
a threat to the person [1]. Hence, an autonomous cleaning
robot that can clean waste from the water contributes to a
significant impact on river pollution control. However, the
suitable design of the robot is a challenging task. The main
tasks to be performed by cleaner robots are garbage detection
BAnis Salwa Mohd Khairuddin
anissalwa@um.edu.my
1Department of Electrical Engineering, Faculty of
Engineering, Universiti Malaya, Kuala Lumpur, Malaysia
2School of Aerospace Engineering, Universiti Sains Malaysia,
Engineering Campus, 14300 Nibong Tebal, Penang, Malaysia
3School Malaysia Japan International Institute of Technology,
Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
and garbage collection. The detection task is particularly sig-
nificant since it provides precise object location information
for the cleaning robot.
Therefore, an efficient object detection method which
incorporates computer vision is highly demanded. Gener-
ally, computer vision is an extent of artificial intelligence (AI)
that lets computers and systems to interpret information from
various visual inputs. The rapid growth of machine learning
technology in machine vision applications has contributed
to deep learning methods obtaining the state-of-the-art out-
comes for the object detection system [2]. Not to mention,
the deep learning method also has the capability to auto-
matically isolate deep features from the feedback image by
self-learning. Faster R-CNN, Single Shot Detector (SSD),
and You Only Look Once (YOLO) are some examples of
object detection algorithms that could be used to serve the
purpose of obtaining visual inputs for the cleaning robots.
The YOLOv4 algorithm, which has been widely used, is
another integrated version from the features of YOLOv1,
YOLOv2, and YOLOv3. In a real complicated environment,
due to external hindrances such as obstruction and multi-
scale, there are still some deficiencies in the garbage detection
when directly using YOLOv4. Some of the concerns are the
123
Signal, Image and Video Processing
long training time, high computation cost and overfull param-
eters [37]. Besides, various weather conditions and lighting
would be a challenge in most works because these kinds of
datasets have not really been mentioned and focused on. The
proposed model is developed based on key objectives, which
are to improve the detection on various weather conditions,
achieving real-time prediction, and small-scale memory stor-
age. The two main improvements in the optimized model are
as follows:
Firstly, mish activation function is fine-tuned to increase
regularization, expressivity, and gradient flow in obtaining
a more generalized model.
Secondly, DenseNet with Spatial Pyramid Average Pool-
ing is implemented by adding more layers to concate-
nate valuable features in the same convolutional layer,
thus increasing the receptive field of the network which
improves detection accuracy.
2 Related works
The visual counting method is an effective method in man-
aging floating debris which requires labor work to count the
number of visible debris in the river. The risks include biased
judgements of observers, as well as geographical limitations.
Therefore, with the advancement of machine learning tech-
nologies, automatic riverine monitoring system can really
be implemented for the sake of a better livelihood. The
most important step to develop an efficient monitoring sys-
tem is by having a reliable visual detector to collect and
extract the debris in the river [811]. The mainstream object
detection algorithms are based on convolution neural net-
works (CNN), which are one-stage detection and two-stage
detection, by using different feature extraction methods.
Object detection algorithms that adopt a two-stage detection
method include R-CNN, Fast R-CNN, and Faster R-CNN,
which divide the detection task into region proposal and
classification. Meanwhile, the one-stage detection method
integrates region proposal and classification into one step,
which reduces the detection time. The mainstream methods
of one-stage detection are SSD and YOLO. SSD is more
recommended for the object detection application due to its
significant increase in accuracy and speed. On the other hand,
the idea of YOLO detector is to employ a unique neural net-
work to the entire image, where the network splits the image
into regions and concurrently predicts bounding boxes and
probabilities for each region [37].Theworkin[2] proposed
modified YOLOv3 model for garbage detection and achieved
mean average precision (mAP) of 91.431%. However, the
model only detects three class, which are bottle, bag and
Styrofoam. The work in [5] showed that YOLOv3 model per-
forms better compared to YOLOv3-tiny model in detecting
garbage. The works in [1219] modified the deep learning
architecture to improve the detection accuracy for garbage
detection. However, previous works reported the shortcom-
ings of the garbage detection using computer vision, such
as long training time, high computation cost and overfull
parameters. Hence, this work aims to improve the original
network features. Besides, the implementation of embedded
device for real-life application requires the model to be small,
lightweight, and fast. YOLOv4 had shown to have a high-
precision and real-time one-stage object detection algorithm.
On the other hand, YOLOv4-tiny is basically the simpler ver-
sion of YOLOv4. YOLOv4-tiny has become very practical
in creating on mobile and embedded devices due to its faster
training time and detection speed [16,17]. Therefore, this
work focuses on optimizing the conventional YOLOv4-tiny
model in detecting the floating debris for river monitoring
system, which satisfies the requirements mentioned earlier
with accurate detection performance.
3 Proposed methodology
3.1 Dataset
This work utilizes garbage images from open access
databases [911]. To create an effective object detector, the
training images are augmented in terms of brightness and
positions to prevent overfitting. The scope of this project is
to focus on five common classes of debris, namely styrofoam,
plastic bag, plastic bottle, plastic container, and aluminium
can. The size of an input image is 416 ×416. The proposed
model has been trained on the dataset with 21,358 and 5,845
training and testing images, respectively.
3.2 The proposed optimized YOLO model
In this work, an optimized model based on YOLOv4-tiny
is proposed with the goals of improving overall accuracy,
detection, and model size, which is a very crucial point
to be implemented in embedded devices. There are four
key components in this model, which are the fined-tuned
Mish activation function which optimizes the usage of
Mish instead of rectified linear unit (ReLu), spatial pyra-
mid average pooling (SPAP) in the DenseNet architecture
with more concatenated layers, hyperparameters optimiza-
tion by manipulating them in several series of experiments
and customized anchor box mechanism which is generated
using K-means clustering algorithm.
123
Signal, Image and Video Processing
3.3 The fined-tuned Mish activation function
Mish activation function is an improved activation function
that is flowing and non-monotonic [3]. The expression can
be defined as:
(x)=x·tanh(ς(x)) (1)
Where, (x)=ln(1+ex)(2)
By providing the scalar input to the gate through self-
gating, Mish function has a similar property as the Swish
function, which is very useful to substitute existing activa-
tion functions, including ReLu. The implementation of the
Mish function is also straightforward in the deep learning
framework by just specifying a custom activation layer. How-
ever, for the Mish function, it is advisable to state a lower
learning rate compared to ReLu for better results. Mish acti-
vation function has a few features, such as being bounded
below, unbounded above, flowing, and non-monotonic. This
will result to an increased in expressivity and gradient flow.
Hence, this work is implementing the fine-tuned Mish acti-
vation function in its architecture (Fig. 1). The network is
modified by inserting two CBM blocks that consist of 1 ×
1 Conv-BatchNorm-Mish that initially processes the input.
The Conv-BatchNorm-ReLu is added to provide a clear and
precise transition of scalar magnitudes before performing 3
×3 convolution to enhance the feature extraction. The output
will be feedforwarded as the input for the next convolutions.
In one of the feedforward mechanisms, the output of 3 ×3
convolution will be divided into two parts to perform another
3×3 convolution before it is stacked with 1 ×1 convolu-
tion to further integrate the channel. Finally, the parts are
concatenated to obtain smoother loss functions in transition
results. This defines the good effects for generalization and
optimization of the model. Mish function is also integrated
into the DenseNet structure along with the ReLu function
in its dense layers as activation function. Both functions are
crucial in improving the cost efficiency and regularization
of the network structure due to their properties which allow
for different nonlinearities that typically works well for deci-
phering a specific function.
3.4 DenseNet with spatial pyramid average pooling
(SPAP)
Reduced gradient information is one of the concerns in
deep convolutional neural networks. This happens when fea-
ture information slowly degrades due to big information
being transferred from the input to the output layer. There-
fore, densely connected convolutional network (DenseNet) is
adopted in this work to guarantee a high and powerful gradi-
ent flow. Generally, DenseNet employs the usage of features
in order to ensure highly varied features and deeper patterns.
In this work, each layer in multiple convolution layers of a
Dense Block is called Hi.Hiconsists of batch normaliza-
tion, ReLu or Mish function, as mentioned in the previous
section, and lastly, convolution. All previous layers are taken
as output and the original as inputs by Hisuch as x0,x1,…,
and xi1.
Hi=bix0,x1,...,xi1(3)
where [x0,x1,…,xi1] is defined as concatenated feature
maps in each layer [0, 1, 2, …, i1]. On the other hand,
bi represents a function that processes information of linked
feature maps to produce nonlinear transformations. bi also
may generate ynumber of feature maps which can be referred
as follows:
yi=y0+y(4)
Feature maps are produced as outputs from preceding lay-
ers. Therefore, the growth rate of feature maps is increased
by the number of feature maps produced at each layer. The
composition of multiple Dense Blocks is done to create
a DenseNet [46]. Different special resolutions created at
the neck are needed for different scales of object detec-
tion. Therefore, the head probing feature maps produces a
hierarchy structure. The neck consists of feature maps that
will be added from bottom-up stream to top-down stream
to enhance the information that needs to be passed on to
the head. This addition is done with concatenation or ele-
mentwise by adding neighbouring feature maps. As a result,
spatial rich information will be obtained by the head’s input.
Furthermore, a transition block called Rn is in between layers
of Dense Blocks, which consists of pooling and convolution.
In this work, spatial pyramid average pooling layer (SPAP) is
implemented, which, as the name suggests, takes an average
pooling instead of max pooling, as shown in Fig. 2.
In this work, spatial pyramid average pooling layer (SPAP)
is implemented, which, as the name suggests, takes an aver-
age pooling instead of max pooling. In SPAP, the feature
maps from preceding layers are taken to provide multi-scale
local region feature maps of 1 ×256, 4 ×256 and 9 ×256,
which translate into an output feature vector of 6 ×1024.
The vector is expanded into 13 ×13 kernel size to be
passed to the convolution in the neck network. Images are
smoothed out without clear features by taking the average
pooling, which is useful due to the different lighting condi-
tions of the particular image datasets. This is because, the
SPAP layer takes the average values or average pixels in
passing the information instead of the brightest pixels in the
conventional SPP with max pooling. The output produced is
123
Signal, Image and Video Processing
Fig. 1 The fine-tuned module structure
Fig. 2 The spatial pyramid average pooling (SPAP)
the outcome of k function through the application of embed-
ding vectors, v that are passed to each layer. The function
can be referred to as:
ke1,e2,...,eW/S=1
W
S
v
ev(5)
where evrepresents the v-th embedding vector.
Average pooling plays an important role to convey deeper
semantic information through embedding vectors. SPAP is
used in the first and second transition layers of the DenseNet
which mainly acts to focus on overall features from the input
to be feedforwarded to the next layers. As a result, native
convolution structure could be obtained as feature maps are
clearly construed as categories confidence maps. Further-
more, spatial information is added up at this layer to prevent
overfitting due to no parameter to be optimized in SPAP,
which results to spatial translations of the input. The general
architecture of the proposed model is illustrated in Fig. 3.
4 Results and discussion
The experiments are carried out by using Windows 10 64-bit
operating system and ×64-based processor. It is equipped
with AMD Ryzen 7 3750H with Radeon Vega Mobile Gfx
2.30 GHz. It has an installed memory of 12.0 GB RAM and
NVIDIA GeForce GTX 1650 graphic card. GPU accelera-
tor used is Tesla K80, which is readily available on Google
Colab with Jupyter Notebooks compilers and Python 3 as the
scripting language. Evaluation metrics are computed for each
123
Signal, Image and Video Processing
Fig. 3 The overall architecture of the proposed model
of the object classes and the model’s performance is evalu-
ated in terms of accuracy, mean average precision (mAP),
and recall.
4.1 Experimental results
In this part, the optimized proposed model is compared with
several other models, including YOLOv3, YOLOv3-tiny,
YOLOv4 and YOLOv4-tiny. The performance of these mod-
els is evaluated based on the mean average precision, average
IoU, precision, recall, training time, model size, and compu-
tation time.
4.2 Detection performance
The models are also evaluated in terms of mean average preci-
sion (mAP) at different threshold values of 0.5, 0.75 and 0.95.
Table 1shows the proposed model outperforms lightweight
models of YOLOv3 tiny and YOLOv4 tiny. This proves the
efficiency of the proposed lightweight model in detecting
the floating debris by optimizing the usage of concatenated
layers of densely connected neural network in the backbone.
On the other hand, the proposed work shows a substantial
improvement in terms of average IoU, which is the highest
(67.67%) compared to the other models. This explains that
the customized anchor boxes are implemented successfully
to increase the overlapping area with the ground truth of the
image, which finally leads to increment in the IoU. Besides,
in terms of precision and recall, the improved YOLO model
attained great moderate values of 75% and 60%, respectively,
which proves that it is superior to those of the conventional
YOLOv4-tiny with just 73% and 58% precision and recall.
The excellent values of precision and recall contributes to the
highest F1-score by the proposed model (0.75), which is sim-
ilar to the YOLOv4 model. The model clearly has remarkable
and stabled values between the precision and recall which
are necessary to improve its overall detection performance.
Receiver operating characteristics (ROC) curve for all mod-
els are also being shown to compare their performance, as
can be seen in Fig. 4.
As mentioned in the previous section, test predictions only
have four probabilities of being True Positive (TP), False
Positive (FP), True Negative (TN) and False Negative (FN).
It can be seen that the YOLOv4 model has the best ROC
out of all models because it has the most similar shape and
curves to the perfect classifier that has a 100% true positive
rate and 0% false positive rate. In other words, the closer
the curve to the upper left corner of the graph, the better the
performance of the model in terms of ROC. Following close
123
Signal, Image and Video Processing
Table 1 Comparison of the
detection performance for
different models
Model mAP Average IoU
(%)
Precision
(%)
Recall
(%)
F1-score
0.50 0.75 0.95
YOLOv3-tiny 51.32 19.48 0.00 54.24 74 29 0.41
YOLOv4-tiny 70.14 28.97 0.00 54.68 73 58 0.70
YOLOv3 74.79 49.31 0.05 62.34 79 63 0.73
YOLOv4 81.83 56.26 0.15 64.46 81 47 0.75
The proposed
work
74.89 31.76 0.00 67.67 75 60 0.75
Fig. 4 The ROC curve comparison for all models
after YOLOv4 are the proposed model, YOLOv3, YOLOv4-
tiny, and finally, YOLOv3-tiny. YOLOv3-tiny has the curve
shape closest to the straight linear line, indicating no predic-
tive power or random guessing. One of the benefits of using
the ROC is that it helps to find the most suitable classification
threshold that matches a specific problem, in this case, for
our floating garbage classifier.
4.3 Computational performance
Based on Table 2, the computational performance of the
model is mapped out. It produces 7.247 billion FLOPS or
BFLOPs, which is 90.87% lower than the YOLOv4 model
with the highest BFLOPs. This indicates that it has a great
lightweight capability in the constraints of a real-life imple-
mentation. Compared to the conventional YOLOv4-tiny,
BLOPs are slightly increased by 6.68% for the proposed
work, which means BFLOPs are a bit enlarged due to vari-
ous number of layers in the network. Besides, the optimized
model produces a model size of 16.4 MB, which is also
the best among YOLOv4 (250 MB), YOLOv3 (238 MB),
YOLOv4-tiny (23 MB), and YOLOv3-tiny (35 MB). The
decrease of 1.4 times model size than YOLOv4-tiny proves
the effectiveness of implementing the densely connected neu-
ral network in the architecture of the model, which is caused
by the reduction of the network parameters. Besides, the
training time for the proposed model is slightly increased by
6.7% for the proposed model is slightly increased by 6.7%
than YOLOv4-tiny; however, it is not significant when com-
pared to other outcomes.
4.4 Detection on test images
In this section, the performance of the proposed optimized
model is evaluated with test images from all 5 classes. Some
challenging images that are blurry, noisy, darkened or bright-
ened can still be detected because of wide variations of
images in the datasets, as can be seen in Fig. 5.Thevari-
ation of datasets done through image augmentation ensures
that we can mimic the actual environment in the best way
possible. This proves that the detector is reliable to be used
in various weather conditions in real life such as during rainy
or sunny days.
IoU threshold values simply limit the model’s confidence
to detect the object. Hence, the lower the threshold value
being set, the more the number of objects detected, which
contributes to the improvement in the overall performance
of the model. Precision and recall are evaluated based on the
threshold values shown in the comparative graphs in Figs. 6
and 7.
Based on Fig. 5, the plastic container class outperforms
other object classes with the highest overall precision val-
ues for all different threshold values. At the threshold of
0.3, the second-best result is achieved by aluminium can,
followed by plastic bottle, plastic bag, and Styrofoam. The
plastic container has the highest true positive (TP) and the
least false positive (FP) detections. At a threshold of 0.9, the
precision for most classes drops significantly except for the
plastic container. The lowest precision with the most FP is
obtained by plastic bag class (13%) which means the model’
confidence to detect the object is high, unfortunately for the
wrong classes.
123
Signal, Image and Video Processing
Table 2 Comparison of the
computational performance of
the models
Model BFLOPs Detection time
(s)
Average
training time
(h)
Model size
(MB)
Frames per
second (FPS)
YOLOv3-tiny 5.454 40.87 4.2 35.0 66.2
YOLOv4-tiny 6.793 39.35 7.5 23.0 66.3
YOLOv3 65.333 418.21 13 238.0 33.1
YOLOv4 79.339 456.38 15.5 250.0 34.8
The proposed
work
7.247 38.15 8 16.4 66.4
Fig. 5 Example of some test
images
Fig. 6 Precision of each object class
Furthermore, in terms of recall values in Fig. 6, plastic
container also obtained the highest overall results with sig-
nificant differences compared to other classes. However, at
threshold of 0.7, aluminium class shows the highest recall
value of 66%, which is about 8% higher than plastic con-
tainer. The percentage number of false negative (FN) results
for plastic container overpowers the total number of FN in
aluminium class due to failure of the model to detect objects
when they are present.
Generally, looking at the overall performance in terms
of precision and recall results, plastic container has the best
results, followed by aluminium can, plastic bottle, styforoam,
and plastic bag class. The performance for each object class
is affected mostly by the number of datasets available, as well
as the common features in terms of the shapes and colours of
the objects. Plastic bag has the lowest detection results due to
Fig. 7 Recall of each object class
their indistinct shapes and colours, compared to other objects
with easier features and variables to learn by the model. In
short, precision, and recall, values decrease as the thresh-
old value decreases. Plastic bag has the lowest detection
results due to their indistinct shapes and colours, compared
to other objects with easier features and variables to learn
by the model. In short, precision, and recall, values decrease
as the threshold value decreases. In addition, Table 3bench-
marks the proposed detection model with previous works
on similar applications. The performance of the proposed
improved model based on YOLOv4-tiny is evaluated on 5845
test images and has produced the mAP value of 74.89% with
a smaller model size of 16.4 MB compared to YOLOv3-tiny
in [12] with 35 MB model size. Despite achieving the highest
mAP value of 91.40% and 84.58%, respectively, the work in
[2,12] only focus on detecting fewer number of test images
123
Signal, Image and Video Processing
Table 3 Benchmark of the proposed work with previous works
References Applications Data Accuracy
(%)
Sherwood
et al. [5]
YOLOv3 9 Classes 48.35
YOLOv3-tiny 39.92
Pedersen et al.
[12]
YOLOv3-tiny 5 Classes 84.58
Li et al. [2] YOLO-v3 3 Classes 91.40
Alejandro
et al. [13]
DNN Random
debris
70.00
Zhang et al.
[14]
YOLOv3 Random
debris
78.60
Faster R-CNN 81.20
Deng et al.
[15]
Mask R-CNN 22 Classes 56.70
Improved Mask
R-CNN
65.00
Ye et al. [16] YOLO with VAE 3 Classes 69.7
Wu et al. [17] GC-YOLOv5 5 Classes 99
Arulmozhi
et al. [18]
FRCNN 1 Class 80–90
Zailan et al.
[19]
Modified
YOLOv4-model
5 Classes 89
The proposed
work
Improved
YOLOv4-tiny
5 Classes 74.89
which are only 60 and 301 images. Meanwhile, the proposed
work focuses on detecting five classes of debris (styrofoam,
plastic bags, plastic bottles, plastic containers, and aluminum
cans), with more training and validation images, as well as
having up to 5845 test images. On the other hand, the work
in [5] using YOLOv3 has the biggest model size (238 MB)
with only 48.35% accuracy. Furthermore, Zhang et al. [14]
also demonstrates quite promising results using YOLOv3
(78.60%) and Faster R-CNN (81.20%). However, the mod-
els are evaluated on random datasets of floating debris with
no particular class detection. The work in [15] using Mask R-
CNN (56.70%) and improved Mask R-CNN (65%) also could
not beat the proposed model in terms of detection accuracy.
On the other hand, the work in [17] reported accuracy of 99%
by using limited private image database with 642 images for
training and 40 images for validation. The work in [18]pro-
posed plastic detection system using FRCNN method with
accuracy range between 80 to 90%. Meanwhile, the work
in [19] achieved accuracy of 89% for detecting 5 classes of
garbage by improving YOLOv4 model. The work in [19]
applied 9554 training images and 2481 test images which is
considered limited and less robust compared this work that
applied 21,358 training images and 5845 test images. The
framework in [19] focuses on improving the conventional
YOLOv4 model which include modification of CSPDark-
Net53 into the backbone to overcome limitations due to
training time, and improved PANet in the Neck module to aid
the feature extraction process. In contrast, this work focuses
on improving the lightweight version, YOLOV4-tiny model
to support application in low-cost embedded devices. Hence,
it can be concluded that the proposed detection model is
considered feasible due to its ability to detect more types of
debris accurately with the smallest model size compared to
previous works. It offers great trade-offs among other mod-
els in terms of accuracy and size, which is a huge advantage
when it comes to real-life applications on low-cost embedded
devices.
5 Conclusion
In conclusion, an optimized model for garbage detection has
been proposed based on a modified YOLOv4-tiny model. It
achieves a mean average precision of 74.89% and 16.4 MB
model size. The expected outcome from the proposed model
includes detecting images under several conditions such as
blurry, noisy, dark, and bright images as well as objects
from different perspectives or angles. In other words, the
proposed model is feasible under different environment con-
ditions. The proposed different environment conditions. The
proposed model consists of three stages which are backbone
feature extraction network, neck network, and object model
consists of three stages which are backbone feature extrac-
tion network, neck network, and object detection stage. As
presented in Table 3, the proposed model shows better per-
formance compared to other state-of-the-art models. This is
achieved by increasing the number of concatenated layers of
the convolutional neural network using DenseNet for better
feature extractions and customized anchor box mechanism
which is generated using K-means clustering algorithm to
better suit this work’s dataset. Furthermore, the proposed
model also adopts the Mish activation function and opti-
mized hyperparameters, which prove to create a good balance
between the overall accuracy and the model size of the object
detection system for real-time detection.
Acknowledgements The research funding is provided by Universiti
Malaya with project number IMG001-2022.
Author contributions NAZ, MHJ and ASMK performed analysis,
investigation, validation, and draft manuscript. KH and UK prepared
conceptualization, methodology and figures. All authors reviewed the
manuscript.
Data availability The dataset analyzed in this study is available upon
reasonable request.
Declarations
Conflict of interest The authors declare that they have no known com-
peting financial interests or personal relationships that could have
appeared to influence the work reported in this paper. All the authors
listed have approved the manuscript that is enclosed.
123
Signal, Image and Video Processing
Ethical approval Ethical and informed consent for data used. No ethical
data in this paper.
References
1. Chen, Y.C.: Effects of urbanization on municipal solid waste com-
position. Waste Manag. 79, 823–836 (2018). https://doi.org/10.
1016/j.wasman.2018.04.017
2. Li, X., Tian, M., Kong, S., Wu, L., Yu, J.: A modified YOLOv3
detection method for vision-based water surface garbage cap-
ture robot. Int. J. Adv. Rob. Syst. (2020). https://doi.org/10.1109/
ICCEA50009.2020.00176
3. Junos, M., Mohd Khairuddin, A., Thannirmalai, S., Dahari, M.:
Automatic detection of oil palm fruits from UAV images using an
improved YOLO model. Vis. Comput. (2021). https://doi.org/10.
1007/s00371-021-02116-3
4. Junos, M., Mohd Khairuddin, A., Dahari, M.: Automated object
detection on aerial images for limited capacity embedded device
using a lightweight CNN model. Alex. Eng. J. (2022). https://doi.
org/10.1016/j.aej.2021.11.027
5. Sherwood, L., Tian, M., Kong, S., Wu, L., Yu, J.: Applying object
detection to monitoring marine debris. In: Tropical Conservation
Biology and Environmental Science TCBES Theses, vol 14, No. 8
(2020). http://hdl.handle.net/10790/5298
6. Junos, M.H., Mohd Khairuddin, A.S., Thannirmalai, S., Dahari,
M.: An optimized YOLO-based object detection model for crop
harvesting system. IET Image Process. 15(9), 2112–2125 (2021).
https://doi.org/10.1049/ipr2.12181
7. Momin, M.A., Junos, M.H., Mohd Khairuddin, A.S., et al.:
Lightweight CNN model: automated vehicle detection in aerial
images. SIViP 17, 1209–1217 (2022). https://doi.org/10.1007/
s11760-022-02328-7
8. Kaggle: Datasets. https://www.kaggle.com/datasets. Accessed 5
Feb 2021
9. OR&R’s Marine Debris Program: Marine Debris Monitor-
ing and Assessment Project. https://marinedebris.noaa.gov/
research/marine-debrismonitoring-and-assessment-project
(2020). Accessed 12 Sept 2020
10. Litwinow, N.: Contaminants in water in the marine environ-
ment. Kaggle. https://doi.org/10.34740/KAGGLE/DS/2088659.
Accessed 21 Feb 2022
11. Panwar, H.: Aquatrash. Kaggle. https://doi.org/10.34740/
KAGGLE/DSV/4237900. Accessed 15 Mar 2022
12. Pedersen, M., Haurum, J.B., Moeslund, T.: Detection of marine
animals in a new underwater dataset with varying visibility.
In: Environmental Science, Computer Science, CVPR Work-
shops. https://openaccess.thecvf.com/content_CVPRW_2019/
papers/AAMVEM/Pedersen_Detection_of_Marine_Animals_
in_a_New_Underwater_Dataset_with_CVPRW_2019_paper.pdf
(2019)
13. Alejandro, M., Toro, V.: Deep neural networks for marine debris
detection in sonar images. Dissertation submitted to Heriot-Watt
University, Edinburgh. arXiv:1905.0524 (2019)
14. Zhang, L., Zhang, Y., Zhang, Z., Shen, J., Wang, H.: Real-time
water surface object detection based on improved faster R-CNN.
Sensors (2019). https://doi.org/10.3390/s19163523
15. Deng, H., Ergu, D., Liu, F., Ma, B., Chai, Y.: An embeddable algo-
rithm for automatic garbage detection based on complex marine
environment. Sensors (2021). https://doi.org/10.3390/s21196391
16. Ye, A., Pang, B., Jin, Y., Cui, J.: A YOLO-based neural network
with VAE for intelligent garbage detection and classification. In:
Proceedings of the 2020 3rd International Conference on Algo-
rithms, Computing and Artificial Intelligence, pp. 1–7 (2020)
17. Wu, Z., Zhang, D., Shao, Y., Zhang, X., Zhang, X., Feng, Y.,
Cui, P.: Using YOLOv5 for garbage classification. In: 2021 4th
International Conference on Pattern Recognition and Artificial
Intelligence (PRAI), pp. 35–38. IEEE (2021).
18. Arulmozhi, M., Iyer, N.G., Jeny Sophia, S., Sivakumar, P., Amutha,
C., Sivamani, D.: Comparison of YOLO and Faster R-CNN on
Garbage Detection. In: Optimization Techniques in Engineering:
Advances and Applications, pp. 37–49 (2023).
19. Zailan, N.A., Azizan, M.M., Hasikin, K., Mohd Khairuddin, A.S.,
Khairuddin, U.: An automated solid waste detection using the opti-
mized YOLO model for riverine management. Front. Public Health
10, 907280 (2022). https://doi.org/10.3389/fpubh.2022.907280
20. Cchangcs: Garbage classification. Kaggle. https://doi.org/10.
34740/KAGGLE/DS/81794 (2018). Accessed 14 Mar 2022
Publisher’s Note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds
exclusive rights to this article under a publishing agreement with the
author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such
publishing agreement and applicable law.
123
... The fourth enhancement involves improving modules and incorporating effective components. Zailan et al [36] enhanced pyramid average pooling using Densely Connected Convolutional Networks (DenseNet) and incorporated the Mish activation function, resulting in an optimized YOLOv4-tiny model that strikes a good balance between overall accuracy and model size in the detection of floating debris. Li et al [37] incorporated a dynamic convolution module into the YOLOv7 framework to address the issues of missed and false detection due to changes in target size, thereby enhancing the network's robustness. ...
Article
Full-text available
The issue of floating debris on water surfaces is becoming increasingly prominent, posing significant threats to aquatic ecosystems and human habitats. The detection of floating debris is impeded by complex backgrounds and water currents, resulting in suboptimal detection accuracy. To enhance detection effectiveness, this study presents a floating debris detection algorithm rooted in CDW-YOLOv8. Firstly, the study augments the original C2f module by incorporating the Coordinate Attention (CA) mechanism, resulting in the C2f-CA module, to boost the model’s sensitivity to target locations. Secondly, the study substitutes the standard Upsample module with the DySample module to diminish model parameters and increase flexibility. Furthermore, the study incorporates a small object detection layer to enhance the detection performance of small floating debris. Lastly, the Complete-IOU (CIOU) loss function is substituted by the Focaler-Wise-IOU v3 (Focaler-WIoUv3) loss function, which aims to minimize the impact of low-quality anchor boxes and improve regression accuracy. Experimental results demonstrate that the improved CDW-YOLOv8 algorithm has realized a comprehensive performance improvement in accuracy, recall rate, mAP@0.5, and mAP@0.5:0.95, noting increases of 2.9%, 0.6%, 2.5%, and 1.5%, respectively, relative to the original YOLOv8 algorithm. This offers a robust reference for the intelligent detection and identification of floating debris on water surfaces.
... After training, this model outperforms current models like YOLOv1 and Fast R-CNN, with a correct rate of 69.70% and 32.1 million parameters, processing at a pace of 60 frames per second (FPS). Moreover, the study affirms that using YOLO in garbage detection proves to be an effective means of controlling garbage pollution,[23]. This research primarily enhances the spatial pyramid pooling with average pooling, mish activation function, concatenated densely connected neural network, and hyperparameter optimization, this research provided an optimized YoLOv4-tiny model to detect floating junk. ...
Article
Garbage problems in urban areas are becoming more serious as the population increases, resulting in community garbage, including Bangkok, the capital of Thailand, being affected by pollution from rotten waste. Therefore, this research aims to apply deep learning technology to detect images from CCTV cameras in urban areas of Bangkok by using YOLO to detect images from CCTV cameras in urban areas of Bangkok, using YOLO to detect 1,383 images of overflowing garbage bins, classified into 2 classes: garbage class and bin class. YOLO in each version was compared, consisting of YOLOv5n, YOLOv6n, YOLOv7, and YOLOv8n. The comparison results showed that YOLOv5n was able to classify classes with an accuracy of 94.50%, followed by YOLOv8n at 93.80%, YOLOv6n at 71.60%, and YOLOv7 at 24.60%, respectively. The results from this research can be applied to develop a mobile or web application to notify of overflowing garbage bins by integrating with CCTV cameras installed in communities to monitor garbage that is overflowing or outside the bin and notify relevant agencies or the locals. This will allow for faster and more efficient waste management.
Article
Full-text available
Due to urbanization, solid waste pollution is an increasing concern for rivers, possibly threatening human health, ecological integrity, and ecosystem services. Riverine management in urban landscapes requires best management practices since the river is a vital component in urban ecological civilization, and it is very imperative to synchronize the connection between urban development and river protection. Thus, the implementation of proper and innovative measures is vital to control garbage pollution in the rivers. A robot that cleans the waste autonomously can be a good solution to manage river pollution efficiently. Identifying and obtaining precise positions of garbage are the most crucial parts of the visual system for a cleaning robot. Computer vision has paved a way for computers to understand and interpret the surrounding objects. The development of an accurate computer vision system is a vital step toward a robotic platform since this is the front-end observation system before consequent manipulation and grasping systems. The scope of this work is to acquire visual information about floating garbage on the river, which is vital in building a robotic platform for river cleaning robots. In this paper, an automated detection system based on the improved You Only Look Once (YOLO) model is developed to detect floating garbage under various conditions, such as fluctuating illumination, complex background, and occlusion. The proposed object detection model has been shown to promote rapid convergence which improves the training time duration. In addition, the proposed object detection model has been shown to improve detection accuracy by strengthening the non-linear feature extraction process. The results showed that the proposed model achieved a mean average precision (mAP) value of 89%. Hence, the proposed model is considered feasible for identifying five classes of garbage, such as plastic bottles, aluminum cans, plastic bags, styrofoam, and plastic containers.
Article
Full-text available
Efficient vehicle detection has played an important role in Intelligent Transportation Systems (ITS) in smart cities. With the development of the Convolutional Neural Network (CNN) for objection detection, new applications have been designed to enable on-road vehicle detection algorithms. Therefore, this work aims to further improve the conventional CNN model for real-time detection on low-cost embedded hardware. In this study, a lightweight CNN model is proposed based on YOLOv4 Tiny to detect vehicles from the VEDAI dataset. In the proposed method, one additional scale feature map is added to make a total of three prediction boxes in the architecture. Then, the output image size of the second and third prediction boxes are upscaled in order to improve detection accuracy in detecting small size vehicles in the aerial images. The proposed model has been evaluated on NVIDIA Geforce 940MX GPU-based computer, Google Collab (TESLA K80) and Jetson Nano. Based on the experimental results, this study has demonstrated that the proposed model achieved better mean average precision (mAP) compared to the conventional YOLOv4 Tiny and previous works.
Article
Full-text available
With the growing demand for geospatial data, challenging aerial images with high spatial, spectral, and temporal resolution achieve excellent development. Currently, deep Convolutional Neural Network (CNN) structures are applied widely for object detection. Nevertheless, existing deep CNN-based models consist of complex network structures and require immense amounts of graphics processing unit (GPU) computation power with high energy consumption. Thus, achieving efficient real-time object detection for limited memory and processing capacity embedded device is a major challenge. This paper proposes a feasible and lightweight object detection model based on deep CNN where a mobile inverted bottleneck module is adopted in the backbone structure. Moreover, an enhanced spatial pyramid pooling is adopted to increase the receptive field in the network by concatenating the multi-scale local region features. The experimental results demonstrated that the proposed model achieved higher average precision and required the smallest memory storage compared to previous works. Moreover, the proposed model offers the best trade-offs in terms of detection accuracy, model size, and detection time which has excellent potential to be deployed on limited capacity embedded device.
Article
Full-text available
With the continuous development of artificial intelligence, embedding object detection algorithms into autonomous underwater detectors for marine garbage cleanup has become an emerging application area. Considering the complexity of the marine environment and the low resolution of the images taken by underwater detectors, this paper proposes an improved algorithm based on Mask R-CNN, with the aim of achieving high accuracy marine garbage detection and instance segmentation. First, the idea of dilated convolution is introduced in the Feature Pyramid Network to enhance feature extraction ability for small objects. Secondly, the spatial-channel attention mechanism is used to make features learn adaptively. It can effectively focus attention on detection objects. Third, the re-scoring branch is added to improve the accuracy of instance segmentation by scoring the predicted masks based on the method of Generalized Intersection over Union. Finally, we train the proposed algorithm in this paper on the Transcan dataset, evaluating its effectiveness by various metrics and comparing it with existing algorithms. The experimental results show that compared to the baseline provided by the Transcan dataset, the algorithm in this paper improves the mAP indexes on the two tasks of garbage detection and instance segmentation by 9.6 and 5.0, respectively, which significantly improves the algorithm performance. Thus, it can be better applied in the marine environment and achieve high precision object detection and instance segmentation.
Article
Full-text available
Manual harvesting of loose fruits in the oil palm plantation is both time consuming and physically laborious. Automatic harvesting system is an alternative solution for precision agriculture which requires accurate visual information of the targets. Current state-of-the-art one-stage object detection method provides excellent detection accuracy; however, it is computationally intensive and impractical for embedded system. This paper proposed an improved YOLO model to detect oil palm loose fruits from unmanned aerial vehicle images. In order to improve the robustness of the detection system, the images are augmented by brightness, rotation, and blurring to simulate the actual natural environment. The proposed improved YOLO model adopted several improvements; densely connected neural network for better feature reuse, swish activation function, multi-layer detection to enhance detection on small targets and prior box optimization to obtain accurate bounding box information. The experimental results show that the proposed model achieves outstanding average precision of 99.76% with detection time of 34.06 ms. In addition, the proposed model is also light in weight size and requires less training time which is significant in reducing the hardware costs. The results exhibit the superiority of the proposed improved YOLO model over several existing state-of-the-art detection models.
Article
Full-text available
Abstract The adoption of automated crop harvesting system based on machine vision may improve productivity and optimize the operational cost. The scope of this study is to obtain visual information at the plantation which is crucial in developing an intelligent automated crop harvesting system. This paper aims to develop an automatic detection system with high accuracy performance, low computational cost and lightweight model. Considering the advantages of YOLOv3 tiny, an optimized YOLOv3 tiny network namely YOLO‐P is proposed to detect and localize three objects at palm oil plantation which include fresh fruit bunch, grabber and palm tree under various environment conditions. The proposed YOLO‐P model incorporated lightweight backbone based on densely connected neural network, multi‐scale detection architecture and optimized anchor box size. The experimental results demonstrated that the proposed YOLO‐P model achieved good mean average precision and F1 score of 98.68% and 0.97 respectively. Besides, the proposed model performed faster training process and generated lightweight model of 76 MB. The proposed model was also tested to identify fresh fruit bunch of various maturities with accuracy of 98.91%. The comprehensive experimental results show that the proposed YOLO‐P model can effectively perform robust and accurate detection at the palm oil plantation.
Article
Full-text available
In this paper, we consider water surface object detection in natural scenes. Generally, background subtraction and image segmentation are the classical object detection methods. The former is highly susceptible to variable scenes, so its accuracy will be greatly reduced when detecting water surface objects due to the changing of the sunlight and waves. The latter is more sensitive to the selection of object features, which will lead to poor generalization as a result, so it cannot be applied widely. Consequently, methods based on deep learning have recently been proposed. The River Chief System has been implemented in China recently, and one of the important requirements is to detect and deal with the water surface floats in a timely fashion. In response to this case, we propose a real-time water surface object detection method in this paper which is based on the Faster R-CNN. The proposed network model includes two modules and integrates low-level features with high-level features to improve detection accuracy. Moreover, we propose to set the different scales and aspect ratios of anchors by analyzing the distribution of object scales in our dataset, so our method has good robustness and high detection accuracy for multi-scale objects in complex natural scenes. We utilized the proposed method to detect the floats on the water surface via a three-day video surveillance stream of the North Canal in Beijing, and validated its performance. The experiments show that the mean average precision (MAP) of the proposed method was 83.7%, and the detection speed was 13 frames per second. Therefore, our method can be applied in complex natural scenes and mostly meets the requirements of accuracy and speed of water surface object detection online.