ArticlePDF Available

High-precision apple recognition and localization method based on RGB-D and improved SOLOv2 instance segmentation

Frontiers
Frontiers in Sustainable Food Systems
Authors:

Abstract and Figures

Intelligent apple-picking robots can significantly improve the efficiency of apple picking, and the realization of fast and accurate recognition and localization of apples is the prerequisite and foundation for the operation of picking robots. Existing apple recognition and localization methods primarily focus on object detection and semantic segmentation techniques. However, these methods often suffer from localization errors when facing occlusion and overlapping issues. Furthermore, the few instance segmentation methods are also inefficient and heavily dependent on detection results. Therefore, this paper proposes an apple recognition and localization method based on RGB-D and an improved SOLOv2 instance segmentation approach. To improve the efficiency of the instance segmentation network, the EfficientNetV2 is employed as the feature extraction network, known for its high parameter efficiency. To enhance segmentation accuracy when apples are occluded or overlapping, a lightweight spatial attention module is proposed. This module improves the model position sensitivity so that positional features can differentiate between overlapping objects when their semantic features are similar. To accurately determine the apple-picking points, an RGB-D-based apple localization method is introduced. Through comparative experimental analysis, the improved SOLOv2 instance segmentation method has demonstrated remarkable performance. Compared to SOLOv2, the F1 score, mAP, and mIoU on the apple instance segmentation dataset have increased by 2.4, 3.6, and 3.8%, respectively. Additionally, the model’s Params and FLOPs have decreased by 1.94M and 31 GFLOPs, respectively. A total of 60 samples were gathered for the analysis of localization errors. The findings indicate that the proposed method achieves high precision in localization, with errors in the X, Y, and Z axes ranging from 0 to 3.95 mm, 0 to 5.16 mm, and 0 to 1 mm, respectively.
This content is subject to copyright.
Frontiers in Sustainable Food Systems 01 frontiersin.org
High-precision apple recognition
and localization method based on
RGB-D and improved SOLOv2
instance segmentation
ShixiTang
1,2, ZilinXia
3*, JinanGu
3, WenboWang
3,
ZedongHuang
3 and WenhaoZhang
3
1 School of Information Engineering, Yancheng Teachers University, Yancheng, China, 2 Jiangsu
Engineering Laboratory of Cyberspace Security, Suzhou, China, 3 School of Mechanical Engineering,
Jiangsu University, Zhenjiang, China
Intelligent apple-picking robots can significantly improve the eciency of apple
picking, and the realization of fast and accurate recognition and localization of
apples is the prerequisite and foundation for the operation of picking robots.
Existing apple recognition and localization methods primarily focus on object
detection and semantic segmentation techniques. However, these methods often
suer from localization errors when facing occlusion and overlapping issues.
Furthermore, the few instance segmentation methods are also inecient and
heavily dependent on detection results. Therefore, this paper proposes an apple
recognition and localization method based on RGB-D and an improved SOLOv2
instance segmentation approach. To improve the eciency of the instance
segmentation network, the EcientNetV2 is employed as the feature extraction
network, known for its high parameter eciency. To enhance segmentation
accuracy when apples are occluded or overlapping, a lightweight spatial attention
module is proposed. This module improves the model position sensitivity so that
positional features can dierentiate between overlapping objects when their
semantic features are similar. To accurately determine the apple-picking points,
an RGB-D-based apple localization method is introduced. Through comparative
experimental analysis, the improved SOLOv2 instance segmentation method has
demonstrated remarkable performance. Compared to SOLOv2, the F1 score,
mAP, and mIoU on the apple instance segmentation dataset have increased by
2.4, 3.6, and 3.8%, respectively. Additionally, the model’s Params and FLOPs have
decreased by 1.94M and 31 GFLOPs, respectively. A total of 60 samples were
gathered for the analysis of localization errors. The findings indicate that the
proposed method achieves high precision inlocalization, with errors in the X, Y,
and Z axes ranging from 0 to 3.95 mm, 0 to 5.16 mm, and 0 to 1 mm, respectively.
KEYWORDS
apple instance segmentation, lightweight spatial attention, EcientNetv2, improved
SOLOv2, RGB-D
1 Introduction
Currently, apple picking relies largely on manual labor, which is time-consuming and
labor-intensive, resulting in high harvesting costs and low eciency. With the rapid
development of articial intelligence and robotics, the realization of automated apple picking
has become an inevitable trend (Wang etal., 2022, 2023). Achieving rapid and accurate apple
identication and localization in complex orchard environments is the key to realizing
OPEN ACCESS
EDITED BY
Raul Avila-Sosa,
Benemérita Universidad Autónoma de Puebla,
Mexico
REVIEWED BY
Jieli Duan,
South China Agricultural University, China
Zhenguo Zhang,
Xinjiang Agricultural University, China
*CORRESPONDENCE
Zilin Xia
xiazilin@stmail.ujs.edu.cn
RECEIVED 20 March 2024
ACCEPTED 13 May 2024
PUBLISHED 06 June 2024
CITATION
Tang S, Xia Z, Gu J, Wang W, Huang Z and
Zhang W (2024) High-precision apple
recognition and localization method based
on RGB-D and improved SOLOv2 instance
segmentation.
Front. Sustain. Food Syst. 8:1403872.
doi: 10.3389/fsufs.2024.1403872
COPYRIGHT
© 2024 Tang, Xia, Gu, Wang, Huang and
Zhang. This is an open-access article
distributed under the terms of the Creative
Commons Attribution License (CC BY). The
use, distribution or reproduction in other
forums is permitted, provided the original
author(s) and the copyright owner(s) are
credited and that the original publication in
this journal is cited, in accordance with
accepted academic practice. No use,
distribution or reproduction is permitted
which does not comply with these terms.
TYPE Original Research
PUBLISHED 06 June 2024
DOI 10.3389/fsufs.2024.1403872
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 02 frontiersin.org
automated apple harvesting. However, the complex environment of
orchards is inuenced by shading, overlapping, and camera angles,
making fast and accurate apple identication and positioning a
greater challenge.
In recent years, along with the development of machine vision, the
recognition and localization of apples have been extensively researched
(Xia et al., 2022, 2023; Gai et al., 2023). Specic methods can
beclassied into three main categories: object detection-based (Huang
etal., 2017; Hu etal., 2023), semantic segmentation-based (Jia etal.,
2022b), and instance segmentation-based (Wang and He, 2022a).
Object detection involves identifying and localizing objects in images
and marking them with bounding boxes. Jia etal. (2022a) proposed
an improved FoveaBox (Kong etal., 2020) for green apple object
detection. is approach utilizes EcientNetV2s as the feature
extraction network and employs the BiFPN (Bidirectional Feature
Pyramid Network) (Tan etal., 2020) for feature fusion. It utilizes the
ATSS (Adaptive Training Sample Selection) (Zhang et al., 2020)
technique to match positive and negative samples. e overall model
achieves a high recall but with reduced speed. Wu et al. (2021)
presented an enhanced YOLOV4 method for complex scene apple
detection. ey replaced YOLOV4’s backbone feature extraction
network with EcientNet and achieved a 96.54% F1 score on the
constructed dataset. Chen etal. (2021) proposed a Des-YOLOV4
detection model tailored for apples. is method introduces the
DenseNet dense residual structure into YOLOV4 and employs
So-NMS in the post-processing phase to enhance the recall rate of
overlapping apples. e overall model has fewer parameters compared
to YOLOV4. Apple recognition and localization based on object
detection methods are faster. But when the apples in the detection
bounding box are obscured or overlapped, it will hinder the
acquisition of their depth information and lead to picking failure.
Semantic segmentation can segment each pixel in an image into
corresponding categories, yielding more rened object segmentation
results. Ahmad et al. (2018) proposed a method based on a fuzzy
inference system and fuzzy c-means to achieve the segmentation of
apples with dierent colors during the growth process. Zou etal.
(2022) introduced a color-index-based apple segmentation method
that enables rapid segmentation of orchard apples, with an average
segmentation time of 20 ms. While these traditional segmentation
methods oer faster speed, their robustness is compromised when
facing complex orchard environments. Kang and Chen (2019)
introduced the DasNet, a deep learning-based semantic segmentation
network, to achieve the segmentation of apples and tree branches. Li
etal. (2021) proposed an improved U-Net (Ronneberger etal., 2015)
method for segmenting green apples. It incorporated dilated
convolutions and the ASPP (Atrous Spatial Pyramid Pooling) (Chen
etal., 2017) structure into U-Net, which enlarged the receptive eld
and enhanced segmentation accuracy. Using semantic segmentation
methods for apple segmentation can provide more detailed contours.
However, in cases of overlapping apples, distinguishing between them
becomes challenging with semantic segmentation, which, in turn,
impacts the acquisition of depth information for each apple.
Instance segmentation enables the classication of each pixel’s
category in an image while distinguishing dierent instances of the
same category. Kang and Chen (2020) proposed the DaSNet-V2
method for apple instance segmentation, using ResNet101 and
ResNet18 as backbone feature extraction networks, achieving
segmentation accuracies of 87.3 and 86.6%, respectively. Wang and
He(2022b) introduced an improved Mask R-CNN method for apple
instance segmentation. By incorporating attention mechanisms in the
feature extraction module, this approach enhances apple
segmentation accuracy, but at a slower speed. Jia et al. (2020)
presented an enhanced Mask R-CNN method for apple instance
segmentation. ey combined the DenseNet dense connection
structure into the ResNet backbone feature extraction network, thus
improving segmentation accuracy and enabling recognition and
segmentation of overlapping apples. Jia etal. (2021) proposed an
anchor-free instance segmentation method tailored for green apples.
is method adds an instance branch to FoveaBox, conducting apple
detection before segmentation. Nevertheless, it exhibits subpar
performance in segmenting apple edge contours. Instance
segmentation-based methods can achieve apple recognition, precise
localization, and mask generation. However, the majority of current
research focuses on detection-based instance segmentation
approaches. In these methods, the instance branch oen lacks
consideration of global context, resulting in suboptimal performance
in edge segmentation and slower segmentation speeds.
Acquiring depth information for apples is a critical factor in
achieving accurate picking. Specic means of obtaining this
information include stereo cameras, structured light cameras, TOF
(time-of-ight) cameras, and laser radar. Tian etal. (2019) proposed
a fruit localization technique based on Kinect V2, utilizing depth
images to determine the apples center and combining RGB data to
estimate the apples radius. But, in cases of overlap and occlusion, the
depth image may not fully represent the apple’s true depth information,
leading to ambiguous localization. Kang etal. (2020) implemented
apple localization using an Intel D-435 camera. ey employed RGB
images for fruit detection and instance segmentation, combining
depth information to t apple’s point cloud, thus localizing it.
However, this method suers from lower eciency. Gené-Mola etal.
(2019) utilized laser radar and object detection for apple localization,
achieving a success rate of 87.5%. Kang etal. (2022) fused radar with
the camera as input and then used instance segmentation to achieve
apple localization, but this method incurs higher costs.
So far, the recognition and localization of apples have
predominantly relied on object detection and semantic segmentation
methods. However, these methods oen lead to positioning errors
when facing challenges such as occlusion and overlapping. While a
few studies have explored detection-based instance segmentation
methods for apple recognition and localization, these methods usually
come with high parameter and computational complexity, are
susceptible to the inuence of detection results, and lack consideration
of global information. SOLOv2 (Wang et al., 2020b) is a one-stage
instance segmentation method that introduces an ecient instance
mask representation scheme based on the foundation of SOLO (Wan g
et al., 2020a). It improves the eciency of the overall method by
decoupling instance mask generation into mask kernel and mask
feature learning and utilizing convolutional operations to generate
instance masks. Compared to two-stage instance segmentation models
like MaskRCNN, SOLOv2 eliminates the need for anchor boxes, does
not rely on detection results, occupies less memory, and is more
suitable for practical engineering applications. erefore, this paper
proposes an apple recognition and localization approach based on
RGB-D and an improved SOLOV2 instance segmentation method.
is method eliminates reliance on detection results and can achieve
accurate apple positioning even in occlusion and overlapping
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 03 frontiersin.org
scenarios. Specically, the main contributions of this paper are
as follows:
1 Introducing an improved SOLOV2 instance segmentation
method that achieves high-precision apple instance
segmentation and is independent of detection results.
2 Introducing a lightweight spatial attention mechanism into the
mask prediction head of SOLOV2 to enhance the segmentation
accuracy for overlapping apples.
3 Introducing an RGB-D-based apple localization method that
achieves accurate positioning in scenarios with occlusion and
overlapping, thereby enhancing the success rate of apple picking.
e sections of this paper are organized as follows: section 2
introduces the improved SOLOv2 instance segmentation method and
the RGB-D-based apple-picking point localization method. Section
3 conducts comparative experiments and analyzes the experimental
results. Section 4 summarizes the entire paper and outlines future
research directions.
2 The proposed method
2.1 Apple instance segmentation method
based on improved SOLOv2
In this paper, wefurther enhance the segmentation accuracy based
on SOLOv2 without introducing excessive parameters. Specically,
weintegrate the proposed lightweight spatial attention module into the
mask kernel and mask feature branches and adopt a more ecient
feature extraction network, EcientNetV2. Aer these improvements,
the improved SOLOv2 signicantly boosts instance segmentation
accuracy while maintaining eciency and avoiding the introduction
of redundant parameters. Figure1 illustrates the enhanced SOLOv2
instance segmentation method, and detailed descriptions of each
module will beelaborated in the subsequent sections.
2.1.1 Backbone feature extraction network
e feature extraction network, as a crucial component of the
instance segmentation method, signicantly inuences the performance
of the whole model. In this study, EcientNetV2 (Tan and Le, 2021)
was adopted as the backbone feature extraction network, building upon
the improvements made in EcientNetV1 (Tan and Le, 2019). e
network’s optimal width, height, and other design parameters were
determined using NAS (Neural Architecture Search) techniques. To
address the slow training speed of EcientNetV1, the shallow MBConv
modules were substituted with Fused-MBConv modules, with the
specic MBConv and Fused-MBConv modules illustrated in Figure1.
As depicted in Figure1, the MBConv module employs a 1 × 1
convolutional layer to increase feature dimensionality, followed by a
3 × 3 depthwise separable convolutional layer for feature extraction. In
contrast, the Fused-MBConv module directly utilizes a 3 × 3
convolutional layer to perform feature extraction and dimensionality
expansion, improving feature extraction speed. EcientNetV2
demonstrates exceptional accuracy on the ImageNet dataset while
enhancing training speed and parameter eciency. Compared to
ResNet50, EcientNetV2 exhibits higher eciency, achieving greater
precision with equivalent parameters and computation. Additionally,
EcientNetV2 is well-suited for mobile and embedded device
deployment for tasks such as apple harvesting.
2.1.2 Instance mask generation module
SOLOv2 decouples the instance mask generation into mask kernels
and mask features. en, it utilizes convolution between the mask
kernels and mask features to obtain the nal instance mask. e
parameters of the mask kernels and mask features are generated
separately through the mask kernel branch and the mask feature branch.
As depicted in Figure1, the rst step is utilizing the FPN (Lin etal.,
2017) to perform multi-scale feature fusion, aiming to achieve multi-
scale segmentation. is process is detailed in the following Eq.1.
PPPPP CCCC
23456
2345
,,,,
=
()
FPN,,, (1)
where
CCCC
2345
,,,
are the eective feature layers output by
EcientNetV2, and
PPPPP
23456
,,
,,
are the feature layers output aer
feature fusion.
In the mask kernel branch, each feature layer
P
i
is sampled to a
grid of size
Si
, and if the center of the GT falls into this grid, it
indicates that this grid is responsible for predicting its instances.
Specically as shown in Eq.2.
KfPi
ii
=
()
=
Kernel Branch,,,,,
23456
(2)
Ki
is the mask kernel parameter generated by the corresponding
feature layer with size
SSC
ii
××
.
In the mask feature branch, FPN output features are used to create
shared mask features across dierent levels. is approach allows
dierent levels to share the same mask features, reducing parameters
and improving eciency. e process is detailed in Eq.3 b elow.
Feature Branch ,,,
(3)
where
F
denotes the shared mask features, with sizes of
HWC××
.
H
and
W
are one-fourth the size of the input height and
width, respectively.
Finally, the mask kernel parameters corresponding to the grids
containing objects, denoted as
Ki
pos
, are selected. ese are the grids
where the center of the GT falls during training and the grids where
the predicted classication score is greater than the score threshold
during inference. e selected mask kernel parameters
Ki
pos
are
utilized to convolve with the shared mask features
F
to generate
instance masks. As shown in the following Eq.4.
MK
F
ij i
,=pos (4)
Ki
pos
is the mask kernel parameter obtained by ltering in
Ki
with
size n × 1 × 1 × C, and
Mij
,
is the instance prediction mask generated
at the corresponding location. e overall instance mask generation
module is illustrated in Figure2.
2.1.3 Improved mask feature branch
In SOLOv2, the mask feature branch is composed solely of
upsampling and convolution operations. However, instances with
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 04 frontiersin.org
similar semantic features heavily rely on positional features for
dierentiation. Relying solely on convolution-generated positional
information is insucient. erefore, an Att-Block is utilized as a
replacement for the convolution operation to construct mask features
containing comprehensive positional information without introducing
excessive parameters. In the Att-Block, the convolution is replaced
with a depthwise separable convolution and a lightweight spatial
attention mechanism is introduced to capture positional features
between instances. e details are illustrated in Figure3.
e lightweight spatial attention module is divided into two
steps: (1) rst, obtain the corresponding spatial position relationship
in the vertical direction by using a K × 1 convolutional kernel on the
feature map. e computational complexity of this step is
H W
2
; (2)
then obtain the corresponding spatial position relationship in the
horizontal direction by using a 1 × K convolutional kernel on the
feature map generated in step (1), the computational complexity of
this step is
HW 2
. Finally, use Sigmoid to generate the spatial
attention map, the overall computational complexity is
HWHW
22
+
, compared with directly using a fully connected layer to calculate
the spatial attention map of the feature map, the lightweight attention
module has lower computational complexity when the feature map
width W and height H are large. is makes it particularly suitable
for capturing feature map spatial relationships in
lightweight networks.
2.1.4 Improved mask kernel branch
With the aim of enhancing the sensitivity of learned mask kernel
parameters to positional information and improving instance
segmentation accuracy, this paper introduces a modication to the mask
kernel branch. e convolution operations in the mask kernel branch are
FIGURE1
Diagram of the overall structure of the improved SOLOV2.
FIGURE2
Instance mask generation module structure diagram.
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 05 frontiersin.org
replaced with Att-Block modules to capture feature spatial relationships.
is alteration enables the learned mask kernel parameters to encompass
richer positional information, as depicted in Figure4. It is important to
note that the Att-Block used in the improved mask kernel branch
employs regular convolutional structures rather than depthwise
separable convolutions. is choice aims to ensure that the encoded
information within the mask kernel is more comprehensive.
2.1.5 Label assignment method and loss
calculation
SOLOV2 diers from detection-based instance segmentation
methods in that it does not assign labels by IoU thresholding. It resizes
dierent feature layers into S × S grids of dierent sizes, and each
element in the grid is responsible for predicting one instance. Given
an image where
GT
represents the ground truth labels,
GT
area
denotes
the area of the label,
GT
mask
represents the mask of the label, and
GT
labe
l
indicates the category of the label. Firstly, the ground truth
instances are categorized into dierent levels based on their area.
Specically as shown in Eq.5.
area
GT
ii
lb up≤≤
(5)
where
lbi
and
upi
represent the lower and upper bounds of the
object scale predicted by the current feature layer, if instances satisfy
this condition, are considered as
GTi
for the current layer.
Subsequently,
GTi
is scaled around its center, and the grid cells within
the scaled
GTi
are selected as positive samples, as shown in the
following Eq.6.
pos GT pos
index
scaleii
=∗
(6)
where
pos
inde
x
i
represents the indices of grids within the scaled
GTi
, which are the indexes of positive samples;
posscale
is the scaling
factor. en, the mask kernel parameters corresponding to the positive
samples are selected using these indices and denoted as
Ki
pos
.
Specically as shown in Eq.7.
K
K
iii
pos index
pos=
(7)
e mask kernel parameters corresponding to positive samples
from all layers are collected and denoted as
Kpos
. en, convolution
is applied to obtain the predicted masks. Specically as shown
in Eq.8.
FIGURE3
The structure diagram of the improved mask feature branch.
FIGURE4
The structure diagram of the improved mask kernel branch.
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 06 frontiersin.org
ALGORITHM 1 The label assignment method and loss calculation in SOLOv2
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 07 frontiersin.org
MK F=pos
(8)
where
F
is the mask feature generated by the mask branch, and
M
is the prediction mask. Finally, the mask and classication losses
are computed as follows in Eqs.9, 10.
L M
mask mask
Diceloss,Target
=
()
(9)
L P
cl
sl
abel
Focalloss ,Target
=
()
(10)
L
ma
sk
is the mask loss, specically
Diceloss
, where
Targetmask
means that the index of positive samples corresponds to
GT
mask
, and
the negative samples do not participate in the calculation of the mask
loss.
L
cls
is the classication loss, specically
Focalloss
, where
P
is
the classication prediction value, and
Targetlabel
means that the
positive samples correspond to
GT
labe
l
, and the negative samples are
0. Both positive and negative samples contribute to the calculation of
the classication loss. e overall loss function is formulated as
shown in Eq.11.
LLL
totalcls mask
=+
λλ
12
(11)
where
L
total
is the total loss,
λ
1
and
λ
2
are the weights of
classication loss and mask loss, which take the values of 1.0 and 3.0in
this paper, respectively. e overall training label assignment method
and loss calculation can beseen in Algorithm 1.
2.2 RGB-D-based apple localization
method
To achieve precise apple localization, especially in scenarios with
occlusion and overlapping, this paper proposes an RGB-D-based
apple localization method. e method begins by employing the
enhanced SOLOv2 apple instance segmentation method to obtain
masks for apples in the images. Subsequently, these masks are
combined with the depth maps generated by an RGB-D camera to
accurately locate the points where apples can bepicked. e overall
workow is depicted in Figure5, with the following steps.
Step1: Instance segmentation.
Perform segmentation on the RGB images to obtain apple masks.
Step2: Finding the minimum enclosing circle of the mask.
Utilize OpenCV to compute the minimum enclosing circle of the
segmented apple mask. is step aims to ensure a better t of the mask
to the apple, avoiding excessive inclusion of background information.
Step3: Calculating mask and minimum enclosing circle IoU.
To ensure that the pixel information of the apple is as complete as
possible, thereby enhancing the success rate of picking, compute the
IoU to lter out apples that are viable for picking in the current view.
A higher IoU value indicates fewer obscured parts of the apple. is
paper adopts an IoU threshold of 0.5.
Step4: Conrming if the central point of the minimum enclosing
circle belongs to the apple.
Select the center point of the minimum enclosing circle of the
apple mask as the picking point. To do so, verify whether the pixel
coordinates of the circles center point correspond to the apple. If
leaves or branches potentially obstruct the point, picking is not viable
from the current viewpoint.
Step5: Calculate picking point coordinates.
If steps 3 and 4 are satised, it indicates that the viewpoint allows
picking. Using pixel coordinates along with the corresponding depth
information and camera intrinsic allows calculating the three-
dimensional coordinates (x, y, z) of the picking point in the camera
coordinate system. Specically as shown in Eqs.12,13.
x
z
uu
fx
0
(12)
y z
vv
f
y
0
(13)
where
uv,
( )
represents the pixel coordinates of the center of the
minimum enclosing circle in the X and Y directions,
z
indicates the
depth information of the circle center, and
uvf
x00
,,
, and
fy
are the
camera intrinsic.
3 Experiments
3.1 Dataset
e apple instance segmentation dataset constructed in this paper
consists of two parts. One part is the public dataset, which includes
3,925 apple images annotated with instance labels (Gené-Mola etal.,
2023). is dataset covers two growth stages of apples, with
approximately 70% at the growth stage where apples are primarily
green, as shown in Figure6A. e remaining approximately 30% are
at the ripening stage, where apples are mostly light red, as shown in
Figure6B.
e other part of the dataset is collected from orchards, consisting
of 300 apple images and annotated with instance labels using the
Labelme tool. ese images were captured during the ripe stage of
apples, characterized by their red color, as illustrated in Figure6C.
Lastly, an 8:2 data split ratio was employed to ensure the eective
utilization of training data. It means that 80% of the data were used
for training and validation, totaling 3,400 images, while the remaining
20% were reserved for testing, comprising 852 images. Such a division
aims to avoid overtting, thereby improving the generalization ability
and robustness of the model.
3.2 Experimental setting
e hardware setup for the experiments in this study included an
E5-2678 V3 CPU, 32GB of RAM, and an NVIDIA 3090 GPU with
24GB of VRAM. e soware system used was Ubuntu 18.04, with
Python version 3.8. e deep learning framework employed was
PyTorch. Pretrained weights were utilized for the backbone feature
extraction networks to expedite model convergence. e training
conguration encompassed 40 epochs with a batch size of 4. e SGD
optimizer was used with an initial learning rate of 0.01. Learning rate
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 08 frontiersin.org
adjustments were applied using the StepLR strategy, where the
learning rate was reduced by 0.1 at the 16th and 32th epochs,
respectively. To accelerate model convergence, the weights of the
backbone for all models were initialized using pre-trained weights on
ImageNet-1K. e specic experimental settings are shown in Table1.
3.3 Evaluation metrics
In order to evaluate the performance of the proposed method,
AP
(average precision),
mAP
(mean average precision), mIoU (mean
intersection over union), and
F1
scores are used to measure the
accuracy, and Params (parameters), FLOPs (oating-point
operations), and FPS (frames per second) are used to measure the
model complexity. e calculation formula is shown below.
Pr
ecision TP
alldetections
=
(14)
Recal
lTP
allGTBox
=
(15)
APd
=∫
()
1
0
pr
r
(16)
m
AP AP
=
N
(17)
FIGURE6
Samples of apple instance segmentation dataset.
Class
Branch
Feature
Branch
Backbone
FPNKernel
Branch
Input
Improved SOLOv2
Output
Segmentation results
and depth mapRGB-D-based apple localization method
Pickable apples
Minimum enclosing circle
IoU threshold filtering
Confirmation of the depth
category of the center point
Solving for picking point coordinates
Realsense L515
FIGURE5
Flowchart of RGB-D based apple localization method.
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 09 frontiersin.org
F
Precision Recall
Precision Recall
12
+
.
(18)
m
IoUTP
TP FP FN
=∑
++
(19)
where
TP
denotes the number of correctly detected targets among
all detected targets, FP denotes the number of incorrectly detected
targets among all detected targets, FN indicates the number of
incorrectly classied negative samples,
p r
()
stands for the Precision-
Recall curve, and
N
represents the number of categories in the dataset.
FLOPs and Params are critical metrics for evaluating model
complexity and speed. FLOPs measure the amount of computation,
and Params indicate the number of learnable parameters in the
network. A larger computational and parameter count typically results
in higher model complexity and slower detection speed. erefore, a
model suitable for edge devices such as apple picking in orchards
should have fewer parameters and lower computational burden.
3.4 Experimental results of the improved
method
e improved SOLOv2 is trained on the constructed apple
instance segmentation dataset, and model evaluation is performed
every epoch. e training loss curve and the test set mAP curve are
shown in Figure7, where red represents the mAP curve and green
represents the loss curve.
As shown in Figure7, the model’s loss value gradually decreases
and stabilizes as the training progresses, while the mAP metric steadily
increases. It indicates that the model is progressively converging.
Selecting the weights from the last epoch as the nal result, the mAP
on the test set of the apple instance segmentation dataset reaches
90.1%. Demonstrates that the proposed method achieves high
precision and recall in apple instance segmentation tasks, and the
model’s overall performance is excellent.
3.5 Comparative experiments with other
instance segmentation methods
To verify the eectiveness and advancement of the proposed
method, it will be compared to other mainstream instance
segmentation methods, specically including the original SOLOv2
method before improvement, the one-stage instance segmentation
method Yolact (Bolya et al., 2019), and the two-stage instance
segmentation method MaskRCNN (He etal., 2020) and MS-RCNN
(Huang et al., 2019). e mAP, mIoU and F1 scores of various
segmentation models are depicted in Figure8. It can beobserved that,
compared to other segmentation models, the improved SOLOv2
achieves the highest scores.
According to the results in Table 2, the improved SOLOv2
instance segmentation model performs best in the F1 score, mIoU,
and mAP metrics, reaching 88.5, 83.2, and 90.1%, respectively.
Compared to the original method, these three metrics were improved
by 2.4, 3.6, and 3.8%, respectively, highlighting the eectiveness of the
improved method. Compared with the two-stage models MaskRCNN
and MS-RCNN, the improved SOLOv2 model improved the F1 scores
by 0.2 and 0.6%, mIoU by 1.1 and 2.7%, and mAP by 2.3 and 2.1%,
respectively. Compared to the one-stage model Yolact, the improved
SOLOv2 model signicantly improved all metrics, including a 7.9%
improvement in mIoU, 2.3, and 4.4% in F1 score and mAP,
respectively. ese results highlight the superior precision and recall
achieved by the proposed method, resulting in more eective
instance segmentation.
Furthermore, the improved SOLOv2 apple instance segmentation
method has also been optimized for Params, FLOPs, and
FPS. Compared to the original method, it reduces Params by 1.94M,
FLOPs by 31 GFLOPs, and maintains detection speed almost the
same, with a slight decrease of 0.7 frames per second. Compared to
MaskRCNN, Params remain similar, but FLOPs decrease by 39
GFLOPs, and FPS increases by 1.3. Compared to MS-RCNN, Params
and FLOPs are signicantly reduced by 15.94M and 78 GFLOPs,
respectively, with FPS increasing by 2.8. Although Yolact performs
best in detection speed-related metrics, the proposed method
signicantly improves segmentation accuracy. Overall, the proposed
method strikes a balance between model accuracy and complexity,
performing excellently in apple instance segmentation tasks.
Figure9 displays a comparison of Precision-Recall (P-R) curves
for each method within the apple category. e red curve represents
the proposed enhanced SOLOv2 instance segmentation method.
Notably, the red curve encompasses the largest area, and even at high
recall rates, it sustains a remarkable level of accuracy. ese ndings
underscore the enhanced method’s ability to attain superior precision
FIGURE7
Training loss curve and mAP curve of the improved SOLOv2 model.
TABLE1 Experimental parameter settings.
Hyperparameters Setting
Batch size 4
Epoch 40
Learning rate Epoch 1–16 0.01
Epoch 16–32 0.01*0.1
Epoch 32–40 0.01*0.01
Optimizer SGD
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 10 frontiersin.org
and recall, showcasing improved stability and performance when
contrasted with other methods.
Figure10 illustrates a comparison of segmentation results between
the enhanced SOLOv2 and other methods on the test set of the apple
instance segmentation dataset. Notably, the improved SOLOv2
maintains accurate segmentation even in scenarios where apples are
closely spaced. In Figure10C, SOLOv2 exhibits segmentation errors
when distinguishing overlapping objects, failing to separate the two
instances. Moreover, in Figure 10D, MaskRCNN experiences
segmentation omission issues with overlapping objects. However,
Figure10B illustrates that these issues were substantially addressed
following the improvements. e improved model can accurately
segment and dierentiate overlapping instances. is further
underscores the eectiveness of the proposed lightweight spatial
attention module, which excels at distinguishing objects based on
their spatial characteristics when semantic features pose challenges
in dierentiation.
3.6 Ablation study
In order to further validate the impact of improvements on model
performance, this section conducts ablation experiments to assess the
eectiveness of both the backbone feature extraction network and the
lightweight attention module. Firstly, we replace the original
ResNet50in the SOLOv2 backbone feature extraction network with
EcientNetV2 while keeping all other aspects unchanged. is step
aims to evaluate how the improved backbone feature extraction network
inuences model performance. Subsequently, weconduct experiments
to individually introduce the proposed lightweight attention module
into the mask feature branch, the mask kernel branch, and
simultaneously into both branches. ese experiments are designed to
assess the impact of the proposed lightweight attention module. e
results of the specic ablation experiments can beseen in Table3.
As shown in Table3, improving the backbone feature network to
EcientNetV2 results in a 0.5% increase in the F1 score and a 0.2%
increase in mAP. Additionally, EcientNetV2’s parameter-ecient
design enhances the computational eciency of the model. e
performance is improved when introducing the lightweight spatial
attention module separately into the mask feature branch and the
mask kernel branch. Specically, adding the attention module to the
mask feature branch increases mAP by 1%. Incorporating the
attention module into the mask kernel branch results in a 1.2%
improvement in the F1 score and a 2.3% improvement in
mAP. Simultaneously, adding the attention module to both branches
yields even more signicant eects, with the F1 score improving by
2.4% and mAP by 3.4%. is unequivocally demonstrates that the
proposed lightweight spatial attention module signicantly enhances
the precision of apple instance segmentation.
3.7 Positioning error analysis
For validation of the localization accuracy of the proposed
RGB-D-based apple localization method, 20 sets of RGB images and
FIGURE8
Comparison of F1 score, mIoU and mAP of dierent segmentation
models.
TABLE2 Comparative experimental results of mIoU, mAP, F1 score, Params, FLOPs and FPS for dierent segmentation models.
Methods F1 (%) (%) mIoU (%) FLOPs (GFLOPs) Params (M) FPS
MaskRCNN 88.3 87.8 82.1 186 43.97 28.2
MS-RCNN 87.9 88.0 80.5 225 60.23 26.7
Yo l a c t 86.2 85.7 75.3 61.427 34.73 51.4
SOLOv2 86.1 86.5 79.4 178 46.23 30.2
Improved SOLOv2 88.5 90.1 83.2 147 44.29 29.5
FIGURE9
Comparison of P-R curves for dierent segmentation models.
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 11 frontiersin.org
the corresponding depth maps, totaling about 60 apples, are captured
using the Realsense L515 depth camera. e true picking point of an
apple is dened as the cameras three-dimensional coordinates
xyz,,
( )
,
obtained by combining the pixel coordinates of the manually
annotated center of the largest bounding rectangle around the apple,
camera intrinsic parameters, and the corresponding depth
information. Subsequently, the improved SOLOv2 instance
segmentation method and the depth-based apple localization method
are used to derive the predicted three-dimensional coordinates
xyz
,,
for the apples estimated picking point. Finally, the error
between the predicted and true picking points is calculated to assess
the positioning accuracy. Table4 presents some true picking points,
predicted picking points, and their absolute errors. Figure11 illustrates
box plots of the positioning errors in the X, Y, and Z directions for
approximately 60 sets of apples.
OursOriginal
AB
CD
EF
image
MaskRCNNSOLOv2
Yolact MS-RCNN
False segmentation Missed segmentation
FIGURE10
Comparison of segmentation results of dierent segmentation models.
TABLE3 Ablation experiment results.
Baseline EcientNetV2 Att-Block F1 (%) mAP (%)
Mask feature
branch
Mask kernel
branch
× × × 86.1 86.5
× 86.6 86.7
× 86.6 87.7
×87.8 89.0
88.5 90.1
Baseline is SOLOv2-ResNet50.
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 12 frontiersin.org
Figure 11 displays the median errors represented by red line
segments. e median positioning errors in the X and Y directions are
less than 1.5 mm. Furthermore, the median positioning error in the Z
direction approaches zero, with a maximum Z-direction positioning
error of approximately 1 mm. ese observations demonstrate that the
proposed RGB-D-based apple-picking point localization method
attains remarkable precision, fullling practical picking needs.
Figure12 illustrates the process of apple-picking point localization.
Figure12A shows the original image, while Figure12B displays the
instance segmentation result. Figure12C shows the pickable apples
aer IoU ltering and conrmation of the depth information of the
center point, where the blue circles indicate the pickable apples and
the red circles indicate the non-pickable apples. Figure12D presents
the localization results of picking points in the camera coordinate
system, obtained by combining depth information and camera
intrinsic parameters with coordinates measured in meters. It can
beobserved from the gure that the proposed RGB-D-based picking
point localization method eectively achieves accurate apple
localization. Furthermore, when the depth information at the center
of the bounding circle of the apple segmentation mask does not
correspond to the apple category, the localization method can provide
correct feedback.
4 Conclusion
e orchard environment is complex, and detection and
segmentation-based methods exhibit lower accuracy in recognizing
and localizing overlapping or occluded apples. Detection-based
instance segmentation methods heavily rely on detection results and
do not consider global features, such as MaskRCNN. erefore, this
study introduces a high-precision method based on RGB-D data and
an enhanced SOLOV2 instance segmentation method for orchard
apple recognition and picking point localization. is method does
not rely on detection results, performs well in the face of occlusion,
and can accurately locate the apple picking point. e specic
conclusions of this research are outlined below:
(1) An improved SOLOv2 high-precision apple instance
segmentation method is introduced. To enhance the eciency
of the instance segmentation network, EcientNetV2 is
adopted as the backbone feature extraction network, which
has a highly ecient parameter design. When faced with
scenarios involving overlapping or occluded apples, as their
semantic features are quite similar, weintroduce a lightweight
spatial attention module to improve segmentation accuracy.
is module can increase position sensitivity, thus
distinguishing based on positional features even when
semantic features are similar. rough comparative
experimental analysis, the improved SOLOv2 instance
segmentation method performs exceptionally well, achieving
the highest F1 score and mAP values on the apple instance
segmentation dataset, 88.5 and 90.1%, respectively.
Furthermore, compared to the previous version, the model’s
parameter count and computational load have slightly
decreased by 1.94M and 31 GFLOPs.
(2) To achieve precise apple-picking point localization, an apple
localization method based on RGB-D is proposed. Firstly, the
pickable apples are ltered by the IoU of the mask and its
maximum outer circle and then determine whether the
midpoint of the maximum outer circle is an apple category.
Finally, the 3D coordinates of the picking point are obtained
based on the depth information of the midpoint and the
camera’s intrinsic parameters. Experimental verication
indicates that, in the collection of 60 datasets, the median
TABLE4 The positioning error of some picking points, in which the data unit is mm.
x
y
z
x
y
z
xx
yy
zz
20.04 47.83 733.75 20.04 47.83 733.75 0 0 0
43.68 53.3 801 43.66 53.3 801 0.02 0 0
109.13 12.59 892 109.51 12.58 801 0.38 0.01 0
87.21 76.57 823.75 89.17 75.05 823.75 1.96 1.52 0
132.2 104.1 923.5 130.1 106.2 923.8 2.1 2.1 0.3
113.29 149.57 791 114.77 148.24 791 1.48 1.33 0
279.69 75.82 653.75 281.47 74.4 653.75 1.78 1.42 0
132.23 104.11 923.5 130.12 106.19 923.75 2.11 2.08 0.25
FIGURE11
X, Y, Z direction positioning error.
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 13 frontiersin.org
errors in the X and Y directions for localization are less than
1.5 mm, while the median error in the Z direction is close to
0. Moreover, the maximum error in the Z direction is
approximately 1 mm, demonstrating high accuracy.
In the future, due to the high cost of obtaining instance
segmentation data and issues related to the real-time performance of
the models, wewill focus on in-depth research in two critical areas:
data generation and model lightweight. is will enable practical
applications on edge devices and embedded systems.
Data availability statement
e raw data supporting the conclusions of this article will
bemade available by the authors, without undue reservation.
Author contributions
ST: Conceptualization, Funding acquisition, Supervision, Writing
– original dra, Writing – review & editing. ZX: Data curation,
Methodology, Writing – original dra, Writing – review & editing,
Conceptualization, Formal analysis, Funding acquisition,
Investigation, Project administration, Resources, Soware,
Supervision, Validation, Visualization. JG: Funding acquisition,
Investigation, Writing – original dra, Writing – review & editing.
WW: Validation, Investigation, Writing – original dra, Writing –
review & editing. ZH: Writing – review & editing, Writing – original
dra, Visualization, Validation, Formal analysis. WZ: Writing – review
& editing, Validation, Visualization, Soware.
Funding
e author(s) declare that nancial support was received for the
research, authorship, and/or publication of this article. is research
was funded by the Key Project of Jiangsu Province Key Research and
Development Program (No. BE2021016-3), the Jiangsu Agricultural
Science and Technology Independent Innovation Fund Project (No.
CX (22) 3016), and the Key R&D Program (Agricultural Research and
Development) Project in Yancheng City (No. YCBN202309).
Acknowledgments
e authors express their gratitude to the editors and reviewers
for their invaluable comments and suggestions.
Conflict of interest
e authors declare that the research was conducted in the
absence of any commercial or nancial relationships that could
beconstrued as a potential conict of interest.
Publisher's note
All claims expressed in this article are solely those of the authors
and do not necessarily represent those of their aliated organizations,
or those of the publisher, the editors and the reviewers. Any product
that may beevaluated in this article, or claim that may bemade by its
manufacturer, is not guaranteed or endorsed by the publisher.
FIGURE12
Apple picking point localization process. (A) Original image. (B) Segmentation result. (C) Pickable apples. (D) Localization result of pickable apples.
Tang et al. 10.3389/fsufs.2024.1403872
Frontiers in Sustainable Food Systems 14 frontiersin.org
References
Ahmad, M. T., Greenspan, M., Asif, M., and Marshall, J. A.. (2018). Robust apple
segmentation using fuzzy logic. 5th International Multi-Topic ICT Conference:
Technologies For Future Generations, IMTIC 2018—Proceedings. 1–5.
Bolya, D., Zhou, C., Xiao, F., and Lee, Y. J.. (2019). YOLACT: real-time instance
segmentation. Proceedings of the IEEE International Conference on Computer Vision.
9157–9166
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2017).
DeepLab: semantic image segmentation with deep convolutional nets, atrous
convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40,
834–848. doi: 10.1109/TPAMI.2017.2699184
Chen, W., Zhang, J., Guo, B., Wei, Q., and Zhu, Z. (2021). An apple detection method
based on des-YOLO v4 algorithm for harvesting robots in complex environment. Math.
Probl. Eng. 2021, 1–12. doi: 10.1155/2021/7351470
Gai, R., Chen, N., and Yuan, H. (2023). A detection algorithm for cherry fruits based
on the improved YOLO-v4 model. Neural Comput. Appl. 35, 13895–13906. doi: 10.1007/
s00521-021-06029-z
Gené-Mola, J., Ferrer-Ferrer, M., Gregorio, E., Blok, P. M., Hemming, J., Morros, J. R.,
et al. (2023). Looking behind occlusions: a study on amodal segmentation for robust
on-tree apple fruit size estimation. Comput. Electron. Agric. 209:107854. doi: 10.1016/j.
compag.2023.107854
Gené-Mola, J., Gregorio, E., Guevara, J., Auat, F., Sanz-Cortiella, R., Escolà, A., et al.
(2019). Fruit detection in an apple orchard using a mobile terrestrial laser scanner.
Biosyst. Eng. 187, 171–184. doi: 10.1016/j.biosystemseng.2019.08.017
He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2020). Mask R-CNN. IEEE Trans.
Pattern Anal. Mach. Intell. 42, 386–397. doi: 10.1109/TPAMI.2018.2844175
Hu, T., Wang, W., Gu, J., Xia, Z., Zhang, J., and Wang, B. (2023). Research on apple
object detection and localization method based on improved YOLOX and RGB-D
images. Agronomy 13:1816. doi: 10.3390/agronomy13071816
Huang, Z., Huang, L., Gong, Y., Huang, C., and Wang, X.. (2019). Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition. 6409–6418.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q.. (2017). Densely
connected convolutional networks. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 4700–4708
Jia, W., Tian, Y., Luo, R., Zhang, Z., Lian, J., and Zheng, Y. (2020). Detection and
segmentation of overlapped fruits based on optimized mask R-CNN application in apple
harvesting robot. Comput. Electron. Agric. 172:105380. doi: 10.1016/j.compag.2020.105380
Jia, W., Wang, Z., Zhang, Z., Yang, X., Hou, S., and Zheng, Y. (2022a). A fast and
ecient green apple object detection model based on Foveabox. J. King Saud Univ.-
Comput. Inf. Sci. 34, 5156–5169. doi: 10.1016/j.jksuci.2022.01.005
Jia, W., Zhang, Z., Shao, W., Hou, S., Ji, Z., Liu, G., et al. (2021). FoveaMask: a fast and
accurate deep learning model for green fruit instance segmentation. Comput. Electron.
Agri c. 191:106488. doi: 10.1016/j.compag.2021.106488
Jia, W., Zhang, Z., Shao, W., Ji, Z., and Hou, S. (2022b). RS-Net: robust segmentation
of green overlapped apples. Precis. Agric. 23, 492–513. doi: 10.1007/s11119-021-09846-3
Kang, H., and Chen, C. (2019). Fruit detection and segmentation for apple harvesting
using visual sensor in orchards. Sensors 19:4599. doi: 10.3390/s19204599
Kang, H., and Chen, C. (2020). Fruit detection, segmentation and 3D visualisation of
environments in apple orchards. Comput. Electron. Agric. 171:105302. doi: 10.1016/j.
compag.2020.105302
Kang, H., Wang, X., and Chen, C. (2022). Accurate fruit localisation using high
resolution LiDAR-camera fusion and instance segmentation. Comput. Electron. Agric.
203:107450. doi: 10.1016/j.compag.2022.107450
Kang, H., Zhou, H., Wang, X., and Chen, C. (2020). Real-time fruit recognition and
grasping estimation for robotic apple harvesting. Sensors 20:5670. doi: 10.3390/
s20195670
Kong, T., Sun, F., Liu, H., Jiang, Y., Li, L., and Shi, J. (2020). FoveaBox: beyond anchor-
based object detection. IEEE Trans. Image Process. 29, 7389–7398. doi: 10.1109/
TIP.2020.3002345
Li, Q., Jia, W., Sun, M., Hou, S., and Zheng, Y. (2021). A novel green apple
segmentation algorithm based on ensemble U-Net under complex orchard
environment. Comput. Electron. Agric. 180:105900. doi: 10.1016/j.compag.2020.
105900
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S.. (2017).
Feature pyramid networks for object detection. Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2117–2125.
Ronneberger, O., Fischer, P., and Brox, T.. (2015). U-net: convolutional networks for
biomedical image segmentation. Medical Image Computing and Computer-Assisted
Intervention—MICCAI. 234–241.
Tan, M., and Le, Q. V.. (2019). EcientNet: rethinking model scaling for convolutional
neural networks. International Conference on Machine Learning. 6105–6114.
Tan, M., and Le, Q. V.. (2021). EcientNetV2: smaller models and faster training.
International Conference on Machine Learning. 10096–10106.
Tan, M., Pang, R., and Le, Q. V.. (2020). EcientDet: scalable and ecient object
detection. Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition. 10781–10790.
Tian, Y., Duan, H., Luo, R., Zhang, Y., Jia, W., Lian, J., et al. (2019). Fast recognition
and location of target fruit based on depth information. IEEE Access 7, 170553–170563.
doi: 10.1109/ACCESS.2019.2955566
Wang, D., and He, D. (2022a). Apple detection and instance segmentation in natural
environments using an improved mask scoring R-CNN model. Front. Plant Sci.
13:1016470. doi: 10.3389/fpls.2022.1016470
Wang, D., and He, D. (2022b). Fusion of mask RCNN and attention mechanism for
instance segmentation of apples under complex background. Comput. Electron. Agric.
196:106864. doi: 10.1016/j.compag.2022.106864
Wang, X., Kang, H., Zhou, H., Au, W., Wang, M. Y., and Chen, C. (2023). Development
and evaluation of a robust so robotic gripper for apple harvesting. Comput. Electron.
Agri c. 204:107552. doi: 10.1016/j.compag.2022.107552
Wang, X., Kong, T., Shen, C., Jiang, Y., and Li, L.. (2020a). SOLO: segmenting objects
by locations. Computer Vision—ECCV 2020. 649–665.
Wang, W., Zhang, Y., Gu, J., and Wang, J. (2022). A proactive manufacturing resources
assignment method based on production performance prediction for the smart factory.
IEEE Trans. Ind. Inform. 18, 46–55. doi: 10.1109/TII.2021.3073404
Wang, X., Zhang, R., Kong, T., Li, L., and Shen, C.. (2020b). SOLOv2: dynamic and
fast instance segmentation. Advances in Neural Information Processing Systems.
17721–17732.
Wu, L., Ma, J., Zhao, Y., and Liu, H. (2021). Apple detection in complex scene using
the improved yolov4 model. Agronomy 11:476. doi: 10.3390/agronomy11030476
Xia, Z., Gu, J., Wang, W., and Huang, Z. (2023). Research on a lightweight electronic
component detection method based on knowledge distillation. Math. Biosci. Eng. 20,
20971–20994. doi: 10.3934/mbe.2023928
Xia, Z., Gu, J., Zhang, K., Wang, W., and Li, J. (2022). Research on multi-scene
electronic component detection algorithm with anchor assignment based on K-means.
Electronics 11:514. doi: 10.3390/electronics11040514
Zhang, S., Chi, C., Yao, Y., Lei, Z., and Li, S. Z.. (2020). Bridging the gap between
anchor-based and anchor-free detection via adaptive training sample selection.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition. 9759–9768.
Zou, K., Ge, L., Zhou, H., Zhang, C., and Li, W. (2022). An apple image segmentation
method based on a color index obtained by a genetic algorithm. Multimed. Tools Appl.
81, 8139–8153. doi: 10.1007/s11042-022-11905-4
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
As an essential part of electronic component assembly, it is crucial to rapidly and accurately detect electronic components. Therefore, a lightweight electronic component detection method based on knowledge distillation is proposed in this study. First, a lightweight student model was constructed. Then, we consider issues like the teacher and student's differing expressions. A knowledge distillation method based on the combination of feature and channel is proposed to learn the teacher's rich class-related and inter-class difference features. Finally, comparative experiments were analyzed for the dataset. The results show that the student model Params (13.32 M) are reduced by 55%, and FLOPs (28.7 GMac) are reduced by 35% compared to the teacher model. The knowledge distillation method based on the combination of feature and channel improves the student model's mAP by 3.91% and 1.13% on the Pascal VOC and electronic components detection datasets, respectively. As a result of the knowledge distillation, the constructed student model strikes a superior balance between model precision and complexity, allowing for fast and accurate detection of electronic components with a detection precision (mAP) of 97.81% and a speed of 79 FPS.
Article
Full-text available
The vision-based fruit recognition and localization system is the basis for the automatic operation of agricultural harvesting robots. Existing detection models are often constrained by high complexity and slow inference speed, which do not meet the real-time requirements of harvesting robots. Here, a method for apple object detection and localization is proposed to address the above problems. First, an improved YOLOX network is designed to detect the target region, with a multi-branch topology in the training phase and a single-branch structure in the inference phase. The spatial pyramid pooling layer (SPP) with serial structure is used to expand the receptive field of the backbone network and ensure a fixed output. Second, the RGB-D camera is used to obtain the aligned depth image and to calculate the depth value of the desired point. Finally, the three-dimensional coordinates of apple-picking points are obtained by combining two-dimensional coordinates in the RGB image and depth value. Experimental results show that the proposed method has high accuracy and real-time performance: F1 is 93%, mean average precision (mAP) is 94.09%, detection speed can reach 167.43 F/s, and the positioning errors in X, Y, and Z directions are less than 7 mm, 7 mm, and 5 mm, respectively.
Article
Full-text available
The detection and sizing of fruits with computer vision methods is of interest because it provides relevant information to improve the management of orchard farming. However, the presence of partially occluded fruits limits the performance of existing methods, making reliable fruit sizing a challenging task. While previous fruit segmentation works limit segmentation to the visible region of fruits (known as modal segmentation), in this work we propose an amodal segmentation algorithm to predict the complete shape, which includes its visible and occluded regions. To do so, an end-to-end convolutional neural network (CNN) for simultaneous modal and amodal instance segmentation was implemented. The predicted amodal masks were used to estimate the fruit diameters in pixels. Modal masks were used to identify the visible region and measure the distance between the apples and the camera using the depth image. Finally, the fruit diameters in millimetres (mm) were computed by applying the pinhole camera model. The method was developed with a Fuji apple dataset consisting of 3925 RGB-D images acquired at different growth stages with a total of 15,335 annotated apples, and was subsequently tested in a case study to measure the diameter of Elstar apples at different growth stages. Fruit detection results showed an F1-score of 0.86 and the fruit diameter results reported a mean absolute error (MAE) of 4.5 mm and R 2 = 0.80 irrespective of fruit visibility. Besides the diameter estimation, modal and amodal masks were used to automatically determine the percentage of visibility of measured apples. This feature was used as a confidence value, improving the diameter estimation to MAE = 2.93 mm and R 2 = 0.91 when limiting the size estimation to fruits detected with a visibility higher than 60%. The main advantages of the present methodology are its robustness for measuring partially occluded fruits and the capability to determine the visibility percentage. The main limitation is that depth images were generated by means of photogrammetry methods, which limits the efficiency of data acquisition. To overcome this limitation, future works should consider the use of commercial RGB-D sensors. The code and the dataset used to evaluate the method have been made publicly available at https://github.com/GRAP-UdL-AT/Amodal_Fruit_Sizing.
Article
Full-text available
To enable the apple picking robot to quickly and accurately detect apples under the complex background in orchards, we propose an improved You Only Look Once version 4 (YOLOv4) model and data augmentation methods. Firstly, the crawler technology is utilized to collect pertinent apple images from the Internet for labeling. For the problem of insufficient image data caused by the random occlusion between leaves, in addition to traditional data augmentation techniques, a leaf illustration data augmentation method is proposed in this paper to accomplish data augmentation. Secondly, due to the large size and calculation of the YOLOv4 model, the backbone network Cross Stage Partial Darknet53 (CSPDarknet53) of the YOLOv4 model is replaced by EfficientNet, and convolution layer (Conv2D) is added to the three outputs to further adjust and extract the features, which make the model lighter and reduce the computational complexity. Finally, the apple detection experiment is performed on 2670 expanded samples. The test results show that the EfficientNet-B0-YOLOv4 model proposed in this paper has better detection performance than YOLOv3, YOLOv4, and Faster R-CNN with ResNet, which are state-of-the-art apple detection model. The average values of Recall, Precision, and F1 reach 97.43%, 95.52%, and 96.54% respectively, the average detection time per frame of the model is 0.338 s, which proves that the proposed method can be well applied in the vision system of picking robots in the apple industry.
Article
Full-text available
The accurate detection and segmentation of apples during growth stage is essential for yield estimation, timely harvesting, and retrieving growth information. However, factors such as the uncertain illumination, overlaps and occlusions of apples, homochromatic background and the gradual change in the ground color of apples from green to red, bring great challenges to the detection and segmentation of apples. To solve these problems, this study proposed an improved Mask Scoring region-based convolutional neural network (Mask Scoring R-CNN), known as MS-ADS, for accurate apple detection and instance segmentation in a natural environment. First, the ResNeSt, a variant of ResNet, combined with a feature pyramid network was used as backbone network to improve the feature extraction ability. Second, high-level architectures including R-CNN head and mask head were modified to improve the utilization of high-level features. Convolutional layers were added to the original R-CNN head to improve the accuracy of bounding box detection (bbox_mAP), and the Dual Attention Network was added to the original mask head to improve the accuracy of instance segmentation (mask_mAP). The experimental results showed that the proposed MS-ADS model effectively detected and segmented apples under various conditions, such as apples occluded by branches, leaves and other apples, apples with different ground colors and shadows, and apples divided into parts by branches and petioles. The recall, precision, false detection rate, and F1 score were 97.4%, 96.5%, 3.5%, and 96.9%, respectively. A bbox_mAP and mask_mAP of 0.932 and 0.920, respectively, were achieved on the test set, and the average run-time was 0.27 s per image. The experimental results indicated that the MS-ADS method detected and segmented apples in the orchard robustly and accurately with real-time performance. This study lays a foundation for follow-up work, such as yield estimation, harvesting, and automatic and long-term acquisition of apple growth information.
Article
Full-text available
Accurate depth-sensing is crucial in securing a high success rate of robotic harvesting in natural orchard environments. The solid-state LiDAR technique, a recently introduced LiDAR sensor, can perceive high-resolution geometric information of the scenes, which can be utilised to receive accurate depth information. Meanwhile, the fusion of the sensory data from LiDAR and the camera can significantly enhance the sensing ability of the harvesting robots. This work first introduces a LiDAR-camera fusion-based visual sensing and perception strategy to perform accurate fruit localisation in the apple orchards. Two SOTA LiDAR-camera extrinsic calibration methods are evaluated to obtain the accurate extrinsic matrix between the LiDAR and camera. After that, the point clouds and colour images are fused to perform fruit localisation using a one-stage instance segmentation network. In addition, comprehensive experiments show that LiDAR-camera achieves better visual sensing performance in natural environments. Meanwhile, introducing the LiDAR-camera fusion can largely improve the accuracy and robustness of the fruit localisation. Specifically, the standard deviations of fruit localisation using LiDAR-camera at 0.5, 1.2, and 1.8 m are 0.253, 0.230, and 0.285 cm, respectively, during the afternoon with intensive sunlight. This measurement error is much smaller compared with that from Realsense D455. Lastly, visualised point cloud² of the apple trees have been attached to demonstrate the highly accurate sensing results of the proposed Lidar-camera fusion method.
Article
Full-text available
Achieving multi-scene electronic component detection is the key to automatic electronic component assembly. The study of a deep-learning-based multi-scene electronic component object detection method is an important research focus. There are many anchors in the current object detection methods, which often leads to extremely unbalanced positive and negative samples during training and requires manual adjustment of thresholds to divide positive and negative samples. Besides, the existing methods often bring a complex model with many parameters and large computation complexity. To meet these issues, a new method was proposed for the detection of electronic components in multiple scenes. Firstly, a new dataset was constructed to describe the multi-scene electronic component scene. Secondly, a K-Means-based two-stage adaptive division strategy was used to solve the imbalance of positive and negative samples. Thirdly, the EfficientNetV2 was selected as the backbone feature extraction network to make the method simpler and more efficient. Finally, the proposed algorithm was evaluated on both the public dataset and the constructed multi-scene electronic component dataset. The performance was outstanding compared to the current mainstream object detection algorithms, and the proposed method achieved the highest mAP (83.20% and 98.59%), lower FLOPs (44.26GMAC) and smaller Params (29.3 M).
Article
Fruit harvesting is facing challenges due to the labour shortage, which has been more severe since the rapid pandemic. Robotic harvesting has been attempted in autonomous fruit harvesting tasks, such as apple harvesting. However, current apple harvesting robots show limited harvesting performance in the orchard environment due to the inefficiency of the robotic grippers. This research presents a fruit harvesting method that includes a novel soft robotic gripper and a detachment strategy to achieve apple harvesting in the natural orchard. The soft robotic gripper includes four tapered soft robotic fingers (SRF) and one multi-mode suction cup. The SRF is customised to avoid interference with obstacles during grasping, and its compliance and force exertion are comprehensively evaluated with FEA and experiments. The multi-mode suction cup can provide suction adhesion force, show active extrusion/withdrawal, and present passive compliance mode. The simultaneously twist-pulling motion is finally proposed and implemented to detach the apples from the trees. The proposed robotic gripper is compact, compliant with apple grasping and generates a large grasping force. Our proposed method is finally validated in a natural orchard and achieves a detachment, damage and harvesting rate of 75.6%, 4.55%, and 70.77%, respectively.
Article
Fruit object detection is crucial for automatic harvesting systems, serving applications such as orchard yield measurement and fruit harvesting. In order to achieve fast recognition and localization of green apples and meet the real-time working requirements of the vision system of harvesting robots, a fast optimized Foveabox detection model (Fast-FDM) is proposed. Fast-FDM uses an optimized form of anchor-free Foveabox to accurately and efficiently detect green apples in harvesting environments. Specifically, the EfficientNetV2-S with fast training and small size is used as the backbone network, a weighted bi-directional feature pyramid network (BiFPN) is employed as the feature extraction network to fuse multi-scale features easily and fast, and then the fused features are fed to the fovea head prediction network for the classification and bounding box prediction. Furthermore, an adaptive training sample selection (ATSS) method is adopted to directly select positive and negative samples, allowing green fruits of different scales to obtain higher recall and achieve more accurate green apple detection. Experimental results show that the proposed Fast-FDM realizes a mean average precision (mAP) of 62.3% for green apple detection using fewer parameters and floating point of operations (FLOPs), achieving better trade-offs between accuracy and detection efficiency.
Article
It is important to precisely segment apples in an orchard during the growth period to obtain accurate growth information. However, the complex environmental factors and growth characteristics, such as fluctuating illumination, overlapping and occlusion of apples, the gradual change in the ground colour of apples from green to red, and the similarities between immature apples and background leaves, affect apple segmentation accuracy. The purpose of this study was to develop a precise apple instance segmentation method based on an improved Mask region-based convolutional neural network (Mask RCNN). An existing Mask RCNN model was improved by fusing an attention module into the backbone network to enhance its feature extraction ability. A combination of deformable convolution and the transformer attention with the key content only term was used as the attention module in this study. The experimental results showed that the improved Mask RCNN can accurately segment apples under various conditions, such as apples with shadows and different ground colours, overlapped apples, and apples occluded by branches and leaves. A recall, precision, F1 score, and segmentation mAP of 97.1%, 95.8%, 96.4% and 0.917, respectively, were achieved, and the average run-time on the test set was 0.25 s per image. Our method outperformed the two methods in comparison, indicating that it can accurately segment apples in the growth stage with a near real-time performance. This study lays the foundation for realizing accurate fruit detection and long-term automatic growth monitoring.