ArticlePDF Available

High-precision apple recognition and localization method based on RGB-D and improved SOLOv2 instance segmentation

Frontiers in Sustainable Food Systems

June 2024
8

DOI:10.3389/fsufs.2024.1403872

License
CC BY 4.0

Authors:

Shixi Tang

Yancheng Teachers University

Wenbo Wang

Jiangsu University

Show all 6 authorsHide

Intelligent apple-picking robots can significantly improve the efficiency of apple picking, and the realization of fast and accurate recognition and localization of apples is the prerequisite and foundation for the operation of picking robots. Existing apple recognition and localization methods primarily focus on object detection and semantic segmentation techniques. However, these methods often suffer from localization errors when facing occlusion and overlapping issues. Furthermore, the few instance segmentation methods are also inefficient and heavily dependent on detection results. Therefore, this paper proposes an apple recognition and localization method based on RGB-D and an improved SOLOv2 instance segmentation approach. To improve the efficiency of the instance segmentation network, the EfficientNetV2 is employed as the feature extraction network, known for its high parameter efficiency. To enhance segmentation accuracy when apples are occluded or overlapping, a lightweight spatial attention module is proposed. This module improves the model position sensitivity so that positional features can differentiate between overlapping objects when their semantic features are similar. To accurately determine the apple-picking points, an RGB-D-based apple localization method is introduced. Through comparative experimental analysis, the improved SOLOv2 instance segmentation method has demonstrated remarkable performance. Compared to SOLOv2, the F1 score, mAP, and mIoU on the apple instance segmentation dataset have increased by 2.4, 3.6, and 3.8%, respectively. Additionally, the model’s Params and FLOPs have decreased by 1.94M and 31 GFLOPs, respectively. A total of 60 samples were gathered for the analysis of localization errors. The findings indicate that the proposed method achieves high precision in localization, with errors in the X, Y, and Z axes ranging from 0 to 3.95 mm, 0 to 5.16 mm, and 0 to 1 mm, respectively.

Diagram of the overall structure of the improved SOLOV2.

…

Instance mask generation module structure diagram.

…

The structure diagram of the improved mask feature branch.

…

The structure diagram of the improved mask kernel branch.

…

+10

Flowchart of RGB-D based apple localization method.

…

Figures - available from: Frontiers in Sustainable Food Systems

This content is subject to copyright.

Access to this full-text is provided by Frontiers.

Learn more

Content available from Frontiers in Sustainable Food Systems

This content is subject to copyright.

Frontiers in Sustainable Food Systems 01 frontiersin.org

High-precision apple recognition

and localization method based on

RGB-D and improved SOLOv2

instance segmentation

ShixiTang

1,2, ZilinXia

3*, JinanGu

3, WenboWang

ZedongHuang

3 and WenhaoZhang

1 School of Information Engineering, Yancheng Teachers University, Yancheng, China, 2 Jiangsu

Engineering Laboratory of Cyberspace Security, Suzhou, China, 3 School of Mechanical Engineering,

Jiangsu University, Zhenjiang, China

Intelligent apple-picking robots can signiﬁcantly improve the eciency of apple

picking, and the realization of fast and accurate recognition and localization of

apples is the prerequisite and foundation for the operation of picking robots.

Existing apple recognition and localization methods primarily focus on object

detection and semantic segmentation techniques. However, these methods often

suer from localization errors when facing occlusion and overlapping issues.

Furthermore, the few instance segmentation methods are also inecient and

heavily dependent on detection results. Therefore, this paper proposes an apple

recognition and localization method based on RGB-D and an improved SOLOv2

instance segmentation approach. To improve the eciency of the instance

segmentation network, the EcientNetV2 is employed as the feature extraction

network, known for its high parameter eciency. To enhance segmentation

accuracy when apples are occluded or overlapping, a lightweight spatial attention

module is proposed. This module improves the model position sensitivity so that

positional features can dierentiate between overlapping objects when their

semantic features are similar. To accurately determine the apple-picking points,

an RGB-D-based apple localization method is introduced. Through comparative

experimental analysis, the improved SOLOv2 instance segmentation method has

demonstrated remarkable performance. Compared to SOLOv2, the F1 score,

mAP, and mIoU on the apple instance segmentation dataset have increased by

2.4, 3.6, and 3.8%, respectively. Additionally, the model’s Params and FLOPs have

decreased by 1.94M and 31 GFLOPs, respectively. A total of 60 samples were

gathered for the analysis of localization errors. The ﬁndings indicate that the

proposed method achieves high precision inlocalization, with errors in the X, Y,

and Z axes ranging from 0 to 3.95 mm, 0 to 5.16 mm, and 0 to 1 mm, respectively.

KEYWORDS

apple instance segmentation, lightweight spatial attention, EcientNetv2, improved

SOLOv2, RGB-D

1 Introduction

Currently, apple picking relies largely on manual labor, which is time-consuming and

labor-intensive, resulting in high harvesting costs and low eciency. With the rapid

development of articial intelligence and robotics, the realization of automated apple picking

has become an inevitable trend (Wang etal., 2022, 2023). Achieving rapid and accurate apple

identication and localization in complex orchard environments is the key to realizing

OPEN ACCESS

EDITED BY

Raul Avila-Sosa,

Benemérita Universidad Autónoma de Puebla,

Mexico

REVIEWED BY

Jieli Duan,

South China Agricultural University, China

Zhenguo Zhang,

Xinjiang Agricultural University, China

*CORRESPONDENCE

Zilin Xia

xiazilin@stmail.ujs.edu.cn

RECEIVED 20 March 2024

ACCEPTED 13 May 2024

PUBLISHED 06 June 2024

CITATION

Tang S, Xia Z, Gu J, Wang W, Huang Z and

Zhang W (2024) High-precision apple

recognition and localization method based

on RGB-D and improved SOLOv2 instance

segmentation.

Front. Sustain. Food Syst. 8:1403872.

doi: 10.3389/fsufs.2024.1403872

Zhang. This is an open-access article

distributed under the terms of the Creative

Commons Attribution License (CC BY). The

use, distribution or reproduction in other

forums is permitted, provided the original

author(s) and the copyright owner(s) are

credited and that the original publication in

this journal is cited, in accordance with

accepted academic practice. No use,

distribution or reproduction is permitted

which does not comply with these terms.

TYPE Original Research

PUBLISHED 06 June 2024

DOI 10.3389/fsufs.2024.1403872

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 02 frontiersin.org

automated apple harvesting. However, the complex environment of

orchards is inuenced by shading, overlapping, and camera angles,

making fast and accurate apple identication and positioning a

greater challenge.

In recent years, along with the development of machine vision, the

recognition and localization of apples have been extensively researched

(Xia et al., 2022, 2023; Gai et al., 2023). Specic methods can

beclassied into three main categories: object detection-based (Huang

etal., 2017; Hu etal., 2023), semantic segmentation-based (Jia etal.,

2022b), and instance segmentation-based (Wang and He, 2022a).

Object detection involves identifying and localizing objects in images

and marking them with bounding boxes. Jia etal. (2022a) proposed

an improved FoveaBox (Kong etal., 2020) for green apple object

detection. is approach utilizes EcientNetV2s as the feature

extraction network and employs the BiFPN (Bidirectional Feature

Pyramid Network) (Tan etal., 2020) for feature fusion. It utilizes the

ATSS (Adaptive Training Sample Selection) (Zhang et al., 2020)

technique to match positive and negative samples. e overall model

achieves a high recall but with reduced speed. Wu et al. (2021)

presented an enhanced YOLOV4 method for complex scene apple

detection. ey replaced YOLOV4’s backbone feature extraction

network with EcientNet and achieved a 96.54% F1 score on the

constructed dataset. Chen etal. (2021) proposed a Des-YOLOV4

detection model tailored for apples. is method introduces the

DenseNet dense residual structure into YOLOV4 and employs

So-NMS in the post-processing phase to enhance the recall rate of

overlapping apples. e overall model has fewer parameters compared

to YOLOV4. Apple recognition and localization based on object

detection methods are faster. But when the apples in the detection

bounding box are obscured or overlapped, it will hinder the

acquisition of their depth information and lead to picking failure.

Semantic segmentation can segment each pixel in an image into

corresponding categories, yielding more rened object segmentation

results. Ahmad et al. (2018) proposed a method based on a fuzzy

inference system and fuzzy c-means to achieve the segmentation of

apples with dierent colors during the growth process. Zou etal.

(2022) introduced a color-index-based apple segmentation method

that enables rapid segmentation of orchard apples, with an average

segmentation time of 20 ms. While these traditional segmentation

methods oer faster speed, their robustness is compromised when

facing complex orchard environments. Kang and Chen (2019)

introduced the DasNet, a deep learning-based semantic segmentation

network, to achieve the segmentation of apples and tree branches. Li

etal. (2021) proposed an improved U-Net (Ronneberger etal., 2015)

method for segmenting green apples. It incorporated dilated

convolutions and the ASPP (Atrous Spatial Pyramid Pooling) (Chen

etal., 2017) structure into U-Net, which enlarged the receptive eld

and enhanced segmentation accuracy. Using semantic segmentation

methods for apple segmentation can provide more detailed contours.

However, in cases of overlapping apples, distinguishing between them

becomes challenging with semantic segmentation, which, in turn,

impacts the acquisition of depth information for each apple.

Instance segmentation enables the classication of each pixel’s

category in an image while distinguishing dierent instances of the

same category. Kang and Chen (2020) proposed the DaSNet-V2

method for apple instance segmentation, using ResNet101 and

ResNet18 as backbone feature extraction networks, achieving

segmentation accuracies of 87.3 and 86.6%, respectively. Wang and

He(2022b) introduced an improved Mask R-CNN method for apple

instance segmentation. By incorporating attention mechanisms in the

feature extraction module, this approach enhances apple

segmentation accuracy, but at a slower speed. Jia et al. (2020)

presented an enhanced Mask R-CNN method for apple instance

segmentation. ey combined the DenseNet dense connection

structure into the ResNet backbone feature extraction network, thus

improving segmentation accuracy and enabling recognition and

segmentation of overlapping apples. Jia etal. (2021) proposed an

anchor-free instance segmentation method tailored for green apples.

is method adds an instance branch to FoveaBox, conducting apple

detection before segmentation. Nevertheless, it exhibits subpar

performance in segmenting apple edge contours. Instance

segmentation-based methods can achieve apple recognition, precise

localization, and mask generation. However, the majority of current

research focuses on detection-based instance segmentation

approaches. In these methods, the instance branch oen lacks

consideration of global context, resulting in suboptimal performance

in edge segmentation and slower segmentation speeds.

Acquiring depth information for apples is a critical factor in

achieving accurate picking. Specic means of obtaining this

information include stereo cameras, structured light cameras, TOF

(time-of-ight) cameras, and laser radar. Tian etal. (2019) proposed

a fruit localization technique based on Kinect V2, utilizing depth

images to determine the apple’s center and combining RGB data to

estimate the apple’s radius. But, in cases of overlap and occlusion, the

depth image may not fully represent the apple’s true depth information,

leading to ambiguous localization. Kang etal. (2020) implemented

apple localization using an Intel D-435 camera. ey employed RGB

images for fruit detection and instance segmentation, combining

depth information to t apple’s point cloud, thus localizing it.

However, this method suers from lower eciency. Gené-Mola etal.

(2019) utilized laser radar and object detection for apple localization,

achieving a success rate of 87.5%. Kang etal. (2022) fused radar with

the camera as input and then used instance segmentation to achieve

apple localization, but this method incurs higher costs.

So far, the recognition and localization of apples have

predominantly relied on object detection and semantic segmentation

methods. However, these methods oen lead to positioning errors

when facing challenges such as occlusion and overlapping. While a

few studies have explored detection-based instance segmentation

methods for apple recognition and localization, these methods usually

come with high parameter and computational complexity, are

susceptible to the inuence of detection results, and lack consideration

of global information. SOLOv2 (Wang et al., 2020b) is a one-stage

instance segmentation method that introduces an ecient instance

mask representation scheme based on the foundation of SOLO (Wan g

et al., 2020a). It improves the eciency of the overall method by

decoupling instance mask generation into mask kernel and mask

feature learning and utilizing convolutional operations to generate

instance masks. Compared to two-stage instance segmentation models

like MaskRCNN, SOLOv2 eliminates the need for anchor boxes, does

not rely on detection results, occupies less memory, and is more

suitable for practical engineering applications. erefore, this paper

proposes an apple recognition and localization approach based on

RGB-D and an improved SOLOV2 instance segmentation method.

is method eliminates reliance on detection results and can achieve

accurate apple positioning even in occlusion and overlapping

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 03 frontiersin.org

scenarios. Specically, the main contributions of this paper are

as follows:

1 Introducing an improved SOLOV2 instance segmentation

method that achieves high-precision apple instance

segmentation and is independent of detection results.

2 Introducing a lightweight spatial attention mechanism into the

mask prediction head of SOLOV2 to enhance the segmentation

accuracy for overlapping apples.

3 Introducing an RGB-D-based apple localization method that

achieves accurate positioning in scenarios with occlusion and

overlapping, thereby enhancing the success rate of apple picking.

e sections of this paper are organized as follows: section 2

introduces the improved SOLOv2 instance segmentation method and

the RGB-D-based apple-picking point localization method. Section

3 conducts comparative experiments and analyzes the experimental

results. Section 4 summarizes the entire paper and outlines future

research directions.

2 The proposed method

2.1 Apple instance segmentation method

based on improved SOLOv2

In this paper, wefurther enhance the segmentation accuracy based

on SOLOv2 without introducing excessive parameters. Specically,

weintegrate the proposed lightweight spatial attention module into the

mask kernel and mask feature branches and adopt a more ecient

feature extraction network, EcientNetV2. Aer these improvements,

the improved SOLOv2 signicantly boosts instance segmentation

accuracy while maintaining eciency and avoiding the introduction

of redundant parameters. Figure1 illustrates the enhanced SOLOv2

instance segmentation method, and detailed descriptions of each

module will beelaborated in the subsequent sections.

2.1.1 Backbone feature extraction network

e feature extraction network, as a crucial component of the

instance segmentation method, signicantly inuences the performance

of the whole model. In this study, EcientNetV2 (Tan and Le, 2021)

was adopted as the backbone feature extraction network, building upon

the improvements made in EcientNetV1 (Tan and Le, 2019). e

network’s optimal width, height, and other design parameters were

determined using NAS (Neural Architecture Search) techniques. To

address the slow training speed of EcientNetV1, the shallow MBConv

modules were substituted with Fused-MBConv modules, with the

specic MBConv and Fused-MBConv modules illustrated in Figure1.

As depicted in Figure1, the MBConv module employs a 1 × 1

convolutional layer to increase feature dimensionality, followed by a

3 × 3 depthwise separable convolutional layer for feature extraction. In

contrast, the Fused-MBConv module directly utilizes a 3 × 3

convolutional layer to perform feature extraction and dimensionality

expansion, improving feature extraction speed. EcientNetV2

demonstrates exceptional accuracy on the ImageNet dataset while

enhancing training speed and parameter eciency. Compared to

ResNet50, EcientNetV2 exhibits higher eciency, achieving greater

precision with equivalent parameters and computation. Additionally,

EcientNetV2 is well-suited for mobile and embedded device

deployment for tasks such as apple harvesting.

2.1.2 Instance mask generation module

SOLOv2 decouples the instance mask generation into mask kernels

and mask features. en, it utilizes convolution between the mask

kernels and mask features to obtain the nal instance mask. e

parameters of the mask kernels and mask features are generated

separately through the mask kernel branch and the mask feature branch.

As depicted in Figure1, the rst step is utilizing the FPN (Lin etal.,

2017) to perform multi-scale feature fusion, aiming to achieve multi-

scale segmentation. is process is detailed in the following Eq.1.

PPPPP CCCC

23456

2345

,,,,

()

FPN,,, (1)

where

CCCC

2345

,,,

are the eective feature layers output by

EcientNetV2, and

PPPPP

23456

are the feature layers output aer

feature fusion.

In the mask kernel branch, each feature layer

is sampled to a

grid of size

, and if the center of the GT falls into this grid, it

indicates that this grid is responsible for predicting its instances.

Specically as shown in Eq.2.

KfPi

()

Kernel Branch,,,,,

23456

(2)

is the mask kernel parameter generated by the corresponding

feature layer with size

SSC

××

In the mask feature branch, FPN output features are used to create

shared mask features across dierent levels. is approach allows

dierent levels to share the same mask features, reducing parameters

and improving eciency. e process is detailed in Eq.3 b elow.

FfPPPP=

()

Feature Branch ,,,

2345

(3)

where

denotes the shared mask features, with sizes of

HWC××

and

are one-fourth the size of the input height and

width, respectively.

Finally, the mask kernel parameters corresponding to the grids

containing objects, denoted as

pos

, are selected. ese are the grids

where the center of the GT falls during training and the grids where

the predicted classication score is greater than the score threshold

during inference. e selected mask kernel parameters

pos

are

utilized to convolve with the shared mask features

to generate

instance masks. As shown in the following Eq.4.

ij i

,=pos  (4)

pos

is the mask kernel parameter obtained by ltering in

with

size n × 1 × 1 × C, and

Mij

is the instance prediction mask generated

at the corresponding location. e overall instance mask generation

module is illustrated in Figure2.

2.1.3 Improved mask feature branch

In SOLOv2, the mask feature branch is composed solely of

upsampling and convolution operations. However, instances with

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 04 frontiersin.org

similar semantic features heavily rely on positional features for

dierentiation. Relying solely on convolution-generated positional

information is insucient. erefore, an Att-Block is utilized as a

replacement for the convolution operation to construct mask features

containing comprehensive positional information without introducing

excessive parameters. In the Att-Block, the convolution is replaced

with a depthwise separable convolution and a lightweight spatial

attention mechanism is introduced to capture positional features

between instances. e details are illustrated in Figure3.

e lightweight spatial attention module is divided into two

steps: (1) rst, obtain the corresponding spatial position relationship

in the vertical direction by using a K × 1 convolutional kernel on the

feature map. e computational complexity of this step is

H W

; (2)

then obtain the corresponding spatial position relationship in the

horizontal direction by using a 1 × K convolutional kernel on the

feature map generated in step (1), the computational complexity of

this step is

HW 2

. Finally, use Sigmoid to generate the spatial

attention map, the overall computational complexity is

HWHW

, compared with directly using a fully connected layer to calculate

the spatial attention map of the feature map, the lightweight attention

module has lower computational complexity when the feature map

width W and height H are large. is makes it particularly suitable

for capturing feature map spatial relationships in

lightweight networks.

2.1.4 Improved mask kernel branch

With the aim of enhancing the sensitivity of learned mask kernel

parameters to positional information and improving instance

segmentation accuracy, this paper introduces a modication to the mask

kernel branch. e convolution operations in the mask kernel branch are

FIGURE1

Diagram of the overall structure of the improved SOLOV2.

FIGURE2

Instance mask generation module structure diagram.

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 05 frontiersin.org

replaced with Att-Block modules to capture feature spatial relationships.

is alteration enables the learned mask kernel parameters to encompass

richer positional information, as depicted in Figure4. It is important to

note that the Att-Block used in the improved mask kernel branch

employs regular convolutional structures rather than depthwise

separable convolutions. is choice aims to ensure that the encoded

information within the mask kernel is more comprehensive.

2.1.5 Label assignment method and loss

calculation

SOLOV2 diers from detection-based instance segmentation

methods in that it does not assign labels by IoU thresholding. It resizes

dierent feature layers into S × S grids of dierent sizes, and each

element in the grid is responsible for predicting one instance. Given

an image where

represents the ground truth labels,

area

denotes

the area of the label,

mask

represents the mask of the label, and

labe

indicates the category of the label. Firstly, the ground truth

instances are categorized into dierent levels based on their area.

Specically as shown in Eq.5.

area

lb up≤≤

(5)

where

lbi

and

upi

represent the lower and upper bounds of the

object scale predicted by the current feature layer, if instances satisfy

this condition, are considered as

GTi

for the current layer.

Subsequently,

GTi

is scaled around its center, and the grid cells within

the scaled

GTi

are selected as positive samples, as shown in the

following Eq.6.

pos GT pos

index

scaleii

=∗

(6)

where

pos

inde

represents the indices of grids within the scaled

GTi

, which are the indexes of positive samples;

posscale

is the scaling

factor. en, the mask kernel parameters corresponding to the positive

samples are selected using these indices and denoted as

pos

Specically as shown in Eq.7.

iii

pos index

pos=







(7)

e mask kernel parameters corresponding to positive samples

from all layers are collected and denoted as

Kpos

. en, convolution

is applied to obtain the predicted masks. Specically as shown

in Eq.8.

FIGURE3

The structure diagram of the improved mask feature branch.

FIGURE4

The structure diagram of the improved mask kernel branch.

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 06 frontiersin.org

ALGORITHM 1 The label assignment method and loss calculation in SOLOv2

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 07 frontiersin.org

MK F=pos 

(8)

where

is the mask feature generated by the mask branch, and

is the prediction mask. Finally, the mask and classication losses

are computed as follows in Eqs.9, 10.

L M

mask mask

Diceloss,Target

()

(9)

L P

abel

Focalloss ,Target

()

(10)

is the mask loss, specically

Diceloss

, where

Targetmask

means that the index of positive samples corresponds to

mask

, and

the negative samples do not participate in the calculation of the mask

loss.

cls

is the classication loss, specically

Focalloss

, where

the classication prediction value, and

Targetlabel

means that the

positive samples correspond to

labe

, and the negative samples are

0. Both positive and negative samples contribute to the calculation of

the classication loss. e overall loss function is formulated as

shown in Eq.11.

LLL

totalcls mask

λλ

(11)

where

total

is the total loss,

and

are the weights of

classication loss and mask loss, which take the values of 1.0 and 3.0in

this paper, respectively. e overall training label assignment method

and loss calculation can beseen in Algorithm 1.

2.2 RGB-D-based apple localization

method

To achieve precise apple localization, especially in scenarios with

occlusion and overlapping, this paper proposes an RGB-D-based

apple localization method. e method begins by employing the

enhanced SOLOv2 apple instance segmentation method to obtain

masks for apples in the images. Subsequently, these masks are

combined with the depth maps generated by an RGB-D camera to

accurately locate the points where apples can bepicked. e overall

workow is depicted in Figure5, with the following steps.

Step1: Instance segmentation.

Perform segmentation on the RGB images to obtain apple masks.

Step2: Finding the minimum enclosing circle of the mask.

Utilize OpenCV to compute the minimum enclosing circle of the

segmented apple mask. is step aims to ensure a better t of the mask

to the apple, avoiding excessive inclusion of background information.

Step3: Calculating mask and minimum enclosing circle IoU.

To ensure that the pixel information of the apple is as complete as

possible, thereby enhancing the success rate of picking, compute the

IoU to lter out apples that are viable for picking in the current view.

A higher IoU value indicates fewer obscured parts of the apple. is

paper adopts an IoU threshold of 0.5.

Step4: Conrming if the central point of the minimum enclosing

circle belongs to the apple.

Select the center point of the minimum enclosing circle of the

apple mask as the picking point. To do so, verify whether the pixel

coordinates of the circle’s center point correspond to the apple. If

leaves or branches potentially obstruct the point, picking is not viable

from the current viewpoint.

Step5: Calculate picking point coordinates.

If steps 3 and 4 are satised, it indicates that the viewpoint allows

picking. Using pixel coordinates along with the corresponding depth

information and camera intrinsic allows calculating the three-

dimensional coordinates (x, y, z) of the picking point in the camera

coordinate system. Specically as shown in Eqs.12,13.

=×

−

(12)

y z

=×

−

(13)

where

uv,

( )

represents the pixel coordinates of the center of the

minimum enclosing circle in the X and Y directions,

indicates the

depth information of the circle center, and

uvf

x00

, and

are the

camera intrinsic.

3 Experiments

3.1 Dataset

e apple instance segmentation dataset constructed in this paper

consists of two parts. One part is the public dataset, which includes

3,925 apple images annotated with instance labels (Gené-Mola etal.,

2023). is dataset covers two growth stages of apples, with

approximately 70% at the growth stage where apples are primarily

green, as shown in Figure6A. e remaining approximately 30% are

at the ripening stage, where apples are mostly light red, as shown in

Figure6B.

e other part of the dataset is collected from orchards, consisting

of 300 apple images and annotated with instance labels using the

Labelme tool. ese images were captured during the ripe stage of

apples, characterized by their red color, as illustrated in Figure6C.

Lastly, an 8:2 data split ratio was employed to ensure the eective

utilization of training data. It means that 80% of the data were used

for training and validation, totaling 3,400 images, while the remaining

20% were reserved for testing, comprising 852 images. Such a division

aims to avoid overtting, thereby improving the generalization ability

and robustness of the model.

3.2 Experimental setting

e hardware setup for the experiments in this study included an

E5-2678 V3 CPU, 32GB of RAM, and an NVIDIA 3090 GPU with

24GB of VRAM. e soware system used was Ubuntu 18.04, with

Python version 3.8. e deep learning framework employed was

PyTorch. Pretrained weights were utilized for the backbone feature

extraction networks to expedite model convergence. e training

conguration encompassed 40 epochs with a batch size of 4. e SGD

optimizer was used with an initial learning rate of 0.01. Learning rate

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 08 frontiersin.org

adjustments were applied using the StepLR strategy, where the

learning rate was reduced by 0.1 at the 16th and 32th epochs,

respectively. To accelerate model convergence, the weights of the

backbone for all models were initialized using pre-trained weights on

ImageNet-1K. e specic experimental settings are shown in Table1.

3.3 Evaluation metrics

In order to evaluate the performance of the proposed method,

(average precision),

mAP

(mean average precision), mIoU (mean

intersection over union), and

scores are used to measure the

accuracy, and Params (parameters), FLOPs (oating-point

operations), and FPS (frames per second) are used to measure the

model complexity. e calculation formula is shown below.

ecision TP

alldetections

(14)

Recal

lTP

allGTBox

(15)

APd

=∫

()

(16)

AP AP

∑

(17)

FIGURE6

Samples of apple instance segmentation dataset.

Class

Branch

Feature

Branch

Backbone

FPNKernel

Branch

Input

Improved SOLOv2

Output

Segmentation results

and depth mapRGB-D-based apple localization method

Pickable apples

Minimum enclosing circle

IoU threshold filtering

Confirmation of the depth

category of the center point

Solving for picking point coordinates

Realsense L515

FIGURE5

Flowchart of RGB-D based apple localization method.

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 09 frontiersin.org

Precision Recall

12=×

(18)

IoUTP

TP FP FN

=∑

(19)

where

denotes the number of correctly detected targets among

all detected targets, FP denotes the number of incorrectly detected

targets among all detected targets, FN indicates the number of

incorrectly classied negative samples,

p r

()

stands for the Precision-

Recall curve, and

represents the number of categories in the dataset.

FLOPs and Params are critical metrics for evaluating model

complexity and speed. FLOPs measure the amount of computation,

and Params indicate the number of learnable parameters in the

network. A larger computational and parameter count typically results

in higher model complexity and slower detection speed. erefore, a

model suitable for edge devices such as apple picking in orchards

should have fewer parameters and lower computational burden.

3.4 Experimental results of the improved

method

e improved SOLOv2 is trained on the constructed apple

instance segmentation dataset, and model evaluation is performed

every epoch. e training loss curve and the test set mAP curve are

shown in Figure7, where red represents the mAP curve and green

represents the loss curve.

As shown in Figure7, the model’s loss value gradually decreases

and stabilizes as the training progresses, while the mAP metric steadily

increases. It indicates that the model is progressively converging.

Selecting the weights from the last epoch as the nal result, the mAP

on the test set of the apple instance segmentation dataset reaches

90.1%. Demonstrates that the proposed method achieves high

precision and recall in apple instance segmentation tasks, and the

model’s overall performance is excellent.

3.5 Comparative experiments with other

instance segmentation methods

To verify the eectiveness and advancement of the proposed

method, it will be compared to other mainstream instance

segmentation methods, specically including the original SOLOv2

method before improvement, the one-stage instance segmentation

method Yolact (Bolya et al., 2019), and the two-stage instance

segmentation method MaskRCNN (He etal., 2020) and MS-RCNN

(Huang et al., 2019). e mAP, mIoU and F1 scores of various

segmentation models are depicted in Figure8. It can beobserved that,

compared to other segmentation models, the improved SOLOv2

achieves the highest scores.

According to the results in Table 2, the improved SOLOv2

instance segmentation model performs best in the F1 score, mIoU,

and mAP metrics, reaching 88.5, 83.2, and 90.1%, respectively.

Compared to the original method, these three metrics were improved

by 2.4, 3.6, and 3.8%, respectively, highlighting the eectiveness of the

improved method. Compared with the two-stage models MaskRCNN

and MS-RCNN, the improved SOLOv2 model improved the F1 scores

by 0.2 and 0.6%, mIoU by 1.1 and 2.7%, and mAP by 2.3 and 2.1%,

respectively. Compared to the one-stage model Yolact, the improved

SOLOv2 model signicantly improved all metrics, including a 7.9%

improvement in mIoU, 2.3, and 4.4% in F1 score and mAP,

respectively. ese results highlight the superior precision and recall

achieved by the proposed method, resulting in more eective

instance segmentation.

Furthermore, the improved SOLOv2 apple instance segmentation

method has also been optimized for Params, FLOPs, and

FPS. Compared to the original method, it reduces Params by 1.94M,

FLOPs by 31 GFLOPs, and maintains detection speed almost the

same, with a slight decrease of 0.7 frames per second. Compared to

MaskRCNN, Params remain similar, but FLOPs decrease by 39

GFLOPs, and FPS increases by 1.3. Compared to MS-RCNN, Params

and FLOPs are signicantly reduced by 15.94M and 78 GFLOPs,

respectively, with FPS increasing by 2.8. Although Yolact performs

best in detection speed-related metrics, the proposed method

signicantly improves segmentation accuracy. Overall, the proposed

method strikes a balance between model accuracy and complexity,

performing excellently in apple instance segmentation tasks.

Figure9 displays a comparison of Precision-Recall (P-R) curves

for each method within the apple category. e red curve represents

the proposed enhanced SOLOv2 instance segmentation method.

Notably, the red curve encompasses the largest area, and even at high

recall rates, it sustains a remarkable level of accuracy. ese ndings

underscore the enhanced method’s ability to attain superior precision

FIGURE7

Training loss curve and mAP curve of the improved SOLOv2 model.

TABLE1 Experimental parameter settings.

Hyperparameters Setting

Batch size 4

Epoch 40

Learning rate Epoch 1–16 0.01

Epoch 16–32 0.01*0.1

Epoch 32–40 0.01*0.01

Optimizer SGD

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 10 frontiersin.org

and recall, showcasing improved stability and performance when

contrasted with other methods.

Figure10 illustrates a comparison of segmentation results between

the enhanced SOLOv2 and other methods on the test set of the apple

instance segmentation dataset. Notably, the improved SOLOv2

maintains accurate segmentation even in scenarios where apples are

closely spaced. In Figure10C, SOLOv2 exhibits segmentation errors

when distinguishing overlapping objects, failing to separate the two

instances. Moreover, in Figure 10D, MaskRCNN experiences

segmentation omission issues with overlapping objects. However,

Figure10B illustrates that these issues were substantially addressed

following the improvements. e improved model can accurately

segment and dierentiate overlapping instances. is further

underscores the eectiveness of the proposed lightweight spatial

attention module, which excels at distinguishing objects based on

their spatial characteristics when semantic features pose challenges

in dierentiation.

3.6 Ablation study

In order to further validate the impact of improvements on model

performance, this section conducts ablation experiments to assess the

eectiveness of both the backbone feature extraction network and the

lightweight attention module. Firstly, we replace the original

ResNet50in the SOLOv2 backbone feature extraction network with

EcientNetV2 while keeping all other aspects unchanged. is step

aims to evaluate how the improved backbone feature extraction network

inuences model performance. Subsequently, weconduct experiments

to individually introduce the proposed lightweight attention module

into the mask feature branch, the mask kernel branch, and

simultaneously into both branches. ese experiments are designed to

assess the impact of the proposed lightweight attention module. e

results of the specic ablation experiments can beseen in Table3.

As shown in Table3, improving the backbone feature network to

EcientNetV2 results in a 0.5% increase in the F1 score and a 0.2%

increase in mAP. Additionally, EcientNetV2’s parameter-ecient

design enhances the computational eciency of the model. e

performance is improved when introducing the lightweight spatial

attention module separately into the mask feature branch and the

mask kernel branch. Specically, adding the attention module to the

mask feature branch increases mAP by 1%. Incorporating the

attention module into the mask kernel branch results in a 1.2%

improvement in the F1 score and a 2.3% improvement in

mAP. Simultaneously, adding the attention module to both branches

yields even more signicant eects, with the F1 score improving by

2.4% and mAP by 3.4%. is unequivocally demonstrates that the

proposed lightweight spatial attention module signicantly enhances

the precision of apple instance segmentation.

3.7 Positioning error analysis

For validation of the localization accuracy of the proposed

RGB-D-based apple localization method, 20 sets of RGB images and

FIGURE8

Comparison of F1 score, mIoU and mAP of dierent segmentation

models.

TABLE2 Comparative experimental results of mIoU, mAP, F1 score, Params, FLOPs and FPS for dierent segmentation models.

Methods F1 (%) (%) mIoU (%) FLOPs (GFLOPs) Params (M) FPS

MaskRCNN 88.3 87.8 82.1 186 43.97 28.2

MS-RCNN 87.9 88.0 80.5 225 60.23 26.7

Yo l a c t 86.2 85.7 75.3 61.427 34.73 51.4

SOLOv2 86.1 86.5 79.4 178 46.23 30.2

Improved SOLOv2 88.5 90.1 83.2 147 44.29 29.5

FIGURE9

Comparison of P-R curves for dierent segmentation models.

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 11 frontiersin.org

the corresponding depth maps, totaling about 60 apples, are captured

using the Realsense L515 depth camera. e true picking point of an

apple is dened as the camera’s three-dimensional coordinates

xyz,,

( )

obtained by combining the pixel coordinates of the manually

annotated center of the largest bounding rectangle around the apple,

camera intrinsic parameters, and the corresponding depth

information. Subsequently, the improved SOLOv2 instance

segmentation method and the depth-based apple localization method

are used to derive the predicted three-dimensional coordinates

xyz













for the apple’s estimated picking point. Finally, the error

between the predicted and true picking points is calculated to assess

the positioning accuracy. Table4 presents some true picking points,

predicted picking points, and their absolute errors. Figure11 illustrates

box plots of the positioning errors in the X, Y, and Z directions for

approximately 60 sets of apples.

OursOriginal

image

MaskRCNNSOLOv2

Yolact MS-RCNN

False segmentation Missed segmentation

FIGURE10

Comparison of segmentation results of dierent segmentation models.

TABLE3 Ablation experiment results.

Baseline EcientNetV2 Att-Block F1 (%) mAP (%)

Mask feature

branch

Mask kernel

branch

✓× × × 86.1 86.5

✓ ✓ ✓ × 86.6 86.7

✓ ✓ ✓ × 86.6 87.7

✓ ✓ ×✓87.8 89.0

✓ ✓ ✓ ✓ 88.5 90.1

Baseline is SOLOv2-ResNet50.

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 12 frontiersin.org

Figure 11 displays the median errors represented by red line

segments. e median positioning errors in the X and Y directions are

less than 1.5 mm. Furthermore, the median positioning error in the Z

direction approaches zero, with a maximum Z-direction positioning

error of approximately 1 mm. ese observations demonstrate that the

proposed RGB-D-based apple-picking point localization method

attains remarkable precision, fullling practical picking needs.

Figure12 illustrates the process of apple-picking point localization.

Figure12A shows the original image, while Figure12B displays the

instance segmentation result. Figure12C shows the pickable apples

aer IoU ltering and conrmation of the depth information of the

center point, where the blue circles indicate the pickable apples and

the red circles indicate the non-pickable apples. Figure12D presents

the localization results of picking points in the camera coordinate

system, obtained by combining depth information and camera

intrinsic parameters with coordinates measured in meters. It can

beobserved from the gure that the proposed RGB-D-based picking

point localization method eectively achieves accurate apple

localization. Furthermore, when the depth information at the center

of the bounding circle of the apple segmentation mask does not

correspond to the apple category, the localization method can provide

correct feedback.

4 Conclusion

e orchard environment is complex, and detection and

segmentation-based methods exhibit lower accuracy in recognizing

and localizing overlapping or occluded apples. Detection-based

instance segmentation methods heavily rely on detection results and

do not consider global features, such as MaskRCNN. erefore, this

study introduces a high-precision method based on RGB-D data and

an enhanced SOLOV2 instance segmentation method for orchard

apple recognition and picking point localization. is method does

not rely on detection results, performs well in the face of occlusion,

and can accurately locate the apple picking point. e specic

conclusions of this research are outlined below:

(1) An improved SOLOv2 high-precision apple instance

segmentation method is introduced. To enhance the eciency

of the instance segmentation network, EcientNetV2 is

adopted as the backbone feature extraction network, which

has a highly ecient parameter design. When faced with

scenarios involving overlapping or occluded apples, as their

semantic features are quite similar, weintroduce a lightweight

spatial attention module to improve segmentation accuracy.

is module can increase position sensitivity, thus

distinguishing based on positional features even when

semantic features are similar. rough comparative

experimental analysis, the improved SOLOv2 instance

segmentation method performs exceptionally well, achieving

the highest F1 score and mAP values on the apple instance

segmentation dataset, 88.5 and 90.1%, respectively.

Furthermore, compared to the previous version, the model’s

parameter count and computational load have slightly

decreased by 1.94M and 31 GFLOPs.

(2) To achieve precise apple-picking point localization, an apple

localization method based on RGB-D is proposed. Firstly, the

pickable apples are ltered by the IoU of the mask and its

maximum outer circle and then determine whether the

midpoint of the maximum outer circle is an apple category.

Finally, the 3D coordinates of the picking point are obtained

based on the depth information of the midpoint and the

camera’s intrinsic parameters. Experimental verication

indicates that, in the collection of 60 datasets, the median

TABLE4 The positioning error of some picking points, in which the data unit is mm.

y



−

20.04 47.83 733.75 20.04 47.83 733.75 0 0 0

−43.68 −53.3 801 −43.66 −53.3 801 0.02 0 0

109.13 −12.59 892 109.51 −12.58 801 0.38 0.01 0

87.21 −76.57 823.75 89.17 −75.05 823.75 1.96 1.52 0

−132.2 104.1 923.5 −130.1 106.2 923.8 2.1 2.1 0.3



113.29 −149.57 791 114.77 −148.24 791 1.48 1.33 0

279.69 75.82 653.75 281.47 74.4 653.75 1.78 1.42 0

−132.23 104.11 923.5 −130.12 106.19 923.75 2.11 2.08 0.25

FIGURE11

X, Y, Z direction positioning error.

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 13 frontiersin.org

errors in the X and Y directions for localization are less than

1.5 mm, while the median error in the Z direction is close to

0. Moreover, the maximum error in the Z direction is

approximately 1 mm, demonstrating high accuracy.

In the future, due to the high cost of obtaining instance

segmentation data and issues related to the real-time performance of

the models, wewill focus on in-depth research in two critical areas:

data generation and model lightweight. is will enable practical

applications on edge devices and embedded systems.

Data availability statement

e raw data supporting the conclusions of this article will

bemade available by the authors, without undue reservation.

Author contributions

ST: Conceptualization, Funding acquisition, Supervision, Writing

– original dra, Writing – review & editing. ZX: Data curation,

Methodology, Writing – original dra, Writing – review & editing,

Conceptualization, Formal analysis, Funding acquisition,

Investigation, Project administration, Resources, Soware,

Supervision, Validation, Visualization. JG: Funding acquisition,

Investigation, Writing – original dra, Writing – review & editing.

WW: Validation, Investigation, Writing – original dra, Writing –

review & editing. ZH: Writing – review & editing, Writing – original

dra, Visualization, Validation, Formal analysis. WZ: Writing – review

& editing, Validation, Visualization, Soware.

Funding

e author(s) declare that nancial support was received for the

research, authorship, and/or publication of this article. is research

was funded by the Key Project of Jiangsu Province Key Research and

Development Program (No. BE2021016-3), the Jiangsu Agricultural

Science and Technology Independent Innovation Fund Project (No.

CX (22) 3016), and the Key R&D Program (Agricultural Research and

Development) Project in Yancheng City (No. YCBN202309).

Acknowledgments

e authors express their gratitude to the editors and reviewers

for their invaluable comments and suggestions.

Conﬂict of interest

e authors declare that the research was conducted in the

absence of any commercial or nancial relationships that could

beconstrued as a potential conict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors

and do not necessarily represent those of their aliated organizations,

or those of the publisher, the editors and the reviewers. Any product

that may beevaluated in this article, or claim that may bemade by its

manufacturer, is not guaranteed or endorsed by the publisher.

FIGURE12

Apple picking point localization process. (A) Original image. (B) Segmentation result. (C) Pickable apples. (D) Localization result of pickable apples.

Tang et al. 10.3389/fsufs.2024.1403872

Frontiers in Sustainable Food Systems 14 frontiersin.org

References

Ahmad, M. T., Greenspan, M., Asif, M., and Marshall, J. A.. (2018). Robust apple

segmentation using fuzzy logic. 5th International Multi-Topic ICT Conference:

Technologies For Future Generations, IMTIC 2018—Proceedings. 1–5.

Bolya, D., Zhou, C., Xiao, F., and Lee, Y. J.. (2019). YOLACT: real-time instance

segmentation. Proceedings of the IEEE International Conference on Computer Vision.

9157–9166

Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2017).

DeepLab: semantic image segmentation with deep convolutional nets, atrous

convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40,

834–848. doi: 10.1109/TPAMI.2017.2699184

Chen, W., Zhang, J., Guo, B., Wei, Q., and Zhu, Z. (2021). An apple detection method

based on des-YOLO v4 algorithm for harvesting robots in complex environment. Math.

Probl. Eng. 2021, 1–12. doi: 10.1155/2021/7351470

Gai, R., Chen, N., and Yuan, H. (2023). A detection algorithm for cherry fruits based

on the improved YOLO-v4 model. Neural Comput. Appl. 35, 13895–13906. doi: 10.1007/

s00521-021-06029-z

Gené-Mola, J., Ferrer-Ferrer, M., Gregorio, E., Blok, P. M., Hemming, J., Morros, J. R.,

et al. (2023). Looking behind occlusions: a study on amodal segmentation for robust

on-tree apple fruit size estimation. Comput. Electron. Agric. 209:107854. doi: 10.1016/j.

compag.2023.107854

Gené-Mola, J., Gregorio, E., Guevara, J., Auat, F., Sanz-Cortiella, R., Escolà, A., et al.

(2019). Fruit detection in an apple orchard using a mobile terrestrial laser scanner.

Biosyst. Eng. 187, 171–184. doi: 10.1016/j.biosystemseng.2019.08.017

He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2020). Mask R-CNN. IEEE Trans.

Pattern Anal. Mach. Intell. 42, 386–397. doi: 10.1109/TPAMI.2018.2844175

Hu, T., Wang, W., Gu, J., Xia, Z., Zhang, J., and Wang, B. (2023). Research on apple

object detection and localization method based on improved YOLOX and RGB-D

images. Agronomy 13:1816. doi: 10.3390/agronomy13071816

Huang, Z., Huang, L., Gong, Y., Huang, C., and Wang, X.. (2019). Proceedings of the IEEE

Computer Society Conference on Computer Vision and Pattern Recognition. 6409–6418.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q.. (2017). Densely

connected convolutional networks. Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition. 4700–4708

Jia, W., Tian, Y., Luo, R., Zhang, Z., Lian, J., and Zheng, Y. (2020). Detection and

segmentation of overlapped fruits based on optimized mask R-CNN application in apple

harvesting robot. Comput. Electron. Agric. 172:105380. doi: 10.1016/j.compag.2020.105380

Jia, W., Wang, Z., Zhang, Z., Yang, X., Hou, S., and Zheng, Y. (2022a). A fast and

ecient green apple object detection model based on Foveabox. J. King Saud Univ.-

Comput. Inf. Sci. 34, 5156–5169. doi: 10.1016/j.jksuci.2022.01.005

Jia, W., Zhang, Z., Shao, W., Hou, S., Ji, Z., Liu, G., et al. (2021). FoveaMask: a fast and

accurate deep learning model for green fruit instance segmentation. Comput. Electron.

Agri c. 191:106488. doi: 10.1016/j.compag.2021.106488

Jia, W., Zhang, Z., Shao, W., Ji, Z., and Hou, S. (2022b). RS-Net: robust segmentation

of green overlapped apples. Precis. Agric. 23, 492–513. doi: 10.1007/s11119-021-09846-3

Kang, H., and Chen, C. (2019). Fruit detection and segmentation for apple harvesting

using visual sensor in orchards. Sensors 19:4599. doi: 10.3390/s19204599

Kang, H., and Chen, C. (2020). Fruit detection, segmentation and 3D visualisation of

environments in apple orchards. Comput. Electron. Agric. 171:105302. doi: 10.1016/j.

compag.2020.105302

Kang, H., Wang, X., and Chen, C. (2022). Accurate fruit localisation using high

resolution LiDAR-camera fusion and instance segmentation. Comput. Electron. Agric.

203:107450. doi: 10.1016/j.compag.2022.107450

Kang, H., Zhou, H., Wang, X., and Chen, C. (2020). Real-time fruit recognition and

grasping estimation for robotic apple harvesting. Sensors 20:5670. doi: 10.3390/

s20195670

Kong, T., Sun, F., Liu, H., Jiang, Y., Li, L., and Shi, J. (2020). FoveaBox: beyond anchor-

based object detection. IEEE Trans. Image Process. 29, 7389–7398. doi: 10.1109/

TIP.2020.3002345

Li, Q., Jia, W., Sun, M., Hou, S., and Zheng, Y. (2021). A novel green apple

segmentation algorithm based on ensemble U-Net under complex orchard

environment. Comput. Electron. Agric. 180:105900. doi: 10.1016/j.compag.2020.

105900

Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S.. (2017).

Feature pyramid networks for object detection. Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition. 2117–2125.

Ronneberger, O., Fischer, P., and Brox, T.. (2015). U-net: convolutional networks for

biomedical image segmentation. Medical Image Computing and Computer-Assisted

Intervention—MICCAI. 234–241.

Tan, M., and Le, Q. V.. (2019). EcientNet: rethinking model scaling for convolutional

neural networks. International Conference on Machine Learning. 6105–6114.

Tan, M., and Le, Q. V.. (2021). EcientNetV2: smaller models and faster training.

International Conference on Machine Learning. 10096–10106.

Tan, M., Pang, R., and Le, Q. V.. (2020). EcientDet: scalable and ecient object

detection. Proceedings of the IEEE Computer Society Conference on Computer Vision

and Pattern Recognition. 10781–10790.

Tian, Y., Duan, H., Luo, R., Zhang, Y., Jia, W., Lian, J., et al. (2019). Fast recognition

and location of target fruit based on depth information. IEEE Access 7, 170553–170563.

doi: 10.1109/ACCESS.2019.2955566

Wang, D., and He, D. (2022a). Apple detection and instance segmentation in natural

environments using an improved mask scoring R-CNN model. Front. Plant Sci.

13:1016470. doi: 10.3389/fpls.2022.1016470

Wang, D., and He, D. (2022b). Fusion of mask RCNN and attention mechanism for

instance segmentation of apples under complex background. Comput. Electron. Agric.

196:106864. doi: 10.1016/j.compag.2022.106864

Wang, X., Kang, H., Zhou, H., Au, W., Wang, M. Y., and Chen, C. (2023). Development

and evaluation of a robust so robotic gripper for apple harvesting. Comput. Electron.

Agri c. 204:107552. doi: 10.1016/j.compag.2022.107552

Wang, X., Kong, T., Shen, C., Jiang, Y., and Li, L.. (2020a). SOLO: segmenting objects

by locations. Computer Vision—ECCV 2020. 649–665.

Wang, W., Zhang, Y., Gu, J., and Wang, J. (2022). A proactive manufacturing resources

assignment method based on production performance prediction for the smart factory.

IEEE Trans. Ind. Inform. 18, 46–55. doi: 10.1109/TII.2021.3073404

Wang, X., Zhang, R., Kong, T., Li, L., and Shen, C.. (2020b). SOLOv2: dynamic and

fast instance segmentation. Advances in Neural Information Processing Systems.

17721–17732.

Wu, L., Ma, J., Zhao, Y., and Liu, H. (2021). Apple detection in complex scene using

the improved yolov4 model. Agronomy 11:476. doi: 10.3390/agronomy11030476

Xia, Z., Gu, J., Wang, W., and Huang, Z. (2023). Research on a lightweight electronic

component detection method based on knowledge distillation. Math. Biosci. Eng. 20,

20971–20994. doi: 10.3934/mbe.2023928

Xia, Z., Gu, J., Zhang, K., Wang, W., and Li, J. (2022). Research on multi-scene

electronic component detection algorithm with anchor assignment based on K-means.

Electronics 11:514. doi: 10.3390/electronics11040514

Zhang, S., Chi, C., Yao, Y., Lei, Z., and Li, S. Z.. (2020). Bridging the gap between

anchor-based and anchor-free detection via adaptive training sample selection.

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern

Recognition. 9759–9768.

Zou, K., Ge, L., Zhou, H., Zhang, C., and Li, W. (2022). An apple image segmentation

method based on a color index obtained by a genetic algorithm. Multimed. Tools Appl.

81, 8139–8153. doi: 10.1007/s11042-022-11905-4

Available via license: CC BY

Content may be subject to copyright.

ResearchGate has not been able to resolve any citations for this publication.

Research on a lightweight electronic component detection method based on knowledge distillation

Article

Full-text available

Nov 2023

As an essential part of electronic component assembly, it is crucial to rapidly and accurately detect electronic components. Therefore, a lightweight electronic component detection method based on knowledge distillation is proposed in this study. First, a lightweight student model was constructed. Then, we consider issues like the teacher and student's differing expressions. A knowledge distillation method based on the combination of feature and channel is proposed to learn the teacher's rich class-related and inter-class difference features. Finally, comparative experiments were analyzed for the dataset. The results show that the student model Params (13.32 M) are reduced by 55%, and FLOPs (28.7 GMac) are reduced by 35% compared to the teacher model. The knowledge distillation method based on the combination of feature and channel improves the student model's mAP by 3.91% and 1.13% on the Pascal VOC and electronic components detection datasets, respectively. As a result of the knowledge distillation, the constructed student model strikes a superior balance between model precision and complexity, allowing for fast and accurate detection of electronic components with a detection precision (mAP) of 97.81% and a speed of 79 FPS.

Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images

Article

Full-text available

Jul 2023

The vision-based fruit recognition and localization system is the basis for the automatic operation of agricultural harvesting robots. Existing detection models are often constrained by high complexity and slow inference speed, which do not meet the real-time requirements of harvesting robots. Here, a method for apple object detection and localization is proposed to address the above problems. First, an improved YOLOX network is designed to detect the target region, with a multi-branch topology in the training phase and a single-branch structure in the inference phase. The spatial pyramid pooling layer (SPP) with serial structure is used to expand the receptive field of the backbone network and ensure a fixed output. Second, the RGB-D camera is used to obtain the aligned depth image and to calculate the depth value of the desired point. Finally, the three-dimensional coordinates of apple-picking points are obtained by combining two-dimensional coordinates in the RGB image and depth value. Experimental results show that the proposed method has high accuracy and real-time performance: F1 is 93%, mean average precision (mAP) is 94.09%, detection speed can reach 167.43 F/s, and the positioning errors in X, Y, and Z directions are less than 7 mm, 7 mm, and 5 mm, respectively.

Looking behind occlusions: A study on amodal segmentation for robust on-tree apple fruit size estimation

Article

Full-text available

Apr 2023
COMPUT ELECTRON AGR

The detection and sizing of fruits with computer vision methods is of interest because it provides relevant information to improve the management of orchard farming. However, the presence of partially occluded fruits limits the performance of existing methods, making reliable fruit sizing a challenging task. While previous fruit segmentation works limit segmentation to the visible region of fruits (known as modal segmentation), in this work we propose an amodal segmentation algorithm to predict the complete shape, which includes its visible and occluded regions. To do so, an end-to-end convolutional neural network (CNN) for simultaneous modal and amodal instance segmentation was implemented. The predicted amodal masks were used to estimate the fruit diameters in pixels. Modal masks were used to identify the visible region and measure the distance between the apples and the camera using the depth image. Finally, the fruit diameters in millimetres (mm) were computed by applying the pinhole camera model. The method was developed with a Fuji apple dataset consisting of 3925 RGB-D images acquired at different growth stages with a total of 15,335 annotated apples, and was subsequently tested in a case study to measure the diameter of Elstar apples at different growth stages. Fruit detection results showed an F1-score of 0.86 and the fruit diameter results reported a mean absolute error (MAE) of 4.5 mm and R 2 = 0.80 irrespective of fruit visibility. Besides the diameter estimation, modal and amodal masks were used to automatically determine the percentage of visibility of measured apples. This feature was used as a confidence value, improving the diameter estimation to MAE = 2.93 mm and R 2 = 0.91 when limiting the size estimation to fruits detected with a visibility higher than 60%. The main advantages of the present methodology are its robustness for measuring partially occluded fruits and the capability to determine the visibility percentage. The main limitation is that depth images were generated by means of photogrammetry methods, which limits the efficiency of data acquisition. To overcome this limitation, future works should consider the use of commercial RGB-D sensors. The code and the dataset used to evaluate the method have been made publicly available at https://github.com/GRAP-UdL-AT/Amodal_Fruit_Sizing.

Apple Detection in Complex Scene Using the Improved YOLOv4 Model

Article

Full-text available

Mar 2021

To enable the apple picking robot to quickly and accurately detect apples under the complex background in orchards, we propose an improved You Only Look Once version 4 (YOLOv4) model and data augmentation methods. Firstly, the crawler technology is utilized to collect pertinent apple images from the Internet for labeling. For the problem of insufficient image data caused by the random occlusion between leaves, in addition to traditional data augmentation techniques, a leaf illustration data augmentation method is proposed in this paper to accomplish data augmentation. Secondly, due to the large size and calculation of the YOLOv4 model, the backbone network Cross Stage Partial Darknet53 (CSPDarknet53) of the YOLOv4 model is replaced by EfficientNet, and convolution layer (Conv2D) is added to the three outputs to further adjust and extract the features, which make the model lighter and reduce the computational complexity. Finally, the apple detection experiment is performed on 2670 expanded samples. The test results show that the EfficientNet-B0-YOLOv4 model proposed in this paper has better detection performance than YOLOv3, YOLOv4, and Faster R-CNN with ResNet, which are state-of-the-art apple detection model. The average values of Recall, Precision, and F1 reach 97.43%, 95.52%, and 96.54% respectively, the average detection time per frame of the model is 0.338 s, which proves that the proposed method can be well applied in the vision system of picking robots in the apple industry.

Apple detection and instance segmentation in natural environments using an improved Mask Scoring R-CNN Model

Article

Full-text available

Dec 2022

The accurate detection and segmentation of apples during growth stage is essential for yield estimation, timely harvesting, and retrieving growth information. However, factors such as the uncertain illumination, overlaps and occlusions of apples, homochromatic background and the gradual change in the ground color of apples from green to red, bring great challenges to the detection and segmentation of apples. To solve these problems, this study proposed an improved Mask Scoring region-based convolutional neural network (Mask Scoring R-CNN), known as MS-ADS, for accurate apple detection and instance segmentation in a natural environment. First, the ResNeSt, a variant of ResNet, combined with a feature pyramid network was used as backbone network to improve the feature extraction ability. Second, high-level architectures including R-CNN head and mask head were modified to improve the utilization of high-level features. Convolutional layers were added to the original R-CNN head to improve the accuracy of bounding box detection (bbox_mAP), and the Dual Attention Network was added to the original mask head to improve the accuracy of instance segmentation (mask_mAP). The experimental results showed that the proposed MS-ADS model effectively detected and segmented apples under various conditions, such as apples occluded by branches, leaves and other apples, apples with different ground colors and shadows, and apples divided into parts by branches and petioles. The recall, precision, false detection rate, and F1 score were 97.4%, 96.5%, 3.5%, and 96.9%, respectively. A bbox_mAP and mask_mAP of 0.932 and 0.920, respectively, were achieved on the test set, and the average run-time was 0.27 s per image. The experimental results indicated that the MS-ADS method detected and segmented apples in the orchard robustly and accurately with real-time performance. This study lays a foundation for follow-up work, such as yield estimation, harvesting, and automatic and long-term acquisition of apple growth information.

Accurate fruit localisation using high resolution LiDAR-camera fusion and instance segmentation

Article

Full-text available

Dec 2022
COMPUT ELECTRON AGR

Accurate depth-sensing is crucial in securing a high success rate of robotic harvesting in natural orchard environments. The solid-state LiDAR technique, a recently introduced LiDAR sensor, can perceive high-resolution geometric information of the scenes, which can be utilised to receive accurate depth information. Meanwhile, the fusion of the sensory data from LiDAR and the camera can significantly enhance the sensing ability of the harvesting robots. This work first introduces a LiDAR-camera fusion-based visual sensing and perception strategy to perform accurate fruit localisation in the apple orchards. Two SOTA LiDAR-camera extrinsic calibration methods are evaluated to obtain the accurate extrinsic matrix between the LiDAR and camera. After that, the point clouds and colour images are fused to perform fruit localisation using a one-stage instance segmentation network. In addition, comprehensive experiments show that LiDAR-camera achieves better visual sensing performance in natural environments. Meanwhile, introducing the LiDAR-camera fusion can largely improve the accuracy and robustness of the fruit localisation. Specifically, the standard deviations of fruit localisation using LiDAR-camera at 0.5, 1.2, and 1.8 m are 0.253, 0.230, and 0.285 cm, respectively, during the afternoon with intensive sunlight. This measurement error is much smaller compared with that from Realsense D455. Lastly, visualised point cloud² of the apple trees have been attached to demonstrate the highly accurate sensing results of the proposed Lidar-camera fusion method.

Research on Multi-Scene Electronic Component Detection Algorithm with Anchor Assignment Based on K-Means

Article

Full-text available

Feb 2022

Achieving multi-scene electronic component detection is the key to automatic electronic component assembly. The study of a deep-learning-based multi-scene electronic component object detection method is an important research focus. There are many anchors in the current object detection methods, which often leads to extremely unbalanced positive and negative samples during training and requires manual adjustment of thresholds to divide positive and negative samples. Besides, the existing methods often bring a complex model with many parameters and large computation complexity. To meet these issues, a new method was proposed for the detection of electronic components in multiple scenes. Firstly, a new dataset was constructed to describe the multi-scene electronic component scene. Secondly, a K-Means-based two-stage adaptive division strategy was used to solve the imbalance of positive and negative samples. Thirdly, the EfficientNetV2 was selected as the backbone feature extraction network to make the method simpler and more efficient. Finally, the proposed algorithm was evaluated on both the public dataset and the constructed multi-scene electronic component dataset. The performance was outstanding compared to the current mainstream object detection algorithms, and the proposed method achieved the highest mAP (83.20% and 98.59%), lower FLOPs (44.26GMAC) and smaller Params (29.3 M).

Development and evaluation of a robust soft robotic gripper for apple harvesting

Article

Jan 2023
COMPUT ELECTRON AGR

Fruit harvesting is facing challenges due to the labour shortage, which has been more severe since the rapid pandemic. Robotic harvesting has been attempted in autonomous fruit harvesting tasks, such as apple harvesting. However, current apple harvesting robots show limited harvesting performance in the orchard environment due to the inefficiency of the robotic grippers. This research presents a fruit harvesting method that includes a novel soft robotic gripper and a detachment strategy to achieve apple harvesting in the natural orchard. The soft robotic gripper includes four tapered soft robotic fingers (SRF) and one multi-mode suction cup. The SRF is customised to avoid interference with obstacles during grasping, and its compliance and force exertion are comprehensively evaluated with FEA and experiments. The multi-mode suction cup can provide suction adhesion force, show active extrusion/withdrawal, and present passive compliance mode. The simultaneously twist-pulling motion is finally proposed and implemented to detach the apples from the trees. The proposed robotic gripper is compact, compliant with apple grasping and generates a large grasping force. Our proposed method is finally validated in a natural orchard and achieves a detachment, damage and harvesting rate of 75.6%, 4.55%, and 70.77%, respectively.

A fast and efficient green apple object detection model based on Foveabox

Article

Sep 2022

Fruit object detection is crucial for automatic harvesting systems, serving applications such as orchard yield measurement and fruit harvesting. In order to achieve fast recognition and localization of green apples and meet the real-time working requirements of the vision system of harvesting robots, a fast optimized Foveabox detection model (Fast-FDM) is proposed. Fast-FDM uses an optimized form of anchor-free Foveabox to accurately and efficiently detect green apples in harvesting environments. Specifically, the EfficientNetV2-S with fast training and small size is used as the backbone network, a weighted bi-directional feature pyramid network (BiFPN) is employed as the feature extraction network to fuse multi-scale features easily and fast, and then the fused features are fed to the fovea head prediction network for the classification and bounding box prediction. Furthermore, an adaptive training sample selection (ATSS) method is adopted to directly select positive and negative samples, allowing green fruits of different scales to obtain higher recall and achieve more accurate green apple detection. Experimental results show that the proposed Fast-FDM realizes a mean average precision (mAP) of 62.3% for green apple detection using fewer parameters and floating point of operations (FLOPs), achieving better trade-offs between accuracy and detection efficiency.

Fusion of Mask RCNN and attention mechanism for instance segmentation of apples under complex background

Article

May 2022
COMPUT ELECTRON AGR

It is important to precisely segment apples in an orchard during the growth period to obtain accurate growth information. However, the complex environmental factors and growth characteristics, such as fluctuating illumination, overlapping and occlusion of apples, the gradual change in the ground colour of apples from green to red, and the similarities between immature apples and background leaves, affect apple segmentation accuracy. The purpose of this study was to develop a precise apple instance segmentation method based on an improved Mask region-based convolutional neural network (Mask RCNN). An existing Mask RCNN model was improved by fusing an attention module into the backbone network to enhance its feature extraction ability. A combination of deformable convolution and the transformer attention with the key content only term was used as the attention module in this study. The experimental results showed that the improved Mask RCNN can accurately segment apples under various conditions, such as apples with shadows and different ground colours, overlapped apples, and apples occluded by branches and leaves. A recall, precision, F1 score, and segmentation mAP of 97.1%, 95.8%, 96.4% and 0.917, respectively, were achieved, and the average run-time on the test set was 0.25 s per image. Our method outperformed the two methods in comparison, indicating that it can accurately segment apples in the growth stage with a near real-time performance. This study lays the foundation for realizing accurate fruit detection and long-term automatic growth monitoring.

High-precision apple recognition and localization method based on RGB-D and improved SOLOv2 instance segmentation

Abstract and Figures

Recommended publications

Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images

Research on a lightweight electronic component detection method based on knowledge distillation

SE-COTR: A Novel Fruit Segmentation Model for Green Apples Application in Complex Orchard

Channel pruned YOLO V5s-based deep learning approach for rapid and accurate apple fruitlet detection...