ArticlePDF Available

A Counting Method of Red Jujube Based on Improved YOLOv5s

December 2022
Agriculture 12(12):2071

December 2022
12(12):2071

DOI:10.3390/agriculture12122071

License
CC BY 4.0

Authors:

Zhouzhou Zheng

Northwest A & F University

Show all 7 authorsHide

Due to complex environmental factors such as illumination, shading between leaves and fruits, shading between fruits, and so on, it is a challenging task to quickly identify red jujubes and count red jujubes in orchards. A counting method of red jujube based on improved YOLOv5s was proposed, which realized the fast and accurate detection of red jujubes and reduced the model scale and estimation error. ShuffleNet V2 was used as the backbone of the model to improve model detection ability and light the weight. In addition, the Stem, a novel data loading module, was proposed to prevent the loss of information due to the change in feature map size. PANet was replaced by BiFPN to enhance the model feature fusion capability and improve the model accuracy. Finally, the improved YOLOv5s detection model was used to count red jujubes. The experimental results showed that the overall performance of the improved model was better than that of YOLOv5s. Compared with the YOLOv5s, the improved model was 6.25% and 8.33% of the original network in terms of the number of model parameters and model size, and the Precision, Recall, F1-score, AP, and Fps were improved by 4.3%, 2.0%, 3.1%, 0.6%, and 3.6%, respectively. In addition, RMSE and MAPE decreased by 20.87% and 5.18%, respectively. Therefore, the improved model has advantages in memory occupation and recognition accuracy, and the method provides a basis for the estimation of red jujube yield by vision.

A counting method of red jujube based on improved YOLOv5s.

…

LabelImg data set annotation.

…

The structure of ShuffleNet-v2 Units. (a) the structure of ShuffleNet-v2 Unit1. (b) the structure of ShuffleNet-v2 Unit2.

…

The structure of the Stem.

…

Bi-directional feature fusion network. (a) PANet with bi-directional feature fusion network, (b) BiFPN with bi-directional feature fusion network.

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Citation: Qiao, Y.; Hu, Y.; Zheng, Z.;

Yang, H.; Zhang, K.; Hou, J.; Guo, J. A

Counting Method of Red Jujube

Based on Improved YOLOv5s.

Agriculture 2022,12, 2071.

https://doi.org/10.3390/

agriculture12122071

Academic Editors: Vadim

Bolshev, Vladimir Panchenko

and Alexey Sibirev

Received: 10 October 2022

Accepted: 30 November 2022

Published: 2 December 2022

Publisher’s Note: MDPI stays neutral

with regard to jurisdictional claims in

published maps and institutional afﬁl-

iations.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

agriculture

Article

A Counting Method of Red Jujube Based on

Improved YOLOv5s

Yichen Qiao 1, Yaohua Hu 2,*, Zhouzhou Zheng 1, Huanbo Yang 1, Kaili Zhang 1, Juncai Hou 1,* and Jiapan Guo 3,4

1College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling 712100, China

2College of Optical, Mechanical, and Electrical Engineering, Zhejiang A&F University,

Hangzhou 311300, China

3Bernoulli Institute for Mathematics, Computer Science and Artiﬁcial Intelligence, University of Groningen,

9747 AG Groningen, The Netherlands

4Data Science Center in Health (DASH), University Medical Center Groningen, University of Groningen,

9713 GZ Groningen, The Netherlands

*Correspondence: huyaohua@zafu.edu.cn (Y.H.); houjuncai@nwsuaf.edu.cn (J.H.);

Tel.: +86-15291680166 (Y.H.); +86-18792954818 (J.H.)

Abstract:

Due to complex environmental factors such as illumination, shading between leaves and

fruits, shading between fruits, and so on, it is a challenging task to quickly identify red jujubes

and count red jujubes in orchards. A counting method of red jujube based on improved YOLOv5s

was proposed, which realized the fast and accurate detection of red jujubes and reduced the model

scale and estimation error. ShufﬂeNet V2 was used as the backbone of the model to improve model

detection ability and light the weight. In addition, the Stem, a novel data loading module, was

proposed to prevent the loss of information due to the change in feature map size. PANet was

replaced by BiFPN to enhance the model feature fusion capability and improve the model accuracy.

Finally, the improved YOLOv5s detection model was used to count red jujubes. The experimental

results showed that the overall performance of the improved model was better than that of YOLOv5s.

Compared with the YOLOv5s, the improved model was 6.25% and 8.33% of the original network in

terms of the number of model parameters and model size, and the Precision, Recall, F1-score, AP,

and Fps were improved by 4.3%, 2.0%, 3.1%, 0.6%, and 3.6%, respectively. In addition, RMSE and

MAPE decreased by 20.87% and 5.18%, respectively. Therefore, the improved model has advantages

in memory occupation and recognition accuracy, and the method provides a basis for the estimation

of red jujube yield by vision.

Keywords: count red jujubes; red jujube; improved YOLOv5s; ShufﬂeNet V2 Unit; Stem; BiFPN

1. Introduction

Chinese red jujube is a kind of characteristic fruit which is famous for its various

nutritional ingredients [

]. With the increasing demand for red jujubes, it is more and more

important to count red jujubes so as to provide a basis for the estimation of jujube yield

through vision. Due to the increasing supply of red jujubes, the count of red jujubes will

play an important role in the planting and production management. Therefore, it is of great

signiﬁcance to realize the count of red jujubes, and it will help improve the utilization rate

of red jujubes. However, the development of artiﬁcial intelligence, it provides a new way

to solve the problem of low fruit production efﬁciency [2].

It is an important task of orchard management to estimate the fruit yield by counting

the number of fruits. Deep learning has become a potential tool for counting the number

of fruits, and It enables automatic feature extraction from data sets. At the same time, by

extracting the basic parameters of crop growth, intelligent agricultural technology enables

farmers to estimate crop yield, thus reasonably arranging the production and processing

of red jujubes [

]. Machine learning methods, such as the Watershed algorithm [

] and

Agriculture 2022,12, 2071. https://doi.org/10.3390/agriculture12122071 https://www.mdpi.com/journal/agriculture

Agriculture 2022,12, 2071 2 of 20

the kalman ﬁlter algorithm [

], are widely used to count fruit. However, because the

supervised learning method in machine learning can’t capture the nonlinear relationship

between input and output variables and the uncertainty of the crop environment, it is

difﬁcult for traditional machine learning methods to develop a reliable crop counting

model. However, in recent years, the progress of technology has made it possible to develop

advanced crop counting models by using deep learning. Shileiliu et al. [

] proposed a light

target detection YOLOv5-CS model, which could realize the object detection and accurate

counting of green citrus in the natural environment. The map of the model was 98.23%.

ZhangYanchao et al. [7]

used the YOLOX target detection network to detect and count the

holly fruits, and the map was 95%.

Owing to the improvement of computer hardware and the development of computer

vision technology, deep learning has been widely used in various industries [

–

]. Object

detection algorithm based on deep learning mainly includes One-Stage and Two-Stage. The

ﬁrst type is the detection algorithm based on candidate region, such as R-CNN (Region-

Convolutional Neural Networks) [

], Fast R-CNN (Fast Region-Convolutional Neural

Networks) [

], Faster R-CNN (Faster Region-Convolutional Neural Networks) [

]. The

second kind regards the detection of target position as a regression problem and directly

uses CNN (Convolutional Neural Network) for images, such as SSD (Single Shot Multi-Box

Detector) [14,15], YOLO (You Only Look Once) [16–19].

Computer vision technology has also been widely used in various ﬁelds [

–

]. The

image processing technology is one of the key technologies in precision agriculture, and it

is mainly used in classiﬁcation, localization, and yield prediction [

]. Mulyono et al. [

]

proposed a texture extraction method based on a gray-level co-occurrence matrix that is

followed by a K-nearest neighbor for the classiﬁcation of litchii. Sutarno et al. [

] adopted

similar ideas to extract texture information and then used the learning vector quantization

(LVQ) algorithm as the classiﬁer to classify durian based on their color, shape, and texture.

The method was difﬁcult to detect the subtle feature changes among different fruits, and the

accuracy of fruit classiﬁcation was 89%. Zhao et al. [

] proposed a matching algorithm that

used the sum of absolute transformed differences (SATD) for fruit detection, followed by

the support vector machine (SVM) classiﬁer. The accuracy of recognition reached more than

83%. Dorj et al. [

] proposed forecasting the yield of citrus yields. The method preprocessed

images by color space conversion and denoising then recognized and detected citrus and

counted citrus by the watershed segmentation algorithm. Other researchers have also

studied the fruit classiﬁcation, identiﬁcation, and count of fruits based on shape invariant

moments [

], decision trees [

], and Hough [

] combined with the texture and color of

fruits. The above methods use single features or multi-feature combinations with texture

features, shape size, and color differences of fruits to recognize fruits. The recognition

result is about 93% when the environment is complex, such as light changes, fruit overlap,

leaf occlusion, etc. In addition, the traditional machine learning algorithm is limited by the

result of the classiﬁer of the algorithm itself, and it is difﬁcult for the algorithm to complete

the object detection of fruit in a complex environment [31].

Due to the occlusion of fruit and leaves, the image transformation, and the background

switching in complex orchard environments, the deep learning-based object detection algo-

rithm can solve these problems quickly and effectively with its powerful learning ability and

feature representation capability. Fu et al. [

] proposed a deep convolutional neural network

detection model in which the improved Faster R-CNN was trained end-to-end by using back-

propagation, random gradient descent algorithm, and ZFNet (Zeiler and Fergus networks)

for kiwifruit detection. The experiment showed that the model could improve the accuracy of

fruit recognition to 96%. Liu et al. [

] fused RGB and NIR images to identify kiwifruit by

VGG16. The average detection precision of an image was 90.7%, and the detection time was

0.134 s on one image. Wang et al. [

] proposed an improved model of a lightweight detection

network of SSD. The model used a modified DenseNet network as the backbone to replace the

first three additional layers in SSD and incorporate a multi-level fusion structure. Compared

with the original model, the number of parameters of the improved model was reduced by

Agriculture 2022,12, 2071 3 of 20

11.14

, and the average precision was increased by 2.02%. The classical deep learning

networks have been successful in fruit identification and detection. There are advantages

of high accuracy and efficiency in the identification and detection of fruits. However, the

networks are relatively large, which is not conducive to the application of mobile equipment

in the agricultural field. Many researchers have already studied the lightweight model. For

instance, Li et al. [

] applied the adaptive spatial pyramid to detect the green peppers and

the accuracy reached 96.11% in YOLOv4_tiny. Zhang et al. [

] used MobileNet-v3 as the

feature extraction network of YOLOv4-LITE. The improved model reduced the model size

and improved the detection speed. Therefore, it is feasible to reduce the weight of the model

while ensuring the precision of model detection.

The lightweight model will be beneﬁcial to the application of agricultural mobile

equipment and realize the intelligence of agricultural equipment. In order to ensure the

detection accuracy of the model in complex unstructured orchards and counting fruit, a

counting method of red jujube based on improved YOLOv5s was proposed. The main goal

of this research was to reduce the size of the model while ensuring its detection accuracy

and speed in an embedded device. The effectiveness of counting red jujubes in a complex

environment was comprehensively considered from four aspects in this research

(1) ShufﬂeNet V2 was used as the backbone of the network to extract the feature map and

make the model lightweight.

(2) The Stem, a novel data loading module, was proposed to reduce data information loss

and improve model detection accuracy.

(3)

The original PANet (Path Aggregation Network) structure was improved to BiFPN

(Bidirectional Feature Pyramid Network) for multi-scale feature fusion to enhance the

model feature fusion capability and improve the model accuracy.

(4)

The improved YOLOv5s detection model was used to count red jujubes.

The second section introduced the method of making the dataset, the improved red

jujubes detection algorithm, the counting method of red jujubes, and the training of the

network. The third section introduced the test results of the model and the analysis

compared with other algorithms. In the last section, the counting methods of red jujubes

were summarized and prospect.

2. Materials and Methods

In this section, the acquisition and production of the dataset were mainly introduced.

Then, a detection algorithm based on the improved yolov5s of red jujube was proposed,

and a counting method for red jujubes was presented. Finally, the training method of the

network was introduced, as shown in Figure 1.

Agriculture 2022, 12, x FOR PEER REVIEW 4 of 20

Figure 1. A counting method of red jujube based on improved YOLOv5s.

2.1. Image Data Acquisition

The dataset of red jujube, including Jun jujube and Gray jujube, in this study, was

collected from a red jujube orchard from 5 October to 9 October in Alar City, Xinjiang,

China.

Images of Jun jujube and Gray jujube were taken in a jujube orchard of the 13th com-

pany of a group in Alar City, Xinjiang Uygur Autonomous Region. In order to ensure the

reliability of the experimental results, the jujube image dataset was collected, which was

under different illumination at 9:00 a.m., 3:00 p.m., and 9:00 p.m. for red jujubes. The res-

olution of the images was 1080 × 1920 pixels, with a total of 1026 original images, which

included illumination changes, leaf shading, and fruit overlap. In order to improve the

robustness of this model, each image contained one or more different scenarios. The dis-

tribution of the dataset is shown in Table 1.

Table 1. Distribution of dataset of red jujubes.

Dataset Grey Jujube Jun Jujube Total Number

illumination change images 136 190 326

leaf shading images 132 225 357

fruit overlap images 139 204 343

Total Number 407 619 1026

2.2. Data Preprocessing and Augmentation

The collection of data sets would affect the recognition effect of the target detection

model. The more sufficient and comprehensive the data set is, the better the generalization

ability and robustness of the model. Therefore, the number of samples could be expanded

by data amplification. In order to truly simulate the shooting of red jujube in a complex

environment and apply it to the detection network, this research used Opencv in python

to compress and cut the images into 640 × 640. Then, the images were randomly enhanced

by different image processing methods [36], such as rotating 180, mirroring, adding salt

and pepper noise which set the threshold to 0.5, and changing the image brightness by

setting the threshold to 1.3 and 0.7, as shown in Figure 2. Repeated random image pro-

cessing on an image many times. After enhancement, a total of 10,000 images were ob-

tained as the data set of the model.

Figure 1. A counting method of red jujube based on improved YOLOv5s.

Agriculture 2022,12, 2071 4 of 20

2.1. Image Data Acquisition

The dataset of red jujube, including Jun jujube and Gray jujube, in this study, was col-

lected from a red jujube orchard from 5 October to 9 October in Alar City,

Xinjiang, China

Images of Jun jujube and Gray jujube were taken in a jujube orchard of the 13th

company of a group in Alar City, Xinjiang Uygur Autonomous Region. In order to ensure

the reliability of the experimental results, the jujube image dataset was collected, which

was under different illumination at 9:00 a.m., 3:00 p.m., and 9:00 p.m. for red jujubes.

The resolution of the images was 1080

1920 pixels, with a total of 1026 original images,

which included illumination changes, leaf shading, and fruit overlap. In order to improve

the robustness of this model, each image contained one or more different scenarios. The

distribution of the dataset is shown in Table 1.

Table 1. Distribution of dataset of red jujubes.

Dataset Grey Jujube Jun Jujube Total Number

illumination change images 136 190 326

leaf shading images 132 225 357

fruit overlap images 139 204 343

Total Number 407 619 1026

2.2. Data Preprocessing and Augmentation

The collection of data sets would affect the recognition effect of the target detection model.

The more sufficient and comprehensive the data set is, the better the generalization ability

and robustness of the model. Therefore, the number of samples could be expanded by data

amplification. In order to truly simulate the shooting of red jujube in a complex environment

and apply it to the detection network, this research used Opencv in python to compress and

cut the images into 640

640. Then, the images were randomly enhanced by different image

processing methods [

], such as rotating 180, mirroring, adding salt and pepper noise which

set the threshold to 0.5, and changing the image brightness by setting the threshold to 1.3 and

0.7, as shown in Figure 2. Repeated random image processing on an image many times. After

enhancement, a total of 10,000 images were obtained as the data set of the model.

Agriculture 2022, 12, x FOR PEER REVIEW 5 of 20

(a) (b) (c)

(d) (e) (f)

Figure 2. Image sample after data preprocessing and augmentation. (a) original image, (b) rotating

by 180°, (c) Increasing brightness, (d) mirroring image, (e) adding noise, (f) reducing brightness.

2.3. Images Annotation and Dataset Division

In this research, LabelImg was used to label red jujube in the data set with artificial

rectangular boxes, as shown in Figure 3. The dataset was divided into 80% training da-

tasets, 10% validation datasets, and 10% test datasets. The final image samples of the train-

ing set, verification set, and test set are 8000, 1000, and 1000 respectively.

Figure 3. LabelImg data set annotation.

2.4. Methodologies

The Yolo series are effective in single-stage object detection, and their miniature de-

tection models guarantee higher accuracy, taking into account faster speed and fewer pa-

rameters. Therefore, the lightweight detection models of the Yolo series are more suitable

to be applied to embedded devices to develop mobile agricultural equipment. However,

due to the complexity of the agricultural production environment and the harsh working

Figure 2.

Image sample after data preprocessing and augmentation. (

) original image, (

) rotating

by 180◦, (c) Increasing brightness, (d) mirroring image, (e) adding noise, (f) reducing brightness.

Agriculture 2022,12, 2071 5 of 20

2.3. Images Annotation and Dataset Division

In this research, LabelImg was used to label red jujube in the data set with artiﬁcial

rectangular boxes, as shown in Figure 3. The dataset was divided into 80% training datasets,

10% validation datasets, and 10% test datasets. The ﬁnal image samples of the training set,

veriﬁcation set, and test set are 8000, 1000, and 1000 respectively.

Agriculture 2022, 12, x FOR PEER REVIEW 5 of 20

(a) (b) (c)

(d) (e) (f)

Figure 2. Image sample after data preprocessing and augmentation. (a) original image, (b) rotating

by 180°, (c) Increasing brightness, (d) mirroring image, (e) adding noise, (f) reducing brightness.

2.3. Images Annotation and Dataset Division

In this research, LabelImg was used to label red jujube in the data set with artificial

rectangular boxes, as shown in Figure 3. The dataset was divided into 80% training da-

tasets, 10% validation datasets, and 10% test datasets. The final image samples of the train-

ing set, verification set, and test set are 8000, 1000, and 1000 respectively.

Figure 3. LabelImg data set annotation.

2.4. Methodologies

The Yolo series are effective in single-stage object detection, and their miniature de-

tection models guarantee higher accuracy, taking into account faster speed and fewer pa-

rameters. Therefore, the lightweight detection models of the Yolo series are more suitable

to be applied to embedded devices to develop mobile agricultural equipment. However,

due to the complexity of the agricultural production environment and the harsh working

Figure 3. LabelImg data set annotation.

2.4. Methodologies

The Yolo series are effective in single-stage object detection, and their miniature de-

tection models guarantee higher accuracy, taking into account faster speed and fewer

parameters. Therefore, the lightweight detection models of the Yolo series are more suitable

to be applied to embedded devices to develop mobile agricultural equipment. However,

due to the complexity of the agricultural production environment and the harsh working

environment, it is difﬁcult to meet the agricultural production for the simple detection

algorithm. Based on YOLOv5s, the original backbone network was replaced by the Shuf-

ﬂeNet V2 backbone network in this research, which signiﬁcantly reduced the number of

parameters of the network. The Foucs were replaced by the Stem to resist partial informa-

tion missing from the feature map. PANet was replaced by BiFPN to enhance the model

feature fusion capability and improve the model accuracy. Finally, the improved YOLOv5s

detection network was used to identify the image and count red jujubes.

2.4.1. Yolov5s Network

YOLOv5 is improved by adding some new ideas on the basis of YOLOv4, and its

detection accuracy and speed have been greatly improved. The YOLOv5 can be divided

into four types according to the size of the model: YOLOv5s, YOLOv5m, YOLOv5l, and

YOLOv5x, among which the YOLOv5s model is the smallest. YOLOv5s mainly consists of

four parts: Input, Backbone, Neck, and Prediction.

In order to improve the speed and accuracy of the network, the Mosaic data augmen-

tation is used in the YOLOv5 to stitch images by random cropping, scaling, and lining up.

YOLOv5s uses adaptive anchor box calculation to set the initial anchor boxes for different

datasets and calculates the difference between the bounding boxes and the ground truth.

YOLOv5s updates the anchor boxes in the reverse iteration to adaptively calculate the best

anchor box for different training sets. To adapt different sizes of images in the dataset,

YOLOv5 uses adaptive image scaling to ﬁll the scaled image with the least amount of black

edges, which reduces the computation and improves the speed. Backbone will perform

information extraction on the feature maps. It mainly includes Focus, CBS, and C3. The

input image is sliced by the Focus and convolved by one convolution with 32 kernels, as

Agriculture 2022,12, 2071 6 of 20

shown in Figure 4. CBS consists of a convolution, a batch normalization, and the SiLU. The

SiLU is deﬁned as follows:

SiLU(x) = x

1+ex p(−x)(1)

where, xrepresents the feature map.

Agriculture 2022, 12, x FOR PEER REVIEW 6 of 20

environment, it is difficult to meet the agricultural production for the simple detection

algorithm. Based on YOLOv5s, the original backbone network was replaced by the Shuf-

fleNet V2 backbone network in this research, which significantly reduced the number of

parameters of the network. The Foucs were replaced by the Stem to resist partial infor-

mation missing from the feature map. PANet was replaced by BiFPN to enhance the

model feature fusion capability and improve the model accuracy. Finally, the improved

YOLOv5s detection network was used to identify the image and count red jujubes.

2.4.1. Yolov5s Network

YOLOv5 is improved by adding some new ideas on the basis of YOLOv4, and its

detection accuracy and speed have been greatly improved. The YOLOv5 can be divided

into four types according to the size of the model: YOLOv5s, YOLOv5m, YOLOv5l, and

YOLOv5x, among which the YOLOv5s model is the smallest. YOLOv5s mainly consists

of four parts: Input, Backbone, Neck, and Prediction.

In order to improve the speed and accuracy of the network, the Mosaic data augmen-

tation is used in the YOLOv5 to stitch images by random cropping, scaling, and lining up.

YOLOv5s uses adaptive anchor box calculation to set the initial anchor boxes for different

datasets and calculates the difference between the bounding boxes and the ground truth.

YOLOv5s updates the anchor boxes in the reverse iteration to adaptively calculate the best

anchor box for different training sets. To adapt different sizes of images in the dataset,

YOLOv5 uses adaptive image scaling to fill the scaled image with the least amount of

black edges, which reduces the computation and improves the speed. Backbone will per-

form information extraction on the feature maps. It mainly includes Focus, CBS, and C3.

The input image is sliced by the Focus and convolved by one convolution with 32 kernels,

as shown in Figure 4. CBS consists of a convolution, a batch normalization, and the SiLU.

The SiLU is defined as follows:

SiLU(x)=

1+ exp( x)

(1)

where,

represents the feature map.

Figure 4. Foucs structrue.

As a new structure of BottlenneckCSP, C3 contains 3 CBS modules and several Bot-

tlenecks. The C3 is used repeatedly in YOLOv5s to extract more information. As shown

in Figure 5., the SPP (spatial pyramid pooling) introduces three different pooling kernels

of 5 × 5, 9 × 9, and 13 × 13, and it connects different feature maps to expand the respective

field, which effectively separates the most important features and improves the accuracy

of the model.

Figure 4. Foucs structrue.

As a new structure of BottlenneckCSP, C3 contains 3 CBS modules and several Bottle-

necks. The C3 is used repeatedly in YOLOv5s to extract more information. As shown in

Figure 5, the SPP (spatial pyramid pooling) introduces three different pooling kernels of

5×5

, 9

9, and 13

13, and it connects different feature maps to expand the respective

ﬁeld, which effectively separates the most important features and improves the accuracy of

the model.

Agriculture 2022, 12, x FOR PEER REVIEW 7 of 20

Figure 5. SPP structure.

To utilize most of the backbone information, the Neck of YOLOv5 uses the FPN +

PAN. Feature Pyramid Network (FPN) solves the problem of different input feature map

sizes by constructing an image pyramid on the feature map. PAN, as the innovative point

of path aggregation network (PANet) [37], downsamples the image from FPN and then

performs concat on the image. To improve the ability of image recognition and localiza-

tion, FPN acquires the semantic features of the image from the top, while PAN gets the

localization features of the image from the bottom.

There are some regression loss functions used in object detection tasks, such as the

Smooth Loss function [16], IOU Loss function [38], GIOU Loss function [39], DIOU Loss

function [40], and CIOU_Loss function [41]. In the Prediction, YOLOv5 uses CIOU_Loss

as the loss function of the Bounding box. The CIOU_Loss function is defined as follows:

2 gt

CIOU 2

(b,b )

L = 1 IOU + +





 (2)

where,

IO U

represents the intersection ratio of the prediction box to the object box.

represents the center point of the prediction box.

represents the center point of the

object box.

2 gt

(b,b )



represents Euclidean distance squared between the center point

of the prediction box and the center point of the object box.

represents the diagonal

length of the two closed boxes.



represents a positive trade-off parameter.



repre-

sents the consistency of the aspect ratio.

2.4.2. ShuffleNet V2 Backbone

YOLOv5s reduces the parameters of the model by C3 and improves the speed of the

model, but the C3 is very complicated, with a large amount of calculation and still needs

a lot of memory. The YOLOv5 lightweight model based on ShuffleNet V2 was designed,

which greatly reduced the model parameters. The ShuffleNet V2 backbone was designed

Figure 5. SPP structure.

Agriculture 2022,12, 2071 7 of 20

To utilize most of the backbone information, the Neck of YOLOv5 uses the FPN + PAN.

Feature Pyramid Network (FPN) solves the problem of different input feature map sizes by

constructing an image pyramid on the feature map. PAN, as the innovative point of path

aggregation network (PANet) [

], downsamples the image from FPN and then performs

concat on the image. To improve the ability of image recognition and localization, FPN

acquires the semantic features of the image from the top, while PAN gets the localization

features of the image from the bottom.

There are some regression loss functions used in object detection tasks, such as the

Smooth Loss function [

], IOU Loss function [

], GIOU Loss function [

], DIOU Loss

function [

], and CIOU_Loss function [

]. In the Prediction, YOLOv5 uses CIOU_Loss as

the loss function of the Bounding box. The CIOU_Loss function is deﬁned as follows:

LCIOU =1−IOU +ρ2(b,bgt)

c2+αυ (2)

where,

IOU

represents the intersection ratio of the prediction box to the object box.

represents the center point of the prediction box.

bgt

represents the center point of the

object box.

ρ2(b

bgt)

represents Euclidean distance squared between the center point of the

prediction box and the center point of the object box.

represents the diagonal length of the

two closed boxes.

represents a positive trade-off parameter.

represents the consistency

of the aspect ratio.

2.4.2. ShufﬂeNet V2 Backbone

YOLOv5s reduces the parameters of the model by C3 and improves the speed of the

model, but the C3 is very complicated, with a large amount of calculation and still needs

a lot of memory. The YOLOv5 lightweight model based on ShufﬂeNet V2 was designed,

which greatly reduced the model parameters. The ShufﬂeNet V2 backbone was designed

by using ShufﬂeNet V2 Units [

], and the backbone of the original model was replaced by

the ShufﬂeNet V2 backbone.

As a lightweight convolutional neural network that is suitable for application to

mobile devices, ShufﬂeNet V2 was ﬁrst proposed in 2018. Compared with ShufﬂeNet V1,

ShufﬂeNet V2 adopts the way of channel Shufﬂe, which divides the characteristic channels

into two parts, ensuring that the input and output channels are the same, One part enters

the bottleneck, and the other part does not run. Excessive point convolution will increase

computational complexity. ShufﬂeNet V2 replaces the grouped point convolution with the

standard point convolution. ShufﬂeNet V2 puts the channel shufﬂe after the dimensional

stacking to prevent fragmentation of the model. ShufﬂeNet V2 replaces element-wise

operators with concat to reduce the time of model detection. The basic model units of

ShufﬂeNet V2 are divided into two types. The ShufﬂeNet V2 Units are shown in Figure 6.

ShufﬂeNet V2 introduces channel shufﬂe. First, the channels of the input feature map are

divided into two branches. The two branches directly connect to the concat. There are two

1 point convolution layers and a 3

3 group convolution layer with a stride size of 2

in the other branch. The convolution layers contain a batch normalization layer and ReLu.

The other basic model unit of ShufﬂeNet V2 differs from the previous model, where two

convolution layers: a 3

3 group convolution layer with a stride of 2 and a 1

1 point

convolution layer. Finally, two images of branches of the same size were spliced together.

In order to extract information on different-size feature maps, the ShufﬂeNet V2 backbone

was designed to replace the backbone by using 16 ShufﬂeNet V2 Units in YOLOv5s.

Agriculture 2022,12, 2071 8 of 20

Agriculture 2022, 12, x FOR PEER REVIEW 8 of 20

by using ShuffleNet V2 Units [42], and the backbone of the original model was replaced

by the ShuffleNet V2 backbone.

As a lightweight convolutional neural network that is suitable for application to mo-

bile devices, ShuffleNet V2 was first proposed in 2018. Compared with ShuffleNet V1,

ShuffleNet V2 adopts the way of channel Shuffle, which divides the characteristic chan-

nels into two parts, ensuring that the input and output channels are the same, One part

enters the bottleneck, and the other part does not run. Excessive point convolution will

increase computational complexity. ShuffleNet V2 replaces the grouped point convolu-

tion with the standard point convolution. ShuffleNet V2 puts the channel shuffle after the

dimensional stacking to prevent fragmentation of the model. ShuffleNet V2 replaces ele-

ment-wise operators with concat to reduce the time of model detection. The basic model

units of ShuffleNet V2 are divided into two types. The ShuffleNet V2 Units are shown in

Figure 6. ShuffleNet V2 introduces channel shuffle. First, the channels of the input feature

map are divided into two branches. The two branches directly connect to the concat. There

are two 1 × 1 point convolution layers and a 3 × 3 group convolution layer with a stride

size of 2 in the other branch. The convolution layers contain a batch normalization layer

and ReLu. The other basic model unit of ShuffleNet V2 differs from the previous model,

where two convolution layers: a 3 × 3 group convolution layer with a stride of 2 and a 1 ×

1 point convolution layer. Finally, two images of branches of the same size were spliced

together. In order to extract information on different-size feature maps, the ShuffleNet V2

backbone was designed to replace the backbone by using 16 ShuffleNet V2 Units in

YOLOv5s.

(a) (b)

Figure 6. The structure of ShuffleNet-v2 Units. (a) the structure of ShuffleNet-v2 Unit1. (b) the struc-

ture of ShuffleNet-v2 Unit2.

2.4.3. Stem Construction

Inception-v4 [43] was proposed in 2017, which confirmed that residual connectivity

largely accelerated the training speed of Inception networks. With reference to the design

idea of Inception-v4, the Stem was proposed to rapidly reduce the resolution of the input

feature maps, ultimately achieving a top-5 error rate of 3.08% on ILSVRC. The feature map

is continuously reduced from 299 × 299 to 35 × 35 by Stem in the InceptionV4 network,

and it has many convolution layers, which is better for complex task feature extraction.

Figure 6.

The structure of ShufﬂeNet-v2 Units. (

) the structure of ShufﬂeNet-v2 Unit1. (

) the

structure of ShufﬂeNet-v2 Unit2.

2.4.3. Stem Construction

Inception-v4 [

] was proposed in 2017, which conﬁrmed that residual connectivity

largely accelerated the training speed of Inception networks. With reference to the design

idea of Inception-v4, the Stem was proposed to rapidly reduce the resolution of the input

feature maps, ultimately achieving a top-5 error rate of 3.08% on ILSVRC. The feature map

is continuously reduced from 299

299 to 35

35 by Stem in the InceptionV4 network,

and it has many convolution layers, which is better for complex task feature extraction.

However, the task is simpler to detect a single target of red jujube, which will cause

excessive calculation. The Stem is shown in Figure 7. In order to reduce the parameters of

the model, the model could be pruned. Inspired by the idea of fast feature map resolution

reduction, four CBS were adopted to make the size of the feature map to be suitable for the

network, where 3

3 convolutions with the stride of 2 were used in the ﬁrst and third CBS

and 1

1 convolution was used in the second and fourth CBS. In contrast to the Foucs,

which sliced the feature map into 32 small feature maps before image concat, the Stem used

two 3

3 convolutions with the stride of 2 to reduce the feature map sizes and concated

it with the feature map of the maximum pooling layer, so that the number of parameters

was reduced while improving the feature extraction ability of the network and improving

the accuracy.

2.4.4. BiFPN

With the deepening of the network level, the semantic information of image features

gradually changes from a low dimension to a high dimension. As shown in Figure 8,

the PANet structure was used to fuse the multi-scale features of images in the original

YOLOv5s detection network. In order to improve the detection accuracy of red jujubes,

the BIFPN network, a weighted bidirectional feature pyramid network, was applied to the

detection of red jujubes. Compared with the traditional feature fusion network, BiFPN

introduced weight to make it more sensitive to important features and makes better use of

feature information of different sizes.

Agriculture 2022,12, 2071 9 of 20

Agriculture 2022, 12, x FOR PEER REVIEW 9 of 20

However, the task is simpler to detect a single target of red jujube, which will cause ex-

cessive calculation. The Stem is shown in Figure 7. In order to reduce the parameters of

the model, the model could be pruned. Inspired by the idea of fast feature map resolution

reduction, four CBS were adopted to make the size of the feature map to be suitable for

the network, where 3 × 3 convolutions with the stride of 2 were used in the first and third

CBS and 1 × 1 convolution was used in the second and fourth CBS. In contrast to the Foucs,

which sliced the feature map into 32 small feature maps before image concat, the Stem

used two 3 × 3 convolutions with the stride of 2 to reduce the feature map sizes and con-

cated it with the feature map of the maximum pooling layer, so that the number of param-

eters was reduced while improving the feature extraction ability of the network and im-

proving the accuracy.

Figure 7. The structure of the Stem.

2.4.4. BiFPN

With the deepening of the network level, the semantic information of image features

gradually changes from a low dimension to a high dimension. As shown in Figure 8, the

PANet structure was used to fuse the multi-scale features of images in the original

YOLOv5s detection network. In order to improve the detection accuracy of red jujubes,

the BIFPN network, a weighted bidirectional feature pyramid network, was applied to the

detection of red jujubes. Compared with the traditional feature fusion network, BiFPN

introduced weight to make it more sensitive to important features and makes better use

of feature information of different sizes.

Figure 7. The structure of the Stem.

Agriculture 2022, 12, x FOR PEER REVIEW 10 of 20

Figure 8. bi-directional feature fusion network. (a) PANet with bi-directional feature fusion net-

work, (b) BiFPN with bi-directional feature fusion network.

In this research, BiFPN was introduced in the neck of YOLOv5s, as shown in Figure

9. Because the node, which had only one input edge and no ability of feature fusion, made

little contribution to the feature fusion of the network. Therefore, deleting this node had

little effect on network feature fusion. When the original input node and the output node

were in the same layer, an extra edge was added between the output node and the input

node, and feature fusion was realized without increasing too much computational over-

head. Different from the PANet structure of YOLOv5s, when performing feature fusion,

each bidirectional path was used as a feature network layer, and the feature network layer

was reused at the same layer, thus realizing a higher level of feature fusion.

Figure 9. The structure of the improved YOLOv5s model.

(a)

(b)

Direct connection

Jump connection

Upsampling

Downsampling

Figure 8.

Bi-directional feature fusion network. (

) PANet with bi-directional feature fusion network,

(b) BiFPN with bi-directional feature fusion network.

In this research, BiFPN was introduced in the neck of YOLOv5s, as shown in Figure 9.

Because the node, which had only one input edge and no ability of feature fusion, made

little contribution to the feature fusion of the network. Therefore, deleting this node had

little effect on network feature fusion. When the original input node and the output node

were in the same layer, an extra edge was added between the output node and the input

node, and feature fusion was realized without increasing too much computational overhead.

Different from the PANet structure of YOLOv5s, when performing feature fusion, each

bidirectional path was used as a feature network layer, and the feature network layer was

reused at the same layer, thus realizing a higher level of feature fusion.

Agriculture 2022,12, 2071 10 of 20