Conference PaperPDF Available

Weed Detection: A Vision Transformer Approach For Soybean Crops

July 2023

July 2023

DOI:10.1109/ICCCNT56998.2023.10307069

Conference: 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT)

Authors:

Venkata Rami Reddy Ch

VIT-AP University, Amaravati

Content uploaded by Venkata Rami Reddy Ch

Content may be subject to copyright.

Weed Detection: A Vision Transformer Approach

For Soybean Crops

Sanjay M

School of CSE

VIT-AP University

Amaravati, India

sanjaymythili2002@gmail.com

Mithisha Brilent Tavares

School of CSE

VIT-AP University

Amaravati, India

mithishatavares8303@gmail.com

Deepashree P. Vaideeswar

School of CSE

VIT-AP University

Amaravati, India

dvaideeswar@gmail.com

Ch. Venkata Rami Reddy

School of CSE

VIT-AP University

Amaravati, India

chvrr58@gmail.com

Abstract— Unwanted plants called weeds grow among

agricultural crops, competing with them for nutrients, water, and

sunshine, resulting in severe output losses. Machine learning

algorithms, particularly Deep Learning models, have automated

the weed detection process. By leveraging annotated image

datasets, these algorithms can accurately classify and distinguish

weeds from crops. They have the potential to develop real-time,

autonomous weed detection systems, empowering farmers to make

informed decisions regarding weed control measures. This study

suggests using Vision Transformers to classify and identify weeds

in soybean farms. Recently, the significance of Vision

Transformers in the area of Computer Vision has grown as a

result of their capability to identify remote dependencies in images.

The method suggested in this paper makes use of Vision

Transformer's benefits by using a Deep Learning framework,

which improves the precision of weed identification and

categorization. This study’s dataset comprises of 15,336 images

encompassing soil, grass weeds, broadleaf weeds, and soybean

crops. The experimental analysis of the proposed approach shows

that this approach outperforms a number of cutting-edge

techniques for the identification and categorization of weeds. The

accuracy of the proposed approach is 98.83%. This accuracy

outperforms that of other methods, such as Convolutional Neural

Networks (CNNs), Support Vector Machines, and other Deep

Learning models, by a significant margin. The proposed method

has the capacity to be used to enhance the accuracy and

dependability of weed identification and categorization in many

different crops, not just soybean. Consequently, these findings can

lead to improved weed management strategies and increased crop

yield.

Keywords—Image Detection, Weed Detection, Vision

Transformers, Agriculture.

I. INTRODUCTION

The Soybean (Glycine Max), also known as Soja Bean or

Soya Bean, is a legume that grows annually and belongs to

the pea family (Fabaceae). Its seed is edible, and it is

considered the most important bean worldwide in terms of

economic significance, as it provides vegetable protein for

millions of people and serves as a raw material for numerous

chemical products. Soybean is an affordable and highly

nutritious source of protein, widely consumed by both

humans and animals in various parts of the world. The

soybean seed is composed of 17% oil and 63% meal, with

protein constituting 50% of the meal. Due to the absence of

starch, soybeans are considered an excellent protein source

for people with diabetes. Utilising techniques to optimise

productivity and improve product quality is crucial,

considering the significance of soy in the economic

landscape. Weed is a broad term used to describe any plant

that grows in an unwanted location. Throughout history,

humans have struggled to protect their crops from invasive

weeds. However, some plants initially classified as weeds

were later discovered to have useful properties and were

therefore cultivated. Conversely, some cultivated plants that

were introduced to new environments became invasive and

turned into weeds. This means that the definition of weeds is

constantly evolving, making it a relative term. The presence

of these plants, which compete with economically valuable

crops like soybeans, can cause significant harm. They make

it challenging to operate harvesting machinery and lead to

impurities and moisture in the grain. Weeds have numerous

detrimental effects on crops, such as competing for resources

like water, light, nutrients, and space. This competition

results in increased production costs, difficulties in

harvesting, decreased product quality, increased

susceptibility to pests and diseases, and reduced commercial

value of cultivated areas. While Computer Vision (CV)

applications do not frequently employ the Transformer

design, Natural Language Processing (NLP) tasks do. In CV,

attention is either used in conjunction with convolutional

networks or, in certain cases, in substitution of some of their

component portions while preserving the basic architecture

of the convolutional network. Convolutional architectures,

however, continue to be widely used in computer vision tasks

[1-9]. The latest models that make use of specific attention

patterns have theoretic promise, but they have not scaled well

on the hardware accelerators available today. Hence,

traditional ResNet like architectures are still at the cutting

edge of large-scale image recognition as inferred from [10]

and [11]. However, if Vision Transformer (ViT) is trained on

a larger data and used on small image recognition tests like

ImageNet, CIFAR-100, and VTAB, it achieves outstanding

results in comparison to the latest convolutional networks.

Additionally, it needs fewer computational resources for its

training. ViT is a type of advanced deep learning model that

has become increasingly popular in computer vision research

in recent years. In contrast to old CNNs that use convolution

IEEE - 56998

14th ICCCNT IEEE Conference

July 6-8, 2023

IIT - Delhi, Delhi, India

Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.

layers to extract spatial features from images, ViTs utilise

self-attention mechanisms to model the relations between

various image patches directly. This approach enables ViTs

to capture both local and global image information more

effectively, making them more robust and efficient than

CNNs in certain applications. The ViT architecture was

introduced in a paper by Dosovitskiy et al. in 2020 [12],

demonstrating superior performance on various image

classification benchmarks. ViTs have been extensively

applied to a diverse range of CV tasks. These include image

generation, object detection, and semantic segmentation,

among others. By leveraging self-attention mechanisms to

model image patches, ViTs can effectively learn and capture

intricate features in images, which enables them to perform

well on various visual recognition tasks. As a result, ViTs

have become an increasingly popular choice for researchers

and practitioners in the computer vision community. ViTs

have a distinct advantage over traditional CNNs in their

capability to model long-range dependencies between

different image patches, which is a challenging task for

CNNs. This attribute enables ViTs to extract more global

features of an image, resulting in more precise predictions.

Moreover, ViTs are highly scalable, which means they can be

trained on large datasets and achieve even better

performance. This scalability and flexibility have made ViTs

an appealing option for various applications in computer

vision research. Objective of this study include the following:

● Developing a model based on Vision Transformer

that can accurately detect and classify weeds in

soybean crops.

● Evaluating the model on the standard metrics.

● Comparing the performance of the proposed model

with existing weed detection models.

II. LITERATURE REVIEW

Image classification has emerged as a promising analytical

technique in the field of agriculture. This is because image

recognition and classification technologies can assess crops

based on their appearance, thereby eliminating the need for

expert analysis and costly and time-consuming experiments.

With its ability to provide accurate and rapid results at a lower

cost, image classification has gained significant popularity in

the agriculture industry. The practical uses of image

classification in agriculture are numerous, including tasks

such as soil assessment, leaf analysis, weed detection, pest

control, plant monitoring, disease recognition, and fruit/food

grading. These applications have been identified as valuable

tools for enhancing agricultural productivity and efficiency.

An approach to keep an eye on the wellbeing of plants was

proposed in [13] by constructing a system to detect iron,

nitrogen, and zinc deficiency by observing lettuce leaves. The

first step of the proposed method was to segment images into

‘leaf’ and ‘background’ classes using Artificial Neural

Networks (ANN). Further, RGB and HSI representation of

images were extracted. These parameters were subsequently

(a) Soil

(b) Soybean

(d) Broadleaf

Fig. 1. Input images of each class for prediction

IEEE - 56998

14th ICCCNT IEEE Conference

July 6-8, 2023

IIT - Delhi, Delhi, India

Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.

given to statistical classifiers and neural networks, which

predicted the plant state with a 92% accuracy. Two

approaches were compared in [14] to identify and categorize

three different citrus illnesses. In total, 39 texture attributes

were gathered, and four subsections of the attributes were

produced. The two methods employed in the system were -

Fast Image Processing (FIP), that can quickly process images

and provide results in real-time. The other method was

Robust Crop Row Detection (RCRD) which is slower but

more precise and is used to rectify errors made by the FIP.

The system combined both methods to function effectively in

a wide range of situations and provide highly precise

outcomes. It can accurately identify about 95% of weeds and

80% of crops even when there are variations in factors such

as lighting, soil moisture, and the growth of weeds/crops.

There are several studies that use ViT for multiple image

processing tasks. An empirical study was conducted in [15]

to explore the effects of several elements on the performance

of Vision Transformer (ViT) models trained on the

ImageNet-21 dataset, including the volume of training data,

data regularization and augmentation (AugReg), model size,

and compute budget. The study found that by increasing both

the compute budget and the use of AugReg, it is possible to

achieve similar performance as models trained on ten times

more data. Additionally, the ViT models trained on the

ImageNet-21 dataset not only matched but also outperformed

their counterparts. A research paper [16] proposes a mineral

recognition classification model based on the ViT

architecture. The model is trained and tested on a dataset

containing 2000 images of 12 different minerals, which is

augmented with data enhancement techniques to increase the

model's generalization capability. To enhance the feature

extraction process, a self-attentive mechanism is introduced.

Additionally, a new activation function is used to speed up

the convergence of the model during training. An accuracy of

96.08% is attained by the proposed model in mineral

recognition classification. ViT, using different augmentation

strategies were used to classify breast ultrasound images in

[17]. Due to the imbalance in the considered dataset,

weighted cross-entropy loss function was applied. The

results, as indicated in [17] imply that when it comes to

classifying breast ultrasound pictures, ViT models perform

on par with or even better than CNNs. Yet another use of

Vision Transformers in the medical industry can be seen in

[18] where the study provided a comprehensive scrutiny skin-

lesion image segmentation using of U-Net and attention-

based methods. However, as per [18], hybrid TransUNet, a

CNN performed better with an accuracy of 92.11%. A ViT-

based fire warning model was suggested in a publication [19].

ViT’s performance surpasses other CNNs like VGG, ResNet,

etc. On small data sets, the accuracy of the experimental

results was 97.4%, while on large data sets, it was as high as

97.03%. Another study [20] proposed several improvements

to enhance the performance of ViT. Firstly, it suggested that

using a hybrid architecture, which combines convolutional

neural networks (CNNs) with transformers, is much more

effective than using plain transformers alone. Second, the

study suggested adding two branches to the architecture to

gather local information from patch tokens and global

information from the categorization token, which would then

be combined to create a global image representation. Thirdly,

the proposed study suggests gathering multi-layer

characteristics from the transformer encoders of every

branch. This helps improve the representation of complex

image features. As stated earlier, the agriculture industry has

recognized the importance of weed detection due to the

detrimental impact of weeds on crop nutrition and water

absorption. As a result, different approaches for weed

detection have emerged, ranging from manual methods to

more advanced vision-based systems. Examining various

studies and approaches to weed detection revealed studies

such as [21] that detected soybean weeds using multiple CNN

architectures such as MobileNet, ResNet50, and three

different versions of CNNs. A 5-layer CNN architecture from

3 custom networks was selected for deployment as it yielded

the highest accuracy of 97.7%. Use of CNN LVQ model was

used in [22] for weed detection in soybean crops. A total of

4400 images captured by Unmanned Aerial Vehicles (UAVs)

were used to identify weeds in soybean fields, and the model

achieved an impressive accuracy of 99.79%. In a separate

study [23], crop and weed images taken by UAVs in fields of

beetroot, spinach, and parsley were classified using a Vision

Transformer and the self-attention paradigm. The results

showed that the Vision Transformer surpassed other

advanced techniques, including ResNet and EfficientNet,

achieving an accuracy of 99.14% for 13,596 testing images.

In other words, the Vision Transformer demonstrated

superior performance compared to other deep learning

models in accurately identifying crops and weeds in aerial

images acquired by UAVs. Rai et al [24], in their review

study, conducted a comprehensive analysis and comparison

of 60 deep learning-based models used for site-specific weed

management, providing detailed insights into the

performance of each model included. Many of the included

models achieved an accuracy of over 90%; however, it is

important to note that these accuracies were evaluated on

different datasets. Maheswararan et al [25] introduced an

innovative autonomous weeder aimed at effectively

eliminating weeds. Their design incorporates flexible

rotavator blades with high torque capabilities, which are

controlled using machine vision technology and Raspberry

Pi. Remarkably, their approach achieved a notable accuracy

rate of 90% in successfully removing weeds.

III. METHODOLOGY

Recent deep neural network models called Vision

Transformers (ViT) have demonstrated considerable potential

in computer vision applications including segmentation,

object identification, and picture categorization.

Transformers, which were created initially for natural

language processing (NLP) activities, served as a foundation

for this ViT. Generally, for image classification applications,

CNNs have been the most widely used architecture in

computer vision. The Vision Transformer (ViT) is a

comparatively newer architecture that has recently acquired

popularity in the field of CV. For image classification, ViT

employs a transformer architecture that was first created for

natural language processing. This research paper aims to use

ViT for weed detection in soybean crop application.

The ViT architecture was introduced in a paper by

Dosovitskiy and Alexey in 2020 [6]. Convolutional, pooling,

and fully connected layers make up the classic CNN design.

These models are computationally intensive and call for

several parameters. On the other hand, the transformer design

IEEE - 56998

14th ICCCNT IEEE Conference

July 6-8, 2023

IIT - Delhi, Delhi, India

Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.

is founded on self-attention processes that are computationally

effective and call for fewer parameters. The transformer

architecture has achieved cutting-edge performance on

various benchmark datasets and has been very effective in

NLP. Applying the transformer architecture, which was

initially created for NLP, to image data is the main concept

behind the ViT. A neural network architecture called the

transformer has proved very effective for NLP tasks like

sentiment analysis, language modelling, and machine

translation. Its foundation is the idea of self-attention, which

enables the model to concentrate on different input sequence

components when making predictions.

A. Architecture

The ViT architecture consists of three phases: the first stage

involves splitting the input image into patches, the second

stage involves flattening and linearly projecting the patches

to the transformer, and the third stage involves encoding the

patches in transformer and sending them to the feed-forward

network, where the final predictions are made.

The self-attention technique, which is used to derive

correlations in long-range and contextual information from

the incoming input, is an essential component of the

transformer architecture. A ViT model will focus on various

areas of the input data according to their relevance to the job

at hand thanks to the self-attention mechanism. The

transformers basic fundamental module is self-attention. The

self-attention mechanism calculates a weighted sum of the

input data as a result, with the weights determined by how

similar the input characteristics are. This enables the model

to give the pertinent input characteristics greater weight,

which aids in capturing more accurate representations of the

input data. Self-attention, therefore, is a computational

primitive that quantifies paired entity interactions and enables

a network to understand the hierarchies and alignments

contained in incoming data. For visual networks to acquire

greater resilience, attention has been shown to be a crucial

component. Mathematically the output self-attention is

computed as follows-

  󰇛

󰇜 (1)

Transformer is composed of numerous modules for self-

attention. The transformer in this study is made up of 8 self-

attention modules. The patch embedding layer and the

transformer encoder are the two primary parts of the ViT

architecture. The patch embedding layer divides the input

image into non-overlapping pieces before flattening the

pieces into vectors. These patch embeddings are sent to the

encoder of the transformer, which consists of multiple layers

of feedforward and self-attention neural networks. The

feedforward networks enable the model to collect local

spatial information, while the self-attention mechanism

permits the model to extract global context from the input

image. The transformer encoder's output is routed into a

classification head, which creates the model's final output.

The input image is split into a matrix of patches for the ViT,

which is then flattened and supplied into the transformer

model. In the same way that words are used as input tokens

in NLP tasks, the patches act as the model's input tokens.

After processing these input tokens, the transformer creates a

feature vector that can be applied to classification or other

subsequent tasks. The self-attention mechanism is used by the

ViT to capture both global and local correlations among the

input patches. This enables the model to acquire finer-grained

characteristics and more accurately represent the image's

semantic content.

Below pseudo code shows the algorithm that ViT uses:

Input: Image of 72x72x3 pixels

Output: Image class {broadleaf, grass, soil, soybean}

Step-I: The input image is split into smaller patches of a fixed

size (e.g., 6x6 pixels).

Step-II: Each patch is then fed through a CNN to extract a

fixed-size feature vector (embedding) for that patch. CNN is

typically a simple 3-layer network with small filters.

Step-III: The patch embeddings are then flattened and

arranged in a 2D grid, with one row per patch and one column

per embedding feature.

Fig. 2. Architecture of Vision Transformer

IEEE - 56998

14th ICCCNT IEEE Conference

July 6-8, 2023

IIT - Delhi, Delhi, India

Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.

Step-IV: A fixed positional embedding is incorporated to

each piece of the embedding to help the model learn the

spatial relationships between each patch.

Step-V: Then patch embeddings are sent through a stack of

Transformer encoder layers (e.g., 8 layers), which allows the

model to go through different parts of the image and capture

long-range dependencies.

Step-VI: The final output of the Transformer encoder is a

single vector representing the entire image.

Step-VII: This vector is passed through a softmax classifier

to make the final prediction class label of the input image.

Step-VIII: End

Before fine-tuning the model for a particular job, the ViT

employs a method known as pretraining to learn a general-

purpose representation of image data. The model is trained by

a self-supervised method of learning using a

large image dataset during the pre-training phase. In order to

do this, some input patches are randomly masks, and the

model is then asked to forecast what the masks should be. In

doing so, the model develops the ability to recognise complex

visual cues that can be used for a number of downstream

tasks. Figure 1 shows the complete architecture of ViT used

in this research paper.

B. Loss Function

The loss function used for the ViT is the Sparse Categorical

Cross Entropy (SCCE). In the SCCE loss the model's

predicted probabilities for each class are compared to the true

label, which is represented as an integer value corresponding

to the correct class. Unlike categorical cross entropy, which

compares the predicted probability distribution to a one-hot

encoded vector, SCCE does not require the true label to be

one-hot encoded. This makes it more efficient and memory-

friendly, especially when the number of classes is large. The

SCCE loss function is given by,

  󰇛󰇜



 (2)

Where,  represents the truth label and  represents the

probability for the   class obtained by applying softmax

function.

IV. EXPERIMENTAL SETUP

A. Dataset Description

The dataset considered for this study is an image dataset for

weed detection in soybean crops. It consists of a total of

15,336 image segments, of which 3,249 are soil segments,

7,376 are soybean segments, 3,520 are grass segments, and

1,191 are broadleaf weed segments. The segments were

extracted from 400 images captured by an unmanned aerial

vehicle (UAV) and were manually annotated with their

respective class. The images were pre-processed using the

SLIC algorithm implemented in the Pynovisão software,

resulting in high-quality image segments that are suitable for

training and evaluating deep learning models for weed

detection. The dataset is well-balanced, with each class

containing a significant number of segments, allowing for

accurate evaluation of model performance.

TABLE I. DATASET DESCRIPTION

Class number

Classes

Total

Appearances

Soil

3249

Soybean

7376

Grass

3520

Broadleaf Weeds

1191

Total No. of. Images

15336

The purpose of this study is to detect the presence of weed in

UAV-captured soybean crop images, with high accuracy and

low complexity such that it can be easily deployed on

agricultural devices. This is achieved through training the

dataset on ViT. This study’s dataset contains 15,336 images

categorized into 4 classes, “Soil” (3249), “Soybean” (7376),

“Grass” (3520) and “Broadleaf Weeds” (1191). A pre-

processed version of the dataset is used, where the image was

segmented and manually annotated. Further, the dataset is

segregated into train (81.00%), validation (8.99%) and test

(9.99%) sets.

B. Data Augmentation

The dataset used for training was modified by randomly

flipping images horizontally and vertically, as well as

zooming in or out with a factor of 0.2 for both height and

width. Additionally, random rotation with a factor of 0.2 was

applied to the images. After these modifications, the images

were normalized. Fig 3 below shows some sample images

after applying augmentation and normalization.

C. Evaluation Metrics and Hyperparameter Tuning

The following values of hyperparameter were considered:

learning rate is set to 0.001, batch size of 256, image size of

72 x 72, patch size of 6 x 6 and weight decay with a value of

Fig. 3. Augmented Images

IEEE - 56998

14th ICCCNT IEEE Conference

July 6-8, 2023

IIT - Delhi, Delhi, India

Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.

0.0001. The number of heads in the transformer is 4, and there

are 8 transformers in total. The model was trained for 500

epochs spanning over 3 hours.

The proposed model was carried out using the Python

programming language, and the TensorFlow framework. The

model training was done on Kaggle platforms’ kernel. The

training was conducted on a T4 x2 GPU. The standard

accuracy metric is used to evaluate the approach in this study.

As accuracy increases, the model's performance is better on

the training and test set. The other metrics used to evaluate

the model are precision, recall and F1 score. Each metric is

computed as follows:

   

 (3)

   

 (4)

  

 (5)

   󰇛󰇜

󰇛󰇜 (6)

Where, TP, TN, FP, FN stands for True Positive, True

Negative, False Positive, False Negative, respectively. A

good model should have a high F1 score, high recall, high

accuracy, and high precision. Precision and recall, however,

typically trade off against one another. To determine the ideal

balance between precision and recall, use the F1 score.

Confusion matrix can be used to determine the model’s

advantages and disadvantages and all kinds of errors the

model makes.

V. RESULTS AND ANALYSIS

The results of ViT model demonstrate excellent performance

in image classification tasks. After training on a dataset of

images, the proposed model has attained an accuracy of

99.18% on the training set, and 99.2% accuracy on the

validation set as shown in Table 2. This high level of accuracy

was achieved after training the model for 500 epochs,

indicating that the model is capable to learn the patterns and

features of the images over a prolonged period. Through the

hyperparameters mentioned above, a test accuracy of 98.83%

is attained with a final loss value of 0.0332, indicating that it

can generalize well to new unseen images. Fig 4a shows the

model's accuracy after each epoch. From the graph, it can be

observed how the model's performance varies after each

epoch. Fig 4b shows the loss of the model after each epoch

and it can be concluded that as the epochs are increasing the

loss is decreasing.

Fig 5 shows the confusion matrix of the proposed model. As

indicated in Fig 5, class 1 i.e., class “Grass” has the least

classification accuracy of 93%. The remaining three classes

0,2,3 i.e., class “Broadleaf”, “Soil”, “Soybean”, have 100%,

98%, 99% of accuracies respectively. The model's attention

mechanism enables it to concentrate on crucial areas of the

image when generating predictions, allowing it to attain very

high levels of accuracy. The model's architecture, which

incorporates a multi-head attention mechanism and numerous

attention blocks, additionally influences how effectively it

performs.

Table 3 provides an overview of the performance and

accuracy of different models proposed by other studies

including the model proposed in this study. The purpose of

this table is to showcase the effectiveness of the proposed

model in association to other existing models in the field. The

second row of table 3 represents the DNN models proposed

by Singh, Gourisaria, et al [26]. These models were trained

on the same dataset of 15,336 images and achieved an

(a) Model accuracy versus epochs

(b) Model loss versus epochs

Fig. 4. Model performance after each epoch

TABLE II. EVALUATION METRICS

Training Accuracy

Validation Accuracy

Testing Accuracy

Precision

Recall

F1 Score

99.18%

99.2%

98.83%

97.07%

97.49%

97.28%

IEEE - 56998

14th ICCCNT IEEE Conference

July 6-8, 2023

IIT - Delhi, Delhi, India

Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.

accuracy of 94.58%. While these models show good

performance, they are outperformed by the proposed model

of this study. Overall, the proposed model in this study

exhibits the highest accuracy among the mentioned models,

achieving an accuracy of 98.83% on a dataset of 15,336

images. This highlights the effectiveness of the proposed

model in accurately predicting the desired output and

suggests its potential for practical applications in the field.

TABLE III. COMPARISON OF RESULT WITH EXISTING MODELS

Author/Model

Dataset Size

Accuracy

The proposed model in this

study

15,336 images

98.83%

Singh, Gourisaria, et al. (DNN

models) [26]

15,336 images

94.58%

Zhang, Wang, et al. (EM-

YOLOv4-Tiny) [27]

855 images

96.70%

True, Julien et al. (CNN

Models) [20]

400 images

97.7%

VI. CONCLUSION AND FUTURE WORK

This study suggests utilizing vision transformers for detecting

weeds in soybean crops. The proposed approach achieved

high accuracy, making it useful to be deployed on agricultural

devices. The dataset was trained using the ViT model, and the

model was evaluated using the standard accuracy metric. The

model achieved a test accuracy of 98.83% with a final loss

value of 0.0332, which indicates that the proposed approach

is effective in detecting weeds in soybean crops. The results

obtained in this study demonstrate the potential of vision

transformers in weed detection applications.

There are several possible directions for further research in

this study. One possible way is to test the effectiveness of the

proposed method on larger datasets with more diverse

categories. Another area of research is to measure the

efficiency of the proposed method under different lighting

and environmental conditions. Moreover, it would be

fascinating to explore the effectiveness of the proposed

method in real-time scenarios. Furthermore, the proposed

method can be expanded to identify other types of weeds in

various crops. Finally, the proposed approach can be

integrated with agricultural devices to provide real-time weed

detection, enabling farmers to take timely and informed

decisions to manage weeds in their crops.

REFERENCES

[1] Reddy, C.V.R., Reddy, U.S., Kishore, K.V.K. (2019). Facial emotion

recognition using NLPCA and SVM. Traitement du Signal, 36(1): 13-

22.

[2] VenkataRamiReddy, C., Kishore, K.K., Bhattacharyya, D., Kim, T.H.

(2014). Multi-feature fusion-based facial expression classification

using DLBP and DCT. International Journal of Software Engineering

and Its Applications, 8(9): 55-68.

[3] Ramireddy, C.V., Kishore, K.K. (2013). Facial expression

classification using Kernel-based PCA with fused DCT and GWT

features. In 2013 IEEE International Conference on Computational

Intelligence and Computing Research, IEEE, pp. 1-6. Reddy, C.V.R.,

Kishore, K.K., Reddy, U.S., Suneetha, M. (2016). Person identification

system using feature level fusion of multi-biometrics. In 2016 IEEE

International Conference on Computational Intelligence and

Computing Research (ICCIC), IEEE, pp. 1-6.

[4] Palakodati, S.S.S., Chirra, V.R.R., Yakobu, D., Bulla, S. (2020). Fresh

and rotten fruits classification using CNN and transfer learning. Revue

d'Intelligence Artificielle, 34(5): 617-622.

[5] Chirra, V.R.R., Uyyala, S.R., Kolli, V.K.K. (2021). Virtual facial

expression recognition using deep CNN with ensemble learning.

Journal of Ambient Intelligence and Humanized Computing, 12:

10581–10599.

[6] Chirra, V.R.R., Uyyala, S.R., Kolli, V.K.K. (2019). Deep CNN: A

machine learning approach for driver drowsiness detection based on

eye state. Revue d'Intelligence Artificielle, 33(6): 461-466.

[7] Y. LeCun et al., "Backpropagation Applied to Handwritten Zip Code

Recognition," in Neural Computation, vol. 1, no. 4, pp. 541-551, Dec.

1989.

[8] Krizhevsky, Alex & Sutskever, Ilya & Hinton, Geoffrey. (2012).

ImageNet Classification with Deep Convolutional Neural Networks.

Neural Information Processing Systems.

[9] Mahajan, A., Taliun, D., Thurner, M. et al. Fine-mapping type 2

diabetes loci to single-variant resolution using high-density imputation

and islet-specific epigenome maps. Nat Genet 50, 1505–1513 (2018).

[10] Kolesnikov, Alexander. “Big Transfer (BiT): General Visual

Representation Learning.” arXiv.org, 24 Dec. 2019.

[11] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,

Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16

words: Transformers for image recognition at scale. arXiv preprint

arXiv:2010.11929.

[12] Nejati, Hossein & Azimifar, Zohreh & Zamani, Mohsen. (2008). Using

fast fourier transform for weed detection in corn fields.

[13] Xavier P. Burgos-Artizzu, Angela Ribeiro, Maria Guijarro, Gonzalo

Pajares; “Real- time image processing for crop/weed discrimination

in maize fields”; Elsevier; 2010.

[14] Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., &

Beyer, L. (2021). How to train your vit? data, augmentation, and

regularization in vision transformers. arXiv preprint

arXiv:2106.10270.

[15] X. Cui, C. Peng and H. Yang, "Intelligent Mineral Identification and

Classification based on Vision Transformer," 2022 9th International

Conference on Dependable Systems and Their Applications (DSA),

Wulumuqi, China, 2022, pp.

[16] Gheflati B, Rivaz H. Vision Transformers for Classification of Breast

Ultrasound Images. Annu Int Conf IEEE Eng Med Biol Soc. 2022

Jul;2022:480-483.

[17] Gulzar Y, Khan SA. Skin Lesion Segmentation Based on Vision

Transformers and Convolutional Neural Networks—A Comparative

Study. Applied Sciences. 2022.

Fig. 5. Confusion Matrix of the Model

IEEE - 56998

14th ICCCNT IEEE Conference

July 6-8, 2023

IIT - Delhi, Delhi, India

Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.

[18] Zhang, Kaidi, et al. “Fire Detection Using Vision Transformer on

Power Plant.” Energy Reports, vol. 8, Elsevier BV, Nov. 2022, pp.

657–64.

[19] C. H. Song, J. Yoon, S. Choi and Y. Avrithis, "Boosting vision

transformers for image retrieval," 2023 IEEE/CVF Winter Conference

on Applications of Computer Vision (WACV), Waikoloa, HI, USA,

2023, pp. 107-117.

[20] True, Julian, et al. “Weed Detection in Soybean Crops Using Custom

Lightweight Deep Learning Models.” Journal of Agriculture and Food

Research, vol. 8, Elsevier BV, Apr. 2022, p. 100308.

[21] M. Anul Haq and . , "Cnn based automated weed detection system

using uav imagery," Computer Systems Science and Engineering, vol.

42, no.2, 2022.

[22] Reedha, R., Dericquebourg, E., Canals, R., & Hafiane, A. (2022).

Transformer neural network for weed and crop classification of high

resolution UAV images. Remote Sensing, 14(3), 592.

[23] Shinde, A. K., & Shukla, M. Y. (2014). Crop detection by machine

vision for weed management. International Journal of Advances in

Engineering & Technology, 7(3), 818-826.

[24] Rai, Nitin, et al. "Applications of deep learning in precision weed

management: A review." Computers and Electronics in Agriculture

206 (2023): 107698.

[25] Maheswaran, S., et al. "Design and development of chemical free green

embedded weeder for row based crops." Journal of Green Engineering

10.5 (2020): 2103-2120.

[26] Singh, Vinayak, Mahendra Kumar Gourisaria, Harshvardhan GM, and

Tanupriya Choudhury. "Weed Detection in Soybean Crop Using Deep

Neural Network." Pertanika Journal of Science & Technology

31, no. 1 (2023).

[27] Zhang H, Wang Z, Guo Y, Ma Y, Cao W, Chen D, Yang S, Gao R.

Weed Detection in Peanut Fields Based on Machine Vision.

Agriculture. 2022.

IEEE - 56998

14th ICCCNT IEEE Conference

July 6-8, 2023

IIT - Delhi, Delhi, India

Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.

ResearchGate has not been able to resolve any citations for this publication.

Applications of deep learning in precision weed management: A review

Article

Full-text available

Feb 2023
COMPUT ELECTRON AGR

Deep Learning (DL) has been described as one of the key subfields of Artificial Intelligence (AI) that is transforming weed detection for site-specific weed management (SSWM). In the last demi-decade, DL techniques have been integrated with ground as well as aerial-based technologies to identify weeds in still image context and real-time setting. After observing the current research trend in DL-based weed detection, techniques are advancing by assisting precision weeding technologies to make smart decisions. Therefore, the objective of this paper was to present a systematic review study that involves DL-based weed detection techniques and technologies available for SSWM. To accomplish this study, a comprehensive literature survey was performed that consists of 60 closest technical papers on DL-based weed detection. The key findings are summarized as follows, a) transfer learning approach is a widely adopted technique to address weed detection in majority of research work, b) less focus navigated towards custom designed neural networks for weed detection task, c) based on the pretrained models deployed on test dataset, no one specific model can be attributed to have achieved high accuracy on multiple field images pertaining to several research studies, d) inferencing DL models on resource-constrained edge devices with limited number of dataset is lagging, e) different versions of YOLO (mostly v3) is a widely adopted model for detecting weeds in real-time scenario, f) SegNet and U-Net models have been deployed to accomplish semantic segmentation task in multispectral aerial imagery, g) less number of open-source weed image dataset acquired using drones, h) lack of research in exploring optimization and generalization techniques for weed identification in aerial images, i) research in exploring ways to design models that consume less training hours, low-power consumption and less parameters during training or inferencing, and j) slow-moving advances in optimizing models based on domain adaptation approach. In conclusion, this review will help researchers, DL experts, weed scientists, farmers, and technology extension specialist to gain updates in the area of DL techniques and technologies available for SSWM.

Weed Detection in Peanut Fields Based on Machine Vision

Article

Full-text available

Sep 2022

The accurate identification of weeds in peanut fields can significantly reduce the use of herbicides in the weed control process. To address the identification difficulties caused by the cross-growth of peanuts and weeds and by the variety of weed species, this paper proposes a weed identification model named EM-YOLOv4-Tiny incorporating multiscale detection and attention mechanisms based on YOLOv4-Tiny. Firstly, an Efficient Channel Attention (ECA) module is added to the Feature Pyramid Network (FPN) of YOLOv4-Tiny to improve the recognition of small target weeds by using the detailed information of shallow features. Secondly, the soft Non-Maximum Suppression (soft-NMS) is used in the output prediction layer to filter the best prediction frames to avoid the problem of missed weed detection caused by overlapping anchor frames. Finally, the Complete Intersection over Union (CIoU) loss is used to replace the original Intersection over Union (IoU) loss so that the model can reach the convergence state faster. The experimental results show that the EM-YOLOv4-Tiny network is 28.7 M in size and takes 10.4 ms to detect a single image, which meets the requirement of real-time weed detection. Meanwhile, the mAP on the test dataset reached 94.54%, which is 6.83%, 4.78%, 6.76%, 4.84%, and 9.64% higher compared with YOLOv4-Tiny, YOLOv4, YOLOv5s, Swin-Transformer, and Faster-RCNN, respectively. The method has much reference value for solving the problem of fast and accurate weed identification in peanut fields.

Skin Lesion Segmentation Based on Vision Transformers and Convolutional Neural Networks—A Comparative Study

Article

Full-text available

Jun 2022

Melanoma skin cancer is considered as one of the most common diseases in the world. Detecting such diseases at early stage is important to saving lives. During medical examinations, it is not an easy task to visually inspect such lesions, as there are similarities between lesions. Technological advances in the form of deep learning methods have been used for diagnosing skin lesions. Over the last decade, deep learning, especially CNN (convolutional neural networks), has been found one of the promising methods to achieve state-of-art results in a variety of medical imaging applications. However, ConvNets’ capabilities are considered limited due to the lack of understanding of long-range spatial relations in images. The recently proposed Vision Transformer (ViT) for image classification employs a purely self-attention-based model that learns long-range spatial relations to focus on the image’s relevant parts. To achieve better performance, existing transformer-based network architectures require large-scale datasets. However, because medical imaging datasets are small, applying pure transformers to medical image analysis is difficult. ViT emphasizes the low-resolution features, claiming that the successive downsampling results in a lack of detailed localization information, rendering it unsuitable for skin lesion image classification. To improve the recovery of detailed localization information, several ViT-based image segmentation methods have recently been combined with ConvNets in the natural image domain. This study provides a comprehensive comparative study of U-Net and attention-based methods for skin lesion image segmentation, which will assist in the diagnosis of skin lesions. The results show that the hybrid TransUNet, with an accuracy of 92.11% and dice coefficient of 89.84%, outperforms other benchmarking methods.

Transformer Neural Network for Weed and Crop Classification of High Resolution UAV Images

Article

Full-text available

Jan 2022

Monitoring crops and weeds is a major challenge in agriculture and food production today. Weeds compete directly with crops for moisture, nutrients, and sunlight. They therefore have a significant negative impact on crop yield if not sufficiently controlled. Weed detection and mapping is an essential step in weed control. Many existing research studies recognize the importance of remote sensing systems and machine learning algorithms in weed management. Deep learning approaches have shown good performance in many agriculture-related remote sensing tasks, such as plant classification, disease detection, etc. However, despite the success of these approaches, they still face many challenges such as high computation cost, the need of large labelled datasets, intra-class discrimination (in growing phase weeds and crops share many attributes similarity as color, texture, and shape), etc. This paper aims to show that the attention-based deep network is a promising approach to address the forementioned problems, in the context of weeds and crops recognition with drone system. The specific objective of this study was to investigate visual transformers (ViT) and apply them to plant classification in Unmanned Aerial Vehicles (UAV) images. Data were collected using a high-resolution camera mounted on a UAV, which was deployed in beet, parsley and spinach fields. The acquired data were augmented to build larger dataset, since ViT requires large sample sets for better performance, we also adopted the transfer learning strategy. Experiments were set out to assess the effect of training and validation dataset size, as well as the effect of increasing the test set while reducing the training set. The results show that with a small labeled training dataset, the ViT models outperform state-of-the-art models such as EfficientNet and ResNet. The results of this study are promising and show the potential of ViT to be applied to a wide range of remote sensing image analysis tasks.

Boosting vision transformers for image retrieval

Conference Paper

Jan 2023

Weed Detection in Soybean Crop Using Deep Neural Network

Article

Nov 2022

The problematic and undesirable effects of weeds lead to degradation in the quality and productivity of yields. These unacceptable weeds are close competitors of crops as they constantly devour water, air, nutrients, and sunlight which are helpful for the maturation of crops. For better cultivation and good quality production of crops, weed detection at the appropriate time is an essential stride. In recent years, various state-of-the-art (SOTA) architectures were proposed to detect weeds among crop yields, but they lacked computational cost. This paper mainly focuses on proposing a customized state-of-the-art (SOTA) architecture and comparative study with transfer learning models for detecting and classifying weeds among soybean crops by concentrating on the low computational cost. The selected SoTA is beneficial for detecting weeds on a large scale with very low computational costs. In terms of selection, Maximum Validation Accuracy (MVA), Least Validation Cross-Entropy Loss (LVCEL), and Training Time (TT) were considered for proposing an objective function value system. In total, 15 proposed CNNs with 18 Transfer learning models were analyzed with the help of objective function value and various metric evaluations for finding the best and optimal architecture for weed classification. Experimentation and analysis resulted in C13 being robust and optimal architecture which outperformed every CNNs and Transfer learning model by achieving the highest accuracy of 0.9458 with an objective function value of 5.9335 and ROC-AUC of 0.9927 for the classification of weeds from soybean crops.

Intelligent Mineral Identification and Classification based on Vision Transformer

Conference Paper

Aug 2022

Vision Transformers for Classification of Breast Ultrasound Images

Conference Paper

Jul 2022

Medical ultrasound (US) imaging has become a prominent modality for breast cancer imaging due to its ease of use, low cost, and safety. In the past decade, convolutional neural networks (CNNs) have emerged as the method of choice in vision applications and have shown excellent potential in the automatic classification of US images. Despite their success, their restricted local receptive field limits their ability to learn global context information. Recently, Vision Transformer (ViT) designs, based on self-attention between image patches, have shown great potential to be an alternative to CNNs. In this study, for the first time, we utilize ViT to classify breast US images using different augmentation strategies. We also adopted a weighted cross-entropy loss function since breast ultrasound datasets are often imbalanced. The results are provided as classification accuracy and Area Under the Curve (AUC) metrics, and the performance is compared with the SOTA CNNs. The results indicate that the ViT models have comparable efficiency with or even better than the CNNs in the classification of US breast images. Clinical relevance- This work shows the potential of Vision Transformers in the automatic classification of masses in breast ultrasound, which helps clinicians diagnose and make treatment decisions more precisely.

Fire detection using vision transformer on power plant

Article

Nov 2022

The importance of power plant safety is increasing in the era of gradual technological development. When a fire occurs in the power plant, it will cause huge material losses, social unrest, and even casualties. The paper studies the common methods and models of fire warning, and introduces several model recognition techniques based on flames or smoke. Improved an automated power plant identification system based on the vision transformer, and proved the advantages of the technology through comparative analysis.

Weed detection in soybean crops using custom lightweight deep learning models

Article

Apr 2022

Weed detection has become an integral part of precision farming that leverages the IoT framework. Weeds have become responsible for 45% of the agriculture industry's crop losses due mainly to the competition with crops. An efficient weed detection method can reduce this percentage. This paper proposes a vision-based weed detection system using deep learning models that effectively detect weed within a soybean plantation. Five deep learning models are used, including MobileNetV2, ResNet50, and three custom Convolutional Neural Network (CNN) Models. The MobileNetV2 and ResNet50 were deployed on a Raspberry PI controller for comparison purposes. Based on a dataset with 400 images and 1536 total segments, the custom 5-layer CNN architecture shows high detection accuracy of 97.7% and the lowest latency & memory usage with 1.78 GB and 22.245 ms respectively. Utilizing the proposed custom deep learning CNN model with high accuracy can positively impact efficiency, time, and overall production within the soybean industry.

Weed Detection: A Vision Transformer Approach For Soybean Crops

Recommended publications

Self-Attention Vision Transformer with Transfer Learning for Efficient Crops and Weeds Classificatio...

A Vision System for Autonomous Weed Detection Robot

Surface Crack Detection Using Deep Learning Models

Monkeypox Detection using CNN-Based Pretrained Models

Pokepedia: Pokemon Image Classification Using Transfer Learning

Fresh and Rotten Fruits Classification Using CNN and Transfer Learning