Conference PaperPDF Available

Weed Detection: A Vision Transformer Approach For Soybean Crops

Authors:
  • VIT-AP University, Amaravati
Weed Detection: A Vision Transformer Approach
For Soybean Crops
Sanjay M
School of CSE
VIT-AP University
Amaravati, India
sanjaymythili2002@gmail.com
Mithisha Brilent Tavares
School of CSE
VIT-AP University
Amaravati, India
mithishatavares8303@gmail.com
Deepashree P. Vaideeswar
School of CSE
VIT-AP University
Amaravati, India
dvaideeswar@gmail.com
Ch. Venkata Rami Reddy
School of CSE
VIT-AP University
Amaravati, India
chvrr58@gmail.com
Abstract Unwanted plants called weeds grow among
agricultural crops, competing with them for nutrients, water, and
sunshine, resulting in severe output losses. Machine learning
algorithms, particularly Deep Learning models, have automated
the weed detection process. By leveraging annotated image
datasets, these algorithms can accurately classify and distinguish
weeds from crops. They have the potential to develop real-time,
autonomous weed detection systems, empowering farmers to make
informed decisions regarding weed control measures. This study
suggests using Vision Transformers to classify and identify weeds
in soybean farms. Recently, the significance of Vision
Transformers in the area of Computer Vision has grown as a
result of their capability to identify remote dependencies in images.
The method suggested in this paper makes use of Vision
Transformer's benefits by using a Deep Learning framework,
which improves the precision of weed identification and
categorization. This study’s dataset comprises of 15,336 images
encompassing soil, grass weeds, broadleaf weeds, and soybean
crops. The experimental analysis of the proposed approach shows
that this approach outperforms a number of cutting-edge
techniques for the identification and categorization of weeds. The
accuracy of the proposed approach is 98.83%. This accuracy
outperforms that of other methods, such as Convolutional Neural
Networks (CNNs), Support Vector Machines, and other Deep
Learning models, by a significant margin. The proposed method
has the capacity to be used to enhance the accuracy and
dependability of weed identification and categorization in many
different crops, not just soybean. Consequently, these findings can
lead to improved weed management strategies and increased crop
yield.
KeywordsImage Detection, Weed Detection, Vision
Transformers, Agriculture.
I. INTRODUCTION
The Soybean (Glycine Max), also known as Soja Bean or
Soya Bean, is a legume that grows annually and belongs to
the pea family (Fabaceae). Its seed is edible, and it is
considered the most important bean worldwide in terms of
economic significance, as it provides vegetable protein for
millions of people and serves as a raw material for numerous
chemical products. Soybean is an affordable and highly
nutritious source of protein, widely consumed by both
humans and animals in various parts of the world. The
soybean seed is composed of 17% oil and 63% meal, with
protein constituting 50% of the meal. Due to the absence of
starch, soybeans are considered an excellent protein source
for people with diabetes. Utilising techniques to optimise
productivity and improve product quality is crucial,
considering the significance of soy in the economic
landscape. Weed is a broad term used to describe any plant
that grows in an unwanted location. Throughout history,
humans have struggled to protect their crops from invasive
weeds. However, some plants initially classified as weeds
were later discovered to have useful properties and were
therefore cultivated. Conversely, some cultivated plants that
were introduced to new environments became invasive and
turned into weeds. This means that the definition of weeds is
constantly evolving, making it a relative term. The presence
of these plants, which compete with economically valuable
crops like soybeans, can cause significant harm. They make
it challenging to operate harvesting machinery and lead to
impurities and moisture in the grain. Weeds have numerous
detrimental effects on crops, such as competing for resources
like water, light, nutrients, and space. This competition
results in increased production costs, difficulties in
harvesting, decreased product quality, increased
susceptibility to pests and diseases, and reduced commercial
value of cultivated areas. While Computer Vision (CV)
applications do not frequently employ the Transformer
design, Natural Language Processing (NLP) tasks do. In CV,
attention is either used in conjunction with convolutional
networks or, in certain cases, in substitution of some of their
component portions while preserving the basic architecture
of the convolutional network. Convolutional architectures,
however, continue to be widely used in computer vision tasks
[1-9]. The latest models that make use of specific attention
patterns have theoretic promise, but they have not scaled well
on the hardware accelerators available today. Hence,
traditional ResNet like architectures are still at the cutting
edge of large-scale image recognition as inferred from [10]
and [11]. However, if Vision Transformer (ViT) is trained on
a larger data and used on small image recognition tests like
ImageNet, CIFAR-100, and VTAB, it achieves outstanding
results in comparison to the latest convolutional networks.
Additionally, it needs fewer computational resources for its
training. ViT is a type of advanced deep learning model that
has become increasingly popular in computer vision research
in recent years. In contrast to old CNNs that use convolution
IEEE - 56998
14th ICCCNT IEEE Conference
July 6-8, 2023
IIT - Delhi, Delhi, India
2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT) | 979-8-3503-3509-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICCCNT56998.2023.10307069
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.
layers to extract spatial features from images, ViTs utilise
self-attention mechanisms to model the relations between
various image patches directly. This approach enables ViTs
to capture both local and global image information more
effectively, making them more robust and efficient than
CNNs in certain applications. The ViT architecture was
introduced in a paper by Dosovitskiy et al. in 2020 [12],
demonstrating superior performance on various image
classification benchmarks. ViTs have been extensively
applied to a diverse range of CV tasks. These include image
generation, object detection, and semantic segmentation,
among others. By leveraging self-attention mechanisms to
model image patches, ViTs can effectively learn and capture
intricate features in images, which enables them to perform
well on various visual recognition tasks. As a result, ViTs
have become an increasingly popular choice for researchers
and practitioners in the computer vision community. ViTs
have a distinct advantage over traditional CNNs in their
capability to model long-range dependencies between
different image patches, which is a challenging task for
CNNs. This attribute enables ViTs to extract more global
features of an image, resulting in more precise predictions.
Moreover, ViTs are highly scalable, which means they can be
trained on large datasets and achieve even better
performance. This scalability and flexibility have made ViTs
an appealing option for various applications in computer
vision research. Objective of this study include the following:
Developing a model based on Vision Transformer
that can accurately detect and classify weeds in
soybean crops.
Evaluating the model on the standard metrics.
Comparing the performance of the proposed model
with existing weed detection models.
II. LITERATURE REVIEW
Image classification has emerged as a promising analytical
technique in the field of agriculture. This is because image
recognition and classification technologies can assess crops
based on their appearance, thereby eliminating the need for
expert analysis and costly and time-consuming experiments.
With its ability to provide accurate and rapid results at a lower
cost, image classification has gained significant popularity in
the agriculture industry. The practical uses of image
classification in agriculture are numerous, including tasks
such as soil assessment, leaf analysis, weed detection, pest
control, plant monitoring, disease recognition, and fruit/food
grading. These applications have been identified as valuable
tools for enhancing agricultural productivity and efficiency.
An approach to keep an eye on the wellbeing of plants was
proposed in [13] by constructing a system to detect iron,
nitrogen, and zinc deficiency by observing lettuce leaves. The
first step of the proposed method was to segment images into
‘leaf’ and ‘background’ classes using Artificial Neural
Networks (ANN). Further, RGB and HSI representation of
images were extracted. These parameters were subsequently
(a) Soil
(b) Soybean
(c) Grass
(d) Broadleaf
Fig. 1. Input images of each class for prediction
IEEE - 56998
14th ICCCNT IEEE Conference
July 6-8, 2023
IIT - Delhi, Delhi, India
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.
given to statistical classifiers and neural networks, which
predicted the plant state with a 92% accuracy. Two
approaches were compared in [14] to identify and categorize
three different citrus illnesses. In total, 39 texture attributes
were gathered, and four subsections of the attributes were
produced. The two methods employed in the system were -
Fast Image Processing (FIP), that can quickly process images
and provide results in real-time. The other method was
Robust Crop Row Detection (RCRD) which is slower but
more precise and is used to rectify errors made by the FIP.
The system combined both methods to function effectively in
a wide range of situations and provide highly precise
outcomes. It can accurately identify about 95% of weeds and
80% of crops even when there are variations in factors such
as lighting, soil moisture, and the growth of weeds/crops.
There are several studies that use ViT for multiple image
processing tasks. An empirical study was conducted in [15]
to explore the effects of several elements on the performance
of Vision Transformer (ViT) models trained on the
ImageNet-21 dataset, including the volume of training data,
data regularization and augmentation (AugReg), model size,
and compute budget. The study found that by increasing both
the compute budget and the use of AugReg, it is possible to
achieve similar performance as models trained on ten times
more data. Additionally, the ViT models trained on the
ImageNet-21 dataset not only matched but also outperformed
their counterparts. A research paper [16] proposes a mineral
recognition classification model based on the ViT
architecture. The model is trained and tested on a dataset
containing 2000 images of 12 different minerals, which is
augmented with data enhancement techniques to increase the
model's generalization capability. To enhance the feature
extraction process, a self-attentive mechanism is introduced.
Additionally, a new activation function is used to speed up
the convergence of the model during training. An accuracy of
96.08% is attained by the proposed model in mineral
recognition classification. ViT, using different augmentation
strategies were used to classify breast ultrasound images in
[17]. Due to the imbalance in the considered dataset,
weighted cross-entropy loss function was applied. The
results, as indicated in [17] imply that when it comes to
classifying breast ultrasound pictures, ViT models perform
on par with or even better than CNNs. Yet another use of
Vision Transformers in the medical industry can be seen in
[18] where the study provided a comprehensive scrutiny skin-
lesion image segmentation using of U-Net and attention-
based methods. However, as per [18], hybrid TransUNet, a
CNN performed better with an accuracy of 92.11%. A ViT-
based fire warning model was suggested in a publication [19].
ViT’s performance surpasses other CNNs like VGG, ResNet,
etc. On small data sets, the accuracy of the experimental
results was 97.4%, while on large data sets, it was as high as
97.03%. Another study [20] proposed several improvements
to enhance the performance of ViT. Firstly, it suggested that
using a hybrid architecture, which combines convolutional
neural networks (CNNs) with transformers, is much more
effective than using plain transformers alone. Second, the
study suggested adding two branches to the architecture to
gather local information from patch tokens and global
information from the categorization token, which would then
be combined to create a global image representation. Thirdly,
the proposed study suggests gathering multi-layer
characteristics from the transformer encoders of every
branch. This helps improve the representation of complex
image features. As stated earlier, the agriculture industry has
recognized the importance of weed detection due to the
detrimental impact of weeds on crop nutrition and water
absorption. As a result, different approaches for weed
detection have emerged, ranging from manual methods to
more advanced vision-based systems. Examining various
studies and approaches to weed detection revealed studies
such as [21] that detected soybean weeds using multiple CNN
architectures such as MobileNet, ResNet50, and three
different versions of CNNs. A 5-layer CNN architecture from
3 custom networks was selected for deployment as it yielded
the highest accuracy of 97.7%. Use of CNN LVQ model was
used in [22] for weed detection in soybean crops. A total of
4400 images captured by Unmanned Aerial Vehicles (UAVs)
were used to identify weeds in soybean fields, and the model
achieved an impressive accuracy of 99.79%. In a separate
study [23], crop and weed images taken by UAVs in fields of
beetroot, spinach, and parsley were classified using a Vision
Transformer and the self-attention paradigm. The results
showed that the Vision Transformer surpassed other
advanced techniques, including ResNet and EfficientNet,
achieving an accuracy of 99.14% for 13,596 testing images.
In other words, the Vision Transformer demonstrated
superior performance compared to other deep learning
models in accurately identifying crops and weeds in aerial
images acquired by UAVs. Rai et al [24], in their review
study, conducted a comprehensive analysis and comparison
of 60 deep learning-based models used for site-specific weed
management, providing detailed insights into the
performance of each model included. Many of the included
models achieved an accuracy of over 90%; however, it is
important to note that these accuracies were evaluated on
different datasets. Maheswararan et al [25] introduced an
innovative autonomous weeder aimed at effectively
eliminating weeds. Their design incorporates flexible
rotavator blades with high torque capabilities, which are
controlled using machine vision technology and Raspberry
Pi. Remarkably, their approach achieved a notable accuracy
rate of 90% in successfully removing weeds.
III. METHODOLOGY
Recent deep neural network models called Vision
Transformers (ViT) have demonstrated considerable potential
in computer vision applications including segmentation,
object identification, and picture categorization.
Transformers, which were created initially for natural
language processing (NLP) activities, served as a foundation
for this ViT. Generally, for image classification applications,
CNNs have been the most widely used architecture in
computer vision. The Vision Transformer (ViT) is a
comparatively newer architecture that has recently acquired
popularity in the field of CV. For image classification, ViT
employs a transformer architecture that was first created for
natural language processing. This research paper aims to use
ViT for weed detection in soybean crop application.
The ViT architecture was introduced in a paper by
Dosovitskiy and Alexey in 2020 [6]. Convolutional, pooling,
and fully connected layers make up the classic CNN design.
These models are computationally intensive and call for
several parameters. On the other hand, the transformer design
IEEE - 56998
14th ICCCNT IEEE Conference
July 6-8, 2023
IIT - Delhi, Delhi, India
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.
is founded on self-attention processes that are computationally
effective and call for fewer parameters. The transformer
architecture has achieved cutting-edge performance on
various benchmark datasets and has been very effective in
NLP. Applying the transformer architecture, which was
initially created for NLP, to image data is the main concept
behind the ViT. A neural network architecture called the
transformer has proved very effective for NLP tasks like
sentiment analysis, language modelling, and machine
translation. Its foundation is the idea of self-attention, which
enables the model to concentrate on different input sequence
components when making predictions.
A. Architecture
The ViT architecture consists of three phases: the first stage
involves splitting the input image into patches, the second
stage involves flattening and linearly projecting the patches
to the transformer, and the third stage involves encoding the
patches in transformer and sending them to the feed-forward
network, where the final predictions are made.
The self-attention technique, which is used to derive
correlations in long-range and contextual information from
the incoming input, is an essential component of the
transformer architecture. A ViT model will focus on various
areas of the input data according to their relevance to the job
at hand thanks to the self-attention mechanism. The
transformers basic fundamental module is self-attention. The
self-attention mechanism calculates a weighted sum of the
input data as a result, with the weights determined by how
similar the input characteristics are. This enables the model
to give the pertinent input characteristics greater weight,
which aids in capturing more accurate representations of the
input data. Self-attention, therefore, is a computational
primitive that quantifies paired entity interactions and enables
a network to understand the hierarchies and alignments
contained in incoming data. For visual networks to acquire
greater resilience, attention has been shown to be a crucial
component. Mathematically the output self-attention is
computed as follows-
 󰇛
󰇜 (1)
Transformer is composed of numerous modules for self-
attention. The transformer in this study is made up of 8 self-
attention modules. The patch embedding layer and the
transformer encoder are the two primary parts of the ViT
architecture. The patch embedding layer divides the input
image into non-overlapping pieces before flattening the
pieces into vectors. These patch embeddings are sent to the
encoder of the transformer, which consists of multiple layers
of feedforward and self-attention neural networks. The
feedforward networks enable the model to collect local
spatial information, while the self-attention mechanism
permits the model to extract global context from the input
image. The transformer encoder's output is routed into a
classification head, which creates the model's final output.
The input image is split into a matrix of patches for the ViT,
which is then flattened and supplied into the transformer
model. In the same way that words are used as input tokens
in NLP tasks, the patches act as the model's input tokens.
After processing these input tokens, the transformer creates a
feature vector that can be applied to classification or other
subsequent tasks. The self-attention mechanism is used by the
ViT to capture both global and local correlations among the
input patches. This enables the model to acquire finer-grained
characteristics and more accurately represent the image's
semantic content.
Below pseudo code shows the algorithm that ViT uses:
Input: Image of 72x72x3 pixels
Output: Image class {broadleaf, grass, soil, soybean}
Step-I: The input image is split into smaller patches of a fixed
size (e.g., 6x6 pixels).
Step-II: Each patch is then fed through a CNN to extract a
fixed-size feature vector (embedding) for that patch. CNN is
typically a simple 3-layer network with small filters.
Step-III: The patch embeddings are then flattened and
arranged in a 2D grid, with one row per patch and one column
per embedding feature.
Fig. 2. Architecture of Vision Transformer
IEEE - 56998
14th ICCCNT IEEE Conference
July 6-8, 2023
IIT - Delhi, Delhi, India
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.
Step-IV: A fixed positional embedding is incorporated to
each piece of the embedding to help the model learn the
spatial relationships between each patch.
Step-V: Then patch embeddings are sent through a stack of
Transformer encoder layers (e.g., 8 layers), which allows the
model to go through different parts of the image and capture
long-range dependencies.
Step-VI: The final output of the Transformer encoder is a
single vector representing the entire image.
Step-VII: This vector is passed through a softmax classifier
to make the final prediction class label of the input image.
Step-VIII: End
Before fine-tuning the model for a particular job, the ViT
employs a method known as pretraining to learn a general-
purpose representation of image data. The model is trained by
a self-supervised method of learning using a
large image dataset during the pre-training phase. In order to
do this, some input patches are randomly masks, and the
model is then asked to forecast what the masks should be. In
doing so, the model develops the ability to recognise complex
visual cues that can be used for a number of downstream
tasks. Figure 1 shows the complete architecture of ViT used
in this research paper.
B. Loss Function
The loss function used for the ViT is the Sparse Categorical
Cross Entropy (SCCE). In the SCCE loss the model's
predicted probabilities for each class are compared to the true
label, which is represented as an integer value corresponding
to the correct class. Unlike categorical cross entropy, which
compares the predicted probability distribution to a one-hot
encoded vector, SCCE does not require the true label to be
one-hot encoded. This makes it more efficient and memory-
friendly, especially when the number of classes is large. The
SCCE loss function is given by,
 󰇛󰇜
 (2)
Where, represents the truth label and represents the
probability for the class obtained by applying softmax
function.
IV. EXPERIMENTAL SETUP
A. Dataset Description
The dataset considered for this study is an image dataset for
weed detection in soybean crops. It consists of a total of
15,336 image segments, of which 3,249 are soil segments,
7,376 are soybean segments, 3,520 are grass segments, and
1,191 are broadleaf weed segments. The segments were
extracted from 400 images captured by an unmanned aerial
vehicle (UAV) and were manually annotated with their
respective class. The images were pre-processed using the
SLIC algorithm implemented in the Pynovisão software,
resulting in high-quality image segments that are suitable for
training and evaluating deep learning models for weed
detection. The dataset is well-balanced, with each class
containing a significant number of segments, allowing for
accurate evaluation of model performance.
TABLE I. DATASET DESCRIPTION
Class number
Total
Appearances
0
3249
1
7376
2
3520
3
1191
Total No. of. Images
15336
The purpose of this study is to detect the presence of weed in
UAV-captured soybean crop images, with high accuracy and
low complexity such that it can be easily deployed on
agricultural devices. This is achieved through training the
dataset on ViT. This study’s dataset contains 15,336 images
categorized into 4 classes, “Soil” (3249), “Soybean” (7376),
“Grass” (3520) and “Broadleaf Weeds” (1191). A pre-
processed version of the dataset is used, where the image was
segmented and manually annotated. Further, the dataset is
segregated into train (81.00%), validation (8.99%) and test
(9.99%) sets.
B. Data Augmentation
The dataset used for training was modified by randomly
flipping images horizontally and vertically, as well as
zooming in or out with a factor of 0.2 for both height and
width. Additionally, random rotation with a factor of 0.2 was
applied to the images. After these modifications, the images
were normalized. Fig 3 below shows some sample images
after applying augmentation and normalization.
C. Evaluation Metrics and Hyperparameter Tuning
The following values of hyperparameter were considered:
learning rate is set to 0.001, batch size of 256, image size of
72 x 72, patch size of 6 x 6 and weight decay with a value of
Fig. 3. Augmented Images
IEEE - 56998
14th ICCCNT IEEE Conference
July 6-8, 2023
IIT - Delhi, Delhi, India
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.
0.0001. The number of heads in the transformer is 4, and there
are 8 transformers in total. The model was trained for 500
epochs spanning over 3 hours.
The proposed model was carried out using the Python
programming language, and the TensorFlow framework. The
model training was done on Kaggle platforms’ kernel. The
training was conducted on a T4 x2 GPU. The standard
accuracy metric is used to evaluate the approach in this study.
As accuracy increases, the model's performance is better on
the training and test set. The other metrics used to evaluate
the model are precision, recall and F1 score. Each metric is
computed as follows:
 
 (3)
 
 (4)
 
 (5)
 󰇛󰇜
󰇛󰇜 (6)
Where, TP, TN, FP, FN stands for True Positive, True
Negative, False Positive, False Negative, respectively. A
good model should have a high F1 score, high recall, high
accuracy, and high precision. Precision and recall, however,
typically trade off against one another. To determine the ideal
balance between precision and recall, use the F1 score.
Confusion matrix can be used to determine the model’s
advantages and disadvantages and all kinds of errors the
model makes.
V. RESULTS AND ANALYSIS
The results of ViT model demonstrate excellent performance
in image classification tasks. After training on a dataset of
images, the proposed model has attained an accuracy of
99.18% on the training set, and 99.2% accuracy on the
validation set as shown in Table 2. This high level of accuracy
was achieved after training the model for 500 epochs,
indicating that the model is capable to learn the patterns and
features of the images over a prolonged period. Through the
hyperparameters mentioned above, a test accuracy of 98.83%
is attained with a final loss value of 0.0332, indicating that it
can generalize well to new unseen images. Fig 4a shows the
model's accuracy after each epoch. From the graph, it can be
observed how the model's performance varies after each
epoch. Fig 4b shows the loss of the model after each epoch
and it can be concluded that as the epochs are increasing the
loss is decreasing.
Fig 5 shows the confusion matrix of the proposed model. As
indicated in Fig 5, class 1 i.e., class “Grass” has the least
classification accuracy of 93%. The remaining three classes
0,2,3 i.e., class “Broadleaf”, “Soil”, “Soybean”, have 100%,
98%, 99% of accuracies respectively. The model's attention
mechanism enables it to concentrate on crucial areas of the
image when generating predictions, allowing it to attain very
high levels of accuracy. The model's architecture, which
incorporates a multi-head attention mechanism and numerous
attention blocks, additionally influences how effectively it
performs.
Table 3 provides an overview of the performance and
accuracy of different models proposed by other studies
including the model proposed in this study. The purpose of
this table is to showcase the effectiveness of the proposed
model in association to other existing models in the field. The
second row of table 3 represents the DNN models proposed
by Singh, Gourisaria, et al [26]. These models were trained
on the same dataset of 15,336 images and achieved an
(a) Model accuracy versus epochs
(b) Model loss versus epochs
Fig. 4. Model performance after each epoch
TABLE II. EVALUATION METRICS
Training Accuracy
Validation Accuracy
Testing Accuracy
Precision
Recall
F1 Score
99.18%
99.2%
98.83%
97.07%
97.49%
97.28%
IEEE - 56998
14th ICCCNT IEEE Conference
July 6-8, 2023
IIT - Delhi, Delhi, India
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.
accuracy of 94.58%. While these models show good
performance, they are outperformed by the proposed model
of this study. Overall, the proposed model in this study
exhibits the highest accuracy among the mentioned models,
achieving an accuracy of 98.83% on a dataset of 15,336
images. This highlights the effectiveness of the proposed
model in accurately predicting the desired output and
suggests its potential for practical applications in the field.
TABLE III. COMPARISON OF RESULT WITH EXISTING MODELS
Author/Model
Dataset Size
Accuracy
The proposed model in this
study
15,336 images
98.83%
Singh, Gourisaria, et al. (DNN
models) [26]
15,336 images
94.58%
Zhang, Wang, et al. (EM-
YOLOv4-Tiny) [27]
855 images
96.70%
True, Julien et al. (CNN
Models) [20]
400 images
97.7%
VI. CONCLUSION AND FUTURE WORK
This study suggests utilizing vision transformers for detecting
weeds in soybean crops. The proposed approach achieved
high accuracy, making it useful to be deployed on agricultural
devices. The dataset was trained using the ViT model, and the
model was evaluated using the standard accuracy metric. The
model achieved a test accuracy of 98.83% with a final loss
value of 0.0332, which indicates that the proposed approach
is effective in detecting weeds in soybean crops. The results
obtained in this study demonstrate the potential of vision
transformers in weed detection applications.
There are several possible directions for further research in
this study. One possible way is to test the effectiveness of the
proposed method on larger datasets with more diverse
categories. Another area of research is to measure the
efficiency of the proposed method under different lighting
and environmental conditions. Moreover, it would be
fascinating to explore the effectiveness of the proposed
method in real-time scenarios. Furthermore, the proposed
method can be expanded to identify other types of weeds in
various crops. Finally, the proposed approach can be
integrated with agricultural devices to provide real-time weed
detection, enabling farmers to take timely and informed
decisions to manage weeds in their crops.
REFERENCES
[1] Reddy, C.V.R., Reddy, U.S., Kishore, K.V.K. (2019). Facial emotion
recognition using NLPCA and SVM. Traitement du Signal, 36(1): 13-
22.
[2] VenkataRamiReddy, C., Kishore, K.K., Bhattacharyya, D., Kim, T.H.
(2014). Multi-feature fusion-based facial expression classification
using DLBP and DCT. International Journal of Software Engineering
and Its Applications, 8(9): 55-68.
[3] Ramireddy, C.V., Kishore, K.K. (2013). Facial expression
classification using Kernel-based PCA with fused DCT and GWT
features. In 2013 IEEE International Conference on Computational
Intelligence and Computing Research, IEEE, pp. 1-6. Reddy, C.V.R.,
Kishore, K.K., Reddy, U.S., Suneetha, M. (2016). Person identification
system using feature level fusion of multi-biometrics. In 2016 IEEE
International Conference on Computational Intelligence and
Computing Research (ICCIC), IEEE, pp. 1-6.
[4] Palakodati, S.S.S., Chirra, V.R.R., Yakobu, D., Bulla, S. (2020). Fresh
and rotten fruits classification using CNN and transfer learning. Revue
d'Intelligence Artificielle, 34(5): 617-622.
[5] Chirra, V.R.R., Uyyala, S.R., Kolli, V.K.K. (2021). Virtual facial
expression recognition using deep CNN with ensemble learning.
Journal of Ambient Intelligence and Humanized Computing, 12:
1058110599.
[6] Chirra, V.R.R., Uyyala, S.R., Kolli, V.K.K. (2019). Deep CNN: A
machine learning approach for driver drowsiness detection based on
eye state. Revue d'Intelligence Artificielle, 33(6): 461-466.
[7] Y. LeCun et al., "Backpropagation Applied to Handwritten Zip Code
Recognition," in Neural Computation, vol. 1, no. 4, pp. 541-551, Dec.
1989.
[8] Krizhevsky, Alex & Sutskever, Ilya & Hinton, Geoffrey. (2012).
ImageNet Classification with Deep Convolutional Neural Networks.
Neural Information Processing Systems.
[9] Mahajan, A., Taliun, D., Thurner, M. et al. Fine-mapping type 2
diabetes loci to single-variant resolution using high-density imputation
and islet-specific epigenome maps. Nat Genet 50, 15051513 (2018).
[10] Kolesnikov, Alexander. “Big Transfer (BiT): General Visual
Representation Learning.” arXiv.org, 24 Dec. 2019.
[11] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16
words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929.
[12] Nejati, Hossein & Azimifar, Zohreh & Zamani, Mohsen. (2008). Using
fast fourier transform for weed detection in corn fields.
[13] Xavier P. Burgos-Artizzu, Angela Ribeiro, Maria Guijarro, Gonzalo
Pajares; “Real- time image processing for crop/weed discrimination
in maize fields”; Elsevier; 2010.
[14] Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., &
Beyer, L. (2021). How to train your vit? data, augmentation, and
regularization in vision transformers. arXiv preprint
arXiv:2106.10270.
[15] X. Cui, C. Peng and H. Yang, "Intelligent Mineral Identification and
Classification based on Vision Transformer," 2022 9th International
Conference on Dependable Systems and Their Applications (DSA),
Wulumuqi, China, 2022, pp.
[16] Gheflati B, Rivaz H. Vision Transformers for Classification of Breast
Ultrasound Images. Annu Int Conf IEEE Eng Med Biol Soc. 2022
Jul;2022:480-483.
[17] Gulzar Y, Khan SA. Skin Lesion Segmentation Based on Vision
Transformers and Convolutional Neural NetworksA Comparative
Study. Applied Sciences. 2022.
Fig. 5. Confusion Matrix of the Model
IEEE - 56998
14th ICCCNT IEEE Conference
July 6-8, 2023
IIT - Delhi, Delhi, India
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.
[18] Zhang, Kaidi, et al. “Fire Detection Using Vision Transformer on
Power Plant.” Energy Reports, vol. 8, Elsevier BV, Nov. 2022, pp.
65764.
[19] C. H. Song, J. Yoon, S. Choi and Y. Avrithis, "Boosting vision
transformers for image retrieval," 2023 IEEE/CVF Winter Conference
on Applications of Computer Vision (WACV), Waikoloa, HI, USA,
2023, pp. 107-117.
[20] True, Julian, et al. “Weed Detection in Soybean Crops Using Custom
Lightweight Deep Learning Models.” Journal of Agriculture and Food
Research, vol. 8, Elsevier BV, Apr. 2022, p. 100308.
[21] M. Anul Haq and . , "Cnn based automated weed detection system
using uav imagery," Computer Systems Science and Engineering, vol.
42, no.2, 2022.
[22] Reedha, R., Dericquebourg, E., Canals, R., & Hafiane, A. (2022).
Transformer neural network for weed and crop classification of high
resolution UAV images. Remote Sensing, 14(3), 592.
[23] Shinde, A. K., & Shukla, M. Y. (2014). Crop detection by machine
vision for weed management. International Journal of Advances in
Engineering & Technology, 7(3), 818-826.
[24] Rai, Nitin, et al. "Applications of deep learning in precision weed
management: A review." Computers and Electronics in Agriculture
206 (2023): 107698.
[25] Maheswaran, S., et al. "Design and development of chemical free green
embedded weeder for row based crops." Journal of Green Engineering
10.5 (2020): 2103-2120.
[26] Singh, Vinayak, Mahendra Kumar Gourisaria, Harshvardhan GM, and
Tanupriya Choudhury. "Weed Detection in Soybean Crop Using Deep
Neural Network." Pertanika Journal of Science & Technology
31, no. 1 (2023).
[27] Zhang H, Wang Z, Guo Y, Ma Y, Cao W, Chen D, Yang S, Gao R.
Weed Detection in Peanut Fields Based on Machine Vision.
Agriculture. 2022.
IEEE - 56998
14th ICCCNT IEEE Conference
July 6-8, 2023
IIT - Delhi, Delhi, India
Authorized licensed use limited to: VIT-Amaravathi campus. Downloaded on November 25,2023 at 06:36:02 UTC from IEEE Xplore. Restrictions apply.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Deep Learning (DL) has been described as one of the key subfields of Artificial Intelligence (AI) that is transforming weed detection for site-specific weed management (SSWM). In the last demi-decade, DL techniques have been integrated with ground as well as aerial-based technologies to identify weeds in still image context and real-time setting. After observing the current research trend in DL-based weed detection, techniques are advancing by assisting precision weeding technologies to make smart decisions. Therefore, the objective of this paper was to present a systematic review study that involves DL-based weed detection techniques and technologies available for SSWM. To accomplish this study, a comprehensive literature survey was performed that consists of 60 closest technical papers on DL-based weed detection. The key findings are summarized as follows, a) transfer learning approach is a widely adopted technique to address weed detection in majority of research work, b) less focus navigated towards custom designed neural networks for weed detection task, c) based on the pretrained models deployed on test dataset, no one specific model can be attributed to have achieved high accuracy on multiple field images pertaining to several research studies, d) inferencing DL models on resource-constrained edge devices with limited number of dataset is lagging, e) different versions of YOLO (mostly v3) is a widely adopted model for detecting weeds in real-time scenario, f) SegNet and U-Net models have been deployed to accomplish semantic segmentation task in multispectral aerial imagery, g) less number of open-source weed image dataset acquired using drones, h) lack of research in exploring optimization and generalization techniques for weed identification in aerial images, i) research in exploring ways to design models that consume less training hours, low-power consumption and less parameters during training or inferencing, and j) slow-moving advances in optimizing models based on domain adaptation approach. In conclusion, this review will help researchers, DL experts, weed scientists, farmers, and technology extension specialist to gain updates in the area of DL techniques and technologies available for SSWM.
Article
Full-text available
The accurate identification of weeds in peanut fields can significantly reduce the use of herbicides in the weed control process. To address the identification difficulties caused by the cross-growth of peanuts and weeds and by the variety of weed species, this paper proposes a weed identification model named EM-YOLOv4-Tiny incorporating multiscale detection and attention mechanisms based on YOLOv4-Tiny. Firstly, an Efficient Channel Attention (ECA) module is added to the Feature Pyramid Network (FPN) of YOLOv4-Tiny to improve the recognition of small target weeds by using the detailed information of shallow features. Secondly, the soft Non-Maximum Suppression (soft-NMS) is used in the output prediction layer to filter the best prediction frames to avoid the problem of missed weed detection caused by overlapping anchor frames. Finally, the Complete Intersection over Union (CIoU) loss is used to replace the original Intersection over Union (IoU) loss so that the model can reach the convergence state faster. The experimental results show that the EM-YOLOv4-Tiny network is 28.7 M in size and takes 10.4 ms to detect a single image, which meets the requirement of real-time weed detection. Meanwhile, the mAP on the test dataset reached 94.54%, which is 6.83%, 4.78%, 6.76%, 4.84%, and 9.64% higher compared with YOLOv4-Tiny, YOLOv4, YOLOv5s, Swin-Transformer, and Faster-RCNN, respectively. The method has much reference value for solving the problem of fast and accurate weed identification in peanut fields.
Article
Full-text available
Melanoma skin cancer is considered as one of the most common diseases in the world. Detecting such diseases at early stage is important to saving lives. During medical examinations, it is not an easy task to visually inspect such lesions, as there are similarities between lesions. Technological advances in the form of deep learning methods have been used for diagnosing skin lesions. Over the last decade, deep learning, especially CNN (convolutional neural networks), has been found one of the promising methods to achieve state-of-art results in a variety of medical imaging applications. However, ConvNets’ capabilities are considered limited due to the lack of understanding of long-range spatial relations in images. The recently proposed Vision Transformer (ViT) for image classification employs a purely self-attention-based model that learns long-range spatial relations to focus on the image’s relevant parts. To achieve better performance, existing transformer-based network architectures require large-scale datasets. However, because medical imaging datasets are small, applying pure transformers to medical image analysis is difficult. ViT emphasizes the low-resolution features, claiming that the successive downsampling results in a lack of detailed localization information, rendering it unsuitable for skin lesion image classification. To improve the recovery of detailed localization information, several ViT-based image segmentation methods have recently been combined with ConvNets in the natural image domain. This study provides a comprehensive comparative study of U-Net and attention-based methods for skin lesion image segmentation, which will assist in the diagnosis of skin lesions. The results show that the hybrid TransUNet, with an accuracy of 92.11% and dice coefficient of 89.84%, outperforms other benchmarking methods.
Article
Full-text available
Monitoring crops and weeds is a major challenge in agriculture and food production today. Weeds compete directly with crops for moisture, nutrients, and sunlight. They therefore have a significant negative impact on crop yield if not sufficiently controlled. Weed detection and mapping is an essential step in weed control. Many existing research studies recognize the importance of remote sensing systems and machine learning algorithms in weed management. Deep learning approaches have shown good performance in many agriculture-related remote sensing tasks, such as plant classification, disease detection, etc. However, despite the success of these approaches, they still face many challenges such as high computation cost, the need of large labelled datasets, intra-class discrimination (in growing phase weeds and crops share many attributes similarity as color, texture, and shape), etc. This paper aims to show that the attention-based deep network is a promising approach to address the forementioned problems, in the context of weeds and crops recognition with drone system. The specific objective of this study was to investigate visual transformers (ViT) and apply them to plant classification in Unmanned Aerial Vehicles (UAV) images. Data were collected using a high-resolution camera mounted on a UAV, which was deployed in beet, parsley and spinach fields. The acquired data were augmented to build larger dataset, since ViT requires large sample sets for better performance, we also adopted the transfer learning strategy. Experiments were set out to assess the effect of training and validation dataset size, as well as the effect of increasing the test set while reducing the training set. The results show that with a small labeled training dataset, the ViT models outperform state-of-the-art models such as EfficientNet and ResNet. The results of this study are promising and show the potential of ViT to be applied to a wide range of remote sensing image analysis tasks.
Article
The problematic and undesirable effects of weeds lead to degradation in the quality and productivity of yields. These unacceptable weeds are close competitors of crops as they constantly devour water, air, nutrients, and sunlight which are helpful for the maturation of crops. For better cultivation and good quality production of crops, weed detection at the appropriate time is an essential stride. In recent years, various state-of-the-art (SOTA) architectures were proposed to detect weeds among crop yields, but they lacked computational cost. This paper mainly focuses on proposing a customized state-of-the-art (SOTA) architecture and comparative study with transfer learning models for detecting and classifying weeds among soybean crops by concentrating on the low computational cost. The selected SoTA is beneficial for detecting weeds on a large scale with very low computational costs. In terms of selection, Maximum Validation Accuracy (MVA), Least Validation Cross-Entropy Loss (LVCEL), and Training Time (TT) were considered for proposing an objective function value system. In total, 15 proposed CNNs with 18 Transfer learning models were analyzed with the help of objective function value and various metric evaluations for finding the best and optimal architecture for weed classification. Experimentation and analysis resulted in C13 being robust and optimal architecture which outperformed every CNNs and Transfer learning model by achieving the highest accuracy of 0.9458 with an objective function value of 5.9335 and ROC-AUC of 0.9927 for the classification of weeds from soybean crops.
Conference Paper
Medical ultrasound (US) imaging has become a prominent modality for breast cancer imaging due to its ease of use, low cost, and safety. In the past decade, convolutional neural networks (CNNs) have emerged as the method of choice in vision applications and have shown excellent potential in the automatic classification of US images. Despite their success, their restricted local receptive field limits their ability to learn global context information. Recently, Vision Transformer (ViT) designs, based on self-attention between image patches, have shown great potential to be an alternative to CNNs. In this study, for the first time, we utilize ViT to classify breast US images using different augmentation strategies. We also adopted a weighted cross-entropy loss function since breast ultrasound datasets are often imbalanced. The results are provided as classification accuracy and Area Under the Curve (AUC) metrics, and the performance is compared with the SOTA CNNs. The results indicate that the ViT models have comparable efficiency with or even better than the CNNs in the classification of US breast images. Clinical relevance- This work shows the potential of Vision Transformers in the automatic classification of masses in breast ultrasound, which helps clinicians diagnose and make treatment decisions more precisely.
Article
The importance of power plant safety is increasing in the era of gradual technological development. When a fire occurs in the power plant, it will cause huge material losses, social unrest, and even casualties. The paper studies the common methods and models of fire warning, and introduces several model recognition techniques based on flames or smoke. Improved an automated power plant identification system based on the vision transformer, and proved the advantages of the technology through comparative analysis.
Article
Weed detection has become an integral part of precision farming that leverages the IoT framework. Weeds have become responsible for 45% of the agriculture industry's crop losses due mainly to the competition with crops. An efficient weed detection method can reduce this percentage. This paper proposes a vision-based weed detection system using deep learning models that effectively detect weed within a soybean plantation. Five deep learning models are used, including MobileNetV2, ResNet50, and three custom Convolutional Neural Network (CNN) Models. The MobileNetV2 and ResNet50 were deployed on a Raspberry PI controller for comparison purposes. Based on a dataset with 400 images and 1536 total segments, the custom 5-layer CNN architecture shows high detection accuracy of 97.7% and the lowest latency & memory usage with 1.78 GB and 22.245 ms respectively. Utilizing the proposed custom deep learning CNN model with high accuracy can positively impact efficiency, time, and overall production within the soybean industry.