Content uploaded by Parameswaran Ramesh
Author content
All content in this area was uploaded by Parameswaran Ramesh on Nov 06, 2023
Content may be subject to copyright.
979-8-3503-1414-4/23/$31.00©2023 IEEE
Machine Learning Approach for Soil Nutrient
Prediction
K Gurubaran
Departmnet of Electronics
Engineering
Madras Institute of Technology,
Anna University
Chennai, India
gurubarankathirvel2006@gmail.c
om
Poornesh S
Department of Electronics
Engineering
Madras Institute of Technology,
Anna University
Chennai, India
poornesh123s@gmail.com
Shwetha L S
Department of Electronics
Engineering
Madras Institute of Technology,
Anna University
Chennai, India
shwetha.ls0712@gmail.com
Deepak Athipan A M B
Department of Information
Technology
Madras Institute of Technology,
Anna University
Chennai, India
athipan2703@gmail.com
Vidhya N
Department of Electronics
Engineering
Madras Institute of Technology,
Anna University
Chennai, India
vin.vidhya612@gmail.com
Shabana Parveen M
Department of Electronics and
Communication Engineering
Sri Sairam Engineering College,
Anna University
Chennai, India
shabanaafsar08@gmail.com
Parameswaran Ramesh
Department of Electronics
Engineering
Madras Institute of Technloogy,
Anna University
Chennai, India
parameswaran0789@gmail.com
Bhuvaneswari P.T.V
Department of Electronics
Engineering
Madras Institute of Technloogy,
Anna University
Chennai, India
ptvbmit@annauniv.edu
Abstract—Soil nutrient prediction is of utmost importance in
agricultural practices as it directly affects crop productivity and
sustainable farming. In order to forecast soil nutrient levels, this
study compares the effectiveness of two machine learning
algorithms: K-Nearest Neighbors (KNN) regression and Multiple
Linear Regression (MLR). Atomic emission spectroscopy data and
conventional laboratory data pertaining to macro and
micronutrients present in soil taken from seven different zones of
Tamil Nadu are involved in the construction of the customized
dataset that is used for this investigation. Initially, the data in the
dataset is pre-processed to filter outliers. Both KNN regression
and MLR were trained and assessed using the gathered dataset in
the comparison study. Using the r2_score as a performance metric,
the percentage of variation achieved by the models is evaluated.
The performance of algorithms was evaluated. According to the
findings, KNN regression works better than MLR at predicting
soil nutrient levels. Due to its capacity to capture complex
interactions between soil nutrient concentrations and predictor
variables, KNN regression shows improved accuracy and
generalization capabilities. The findings have important
implications for farmers and decision-makers, offering insightful
information to support wise choices for land management and
environmentally friendly farming methods.
Keywords—Machine Learning, Atomic Emission spectral Data,
K-Nearest Neighbor regression, MLR and r2_score.
I. INTRODUCTION
Soil nutrients play a crucial role in determining the fertility
and productivity of agricultural land. Understanding the
composition and significance of these nutrients is essential for
effective soil management and sustainable crop production. This
research paper aims to explore the prediction of soil nutrients
using machine learning techniques as an alternative to
conventional methods.
For a plant to grow and flourish, macronutrients and
micronutrients are essential. Large amounts of macronutrients
that include Phosphorus (P), sulphur (S), and Potassium (K) are
needed. Contrarily, lesser levels of micronutrients, namely
Magnesium (Mg), Calcium (Ca), sulphur (S), chlorine (Cl),
Molybdenum (Mo), Boron (B), Copper (Cu), manganese (Mn),
Zinc (Zn), and iron (Fe), are also needed for maintaining plant
health [1].
Conventional methods for soil nutrient prediction have been
widely used in agricultural practises [2]. These methods include
laboratory analysis of soil samples, soil testing kits, and nutrient
extraction procedures. While these methods provide valuable
information about the nutrient content of the soil, they have
several limitations. One major limitation of conventional soil
nutrient prediction methods is their time-consuming nature.
These methods often require extensive laboratory work and can
take days or even weeks to obtain results. Additionally, they are
expensive, requiring specialised equipment and trained
personnel. Moreover, these methods provide localised
information and may not capture the spatial variability of soil
nutrient levels accurately.
To address these limitations, Machine Learning (ML)
approaches have emerged as a promising alternative for soil
nutrient prediction. They offer several advantages over
conventional methods. They are faster, more cost-effective, and
can provide real-time predictions. Furthermore, these models
have the potential to capture the complex relationships between
soil properties and nutrient levels, enabling better understanding
and management of soil nourishment.
The main objective of this work is (i) to utilise the ML
algorithms for Soil Nutrient Prediction. (ii) to analyse the
performance of KNN regression and MLR models for Soil
Nutrient prediction.
Following is the order of the remaining text in the paper:
Section II describes the existing work in relation to the
conducted research. In Section III, the proposed methodology is
presented. The results and subsequent conclusions are presented
in Section IV. The study is concluded with Section V, which
emphasises the importance of the research.
II. L
ITERATURE
S
URVEY
Multiple Linear Regression (MLR) was employed by the
author in [3] to forecast soil macronutrients. It is a statistical
technique that forecasts the dependent or response variable
using a number of independent or explanatory variables. It
shows how the dependent and independent variables are related.
This method uses soil properties including pH, electrical
conductivity, nitrogen, phosphorus, and potassium to show the
link between these elements and predict NPK values. The
predicted NPK data is around 80% accurate when compared to
the actual dataset. These results help farmers make better
decisions about how much fertiliser to apply and how to
maximise crop output.
Real-time environmental data for Mangalore, Kodagu,
Kasaragod, and a few other Karnataka state districts are gathered
in [4]. This data comprises soil type, rainfall, humidity, and other
variables. The datasets are arranged in an orderly fashion. The
K-NN approach is used to calculate the forecast of crop yield
and crop accuracy.
In [5], the author discussed a novel method for evaluating
soil nutrients. Starting with the Pearson correlation coefficient,
least absolute shrinkage and selection operator, and gradient-
boosting decision tree, partial least squares regression is used to
find the optimal approach. Second, using 10-fold cross-
validation, the most accurate linear regression-based methods
and non-linear neural network-based support vector machine
(SVM)-based algorithms for estimating soil nitrogen, total
phosphorus, and total potassium concentrations were found.
The author of [6] described a technique for building
regression-based ML models to evaluate fundamental
macronutrients and micronutrients in soil. The three most
significant nutrients, N, P, and K, commonly known as the key
nutrients of soils, were examined using the models in order to
ascertain how the presence of N affects other critical soil
nutrients.
The author of [7] described a practical technique for
producing a high-resolution spatial map of soil nutrients.
Because the KNN-based CNN methodology can forecast soil
nutrient value with less computer expense and real-time
processing capabilities than traditional geo-statistical methods,
it is thought to be useful for mapping soil nutrients. The soil
health card of a specific field sampling site is used to create the
soil nutrients N, P, K, and OC map. Nutrient distributions can
effectively represent spatial changes. This study lays the
foundation for precise fertiliser application and makes a
technical contribution to precision agriculture.
In [8], the author suggests a system for classifying soil
according to macronutrients and micronutrients and identifying
the kinds of crops that can be grown in each type of soil. It uses
ML algorithms such as SVM, bagged trees, logistic regression,
and KNN.
From the existing state of the art, there are several limitations
observed. Only a few soil nutrients are predicted in recent works.
The majority of the work focused on estimating the
concentrations of N, P, and K. The source of reliable dataset
collection is not evidently mentioned in the literature. The
datasets collected in certain works made use of hyperspectral
remote sensing data, which involves tedious pre-processing
techniques. Some works applied ML techniques in very small
geographical soil zones and under laboratory conditions. The
performance of the model is not highly appreciated in some
literature.
III. P
ROPOSED
M
ETHODOLOGY
In this section, an ML approach for predicting soil nutrients
is discussed in detail. Procedures involved in data pre-
processing and descriptions of the KNN regression and MLR
algorithms are elaborated.
A. Dataset Collection
Figure 1 represents the flow involved in the proposed work.
Initially, dataset collection is done. In this research, two soil
nutrient datasets (i) Atomic Emission spectroscopy (AES) data
and (ii) Chemical processing-based laboratory data are
considered. 45 soil samples representing seven agro-climatic
zones of Tamil Nadu are considered in this research. To enhance
the dataset, an open-source elementwise Atomic spectral
database (ASD) is used as a reference.
Fig. 1. Block diagram of Soil Nutrient Prediction.
B. Data Preprocessing
The data pre-processing procedure is accomplished using
mathematical and theoretical operations. Let D1 denote the AES
dataset and D2 denote the ASD dataset. The intersection of D1
and D2 is represented in Figure 2, and its notation is expressed
in equation 1.
Fig. 2. Intersection of D1 and D2.
D1∩D2 (1)
If D2 is initially a subset of D1 as shown in Figure 3,
equation is simplified to:
D1∩D2 = D2 (2)
Fig. 3. D2 is a subset of D1.
If D1∩D2! = D2, then certain modifications can be
done to remove the outliers.
Outlier Removal:
Method 1: Approximation
For each data point, ‘d’, in D2, its value can be
approximated, if it is within a range of ±2 units in D1. This can
be represented mathematically as:
d ∈ D2, d±2 ∈ D1 (3)
Method 2: Elimination
If no match is found even after the approximation, the data
point from D2 is eliminated. Using set difference, this can be
represented as:
D2 = D2 – {d} (4)
Combining both methods, the modified ASD dataset (D2)
can be represented as:
D2 = (D2 ∩ (D1 ± 2)) – {d} (5)
The equation for the intersection of the AES dataset and the
modified ASD dataset is:
Intersection (D1, ModifiedD2) =Intersection (D1, (D2∩
(D1±2))-{d}) (6)
The process is repeated until: Intersection (D1, Modified
D2) = Modified D2
C. Feature Selection
The final dataset, after the removal of outliers, contains 35
columns of independent features and 12 columns of dependent
or predictor variables. The total size of the dataset is 450 rows
by 47 columns. The feature matrix consists of wavelengths in
the columns and corresponding intensities in the rows, as shown
in Figure 4.
Fig. 4. Feature Matrix.
The conventional laboratory data on concentrations of soil
nutrients form the dependent continuous variable matrix as
shown in Figure 5.
Fig. 5. Dependent Continuous Variable matrix.
D. Model Training
The pre-processed dataset is trained using KNN regression
and MLR algorithms. The training dataset is 80% of the final
pre-processed dataset, and the testing dataset is 20% of the pre-
processed dataset.
E. Multiple Linear Regression
Multiple Linear Regression (MLR) [15] is a statistical
technique used to determine the relationship between a
dependent variable and two or more independent variables by
fitting an equation that is linear to the data [9]. It applies the idea
of simple linear regression, which only considers one
independent variable, to a situation in which there are several
predictors. The objective of MLR is to identify the best-fitting
equation for linear regression that captures the connection
between the dependent variable and the independent variables.
Y = β₀ + β₁X₁1 + β₂X1₂ + ... + βₚXij + e (7)
In the above equation, Y is the concentration of nutrients;
X
11
, X
12
,..., X
ij
are the intensities of soil samples; β₀, β₁, β₂, ..., βₚ
are the coefficients of intensities of soil samples (Xij); e is the
error term.
F. K-Nearest Neighbor (KNN) Regression
A non-parametric regression technique known as KNN
Regressor predicts the values of a dependent variable by locating
the K-Nearest Neighbors of a fresh data point in the training set
and using the average of those neighbors dependent variable
values as the predicted value [10]. The main hyperparameters in
the KNN Regressor are K (number of nearest neighbors to
consider) and the distance metric (Euclidean distance,
Manhattan distance, etc.) used to measure the distance between
data points. The steps involved in KNN regression are as
follows:
1. Compute Euclidean distance:
d (X₀, Xij) = √ (∑ (X₀ - Xᵢⱼ) ²) (8)
where X₀ is the test data intensity of a test soil sample and
Xᵢⱼ is the training data of intensity of soil sample
2. Find K nearest neighbors: Select the K training instances
with the smallest distances to X₀.
3. Aggregate target values: To determine the anticipated
target value for X₀, average the target values of the K closest
neighbors. This can be expressed as:
Y₀ = (1/K) * ∑(Yⱼ) (9)
Where, Y₀ is the predicted concentration of soil sample and
Yⱼ represents the concentration of the nutrients of the jth nearest
neighbors.
G. Accuracy Analysis
r2_score is the performance metric used to determine
accuracy. It is sometimes referred to as the coefficient of
determination, which is a statistical indicator that shows how
much of the variance in the target or dependent variable in a
regression model can be accounted for by the independent
variables. It shows how well the model fits the data as it was
observed. The dependent variable's variation is explained
entirely by the model if the R2 score is 1, whereas a score of 0
means that the model does not account for any of the variance.
An improved fit of the regression model to the data is shown by
a higher R2 value. The equation for calculating the R2 score is
as follows:
R_2 = 1- (SSR/ SST) (10)
Where SSR (Sum of Squared Residuals) is the sum of the
squared differences between the predicted values and the actual
values, SST (Total Sum of Squares) is the sum of the squared
differences between the actual values and the mean of the
dependent variable.
IV. R
ESULTS AND
D
ISCUSSION
A. Performance of MLR Model
The r2_Score values of MLR are shown in Table 1. The
graphical representation of accuracy vs. concentration of
nutrients in the MLR model is shown in Figure 6.
TABLE I.
R
2_
SCORES OF MLR MODEL
Soil
Nutrients
R2_score
(MLR)
Accuracy
in
%
N
0.56
56
P
0.71
71
K
0.73
73
Ca
0.88
88
Mg
0.81
81
S
0.50 50
Zn
0.56 56
Mn
0.53 53
Fe
0.70 70
Cu
0.65 65
Soil
Nutrients
R2_score
(MLR)
Accuracy
in
%
B
0.69 69
Na
0.63 63
Fig. 6. Accuracy of MLR Model.
B. Performance of MLR Model
The R2 Score values of KNN regression are shown in Table
2. The graphical representation of the accuracy vs. concentration
of nutrients plot in the MLR model is shown in Figure 7.
TABLE II.
R
2_
SCORES OF MLR MODEL
Soil
Nutrients
R2_score
(MLR)
Accuracy
in
%
N
0.81
81
P
0.87
87
K
0.88
88
Ca
0.95
95
Mg
0.98
98
S
0.74 74
Zn
0.90 90
Mn
0.87 87
Fe
0.88 88
Cu
0.90 90
B
0.82 82
Na
0.97 97
Fig. 7. Accuracy of KNN Regression Model.
C. Hyper parameter Optimization Technique in KNN
Regression
From the accuracy analysis, it is observed that models
trained using KNN Regression have higher accuracy than MLR.
The accuracy of the model was further increased using the
hyperparameter optimization technique, which means choosing
the best values of K to be predicted while making a prediction.
Figure 8 shows that the accuracy of the model is high when the
value of K is 4.
Fig. 8. Hyper Parameter Optimization in KNN Regression Model.
D. Time Complexity
The time complexity of KNN regression is O(N * d), where
N is the number of data points and d is the number of features.
On the other hand, the time complexity of MLR using ordinary
least squares (OLS) is O(N2 * d + N * d2), where N is the
number of data points and d is the number of features.
Comparing these complexities, MLR generally has a lesser time
complexity than KNN regression, especially as the number of
data points increases. This implies that MLR is computationally
faster than KNN regression for larger datasets.
V. C
ONLCUSION
A
ND
F
UTUREWORK
The significance of data pre-processing was emphasized in
improving the accuracy and reliability of predictions in the ML
approach for soil nutrient prediction using KNN regression and
MLR. The handling of missing values, outliers, and data scaling
contributed to effectively capturing the relationships between
predictors and soil nutrient concentrations. The superiority of
KNN regression over MLR was observed in predicting the
concentration of 12 soil nutrients, which comprise both macro
and micro nutrients. The ability of KNN regression to consider
the proximity of data points proved advantageous in capturing
the non-linear relationships present in the dataset. Considering
its time complexity, MLR typically requires more computational
resources due to the need to solve a system of linear equations.
On the other hand, the time complexity of KNN regression
increases with the size of the training dataset, making it more
suitable for smaller datasets. Future work can explore alternative
regression techniques, namely support vector regression or
random forest regression, as well as feature engineering to
enhance the predictive power of the models. Additionally, the
investigation of ensemble models, time series analysis, field
validation, and data collection can further improve the accuracy
and applicability of ML approaches for soil nutrient prediction,
facilitating more efficient and sustainable agricultural practices.
A
CKNOWLEDGMENT
The authors express their heartfelt gratitude to the Centre for
Internet of Things (CIoT), Madras Institute of Technology
Campus, Anna University for generously providing the Atomic
emission spectroscopic dataset of soil from 7 agro-climatic
zones in Tamil Nadu. This dataset has been instrumental in
conducting the research presented at this conference.
R
EFERENCES
[1] Revati P. Potdar, Mandar M. Shirolkar, Alok J. Verma, Pravin S. More &
Atul Kulkarni, “Determination of soil nutrients (NPK) using optical
methods: a mini review”, Journal of Plant Nutrition, vol. no.44, issue
no.12, 29 February 2021.
[2] Parameswaran, R.; Bhuvaneswari, P. Detection of macro and micro
nutrients in potatoes using elemental analysis techniques. Int. J. Recent
Technol. Eng. IJRTE 2020, 8, 1033–1040.
[3] Madhumathi R, Arumuganathan T, Shruthi R and Raghavendar S, “Soil
NPK Prediction Using Multiple Linear Regression”, 2022 8th
International Conference on Advanced Computing and Communication
Systems (ICACCS), Coimbatore, India, 25-26 March 2022.
[4] H. K. Karthikeya, K. Sudarshan and Disha S. Shetty, “Prediction of
Agricultural Crops using KNN Algorithm”, International Journal of
Innovative Science and Research Technology, May 2020, Volume 5,
Issue 5, pp.no. 1422- 1424.
[5] Yiping Peng, Lu Wang, Li Zhao, Zhenhua Liu, Chenjie Lin, Yueming Hu
and Luo Liu, “Estimation of Soil Nutrient Content Using Hyperspectral
Data”, Agriculture 2021, 11 November 2021, Volume 11, Issue 11, pp.no.
1-17.
[6] Kaur, S. and Malik, K., “Predicting and Estimating the Major Nutrients
of Soil Using Machine Learning Techniques”, Soft Computing for
Intelligent Systems- Algorithms for Intelligent Systems, Springer,
Singapore, 23 June 2021, pp no. 539–546.
[7] Kamal Das, Subhojit Mandal and Mainak Thakur, “High Resolution
Spatial Mapping of Soil Nutrients Using K – Nearest Neighbor Based
CNN Approach”, 2020 IEEE International Geoscience and Remote
Sensing Symposium, Waikoloa, HI, USA, 26 September 2020 - 02
October 2020, pp.no. 1102-1105.
[8] Mrs. N. Saranya and Ms. A. Mythili, “Classification of Soil and Crop
Suggestion using Machine Learning Techniques”, International Journal
of Engineering Research & Technology (IJERT), Volume 9, Issue 02,
February 2020, pp.no. 671-673.
[9] https://www.simplilearn.com/what-is-multiple-linear-regression-in-
machine-learning-article
[10] https://www.analyticsvidhya.com/blog/2018/08/k-nearest-neighbor-
introduction-regression-python/
[11] Keerthan Kumar T G, Shubha C and Sushma S A, “Random Forest
Algorithm for Soil Fertility Prediction and Grading Using Machine
Learning”, International Journal of Innovative Technology and Exploring
Engineering (IJITEE), vol no.9 Issue no. 1, November 2019, pp.no. 1301-
1304.
[12] Oladipe Ebenezer Oluwole, E.O. Osaghae and Fredrick D. Basaky,
“Machine learning Solution for Prediction of Soil Nutrients for Crop
Yield: A Survey”, Journal of Multidisciplinary Engineering Science and
Technology (JMEST), vol.no. 9, issue no.9, September – 2022, pp.no.
15519- 15523.
[13] Hao Li, Weijia Leng, Yibing Zhou, Fudi Chen, Zhilong Xiu, and Dazuo
Yang, “Evaluation Models for Soil Nutrient Based on Support Vector
Machine and Artificial Neural Networks”, Scientific World Journal, 7
December 2014, pp.no. 1-7.
[14] Arpit Rawankar, Mayurkumar Nanda, Hemant Jadhav, Prem Lotekar,
Rahul Pawar, Libin Sibichan and Akshay Pangare, “Detection of N, P, K
Fertilizers in Agricultural Soil with NIR Laser Absorption Technique”,
2018 3rd International Conference on Microwave and Photonics (ICMAP
2018), Dhanbad, India, 09-11 February 2018, pp.no. 1-2.
[15] Sri Dhivya Krishnan, K. and Bhuvaneswari, P.T.V. (2019). Multiple
linear regression-based prediction model to detect hexavalent chromium
in drinking water. In Behera, H.S., Nayak, J., Naik, B., and Abraham, A.,
editors, Computational Intelligence in Data Mining, pages 493–504,
Singapore. Springer Singapore.