ArticlePDF AvailableLiterature Review

Integrating artificial intelligence, machine learning, and deep learning approaches into remediation of contaminated sites: A review

October 2023
Chemosphere 345(8):140476

October 2023
345(8):140476

DOI:10.1016/j.chemosphere.2023.140476

Authors:

Jagadeesh Kumar Janga

University of Illinois at Chicago

Krishna R. Reddy

University of Illinois at Chicago

Raviteja KVNS

SRM University-AP

The growing number of contaminated sites across the world pose a considerable threat to the environment and human health. Remediating such sites is a cumbersome process with the complexity originating from the need for extensive sampling and testing during site characterization. Selection and design of remediation technology is further complicated by the uncertainties surrounding contaminant attributes, concentration, as well as soil and groundwater properties, which influence the remediation efficiency. Additionally, challenges emerge in identifying contamination sources and monitoring the affected area. Often, these problems are overly simplified, and the data gathered is underutilized rendering the remediation process inefficient. The potential of artificial intelligence (AI), machine-learning (ML), and deep-learning (DL) to address these issues is noteworthy, as their emergence revolutionized the process of data management/analysis. Researchers across the world are increasingly leveraging AI/ML/DL to address remediation challenges. Current study aims to perform a comprehensive literature review on the integration of AI/ML/DL tools into contaminated site remediation. A brief introduction to various emerging and existing AI/ML/DL technologies is presented, followed by a comprehensive literature review. In essence, ML/DL based predictive models can facilitate a thorough understanding of contamination patterns, reducing the need for extensive soil and groundwater sampling. Additionally, AI/ML/DL algorithms can play a pivotal role in identifying optimal remediation strategies by analyzing historical data, simulating scenarios through surrogate models, parameter-optimization using nature inspired algorithms, and enhancing decision-making with AI-based tools. Overall, with supportive measures like open-data policies and data integration, AI/ML/DL possess the potential to revolutionize the practice of contaminated site remediation.

Content uploaded by Krishna R. Reddy

Content may be subject to copyright.

Chemosphere 345 (2023) 140476

Available online 20 October 2023

Integrating articial intelligence, machine learning, and deep learning

approaches into remediation of contaminated sites: A review

Jagadeesh Kumar Janga

, Krishna R. Reddy

, K.V.N.S. Raviteja

University of Illinois Chicago, Department of Civil, Materials, and Environmental Engineering, 842 West Taylor Street, Chicago, IL 60607, USA

SRM University AP, Department of Civil Engineering, Guntur, Andhra Pradesh 522503, India

HIGHLIGHTS GRAPHICAL ABSTRACT

•Comprehensive review of AI/ML/DL

techniques in site remediation is

performed.

•Bibliometric analysis showed an

increasing interest across the world on

this topic.

•ML based predictive models can be used

for spatial contamination prediction.

•Predictive data-driven models can sur-

rogate complicated physical models.

•AI enables effective parameter optimi-

zation for efcient remediation design.

ARTICLE INFO

Handling Editor: Dr Y Yeomin Yoon

Keywords:

Environmental remediation

Big data

Surrogate models

Data-driven approach

Optimization

Decision-making

ABSTRACT

The growing number of contaminated sites across the world pose a considerable threat to the environment and

human health. Remediating such sites is a cumbersome process with the complexity originating from the need for

extensive sampling and testing during site characterization. Selection and design of remediation technology is

further complicated by the uncertainties surrounding contaminant attributes, concentration, as well as soil and

groundwater properties, which inuence the remediation efciency. Additionally, challenges emerge in identi-

fying contamination sources and monitoring the affected area. Often, these problems are overly simplied, and

the data gathered is underutilized rendering the remediation process inefcient. The potential of articial in-

telligence (AI), machine-learning (ML), and deep-learning (DL) to address these issues is noteworthy, as their

emergence revolutionized the process of data management/analysis. Researchers across the world are increas-

ingly leveraging AI/ML/DL to address remediation challenges. Current study aims to perform a comprehensive

literature review on the integration of AI/ML/DL tools into contaminated site remediation. A brief introduction

to various emerging and existing AI/ML/DL technologies is presented, followed by a comprehensive literature

review. In essence, ML/DL based predictive models can facilitate a thorough understanding of contamination

patterns, reducing the need for extensive soil and groundwater sampling. Additionally, AI/ML/DL algorithms can

play a pivotal role in identifying optimal remediation strategies by analyzing historical data, simulating scenarios

through surrogate models, parameter-optimization using nature inspired algorithms, and enhancing decision-

making with AI-based tools. Overall, with supportive measures like open-data policies and data integration,

AI/ML/DL possess the potential to revolutionize the practice of contaminated site remediation.

* Corresponding author.

E-mail addresses: jreddy3@uic.edu (J.K. Janga), kreddy@uic.edu (K.R. Reddy), raviteja.k@srmap.edu.in (K.V.N.S. Raviteja).

Contents lists available at ScienceDirect

Chemosphere

journal homepage: www.elsevier.com/locate/chemosphere

https://doi.org/10.1016/j.chemosphere.2023.140476

Received 21 August 2023; Received in revised form 15 October 2023; Accepted 16 October 2023

Chemosphere 345 (2023) 140476

1. Introduction

Articial Intelligence/Machine Learning/Deep Learning (AI/ML/

DL) technologies have emerged as powerful tools with the potential to

revolutionize various elds, including environmental sciences and en-

gineering (Zhong et al., 2021). This study focuses on their application in

the specialized domain of contaminated site remediation. Contaminated

sites are a growing concern in developing countries, while developed

nations have been grappling with this issue for years. Sources such as

waste dumps, chemical spills, and agricultural practices contribute to

soil and groundwater contamination, leading to ecological and health

risks (Sharma and Reddy, 2004). Over the past few decades, many

contaminated sites have been identied in numerous locations across

the world that are posing a problem to the earth and the environment

ever since (Singh and Naidu, 2012). The presence of these contaminated

sites gives rise to several issues, including the contamination of drinking

water sources, the pollution of groundwater that can further contami-

nate surface waters, and the overall exposure of humans and ecosystems

to associated risks (Khan et al., 2004). Soil and groundwater pollution

poses a signicant challenge, that is comparable to air and surface water

pollution (Sharma and Reddy, 2004). To tackle this problem, various

public bodies such as the United States Environmental Protection

Agency (USEPA), Central Pollution Control Board of India (CPCB),

among many others are collaborating with researchers to develop

effective solutions to remediate such sites.

The process of cleaning up contaminated sites involves several key

steps: site characterization, risk assessment, devising remedial goals and

comparing alternatives, remediation design, and monitoring (Sharma

and Reddy, 2004): Initially, site characterization is conducted to un-

derstand the nature and extent of contamination, followed by a thor-

ough risk assessment to evaluate the potential harm to human health

and the environment. Subsequently, if the contaminated sites pose more

than acceptable risk to human health or the surrounding environment,

various remediation techniques need to be identied and employed to

address the contamination effectively.

Contaminated site remediation projects pose several challenges.

Firstly, these projects are inherently costly and complicated due to the

variability of sites and contaminant types (Lehr et al., 2002). This

inherent nature of contaminated site remediation generates a signicant

amount of data throughout the process. One of the most challenging

tasks is site characterization, which involves collecting extensive data

regarding site geology, soil stratigraphy, hydraulic properties, ground-

water levels, contaminant types, concentrations, and dynamic biogeo-

chemical parameters (Laha et al., 2000; Tao et al., 2022). Groundwater

ow and contaminant transport modeling further contributes to the

generation of large datasets, which can be time-consuming to

comprehend.

Contaminated site remediation requires performing numerous lab-

oratory and eld tests to assess the behavior of materials and also to

understand the subsurface conditions. These tests are typically

descriptive, time consuming, expensive and require a lot of human

effort. Further, it is required to study a large number of variables like

contamination chemistry, fate and transport, geology and hydrogeology.

It is evident that these properties are associated with wide range of

variability (Baecher, 2023) that demands numerous tests in large areas,

and in addition requires repetitive testing to ensure accuracy. Another

challenge lies in the risk assessment phase, where multiple receptors,

exposure pathways, and risks associated with various contaminants

must be considered. The selection of the most suitable technology and

the optimization of system variables are additional hurdles. Once the

remediation is implemented, continuous monitoring through soil and

groundwater sampling generates vast amounts of data. Unfortunately,

proper utilization of this data is often overlooked, and problems are

oversimplied with numerous assumptions. Moreover, processing and

analyzing these large datasets require substantial human resources,

resulting in time-consuming and expensive processes. As a result,

neglecting to consider all necessary factors can lead to ineffective

remediation efforts.

AI/ML/DL technologies can be harnessed to address the above-listed

crucial challenges. These advanced technologies can assist in different

aspects of the remediation process. For instance, they can be used to

analyze and interpret large amounts of data collected during site char-

acterization, enabling a more comprehensive understanding of the

contamination patterns and sources (Yaseen, 2021; Zhang et al., 2023).

These technologies can also aid in reducing the requirement of soil and

groundwater sampling and subsequent laboratory characterization

(Hanoon et al., 2021). Additionally, AI/ML/DL algorithms can support

the risk assessment process by integrating multiple data sources and

predicting the potential impacts of contaminants on human health and

ecosystems (Li et al., 2022b). Furthermore, these advanced technologies

can assist in identifying the most suitable and efcient remediation

techniques by analyzing historical data, conducting simulations through

surrogate models, and optimizing decision-making process (Li et al.,

2022a). By analyzing historical data, conducting simulations, and

optimizing the decision-making process, these technologies can

contribute to the selection and implementation of appropriate remedi-

ation strategies. This can lead to more effective cleanup efforts, mini-

mizing the risks associated with exposure to contaminants and

improving the overall efciency of remediation projects. These tech-

nologies can effectively analyze and interpret the extensive data

collected during site characterization and modeling. Proper utilization

of these technologies enables optimization of remediation process by

allowing more accurate site characterization, comprehensive risk

assessment, identication of best remediation strategies, and the opti-

mization of remediation process. Ultimately, AI/ML/DL technologies

can play an increasingly important role in addressing the challenges

posed by contaminant site remediation, leading to more effective

cleanup efforts and improved environmental outcomes. A signicant

amount of research was carried out in the integration of these technol-

ogies into various phases of contaminated site remediation. Zhang et al.

(2023) have performed an extensive literature review on the utilization

of ML-based models for spatial prediction of contamination patterns.

Yaseen (2021) has provided a brief account on the literature involving

utilization of ML models in simulating adsorption of heavy metals in soil

and water bodies. In their review of advances in literature concerning

control and abatement of soil heavy metal pollution, Gautam et al.

(2023) have emphasized the benets of using AI in this process. In

addition, researchers have also extensively studied the employment of

AI in groundwater quality and ow modeling (Asher et al., 2015; Han-

oon et al., 2021). However, to the authors’ knowledge no review has

been performed to account and summarize the advancements of inte-

grating AI into contaminated site remediation process as a whole.

Hence, current study aims to perform a comprehensive review of the

literature available on utilizing AI/ML/DL based technologies during

various stages of contaminated site remediation. First, various types of

emerging and existing technologies based on AI/ML/DL are briey

explained, followed by a comprehensive review of various published

studies to apply these techniques in site remediation, and made rec-

ommendations to direct future research and enhance the remediation

efforts.

2. Articial intelligence

Emerging technologies like AI, ML, and DL have the potential to

make site remediation less expensive by reducing the human effort and

the need for rigorous sampling and monitoring, while also increasing the

efciency of the remediation (Raviteja and Reddy, 2023). These tech-

nologies can be applied in engineering optimization, which otherwise

requires large number of eld and laboratory tests, numerical and

physical modeling, and the analysis of corresponding data to determine

optimized parameters. Fig. 1 shows the evolution of the overall domain

of AI, where ML is a subset of AI, and DL is the subset of ML. AI has

J.K. Janga et al.

Chemosphere 345 (2023) 140476

emerged as an academic discipline in the early 1950s focusing mostly on

rule-based systems based on knowledge representation systems for

decision-making (Lu, 2019). ML is later developed in the late 1980s

specically on certain optimizations where the algorithms can learn

from the data to improve the prediction accuracy and decision-making

capacity. DL, which consists of multiple hidden neural layers, is

further developed as a newer subset to ML in the early 2010s to enhance

the ability of neural networks in understanding and processing complex

data sets (Lu, 2019).

AI imitates or surpasses human intelligence through specialized

hardware and software, enabling the development and utilization of

computational systems capable of discovery, inference, and prediction.

These systems nd applications in various domains, including computer

vision, natural language processing, and data science. AI extends beyond

mimicking human intellect. As shown in Fig. 2, AI consists of a wide

variety of systems including but not limited to different ML algorithms,

various types of nature inspired optimization algorithms (NIO),

knowledge-based systems, symbolic AI, computer vision, motion cap-

ture, and natural language processing. Although there are numerous

other techniques apart from NIO and ML algorithms, from the

perspective of contaminated site remediation, most of the research

focused on the following aspects: optimization of remediation design

using different optimization algorithms, use of rule-based systems for

decision-making based on historical data, and predictive modeling of

contaminated subsurface conditions using different types of ML and DL

algorithms. Hence, these components of AI are further elaborated and

briey described in the subsequent sections.

2.1. Nature inspired optimization algorithms

Optimization of the design parameters is a key practice in any en-

gineering practice. Similarly, in contaminated site remediation, the goal

of designing a remedial strategy is to optimize the design parameters in a

way that can reduce the environmental and economic costs while also

providing maximum remediation efciency. As most site remediation

problems consists of non-linear and heterogeneous datasets, it might

often occur that conventional simplex or gradient-based optimization

algorithms might not be adequate to solve such complex optimization

problems. Across the years, several optimization algorithms inspired by

nature as shown in Fig. 2, drawing on the principles of physical and

biological patterns observed in various natural systems, have been

created to tackle complex real-life optimization challenges. (Yang,

2014).

One such algorithm is the genetic algorithm (GA) that serves as an

optimization technique rooted in natural selection, mirroring the bio-

logical evolution process as dened by Darwin’s evolution theory. It has

the ability to effectively solve both constrained and unconstrained

optimization challenges (Gen and Cheng, 1999). This algorithm itera-

tively renes a population of individual solutions. In each cycle, certain

individuals from the present population are chosen as parents, guiding

the generation of offspring for the next iteration. Across subsequent

generations, the population gradually advances towards an optimal

solution.

Inspired by natural selection and genetic inheritance in living or-

ganisms, genetic algorithms are employed to nd optimal solutions to

complex problems by mimicking the process of evolution through ge-

netic operators such as selection, crossover, and mutation (Gen and

Cheng, 1999). A population of possible solutions is randomly generated

and evaluated for their tness based on a predened objective function.

The ttest individuals are then chosen to reproduce and generate a new

generation of possible solutions, which undergo crossover and mutation

to introduce genetic diversity. This process continues until a satisfactory

solution is attained, or a predened stopping criterion is met. Similarly,

other evolutionary algorithms such as differential evolution are also

based on Darwin’s evolution theory offering further exibility in oper-

ator selection and strategy design (Yang, 2014).

Other nature-inspired metaheuristics, as shown in Fig. 2, include

simulated annealing based on the metal annealing process (Kirkpatrick

et al., 1983), articial immune system (AIS) based on vertebrate im-

mune system (Farmer et al., 1986), wind driven optimization (WDO)

based on the movement of air parcels in Earth’s atmosphere (Bayraktar

et al., 2010), and harmony search that is a music-inspired algorithm

(Geem et al., 2001). Some commonly used swarm intelligence based

optimization algorithms, which are inspired by the social interactions of

natural swarms of species include: particle swarm optimization (PSO)

(Kennedy and Eberhart, 1995), ant colony optimization (ACO) based on

Fig. 1. Evolution of articial intelligence (AI), machine learning (ML), and deep learning (DL) technologies.

J.K. Janga et al.

Chemosphere 345 (2023) 140476

behavior of social ants (Dorigo et al., 2006), articial bee colony (ABC)

based on the behavior of bees in a colony (Karaboga, 2005), grey wolf

optimization (GWO) based on the leadership hierarchy and hunting

mechanism of grey wolves (Mirjalili et al., 2014).

These nature-inspired algorithms have been used extensively in

diverse elds such as engineering, computer science, economics, and

biology to solve complex problems that are difcult to tackle using

traditional optimization methods (Yang, 2014; Yang and He, 2016;

Lambora et al., 2019). They are especially benecial for addressing

problems that involve a large search space with inherent non-linearity,

and multiple conicting objectives, which is often the case with

contaminated site remediation.

3. Machine learning

ML, a subset of AI involves creating mathematical models to facili-

tate predictions and decisions, all without requiring explicit program-

ming (Sarker, 2021). ML models are generally trained using training

data and it is highly inuenced by the completeness of such data. ML

involves various approaches like supervised, semi-supervised, unsuper-

vised, and reinforcement learning, among others. Each of these methods

have specic applications in solving various problems. Supervised

learning approaches are used to solve regression, and classication

problems (James et al., 2023). Whereas, unsupervised learning ap-

proaches are used for dimensionality reduction, clustering and associ-

ation problems (Alloghani et al., 2020). Reinforcement learning is

mostly used for challenges associated with problems that require

real-time learning (Perera and Kamalaruban, 2021). Numerous algo-

rithms have been developed over the past few decades to implement

these techniques. Some of the widely used ML algorithms are presented

in Fig. 3.

3.1. Supervised learning

In supervised learning, the machine learns from user-provided data,

which shows how the inputs are mapped to outcomes. This helps the

machine to build a model for predicting outcomes of new inputs based

on the trends identied from past examples (training data) (James et al.,

2023). For instance, if there are three fruits with different colors for

each, and the objective is to sort them into groups as per the type and

color. In this case, as the user has previous experience and memory to

recognize, the sorting can be done quickly without any iterations. The

variables in this case are known as labeled variables as the features of

the variables are known. The learning algorithm is known as supervised

learning as the user can recognize the features and identify the cluster to

which the variable belongs to, without iterations.

One of the most basic versions of supervised ML is the linear

regression analysis, which is a rather simple model. However, use of a

simple model like linear regression for complex datasets introduces lot

of bias into the predictions (James et al., 2023). Hence, gradually

numerous parametric and non-parametric models were developed to

handle complex non-linear data, while balancing the bias-variance

trade-off, to make predictions more accurate and close to reality.

Commonly employed supervised learning models as presented in Fig. 3,

include decision-tree based models such as gradient boosting and

random forest, and support vector machines/regressors for both classi-

cation and regression tasks. However, k-Nearest Neighbors (kNN),

Bayesian models such as naïve bayes, and discriminant analyses models

are particularly used for classication tasks. Many of the supervised

learning models mentioned above can be utilized for predictive

modeling of contaminated subsurface systems when ample labeled data

is accessible, whether obtained through eld sampling or generated via

numerical modeling (process-based reactive transport models). These

models are suitable for both classication and regression tasks in this

Fig. 2. An overview of commonly used articial intelligence (AI) techniques.

J.K. Janga et al.

Chemosphere 345 (2023) 140476

context.

3.2. Unsupervised learning

In unsupervised learning, the user cannot identify the features of the

variables as opposite to supervised learning. Therefore, the variables are

said to be unlabeled. In this case, the sorting will be done based on the

initial expectation of the user. Further, the objective can be achieved

only after certain iterations. Unsupervised learning employs ML algo-

rithms to examine and group unmarked datasets (Alloghani et al., 2020).

These algorithms unveil concealed patterns or data clusters indepen-

dently. Its ability to identify resemblances and variations in data renders

it optimal for tasks like exploring data patterns, contaminant clustering,

and image recognition, which can be used for 3-D delineation of

contaminant distribution (Chen et al., 2023). The differences in super-

vised and unsupervised learning can be understood by comparing

various parameters (Alloghani et al., 2020). It can be noted that the

accuracy of supervised learning is better than unsupervised learning

which indicate that initial user involvement can improve the efciency

of the learning algorithms. Unsupervised learning techniques, such as

clustering, can be useful in situations where detailed sampling and

testing for some contaminants may be expensive, and also to identify

contaminant sources (Tariq et al., 2008). In such cases, areas can be

clustered based on similarities in other subsurface conditions and the

relative abundance of other contaminants, which can easily be

measured. Following this, smarter sampling strategies for other

chemicals, which are expensive to test, can be formulated to minimize

expenses and based on the contaminant clusters, the sources of such

contaminants can be identied. On the other hand, another unsuper-

vised learning technique, i.e., dimensionality reduction, can be applied

in situations where the data is too complex to comprehend.

3.3. Semi-supervised learning

Semi-supervised learning is a type of ML technique in which an al-

gorithm learns from both labeled and unlabeled data. In this approach, a

small amount of labeled data is used to guide the learning process, while

a larger amount of unlabeled data is used to improve the accuracy of the

model (Zhou, 2021). It is particularly useful in situations where it may

be difcult or expensive to obtain labeled data, and abundant avail-

ability of unlabeled data. This approach can be used in different types of

applications, including image classication, anomaly detection, and

natural language processing. However, it is important to understand that

the effectiveness of semi-supervised learning depends on the quality and

quantity of the labeled and unlabeled data. Past studies have indicated

the effectiveness of application of semi-supervised learning for

contaminant source identication (Vesselinov et al., 2018).

3.4. Reinforcement learning

Reinforcement learning is a subeld of ML that deals with training

algorithms to make decisions in dynamic environments (Sutton and

Fig. 3. Various commonly used machine learning (ML) algorithms.

J.K. Janga et al.

Chemosphere 345 (2023) 140476

Barto, 2018). It involves a user/agent interacting with an environment,

receiving feedback in the form of rewards based on its actions, and

adjusting its behavior to maximize long-term rewards. Reinforcement

learning aims to develop an optimal policy that can guide the agent’s

actions towards achieving a specic objective. In the context of site

remediation, reinforcement learning can be especially useful in cases

where the operational decisions have to be made based on real-time

monitoring data.

3.5. Articial neural networks

Articial neural networks (ANNs) are the algorithms that reect the

structure and function of the human brain, allowing computer programs

to recognize patterns for problem solving (Aggarwal, 2018). They are

composed of interconnected nodes, also known as neurons, which are

organized in layers. Each neuron takes input from other neurons, applies

a transformation to it, and outputs a result that is passed on to other

neurons. All the nodes are connected, and each neuron is associated with

corresponding weight and threshold value. A certain node will be acti-

vated only if the output from the neuron is greater than its threshold

value.

The general structure of a neural network comprises of nodes, also

called neurons, connecting the input layer, hidden layers and the output

layer (Goodfellow et al., 2016). The input layer comprises of an array of

neurons to store various features of the input variables. Input layer is the

initial layer that receives the input data and passes forward to hidden

layers. Hidden layers are located next to input layers that receives the

data for performing computational analysis. As the values of neurons in

these layers are unobservable and inaccessible outside of the network,

these are termed as hidden layers. The hidden layers generally perform

various tasks such as extracting features of input data, application of

mathematical functions and generation of output. The output of each

neuron in the hidden layer is a non-linear transformation of its input,

which allows the neural network to learn complex relationships between

inputs and outputs (Gurney, 1997). The number of neurons in the hid-

den layer is a hyperparameter that can be tuned to optimize the per-

formance of the neural network. Too few neurons in the hidden layer can

lead to under tting, where the model is unable to capture the

complexity of the data, while too many neurons can lead to overtting,

where the model tries to identify nonexistent patterns with an aim to

produce accurate t to the training data making the model unsuitable to

be generalized in order to make good predictions with new data. The

number of hidden layers in a neural network depends on the dimensions

and features of the input data (Goodfellow et al., 2016). If the input data

is linear-separable, then no hidden layers are required for the analysis.

Most of the engineering problems require 3–5 hidden layers for an ac-

curate analysis. However, choosing higher number of hidden layers for

relatively simple datasets will increase the complexity of the model and

may result in over-tting. Hence, care must be taken while tuning the

hyperparameters of neural networks in particular or any other ML model

in general.

The output layer is the last layer of neural network that produces the

nal output of the model. There can be several types of output layers

that can be specic to the problem that is being solved using the neural

network. Activation function is a crucial component located in the

hidden and output layers of neural networks (Sharma et al., 2017). The

activation function operates on the output of each neuron in a layer,

transforming it into a non-linear form that can be passed on to the next

layer or used as the nal output of the network.

In the hidden layers, the activation function plays a crucial role in

introducing non-linearity into the network, thereby enabling it to model

complex patterns in data (Sharma et al., 2017). The choice of activation

function depends on both the problem being solved and the architecture

of the neural network. In the output layer, the selection of the activation

function is based on the nature of the problem at hand. For instance, in

binary classication tasks, a sigmoid activation function may be

employed to generate a probability output within the range of 0 and 1.

Alternatively, in regression tasks, a linear activation function can be

employed to produce a continuous output (Sharma et al., 2017).

Based on the direction of ow of information these neural networks

can be divided into feed-forward, where information ows only in one

direction without any looping and recurrent neural networks that are bi-

directional networks. Lately, back-propagation algorithms are being

used to train feed-forward networks to reduce the errors in predicted

output (Svozil et al., 1997), while RNNs use feedback loops for this

purpose (Salehinejad et al., 2017).

With this brief introduction on basic working principles of neural

networks, they can be divided into two types: shallow learning networks

and DL networks (Aggarwal, 2018). Although there is no denite

consensus among the ML-community, networks with only one or two

hidden layers are generally termed as shallow learning networks, and

neural networks with more than two layers are termed as DL networks. A

single layer perceptron is the simplest form of neural network consisting

of an input layer, an activation function and an output (Auer et al.,

2008). Extreme learning machines (ELM) on the other hand are ANNs

with one input layer, one hidden layer, and one output layer (Huang

et al., 2006). In an ELM the weights assigned for the hidden layer are

randomly generated and are not iteratively adjusted, hence training

these models is extremely rapid. Conversely, DL utilizes neural networks

containing multiple hidden layers.

Overall, neural networks are versatile models with the ability to

perform supervised, unsupervised, semi-supervised, and reinforcement

learning tasks, making their applications extensive. Even in the context

of site remediation, both shallow and deep neural networks have found

utility in diverse projects, encompassing tasks such as site character-

ization, contaminant source identication, and remediation design and

optimization (Srivastava and Singh, 2014; Zhao et al., 2020; Zheng

et al., 2022). Particularly, DL networks, a subset of articial neural

networks (ANNs), have surged in popularity due to their capability in

processing intricate data forms. This surge has led to the development of

numerous DL models, which were also commonly used in literature

related to site remediation. Consequently, a dedicated section providing

a very brief overview of DL models is provided in the next section.

4. Deep learning

Deep learning (DL) is a subset of ML that deals with data of increased

complexity, where conventional ML models may not be adequate and

can result in inaccurate analyses. DL typically employs neural networks

with multiple hidden layers in order to achieve accurate estimation of

the output. These neural networks are designed to learn and extract

hierarchical representations of data from multiple layers of abstraction,

which can then be used to make predictions or decisions (Goodfellow

et al., 2016). DL models are very advanced and can mimic the behavior

of neurons in the human brain for predicting outputs. Convolutional

neural networks (CNN) and recurrent neural networks (RNN) are two of

the widely used and more sophisticated algorithms to implement DL. In

addition to these, various other types of existing DL models commonly

employed in various applications are presented in Fig. 4. A fully con-

nected multi-layer neural network is commonly referred to as a multi-

layer perceptron (MLP), which is generally classied as a DL model.

Further, several different models such as Boltzmann machines,

auto-encoders,deep belief networks etc., were developed by altering the

way information ows between nodes in different layers, to excel at

specic tasks. For example, encoders are originally programmed with

two major components: and encoder and a decoder, to approximately

copy the features of the input variables to the output variables, so that

they can excel at dimensionality reduction and feature learning appli-

cations (Goodfellow et al., 2016). Similarly, numerous DL network ar-

chitectures were developed and are being developed to enhance the way

these neural networks process information. However, the optimal choice

of network architecture is dependent on the unique features of the data

J.K. Janga et al.

Chemosphere 345 (2023) 140476

being processed and the specic problem being addressed. Brief de-

scriptions of some techniques generally used to implement DL in site

remediation are presented below.

4.1. Recurrent neural networks

Recurrent Neural Networks (RNNs) are a category of neural net-

works that are specically designed to handle sequential data, such as

natural language text or time-series data. RNNs utilize feedback con-

nections, forming loops that enable information to persist and propagate

through the network over time (Yu et al., 2019). This feedback mech-

anism enables RNNs to model dynamic temporal dependencies in the

input data, making them well suited for applications such as speech

recognition, natural language processing, and video analysis.

In RNNs, the connections between the nodes create a loop, allowing

the output of some nodes to affect the subsequent input of the same

node. This loop structure enables RNNs to exhibit temporal dynamic

behavior. The hidden state of an RNN has a recurrent connection, which

ensures that sequential information, such as the dependencies between

words in a text and while making predictions, is captured in the input

data. Furthermore, RNNs utilize parameter sharing as a technique to

minimize the quantity of parameters that require learning. This en-

hances their efciency in managing sequential data. Long short-term

memory networks (LSTM), and Gated recurrent units (GRUs) are two

common types on RNN architecture, which can be especially efcient in

processing sequential data (Chung et al., 2014).

4.2. Long-Short Term Memory

Long-Short Term Memory (LSTM) is a type of recurrent neural

network (RNN) that addresses the issue of vanishing gradients that is

common in traditional RNNs (Sherstinsky, 2020). The LSTM architec-

ture incorporates specialized memory cells that can retain information

over extended periods, enabling them to capture long-term de-

pendencies in sequential data (Hochreiter and Schmidhuber, 1997).

These memory cells are equipped with gates that regulate the ow of

information, allowing the network to selectively store or forget past

information. During training, the connection weights and biases in the

network are updated, analogous to the way physiological changes in

synaptic strengths store long-term memories (Sherstinsky, 2020). At

each time-step, the activation patterns in the network change similar to

how changes in electric ring patterns, in the brain, store short-term

memories. Because of its effectiveness in modeling sequential data

with long-term dependencies, LSTM has gained popularity in diverse

elds and has also been applied in site remediation tasks (Qiu et al.,

2023; Li et al., 2021).

4.3. Convolution neural networks

Convolutional Neural Networks (CNNs) are neural networks specif-

ically designed for tasks involving image recognition and processing.

They employ shared-weight architectures in their convolutional layers,

where kernels slide over input features, generating feature maps that

capture various patterns and objects within images. This unique

approach enables CNNs to excel in tasks related to visual data analysis

(Gu et al., 2018). Kernels are building blocks of CNN used to extract the

relevant features of the input using the convolution operation. CNN

captures the spatial features from an image. CNNs help us in identifying

the object accurately, the location of an object, as well as its relationship

with other objects in an image.

CNNs employ convolutional layers to extract features from input

data, followed by pooling layers that decrease the dimensionality of

feature maps (Gu et al., 2018). The output of the convolutional and

pooling layers is then passed through one or more fully connected layers

to produce the nal output. CNNs are particularly effective for image

classication tasks because they are able to learn features directly from

the raw pixel data, rather than requiring handcrafted features. They are

also capable of handling variations in the input image such as changes in

lighting, rotation, and scale, making them very efcient. Owing to this

exceptional ability of CNNs, they can be used in site characterization

tasks to analyze spatial contaminant distribution patterns based on

various imaging techniques. Studies in the past have used CNN for

estimating contaminant concentrations based on visible and near

infrared spectroscopy (imaging) (Pyo et al., 2020).

4.4. Autoencoders

Autoencoders are a special class of feedforward neural networks that

are trained to copy the input features to the output layer. Autoencoders

consist of two main parts: the encoder, which maps the input data to a

latent space (the compressed representation), and the decoder, which

reconstructs the input data from this compressed representation (Bank

et al., 2023). The encoder and decoder are both neural networks, and

they work together to minimize the difference between the input and the

output (reconstruction loss). Autoencoders are typically trained using

backpropagation and gradient descent algorithms to minimize the

reconstruction loss similar to that of feed forward neural networks.

However, autoencoders are programmed to copy the input imperfectly.

By forcing the network to learn a compressed representation, they

capture the essential features of the input data while discarding the

non-essential details, which makes them useful for dimensionality

reduction and feature extraction applications (Bank et al., 2023). Pyo

et al. (2020) used convolutional autoencoder for dimensionality

reduction of visible and near-infrared spectroscopy (VNIRS) data for

prediction of heavy metal contamination.

Recent developments to autoencoders made them suitable for

generative modeling, where they are trained to generate new data

samples that are similar to the training data (Goodfellow et al., 2016).

Variational Autoencoders (VAEs) are a specic type of autoencoder

designed for generative tasks. VAEs incorporate probabilistic techniques

to generate new data points (Doersch, 2016). Kang et al. (2021) have

used convolutional VAE for monitoring of contamination source zone

during remediation.

While DL networks such as restricted Boltzmann machines, deep

belief networks, generative adversarial networks, etc., are recognized in

diverse elds, they are not widely explored for applications in contam-

inated site remediation process. Hence, the implementation and

Fig. 4. Few commonly employed algorithms to implement deep learning (DL).

J.K. Janga et al.

Chemosphere 345 (2023) 140476

principals behind these algorithms have not been discussed in this sec-

tion. In order to explore more about the architectures and working

principles of these models, refer to Goodfellow et al. (2016).

5. Special techniques

5.1. Fuzzy logic

Fuzzy logic is a mathematical framework that allows for reasoning

with uncertain or imprecise information. This approach is used for

computational work based on degrees of truth rather than the usual true

or false (1 or 0). Fuzzy logic includes 0 and 1 as extreme cases of truth

but with various intermediate degrees of truth (Zadeh, 1973). This al-

lows for more nuanced and exible reasoning, particularly in complex

systems where there may be multiple factors inuencing an outcome.

Fuzzy logic proves valuable in engineering scenarios characterized by

ambiguous certainties and uncertainties, or when handling imprecise

data, as observed in natural language processing technologies. More-

over, it excels in governing and managing machine outputs, adapting to

diverse input variables. Fuzzy logic based models are such as fuzzy

c-means clustering, and fuzzy based optimization were employed in past

studies related to contaminated site remediation (Chen et al., 2023; Hu

and Chan, 2015).

5.2. Sugeno Fuzzy Logic

The Sugeno fuzzy inference, also known as Takagi-Sugeno-Kang

fuzzy inference, utilizes singleton output membership functions, which

can be either a constant or a linear function of the input values (Takagi

and Sugeno, 1985; Sugeno and Kang, 1988). In comparison to

centroid-based defuzzication (Runkler, 1996), the defuzzication

process for Sugeno systems is computationally more efcient. It employs

a weighted average or weighted sum of a few data points.

Sugeno fuzzy logic (SFL) nds widespread use in applications

requiring precise control, such as industrial automation, control sys-

tems, and robotics. It is also frequently employed in decision-making

systems like expert systems and rule-based systems. The principal

advantage of SFL is its capacity to model complex nonlinear systems

with accuracy and efciency while retaining easy interpretability of the

fuzzy rules. Sadeghfam et al. (2019) used SFL based surrogate models

coupled with GA-based optimization to optimize the pumpage schedule

while remediating excessive total dissolved solids (TDS) in groundwater,

using pump, treat, and inject (PTI) technology. They observed that SFL

models in comparison to ANN based models can yield better computa-

tional efciency while reporting similar accuracy.

5.3. Surrogate models

Surrogate models, also referred to as metamodels or emulators;

imitate simulation models with high accuracy while requiring fewer

computational resources (Cozad et al., 2014; Asher et al., 2015). These

models are usually created using a smaller dataset generated by

resource-intensive simulations or experiments. This dataset helps create

mathematical or statistical models that can quickly and accurately

predict how a system or process behaves. Surrogate models are built

using a data-driven approach with strategically selected sample simu-

lation outputs at specic points in the design parameter space. For each

of these points, a full simulation is run to calculate the corresponding

output.

The pairs of input (design parameters) and output values are

collected into a training dataset, which is used to construct a statistical

model. Unlike traditional methods that use predetermined datasets,

surrogate models use active learning to gradually expand their training

data. This approach signicantly improves both the efciency and the

accuracy of training process. When a new sample is identied, a new

simulation is performed to calculate its corresponding output value. The

surrogate model is then updated with this new information. This process

is repeated until the surrogate model’s accuracy meets the desired level.

Surrogate models are widely applied across various elds, such as en-

gineering, physics, chemistry, and nance. They effectively reduce the

computational expenses associated with complex simulations, enabling

more efcient optimization and design procedures. Surrogate models

were also widely used by researchers for groundwater and subsurface

modeling to optimize remediation design parameters, and also for

contaminant source identication.

5.4. Ensemble learning

Ensemble learning involves creating a prediction model by har-

nessing the strengths of several simpler base models (Polikar, 2012).

Ensemble learning can be divided into two main tasks: rst, developing

a population of base learners from the training data, and then combining

these learners to create a composite predictor. Tree based ML algorithms

such as gradient boosting algorithms and random forest belong to the

class of ensemble learning techniques. These algorithms learn from

multiple decision trees (weak/base learners) to enhance the pre-

diction/classication accuracy. Any number of ML models can be

combined to build an ensemble-learning model, given that it is not

detrimental to the nal prediction accuracy. Several studies have

employed an ensemble of surrogate models for predictions in the context

of contaminated site remediation (Chu and Lu, 2015; Hou et al., 2017;

Ouyang et al., 2017b; Xing et al., 2019; Qiu et al., 2023).

5.5. Decision making

The process of decision-making involves selecting the optimal course

of action among a range of available alternatives. It requires the iden-

tication and evaluation of various options, considering potential out-

comes and consequences, and ultimately choosing a path based on the

available information, preferences, and goals.

To arrive at a conclusion, decision-making necessitates analyzing

data from multiple sources with varying levels of certainty, merging the

information by weighting certain data sources over others (Kochender-

fer, 2015). An agent, which acts based on observations of its environ-

ment, interacts with the environment through an observe-act cycle. At

time-‘t’, the agent receives an observation of the environment, repre-

sented as (O

), and then chooses an action (a

) through the

decision-making process.

Given the past sequence of observations O

, O

…, O

and knowledge

of the environment, the agent must choose an action that best achieves

its objectives, considering various sources of uncertainty. Intelligent

decision-making systems can be useful in making real-time decisions in

contaminated site remediation, especially in the age of big data.

6. Model development

Model selection is a crucial task in the domain of AI, which involves

the identication of the most appropriate algorithm or model for a given

problem. Optimal model selection can signicantly inuence the accu-

racy and efciency of prediction or classication tasks.

To select the most suitable model, various factors such as accuracy,

computational complexity, interpretability, robustness, and generaliza-

tion performance must be considered. Model development process

typically involves dividing the available data into training, validation,

and test sets. The training set is used to train different models, the

validation set to select the best model, and the test set to evaluate the

nal model’s performance. Several techniques can be employed for

model selection, including cross-validation, grid search, and Bayesian

optimization (James et al., 2023). Cross-validation involves iterative

division of the available data into training and validation sets to eval-

uate the model’s performance on each iteration. Grid search involves

trying various combinations of hyperparameters to identify the optimal

J.K. Janga et al.

Chemosphere 345 (2023) 140476

model, whereas Bayesian optimization uses probabilistic approaches to

search for the most effective hyperparameters.

6.1. Steps involved in model development

The following are typical steps forAI based data-driven model

development (James et al., 2023):

Data collection: It is imperative to use data from a trustworthy source

since the quality of data directly inuences the model’s outcome. High-

quality data is relevant to the problem being addressed, contains mini-

mal missing and duplicated values, and represents various sub-

categories/classes appropriately. It is commonly believed that having

more data results in a better model, which leads to higher accuracy.

However, the quality of data is equally important as the quantity. The

model’s accuracy heavily relies on the data quality, and having a large

dataset with poor quality may not improve the model’s performance.

Thus, it is essential to ensure the data quality before using it to develop

models. This includes checking for missing and duplicated values and

verifying that the data represents the subcategories/classes present.

Using high-quality data enhances the model’s accuracy, making it more

reliable and useful for addressing various problems.

Data preprocessing: This is a critical step in developing accurate and

reliable models. Randomizing the data ensures that it is evenly distrib-

uted, which is important for preventing bias in the model. Data cleaning

is also essential to remove unwanted data, such as missing values, rows,

and columns, as well as duplicates and to convert data types if necessary.

Once the data is cleaned, it is split into two sets - a training set and a

testing set. The training set is used to train the model, while the testing

set is used to evaluate the model’s performance. The training set is what

your model learns from, and the testing set is used to check the accuracy

of your model’s learning after training. Proper data preparation ensures

that the model is trained in high-quality data, leading to more accurate

and reliable predictions. By randomly distributing the data, cleaning the

data, and splitting it into training and testing sets, the model can learn

effectively and produce accurate predictions on unseen data.

Model selection: This is a crucial step in developing effective AI/ML/

DL models that can accurately solve the task at hand. Choosing a model

that is relevant to the specic task is essential. This involves assessing

whether the model is suited for numerical or categorical data and

choosing accordingly. It is also important to ensure that the selected

model is suitable for the specic problem at hand. Different models have

their strengths and weaknesses, and it is essential to choose a model that

is appropriate for the specic problem. Additionally, the complexity of

the model should be considered, as overly complex models can lead to

overtting, while overly simple models may not capture the complexity

of the problem. In conclusion, model selection is a crucial step in

developing effective AI/ML models. It requires careful consideration of

the data type, problem complexity, and the strengths and weaknesses of

different models to choose the most appropriate one for the task at hand.

By selecting the right model, one can ensure that the model can solve the

problem accurately and effectively, leading to more reliable and accu-

rate results. The model accuracy can be evaluated using various model

accuracy parameters.

Model Training: In model training, the prepared data is used to teach

the machine-learning model to recognize patterns and make predictions.

By training on the data, the model learns to accomplish the tasks set out

for it. As training progresses, the model becomes better at predicting

outcomes, resulting in improved prediction accuracy.

Model Testing: To evaluate the performance of a model, it is neces-

sary to test it on previously unseen data. The dataset used for testing is

separate from the one used for training and is referred to as the testing

set. The testing set allows for an objective evaluation of the model’s

generalization ability to unseen data.

Parameter tuning: This is a crucial step in improving the accuracy of

a model once it has been selected. The process involves adjusting the

values of the parameters represented in the model to achieve optimal

performance. By ne-tuning the values of specic parameters, the model

can be tailored to perform optimally for a specic task. The accuracy of

the model can be signicantly improved by tuning the parameters

effectively.

Making predictions: This is the nal step in a typical ML workow.

After the model has been trained and tested with unseen data, it can be

used to make predictions on new data. It is important to ensure that the

data used for making predictions is of the same quality as the data used

for training and testing. This will help ensure that the model makes

accurate predictions and performs well in real-world applications.

Performance metrics: Performance metrics are used to evaluate the

effectiveness of AI models in solving specic problems. Different types of

metrics are available depending on the purpose for which the models are

used (classication/regression/clustering). Mainly in the context of

models used in contaminated site remediation-based studies, regression

metrics (MSE, MAE, etc.), classication metrics (precision, accuracy,

recall, etc.), correlation metrics (R

, r, etc.), and rank metrics (spear-

man’s rank correlation coefcient) were generally used. Formulae for

such model performance evaluation metrics are presented in Table 1.

7. Integration of AI/ML/DL into contaminated site remediation

7.1. Bibliometric analysis

A bibliometric analysis was conducted to analyze the scientic

literature avaialble regarding the appication of AI/ML/DL in contami-

nated site remediation using the Scopus database developed by Elsevier.

The search query used to nd relevant literature is as follows:

(“Groundwater” OR “Ground water” OR “Soil”) AND (“Remediation”)

AND (“Articial Intelligence” OR “Machine Learning” OR “Deep

Learning” OR “Neural Networks” OR “Genetic Algorithm” OR “Opti-

mization Algorithms”) AND NOT (“wastewater” OR “waste water”). The

search yielded a total of 427 documents, including technical articles,

review articles, book chapters and conference papers.

Fig. 3a illustrates the distribution of published papers over the years.

It shows that the use of AI in site remediation has been in practice since

the late 1990s. Initially, the research mainly focused on building

decision-support tools mainly using genetic algorithms and optimizing

various parameters using AI-based models. However, post 2010, the

research diversied into different stages of site remediation. This in-

cludes predicting the contaminant concentration and spatial distribution

of contaminants, employing DL models to create accurate surrogates to

traditional process-based simulation models, conducting simulation-

optimization of subsurface models to optimize remediation tech-

niques, use of simulation-optimization techniques for contamination

source identication and developing robust decision support tools in

data-rich elds.

Moreover, Fig. 5a indicates that the use of AI/ML/DL in site reme-

diation has experienced signicant growth in the past ve years. To

comprehend the key research areas where these technologies are

applied, the co-occurrence of index-based and author’s keywords was

analyzed using VOSViewer, an open-source software tool for con-

structing and visualizing bibliometric networks. The resulting keyword

network is presented in Fig. 5b and the nation of origin of co-authors of

the articles is presented in Fig. 5c.

The analysis reveals that a considerable amount of research has been

devoted to groundwater contamination and remediation, while soil

contamination and remediation have been relatively less explored.

Additionally, both ML and DL models were extensively employed for

various applications in contaminated site remediation. Early research

focusing on the use of GA for optimizing remedial strategies and

building decision support tools, along with continued usage of GA for

various optimization practices, has contributed to its higher relevance in

the bibliography. In addition, as observed in Fig. 5c, most of the research

regarding the integration of AI into site remediation is concentrated in

the USA, and China followed by India, Iran, and the UK.

J.K. Janga et al.

Chemosphere 345 (2023) 140476

7.2. Site characterization and risk assessment

Site characterization is a crucial part of contaminated site remedia-

tion. It provides the necessary information for risk assessments and

designing effective remediation systems. The primary objectives of site

characterization are to identify the nature and extent of contamination.

This includes determining the types of contaminants, their quantities,

locations, and the phases in which they occur. Owing to the inherent

nature of contaminated sites, rigorous groundwater and soil sampling

for laboratory testing and use of data intensive statistical techniques for

this purpose is common. However, with the emergence of AI-based

technologies, this procedure can become signicantly less cumber-

some while maintaining the accuracy of the predictions intact (Hu et al.,

2003). Table 2 displays a variety of studies centered on predicting

contaminant concentrations, spatial distributions, or a combination of

both, along with risk assessment activities. Table 2 provides a detailed

understanding of how AI-based technologies have been employed by

researchers across the world to address problems in site characterization

activities. Nevertheless, a brief description of how these technologies

can help enhance site characterization and risk assessment activities is

provided below:

Spatial distribution of contamination: Numerous studies have re-

ported the use of AI/ML/DL in estimating the spatial distribution of

contaminants (Kanevski, 1999; Shaker et al., 2010; Zhang et al., 2023).

The input data required for such prediction can be either sparsely

sampled groundwater or soil monitoring data, or feature based micro-

scopic or drone images. Specic applications include feature extraction

from aerial or spectral images, prediction of contaminant concentration

with the help of relationship between the extracted features and

contaminant concentrations (Jia et al., 2021b). Another way of pre-

dicting spatial distribution is by using groundwater monitoring wells or

soil sampling data and subsequent laboratory testing and the use of same

to spatially interpolate the contaminant concentrations. A brief meth-

odology owchart to determine the spatial dispersion of pollutants,

employing various imaging techniques, laboratory investigations, and

leveraging ML/DL models, is depicted in Fig. 6. Table 2 briey sum-

marizes various studies reporting the application of different models to

predict spatial distribution of contaminants.

Risk assessment: This is a critical phase following the spatial delin-

eation of contaminant distribution, as described above. This spatial

delineation provides valuable insights that can be harnessed to pinpoint

areas of elevated risk. By identifying these risk zones, it becomes feasible

to tailor an appropriate and targeted approach for remediation strate-

gies. This step ensures that resources are allocated efciently, and in-

terventions are designed to mitigate potential hazards based on the

specic contamination patterns identied effectively. In addition, ML

based frameworks can be used to identify the hotspots and drivers of

contamination in soils, resulting in a comprehensive understanding of

the risks associated (Yang et al., 2021).

7.3. Selection of remediation technology (decision-making)

Availability of data holds paramount importance in engineering-

based elds, particularly when dealing with the growing number of

contaminated sites and various types of contaminants worldwide.

Identifying contamination and selecting suitable remediation tech-

niques can be challenging, but ML models offer a valuable solution by

leveraging vast amounts of pollutant and remediation data accumulated

from decades of contaminated site remediation experiences. For

instance, Li et al. (2022a) employed a decision tree classier on data

from the CERCLA database to classify common pollutants and associated

remediation techniques across 144 contaminated sites in four US states.

The study revealed a decline in the growth of contaminated sites over

the past decade, with physical remediation technologies being the most

employed.

In addition to using AI-based technologies for data analysis regarding

prevalent remediation technologies, they can also be utilized to evaluate

the most suitable remediation technique based on in-situ site charac-

teristics (Wijaya et al., 2023). These characteristics may include soil

microbial data, physicochemical properties of the soil, initial form of the

contaminant, and the extent of contamination. By employing data

analysis techniques on such information, one can assess which remedi-

ation technology aligns best with the prevalent site-specic conditions.

Consequently, by leveraging past remediation project data and in-situ

site characteristics, AI-based technologies empower decision-makers to

make informed choices when selecting the appropriate remediation

technology for a given scenario (Li et al., 2022b).

AI-based decision-support tools to select and implement an appro-

priate remedial action have been studied since the early 2000s (Chen

et al., 2003; García et al., 2006; Dunea et al., 2014). A general

Table 1

Various model-performance evaluation metrics used to evaluate AI/ML/DL based models.

Metric Formula Metric Formula

1−i=n

i=1(yi−

yi)2

i=n

i=1(yi−y)2

Root mean square error (RMSE) 

i=n

i=1(yi−

yi)2



Mean absolute error (MAE) i=n

i=1yi−

yi

Mean absolute percentage error (MAPE) 1

ni=n

i=1

yi−

yi×100%

Mean square error (MSE) 1

ni=n

i=1(yi−

yi)2 Willmott’s index of agreement, d 1−i=n

i=1yi−

2×i=n

i=1(yi−y)

Relative absolute error (RAE) i=n

i=1yi−

yi

i=n

i=1yi−y

Relative error (RE) i=n

i=1

yi−

yi×100%

Precision TP

TP +FP

Recall TP

TP +FN

Accuracy TP +TN

TP +FP +TN +FN

Cohen’s Kappa coefcient 2× (TP ×TN −FN ×FP)

(TP +FP) × (FP +TN) + (TP +FN) × (FN +TN)

F1 score 2×precision ×recall

precision +recall

AUC Area under ROC curve

IQ Q3−Q1 RPIQ IQ

RMSE

Probability of FP β Probability of FN

TErate

+β Spearman’s rank correlation coefcient 1−6×di2

n(n2−1)

Notes: 

yi - predicted value, yi– actual observed value, y – mean of observed values, n-total no. Of test values, TP- True Positives, TN – True Negatives, FP – False

Positives, FN – False Negatives, ROC - Receiver Operating Characteristic Curve, Q

– value below which 25% samples can be found, Q

– Value below which 75%

samples can be found, di –difference in the ranks given to the two variable values for each item of the data.

J.K. Janga et al.

Chemosphere 345 (2023) 140476

theoretical approach to selection of remediation technology is presented

in Fig. 7. Initially when computational infrastructure was limited,

decision-support tools were developed using expert inputs (Geng et al.,

2001), later such tools were converted into probabilistic tools providing

the probability of achieving required remediation efciency if a reme-

diation technology is adopted (He et al., 2006; Qin et al., 2007). Later,

weighing systems were developed to weigh the preferences of different

stakeholders so that the decision support tools can weigh in the same

before choosing the remediation technology (Balasubramaniam et al.,

2007). With the growing concerns of climate change, due to unsus-

tainable and environmentally unfriendly practices, including

sustainability and resiliency in decision-making with respect to site

remediation is of paramount importance (Reddy et al., 2019a, 2019b).

More recently, research on decision support tools started including

sustainability and resiliency, along with stakeholder feedback, of the

remediation technologies (Li et al., 2022b; Huysegoms and Cappuyns,

2017).

In summary, the integration of AI based techniques and data analysis

in contaminated site remediation can revolutionize the way environ-

mental challenges are addressed. By harnessing the power of data, AI-

based technologies offer a more informed and effective approach to

identifying contamination, understanding its distribution, and selecting

Fig. 5. Results of bibliometric analysis: (a) Annual scientic production of papers between 2000–2023, August, (b) Visualization of authors’ and indexed keywords

co-occurrence constructed using full counting, and (c) Visualization of the origin nations-network of co-authors from the literature analyzed.

J.K. Janga et al.

Chemosphere 345 (2023) 140476

Table 2

Various studies using AI/ML/DL based models for site characterization and risk assessment.

Contaminant(s) under

study

Model(s) used Model accuracy

parameters

Objective Result Reference

1,2,3-

trichloropropane

(TCP)

CART, RF, and BRT R

, RMSE, and MAE To predict spatial distribution of TCPs

in the absence of groundwater

monitoring, using various effecting

parameters like historical land-use,

dissolved oxygen, and co-

contaminant-nitrate data. To compare

the performance of various models

BRT, and RF models were found to be

better performing models with R

values of 0.44, and 0.41 compared to

CART (R

=0.020). RF was

successfully used to predict the spatial

distribution and identify the relevant

parameters effecting TCP

concentration i.e., precipitation,

dissolved oxygen, and nitrate

concentration.

Hauptman et al.

(2023)

DEHP Bi-LSTM, kNN, and RF Confusion matrix To develop a novel DL based model to

predict the spatial variation of DEHP

in the study area

The LSTM based model proposed in

the study has indicated a better

performance compared to the other

two models in predicting the spatial

variation of DEHP

Zheng et al.

(2022)

PFAS RF, and LR Spearman’s

correlation

coefcient, and AUC

To build an ML model in order to

accurately predict the PFAS

contamination using features such as

co-contaminant ngerprint, proximity

to airport and military installations,

and other surface and subsurface

features to allow for prioritizing the

groundwater testing

The ML model built was able to

predict the PFAS concentrations with

high accuracy even with limited

availability of co-contaminant data.

Model was projected to reduce the

number of groundwater wells to be

tested by 70% as compared to

traditional random sampling

approach

George and Dixit

(2021)

PAHs SVR

GA for optimization

RMSE, MAE, and

MAPE

To build an SVR model to predict

PAHs using TPHs as descriptors and

compare optimization techniques for

selecting hyperparameters to improve

model accuracy.

The SVR model utilizing the GA

approach with Gaussian Kernel

functions yielded the most accurate

predictions when compared to other

optimization approaches in the study.

Study suggests that TPH

concentrations can serve as a reliable

means to predict PAH concentrations

without compromising on prediction

accuracy using the developed model

Akinpelu et al.

(2020)

Arsenic RF, ERF, SVM, MLP Accuracy, Precision,

Recall, F1 score, and

Cohen’s Kappa

coefcient

Mapping risk level for soil As pollution

using high resolution aerial imaging

(HRAI), and ML

ERF gave better predictions compared

to the other three ML algorithms with

an average classication accuracy of

0.87.

Jia et al. (2021a)

Chromium GRNN, and MLP with

and without kriging

MAE, RMSE,

Spearman’s rank

correlation

coefcient,

To predict the concertation of

chromium (Cr), which is abnormally

distributed, using hybrid models

consisting of ML models and residual

kriging

Estimation of ML model residuals

using residual kriging helps in

smoothing out abnormal high and low

predictions and can be superior as

compared to pure ML-based models.

Tarasov et al.

(2018)

Cr(VI) XGBoost kNN for

imputation of missing

data

, MSE, and RMSE To predict the long-term groundwater

contamination with pollutants like Cr

(VI) using XGBoost model optimized

using Bayesian search cross-validation

approach

Optimized XGBoost model can predict

the contamination with a good

accuracy (R

=0.99 during training,

=0.85 during testing)

Mazumdar et al.

(2022)

Copper, Manganese,

and Nickel

GRNN, and MLP with

and without kriging

, RMSE,

Spearman’s

correlation

coefcient,

Willmott’s index of

agreement, IQ

To predict spatial distribution of

heavy metal contamination using

hybrid models constructed using ML

models and geostatistical kriging

Hybrid approaches employed did

improve the prediction accuracy

compared to basic ML models and

generally used universal kriging for

spatial correlation

Sergeev et al.

(2019)

Arsenic, Copper, and

Lead

CACNN, CNN, ANN,

RFR

PCA for dimensionality

reduction

, and RMSE Estimation of heavy metal

contamination from VNIRS data using

DL based models

CACNN provided reasonably good

estimates of all three elements, while

ANN and RFR could not provide

accurate estimates of all three

elements at the same time

Pyo et al. (2020)

Iron, Manganese,

and Zinc

MLP Sum of Squares

Error, RMSE, and

Relative Error

Estimation and prediction of heavy

metal levels using macro elements and

altitude level data, at the Mount Ida

national park

ANN can be effective in predicting

contaminant concentrations based on

various known various soil

parameters

Sari et al. (2022)

Heavy metals DL with nearest

neighbor neural

network

RMSE,

, β, and

TErate

To build a DL based spatial

interpolation model, using residual

network (ResNet) architecture and

compare it with traditional kriging

based spatial interpolation models

DL algorithms provide a robust

alternative to kriging based

interpolation by providing higher

accuracy for spatial interpolation of

contaminant concentrations

Man et al. (2021)

Heavy metals RF R

, ME, MSE, and

RMSE

To determine the factor importance in

predicting the heavy metal

distribution in coastal reclaimed soils

RF model was successfully used to

establish the importance of various

factors such as soil mineral

composition, soil organic content, and

chemical properties such as pH in

Zhang et al.

(2021)

(continued on next page)

J.K. Janga et al.

Chemosphere 345 (2023) 140476

Table 2 (continued )

Contaminant(s) under

study

Model(s) used Model accuracy

parameters

Objective Result Reference

determining the distribution

dynamics of various heavy metals

Heavy metals ANN, BP-FFNN R

, MSE, MAE, and

RMSE

To determine the best pollution

indexing approach to assess

groundwater pollution through ML

and DL based approaches

DL (BP-FFNN) based approach was

more appropriate to determine the

effectiveness of the pollution indices

compared to ML (ANN) based

approach in assessing the ground

water pollution

Singha et al.

(2020)

Heavy metals MLR1, RF with fuzzy c-

means

, and RMSE To predict the spatial variation of

heavy metal pollution, identify the key

inuencing factors, determine the risk

levels, and delineate the risk zones

The effectiveness of the RF model in

assessing the spatial distribution of

contamination and identifying crucial

inuencing factors has been well

established. Moreover, the successful

application of fuzzy c-means for

detecting and outlining risk zones has

proven valuable in visualizing and

pinpointing areas of concern.

Chen et al. (2023)

Soil Microplastics

(MPs)

SVR-RBF, BPNN, RF,

RBFN, LSTM, XGBoost,

CART, RR, and LASSO

regression

, RMSE, and MAE To assess and compare the accuracy

and applicability of different ML

models in predicting soil MPs

abundance. To predict the spatial

distribution of MPs based on most

accurate model

SVR-RBF model was found to be the

most accurate compared to other

models in predicting the soil MPs

abundance. RF based ensemble model

was the best in explaining the

environmental factors effecting MPs

distribution.

Qiu et al. (2023)

Boron BPNN, SVM, and LR MAE, and RMSE To predict the concentration of

geothermal originated boron in the

study area using traditional ML

models-SVM, and linear regression

and a DL based model

DNN was able to predict the

concentrations of boron in

groundwater, surrounding the

geothermal wells in the study area,

with a better accuracy as compared to

the other two traditional models

Tut Haklidir and

Haklidir (2020)

Nitrate BRT, SVM, and MDA AUC, Kappa, and

MSE

To produce contamination risk maps

in the study area using the predictions

from ML based models

All three models and an ensemble of

the three models, weighed according

to the performance, indicated good

performance in predicting the

groundwater contamination (AUC

>80%) along with risk levels

Sajedi-Hosseini

et al. (2018)

Fluoride ELM, MLP, and SVM R

, RMSE, and MAE To investigate the ability of ELMs with

various activation functions to predict

uoride contamination in

groundwater as compared to that of

MLP, and SVM

ELM models outperformed MLP and

SVM models in predicting the uoride

contamination of groundwater. ELM

based on RBF gave the best results as

compared to linear, polynomial, and

sigmoid-based kernel function. ELM

models were also computationally

more efcient

Barzegar et al.

(2017)

Lead, Magnesium,

Iron, Zinc, and

Ammonia

MLR2, and RF Confusion matrix, F1

score, Accuracy,

Precision, and Recall

To predict the groundwater pollution

and delineate the spatial distribution

of risk zones with the help of ammonia

index and other relevant variables.

RF model showed a better

performance when compared with

MLR with an accuracy of 93%. Study

was successfully able to depict the

effectiveness of ML models in

predicting groundwater

contamination using ammonia index

and other relevant variables

Madani et al.

(2022)

Radium ANN, SVM RMSE

To develop an optimized detection

system based on various detector-

algorithm combinations to improve

the accuracy of detecting ‘hot’

particles

ML algorithms can improve detection

limits compared to conventional

count rate algorithms by

concentrating on the spectral shape

changes. ANN in particular gave

better results as compared to SVM.

Varley et al.

(2015)

Note: Abbreviations - DEHP - Di(2-ethylhexyl) phthalate, PFAS - Per- and polyuoroalkyl substances, PAHs – Polycyclic Aromatic Hydrocarbons; CART - Classication

and Regression Tree, RF - Random Forest, BRT – Boosted Regression Trees, LSTM – Long Short Term Memory, Bi-LSTM – Bidirectional LSTM, kNN – k-Nearest

Neighbor, LR – Linear Regression, SVR - Support Vector Regression, GA – Genetic Algorithm, SVM – Support Vector Machine, ERF – Extreme RF, MLP – Multi-Layer

Perceptron, GRNN – Generalized Regression Neural Networks, CNN – Convolutional Neural Network, CACNN – Convolutional Autoencoder CNN, ANN – Articial

Neural Networks, RFR - Random Forest Regressor, PCA – Principle Component Analysis, BP-FFNN – Back-Propagated Feed Forward Neural Network, MLR1 – Multiple

Linear Regression, RBF - Radial Basis Function, SVR-RBF–SVR with RBF as kernel function, BPNN – Back Propagated Neural Network, RBFN–RBF Network, XGBoost –

Extreme Gradient Boosting, RR – Ridge Regression, LASSO -Least Absolute Shrinkage and Selection Operator, MDA – Multivariate Discriminant Analysis, ELM –

Extreme Learning Machine, MLR2 – Multivariate Logistic Regression; VNIRS – Visible and Near Infrared Spectroscopy; TPH – Total Petroleum Hydrocarbons.

‘hot’ particles – Small highly radioactive items.

RMSE was used as a general purpose metric to improve the prediction accuracy of the models. However, actual eld performance of the deployed detectors was

evaluated with the help of following metrics: Overall detection rate (ODR), Maximum detection rate (MDR), and False alarm rate (FAR).

J.K. Janga et al.

Chemosphere 345 (2023) 140476

the most appropriate remediation technique for each scenario with the

help of appropriately built decision-support tools.

7.4. Design and optimization of remediation technologies

Following the comprehensive characterization of the site and the

careful selection of the remediation technology to be implemented, a

pivotal next step involves an efcient design of the chosen remediation

approach. The primary objective of this design phase is to enhance the

efcacy of the remediation process while concurrently minimizing both

short-term and long-term costs (Sharma and Reddy, 2004). This opti-

mization is achieved through a strategic balance that maximizes the

efciency of the remediation while ensuring its successful imple-

mentation. Various studies reporting the application of AI/ML/DL

Fig. 6. A general approach to prediction of soil contamination using limited chemical analyses and spectral imaging for site characterization.

J.K. Janga et al.

Chemosphere 345 (2023) 140476

models to design or optimize the select remediation technique are pre-

sented in Table 3. These techniques can be harnessed to design and

optimize various widely used remediation technologies (Wang et al.,

2022). While Table 3 offers insights into the current application of AI in

this context, the following sub-sections provide a concise overview of

how these technologies are currently utilized and their potential future

applications in the design and optimization of various remediation

methods.

Pumping-based remediation: Several techniques such as pump, treat,

and inject or surfactant enhanced aquifer remediation are usually

employed to remediate commonly occurring groundwater contaminants

such as petroleum hydrocarbons, DNAPLs, and other water-soluble

chemicals (Sharma and Reddy, 2004). In order to design such systems

optimally, numerous simulations are required to be performed which

can get computationally expensive. Owing to this, ML/DL-based surro-

gates were proven to be reasonably accurate surrogates to reduce the

computational burden of such simulation-optimization problems

(Sreekanth and Datta, 2011; Luo et al., 2013; Luo and Lu, 2014; Chu and

Lu, 2015; Hou et al., 2015, 2016; Ouyang et al., 2017a, 2017b). In

addition, various algorithms such as GA, PSO among others can be used

to optimize the parameters such as pumping schedule, remediation

duration etc. A general approach to apply AI based techniques in

simulation-optimization problems, as shown in Fig. 8, consists of two

main steps: First step being building an appropriate surrogate model and

second step being selection and implementation of an appropriate

optimization algorithm that best ts the purpose. Ensemble learning, as

shown in Fig. 8, and as described earlier, can be employed as necessary

to improve the prediction accuracy and computational efciency of

surrogate models.

Immobilization or containment of contaminants: In-situ remediation

methods like immobilization or containment of contaminants necessi-

tate the utilization of specic materials, such as nanoscale-zero valent

iron (Manfron et al., 2020) or biochar (Kamdar et al., 2023; Zhang et al.,

2022). These materials should possess desirable attributes, including

elevated surface area, enhanced reactivity, and limited biodegradability,

to facilitate the desired remediation outcomes. AI based techniques can

be used in determining the suitability, and predicting the remediation

efciency along with the factors inuencing remediation efciency.

Phytoremediation: The uptake of hazardous contaminants by plants

leading to reduction in contaminant concentration is termed as phy-

toremediation (Reddy et al., 2020; Reddy and Amaya-Santos, 2017).

Often plants will not survive in the presence of hazardous pollutants

such as heavy metals, or need soil amendments with materials such as

sewage sludge, biochar, or compost for optimal performance. AI based

techniques in this case can be used to predict the suitable plant species

for a contaminant type and also can be used to determine optimal

amendment conditions for effective phytoremediation based on limited

experimental results.

Bioremediation: Enhancing the in-situ degradation of organic con-

taminants by pumping necessary nutrients or microbes and simulating

favorable conditions is termed as bioremediation (Decesaro et al., 2017).

AI can be harnessed to optimize bioremediation by evaluating favorable

reaction conditions by analyzing available experimental data or data

from past remediaton projects (Jalali et al., 2023; Stef et al., 2022;

Mohammadi et al., 2021).

Natural attenuation: The natural degradation of organic compounds

without human intervention is termed as natural attenuation (Sharma

and Reddy, 2004). AI based techniques can be employed in this case to

determine whether the in-situ conditions are suitable for natural atten-

uation and to predict the duration of remediation.

Electro-kinetic enhanced bioremediation: Bio-based remediation

techniques such as phytoremediation and microbial degradation

(bioremediation) can be less effective and more often, it is possible that

such remediation alone is not sufcient to remove the contaminants

completely. Hence, electro-kinetic remediation, where applied electric

charge helps in the movement of charged contaminants or additives, can

be used to enhance remedial efciency of bio-remediation techniques

(Cameselle and Reddy, 2022). However, owing to various physical,

chemical, and biological processes involved, it is difcult to predict the

performance of such techniques. Coupled process-based transport and

fate models are usually employed to predict the performance of

electro-kinetic remediation, which are computationally intensive.

AI-based simulation-optimization, as described earlier in “Pumping--

based remediation”, can be used in this case to optimize the design and

operation of electro-kinetic remediation.

Soil Washing: Soil washing is one of the oldest ex-situ remediation

technologies where contaminated soil is dug out and washed using

aqueous chemical solutions to leach the contaminants out of the soil.

Selecting an appropriate washing agent to remove multiple contami-

nants is a challenge (Sharma and Reddy, 2004). AI can be benecial in

selecting and appropriate technology in this regard (Zhang et al., 2022).

Past studies have also employed classication models like k-Nearest

neighbors (kNN) to predict the Cd-ion removal efciency by soil

washing (Mu’azu and Olatunji, 2023).

Although widely used in design and optimization of various tech-

nologies, integration of AI/ML/DL into remediation endeavors can be

further developed to employ more efcient DL models and latest opti-

mization techniques. Also, integration of AI/ML/DL into other

Fig. 7. A theoretical approach to build decision support tool for selecting remediation technology.

J.K. Janga et al.

Chemosphere 345 (2023) 140476

Table 3

Various studies demonstrating the use of AI/ML/DL based models for remediation design and optimization.

Remediation

technology

optimized

Remediation of Model(s) used Model

performance

metrics

Objective Dataset Outcomes Reference

Pump, Treat, and

Inject (PTI)

TDS in

groundwater

SFL and ANN

models as

surrogates and

GA for

optimization

and RMSE To optimize the pumpage

schedule in order to

reduce the TDS

concentrations while

minimizing the costs.

MODFLOW-

2000, and

MT3DMS

Use of SFL instead of

ANNs does not

signicantly affect the

pumpage schedule but

signicantly reduces the

model runtime.

Employing such AI-based

optimization driven by

ML-based surrogates can

be effective in optimizing

the remediation

strategies.

Sadeghfam

et al. (2019)

Pump and Treat Groundwater

contaminated

with CCl

3-D CNN Precision,

Accuracy,

Sensitivity, and

Specicity

To predict the future well

performance based on

past performance data

and numerical

simulations.

Field data and

simulation data

from custom

made fate and

transport model

(1) DL models can be an

effective to predict

contaminant plume

distribution, to help in

decision-making process.

(2) The architecture of

3D-CNN is exible and

can be readily extended

to include numerous

variables.

Song et al.

(2023)

SEAR DNAPL

contaminated

aquifer

PRS, RBFN,

SVR, GP, and

Kriging

and RMSE (1) To create an efcient

ensemble surrogate

model with the help of

ve different surrogate

models and verify the

accuracy. (2) To verify

the application of

adaptive sequential

sampling in the

optimization of remedial

strategy.

Simulations

using UTCHEM

model

(1) An ensemble

surrogate can possibly

improve the accuracy as

compared to stand-alone

surrogate models if the

surrogates are chosen

properly

(2) Adaptive sequential

sampling can improve

the reliability of the

results obtained

Ouyang et al.

(2017c)

RBFN, SVR and

GA to solve

optimization

problem

, and

Absolute errors,

and Relative

errors (mean

and maximum)

Utilization of set pair

analysis in order to build

an ensemble surrogate

model to optimize

remediation strategy

Simulations

using UTCHEM

model

Set pair analysis can be

an effective method to

improve the ensemble

surrogate model’s

selection, building, and

accuracy

Hou et al.

(2017)

3-D CNN R

, and SSIM To employ CNN model in

order to accurately

predict the removal rates

of DNAPL by considering

heterogeneity of the

aquifer and also identify

risk zones post

remediation to aid in

decision making

Simulations

using UTCHEM

model

The proposed

optimization strategy

using multiple

realizations of k values

and source zone

architecture can

accurately identify the

optimal solutions with a

99.8% speed up as

compared to

conventional simulation

optimization. This

approach can also allow

in delineating risk zones

based on NAPL left in

aquifer post remediation.

Du et al.

(2022)

In-situ

bioremediation

Petroleum

contaminated

site

Fuzzy based

optimization

– To develop a fuzzy rule

based predictive control

system to optimize the in-

situ bioremediation

process and

demonstration using a

case-study

Simulations

using UTCHEM

model

The developed fuzzy

based predictive control

system allows online,

real-time, cost-effective,

and optimized control of

the in-situ

bioremediation process

during the entire cleanup

duration. Its primary

advantage is its real-time

handling of uncertainties

in the simulation model.

The case-study highlights

the crucial role of

simulation model

accuracy in developing

Hu and Chan

(2015)

(continued on next page)

J.K. Janga et al.

Chemosphere 345 (2023) 140476

Table 3 (continued )

Remediation

technology

optimized

Remediation of Model(s) used Model

performance

metrics

Objective Dataset Outcomes Reference

an effective control

system

In-situ

bioremediation

Chlorinated

ethenes

CART AUC To develop and

demonstrate the value of

data mining approach in

identifying the most

promising in-situ

remediation strategy and

to identify the

parameters effecting the

in-situ reductive

dechlorination potential

Groundwater

monitoring wells

data

The representative CART

model was capable of

effectively classifying the

3-month-ahead

dechlorination potential

with 75.8% and 69.5%

true positive rates for the

training and the test set,

respectively. Study

demonstrated the use of

data mining to determine

factors inuencing in-situ

dechlorination potential.

Lee et al.

(2016)

Electro-kinetic

enhanced

bioremediation

Chlorinated

solvents in low

permeable

porous media

ANN R

To build a surrogate

model to process-based

numerical model

simulating the electro-

kinetic enhanced

bioremediation and

perform sensitivity and

uncertainty analysis.

Simulation

results from

process-based

numerical model

The surrogate model

built has exhibited a

robust performance (R

>0.99

)

in predicting the

relative area of

distribution and relative

mass. The surrogate

model built so is then

successfully used to

perform sensitivity and

uncertainty analyses

Sprocati and

Rolle (2021)

Phytoremediation PAH

contaminated

soil

BP-FFNN R

and RMSE To predict the ideal

conditions for maximum

uptake of PAHs by

Melilutus alba.

Experimental

dataset obtained

by performing

pot experiments

(1) The ANN model

accurately predicted the

PAH levels in plant roots

using several soil

properties

(2) Based on the ANN

output, different soil

amendments were

recommended for

different soil pH

conditions

Olawoyin

(2016)

Cadmium

contaminated

soil

BP-FFNN R

, MSE, MAE,

and Correlation

coefcient

To predict the changes in

cadmium uptake ability

by Sinapis alba L. with

sewage sludge

modication.

Experimental

dataset by

performing pot

experiments

ANN based predictive

model had signicantly

higher accuracy as

compared to response

surface methodology in

predicting cadmium

removal rate. Such

models can be used to

determine the percentage

of sewage modication

for optimal remediation.

Jaskulak et al.

(2020)

Soil

contaminated

with Heavy

metals

XGBoost R

, RMSE, F1

score, Precision,

Recall, and

accuracy

To predict the factors

effecting phyto

extraction of heavy

metals by hyper

accumulators, using ML

based model

Data from

literature

The output parameters

like HM concentration in

shoot, shoot yield, bio-

concentration factor,

metal extraction ratio,

and remediation time

were accurately

predicted by the model

based on input

parameters consisting of

soil, HM, and plant

properties.

Shi et al.

(2023)

Monitored natural

attenuation

PAH

contaminated

soil

RF, and LDA Accuracy To predict the PAH

degradation in soil using

both the models. To

assess the importance of

different factors in the

degradation of PAH

Experimental

mesocosm trials

RF model showed better

accuracy than LDA model

in predicting the

degradation of

investigated PAHs. The

correlations between

various variables and the

degradation obtained

using RF helped in

understanding the factors

effecting the degradation

Picariello

et al. (2022)

ZVI based

permeable

reactive barriers

Chlorinated

organic

compounds

XGBoost RMSE, MAE,

and MAPE

To use ML based model in

order to select the

optimal iron-based

Data from

literature

An XGBoost model was

developed with

descriptors such as the

Ren et al.

(2023)

(continued on next page)

J.K. Janga et al.

Chemosphere 345 (2023) 140476

commonly used soil and groundwater remediation technologies such as

vitrication, soil vapor extraction, soil fracturing etc., has not yet been

investigated, as far as the authors are aware.

7.5. Contaminant source identication, apportionment and monitoring

Source identication and monitoring of source zones is crucial in

optimizing the remedial efforts, and to avoid further contamination of

groundwater. Few studies have performed source identication of

contaminated aquifers using various types of surrogate models (Rao,

2006; Srivastava and Singh, 2014, 2015; Zhao et al., 2016; Hou and Lu,

2018; Xing et al., 2019; Kang et al., 2021). However, the utilization of AI

based techniques in the monitoring of contaminants is relatively less

explored. As represented in Table 4, monitoring of source zones or

contaminated areas can be performed using various data types such as

simulation data, remote sensing data or aerial imaging or spectral im-

aging data. Meray et al., (2022) have developed a framework called

PyLEnM which is a machine-learning based framework developed for

long-term contamination monitoring strategies. Such frameworks can

contribute to the efcient supervision of contamination sources by

determining the optimum no of monitoring wells needed and mini-

mizing the usage of other resources. As a result, the monitoring costs can

also be lowered.

Various types of studies were found in the eld of source identi-

cation or apportionment, and monitoring and a brief overview of them is

presented in Table 4. As understood from Table 4, the sources of

increased contamination in soil matrix can be apportioned to various

sources based on the geographical data regarding enterprises, irrigation

Table 3 (continued )

Remediation

technology

optimized

Remediation of Model(s) used Model

performance

metrics

Objective Dataset Outcomes Reference

reactive materials for

employing in permeable

reactive barriers

particle size, surface

area, and pore size of ZVI

based materials, and

reaction conditions to

predict the kinetic

reaction constant of ZVI

based materials. Results

indicated that specic

area is one of the most

important factor in

deciding the reaction

rate.

In-situ

immobilization

using biochar

Heavy metal

contaminated

soil

SVR, RF, and

ANN

, and RMSE To predict the heavy

metal immobilization

efciency of biochar-

amended soil based on

data of various properties

of soil, biochar, and

heavy metal that can

affect immobilization

efciency.

Data from

literature

RF based model provided

better accuracy

compared to the other

two models in predicting

the immobilization

efciency. Updated RF

model by removing

redundant variables

provided even better

accuracy. GUI developed

based on the updated RF

model performed

reasonably well with

error <30% even using

data outside the training

dataset.

Palansooriya

et al. (2022)

Heavy metal

contaminated

soil

BP-FFNN, and

, and RMSE To predict the

remediation efciency of

ve heavy metals based

on biochar and soil

characteristics,

incubation and initial HM

conditions

Data from

literature

Both ANN and RF models

showed excellent

performance, with RF

being more tolerant to

missing data. The

analysis of inuential

factors revealed that the

type of heavy metals, pH

value of biochar, dosage,

and remediation time

were crucial for the

remediation process.

Sun et al.

(2022)

In-situ

immobilization

using nanoscale-

ZVI

Arsenic

contaminated

soil

ANN Correlation

coefcient

To predict the arsenic

immobilization efciency

of nanoscale zero-valent

iron particles

Experimental

data

Results indicated a good

correlation between the

input parameters and

output efciency using

ANN. Further

experimental results are

needed to simulate

diverse conditions for

future studies.

Han et al.

(2021)

Note: Abbreviations- SEAR – Surfactant-Enhanced Aquifer Remediation, ZVI – Zero-Valent Iron; TDS – Total Dissolved Solids, DNAPL – Dense Non-Aqueous Phase

Liquids, PAH – Polycyclic Aromatic Hydrocarbons; SFL – Sugeno Fuzzy Logic, ANN – Articial Neural networks, GA – Genetic Algorithm, CNN – Convolutional Neural

Networks, 3D CNN – Three Dimensional CNN, PRS – Polynomial Response Surface, RBFN-Radial Basis Function Networks, SVR – Support Vector Regression, CART –

Classication and Regression Tree, BP-FFNN – Back Propagated Feed Forward Neural Network, XGBoost – Extreme Gradient Boosting, RF- Random Forest, LDA – Liner

Discriminant Analysis; Mean Error =(Error/n), SSIM – Structural Similarity Index Metric-quanties the structural similarity between two 2-D images; HM – Heavy

Metal; UTCHEM - University of Texas Chemical Compositional Simulator; GUI – Graphical User Interface.

J.K. Janga et al.

Chemosphere 345 (2023) 140476

history and other urbanization factors with the help of ML based clas-

sication and clustering models. However, it is important to further the

research in the direction of source identication or apportionment, and

monitoring, as the risk for recontamination always exists if the source

contaminators are not properly managed.

8. Limitations of AI based tools

8.1. Over tting

Overtting occurs when a model performs well on the training data

but poorly on the testing data, resulting in poor accuracy. This can

happen due to high variance and low bias, model complexity, or inad-

equate training data size. To reduce overtting, techniques such as

increasing the training data size, reducing model complexity, shrinkage

and regularization (Ridge and Lasso) can be used (James et al., 2023;

Jabbar and Khan, 2015). Dropout can also be used in neural networks to

tackle overtting.

8.2. Under tting

An undert model results in high prediction errors for both training

and test data. The reasons for undertting include high bias and low

variance, the model being too simple, the size of the training data being

insufcient, and the training data not being cleaned and containing

noise. Techniques to reduce undertting include increasing model

complexity, increasing the number of features by performing feature

engineering, removing noise from the data, and increasing the number

of epochs or the duration of training to get better results (Jabbar and

Khan, 2015).

8.3. Data unavailability

One of the main requirements for data driven models is the abun-

dance of data to be able to understand the relationships between various

variables accurately. However, data regarding contaminated site reme-

diation is not properly organized and is not available in open-source

databases, and incomplete data can lead to less accurate predictions.

Therefore, to take full advantage of prediction capabilities of ML based

models, efforts should be made to enhance data collection and sharing

through collaborations and open data initiatives. Further research and

development of ML-based models and algorithms and their use in

remediation activities should be encouraged, and these models should

be trained on larger and diverse datasets. Moreover, integrating real-

time monitoring and sensor technologies can enhance timely contami-

nation identication and prompt remediation actions. Additionally,

continuous evaluation and adaptation of remediation strategies, inter-

national collaboration, and knowledge exchange are crucial for

advancing the eld. By implementing these recommendations,

contamination patterns can be better understood, and remediation

techniques can be effectively designed and implemented, leading to

improved environmental outcomes.

9. Concluding remarks

The primary aim of this investigation was to perform an exhaustive

Fig. 8. A general approach to perform AI-based simulation-optimization for contaminated site remediation modeling

J.K. Janga et al.

Chemosphere 345 (2023) 140476

Table 4

Various studies reporting applications of AI/ML/DL for source identication, source apportionment and monitoring.

Approach Model used Case Model

accuracy

parameters

Objective Outcome(s) Reference

Source identication –

inverse simulations for

plume development

modeling

Surrogate – KELM,

Optimization – PSO,

GA, QPSO, QGA

Hypothetical

and actual

, and RE To build an accurate surrogate

model and optimization

algorithms which can work for

both hypothetical (simulation

based) and actual (monitoring

data) cases

(1) KELM could accurately predict

the simulation results with R

0.9990 and reasonably predict the

actual data with R

>0.99 in 9/15

wells

(2) QGA and QPSO had better

prediction accuracies in terms of

source identication of

hypothetical case while the

accuracy was similar for all four

algorithms in the actual case

Zhao et al.

(2020)

Surrogates - LSTM,

Kriging, KELM,

RBFN

Optimization - GA

Hypothetical R

, RE, and

RMSE

To employ DL based model

(LSTM) as a surrogate model to

traditional physics based

simulation model and compare

its performance with other

surrogate models as compared to

LSTM model.

LSTM had the highest accuracy

followed by RBFN, KELM, and

kriging. While the total time

required to train the model was the

least for kriging followed by

KELM, RBFN, and LSTM.

However, compared to the

computational time required for

traditional simulation/

optimization approach all four

surrogate models saved >99% of

the computational time required.

Li et al.

(2021)

Spatial distribution

prediction and

corresponding land use,

soil and water features

extraction for source

apportionment

RF, SVM, and BP-

FFNN

Actual R

, MAE,

MSE, and

RMSE

To employ ML based models to

identify the key factors causing

heavy metal pollution in

groundwater

Outcomes suggested that the rapid

urbanization is one of the leading

cause of increase in heavy metal

concentrations in the subsurface.

The prediction accuracy for

different contaminants varied for

different models used. Overall RF

and SVM performed better than

ANN

Zhang

et al.

(2020)

RF, NB for

classication and

BLMI for source

apportionment

Actual Precision,

Recall, and F1

score

To use the proposed combination

of models to predict spatial

variation of heavy metals,

identify effecting factors, and

correlate using spatial clustering

The NB classier was effectively

used to identify 250

contaminating enterprise and 25

contributing medium industry

types in the study area. RF model

was effective in predicting

contaminant concentrations and in

addition to quantitatively

determining the factors

responsible. BLMI was able to

successfully cluster the risk zones

by using the contaminant risk

levels and contributing factors.

Huang

et al.

(2022)

Classication - NB,

ANN, and SVM,

source

apportionment -

BLMI

Actual Accuracy, and

Kappa

coefcient

To classify the enterprise types

using ML models and then

perform source apportionment

using BLMI technique

Results revealed that NB

performed slightly better in

classifying the enterprises. Source

apportionment using BLMI

indicated that increased Cd

concentrations were mainly

caused by excessive fertilization,

and due to coal mining and metal

industries, and chemical industries

were the main reason for Hg

pollution

Jia et al.

(2019)

Monitoring of source zone

during remediation

CVAE-EnKF Hypothetical NRMSE To determine the variation in

DNAPL saturation at the source

zone with changing source zone

architecture during remediation

using DL-based model (auto-

encoder)

The proposed CVAE-EnKF

framework offers a physically

based prior distribution for DNAPL

saturations, leading to enhanced

estimations of evolving DNAPL

source zone architecture and

associated remediation metrics

throughout the remediation

process. As a result, CVAE-EnKF

demonstrates promising

capabilities as a high-resolution

method for monitoring DNAPL

remediation.

Kang et al.

(2021)

Monitoring contaminated

area using remote

sensing

RT, SVM, and RF Actual MAE, RAE,

Precision,

To monitor the contamination

due to oil spills in the study using

remote sensing data

Spectral indices obtained from

remote sensing data can be a great

way to differentiate between

Kaplan

et al.

(2022)

(continued on next page)

J.K. Janga et al.

Chemosphere 345 (2023) 140476

literature review encompassing the existing research landscape con-

cerning the utilization of AI/ML/DL across diverse dimensions of

contaminated site remediation. In its initial phase, the study delivered a

succinct preamble to distinct categories of AI/ML/DL models while

introducing an array of metrics employed for the assessment of model

performance. Subsequent to this, a thorough bibliometric analysis was

conducted to delve into the prevalence and deployment of AI/ML/DL

within the domain of contaminated site remediation. The ndings

indicated an increasing prevalence of research related to AI/ML/DL in

site remediation in the past ve years. It is found that more research has

been dedicated to employing these techniques in groundwater remedi-

ation compared to soil remediation. Furthermore, a thorough review of

literature was conducted concerning the use of these technologies at

different stages of site remediation, including site characterization, risk

assessment, decision support tools (DST), remediation design and opti-

mization, and source identication and monitoring. Following this, it

has been understood that a signicant portion of research has been

focused on specic types of problems, such as ML/DL-based models in

simulation-optimization problems for contaminated aquifers, DSTs for

petroleum-contaminated sites, ML/DL based models for spatial inter-

polation of contaminant concentration, and groundwater contaminant

source identication. However, studies pertaining to AI/ML/DL appli-

cations in optimizing diverse remediation technologies and post-

remediation monitoring were relatively scarce, possibly due to limited

data availability. Nevertheless, the studies found in these cases

demonstrated the immense potential of AI/ML/DL-based tools in

remediation optimization and monitoring processes. To fully harness the

advantages of these cutting-edge computing tools in site remediation,

future research should prioritize building robust databases encompass-

ing diverse remediation technologies and monitoring statistics. By doing

so, the remediation process can be signicantly enhanced through the

integration of AI/ML/DL. With an emphasis on data availability and

innovative research approaches, the eld of contaminated site remedi-

ation can continue to progress and benet from these emerging

technologies.

CRediT author statement

Jagadeesh Kumar Janga: Conceptualization, Methodology, Inves-

tigation, Resources, Writing - Original Draft, Krishna R. Reddy:

Conceptualization, Methodology, Investigation, Resources, Supervision,

Project administration, Funding acquisition, Writing - Review & Editing,

KVNS Raviteja: Conceptualization, Methodology, Investigation, Re-

sources, Writing - Review & Editing.

Declaration of competing interest

The authors declare that they have no known competing nancial

interests or personal relationships that could have appeared to inuence

the work reported in this paper.

Data availability

No data was used for the research described in the article.

References

Aggarwal, C.C., 2018. Neural Networks and Deep Learning. Springer, Cham.

Akinpelu, A.A., Ali, M.E., Owolabi, T.O., Johan, M.R., Saidur, R., Olatunji, S.O.,

Chowdbury, Z., 2020. A support vector regression model for the prediction of total

polyaromatic hydrocarbons in soil: an articial intelligent system for mapping

environmental pollution. Neural Comput. Appl. 32, 14899–14908.

Alloghani, M., Al-Jumeily, D., Mustana, J., Hussain, A., Aljaaf, A.J., 2020. A systematic

review on supervised and unsupervised machine learning algorithms for data

science. In: Supervised and Unsupervised Learning for Data Science, pp. 3–21.

Asher, M.J., Croke, B.F., Jakeman, A.J., Peeters, L.J., 2015. A review of surrogate models

and their application to groundwater modeling. Water Resour. Res. 51 (8),

5957–5973.

Auer, P., Burgsteiner, H., Maass, W., 2008. A learning rule for very simple universal

approximators consisting of a single layer of perceptrons. Neural Network. 21 (5),

786–795.

Baecher, G.B., 2023. 2021 Terzaghi lecture: geotechnical systems, uncertainty, and risk.

J. Geotech. Geoenviron. Eng. 149 (3), 03023001.

Balasubramaniam, A., Boyle, A.R., Voulvoulis, N., 2007. Improving petroleum

contaminated land remediation decision-making through the MCA weighting

process. Chemosphere 66 (5), 791–798.

Bank, D., Koenigstein, N., Giryes, R., 2023. Autoencoders. Machine Learning for Data

Science Handbook: Data Mining and Knowledge Discovery Handbook, pp. 353–374.

Barzegar, R., Asghari Moghaddam, A., Adamowski, J., Fijani, E., 2017. Comparison of

machine learning models for predicting uoride contamination in groundwater.

Stoch. Environ. Res. Risk Assess. 31, 2705–2718.

Bayraktar, Z., Komurcu, M., Werner, D.H., 2010. Wind Driven Optimization (WDO): a

novel nature-inspired optimization algorithm and its application to

electromagnetics. In: 2010 IEEE Antennas and Propagation Society International

Symposium. IEEE, pp. 1–4.

Cameselle, C., Reddy, K.R., 2022. Electrobioremediation: combined electrokinetics and

bioremediation technology for contaminated site remediation. Indian Geotech. J. 52

(5), 1205–1225.

Chen, Z., Huang, G.H., Chan, C.W., Geng, L.Q., Xia, J., 2003. Development of an expert

system for the remediation of petroleum-contaminated sites. Environ. Model. Assess.

8, 323–334.

Chen, D., Wang, X., Luo, X., Huang, G., Tian, Z., Li, W., Liu, F., 2023. Delineating and

identifying risk zones of soil heavy metal pollution in an industrialized region using

machine learning. Environ. Pollut. 318, 120932.

Chu, H., Lu, W., 2015. Optimization design based on ensemble surrogate models for

DNAPLs-contaminated groundwater remediation. J. Water Supply Res. Technol. -

Aqua 64 (6), 697–707.

Chung, J., Gulcehre, C., Cho, K., Bengio, Y., 2014. Empirical Evaluation of Gated

Recurrent Neural Networks on Sequence Modeling arXiv preprint arXiv:1412.3555.

Cozad, A., Sahinidis, N.V., Miller, D.C., 2014. Learning surrogate models for simulation-

based optimization. AIChE J. 60 (6), 2211–2227.

Table 4 (continued )

Approach Model used Case Model

accuracy

parameters

Objective Outcome(s) Reference

Recall, and F1

score

contaminated and

uncontaminated areas. ML based

models could effectively classify

the site as contaminated or

uncontaminated based on the

differences in spectral values of

clean and contaminated areas.

Such technologies have scope to

further aid in monitoring of the

contaminated sites.

Note: Abbreviations- KELM – Kernel Extreme Leaning Machine, PSO – Particle Swarm Optimization, GA – Genetic Algorithm, QPSO – Quantum PSO, QGA – Quantum

GA, LSTM – Long Short Term Memory, RBFN – Radial Basis Function Network, RF – Random Forest, SVM – Support Vector Machine, BP-FFNN – Back Propagated Feed

Forward Neural Network, NB – Naïve Bayes Classier, BLMI – Bivariate Local Moran’s I, ANN – Articial Neural Network, CVAE – Convolutional Variational Auto

Encoder, EnKF – Ensemble Kalman Filter, RT – Random Tree; NRMSE – Normalized Root Mean Square Error =RMSE

xmax −xmin

J.K. Janga et al.

Chemosphere 345 (2023) 140476

Decesaro, A., Rampel, A., Machado, T.S., Thom´

e, A., Reddy, K.R., Margarites, A.C.,

Colla, L.M., 2017. Bioremediation of soil contaminated with diesel and biodiesel fuel

using biostimulation with microalgae biomass. J. Environ. Eng. 143 (4), 04016091.

Doersch, C., 2016. Tutorial on Variational Autoencoders arXiv:1606.05908.

Dorigo, M., Birattari, M., Stutzle, T., 2006. Ant colony optimization. IEEE Comput. Intell.

Mag. 1 (4), 28–39.

Du, J., Shi, X., Mo, S., Kang, X., Wu, J., 2022. Deep learning based optimization under

uncertainty for surfactant-enhanced DNAPL remediation in highly heterogeneous

aquifers. J. Hydrol. 608, 127639.

Dunea, D., Iordache, S., Pohoata, A., Neagu Frasin, L.B., 2014. Investigation and

selection of remediation technologies for petroleum-contaminated soils using a

decision support system. Water, Air, Soil Pollut. 225, 1–18.

Farmer, J.D., Packard, N.H., Perelson, A.S., 1986. The immune system, adaptation, and

machine learning. Phys. Nonlinear Phenom. 22 (1–3), 187–204.

García, M., L´

opez, E., Kumar, V., Valls, A., 2006. A multicriteria fuzzy decision system to

sort contaminated soils. In: Modeling Decisions for Articial Intelligence: Third

International Conference. Springer Berlin Heidelberg, Tarragona, Spain,

pp. 105–116.

Gautam, K., Sharma, P., Dwivedi, S., Singh, A., Gaur, V.K., Varjani, S., et al., 2023.

A Review on Control and Abatement of Soil Pollution by Heavy Metals: Emphasis on

Articial Intelligence in Recovery of Contaminated Soil. Environmental Research,

115592.

Geem, Z.W., Kim, J.H., Loganathan, G.V., 2001. A new heuristic optimization algorithm:

harmony search. Simulation 76 (2), 60–68.

Gen, M., Cheng, R., 1999. Genetic Algorithms and Engineering Optimization, vol. 7. John

Wiley & Sons, Inc.

Geng, L., Chen, Z., Chan, C.W., Huang, G.H., 2001. An intelligent decision support

system for management of petroleum-contaminated sites. Expert Syst. Appl. 20 (3),

251–260.

George, S., Dixit, A., 2021. A machine learning approach for prioritizing groundwater

testing for per-and polyuoroalkyl substances (PFAS). J. Environ. Manag. 295,

113359.

Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT press.

Gurney, K., 1997. An Introduction to Neural Networks, rst ed. CRC Press.

Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G.,

Cai, J., Chen, T., 2018. Recent advances in convolutional neural networks. Pattern

Recogn. 77, 354–377.

Han, Z., Salawu, O.A., Zenobio, J.E., Zhao, Y., Adeleye, A.S., 2021. Emerging investigator

series: immobilization of arsenic in soil by nanoscale zerovalent iron: role of

suldation and application of machine learning. Environ. Sci.: Nano 8 (3), 619–633.

Hanoon, M.S., Ahmed, A.N., Fai, C.M., Birima, A.H., Razzaq, A., Sherif, M., et al., 2021.

Application of articial intelligence models for modeling water quality in

groundwater: comprehensive review, evaluation and future trends. Water, Air, Soil

Pollut. 232, 1–41.

Hauptman, B.H., Naughton, C.C., Harmon, T.C., 2023. Using Machine Learning to

Predict 1, 2, 3-trichloropropane Contamination from Legacy Non-point Source

Pollution of Groundwater in California’s Central Valley, vol. 22. Groundwater for

Sustainable Development, 100955.

He, L., Chan, C.W., Huang, G.H., Zeng, G.M., 2006. A probabilistic reasoning-based

decision support system for selection of remediation technologies for petroleum-

contaminated sites. Expert Syst. Appl. 30 (4), 783–795.

Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8),

1735–1780.

Hou, Z., Lu, W., Chu, H., Luo, J., 2015. Selecting parameter-optimized surrogate models

in DNAPL-contaminated aquifer remediation strategies. Environ. Eng. Sci. 32 (12),

1016–1026.

Hou, Z., Lu, W., Chen, M., 2016. Surrogate-based sensitivity analysis and uncertainty

analysis for DNAPL-contaminated aquifer remediation. J. Water Resour. Plann.

Manag. 142 (11), 04016043.

Hou, Z., Lu, W., Xue, H., Lin, J., 2017. A comparative research of different ensemble

surrogate models based on set pair analysis for the DNAPL-contaminated aquifer

remediation strategy optimization. J. Contam. Hydrol. 203, 28–37.

Hou, Z., Lu, W., 2018. Comparative study of surrogate models for groundwater

contamination source identication at DNAPL-contaminated sites. Hydrogeol. J. 26

(3).

Hu, Z., Chan, C.W., 2015. In-situ bioremediation for petroleum contamination: a fuzzy

rule-based model predictive control system. Eng. Appl. Artif. Intell. 38, 70–78.

Hu, Z., Chan, C.W., Huang, G.H., 2003. A fuzzy expert system for site characterization.

Expert Syst. Appl. 24 (1), 123–131.

Huang, G., Wang, X., Chen, D., Wang, Y., et al., 2022. A hybrid data-driven framework

for diagnosing contributing factors for soil heavy metal contaminations using

machine learning and spatial clustering analysis. J. Hazard Mater. 437, 129324.

Huang, G.B., Zhu, Q.Y., Siew, C.K., 2006. Extreme learning machine: theory and

applications. Neurocomputing 70 (1–3), 489–501.

Huysegoms, L., Cappuyns, V., 2017. Critical review of decision support tools for

sustainability assessment of site remediation options. J. Environ. Manag. 196,

278–296.

Jabbar, H., Khan, R.Z., 2015. Methods to avoid over-tting and under-tting in

supervised machine learning (comparative study). Computer Science,

Communication and Instrumentation Devices 70 (10), 978–981, 3850.

Jalali, F.M., Chahkandi, B., Gheibi, M., Eftekhari, M., Behzadian, K., Campos, L.C., 2023.

Developing a smart and clean technology for bioremediation of antibiotic

contamination in arable lands. Sustainable Chemistry and Pharmacy 33, 101127.

James, G., Witten, D., Hastie, T., Tibshirani, R., Taylor, J., 2023. An Introduction to

Statistical Learning: with Applications in python. Springer International Publishing,

New York, USA.

Jaskulak, M., Grobelak, A., Vandenbulcke, F., 2020. Modeling and optimizing the

removal of cadmium by Sinapis alba L. from contaminated soil via Response Surface

Methodology and Articial Neural Networks during assisted phytoremediation with

sewage sludge. Int. J. Phytoremediation 22 (12), 1321–1330.

Jia, X., Hu, B., Marchant, B.P., Zhou, L., Shi, Z., Zhu, Y., 2019. A methodological

framework for identifying potential sources of soil heavy metal pollution based on

machine learning: a case study in the Yangtze Delta, China. Environ. Pollut. 250,

601–609.

Jia, X., Cao, Y., O’Connor, D., Zhu, J., Tsang, D.C., Zou, B., Hou, D., 2021a. Mapping soil

pollution by using drone image recognition and machine learning at an arsenic-

contaminated agricultural eld. Environ. Pollut. 270, 116281.

Jia, X., O’Connor, D., Shi, Z., Hou, D., 2021b. VIRS based detection in combination with

machine learning for mapping soil pollution. Environ. Pollut. 268, 115845.

Kamdar, B.A., Solanki, C.H., Reddy, K.R., 2023. Moringa seed cake biochar: a novel

binder for sustainable remediation of lead-contaminated soil. J. Environ. Eng. 149

(10), 04023059.

Kanevski, M.F., 1999. Spatial predictions of soil contamination using general regression

neural networks. Syst. Res. Inf. Sci. 8, 241–256.

Kang, X., Kokkinaki, A., Power, C., Kitanidis, P.K., Shi, X., Duan, L., et al., 2021.

Integrating deep learning-based data assimilation and hydrogeophysical data for

improved monitoring of DNAPL source zones during remediation. J. Hydrol. 601,

126655.

Kaplan, G., Aydinli, H.O., Pietrelli, A., Mieyeville, F., Ferrara, V., 2022. Oil-contaminated

soil modeling and remediation monitoring in arid areas using remote sensing. Rem.

Sens. 14 (10), 2500.

Karaboga, D., 2005. An Idea Based on Honey Bee Swarm for Numerical Optimization.

Technical Report-TR06, Erciyes University, Engineering Faculty. Computer

Engineering Department.

Kennedy, J., Eberhart, R., 1995. Particle swarm optimization. In: Proceedings of

ICNN’95-international Conference on Neural Networks, vol. 4. IEEE, pp. 1942–1948.

Khan, F.I., Husain, T., Hejazi, R., 2004. An overview and analysis of site remediation

technologies. J. Environ. Manag. 71 (2), 95–122.

Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P., 1983. Optimization by simulated annealing.

Science 220 (4598), 671–680.

Kochenderfer, M.J., 2015. Decision Making under Uncertainty: Theory and Application.

MIT Press.

Laha, S., Mukherjee, S., Nebhrajani, S.R., 2000. Information management system for site

remediation efforts. Environ. Manag. 25, 513–523.

Lambora, A., Gupta, K., Chopra, K., 2019. Genetic algorithm-A literature review. In: 2019

International Conference on Machine Learning, Big Data, Cloud and Parallel

Computing (COMITCon). IEEE, pp. 380–384.

Lee, J., Im, J., Kim, U., L¨

ofer, F.E., 2016. A data mining approach to predict in situ

detoxication potential of chlorinated ethenes. Environ. Sci. Technol. 50 (10),

5181–5188.

Lehr, J.H., Hyman, M., Gass, T., Seevers, W.J., 2002. Handbook of Complex

Environmental Remediation Problems. McGraw-Hill Education.

Li, J., Lu, W., Luo, J., 2021. Groundwater contamination sources identication based on

the Long-Short Term Memory network. J. Hydrol. 601, 126670.

Li, H., Zhou, Z., Long, T., Wei, Y., Xu, J., Liu, S., Wang, X., 2022a. Big-data analysis and

machine learning based on oil pollution remediation cases from CERCLA database.

Energies 15 (15), 5698.

Li, X., Yi, S., Cundy, A.B., Chen, W., 2022b. Sustainable decision-making for

contaminated site risk management: a decision tree model using machine learning

algorithms. J. Clean. Prod. 371, 133612.

Lu, Y., 2019. Articial intelligence: a survey on evolution, models, applications and

future trends. Journal of Management Analytics 6 (1), 1–29.

Luo, J., Lu, W., Xin, X., Chu, H., 2013. Surrogate model application to the identication

of an optimal surfactant-enhanced aquifer remediation strategy for DNAPL-

contaminated sites. J. Earth Sci. 24 (6), 1023–1032.

Luo, J., Lu, W., 2014. A mixed-integer non-linear programming with surrogate model for

optimal remediation design of NAPLs contaminated aquifer. Int. J. Environ. Pollut.

54 (1), 1–16.

Madani, A., Hagage, M., Elbeih, S.F., 2022. Random Forest and Logistic Regression

algorithms for prediction of groundwater contamination using ammonia

concentration. Arabian J. Geosci. 15 (20), 1619.

Man, J., Zeng, L., Luo, J., Gao, W., Yao, Y., 2021. Application of the deep learning

algorithm to identify the spatial distribution of heavy metals at contaminated sites.

ACS ES&T Engineering 2 (2), 158–168.

Manfron, S., Thom´

e, A., Cecchim, I., Reddy, K.R., 2020. Application of zero-valent iron

nanoparticles (nZVI) on the remediation of contaminated soil and groundwater: a

review. Quím. Nova 43, 623–631.

Mazumdar, H., Murphy, M.P., Bhatkande, S., Emerson, H.P., Kaplan, D.I., Gohel, H.A.,

2022. Optimized machine learning model for predicting groundwater

contamination. In: 2022 IEEE MetroCon. IEEE, pp. 1–3.

Meray, A.O., Sturla, S., Siddiquee, M.R., Serata, R., Uhlemann, S., Gonzalez-Raymat, H.,

et al., 2022. Pylenm: a machine learning framework for long-term groundwater

contamination monitoring strategies. Environ. Sci. Technol. 56 (9), 5973–5983.

Mirjalili, S., Mirjalili, S.M., Lewis, A., 2014. Grey wolf optimizer. Adv. Eng. Software 69,

46–61.

Mohammadi, M., Gheibi, M., Fathollahi-Fard, A.M., Eftekhari, M., Kian, Z., Tian, G.,

2021. A hybrid computational intelligence approach for bioremediation of

amoxicillin based on fungus activities from soil resources and aatoxin B1 controls.

J. Environ. Manag. 299, 113594.

Mu’azu, N.D., Olatunji, S.O., 2023. K-nearest neighbor based computational intelligence

and RSM predictive models for extraction of Cadmium from contaminated soil. Ain

Shams Eng. J. 14 (4), 101944.

J.K. Janga et al.

Chemosphere 345 (2023) 140476

Olawoyin, R., 2016. Application of backpropagation articial neural network prediction

model for the PAH bioremediation of polluted soil. Chemosphere 161, 145–150.

Ouyang, Q., Lu, W., Lin, J., Deng, W., Cheng, W., 2017a. Conservative strategy-based

ensemble surrogate model for optimal groundwater remediation design at DNAPLs-

contaminated sites. J. Contam. Hydrol. 203, 1–8.

Ouyang, Q., Lu, W., Hou, Z., Zhang, Y., Li, S., Luo, J., 2017b. Chance-constrained multi-

objective optimization of groundwater remediation design at DNAPLs-contaminated

sites using a multi-algorithm genetically adaptive method. J. Contam. Hydrol. 200,

15–23.

Ouyang, Q., Lu, W., Miao, T., Deng, W., Jiang, C., Luo, J., 2017c. Application of ensemble

surrogates and adaptive sequential sampling to optimal groundwater remediation

design at DNAPLs-contaminated sites. J. Contam. Hydrol. 207, 31–38.

Palansooriya, K.N., Li, J., Dissanayake, P.D., Suvarna, M., et al., 2022. Prediction of soil

heavy metal immobilization by biochar using machine learning. Environ. Sci.

Technol. 56 (7), 4187–4198.

Perera, A.T.D., Kamalaruban, P., 2021. Applications of reinforcement learning in energy

systems. Renew. Sustain. Energy Rev. 137, 110618.

Picariello, E., Baldantoni, D., De Nicola, F., 2022. Investigating natural attenuation of

PAHs by soil microbial communities: insights by a machine learning approach.

Restor. Ecol. 30 (8), e13655.

Polikar, R., 2012. Ensemble learning. Ensemble Machine Learning: Methods and

Applications 1–34.

Pyo, J., Hong, S.M., Kwon, Y.S., Kim, M.S., Cho, K.H., 2020. Estimation of heavy metals

using deep neural network with visible and infrared spectroscopy of soil. Sci. Total

Environ. 741, 140162.

Qin, X.S., Huang, G.H., Huang, Y.F., Zeng, G.M., Chakma, A., Li, J.B., 2007. NRSRM: a

decision support system and visualization software for the management of

petroleum-contaminated sites. Energy Sources, Part A. 28 (3), 199–220.

Qiu, Y., Zhou, S., Zhang, C., Qin, W., Lv, C., Zou, M., 2023. Identication of potentially

contaminated areas of soil microplastic based on machine learning: a case study in

Taihu Lake region, China. Sci. Total Environ. 877, 162891.

Rao, S.V.N., 2006. A computationally efcient technique for source identication

problems in three-dimensional aquifer systems using neural networks and simulated

annealing. Environ. Forensics 7 (3), 233–240.

Raviteja, K.V.N.S., Reddy, K.R., 2023. Application of articial intelligence, machine

learning, and deep learning in contaminated site remediation. In: Al Khaddar, R.,

et al. (Eds.), Recent Developments in Energy and Environmental Engineering,

TRACE 2022, Lecture Notes in Civil Engineering, vol. 333. Springer, Singapore.

Reddy, K.R., Amaya-Santos, G., 2017. Effects of variable site conditions on

phytoremediation of mixed contaminants: eld-scale investigation at big marsh site.

J. Environ. Eng. 143 (9), 04017057.

Reddy, K.R., Cameselle, C., Adams, J.A., 2019a. Sustainable Engineering: Drivers,

Metrics, Tools, and Applications. John Wiley &Sons, Inc.

Reddy, K.R., Kumar, G., Du, Y.J., 2019b. Risk, sustainability, and resiliency

considerations in polluted site remediation. In: Zhan, L., Chen, Y., Bouazza, A. (Eds.),

8th International Congress on Environmental Geotechnics Volume 1, Environmental

Science and Engineering. Springer, Singapore.

Reddy, K.R., Chirakkara, R.A., Martins Ribeiro, L.F., 2020. Effects of elevated

concentrations of co-existing heavy metals and PAHs in soil on phytoremediation.

J. Hazardous, Toxic, and Radioactive Waste 24 (4), 04020035.

Ren, Y., Cui, M., Zhou, Y., Lee, Y., Ma, J., Han, Z., Khim, J., 2023. Zero-valent iron based

materials selection for permeable reactive barrier using machine learning. J. Hazard

Mater. 453, 131349.

Runkler, T.A., 1996. Extended defuzzication methods and their properties. In:

Proceedings of IEEE 5th International Fuzzy Systems, vol. 1. IEEE, pp. 694–700.

Sadeghfam, S., Hassanzadeh, Y., Khatibi, R., Nadiri, A.A., Moazamnia, M., 2019.

Groundwater remediation through pump-treat-inject technology using optimum

control by articial intelligence (OCAI). Water Resour. Manag. 33, 1123–1145.

Sajedi-Hosseini, F., Malekian, A., Choubin, B., Rahmati, O., Cipullo, S., Coulon, F.,

Pradhan, B., 2018. A novel machine learning-based approach for the risk assessment

of nitrate groundwater contamination. Sci. Total Environ. 644, 954–962.

Salehinejad, H., Sankar, S., Barfett, J., Colak, E., Valaee, S., 2017. Recent Advances in

Recurrent Neural Networks arXiv preprint arXiv:1801.01078.

Sari, M., Cosgun, T., Yalcin, I.E., Taner, M., Ozyigit, I.I., 2022. Deciding heavy metal

levels in soil based on various ecological information through articial intelligence

modeling. Appl. Artif. Intell. 36 (1), 2014189.

Sarker, I.H., 2021. Machine learning: algorithms, real-world applications and research

directions. SN Computer Science 2 (3), 160.

Sergeev, A.P., Buevich, A.G., Baglaeva, E.M., Shichkin, A.V., 2019. Combining spatial

autocorrelation with machine learning increases prediction accuracy of soil heavy

metals. Catena 174, 425–435.

Shaker, R., Tofan, L., Bucur, M., Costache, S., Sava, D., Ehlinger, T., 2010. Land coverand

landscape as predictors of groundwater contamination: a neural-network modelling

approach applied to Dobrogea, Romania. J. Environ. Protect. Ecology 11 (1),

337–348.

Sharma, H.D., Reddy, K.R., 2004. Geoenvironmental Engineering: Site Remediation,

Waste Containment, and Emerging Waste Management Technologies. John Wiley &

Sons, Inc.

Sharma, S., Sharma, S., Athaiya, A., 2017. Activation functions in neural networks. Int. J.

Eng. Appl. Sci. Technol 6 (12), 310–316.

Sherstinsky, A., 2020. Fundamentals of recurrent neural network (RNN) and long short-

term memory (LSTM) network. Phys. Nonlinear Phenom. 404, 132306.

Shi, L., Li, J., Palansooriya, K.N., Chen, Y., et al., 2023. Modeling phytoremediation of

heavy metal contaminated soils through machine learning. J. Hazard Mater. 441,

129904.

Singh, B.K., Naidu, R., 2012. Cleaning contaminated environment: a growing challenge.

Biodegradation 23, 785–786. https://doi.org/10.1007/s10532-012-9590-5, 2012.

Singha, S., Pasupuleti, S., Singha, S.S., Kumar, S., 2020. Effectiveness of groundwater

heavy metal pollution indices studies by deep-learning. J. Contam. Hydrol. 235,

103718.

Song, X., Ren, H., Hou, Z., Lin, X., Karanovic, M., Tonkin, M., et al., 2023. Predicting

future well performance for environmental remediation design using deep learning.

J. Hydrol. 617, 129110.

Sprocati, R., Rolle, M., 2021. Integrating process-based reactive transport modeling and

machine learning for electrokinetic remediation of contaminated groundwater.

Water Resour. Res. 57 (8), e2021WR029959.

Sreekanth, J., Datta, B., 2011. Coupled simulation-optimization model for coastal aquifer

management using genetic programming-based ensemble surrogate models and

multiple-realization optimization. Water Resour. Res. 47 (4).

Srivastava, D., Singh, R.M., 2014. Breakthrough curves characterization and

identication of an unknown pollution source in groundwater system using an

articial neural network (ANN). Environ. Forensics 15 (2), 175–189.

Srivastava, D., Singh, R.M., 2015. Groundwater system modeling for simultaneous

identication of pollution sources and parameters with uncertainty characterization.

Water Resour. Manag. 29, 4607–4627.

Stef, P.F., Thirumalaiyammal, B., Anburaj, R., Mishel, P.F., 2022. Articial intelligence

in bioremediation modelling and clean-up of contaminated sites: recent advances,

challenges and opportunities. In: Kumar, V., Thakur, I.S. (Eds.), Omics Insights in

Environmental Bioremediation. Springer, Singapore, pp. 683–702.

Sugeno, M., Kang, G.T., 1988. Structure identication of fuzzy model. Fuzzy Set Syst. 28

(1), 15–33.

Sun, Y., Zhang, Y., Lu, L., Wu, Y., Zhang, Y., Kamran, M.A., Chen, B., 2022. The

application of machine learning methods for prediction of metal immobilization

remediation by biochar amendment in soil. Sci. Total Environ. 829, 154668.

Sutton, R.S., Barto, A.G., 2018. Reinforcement Learning: an Introduction. MIT Press.

Svozil, D., Kvasnicka, V., Pospichal, J., 1997. Introduction to multi-layer feed-forward

neural networks. Chemometr. Intell. Lab. Syst. 39 (1), 43–62.

Takagi, T., Sugeno, M., 1985. Fuzzy identication of systems and its applications to

modeling and control. IEEE Transactions on Systems, Man, and Cybernetics (1),

116–132.

Tao, H., Liao, X., Cao, H., Zhao, D., Hou, Y., 2022. Three-dimensional delineation of soil

pollutants at contaminated sites: progress and prospects. J. Geogr. Sci. 32 (8),

1615–1634.

Tarasov, D.A., Buevich, A.G., Sergeev, A.P., Shichkin, A.V., 2018. High variation topsoil

pollution forecasting in the Russian Subarctic: using articial neural networks

combined with residual kriging. Appl. Geochem. 88, 188–197.

Tariq, S.R., Shah, M.H., Shaheen, N., Jaffar, M., Khalique, A., 2008. Statistical source

identication of metals in groundwater exposed to industrial contamination.

Environ. Monit. Assess. 138, 159–165.

Tut Haklidir, F.S., Haklidir, M., 2020. Prediction of geothermal originated boron

contamination by deep learning approach: at Western Anatolia Geothermal Systems

in Turkey. Environ. Earth Sci. 79, 1–16.

Varley, A., Tyler, A., Smith, L., Dale, P., Davies, M., 2015. Remediating radium

contaminated legacy sites: advances made through machine learning in routine

monitoring of “hot” particles. Sci. Total Environ. 521, 270–279.

Vesselinov, V.V., Alexandrov, B.S., O’Malley, D., 2018. Contaminant source

identication using semi-supervised machine learning. J. Contam. Hydrol. 212,

134–142.

Wang, X., Li, R., Tian, Y., Zhang, B., Zhao, Y., Zhang, T., Liu, C., 2022. A computational

framework for design and optimization of risk-based soil and groundwater

remediation strategies. Processes 10 (12), 2572.

Wijaya, J., Byeon, H., Jung, W., Park, J., Oh, S., 2023. Machine learning modeling using

microbiome data reveal microbial indicator for oil-contaminated groundwater.

J. Water Proc. Eng. 53, 103610.

Xing, Z., Qu, R., Zhao, Y., Fu, Q., Ji, Y., Lu, W., 2019. Identifying the release history of a

groundwater contaminant source based on an ensemble surrogate model. J. Hydrol.

572, 501–516.

Yang, X.S., 2014. Nature-inspired Optimization Algorithms, rst ed. Elsevier,

Amsterdam; Boston.

Yang, X.S., He, X., 2016. Nature-inspired optimization algorithms in engineering:

overview and applications. Nature-Inspired Comput. Engin. 1–20.

Yang, S., Taylor, D., Yang, D., He, M., Liu, X., Xu, J., 2021. A synthesis framework using

machine learning and spatial bivariate analysis to identify drivers and hotspots of

heavy metal pollution of agricultural soils. Environ. Pollut. 287, 117611.

Yaseen, Z.M., 2021. An insight into machine learning models era in simulating soil, water

bodies and adsorption heavy metals: review, challenges and solutions. Chemosphere

277, 130126.

Yu, Y., Si, X., Hu, C., Zhang, J., 2019. A review of recurrent neural networks: LSTM cells

and network architectures. Neural Comput. 31 (7), 1235–1270.

Zadeh, L.A., 1973. Outline of a new approach to the analysis of complex systems and

decision processes. IEEE Transact. Systems, Man, and Cybernetics (1), 28–44.

Zhang, H., Yin, A., Yang, X., Fan, M., et al., 2021. Use of machine-learning and receptor

models for prediction and source apportionment of heavy metals in coastal

reclaimed soils. Ecol. Indicat. 122, 107233.

Zhang, H., Yin, S., Chen, Y., Shao, S., et al., 2020. Machine learning-based source

identication and spatial prediction of heavy metals in soil in a rapid urbanization

area, eastern China. J. Clean. Prod. 273, 122858.

Zhang, Y., Ren, M., Tang, Y., Cui, X., Cui, J., Xu, C., et al., 2022. Immobilization on

anionic metal (loid) s in soil by biochar: a meta-analysis assisted by machine

learning. J. Hazard Mater. 438, 129442.

J.K. Janga et al.

Chemosphere 345 (2023) 140476

Zhang, Y., Lei, M., Li, K., Ju, T., 2023. Spatial prediction of soil contamination based on

machine learning: a review. Front. Environ. Sci. Eng. 17 (8), 93.

Zhao, Y., Lu, W., Xiao, C., 2016. A Kriging surrogate model coupled in

simulation–optimization approach for identifying release history of groundwater

sources. J. Contam. Hydrol. 185, 51–60.

Zhao, Y., Qu, R., Xing, Z., Lu, W., 2020. Identifying groundwater contaminant sources

based on a KELM surrogate model together with four heuristic optimization

algorithms. Adv. Water Resour. 138, 103540.

Zheng, S., Wang, J., Zhuo, Y., Yang, D., Liu, R., 2022. Spatial distribution model of DEHP

contamination categories in soil based on Bi-LSTM and sparse sampling. Ecotoxicol.

Environ. Saf. 229, 113092.

Zhong, S., Zhang, K., Bagheri, M., Burken, J.G., Gu, A., Li, B., Ma, X., Marrone, B.L.,

Ren, Z.J., Schrier, J., Shi, W., 2021. Machine learning: new ideas and tools in

environmental science and engineering. Environ. Sci. Technol. 55 (19),

12741–12754.

Zhou, Z.H., 2021. Machine Learning. Springer Nature. https://doi.org/10.1007/978-

981-15-1967-3.

J.K. Janga et al.

Screening and Optimization of Soil Remediation Strategies Assisted by Machine Learning

Article

Full-text available

Jun 2024

A numerical approach assisted by machine learning was developed for screening and optimizing soil remediation strategies. The approach includes a reactive transport model for simulating the remediation cost and effect of applicable remediation technologies and their combinations for a target site. The simulated results were used to establish a relationship between the cost and effect using a machine learning method. The relationship was then used by an optimization method to provide optimal remediation strategies under various constraints and requirements for the target site. The approach was evaluated for a site contaminated with both arsenic and polycyclic aromatic hydrocarbons at a former shipbuilding factory in Guangzhou City, China. An optimal strategy was obtained and successfully implemented at the site, which included the partial excavation of the contaminated soils and natural attenuation of the residual contaminated soils. The advantage of the approach is that it can fully consider the natural attenuation capacity in designing remediation strategies to reduce remediation costs and can provide cost-effective remediation strategies under variable constraints for policymakers. The approach is general and can be applied for screening and optimizing remediation strategies at other remediation sites.

Assessing the effectiveness of remedial strategies on senior one student's academic performance in mathematics and sciences

Article

Full-text available

Apr 2024

The study's primary objective was to evaluate the effectiveness of remedial strategies in enhancing the academic performance of senior one-level students in mathematics and sciences. Employing a qualitative approach, the research utilized a six-month observation period to assess the impact of remedial strategies implemented by twelve experienced mathematics and sciences teachers across three selected schools. The study focused on ensuring the validity and reliability of the evaluation methods, obtaining informed consent, and maintaining participant confidentiality. The observed remedial strategies included individualized instruction, small group tutoring, technology integration, hands-on learning activities, formative assessment and feedback, peer collaboration, parental involvement, and extended learning opportunities. The findings revealed that these strategies were pivotal in addressing students' various needs and raising a supportive learning environment conducive to academic success. Notably, there was a strong correlation between the successful implementation of remedial strategies and improved academic performance among senior one-level students in mathematics and sciences. The study recommended developing and implementing personalized learning plans for these students, emphasizing the importance of continuous professional development for teachers to optimize the utilization of identified remedial strategies.

Accurate Agarwood Oil Quality Determination: A Breakthrough With Artificial Neural Networks and the Levenberg- Marquardt Algorithm

Article

Full-text available

Jan 2024

The agarwood oil quality has been divided into four grades, including low, medium-low, medium-high, and high, and has been thoroughly examined in this manuscript. Recently, there has been a high demand for agarwood oil but the current grading method is based on conventional techniques that rely on visual inspection of various characteristics such as intensity, smell, texture, and weight. However, this method is not standardized, making it difficult to grade agarwood oil accurately. Therefore, the use of artificial neural networks (ANN) in artificial intelligence (AI) was employed to develop a system for identifying agarwood oil quality using the Levenberg-Marquardt (LM) algorithm. Data from 660 samples of chemical compounds extracted from agarwood oil were used to train the ANN. To enhance the accuracy of agarwood oil quality identification with LM performance, the data was split into 70% for validation, 15% for training, and 15% for testing. The results showed that the ANN with the eleven inputs (10-epi-ɤ-eudesmol, α-agarofuran, ɤ-eudesmol, β-agarofuran, ar-curcumene, valerianol, β-dihydro agarofuran, α-guaiene, allo aromadendrene epoxide and ϒ-cadinene) trained by ten hidden neurons of LM algorithm provided the best performance with 100% for accuracy, specificity, sensitivity and precision as well as minimum convergence epoch. The experimental implementation of the model was done using the MATLAB version R2015a platform. This study will help to standardize agarwood oil quality determination using intelligent modeling techniques and serve as a guide for future research in the essential oil industry.

AI-Enhanced Decision Support Systems for Optimizing Hazardous Waste Handling in Civil Engineering

Article

Jan 2024

A comprehensive review of machine learning prediction in the production of bio-oil from lignocellulose via pyrolysis

Preprint

Full-text available

Jan 2024

Bio-oil produced through pyrolysis of lignocellulosic biomass has recently received significant attention due to its possible uses as a second-generation biofuel. The yield and characteristics of produced bio-oil are affected by reaction conditions (reactor type, particle size, feed rate, operating temperature, heating rate, retention time, etc.) and the type of feedstock that is used (softwood, hardwood, agricultural plant residues, miscanthus, etc.). Recently, machine learning (ML) techniques have been widely employed to forecast the performance of the pyrolysis and the characteristics of bi-oil. In this study, a comprehensive review of ML research on bio-oil has been carried out. Regression methods were most frequently employed to build prediction models. The top five ML methods for bio-oil research were random forest, artificial neural network, gradient boosting, support vector regression, and linear regression. In addition, users frequently extract features using their own knowledge and restricted datasets were employed I previous studies. We highlighted the challenges and potential of cutting-edge ML techniques in bio-oil production.

Non-gaussian hydraulic conductivity and potential contaminant source identification: A comparison of two advanced DLPM-based inversion framework

Article

Jun 2024
J HYDROL

Developing a Predictive Machine Learning Model and a Kinetic Model for the Bioremediation of Terrestrial Diesel Spills

Article

Jun 2024

Artificial Intelligence for Remote Patient Monitoring: Advancements, Applications, and Challenges

Book

Full-text available

Feb 2024

Artificial Intelligence (AI) has emerged as a transformative force in the healthcare sector, particularly in the realm of remote patient monitoring (RPM). RPM involves the collection, analysis, and interpretation of patient data outside of traditional clinical settings, allowing healthcare providers to monitor patients' health remotely. Advancements in AI have significantly enhanced RPM by enabling more accurate and timely monitoring, diagnosis, and intervention, thereby improving patient outcomes and reducing healthcare costs. One of the key applications of AI in RPM is predictive analytics, where algorithms analyze patient data to identify patterns and predict potential health issues before they escalate. This proactive approach allows healthcare providers to intervene early, preventing complications and hospitalizations. AI-powered wearables and sensors collect continuous data on vital signs, activity levels, and other health metrics, providing a comprehensive view of patients' health status in real-time. Machine learning algorithms analyze this data to detect anomalies and trends, alerting healthcare providers to any deviations from normal parameters. Furthermore, AI facilitates personalized medicine by tailoring treatment plans to individual patients based on their unique characteristics and medical history. By integrating AI-driven decision support systems into RPM platforms, healthcare providers can make more informed clinical decisions, optimize resource allocation, and improve the efficiency of healthcare delivery. In conclusion, AI holds immense potential to revolutionize remote patient monitoring by enabling more personalized, proactive, and efficient healthcare delivery. Addressing the challenges associated with its implementation will be crucial in realizing the full benefits of AI in RPM and improving patient care outcomes.

Artificial Intelligence for Remote Patient Monitoring: Advancements, Applications, and Challenges

Book

Full-text available

Feb 2024

Non-Gaussian Hydraulic Conductivity and Potential Contaminant Source Identification: A Comparison of Two Advanced DLPM-based Inversion Framework

Preprint

Full-text available

Feb 2024

Accurate identification of hydraulic conductivity fields (K) and contaminant source parameters is imperative for the enhanced assessment and effective remediation of polluted aquifers. Given the challenges posed by non-Gaussian distributions, high dimensionality, and ill-posed nature of groundwater inversion problems, reducing unknowns is a common strategy. Unlike conventional parameterization methods constrained by prior assumptions, this research introduces an innovative deep learning-based parameterization method (DLPM), AEdiffusion. AEdiffusion combines a Diffusion Denoising Probabilistic Model (DDPM) with a Variational Autoencoder (VAE) through a generator-refiner strategy, enabling the generation of high-dimensional fields (K) from low-dimensional latent representations. Additionally, this study examines the application of Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP), another advanced DLPM, in groundwater inversion. Through comparative analysis within a Data Assimilation (DA) framework, focusing on non-Gaussian K fields and three potential contaminant sources under varied data availability scenarios, this study reveals that both DLPM-based inversion frameworks are capable of identifying K fields and the true contaminant sources. Notably, the AEdiffusion-based framework excels in extracting critical information from sparse observations, delivering more stable performance, but at the cost of increased time consumption compared to WGAN-GP-based framework.

Moringa Seed Cake Biochar: A Novel Binder for Sustainable Remediation of Lead-Contaminated Soil

Article

Full-text available

Jul 2023

The present investigation applies the stabilization/solidification technique for lead (Pb)-contaminated soil remediation utilizing an organic binder to negate the environmental consequences caused by inorganic binders such as cement. This research synthesized novel biochar by slow pyrolysis of moringa seed cake or de-oiled cake (waste generated after oil recovery) and tested its physicochemical characteristics , which revealed that it possesses a high pH and abundant surface functional groups that can act as potential adsorption sites for lead. Furthermore, the effects of biochar content (0% to 10% w/w) and curing time on the stabilization of soil contaminated with Pb at a concentration of 5,000 mg=kg were evaluated. The toxicity characteristic leaching test showed that treatment with 10% w/w biochar and 28 days of curing reduced Pb leachability to regulatory limits with over 89% immobilization efficiency. Moreover, the soil strength and the pH increased steadily with the biochar content and curing time while maintaining stability after 56 to 90 days of curing. Microstructural characterization revealed the underlying mechanisms in effectively stabilizing lead in soil, including precipitation, surface complexation with functional groups (C═O, O═C─O), and lead encapsulation in calcium silicate hydrates (C─S─H).

Developing a smart and clean technology for bioremediation of antibiotic contamination in arable lands

Article

Full-text available

Jun 2023

This study presents a smart technological framework to efficiently remove azithromycin from natural soil resources using bioremediation techniques. The framework consists of several modules, each with different models such as Penicillium Simplicissimum (PS) bioactivity, soft computing models, statistical optimisation, Machine Learning (ML) algorithms, and Decision Tree (DT) control system based on Removal Percentage (RP). The first module involves designing experiments using a literature review and the Taguchi Orthogonal design method for cultural conditions. The RP is predicted as a function of cultural parameters using Response Surface Methodology (RSM) and three ML algorithms: Instance-Based K (IBK), KStar, and Locally Weighted Learning (LWL). The sensitivity analysis shows that pH is the most important factor among all parameters, including pH, Aeration Intensity (AI), Temperature, Microbial/Food (M/F) ratio, and Retention Time (RT), with a p-value of <0.0001. AI is the next most significant parameter, also with a p-value of <0.0001. The optimal biological conditions for removing azithromycin from soil resources are a temperature of 32 °C, pH of 5.5, M/F ratio of 1.59 mg/g, and AI of 8.59 m³/h. During the 100-day bioremediation process, RP was found to be an insignificant factor for more than 25 days, which simplifies the conditions. Among the ML algorithms, the IBK model provided the most accurate prediction of RT, with a correlation coefficient of over 95%.

Identification of potentially contaminated areas of soil microplastic based on machine learning: A case study in Taihu Lake region, China

Article

Full-text available

Mar 2023

Soil microplastic (MP) pollution has recently become increasingly aggravated, with severe consequences being generated. Understanding the spatial distribution characteristics of soil MPs is an important prerequisite for protecting and controlling soil pollution. However, determining the spatial distribution of soil MPs through a large number of soil field sampling and laboratory test analyses is unrealistic. In this study, we compared the accuracy and applicability of different machine learning models for predicting the spatial distribution of soil MPs. The support vector machine regression model with radial basis function (RBF) as kernel function (SVR–RBF) has a high prediction accuracy (R2=0.8934). Among the six ensemble models, random forest (R2=0.9007) could better explain the significance of source and sink factors affecting the occurrence of soil MPs. Soil texture, population density, and MPs point of interest (MPs–POI) were the main source-sink factors affecting the occurrence of soil MPs. Furthermore, the accumulation of MPs in soil was significantly affected by human activity. The spatial distribution map of soil MP pollution in the study area was drawn based on the bivariate local Moran’s I model of soil MP pollution and the normalized difference vegetation index (NDVI) variation trend. A total of 48.74 km2 of soil was in an area of serious MP pollution, mainly concentrated in urban soil. This study provides a hybrid framework that includes spatial distribution prediction of MPs, source-sink analysis, and pollution risk area identification, providing scientific and systematic methods and techniques for pollution management in other soil environments.

Genetic Algorithms and Engineering Optimization

Book

Nov 2007

An Introduction to Statistical Learning: with Applications in Python

Book

Jan 2023

Application of Artificial Intelligence, Machine Learning, and Deep Learning in Contaminated Site Remediation

Chapter

Jun 2023

Soil and groundwater contamination is caused by improper waste disposal practices and accidental spills, posing threat to public health and the environment. It is imperative to assess and remediate these contaminated sites to protect public health and the environment as well as to assure sustainable development. Site remediation is inherently complex due to the many variables involved, such as contamination chemistry, fate and transport, geology, and hydrogeology. The selection of remediation method also depends on the contaminant type and distribution and subsurface soil and groundwater conditions. Depending on the type of remediation method, many systems and operating variables can affect the remedial efficiency. The design and implementation of site remediation can be expensive, time-consuming, and may require much human effort. Emerging technologies such as Artificial Intelligence, Machine Learning, and Deep Learning have the potential to make site remediation cost-effective with reduced human effort. This study provides a brief overview of these emerging technologies and presents case studies demonstrating how these technologies can help contaminated site remediation decisions.KeywordsSite remediationArtificial intelligenceMachine learningDeep learning

Using machine learning to predict 1,2,3-Trichloropropane contamination from legacy non-point source pollution of groundwater in California's Central Valley

Article

May 2023

Zero-valent iron based materials selection for permeable reactive barrier using machine learning

Article

Apr 2023
J HAZARD MATER

The zero-valent iron (ZVI) based reactive materials are potential remediation reagents in permeable reactive barriers (PRB). Considering that reactive materials is the essential to determining the long-term stability of PRB and the emergence of a large number of new iron-based materials. Here, we present a new approach using machine learning to screen PRB reactive materials, which proposes to improve the efficiency and practicality of selection of ZVI-based materials. To compensate for the insufficient amount of existing machine learning source data and the real-world implementation, machine learning combines evaluation index (EI) and reactive material experimental evaluations. XGboost model is applied to estimate the kinetic data and SHAP is used to improve the accuracy of model. Batch and column tests were conducted to investigate the geochemical characteristics of groundwater. The study find that specific surface area is a fundamental factor correlated with the kinetic constants of ZVI-based materials, according to SHAP analysis. Reclassifying the data with specific surface area significantly improved prediction accuracy (reducing RMSE from 1.84 to 0.6). Experimental evaluation results showed that ZVI had 3.2 times higher anaerobic corrosion reaction kinetic constants and 3.8 times lower selectivity than AC-ZVI. Mechanistic studies revealed the transformation pathways and endpoint products of iron compounds. Overall, this study is a successful initial attempt to use machine learning for selecting reactive materials.

Machine learning modeling using microbiome data reveal microbial indicator for oil-contaminated groundwater

Article

Jul 2023

Monitoring groundwater (GW) quality is essential for the sustainable management of water resources to preserve public health and ecosystem functioning. The present study developed a machine learning (ML) modeling framework using high-throughput sequencing microbiome data as input variables, which successfully predicted the status and source of GW pollution. No systematic spatiotemporal patterns in the environmental parameters and community diversity indices were observed for the GW samples taken from a total petroleum hydrocarbon (TPH)-contaminated site. In contrast, the ML modeling optimized via model selection and hyperparameter tuning led to high prediction accuracy (>98 %) in classifying the status and source of GW pollution. Feature importance analysis using the ML models (logistic regression and support vector machine with radial basis function) identified members of Rhodocyclaceae, Syntrophaceae, and Helicobacteraceae as strong indicators of GW polluted with TPHs. The identification of these microbial taxa as pollution indicators was consistent with their known ecophysiology associated with TPH metabolism. The usefulness of these microbial indicators was then validated using both conventional hypothesis testing and phylogenetic analysis. Overall, the ML modeling pipeline established in this study using microbiome data provides new information on the interaction between a set of microbial biomarkers and enhances the predictive understanding of GW pollution and its bioremediation potential.

A review on control and abatement of soil pollution by heavy metals: Emphasis on artificial intelligence in recovery of contaminated soil

Article

Feb 2023
ENVIRON RES

"Save Soil Save Earth" is not just a catchphrase; it is a necessity to protect our soil ecosystem from the unwanted and unregulated level of xenobiotic contamination. Numerous challenges such as type, lifespan, nature of pollutants and high cost of treatment has been associated with the treatment or remediation of contaminated soil, whether it be either on-site or off-site. Due to the food chain, the health of non-target soil species as well as human health were impacted by soil contaminants, both organic and inorganic. In this review, the use of microbial omics approaches and artificial intelligence or machine learning has been comprehensively explored with recent advancements in order to identify the sources, characterise, quantify, and mitigate soil pollutants from the environment for increased sustainability. This will generate novel insights into methods for soil remediation that will reduce the time and expense of soil treatment.

Integrating artificial intelligence, machine learning, and deep learning approaches into remediation of contaminated sites: A review

Abstract

Recommended publications

Application of Artificial Intelligence, Machine Learning, and Deep Learning in Contaminated Site Rem...

Application of Artificial Intelligence, Machine Learning, and Deep Learning in Contaminated Site Rem...

Big-Data Analysis and Machine Learning Based on Oil Pollution Remediation Cases from CERCLA Database

Retention of Phosphate by Bentonite-Amended Fly Ash Liner