Article

Reinforcement Learning: An Introduction

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In contrast, more intricate environments necessitate models with greater representational power and have higher data requirements. Model-based RL (MBRL) (Sutton & Barto, 2018) is hypothesized to be the key for scaling up deep RL agents (LeCun, 2022). Indeed, world models (Ha & Schmidhuber, 2018) offer a diverse range of capabilities: lookahead search (Schrittwieser et al., 2020;Ye et al., 2021), learning in imagination (Sutton, 1991;Hafner et al., 2023), representation learning (Schwarzer et al., 2021;D'Oro et al., 2023), and uncertainty estimation (Pathak et al., 2017;Sekar et al., 2020). ...
... We consider a Partially Observable Markov Decision Process (POMDP) (Sutton & Barto, 2018). The transition, reward, and episode termination dynamics are captured by the conditional distributions p(x t+1 | x ≤t , a ≤t ) and p(r t , d t | x ≤t , a ≤t ), where x t ∈ X = R 3×h×w is an image observation, a t ∈ A = {1, . . . ...
... Learning in imagination (Sutton, 1991;Sutton & Barto, 2018) consists of 3 stages that are repeated alternatively: experience collection, world model learning, and policy improvement. Strikingly, the agent learns behaviours purely within its world model, and real experience is only leveraged to learn the environment dynamics. ...
Preprint
Full-text available
Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose $\Delta$-IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, $\Delta$-IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at https://github.com/vmicheli/delta-iris.
... Another related piece of work is [31] where a classic Q-learning agent is first trained on a powerful computer and its Q-table then transferred to the embedded device. The solution that the authors offer is limited to that one algorithm and does not support NNs, i.e. it does not allow for deep RL [32]. Furthermore, it does not come with a mechanism to automatically and repeatedly replace the agent on the target system. ...
... The experiences that are generated in the process are collected and utilised to update the agent's policy such that the cumulated reward it receives for its behaviour is maximised. [32] The mathematical foundation of the environment is a timediscrete Markov decision process (MDP) 11 defined by the four-tuple (S, A, P, R), that is ...
... with a learning rate η ∈ R. The policy is typically implemented as an NN in which case the parameter set θ consists of its weights and biases. [32] There are three key metrics to quantify how well an agent fares in a certain situation: The value function (VF) V π θ (4) estimates the return at a state s ∈ S when acting on-policy (i.e. when choosing actions according to the current policy) from there on. Similarly, the action-value function or Q-function Q π θ (5) estimates the return when taking an action a ∈ A at a state s ∈ S on the assumption that all following actions are on-policy. ...
Article
Full-text available
Advances in artificial intelligence (AI) have led to its application in many areas of everyday life. In the context of control engineering, reinforcement learning (RL) represents a particularly promising approach as it is centred around the idea of allowing an agent to freely interact with its environment to find an optimal strategy. One of the challenges professionals face when training and deploying RL agents is that the latter often have to run on dedicated embedded devices. This could be to integrate them into an existing toolchain or to satisfy certain performance criteria like real-time constraints. Conventional RL libraries, however, cannot be easily utilised in conjunction with that kind of hardware. In this paper, we present a framework named LExCI, the Learning and Experiencing Cycle Interface, which bridges this gap and provides end-users with a free and open-source tool for training agents on embedded systems using the open-source library RLlib. Its operability is demonstrated with two state-of-the-art RL-algorithms and a rapid control prototyping system.
... Reinforcement Learning (RL) [11] stands out as a recognised self-learning paradigm in robotics [7,5]. RL enhances robots to learn optimal behaviours by interacting with the environment and receiving feedback in the form of rewards or penalties. ...
... We employ a self-learning algorithm based on RL to dynamically select paths for each time step and aim to minimize the overall travelling distances, as explained in Eq. 2. The fundamental of the RL model is structured as a Markov decision process [11], denoted as Υ(A, R a (s, s ′ ), S , P a (s, s ′ )). A represents the set of possible actions, S describes the agent states within the environment, P a (s, s ′ ) defines the transition probability from state s to state s ′ under action a at each time step t, and R a (s, s ′ ) denotes the reward or punishment received after transitioning from state s to state s ′ under action a. ...
... where m 0 represents the initial value of m-step, and λ t denotes the decay rate parameter. This approach is to a certain extent similar to the exploration and exploitation relation in vanilla RL [11]. By regulating the sampling rate from OR solvers using this exponential decay approach, we optimize the efficacy of knowledge transfer. ...
Article
Full-text available
There is a growing interest in implementing artificial intelligence for operations research in the industrial environment. While numerous classic operations research solvers ensure optimal solutions, they often struggle with real-time dynamic objectives and environments, such as dynamic routing problems, which require periodic algorithmic recalibration. To deal with dynamic environments, deep reinforcement learning has shown great potential with its capability as a self-learning and optimizing mechanism. However, the real-world applications of reinforcement learning are relatively limited due to lengthy training time and inefficiency in high-dimensional state spaces. In this study, we introduce two methods to enhance reinforcement learning for dynamic routing optimization. The first method involves transferring knowledge from classic operations research solvers to reinforcement learning during training, which accelerates exploration and reduces lengthy training time. The second method uses a state-space decomposer to transform the high-dimensional state space into a low-dimensional latent space, which allows the reinforcement learning agent to learn efficiently in the latent space. Lastly, we demonstrate the applicability of our approach in an industrial application of an automated welding process, where our approach identifies the shortest welding pathway of an industrial robotic arm to weld a set of dynamically changing target nodes, poses and sizes. The suggested method cuts computation time by 25% to 50% compared to classic routing algorithms.
... While the undiscounted cost function is always considered in optimal control, such as linear quadratic regulator (LQR) and model predictive control (MPC), the discounted setting is more popular in RL [37], which refers to the stage costs are weighted by an exponentially decreasing term with respect to the discount factor. The discount factor can be viewed as a trade-off between short-term and long-term returns [37]. ...
... While the undiscounted cost function is always considered in optimal control, such as linear quadratic regulator (LQR) and model predictive control (MPC), the discounted setting is more popular in RL [37], which refers to the stage costs are weighted by an exponentially decreasing term with respect to the discount factor. The discount factor can be viewed as a trade-off between short-term and long-term returns [37]. Consequently, discount factors with different values will result in control policies with different performance. ...
Article
Full-text available
In this paper, the stability of nonlinear systems under infinite-horizon discounted optimal control via adaptive dynamic programming method is analyzed. First, considering the adoption of function approximators during value iteration (VI), the iterative value function and control policy are shown to be continuous. Then, based on a verifiable condition on the approximation errors caused by the critic network, it is proved that the approximate value functions are bounded and positive definite. Further in the stability analysis, a stability condition as the termination criterion of approximate VI (AVI) is developed, which guarantees that the control policy derived from the obtained critic network makes the controlled system \(\mathcal{K}\mathcal{L}\)-stable. Also, an upper bound function of the approximation errors caused by the action network is derived for ensuring that the system controlled by the trained action network remains \(\mathcal{K}\mathcal{L}\)-stable. The \(\mathcal{K}\mathcal{L}\)-stability of the closed-loop system is established by using the approximate value function to act as the Lyapunov function and estimate the region of attraction. Finally, the present theoretical results are applied to the simulation studies of the spacecraft rendezvous.
... The concept of RL is a reward-based method that relies on communication between an agent and a given environment to maximise the numerical reward [27]. As a result of environmental feedback, the agent learns its behaviour and then strives to improve its actions. ...
... As a part of the mechanism of the agent's engagement with the environment, the e-greedy strategy is used to make sure the agent maintains a balance between exploration (by taking random action a) and exploitation (action a with arg max a ðQðs t ; fagÞÞ) and generates an assortment of experiences as it interacts with the environment [27]. The experience samples are stored in a database. ...
Article
Full-text available
To gain access to networks, various intrusion attack types have been developed and enhanced. The increasing importance of computer networks in daily life is a result of our growing dependence on them. Given this, it is glaringly obvious that algorithmic tools with strong detection performance and dependability are required for a variety of attack types. The objective is to develop a system for intrusion detection based on deep reinforcement learning. On the basis of the Markov decision procedure, the developed system can construct patterns appropriate for classification purposes based on extensive amounts of informative records. Deep Q‐Learning (DQL), Soft DQL, Double DQL, and Soft double DQL are examined from two perspectives. An evaluation of the authors’ methods using UNSW‐NB15 data demonstrates their superiority regarding accuracy, precision, recall, and F1 score. The validity of the model trained on the UNSW‐NB15 dataset was also checked using the BoT‐IoT and ToN‐IoT datasets, yielding competitive results.
... Reinforcement Learning [45] is a field of machine learning which consists in training an intelligent agent to make decisions in a dynamic environment in order to reach a goal. In this process, the agent receives rewards or penalties based on its actions, and tries to learn a policy that maximizes the cumulative sum of rewards received over time. ...
... actions that it predicts to yield the highest reward based on its current knowledge) as it gets close to 1. Designing reward signals is a cutting-edge research area, and this may not be the best reward function for this particular problem. The intuition for this choice came from Section 17.4 of [45]; also, of the several reward functions tested experimentally, this one led to the highest performance in practice. ...
Preprint
Optical aberrations prevent telescopes from reaching their theoretical diffraction limit. Once estimated, these aberrations can be compensated for using deformable mirrors in a closed loop. Focal plane wavefront sensing enables the estimation of the aberrations on the complete optical path, directly from the images taken by the scientific sensor. However, current focal plane wavefront sensing methods rely on physical models whose inaccuracies may limit the overall performance of the correction. The aim of this study is to develop a data-driven method using model-free reinforcement learning to automatically perform the estimation and correction of the aberrations, using only phase diversity images acquired around the focal plane as inputs. We formulate the correction problem within the framework of reinforcement learning and train an agent on simulated data. We show that the method is able to reliably learn an efficient control strategy for various realistic conditions. Our method also demonstrates robustness to a wide range of noise levels.
... Computational modeling of task behavior focuses on differences in the first free choice between H1 and H6 conditions and how this is affected by equal vs. unequal information (Wilson et al., 2014). Based on the reinforcement learning (RL) framework (Sutton and Barto, 2018), the model learns expected rewards for each option based on the forced-choice outcomes and then uses these learned expectations to select the first free choice. Formally, the value of a given bandit, such as that of option ( ! ), is computed as follows: ...
... When comparing choice behavior, they found that, relative to the other two groups, the propranolol group was also more likely to choose the option with the higher average reward value, and that their choice patterns were generally more consistent. A range of models with different assumptions were fit to the choice data, with results indicating group differences in an " -greedy" model parameter that captures pure choice randomness, irrespective of expected value and uncertainty (Sutton and Barto, 2018). In particular, participants in the propranolol group showed lower values, suggesting reductions in random exploration were caused by lowering norepinephrine levels. ...
Preprint
Full-text available
Exploratory behaviors can serve an adaptive role within novel or changing environments. Namely, they facilitate information gain, allowing an organism to maintain accurate beliefs about the environment and select actions that better maximize reward. However, finding the optimal balance between exploration and reward-seeking behavior – the so-called explore-exploit dilemma – can be challenging, as it requires sensitivity to one’s own uncertainty and to the predictability of one’s surroundings. Here, we review computational modeling studies investigating how exploration is influenced by anxiety. While some apparent inconsistencies remain to be resolved, studies using reinforcement learning tasks suggest that directed (but not random) forms of exploration may be elevated by trait and/or cognitive anxiety, but reduced by state and/or somatic anxiety. Anxiety is also consistently associated with less exploration in foraging tasks. Some differences in exploration may further stem from how anxiety modulates changes in uncertainty over time (learning rates). Jointly, these results highlight important directions for future work in refining choice of tasks and anxiety measures and maintaining consistent methodology across studies.
... These features have been previously extracted and all-zero Q-table Q s(t), m(t) . 4: Obtain π * [2] (·) & Q * [2] (·) by solving (4) using Q-learning [53]. 5: Compute V * [2] oi(t) following eq. ...
... The term exact reinforcement learning, as opposed to approximate reinforcement learning refers to the tabular methods where the exact value of an action-state pair is computed[53].This article has been accepted for publication in IEEE Transactions on Communications. This is the author's version which has not been fully edited and content may change prior to final publication. ...
Article
Full-text available
With countless promising applications in various domains such as IoT and Industry 4.0, task-oriented communication design (TOCD) is getting accelerated attention from the research community. This paper presents a novel approach for designing scalable task-oriented quantization and communications in cooperative multi-agent systems (MAS). The proposed approach utilizes the TOCD framework and the value of information (VoI) concept to enable efficient communication of quantized observations among agents while maximizing the average return performance of the MAS, a parameter that quantifies the MAS’s task effectiveness. The computational complexity of learning the VoI, however, grows exponentially with the number of agents. Thus, we propose a three-step framework: (i) learning the VoI (using reinforcement learning (RL)) for a two-agent system, (ii) designing the quantization policy for an N-agent MAS using the learned VoI for a range of bit-budgets and, (iii) learning the agents’ control policies using RL while following the designed quantization policies in the earlier step. Our analytical results show the applicability of the proposed framework under a wide range of problems. Numerical results show striking improvements in reducing the computational complexity of obtaining VoI needed for the TOCD in a MAS problem without compromising the average return performance of the MAS.
... The agent refines its policy through trial and error to achieve optimal outcomes. Figure adapted from[51] ...
Article
Full-text available
Global Navigation Satellite Systems (GNSS)-based positioning plays a crucial role in various applications, including navigation, transportation, logistics, mapping, and emergency services. Traditional GNSS positioning methods are model-based, utilizing satellite geometry and the known properties of satellite signals. However, model-based methods have limitations in challenging environments and often lack adaptability to uncertain noise models. This paper highlights recent advances in machine learning (ML) and its potential to address these limitations. It covers a broad range of ML methods, including supervised learning, unsupervised learning, deep learning, and hybrid approaches. The survey provides insights into positioning applications related to GNSS, such as signal analysis, anomaly detection, multi-sensor integration, prediction, and accuracy enhancement using ML. It discusses the strengths, limitations, and challenges of current ML-based approaches for GNSS positioning, providing a comprehensive overview of the field.
... This technique involves gathering initial experiences to stabilize the training process of actor and critic networks [61]. This study aims to create a buffer of valuable experiences by allowing the agent to explore the environment and collect a set of initial experiences through random actions independent of the current policy. ...
Article
Full-text available
An intelligent reflecting surface (IRS) is a promising technology for future wireless communication. It comprises many hardware-efficient passive elements. The applications of unmanned aerial vehicles (UAVs) have expanded beyond military missions owing to their mobility, maneuverability, flexibility, ease of deployment, and cost-effectiveness. Combining IRS with UAVs, known as UAV-mounted IRSs (IRS-UAV), has gained significant attention owing to the unique advantages offered by both technologies. Wireless communication systems face critical challenges in physical layer security, particularly cell-free massive multiple-input multiple-output (MIMO) systems (CFMM). This study investigated physical layer security (PLS) in an IRS-UAV-assisted CFMM, involving multiple IRS-UAVs, access points, users, and passive eavesdroppers. To maximize the average secrecy downlink rate, this study proposes an optimization algorithm using deep reinforcement learning based on a deep deterministic policy gradient (DDPG) that achieves at least one locally optimal solution. However, this approach results in a relatively high computational complexity. A second approach is introduced to address this: an alternating optimization algorithm combining inner approximation (IA) methods and an advanced DDPG algorithm with a warm-up technique. The simulation results demonstrated the efficiency of both approaches in resolving complex optimization problems. Furthermore, the numerical findings confirmed that the proposed alternating optimization algorithm exhibited competitive performance and significantly reduced computational complexity compared with the DDPG-based approach.
... Deep reinforcement learning (DRL) offers a promising paradigm where agents learn and evolve based on their interactions with the environment, making decisions that optimize a specific objective function over time [1]. DRL combines reinforcement learning with deep neural networks to approximate the value function (or action-value function) and policy [2]. ...
Article
Full-text available
Trained deep reinforcement learning (DRL) based controllers can effectively control dynamic systems where classical controllers can be ineffective and difficult to tune. However, the lack of closed‐loop stability guarantees of systems controlled by trained DRL agents hinders their adoption in practical applications. This research study investigates the closed‐loop stability of dynamic systems controlled by trained DRL agents using Lyapunov analysis based on a linear‐quadratic polynomial approximation of the trained agent. In addition, this work develops an understanding of the system's stability margin to determine operational boundaries and critical thresholds of the system's physical parameters for effective operation. The proposed analysis is verified on a DRL‐controlled system for several simulated and experimental scenarios. The DRL agent is trained using a detailed dynamic model of a non‐linear system and then tested on the corresponding real‐world hardware platform without any fine‐tuning. Experiments are conducted on a wide range of system states and physical parameters and the results have confirmed the validity of the proposed stability analysis (https://youtu.be/QlpeD5sTlPU).
... Policy gradient methods (PG) are a family of reinforcement learning algorithms (Sutton & Barto, 2018) that are particularly suited for large-scale decision and control problems. Recent applications of PG algorithms range from videogames (Wurman et al., 2022), to robotics (Rudin et al., 2021), to large language models (OpenAI, 2023;Ahmadian et al., 2024), where fine-tuning of the LLM with these methods shows a concrete application within the current conversational artificial intelligence boom, and the potential impact of accelerated versions of these methods. ...
Article
Full-text available
Several variance-reduced versions of REINFORCE based on importance sampling achieve an improved O(ϵ-3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\epsilon ^{-3})$$\end{document} sample complexity to find an ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon$$\end{document}-stationary point, under an unrealistic assumption on the variance of the importance weights. In this paper, we propose the Defensive Policy Gradient (DEF-PG) algorithm, based on defensive importance sampling, achieving the same result without any assumption on the variance of the importance weights. We also show that this is not improvable by establishing a matching Ω(ϵ-3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Omega (\epsilon ^{-3})$$\end{document} lower bound, and that REINFORCE with its O(ϵ-4)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\epsilon ^{-4})$$\end{document} sample complexity is actually optimal under weaker assumptions on the policy class. Numerical simulations show promising results for the proposed technique compared to similar algorithms based on vanilla importance sampling.
... In a bandit framework, an agent is faced with a series of decisions, each of which has a potential reward. The goal is to make a series of decisions that maximize the total cumulative reward [Sutton andBarto, 2018, Russo et al., 2018]. When applied to credit scoring and underwriting, an agent (or lender) sequentially decides which borrowers to underwrite. ...
Preprint
This paper proposes a novel reinforcement learning (RL) framework for credit underwriting that tackles ungeneralizable contextual challenges. We adapt RL principles for credit scoring, incorporating action space renewal and multi-choice actions. Our work demonstrates that the traditional underwriting approach aligns with the RL greedy strategy. We introduce two new RL-based credit underwriting algorithms to enable more informed decision-making. Simulations show these new approaches outperform the traditional method in scenarios where the data aligns with the model. However, complex situations highlight model limitations, emphasizing the importance of powerful machine learning models for optimal performance. Future research directions include exploring more sophisticated models alongside efficient exploration mechanisms.
... [21]. Reinforcement Learning is a form of machine learning different from supervised and unsupervised learning, where an intelligent agent learns to take a series of actions by maximising a cumulative reward in order to extrapolate or generalise to situations not present in the training set [28]. While there are many approaches to solving reinforcement learning problems, this study mainly focuses on using policy gradient methods that perform better in continuous action spaces. ...
Preprint
Full-text available
Planetary exploration requires traversal in environments with rugged terrains. In addition, Mars rovers and other planetary exploration robots often carry sensitive scientific experiments and components onboard, which must be protected from mechanical harm. This paper deals with an active suspension system focused on chassis stabilisation and an efficient traversal method while encountering unavoidable obstacles. Soft Actor-Critic (SAC) was applied along with Proportional Integral Derivative (PID) control to stabilise the chassis and traverse large obstacles at low speeds. The model uses the rover's distance from surrounding obstacles, the height of the obstacle, and the chassis' orientation to actuate the control links of the suspension accurately. Simulations carried out in the Gazebo environment are used to validate the proposed active system.
... This is a type of machine learning algorithm that learns by interacting with its environment. The algorithm learns to take actions that maximize a reward signal [21]. The applications of Reinforcement Learning in Urban Drainage System Modeling include: ...
Article
Urban drainage systems may derive advantages from the implementation of machine learning methods for decision-making and cleansing operations. Conventional decision support systems are rendered ineffective in tackling the intricate and indeterminate aspects of urban planning concerns. It aims to improve model evolution and use while facilitating simpler access using suggested open sources provided in this research. This paper provides an overview of machine learning methods applied in modeling urban drainage systems, and it compares the proposed methods among different machine learning models., which are classified into five distinct approaches: Supervised learning, unsupervised learning, deep learning, Reinforcement learning, and finally hybrid approaches combining two or more previous algorithms. This study explores also diverse datasets and open sources related modelling in urban drainage system that researchers can utilize in their scientific investigations. From the study it can be concluded that the choice of machine learning for urban irrigation systems depends on the specific goals and the nature of the problem. Additionally, studies demonstrate that hybrid and deep learning approaches can solve problems with urban irrigation systems, increase system performance, and yield correct results. Accurately interpreting data and successfully resolving urban irrigation systems' problems are the goals of deep learning. Furthermore, hybrid methods facilitate advances in model development through the integration of mathematical and machine learning models. This study helps researchers improve models and promote their applications to prevent natural disasters.
... RL, a subset of machine learning, involves an agent that learns to make decisions through interactions with its environment, aiming to maximize a numerical reward. The agent executes actions, receives feedback in the form of rewards, and modifies its strategies to optimize the cumulative rewards (Andrew, 1998, Lin et al., 2024, Liu et al., 2023. This adaptive learning process allows the agent to progressively refine its decisionmaking policy and enhance the prediction accuracy of the system as new data becomes evident. ...
Article
This paper presents a novel Artificial Intelligence (AI)-driven tool designed to convert deflection test results into crucial soil parameters essential for quality assurance in compaction projects. The accurate determination of these parameters, such as density and void ratio, is imperative for ensuring the structural integrity of in-frastructures constructed on such soils. Moreover, it facilitates the utilization of modern non-destructive equipment in compaction endeavors. The determination of these parameters is notably challenging in unsatu-rated soils owing to the intricate interplay among factors such as suction, moisture content, void ratio, and resulting deflection. This paper presents a pioneering tool to address these challenges. By integrating unsaturated soil mechanics with advanced AI techniques, particularly reinforcement learning, the tool leverages a diverse array of inputs, including in-situ data, experimental observations, and physics-based modeling. This integration enables dynamic adaptation to changing field conditions and ensuring the tool's real-time adaptability and predictive accuracy. Field trials validated the tool's efficacy in predicting soil properties accurately without direct measurements of moisture content or suction, variables often unmeasured in practical soil compaction projects. This unique capability underscores significant advancements in real-time assessment of unsaturated soils, illustrating the transformative potential of AI in geotechnical engineering and unsaturated soil mechanics.
... Even though there were a large number of reports on the application of RL and related approaches for Go [38−42] , it is only with AlphaGo [2] that deep NNs were adopted to establish the value networks to achieve high evaluation accuracy. Position evaluation [39,40,43,44] and deep learning [45,46] have been applied to programs to play the game of Go, however, none of them realized the level of success by AlphaGo [2] . The success of AlphaGo has a far-reaching impact on the research in AI. ...
Article
Full-text available
This article introduces the state-of-the-art development of adaptive dynamic programming and reinforcement learning (ADPRL). First, algorithms in reinforcement learning (RL) are introduced and their roots in dynamic programming are illustrated. Adaptive dynamic programming (ADP) is then introduced following a brief discussion of dynamic programming. Researchers in ADP and RL have enjoyed the fast developments of the past decade from algorithms, to convergence and optimality analyses, and to stability results. Several key steps in the recent theoretical developments of ADPRL are mentioned with some future perspectives. In particular, convergence and optimality results of value iteration and policy iteration are reviewed, followed by an introduction to the most recent results on stability analysis of value iteration algorithms.
... One way to learn the function Q ⋆ is using the Q-learning algorithm (Dayan & Watkins, 1992;Sutton & Barto, 2018) 6 and its variants for continuous definition of environment's states, such as Deep Q-learning (Mnih et al., 2015). ...
Preprint
Full-text available
In many situations, the measurements of a studied phenomenon are provided sequentially, and the prediction of its class needs to be made as early as possible so as not to incur too high a time penalty, but not too early and risk paying the cost of misclassification. This problem has been particularly studied in the case of time series, and is known as Early Classification of Time Series (ECTS). Although it has been the subject of a growing body of literature, there is still a lack of a systematic, shared evaluation protocol to compare the relative merits of the various existing methods. This document begins by situating these methods within a principle-based taxonomy. It defines dimensions for organizing their evaluation, and then reports the results of a very extensive set of experiments along these dimensions involving nine state-of-the art ECTS algorithms. In addition, these and other experiments can be carried out using an open-source library in which most of the existing ECTS algorithms have been implemented (see \url{https://github.com/ML-EDM/ml_edm}).
... In this work, we learn the policy with actor-critic algorithm [47]. This algorithm jointly trains the higher-level policy function π(g|s) and the value function V (s). ...
Preprint
Full-text available
Pathology image are essential for accurately interpreting lesion cells in cytopathology screening, but acquiring high-resolution digital slides requires specialized equipment and long scanning times. Though super-resolution (SR) techniques can alleviate this problem, existing deep learning models recover pathology image in a black-box manner, which can lead to untruthful biological details and misdiagnosis. Additionally, current methods allocate the same computational resources to recover each pixel of pathology image, leading to the sub-optimal recovery issue due to the large variation of pathology image. In this paper, we propose the first hierarchical reinforcement learning framework named Spatial-Temporal hierARchical Reinforcement Learning (STAR-RL), mainly for addressing the aforementioned issues in pathology image super-resolution problem. We reformulate the SR problem as a Markov decision process of interpretable operations and adopt the hierarchical recovery mechanism in patch level, to avoid sub-optimal recovery. Specifically, the higher-level spatial manager is proposed to pick out the most corrupted patch for the lower-level patch worker. Moreover, the higher-level temporal manager is advanced to evaluate the selected patch and determine whether the optimization should be stopped earlier, thereby avoiding the over-processed problem. Under the guidance of spatial-temporal managers, the lower-level patch worker processes the selected patch with pixel-wise interpretable actions at each time step. Experimental results on medical images degraded by different kernels show the effectiveness of STAR-RL. Furthermore, STAR-RL validates the promotion in tumor diagnosis with a large margin and shows generalizability under various degradations. The source code is available at https://github.com/CUHK-AIM-Group/STAR-RL.
... For estimating the value in EDM, we used the reinforcement learning (RL) model (Sutton & Barto, 1998;Watkins & Dayan, 1992), which has been widely used to explain the value-learning mechanism in EDM (Daw & Doya, 2006;Dayan & Abbott, 2005;Dayan & Balleine, 2002;Gläscher et al., 2010;Schönberg et al., 2007). The value of the chosen option is updated based on the difference between the expected value of item and the actual feedback (Behrens et al., 2007;Biele et al., 2011;Gluth et al., 2014;Hauser et al., 2015;Katahira et al., 2011;Lindström et al., 2014;O'Doherty et al., 2007). ...
Article
Full-text available
All humans must engage in decision-making. Decision-making processes can be broadly classified into internally guided decision-making (IDM), which is determined by individuals’ internal value criteria, such as preference, or externally guided decision-making (EDM), which is determined by environmental external value criteria, such as monetary rewards. However, real-life decisions are never made simply using one kind of decision-making, and the relationship between IDM and EDM remains unclear. This study had individuals perform gambling tasks requiring the EDM using stimuli that formed preferences through the preference judgment task as the IDM. Computational model analysis revealed that strong preferences in the IDM affected initial choice behavior in the EDM. Moreover, through the analysis of the subjective preference evaluation after the gambling tasks, we found that even when stimuli that were preferred in the IDM were perceived as less valuable in the EDM, the preference for IDM was maintained after EDM. These results indicate that although internal criteria, such as preferences, influence EDM, the results show that internal and external criteria differ.
... Reinforcement learning (Sutton & Barto, 1998), a class of algorithms for learning what actions to take based on discrete outcomes in the form of rewards and punishment, offers a very useful framework for tackling this question, because it explicitly separates the learning mechanism, controlled by a learning rate parameter, from the decision-making process, typically modelled as a softmax rule (Daw et al., 2006) controlled by a parameter called the inverse temperature. Although the two parameters are still correlated, so that increasing one can be partly compensated for by decreasing the other, this compensation is not a strict equivalence. ...
Article
Full-text available
In uncertain environments in which resources fluctuate continuously, animals must permanently decide whether to stabilise learning and exploit what they currently believe to be their best option, or instead explore potential alternatives and learn fast from new observations. While such a trade-off has been extensively studied in pretrained animals facing non-stationary decision-making tasks, it is yet unknown how they progressively tune it while learning the task structure during pretraining. Here, we compared the ability of different computational models to account for long-term changes in the behaviour of 24 rats while they learned to choose a rewarded lever in a three-armed bandit task across 24 days of pretraining. We found that the day-by-day evolution of rat performance and win-shift tendency revealed a progressive stabilisation of the way they regulated reinforcement learning parameters. We successfully captured these behavioural adaptations using a meta-learning model in which either the learning rate or the inverse temperature was controlled by the average reward rate. K E Y W O R D S decision-making, dopamine, exploration-exploitation trade-off, meta-learning
... As mentioned above, the limitations of the existing studies can be summarized as follows. Firstly, although many studies have considered one of the three factors of following distance [22,23], energy loss [24], and whether the prediction space is continuous or not [25], few studies have considered all three, failing to fully meet the complex driving environment and diversified optimization needs. Furthermore, previous research has predominantly focused on methods using RL with discrete action planning, while RL algorithms in continuous action spaces have rarely been applied to the cruise control of intelligently connected EVs. ...
Article
Full-text available
Eco-driving aims to enhance vehicle efficiency by optimizing speed profiles and driving patterns. However, ensuring safe following distances during eco-driving can lead to excessive use of lithium-ion batteries (LIBs), causing accelerated battery wear and potential safety concerns. This study addresses this issue by proposing a novel, multi-physics-constrained cruise control strategy for intelligently connected electric vehicles (EVs) using deep reinforcement learning (DRL). Integrating a DRL framework with an electrothermal model to estimate unmeasurable states, this strategy simultaneously manages battery degradation and thermal safety while maintaining safe following distances. Results from hardware-in-the-loop simulation testing demonstrated that this approach reduced overall driving costs by 18.72%, decreased battery temperatures by 4 °C to 8 °C in high-temperature environments, and reduced state-of-health (SOH) degradation by up to 46.43%. These findings highlight the strategy’s superiority in convergence efficiency, battery thermal safety, and cost reduction compared to existing methods. This research contributes to the advancement of eco-driving practices, ensuring both vehicle efficiency and battery longevity.
... The Actor network makes the action based on the probability, and the Critic network gives the value based on the action, thus speeding up the learning process. The core of the RL algorithm is the Markov Decision Process (MDP), which is introduced by ref. [24]: ...
Article
Full-text available
Path planning and obstacle avoidance are fundamental problems in unmanned ground vehicle path planning. Aiming at the limitations of Deep Reinforcement Learning (DRL) algorithms in unmanned ground vehicle path planning, such as low sampling rate, insufficient exploration, and unstable training, this paper proposes an improved algorithm called Dual Priority Experience and Ornstein–Uhlenbeck Soft Actor-Critic (DPEOU-SAC) based on Ornstein–Uhlenbeck (OU noise) and double-factor prioritized sampling experience replay (DPE) with the introduction of expert experience, which is used to help the agent achieve faster and better path planning and obstacle avoidance. Firstly, OU noise enhances the agent’s action selection quality through temporal correlation, thereby improving the agent’s detection performance in complex unknown environments. Meanwhile, the experience replay is based on double-factor preferential sampling, which has better sample continuity and sample utilization. Then, the introduced expert experience can help the agent to find the optimal path with faster training speed and avoid falling into a local optimum, thus achieving stable training. Finally, the proposed DPEOU-SAC algorithm is tested against other deep reinforcement learning algorithms in four different simulation environments. The experimental results show that the convergence speed of DPEOU-SAC is 88.99% higher than the traditional SAC algorithm, and the shortest path length of DPEOU-SAC is 27.24, which is shorter than that of SAC.
... Consistent with this contemporary perspective, recent work implicates several regions of the DMN in organizing different modes of behavior over time. For instance, DMN areas such as medial frontal cortex and posteromedial cortex appear to play an important role in shifting between information gathering versus information exploitation during reward-guided decision-making tasks (Barack et al., 2017;Foster et al., 2023;Pearson et al., 2011;Pearson et al., 2009;Schuck et al., 2015;Trudel et al., 2021)-the so-called explore/exploit trade-off (Frank et al., 2009;Sutton and Barto, 2018). Consistent with this, recent studies have argued that broad features of the DMN's activity can be explained under the auspices that it supports behavior under conditions in which performance depends on knowledge accrued across several trials rather than by immediate sensory inputs (Hayden et al., 2008;Murphy et al., 2019;Murphy et al., 2018;Vatansever et al., 2017). ...
Article
Full-text available
Adaptive motor behavior depends on the coordinated activity of multiple neural systems distributed across the brain. While the role of sensorimotor cortex in motor learning has been well established, how higher-order brain systems interact with sensorimotor cortex to guide learning is less well understood. Using functional MRI, we examined human brain activity during a reward-based motor task where subjects learned to shape their hand trajectories through reinforcement feedback. We projected patterns of cortical and striatal functional connectivity onto a low-dimensional manifold space and examined how regions expanded and contracted along the manifold during learning. During early learning, we found that several sensorimotor areas in the dorsal attention network exhibited increased covariance with areas of the salience/ventral attention network and reduced covariance with areas of the default mode network (DMN). During late learning, these effects reversed, with sensorimotor areas now exhibiting increased covariance with DMN areas. However, areas in posteromedial cortex showed the opposite pattern across learning phases, with its connectivity suggesting a role in coordinating activity across different networks over time. Our results establish the neural changes that support reward-based motor learning and identify distinct transitions in the functional coupling of sensorimotor to transmodal cortex when adapting behavior.
... The ε-greedy is a reinforcement learning algorithm designed to solve the multiarmed bandit (MAB) problem [18]. The problem of MAB was introduced by [15], and it consists of an agent who is presented with a range of actions, commonly known as "arms", where each action generates a specific reward upon selection. ...
Chapter
One of the most promising alternatives to suppress epileptic seizures in drug-resistant and neurosurgery-refractory patients is using electro-electronic devices. By applying an appropriate pulsatile electrical stimulation, the process of ictogenesis can be quickly suppressed. However, in designing such stimulation devices, a common problem is defining suitable parameters such as pulse amplitude, duration, and frequency. In this work, we propose a machine learning technique based on the epsilongreedy algorithm to optimize the pulse frequency which could prevent abnormal neuronal activity without exceeding energy usage for the stimulation. Five different simulations were carried out in order to evaluate the contribution of the energy consumption in determining the minimum frequency. The results show the efficacy of the proposed algorithm to search the minimum pulse frequency necessary to suppress epileptic seizures.
... where, y t k = i R π kis t α,i . Following a policy gradient approach to maximize the total reward obtained during the episode, it is possible to write the following loss function (see 19,27 ): ...
Article
Full-text available
Humans and animals can learn new skills after practicing for a few hours, while current reinforcement learning algorithms require a large amount of data to achieve good performances. Recent model-based approaches show promising results by reducing the number of necessary interactions with the environment to learn a desirable policy. However, these methods require biological implausible ingredients, such as the detailed storage of older experiences, and long periods of offline learning. The optimal way to learn and exploit world-models is still an open question. Taking inspiration from biology, we suggest that dreaming might be an efficient expedient to use an inner model. We propose a two-module (agent and model) spiking neural network in which “dreaming” (living new experiences in a model-based simulated environment) significantly boosts learning. Importantly, our model does not require the detailed storage of experiences, and learns online the world-model and the policy. Moreover, we stress that our network is composed of spiking neurons, further increasing the biological plausibility and implementability in neuromorphic hardware.
... Longer skills shorten the effective time horizon of the task by a factor of the average skill length for the skill-based agent because the skill-based agent operates on an MDP where transitions are defined by the end of execution of a skill which can be comprised of many low-level environment actions. By shortening the task time horizon, the learning efficiency of temporal-difference learning RL algorithms [72] can be improved by, for example, reducing value function bootstrapping error accumulation as there are less timesteps between a sparse reward signal and the starting state. ...
Preprint
Full-text available
Most reinforcement learning (RL) methods focus on learning optimal policies over low-level action spaces. While these methods can perform well in their training environments, they lack the flexibility to transfer to new tasks. Instead, RL agents that can act over useful, temporally extended skills rather than low-level actions can learn new tasks more easily. Prior work in skill-based RL either requires expert supervision to define useful skills, which is hard to scale, or learns a skill-space from offline data with heuristics that limit the adaptability of the skills, making them difficult to transfer during downstream RL. Our approach, EXTRACT, instead utilizes pre-trained vision language models to extract a discrete set of semantically meaningful skills from offline data, each of which is parameterized by continuous arguments, without human supervision. This skill parameterization allows robots to learn new tasks by only needing to learn when to select a specific skill and how to modify its arguments for the specific task. We demonstrate through experiments in sparse-reward, image-based, robot manipulation environments that EXTRACT can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL. Website at https://www.jessezhang.net/projects/extract/.
... Through trial and error, the agent refines its policy to achieve optimal outcomes. Figure adapted from[51]. ...
Preprint
Full-text available
Global Navigation Satellite Systems (GNSS)-based positioning plays a crucial role in various applications, including navigation, transportation, logistics, mapping, and emergency services. Traditional GNSS positioning methods are model-based and they utilize satellite geometry and the known properties of satellite signals. However, model-based methods have limitations in challenging environments and often lack adaptability to uncertain noise models. This paper highlights recent advances in Machine Learning (ML) and its potential to address these limitations. It covers a broad range of ML methods, including supervised learning, unsupervised learning, deep learning, and hybrid approaches. The survey provides insights into positioning applications related to GNSS such as signal analysis, anomaly detection, multi-sensor integration, prediction, and accuracy enhancement using ML. It discusses the strengths, limitations, and challenges of current ML-based approaches for GNSS positioning, providing a comprehensive overview of the field.
... In RL, an agent learns to optimize its behaviors by interacting with the environment [24]. The environment is defined as a Markov decision process (MDP): (S, A, P, r, γ), where S is the state space, A is the action space, P : S × A × S → [0, 1] is the transition probability function, r : S × A × S → R is the reward function and γ ∈ (0, 1] is the discount factor, which determines the present value of future rewards. ...
Preprint
Full-text available
Ransomware presents a significant and increasing threat to individuals and organizations by encrypting their systems and not releasing them until a large fee has been extracted. To bolster preparedness against potential attacks, organizations commonly conduct red teaming exercises, which involve simulated attacks to assess existing security measures. This paper proposes a novel approach utilizing reinforcement learning (RL) to simulate ransomware attacks. By training an RL agent in a simulated environment mirroring real-world networks, effective attack strategies can be learned quickly, significantly streamlining traditional, manual penetration testing processes. The attack pathways revealed by the RL agent can provide valuable insights to the defense team, helping them identify network weak points and develop more resilient defensive measures. Experimental results on a 152-host example network confirm the effectiveness of the proposed approach, demonstrating the RL agent's capability to discover and orchestrate attacks on high-value targets while evading honeyfiles (decoy files strategically placed to detect unauthorized access).
... Formally, a policy π : S → A for an MDP maps states to actions. Reinforcement learning (RL) [11] is the process of learning an optimal policy π * that maximizes the cumulative discounted reward for this MDP. Additionally, a trajectory σ of an MDP given an initial state s 0 ∼ I and a policy π is defined accordingly as σ = (s 0 a0 − → s 1 . . . ...
Preprint
Full-text available
Cyber-physical systems (CPS) with reinforcement learning (RL)-based controllers are increasingly being deployed in complex physical environments such as autonomous vehicles, the Internet-of-Things(IoT), and smart cities. An important property of a CPS is tolerance; i.e., its ability to function safely under possible disturbances and uncertainties in the actual operation. In this paper, we introduce a new, expressive notion of tolerance that describes how well a controller is capable of satisfying a desired system requirement, specified using Signal Temporal Logic (STL), under possible deviations in the system. Based on this definition, we propose a novel analysis problem, called the tolerance falsification problem, which involves finding small deviations that result in a violation of the given requirement. We present a novel, two-layer simulation-based analysis framework and a novel search heuristic for finding small tolerance violations. To evaluate our approach, we construct a set of benchmark problems where system parameters can be configured to represent different types of uncertainties and disturbancesin the system. Our evaluation shows that our falsification approach and heuristic can effectively find small tolerance violations.
... The functionality of the software according to the customer scenarios generates results utilizing a specific performance roadmap (Abadeh 2019), e.g., cycle time and bottlenecks. The proposed performance learning in this paper can apply performance profiling at different levels of abstraction to satisfy run-time parameters by predicting the system performance using Reinforcement Learning (Sutton and Barto 2018;Kaelbling et al. 1996). This process can lead to a quality-driven intelligent software development approach, which is a promising goal in software development. ...
Article
Full-text available
In the rapidly evolving software development industry, the early identification of optimal design alternatives and accurate performance prediction are critical for developing efficient software products. This paper introduces a novel approach to software refinement, termed Reinforcement Learning-based Software Refinement (RLSR), which leverages Reinforcement Learning techniques to address this challenge. RLSR enables an automated software refinement process that incorporates quality-driven intelligent software development as an early decision-making strategy. By proposing a Q-learning-based approach, RLSR facilitates the automatic refinement of software in dynamic environments while optimizing the utilization of computational resources and time. Additionally, the convergence rate to an optimal policy during the refinement process is investigated. The results demonstrate that training the policy using throughput values leads to significantly faster convergence to optimal rewards. This study evaluates RLSR based on various metrics, including episode length, reward over time, and reward distributions on a running example. Furthermore, to illustrate the effectiveness and applicability of the proposed method, a comparative analysis is applied to three refinable software designs, such as the E-commerce platform, smart booking platform, and Web-based GIS transformation system. The comparison between Q-learning and the proposed algorithm reveals that the refinement outcomes achieved with the proposed algorithm are superior, particularly when an adequate number of learning steps and a comprehensive historical dataset are available. The findings emphasize the potential of leveraging reinforcement learning techniques for automating software refinement and improving the efficiency of the model-driven development process.
... Then, the agent receives a numerical reward r t = R(s t , a t , s t+1 ) for taking action a t at the state s t and landing to state s t+1 . The goal is to find a policy π * that maximizes the expected sum of discounted future rewards given by R(t) = T i γr t [30]. We used the RL algorithm soft actor-critic (SAC). ...
Conference Paper
Cooking robots can enhance the home experience by reducing the burden of daily chores. However, these robots must perform their tasks dexterously and safely in shared human environments, especially when handling dangerous tools such as kitchen knives. This study focuses on enabling a robot to autonomously and safely learn food-cutting tasks. More specifically, our goal is to enable a collaborative robot or industrial robot arm to perform food-slicing tasks by adapting to varying material properties using compliance control. Our approach involves using Reinforcement Learning (RL) to train a robot to compliantly manipulate a knife, by reducing the contact forces exerted by the food items and by the cutting board. However, training the robot in the real world can be inefficient, and dangerous, and result in a lot of food waste. Therefore, we proposed SliceIt!, a framework for safely and efficiently learning robot food-slicing tasks in simulation. Following a real2sim2real approach, our framework consists of collecting a few real food slicing data, calibrating our dual simulation environment (a high-fidelity cutting simulator and a robotic simulator), learning compliant control policies on the calibrated simulation environment, and finally, deploying the policies on the real robot.
... Efficient task distribution: Distributing tasks among team members in a dynamic and adaptive manner is essential for ensuring smooth project progress. Reinforcement learning algorithms can be employed to dynamically assign tasks to team members based on real-time project progress and individual performance metrics [11]. These algorithms learn from the outcomes of previous task assignments and continuously adapt their strategies to maximize overall project efficiency. ...
Article
Full-text available
The rapid advancements in artificial intelligence (AI) have transformed various aspects of software engineering, offering new opportunities for enhancing the efficiency, quality, and reliability of software development processes. This article presents a comprehensive review of the applications of AI across the software development lifecycle, focusing on the areas of planning, coding, testing, and deployment. The study explores how AI techniques, such as machine learning, natural language processing, and reinforcement learning, can augment human expertise, automate repetitive tasks, and improve decision-making in software development. The article also discusses the challenges associated with AI integration, including compatibility issues, increased complexity, ethical considerations, and the need for specialized skills. Furthermore, the review highlights future directions in AI-driven software engineering, such as explainable AI, federated learning, and collaborative human-AI development. By providing a systematic analysis of the current state and potential of AI in software engineering, this article aims to guide researchers, practitioners, and decision-makers in leveraging AI technologies to revolutionize software development practices and drive innovation in the field.
... RL has risen as a key computational paradigm involving training intelligent agents to make sequential decisions, aiming to maximize some notion of expected return [1]. It has been instrumental in solving complex dynamic problems across a wide range of applications such as autonomous driving, robotics, aviation, finance, etc. [2]- [5]. ...
Article
Full-text available
Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints. A Python implementation of the algorithm can be found at https://github.com/SAILRIT/Concurrent-Learning-of-Control-Policy-and-Unknown-Constraints-in-Reinforcement-Learning.git .
... Using the knowledge obtained from experimentation, improve the fair scheduling strategies and machine learning models [31]. To improve the system, use strategies like hyperparameter tweaking and reinforcement learning training. ...
Research
The effective deployment of computational resources is essential for data processing in the big data era. By ensuring equitable resource distribution, fair scheduling is essential in striking this state of balance. In the context of large data processing, this paper examines the mathematical foundations and practical application of fair scheduling. We explore the intricate details of choosing tools, creating strategy, and addressing practical difficulties. We illustrate the actual impact of fair scheduling and highlight its significance in improving efficiency and resource utilisation in big data processing systems through empirical evaluation and real-world use cases. Our primary focus revolves around the practical implementation of fair scheduling within a big data framework. We go into much detail about the choice of suitable tools and technologies, examine the architectural details of the selected framework, and go over the creation and use of fair scheduling principles and algorithms. Throughout the paper, we offer valuable insights into the challenges faced during implementation and the innovative solutions devised to overcome them.
... State-of-the-art solutions to tackle complex control tasks generally rely on deep Reinforcement Learning (RL) [1,2]. While this approach has led to remarkable advances in machine learning, neural networks have well-known issues regarding data efficiency, explainability, and generalization [3]. ...
Preprint
Full-text available
Determining an optimal plan to accomplish a goal is a hard problem in realistic scenarios, which often comprise dynamic and causal relationships between several entities. Although traditionally such problems have been tackled with optimal control and reinforcement learning, a recent biologically-motivated proposal casts planning and control as an inference process. Among these new approaches, one is particularly promising: active inference. This new paradigm assumes that action and perception are two complementary aspects of life whereby the role of the former is to fulfill the predictions inferred by the latter. In this study, we present an effective solution, based on active inference, to complex control tasks. The proposed architecture exploits hybrid (discrete and continuous) processing to construct a hierarchical and dynamic representation of the self and the environment, which is then used to produce a flexible plan consisting of subgoals at different temporal scales. We evaluate this deep hybrid model on a non-trivial task: reaching a moving object after having picked a moving tool. This study extends past work on planning as inference and advances an alternative direction to optimal control and reinforcement learning.
Thesis
Full-text available
In this paper, the focus is on how Alexa Skills are integrated to enhance the interaction of a smart Air Conditioning system with users in a laboratory located at ESPOL University. Additionally, this system incorporates an AI component using the Q-learning technique, allowing for optimal balancing of user comfort with the defined states in the project, while also prioritizing energy consumption.
Article
Full-text available
Ethanol production is a significant industrial bioprocess for energy. The primary objective of this study is to control the process reactor temperature to get the desired product, that is, ethanol. Advanced model‐based control systems face challenges due to model‐process mismatch, but Reinforcement Learning (RL) is a class of machine learning which can help by allowing agents to learn policies directly from the environment. Hence a RL algorithm called twin delayed deep deterministic policy gradient (TD3) is employed. The control of reactor temperature is categorized into two categories namely unconstrained and constrained control approaches. The TD3 with various reward functions are tested on a nonlinear bioreactor model. The results are compared with existing popular RL algorithm, namely, deep deterministic policy gradient (DDPG) algorithm with a performance measure such as mean squared error (MSE). In the unconstrained control of the bioreactor, the TD3 based controller designed with the integral absolute error (IAE) reward yields a lower MSE of 0.22, whereas the DDPG produces an MSE of 0.29. Similarly, in the case of constrained controller, TD3 based controller designed with the IAE reward yields a lower MSE of 0.38, whereas DDPG produces an MSE of 0.48. In addition, the TD3 trained agent successfully rejects the disturbances, namely, input flow rate and inlet temperature in addition to a setpoint change with better performance metrics.
Article
Full-text available
Industry 4.0 incorporates the integration of cloud computing, Industrial Internet of Things (IIoT), and modern communication technologies within the industrial automation systems. Various devices with different network requirements of high reliability and low latency, rely on connectivity. The 5G and Beyond (B5G) software-defined architecture facilitates Network Function Virtualization (NFV), which is an essential solution for fulfilling these stringent demands. NFV allows for the implementation and control of Virtual Network Functions (VNFs) in dynamic network environments. VNF placement optimization has been extensively studied in the 5G perspective outside the industry environment with a focus on minimizing delay and cost, increasing VNF reliability, and increasing resource efficiency. However, the complex dynamics of the wireless channel in industrial environments have a considerable impact on the essential delay factors that are important for optimizing the deployment of VNFs. This study focuses on modeling a Wireless Sensor Network (WSN) based Industry 4.0 factory automation scenario at mmWave band, formulating an optimization problem to minimize overall delay while considering packet loss rate in the 5G industrial wireless channel. The optimization problem is formulated as a Markov Decision Process (MDP) and two Reinforcement Learning (RL) based algorithms AVP-Q and AVP-DQN are proposed for optimizing the VNF placement. The proposed algorithms are extensively evaluated against the Value Iteration algorithm which assumes a completely known MDP model and two other algorithms from the literature. The simulated results show that AVP-DQN outperforms existing algorithms for this scenario by 39% and 22.6% and the achieved performance is only close to that of the Value Iteration algorithm.
Article
Full-text available
While the recent advanced military operational concept requires an intelligent support of command and control, Reinforcement Learning (RL) has not been actively studied in the military domain. This study points out the limitations of RL for military applications from literature review and aims at improving the understanding of RL for military decision support under the limitations. Most of all, the black box characteristic of Deep RL makes the internal process difficult to understand in addition to complex simulation tools. A scalable weapon selection RL framework is built which can be solved either by a tabular form or a neural network form. The transition of the Deep Q-Network (DQN) solution to the tabular form makes it easier to compare the result to the Q-learning solution. Furthermore, rather than using one or two RL models selectively as before, RL models are divided as an actor and a critic, and systematically compared. A random agent, Q-learning and DQN agents as a critic, a Policy Gradient (PG) agent as an actor, Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) agents as an actor-critic approach are designed, trained, and tested. The performance results show that the trained DQN and PPO agents are the best decision supporter candidates for the weapon selection RL framework.
Book
This book constitutes the refereed proceedings of the 4th Latin American Workshop, LAWCN 2023, held in Envigado, Colombia, during November 28–30, 2023. The 8 full papers and 2 short papers included in this book were carefully reviewed and selected from 50 submissions. They were organized in topical sections as follows: artificial intelligence and machine learning; computational neuroscience; and brain-computer interfaces.
Conference Paper
The optimization of continuous action control tasks is a crucial step in deep reinforcement learning (DRL) applications. The goal is to identify optimal actions through the accumulation of experience in continuous action control tasks. This process can be achieved through DRL, which trains agents to develop a policy that maximizes the cumulative rewards gained from decision-making in dynamic environments. Balancing exploration and exploitation is a crucial challenge in acquiring this policy. To address the exploration-exploitation trade-off, the Exploration Decay Policy (EDP) implements a dynamic exploration noise strategy that adapts to the current training progress, enabling efficient exploration in the initial phases while gradually reducing exploration to focus on exploitation as training progresses. However, the fluctuating training stability across episodes in dynamic environments poses a challenge for exploitation policies to adapt accordingly. In this paper, we propose EDP to address exploration–exploitation trade-off dilemma. The objective is to dynamically modulate the noise scale, gradually decreasing it during periods of high training stability to promote exploration, while reducing it to maintain exploitation during periods of low training stability. The study introduces the EDP-DDPG method, enhancing continuous control tasks in Box2D environments. EDP-DDPG outperforms the standard DDPG by achieving higher rewards and quicker convergence. Its success stems from dynamically adjusting exploration noise every 25 episodes, balancing exploration and exploitation. This adaptive approach, reducing noise by 10% every 25 episodes, evolves from random to strategic limb movements, optimizing policy exploitation and adaptability in dynamic settings.
Article
The significant increase in civil aircraft traffic has led to frequent congestion and various negative outcomes within the terminal area. To mitigate these challenges, addressing and studying the aircraft landing problem (ALP) well are essential. This problem involves arranging the sequence of aircraft on airport runways and scheduling their operations while accounting for multiple operational constraints. Air traffic control is known as an optimization and ongoing problem with multi-objective challenges. In this survey paper, we propose a comprehensive review of the latest studies on the ALP in which we cover a wide variety of methodologies, including mixed-integer programming, dynamic programming, heuristics, meta-heuristics, and reinforcement learning techniques. This survey also proposes a classification of the ALP and summarizes all the identified constraints, objectives, frequency of their occurrence, and datasets used by these studies consisting of one or multiple runways. In addition, based on the existing literature, we discuss the advantages, such as delay and pollution minimization, and disadvantages, such as problem size unscalability and limiting the problem to single-runway mode.
Chapter
A common research protocol in cognitive neuroscience is to train subjects to perform deliberately designed experiments while recording brain activity, with the aim of understanding the brain mechanisms underlying cognition. However, how the results of this protocol of research can be applied in technology is seldom discussed. Here, I review the studies on time processing of the brain as examples of this research protocol, as well as two main application areas of neuroscience (neuroengineering and brain-inspired artificial intelligence). Time processing is a fundamental dimension of cognition, and time is also an indispensable dimension of any real-world signal to be processed in technology. Therefore, one may expect that the studies of time processing in cognition profoundly influence brain-related technology. Surprisingly, I found that the results from cognitive studies on timing processing are hardly helpful in solving practical problems. This awkward situation may be due to the lack of generalizability of the results of cognitive studies, which are under well-controlled laboratory conditions, to real-life situations. This lack of generalizability may be rooted in the fundamental unknowability of the world (including cognition). Overall, this paper questions and criticizes the usefulness and prospect of the abovementioned research protocol of cognitive neuroscience. I then give three suggestions for future research. First, to improve the generalizability of research, it is better to study brain activity under real-life conditions instead of in well-controlled laboratory experiments. Second, to overcome the unknowability of the world, we can engineer an easily accessible surrogate of the object under investigation, so that we can predict the behavior of the object under investigation by experimenting on the surrogate. Third, the paper calls for technology-oriented research, with the aim of technology creation instead of knowledge discovery.
Chapter
Reinforcement learning (RL) is a very powerful tool for sequential decision making. It has already been a vital component in solving grand challenge problems like the “protein folding problem.” The RL model is well suited for communication networks because it can learn complex behaviors in the target environment even when information like channel conditions, user statistics are unavailable. Our focus here is on adversarial RL because of its ability to capture scenarios where the reward and/or the dynamics of the environment change over time, possibly in an adversarial manner. In order to perform well in such systems, the agent, e.g., defender, needs to change the policy accordingly over time. However, the agent may not be able to afford frequent policy changes especially when the reward and/or the dynamics of the environment change rapidly, e.g., the energy limitations for edge devices to change policies. Thus, in addition to the standard metric of losses, switching costs, which capture the costs for changing policies, are regarded as a critical metric in RL. This chapter will introduce the state-of-the-art results on both bandits and RL with switching costs, their importance on network security under limited defender resources, and interesting future work.
Chapter
Software Defined Networking (SDN) is an innovative option with great potential for the future growth of the Internet. It enhances the flexibility and transparency of centralized and controlled networks. However, these advantages come with some inherent vulnerabilities and substantial risks. SDN environments, due to the vulnerabilities between the control plane and the data plane, are particularly susceptible to Denial of Service (DoS) attacks, which pose a significant threat. To address these security issues, it is essential to incorporate an Intrusion Detection System (IDS) as a security module within the SDN framework. The IDS enables continuous monitoring of the SDN environment, performs comprehensive traffic inspection, and generates alerts when malicious activities occur in the SDN network. In this chapter, we will review state-of-the-art machine learning and deep learning-based intrusion detection methods for SDN. We will focus on specific network intrusion methods designed for SDN and evaluate their effectiveness and efficiency based on metrics such as accuracy, processing time, overhead, and false positive rate. Through real testbed evaluations, we aim to identify the most suitable intrusion detection methods for SDN environments.
Preprint
Full-text available
The adaptive fitness of an organism in its ecological niche is highly reliant upon its ability to associate environmental or internal stimuli with behavioral responses through reinforcement. This simple but powerful approach has been successfully applied in computational neuroscience and reinforcement learning to model both human and animal behaviors. However, a critical challenge faced by these models is the credit assignment problem, in which the association between past behavior and a delayed reinforcement signal must be considered. In this paper, we reformulate the credit assignment problem to consider how past stimuli are linked to adaptive behavioral responses in a simple neuronal circuit. We propose a biologically plausible variant of a spiking neural network, which can model a wide variety of behavioral, learning, and evolutionary phenomena. Our model suggests one fundamental mechanism for associating a behavior with an adaptive response that may be used in the brains of both simple and complex organisms. Our results show the model's versatility and biological plausibility in a number of tasks related to classical and operant conditioning, including behavioral chaining. Additionally, we present simulations to demonstrate how adaptive behaviors such as reflexes and simple category detection may have evolved using our model. Our results indicate the potential for further modifications and extensions of our model to replicate more sophisticated and biologically plausible behavioral, learning, and intelligence phenomena found throughout the animal kingdom.
Article
Full-text available
We develop a theoretical framework that shows how mesencephalic dopamine systems could distribute to their targets a signal that represents information about future expectations. In particular, we show how activity in the cerebral cortex can make predictions about future receipt of reward and how fluctuations in the activity levels of neurons in diffuse dopamine systems above and below baseline levels would represent errors in these predictions that are delivered to cortical and subcortical targets. We present a model for how such errors could be constructed in a real brain that is consistent with physiological results for a subset of dopaminergic neurons located in the ventral tegmental area and surrounding dopaminergic neurons. The theory also makes testable predictions about human choice behavior on a simple decision-making task. Furthermore, we show that, through a simple influence on synaptic plasticity, fluctuations in dopamine release can act to change the predictions in an appropriate manner.
Article
A partially observed Markov decision process (POMDP) is a generalization of a Markov decision process that allows for incomplete information regarding the state of the system. The significant applied potential for such processes remains largely unrealized, due to an historical lack of tractable solution methodologies. This paper reviews some of the current algorithmic alternatives for solving discrete-time, finite POMDPs over both finite and infinite horizons. The major impediment to exact solution is that, even with a finite set of internal system states, the set of possible information states is uncountably infinite. Finite algorithms are theoretically available for exact solution of the finite horizon problem, but these are computationally intractable for even modest-sized problems. Several approximation methodologies are reviewed that have the potential to generate computationally feasible, high precision solutions.
Article
A comprehensive theory of cerebellar function is presented, which ties together the known anatomy and physiology of the cerebellum into a pattern-recognition data processing system. The cerebellum is postulated to be functionally and structurally equivalent to a modification of the classical Perceptron pattern-classification device. It is suggested that the mossy fiber → granule cell → Golgi cell input network performs an expansion recoding that enhances the pattern-discrimination capacity and learning speed of the cerebellar Purkinje response cells.Parallel fiber synapses of the dendritic spines of Purkinje cells, basket cells, and stellate cells are all postulated to be specifically variable in response to climbing fiber activity. It is argued that this variability is the mechanism of pattern storage. It is demonstrated that, in order for the learning process to be stable, pattern storage must be accomplished principally by weakening synaptic weights rather than by strengthening them.
Article
A translation-invariant back-propagation network is described that performs better than a sophisticated continuous acoustic parameter hidden Markov model on a noisy, 100-speaker confusable vocabulary isolated word recognition task. The network's replicated architecture permits it to extract precise information from unaligned training patterns selected by a naive segmentation rule.
Conference Paper
WC prcscnt a characterization of heuristic evaluation functions Hhich unities their trcatmcnt in single-agent problems and two- person games. 'l'hc central result is that a useful heuristic function is one which dctcrmincs the outcome of a search and is invariant along a solution path. 'I'his local chnractcrization of heuristics can hc used to predict the cffcctivcncss of given heuristics and to automatically learn useful heuristic functions for problems. In one cxpcrimcnt, a set of rclntivc weights for the different chess pieces was automatically learned. 1. Int reduction Consider the following anomaly. 'I'hc Manhattan distance heuristic for the Fifteen PuzAo is computed by monsuring the distance along the two-dimensional grid of each tilt from its current position to its goal position, and summing thtic values for each tile. Manhattan distance is a very cffccticc heuristic function for solving the Fifteen l~uz7lc 141. A complctcly analogous heuristic can bc dcfincd in three dimensions for Rubik's Cube: for each individual movable piccc of the cube. count the nun;hcr of twists rcquircd to bring it to its goal position and orientation. and sum thcsc vducs for each component. 'I'hrcc dimcnsionzl Manhattan dizlancc. howcvcr, is cffcctivcly worthless as a heuristic function for liubik's Cube 151. l'vcn though Rubik'~ Cube is similar to the t-'iltccn Puy/.lc, the two heuristics arc virtually idcnIical, and i11 both ci\scs the goal is ilchicvcd when the value of th: heuristic is minimized, the hcurislic is very cffcctivc in one USC and usclcss in the other.
Article
Until recently, statistical theory has been restricted to the design and analysis of sampling experiments in which the size and composition of the samples are completely determined before the experimentation begins. The reasons for this are partly historical, dating back to the time when the statistician was consulted, if at all, only after the experiment was over, and partly intrinsic in the mathematical difficulty of working with anything but a fixed number of independent random variables. A major advance now appears to be in the making with the creation of a theory of the sequential design of experiments, in which the size and composition of the samples are not fixed in advance but are functions of the observations themselves.
Article
In the Markov decision process (MDP) formalization of reinforcement learning, a single adaptive agent interacts with an environment defined by a probabilistic transition function. In this solipsistic view, secondary agents can only be part of the environment and are therefore fixed in their behavior. The framework of Markov games allows us to widen this view to include multiple adaptive agents with interacting or competing goals. This paper considers a step in this direction in which exactly two agents with diametrically opposed goals share an environment. It describes a Q-learning-like algorithm for finding optimal policies and demonstrates its application to a simple two-player game in which the optimal policy is probabilistic. 1 INTRODUCTION No agent lives in a vacuum; it must interact with other agents to achieve its goals. Reinforcement learning is a promising technique for creating agents that co-exist [Tan, 1993, Yanco and Stein, 1993] , but the mathematical framework that just...