Article

Reinforcement Learning: An Introduction

February 1998
IEEE Transactions on Neural Networks 9(5):1054

February 1998
9(5):1054

DOI:10.1109/TNN.1998.712192

Source
PubMed

Authors:

Richard Sutton

University of Alberta

Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911) The idea of learning to make appropriate responses based on reinforcing events has its roots in early psychological theories such as Thorndike's "law of effect" (quoted above). Although several important contributions were made in the 1950s, 1960s and 1970s by illustrious luminaries such as Bellman, Minsky, Klopf and others (Farley and Clark, 1954; Bellman, 1957; Minsky, 1961; Samuel, 1963; Michie and Chambers, 1968; Grossberg, 1975; Klopf, 1982), the last two decades have wit- nessed perhaps the strongest advances in the mathematical foundations of reinforcement learning, in addition to several impressive demonstrations of the performance of reinforcement learning algo- rithms in real world tasks. The introductory book by Sutton and Barto, two of the most influential and recognized leaders in the field, is therefore both timely and welcome. The book is divided into three parts. In the first part, the authors introduce and elaborate on the es- sential characteristics of the reinforcement learning problem, namely, the problem of learning "poli- cies" or mappings from environmental states to actions so as to maximize the amount of "reward"

Efficient World Models with Context-Aware Tokenization

Preprint

Full-text available

Jun 2024

Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose $\Delta$-IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, $\Delta$-IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at https://github.com/vmicheli/delta-iris.

LExCI: A framework for reinforcement learning with embedded systems

Article

Full-text available

Jun 2024
APPL INTELL

Lucas Koch

Advances in artificial intelligence (AI) have led to its application in many areas of everyday life. In the context of control engineering, reinforcement learning (RL) represents a particularly promising approach as it is centred around the idea of allowing an agent to freely interact with its environment to find an optimal strategy. One of the challenges professionals face when training and deploying RL agents is that the latter often have to run on dedicated embedded devices. This could be to integrate them into an existing toolchain or to satisfy certain performance criteria like real-time constraints. Conventional RL libraries, however, cannot be easily utilised in conjunction with that kind of hardware. In this paper, we present a framework named LExCI, the Learning and Experiencing Cycle Interface, which bridges this gap and provides end-users with a free and open-source tool for training agents on embedded systems using the open-source library RLlib. Its operability is demonstrated with two state-of-the-art RL-algorithms and a rapid control prototyping system.

Dynamic robot routing optimization: State–space decomposition for operations research-informed reinforcement learning

Article

Full-text available

Jun 2024
ROBOT CIM-INT MANUF

There is a growing interest in implementing artificial intelligence for operations research in the industrial environment. While numerous classic operations research solvers ensure optimal solutions, they often struggle with real-time dynamic objectives and environments, such as dynamic routing problems, which require periodic algorithmic recalibration. To deal with dynamic environments, deep reinforcement learning has shown great potential with its capability as a self-learning and optimizing mechanism. However, the real-world applications of reinforcement learning are relatively limited due to lengthy training time and inefficiency in high-dimensional state spaces. In this study, we introduce two methods to enhance reinforcement learning for dynamic routing optimization. The first method involves transferring knowledge from classic operations research solvers to reinforcement learning during training, which accelerates exploration and reduces lengthy training time. The second method uses a state-space decomposer to transform the high-dimensional state space into a low-dimensional latent space, which allows the reinforcement learning agent to learn efficiently in the latent space. Lastly, we demonstrate the applicability of our approach in an industrial application of an automated welding process, where our approach identifies the shortest welding pathway of an industrial robotic arm to weld a set of dynamically changing target nodes, poses and sizes. The suggested method cuts computation time by 25% to 50% compared to classic routing algorithms.

Complete stability analysis of iterative adaptive critic designs with discounted cost

Article

Full-text available

Jun 2024
NONLINEAR DYNAM

In this paper, the stability of nonlinear systems under infinite-horizon discounted optimal control via adaptive dynamic programming method is analyzed. First, considering the adoption of function approximators during value iteration (VI), the iterative value function and control policy are shown to be continuous. Then, based on a verifiable condition on the approximation errors caused by the critic network, it is proved that the approximate value functions are bounded and positive definite. Further in the stability analysis, a stability condition as the termination criterion of approximate VI (AVI) is developed, which guarantees that the control policy derived from the obtained critic network makes the controlled system $\mathcal{K}\mathcal{L}$-stable. Also, an upper bound function of the approximation errors caused by the action network is derived for ensuring that the system controlled by the trained action network remains $\mathcal{K}\mathcal{L}$-stable. The $\mathcal{K}\mathcal{L}$-stability of the closed-loop system is established by using the approximate value function to act as the Lyapunov function and estimate the region of attraction. Finally, the present theoretical results are applied to the simulation studies of the spacecraft rendezvous.

Analysis of anomalous behaviour in network systems using deep reinforcement learning with convolutional neural network architecture

Article

Full-text available

Jun 2024

To gain access to networks, various intrusion attack types have been developed and enhanced. The increasing importance of computer networks in daily life is a result of our growing dependence on them. Given this, it is glaringly obvious that algorithmic tools with strong detection performance and dependability are required for a variety of attack types. The objective is to develop a system for intrusion detection based on deep reinforcement learning. On the basis of the Markov decision procedure, the developed system can construct patterns appropriate for classification purposes based on extensive amounts of informative records. Deep Q‐Learning (DQL), Soft DQL, Double DQL, and Soft double DQL are examined from two perspectives. An evaluation of the authors’ methods using UNSW‐NB15 data demonstrates their superiority regarding accuracy, precision, recall, and F1 score. The validity of the model trained on the UNSW‐NB15 dataset was also checked using the BoT‐IoT and ToN‐IoT datasets, yielding competitive results.

Image-based wavefront correction using model-free Reinforcement Learning

Preprint

Jun 2024

Optical aberrations prevent telescopes from reaching their theoretical diffraction limit. Once estimated, these aberrations can be compensated for using deformable mirrors in a closed loop. Focal plane wavefront sensing enables the estimation of the aberrations on the complete optical path, directly from the images taken by the scientific sensor. However, current focal plane wavefront sensing methods rely on physical models whose inaccuracies may limit the overall performance of the correction. The aim of this study is to develop a data-driven method using model-free reinforcement learning to automatically perform the estimation and correction of the aberrations, using only phase diversity images acquired around the focal plane as inputs. We formulate the correction problem within the framework of reinforcement learning and train an agent on simulated data. We show that the method is able to reliably learn an efficient control strategy for various realistic conditions. Our method also demonstrates robustness to a wide range of noise levels.

The influence of anxiety on exploration: A review of computational modeling studies

Preprint

Full-text available

Jun 2024

Exploratory behaviors can serve an adaptive role within novel or changing environments. Namely, they facilitate information gain, allowing an organism to maintain accurate beliefs about the environment and select actions that better maximize reward. However, finding the optimal balance between exploration and reward-seeking behavior – the so-called explore-exploit dilemma – can be challenging, as it requires sensitivity to one’s own uncertainty and to the predictability of one’s surroundings. Here, we review computational modeling studies investigating how exploration is influenced by anxiety. While some apparent inconsistencies remain to be resolved, studies using reinforcement learning tasks suggest that directed (but not random) forms of exploration may be elevated by trait and/or cognitive anxiety, but reduced by state and/or somatic anxiety. Anxiety is also consistently associated with less exploration in foraging tasks. Some differences in exploration may further stem from how anxiety modulates changes in uncertainty over time (learning rates). Jointly, these results highlight important directions for future work in refining choice of tasks and anxiety measures and maintaining consistent methodology across studies.

Task-Oriented Communication Design at Scale

Article

Full-text available

Jan 2024

With countless promising applications in various domains such as IoT and Industry 4.0, task-oriented communication design (TOCD) is getting accelerated attention from the research community. This paper presents a novel approach for designing scalable task-oriented quantization and communications in cooperative multi-agent systems (MAS). The proposed approach utilizes the TOCD framework and the value of information (VoI) concept to enable efficient communication of quantized observations among agents while maximizing the average return performance of the MAS, a parameter that quantifies the MAS’s task effectiveness. The computational complexity of learning the VoI, however, grows exponentially with the number of agents. Thus, we propose a three-step framework: (i) learning the VoI (using reinforcement learning (RL)) for a two-agent system, (ii) designing the quantization policy for an N-agent MAS using the learned VoI for a range of bit-budgets and, (iii) learning the agents’ control policies using RL while following the designed quantization policies in the earlier step. Our analytical results show the applicability of the proposed framework under a wide range of problems. Numerical results show striking improvements in reducing the computational complexity of obtaining VoI needed for the TOCD in a MAS problem without compromising the average return performance of the MAS.

A survey of machine learning techniques for improving Global Navigation Satellite Systems

Article

Full-text available

Jun 2024

Global Navigation Satellite Systems (GNSS)-based positioning plays a crucial role in various applications, including navigation, transportation, logistics, mapping, and emergency services. Traditional GNSS positioning methods are model-based, utilizing satellite geometry and the known properties of satellite signals. However, model-based methods have limitations in challenging environments and often lack adaptability to uncertain noise models. This paper highlights recent advances in machine learning (ML) and its potential to address these limitations. It covers a broad range of ML methods, including supervised learning, unsupervised learning, deep learning, and hybrid approaches. The survey provides insights into positioning applications related to GNSS, such as signal analysis, anomaly detection, multi-sensor integration, prediction, and accuracy enhancement using ML. It discusses the strengths, limitations, and challenges of current ML-based approaches for GNSS positioning, providing a comprehensive overview of the field.

Physical Layer Security for IRS-UAV-Assisted Cell-Free Massive MIMO Systems

Article

Full-text available

Jan 2024

An intelligent reflecting surface (IRS) is a promising technology for future wireless communication. It comprises many hardware-efficient passive elements. The applications of unmanned aerial vehicles (UAVs) have expanded beyond military missions owing to their mobility, maneuverability, flexibility, ease of deployment, and cost-effectiveness. Combining IRS with UAVs, known as UAV-mounted IRSs (IRS-UAV), has gained significant attention owing to the unique advantages offered by both technologies. Wireless communication systems face critical challenges in physical layer security, particularly cell-free massive multiple-input multiple-output (MIMO) systems (CFMM). This study investigated physical layer security (PLS) in an IRS-UAV-assisted CFMM, involving multiple IRS-UAVs, access points, users, and passive eavesdroppers. To maximize the average secrecy downlink rate, this study proposes an optimization algorithm using deep reinforcement learning based on a deep deterministic policy gradient (DDPG) that achieves at least one locally optimal solution. However, this approach results in a relatively high computational complexity. A second approach is introduced to address this: an alternating optimization algorithm combining inner approximation (IA) methods and an advanced DDPG algorithm with a warm-up technique. The simulation results demonstrated the efficiency of both approaches in resolving complex optimization problems. Furthermore, the numerical findings confirmed that the proposed alternating optimization algorithm exhibited competitive performance and significantly reduced computational complexity compared with the DDPG-based approach.

Closed‐loop stability analysis of deep reinforcement learning controlled systems with experimental validation

Article

Full-text available

Jun 2024
IET CONTROL THEORY A

Trained deep reinforcement learning (DRL) based controllers can effectively control dynamic systems where classical controllers can be ineffective and difficult to tune. However, the lack of closed‐loop stability guarantees of systems controlled by trained DRL agents hinders their adoption in practical applications. This research study investigates the closed‐loop stability of dynamic systems controlled by trained DRL agents using Lyapunov analysis based on a linear‐quadratic polynomial approximation of the trained agent. In addition, this work develops an understanding of the system's stability margin to determine operational boundaries and critical thresholds of the system's physical parameters for effective operation. The proposed analysis is verified on a DRL‐controlled system for several simulated and experimental scenarios. The DRL agent is trained using a detailed dynamic model of a non‐linear system and then tested on the corresponding real‐world hardware platform without any fine‐tuning. Experiments are conducted on a wide range of system states and physical parameters and the results have confirmed the validity of the proposed stability analysis (https://youtu.be/QlpeD5sTlPU).

Sample complexity of variance-reduced policy gradient: weaker assumptions and lower bounds

Article

Full-text available

Jun 2024
MACH LEARN

Several variance-reduced versions of REINFORCE based on importance sampling achieve an improved O(ϵ-3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\epsilon ^{-3})$$\end{document} sample complexity to find an ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon$$\end{document}-stationary point, under an unrealistic assumption on the variance of the importance weights. In this paper, we propose the Defensive Policy Gradient (DEF-PG) algorithm, based on defensive importance sampling, achieving the same result without any assumption on the variance of the importance weights. We also show that this is not improvable by establishing a matching Ω(ϵ-3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Omega (\epsilon ^{-3})$$\end{document} lower bound, and that REINFORCE with its O(ϵ-4)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\epsilon ^{-4})$$\end{document} sample complexity is actually optimal under weaker assumptions on the policy class. Numerical simulations show promising results for the proposed technique compared to similar algorithms based on vanilla importance sampling.

Reinforcement Learning in Credit Scoring and Underwriting

Preprint

Dec 2022

This paper proposes a novel reinforcement learning (RL) framework for credit underwriting that tackles ungeneralizable contextual challenges. We adapt RL principles for credit scoring, incorporating action space renewal and multi-choice actions. Our work demonstrates that the traditional underwriting approach aligns with the RL greedy strategy. We introduce two new RL-based credit underwriting algorithms to enable more informed decision-making. Simulations show these new approaches outperform the traditional method in scenarios where the data aligns with the model. However, complex situations highlight model limitations, emphasizing the importance of powerful machine learning models for optimal performance. Future research directions include exploring more sophisticated models alongside efficient exploration mechanisms.

Autonomous Control of a Novel Closed Chain Five Bar Active Suspension via Deep Reinforcement Learning

Preprint

Full-text available

Jun 2024

Planetary exploration requires traversal in environments with rugged terrains. In addition, Mars rovers and other planetary exploration robots often carry sensitive scientific experiments and components onboard, which must be protected from mechanical harm. This paper deals with an active suspension system focused on chassis stabilisation and an efficient traversal method while encountering unavoidable obstacles. Soft Actor-Critic (SAC) was applied along with Proportional Integral Derivative (PID) control to stabilise the chassis and traverse large obstacles at low speeds. The model uses the rover's distance from surrounding obstacles, the height of the obstacle, and the chassis' orientation to actuate the control links of the suspension accurately. Simulations carried out in the Gazebo environment are used to validate the proposed active system.

Integrating Artificial Intelligence for Urban Drainage Systems Aided Decision: State of the Art

Article

Jan 2024

Urban drainage systems may derive advantages from the implementation of machine learning methods for decision-making and cleansing operations. Conventional decision support systems are rendered ineffective in tackling the intricate and indeterminate aspects of urban planning concerns. It aims to improve model evolution and use while facilitating simpler access using suggested open sources provided in this research. This paper provides an overview of machine learning methods applied in modeling urban drainage systems, and it compares the proposed methods among different machine learning models., which are classified into five distinct approaches: Supervised learning, unsupervised learning, deep learning, Reinforcement learning, and finally hybrid approaches combining two or more previous algorithms. This study explores also diverse datasets and open sources related modelling in urban drainage system that researchers can utilize in their scientific investigations. From the study it can be concluded that the choice of machine learning for urban irrigation systems depends on the specific goals and the nature of the problem. Additionally, studies demonstrate that hybrid and deep learning approaches can solve problems with urban irrigation systems, increase system performance, and yield correct results. Accurately interpreting data and successfully resolving urban irrigation systems' problems are the goals of deep learning. Furthermore, hybrid methods facilitate advances in model development through the integration of mathematical and machine learning models. This study helps researchers improve models and promote their applications to prevent natural disasters.

Real-time inference of compacted soil properties using deflection tests: An AI-driven solution informed by unsaturated soil mechanics principles

Article

Jun 2024
COMPUT GEOTECH

Javad Ghorbani

This paper presents a novel Artificial Intelligence (AI)-driven tool designed to convert deflection test results into crucial soil parameters essential for quality assurance in compaction projects. The accurate determination of these parameters, such as density and void ratio, is imperative for ensuring the structural integrity of in-frastructures constructed on such soils. Moreover, it facilitates the utilization of modern non-destructive equipment in compaction endeavors. The determination of these parameters is notably challenging in unsatu-rated soils owing to the intricate interplay among factors such as suction, moisture content, void ratio, and resulting deflection. This paper presents a pioneering tool to address these challenges. By integrating unsaturated soil mechanics with advanced AI techniques, particularly reinforcement learning, the tool leverages a diverse array of inputs, including in-situ data, experimental observations, and physics-based modeling. This integration enables dynamic adaptation to changing field conditions and ensuring the tool's real-time adaptability and predictive accuracy. Field trials validated the tool's efficacy in predicting soil properties accurately without direct measurements of moisture content or suction, variables often unmeasured in practical soil compaction projects. This unique capability underscores significant advancements in real-time assessment of unsaturated soils, illustrating the transformative potential of AI in geotechnical engineering and unsaturated soil mechanics.

State of the Art of Adaptive Dynamic Programming and Reinforcement Learning

Article

Full-text available

Mar 2023

This article introduces the state-of-the-art development of adaptive dynamic programming and reinforcement learning (ADPRL). First, algorithms in reinforcement learning (RL) are introduced and their roots in dynamic programming are illustrated. Adaptive dynamic programming (ADP) is then introduced following a brief discussion of dynamic programming. Researchers in ADP and RL have enjoyed the fast developments of the past decade from algorithms, to convergence and optimality analyses, and to stability results. Several key steps in the recent theoretical developments of ADPRL are mentioned with some future perspectives. In particular, convergence and optimality results of value iteration and policy iteration are reviewed, followed by an introduction to the most recent results on stability analysis of value iteration algorithms.

Early Classification of Time Series: Taxonomy and Benchmark

Preprint

Full-text available

Jun 2024

Vincent Lemaire

In many situations, the measurements of a studied phenomenon are provided sequentially, and the prediction of its class needs to be made as early as possible so as not to incur too high a time penalty, but not too early and risk paying the cost of misclassification. This problem has been particularly studied in the case of time series, and is known as Early Classification of Time Series (ECTS). Although it has been the subject of a growing body of literature, there is still a lack of a systematic, shared evaluation protocol to compare the relative merits of the various existing methods. This document begins by situating these methods within a principle-based taxonomy. It defines dimensions for organizing their evaluation, and then reports the results of a very extensive set of experiments along these dimensions involving nine state-of-the art ECTS algorithms. In addition, these and other experiments can be carried out using an open-source library in which most of the existing ECTS algorithms have been implemented (see \url{https://github.com/ML-EDM/ml_edm}).

Spatial-temporal Hierarchical Reinforcement Learning for Interpretable Pathology Image Super-Resolution

Preprint

Full-text available

Jun 2024

Pathology image are essential for accurately interpreting lesion cells in cytopathology screening, but acquiring high-resolution digital slides requires specialized equipment and long scanning times. Though super-resolution (SR) techniques can alleviate this problem, existing deep learning models recover pathology image in a black-box manner, which can lead to untruthful biological details and misdiagnosis. Additionally, current methods allocate the same computational resources to recover each pixel of pathology image, leading to the sub-optimal recovery issue due to the large variation of pathology image. In this paper, we propose the first hierarchical reinforcement learning framework named Spatial-Temporal hierARchical Reinforcement Learning (STAR-RL), mainly for addressing the aforementioned issues in pathology image super-resolution problem. We reformulate the SR problem as a Markov decision process of interpretable operations and adopt the hierarchical recovery mechanism in patch level, to avoid sub-optimal recovery. Specifically, the higher-level spatial manager is proposed to pick out the most corrupted patch for the lower-level patch worker. Moreover, the higher-level temporal manager is advanced to evaluate the selected patch and determine whether the optimization should be stopped earlier, thereby avoiding the over-processed problem. Under the guidance of spatial-temporal managers, the lower-level patch worker processes the selected patch with pixel-wise interpretable actions at each time step. Experimental results on medical images degraded by different kernels show the effectiveness of STAR-RL. Furthermore, STAR-RL validates the promotion in tumor diagnosis with a large margin and shows generalizability under various degradations. The source code is available at https://github.com/CUHK-AIM-Group/STAR-RL.

Internally Formed Preferences for Options only Influence Initial Decisions in Gambling Tasks, while the Gambling Outcomes do not Alter these Preferences

Article

Full-text available

Jun 2024
J GAMBL STUD

All humans must engage in decision-making. Decision-making processes can be broadly classified into internally guided decision-making (IDM), which is determined by individuals’ internal value criteria, such as preference, or externally guided decision-making (EDM), which is determined by environmental external value criteria, such as monetary rewards. However, real-life decisions are never made simply using one kind of decision-making, and the relationship between IDM and EDM remains unclear. This study had individuals perform gambling tasks requiring the EDM using stimuli that formed preferences through the preference judgment task as the IDM. Computational model analysis revealed that strong preferences in the IDM affected initial choice behavior in the EDM. Moreover, through the analysis of the subjective preference evaluation after the gambling tasks, we found that even when stimuli that were preferred in the IDM were perceived as less valuable in the EDM, the preference for IDM was maintained after EDM. These results indicate that although internal criteria, such as preferences, influence EDM, the results show that internal and external criteria differ.

Regulation of reinforcement learning parameters captures long-term changes in rat behaviour

Article

Full-text available

Jun 2024
EUR J NEUROSCI

In uncertain environments in which resources fluctuate continuously, animals must permanently decide whether to stabilise learning and exploit what they currently believe to be their best option, or instead explore potential alternatives and learn fast from new observations. While such a trade-off has been extensively studied in pretrained animals facing non-stationary decision-making tasks, it is yet unknown how they progressively tune it while learning the task structure during pretraining. Here, we compared the ability of different computational models to account for long-term changes in the behaviour of 24 rats while they learned to choose a rewarded lever in a three-armed bandit task across 24 days of pretraining. We found that the day-by-day evolution of rat performance and win-shift tendency revealed a progressive stabilisation of the way they regulated reinforcement learning parameters. We successfully captured these behavioural adaptations using a meta-learning model in which either the learning rate or the inverse temperature was controlled by the average reward rate. K E Y W O R D S decision-making, dopamine, exploration-exploitation trade-off, meta-learning

Joint Concern over Battery Health and Thermal Degradation in the Cruise Control of Intelligently Connected Electric Vehicles Using a Model-Assisted DRL Approach

Article

Full-text available

Jun 2024

Eco-driving aims to enhance vehicle efficiency by optimizing speed profiles and driving patterns. However, ensuring safe following distances during eco-driving can lead to excessive use of lithium-ion batteries (LIBs), causing accelerated battery wear and potential safety concerns. This study addresses this issue by proposing a novel, multi-physics-constrained cruise control strategy for intelligently connected electric vehicles (EVs) using deep reinforcement learning (DRL). Integrating a DRL framework with an electrothermal model to estimate unmeasurable states, this strategy simultaneously manages battery degradation and thermal safety while maintaining safe following distances. Results from hardware-in-the-loop simulation testing demonstrated that this approach reduced overall driving costs by 18.72%, decreased battery temperatures by 4 °C to 8 °C in high-temperature environments, and reduced state-of-health (SOH) degradation by up to 46.43%. These findings highlight the strategy’s superiority in convergence efficiency, battery thermal safety, and cost reduction compared to existing methods. This research contributes to the advancement of eco-driving practices, ensuring both vehicle efficiency and battery longevity.

Unmanned Ground Vehicle Path Planning Based on Improved DRL Algorithm

Article

Full-text available

Jun 2024

Path planning and obstacle avoidance are fundamental problems in unmanned ground vehicle path planning. Aiming at the limitations of Deep Reinforcement Learning (DRL) algorithms in unmanned ground vehicle path planning, such as low sampling rate, insufficient exploration, and unstable training, this paper proposes an improved algorithm called Dual Priority Experience and Ornstein–Uhlenbeck Soft Actor-Critic (DPEOU-SAC) based on Ornstein–Uhlenbeck (OU noise) and double-factor prioritized sampling experience replay (DPE) with the introduction of expert experience, which is used to help the agent achieve faster and better path planning and obstacle avoidance. Firstly, OU noise enhances the agent’s action selection quality through temporal correlation, thereby improving the agent’s detection performance in complex unknown environments. Meanwhile, the experience replay is based on double-factor preferential sampling, which has better sample continuity and sample utilization. Then, the introduced expert experience can help the agent to find the optimal path with faster training speed and avoid falling into a local optimum, thus achieving stable training. Finally, the proposed DPEOU-SAC algorithm is tested against other deep reinforcement learning algorithms in four different simulation environments. The experimental results show that the convergence speed of DPEOU-SAC is 88.99% higher than the traditional SAC algorithm, and the shortest path length of DPEOU-SAC is 27.24, which is shorter than that of SAC.

Reconfigurations of cortical manifold structure during reward-based motor learning

Article

Full-text available

Jun 2024
eLife

Adaptive motor behavior depends on the coordinated activity of multiple neural systems distributed across the brain. While the role of sensorimotor cortex in motor learning has been well established, how higher-order brain systems interact with sensorimotor cortex to guide learning is less well understood. Using functional MRI, we examined human brain activity during a reward-based motor task where subjects learned to shape their hand trajectories through reinforcement feedback. We projected patterns of cortical and striatal functional connectivity onto a low-dimensional manifold space and examined how regions expanded and contracted along the manifold during learning. During early learning, we found that several sensorimotor areas in the dorsal attention network exhibited increased covariance with areas of the salience/ventral attention network and reduced covariance with areas of the default mode network (DMN). During late learning, these effects reversed, with sensorimotor areas now exhibiting increased covariance with DMN areas. However, areas in posteromedial cortex showed the opposite pattern across learning phases, with its connectivity suggesting a role in coordinating activity across different networks over time. Our results establish the neural changes that support reward-based motor learning and identify distinct transitions in the functional coupling of sensorimotor to transmodal cortex when adapting behavior.

In Silico Application of the Epsilon-Greedy Algorithm for Frequency Optimization of Electrical Neurostimulation for Hypersynchronous Disorders

Chapter

Jun 2024

One of the most promising alternatives to suppress epileptic seizures in drug-resistant and neurosurgery-refractory patients is using electro-electronic devices. By applying an appropriate pulsatile electrical stimulation, the process of ictogenesis can be quickly suppressed. However, in designing such stimulation devices, a common problem is defining suitable parameters such as pulse amplitude, duration, and frequency. In this work, we propose a machine learning technique based on the epsilongreedy algorithm to optimize the pulse frequency which could prevent abnormal neuronal activity without exceeding energy usage for the stimulation. Five different simulations were carried out in order to evaluate the contribution of the energy consumption in determining the minimum frequency. The results show the efficacy of the proposed algorithm to search the minimum pulse frequency necessary to suppress epileptic seizures.

Towards biologically plausible model-based reinforcement learning in recurrent spiking networks by dreaming new experiences

Article

Full-text available

Jun 2024

Humans and animals can learn new skills after practicing for a few hours, while current reinforcement learning algorithms require a large amount of data to achieve good performances. Recent model-based approaches show promising results by reducing the number of necessary interactions with the environment to learn a desirable policy. However, these methods require biological implausible ingredients, such as the detailed storage of older experiences, and long periods of offline learning. The optimal way to learn and exploit world-models is still an open question. Taking inspiration from biology, we suggest that dreaming might be an efficient expedient to use an inner model. We propose a two-module (agent and model) spiking neural network in which “dreaming” (living new experiences in a model-based simulated environment) significantly boosts learning. Importantly, our model does not require the detailed storage of experiences, and learns online the world-model and the policy. Moreover, we stress that our network is composed of spiking neurons, further increasing the biological plausibility and implementability in neuromorphic hardware.

EXTRACT: Efficient Policy Learning by Extracting Transferrable Robot Skills from Offline Data

Preprint

Full-text available

Jun 2024

Most reinforcement learning (RL) methods focus on learning optimal policies over low-level action spaces. While these methods can perform well in their training environments, they lack the flexibility to transfer to new tasks. Instead, RL agents that can act over useful, temporally extended skills rather than low-level actions can learn new tasks more easily. Prior work in skill-based RL either requires expert supervision to define useful skills, which is hard to scale, or learns a skill-space from offline data with heuristics that limit the adaptability of the skills, making them difficult to transfer during downstream RL. Our approach, EXTRACT, instead utilizes pre-trained vision language models to extract a discrete set of semantically meaningful skills from offline data, each of which is parameterized by continuous arguments, without human supervision. This skill parameterization allows robots to learn new tasks by only needing to learn when to select a specific skill and how to modify its arguments for the specific task. We demonstrate through experiments in sparse-reward, image-based, robot manipulation environments that EXTRACT can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL. Website at https://www.jessezhang.net/projects/extract/.

A Survey of Machine Learning Techniques for Improving Global Navigation Satellite Systems

Preprint

Full-text available

Mar 2024

Global Navigation Satellite Systems (GNSS)-based positioning plays a crucial role in various applications, including navigation, transportation, logistics, mapping, and emergency services. Traditional GNSS positioning methods are model-based and they utilize satellite geometry and the known properties of satellite signals. However, model-based methods have limitations in challenging environments and often lack adaptability to uncertain noise models. This paper highlights recent advances in Machine Learning (ML) and its potential to address these limitations. It covers a broad range of ML methods, including supervised learning, unsupervised learning, deep learning, and hybrid approaches. The survey provides insights into positioning applications related to GNSS such as signal analysis, anomaly detection, multi-sensor integration, prediction, and accuracy enhancement using ML. It discusses the strengths, limitations, and challenges of current ML-based approaches for GNSS positioning, providing a comprehensive overview of the field.

Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations

Preprint

Full-text available

Jun 2024

Ransomware presents a significant and increasing threat to individuals and organizations by encrypting their systems and not releasing them until a large fee has been extracted. To bolster preparedness against potential attacks, organizations commonly conduct red teaming exercises, which involve simulated attacks to assess existing security measures. This paper proposes a novel approach utilizing reinforcement learning (RL) to simulate ransomware attacks. By training an RL agent in a simulated environment mirroring real-world networks, effective attack strategies can be learned quickly, significantly streamlining traditional, manual penetration testing processes. The attack pathways revealed by the RL agent can provide valuable insights to the defense team, helping them identify network weak points and develop more resilient defensive measures. Experimental results on a 152-host example network confirm the effectiveness of the proposed approach, demonstrating the RL agent's capability to discover and orchestrate attacks on high-value targets while evading honeyfiles (decoy files strategically placed to detect unauthorized access).

Tolerance of Reinforcement Learning Controllers against Deviations in Cyber Physical Systems

Preprint

Full-text available

Jun 2024

Cyber-physical systems (CPS) with reinforcement learning (RL)-based controllers are increasingly being deployed in complex physical environments such as autonomous vehicles, the Internet-of-Things(IoT), and smart cities. An important property of a CPS is tolerance; i.e., its ability to function safely under possible disturbances and uncertainties in the actual operation. In this paper, we introduce a new, expressive notion of tolerance that describes how well a controller is capable of satisfying a desired system requirement, specified using Signal Temporal Logic (STL), under possible deviations in the system. Based on this definition, we propose a novel analysis problem, called the tolerance falsification problem, which involves finding small deviations that result in a violation of the given requirement. We present a novel, two-layer simulation-based analysis framework and a novel search heuristic for finding small tolerance violations. To evaluate our approach, we construct a set of benchmark problems where system parameters can be configured to represent different types of uncertainties and disturbancesin the system. Our evaluation shows that our falsification approach and heuristic can effectively find small tolerance violations.

Knowledge-enhanced software refinement: leveraging reinforcement learning for search-based quality engineering

Article

Full-text available

Jun 2024
AUTOMAT SOFTW ENG

Maryam Nooraei

In the rapidly evolving software development industry, the early identification of optimal design alternatives and accurate performance prediction are critical for developing efficient software products. This paper introduces a novel approach to software refinement, termed Reinforcement Learning-based Software Refinement (RLSR), which leverages Reinforcement Learning techniques to address this challenge. RLSR enables an automated software refinement process that incorporates quality-driven intelligent software development as an early decision-making strategy. By proposing a Q-learning-based approach, RLSR facilitates the automatic refinement of software in dynamic environments while optimizing the utilization of computational resources and time. Additionally, the convergence rate to an optimal policy during the refinement process is investigated. The results demonstrate that training the policy using throughput values leads to significantly faster convergence to optimal rewards. This study evaluates RLSR based on various metrics, including episode length, reward over time, and reward distributions on a running example. Furthermore, to illustrate the effectiveness and applicability of the proposed method, a comparative analysis is applied to three refinable software designs, such as the E-commerce platform, smart booking platform, and Web-based GIS transformation system. The comparison between Q-learning and the proposed algorithm reveals that the refinement outcomes achieved with the proposed algorithm are superior, particularly when an adequate number of learning steps and a comprehensive historical dataset are available. The findings emphasize the potential of leveraging reinforcement learning techniques for automating software refinement and improving the efficiency of the model-driven development process.

SliceIt! -A Dual Simulator Framework for Learning Robot Food Slicing

Conference Paper

May 2024

Cooking robots can enhance the home experience by reducing the burden of daily chores. However, these robots must perform their tasks dexterously and safely in shared human environments, especially when handling dangerous tools such as kitchen knives. This study focuses on enabling a robot to autonomously and safely learn food-cutting tasks. More specifically, our goal is to enable a collaborative robot or industrial robot arm to perform food-slicing tasks by adapting to varying material properties using compliance control. Our approach involves using Reinforcement Learning (RL) to train a robot to compliantly manipulate a knife, by reducing the contact forces exerted by the food items and by the cutting board. However, training the robot in the real world can be inefficient, and dangerous, and result in a lot of food waste. Therefore, we proposed SliceIt!, a framework for safely and efficiently learning robot food-slicing tasks in simulation. Following a real2sim2real approach, our framework consists of collecting a few real food slicing data, calibrating our dual simulation environment (a high-fidelity cutting simulator and a robotic simulator), learning compliant control policies on the calibrated simulation environment, and finally, deploying the policies on the real robot.

Artificial Intelligence in Software Engineering: A Systematic Exploration of AI-Driven Development

Article

Full-text available

Jun 2024

Shantanu Kumar

The rapid advancements in artificial intelligence (AI) have transformed various aspects of software engineering, offering new opportunities for enhancing the efficiency, quality, and reliability of software development processes. This article presents a comprehensive review of the applications of AI across the software development lifecycle, focusing on the areas of planning, coding, testing, and deployment. The study explores how AI techniques, such as machine learning, natural language processing, and reinforcement learning, can augment human expertise, automate repetitive tasks, and improve decision-making in software development. The article also discusses the challenges associated with AI integration, including compatibility issues, increased complexity, ethical considerations, and the need for specialized skills. Furthermore, the review highlights future directions in AI-driven software engineering, such as explainable AI, federated learning, and collaborative human-AI development. By providing a systematic analysis of the current state and potential of AI in software engineering, this article aims to guide researchers, practitioners, and decision-makers in leveraging AI technologies to revolutionize software development practices and drive innovation in the field.

Concurrent Learning of Control Policy and Unknown Safety Specifications in Reinforcement Learning

Article

Full-text available

Jan 2024

Reinforcement learning (RL) has revolutionized decision-making across a wide range of domains over the past few decades. Yet, deploying RL policies in real-world scenarios presents the crucial challenge of ensuring safety. Traditional safe RL approaches have predominantly focused on incorporating predefined safety constraints into the policy learning process. However, this reliance on predefined safety constraints poses limitations in dynamic and unpredictable real-world settings where such constraints may not be available or sufficiently adaptable. Bridging this gap, we propose a novel approach that concurrently learns a safe RL control policy and identifies the unknown safety constraint parameters of a given environment. Initializing with a parametric signal temporal logic (pSTL) safety specification and a small initial labeled dataset, we frame the problem as a bilevel optimization task, intricately integrating constrained policy optimization, using a Lagrangian-variant of the twin delayed deep deterministic policy gradient (TD3) algorithm, with Bayesian optimization for optimizing parameters for the given pSTL safety specification. Through experimentation in comprehensive case studies, we validate the efficacy of this approach across varying forms of environmental constraints, consistently yielding safe RL policies with high returns. Furthermore, our findings indicate successful learning of STL safety constraint parameters, exhibiting a high degree of conformity with true environmental safety constraints. The performance of our model closely mirrors that of an ideal scenario that possesses complete prior knowledge of safety constraints, demonstrating its proficiency in accurately identifying environmental safety constraints and learning safe policies that adhere to those constraints. A Python implementation of the algorithm can be found at https://github.com/SAILRIT/Concurrent-Learning-of-Control-Policy-and-Unknown-Constraints-in-Reinforcement-Learning.git .

Optimizing Resource Allocation in Fair Scheduler: A Simulation-Based Approach

Research

Nov 2023

Mrs. Bareen Kayyum Shaikh

The effective deployment of computational resources is essential for data processing in the big data era. By ensuring equitable resource distribution, fair scheduling is essential in striking this state of balance. In the context of large data processing, this paper examines the mathematical foundations and practical application of fair scheduling. We explore the intricate details of choosing tools, creating strategy, and addressing practical difficulties. We illustrate the actual impact of fair scheduling and highlight its significance in improving efficiency and resource utilisation in big data processing systems through empirical evaluation and real-world use cases. Our primary focus revolves around the practical implementation of fair scheduling within a big data framework. We go into much detail about the choice of suitable tools and technologies, examine the architectural details of the selected framework, and go over the creation and use of fair scheduling principles and algorithms. Throughout the paper, we offer valuable insights into the challenges faced during implementation and the innovative solutions devised to overcome them.

Deep hybrid models: plan and infer in the real world

Preprint

Full-text available

Jun 2024

Determining an optimal plan to accomplish a goal is a hard problem in realistic scenarios, which often comprise dynamic and causal relationships between several entities. Although traditionally such problems have been tackled with optimal control and reinforcement learning, a recent biologically-motivated proposal casts planning and control as an inference process. Among these new approaches, one is particularly promising: active inference. This new paradigm assumes that action and perception are two complementary aspects of life whereby the role of the former is to fulfill the predictions inferred by the latter. In this study, we present an effective solution, based on active inference, to complex control tasks. The proposed architecture exploits hybrid (discrete and continuous) processing to construct a hierarchical and dynamic representation of the self and the environment, which is then used to produce a flexible plan consisting of subgoals at different temporal scales. We evaluate this deep hybrid model on a non-trivial task: reaching a moving object after having picked a moving tool. This study extends past work on planning as inference and advances an alternative direction to optimal control and reinforcement learning.

ESCUELA SUPERIOR POLITÉCNICA DEL LITORAL

Thesis

Full-text available

Feb 2024

Edwin Chavez

In this paper, the focus is on how Alexa Skills are integrated to enhance the interaction of a smart Air Conditioning system with users in a laboratory located at ESPOL University. Additionally, this system incorporates an AI component using the Q-learning technique, allowing for optimal balancing of user comfort with the defined states in the project, while also prioritizing energy consumption.

Reinforcement learning based temperature control of a fermentation bioreactor for ethanol production

Article

Full-text available

Jun 2024
BIOTECHNOL BIOENG

Ethanol production is a significant industrial bioprocess for energy. The primary objective of this study is to control the process reactor temperature to get the desired product, that is, ethanol. Advanced model‐based control systems face challenges due to model‐process mismatch, but Reinforcement Learning (RL) is a class of machine learning which can help by allowing agents to learn policies directly from the environment. Hence a RL algorithm called twin delayed deep deterministic policy gradient (TD3) is employed. The control of reactor temperature is categorized into two categories namely unconstrained and constrained control approaches. The TD3 with various reward functions are tested on a nonlinear bioreactor model. The results are compared with existing popular RL algorithm, namely, deep deterministic policy gradient (DDPG) algorithm with a performance measure such as mean squared error (MSE). In the unconstrained control of the bioreactor, the TD3 based controller designed with the integral absolute error (IAE) reward yields a lower MSE of 0.22, whereas the DDPG produces an MSE of 0.29. Similarly, in the case of constrained controller, TD3 based controller designed with the IAE reward yields a lower MSE of 0.38, whereas DDPG produces an MSE of 0.48. In addition, the TD3 trained agent successfully rejects the disturbances, namely, input flow rate and inlet temperature in addition to a setpoint change with better performance metrics.

Adaptive VNF Placement Considering Overall Latency and 5G Wireless Channel Reliability in Industry 4.0: A Reinforcement Learning Based Approach

Article

Full-text available

Jan 2024

Nauman Saqib

Industry 4.0 incorporates the integration of cloud computing, Industrial Internet of Things (IIoT), and modern communication technologies within the industrial automation systems. Various devices with different network requirements of high reliability and low latency, rely on connectivity. The 5G and Beyond (B5G) software-defined architecture facilitates Network Function Virtualization (NFV), which is an essential solution for fulfilling these stringent demands. NFV allows for the implementation and control of Virtual Network Functions (VNFs) in dynamic network environments. VNF placement optimization has been extensively studied in the 5G perspective outside the industry environment with a focus on minimizing delay and cost, increasing VNF reliability, and increasing resource efficiency. However, the complex dynamics of the wireless channel in industrial environments have a considerable impact on the essential delay factors that are important for optimizing the deployment of VNFs. This study focuses on modeling a Wireless Sensor Network (WSN) based Industry 4.0 factory automation scenario at mmWave band, formulating an optimization problem to minimize overall delay while considering packet loss rate in the 5G industrial wireless channel. The optimization problem is formulated as a Markov Decision Process (MDP) and two Reinforcement Learning (RL) based algorithms AVP-Q and AVP-DQN are proposed for optimizing the VNF placement. The proposed algorithms are extensively evaluated against the Value Iteration algorithm which assumes a completely known MDP model and two other algorithms from the literature. The simulated results show that AVP-DQN outperforms existing algorithms for this scenario by 39% and 22.6% and the achieved performance is only close to that of the Value Iteration algorithm.

Military Decision Support with Actor and Critic Reinforcement Learning Agents

Article

Full-text available

Feb 2024

While the recent advanced military operational concept requires an intelligent support of command and control, Reinforcement Learning (RL) has not been actively studied in the military domain. This study points out the limitations of RL for military applications from literature review and aims at improving the understanding of RL for military decision support under the limitations. Most of all, the black box characteristic of Deep RL makes the internal process difficult to understand in addition to complex simulation tools. A scalable weapon selection RL framework is built which can be solved either by a tabular form or a neural network form. The transition of the Deep Q-Network (DQN) solution to the tabular form makes it easier to compare the result to the Q-learning solution. Furthermore, rather than using one or two RL models selectively as before, RL models are divided as an actor and a critic, and systematically compared. A random agent, Q-learning and DQN agents as a critic, a Policy Gradient (PG) agent as an actor, Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) agents as an actor-critic approach are designed, trained, and tested. The performance results show that the trained DQN and PPO agents are the best decision supporter candidates for the weapon selection RL framework.

PPB-MCTS: A Novel Distributed-Memory Parallel Partial-Backpropagation Monte Carlo Tree Search Algorithm

Article

Jun 2024
J PARALLEL DISTR COM

Computational Neuroscience. 4th Latin American Workshop, LAWCN 2023 - Envigado, Colombia, November 28–30, 2023. Revised Selected Papers

Book

Jun 2024

Vinicius Cota

This book constitutes the refereed proceedings of the 4th Latin American Workshop, LAWCN 2023, held in Envigado, Colombia, during November 28–30, 2023. The 8 full papers and 2 short papers included in this book were carefully reviewed and selected from 50 submissions. They were organized in topical sections as follows: artificial intelligence and machine learning; computational neuroscience; and brain-computer interfaces.

Exploration Decay Policy (EDP) to Enhanced Exploration-Exploitation Trade-Off in DDPG for Continuous Action Control Optimization

Conference Paper

Dec 2023

The optimization of continuous action control tasks is a crucial step in deep reinforcement learning (DRL) applications. The goal is to identify optimal actions through the accumulation of experience in continuous action control tasks. This process can be achieved through DRL, which trains agents to develop a policy that maximizes the cumulative rewards gained from decision-making in dynamic environments. Balancing exploration and exploitation is a crucial challenge in acquiring this policy. To address the exploration-exploitation trade-off, the Exploration Decay Policy (EDP) implements a dynamic exploration noise strategy that adapts to the current training progress, enabling efficient exploration in the initial phases while gradually reducing exploration to focus on exploitation as training progresses. However, the fluctuating training stability across episodes in dynamic environments poses a challenge for exploitation policies to adapt accordingly. In this paper, we propose EDP to address exploration–exploitation trade-off dilemma. The objective is to dynamically modulate the noise scale, gradually decreasing it during periods of high training stability to promote exploration, while reducing it to maintain exploitation during periods of low training stability. The study introduces the EDP-DDPG method, enhancing continuous control tasks in Box2D environments. EDP-DDPG outperforms the standard DDPG by achieving higher rewards and quicker convergence. Its success stems from dynamically adjusting exploration noise every 25 episodes, balancing exploration and exploitation. This adaptive approach, reducing noise by 10% every 25 episodes, evolves from random to strategic limb movements, optimizing policy exploitation and adaptability in dynamic settings.

A Comprehensive Survey on Multiple-Runway Aircraft Landing Optimization Problem

Article

Jun 2024

The significant increase in civil aircraft traffic has led to frequent congestion and various negative outcomes within the terminal area. To mitigate these challenges, addressing and studying the aircraft landing problem (ALP) well are essential. This problem involves arranging the sequence of aircraft on airport runways and scheduling their operations while accounting for multiple operational constraints. Air traffic control is known as an optimization and ongoing problem with multi-objective challenges. In this survey paper, we propose a comprehensive review of the latest studies on the ALP in which we cover a wide variety of methodologies, including mixed-integer programming, dynamic programming, heuristics, meta-heuristics, and reinforcement learning techniques. This survey also proposes a classification of the ALP and summarizes all the identified constraints, objectives, frequency of their occurrence, and datasets used by these studies consisting of one or multiple runways. In addition, based on the existing literature, we discuss the advantages, such as delay and pollution minimization, and disadvantages, such as problem size unscalability and limiting the problem to single-runway mode.

Optimizing Datasets in the E-Learning Ecosystem A State-of-the-Art Review

Conference Paper

Sep 2023

A Mobile Robot with an Autonomous and Custom-Designed Control System

Chapter

Jun 2024

Cognition of Time and Thinking Beyond

Chapter

Jun 2024
ADV EXP MED BIOL

Zedong Bi

A common research protocol in cognitive neuroscience is to train subjects to perform deliberately designed experiments while recording brain activity, with the aim of understanding the brain mechanisms underlying cognition. However, how the results of this protocol of research can be applied in technology is seldom discussed. Here, I review the studies on time processing of the brain as examples of this research protocol, as well as two main application areas of neuroscience (neuroengineering and brain-inspired artificial intelligence). Time processing is a fundamental dimension of cognition, and time is also an indispensable dimension of any real-world signal to be processed in technology. Therefore, one may expect that the studies of time processing in cognition profoundly influence brain-related technology. Surprisingly, I found that the results from cognitive studies on timing processing are hardly helpful in solving practical problems. This awkward situation may be due to the lack of generalizability of the results of cognitive studies, which are under well-controlled laboratory conditions, to real-life situations. This lack of generalizability may be rooted in the fundamental unknowability of the world (including cognition). Overall, this paper questions and criticizes the usefulness and prospect of the abovementioned research protocol of cognitive neuroscience. I then give three suggestions for future research. First, to improve the generalizability of research, it is better to study brain activity under real-life conditions instead of in well-controlled laboratory experiments. Second, to overcome the unknowability of the world, we can engineer an easily accessible surrogate of the object under investigation, so that we can predict the behavior of the object under investigation by experimenting on the surrogate. Third, the paper calls for technology-oriented research, with the aim of technology creation instead of knowledge discovery.

Adversarial Online Reinforcement Learning Under Limited Defender Resources

Chapter

Feb 2024

Reinforcement learning (RL) is a very powerful tool for sequential decision making. It has already been a vital component in solving grand challenge problems like the “protein folding problem.” The RL model is well suited for communication networks because it can learn complex behaviors in the target environment even when information like channel conditions, user statistics are unavailable. Our focus here is on adversarial RL because of its ability to capture scenarios where the reward and/or the dynamics of the environment change over time, possibly in an adversarial manner. In order to perform well in such systems, the agent, e.g., defender, needs to change the policy accordingly over time. However, the agent may not be able to afford frequent policy changes especially when the reward and/or the dynamics of the environment change rapidly, e.g., the energy limitations for edge devices to change policies. Thus, in addition to the standard metric of losses, switching costs, which capture the costs for changing policies, are regarded as a critical metric in RL. This chapter will introduce the state-of-the-art results on both bandits and RL with switching costs, their importance on network security under limited defender resources, and interesting future work.

Advanced ML/DL-Based Intrusion Detection Systems for Software-Defined Networks

Chapter

Feb 2024

Software Defined Networking (SDN) is an innovative option with great potential for the future growth of the Internet. It enhances the flexibility and transparency of centralized and controlled networks. However, these advantages come with some inherent vulnerabilities and substantial risks. SDN environments, due to the vulnerabilities between the control plane and the data plane, are particularly susceptible to Denial of Service (DoS) attacks, which pose a significant threat. To address these security issues, it is essential to incorporate an Intrusion Detection System (IDS) as a security module within the SDN framework. The IDS enables continuous monitoring of the SDN environment, performs comprehensive traffic inspection, and generates alerts when malicious activities occur in the SDN network. In this chapter, we will review state-of-the-art machine learning and deep learning-based intrusion detection methods for SDN. We will focus on specific network intrusion methods designed for SDN and evaluate their effectiveness and efficiency based on metrics such as accuracy, processing time, overhead, and false positive rate. Through real testbed evaluations, we aim to identify the most suitable intrusion detection methods for SDN environments.

A computational model of behavioral adaptation to solve the credit assignment problem

Preprint

Full-text available

Jun 2024

The adaptive fitness of an organism in its ecological niche is highly reliant upon its ability to associate environmental or internal stimuli with behavioral responses through reinforcement. This simple but powerful approach has been successfully applied in computational neuroscience and reinforcement learning to model both human and animal behaviors. However, a critical challenge faced by these models is the credit assignment problem, in which the association between past behavior and a delayed reinforcement signal must be considered. In this paper, we reformulate the credit assignment problem to consider how past stimuli are linked to adaptive behavioral responses in a simple neuronal circuit. We propose a biologically plausible variant of a spiking neural network, which can model a wide variety of behavioral, learning, and evolutionary phenomena. Our model suggests one fundamental mechanism for associating a behavior with an adaptive response that may be used in the brains of both simple and complex organisms. Our results show the model's versatility and biological plausibility in a number of tasks related to classical and operant conditioning, including behavioral chaining. Additionally, we present simulations to demonstrate how adaptive behaviors such as reflexes and simple category detection may have evolved using our model. Our results indicate the potential for further modifications and extensions of our model to replicate more sophisticated and biologically plausible behavioral, learning, and intelligence phenomena found throughout the animal kingdom.

A Framework for Mesencephalic Dopamine Systems Based on Predictive Hebbian Learning

Article

Full-text available

Apr 1996

We develop a theoretical framework that shows how mesencephalic dopamine systems could distribute to their targets a signal that represents information about future expectations. In particular, we show how activity in the cerebral cortex can make predictions about future receipt of reward and how fluctuations in the activity levels of neurons in diffuse dopamine systems above and below baseline levels would represent errors in these predictions that are delivered to cortical and subcortical targets. We present a model for how such errors could be constructed in a real brain that is consistent with physiological results for a subset of dopaminergic neurons located in the ventral tegmental area and surrounding dopaminergic neurons. The theory also makes testable predictions about human choice behavior on a simple decision-making task. Furthermore, we show that, through a simple influence on synaptic plasticity, fluctuations in dopamine release can act to change the predictions in an appropriate manner.

A Survey of Algorithmic Methods for Partially Observed Markov Decision Processes

Article

Dec 1991

William S. Lovejoy

A partially observed Markov decision process (POMDP) is a generalization of a Markov decision process that allows for incomplete information regarding the state of the system. The significant applied potential for such processes remains largely unrealized, due to an historical lack of tractable solution methodologies. This paper reviews some of the current algorithmic alternatives for solving discrete-time, finite POMDPs over both finite and infinite horizons. The major impediment to exact solution is that, even with a finite set of internal system states, the set of possible information states is uncountably infinite. Finite algorithms are theoretically available for exact solution of the finite horizon problem, but these are computationally intractable for even modest-sized problems. Several approximation methodologies are reviewed that have the potential to generate computationally feasible, high precision solutions.

A Theory of Cerebellar Function

Article

Feb 1971
MATH BIOSCI

James S. Albus

A comprehensive theory of cerebellar function is presented, which ties together the known anatomy and physiology of the cerebellum into a pattern-recognition data processing system. The cerebellum is postulated to be functionally and structurally equivalent to a modification of the classical Perceptron pattern-classification device. It is suggested that the mossy fiber → granule cell → Golgi cell input network performs an expansion recoding that enhances the pattern-discrimination capacity and learning speed of the cerebellar Purkinje response cells.Parallel fiber synapses of the dendritic spines of Purkinje cells, basket cells, and stellate cells are all postulated to be specifically variable in response to climbing fiber activity. It is argued that this variability is the mechanism of pattern storage. It is demonstrated that, in order for the learning process to be stable, pattern storage must be accomplished principally by weakening synaptic weights rather than by strengthening them.

A time-delay neural network architecture for isolated word recognition

Article

Dec 1990
NEURAL NETWORKS

A translation-invariant back-propagation network is described that performs better than a sophisticated continuous acoustic parameter hidden Markov model on a noisy, 100-speaker confusable vocabulary isolated word recognition task. The network's replicated architecture permits it to extract precise information from unaligned training patterns selected by a naive segmentation rule.

A Unified Theory of Heuristic Evaluation Functions and its Application to Learning.

Conference Paper

Jan 1986

WC prcscnt a characterization of heuristic evaluation functions Hhich unities their trcatmcnt in single-agent problems and two- person games. 'l'hc central result is that a useful heuristic function is one which dctcrmincs the outcome of a search and is invariant along a solution path. 'I'his local chnractcrization of heuristics can hc used to predict the cffcctivcncss of given heuristics and to automatically learn useful heuristic functions for problems. In one cxpcrimcnt, a set of rclntivc weights for the different chess pieces was automatically learned. 1. Int reduction Consider the following anomaly. 'I'hc Manhattan distance heuristic for the Fifteen PuzAo is computed by monsuring the distance along the two-dimensional grid of each tilt from its current position to its goal position, and summing thtic values for each tile. Manhattan distance is a very cffccticc heuristic function for solving the Fifteen l~uz7lc 141. A complctcly analogous heuristic can bc dcfincd in three dimensions for Rubik's Cube: for each individual movable piccc of the cube. count the nun;hcr of twists rcquircd to bring it to its goal position and orientation. and sum thcsc vducs for each component. 'I'hrcc dimcnsionzl Manhattan dizlancc. howcvcr, is cffcctivcly worthless as a heuristic function for liubik's Cube 151. l'vcn though Rubik'~ Cube is similar to the t-'iltccn Puy/.lc, the two heuristics arc virtually idcnIical, and i11 both ci\scs the goal is ilchicvcd when the value of th: heuristic is minimized, the hcurislic is very cffcctivc in one USC and usclcss in the other.

Some Aspects of the Sequential Design of Experiments

Article

Sep 1952

Herbert Robbins

Until recently, statistical theory has been restricted to the design and analysis of sampling experiments in which the size and composition of the samples are completely determined before the experimentation begins. The reasons for this are partly historical, dating back to the time when the statistician was consulted, if at all, only after the experiment was over, and partly intrinsic in the mathematical difficulty of working with anything but a fixed number of independent random variables. A major advance now appears to be in the making with the creation of a theory of the sequential design of experiments, in which the size and composition of the samples are not fixed in advance but are functions of the observations themselves.

Markov Games as a Framework for Multi-Agent Reinforcement Learning

Article

May 1997

Michael L. Littman

In the Markov decision process (MDP) formalization of reinforcement learning, a single adaptive agent interacts with an environment defined by a probabilistic transition function. In this solipsistic view, secondary agents can only be part of the environment and are therefore fixed in their behavior. The framework of Markov games allows us to widen this view to include multiple adaptive agents with interacting or competing goals. This paper considers a step in this direction in which exactly two agents with diametrically opposed goals share an environment. It describes a Q-learning-like algorithm for finding optimal policies and demonstrates its application to a simple two-player game in which the optimal policy is probabilistic. 1 INTRODUCTION No agent lives in a vacuum; it must interact with other agents to achieve its goals. Reinforcement learning is a promising technique for creating agents that co-exist [Tan, 1993, Yanco and Stein, 1993] , but the mathematical framework that just...

Reinforcement Learning: An Introduction

Abstract

No full-text available

Recommended publications

Strategies for Rewarding Your Best Suppliers in a Dynamic Supply Chain

Leadership Paradigm Affecting SGA to Drive Organizational Performance

Potential Based Reward Shaping Using Learning to Rank

Approximating the environmental reinforcement signal with non-linear polynomials using learning clas...