Fig 7 - uploaded by Woo Young Kwon
Content may be subject to copyright.
Definition of state space.

Definition of state space.

Source publication
Article
Full-text available
Reinforcement learning (RL) has been widely used as a mechanism for autonomous robots to learn state-action pairs by interacting with their environment. However, most RL methods usually suffer from slow convergence when deriving an optimum policy in practical applications. To solve this problem, a stochastic shortest path-based Q-learning (SSPQL) i...

Similar publications

Article
Full-text available
Differential evolution (DE) is a simple yet efficient stochastic search approach for numerical optimization. However, it tends to suffer from slow convergence when tackling complicated problems. In addition, its search ability is significantly influenced by its control parameters. To improve the performance of the basic DE, this paper proposes a se...

Citations

... In this section, we will briefly describe a reinforcement learning algorithm that is based on a typical Q-learning method (see [10], [14]). For mode details concerning similar applications of Q-learning see [15] or [16]. The discussed algorithm is a model free reinforcement learning technique, which was chosen for the purpose of this paper, since it is transparent for analysis and the idea behind our approach can be clearly illustrated. ...
... Action model-based approaches use prior experiences to construct a model that mimics the behavior of the world. The models aim to find optimal future actions via dynamic programming [61] or stochastic shortest path [62], which can then be used to provide synthetic experiences. Depending on the underlying learner, these experiences can be used to perform additional refinement (e.g., additional value iteration steps in a Q-learning framework [63]) or to guide the learner toward better solutions [62]. ...
... The models aim to find optimal future actions via dynamic programming [61] or stochastic shortest path [62], which can then be used to provide synthetic experiences. Depending on the underlying learner, these experiences can be used to perform additional refinement (e.g., additional value iteration steps in a Q-learning framework [63]) or to guide the learner toward better solutions [62]. Other approaches aim to improve learning speed and performance by introducing "imagined" experiences to the training data provided to the learner [64]. ...
Thesis
Full-text available
As autonomous systems are deployed in increasingly complex and uncertain environments, safe, accurate, and robust feedback control techniques are required to ensure reliable operation. Accurate trajectory tracking is essential to complete a variety of tasks, but this may be difficult if the system’s dynamics change online, e.g., due to environmental effects or hardware degradation. As a result, uncertainty mitigation techniques are also necessary to ensure safety and accuracy. This problem is well suited to a receding-horizon optimal control formulation via Nonlinear Model Predictive Control (NMPC). NMPC employs a nonlinear model of the plant dynamics to compute non-myopic control policies, thereby improving tracking accuracy relative to reactive approaches. This formulation ensures constraints on the dynamics are satisfied and can compensate for uncertainty in the state and dynamics model via robust and adaptive extensions. However, existing NMPC techniques are computationally expensive, and many operating domains preclude reliable, high-rate communication with a base station. This is particularly difficult for small, agile systems, such as micro air vehicles, that have severely limited computation due to size, weight, and power restrictions but require high-rate feedback control to maintain stability. Therefore, the system must be able to operate safely and reliably with typically limited onboard computational resources. In this thesis, we propose a series of non-myopic, computationally-efficient, feedback control strategies that enable accurate and reliable operation in the presence of unmodeled system dynamics and state uncertainty. The key concept underlying these techniques is the reuse of past experiences to reduce online computation and enhance control performance in novel scenarios. These experiences inform an online-updated estimate of the system dynamics model and the choice of controller to optimize performance for a given scenario. We present a set of simulation and experimental studies with a small aerial robot operating in windy environments to assess the performance of the proposed control methodologies. These results demonstrate that leveraging past experiences to inform feedback control yields high-rate, constrained, robust-adaptive control and enables the deployment of predictive control techniques on systems with severe computational constraints.
... } can converge to the optimal values after limited steps of exploring [17], with π*(B) depicting the corresponding optimal strategy. Thus the action policy of BMDP is derived, ...
Conference Paper
In cognitive radio networks (CRNs), TCP goodput is one of the key issues to measure it's performance. However, most existing research efforts on TCP performance improvement have two weaknesses as follows: first of all, most of them only consider the underlying parameters to optimize the physical performance, the TCP performance have been neglected; Second, they are largely formulated as a Markov Decision Process (MDP), which requires a complete knowledge of network and cannot be directly applied to distributed CRNs. To solve the above problems, a Q-BMDP algorithm is proposed in this paper: Each user in CRN autonomously decides modulation type and transmitting power in PHY, channels to access in MAC to find the best TCP goodput. Due to the existence of perception error of environment, this issue is formulated as a Partial Observable Markov Decision Process (POMDP) which is then converted to belief state MDP, with Q-value iteration to find the optimal strategy. Simulation results show that the network can learn optimal strategy to effectively improve TCP goodput in dynamic wireless network.
... Classification process is the last and may be the most important step of a recognition system. Combining learner machines [1][2][3][4][5] in general and combining classifiers [6][7][8] in particular have been widely known as an approach to improve the classification performance. In order to enhance the classification accuracy, we employed the approach of combining classifiers. ...
Article
Full-text available
A modified version of Boosted Mixture of Experts (BME) is presented in this paper. While previous related works, namely BME, attempt to improve the performance by incorporating complementary features of a hybrid combining framework, they have some drawback. Analyzing the problems of previous approaches has suggested several modifications that have led us to propose a new method called Boost-wise Pre-loaded Mixture of Experts (BPME). We present a modification in pre-loading (initialization) procedure of ME, which addresses previous problems and overcomes them by employing a two-stage pre-loading procedure. In this approach, both the error and confidence measures are used as the difficulty criteria in boost-wise partitioning of problem space.
... We proved that RVI (MDP case) for our problem converges in span; note that RVI, which is more stable than VI, has been shown to converge via other methods, but not in the span (in [20] via Lyapunov functions; [5] via Cauchy sequences; [42] via perturbation transformations). We extended a key SSP result for MDPs from [5], used widely in the literature (see e.g., [34]), to the SMDP case and presented a two-time-scale VI algorithm which exploits this result to solve the SMDP without discretization (conversion to MDP). For RL, we showed that eignenvalues of the transition matrix can be exploited for showing boundedness of a vanishing-discount procedure and also analyzed the undiscounted two-time-scale procedure. ...
Article
Full-text available
We develop the theory for Markov and semi-Markov control using dynamic programming and reinforcement learning in which a form of semi-variance which computes the variability of rewards below a pre-specified target is penalized. The objective is to optimize a function of the rewards and risk where risk is penalized. Penalizing variance, which is popular in the literature, has some drawbacks that can be avoided with semi-variance. KeywordsRelative value iteration–semi-Markov control–semi-variance–stochastic shortest path problem–target-sensitive
Conference Paper
Routing in uncertain environments is challenging as it involves a number of contextual elements, such as different environmental conditions (forecast realizations with varying spatial and temporal uncertainty), changes in mission goals while en route, and asset status. In this paper, we use an approximate dynamic programming method with Q-factors to determine a cost-to-go approximation by treating the weather forecast realization information as a stochastic state. These types of algorithms take a large amount of offline computation time to determine the cost-to-go approximation, but once obtained, the online route recommendation is nearly instantaneous and several orders of magnitude faster than previously proposed ship routing algorithms. The proposed algorithm is robust to the uncertainty present in the weather forecasts. We compare this algorithm to a well-known shortest path algorithm and apply the approach to a real-world shipping tragedy using weather forecast realizations available prior to the event.
Article
In cognitive radio network (CRN), TCP end to end throughput is one of the key issues to measure its performance. However, most of existing research efforts devoted to TCP performance improvement have two weaknesses as follows. First, most of them only consider the underlying parameters to optimize the physical performance, but the TCP performance is neglected. Second, they are largely formulated as a Markov decision process (MDP), which requires a complete knowledge of network and cannot be directly applied to CRNs. To solve the above problems, a Q-BMDP algorithm is proposed in this paper. Each user in CRN combines modulation type and transmitting power at the physical layer, access channels at the media access control layer and TCP congestion control factor to maximize the TCP throughput. Due to the existence of perception error of environment, this issue is formulated as a partial observable Markov decision process (POMDP) which is then converted to belief state MDP, with Q-value iteration to find the approximately optimal strategy. Simulation and analysis results show that the proposed algorithm can be approximately converged to optimal strategy under a maximum error limit, and can effectively improve TCP throughput in a dynamic wireless network under the premise of the limited power consumption.
Article
The shortest-path searching algorithm must not only find a global solution to the destination, but also solve a turn penalty problem (TPP) in an urban road transportation network (URTN). Although the Dijkstra algorithm (DA) as a representative node-based algorithm secures a global solution to the shortest path search (SPS) in the URTN by visiting all the possible paths to the destination, the DA does not solve the TPP and the slow execution speed problem (SEP) because it must search for the temporary minimum cost node. Potts and Oliver solved the TPP by modifying the visiting unit from a node to the link type of a tree-building algorithm like the DA. The Multi Tree Building Algorithm (MTBA), classified as a representative Link Based Algorithm (LBA), does not extricate the SEP because the MTBA must search many of the origin and destination links as well as the candidate links in order to find the SPS. In this paper, we propose a new Link-Based Single Tree Building Algorithm in order to reduce the SEP of the MTBA by applying the breaking rule to the LBA and also prove its usefulness by comparing the proposed with other algorithms such as the node-based DA and the link-based MTBA for the error rates and execution speeds.
Article
This article demonstrates that Q-learning can be accelerated by appropriately specifying initial Q-values using dynamic wave expansion neural network. In our method, the neural network has the same topography as robot work space. Each neuron corresponds to a certain discrete state. Every neuron of the network will reach an equilibrium state according to the initial environment information. The activity of the special neuron denotes the maximum cumulative reward by following the optimal policy from the corresponding state when the network is stable. Then the initial Q-values are defined as the immediate reward plus the maximum cumulative reward by following the optimal policy beginning at the succeeding state. In this way, we create a mapping between the known environment information and the initial values of Q-table based on neural network. The prior knowledge can be incorporated into the learning system, and give robots a better learning foundation. Results of experiments in a grid world problem show that neural network-based Q-learning enables a robot to acquire an optimal policy with better learning performance compared to conventional Q-learning and potential field-based Qlearning.