Figure 6 - uploaded by Assaf Zeevi
Content may be subject to copyright.
Inaccuracy of certainty-equivalence learning in a slowly and unboundedly changing environment. Panels (a) and (b) show sample paths of the inaccuracy process {∆ t }, and the histogram of the inaccuracy in period 10,000, respectively, generated under the certainty-equivalence learning policy C in Example 4. On approximately 96% of the 2,000 sample paths, the estimatê θ 10,000 is ε-accurate for ε = 0.05.

Inaccuracy of certainty-equivalence learning in a slowly and unboundedly changing environment. Panels (a) and (b) show sample paths of the inaccuracy process {∆ t }, and the histogram of the inaccuracy in period 10,000, respectively, generated under the certainty-equivalence learning policy C in Example 4. On approximately 96% of the 2,000 sample paths, the estimatê θ 10,000 is ε-accurate for ε = 0.05.

Source publication
Article
Full-text available
We consider a dynamic learning problem where a decision maker sequentially selects a control and observes a response variable that depends on chosen control and an unknown sensitivity parameter. After every observation, the decision maker updates his or her estimate of the unknown parameter and uses a certainty-equivalence decision rule to determin...

Contexts in source publication

Context 1
... the environment is changing slowly as in Theorem 3, then we can also characterize the maximum and minimum possible growth rates of {J t }, and prove that the effect of said change, which is given by the first term on the right hand side of (4.6), becomes very small eventually. Figure 6 demonstrates the accuracy of the certainty-equivalence learning policy in Example 4, which satisfies the hypotheses of Theorem 3. Observing that the inaccuracy ∆ t becomes less than 0.05 on more than 95% of the sample paths, we can deduce that the estimate sequence { ˆ θ t } is asymptotically ε-accurate for ε = 0.05 in this example. This is a significant improvement over the asymptotic inaccuracy of 0.20 observed in Example 1 (see Figure 2). ...
Context 2
... show sample paths of the inaccuracy process {∆ t }, and the histogram of the inaccuracy in period 10,000, respectively, generated under the certainty-equivalence learning policy C in Example 4. On approximately 96% of the 2,000 sample paths, the estimatê θ 10,000 is ε-accurate for ε = 0.05. Figure 6 demonstrates the accuracy of the certainty-equivalence learning policy in Example 4, which satisfies the hypotheses of Theorem 3. Observing that the inaccuracy ∆ t becomes less than 0.05 on more than 95% of the sample paths, we can deduce that the estimate sequence { ˆ θ t } is asymptotically ε-accurate for ε = 0.05 in this example. This is a significant improvement over the asymptotic inaccuracy of 0.20 observed in Example 1 (see Figure 2). ...

Citations

... Concerns regarding the consistency of the learning process -specifically, whether the parameter estimation is unbiased -are mitigated by two factors: (i). There is no uninformative state due to the fact that ∂q (p, ξ) /∂ξ > 0 (see Keskin and Zeevi (2018)); (ii). Our policies differ from typical certainty-equivalence controls because, in our setting, multiple observations are available within each decision epoch. ...
Preprint
Full-text available
Problem definition: Online asset-selling businesses, encompassing used cars and real estate sectors, have experienced remarkable growth in recent years. Unlike general retailing businesses whose operations decisions are made at the stock-keeping unit (SKU) level, asset selling operates at the individual unit asset level. We develop a dynamic pricing modeling framework capturing salient operational characteristics that distinguish asset-selling platforms from general retailing (e.g., infrequent pricing within a finite time horizon) and enable utilizing the platform’s access to customers’ online behavioral data to maximize the payoff of the individual asset unit. Methodology/results: We develop practical algorithms to solve the dynamic pricing problem, including a deterministic approximation (DA) algorithm that omits randomness in the customer arrival rate and two online learning algorithms, Posterior-Sampling (PS) and Maximum-A-Posterior (MAP) that integrate learning of the idiosyncratic latent value of an asset with dynamic pricing decisions. To analyze the performance of these algorithms, we propose a new asymptotic regime suitable for the online asset-selling business context, one that scales up customer demand arrival rate within a finite time horizon, and we obtain the regret bounds. An extensive numerical study and a real-data calibrated case study verify our proposed algorithms' applicability and potential values. Managerial implications: The algorithms that integrate the learning of an asset's idiosyncratic latent value with dynamic pricing decisions, alongside our novel asymptotic analysis, provide a robust framework for data-driven decision-making and demonstrate consumer behavior data as a strategic asset of asset-selling platforms in bringing business innovation to traditional industries.
... This behaviour results in heavily unbalanced allocations in favour of one of the two arms-eventually fixing on selecting only a single arm-even when no underlying differences between arms exist. We emphasise that this is a well-studied phenomenon in MAB problems, referred to as incomplete learning (see e.g., Keskin and Zeevi 2018) and occurring when parameter estimates fail to converge to the true value. The main reason is the insufficient exploration of the arms, although recent work has pointed to some consequences of the sequential nature of data collection (see Deshpande et al. 2018). ...
Article
Full-text available
Bandit algorithms such as Thompson sampling (TS) have been put forth for decades as useful tools for conducting adaptively-randomised experiments. By skewing the allocation toward superior arms, they can substantially improve particular outcomes of interest for both participants and investigators. For example, they may use participants’ ratings for continuously optimising their experience with a program. However, most of the bandit and TS variants are based on either binary or continuous outcome models, leading to suboptimal performances in rating scale data. Guided by behavioural experiments we conducted online, we address this problem by introducing Multinomial-TS for rating scales. After assessing its improved empirical performance in unique optimal arm scenarios, we explore potential considerations (including prior’s role) for calibrating uncertainty and balancing arm allocation in scenarios with no unique optimal arms.
... Two key concepts in RL are exploration, which corresponds to learning via interactions with the environment, and exploitation, which corresponds to optimising the objective function given accumulated information (see [25]). They are at odds with each other, as learning the environment often requires acting suboptimally with respect to existing knowledge [18]. Thus it is crucial to develop effective exploration strategies and to optimally balance exploration and exploitation. ...
Preprint
This work uses the entropy-regularised relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies on the one hand, explore the space and hence facilitate learning but, on the other hand, introduce bias by assigning a positive probability to non-optimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularisation. We study algorithms resulting from two entropy regularisation formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalises the divergence of policies between two consecutive episodes. We analyse the finite horizon continuous-time linear-quadratic (LQ) RL problem for which both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularisation, we prove that the regret, for both learning algorithms, is of the order $\mathcal{O}(\sqrt{N}) $ (up to a logarithmic factor) over $N$ episodes, matching the best known result from the literature.
... This body of literature focuses on sound heuristic solutions, numerical analyses, and asymptotic performance bounds. More recently, Keskin and Zeevi (2018) studied a dynamic control problem that includes dynamic pricing as a special application. In their research, they focus on the incomplete learning phenomenon under the certainty-equivalence policy, and show that complete learning can be achieved by restricting the (rolling) window used for parameter estimation. ...
Article
Full-text available
A New Method for Dynamic Learning and Doing For a large class of learning-and-doing problems, two processes are intertwined in the analysis: a forward process that updates the decision maker’s belief or estimate of the unknown parameter, and a backward process that computes the expected future values. The mainstream literature focuses on the former process. In contrast, in “Dynamic Learning and Decision Making via Basis Weight Vectors,” Hao Zhang proposes a new method based on pure backward induction on the continuation values created by feasible continuation policies. When the unknown parameter is a continuous variable, the method represents each continuation-value function by a vector of weights placed on a set of basis functions. The weight vectors that are potentially useful for the optimal solution can be found backward in time exactly (for very small problems) or approximately (for larger problems). A simulation study demonstrates that an approximation algorithm based on this method outperforms some popular algorithms in the linear contextual bandit literature when the learning horizon is short.
... This means that it suffices to analyse the second term which is the (expected) regret that has been analysed in [5,16] assuming a self-exploration property of greedy policies (see Remark 3.1). However, the following examples shows these greedy policies in general do not guarantee exploration and consequently convergence to the optimal solution, which is often referred to as incomplete learning in the literature (see e.g., [21]). Example 1.1. ...
... They prove that if optimal controls of the true model automatically exploit the parameter space, then a greedy least-squares algorithm with suitable initialisation admits a non-asymptotically logarithmic expected regret for LQ models [5], and a O( √ N ) expected regret for LC models [16]. Unfortunately, as shown in Example 1.1 and in [21], such a self-exploration property may not hold in general, even for LQ models. Furthermore, the learning algorithm studied here works for an arbitrary initialisation. ...
Preprint
Full-text available
We develop a probabilistic framework for analysing model-based reinforcement learning in the episodic setting. We then apply it to study finite-time horizon stochastic control problems with linear dynamics but unknown coefficients and convex, but possibly irregular, objective function. Using probabilistic representations, we study regularity of the associated cost functions and establish precise estimates for the performance gap between applying optimal feedback control derived from estimated and true model parameters. We identify conditions under which this performance gap is quadratic, improving the linear performance gap in recent work [X. Guo, A. Hu, and Y. Zhang, arXiv preprint, arXiv:2104.09311, (2021)], which matches the results obtained for stochastic linear-quadratic problems. Next, we propose a phase-based learning algorithm for which we show how to optimise exploration-exploitation trade-off and achieve sublinear regrets in high probability and expectation. When assumptions needed for the quadratic performance gap hold, the algorithm achieves an order $\mathcal{O}(\sqrt{N} \ln N)$ high probability regret, in the general case, and an order $\mathcal{O}((\ln N)^2)$ expected regret, in self-exploration case, over $N$ episodes, matching the best possible results from the literature. The analysis requires novel concentration inequalities for correlated continuous-time observations, which we derive.
... In our numerical analysis (see Section 7) we also include a lookahead type policy. Some recent work have highlighted the effectiveness of simple policies in the context of dynamic learning such as greedy algorithms (e.g., Bastani et al., 2020) and Certainty-Equivalence (e.g., Keskin and Zeevi, 2018). The greedy algorithm behaves well when exploration is expensive while in our case it is free. ...
Preprint
Full-text available
We consider a decision maker who must choose an action in order to maximize a reward function that depends also on an unknown parameter {\Theta}. The decision maker can delay taking the action in order to experiment and gather additional information on {\Theta}. We model the decision maker's problem using a Bayesian sequential experimentation framework and use dynamic programming and diffusion-asymptotic analysis to solve it. For that, we scale our problem in a way that both the average number of experiments that is conducted per unit of time is large and the informativeness of each individual experiment is low. Under such regime, we derive a diffusion approximation for the sequential experimentation problem, which provides a number of important insights about the nature of the problem and its solution. Our solution method also shows that the complexity of the problem grows only quadratically with the cardinality of the set of actions from which the decision maker can choose. We illustrate our methodology and results using a concrete application in the context of assortment selection and new product introduction. Specifically, we study the problem of a seller who wants to select an optimal assortment of products to launch into the marketplace and is uncertain about consumers' preferences. Motivated by emerging practices in e-commerce, we assume that the seller is able to use a crowdvoting system to learn these preferences before a final assortment decision is made. In this context, we undertake an extensive numerical analysis to assess the value of learning and demonstrate the effectiveness and robustness of the heuristics derived from the diffusion approximation.
... A myopic pricing policy without learning mechanism may permanently get stuck at an uninformative choice which provides no quality improvement on the underlying reward functions, leading to poor performance. This is also called the incomplete learning phenomenon, see Keskin and Zeevi (2018) for a comprehensive summary about the situations where the phenomenon may appear and the situations where the myopic algorithm can be directly applied without learning. ...
Article
Full-text available
We study a dynamic pricing problem where the observed cost in each selling period varies from period to period, and the demand function is unknown and only depends on the price. The decision maker needs to select a price from a menu of K prices in each period to maximize the expected cumulative profit. Motivated by the classical upper confidence bound (UCB) algorithm for the multi‐armed bandit problem, we propose a UCB‐Like policy to select the price. When the cost is a continuous random variable, as the cost varies, the profit of the optimal price can be arbitrarily close to that of the second‐best price, making it very difficult to make the correct decision. In this situation, we show that the expected cumulative regret of our policy grows in the order of (log T)2, where T is the number of selling periods. When the cost takes discrete values from a finite set and all prices are optimal for some costs, we show that the expected cumulative regret is upper bounded by a constant for any T. This result suggests that in this situation, the suboptimal price will only be selected in a finite number of periods, and the trade‐off between earning and learning vanishes and learning is no longer necessary beyond a certain period.
... Here the incomplete learning means that the parameters of the demand functions cannot be learned consistently, i.e., the parameter estimators do not converge to the true values as the number of periods goes to infinity (cf. Keskin and Zeevi (2018) for a detailed introduction on incomplete learning). ...
... In all three aforementioned pricing policies, the parameters of the demand function may be estimated consistently by using forced exploration, thus avoiding incomplete learning. Recently, Keskin and Zeevi (2018) propose a limited-memory learning scheme (i.e., adaptively choosing the estimation windows) to improve the certainty-equivalence policy in both static and slowly time-varying environments without forced exploration. Furthermore, Broder and Rusmevichientong (2012) and Keskin and Zeevi (2014) prove respectively that the lower bounds of the expected cumulative regrets of their problems are √ T , no matter what pricing policies are used. ...
Article
Full-text available
We consider the problem of nonparametric multi‐product dynamic pricing with unknown demand and show that the problem may be formulated as an online model‐free stochastic program, which can be solved by the classical Kiefer‐Wolfowitz stochastic approximation (KWSA) algorithm. We prove that the expected cumulative regret of the KWSA algorithm is bounded above by where κ1, κ2 are positive constants and T is the number of periods for any T = 1, 2, …. Therefore, the regret of the KWSA algorithm grows in the order of , which achieves the lower bounds known for parametric dynamic pricing problems and shows that the nonparametric problems are not necessarily more difficult to solve than the parametric ones. Numerical experiments further demonstrate the effectiveness and efficiency of our proposed KW pricing policy by comparing with some pricing policies in the literature.
... A fundamental question addressed in this literature is whether a firm that follows an optimal pricing-and-learning policy will eventually obtain full information regarding the underlying demand model. Compared to the aforementioned economics work, the operations research and management science (OR/MS) literature on dynamic pricing with demand learning is more recent, and has been growing rapidly for a decade (see, for example, Araman and Caldentey 2009, Farias and van Roy 2010, Harrison et al. 2012, Broder and Rusmevichientong 2012, Keskin and Zeevi 2018, Keskin and Birge 2019, den Boer and Keskin 2020, Golrezaei et al. 2021. As in the economics literature, Bayesian models are commonly used for demand inference in the OR/MS literature (Araman and Caldentey 2009, Farias and van Roy 2010, Harrison et al. 2012, Bensoussan and Guo 2015. ...
Preprint
Full-text available
In this paper, we study a firm’s dynamic pricing problem in the presence of unknown and time- varying heterogeneity in customers’ preferences for quality. The firm offers a standard product as well as a premium product to deal with this heterogeneity. First, we consider a benchmark case in which the transition structure of customer heterogeneity is known. In this case, we analyze the firm’s optimal pricing policy and characterize its key structural properties. Thereafter, we investigate the case of unknown market transition structure, and design a simple and practically implementable policy, called the bounded learning policy, which is a combination of two policies that perform poorly in isolation. Measuring performance by regret—i.e., the revenue loss relative to a clairvoyant who knows the underlying changes in the market—we prove that our bounded learning policy achieves the smallest possible growth rate of regret in terms of the frequency of market shifts. Thus, our policy performs well without relying on precise knowledge of the market transition structure.
... Theorem 2 implies that (statistical) incomplete learning does not happen in our context and many simple Bayesian policies exhibit remarkably good profit performance when there is no informed bettor (for the antecedent work on incomplete learning, see, e.g., McLennan 1984, Harrison et al. 2012, Keskin and Zeevi 2018. Thus, the informed bettor's strategic manipulation, instead of incomplete learning, is the market maker's major challenge in the context of our problem formulation. ...