Monte Carlo Tree Search (MCTS), most famously used in game-play artificial intelligence (e.g., the game of Go), is a well-known strategy for constructing approximate solutions to sequential decision problems. Its primary innovation is the use of a heuristic, known as a default policy, to obtain Monte Carlo estimates of downstream values for states in a decision tree. This information is used to iteratively expand the tree towards regions of states and actions that an optimal policy might visit. However, to guarantee convergence to the optimal action, MCTS requires the entire tree to be expanded asymptotically. In this paper, we propose a new technique called Primal-Dual MCTS that utilizes sampled information relaxation upper bounds on potential actions, creating the possibility of "ignoring" parts of the tree that stem from highly suboptimal choices. This allows us to prove that despite converging to a partial decision tree in the limit, the recommended action from Primal-Dual MCTS is optimal. The new approach shows significant promise when used to optimize the behavior of a single driver navigating a graph while operating on a ride-sharing platform. Numerical experiments on a real dataset of 7,000 trips in New Jersey suggest that Primal-Dual MCTS improves upon standard MCTS by producing deeper decision trees and exhibits a reduced sensitivity to the size of the action space.
Mar 16 2017 math.OC
A widely used heuristic for solving stochastic optimization problems is to use a deterministic rolling horizon procedure, which has been modified to handle uncertainty (e.g. buffer stocks, schedule slack). This approach has been criticized for its use of a deterministic approximation of a stochastic problem, which is the major motivation for stochastic programming. We recast this debate by identifying both deterministic and stochastic approaches as policies for solving a stochastic base model, which may be a simulator or the real world. Stochastic lookahead models (stochastic programming) require a range of approximations to keep the problem tractable. By contrast, so-called deterministic models are actually parametrically modified cost function approximations which use parametric adjustments to the objective function and/or the constraints. These parameters are then optimized in a stochastic base model which does not require making any of the types of simplifications required by stochastic programming. We formalize this strategy and describe a gradient-based stochastic search strategy to optimize the parameters.
May 11 2016 math.OC
We consider the sequential decision problem faced by the manager of an electric vehicle (EV) charging station, who aims to satisfy the charging demand of the customer while minimizing cost. Since the total time needed to charge the EV up to capacity is typically less than the amount of time that the customer is away, there are opportunities to exploit electricity spot price variations within some time window. However, it is also true that the return time of the customer is uncertain, so there exists the risk of an insufficient charge. We formulate the problem as a finite horizon Markov decision process (MDP) and consider a risk-averse objective function by optimizing under a dynamic risk measure constructed using a convex combination of expected value and conditional value at risk (CVaR). For the first time in the literature, we provide an analysis of the effect that risk parameters, e.g., the risk-level $\alpha$ used in CVaR, have on the structure of the optimal policy. We show that becoming more risk-averse in the dynamic risk measure sense corresponds to the intuitively appealing notion of becoming more risk-averse in the order thresholds of the optimal policy. This result allows us to develop computational techniques for approximating a "spectrum of risk-averse policies" generated by varying the parameters of the risk measure. Finally, numerical results for a case study using spot price data from California ISO (CAISO) are shown, where the Pareto optimality of our policies when measured against practical metrics of risk and reward is examined.
There has been widespread interest in the use of grid-level storage to handle the variability from increasing penetrations of wind and solar energy. This problem setting requires optimizing energy storage and release decisions for anywhere from a half-dozen, to potentially hundreds of storage devices spread around the grid as new technologies evolve. We approach this problem using two competing algorithmic strategies. The first, developed within the stochastic programming literature, is stochastic dual dynamic programming (SDDP) which uses Benders decomposition to create a multidimensional value function approximations, which have been widely used to manage hydro reservoirs. The second approach, which has evolved using the language of approximate dynamic programming, uses separable, piecewise linear value function approximations, a method which has been successfully applied to high-dimensional fleet management problems. This paper brings these two approaches together using a common notational system, and contrasts the algorithmic strategies (which are both a form of approximate dynamic programming) used by each approach. The methods are then subjected to rigorous testing using the context of optimizing grid level storage.
In this paper, we consider a finite-horizon Markov decision process (MDP) for which the objective at each stage is to minimize a quantile-based risk measure (QBRM) of the sequence of future costs; we call the overall objective a dynamic quantile-based risk measure (DQBRM). In particular, we consider optimizing dynamic risk measures where the one-step risk measures are QBRMs, a class of risk measures that includes the popular value at risk (VaR) and the conditional value at risk (CVaR). Although there is considerable theoretical development of risk-averse MDPs in the literature, the computational challenges have not been explored as thoroughly. We propose data-driven and simulation-based approximate dynamic programming (ADP) algorithms to solve the risk-averse sequential decision problem. We address the issue of inefficient sampling for risk applications in simulated settings and present a procedure, based on importance sampling, to direct samples toward the "risky region" as the ADP algorithm progresses. Finally, we show numerical results of our algorithms in the context of an application involving risk-averse bidding for energy storage.
We present a sparse knowledge gradient (SpKG) algorithm for adaptively selecting the targeted regions within a large RNA molecule to identify which regions are most amenable to interactions with other molecules. Experimentally, such regions can be inferred from fluorescence measurements obtained by binding a complementary probe with fluorescence markers to the targeted regions. We use a biophysical model which shows that the fluorescence ratio under the log scale has a sparse linear relationship with the coefficients describing the accessibility of each nucleotide, since not all sites are accessible (due to the folding of the molecule). The SpKG algorithm uniquely combines the Bayesian ranking and selection problem with the frequentist $\ell_1$ regularized regression approach Lasso. We use this algorithm to identify the sparsity pattern of the linear model as well as sequentially decide the best regions to test before experimental budget is exhausted. Besides, we also develop two other new algorithms: batch SpKG algorithm, which generates more suggestions sequentially to run parallel experiments; and batch SpKG with a procedure which we call length mutagenesis. It dynamically adds in new alternatives, in the form of types of probes, are created by inserting, deleting or mutating nucleotides within existing probes. In simulation, we demonstrate these algorithms on the Group I intron (a mid-size RNA molecule), showing that they efficiently learn the correct sparsity pattern, identify the most accessible region, and outperform several other policies.
We develop a quadratic regularization approach for the solution of high-dimensional multistage stochastic optimization problems characterized by a potentially large number of time periods/stages (e.g. hundreds), a high-dimensional resource state variable, and a Markov information process. The resulting algorithms are shown to converge to an optimal policy after a finite number of iterations under mild technical assumptions. Computational experiments are conducted using the setting of optimizing energy storage over a large transmission grid, which motivates both the spatial and temporal dimensions of our problem. Our numerical results indicate that the proposed methods exhibit significantly faster convergence than their classical counterparts, with greater gains observed for higher-dimensional problems.
We propose a sequential learning policy for noisy discrete global optimization and ranking and selection (R\&S) problems with high dimensional sparse belief functions, where there are hundreds or even thousands of features, but only a small portion of these features contain explanatory power. We aim to identify the sparsity pattern and select the best alternative before the finite budget is exhausted. We derive a knowledge gradient policy for sparse linear models (KGSpLin) with group Lasso penalty. This policy is a unique and novel hybrid of Bayesian R\&S with frequentist learning. Particularly, our method naturally combines B-spline basis expansion and generalizes to the nonparametric additive model (KGSpAM) and functional ANOVA model. Theoretically, we provide the estimation error bounds of the posterior mean estimate and the functional estimate. Controlled experiments show that the algorithm efficiently learns the correct set of nonzero parameters even when the model is imbedded with hundreds of dummy parameters. Also it outperforms the knowledge gradient for a linear model.
Approximate dynamic programming (ADP) has proven itself in a wide range of applications spanning large-scale transportation problems, health care, revenue management, and energy systems. The design of effective ADP algorithms has many dimensions, but one crucial factor is the stepsize rule used to update a value function approximation. Many operations research applications are computationally intensive, and it is important to obtain good results quickly. Furthermore, the most popular stepsize formulas use tunable parameters and can produce very poor results if tuned improperly. We derive a new stepsize rule that optimizes the prediction error in order to improve the short-term performance of an ADP algorithm. With only one, relatively insensitive tunable parameter, the new rule adapts to the level of noise in the problem and produces faster convergence in numerical experiments.
Feb 17 2014 math.OC
There is growing interest in the use of grid-level storage to smooth variations in supply that are likely to arise with increased use of wind and solar energy. Energy arbitrage, the process of buying, storing, and selling electricity to exploit variations in electricity spot prices, is becoming an important way of paying for expensive investments into grid-level storage. Independent system operators such as the NYISO (New York Independent System Operator) require that battery storage operators place bids into an hour-ahead market (although settlements may occur in increments as small as 5 minutes, which is considered near "real-time"). The operator has to place these bids without knowing the energy level in the battery at the beginning of the hour, while simultaneously accounting for the value of leftover energy at the end of the hour. The problem is formulated as a dynamic program. We describe and employ a convergent approximate dynamic programming (ADP) algorithm that exploits monotonicity of the value function to find a revenue-generating bidding policy; using optimal benchmarks, we empirically show the computational benefits of the algorithm. Furthermore, we propose a distribution-free variant of the ADP algorithm that does not require any knowledge of the distribution of the price process (and makes no assumptions regarding a specific real-time price model). We demonstrate that a policy trained on historical real-time price data from the NYISO using this distribution-free approach is indeed effective.
Jan 09 2014 math.OC
Many sequential decision problems can be formulated as Markov Decision Processes (MDPs) where the optimal value function (or cost-to-go function) can be shown to satisfy a monotone structure in some or all of its dimensions. When the state space becomes large, traditional techniques, such as the backward dynamic programming algorithm (i.e., backward induction or value iteration), may no longer be effective in finding a solution within a reasonable time frame, and thus we are forced to consider other approaches, such as approximate dynamic programming (ADP). We propose a provably convergent ADP algorithm called Monotone-ADP that exploits the monotonicity of the value functions in order to increase the rate of convergence. In this paper, we describe a general finite-horizon problem setting where the optimal value function is monotone, present a convergence proof for Monotone-ADP under various technical assumptions, and show numerical results for three application domains: optimal stopping, energy storage/allocation, and glycemic control for diabetes patients. The empirical results indicate that by taking advantage of monotonicity, we can attain high quality solutions within a relatively small number of iterations, using up to two orders of magnitude less computation than is needed to compute the optimal solution exactly.
This paper studies approximate policy iteration (API) methods which use least-squares Bellman error minimization for policy evaluation. We address several of its enhancements, namely, Bellman error minimization using instrumental variables, least-squares projected Bellman error minimization, and projected Bellman error minimization using instrumental variables. We prove that for a general discrete-time stochastic control problem, Bellman error minimization using instrumental variables is equivalent to both variants of projected Bellman error minimization. An alternative to these API methods is direct policy search based on knowledge gradient. The practical performance of these three approximate dynamic programming methods are then investigated in the context of an application in energy storage, integrated with an intermittent wind energy supply to fully serve a stochastic time-varying electricity demand. We create a library of test problems using real-world data and apply value iteration to find their optimal policies. These benchmarks are then used to compare the developed policies. Our analysis indicates that API with instrumental variables Bellman error minimization prominently outperforms API with least-squares Bellman error minimization. However, these approaches underperform our direct policy search implementation.
In this paper we study convex stochastic search problems where a noisy objective function value is observed after a decision is made. There are many stochastic search problems whose behavior depends on an exogenous state variable which affects the shape of the objective function. Currently, there is no general purpose algorithm to solve this class of problems. We use nonparametric density estimation to take observations from the joint state-outcome distribution and use them to infer the optimal decision for a given query state. We propose two solution methods that depend on the problem characteristics: function-based and gradient-based optimization. We examine two weighting schemes, kernel-based weights and Dirichlet process-based weights, for use with the solution methods. The weights and solution methods are tested on a synthetic multi-product newsvendor problem and the hour-ahead wind commitment problem. Our results show that in some cases Dirichlet process weights offer substantial benefits over kernel based weights and more generally that nonparametric estimation methods provide good solutions to otherwise intractable problems.