### If you catch errors or would like to contribute insights, please email them to me at powell@princeton.edu with the subject "ADP book" so I can find them in my pile of emails.

General notes on second edition (updated June 23, 2011)

Chapter 1 - Introduction (updated June 23, 2011)

Chapter 2 - Some Illustrative Models (updated November 12, 2012)

Chapter 3 - Introduction to Markov Decision Processes (updated May 30, 2012)

Chapter 4 - Introduction to Approximate Dynamic Programming (updated June 23, 2011)

Chapter 5 - Modeling Dynamic Programs (updated June 23, 2011)

Chapter 6 - Policies (updated July 31, 2011)

Chapter 7 - Policy Search (updated March 13, 2015)

Chapter 8 - Approximating Value Functions (updated June 23, 2011)

Chapter 9 - Learning Value Function Approximations (updated June 23, 2011)

Chapter 10 - Optimizing While Learning (updated June 23, 2011)

Chapter 11 - Adaptive Estimation and Stepsizes (updated June 23, 2011)

Chapter 12 - Exploration vs. Exploitation (updated June 23, 2011)

Chapter 13 - Value Function Approximations for Resource Allocation Problems (updated June 23, 2011)

Chapter 14 - Dynamic Resource Allocation Problems (updated June 23, 2011)

Chapter 15 - Implementation Challenges (updated June 23, 2011)

### Notes on second edition

The second edition of Approximate Dynamic Programming is a major revision, partly reflecting the growth of the field, but primarily reflecting what I learned since 2005, when major addition to the first edition drew to a close. Major changes in the second edition include:

• In the first edition, I sought to build a bridge between the traditional discrete action spaces of dynamic programming, and the vector-valued decisions of math programmg, by using "x" as a decision throughout. In this edition, I have reverted to "a" as the default notation, but only when working on problems with discrete action spaces. When I want to emphasize that an algorithm will work for vector valued decisions, I switch to "x", so that the notation reflects the nature of the problem.
• Chapter 4 has been revised so that it can be viewed as an introduction to reinforcement learning. However, I have greatly expanded the discussion of post-decision states, which seems to be a modeling device that has been largely overlooked in the ADP/RL communities.
• Chapter 6 is a new chapter devoted to a discussion of policies. After gaining a much better appreciation of the range of algorithmic strategies being used in computer science, operations research and engineering, I found that I could describe virtually every algorithmic strategy as consisting of four fundamental approaches: 1) myopic policies (max/min a one period reward/cost), 2) lookahead policies, which optimize (usually approximately) over some horizon, 3) policy function approximations, which is a new term I am introducing to describe functions that return an action given a state), and 4) policies based on value function approximations. The field of "approximate dynamic programming" has been largely associated with the last group, yet many researchers in ADP/RL can be found using any of the last three, often in combinations. My hope is that this strategy will bring what are sometimes competing, disparate communities under one umbrella.
• Chapter 7 focuses on tuning the parameters of a policy, without attempting to approximate the value of being in a state. It introduces readers to the literature on stochastic search and introduces a new strategy called the knowledge gradient.
• Chapters 8, 9 and 10 are a three-chapter sequence focusing on approximating value functions. Chapter 8 discusses some of the most popular strategies for approximating value functions (basically a statistics review). Chapter 9 discusses methods for updating estimates of a value function approximation while holding fixed the policy being evaluated. Finally, chapter 10 delves into the much more complex problem of approximating the value of being in a state while simultaneously optimizing a policy.
• Chapter 11 is a major revision of the chapter on stepsizes. Some of the old material was streamlined, and there is a new section on optimal stepsizes for dynamic programming.
• Chapter 12 is a major revision of my "Exploration vs. exploitation" chapter. It introduces bandit problems and Gittins indices, discusses various approximation strategies, and revisits the concept of the knowledge gradient (first introduced in chapter 7). We then present some very recent research which combines the knowledge gradient in the setting with a physical state.
• Chapters 13 and 14 are very minor revisions of chapter 11 and 12 in the first edition on resource allocation.

### Chapter 1 - Introduction

The usual introductory fluff. This chapter sets the tone of the book and the audience. I try to set this book in the context of the rest of the literature.

### Chapter 2 - Some Illustrative Models (minor revision for 2nd ed.)

There is tremendous richness in modeling stochastic, dynamic problems, as we show in chapter 5. In this chapter, we try to teach modeling by example. This chapter illustrates both deterministic and stochastic problems. No effort is made to explain the principles of modeling, but here we try to teach modeling by example. At the end of the chapter is a half page summary of the core elements of most stochastic, dynamic systems.

### Errata

Page 33, line 10 from bottom: "we would be willing to pay up to $300" should be "we would be willing to pay up to$370"

### Chapter 3 - Introduction to Markov Decision Processes (minor revision for 2nd ed.)

Overview of the chapter:

This is a streamlined introduction to the classic theory of discrete state, discrete action Markov decision processes, where we assume that we have access to a one-step transition matrix. The theory is quite elegant, but the algorithms are extremely limited in terms of their range of applications. However, the fundamental algorithms (value iteration, policy iteration) underlie most of approximate dynamic programming. More advanced students will enjoy the convergence proofs in the appendix.

### Errata

Chapter 3, Example 3.2 (page 79)

The example states "Let R_t be the current inventory." While I often use "R_t" as a state variable for resources, in this example it should read "Let S_t be the current inventory."

### Chapter 4 - Introduction to Approximate Dynamic Programming (major revision for 2nd ed.)

Overview of the chapter:

This chapter, which was almost completely rewritten for the second edition, could be called "Introduction to reinforcement learning." By the end of this chapter, you will have a good working idea of the most important ideas in approximate dynamic programming, presented from the perspective of the reinforcement learning community. The chapter starts with Bellman's equation, and discusses the computational challenges that arise because of 1) large state spaces, 2) computing expectations (or the one-step transition matrix), and 3) large action spaces (especially if the decision is a vector). These are known as the three curses of dimensionality.

Q-learning, SARSA and other learning issues

The chapter then introduces Q-learning and SARSA, the concepts of on-policy and off-policy learning and the exploration-exploitation problem. The algorithmic strategy known as "real-time dynamic programming" is introduced which is a highly specialized algorithm which overcomes the exploration problem but assumes that the expectation can be computed.

Approximate value iteration

Value iteration is presented next using a Monte Carlo-based method for approximating the expectation. This sets up the most fundamental ADP algorithm that is the most intuitive and easiest to implement. However, this algorithm will not generally work in practice (the presentation does not make this point very clearly). For example, the algorithm is described as an on-policy algorithm, which can easily become stuck in a subset of states (we deal with this point in much greater detail later in the book).

The post-decision state

We then provide an expanded discussion of the post-decision state, with a large number of examples. The idea of the post-decision state has been used by other authors, sometimes under different names ("after state", "end of period state"), but my sense is that this idea has been largely overlooked by the RL/ADP literature. The post-decision state is the state measured just after you have made a decision, but before any new information has arrived. For some problems, the post-decision state is simpler (sometimes dramatically so) than the pre-decision state, which can simplify the process of approximating the value function. It is almost always simpler than estimating a Q-factor. Most importantly, it eliminates the need to compute (or approximate) the expectation within the max/min operator. This makes it much easier to run an ADP/RL algorithm.

The RL community often focuses on "model free" dynamic programming. Q-learning only requires that we observe the next state, so we do not need a transition function, or the need to compute an expectation, when estimating the value of a state action pair. A state-action pair is actually a kind of post-decision state variable, but it can be much clumsier than a post-decision state, although this depends on the problem. In the book, I present about 10 examples of the post-decision state.

### Chapter 5 - Modeling Dynamic Programs (minor revision for 2nd ed.)

Overview of the chapter:

40 pages on how to model a dynamic program. This chapter covers subtle issues such as the modeling of time. When modeling a stochastic system, it is especially important to know what is, and is not, random at a point in time t.

It introduces the five core components of any dynamic system, including

• States - The information needed to make a decision, compute the cost/contribution, and compute the transition function.
• Actions - Also known as decisions or controls, this is how the system evolves
• Exogenous information - All the random stuff
• Transition function - How the system evolves over time. This is also known as the "model" or "system model" (along with many other names).
• Objective function - This is how we compute how well we are doing (costs and rewards).

This chapter also covers topics such as partially observable states, model-based and model-free dynamic programming.

I routinely find that the biggest weakness of application papers in stochastic optimization is a lack of precision in the modeling. This chapter provides a simple template for modeling problems that will provide a level of clarity and precision that is missing from most of the papers that I read.

June 29, 2011 - The objective function

The book focuses on an objective function which minimizes expected costs or maximizes expected rewards. This approach ignores risk, which arises in many applications. Of course, you can handle risk by simply adding a risk measure (such as variance) to the objective function, but a more interest approach introduces an explicit risk constraint.

A separate strategy focuses on worst-case performance. This perspective has been receiving attention under the name "robust optimization," although many authors claim that any solution to a stochastic optimization problem is "robust" relative to a solution that optimizes a deterministic approximation. The growing field of robust optimization assumes that we can place bounds on the outcomes of random variables. Instead of taking expectations, you use the worst random outcome that might happen.

### Chapter 6 - Policies (new for 2nd ed.)

Overview of the chapter:

I like to characterize the field of stochastic optimization as a jungle of methods, often presented in the academic research literature with an impenetrable style (admissible policies? filtrations? nonanticipativity constraints?). This chapter breaks down this jungle by describing four fundamental types of policies:

• Myopic policies - Here we maximize the contribution (or minimize the cost) for one period, without regard to the impact of a decision on the future. There actually are problems where myopic policies are optimal, and there are applications where they are a good approximation. (But, see the July 31, 2011 notes below on cost function approximations.)
• Lookahead policies - This covers rolling horizon procedures, stochastic programming, rollout heuristics and tree search. Any policy that involves making decisions now and in the future in order to make better decisions now.
• Policy function approximations - A PFA is an analytical function which returns an action given a state, without having to solve an optimization problem (you might have to solve many optimization problems to come up with the PFA). Policy function approximations are particularly useful when the nature of a decision rule (policy) is obvious given the structure of the problem. For example, selling a stock if it goes over some price, or ordering inventory when it goes under some level.
• Policies based on value function approximations - This is where you determine an action based on a one-period cost/contribution plus the value of the downstream destination.

Policy function approximations and value function approximations both require approximating a function (!!). There are three fundamental ways of approximating a function:

• Lookup table - Here we calculate an action (PFA) or value (VFA) given a discrete state.
• Parametric models - This is an analytical function that returns an action or value given a state. These functions are parameterized by parameters that have to be tuned.
• Nonparametric models - Here we use a nonparametric statistical technique to compute an action or value given a state.

Parametric models (for actions or states) introduce the art of coming up with the functional approximation. The science involves determining the parameters.

Although there are four fundamental classes of policies, it is possible to find hybrid policies by mixing and matching different strategies. For example, you can perform a lookahead policy by searching several steps into the future, and then using a value function approximation to limit the depth of the search. Myopic policies can be tweaked by adding tunable parameters, which can be viewed as a hybrid of a myopic policy and a policy function approximation.

July 31, 2011 -Cost function approximations

As I wrote up my four classes of policies in the summer of 2010, there was one class of policies that did not seem to fit nicely in one of these four categories. It occurred to me that I had overlooked an important class that I would group with myopic policies, and I am going to call this class cost function approximation (which nicely parallels policy function approximation and value function approximation). One way to make a myopic policy work better is to modify the costs in a heuristic way. For example, imagine that you are assign people (such as a surgeon) or a machine to tasks. Our cost function might prioritize putting the best person or machine on a task, but this might result in some lower priority tasks being delayed, which creates problems in the future. For this reason, we might heuristically add a bonus for covering tasks that have been delayed. The bonus is a tunable parameter.

This strategy can be combined with lookahead policies and policies based on value function approximations (both of which require a cost function to minimize), but in its purest form would be used with a myopic policy, where we minimize a one-period cost. In principle, cost function approximations can be lookup tables, parameteric or nonparametric, although I have only used parametric approximations in the form of adding tunable costs. We can then optimize over a class of cost function approximation by finding the tunable cost (or bonus) that works the best over time.

### Chapter 7 - Policy Search (new for 2nd ed.)

Overview of the chapter:

In this chapter, we assume we have some sort of parameterized policy. This might be a myopic policy with tunable parameters, a policy function approximation, or even a value function approximation with tunable parameters. In this chapter, we present methods for tuning these parameters using concepts from the field of stochastic search.

The chapter starts with classical search methods based on stochastic gradients, assuming that derivatives can be computed. In the appendix, proofs are presented for those interested in seeing the theory behind these techniques.

The chapter then introduces search methods assuming a finite set of policies, also known as the ranking and selection problem. Frequentist and Bayesian approaches are presented, including Bayesian updating with correlated beliefs (a powerful idea we return to). Various heuristic search algorithms are presented including epsilon-greedy, Bolzmann exploration and interval estimation.

We then introduce the knowledge gradient, which is a general search technique which chooses a set of parameters to evaluate which produce the greatest marginal value. The knowledge gradient is given for discrete alternatives with independent and correlated beliefs, as well as when a parametric belief model is used (in particular, linear regression). The knowledge gradient is a Bayesian method for learning the best set of parameters. This chapter provides only a brief introduction to this powerful method. For more information, see our optimal learning website.

We next present some important ideas from the literature known as "simulation optimization." These methods were derived in the setting of finding a set of parameters (typically a finite set) to produce the best results from a Monte Carlo simulation. We illustrate two important ideas - indifference zone selection, and the optimal computing budget allocation, which seeks to allocate a finite budget for computing time among a small set of alternatives. These methods are all frequentist.

### Errata

p. 285 - The conclusion that \beta^n goes to zero requires that we modify the assumption in equation 7.69 (p. 281) that the stepsizes alpha_n be strictly greater than 0 (rather than greater than or equal to zero).

### Chapter 8 - Approximating Value Functions (new for 2nd ed.)

Overview of the chapter:

Chapter 8 is the first of a three-chapter sequence to develop the important ideas of finding policies based on approximating value functions. This chapter can be viewed as a review of statistical techniques that have proved popular in the ADP/RL communities for approximating value functions.

Major techniques include:

• Aggregation, and in particular hierarchical aggregation, where we use a weighted sum of estimates at different levels of aggregations. The weights are a function of the state; as more observations are made around a set of states, we put more weight on more disaggregate estimates.
• Parametric models - This is a review of classical linear regression, including recursive least squares. The language of "basis functions" is introduced (another word for independent variables, or covariates). An elegant geometric view of basis functions is given, including a matrix-based derivation of the projection operation.
• Variations on regression are given, including shrinkage methods and support vector regression.
• Nonparametric methods, including nearest neighbors, kernel regression, local polynomial regression, neural networks, and a family of closely related methods known as indexed functions, tree structures and clustering.

The curse of dimensionality

We provide a section on the curse of dimensionality. ADP is sometimes presented as a method for overcoming the curse of dimensionality, which we believe is a little disengeneous. Approximating an N dimensional function in a way that captures interactions between all N dimensions is fundamentally difficult for values of N larger than about 5. Approximations of high-dimensional functions typically requires using separable or sequences of low-dimensional approximations. ADP/RL has not solved this problem.

The classical discrete state (also known as a flat representation) does not use separable (or low-dimensional) approximations, but that is why classical discrete state models do not scale to interesting problems.

### Chapter 9 - Learning Value Function Approximations (new for 2nd ed.)

Overview of the chapter:

There are many papers that focus on the problem of estimation (sometimes called prediction), which means estimating the value of being in a state while following a fixed policy. This entire chapter focuses on this problem. To further emphasize that we are fixing the policy, any quantity that depends on a policy is indexed by pi.

Our presentation starts with an overview of different methods for taking samples of the value of being in a state. This includes sampling the value of a policy over finite and infinite horizons, and bootstrapping, known widely in the reinforcement learning community as "temporal differencing".

We then introduce stochastic approximation methods for updating the parameters of a function. We then present recursive least squares for linear regression models, including updating methods for stationary, nonstationary and autoregressive methods.

We then present an in-depth discussion of Bellman's equation using linear models. We present a matrix-based derivation, a simulation-based algorithm, and presentations of least squares temporal differencing (LSTD) and least squares policy evaluation (LSPE).

We then provide an analysis of different algorithms (recursive least squares, TD(0), LSPE, LSTD) for a simplified dynamic program with a single state and single action. This simple problem allows us to give each of these estimation methods in closed form. This allows us to understand the type of stepsize rules that each requires.

The chapter then presents some exciting new material on gradient based methods for approximate value iteration. Classical methods for estimating the parameters of a value function approximation have computed gradients based on weighted squares of Bellman errors. While popular, this method has known limitations, including divergence for some applications. New research has proposed using a different objective function called the "mean squared *projected* Bellman error", or MSPBE for short. This method leads to a provably convergent algorithm, overcoming the divergent behavior of previous algorithms.

The chapter closes with a presentation of recent research on least square temporal differencing using kernel regression.

### Chapter 10 - Optimizing While Learning (new for 2nd ed.)

Overview of the chapter:

Learning a value function approximations while sampling observations from a policy that depends on the value function approximation is ... well, hard!

The issues that arise when learning while optimizing depends heavily on how we sample observations of the value of being in a state, and how we are trying to approximate the value function.

There are two fundamental methods for sampling the value of being in a state. The first uses bootstrapping, where the value of being in a state s depends on the best action we can take and our current approximation of the value of the state that the action takes us to. This is known as TD learning and is used in approximate value iteration.

The second method simulates a policy (directly or indirectly) and uses the resulting sequence of costs/rewards to get an observation of the value of being in a state. This simulation can be done by literally walking forward through time (perhaps to the end of a horizon), or by inferring the infinite horizon contribution from a single cost. This strategy is best recognized under the name of approximate policy iteration, but methods such as least squares temporal differencing accomplish the same thing.

Of particular importance is how we are approximating the value function. Recall that there are three fundamental strategies: lookup tables, parametric models, and nonparametric models. If we use a lookup table, we benefit from the fact that if we sample states and actions often enough, we will eventually learn the best value of being in a state. The key here is that we need an exploration strategy to ensure this, which means we are using off-policy learning.

Of greatest interest in the ADP community is the use of linear regression (the easiest form of parametric model). Sadly, even if we fix a policy, if we use off-policy learning, we will not learn the correct value function approximation. We have to use on-policy learning. At the same time, we do not have the same needs for exploration that are required with lookup tables. Instead of visiting all states infinitely often, we need to visit enough different states to fit the parameters. Very recent research using mean squared projected Bellman error introduces an important correction term that makes it possible to use off-policy learning. However, there are some applications where on-policy learning (where we follow a single trajectory produced by a policy which depends on the value function approximation), is the most practical.

Approximate policy iteration has attracted considerable attention because it enables trajectory following in an inner iteration where we fix the value function approximation (to fix the policy) and then simulate the policy. This has to be done many times to get a good approximation of the value function for this policy, which is then used to update the policy. Convergence proofs are available, but these require assumptions either on the completeness of the basis functions or the behavior of the policy in terms of exploring states.

There is growing interest in the use of nonparametric models to avoid the "art" of choosing basis functions (also known as features). Nonparametric methods use local observations of a function to approximate the function in a region. This means that we lose the power of parametric methods where observations in one region of the function allow us to update our approximation of the entire function. This is both a weakness and a strength, because when one observation updates the entire function, it introduces instabilities. The flip side is that we are now going to need exploration policies to ensure that we visit different portions of the function.

### Chapter 11 - Adaptive Estimation and Stepsizes (significant revision for 2nd ed.)

Overview of the chapter:

Fundamental to approximate dynamic programming is the need for recursive estimation, where a current estimate is combined with a new observation to produce an updated estimate. By far the most popular method for doing this uses a stochastic approximation algorithm, where (1-alpha_{n-1}) is multipled times the old estimate and added to alpha_{n-1} times the new observation. alpha_{n-1} is known as a stepsize (given the derivation from stochastic approximation theory), smoothing factor or a learning rate.

A stepsize of 1/n (which has the effect of averaging observations) satisfies the properties needed for convergence proofs, but has long been recognized to decline too quickly (however, under certain conditions it can be the best possible stepsize). A harmonic stepsize of a/(a+n) reduces the rate of decrease by increasing "a".

The reinforcement learning community most often uses a constant stepsize in practice, primarily because it avoids the practical problem of using a declining stepsize that declines too quickly. This choice introduces a tunable parameter, and it also forces a single stepsize on the entire problem. States (or regions of the state space) with few observations benefit from a larger stepsize, while a smaller stepsize produces more accurate estimates.

This chapter reviews deterministic stepsizes, heuristic stochastic (adaptive) stepsize rules, and reviews several stepsize rules that satisfy different optimality criterion for rate of convergence. We review three of these optimal stepsize rules:

• The learning rate from the Kalman filter literature, which depends on assumptions about specific properties on how the observations (estimates of the value of being in a state) are evolving.
• A rule we call the Bias Adjusted Kalman Filter (BAKF) rule, which strikes a balance between the apparent noise in the signal, and an estimate of the bias as the value converges to a steady state estimate.
• A new rule we call OSAVI for Optimal Stepsize for Approximate Value Iteration. This rule appears to be the first optimal stepsize rule derived specifically for bootstrapping, including approximate value iteration and Q-learning. OSAVI avoids the need of BAKF to estimate the bias (which is the Achilles heel of that procedure) by estimating the bias directly from the discount factor.

### Chapter 12 - Exploration vs. Exploitation (major revision for 2nd ed.)

Overview of the chapter:

When I wrote my "Exploration vs. Exploitation" chapter in the first edition, I realized how little I knew about this topic, and decided to make it a major area of research. This research is evolving under the name "Optimal Learning" (see http://optimallearning.princeton.edu). Some of this work is evident in chapter 7 for policy search, where we use it to help search for the best value of parameters that fix a policy. In this setting, learning is a dynamic programming problem where the state is our state of belief about the parameters. After each measurement, I can test any value of the parameters I would like.

In this chapter, we tackle the much more difficult problem of learning in the presence of a physical state (which characterizes all the dynamic programming problems in this book). The chapter starts with some basic introduction to learning (frequentist and Bayesian), and reviews some of the popular heuristic policies (although this review is incomplete - more on this in additional updates to this website).

We then review the classic theory of bandit problems and Gittins indices, which is one of the earliest contributions to active learning. We also review approximations of Gittins indices, and the technique of upper confidence bounding.

We then review the knowledge gradient policy, first introduced in chapter 7, briefly reviewing knowledge gradient for correlated beliefs, as well as on-line problems (multiarmed bandits) and off-line prolems (ranking and selection). Often overlooked in the ADP literature is that algorithms for training policies are fundamentally online, because they consider the cost/reward being earned, rather than purely the value of information.

We then present some new material for learning in the presence of a physical state. We review an elegant heuristic called the "local bandit approximation" which is an adaptation of Gittins indices to problems with a physical state. We then show how the knowledge gradient can be used, where it results in adding a "value of information" term to the objective when making a decision. We use the ability of the knowledge gradient to visit one state while learning about another state (either because of correlated beliefs for discrete states, or through parametric models). The supporting literature behind this material is quite young, and this material should be viewed as emerging research. Early numerical work is promising, but the algorithm is considerably more complicated than the simple heuristics that are used in practice. Not know at this time is whether the computational requirements of the knowledge gradient calculation is offset by the benefits of the algorithm.

### Chapter 13 - Value Function Approximations for Resource Allocation Problems (minor revision for 2nd ed.)

Overview of the chapter:

There is a large problem class that is best described as "resource allocation" where we are managing blood, money, people, water, energy or food. We have to decide not just what to do with a resource, but we usually have to deal with how much. These problems are typically convex (concave if maximizing) in the "how much" dimension, and we can exploit this property.

In this chapter, we focus on different methods for approximating value functions where we can take advantage of convexity. Approximation strategies might be linear in the resource state, piecewise linear separable, quadratic or they may use a powerful idea from mathematical programming which uses a sophisticated idea known as Benders cuts. The use of multidimensional Benders cuts is also known as "stochastic dual decomposition procedure" (SDDP) which is well known in the stochastic programming community. This can be viewed as a form of approximate dynamic programming.

We have considerable experience with stochastic resource allocation problems in our work in CASTLE Lab. Most of our applications use approximations that are piecewise linear and separable in the resource variable. This method has proven to be very robust and scalable. In our comparisons against Benders cuts, we find that Benders cuts works well up to around 10 or 20 dimensions. Above this, the method seems to become quite slow. Piecewise linear separable approximations seems to work well for problems with hundreds and thousands of dimensions. For our largest problems (millions of dimensions), the number of resources of a particular type becomes so small that we end up using approximations that are linear in the resource state.

### Chapter 14 - Dynamic Resource Allocation Problems (minor revision for 2nd ed.)

Overview of the chapter:

Here we describe a series of resource allocation problems of increasing complexity. We start with a simple inventory problem (the resource state is scalar), and then present a blood management problem that nicely illustrates how we use the post-decision state to solve a multidimensional resource allocation problem. This basic strategy generalizes to much larger problems. We also have an illustration of a financial portfolio optimization problem. Again, this is done primarily to illustrate an approximation strategy.

Throughout this chapter, we focus on optimization problems with vector-valued decisions, so we use "x" throughout as our decision variable.

Section 12.4 stops to present a general notational strategy for a fairly broad class of resource allocation problems. Our hope is that the notational style will prove useful for some modelers.

Sections 12.5 and 12.6 review two large, industrial applications arising in the management of box cars for a major railroad, and the planning of drivers (a complex resource) for a truckload motor carrier. These projects are primarily demonstrating that these methods will work in practice.

### Chapter 15 - Implementation Challenges (minor revision for 2nd ed.)

Overview of the chapter:

Anyone who works with real applications learns that there is a real art to solving this very broad class of problems. I have made an effort to summarize some of the insights that I have acquired from the applications that I have worked on.

Chapter 15 is only a minor revision from the first edition, where my focus was primarily on using value function approximations to solve problems, in the contextual domain of resource allocation. I have some general insights into whether value functions is a good strategy for your problem and steps for debugging an algorithm. I provide a random list of thoughts on discount factors, getting through the early iterations of an ADP algorithm (when value functions and policies are the worst), evaluating a policy, and dealing with the challenge of when to stop an algorithm.

I close with a section called "If it works, patent it!" Although I would actually discourage the community from patents (which simply discourages innovation in my view), the point is that getting an ADP algorithm to really work is a real breakthrough. There is a lot of theory and toy experiments in labs, but the transition to a real field implementation is quite difficult.

Good luck!