Details of Dynamic Programming Theory

I derive the Bellman Equation and show that if a solution to the equation exists, then it must be unique.

In this blog post, I dig into the details of how the bellman equations were discovered and show using Banach’s fixed point theorem that there is a unique value function that solves the bellman equation.

Dynamic Programming Model

Just like in many classic dynamic programming models, let tN{0}t \in \mathbb{N} \cup\{0\} be the stage, SRnS \subseteq \mathbb{R}^n be the state space where s0Ss_0 \in S is given, ARkA \subseteq \mathbb{R}^k be the action space, Φ:SA\Phi: S \rightarrow A be a compact-valued continuous function mapping feasible actions at some state, f:S×Af: S \times A be a continuous transition function, r:S×ARr: S \times A \rightarrow \mathbb{R} be a bounded and continuous reward function, and δ[0,1)\delta \in[0,1) be the discount factor.

There are several important definitions in a classic dynamic programming problem, which I now define.

Definition. History\textbf{Definition. History}

A tt-history ηt\eta_t is a vector (s0,a0,,st1,at1,st)\left(s_0, a_0, \ldots, s_{t-1}, a_{t-1}, s_t\right) of the state sτs_\tau in each period τ\tau up to tt, the action aτa_\tau taken that period, and the period-tt state sts_t. H0=SH_0=S and HtH_t denotes the set of all possible tt-histories ηt\eta_t.

Definition. Strategy\textbf{Definition. Strategy}

A strategy for period tt is a function σt:HtA\sigma_t: H_t \rightarrow A s.t. σt(s0,a0,,at1,st)Φ(st)\sigma_t\left(s_0, a_0, \ldots, a_{t-1}, s_t\right) \in \Phi\left(\mathrm{s}_t\right) for all tt and all (s0,a0,,at1,st)Ht\left(s_0, a_0, \ldots, a_{t-1}, s_t\right) \in H_t. A strategy σ\sigma is a sequence of period strategies (σt)tN\left(\sigma_t\right)_{t \in \mathbb{N}} where σt\sigma_t is the strategy at period tt. The set of all strategies will be denoted by Σ\Sigma.

Definition. Reward, Utility, Value\textbf{Definition. Reward, Utility, Value}

For any strategy σΣ\sigma \in \Sigma, define

a0(s0,σ)=σ0(s0)s1(s0,σ)=f(s0,a0(s0,σ))a1(s0,σ)=σ1(s0,a0(s0,σ),s1(s0,σ)) \begin{array}{c} a_0\left(s_0, \sigma\right)=\sigma_0\left(s_0\right) \\ s_1\left(s_0, \sigma\right)=f\left(s_0, a_0\left(s_0, \sigma\right)\right) \\ a_1\left(s_0, \sigma\right)=\sigma_1\left(s_0, a_0\left(s_0, \sigma\right), s_1\left(s_0, \sigma\right)\right) \end{array}

and other values of sts_t and ata_t by induction. The reward at period tt is defined as rt(s0,σ)=r_t\left(s_0, \sigma\right)= r(st(s0,σ),at(s0,σ))r\left(s_t\left(s_0, \sigma\right), a_t\left(s_0, \sigma\right)\right) and the overall utility is defined as

W(s0,σ)=t=0δtrt(s0,σ) W\left(s_0, \sigma\right)=\sum_{t=0}^{\infty} \delta^t r_t\left(s_0, \sigma\right)

The value function is defined as V:SRV: S \rightarrow \mathbb{R} where V(s)=supσΣW(s,σ)V(s)=\sup _{\sigma \in \Sigma} W(s, \sigma).

While the first proposition will use mostly the definitions that I provide above, the second proposition will mostly use the definitions below.

Definition. Complete Metric Space\textbf{Definition. Complete Metric Space}

Let M=(X,d)M=(X, d) be a metric space where XX is the set of all bounded functions F:DRF: D \rightarrow \mathbb{R} and dd is the metric from the supremum norm, fD=supxDf(x)\|f\|_D=\sup _{x \in D}|f(x)|. A sequence of functions {fn}n=1\left\{f_n\right\}_{n=1}^{\infty} is said to be Cauchy in MM if for each ϵ>0\epsilon>0, there exists a natural number NN such that fjfkD=\left\|f_j-f_k\right\|_D= supxDfj(x)fk(x)<ϵ\sup _{x \in D}\left|f_j(x)-f_k(x)\right|<\epsilon for every j,kNj, k \geq N. A metric space is complete if every Cauchy sequence in XX converges to some point of XX.

Definition. Contraction\textbf{Definition. Contraction}

Let M=(X,d)M=(X, d) be the metric space defined above. The operator T:XXT: X \rightarrow X is a contraction if there exists q[0,1)q \in[0,1) such that for all w,vX,d(T(w),T(v))qd(w,v)w, v \in X, d(T(w), T(v)) \leq q d(w, v).

Two Important Propositions

The first proposition that I prove deals with deriving an alternative expression for V(s)V(s), given the definition I have given. This alternative expression is extremely useful since it allows one to find the optimal strategy. Finding the optimal strategy may then involve numerical or analytical methods. The second proposition proves that there is a unique function that satisfies the alternative expression for V(s)V(s).

Lemma.\textbf{Lemma.} If rr is bounded, then V(s)=supσΣW(s,σ)=supσΣt=0δtrt(s,σ)V(s)=\sup _{\sigma \in \Sigma} W(s, \sigma)=\sup _{\sigma^{\prime} \in \Sigma} \sum_{t=0}^{\infty} \delta^t r_t\left(s, \sigma^{\prime}\right) is well defined.

Proof.\textit{Proof.} Note that K1δ=Kt=0δtt=0δtrt(s,σ)Kt=0δt=K1δ-\frac{K}{1-\delta}=-K \sum_{t=0}^{\infty} \delta^t \leq \sum_{t=0}^{\infty} \delta^t r_t\left(s, \sigma^{\prime}\right) \leq K \sum_{t=0}^{\infty} \delta^t=\frac{K}{1-\delta} where KK is the bound of rr. Hence, since the set {W(s,σ)σΣ}\{W(s, \sigma) \mid \sigma \in \Sigma\} is bounded and is a subset of R,supσΣW(s,σ)\mathbb{R}, \sup _{\sigma \in \Sigma} W(s, \sigma) exists.

Proposition.\textbf{Proposition.} If rr is bounded, then for every sSs \in S,

V(s)=supaΦ(s)[r(s,a)+δV(f(s,a))]. V(s)=\sup _{a \in \Phi(s)}[r(s, a)+\delta V(f(s, a))].

Proof.\textit{Proof.} Note that for any strategy σΣ\sigma \in \Sigma and sSs \in S,

W(s,σ)=r0(s,σ)+δ(r0(f(s,σ0(s)),σ^)+)=r(s,σ0(s))+δt=0δtrt(f(s,σ0(s)),σ^)=r(s,σ0(s))+δW(f(s,σ0(s)),σ^) \begin{array}{l} W(s, \sigma)=r_0(s, \sigma)+\delta\left(r_0\left(f\left(s, \sigma_0(s)\right), \hat{\sigma}\right)+\cdots\right) \\ =r\left(s, \sigma_0(s)\right)+\delta \sum_{t=0}^{\infty} \delta^t r_t\left(f\left(s, \sigma_0(s)\right), \hat{\sigma}\right) \\ =r\left(s, \sigma_0(s)\right)+\delta W\left(f\left(s, \sigma_0(s)\right), \hat{\sigma}\right) \end{array}

where σ^\hat{\sigma} excludes σ0\sigma_0 and shifts all σt\sigma_t one step backward. Hence, by definition of VV, we have

V(s)=supσΣW(s,σ)=supσΣr(s,σ0(s))+δW(f(s,σ0(s)),σ^)=supaΦ(s)(r(s,a)+δsupσ^ΣW(f(s,a),σ^))=supaΦ(s)[r(s,a)+δV(f(s,a))]. \begin{array}{l} V(s)=\sup _{\sigma \in \Sigma} W(s, \sigma) \\ =\sup _{\sigma \in \Sigma} r\left(s, \sigma_0(s)\right)+\delta W\left(f\left(s, \sigma_0(s)\right), \hat{\sigma}\right) \\ =\sup _{a \in \Phi(s)}\left(r(s, a)+\delta \sup _{\hat{\sigma} \in \Sigma} W(f(s, a), \hat{\sigma})\right) \\ =\sup _{a \in \Phi(s)}[r(s, a)+\delta V(f(s, a))]. \end{array}

The second and third equalities hold by induction. The base case would be just considering σ0,σ1\sigma_0, \sigma_1. For the LHS we have,

supσ0Φ(s0),σ1Φ(f(s0,σ0))r(s0,σ0)+δW(f(s0,σ0),σ1)=supσ0,σ1r(s0,σ0)+δr(f(s0,σ0),σ1) \sup _{\sigma_0 \in \Phi\left(s_0\right), \sigma_1 \in \Phi\left(f\left(s_0, \sigma_0\right)\right)} r\left(s_0, \sigma_0\right)+\delta W\left(f\left(s_0, \sigma_0\right), \sigma_1\right)=\sup _{\sigma_0, \sigma_1} r\left(s_0, \sigma_0\right)+\delta r\left(f\left(s_0, \sigma_0\right), \sigma_1\right)

And for the RHS we have,

supa0Φ(s0)(r(s0,a0)+δsupσ1W(f(s0,a0),σ1))=supσ0Φ(s0)(r(s0,σ0)+δsupσ1W(f(s0,σ0),σ1))=supσ0,σ1r(s0,σ0)+δr(f(s0,σ0),σ1) \begin{array}{l} \sup _{a_0 \in \Phi\left(s_0\right)}\left(r\left(s_0, a_0\right)+\delta \sup _{\sigma_1} W\left(f\left(s_0, a_0\right), \sigma_1\right)\right) \\ =\sup _{\sigma_0 \in \Phi\left(s_0\right)}\left(r\left(s_0, \sigma_0\right)+\delta \sup _{\sigma_1} W\left(f\left(s_0, \sigma_0\right), \sigma_1\right)\right) \\ =\sup _{\sigma_0, \sigma_1} r\left(s_0, \sigma_0\right)+\delta r\left(f\left(s_0, \sigma_0\right), \sigma_1\right) \end{array}

since supx(a+f(x))=a+supx(f(x))\sup _x(a+f(x))=a+\sup _x(f(x)) and supx(δf(x))=δsupx(f(x))\sup _x(\delta f(x))=\delta \sup _x(f(x)). Hence, the equality in the proposition is proven.

I now prove two lemmas that are important for proving the second important proposition. I could have included a proof of Banach’s fixed point theorem here, but I chose not to because there are many resources online that explain his theorem.

Lemma.\textbf{Lemma.} The metric space M=(X,d)M=(X, d) is complete.

Proof.\textit{Proof.} I first show that an arbitrary Cauchy sequence of XX converges pointwise. Then I show that the element in which the sequence converges to is in XX.

Let {fn}n=1\left\{f_n\right\}_{n=1}^{\infty} be a Cauchy sequence of functions in XX. Then for every ϵ>0\epsilon>0, there exists NN such that for every j,kN,supxDfj(x)fk(x)<ϵj, k \geq N, \sup _{x \in D}\left|f_j(x)-f_k(x)\right|<\epsilon. Hence, for any tD,fj(t)fk(t)<ϵt \in D,\left|f_j(t)-f_k(t)\right|<\epsilon. When fixing tt, the above implies that the Cauchy sequence of numbers in R,{fn(t)}n=1\mathbb{R},\left\{f_n(t)\right\}_{n=1}^{\infty}, converges pointwise. Hence, denote whatever the sequence converges to as f(t)f(t). We then have

limnfn(t)=f(t). \lim _{n \rightarrow \infty} f_n(t)=f(t).

Hence, for all jNj \geq N and tDt \in D, we have that fj(t)f(t)<ϵ\left|f_j(t)-f(t)\right|<\epsilon. This means that the Cauchy sequence of functions {fn}n=1\left\{f_n\right\}_{n=1}^{\infty} converges uniformly. Hence, the arbitrary sequence of functions {fn}n=1\left\{f_n\right\}_{n=1}^{\infty} also converges pointwise to ff.

Now I will show that fXf \in X or that ff is bounded. Since each function fnf_n is bounded, there exists a constant BnB_n such that fnD=supxDfn(x)Bn\left\|f_n\right\|_D=\sup _{x \in D}\left|f_n(x)\right| \leq B_n. From above, I know that for a fixed jNj \geq N and for any kN,fjfkD<ϵk \geq N,\left\|f_j-f_k\right\|_D<\epsilon. By the triangle inequality, and the fact that for any xDx \in D,

fk(x)fk(x)fj(x)+fj(x)ϵ+Bn. \left|f_k(x)\right| \leq\left|f_k(x)-f_j(x)\right|+\left|f_j(x)\right| \leq \epsilon+B_n.

Since the above is true for all xDx \in D, we have that for all kNk \geq N

fkD=supxDfk(x)ϵ+Bn. \left\|f_k\right\|_D=\sup _{x \in D}\left|f_k(x)\right| \leq \epsilon+B_n.

Let B:=ϵ+BnB:=\epsilon+B_n. We have that, since the absolute value function is continuous on R\mathbb{R}, for all xDx \in D,

f(x)=limkfk(x)=limkfk(x)B. |f(x)|=\left|\lim _{k \rightarrow \infty} f_k(x)\right|=\lim _{k \rightarrow \infty}\left|f_k(x)\right| \leq B.

Hence, since f(x)B|f(x)| \leq B for all xD,fD=supxDf(x)Bx \in D,\|f\|_D=\sup _{x \in D}|f(x)| \leq B. This shows that the metric space MM is complete.

Lemma.\textbf{Lemma.} If F,GXF, G \in X, then

supxF(x)supxG(x)supxF(x)G(x). \left|\sup _x F(x)-\sup _x G(x)\right| \leq \sup _x|F(x)-G(x)|.

Proof.\textit{Proof.} A sketch of the proof is as follows

supxF(x)supxG(x)supxF(x)G(x)supxF(x)G(x). \begin{array}{l} \sup _x F(x)-\sup _x G(x) \\ \leq \sup _x F(x)-G(x) \\ \leq \sup _x|F(x)-G(x)|. \end{array}

When switching GG and FF, I obtain

supxG(x)supxF(x)supxF(x)G(x). \begin{array}{l} \sup _x G(x)-\sup _x F(x) \\ \leq \sup _x|F(x)-G(x)|. \end{array}

Hence, this shows that supxG(x)supxF(x)supxG(x)F(x)\left|\sup _x G(x)-\sup _x F(x)\right| \leq \sup _x|G(x)-F(x)|. Obviously, there could be more explanation for why supxF(x)supxG(x)supxF(x)G(x)\sup _x F(x)-\sup _x G(x) \leq \sup _x F(x)-G(x). A way to show this would be to find something greater than the LHS and less than the RHS.

Proposition.\textbf{Proposition.} There is a unique bounded function w:SRw: S \rightarrow \mathbb{R} that satisfies

w(s)=supaΦ(s)[r(s,a)+δw(f(s,a))]. w(s)=\sup _{a \in \Phi(s)}[r(s, a)+\delta w(f(s, a))].

Proof.\textbf{Proof.} Let M=(X,d)M=(X, d) be the metric space defined above. As shown in the lemma above, this metric space is complete. Now, define mapping w:SRw: S \rightarrow \mathbb{R} such that ww is bounded in MM. Define mapping T:XXT: X \rightarrow X to be

T(w)(s):=supaΦ(s)[r(s,a)+δw(f(s,a))] T(w)(s):=\sup _{a \in \Phi(s)}[r(s, a)+\delta w(f(s, a))]

for all sSs \in S. A solution to the equation in the proposition is a fixed point of TT (i.e. w(s)=T(w)(s)w(s)=T(w)(s) for all sSs \in S). TT has a unique fixed point if it is a contraction mapping on the space XX. This is the result of Banach’s fixed-point theorem. TT is a contraction on BB because

d(T(w),T(v))=supsSsupaΦ(s)[r(s,a)+δw(f(s,a))]supaΦ(s)[r(s,a)+δv(f(s,a))]supsS,aΦ(s)δ(w(f(s,a))v(f(s,a)))=δsupsS,aΦ(s)w(f(s,a))v(f(s,a))δsupsSw(s)v(s)=δd(w,v). \begin{aligned} d(T(w), T(v)) & =\sup _{s \in S}\left|\sup _{a \in \Phi(s)}[r(s, a)+\delta w(f(s, a))]-\sup _{a \in \Phi(s)}[r(s, a)+\delta v(f(s, a))]\right| \\ & \leq \sup _{s \in S, a \in \Phi(s)}|\delta(w(f(s, a))-v(f(s, a)))| \\ & =\delta \sup _{s \in S, a \in \Phi(s)}|w(f(s, a))-v(f(s, a))| \\ & \leq \delta \sup _{s \in S}|w(s)-v(s)| \\ & =\delta d(w, v). \end{aligned}

The first inequality is true because of the lemma above. The last inequality is true because f(S,Φ(S))Sf(S, \Phi(S)) \subseteq S. Hence, TT has a unique fixed point.