Details of Dynamic Programming Theory

I derive the Bellman Equation and show that if a solution to the equation exists, then it must be unique.

In this blog post, I dig into the details of how the bellman equations were discovered and show using Banach’s fixed point theorem that there is a unique value function that solves the bellman equation.

Dynamic Programming Model

Just like in many classic dynamic programming models, let $t \in \mathbb{N} \cup\{0\}$ be the stage, $S \subseteq \mathbb{R}^n$ be the state space where $s_0 \in S$ is given, $A \subseteq \mathbb{R}^k$ be the action space, $\Phi: S \rightarrow A$ be a compact-valued continuous function mapping feasible actions at some state, $f: S \times A$ be a continuous transition function, $r: S \times A \rightarrow \mathbb{R}$ be a bounded and continuous reward function, and $\delta \in[0,1)$ be the discount factor.

There are several important definitions in a classic dynamic programming problem, which I now define.

$\textbf{Definition. History}$

A $t$ -history $\eta_t$ is a vector $\left(s_0, a_0, \ldots, s_{t-1}, a_{t-1}, s_t\right)$ of the state $s_\tau$ in each period $\tau$ up to $t$ , the action $a_\tau$ taken that period, and the period- $t$ state $s_t$ . $H_0=S$ and $H_t$ denotes the set of all possible $t$ -histories $\eta_t$ .

$\textbf{Definition. Strategy}$

A strategy for period $t$ is a function $\sigma_t: H_t \rightarrow A$ s.t. $\sigma_t\left(s_0, a_0, \ldots, a_{t-1}, s_t\right) \in \Phi\left(\mathrm{s}_t\right)$ for all $t$ and all $\left(s_0, a_0, \ldots, a_{t-1}, s_t\right) \in H_t$ . A strategy $\sigma$ is a sequence of period strategies $\left(\sigma_t\right)_{t \in \mathbb{N}}$ where $\sigma_t$ is the strategy at period $t$ . The set of all strategies will be denoted by $\Sigma$ .

$\textbf{Definition. Reward, Utility, Value}$

For any strategy $\sigma \in \Sigma$ , define

$\begin{array}{c} a_0\left(s_0, \sigma\right)=\sigma_0\left(s_0\right) \\ s_1\left(s_0, \sigma\right)=f\left(s_0, a_0\left(s_0, \sigma\right)\right) \\ a_1\left(s_0, \sigma\right)=\sigma_1\left(s_0, a_0\left(s_0, \sigma\right), s_1\left(s_0, \sigma\right)\right) \end{array}$

and other values of $s_t$ and $a_t$ by induction. The reward at period $t$ is defined as $r_t\left(s_0, \sigma\right)=$ $r\left(s_t\left(s_0, \sigma\right), a_t\left(s_0, \sigma\right)\right)$ and the overall utility is defined as

$W\left(s_0, \sigma\right)=\sum_{t=0}^{\infty} \delta^t r_t\left(s_0, \sigma\right)$

The value function is defined as $V: S \rightarrow \mathbb{R}$ where $V(s)=\sup _{\sigma \in \Sigma} W(s, \sigma)$ .

While the first proposition will use mostly the definitions that I provide above, the second proposition will mostly use the definitions below.

$\textbf{Definition. Complete Metric Space}$

Let $M=(X, d)$ be a metric space where $X$ is the set of all bounded functions $F: D \rightarrow \mathbb{R}$ and $d$ is the metric from the supremum norm, $\|f\|_D=\sup _{x \in D}|f(x)|$ . A sequence of functions $\left\{f_n\right\}_{n=1}^{\infty}$ is said to be Cauchy in $M$ if for each $\epsilon>0$ , there exists a natural number $N$ such that $\left\|f_j-f_k\right\|_D=$ $\sup _{x \in D}\left|f_j(x)-f_k(x)\right|<\epsilon$ for every $j, k \geq N$ . A metric space is complete if every Cauchy sequence in $X$ converges to some point of $X$ .

$\textbf{Definition. Contraction}$

Let $M=(X, d)$ be the metric space defined above. The operator $T: X \rightarrow X$ is a contraction if there exists $q \in[0,1)$ such that for all $w, v \in X, d(T(w), T(v)) \leq q d(w, v)$ .

Two Important Propositions

The first proposition that I prove deals with deriving an alternative expression for $V(s)$ , given the definition I have given. This alternative expression is extremely useful since it allows one to find the optimal strategy. Finding the optimal strategy may then involve numerical or analytical methods. The second proposition proves that there is a unique function that satisfies the alternative expression for $V(s)$ .

$\textbf{Lemma.}$ If $r$ is bounded, then $V(s)=\sup _{\sigma \in \Sigma} W(s, \sigma)=\sup _{\sigma^{\prime} \in \Sigma} \sum_{t=0}^{\infty} \delta^t r_t\left(s, \sigma^{\prime}\right)$ is well defined.

$\textit{Proof.}$ Note that $-\frac{K}{1-\delta}=-K \sum_{t=0}^{\infty} \delta^t \leq \sum_{t=0}^{\infty} \delta^t r_t\left(s, \sigma^{\prime}\right) \leq K \sum_{t=0}^{\infty} \delta^t=\frac{K}{1-\delta}$ where $K$ is the bound of $r$ . Hence, since the set $\{W(s, \sigma) \mid \sigma \in \Sigma\}$ is bounded and is a subset of $\mathbb{R}, \sup _{\sigma \in \Sigma} W(s, \sigma)$ exists.

$\textbf{Proposition.}$ If $r$ is bounded, then for every $s \in S$ ,

$V(s)=\sup _{a \in \Phi(s)}[r(s, a)+\delta V(f(s, a))].$

$\textit{Proof.}$ Note that for any strategy $\sigma \in \Sigma$ and $s \in S$ ,

$\begin{array}{l} W(s, \sigma)=r_0(s, \sigma)+\delta\left(r_0\left(f\left(s, \sigma_0(s)\right), \hat{\sigma}\right)+\cdots\right) \\ =r\left(s, \sigma_0(s)\right)+\delta \sum_{t=0}^{\infty} \delta^t r_t\left(f\left(s, \sigma_0(s)\right), \hat{\sigma}\right) \\ =r\left(s, \sigma_0(s)\right)+\delta W\left(f\left(s, \sigma_0(s)\right), \hat{\sigma}\right) \end{array}$

where $\hat{\sigma}$ excludes $\sigma_0$ and shifts all $\sigma_t$ one step backward. Hence, by definition of $V$ , we have

$\begin{array}{l} V(s)=\sup _{\sigma \in \Sigma} W(s, \sigma) \\ =\sup _{\sigma \in \Sigma} r\left(s, \sigma_0(s)\right)+\delta W\left(f\left(s, \sigma_0(s)\right), \hat{\sigma}\right) \\ =\sup _{a \in \Phi(s)}\left(r(s, a)+\delta \sup _{\hat{\sigma} \in \Sigma} W(f(s, a), \hat{\sigma})\right) \\ =\sup _{a \in \Phi(s)}[r(s, a)+\delta V(f(s, a))]. \end{array}$

The second and third equalities hold by induction. The base case would be just considering $\sigma_0, \sigma_1$ . For the LHS we have,

$\sup _{\sigma_0 \in \Phi\left(s_0\right), \sigma_1 \in \Phi\left(f\left(s_0, \sigma_0\right)\right)} r\left(s_0, \sigma_0\right)+\delta W\left(f\left(s_0, \sigma_0\right), \sigma_1\right)=\sup _{\sigma_0, \sigma_1} r\left(s_0, \sigma_0\right)+\delta r\left(f\left(s_0, \sigma_0\right), \sigma_1\right)$

And for the RHS we have,

$\begin{array}{l} \sup _{a_0 \in \Phi\left(s_0\right)}\left(r\left(s_0, a_0\right)+\delta \sup _{\sigma_1} W\left(f\left(s_0, a_0\right), \sigma_1\right)\right) \\ =\sup _{\sigma_0 \in \Phi\left(s_0\right)}\left(r\left(s_0, \sigma_0\right)+\delta \sup _{\sigma_1} W\left(f\left(s_0, \sigma_0\right), \sigma_1\right)\right) \\ =\sup _{\sigma_0, \sigma_1} r\left(s_0, \sigma_0\right)+\delta r\left(f\left(s_0, \sigma_0\right), \sigma_1\right) \end{array}$

since $\sup _x(a+f(x))=a+\sup _x(f(x))$ and $\sup _x(\delta f(x))=\delta \sup _x(f(x))$ . Hence, the equality in the proposition is proven.

I now prove two lemmas that are important for proving the second important proposition. I could have included a proof of Banach’s fixed point theorem here, but I chose not to because there are many resources online that explain his theorem.

$\textbf{Lemma.}$ The metric space $M=(X, d)$ is complete.

$\textit{Proof.}$ I first show that an arbitrary Cauchy sequence of $X$ converges pointwise. Then I show that the element in which the sequence converges to is in $X$ .

Let $\left\{f_n\right\}_{n=1}^{\infty}$ be a Cauchy sequence of functions in $X$ . Then for every $\epsilon>0$ , there exists $N$ such that for every $j, k \geq N, \sup _{x \in D}\left|f_j(x)-f_k(x)\right|<\epsilon$ . Hence, for any $t \in D,\left|f_j(t)-f_k(t)\right|<\epsilon$ . When fixing $t$ , the above implies that the Cauchy sequence of numbers in $\mathbb{R},\left\{f_n(t)\right\}_{n=1}^{\infty}$ , converges pointwise. Hence, denote whatever the sequence converges to as $f(t)$ . We then have

$\lim _{n \rightarrow \infty} f_n(t)=f(t).$

Hence, for all $j \geq N$ and $t \in D$ , we have that $\left|f_j(t)-f(t)\right|<\epsilon$ . This means that the Cauchy sequence of functions $\left\{f_n\right\}_{n=1}^{\infty}$ converges uniformly. Hence, the arbitrary sequence of functions $\left\{f_n\right\}_{n=1}^{\infty}$ also converges pointwise to $f$ .

Now I will show that $f \in X$ or that $f$ is bounded. Since each function $f_n$ is bounded, there exists a constant $B_n$ such that $\left\|f_n\right\|_D=\sup _{x \in D}\left|f_n(x)\right| \leq B_n$ . From above, I know that for a fixed $j \geq N$ and for any $k \geq N,\left\|f_j-f_k\right\|_D<\epsilon$ . By the triangle inequality, and the fact that for any $x \in D$ ,

Since the above is true for all $x \in D$ , we have that for all $k \geq N$

$\left\|f_k\right\|_D=\sup _{x \in D}\left|f_k(x)\right| \leq \epsilon+B_n.$

Let $B:=\epsilon+B_n$ . We have that, since the absolute value function is continuous on $\mathbb{R}$ , for all $x \in D$ ,

$|f(x)|=\left|\lim _{k \rightarrow \infty} f_k(x)\right|=\lim _{k \rightarrow \infty}\left|f_k(x)\right| \leq B.$

Hence, since $|f(x)| \leq B$ for all $x \in D,\|f\|_D=\sup _{x \in D}|f(x)| \leq B$ . This shows that the metric space $M$ is complete.

$\textbf{Lemma.}$ If $F, G \in X$ , then

$\left|\sup _x F(x)-\sup _x G(x)\right| \leq \sup _x|F(x)-G(x)|.$

$\textit{Proof.}$ A sketch of the proof is as follows

$\begin{array}{l} \sup _x F(x)-\sup _x G(x) \\ \leq \sup _x F(x)-G(x) \\ \leq \sup _x|F(x)-G(x)|. \end{array}$

When switching $G$ and $F$ , I obtain

$\begin{array}{l} \sup _x G(x)-\sup _x F(x) \\ \leq \sup _x|F(x)-G(x)|. \end{array}$

Hence, this shows that $\left|\sup _x G(x)-\sup _x F(x)\right| \leq \sup _x|G(x)-F(x)|$ . Obviously, there could be more explanation for why $\sup _x F(x)-\sup _x G(x) \leq \sup _x F(x)-G(x)$ . A way to show this would be to find something greater than the LHS and less than the RHS.

$\textbf{Proposition.}$ There is a unique bounded function $w: S \rightarrow \mathbb{R}$ that satisfies

$w(s)=\sup _{a \in \Phi(s)}[r(s, a)+\delta w(f(s, a))].$

$\textbf{Proof.}$ Let $M=(X, d)$ be the metric space defined above. As shown in the lemma above, this metric space is complete. Now, define mapping $w: S \rightarrow \mathbb{R}$ such that $w$ is bounded in $M$ . Define mapping $T: X \rightarrow X$ to be

$T(w)(s):=\sup _{a \in \Phi(s)}[r(s, a)+\delta w(f(s, a))]$

for all $s \in S$ . A solution to the equation in the proposition is a fixed point of $T$ (i.e. $w(s)=T(w)(s)$ for all $s \in S$ ). $T$ has a unique fixed point if it is a contraction mapping on the space $X$ . This is the result of Banach’s fixed-point theorem. $T$ is a contraction on $B$ because

$\begin{aligned} d(T(w), T(v)) & =\sup _{s \in S}\left|\sup _{a \in \Phi(s)}[r(s, a)+\delta w(f(s, a))]-\sup _{a \in \Phi(s)}[r(s, a)+\delta v(f(s, a))]\right| \\ & \leq \sup _{s \in S, a \in \Phi(s)}|\delta(w(f(s, a))-v(f(s, a)))| \\ & =\delta \sup _{s \in S, a \in \Phi(s)}|w(f(s, a))-v(f(s, a))| \\ & \leq \delta \sup _{s \in S}|w(s)-v(s)| \\ & =\delta d(w, v). \end{aligned}$

The first inequality is true because of the lemma above. The last inequality is true because $f(S, \Phi(S)) \subseteq S$ . Hence, $T$ has a unique fixed point.