I derive the Bellman Equation and show that if a solution to the equation exists, then it must be unique.
In this article, I dig into the details of how the bellman equations were discovered and show using Banach’s fixed point theorem that there is a unique value function that solves the bellman equation.
Dynamic Programming Model
Just like in many classic dynamic programming models, let t∈N∪{0} be the stage, S⊆Rn be the state space where s0∈S is given, A⊆Rk be the action space, Φ:S→A be a compact-valued continuous function mapping feasible actions at some state, f:S×A be a continuous transition function, r:S×A→R be a bounded and continuous reward function, and δ∈[0,1) be the discount factor.
There are several important definitions in a classic dynamic programming problem, which I now define.
Definition. History
A t-history ηt is a vector (s0,a0,…,st−1,at−1,st) of the state sτ in each period τ up to t, the action aτ taken that period, and the period-t state st. H0=S and Ht denotes the set of all possible t-histories ηt.
Definition. Strategy
A strategy for period t is a function σt:Ht→A s.t. σt(s0,a0,…,at−1,st)∈Φ(st) for all t and all (s0,a0,…,at−1,st)∈Ht. A strategy σ is a sequence of period strategies (σt)t∈N where σt is the strategy at period t. The set of all strategies will be denoted by Σ.
Definition. Reward, Utility, Value
For any strategy σ∈Σ, define
a0(s0,σ)=σ0(s0)s1(s0,σ)=f(s0,a0(s0,σ))a1(s0,σ)=σ1(s0,a0(s0,σ),s1(s0,σ))
and other values of st and at by induction. The reward at period t is defined as rt(s0,σ)= r(st(s0,σ),at(s0,σ)) and the overall utility is defined as
W(s0,σ)=t=0∑∞δtrt(s0,σ)
The value function is defined as V:S→R where V(s)=supσ∈ΣW(s,σ).
While the first proposition will use mostly the definitions that I provide above, the second proposition will mostly use the definitions below.
Definition. Complete Metric Space
Let M=(X,d) be a metric space where X is the set of all bounded functions F:D→R and d is the metric from the supremum norm, ∥f∥D=supx∈D∣f(x)∣. A sequence of functions {fn}n=1∞ is said to be Cauchy in M if for each ϵ>0, there exists a natural number N such that ∥fj−fk∥D= supx∈D∣fj(x)−fk(x)∣<ϵ for every j,k≥N. A metric space is complete if every Cauchy sequence in X converges to some point of X.
Definition. Contraction
Let M=(X,d) be the metric space defined above. The operator T:X→X is a contraction if there exists q∈[0,1) such that for all w,v∈X,d(T(w),T(v))≤qd(w,v).
Two Important Propositions
The first proposition that I prove deals with deriving an alternative expression for V(s), given the definition I have given. This alternative expression is extremely useful since it allows one to find the optimal strategy. Finding the optimal strategy may then involve numerical or analytical methods. The second proposition proves that there is a unique function that satisfies the alternative expression for V(s).
Lemma.
If r is bounded, then V(s)=supσ∈ΣW(s,σ)=supσ′∈Σ∑t=0∞δtrt(s,σ′) is well defined.
Proof.
Note that −1−δK=−K∑t=0∞δt≤∑t=0∞δtrt(s,σ′)≤K∑t=0∞δt=1−δK where K is the bound of r. Hence, since the set {W(s,σ)∣σ∈Σ} is bounded and is a subset of R,supσ∈ΣW(s,σ) exists.
Proposition.
If r is bounded, then for every s∈S,
V(s)=a∈Φ(s)sup[r(s,a)+δV(f(s,a))].
Proof.
Note that for any strategy σ∈Σ and s∈S,
W(s,σ)=r0(s,σ)+δ(r0(f(s,σ0(s)),σ^)+⋯)=r(s,σ0(s))+δ∑t=0∞δtrt(f(s,σ0(s)),σ^)=r(s,σ0(s))+δW(f(s,σ0(s)),σ^)
where σ^ excludes σ0 and shifts all σt one step backward. Hence, by definition of V, we have
V(s)=supσ∈ΣW(s,σ)=supσ∈Σr(s,σ0(s))+δW(f(s,σ0(s)),σ^)=supa∈Φ(s)(r(s,a)+δsupσ^∈ΣW(f(s,a),σ^))=supa∈Φ(s)[r(s,a)+δV(f(s,a))].
The second and third equalities hold by induction. The base case would be just considering σ0,σ1. For the LHS we have,
σ0∈Φ(s0),σ1∈Φ(f(s0,σ0))supr(s0,σ0)+δW(f(s0,σ0),σ1)=σ0,σ1supr(s0,σ0)+δr(f(s0,σ0),σ1)
And for the RHS we have,
supa0∈Φ(s0)(r(s0,a0)+δsupσ1W(f(s0,a0),σ1))=supσ0∈Φ(s0)(r(s0,σ0)+δsupσ1W(f(s0,σ0),σ1))=supσ0,σ1r(s0,σ0)+δr(f(s0,σ0),σ1)
since supx(a+f(x))=a+supx(f(x)) and supx(δf(x))=δsupx(f(x)). Hence, the equality in the proposition is proven.
I now prove two lemmas that are important for proving the second important proposition. I could have included a proof of Banach’s fixed point theorem here, but I chose not to because there are many resources online that explain his theorem.
Lemma.
The metric space M=(X,d) is complete.
Proof.
I first show that an arbitrary Cauchy sequence of X converges pointwise. Then I show that the element in which the sequence converges to is in X.
Let {fn}n=1∞ be a Cauchy sequence of functions in X. Then for every ϵ>0, there exists N such that for every j,k≥N,supx∈D∣fj(x)−fk(x)∣<ϵ. Hence, for any t∈D,∣fj(t)−fk(t)∣<ϵ. When fixing t, the above implies that the Cauchy sequence of numbers in R,{fn(t)}n=1∞, converges pointwise. Hence, denote whatever the sequence converges to as f(t). We then have
n→∞limfn(t)=f(t).
Hence, for all j≥N and t∈D, we have that ∣fj(t)−f(t)∣<ϵ. This means that the Cauchy sequence of functions {fn}n=1∞ converges uniformly. Hence, the arbitrary sequence of functions {fn}n=1∞ also converges pointwise to f.
Now I will show that f∈X or that f is bounded. Since each function fn is bounded, there exists a constant Bn such that ∥fn∥D=supx∈D∣fn(x)∣≤Bn. From above, I know that for a fixed j≥N and for any k≥N,∥fj−fk∥D<ϵ. By the triangle inequality, and the fact that for any x∈D,
∣fk(x)∣≤∣fk(x)−fj(x)∣+∣fj(x)∣≤ϵ+Bn.
Since the above is true for all x∈D, we have that for all k≥N
∥fk∥D=x∈Dsup∣fk(x)∣≤ϵ+Bn.
Let B:=ϵ+Bn. We have that, since the absolute value function is continuous on R, for all x∈D,
∣f(x)∣=∣∣∣∣∣k→∞limfk(x)∣∣∣∣∣=k→∞lim∣fk(x)∣≤B.
Hence, since ∣f(x)∣≤B for all x∈D,∥f∥D=supx∈D∣f(x)∣≤B. This shows that the metric space M is complete.
Lemma.
If F,G∈X, then
∣∣∣∣∣xsupF(x)−xsupG(x)∣∣∣∣∣≤xsup∣F(x)−G(x)∣.
Proof.
A sketch of the proof is as follows
supxF(x)−supxG(x)≤supxF(x)−G(x)≤supx∣F(x)−G(x)∣.
When switching G and F, I obtain
supxG(x)−supxF(x)≤supx∣F(x)−G(x)∣.
Hence, this shows that ∣supxG(x)−supxF(x)∣≤supx∣G(x)−F(x)∣. Obviously, there could be more explanation for why supxF(x)−supxG(x)≤supxF(x)−G(x). A way to show this would be to find something greater than the LHS and less than the RHS.
Proposition.
There is a unique bounded function w:S→R that satisfies
w(s)=a∈Φ(s)sup[r(s,a)+δw(f(s,a))].
Proof.
Let M=(X,d) be the metric space defined above. As shown in the lemma above, this metric space is complete. Now, define mapping w:S→R such that w is bounded in M. Define mapping T:X→X to be
T(w)(s):=a∈Φ(s)sup[r(s,a)+δw(f(s,a))]
for all s∈S. A solution to the equation in the proposition is a fixed point of T (i.e. w(s)=T(w)(s) for all s∈S). T has a unique fixed point if it is a contraction mapping on the space X. This is the result of Banach’s fixed-point theorem. T is a contraction on B because
d(T(w),T(v))=s∈Ssup∣∣∣∣∣∣a∈Φ(s)sup[r(s,a)+δw(f(s,a))]−a∈Φ(s)sup[r(s,a)+δv(f(s,a))]∣∣∣∣∣∣≤s∈S,a∈Φ(s)sup∣δ(w(f(s,a))−v(f(s,a)))∣=δs∈S,a∈Φ(s)sup∣w(f(s,a))−v(f(s,a))∣≤δs∈Ssup∣w(s)−v(s)∣=δd(w,v).
The first inequality is true because of the lemma above. The last inequality is true because f(S,Φ(S))⊆S. Hence, T has a unique fixed point.