An Alternative Algorithm to Early Stopping

I describe an algorithm to determine the number of iterations hyperparameter of neural networks.

In the deep learning literature, the early stopping algorithm determines the optimal number of iterations that a neural network runs in order to update its weights. One drawback to the algorithm is that it only minimizes validation set error (which differs from training set error). Hence, early stopping may result in underfitting data (i.e. evidenced by high training set error). Instead of focusing solely on minimizing validation set error, what if the training set error was also incorporated into determining the optimal number of iterations for a neural network? Would this be a good idea?

In theoretical economics, many models start with some general properties to reach some conclusion. From this perspective, I propose some general properties that I would like an algorithm to satisfy in order to determine the optimal number of iterations hyperparameter. I call this new algorithm A-Early Stopping or (AES) from now on.

I now propose two properties that AES focuses on. The first relates to minimizing the distance between the validation and training set error and the second is minimizing the total error. For the former property, since we often want to balance these errors in order to not underfit or overfit, we want these numbers to be as close as possible to each other. There are a few ways to measure the distance between these numbers.

The first way is to find the absolute distance. Given positive real numbers xx and yy, the absolute distance is simply xy|x-y|. The second way to measure the distance between validation and training set error is to find the relative distance. Given positive real numbers xx and yy, the relative distance is xy1\left|\frac{x}{y}-1\right|. The primary difference between these two methods is that when both xx and yy are the same distance away from one another but both are large numbers, their relative distance will be smaller than if both numbers were smaller. For example, in the table below, minimizing the relative distance would favor a training set error of 2 and a validation set error of 3 over 1 and 2.

 Training Set Error  Validation Set Error  Absolute  Relative 1210.52310.33 \begin{array}{|c|c|c|c|} \hline \text { Training Set Error } & \text { Validation Set Error } & \text { Absolute } & \text { Relative } \\ \hline 1 & 2 & 1 & 0.5 \\ \hline 2 & 3 & 1 & \sim 0.33 \\ \hline \end{array}

The second property of AES is simply to minimize the total error. Hence, the objective function that the algorithm should minimize consists of two parts.

AES Algorithm

Suppose there are RN+R \in \mathbb{N_+} number of total iterations that the neural network runs in order to update its parameters. For all i=1,,Ri=1, \ldots, R, let viv_{i} and tit_{i} be the validation error and training error on iteration ii. The absolute error on any iteration ii is viti\left|v_{i}-t_{i}\right|. Likewise, the relative error is viti1\left|\frac{v_{i}}{t_{i}}-1\right|. Let the function P(v,t)P(v, t) be either the absolute vt|v-t| or relative error vt1\left|\frac{v}{t}-1\right|.

The objective function to find the optimal number of iterations is then as follows.

argmini{1,,R}(vi+ti)+P(vi,ti) \underset{i \in\{1, \ldots, R\}}{\operatorname{argmin}}\left(v_{i}+t_{i}\right)+P\left(v_{i}, t_{i}\right)

The algorithm that solves the problem above should then store the parameters associated with that optimal number of iterations.

In order to further analyze the relative error function, I make an important definition. The relative error function P1P_{1} prefers lowering the training set error over the validation set error compared to the other error function P2P_{2} if the training set error is lower than the validation set error if and only if P1P_{1} is less than P2P_{2}. Likewise, error function P1P_{1} prefers lowering the validation set error over the training set error compared to the other error function P2P_{2} if the validation set error is lower than the training set error if and only if P1P_{1} is less than P2P_{2}.

Now consider another problem involving two different cases of training and validation set errors. Let (t1,v1)\left(t_1, v_1\right) be the training and validation set errors for one case and (t2,v2)\left(t_2, v_2\right) be the errors for the second case. Further, since training error is typically lower than validation error because of overfitting, suppose that t1<v1t_1<v_1 and t2<v2t_2<v_2. I want to make sure that in cases where the training set error is the same but the validation set error is less in one case compared to the other, the objective function prefers the case with the lower validation set error. For example, suppose (t1,v1)=(1,6)\left(t_1, v_1\right)=(1,6) and (t2,v2)=(4,6)\left(t_2, v_2\right)=(4,6), the objective function should produce a lower number for (1,6)(1,6) than (4,6)(4,6). To capture this idea, I create another definition. The objective function OO is rational if whenever t1=t2t_1=t_2 and v1>v2,O(v1,t1)>O(v2,t2)v_1>v_2, O\left(v_1, t_1\right)>O\left(v_2, t_2\right) and whenever v1=v2v_1=v_2 and t1>t_1> t2,O(v1,t1)>O(v2,t2)t_2, O\left(v_1, t_1\right)>O\left(v_2, t_2\right).

Analysis of AES Algorithm

I dig a little deeper into the implications of this simple algorithm.

Proposition.\textbf{Proposition.} The relative error function tv1\left|\frac{t}{v}-1\right| prefers lowering the training set error over the validation set error compared to the relative error function vt1\left|\frac{v}{t}-1\right|. Further, the relative error function vt1\left|\frac{v}{t}-1\right| prefers lowering the validation set error over the training set error compared to tv1\left|\frac{t}{v}-1\right|.

Proof.\textit{Proof.} Suppose 0<t<v0<t<v. Hence, let v=t+kv=t+k where k>0k>0. Then we have

tv1=tt+k1=t(t+k)t+k=kt+k \left|\frac{t}{v}-1\right|=\left|\frac{t}{t+k}-1\right|=\left|\frac{t-(t+k)}{t+k}\right|=\left|\frac{k}{t+k}\right|

and

vt1=t+kt1=1+kt1=kt. \left|\frac{v}{t}-1\right|=\left|\frac{t+k}{t}-1\right|=\left|1+\frac{k}{t}-1\right|=\left|\frac{k}{t}\right|.

Hence, since kt+k<kt\left|\frac{k}{t+k}\right|<\left|\frac{k}{t}\right|, we have that tv1<vt1\left|\frac{t}{v}-1\right|<\left|\frac{v}{t}-1\right|. Likewise, tv1=tvv<vt1=\left|\frac{t}{v}-1\right|=\frac{|t-v|}{v}<\left|\frac{v}{t}-1\right|= vtt1v<1tt<v\frac{|v-t|}{t} \Rightarrow \frac{1}{v}<\frac{1}{t} \Rightarrow t<v. WLOG, if 0<v<t0<v<t, then vt1<tv1\left|\frac{v}{t}-1\right|<\left|\frac{t}{v}-1\right|.

The above statement is relevant because the choice of the relative error function matters. For example, if one wanted to focus on reducing validation error, then vt1\left|\frac{v}{t}-1\right| should be chosen over tv1\left|\frac{t}{v}-1\right|. Hence, it should be noted that the objective function (v+t)+vt1(v+t)+\left|\frac{v}{t}-1\right| would prefer validation error over training error.

For the absolute error function tv|t-v|, it can be said that it is indifferent between training and validation errors because tv=vt|t-v|=|v-t|. Hence, when the validation and training errors are switched, the absolute error remains the same.

Proposition.\textbf{Proposition.} The objective function t+v+tv1t+v+\left|\frac{t}{v}-1\right| is rational when v>1v>1.

Proof.\textit{Proof.} Case 1. Suppose that 0<t2=t1<v2<v10<t_{2}=t_{1}<v_{2}<v_{1}. Then we know that 1v11v2<0\frac{1}{v_{1}}-\frac{1}{v_{2}}<0 and v1v2>0v_{1}-v_{2}>0. Hence,

v1v2>t1(1v11v2)v1t1v1+1>v2t2v2+1t1+v1t1v1+1>t2+v2t2v2+1. \begin{gathered} v_{1}-v_{2}>t_{1}\left(\frac{1}{v_{1}}-\frac{1}{v_{2}}\right) \\ \Rightarrow v_{1}-\frac{t_{1}}{v_{1}}+1>v_{2}-\frac{t_{2}}{v_{2}}+1 \\ \Rightarrow t_{1}+v_{1}-\frac{t_{1}}{v_{1}}+1>t_{2}+v_{2}-\frac{t_{2}}{v_{2}}+1. \end{gathered}

Further, since t1v1<1\frac{t_{1}}{v_{1}}<1 and t2v2<1\frac{t_{2}}{v_{2}}<1, we have t1v1+1>0-\frac{t_{1}}{v_{1}}+1>0 and t2v2+1>0-\frac{t_{2}}{v_{2}}+1>0. Hence, t1v11=\left|\frac{t_{1}}{v_{1}}-1\right|= t1v1+1-\frac{t_{1}}{v_{1}}+1 and t2v21=t2v2+1\left|\frac{t_{2}}{v_{2}}-1\right|=-\frac{t_{2}}{v_{2}}+1. Therefore, we have,

t1+v1+t1v11>t2+v2+t2v21. t_{1}+v_{1}+\left|\frac{t_{1}}{v_{1}}-1\right|>t_{2}+v_{2}+\left|\frac{t_{2}}{v_{2}}-1\right|.

Case 2. Suppose 1<v1=v21<v_{1}=v_{2} and t1<v1t_{1}<v_{1} and t2<v2t_{2}<v_{2}. Then we have 1>1v11>\frac{1}{v_{1}}, which implies

t1t2>t1v1t2v2t1t1v1+1>t2t2v2+1t1+v1+t1v11>t2+v2+t2v21. \begin{gathered} t_{1}-t_{2}>\frac{t_{1}}{v_{1}}-\frac{t_{2}}{v_{2}} \\ \Rightarrow t_{1}-\frac{t_{1}}{v_{1}}+1>t_{2}-\frac{t_{2}}{v_{2}}+1 \\ \Rightarrow t_{1}+v_{1}+\left|\frac{t_{1}}{v_{1}}-1\right|>t_{2}+v_{2}+\left|\frac{t_{2}}{v_{2}}-1\right|. \end{gathered}

The above proposition essentially advocates for using the objective function t+v+tv1t+v+\left|\frac{t}{v}-1\right| because of its rationality property. Although there may be other objective functions to determine the number of iterations hyperparameters containing other desirable properties (that I may not have defined in this post), I stop my analysis here. Further investigations of this topic may include finding rational objective functions for validation errors that are less than one.