An Alternative Algorithm to Early Stopping

I describe an algorithm to determine the number of iterations hyperparameter of neural networks.

In the deep learning literature, the early stopping algorithm determines the optimal number of iterations that a neural network runs in order to update its weights. One drawback to the algorithm is that it only minimizes validation set error (which differs from training set error). Hence, early stopping may result in underfitting data (i.e. evidenced by high training set error). Instead of focusing solely on minimizing validation set error, what if the training set error was also incorporated into determining the optimal number of iterations for a neural network? Would this be a good idea?

In theoretical economics, many models start with some general properties to reach some conclusion. From this perspective, I propose some general properties that I would like an algorithm to satisfy in order to determine the optimal number of iterations hyperparameter. I call this new algorithm A-Early Stopping or (AES) from now on.

I now propose two properties that AES focuses on. The first relates to minimizing the distance between the validation and training set error and the second is minimizing the total error. For the former property, since we often want to balance these errors in order to not underfit or overfit, we want these numbers to be as close as possible to each other. There are a few ways to measure the distance between these numbers.

The first way is to find the absolute distance. Given positive real numbers $x$ and $y$ , the absolute distance is simply $|x-y|$ . The second way to measure the distance between validation and training set error is to find the relative distance. Given positive real numbers $x$ and $y$ , the relative distance is $\left|\frac{x}{y}-1\right|$ . The primary difference between these two methods is that when both $x$ and $y$ are the same distance away from one another but both are large numbers, their relative distance will be smaller than if both numbers were smaller. For example, in the table below, minimizing the relative distance would favor a training set error of 2 and a validation set error of 3 over 1 and 2.

\begin{array}{|c|c|c|c|} \hline \text { Training Set Error } & \text { Validation Set Error } & \text { Absolute } & \text { Relative } \\ \hline 1 & 2 & 1 & 0.5 \\ \hline 2 & 3 & 1 & \sim 0.33 \\ \hline \end{array}

The second property of AES is simply to minimize the total error. Hence, the objective function that the algorithm should minimize consists of two parts.

AES Algorithm

Suppose there are $R \in \mathbb{N_+}$ number of total iterations that the neural network runs in order to update its parameters. For all $i=1, \ldots, R$ , let $v_{i}$ and $t_{i}$ be the validation error and training error on iteration $i$ . The absolute error on any iteration $i$ is $\left|v_{i}-t_{i}\right|$ . Likewise, the relative error is $\left|\frac{v_{i}}{t_{i}}-1\right|$ . Let the function $P(v, t)$ be either the absolute $|v-t|$ or relative error $\left|\frac{v}{t}-1\right|$ .

The objective function to find the optimal number of iterations is then as follows.

$\underset{i \in\{1, \ldots, R\}}{\operatorname{argmin}}\left(v_{i}+t_{i}\right)+P\left(v_{i}, t_{i}\right)$

The algorithm that solves the problem above should then store the parameters associated with that optimal number of iterations.

In order to further analyze the relative error function, I make an important definition. The relative error function $P_{1}$ prefers lowering the training set error over the validation set error compared to the other error function $P_{2}$ if the training set error is lower than the validation set error if and only if $P_{1}$ is less than $P_{2}$ . Likewise, error function $P_{1}$ prefers lowering the validation set error over the training set error compared to the other error function $P_{2}$ if the validation set error is lower than the training set error if and only if $P_{1}$ is less than $P_{2}$ .

Now consider another problem involving two different cases of training and validation set errors. Let $\left(t_1, v_1\right)$ be the training and validation set errors for one case and $\left(t_2, v_2\right)$ be the errors for the second case. Further, since training error is typically lower than validation error because of overfitting, suppose that $t_1<v_1$ and $t_2<v_2$ . I want to make sure that in cases where the training set error is the same but the validation set error is less in one case compared to the other, the objective function prefers the case with the lower validation set error. For example, suppose $\left(t_1, v_1\right)=(1,6)$ and $\left(t_2, v_2\right)=(4,6)$ , the objective function should produce a lower number for $(1,6)$ than $(4,6)$ . To capture this idea, I create another definition. The objective function $O$ is rational if whenever $t_1=t_2$ and $v_1>v_2, O\left(v_1, t_1\right)>O\left(v_2, t_2\right)$ and whenever $v_1=v_2$ and $t_1>$ $t_2, O\left(v_1, t_1\right)>O\left(v_2, t_2\right)$ .

Analysis of AES Algorithm

I dig a little deeper into the implications of this simple algorithm.

$\textbf{Proposition.}$
The relative error function $\left|\frac{t}{v}-1\right|$ prefers lowering the training set error over the validation set error compared to the relative error function $\left|\frac{v}{t}-1\right|$ . Further, the relative error function $\left|\frac{v}{t}-1\right|$ prefers lowering the validation set error over the training set error compared to $\left|\frac{t}{v}-1\right|$ .

$\textit{Proof.}$ Suppose $0<t<v$ . Hence, let $v=t+k$ where $k>0$ . Then we have

$\left|\frac{t}{v}-1\right|=\left|\frac{t}{t+k}-1\right|=\left|\frac{t-(t+k)}{t+k}\right|=\left|\frac{k}{t+k}\right|$

and

$\left|\frac{v}{t}-1\right|=\left|\frac{t+k}{t}-1\right|=\left|1+\frac{k}{t}-1\right|=\left|\frac{k}{t}\right|.$

Hence, since $\left|\frac{k}{t+k}\right|<\left|\frac{k}{t}\right|$ , we have that $\left|\frac{t}{v}-1\right|<\left|\frac{v}{t}-1\right|$ . Likewise, $\left|\frac{t}{v}-1\right|=\frac{|t-v|}{v}<\left|\frac{v}{t}-1\right|=$ $\frac{|v-t|}{t} \Rightarrow \frac{1}{v}<\frac{1}{t} \Rightarrow t<v$ . WLOG, if $0<v<t$ , then $\left|\frac{v}{t}-1\right|<\left|\frac{t}{v}-1\right|$ .

The above statement is relevant because the choice of the relative error function matters. For example, if one wanted to focus on reducing validation error, then $\left|\frac{v}{t}-1\right|$ should be chosen over $\left|\frac{t}{v}-1\right|$ . Hence, it should be noted that the objective function $(v+t)+\left|\frac{v}{t}-1\right|$ would prefer validation error over training error.

For the absolute error function $|t-v|$ , it can be said that it is indifferent between training and validation errors because $|t-v|=|v-t|$ . Hence, when the validation and training errors are switched, the absolute error remains the same.

$\textbf{Proposition.}$
The objective function $t+v+\left|\frac{t}{v}-1\right|$ is rational when $v>1$ .

$\textit{Proof.}$ Case 1. Suppose that $0<t_{2}=t_{1}<v_{2}<v_{1}$ . Then we know that $\frac{1}{v_{1}}-\frac{1}{v_{2}}<0$ and $v_{1}-v_{2}>0$ . Hence,

$\begin{gathered} v_{1}-v_{2}>t_{1}\left(\frac{1}{v_{1}}-\frac{1}{v_{2}}\right) \\ \Rightarrow v_{1}-\frac{t_{1}}{v_{1}}+1>v_{2}-\frac{t_{2}}{v_{2}}+1 \\ \Rightarrow t_{1}+v_{1}-\frac{t_{1}}{v_{1}}+1>t_{2}+v_{2}-\frac{t_{2}}{v_{2}}+1. \end{gathered}$

Further, since $\frac{t_{1}}{v_{1}}<1$ and $\frac{t_{2}}{v_{2}}<1$ , we have $-\frac{t_{1}}{v_{1}}+1>0$ and $-\frac{t_{2}}{v_{2}}+1>0$ . Hence, $\left|\frac{t_{1}}{v_{1}}-1\right|=$ $-\frac{t_{1}}{v_{1}}+1$ and $\left|\frac{t_{2}}{v_{2}}-1\right|=-\frac{t_{2}}{v_{2}}+1$ . Therefore, we have,

$t_{1}+v_{1}+\left|\frac{t_{1}}{v_{1}}-1\right|>t_{2}+v_{2}+\left|\frac{t_{2}}{v_{2}}-1\right|.$

Case 2. Suppose $1<v_{1}=v_{2}$ and $t_{1}<v_{1}$ and $t_{2}<v_{2}$ . Then we have $1>\frac{1}{v_{1}}$ , which implies

$\begin{gathered} t_{1}-t_{2}>\frac{t_{1}}{v_{1}}-\frac{t_{2}}{v_{2}} \\ \Rightarrow t_{1}-\frac{t_{1}}{v_{1}}+1>t_{2}-\frac{t_{2}}{v_{2}}+1 \\ \Rightarrow t_{1}+v_{1}+\left|\frac{t_{1}}{v_{1}}-1\right|>t_{2}+v_{2}+\left|\frac{t_{2}}{v_{2}}-1\right|. \end{gathered}$

The above proposition essentially advocates for using the objective function $t+v+\left|\frac{t}{v}-1\right|$ because of its rationality property. Although there may be other objective functions to determine the number of iterations hyperparameters containing other desirable properties (that I may not have defined in this post), I stop my analysis here. Further investigations of this topic may include finding rational objective functions for validation errors that are less than one.