Overfitting Data

There is a network that overfits the XOR function and other types of data.

In the previous post, I showed that using gradient descent would find weights such that the loss function obtained a local minimum instead of a global minimum. In the model that I proposed, there was only one hidden layer and two units in the hidden layer. Further, I purposefully chose the initial weights in order to obtain a close approximation of the true weights (weights that result in the squared loss function being globally minimized or zero).

In this post, I investigate deeper neural networks (i.e. more hidden layers and more units in each layer) with random initial weights to approximate (i.e. overfit) the XOR function. The framework created in this investigation will likely help in creating networks to overfit other problems. Once a large enough network can overfit some data, I can then work on determining the optimal number of iterations to run in order to decrease the validation error as much as possible (i.e. early stopping). In this post, I use python’s Pytorch package to make the computations needed for backpropagation.

Model of a Deep Neural Network

Let the input matrix be X\boldsymbol{X}, output vector be y\boldsymbol{y}, and the matrix (or vector) weights and vector biases in layer i=1,,l1i=1, \ldots, l-1 be Wi\boldsymbol{W}_i and bi\boldsymbol{b}_i, respectively. Let the input to layer i=1,,l1i=1, \ldots, l-1 be denoted as Hi\boldsymbol{H}_i where H1=X\boldsymbol{H}_1=\boldsymbol{X}. For any layer i=1,,l1i=1, \ldots, l-1, given the input dimension mim_i, output dimension nin_i (note that ni=mi+1n_i=m_{i+1} if il1i \neq l-1 and nl1n_{l-1} is the dimension of the final output), and input Hi\boldsymbol{H}_i, the Pytorch class torch.nn.Linear initializes the weights Wi\boldsymbol{W}_i and biases bi\boldsymbol{b}_i to randomly selected values from the uniform distribution U(1/mi,1/mi)U\left(-\sqrt{1 / m_i}, \sqrt{1 / m_i}\right) and applies the transformation Ai=HiWi+\boldsymbol{A}_i=\boldsymbol{H}_i \boldsymbol{W}_i^{\top}+ [bibi]\left[\begin{array}{lll}\boldsymbol{b}_i & \ldots & \boldsymbol{b}_i\end{array}\right]^{\top}. Note that Ai\boldsymbol{A}_i can be interpreted as the input to the activation function. The function torch.relu applies the function max{0,x}\max \{0, x\} element-wise. Define the output from the activation function for layer i=1,,l2i=1, \ldots, l-2 to be Hi+1=max{0,Ai}\boldsymbol{H}_{i+1}=\max \left\{0, \boldsymbol{A}_i\right\}, and define the prediction as y^=Al1\hat{\boldsymbol{y}}=\boldsymbol{A}_{l-1}. Suppose that the loss function is J(y^)=y^y2=(y^y)(y^y)J(\hat{\boldsymbol{y}})=\|\hat{\boldsymbol{y}}-\boldsymbol{y}\|^2=(\hat{\boldsymbol{y}}-\boldsymbol{y})^{\top}(\hat{\boldsymbol{y}}-\boldsymbol{y}).

Further suppose that the learning rate in this model is 0.01. The Pytorch class torch.optim.SGD will compute Wi,r=Wi,(r1)0.01JWi(Wi,(r1))\boldsymbol{W}_{i, r}=\boldsymbol{W}_{i,(r-1)}-0.01 \frac{\partial J}{\partial W_i}\left(\boldsymbol{W}_{i,(r-1)}\right) and bi,r=bi,(r1)0.01Jbi(bi,(r1))\boldsymbol{b}_{i, r}=\boldsymbol{b}_{i,(r-1)}-0.01 \frac{\partial J}{\partial b_i}\left(\boldsymbol{b}_{i,(r-1)}\right) for all r=1,,Rr=1, \ldots, R and i=1,,l1i=1, \ldots, l-1, where RR is the total number of iterations.

Example of Overfitting the XOR Function

The model that I have laid out then allows me to create networks that overfit the XOR problem from randomized weights. For example, suppose that l=6,m1=2,mi=100l=6, m_1=2, m_i=100 for i=1,,5i=1, \ldots, 5, and n5=1n_5=1, then for some initial random weights chosen from U(1/mi,1/mi)U\left(-\sqrt{1 / m_i}, \sqrt{1 / m_i}\right), I can plot the loss with respect to the total number of iterations. Unsurprisingly, increasing the total number of iterations decreases the loss function even more. In a future blog post, I will use this model to analyze more interesting problems.

To see the python code that I wrote for this example, see here.