If batch gradient descent is used to find the weights of a neural network modeling the XOR function, a local minimum may be obtained instead of the exact solution.
The XOR function is defined such that given any two binary values, if exactly one of these values is equal to 1, the XOR function returns 1; otherwise, it returns 0. Let the matrix X carry all possible values of the two binary variables, and let the vector y contain the respected outputs of the vectors of X. Let (W,w) be the weights and (c,b) be the biases of the network in the first and second layers, respectively. Finally, let the vector y^ contain the network’s predicted outcomes given binary values in X. I would like to find the weights w,W,c, and b of the network
y^=f(X;W,c,w,b)=max{0,XW+[c…c]⊤}w+[b…b]⊤
in order to represent the XOR function.
I now define some terms to simplify the following analysis. Define A(1)=XW+[c…c]⊤, H(1)=max{0,A(1)}, and a(2)=y^=H(1)w+[b…b]⊤. Define q to be the element-wise function max (i.e. element-wise function means that the same function is applied to each component of a tensor to obtain a tensor). Note that in general, q need not be specifically max. It is important to note that the derivative of the ReLU function max{0,x} is 0 if x<0 and 1 if x>0. Although not strictly true, if x=0, then the derivative of the ReLU function is 0 in this model. Assume that the loss function is the squared error J(y^)=∥y^−y∥2=(y^−y)⊤(y^−y).
Backpropagation Analysis
The gradient on the output layer is ∇y^J=∇a(2)J=2(y^−y). Using the chain rule, I need to compute the second layer’s gradients dbdJ=(∇a(2)J)⊤1 and ∇wJ=(∂w∂a(2))⊤∇a(2)J=(H(1))⊤∇a(2)J in order to adjust the weights b and w. Likewise, I need to compute the first layer’s gradients ∇cJ=(2((y^−y)w⊤)⊙q′(A(1)))⊤1 and ∂W∂J=X⊤(2((y^−y)w⊤)⊙q′(A(1))) to adjust the weights c and W.
Suppose that the initialized weights are as follows and the learning rate is 0.01.
W=[0.10.30.20.4],c=[0.10.2],w=[0.5−0.6],b=−0.3
In computing the results of the network by using all of the training data and the initialized weights, y^=
[−0.37−0.46−0.44−0.53]⊤ when X=[00011011]⊤, yielding a total loss or error of 4.623.
After running gradient descent for 1000 iterations, the approximate weights I obtain are as follows.
W=[0.810.811.061.06],c=[0.09−1.06],w=[1.21−1.86],b=−0.11
The total loss is about 0.0001, using these weights. If I increase the number of iterations, the total loss decreases even more. If I change the initial weights (just slightly) by changing all negative components to positive (i.e. w=[0.50.6] and b=0.3), I obtain a higher total loss. In both cases, I am stuck at a local minimum (albeit the former case better than the latter). This is because the exact weights are as follows.
W=[1111],c=[0−1],w=[1−2],b=0
The issue of being stuck at a local minimum because of the initial choice of weights is an extremely big problem in the field of deep learning. Essentially, an important, and already discovered result is that even if the architecture of the neural network has the potential to correctly predict some outcome, finding the weights of the architecture may not be so easy. I may attempt to address this issue in the future.
To see the mathematica code that I wrote for the backpropagation analysis, see here.
Appendix
Proposition.
∇cJ=(2((y^−y)w⊤)⊙q′(A(1)))⊤1 and ∂W∂J=X⊤(2((y^−y)w⊤)⊙q′(A(1)))
Proof.
In order to find ∇cJ, consider the following. By definition of chain rule, we have that
∂H(1)∂J=j∑∂H(1)∂aj(2)∂aj(2)∂J=⎣⎢⎡w100………wn00⎦⎥⎤2(a1(2)−y1)+⋯+⎣⎢⎡00w1………00wn⎦⎥⎤2(an(2)−yn)=2[a1(2)−y1an(2)−yn]w⊤=2(y^−y)w⊤.(1)
We also have that
H(1)=q(A(1))=⎣⎢⎢⎢⎡q(A11(1))…q(Am1(1))………q(A1n(1))…q(Amn(1))⎦⎥⎥⎥⎤
and
(∂A(1)∂Hj(1))k=∂Ak(1)∂Hj(1)={q′(Aj(1)) if j=k0 otherwise }
implying that
∂A(1)∂J=j∑∂A(1)∂Hj(1)∂Hj(1)∂J=j∑⎣⎢⎢⎡⋯……⋯q′(Aj(1))…⋯⎦⎥⎥⎤2((y^−y)w⊤)j=2((y^−y)w⊤)11⎣⎢⎢⎡q′(A11(1))0…0………………⎦⎥⎥⎤+⋯+2((y^−y)w⊤)mn⎣⎢⎢⎡⋯⋯⋯⋯⋯…0⋯…0q′(Amn(1))⎦⎥⎥⎤=2⎣⎢⎢⎢⎡((y^−y)w⊤)11q′(A11(1))…((y^−y)w⊤)m1q′(Am1(1))………((y^−y)w⊤)1nq′(A1n(1))…((y^−y)w⊤)mnq′(Amn(1))⎦⎥⎥⎥⎤=2((y^−y)w⊤)⊙q′(A(1)).(2)
Hence,
∇cJ=j∑∂c∂Aj(1)∂Aj(1)∂J=⎣⎢⎢⎡∂c1∂A11(1)⋯∂cn∂A11(1)⎦⎥⎥⎤(2((y^−y)w⊤)⊙q′(A(1)))11+⋯=⎣⎢⎡1⋯0⎦⎥⎤(2((y^−y)w⊤)⊙q′(A(1)))11+⋯+⎣⎢⎡0⋯1⎦⎥⎤(2((y^−y)w⊤)⊙q′(A(1)))1n+⋯+⎣⎢⎡1…0⎦⎥⎤(2((y^−y)w⊤)⊙q′(A(1)))m1+⋯+⎣⎢⎡0…1⎦⎥⎤(2((y^−y)w⊤)⊙q′(A(1)))mn=⎣⎢⎡(2((y^−y)w⊤)⊙q′(A(1)))11+⋯+(2((y^−y)w⊤)⊙q′(A(1)))m1(2((y^−y)w⊤)⊙q′(A(1)))1n+⋯+(2((y^−y)w⊤)⊙q′(A(1)))mn⎦⎥⎤=(2((y^−y)w⊤)⊙q′(A(1)))⊤1.(3)
For ∂W∂J, we have that
∂W∂J=j∑∂W∂Aj(1)∂Aj(1)∂J=⎣⎢⎡X11…X1n0……0…000⎦⎥⎤(2((y^−y)w⊤)⊙q′(A(1)))11+⋯+⎣⎢⎡Xm1…Xmn0……0…000⎦⎥⎤(2((y^−y)w⊤)⊙q′(A(1)))m1+⋯+⎣⎢⎡000…0……0X11…X1n⎦⎥⎤(2((y^−y)w⊤)⊙q′(A(1)))1n+⋯+⎣⎢⎡000…0……0Xm1…Xmn⎦⎥⎤(2((y^−y)w⊤)⊙q′(A(1)))mn=⎣⎢⎡X11Q11+⋯+Xm1Qm1⋯Xm1Q11+⋯+XmnQm1⋯⋯⋯X11Q1n+⋯+Xm1Qmn⋯X1nQ1n+⋯+XmnQmn⎦⎥⎤=X⊤Q(4)
where X=⎣⎢⎡X11…Xm1………X1n…Xmn⎦⎥⎤ and Q=∂A(1)∂J.
Proposition.
In some models, there is also a function applied to a(2), denoted as h(2)=q(a(2)) (also known as the predicted value). Hence, in such models, ∇a(2)J=q′(a(2))⊙∇h(2)J.
Proof.
We know J is a function of h(2) and h(2) is a function of a(2). By definition,
∇a(2)J=(∂a(2)∂h(2))⊤∇h(2)J.
But we know that h(2)=q(a(2))=⎣⎢⎢⎢⎡q(a1(2))…q(an(2))⎦⎥⎥⎥⎤. Hence, ∂a(2)∂h(2)=⎣⎢⎢⎢⎡q′(a1(2))0…00………00…0q′(an(2))⎦⎥⎥⎥⎤. So,
(∂a(2)∂h(2))⊤∇h(2)J=⎣⎢⎢⎢⎡q′(a1(2))(∇h(2)J)1⋯q′(an(2))(∇h(2)J)n⎦⎥⎥⎥⎤=q′(a(2))⊙∇h(2)J.
This proof is relevant because it establishes the relationship between the Jacobian and the Hadamard product when using the chain rule.