Gradient of the MSE:
Let’s take the first partial derivative as an example.
Using the chain rule:
The constants are cancelled out.
Only will remain from since all other variables will be treated as a constant except for .
Back to the other equation:
Substitute the function.
Then do this for all the weights: