Deep Architectures 
  Neil Lawrence
  2025-09-24 
  Dedan Kimathi University of Technology, Nyeri, Kenya
 
Basis Functions to Neural Networks 
 
From Basis Functions to Neural Networks 
So far: linear models or “shallow learning” 
Consider instead function composition or “deep learning” 
 
 
Neural Network Prediction Function 
\[
f(\mathbf{ x}; \mathbf{W}) = \mathbf{ w}_4 ^\top\phi\left(\mathbf{W}_3 \phi\left(\mathbf{W}_2\phi\left(\mathbf{W}_1 \mathbf{ x}\right)\right)\right).
\] 
 
Linear Models as Basis + Weights 
\[
f(\mathbf{ x}_i) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}_i).
\] 
A neural network. Input nodes are shown at the bottom. The hidden layer is the result of applying an affine transformation to the input layer and placing through an activation function.
 
 
Mathematically 
\[
\begin{align*}
    \mathbf{ h}_{1} &= \phi\left(\mathbf{W}_1 \mathbf{ x}\right)\\
    \mathbf{ h}_{2} &=  \phi\left(\mathbf{W}_2\mathbf{ h}_{1}\right)\\
    \mathbf{ h}_{3} &= \phi\left(\mathbf{W}_3 \mathbf{ h}_{2}\right)\\
    f&= \mathbf{ w}_4 ^\top\mathbf{ h}_{3}
\end{align*}
\] 
 
From Shallow to Deep 
\[
f(\mathbf{ x}; \mathbf{W})  =  \mathbf{ w}_4 ^\top\phi\left(\mathbf{W}_3 \phi\left(\mathbf{W}_2\phi\left(\mathbf{W}_1 \mathbf{ x}\right)\right)\right).
\] 
 
Gradient-based optimization 
\[
E(\mathbf{W}) = \sum_{i=1}^nE(\mathbf{ y}_i, f(\mathbf{ x}_i; \mathbf{W}))
\] 
\[
\mathbf{W}_{t+1} = \mathbf{W}_t - \eta \nabla_\mathbf{W}E(\mathbf{W})
\] 
 
Basic Multilayer Perceptron 
\[
\small
f_L(\mathbf{ x}) = \boldsymbol{ \phi}\left(\mathbf{b}_L+ \mathbf{W}_L\boldsymbol{ \phi}\left(\mathbf{b}_{L-1} + \mathbf{W}_{L-1} \boldsymbol{ \phi}\left( \cdots \boldsymbol{ \phi}\left(\mathbf{b}_1 + \mathbf{W}_1 \mathbf{ x}\right) \cdots \right)\right)\right)
\] 
 
Recursive MLP Definition 
\[\begin{align}
f_\ell(\mathbf{ x}) &= \boldsymbol{ \phi}\left(\mathbf{W}_\ell f_{\ell-1}(\mathbf{ x}) + \mathbf{b}_\ell\right),\quad l=1,\dots,L\\
f_0(\mathbf{ x}) &= \mathbf{ x}
\end{align}\] 
 
Rectified Linear Unit 
(x) =
\[\begin{cases}
0 & x \le 0\\
x & x > 0
\end{cases}\] 
The ReLU activation function: zero for negative inputs, identity for positive inputs.
 
 
Shallow ReLU Networks 
\[\begin{align}
\mathbf{h}(x) &= \operatorname{ReLU}(\mathbf{w}_1 x - \mathbf{b}_1)\\
f(x) &= \mathbf{w}_2^{\top} \, \mathbf{h}(x)
\end{align}\] 
Output of a 2D ReLU network with 3 hidden units, showing piecewise-linear regions.
 
 
ReLU Activations 
Decision boundaries for the ReLU network.
 
 
Decision Boundaries 
Decision boundaries for the ReLU network.
 
 
Decision Boundaries 
Decision boundaries for the ReLU network.
 
Complexity (1D): number of kinks \(\propto\)  width of network.
 
Depth vs Width: Representation Power 
Single hidden layer: number of kinks \(\approx O(\text{width})\)  
Deep ReLU networks: kinks \(\approx O(2^{\text{depth}})\)  
 
 
Sawtooth Construction 
\[\begin{align}
f_l(x) &= 2\vert f_{l-1}(x)\vert - 2, \quad f_0(x) = x\\
f_l(x) &= 2 
  \operatorname{ReLU}(f_{l-1}(x)) + 2 
  \operatorname{ReLU}(-f_{l-1}(x)) - 2
\end{align}\] 
Sawtooth construction showing exponential growth of linear segments with depth. Each layer doubles the number of kinks.
 
Complexity (1D): number of kinks \(\approx O(2^{\text{depth}})\) .
 
Higher-dimensional Geometry 
Regions: piecewise-linear, often triangular/polygonal in 2D 
Continuity arises despite stitching flat regions 
Early layers: few regions; later layers: many more 
 
Deep ReLU network output showing complex piecewise-linear regions in 2D. Each region has constant gradient, but the overall function is continuous.
 
 
Chain Rule and Back Propgation 
 
Historical Context 
1957 : Rosenblatt’s Perceptron with multiple layersAssociation units : deep structure with random weightsModern approach : adjust weights to minimise error 
 
Parameterised Basis Functions 
Basis functions : \(\boldsymbol{ \phi}(\mathbf{ x}; \mathbf{V})\) Adjustable parameters : \(\mathbf{V}\)  can be optimizedNon-linear models : in both inputs and parameters 
 
The Gradient Problem 
Model : \(f(\mathbf{ x}_i) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}_i; \mathbf{V})\) Easy gradient : \(\frac{\text{d}f_i}{\text{d}\mathbf{ w}} = \boldsymbol{ \phi}(\mathbf{ x}_i; \mathbf{V})\) Hard gradient : \(\frac{\text{d}f_i}{\text{d}\mathbf{V}}\)  
 
Chain Rule Application 
\[\frac{\text{d}f_i}{\text{d}\mathbf{V}} = \frac{\text{d}f_i}{\text{d}\boldsymbol{ \phi}_i}\frac{\text{d}\boldsymbol{ \phi}_i}{\text{d} \mathbf{V}}\] 
 
The Matrix Challenge 
Chain rule : standard approachNew challenge : \(\frac{\text{d}\boldsymbol{ \phi}_i}{\text{d} \mathbf{V}}\) Problem : derivative of vector  with respect to matrix  
 
Matrix Calculus Standards 
Multiple approaches  to multivariate chain ruleMike Brookes’ method : preferred approachReference : Brookes (2005)  
 
Brookes’ Approach 
Never  compute matrix derivatives directlyStacking operation : intermediate stepVector derivatives : easier to handle 
 
Matrix Stacking 
Input : matrix \(\mathbf{A}\) Output : vector \(\mathbf{a}\) Process : stack columns vertically 
 
Vector Derivatives 
\[\frac{\text{d} \mathbf{a}}{\text{d} \mathbf{b}} \in \Re^{m \times n}\] 
Input : \(\mathbf{a} \in \Re^{m \times 1}\) , \(\mathbf{b} \in \Re^{n \times 1}\) Output : \(m \times n\)  matrixStandard form : vector-to-vector derivatives 
 
Matrix Stacking Notation 
\(\mathbf{C}\!:\) \(\mathbf{C}\) Stacking : columns of \(\mathbf{C}\)  → \(\Re^{mn \times 1}\)  vectorCondition : \(\mathbf{C} \in \Re^{m \times n}\)  
 
Matrix-Matrix Derivatives 
\[\frac{\text{d} \mathbf{E}}{\text{d} \mathbf{C}} \in \Re^{pq \times mn}\] 
Input : \(\mathbf{E} \in \Re^{p \times q}\) , \(\mathbf{C} \in \Re^{m \times n}\) Output : \(pq \times mn\)  matrixAdvantage : easier chain rule application 
 
Kronecker Products 
Notation : \(\mathbf{F} \otimes \mathbf{G}\) Essential tool : for matrix derivativesKey role : in chain rule applications 
 
Kronecker Products 
Notation : \(\mathbf{A} \otimes \mathbf{B}\) Input : matrices \(\mathbf{A} \in \Re^{m \times n}\) , \(\mathbf{B} \in \Re^{p \times q}\) Output : \((mp) \times (nq)\)  matrix 
 
Visual Examples 
Illustration of the Kronecker product.
 
 
Kronecker Product Visualization 
Kronecker product between two matrices.
 
 
Column vs Row Stacking 
Column stacking : \(\mathbf{I} \otimes \mathbf{K}\)  - time independenceRow stacking : \(\mathbf{K} \otimes \mathbf{I}\)  - dimension independenceDifferent structure : affects which variables are correlated 
 
Basic Properties 
Distributive : \((\mathbf{A} + \mathbf{B}) \otimes \mathbf{C} = \mathbf{A} \otimes \mathbf{C} + \mathbf{B} \otimes \mathbf{C}\) Associative : \((\mathbf{A} \otimes \mathbf{B}) \otimes \mathbf{C} = \mathbf{A} \otimes (\mathbf{B} \otimes \mathbf{C})\) Mixed product : \((\mathbf{A} \otimes \mathbf{B})(\mathbf{C} \otimes \mathbf{D}) = \mathbf{A}\mathbf{C} \otimes \mathbf{B}\mathbf{D}\)  
 
Identity and Zero 
Identity : \(\mathbf{I}_m \otimes \mathbf{I}_n = \mathbf{I}_{mn}\) Zero : \(\mathbf{0} \otimes \mathbf{A} = \mathbf{A} \otimes \mathbf{0} = \mathbf{0}\) Scalar : \(c \otimes \mathbf{A} = \mathbf{A} \otimes c = c\mathbf{A}\)  
 
Transpose and Inverse 
Transpose : \((\mathbf{A} \otimes \mathbf{B})^\top = \mathbf{A}^\top \otimes \mathbf{B}^\top\) Inverse : \((\mathbf{A} \otimes \mathbf{B})^{-1} = \mathbf{A}^{-1} \otimes \mathbf{B}^{-1}\)  (when both exist)Determinant : \(\det(\mathbf{A} \otimes \mathbf{B}) = \det(\mathbf{A})^q \det(\mathbf{B})^m\)  
 
Matrix Stacking and Kronecker Products 
 
Column Stacking 
Stacking : place columns of matrix on top of each otherNotation : \(\mathbf{a} = \text{vec}(\mathbf{A})\)  or \(\mathbf{a} = \mathbf{A}\!:\) Kronecker form : \(\mathbf{I} \otimes \mathbf{K}\)  for column-wise structure 
 
Row Stacking 
Alternative : stack rows instead of columnsKronecker form : \(\mathbf{K} \otimes \mathbf{I}\)  for row-wise structureDifferent structure : changes which variables are independent 
 
Efficient Matrix Operations 
Inversion : \((\mathbf{A} \otimes \mathbf{B})^{-1} = \mathbf{A}^{-1} \otimes \mathbf{B}^{-1}\) Eigenvalues : \(\lambda_i(\mathbf{A} \otimes \mathbf{B}) = \lambda_j(\mathbf{A}) \lambda_k(\mathbf{B})\) Determinant : \(\det(\mathbf{A} \otimes \mathbf{B}) = \det(\mathbf{A})^q \det(\mathbf{B})^m\)  
 
Memory and Speed Benefits 
Storage : avoid storing full covariance matrixOperations : work with smaller component matricesScaling : \(O(n^3)\)  becomes \(O(n_1^3 + n_2^3)\)  for \(n = n_1 n_2\)  
 
Kronecker Product Simplification 
\[(\mathbf{E}:)^{\top} \mathbf{F} \otimes \mathbf{G}=\left(\left(\mathbf{G}^{\top} \mathbf{E F}\right):\right)^{\top}\] 
Removal rule : Kronecker products often simplifiedChain rule : typically produces this formKey insight : avoid direct Kronecker computation 
 
Chain Rule Application 
\[\frac{\text{d} E}{\text{d} \mathbf{H}:} \frac{\text{d} \mathbf{H}:}{\text{d} \mathbf{J}:}=\frac{\text{d} E}{\text{d} \mathbf{J}:}\] 
\[\frac{\text{d}\mathbf{H}}{\text{d} \mathbf{J}} = \mathbf{F} \otimes \mathbf{G}\] 
 
Kronecker Product Identities 
\[\mathbf{F}^{\top} \otimes \mathbf{G}^{\top}=(\mathbf{F} \otimes \mathbf{G})^{\top}\] 
\[(\mathbf{E} \otimes \mathbf{G})(\mathbf{F} \otimes \mathbf{H})=\mathbf{E F} \otimes \mathbf{G H}\] 
 
Derivative Representations 
Two forms : row vector vs matrixRow vector : easier computationMatrix : better summarization 
\[\frac{\text{d} E}{\text{d} \mathbf{J}:}=\left(\left(\frac{\text{d} E}{\text{d} \mathbf{J}}\right):\right)^{\top}\] 
 
Useful Multivariate Derivatives 
 
Derivative of $mathbf{A} wrt \(\mathbf{A}\)  
Problem : \(\mathbf{b} = \mathbf{A}\mathbf{x}\) , find \(\frac{\text{d}\mathbf{b}}{\text{d}\mathbf{A}}\) Approach : indirect derivation via \(\mathbf{a}^\prime\) Key insight : stacking rows vs columns 
 
Indirect Approach 
\[\frac{\text{d}}{\text{d} \mathbf{a^\prime}} \mathbf{A} \mathbf{x}\] 
Size : \(m \times mn\)  matrixDefinition : \(\mathbf{a}^\prime = \left(\mathbf{A}^\top\right):\) Stacking : columns of \(\mathbf{A}^\top\)  = rows of \(\mathbf{A}\)  
 
Row-wise Structure 
\[\mathbf{A}\mathbf{x} = \begin{bmatrix}
\mathbf{a}_{1, :}^\top \mathbf{x} \\
\mathbf{a}_{2, :}^\top \mathbf{x} \\
\vdots \\
\mathbf{a}_{m, :}^\top \mathbf{x}
\end{bmatrix}\] 
Structure : inner products of rows of \(\mathbf{A}\) Gradient : \(\frac{\text{d}\mathbf{a}_{i,:}^\top \mathbf{x}}{\text{d}\mathbf{a}_{i,:}} = \mathbf{x}\) Expectation : repeated versions of \(\mathbf{x}\)  
 
Kronecker Product Tiling 
\[\mathbf{I}_m \otimes \mathbf{x}^\top = \begin{bmatrix}
\mathbf{x}^\top & \mathbf{0}& \cdots & \mathbf{0}\\
\mathbf{0}& \mathbf{x}^\top & \cdots & \mathbf{0}\\
\vdots & \vdots & \ddots & \vdots \\
\mathbf{0}& \mathbf{0}& \cdots& \mathbf{x}^\top
\end{bmatrix}\] 
Pattern : tiling of \(\mathbf{x}^\top\) Size : \(m \times nm\)  matrixStructure : \(\mathbf{x}\)  repeated in blocks 
 
Final Result 
\[\frac{\text{d}}{\text{d} \mathbf{a^\prime}} \mathbf{A} \mathbf{x} = \mathbf{I}_{m} \otimes \mathbf{x}^\top\] 
Each row : derivative of \(b_i = \mathbf{a}_{i, :}^\top \mathbf{x}\) Non-zero elements : only where \(\mathbf{a}_{i, :}\)  appearsValues : \(\mathbf{x}\)  in appropriate positions 
 
Column Stacking Alternative 
\[\frac{\text{d}}{\text{d} \mathbf{a}} \mathbf{A} \mathbf{x} = \mathbf{x}^\top \otimes \mathbf{I}_{m}\] 
Change : \(\mathbf{a} = \mathbf{A}:\)  (column stacking)Method : reverse Kronecker product orderResult : \(\mathbf{x}^\top \otimes \mathbf{I}_{m}\)  
 
Column Stacking Structure 
\[\mathbf{x}^\top \otimes \mathbf{I}_{m} = \begin{bmatrix}
x_1 \mathbf{I}_m & x_2 \mathbf{I}_m & \cdots & x_n \mathbf{I}_m
\end{bmatrix}\] 
Pattern : \(\mathbf{x}\)  elements distributed across blocksSize : \(m \times mn\)  matrixCorrect : for column stacking of \(\mathbf{A}\)  
 
Chain Rule for Neural Networks 
 
Neural Network Chain Rule 
Goal : compute gradients for neural networksApproach : apply matrix derivative rulesKey : three fundamental gradients needed 
 
Neural Network Structure 
\[\mathbf{ f}_\ell= \mathbf{W}_\ell\boldsymbol{ \phi}_{\ell-1}\] 
\[\boldsymbol{ \phi}_{\ell} = \boldsymbol{ \phi}_{\ell}(\mathbf{ f}_{\ell})\] 
Activations : \(\mathbf{ f}_\ell\)  at layer \(\ell\) Basis functions : applied to activationsWeight matrix : \(\mathbf{W}_\ell\)  
 
First and Final Layers 
\[f_1 = \mathbf{W}_1 \mathbf{ x}\] 
\[\mathbf{h} = \mathbf{h}(\mathbf{ f}_L)\] 
First layer : \(f_1 = \mathbf{W}_1 \mathbf{ x}\) Final layer : inverse link functionDummy basis : \(\mathbf{h} = \boldsymbol{ \phi}_L\)  
 
Limitations 
Skip connections : not captured in this formalismExample : layer \(\ell-2\)  → layer \(\ell\) Sufficient : for main aspects of deep network gradients 
 
Three Fundamental Gradients 
Activation gradient : \(\frac{\text{d}\mathbf{ f}_\ell}{\text{d}\mathbf{ w}_\ell}\) Basis gradient : \(\frac{\text{d}\boldsymbol{ \phi}_{\ell}}{\text{d} \mathbf{ f}_{\ell}}\) Across-layer gradient : \(\frac{\text{d}\mathbf{ f}_\ell}{\text{d}\boldsymbol{ \phi}_{\ell-1}}\)  
 
Activation Gradient 
\[\frac{\text{d}\mathbf{ f}_\ell}{\text{d}\mathbf{ w}_\ell} = \boldsymbol{ \phi}_{\ell-1}^\top \otimes \mathbf{I}_{d_\ell}\] 
Size : \(d_\ell\times (d_{\ell-1} d)\) Input : \(\mathbf{ f}_\ell\)  and \(\mathbf{ w}_\ell\) Kronecker product : key to the result 
 
Basis Gradient 
\[\frac{\text{d}\boldsymbol{ \phi}_{\ell}}{\text{d} \mathbf{ f}_{\ell}} = \boldsymbol{ \Phi}^\prime\] 
\[\frac{\text{d}\phi_i^{(\ell)}(\mathbf{ f}_{\ell})}{\text{d} f_j^{(\ell)}}\] 
Size : \(d_\ell\times d_\ell\)  matrixElements : derivatives of basis functionsDiagonal : if basis functions depend only on their own activations 
 
Activations and Gradients 
Some different activation functions (solid line) and their gradients (dashed line).
 
 
Across-Layer Gradient 
\[\frac{\text{d}\mathbf{ f}_\ell}{\text{d}\boldsymbol{ \phi}_{\ell-1}} = \mathbf{W}_\ell\] 
Size : \(d_\ell\times d_{\ell-1}\) Matches : shape of \(\mathbf{W}_\ell\) Simple : just the weight matrix itself 
 
Complete Chain Rule 
\[\frac{\text{d} \mathbf{ f}_\ell}{\text{d}\mathbf{ w}_{\ell-k}} = \left[\prod_{i=0}^{k-1} \frac{\text{d} \mathbf{ f}_{\ell - i}}{\text{d} \boldsymbol{ \phi}_{\ell - i -1}}\frac{\text{d} \boldsymbol{ \phi}_{\ell - i -1}}{\text{d} \mathbf{ f}_{\ell - i -1}}\right] \frac{\text{d} \mathbf{ f}_{\ell-k}}{\text{d} \mathbf{ w}_{\ell - k}}\] 
Product : of all intermediate gradientsChain : from layer \(\ell\)  back to layer \(\ell-k\) Complete : gradient computation for any layer 
 
Verify Neural Network Gradients 
 
Test loss function gradients 
 
Simple Deep NN Implementation 
 
Create and Test Neural Network 
 
Unified Optimisation Interface 
 
Train Network for Regression 
mlai.write_figure(“nn-regression-training-progress.svg,” directory=“./ml”) }
{
Training error for the fitting of the neural network. Left is training loss, right is training accuracy
 
 
Visualise Decision Boundary 
Decision boundary for the trained neural network
 
 
Evaluating Derivatives 
\[
\frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} 
\] 
\[
\small
\frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \left( \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \left( \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \cdots \left( \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \left( \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} \right) \right) \cdots \right) \right)
\] 
 
Or like this? 
\[
\frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} 
\] 
\[
\small
\frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \left( \left( \cdots \left( \left( \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}}  \right) \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \right) \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \right) \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \right) \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} 
\] 
 
Or in a funky way? 
\[
\frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} 
\] 
\[
\small
\frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \left( \left( \left( \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}}  \right) \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \right) \left( \left( \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \right) \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \right) \right)\frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} 
\] 
 
Automatic differentiation 
Forward-mode 
\[
\small
\frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \left( \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \left( \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \cdots \left( \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \left( \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} \right) \right) \cdots \right) \right)
\] 
Cost: \[
\small
d_0 d_1 d_2 + d_0 d_2 d_3 + \ldots + d_0 d_{L-1} d_L= d_0 \sum_{\ell=2}^{L} d_\ell d_{\ell-1}
\] 
 
Automatic Differentiation 
Reverse-mode 
\[
\small
\frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \left( \left( \cdots \left( \left( \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}}  \right) \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \right) \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \right) \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \right) \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} 
\] 
Cost: \[
\small
 d_Ld_{L-1}d_{L-2} + d_{L}d_{L-2} d_{L-3} + \ldots + d_Ld_{1} d_0 = d_L\sum_{\ell=0}^{L-2}d_\ell d_{\ell+1}
\] 
 
Forward-mode 
Store only:
current activation \(\mathbf{ f}_l\)  (\(d_l\) ) 
running JVP of size \(d_0 \times d_l\)  
current Jacobian \(d_l \times d_{l+1}\)  
 
 
Reverse-mode 
Cache activations \(\{\mathbf{ f}_l\}\)  so local Jacobians can be applied during backward; higher memory than forward-mode.
 
Automatic differentiation 
in deep learning we’re most interested in scalar objectives 
\(d_L=1\) , consequently, backward mode is always optimalin the context of neural networks this is backpropagation . 
backprop has higher memory cost than forwardprop 
 
 
Backprop in deep networks 
 
Autodiff: Practical Details 
Terminology : forward-mode = JVP; reverse-mode = VJPNon-scalar outputs : seed vector required (e.g. backward(grad_outputs))Stopping grads : .detach(), no_grad() in PyTorch 
 
Autodiff: Practical Details 
Grad accumulation : .grad accumulates by default; call zero_grad()Memory : backprop caches intermediates; use checkpointing/rematerialisationMixed-mode : combine forward+reverse for Hessian-vector products 
 
Computational Graphs and PyTorch Autodiff 
Nodes: tensors and ops; Edges: data dependencies 
Forward pass caches intermediates needed for backward 
Backward pass applies local Jacobian-vector products 
 
 
Overparameterised Systems 
Neural networks are highly overparameterised. 
If we could  examine their Hessian at “optimum”
Very low (or negative) eigenvalues. 
Error function is not sensitive to changes in parameters. 
Implies parmeters are badly determined  
  
 
 
Whence Generalisation? 
Not enough regularisation in our objective functions to explain. 
Neural network models are not  using traditional generalisation approaches. 
The ability of these models to generalise must  be coming somehow from the algorithm* 
How to explain it and control it is perhaps the most interesting theoretical question for neural networks. 
 
 
Double Descent 
Left  traditional perspective on generalization. There is a sweet spot of operation where the training error is still non-zero. Overfitting occurs when the variance increases. Right  The double descent phenomenon, the modern models operate in an interpolation regime where they reconstruct the training data fully but are well regularized in their interpolations for test data. Figure from Belkin et al. (2019) .
 
 
Neural Tangent Kernel 
Consider very wide neural networks. 
Consider particular initialisation. 
Deep neural network is regularising with a particular kernel . 
This is known as the neural tangent kernel (Jacot et al., 2018) . 
 
 
Regularization in Optimization 
Gradient flow methods allow us to study nature of optima. 
In particular systems, with given initialisations, we can show L1 and L2 norms are minimised. 
In other cases the rank of \(\mathbf{W}\)  is minimised. 
Questions remain over the nature of this regularisation in neural networks. 
 
 
Deep Linear Models 
\[
f(\mathbf{ x}; \mathbf{W}) = \mathbf{W}_4 \mathbf{W}_3 \mathbf{W}_2 \mathbf{W}_1 \mathbf{ x}.
\] 
\[
\mathbf{W}= \mathbf{W}_4 \mathbf{W}_3 \mathbf{W}_2 \mathbf{W}_1 
\] 
 
Practical Optimisation Notes 
Initialisation: variance-preserving (He, Xavier); scale affects signal propagation 
Optimisers: SGD(+momentum), Adam/AdamW; decoupled weight decay often preferable 
Learning rate schedules: cosine, step, warmup; LR is the most sensitive hyperparameter 
Batch size: affects gradient noise and implicit regularisation; tune with LR 
Normalisation: BatchNorm/LayerNorm stabilise optimisation 
Regularisation: data augmentation, dropout, weight decay; early stopping 
 
 
Further Reading 
Baydin et al. (2018) Automatic differentiation in ML (JMLR) 
Jacot, Gabriel, Hongler (2018) Neural Tangent Kernel 
Arora et al. (2019) On exact dynamics of deep linear networks 
Keskar et al. (2017) Large-batch training and sharp minima 
Novak et al. (2018) Sensitivity and generalisation in neural networks 
JAX Autodiff Cookbook  (JVPs/VJPs, jacfwd/jacrev)PyTorch Autograd Docs (computational graph, grad_fn, backward) 
 
 
Example: calculating a Hessian 
\[
H(\mathbb{w}) = \frac{\text{d}^2}{\text{d}\mathbf{ w}\text{d}\mathbf{ w}^\top} L(\mathbf{ w})
:= \frac{\text{d}}{\text{d}\mathbf{ w}} \mathbf{g}(\mathbf{ w})
\] 
 
Example: Hessian-vector product 
\[
\mathbf{v}^\top H(\mathbf{ w}) = \frac{\text{d}}{\text{d}\mathbf{ w}} \left( \mathbf{v}^\top \mathbf{g}(\mathbf{ w}) \right)
\]