Deep Architectures

Review

Basis Functions to Neural Networks

From Basis Functions to Neural Networks

So far: linear models or “shallow learning”
Consider instead function composition or “deep learning”

Neural Network Prediction Function

\[ f(\mathbf{ x}; \mathbf{W}) = \mathbf{ w}_4 ^\top\phi\left(\mathbf{W}_3 \phi\left(\mathbf{W}_2\phi\left(\mathbf{W}_1 \mathbf{ x}\right)\right)\right). \]

Linear Models as Basis + Weights

\[ f(\mathbf{ x}_i) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}_i). \]

Deep Neural Network

Mathematically

\[ \begin{align*} \mathbf{ h}_{1} &= \phi\left(\mathbf{W}_1 \mathbf{ x}\right)\\ \mathbf{ h}_{2} &= \phi\left(\mathbf{W}_2\mathbf{ h}_{1}\right)\\ \mathbf{ h}_{3} &= \phi\left(\mathbf{W}_3 \mathbf{ h}_{2}\right)\\ f&= \mathbf{ w}_4 ^\top\mathbf{ h}_{3} \end{align*} \]

From Shallow to Deep

\[ f(\mathbf{ x}; \mathbf{W}) = \mathbf{ w}_4 ^\top\phi\left(\mathbf{W}_3 \phi\left(\mathbf{W}_2\phi\left(\mathbf{W}_1 \mathbf{ x}\right)\right)\right). \]

Gradient-based optimization

\[ E(\mathbf{W}) = \sum_{i=1}^nE(\mathbf{ y}_i, f(\mathbf{ x}_i; \mathbf{W})) \]

\[ \mathbf{W}_{t+1} = \mathbf{W}_t - \eta \nabla_\mathbf{W}E(\mathbf{W}) \]

Basic Multilayer Perceptron

\[ \small f_L(\mathbf{ x}) = \boldsymbol{ \phi}\left(\mathbf{b}_L+ \mathbf{W}_L\boldsymbol{ \phi}\left(\mathbf{b}_{L-1} + \mathbf{W}_{L-1} \boldsymbol{ \phi}\left( \cdots \boldsymbol{ \phi}\left(\mathbf{b}_1 + \mathbf{W}_1 \mathbf{ x}\right) \cdots \right)\right)\right) \]

Recursive MLP Definition

\[\begin{align} f_\ell(\mathbf{ x}) &= \boldsymbol{ \phi}\left(\mathbf{W}_\ell f_{\ell-1}(\mathbf{ x}) + \mathbf{b}_\ell\right),\quad l=1,\dots,L\\ f_0(\mathbf{ x}) &= \mathbf{ x} \end{align}\]

Rectified Linear Unit

(x) = \[\begin{cases} 0 & x \le 0\\ x & x > 0 \end{cases}\]

Shallow ReLU Networks

\[\begin{align} \mathbf{h}(x) &= \operatorname{ReLU}(\mathbf{w}_1 x - \mathbf{b}_1)\\ f(x) &= \mathbf{w}_2^{\top} \, \mathbf{h}(x) \end{align}\]

ReLU Activations

Decision Boundaries

Complexity (1D): number of kinks $\propto$ width of network.

Depth vs Width: Representation Power

Single hidden layer: number of kinks $\approx O(\text{width})$
Deep ReLU networks: kinks $\approx O(2^{\text{depth}})$

Sawtooth Construction

\[\begin{align} f_l(x) &= 2\vert f_{l-1}(x)\vert - 2, \quad f_0(x) = x\\ f_l(x) &= 2 \operatorname{ReLU}(f_{l-1}(x)) + 2 \operatorname{ReLU}(-f_{l-1}(x)) - 2 \end{align}\]

Complexity (1D): number of kinks $\approx O(2^{\text{depth}})$.

Higher-dimensional Geometry

Regions: piecewise-linear, often triangular/polygonal in 2D
Continuity arises despite stitching flat regions
Early layers: few regions; later layers: many more

Chain Rule and Back Propgation

Historical Context

1957: Rosenblatt’s Perceptron with multiple layers
Association units: deep structure with random weights
Modern approach: adjust weights to minimise error

Parameterised Basis Functions

Basis functions: $\boldsymbol{ \phi}(\mathbf{ x}; \mathbf{V})$
Adjustable parameters: $\mathbf{V}$ can be optimized
Non-linear models: in both inputs and parameters

The Gradient Problem

Model: $f(\mathbf{ x}_i) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}_i; \mathbf{V})$
Easy gradient: $\frac{\text{d}f_i}{\text{d}\mathbf{ w}} = \boldsymbol{ \phi}(\mathbf{ x}_i; \mathbf{V})$
Hard gradient: $\frac{\text{d}f_i}{\text{d}\mathbf{V}}$

Chain Rule Application

\[\frac{\text{d}f_i}{\text{d}\mathbf{V}} = \frac{\text{d}f_i}{\text{d}\boldsymbol{ \phi}_i}\frac{\text{d}\boldsymbol{ \phi}_i}{\text{d} \mathbf{V}}\]

The Matrix Challenge

Chain rule: standard approach
New challenge: $\frac{\text{d}\boldsymbol{ \phi}_i}{\text{d} \mathbf{V}}$
Problem: derivative of vector with respect to matrix

Matrix Vector Chain Rule

Matrix Calculus Standards

Multiple approaches to multivariate chain rule
Mike Brookes’ method: preferred approach
Reference: Brookes (2005)

Brookes’ Approach

Never compute matrix derivatives directly
Stacking operation: intermediate step
Vector derivatives: easier to handle

Matrix Stacking

Input: matrix $\mathbf{A}$
Output: vector $\mathbf{a}$
Process: stack columns vertically

Stacking Formula

\[\mathbf{a} = \begin{bmatrix} \mathbf{a}_{:, 1}\\ \mathbf{a}_{:, 2}\\ \vdots\\ \mathbf{a}_{:, q} \end{bmatrix}\]

Column stacking: each column becomes a block
Order: first column at top, last at bottom
Size: $mn \times 1$ vector from $m \times n$ matrix

Vector Derivatives

\[\frac{\text{d} \mathbf{a}}{\text{d} \mathbf{b}} \in \Re^{m \times n}\]

Input: $\mathbf{a} \in \Re^{m \times 1}$, $\mathbf{b} \in \Re^{n \times 1}$
Output: $m \times n$ matrix
Standard form: vector-to-vector derivatives

Matrix Stacking Notation

$\mathbf{C}\!:$ vector formed from matrix $\mathbf{C}$
Stacking: columns of $\mathbf{C}$ → $\Re^{mn \times 1}$ vector
Condition: $\mathbf{C} \in \Re^{m \times n}$

Matrix-Matrix Derivatives

\[\frac{\text{d} \mathbf{E}}{\text{d} \mathbf{C}} \in \Re^{pq \times mn}\]

Input: $\mathbf{E} \in \Re^{p \times q}$, $\mathbf{C} \in \Re^{m \times n}$
Output: $pq \times mn$ matrix
Advantage: easier chain rule application

Kronecker Products

Notation: $\mathbf{F} \otimes \mathbf{G}$
Essential tool: for matrix derivatives
Key role: in chain rule applications

Kronecker Products

Notation: $\mathbf{A} \otimes \mathbf{B}$
Input: matrices $\mathbf{A} \in \Re^{m \times n}$, $\mathbf{B} \in \Re^{p \times q}$
Output: $(mp) \times (nq)$ matrix

Visual Examples

Kronecker Product Visualization

Column vs Row Stacking

Column stacking: $\mathbf{I} \otimes \mathbf{K}$ - time independence
Row stacking: $\mathbf{K} \otimes \mathbf{I}$ - dimension independence
Different structure: affects which variables are correlated

Kronecker Product Formula

\[(\mathbf{A} \otimes \mathbf{B})_{ij} = a_{i_1 j_1} b_{i_2 j_2}\]

Index mapping: $i = (i_1-1)p + i_2$, $j = (j_1-1)q + j_2$
Block structure: each element of $\mathbf{A}$ scaled by $\mathbf{B}$
Size: $(mp) \times (nq)$ result

Basic Properties

Distributive: $(\mathbf{A} + \mathbf{B}) \otimes \mathbf{C} = \mathbf{A} \otimes \mathbf{C} + \mathbf{B} \otimes \mathbf{C}$
Associative: $(\mathbf{A} \otimes \mathbf{B}) \otimes \mathbf{C} = \mathbf{A} \otimes (\mathbf{B} \otimes \mathbf{C})$
Mixed product: $(\mathbf{A} \otimes \mathbf{B})(\mathbf{C} \otimes \mathbf{D}) = \mathbf{A}\mathbf{C} \otimes \mathbf{B}\mathbf{D}$

Identity and Zero

Identity: $\mathbf{I}_m \otimes \mathbf{I}_n = \mathbf{I}_{mn}$
Zero: $\mathbf{0} \otimes \mathbf{A} = \mathbf{A} \otimes \mathbf{0} = \mathbf{0}$
Scalar: $c \otimes \mathbf{A} = \mathbf{A} \otimes c = c\mathbf{A}$

Transpose and Inverse

Transpose: $(\mathbf{A} \otimes \mathbf{B})^\top = \mathbf{A}^\top \otimes \mathbf{B}^\top$
Inverse: $(\mathbf{A} \otimes \mathbf{B})^{-1} = \mathbf{A}^{-1} \otimes \mathbf{B}^{-1}$ (when both exist)
Determinant: $\det(\mathbf{A} \otimes \mathbf{B}) = \det(\mathbf{A})^q \det(\mathbf{B})^m$

Matrix Stacking and Kronecker Products

Column Stacking

Stacking: place columns of matrix on top of each other
Notation: $\mathbf{a} = \text{vec}(\mathbf{A})$ or $\mathbf{a} = \mathbf{A}\!:$
Kronecker form: $\mathbf{I} \otimes \mathbf{K}$ for column-wise structure

Row Stacking

Alternative: stack rows instead of columns
Kronecker form: $\mathbf{K} \otimes \mathbf{I}$ for row-wise structure
Different structure: changes which variables are independent

Computational Advantages

Efficient Matrix Operations

Inversion: $(\mathbf{A} \otimes \mathbf{B})^{-1} = \mathbf{A}^{-1} \otimes \mathbf{B}^{-1}$
Eigenvalues: $\lambda_i(\mathbf{A} \otimes \mathbf{B}) = \lambda_j(\mathbf{A}) \lambda_k(\mathbf{B})$
Determinant: $\det(\mathbf{A} \otimes \mathbf{B}) = \det(\mathbf{A})^q \det(\mathbf{B})^m$

Memory and Speed Benefits

Storage: avoid storing full covariance matrix
Operations: work with smaller component matrices
Scaling: $O(n^3)$ becomes $O(n_1^3 + n_2^3)$ for $n = n_1 n_2$

Kronecker Product Simplification

\[(\mathbf{E}:)^{\top} \mathbf{F} \otimes \mathbf{G}=\left(\left(\mathbf{G}^{\top} \mathbf{E F}\right):\right)^{\top}\]

Removal rule: Kronecker products often simplified
Chain rule: typically produces this form
Key insight: avoid direct Kronecker computation

Chain Rule Application

\[\frac{\text{d} E}{\text{d} \mathbf{H}:} \frac{\text{d} \mathbf{H}:}{\text{d} \mathbf{J}:}=\frac{\text{d} E}{\text{d} \mathbf{J}:}\]

\[\frac{\text{d}\mathbf{H}}{\text{d} \mathbf{J}} = \mathbf{F} \otimes \mathbf{G}\]

Kronecker Product Identities

\[\mathbf{F}^{\top} \otimes \mathbf{G}^{\top}=(\mathbf{F} \otimes \mathbf{G})^{\top}\]

\[(\mathbf{E} \otimes \mathbf{G})(\mathbf{F} \otimes \mathbf{H})=\mathbf{E F} \otimes \mathbf{G H}\]

Derivative Representations

Two forms: row vector vs matrix
Row vector: easier computation
Matrix: better summarization

\[\frac{\text{d} E}{\text{d} \mathbf{J}:}=\left(\left(\frac{\text{d} E}{\text{d} \mathbf{J}}\right):\right)^{\top}\]

Useful Multivariate Derivatives

Derivative of $mathbf{A} wrt $\mathbf{A}$

Problem: $\mathbf{b} = \mathbf{A}\mathbf{x}$, find $\frac{\text{d}\mathbf{b}}{\text{d}\mathbf{A}}$
Approach: indirect derivation via $\mathbf{a}^\prime$
Key insight: stacking rows vs columns

Indirect Approach

\[\frac{\text{d}}{\text{d} \mathbf{a^\prime}} \mathbf{A} \mathbf{x}\]

Size: $m \times mn$ matrix
Definition: $\mathbf{a}^\prime = \left(\mathbf{A}^\top\right):$
Stacking: columns of $\mathbf{A}^\top$ = rows of $\mathbf{A}$

Row-wise Structure

\[\mathbf{A}\mathbf{x} = \begin{bmatrix} \mathbf{a}_{1, :}^\top \mathbf{x} \\ \mathbf{a}_{2, :}^\top \mathbf{x} \\ \vdots \\ \mathbf{a}_{m, :}^\top \mathbf{x} \end{bmatrix}\]

Structure: inner products of rows of $\mathbf{A}$
Gradient: $\frac{\text{d}\mathbf{a}_{i,:}^\top \mathbf{x}}{\text{d}\mathbf{a}_{i,:}} = \mathbf{x}$
Expectation: repeated versions of $\mathbf{x}$

Kronecker Product Tiling

\[\mathbf{I}_m \otimes \mathbf{x}^\top = \begin{bmatrix} \mathbf{x}^\top & \mathbf{0}& \cdots & \mathbf{0}\\ \mathbf{0}& \mathbf{x}^\top & \cdots & \mathbf{0}\\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{0}& \mathbf{0}& \cdots& \mathbf{x}^\top \end{bmatrix}\]

Pattern: tiling of $\mathbf{x}^\top$
Size: $m \times nm$ matrix
Structure: $\mathbf{x}$ repeated in blocks

Final Result

\[\frac{\text{d}}{\text{d} \mathbf{a^\prime}} \mathbf{A} \mathbf{x} = \mathbf{I}_{m} \otimes \mathbf{x}^\top\]

Each row: derivative of $b_i = \mathbf{a}_{i, :}^\top \mathbf{x}$
Non-zero elements: only where $\mathbf{a}_{i, :}$ appears
Values: $\mathbf{x}$ in appropriate positions

Column Stacking Alternative

\[\frac{\text{d}}{\text{d} \mathbf{a}} \mathbf{A} \mathbf{x} = \mathbf{x}^\top \otimes \mathbf{I}_{m}\]

Change: $\mathbf{a} = \mathbf{A}:$ (column stacking)
Method: reverse Kronecker product order
Result: $\mathbf{x}^\top \otimes \mathbf{I}_{m}$

Column Stacking Structure

\[\mathbf{x}^\top \otimes \mathbf{I}_{m} = \begin{bmatrix} x_1 \mathbf{I}_m & x_2 \mathbf{I}_m & \cdots & x_n \mathbf{I}_m \end{bmatrix}\]

Pattern: $\mathbf{x}$ elements distributed across blocks
Size: $m \times mn$ matrix
Correct: for column stacking of $\mathbf{A}$

Chain Rule for Neural Networks

Neural Network Chain Rule

Goal: compute gradients for neural networks
Approach: apply matrix derivative rules
Key: three fundamental gradients needed

Neural Network Structure

\[\mathbf{ f}_\ell= \mathbf{W}_\ell\boldsymbol{ \phi}_{\ell-1}\]

\[\boldsymbol{ \phi}_{\ell} = \boldsymbol{ \phi}_{\ell}(\mathbf{ f}_{\ell})\]

Activations: $\mathbf{ f}_\ell$ at layer $\ell$
Basis functions: applied to activations
Weight matrix: $\mathbf{W}_\ell$

First and Final Layers

\[f_1 = \mathbf{W}_1 \mathbf{ x}\]

\[\mathbf{h} = \mathbf{h}(\mathbf{ f}_L)\]

First layer: $f_1 = \mathbf{W}_1 \mathbf{ x}$
Final layer: inverse link function
Dummy basis: $\mathbf{h} = \boldsymbol{ \phi}_L$

Limitations

Skip connections: not captured in this formalism
Example: layer $\ell-2$ → layer $\ell$
Sufficient: for main aspects of deep network gradients

Three Fundamental Gradients

Activation gradient: $\frac{\text{d}\mathbf{ f}_\ell}{\text{d}\mathbf{ w}_\ell}$
Basis gradient: $\frac{\text{d}\boldsymbol{ \phi}_{\ell}}{\text{d} \mathbf{ f}_{\ell}}$
Across-layer gradient: $\frac{\text{d}\mathbf{ f}_\ell}{\text{d}\boldsymbol{ \phi}_{\ell-1}}$

Activation Gradient

\[\frac{\text{d}\mathbf{ f}_\ell}{\text{d}\mathbf{ w}_\ell} = \boldsymbol{ \phi}_{\ell-1}^\top \otimes \mathbf{I}_{d_\ell}\]

Size: $d_\ell\times (d_{\ell-1} d)$
Input: $\mathbf{ f}_\ell$ and $\mathbf{ w}_\ell$
Kronecker product: key to the result

Basis Gradient

\[\frac{\text{d}\boldsymbol{ \phi}_{\ell}}{\text{d} \mathbf{ f}_{\ell}} = \boldsymbol{ \Phi}^\prime\]

\[\frac{\text{d}\phi_i^{(\ell)}(\mathbf{ f}_{\ell})}{\text{d} f_j^{(\ell)}}\]

Size: $d_\ell\times d_\ell$ matrix
Elements: derivatives of basis functions
Diagonal: if basis functions depend only on their own activations

Activations and Gradients

Across-Layer Gradient

\[\frac{\text{d}\mathbf{ f}_\ell}{\text{d}\boldsymbol{ \phi}_{\ell-1}} = \mathbf{W}_\ell\]

Size: $d_\ell\times d_{\ell-1}$
Matches: shape of $\mathbf{W}_\ell$
Simple: just the weight matrix itself

Complete Chain Rule

\[\frac{\text{d} \mathbf{ f}_\ell}{\text{d}\mathbf{ w}_{\ell-k}} = \left[\prod_{i=0}^{k-1} \frac{\text{d} \mathbf{ f}_{\ell - i}}{\text{d} \boldsymbol{ \phi}_{\ell - i -1}}\frac{\text{d} \boldsymbol{ \phi}_{\ell - i -1}}{\text{d} \mathbf{ f}_{\ell - i -1}}\right] \frac{\text{d} \mathbf{ f}_{\ell-k}}{\text{d} \mathbf{ w}_{\ell - k}}\]

Product: of all intermediate gradients
Chain: from layer $\ell$ back to layer $\ell-k$
Complete: gradient computation for any layer

Substituted Matrix Form

\[\frac{\text{d} \mathbf{ f}_\ell}{\text{d}\mathbf{ w}_{\ell-k}} = \left[\prod_{i=0}^{k-1} \mathbf{W}_{\ell - i} \boldsymbol{ \Phi}^\prime_{\ell - i -1}\right] \boldsymbol{ \phi}_{\ell-k-1}^\top \otimes \mathbf{I}_{d_{\ell-k}}\]

Weight matrices: $\mathbf{W}_{\ell - i}$ for across-layer gradients
Basis derivatives: $\boldsymbol{ \Phi}^\prime_{\ell - i -1}$ for activation derivatives
Final term: Kronecker product for activation gradient

Gradient Verification

Verify Neural Network Gradients

Loss Functions

Test loss function gradients

Simple Deep NN Implementation

Create and Test Neural Network

Test Backpopagation

Unified Optimisation Interface

Train Network for Regression

mlai.write_figure(“nn-regression-training-progress.svg,” directory=“./ml”) }

{}{Neural Network Training Progress for a regression model.){nn-regression-training-progress}

Visualise Decision Boundary

Evaluating Derivatives

\[ \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} \]

\[ \small \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \left( \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \left( \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \cdots \left( \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \left( \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} \right) \right) \cdots \right) \right) \]

Or like this?

\[ \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} \]

\[ \small \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \left( \left( \cdots \left( \left( \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \right) \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \right) \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \right) \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \right) \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} \]

Or in a funky way?

\[ \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} \]

\[ \small \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \left( \left( \left( \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \right) \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \right) \left( \left( \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \right) \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \right) \right)\frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} \]

Automatic differentiation

Forward-mode

\[ \small \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \left( \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \left( \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \cdots \left( \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \left( \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} \right) \right) \cdots \right) \right) \]

Cost: \[ \small d_0 d_1 d_2 + d_0 d_2 d_3 + \ldots + d_0 d_{L-1} d_L= d_0 \sum_{\ell=2}^{L} d_\ell d_{\ell-1} \]

Automatic Differentiation

Reverse-mode

\[ \small \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ w}} = \left( \left( \cdots \left( \left( \frac{\text{d} \mathbf{ f}_L}{\text{d} \mathbf{ f}_{L-1}} \frac{\text{d} \mathbf{ f}_{L-1}}{\text{d} \mathbf{ f}_{L-2}} \right) \frac{\text{d} \mathbf{ f}_{L-2}}{\text{d} \mathbf{ f}_{L-3}} \right) \cdots \frac{\text{d} \mathbf{ f}_3}{\text{d} \mathbf{ f}_{2}} \right) \frac{\text{d} \mathbf{ f}_2}{\text{d} \mathbf{ f}_{1}} \right) \frac{\text{d} \mathbf{ f}_1}{\text{d} \mathbf{ w}} \]

Cost: \[ \small d_Ld_{L-1}d_{L-2} + d_{L}d_{L-2} d_{L-3} + \ldots + d_Ld_{1} d_0 = d_L\sum_{\ell=0}^{L-2}d_\ell d_{\ell+1} \]

Memory cost of autodiff

Forward-mode

Store only:

current activation $\mathbf{ f}_l$ ($d_l$)
running JVP of size $d_0 \times d_l$
current Jacobian $d_l \times d_{l+1}$

Reverse-mode

Cache activations $\{\mathbf{ f}_l\}$ so local Jacobians can be applied during backward; higher memory than forward-mode.

Automatic differentiation

in deep learning we’re most interested in scalar objectives
$d_L=1$, consequently, backward mode is always optimal
in the context of neural networks this is backpropagation.
backprop has higher memory cost than forwardprop

Backprop in deep networks

Autodiff: Practical Details

Terminology: forward-mode = JVP; reverse-mode = VJP
Non-scalar outputs: seed vector required (e.g. backward(grad_outputs))
Stopping grads: .detach(), no_grad() in PyTorch

Autodiff: Practical Details

Grad accumulation: .grad accumulates by default; call zero_grad()
Memory: backprop caches intermediates; use checkpointing/rematerialisation
Mixed-mode: combine forward+reverse for Hessian-vector products

Computational Graphs and PyTorch Autodiff

Nodes: tensors and ops; Edges: data dependencies
Forward pass caches intermediates needed for backward
Backward pass applies local Jacobian-vector products

Overparameterised Systems

Neural networks are highly overparameterised.
If we could examine their Hessian at “optimum”
- Very low (or negative) eigenvalues.
- Error function is not sensitive to changes in parameters.
- Implies parmeters are badly determined

Whence Generalisation?

Not enough regularisation in our objective functions to explain.
Neural network models are not using traditional generalisation approaches.
The ability of these models to generalise must be coming somehow from the algorithm*
How to explain it and control it is perhaps the most interesting theoretical question for neural networks.

Double Descent

Neural Tangent Kernel

Consider very wide neural networks.
Consider particular initialisation.
Deep neural network is regularising with a particular kernel.
This is known as the neural tangent kernel (Jacot et al., 2018).

Regularization in Optimization

Gradient flow methods allow us to study nature of optima.
In particular systems, with given initialisations, we can show L1 and L2 norms are minimised.
In other cases the rank of $\mathbf{W}$ is minimised.
Questions remain over the nature of this regularisation in neural networks.

Deep Linear Models

\[ f(\mathbf{ x}; \mathbf{W}) = \mathbf{W}_4 \mathbf{W}_3 \mathbf{W}_2 \mathbf{W}_1 \mathbf{ x}. \]

\[ \mathbf{W}= \mathbf{W}_4 \mathbf{W}_3 \mathbf{W}_2 \mathbf{W}_1 \]

Practical Optimisation Notes

Initialisation: variance-preserving (He, Xavier); scale affects signal propagation
Optimisers: SGD(+momentum), Adam/AdamW; decoupled weight decay often preferable
Learning rate schedules: cosine, step, warmup; LR is the most sensitive hyperparameter
Batch size: affects gradient noise and implicit regularisation; tune with LR
Normalisation: BatchNorm/LayerNorm stabilise optimisation
Regularisation: data augmentation, dropout, weight decay; early stopping

Example: calculating a Hessian

\[ H(\mathbb{w}) = \frac{\text{d}^2}{\text{d}\mathbf{ w}\text{d}\mathbf{ w}^\top} L(\mathbf{ w}) := \frac{\text{d}}{\text{d}\mathbf{ w}} \mathbf{g}(\mathbf{ w}) \]

Example: Hessian-vector product

\[ \mathbf{v}^\top H(\mathbf{ w}) = \frac{\text{d}}{\text{d}\mathbf{ w}} \left( \mathbf{v}^\top \mathbf{g}(\mathbf{ w}) \right) \]

Thanks!

References

Belkin, M., Hsu, D., Ma, S., Soumik Mandal, and, 2019. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. USA 116, 15849–15854.

Brookes, M., 2005. The matrix reference manual.

Jacot, A., Gabriel, F., Hongler, C., 2018. Neural tangent kernel: Convergence and generalization in neural networks, in: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (Eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., pp. 8571–8580.

Deep Architectures

Review

Basis Functions to Neural Networks

From Basis Functions to Neural Networks

Neural Network Prediction Function

Linear Models as Basis + Weights

Deep Neural Network

Deep Neural Network

Mathematically

From Shallow to Deep

Gradient-based optimization

Basic Multilayer Perceptron

Recursive MLP Definition

Rectified Linear Unit

Shallow ReLU Networks

ReLU Activations

Decision Boundaries

Decision Boundaries

Depth vs Width: Representation Power

Sawtooth Construction

Higher-dimensional Geometry

Chain Rule and Back Propgation

Historical Context

Parameterised Basis Functions

The Gradient Problem

Chain Rule Application

The Matrix Challenge

Matrix Vector Chain Rule

Matrix Calculus Standards

Brookes’ Approach

Matrix Stacking

Stacking Formula

Vector Derivatives

Matrix Stacking Notation

Matrix-Matrix Derivatives

Kronecker Products

Kronecker Products

Visual Examples

Kronecker Product Visualization

Column vs Row Stacking

Kronecker Product Formula

Basic Properties

Identity and Zero

Transpose and Inverse

Matrix Stacking and Kronecker Products

Column Stacking

Row Stacking

Computational Advantages

Efficient Matrix Operations

Memory and Speed Benefits

Kronecker Product Simplification

Chain Rule Application

Kronecker Product Identities

Derivative Representations

Useful Multivariate Derivatives

Derivative of $mathbf{A} wrt \(\mathbf{A}\)

Indirect Approach

Row-wise Structure

Kronecker Product Tiling

Final Result

Column Stacking Alternative

Column Stacking Structure

Chain Rule for Neural Networks

Neural Network Chain Rule

Neural Network Structure

First and Final Layers

Limitations

Three Fundamental Gradients

Activation Gradient

Basis Gradient

Activations and Gradients

Across-Layer Gradient

Complete Chain Rule

Substituted Matrix Form

Gradient Verification

Verify Neural Network Gradients

Loss Functions

Test loss function gradients

Simple Deep NN Implementation

Create and Test Neural Network