Convolutional Neural Networks
Neil D. Lawrence
2025-09-30
From Deep Networks to CNNs
Review : Deep networks, chain rule
Today : How convolutional networks exploit these concepts
Chain Rule for Layered CNN Architecture
CNN Chain Rule Overview
Layered Architecture : Each layer implements forward(), backward(), parameters
Spatial Operations : Convolution, pooling, flattening require specialized gradients
Composition : CNN built by composing ConvolutionalLayer, MaxPoolingLayer, FlattenLayer
Gradient Flow : Each layer computes its own gradients independently
Verification : Finite difference testing ensures mathematical correctness
CNN Layer Types and Their Gradients
ConvolutionalLayer : \(\frac{\partial L}{\partial \mathbf{X}}\) , \(\frac{\partial L}{\partial \filters}\) , \(\frac{\partial L}{\partial \biases}\)
MaxPoolingLayer : \(\frac{\partial L}{\partial \mathbf{X}}\) (no parameters)
FlattenLayer : \(\frac{\partial L}{\partial \mathbf{X}}\) (no parameters)
FullyConnectedLayer : Standard neural network gradients
LinearLayer : Linear transformation gradients
Convolutional Layer Forward Pass
Input : \(\mathbf{X}\) of shape \((B, C_{in}, H, W)\)
Filters : \(\filters\) of shape \((C_{out}, C_{in}, K_h, K_w)\)
Output : \(\outputMatrix\) of shape \((B, C_{out}, H_{out}, W_{out})\)
Operation : \(\outputMatrix[b,c,h,w] = \sum_{i,j,k} \mathbf{X}[b,k,h+i,w+j] \cdot \filters[c,k,i,j] + \biases[c]\)
Convolutional Layer Backward Pass
Output gradient : \(\frac{\partial L}{\partial \outputMatrix}\) from next layer
Filter gradient : \(\frac{\partial L}{\partial \filters[c,k,i,j]} = \sum_{b,h,w} \frac{\partial L}{\partial \outputMatrix[b,c,h,w]} \cdot \mathbf{X}[b,k,h+i,w+j]\)
Bias gradient : \(\frac{\partial L}{\partial \biases[c]} = \sum_{b,h,w} \frac{\partial L}{\partial \outputMatrix[b,c,h,w]}\)
Input gradient : \(\frac{\partial L}{\partial \mathbf{X}[b,k,h,w]} = \sum_{c,i,j} \frac{\partial L}{\partial \outputMatrix[b,c,h-i,w-j]} \cdot \filters[c,k,i,j]\)
Max Pooling Layer Gradients
Forward : \(\outputMatrix[b,c,h,w] = \max_{i,j \in \poolRegion} \mathbf{X}[b,c,h \cdot \stride + i, w \cdot \stride + j]\)
Backward : \(\frac{\partial L}{\partial \mathbf{X}[b,c,h,w]} = \begin{cases} \frac{\partial L}{\partial \outputMatrix[b,c,h_{out},w_{out}]} & \text{if } (h,w) \text{ was the max in pool region} \\ 0 & \text{otherwise} \end{cases}\)
Flatten Layer Gradients
Forward : \(\outputMatrix = \mathbf{X}.reshape(B, -1)\)
Backward : \(\frac{\partial L}{\partial \mathbf{X}} = \frac{\partial L}{\partial \outputMatrix}.reshape(\inputShape)\)
CNN Chain Rule Implementation
Layer-wise : Each layer computes its own gradients independently
Composition : LayeredNeuralNetwork coordinates gradient flow between layers
Spatial awareness : Gradients preserve spatial structure through convolution
Parameter updates : Each layer manages its own parameter gradients
Multi-Path Gradient Flow in CNNs
Convolutional path : \(\mathbf{X}\rightarrow \convOutput \rightarrow \poolOutput \rightarrow \flattenOutput\)
Parameter paths : \(\filters \rightarrow \convOutput\) , \(\biases \rightarrow \convOutput\)
Spatial paths : Each spatial location has independent gradient flow
Channel paths : Each output channel has independent gradient computation
Activation Function Integration
ReLU in convolution : \(\frac{\partial L}{\partial \convOutput} = \frac{\partial L}{\partial \activationOutput} \odot \frac{\partial \phi}{\partial \convOutput}\)
ReLU gradient : \(\frac{\partial \reluFunction}{\partial x} = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}\)
Spatial activation : Applied element-wise across all spatial locations
CNN Gradient Verification
Finite differences : Compare analytical vs numerical gradients
Spatial testing : Verify gradients at different spatial locations
Channel testing : Verify gradients for different output channels
End-to-end testing : Verify complete CNN gradient flow
Layered CNN Architecture Benefits
Modularity : Each layer type has clear gradient responsibilities
Testability : Each layer can be tested independently
Composability : Layers can be combined in any order
Educational : Clear separation makes learning easier
Maintainability : Easy to add new layer types
Complete CNN Gradient Flow
Input : \(\mathbf{X}\) (batch of images)
Convolution : \(\convOutput = \reluFunction(\convolution(\mathbf{X}, \filters) + \biases)\)
Pooling : \(\poolOutput = \maxPool(\convOutput)\)
Flatten : \(\flattenOutput = \flatten(\poolOutput)\)
Dense : \(\denseOutput = \reluFunction(\denseOutput \cdot \denseWeights + \denseBiases)\)
Output : \(\outputMatrix = \denseOutput \cdot \outputWeights + \outputBiases\)
CNN Gradient Chain Rule
Output to dense : \(\frac{\partial L}{\partial \denseOutput} = \frac{\partial L}{\partial \outputMatrix} \cdot \outputWeights^\top\)
Dense to flatten : \(\frac{\partial L}{\partial \flattenOutput} = \frac{\partial L}{\partial \denseOutput} \cdot \denseWeights^\top\)
Flatten to pool : \(\frac{\partial L}{\partial \poolOutput} = \frac{\partial L}{\partial \flattenOutput}.reshape(\poolShape)\)
Pool to conv : \(\frac{\partial L}{\partial \convOutput} = \maxPoolGradient(\frac{\partial L}{\partial \poolOutput})\)
Conv to input : \(\frac{\partial L}{\partial \mathbf{X}} = \convolutionGradient(\frac{\partial L}{\partial \convOutput}, \filters)\)
Parameter Gradients in CNNs
Filter gradients : \(\frac{\partial L}{\partial \filters} = \sum_{b,h,w} \frac{\partial L}{\partial \convOutput[b,c,h,w]} \cdot \mathbf{X}[b,k,h+i,w+j]\)
Bias gradients : \(\frac{\partial L}{\partial \biases} = \sum_{b,h,w} \frac{\partial L}{\partial \convOutput[b,c,h,w]}\)
Dense gradients : Standard neural network parameter gradients
Output gradients : Final layer parameter gradients
Implementation Verification
Layer testing : Each layer tested independently with finite differences
Composition testing : Complete CNN tested end-to-end
Spatial testing : Gradients verified at different spatial locations
Channel testing : Gradients verified for different output channels
Parameter testing : All parameter gradients verified numerically
Educational Benefits
Clear separation : Each layer type has distinct gradient responsibilities
Modular learning : Students can understand each component independently
Visual debugging : Easy to see where gradients flow and where they don’t
Mathematical rigor : All gradients verified with finite differences
Practical implementation : Code directly maps to mathematical theory
Summary
Layered design : Each CNN layer type has specialized gradient computation
Spatial awareness : Gradients preserve spatial structure through convolution
Modular testing : Each layer can be verified independently
Educational clarity : Clear separation of concerns makes learning easier
Mathematical rigor : All gradients verified with finite differences
Practical implementation : Code directly implements the mathematical theory
Code Mapping
ConvolutionalLayer : Implements spatial convolution with filter and bias gradients
MaxPoolingLayer : Implements max pooling with sparse gradient distribution
FlattenLayer : Implements spatial-to-vector conversion with shape preservation
LayeredNeuralNetwork : Coordinates gradient flow between all layer types
Gradient testing : Comprehensive finite difference verification
Verification with Our Implementation
Gradient testing : Use finite_difference_gradient to verify analytical gradients
Spatial verification : Check gradients at different spatial locations
Channel verification : Verify gradients for different output channels
End-to-end testing : Complete CNN gradient flow verification
Parameter testing : All parameter gradients verified numerically
24
Simple CNN Implementation
Explore Different Image Patterns
Create and Test Convolutional Layer
Build Complete CNN with Layered Architecture
Train CNN for Image Classification
CNN Training Progress for image classification
Visualise CNN Feature Maps
CNN feature maps showing how the network learns to detect different patterns
Test Different CNN Architectures
Benefits of the New CNN Architecture
Compare CNN with Traditional Neural Networks