Convolutional Neural Networks 
  Neil D. Lawrence
  2025-09-30 
 
From Deep Networks to CNNs 
Review : Deep networks, chain ruleToday : How convolutional networks exploit these concepts 
 
Chain Rule for Layered CNN Architecture 
 
CNN Chain Rule Overview 
Layered Architecture : Each layer implements forward(), backward(), parametersSpatial Operations : Convolution, pooling, flattening require specialized gradientsComposition : CNN built by composing ConvolutionalLayer, MaxPoolingLayer, FlattenLayerGradient Flow : Each layer computes its own gradients independentlyVerification : Finite difference testing ensures mathematical correctness 
 
CNN Layer Types and Their Gradients 
ConvolutionalLayer : \(\frac{\partial L}{\partial \mathbf{X}}\) , \(\frac{\partial L}{\partial \filters}\) , \(\frac{\partial L}{\partial \biases}\) MaxPoolingLayer : \(\frac{\partial L}{\partial \mathbf{X}}\)  (no parameters)FlattenLayer : \(\frac{\partial L}{\partial \mathbf{X}}\)  (no parameters)FullyConnectedLayer : Standard neural network gradientsLinearLayer : Linear transformation gradients 
 
Convolutional Layer Forward Pass 
Input : \(\mathbf{X}\)  of shape \((B, C_{in}, H, W)\) Filters : \(\filters\)  of shape \((C_{out}, C_{in}, K_h, K_w)\) Output : \(\outputMatrix\)  of shape \((B, C_{out}, H_{out}, W_{out})\) Operation : \(\outputMatrix[b,c,h,w] = \sum_{i,j,k} \mathbf{X}[b,k,h+i,w+j] \cdot \filters[c,k,i,j] + \biases[c]\)  
 
Convolutional Layer Backward Pass 
Output gradient : \(\frac{\partial L}{\partial \outputMatrix}\)  from next layerFilter gradient : \(\frac{\partial L}{\partial \filters[c,k,i,j]} = \sum_{b,h,w} \frac{\partial L}{\partial \outputMatrix[b,c,h,w]} \cdot \mathbf{X}[b,k,h+i,w+j]\) Bias gradient : \(\frac{\partial L}{\partial \biases[c]} = \sum_{b,h,w} \frac{\partial L}{\partial \outputMatrix[b,c,h,w]}\) Input gradient : \(\frac{\partial L}{\partial \mathbf{X}[b,k,h,w]} = \sum_{c,i,j} \frac{\partial L}{\partial \outputMatrix[b,c,h-i,w-j]} \cdot \filters[c,k,i,j]\)  
 
Max Pooling Layer Gradients 
Forward : \(\outputMatrix[b,c,h,w] = \max_{i,j \in \poolRegion} \mathbf{X}[b,c,h \cdot \stride + i, w \cdot \stride + j]\) Backward : \(\frac{\partial L}{\partial \mathbf{X}[b,c,h,w]} = \begin{cases} \frac{\partial L}{\partial \outputMatrix[b,c,h_{out},w_{out}]} & \text{if } (h,w) \text{ was the max in pool region} \\ 0 & \text{otherwise} \end{cases}\)  
 
Flatten Layer Gradients 
Forward : \(\outputMatrix = \mathbf{X}.reshape(B, -1)\) Backward : \(\frac{\partial L}{\partial \mathbf{X}} = \frac{\partial L}{\partial \outputMatrix}.reshape(\inputShape)\)  
 
CNN Chain Rule Implementation 
Layer-wise : Each layer computes its own gradients independentlyComposition : LayeredNeuralNetwork coordinates gradient flow between layersSpatial awareness : Gradients preserve spatial structure through convolutionParameter updates : Each layer manages its own parameter gradients 
 
Multi-Path Gradient Flow in CNNs 
Convolutional path : \(\mathbf{X}\rightarrow \convOutput \rightarrow \poolOutput \rightarrow \flattenOutput\) Parameter paths : \(\filters \rightarrow \convOutput\) , \(\biases \rightarrow \convOutput\) Spatial paths : Each spatial location has independent gradient flowChannel paths : Each output channel has independent gradient computation 
 
Activation Function Integration 
ReLU in convolution : \(\frac{\partial L}{\partial \convOutput} = \frac{\partial L}{\partial \activationOutput} \odot \frac{\partial \phi}{\partial \convOutput}\) ReLU gradient : \(\frac{\partial \reluFunction}{\partial x} = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}\) Spatial activation : Applied element-wise across all spatial locations 
 
CNN Gradient Verification 
Finite differences : Compare analytical vs numerical gradientsSpatial testing : Verify gradients at different spatial locationsChannel testing : Verify gradients for different output channelsEnd-to-end testing : Verify complete CNN gradient flow 
 
Layered CNN Architecture Benefits 
Modularity : Each layer type has clear gradient responsibilitiesTestability : Each layer can be tested independentlyComposability : Layers can be combined in any orderEducational : Clear separation makes learning easierMaintainability : Easy to add new layer types 
 
Complete CNN Gradient Flow 
Input : \(\mathbf{X}\)  (batch of images)Convolution : \(\convOutput = \reluFunction(\convolution(\mathbf{X}, \filters) + \biases)\) Pooling : \(\poolOutput = \maxPool(\convOutput)\) Flatten : \(\flattenOutput = \flatten(\poolOutput)\) Dense : \(\denseOutput = \reluFunction(\denseOutput \cdot \denseWeights + \denseBiases)\) Output : \(\outputMatrix = \denseOutput \cdot \outputWeights + \outputBiases\)  
 
CNN Gradient Chain Rule 
Output to dense : \(\frac{\partial L}{\partial \denseOutput} = \frac{\partial L}{\partial \outputMatrix} \cdot \outputWeights^\top\) Dense to flatten : \(\frac{\partial L}{\partial \flattenOutput} = \frac{\partial L}{\partial \denseOutput} \cdot \denseWeights^\top\) Flatten to pool : \(\frac{\partial L}{\partial \poolOutput} = \frac{\partial L}{\partial \flattenOutput}.reshape(\poolShape)\) Pool to conv : \(\frac{\partial L}{\partial \convOutput} = \maxPoolGradient(\frac{\partial L}{\partial \poolOutput})\) Conv to input : \(\frac{\partial L}{\partial \mathbf{X}} = \convolutionGradient(\frac{\partial L}{\partial \convOutput}, \filters)\)  
 
Parameter Gradients in CNNs 
Filter gradients : \(\frac{\partial L}{\partial \filters} = \sum_{b,h,w} \frac{\partial L}{\partial \convOutput[b,c,h,w]} \cdot \mathbf{X}[b,k,h+i,w+j]\) Bias gradients : \(\frac{\partial L}{\partial \biases} = \sum_{b,h,w} \frac{\partial L}{\partial \convOutput[b,c,h,w]}\) Dense gradients : Standard neural network parameter gradientsOutput gradients : Final layer parameter gradients 
 
Implementation Verification 
Layer testing : Each layer tested independently with finite differencesComposition testing : Complete CNN tested end-to-endSpatial testing : Gradients verified at different spatial locationsChannel testing : Gradients verified for different output channelsParameter testing : All parameter gradients verified numerically 
 
Educational Benefits 
Clear separation : Each layer type has distinct gradient responsibilitiesModular learning : Students can understand each component independentlyVisual debugging : Easy to see where gradients flow and where they don’tMathematical rigor : All gradients verified with finite differencesPractical implementation : Code directly maps to mathematical theory 
 
Summary 
Layered design : Each CNN layer type has specialized gradient computationSpatial awareness : Gradients preserve spatial structure through convolutionModular testing : Each layer can be verified independentlyEducational clarity : Clear separation of concerns makes learning easierMathematical rigor : All gradients verified with finite differencesPractical implementation : Code directly implements the mathematical theory 
 
Code Mapping 
ConvolutionalLayer : Implements spatial convolution with filter and bias gradientsMaxPoolingLayer : Implements max pooling with sparse gradient distributionFlattenLayer : Implements spatial-to-vector conversion with shape preservationLayeredNeuralNetwork : Coordinates gradient flow between all layer typesGradient testing : Comprehensive finite difference verification 
 
Verification with Our Implementation 
Gradient testing : Use finite_difference_gradient to verify analytical gradientsSpatial verification : Check gradients at different spatial locationsChannel verification : Verify gradients for different output channelsEnd-to-end testing : Complete CNN gradient flow verificationParameter testing : All parameter gradients verified numerically 
24
 
Simple CNN Implementation 
 
Explore Different Image Patterns 
 
Create and Test Convolutional Layer 
 
Build Complete CNN with Layered Architecture 
 
Train CNN for Image Classification 
CNN Training Progress for image classification
 
 
Visualise CNN Feature Maps 
CNN feature maps showing how the network learns to detect different patterns
 
 
Test Different CNN Architectures 
 
Benefits of the New CNN Architecture 
 
Compare CNN with Traditional Neural Networks