[edit]

Week 3: Hardware Ecosystem

[jupyter][google colab][reveal]

Nic Lane

Abstract:

This lecture will look at the changes in hardware that enabled neural networks to be efficient and how neural network models are deployed on hardware.

DeepNN

Plan for the Day

Hardware at Deep Learning's birth

New York Times (1958)

Eniac, 1950s SoTA Hardware

How did we get here? Deep Learning requires peta FLOPS

0.01 PFLOP (left) = \(10^{13}\) FLOPS (right)

Credit: Our World in Data

Plan for the Day

Internal Organisation of Processors

Central Processing Unit (CPU)

  • General-purpose processor (in use since mid-1950s)
  • CPU is composed of cores, each of which consists of several threads.
  • Example high-end performance:
    • AMD Ryzen 9 5950X
    • No. Cores:    16
    • No. Threads:   32
    • Clock speed:   3.4GHz, boost up to 4.9GHz
    • L2 cache:     8 MB
    • L3 cache:     64 MB
from custom_imports import *

our_custom_net = BasicFCModel()
our_custom_net.cpu()
# OR
device = torch.device('cpu')
our_custom_net.to(device)

Graphics Processing Unit

  • Parallelism-exploiting Accelerator
  • Originally used for graphics processing (in use since 1970s)
  • GPU is composed of a large number of threads organised into blocks (cores)
  • Example high-end performance:
    • NVIDIA GEFORCE RTX 3090
    • No. Threads:   10496
    • Clock speed:   1.4GHz, boost up to 1.7GHz
    • L2 cache:     24 GB
if torch.cuda.is_available():
    our_custom_net.cuda()
    # OR
    device = torch.device('cuda:0')
    our_custom_net.to(device)
# Remember to do the same for all inputs to the network

Graphics Processing Unit

  • Register (per thread)
    • An automatic variable in kernel function
    • Low latency, high bandwidth
  • Local Memory (per thread)
    • Variable in a kernel but can not be fitted in register
  • Shared Memory (between thread blocks)
    • All threads faster than local and global memory
    • Use for inter-thread communication
    • physically shared with L1 cache
  • Constant memory
    • Per Device Read-only memory
  • Texture Memory
    • Per SM, read-only cache, optimized for 2D spatial locality
  • Global Memory

A typical organisation of a DL system

  • Processors
    • CPU sits at the centre of the system
  • Accelerators
    • GPUs, TPUs, Eyeriss, other specialised
    • Specialised hardware can be designed with exploiting parallelism in mind
  • Memory hierarchy
    • Caches - smallest and fastest
    • Random Access Memory (RAM) - largest and slowest
    • Disk / SSD - storage
      • Stores the dataset; in crisis it supplements RAM up to Swap
      • Bandwidth can be serious a bottleneck
    • System, memory, and I/O buses
      • Closer to processor - faster
      • Designed to transport fixed-size data chunks
      • Word size is a key system parameter 4 bytes (32 bit) or 8 bytes (64 bit)
    • Auxiliary hardware
      • Mouse, keyboard, display

Data Movement & Parallelism

Memory and bandwidth: memory hierarchy

Memory and bandwidth: data movement

Processor comparison based on memory and bandwidth

The case for parallelism - Moore's law is slowing down

Credit: Our World in Data

Plan for the Day

The case for parallelism - Moore’s law is slowing down

Credit: Karl Rupp

Processor comparison based on parallelism

print("CPU matrix multiplication")
a, b = [torch.rand(2**10, 2**10) for _ in range(2)]
start = time()
a * b
print(f'CPU took {time() - start} seconds')
print("GPU matrix multiplication")
start = time()
a * b
print(f'GPU took {time() - start} seconds')
CPU matrix multiplication
CPU took 0.0005156993865966797 seconds
GPU matrix multiplication
GPU took 0.0002989768981933594 seconds

Plan for the Day

Parallelism in Deep Learning training

DL parallelism: parallelize backprop through an example

DL parallelism: parallelize gradient sample computation

DL parallelism: parallelize update iterations

DL parallelism: parallelize the training of multiple models

Leveraging Deep Learning parallelism

XXXX

XXXX

XXXX

XXXX

CPU training

The most a CPU can do for this setup is to:

CPU training

Consequently:

print("CPU training code")
print("CPU training of the above-defined model short example of how long it takes")
our_custom_net.cpu()
start = time()
train(lenet, MNIST_trainloader)
print(f'CPU took {time()-start:.2f} seconds')
CPU training code
CPU training of the above-defined model short example of how long it takes
Epoch 1, iter 469, loss 1.980: : 469it [00:02, 181.77it/s]
Epoch 2, iter 469, loss 0.932: : 469it [00:02, 182.58it/s]
CPU took 5.22 seconds

GPU training

The GPU, on the other hand, can:

GPU training

Consequently:

print("GPU training")
print("GPU training of the same example as in CPU")
lenet.cuda()

batch_size = 512
gpu_trainloader = make_MNIST_loader(batch_size=batch_size)
start = time()
gpu_train(lenet, gpu_trainloader)
print(f'GPU took {time()-start:.2f} seconds')
GPU training
GPU training of the same example as in CPU
Epoch 1, iter 118, iter loss 0.786: : 118it [00:02, 52.62it/s]
Epoch 2, iter 118, iter loss 0.760: : 118it [00:02, 57.48it/s]
GPU took 4.37 seconds

GPU parallelism: matrix multiplication example

GPU
12 thread blocks, each with 16 threads.
Naive implementation
Each thread reads one row of A, one
column of B and returns one element of C.
Shared memory implementation
Each thread block is computing
one square sub-matrix.

GPU parallelism: matrix multiplication example

Multi-GPU training

With multiple GPUs we can choose one of the following:

Multi-GPU training

print("multi-GPU training")
print("GPU training of the same example as in single GPU but with two GPUs")
our_custom_net_dp = lenet
our_custom_net_dp.cuda()
our_custom_net_dp = nn.DataParallel(our_custom_net_dp, device_ids=[0, 1])
batch_size = 1024
multigpu_trainloader = make_MNIST_loader(batch_size=batch_size)
start = time()
gpu_train(our_custom_net_dp, multigpu_trainloader)
print(f'2 GPUs took {time()-start:.2f} seconds')
multi-GPU training
GPU training of the same example as in single GPU but with two GPUs
Epoch 1, iter 59, iter loss 0.745: : 59it [00:02, 21.24it/s]
Epoch 2, iter 59, iter loss 0.736: : 59it [00:01, 31.70it/s]
2 GPUs took 4.72 seconds

Multi-Machine training

In principle the same options as in multi-GPU:

In practice we would either take advantage of the latter two. In extreme examples one might do a combination of multiple options.

Parallelism summary: model and data parallelism

Parallelism bottlenecks: Synchronization & Communication

Bottlenecks beyond parallelism

print("starving GPUs")
print("show in-code what starving GPU looks like")
# Deliberately slow down data flow into the gpu 
# Do you have any suggestions how to do this in a more realistic way than just to force waiting?
print('Using only 1 worker for the dataloader, the time the GPU takes increases.')
lenet.cuda()
batch_size = 64
gpu_trainloader = make_MNIST_loader(batch_size=batch_size, num_workers=1)
start = time()
gpu_train(lenet, gpu_trainloader)
print(f'GPU took {time()-start:.2f} seconds')
starving GPUs
show in-code what starving GPU looks like
Using only 1 worker for the dataloader, the time the GPU takes increases.
Epoch 1, iter 938, iter loss 0.699: : 938it [00:04, 214.02it/s]
Epoch 2, iter 938, iter loss 0.619: : 938it [00:04, 208.96it/s]
GPU took 8.92 seconds

Plan for the Day

Deep Learning resource characterisation

print("profiling demo")
print("in-house DL training resource profiling code & output - based on the above model and training loop")
#for both of the below produce one figure for inference and one for training
#MACs profiling - first slide; show as piechard

lenet.cpu()
profile_ops(lenet, shape=(1,1,28,28))
profiling demo
in-house DL training resource profiling code & output - based on the above model and training loop
Operation                              OPS      
-------------------------------------  -------  
LeNet/Conv2d[conv1]/onnx::Conv         89856    
LeNet/ReLU[relu1]/onnx::Relu           6912     
LeNet/MaxPool2d[pool1]/onnx::MaxPool   2592     
LeNet/Conv2d[conv2]/onnx::Conv         154624   
LeNet/ReLU[relu2]/onnx::Relu           2048     
LeNet/MaxPool2d[pool2]/onnx::MaxPool   768      
LeNet/Linear[fc1]/onnx::Gemm           30720    
LeNet/ReLU[relu3]/onnx::Relu           240      
LeNet/Linear[fc2]/onnx::Gemm           7200     
LeNet/ReLU[relu4]/onnx::Relu           120      
LeNet/Linear[fc3]/onnx::Gemm           600      
LeNet/ReLU[relu5]/onnx::Relu           20       
------------------------------------   ------   
Input size: (1, 1, 28, 28)
295,700 FLOPs or approx. 0.00 GFLOPs

Deep Learning working set

print("working set profiling")
# compute the per-layer required memory:
# memory to load weights, to load inputs, to save oputputs
# visualize as a per-layer bar chart, each bar consists of three sections - the inputs, outputs, weights

profile_layer_mem(lenet)
working set profiling

Working Set requirement exceeding RAM

print("exceeding RAM+Swap demo")
print("exceeding working set experiment - see the latency spike over a couple of bytes of working set")
# sample* a training speed of a model whose layer working sets just first in the memory
# bump up layer dimensions which are far from reaching the RAM limit - see that the effect on latency is limited
# bump up the layer(s) that are at the RAM limit - observe the latency spike rapidly
# add profiling graphs for each of the cases, print out latency numbers.

# *train for an epoch or two, give the latency & give a reasonable estimate of how long would the full training take (assuming X epochs)
estimate_training_for(LeNet, 1000)
exceeding RAM+Swap demo
exceeding working set experiment - see the latency spike over a couple of bytes of working set
Using 128 hidden nodes took 2.42 seconds,        training for 1000 epochs would take ~2423.7449169158936s
Using 256 hidden nodes took 2.31 seconds,        training for 1000 epochs would take ~2311.570882797241s
Using 512 hidden nodes took 2.38 seconds,        training for 1000 epochs would take ~2383.8846683502197s
Using 1024 hidden nodes took 2.56 seconds,        training for 1000 epochs would take ~2559.4213008880615s
Using 2048 hidden nodes took 3.10 seconds,        training for 1000 epochs would take ~3098.113536834717s
Using 4096 hidden nodes took 7.20 seconds,        training for 1000 epochs would take ~7196.521997451782s
Using 6144 hidden nodes took 13.21 seconds,        training for 1000 epochs would take ~13207.558155059814s

Working Set requirement exceeding RAM + Swap

print("OOM - massive images")
print("show in-code how this can hapen - say massive images; maybe show error message")
# How could we do this without affecting the recording process?
print('Loading too many images at once causes errors.')
lenet.cuda()
batch_size = 6000
gpu_trainloader = make_MNIST_loader(batch_size=batch_size, num_workers=1)
start = time()
gpu_train(lenet, gpu_trainloader)
print(f'GPU took {time()-start:.2f} seconds')
OOM - massive images
show in-code how this can hapen - say massive images; maybe show error message
Loading too many images at once causes errors.
Epoch 1, iter 10, iter loss 0.596: : 10it [00:03,  2.78it/s]
Epoch 2, iter 2, iter loss 0.592: : 2it [00:01,  1.69it/s]

Mapping Deep Models to hardware: Systolic Arrays

Core principle
Systolic system matrix multiplication

Mapping Deep Models to hardware: weight, input, and output stationarity

Weight stationary design

Input stationary design

Output stationary design

Systolic array example: weight stationary Google Tensor Processing Unit (TPU)

Plan for the Day

Deep Learning stack

Beyond hardware methods

Deep Learning and accelerator co-design

AlexNet: how GPU memory defined its architecture

print("profile AlexNet layers - show memory requirements")
print("per-layer profiling of AlexNet - connects to the preceding slide")
from torchvision.models import alexnet as net
anet = net()
profile_layer_alexnet(anet)
profile AlexNet layers - show memory requirements
per-layer profiling of AlexNet - connects to the preceding slide

The actual AlexNet architecture

AlexNet's architecture had to be split down the middle to accommodate the 3GB limit per unit in its two GPUs.

Beyond hardware methods

The Hardware and the Software Lotteries

The software and hardware lottery describes the success of a software or a piece of hardware resulting not from its universal superiority, but, rather, from its fit to the broader hardware and software ecosystem.
Eniac (1950s)
All-optical NN (2019)

Summary of the Day

Thank you for your attention!

Deep Learning resource characterisation

# memory requirements profiling - second slide; show as piechard
# Show proportion of data required for input, parameters and outputs

profile_mem(lenet)