Objective Functions and Gradient Descent

Objective Function

On Monday we introduced ML and motivated the importance of probability.
Today we explore the idea of the ‘objective function.’

Introduction to Classification

Classification

Wake word classification (Global Pulse Project).
Breakthrough in 2012 with ImageNet result of Alex Krizhevsky, Ilya Sutskever and Geoff Hinton

Classification

We are given a data set containing ‘inputs,’ \(\mathbf{X}\) and ‘targets,’ \(\mathbf{ y}\).
Each data point consists of an input vector \(\mathbf{ x}_i\) and a class label, \(y_i\).
For binary classification assume \(y_i\) should be either \(1\) (yes) or \(-1\) (no).
Input vector can be thought of as features.

Discrete Probability

Algorithms based on prediction function and objective function.
For regression the codomain of the functions, \(f(\mathbf{X})\) was the real numbers or sometimes real vectors.
In classification we are given an input vector, \(\mathbf{ x}\), and an associated label, \(y\) which either takes the value \(-1\) or \(1\).

Classification

Inputs, \(\mathbf{ x}\), mapped to a label, \(y\), through a function \(f(\cdot)\) dependent on parameters, \(\mathbf{ w}\), \[ y= f(\mathbf{ x}; \mathbf{ w}). \]
\(f(\cdot)\) is known as the prediction function.

Classification Examples

Classifiying hand written digits from binary images (automatic zip code reading)
Detecting faces in images (e.g. digital cameras).
Who a detected face belongs to (e.g. Facebook, DeepFace)

Classifying type of cancer given gene expression data.
Categorization of document types (different types of news article on the internet)

Hyperplane

Predict class label \(y_i\)
Using data features \(\mathbf{ x}_i\)
Through the prediction function} \[ f(x_i) = \text{sign}\left(\mathbf{ w}^\top \mathbf{ x}_i + b\right) \]

Hyperplane

Boundary for classification given by hyperplane
Hyperplane defined by the normal vector \(\mathbf{ w}\). \[ \mathbf{ w}^\top \mathbf{ x}= -b \]

Toy Data

Red crosses (+ve) and green circles (-ve).

The Perceptron

Developed in 1957 by Rosenblatt.
Take a data point at, \(\mathbf{ x}_i\).
Predict class, \(y_i=1\) if \(\sum_jw_{j} \mathbf{ x}_{i, j} + b > 0\)
Otherwise predict \(y_i=-1\).

Mathematical Drawing of Decision Boundary

Decision boundary defined by hyperplane
Classification: \(\text{sign}(\mathbf{ x}^\top \mathbf{ w})\)
Two features: \(w_1x_{i,1} + w_2x_{i,2} + b\)
Boundary where prediction switches from -1 to +1

Reminder: Equation of Plane

Plane equation: \(w_1 x_{i, 1} + w_2 x_{i, 2} + b = 0\)
Decision boundary where \(\text{sign}(\cdot)\) argument equals zero

Reminder: Equation of Plane

Rearrange to plot: \(x_2 = -\frac{(b+x_1w_1)}{w_2}\)
Separating hyperplane divides feature space

Perceptron Algorithm: Initialisation Maths

Set \(\mathbf{ w}\) with random selected point \(i\) \[ \mathbf{ w}= y_i \mathbf{ x}_i. \]
Why? Consider \[ \text{sign}(\mathbf{ w}^\top\mathbf{ x}_i) \]

Perceptron Algorithm: Initialisation Maths

Setting \(\mathbf{ w}\) to \(y_i\mathbf{ x}_i\) implies \[ \text{sign}(\mathbf{ w}^\top\mathbf{ x}_i) = \text{sign}(y_i\mathbf{ x}_i^\top \mathbf{ x}_i) = y_i \]

Drawing Decision Boundary

Plot \[ x_2 = -\frac{(b+x_1 w_1)}{w_2} \]

or specify \(x_2\) and compute \(x_1\) given the values for \(x_2\), \[ x_1 = -\frac{b + x_2w_2}{w_1} \]

Switching Formulae

Use of first or second formula
Depends on how hyperplane leaves plot

Code for Perceptron

Perceptron Reflection

Algorithm updates are intuitive and interpretable
What happens when classes aren’t linearly separable?

Perceptron Reflections

Non-convergence indicates inseparable data
Possible fix: anneal learning rate
Non-linear extensions: basis functions, kernel methods, multi-layer networks

The Objective Function

Perceptron algorithm lacks explicit objective function
Updates provide intuition but not clear optimization target
Objective functions (loss/error/cost) often easier starting point
Connection between algorithm and objective not immediately obvious

Regression

Objective Functions and Regression

Classification: map feature to class label.
Regression: map feature to real value

Regression

Our prediction function is \[f(x_i) = mx_i + c\]
Need an algorithm to fit it.

Least Squares

Least squares: minimise an error.

\[E(m, c) = \sum_{i=1}^n(y_i - f(x_i))^2\]

Regression

Create an artifical data set.
true value for \(m\) m_true = 1.4
true value for \(c\) c_true = -3.1

We can use these values to create our artificial data. The formula \[y_i = mx_i + c\] is translated to code as follows:

y = m_true*x+c_true

Plot of Data

We can now plot the artifical data we’ve created.

Plot of data

Points lie exactly on a straight line
Not very realistic
Corrupt them with Gaussian ‘noise.’

Noise Corrupted Plot

Contour Plot of Error Function

Steepest Descent

Minimise the sum of squares error function.
E.g. gradient descent.
Initialise with a guess for \(m\) and \(c\)
update that guess by subtracting a portion of the gradient from the guess.
Like walking down a hill in the steepest direction of the hill to get to the bottom.

Algorithm

We start with a guess for \(m\) and \(c\).

m_star = 0.0
c_star = -5.0

Offset Gradient

Gradient of the error wrt \(c\), \[ \frac{\text{d}E(m, c)}{\text{d} c} = -2\sum_{i=1}^n(y_i - mx_i - c) \]

Compute as

c_grad = -2*(y-m_star*x - c_star).sum()

Slope Gradient

Gradient wrt \(m\) is similar \[ \frac{\text{d}E(m, c)}{\text{d} m} = -2\sum_{i=1}^nx_i(y_i - mx_i - c) \]

Compute as

which can be implemented in python (numpy) as

m_grad = -2*(x*(y-m_star*x - c_star)).sum()

Update Equations

Gradients with respect to \(m\) and \(c\).
Can update our inital guesses for \(m\) and \(c\) using the gradient.

Update Equations

We don’t want to just subtract the gradient from \(m\) and \(c\),
We need to take a small step in the gradient direction.
Otherwise we might overshoot the minimum.

Update Equations

We want to follow the gradient to get to the minimum, the gradient changes all the time.

Move in Direction of Gradient

Update Equations

The step size has already been introduced
it’s known as the learning rate and is denoted by \(\eta\). \[ c_\text{new}\leftarrow c_{\text{old}} - \eta\frac{\text{d}E(m, c)}{\text{d}c} \]

Step Size

gives us an update for our estimate of \(c\) and \[ m_\text{new} \leftarrow m_{\text{old}} - \eta\frac{\text{d}E(m, c)}{\text{d}m} \]
Giving us an update for \(m\).

Update Code

These updates can be coded as

learn_rate = 0.01
c_star = c_star - learn_rate*c_grad
m_star = m_star - learn_rate*m_grad

Iterating Updates

Fit model by descending gradient.

Gradient Descent Algorithm

Stochastic Gradient Descent

If \(n\) is small, gradient descent is fine.
But sometimes (e.g. on the internet) \(n\) could be a billion.
Stochastic gradient descent is more similar to perceptron.
Look at gradient of one data point at a time rather than summing across all data points)
This gives a stochastic estimate of gradient.

Stochastic Gradient Descent

The real gradient with respect to \(m\) is given by \[ \frac{\text{d}E(m, c)}{\text{d} m} = -2\sum_{i=1}^nx_i(y_i - mx_i - c) \]

Decompose the Sum

but it has \(n\) terms in the sum. Substituting in the gradient we can see that the full update is of the form \[\begin{aligned} m_\text{new} \leftarrow m_\text{old} \\ & + 2\eta\left[x_1 (y_1 - m_\text{old}x_1 - c_\text{old}) \right. \\ & + (x_2 (y_2 - m_\text{old}x_2 - c_\text{old}) \\ & + \dots + (x_n (y_n - m_\text{old}x_n - c_\text{old})\right] \end{aligned}\]

This could be split up into lots of individual updates \[m_1 \leftarrow m_\text{old} + 2\eta\left[x_1 (y_1 - m_\text{old}x_1 - c_\text{old})\right]\] \[m_2 \leftarrow m_1 + 2\eta\left[x_2 (y_2 - m_\text{old}x_2 - c_\text{old})\right]\] \[m_3 \leftarrow m_2 + 2\eta \left[\dots\right]\] \[m_n \leftarrow m_{n-1} + 2\eta\left[x_n (y_n - m_\text{old}x_n - c_\text{old})\right]\]

which would lead to the same final update.

Updating \(c\) and \(m\)

In the sum we don’t change \(m\) and \(c\) we use for computing the gradient term at each update.
In stochastic gradient descent we do change them.
This means it’s not quite the same as steepest descent.

But we can present each data point in a random order, like we did for the perceptron.
This makes the algorithm suitable for large-scale data use

Stochastic Gradient Descent

Since the data is presented in a random order we write \[ m_\text{new} = m_\text{old} + 2\eta\left[x_i (y_i - m_\text{old}x_i - c_\text{old})\right] \]

SGD for Linear Regression

Putting it all together in an algorithm, we can do stochastic gradient descent for our regression data.

Mini-Batch

In practice use mini-batches.
- Instead of computing gradient of one example.
- Use a small batch of examples.

Reflection on Linear Regression and Supervised Learning

Learning rate size
Smaller data sets and variation in results

for loss functions.

for gradient descent.

Thanks!

References

Bishop, C.M., Bishop, H., 2024. Deep learning: Foundations and concepts. Springer.

Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.

Objective Functions and Gradient Descent