Objective Functions and Gradient Descent

Neil Lawrence

Dedan Kimathi University, Nyeri, Kenya

Objective Function

  • On Monday we introduced ML and motivated the importance of probability.
  • Today we explore the idea of the ‘objective function.’

Introduction to Classification

Classification

Classification

  • We are given a data set containing ‘inputs,’ \(\mathbf{X}\) and ‘targets,’ \(\mathbf{ y}\).
  • Each data point consists of an input vector \(\mathbf{ x}_i\) and a class label, \(y_i\).
  • For binary classification assume \(y_i\) should be either \(1\) (yes) or \(-1\) (no).
  • Input vector can be thought of as features.

Discrete Probability

  • Algorithms based on prediction function and objective function.
  • For regression the codomain of the functions, \(f(\mathbf{X})\) was the real numbers or sometimes real vectors.
  • In classification we are given an input vector, \(\mathbf{ x}\), and an associated label, \(y\) which either takes the value \(-1\) or \(1\).

Classification

  • Inputs, \(\mathbf{ x}\), mapped to a label, \(y\), through a function \(f(\cdot)\) dependent on parameters, \(\mathbf{ w}\), \[ y= f(\mathbf{ x}; \mathbf{ w}). \]
  • \(f(\cdot)\) is known as the prediction function.

Classification Examples

  • Classifiying hand written digits from binary images (automatic zip code reading)
  • Detecting faces in images (e.g. digital cameras).
  • Who a detected face belongs to (e.g. Facebook, DeepFace)
  • Classifying type of cancer given gene expression data.
  • Categorization of document types (different types of news article on the internet)

Hyperplane

  • Predict class label \(y_i\)
  • Using data features \(\mathbf{ x}_i\)
  • Through the prediction function} \[ f(x_i) = \text{sign}\left(\mathbf{ w}^\top \mathbf{ x}_i + b\right) \]

Hyperplane

  • Boundary for classification given by hyperplane
  • Hyperplane defined by the normal vector \(\mathbf{ w}\). \[ \mathbf{ w}^\top \mathbf{ x}= -b \]

Toy Data

  • Red crosses (+ve) and green circles (-ve).

The Perceptron

  • Developed in 1957 by Rosenblatt.
  • Take a data point at, \(\mathbf{ x}_i\).
  • Predict class, \(y_i=1\) if \(\sum_jw_{j} \mathbf{ x}_{i, j} + b > 0\)
  • Otherwise predict \(y_i=-1\).

Mathematical Drawing of Decision Boundary

  • Decision boundary defined by hyperplane
  • Classification: \(\text{sign}(\mathbf{ x}^\top \mathbf{ w})\)
  • Two features: \(w_1x_{i,1} + w_2x_{i,2} + b\)
  • Boundary where prediction switches from -1 to +1

Reminder: Equation of Plane

  • Plane equation: \(w_1 x_{i, 1} + w_2 x_{i, 2} + b = 0\)
  • Decision boundary where \(\text{sign}(\cdot)\) argument equals zero

Reminder: Equation of Plane

  • Rearrange to plot: \(x_2 = -\frac{(b+x_1w_1)}{w_2}\)
  • Separating hyperplane divides feature space

Perceptron Algorithm: Initialisation Maths

  • Set \(\mathbf{ w}\) with random selected point \(i\) \[ \mathbf{ w}= y_i \mathbf{ x}_i. \]
  • Why? Consider \[ \text{sign}(\mathbf{ w}^\top\mathbf{ x}_i) \]

Perceptron Algorithm: Initialisation Maths

  • Setting \(\mathbf{ w}\) to \(y_i\mathbf{ x}_i\) implies \[ \text{sign}(\mathbf{ w}^\top\mathbf{ x}_i) = \text{sign}(y_i\mathbf{ x}_i^\top \mathbf{ x}_i) = y_i \]

Drawing Decision Boundary

  • Plot \[ x_2 = -\frac{(b+x_1 w_1)}{w_2} \]

or specify \(x_2\) and compute \(x_1\) given the values for \(x_2\), \[ x_1 = -\frac{b + x_2w_2}{w_1} \]

Switching Formulae

  • Use of first or second formula
  • Depends on how hyperplane leaves plot

Code for Perceptron

Perceptron Reflection

  • Algorithm updates are intuitive and interpretable
  • What happens when classes aren’t linearly separable?

Perceptron Reflections

  • Non-convergence indicates inseparable data
  • Possible fix: anneal learning rate
  • Non-linear extensions: basis functions, kernel methods, multi-layer networks

The Objective Function

  • Perceptron algorithm lacks explicit objective function
  • Updates provide intuition but not clear optimization target
  • Objective functions (loss/error/cost) often easier starting point
  • Connection between algorithm and objective not immediately obvious

Regression

Objective Functions and Regression

  • Classification: map feature to class label.
  • Regression: map feature to real value

Regression

  • Our prediction function is \[f(x_i) = mx_i + c\]

  • Need an algorithm to fit it.

Least Squares

  • Least squares: minimise an error.

\[E(m, c) = \sum_{i=1}^n(y_i - f(x_i))^2\]

Regression

  • Create an artifical data set.

  • true value for \(m\) m_true = 1.4

  • true value for \(c\) c_true = -3.1

We can use these values to create our artificial data. The formula \[y_i = mx_i + c\] is translated to code as follows:

y = m_true*x+c_true

Plot of Data

We can now plot the artifical data we’ve created.

Plot of data

  • Points lie exactly on a straight line
  • Not very realistic
  • Corrupt them with Gaussian ‘noise.’

Noise Corrupted Plot

Contour Plot of Error Function

Steepest Descent

  • Minimise the sum of squares error function.
  • E.g. gradient descent.
  • Initialise with a guess for \(m\) and \(c\)
  • update that guess by subtracting a portion of the gradient from the guess.
  • Like walking down a hill in the steepest direction of the hill to get to the bottom.

Algorithm

  • We start with a guess for \(m\) and \(c\).
m_star = 0.0
c_star = -5.0

Offset Gradient

  • Gradient of the error wrt \(c\), \[ \frac{\text{d}E(m, c)}{\text{d} c} = -2\sum_{i=1}^n(y_i - mx_i - c) \]

Compute as

c_grad = -2*(y-m_star*x - c_star).sum()

Slope Gradient

Gradient wrt \(m\) is similar \[ \frac{\text{d}E(m, c)}{\text{d} m} = -2\sum_{i=1}^nx_i(y_i - mx_i - c) \]

Compute as

which can be implemented in python (numpy) as

m_grad = -2*(x*(y-m_star*x - c_star)).sum()

Update Equations

  • Gradients with respect to \(m\) and \(c\).
  • Can update our inital guesses for \(m\) and \(c\) using the gradient.

Update Equations

  • We don’t want to just subtract the gradient from \(m\) and \(c\),
  • We need to take a small step in the gradient direction.
  • Otherwise we might overshoot the minimum.

Update Equations

  • We want to follow the gradient to get to the minimum, the gradient changes all the time.

Move in Direction of Gradient

Update Equations

  • The step size has already been introduced
  • it’s known as the learning rate and is denoted by \(\eta\). \[ c_\text{new}\leftarrow c_{\text{old}} - \eta\frac{\text{d}E(m, c)}{\text{d}c} \]

Step Size

  • gives us an update for our estimate of \(c\) and \[ m_\text{new} \leftarrow m_{\text{old}} - \eta\frac{\text{d}E(m, c)}{\text{d}m} \]
  • Giving us an update for \(m\).

Update Code

  • These updates can be coded as
learn_rate = 0.01
c_star = c_star - learn_rate*c_grad
m_star = m_star - learn_rate*m_grad

Iterating Updates

  • Fit model by descending gradient.

Gradient Descent Algorithm

Stochastic Gradient Descent

  • If \(n\) is small, gradient descent is fine.
  • But sometimes (e.g. on the internet) \(n\) could be a billion.
  • Stochastic gradient descent is more similar to perceptron.
  • Look at gradient of one data point at a time rather than summing across all data points)
  • This gives a stochastic estimate of gradient.

Stochastic Gradient Descent

  • The real gradient with respect to \(m\) is given by \[ \frac{\text{d}E(m, c)}{\text{d} m} = -2\sum_{i=1}^nx_i(y_i - mx_i - c) \]

Decompose the Sum

but it has \(n\) terms in the sum. Substituting in the gradient we can see that the full update is of the form \[\begin{aligned} m_\text{new} \leftarrow m_\text{old} \\ & + 2\eta\left[x_1 (y_1 - m_\text{old}x_1 - c_\text{old}) \right. \\ & + (x_2 (y_2 - m_\text{old}x_2 - c_\text{old}) \\ & + \dots + (x_n (y_n - m_\text{old}x_n - c_\text{old})\right] \end{aligned}\]

This could be split up into lots of individual updates \[m_1 \leftarrow m_\text{old} + 2\eta\left[x_1 (y_1 - m_\text{old}x_1 - c_\text{old})\right]\] \[m_2 \leftarrow m_1 + 2\eta\left[x_2 (y_2 - m_\text{old}x_2 - c_\text{old})\right]\] \[m_3 \leftarrow m_2 + 2\eta \left[\dots\right]\] \[m_n \leftarrow m_{n-1} + 2\eta\left[x_n (y_n - m_\text{old}x_n - c_\text{old})\right]\]

which would lead to the same final update.

Updating \(c\) and \(m\)

  • In the sum we don’t change \(m\) and \(c\) we use for computing the gradient term at each update.
  • In stochastic gradient descent we do change them.
  • This means it’s not quite the same as steepest descent.

  • But we can present each data point in a random order, like we did for the perceptron.
  • This makes the algorithm suitable for large-scale data use

Stochastic Gradient Descent

  • Since the data is presented in a random order we write \[ m_\text{new} = m_\text{old} + 2\eta\left[x_i (y_i - m_\text{old}x_i - c_\text{old})\right] \]

SGD for Linear Regression

Putting it all together in an algorithm, we can do stochastic gradient descent for our regression data.

Mini-Batch

  • In practice use mini-batches.
    • Instead of computing gradient of one example.
    • Use a small batch of examples.

Reflection on Linear Regression and Supervised Learning

  • Learning rate size
  • Smaller data sets and variation in results

for loss functions.

for gradient descent.

Thanks!

Further Reading

  • Section 1.1.3 of Rogers and Girolami (2011)

  • Section 8.1 of Bishop and Bishop (2024)

References

Bishop, C.M., Bishop, H., 2024. Deep learning: Foundations and concepts. Springer.
Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.