Last time: Reviewed Objective Functions and gradient descent.
Regression Examples
Predict a real value, \(y_i\) given some inputs \(\mathbf{ x}_i\).
Predict quality of meat given spectral measurements (Tecator data).
Radiocarbon dating, the C14 calibration curve: predict age given quantity of C14 isotope.
Predict quality of different Go or Backgammon moves given expert rated training data.
Latent Variables
\(y= mx+ c + \epsilon\)
point 1: \(x= 1\), \(y=3\)\[
3 = m + c + \epsilon_1
\]
point 2: \(x= 3\), \(y=1\)\[
1 = 3m + c + \epsilon_2
\]
point 3: \(x= 2\), \(y=2.5\)\[
2.5 = 2m + c + \epsilon_3
\]
A Probabilistic Process
Set the mean of Gaussian to be a function. \[
p\left(y_i|x_i\right)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp \left(-\frac{\left(y_i-f\left(x_i\right)\right)^{2}}{2\sigma^2}\right).
\]
This gives us a ‘noisy function.’
This is known as a stochastic process.
Olympic 100m Data
Gold medal times for Olympic 100 m runners since 1896.
One of a number of Olypmic data sets collected by Rogers and Girolami (2011).
Start of the 2012 London 100m race. Image by Darren Wilkinson from Wikimedia Commons
In the standard Gaussian, parameterized by mean and variance, make the mean a linear function of an input.
This leads to a regression model. \[
\begin{align*}
y_i=&f\left(x_i\right)+\epsilon_i,\\
\epsilon_i \sim & \mathscr{N}\left(0,\sigma^2\right).
\end{align*}
\]
Assume \(y_i\) is height and \(x_i\) is weight.
Data Point Likelihood
Likelihood of an individual data point \[
p\left(y_i|x_i,m,c\right)=\frac{1}{\sqrt{2\pi \sigma^2}}\exp\left(-\frac{\left(y_i-mx_i-c\right)^{2}}{2\sigma^2}\right).
\] Parameters are gradient, \(m\), offset, \(c\) of the function and noise variance \(\sigma^2\).
Data Set Likelihood
If the noise, \(\epsilon_i\) is sampled independently for each data point. Each data point is independent (given \(m\) and \(c\)). For independent variables: \[
p(\mathbf{ y}) = \prod_{i=1}^np(y_i)
\]\[
p(\mathbf{ y}|\mathbf{ x}, m, c) = \prod_{i=1}^np(y_i|x_i, m, c)
\]
For Gaussian
i.i.d. assumption \[
p(\mathbf{ y}|\mathbf{ x}, m, c) = \prod_{i=1}^n\frac{1}{\sqrt{2\pi \sigma^2}}\exp \left(-\frac{\left(y_i- mx_i-c\right)^{2}}{2\sigma^2}\right).
\]
\[
p(\mathbf{ y}|\mathbf{ x}, m, c) = \frac{1}{\left(2\pi \sigma^2\right)^{\frac{n}{2}}}\exp\left(-\frac{\sum_{i=1}^n\left(y_i-mx_i-c\right)^{2}}{2\sigma^2}\right).
\]
Log Likelihood Function
Normally work with the log likelihood:
\[\begin{aligned}
L(m,c,\sigma^{2})= & -\frac{n}{2}\log 2\pi -\frac{n}{2}\log \sigma^2 \\& -\sum_{i=1}^{n}\frac{\left(y_i-mx_i-c\right)^{2}}{2\sigma^2}.
\end{aligned}\]
Consistency of Maximum Likelihood
If data was really generated according to probability we specified.
Correct parameters will be recovered in limit as \(n\rightarrow \infty\).
Consistency of Maximum Likelihood
This can be proven through sample based approximations (law of large numbers) of “KL divergences.”
Mainstay of classical statistics (Wasserman, 2003).
Probabilistic Interpretation of Error Function
Probabilistic Interpretation for Error Function is Negative Log Likelihood.
Minimizing error function is equivalent to maximizing log likelihood.
Probabilistic Interpretation of Error Function
Maximizing log likelihood is equivalent to maximizing the likelihood because \(\log\) is monotonic.
Probabilistic interpretation: Minimizing error function is equivalent to maximum likelihood with respect to parameters.
Error Function
Negative log likelihood leads to an error function
\[\begin{aligned}
E(m,c,\sigma^{2})= & \frac{n}{2}\log \sigma^2 \\& +\frac{1}{2\sigma^2}\sum _{i=1}^{n}\left(y_i-mx_i-c\right)^{2}.\end{aligned}\]
Learning by minimising error function.
Sum of Squares Error
Ignoring terms which don’t depend on \(m\) and \(c\) gives \[E(m, c) \propto \sum_{i=1}^n(y_i - f(x_i))^2\] where \(f(x_i) = mx_i + c\).
Sum of Squares Error
This is recognised as the sum of squares error function.
Commonly used and is closely associated with the Gaussian likelihood.
Reminder
Two functions involved:
Prediction function: \(f(x_i)\)
Error, or Objective function: \(E(m, c)\)
Error function depends on parameters through prediction function.
Mathematical Interpretation
What is the mathematical interpretation?
There is a cost function.
It expresses mismatch between your prediction and reality. \[
E(m, c)=\sum_{i=1}^n\left(y_i - mx_i-c\right)^2
\]
This is known as the sum of squares error.
Legendre
Sum of squares error was first published by Legendre in 1805
But Laplace had priority - he used it to recover the lost planet Ceres
This led to a priority dispute between Legendre and Gauss
Legendre
Olympic Marathon Data
Gold medal times for Olympic Marathon since 1896.
Marathons before 1924 didn’t have a standardized distance.
Present results using pace per km.
In 1904 Marathon was badly organized leading to very slow times.
What is the probability he would have won an Olympics if one had been held in 1946?
Running Example: Olympic Marathons
Maximum Likelihood: Iterative Solution
\[
E(m, c) = \sum_{i=1}^n(y_i-mx_i-c)^2
\]
Coordinate Descent
Learning is Optimisation
Learning is minimisation of the cost function.
At the minima the gradient is zero.
Coordinate descent, find gradient in each coordinate and set to zero. \[\frac{\text{d}E(c)}{\text{d}c} = -2\sum_{i=1}^n\left(y_i- m x_i - c \right)\]\[0 = -2\sum_{i=1}^n\left(y_i- mx_i - c \right)\]
Coordinate descent, find gradient in each coordinate and set to zero. \[\frac{\text{d}E(m)}{\text{d}m} = -2\sum_{i=1}^nx_i\left(y_i- m x_i - c \right)\]\[0 = -2\sum_{i=1}^nx_i \left(y_i-m x_i - c \right)\]
Learning is Optimisation
Fixed point equations \[0 = -2\sum_{i=1}^nx_iy_i+2\sum_{i=1}^nm x_i^2+2\sum_{i=1}^ncx_i\]\[m = \frac{\sum_{i=1}^n\left(y_i -c\right)x_i}{\sum_{i=1}^nx_i^2}\]
And a corresponding error function of \[E(\mathbf{ w},\sigma^2)=\frac{n}{2}\log\sigma^2 + \frac{\sum_{i=1}^{n}\left(y_i-\mathbf{ w}^{\top}\mathbf{ x}_i\right)^{2}}{2\sigma^2}.\]
For now some simple multivariate differentiation: \[\frac{\text{d}{\mathbf{a}^{\top}}{\mathbf{ w}}}{\text{d}\mathbf{ w}}=\mathbf{a}\] and \[\frac{\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=\left(\mathbf{A}+\mathbf{A}^{\top}\right)\mathbf{ w}\] or if \(\mathbf{A}\) is symmetric (i.e.\(\mathbf{A}=\mathbf{A}^{\top}\)) \[\frac{\text{d}\mathbf{ w}^{\top}\mathbf{A}\mathbf{ w}}{\text{d}\mathbf{ w}}=2\mathbf{A}\mathbf{ w}.\]
Differentiate the Objective
Differentiating with respect to the vector \(\mathbf{ w}\) we obtain
Solve the matrix equation for \(\mathbf{ w}\). \[\mathbf{X}^\top \mathbf{X}\mathbf{ w}= \mathbf{X}^\top \mathbf{ y}\]
The equation for \(\left.\sigma^2\right.^{*}\) may also be found \[\left.\sigma^2\right.^{{*}}=\frac{\sum_{i=1}^{n}\left(y_i-\left.\mathbf{ w}^{*}\right.^{\top}\mathbf{ x}_i\right)^{2}}{n}.\]
Movie Violence Data
Data containing movie information (year, length, rating, genre, IMDB Rating).
Try and predict what factors influence a movie’s violence
Multivariate Regression on Movie Violence Data
Regress from features Year, Body_Count, Length_Minutes to IMDB_Rating.