Generalised Linear Models

Neil D. Lawrence

2024-11-15

LT2, William Gates Building

Review

Linear Regression Reminder

Linear Regression Model

Linear regression models continuous response \(y_i\) vs inputs \(\mathbf{ x}_i\): \[y_i = f(\mathbf{ x}_i) + \epsilon_i\] where \(f(\mathbf{ x}_i) = \mathbf{ w}^\top\mathbf{ x}_i\)
Probabilistic model: \[p(y_i|\mathbf{ x}_i) = \gaussianDist{\mathbf{ w}^\top\mathbf{ x}_i}{\sigma^2}\]

Linear Regression in Matrix Form

Matrix form: \[\mathbf{ y}= \mathbf{X}\mathbf{ w}+ \boldsymbol{ \epsilon}\]
Expected prediction: \[\mathbb{E}[y_i|\mathbf{ x}_i] = \mathbf{ w}^\top\mathbf{ x}_i\]

Model Fit Statistics

Model fit statistics help assess overall performance:
- R-squared shows variance explained
- F-statistic tests if model is useful
- AIC/BIC help compare models

Parameter Estimates

Parameter estimates tell us about relationships:
- Coefficients show effect direction/size
- Standard errors show uncertainty
- P-values test significance

Residual Diagnostics

Residual diagnostics check assumptions:
- Tests for normality and autocorrelation
- Look for patterns that violate assumptions

Visual Inspection

Visual inspection is crucial:
- With 1D data we can plot everything
- Helps spot patterns statistics might miss
- Shows if relationship makes practical sense

Linear Regression Fit

Linear regression fit to Olympic marathon men’s times using statsmodels.

1904 St. Louis Olympics: Major outlier
- Explains non-normal residuals
- Contributes to right skew

Data Regimes

Three distinct regimes visible:
- Pre-WWI: Rapid improvement
- War years: Disrupted progress
- Post-WWII: Steady improvement

Model Improvements

Model improvements possible with extra features:
- Polynomial terms
- Period indicators
- Interaction terms
- External factors

Design Matrix

The design matrix \(\designMatrix\) stores our features
Each row represents one data point
Each column represents one feature
For \(n\) data points and \(p\) features: \[\designMatrix = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}\]
Also called the feature matrix or model matrix

Augmented Features with Interactions Regression Fit

Polynomial regression fit to Olympic marathon men’s times using statsmodels.

Logistic Regression and GLMs

Modelling entire density allows any question to be answered (also missing data).
Comes at the possible expense of strong assumptions about data generation distribution.
In regression we model probability of \(y_i |\mathbf{ x}_i\) directly.
- Allows less flexibility in the question, but more flexibility in the model assumptions.
Can do this not just for regression, but classification.
Framework is known as generalized linear models.

Log Odds

model the log-odds with the basis functions.
odds are defined as the ratio of the probability of a positive outcome, to the probability of a negative outcome.
Probability is between zero and one, odds are: \[ \frac{\pi}{1-\pi} \]
Odds are between \(0\) and \(\infty\).
Logarithm of odds maps them to \(-\infty\) to \(\infty\).

Logit Link Function

The Logit function, \[g^{-1}(\pi_i) = \log\frac{\pi_i}{1-\pi_i}.\] This function is known as a link function.
For a standard regression we take, \[f(\mathbf{ x}_i) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}_i),\]
For classification we perform a logistic regression. \[\log \frac{\pi_i}{1-\pi_i} = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}_i)\]

Inverse Link Function

We have defined the link function as taking the form \(g^{-1}(\cdot)\) implying that the inverse link function is given by \(g(\cdot)\). Since we have defined, \[ g^{-1}(\pi(\mathbf{ x})) = \mathbf{ w}^\top\boldsymbol{ \phi}(\mathbf{ x}) \] we can write \(\pi\) in terms of the inverse link function, \(g(\cdot)\) as \[ \pi(\mathbf{ x}) = g(\mathbf{ w}^\top\boldsymbol{ \phi}(\mathbf{ x})). \]

Logistic function

Logistic (or sigmoid) squashes real line to between 0 & 1. Sometimes also called a ‘squashing function’.

Basis Function

Prediction Function

Can now write \(\pi\) as a function of the input and the parameter vector as, \[\pi(\mathbf{ x},\mathbf{ w}) = \frac{1}{1+ \exp\left(-\mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x})\right)}.\]
Compute the output of a standard linear basis function composition (\(\mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x})\), as we did for linear regression)
Apply the inverse link function, \(g(\mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}))\).
Use this value in a Bernoulli distribution to form the likelihood.

Bernoulli Reminder

From last time \[P(y_i|\mathbf{ w}, \mathbf{ x}) = \pi_i^{y_i} (1-\pi_i)^{1-y_i}\]
Trick for switching betwen probabilities

def bernoulli(y, pi):
    if y == 1:
        return pi
    else:
return 1-pi

Maximum Likelihood

Conditional independence of data: \[P(\mathbf{ y}|\mathbf{ w}, \mathbf{X}) = \prod_{i=1}^nP(y_i|\mathbf{ w}, \mathbf{ x}_i). \]

Log Likelihood

\[\begin{align*} \log P(\mathbf{ y}|\mathbf{ w}, \mathbf{X}) = & \sum_{i=1}^n\log P(y_i|\mathbf{ w}, \mathbf{ x}_i) \\ = &\sum_{i=1}^ny_i \log \pi_i \\ & + \sum_{i=1}^n(1-y_i)\log (1-\pi_i) \end{align*}\]

Objective Function

Probability of positive outcome for the \(i\)th data point \[\pi_i = g\left(\mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}_i)\right),\] where \(g(\cdot)\) is the inverse link function
Objective function of the form \[\begin{align*} E(\mathbf{ w}) = & - \sum_{i=1}^ny_i \log g\left(\mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}_i)\right) \\& - \sum_{i=1}^n(1-y_i)\log \left(1-g\left(\mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}_i)\right)\right). \end{align*}\]

Minimize Objective

Grdient wrt \(\pi(\mathbf{ x};\mathbf{ w})\) \[\begin{align*} \frac{\text{d}E(\mathbf{ w})}{\text{d}\mathbf{ w}} = & -\sum_{i=1}^n\frac{y_i}{g\left(\mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x})\right)}\frac{\text{d}g(f_i)}{\text{d}f_i} \boldsymbol{ \phi}(\mathbf{ x}_i) \\ & + \sum_{i=1}^n \frac{1-y_i}{1-g\left(\mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x})\right)}\frac{\text{d}g(f_i)}{\text{d}f_i} \boldsymbol{ \phi}(\mathbf{ x}_i) \end{align*}\]

Link Function Gradient

Also need gradient of inverse link function wrt parameters. \[\begin{align*} g(f_i) &= \frac{1}{1+\exp(-f_i)}\\ &=(1+\exp(-f_i))^{-1} \end{align*}\] and the gradient can be computed as \[\begin{align*} \frac{\text{d}g(f_i)}{\text{d} f_i} & = \exp(-f_i)(1+\exp(-f_i))^{-2}\\ & = \frac{1}{1+\exp(-f_i)} \frac{\exp(-f_i)}{1+\exp(-f_i)} \\ & = g(f_i) (1-g(f_i)) \end{align*}\]

Objective Gradient

\[\begin{align*} \frac{\text{d}E(\mathbf{ w})}{\text{d}\mathbf{ w}} = & -\sum_{i=1}^n y_i\left(1-g\left(\mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x})\right)\right) \boldsymbol{ \phi}(\mathbf{ x}_i) \\ & + \sum_{i=1}^n (1-y_i)\left(g\left(\mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x})\right)\right) \boldsymbol{ \phi}(\mathbf{ x}_i). \end{align*}\]

Optimization of the Function

Can’t find a stationary point of the objective function analytically.
Optimization has to proceed by numerical methods.
- Newton’s method or
- gradient based optimization methods
Similarly to matrix factorization, for large data stochastic gradient descent (Robbins Munro (Robbins and Monro, 1951) optimization procedure) works well.

Ad Matching for Facebook

This approach used in many internet companies.
Example: ad matching for Facebook.
- Millions of advertisers
- Billions of users
- How do you choose who to show what?
Logistic regression used in combination with decision trees
Paper available here

Going Further: Optimization

Other optimization techniques for generalized linear models include Newton’s method, it requires you to compute the Hessian, or second derivative of the objective function.

Methods that are based on gradients only include L-BFGS and conjugate gradients. Can you find these in python? Are they suitable for very large data sets? }

Other GLMs

Logistic regression is part of a family known as generalized linear models
They all take the form \[g^{-1}(f_i(x)) = \mathbf{ w}^\top \boldsymbol{ \phi}(\mathbf{ x}_i)\]
Other examples include Poisson regression.

Poisson Distribution

Poisson distribution is used for ‘count data’. For non-negative integers, \(y\), \[P(y) = \frac{\lambda^y}{y!}\exp(-y)\]
Here \(\lambda\) is a rate parameter that can be thought of as the number of arrivals per unit time.
Poisson distributions can be used for disease count data. E.g. number of incidence of malaria in a district.

Poisson Distribution

Poisson Regression

In a Poisson regression make rate a function of space/time. \[\log \lambda(\mathbf{ x}, t) = \mathbf{ w}_x^\top \boldsymbol{ \phi}_x(\mathbf{ x}) + \mathbf{ w}_t^\top \boldsymbol{ \phi}_t(t)\]
This is known as a log linear or log additive model.
The link function is a logarithm.
We can rewrite such a function as \[\log \lambda(\mathbf{ x}, t) = f_x(\mathbf{ x}) + f_t(t)\]

Multiplicative Model

Be careful though … a log additive model is really multiplicative. \[\log \lambda(\mathbf{ x}, t) = f_x(\mathbf{ x}) + f_t(t)\]
Becomes \[\lambda(\mathbf{ x}, t) = \exp(f_x(\mathbf{ x}) + f_t(t))\]
Which is equivalent to \[\lambda(\mathbf{ x}, t) = \exp(f_x(\mathbf{ x}))\exp(f_t(t))\]
Link functions can be deceptive in this way.

Google Trends Search Queries

Google Trends Search Queries Predictions

Synthetic Example

Poisson Regression Diagnostics

Practical Tips

Feature engineering is critical
Build modular pipelines to test features
Consider interactions between variables
Document your process carefully

Model Validation

Use cross-validation wisely
Bootstrap for uncertainty
Keep hold-out test sets
Beware of temporal leakage

Diagnostic Checks

Plot residuals systematically
Check for non-linear patterns
Identify influential points
Test feature relationships

Visualization

Always plot raw data first
Create diagnostic visualizations
Check model assumptions
Communicate results clearly

Thanks!

References

Robbins, H., Monro, S., 1951. A stochastic approximation method. Annals of Mathematical Statistics 22, 400–407.

Rogers, S., Girolami, M., 2011. A first course in machine learning. CRC Press.

Generalised Linear Models

Review

Linear Regression Reminder

Linear Regression Model

Linear Regression in Matrix Form

Model Fit Statistics

Parameter Estimates

Residual Diagnostics

Visual Inspection

Linear Regression Fit

Data Regimes

Model Improvements

Design Matrix

Augmented Features with Interactions Regression Fit

Logistic Regression and GLMs

Log Odds

Logit Link Function

Inverse Link Function

Logistic function

Basis Function

Prediction Function

Bernoulli Reminder

Maximum Likelihood

Log Likelihood

Objective Function

Minimize Objective

Link Function Gradient

Objective Gradient

Optimization of the Function

Ad Matching for Facebook

Going Further: Optimization

Other GLMs

Poisson Distribution

Poisson Distribution

Poisson Regression

Multiplicative Model

Google Trends Search Queries

Google Trends Search Queries Predictions

Synthetic Example

Poisson Regression Diagnostics

Practical Tips

Model Validation

Diagnostic Checks

Visualization

Further Reading

Thanks!

References