Week 5: Visualisation II: Further Dimensionality Reduction
[jupyter][google colab][reveal]
Abstract:
Building on our understanding of discrete and continuous latent variables, this lecture explores modern approaches to dimensionality reduction. We examine the limitations of linear methods like PCA, introduce powerful nonlinear techniques like t-SNE, and develop practical guidelines for choosing and implementing these methods. The lecture emphasizes the importance of understanding when methods preserve local versus global structure and how this affects their application.
Setup
notutils
This small package is a helper package for various notebook utilities used below.
The software can be installed using
%pip install notutils
from the command prompt where you can access your python installation.
The code is also available on GitHub: https://github.com/lawrennd/notutils
Once notutils
is installed, it can be imported in the
usual manner.
import notutils
pods
In Sheffield we created a suite of software tools for ‘Open Data Science’. Open data science is an approach to sharing code, models and data that should make it easier for companies, health professionals and scientists to gain access to data science techniques.
You can also check this blog post on Open Data Science.
The software can be installed using
%pip install pods
from the command prompt where you can access your python installation.
The code is also available on GitHub: https://github.com/lawrennd/ods
Once pods
is installed, it can be imported in the usual
manner.
import pods
mlai
The mlai
software is a suite of helper functions for
teaching and demonstrating machine learning algorithms. It was first
used in the Machine Learning and Adaptive Intelligence course in
Sheffield in 2013.
The software can be installed using
%pip install mlai
from the command prompt where you can access your python installation.
The code is also available on GitHub: https://github.com/lawrennd/mlai
Once mlai
is installed, it can be imported in the usual
manner.
import mlai
from mlai import plot
Part 1: Linear PCA
Principal Component Analysis
Principal Component Analysis (PCA) is one of the most fundamental and widely used dimensionality reduction techniques. While commonly credited to Pearson (1901), who was interested in finding “lines and planes of closest fit to systems of points in space”, the method as we will review it today was introduced and named by Hotelling (Hotelling (1933)). The approach is often used as a way of looking for correlations, but Hotelling’s motivation was a principled alternative to Spearman’s factor analysis (Spearman, 1904).
Often PCA is introduced as a method for finding directions of maximum variance in high-dimensional data, and this is the interpretation that is due to Pearson (1901), but philosophically these approaches arre different even though they turn out to be identical mathematically. In a very real sense “many models lead to the PCA algorithm”. But these equivalences are only true when a linear interpretation is sought. Nonlinear extensions of these ideas (maximum variance directions, eigenvalue problems, latent variable models) all lead to different algorithms. However, since the linear algorithm has so many interpretations it is a wise place to begin analysis.
The mathematical foundation of PCA relies on analyzing the sample covariance matrix of the data. For a dataset with \(n\) points, this matrix is given by:
\[ \mathbf{S}=\frac{1}{n}\sum_{i=1}^n\left(\mathbf{ y}_{i, :}-\boldsymbol{ \mu}\right)\left(\mathbf{ y}_{i, :} - \boldsymbol{ \mu}\right)^\top \]
The modern interpretation focuses on finding a set of orthogonal directions (principal components) along which the data varies the most, but Hotelling’s original formulation was derived from the idea of latent variables. The optimization problem we solve today, which maximizes the variance along each direction while maintaining orthogonality constraints, arises as the solution of the original latent variable formulation (Tipping and Bishop, 1999a).
import numpy as np
# Generate correlated 2D Gaussian samples
= 200
num_points = np.array([0, 0])
mean = np.array([[2.0, 1.8],
cov 1.8, 2.0]])
[= np.random.multivariate_normal(mean, cov, num_points) X
Probabilistic PCA
This linear relationship between the observed data and the latent variables is at the heart of Hotelling’s original formulation of PCA. It explicitly models the idea that the observed high-dimensional data is generated from a lower-dimensional set of latent variables, with some added noise. This perspective aligns PCA more closely with factor analysis and highlights its nature as a latent variable model.
Probabilistic PCA (PPCA) provides a probabilistic interpretation of PCA by explicitly modeling the generative process that creates the observed data. The model assumes that each high-dimensional data point \(\mathbf{ y}_{i,:}\) is generated from a lower-dimensional latent variable \(\mathbf{ z}_{i,:}\) through a linear transformation with added Gaussian noise: \[ \mathbf{ y}_{i,:} = \mathbf{W}\mathbf{ z}_{i,:} + \boldsymbol{ \epsilon}_{i,:} \] where \(\boldsymbol{ \epsilon}_{i,:} \sim \mathcal{N}\left(\mathbf{0},\sigma^2\mathbf{I}\right)\) represents isotropic Gaussian noise. The model places a standard Gaussian prior over the latent variables: \[ p(\mathbf{ z}_{i,:}) = \mathcal{N}\left(\mathbf{ z}_{i,:}|\mathbf{0},\mathbf{I}\right) \] Given these assumptions, the conditional probability of observing a data point given its latent representation is: \[ p(\mathbf{ y}_{i,:}|\mathbf{ z}_{i,:},\mathbf{W}) = \mathcal{N}\left(\mathbf{ y}_{i,:}|\mathbf{W}\mathbf{ z}_{i,:},\sigma^2\mathbf{I}\right) \] By integrating out the latent variables, we obtain the marginal likelihood of the data: \[ p(\mathbf{ y}_{i,:}|\mathbf{W}) = \mathcal{N}\left(\mathbf{ y}_{i,:}|\mathbf{0},\mathbf{W}\mathbf{W}^{\top}+\sigma^{2}\mathbf{I}\right) \] This probabilistic formulation, developed by Tipping and Bishop (Tipping and Bishop, 1999a), provides a principled framework that not only recovers classical PCA as a special case when \(\sigma^2 \to 0\), but also enables extensions like handling missing data and mixture models.
Probabilistic PCA
In 1997 Tipping and Bishop (Tipping and Bishop, 1999b) and Roweis (Roweis, n.d.) independently revisited Hotelling’s model and considered the case where the noise variance was finite, but shared across all output dimensons. Their model can be thought of as a factor analysis where \[ \boldsymbol{\Sigma} = \sigma^2 \mathbf{I}. \] This leads to a marginal likelihood of the form \[ p(\mathbf{Y}|\mathbf{W}, \sigma^2) = \prod_{i=1}^n\mathcal{N}\left(\mathbf{ y}_{i, :}|\mathbf{0},\mathbf{W}\mathbf{W}^\top + \sigma^2 \mathbf{I}\right) \] where the limit of \(\sigma^2\rightarrow 0\) is not taken. This defines a proper probabilistic model. Tippping and Bishop then went on to prove that the maximum likelihood solution of this model with respect to \(\mathbf{W}\) is given by an eigenvalue problem. In the probabilistic PCA case the eigenvalues and eigenvectors are given as follows. \[ \mathbf{W}= \mathbf{U}\mathbf{L} \mathbf{R}^\top \] where \(\mathbf{U}\) is the eigenvectors of the empirical covariance matrix \[ \mathbf{S} = \sum_{i=1}^n(\mathbf{ y}_{i, :} - \boldsymbol{ \mu})(\mathbf{ y}_{i,:} - \boldsymbol{ \mu})^\top, \] which can be written \(\mathbf{S} = \frac{1}{n} \mathbf{Y}^\top\mathbf{Y}\) if the data is zero mean. The matrix \(\mathbf{L}\) is diagonal and is dependent on the eigenvalues of \(\mathbf{S}\), \(\boldsymbol{\Lambda}\). If the \(i\)th diagonal element of this matrix is given by \(\lambda_i\) then the corresponding element of \(\mathbf{L}\) is \[ \ell_i = \sqrt{\lambda_i - \sigma^2} \] where \(\sigma^2\) is the noise variance. Note that if \(\sigma^2\) is larger than any particular eigenvalue, then that eigenvalue (along with its corresponding eigenvector) is discarded from the solution.
PPCA as Manifold Learning
Probabilistic PCA can be viewed as a simple manifold learning algorithm. It assumes that:
- The data lies near a linear manifold (subspace) of the high-dimensional space
- The deviation from this manifold is Gaussian noise
- The intrinsic dimensionality is specified by the number of retained components
This view helps explain why PPCA works well when the manifold hypothesis holds and the manifold is approximately linear. When the manifold is nonlinear, we need more sophisticated methods like kernel PCA or neural network-based approaches.
Python Implementation of Probabilistic PCA
We will now implement this algorithm in python.
import numpy as np
# probabilistic PCA algorithm
def ppca(Y, q):
# remove mean
= Y - Y.mean(0)
Y_cent
# Comute covariance
= np.dot(Y_cent.T, Y_cent)/Y.shape[0]
S = np.linalg.eig(S)
lambd, U
# Choose number of eigenvectors
= np.sum(lambd[q:])/(Y.shape[1]-q)
sigma2 = np.sqrt(lambd[:q]-sigma2)
l = U[:, :q]*l[None, :]
W return W, sigma2
In practice we may not wish to compute the eigenvectors of the covariance matrix directly. This is because it requires us to estimate the covariance, which involves a sum of squares term, before estimating the eigenvectors. We can estimate the eigenvectors directly either through QR decomposition or singular value decomposition. We saw a similar issue arise when , where we also wished to avoid computation of \(\mathbf{Z}^\top\mathbf{Z}\) (or in the case of \(\boldsymbol{\Phi}^\top\boldsymbol{\Phi}\)).
Posterior for Principal Component Analysis
Under the latent variable model justification for principal component analysis, we are normally interested in inferring something about the latent variables given the data. This is the distribution, \[ p(\mathbf{ z}_{i, :} | \mathbf{ y}_{i, :}) \] for any given data point. Determining this density turns out to be very similar to the approach for determining the Bayesian posterior of \(\mathbf{ w}\) in Bayesian linear regression, only this time we place the prior density over \(\mathbf{ z}_{i, :}\) instead of \(\mathbf{ w}\). The posterior is proportional to the joint density as follows, \[ p(\mathbf{ z}_{i, :} | \mathbf{ y}_{i, :}) \propto p(\mathbf{ y}_{i, :}|\mathbf{W}, \mathbf{ z}_{i, :}, \sigma^2) p(\mathbf{ z}_{i, :}) \] And as in the Bayesian linear regression case we first consider the log posterior, \[ \log p(\mathbf{ z}_{i, :} | \mathbf{ y}_{i, :}) = \log p(\mathbf{ y}_{i, :}|\mathbf{W}, \mathbf{ z}_{i, :}, \sigma^2) + \log p(\mathbf{ z}_{i, :}) + \text{const} \] where the constant is not dependent on \(\mathbf{ z}\). As before we collect the quadratic terms in \(\mathbf{ z}_{i, :}\) and we assemble them into a Gaussian density over \(\mathbf{ z}\). \[ \log p(\mathbf{ z}_{i, :} | \mathbf{ y}_{i, :}) = -\frac{1}{2\sigma^2} (\mathbf{ y}_{i, :} - \mathbf{W}\mathbf{ z}_{i, :})^\top(\mathbf{ y}_{i, :} - \mathbf{W}\mathbf{ z}_{i, :}) - \frac{1}{2} \mathbf{ z}_{i, :}^\top \mathbf{ z}_{i, :} + \text{const} \]
Exercise 1
Multiply out the terms in the brackets. Then collect the quadratic term and the linear terms together. Show that the posterior has the form \[ \mathbf{ z}_{i, :} | \mathbf{W}\sim \mathcal{N}\left(\boldsymbol{ \mu}_x,\mathbf{C}_x\right) \] where \[ \mathbf{C}_x = \left(\sigma^{-2} \mathbf{W}^\top\mathbf{W}+ \mathbf{I}\right)^{-1} \] and \[ \boldsymbol{ \mu}_x = \mathbf{C}_x \sigma^{-2}\mathbf{W}^\top \mathbf{ y}_{i, :} \] Compare this to the posterior for the Bayesian linear regression from last week, do they have similar forms? What matches and what differs?
Python Implementation of the Posterior
Now let’s implement the system in code.
Exercise 2
Use the values for \(\mathbf{W}\) and \(\sigma^2\) you have computed, along with the data set \(\mathbf{Y}\) to compute the posterior density over \(\mathbf{Z}\). Write a function of the form
= posterior(Y, W, sigma2) mu_x, C_x
where mu_x
and C_x
are the posterior mean
and posterior covariance for the given \(\mathbf{Y}\).
Don’t forget to subtract the mean of the data Y
inside
your function before computing the posterior: remember we assumed at the
beginning of our analysis that the data had been centred (i.e. the mean
was removed).
Numerically Stable and Efficient Version
Just as we saw for and computation of a matrix such as \(\mathbf{Y}^\top\mathbf{Y}\) (or its centred version) can be a bad idea in terms of loss of numerical accuracy. Fortunately, we can find the eigenvalues and eigenvectors of the matrix \(\mathbf{Y}^\top\mathbf{Y}\) without direct computation of the matrix. This can be done with the singular value decomposition. The singular value decompsition takes a matrix, \(\mathbf{Z}\) and represents it in the form, \[ \mathbf{Z} = \mathbf{U}\boldsymbol{\Lambda}\mathbf{V}^\top \] where \(\mathbf{U}\) is a matrix of orthogonal vectors in the columns, meaning \(\mathbf{U}^\top\mathbf{U} = \mathbf{I}\). It has the same number of rows and columns as \(\mathbf{Z}\). The matrices \(\mathbf{\Lambda}\) and \(\mathbf{V}\) are both square with dimensionality given by the number of columns of \(\mathbf{Z}\). The matrix \(\mathbf{\Lambda}\) is diagonal and \(\mathbf{V}\) is an orthogonal matrix so \(\mathbf{V}^\top\mathbf{V} = \mathbf{V}\mathbf{V}^\top = \mathbf{I}\). The eigenvalues of the matrix \(\mathbf{Y}^\top\mathbf{Y}\) are then given by the singular values of the matrix \(\mathbf{Y}^\top\) squared and the eigenvectors are given by \(\mathbf{U}\).
Solution for \(\mathbf{W}\)
Given the singular value decomposition of \(\mathbf{Y}\) then we have \[ \mathbf{W}= \mathbf{U}\mathbf{L}\mathbf{R}^\top \] where \(\mathbf{R}\) is an arbitrary rotation matrix. This implies that the posterior is given by \[ \mathbf{C}_x = \left[\sigma^{-2}\mathbf{R}\mathbf{L}^2\mathbf{R}^\top + \mathbf{I}\right]^{-1} \] because \(\mathbf{U}^\top \mathbf{U} = \mathbf{I}\). Since, by convention, we normally take \(\mathbf{R} = \mathbf{I}\) to ensure that the principal components are orthonormal we can write \[ \mathbf{C}_x = \left[\sigma^{-2}\mathbf{L}^2 + \mathbf{I}\right]^{-1} \] which implies that \(\mathbf{C}_x\) is actually diagonal with elements given by \[ c_i = \frac{\sigma^2}{\sigma^2 + \ell^2_i} \] and allows us to write \[ \boldsymbol{ \mu}_x = [\mathbf{L}^2 + \sigma^2 \mathbf{I}]^{-1} \mathbf{L} \mathbf{U}^\top \mathbf{ y}_{i, :} \] \[ \boldsymbol{ \mu}_x = \mathbf{D}\mathbf{U}^\top \mathbf{ y}_{i, :} \] where \(\mathbf{D}\) is a diagonal matrix with diagonal elements given by \(d_{i} = \frac{\ell_i}{\sigma^2 + \ell_i^2}\).
import scipy as sp
import numpy as np
# probabilistic PCA algorithm using SVD
def ppca(Y, q, center=True):
"""Probabilistic PCA through singular value decomposition"""
# remove mean
if center:
= Y - Y.mean(0)
Y_cent else:
= Y
Y_cent
# Comute singluar values, discard 'R' as we will assume orthogonal
= sp.linalg.svd(Y_cent.T,full_matrices=False)
U, sqlambd, _ = (sqlambd**2)/Y.shape[0]
lambd # Compute residual and extract eigenvectors
= np.sum(lambd[q:])/(Y.shape[1]-q)
sigma2 = np.sqrt(lambd[:q]-sigma2)
ell return U[:, :q], ell, sigma2
def posterior(Y, U, ell, sigma2, center=True):
"""Posterior computation for the latent variables given the eigendecomposition."""
if center:
= Y - Y.mean(0)
Y_cent else:
= Y
Y_cent = np.diag(sigma2/(sigma2+ell**2))
C_x = ell/(sigma2+ell**2)
d = np.dot(Y_cent, U)*d[None, :]
mu_x return mu_x, C_x
Scikit-learn implementation PCA
We’ve implemented PCA as part of supporting the learning process, but
in practice we can use the scikit-learn
implementation.
Let’s try it on the oil flow data.
Oil Flow Data
This data set is from a physics-based simulation of oil flow in a pipeline. The data was generated as part of a project to determine the fraction of oil, water and gas in North Sea oil pipes (Bishop:gtm96?).
import pods
= pods.datasets.oil() data
The data consists of 1000 12-dimensional observations of simulated oil flow in a pipeline. Each observation is labelled according to the multi-phase flow configuration (homogeneous, annular or laminar).
# Convert data["Y"] from [1, -1, -1] in each row to rows of 0 or 1 or 2
= data["Y"]
Y # Find rows with 1 in first column (class 0)
= (Y[:, 0] == 1).astype(int) * 0
class0 # Find rows with 1 in second column (class 1)
= (Y[:, 1] == 1).astype(int) * 1
class1 # Find rows with 1 in third column (class 2)
= (Y[:, 2] == 1).astype(int) * 2
class2 # Combine into single array of class labels 0,1,2
= class0 + class1 + class2 labels
The data is returned as a dictionary containing training and test inputs (‘X’, ‘Xtst’), training and test labels (‘Y’, ‘Ytst’), and the names of the features.
As normal we include the citation information for the data.
print(data['citation'])
And extra information about the data is included, as standard, under
the keys info
and details
.
print(data['details'])
= data['X']
X = data['Y'] Y
%pip install scikit-learn
from sklearn.decomposition import PCA
= PCA(n_components=2)
pca
pca.fit(X)= pca.transform(X) X_pca
= posterior(Y, W, sigma2) mu_x, C_x
where mu_x
and C_x
are the posterior mean
and posterior covariance for the given \(\mathbf{Y}\).
Don’t forget to subtract the mean of the data Y
inside
your function before computing the posterior: remember we assumed at the
beginning of our analysis that the data had been centred (i.e. the mean
was removed).}{# Answer Code # Write code for you answer to this
exercise in this box # Do not delete these comments, otherwise you will
get zero for this answer. # Make sure your code has run and the answer
is correct before submitting your notebook for marking. import
numpy as np import scipy as sp def posterior(Y, W, sigma2): Y_cent = Y -
Y.mean(0) # Compute posterior over X C_x = mu_x = return mu_x,
C_x}{20}}
Numerically Stable and Efficient Version
Just as we saw for and computation of a matrix such as \(\mathbf{Y}^\top\mathbf{Y}\) (or its centred version) can be a bad idea in terms of loss of numerical accuracy. Fortunately, we can find the eigenvalues and eigenvectors of the matrix \(\mathbf{Y}^\top\mathbf{Y}\) without direct computation of the matrix. This can be done with the singular value decomposition. The singular value decompsition takes a matrix, \(\mathbf{Z}\) and represents it in the form, \[ \mathbf{Z} = \mathbf{U}\boldsymbol{\Lambda}\mathbf{V}^\top \] where \(\mathbf{U}\) is a matrix of orthogonal vectors in the columns, meaning \(\mathbf{U}^\top\mathbf{U} = \mathbf{I}\). It has the same number of rows and columns as \(\mathbf{Z}\). The matrices \(\mathbf{\Lambda}\) and \(\mathbf{V}\) are both square with dimensionality given by the number of columns of \(\mathbf{Z}\). The matrix \(\mathbf{\Lambda}\) is diagonal and \(\mathbf{V}\) is an orthogonal matrix so \(\mathbf{V}^\top\mathbf{V} = \mathbf{V}\mathbf{V}^\top = \mathbf{I}\). The eigenvalues of the matrix \(\mathbf{Y}^\top\mathbf{Y}\) are then given by the singular values of the matrix \(\mathbf{Y}^\top\) squared and the eigenvectors are given by \(\mathbf{U}\).
Solution for \(\mathbf{W}\)
Given the singular value decomposition of \(\mathbf{Y}\) then we have \[ \mathbf{W}= \mathbf{U}\mathbf{L}\mathbf{R}^\top \] where \(\mathbf{R}\) is an arbitrary rotation matrix. This implies that the posterior is given by \[ \mathbf{C}_x = \left[\sigma^{-2}\mathbf{R}\mathbf{L}^2\mathbf{R}^\top + \mathbf{I}\right]^{-1} \] because \(\mathbf{U}^\top \mathbf{U} = \mathbf{I}\). Since, by convention, we normally take \(\mathbf{R} = \mathbf{I}\) to ensure that the principal components are orthonormal we can write \[ \mathbf{C}_x = \left[\sigma^{-2}\mathbf{L}^2 + \mathbf{I}\right]^{-1} \] which implies that \(\mathbf{C}_x\) is actually diagonal with elements given by \[ c_i = \frac{\sigma^2}{\sigma^2 + \ell^2_i} \] and allows us to write \[ \boldsymbol{ \mu}_x = [\mathbf{L}^2 + \sigma^2 \mathbf{I}]^{-1} \mathbf{L} \mathbf{U}^\top \mathbf{ y}_{i, :} \] \[ \boldsymbol{ \mu}_x = \mathbf{D}\mathbf{U}^\top \mathbf{ y}_{i, :} \] where \(\mathbf{D}\) is a diagonal matrix with diagonal elements given by \(d_{i} = \frac{\ell_i}{\sigma^2 + \ell_i^2}\).
import scipy as sp
import numpy as np
# probabilistic PCA algorithm using SVD
def ppca(Y, q, center=True):
"""Probabilistic PCA through singular value decomposition"""
# remove mean
if center:
= Y - Y.mean(0)
Y_cent else:
= Y
Y_cent
# Comute singluar values, discard 'R' as we will assume orthogonal
= sp.linalg.svd(Y_cent.T,full_matrices=False)
U, sqlambd, _ = (sqlambd**2)/Y.shape[0]
lambd # Compute residual and extract eigenvectors
= np.sum(lambd[q:])/(Y.shape[1]-q)
sigma2 = np.sqrt(lambd[:q]-sigma2)
ell return U[:, :q], ell, sigma2
def posterior(Y, U, ell, sigma2, center=True):
"""Posterior computation for the latent variables given the eigendecomposition."""
if center:
= Y - Y.mean(0)
Y_cent else:
= Y
Y_cent = np.diag(sigma2/(sigma2+ell**2))
C_x = ell/(sigma2+ell**2)
d = np.dot(Y_cent, U)*d[None, :]
mu_x return mu_x, C_x