Week 4: Clustering and High Dimensions

[jupyter][google colab][reveal]

Neil D. Lawrence

Abstract: In this lecture we turn to unsupervised learning. We look at clustering models and consider how data behaves in high dimensions.

ML Foundations Course Notebook Setup

[edit]

We install some bespoke codes for creating and saving plots as well as loading data sets.

%%capture
%pip install notutils
%pip install git+https://github.com/lawrennd/ods.git
%pip install git+https://github.com/lawrennd/mlai.git
import notutils
import pods
import mlai
import mlai.plot as plot

Review

[edit]

So far in our classes we have focussed on regression problems and generalised linear models. These are examples of supervised learning. We have considered the relationship between the likelihood and the objective function and we have shown how we can find paramters by maximizing the likelihood (equivalent to minimizing the objective function) in this session we look at latent variables.

Clustering

[edit]

Clustering is a common approach to data analysis, though we will not cover it in great depth in this course. The fundamental idea is to associate each data point \(\mathbf{ y}_{i, :}\) with one of \(k\) different discrete groups. This approach raises interesting questions - for instance, when clustering animals into groups, we might ask whether animal traits are truly discrete or continuous in nature. Similar questions arise when clustering political affiliations.

Humans seem to have a natural affinity for discrete clustering approaches. This makes clustering particularly useful when collaborating with biologists, who often think in terms of discrete categories. However, we should be mindful that this preference for discrete categories may sometimes oversimplify continuous variations in data.

There is a subtle but important distinction between clustering and vector quantisation. In true clustering, we typically expect to see reductions in data density between natural groups - essentially, gaps in the data that separate different clusters. This definition isn’t universally applied though, and vector quantization may partition data without requiring such density gaps. For our current discussion, we’ll treat them similarly, focusing on the common challenges they share: how to allocate points to groups and, more challengingly, how to determine the optimal number of groups.

Clustering methods associate data points with different labels that are allocated by the computer rather than provided by human annotators. This process is quite intuitive for humans - we naturally cluster our observations of the real world. For example, we cluster animals into groups like birds, mammals, and insects. While these labels can be provided by humans, they were originally invented through a clustering process. With computational clustering, we want to recreate that process of label invention.

When thinking about ideas, the Greek philosopher Plato considered the concept of Platonic ideals - the most quintessential version of a thing, like the most bird-like bird or chair-like chair. In clustering, we aim to define different categories by finding their Platonic ideals (cluster centers) and allocating each data point to its nearest center. This allows computers to form categorizations of data at scales too large for human processing.

\(k\)-means Clustering

[edit]

To implement clustering computationally, we need to mathematically represent both our objects and cluster centres as vectors (\(\mathbf{ x}_i\) and \(\boldsymbol{ \mu}_j\) respectively) and define a notion of either similarity or distance between them. The distance function \(d_{ij} = f(\mathbf{ x}_i, \boldsymbol{ \mu}_j)\) measures how far each object is from potential cluster centres. For example, we might cluster customers by representing them through their purchase history and measuring their distance to different customer archetypes.

Squared Distance

A commonly used distance metric is the squared distance: \(d_{ij} = (\mathbf{ x}_i - \boldsymbol{ \mu}_j)^2\). This metric appears frequently in machine learning - we saw it earlier measuring prediction errors in regression, and here it measures dissimilarity between data points and cluster centres.

Once we have decided on the distance or similarity function, we can decide a number of cluster centres, \(K\). We find their location by allocating each center to a sub-set of the points and minimizing the sum of the squared errors, \[ E(\mathbf{M}) = \sum_{i \in \mathbf{i}_j} (\mathbf{ x}_i - \boldsymbol{ \mu}_j)^2 \] where the notation \(\mathbf{i}_j\) represents all the indices of each data point which has been allocated to the \(j\)th cluster represented by the center \(\boldsymbol{ \mu}_j\).

\(k\)-Means Clustering

One approach to minimizing this objective function is known as \(k\)-means clustering. It is simple and relatively quick to implement, but it is an initialization sensitive algorithm. Initialization is the process of choosing an initial set of parameters before optimization. For \(k\)-means clustering you need to choose an initial set of centres. In \(k\)-means clustering your final set of clusters is very sensitive to the initial choice of centres. For more technical details on \(k\)-means clustering you can watch a video of Alex Ihler introducing the algorithm here.

The \(k\)-means algorithm provides a straightforward approach to clustering data. It requires two key elements: a set of \(k\) cluster centres and a way to assign each data point to a cluster. The algorithm follows a simple iterative process:

  1. First, initialize cluster centres by randomly selecting \(k\) data points
  2. Assign each data point to its nearest cluster centre
  3. Update each cluster centre by computing the mean of all points assigned to it
  4. Repeat steps 2 and 3 until the cluster assignments stop changing

This process is intuitive and relatively easy to implement, though it comes with certain limitations.

The \(k\)-means algorithm works by minimizing an objective function that measures the sum of squared Euclidean distances between each point and its assigned cluster center. This objective function can be written mathematically as shown above, where \(\boldsymbol{ \mu}_{j, :}\) represents the mean of cluster \(j\).

It’s important to understand that while this algorithm will always converge to a minimum, this minimum is not guaranteed to be either global or unique. The optimization problem is non-convex, meaning there can be multiple local minima. Different initializations of the cluster centres can lead to different final solutions, which is one of the key challenges in applying \(k\)-means clustering in practice.

from mlai import generate_cluster_data
from mlai import dist2
from mlai import kmeans_assignments
from mlai import kmeans_update
from mlai import kmeans_objective
import numpy as np
# Generate synthetic data with cluster structure
np.random.seed(24)
Y = generate_cluster_data(n_points_per_cluster=30)

# Initialize cluster centres from first cluster
# Note: this is a *bad* initialisation so we can observe algorithm.
nclusters = 3
centres = Y[:nclusters, :].copy()
# Run k-means algorithm for several iterations
niters = 10
centres_store = [centres.copy()]  # Store initial centres
objective_store = [kmeans_objective(Y, centres)]  # Calculate initial objective

print(f"Initial objective: {objective_store[0]:.2f}")
print("Running k-means algorithm...")

for iteration in range(niters):
    centres, assignments = kmeans_update(Y, centres)
    objective = kmeans_objective(Y, centres)
    
    print(f"  Iteration {iteration+1}: objective = {objective:.2f}")
    
    # Store results for plotting
    centres_store.append(centres.copy())
    objective_store.append(objective)

print(f"Final objective: {objective_store[-1]:.2f}")
print(f"Objective improvement: {objective_store[0] - objective_store[-1]:.2f}")

Figure: Clustering with the \(k\)-means clustering algorithm.

Hierarchical Clustering

[edit]

Hierarchical clustering builds a tree structure by iteratively merging the closest clusters. Unlike \(k\)-means, we don’t need to specify the number of clusters in advance. Instead, we start with each point as its own cluster and merge the closest pairs until we have one cluster containing all points. The result is a dendrogram - a tree that shows the complete merge history and allows us to extract any number of clusters by cutting the tree at different heights.

Linkage Criteria

The choice of linkage criterion determines how we measure distance between clusters. Single linkage uses the minimum distance between any two points in different clusters, making it sensitive to outliers. Complete linkage uses the maximum distance, creating more compact clusters. Average linkage provides a balance, while Ward linkage minimizes the increase in within-cluster variance, often producing more balanced cluster sizes.

Ward’s Criterion

Ward’s criterion minimises the increase in within-cluster sum of squares (\(E(\mathbf{M})\)) when merging two clusters. The selection criterion for merging is \[ \Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2, \] where \(\boldsymbol{ \mu}_i\) and \(\boldsymbol{ \mu}_j\) are the centroids of clusters \(i\) and \(j\), and \(n_i, n_j\) are their sizes. This formula comes from greedy minimisation of the within cluster variance. The normalisation by \(n_i\) and \(n_j\) ensures that the distance between cluster centroids with their sizes - larger clusters are penalized more heavily for being far apart, encouraging the formation of compact, well-separated groups.

Ward’s method chooses the merge that result in the smallest possible increase in total within-cluster variance. This means that at each step, we’re making the locally optimal choice that preserves the most information about our data structure. The method is particularly effective when the underlying clusters are roughly spherical and well-separated. It creates compact, convex clusters. The weighting by cluster size (\(\frac{n_i n_j}{n_i + n_j}\)) ensures that we don’t prematurely merge large clusters with small outliers, leading to more balanced and interpretable results.

Mathematical Derivation

We can show that Ward’s method is optimal for spherical clusters. We can derive the relationship between cluster merging and within-cluster variance. The within-cluster sum of squares (\(E(\mathbf{M})\)) measures the total squared distance from each point to its cluster centroid, \[ E(\mathbf{M}) = \sum_{i=1}^k \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2. \] This is the same objective function that \(k\)-means clustering minimises. When we merge two clusters, we want to minimise the increase in this total variance.

Consider two clusters \(C_i\) and \(C_j\) with \(n_i\) and \(n_j\) points respectively, and centroids \(\boldsymbol{ \mu}_i\) and \(\boldsymbol{ \mu}_j\). When we merge them, the new centroid becomes \(\boldsymbol{ \mu}_{ij} = \frac{n_i \boldsymbol{ \mu}_i + n_j \boldsymbol{ \mu}_j}{n_i + n_j}\).

The increase in \(E(\mathbf{M})\) is \(\Delta_{i,j} = E(\mathbf{M})_{\text{new}} - E(\mathbf{M})_{\text{original}}\). Let’s expand this step by step. For the new cluster, we have:

Using the identity \(\|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 = \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2 + \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}\|^2 + 2(\mathbf{ x}- \boldsymbol{ \mu}_i)^T(\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij})\), we can expand the new \(E(\mathbf{M})\). The cross-term \(2(\mathbf{ x}- \boldsymbol{ \mu}_i)^T(\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij})\) sums to zero because \(\sum_{\mathbf{ x}\in C_i} (\mathbf{ x}- \boldsymbol{ \mu}_i) = 0\) by definition of centroid.

We can manipulate to find that \[ \Delta_{i,j} = n_i \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}\|^2 + n_j \|\boldsymbol{ \mu}_j - \boldsymbol{ \mu}_{ij}\|^2$. Substituting $\boldsymbol{ \mu}_{ij} = \frac{n_i \boldsymbol{ \mu}_i + n_j \boldsymbol{ \mu}_j}{n_i + n_j} \] and simplifying, we get the Ward distance formula \[ \Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2. \]

For spherical clusters, the \(E(\mathbf{M})\) is minimised when points are as close as possible to their centroid. Ward’s method makes locally optimal choices by always choosing the merge that results in the smallest possible increase in total variance. However, this is greedy optimal, it makes the best choice at each step without considering the global consequences. The weighting factor we derived, \(\frac{n_i n_j}{n_i + n_j}\), ensures that merging two large clusters is penalised more heavily than merging small clusters, preventing the algorithm from prematurely combining large, well-separated groups. While Ward’s method is not guaranteed to find the globally optimal hierarchical clustering, it often produces good results for spherical cluster structures.

The agglomerative hierarchical clustering algorithm follows a simple iterative process. We start with each data point as its own cluster, then repeatedly find the two closest clusters and merge them. After each merge, we update our distance matrix to reflect the new cluster structure. This process continues until all points belong to a single cluster, creating a complete merge history that we can visualize as a dendrogram.

from mlai import generate_cluster_data
from mlai import ClusterModel
from mlai import WardsMethod
# Generate synthetic data with cluster structure
np.random.seed(24)
Y = generate_cluster_data(n_points_per_cluster=30)

ward = WardsMethod(Y)
ward.fit()

linkage_matrix = ward.get_linkage_matrix()
print(f'Linkage matrix shape: {linkage_matrix.shape}')

print(f"\nFinal clustering completed in {len(linkage_matrix)} steps")
print(f"Ward distances at each step: {[f'{d:.3f}' for d in linkage_matrix[:, 2]]}")

Figure: Hierarchical clustering of some artificial data. On the left we have an artificially generated data set containing three clusters. On the right we can see the dendogram formed by clustering using Ward’s criterion.

Note that the hierarchical clustering here doesn’t always imply that the data was hierarchical in origin. It’s a visualisation of distances between points and clusters. The longer lenghts in the dendogram imply larger distances. Next We’ll demonstrate agglomerative clustering on the oil flow data set, which contains measurements from a multiphase flow facility.

Oil Flow Data

[edit]

This data set is from a physics-based simulation of oil flow in a pipeline. The data was generated as part of a project to determine the fraction of oil, water and gas in North Sea oil pipes (Bishop and James, 1993).

import pods
data = pods.datasets.oil()

The data consists of 1000 12-dimensional observations of simulated oil flow in a pipeline. Each observation is labelled according to the multi-phase flow configuration (homogeneous, annular or laminar).

# Convert data["Y"] from [1, -1, -1] in each row to rows of 0 or 1 or 2
Y = data["Y"]
# Find rows with 1 in first column (class 0)
class0 = (Y[:, 0] == 1).astype(int) * 0
# Find rows with 1 in second column (class 1) 
class1 = (Y[:, 1] == 1).astype(int) * 1
# Find rows with 1 in third column (class 2)
class2 = (Y[:, 2] == 1).astype(int) * 2
# Combine into single array of class labels 0,1,2
labels = class0 + class1 + class2

The data is returned as a dictionary containing training and test inputs (‘X,’ ‘Xtst’), training and test labels (‘Y,’ ‘Ytst’), and the names of the features.

Figure: Visualization of the first two dimensions of the oil flow data from Bishop and James (1993)

As normal we include the citation information for the data.

print(data['citation'])

And extra information about the data is included, as standard, under the keys info and details.

print(data['details'])
X = data['X']
Y = data['Y']

Hierarchical Clustering of Oil Flow Data

[edit]

In this example, we’ll apply hierarchical clustering to the oil flow data set. The data contains measurements from different flow regimes in a multiphase flow facility. The dendrogram shows how measurements naturally cluster into different flow types. Ward’s linkage method is used as it tends to create compact, evenly-sized clusters.

import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import pods
# Perform hierarchical clustering
linked = linkage(X, 'ward')  # Ward's method for minimum variance

Figure: Hierarchical clustering applied to oil flow data. The dendrogram shows how different flow regimes are grouped based on their measurement similarities. The three main flow regimes (homogeneous, annular, and laminar) should form distinct clusters.

Phylogenetic Trees

A powerful application of hierarchical clustering is in constructing phylogenetic trees from genetic sequence data. By comparing DNA/RNA sequences across species, we can reconstruct their evolutionary relationships and estimate when species diverged from common ancestors.1 The resulting tree structure, called a phylogeny, maps out the evolutionary history and relationships between organisms.

These phylogenetic methods go beyond simple linkage methods based on distance, they incorporate models of genetic mutation and molecular evolution. These models can estimate not just the structure of relationships, but also the timing of evolutionary divergence events based on mutation rates. This has important applications in tracking the origins and spread of rapidly evolving pathogens like Covid19, HIV and influenza viruses. Understanding viral phylogenies helps epidemiologists trace outbreak sources, track transmission patterns, and develop targeted containment strategies.

Product Clustering

An e-commerce company could apply hierarchical clustering to organize their product catalog into a taxonomy tree. Products would be grouped into increasingly specific categories - for example, Electronics might split into Phones, Computers, etc., with Phones further dividing into Smartphones, Feature Phones, and so on. This creates an intuitive hierarchical organization. However, many products naturally belong in multiple categories - for instance, running shoes could reasonably be classified as both sporting equipment and footwear. The strict tree structure of hierarchical clustering doesn’t allow for this kind of multiple categorization, which is a key limitation for product organization.

Hierarchical Clustering Challenge

Our psychological ability to form categories is far more sophisticated than hierarchical trees. Research in cognitive science has revealed that humans naturally form overlapping categories and learn abstract principles that guide classification. Josh Tenenbaum’s work (see e.g. Lake et al. (2018)) demonstrates how human concept learning combines multiple forms of inference through hierarchical Bayesian models that integrate similarity-based clustering with theory-based reasoning.

In practice the simple methods we use for agglomerative clustering to merge clusters care often used for visualisation of the cluster structure even when we don’t expect a hieararchy to be present (like the oil flow data and the artificial data above.

Thinking in High Dimensions

[edit]

A recurring plot that is shown in talks on machine learning, particularly those on clustering, is a configuration of data points in two dimensions such as those below.

At first glance, the data appears quite realistic. The density of the data points varies considerably as we move around the plot. The data seems to be clustered, but not in a uniform manner: some clusters are tighter than others. The clusters also seem to be somewhat randomly distributed around the plot. At some point during the talk, a slide containing a fit to the data.

This figure shows the means and variances of a mixture of multivariate (two dimensional) Gaussian densities (McLachlan and Basford (1988)). The fit seems to be a good approximation to the data.

Models of this type can also be justified by appeals to our intuition. The way I like to think of these mixture models is as a summary of the data by a series of prototypes (represented by the means of the Gaussian components of the mixture density). These prototypes are subject to distortions. The density associated with each component represents the range of distortions that the data can undergo. In the case of the mixture of Gaussians, the distortions are of the form of adding zero mean, Gaussian distributed, random noise with a particular covariance matrix.

Mixtures of Gaussians

fig, ax = plt.subplots(figsize=plot.big_figsize)

num_centres = 20
num_data = 200
centres = np.random.normal(size=(num_centres, 2))
w = np.random.normal(size=(num_centres, 2))*0.1
alloc = np.random.randint(0, num_centres, size=(num_data))
sigma = np.random.normal(size=(num_centres, 1))*0.05
epsilon = np.random.normal(size=(num_data,2))*sigma[alloc, :]

Y = w[alloc, :]*np.random.normal(size=(num_data, 1)) + centres[alloc, :] + epsilon

ax.plot(Y[:, 0], Y[:, 1], 'rx')
ax.set_xlabel('$y_1$', fontsize=20)
ax.set_ylabel('$y_2$', fontsize=20)

mlai.write_figure("artificial-mog-1.svg", directory="./dimred/")
pi_vals = np.linspace(-np.pi, np.pi, 200)[:, None]
for i in range(num_centres):
    ax.plot(centres[i, 0], centres[i, 1], 'o', markersize=5, color=[0, 0, 0], linewidth=2)
    x = np.hstack([np.sin(pi_vals), np.cos(pi_vals)])
    L = np.linalg.cholesky(np.outer(w[i, :],w[i, :]) + sigma[i]**2*np.eye(2))
    el = np.dot(x, L.T)
    ax.plot(centres[i, 0] + el[:, 0], centres[i, 1] + el[:, 1], linewidth=2, color=[0,0,0])
mlai.write_figure("artificial-mog-2.svg", directory="./dimred/")

Figure: Two dimensional Gaussian data set.

Figure: Two dimensional data sets. Complex structure not a problem for mixtures of Gaussians.

High Dimensional Data

Thinking in High Dimensions

Such a model seems intuitively appealing. When the expectation maximization approach to optimizing such powerful models was first described by the statistics community it must have seemed to some that the main remaining challenge for density estimation was principally computational. There are, of course, applications for which for which a mixture of Gaussians is an appropriate model (e.g. low dimensional data). Such applications are not the focus of this lecture. We are more interested in the failings of the Gaussian mixture. There are two foundation assumptions which underpin this model. We described them both in our early appeal to intuition. The first assumption is that the data is generated through prototypes. We will not consider this assumption further. We will focus on the second assumption: how the prototypes are corrupted to obtain the data we observe. For the Gaussian mixture model the prototypes are corrupted by Gaussian noise with a particular covariance. It is this second assumption that we would like to consider first. In particular, we are going to examine the special case where the covariance is given by a constant diagonal matrix. We will see that the behavior of a Gaussian in low dimensional space can be quite misleading when we develop intuitions as to how these densities behave in higher dimensions. By higher dimensionality we can think of dimensionality that is difficult to visualize directly, i.e. dimensionality greater than \(d=3\).

Figure: Distance from mean of the density (circle) to a given data point (square).

Dimensionality Greater than Three

In higher dimensions models that seem reasonable in two or three dimensions can fail dramatically. Note the emphasis on models. In this section, we are not making any statements about how a `realistic’ data set behaves in higher dimensions, we are making statements about modeling failures in higher dimensions. In particular we will focus on what assumptions the model makes about where the data is distributed in higher dimensions. We will illustrate the ideas through introducing the Gaussian egg. The Gaussian egg is a hard boiled egg that consists of three parts: the yolk of the egg, the white of the egg and a thin boundary shell between the yolk and the white. In an over boiled egg the boundary layer is often green from iron sulfide formed by reactions between the white and the yolk, giving a bad taste. We therefore refer to this as the ``green’’ of the egg. We will consider spherical Gaussian distributions which have covariance matrices of the form \(\sigma^2\mathbf{I}\). We define the volume of the density associated with the yolk as part of the egg that is within \(0.95\sigma\) of the mean. The green is defined to be the region from \(0.95\sigma\) to \(1.05\sigma\). Finally, the yolk is all the density beyond \(1.05\sigma\). We assume the density of the egg varies according to the Gaussian density, so that the density at the center of the egg is highest, and density falls off with the exponentiated negative squared Euclidean distance from the mean. We take the overall mass of our egg to be one. The question we now ask is how much of our egg’s mass is now taken up by the three component parts: the green, the white and the yolk. The proportions of the mass are dependent on the dimensionality of our egg.

The Gaussian Egg

  • See also Exercise 1.4 in Bishop (1995)

The answer is found through integrating over the Gaussian. For a one dimensional egg we can find the portion of the Gaussian that sits inside 0.95 of a standard deviation’s distance from the mean as \[ \int_{-0.95\sigma}^{0.95\sigma} \mathscr{N}\left(y|0,\sigma^{2}\right) \text{d}y. \] The other portions can be found similarly. For higher dimensional Gaussians the integral is slightly more difficult. To compute it, we will switch from considering the Gaussian density that governs the data directly to the density over squared distances from the mean that the Gaussian implies. Before we introduce that approach, we show three low dimensional Gaussian eggs below indicating their associated masses for the yolk, the green, the white and the shell.

Figure: Volumes associated with the one dimensional Gaussian egg. Here the yolk has 65.8%, the green has 4.8% and the white has 29.4% of the mass.

Figure: Volumes associated with the regions in the two dimensional Gaussian egg. The yolk contains 59.4%, the green contains 7.4% and the white 33.2%.

Figure: Volumes associated with the regions in the three dimensional Gaussian egg. Here the yolk has 56.1% the green has 9.2% the white has 34.7%.

The Gamma Density

[edit]

The Gamma is a density over positive numbers. It has the form \[ \mathscr{G}\left(x|a,b\right)=\frac{b^{a}}{\Gamma\left(a\right)}x^{a-1}e^{-bx}, \] where \(a\) is known as the shape parameter and \(b\) is known as a rate parameter. The function \(\Gamma\left(a\right)\) is known as the gamma function and is defined through the following indefinite integral, \[ \Gamma\left(a\right)=\int_{0}^{\infty}x^{a-1}e^{-x}\text{d}x. \] The mean of the gamma density is given by \[ \left\langle x \right\rangle _{\mathscr{G}\left(x|a,b\right)}=\frac{a}{b} \] and the variance is given by \[ \text{var}_{\mathscr{G}\left(x|a,b\right)}\left( x \right)=\frac{a}{b^{2}}. \] Sometimes the density is defined in terms of a scale parameter, \(\beta=b^{-1}\), instead of a rate. Confusingly, this parameter is also often denoted by “\(b\).” For example, the statistics toolbox in Matlab defines things this way. The gamma density generalises several important special cases including the exponential density with rate \(b\), \(\left\langle x\right\rangle_{b}\), which is the specific case where the shape parameter is taken to be \(a=1\). The chi-squared density with one degree of freedom, denoted \(\chi_{1}^{2}\left(x\right)\), is the special case where the shape parameter is taken to be \(a=\frac{1}{2}\) and the rate parameter is \(b=\frac{1}{2}\).

The gamma density is the conjugate density for the inverse variance (precision) of a Gaussian density.

Gamma random variables have the property that, if multiple gamma variates are sampled from a density with the same rate, \(b\), and shape parameters \(\left\{a_k\right\}^d_{k=1}\), then the sum of those variates is also Gamma distributed with rate parameter \(b\) and shape parameter \(a^\prime = \sum_{k=1}^da_k\).

Summary

The gamma density is a flexible probability distribution over positive numbers with two parameters: shape (\(a\)) and rate (\(b\)). The shape parameter controls the form of the distribution, while the rate parameter controls the scale. The mean and variance are both proportional to the shape parameter, with the mean being \(a/b\) and variance being \(a/b^2\). The gamma function \(\Gamma(a)\) serves as the normalizing constant.

The gamma density generalizes several important distributions. When the shape parameter \(a=1\), it becomes the exponential distribution with rate \(b\). When \(a=\frac{1}{2}\) and \(b=\frac{1}{2}\), it becomes the chi-squared distribution with one degree of freedom. The gamma distribution is particularly important in Bayesian statistics as the conjugate prior for the precision (inverse variance) of Gaussian distributions, making it essential for Bayesian inference about variance parameters.

Gamma distributions have a useful additive property: the sum of independent gamma random variables with the same rate parameter is also gamma distributed, with the shape parameter being the sum of the individual shape parameters. This property is crucial for many applications in statistics and probability. However, be careful with parameterization - some software uses a scale parameter \(\beta = b^{-1}\) instead of the rate parameter \(b\), which can lead to confusion when implementing gamma distributions in different programming environments.

Distribution of Mass against Dimensionality

For the three low dimensional Gaussians, we note that the allocation of the egg’s mass to the three zones changes as the dimensionality increases. The green and the white increase in mass and the yolk decreases in mass. Of greater interest to us is the behavior of this distribution of mass in higher dimensions. It turns out we can compute this through the cumulative distribution function of the gamma density.

We will compute the distribution of the density mass in the three regions by assuming each data point is sampled independently from our spherical covariance Gaussian density, [ {i, :} ] where \(\mathbf{ y}_{i, :}\) is the \(i\)th data point. Independence across features also means we can consider the density associated with the \(k\)th feature of the \(i\)th data point, \(y_{i,k}\), [ {i,k}. ] We are interested in the squared distance of any given sample from the mean. Our choice of a zero mean Gaussian density means that the squared distance of each feature from the mean is easily computed as \(y_{i,k}^2\). We can exploit a useful characteristic of the Gaussian to describe the density of these squared distances. The squares of a Gaussian distributed random variable are known to be distributed according the chi-squared density, [ {i,k}{2}{2}{1}^{2}, ] The chi squared density is a special case of the gamma density with shape parameter \(a=\frac{1}{2}\) and rate parameter \(b=\frac{1}{2\sigma^{2}}\), \[ \mathscr{G}\left(x|a,b\right)=\frac{b^{a}}{\Gamma\left(a\right)}x^{a-1}e^{-bx}. \]

chi-squared Distributions

Figure: The \(\chi^2\) distribution which gives the distribution of the square of a standardised normal variable \(z^2\).

Figure: The scaled \(\chi^2\) squared, equivalent to the sum of two standardised normal variables \(z_1^2 + z_2^2\).

Figure: The scaled \(\chi^2\) squared, equivalent to the sum of five standardised normal variables \(\sum_{i=1}^5 z^2_i\).


# Set up dimensions
x = np.arange(0, 11)
D = 2.0**x
lim1 = 0.95
lim2 = 1.05
lim3 = 100

# Compute values of cumulative gammas
y = gamma.cdf(lim1*lim1, D/2, scale=2/D)
y2 = gamma.cdf(lim2*lim2, D/2, scale=2/D)
y3 = gamma.cdf(lim3*lim3, D/2, scale=2/D)

Figure: Plot of probability mass versus dimension. Plot shows the volume of density inside 0.95 of a standard deviation (yellow), between 0.95 and 1.05 standard deviations (green), over 1.05 and standard deviations (white).

So we have the squared distance from the mean for a single feature being given by \[ y_{i,k}^{2}\sim\mathscr{G}\left(\frac{1}{2},\frac{1}{2\sigma^{2}}\right). \] Of course, we are interested in the distance from the mean of the data point given by \(\mathbf{ y}_{i,:}=\left[y_{i,1}, \dots, y_{i,d}\right]^\top\). The distance from the mean of the \(i\)th data point will be given by \(\sum_{k=1}^dy_{i,k}^2\). Fortunately, the properties of the gamma density (see ) mean we can also compute the density of the resulting random variable, in particular we have \[ \sum_{k=1}^{d} y_{i,k}^{2}\sim\mathscr{G}\left(\frac{d}{2},\frac{1}{2\sigma^{2}}\right). \] We can compute the mean and standard deviation of the squared distance for each point from the mean, \[ \left\langle \sum_{k=1}^{d} y_{i,k}^{2} \right\rangle =d\sigma^{2}, \] which, we note, scales linearly with the dimensionality of the data. The average squared distance of each feature will be distributed as follows, \[ \frac{1}{d}\sum_{k=1}^{d}y_{i,k}^{2}\sim\mathscr{G}\left(\frac{d}{2},\frac{d}{2\sigma^{2}}\right) \] the mean for which is simply the variance of the underlying Gaussian density, \[ \left\langle \frac{1}{d}\sum_{k=1}^{d} y_{i,k}^{2} \right\rangle =\sigma^{2}. \] We can use this gamma density to work out how much of the mass of the Gaussian egg is in the different zones. The cumulative distribution function for the gamma density, \[ \mathscr{GAMMA CDF}\left(z|a,b\right) = \int_0^{x}\mathscr{G}\left(z|a,b\right) \text{d} z, \] doesn’t have a nice analytical form, but implementations of it are provided for many programming languages including R, , Octave and Python. We can use the cumulative distribution to give the probability of the squared distance falling under a certain value. For the data point to be in the yolk, we expect its squared distance from the mean to be under \((0.95\sigma)^2\). For the data point to be in the green we expect its squared distance from the mean to be between \((0.95/\sigma)^2\) and \((1.05/\sigma)^2\). Data in the white will have a squared distance from the mean greater than \((1.05/\sigma)^2\).

Looking at Gaussian Samples

The theory has shown us that data sampled from a Gaussian density in very high dimensions will live in a shell around one standard deviation from the mean of the Gaussian. This is perhaps surprising because we are used to thinking of Gaussians being distributions where most of the data is near the mean of the density. But this is because we are used to looking at low dimensional Gaussian densities. In this regime most of the data is near the mean. One useful property of the Gaussian density is that all the marginal densities are also Gaussian. This means that when we see a two dimensional Gaussian we can think of it as a three dimensional Gaussian with one dimension marginalized. The effect of this marginalization is to project the third dimension down onto the other two by summing across it. This means when we see a two dimensional Gaussian on a page such as that in , we can think of a corresponding three dimensional Gaussian which has this two dimensional Gaussian as a marginal. That three dimensional Gaussian would have data going into and coming out of the page. The projection down to two dimensions makes that data look like it is close to the mean, but when we expand up to the three dimensional Gaussian some of that data moves away from the mean. Following this line of argument, we can ask what if the plot is the two dimensional marginal of a truly four dimensional Gaussian. Now some more of the data that appears close to the mean in the two dimensional Gaussian (and perhaps was close to the mean in the three dimensional Gaussian) is also projected away. We can continue applying this argument until we can be certain that there is no data left near the mean. So the gist of the argument is as follows. When you are looking at a low dimensional projection of a high dimensional Gaussian, it does appear that there is a large amount of data close to the mean. But, when we consider that the data near the mean has been projected down from potentially many dimensions we understand that in fact there is very little data near the mean.

import numpy as np
# Generate random covariance matrix
np.random.seed(24)  # For reproducibility
a = np.random.randn(2, 2)
cov_matrix = a @ a.T

# Sample from multivariate Gaussian
mean = np.array([0, 0])
Y = np.random.multivariate_normal(mean, cov_matrix, 300)

Figure: Looking at a projected Gaussian. This plot shows, in two dimensions, samples from a potentially very high dimensional Gaussian density. The mean of the Gaussian is at the origin. There appears to be a lot of data near the mean, but when we bear in mind that the original data was sampled from a much higher dimensional Gaussian we realize that the data has been projected down to the mean from those other dimensions that we are not visualizing.

High Dimensional Gaussians and Interpoint Distances

Our analysis above showed that for spherical Gaussians in high dimensions the density of squared distances from the mean collapses around the expected value of one standard deviation. Another related effect is the distribution of interpoint squared distances. It turns out that in very high dimensions, the interpoint squared distances become equal. This has knock on effects for many learning algorithms: for example in nearest neighbors classification algorithms a test data point is assigned a label based on the labels of the \(k\) nearest neighbors from the training data. If all interpoint squared distances were close to equal then the stability of these \(k\) neighbors with respect to small data perturbations would be poor.

Can show this for Gaussians with a similar proof to the above, \[ y_{i,k}\sim\mathscr{N}\left(0,\sigma_{k}^{2}\right)\quad\quad y_{j,k}\sim\mathscr{N}\left(0,\sigma_{k}^{2}\right) \] \[ y_{i,k}-y_{j,k}\sim\mathscr{N}\left(0,2\sigma_{k}^{2}\right)] \] \[ \left(y_{i,k}-y_{j,k}\right)^{2}\sim\mathscr{G}\left(\frac{1}{2},\frac{1}{4\sigma_{k}^{2}}\right) \] Once again we can consider the specific case where the data is spherical, \(\sigma_{k}^{2}=\sigma^{2}\), as we can always individually rescale the data input dimensions to ensure this is the case \[ \sum_{k=1}^{d}\left(y_{i,k}-y_{j,k}\right)^{2}\sim\mathscr{G}\left(\frac{d}{2},\frac{1}{4\sigma^{2}}\right) \] \[ \frac{1}{d}\sum_{k=1}^{d}\left(y_{i,k}-y_{j,k}\right)^{2}\sim\mathscr{G}\left(\frac{d}{2},\frac{D}{4\sigma^{2}}\right) \] The dimension normalized squared distance between points is Gamma distributed. Mean is \(2\sigma^{2}\). Variance is \(\frac{8\sigma^{2}}{d}\).

Example Data Sets

[edit]
import numpy as np
# Generate 1000D Gaussian data
np.random.seed(24)
Y = np.random.randn(1000, 1000)

Figure: A good match betwen theory and the samples for a 1000 dimensional Gaussian distribution.

import numpy as np
# Generate 1000D Gaussian data with 100 points
np.random.seed(24)
Y = np.random.randn(100, 1000)

Figure: A good match betwen theory and the samples for a 1000 dimensional Gaussian distribution with 100 points.

Simulation of oil flow
Y = data["Y"]

Figure:

OSU Motion Capture Data: Run 1

Motion capture data the Open Motion Data Project by The Ohio State University Advanced Computing Center for the Arts and Design. Historically the data website was found here http://accad.osu.edu/research/mocap/mocap_data.htm, although it is now missing. The centre website is here: https://accad.osu.edu.

import pods

You can download different data from the site, here we download the ‘run1’ motion.

data = pods.datasets.osu_run1()

The data dictionary contains the keys ‘Y’ and ‘connect,’ which represent the data and connections that can be used to create the skeleton.

data['Y'].shape

The data has often been used in talks demonstrating GP-LVM models and comparing variants such as back constrained and temporal models.

print(data['citation'])

And extra information about the data is included, as standard, under the keys under details.

print(data['details'])
Ohio State University motion capture
Y = data["Y"]

Figure:

Robot Wireless Data

[edit]

The robot wireless data is taken from an experiment run by Brian Ferris at University of Washington. It consists of the measurements of WiFi access point signal strengths as Brian walked in a loop. It was published at IJCAI in 2007 (Ferris et al., 2007).

import pods
import numpy as np
data=pods.datasets.robot_wireless()

The ground truth is recorded in the data, the actual loop is given in the plot below.

Robot Wireless Ground Truth

Figure: Ground truth movement for the position taken while recording the multivariate time-course of wireless access point signal strengths.

We will ignore this ground truth in making our predictions, but see if the model can recover something similar in one of the latent layers.

Robot WiFi Data

One challenge with the data is that the signal strength ‘drops out.’ This is because the device only tracks a limited number of wifi access points, when one of the access points falls outside the track, the value disappears (in the plot below it reads -0.5). The data is missing, but it is not missing at random because the implication is that the wireless access point must be weak to have dropped from the list of those that are tracked.

Figure: Output dimension 1 from the robot wireless data. This plot shows signal strength changing over time.

Robot wireless navigation
Y = data["Y"]

Figure:

import numpy as np
# Create low-rank covariance matrix
np.random.seed(22)
W = np.random.randn(1000, 2)
cov_matrix = W @ W.T + 1e-2 * np.eye(1000)

# Sample from correlated Gaussian
mean = np.zeros(1000)
Y = np.random.multivariate_normal(mean, cov_matrix, 1000)

Figure: Interpoint squared distance distribution for Gaussian with \(d=1000\) but low rank covariance.

from scipy.spatial.distance import pdist, squareform
from scipy.stats import gamma
# Normalize data to have variance 1 for each dimension
varY = np.var(Y, axis=0)
stdY = np.sqrt(varY)
Y_normalized = Y / stdY

# Compute distances
distances = pdist(Y_normalized, metric='sqeuclidean')
v = distances[distances > 1e-10]  # Remove very small distances

# Normalize distances
v = 2 * v / np.mean(v)

Figure: Interpoint squared distance distribution for Gaussian with \(d=1000\) and theoretical curve for \(d=2\).

Summary and Key Points

We’ve covered several key ideas about dimensionality reduction:

  1. High-dimensional spaces have counter-intuitive properties:
    • The curse of dimensionality
    • Concentration of distances
  2. Real data doesn’t behave like random high-dimensional data because:
    • It lies near lower-dimensional manifolds
    • It has structure imposed by physics, biology, or other constraints
  3. This structure makes dimensionality reduction possible:
    • PCA finds linear manifolds
    • More sophisticated methods can find nonlinear manifolds
  4. The probabilistic perspective helps us:
    • Understand when methods will work
    • Quantify uncertainty in our reduced representations
    • Connect dimensionality reduction to other machine learning approaches

Thanks!

For more information on these subjects and more you might want to check the following resources.

References

Bishop, C.M., 1995. Neural networks for pattern recognition. Oxford University Press.
Bishop, C.M., James, G.D., 1993. Analysis of multiphase flows using dual-energy gamma densitometry and neural networks. Nuclear Instruments and Methods in Physics Research A327, 580–593. https://doi.org/10.1016/0168-9002(93)90728-Z
Ferris, B.D., Fox, D., Lawrence, N.D., 2007. WiFi-SLAM using Gaussian process latent variable models, in: Veloso, M.M. (Ed.), Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007). pp. 2480–2485.
Lake, B.M., Lawrence, N.D., Tenenbaum, J.B., 2018. The emergence of organizing structure in conceptual representation. Cognitive science 42 Suppl 3, 809–832. https://doi.org/10.1111/cogs.12580
McLachlan, G.J., Basford, K.E., 1988. Mixture models: Inference and applications to clustering. Marcel Dekker, New York.

  1. Phylogenetic models incorporate molecular clock models that estimate mutation rates over time. By calibrating these with known divergence events from the fossil record, the timing of common ancestors can be estimated.}↩︎