Clustering and High Dimensions
Neil D. Lawrence
2025-09-22
Dedan Kimathi University, Nyeri, Kenya
Clustering
Common approach for grouping data points
Assigns data points to discrete groups
Examples include:
Animal classification
Political affiliation grouping
Clustering vs Vector Quantisation
Clustering expects gaps between groups in data density
Vector quantization may not require density gaps
For practical purposes, both involve:
Allocating points to groups
Determining optimal number of groups
Task
Task : associate data points with different labels.
Labels are not provided by humans.
Process is intuitive for humans - we do it naturally.
Platonic Ideals
Greek philosopher Plato considered the concept of ideals
The Platonic ideal bird is the most bird-like bird
In clustering, we find these ideals as cluster centers
Data points are allocated to their nearest center
Squared Distance
Common choice: squared distance \[
d_{ij} = (\mathbf{ x}_i - \boldsymbol{ \mu}_j)^2
\]
Goal: find centres close to many data points
Objective Function
Given similarity measure, need number of cluster centres, \(K\) .
Find their location by allocating each center to a sub-set of the points and minimizing the sum of the squared errors, \[
E(\mathbf{M}) = \sum_{i \in \mathbf{i}_j} (\mathbf{ x}_i - \boldsymbol{ \mu}_j)^2
\] here \(\mathbf{i}_j\) is all indices of data points allocated to the \(j\) th center.
\(k\) -Means Clustering
\(k\) -means clustering is simple and quick to implement.
Very initialisation sensitive.
Initialisation
Initialisation is the process of selecting a starting set of parameters.
Optimisation result can depend on the starting point.
For \(k\) -means clustering you need to choose an initial set of centres.
Optimisation surface has many local optima, algorithm gets stuck in ones near initialisation.
\(k\) -means Algorithm
Simple iterative clustering algorithm
Key steps:
Initialize with random centres
Assign points to nearest center
Update centres as cluster means
Repeat until stable
Objective Function
Minimizes sum of squared distances: \[
E=\sum_{j=1}^K \sum_{i\ \text{allocated to}\ j} \left(\mathbf{ y}_{i, :} - \boldsymbol{ \mu}_{j, :}\right)^\top\left(\mathbf{ y}_{i, :} - \boldsymbol{ \mu}_{j, :}\right)
\]
Solution not guaranteed to be global or unique
Represents a non-convex optimization problem
Linkage Criteria
Single linkage : Distance between closest points in clusters
Complete linkage : Distance between farthest points in clusters
Average linkage : Average distance between all point pairs
Ward linkage : Minimises within-cluster variance increase
Ward’s Criterion
Ward’s method minimizes the increase in within-cluster variance
Set every data point as cluster size 1.
Select as \[
\Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2
\]
Where \(\boldsymbol{ \mu}_i, \boldsymbol{ \mu}_j\) are cluster centroids and \(n_i, n_j\) are cluster sizes
Mathematical Derivation
Within-cluster sum of squares (\(E(\mathbf{M})\) ):
\(E(\mathbf{M}) = \sum_{i=1}^k \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2\)
Ward’s criterion: Minimise increase in \(E(\mathbf{M})\) when merging
This is the same criterion as used for objective in \(k\) -means
Key insight: For spherical clusters, this is equivalent to minimising centroid distance weighted by cluster sizes
Mathematical Derivation of Ward Distance
Start with two clusters \(C_i\) and \(C_j\) with centroids \(\boldsymbol{ \mu}_i, \boldsymbol{ \mu}_j\)
After merging: new centroid \[
\boldsymbol{ \mu}_{ij} = \frac{n_i \boldsymbol{ \mu}_i + n_j \boldsymbol{ \mu}_j}{n_i + n_j}
\]
Increase in \(E(\mathbf{M})\) : \[
\Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2
\]
Mathematical Derivation - II
Step 1: Original \(E(\mathbf{M})\) for separate clusters \[
E(\mathbf{M})_{\text{original}} = \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2 + \sum_{\mathbf{ x}\in C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_j\|^2
\]
Step 2: New \(E(\mathbf{M})\) after merging \[
E(\mathbf{M})_{\text{new}} = \sum_{\mathbf{ x}\in C_i \cup C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2
\]
Step 3: Increase in \(E(\mathbf{M})\) \[
\Delta_{i,j} = E(\mathbf{M})_{\text{new}} - E(\mathbf{M})_{\text{original}}
\]
Mathematical Derivation - III
Expanding the new \(E(\mathbf{M})\) : \[
E(\mathbf{M})_{\text{new}} = \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 + \sum_{\mathbf{ x}\in C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2
\]
Key identity
\[\begin{aligned}
\|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 = & \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2 \\ & + \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}\|^2 \\ & + 2(\mathbf{ x}- \boldsymbol{ \mu}_i)^\top(\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij})
\end{aligned}\]
Mathematical Derivation - IV
After simplification
\[\begin{aligned}
E(\mathbf{M})_{\text{new}} = & E(\mathbf{M})_{\text{original}} \\ & + n_i \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}\|^2 \\ & + n_j \|\boldsymbol{ \mu}_j - \boldsymbol{ \mu}_{ij}\|^2
\end{aligned}\]
Mathematica Derivation - V
Substituting the centroid formula \[
\boldsymbol{ \mu}_{ij} = \frac{n_i \boldsymbol{ \mu}_i + n_j \boldsymbol{ \mu}_j}{n_i + n_j}
\]
Final result: \[
\Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2
\]
Local Optimality
Spherical clusters have minimal \(E(\mathbf{M})\) for given number of points
Ward’s method makes locally optimal choices at each merge step
The weighting \(\frac{n_i n_j}{n_i + n_j}\) prevents premature merging of large clusters
Algorithm
Step 1 : Start with \(n\) clusters (one per point)
Step 2 : Compute distance matrix between all clusters
Step 3 : Find and merge the two closest clusters
Step 4 : Update distance matrix
Step 5 : Repeat until one cluster remains
Ward’s Criterion on Artificial Data
Hierarchical clustering of some artificial data. On the left we have an artificially generated data set containing three clusters. On the right we can see the dendogram formed by clustering using Ward’s criterion.
Oil Flow Data
Visualization of the first two dimensions of the oil flow data from Bishop and James (1993)
Hierarchical Clustering of Oil Flow Data
Hierarchical clustering applied to oil flow data. The dendrogram shows how different flow regimes are grouped based on their measurement similarities. The three main flow regimes (homogeneous, annular, and laminar) should form distinct clusters.
Phylogenetic Trees
Hierarchical clustering of genetic sequence data
Creates evolutionary trees showing species relationships
Estimates common ancestors and mutation timelines
Critical for tracking viral evolution and outbreaks
Product Clustering
Hierarchical clustering for e-commerce products
Creates product taxonomy trees
Splits into nested categories (e.g. Electronics → Phones → Smartphones)
Hierarchical Clustering Challenge
Many products belong in multiple clusters (e.g. running shoes are both ‘sporting goods’ and ‘clothing’)
Tree structures are too rigid for natural categorization
Human concept learning is more flexible:
Forms overlapping categories
Learns abstract rules
Builds causal theories
Thinking in High Dimensions
Mixtures of Gaussians
Two dimensional Gaussian data set.
Mixtures of Gaussians
Two dimensional data sets. Complex structure not a problem for mixtures of Gaussians.
Thinking in High Dimensions
Two dimensional plots of Gaussians can be misleading.
Our low dimensional intuitions can fail dramatically.
Two major issues:
In high dimensions all the data moves to a ‘shell.’ There is nothing near the mean!
Distances between points become constant.
These affects apply to many densities.
Let’s consider a Gaussian “egg.”
Distance from a Mean
Distance from mean of the density (circle) to a given data point (square).
Dimensionality Greater than Three
1D Egg
Volumes associated with the one dimensional Gaussian egg. Here the yolk has 65.8%, the green has 4.8% and the white has 29.4% of the mass.
Here the yolk has 65.8%, the green has 4.8% and the white has 29.4%
Volumes associated with the regions in the two dimensional Gaussian egg. The yolk contains 59.4%, the green contains 7.4% and the white 33.2%.
Here the yolk has 59.4%, the green has 7.4% and the white has 33.2%
Volumes associated with the regions in the three dimensional Gaussian egg. Here the yolk has 56.1% the green has 9.2% the white has 34.7%.
Here the yolk has 56.1%, the green has 9.2% and the white has 34.7%
Gamma Properties
Mean: \(\frac{a}{b}\) ,
Variance: \(\frac{a}{b^{2}}\)
Support: \(x > 0\) (positive numbers only)
Special Cases
Exponential \(a=1\) \(\rightarrow\) \(\left\langle x\right\rangle_{b}\)
\(\chi^2\) (chi-squared, 1 df) \(a=\frac{1}{2}, b=\frac{1}{2}\) \(\rightarrow\) \(\chi_{1}^{2}\left(x\right)\)
Other Important Applications
Conjugate prior for Gaussian precision (inverse variance)
Modeling waiting times and lifetimes
Additive Property
If \[
x_k \sim \text{Gamma}(a_k, b)
\] for \(k=1,\ldots,d\)
Then \[
\sum_{k=1}^dx_k \sim \text{Gamma}(\sum_{k=1}^d a_k, b)
\]
Warning: Parameterisation Confusion
Rate parameter: \(b\) (used here)
Scale parameter: \(\beta = b^{-1}\) (alternative)
Watch out for different software conventions.
Mathematics
What is the density of probability mass?
For a \(d\) -dimensional Gaussian distribution:
Individual components : \[
y_{i,k} \sim \mathscr{N}\left(0,\sigma^2\right)
\]
Mathematics
What is the density of probability mass?
For a \(d\) -dimensional Gaussian distribution:
Squared components : \[
y_{i,k}^2 \sim \sigma^2 \chi_1^2
\] (scaled chi-squared)
Mathematics
What is the density of probability mass?
For a \(d\) -dimensional Gaussian distribution:
Gamma distribution : \[
y_{i,k}^2 \sim \mathscr{G}\left(\frac{1}{2},\frac{1}{2\sigma^2}\right)
\]
Mathematics
What is the density of probability mass?
For a \(d\) -dimensional Gaussian distribution:
Sum of squares : \[
\sum_{k=1}^dy_{i,k}^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{1}{2\sigma^2}\right)
\]
Mathematics
What is the density of probability mass?
For a \(d\) -dimensional Gaussian distribution:
Expected value :
\[
\left\langle\sum_{k=1}^dy_{i,k}^2\right\rangle = d\sigma^2
\]
Mathematics
What is the density of probability mass?
Normalized sum : \[
\frac{1}{d}\sum_{k=1}^dy_{i,k}^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{d}{2\sigma^2}\right)
\]
Mathematics
What is the density of probability mass?
Expected normalised : \[
\frac{1}{d}\left\langle\sum_{k=1}^dy_{i,k}^2\right\rangle = \sigma^2
\]
Distribution of Mass against Dimensionality
chi-squared Distributions
\(z^2\)
The \(\chi^2\) distribution which gives the distribution of the square of a standardised normal variable \(z^2\) .
\(z_1^2 + z_2^2\)
The scaled \(\chi^2\) squared, equivalent to the sum of two standardised normal variables \(z_1^2 + z_2^2\) .
\(\sum_{i=1}^5 z^2_i\)
The scaled \(\chi^2\) squared, equivalent to the sum of five standardised normal variables \(\sum_{i=1}^5 z^2_i\) .
Square of Sample
Square of sample from Gaussian is scaled chi-squared density
\(\chi^2\) density is a variant of the gamma \(a=\frac{1}{2}\) , \(b=\frac{1}{2\sigma^{2}}\) , \[
\mathscr{G}\left(x|a,b\right)=\frac{b^{a}}{\Gamma\left(a\right)}x^{a-1}e^{-bx}
\]
Distance Distributions
Addition of gamma random variables with the same rate is gamma with sum of shape parameters (\(y_{i,k}\) s are independent)
Scaling of gamma density scales the rate parameter
Where is the Mass?
Plot of probability mass versus dimension. Plot shows the volume of density inside 0.95 of a standard deviation (yellow), between 0.95 and 1.05 standard deviations (green), over 1.05 and standard deviations (white).
Proportions of volumes between yolk, green and white as \(d\rightarrow 1024\) (log scale)
Looking at Gaussian Samples
Looking at a projected Gaussian. This plot shows, in two dimensions, samples from a potentially very high dimensional Gaussian density. The mean of the Gaussian is at the origin. There appears to be a lot of data near the mean, but when we bear in mind that the original data was sampled from a much higher dimensional Gaussian we realize that the data has been projected down to the mean from those other dimensions that we are not visualizing.
High Dimensional Gaussians and Interpoint Distances
Interpoint Distances
The other effect in high dimensions is all points become equidistant.
Can show this for Gaussians with a similar proof to the above.
Interpoint Distance Analysis
For two points \(i\) and \(j\) in d-dimensional space:
Individual components : \(y_{i,k} \sim \mathcal{N}(0, \sigma_k^2)\) and \(y_{j,k} \sim \mathcal{N}(0, \sigma_k^2)\)
Difference : \(y_{i,k} - y_{j,k} \sim \mathcal{N}(0, 2\sigma_k^2)\)
Squared difference : \((y_{i,k} - y_{j,k})^2 \sim \mathscr{G}\left(\frac{1}{2},\frac{1}{4\sigma_k^2}\right)\)
Interpoint Distance Analysis
For spherical Gaussian where \(\sigma_k^2 = \sigma^2\) :
Sum of squared differences : \[
\sum_{k=1}^d(y_{i,k} - y_{j,k})^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{1}{4\sigma^2}\right)
\]
Interpoint Distance Analysis
Normalised distance : \[
\frac{1}{d}\sum_{k=1}^d(y_{i,k} - y_{j,k})^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{d}{4\sigma^2}\right)
\]
Key Results
Mean distance: \(2\sigma^2\)
Variance: \(\frac{8\sigma^2}{d}\) (decreases with dimension!)
All points become equidistant as \(d\to \infty\)
Central Limit Theorem and Non-Gaussian Case
Analytic for spherical, independent Gaussian data.
For independent data, the central limit theorem applies.
The mean squared distance in high D is the mean of the variances.
The variance about the mean scales as \(d^{-1}\) .
Summary
In high dimensions if individual dimensions are independent the distributions behave counter intuitively.
All data sits at one standard deviation from the mean.
The densities of squared distances can be analytically calculated for the Gaussian case.
Summary
For non-Gaussian systems we can invoke the central limit theorem.
Next we will consider example data sets and see how their interpoint distances are distributed.
Sanity Check
Data sampled from independent Gaussian distribution
If dimensions are independent, we expect low variance, Gaussian behavior for the distribution of squared distances.
Sanity Check
A good match betwen theory and the samples for a 1000 dimensional Gaussian distribution.
Distance distribution for a Gaussian with \(d=1000\) , \(n=1000\) with theoretical curve
Sanity Check
Same data generation, but fewer data points.
If dimensions are independent, we expect low variance, Gaussian behaviour for the distribution of squared distances.
Sanity Check
A good match betwen theory and the samples for a 1000 dimensional Gaussian distribution with 100 points.
Distance distribution for a Gaussian with \(d=1000\) and \(n=100\) with theoretical curve
Simulation of oil flow
Histogram for Simulation of oil flow
OSU Motion Capture Data: Run 1
Ohio State University motion capture
Histogram for Ohio State University motion capture
Robot Wireless Data
Ground truth movement for the position taken while recording the multivariate time-course of wireless access point signal strengths.
Robot WiFi Data
Output dimension 1 from the robot wireless data. This plot shows signal strength changing over time.
Robot wireless navigation
Histogram for Robot wireless navigation
Where does practice depart from our theory?
The situation for real data does not reflect what we expect.
Real data exhibits greater variances on interpoint distances.
Somehow the real data seems to have a smaller effective dimension.
Let’s look at another \(d=1000\) .
1000-D Gaussian
Distance distribution for a different Gaussian with \(d=1000\)
Interpoint squared distance distribution for Gaussian with \(d=1000\) but low rank covariance.
Interpoint squared distance distribution for Gaussian with \(d=1000\) and theoretical curve for \(d=2\) .
Gaussian has a specific low rank covariance matrix \(\mathbf{C}=\mathbf{W}\mathbf{W}^{\top}+\sigma^{2}\mathbf{I}\) .
Take \(\sigma^{2}=1e-2\) and sample \(\mathbf{W}\in\Re^{1000\times2}\) from \(\mathscr{N}\left(0,1\right)\) .
Theoretical curve taken assuming dimensionality of 2.