Clustering and High Dimensions 
  Neil D. Lawrence
  2025-09-22 
  Dedan Kimathi University, Nyeri, Kenya
 
Clustering 
Common approach for grouping data points 
Assigns data points to discrete groups 
Examples include:
Animal classification 
Political affiliation grouping 
  
 
 
Clustering vs Vector Quantisation 
Clustering expects gaps between groups in data density 
Vector quantization may not require density gaps 
For practical purposes, both involve:
Allocating points to groups 
Determining optimal number of groups 
  
 
 
Task 
Task : associate data points with different labels.Labels are not  provided by humans. 
Process is intuitive for humans - we do it naturally. 
 
 
Platonic Ideals 
Greek philosopher Plato considered the concept of ideals 
The Platonic ideal bird is the most bird-like bird 
In clustering, we find these ideals as cluster centers 
Data points are allocated to their nearest center 
 
 
Squared Distance 
Common choice: squared distance \[
d_{ij} = (\mathbf{ x}_i - \boldsymbol{ \mu}_j)^2
\]  
Goal: find centres close to many data points 
 
 
Objective Function 
Given similarity measure, need number of cluster centres, \(K\) . 
Find their location by allocating each center to a sub-set of the points and minimizing the sum of the squared errors, \[
E(\mathbf{M}) = \sum_{i \in \mathbf{i}_j} (\mathbf{ x}_i - \boldsymbol{ \mu}_j)^2
\]  here \(\mathbf{i}_j\)  is all indices of data points allocated to the \(j\) th center. 
 
 
\(k\) -Means Clustering
\(k\) -means clusteringVery initialisation  sensitive. 
 
 
Initialisation 
Initialisation is the process of selecting a starting set of parameters. 
Optimisation result can depend on the starting point. 
For \(k\) -means clustering you need to choose an initial set of centres. 
Optimisation surface has many local optima, algorithm gets stuck in ones near initialisation. 
 
 
\(k\) -means Algorithm
Simple iterative clustering algorithm 
Key steps:
Initialize with random centres 
Assign points to nearest center 
Update centres as cluster means 
Repeat until stable 
  
 
 
Objective Function 
Minimizes sum of squared distances: \[
E=\sum_{j=1}^K \sum_{i\ \text{allocated to}\ j}  \left(\mathbf{ y}_{i, :} - \boldsymbol{ \mu}_{j, :}\right)^\top\left(\mathbf{ y}_{i, :} - \boldsymbol{ \mu}_{j, :}\right)
\]  
Solution not guaranteed to be global or unique 
Represents a non-convex optimization problem 
 
 
Linkage Criteria 
Single linkage : Distance between closest points in clustersComplete linkage : Distance between farthest points in clustersAverage linkage : Average distance between all point pairsWard linkage : Minimises within-cluster variance increase 
 
Ward’s Criterion 
Ward’s method minimizes the increase in within-cluster variance 
Set every data point as cluster size 1. 
Select as \[
\Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2
\]  
Where \(\boldsymbol{ \mu}_i, \boldsymbol{ \mu}_j\)  are cluster centroids and \(n_i, n_j\)  are cluster sizes 
 
 
Mathematical Derivation 
Within-cluster sum of squares (\(E(\mathbf{M})\) ): 
\(E(\mathbf{M}) = \sum_{i=1}^k \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2\)  Ward’s criterion:  Minimise increase in \(E(\mathbf{M})\)  when mergingThis is the same criterion as used for objective in \(k\) -means 
Key insight:  For spherical clusters, this is equivalent to minimising centroid distance weighted by cluster sizes 
 
Mathematical Derivation of Ward Distance 
Start with two clusters \(C_i\)  and \(C_j\)  with centroids \(\boldsymbol{ \mu}_i, \boldsymbol{ \mu}_j\)  
After merging: new centroid \[
\boldsymbol{ \mu}_{ij} = \frac{n_i \boldsymbol{ \mu}_i + n_j \boldsymbol{ \mu}_j}{n_i + n_j}
\]  
Increase in \(E(\mathbf{M})\) : \[
\Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2
\]  
 
 
Mathematical Derivation - II 
Step 1:  Original \(E(\mathbf{M})\)  for separate clusters \[
E(\mathbf{M})_{\text{original}} = \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2 + \sum_{\mathbf{ x}\in C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_j\|^2
\] Step 2:  New \(E(\mathbf{M})\)  after merging \[
E(\mathbf{M})_{\text{new}} = \sum_{\mathbf{ x}\in C_i \cup C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2
\] Step 3:  Increase in \(E(\mathbf{M})\)  \[
\Delta_{i,j} = E(\mathbf{M})_{\text{new}} - E(\mathbf{M})_{\text{original}}
\]  
 
Mathematical Derivation - III 
Expanding the new \(E(\mathbf{M})\) :  \[
E(\mathbf{M})_{\text{new}} = \sum_{\mathbf{ x}\in C_i} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 + \sum_{\mathbf{ x}\in C_j} \|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2
\] Key identity 
\[\begin{aligned}
\|\mathbf{ x}- \boldsymbol{ \mu}_{ij}\|^2 = & \|\mathbf{ x}- \boldsymbol{ \mu}_i\|^2 \\ & + \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}\|^2 \\ & + 2(\mathbf{ x}- \boldsymbol{ \mu}_i)^\top(\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij})
\end{aligned}\]  
 
Mathematical Derivation - IV 
After simplification
\[\begin{aligned}
E(\mathbf{M})_{\text{new}} = & E(\mathbf{M})_{\text{original}} \\ & + n_i \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_{ij}\|^2 \\ & + n_j \|\boldsymbol{ \mu}_j - \boldsymbol{ \mu}_{ij}\|^2
\end{aligned}\]  
 
 
Mathematica Derivation - V 
Substituting the centroid formula \[
\boldsymbol{ \mu}_{ij} = \frac{n_i \boldsymbol{ \mu}_i + n_j \boldsymbol{ \mu}_j}{n_i + n_j}
\]  
Final result:  \[
\Delta_{i,j} = \frac{n_i n_j}{n_i + n_j} \|\boldsymbol{ \mu}_i - \boldsymbol{ \mu}_j\|^2
\]  
 
Local Optimality 
Spherical clusters have minimal \(E(\mathbf{M})\)  for given number of points 
Ward’s method makes locally  optimal choices at each merge step 
The weighting \(\frac{n_i n_j}{n_i + n_j}\)  prevents premature merging of large clusters 
 
 
Algorithm 
Step 1 : Start with \(n\)  clusters (one per point)Step 2 : Compute distance matrix between all clustersStep 3 : Find and merge the two closest clustersStep 4 : Update distance matrixStep 5 : Repeat until one cluster remains 
 
Ward’s Criterion on Artificial Data 
Hierarchical clustering of some artificial data. On the left we have an artificially generated data set containing three clusters. On the right we can see the dendogram formed by clustering using Ward’s criterion.
 
 
Oil Flow Data 
Visualization of the first two dimensions of the oil flow data from Bishop and James (1993) 
 
 
Hierarchical Clustering of Oil Flow Data 
Hierarchical clustering applied to oil flow data. The dendrogram shows how different flow regimes are grouped based on their measurement similarities. The three main flow regimes (homogeneous, annular, and laminar) should form distinct clusters.
 
 
Phylogenetic Trees 
Hierarchical clustering of genetic sequence data 
Creates evolutionary trees showing species relationships 
Estimates common ancestors and mutation timelines 
Critical for tracking viral evolution and outbreaks 
 
 
Product Clustering 
Hierarchical clustering for e-commerce products 
Creates product taxonomy trees 
Splits into nested categories (e.g. Electronics → Phones → Smartphones) 
 
 
Hierarchical Clustering Challenge 
Many products belong in multiple clusters (e.g. running shoes are both ‘sporting goods’ and ‘clothing’) 
Tree structures are too rigid for natural categorization 
Human concept learning is more flexible:
Forms overlapping categories 
Learns abstract rules 
Builds causal theories 
  
 
 
Thinking in High Dimensions 
 
Mixtures of Gaussians 
Two dimensional Gaussian data set.
 
 
Mixtures of Gaussians 
Two dimensional data sets. Complex structure not a problem for mixtures of Gaussians.
 
 
Thinking in High Dimensions 
Two dimensional plots of Gaussians can be misleading. 
Our low dimensional intuitions can fail dramatically. 
Two major issues:
In high dimensions all the data moves to a ‘shell.’ There is nothing near the mean! 
Distances between points become constant. 
These affects apply to many densities. 
  
Let’s consider a Gaussian “egg.” 
 
 
Distance from a Mean 
Distance from mean of the density (circle) to a given data point (square).
 
 
Dimensionality Greater than Three 
 
1D Egg 
Volumes associated with the one dimensional Gaussian egg. Here the yolk has 65.8%, the green has 4.8% and the white has 29.4% of the mass.
 
Here the yolk  has 65.8%, the green  has 4.8% and the white has 29.4%
 
 
Volumes associated with the regions in the two dimensional Gaussian egg. The yolk contains 59.4%, the green contains 7.4% and the white 33.2%.
 
Here the yolk  has 59.4%, the green  has 7.4% and the white has 33.2%
 
 
Volumes associated with the regions in the three dimensional Gaussian egg. Here the yolk has 56.1% the green has 9.2% the white has 34.7%.
 
Here the yolk  has 56.1%, the green  has 9.2% and the white has 34.7%
 
 
Gamma Properties 
Mean: \(\frac{a}{b}\) , 
Variance: \(\frac{a}{b^{2}}\)  
Support: \(x > 0\)  (positive numbers only) 
 
 
Special Cases 
Exponential \(a=1\)  \(\rightarrow\)  \(\left\langle x\right\rangle_{b}\)  
\(\chi^2\)  (chi-squared, 1 df) \(a=\frac{1}{2}, b=\frac{1}{2}\)  \(\rightarrow\)  \(\chi_{1}^{2}\left(x\right)\)  
 
Other Important Applications 
Conjugate prior for Gaussian precision (inverse variance) 
Modeling waiting times and lifetimes 
 
 
Additive Property 
If \[
  x_k \sim \text{Gamma}(a_k, b)
  \]  for \(k=1,\ldots,d\)  
Then \[
  \sum_{k=1}^dx_k \sim \text{Gamma}(\sum_{k=1}^d a_k, b)
  \]  
 
 
Warning: Parameterisation Confusion 
Rate parameter: \(b\)  (used here) 
Scale parameter: \(\beta = b^{-1}\)  (alternative) 
Watch out for different software conventions. 
 
 
Mathematics 
What is the density of probability mass? 
For a \(d\) -dimensional Gaussian distribution:
Individual components : \[
y_{i,k} \sim \mathscr{N}\left(0,\sigma^2\right)
\] 
 
Mathematics 
What is the density of probability mass? 
For a \(d\) -dimensional Gaussian distribution:
Squared components : \[
y_{i,k}^2 \sim \sigma^2 \chi_1^2
\]  (scaled chi-squared)
 
Mathematics 
What is the density of probability mass? 
For a \(d\) -dimensional Gaussian distribution:
Gamma distribution : \[
y_{i,k}^2 \sim \mathscr{G}\left(\frac{1}{2},\frac{1}{2\sigma^2}\right)
\] 
 
Mathematics 
What is the density of probability mass? 
For a \(d\) -dimensional Gaussian distribution:
Sum of squares : \[
\sum_{k=1}^dy_{i,k}^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{1}{2\sigma^2}\right)
\] 
 
Mathematics 
What is the density of probability mass? 
For a \(d\) -dimensional Gaussian distribution:
Expected value :
\[
\left\langle\sum_{k=1}^dy_{i,k}^2\right\rangle = d\sigma^2
\] 
 
Mathematics 
What is the density of probability mass? 
Normalized sum : \[
\frac{1}{d}\sum_{k=1}^dy_{i,k}^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{d}{2\sigma^2}\right)
\] 
 
Mathematics 
What is the density of probability mass? 
Expected normalised : \[
\frac{1}{d}\left\langle\sum_{k=1}^dy_{i,k}^2\right\rangle = \sigma^2
\] 
 
Distribution of Mass against Dimensionality 
 
chi-squared Distributions 
 
\(z^2\) 
The \(\chi^2\)  distribution which gives the distribution of the square of a standardised normal variable \(z^2\) .
 
 
\(z_1^2 + z_2^2\) 
The scaled \(\chi^2\)  squared, equivalent to the sum of two standardised normal variables \(z_1^2 + z_2^2\) .
 
 
\(\sum_{i=1}^5 z^2_i\) 
The scaled \(\chi^2\)  squared, equivalent to the sum of five standardised normal variables \(\sum_{i=1}^5 z^2_i\) .
 
 
Square of Sample 
Square of sample from Gaussian is scaled chi-squared density 
\(\chi^2\)  density is a variant of the gamma \(a=\frac{1}{2}\) , \(b=\frac{1}{2\sigma^{2}}\) , \[
  \mathscr{G}\left(x|a,b\right)=\frac{b^{a}}{\Gamma\left(a\right)}x^{a-1}e^{-bx}
  \]  
 
Distance Distributions 
Addition of gamma random variables with the same rate is gamma with sum of shape parameters (\(y_{i,k}\) s are independent) 
Scaling of gamma density scales the rate parameter 
 
 
Where is the Mass? 
Plot of probability mass versus dimension. Plot shows the volume of density inside 0.95 of a standard deviation (yellow), between 0.95 and 1.05 standard deviations (green), over 1.05 and standard deviations (white).
 
Proportions of volumes between yolk, green and white as \(d\rightarrow 1024\)  (log scale)
 
 
Looking at Gaussian Samples 
Looking at a projected Gaussian. This plot shows, in two dimensions, samples from a potentially very high dimensional Gaussian density. The mean of the Gaussian is at the origin. There appears to be a lot of data near the mean, but when we bear in mind that the original data was sampled from a much higher dimensional Gaussian we realize that the data has been projected down to the mean from those other dimensions that we are not visualizing.
 
 
High Dimensional Gaussians and Interpoint Distances 
 
Interpoint Distances 
The other effect in high dimensions is all points become equidistant. 
Can show this for Gaussians with a similar proof to the above. 
 
 
Interpoint Distance Analysis 
For two points \(i\)  and \(j\)  in d-dimensional space:
Individual components : \(y_{i,k} \sim \mathcal{N}(0, \sigma_k^2)\)  and \(y_{j,k} \sim \mathcal{N}(0, \sigma_k^2)\) Difference : \(y_{i,k} - y_{j,k} \sim \mathcal{N}(0, 2\sigma_k^2)\) Squared difference : \((y_{i,k} - y_{j,k})^2 \sim \mathscr{G}\left(\frac{1}{2},\frac{1}{4\sigma_k^2}\right)\)  
 
Interpoint Distance Analysis 
For spherical Gaussian where \(\sigma_k^2 = \sigma^2\) :
Sum of squared differences : \[
\sum_{k=1}^d(y_{i,k} - y_{j,k})^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{1}{4\sigma^2}\right)
\] 
 
Interpoint Distance Analysis 
Normalised distance : \[
\frac{1}{d}\sum_{k=1}^d(y_{i,k} - y_{j,k})^2 \sim \mathscr{G}\left(\frac{d}{2},\frac{d}{4\sigma^2}\right)
\] 
 
Key Results 
Mean distance: \(2\sigma^2\)  
Variance: \(\frac{8\sigma^2}{d}\)  (decreases with dimension!) 
All points become equidistant as \(d\to \infty\)  
 
 
Central Limit Theorem and Non-Gaussian Case 
Analytic  for spherical, independent Gaussian data.For independent  data, the central limit theorem  applies. 
The mean squared distance in high D is the mean of the variances. 
The variance about the mean scales as \(d^{-1}\) . 
 
 
Summary 
In high dimensions if individual dimensions are independent  the distributions behave counter intuitively. 
All data sits at one standard deviation from the mean. 
The densities of squared distances can be analytically calculated for the Gaussian case. 
 
 
Summary 
For non-Gaussian  systems we can invoke the central limit theorem. 
Next we will consider example data sets and see how their interpoint distances are distributed. 
 
 
Sanity Check 
Data sampled from independent Gaussian distribution 
If dimensions are independent, we expect low variance, Gaussian behavior for the distribution of squared distances. 
 
 
Sanity Check 
A good match betwen theory and the samples for a 1000 dimensional Gaussian distribution.
 
Distance distribution for a Gaussian with \(d=1000\) , \(n=1000\)  with theoretical curve
 
 
Sanity Check 
Same data generation, but fewer data points. 
If dimensions are independent, we expect low variance, Gaussian behaviour for the distribution of squared distances. 
 
 
Sanity Check 
A good match betwen theory and the samples for a 1000 dimensional Gaussian distribution with 100 points.
 
Distance distribution for a Gaussian with \(d=1000\)  and \(n=100\)  with theoretical curve
 
Simulation of oil flow
 
 
Histogram for Simulation of oil flow 
 
OSU Motion Capture Data: Run 1 
Ohio State University motion capture
 
 
Histogram for Ohio State University motion capture 
 
Robot Wireless Data 
Ground truth movement for the position taken while recording the multivariate time-course of wireless access point signal strengths.
 
 
Robot WiFi Data 
Output dimension 1 from the robot wireless data. This plot shows signal strength changing over time.
 
Robot wireless navigation
 
 
Histogram for Robot wireless navigation 
 
Where does practice depart from our theory? 
The situation for real data does not reflect what we expect. 
Real data exhibits greater variances on interpoint distances.
Somehow the real data seems to have a smaller effective dimension. 
  
Let’s look at another \(d=1000\) . 
 
 
1000-D Gaussian 
Distance distribution for a different Gaussian with \(d=1000\)  
 
Interpoint squared distance distribution for Gaussian with \(d=1000\)  but low rank covariance.
 
 
Interpoint squared distance distribution for Gaussian with \(d=1000\)  and theoretical curve for \(d=2\) .
 
 
Gaussian has a specific low rank covariance matrix \(\mathbf{C}=\mathbf{W}\mathbf{W}^{\top}+\sigma^{2}\mathbf{I}\) .
Take \(\sigma^{2}=1e-2\)  and sample \(\mathbf{W}\in\Re^{1000\times2}\)  from \(\mathscr{N}\left(0,1\right)\) .
Theoretical curve taken assuming dimensionality of 2.