ST443 Machine Learning and data mining

Week 11


Philipp Sterzinger

p.sterzinger@lse.ac.uk

11 December 2025

Question 1

  1. PCA.

    1. Let \(\lambda, v\) be an eigenvalue-eigenvector pair of the sample covariance matrix \[ \hat{\Sigma} = \frac{1}{n} X^\top X \,, \] such that \(\lambda > 0\). Show that \(v\) can be expressed as a linear combination of \(x_1, \ldots, x_n\), i.e.  \[ v_j = \sum_{i = 1}^n a_i X_{i,j}, \quad j = 1,\ldots, p \,, \] where \(a_i \in \Re\) are some scalar coefficients.

Show

  • \(v = X^\top a\) for some \(a \in \Re^n\)

Know

  • \(\hat \Sigma v = \lambda v\)
  • \(\hat \Sigma = X^\top X / n\)
  • \(\lambda > 0\)

Question 1.a

Quick solution

\[ \begin{aligned} \hat{\Sigma} v &= \lambda v \\ \iff \frac{1}{n} X^\top X v &= \lambda v \\ \iff X^\top \left(\frac{1}{\lambda n}Xv\right) &= v \,, \end{aligned} \]

so that \[ v = X^\top a, \quad \text{for } a = \frac{1}{\lambda n} X v \,. \]

Question 1.a

Detailed solution

Define

  • \(\operatorname{im}(\hat \Sigma) = \{w \in \Re^p: w = \hat{\Sigma} v, \, v \in \Re^p \}\)
  • \(\operatorname{im}(X^\top) = \{z \in \Re^p: z = X^\top a, \, a \in \Re^n \}\)

Then for any \(v \in \Re^p\):

\[ \hat{\Sigma} v = \lambda v \in \operatorname{im}(\hat\Sigma) \subseteq \operatorname{im}(X^\top) \implies v = \frac{1}{\lambda}\hat\Sigma v \in \operatorname{im}(\hat\Sigma) \subseteq \operatorname{im}(X^\top) \,/ \]

Question 1.a

Detailed solution

\[ v \in \operatorname{im}(X^\top) \]

means there exists a \(a \in \Re^n: v = X^\top a\)

\[ X^\top a = \begin{bmatrix} \rule[0.5ex]{2.5ex}{0.5pt} & X_{\bullet, 1}^\top & \rule[0.5ex]{2.5ex}{0.5pt} \\ \rule[0.5ex]{2.5ex}{0.5pt} & X_{\bullet, 2}^\top & \rule[0.5ex]{2.5ex}{0.5pt} \\ &\vdots& \\ \rule[0.5ex]{2.5ex}{0.5pt} & X_{\bullet, p}^\top & \rule[0.5ex]{2.5ex}{0.5pt} \\ \end{bmatrix} \begin{bmatrix} a_1\\ a_2 \\ \vdots \\ a_n \end{bmatrix} = \sum a_i X_{i, \bullet} \,, \]

where \(X_{\bullet, i}\) and \(X_{i, \bullet}\) are the \(i\)th column and row of \(X\), respectively.

Learnings

Connection of two aspects of PCA:

  1. Find a direction \(v\) in the column span of \(X\) that maximises variance, i.e. \(v = X^\top a\)

  2. These directions correspond to the eigenvectors of \(\hat{\Sigma}\)

Question 1.b

In Lecture 9, we considered an example where a covariance matrix was defined as \[ \Sigma = I_p + \theta e_1 e_1^\top \,, \] where \(I_p\) is the \(p \times p\) identity matrix and \(e_1 = [1, 0, \ldots,0]^\top \in \Re^p\) is the first standard basis vector.

It was claimed that \(e_1\) is the prinicpal eigenvector pertaining to eigenvalue \(1 + \theta\).

  1. Show that this is true.

  2. Furthermore, determine the other eigenvalues and the corresponding eigenvectors of \(\Sigma\).

  3. Provide the set of orthonormal eigenvectors of \(\Sigma\).

Show

  • orthonormal set of eigenvectors

Know

  • \((\lambda_i, v_i): \Sigma v_i = \lambda_i v_i\), \(v_i \neq 0_p\)
  • \(\Sigma = I_p + \theta e_1 e_1^\top\)

Question 1.b

\[ \Sigma v = \lambda v \iff (I_p + \theta e_1 e_1^\top - \lambda I_p) v = 0 \,. \]

\[ \begin{bmatrix} (1 + \theta - \lambda) & 0 & 0 & \cdots & 0 \\ 0 & (1 - \lambda) & 0 & \cdots & 0 \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0 & \cdots & \cdots & \cdots & (1 - \lambda) \\ \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_p \end{bmatrix} = \begin{bmatrix} v_1 (1 + \theta - \lambda) \\ v_2 ( 1 - \lambda) \\ \vdots \\ v_p(1 - \lambda) \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \]

  1. \(v_1 \neq 0\): \(\lambda = 1 + \theta\) and thus must have \(v_i = 0\) for \(i > 1\)

  2. \(v_i \neq 0\) for \(i > 1\): \(\lambda = 1\) and thus must have \(v_1 = 0\)

Question 1.b

Eigenvalues & Eigenvectors

  1. \(\lambda_1 = (1 + \theta)\), all corresponding eigenvectors are of the form \[ v = \begin{bmatrix} v_1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \quad v_1 \in \Re \]
  2. \(\lambda_2 = 1\), all corresponding eigenvectors are of the form \[ v = \begin{bmatrix} 0\\ v_2 \\ v_3 \\ \vdots \\ v_p \end{bmatrix}, \quad v_i \in \Re \]

Question 1.b

Orthonormal basis

Find \(p\)-linearly independent eigenvectors \(v_1, \ldots, v_p\) such that:

  1. \(\| v_i \|_2^2 = 1\)
  2. \(v_i^\top v_j = 0\), \(i \neq j\)

Choose the standard basis vectors \(e_1, \ldots, e_p\).

Learnings

  • Finding eigenvectors
  • Understanding orthonormal basis of eigenvectors, that is used for PCA

Question 1.c

In Lecture 9, we discussed the distribution of between-class and within-class pairwise distances between points, where the points are distributed according to either \(\mathrm{N}(c, I_d)\) or \(\mathrm{N}(-c, I_d)\), with \[ \mu = \begin{bmatrix} a \\ 0 \\ \vdots \\ 0 \end{bmatrix} \in \Re^d \] for some \(a \in \Re\). Let \(X_1\) and \(X_2\) be two independent points such that \(X_1 \sim \mathrm{N}(\mu, I_d)\) and \(X_2 \sim \mathrm{N}(-\mu, I_d)\). Note that \[ \mathrm{E}[\| X_1 - X_2 \|_2^2] \] is the expected value of the squared between-class pairwise distance.

  1. Derive an expression for \(\mathrm{E}[\| X_1 - X_2 \|_2^2]\) in terms of \(a\) and \(d\).
  2. Comment on the expression in relation to the curse of dimensionality

Question 1.c

Show

  • Expression for \(\mathrm{E}[\| X_1 - X_2 \|_2^2]\)

Know

  • Distribution of \(X_1, X_2\)

Question 1.c

\[ \begin{aligned} \| X_1 - X_2 \|_2^2 &= \sum_{i = 1}^d \left(X_{1,i} - X_{2,i}\right)^2 \\ &= \sum_{i = 1}^d \left((\mu_i + z_i) - (-\mu_i + z_i')\right)^2, \quad z_i,z_i' \sim \mathrm{N}(0,1) \\ &= (2a + [z_1 - z_2'])^2 + \sum_{i = 2}^d \left(z_i - z_i'\right)^2 \\ &= 4a^2 + 4a (z_1 - z_2) + \sum_{i = 1}^d \left(z_i - z_i'\right)^2 \\ \end{aligned} \]

\[ \mathrm{E}[\| X_1 - X_2 \|_2^2] = \mathrm{E}\left[4a^2 + 4a (z_1 - z_2) + \sum_{i = 1}^d \left(z_i - z_i'\right)^2 \right] = 4a^2 + 2d \,. \]

Question 1.c

Curse of dimensionality

\[ \mathrm{E}[\| X_1 - X_1' \|_2^2] = 2d \,. \]

\[ \lim\limits_{d \to \infty} \, \frac{\mathrm{E}[\| X_1 - X_2 \|_2^2]}{\mathrm{E}[\| X_1 - X_1' \|_2^2]} = 1 \,. \]

  • As \(d \to \infty\), the noise in the uninformative coordinates dominates the signal and the expected within and between class distance become more and more similar

Learnings

  • Details on example for curse of dimensionality that we have seen in the context of \(k\)-NN and PCA
  • This is not the curse of dimensionality of PCA, but motivates the use of (sparse) PCA in high-dimensional data when using e.g. \(k\)-NN

Question 2

The \(K\)-means problem can be stated as follows: given an integer \(K\) and a set of data points \(\mathcal X = \{x_1,\ldots, x_n\}\), where \(x_i \in \Re^p\) for every \(i = 1, \ldots , n\), find a clustering \(C = (C_1,\ldots,C_K)\) and centres \(c = (c_1,\ldots,c_K)\), where \(c_i \in \Re^p\) for every \(i = 1, \ldots ,K\), that minimise the within-cluster sum of squares \[ W(C,c) \;=\; \sum_{i=1}^K \sum_{x \in C_i} \|x-c_i\|_2^2 \,. \]

For fixed centres \(c\), define the induced objective \[ W(c) \;=\; \min_C W(C,c) \;=\; \sum_{x \in \mathcal{X}} \min_{i \in \{1,\ldots,K\}} \|x-c_i\|_2^2 \,. \]

Question 2

The \(K\)-means clustering algorithm is defined as follows:

  1. Choose an initial set of \(K\) centres \(c = (c_1,\ldots,c_K)\) arbitrarily
Code
library(plotly)

set.seed(123)

n <- 60
sigma <- 1
x1 <- cbind(rnorm(n, -2, sigma), rnorm(n, -2, sigma))
x2 <- cbind(rnorm(n, 0, sigma), rnorm(n, 2, sigma))
x3 <- cbind(rnorm(n, 2, sigma), rnorm(n, 0, sigma))
X <- rbind(x1, x2, x3)
X_df <- data.frame(id = 1:nrow(X), x = X[,1], y = X[,2])

K <- 3

cluster0 <- sample(rep(1:K, length.out = nrow(X_df)))

update_centers <- function(df, cluster_vec, centers_old) {
  centers_new <- centers_old
  for (k in 1:nrow(centers_old)) {
    pts <- df[cluster_vec == k, c("x","y"), drop = FALSE]
    if (nrow(pts) > 0) {
      centers_new[k,] <- colMeans(pts)
    }
  }
  centers_new
}

centers0 <- update_centers(X_df, cluster0, matrix(0, nrow = K, ncol = 2))

p_data <- plot_ly(
  X_df,
  x = ~x,
  y = ~y,
  type = "scatter",
  mode = "markers"
) %>%
  layout(
    title = "Data",
    xaxis = list(title = "x"),
    yaxis = list(title = "y")
  )

p_data
Code
p_random <- plot_ly() %>%
  add_markers(
    data = data.frame(X_df, cluster = factor(cluster0)),
    x = ~x,
    y = ~y,
    color = ~cluster,
    type = "scatter",
    mode = "markers",
    name = "points"
  ) %>%
  add_markers(
    data = data.frame(
      x = centers0[,1],
      y = centers0[,2],
      cluster = factor(1:K)
    ),
    x = ~x,
    y = ~y,
    color = ~cluster,
    type = "scatter",
    mode = "markers",
    marker = list(symbol = "x", size = 16),
    name = "centers"
  ) %>%
  layout(
    title = "Random initial assignment",
    xaxis = list(title = "x"),
    yaxis = list(title = "y")
  )

p_random

Question 2

  1. Choose an initial set of \(K\) centres \(c = (c_1,\ldots,c_K)\) arbitrarily

  2. For each \(i \in \{1,2, \ldots, K \}\), set the cluster \(C_i\) to be the set of points in \(\mathcal{X}\) that are closer to \(c_i\) than to \(c_j\) for all \(j \neq i\)

Code
p_random 
Code
dist_mat0 <- sapply(
  1:K,
  function(k) (X_df$x - centers0[k,1])^2 + (X_df$y - centers0[k,2])^2
)
cluster1 <- max.col(-dist_mat0)
centers1 <- update_centers(X_df, cluster1, centers0)

p_assign <- plot_ly() %>%
  add_markers(
    data = data.frame(X_df, cluster = factor(cluster1)),
    x = ~x,
    y = ~y,
    color = ~cluster,
    type = "scatter",
    mode = "markers",
    name = "points"
  ) %>%
  add_markers(
    data = data.frame(
      x = centers0[,1],
      y = centers0[,2],
      cluster = factor(1:K)
    ),
    x = ~x,
    y = ~y,
    color = ~cluster,
    type = "scatter",
    mode = "markers",
    marker = list(symbol = "x", size = 16),
    name = "centers"
  ) %>%
  layout(
    title = "Step 2: assign to closest centroid",
    xaxis = list(title = "x"),
    yaxis = list(title = "y")
  )

p_assign

Question 2

  1. Choose an initial set of \(K\) centres \(c = (c_1,\ldots,c_K)\) arbitrarily

  2. For each \(i \in \{1,2, \ldots, K \}\), set the cluster \(C_i\) to be the set of points in \(\mathcal{X}\) that are closer to \(c_i\) than to \(c_j\) for all \(j \neq i\)

  3. For each \(i \in \{1,2, \ldots, K \}\), set \(c_i\) to be the centre of mass of all points in \(C_i\) i.e., \(c_i = \tfrac{1}{\lvert C_i \rvert}\sum_{x \in C_i}x\)

Code
p_assign
Code
p_update <- plot_ly() %>%
  add_markers(
    data = data.frame(X_df, cluster = factor(cluster1)),
    x = ~x,
    y = ~y,
    color = ~cluster,
    type = "scatter",
    mode = "markers",
    name = "points"
  ) %>%
  add_markers(
    data = data.frame(
      x = centers1[,1],
      y = centers1[,2],
      cluster = factor(1:K)
    ),
    x = ~x,
    y = ~y,
    color = ~cluster,
    type = "scatter",
    mode = "markers",
    marker = list(symbol = "x", size = 16),
    name = "centers"
  ) %>%
  layout(
    title = "Step 3: recompute centroids",
    xaxis = list(title = "x"),
    yaxis = list(title = "y")
  )

p_update

Question 2

  1. Choose an initial set of \(K\) centres \(c = (c_1,\ldots,c_K)\) arbitrarily

  2. For each \(i \in \{1,2, \ldots, K \}\), set the cluster \(C_i\) to be the set of points in \(\mathcal{X}\) that are closer to \(c_i\) than to \(c_j\) for all \(j \neq i\)

  3. For each \(i \in \{1,2, \ldots, K \}\), set \(c_i\) to be the centre of mass of all points in \(C_i\) i.e., \[ c_i = \frac{1}{\lvert C_i \rvert} \sum_{x \in C_i} x \]

  • Repeat Steps 2 and 3 until \(c\) no longer changes

Question 2

Code
iter_max <- 10

frames_points <- data.frame()
frames_centers <- data.frame()

frame0 <- "iter 0 - random assignment"
frames_points <- rbind(
  frames_points,
  data.frame(
    id = X_df$id,
    x = X_df$x,
    y = X_df$y,
    cluster = factor(cluster0),
    frame = frame0
  )
)
frames_centers <- rbind(
  frames_centers,
  data.frame(
    x = centers0[,1],
    y = centers0[,2],
    cluster = factor(1:K),
    frame = frame0
  )
)

current_clusters <- cluster0
current_centers <- centers0

for (iter in 1:iter_max) {
  dist_mat <- sapply(
    1:K,
    function(k) (X_df$x - current_centers[k,1])^2 + (X_df$y - current_centers[k,2])^2
  )
  new_clusters <- max.col(-dist_mat)
  
  frame_assign <- paste("iter", iter, "- assign to nearest center")
  frames_points <- rbind(
    frames_points,
    data.frame(
      id = X_df$id,
      x = X_df$x,
      y = X_df$y,
      cluster = factor(new_clusters),
      frame = frame_assign
    )
  )
  frames_centers <- rbind(
    frames_centers,
    data.frame(
      x = current_centers[,1],
      y = current_centers[,2],
      cluster = factor(1:K),
      frame = frame_assign
    )
  )
  
  new_centers <- update_centers(X_df, new_clusters, current_centers)
  frame_update <- paste("iter", iter, "- recompute centers")
  frames_points <- rbind(
    frames_points,
    data.frame(
      id = X_df$id,
      x = X_df$x,
      y = X_df$y,
      cluster = factor(new_clusters),
      frame = frame_update
    )
  )
  frames_centers <- rbind(
    frames_centers,
    data.frame(
      x = new_centers[,1],
      y = new_centers[,2],
      cluster = factor(1:K),
      frame = frame_update
    )
  )
  
  if (all(new_clusters == current_clusters)) {
    current_clusters <- new_clusters
    current_centers <- new_centers
    break
  }
  
  current_clusters <- new_clusters
  current_centers <- new_centers
}

frames_points$kind <- "point"
frames_centers$kind <- "center"

frames_all <- rbind(
  data.frame(
    x = frames_points$x,
    y = frames_points$y,
    cluster = frames_points$cluster,
    frame = frames_points$frame,
    kind = frames_points$kind
  ),
  data.frame(
    x = frames_centers$x,
    y = frames_centers$y,
    cluster = frames_centers$cluster,
    frame = frames_centers$frame,
    kind = frames_centers$kind
  )
)

frames_all$cluster <- factor(frames_all$cluster)
frames_all$kind <- factor(frames_all$kind, levels = c("point", "center"))

fig_anim <- plot_ly() %>%
  add_markers(
    data = subset(frames_all, kind == "point"),
    x = ~x,
    y = ~y,
    frame = ~frame,
    color = ~cluster,
    type = "scatter",
    mode = "markers",
    name = "points"
  ) %>%
  add_markers(
    data = subset(frames_all, kind == "center"),
    x = ~x,
    y = ~y,
    frame = ~frame,
    color = ~cluster,
    type = "scatter",
    mode = "markers",
    marker = list(symbol = "x", size = 16),
    name = "centers"
  ) %>%
  layout(
    title = "K-means: alternating assignment and centroid updates",
    xaxis = list(title = "x"),
    yaxis = list(title = "y")
  ) %>%
  animation_slider(
    currentvalue = list(prefix = "Step: ")
  )

fig_anim

Question 2.a

  1. Show that Step 2 of the \(K\)-means algorithm achieves \(W(c)\) for any given \(c\)

Show

  • \(W(c)\) value after Step 2

Know

  • Cost of a particular clustering \(C\) with centres \(c\): \[ W(C,c) = \sum_{i = 1}^K \sum_{x \in C_i} \| x - c_i \|_2^2 \]

  • Induced objective for fixed \(c\): \[ W(c) = \sum_{x \in \mathcal{X}} \min_{i \in \{1,\ldots,K\}} \| x - c_i \|_2^2 \]

  • Step 2: assign each \(x\) to its closest centre

Question 2.a

Costs

  • Recall that the cost of a particular clustering is

\[ W(C,c) = \sum_{i = 1}^K \sum_{x \in C_i} \| x - c_i \|_2^2 \,. \]

  • Given any set of centres, the lowest possible cost over all clusterings is: \[ W(c) = \min_C W(C,c) = \sum_{x \in \mathcal{X}} \min_{i \in \{1,\ldots,K\}} \| x - c_i \|_2^2 \,. \]

  • Hence, we are asked to show that given \(c\), Step 2 gives the clustering with lowest possible cost

Question 2.a

After Step 2: \[ x \in C_i \implies \| x - c_i \|_2^2 = \min_{j \in \{1,\ldots,K\}} \| x - c_j \|_2^2 \,. \]

Therefore: \[ \begin{aligned} W(C,c) &= \sum_{i = 1}^K \sum_{x \in C_i} \| x - c_i \|_2^2 \\ &= \sum_{i = 1}^K \sum_{x \in C_i} \min_{j \in \{1,\ldots,K\}} \| x - c_j \|_2^2 \\ &= \sum_{x \in \mathcal X} \min_{j \in \{1,\ldots,K\}} \| x - c_j \|_2^2 \\ &= W(c) \,. \end{aligned} \]

Question 2.b

Let \(S\) be a set of points with the centre of mass \[ c(S) = \frac{1}{\lvert S \rvert} \sum_{x \in S} x \,. \] Show that for any point \(z \in \Re^p\), \[ \sum_{x \in S} \| x - z \|_2^2 - \sum_{x \in S} \|x - c(S) \|_2^2 = \lvert S \rvert \|c(S) - z \|_2^2 \]

Show

  • \(\sum_{x \in S} \| x - z \|_2^2 - \sum_{x \in S} \|x - c(S) \|_2^2 = \lvert S \rvert \|c(S) - z \|_2^2\)

Know

  • expansion of squares

Question 2.b

For \(x \in S\), \(z \in \Re^p\),

\[ \begin{aligned} \| x - z \|_2^2 &= \| x- c(S) + (c(S) - z) \|_2^2 \\ &= (x- c(S) + (c(S) - z) )^\top (x- c(S) + (c(S) - z)) \\ &= (x - c(S))^\top (x - c(S)) + 2 (x - c(S))^\top (c(S) - z) + (c(S) - z)^\top (c(S) - z) \\ &= \| x - c(S) \|_2^2 + 2 (x - c(S))^\top (c(S) - z) + \| c(S) - z \|_2^2 \,. \end{aligned} \]

Therefore

\[ \begin{aligned} \sum_{x \in S} \| x - z \|_2^2 &= \sum_{x \in S} \left( \| x - c(S) \|_2^2 + 2 (x - c(S))^\top (c(S) - z) + \| c(S) - z \|_2^2 \right) \\ &= \sum_{x \in S} \| x - c(S) \|_2^2 + 2 \sum_{x \in S} (x - c(S))^\top (c(S) - z) + \lvert S \rvert \| c(S) - z \|_2^2 \\ \end{aligned} \]

Question 2.b

Finally note that

\[ \begin{aligned} \sum_{x \in S} (x - c(S))^\top (c(S) - z) &= (c(S) - z)^\top \left(\sum_{x \in S} (x - c(S)) \right) \\ &= (c(S) - z)^\top 0_p \,. \end{aligned} \]

Thus

\[ \sum_{x \in S} \| x - z \|_2^2 =\sum_{x \in S} \| x - c(S) \|_2^2 + \lvert S \rvert \| c(S) - z \|_2^2 \,, \]

as required

Question 2.c

  1. Use the identity in 2.b to show that Step 3 of the \(K\)-means algorithm decreases \(W(c)\) in every round where the algorithm does not terminate. We refer to the execution of Steps 2 and 3 as a round.

Show

  • Reduction in \(W(c)\) between execution rounds

Know

  • Question 2.b

Question 2.c

Fix a particular round \(t\) in the algorithm

Pre Step 2

  • We have clustering \(C^t = (C_1^t,\ldots,C_K^t)\) with centres \(c^t = (c_1^t,\ldots,c_K^t)\)

Post Step 2

  • Assigned points to closest centre \(c_i^t\), get new clusters \(C^{t+1}\)

  • From Q2.a, Step 2 is cost optimal clustering given centres \(c^t\), so \[ W(C^{t+1},c^t)=W(c^t) \] and hence \[ W(C^{t+1},c^t)\le W(C^t,c^t). \]

Question 2.c

Post Step 3

  • Have clustering \(C^{t+1}\) with updated centres \(c^{t+1}\) where \[ c_i^{t+1} = c(C_i^{t+1}) = \frac{1}{\lvert C_i^{t+1} \rvert} \sum_{x \in C_i^{t+1}} x \,. \]

  • Cost change post Step 2 to post Step 3 for each \(i\) (using Q2.b with \(S=C_i^{t+1}\), \(z=c_i^t\)): \[ \underbrace{\sum_{x \in C_i^{t+1}} \| x - c_i^t \|_2^2}_{\text{old}} - \underbrace{\sum_{x \in C_i^{t+1}} \| x - c_i^{t+1} \|_2^2}_{\text{new}} = \lvert C_i^{t+1} \rvert \| c_i^{t+1} - c_i^{t}\|_2^2 \geq 0 \,. \]

Question 2.c

Post Step 3

  • Summing over \(i\) gives a total decrease \[ W(C^{t+1},c^t)-W(C^{t+1},c^{t+1}) = \sum_{i=1}^K \lvert C_i^{t+1} \rvert \| c_i^{t+1} - c_i^t \|_2^2 \ge 0. \]

  • If \(\exists i: c_i^{t+1} \neq c_i^t\), then \(W(C^{t+1},c^{t+1}) < W(C^{t+1},c^t)=W(c^t)\), so \[ W(c^{t+1}) < W(c^t). \]

  • If \(c^{t+1} = c^t\), no update in Step 3, algorithm has converged (and Step 2 in the next round makes no change)

Question 2.d.

Show that the algorithm terminates, i.e. after a finite number of repetitions of Steps 2 and 3, \(c\) no longer changes

Show

  • Algorithm terminates

Know

  • Strict decrease in \(W(c)\) in any non-terminating round
  • There are only finitely many clusterings

Question 2.d.

  1. There are are only finitely many clusterings to consider
  2. In any non-terminating round, \(W(c)\) strictly decreases (by Q2.c), so the algorithm cannot revisit the same labelled clustering: if the same labelled clustering \((C_1,\ldots,C_K)\) is revisited, then Step 3 sets \(c_i\) to the same cluster means again, and with fixed tie-breaking Step 2 produces the same clustering again, so \(W(c)\) would not strictly decrease; contradiction
  3. Therefore the algorithm visits at most finitely many clusterings, so it must terminate in a finite number of rounds