Let \(\lambda, v\) be an eigenvalue-eigenvector pair of the sample covariance matrix \[
\hat{\Sigma} = \frac{1}{n} X^\top X \,,
\] such that \(\lambda > 0\). Show that \(v\) can be expressed as a linear combination of \(x_1, \ldots, x_n\), i.e. \[
v_j = \sum_{i = 1}^n a_i X_{i,j}, \quad j = 1,\ldots, p \,,
\] where \(a_i \in \Re\) are some scalar coefficients.
Show
\(v = X^\top a\) for some \(a \in \Re^n\)
Know
\(\hat \Sigma v = \lambda v\)
\(\hat \Sigma = X^\top X / n\)
\(\lambda > 0\)
Question 1.a
Quick solution
\[
\begin{aligned}
\hat{\Sigma} v &= \lambda v \\
\iff \frac{1}{n} X^\top X v &= \lambda v \\
\iff X^\top \left(\frac{1}{\lambda n}Xv\right) &= v \,,
\end{aligned}
\]
so that \[
v = X^\top a, \quad \text{for } a = \frac{1}{\lambda n} X v \,.
\]
Question 1.a
Detailed solution
Define
\(\operatorname{im}(\hat \Sigma) = \{w \in \Re^p: w = \hat{\Sigma} v, \, v \in \Re^p \}\)
\(\operatorname{im}(X^\top) = \{z \in \Re^p: z = X^\top a, \, a \in \Re^n \}\)
Then for any \(v \in \Re^p\):
\[
\hat{\Sigma} v = \lambda v \in \operatorname{im}(\hat\Sigma) \subseteq \operatorname{im}(X^\top)
\implies v = \frac{1}{\lambda}\hat\Sigma v \in \operatorname{im}(\hat\Sigma) \subseteq \operatorname{im}(X^\top) \,/
\]
Question 1.a
Detailed solution
\[
v \in \operatorname{im}(X^\top)
\]
means there exists a \(a \in \Re^n: v = X^\top a\)
where \(X_{\bullet, i}\) and \(X_{i, \bullet}\) are the \(i\)th column and row of \(X\), respectively.
Learnings
Connection of two aspects of PCA:
Find a direction \(v\) in the column span of \(X\) that maximises variance, i.e. \(v = X^\top a\)
These directions correspond to the eigenvectors of \(\hat{\Sigma}\)
Question 1.b
In Lecture 9, we considered an example where a covariance matrix was defined as \[
\Sigma = I_p + \theta e_1 e_1^\top \,,
\] where \(I_p\) is the \(p \times p\) identity matrix and \(e_1 = [1, 0, \ldots,0]^\top \in \Re^p\) is the first standard basis vector.
It was claimed that \(e_1\) is the prinicpal eigenvector pertaining to eigenvalue \(1 + \theta\).
Show that this is true.
Furthermore, determine the other eigenvalues and the corresponding eigenvectors of \(\Sigma\).
Provide the set of orthonormal eigenvectors of \(\Sigma\).
\(v_1 \neq 0\): \(\lambda = 1 + \theta\) and thus must have \(v_i = 0\) for \(i > 1\)
\(v_i \neq 0\) for \(i > 1\): \(\lambda = 1\) and thus must have \(v_1 = 0\)
Question 1.b
Eigenvalues & Eigenvectors
\(\lambda_1 = (1 + \theta)\), all corresponding eigenvectors are of the form \[
v = \begin{bmatrix}
v_1 \\
0 \\
\vdots \\
0
\end{bmatrix}, \quad v_1 \in \Re
\]
\(\lambda_2 = 1\), all corresponding eigenvectors are of the form \[
v = \begin{bmatrix}
0\\
v_2 \\
v_3 \\
\vdots \\
v_p
\end{bmatrix}, \quad v_i \in \Re
\]
Question 1.b
Orthonormal basis
Find \(p\)-linearly independent eigenvectors \(v_1, \ldots, v_p\) such that:
\(\| v_i \|_2^2 = 1\)
\(v_i^\top v_j = 0\), \(i \neq j\)
Choose the standard basis vectors \(e_1, \ldots, e_p\).
Learnings
Finding eigenvectors
Understanding orthonormal basis of eigenvectors, that is used for PCA
Question 1.c
In Lecture 9, we discussed the distribution of between-class and within-class pairwise distances between points, where the points are distributed according to either \(\mathrm{N}(c, I_d)\) or \(\mathrm{N}(-c, I_d)\), with \[
\mu = \begin{bmatrix}
a \\
0 \\
\vdots \\
0
\end{bmatrix}
\in \Re^d
\] for some \(a \in \Re\). Let \(X_1\) and \(X_2\) be two independent points such that \(X_1 \sim \mathrm{N}(\mu, I_d)\) and \(X_2 \sim \mathrm{N}(-\mu, I_d)\). Note that \[
\mathrm{E}[\| X_1 - X_2 \|_2^2]
\] is the expected value of the squared between-class pairwise distance.
Derive an expression for \(\mathrm{E}[\| X_1 - X_2 \|_2^2]\) in terms of \(a\) and \(d\).
Comment on the expression in relation to the curse of dimensionality
Question 1.c
Show
Expression for \(\mathrm{E}[\| X_1 - X_2 \|_2^2]\)
As \(d \to \infty\), the noise in the uninformative coordinates dominates the signal and the expected within and between class distance become more and more similar
Learnings
Details on example for curse of dimensionality that we have seen in the context of \(k\)-NN and PCA
This is not the curse of dimensionality of PCA, but motivates the use of (sparse) PCA in high-dimensional data when using e.g. \(k\)-NN
Question 2
The \(K\)-means problem can be stated as follows: given an integer \(K\) and a set of data points \(\mathcal X = \{x_1,\ldots, x_n\}\), where \(x_i \in \Re^p\) for every \(i = 1, \ldots , n\), find a clustering \(C = (C_1,\ldots,C_K)\) and centres \(c = (c_1,\ldots,c_K)\), where \(c_i \in \Re^p\) for every \(i = 1, \ldots ,K\), that minimise the within-cluster sum of squares \[
W(C,c) \;=\; \sum_{i=1}^K \sum_{x \in C_i} \|x-c_i\|_2^2 \,.
\]
Choose an initial set of \(K\) centres \(c = (c_1,\ldots,c_K)\) arbitrarily
For each \(i \in \{1,2, \ldots, K \}\), set the cluster \(C_i\) to be the set of points in \(\mathcal{X}\) that are closer to \(c_i\) than to \(c_j\) for all \(j \neq i\)
Choose an initial set of \(K\) centres \(c = (c_1,\ldots,c_K)\) arbitrarily
For each \(i \in \{1,2, \ldots, K \}\), set the cluster \(C_i\) to be the set of points in \(\mathcal{X}\) that are closer to \(c_i\) than to \(c_j\) for all \(j \neq i\)
For each \(i \in \{1,2, \ldots, K \}\), set \(c_i\) to be the centre of mass of all points in \(C_i\) i.e., \(c_i = \tfrac{1}{\lvert C_i \rvert}\sum_{x \in C_i}x\)
Choose an initial set of \(K\) centres \(c = (c_1,\ldots,c_K)\) arbitrarily
For each \(i \in \{1,2, \ldots, K \}\), set the cluster \(C_i\) to be the set of points in \(\mathcal{X}\) that are closer to \(c_i\) than to \(c_j\) for all \(j \neq i\)
For each \(i \in \{1,2, \ldots, K \}\), set \(c_i\) to be the centre of mass of all points in \(C_i\) i.e., \[
c_i = \frac{1}{\lvert C_i \rvert} \sum_{x \in C_i} x
\]
Repeat Steps 2 and 3 until \(c\) no longer changes
Given any set of centres, the lowest possible cost over all clusterings is: \[
W(c) = \min_C W(C,c) = \sum_{x \in \mathcal{X}} \min_{i \in \{1,\ldots,K\}} \| x - c_i \|_2^2 \,.
\]
Hence, we are asked to show that given \(c\), Step 2 gives the clustering with lowest possible cost
Question 2.a
After Step 2: \[
x \in C_i \implies \| x - c_i \|_2^2 = \min_{j \in \{1,\ldots,K\}} \| x - c_j \|_2^2 \,.
\]
Let \(S\) be a set of points with the centre of mass \[
c(S) = \frac{1}{\lvert S \rvert} \sum_{x \in S} x \,.
\] Show that for any point \(z \in \Re^p\), \[
\sum_{x \in S} \| x - z \|_2^2 - \sum_{x \in S} \|x - c(S) \|_2^2 = \lvert S \rvert \|c(S) - z \|_2^2
\]
Show
\(\sum_{x \in S} \| x - z \|_2^2 - \sum_{x \in S} \|x - c(S) \|_2^2 = \lvert S \rvert \|c(S) - z \|_2^2\)
\[
\sum_{x \in S} \| x - z \|_2^2 =\sum_{x \in S} \| x - c(S) \|_2^2 + \lvert S \rvert \| c(S) - z \|_2^2 \,,
\]
as required
Question 2.c
Use the identity in 2.b to show that Step 3 of the \(K\)-means algorithm decreases \(W(c)\) in every round where the algorithm does not terminate. We refer to the execution of Steps 2 and 3 as a round.
Show
Reduction in \(W(c)\) between execution rounds
Know
Question 2.b
Question 2.c
Fix a particular round \(t\) in the algorithm
Pre Step 2
We have clustering \(C^t = (C_1^t,\ldots,C_K^t)\) with centres \(c^t = (c_1^t,\ldots,c_K^t)\)
Post Step 2
Assigned points to closest centre \(c_i^t\), get new clusters \(C^{t+1}\)
From Q2.a, Step 2 is cost optimal clustering given centres \(c^t\), so \[
W(C^{t+1},c^t)=W(c^t)
\] and hence \[
W(C^{t+1},c^t)\le W(C^t,c^t).
\]
Question 2.c
Post Step 3
Have clustering \(C^{t+1}\) with updated centres \(c^{t+1}\) where \[
c_i^{t+1} = c(C_i^{t+1}) = \frac{1}{\lvert C_i^{t+1} \rvert} \sum_{x \in C_i^{t+1}} x \,.
\]
Cost change post Step 2 to post Step 3 for each \(i\) (using Q2.b with \(S=C_i^{t+1}\), \(z=c_i^t\)): \[
\underbrace{\sum_{x \in C_i^{t+1}} \| x - c_i^t \|_2^2}_{\text{old}} - \underbrace{\sum_{x \in C_i^{t+1}} \| x - c_i^{t+1} \|_2^2}_{\text{new}}
= \lvert C_i^{t+1} \rvert \| c_i^{t+1} - c_i^{t}\|_2^2 \geq 0 \,.
\]
Question 2.c
Post Step 3
Summing over \(i\) gives a total decrease \[
W(C^{t+1},c^t)-W(C^{t+1},c^{t+1})
= \sum_{i=1}^K \lvert C_i^{t+1} \rvert \| c_i^{t+1} - c_i^t \|_2^2 \ge 0.
\]
If \(\exists i: c_i^{t+1} \neq c_i^t\), then \(W(C^{t+1},c^{t+1}) < W(C^{t+1},c^t)=W(c^t)\), so \[
W(c^{t+1}) < W(c^t).
\]
If \(c^{t+1} = c^t\), no update in Step 3, algorithm has converged (and Step 2 in the next round makes no change)
Question 2.d.
Show that the algorithm terminates, i.e. after a finite number of repetitions of Steps 2 and 3, \(c\) no longer changes
Show
Algorithm terminates
Know
Strict decrease in \(W(c)\) in any non-terminating round
There are only finitely many clusterings
Question 2.d.
There are are only finitely many clusterings to consider
In any non-terminating round, \(W(c)\) strictly decreases (by Q2.c), so the algorithm cannot revisit the same labelled clustering: if the same labelled clustering \((C_1,\ldots,C_K)\) is revisited, then Step 3 sets \(c_i\) to the same cluster means again, and with fixed tie-breaking Step 2 produces the same clustering again, so \(W(c)\) would not strictly decrease; contradiction
Therefore the algorithm visits at most finitely many clusterings, so it must terminate in a finite number of rounds