Regularised least squares. Consider the linear regression model:
\[
y = X \beta + \epsilon
\] where \(y = [y_1, \ldots, y_n]^\top \in \Re^n\), \(X = [x_1,\ldots, x_n] \in \Re^{n \times p}\), \(\beta = [\beta_1,\ldots, \beta_p]^\top \in \Re^p\) and \(\epsilon = [\epsilon_1,\ldots, \epsilon_n]^\top \in \Re^n\) is a vector of vector of i.i.d. random variables with zero mean and variance \(\sigma^2\).
Ordinary least squares. The ordinary least squares (OLS) parameter estimate \(\hat{\beta}^{\text{OLS}}\) is defined as a minimiser of the following optimisation problem: \[
\min_{\beta \in \Re^p} L(\beta), \quad L(\beta) = \| y - X\beta \|_2^2 \,.
\] Show that the OLS estimator is given by \[
\hat{\beta}^{\text{OLS}} = (X^\top X)^{-1}X^\top y \,.
\] whenever \(X\) has full column rank (i.e. predictors are not linearly dependent). Does an OLS estimator always exist and when it exists under what condition is it unique?
Question 1.a
Show
If \(\text{rank}(X) = p\), \[\hat{\beta}^{\text{OLS}} = (X^\top X)^{-1}X^\top y\]
Know
If \(\text{rank}(X) = p\), then \(X^\top X\) is invertible
\[
\nabla_x (x^\top a) = a
\]
\[
\nabla_x (x^\top A x) = Ax + A^\top x
\]
Question 1.a
\[
\begin{aligned}
L(\beta) &= \| y - X\beta \|_2^2 \\
&= (y - X\beta)^\top (y - X\beta) \\
&= y^\top y - y^\top X\beta - \beta^\top X^\top y - \beta^\top (X^\top X) \beta \\
\end{aligned}
\] Differentiating yields: \[
\begin{aligned}
\nabla_\beta L(\beta) = -2 X^\top y - (X^\top X) \beta - (X^\top X)^\top \beta &= -2X^\top y - 2 (X^\top X) \beta \overset{!}{=} 0 \\
\iff (X^\top X) \beta &= X^\top y \\
\iff \beta &= (X^\top X)^{-1} X^\top y
\end{aligned}
\] Checking for optimality: \[
\nabla_\beta \nabla^\top_\beta L(\beta) = - 2 (X^\top X) \preceq 0, \quad \forall \beta \,,
\] So \(L(\beta)\) is convex and \(\hat{\beta}^{\text{OLS}}\) is a minimum (and unique).
Q: Why is \((X^\top X)\) invertible?
Question 1.a
Invertibility of \(X^\top X\)
If \(X^\top X\) has full rank, then it is invertible
Question 1.a
We know that \(X\) has full column rank, i.e. \[
X v = 0 \implies v = 0
\]
What this says is:
\[
Xv = \begin{bmatrix}
{\mid} & {\mid} & \cdots & {\mid} \\
x_1 & x_2 & \cdots & x_p \\
{\mid} & {\mid} & \cdots & {\mid} \\
\end{bmatrix} \begin{bmatrix}
v_1 \\
\vdots \\
v_p
\end{bmatrix} =
v_1\cdot \begin{bmatrix}
{\mid} \\
x_1 \\
{\mid}
\end{bmatrix} +
v_2 \cdot\begin{bmatrix}
{\mid} \\
x_2 \\
{\mid}
\end{bmatrix} +
\cdots +
v_p\cdot\begin{bmatrix}
{\mid} \\
x_p \\
{\mid}
\end{bmatrix}
\] is just a linear combination of the columns of \(X\).
If \(X\) has full column rank, they are linearly independent, so no linear combination of \(p-1\) columns can achieve the other one
Equivalently, \(Xv = 0\) must imply that \(v = 0\)
Question 1.a
Invertibility of \(X^\top X\)
How can we use this find the rank of \(X^\top X\)?
Ridge regression. The ridge regression parameter estimatê \(\hat{\beta}^{\text{ridge}}\) is defined as a minimiser of the following optimisation problem: \[
\min_{\beta \in \Re^p} L(\beta) + \lambda \|y\|_2^2 = \| y - X\beta \|_2^2 + \lambda \| \beta\|_2^2 \,,
\] where \(\lambda \geq 0\) is a regularisation hyperparameter. Assume that \(\lambda > 0\), (if \(\lambda = 0\), then we have the OLS problem).
Parameter vector. Show that \[
\hat{\beta}^{\text{ridge}} = (X^\top X + \lambda I_p)^{-1}X^\top y \,,
\] where \(I_p\) is a \(p \times p\) identity matrix.
Question 1.b.1
Show
\(\hat{\beta}^{\text{ridge}} = (X^\top X + \lambda I_p)^{-1}X^\top y\)
Know
\[
\nabla_x (x^\top a) = a
\]
\[
\nabla_x (x^\top A x) = Ax + A^\top x
\]
Question 1.b.1
\[
\begin{aligned}
L(\beta) + \lambda \|\beta \|_2^2 = y^\top y - 2 y^\top X \beta + \beta^\top (X^\top X) \beta + \lambda \beta^\top \beta \,.
\end{aligned}
\] Differentiating yields: \[
\begin{aligned}
\nabla_\beta \left\{ L(\beta) + \lambda \|\beta \|_2^2\right\} &= - 2 X^\top y + 2 (X^\top X) \beta + 2\lambda \beta \overset{!}{=} 0\\
\iff (X^\top X) \beta + \lambda \beta &= X^\top y \\
\iff (X^\top X + \lambda I_p ) \beta&= X^\top y \\
\iff \beta &= (X^\top X + \lambda I_p )^{-1} X^\top y \,.
\end{aligned}
\] Checking optimality: \[
\begin{aligned}
\nabla_\beta \nabla^\top_\beta \left\{ L(\beta) + \lambda \|\beta \|_2^2\right\} &= -2(X^\top X + \lambda I_p) \prec 0, \quad \forall \beta \in \Re^p \,.
\end{aligned}
\] So the ridge penalised least-squares problem is strictly convex even if \(X\) does not have full column rank. Thus, the ridge estimator exists and is unique regardless of the rank of \(X\).
Question 1.b.1
Invertibility of \((X^\top X + \lambda I_p)\)
Assume there is a \(v \in \Re^p: (X^\top X + \lambda I_p)v = 0\). Then: \[
\begin{aligned}
v^\top (X^\top X + \lambda I_p) v &= v^\top (X^\top X)v + \lambda v^\top v\\
&= \tilde v^\top \tilde v + \lambda v^\top v, \quad (\tilde v = Xv) \\
&\geq v^\top v \lambda \,.
\end{aligned}
\]
Hence \((X^\top X + \lambda I_p)v = 0\) must imply that \(v = 0\) for \(\lambda >0\), regardless of the rank of \(X\).
Question 1.b.1
Invertibility of \((X^\top X + \lambda I_p)\)
Intuition: Ridge augments the data to (potentially) increase the rank of \(X\):
Covariance of the parameter vector. Assume that we observe \(X\) but not \(y\). Hence, we may regard \(\hat{\beta}^{\text{Ridge}}\) as a random vector conforming to \[
\hat{\beta}^{\text{ridge}} = (X^\top X + \lambda I_p)^{-1}X^\top y \,.
\] Consider the eigen-decomposition \[
X^{\top}X = Q D Q^{\top}
\] where \(Q^{\top}Q = QQ^{\top} = I_p\) and \(D\) is the diagonal matrix with diagonal elements \(d_1,\ldots,d_p\).
Show that \[
\operatorname{Cov}\bigl[\hat{\beta}^{\text{Ridge}}\bigr]
= \sigma^2 Q D^* Q^{\top} \,,
\] where \(D^*\) is a diagonal matrix with diagonal elements \(d_1/(d_1+\lambda)^2,\ldots,d_p/(d_p+\lambda)^2\).
Remark: covariance of a random column vector \(Z\) is defined by \[
\operatorname{Cov}[Z] = \mathrm{E}\bigl[(Z-\mathrm{E}[Z])(Z-\mathrm{E}[Z])^{\top}\bigr]\,.
\]
To figure out the inverse of \(X^\top X + \lambda I_p\), we note that \(Q^{-1} = Q^\top\)\[
X^\top X + \lambda I_p = QDQ^\top + \lambda QQ^\top = Q(D + \lambda I_p) Q^\top \,.
\]
Using the assertions in (b.1) and (b.2), discuss how \(\hat \beta^{\text{ridge}}\) behaves as \(\lambda\) grows large.
Shrinkage
\[
\begin{aligned}
\hat{\beta}^{\text{ridge}} = (X^\top X + \lambda I_p)^{-1}X^\top y = Q(D + \lambda I_p)^{-1} Q^\top X^\top y \,.
\end{aligned}
\]
If \(\lambda \to \infty\), \(D_i / (D_i + \lambda)^2 \to 0\), so \(\hat \beta^{\text{ridge}} \to 0\)
Bias
\[
\begin{aligned}
\mathrm{E}[\hat{\beta}^{\text{ridge}}] = Q (D + \lambda I_p)^{-1} Q^\top Q D Q^\top \beta = Q (D + \lambda I_p)^{-1} D Q^\top \beta
\end{aligned}
\] If \(\lambda = 0\), Bias is zero, if \(\lambda \to \infty\), bias is \(-\beta\).
Variance
As \(\lambda \to \infty\), variance goes to zero (Classic bias variance tradeoff)
Learnings
Be comfortable with basic matrix calculus
Understand limitations of OLS when \(X\) is collinear
Comparison of Ridge to OLS in linear model
Understand existence, shrinkage, bias-variance (why we like ridge for prediction)
Check your understanding
You observe a linear model \(y = X\beta + \varepsilon\) with \(X \in \Re^{n \times p}\), and \(X\) is known to be rank-deficient (\(\operatorname{rank}(X) < p\)).
Does an OLS minimiser of \(\|y - X\beta\|_2^2\) always exist?
Answer
Yes, a minimiser of \(\|y - X\beta\|_2^2\) always exists; geometrically, we are projecting onto the closed subspace \(\operatorname{col}(X)\)
Is it necessarily unique?
Answer
It is not unique when \(\operatorname{rank}(X) < p\). There are infinitely many minimisers; if \(\hat\beta\) is one minimiser and \(v \neq 0: Xv = 0\), then \(\hat\beta + v\) is also a minimiser.
How does ridge regression change the answer to (2)?
Answer
Ridge regression minimises \(\|y - X\beta\|_2^2 + \lambda\|\beta\|_2^2\) with \(\lambda > 0\). The Hessian is \(2(X^\top X + \lambda I_p)\), which is positive definite for any \(\lambda > 0\), so the solution is unique even when \(X\) is rank-deficient.
Check your understanding
Let \(X^\top X = Q D Q^\top\) with \(D = \operatorname{diag}(d_1,\dots,d_p)\) and \(Q^\top Q = I_p\). Consider the ridge estimator \[
\hat\beta^{\text{ridge}} = (X^\top X + \lambda I_p)^{-1} X^\top y.
\]
Suppose \(n \ll p\), and the columns of \(X\) are highly collinear. You care only about minimising test error, not about unbiasedness of \(\hat\beta\).
Give two clear reasons why ridge regression can outperform OLS in this regime, and one situation where making \(\lambda\) too large will start to harm predictive performance.
Answer
OLS has very high variance and is unstable when \(n \ll p\) and predictors are collinear; small perturbations in the data can cause large changes in \(\hat\beta\). Ridge stabilises the estimator by shrinking coefficients, reducing variance of predictions.
Ridge effectively downweights directions in predictor space that are poorly informed by the data (small eigenvalues of \(X^\top X\)), which reduces overfitting to noise and typically improves generalisation.
If \(\lambda\) is made too large, the fitted model is overly shrunk toward zero, underfitting the true signal. In this case, the bias in predictions becomes dominant and test error increases, even though the variance is small.
Question 1.c
Penalised least squares analysis. To gain some insights about variable selection procedures, consider the case when \(X\) has orthonormal columns, i.e. \(X^\top X = I_p\)
Separability. Consider the following penalised least squares problem: find \(\hat\beta\) that is a minimiser of the following optimisation problem: \[
\min_{\beta \in \Re^p}
\frac{1}{2} \| y - X\beta \|_2^2+ \sum_{j=1}^p c_j(|\beta_j|,\lambda) \,,
\] where each \(c_j : \Re_+ \times \Re_+^m \to \Re_+\) is an increasing function. Define \[
z = X^\top y = (z_1, \dots, z_p)^\top \,.
\] Show that the penalised least squares problem is equivalent to finding \(\hat\beta\) that is a minimiser of the following optimisation problem: \[
\min_{\beta \in \Re^p} \frac{1}{2} \sum_{j=1}^p (z_j - \beta_j)^2 + \sum_{j=1}^p c_j(|\beta_j|,\lambda) \,.
\]
Question 1.c.1
Show
penalised least squares is separable componentwise
Moreover, show that the last optimisation problem is equivalent to componentwise solving of the following penalised least squares problem: find \(\hat\theta\) that is a minimiser of the optimisation problem: \[
\min_{\theta \in \Re}
\frac{1}{2} (z - \theta)^2 + c(|\theta|,\lambda) \,,
\] where for each \(j \in \{1,\dots,p\}\), \(z \equiv z_j\) and \(c \equiv c_j\).
We can think of orthonormal designs as independence of features, in that no feature holds any information about any other feature.
In that case, it makes sense that a change in the \(j\)th feature (i.e. column of \(X\)), should only affect our choice of the \(j\)th component of \(\beta\).
Mathematically, the cross term
\[
\beta^\top (X^\top X) \beta = \sum_{i = 1}^p \sum_{j = 1}^p \beta_i \beta_j x^\top_i x_j
\] captures all the dependencies of the components of \(\beta\).
Statistically, we can think about the case where all the rows of \(X\) are i.i.d. with variance-covariance matrix \(I_p\).
In that case, we cannot learn anything about the signal by considering dependence of features, so we should ignore it.
For the \(\ell_2\) penalty, the optimisation problem is to find \(\hat{\theta}^{\text{ridge}}\) that minimises \[
\frac{1}{2} (z - \theta)^2 + \frac{\lambda}{2} \theta^2 \,,
\] over \(\theta \in \Re\).
Code
lambda <-0.1z <-3fun <-function(theta){0.5* (z - theta)^2+0.5* lambda * theta^2}thetas <-seq(from =-10, to =10, length.out =1000)dfun <-function(theta){-(z - theta) + lambda * theta} par(mfrow =c(1,2))plot(thetas, fun(thetas), type ="l", linewidth =2, main ="Ridge loss")plot(thetas, dfun(thetas), type ="l", linewidth =2, main ="Ridge loss derivative")abline(h =0, col ="red", linetype ="dash")
zs <-seq(from =-5, to =5, length.out =1000) lambda <-1ridge <-function(z){ z / (1+ lambda)}plot(zs, ridge(zs), type ="l", col ="blue", lwd =2, main ="Ridge shrinkage", xlab ="z", ylab =expression(hat(theta))) lines(zs,zs, col ="blue", lwd =2, lty ="dashed")
Question 1.c.3
For \(\ell_1\) penalty, the optimisation problem is to find \(\hat{\theta}^{\text{soft}}\) that minimises \[
L(\theta) = \frac{1}{2}(z - \theta)^2 + \lambda |\theta| \,.
\]
Question 1.c.3
Problem:
At \(\theta = 0\), the derivative is not defined
Need to deal with the case \(\theta = 0\) separately
Code
lambda <-3.z <-5fun <-function(theta){0.5* (z - theta)^2+ lambda *abs(theta)}thetas <-seq(from =-10, to =10, length.out =1000)dfun <-function(theta){-(z - theta) + lambda *sign(theta)} par(mfrow =c(1,2))plot(thetas, fun(thetas), type ="l", linewidth =2, main ="Lasso loss")abline(v =0, col ="red", linetype ="dash") plot(thetas, dfun(thetas), type ="l", linewidth =2, main ="Lasso loss derivative")abline(v =0, col ="red", linetype ="dash")
Question 1.c.3
When is \(\hat{\theta}^{\text{soft}} = 0\) optimal?
Assume that \(\hat{\theta}^{\text{soft}} > 0\) is optimal:
\[
\frac{\partial}{\partial \theta} \left(\frac{1}{2} (z - \theta)^2 + \lambda |\theta| \right) = \frac{\partial}{\partial \theta} \left(\frac{1}{2} (z - \theta)^2 + \lambda \theta\right) = -(z - \theta) + \lambda \overset{!}{=} 0
\] so that \[
\hat{\theta}^{\text{soft}} = z - \lambda > 0 \iff z > \lambda \,.
\]
Question 1.c.3
\(\hat{\theta}^{\text{soft}} \neq 0\)
Assume that \(\hat{\theta}^{\text{soft}} < 0\) is optimal:
\[
\frac{\partial}{\partial \theta} \left(\frac{1}{2} (z - \theta)^2 + \lambda |\theta|\right) = \frac{\partial}{\partial \theta} \left( \frac{1}{2} (z - \theta)^2 - \lambda \theta\right) = -(z - \theta) - \lambda \overset{!}{=} 0
\] so that \[
\hat{\theta}^{\text{soft}} = z + \lambda < 0 \iff z < -\lambda \,.
\]
Question 1.c.3
We found three candidate minimisers \(\hat{\theta}^{\text{soft}}\) and necessary conditions on \(z, \lambda\) for optimality
We need to verify that these are also sufficient
We check the three cases
\(z > \lambda\)
\(z < - \lambda\)
\(|z| \leq \lambda\)
Question 1.c.3
\(z > \lambda\)
For \(\theta > 0\), \[
\frac{\partial}{\partial \theta}L(\theta) = -(z - \theta) + \lambda = (\lambda - z) + \theta \,,
\] is strictly increasing, negative for \(\theta < z - \lambda\), positive for \(\theta > z - \lambda\).
For \(\theta < 0\), \[
\frac{\partial}{\partial \theta}L(\theta) = -(z - \theta) - \lambda = \theta - (z - \lambda) < 0\,,
\] is strictly decreasing, and the minimum is at the boundary \(\theta = 0\).
zs <-seq(from =-5, to =5, length.out =1000) lambda <-1ht <-function(z){ifelse(abs(z)<= lambda, 0, z) }hts <-ht(zs)plot(zs[hts <0], hts[hts<0], type ="l", col ="blue", lwd =2, main ="Hard thresholding", xlim =c(-5,5), ylim =c(-5,5), xlab ="z", ylab =expression(hat(theta))) lines(zs[hts ==0], hts[hts ==0], type ="l", col ="blue", lwd =2)lines(zs[hts >0], hts[hts>0], type ="l", col ="blue", lwd =2)lines(zs, zs, col ="blue", lwd =1, lty ="dashed") abline(v =-lambda, col ="grey", lwd =2, lty ="dashed") abline(v = lambda, col ="grey", lwd =2, lty ="dashed")
Question 1.c.4
\(k\)-best subset selection
Given a set \(S \subset \{1, 2, \ldots, n\}: |S| = k\), find \(\hat{\beta}_S\) is the \(k\)-vector that minimises \[
\|y - X_{S} \hat{\beta}_{S} \|_2^2 = \| Z_S - X_S^\top X_S \beta \|_2^2 + C = \| Z_S - \beta_S \|_2^2 \,,
\] where \(X_S\) is \(X\) with only columns from \(S\) and \(Z_S = X_S^\top y\). Hence \(\hat \beta_S = Z_S\)
Overall loss for \(S\): \[
\| y - X_S\hat{\beta}_S \|_2^2 = \| y - X_SX_S^\top y \| = \| y\|_2^2 - \| X_S^\top y\|_2^2 = C - \|Z_S \|_2^2
\]
Note that \(Z_S\) is the \(k\)-subvector of \(Z\) with all the entries selected by \(S\)
\[
\| y - X_S\hat{\beta}_S \|_2^2 = C - \sum_{i \in S} z_i^2
\]
and \(k\)-best subset picks the \(k\) entries for which \(|z_i|\) is largest
Question 1.c.4
\(k\)-best subset selection vs. Hard threshold
\(k\)-best subset: Pick \(k\) entries for which \(|z_i|\) is largest
zs <-seq(from =-5, to =5, length.out =1000) lambda1 <- .25lambda2 <- .5en <-function(z){as.numeric(abs(z) > lambda1) *sign(z) * (abs(z) - lambda1) / (1+2* lambda2)}plot(zs, en(zs), type ="l", col ="blue", lwd =2, main ="Elastic net shrinkage", xlab ="z", ylab =expression(hat(theta))) lines(zs, zs, col ="blue", lwd =1, lty ="dashed") abline(v =-lambda1, col ="grey", lwd =2, lty ="dashed") abline(v = lambda1, col ="grey", lwd =2, lty ="dashed")
Learnings
Understand implications of orthonormal designs
Understand shrinkage properties of different penalties
Deal with nondifferentiable, convex functions
Check your understanding
Why can ridge beat thresholding rules for prediction?
Assume the orthonormal design setting with scalar problems \[
\hat\theta(z) = \arg\min_{\theta \in \mathbb{R}} \left\{ \frac{1}{2}(z-\theta)^2 + c(|\theta|,\lambda)\right\},
\] and compare ridge to soft/hard thresholding when you only care about test MSE, not sparsity.
Answer
Stability: Ridge uses a smooth map \(z \mapsto \hat\theta^{\text{ridge}}(z) = z/(1+\lambda)\), so small changes in \(z\) give small changes in \(\hat\theta\). Thresholding rules introduce a discontinuity at \(|z| = \lambda\), so tiny perturbations around the threshold can flip a coefficient from zero to nonzero (or vice versa), which increases prediction variance.
Better use of many weak signals. With many small but real effects, ridge shrinks all of them instead of zeroing them out. Thresholding rules aggressively delete coefficients below a cut-off, which can throw away cumulative signal and hurt predictive performance even if the resulting model is sparser.
Check your understanding
Why is soft thresholding often preferred to hard thresholding for prediction?
Both rules threshold \(z\), but in different ways. Explain why the soft rule is usually safer for test error.
Answer
Soft is continuous, hard is not. Soft thresholding \[
\hat\theta^{\text{soft}}(z)
= \mathbb{1}\{|z|>\lambda\}\,\text{sign}(z)\,(|z| - \lambda)
\] is continuous in \(z\). Hard thresholding \[
\hat\theta^{\text{hard}}(z)
=
\begin{cases}
0 & \text{if } |z|\le\lambda,\\
z & \text{if } |z|>\lambda
\end{cases}
\] has a jump at \(|z|=\lambda\). This jump makes fits highly sensitive to noise near the threshold and inflates test error.
Convex vs non-convex. Soft thresholding comes from a convex \(\ell_1\) penalty, giving a unique global minimiser and well-behaved optimisation. Hard thresholding corresponds to a non-convex penalty; in general designs this leads to multiple local minima and more unstable model selection, which again tends to worsen predictive risk.
Check your understanding
Bias–variance tradeoff of ridge in the orthonormal case
Assume \(z = \beta + \epsilon\) with \(\epsilon \sim (0,\sigma^2)\) and ridge estimator \[
\hat\theta^{\text{ridge}}(z) = \frac{z}{1+\lambda}.
\] Describe how bias and variance depend on \(\lambda\), and why a nonzero \(\lambda\) can reduce test error.
Answer
Bias.\[
\mathbb{E}[\hat\theta^{\text{ridge}}]
= \frac{\beta}{1+\lambda},
\quad
\text{bias}(\hat\theta^{\text{ridge}})
= -\frac{\lambda}{1+\lambda}\,\beta.
\] Bias grows in magnitude as \(\lambda\) increases (stronger shrinkage toward \(0\)).
Variance.\[
\operatorname{Var}(\hat\theta^{\text{ridge}})
= \frac{1}{(1+\lambda)^2}\,\operatorname{Var}(z)
= \frac{\sigma^2}{(1+\lambda)^2},
\] which decreases as \(\lambda\) increases; large \(\lambda\) makes the estimator very low variance.
Tradeoff and test MSE. The risk (per coordinate) is \[
\mathbb{E}\big[(\hat\theta^{\text{ridge}} - \beta)^2\big]
= \underbrace{\frac{\lambda^2}{(1+\lambda)^2}\beta^2}_{\text{bias}^2}
+ \underbrace{\frac{\sigma^2}{(1+\lambda)^2}}_{\text{variance}}.
\] As \(\lambda\) increases, variance drops faster than bias initially, so test MSE can decrease compared to OLS (\(\lambda=0\)). If \(\lambda\) is pushed too large, bias dominates and test error increases again: that is the classical bias–variance tradeoff.