Visualising LASSO, Ridge, \(l_q\) penalties
Understanding AIC, BIC
ST443 Machine learning and data mining
Preamble
This note provides a visualisation of LASSO and Ridge regression and \(\ell_q\)-penalties to show how different penalties affect the regression coefficients. We will also look at AIC and BIC and see how their behaviour motivates different use cases. We make two 🏔️-excursions to explore the excess risk of LASSO under squared-loss and to derive the AIC and BIC and their model selection properties. Both of these are optional, but if your mathematical background allows for it, and you have not seen it before, I recommend looking at the AIC/BIC section. I think you can gain a lot of understanding from it.
This note is slightly more technical than previous ones, but I think it is important to cover these topics with appropriate rigour, as a lot of the intuition about the methods that are covered in this lecture and similar courses, rests on mathematical analysis. If this intuition is then presented without the analysis, I always struggle to make the connection. It feels like trying to build a roof without a ground floor.
Anyhow, I tried to give the big-picture along with the analysis so you do not get lost in the details.
LASSO, Ridge & \(\ell_q\)-penalties
Setup
Recall that for the linear model and given \(\lambda > 0\), the LASSO estimates are the solution to \[ \underset{\substack{\beta_0 \in \Re \\ \beta \in \Re^p}}{ \min}\, \sum_{i = 1}^n \left(y_i - \beta_0 - x_i^\top \beta \right)^2 + \lambda \| \beta \|_1 \,, \tag{1}\] and the ridge estimates are the solution to \[ \underset{\substack{\beta_0 \in \Re \\ \beta \in \Re^p}}{ \min}\, \sum_{i = 1}^n \left(y_i - \beta_0 - x_i^\top \beta \right)^2 + \lambda \| \beta \|_2^2 \,. \]
More generally, the \(\ell_q\)-regularised estimates are the solution to
\[ \underset{\substack{\beta_0 \in \Re \\ \beta \in \Re^p}}{ \min} \, \textrm{RSS}(\beta_0,\beta) + \lambda \| \beta \|_q^q \,, \tag{2}\]
where \[ \textrm{RSS}(\beta_0,\beta) =\sum_{i = 1}^n \left(y_i - \beta_0 - x_i^\top \beta \right)^2, \quad \| \beta \|_q = \left(\sum_{i = 1}^p |\beta_i|^q \right)^{\frac{1}{q}} \,. \]
\(\ell_q\)-regularisation as constrained optimisation
To see how different penalties affect the types of solution we get, we need to consider an equivalent optimisation problem.
You may have noticed that all the optimisation problems have the form \(\textrm{RSS}(\beta_0,\beta) + \lambda \|\beta \|_q\). But this looks very much like the Lagrangian from minimising \(\text{RSS}(\beta_0, \beta)\) under some constraint on \(\| \beta \|_q\)! Hence, we will start with a such a constrained optimisation problem and show that it is equivalent to the \(\ell_q\)-regularised estimators of Equation 2.
In particular, let us consider the constrained minimisation
\[ \underset{\substack{\beta_0 \in \Re \\ \beta \in \Re^p}}{\arg \min} \, \textrm{RSS}(\beta_0,\beta), \quad \text{such that } \| \beta \|_q^q \leq s^q \,, \tag{3}\] for some \(s > 0\). For now, to make our lives easier, let \(q > 1\). Then we have a convex, differentiable optimisation problem since \(\text{RSS}(\beta_0, \beta)\) is convex in its arguments and \(\| \beta \|_q\) is convex for \(q \geq 1\).
Since we are working with a convex, differentiable constrained optimisation problem, let us set up the Lagrangian of Equation 3, which is given by
\[ \mathcal L(\beta_0, \beta, \mu) = \textrm{RSS}(\beta_0,\beta) + \mu (\| \beta \|_q^q - s^q ) \,, \]
and the Dual optimisation problem to Equation 3 is
\[ \underset{\mu \in \Re}{\max} \underset{\substack{\beta_0 \in \Re \\ \beta \in \Re^p}}{\min} \mathcal L(\beta_0, \beta, \mu), \quad \text{such that } \mu \geq 0 \,. \tag{4}\]
The KKT conditions, listed below, give a list of necessary and sufficient conditions for an optimum of Equation 3:
\[ \begin{aligned} \frac{\partial}{\partial \beta_0} L(\beta_0, \beta, \mu) &= 0 \\ \frac{\partial}{\partial \beta_i} L(\beta_0, \beta, \mu) &= 0 \\ \| \beta \|_q^q & \leq s^q \\ \mu \geq 0 \\ \mu (\| \beta \|_q^q - s^q) &= 0 \,. \end{aligned} \tag{5}\]
Now, we said that the penalised problem Equation 2, looks like the Lagrangian of a constrained optimisation problem. Let us test this hypothesis and consider a solution \((\hat \beta_0,\hat \beta)\) to the penalised problem Equation 2 (existence holds for \(q\ge 1\) by convexity). For this, set \(s=\|\hat\beta\|_q\). You can then convince yourself \((\hat \beta_0,\hat \beta)\) also solves the constrained problem Equation 3 with that \(s\), and the KKT stationarity coincides:
- If \(\lambda>0\) and \(\hat\beta\neq 0\), the constraint is active and the optimal multiplier equals \(\mu^*=\lambda>0\).
- If \(\hat\beta=0\) (so \(s=0\)), the constraint is active but the multiplier in the constrained problem is not unique; taking \(\mu^*=\lambda\) satisfies the KKT-conditions.
Conversely, let \((\tilde\beta_0,\tilde\beta)\) solve the constrained problem Equation 3.
- If the constraint is active, i.e. \(\|\tilde\beta\|_q=s\), then with \(\lambda=\mu^*>0\) (the optimal multiplier) \((\tilde\beta_0,\tilde\beta)\) also solves Equation 2.
- If the constraint is inactive, i.e. \(\|\tilde\beta\|_q<s\), then necessarily \(\mu^*=0\). In particular, when \(s\ge \|\beta^{\mathrm{OLS}}\|_q\) the constrained solution is the OLS solution and corresponds to the penalised problem with \(\lambda=0\).
Equivalently, for any \(\lambda\ge 0\) we have the primal replacement \[\begin{multline} (\hat\beta_0(\lambda),\hat\beta(\lambda))\in \arg\min_{\beta_0,\beta}\big\{\mathrm{RSS}(\beta_0,\beta)+\lambda\|\beta\|_q^{q}\big\} \\ \iff \\ (\hat\beta_0(\lambda),\hat\beta(\lambda))\in \arg\min_{\beta_0,\beta}\big\{\mathrm{RSS}(\beta_0,\beta):\ \|\beta\|_q^{\,q}\le \|\hat\beta(\lambda)\|_q^{\,q}\big\}. \end{multline}\] The map \(s\mapsto \lambda\) is monotone decreasing while the constraint binds; once \(s\ge \|\beta^{\mathrm{OLS}}\|_q\) it flattens to \(\lambda=0\) (the constrained and penalised optima both reduce to OLS).
🏔️ Nonsmooth and nonconvex cases:
For \(q=1\) (LASSO) the penalty is convex but non-differentiable at \(\beta=0\), so in Equation 5 we need to replace gradients by the \(\ell_1\) subgradient. The penalised–constrained equivalence holds exactly as above (with the usual \(\mu^*=\lambda\) when the constraint is active).
For \(0<q<1\) the penalty is nonconvex and the ball \(\{\|\beta\|_q^q\le s^q\}\) is nonconvex and we can no longer establish equivalence of \(\ell_q\)-regularised estimation and constrained optimisation the way we did before. However, it still holds that if \((\hat\beta_0,\hat\beta)\) minimises Equation 2 for some \(\lambda\ge0\), then with \(s=\|\hat\beta\|_q\) the pair \((\hat\beta_0,\hat\beta)\) minimises Equation 3. (Proof by contradiction: otherwise one could strictly reduce \(\mathrm{RSS}\) at no greater radius and thus reduce the penalised objective.)
This is enough to convey the geometry: convex (\(q\ge1\)) gives full global equivalence; nonconvex (\(0<q<1\)) preserves only the penalised to constrained direction, and the \(\lambda\)–path can skip constrained optima.
The take-away from this section is that for any \(\ell_q\)-penalised optimisation problem Equation 2, we can find a constrained optimisation problem of the form \(\min \text{RSS}\) such that \(\| \beta\|_q^q \leq s\) that is solved by a solution to Equation 2. Hence, we can look at this constrained optimisation and see how it informs our understanding of the original problem.
Visualising the constrained optimisation
To visualise the constrained optimisation problem, we need to get an idea of how its components \(\text{RSS}(\beta_0, \beta)\) and \(\| \beta \|_q^q\) behave. First we will show that the level sets of \(\text{RSS}(\beta_0, \beta)\) are ellipsoids in \(\beta\).
Note that with \(X\in\Re^{n\times p}\) and \(\beta\in\Re^p\) \[ \mathrm{RSS}(\beta_0,\beta)=\|y-\beta_0\mathbf{1}-X\beta\|_2^2 =(y-\beta_0\mathbf{1}-X\beta)^\top(y-\beta_0\mathbf{1}-X\beta)\,. \]
Expanding and collecting terms yields
\[ \mathrm{RSS}(\beta_0,\beta) =\beta^\top (X^\top X) \beta-2 \beta^\top X^\top (y-\beta_0\mathbf{1}) +(y-\beta_0\mathbf{1})^\top (y-\beta_0\mathbf{1})\,. \] Let \(A:=(X^\top X)\) and, if \(A\) is invertible, set \[ b(\beta_0):=A^{-1}X^\top (y-\beta_0\mathbf{1}). \] Completing the square gives \[ \mathrm{RSS}(\beta_0,\beta) =\big(\beta-b(\beta_0)\big)^\top A\,\big(\beta-b(\beta_0)\big) +\underbrace{(y-\beta_0\mathbf{1})^\top (y-\beta_0\mathbf{1})-b(\beta_0)^\top A\,b(\beta_0)}_{\text{constant in }\beta} \,. \]
Hence, for fixed \(\beta_0\), the \(\beta\)–level sets \(\{\beta:\mathrm{RSS}(\beta_0,\beta)=c\}\) are ellipsoids with shape matrix \(A=X^\top X\) and center \(b(\beta_0)\).
Below you can see how these \(\beta\)-level sets of \(\text{RSS}(\beta_0,\beta)\) look in a \(3\)D space.
R code
library(jsonlite)
# ----- Ellipsoid: (beta - b)^T A (beta - b) = c -----
set.seed(1)
b <- c(0.5, -0.2, 0.8)
# SPD shape matrix A = Q diag(lambdas) Q^T
Q <- qr.Q(qr(matrix(rnorm(9), 3, 3)))
if (det(Q) < 0) Q[,1] <- -Q[,1]
lambdas <- c(6, 3, 1)
A <- Q %*% diag(lambdas) %*% t(Q)
# Levels c to animate
c_vals <- seq(1, 5, length.out = 10)
c_labs <- sprintf('%.2f', c_vals)
# Grid
lam_min <- min(lambdas)
r_max <- sqrt(max(c_vals) / lam_min)
N <- 25
xs <- seq(b[1] - r_max, b[1] + r_max, length.out = N)
ys <- seq(b[2] - r_max, b[2] + r_max, length.out = N)
zs <- seq(b[3] - r_max, b[3] + r_max, length.out = N)
grid <- expand.grid(x = xs, y = ys, z = zs)
# Quadratic form value at each grid point: val = (x-b)^T A (x-b)
D <- cbind(grid$x - b[1], grid$y - b[2], grid$z - b[3]) # N^3 x 3
val <- rowSums((D %*% A) * D)
payload <- list(
b = b,
x = grid$x,
y = grid$y,
z = grid$z,
val = as.numeric(val),
cVals = c_vals,
cLabs = c_labs
)
json <- toJSON(payload, auto_unbox = TRUE, null = "null", digits = I(16))
cat(sprintf('<script id="ellipsoid-data" type="application/json">%s</script>', json))Similarly, let us look at the level sets of the \(\ell_q\)-constraints, which are
\[ \{\beta \in \Re^p : \| \beta \|_q = c\} \,. \]
For different \(q\), they look as follows:
R code
library(jsonlite)
# -------- Parameters --------
s_fixed <- 1.2 # { (x,y,z): |x|^p + |y|^p + |z|^p <= s_fixed }
p_vals <- sort(unique(c(seq(0.5, 1, length.out = 7),
seq(1, 2, length.out = 7),
seq(2, 6, length.out = 9))))
p_labs <- sprintf('%.2f', p_vals)
# 3D grid
N <- 33
r_max <- s_fixed^(1/min(p_vals))
xs <- seq(-r_max, r_max, length.out = N)
ys <- xs
zs <- xs
grid <- expand.grid(x = xs, y = ys, z = zs)
payload <- list(
sFixed = s_fixed,
pVals = p_vals,
pLabs = p_labs,
x = grid$x,
y = grid$y,
z = grid$z,
rMax = r_max
)
json <- toJSON(payload, auto_unbox = TRUE, null = "null", digits = I(16))
cat(sprintf('<script id="lpball-data" type="application/json">%s</script>', json))It is now time that we put our understanding that Equation 2 corresponds to a constrained optimisation problem Equation 3 to use. We make the following two observations:
- The solution to the constrained optimisation problem must be a point in the \(\ell_q\)-ball i.e. the constraint must be satisfied.
- Since we try to minimise \(\text{RSS}(\beta_0, \beta)\), subject to the constraint, we need to find the \(\beta\) that lies on the ellipsoid with the smallest radius (we try to minimise), that also lies within the constraint region.
Therefore, such a \(\beta\) is exactly on the ellipsoid that is tangent to, i.e. just touches, the constraint region! This is the key insight to visualising \(\ell_q\)-regularised estimators, make sure you have understood it.
We are now in a position to visualise how different \(\ell_q\)-penalties can affect the sparsity of a solution. Below, I have plotted the LASSO and Ridge estimates as the solution to the constrained estimation problem Equation 3 (\(q = 1,2\), respectively) for varying values of \(s\). See what happens to \(\beta\) as \(s\) grows (\(\lambda\) shrinks) and vice-versa.
R code
library(jsonlite)
library(viridisLite)
library(png)
library(base64enc)
# ---------- 1) Quadratic loss ----------
Q <- matrix(c(3, 1,
1, 1.2), 2, 2, byrow = TRUE) # SPD
beta_hat <- c(1.2, 0.8)
rss_val <- function(b){
d <- b - beta_hat
as.numeric(t(d) %*% Q %*% d)
}
# ---------- 2) Grid + RSS (high res) ----------
grid_lim <- 2.2
xs <- seq(-grid_lim, grid_lim, length.out = 1000)
ys <- xs
X <- matrix(rep(xs, each = length(ys)), nrow = length(ys))
Y <- matrix(rep(ys, times = length(xs)), nrow = length(ys))
D1 <- X - beta_hat[1]; D2 <- Y - beta_hat[2]
L <- D1*(Q[1,1]*D1 + Q[1,2]*D2) + D2*(Q[2,1]*D1 + Q[2,2]*D2)
# ---------- 3) Coarse heatmap (more steps near min) as layout image ----------
N_BINS <- 12
HEAT_ALPHA <- 0.75
L_norm <- (L - min(L)) / (max(L) - min(L))
warp <- function(z) sqrt(z) # emphasize small values
cuts <- cut(warp(L_norm), breaks = seq(0,1,length.out=N_BINS+1),
include.lowest = TRUE, right = FALSE)
bin_idx <- as.integer(cuts)
pal <- viridis(N_BINS)
col_mat <- matrix(pal[bin_idx], nrow = nrow(L), ncol = ncol(L))
hex_to_rgba <- function(hex_vec, alpha = HEAT_ALPHA){
rgb <- col2rgb(hex_vec); rbind(rgb/255, rep(alpha, length(hex_vec)))
}
rgba <- hex_to_rgba(as.vector(col_mat), alpha = HEAT_ALPHA)
img <- array(0, dim = c(nrow(L), ncol(L), 4))
matR <- matrix(rgba[1,], nrow = nrow(L), ncol = ncol(L))
matG <- matrix(rgba[2,], nrow = nrow(L), ncol = ncol(L))
matB <- matrix(rgba[3,], nrow = nrow(L), ncol = ncol(L))
matA <- matrix(rgba[4,], nrow = nrow(L), ncol = ncol(L))
# flip vertically; anchor at top later
img[,,1] <- matR[nrow(L):1, ]; img[,,2] <- matG[nrow(L):1, ]
img[,,3] <- matB[nrow(L):1, ]; img[,,4] <- matA[nrow(L):1, ]
heat_uri <- dataURI(writePNG(img), mime = "image/png")
# ---------- 4) Constrained minimizers over s (L1 & L2) ----------
constrained_min <- function(L, mask, xs, ys){
Lm <- L; Lm[!mask] <- Inf
if(!any(is.finite(Lm))) return(c(NA_real_, NA_real_))
idx <- which(Lm == min(Lm, na.rm = TRUE), arr.ind = TRUE)[1,]
c(xs[idx[2]], ys[idx[1]])
}
s_vals <- seq(0.15, 2, length.out = 40)
s_labs <- sprintf("%.2f", s_vals)
lasso_sol <- ridge_sol <- matrix(NA_real_, nrow = length(s_vals), ncol = 2)
lasso_lev <- ridge_lev <- numeric(length(s_vals))
for(i in seq_along(s_vals)){
s <- s_vals[i]
l1_mask <- (abs(X) + abs(Y)) <= (s + 1e-12)
l2_mask <- (X*X + Y*Y) <= (s*s + 1e-12)
lasso_sol[i,] <- constrained_min(L, l1_mask, xs, ys)
ridge_sol[i,] <- constrained_min(L, l2_mask, xs, ys)
lasso_lev[i] <- rss_val(lasso_sol[i,])
ridge_lev[i] <- rss_val(ridge_sol[i,])
}
lasso_path <- data.frame(x = lasso_sol[,1], y = lasso_sol[,2])
ridge_path <- data.frame(x = ridge_sol[,1], y = ridge_sol[,2])
# Per-frame points
lasso_pts <- data.frame(x = lasso_sol[,1], y = lasso_sol[,2], frame = s_labs)
ridge_pts <- data.frame(x = ridge_sol[,1], y = ridge_sol[,2], frame = s_labs)
# ---------- 5) Tangency ellipses ----------
eig <- eigen(Q, symmetric = TRUE)
Qmhalf <- eig$vectors %*% diag(1/sqrt(eig$values)) %*% t(eig$vectors)
mk_ellipse <- function(level, n = 200){
t <- seq(0, 2*pi, length.out = n)
circ <- rbind(cos(t), sin(t)) * sqrt(level)
pts <- beta_hat + Qmhalf %*% circ
list(x = as.numeric(pts[1,]), y = as.numeric(pts[2,]))
}
ell_lasso <- lapply(seq_along(s_vals), function(i) mk_ellipse(lasso_lev[i], n = 200))
ell_ridge <- lapply(seq_along(s_vals), function(i) mk_ellipse(ridge_lev[i], n = 200))
# ---------- 6) Constraint polygons per s ----------
mk_diamond <- function(s) list(
x = c(0, s, 0, -s, 0),
y = c(s, 0, -s, 0, s)
)
mk_circle <- function(s, n = 200){
t <- seq(0, 2*pi, length.out = n)
list(x = s*cos(t), y = s*sin(t))
}
diam_list <- lapply(s_vals, mk_diamond)
circ_list <- lapply(s_vals, mk_circle)
# ---------- 7) Build payload ----------
payload <- list(
heatURI = heat_uri,
xRange = c(-grid_lim, grid_lim),
yRange = c(-grid_lim, grid_lim),
betaHat = beta_hat,
sVals = s_vals,
sLabs = s_labs,
lassoPath = list(x = lasso_path$x, y = lasso_path$y),
ridgePath = list(x = ridge_path$x, y = ridge_path$y),
dynamic = list(
lassoDiamond = list(x = lapply(diam_list, `[[`, "x"),
y = lapply(diam_list, `[[`, "y")),
lassoEllipse = list(x = lapply(ell_lasso, `[[`, "x"),
y = lapply(ell_lasso, `[[`, "y")),
lassoPoint = list(x = as.numeric(lasso_pts$x),
y = as.numeric(lasso_pts$y)),
ridgeCircle = list(x = lapply(circ_list, `[[`, "x"),
y = lapply(circ_list, `[[`, "y")),
ridgeEllipse = list(x = lapply(ell_ridge, `[[`, "x"),
y = lapply(ell_ridge, `[[`, "y")),
ridgePoint = list(x = as.numeric(ridge_pts$x),
y = as.numeric(ridge_pts$y))
)
)
json <- toJSON(payload, auto_unbox = TRUE, null = "null", digits = I(16))
cat(sprintf('<script id="l1l2-payload" type="application/json">%s</script>', json))For the LASSO, we see that since a solution lies exactly where the ellipsoid touches the constraint region, which has sharp edges, it is very likely that the ellipsoid level sets of \(\text{RSS}(\beta_0, \beta)\) touch the constraint at a corner, for which some of the components of \(\beta\) are zero. This is how the LASSO produces sparse solutions. For the ridge, on the other hand, the edges of the constraint region are smooth circles. Here, it is very unlikely that the ellipsoid meets the circle at a point where some of the entries of \(\beta\) are zero, as the circle bulges outwards.
In general, we can see from our animation of the constraint regions of \(\ell_q\) penalties, the constraint region bulges inwards for \(q \leq 1\), which produces the corners that give rise to sparse solutions. For \(q > 1\), the constraint region bulges outwards, giving rise to dense solutions.
🏔️ Excess risk of LASSO
Now that we understand why LASSO selects sparse solutions, let us explore its excess risk under squared loss in a linear model with sparsity. Bounding the excess risk of the LASSO is fairly advanced. I have included it in case you are interested pursuing a PhD in theoretical statistics and you want to get a feel about how research in this field looks like. I would not expect to understand everything that is happening at your first read, I certainly did not.
However, the sections on deriving the Bayes risk and the risk of oracle OLS are more in line with the level of the course and a good opportunity to sharpen your toolkit. I have structured them as an exercise, where you can click to expand the solutions. Why not give it a try yourself before checking!
Setup
We start with a vanilla linear model with gaussian errors, along with a sparsity condition on the true effect \(\beta\).
- Data: We observe pairs \((y_i, x_i) \in \Re \times \Re^p\), \((i=1,\ldots,n)\), that are i.i.d. realisations of random variables \((Y_i,X_i)\) from some unknown joint distribution \(P\). Let \(X\) be the \(n \times p\) matrix that collects the features \(X_i\) as rows
- Covariates: The marginal distribution of the covariates is a \(p\)-dimensional normal distribution, i.e. \(X_i \sim \mathrm{N}(0, I_p)\)
- Linear model: Conditional on \(X_i\), \(Y_i\) follows a normal linear model: \[Y_i = X_i^\top \beta + \varepsilon_i\,,\] where \(\varepsilon_i\) are i.i.d. \(\mathrm{N}(0,\sigma^2)\) independent of \(X_i\).
- Sparsity: \(S = \operatorname{supp}(\beta) = \{j: \beta_j \neq 0\}\) with \(|S| = s < p\). Denote by \(X_S\) the columns \(X_j\) of \(X\) for which \(\beta_j \neq 0\). Consequently the rows of \(X_S\) are distributed as \(\mathrm{N}(0, I_s)\).
- Asymptotic regime: Assume that
- \(\log(p) = o(n)\) and
- \(\log(\max \{\|\beta\|_1, \|\beta\|_2^2 \})= o(n)\), as \(n \to \infty\)
To clarify notation, let \(Y = (Y_1,\ldots, Y_n)\) be the random vector of all training labels \(Y_i\) and \(y\) be one such realisation. Similarly, let \(X\) be the \(n \times p\) random matrix that collects the random feature vectors \(X_i\) as rows and \(\varepsilon\) be the \(n\)-vector of errors \(\varepsilon_i\). Henceforth, we will denote a new random pair from the distribution of \(P\) by \((Y', X') \in \Re \times \Re^p\) and denote the noise by \(\varepsilon'\).
We have chosen a very simple data generating process with normally distributed features and a simple linear model to showcase the performance of LASSO under sparsity. The qualitative results can be generalised to more general data structures. In the LASSO literature, you will often encounter so called “fixed design” setups, where the features are treated as a non-random sequence of data and all randomness comes from the labels. This setting is a bit more general as our “random design” setting where we assume a distribution for the features. Moreover, we have not added an intercept for our linear model. Adding an intercept is equivalent to considering the augmented data matrix \(Z = [\mathbf{1}, X]\) in place of \(X\). This would complicate the algebra slightly when we consider \(\mathrm{E}[(Z^\top Z)^{-1}]\) but technique-wise, nothing changes. Our simplified model is nonetheless more than enough for our ambitions here.
Bayes risk
As a warmup, let us derive the Bayes risk of the linear model under squared loss, against which we will compare the risk of oracle OLS and LASSO. You should be experts on this by now, so why not give it a shot.
Reveal solution
We know that the Bayes estimator under squared loss is \(\psi(x) = \mathrm{E}(Y' \mid X' = x)\). In the linear model, this is simply \(\psi(x) = x^\top \beta\). Hence, the expected squared loss is then \[ \begin{aligned} \mathrm{E}[(Y' - x^\top \beta)^2 \mid X = x] &= \mathrm{E}[(\varepsilon')^2 \mid X' = x] \\ &= \sigma^2 \,, \end{aligned} \] where we used the definition of \(Y\) given \(X\) and that \(\varepsilon\) is independent of \(X\). Thus the Bayes risk is simply
\[ \begin{aligned} \mathrm{E}[(Y' - x^\top \beta)^2 ] &= \mathrm{E}[\mathrm{E}[(\varepsilon')^2 \mid X' = x]] \\ &= \sigma^2 \,. \end{aligned} \]Therefore, we have shown that the Bayes risk is \(\sigma^2\).
Oracle OLS risk
Next, let us derive the risk of the oracle OLS estimator that only selects the columns of \(X\) that are in the support of \(\beta\), that is
\[ \hat{\beta}_{S} = (X_S^\top X_S)^{-1}X_S^\top Y \,. \]
To derive the risk will adapt the following strategy:
- Derive \(\mathrm{E}[(\hat \beta_S - \beta_S)^2]\), where \(\beta_S\) is the vector of all the nonzero entries of \(\beta\); we will use the fact that \[ \mathrm{E}\left[(X_S^\top X_S)^{-1} \right] = \frac{I_s}{n - s - 1} \,. \]
- Using 1., derive \(\mathrm{E}[\{(\beta - \hat \beta_S)^\top X'\}^2]\)
- Using 2. and that \(\mathrm{E}[(Y' - \hat{\beta}_S^\top X')^2] = \mathrm{E}[(\varepsilon' - (\beta - \hat{\beta}_S)^\top X')^2]\), derive the risk of the oracle OLS estimator
Reveal solution
Note that \[ \begin{aligned} \hat{\beta}_S &= (X_S^\top X_S)^{-1}X_S^\top Y \\ &= (X_S^\top X_S)^{-1}X_S^\top(X \beta + \varepsilon) \\ &= (X_S^\top X_S)^{-1}X_S^\top(X_S \beta_S + \varepsilon) \\ &= \beta_S + (X_S^\top X_S)^{-1}X_S^\top \varepsilon\,, \end{aligned} \]
where \(\beta_S\) are all the nonzero entries of \(\beta\). Therefore
\[ \begin{aligned} \mathrm{E}[(\hat \beta_S - \beta_S)(\hat \beta_S - \beta_S)^\top] &= \mathrm{E}[((X_S^\top X_S)^{-1}X_S^\top \varepsilon)((X_S^\top X_S)^{-1}X_S^\top \varepsilon)^\top] \\ &= \mathrm{E}[(X_S^\top X_S)^{-1}X_S^\top \varepsilon \varepsilon^\top X_S (X_S^\top X_S)^{-1}] \\ &= \mathrm{E}[(X_S^\top X_S)^{-1}X_S^\top \mathrm{E}[\varepsilon \varepsilon^\top \mid X_S] X_S (X_S^\top X_S)^{-1}] \\ &= \mathrm{E}[(X_S^\top X_S)^{-1}X_S^\top \mathrm{E}[\varepsilon \varepsilon^\top ] X_S (X_S^\top X_S)^{-1}] \\ &= \mathrm{E}[(X_S^\top X_S)^{-1}X_S^\top \sigma^2 I_n X_S (X_S^\top X_S)^{-1}] \\ &= \sigma^2 \mathrm{E}[(X_S^\top X_S)^{-1}X_S^\top X_S (X_S^\top X_S)^{-1}] \\ &= \sigma^2 \mathrm{E}[(X_S^\top X_S)^{-1}] \\ &= \sigma^2 \frac{I_s}{n - s - 1} \,, \end{aligned} \tag{6}\] where we used that \(\varepsilon\) is independent of \(X_S\) and that \(\mathrm{E}\left[(X_S^\top X_S)^{-1} \right] = \frac{I_s}{n - s - 1}\). The latter fact follows from the fact that \(X_S^\top X_S\) follows a Wishart distribution with parameters \(I_s,n\) and its inverse thus follows an inverse Wishart distribution with the same parameters. The mean of this distribution is precisely \(\frac{I_s}{n - s - 1}\).
Next, since we are working with a linear model, we can easily extend Equation 6 to linear predictions, i.e.
\[ \begin{aligned} \mathrm{E}[(\hat \beta_S^\top X' - \beta_S^\top X')^2] &= \mathrm{E}[((\hat \beta_S - \beta_S)^\top X')^2] \\ &= \mathrm{E}[(\hat \beta_S - \beta_S)^\top X' X'^\top (\hat \beta_S - \beta_S)] \\ &= \mathrm{E}[(\hat \beta_S - \beta_S)^\top \mathrm{E}[ X' X'^\top \mid X, Y] (\hat \beta_S - \beta_S)] \\ &= \mathrm{E}[(\hat \beta_S - \beta_S)^\top \mathrm{E}[ X' X'^\top ] (\hat \beta_S - \beta_S)] \\ &= \mathrm{E}[(\hat \beta_S - \beta_S)^\top I_s(\hat \beta_S - \beta_S)] \\ &= \mathrm{E}[\operatorname{trace}((\hat \beta_S - \beta_S)^\top I_s (\hat \beta_S - \beta_S))] \\ &= \mathrm{E}[\operatorname{trace}((\hat \beta_S - \beta_S) (\hat \beta_S - \beta_S)^\top I_s)] \\ &= \operatorname{trace}(\mathrm{E}[(\hat \beta_S - \beta_S) (\hat \beta_S - \beta_S)^\top] I_s) \\ &= \frac{\sigma^2}{n - s -1} \operatorname{trace}(I_s^{-1} I_s) \\ &= \frac{\sigma^2}{n - s -1} \operatorname{trace}(I_s) \\ &= \sigma^2 \frac{s}{n - s - 1} \,, \end{aligned} \tag{7}\] where we used that \(\hat \beta_S\) is independent of \(X'\) and where \(\operatorname{trace}(A) = \sum_{i = 1}^p A_ii\). From this it follows that \(\operatorname{trace}(AB) = \operatorname{trace}(BA)\) and that \(\mathrm{E}[\operatorname{trace}(A)] = \operatorname{trace}(\mathrm{E}[A])\).
The last step is to derive the risk of \(\hat \beta_S\). For this, we just use the definition of risk, along with our linear modelling assumption and our result from Equation 7:
\[ \begin{aligned} \mathrm{E}[(Y' - \hat\beta_S^\top X')^2] &= \mathrm{E}[(\varepsilon' + (\beta_S - \hat \beta_S)^\top X')^2] \\ &= \mathrm{E}[(\varepsilon')^2] + 2 \mathrm{E}[\varepsilon' (\beta_S - \hat \beta_S)^\top X'] + \mathrm{E}[(\hat \beta_S^\top X' - \beta_S^\top X')^2]\\ &= \sigma^2 + 2 \mathrm{E}[\varepsilon' (\beta_S - \hat \beta_S)^\top X'] + \sigma^2 \frac{s}{n - s - 1} \\ &= \sigma^2 + 2 \mathrm{E}[\mathrm{E}[\varepsilon' \mid X, Y] (\beta_S - \hat \beta_S)^\top X'] + \sigma^2 \frac{s}{n - s - 1} \\ &= \sigma^2 + 2 \mathrm{E}[\mathrm{E}[\varepsilon' ] (\beta_S - \hat \beta_S)^\top X'] + \sigma^2 \frac{s}{n - s - 1} \\ &= \sigma^2 + \sigma^2 \frac{s}{n - s - 1} \,. \end{aligned} \]
Therefore, we have shown that the oracle OLS risk is \(\sigma^2 + \sigma^2 s / (n - s - 1)\), and thus its excess risk \(\sigma^2 s / (n - s - 1)\).
🏔️🏔️Excess risk of the LASSO
Now let us get to the main event, finding the excess risk of the LASSO. For this, let us define the LASSO for our setting without intercept. Namely \(\hat \beta\) is defined as
\[ \hat \beta = \underset{\beta \in \Re^p}{\arg \min} \, \frac{1}{2n} \sum_{i = 1}^n (Y_i - X_i^\top \beta)^2 + \lambda \sum_{i = 1}^p |\beta_i| \,. \]
Note that we have normalised the RSS by a factor of \(1 / 2n\) for convenience and in alignment with the literature. This is similar to our definition of the LASSO with shrinkage parameter \(\lambda' = 2n \lambda\).
The main steps of our derivation are the following:
- Derive a basic inequality and cone condition for \(X \Delta = X(\hat \beta - \beta)\)
- With appropriate scaling of \(\lambda\), conclude that \(\| \Delta_{S^C} \|_1 \leq 3 \| \Delta_S \|_1\) with high probability, where \(S^C\) is the complement of \(S\) and \(\Delta_I\) selects all elements of \(\Delta\) that are in the index set \(I\)
- Using the distributional assumptions about \(X\) to conclude \(\| X \Delta \|_2^2 / n \geq C \|\Delta_S\|_1^2 / s\)
- Derive the in-sample prediction error bound $| X |_2^2/n C’ ^2 s $
- For a new test point \(X'\), compare \(\Delta^\top \Delta\) to \(\| X \Delta \|_2^2/n\) to conclude that \(\mathrm{E}[(X'(\hat\beta - \beta))^2] \leq C \sigma^2 \log(p) s / n\)
So let us get started with the first point.
The LASSO objective function is
\[ \ell(\beta) = \frac{1}{2n} \| Y - X \beta \|_2^2 + \lambda \| \beta \|_1 \,, \]
and from optimality of \(\hat \beta\), we get that \(\ell(\hat \beta) \leq \ell(\beta)\). Now note that
\[ \begin{aligned} \ell(\hat \beta) - \ell(\beta) &= \frac{1}{2n} \left( \| Y - X \hat \beta \|_2^2 - \| Y - X \beta \|_2^2 \right) + \lambda \left(\|\hat \beta\|_1 - \| \beta \|_1 \right) \\ &= \frac{1}{2n} \left( \| \varepsilon - X \Delta \|_2^2 - \| \varepsilon \|_2^2 \right)+ \lambda \left(\|\hat \beta\|_1 - \| \beta \|_1 \right) \\ &= \frac{1}{2n} \left( \| \varepsilon \|_2^2 - 2 \varepsilon^\top X \Delta + \|X \Delta\|_2^2 - \|\varepsilon\|_2^2 \right) + \lambda \left(\|\hat \beta\|_1 - \| \beta \|_1 \right) \\ &= \frac{1}{2n} \|X \Delta\|_2^2 - \frac{1}{n} \varepsilon^\top X \Delta + \lambda \left(\|\beta + \Delta \|_1 - \| \beta \|_1 \right) \\ &\leq 0 \,. \end{aligned} \]
Now rearranging yields the inequality
\[ \frac{1}{2n} \|X \Delta\|_2^2 \leq \frac{1}{n} \varepsilon^\top X \Delta + \lambda \left(\| \beta \|_1 - \|\beta + \Delta \|_1 \right) \,. \tag{8}\]
Now note that \(\beta_{S^C}=0\) and \(\Delta=\Delta_S+\Delta_{S^C}\). Thus, we have \[ \|\beta\|_1-\|\beta+\Delta\|_1 = \|\beta_S\|_1 - \left(\|\beta_S+\Delta_S\|_1 + \|\Delta_{S^C}\|_1\right) \le \|\Delta_S\|_1 - \|\Delta_{S^C}\|_1, \] since \(\|\beta_S+\Delta_S\|_1 \ge \|\beta_S\|_1 - \|\Delta_S\|_1\) by the triangle inequality. Plugging this back into our inequality of Equation 8, we get that
\[ \frac{1}{2n} \|X \Delta\|_2^2 \leq \frac{1}{n} \varepsilon^\top X \Delta + \lambda \left( \|\Delta_S\|_1 - \|\Delta_{S^C}\|_1 \right) \,. \tag{9}\]
Henceforth, let us condition on the event that
\[ \frac{1}{n} \| X^\top \varepsilon \|_{\infty} \leq \frac{\lambda}{2} \,, \]
which for our Gaussian covariates holds with probability at least \(1 - C / p\) for \(\lambda = C \sigma \sqrt{\log(p)/n}\) for \(C>0\) being a large enough constant and \(c>0\) being a constant that depends on \(\sigma\) and large enough \(n, p\). This is a crucial step to control the noise in the estimation process and explains the scaling of \(\lambda\).
Then Equation 9 yields that
\[ \begin{aligned} \frac{1}{2n} \|X \Delta\|_2^2 &\leq \frac{1}{n} \varepsilon^\top X \Delta + \lambda \left( \|\Delta_S\|_1 - \|\Delta_{S^C}\|_1 \right)\\ &\leq \frac{1}{n} |\varepsilon^\top X \Delta| + \lambda \left( \|\Delta_S\|_1 - \|\Delta_{S^C}\|_1 \right) \\ % &\leq \frac{1}{n} \|\varepsilon^\top X \Delta\|_1 + \lambda \left( \|\Delta_S\|_1 - \|\Delta_{S^C}\|_1 \right) \\ &\leq \frac{1}{n} \|\varepsilon^\top X \|_{\infty} \|\Delta\|_1 + \lambda \left( \|\Delta_S\|_1 - \|\Delta_{S^C}\|_1 \right) \\ &\leq \frac{\lambda}{2} \|\Delta\|_1 + \lambda \left( \|\Delta_S\|_1 - \|\Delta_{S^C}\|_1 \right) \\ &= \frac{\lambda}{2} \left( \|\Delta_S\|_1 + \|\Delta_{S^C}\|_1 \right) + \lambda \left( \|\Delta_S\|_1 - \|\Delta_{S^C}\|_1 \right) \\ &= \frac{3\lambda}{2} \|\Delta_S\|_1 - \frac{\lambda}{2} \|\Delta_{S^C}\|_1 \,. \end{aligned} \tag{10}\]
Since the LHS of Equation 10 is nonnegative, we get the inequalities
\[ 3 \|\Delta_S\|_1 \geq \|\Delta_{S^C}\|_1, \quad \frac{1}{2n} \|X \Delta\|_2^2 \leq \frac{3\lambda}{2} \|\Delta_S\|_1 \,. \]
Next, let us look at how \(\| X\Delta \|_2^2 /n\) behaves for \(\Delta\)s such that \(3 \|\Delta_S\|_1 \geq \|\Delta_{S^C}\|_1\).
In particular, let us try and derive a lower bound for \(\|X \Delta \|_2^2 /n\) in terms of \(\|\Delta_S\|_2^2\).
We define the cone \[ \mathcal C(S):=\{\Delta\in\Re^p:\ \|\Delta_{S^C}\|_1 \le 3\|\Delta_S\|_1\} \,. \] For a positive semidefinite matrix \(G\), define the cone–restricted Rayleigh minimum \[ \lambda_S(G)= \inf_{\substack{u\in\mathcal C(S)\\ u\neq 0}} \frac{u^\top G\,u}{\|u\|_2^2} \,. \] Then for any \(\Delta\in\mathcal C(S)\), \[ \frac{1}{n}\|X\Delta\|_2^2 = \Delta^\top\!\Big(\tfrac{1}{n}X^\top X\Big)\Delta \geq \lambda_S\!\Big(\tfrac{1}{n}X^\top X\Big)\,\|\Delta\|_2^2 \geq \lambda_S\!\Big(\tfrac{1}{n}X^\top X\Big)\,\|\Delta_S\|_2^2\,. \]
So it suffices to show that \(\lambda_S\big(\tfrac{1}{n}X^\top X\big)\) is bounded away from 0 with high probability.Recall that \(X\) is a \(n \times p\) matrix. If \(p < n\), then \(\lambda_S\big(\tfrac{1}{n}X^\top X\big)\) is lower bounded by the minimal eigenvalue of \(X^\top X\), for which we have high probability concentration bounds for Gaussian design matrices \(X\). However, LASSO operates also when \(p > n\) in which case \(X^\top X\) is singular. So we need to do better.
Proposition: Let the rows of \(X\in\Re^{n\times p}\) be i.i.d. \(\mathrm{N}(0,I_p)\). There exist universal constants \(c,C>0\) such that if \[ n \geq C\, s\,\log\left(\frac{e p}{s} \right) \,, \] then with probability at least \(1-2e^{-c n}\), \[ \lambda_S \left(\tfrac{1}{n}X^\top X\right) \geq \frac{1-7\delta}{10}\,, \] For some \(\delta < 1/6\). Consequently, for all \(\Delta\in\mathcal C(S)\), \[ \frac{1}{n}\|X\Delta\|_2^2 \geq c_0 \|\Delta\|_2^2 \geq c_0 \|\Delta_S\|_2^2 \,, \]
for \(c_0 = (1 - 7\delta) / 10\), \(\delta < 1/ 6\).
Proof.
We will make use of a very powerful tool in high-dimensional probability, called the restricted isometry property (RIP), that allows us to characterise the quadratic \(u^\top G u\) for \(G=\tfrac{1}{n}X^\top X\). We state a variant of it below.
With \(n\gtrsim s\log(ep/s)\), the Restricted Isometry/Orthogonality holds with high probability (see, for example, Theorem 9.5.7): For all vectors supported on at most \(2s\) coordinates, \[ (1-\delta)\|v\|_2^2 \le v^\top G v \le (1+\delta)\|v\|_2^2, \qquad |u^\top G v|\le \delta\,\|u\|_2\|v\|_2\,, \] for some \(\delta < 1/7\).
Note that this fairly close to what we want. The only problem is that for our \(\Delta \in \mathcal{C}(S)\), we do not have any sparsity guarantees readily available. Our strategy will thus be as follows: pick a \(u \in \mathcal{C}(S)\) and decompose it into two parts \(u_T, u_{T^C}: u = u_T + u_{T^C}\), for which we can control the sparsity. Our quadratic then becomes
\[ u^\top G u = u_T^\top G u_T + 2u_T^\top G u_{T^C} + u_{T^C}^\top G u_{T^C} \,, \tag{11}\] on which we can use RIP to derive lower bounds. So let us find such a decomposition of \(u\). For this define \(T := S \cup T_1\), where \(T_1\) indexes the \(s\) coordinates of \(u\) on \(S^C\) that are largest in absolute value. Then \(\| u_T \|_0 \leq 2s\) since \(|S| = s\) and \(|T_1|=s\). Therefore, by RIP, for \(n\) large enough and with high probability, \[ u_T^\top G u_T \geq (1 - \delta) \|u_T\|_2^2 \,. \tag{12}\]
Now this is nice but for this to be useful for us, we need the lower bound in terms of \(\| u \|_2\) not \(\|u_T\|_2\). Since \(\|u\|_2^2=\|u_T\|_2^2+\|u_{T^C}\|_2^2\) if we can upper bound \(\|u_{T^C} \|_2\) in terms of \(\|u_{T} \|_2\), we can use it in Equation 12. This is where our construction of \(T\) comes into play!
Note that \[ \|u_{T^C}\|_\infty \leq \frac{\|u_{S^C}\|_1}{s}\,, \] by construction of \(T\). Now using that \(\|v\|_2^2 \le \|v\|_\infty \|v\|_1\), \[ \|u_{T^C}\|_2^2 \leq \|u_{T^C}\|_\infty \|u_{T^C}\|_1 \leq \frac{\|u_{S^C}\|_1}{s}\,\|u_{S^C}\|_1 = \frac{\|u_{S^C}\|_1^2}{s}\,. \] If \(u\) lies in the Lasso cone \(\|u_{S^C}\|_1 \le 3\|u_S\|_1\), then \[ \|u_{T^C}\|_2 \leq \frac{\|u_{S^C}\|_1}{\sqrt{s}} \leq \frac{3\|u_S\|_1}{\sqrt{s}} \leq 3\,\|u_S\|_2 \leq 3\,\|u_T\|_2\,, \] where we used \(\|u_S\|_1 \le \sqrt{s}\,\|u_S\|_2\) and \(S\subseteq T\). Therefore,
\[\|u\|_2^2=\|u_T\|_2^2+\|u_{T^C}\|_2^2 \le 10\|u_T\|_2^2 \,,\]
which is the kind of inequality we were after.
Next, we need to contend with all terms in the quadratic expansion of Equation 11 that involve \(u_{T^C}\). Unlike for \(u_T\), we do not have explicit control over the sparsity of \(u_{T^C}\).
Partition \(T^C\) into disjoint blocks \(T_2,T_3,\dots\) of size at most \(s\), ordered by decreasing magnitudes of \(u\) on \(S^C\). Note that restricted orthogonality holds uniformly over all \(2s\) sparse vectors and thus for any \(u_{T_j}\) simultaneously. By restricted orthogonality (applied to the disjoint supports \(T\) and each \(T_j\)), \[ \begin{aligned} |u_T^\top G u_{T^C}| & =\Big|\sum_{j\ge 2} u_T^\top G u_{T_j}\Big|\\ &\le \delta\,\|u_T\|_2 \sum_{j\ge 2}\|u_{T_j}\|_2 \\ &\le \delta\,\|u_T\|_2\,\frac{1}{\sqrt{s}}\sum_{j\ge 2}\|u_{T_j}\|_1 \\ &= \delta\,\|u_T\|_2\,\frac{\|u_{T^C}\|_1}{\sqrt{s}} \\ &\le 3\delta\,\|u_T\|_2^2 \,, \end{aligned} \] where we used \(\|v\|_2\le \|v\|_1/\sqrt{s}\) on each block, the cone \(\|u_{S^C}\|_1\le 3\|u_S\|_1\), and \(S\subseteq T\).
Hence, we can decompose \[ \begin{aligned} u^\top G u &= u_T^\top G u_T + 2u_T^\top G u_{T^C} + u_{T^C}^\top G u_{T^C}\\ &\geq u_T^\top G u_T + 2u_T^\top G u_{T^C}\\ & \geq (1-\delta)\|u_T\|_2^2 - 2\cdot 3\delta\,\|u_T\|_2^2 \\ & = (1-7\delta)\|u_T\|_2^2\,. \end{aligned} \]
Since we already know that \(\|u\|_2^2=\|u_T\|_2^2+\|u_{T^C}\|_2^2 \le 10\|u_T\|_2^2\), we obtain \[ \frac{u^\top G u}{\|u\|_2^2} \geq \frac{1-7\delta}{10}\,. \] Since RIP and restricted orthogonality holds uniformly for all \(2s\)-sparse vectors, we can take the infimum over \(u\in\mathcal C(S)\) yields the claim. \(\square\)
Now using Cauchy-Schwarz, we get that
\[ \|\Delta_S \|_2^2 \geq \frac{\|\Delta_S\|_1^2}{s} \,, \]
so that with high probability
\[ \frac{1}{n} \| X \Delta \|_2^2 \geq c_0 \frac{\|\Delta_S\|_1^2}{s} \,. \tag{13}\]
Combining Equation 10 with Equation 13, we get that with high probability,
\[ \begin{aligned} \frac{1}{n} \| X \Delta \|_2^2 &\leq 3 \lambda \| \Delta_S \|_1 \\ &\leq3 \lambda \left(\frac{s}{n} \{c_0\}^{-1} \| X \Delta \|_2^2 \right)^{1/2} \,. \end{aligned} \]
Rearranging and using that \(\lambda = \sqrt{C \log(p) / n}\) for some adequately chosen \(C>0\), yields that
\[ \frac{1}{n} \|X \Delta \|_2^2 \leq \sigma C' \frac{s \log(p)}{n} \,, \]
for some constant \(C'\). Thus we have derived a high-probability in-sample prediction error, which is of order \(s \log(p) / n\). All that remains is to extrapolate to a new test point \(X'\).
Note that
\[ \begin{aligned} \mathrm{E}[(\Delta^\top X')^2] &= \mathrm{E}[\Delta^\top (X'X'^\top)\Delta] \\ &= \mathrm{E}[\Delta^\top \mathrm{E}[X'X'^\top \mid X, Y]\Delta] \\ &= \mathrm{E}[\Delta^\top \mathrm{E}[X'X'^\top ]\Delta] \\ &= \mathrm{E}[\Delta^\top I_p \Delta] \\ &= \mathrm{E}[\Delta^\top \Delta] \\ &= \mathrm{E}[\|\Delta \|_2^2] \,. \end{aligned} \]
Now on the same high probability event where the restricted eigenvalue holds, we have that
\[ \underline{c} \|\Delta \|_2^2 \leq \frac{1}{n} \| X \Delta \|_2^2 \,, \] for some constant \(\underline{c}>0\). Thus on that event, call it \(\mathcal E\),
\[ \mathrm{E}[\|\Delta \|_2^2 \mid X, Y] = \| \Delta\|_2^2 \leq \underline{c}^{-1} \frac{1}{n} \|X \Delta \|_2^2 \leq C'' \sigma \frac{s \log(p)}{n} \,, \] with probability at least \(1 - c_1 \exp\{-c_2 n \}\) for some absolute constants \(c_1,c_2 >0\).
It remains to bound \(\Delta\) on \(\mathcal{E}^C\). Note that by LASSO optimality,
\[ \begin{aligned} \frac{1}{2n} \|Y - X\hat\beta\|_2^2 + \lambda \|\hat\beta\|_1 \leq \frac{1}{2n} \|Y \|_2^2 \,, \end{aligned} \] which we got by comparing the LASSO solution to the point \(0\). Now, it follows that \[ \begin{aligned} \|\hat\beta\|_1 \leq \frac{1}{\lambda}\left(\frac{1}{2n} \|Y \|_2^2 -\frac{1}{2n} \|Y - X\hat\beta\|_2^2 \right) \\ \leq \frac{1}{\lambda} \left(\frac{1}{2n} \|Y \|_2^2 \right) \,. \end{aligned} \]
Therefore,
\[ \|\Delta \|_2 \leq \| \Delta \|_1 \leq \| \hat\beta \|_1 + \|\beta\|_1 \leq \frac{1}{\lambda} \left(\frac{1}{2n} \|Y \|_2^2 \right) + \|\beta\|_1 \,, \]
and \[ \|\Delta \|_2^2 \leq \frac{1}{2\lambda^2 n^2} \|Y \|_2^4 + 2\|\beta\|_1^2 \,, \] which holds always, and thus in particular on the event \(\mathcal{E}^C\).
Now
\[ \begin{aligned} \mathrm{E}[\| \Delta \|_2^2 \mathbb{1}\{\mathcal E^C\} ] &= \frac{1}{2 \lambda^2 n^2} \mathrm{E}[\| Y \mathbb{1}\{\mathcal E^C\} \|_2^4] + 2 \|\beta \|_1^2 \Pr(\mathcal{E}^C) \\ &\leq \frac{1}{2 \lambda^2 n^2} \mathrm{E}[\| Y \|_2^8]^{1/2} \Pr(\mathcal{E}^C)^{1/2} + 2 \|\beta \|_1^2 \Pr(\mathcal{E}^C) \\ &\leq \frac{1}{2 \lambda^2 n^2} cn^2 \|\beta\|_2^4 \Pr(\mathcal{E}^C)^{1/2} + 2 \|\beta \|_1^2 \Pr(\mathcal{E}^C) \\ &\leq c'' \max \{\|\beta\|_1^2, \|\beta\|_2^4\} \frac{n}{\log(p)} \exp\{-c''' n\} \,. \end{aligned} \]
Therefore
\[ \begin{aligned} \mathrm{E}[\| \Delta\|_2^2] &= \mathrm{E}[\| \Delta \|_2^2\mathbb{1}\{\mathcal E\}] + \mathrm{E}[\| \Delta \ \|_2^2 \mathbb{1}\{\mathcal E^C\}] \\ &\leq \frac{ \sigma C'' s \log(p)}{n} + \mathrm{E}[\| \Delta \|_2^2 \mathbb{1}\{\mathcal E^C\}] \\ &\leq \frac{\sigma C'' s \log(p)}{n} + c'' \max \{\|\beta\|_1^2, \|\beta\|_2^4\} \frac{n}{\log(p)} \exp\{-c''' n\} \\ & \leq C''' \sigma \frac{s \log(p)}{n} \,, \end{aligned} \]
for \(n\) large enough and \(p\) not growing too fast.
Hence, as was the case with the risk of oracle OLS, we have that the risk of the LASSO is
\[ \sigma^2 + \sigma^2 C''' \frac{s \log(p)}{n} \,, \]
and thus its excess risk is
\[ \sigma^2 C''' \frac{s \log(p)}{n} \,, \]
which is what we wanted to show. Note that this means that if \(\beta\) is sparse enough such that
\[ \frac{s \log(p)}{n} \to 0 \,, \]
then the LASSO has zero excess risk asymptotically. That was quite sophisticated and long, well done if you have made it through.
AIC & BIC
Introduction
In this section, we will take a closer look at AIC and BIC and what theoretical justifications they hold for model selection. Since model selection is only a small portion of this course, I will keep these notes quite short.
We will look at two important aspects of AIC and BIC:
We will give a motivation why AIC might be preferable in a prediction context
We show that BIC selects the correct model with probability going to one, if it is contained in the set of candidate models, while AIC does not
First, I will give you an informal “big-picture” rundown of AIC and BIC. We will then formally derive these properties.
AIC
To motivate the AIC, I find it useful to have this scenario in mind: Say we have some data which are i.i.d. realisations from random variables \(Z_1, \ldots, Z_n\) with distribution \(g\), and we try to estimate \(g\) from a parametric family \(\mathcal F = \{f(z; \theta), \theta \in \Theta \subseteq \Re^{d}\}\) of probability density functions, where we assume that \(g\) can be represented by \(\mathcal{F}\), i.e. there is a \(\theta_0 \in \Theta\) such that \(g = f(\cdot; \theta_0)\). The standard weapon of choice for this estimation problem is maximum likelihood estimation, which minimises the empirical risk under log-loss over \(\mathcal{F}\), i.e.
\[ \hat \theta = \underset{\theta \in \Theta}{\arg \min} \, -\sum_{i = 1}^n \log(f(Z_i; \theta)) \,. \]
Now, as so often in this course, we might ask ourselves how good of a job \(\hat \theta\) does of estimating \(\theta_0\). A natural way to measure this performance is by considering the risk of \(\hat{\theta}\) under log-loss, i.e. \[ R(\hat \theta) = -n \mathrm{E}[\log(f(Z; \hat \theta))] \,, \]
where the expectation is taken both with respect to the randomness of an independent test draw \(Z\) and \(\hat \theta\), which depends on the training data, so that \(R(\hat \theta)\) is just a number.
If you are wondering why the risk under log-loss is an appropriate measure, note that the risk under log-loss, is, up to some additive constants that only depend on \(g\), equivalent to the Kullback-Leibler divergence
\[ \mathrm{KL}(g \Vert h) = \int \log \left(\frac{g(x)}{h(x)} \right) g(x) dx \,, \]
between two distributions. The KL-divergence is a measure of how far apart two distributions are from one another over their support. Hence, the risk of the ML-estimator under log-loss can be seen as a measure of how far apart the density \(f(\cdot;\hat \theta)\) is from the target \(f(\cdot; \theta_0)\).
Unfortunately, the risk \(R(\hat \theta)\) requires knowledge of the unknown distribution \(g\), so we cannot compute it in practical contexts. However, we can try to estimate it! Note that for any fixed \(\theta\), the empirical risk
\[ R_n(\theta) = - \sum_{i = 1}^n \log(f(Z_i; \theta)) \,, \]
is an unbiased estimator of the risk \(R(\theta)\). The problem is, that when we consider \(R_n(\hat \theta)\), we use the same data to estimate \(\theta_0\) as \(\hat \theta\), and to estimate \(R_n(\theta)\). This introduces bias:
\[ R_n(\hat \theta) = R(\hat \theta) + \underbrace{(R_n(\hat \theta) - R(\hat \theta))}_{=b_n} \,, \]
with \[ \mathrm E [b_n] \neq 0 \,. \]
The AIC acknowledges this problem and tries to find an asymptotic correction to this bias. That is, it finds an expression \(B\) such that \(\mathrm E[B - b_n] \to 0\) as \(n\to \infty\). With that correction, we can devise an asymptotically unbiased estimator of \(R(\hat \theta)\) by correcting \(R_n(\hat \theta)\) by \(B\). That is
\[ \begin{aligned} \mathrm E[R_n (\hat{\theta}) - B] &= \mathrm R (\hat{\theta}) + E[b_n - B] \\ &\to \mathrm R (\hat{\theta}), \quad \text{as } n \to \infty \,. \end{aligned} \]
The AIC finds one such correction, namely \(B=-d\). Rescaling by \(2\) yields the typical AIC formula
\[ \text{AIC} = -2 \sum_{i = 1}^n \log(f(Z_i; \hat \theta)) + 2d \,, \] with \[ \mathrm E[\text{AIC}] = 2 R(\hat \theta) + o(1) \,. \]
The scaling of the AIC statistic by the factor \(2\) is due to historical reasons, where the AIC was derived in terms of KL-divergences, but is immaterial.
AIC and model selection
We now have an idea about what the AIC wants to be and how it came into being. However, it is perhaps not immediately clear how this motivates its use for model selection. Oftentimes, people jump to the incorrect conclusion that AIC approximates risk, and therefore will choose the risk-minimal model at least asymptotically. So let us see what AIC does and does not.
Model comparison: The fact that AIC is an asymptotically unbiased estimator of risk was derived under the assumption that the family of distribution functions \(\mathcal{F}\) can represent \(g\). When we do model selection, we compare the AIC from different families of distributions \(\mathcal{F}, \mathcal{F}',\ldots\) to choose our model. If \(\mathcal{F}'\) cannot represent \(g\), there is no guarantee that its AIC is an approximation of risk. Reassuringly, one can show that if given the choice between two models, one of which can represent \(g\) while the other cannot, AIC will favour the model that can represent \(g\) with probability going to one. However, this needs to be proven and does not follow immediately from the fact that AIC approximates risk in expectation.
Model selection consistency: AIC is a random variable. Although its expectation approximates risk, it will fluctuate away from its expectation. In particular, this leads to the fact that if we want to choose from two nested models that can represent \(g\), AIC is not guaranteed to pick the true model with probability going to one. Thus it is not consistent as a model selection tool.
Asymptotic efficiency for predictive risk: Under some regularity conditions, the model chosen by AIC, although not necessarily the true model, achieves expected predictive risk that is asymptotically as low as the risk of the best candidate model.
Thus AIC is prediction oriented, in that the model it chooses achieves asymptotically optimal risk, not truth-oriented in that we can select wrong models with positive probability. A final caveat for all of these points is that in these models comparisons it is assumed that at least one model can represent \(g\), for which there are no guarantees, and on the contrary, it might appear more plausible that all candidates models cannot do so. We can generalise the ideas of AIC to this case with another information criterion, the TIC (Takeuchi information criterion), but this is well beyond the scope of this note.
BIC
The BIC (Bayes information criterion) is an alternative to the AIC and is motivated from Bayesian ideas. I do not want to dwell on details for the BIC too much, but simply state that it is an approximation to the Bayes factor, which is a model selection tool in Bayesian statistics.
There are two main take-aways from the BIC:
- Model selection consistency: Unlike the AIC, the BIC is consistent for model selection. It will, from a fixed finite set of candidate models that contains the true model, select the true model with probability going to one.
- Conservative predictions: The price the BIC pays for model selection consistency is a harsher penalty compared to the AIC, which may lead to models that are too parsimonious and thus lead to worse predictive performance. Thus, in prediction contexts, the AIC is typically preferred over the BIC.
🏔️ Derivations
Setup
In this section, we will formally derive all the theoretical properties of AIC and BIC that we hinted at in the earlier parts of this note. These derivations are formal, but not overly technical, so do not get discouraged by the amount of displayed. Given the widespread use of AIC and BIC in practice, I think every statistician should have looked at their most important properties at least once.
Throughout, we will assume the following setup:
Data: Let \(Z_1, \ldots, Z_n\) be random variables from a distribution \(g\).
Candidate models: We consider a collection of likelihood-based candidate models \(\mathcal M = \{M_1, \ldots, M_J\}\).
Parametric models: Each model \(M\) specifies a parametric family \(\mathcal F_{M} = \{f_M(z; \theta), \theta \in \Theta_M \subset \Re^{d_M}\}\) of probability density functions.
Maximum likelihood estimator: For model \(M\), define its log-likelihood function as \[ \ell_M(\theta) = \sum_{i = 1}^n \log (f_M(Z_i; \theta)) \,, \tag{14}\] and the maximum-likelihood (ML) estimator as \[ \hat \theta_M \in \underset{\theta \in \Theta_M}{\arg \max} \, \ell_M(\theta) \,. \tag{15}\]
Regularity conditions: We assume the following:
- The ML estimator \(\hat \theta_M\) exists (for each \(M\)) and \(\ell_M\) is twice continuously differentiable
- There exists at least one model \(M_0 \in \mathcal M\) and a \(\theta_0 \in \Theta_{M_0}\) such that \(g = f_{M_0}(\cdot; \theta_0)\) (i.e. the DGP is contained in some candidate model)
- For any \(M\) that contains \(g\) via \(\theta_0\), the Fisher information \(I_M(\theta_0) := \mathrm{E}\!\left[- \nabla \nabla^\top \log f_M(Z; \theta_0)\right]\) is positive definite and \[ \sqrt{n}\,(\hat \theta_M - \theta_0) \ \overset{d}{\to}\ \mathrm N\!\left(0 ,\, I_M(\theta_0)^{-1}\right) \,. \tag{16}\]
- Interchange of \(\mathrm{E}\) with \(\nabla\) and \(\nabla \nabla^\top\) is valid; under correct specification the information equality holds \(\mathrm{var}\!\big(\nabla \log f_M(Z;\theta_0)\big)=I_M(\theta_0)\); and \(\mathrm{E}\|\nabla \log f_M(Z;\theta_0)\|^{4} < \infty\) (so \(\{\|s_n\|^2\}\) is uniformly integrable and \(\mathrm{E}\|s_n\|^4=O(n^{-2})\)), where \(s_n:=\nabla \hat L_{n,M}(\theta_0)\) and \(\hat L_{n,M}(\theta):=\ell_M(\theta)/n\).
- Third derivatives of \(\log f_M(z;\theta)\) are bounded in a neighbourhood of \(\theta_0\); with probability tending to one the matrix \(J_n:=-\,\nabla\nabla^\top \hat L_{n,M}(\theta_0)\) is positive definite and \(J_n \xrightarrow{p} I_M(\theta_0)\); moreover \(J_n^{-1}\xrightarrow{L^2} I_M(\theta_0)^{-1}\).
The AIC for model \(M\) is defined as \[ \mathrm{AIC}(M) = - 2 \,\ell_M(\hat \theta_M) + 2 d_M \,, \tag{17}\] and the BIC as \[ \mathrm{BIC}(M) = -2 \,\ell_M(\hat \theta_M) + d_M \log n \,. \tag{18}\]
Deriving AIC
In this section we show that, within a given model that contains \(g\), AIC is an asymptotically unbiased estimator (per observation) of the population test risk under log-loss.
Throughout this section, fix a model \(M\) that contains \(g\) via \(\theta_0\). Let \(\hat L_{n,M}(\theta) := \ell_M(\theta)/n\) and define the (per-observation) population log-likelihood as \[ L_M(\theta) := \mathrm{E}[\log f_M(Z;\theta)] \,, \tag{19}\] which is just the negative of the per-observation population test risk.
Training expansion
Set \(s_n := \nabla \hat L_{n,M}(\theta_0)\) and \(H_n := \nabla \nabla^\top \hat L_{n,M}(\theta_0)\). A first-order Taylor expansion of the score together with the score equation \(\nabla \hat L_{n,M}(\hat\theta_M)=0\) yields \[ 0 = s_n + H_n (\hat \theta_M - \theta_0) + r_n, \qquad \|r_n\| = o_p(\|\hat \theta_M - \theta_0\|) \,. \tag{20}\]
By the law of large numbers and the regularity conditions, \[ H_n \ \xrightarrow{p}\ \nabla\nabla^\top L_M(\theta_0) \,=\, -\,I_M(\theta_0), \] and \(J_n:=-H_n\) is positive definite with probability tending to \(1\). Since \(\|s_n\|=O_p(n^{-1/2})\), the score expansion Equation 20 implies \[ \hat \theta_M - \theta_0 = -\, H_n^{-1} s_n + o_p(n^{-1/2}) \,. \tag{21}\]
A second-order expansion of \(\hat L_{n,M}\) at \(\theta_0\) gives \[ \hat L_{n,M}(\hat \theta_M) = \hat L_{n,M}(\theta_0) + s_n^\top(\hat \theta_M - \theta_0) + \tfrac12(\hat \theta_M - \theta_0)^\top H_n (\hat \theta_M - \theta_0) + o_p(n^{-1}) \,. \tag{22}\] By bounded third derivatives and \(\hat\theta_M-\theta_0=O_p(n^{-1/2})\), the remainder is \(O_p(\|\hat\theta_M-\theta_0\|^3)=O_p(n^{-3/2})=o_p(n^{-1})\).
Using Equation 21 in Equation 22 and writing \(J_n:=-H_n\), we obtain \[ \begin{aligned} \hat L_{n,M}(\hat \theta_M) & = \hat L_{n,M}(\theta_0) - \tfrac12\, s_n^\top H_n^{-1} s_n + o_p(n^{-1}) \\ &= \hat L_{n,M}(\theta_0) + \tfrac12\, s_n^\top J_n^{-1} s_n + o_p(n^{-1}) \,. \end{aligned} \tag{23}\]
Taking expectations, and using \(J_n \xrightarrow{p} I_M(\theta_0)\) together with \(\mathrm{E}[s_n]=0\) and \(\mathrm{Var}(s_n)=I_M(\theta_0)/n\) (information equality), we have \[ \begin{aligned} \mathrm{E}\!\left[s_n^\top J_n^{-1} s_n\right] &=\mathrm{E}\!\left[s_n^\top I_M(\theta_0)^{-1} s_n\right] + o\!\left(\tfrac{1}{n}\right)\\ &=\tfrac{1}{n}\,\mathrm{tr}\!\big(I_M(\theta_0)^{-1} I_M(\theta_0)\big) + o\!\left(\tfrac{1}{n}\right)\\ &=\tfrac{d_M}{n} + o\!\left(\tfrac{1}{n}\right), \end{aligned} \] where the \(o(1/n)\) term follows from \(J_n^{-1}\xrightarrow{L^2} I_M(\theta_0)^{-1}\) and \(\mathrm{E}\|s_n\|^4=O(n^{-2})\) via Cauchy–Schwarz; uniform integrability then yields \(\mathrm{E}[o_p(n^{-1})]=o(n^{-1})\). Therefore, \[ \mathrm{E}\big[\hat L_{n,M}(\hat \theta_M)\big] = L_M(\theta_0) + \frac{d_M}{2n} + o\!\left(\frac{1}{n}\right). \tag{24}\]
Equivalently, \[ \mathrm{E}\!\left[-\frac{2}{n}\,\ell_M(\hat \theta_M)\right] = -2\,L_M(\theta_0) - \frac{d_M}{n} + o\!\left(\frac{1}{n}\right). \tag{25}\]
Test expansion
Since \(\nabla L_M(\theta_0) = 0\) and \(\nabla \nabla^\top L_M(\theta_0) = - I_M(\theta_0)\), a second-order Taylor expansion yields \[ L_M(\hat \theta_M) = L_M(\theta_0) - \tfrac12 (\hat \theta_M - \theta_0)^\top I_M(\theta_0) (\hat \theta_M - \theta_0) + o_p(n^{-1}) \,. \tag{26}\] Using Equation 16,
\[ \mathrm{E}\!\big[(\hat \theta_M - \theta_0)(\hat \theta_M - \theta_0)^\top\big] = I_M(\theta_0)^{-1}/n + o(n^{-1})\,,\] so \[ \mathrm{E}\big[L_M(\hat \theta_M)\big] = L_M(\theta_0) - \frac{d_M}{2n} + o\!\left(\frac{1}{n}\right)\,. \tag{27}\] By the definition \(L_M(\theta)\) in Equation 19, \[ \mathrm{E}\!\left[-\frac{2}{n}\, \ell_M(\hat \theta_M )\right] = -2\,\mathrm{E}\big[L_M(\hat \theta_M)\big] = -2\,L_M(\theta_0) + \frac{d_M}{n} + o\!\left(\frac{1}{n}\right). \tag{28}\]
Matching AIC to test risk
From Equation 17 and Equation 25, \[ \begin{aligned} \mathrm{E}\!\left[\frac{1}{n}\,\mathrm{AIC}(M)\right] &= \mathrm{E}\!\left[-\frac{2}{n}\,\ell_M(\hat \theta_M)\right] + \frac{2d_M}{n} \\ &= -2\,L_M(\theta_0) + \frac{d_M}{n} + o\!\left(\frac{1}{n}\right)\,. \end{aligned} \tag{29}\] Comparing Equation 29 with Equation 28 shows that for any model that contains \(g\), \(\mathrm{AIC}(M)/n\) is an asymptotically unbiased estimator of its population test risk per observation.
AIC with underspecified models
We have shown that for a model that can represent the true DGP, AIC is an approximation of the test risk of the ML estimator under log-loss. However, when we conduct model selection, we compare different models with one another. So, let us now see how AIC behaves when a model is underspecified, i.e. it cannot represent \(g\).
For this, assume there exists at least one model \(M_0\) with \(g = f_{M_0}(\cdot;\theta_0)\).
Let \(M\) be any model that does not contain \(g\) and define the KL–projection \[
\theta_M^\star \in \underset{\theta\in\Theta_M}{\arg\max}\ L_M(\theta) \,,
\] with \[
L_M(\theta_M^\star) = \mathrm{E}[\log g(Z)] - \mathrm{KL}\!\big(g \,\Vert\, f_M(\cdot;\theta_M^\star)\big) < \mathrm{E}[\log g(Z)] \,.
\tag{30}\]
Here, \(\mathrm{KL}(g \Vert h)\) is the Kullback-Leibler divergence between two distributions \(g,h\), which is defined as \[ \mathrm{KL}(g \Vert h) = \int \log \left(\frac{g(x)}{h(x)} \right) g(x) dx \,, \] and which is a way to quantify how far two distributions are apart from one another. The Kullback-Leibler divergence is zero if and only if \(g = h\) (\(g\) almost everywhere) and positive otherwise. Hence, the KL-projection \(\theta_M^{\star}\) is the point in \(\Theta_M\) that produces the distribution in \(M\) that is closest to \(g\) as measured by KL divergence.
By standard M-estimation arguments (uniform LLN and the argmax theorem), \[ \frac{1}{n}\,\ell_M(\hat \theta_M)\overset{p}{\to} L_M(\theta_M^\star), \qquad \frac{1}{n}\,\ell_{M_0}(\hat \theta_{M_0})\overset{p}{\to} L_{M_0}(\theta_0)=\mathrm{E}[\log g(Z)]. \tag{31}\] Therefore, \[ \begin{aligned} \frac{1}{n}\Big(\mathrm{AIC}(M)-\mathrm{AIC}(M_0)\Big) &= -2\!\left(\frac{1}{n}\,\ell_M(\hat \theta_M) - \frac{1}{n}\,\ell_{M_0}(\hat \theta_{M_0})\right) + \frac{2(d_M-d_{M_0})}{n} \\ & \overset{p}{\to} 2\,\mathrm{KL}\!\big(g \,\Vert\, f_M(\cdot;\theta_M^\star)\big) \\ &> 0 \,. \end{aligned} \tag{32}\] Hence, asymptotically AIC will not choose underspecified models (those that do not contain \(g\)).
AIC with nested models
So we know that AIC will not choose models that cannot represent \(g\). Another important aspect of model selection is to select small models. Ideally that would mean that we select the smallest model \(M\) that is able to represent \(g\) if such a model is available. We consider this next.
Suppose \(M_0 \subset M_1\) and \(g \in \mathcal F_{M_0}\) with dimensions \(d_0<d_1\). Since \(L_{M_0}(\theta_0)=L_{M_1}(\theta_0)=\mathrm{E}[\log g(Z)]\), apply Equation 28 to both \(M_0\) and \(M_1\): \[ \mathrm{E}\!\left[\frac{1}{n}\,R(M_k)\right] = -2\,\mathrm{E}[\log g(Z)] + \frac{d_k}{n} + o\!\left(\frac{1}{n}\right), \qquad k\in\{0,1\}. \tag{33}\] Thus the population test risk per observation is minimised by the smallest model that contains \(g\) (here \(M_0\)), with gap \(\frac{d_1-d_0}{n}+o(n^{-1})\).
This means that in expectation, the AIC gap between two nested models that both can represent \(g\), is strictly positive and we would prefer the smallest such model. However, as we said before, AIC is a random variable that fluctuates away from its expectation. In particular, it can be shown that AIC will select the overspecified model with positive probability, even though the population comparison prefers the smallest true model. We will show this fact once we have derived the BIC.
Deriving BIC
We derive BIC from the (log) marginal likelihood (Bayesian evidence) using a Laplace approximation around the MLE. Let \(M\) have prior \(\pi(M)\) and parameter prior \(\pi(\theta\mid M)\) with \(\pi(\theta_0\mid M)>0\). The model evidence is \[ p(Z_{1:n}\mid M) =\int \exp\{\ell_M(\theta)\}\,\pi(\theta\mid M)\,d\theta. \tag{34}\]
Assume \(g\in\mathcal F_M\) via \(\theta_0\), \(\hat\theta_M\to_p\theta_0\), and that \(\ell_M\) is twice continuously differentiable near \(\theta_0\) with \[ -\frac{1}{n}\,\nabla\nabla^\top \ell_M(\theta_0)\ \xrightarrow{p}\ I_M(\theta_0)\ \text{(pd)}. \] A second-order expansion of \(\ell_M\) at \(\hat\theta_M\) gives \[ \ell_M(\theta) =\ell_M(\hat\theta_M) -\frac{n}{2}\,(\theta-\hat\theta_M)^\top I_M(\theta_0)\,(\theta-\hat\theta_M) + o_p\!\big(n\|\theta-\hat\theta_M\|^2\big). \tag{35}\] Define the observed (per-observation) information at the MLE \[ \hat I_{n,M}(\hat\theta_M)\;:=\;-\frac{1}{n}\,\nabla\nabla^\top \ell_M(\hat\theta_M)\,. \tag{36}\]
Plugging the quadratic expansion Equation 35 into the evidence Equation 34 and applying Laplace’s method yields \[ \begin{aligned} \log p(Z_{1:n}\mid M) &=\ell_M(\hat\theta_M)-\frac{d_M}{2}\log n -\frac{1}{2}\log\det \hat I_{n,M}(\hat\theta_M) \\ &\quad + \frac{d_M}{2}\log(2\pi)+\log \pi(\hat\theta_M\mid M) + o_p(1)\,. \end{aligned} \tag{37}\]
Under regularity, \(\hat I_{n,M}(\hat\theta_M)\xrightarrow{p} I_M(\theta_0)\), so the term \(-\tfrac12\log\det \hat I_{n,M}(\hat\theta_M)\) can be replaced by \(-\tfrac12\log\det I_M(\theta_0)+o_p(1)\). Multiplying by \(-2\) and dropping \(O_p(1)\) terms that do not grow with \(n\) (they may differ across models but do not affect asymptotic selection) gives the BIC criterion \[ \mathrm{BIC}(M)\ :=\ -2\,\ell_M(\hat\theta_M)+d_M\log n\,, \] up to an additive constant.
Model selection (in)consistency with AIC and BIC
Let \(M_0\in\mathcal M\) be the true model: \(g(\cdot)=f_{M_0}(\cdot;\theta_0)\) for some \(\theta_0\in\Theta_{M_0}\). A selection rule \(S_n:\{Z_i\}_{i=1}^n\mapsto\mathcal M\) is consistent if \(\Pr\{S_n=M_0\}\to 1\) as \(n\to\infty\).
We study differences of criteria \(C_n\in\{\mathrm{AIC},\mathrm{BIC}\}\): \[ \Delta C_n(M,M_0)\ :=\ C_n(M)-C_n(M_0). \tag{38}\]
Underspecified competitors
Let \(M\) be a candidate model that cannot represent \(g\) and \(\theta_M^\star\in\arg\max_{\theta\in\Theta_M}\ \mathrm E[\log f_M(Z;\theta)]\) be the KL projection. By the LLN, \[ \begin{aligned} \frac{1}{n}\Big(\ell_M(\hat\theta_M)-\ell_{M_0}(\hat\theta_{M_0})\Big) & \overset{p}{\to} L_M(\theta_M^\star)-L_{M_0}(\theta_0) \\ &= -\,\mathrm{KL}\!\big(g\Vert f_M(\cdot;\theta_M^\star)\big) \\ & < 0 \,. \end{aligned} \tag{39}\] Hence \(\ell_{M_0}(\hat\theta_{M_0})-\ell_M(\hat\theta_M) = n\,c(M)+o_p(n)\) with \(c(M)=\mathrm{KL}(g\Vert f_M(\cdot;\theta_M^\star))>0\), so \[ \begin{aligned} \Delta\mathrm{AIC}_n(M,M_0)=2nc(M)+O_p(1)&\ \xrightarrow{p}\ +\infty\,, \\ \Delta\mathrm{BIC}_n(M,M_0)=2nc(M)+O_p(\log n)&\ \xrightarrow{p}\ +\infty \,. \end{aligned} \tag{40}\] Thus, both AIC and BIC eventually reject every underspecified \(M\).
Overspecified competitors
Let \(M\) be a candidate model that can represent \(g\) with dimension \(d_M\) such that \(d_M - d_{M_0} = k > 0\). By Wilks’ theorem, \[ 2\Big\{\ell_M(\hat\theta_M)-\ell_{M_0}(\hat\theta_{M_0})\Big\}\overset{d}{\to} \chi^2_k. \tag{41}\] Therefore, \[ \Delta\mathrm{AIC}_n(M,M_0)\ =\ -\,\chi^2_k + 2k + o_p(1)\,, \] which implies that \[ \liminf_{n\to\infty}\Pr\!\big\{\mathrm{AIC}(M)<\mathrm{AIC}(M_0)\big\} = \Pr(\chi^2_k>2k)\ > 0 \,, \tag{42}\] so AIC is not consistent: it selects some overspecified competitors with nonvanishing probability.
For BIC, \[ \Delta\mathrm{BIC}_n(M,M_0)\ =\ -\,\chi^2_k + k\log n + o_p(1)\ \xrightarrow{p}\ +\infty, \tag{43}\] hence BIC is consistent over a fixed finite candidate set: \[ \Pr\!\left\{\arg\min_{M\in\mathcal M}\mathrm{BIC}(M)=M_0\right\}\ \to\ 1. \tag{44}\]
Summary
AIC targets population test risk and, asymptotically, does not select models that fail to contain \(g\) (Equation 40). Among models that do contain \(g\), its fixed \(2d_M\) penalty implies possible over-selection in finite samples (Equation 42).
BIC arises from a Laplace approximation to the evidence. Its \(d_M\log n\) penalty yields consistency for true-model identification in fixed finite \(\mathcal M\) (Equation 43–Equation 44).