ST308 Bayesian Inference

Class 5


Philipp Sterzinger

p.sterzinger@lse.ac.uk

28 February 2025

OLS, Ridge, LASSO

  • \(y_1,\ldots, y_n \in \Re\), which are realisations of random variables \(Y_1,\ldots,Y_n\)
  • Nonrandom sequence of covariates \(x_1,\ldots, x_n \in \Re^p\)
  • Model a linear relationship between \(y\) and \(x\): \[y_i = x_i^\top \beta + \varepsilon_i\,,\] with residual \(\epsilon_i = y_i - x_i^\top \beta\).

Least-squares

\[ \hat{\beta}^{\textrm{OLS}} = \arg \underset{\beta \in \Re^p}{\min} \sum_{i = 1}^n \left(y_i - x_i^\top \beta \right)^2 = (X^\top X)^{-1}X^\top y \,. \]

If \(y_i \sim \mathcal{N}(x_i^\top \beta, \sigma^2)\) then \(\hat{\beta}^\textrm{OLS}\) is the maximum likelihood estimator.

May be prone to overfitting, encounter issues if \(X\) is ill-conditioned

Ridge

\[ \hat{\beta}^{\textrm{R}} = \arg \underset{\beta \in \Re^p}{\min} \sum_{i = 1}^n \left(y_i - x_i^\top \beta \right)^2 + \lambda || \beta ||_2^2 = (X^\top X + \lambda I_p)^{-1}X^\top y \,. \]

If \(y_i \sim \mathcal{N}(x_i^\top \beta, \sigma^2)\) and \(\beta \sim \mathcal{N}(0_p, \{\lambda\}^{-1} I_p)\), \(\sigma^2\) knwon, then

\[ \beta \mid_{y, X} \sim \mathcal{N}(\mu_n, \sigma \Omega_n)\,, \]

with \(\mu_n = (X^\top X + \lambda I_p)^{-1}X^\top y\) and \(\Omega_n = (X^\top X + \lambda I_p)^{-1}\). Hence \(\hat{\beta}^{\textrm{R}}\) is the Bayes estimator under quadratic loss.

  • Shrinks coefficients of \(\beta\) towards zero
  • Biased, but smaller variance than OLS, more stable, less prone to overfitting

LASSO

\[ \hat{\beta}^{\textrm{L}} = \arg \underset{\beta \in \Re^p}{\min} \sum_{i = 1}^n \left(y_i - x_i^\top \beta \right)^2 + \lambda || \beta ||_1 \]

Can be seen as the posterior mode using a Laplace(\(0\), \(1 / \gamma\)) prior for \(\lambda = \gamma / \sigma\).

  • Shrinks coefficients of \(\beta\) towards zero, and sets some exactly to zero
  • Biased

Mean-squared prediction error

Given estimate \(\hat{\beta}\), which was obtained from \(\{X_{\textrm{train}},y_{\textrm{train}}\}\), we can test the out of sample performance on new data \(\{X_{\textrm{test}},y_{\textrm{test}}\}\).

For example, we can evaluate the mean-squared prediction error on the test data

\[ \widehat{\textrm{MSPE}} = \frac{1}{n_{\textrm{test}}} \sum_{i = 1}^{n_{\textrm{test}}} \left(y_{\textrm{test}, i} - x_{\textrm{test}, i}^\top\hat{\beta} \right)^2 \,. \]

Computer class

Demo: College data

In this exercise, we will predict the number of applications received using the other variables in the College data set.

For this, we need to install the ISLR package, which contains the College dataset

install.packages("ISLR")

Let’s check out the data first

Code
library(ISLR) 

head(College, n = 1)
                             Private Apps Accept Enroll Top10perc Top25perc
Abilene Christian University     Yes 1660   1232    721        23        52
                             F.Undergrad P.Undergrad Outstate Room.Board Books
Abilene Christian University        2885         537     7440       3300   450
                             Personal PhD Terminal S.F.Ratio perc.alumni Expend
Abilene Christian University     2200  70       78      18.1          12   7041
                             Grad.Rate
Abilene Christian University        60
Code
sum(is.na(College))
[1] 0

Demo: College data

  1. Split the data set into a training set and a test set.
Code
train.size <- nrow(College) * .5

set.seed(11)
train <- sample(1 : nrow(College), train.size)
test <- - train

College.train <- College[train, ]
College.test <- College[test, ]

Demo: College data

  1. Fit a linear model using least squares on the training set, and report the test error obtained. Use Apps, the number of applications, as the response variable.
Code
lm.fit <- lm(Apps ~ . , data = College.train)
lm.pred <- predict(lm.fit, College.test)

summary(lm.fit) 

Call:
lm(formula = Apps ~ ., data = College.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3661.2  -451.5   -18.2   369.0  7117.3 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   38.06147  638.26355   0.060 0.952480    
PrivateYes  -551.14624  213.67377  -2.579 0.010283 *  
Accept         1.74987    0.05583  31.345  < 2e-16 ***
Enroll        -1.36056    0.27296  -4.984 9.57e-07 ***
Top10perc     65.56530    8.30126   7.898 3.26e-14 ***
Top25perc    -22.53363    6.56872  -3.430 0.000671 ***
F.Undergrad    0.10189    0.05128   1.987 0.047658 *  
P.Undergrad    0.01788    0.05854   0.305 0.760175    
Outstate      -0.08708    0.02963  -2.939 0.003499 ** 
Room.Board     0.15386    0.07535   2.042 0.041863 *  
Books         -0.12238    0.38163  -0.321 0.748643    
Personal       0.16196    0.10761   1.505 0.133167    
PhD          -14.29804    7.15054  -2.000 0.046277 *  
Terminal      -1.03096    8.12625  -0.127 0.899114    
S.F.Ratio      4.48254   19.90001   0.225 0.821907    
perc.alumni   -0.45501    6.51788  -0.070 0.944383    
Expend         0.05619    0.01848   3.041 0.002529 ** 
Grad.Rate      9.07464    4.61797   1.965 0.050154 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1127 on 370 degrees of freedom
Multiple R-squared:  0.9343,    Adjusted R-squared:  0.9313 
F-statistic: 309.7 on 17 and 370 DF,  p-value: < 2.2e-16
Code
pred_error_ls <- mean((College.test[, "Apps"] - lm.pred)^2)
cat("LS prediction error:", pred_error_ls)
LS prediction error: 1026096
Code
# Alternative computation of prediction error 
beta_hat <- as.vector(lm.fit$coefficients)
X_mat <- College.test[,  -match("Apps", colnames(College))]
X_mat$Private <- as.numeric(X_mat$Private == "Yes")
X_mat <- do.call(cbind, lapply(X_mat, as.numeric))

fitted_vals <- X_mat %*% beta_hat[2:length(beta_hat)] + beta_hat[1] 
pred_error_ls <- mean((College.test[, "Apps"] - fitted_vals)^2)

Demo: College data

  1. Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.

For this we need the glmnet library.

install.packages("glmnet")

To see what functionality this library offers, have a look at its vignette on CRAN.

Demo: College data

Shrinkage of Ridge

Code
library(glmnet) 

train.mat <- model.matrix(Apps ~ . , data = College.train)
path.ridge <- glmnet(train.mat, College.train[, "Apps"], 
                        alpha = 0, 
                        thresh = 1e-12)
plot(path.ridge, ylim = c(-10, 10), xlim = c(0, 500))

Demo: College data

Find best \(\lambda\) using CV

Code
set.seed(1) 
train.mat <- model.matrix(Apps ~ . , data = College.train)
grid <-  10 ^ seq(4, -2, length = 100)

mod.ridge <- cv.glmnet(train.mat, College.train[, "Apps"], 
                        alpha = 0, lambda = grid, thresh = 1e-12)

plot(mod.ridge) 
Code
lambda.best <- mod.ridge$lambda.min
cat("Best lambda:", lambda.best) 
Best lambda: 0.01

Demo: College data

Inspect regression coefficients

Code
mod.ridge_c <-  glmnet(model.matrix(Apps ~ . , data = College), 
                     College[, "Apps"], alpha = 0,
                     lambda=lambda.best)
predict(mod.ridge_c, s = lambda.best, type = "coefficients")
19 x 1 sparse Matrix of class "dgCMatrix"
                       s1
(Intercept) -450.49661940
(Intercept)    .         
PrivateYes  -494.51887742
Accept         1.58270364
Enroll        -0.86248869
Top10perc     49.80780909
Top25perc    -14.15023862
F.Undergrad    0.05535230
P.Undergrad    0.04454701
Outstate      -0.08573363
Room.Board     0.15164291
Books          0.02094073
Personal       0.03091771
PhD           -8.67627741
Terminal      -3.32867772
S.F.Ratio     15.40537233
perc.alumni    0.12240070
Expend         0.07792438
Grad.Rate      8.67320397

Compute prediction error on test set

Code
test.mat <- model.matrix(Apps ~ . , data = College.test)
ridge.pred <-  predict(mod.ridge, newx = test.mat, s = lambda.best)
pred_error_r <- mean((College.test[, "Apps"] - ridge.pred)^2)
cat("Ridge prediction error:", pred_error_r)
Ridge prediction error: 1026069

Demo: College data

  1. Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

Pick \(\lambda\) using College.train and report error on College.test.

Code
set.seed(1)
mod.lasso <-  cv.glmnet(train.mat, College.train[, "Apps"], 
                        alpha = 1, lambda = grid, thresh = 1e-12)
plot(mod.lasso) 

Demo: College data

Prediction error

Code
lambda.best <-  mod.lasso$lambda.min
cat("Best lambda:", lambda.best) 
Best lambda: 0.01
Code
lasso.pred <-  predict(mod.lasso, newx = test.mat, s = lambda.best)
pred_error_l <- mean((College.test[, "Apps"] - lasso.pred)^2)
cat("Lasso prediction error:", pred_error_l)
Lasso prediction error: 1026036
Code
cbind(pred_error_ls, pred_error_r, pred_error_l)
     pred_error_ls pred_error_r pred_error_l
[1,]       1026096      1026069      1026036

Demo: College data

Regression coefficients

Code
mod.lasso <-  glmnet(model.matrix(Apps ~ . , data = College), 
                     College[, "Apps"], alpha = 1)
coefs <- predict(mod.lasso, s = lambda.best, type = "coefficients")
coefs
19 x 1 sparse Matrix of class "dgCMatrix"
                       s1
(Intercept) -471.39372052
(Intercept)    .         
PrivateYes  -491.04485137
Accept         1.57033288
Enroll        -0.75961467
Top10perc     48.14698892
Top25perc    -12.84690695
F.Undergrad    0.04149116
P.Undergrad    0.04438973
Outstate      -0.08328388
Room.Board     0.14943472
Books          0.01532293
Personal       0.02909954
PhD           -8.39597537
Terminal      -3.26800340
S.F.Ratio     14.59298267
perc.alumni   -0.04404771
Expend         0.07712632
Grad.Rate      8.28950241
Code
cat("# nonzero coeffs: ", sum(!coefs == 0), "out of", length(coefs))
# nonzero coeffs:  18 out of 19

Activity

We will now predict per capita crime rate in the Boston data set.

set.seed(1)
library(MASS)
train.size <-  nrow(Boston) / 2
train <-  sample(1:nrow(Boston), train.size)
test <-  -train
Boston.train <-  Boston[train, ]
Boston.test <-  Boston[test, ]
lm.fit <-  lm(medv ~ . , data = Boston.train)
lm.pred <-  predict(lm.fit, Boston.test)
mean((Boston.test[, "medv"] - lm.pred)^2)
[1] 26.86123

Repeat for Ridge and LASSO

Activity

Ridge

Code
set.seed(2)
train <-  model.matrix(medv ~ . , data = Boston.train)
test <-  model.matrix(medv ~ . , data = Boston.test)
grid <-  10 ^ seq(4, -2, length = 100)
mod.ridge <-  cv.glmnet(train, Boston.train[, "medv"], 
                        alpha = 0, lambda = grid, thresh = 1e-12)
lambda.best <-  mod.ridge$lambda.min
lambda.best
[1] 0.1417474
Code
ridge.pred <-  predict(mod.ridge, newx = test, s = lambda.best)
mean((Boston.test[, "medv"] - ridge.pred)^2)
[1] 26.76297
Code
mod.ridge <-  glmnet(model.matrix(medv ~ . , data = Boston.train), 
                     Boston.train[, "medv"], alpha = 0,lambda=lambda.best)
predict(mod.ridge, s = lambda.best, type = "coefficients")
15 x 1 sparse Matrix of class "dgCMatrix"
                       s1
(Intercept)  31.980150997
(Intercept)   .          
crim         -0.085860072
zn            0.026771398
indus         0.010542001
chas          2.455954335
nox         -12.292950293
rm            4.184299762
age          -0.005580903
dis          -1.191523651
rad           0.306602773
tax          -0.015479578
ptratio      -0.945685226
black         0.006733375
lstat        -0.472663974

Activity

LASSO

Code
set.seed(3)
mod.lasso <-  cv.glmnet(train, Boston.train[, "medv"], 
                        alpha = 1, lambda = grid, thresh = 1e-12)
lambda.best <-  mod.lasso$lambda.min
lambda.best
[1] 0.01
Code
lasso.pred <-  predict(mod.lasso, newx = test, s = lambda.best)
mean((Boston.test[, "medv"] - lasso.pred)^2)
[1] 26.86046
Code
mod.rlasso <-  glmnet(model.matrix(medv ~ . , data = Boston.train), 
                     Boston.train[, "medv"], alpha = 1,lambda=lambda.best)
predict(mod.lasso, s = lambda.best, type = "coefficients")
15 x 1 sparse Matrix of class "dgCMatrix"
                       s1
(Intercept)  33.944178029
(Intercept)   .          
crim         -0.089026132
zn            0.028425033
indus         0.022664796
chas          2.315987496
nox         -13.015010413
rm            4.116581960
age          -0.002997100
dis          -1.233974761
rad           0.360927144
tax          -0.018251130
ptratio      -0.971434256
black         0.006683536
lstat        -0.486676975