p.sterzinger@lse.ac.uk
28 February 2025
\[ \hat{\beta}^{\textrm{OLS}} = \arg \underset{\beta \in \Re^p}{\min} \sum_{i = 1}^n \left(y_i - x_i^\top \beta \right)^2 = (X^\top X)^{-1}X^\top y \,. \]
If \(y_i \sim \mathcal{N}(x_i^\top \beta, \sigma^2)\) then \(\hat{\beta}^\textrm{OLS}\) is the maximum likelihood estimator.
May be prone to overfitting, encounter issues if \(X\) is ill-conditioned
\[ \hat{\beta}^{\textrm{R}} = \arg \underset{\beta \in \Re^p}{\min} \sum_{i = 1}^n \left(y_i - x_i^\top \beta \right)^2 + \lambda || \beta ||_2^2 = (X^\top X + \lambda I_p)^{-1}X^\top y \,. \]
If \(y_i \sim \mathcal{N}(x_i^\top \beta, \sigma^2)\) and \(\beta \sim \mathcal{N}(0_p, \{\lambda\}^{-1} I_p)\), \(\sigma^2\) knwon, then
\[ \beta \mid_{y, X} \sim \mathcal{N}(\mu_n, \sigma \Omega_n)\,, \]
with \(\mu_n = (X^\top X + \lambda I_p)^{-1}X^\top y\) and \(\Omega_n = (X^\top X + \lambda I_p)^{-1}\). Hence \(\hat{\beta}^{\textrm{R}}\) is the Bayes estimator under quadratic loss.
\[ \hat{\beta}^{\textrm{L}} = \arg \underset{\beta \in \Re^p}{\min} \sum_{i = 1}^n \left(y_i - x_i^\top \beta \right)^2 + \lambda || \beta ||_1 \]
Can be seen as the posterior mode using a Laplace(\(0\), \(1 / \gamma\)) prior for \(\lambda = \gamma / \sigma\).
Given estimate \(\hat{\beta}\), which was obtained from \(\{X_{\textrm{train}},y_{\textrm{train}}\}\), we can test the out of sample performance on new data \(\{X_{\textrm{test}},y_{\textrm{test}}\}\).
For example, we can evaluate the mean-squared prediction error on the test data
\[ \widehat{\textrm{MSPE}} = \frac{1}{n_{\textrm{test}}} \sum_{i = 1}^{n_{\textrm{test}}} \left(y_{\textrm{test}, i} - x_{\textrm{test}, i}^\top\hat{\beta} \right)^2 \,. \]
College dataIn this exercise, we will predict the number of applications received using the other variables in the College data set.
For this, we need to install the ISLR package, which contains the College dataset
Let’s check out the data first
Private Apps Accept Enroll Top10perc Top25perc
Abilene Christian University Yes 1660 1232 721 23 52
F.Undergrad P.Undergrad Outstate Room.Board Books
Abilene Christian University 2885 537 7440 3300 450
Personal PhD Terminal S.F.Ratio perc.alumni Expend
Abilene Christian University 2200 70 78 18.1 12 7041
Grad.Rate
Abilene Christian University 60
[1] 0
College dataCollege dataApps, the number of applications, as the response variable.
Call:
lm(formula = Apps ~ ., data = College.train)
Residuals:
Min 1Q Median 3Q Max
-3661.2 -451.5 -18.2 369.0 7117.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.06147 638.26355 0.060 0.952480
PrivateYes -551.14624 213.67377 -2.579 0.010283 *
Accept 1.74987 0.05583 31.345 < 2e-16 ***
Enroll -1.36056 0.27296 -4.984 9.57e-07 ***
Top10perc 65.56530 8.30126 7.898 3.26e-14 ***
Top25perc -22.53363 6.56872 -3.430 0.000671 ***
F.Undergrad 0.10189 0.05128 1.987 0.047658 *
P.Undergrad 0.01788 0.05854 0.305 0.760175
Outstate -0.08708 0.02963 -2.939 0.003499 **
Room.Board 0.15386 0.07535 2.042 0.041863 *
Books -0.12238 0.38163 -0.321 0.748643
Personal 0.16196 0.10761 1.505 0.133167
PhD -14.29804 7.15054 -2.000 0.046277 *
Terminal -1.03096 8.12625 -0.127 0.899114
S.F.Ratio 4.48254 19.90001 0.225 0.821907
perc.alumni -0.45501 6.51788 -0.070 0.944383
Expend 0.05619 0.01848 3.041 0.002529 **
Grad.Rate 9.07464 4.61797 1.965 0.050154 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1127 on 370 degrees of freedom
Multiple R-squared: 0.9343, Adjusted R-squared: 0.9313
F-statistic: 309.7 on 17 and 370 DF, p-value: < 2.2e-16
LS prediction error: 1026096
# Alternative computation of prediction error
beta_hat <- as.vector(lm.fit$coefficients)
X_mat <- College.test[, -match("Apps", colnames(College))]
X_mat$Private <- as.numeric(X_mat$Private == "Yes")
X_mat <- do.call(cbind, lapply(X_mat, as.numeric))
fitted_vals <- X_mat %*% beta_hat[2:length(beta_hat)] + beta_hat[1]
pred_error_ls <- mean((College.test[, "Apps"] - fitted_vals)^2)College dataFor this we need the glmnet library.
To see what functionality this library offers, have a look at its vignette on CRAN.
College dataShrinkage of Ridge
College dataFind best \(\lambda\) using CV
Best lambda: 0.01
College dataInspect regression coefficients
19 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -450.49661940
(Intercept) .
PrivateYes -494.51887742
Accept 1.58270364
Enroll -0.86248869
Top10perc 49.80780909
Top25perc -14.15023862
F.Undergrad 0.05535230
P.Undergrad 0.04454701
Outstate -0.08573363
Room.Board 0.15164291
Books 0.02094073
Personal 0.03091771
PhD -8.67627741
Terminal -3.32867772
S.F.Ratio 15.40537233
perc.alumni 0.12240070
Expend 0.07792438
Grad.Rate 8.67320397
Compute prediction error on test set
College dataPick \(\lambda\) using College.train and report error on College.test.
College dataPrediction error
Best lambda: 0.01
Lasso prediction error: 1026036
pred_error_ls pred_error_r pred_error_l
[1,] 1026096 1026069 1026036
College dataRegression coefficients
19 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -471.39372052
(Intercept) .
PrivateYes -491.04485137
Accept 1.57033288
Enroll -0.75961467
Top10perc 48.14698892
Top25perc -12.84690695
F.Undergrad 0.04149116
P.Undergrad 0.04438973
Outstate -0.08328388
Room.Board 0.14943472
Books 0.01532293
Personal 0.02909954
PhD -8.39597537
Terminal -3.26800340
S.F.Ratio 14.59298267
perc.alumni -0.04404771
Expend 0.07712632
Grad.Rate 8.28950241
# nonzero coeffs: 18 out of 19
We will now predict per capita crime rate in the Boston data set.
set.seed(1)
library(MASS)
train.size <- nrow(Boston) / 2
train <- sample(1:nrow(Boston), train.size)
test <- -train
Boston.train <- Boston[train, ]
Boston.test <- Boston[test, ]
lm.fit <- lm(medv ~ . , data = Boston.train)
lm.pred <- predict(lm.fit, Boston.test)
mean((Boston.test[, "medv"] - lm.pred)^2)[1] 26.86123
Repeat for Ridge and LASSO
[1] 0.1417474
[1] 26.76297
15 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 31.980150997
(Intercept) .
crim -0.085860072
zn 0.026771398
indus 0.010542001
chas 2.455954335
nox -12.292950293
rm 4.184299762
age -0.005580903
dis -1.191523651
rad 0.306602773
tax -0.015479578
ptratio -0.945685226
black 0.006733375
lstat -0.472663974
15 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 33.944178029
(Intercept) .
crim -0.089026132
zn 0.028425033
indus 0.022664796
chas 2.315987496
nox -13.015010413
rm 4.116581960
age -0.002997100
dis -1.233974761
rad 0.360927144
tax -0.018251130
ptratio -0.971434256
black 0.006683536
lstat -0.486676975
Philipp Sterzinger - ST308 Week 5