ST308 Bayesian Inference

Class 6


Philipp Sterzinger

p.sterzinger@lse.ac.uk

08 March 2025

Recap

LDA

Motivation

Have data \(x_1,x_2,\ldots, x_n\), \(x_i \in \Re^p\)

class labels \(y_1,\ldots,y_n \in \{1,\ldots, k\}\) such that \(y_i = c_k\) if \(x_i\) belongs to class \(c_k\).

Given new data \(x\), we want to find the class that best fits \(x\)

Model

We consider \(k = 2\) classes

  • \(\Pr(y_i = 1) = p \in (0,1)\)
  • If \(y_i = 0\), then \(x_i \sim \mathrm{N}(\mu_0, \Sigma)\), if \(y_i = 1\), \(x_i \sim \mathrm{N}(\mu_1, \Sigma)\)

Use Bayes’ Theorem to compute

\[ \Pr(y = c_k \mid x) = \frac{f(x \mid y = c_k) \Pr(y = c_k)}{\sum_{i = 0}^1 f(x \mid y = i) \Pr(y = i)} \propto f(x \mid y = c_k) \Pr(y = c_k) \,. \]

LDA

Decision rule (Hastie et al., Ch. 2.4)

Assign point \(x\) to class \(1\) if \(\Pr(y = 1 \mid x) > \Pr(y = 0 \mid x)\) or equivalently if

\[ \log \left( \frac{\Pr(y = 1 \mid x)}{\Pr(y = 0 \mid x)} \right) > 0 \,. \]

Hence, let us consider \[ \begin{aligned} \log \left( \frac{\Pr(y = 1 \mid x)}{\Pr(y = 0 \mid x)} \right) &= \log \left( \frac{f(x \mid y = 1)}{f(x \mid y = 0)} \frac{p}{1-p} \right) \\ &= \log \left( \frac{f(x \mid y = 1)}{f(x \mid y = 0)} \right) + \log \left(\frac{p}{1-p} \right) \\ \end{aligned} \]

and

\[ \begin{aligned} \log \left( \frac{f(x \mid y = 1)}{f(x \mid y = 0)} \right) &= \frac{1}{2} \left\{ (x - \mu_0)^\top \Sigma^{-1}(x - \mu_0) - (x - \mu_1)^\top \Sigma^{-1}(x - \mu_1) \right\} \\ &= \frac{1}{2} \left\{ x^\top \Sigma^{-1}(\mu_1 - \mu_0) - (\mu_1 + \mu_0)^\top \Sigma^{-1}(\mu_1 - \mu_2) \right\} \\ \end{aligned} \]

LDA

Decision rule

Hence

\[ \log \left( \frac{\Pr(y = 1 \mid x)}{\Pr(y = 0 \mid x)} \right) > 0 \iff x^\top w + b > 0 \]

for

\[ w = \Sigma^{-1} (\mu_1 - \mu_0), \quad b = \log \left( \frac{p}{1-p}\right) + \frac{1}{2} (\mu_1 + \mu_0) \Sigma^{-1} (\mu_1 - \mu_0) \,, \]

which is linear in \(x\).

LDA

Decision boundary

Code
library(ggplot2)
library(dplyr)
library(ISLR)

# Load and process the data
mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)
Auto$mpg01 <- mpg01

train <-  (Auto$year %% 2 == 0) # if the year is even
test <-  !train
Auto.train <-  Auto[train,]
Auto.test <-  Auto[test,]
mpg01.test <-  mpg01[test]

# Create feature matrix
X <- model.matrix(mpg01 ~ -1 + horsepower + weight, data = Auto.train)
y <- Auto.train$mpg01
n <- length(y) 
n_1 <- sum(y == 1) 

# Prior probability of class 1
pr_1 <- n_1 / n 

# Compute class means
X_0 <- X[y == 0,]
X_1 <- X[y == 1,]

mu_0 <- colMeans(X_0) 
mu_1 <- colMeans(X_1)

# Compute covariance matrix
X_0_c <- sweep(X_0, 2, mu_0, "-")
X_1_c <- sweep(X_1, 2, mu_1, "-") 

Sigma <- (t(X_0_c) %*% X_0_c + t(X_1_c) %*% X_1_c) / n
Sigma_inv <- solve(Sigma) 

# Compute LDA coefficients
w <- Sigma_inv %*% (mu_1 - mu_0) 
thresh <- log((1 - pr_1) / pr_1) + 0.5 * t(mu_0 + mu_1) %*% w  

# Extract test data for plotting
X_test <- model.matrix(mpg01 ~ -1 + horsepower + weight, data = Auto.test)
Auto.test$mpg01 <- as.factor(Auto.test$mpg01)

# Decision boundary equation: w1 * horsepower + w2 * weight = thresh
# Solve for weight in terms of horsepower: weight = (-w1/w2) * horsepower + (thresh / w2)
slope <- -w[1] / w[2]
intercept <- thresh / w[2]

# Compute the decision boundary 
boundary_df <- data.frame(
  horsepower = seq(min(Auto.test$horsepower), max(Auto.test$horsepower), length.out = 100)
)
boundary_df$weight <- as.numeric(slope) * boundary_df$horsepower + as.numeric(intercept)

# Compute predictions 

Auto.test$preds <- as.numeric((X_test %*% w ) > thresh[1,1])
Auto.test$correct <- Auto.test$preds == Auto.test$mpg01

# Create the plot
ggplot(Auto.test, aes(x = horsepower, y = weight, shape = correct, color = mpg01)) +
  geom_point(size = 3) +
  geom_line(data = boundary_df, aes(x = horsepower, y = weight, linetype = "Decision Boundary"), 
            color = "grey", linewidth = 1.5, inherit.aes = FALSE) + 
  scale_linetype_manual(name = "LDA", values = c("Decision Boundary" = "dashed")) +
  scale_color_manual(name = "True Class", values = c("0" = "tomato", "1" = "steelblue")) +
  scale_shape_manual(name = "Classification", values = c("FALSE" = 4, "TRUE" = 16), labels = c("Incorrect", "Correct")) +  
  labs(title = "LDA Decision Boundary",
       x = "Horsepower",
       y = "Weight") +
  theme_minimal() +
  theme(legend.position = "right")  +
      guides(linetype = guide_legend(order = 1),
             color = guide_legend(order = 2),
             shape = guide_legend(order = 3))

LDA

Parameter estimation (Matrix cookbook)

\[ \ell(\theta \mid y, X) = \sum_{i = 1}^n y_i \left( \log(f(x_i \mid y_i)) + \log(p) \right) + \sum_{i = 1}^n (1 - y_i) \left( \log(f(x_i \mid y_i)) + \log(1 - p) \right) \]

\[ \frac{\partial }{\partial p} \ell(\theta \mid y, X ) = \frac{n_1}{p} - \frac{n_0}{1 - p} \implies \hat{p} = \frac{n_1}{n} \]

\[ \frac{\partial }{\partial \mu_1} \ell(\theta \mid y, X )= \Sigma^{-1} \left(\sum_{i: y_i = 1} (x_i - \mu_1) \right)\implies \hat{\mu}_1 = \frac{1}{n_1} \sum_{i = 1}^n y_i x_i \]

\[\begin{multline} \frac{\partial}{\partial \Sigma} \ell(\theta \mid y, X ) = -\frac{n}{2} \Sigma^{-1} \\ \left\{ I_p - \\ \frac{1}{n}\left[ \sum_{i=1}^{n} y_i (x_i - \mu_1)(x_i - \mu_1)^\top + (1 - y_i)(x_i - \mu_0)(x_i - \mu_0)^\top \right] \Sigma^{-1} \right\} \\ \implies \hat{\Sigma} = \frac{1}{n} \sum_{i=1}^{n} \left[ y_i (x_i - \mu_1)(x_i - \mu_1)^\top + (1 - y_i)(x_i - \mu_0)(x_i - \mu_0)^\top \right] \end{multline}\]

Logistic regression

\[ y_i \mid_{x_i} \sim \textrm{Ber}(\mu_i), \quad \log\left(\frac{\mu_i}{1 - \mu_i} \right) = x_i^\top \beta\,, \]

i.e. class is a Bernoulli random variable whose mean is a nonlinear transformation of the linear predictor \(x_i^\top \beta\)

Maximum likelihood

\(\displaystyle l(\beta; y, X) = \sum_{i = 1}^n \left\{y_i \left(x_i^\top \beta \right) - \log\left(1 + \exp\left\{x_i^\top \beta \right\} \right) \right\}\)

\(\hat{\beta} = \arg \underset{\beta \in \Re^p}{\max} \, l(\beta; y, X)\)


In R this is done with the glm function

Code
logistic_fit <- glm(mpg01 ~ horsepower + weight, 
        family = binomial(link = "logit"), 
        data = Auto.train) 

Logistic regression

Classification

Recall that \(\log \left(\mu /(1 - \mu) \right) = x^\top \beta\) so that \(\displaystyle \mu = (1 + \exp \{-x^\top \beta \})^{-1}\)

Given ML estimate \(\hat{\beta}\) and data point \(x\), we can estimate \(\Pr(Y = 1\mid x)\) as \[ \hat{\mu} = (1 + \exp \{ -x^\top \hat{\beta} \})^{-1}\,.\]

Classify data point \(x\) according to \[\mathbb{1} \left\{\hat{\mu} > 1/2 \right\} \,,\] where the function \(\mathbb{1}\{\mathcal{A} \}\) is one when \(\mathcal{A}\) is true and zero otherwise.

Logistic regression

Probability surface

Code
#load required libraries 
library(plotly)

# Create a grid of horsepower and weight values
horsepower_ticks <- seq(min(Auto.test$horsepower), max(Auto.test$horsepower), length.out = 100)
weight_ticks <- seq(min(Auto.test$weight), max(Auto.test$weight), length.out = 100)
background_grid <- expand.grid(horsepower = horsepower_ticks, weight = weight_ticks)

# Predict probabilities for the grid
background_grid$predicted_prob <- predict(logistic_fit, newdata = background_grid, type = "response")

# Predict classifications for the actual data
Auto.test$predicted_class <- as.numeric(predict(logistic_fit, newdata = Auto.test, type = "response") > 0.5)
Auto.test$correct <- Auto.test$predicted_class == Auto.test$mpg01

Auto.test$height <- ifelse(Auto.test$mpg01 == 1, 1, 0)

plot_ly() %>%
    layout(autosize = FALSE, width = 1000, height = 500) %>%
  add_surface(
    x = ~horsepower_ticks, 
    y = ~weight_ticks, 
    z = t(matrix(background_grid$predicted_prob, nrow = length(horsepower_ticks), ncol = length(weight_ticks))), 
    colorscale = list(c(0, 1), c ("#F8766D", "#00BFC4")),
    showscale = TRUE,
    colorbar = list(title = "Probability \n high fuel", tickvals = c(.25, 0.5, .75), ticktext = c("0.25","0.5","0.75")),
    hoverinfo = "x+y+z"
  ) %>%
  add_markers(
    data = Auto.test[Auto.test$mpg01 == 0 & Auto.test$correct, ], 
    x = ~horsepower, 
    y = ~weight, 
    z = ~height, 
    marker = list(
      color = "#F8766D", 
      symbol = "circle",
      size = 5,
      line = list(width = 0.5, color = "#F8766D")
    ),
    name = "Low fuel – Correct", 
    hoverinfo = "x+y+z",
    visible = FALSE,
    key = "low_fuel_correct"
  ) %>%
  add_markers(
    data = Auto.test[Auto.test$mpg01 == 0 & !Auto.test$correct, ], 
    x = ~horsepower, 
    y = ~weight, 
    z = ~height, 
    marker = list(
      color = "#F8766D", 
      symbol = "cross",
      size = 5,
      line = list(width = 0.5, color = "#F8766D")
    ),
    name = "Low fuel – Incorrect", 
    hoverinfo = "x+y+z",
    visible = FALSE,
    key = "low_fuel_incorrect"
  ) %>%
  add_markers(
    data = Auto.test[Auto.test$mpg01 == 1 & Auto.test$correct, ], 
    x = ~horsepower, 
    y = ~weight, 
    z = ~height, 
    marker = list(
      color = "#00BFC4", 
      symbol = "circle",
      size = 5,
      line = list(width = 0.5, color = "#00BFC4")
    ),
    name = "High fuel – Correct", 
    hoverinfo = "x+y+z",
    visible = FALSE,
    key = "high_fuel_correct"
  ) %>%
  add_markers(
    data = Auto.test[Auto.test$mpg01 == 1 & !Auto.test$correct, ], 
    x = ~horsepower, 
    y = ~weight, 
    z = ~height, 
    marker = list(
      color = "#00BFC4", 
      symbol = "cross",
      size = 5,
           line = list(width = 0.5, color = "#00BFC4")
    ),
    name = "High fuel – Incorrect", 
    hoverinfo = "x+y+z",
    visible = FALSE,
    key = "high_fuel_incorrect"
  ) %>%
  layout(
    scene = list(
      xaxis = list(title = "horsepower"),
      yaxis = list(title = "weight"),
      zaxis = list(title = "Probability")
    ),
    title = " ",
    updatemenus = list(
      list(
        type = "buttons",
        buttons = list(
          list(
            method = "restyle",
            args = list("visible", list(TRUE, FALSE, FALSE, FALSE, FALSE)),
            label = "Surface Only"
          ),
          list(
            method = "restyle",
            args = list("visible", list(TRUE, TRUE, TRUE, TRUE, TRUE)),
            label = "Show All"
          ),
          list(
            method = "restyle",
            args = list("visible", list(TRUE, TRUE, FALSE, FALSE, FALSE)),
            label = "Low fuel – Correct"
          ),
          list(
            method = "restyle",
            args = list("visible", list(TRUE, FALSE, TRUE, FALSE, FALSE)),
            label = "Low fuel – Incorrect"
          ),
          list(
            method = "restyle",
            args = list("visible", list(TRUE, FALSE, FALSE, TRUE, FALSE)),
            label = "High fuel – Correct"
          ),
          list(
            method = "restyle",
            args = list("visible", list(TRUE, FALSE, FALSE, FALSE, TRUE)),
            label = "High fuel – Incorrect"
          )
        )
      )
    )
  )

Logistic regression

Classification boundary

Code
# Create a grid of horsepower and weight values
horsepower_ticks <- seq(min(Auto.test$horsepower), max(Auto.test$horsepower), length.out = 100)
weight_ticks <- seq(min(Auto.test$weight), max(Auto.test$weight), length.out = 100)
background_grid <- expand.grid(horsepower = horsepower_ticks, weight = weight_ticks)

# Predict probabilities for the grid
background_grid$predicted_prob <- predict(logistic_fit, newdata = background_grid, type = "response")

# Predict classifications for the actual data
Auto.test$predicted_class <- predict(logistic_fit, newdata = Auto.test, type = "response") > 0.5
Auto.test$correct <- as.numeric(Auto.test$predicted_class) == Auto.test$mpg01

# Calculate the decision boundary line
slope <- coef(logistic_fit)[2] / (-coef(logistic_fit)[3])
intercept <- coef(logistic_fit)[1] / (-coef(logistic_fit)[3])

# Plot the decision boundary and data points with color gradient background
ggplot() +
  geom_tile(data = background_grid, aes(x = horsepower, y = weight, fill = predicted_prob), alpha = 0.5) +
  scale_fill_gradient(low = "#F8766D", high = "#00BFC4", name = "Probability high fuel") +
  geom_point(data = Auto.test, aes(x = horsepower, y = weight, color = as.factor(mpg01), shape = correct), size = 2) +
  scale_color_manual(values = c("1" = "#00BFC4", "0" = "#F8766D"), labels = c("Low fuel", "High fuel")) +
  scale_shape_manual(values = c("TRUE" = 16, "FALSE" = 4), labels = c("Correct", "Incorrect")) +
  geom_abline(intercept = intercept, slope = slope, linetype = "dashed", color = "black", size = 1, alpha = 0.6) +
  theme_minimal() +
  labs(title = " ", x = "horsepower", y = "weight", color = "Fuel economy", shape = "Classification") +
  theme(legend.position = "right")

Exercises

Predicting fuel economy

In this problem, we will develop a model to predict whether a given car gets high or low gas mileage based on the Auto dataset from the ISLR package.

  1. Create a binary variable, mpg01, that contains a \(1\) if mpg contains a value above its median, and a \(0\) if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
Code
library(ISLR)
mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)
Auto$mpg01 <- mpg01

head(Auto, n = 3) 
  mpg cylinders displacement horsepower weight acceleration year origin
1  18         8          307        130   3504         12.0   70      1
2  15         8          350        165   3693         11.5   70      1
3  18         8          318        150   3436         11.0   70      1
                       name mpg01
1 chevrolet chevelle malibu     0
2         buick skylark 320     0
3        plymouth satellite     0

Predicting fuel economy

  1. Split the data into a training set and a test set.
Code
train <-  (Auto$year %% 2 == 0) # if the year is even
test <-  !train
Auto.train <-  Auto[train,]
Auto.test <-  Auto[test,]
mpg01.test <-  mpg01[test]

Predicting fuel economy

  1. Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
Code
# Association of vars 
cor(Auto.train[,-c(1,8,9)])
              cylinders displacement horsepower     weight acceleration
cylinders     1.0000000    0.9512812  0.8499046  0.8976253   -0.5459842
displacement  0.9512812    1.0000000  0.9054857  0.9146319   -0.5960463
horsepower    0.8499046    0.9054857  1.0000000  0.8642406   -0.6866829
weight        0.8976253    0.9146319  0.8642406  1.0000000   -0.4318448
acceleration -0.5459842   -0.5960463 -0.6866829 -0.4318448    1.0000000
year         -0.4491209   -0.4509433 -0.4960804 -0.3831162    0.3459440
mpg01        -0.8082180   -0.7825224 -0.6984534 -0.7971422    0.3759669
                   year      mpg01
cylinders    -0.4491209 -0.8082180
displacement -0.4509433 -0.7825224
horsepower   -0.4960804 -0.6984534
weight       -0.3831162 -0.7971422
acceleration  0.3459440  0.3759669
year          1.0000000  0.4666924
mpg01         0.4666924  1.0000000
Code
# LDA
library(MASS)
lda.fit <-  lda(mpg01 ~ cylinders + weight + displacement + horsepower,
              data = Auto, subset = train)
lda.pred <-  predict(lda.fit, Auto.test)
lda_pred_error <- mean(lda.pred$class != mpg01.test)

cat("LDA classification error:", lda_pred_error) 
LDA classification error: 0.1263736

Predicting fuel economy

  1. Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
Code
glm.fit <-  glm(mpg01 ~ cylinders + weight + displacement + horsepower,
              data = Auto,
              family = binomial,
              subset = train)
glm.probs <-  predict(glm.fit, Auto.test, type = "response")
glm.pred <-  rep(0, length(glm.probs))
glm.pred[glm.probs > 0.5] <- 1
lr_pred_error <- mean(glm.pred != mpg01.test)

cat("Logistic regression classification error:", lr_pred_error) 
Logistic regression classification error: 0.1208791

Predicting fuel economy

Logistic ridge

Last week, we saw how to use glmnet to get Ridge and LASSO estimates in a linear model. We can use these estimators for classification too. In this case our estimates of the class probabilities become

\[ \Pr \left(Y = 1 \mid x \right) = x^\top \hat{\beta} \,, \] where \(\hat{\beta}\) is either the ridge or LASSO estimate. This is called a linear probability model. We can conduct classification just as with logistic regression. However, this linear relationship may not always be appropriate (why?).

Instead, we can use glmnet to add a ridge or LASSO peanlty to our logistic regression model, by specifying family = "binomial" in cv.glmnet.

Code
library(glmnet)
set.seed(1)
train.mat <-  model.matrix(mpg01 ~ cylinders + weight + displacement + horsepower , data = Auto.train)
test.mat <-  model.matrix(mpg01 ~ cylinders + weight + displacement + horsepower , data = Auto.test)
grid <-  seq(0, 1, length = 100)
mod.ridge <-  cv.glmnet(train.mat, Auto.train[, "mpg01"], 
                        alpha = 0,
                        family = "binomial", 
                        type.measure = "class")
lambda.best <-  mod.ridge$lambda.min
cat("Best lambda: ", lambda.best) 
Best lambda:  0.06410877
Code
ridge.probs <-  predict(mod.ridge, newx = test.mat, s = lambda.best)
ridge.pred <-  rep(0, length(ridge.probs))
ridge.pred[ridge.probs > 0.5] <- 1

lrr_pred_error <- mean(ridge.pred != mpg01.test)

cat("Logistic ridge classification error:", lrr_pred_error) 
Logistic ridge classification error: 0.1263736

Predicting fuel economy

Logistic LASSO

Code
mod.lasso <-  cv.glmnet(train.mat, Auto.train[, "mpg01"], 
                        alpha = 1, 
                        family = "binomial",
                        type.measure = "class")
lambda.best <-  mod.lasso$lambda.min
cat("Best lambda: ", lambda.best) 
Best lambda:  0.0005978806
Code
lasso.probs <-  predict(mod.lasso, newx = test.mat, s = lambda.best)
lasso.pred <-  rep(0, length(lasso.probs))
lasso.pred[lasso.probs > 0.5] <- 1
lrl_pred_error <- mean(lasso.pred != mpg01.test)

cat("Logistic LASSO classification error:", lrl_pred_error) 
Logistic LASSO classification error: 0.1263736

Activity

Repeat the above for the Stock market data in the ISLR package

names(Smarket)
[1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
[7] "Volume"    "Today"     "Direction"
dim(Smarket)
[1] 1250    9
summary(Smarket)
      Year           Lag1                Lag2                Lag3          
 Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
 1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
 Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
 Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
 3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
 Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
      Lag4                Lag5              Volume           Today          
 Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
 1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
 Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
 Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
 3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
 Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
 Direction 
 Down:602  
 Up  :648  
           
           
           
           

Activity

Split into train and test data

train =(Smarket$Year <2005)
Smarket.2005= Smarket [! train ,]
dim(Smarket.2005)
[1] 252   9
Direction.2005= Smarket$Direction [! train]

Fit LDA and logistic regression models to the train dataset and evaluate their predictons in the test dataset. Fit LDA and logistic regression models to the train dataset and evaluate their predictons in the test dataset. Try to find a model based on some of the covariates such that it has as good accuracy rate as possible.