ST443 Machine Learning and data mining

Week 4


Philipp Sterzinger

p.sterzinger@lse.ac.uk

31 October 2025

R-lab: Cross validation, bootstrap

Motivation

Given a prediction function \(\hat{f}\), that we trained on a training dataset \(\mathcal{D}_n = \{(X_1, Y_1), \ldots, (X_n, Y_n)\}\), we would like to:

  • Evaluate its performance, i.e. its risk \[ R(\hat{f}) = \mathrm{E}_{\mathcal{D}_n, Y, X}[\ell(Y, \hat{f}(X))] \] loss functions typically \(\ell(y,f) = (y-f)^2\) (regression task), \(\ell(y,f) = \mathbb{1}\{y \neq f\}\) (classification task)

  • Choose the hyperparameters that minimise test error:

    • \(\hat{f}(x)\) often depends on hyperparameters, i.e. parameters that we do not choose in the training phase of \(\hat{f}\).
    • Examples: the number of neighbours in kNN, the learning rate \(\alpha\) in gradient descent for logistic classification

Problem

We only have one dataset \(\mathcal{D}_n\), which we already used for training \(\hat{f}\).

We can look at the training error:

\[ \mathrm{Training \, error} = \frac{1}{n} \sum_{i = 1}^n \ell(Y_i, \hat{f}(X_i)) \,. \]

But, because we used the training data already for fitting \(\hat{f}\), our estimate will be too optimistic. If \(\hat{f}\) is powerful enough, it can achieve zero training error.

Validation set approach

We can split our data \(\mathcal{D}_n\) into two sets, a training set and a validation set.

We train \(\hat{f}\) on only set of the data, and evaluate on the other.

  1. Randomly split data into train and test set
  2. Train \(\hat{f}\) on training set
  3. Estimate test error on test set

Validation set approach

  • Random split introduces uncertainty
  • Less training data → worse training → worse predictions → overestimate test error

Validation set approach: Model selection

Say our prediction \(\hat{f}\) has a hyperparameter, call it \(\lambda\), that we want to tune.

We do the following:

  1. Split data into train and validation set
  2. For a range of hyperparmeters \(\lambda_i \in \{\lambda_1, \ldots, \lambda_k\}\), do:
    1. Train \(\hat{f}(\cdot;\lambda_i)\) on training data
    2. Evluate error on validation data
  3. Pick the model \(\hat{f}(\cdot; \hat{\lambda})\), where \(\hat{\lambda}\) achieves smallest error on validation set

Is the error on the validation set a good estimate for the test error of \(\hat{f}(\cdot; \hat{\lambda})\)?

No! We picked \(\lambda_i\) to achieve smallest error on the validation set. If we want to have a more honest estimate of the estimate of the test error, we need to split further: Train, validate, test.

  1. Use training data to train model for each hyperparameter
  2. Evaluate error on validation set
  3. For best choice of hyperparameter, use test set to estimate test error

Illustration: Auto data

Code
library(ISLR) 

str(Auto)
'data.frame':   392 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : num  3504 3693 3436 3433 3449 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
Code
plot(mpg ~ horsepower, data = Auto)

Illustration: Auto data

We want to fit a model predicting mpg as a function of horsepower

Prediction function is a polynomial of degree \(d\) of horsepower. I.e.

  • \(d = 1\): \(\hat{f}(x; 1) = \beta_0 + \beta_1 x\)
  • \(d = 2\): \(\hat{f}(x; 2) = \beta_0 + \beta_1 x + \beta_2 x^2\)
  • \(d = 3\): \(\hat{f}(x; 2) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\)

Use validation set approach to pick best hyperparameter and report estimate of test error under squared loss

Illustration: Auto data

Train, validate, test split

  • Train: half
  • Validate, test: one quarter
Code
set.seed(2022)
n <- nrow(Auto)
shuffle <- sample(n) 
Auto_train <- Auto[shuffle[1:(n/2)], ]
Auto_validate <- Auto[shuffle[(n/2+1):(3*n/4)], ]
Auto_test <- Auto[shuffle[(3*n/4+1):n], ]
dim(Auto_train)
[1] 196   9

Illustration: Auto data

Hyperparameter grid

\[ d_i \in \{1, 2, \ldots, 10\}\]

Code
MSE <- function(u, v){
mean((u - v)^2)
}

max_deg <- 10
MSE_train <- rep(0, max_deg) 
MSE_validate <- rep(0, max_deg) 
for (deg in 1:max_deg){
Auto_poly = lm(mpg ~ poly(horsepower, deg), data=Auto_train)

pred_train <- predict(Auto_poly, newdata=Auto_train)
pred_validate <- predict(Auto_poly, newdata=Auto_validate)

MSE_train[deg] <- MSE(pred_train, Auto_train$mpg)
MSE_validate[deg] <- MSE(pred_validate, Auto_validate$mpg)
}
plot(c(0,max_deg), range(c(MSE_train, MSE_validate)), type='n', 
   xlab='deg', ylab='MSE')
points(1:max_deg, MSE_train, col='blue', type='b')
points(1:max_deg, MSE_validate, col='orange', type='b')
legend("topright", lty = c(1,1), col = c("blue", "orange"), legend = c("Train", "Validate"))

Illustration: Auto data

Report estimated test error

Code
best_deg <- which.min(MSE_validate)

Auto_poly <-  lm(mpg ~ poly(horsepower, best_deg), data=Auto_train)
pred_test <- predict(Auto_poly, newdata=Auto_test)
best_mse <- MSE(pred_test, Auto_test$mpg)

cat("Test error estimate for d = ", best_deg, ": ", best_mse)
Test error estimate for d =  4 :  18.93462

Activity

Repeat the same validation set approach on another dataset, e.g. the house data from the MASS package

library(MASS) 
str(Boston)
'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Leave-one-out cross validation

Aim to mitigate the variability of the validation set approach and be more efficient with our data

\[ \text{LOOCV}(\hat{f}) = \frac{1}{n} \sum_{i = 1}^n \ell(Y_i, \hat{f}^{(-i)}(X_i)) \]

  • Each split like train, validation split

Leave-one-out cross validation

  • Low bias: Beacuse train on almost foll data, LOOCV is a good (low bias) approximation of the error of \(\hat{f}\), i.e.  \[ \mathrm{E}_{\mathcal{D}_n}[\text{LOOCV}(\hat{f})] \approx \mathrm{E}_{X,Y,\mathcal{D}_n}[\ell(Y, \hat{f}(X))] \]

  • High variance:

    • Because each summands predicts only a single point, we get a noisy estimate of expected loss
    • Because training sets are very similar, high correlation between them

    → If datasets change, we expect to see large fluctuations in LOOCV

  • Computationally intensive: We need to fit a model for \(n\) different data folds

K-fold cross validation

Compromise between validation set approach and LOOCV:

  1. Partition data \(\mathcal{D}_n\) into \(K\) cells \(C_1, \ldots, C_K\)
  2. For fold \(k\), train \(\hat{f}^{(-k)}\) on \(\mathcal{D}_n \setminus C_k\), validate on \(C_k\)
  3. Aggregate: \[ \text{CV}_k(\hat{f}) = \frac{1}{n} \sum_{k = 1}^K \sum_{j \in C_i} \ell(Y_j, \hat{f}^{(-k)}(X_j)) \]

K-fold cross validation

  • Split uncertainty: K-fold CV still uses random data split, but variability is substantially lower than for validation set: Overlap between folds reduces impact of split

  • Bias: Typically less bias than validation set approach, more bias than LOOCV
  • Variance: Less variance than both LOOCV, and validation set approach

→ Empirically: K-fold CV with \(5-10\) folds: Sweet spot between variance, bias, and computational effort

Illustration: Auto data

Use library boot for CV

Code
library(boot) #for cv.glm
Auto_train <- rbind(Auto_train, Auto_validate)
dim(Auto_train)
[1] 294   9

Illustration: Auto data

Run LOOCV, 5-fold CV, and 10-fold CV to estimate

\[ \hat{f}(x;d) = \beta_0 + \sum_{i = 1}^d \beta_i x^i \]

Code
num <- 5
cv_error1 = rep(0, num)
cv_error5 = rep(0, num)
cv_error10 = rep(0, num)
degree = 1:num
for (deg in degree) {
glm_fit <-  glm(mpg ~ poly(horsepower, deg), data = Auto)
## CV errors using naive LOOCV
cv_error1[deg] = cv.glm(Auto, glm_fit)$delta[1]
## CV errors using K-fold-CV
cv_error5[deg] = cv.glm(Auto, glm_fit, K = 5)$delta[1]
## CV errors using K-fold-CV
cv_error10[deg] = cv.glm(Auto, glm_fit, K = 10)$delta[1]
}

plot(degree, cv_error1, type = "b", col = "black")
lines(degree, cv_error5, type = "b", col = "red")
lines(degree, cv_error10, type = "b", col = "blue")
legend("topright", legend = c("LOOCV", "5-fold CV", "10-fold CV"), col = c("black", "red", "blue"), lwd = 2)

Activity

Repeat the above with a different set of predictors.

Bootstrap

Suppose we are given data \((Y_1,X_1), \ldots, (Y_n, X_n)\) that is an i.i.d. sample from some unknown distribution \(P\).

We compute an estimate \(\hat \theta\) of some quantity of interest, e.g. the MSE of a ML-algorithm. We would like to know something about its distribution

Monte carlo method

If we had access to \(P\), we could:

  1. Draw \(B\) independent training samples \(\mathcal{D}_1,\ldots, \mathcal{D}_B\)
  2. Get estimates \(\hat{\theta}_1,\ldots, \hat{\theta}_B\)
  3. Use the empirical distribution of our estimates as an approximation of the distribution of the estimator itself

Bootstrap

Empirical distribution

The empirical distrubtion function of a sample of \(n\) i.i.d. random variables \(X\) is

\[ P_n(x) = \frac{1}{n} \sum_{i = 1}^n \mathbb{1}\{X_i \leq x\} \]

Code
set.seed(123) 
N <- 50

xs <- sort(rnorm(N)) 

emp_cdf <- 1 : N / N

x_grid <- seq(from = -3, to = 3, length.out = 1000)

plot(xs, emp_cdf, col = "blue", lwd = 2, type = "l", main = "Empirical CDF vs. CDF") 
lines(x_grid, pnorm(x_grid), col = "green", lwd = 2) 
legend("bottomright", col = c("blue", "green"), lwd = rep(2,2), legend = c("Emp. CDF", "CDF")) 

Bootstrap

Bootstrap idea

Rather than using the unknown distribution \(P\) to sample data, we use the empirical distribution \(P_n\) to sample data.

  • This corresponds to sampling from our training data \(\mathcal{D}\) uniformly at random with replacement

Illustration: Portfolio data

The Portfolio data set is in the ISLR package described in Section 5.2.

'data.frame':   100 obs. of  2 variables:
 $ X: num  -0.895 -1.562 -0.417 1.044 -0.316 ...
 $ Y: num  -0.235 -0.885 0.272 -0.734 0.842 ...

Illustration: Portfolio data

Wish to invest a fixed sum of money in two financial assets that yield returns of \(X\) and \(Y\), respectively, where X and Y are random quantities. We will invest a fraction \(\alpha\) of our money in \(X\), and will invest the remaining \(1 - \alpha\) in \(Y\).

The formula for \(\alpha\) that results in the minimum risk investment of form \(\alpha X + (1-\alpha)Y\) is

\[ \alpha = \frac{\text{Var}(Y) - \text{Cov}(X,Y)}{\text{Var}(X)+ \text{Var}(Y)- 2\text{Cov}(X,Y)} \]

Illustration: Portfolio data

To illustrate the use of the bootstrap on this data, we must first create:

  • a function, alpha_fn() that takes as input the \((X,Y)\) data and returns an estimate of alpha using the estimator (5.7) to the observations corresponding to the rows of the data index.
  • a vector indicating which observations should be used to estimate \(\alpha\). Note that \(\alpha = \arg \min \text{var}[\alpha X + (1 - \alpha Y)]\).
Code
alpha_fn <-  function(data, index) {
X <-  data$X[index]
Y <-  data$Y[index]
# solution of alpha s.t. min. var
alpha <-  (var(Y) - cov(X, Y)) / (var(X) + var(Y) - 2 * cov(X, Y))
return(alpha)
}

randomly select \(100\) observations from \(1\) to \(100\) with replacement, i.e. construct a new bootstrap data and compute the corresponding alpha

Code
set.seed(1)
alpha_fn(data <-  Portfolio, index = sample(1:100, size = 100, replace = T))
[1] 0.7368375

Illustration: Portfolio data

We can implement a bootstrap analysis by performing this command many times, recording all of the corresponding estimates for α, and computing the resulting standard. deviation.

Bootstrap using for loop:

Code
B <- 1000
boot_alpha = rep(0, B)
for(i in 1:B){
boot_index <-  sample(x = 1:100, size = 100, replace = T)
boot_alpha[i] <-  alpha_fn(data = Portfolio, index = boot_index)
}
mean(boot_alpha)
[1] 0.5739061
Code
sd(boot_alpha)
[1] 0.09285792

boot automates this approach. To produce 1000 bootstrap estimates for alpha

Code
boot(Portfolio, alpha_fn, R = 1000)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Portfolio, statistic = alpha_fn, R = 1000)


Bootstrap Statistics :
     original       bias    std. error
t1* 0.5758321 -0.001539618  0.08846251

Illustration: Auto data

Now we will find the bootstrap estimates of the standar error of the linear regression coefficients from the model of mpg on horsepower in Auto data

We get the estimated standard errors from lm:


Call:
lm(formula = mpg ~ horsepower, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Now we will use bootstrap to estimate the standard errors of the coefficients:

Illustration: Auto data

First define a function that returns the intercept and slope estimates for the linear regression model:

(Intercept)  horsepower 
 39.9358610  -0.1578447 
(Intercept)  horsepower 
 40.9483172  -0.1636591 

Now we use boot() to compute the standard errors of 1000 bootstrap estimates for the intercept and slope


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Auto, statistic = boot_fn, R = 1000)


Bootstrap Statistics :
      original        bias    std. error
t1* 39.9358610  0.0141715259 0.842049552
t2* -0.1578447 -0.0002181104 0.007203328

Illustration: Auto data

Compare with standard formula results for the regression coefficients in a linear model

              Estimate  Std. Error   t value      Pr(>|t|)
(Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81

What can you conclude from the different results?

There is some difference between the SE estimates of the parameters using bootstrap approach and the formulas coming from assuming a linear model. Standard error estimates depend on \(\sigma^2\) that first needs to be estimated. \(\hat{\sigma}^2\) depends on the assumption that a linear model is correct, if it is not, then the residuals from the linear fit will be inflated, and consequently so will \(\hat{\sigma}^2\). The bootstrap approach does not rely on these assumptions so might provide more accurate estimates.

Activity

Redo everything for polynomial regression with degree=2

Check your understanding

In cross-validation, what is the main purpose of repeatedly splitting the data into training and validation sets?

To estimate the model’s predictive performance on unseen data by mimicking out-of-sample prediction.

When estimating model accuracy, why might cross-validation give a less biased estimate of test error than the bootstrap?

Because cross-validation always evaluates on data not used for training, whereas bootstrap reuses data points for both training and testing.

True or False:

  • The bootstrap is primarily designed to assess the variability of an estimator, while cross-validation is designed to assess predictive performance.
    • TRUE
  • In k-fold cross-validation, each observation is used exactly once for testing and \(k-1\) times for training.
    • TRUE
  • Bootstrap resampling always uses disjoint training and testing sets.
    • FALSE
  • Increasing the number of folds in cross-validation always reduces bias in the error estimate without affecting variance.
    • FALSE

Summary

Cross-validation

  • Used to estimate test-errors
  • \(K\)-fold CV for \(K=5\) to \(K=10\) sweetspot between bias, variance, and computational effort

Bootstrap

  • Used to estimate distributional properties of an estimator whose distribution is unknown
  • Use to construct estimates of standard errors, CIs

Homework 2

Key concepts

  • Linear classifiers, LDA, logistic classifier (Q1 a, Q2)
  • Loss, surrogate loss (Q1 b, c)
  • Discriminant based classifiers (Q2, Q3, Q4 a)
  • ROC (Q4 b)

Question 1

In this exercise, we consider binary classification, where \(Y = −1\) denotes a negative example and \(Y = 1\) denotes a positive example.

  1. Show that the output of a logistic regression classifier with parameter \(\hat{\beta}\) can be expressed as follows \[ \psi(x) = H(x^\top \hat\beta) \,, \] where \(H(z) = 1\) if \(z \geq 0\) and \(H(z) = 0\) if \(z < 0\).

Question 1a

Show

  • For logistic classifier: \(\psi(x) = H(x^\top \hat\beta)\)

  • I.e. classifier is a \(0\)-\(1\) step function in \(x^\top \hat{\beta}\):

Code
H <- function(z){
    ifelse(z < 0, 0, 1) 
}

zs <- seq(from = -5, to = 5, length.out= 1000) 

plot(zs, H(zs), col = "blue", lwd = 2, type = "l", main = expression(H(z)))

Know

  • \(\psi(x) = \mathbb{1}\{ \hat{\eta}(x) > 1 / 2 \}\)
  • \(\hat\eta(x) = 1 / (1 + \exp\{-x^\top \hat \beta \})\)
Code
plot(zs, plogis(zs), col = "red", lwd = 2, type = "l", main = "Logistic classifier")
abline(h = 0.5, col = "grey", lty = 2, lwd = 2) 
lines(zs, H(zs), col = "blue", lwd = 2)

Question 1a

Show

Logistic classifier is a \(0\)-\(1\) step function in \(x^\top\hat{\beta}\)

Plug-in

\[\begin{equation} \begin{aligned} \psi(x) &= \mathbb{1}\{ \hat{\eta}(x) \geq 1 / 2 \} \\ &= \mathbb{1}\{ 1 / (1 + \exp\{-x^\top \hat \beta \}) \geq 1 / 2 \} \\ &= \mathbb{1}\{ 2 \geq 1 + \exp\{-x^\top \hat \beta \} \} \\ &= \mathbb{1}\{ 1 \geq \exp\{-x^\top \hat \beta \} \} \\ &= \mathbb{1}\{ 0 \geq -x^\top \hat \beta \} \\ &= \mathbb{1}\{ x^\top \hat \beta \geq 0 \} \\ &= H(x^\top \hat \beta) \,. \end{aligned} \end{equation}\]

Learnings

  • Go from definition to statement

Question 1b

Let \(\ell: \Re \to \Re\) with \(\ell(z) = \log(1 + \exp\{-z\})\). Show that for logistic regression, the negative log-likelihood (referred to as binary cross-entropy or log loss) is \(\ell(-yx^\top \beta)\).

Show

  • negative log-likelihood in logistic regression is \(\ell(-yx^\top \beta)\)

Know

  • \(\Pr(Y = 1 \mid X = x) = \frac{1}{1 + \exp\{-x^\top \hat \beta\}}\)
  • \(\Pr(Y = -1 \mid X = x) = \frac{\exp\{-x^\top \beta\}}{1 + \exp\{-x^\top \hat \beta\}}\)
  • log-likelihood: \[ \begin{aligned} &\mathbb{1}\{Y = 1\} \log(\Pr(Y = 1 \mid X = x)) \\ + &\mathbb{1}\{Y = -1\}\log(\Pr(Y = -1 \mid X = x)) \end{aligned} \]

Question 1b

log-likelihhod:

\[ \mathcal{L} = \mathbb{1}\{Y = 1\} \log(\Pr(Y = 1 \mid X = x)) + \mathbb{1}\{Y = -1\}\log(\Pr(Y = -1 \mid X = x)) \]

\(y = 1\)

\[ \begin{aligned} \mathcal{L} & = \log(\Pr(Y = 1 \mid X = x))\\ &= \log\left(\frac{1}{1 + \exp\{-x^\top \hat \beta\}} \right) \\ &= \log(1) - \log \left({1 + \exp\{-x^\top \hat \beta\}} \right) \\ &= - \ell(-y x^\top \beta) \end{aligned} \]

Question 1b

\(y = -1\)

\[ \begin{aligned} \mathcal{L} & = \log(\Pr(Y = 1 \mid X = x))\\ &= \log\left(\frac{\exp\{-x^\top \beta\}}{1 + \exp\{-x^\top \hat \beta\}} \right) \\ &= \log\left(\frac{1}{1 + 1 / \exp\{-x^\top \hat \beta\}} \right) \\ &= \log\left(\frac{1}{1 + \exp\{x^\top \hat \beta\}} \right) \\ &= - \log(1 + \exp\{x^\top \hat \beta\}) \\ &= - \ell(-y x^\top \beta) \end{aligned} \]

Question 1b

Learnings

  • Go from definition to statement
  • log-likelihood estimation minimises cross-entropy loss

Question 1c

Consider a linear classifier \(H(x^\top \hat \beta)\) with the loss \(\ell(yx^\top \hat{\beta})\), using either the \(0\)-\(1\) loss function \(\ell(z) = \mathbb{1}\{z \leq 0\}\), the hinge loss function \(\ell(z) = \max \{0, 1-z\}\), or the squared hinge loss function \(\ell(z) = \max \{0, 1-z\}^2\).

  1. Compare the values of the three loss functions over the domain R and discuss their continuity properties.
  2. Which of the three loss functions is differentiable at each point \(z \in \Re\)?
  3. How does the binary cross-entropy function compare to the above three functions (up to scaling with a multiplicative constant)?

Question 1c

Code
hinge <- function(z){
    pmax(0, 1 - z) 
}

sq_hinge <- function(z){
    pmax(0, 1 - z)^2
}

z_o <- function(z){
    ifelse(z <= 0, 1, 0) 
}

plot(zs, hinge(zs), col = "black", lwd = 2, type = "l") 
lines(zs, sq_hinge(zs), col = "blue", lwd = 2)  
lines(zs, z_o(zs), col = "red", lwd = 2)  
legend("topright", col = c("black", "blue", "red"), lwd = rep(2,3), legend = c("hinge", "sq. hinge", "0-1")) 
  • Hinge and squared hinge upper bound the \(0\)-\(1\) loss
  • \(0\)-\(1\) loss has discontinuity at \(0\), others continuous
  • Squared hinge loss is differentiable on \(\Re\), \(0\)-\(1\) loss is not differentiable at \(0\), hinge loss not differentiable at \(1\)

Question 1c

Code
cross_entropy <- function(z, a = 1 / log(2)){
    a * log(1 + exp(-z)) 
}


plot(zs, hinge(zs), col = "black", lwd = 2, type = "l") 
lines(zs, sq_hinge(zs), col = "blue", lwd = 2)  
lines(zs, z_o(zs), col = "red", lwd = 2)  
lines(zs, cross_entropy(zs), col = "orange", lwd = 2)
legend("topright", col = c("black", "blue", "red", "orange"), lwd = rep(2,43), legend = c("hinge", "sq. hinge", "0-1", "cross entropy")) 
  • upper bounds \(0\)-\(1\) loss for \(a > 1 / \log(2)\)
  • can be seen as smooth approximation to hinge-loss function

Question 1c

Learnings:

  • understand properties of difference loss functions
  • maximum-likelihood estimate in logistic regression can be seen as smooth approximation to minimising empirical \(0\)-\(1\) loss (surrogate loss)

Question 2a

Consider LDA for \(p = 1\). a. Show that the discriminator can be expressed as follows: \[ \delta_k(x) = x \frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat{\mu}_k^2}{2 \hat{\sigma}^2} + \log(\hat{\pi}_k) \]

Recap

Discriminant based classifiers

Classifier \(\psi: \Re^p \to \{1 , \ldots, K\}\) is a discriminant based classifier if it can be expressed as

\[ \psi(x) = \underset{k}{\arg \max} \, \delta_k(x) \,, \] where \(\delta_k: \Re^p \to \Re\) is called the discriminant function. Any monotone transformation of \(\delta_k\) gives the same predictions!

Bayes discriminant function under \(0\)-\(1\) loss

\[ \eta_k(x) = \Pr(Y = k \mid X = x) = \frac{\pi_k f_k(x)}{\sum_{k = 1}^K \pi_k f_k(x)} \]

  • Bayes’ Theorem: \(\Pr(A \mid B) = \frac{\Pr(A) \Pr(B \mid A)}{\Pr(B)}\)
  • Law of total probablity: \(\Pr(B) = \sum_{i = 1}^n \Pr(A) \Pr(B_n \mid A)\), for \(B_n\) disjoint with \(\bigcup_{i =1}^n B_n = B\)

Recap

LDA

Assumes that \(X \mid Y = k\) is \(\mathrm{N}(\mu_k, \Sigma)\), i.e. 

\[ f_k(x) = \frac{1}{(2\pi)^{p/2}(\det\Sigma)^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^\top\Sigma^{-1}(x-\mu_k)} \]

Code
library(MASS)
library(ggplot2)
library(ggnewscale)

set.seed(42)

mu1 <- c(2, 3)
mu2 <- c(7, 8)
Sigma <- matrix(c(2, 1, 1, 2), 2, 2)
n <- 1000

X1 <- as.data.frame(mvrnorm(n, mu1, Sigma))
X2 <- as.data.frame(mvrnorm(n, mu2, Sigma))
colnames(X1) <- colnames(X2) <- c("x", "y")

X1s <- X1[sample(1:n, n/10), ]
X2s <- X2[sample(1:n, n/10), ]

ggplot() +
  stat_density_2d(
    data = X1, aes(x = x, y = y, fill = after_stat(level)),
    geom = "polygon", color = NA, alpha = 0.5
  ) +
  scale_fill_gradient(low = "#c6dbef", high = "#08519c", guide = "none") +
  
  new_scale_fill() +
  
  stat_density_2d(
    data = X2, aes(x = x, y = y, fill = after_stat(level)),
    geom = "polygon", color = NA, alpha = 0.5
  ) +
  scale_fill_gradient(low = "#c7e9c0", high = "#006d2c", guide = "none") +

  geom_point(data = X1s, aes(x = x, y = y), color = "#08519c", alpha = 0.8, size = 0.9) +
  geom_point(data = X2s, aes(x = x, y = y), color = "#006d2c", alpha = 0.8, size = 0.9) +
  
  theme_minimal(base_size = 14) +
  theme(legend.position = "none") +
  labs(
    title = expression("LDA modelling assumption"),
    x = expression(X[1]),
    y = expression(X[2])
  )

Recap LDA

With Bayes discriminant function

\[ \log (\pi_k f_k(x)) = \log\pi_k - \frac{1}{2}(x-\mu_k)^\top \Sigma^{-1}(x-\mu_k) \]

LDA estimates \(\pi_k, \mu_k, \Sigma\) to get

\[ \delta_k^{\mathrm{LDA}}(x):= \log\hat \pi_k - \frac{1}{2}(x-\hat \mu_k)^\top \hat\Sigma^{-1}(x-\hat \mu_k) \]

The learnable parameters \(\pi_k, \Sigma, \mu_k\) are estimated from the data:

  • \(\hat{\pi}_k = \frac{n_k}{n} = \frac{1}{n} \sum_{i = 1}^n \mathbb{1}\{Y_i = k\}\)
  • \(\hat\mu_k = \frac{1}{n_k}\sum_{i = 1}^n \mathbb{1}\{Y_i = k\} X_i\)
  • \(\hat{\Sigma} = \frac{1}{n - K} \sum_{k = 1}^K \sum_{i = 1}^n \mathbb{1}\{Y_i = k\} (X_i - \hat{\mu}_k)(X_i - \hat{\mu}_k)^\top\)

Question 2a

Show

For \(p = 1\)

\[ \delta_k(x) = x \frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat{\mu}_k^2}{2 \hat{\sigma}^2} + \log(\hat{\pi}_k) \]

Know

  • \(\delta_k(x) = \log(\hat\pi_k \hat f_k(x)) + c\), where \(c\) does not depend on class
  • \(\hat{f}_k(x) = \frac{1}{\sqrt{2 \pi \hat{\sigma}^2}} \exp\{-(x - \mu_k)^2/ 2 \hat\sigma^2 \}\)

Question 2a

Plug in

\[ \begin{aligned} \log(\hat\pi_k \hat f_k(x)) &=\log(\hat{\pi}_k) -\frac{1}{2}\log(2 \pi \hat{\sigma}^2) - \frac{1}{2 \hat{\sigma}^2} (x -\hat \mu_k)^2 \\ &= \log(\hat{\pi}_k) -\frac{1}{2}\log(2 \pi \hat{\sigma}^2) - \frac{1}{2 \hat{\sigma}^2} \left(x^2 - 2 \hat \mu_k x + \mu_k^2 \right) \\ &= \log(\hat{\pi}_k) + x \frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat \mu_k^2}{2 \hat \sigma^2} -\frac{1}{2}\log(2 \pi \hat{\sigma}^2) - \frac{x^2}{2 \hat{\sigma}^2} \\ &= \log(\hat{\pi}_k) + x \frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat \mu_k^2}{2 \hat \sigma^2} + c \end{aligned} \]

Conclusion

Discriminant function discriminates based on \[ \log(\hat{\pi}_k) + x \frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat \mu_k^2}{2 \hat \sigma^2} \]

Question 2b

Show that if \(\hat{\mu}_1 < \hat{\mu}_2\), then every \(x \in \Re\) such that \(x < x^*\) is classified as class 1, and every \(x \in \Re\) such that \(x > x^*\) is classified as class 2, where \[ x^* = \frac{\hat{\mu}_1 + \hat{\mu}_2}{2} + \frac{\hat\sigma^2 \left[\log(\hat{\pi}_1) - \log(\hat{\pi}_2)\right]}{\hat{\mu}_2 - \hat{\mu}_1}. \]

Show

\[ \psi(x) = \mathbb{1}\{x < x^* \} + 1 \]

Know

  • \(\psi(x) = \mathbb{1}\{\delta_1(x) > \delta_2(x)\} + 1\)
  • \(\delta_k(x) = x \frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat{\mu}_k^2}{2 \hat{\sigma}^2} + \log(\hat{\pi}_k)\)

Question 2b

Plug in

\[ \begin{aligned} \delta_1(x) > \delta_2(x) &\iff x \frac{\hat{\mu}_1}{\hat{\sigma}^2} - \frac{\hat{\mu}_1^2}{2 \hat{\sigma}^2} + \log(\hat{\pi}_1) > x \frac{\hat{\mu}_2}{\hat{\sigma}^2} - \frac{\hat{\mu}_2^2}{2 \hat{\sigma}^2} + \log(\hat{\pi}_2) \\ &\iff \frac{\hat \mu_2^2 - \hat{\mu}_1^2}{2 \hat{\sigma}^2} + \log(\hat{\pi}_1) - \log(\hat{\pi}_2) > x \frac{\hat{\mu}_2-\hat{\mu}_1}{\hat{\sigma}^2} \\ &\iff \frac{1}{2 } \frac{\hat \mu_2^2 - \hat{\mu}_1^2}{\hat{\mu}_2-\hat{\mu}_1} +\hat{\sigma}^2\frac{[\log(\hat{\pi}_1) - \log(\hat{\pi}_2)]}{\hat{\mu}_2-\hat{\mu}_1} > x \\ &\iff \frac{1}{2 } \frac{(\hat \mu_2 - \hat{\mu}_1)(\hat \mu_2 + \hat{\mu}_1)}{\hat{\mu}_2-\hat{\mu}_1} +\hat{\sigma}^2\frac{[\log(\hat{\pi}_1) - \log(\hat{\pi}_2)]}{\hat{\mu}_2-\hat{\mu}_1} > x \\ &\iff \frac{1}{2 } \frac{\hat \mu_2 + \hat{\mu}_1}{\hat{\mu}_2-\hat{\mu}_1} +\hat{\sigma}^2\frac{[\log(\hat{\pi}_1) - \log(\hat{\pi}_2)]}{\hat{\mu}_2-\hat{\mu}_1} > x \\ &\iff x^* > x \end{aligned} \]

Question 2b

Code
xs <- seq(from = -5, to = 5, length.out = 1000) 

mu_1 <- -0.5 
mu_2 <- 0.5

pi_1 <- 0.5
pi_2 <- 0.5 

sigma <- 1

x_star = (mu_1 + mu_2)/2 + sigma^2 * (log(pi_1) - log(pi_2)) / (mu_2 - mu_1) 

plot(xs, pi_1 * dnorm(xs, mean = mu_1, sd = sigma), type = "l", col = "green", lwd = 2, main = "Equal pis") 
lines(xs, pi_2 * dnorm(xs, mean = mu_2, sd = sigma), type = "l", col = "blue", lwd = 2)
abline(v = x_star, col = "grey", lty = 2) 
legend("topright", col = c("green", "blue", "grey"), lty = c(1,1,2), lwd = rep(2,3), legend = c("Class 1", "Class 2", "x_star")) 

Code
pi_1 <- 0.6
pi_2 <- 0.4

x_star = (mu_1 + mu_2)/2 + sigma^2 * (log(pi_1) - log(pi_2)) / (mu_2 - mu_1) 


plot(xs, pi_1 * dnorm(xs, mean = mu_1, sd = sigma), type = "l", col = "green", lwd = 2, main = "Unequal pis") 
lines(xs, pi_2 * dnorm(xs, mean = mu_2, sd = sigma), type = "l", col = "blue", lwd = 2)
abline(v = x_star, col = "grey", lty = 2) 
legend("topright", col = c("green", "blue", "grey"), lty = c(1,1,2), lwd = rep(2,3), legend = c("Class 1", "Class 2", "x_star")) 

Learnings

  • From definition to statement
  • Under equal \(\hat\pi_k\), boundary is midway point between class means
  • If \(\hat \pi_1 > \hat{\pi}_2\), second term positive, boundary shift to the right (favour class 1), and vice-versa

Question 3

Consider a three-class classification problem, with \(Y = i\) for class \(i \in \{1,2,3\}\).

Let \(X = (X_1, X_2)^\top\), with the following mean vectors:

\[ \mu_1 = (0, 0)^\top, \quad \mu_2 = (1, 1)^\top, \quad \mu_{3,1} = (0.5, 0.5)^\top, \quad \mu_{3,2} = (-0.5, 0.5)^\top \,, \]

and the covariance matrix is given by

\[ \Sigma = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}\,. \]

  • Suppose that \(X \mid Y = 1\) and \(X \mid Y = 2\) follow bivariate Gaussian distributions \(N(\mu_1, \Sigma)\) and \(N(\mu_2, \Sigma)\), respectively.

  • Suppose \(X \mid Y = 3\) follows a mixture of two bivariate Gaussian distributions, with a 50% chance of being from \(N(\mu_{3,1}, \Sigma)\) and a 50% chance from \(N(\mu_{3,2}, \Sigma)\).

Additionally, assume

\[ \mathbb{P}[Y=1] = \tfrac{2}{5}, \quad \mathbb{P}[Y=2] = \tfrac{2}{5}, \quad \mathbb{P}[Y=3] = \tfrac{1}{5}. \]

Classify the point \(x_0 = (0.3, 0.3)^\top\) based on the posterior probabilities.

Recap

Posterior distribution

Bayes’ Theorem:

\[ \Pr(Y = k \mid X = x) = \frac{\pi_k f_k(x)}{\sum_{k = 1}^K \pi_k f_k(x)} \]

Classification:

\[ \underset{k}{\arg \max}\, \Pr(Y = k \mid X = x) = \underset{k}{\arg \max} \, \pi_k f_k(x) \]

Recap

Mixture distribution

A random variable \(X\) follows a mixture distribution with weights \(w_1, \ldots, w_C\) and mixture densities \(f_1,\ldots,f_C\) if:

\[ \Pr(X \leq t) = \sum_{c = 1}^k w_c F_c(t)\,, \]

where \(F_c(t) = \int_{-\infty}^t f_c(x) dx\) is the CDF associated with \(f_c\).

Code
c1 <- dnorm(xs, mean = mu_1, sd = 1)
c2 <- ifelse(xs > 0, dexp(xs), 0)

plot(xs, c1, type = "l", lty = 2, col = "green", lwd = 2, main = "A mixture distribution", ylim = c(0,1)) 
lines(xs, c2, type = "l", lty = 2, col = "blue", lwd = 2)
lines(xs, pi_1 * c1 + pi_2 * c2, col = "orange", lwd = 2) 
legend("topright", col = c("green", "blue", "orange"), lty = c(2,2,1), lwd = rep(2,3), legend = c("Class 1", "Class 2", "Mixture")) 

Recap

Mixture distribution

Can think about a draw of \(X\) as follows:

  1. Draw mixture class \(c\) from classes \(1,\ldots,C\) with probabilities \(w_1, \ldots, w_C\)
  2. Draw \(X\) according to \(f_c\)

\[ \begin{aligned} \Pr (X \leq t) &= \Pr\left(\bigcup_{c = 1}^C \{\text{class } = c, X \leq t\}\right) \\ &= \sum_{c=1}^C \Pr( \text{class } = c, X \leq t) \\ &= \sum_{c=1}^C \Pr( \text{class } = c) \Pr( X \leq t \mid \text{class } = c) \\ &= \sum_{c=1}^C w_c F_c(t) \,. \end{aligned} \]

Recap

Mixture distribution

PDF of mixture distribution: \[ \begin{aligned} f(x) &= \frac{\partial}{\partial x} F(x) \\ &= \frac{\partial}{\partial x} \sum_{c = 1}^k w_c F_c(t) \\ &= \sum_{c = 1}^kw_c \frac{\partial}{\partial x} F_c(t) \\ &= \sum_{c = 1}^kw_c f_c(t) \\ \end{aligned} \]

Question 3

Show

How to classify \(x = (0.3,0.3)\) according to posterior probabilities

Know

  • \(\Pr(Y = k \mid X = x) \propto \pi_k f_k\)
  • \([\pi_1, \pi_2, \pi_3] = [0.4, 0.4, 0.2]\)
  • The conditional pdfs of \(X\) conditional on label \(Y\) (two normal, one mixture)

Question 3

Plug-in

We have:

\[ \pi_1 f_1(x_0) = \frac{2}{5} \times \frac{1}{2\pi} \exp\!\left( -\frac{1}{2}(0.3 - 0,\, 0.3 - 0)(0.3 - 0,\, 0.3 - 0)^\top \right) = 0.0582, \]

\[ \pi_2 f_2(x_0) = \frac{2}{5} \times \frac{1}{2\pi} \exp\!\left( -\frac{1}{2}(0.3 - 1,\, 0.3 - 1)(0.3 - 1,\, 0.3 - 1)^\top \right) = 0.0390, \]

and

\[ \begin{aligned} \pi_3 f_3(x_0) &= \frac{1}{5} \times \frac{1}{2} \times \frac{1}{2\pi} \exp\!\left( -\frac{1}{2}(0.3 - 0.5,\, 0.3 - 0.5)(0.3 - 0.5,\, 0.3 - 0.5)^\top \right) \\ &\quad + \frac{1}{5} \times \frac{1}{2} \times \frac{1}{2\pi} \exp\!\left( -\frac{1}{2}(0.3 + 0.5,\, 0.3 + 0.5)(0.3 - 0.5,\, 0.3 - 0.5)^\top \right)\\ &= 0.0266. \end{aligned} \]

Since \(\pi_1 f_1(x_0)\) is the largest, we classify \(x_0\) as belonging to class 1.

Question 3

Learnings

  • Generalise the discriminant approach \(\delta_k(x) = \log(\pi_k f_k)\) to unseen setting

Question 4

In this exercise, we consider binary classification. a. Consider a classifier that assigns each example to the majority label in the training dataset. What is the expected test accuracy of this classifier?

Show

Expected test accuracy of majority vote classifier

Know

  • \(\psi(x) = \mathbb{1}\left\{\frac{1}{n}\sum_{i = 1}^n Y_i > \frac{1}{2} \right\} = \psi\)
  • Accuracy of \(\psi\) at \(Y\): \(\mathbb{1}\{Y = \psi\}\)
  • Expected accuracy: \(\mathrm{E}_{Y, \mathcal{D}}[\mathbb{1}\{Y = \psi\}]\) (\(\mathcal{D}\) is the training data with which we train our classifier)
  • \(Y \perp \psi\)

Question 4a

Plug in

\[ \begin{aligned} \mathrm{E}_{Y, \mathcal{D}}[\mathbb{1}\{Y = \psi\}] &= \Pr\left(Y = \psi \right) \\ &= \Pr\left( \{Y = \psi = 1\} \cup \{Y = \psi = 0\} \right) \\ &= \sum_{i = 0}^1 \Pr \left( Y = \psi = i \right) \\ &= \sum_{i = 0}^1 \Pr \left( Y = i \right) \Pr(\psi = i) \\ &= \pi \pi_\psi + (1 - \pi) (1 - \pi_\psi) \,, \end{aligned} \] where \[ \pi = \Pr(Y = 1), \quad \pi_\psi = \Pr(\psi = 1) \]

Question 4a

What is \(\Pr(\psi = 1)\)?

To make further progress, we need to understand \(\psi\). Let us stick to our guns and write down what we know

\[ \Pr(\psi = 1) = \Pr\left(\mathbb{1}\left\{\frac{1}{n}\sum_{i = 1}^n Y_i > \frac{1}{2}\right\} = 1 \right) = \Pr \left( \frac{1}{n}\sum_{i = 1}^n Y_i > \frac{1}{2} \right) = \Pr \left( \sum_{i = 1}^n Y_i > \frac{n}{2} \right) \]

Now the random variable \(\sum_{i = 1}^n Y_i\) just counts the number of successes in \(n\) independent Bernoulli trials. Hence it follows a binomial distribution, and thus

\[ \begin{aligned} \Pr\left(\sum_{i = 1}^n Y_i > \frac{n}{2} \right) &= 1 - \Pr\left(\sum_{i = 1}^n Y_i \leq \frac{n}{2} \right) \\ &= 1 - \sum_{k = 1}^{\lfloor n / 2 \rfloor} {n\choose k} \pi^k (1- \pi)^{n-k} \end{aligned} \]

Question 4a

Hence we have found that

\[ \mathrm{E}_{Y, \mathcal{D}}[\mathbb{1}\{Y = \psi\}] = \pi \left(1 - \sum_{k = 1}^{\lfloor n / 2 \rfloor} {n\choose k} \pi^k (1- \pi)^{n-k} \right) + (1 - \pi) \left(\sum_{k = 1}^{\lfloor n / 2 \rfloor} {n\choose k} \pi^k (1- \pi)^{n-k} \right) \,. \]

Now although this is a precise characterisation of the expected prediction accuracy of our majority vote classifier, it looks rather convoluted and altogether a bit unsatisfying.

Intuitevly, we would expect that if we had enough votes, the majority vote should be \(1\) if the individual voting probabilities \(\pi\) are greater than \(1/2\) and \(0\) if they are less than \(1/2\).

Can we formalise this thought?

Question 4a

Intuitevly, we would expect that if we had enough votes, the majority vote should be \(1\) if the individual voting probabilities \(\pi\) are greater than \(1/2\) and \(0\) if they are less than \(1/2\).

Code
n <- 10^3 

pi <- 0.7 

pi2 <- 0.3 

y1 <- rbinom(n, 1, pi) 
y2 <- rbinom(n, 1, pi2) 

av1 <- cumsum(y1) / 1:n 
av2 <- cumsum(y2) / 1:n 

plot(1:n, av1, type = "l", col = "blue", lwd = 2, main = "Average of responses", xlab = "Number of votes", ylab = "% of 1-votes") 
lines(1:n, av2, col = "orange", lwd = 2) 
abline(h = 0.5, col = "grey", lwd = 2, lty = 2) 
legend("bottomright", legend = c(expression(pi > 1/2), expression(pi < 1/2), "Decision boundary"), lty = c(1,1,2), lwd = rep(1,3), col = c("blue", "orange", "grey"))

Can we formalise this thought?

Recap

Law of large numbers

An average \(S_n = \sum_{i = 1}^n X_i / n\) of \(n\) i.i.d. random variables with expectation \(E[X] = \mu\) converges to \(\mu\) in probability as \(n \to \infty\): For all \(\epsilon, \delta\), there is \(N \in \mathbb{N}\), such that for all \(n > N\),

\[ \Pr(|S_n - \mu| > \epsilon ) < \delta \]

Code
n <- 10^4 

pi <- 0.55 

eps <- 1e-2 
ub <- pi + eps
lb <- pi - eps 

set.seed(123)
y1 <- rbinom(n, 1, pi) 

av1 <- cumsum(y1) / 1:n 

plot(1:n, av1, type = "l", col = "blue", lwd = 2, main = "High probability bands", ylim = c(pi - 0.1, pi + .1), xlab = "Number of samples", ylab = expression(S[n])) 
abline(h = ub, col = "grey", lty = 2, lwd = 2) 
abline(h = lb, col = "grey", lty = 2, lwd = 2) 

polygon(
  x = c(par("usr")[1], par("usr")[2], par("usr")[2], par("usr")[1]),
  y = c(lb, lb, ub, ub),
  col = rgb(0, 0, 1, 0.2),
  border = NA
)

legend("bottomright", legend = c(expression(S[n]), expression(mu + epsilon), expression(mu - epsilon)), lty = c(1,2,2), lwd = rep(1,3), col = c("blue", "grey", "grey"))

Question 4a

\[ S_n = \frac{1}{n}\sum_{i = 1}^n Y_i \]

is an average of \(n\) i.i.d. Bernoulli random variables with success probability \(\pi\).

By the law of large numbers: \[ S_n \overset{p}{\to} \mathrm{E}[S_n] = \pi, \quad \text{as } n \to \infty \ \]

Question 4a

\(\pi > 1/2\)

Code
n <- 10^3 

pi <- 0.6 

y1 <- rbinom(n, 1, pi) 

av1 <- cumsum(y1) / 1:n 

eps <- 1e-2 
ub <- pi + eps
lb <- pi - eps 



plot(1:n, av1, type = "l", col = "blue", lwd = 2, main = expression(pi > frac(1,2))) 
abline(h = ub, col = "grey", lty = 2, lwd = 2) 
abline(h = lb, col = "grey", lty = 2, lwd = 2) 
abline(h = 0.5, col = "red", lwd = 2, lty = 2) 
polygon(
  x = c(par("usr")[1], par("usr")[2], par("usr")[2], par("usr")[1]),
  y = c(lb, lb, ub, ub),
  col = rgb(0, 0, 1, 0.2),
  border = NA
)
polygon(
  x = c(par("usr")[1], par("usr")[2], par("usr")[2], par("usr")[1]),
  y = c(.5, .5, lb, lb),
  col = rgb(1, 0, 0, 0.2),  
  border = NA
)

legend("topright", legend = c(expression(S[n]), expression(pi + epsilon), expression(pi - epsilon), "Decision boundary"), lty = c(1,2,2,2), lwd = rep(1,4), col = c("blue", "grey", "grey", "red"))
  • \(S_n \overset{p}{\to} \pi\)
  • \(S_n - 1/2 \overset{p}{\to} \pi - 1/2 > 0\)
  • \(\Pr(S_n - 1/2 > 0) \to 1\), as \(n \to \infty\)

Question 4a

\(\pi < 1/2\)

Code
n <- 10^3 

pi <- 0.4 

set.seed(123)
y1 <- rbinom(n, 1, pi) 

av1 <- cumsum(y1) / 1:n 

eps <- 1e-2 
ub <- pi + eps
lb <- pi - eps 



plot(1:n, av1, type = "l", col = "blue", lwd = 2, main = expression(pi < frac(1,2))) 
abline(h = ub, col = "grey", lty = 2, lwd = 2) 
abline(h = lb, col = "grey", lty = 2, lwd = 2) 
abline(h = 0.5, col = "red", lwd = 2, lty = 2) 
polygon(
  x = c(par("usr")[1], par("usr")[2], par("usr")[2], par("usr")[1]),
  y = c(lb, lb, ub, ub),
  col = rgb(0, 0, 1, 0.2),
  border = NA
)
polygon(
  x = c(par("usr")[1], par("usr")[2], par("usr")[2], par("usr")[1]),
  y = c(ub, ub, 0.5, 0.5),
  col = rgb(1, 0, 0, 0.2),  
  border = NA
)

legend("bottomright", legend = c(expression(S[n]), expression(pi + epsilon), expression(pi - epsilon), "Decision boundary"), lty = c(1,2,2,2), lwd = rep(1,4), col = c("blue", "grey", "grey", "red"))
  • \(\Pr(S_n - 1/2 > 0) \to 0\), as \(n \to \infty\)

Question 4a

\(\pi = 1/2\)

Code
n <- 10^3 

pi <- 0.5

y1 <- rbinom(n, 1, pi) 

av1 <- cumsum(y1) / 1:n 

plot(1:n, av1, type = "l", col = "blue", lwd = 2, main = expression(pi==frac(1,2))) 
abline(h = 0.5, lwd = 2, lty = 2, col = "grey")

The process is \(S_n\) is symmetric. Each outcome \([y_1, \ldots, y_n]\) such that \(S_n > 1/2\) has a corresponding outcome \([1-2y_1, \ldots, 1-2y_n]\) with \(S_n < 1/2\) that has the same probability.

  • \(\Pr(S_n > 1/2) = \Pr(S_n < 1 / 2)\)
  • If \(n\) is odd: \(S_n = 1/2\) impossible: \(\Pr(S_n > 1/2) = 1/2\)
  • If \(n\) is even: \(\Pr(S_n = 1/2) = (1/2)^n\). \(\Pr(S_n > 1/2) = (1 - (1/2)^n) / 2 \to 1/2\) as \(n \to \infty\)

Question 4a

\(\pi > 1/2\)

  • \(\Pr(S_n > 1/2) = \Pr(\psi = 1) = \pi_\psi \to 1\) as \(n \to \infty\)
  • \(\mathrm{E}_{Y, \mathcal{D}}[\mathbb{1}\{Y = \psi\}] = \pi \pi_\psi + (1 - \pi)(1 - \pi_\psi) \to \pi\)

\(\pi < 1/2\)

  • \(\Pr(S_n > 1/2) = \Pr(\psi = 0) = 1 - \pi_\psi \to 1\) as \(n \to \infty\)
  • \(\mathrm{E}_{Y, \mathcal{D}}[\mathbb{1}\{Y = \psi\}] = \pi \pi_\psi + (1 - \pi)(1 - \pi_\psi) \to 1 - \pi\)

\(\pi = 1/2\)

  • \(\Pr(S_n > 1/2) = \pi_\psi = 1/2\)
  • \(\mathrm{E}_{Y, \mathcal{D}}[\mathbb{1}\{Y = \psi\}] = \pi \pi_\psi + (1 - \pi)(1 - \pi_\psi) \to \pi = 1 - \pi\)

Question 4a

Overall: \[ \mathrm{E}_{Y, \mathcal{D}}[\mathbb{1}\{Y = \psi\}] \to \max\{\pi, 1-\pi\}, \quad \text{as } n \to \infty \]

Question 4a

Learnings

  • Use definitions to characterise unknown classifier
  • Use of rigour (definitions, LLN) and intuition (Majority class should be class with higher probability) to get a precise result
  • Majority rate classifier classifies majority class correctly with probability going to one \((\pi \neq 1/2)\)
  • Serves as a lower bound in accuracy for more sophisticated ML algorithms

Question 4b

For an email spam filter, it may be preferable to allow for some rate of spam to reach the inbox rather than risk sending an important legitimate email message to the spam folder. Assume we train a spam classifier where the positive class is spam and the negative class is non-spam.

Suppose we have three points on the ROC curve for our classifier: A: \((0.1, 0.7)\), B: \((0.2, 0.75)\) and C: \((0.3, 0.8)\). Which of these is preferable?

Recap

ROC

Binary, discriminant based classifiers are of the form \(\psi(x) = \mathbb{1}\{ s(x) < t \}\), where

  • \(s(x)\) is a score that maps \(x\) to the real numbers
  • \(t\) is treshold that translates scores into predictions

The scores induce an ordering on a test set \(X_1, \ldots, X_n\):

\[ s(X_{(1)}) < s(X_{(2)}) < \ldots < s(X_{(i)}) < t < s(X_{(i + 1)}) < \ldots < s(X_{(n)}) \]

where \(X_{(1)}\) is the feature that produces the lowest score and so on.

  • If \(t < s(X_{(1)})\), then all points are labelled zero and both the TPR and FPR are zero
  • For \(s(X_{(i+1)}) \geq t > s(X_{(i)})\), \(X_{(1)}, \ldots, X_{(i)}\) are labelled positive, the rest negative
  • ROC plots \(TPR(t), FPR(t)\) for all values of \(t\)

The ROC curve gives an indication of how well a prediction method discriminates among classes, regardless of the threshold

Recap

Confusion matrix for binary responses

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

\[ \text{TPR} = \frac{TP}{TP + FN} \]

\[ \text{FPR} = \frac{FP}{FP + TN} \]

Recap

ROC curve

R code
# Load necessary libraries
library(dplyr)
library(knitr)
library(kableExtra) 

# Set seed for reproducibility
set.seed(123)

# Define the number of observations
n <- 70

# Generate random data for each covariate

# Age: Assuming cat age follows an absolute normal distribution
age <- abs(rnorm(n, sd = 4))

# Color: Ginger, Black, Brown
color <- c("Ginger", "Black", "Brown")[rbinom(n, 2, age/max(age)) + 1]

# Multiple Colors: binary with conditional probabilities
multiple_colors <- sapply(color, function(col) {
  if (col == "Ginger") {
    return(c("No", "Yes")[rbinom(1, 1, 0.6) + 1]) 
  } else if (col == "Black") {
    return(c("No", "Yes")[rbinom(1, 1, 0.05) + 1]) 
  } else {
    return(c("No", "Yes")[rbinom(1, 1, 0.25) + 1]) 
  }
})

# Generate previous_owners correlated with age
previous_owners <- sapply(age, function(a) {
  prob <- c(0.05, 0.7, 0.2, 0.05) * (a / max(age))
  prob <- prob / sum(prob) 
  return(sample(0:3, 1, prob = prob))
})
# Neutered: Binary with p = 0.9
neutered <- c("No","Yes")[rbinom(n, 1, 0.9) + 1]

# Gender: Binary with p = 0.5
gender <- c("Male", "Female")[rbinom(n, 1, ifelse(multiple_colors == "Yes", 0.1, 0.5)) + 1]

# Playfulness: range between 1 and 10, negatively correlated with age and col == black, positively with neutered and gender 
playfulness <- -0.2 * age + rnorm(n, sd = 2) + 0.25 * (neutered == "Yes") - 0.1 * (2 * (gender == "Male") - 1) - 5 * (color == "Black")
playfulness <- 10 * ((playfulness - min(playfulness)) / (max(playfulness) - min(playfulness)))

# Create a data frame
cat_adoption_data <- data.frame(
  Age = round(age, 1),
  MainColor = as.factor(color),
  MultipleColors = as.factor(multiple_colors),
  PreviousOwners = previous_owners,
  Neutered = as.factor(neutered),
  Gender = as.factor(gender),
  Playfulness = round(playfulness, 1)
)

# Expand categorical variables using model.matrix
model_data <- model.matrix(~ Age + MainColor + MultipleColors + PreviousOwners + Neutered + Gender + Playfulness, data = cat_adoption_data)

# Define a true effect vector (for simplicity, assume some arbitrary effects)
true_effect_base <- c(Age = -0.5, MainColorBrown = .3, MainColorGinger = -.5, MultipleColorsYes = 0.25, PreviousOwners = 0, NeuteredYes = 0, GenderMale = 0, Playfulness = 0.25)

true_effects <- c(-c(Intercept = mean(model_data[,-1] %*% true_effect_base)), true_effect_base) 

# Simulate the probability of adoption using the logistic function
prob_adoption <- plogis(model_data %*% true_effects)

# Simulate the binary adoption outcome
adopted <- c("No", "Yes")[rbinom(n, 1, prob_adoption) + 1]

# Add the adoption outcome to the data frame
cat_adoption_data <- cbind(cat_adoption_data, Adopted = adopted)




Question 4b

Show

  • Which point on ROC curve is most preferrable for a spam filter

Know

  • ROC points record \(TPR(t), FPR(t)\) for a particular threshold \(t\)
  • TP: Spam emails classified as spam
  • FP: Normal emails classified as spam
  • Prefer spam emails classified as normal (FN), over normal emails classified as spam (FP)
  • FNR = 1 - TPR

Question 4b

Look for point with low FPR, at expense of lower TPR

Code
x0 <- c(0,0)
x1 <- c(0.1, 0.7) 
x2 <- c(.2, 0.75) 
x3 <- c(.3, .8)
x4 <- c(1,1)

xs <- rbind(x0, x1, x2, x3, x4) 

plot(xs, type = "b", col = "blue", lwd = 2, xlab = "FPR", ylab = "TPR", main = "ROC") 
abline(a = 0, b = 1, col = "grey", lwd = 2, lty = 2) 

ROC_points <- rbind(x1,x2,x3) 
point_labels <- c("A", "B", "C") 
text(ROC_points, point_labels, pos=3, cex = 1.5)
  • A has lowest FPR, with comparable TPR, so it is most preferable