Given a prediction function \(\hat{f}\), that we trained on a training dataset \(\mathcal{D}_n = \{(X_1, Y_1), \ldots, (X_n, Y_n)\}\), we would like to:
Evaluate its performance, i.e. its risk \[
R(\hat{f}) = \mathrm{E}_{\mathcal{D}_n, Y, X}[\ell(Y, \hat{f}(X))]
\] loss functions typically \(\ell(y,f) = (y-f)^2\) (regression task), \(\ell(y,f) = \mathbb{1}\{y \neq f\}\) (classification task)
Choose the hyperparameters that minimise test error:
\(\hat{f}(x)\) often depends on hyperparameters, i.e. parameters that we do not choose in the training phase of \(\hat{f}\).
Examples: the number of neighbours in kNN, the learning rate \(\alpha\) in gradient descent for logistic classification
Problem
We only have one dataset \(\mathcal{D}_n\), which we already used for training \(\hat{f}\).
But, because we used the training data already for fitting \(\hat{f}\), our estimate will be too optimistic. If \(\hat{f}\) is powerful enough, it can achieve zero training error.
Validation set approach
We can split our data \(\mathcal{D}_n\) into two sets, a training set and a validation set.
We train \(\hat{f}\) on only set of the data, and evaluate on the other.
Randomly split data into train and test set
Train \(\hat{f}\) on training set
Estimate test error on test set
Validation set approach
Random split introduces uncertainty
Less training data → worse training → worse predictions → overestimate test error
Validation set approach: Model selection
Say our prediction \(\hat{f}\) has a hyperparameter, call it \(\lambda\), that we want to tune.
We do the following:
Split data into train and validation set
For a range of hyperparmeters \(\lambda_i \in \{\lambda_1, \ldots, \lambda_k\}\), do:
Train \(\hat{f}(\cdot;\lambda_i)\) on training data
Evluate error on validation data
Pick the model \(\hat{f}(\cdot; \hat{\lambda})\), where \(\hat{\lambda}\) achieves smallest error on validation set
Is the error on the validation set a good estimate for the test error of \(\hat{f}(\cdot; \hat{\lambda})\)?
No! We picked \(\lambda_i\) to achieve smallest error on the validation set. If we want to have a more honest estimate of the estimate of the test error, we need to split further: Train, validate, test.
Use training data to train model for each hyperparameter
Evaluate error on validation set
For best choice of hyperparameter, use test set to estimate test error
Low bias: Beacuse train on almost foll data, LOOCV is a good (low bias) approximation of the error of \(\hat{f}\), i.e. \[
\mathrm{E}_{\mathcal{D}_n}[\text{LOOCV}(\hat{f})] \approx \mathrm{E}_{X,Y,\mathcal{D}_n}[\ell(Y, \hat{f}(X))]
\]
High variance:
Because each summands predicts only a single point, we get a noisy estimate of expected loss
Because training sets are very similar, high correlation between them
→ If datasets change, we expect to see large fluctuations in LOOCV
Computationally intensive: We need to fit a model for \(n\) different data folds
K-fold cross validation
Compromise between validation set approach and LOOCV:
Partition data \(\mathcal{D}_n\) into \(K\) cells \(C_1, \ldots, C_K\)
For fold \(k\), train \(\hat{f}^{(-k)}\) on \(\mathcal{D}_n \setminus C_k\), validate on \(C_k\)
Split uncertainty: K-fold CV still uses random data split, but variability is substantially lower than for validation set: Overlap between folds reduces impact of split
Bias: Typically less bias than validation set approach, more bias than LOOCV
Variance: Less variance than both LOOCV, and validation set approach
→ Empirically: K-fold CV with \(5-10\) folds: Sweet spot between variance, bias, and computational effort
num <-5cv_error1 =rep(0, num)cv_error5 =rep(0, num)cv_error10 =rep(0, num)degree =1:numfor (deg in degree) {glm_fit <-glm(mpg ~poly(horsepower, deg), data = Auto)## CV errors using naive LOOCVcv_error1[deg] =cv.glm(Auto, glm_fit)$delta[1]## CV errors using K-fold-CVcv_error5[deg] =cv.glm(Auto, glm_fit, K =5)$delta[1]## CV errors using K-fold-CVcv_error10[deg] =cv.glm(Auto, glm_fit, K =10)$delta[1]}plot(degree, cv_error1, type ="b", col ="black")lines(degree, cv_error5, type ="b", col ="red")lines(degree, cv_error10, type ="b", col ="blue")legend("topright", legend =c("LOOCV", "5-fold CV", "10-fold CV"), col =c("black", "red", "blue"), lwd =2)
Activity
Repeat the above with a different set of predictors.
Bootstrap
Suppose we are given data \((Y_1,X_1), \ldots, (Y_n, X_n)\) that is an i.i.d. sample from some unknown distribution \(P\).
We compute an estimate \(\hat \theta\) of some quantity of interest, e.g. the MSE of a ML-algorithm. We would like to know something about its distribution
Monte carlo method
If we had access to \(P\), we could:
Draw \(B\) independent training samples \(\mathcal{D}_1,\ldots, \mathcal{D}_B\)
Get estimates \(\hat{\theta}_1,\ldots, \hat{\theta}_B\)
Use the empirical distribution of our estimates as an approximation of the distribution of the estimator itself
Bootstrap
Empirical distribution
The empirical distrubtion function of a sample of \(n\) i.i.d. random variables \(X\) is
set.seed(123) N <-50xs <-sort(rnorm(N)) emp_cdf <-1: N / Nx_grid <-seq(from =-3, to =3, length.out =1000)plot(xs, emp_cdf, col ="blue", lwd =2, type ="l", main ="Empirical CDF vs. CDF") lines(x_grid, pnorm(x_grid), col ="green", lwd =2) legend("bottomright", col =c("blue", "green"), lwd =rep(2,2), legend =c("Emp. CDF", "CDF"))
Bootstrap
Bootstrap idea
Rather than using the unknown distribution \(P\) to sample data, we use the empirical distribution \(P_n\) to sample data.
This corresponds to sampling from our training data \(\mathcal{D}\) uniformly at random with replacement
Illustration: Portfolio data
The Portfolio data set is in the ISLR package described in Section 5.2.
'data.frame': 100 obs. of 2 variables:
$ X: num -0.895 -1.562 -0.417 1.044 -0.316 ...
$ Y: num -0.235 -0.885 0.272 -0.734 0.842 ...
Illustration: Portfolio data
Wish to invest a fixed sum of money in two financial assets that yield returns of \(X\) and \(Y\), respectively, where X and Y are random quantities. We will invest a fraction \(\alpha\) of our money in \(X\), and will invest the remaining \(1 - \alpha\) in \(Y\).
The formula for \(\alpha\) that results in the minimum risk investment of form \(\alpha X + (1-\alpha)Y\) is
To illustrate the use of the bootstrap on this data, we must first create:
a function, alpha_fn() that takes as input the \((X,Y)\) data and returns an estimate of alpha using the estimator (5.7) to the observations corresponding to the rows of the data index.
a vector indicating which observations should be used to estimate \(\alpha\). Note that \(\alpha = \arg \min \text{var}[\alpha X + (1 - \alpha Y)]\).
randomly select \(100\) observations from \(1\) to \(100\) with replacement, i.e. construct a new bootstrap data and compute the corresponding alpha
Code
set.seed(1)alpha_fn(data <- Portfolio, index =sample(1:100, size =100, replace = T))
[1] 0.7368375
Illustration: Portfolio data
We can implement a bootstrap analysis by performing this command many times, recording all of the corresponding estimates for α, and computing the resulting standard. deviation.
Bootstrap using for loop:
Code
B <-1000boot_alpha =rep(0, B)for(i in1:B){boot_index <-sample(x =1:100, size =100, replace = T)boot_alpha[i] <-alpha_fn(data = Portfolio, index = boot_index)}mean(boot_alpha)
[1] 0.5739061
Code
sd(boot_alpha)
[1] 0.09285792
boot automates this approach. To produce 1000 bootstrap estimates for alpha
Compare with standard formula results for the regression coefficients in a linear model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.9358610 0.717498656 55.65984 1.220362e-187
horsepower -0.1578447 0.006445501 -24.48914 7.031989e-81
What can you conclude from the different results?
There is some difference between the SE estimates of the parameters using bootstrap approach and the formulas coming from assuming a linear model. Standard error estimates depend on \(\sigma^2\) that first needs to be estimated. \(\hat{\sigma}^2\) depends on the assumption that a linear model is correct, if it is not, then the residuals from the linear fit will be inflated, and consequently so will \(\hat{\sigma}^2\). The bootstrap approach does not rely on these assumptions so might provide more accurate estimates.
Activity
Redo everything for polynomial regression with degree=2
Check your understanding
In cross-validation, what is the main purpose of repeatedly splitting the data into training and validation sets?
To estimate the model’s predictive performance on unseen data by mimicking out-of-sample prediction.
When estimating model accuracy, why might cross-validation give a less biased estimate of test error than the bootstrap?
Because cross-validation always evaluates on data not used for training, whereas bootstrap reuses data points for both training and testing.
True or False:
The bootstrap is primarily designed to assess the variability of an estimator, while cross-validation is designed to assess predictive performance.
TRUE
In k-fold cross-validation, each observation is used exactly once for testing and \(k-1\) times for training.
TRUE
Bootstrap resampling always uses disjoint training and testing sets.
FALSE
Increasing the number of folds in cross-validation always reduces bias in the error estimate without affecting variance.
FALSE
Summary
Cross-validation
Used to estimate test-errors
\(K\)-fold CV for \(K=5\) to \(K=10\) sweetspot between bias, variance, and computational effort
Bootstrap
Used to estimate distributional properties of an estimator whose distribution is unknown
Use to construct estimates of standard errors, CIs
Homework 2
Key concepts
Linear classifiers, LDA, logistic classifier (Q1 a, Q2)
Loss, surrogate loss (Q1 b, c)
Discriminant based classifiers (Q2, Q3, Q4 a)
ROC (Q4 b)
Question 1
In this exercise, we consider binary classification, where \(Y = −1\) denotes a negative example and \(Y = 1\) denotes a positive example.
Show that the output of a logistic regression classifier with parameter \(\hat{\beta}\) can be expressed as follows \[
\psi(x) = H(x^\top \hat\beta) \,,
\] where \(H(z) = 1\) if \(z \geq 0\) and \(H(z) = 0\) if \(z < 0\).
Question 1a
Show
For logistic classifier: \(\psi(x) = H(x^\top \hat\beta)\)
I.e. classifier is a \(0\)-\(1\) step function in \(x^\top \hat{\beta}\):
Code
H <-function(z){ifelse(z <0, 0, 1) }zs <-seq(from =-5, to =5, length.out=1000) plot(zs, H(zs), col ="blue", lwd =2, type ="l", main =expression(H(z)))
plot(zs, plogis(zs), col ="red", lwd =2, type ="l", main ="Logistic classifier")abline(h =0.5, col ="grey", lty =2, lwd =2) lines(zs, H(zs), col ="blue", lwd =2)
Question 1a
Show
Logistic classifier is a \(0\)-\(1\) step function in \(x^\top\hat{\beta}\)
Let \(\ell: \Re \to \Re\) with \(\ell(z) = \log(1 + \exp\{-z\})\). Show that for logistic regression, the negative log-likelihood (referred to as binary cross-entropy or log loss) is \(\ell(-yx^\top \beta)\).
Show
negative log-likelihood in logistic regression is \(\ell(-yx^\top \beta)\)
log-likelihood estimation minimises cross-entropy loss
Question 1c
Consider a linear classifier \(H(x^\top \hat \beta)\) with the loss \(\ell(yx^\top \hat{\beta})\), using either the \(0\)-\(1\) loss function \(\ell(z) = \mathbb{1}\{z \leq 0\}\), the hinge loss function \(\ell(z) = \max \{0, 1-z\}\), or the squared hinge loss function \(\ell(z) = \max \{0, 1-z\}^2\).
Compare the values of the three loss functions over the domain R and discuss their continuity properties.
Which of the three loss functions is differentiable at each point \(z \in \Re\)?
How does the binary cross-entropy function compare to the above three functions (up to scaling with a multiplicative constant)?
Question 1c
Code
hinge <-function(z){pmax(0, 1- z) }sq_hinge <-function(z){pmax(0, 1- z)^2}z_o <-function(z){ifelse(z <=0, 1, 0) }plot(zs, hinge(zs), col ="black", lwd =2, type ="l") lines(zs, sq_hinge(zs), col ="blue", lwd =2) lines(zs, z_o(zs), col ="red", lwd =2) legend("topright", col =c("black", "blue", "red"), lwd =rep(2,3), legend =c("hinge", "sq. hinge", "0-1"))
Hinge and squared hinge upper bound the \(0\)-\(1\) loss
\(0\)-\(1\) loss has discontinuity at \(0\), others continuous
Squared hinge loss is differentiable on \(\Re\), \(0\)-\(1\) loss is not differentiable at \(0\), hinge loss not differentiable at \(1\)
Question 1c
Code
cross_entropy <-function(z, a =1/log(2)){ a *log(1+exp(-z)) }plot(zs, hinge(zs), col ="black", lwd =2, type ="l") lines(zs, sq_hinge(zs), col ="blue", lwd =2) lines(zs, z_o(zs), col ="red", lwd =2) lines(zs, cross_entropy(zs), col ="orange", lwd =2)legend("topright", col =c("black", "blue", "red", "orange"), lwd =rep(2,43), legend =c("hinge", "sq. hinge", "0-1", "cross entropy"))
upper bounds \(0\)-\(1\) loss for \(a > 1 / \log(2)\)
can be seen as smooth approximation to hinge-loss function
Question 1c
Learnings:
understand properties of difference loss functions
maximum-likelihood estimate in logistic regression can be seen as smooth approximation to minimising empirical \(0\)-\(1\) loss (surrogate loss)
Question 2a
Consider LDA for \(p = 1\). a. Show that the discriminator can be expressed as follows: \[
\delta_k(x) = x \frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat{\mu}_k^2}{2 \hat{\sigma}^2} + \log(\hat{\pi}_k)
\]
Recap
Discriminant based classifiers
Classifier \(\psi: \Re^p \to \{1 , \ldots, K\}\) is a discriminant based classifier if it can be expressed as
\[
\psi(x) = \underset{k}{\arg \max} \, \delta_k(x) \,,
\] where \(\delta_k: \Re^p \to \Re\) is called the discriminant function. Any monotone transformation of \(\delta_k\) gives the same predictions!
Bayes discriminant function under \(0\)-\(1\) loss
\[
\eta_k(x) = \Pr(Y = k \mid X = x) = \frac{\pi_k f_k(x)}{\sum_{k = 1}^K \pi_k f_k(x)}
\]
Bayes’ Theorem: \(\Pr(A \mid B) = \frac{\Pr(A) \Pr(B \mid A)}{\Pr(B)}\)
Law of total probablity: \(\Pr(B) = \sum_{i = 1}^n \Pr(A) \Pr(B_n \mid A)\), for \(B_n\) disjoint with \(\bigcup_{i =1}^n B_n = B\)
Recap
LDA
Assumes that \(X \mid Y = k\) is \(\mathrm{N}(\mu_k, \Sigma)\), i.e.
Discriminant function discriminates based on \[
\log(\hat{\pi}_k) + x \frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat \mu_k^2}{2 \hat \sigma^2}
\]
Question 2b
Show that if \(\hat{\mu}_1 < \hat{\mu}_2\), then every \(x \in \Re\) such that \(x < x^*\) is classified as class 1, and every \(x \in \Re\) such that \(x > x^*\) is classified as class 2, where \[
x^* = \frac{\hat{\mu}_1 + \hat{\mu}_2}{2}
+ \frac{\hat\sigma^2 \left[\log(\hat{\pi}_1) - \log(\hat{\pi}_2)\right]}{\hat{\mu}_2 - \hat{\mu}_1}.
\]
Suppose that \(X \mid Y = 1\) and \(X \mid Y = 2\) follow bivariate Gaussian distributions \(N(\mu_1, \Sigma)\) and \(N(\mu_2, \Sigma)\), respectively.
Suppose \(X \mid Y = 3\) follows a mixture of two bivariate Gaussian distributions, with a 50% chance of being from \(N(\mu_{3,1}, \Sigma)\) and a 50% chance from \(N(\mu_{3,2}, \Sigma)\).
Since \(\pi_1 f_1(x_0)\) is the largest, we classify \(x_0\) as belonging to class 1.
Question 3
Learnings
Generalise the discriminant approach \(\delta_k(x) = \log(\pi_k f_k)\) to unseen setting
Question 4
In this exercise, we consider binary classification. a. Consider a classifier that assigns each example to the majority label in the training dataset. What is the expected test accuracy of this classifier?
Show
Expected test accuracy of majority vote classifier
Now the random variable \(\sum_{i = 1}^n Y_i\) just counts the number of successes in \(n\) independent Bernoulli trials. Hence it follows a binomial distribution, and thus
Now although this is a precise characterisation of the expected prediction accuracy of our majority vote classifier, it looks rather convoluted and altogether a bit unsatisfying.
Intuitevly, we would expect that if we had enough votes, the majority vote should be \(1\) if the individual voting probabilities \(\pi\) are greater than \(1/2\) and \(0\) if they are less than \(1/2\).
Can we formalise this thought?
Question 4a
Intuitevly, we would expect that if we had enough votes, the majority vote should be \(1\) if the individual voting probabilities \(\pi\) are greater than \(1/2\) and \(0\) if they are less than \(1/2\).
Code
n <-10^3pi <-0.7pi2 <-0.3y1 <-rbinom(n, 1, pi) y2 <-rbinom(n, 1, pi2) av1 <-cumsum(y1) /1:n av2 <-cumsum(y2) /1:n plot(1:n, av1, type ="l", col ="blue", lwd =2, main ="Average of responses", xlab ="Number of votes", ylab ="% of 1-votes") lines(1:n, av2, col ="orange", lwd =2) abline(h =0.5, col ="grey", lwd =2, lty =2) legend("bottomright", legend =c(expression(pi >1/2), expression(pi <1/2), "Decision boundary"), lty =c(1,1,2), lwd =rep(1,3), col =c("blue", "orange", "grey"))
Can we formalise this thought?
Recap
Law of large numbers
An average \(S_n = \sum_{i = 1}^n X_i / n\) of \(n\) i.i.d. random variables with expectation \(E[X] = \mu\) converges to \(\mu\) in probability as \(n \to \infty\): For all \(\epsilon, \delta\), there is \(N \in \mathbb{N}\), such that for all \(n > N\),
\[
\Pr(|S_n - \mu| > \epsilon ) < \delta
\]
Code
n <-10^4pi <-0.55eps <-1e-2ub <- pi + epslb <- pi - eps set.seed(123)y1 <-rbinom(n, 1, pi) av1 <-cumsum(y1) /1:n plot(1:n, av1, type ="l", col ="blue", lwd =2, main ="High probability bands", ylim =c(pi -0.1, pi + .1), xlab ="Number of samples", ylab =expression(S[n])) abline(h = ub, col ="grey", lty =2, lwd =2) abline(h = lb, col ="grey", lty =2, lwd =2) polygon(x =c(par("usr")[1], par("usr")[2], par("usr")[2], par("usr")[1]),y =c(lb, lb, ub, ub),col =rgb(0, 0, 1, 0.2),border =NA)legend("bottomright", legend =c(expression(S[n]), expression(mu + epsilon), expression(mu - epsilon)), lty =c(1,2,2), lwd =rep(1,3), col =c("blue", "grey", "grey"))
Question 4a
\[
S_n = \frac{1}{n}\sum_{i = 1}^n Y_i
\]
is an average of \(n\) i.i.d. Bernoulli random variables with success probability \(\pi\).
By the law of large numbers: \[
S_n \overset{p}{\to} \mathrm{E}[S_n] = \pi, \quad \text{as } n \to \infty \
\]
n <-10^3pi <-0.5y1 <-rbinom(n, 1, pi) av1 <-cumsum(y1) /1:n plot(1:n, av1, type ="l", col ="blue", lwd =2, main =expression(pi==frac(1,2))) abline(h =0.5, lwd =2, lty =2, col ="grey")
The process is \(S_n\) is symmetric. Each outcome \([y_1, \ldots, y_n]\) such that \(S_n > 1/2\) has a corresponding outcome \([1-2y_1, \ldots, 1-2y_n]\) with \(S_n < 1/2\) that has the same probability.
\(\Pr(S_n > 1/2) = \Pr(S_n < 1 / 2)\)
If \(n\) is odd: \(S_n = 1/2\) impossible: \(\Pr(S_n > 1/2) = 1/2\)
If \(n\) is even: \(\Pr(S_n = 1/2) = (1/2)^n\). \(\Pr(S_n > 1/2) = (1 - (1/2)^n) / 2 \to 1/2\) as \(n \to \infty\)
Use definitions to characterise unknown classifier
Use of rigour (definitions, LLN) and intuition (Majority class should be class with higher probability) to get a precise result
Majority rate classifier classifies majority class correctly with probability going to one \((\pi \neq 1/2)\)
Serves as a lower bound in accuracy for more sophisticated ML algorithms
Question 4b
For an email spam filter, it may be preferable to allow for some rate of spam to reach the inbox rather than risk sending an important legitimate email message to the spam folder. Assume we train a spam classifier where the positive class is spam and the negative class is non-spam.
Suppose we have three points on the ROC curve for our classifier: A: \((0.1, 0.7)\), B: \((0.2, 0.75)\) and C: \((0.3, 0.8)\). Which of these is preferable?
Recap
ROC
Binary, discriminant based classifiers are of the form \(\psi(x) = \mathbb{1}\{ s(x) < t \}\), where
\(s(x)\) is a score that maps \(x\) to the real numbers
\(t\) is treshold that translates scores into predictions
The scores induce an ordering on a test set \(X_1, \ldots, X_n\):