Show that \(\delta^{\mathrm{QDA}}_k(x)\) depends on \(\hat{\sigma}_k\) such that it is maximum when \(\hat{\sigma}_k = |x - \hat{\mu}_k|\) and that it is decreasing with \(\hat{\sigma}_k\) as its value is further away from \(|x - \hat{\mu}_k|\). Give an intuitive explanation for this dependence.
Recap
QDA
QDA assumes that \(X | Y = k \sim \mathrm{N}(\mu_k, \Sigma_k)\).
Show that \(\delta^{\mathrm{QDA}}_k(x)\) depends on \(\hat{\sigma}_k\) such that it is maximum when \(\hat{\sigma}_k = |x - \hat{\mu}_k|\) and that it is decreasing with \(\hat{\sigma}_k\) as its value is further away from \(|x - \hat{\mu}_k|\). Give an intuitive explanation for this dependence.
Show
\(\delta^{\mathrm{QDA}}_k(x)\), as a function of \(\hat \sigma\), is maximised at \(| x - \hat{\mu}_k|\)
Since \(\hat \sigma > 0\), this holds for \(| x - \hat \mu_k |\).
Question 1.b
Check for maximum
Second derivative test:
If \(f\) s twice differentiable with critical point \(x^*\) such that \[
\frac{\partial^2}{\partial x^2} f(x) \bigg|_{x = x^*} < 0 \,,
\] then \(x^*\) is a local maximum.
If \(\hat{\sigma}_k > | x - \hat \mu_k |\): Numerator \(< 0\), \(\delta_k(\hat{\sigma_k})\)decreases beyond \(| x - \hat \mu_k |\)
If \(\hat{\sigma}_k < | x - \hat \mu_k |\): Numerator \(> 0\), \(\delta_k(\hat{\sigma_k})\)increases beyond \(| x - \hat \mu_k |\)
Question 1.b
Interpretation
\(|x - \mu_k|\) is much larger than \(\hat{\sigma}_k\): Conditional density \(\hat f_k\) of the feature is concentrated around its mean value and hence it is less likely that \(x\) is a sample from \(f_k\)
\(|x - \mu_k|\) is much smaller than \(\hat{\sigma}_k\): \(\hat f_k\) has a wide spread, making it less likely that \(x\) is a sample from \(f_k\).
Alternative: Conditonal on \(Y =k\), \(|x-\mu_k|\) is a half-normal random variable with mean \(\sigma_k \sqrt{2 / \pi} \approx 0.79 \sigma_k\) and variance \(\sigma (1 - 2 / \pi) \approx .036 \sigma_k\). Hence, we expect \(|x - \mu_k|\) to concentrate around \(\sigma_k\).
Question 1.b
Interpretation
Code
library(ggplot2)library(patchwork)set.seed(42)mu <-0sigma_small <-0.5sigma_large <-10x_close <-0.4x_far <-3.0n <-800xs <-seq(-6, 6, length.out =2000)df_small <-data.frame(x = xs, y =dnorm(xs, mu, sigma_small))df_large <-data.frame(x = xs, y =dnorm(xs, mu, sigma_large))samples_small <-data.frame(x =rnorm(n, mu, sigma_small))samples_large <-data.frame(x =rnorm(n, mu, sigma_large))Eabs_small <- sigma_small *sqrt(2/pi)Eabs_large <- sigma_large *sqrt(2/pi)y_close_small <-dnorm(x_close, mu, sigma_small)y_close_large <-dnorm(x_close, mu, sigma_large)y_far_small <-dnorm(x_far, mu, sigma_small)y_far_large <-dnorm(x_far, mu, sigma_large)ylim_shared <-c(0, max(df_small$y) *1.1)p_small <-ggplot() +annotate("rect",xmin =-Eabs_small, xmax = Eabs_small,ymin =0, ymax =Inf, fill ="#08519c", alpha =0.08) +geom_area(data = df_small, aes(x = x, y = y), fill ="#c6dbef", alpha =0.95) +geom_line(data = df_small, aes(x = x, y = y), color ="#08519c", linewidth =1) +geom_rug(data = samples_small, aes(x = x), sides ="b", color ="#08519c", alpha =0.35) +geom_vline(xintercept = mu, linetype ="dashed", linewidth =0.6) +geom_segment(aes(x = x_far, xend = x_far, y =0, yend = y_far_small), color ="#de2d26", linewidth =0.7) +geom_point(aes(x = x_far, y = y_far_small), color ="#de2d26", size =3) +coord_cartesian(xlim =c(-6, 6), ylim = ylim_shared) +labs(title =expression("Small"~ sigma[k]), x ="x", y ="density") +theme_minimal(base_size =14) +theme(legend.position ="none")p_large <-ggplot() +annotate("rect",xmin =-Eabs_large, xmax = Eabs_large,ymin =0, ymax =Inf, fill ="#006d2c", alpha =0.08) +geom_area(data = df_large, aes(x = x, y = y), fill ="#c7e9c0", alpha =0.95) +geom_line(data = df_large, aes(x = x, y = y), color ="#006d2c", linewidth =1) +geom_rug(data = samples_large, aes(x = x), sides ="b", color ="#006d2c", alpha =0.35) +geom_vline(xintercept = mu, linetype ="dashed", linewidth =0.6) +geom_segment(aes(x = x_close, xend = x_close, y =0, yend = y_close_large), color ="#de2d26", linewidth =0.7) +geom_point(aes(x = x_close, y = y_close_large), color ="#de2d26", size =3) +coord_cartesian(xlim =c(-6, 6), ylim = ylim_shared) +labs(title =expression("Large"~ sigma[k]), x ="x", y ="density") +theme_minimal(base_size =14) +theme(legend.position ="none")(p_small | p_large) +plot_annotation(title ="QDA in 1D")
Question 1
Learnings
Show properties of familiar classifiers
Interpretation of QDA behaviour (vs LDA)
Check your understanding
When might QDA perform worse than LDA, even though QDA is more flexible?
Answer
If the Bayes decision boundary is approximately linear
When the sample size is small or the number of features is large relative to the sample size. QDA estimates separate covariance matrices, leading to high variance and overfitting.
How may regularization or shrinkage of covariance matrices help in discriminant analysis?
Answer
It reduces the variance of covariance estimates by shrinking them toward a simpler structure (e.g., the identity or a common matrix), improving stability and generalization, especially in high-dimensional or small-sample settings.
How do unequal class priors affect the LDA decision boundary?
Answer
They shift the boundary toward the less frequent class, effectively making misclassification of rare classes less likely.
How can LDA/QDA be modified to handle heteroscedastic noise or outliers?
Suppose that the training data points \[
(x_1, y_1), \ldots, (x_n, y_n) \in \mathbb{R}^p \times \{1, \ldots, K\}
\] are i.i.d. samples with \[
\log \left(
\frac{\Pr[Y_i = k \mid X_i = x]}{\Pr[Y_i = K \mid X_i = x]}
\right)
= \beta_k^\top x
\quad \text{for } i = 1, \ldots, n \text{ and } k = 1, \ldots, K,
\] where \(\beta = (\beta_1, \ldots, \beta_K)\) is a parameter in \(\mathbb{R}^{p \times K}\).
Express \(\Pr(Y = k \mid X = x)\) in terms of \(\beta, x\).
Recap
Multiclass logistic classification
\[
\log \left(
\frac{\Pr[Y_i = k \mid X_i = x]}{\Pr[Y_i = K \mid X_i = x]}
\right)
= \beta_k^\top x
\quad \text{for } i = 1, \ldots, n \text{ and } k = 1, \ldots, K,
\] where \(\beta = (\beta_1, \ldots, \beta_K)\) is a parameter in \(\mathbb{R}^{p \times K}\).
Aim to generalise the idea of binary logistic classification: \[
\log \left(
\frac{\Pr[Y_i = 1 \mid X_i = x]}{\Pr[Y_i = 0 \mid X_i = x]}
\right)
= \beta^\top x
\] to more than two classes
Recap
Why do we need a reference class?
The last probability \(\Pr(Y = K \mid X = x)\) is redundant in that \[
\Pr(Y = K \mid X = x) = 1 - \sum_{i = 1}^{K-1}\Pr(Y = i \mid X = x)
\]
Recap
Identification
What if we just model without reference class, i.e. \[
\Pr(Y = k \mid X = x) = C \exp\{\beta_k^\top x\}, \quad k = 1, \ldots, K \,.
\]
Normalisation requires that
\[
\begin{aligned}
1 &= \sum_{i = 1}^K \Pr(Y = i \mid X = x) \\
&= C \sum_{i = 1}^K\exp\{\beta_i^\top x\} \\
\iff C &= \frac{1}{\sum_{i = 1}^K\exp\{\beta_i^\top x\}} \,,
\end{aligned}
\]
Recap
Identification
\[
\Pr(Y = k \mid X = x) = \frac{\exp\{\beta_k^\top x\}}{\sum_{i = 1}^K\exp\{\beta_i^\top x\}}
\]
Recap
Identification
Consider, for any \(\beta_1, \ldots, \beta_K\), consider the vectors \(\beta_1 + v, \ldots, \beta_K + v\) for some \(v\).
If we wish to minimise \(f(x)\) over \(x\), gradient descent iteratively picks the direction of steepest descent and updated current iterate.
\[
x^{(t)} = x^{(t-1)} - \alpha_t \nabla f(x^{(t-1)}) \,,
\] where the step size is the learning rate \(\alpha_t\).
Sometimes
\[
\nabla f(x^{(t-1)}) = \sum_{i = 1}^n \nabla f(x^{(t-1)}; z_i) \,,
\] is just a sum of gradients evaluated at data points. If computation of gradient is costly, then we may evluate at random subsample at each step.
Understand the modelling assumptions of multiclass logistic regression
Define maximum likelihood problem based on distribution
Check your understanding
Why do we need a reference class in multiclass logistic regression?
Answer
The softmax probabilities are unchanged if you add the same vector v to all class parameters \(\beta_k\). Without a constraint, parameters aren’t unique. Fixing a reference makes the model identifiable.
What does \(\beta_kx\) represent, and how do you interpret the intercepts?
Answer
It is the log-odds of class k versus the reference class K. Intercepts capture baseline log-odds when \(x = 0\); with centered features, they primarily encode class priors.
What is the shape of decision boundaries in multiclass logistic regression?
Answer
Pairwise boundaries between classes \(k\) and \(j\) are linear hyperplanes given by \((\beta_k - \beta_k)^\top x = 0\). Regions are separated by these hyperplanes; there are no quadratic terms.
What happens if classes are linearly separable? How is this handled in practice?
Answer
The MLE can diverge (parameters grow without bound) leading to non-convergence and overconfident probabilities. Use regularization (L2/L1).