ST443 Machine Learning and data mining

Week 4


Philipp Sterzinger

p.sterzinger@lse.ac.uk

13 November 2025

Homework 3

Question 1

Consider the QDA classifier for \(p=1\).

  1. Show that the discriminator can be written as follows:

\[ \delta^{\mathrm{QDA}}_k(x) = \log(\hat{\pi}_k) - \frac{1}{2\hat{\sigma}_k^2}(x - \hat{\mu}_k)^2 - \log(\hat{\sigma}_k). \]

  1. Show that \(\delta^{\mathrm{QDA}}_k(x)\) depends on \(\hat{\sigma}_k\) such that it is maximum when \(\hat{\sigma}_k = |x - \hat{\mu}_k|\) and that it is decreasing with \(\hat{\sigma}_k\) as its value is further away from \(|x - \hat{\mu}_k|\). Give an intuitive explanation for this dependence.

Recap

QDA

QDA assumes that \(X | Y = k \sim \mathrm{N}(\mu_k, \Sigma_k)\).

\[ f_k(x) = \frac{1}{(2\pi)^{p/2}(\det\Sigma_k)^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)} \]

Code
library(MASS)
library(ggplot2)
library(ggnewscale)

set.seed(42)

mu1 <- c(2, 3)
mu2 <- c(7, 8)
Sigma_1 <- matrix(c(2, 1, 1, 2), 2, 2)

Sigma_2 <- matrix(1 * rnorm(4), 2, 2)

n <- 1000

X1 <- as.data.frame(mvrnorm(n, mu1, Sigma_1))
X2 <- as.data.frame(mvrnorm(n, mu2, Sigma_2))
colnames(X1) <- colnames(X2) <- c("x", "y")

X1s <- X1[sample(1:n, n/10), ]
X2s <- X2[sample(1:n, n/10), ]

qda_plot <- ggplot() +
  stat_density_2d(
    data = X1, aes(x = x, y = y, fill = after_stat(level)),
    geom = "polygon", color = NA, alpha = 0.5
  ) +
  scale_fill_gradient(low = "#c6dbef", high = "#08519c", guide = "none") +
  
  new_scale_fill() +
  
  stat_density_2d(
    data = X2, aes(x = x, y = y, fill = after_stat(level)),
    geom = "polygon", color = NA, alpha = 0.5
  ) +
  scale_fill_gradient(low = "#c7e9c0", high = "#006d2c", guide = "none") +

  geom_point(data = X1s, aes(x = x, y = y), color = "#08519c", alpha = 0.8, size = 0.9) +
  geom_point(data = X2s, aes(x = x, y = y), color = "#006d2c", alpha = 0.8, size = 0.9) +
  
  theme_minimal(base_size = 14) +
  theme(legend.position = "none") +
  labs(
    title = expression("QDA modelling assumption"),
    x = expression(X[1]),
    y = expression(X[2])
  )
qda_plot

Recap

LDA vs QDA

Code
library(MASS)
library(ggplot2)
library(ggnewscale)
library(patchwork)

set.seed(42)

mu1 <- c(2, 3)
mu2 <- c(7, 8)
Sigma <- matrix(c(2, 1, 1, 2), 2, 2)
n <- 1000

X1 <- as.data.frame(mvrnorm(n, mu1, Sigma))
X2 <- as.data.frame(mvrnorm(n, mu2, Sigma))
colnames(X1) <- colnames(X2) <- c("x", "y")

X1s <- X1[sample(1:n, n/10), ]
X2s <- X2[sample(1:n, n/10), ]

lda_plot <- ggplot() +
  stat_density_2d(
    data = X1, aes(x = x, y = y, fill = after_stat(level)),
    geom = "polygon", color = NA, alpha = 0.5
  ) +
  scale_fill_gradient(low = "#c6dbef", high = "#08519c", guide = "none") +
  
  new_scale_fill() +
  
  stat_density_2d(
    data = X2, aes(x = x, y = y, fill = after_stat(level)),
    geom = "polygon", color = NA, alpha = 0.5
  ) +
  scale_fill_gradient(low = "#c7e9c0", high = "#006d2c", guide = "none") +

  geom_point(data = X1s, aes(x = x, y = y), color = "#08519c", alpha = 0.8, size = 0.9) +
  geom_point(data = X2s, aes(x = x, y = y), color = "#006d2c", alpha = 0.8, size = 0.9) +
  
  theme_minimal(base_size = 14) +
  theme(legend.position = "none") +
  labs(
    title = expression("LDA modelling assumption"),
    x = expression(X[1]),
    y = expression(X[2])
  )

  qda_plot + lda_plot

Recap

QDA parameters

The learnable parameters \(\pi_k, \Sigma, \mu_k\) are estimated from the data:

  • \(\hat{\pi}_k = \frac{n_k}{n} = \frac{1}{n} \sum_{i = 1}^n \mathbb{1}\{Y_i = k\}\)
  • \(\hat\mu_k = \frac{1}{n_k}\sum_{i = 1}^n \mathbb{1}\{Y_i = k\} X_i\)
  • \(\hat{\Sigma}_k = \frac{1}{n_k - 1} \sum_{i = 1}^n \mathbb{1}\{Y_i = k\} (X_i - \hat{\mu}_k)(X_i - \hat{\mu}_k)^\top\)

Recap

QDA discriminant function

\[ \delta_k(x)^{\text{QDA}} = \log(\hat \pi_k \hat f_k) + C = \log(\hat \pi_k) -\frac{1}{2} \log(\det\hat \Sigma_k) -\frac{1}{2}(x-\hat \mu_k)^\top\hat\Sigma_k^{-1}(x-\hat \mu_k) \,. \]

QDA classifier

\[ \psi(x)^{\text{QDA}} = \underset{k}{\arg \min} \, \delta_k^{\text{QDA}}(x) \]

Question 1

Consider the QDA classifier for \(p=1\).

  1. Show that the discriminator can be written as follows:

\[ \delta^{\mathrm{QDA}}_k(x) = \log(\hat{\pi}_k) - \frac{1}{2\hat{\sigma}_k^2}(x - \hat{\mu}_k)^2 - \log(\hat{\sigma}_k). \]

Show

  • particular form of \(\delta^{\mathrm{QDA}}_k(x)\) for univariate features

Know

  • \[ \hat f_k(x) = \frac{1}{\sqrt{2 \pi \hat \sigma_k^2}} \exp\left\{- \frac{(x - \hat\mu_k)^2}{2 \hat \sigma_k^2} \right\} \]
  • \(\delta^{\mathrm{QDA}}_k(x) = \log(\hat \pi \hat f_k(x)) + C\)

Question 1.a

\[ \begin{aligned} \log(\hat \pi \hat f_k(x)) &= \log(\hat \pi) + \log(\hat f_k(x)) \\ &= \log(\hat \pi) - \frac{1}{2} \log(2 \pi) - \frac{1}{2} \log(\hat \sigma_k) - \frac{1}{2 \hat{\sigma_k}^2} (x - \hat \mu_k)^2 \\ &= \log(\hat \pi) - \frac{1}{2} \log(\hat \sigma_k) - \frac{1}{2 \hat{\sigma_k}^2} (x - \hat \mu_k)^2 + C \end{aligned} \]

Question 1.b

Show that \(\delta^{\mathrm{QDA}}_k(x)\) depends on \(\hat{\sigma}_k\) such that it is maximum when \(\hat{\sigma}_k = |x - \hat{\mu}_k|\) and that it is decreasing with \(\hat{\sigma}_k\) as its value is further away from \(|x - \hat{\mu}_k|\). Give an intuitive explanation for this dependence.

Show

  • \(\delta^{\mathrm{QDA}}_k(x)\), as a function of \(\hat \sigma\), is maximised at \(| x - \hat{\mu}_k|\)

Know

  • \[ \delta_k(\hat \sigma_k) = C - \frac{1}{2\hat{\sigma}_k^2}(x - \hat{\mu}_k)^2 - \log(\hat{\sigma}_k) \]

Question 1.b

Code
sigmas <- seq(from = 0.1, to = 4, length.out = 1000) 

cent <- 2 

delta_k <- function(sigma){
    -log(sigma) - 0.5 * cent^2 / sigma^2
}

plot(sigmas, delta_k(sigmas), type = "l", col = "blue", lwd = 2, main = expression(delta[k](sigma))) 
abline(v= cent, lty = "dashed", col = "grey", lwd = 2) 

\(\delta_k(\hat \sigma_k)\) is continuous function in \(\hat{\sigma}_k\)

  • Differentiate wrt \(\hat \sigma_k\)
  • Set equal to \(0\) and solve for \(\hat{\sigma}_k\)

Question 1.b

\[ \begin{aligned} \frac{\partial}{\partial \hat \sigma_k} \delta_k(\hat \sigma) &= \frac{\partial}{\partial \hat \sigma_k} \left\{C - \frac{1}{2\hat{\sigma}_k^2}(x - \hat{\mu}_k)^2 - \log(\hat{\sigma}_k) \right\} \\ &= - \frac{(x - \hat \mu_k)^2}{2} \left( \frac{-2}{\hat \sigma^3_k} \right) - \frac{1}{\hat \sigma_k} \\ &\overset{!}{=} 0 \\ \iff \frac{(x - \hat \mu_k)^2}{\hat \sigma^3_k} &= \frac{1}{\hat \sigma_k} \\ \iff \frac{(x - \hat \mu_k)^2}{\hat \sigma^2_k} &= 1 \\ \iff \hat \sigma^2_k &= (x - \hat \mu_k)^2 \\ \end{aligned} \]

Since \(\hat \sigma > 0\), this holds for \(| x - \hat \mu_k |\).

Question 1.b

Check for maximum

Second derivative test:

If \(f\) s twice differentiable with critical point \(x^*\) such that \[ \frac{\partial^2}{\partial x^2} f(x) \bigg|_{x = x^*} < 0 \,, \] then \(x^*\) is a local maximum.

\[ \frac{\partial^2}{\partial \hat \sigma_k^2} \delta_k(\hat \sigma_k) = - 3 \frac{(x - \hat \mu_k)^2}{\hat \sigma_k^4} + \frac{1}{\hat \sigma_k^2} \]

Question 1.b

Check for maximum

\[ \begin{aligned} \frac{\partial^2}{\partial \hat \sigma_k^2} \delta_k(\hat \sigma_k) \bigg|_{\hat \sigma_k = | x - \hat \mu_k|} &= - 3 \frac{(x - \hat \mu_k)^2}{| x - \hat \mu_k|^4} + \frac{1}{| x - \hat \mu_k|^2} \\ &= -3\frac{1}{| x - \hat \mu_k|^2} + \frac{1}{| x - \hat \mu_k|^2} \\ &= -2\frac{1}{| x - \hat \mu_k|^2} \\ & < 0 \,. \end{aligned} \]

Question 1.b

Global maximum

\[ \begin{aligned} \frac{\partial}{\partial \hat \sigma_k} \delta_k(\hat \sigma) &= \frac{(x - \hat \mu_k)^2 - \hat \sigma_k^2}{\hat \sigma^3_k} \\ \end{aligned} \]

  • If \(\hat{\sigma}_k > | x - \hat \mu_k |\): Numerator \(< 0\), \(\delta_k(\hat{\sigma_k})\) decreases beyond \(| x - \hat \mu_k |\)

  • If \(\hat{\sigma}_k < | x - \hat \mu_k |\): Numerator \(> 0\), \(\delta_k(\hat{\sigma_k})\) increases beyond \(| x - \hat \mu_k |\)

Question 1.b

Interpretation

  1. \(|x - \mu_k|\) is much larger than \(\hat{\sigma}_k\): Conditional density \(\hat f_k\) of the feature is concentrated around its mean value and hence it is less likely that \(x\) is a sample from \(f_k\)

  2. \(|x - \mu_k|\) is much smaller than \(\hat{\sigma}_k\): \(\hat f_k\) has a wide spread, making it less likely that \(x\) is a sample from \(f_k\).

  3. Alternative: Conditonal on \(Y =k\), \(|x-\mu_k|\) is a half-normal random variable with mean \(\sigma_k \sqrt{2 / \pi} \approx 0.79 \sigma_k\) and variance \(\sigma (1 - 2 / \pi) \approx .036 \sigma_k\). Hence, we expect \(|x - \mu_k|\) to concentrate around \(\sigma_k\).

Question 1.b

Interpretation

Code
library(ggplot2)
library(patchwork)

set.seed(42)

mu <- 0
sigma_small <- 0.5
sigma_large <- 10

x_close <- 0.4
x_far   <- 3.0

n <- 800
xs <- seq(-6, 6, length.out = 2000)

df_small <- data.frame(x = xs, y = dnorm(xs, mu, sigma_small))
df_large <- data.frame(x = xs, y = dnorm(xs, mu, sigma_large))

samples_small <- data.frame(x = rnorm(n, mu, sigma_small))
samples_large <- data.frame(x = rnorm(n, mu, sigma_large))

Eabs_small <- sigma_small * sqrt(2/pi)
Eabs_large <- sigma_large * sqrt(2/pi)

y_close_small <- dnorm(x_close, mu, sigma_small)
y_close_large <- dnorm(x_close, mu, sigma_large)
y_far_small   <- dnorm(x_far,   mu, sigma_small)
y_far_large   <- dnorm(x_far,   mu, sigma_large)

ylim_shared <- c(0, max(df_small$y) * 1.1)

p_small <-
  ggplot() +
  annotate("rect",
           xmin = -Eabs_small, xmax = Eabs_small,
           ymin = 0, ymax = Inf, fill = "#08519c", alpha = 0.08) +
  geom_area(data = df_small, aes(x = x, y = y), fill = "#c6dbef", alpha = 0.95) +
  geom_line(data = df_small, aes(x = x, y = y), color = "#08519c", linewidth = 1) +
  geom_rug(data = samples_small, aes(x = x), sides = "b", color = "#08519c", alpha = 0.35) +
  geom_vline(xintercept = mu, linetype = "dashed", linewidth = 0.6) +
  geom_segment(aes(x = x_far, xend = x_far, y = 0, yend = y_far_small), color = "#de2d26", linewidth = 0.7) +
  geom_point(aes(x = x_far, y = y_far_small), color = "#de2d26", size = 3) +
  coord_cartesian(xlim = c(-6, 6), ylim = ylim_shared) +
  labs(title = expression("Small" ~ sigma[k]), x = "x", y = "density") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

p_large <-
  ggplot() +
  annotate("rect",
           xmin = -Eabs_large, xmax = Eabs_large,
           ymin = 0, ymax = Inf, fill = "#006d2c", alpha = 0.08) +
  geom_area(data = df_large, aes(x = x, y = y), fill = "#c7e9c0", alpha = 0.95) +
  geom_line(data = df_large, aes(x = x, y = y), color = "#006d2c", linewidth = 1) +
  geom_rug(data = samples_large, aes(x = x), sides = "b", color = "#006d2c", alpha = 0.35) +
  geom_vline(xintercept = mu, linetype = "dashed", linewidth = 0.6) +
  geom_segment(aes(x = x_close, xend = x_close, y = 0, yend = y_close_large), color = "#de2d26", linewidth = 0.7) +
  geom_point(aes(x = x_close, y = y_close_large), color = "#de2d26", size = 3) +
  coord_cartesian(xlim = c(-6, 6), ylim = ylim_shared) +
  labs(title = expression("Large" ~ sigma[k]), x = "x", y = "density") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")


(p_small | p_large) +
  plot_annotation(title = "QDA in 1D")

Question 1

Learnings

  • Show properties of familiar classifiers
  • Interpretation of QDA behaviour (vs LDA)

Check your understanding

When might QDA perform worse than LDA, even though QDA is more flexible?

Answer
  • If the Bayes decision boundary is approximately linear
  • When the sample size is small or the number of features is large relative to the sample size. QDA estimates separate covariance matrices, leading to high variance and overfitting.

How may regularization or shrinkage of covariance matrices help in discriminant analysis?

Answer It reduces the variance of covariance estimates by shrinking them toward a simpler structure (e.g., the identity or a common matrix), improving stability and generalization, especially in high-dimensional or small-sample settings.

How do unequal class priors affect the LDA decision boundary?

Answer They shift the boundary toward the less frequent class, effectively making misclassification of rare classes less likely.

How can LDA/QDA be modified to handle heteroscedastic noise or outliers?

Answer Robust covariance estimators, shrinkage toward diagonal/identity, heavy-tailed models (e.g., t-based DA),

Question 2

Suppose that the training data points \[ (x_1, y_1), \ldots, (x_n, y_n) \in \mathbb{R}^p \times \{1, \ldots, K\} \] are i.i.d. samples with \[ \log \left( \frac{\Pr[Y_i = k \mid X_i = x]}{\Pr[Y_i = K \mid X_i = x]} \right) = \beta_k^\top x \quad \text{for } i = 1, \ldots, n \text{ and } k = 1, \ldots, K, \] where \(\beta = (\beta_1, \ldots, \beta_K)\) is a parameter in \(\mathbb{R}^{p \times K}\).

  1. Express \(\Pr(Y = k \mid X = x)\) in terms of \(\beta, x\).

Recap

Multiclass logistic classification

\[ \log \left( \frac{\Pr[Y_i = k \mid X_i = x]}{\Pr[Y_i = K \mid X_i = x]} \right) = \beta_k^\top x \quad \text{for } i = 1, \ldots, n \text{ and } k = 1, \ldots, K, \] where \(\beta = (\beta_1, \ldots, \beta_K)\) is a parameter in \(\mathbb{R}^{p \times K}\).

  • Aim to generalise the idea of binary logistic classification: \[ \log \left( \frac{\Pr[Y_i = 1 \mid X_i = x]}{\Pr[Y_i = 0 \mid X_i = x]} \right) = \beta^\top x \] to more than two classes

Recap

Why do we need a reference class?

The last probability \(\Pr(Y = K \mid X = x)\) is redundant in that \[ \Pr(Y = K \mid X = x) = 1 - \sum_{i = 1}^{K-1}\Pr(Y = i \mid X = x) \]

Recap

Identification

What if we just model without reference class, i.e.  \[ \Pr(Y = k \mid X = x) = C \exp\{\beta_k^\top x\}, \quad k = 1, \ldots, K \,. \]

Normalisation requires that

\[ \begin{aligned} 1 &= \sum_{i = 1}^K \Pr(Y = i \mid X = x) \\ &= C \sum_{i = 1}^K\exp\{\beta_i^\top x\} \\ \iff C &= \frac{1}{\sum_{i = 1}^K\exp\{\beta_i^\top x\}} \,, \end{aligned} \]

Recap

Identification

\[ \Pr(Y = k \mid X = x) = \frac{\exp\{\beta_k^\top x\}}{\sum_{i = 1}^K\exp\{\beta_i^\top x\}} \]

Recap

Identification

Consider, for any \(\beta_1, \ldots, \beta_K\), consider the vectors \(\beta_1 + v, \ldots, \beta_K + v\) for some \(v\).

\[ \begin{aligned} \Pr(Y = k \mid X = x) &= \frac{\exp\{(\beta_k + v)^\top x\}}{\sum_{i = 1}^K\exp\{(\beta_i + v)^\top x\}} \\ &= \frac{\exp\{v^\top x + \beta_k^\top x\}}{\sum_{i = 1}^K\exp\{v^\top x + \beta_k^\top x\}} \\ &= \frac{\exp\{v^\top x\}\exp\{\beta_k^\top x\}}{\sum_{i = 1}^K\exp\{v^\top x\}\exp\{\beta_k^\top x\}} \\ &= \frac{\exp\{v^\top x\}\exp\{\beta_k^\top x\}}{\exp\{v^\top x\}\sum_{i = 1}^K\exp\{\beta_k^\top x\}} \\ &= \frac{\exp\{\beta_k^\top x\}}{\sum_{i = 1}^K\exp\{\beta_k^\top x\}} \\ \end{aligned} \]

Recap

Identification

The parameter is not uniquely identifed.

If we conduct maximum likelihood estimation, the maximum is not unique.

Numerical optimization will wander or fail to converge properly because the gradient is zero in those directions.

If we use redundancy of one probability class and set \(\beta_K = 0\), we break this issue

Question 2.a

Show

  • \(\Pr(Y = k \mid X = x)\) can be expressed solely in terms of \(\beta\), \(x\)

Know

  • \[ \log \left( \frac{\Pr[Y_i = k \mid X_i = x]}{\Pr[Y_i = K \mid X_i = x]} \right) = \beta_k^\top x \]
  • \(\beta_K = 0\)
  • \(\sum_{k = 1}^K \Pr(Y = k \mid X = x) = 1\)

Question 2.a

\[ \begin{aligned} \log \left( \frac{\Pr[Y_i = k \mid X_i = x]}{\Pr[Y_i = K \mid X_i = x]} \right) &= \beta_k^\top x \\ \iff \Pr[Y_i = k \mid X_i = x] &= \exp \{\beta_k^\top x\} \Pr[Y_i = K \mid X_i = x] \end{aligned} \] And using the summability condition \[ \begin{aligned} 1 &= \sum_{i= 1}^K \Pr[Y_i = k \mid X_i = x] \\ &= \Pr[Y_i = K \mid X_i = x] \left(1 + \sum_{i= 1}^{K-1} \exp\{\beta_i^\top x\}\right) \\ \iff \Pr[Y_i = K \mid X_i = x] &= \frac{1}{1 + \sum_{i= 1}^{K-1} \exp\{\beta_i^\top x\}} \end{aligned} \]

Question 2.a

Therefore \[ \Pr[Y_i = k \mid X_i = x] = \exp \{\beta_k^\top x\} \Pr[Y_i = K \mid X_i = x] = \frac{\exp\{\beta_k^\top x\}}{1 + \sum_{i= 1}^{K-1} \exp\{\beta_i^\top x\} } \]

Question 2.b

  1. Show that the negative log-likelihood function is equal to

\[ \ell(\beta_1, \ldots, \beta_K) = -\sum_{i=1}^n \log \left( \frac{e^{\beta_{y_i}^\top x_i}}{ e^{\beta_1^\top x_i} + \cdots + e^{\beta_K^\top x_i}} \right). \]

Recap

Maximum likelihood paradigm

  • \(Z_1, \ldots, Z_n \sim g\), i.i.d., \(g\) unknown
  • Approximate \(g\) with a family of parametric distributions \(\mathcal{F} = \{f(z;\theta): \theta \in \Theta\}\)

If the data were generated by \(f(x;\theta)\), then the likelihood of observing a particular sample is

\[ L(\theta; z_1, \ldots, z_n) = \prod_{i = 1}^n f(z_i; \theta) \,. \]

Maximum likelihood estimator

Choose the parameter \(\hat \theta \in \Theta\) that makes the observation of our data most likely, i.e. 

\[ \hat \theta = \underset{\theta \in \Theta}{\arg \max} \, L(\theta; z_1, \ldots, z_n) = \underset{\theta \in \Theta}{\arg \min} \, -\log(L(\theta; z_1, \ldots, z_n)) \]

Question 2.b

Show

  • \[ \ell(\beta) = -\sum_{i=1}^n \log \left( \frac{e^{\beta_{y_i}^\top x_i}}{ \sum_{k = 1}^K \exp\{\beta_k^\top x_i\}} \right). \]

Know

  • \(\Pr(Y = k \mid X = x) =\frac{ \exp\{\beta_k^\top x\}}{1 + \sum_{k = 1}^{K-1} \exp\{\beta_k^\top x\} }\)

  • \(\ell(\beta) = -\sum_{i = 1}^n \sum_{k = 1}^K \mathbb{1}\{Y_i = k\} \log \left( \Pr(Y = k \mid X = x) \right)\)

Question 2.b

\[ \begin{aligned} \ell(\beta) &= -\sum_{i = 1}^n \sum_{k = 1}^K \mathbb{1}\{Y_i = k\} \log \left( \Pr(Y = k \mid X = x) \right) \\ &= -\sum_{i = 1}^n \sum_{k = 1}^K \mathbb{1}\{Y_i = k\} \log \left(\frac{ \exp\{\beta_k^\top x\}}{1 + \sum_{k = 1}^{K-1} \exp\{\beta_k^\top x\} } \right)\\ &= -\sum_{i = 1}^n \log \left(\frac{ \exp\{\beta_{Y_i}^\top x\}}{1 + \sum_{k = 1}^{K-1} \exp\{\beta_k^\top x\} } \right)\\ \end{aligned} \]

Question 2.c

Show that the gradient of \(\ell\) with respect to \(\beta_k\) satisfies:

\[ \frac{\partial}{\partial \beta_k} \ell(\beta_1,\ldots, \beta_k) = \sum_{i = 1}^n \left( \frac{ \exp\{\beta_k^\top x\}}{1 + \sum_{k = 1}^{K-1} \exp\{\beta_k^\top x\}} - \mathbb{1}\{Y_i = k \} \right) \]

Question 2.c

Gradient descent

If we wish to minimise \(f(x)\) over \(x\), gradient descent iteratively picks the direction of steepest descent and updated current iterate.

\[ x^{(t)} = x^{(t-1)} - \alpha_t \nabla f(x^{(t-1)}) \,, \] where the step size is the learning rate \(\alpha_t\).

Sometimes

\[ \nabla f(x^{(t-1)}) = \sum_{i = 1}^n \nabla f(x^{(t-1)}; z_i) \,, \] is just a sum of gradients evaluated at data points. If computation of gradient is costly, then we may evluate at random subsample at each step.

Question 2.c

Write

\[ \ell(\beta_1,\ldots,\beta_K) = -\sum_{i=1}^n \left( \beta_{y_i}^\top x_i - \log\!\left( \sum_{k'=1}^K e^{\beta_{k'}^\top x_i} \right) \right). \]

\[ \frac{\partial}{\partial \beta_{k,j}} \ell(\beta_1,\ldots,\beta_K) = -\sum_{i=1}^n \left( x_{i,j}\,\mathbb{1}(y_i = k) - \frac{e^{\beta_k^\top x_i} x_{i,j}} {\sum_{k'=1}^K e^{\beta_{k'}^\top x_i}} \right). \]

\[ \frac{\partial}{\partial \beta_{k,j}} \ell(\beta_1,\ldots,\beta_K) = \sum_{i=1}^n \left( \frac{e^{\beta_k^\top x_i}} {\sum_{k'=1}^K e^{\beta_{k'}^\top x_i}} - \mathbb{1}(y_i = k) \right)x_{i,j}. \]

\[ \frac{\partial \ell(\beta_1,\ldots,\beta_K)}{\partial \beta_k} = \sum_{i=1}^n \left( \frac{e^{\beta_k^\top x_i}} {e^{\beta_1^\top x_i} + \cdots + e^{\beta_K^\top x_i}} - \mathbb{1}(y_i = k) \right) x_i. \]

Learnings

  • Understand the modelling assumptions of multiclass logistic regression
  • Define maximum likelihood problem based on distribution

Check your understanding

Why do we need a reference class in multiclass logistic regression?

Answer The softmax probabilities are unchanged if you add the same vector v to all class parameters \(\beta_k\). Without a constraint, parameters aren’t unique. Fixing a reference makes the model identifiable.

What does \(\beta_kx\) represent, and how do you interpret the intercepts?

Answer It is the log-odds of class k versus the reference class K. Intercepts capture baseline log-odds when \(x = 0\); with centered features, they primarily encode class priors.

What is the shape of decision boundaries in multiclass logistic regression?

Answer Pairwise boundaries between classes \(k\) and \(j\) are linear hyperplanes given by \((\beta_k - \beta_k)^\top x = 0\). Regions are separated by these hyperplanes; there are no quadratic terms.

What happens if classes are linearly separable? How is this handled in practice?

Answer The MLE can diverge (parameters grow without bound) leading to non-convergence and overconfident probabilities. Use regularization (L2/L1).