Department of Statistics London School of Economics and Political Science (LSE)
Published
07 October 2025
Preamble
This note visualises the “mechanics” behind linear discriminant analysis (LDA), logistic classification and the receiver operating characteristic (ROC) that we covered in ST443, lecture 2. Its primary goal is to enable you to deepen your understanding of these concepts in an interactive way. Although not incredibly involved mathematically, some parts of visualising LDA draw from geometric ideas that I do not expect you to know in detail. I have marked these parts with a 🏔️-symbol. They are there in case you want to dig a bit deeper or, otherwise, simply for your viewing pleasure.
Generally, understand this note as a helper to better understand what is covered in the lecture materials and the classes, not as additional content.
The R code used to generate the visuals is also available, if that interests you. Enjoy!
LDA
Setup
Recall that in LDA we assume that
\[
Y \sim \pi = (\pi_1, \ldots, \pi_K), \quad X \mid_{Y = k} \sim \mathrm{N}(\mu_k, \Sigma) \,,
\] so that \(f_k\), the density of \(X\) given \(Y = k\) is then \[
f_k(x) = \frac{1}{(2\pi)^{p/2}(\det\Sigma)^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^\top\Sigma^{-1}(x-\mu_k)} \,.
\]
Hence, for the \(K\)-class classification task, the Bayes-optimal LDA discriminant function under \(0\)-\(1\) loss is
In practice, we do not know \(\pi_k, \mu_k\), \((k = 1, \ldots, K)\), \(\Sigma\) and we must estimate these quantities from our training data. Today, we are only exploring the geometry of LDA and for this it is immaterial if we are looking at estimated or unknown quantities – the mechanics remain the same.
Linear decision boundary
In the lecture, I told you that the decision boundary of our LDA classifier is linear (you will also explore this in HW1), but this may not be immediately obvious, after all, the discriminant function Equation 1 is quadratic in \(x\). Before looking at the explanation below, try and think for yourself why this might be true (Hint: Expand the quadratic in Equation 1 and see what is really relevant for classification).
for \(c_k = \log \pi_k - \mu_k^\top \Sigma^{-1} \mu_k / 2\), \(b_k = \Sigma^{-1} \mu_k\) and \(q(x) = - x^\top \Sigma^{-1} x / 2\). Now think about what the decision boundary is: It is the line at which we switch from a label \(k\) to \(k'\) because \(\delta_k(x) < \delta_{k'}(x)\). Hence, the decision boundary is precisely the line \(\delta_{k}(x) - \delta_{k'}(x) = 0\). \[
\begin{aligned}
\delta_{k}(x) - \delta_{k'}(x) &= (c_k - c_{k'}) + x^\top (b_k - b_{k'}) + (q(x) - q(x)) \\
&= (c_k - c_{k'}) + x^\top (b_k - b_{k'})
\end{aligned}\,,
\] which is a linear function in \(x\) with intercept \(c_k - c_{k'}\) and slope \(b_k - b_{k'}\)!
If \(x\) has dimension three or less, we can visualise this. Let us do so using a made up dataset about cat adoption.
R code
# Load necessary librarieslibrary(dplyr)library(knitr)library(kableExtra) # Set seed for reproducibilityset.seed(123)# Define the number of observationsn <-70# Generate random data for each covariate# Age: Assuming cat age follows an absolute normal distributionage <-abs(rnorm(n, sd =4))# Color: Ginger, Black, Browncolor <-c("Ginger", "Black", "Brown")[rbinom(n, 2, age/max(age)) +1]# Multiple Colors: binary with conditional probabilitiesmultiple_colors <-sapply(color, function(col) {if (col =="Ginger") {return(c("No", "Yes")[rbinom(1, 1, 0.6) +1]) } elseif (col =="Black") {return(c("No", "Yes")[rbinom(1, 1, 0.05) +1]) } else {return(c("No", "Yes")[rbinom(1, 1, 0.25) +1]) }})# Generate previous_owners correlated with ageprevious_owners <-sapply(age, function(a) { prob <-c(0.05, 0.7, 0.2, 0.05) * (a /max(age)) prob <- prob /sum(prob) return(sample(0:3, 1, prob = prob))})# Neutered: Binary with p = 0.9neutered <-c("No","Yes")[rbinom(n, 1, 0.9) +1]# Gender: Binary with p = 0.5gender <-c("Male", "Female")[rbinom(n, 1, ifelse(multiple_colors =="Yes", 0.1, 0.5)) +1]# Playfulness: range between 1 and 10, negatively correlated with age and col == black, positively with neutered and gender playfulness <--0.2* age +rnorm(n, sd =2) +0.25* (neutered =="Yes") -0.1* (2* (gender =="Male") -1) -5* (color =="Black")playfulness <-10* ((playfulness -min(playfulness)) / (max(playfulness) -min(playfulness)))# Create a data framecat_adoption_data <-data.frame(Age =round(age, 1),MainColor =as.factor(color),MultipleColors =as.factor(multiple_colors),PreviousOwners = previous_owners,Neutered =as.factor(neutered),Gender =as.factor(gender),Playfulness =round(playfulness, 1))# Expand categorical variables using model.matrixmodel_data <-model.matrix(~ Age + MainColor + MultipleColors + PreviousOwners + Neutered + Gender + Playfulness, data = cat_adoption_data)# Define a true effect vector (for simplicity, assume some arbitrary effects)true_effect_base <-c(Age =-0.5, MainColorBrown = .3, MainColorGinger =-.5, MultipleColorsYes =0.25, PreviousOwners =0, NeuteredYes =0, GenderMale =0, Playfulness =0.25)true_effects <-c(-c(Intercept =mean(model_data[,-1] %*% true_effect_base)), true_effect_base) # Simulate the probability of adoption using the logistic functionprob_adoption <-plogis(model_data %*% true_effects)# Simulate the binary adoption outcomeadopted <-c("No", "Yes")[rbinom(n, 1, prob_adoption) +1]# Add the adoption outcome to the data framecat_adoption_data <-cbind(cat_adoption_data, Adopted = adopted)# Display the full dataset with scrollingkable(cat_adoption_data) %>%kable_material(c("striped", "hover"), font_size =18) %>%scroll_box(height ="300px")
Age
MainColor
MultipleColors
PreviousOwners
Neutered
Gender
Playfulness
Adopted
2.2
Ginger
Yes
1
Yes
Male
7.7
Yes
0.9
Ginger
Yes
2
Yes
Male
4.6
No
6.2
Brown
No
1
No
Female
6.5
Yes
0.3
Ginger
Yes
1
Yes
Male
10.0
Yes
0.5
Ginger
No
1
Yes
Male
4.4
No
6.9
Brown
No
1
Yes
Female
7.0
No
1.8
Ginger
No
1
No
Female
5.9
Yes
5.1
Brown
No
1
Yes
Male
4.1
Yes
2.7
Ginger
Yes
1
Yes
Male
5.1
Yes
1.8
Black
No
1
Yes
Male
3.1
No
4.9
Ginger
No
1
Yes
Female
5.6
No
1.4
Ginger
Yes
1
Yes
Female
4.7
Yes
1.6
Ginger
No
1
Yes
Female
6.3
No
0.4
Ginger
No
1
Yes
Male
5.3
No
2.2
Ginger
No
2
Yes
Female
7.5
Yes
7.1
Brown
No
2
Yes
Male
6.9
No
2.0
Ginger
Yes
2
Yes
Male
6.0
No
7.9
Brown
No
0
Yes
Male
3.2
No
2.8
Ginger
Yes
1
Yes
Male
6.9
Yes
1.9
Ginger
No
1
Yes
Male
3.8
No
4.3
Black
No
2
Yes
Male
2.0
No
0.9
Ginger
No
2
Yes
Male
7.2
No
4.1
Black
No
1
Yes
Female
4.1
Yes
2.9
Ginger
Yes
2
Yes
Male
7.2
Yes
2.5
Ginger
Yes
2
Yes
Male
7.7
Yes
6.7
Brown
No
1
Yes
Female
5.1
No
3.4
Black
No
1
No
Female
1.0
Yes
0.6
Ginger
Yes
1
Yes
Male
5.8
No
4.6
Black
No
1
Yes
Male
5.1
No
5.0
Brown
No
1
Yes
Male
7.4
No
1.7
Ginger
No
1
No
Female
6.0
Yes
1.2
Ginger
Yes
1
Yes
Male
5.4
No
3.6
Brown
No
1
Yes
Male
7.9
No
3.5
Black
No
2
Yes
Male
3.1
No
3.3
Black
No
1
Yes
Male
2.9
Yes
2.8
Black
No
1
Yes
Male
0.0
No
2.2
Ginger
No
2
No
Female
8.4
Yes
0.2
Ginger
Yes
2
Yes
Male
5.5
No
1.2
Black
Yes
1
Yes
Male
3.7
Yes
1.5
Ginger
No
1
Yes
Male
7.7
Yes
2.8
Black
No
2
Yes
Male
4.5
No
0.8
Ginger
Yes
2
No
Female
8.3
Yes
5.1
Black
No
1
Yes
Male
2.3
No
8.7
Brown
No
1
Yes
Male
6.3
No
4.8
Brown
No
2
Yes
Male
3.2
No
4.5
Black
No
1
Yes
Female
2.8
No
1.6
Ginger
No
1
Yes
Male
7.7
Yes
1.9
Ginger
Yes
0
Yes
Male
6.0
No
3.1
Black
No
1
Yes
Female
1.6
No
0.3
Ginger
Yes
1
Yes
Male
7.2
Yes
1.0
Black
No
3
No
Female
3.4
Yes
0.1
Ginger
No
1
No
Male
4.4
No
0.2
Ginger
No
1
Yes
Female
5.2
Yes
5.5
Ginger
No
2
No
Male
7.4
Yes
0.9
Ginger
No
2
Yes
Female
6.1
No
6.1
Ginger
Yes
1
Yes
Male
3.3
Yes
6.2
Brown
No
1
Yes
Male
6.5
No
2.3
Ginger
Yes
1
Yes
Male
5.7
Yes
0.5
Ginger
Yes
1
Yes
Male
7.8
Yes
0.9
Ginger
Yes
1
Yes
Male
7.8
Yes
1.5
Ginger
Yes
0
Yes
Male
6.1
Yes
2.0
Ginger
No
1
Yes
Male
4.6
No
1.3
Black
No
1
Yes
Female
6.6
Yes
4.1
Black
No
1
Yes
Male
3.8
No
4.3
Black
No
1
Yes
Female
3.8
No
1.2
Ginger
No
1
Yes
Female
4.3
Yes
1.8
Black
No
1
Yes
Male
4.6
Yes
0.2
Ginger
No
0
No
Male
6.1
Yes
3.7
Ginger
Yes
1
Yes
Male
5.7
No
8.2
Brown
No
1
Yes
Male
7.0
No
Let us pick two features, Playfulness and Age for our classifier. To build our LDA classifier, we need to estimate the adoption probabilities \(\pi_1\), \(\pi_0 = 1 - \pi_1\), the feature means \(\mu_{1}\), \(\mu_{0}\) and the variance-covariance matrix \(\Sigma\).
Below, I construct the linear decision boundary and the classifications by hand, in class we will see how to do this with readily available R packages. This is a good opportunity to check your understanding of LDA: Have a look at the code below and try to understand what we are doing.
The visualisation of the linear decision boundary works nicely, as long as the feature dimension is less than three. However, even if the covariate dimension is arbitrarily large, LDA has a very elegant geometric representation that allows us to reduce the effective dimension of the space, where we draw our classification boundaries to a \(K-1\) dimensional subspace. If \(K = 3\), this means we can still visualise the data in a \(2\)-dimensional plot. If \(K > 3\), we can extend this approach and reduce the decision space to some \(L < K - 1\) dimensional subspace that is optimal in some sense.
Below, we will explore the case \(K = 3\) to give you an intuition (and the mathematics) behind the dimension reducing properties of LDA. This is purely to give you another level of understanding (and appreciation) for LDA, and you are not expected to replicate this in any way.
Original \(X\)-space
R code
library(plotly)library(MASS)# Setupset.seed(7)## colors group_cols <-c("#1f77b4", "#2ca02c", "#d62728") centroid_sym <-"diamond"point_size <-3centroid_size<-8grid_grey <-"#7f7f7f"boundary_grey<-"#333333"## Helper functions trans3 <-function(A, pts) as.matrix(pts[,1:3]) %*%t(A)make_grid_lines <-function(bounds, step =1) { xs <-seq(bounds[1,1], bounds[2,1], by = step) ys <-seq(bounds[1,2], bounds[2,2], by = step) zs <-seq(bounds[1,3], bounds[2,3], by = step) lines <-list(); g <-1Lfor (y0 in ys) for (z0 in zs) { lines[[g]] <-cbind(x = xs, y =rep(y0,length(xs)), z =rep(z0,length(xs)), gid = g); g <- g+1L }for (x0 in xs) for (z0 in zs) { lines[[g]] <-cbind(x =rep(x0,length(ys)), y = ys, z =rep(z0,length(ys)), gid = g); g <- g+1L }for (x0 in xs) for (y0 in ys) { lines[[g]] <-cbind(x =rep(x0,length(zs)), y =rep(y0,length(zs)), z = zs, gid = g); g <- g+1L }do.call(rbind, lines)}## collapse grid into single vectors with NA separators collapse_grid_to_vectors <-function(grid_mat) { outx <- outy <- outz <-numeric(0) gids <-unique(grid_mat[, "gid"])for (gid in gids) { seg <- grid_mat[grid_mat[, "gid"]==gid, , drop=FALSE] outx <-c(outx, seg[, "x"], NA) outy <-c(outy, seg[, "y"], NA) outz <-c(outz, seg[, "z"], NA) }list(x = outx, y = outy, z = outz)}## ellipsoid halves using parametric surfaces: x = mu + Sigma^{1/2} [cos u cos v, sin u cos v, sin v]^Tellipsoid_halves <-function(center, Sigma_sqrt, r =1.35, nu =64, nv =32) { u <-seq(0, 2*pi, length.out = nu) v_top <-seq(0, pi/2, length.out = nv) v_bot <-seq(-pi/2, 0, length.out = nv) cu <-cos(u); su <-sin(u) make_half <-function(v) { cv <-cos(v); sv <-sin(v) U1 <-outer(cu, cv) U2 <-outer(su, cv) U3 <-matrix(rep(sv, each =length(u)), nrow =length(u)) X <- center[1] + r*( Sigma_sqrt[1,1]*U1 + Sigma_sqrt[1,2]*U2 + Sigma_sqrt[1,3]*U3 ) Y <- center[2] + r*( Sigma_sqrt[2,1]*U1 + Sigma_sqrt[2,2]*U2 + Sigma_sqrt[2,3]*U3 ) Z <- center[3] + r*( Sigma_sqrt[3,1]*U1 + Sigma_sqrt[3,2]*U2 + Sigma_sqrt[3,3]*U3 )list(x = X, y = Y, z = Z) }list(top =make_half(v_top), bottom =make_half(v_bot))}## spheres in whitened spacesphere_surface <-function(center, r =1.35, n =46) { u <-seq(0, 2*pi, length.out = n) v <-seq(0, pi, length.out = n) uu <-matrix(rep(u, each=n), n, n) vv <-matrix(rep(v, times=n), n, n) x <- center[1] + r *cos(uu) *sin(vv) y <- center[2] + r *sin(uu) *sin(vv) z <- center[3] + r *cos(vv)list(x=x, y=y, z=z)}line_in_box <-function(w, c, xmin, xmax, ymin, ymax) { a <- w[1]; b <- w[2] pts <-matrix(NA_real_, nrow=0, ncol=2)if (abs(b) >1e-10) { y1 <- (c - a*xmin)/b; if (y1>=ymin & y1<=ymax) pts <-rbind(pts, c(xmin, y1)) y2 <- (c - a*xmax)/b; if (y2>=ymin & y2<=ymax) pts <-rbind(pts, c(xmax, y2)) }if (abs(a) >1e-10) { x1 <- (c - b*ymin)/a; if (x1>=xmin & x1<=xmax) pts <-rbind(pts, c(x1, ymin)) x2 <- (c - b*ymax)/a; if (x2>=xmin & x2<=xmax) pts <-rbind(pts, c(x2, ymax)) }if (nrow(pts) >=2) pts[1:2,] elseNULL}# decision planes in 3D z-space: (m_i - m_j)^T z = 0.5(||m_i||^2 - ||m_j||^2)plane_patch_mesh <-function(normal, offset, half_width =8, n =18) { nvec <-as.vector(normal); nvec <- nvec /sqrt(sum(nvec^2)) z0 <- (offset /sum(nvec*nvec)) * nvec a <-c(1,0,0); if (abs(sum(a*nvec)) >0.9) a <-c(0,1,0) e1 <- a -sum(a*nvec)*nvec; e1 <- e1 /sqrt(sum(e1^2)) e2 <-c(nvec[2]*e1[3] - nvec[3]*e1[2], nvec[3]*e1[1] - nvec[1]*e1[3], nvec[1]*e1[2] - nvec[2]*e1[1]) s <-seq(-half_width, half_width, length.out = n) t <-seq(-half_width, half_width, length.out = n) S <-matrix(rep(s, each=n), n, n) T <-matrix(rep(t, times=n), n, n) X <- z0[1] + S*e1[1] + T*e2[1] Y <- z0[2] + S*e1[2] + T*e2[2] Z <- z0[3] + S*e1[3] + T*e2[3]list(x=X, y=Y, z=Z)}
To set the scene, let us work with three classes (\(K = 3\)) and a three-dimensional feature space (\(p=3\)). We choose \(p=3\) purely so that we can visualise the data, the derivations hold for any \(p\). \[
x_{k, i} \sim \mathrm{N}(\mu_k, \Sigma)\,, \mu_k \in \Re^p\,, \quad (i = 1, \ldots, n_k; k = 1,\ldots, K) \,,
\] with the notational convention that \(x_{k,i}\) is the \(i\)th observation of group \(k\). Below is a plot of our data, classes are colour coded, and the class means (also called centroids) are indicated by diamond shapes.
R code
# data p <-3K <-3Sigma <-matrix(c(2.0, 0.9, 0.4,0.9, 1.5, 0.2,0.4, 0.2, 1.2), nrow =3, byrow =TRUE)e <-eigen(Sigma)Sigma_sqrt <- e$vectors %*%diag(sqrt(e$values)) %*%t(e$vectors)Sigma_inv_sqrt <- e$vectors %*%diag(1/sqrt(e$values)) %*%t(e$vectors)mus <-rbind(c(2.3, 0.2, 0.1),c(0.3, 2.2, 0.8),c(0.2, 0.2, 2.6))n_per <-50X <-rbind(mvrnorm(n_per, mu = mus[1,], Sigma = Sigma),mvrnorm(n_per, mu = mus[2,], Sigma = Sigma),mvrnorm(n_per, mu = mus[3,], Sigma = Sigma))y <-rep(1:K, each = n_per)fig1 <-plot_ly()for (k inseq_len(K)) { pts <- X[y == k, , drop =FALSE] fig1 <- fig1 %>%add_markers(x = pts[,1], y = pts[,2], z = pts[,3],name =paste("class", k),legendgroup =paste0("class", k), showlegend =TRUE,marker =list(size = point_size, color = group_cols[k]),hoverinfo ="skip",type ="scatter3d", mode ="markers", opacity =0.6 )}for (k inseq_len(K)) { fig1 <- fig1 %>%add_markers(x = mus[k,1], y = mus[k,2], z = mus[k,3],name =paste0("μ<sub>", k, "</sub>"),legendgroup =paste0("class", k),showlegend =TRUE,marker =list(size = centroid_size, symbol = centroid_sym, color = group_cols[k]),type ="scatter3d", mode ="markers" )}n_traces <-2* K idx_pts <-seq_len(K)idx_ctr <- K +seq_len(K)vis_all <-rep(TRUE, n_traces)class_buttons <-lapply(seq_len(K), function(k) { v <-rep(FALSE, n_traces) v[idx_pts[k]] <-TRUE v[idx_ctr[k]] <-TRUElist(method ="update",args =list(list(visible = v)),label =paste0("Class ", k) )})btn_show_all <-list(method ="update",args =list(list(visible = vis_all)),label ="Show all")fig1 <- fig1 %>%layout(title =list( text ="(1) Original X-space", x =0.5, xanchor ="center"),scene =list(xaxis =list(title =list(text ="x<sub>1</sub>")),yaxis =list(title =list(text ="x<sub>2</sub>")),zaxis =list(title =list(text ="x<sub>3</sub>")), aspectmode ="data" ),updatemenus =list(list(type ="buttons",buttons =c(list(btn_show_all), class_buttons ) ) ) )fig1
Now recall the discriminant function \(\delta_k(x)\): \[
\begin{aligned}
\delta_k(x) &= \log \pi_k - \frac{1}{2} (x - \mu_k)^\top \Sigma^{-1}(x - \mu_k) \\
&= \log \pi_k - \frac{1}{2} \| \Sigma^{-1/2} x - \Sigma^{-1/2} \mu_k \|_2^2
\end{aligned}\,,
\tag{2}\]
where \(\Sigma^{-1/2}\) is the \(p \times p\) matrix square root of \(\Sigma^{-1}\), i.e. \(\Sigma^{-1/2}\Sigma^{-1/2} = \Sigma^{-1}\) and \(\| a \|_2 = \sqrt{a^\top a}\) is the Euclidean norm (You can try to derive this yourself).
Note that the level set \[
\left\{x \in \Re^p: \| \Sigma^{-1/2} x - \Sigma^{-1/2} \mu_k \|_2^2 = c^2 \right\} \,,
\tag{3}\] defines an ellipsoid centred at the centroid \(\mu_k\) with radii \(c\sqrt{\lambda_i}\), where \(\lambda_i\) are the eigenvalues of \(\Sigma\). You might have heard of this property in the context of a multivariate normal distribution, where the equidensity contours are ellipsoids centered at the mean parameter.
Importantly, Equation 2 and Equation 3 tells us that, ingoring the class marginals \(\pi_k\), the level sets of the discriminant function are ellipsoids:
R code
rngX <-apply(X, 2, range); pad <-0.75boundsX <-rbind(rngX[1,]-pad, rngX[2,]+pad)grid_lines_X <-make_grid_lines(boundsX, step =1)grid_axisX_vecs <-collapse_grid_to_vectors(grid_lines_X)fig1b <-plot_ly() fig1b <- fig1b %>%add_trace(x = grid_axisX_vecs$x, y = grid_axisX_vecs$y, z = grid_axisX_vecs$z,type="scatter3d", mode="lines",line=list(width=1, color=grid_grey), hoverinfo="skip", opacity=0.10,name="X-grid")for (k inseq_len(K)) { pts <- X[y == k, , drop =FALSE] fig1b <- fig1b %>%add_markers(x = pts[,1], y = pts[,2], z = pts[,3],name =paste("class", k),legendgroup =paste0("class", k), showlegend =TRUE,marker =list(size = point_size, color = group_cols[k]),hoverinfo ="skip",type ="scatter3d", mode ="markers", opacity =0.6 )}for (k inseq_len(K)) { fig1b <- fig1b %>%add_markers(x = mus[k,1], y = mus[k,2], z = mus[k,3],name =paste0("μ<sub>", k, "</sub>"),legendgroup =paste0("class", k),showlegend =TRUE,marker =list(size = centroid_size, symbol = centroid_sym, color = group_cols[k]),type ="scatter3d", mode ="markers" )}for (k in1:K) { halves <-ellipsoid_halves(mus[k,], Sigma_sqrt, r =1.35, nu =64, nv =32) cs <-list(list(0, group_cols[k]), list(1, group_cols[k])) fig1b <- fig1b %>%add_surface(x = halves$top$x, y = halves$top$y, z = halves$top$z,colorscale = cs, showscale =FALSE, opacity =0.28,name =paste0("ellipsoid ",k," (top)")) %>%add_surface(x = halves$bottom$x, y = halves$bottom$y, z = halves$bottom$z,colorscale = cs, showscale =FALSE, opacity =0.28,name =paste0("ellipsoid ",k," (bottom)"))}n_traces <-4* K +1i <-1Lidx_grid <- ii <- i +1Lidx_pts <- i:(i + K -1L) i <- i + Kidx_ctr <- i:(i + K -1L)i <- i + Kidx_sur <- i:(i +2L*K -1L)i <- i +2L*Kidx_top <- idx_sur[seq(1, length(idx_sur), by =2L)]idx_bot <- idx_top +1Lvis_all <-rep(TRUE, n_traces)ellipsoid_buttons <-lapply(seq_len(K), function(k) { v <-rep(FALSE, n_traces) v[idx_grid] <-TRUE v[idx_top[k]] <-TRUE v[idx_bot[k]] <-TRUE v[idx_ctr[k]] <-TRUE v[idx_pts[k]] <-TRUElist(method ="restyle",args =list(list(visible = v)),label =paste0("Class ", k) )})btn_show_all <-list(method ="restyle",args =list(list(visible = vis_all)),label ="Show all")fig1b <- fig1b %>%layout(title =list( text ="(2) Class ellipsoids in original X-space", x =0.5, xanchor ="center"),scene =list(xaxis =list(title =list(text ="x<sub>1</sub>")),yaxis =list(title =list(text ="x<sub>2</sub>")),zaxis =list(title =list(text ="x<sub>3</sub>")),aspectmode ="data" ),updatemenus =list(list(type ="buttons",buttons =c(list(btn_show_all), ellipsoid_buttons ) ) ) )fig1b
Hence, if we ignore the marginals \(\pi_k\) (i.e. set them all to the same value; we will consider them at the end, but for now let us forget about them), we classify a point according to the class whose smallest ellipsoid contains contains \(x\).
\(\Sigma^{-1/2}\) transformed \(X\)-space
Now let us go back to Equation 3, but now for any \(x \in \Re^p\), let \(y = \Sigma^{-1/2}x\) and \(\tilde{\mu}_k = \Sigma^{-1/2} \mu_k\), both of which lie in \(\Re^p\) because \(\Sigma\) has full rank. The pre-multiplication by \(\Sigma^{-1/2}\) defines a linear transformation of our original \(X\)-space. A linear transformation is a map that fixes the origin and carries straight lines to straight lines. In \(\Re^2\) or \(\Re^3\), you can view any linear map as some combination of rotations/reflections, direction‑dependent stretches and shears. Check out this 3blue1brown video on matrix multiplication for a great visualisation.
Now note that the linear transformation \(T:x \mapsto \Sigma^{-1/2}x\) transforms the ellipsoids in \(X\)-space to spheres in \(\Sigma^{-1/2}X\)-space: \[
\begin{aligned}
\left\{x \in \Re^p: \| \Sigma^{-1/2} x - \Sigma^{-1/2} \mu_k \|_2^2 = c^2 \right\} = \left\{y \in \Re^p: \| y - \tilde{\mu}_k \|_2^2 = c^2\right\}
\end{aligned} \,.
\]
Let us visualise the transformation \(T(x) = \Sigma^{-1/2}x\):
R code
Z <-t(Sigma_inv_sqrt %*%t(X))M <-t(Sigma_inv_sqrt %*%t(mus))m_bar <-colMeans(M)centered <-sweep(M, 2, m_bar)Q <-qr.Q(qr(t(centered))) B <- Q[,1:2, drop=FALSE] P <- B %*%t(B) fig2 <-plot_ly()rngZ <-apply(Z, 2, range); pad <-0.75boundsZ <-rbind(rngZ[1,]-pad, rngZ[2,]+pad)grid_lines_Z <-make_grid_lines(boundsZ, step =1)grid_axisZ_vecs <-collapse_grid_to_vectors(grid_lines_Z)fig2 <- fig2 %>%add_trace(x = grid_axisZ_vecs$x, y = grid_axisZ_vecs$y, z = grid_axisZ_vecs$z,type="scatter3d", mode="lines",line=list(width=1, color=grid_grey), hoverinfo="skip", opacity=0.0,name="Y-grid")for (k in1:K) { pts <- Z[y==k, ] fig2 <- fig2 %>%add_markers(x = pts[,1], y = pts[,2], z = pts[,3],name =paste("class", k),legendgroup =paste0("class", k), showlegend =TRUE,marker =list(size = point_size, color = group_cols[k]),hoverinfo ="skip",type ="scatter3d", mode ="markers", opacity =0.6 )}for (k inseq_len(K)) { fig2 <- fig2 %>%add_markers(x = M[k,1], y = M[k,2], z = M[k,3],name =paste0("μ<sub>", k, "</sub>"),legendgroup =paste0("class", k),showlegend =TRUE,marker =list(size = centroid_size, symbol = centroid_sym, color = group_cols[k]),type ="scatter3d", mode ="markers" )}for (k in1:K) { s <-sphere_surface(M[k, ], r =1.35, n =46) cs <-list(list(0, group_cols[k]), list(1, group_cols[k])) fig2 <- fig2 %>%add_surface(x = s$x, y = s$y, z = s$z,colorscale = cs, showscale =FALSE, opacity =0.28,name =paste("sphere", k) )}V <- Sigma_inv_sqrtgrid_from_transformed_basis <-function(V, boundsY, step =c(1, 1, 1)) {stopifnot(nrow(boundsY) ==2, ncol(boundsY) ==3, length(step) ==3) cornersY <-as.matrix(expand.grid(boundsY[,1], boundsY[,2], boundsY[,3])) Xcorners <-t(apply(cornersY, 1, function(y) solve(V, y))) minsX <-apply(Xcorners, 2, min); maxsX <-apply(Xcorners, 2, max) boundsX <-rbind(floor(minsX), ceiling(maxsX)) xlev <-lapply(1:3, function(i) seq(boundsX[1, i], boundsX[2, i], by = step[i])) add_line <-function(P, Q, acc) { acc$x <-c(acc$x, P[1], Q[1], NA_real_) acc$y <-c(acc$y, P[2], Q[2], NA_real_) acc$z <-c(acc$z, P[3], Q[3], NA_real_) acc } acc <-list(x =numeric(0), y =numeric(0), z =numeric(0))for (x2 in xlev[[2]]) for (x3 in xlev[[3]]) { A <-c(boundsX[1,1], x2, x3); B <-c(boundsX[2,1], x2, x3) P <-as.numeric(V %*% A); Q <-as.numeric(V %*% B) acc <-add_line(P, Q, acc) }for (x1 in xlev[[1]]) for (x3 in xlev[[3]]) { A <-c(x1, boundsX[1,2], x3); B <-c(x1, boundsX[2,2], x3) P <-as.numeric(V %*% A); Q <-as.numeric(V %*% B) acc <-add_line(P, Q, acc) }for (x1 in xlev[[1]]) for (x2 in xlev[[2]]) { A <-c(x1, x2, boundsX[1,3]); B <-c(x1, x2, boundsX[2,3]) P <-as.numeric(V %*% A); Q <-as.numeric(V %*% B) acc <-add_line(P, Q, acc) } acc}norms <-sqrt(colSums(V^2)) step_basis <-1/ norms grid_basis_vecs <-grid_from_transformed_basis(V, boundsZ, step = step_basis) fig2 <- fig2 %>%add_trace(x = grid_basis_vecs$x, y = grid_basis_vecs$y, z = grid_basis_vecs$z,type ="scatter3d", mode ="lines",line =list(width =1, dash ="dash"), hoverinfo ="skip", opacity =0.5,name ="Transformed lattice",color ="#bdbdbd",visible =FALSE)n_traces <-3* K +2i <-1Lidx_grid <- ii <- i +1Lidx_pts <- i:(i + K -1L) i <- i + Kidx_ctr <- i:(i + K -1L)i <- i + Kidx_sur <- i:(i + K -1L)i <- i + Kidx_lat <- ivis_all <-rep(TRUE, n_traces)sphere_buttons <-lapply(seq_len(K), function(k) { v <-rep(FALSE, n_traces) v[idx_grid] <-TRUE v[idx_sur[k]] <-TRUE v[idx_ctr[k]] <-TRUE v[idx_pts[k]] <-TRUElist(method ="update",args =list(list(visible = v)),label =paste0("Class ", k) )})no_lat <-rep(TRUE, n_traces) no_lat[idx_lat] <-FALSEbtn_hide_lattice <-list(method ="update",args =list(list(visible = no_lat)),label ="Hide transformed lattice")btn_show_all <-list(method ="update",args =list(list(visible = vis_all)),label ="Show transformed lattice")fig2 <- fig2 %>%layout(title =list(text ="(3) Transformed space: y = Σ^{-1/2}x", x =0.5, xanchor ="center"),scene =list(xaxis =list(title =list(text ="y<sub>1</sub>")),yaxis =list(title =list(text ="y<sub>2</sub>")),zaxis =list(title =list(text ="y<sub>3</sub>")),aspectmode ="data" ),updatemenus =list(list(type ="buttons",buttons =c(list(btn_show_all), list(btn_hide_lattice) ) ) ))fig2
In the figure above, the transformed lattice is a visualisation of the grid lines of the “old” \(X\)-space coordinate system after they have been transformed by \(\Sigma^{-1/2}\).
We now have an even simpler interpretation of LDA: After transforming our data points \(x\) and centroids by \(T\), we assign \(x\) the class label for which \(y = T(x)\) is closest to the transformed centroid!
Dimension reduction
We have now reduced the LDA problem from decision boundaries in \(\Re^p\) to finding the transformed centroid \(\tilde{\mu}_k\) that is closest to our transformed data point \(y = \Sigma^{-1/2}x\). Let us think about this a bit more and simplify things further.
Our first big realsisation should be that the \(K\) transformed centroids lie in a \(K-1\)-dimensional affine subspace \(\mathcal{A}\). For \(K=3\) you can think of it like this: You can draw a line between two points, and then rest a rigid sheet on the remaining point and your line. This (infinitely large) sheet is a \(2\)-dimensional plane on which all our transformed centroids \(\tilde{\mu}_k\) sit.
This is visualised below:
R code
to_plane <-function(z) t(B) %*% (z - m_bar)Y <-t(apply(Z, 1, to_plane)) Ym <-t(apply(M, 1, to_plane)) boundary_params_2d <-function(i, j, priors =NULL) { mi <- M[i,]; mj <- M[j,] w2 <-as.vector(t(B) %*% (mi - mj)) rhs <-0.5*(sum(mi^2) -sum(mj^2))if (!is.null(priors)) rhs <- rhs - (log(priors[i]) -log(priors[j])) c2 <- rhs -sum((mi - mj) * m_bar)list(w=w2, c=c2)}plane_patch <-function(center, B, half =7, n =24) { s <-seq(-half, half, length.out = n) t <-seq(-half, half, length.out = n) S <-matrix(rep(s, each=n), n, n) T <-matrix(rep(t, times=n), n, n) X <- center[1] + S*B[1,1] + T*B[1,2] Y <- center[2] + S*B[2,1] + T*B[2,2] Z <- center[3] + S*B[3,1] + T*B[3,2]list(x=X, y=Y, z=Z)}plane_mesh <-plane_patch(m_bar, B, half =7, n =24)grey_scale <-list(list(0.0, "#bdbdbd"), list(1.0, "#bdbdbd"))fig3 <-plot_ly()rngZ <-apply(Z, 2, range); pad <-0.75boundsZ <-rbind(rngZ[1,]-pad, rngZ[2,]+pad)grid_lines_Z <-make_grid_lines(boundsZ, step =1)grid_axisZ_vecs <-collapse_grid_to_vectors(grid_lines_Z)fig3 <- fig3 %>%add_trace(x = grid_axisZ_vecs$x, y = grid_axisZ_vecs$y, z = grid_axisZ_vecs$z,type="scatter3d", mode="lines",line=list(width=1, color=grid_grey), hoverinfo="skip", opacity=0.0,name="Y-grid")for (k in1:K) { pts <- Z[y==k, ] fig3 <- fig3 %>%add_markers(x = pts[,1], y = pts[,2], z = pts[,3],name =paste("class", k),legendgroup =paste0("class", k), showlegend =TRUE,marker =list(size = point_size, color = group_cols[k]),hoverinfo ="skip",type ="scatter3d", mode ="markers", opacity =0.6 )}for (k inseq_len(K)) { fig3 <- fig3 %>%add_markers(x = M[k,1], y = M[k,2], z = M[k,3],name =paste0("μ<sub>", k, "</sub>"),legendgroup =paste0("class", k),showlegend =TRUE,marker =list(size = centroid_size, symbol = centroid_sym, color = group_cols[k]),type ="scatter3d", mode ="markers" )}for (k in1:K) { s <-sphere_surface(M[k, ], r =1.35, n =46) cs <-list(list(0, group_cols[k]), list(1, group_cols[k])) fig3 <- fig3 %>%add_surface(x = s$x, y = s$y, z = s$z,colorscale = cs, showscale =FALSE, opacity =0.28,name =paste("sphere", k) )}fig3 <- fig3 %>%add_surface(x=plane_mesh$x, y=plane_mesh$y, z=plane_mesh$z,surfacecolor =matrix(0, nrow(plane_mesh$x), ncol(plane_mesh$x)),colorscale = grey_scale, showscale=FALSE, opacity=0.35,name="LDA plane" )n_traces <-3* K +2i <-1Lidx_grid <- ii <- i +1Lidx_pts <- i:(i + K -1L) i <- i + Kidx_ctr <- i:(i + K -1L)i <- i + Kidx_sur <- i:(i + K -1L)idx_pln <- n_traces sphere_plane_buttons <-lapply(seq_len(K), function(k) { v <-rep(TRUE, n_traces) v[idx_sur[k]] <-FALSE v[idx_ctr[k]] <-FALSE v[idx_pts[k]] <-FALSElist(method ="update",args =list(list(visible = v)),label =paste0("Hide class ", k) )})vis_all <-rep(TRUE, n_traces)btn_show_all <-list(method ="update",args =list(list(visible = vis_all)),label ="Show all")no_pln <-rep(TRUE, n_traces)no_pln[idx_pln] <-FALSEbtn_hide_pln <-list(method ="update",args =list(list(visible = no_pln)),label ="Hide LDA plane")fig3 <- fig3 %>%layout(title =list(text ="(4) Transformed space + LDA plane", x =0.5, xanchor ="center"),scene =list(xaxis =list(title =list(text ="y<sub>1</sub>")),yaxis =list(title =list(text ="y<sub>2</sub>")),zaxis =list(title =list(text ="y<sub>3</sub>")),aspectmode ="data" ),updatemenus =list(list(type ="buttons",buttons =c(list(btn_show_all), list(btn_hide_pln), sphere_plane_buttons ) ) ))fig3
The second insight is that when we look for the closest transformed centroid, we can ignore distances orthogonal to the \(K-1\)-dimensional affine subspace \(\mathcal{A}\), since they will contribute equally to each class. Hence, we can project the data directly onto \(\mathcal{A}\) and make distance comparisons there.
Mathematically, this means the following: Let \(P_{\mathcal{A}}\) the projection matrix onto \(\mathcal{A}\), then since \(\tilde{\mu}_k \in \mathcal{A}\), \(P_{\mathcal{A}} \tilde{\mu}_k = \tilde{\mu}_k\), and \((I - P_{\mathcal{A}}) \tilde{\mu}_k = 0\). Therefore
\[
\begin{aligned}
\| y - \tilde{\mu}_k \|_2^2 &= \| P_{\mathcal{A}}y - P_{\mathcal{A}}\tilde{\mu}_k + (I - P_{\mathcal{A}})y - (I - P_{\mathcal{A}})\tilde{\mu}_k \|_2^2 \\
&= \| P_{\mathcal{A}}y - \tilde{\mu}_k + [I - P_{\mathcal{A}}]y \|_2^2 \\
&= (P_{\mathcal{A}}y - \tilde{\mu}_k)^\top (P_{\mathcal{A}}y - \tilde{\mu}_k) + 2 (P_{\mathcal{A}}y - \tilde{\mu}_k )^\top ([I - P_{\mathcal{A}}]y) + ([I - P_{\mathcal{A}}]y)^\top ([I - P_{\mathcal{A}}]y) \\
&=\| P_{\mathcal{A}}y - \tilde{\mu}_k \|_2^2 + \| [I - P_{\mathcal{A}}]y \|_2^2
\end{aligned} \,,
\] where we used that \((P_{\mathcal{A}}y - \tilde{\mu}_k )^\top ([I - P_{\mathcal{A}}]y) = 0\) since \((P_{\mathcal{A}}y - \tilde{\mu}_k ) \in \mathcal{A}\) and \((I - P_{\mathcal{A}})y\) is in the subspace orthogonal to \(\mathcal{A}\).
The figure below visualises this decomposition. Try to convince yourself that the component of \(y\) that is relevant for classification is indeed \(P_{\mathcal{A}}y\), i.e. the projection of \(y\) onto \(\mathcal{A}\).
R code
m_bar <-colMeans(M) centered <-sweep(M, 2, m_bar) Q <-qr.Q(qr(t(centered))) B <- Q[, 1:2, drop =FALSE] P <- B %*%t(B) Z_proj <-sweep(Z, 2, m_bar, FUN ="-") %*% PZ_proj <-sweep(Z_proj, 2, m_bar, FUN ="+")M_proj <-sweep(M, 2, m_bar, FUN ="-") %*% PM_proj <-sweep(M_proj, 2, m_bar, FUN ="+")segments_from_points <-function(A, Bp) {stopifnot(ncol(A) ==3, ncol(Bp) ==3, nrow(A) ==nrow(Bp)) n <-nrow(A); x <- y <- z <-numeric(0)for (i inseq_len(n)) { x <-c(x, A[i,1], Bp[i,1], NA_real_) y <-c(y, A[i,2], Bp[i,2], NA_real_) z <-c(z, A[i,3], Bp[i,3], NA_real_) }list(x = x, y = y, z = z)}circle_on_affine_plane <-function(c, r, B, m_bar, n =180) { P <- B %*%t(B) c_proj <-as.numeric(m_bar + P %*% (c - m_bar)) d2 <-sum((c - c_proj)^2)if (d2 > r^2) return(list(x =NA_real_, y =NA_real_, z =NA_real_)) rprime <-sqrt(r^2- d2) t <-seq(0, 2*pi, length.out = n) u <- B[,1]; v <- B[,2] Y <-t( matrix(c_proj, nrow =3, ncol = n) + rprime * (u %o%cos(t) + v %o%sin(t)) )list(x =c(Y[,1], NA_real_), y =c(Y[,2], NA_real_), z =c(Y[,3], NA_real_))}plane_mesh_from_B_affine <-function(B, m_bar, boundsY, n =16) { cornersY <-as.matrix(expand.grid(boundsY[,1], boundsY[,2], boundsY[,3])) AB <-t(B) %*%t(cornersY -matrix(m_bar, nrow =nrow(cornersY), ncol =3, byrow =TRUE)) ar <-range(AB[1,]); br <-range(AB[2,]) a <-seq(ar[1], ar[2], length.out = n) b <-seq(br[1], br[2], length.out = n) grid <-expand.grid(a = a, b = b) Y <-t(apply(grid, 1, function(w) m_bar + B %*%c(w["a"], w["b"])))list(x =matrix(Y[,1], nrow =length(a), byrow =FALSE),y =matrix(Y[,2], nrow =length(a), byrow =FALSE),z =matrix(Y[,3], nrow =length(a), byrow =FALSE) )}rngZ <-apply(Z, 2, range); pad <-0.75boundsZ <-rbind(rngZ[1,]-pad, rngZ[2,]+pad)if (!exists("plane_mesh")) { plane_mesh <-plane_mesh_from_B_affine(B, m_bar, boundsZ, n =16)}fig4 <-plot_ly()grid_lines_Z <-make_grid_lines(boundsZ, step =1)grid_axisZ_vecs <-collapse_grid_to_vectors(grid_lines_Z)fig4 <- fig4 %>%add_trace(x = grid_axisZ_vecs$x, y = grid_axisZ_vecs$y, z = grid_axisZ_vecs$z,type="scatter3d", mode="lines",line=list(width=1, color=grid_grey), hoverinfo="skip", opacity=0.0,name="Y-grid")for (k in1:K) { pts <- Z[y==k, ] fig4 <- fig4 %>%add_markers(x = pts[,1], y = pts[,2], z = pts[,3],name =paste("class", k),legendgroup =paste0("class", k),showlegend =TRUE,marker =list(size = point_size, color = group_cols[k]),hoverinfo ="skip",type ="scatter3d", mode ="markers", opacity =0.6 )}for (k in1:K) { pts <- Z_proj[y==k, ] fig4 <- fig4 %>%add_markers(x = pts[,1], y = pts[,2], z = pts[,3],name =paste("projected class", k),legendgroup =paste0("projclass", k),showlegend =TRUE,marker =list(size = point_size, color = group_cols[k]),hoverinfo ="skip",type ="scatter3d", mode ="markers", opacity =0.85,visible =FALSE )}for (k in1:K) { idx <-which(y == k) seg <-segments_from_points(Z_proj[idx, , drop =FALSE], Z[idx, , drop =FALSE]) fig4 <- fig4 %>%add_trace(x = seg$x, y = seg$y, z = seg$z,type ="scatter3d", mode ="lines",line =list(width =2, color = group_cols[k]),name =paste0("orthogonal offsets ", k),showlegend =FALSE,visible =TRUE )}for (k in1:K) { fig4 <- fig4 %>%add_markers(x = M[k,1], y = M[k,2], z = M[k,3],name =paste0("μ<sub>", k, "</sub>"),legendgroup =paste0("class", k),showlegend =TRUE,marker =list(size = centroid_size, symbol = centroid_sym, color = group_cols[k]),type ="scatter3d", mode ="markers" )}sphere_r <-1.35for (k in1:K) { s <-sphere_surface(M[k, ], r = sphere_r, n =46) cs <-list(list(0, group_cols[k]), list(1, group_cols[k])) fig4 <- fig4 %>%add_surface(x = s$x, y = s$y, z = s$z,colorscale = cs, showscale =FALSE, opacity =0.28,name =paste("sphere", k) )}for (k in1:K) { crv <-circle_on_affine_plane(M[k,], r = sphere_r, B, m_bar, n =180) fig4 <- fig4 %>%add_trace(x = crv$x, y = crv$y, z = crv$z,type ="scatter3d", mode ="lines",line =list(width =3, color = group_cols[k]),name =paste0("sphere∩plane ", k),visible =FALSE, opacity =0.28 )}fig4 <- fig4 %>%add_surface(x = plane_mesh$x, y = plane_mesh$y, z = plane_mesh$z,surfacecolor =matrix(0, nrow(plane_mesh$x), ncol(plane_mesh$x)),colorscale = grey_scale, showscale =FALSE, opacity =0.35,name ="LDA plane")n_traces <-6* K +2i <-1Lidx_grid <- i; i <- i +1Lidx_pts <- i:(i + K -1L); i <- i + Kidx_prj <- i:(i + K -1L); i <- i + Kidx_stk <- i:(i + K -1L); i <- i + Kidx_ctr <- i:(i + K -1L); i <- i + Kidx_sur <- i:(i + K -1L); i <- i + Kidx_cir <- i:(i + K -1L); i <- i + Kidx_pln <- iv_full <-rep(FALSE, n_traces)v_full[c(idx_grid, idx_pts, idx_stk, idx_ctr, idx_sur, idx_pln)] <-TRUEv_proj <-rep(FALSE, n_traces)v_proj[c(idx_grid, idx_prj, idx_cir, idx_pln, idx_ctr)] <-TRUElg_full <-rep(FALSE, n_traces); lg_full[c(idx_pts, idx_ctr)] <-TRUElg_proj <-rep(FALSE, n_traces); lg_proj[c(idx_prj, idx_ctr)] <-TRUEbtn_full <-list(method ="update",args =list(list(visible = v_full, showlegend = lg_full)),label ="Full 3D")btn_proj <-list(method ="update",args =list(list(visible = v_proj, showlegend = lg_proj)),label ="Projection only")fig4 <- fig4 %>%layout(title =list(text ="(5) Projection onto LDA plane", x =0.5, xanchor ="center"),scene =list(xaxis =list(title =list(text ="y<sub>1</sub>")),yaxis =list(title =list(text ="y<sub>2</sub>")),zaxis =list(title =list(text ="y<sub>3</sub>")),aspectmode ="data" ),updatemenus =list(list(type ="buttons",buttons =list(btn_full, btn_proj)) ))fig4
Decision regions in the LDA plane
We now know that once we have projected the data onto \(\mathcal{A}\), all we have to do is find transformed class centroid \(\tilde{\mu}_k\) that is closest to the projected data. Geometrically, this means that the decision boundary between two classes \(k,k'\) can be constructed as follows: We draw a line \(\ell_{k,k'}\) between the two transformed centroids \(\tilde{\mu}_k, \tilde{\mu}_{k'}\). The decision boundary is then simply the line that is perpendicular to \(\ell_{k,k'}\) that intersects it in the middle.
Mathematically, this means that, ignoring the marginals \(\pi_k\) for now, in \(\mathcal A\) we classify \(y = \Sigma^{-1/2}x\) according to the nearest transformed centroid \(\tilde{\mu}_k\)\[
k = \underset{k}{\arg \min} \, \|P_{\mathcal{A}}y - \tilde{\mu}_k\|_2 \,.
\]
For two classes \(k,k'\), let \(\Delta=\tilde\mu_k-\tilde\mu_{k'}\) and \(m=(\tilde\mu_k+\tilde\mu_{k'}) / 2\). Then the decision boundary is \[
(P_{\mathcal{A}}y - m)^\top \Delta = 0 \,,
\] which is the perpendicular bisector we described geometrically.
The figure below shows the LDA plane \(\mathcal A\). Each circle is the intersection of a sphere centred at \(\tilde{\mu}_k\) with \(\mathcal{A}\), and the pairwise decision lines \((P_{\mathcal{A}}y-m)^\top \Delta= 0\) partition the plane into the shaded regions.
So far, we have ignored the marginals \(\pi_k\). Now that we have a good understanding of how LDA classification works in \(\mathcal{A}\), let us add them back into the picture. Let \(y = \Sigma^{-1/2}x\). Recall what we know about the discriminant function: \[
\begin{aligned}
\delta_k(x) &= \log \pi_k - \frac{1}{2} (x - \mu_k)^\top \Sigma^{-1}(x - \mu_k) \\
&= \log \pi_k - \frac{1}{2} \| \Sigma^{-1/2} x - \Sigma^{-1/2} \mu_k \|_2^2 \\
&= \log \pi_k - \frac{1}{2} \| y - \tilde \mu_k \|_2^2 \\
&= \log \pi_k - \frac{1}{2} \| P_{\mathcal{A}}y - \tilde \mu_k \|_2^2 - \frac{1}{2} \| [I - P_{\mathcal{A}}]\|_2^2
\end{aligned}\,.
\]
Hence, for two classes \(k,k'\), let \(\Delta=\tilde\mu_k-\tilde\mu_{k'}\) and \(m=(\tilde\mu_k+\tilde\mu_{k'}) / 2\). Then the decision boundary between \(k,k'\)\[
(P_{\mathcal{A}}y - m)^\top \Delta = -\log\frac{\pi_k}{ \pi_{k'}} \,.
\] Geometrically, this means that we move the perpendicular line \(\ell_{k,k'}\) further away from the transformed centroid \(\tilde{\mu}_k\) when \(\pi_k > \pi_{k'}\). Intuitively, this means that since class label \(k\) is more likely than \(k'\), we increase the region where we prefer \(k\) over \(k'\).
In summary, for \(y = \Sigma^{-1/2}\), \(\tilde{\mu}_k = \Sigma^{-1/2} \mu_k, (k = 1,\ldots, K)\), our LDA classifier is \[
k = \underset{k}{\arg \min} \, \frac{1}{2} \| P_{\mathcal{A}}y - \tilde{\mu}_k \|_2^2 - \log \pi_k \,.
\]
Importantly, we now understand that this is a linear classification problem in the \(K-1\)-dimensional subspace spanned by \(\tilde{\mu}_1, \ldots, \tilde{\mu}_K\), which corresponds to finding the closest transformed centroid \(\tilde{\mu}_k\) to our projected data \(P_{\mathcal{A}} \Sigma^{-1/2}x\) modulo the shift induced by the marginals \(\pi_k\).
Let us pause and understand what this means. Not only does LDA have a nice representation, but we have discovered that it reduces the decision space from \(p\) dimensions to \(K-1\), which can be pretty substantial in practice. If \(K\) is still large (e.g. too large to visualise), we can run PCA on \(\mathcal{A}\) to further reduce the dimension of the problem. But this is beyond the scope of this note. You can check out ESL, Chapter 4.3.3 for more on this. Nice one!
Logistic classification
Next, let us look at the logistic classifier. The logistic classifier does not have such an elegant low-dimensional projection like LDA, so we can only visualise it for up to \(3\)-dimensional feature spaces. In the example below, we consider a \(2\)-dimensional feature space, which is sufficient to fix an idea of how logistic classification operates.
For now, let us consider binary logistic classification. Recall, that in this setting, we assume that
\[
Y \mid_{X = x} \sim \mathrm{Bern}(\eta_1(x)), \quad \log \left( \frac{\eta_1(x)}{1 - \eta_1(x)}\right) = x^\top \beta, \quad \beta \in \Re^p \,,
\] where \(\eta_k = \Pr(Y = k \mid X = x)\), \((k = 0,1)\).
Also recall that the Bayes classifier under \(0\)-\(1\) loss is given by
In logistic classification, we have an explicit model for \(\eta_k(x), (k = 0,1)\), so all that we have to do is estimate \(\beta\) and replace \(\beta\) by our estimate \(\hat{\beta}\). We do so by maximising the log-likelihood function
\[
\ell(\beta) = \sum_{i = 1}^n y_i x_i^\top\beta - \log(1 + \exp\{x_i^\top \beta\}) \,,
\] over \(\beta \in \Re^p\). As an exercise, try to derive this log-likelihood yourself if you have never done this before.
Make the connection
Note that the maximum-likelihood estimator is the empirical risk minimiser with logistic cross-entropy loss \[
\ell(y,\pi) = y \log(\pi) + (1 - y) \log(\pi) \,,
\] with \(\pi = 1 / (1 + \exp \{-x^\top \beta \})\). You can try to derive this yourself too. This fits nicely into our mental picture from lecture 1: The gold standard in classification is the Bayes rule, which minimises the population risk with respect to some loss function. In classification, we are typically interested in the misclassification rate, so we consider the \(0\)-\(1\) loss. In practice, however, we do not have access to the Bayes classifier, so we try minimise the empirical risk under \(0\)-\(1\) loss, restricted to a particular model class instead. The \(0\)-\(1\) loss is not smooth, and minimising the empirical risk over it may not be feasible. Hence, we need to resort to ‘nicer’, surrogate risk functions instead.
Once we have our estimator, we can build our classifier as \[
\psi^{\text{logit}}(x) = \mathbb{1} \{ \hat{\eta}_1(x) > 1 / 2 \}
\]
Let us revisit the cat adoption data set, but this time, we classify with our logistic classifier:
R code
#load required libraries library(plotly)#run logistic regression logistic_fit <-glm(Adopted_num ~ Age + Playfulness, family =binomial(link ="logit"), data = cat_adoption_data)# Create a grid of Age and Playfulness valuesage_ticks <-seq(min(cat_adoption_data$Age), max(cat_adoption_data$Age), length.out =100)playfulness_ticks <-seq(min(cat_adoption_data$Playfulness), max(cat_adoption_data$Playfulness), length.out =100)background_grid <-expand.grid(Age = age_ticks, Playfulness = playfulness_ticks)# Predict probabilities for the gridbackground_grid$predicted_prob <-predict(logistic_fit, newdata = background_grid, type ="response")# Predict classifications for the actual datacat_adoption_data$predicted_class <-predict(logistic_fit, newdata = cat_adoption_data, type ="response") >0.5cat_adoption_data$correct <- cat_adoption_data$predicted_class == cat_adoption_data$Adopted_numcat_adoption_data$height <-predict(logistic_fit, newdata = cat_adoption_data, type ="response")#ifelse(cat_adoption_data$Adopted_num == 1, 1, 0)plot_ly() %>%add_surface(x =~age_ticks,y =~playfulness_ticks,z =t(matrix(0.5, nrow =length(age_ticks), ncol =length(playfulness_ticks))),colorscale =list(list(0, "#BDBDBD"), list(1, "#BDBDBD")),opacity =0.35,showscale =FALSE,hoverinfo ="skip",name ="p=0.5 plane" ) %>%add_surface(x =~age_ticks,y =~playfulness_ticks,z =t(matrix(background_grid$predicted_prob,nrow =length(age_ticks),ncol =length(playfulness_ticks))),colorscale =list(list(0, "#F8766D"), list(1, "#00BFC4")),showscale =TRUE,colorbar =list(title =list(text ="Adoption<br>Probability", side ="top"), tickvals =c(0.25, 0.5, 0.75),ticktext =c("0.25", "0.5", "0.75") ),hoverinfo ="x+y+z" ) %>%add_markers(data = cat_adoption_data[cat_adoption_data$Adopted_num ==0& cat_adoption_data$correct, ], x =~Age, y =~Playfulness, z =~height, marker =list(color ="#F8766D", symbol ="circle",size =5,line =list(width =0.5, color ="#F8766D") ),name ="Not Adopted – Correct", hoverinfo ="x+y+z",visible =FALSE,key ="not_adopted_correct" ) %>%add_markers(data = cat_adoption_data[cat_adoption_data$Adopted_num ==0&!cat_adoption_data$correct, ], x =~Age, y =~Playfulness, z =~height, marker =list(color ="#F8766D", symbol ="cross",size =5,line =list(width =0.5, color ="#F8766D") ),name ="Not Adopted – Incorrect", hoverinfo ="x+y+z",visible =FALSE,key ="not_adopted_incorrect" ) %>%add_markers(data = cat_adoption_data[cat_adoption_data$Adopted_num ==1& cat_adoption_data$correct, ], x =~Age, y =~Playfulness, z =~height, marker =list(color ="#00BFC4", symbol ="circle",size =5,line =list(width =0.5, color ="#00BFC4") ),name ="Adopted – Correct", hoverinfo ="x+y+z",visible =FALSE,key ="adopted_correct" ) %>%add_markers(data = cat_adoption_data[cat_adoption_data$Adopted_num ==1&!cat_adoption_data$correct, ], x =~Age, y =~Playfulness, z =~height, marker =list(color ="#00BFC4", symbol ="cross",size =5,line =list(width =0.5, color ="#00BFC4") ),name ="Adopted – Incorrect", hoverinfo ="x+y+z",visible =FALSE,key ="adopted_incorrect" ) %>%layout(title =list(text ="Logistic classification - probability surface", x =0.5, xanchor ="center"),scene =list(xaxis =list(title =list(text ="Age")),yaxis =list(title =list(text ="Playfulness")),zaxis =list(title =list(text ="Probability")) ),# legend = list(itemsizing='constant'),updatemenus =list(list(type ="buttons",buttons =list(list(method ="restyle",args =list("visible", list(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)),label ="Surface Only" ),list(method ="restyle",args =list("visible", list(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE)),label ="Show All" ),list(method ="restyle",args =list("visible", list(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)),label ="Not Adopted – Correct" ),list(method ="restyle",args =list("visible", list(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE)),label ="Not Adopted – Incorrect" ),list(method ="restyle",args =list("visible", list(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE)),label ="Adopted – Correct" ),list(method ="restyle",args =list("visible", list(TRUE, TRUE, FALSE, FALSE, FALSE, TRUE)),label ="Adopted – Incorrect" ) ) ) ) )
We can now clearly see how classification works with the logistic classifier: Given our estimate \(\hat{\beta}\), we can associate each \(x\)-point with a probability that its class label is \(1\) through \(\hat{\eta}_1(x)\). Our classification is then simply deterimining whether this class probability is greater than \(1/2\). Hence, not only do we get a class label for each point, but also an estimate of the probability that this classification is correct.
When we project the decision boundary onto the \(2\)-dimensional \(X\)-space, we can see the linear nature of it even clearer:
R code
# Create a grid of Age and Playfulness valuesage_ticks <-seq(min(cat_adoption_data$Age), max(cat_adoption_data$Age), length.out =100)playfulness_ticks <-seq(min(cat_adoption_data$Playfulness), max(cat_adoption_data$Playfulness), length.out =100)background_grid <-expand.grid(Age = age_ticks, Playfulness = playfulness_ticks)# Predict probabilities for the gridbackground_grid$predicted_prob <-predict(logistic_fit, newdata = background_grid, type ="response")# Predict classifications for the actual datacat_adoption_data$predicted_class <-predict(logistic_fit, newdata = cat_adoption_data, type ="response") >0.5cat_adoption_data$correct <- cat_adoption_data$predicted_class == cat_adoption_data$Adopted_num# Calculate the decision boundary lineslope <-coef(logistic_fit)[2] / (-coef(logistic_fit)[3])intercept <-coef(logistic_fit)[1] / (-coef(logistic_fit)[3])# Plot the decision boundary and data points with color gradient backgroundggplot() +geom_tile(data = background_grid, aes(x = Age, y = Playfulness, fill = predicted_prob), alpha =0.5) +scale_fill_gradient(low ="#F8766D", high ="#00BFC4", name ="Adoption\nProbability") +geom_point(data = cat_adoption_data, aes(x = Age, y = Playfulness, color =as.factor(Adopted_num), shape = correct), size =2) +scale_color_manual(values =c("1"="#00BFC4", "0"="#F8766D"), labels =c("Not Adopted", "Adopted")) +scale_shape_manual(values =c("TRUE"=16, "FALSE"=4), labels =c("Correct", "Incorrect")) +geom_abline(intercept = intercept, slope = slope, linetype ="dashed", color ="black", linewidth =1, alpha =0.6) +guides(fill =guide_colourbar(order =1, title.position ="top"),colour =guide_legend(order =2, title.position ="top"),shape =guide_legend(order =3, title.position ="top") ) +theme_minimal() +labs(title ="Logistic classification: decision boundary", x ="Age", y ="Playfulness", color ="Adoption Status", shape ="Classification") +theme(legend.position ="right")
The linear classification boundary we see is not surprising, after all, when we invert the regression function \(\eta_1(x)\) we get that \(\eta_1(x) = 1/2 \iff x^\top \beta = 0\). Therefore \[
\psi^{\text{logit}}(x) = \mathbb{1}\{x^\top \hat{\beta} > 0 \} \,.
\]
Multiclass logistic classification
The multiclass logistic classification model is a straightforward generalisation of the binary case, nothing here will surprise us. If we have \(K\) classes, we assume that
\[
\frac{\Pr(Y = k \mid X = x)}{\Pr(Y = K \mid X = x)} = \frac{\eta_k(x)}{\eta_K(x)} = \exp \{x^\top \beta_k \}, \quad \beta_K = 0 \,,
\] where we need to fix a reference class, usually \(K\), for which \(\beta_K = 0\) for identification purposes. Now, simple algebra yields that \[
\Pr(Y = k \mid X = x) = \eta_k(x) = \frac{\exp\{x^\top \beta_k\}}{\sum_{k' = 1}^K \exp\{x^\top \beta_{k'}\}} \,.
\] As before, \[
\psi^{\text{Bayes}}(x) = \underset{k}{\arg \max} \, \eta_k(x) \,.
\]
Just as before, to get estimates \(\hat{\beta}_1, \ldots, \hat{\beta}_{K-1}\) from our training sample, we minimise the mutlinomial log-likelihood \[
\ell(\beta_1,\ldots, \beta_{K-1}) = \sum_{i = 1}^n \sum_{k = 1}^K \mathbb{1}\{Y_i = k\} \log \left( \frac{\exp\{ x_i^\top \beta_k \}}{\sum_{k' = 1}^K \exp\{x_i^\top \beta_{k'}\}} \right) \,,
\] over \(\beta_1,\ldots, \beta_{K-1}\). Then we have the logistic classifier
Try to convince yourself that these derivation reduce the logistic classifier for \(K = 2\). Below is a visualisation of the logistic classifier for the cat adoption data. This time, we try to predict a cat’s main fur colour using the features Playfulness and Age.
We have learned that both LDA and logistic classification produce a linear score \[
s(x)=w^\top x + b\,,
\] and make predictions based on whether this score crosses a certain threshold value \(t\), i.e. we predict label \(1\) if \(s(x) > t\) and \(0\) otherwise.
Now, in LDA we found that changing the class marginals \((\pi_0,\pi_1)\) adds a constant to the rule - or equivalently, changes the threshold \(t\). In logistic classification, changing the intercept term in \(\beta\) by \(\Delta\) is the same as using a new threshold \(t'=t-\Delta\).
In both instances, these changes in the threshold result in parallel shifts of the decision boundary. Naturally, by making such shifts, we change the classification rule and the performance of the classifier: Let us assume we have a given test set and vary threshold value \(t\).
Starting from a threshold such that we label points as zeroes, if we lower the threshold \(t\), we will be more generous in labelling data points as \(1\).
As a result, eventually, the true positive rate (TPR) \[
\mathrm{TPR}(t)
= \frac{\text{Number of true positives}}{\text{Number of actual positives}}
= \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} \,,
\] will increase. Concurrently though, the false positive rate (FPR), \[
\mathrm{FPR}(t)
= \frac{\text{Number of false positives}}{\text{Number of actual negatives}}
= \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}}\,,
\] that is the proportion of class \(0\) points incorrectly labelled as class \(1\), will increase. Hence, lowering the threshold \(t\) increases both TPR and FPR: the classifier becomes more liberal in predicting class \(1\), trading false positives for true positives.
It is natural to plot TPR(\(t\)) against FPR(\(t\)) because TPR(\(t\)) estimates \(\Pr\!\big(s(X)>t\,\big|\,Y=1\big)\) and FPR(\(t\)) estimates \(\Pr\!\big(s(X)>t\,\big|\,Y=0\big)\) on the test set.
Hence, for each threshold \(t\), the pair FPR(\(t\)),TPR(\(t\)) captures the classifier’s trade‑off between missing positives and wrongly flagging negatives.
Below you can see how the ROC curve works for LDA on the cat adoption data.
Now let us think about what the ROC curve does: Given points \(x_1,\ldots,x_n\), the classifier produces scores \(s(x_1), \ldots, s(x_n)\). Given scores \(s(x_1),\ldots,s(x_n)\), we sort the observations by score and move \(t\) from \(+\infty\) to \(-\infty\). Each time \(t\) passes a score, predicted labels update and we record (TPR(\(t\)), FPR(\(t\))). The ROC therefore depends only on the ranking induced by \(s\).
When we plot these points against each other for all values of \(t\), we get a visual summary of how the ranking of scores induced by our decision rule \(s\) impacts the classification performance. Thus, the ROC separates the model’s ordering from the choice of threshold and lets us compare models independently of the threshold value \(t\): if one ROC dominates another (lies everywhere above it), it is universally preferred.
When curves cross, a common summary is the area under the curve (AUC), which summarises the performance of a classifier over all threshold values. A perfect classifier will lie in the top left corner and have an AUC of \(1\). A classifier that just assigns labels at random (with equal probability) will lie on the diagonal line TPR = FPR and have an AUC of \(0.5\). Dominance of one classifier over another across all threshold levels is rare in practice, so AUC is often reported alongside the full curve.
Below you can see the ROC curve and the AUC for LDA and the logistic classifier for our cat adoption data using all the features.
Interestingly, even though I generated the adoption data from a logistic regression model, i.e. the modelling assumption of the logistic classifier is correct, LDA and logistic classification are practically indistinguishable here.