Visualising Bias-variance tradeoff

ST443 Machine learning and data mining


Author
Affiliation

Philipp Sterzinger
p.sterzinger@lse.ac.uk

Department of Statistics
London School of Economics and Political Science (LSE)

Published

12 September 2025

Bias-variance tradeoff

This note provides a very brief visualisation of the bias-variance tradeoff that we discussed in lecture 1. In almost all machine learning courses you will come across the MSE expansion that motivates the bias variance tradeoff and a picture a U-shaped graph. I always found that these things are best understood dynamically, not just from one bowl-shaped picture. So below we will explore the bias-variance tradeoff with some animations. I have marked parts that I do not expect you to know or replicate yourselves with a 🏔️-symbol (Understanding or deriving them is nonetheless a great exercise). I have also attached the R code that I used to generate these visuals, so feel free to explore!

A first look

Before we dive into the details of the bias-variance tradeoff, let us start out with a visual:

We are trying to learn \(f(x) = 1 - x^2\), which is a parabola shaped curve, from some data \(X\), \(Y = 1 - X^2 + \varepsilon\), where the response variables are contaminated by some additive noise. Ultimately we want to make a prediction at a new point \(x_0\).

Below you can see two fits to a training data sample, that come from the same class of algorithms. The only thing that differs is the complexity–think flexibility– of each predictor.

Code
library(ggplot2)
library(splines)
library(cowplot)

set.seed(123)
f_true <- function(x) 1 - x^2
n      <- 30
x      <- runif(n, -2, 2)
y      <- f_true(x) + rnorm(n)
x_grid <- seq(-2, 2, length.out = 400)
x_0    <- 0.5

df_data  <- data.frame(x, y)
df_truth <- data.frame(x = x_grid, y = f_true(x_grid))
df_point <- data.frame(x = x_0,   y = f_true(x_0))

col_truth <- "#7f7f7f"
col_crude <- "#1f77b4"
col_over  <- "#d62728"

p <- ggplot() +
  geom_point(data = df_data,  aes(x, y, shape = "Data"),   colour = "black", size = 2, alpha = .8) +
  geom_line (data = df_truth, aes(x, y, colour = "f(x)",        linetype = "f(x)"),        linewidth = 1) +
  stat_smooth(data = df_data, aes(x, y, colour = "Crude fit",   linetype = "Crude fit"),  method = "lm", formula = y ~ ns(x, df = 1),  se = FALSE, linewidth = 1) +
  stat_smooth(data = df_data, aes(x, y, colour = "Overfit",    linetype = "Overfit"),   method = "lm", formula = y ~ ns(x, df = 20), se = FALSE, linewidth = 1) +
  geom_point(data = df_point, aes(x, y, shape  = "f(x0)"), colour = "#00BFC4", size = 4, stroke = 1) +
  scale_colour_manual(name = "Lines",
                      breaks = c("f(x)", "Crude fit", "Overfit"),
                      values = c("f(x)"      = col_truth,
                                 "Crude fit" = col_crude,
                                 "Overfit"  = col_over)) +
  scale_linetype_manual(name = "Lines",
                        breaks = c("f(x)", "Crude fit", "Overfit"),
                        values = c("f(x)"      = "dashed",
                                   "Crude fit" = "solid",
                                   "Overfit"  = "solid")) +
  scale_shape_manual(name = "Points",
                     breaks = c("Data", "f(x0)"),
                     values = c("Data" = 16, "f(x0)" = 4),
                     labels = c("Training data", expression(f(x[0])))) +
  labs(x = "X", y = "Y") +
  theme_minimal() +
  theme(axis.text.y  = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = "bottom")

title_grob <- ggdraw() +
  draw_label(expression(Y == 1 - x^2 + epsilon), fontface = "bold", hjust = 0.5)

plot_grid(title_grob, p, ncol = 1, rel_heights = c(.12, 1))

This plot hints at two concepts that we will explore in this note and that are central to the bias-variance tradeoff: (1) underfitting: the linear model fails to learn \(f(x)\) due to its modelling constraints, resulting in poor prediction precision. (2) overfitting: the very wiggly fit fails to capture \(f(x)\) as it pays too much attention to the noise, resulting in very volatile predictions.

We will quantify these observations and relate them to overall predictive performance in this note.

MSE decomposition

Before exploring the bias variance tradeoff on a model, let us derive the MSE bias-variance decomposition. This is a machine learning course after all. For this, recall our modelling assumption:

\[ Y = f(X) + \varepsilon\,, \tag{1}\]

with \(\text{E}[\varepsilon \mid X] = 0\), \(\text{var}(\varepsilon \mid X) = \sigma^2\). We train our estimator \(\hat{f}(x)\) on training data \((Y_1, X_1), \ldots (Y_n, X_n)\) and evaluate it on a new point \(x_0\). All expectations/variances are with respect to the randomness of the training sample and conditional on \(X = x_0\).

We want to show that \[\mathrm{MSE}(x_0) = \mathrm{E}\left[(Y-\hat f(x_0))^2 \right] = \underbrace{\mathrm{Var}\!\left(\hat f(x_0)\right)}_{\text{variance}} + \underbrace{\left(\mathrm{E}\left[\hat f(x_0) \right]-f(x_0)\right)^2}_{\text{bias}^2} + \underbrace{\sigma^2}_{\text{irreducible noise}}\,. \tag{2}\]

Try and see if you can derive the bias variance decomposition of the MSE yourself. (🏔️; Hint: Plug-in the modelling assumption Equation 1 into Equation 2, and start expanding the square)

Reveal solution \[ \begin{aligned} \mathrm{MSE}(x_0) &= \mathrm{E}\left[(Y-\hat f(x_0))^2 \right] \\ &= \mathrm{E}\left[(f(x_0)+\varepsilon - \hat f(x_0))^2 \right ] \\ &= \mathrm{E}\left[(\hat f(x_0)-f(x_0))^2 \right] + 2\mathrm{E}\left[\varepsilon\,(f(x_0)-\hat f(x_0)) \right] + \mathrm{E}\left[\varepsilon^2 \right] \\ &= \mathrm{E}\left[(\hat f(x_0)-f(x_0))^2 \right] + 2\mathrm{E}\!\left[(f(x_0)-\hat f(x_0)) \underbrace{\mathrm{E}\left[\varepsilon \mid X = x_0 \right]}_{=0}\right] + \sigma^2 \\ &= \mathrm{E}\left[(\hat f(x_0)-f(x_0))^2\right] + \sigma^2 \\ &= \mathrm{E}\left[(\hat f(x_0) - \mathrm{E}[\hat f(x_0)] + \mathrm{E}[\hat f(x_0)] - f(x_0))^2 \right] + \sigma^2\\ &= \mathrm{E}\left[(\hat f(x_0) - \mathrm{E}[\hat f(x_0)])^2 \right] + 2\mathrm{E}\left[\left(\hat f(x_0) - \mathrm{E}(\hat f(x_0))\right) \left(\mathrm{E}[\hat f(x_0)]-f(x_0)\right) \right] + \mathrm{E} \left[ (\mathrm{E}(\hat f(x_0) - f(x_0)))^2 \right]+ \sigma^2\\ &= \mathrm{E}\left[(\hat f(x_0) - \mathrm{E}[\hat f(x_0)])^2 \right] + 2 \underbrace{\mathrm{E}\left[\hat f(x_0)-\mathrm{E}[\hat f(x_0)]\right]}_{=\,0} (\mathrm{E}[\hat f(x_0)]-f(x_0)) + \left(\mathrm{E}[\hat f(x_0)]-f(x_0)\right)^2 + \sigma^2\\ &= \mathrm{E}\left[(\hat f(x_0) - \mathrm{E}[\hat f(x_0)])^2 \right] + \left(\mathrm{E}[\hat f(x_0)]-f(x_0)\right)^2 + \sigma^2\\ &= \underbrace{\mathrm{Var}\!\left(\hat f(x_0)\right)}_{\text{variance}} + \underbrace{\left(\mathrm{E}\left[\hat f(x_0) \right]-f(x_0)\right)^2}_{\text{bias}^2} + \underbrace{\sigma^2}_{\text{irreducible noise}}\,. \end{aligned} \]

The story with Equation 2 typically goes like: “If we increase the model complexity, our predictions get better on average, so bias decreases. At the same time, however, our model becomes more prone to overfitting, thus increasing the variance of our predictions.” Hence, following this heuristic, the choice of model complexity is a balancing act between its accuracy (bias) and its generalizability to unseen data (variance).

However, it may not be immediately clear how this heuristic derives from the MSE-decomposition of Equation 2. So let us first understand what the bias and variance terms therein really mean.

First, let us recall \(\hat f(x)\) is a function that we learned using the training data. Thus, it a function of the random variables \((Y_1, X_1), \ldots (Y_n, X_n)\) and therefore a random variable itself. The randomness, over which we take expectations in the terms \(\mathrm{Var}(\hat f(x_0))\), \((\mathrm{E}[\hat f(x_0)] - f(x_0))^2\) pertains to \(\hat{f}(x_0)\) (as this is the only random variable involved in the expressions) and thus the expectations are with respect to the training data.

Hence, loosely speaking, the bias term \((\mathrm{E}[\hat f(x_0)] - f(x_0))^2\) answers the following question: If we consider many datasets to train \(\hat{f}\) on, how much will our prediction \(\hat{f}(x_0)\) differ from our target \(f(x_0)\) on average? (🏔️; This can be made precise by the Law of large numbers which tells us that, under regularity conditions, \(1/n \sum_{i = 1}^n \hat{f}^i(x_0)\) converges to \(\mathrm{E}[\hat{f}(x_0)]\) almost surely as the number of training samples \(n\) goes to \(\infty\). Here, \(\hat{f}^i\) is the prediction function that we trained on the \(i\)th training sample.) Bias therefore is a measure of our prediction accuracy, measured across different training samples.

However, bias does not tell the full story,

Visualising the bias-variance tradeoff

You may start to understand how the bias-variance tradeoff is tied to model complexity. If a model is very complex in that it can fit a given training data set very well, then across many training samples, we will do a pretty good job at predicting \(\hat{f}(x_0)\). However, because we fit any given training data set very well, we will also start fitting the noise, rather than the underlying function \(f(x)\). Thus, our predictions will more heavily depend on the training data at hand. (low bias, high variance)

Conversely, if we have a very crude prediction function with limited capabilities to model \(f(x)\), we might expect poorer predictions on average, as we are not able to capture the structure of \(f(x)\). At the same time, because our model is very simple, we might expect it to be less impressed by different training data. It may give poor predictions, but it does consistently so. (high bias, low variance)

It should be very intuitive that the sweet-spot lies somewhere in the middle of these extremes, and that is exactly what the MSE-decomposition of Equation 2 tells us heuristically.

Now let us see this play out with an example. We let

\[ \begin{aligned} Y &= 1 - X^2 + \varepsilon \\ \end{aligned} \,, \] with \(X \sim \mathrm{U}(-2,2)\), \(\varepsilon \sim \mathrm{N}(0, 1)\) and \(X \perp \varepsilon\) and consider the prediction point \(x_0 = 1/2\).

R code
library(ggplot2) 
library(ggplot2)
library(cowplot)

f_true <- function(x) -x^2 + 1
x_grid <- seq(-2, 2, length.out = 200)
x_0 <- 0.5
sigma <- 1

set.seed(123)
n <- 30 
eps <- rnorm(n, mean = 0, sd = sigma) 
x <- 4 * runif(n) - 2 
y <- f_true(x) + eps 


df_data   <- data.frame(x = x, y = y)
df_truth  <- data.frame(x = x_grid, y = f_true(x_grid))
df_point  <- data.frame(x = x_0, y = f_true(x_0))

base_layers <- list(
  geom_point(data = df_data,  aes(x, y), color="black", size=2, alpha=0.8),
  geom_line(data = df_truth,  aes(x, y), color="#7f7f7f", linetype="dashed", linewidth=1)
)

df_dummy <- data.frame(x=1,y=1)  

mock_legend <- ggplot(df_dummy, aes(x,y)) +
  geom_point(aes(shape="Data"),   color="black",   size=2, alpha=0.8) +
  geom_point(aes(shape="f(x0)"),  color="#00BFC4", size=4) +
  scale_shape_manual(
    name   = "Points",
    breaks = c("Data","f(x0)"),
    values = c("Data"=16, "f(x0)"=4),
    labels = c("Training data", expression(f(x[0])))
  ) +
  geom_line(aes(linetype="f(x)"),      color="#7f7f7f", linewidth=1) +
  scale_linetype_manual(
    name   = "Lines",
    breaks = c("f(x)"),
    values = c("dashed"),
    labels = c(expression(f(x)))
  ) +
  theme_void() +
  theme(legend.position="right")

p <- ggplot() + base_layers + 
     geom_point(data = df_point, aes(x, y), color="#00BFC4", size=4, shape=4, stroke=1) + 
    labs(x = "X", y = "Y") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none", 
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()
    ) + 
    coord_cartesian(xlim = c(-2,2), ylim = c(-3, 2.5))

legend_grob <- get_legend(mock_legend)

title  <- ggdraw() + draw_label(expression(Y == 1 - x^2 + epsilon),
                                fontface = 'bold', x = 0.5, hjust = 0.5)

final_plot <- plot_grid(
  title,
  plot_grid(p, legend_grob, ncol = 2, rel_widths = c(4,1)),
  ncol = 1, rel_heights = c(0.1,1)
)

final_plot

Next, we fit the training data with smoothing splines using the R package splines. You can think of them as curves that can fit the data very flexibly. How flexible they are, i.e. how well they fit the data is controlled by a smoothness parameter df, which stands for degrees of freedom. For the sake of illustration, we choose three parameterisations: (1) a crude fit, which is just a linear fit to the trainning data, (2) a good fit, which is a quadratic fit to the training data, and (3) an over fit, which is a natural cubic spline fit with \(20\) degrees of freedom.

Let us look at one such fit:

R code
library(splines)

df_crude <- 1
df_good <- 2 
df_over <- 20

fit1  <- lm(y ~ ns(x, df = df_crude),  data = df_data)
yhat1 <- predict(fit1,  data.frame(x = x_0))

fit2  <- lm(y ~ ns(x, df = df_good),  data = df_data)
yhat2 <- predict(fit2,  data.frame(x = x_0))

fit3  <- lm(y ~ ns(x, df = df_over), data = df_data)
yhat3 <- predict(fit3,  data.frame(x = x_0))

y_true <- f_true(x_0)

p1 <- ggplot() + base_layers +
        stat_smooth(data    = df_data,
                    aes(x, y),
                    method  = "lm",
                    formula = y ~ ns(x, df = df_crude),
                    se      = FALSE,
                    colour  = "#1f77b4",
                    size    = 1) +
        geom_segment(aes(x = x_0, xend = x_0, y = yhat1, yend = y_true),
               colour = "purple", linetype = "dotted", size = 1) +
        geom_point(data = df_point, aes(x, y), color="#00BFC4", size=4, shape=4, stroke=1) + 
        labs(x = " ", y = "Y") +
        theme_minimal(base_size = 14) +
        theme(plot.title = element_text(hjust = 0.5),
            legend.position = "none"
        ) + 
        coord_cartesian(xlim = c(-2,2), ylim = c(-3, 2.5))

p2 <- ggplot() + base_layers +
    stat_smooth(data    = df_data,
                aes(x, y),
                method  = "lm",
                formula = y ~ ns(x, df = df_good),
                se      = FALSE,
                colour  = "#2ca02c",
                size    = 1) +
    geom_segment(aes(x = x_0, xend = x_0, y = yhat2, yend = y_true),
            colour = "purple", linetype = "dotted", size = 1) +            
    geom_point(data = df_point, aes(x, y), color="#00BFC4", size=4, shape=4, stroke=1) + 
    labs(x = "X", y = " ") +
    theme_minimal(base_size = 14) +
    theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none", 
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()
    ) + 
    coord_cartesian(xlim = c(-2,2), ylim = c(-3, 2.5))

p3 <- ggplot() + base_layers +
    stat_smooth(data    = df_data,
                aes(x, y),
                method  = "lm",
                formula = y ~ ns(x, df = df_over),
                se      = FALSE,
                colour  = "#d62728",
                size    = 1) +
    geom_segment(aes(x = x_0, xend = x_0, y = yhat3, yend = y_true),
                colour = "purple", linetype = "dotted", size = 1) +
    geom_point(data = df_point, aes(x, y), color="#00BFC4", size=4, shape=4, stroke=1) + 
    labs(x = " ", y = " ") +
    theme_minimal(base_size = 14) +
    theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none", 
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()
    ) + 
    coord_cartesian(xlim = c(-2,2), ylim = c(-3, 2.5))


mock_legend <- ggplot(df_dummy, aes(x,y)) +
  geom_point(aes(shape="Data"),   color="black",   size=2, alpha=0.8) +
  geom_point(aes(shape="f(x0)"),  color="#00BFC4", size=4) +
  scale_shape_manual(
    name   = "Points",
    breaks = c("Data","f(x0)"),
    values = c("Data"=16, "f(x0)"=4),
    labels = c("Training data", expression(f(x[0])))
  ) +
  geom_line(aes(linetype="f(x)"), color="#7f7f7f", size=1) +
  geom_line(aes(linetype="diff"), color="purple", size=1) +
  geom_line(aes(linetype="Crude fit"), color="#1f77b4", size=1) +
  geom_line(aes(linetype="Good fit"), color="#2ca02c", size=1) +
  geom_line(aes(linetype="Overfit"), color="#d62728", size=1) +
  scale_linetype_manual(
    name   = "Lines",
    breaks = c("f(x)", "diff", "Crude fit","Good fit","Overfit"),
    values = c("dashed", "dotted", "solid","solid","solid"),
    labels = c(expression(f(x)), 
        expression(paste("|", hat(f)(x[0]) - f(x[0]), "|")),
        expression("Crude" ~ hat(f)),
        expression("Good" ~ hat(f)), 
        expression("Overfit" ~ hat(f)))
  ) +
  theme_void() +
  theme(legend.position="top")

panels <- plot_grid(p1, p2, p3, ncol = 3)

legend_grob <- get_legend(mock_legend)

final_plot <- plot_grid(legend_grob, panels, nrow = 2, rel_heights = c(0.1, 1))

final_plot

We can see that the crude fit does not learn the structure of \(f(x)\) very well. Similarly, the overfitted model is so preoccupied with achieving a good fit on the training data, that the fitted curve \(\hat f(x)\) does not capture much of \(f(x)\).

Now we said that bias and variance are quantities that are weighted across all possible training samples. Hence, to get an idea about their behaviour, let us look at the fits for a multitude of different training samples. For this, I simulate a bucnh (\(1000\)) of training samples, fit the three models, and record the test MSE, the squared bias and the variance. In the figure you can see an animation of every \(10\)th model fit. At the bottom of the figure, I also include estimates of test MSE, squared bias and variance that I computed using all samples to the point that is shown in the animation.

R code
library(ggplot2)
library(cowplot)
library(splines)
library(magick)
library(grid)

set.seed(123)
n_sims  <- 100
n_train <- 30
sigma   <- 1
x_0     <- 0.5
f_true  <- function(x) 1 - x^2
y_true  <- f_true(x_0)

df_crude <- 1
df_good  <- 2
df_over  <- 20

yhat1 <- yhat2 <- yhat3 <- numeric(n_sims)
train_x_list <- replicate(n_sims, runif(n_train, -2, 2), FALSE)
df_data_list <- vector("list", n_sims)

for (i in 1:n_sims) {
  x <- train_x_list[[i]]
  y <- f_true(x) + rnorm(n_train, 0, sigma)
  df <- data.frame(x, y)
  yhat1[i] <- predict(lm(y ~ ns(x, df = df_crude), df), data.frame(x = x_0))
  yhat2[i] <- predict(lm(y ~ ns(x, df = df_good ), df), data.frame(x = x_0))
  yhat3[i] <- predict(lm(y ~ ns(x, df = df_over ), df), data.frame(x = x_0))
  df_data_list[[i]] <- df
}

bias2_1 <- (cumsum(yhat1)/1:n_sims - y_true)^2
bias2_2 <- (cumsum(yhat2)/1:n_sims - y_true)^2
bias2_3 <- (cumsum(yhat3)/1:n_sims - y_true)^2
var_1   <- sapply(1:n_sims, \(k) var(yhat1[1:k]))
var_2   <- sapply(1:n_sims, \(k) var(yhat2[1:k]))
var_3   <- sapply(1:n_sims, \(k) var(yhat3[1:k]))
mse_1   <- bias2_1 + var_1 + sigma^2
mse_2   <- bias2_2 + var_2 + sigma^2 
mse_3   <- bias2_3 + var_3 + sigma^2 

x_grid   <- seq(-2, 2, len = 200)
df_truth <- data.frame(x = x_grid, y = f_true(x_grid))
df_point <- data.frame(x = x_0,   y = y_true)

base_layers <- list(
  geom_line(data = df_truth, aes(x, y), col = "#7f7f7f", lty = "dashed", size = 1)
)

df_dummy <- data.frame(x = 1, y = 1)
mock_legend <- ggplot(df_dummy, aes(x, y)) +
  geom_point(aes(shape = "Data"),  colour = "black",   size = 2, alpha = .8) +
  geom_point(aes(shape = "f(x0)"), colour = "#00BFC4", size = 4) +
  scale_shape_manual(
    name   = expression(Points),
    breaks = c("Data", "f(x0)"),
    values = c("Data" = 16, "f(x0)" = 4),
    labels = list(expression("Training data"), expression(f(x[0])))
  ) +
  geom_line(aes(linetype = "f(x)"),      colour = "#7f7f7f", size = 1) +
  geom_line(aes(linetype = "diff"),      colour = "purple",  size = 1) +
  geom_line(aes(linetype = "Crude fit"), colour = "#1f77b4", size = 1) +
  geom_line(aes(linetype = "Good fit"),  colour = "#2ca02c", size = 1) +
  geom_line(aes(linetype = "Overfit"),   colour = "#d62728", size = 1) +
  scale_linetype_manual(
    name   = expression(Lines),
    breaks = c("f(x)", "diff", "Crude fit", "Good fit", "Overfit"),
    values = c("dashed", "dotted", "solid", "solid", "solid"),
    labels = list(
      expression(f(x)),
      expression("|" * hat(f)(x[0]) - f(x[0]) * "|"),
      expression("Crude"   ~ hat(f)),
      expression("Good"    ~ hat(f)),
      expression("Overfit" ~ hat(f))
    )
  ) +
  theme_void() +
  theme(legend.position = "top")
legend_grob <- cowplot::get_legend(legend_plot)

img_list <- image_graph(1875, 1250, res = 200)

for (i in seq(10, n_sims, 10)) {
  df_data <- df_data_list[[i]]
  
  p1 <- ggplot() + base_layers +
    geom_point(data = df_data, mapping = aes(x, y),
               colour = "black", size = 2, alpha = .8) + 
    stat_smooth(data = df_data, aes(x, y), method = "lm",
                formula = y ~ ns(x, df = df_crude),
                se = FALSE, col = "#1f77b4") +
    geom_point(data = df_point, aes(x, y), col = "#00BFC4", shape = 4, size = 4) +
    geom_segment(
      data        = df_point,          
      aes(x      = x,                  
          xend   = x,
          y      = yhat1[i],           
          yend   = y_true),          
      inherit.aes = FALSE,
      colour      = "purple",
      linetype    = "dotted",
      size        = 1
    ) + 
    coord_cartesian(xlim = c(-2, 2), ylim = c(-3.7, 2.5)) +
    labs(caption = sprintf("MSE %.3f   Bias² %.3f   Var %.3f",
                    mse_1[i], bias2_1[i], var_1[i]), 
         x = "", y = "Y") +
    theme_minimal() +
    theme(
          legend.position = "none",
          plot.title = element_text(hjust = .5, face = "plain", size = 13), 
          plot.caption = element_text(hjust = .5, vjust = -1, size = 12),
          plot.caption.position = "plot")
  
  p2 <- ggplot() + base_layers +
    geom_point(data = df_data, mapping = aes(x, y),
               colour = "black", size = 2, alpha = .8) +
    stat_smooth(data = df_data, aes(x, y), method = "lm",
                formula = y ~ ns(x, df = df_good),
                se = FALSE, col = "#2ca02c") +
    geom_point(data = df_point, aes(x, y), col = "#00BFC4", shape = 4, size = 4) +
    geom_segment(
      data        = df_point,          
      aes(x      = x,                  
          xend   = x,
          y      = yhat2[i],           
          yend   = y_true),          
      inherit.aes = FALSE,
      colour      = "purple",
      linetype    = "dotted",
      size        = 1
    ) + 
    coord_cartesian(xlim = c(-2, 2), ylim = c(-3.7, 2.5)) +
    labs(caption = sprintf("MSE %.3f   Bias² %.3f   Var %.3f",
                    mse_2[i], bias2_2[i], var_2[i]),
         x = "X", y = "") +
    theme_minimal() +
    theme(
          legend.position = "none",
          axis.text.y = element_blank(),
          axis.ticks.y = element_blank(),
          plot.caption = element_text(hjust = .5, vjust = -1, size = 12),
          plot.caption.position = "plot")
  
  
  p3 <- ggplot() + base_layers +
    geom_point(data = df_data, mapping = aes(x, y),
               colour = "black", size = 2, alpha = .8) + 
    stat_smooth(data = df_data, aes(x, y), method = "lm",
                formula = y ~ ns(x, df = df_over),
                se = FALSE, col = "#d62728") +
    geom_point(data = df_point, aes(x, y), col = "#00BFC4", shape = 4, size = 4) +
    geom_segment(
      data        = df_point,          
      aes(x      = x,                  
          xend   = x,
          y      = yhat3[i],           
          yend   = y_true),          
      inherit.aes = FALSE,
      colour      = "purple",
      linetype    = "dotted",
      size        = 1
    ) + 
    coord_cartesian(xlim = c(-2, 2), ylim = c(-3.7, 2.5)) +
    labs(caption = sprintf("MSE %.3f   Bias² %.3f   Var %.3f",
                    mse_3[i], bias2_3[i], var_3[i]),
         x = "", y = "") +
    theme_minimal() +
    theme(
          legend.position = "none",
          axis.text.y = element_blank(),
          axis.ticks.y = element_blank(),
          plot.caption = element_text(hjust = .5, vjust = -1, size = 12),
          plot.caption.position = "plot") 
  
  print(plot_grid(legend_grob,
                  plot_grid(p1, p2, p3, ncol = 3),
                  ncol = 1,
                  rel_heights = c(.22, 1)))
}

dev.off()

anim <- image_animate(img_list, fps = 1)
image_write(anim, "bias_variance_tradeoff.gif")

You can now clearly see how bias and variance play out for different fits. The crude fit provides a poor representation of \(f(x)\), hence we incur a lot of bias. At the same time, even though it as a poor predictor, it is consistently so, i.e. the fitted straight line does not vary a lot across samples. The overfitted model on the other hand, has very high variance, because we mostly try to fit the training data. However, the high model flexibility means that even though we have a lot of spread, our predictions are quite accurate (weighed across) samples, meaning we achieve low bias. The good fit, strikes a great balance: It has variance comparable to the crude fit, and bias similar to the overfitted model.

We should now have a clear picture in mind of what the bias-variance tradeoff is, and why either low bias or low variance alone are not good metrics for te performance of our predictors.

With our current understanding of the relation of bias-variance tradeoff and model complexity, let us now look at the full picture. Below is another grphic, where we try to learn

\[ Y = \sin(X) \log(|X|) + \varepsilon\,, \] with \(X \sim \mathrm{U}(-10, 10)\), \(\varepsilon \sim \mathrm{N}(0, 16)\), \(X \perp \varepsilon\). Again we are using splines with varying degrees of freedom to fit the data. For each setting of degrees of freedom, I simulate \(50\ 000\) datasets that I use to estimate the training MSE, the test MSE, and the squared bias and variance. You can see these curves in the right panel of the plot. Moreover, I included an example data set, and the model fit with the current degrees of freedom. With the slider bar below, you can increase the model complexity and see how you traverse the test MSE curve. At the same time, you can see how the model fit changes with varying complexity and its implications for the performance of our predictor.

R code
library(parallel)
library(plotly)

set.seed(123)

n_sims  <- 50000
n_train <- 50
grid_x  <- seq(-10, 10, length.out = 1000)
sigma   <- 4
f_true  <- function(x) sin(x) * log(abs(x))
dfs     <- 1:30

core_fun <- function(d){
  res <- replicate(n_sims, {
    tr_x <- runif(n_train, -10, 10)
    tr_y <- f_true(tr_x) + rnorm(n_train, 0, sigma)
    fit  <- lm(tr_y ~ ns(tr_x, df = d))
    train_mse <- mean(residuals(fit)^2)
    pred_grid <- predict(fit, data.frame(tr_x = grid_x))
    test_y    <- f_true(grid_x) + rnorm(length(grid_x), 0, sigma)
    test_mse  <- mean((test_y - pred_grid)^2)
    list(train_mse = train_mse,
         test_mse  = test_mse,
         pred_grid = pred_grid)
  }, simplify = FALSE)
  train_mse_vec <- sapply(res, `[[`, "train_mse")
  test_mse_vec  <- sapply(res, `[[`, "test_mse")
  pred_mat      <- sapply(res, `[[`, "pred_grid")
  train_mse <- mean(train_mse_vec)
  test_mse  <- mean(test_mse_vec)
  bias2     <- mean((rowMeans(pred_mat) - f_true(grid_x))^2)
  var_hat   <- mean(apply(pred_mat, 1, var))
  err <- sigma
  c(train_mse, test_mse, bias2, var_hat, err)
}

core_fits <- function(d, tr_x, tr_y){
  fit  <- lm(tr_y ~ ns(tr_x, df = d))
  train_mse <- mean(residuals(fit)^2)
  predict(fit, data.frame(tr_x = grid_x))
}

out <- do.call(rbind, mclapply(dfs, core_fun, mc.cores = max(1, detectCores() - 1)))
colnames(out) <- c("train_mse", "test_mse", "bias2", "var_hat", "err")
df_out <- data.frame(df = dfs, out)

tr_x <- runif(n_train, -10, 10)
tr_y <- f_true(tr_x) + rnorm(n_train, 0, sigma)
out <- do.call(cbind, mclapply(dfs, core_fits, mc.cores = max(1, detectCores() - 1), tr_x = tr_x, tr_y = tr_y))
dfs_out <- data.frame(f_true(grid_x), grid_x, out)
df_train <- data.frame(tr_x = tr_x, tr_y = tr_y)
colnames(dfs_out) <- c("truth", "grid_x", paste0("df", dfs))

#save(df_out, file = "df_out.RData")
#save(dfs_out, file = "dfs_out.RData")
#save(df_train, file = "df_train.RData")


#load("df_out.RData")
#load("dfs_out.RData")
#load("df_train.RData")

truth_y <- dfs_out$truth
range_y <- range(df_out[,-1])

init_df <- dfs[1]
init_fit <- dfs_out[, paste0("df", init_df)]
init_mse <- df_out$test_mse[init_df]

fig_left <- plot_ly() |>
  add_markers(x = df_train$tr_x, y = df_train$tr_y, name = "Data", showlegend = FALSE, opacity = 0.6) |>
  add_lines(x = dfs_out$grid_x, y = dfs_out$truth, name = "Truth",
            line = list(color = "grey", dash = "dash"), showlegend = FALSE) |>
  add_lines(x = dfs_out$grid_x, y = init_fit, name = "Fit",
            line = list(color = "#FF5733"), showlegend = FALSE) 

fig_right <- plot_ly() |>
  add_lines(x = df_out$df, y = df_out$err, name = "Irreducible error",
            line = list(color = "purple")) |>
  add_lines(x = df_out$df, y = df_out$train_mse, name = "Train MSE",
            line = list(color = "black")) |>
  add_lines(x = df_out$df, y = df_out$test_mse, name = "Test MSE",
            line = list(color = "#1f77b4")) |>
  add_lines(x = df_out$df, y = df_out$bias2, name = "Bias²",
            line = list(color = "#2ca02c")) |>
  add_lines(x = df_out$df, y = df_out$var_hat, name = "Variance",
            line = list(color = "#d62728")) |>
  add_lines(x = c(init_df, init_df), y = range_y, opacity = 0.6,
            line = list(color = "grey", dash = "dot"), showlegend = FALSE) |>
  layout(legend = list(yanchor = "center", y = 0.5)) 

frames <- lapply(seq_along(dfs), function(k) {
  df_k <- dfs[k]
  list(
    name   = as.character(df_k),
    traces = c(2,8),
    data   = list(
      list(x = dfs_out$grid_x, y = dfs_out[, paste0("df", df_k)]),
      list(x = c(df_k, df_k), y = range_y)
    )
  )
})


steps <- lapply(seq_along(dfs), function(k) {
  df_k <- dfs[k]
  list(
    label  = as.character(df_k),
    method = "animate",
    args   = list(list(as.character(df_k)),
                  list(mode = "immediate",
                       frame = list(duration = 0, redraw = FALSE),
                       transition = list(duration = 0)))
  )
})

fig <- subplot(
  fig_left, fig_right,
  widths = c(0.5, 0.5),
  shareY  = FALSE,
  titleX  = TRUE,
  titleY  = TRUE
)

fig <- layout(
  fig,
  annotations = list(
    list(
      x       = 0.25,           
      y       = 1.05,            
      xref    = "paper",
      yref    = "paper",
      text    = "Example fit",
      showarrow = FALSE,
      xanchor = "center"
    ),
    list(
      x       = 0.75,            
      y       = 1.05,
      xref    = "paper",
      yref    = "paper",
      text    = "Errors",
      showarrow = FALSE,
      xanchor = "center"
    )
  ),
  sliders = list(
    list(
      active       = 0,
      steps        = steps,
      currentvalue = list(prefix = "Degrees of freedom = "),
      xanchor      = "center",
      x = 0.5,
      y            = -0.05
    )
  )
)

fig$x$frames <- frames

fig

Degrees of freedom = 1

In these stylised examples, we could easily pin down test-MSE optimal model complexity as we could compute the test MSE. In practice, we pick the complexity via cross-validation, which we will visit later in the course.

Hopefully, this interactive example, which uses real data and models, gives you a better understanding of the bias-variance tradeoff thant the stylised U-shaped MSE curves that you find in most machine learning text books.