Department of Statistics London School of Economics and Political Science (LSE)
Published
12 September 2025
Bias-variance tradeoff
This note provides a very brief visualisation of the bias-variance tradeoff that we discussed in lecture 1. In almost all machine learning courses you will come across the MSE expansion that motivates the bias variance tradeoff and a picture a U-shaped graph. I always found that these things are best understood dynamically, not just from one bowl-shaped picture. So below we will explore the bias-variance tradeoff with some animations. I have marked parts that I do not expect you to know or replicate yourselves with a 🏔️-symbol (Understanding or deriving them is nonetheless a great exercise). I have also attached the R code that I used to generate these visuals, so feel free to explore!
A first look
Before we dive into the details of the bias-variance tradeoff, let us start out with a visual:
We are trying to learn \(f(x) = 1 - x^2\), which is a parabola shaped curve, from some data \(X\), \(Y = 1 - X^2 + \varepsilon\), where the response variables are contaminated by some additive noise. Ultimately we want to make a prediction at a new point \(x_0\).
Below you can see two fits to a training data sample, that come from the same class of algorithms. The only thing that differs is the complexity–think flexibility– of each predictor.
This plot hints at two concepts that we will explore in this note and that are central to the bias-variance tradeoff: (1) underfitting: the linear model fails to learn \(f(x)\) due to its modelling constraints, resulting in poor prediction precision. (2) overfitting: the very wiggly fit fails to capture \(f(x)\) as it pays too much attention to the noise, resulting in very volatile predictions.
We will quantify these observations and relate them to overall predictive performance in this note.
MSE decomposition
Before exploring the bias variance tradeoff on a model, let us derive the MSE bias-variance decomposition. This is a machine learning course after all. For this, recall our modelling assumption:
\[
Y = f(X) + \varepsilon\,,
\tag{1}\]
with \(\text{E}[\varepsilon \mid X] = 0\), \(\text{var}(\varepsilon \mid X) = \sigma^2\). We train our estimator \(\hat{f}(x)\) on training data \((Y_1, X_1), \ldots (Y_n, X_n)\) and evaluate it on a new point \(x_0\). All expectations/variances are with respect to the randomness of the training sample and conditional on \(X = x_0\).
We want to show that \[\mathrm{MSE}(x_0)
= \mathrm{E}\left[(Y-\hat f(x_0))^2 \right] = \underbrace{\mathrm{Var}\!\left(\hat f(x_0)\right)}_{\text{variance}}
+ \underbrace{\left(\mathrm{E}\left[\hat f(x_0) \right]-f(x_0)\right)^2}_{\text{bias}^2}
+ \underbrace{\sigma^2}_{\text{irreducible noise}}\,. \tag{2}\]
Try and see if you can derive the bias variance decomposition of the MSE yourself. (🏔️; Hint: Plug-in the modelling assumption Equation 1 into Equation 2, and start expanding the square)
The story with Equation 2 typically goes like: “If we increase the model complexity, our predictions get better on average, so bias decreases. At the same time, however, our model becomes more prone to overfitting, thus increasing the variance of our predictions.” Hence, following this heuristic, the choice of model complexity is a balancing act between its accuracy (bias) and its generalizability to unseen data (variance).
However, it may not be immediately clear how this heuristic derives from the MSE-decomposition of Equation 2. So let us first understand what the bias and variance terms therein really mean.
First, let us recall \(\hat f(x)\) is a function that we learned using the training data. Thus, it a function of the random variables \((Y_1, X_1), \ldots (Y_n, X_n)\) and therefore a random variable itself. The randomness, over which we take expectations in the terms \(\mathrm{Var}(\hat f(x_0))\), \((\mathrm{E}[\hat f(x_0)] - f(x_0))^2\) pertains to \(\hat{f}(x_0)\) (as this is the only random variable involved in the expressions) and thus the expectations are with respect to the training data.
Hence, loosely speaking, the bias term \((\mathrm{E}[\hat f(x_0)] - f(x_0))^2\) answers the following question: If we consider many datasets to train \(\hat{f}\) on, how much will our prediction \(\hat{f}(x_0)\) differ from our target \(f(x_0)\) on average? (🏔️; This can be made precise by the Law of large numbers which tells us that, under regularity conditions, \(1/n \sum_{i = 1}^n \hat{f}^i(x_0)\) converges to \(\mathrm{E}[\hat{f}(x_0)]\) almost surely as the number of training samples \(n\) goes to \(\infty\). Here, \(\hat{f}^i\) is the prediction function that we trained on the \(i\)th training sample.) Bias therefore is a measure of our prediction accuracy, measured across different training samples.
However, bias does not tell the full story,
Visualising the bias-variance tradeoff
You may start to understand how the bias-variance tradeoff is tied to model complexity. If a model is very complex in that it can fit a given training data set very well, then across many training samples, we will do a pretty good job at predicting \(\hat{f}(x_0)\). However, because we fit any given training data set very well, we will also start fitting the noise, rather than the underlying function \(f(x)\). Thus, our predictions will more heavily depend on the training data at hand. (low bias, high variance)
Conversely, if we have a very crude prediction function with limited capabilities to model \(f(x)\), we might expect poorer predictions on average, as we are not able to capture the structure of \(f(x)\). At the same time, because our model is very simple, we might expect it to be less impressed by different training data. It may give poor predictions, but it does consistently so. (high bias, low variance)
It should be very intuitive that the sweet-spot lies somewhere in the middle of these extremes, and that is exactly what the MSE-decomposition of Equation 2 tells us heuristically.
Now let us see this play out with an example. We let
\[
\begin{aligned}
Y &= 1 - X^2 + \varepsilon \\
\end{aligned} \,,
\] with \(X \sim \mathrm{U}(-2,2)\), \(\varepsilon \sim \mathrm{N}(0, 1)\) and \(X \perp \varepsilon\) and consider the prediction point \(x_0 = 1/2\).
Next, we fit the training data with smoothing splines using the R package splines. You can think of them as curves that can fit the data very flexibly. How flexible they are, i.e. how well they fit the data is controlled by a smoothness parameter df, which stands for degrees of freedom. For the sake of illustration, we choose three parameterisations: (1) a crude fit, which is just a linear fit to the trainning data, (2) a good fit, which is a quadratic fit to the training data, and (3) an over fit, which is a natural cubic spline fit with \(20\) degrees of freedom.
We can see that the crude fit does not learn the structure of \(f(x)\) very well. Similarly, the overfitted model is so preoccupied with achieving a good fit on the training data, that the fitted curve \(\hat f(x)\) does not capture much of \(f(x)\).
Now we said that bias and variance are quantities that are weighted across all possible training samples. Hence, to get an idea about their behaviour, let us look at the fits for a multitude of different training samples. For this, I simulate a bucnh (\(1000\)) of training samples, fit the three models, and record the test MSE, the squared bias and the variance. In the figure you can see an animation of every \(10\)th model fit. At the bottom of the figure, I also include estimates of test MSE, squared bias and variance that I computed using all samples to the point that is shown in the animation.