🚀 The Ultimate Guide for Bias-Variance Trade-off

Ardavan Modarres - linkedin.com/in/Ardavan-Modarres

📌 The Best Visual Explanation of the Bias-Variance Trade-off

Suppose we want to learn and estimate a continuous function using machine learning from a given dataset. It can be shown that the average difference between our estimate and the true underlying function can be decomposed into three components: bias, variance, and noise.

🔹 Bias represents the difference between the best model we can learn from the data and the true function we aim to estimate. If the model is too simple, the average deviation from the target function will be large, resulting in high bias.

🔹 Variance reflects how sensitive the model is to the specific dataset used for training. High variance indicates that the learned model strongly depends on the particular training data, which leads to poor generalization to new data.

Let me explain with a more concrete example:

🧩 Suppose we want to learn the sine function within a specific range using data generated from the sine function itself—but without knowing that the data comes from a sine function. Imagine we have only two data points for training. We try learning the sine function once with a constant model and once with a linear model. Each time we repeat this learning process, we arrive at an optimal model that best approximates the data-generating function (the sine function) for the data we have.

If we repeat this experiment many times, each time randomly selecting only two points generated by the sine function, we obtain many optimal models. By computing the average of these models, we get the Expected Hypothesis, which represents the best estimate of the sine function within the chosen hypothesis space. This Expected Hypothesis can be seen in the first row of the diagram. Using the Expected Hypothesis, we can also compute the variance of the models around the mean, which is shown in the second row of the diagram.

🔍 As expected, the constant model is too simple to accurately estimate the sine function—it exhibits high bias. However, its variance is low. In contrast, the linear model, which has more degrees of freedom (slope and intercept) than the constant model (intercept only), can better approximate the sine function. But with limited data, we observe that the linear model’s variance is much higher than that of the constant model.

Since the average error decomposes into bias, variance, and noise, achieving a good estimate requires a careful trade-off between bias and variance. This demonstrates that with limited data, using more complex models with higher degrees of freedom can lead to high variance, strong sensitivity to the training data, and poor generalization. The effect of increasing the number of training data points on variance can be observed in this post.

🤗🏆 I hope this explanation is helpful. If you’d like, you can join our channel to receive more insightful content.

🚀 The Ultimate Guide for Bias-Variance Trade-off

Report Page