Machine-learning 5 minutes

The bias-variance-noise decomposition

The MSE loss is attractive because the expected error in prediction can be explained by the bias-variance of the model and the variance of the noise. This is called the bias-variance-noise decomposition. In this article, we will introduce this decomposition using the tools of probability theory.

In short, when $\ry = \ff(\rvx) + \epsilon$, the bias-variance-noise decomposition is:

Notations

Let $(\rvx, \ry)$ be a pair of random variables on $\realvset{\sd} \times \realset$.

Assume there exists a $0$-mean random noise $\epsilon$ and a function $\ff$ such that:

The goal of a regression is to use a sample $\trainset$ to estimate this function:

For instance, in a linear regression the function $\ff$ is a linear function with parameter $\vw$:

And the regression aims at estimating $\vw$ from the training-set:

Once the function $\ff_{\trainset}$ is estimated, we can measure the error between a predictions $\hat{\sy}_{\trainset} = \ff_{\trainset}(\vx)$ and the true value $\sy$:

The expected error in prediction is:

Define $A$ as a shorthand.

$A$ does not depend on $\epsilon$ and $\epsilon$ does not depend on $\sets$, so:

Recall that $\expectation[\epsilon] = 0$:

Since $\epsilon$ is a $0$-mean noise we have:

Hence:

Finally, the term $\expectation_{\sets}[A^2]$ is exactly the error in estimation between $\ff$ and $\hat{\ff}$. We can exprees it using the bias-variance decomposition:

Finally, the bias-variance-noise decomposition is: