Machine-learning 5 minutes

Linear regressions, the probabilistic viewpoint

General formulation

Let be a pair of random variables on .

A linear regression model assumes that is a linear function of :

Where the parameter is unknown.

A common use-case for this hypothesis is when is the sum of a linear signal and -mean random-noise :

Learning

Learning consists in finding an estimate for based on a sample made of independent observations of :

This can be done in two (sometimes equivalent) ways:

  1. use an appropriate estimator tailored to the distribution of ; or
  2. define a loss function and minimize the expected risk.

Let’s detail the second approach.

Loss function

The “best” value for the parameter can be defined through a loss function:

which measures the error between and . Thus, increases with how wrongly estimates .

Risk

For a fixed value of , the expected error in prediction is:

Which we will name the risk.

Ideally, we would want to minimize it:

But since we don’t know the distribution of , the true risk is unknown. So we need to find a proxy: we will first estimate the risk, then estimate the parameters.

Empirical risk

For a fixed , the best estimate for the true risk given a sample is the sample average of the loss:

Using the usual convergence theorems (law of large number, CLT, Chernoff bound) we know that:

  • the empirical risk is an unbiased estimator for the true risk; and
  • the error between them decreases as .

Learning using the empirical risk

Given a sample dedicated to training, we can compute our estimate for :

We use as subscript to reminds us that greatly depends on the training-sample.

The training-error is the empirical risk computed on the training sample:

It can’t reliably be used to estimate because of the way is computed. For instance, with extreme overfiting it can happen that the training error is , but the real risk way above .

How does the model performs on unseen data?

To assess how our model performs on unseen data, we estimate the risk , which is also called the generalization error.

As explained previously, we can do so using a sample on which does not depend. i.e.

Under this condition, the test-error:

is an unbiased estimate of the generalization error and converges at the speed of .

How big should the testing sample be?

This can be answered using a Chernoff bound:

If we assume for instance that , the probability that the true risk and the empirical risk differ by more that is less than . When the contains samples, this evaluates to