Machine-learning 5 minutes

# Linear regressions, the probabilistic viewpoint

## General formulation

Let $(\rvx, \ry)$ be a pair of random variables on $\realvset{\sd} \times \realset$.

A linear regression model assumes that $\expectation[\ry \mid \rvx = \vx]$ is a linear function of $\vx$:

Where the parameter $\vw \in \realvset{\sd}$ is unknown.

A common use-case for this hypothesis is when $\ry$ is the sum of a linear signal and $0$-mean random-noise $\epsilon$:

## Learning

Learning consists in finding an estimate for $\vw$ based on a sample $\trainset$ made of $\sn$ independent observations of $(\rvx, \ry)$:

This can be done in two (sometimes equivalent) ways:

1. use an appropriate estimator tailored to the distribution of $\sy$; or
2. define a loss function and minimize the expected risk.

Let’s detail the second approach.

## Loss function

The “best” value for the parameter $\vw$ can be defined through a loss function:

which measures the error between $\vw\cdot\vx$ and $\sy$. Thus, $\indivloss$ increases with how wrongly $\vw\cdot\vx$ estimates $\sy$.

## Risk

For a fixed value of $\vw$, the expected error in prediction is:

Which we will name the risk.

Ideally, we would want to minimize it:

But since we don’t know the distribution of $(\rvx, \ry)$, the true risk is unknown. So we need to find a proxy: we will first estimate the risk, then estimate the parameters.

## Empirical risk

For a fixed $\vw$, the best estimate for the true risk given a sample $\sets$ is the sample average of the loss:

Using the usual convergence theorems (law of large number, CLT, Chernoff bound) we know that:

• the empirical risk is an unbiased estimator for the true risk; and
• the error between them decreases as $\mathcal{O}(1 / \sqrt{\card{\sets}})$.

## Learning using the empirical risk

Given a sample $\trainset$ dedicated to training, we can compute our estimate for $\vw$:

We use $\trainset$ as subscript to reminds us that $\hat{\vw}$ greatly depends on the training-sample.

The training-error is the empirical risk computed on the training sample:

It can’t reliably be used to estimate $\fr(\hat{\vw}_{\trainset})$ because of the way $\hat{\vw}$ is computed. For instance, with extreme overfiting it can happen that the training error is $0$, but the real risk way above $0$.

## How does the model performs on unseen data?

To assess how our model performs on unseen data, we estimate the risk $\fr(\hat{\vw}_{\trainset})$, which is also called the generalization error.

As explained previously, we can do so using a sample $\testset$ on which $\hat{\vw}_{\trainset}$ does not depend. i.e.

Under this condition, the test-error:

is an unbiased estimate of the generalization error and converges at the speed of $\mathcal{O}(1 / \sqrt{\card{\testset}})$.

## How big should the testing sample be?

This can be answered using a Chernoff bound:

If we assume for instance that $\abs{b - a} = 1000$, the probability that the true risk $\fr(\hat{w})$ and the empirical risk $\hat{\fr}(\hat{w}, \testset)$ differ by more that $0.001$ is less than $2e^{-2\card{\testset}}$. When the $\testset$ contains $1000$ samples, this evaluates to $10^{-869}$