Machine-learning 5 minutes

# OLS regressions from the probabilistic viewpoint

We will show that the loss function used by ordinary least-squares (OLS) stems from the statistical theory of maximum likelihood estimation applied to the normal distribution.

This fact is crucial to the model’s performance: when the data is generated by a normal distribution, the MSE-loss provides an optimal estimator.

## Foreword

If you are looking for an elementary introduction to OLS regression, check out our article: OLS regressions in simple terms.

This article is a sequel to our previous article: Linear regressions from the probabilistic viewpoint. There, we outline the general framework for linear regressions using loss functions. In this article, we will tackle the same goal using a different approach based on the MLE instead of a loss function.

## Setup

Let $(\rvx, \ry)$ be a pair of random variables on $\realvset{D} \times \realset$. Let $\epsilon ~ \gaussian(0, \sigma^2)$ a gaussian, $0$-mean random-noise.

We assume the following relationship:

Where $\vw$ is an unknown parameter. This equation says that $\ry$ is the sum of a linear signal of $\rvx$ and some gaussian random-noise.

By the properties of the normal distribution, this is equivalent to saying that $\ry$ is gaussian distributed with mean $\vw\cdot\rvx$:

Given an observation $\vx$ of $\rvx$, our end-goal is to be able to estimate:

Before we can do so, we need to estimate the value of the parameter $\vw$.

## Maximum likelihood estimation

Let $\trainset$ be a sample made of $\sn$ independent observations of $\rvx, \ry$:

We will use the MLE estimator for $\vw$. This estimator is defined as the value $\hat{\vw}$ that maximizes the probability that our sample is produced by the generators $(\gx, \gy)$:

Let’s compute the solution to this equation.

Since the pairs are drawn independently, the probability to observe our whole sample is the product of the probabilities to observe each pair:

The logarithm is an increasing function, so maximizing the likelihood is the same as maximizing the log of the likelihood. Taking the logarithm is convenient since it transforms a product into a sum.

Thus, we want to maximize (using the function composition notation):

The probability to observe the pair $(\vx, \sy)$, given the parameter $\vw$ is:

Where $\prob(\vx \mid \vw)$ is the probability that the value $\vx$ is produced by our generator $\gx$. This probability does not depend on $\vw$ and is thus a positivite constant. We can ignore it since it won’t change the location of the maximum. So $\prob(\vx_{\si}, \sy_{\si} \mid \vw)$ reduces to $\prob(\sy_{\si} \mid \vx_{\si}, \vw)$. And the expression to maximize is :

Use the definition of a gaussian generator to compute $\prob(\sy_{\si} \mid \vx_{\si}, \vw)$:

Removing constant factors yields:

Maximizing this expression is the same as minimizing its negative:

Which is the exact same formula that defines the solution to OLS! Hence, the solution to an OLS regression is the MLE $\hat{\vw}$.

## Conclusion

OLS will provide us with the maximum likelihood estimator for the parameter $\vw$, provided the following key conditions are met:

1. Normality: $\ry$ is generated by a gaussian generator.
2. Linearity: $\expectation\brak{\ry} = \ff(\vw, \rvx)$.
3. Homoskedasticity: $\var(\ry_\si) = \sigma^2, \forall \si \leq \sn$.
4. The output values $\ry_\si$ are uncorrelated.

When either of these conditions is not met, OLS will produce a suboptimal estimator of $\expectation[\ry \mid \rvx = \vx]$ in the best case scenario, and complete garbage in the worst case scenario.

In practice, we have the sample $\trainset$ but we can’t know for sure whether those conditions are met. To assess whether OLS is a good model to use, the best we can do is find implications of each assumption that we can check graphically and vizualize the corresponding plots.

Read next: How to graphically check whether OLS is appropriate for the dataset at hand.