Machine-learning 5 minutes

OLS regressions from the probabilistic viewpoint

We will show that the loss function used by ordinary least-squares (OLS) stems from the statistical theory of maximum likelihood estimation applied to the normal distribution.

This fact is crucial to the model’s performance: when the data is generated by a normal distribution, the MSE-loss provides an optimal estimator.


If you are looking for an elementary introduction to OLS regression, check out our article: OLS regressions in simple terms.

This article is a sequel to our previous article: Linear regressions from the probabilistic viewpoint. There, we outline the general framework for linear regressions using loss functions. In this article, we will tackle the same goal using a different approach based on the MLE instead of a loss function.


Let be a pair of random variables on . Let a gaussian, -mean random-noise.

We assume the following relationship:

Where is an unknown parameter. This equation says that is the sum of a linear signal of and some gaussian random-noise.

By the properties of the normal distribution, this is equivalent to saying that is gaussian distributed with mean :

Given an observation of , our end-goal is to be able to estimate:

Before we can do so, we need to estimate the value of the parameter .

Maximum likelihood estimation

Let be a sample made of independent observations of :

We will use the MLE estimator for . This estimator is defined as the value that maximizes the probability that our sample is produced by the generators :

Let’s compute the solution to this equation.

Since the pairs are drawn independently, the probability to observe our whole sample is the product of the probabilities to observe each pair:

The logarithm is an increasing function, so maximizing the likelihood is the same as maximizing the log of the likelihood. Taking the logarithm is convenient since it transforms a product into a sum.

Thus, we want to maximize (using the function composition notation):

The probability to observe the pair , given the parameter is:

Where is the probability that the value is produced by our generator . This probability does not depend on and is thus a positivite constant. We can ignore it since it won’t change the location of the maximum. So reduces to . And the expression to maximize is :

Use the definition of a gaussian generator to compute :

Removing constant factors yields:

Maximizing this expression is the same as minimizing its negative:

Which is the exact same formula that defines the solution to OLS! Hence, the solution to an OLS regression is the MLE .


OLS will provide us with the maximum likelihood estimator for the parameter , provided the following key conditions are met:

  1. Normality: is generated by a gaussian generator.
  2. Linearity: .
  3. Homoskedasticity: .
  4. The output values are uncorrelated.

When either of these conditions is not met, OLS will produce a suboptimal estimator of in the best case scenario, and complete garbage in the worst case scenario.

In practice, we have the sample but we can’t know for sure whether those conditions are met. To assess whether OLS is a good model to use, the best we can do is find implications of each assumption that we can check graphically and vizualize the corresponding plots.

Read next: How to graphically check whether OLS is appropriate for the dataset at hand.