Math 5 minutes

# The Moore-Penrose (pseudo-inverse) matrix

The Moore-Penrose inverse of a matrix is used to approximatively solve a degenerate system of linear equations.

## In short

Let $X$ be a matrix and $\vec{y}$ a vector. Consider the following equation:

When the matrix $X$ is invertible, the equation admits a unique solution:

But what if $X$ is not invertible? The next best thing to a null vector is a vector of minimal length. So let’s try and solve the following equation:

Whose solution is given by the Moore-Penrose pseudo-inverse of $X$ (when it exists):

What is this pseudo-inverse matrix?

## Setup

Let $X \in \mathbb{R}^{N\times D}$ be a matrix and $\vec{y} \in \mathbb{R}^{N}$ a vector. The following equation:

has a solution only when $\vec{y} \in \mathrm{span}(X)$, which means that $\vec{y}$ is in the linear subspace spanned by the columns of $X$.

When it is not the case, we might want an approximate solution by minimizing the norm of the residual vector:

When this norm is the euclidean 2-norm, we know from geometry that the solution to this equation is the orthogonal projection of $\vec{y}$ onto $\mathrm{span}(X)$, but let’s derive it analytically.

## Finding a solution

Recall that the 2-norm of a vector $\vec{v}$ is:

And we can minimize its square instead.

The directional derivative along a vector $\partial \vec{u}$ is (proof):

And (since the 2-norm squared is strictly convex) the minimum is reached when the directional derivatives are 0 in all direction $\partial \vec{u}$, hence:

which is the matrix form of the normal equations.

## The Moore-Penrose inverse

Rearranging the equation above, we have:

Where the matrix $X^{\dagger} = (X^{\top}X)^{-1}X^{\top}$ is called the Moore-Penrose inverse of $X$.

## Uniqueness of the solution

The matrix $X^{\top}X$ is a square matrix of dimension $D$. It is called the Gram matrix. The Moore-Penrose inverse exists when the Gram matrix is invertible.

The Gram matrix $X^{\top}X$ is invertible when the matrix $X$ has full column rank, i.e. when $\mathrm{rank}(X) = D$.

When $% $ (which necessarily happens when $% $), we are in the case of an underdetermined system and the Gram matrix is not invertible. This means that there exist several solutions. Since it is a minimization problem, we can use a numerical solver such as gradient descent or stochastic gradient descent to find one, even though in practical applications we can often reduce $D$ by removing some columns from $X$.

$\mathrm{rank}(X)$ is always at most the column rank (up to $D$) and at most the row rank (up to $N$), so it can never happen that the system is overdetermined. In other words, there is always at least one solution.

## Numerical issues

When some columns of $X$ are nearly collinear, the matrix is ill-conditioned, leading to numerical issues when solving the linear system. So it is better to use the QR factorization to solve the system.