Machine-learning 5 minutes

# What is stochastic gradient descent (SGD)?

Stochastic gradient descent is an algorithm that tries to find the minimum of a function expressed as a sum of component functions. It does so by choosing a component function at random then following its gradient.

## Sum functions

Suppose $f$ is expressed as the mean of $N$ component functions $f_n$:

## The gradient

The gradient $\nabla f(x)$ of $f$ at $x$ is a vector directed along the steepest ascent at $x$. We can try to find the minimum of $f$ by taking iterative steps in the direction of $-\nabla f$. This is what the gradient descent algorithm do.

Sometimes, computing the gradient of $f$ is too computationally expensive. Stochastic gradient descent reduces the computational cost by using the gradient $\nabla f_n$ of one component function instead.

## Stochastic gradient descent algorithm

The stochastic gradient descent algorithm aims at finding the minimum of $f = (1/N)\sum_{n=1}^{N}f_n$ by taking successive steps directed along $- \nabla f_n$, where at each step $n$ is chosen at random.

def stochastic_gradient_descent(f, x_0, max_steps, step_size):
"""
f is a list such that f[n-1]
is the n-th component function
"""
x = x_0
for step in range(max_steps):
# Choose the component function at random
n = random.randint(0, len(f)-1)
# Update step
x = x - step_size * gradient(of=f[n], at=x)
return x


## Comparison with gradient descent

The update rule for gradient descent at step t is:

While the update rule for stochastic gradient descent is:

 $x_{t+1} = x_{t} - \gamma\nabla f_n(x_{t})$ n is random

## Is it really faster than gradient descent?

For $I$ iterations, and $N$ training examples, gradient descent computes $I*N$ gradients while stochastic gradient descent computes only $I$ gradients. When $N >> I$, this yields a good optimization.

In practice SGD is worth a shot when we have a very large number of training examples. In machine learning, it often happens that we have hundreds of thousands, or millions of training examples (i.e. $N > 10^6$) and that SGD still converges after only a few hundreds of iterations (i.e. $% $).

## Why it works

This algorithm works because $\nabla f_n(x)$ is an unbiased estimate of $\nabla f(x)$ that is much simpler to compute.

### Computational cost

The gradient of a sum is the sum of the gradients, so:

Which explains why $\nabla f_n$ is computationaly cheaper than $\nabla f$.

### Correctness

Furthermore, $\nabla f_n$ is an unbiased estimate for $\nabla f$, indeed:

Where $\mathbb{E}_n$ is the expectation taken over dthe distribution of $n$ (which is chosen randomly, remember).

## Mini-batch SGD

There is a variant of stochastic gradient descent called mini-batch, where instead of using only one component function, several components functions are used. These component functions are chosen at random at each step. $\newcommand{\card}{\left|#1\right|}$

In this version, instead of simply choosing one index $n$ at random, a subset $B \subseteq \left[1; N\right]$ is chosen at random. The update step of the algorithm is the same where $f_n$ is replaced by $\frac{1}{\card{B}}\sum_{n \in B} f_n$.

### Comparison with SGD

The update rule for SGD at step t is:

 $x_{t+1} = x_{t} - \gamma\left[\nabla f_n(x_{t})\right]$ n is random

While the update rule for mini-batch SGD at step t is:

 $x_{t+1} = x_{t} - \gamma\left[\frac{1}{\card{B}}\sum_{n \in B}\nabla f_n(x_{t})\right]$ B is random

### The mini-batch algorithm

def minibatch_SGD(f, x_0, max_steps, step_size):
"""
f is a list such that f[n-1]
is the n-th component function
"""
x = x_0
for step in range(max_steps):
# Choose B=[a;b] at random
a = random.randint(0, len(f)-1)
b = random.randint(a, len(f)-1)
# Compute the sum of gradients
sigma = sum([gradient(of=f[n], at=x) for n in range(a, b+1)])
# Update step
x = x - step_size * 1/(b-a+1) * sigma
return x