## Introduction to PAC Learning

What is “learning” and do we have a formal model for it? I’ve decided to dive into the theoretical underpinnings of machine-learning, so here’s a quick introduction to...

## Understanding p-values

Hypothesis testing and p-values are often misused and misunderstood. In this article, I explain what a p-value is, and how to use it.

## Introduction to hypothesis testing

We introduce the basic vocabulary required to understand hypothesis testing and define the p-value.

## The maximum likelihood estimator (MLE)

The maximum likelihood estimator is one of the most used estimators in statistics. In this article, we introduce this estimator and study its properties.

## MLE: an information theory viewpoint

We show that the MLE is obtained by minimizing the KL-divergence from an empirical distribution and interpret what it means.

## Introduction to statistical estimators

In this article we define what an estimator is. We focus on the theory to compare and assess estimators, rather than how to find one.

## What is ridge regression?

A ridge regression is an OLS regression that uses L2-regularization.

## The effect of L2-regularization

In this article, we discuss the impact of L2-regularization on the estimated parameters $\vw$ of a linear model.

## What is regularization?

Regularization is a semi-automated method to manage overfitting. The core idea is to avoid overfitting by penalizing model complexity.

## Quote of the day

The problem of fitting a model to data differs from the problem of finding patterns that generalize to new data.

## What is a polynomial regression?

A polynomial regression is a linear regression where the input vectors $\vx$ have been preprocessed using polynomial basis expansion.

## Polynomial basis expansion

Polynomial basis expansion, also called polynomial features augmentation, is part of the machine-learning preprocessing. It consists in adding powers of the input’s components to the input vector.

## How to assess an OLS regression?

We’ve just fitted OLS to our trainset. How to assess whether it was a good model to use? We will answer this question from the point of view...

## The MSE loss

The mean squared error loss quantifies the error between a target variable $\vy$ and an estimate $\hat{\vy}$ for its value.

## The bias-variance-noise decomposition

The MSE loss is attractive because the expected error in prediction can be explained by the bias-variance of the model and the variance of the noise. This is...

## The bias-variance decomposition

The MSE loss is attractive because the expected error in estimation can be explained by the bias and the variance of the model. This is called the bias-variance...

## OLS regressions in simple terms

A least-squares regression, often called ordinary least squares (OLS), is a linear regression model that uses the mean squared-error loss function (MSE loss).

## OLS regressions from the probabilistic viewpoint

We will show that the loss function used by ordinary least-squares (OLS) stems from the statistical theory of maximum likelihood estimation applied to the normal distribution.

## Vector notation for linear regressions

A linear regression attempts to estimate an output value using a linear function. Those functions can be expressed concisely using the vector notations. In this article, we define...

## Linear regressions in simple terms

A linear regression is a model used to predict the value of a (continuous) variable.

## Train error and test error

The train error is the error commited by a machine-learning model on the dataset it was trained on. The test error is the error commited on another dataset...

## What is a target value?

In machine-learning, a target value is an output value to a supervised learning problem.

## What is a residual?

The residual $\ve$ is the error vector between the true output vector $\vy$ and its estimate $\hat{\vy}$:

## What is an hyperparameter?

An hyperparameter is a parameter of the machine-learning algorithm. While the parameters are learned by the machine-learning algorithm, an hyperparameter dictates how the algorithm learns.

## What is a feature?

In machine-learning, a feature is an input variable to a supervised learning problem.

## What is the design matrix?

Given our usual dataset made of input vectors and output values, the design matrix is the matrix whose rows are the input vectors.

## Why there is more to classification than dicrete regression

In a classification problem, the dataset $\dataset$ consists of $\ndataset$ pairs of input vectors $\ninputvec{\idataset}$ and discrete labels $\ioutputval{\idataset}$:

## What is a statistic and why do we care?

In this article, we explain that a statistic is a way of compressing information contained in the data, and we show how it can be used for inference.

...

## Derivative, Gradient and Jacobian unified

A summary about scalar and vector derivatives.

## What is logistic regression?

A Logistic regression is a generalized linear model which is tailored to classification. In this article, we introduce this regression and explain its origin.

## What is a generalized linear model?

To understand what a generalized linear model does, let’s look back at linear models.

## Regression with squared error loss

In this article we study the solution to a regression with squared error loss. We start with the theoretical formulation before tackling the problem in practice.

## The geometry of the normal equations

In this article, I show that the normal equations define the orthogonal projection of a vector onto a linear subspace.

## Understanding and solving the normal equations

The normal equations arise in several branches of mathematics, from statistics to geometry. In this article, we discuss how they emerge and how to solve them.

## The Moore-Penrose (pseudo-inverse) matrix

The Moore-Penrose inverse of a matrix is used to approximatively solve a degenerate system of linear equations.

## What is stochastic gradient descent (SGD)?

Stochastic gradient descent is an algorithm that tries to find the minimum of a function expressed as a sum of component functions. It does so by choosing a...

## What is gradient descent (GD)?

Gradient descent is an optimization algorithm that tries to find the minimum of a function by following its gradient.

## Why do we care about convexity?

In machine learning, the best parameters for a model are chosen so as to minimize the training objective. Strictly convex functions are paticularly interesting because they have a...

## Building my own ebook library (#5)

We will use the whoosh library to store and search our local database. Install the library with pip:

## Building my own ebook library (#3)

We will scrape the website GoodReads to perform our search.

## Building my own ebook library (#2)

Let’s tackle the providers.

## Building my own ebook library (#1)

pip install unidecode


This morning I spent over an hour sorting and renaming the ebooks I downloaded this year. There were over 300 ebooks. What a waste of my time.

## Scraping with Python3 and Scrapy

Scrapy is one of the most popular Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites,...

## Scraping basics with python3 and urllib

Scraping means using a program to extract data from a source. When the source is a website or a blog, we say web scraping, and today we will...

## Scraping with BeautifulSoup

In a previous article, we discussed how to use python and urllib to scrape the web. In this article, we will see how the BeautifulSoup library replaces...

## Pause and restart a python script using Pickle

You’ve just quickly crafted a python script, and you ready to let it run the whole night while you sleep. Not long after you’ve fallen asleep, an error...

## The geometry of (normal) parameter estimation

This article shows geometrically where the best estimates for the mean and variance of a normally distributed random vector can be found. We start with a simple question...

## Propositional logic derived as a special case of probability calculus

In this article, I will apply the rules of probability calculus to derive the rules of propositional logic (also called propositional calculus).

## Extending logic to deal with uncertainty

This article sketches a construction of probability calculus as an extension of classical logic to account for uncertainty so that by construction, it can be used to automate...

## An information theory perspective on probability

In 1948, Claude Shannon invented information theory based on probability theory. The basic definition is entropy. Given of a set of messages mi, each one occurring with probability...

## A Bayesian Perspective

Probability is not a property of an event or state; there is no such thing as the probability that the coin lands showing head. Probability expresses a strength...