Why there is more to classification than dicrete regression
In a classification problem, the dataset consists of pairs of input vectors and discrete labels :
While in regression, output values are numerical (), in classification the labels can take at most a finite number of values: .
Assume that the labels are binary: .
As for regression, we suppose that there exists an approximate deterministic relationship such that :
Our goal is to use a subset to train a model able to approximate this relationship.
Classification using regression
We can try to use a regression and then binarize the predicted value: values above a given threshold are set to , values under are set to .
Let’s generate a dataset maThe use of logistic regression de of examples in each class and fit a linear least squares regression.
On the picture below:
- The orange marks are datapoints.
- Datapoints in class are at and datapoints in class are at . We can clearly see that a point is in class if and only if .
- We fit a linear least squares line to this dataset. The line is drawn in blue.
- The orange dashed line shows the threshold value of .
- If the blue line is under the orange line, the point is classified as , if it is above the point is classified as .
The problem with this approach is that the loss function used by the regression is not at all adapted to classification. Even on easy dataset like this one where the separation lies at , the regression line might shift unexpectedly when the number of datapoints changes.
Let’s generate additional examples in class to see what happens:
The stability is much better using a polynomial regression of degree , as shown on the picture below.
This is because a predicted value of classifies an example in class , and assuming the point is indeed in class , the classification error is but the corresponding MSE error is still .
A model with MSE loss might therefore have to work much harder than necessary in order to provide a decent upper bound on the classification error.