What’s wrong with sum of squares for logistic regression?

import numpy as np
import pandas as pd
# Safe settings for Pandas.
pd.set_option('mode.chained_assignment', 'raise')
%matplotlib inline
import matplotlib.pyplot as plt
# Make the plots look more fancy.
plt.style.use('fivethirtyeight')
# Optimization function
from scipy.optimize import minimize

This page gives some extra explanation for the logistic_regression page.

Here we say more about why we might prefer the Maximum Likelihood Estimate way of scoring potential fits to the data, to our more usual least squared error. If you want the gory details on this choice, see this answer on StackOverflow. Here we look at whether this the sum of squares works well with minimize. The discussion in this page corresponds to the “computational efficiency” section of the answer linked above.

The background, the data

In that page we were trying to looking at the chronic kidney disease dataset, to see if we good predict whether a patient had “good” appetite (as opposed to “poor” appetite) given that patient’s blood hemoglobin concentration.

df = pd.read_csv('ckd_clean.csv')
# Our columns of interest.
hgb_app = df.loc[:, ['Hemoglobin', 'Appetite']].copy()
# Dummy value column containing 0 for "poor" Appetite, 1 for "good".
hgb_app['appetite_dummy'] = hgb_app['Appetite'].replace(
    ['poor', 'good'],
    [0, 1])
hgb_app.head()
Hemoglobin Appetite appetite_dummy
0 11.2 poor 0
1 9.5 poor 0
2 10.8 poor 0
3 5.6 poor 0
4 7.7 poor 0

We take out the columns we are interested in for our further use:

# The x (predictor) and y (to-be-predicted) variables.
hemoglobin = hgb_app['Hemoglobin']
appetite_d = hgb_app['appetite_dummy']

Here is a plot of the 0 / 1 appetite values

def plot_hgb_app():
    # Build plot, add custom label.
    colors = hgb_app['Appetite'].replace(['poor', 'good'], ['red', 'blue'])
    hgb_app.plot.scatter('Hemoglobin', 'appetite_dummy', c=colors)
    plt.ylabel('Appetite\n0 = poor, 1 = good')
    plt.yticks([0,1]);  # Just label 0 and 1 on the y axis.
    # Put a custom legend on the plot.  This code is a little obscure.
    plt.scatter([], [], c='blue', label='good')
    plt.scatter([], [], c='red', label='poor')

plot_hgb_app()
plt.legend();
../_images/logistic_convexity_8_0.png

Linear regression - the crude approach

The crude and brutal approach to predicting these values is to use simple least-squares regression. We can do this in the usual way by using scipy.optimize.minimize with a function that returns the sum of squared error between the straight line predictions and the 0 / 1 labels. Here’s the function:

def ss_cost(c_s, x_values, y_values):
    """ Cost function for sum of squares prediction error
    """
    # c_s is a list containing two elements, an intercept and a slope.
    intercept, slope = c_s
    # Values predicted from these x_values, using this intercept and slope.
    predicted = intercept + x_values * slope
    # Difference of prediction from the actual y values.
    error = y_values - predicted
    # Sum of squared error.
    return np.sum(error ** 2)

Start with a guess of intercept -0.5, slope 0.1:

# Use cost function with miminize.
mr_ss = minimize(ss_cost, [-0.5, 0.1], args=(hemoglobin, appetite_d))
mr_ss
      fun: 10.315509522424126
 hess_inv: array([[ 0.07498682, -0.00524735],
       [-0.00524735,  0.00038337]])
      jac: array([-3.57627869e-07, -2.38418579e-07])
  message: 'Optimization terminated successfully.'
     nfev: 24
      nit: 4
     njev: 8
   status: 0
  success: True
        x: array([-0.07904142,  0.07004927])

Store the slope and intercept, predict the values directly from the straight line:

inter_ss, slope_ss = mr_ss.x
predicted_ss = inter_ss + slope_ss * hemoglobin

Show the results:

# Do the base plot of the hemoglobin and appetite_d.
plot_hgb_app()

# A new plot on top of the old.
plt.scatter(hemoglobin, predicted_ss,
            label='LR prediction',
            color='orange')
plt.title("Linear regression with sum of squares")
# Show the legend.
plt.legend();
../_images/logistic_convexity_16_0.png

Let us remind ourselves of how the sum of squared error values change as we change the slope and the intercept. First we hold the slope constant at a fairly bad guess of 0.1, and try different intercepts. For each intercept we calculate the sum of squared error:

# Some intercepts to try.
intercepts = np.linspace(-2, 1, 1000)
n_inters = len(intercepts)
ss_errors = np.zeros(n_inters)
for i in np.arange(n_inters):
    ss_errors[i] = ss_cost([intercepts[i], 0.1], hemoglobin, appetite_d)
plt.scatter(intercepts, ss_errors)
plt.xlabel('intercept')
plt.ylabel('Linear SS error')
plt.title("Errors for different intercepts, slope 0.1")
Text(0.5, 1.0, 'Errors for different intercepts, slope 0.1')
../_images/logistic_convexity_18_1.png

Notice the very simple shape of this curve. It is a parabola, it descends steeply for values far from the minimum, and more slowly as it gets closer. This is a curve that minimize finds it very easy to work with, because every time it tries an intercept (in this case), the direction (up, down) of the curve tells minimize what direction to go next. If the curve is going down at this point, it should try a larger (more positive) intercept value; if the curve is going up, it should try a smaller (more negative) intercept. The up/down-ness of the curve tells minimize the right way to go, and this direction is always correct. You may also have noticed that this parabola shape is always the same for these simple least squares functions, like ss_any_line.

Just to illustrate again, here we try holding the intercept constant at a fairly bad guess of 0.5, and vary the slopes. Notice we get the same helpful parabola shape:

# Slopes to try.
slopes = np.linspace(-0.25, 0.25, 1000)
n_slopes = len(slopes)
ss_errors = np.zeros(n_slopes)
for i in np.arange(n_slopes):
    ss_errors[i] = ss_cost([0.5, slopes[i]], hemoglobin, appetite_d)
plt.scatter(slopes, ss_errors)
plt.xlabel('slope')
plt.ylabel('Linear SS error')
plt.title("Errors for different slopes, intercept 0.5");
../_images/logistic_convexity_20_0.png

These are plots of how the value of the cost function changes as we change the parameters. The parabolas we see above are examples of curves that are convex; convex curves like parabolas are particularly easy and quick for minimize to work with.

We will see that using sum of squared error with our preferred sigmoid prediction generates cost function curves that are a lot more complicated, making it more difficult for minimize to find the best parameters. If we give minimize a bad initial guess, it can get the answer badly wrong. Put technically, this is because the cost function curves are not convex.

Sigmoid prediction with sum of squares error

For the reasons you saw in the [logistic regression page](logistic regression), we recoil from the very simple straight line fit above, and prefer to use a sigmoid curve to fit the 0 / 1 labels.

In that page we defined the functions to convert the straight line predictions that we want to use with minimize and the sigmoid predictions:

def inv_logit(y):
    """ Reverse logit transformation
    """
    odds_ratios = np.exp(y)  # Reverse the log operation.
    return odds_ratios / (odds_ratios + 1)  # Reverse odds ratios operation.


def params2pps(intercept, slope, x):
    """ Calculate predicted probabilities of 1 for each observation.
    """
    # Predicted log odds of being in class 1.
    predicted_log_odds = intercept + slope * x
    return inv_logit(predicted_log_odds)

This allowed us to build our sum of squares logit cost function. This function calculates the sum of squares difference from the sigmoid predictions and the actual 0 / 1 labels.

def ss_logit_cost(c_s, x_values, y_values):
    # Unpack intercept and slope into values.
    intercept, slope = c_s
    # Predicted p values on sigmoid
    pps = params2pps(intercept, slope, x_values)
    # Prediction errors.
    sigmoid_error = y_values - pps
    # Sum of squared error
    return np.sum(sigmoid_error ** 2)

Then we found our sum of squares best straight line (that corresponds to a sigmoid after transformation). Notice that we have started minimize with a guessed intercept of -7 and a guessed slope of 0.8.

mr_ss_logit = minimize(ss_logit_cost, [-7, 0.8], args=(hemoglobin, appetite_d))
mr_ss_logit
      fun: 9.621504133676334
 hess_inv: array([[ 6.28125352, -0.59480177],
       [-0.59480177,  0.05889232]])
      jac: array([1.19209290e-07, 2.38418579e-07])
  message: 'Optimization terminated successfully.'
     nfev: 42
      nit: 12
     njev: 14
   status: 0
  success: True
        x: array([-5.28090393,  0.58493976])

We can calculate the predicted 0 / 1 labels, and plot them.

inter_ss_logit, slope_ss_logit = mr_ss_logit.x
predicted_ss_logit = params2pps(inter_ss_logit, slope_ss_logit, hemoglobin)
plot_hgb_app()
# A new plot on top of the old.
plt.scatter(hemoglobin, predicted_ss_logit,
            label='Logit ss solution',
            color='gold')
# Show the legend.
plt.legend();
../_images/logistic_convexity_30_0.png

Let us have a look at what the cost function curves look like for the ss_logit_cost cost function. For now, let us look at what happens to the cost function curves as we change the intercept, holding the slope the same.

Because we will do this several times, with various intercepts and constant slopes, we make a function, so we don’t repeat ourselves:

def plot_some_intercepts(cost_function, intercepts, slope):
    """ Plot values of `cost_function` for given `intercepts` and `slope`

    Parameters
    ----------
    cost_function : function
        Function to call to get cost function value, given an intercept and a slope.
    intercepts : array
        Array of intercepts for which to calculate cost function.
    slope : number
        Slope (held constant for each intercept).
    """
    n = len(intercepts)
    ss_errors = np.zeros(n)
    for i in np.arange(n):
        # Calculate and store cost function value for intercept, slope.
        ss_errors[i] = cost_function([intercepts[i], slope],
                                     hemoglobin, appetite_d)
    plt.plot(intercepts, ss_errors, linewidth=1)
    plt.xlabel('intercept')
    plt.ylabel('Cost function value');
    plt.title('Errors for slope = %.2f' % slope)

Remember the attractive parabolas for the cost function curves in the crude case above, where we were doing simple regression.

In the next cell, we set the slope to the best slope that minimize found, and show the effect on our ss_logit_cost function, when varying the intercept.

# Cost function curve varying intercept between -20 and 5, for best slope.
plot_some_intercepts(ss_logit_cost, np.linspace(-20, 5, 1000), slope_ss_logit)
../_images/logistic_convexity_34_0.png

There is a clear minimum at around -5.2, as we expect from the results above, but we have lost the nice parabola shape. For intercepts greater than about 3, the graph is very flat. This could spell trouble for minimize, if it gets stuck trying a series of intercepts more than 3. For example, the cost function will stay almost the same as minimize tries values around 5, so minimize may not discover that it needs to track back to the real minimum, near -5.

It can get even worse when trying slopes that are further away from the optimum. In the next plot, we set the slope badly wrong, at 3, and try different intercepts.

plot_some_intercepts(ss_logit_cost, np.linspace(-30, 25, 1000), 3)
../_images/logistic_convexity_36_0.png

The plot is a strange shape. Again we see a nasty plateau with intercepts above 0. If minimize is trying intercepts above 0, the cost function may not vary much, and minimize may get stuck on this plateau, for example concluding the intercept of 6 is as good as any nearby.

An in fact, this does happen if we set very bad starting estimates for minimize. Here we set the starting intercept to be 6, and the starting slope to be 2.5.

bad_mr_ss_logit = minimize(ss_logit_cost, [6, 2.5],
                           args=(hemoglobin, appetite_d))
bad_mr_ss_logit
      fun: 18.999997860319507
 hess_inv: array([[1, 0],
       [0, 1]])
      jac: array([2.14576721e-06, 6.67572021e-06])
  message: 'Optimization terminated successfully.'
     nfev: 3
      nit: 0
     njev: 1
   status: 0
  success: True
        x: array([6. , 2.5])

You can see that minimize has got stuck on the plateau we saw above and has given up, simply returning the terrible intercept and slope we sent it.

You can also see that minimize did not detect any problems, and returned the message “Optimization terminated successfully”.

We have this problem because of the irregular shape of the cost-function curve for our cost function, that calculates sum of squared error for the sigmoid predictions.

Back to maximum likelihood

The logistic_regression page proposed an alternative cost function for the sigmoid predictions - maximum likelihood. See that page for details, but here is somewhat cleaned up version the maximum likelihood cost function from the page above.

def ml_logit_cost(intercept_and_slope, x, y):
    """ Cost function for maximum likelihood
    """
    intercept, slope = intercept_and_slope
    pp1 = params2pps(intercept, slope, x)
    p_of_y = y * pp1 + (1 - y) * (1 - pp1)
    log_likelihood = np.sum(np.log(p_of_y))
    return -log_likelihood

We find the best intercept and slope using the maximum likelihood (ML). While we are at it, we send in the same terrible estimate for the intercept and slope:

mr_ML = minimize(ml_logit_cost, [6, 2.5], args=(hemoglobin, appetite_d))
mr_ML
      fun: 29.167993895856437
 hess_inv: array([[ 2.62881334, -0.24431754],
       [-0.24431754,  0.02371837]])
      jac: array([0.00000000e+00, 4.76837158e-06])
  message: 'Optimization terminated successfully.'
     nfev: 90
      nit: 13
     njev: 30
   status: 0
  success: True
        x: array([-7.29187397,  0.79915455])

The fit is reasonable:

inter_ML, slope_ML = mr_ML.x
predicted_ML = inv_logit(inter_ML + slope_ML * hemoglobin)
plot_hgb_app()
plt.scatter(hemoglobin, predicted_ML, c='gold', label='ML prediction')
plt.legend();
../_images/logistic_convexity_46_0.png

The ML search by minimize is more reliable than the sum-of-squares case above; it is less dependent on us choosing some reasonable starting values. This is because the ML cost function is convex. Here is the cost-function curve for the ML cost function, as we vary the intercept for a fixed slope. We see that we have a much more predictable curve, that slopes smoothly downwards to a minimum and smoothly upwards after that.

plot_some_intercepts(ml_logit_cost, np.linspace(-30, 25, 1000), slope_ss_logit)
../_images/logistic_convexity_48_0.png

However, this does not mean the ML cost function is infallible. We can push it to a state where the calculation errors start to overwhelm the values. However, ML still has the advantage, because, unlike the sum of squares, we do get a warning:

# Starting estimates too far off.
minimize(ml_logit_cost, [6, 3], args=(hemoglobin, appetite_d))
/Users/mb312/Library/Python/3.8/lib/python/site-packages/pandas/core/arraylike.py:358: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
/Users/mb312/Library/Python/3.8/lib/python/site-packages/scipy/optimize/_numdiff.py:557: RuntimeWarning: invalid value encountered in subtract
  df = fun(x) - f0
      fun: inf
 hess_inv: array([[1, 0],
       [0, 1]])
      jac: array([nan, nan])
  message: 'NaN result encountered.'
     nfev: 3
      nit: 0
     njev: 1
   status: 3
  success: False
        x: array([6., 3.])

Another example

You will find another demonstration of this difference between sum of squares and maximum likelihood in this on logistic cost functions.